Reducing Cache Miss Rate

Reducing Cache Miss Rate

Compiler-controlled prefetch

An alternative to hardware prefetching.

Some CPUs include prefetching instructions .

These instructions request that data be moved into either a register or cache.

These special instructions can either be faulting or non-faulting .

Non-faulting instructions do nothing (no-op) if the memory access would cause an exception.

Of course, prefetching does not help if it interferes with normal CPU memory access or operation.

Thus, the cache must be nonblocking (also called lockup-free ).

This allows the CPU to overlap execution with the prefetching of data.

Reducing Cache Miss Rate

Compiler-controlled prefetch

While this approach yields better prefetch "hit" rates than hardware prefetch, it does so at the expense of executing more instructions.

Thus, the compiler tends to concentrate on prefetching data that are likely to be cache misses anyway.

Loops are key targets since they operate over large data spaces and their data accesses can be inferred from the loop index in advance.

Compiler optimizations

This method does NOT require any hardware modifications.

Yet it can be the most efficient way to eliminate cache misses.

The improvement results from better code and data organizations.

For example, code can be rearranged to avoid conflicts in a direct-mapped cache, and accesses to arrays can be reordered to operate on blocks of data rather than processing rows of the array.

Reducing Cache Miss Rate

Compiler optimizations

Merging arrays

This method combines two separate arrays (that might conflict for a single block in the cache) into a single interleaved array.

This brings together corresponding elements in both arrays, which are likely to be referenced together.

Reorganizing and fetching them at the same time can reduce misses.

This technique reduces misses by improving spatial locality.

Reducing Cache Miss Rate

Compiler optimizations

Loop interchange

By switching the order in which loops execute, misses can be reduced due to improvements in spatial locality.

For example,

These loops cause a miss on each memory access because of the long stride given by index j in the inner loop.

By switching the order of the loops, the stride is changed to 1, allowing the elements to be accessed in sequential order.

Reducing Cache Miss Rate

Compiler optimizations

Loop Fusion

Many programs have separate loops that operate on the same data.

Combining these loops allows a program to take advantage of temporal locality by grouping operations on the same (cached) data together.

Blocking

The above methods work well on array accesses that occur along one dimension only.

However, loops that access both rows and columns, such as matrix multiplication, are problems.

Reducing Cache Miss Rate

Blocking

Unoptimized matrix multiplication requires the cache to hold the minimum elements shown shaded below.

Capacity misses can occur for large matrices since it may not be possible to store all the elements of Z in the cache.

Blocking operates on blocks (submatrices) as shown by the dotted line, and reduces the total number of memory words accessed by a factor of B (the blocking factor).

Reducing Cache Miss Rate

Compiler optimizations

Blocking

Therefore, matrix multiplication is performed by multiplying the submatrices first.

Matrix Y benefits from spatial locality and Z benefits from temporal locality.

This method is also used to reduce the number of blocks that must be transferred between disk and main memory.
Therefore, the technique is effective for several levels of the hierarchy.

Given the increasing speed gap in processor speed and memory access times, these last two techniques will only increase in importance over time.

Reducing Cache Miss Penalty

Giving read misses priority

If a system has a write buffer, writes can be delayed to come after reads.

The system must, however, be careful to check the write buffer to see if the value being read is about to be written.

A simple method of dealing with this problem:

Stall reads until the write buffer is empty.

However, this method increases the read miss penalty considerably since the write buffer in write-through is likely to have blocks waiting to be written.

An alternative is to check the write buffer for conflicts.

In cases like this, the write buffer acts as a victim cache.

Reducing Cache Miss Penalty

Using subblocks to reduce fetch time

Tags can hurt performance by occupying too much space or by slowing down caches.

Using large blocks reduces the amount of storage for tags (and makes them shorter), optimizing space on the chip.

This may even reduce miss rate by reducing compulsory misses.

However, the miss penalty for large blocks is high, since the entire block must be moved between the cache and memory.

The solution is to divide each block into subblocks, each of which has a valid bit.

Reducing Cache Miss Penalty

Using subblocks to reduce fetch time

The tag is valid for the entire block, but only a sub-block needs to be read on a miss.

Therefore, a block can no longer be defined as the minimum unit transferred between cache and memory.

This results in a smaller miss penalty.

Early restart & critical word first

This strategy does NOT require extra hardware (like the previous two techniques).

It optimizes the order in which the words of a block are fetched and when the desired word is delivered to the CPU.

Early restart

With early restart, the CPU gets its data (and thus resumes execution) as soon as it arrives in the cache without waiting for the rest of the block.

Reducing Cache Miss Penalty

Early restart & critical word first
Critical word first

Instead of starting the fetch of a block with the first word, the cache can fetch the requested word first and then fetch the rest afterwards.

In conjunction with early restart , this reduces the miss penalty by allowing the CPU to continue execution while most of the block is still being fetched.

Nonblocking caches

A nonblocking cache, in conjunction with out-of-order execution, can allow the CPU to continue executing instructions after a data cache miss.

The cache continues to supply hits while processing read misses ( hit under miss ).

The instruction needing the missed data waits for the data to arrive.

Reducing Cache Miss Penalty

Nonblocking caches

Complex caches can even have multiple outstanding misses ( miss under miss ).

But this greatly increases cache complexity.

Second-level caches

This method focuses on the interface between the cache and main memory.

We can add an second-level cache between main memory and a small, fast first-level cache.

This helps satisfy the desire to make the cache fast and large.

The second-level cache allows:

The smaller first-level cache to fit on the chip with the CPU and fast enough to service requests in one or two CPU clock cycles.
Hits for many memory accesses that would go to main memory, lessening the effective miss penalty.

Reducing Cache Miss Penalty

Second-level caches
Performance of a multi-level cache:

The performance of a two-level cache is calculated in a similar way to the performance for a single level cache.

So the miss penalty for level 1 is calculated using the hit time, miss rate, and miss penalty for the level 2 cache.

For two level caches, there are two miss rates:

Global miss rate

The number of misses in the cache divided by the total number of memory accesses generated by the CPU (Miss rateL1*Miss rateL2).

Local miss rate

The number of misses in the cache divided by the total number of memory accesses to this cache (Miss rateL2 for the 2nd-level cache).

Reducing Cache Miss Penalty

Performance of a multi-level cache:

Note that the local miss rate for L2 is high because it's only getting the misses from the L1 cache (instead of all memory accesses).

In general, the global miss rate is a more useful measure since it indicates what fraction of the memory accesses that leave the CPU go all the way to memory.

Desirable characteristics for an L2 cache:
Much larger than the L1 cache

Since L2 contains the same data as L1, making L2 about the same size as L1 causes it to have a high local miss rate.

This is true since if we miss in L1, it is likely that we'll miss in L2 as well resulting in performance that is not much better than using main memory alone.

Therefore, it must be much larger.

Reducing Cache Miss Penalty

Desirable characteristics for an L2 cache:
Higher associativity

The main reason for low associativity was fast, small caches.

The L2 cache need be neither , and will benefit from the higher hit rate that more blocks per set provides.

Larger block size

This has the advantage of reducing compulsory misses that must go all the way to main memory.

Since the L2 cache is large, the effect of increasing conflict misses (as is true for a smaller cache) is minimal.

Reducing Cache Miss Penalty

Inclusion

If all of the data in the L1 cache is also in the L2 cache, the L2 cache has the multilevel inclusion property .

Most caches enforce this property since it is easier to deal with cache consistency.

Consistency between I/O and caches (and between caches in a multiprocessor) can be determined by checking second-level cache.

Design of L1 and L2 caches

Although they can be designed separately, it is often helpful to know if there is going to be an L2 cache.

For example, write-through in L1 is much more effective if there is an L2 writeback cache to buffer repeated writes.

Similarly, a direct-mapped L1 cache can work fine if the L2 cache satisfies most of the conflict misses.

Reducing Cache Miss Penalty

L2 cache summary

In general, cache design trades fast hits for few misses.

For an L1 cache, fast hits are more important.

For L2, there are many fewer hits, so fewer misses becomes more important.

Therefore, larger caches with higher associativity and larger blocks are beneficial in L2 caches.