Multi-Cycle Pipeline Operations

Multi-Cycle Pipeline Operations

Handling exceptions.

Exceptions are difficult because instructions may now finish out of order .

Consider the following sequence:

In this example, ADDF and SUBF are expected to complete before DIVF .
Out-of-order completion.

Suppose SUBF caused an arithmetic exception at a point where ADDF completed but DIVF has not.

The result is an imprecise exception . Fix here is to let pipeline drain.

Worse, suppose DIVF had an exception after ADDF completed.

Since ADDF destroys one of its operands, we can not restore the state to what it was before the DIVF instruction, even with software !

Multi-Cycle Pipeline Operations

Handling exceptions, first solution (of four):

Ignore the problem (imprecise exceptions):

This may be fast and easy, but it's difficult to debug programs without precise exceptions.

Many modern CPUs, i.e. DEC Alpha 21064, IBM Power-1 and MIPS R800, provide a precise mode that allows only a single outstanding FP instruction at any time.

This mode is much slower than the imprecise mode, but it makes debugging possible.

Multi-Cycle Pipeline Operations

Handling exceptions, second solution:

Buffer the results and delay commitment :

In this case, the CPU doesn't actually make any state (register or memory) changes until the instruction is guaranteed to finish.

This becomes difficult when the difference in running time among operations is large.

Lots of intermediate results have to be buffered (and forwarded, if necessary).

Multi-Cycle Pipeline Operations

Handling exceptions, variations of the second solution:
History file:

This technique saves the original values of the registers that have been changed recently.

If an exception occurs, the original values can be retrieved from this cache .

Note that the file has to have enough entries for one register modification per cycle for the longest possible instruction.

Similar to the solution used for the VAX for autoincrement and autodecrement addressing.

Multi-Cycle Pipeline Operations

Handling exceptions, variations of the second solution:
Future file:

This method stores the newer values for registers.

When all earlier instructions have completed, the main register file is updated from the future file.

On an exception, the main register file has the precise values for the interrupted state.

Multi-Cycle Pipeline Operations

Handling exceptions, third solution:
Keep enough information for the trap handler to create a precise sequence for the exception:

The instructions in the pipeline and the corresponding PCs must be saved.

After the exception, the software finishes any instructions that precede the latest instruction completed.

Technique is used in the SPARC architecture.

Multi-Cycle Pipeline Operations

Handling exceptions, fourth solution:
Allow instruction issue only if it is known that all previous instructions will complete without causing an exception.

The floating point function units must determine if an exception is possible early in the EX stage, first couple clocks,

In order to prevent the following instructions from completing.

Sometimes it requires stalling the pipeline in order to maintain precise interrupts.

The R4000 and Pentium solution.

Guidelines for designing instruction sets for pipelining

Avoid variable instruction lengths and running times whenever possible :

Variable length instructions complicate hazard detection and precise exception handling.

Sometimes it is worth it because of performance adv., i.e., caches .

Cause instruction running times to vary, when they miss.

Many times, the added complexity is delt with by freezing the pipeline.

Avoid sophisticated addressing modes :

Addressing modes that update registers (post-autoincrement) complicates exceptions and hazard detection.

It also makes it harder to restart instructions.

Allowing addressing modes with multiple memory accesses also complicates pipelining.

Guidelines for designing instruction sets for pipelining

Don't allow self-modifying code :

Since it is possible that the instruction being modified is already in the pipeline, the address being written must constantly be checked.

If it is found, then the pipeline must be flushed or the instruction updated !

Even if it's not in the pipeline, it could be in the instruction cache.

Avoid implicitly setting CCs in instructions :

This makes it harder to avoid control hazards since it's impossible to determine if CCs are set on purpose or as a side effect.

For implementations that set the CC almost unconditionally :

Makes instruction reordering difficult since it is hard to find instructions that can be scheduled between the condition evaluation and the branch.

The MIPS R4000 integer pipeline

Eight stages:
IF

First half of instruction fetch. PC selection occurs. Cache access is initiated.

IS

Second half of instruction fetch.
This allows the cache access to take two cycles.

RF

Decode and register fetch, hazard checking, I-cache hit detection.

EX

Execution: address calculation, ALU Ops, branch target calculation and condition evaluation.

DF/DS/TC

Data fetched from cache in the first two cycles.
The third cycle involves checking a tag check to determine if the cache access was a hit.

WB

Write back result for loads and R-R operations.

The MIPS R4000 integer pipeline

Possible stalls and delays:
Load delay: two cycles

The delay might seem to be three cycles, since the tag isn't checked until the end of the TC cycle.
However, if TC indicates a miss, the data must be fetched from main memory and the pipeline is backed up to get the real value.

Branch delay: three cycles (including one branch delay slot)

The branch is resolved during EX, giving a 3 cycle delay.
The first cycle may be a regular branch delay slot (instruction always executed) or a branch-likely slot (instruction cancelled if branch not taken).
MIPS uses a predict-not-taken method presumably because it requires the least hardware.

The MIPS R4000 integer pipeline

Effects of longer pipeline:

In addition to the longer (and possibly more frequent) stalls just mentioned, the longer pipeline requires additional forwarding hardware.

It also requires more complex hazard detection to find dependencies in the additional stages.

Benefits of longer pipeline

The major benefit to a longer pipeline is that each stage may be shorter.

This means that the clock cycle can be shorter, allowing more instructions to be issued in a fixed time.

Of course, the added stalls might eat up this benefit, but the hope is that at least some speedup will be left.

The MIPS R4000 integer pipeline

Performance issues (integer only)

The ideal CPI for the pipelined CPU is 1.

The biggest contributor to stalls is branch stalls.

Load stalls contribute very little.

This is probably because the compiler can usually reorganize code to avoid stalling on loads.

Since load latency is two cycles, though, the job is harder than it might be on processors with a single-cycle latency.

Summary

Pipelining is a good way to improve performance

These days, pipelining is one of the best ways to improve performance.

It allows the CPU to issue one instruction per cycle even when finishing an instruction takes many cycles.

This has been the major factor allowing consumer-level microprocessors to run at 150 MHz or higher.

In the next few weeks, we'll cover more ways to squeeze performance out of the CPU.