Getting at Even More Parallelism

Getting at Even More Parallelism

Section 4.5 describes techniques such as loop unrolling , software pipelining and trace scheduling that the compiler can use to uncover ILP.

This works as long as the behavior of branches is fairly predictable.

We'll now cover methods of increasing the ILP in programs.

These methods include conditional execution and speculative execution .

Conditional instructions

A conditional instruction refers to a condition which is evaluated as part of the instruction execution, i.e.,

Rather than use a branch to skip a single instruction, the CPU always executes the instruction but writes the result only if the condition is met.

Conditional Instructions

Eliminating the branch gives two benefits:
The branch is not executed, reducing the instruction count by 1.
The branch delay is avoided.

A conditional branch changes a control dependence into a data dependence.

This is beneficial because, in an integer pipeline, data dependences rarely cause stalls while control hazards do cause stalls.

Conditional Instructions

Other benefits:

Conditional instructions help a lot with superscalar machines because such machines suffer even more from branch stalls.

This is true because conditional instructions can be scheduled as normal instructions.

Branches, on the other hand, often cannot be scheduled this way because they may cause a change in the instruction stream.

This allows more slots in a superscalar machine to be filled.

Conditional instructions are of even greater benefit in this respect on a VLIW machine.

Conditional Instructions

Exceptions:

We must ensure a speculated instruction does not introduce an exception.

The instruction must have NO effect if the condition is not satisfied.

For example:

If R10 contains zero , then it is likely that the LW instruction will cause a protection violation if allowed to execute.

In DLX, memory accesses are not started until MEM.

Therefore, it is easy to evaluate the condition (i.e. during EX) and prevent this from happening in this case.

Conditional Instructions

Limits to conditional instructions:
Executing them takes time.

A conditional instruction always requires time, even if the instruction is annulled.

Moving an instruction across a branch is essentially speculating on the outcome of the branch.

This may slow down a program if an instruction is executed but turned into a no-op, since another instruction may have executed during that slot.

They are always a win when the cycle that they occupy would have been idle anyway.

Trading a branch and move for a conditional move is usually a win.

Longer sequences may not be.

Conditional Instructions

Limits to conditional instructions:
The condition must be evaluated early .

As noted above, the condition must be known before the processor's state is changed, and the earlier the better.

Conditional instructions are difficult for multiple conditions.

These instructions work well for avoiding single branches.

However, the task is more difficult for two or more branch options since it requires additional instructions to compute the logical combination of the conditions.

Conditional instructions may impose a speed penalty.

This can be in one of two forms:

The cycle time for the entire CPU can be increased, or
A conditional instruction may take more clock cycles to execute than a non-conditional instruction.

Compiler-Directed Speculative Execution with Hardware Support

Conditional instructions are effective in eliminating control dependencies for small if-then blocks, as we have seen.

However, a more significant performance gain can be attained by moving larger blocks of code across (before) branches (see trace scheduling in Section 4.5.)

This can create problems in two areas:

Registers that should not be modified (because of the branch) are modified anyway.
Similar to conditional instructions, exceptions that should not occur are possible.

Note that resumable exceptions (page faults) are not a problem if they occur in speculative code.

They may cause performance to suffer a bit, but correct programs are not terminated.

Non-resumable (terminating) exceptions are a problem and must be handled.

Compiler-Directed Speculative Execution with Hardware Support

Three schemes for supporting more ambitious speculation without introducing erroneous exception behavior have been investigated:
Ignore exceptions

The simplest method for speculation is for the CPU and OS to ignore non-resumable exceptions for speculative instructions.

Rather than terminate the program, they return an undefined value for the instruction causing the exception.

If the exception generating instruction was not speculative , the program is in error but it is allowed to continue !

But it will probably generate incorrect results.

If the exception generating instruction was speculative , the speculative result will not be used and the program will run properly.

Either way, a correct program is not terminated improperly.

Compiler-Directed Speculative Execution with Hardware Support

Ignore exceptions: An example:

Compiler-Directed Speculative Execution with Hardware Support

Three schemes (continued):
Poison bits

Each register has a "poison bit" attached to it.

If a speculative instruction causes an exception, the exception is handled by setting the poison bit of its destination register.

If another speculative instruction uses a poisoned register as a source operand, its destination register poison bit is also set.

If a non-speculative instruction uses a poisoned register, an exception is generated.

It may, however, write to a poisoned register.
If this occurs, the poison bit is cleared.

This method generates exceptions for incorrect programs (at about the right place.)
The complication is that the OS must be able to save, restore, and reset the poison bits, which requires special instructions.

Compiler-Directed Speculative Execution with Hardware Support

Three schemes (continued):
Speculative instructions with renaming (buffering results).

Note that we had to introduce register copies in the previous schemes.

This approach (called boosting ) provides renaming and buffering in the hardware (similar to Tomasulo's algorithm.)

A boosted instruction is executed speculatively based on a branch.

Its results are forwarded to and used by other boosted instructions.

When the branch is reached, if the prediction is correct, the results are committed to the register file.

Therefore, instructions that are control dependent on a branch can be executed before the branch .

Compiler-Directed Speculative Execution with Hardware Support

Speculative instructions with renaming:
An example:

Hardware-Based Speculation

We now examine the combination of speculative execution and dynamic scheduling based on Tomasulo's algorithm.

We focus on floating-point operations but a similar structure can handle integer operations.

In order to support speculation, a change is necessary to Tomasulo's approach:

We must separate the process of completing execution and the bypassing of results among instructions from instruction commit (reg file or memory update).

This allows other (speculative) instructions to execute, but no results are committed until we know the instruction is no longer speculative.

We will allow instructions to execute out of order but force them to commit in order, which helps with handling exceptions properly.

Hardware-Based Speculation

A set of hardware buffers ( Reorder buffers ) will be used to hold the results of instructions that have finished execution but have not committed .

Hardware-Based Speculation

The reorder buffer provide additional virtual registers and is a source of operands for instructions.

An additional step is added to Tomasulo's algorithm, as follows:
Issue

Get a floating-point instruction, and issue it if there is a reservation station open and an empty slot in the reorder buffer .

Send the number of the reorder buffer assigned for the result to the reservation station so it can be used to tag the result.

Execute

Monitor the CDB while waiting for source registers to be ready.

When both operands are available, perform the operation.

Hardware-Based Speculation

Write result

Write the result on the CDB with the reorder buffer tag.

The result is stored into the reorder buffer as well as into any reservation stations waiting for the result.

The reorder buffer can also serve as a source register for operands similar to the CDB.

Commit

When the instruction reaches the head of the reorder buffer and its result is present in the buffer, update the register with the result (or write memory).

When an incorrectly predicted branch arrives, flush the reorder buffer and restart execution at the correct successor of the branch.

If the branch was correctly predicted, do nothing.

Hardware-Based Speculation

This scheme has several advantages over dynamic scheduling alone.

First, instructions can "finish" out of order as long as they are not committed.

This means that the CPU can keep precise interrupts even while executing out of order since changes are committed in order.

Second, this method allows the CPU to speculatively execute instructions past a branch (but before the branch is executed), subsequently cancelling them if the branch is mispredicted.

Hardware-Based Speculation

Exception handling

Exceptions in this model are handled just before the instruction is ready to commit.

At that time, all previous instructions have committed and all later instructions have not committed.

Thus, the CPU can do a precise exception even if execution occurs out of order.

Speculation in multiple-issue CPUs:

The techniques that work in single-issue CPUs work in multiple-issue CPUs as well.

In fact, they may be more useful in such processors because of the longer delays and the greater need for speculation to fill empty slots.

Things to remember about CPU design

Lower CPI is not always faster

If the lower CPI comes at the expense of a longer clock cycle, it may slow the processor down.

This is almost invariably true since lowering CPI using hardware means implementing more sophisticated techniques, which increase clock cycle time.

However, this inclination arises because:

Simulation tools to evaluate the impact of enhancements that affect CPI are more readily available than tools to evaluate the impact on clock cycle time.

This is true largely because an accurate analysis on the impact of clock rate is not possible until the design is well underway.

Things to remember about CPU design

Improve all parts of a multiple-issue CPU, not just one

As with uniprocessors, improving one aspect of a CPU does not help unless it was the bottleneck from the beginning.

For example, improving FP latency for a multiple-issue CPU does not help much unless something is done about branching.

Speculative execution is great but is of limited benefit unless there are additional registers to use (either implicitly or under compiler control).