Advanced Pipelining and ILP

Advanced Pipelining and ILP

Instruction-level parallelism (ILP)

The potential overlap among instructions is called ILP .
It implies a lack of dependence between instructions.

We spent two weeks talking about pipelines that allow us to overlap independent instructions.

All of the techniques in this chapter will exploit parallelism among instruction sequences .

Advanced Pipelining and ILP

Instruction-level parallelism (ILP)
Exploiting ILP

We will discuss methods for exploiting any ILP that may exist in a program.

As we started to see last week, this can mean:

Executing instructions out of order and
Other techniques that keep the functional units busy with useful work.

Limits to ILP

We'll also discuss things that limit the amount of ILP that a processor can exploit.

This limitation can come from either:

The processor (i.e., limited number of functional units) or
The program (every instruction depends on the previous one).

Advanced Pipelining and ILP

CPI revisited:

The focus of the last chapter was on reducing RAW and control stalls.

The techniques presented in this chapter to further reduce RAW and control stalls will increase the importance in dealing with WAR, WAW and structural stalls.

Techniques we will investigate in this chapter:
Loop unrolling

Reduces control stalls.

Basic pipeline scheduling

Reduces RAW stalls.

Scoreboarding

Reduces RAW stalls.

Basic ILP Techniques

Register renaming

Reduces both WAR and WAW stalls.

Dynamic branch prediction

Reduces control stalls.

Issuing multiple instructions per cycle (superscalar)

Reduces ideal CPI by allowing more than one instruction to start each cycle.

Compiler dependence analysis

Reduces ideal CPI and data stalls by having the compiler perform code scheduling.

Basic ILP Techniques

Software pipelining and trace scheduling

Reduces ideal CPI and data stalls by using previous executions to tune future executions.

Speculation

Execute "possible" instructions so that, whichever way the program "really" goes, the CPU will have the proper state.
This reduces all data and control stalls.

Dynamic memory disambiguation

Reduce RAW stalls involving memory.

Basic ILP Techniques

What is ILP, and where does it come from?

Basic blocks

A block of code with no branches into the code except at the start and no branches out of the code except at the end.

The code inside the average basic block is quite small.

Last chapter, we saw the average dynamic branch frequency in integer programs was about 15% .

This means that between 6 and 7 instructions are executed between a pair of branches.

It is likely that these instructions depend on one another since the instructions tend to operate on the same data.

Therefore, we must exploit ILP across multiple basic blocks .

Basic ILP Techniques

Loop-level parallelism

One of the simplest and most common ways to increase parallelism is to exploit it among iterations of a loop.

In many loops, the iterations of a loop are independent, i.e.

Each iteration can overlap with any other iteration even though individual iterations have few (if any) overlappable instructions.

Many techniques exist for exploiting the ILP in loops.

Some are done statically by the compiler (loop unrolling) and some are done dynamically by the CPU.

In addition, vector processors can be run very quickly on simple loop operations.

Basic ILP Techniques

Pipeline scheduling

Compiler seeks to separate a dependent instruction from the source instruction by a distance (in clk cycles) equal to the pipeline latency of the source.

To do this, the compiler must have intimate knowledge of the internal hardware workings.

This is one reason that code written for (say) the Intel 486 may not run optimally on the Pentium -- the pipeline latencies have likely changed.

Let's assume the following latencies:

Instruction producing result	Instruction using result	Latency in clk cycles
FP ALU op	Another FP ALU op	3
FP ALU op	Store double	2
Load double	FP ALU op	1
Load double	Store double	0 (forwarded to store)

Basic ILP Techniques

Pipeline scheduling

Given the knowledge of pipeline latencies, the compiler can reorder the instructions so that stalls are avoided, i.e.,

Assume R1 holds the address of the element in array with highest address.
Assume F2 contains the scaler value, c.

Unscheduled DLX assembly:

Basic ILP Techniques

Pipeline scheduling

Unscheduled analysis:

When we execute this code, it takes 10 cycles per iteration.

Basic ILP Techniques

Pipeline scheduling

Unscheduled/Scheduled code:

Execution time is reduced from 10 cycles to 6.

Note that the compiler modified the effective address for SD.

Basic ILP Techniques

Pipeline scheduling & loop unrolling

We knocked the cycle count from 10 down to 6, but we really only did 3 cycles of work:

A load, an add, and a store.

The rest of the loop was overhead , i.e., SUBI, BNEZ and a stall.

While it may seem we actually did work by adding to the i index, that's not really work since it didn't produce any results in memory.

We need to get more operations into the loop relative to the number of branch and overhead instructions.

This is done by loop unrolling : replicating the loop body multiple times and adjusting the loop termination code.

Basic ILP Techniques

Pipeline scheduling & loop unrolling

Loop unrolling involves creating multiple copies of the loop body.

It improves scheduling because:

It eliminates branches.
It allows instructions from different iterations to be scheduled together, it exposes parallelism.

This allows the CPU to amortize the cost of updating indices across several iterations of the loop.

Loop unrolling also increases register usage.

Basic ILP Techniques

Pipeline scheduling & loop unrolling

Previous code unrolled 4 times and scheduled (assuming R1 was initially a multiple of 32).

Note that displacement addressing mode makes it possible to keep the overhead very low.

Basic ILP Techniques

Pipeline scheduling & loop unrolling

In real code, we don't know the upper bound on the loop.

We don't know how many times the loop will be executed.

Suppose it is n and we want to unroll the loop k times.

We create two consecutive loops:

The first one contains the original code and executes ( n mod k ) times.
The second one contains the unrolled body and executes n/k times.

If the number of iterations is large, this method saves the CPU lots of time at the cost of larger code size.

Although it was easy for us to recognize these transformations, programming the compiler to recognize these while guaranteeing that the transformed code is correct is not trivial.