The potential overlap among instructions is called
ILP
.
It implies a lack of dependence between instructions.
We spent two weeks talking about pipelines that allow us to overlap
independent
instructions.
All of the techniques in this chapter will exploit parallelism among
instruction sequences
.
Advanced Pipelining and ILP
Instruction-level parallelism (ILP)
Exploiting ILP
We will discuss methods for exploiting any ILP that may exist in a program.
As we started to see last week, this can mean:
Executing instructions
out of order
and
Other techniques that keep the functional units busy with useful work.
Limits to ILP
We'll also discuss things that limit the amount of ILP that a processor can exploit.
This limitation can come from either:
The processor (i.e., limited number of functional units) or
The program (every instruction depends on the previous one).
Advanced Pipelining and ILP
CPI revisited:
The focus of the last chapter was on reducing
RAW
and
control
stalls.
The techniques presented in this chapter to further reduce RAW and control stalls will
increase
the importance in dealing with WAR, WAW and structural stalls.
Techniques we will investigate in this chapter:
Loop unrolling
Reduces
control
stalls.
Basic pipeline scheduling
Reduces
RAW
stalls.
Scoreboarding
Reduces
RAW
stalls.
Basic ILP Techniques
Register renaming
Reduces both
WAR
and
WAW
stalls.
Dynamic branch prediction
Reduces
control
stalls.
Issuing multiple instructions per cycle (superscalar)
Reduces
ideal CPI
by allowing more than one instruction to start each cycle.
Compiler dependence analysis
Reduces
ideal CPI
and
data
stalls by having the compiler perform code scheduling.
Basic ILP Techniques
Software pipelining and trace scheduling
Reduces
ideal CPI
and
data
stalls by using previous executions to tune future executions.
Speculation
Execute "possible" instructions so that, whichever way the program "really" goes, the CPU will have the proper state.
This reduces all
data
and
control
stalls.
Dynamic memory disambiguation
Reduce
RAW
stalls involving memory.
Basic ILP Techniques
What is ILP, and where does it come from?
Basic blocks
A block of code with
no
branches
into
the code except at the start and
no
branches
out
of the code except at the end.
The code inside the average basic block is quite small.
Last chapter, we saw the average dynamic branch frequency in integer programs was about
15%
.
This means that between 6 and 7 instructions are executed between a pair of branches.
It is likely that these instructions depend on one another since the instructions tend to operate on the same data.
Therefore, we must exploit ILP
across multiple basic blocks
.
Basic ILP Techniques
Loop-level parallelism
One of the simplest and most common ways to increase parallelism is to exploit it among iterations of a loop.
In many loops, the iterations of a loop are independent, i.e.
Each iteration can overlap with any other iteration even though individual iterations have few (if any) overlappable instructions.
Many techniques exist for exploiting the ILP in loops.
Some are done
statically
by the compiler (loop unrolling) and some are done
dynamically
by the CPU.
In addition, vector processors can be run very quickly on simple loop operations.
Basic ILP Techniques
Pipeline scheduling
Compiler seeks to separate a
dependent
instruction from the
source
instruction by a distance (in clk cycles) equal to the pipeline latency of the source.
To do this, the compiler must have intimate knowledge of the internal hardware workings.
This is one reason that code written for (say) the Intel 486 may not run optimally on the Pentium -- the pipeline latencies have likely changed.
Let's assume the following latencies:
Instruction producing result
Instruction using result
Latency in clk cycles
FP ALU op
Another FP ALU op
3
FP ALU op
Store double
2
Load double
FP ALU op
1
Load double
Store double
0 (forwarded to store)
Basic ILP Techniques
Pipeline scheduling
Given the knowledge of pipeline latencies, the compiler can reorder the instructions so that stalls are avoided, i.e.,
Assume R1 holds the address of the element in array with highest address.
Assume F2 contains the scaler value, c.
Unscheduled DLX assembly:
Basic ILP Techniques
Pipeline scheduling
Unscheduled analysis:
When we execute this code, it takes 10 cycles per iteration.
Basic ILP Techniques
Pipeline scheduling
Unscheduled/Scheduled code:
Execution time is reduced from 10 cycles to 6.
Note that the compiler modified the effective address for SD.
Basic ILP Techniques
Pipeline scheduling & loop unrolling
We knocked the cycle count from 10 down to 6, but we really only did 3 cycles of work:
A load, an add, and a store.
The rest of the loop was
overhead
, i.e., SUBI, BNEZ and a stall.
While it may seem we actually did work by adding to the i index, that's not really work since it didn't produce any results in memory.
We need to get more operations into the loop relative to the number of branch and overhead instructions.
This is done by
loop unrolling
:
replicating
the loop body multiple times and adjusting the loop termination code.
Basic ILP Techniques
Pipeline scheduling & loop unrolling
Loop unrolling involves creating
multiple copies
of the loop body.
It improves scheduling because:
It eliminates branches.
It allows instructions from different iterations to be scheduled together, it exposes parallelism.
This allows the CPU to amortize the cost of updating indices across several iterations of the loop.
Loop unrolling also increases register usage.
Basic ILP Techniques
Pipeline scheduling & loop unrolling
Previous code unrolled 4 times and scheduled (assuming R1 was initially a multiple of 32).
Note that displacement addressing mode makes it possible to keep the overhead very low.
Basic ILP Techniques
Pipeline scheduling & loop unrolling
In real code, we don't know the upper bound on the loop.
We don't know how many times the loop will be executed.
Suppose it is
n
and we want to unroll the loop
k
times.
We create
two
consecutive loops:
The first one contains the original code and executes (
n mod k
) times.
The second one contains the unrolled body and executes
n/k
times.
If the number of iterations is large, this method saves the CPU lots of time at the cost of larger code size.
Although it was easy for us to recognize these transformations, programming the compiler to recognize these while guaranteeing that the transformed code is correct is
not
trivial.