Pipelining DLX

Pipelining DLX

Data path

Can data path resources, such as the adder, be shared ?

Pipelining DLX

Pipeline Issues :
Separate instruction caches and data caches eliminates conflicts for memory access in IF and MEM.

Note that the memory system must deliver 5x the bandwidth over the unpipelined version.

The register file is used in two stages, reading in ID and writing in WB.

Two reads and one write required per clock.
More importantly, what happens when a read and a write occur to the same register ?

What about branch instructions and the PC ?

Branches change the value of the PC -- but the condition is not evaluated until MEM !
If the branch is taken, the instructions fetched behind the branch are invalid !
This is clearly a serious problem that needs to be addressed.

Pipelining performance issues

Pipelining decreases execution time but can increase cycle time.

Throughput is increased since a single instruction (ideally) finishes every clock.
However, it usually increases the latency of each instruction.

Why ?
Imbalance among the pipe stages:

The slowest stage determines the clock cycle time.

Pipeline overhead:

Pipeline register delay . Adding registers, adds logic between each of the stages (plus constraints on setup and hold times for proper operation -- but we won't talk about those.)
Clock skew . The clock must be routed to possibly widely separated registers/latches, introducing delay in signal arrival times.

In the limit, clock cycle time is bound by the sum of the clock skew and latch overhead.

Pipelining performance issues

Instruction regularity :

With a pipeline, differences in instruction CPI can NOT be taken advantage of.

In the unpipelined version, a store instruction finishes after MEM, 4 clocks rather than 5. The same is true of ALU instructions.

With pipelining, we can not start the next instruction one clock earlier since it is already in the pipeline.

Therefore, CPI may not be decreased by the full number of pipeline stages (ideal case is usually not achievable).

This effect reduces the maximum pipeline depth .

Pipelining can be thought of as reducing the CPI.

This increases throughput even though clk cycle time is increased .

Pipeline hazards

A hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage.

There are three types of hazards.
Structural hazards

These are conflicts over hardware resources.

Data hazards

These occurs when an instruction needs data that is not yet available because a previous instruction has not computed (FP pipes) or stored it.

Control hazards

These occur for branch instructions since the branch condition (for compare and branch) and the branch PC are not available in time to fetch an instruction on the next clock.

Pipeline stalls

Hazards in the pipeline may make it necessary to stall the pipeline.

Stall definition:

The simplest way to "fix" hazards is to stall the pipeline.

This means suspending the pipeline for some instructions by one or more clk cycles.

The stall delays all instructions issued after the instruction that was stalled.

A pipeline stall is also called a pipeline bubble or simply bubble .

Pipeline stalls

Stall location:

Note that a bubble need not be inserted at the start of an instruction.

It can be inserted in the middle.

A bubble is inserted whenever the pipeline must be suspended for one instruction to allow a previous instruction to proceed.

The previous instruction MUST proceed in order for the hazard to clear.
This is unlike a cache miss, in which all instructions (before and after) are stalled.

No new instructions are fetched during a stall.

To emphasize, all instructions issued later than the stalled instruction are stalled.

Performance of Pipelines with stalls

Effect on pipeline speedup:

Pipeline stalls decrease performance from the ideal !

Every cycle the pipeline is stalled results in a cycle in which an instruction is NOT issued.

And thus is a cycle in which an instruction is NOT completed.

Lets start with the basic formula:

Remember our performance equation: (IC*CPI*Clk cycle time):

Performance of Pipelines with stalls

Effect on pipeline speedup:

Note that pipelining can be thought of as decreasing CPI or decreasing clock cycle time . Let's focus on the former.

Assuming the ideal CPI is 1:

Let's ignore increases in clk cycle time (due to pipeline overhead):

Let's further assume unpipelined CPI is equal to the depth of the pipeline (ignore shorter instruction CPIs).

Performance of Pipelines with stalls

Effect on pipeline speedup:

If we include the effect of pipeline overhead, we get:

Just remember:

and the rest can be derived.

As long as CPI and clock cycle are calculated properly for both machines, this formula will hold true.