-
So far, we have seen that data hazards that prevent instruction issue were hidden by:
-
Forwarding.
-
Compiler scheduling that separated dependent instructions.
-
The latter is referred to as
static scheduling
.
-
Dynamic scheduling
is also possible:
-
The CPU rearranges the instructions (while preserving dependences) to reduce stalls.
-
Dynamic scheduling has several advantages over static:
-
Handles dependencies that are
UN
known at compile time (i.e., a memory reference.)
-
Allows code compiled with one pipeline in mind to run efficiently on a different pipeline.
-
Basic problem with pipelining techniques used so far is that they all use
in-order
instruction issue.
-
A stall of an instruction stalls
all
instructions behind it.
-
It is possible that the instructions that follow can issue. For example:
-
Basic
out-of-order execution
:
-
There is no reason why the CPU can't execute the SUBD before the ADDD.
-
Doing so will reduce the penalty caused by the data stall.
-
However, this will force
out-of-order completion
, which causes problems handling exceptions.
-
In order to do
out-of-order execution
, we split the ID stage into two halves.
-
Issue
-
Decode instructions and check for structural hazards.
-
Read operands
-
Wait until there are no data hazards, and then read the operands.
-
Note that we can still use forwarding to remove data hazards.
-
A design of this type may use an instruction queue to hold instructions that have been fetched but are waiting to be executed.
-
An instruction is considered to be in
execution
at any time that it's in an
EX
stage.
-
Multiple instructions can be in execution at any given time.
-
This technique issues instructions in order (
in-order
issue.)
-
However, instructions can
bypass
other waiting instructions in the "read operands" phase.
-
It is named after the CDC 6600, which was the first machine to use a scoreboard.
-
Remember WAR hazards ?
-
They did not exist on our DLX FP and integer pipelines.
-
Goals:
-
The goal of this technique (and other dynamic scheduling methods) is to maintain an execution rate of one instruction per cycle.
-
This can be done by executing each instruction as soon as possible.
-
To accomplish this, the CPU must possess either multiple functional units or pipelined functional units (or both).
-
These are equivalent for the purposes of pipeline control.
-
Let's assume multiple functional units.
-
We'll focus the analysis of scoreboarding on operations involving the FP units.
-
The integer units encounter hazards very rarely. (The integer DLX pipeline only stalls when waiting for a value after a load).
-
DLX implementation:
-
Every instruction goes through the scoreboard.
-
The scoreboard determines when an instruction can read its operands and write its results.
-
Therefore, all
hazard detection
and
resolution
are centralized.
-
Pipeline stages of interest: ID stage replaced with two stages:
-
Issue (IS)
-
An instruction is issued if:
-
The functional unit is available and
-
No other active instruction has the
same
destination register.
-
This avoids
WAW
hazards and
structural
hazards.
-
For a stall, this causes the buffer between IF and IS to fill.
-
A one-entry buffer fills quickly!
-
Pipeline stages of interest:
-
Read Operands (RD)
-
The read operation is delayed until the operands are available.
-
-
This means that
no previously issued but uncompleted
instruction has the operand as its destination.
-
-
This resolves
RAW
hazards
dynamically
.
-
Execution (EX)
-
Notify the scoreboard when completed so the functional unit can be reused.
-
-
This step may take multiple cycles.
-
Pipeline stages of interest:
-
Write result (WB)
-
The scoreboard checks for
WAR
hazards and stalls the completing instruction if necessary.
-
In our earlier example, SUBD would be stalled in WB until ADDD reads its operands.
-
In general, writeback is stalled if:
-
A
preceding
instructions has
not
read its operands and
-
One of the operands is the
same
register as the destination of the
completing
instruction.
-
The DLX pipeline is now six cycles long: IF IS RD EX MEM WB.
-
Note that forwarding is not used here.
-
But it is not a large penalty since write-back occurs as soon as the result is available.
-
Instructions that do NOT need the MEM phase don't execute it.
-
Components of the system:
-
Instruction status
-
This component keeps track of the current stage of each instruction.
-
There is one entry for each instruction that has passed the IF stage but has not yet completed.
-
Functional unit status
-
This component holds the status of each functional unit.
-
"Busy" indicates whether or not the unit is busy.
-
"Op" indicates the operation being performed (some functional units can do more than one operation)
-
F
i
, F
j
and F
k
indicate the instruction's source and destination registers.
-
Q
j
and Q
k
indicate the functional units producing the instruction's source registers
-
R
j
and R
k
indicate whether or not the values are ready.
-
It's these fields that are used to avoid
WAR
hazards.
-
Components:
-
Register result status
-
This holds the ID of the functional unit that will
eventually
write a register.
-
If the register is not the destination of an issued instruction, the field will indicate no functional unit.
-
Let's look at snapshots of the three components of the scoreboard during execution.
-
Handling hazards, and distinguishing between
RAW
and
WAR
hazards:
-
RAW
-
We can detect
RAW
hazards by checking to see if a source register is listed in the Register Result Status table.
-
If it is, we have a RAW hazard.
-
If the pending instruction is receiving a value from the current instruction, then one of the pending instruction's R
j
/R
k
fields is set to
No
.
-
WAR
-
Before writing the value, we check to make sure that no
pending
instruction is using a
previous
value for the register we're about to overwrite.
-
If some pending instruction has already received the value it needs, then R
j
or R
k
is set to
Yes
and the current instruction must stall (
WAR
).
-
This is how we distinguish between a
RAW
and
WAR
.
-
Limitations:
-
ILP
-
If we can't find independent instructions to execute, scoreboarding (or any dynamic scheduling scheme for that matter) helps very little.
-
Size of the "issued" queue
-
This determines how far ahead the CPU can look for instructions to execute in parallel.
-
For now, we assume that a window can
not
span a branch.
-
In other words, the window includes instructions only within basic blocks.
-
We'll show how the window can be extended beyond the branch later.
-
Limitations:
-
Number, types, and speed of the functional units
-
This determines how often a structural hazard results in stall.
-
The presence of antidependences and output dependences
-
WAR
and
WAW
hazards limit the scoreboard more than
RAW
hazards.
-
RAW
hazards are problems for any technique.
-
But
WAR
and
WAW
hazards can be solved in ways other than scoreboards.