-
Performance of Branch Schemes
:
-
Here we assume that there are
no
other delays from data hazards (an ideal CPI of 1).
-
We can calculate pipeline stalls from branches by:.
-
Example:
-
Suppose we have a CPU that has a single branch delay slot.
-
This slot can be filled with a useful instruction
65%
of the time.
-
In addition, the branch condition is not known for
two
cycles beyond the delay slot.
-
If these are predicted properly, there is no penalty.
-
If they are mispredicted, the two intervening instructions must be cancelled.
-
Forward branches are always
predicted not taken
, while backward branches are always
predicted taken
.
-
Forward branches make up
75%
of all branches, and branches are
20%
of all instructions.
-
If
50%
of forward branches and
85%
of backward branches are taken.
-
What is the new CPI (assuming the original CPI is 1)?
-
Solution:
-
First, let's calculate the number of stall cycles.
-
For 35% of the branch instruction, the delay slot isn't filled.
-
This adds 0.35 cycles of branch stalls.
-
50% of forward branches suffer a 2 cycle penalty.
-
Since 75% of branches are forward, this contributes
cycles.
-
Similarly, 15% of backward branches suffer a 2 cycle penalty, adding
cycles.
-
The total branch penalty is thus 0.35 +
0.75
+ 0.075 = 1.175 cycles.
-
Since branches make up 20% of all instructions, the penalty to the CPI is
cycles.
-
This makes the new CPI 1.235.
-
Compilers and static branch prediction
:
-
Having accurate information about branch behavior at compile time is also helpful for scheduling
data
hazards:
-
Suppose we knew that the branch was
almost always taken
and value in R7 was not needed in the fall through part:
-
Compiler could move ADD R7, R8, R9 after the load instruction.
-
Suppose we knew that the branch was
rarely taken
and value in R4 was not needed on the taken path:
-
Compiler could move OR R4, R5, R6 after the load instruction.
-
These optimizations are in addition to any branch delay scheduling.
-
Compilers and static branch prediction:
-
In order to reduce branch stall penalties, the compiler can:
-
Reorder instructions
:
-
As we have seen in previous examples.
-
Predict all branches taken
-
This is surprisingly effective since 85% of backward branches and 60% of forward branches are taken.
-
However, this still leaves more than a third of the branches improperly predicted.
-
For some programs, this method is excellent (< 10% mispredictions), but for others, it does badly (> 50%).
-
Compilers and static branch prediction:
-
Predict forward not taken and backward taken
-
This scheme is similar to predicting all branches as taken except that it uses information about the types of branches.
-
Forward branches are likely part of
if-else
constructs, and may be less likely to be taken.
-
Backward branches are usually part of
loops
and thus more likely to be taken.
-
This is particularly true if the compiler
reorganizes
if-else
constructs to make the non-taken fork of the branch more likely.
-
However, this method won't perform much better than simply predicting
not-taken
.
-
Compilers and static branch prediction
:
-
Use profile information from previous runs
-
The compiler can instrument the code using the profile information from previous runs of the program.
-
It can build a higher performance program by predicting that branches taken in the practice run(s) will be taken in the final version.
-
It is not perfect since many branches are both taken and not taken in the course of execution.
-
But it does provide better prediction than other static methods.
-
Misprediction rates for this method range from 5% to 20%.
-
This is true even if
different
input data is used for the program.
-
Compilers and static branch prediction
:
-
Studies have shown that
profile-based prediction
is almost always better than
predict-taken
or other non-profile-based methods.
-
Since
profile-based prediction
is so good, why not use it ?
-
Dynamic branch prediction
provides a better solution (which we'll discuss in a week or two.)
-
Why is pipelining difficult ?
-
Now that we've seen how pipelining can be done and how to detect and resolve hazards, the question arises: what's so hard about this?
-
Exceptions
-
Instruction set complications
-
Exceptions
-
The problem is that an instruction in the pipeline can raise an exception that may force other instructions in the pipeline to be aborted.
-
These other instructions may have
altered
the state of the machine.
-
More importantly, exceptions introduce the possibility that an exception in a later instruction (i.e. in ID or EX) will prevent a previous instruction (i.e. in MEM or WB) from completing.
-
Exception causes
-
I/O device requests
-
User OS service requests
-
Breakpoints
-
Integer arithmetic overflow/underflow
-
FP arithmetic anomaly
-
Page fault
-
Misaligned memory accesses
-
Memory protection violations
-
Hardware malfunctions
-
Undefined instructions
-
Exception characteristics
-
Synchronous vs. asynchronous
-
Does the exception come as a result of execution, at the same place for every run of a program with the same data and memory allocation ?
-
Or is it generated external to the CPU ?
-
Asynchronous
events can usually be handled after the completion of the current instruction, making them easier to handle.
-
User requested vs. coerced
-
Did the user request an exception, i.e. through an exception instr. ?
-
Or did it happen as a result of something beyond the user program's control, i.e. a hardware event ?
-
Coerced exceptions are harder to implement since they are not predictable.
-
Exception characteristics
-
User maskable vs. non-maskable
-
Can the user prevent the hardware from responding ?
-
Note that for maskable interrupts, the user can choose to respond to them, and therefore they are similar to non-maskable interrupts.
-
In other words, maskable interrupts must still be handled properly.
-
Within vs. between instructions
-
Does the exception prevent instruction completion, by occurring in the middle of execution ?
-
Or is it recognized between instructions.
-
Exceptions occurring within instructions are usually synchronous, since the instruction triggers the exception.
-
Within
is more difficult to implement than
between
since the former must be restarted.
-
Exception characteristics
-
Resume vs. terminate
-
Does the exception stop the program from running ?
-
Or must the program be restarted after the interrupt ?
-
Restarting is harder (obviously), and is the more common case.
-
The most difficult case is handling interrupts within an instruction, where the instruction must be resumed.
-
In this case, another section of code (usually OS code) must be invoked to:
-
Save the state of the executing program.
-
Fix the cause of the exception.
-
Restore the state of the original program, and restart it as if nothing had happened.
-
Exceptions of this type occur for
virtual memory management
systems.
-
Machines that can perform these operations are called
restartable
.
-
For exceptions that occur
within
instructions (i.e. in EX or MEM) and must be
restarted
(page fault), the pipeline state must be saved.
-
Pipeline control accomplishes this by:
-
Inserting a trap instruction into the pipeline on the next IF.
-
Turn off all writes for the faulting instruction and the instructions following it in the pipeline.
-
Previous instructions are allowed to complete.
-
Save the PC of the faulting instruction so it can be restarted. (Done by the OS exception handling routine.)
-
This method requires as many PCs as there are delay slots, since the instructions currently in the pipeline may not be sequentially related !
-
In any case, we will have to save at least one PC value: the location of the faulting instruction.
-
Precise vs. imprecise exceptions
-
A
precise
exception is one in which:
-
All instructions
before
the faulting instruction complete AND
-
And instructions
following
the faulting instruction, including the faulting instruction, do not change the state of the machine.
-
Under this model, restarting is easy:
-
Simply re-execute the original faulting instruction.
-
Or, if it is not a resumable instruction, i.e. an integer overflow, start with the next instruction.
-
Often, precise exceptions are difficult because of
out-of-order
instruction completions and
out-of-order
exception occurrences.
-
This leads to
imprecise
exceptions.
-
This is true of floating point pipelines more so than integer pipelines.
-
In general, integer exceptions are precise, while FP exceptions may not be.
-
Exception ordering:
-
Suppose two consecutive instructions cause exceptions:
-
In this case, the memory exception comes in the same cycle as the overflow exception.
-
In this case, the first one (the page fault) should be handled and the second instruction canceled.
-
Exception ordering:
-
What if fetching the ADD instruction caused a page fault ?
-
Then, the ADD instruction page fault occurs
before
(in time) the LW page fault.
-
However, we must finish the LW before handling the ADD page fault (if we are implementing precise exceptions.)
-
This is done by keeping an
exception vector
for each instruction:
-
If an exception is posted, it is added to the vector and all writes that affect system state are disabled.
-
Exception ordering:
-
When the instruction is about to exit the pipeline (MEM/WB), any pending exceptions for the instruction are examined.
-
If an instruction generates
multiple
exceptions, the exception occurring in the earliest stage takes precedence.
-
Note that, for the DLX, the faulting instruction has not updated any state (since all updates occur in WB.)
-
Many CPUs support
both
for performance reasons, since precise exception mode is much slower.
-
Instruction set complications
-
An instruction is
committed
when it is guaranteed to complete.
-
On DLX, all instructions are committed at the end of MEM.
-
Since no updates occur before instructions commit,
precise
interrupts are straightforward.
-
In most RISC systems, each instruction writes only one result.
-
This means that the instruction can be cancelled any time before the instruction is committed, with no harm to the system state.
-
This is not true for many CISC machines, i.e. VAX
-
On these machines, the system state may be modified well before the instruction or its predecessors are committed.
-
For example, if an instruction using autoincrement mode is aborted because of an exception, then the machine state may have been altered.
-
This leads to an
imprecise
exception making it difficult to restart the instruction.
-
Instruction set complications
-
The situation is worse for instructions that access and write memory in multiple places.
-
These instructions can generate multiple faults.
-
Therefore, it becomes difficult to know where to resume.
-
For
string
instructions, the CPU must also know how far into the operation it was when the exception occurred.
-
This is usually solved by using general purpose registers as scratch space (that are saved and restored.)
-
The general solution used by more complex instruction set machines is to pipeline the microcode.
-
In fact, RISC has often been compared to having the microcode as the actual assembly language.
-
Instruction set complications
-
Multi-cycle operations:
-
Implementing instructions vary widely in the number of clock cycles they take to complete makes building a pipeline more complex.