So far, we've discussed techniques for lowering ideal CPI to as close to 1 as possible.
In order to go
1, though, the CPU must be capable of issuing more than one instruction per cycle.
This can be done in two ways:
In superscalar architectures, the processor tries to issue
more than one
instruction per cycle so as to keep all of the functional units busy.
There may be limitations on parallel issue, i.e.
No more than one memory instruction per clock cycle.
A limit of a single branch per cycle.
In order to maximize the number of instructions issued per clock, both
scheduling techniques are used.
In contrast, VLIW (Very Long Instruction Word) processors issue a
number of instructions per clock.
It's similar to taking (for example) four DLX instructions and turning them into one 128 bit instruction.
VLIW processors are inherently
scheduled by the compiler.
We'll need to make sure that there are
no data
hazards between instructions issued together.
The easiest way to accomplish this is to allow
dual issue
of one integer instruction (ALU, load/store) and one floating point instruction.
Hardware requirements
Instruction alignment
We can require that instruction pairs be 64-bit aligned, and that the integer instruction be first.
We could relax this requirement, but it would increase the complexity of detecting hazards and thus the cost of the hardware.
Arithmetic units & pipelines
The CPU must have sufficient FP hardware to support one issue/clk.
This means pipelined FP units (or multiple FP units or both).
Hardware requirements
Interactions between integer and FP
FP and integer are largely independent.
However, integer instructions such as FP loads and stores as well as movement between integer and FP registers can cause problems.
These create contention for the
FP register ports
between integer FP loads/stores and FP ALU instructions.
The first can be handled by adding an extra port to the FP register file for memory operations.
Therefore, we must detect the case in which an
instruction is issued in the same cycle as the
that fetches a source operand for it (RAW hazard.)
Data and control hazards
In the simple DLX pipeline, loads had a latency of one clk cycle.
In the superscalar pipeline, the result of a load cannot be used on the
same clk
or the next clk cycle.
Hazards impose a penalty measured
in cycles
, not
This means the next 3 instructions cannot use the result without a stall.
The same is true for branch delays.
Therefore, more ambitious compiler or hardware scheduling techniques and more complex instruction decoding (for branches) is needed.
If the CPU is not able to get a useful instruction in one of the two slots, the CPI increases and approaches 1.
Static scheduling
on a superscalar processor:
Loop unrolling
and scheduling on a dual-issue DLX:
To schedule without delays, 5 copies unrolled. 2.4 clks per iteration.
Dynamic scheduling
on a superscalar processor:
Dynamic scheduling can improve on these results to an even greater extent.
This is true because the CPU can dual issue instructions with dependencies and serialize them later using hazard detection logic.
Additional hardware can reduce delays through the elimination of
hazards and memory disambiguation (as we saw with Tomasulo's approach.)
Dynamic scheduling allows the CPU to keep the functional units busy as often as possible.
It also permits the CPU to run well on code that was not scheduled for superscalar execution.
Superscalar machines use hardware to reorder instructions and keep functional units busy.
(Very Long Instruction Word) machines, all of this burden falls upon the compiler.
Each VLIW "instruction" is composed of multiple independent instructions, each of which execute on different function units.
The functional units might include integer ALUs, FP ALUs, memory units, and a branch unit.
To control these units, the instruction must allocate 16 or more bits to each unit to describe the operation that the unit will run on each cycle.
To keep the functional units busy, parallelism is uncovered by the compiler by unrolling loops and scheduling code
basic blocks.
The CPU can also help by providing forwarding.
Suppose we have a VLIW machine that could issue
memory references,
FP operations and
integer operation or branch instruction per clock:
Loop unrolled
times, branch delay ignored. 2.5 operations per clk.
Limits in multiple-issue processors:
Why stop at 5 instructions/clk, why not 50 !
Limits on available ILP in programs
There are usually not enough operations to fill all of the available slots.
It might seem that 5 independent instructions are sufficient in our example.
However, the memory, branch and FP units will likely be pipelined and have a multicycle latency, i.e.,
Assume a latency of 6 clks for the FP units, and that two FP pipelined units are available.
This requires that we find 12 FP instructions that are independent of the most recently issued FP instruction !
If a branch requires just a one cycle latency, it results in a 5 instruction latency in our machine.
Hardware complexity
Additional functional units
We must duplicate the integer and FP units for multiple-issue but their cost scales linearly.
Added bandwidth to registers and memory
More register file ports are required to sustain the multiple issue.
For example, a single integer pipeline requires 3 ports to a register file.
Adding another pipeline requires 3 more ports.
A more significant problem is adding memory ports, which are much more expensive than register ports.
The complexity and access time penalties of a multiported memory hierarchy are probably the most serious hardware limitations of superscalar and VLIW.
Hardware complexity
Scheduling hardware
This can range from pretty simple (VLIW) to very complex (superscalar).
Complex hardware slows the CPU down (longer cycle time) and makes it difficult to:
Verify the design.
To include fast functional units and large caches.
Limitations specific to superscalar or VLIW
Instruction issue logic is the primary challenge with superscalar.
Technical problems:
Increase in code size
from open slots (wasted bits for unused functional units) increases memory bandwidth requirements unnecessarily.
A stall (i.e., cache miss) in any functional unit causes the entire processor to stall because of the
lock step
operation of VLIW.
Logistical problems:
Binary compatibility
is a problem since machines with different numbers of issues and functional unit latencies require different versions of the code.