
A simple processor illustrates many of the basic components used in any digital system:

Datapath: The core  all other components are support units that store either the results of the datapath or determine what happens in the next cycle.

Memory:

A broad range of classes exist determined by the way data is accessed:

ReadOnly vs. ReadWrite

Sequential vs. Random access

Singleported vs. Multiported access

Or by their data retention characteristics:

Stay tuned for a more extensive treatment of memories.

Control:

A FSM (sequential circuit) implemented using random logic, PLAs or memories.

Interconnect and InputOutput:

Parasitic resistance, capacitance and inductance affect performance of wires both on and off the chip.

Growing die size increases the length of the onchip interconnect, increasing the value of the parasitics.

Datapath elements include adders, multipliers, shifters, BFUs, etc.

The speed of these elements often dominates the overall system performance so optimization techniques are important.

However, as we will see, the task is nontrivial since there are multiple equivalent logic and circuit topologies to choose from, each with adv./disadv. in terms of speed, power and area.

Also, optimizations focused at one design level, e.g., sizing transistors, leads to inferior designs.

Let's start with addition, since it is very a common datapath element and often a speedlimiting element.

Optimizations can be applied at the logic or circuit level.

Logiclevel optimization try to rearrange the Boolean equations to produce a faster or smaller circuit, e.g. carry lookahead adder.

Circuitlevel optimizations manipulate transistor sizes and circuit topology to optimize speed.

Let's start with some basic definitions before considering optimizations:

G(A.B)
: (generate)

Occurs when a Co is internally generated within the adder (occurs independent of Ci).

P(A+B)
: (propagate)

Indicates that Ci is
propagated
(passed) to Co.

P'(A XOR B)
: (propagate)

Used in some adders for the P term since it can be reused to generate the sum term.

D(
A
.
B
)
: (delete)

Ensures that a carry bit will be deleted at C
_{
o}
.

The Boolean expressions for S and C
_{
o}
are:

Sum = A.B.C
_{
i}
+ A.B.C
_{
i}
+ A.B.C
_{
i}
+ A.B.C
_{
i}
= A
XOR
B
XOR
C

Carry = A.B + A.C
_{
i}
+ B.C
_{
i}

But S and C
_{
o}
can be written in terms of G and P':

C
_{
o}
(G, P') = G + P'C
_{
i}
(or P in this case).

S(G, P') = P' XOR C
_{
i}

Note that G and P' are INdependent upon C
_{
i}
.

(Also, C
_{
o}
and S can be expressed in terms of delete (D)).

The
critical path
(worst case delay over all possible inputs) is a ripple from
lsb
to
msb
.

The delay in this case is proportional to the number of bits, N, in the input words:

t
_{
adder}
= (N  1)t
_{
carry}
+ t
_{
sum}

where t
_{
carry}
and t
_{
sum}
equal the propagation delays from C
_{
i}
to C
_{
o}
& S.

One possible worst case bit pattern (from
lsb
to
msb
) is:

Convince yourself that this is true.

Note that when optimizing this structure, it is far more important to optimize t
_{
carry}
than t
_{
sum}
.

The inverting property of a full adder can be used to achieve this goal:

Thus,

S(A, B, Ci) = S(A, B, Ci)

Co(A, B, Ci) = Co(A, B, Ci)

One possible (unoptimized) implementation:

C
_{
o}
is reused in the S term as:

Sum = A.B.Ci + (A + B + Ci)Co

Even with some design tricks, e.g., transistors on the critical path, C
_{
i}
, placed closest to the output and symmetrical design, this implementation is slow.

The load capacitance in previous version on C
_{
o}
consists of 2 diffusion capacitances (inverter) and 6 (next bit) gate capacitances:

This version increases Co's load to 4 diffusion caps, 2 internal (sum) gate caps plus the 6 (next bit) gate caps.

Serial addition
can be used if area is a concern:

In this case, you want equal Sum and Carry delays in order to minimize clock cycle time.

Bitlevel pipelining
can be used to break the dependency between addition time and the number of bits by inserting FAs between each register bit.

Total transistors is 26! Can reduce to 24 by using an inverter for XNOR (see Weste and Eshraghian for an 18 transistor implementation).

Note: Sum and Carry delay times are approximately equal.

Dynamic Adder Design:
npCMOS
adder

Dynamic Adder Design:
Manchester CarryChain
adder.

A chain of passtransistors are used to implement the carry chain.

Precharge: All intermediate nodes, e.g. Co,0, charged to VDD.

Evaluate: Node Co,k is discharged if there is an incoming carry, Ci,0 and the previous propagate signals are high, P0 to Pk1.

Only 4 diffusion capacitances are present per node but the distributed RCnature of the chain results in delay that is quadratic with number of bits.

Buffers and/or transistor sizing can be used to improve performance.

Consider the worst case delay of the carry chain:

Elmore delay is given by:

The delay of the RC network is then:

t
_{
p}
= 0.69(C
_{
1}
R
_{
1}
+ C
_{
2}
(R
_{
1}
+ R
_{
2}
) + C
_{
3}
(R
_{
1}
+ R
_{
2}
+ R
_{
3}
) + C
_{
4}
(R
_{
1}
+ R
_{
2}
+ R
_{
3}
+ R
_{
4}
) +

C
_{
5}
(R
_{
1}
+ R
_{
2}
+ R
_{
3}
+ R
_{
4}
+ R
_{
5}
) + C
_{
6}
(R
_{
1}
+ R
_{
2}
+ R
_{
3}
+ R
_{
4}
+ R
_{
5}
+ R
_{
6}
)

Since R
_{
6}
appears
6 times
in the expression, it makes sense to minimize its contribution.

Note that reducing R by a factor, e.g.
k
, at each stage increases the capacitance by a factor
k
and increases area.

A
kfactor
of 1.5, reduces delay by 40% and increases area by 3.5X.

Assume A
_{
k}
and B
_{
k}
(for k = 1...3) are set such that all P
_{
k}
(propagate) are high.

In this case, an incoming carry C
_{
i,0}
= 1, propagates along the complete chain and C
_{
o,3 }
= 1.

In other words:

if (P
_{
0}
P
_{
1}
P
_{
2}
P
_{
3}
== 1) then C
_{
o,3}
= C
_{
i,0}
else either DELETE or GENERATE occurred.

Linear CarrySelect
adder:

One way around waiting for the incoming carry is to compute the result of
both
possible values in advance and let the incoming carry
select
the correct result.

A
SquareRoot CarrySelect
Adder (delay = O(N
^{
1/2}
)) is constructed by increasing the number of input bits in each block from
lsb
to
msb
.

Carry lookahead
adder (avoiding the ripple altogether):

Compute the carries to each stage in parallel.

Note that the loworder terms, e.g., P
_{
0}
and G
_{
0}
, appear in the expression for every bit, making the fanout load large.

Carry lookahead
adder:

One possible implementation without using simple logic gates.

Size and fanin of the gates limit the size to about four.

The
Logarithmic lookahead
adder: O(log
_{
2}
N) delay:

The number of logic levels is proportional to log
_{
2}
N, fanin is limited and the layout is compact (jigsaw puzzle) (see Rabaey for details).

Magnitude Comparators
:

May be built from an adder, complementer (XOR gates) and a zero detect unit.

Think about the modifications necessary to make it a signed comparator (Hint: A couple of XOR gates).

Asynchronous: Based on the Toggle register.

Not a good choice for performance and testability (with no reset).

Replace AND gate with an adder for up/down counting capability.

Weste and Eshraghian also show a version that can be initialized.

Multiplication can be broken down into two steps:

Computation of partial products.

Accumulation of the shifted partial products.

Multipliers may be classified by the format in which data words are accessed:

Serial

Serial/parallel

Parallel

The parallel form computes the partial products in parallel.

Parallel Unsigned Multiplication:

Parallel Signed Multiplication:

Parallel Signed Multiplication:

Serial Unsigned Multiplication:

Serial/Parallel Unsigned Multiplier shown in Weste and Eshraghian.

Right/Left 1bit shifter: