Constructing Zero-deficiency Parallel Prefix Adder of Minimum Depth
Haikun Zhu, Chung-Kuan Cheng, Ronald Graham
Department of Computer Science and Engineering
La Jolla, California 92093
Abstract—Parallel prefix adder is a general technique for
speeding up binary addition. In unit delay model, we denote
the size and depth of an n-bit prefix adder C(n) as sC(n)and
dC(n)respectively. Snir proved that sC(n)+dC(n)≥ 2n−2 holds
for arbitrary prefix adders. Hence, a prefix adder is said to be
of zero-deficiency if sC(n)+ dC(n)= 2n − 2. In this paper, we
first propose a new architecture of zero-deficiency prefix adder
dubbed Z(d), which provably has the minimal depth among all
kinds of zero-deficiency prefix adders. We then design a 64-bit
prefix adder Z64, which is derived from Z(d)|d=8, and compare
it against several classical prefix adders of the same bit width
in terms of area and delay using logical effort method. The
result shows that the proposed Z(d) adder is also promising in
practical VLSI design.
Binary adders are the most fundamental modules in com-
puter arithmetic design, and consequently have been investi-
gated extensively for decades. Quite a few classic fast adders,
such as the carry-skip adder, the carry-select adder, and
the carry-lookahead adder, were proposed in the past .
However, these fast adders are ad-hoc in structure, and each
of them represents a unique area-time tradeoff in the design
space. Parallel prefix adder, on the other hand, represents a
class of general adder structure that exhibits flexible area-time
tradeoffs for adder design. Therefore, identifying the exact
area-delay tradeoff curve of the parallel prefix adder is an
interesting problem that has received much research attention.
In designing parallel prefix adders, it has been popular to
assume the unit delay timing model, in which the computation
nodes are arranged in levels that represent the signal timing
, , ∼ . Denoting the size (i.e., the number of
computation nodes) and depth of an n-bit prefix adder C(n) as
sC(n)and dC(n)respectively, Snir proved that sC(n)+dC(n)≥
2n − 2 holds for arbitrary prefix circuits . He defined the
deficiency of a prefix circuit as
def(C(n)) = sC(n)+ dC(n)− 2n + 2
Therefore, a prefix adder is said to be of zero-deficiency if
sC(n)+ dC(n)= 2n − 2.
Snir’s theorem indicates that the solution space for parallel
prefix adders should look like Fig. 1. For loose depth con-
straints we can observe a linear tradeoff between the depth
and size which is exhibited by zero-deficiency prefix adders.
However, if the depth constraint is too tight, the size of
the prefix adders will grow dramatically and zero-deficiency
prefix adders no longer exist. It remains an open question to
find the zero-deficiency prefix adders of minimum depth, i.e.,
to identify the curve d = f(n) shown in Fig. 1.
Various zero-deficiency prefix circuits were proposed in the
past. Among them the most notable ones are Snir’s design
, LY D(n) circuit  and LS(n) circuit . The main
zero−deficiency prefix adders
(linear depth−size tradeoff)
20 40 60 80 100 120
d (Depth level)
n (Bit Width)
Serial Prefix Adder
d = ⌈log n⌉
d = n − 1
d = f(n)
width n, the maximal depth of the prefix adders is n−1 (serial prefix adder)
while the minimal depth is ⌈logn⌉ (Sklansky adder ).
Optimal Depth-Size tradeoffs of Parallel Prefix Adders. For a given
purpose of these work has been tightening the depth of zero-
deficiency prefix circuits as small as possible for a given
width. In , Zimmermann proposed an heuristic for prefix
adder optimization using depth-controlled compression and
expansion. His approach in many cases produces depth-size
optimal or near optimal prefix adders. However, the optimality
of his results is not guaranteed.
In this paper, we propose a new kind of zero-deficiency
prefix adder called Z(d) which provably has the minimal
depth for a given width. As in previous work, we adopt the
unit delay timing model since it is simple enough but also
easy for extension. We design our structure from an alternative
point of view, that is, by constructing a zero-deficiency prefix
adder of maximal width for a given depth. We then design
a 64-bit prefix adder derived from Z(d)|d=8, and compare it
against several classical prefix adders in terms of area and
delay using logical effort method , . The result shows
that Z(d) adder is promising in practical VLSI design.
The remainder of the paper is organized as follows. Sec-
tion II explains how binary addition is formulated as a parallel
prefix problem. In Section III, we first give a revised proof
for Snir’s lower bound theorem sC(n)+ dC(n) ≥ 2n − 2,
and then discuss the properties of zero-deficiency adders. Our
main contribution lies in Section IV, where a new class of
zero-deficiency prefix circuits Z(d) is proposed. Section V
focuses on area and delay analysis using logical effort, as
well as comparisons. Section VI concludes the paper.
The generalized prefix problem is to compute yi = xi•
xi−1• ···x1 for 1 ≤ i ≤ n, given n inputs x1,x2,...,xn
and an arbitrary associative operator •.
Binary addition is usually expressed in terms of carry
generation signal gi, carry propagation signal pi, carry signal
0-7803-8736-8/05/$20.00 ©2005 IEEE.ASP-DAC 2005
Fig. 2.An examples of parallel prefix adder: n = 16, d = 4, s = 32
ciand sum signal siat each bit position (1 ≤ i ≤ n):
pi= ai⊕ bi
si= pi⊕ ci−1
if i = 1
gi+ pici−1 if 2 ≤ i ≤ n
where A = anan−1···a1and B = bnbn−1···b1are the two
binary numbers. The concepts of carry generation and carry
propagation can be easily extended to a block of consecutive
G[i:k]+ P[i:k]G[k−1:j]if n ≥ i > j ≥ 1
if i = j;
if i = j;
if n ≥ i > j ≥ 1
By introducing an associative operator •, the computation of
a pair of (G,P) signals and carry signals can be rewritten as:
(G,P)[i:j]= (G,P)[i:k]• (G,P)[k−1:j]
= (G[i:k]+ P[i:k]G[k−1:j],P[i:k]P[k−1:j]) (9)
Therefore, the carry signal generation, namely, the (G,P)[i:1]
signal generation in terms of equation (9) is exactly a prefix
computation problem. Since the generation of gi, pi signals
and sum bits siare just local operations, the performance of
the adder is determined by the prefix circuit of generating
(G,P) signals. In the sequel, we will only show the interme-
diate prefix circuit of the prefix adders. As an example, Fig. 2
shows a 16-bit prefix circuit of depth 5. Each computation
node (i.e., black node) in the figure is a (G,P) generator.
Generally, for an n-bit prefix circuit C(n), its size sC(n)is
defined as the number of computation nodes while its depth
dC(n)is defined as the level of the latest prefix output. The
white nodes in the figure are duplication nodes since they do
nothing but pass the signals. In the rest of the paper when
we say “a node”, we always refer to a computation (or black)
III. PROPERTIES OF THE ZERO-DEFICIENCY PREFIX ADDER
Snir’s original proof for sC(n)+ dC(n) ≥ 2n − 2 is by
mathematical induction and does not reveal the structural
information of zero-deficiency circuits. We notice that there
are some nice ideas in  and  which can be used to devise
an elegant proof for Snir’s theorem. In this section, we polish
these ideas and prove an enhanced version of Snir’s lower
Theorem 1: Let C(n) be an n-bit prefix circuit, with its
size and depth being denoted by sC(n)and dC(n), respectively.
Denote the depth of its most significant bit (MSB) output by
sC(n)+ dC(n)≥ sC(n)+ dM
C(n)≥ 2n − 2
Proof: Consider the MSB output node in C(n). This
output node is actually the root of an alphabetical tree which
is upside-down with all the input nodes being its leaves. The
size of this tree is exactly n − 1, and its depth is dM
Including the LSB bit, at most dM
be obtained from this tree. For each of the columns where the
prefix results are not ready, at least one extra node is needed
to generate the output. Thus, besides the tree, we need at least
C(n)+ 1 prefix outputs can
C(n)+1) nodes for the outputs. Consequently, we have
sC(n)≥ (n − 1) + (n − (dM
C(n)+ 1)) = 2n − 2 − dM
which leads to
C(n)≥ 2n − 2
We now define a few new concepts about the prefix circuits
Definition 1: For a prefix circuit C(n), the binary alpha-
betical tree generating the MSB prefix output is called the
backbone of C(n). In addition, there is another tree whose
nodes are exactly all the prefix output nodes, with the first
input node being its root. This tree is called the the affiliated
tree of C(n). The common part of the backbone and the
affiliated tree, that is, the path from the first input to the MSB
output, is called the ridge of C(n).
For illustration, the backbone of the prefix circuit in Fig. 2
is enclosed by a solid line loop, while the affiliated tree is
enclosed by a dashed line loop. Their common part, i.e., the
ridge, is highlighted using heavy line.
According to the proof of Theorem 1, it is straightforward
to derive the following corollary:
Corollary 1: A prefix circuit C(n) of depth d is of zero-
deficiency if and only if
1) The backbone of C(n) has depth d, and its size is n−1.
2) The affiliated tree of C(n) has size n − 1, and it
has exactly one node per column (excluding the LSB
column). Each node of the affiliated tree is a prefix
3) The ridge has d nodes, one node per level.
IV. PROPOSED Z(d) CIRCUIT
In this section, we propose a new class of zero-deficiency
prefix circuits, called Z(d), which have the minimum depth
among all zero-deficiency prefix circuits.
We will first construct a class of parameterized trees called
Tk(t) trees which will be used to form the backbone of
the Z(d) circuit. We then define the Ak(t) trees which will
be used to form the affiliated tree of Z(d). Z(d) circuit
is constructed by assembling Tk(t) trees and Ak(t) trees
The Tk(t) trees are defined by a recursive way as shown in
Fig. 3. The parameter t(≥ 1) is the depth of Tk(t) tree while
k represents the maximum number of black nodes in a single
column. As an example, Fig. 5(a) shows the T3(5) tree whose
. . .
. . .
k ≤ t.
The recursive definition of Tk(t) tree: (a) T1(t); (b) Tk(t), 1 ≤
. . .
. . .. . .
. . .
k ≤ t.
The recursive definition of Ak(t) tree: (a) A1(t); (b) Ak(t), 1 ≤
Algorithm 1 Generation of the Tk(t) tree
T tree generation(int t, int k)
1: for i = 1 to k − 1 do
for j = i to t − k + i do
Construct Ti(j) according to Fig. 3
5: end for
6: Construct Tk(t) according to Fig. 3
Algorithm 2 Generation of the Ak(t) tree
A tree generation(int t)
1: for i = 1 to k − 1 do
for j = i to t − k + i do
Construct Ai(j) according to Fig. 4
5: end for
6: Construct Ak(t) according to Fig. 4
depth is 5 and maximum number of nodes per column is 3,
and also shows how it is composed of T2(t), T2(3), T2(2)
and T1(1). Algorithm 1 formally presents how we generate
Following nearly the same recursive way of construction as
Tk(t) trees, we define the Ak(t) trees as shown in Fig. 4. For
Ak(t) tree, the parameter t is the lateral fan-out of the root,
while k+1 is its depth. We also give an example of A3(5) tree
shown in Fig. 5(b). Similar to the structure of T3(5), A3(5)
comprises A2(4), A2(3), A2(2) and A1(1). The algorithmic
description of Ak(t) trees is presented in Algorithm 2.
It is interesting to note that, T3(5) and A3(5) can be
assembled together to form a partial prefix adder, as shown
in Fig. 5(c). If the root of A3(5) is the prefix output of bit i,
then essentially we have the prefix outputs from bit i + 1 to
i+26. Generally, we can always combine a pair of Tk(t) tree
and Ak(t) tree to form a partial prefix adder of depth k + t,
as shown in Fig. 6. This is feasible because Tk(t) tree and
Ak(t) tree have the same width, which is due to the fact that
they follow the same recursive way of definition.
Theorem 2 gives the width of Tk(t) tree and Ak(t) tree.
Actually since T-tree and A-tree are defined recursively, their
width is a two dimensional integer recurrence. Equation (12)
is obtained by deriving a closed form formula of that recur-
rence. The detailed mathematical derivation can be found in
, and is omitted here due to lack of space.
Theorem 2: Tk(t) tree and Ak(t) tree has the same width,
Furthermore, the size of Tk(t) tree is N(k,t) − 1, while the
for 1 ≤ k ≤ t
Assembling of T3(5) and A3(5).
Examples of Tk(t) and Ak(t) trees: (a) T3(5); (b) A3(5) (c)
k + t
i + 1
i + N(k,t)
Fig. 6.Assembling of Tk(t) tree and Ak(t) tree
size of Ak(t) tree is N(k,t) with one node per column.
Now we are ready to introduce our new zero-deficiency
circuit Z(d). Algorithm 3 along with Fig. 7 defines the Z(d)
circuit. It can be seen that essentially Z(d) is defined over its
depth d. The width of Z(d) is given in Theorem 3. Again,
the derivation is omitted and can be found in .
Theorem 3: The width of the Z(d) circuit, which we
denote by NZ(d), is
NZ(d) = F(d + 3) − 1 for d ≥ 1
where F(k) are the natural Fibonacci numbers.
In order to prove the optimality of Z(d) circuit, we shall
first show that the Z(d) circuit is indeed of zero-deficiency,
and then prove that it does have the minimal depth among
all possible zero-deficiency circuits. These two facts are
presented in Theorem 4 and Theorem 5 respectively. For
Theorem 4, the proof is relatively simple. We just count the
number of computation nodes in Z(d) and verify it satisfies
the definition of zero-deficiency. For Theorem 5, the proof is
by contradiction. The basic idea is to show that, given a prefix
circuit of depth d, if its ridge spans wider than that of Z(d)
does, it can not be of zero-deficiency indeed. The detailed
proofs are presented in .
Theorem 4: The parallel prefix circuit Z(d) shown in
Fig. 7 is of zero-deficiency.
Theorem 5: Z(d) has the maximum width for a given
depth d among all zero-deficiency prefix circuits.
As an example, Fig. 8 shows the Z(d) circuit for d = 8.
It is now clear that the curve of function d = f(n) is
just the reverse function of NZ(d) = F(d + 3) − 1 given
in Theorem 3. Table I shows the widths of Z(d) circuits for
. . .
. . .
. . .
. . .
T1(d − 1)
T2(d − 2)
2⌋ + 1)
A1(d − 1)
A2(d − 2)
2⌋ + 1)
Fig. 7.A new class of zero-deficiency prefix circuits Z(d).
585981 808833 32
Fig. 8.Example of Z(d) circuits: Z(d)|d=8.
Algorithm 3 Generation of the Z(d) circuit
Z circuit generation(int d) // d is the depth
1: for i = 1 to ⌈d
T tree generation(d − i,i); // call algorithm 1
A tree generation(d − i,i); // call algorithm 2
4: end for
5: for i = ⌊d
T tree generation(i,i);
A tree generation(i,i);
8: end for
9: Stitch all the T, A trees together to form Z(d) as shown in Fig. 7.
2⌉ − 1 do
2⌋ to 1 do
3 ≤ d ≤ 18, with the results of Lin’s design  and the
LY D(n) circuit  listed for comparison. The numbers are
read as the maximal widths up to which zero-deficiency prefix
circuits of the specified type and depth can be constructed.
Clearly, our design dominates the other two, especially when
the width is large.
Since Z(d) adder has provable optimality, it is also truth-
fully better than the result produced by Zimmermann’s al-
gorithm . For example, given width 54 and depth 7, our
design has 99 nodes, while Zimmermann’s algorithm gives a
design of 104 nodes .
V. ANALYSIS AND COMPARISON
In this section, we compare a Z(d)-derived zero-deficiency
adder with several classical prefix adders in terms of both
delay and area, for 64-bit width. The selected prefix adders
include Sklansky adder , Brent-Kung (BK) adder ,
Kogge-Stone (KS) adder  and Han-Carlson (HC) adder
. These adders are well known to be fast because of small
logic depth or regular layout.
We use logical effort method for fast estimation of the adder
delay . Logical effort method is a shorthand for RC delay
WIDTHS OF LS(n), LY D(n) AND Z(d) CIRCUITS.
model yet provides reasonable accuracy. For a single stage
gate, logical effort method measures its delay in units τ, which
is the intrinsic delay of an ideal inverter:
D = Dabs/τ = gh + p
where g is the logical effort of the gate, which is the ratio
of the input capacitance of the gate to the input capacitance
of an inverter with the same unit effective resistance; h is
the electrical effort of the gate, which is the ratio of load
capacitance to input capacitance. p characterizes the parasitic
delay of the gate. Note that by incorporating wire capacitance
into h the interconnect delay, as well as fan-out effect, can be
easily considered. The overall path delay, is simply the sum
of the gate delay
The first item is called path effort delay while the second item
is called path parasitic delay.
In this study, we exactly follow the experimental settings in
. We assume that the (P,G) generators are implemented
using inverting static CMOS, as shown in Fig. 9. The transis-
tors are sized such that each pull down stack has unit effective
resistance. Note that there are two kinds of (P,G) generators.
The black cells in Fig. 9(a) calculate both P and G signals
while the gray cells in Fig. 9(b) only calculate P signals.
Both black cells and gray cells have two versions of opposite
Table II lists the logical efforts, parasitic delays and circuit
area of black and gray cells. The parasitic delay is estimated
by counting the total transistor width on the output node. The
cell area is calculated by summing up all the transistor area in
the cell in unit squares. Since in the prefix network alternating
stages uses alternating polarities of inputs and outputs, all the
values in Table II are the average of the two polarities.
In analyzing the delay, two basic assumptions are made as
they were in .
• The wires are short enough so that distributive RC delay
can be neglected. Thus the wires are only considered as
capacitive load. This assumption is supported by .
Fig. 9. Inverting CMOS Logic: (a) black cells; (b) gray cell; (c) inverter
LOGICAL EFFORT, PARASITIC DELAY AND AREA OF ADDER CIRCUIT
LE=Logical Effort, PD=Parasitic Delay, A=Area
1413 12 11 10
1413 12 11 10
1413 12 11 10
14 13 12 11 10
Kung; (d) Han-Carlson.
Parallel prefix adders: (a) Sklansky; (b) Kogge-Stone; (c) Brent-
• Vertical wires are short enough to be neglected. The
horizontal wire capacitance is measured as w units per
column spanned. It is estimated that w ≈ 0.5 for a
0.18um technology. For KS and HC adders, there are
a lot of parallel wires which have significant coupling
capacitance . Therefore, for these two adders w =
0.1 is used.
For illustration purpose, Fig. 10 shows 16-bit Sklansky,
BK, KS and HC adders with critical paths identified. A few
buffers are inserted to decouple the capacitance load from the
critical path. These buffers have half the drive of an ordinary
gate and hence half the input capacitance. Note that in 
a fixed critical path is specified for a given adder structure
of various bit widths. This is barely true because when bit
width increases, the critical paths vary due to increased wire
capacitance. Instead, we analyze the critical paths of 64-bit
adders by hand, and list them in Table III.
The 64-bit Z(d)-derived prefix adder, denoted as Z64 for
short, is generated as follows:
1) Generate Z(d)|d=8, whose width is 88;
2) Scan the nodes of Z(d)|d=8one by one from level 0 to
level 8, and column 1 to column 88. Do the following
CRITICAL PATHS OF VARIOUS 64-BIT ADDERS
(i,j)b denotes a black cell at level i, column j, similar for (i,j)g.
DELAY AND AREA OF VARIOUS ADDERS
recursively for a selected node:
(a) If the fan-out of the node is larger than 4, slide
down it’s branch of least lateral connection by one level;
Otherwise skip to the next node;
(b) If some nodes on the sliding down branch exceeds
level 8, the entire columns where the exceeding nodes
reside are deleted. Do local connection adjustment if
needed. After this step, a 72-bit prefix network whose
largest fan-out is limited by 4 is generated;
3) Discard the eight MSB columns, yielding a 64-bit prefix
4) Further optimize the prefix adder by inserting buffers to
decouple load capacitance from the critical path.
Fig. 11 shows Z64 with critical path highlighted in heavy
line. As an example, we calculate its delay and area as
DF= 4 ∗ (LEgraygl+ LEbuf) + (2 ∗ LEgraygl+ LEbuf)
+(3 ∗ LEgraygl+ LEbuf) + (LEgraygl+ LEbuf)
+3 ∗ LEgraygl+ LEbuf)
=13 ∗ LEgraygl+ 8 ∗ LEbuf = 30
DP=8 ∗ PDgray = 20
Dwire=63 ∗ w = 63 ∗ 0.5 = 31.5
Dtotal=DF + DP + Dwire= 81.5
Area=(#black cells) ∗ Ablack+ (#gray cells) ∗ Agray
The other four adder structures are evaluated in the same
way, and the results are listed in Table IV.
Clearly Z64 and BK adders are the best two of the five in
terms of both delay and area. Sklansky’s problem is that the
fan-out grows exponentially as logic depth increases, resulting
in huge delay. KS and HC adders, on the other hand, suffer
from high coupling capacitance (w=1). The area of Sklansky,
KS and HC is way larger than that of Z64 and BK.
Compared to BK adder, Z64 has smaller logic depth but
larger fan-out. These two factors effectively cancel out so that
BK and Z64 adders have nearly the same delay. However, we
conjecture that Z64 adder be more power efficient than BK
adder. The reason is that circuit cells of deep logic depth
tend to have high activity rate, hence consume more power.
A detailed power analysis of the proposed Z(d) circuit is
projected as a future work.
In this paper, we have proposed a new class of zero-
deficiency prefix adder Z(d) which has the minimal depth