Page 1

A Family of Parallel-Prefix Modulo 2n− 1 Adders

G. Dimitrakopoulos†, H. T. Vergos†‡, D. Nikolos†‡, and C. Efstathiou§

†Computer Engineering and Informatics Dept., University of Patras, 26500 Patras, Greece

‡Computer Technology Institute, 3 Kolokotroni Str., 26221 Patras, Greece

§Informatics Dept., TEI of Athens, 12210 Egaleo, Athens, Greece.

Abstract

In this paper we at first reveal the cyclic nature of idempotency in the case of modulo 2n− 1

addition. Then based on this property, we derive for each n, a family of minimum logic depth

modulo 2n− 1 adders, which allows several trade-offs between the number of operators, the

internal wire length, and the fanout of internal nodes. Performance data, gathered using static

CMOS implementations, reveal that the proposed architectures outperform all previously reported

ones in terms of area and/or operation speed.

1 Introduction

Modulo 2n− 1, or equivalently one’s complement addition, plays an essential role in Residue

Number System (RNS) applications [1], in fault-tolerant computer systems [2], in error detection

in computer networks [3], and in floating-point arithmetic [4], [5].

In RNS each operand is encoded as a vector of residues, computed with respect to a set of M

pairwise relatively prime integers called the moduli. The later form a set W = {m1,m2,...,mM},

which is called the base of the RNS. All integers A,B with 0 ≤ A,B <

a unique RNS representation A

←→ {A1,A2,...,AM} and B

where Ai = ?A?mi, Bi = ?B?mifor i = 1,2,...,M, and ?x?midenotes the residue of x

modulo mi. Multiplication, addition, and subtraction in RNS are described by Z = A ? B

{Z1,Z2,...,ZM}, where Zi= ?Ai? Bi?miand ? denotes any of the aforementioned operations.

Significant speedup over the corresponding binary operations can be achieved, since the Zis are

computed in parallel, each in a separate arithmetic unit (channel), because their computation

depends only on Ai, Bi, and mi, and no carry propagation among the channels is required. Among

the most popular three-moduli bases are the {2n,2n− 1,2n+ 1} and the {2n,2n− 1,2n−1− 1}

[6]–[8]. Therefore, a modulo 2n− 1 adder is essential in the most popular RNS implementations.

Modulo 2n− 1 adders also find great applicability in fault-tolerant computer systems. They are

commonly used for implementing residue, inverse residue, product (AN) and checksum arithmetic

codes. For low-cost implementations of such codes, modulo 2n− 1 adders are used both for

the encoding and the checking process [2]. Such codes, are also used extensively in checksum

computation, and error detection in TCP/IP networks [3].

Recently, one’s complement addition has also been employed in the design of high-speed

floating-point arithmetic units [4], [5]. In [4] an 161-bit end-around carry (EAC) adder was used in

the design of a single pass floating-point multiplier, while in the dual pass version of the design, it

?M

i=1mi have

RNS RNS

←→ {B1,B2,...,BM},

RNS

←→

Proceedings of the Application-Specific Systems, Architectures, and Processors (ASAP’03)

ISBN0-7695-1992-X/03 $17.00 © 2003 IEEE

Page 2

was substituted by an 81-bit EAC adder. Additionally, Pillai et al. in [5] have presented a triple-

datapath architecture for floating point addition, which employs 1’s complement adders, and offers

significant savings in the power dissipation.

Several proposals have already appeared to the fast modulo 2n−1 adder design problem. Single

and two-level carry lookahead modulo 2n− 1 adders have been presented in [9]. To achieve even

higher speeds, parallel-prefix carry-computation design approaches have been adopted in [10]–[14].

In [10] and [12] the required end-around carry operation is achieved by feeding back the carry

output of a parallel-prefix integer addition unit via an extra prefix level. This technique apart from

adding an extra prefix level also suffers from the unlimited fanout loading problem at the re-entering

carry line. In [11] it was shown that modulo 2n− 1 addition can be performed by recirculating the

generate and propagate signals, instead of the traditional end-around carry approach. In this way,

the need for an additional prefix level is cancelled, and parallel-prefix modulo 2n− 1 adders with

minimum logic depth (equal to the fastest integer parallel prefix adders) are derived. Although the

fundamental theory and a general architecture were presented in [11], no straightforward design

method was given when n ?= 2k. This task was left to the intuition of the designer. Extending the

work of [11], in [14] a method was given to produce Kogge-Stone like modulo 2n− 1 adders for

every n. Finally in [13], parallel-prefix adders, similar to those of [11], using prefix operators of

valency greater or equal to 2 were presented, only for the case that n is of the 2kform.

In this paper a novel carry-computation architecture for parallel-prefix modulo 2n− 1 adders,

for arbitrary values of n, is introduced. The proposed architecture actually describes a whole

family of adders, which exhibits minimum logic depth and small operator count. At first the

basic theory of idempotency is extended for the case of modulo 2n− 1 addition and it is shown

that terms produced in a parallel-prefix tree can be further associated in a circular manner. Then,

a systematic methodology for designing a family of modulo 2n− 1 adders is presented. Static

CMOS implementations are finally utilized for real comparisons of the proposed structures against

previously reported parallel-prefix modulo 2n−1 adders. Experimental results indicate that several

members of the proposed family of modulo 2n− 1 adders can significantly reduce the area penalty

of previously reported designs [14], while maintaining the highest speed compared to [10].

The rest of the paper is organized as follows. Some background material on parallel-prefix

addition and the notation used are given in Section 2. The extension of the idempotency property

is introduced in Section 3, while the family of modulo 2n− 1 adders is presented in Section 4.

Quantitative results are presented in Section 5. Finally, conclusions are drawn in Section 6.

2Background and Definitions

Suppose that A = an−1an−2...a0and B = bn−1bn−2...b0represent the two numbers to be

added and S = sn−1sn−2...s0their sum. An adder can be considered as a three-stage circuit.

The preprocessing stage computes the carry-generate bits gi, the carry-propagate bits pi, and the

half-sum bits hi, for every i, 0 ≤ i ≤ n − 1, according to:

gi= ai· bi

where ·, +, and ⊕ denote the logical AND, OR and exclusive-OR operations respectively. The

second stage of the adder, hereafter called the carry-computation unit, computes the carry signals

ci, for 0 ≤ i ≤ n − 1 using the carry generate and carry propagate bits giand pi. The third stage

computes the sum bits according to:

si= hi⊕ ci−1.

pi= ai+ bi

hi= ai⊕ bi,

Proceedings of the Application-Specific Systems, Architectures, and Processors (ASAP’03)

ISBN0-7695-1992-X/03 $17.00 © 2003 IEEE

Page 3

01234567

Level 0

Level 1

Level 2

Level 3

Operator (1,5)

(a)

01234567

Level 0

Level 1

Level 2

Level 3

(b)

Figure 1. The (a) Kogge-Stone and (b) Ladner-Fischer Parallel-Prefix Structures.

Sinceinalladdersthefirstandthethirdstageareidentical, inthefollowingweconcentrateonthe

carry-computation unit. Several algorithms have been proposed for the carry computation problem.

Carry computation is transformed into a prefix problem using the operator ◦, which associates pairs

of generate and propagate signals and was defined in [15] as follows,

(g,p) ◦ (g?,p?) = (g + p · g?,p · p?).

(1)

In a series of associations of consecutive generate/propagate pairs (g,p) the notation (Gk:j,Pk:j),

k > j, is used to denote the group generate/propagate term produced out of bits k,k − 1,...,j,

that is,

(Gk:j,Pk:j) = (gk,pk) ◦ (gk−1,pk−1) ◦ ... ◦ (gj+1,pj+1) ◦ (gj,pj).

Although the ◦ operator is not commutative, the idempotency property [16],

(Gi:j,Pi:j) = (Gi:k,Pi:k) ◦ (Gm:j,Pm:j)

holds for it, where i > m ≥ k > j.

We define as length of a group generate/propagate term (or simply length), the number of distinct

generate/propagate pairs (gk,pk) that have been associated for its computation. The length of the

group generate/propagate term (Gk:j,Pk:j) is obviously k −j +1, k > j. When two group signals

arefurtherassociatedtheresultwillhavealengthequaltothesumofthelengthsofthetwooperands

minus any overlapping terms due to idempotency.

The parallel-prefix structures proposed by Kogge-Stone [17] and Ladner-Fisher [18] for an 8-bit

carry-computation unit are shown in Figure 1. The operator ◦ is represented as a node (•), while

white nodes are buffering nodes. In any parallel-prefix graph we will number the prefix levels from

0 (the (g,p) signals-pair level) up to m (the level that produces the carries) and the bit columns from

0 up to n − 1. Since the • nodes are placed on the grid of rows and columns we can refer to any of

them by the pair (prefix level, bit column). For example in Figure 1(a) the operator (1,5) is pointed.

The prefix structures proposed by Kogge-Stone [17], Ladner-Fischer [18] and Knowles [19] are of

special practical interest, since they offer minimum logic-depth solutions to the carry-computation

problem.

(2)

(3)

3 The Case of Modulo 2n− 1 Addition

In this section the basic theory introduced in [11] is revisited, and a novel property of

idempotency is introduced. According to [11] in the case of modulo 2n− 1 addition, each carry ci,

Proceedings of the Application-Specific Systems, Architectures, and Processors (ASAP’03)

ISBN0-7695-1992-X/03 $17.00 © 2003 IEEE

Page 4

0 ≤ i ≤ n−1, is produced by combining the carry generate and propagate pairs using the formula,

Gi= (gi,pi) ◦ (gi−1,pi−1) ◦ ... ◦ (g0,p0) ◦ (gn−1,pn−1) ◦ ... ◦ (gi+1,pi+1)

where, ci= Giand c−1= cn−1. It should be noted that in contrast to integer addition, the number

of pairs (gk,pk) that have to be associated for the generation of each carry is equal to n.

Due to (4) the definition of a group generate/propagate term (Gk:j,Pk:j) is extended here to the

case where k < j, 0 ≤ k,j ≤ n − 1, and is defined as,

(Gk:j,Pk:j) = (Gk:0,Pk:0) ◦ (Gn−1:j,Pn−1:j).

Therefore, in the general case (k > j or k < j), the length of a group generate/propagate term

(Gk:j,Pk:j) is equal to?k − j + 1?

Gi= (Gi:k,Pi:k) ◦ (Gk−1:i+1,Pk−1:i+1).

The following Theorem reveals the cyclic nature of the idempotency property in the case of

modulo 2n− 1 addition, and is used as the basis for the design of the family of adders proposed in

this paper.

(4)

(5)

n. Assuming an intermediate index k, 0 ≤ k ≤ n − 1, each

carry ciof (4) can be expressed as,

(6)

Theorem 1. Let

(Gi:k,Pi:k) =

?

(gi,pi) ◦ (gi−1,pi−1) ◦ ... ◦ (gk,pk),

(gi,pi) ◦ (gi−1,pi−1) ◦ ... ◦ (g0,p0) ◦ (gn−1,pn−1) ◦ ... ◦ (gk,pk),

if i ≥ k

if i < k.

Then it holds that

(Gi:k,Pi:k) ◦ (Gk−1:i+1,Pk−1:i+1) ◦ (Gi:k,Pi:k) = (Gi:k,Pi:k) ◦ (Gk−1:i+1,Pk−1:i+1).

Proof. Unrolling the prefix operator ◦ it follows that,

(Gi:k,Pi:k) ◦ (Gk−1:i+1,Pk−1:i+1) ◦ (Gi:k,Pi:k) =

= (Gi:k+ Pi:k· Gk−1:i+1, Pi:k· Pk−1:i+1) ◦ (Gi:k,Pi:k)

= (Gi:k+ Pi:k· Gk−1:i+1+ Pi:k· Pk−1:i+1· Gi:k, Pi:k· Pk−1:i+1· Pi:k)

= (Gi:k(1 + Pi:k· Pk−1:i+1) + Pi:k· Gk−1:i+1, Pi:k· Pk−1:i+1)

= (Gi:k+ Pi:k· Gk−1:i+1, Pi:k· Pk−1:i+1)

= (Gi:k,Pi:k) ◦ (Gk−1:i+1,Pk−1:i+1).

Theorem 1 simplifies group generate/propagate terms of length greater than n to terms of length

equal to n. For example assume the case of a modulo 25− 1 adder. The association of (G4:1,P4:1)

(length 4 term) with (G1:3,P1:3) (length 4 term) is expected to lead to a group term of size 7, since

under the normal definition of idempotency only the overlapping term (g1,p1) can be simplified.

However, due to Theorem 1 the resulting term is (G4:0,P4:0), which is a length-5 term, since,

(G4:1,P4:1) ◦ (G1:3,P1:3) = (G4:3,P4:3) ◦ (G2:1,P2:1) ◦ (G1:0,P1:0) ◦ (G4:3,P4:3)

= (G4:3,P4:3) ◦ (G2:0,P2:0) ◦ (G4:3,P4:3)

= (G4:3,P4:3) ◦ (G2:0,P2:0) = (G4:0,P4:0).

Theorem 1 is an extension of the basic idempotency property presented in [16]. The assumption

that two group generate/propagate terms must meet or overlap in order to be associated can be

also considered in a circular manner. Figure 2 explains the circular meet-or-overlap for the case of

(G4:1,P4:1) and (G1:3,P1:3).

Proceedings of the Application-Specific Systems, Architectures, and Processors (ASAP’03)

ISBN0-7695-1992-X/03 $17.00 © 2003 IEEE

Page 5

(G4:1,P4:1) ◦ (G1:3,P1:3)

?

?

(g4,p4) ◦ (g3,p3) ◦ (g2,p2) ◦ (g1,p1) ◦ (g0,p0)

?

(g4,p4) ◦ (g3,p3) ◦ (g2,p2) ◦ (g1,p1) ◦ (g1,p1) ◦ (g0,p0) ◦ (g4,p4) ◦ (g3,p3)

?

?

?

?

?

?

?

Figure 2. Traditional and Circular Idempotency.

4 The Proposed Design Methodology

A systematic methodology for designing a family of area-time efficient parallel-prefix modulo

2n−1 adders is introduced in this section. All derived family members, i.e., prefix structures, have

minimum logic depth equal to m = ?log2n? prefix levels, and the number of operators employed

for carry generation can vary according to the value of n and the design selected in each case.

According to Eq. (4) the length of all carry equations in a modulo 2n− 1 adder is equal to n.

Therefore, among the possible lengths of group generate/propagate terms that can be generated in

at most m − 1 prefix levels the ones that allows us to build, at the mth level, group terms of length

greater than n, due to Theorem 1, or equal to n are sought. In any other case the generation of a

valid carry relation is impossible. Let Saand Sbrepresent the length of any two group terms, which

are generated in the first m − 1 prefix levels, and are selected to complete carry generation in the

mth level. Then, the selected Saand Sbshould satisfy the following condition,

Sa+ Sb≥ n.

(7)

The way group generate/propagate terms can be produced in the first m − 1 prefix levels of the

carry-computation unit can be graphically represented via a graph, called the Length Dependency

Graph. The Length Dependency Graph is the same for all values of n with the same logic depth

m = ?log2n?, and is denoted as LDGm. The LDGmconsists of m levels and contains one vertex

for each possible length of the group terms that can be produced in the first m − 1 levels of the

carry-computation unit. The value inside each vertex is equal to the length that it represents. For

example LDG4is drawn in Figure 3(a). At level 0 only one vertex exists with value equal to 1,

which represents the length-1 terms, i.e., the generate/propagate pairs (g,p).

The edges of LDGmdescribe the way that each possible length of group terms can be produced.

For example, the edge (4) → (6) with weights {2,3,4} implies that a length-6 group term can be

produced on the 3rd prefix level by associating a length-4 term of the second level with a suitable

term of length either 2 or 3 or 4. The associations of terms of length 4 and 3, and 4 and 4, require

the use of idempotency in order to produce a length-6 term.

Therefore, for each pair of lengths {Sa,Sb} that satisfies condition (7), and by following the

connections of LDGma parallel-prefix carry-computation unit for a modulo 2n− 1 adder can be

constructed. To simplify the design procedure, for each selected pair {Sa,Sb} the Design Graph

DGn,{Sa,Sb}is extracted from the corresponding LDGm. The DGn,{Sa,Sb}is a subgraph of LDGm,

and it is derived by following the paths of the LDGmthat depart from the vertices with values Sa

and Sb, respectively, up to level 0. The vertices that do not belong to any of these paths are excluded.

For example the DG10,{8,5}in case of modulo 210− 1 addition is derived from LDG4 of

Figure 3(a), and is presented in Figure 3(b). Since the pair {8,5} satisfies condition (7) for the

Proceedings of the Application-Specific Systems, Architectures, and Processors (ASAP’03)

ISBN0-7695-1992-X/03 $17.00 © 2003 IEEE