Content uploaded by Khaled Elleithy
Author content
All content in this area was uploaded by Khaled Elleithy on Sep 01, 2013
Content may be subject to copyright.
628
/;
IEEE
TRANSACTIONS
ON
CIRCUITS
AND
SYSTEMS, VOL.
37,
NO.
5,
MAY
1990
2
m1
rn1
0
m+
1 1
ZmZ
m2
A
O(l)
Algorithm for Modulo Addition
KHALED M. ELLEITHY
AND
MAGDY A. BAYOUMI
Absfrad
A
O(l)
algorithm for large modulo addition
for
residue
number system
(RNS)
based archictectures is proposed. The addition is
done in a fixed number of stages which does not depend on the size of
the modulus. The proposed modulo adder is much faster than the
previous adders and more area efficient. The implementation
of
the
adder is modular and
is
based on simple cells which leads
to
efficient
VLSI
realization.
I. INTRODUC~ION
Recently, the residue member system (RNS) is receiving in
creased attention due
to
its ability to support highspeed concur
rent arithmetic
[
11.
Applications such as fast Fourier transform,
digital filtering, and image processing utilize the highspeed
RNS arithmetic operations; addition and multiplication, do not
require the difficult RNS operations such as division and magni
tude comparison. The technological advantages offered by
VLSI
have added a new dimension in the implementation of RNS
based architectures [2]. Several highspeed
VLSI special pur
pose digital signal processors have been successfully imple
mented [31[51.
Modulo addition represents the computational kernel for
RNSbased architectures. Subtraction is performed by adders
using the additive inverse property [6]. Multiplication can be
transformed into addition by several techniques [7].
Also,
mod
ulo addition is the basic element in the conversion from RNS to
binary using the Chinese remainder theorem (CRT) [6]. Banerji
[8] analyzed modulo addition in MSI technology. A VLSI analy
sis of modulo addition has been reported in [9][11]. In general,
lookup tables and
PLAs have been the main logical modules
used when the data granularity is the word. It has been found
that such structure is only efficient for small size moduli. For
medium size and large moduli, bitlevel structures are more
efficient, where the data granularity is the bit [12].
In this paper, we present a modulo adder for medium size
and large moduli. It is based on using a twodimensional array
of very simple cells (full adders). The modulo addition is per
formed in a fixed time delay independent of the size
of
the
moduli.
11.
RESIDUE
NUMBER SYSTEM
(RNS)
In
RNS,
an integer
X
can be represented by Ntuple of
residue digits
=
(rl
1
r2
1
' '
'
7
rN)
where
r,
=
IXI,,,,
with respect to a set of N moduli
(rn1,m2;.
.,rnN].
In order to have a unique residue representa
tion, the moduli must be painvise relatively prime, that is:
GCD(rn,,m,)
=
1,
for
i
#
j.
Then it is shown that there is a unique representation for each
number in the range of
0
Q
X
<
nElm,
=
A4
where N is the
number of moduli.
Manuscript received April 12, 1989. This
work
was supported in part by the
National Science Foundation under Grant MIP8809811. This letter was
recommended by Associate Editor T.
R.
Viswanathan.
The authors are with The Center
for
Advanced Computer Studies, Univer
sity
of
Southwestern Louisiana, Lafayette, LA 70504.
IEEE
Log
Number 9034410.
h+
tl,
Fig.
1.
Modulo
addition using
two
adders.
Fig. 2.
Modulo
addition using a lookup table.
The arithmetic operation on two integers
A
and
B
is equiva
lent to the arithmetic operation on its residue representation,
that is:
1A.BI.M
=
(IIAlrnI.
IBl~lIml>IIAl~2~
IBb2lm2,.
.
.3
IIAI~N.
I
BI~NI~N)
where
''."
can be addition, subtraction, or multiplication. It is
desirable to convert binary arithmetic on large integers to residue
arithmetic on smaller residue digits in which the operations can
be parallelly executed, and there is no carry chain between
residue digits.
2.1.
The
Modulo Addition
Generally, addition modulo
rn
has 2"

rn
(n
=
[log
rnl)
incor
rect residue states. These states are in the range [m,2"

11
which may be called overflow states. The corrected residue
numbers can be obtained by two methods; employing a binary
adder or a correction table. In the first method,
a
constant
(2"

rn)
is added to correct the overflow residue states (gener
alized endround carry) as shown in Fig.
1.
The addition is
performed as follows:
if
xl
+
x2
<
m
if
+
x2
rn.
y=lxl
+xZlm=
{
z:
1
z::
m,
Two nbit adders are used; the first computes
x1
+
x2,
while the
second computes
x1
+
xg

rn.
The carry bit generated from the
second adder indicates whether or not
xl
+
x2
is greater than
rn.
A multiplexer, controlled by the carry, selects the correct
output. In the second method, a lookup table is used to correct
the incorrect residue states (2"

m),
Fig.
2.
The first algorithm
of modulo addition has a time complexity of O(logn), and the
second algorithm is not suitable for medium and large moduli.
00984094/90/0500062SS01.00
01990
IEEE
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:38:37 EST from IEEE Xplore. Restrictions apply.
IEEE
TRANSACTIONS
ON
CIRCUITS
AND
SYSTEMS,
VOL.
31,
NO.
5,
MAY
1990
629
Fig.
3.
A
modulo sum adder.
111. THE PROPOSED MODULO
ADDER
Carry save adder (CSA)
[13]
has been proved to have high
speed in multioperand addition. Basically CSA depends on the
idea of not completing the addition process at a certain stage,
but postponing it to the final stage. In the intermediate stages
numbers are represented as
sum
and
carry
to avoid the com
plete addition process.
The idea of representing a number as a
carry
and a
sum
can
be used in the modulo addition to obtain a scheme that has a
constant speed which does not depend on the number of bits.
The modulo adder is used to add two numbers
A
and
B
in
modulo
m.
Fig.
3
shows that
A
is represented as a pair of
numbers
(A,,A,),
B
is also represented as
(Bs,B,),
and the
output
C
is represented as
(C,,C,).
Each number is repre
sented as a group
of
sum
bits and
carry
bits. There is no unique
representation
for
A,
and
A,.
The condition that needs to be
satisfied is
IA,
+
ACIm
=
14,.
One possible representation is
A,
=
JAl,
A,
=
0.
The choice
of
a representation has no implication on the com
plexity
of
the design. With such representation, four numbers
(A,,
A,,
B,, B,)
need to be added, and two steps of CSA are
required. After the addition process we need to detect
if

M
or
2*(
M)
is
required to adjust the result. The adjusting
process takes at most three steps. Since the adder has a fixed
number
of
stepsfiveno matter how long
A
and B are, it
can be used in a multioperand pipelined addition scheme
[141.
3.1.
The Modulo Addition Algorithm
The proposed algorithm for modulo
m
addition
of
two num
Algorithm modulo
add
(A,
B,
Result)
bers can be described as follows.
Znput:
Two variables
A
and
B
in modulo
m,A
is repre
sented as
A,
and
A,.
B
is represented as
B,
and
B,.
All
variables are
n
bit numbers
(2"'
<
m
Q
2").
Output:
Variable
Result
represented as Result. and Re
sult,. The relation between
A,
B,
and Result is: Result
=
IA
+
BIm.
Procedure:
begin
Do
in parallel
begin
Call Sum(temp,,
A,,
A,,
B,)
Call Carry(temp,,
A,, A,,
B,)
end
begin
Do
in parallel
Call Carry(temp,, temp,, temp,,
B,)
Call Carry(temp,, temp,, temp,,
B,)
end
0:
Do
in parallel
Case (temp,
[n
+
11
temp,
[n
+
11)
of
begin
Result,
:=
temp,
Result,
:=
temp,
end
exit
begin
1:
do in parallel
Call Sum(temp,, temp,, temp,,
(2"

rn))
Call Carry(temp,, temp,, temp,,
(2"

rn))
end
begin
2:
Do
in parallel
Call sum(temp,, temp,, temp,,
2*(2"

rn))
Call CarryItemp,, temp,, temp,,
2*(2"

rn))
end
end case
Case (temp,
[n
+
11)
of
0: do in parallel
begin
Result,
:=
temp,
Result
,
:=
temp,
end
exit
begin
1:
Do
in parallel
Call Sum(temp,, temp,, temp,,
(2"

rn))
Call Carry(temp,, temp,, temp,,
(2"

rn))
end
end case
Case (temp,
[n
+
11)
of
0:
do in parallel
begin
Result,
:=
temp,
Result,
:=
temp,
end
begin
1:
Do
in parallel
Call Sum(temp,, temp,, temp,,
(2"

m))
Call Carry(temp,,, tempo,, temp,,
(2"

m))
end
begin
Do
in parallel
Result,
:=
temp,
Result,
:=
temp,,
end
end case
end.
Sum(A,B,C,D)
begin
Do
in parallel
(1
<
i
Q
n)
~[i]
:=
(B[i]
A
C[i])
V
(B[i]
A
D[i])V (C[i]
A
D[il)
end
Carry
(A,
B,
C,
0)
begin
A[1]
:=
0
Do
in parallel
(1
<
i
<
n)
A[
i
+
11
:=
B[
i]
CB
C[i]
CBD[
i]
end
An
implementation of the algorithm is shown in Fig.
4.
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:38:37 EST from IEEE Xplore. Restrictions apply.
630
IEEE
TRANSACTIONS
ON
CIRCUITS
AND
SYSTEMS,
VOL.
31,
NO.
5,
MAY
1990
F*
...
"
..............
"
...............
0
JJ
J$&j
....................................
2'
M
2(
2'
2'
M
2(
2'
M
)
...
2n
M
2(
2'
M
1
....................................
pia ....................................
+Ti!/
#i!
0
Resultln]
sc
Result[nl
ResultLl
1
Result[l
I
S
C
Fig.
4.
Different stages of the modulo adder.
Theorem
1:
The modulo adder scheme for adding two nbit
numbers in modulo
rn
has an asymptotic time complexity
O(1).
Proof:
To prove that the number of steps is constant (five)
we need to prove that the last carry is equal to zero in five or
less steps. Induction is used to prove the correctness of the
theorem on the number
of
bits
n.
1) Basis step: for
n=0,
it means that we do not add any
numbers and in this case the required number
of
steps is zero.
2) Induction hypothesis: assume for a fixed arbitrary
n
>
0
that the maximum number of steps is five.
3)
Induction step: for numbers with
n
+
1
bits let:
7
=
temp,
[
n
+
11
+temp,
[n
+2].
Then we have the following cases.
(a)
7
=
0:
then the carry propagation stopped at bit
n,
and it
ends after five steps at most according to the induction hypothe
sis.
(b)
7
=
1:
then the correction is 2"+'

rn
in step 3. Since
rn
>
2", then 2"+'

rn
<
2", which means that (2"+'

rn)
[n]=
0.
The worst case we get to have temp,[
n
+
11 and temp,[
n
+
21
to be equal to one. This means that temp,[n
+
11
=
0
and
temp,
[
n
+
21
=
1,
then temp,
[
n
+
21
=
0.
In this case the correc
tion is done in
two
steps (step 3 and step
4).
(c)
7
=
2: then the correction is 2*(2"+l
rn)
in step 3. The
worst case we get to have temp,
[
n
+
11,
temp,
[
n
+
21, and
2*(2"+'

rn)
to be equal
to
one. Then temp,[n
+
11
=
1,
temp6[n+1]=1, and 2"+'M=0. At step
4
temp,[n+l]=O
and temp,[n +2]=
1.
At step
5
temp,[n +1]=
1
and templ0[n
+
21
=
0.
In this case the correction is done in three steps (steps
35).
As an example, the modulo addition of
A
=
1272 and
B
=
450
for
rn
=
2050 is shown in Fig.
5.
There is no unique representa
tion for
A
and
B.
One valid representation is shown in this
figure. The detailed modulo addition operation is shown in this
Initial:
As=
101
11
11
101
11
Ac.
11001
I
I01
101
Bp
11
1100010101
Bp
1010101
1001
1
M2050
,
NI
12
Step
1.
As=
101
11
11
101
11
Ac*
11001
1101
101
8.;
11
1100010101
temp,
=
10000000
I
I
1 1
temp,;llI!
11
1101010
Step
2.
temp,.
100000001
11
I
temp,:
11
11
11
101010
__________
&.
1010101
1001
1
temp,.
l10101010110
temp,=Ij010101010
1
10
step
3.
tamp,.
1101010101
10
temp,.
0101010101
10
11
11
11
11
1100
_________
2(2"
M)
=
temp,=
01
I1
11
11
1100
temp6:ij101010101
100
Step
4.
temp,.
01
1
I
I1
11
1100
temp6=
101010101
100
2"n
=
011111111110
temp,=
1010101011
IO
ternpp.nl111t
11
11000
Resu/fs=
li?/i?/O/i?/
Ill)
A
detailed example
for
the modulo addition
____________
RPs/lIt0
:
I
/
I
I I
I
I
I
lC?i?C?
Fig.
5.
example. In step
1
we get temp,[l3]= 1, and in step 2 we get
temp,[l3]= 1, which means that at step 3 we have to add
2(2"

M).
At step 3 we get temp6[13]
=
1,
which means that at
step
4
we have to add 2"

M.
At step
4
we get temp,[l3]
=
0,
which means that the addition process stops at step
4.
The result
of
step
4
is the final result.
IV.
MODULO
ADDER
EVALUATION
Using the VLSI model of computation for asymptotic com
plexity [15], a comparative study for the proposed adder is
analyzed. For adder I (Fig. 11, using the binary adder
of
Brent
and Kung [16], the complexity measures will be as follows:
A
=
O(log
rn
log log
rn)
=
O(
n
log
n)
T=O(loglogrn) =O(logn)
AT,
=
O(
n(10g
.I,).
For adder I1 (Fig. 21, using the complexity analysis of the
correlation table of [17]:
A
=
O(Iogrn IogIogrn
+
rnlog
rn)
=
O(nI0gn +2"n)
=
O(n2")
T=O(loglogrn+ logrn) =O(logn+n)=O(n)
AT,
=
qn32n).
For the proposed adder,
A
=
O(n)
AT,=
O(n).
T
=
O(1)
V.
CONCLUSIONS
The modulo adder introduced in this paper has a total time
delay complexity
of
O(1)
for adding two nbit numbers in mod
ulo
rn.
Based on the analysis of Section IV, this adder is the
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:38:37 EST from IEEE Xplore. Restrictions apply.
IEEE
TRANSACTIONS
ON
CIRCUITS
AND
SYSTEMS,
VOL.
37,
NO.
5,
MAY
1990
63
1
fastest and the most area efficient for large moduli. The pro
posed design has the following advantages.
1)
It does not have any limitation on the size of the modulus.
2)
It is quite modular, and it is a twodimensional array of
3) It is easy to pipeline.
4)
It is very efficient architecture for the implementation of
one type cell (fulladder).
the CRT decoding [14].
REF
ER
EN
c
E
s
F.
J.
Taylor, “Residue arithmetic: A tutorial with examples,”
IEEE
Comput. Mag.,
pp. 5062, May 1984.
M. A. Bayoumi,
G.
A. Jullien, and W. C. Miller, “A lookup table
VLSI design methodology for RNS structures used in DSP applica
tions,”
IEEE Trans. Circuits
Syst.,
vol. CAS34, pp. 604616, June 1987.
M.
A.
Bayoumi, “A high speed VLSI complex digital signal processor
based on quadratic residue number system,” in
VLSI Signal Processing
II.
,
“Digital filter VLSI systolic arrays over finite fields for DSP
applications,” in
Proc. 6th IEEE Ann. Phoenix Conf.
on
Computers and
Communications,
pp. 194199, Feb. 1987.
W. Jenkins and
E.
Davidson, “A customdesigned integrated circuit
for
the realization of residue number digital filters,” in
Proc. ICASSP
1985, pp. 220223, Mar. 1985.
N.
S.
Szabo and R.
I.
Tanaka,
Residue Arithmatic and Its Applications
to
Computer Technology.
M.
A.
Soderstrand and E. L. Fields, “Multipliers and residue number
arithmetic digital filters,”
Electron. Lett.,
vol. 13, no.
6,
pp. 164166,
Mar. 1977.
D.
K.
Banerji, “A novel implementation method for addition and
subtraction in residue number systems,”
IEEE Trans. Comput.,
vol.
C23, pp. 106109, Jan. 1974.
M. A. Bayoumi,
G.
A. Jullien, and W. C. Miller, “A VLSI implementa
tion of residue adders,”
IEEE Trans. Circuits Syst.,
vol. CAS34, pp.
284288, Mar. 1987.
M.
A.
Bayoumi, “VLSI PLA structures for residue number systems
arithmetic implementations,” in
Proc. ISCAS
1987,
1987.
C.
L.
Chiang and L. Johnsson, “Residue arithmetic and VLSI,” in
Proc. ICCD
83,
pp. 8083, Oct. 1983.
K.
M.
Elleithy, “On hitparallel processing for modulo arithmetic,”
VLSI Tech. Rep. TR8681, Ctr. Advanced Computer Studies, Univ. of
Southwestern Louisiana, 1986.
K.
Hwang,
Computer Arithmetic: Principles, Architecture, and Design.
New
York:
Wiley, 1978.
K.
M.
Elleithy, “On the bitparallel implementation for the Chinese
remainder theorem,’’ VLSI Tech. Rep. TR8781,
Ctr.
Advanced Com
puter Studies, Univ. of Southwestern Louisiana, 1987.
G.
Alia and
E.
Materinelli, “A VLSI algorithm for direct and reverse
conversion from weighted binary number system to residue number
system,”
IEEE Trans. Circuits
Syst.,
vol. CAS31, pp. 10331039, 1984.
R. P. Brent and
H.
T. Kung, “A regular layout for parallel adders,”
IEEE Trans. Comput.
vol. C31, pp. 260264, Mar. 1982.
M.
A.
Bayoumi, “Lower bounds for VLSI implementation of residue
number system architectures,”
Integration, The VLSI
J.,
vol. 4, no. 4, pp.
263269, Dec. 1986.
New
York
IEEE
Press, 1986, pp. 200211.
New
York
McGrawHill, 1967.
I. INTRODUCTION
The solution given by Vlcek and Unbehauen [l] to what they
refer to as the “degree equation” for ellipticfunction filters is
exact, but rather complicated for the intended use in adjusting
the parameters at the beginning of a design, and involves
computing several elliptic functions of rational fractions of a
quarterperiod. High precision is unnecessary at this stage in the
design, and a much simpler formula which allows one to achieve
the same end result with a pocket calculator was in fact given by
Darlington
[2]
just
50
years ago, but seems to have been over
looked. The purpose of this note is to explain his formula in
more detail than appeared in
[2]
and to expand on it slightly.
In the design of an ellipticfunction filter one is given a
specification for: The passband ripple,
up,
the minimum stop
band
loss,
a,,
(both in decibels) and the elliptic modulus k
=
wp/os,
where
op
is the passbandedge frequency and
os
is the
stopbandedge frequency. From these one must first find the
smallest integral value of the degree
n
that can be used. This
choice for
n
will normally cause the filter to be slightly better
than is called for in the specification,
so
one can adjust the
values of
up,
a,
and k that one uses in the design to allow some
margin inside the specification. The relation between these four
parameters,
up,
a,,
k and
n
is
given by the “degree equation.”
11.
DEFINITION OF
THE
ELLIPTICFUNCTION
FILTER
The power ratio for an ellipticfunction filter is most conve
niently defined by a pair of equations involving a parametric
variable, exactly analogous to those for the wellknown Cheby
shev filter. The latter
is
defined by
10~’’~
=I+ e2cos2nu (la)
n
=
COSU (1b)
where
U is the parametric variable and
n
is the degree. The
passband edge is normalized to
=
1
and the passband ripple is
up
=
10
log
(1
+
E‘)
dB.
The equations for the ellipticfunction filter are obtained by
replacing the cosines in (1) by the Jacobian cd elliptic functions
and take the form’
10n/’O
=
1
+
c2cd2(nuK, /K;k,)
(3a)
fl=cd(u;k)
As
in
(l),
this definition holds for both odd and even values of
the degree
n.
K, and
K
are the real quarterperiods belonging
respectively. The fre uency scale is still normalized to
R
=
1
at
wp,
rather than to
&
as in [l] and
[2];
the latter normaliza
tion, though nicer for theoretical work, is a nuisance in practical
design.
In order for
(3)
to define a rational function, the period
Of
cd(nuKl
iK;
in
the
plane,
must
fit
exactly
?I
times into the period rectangle of cd(u; k), just as the period
strip of cos
nu
fits
n
times into the period strip of cos U. This
Adjusting the Parameters in EllipticFunction Filters
to the elliptic functions with moduli k, and k in (3a) and (3b)
H.
J.
ORCHARD
Abstract
When designing ellipticfunction filters there is usually
some margin in performance to be distributed over the defining parame
ters.
A
recent paper offered some comparatively complicated formulas
for use in this stage of the design. However, a simpler method, originally
due to Darlington, is available and
is
described briefly.
passband ripple is again given by
(2).
‘The cd function is the same as the sn function by a real quarter period,
just
as the cosine is the same as the sine shifted by ~/2. Using the cd
function rather than the sn function as in [2], allows one to describe both odd
and even degree cases with one common formula.
Manuscript received April 28, 1989. This letter was recommended by
The author is with the Electrical Engineering Department, University of
IEEE
Log
Number 8930187.
Associate Editor
T.
R. Viswanathan.
California
Los
Angeles,
Los
Angeles, Ca 90024.
OO984094/90/0500oS31~$01
.OO
01990
IEEE
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:38:37 EST from IEEE Xplore. Restrictions apply.