Content uploaded by Khaled Elleithy
Author content
All content in this area was uploaded by Khaled Elleithy on Sep 08, 2013
Content may be subject to copyright.
Content uploaded by Khaled Elleithy
Author content
All content in this area was uploaded by Khaled Elleithy on Sep 01, 2013
Content may be subject to copyright.
e(logN)
ARCHITECTURES
FOR
RNS ARITHMETIC
DECODING*
K.
M.
Elleithy,
M.
A.
Bayoumt, and
K.
P.
Lee
The Center
for
Advanced Computer Studies
University
of
Southwestern Louisiana
Lafayette,
LA
70504.
U.S.A.
ABSTRACT
Decoding in Residue Number System (RNS) based architectures
can be a bottleneck. A high speed and flexible modulo decoder
is
an
essential computational element
to
maintain the advantages of RNS.
In
this paper, a fast and flexible modulo decoder, based on the
Chinese Remainder Theorem (CRT), is presented. It decodes a set of
residues into its equivalent representation in either unsigned magni
tude
or
2’s
complement binary number system. Two different archi
tectures are analyzed; the first one is based on using Carry Save
Adders(CSA), while, the other is based on utilizing a modified struc
ture of Carry Save Adders(MCSA). Both architectures are modular
and are based on simple cells which leads
to
efficient VLSI implemen
tation.
it has a time complexity of
e(
IogN).
1.
Introduction
Recently, RNS has received increased attention due
to
its abil
ity
to
support highspeed concurrent arithmetic
1131.
Applications
such
as
fast
fourier transform, digital filtering, and image processing
utilize the efficiencies
of
RNS arithmetics in addition and multiplica
tion, they do not require the difficult RNS operations such
as
divi
sion and magnitude comparison. RNS has been employed efficiently
in the implementation of several special purpose processors such
as
digital signal processors[4].
Since special purpose pme4sors are associated with general pur
pose computers, binarybresidue and residuetebinary conversions
become inherently important and the eonversion process should not
offset the speed gain in RNS operations. While the binarybresidue
conversion does not pose a serious threat
to
the speed gain in RNS
operations, the residuebbinary conversion can be a bottleneck. It is
mainly carried out employing the Chinese Remainder Theorem
(CRT) [5,6]. Several implementations of the residue decoder have
been reported [7121.
In
1121, the proposed residue decoders are basi
cally based on biased addition, and take advantage of the fast addi
tion speed of CSA[13]. But, the eonversion output is not in
2’s
com
plement
form.
The implementation
in
[ll]
requires that one of the
moduli must be a power of two; therefore, it may be limited in
application. The residue decoders in [7,8] are based on using three
moduli in the form (2”1,
2’,
2*+1).
Due
to
the limitation imposed
on the number of moduli and the choice of them, it
is
limited in
application. In
[lo],
the residue decoder is based on the base exten
sion technique, it
uses
modular lookup tables in its implementation.
Since two moduli are fed into a lookup table, the choice of moduli
must not be large for the implementation
to
be feasible. In addition,
it does not support residue
to
2’s
complement binary number system
conversion. Although lookup tables are used in this scheme, its time
complexity is
O(p).
In [14,15], the scheme used has a time complex
ity of
B((logN)’):
In
[9],
a
scheme of
B(logNP)
(where
P
is the
number of bits)
IS
used
to
support only unsigned magnitude binary
numbers.
The proposed decoder is fast,
‘This
work
was
supported
in
part
by
NSF
grant
No.
MIP8809811
In this paper,
a
B(logn)
residue decoder capable of decoding a
set of residues
to
its equivalent representation in unsigned magnitude
or
2’s complement binary number system is introduced. Two different
architectures using CSAs based on[16] and MCSA[17] are imple
mented.
In the following section, the RNS theory
is
reviewed. Sec
tion
3
discusses how this fast and flexible residue decoder can be
implemented. Section
4
evaluates the speed performance
of
this resi
due decoder.
2.
Residue Number System
due digits,
X
=
(rl, r2
,..............,
rN)
where
ri
=
1x1
111,,
with respect
to
a set
of
N moduli
{ml, m2
,________.____.,
mN}.
In order to have a unique residue represen
tation, the moduli must be pairwise relatively prime, that is,
In RNS, an integer
,
X,
can be represented by Ntuple of resi
GCD(mi, mi)
=
1,
for
i
#
J’
then it is shown that there is
a
unique representation for each
number in the range of
0
5
X
<
nmi
=
M where N is the number
of moduli.
to
the arithmetic operation on its residue representation. that is,
N
i
1
The arithmetic operation on two integers A and
B
is equivalent
where
’.’
can be addition, subtraction,
or
multiplication. Therefore, it
is
desired
to
convert binary arithmetic on large integers
to
residue
arithmetic on small residue digits in which the operations can be
parallelly executed, and there
is
no carry chain between residue digits.
For
applications in digital signal processing, it is helpful to
define a dynamic range for the RNS
integers. The dynamic range is defined
odd and
as
[+,
+11
for
M
even,
or
more specifically, for
M
odd,
I.
ifz<M’
2
X=
and for
M
even,
M
ifZ<
2
Z
M
ZM ifZ2
2
where
Z
is an integer within the legitimate range,
0
<
Z
<
M. Any
integer,
X,
within the dynamic range can be represented by N residue
digits.
202
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:07:27 EST from IEEE Xplore. Restrictions apply.
The conversion from RNS
to
weighted binary number system is
done by using the CRT, which states that
where
Although the CRT provides a direct, fast, and simple conversion for
mula, the lack of large and fast modulo
M
adder has held back this
approach.
3.
The Residue Decoder
The residue decoder based on the CRT can be implemented by
a modulo M adder tree. The modulo
M
adders at each level are used
to
correct the partial sum
so
that it is within the legitimate range.
Since modulo M adder is very slow, the possible implementation may
pose an overhead
to
the overall speed performance of an RNS proces
sor.
In addition, the CRT only converts residues
to
its binary
representation in the legitimate range but not in the dynamic range.
Therefore, conversion
to
2’s complement binary number system
requires a final correction.
In
order
to
implement a high speed residue decoder that can
perform conversion
to
both unsigned magnitude and 2’s complement
binary number system, the following solutions are proposed:
1) The number of modulo M adders
or
binary adders should be
reduced
to
a
minimum.
2)
CSAs
or
MCSAs can be used wherever multioperand addition
is
required due
to
its high addition speed.
3) Correction can be performed only at the last stage, and it
sup
ports conversion
to
both unsigned magnitude and 2’s comple
ment binary number system.
For
ease of residue decoder design, it
is
partitioned into
4
stages
as
shown in Figure 1. The input
to
the residue decoder are the residues
and a control line,
C,
which determines the output
to
be in unsigned
magnitude
or
2’s
complement number system
rr
rr
r
t*.l
.
. .
.
. .
.
.
. .
. .
Partial
Sum
Adder
1s
Range Deterk
I
BClI
1
1
S
I
BC;
I
BC:
Final Corrector
1
1x1,
c
Figure
1
Block
Diagram
of
the Residue Decoder
3.1.
Partial Sum Generator
The inputs
to
this stage are the
N
residues.
The main function
of
this stage
is
to
compute partial sums,
ti
’s,
where
Since
mi
is usually small, the value of
ti
can be obtained by access
ing a lookup table with a small address space. Hence,
ri
will serve
as
ROM address input, and
ti
will be obtained from ROM output.
In
most cases, it
is
better
to
reduce the number of partial
sums,
(ti’s),
in order
to
reduce the complexity at lower stages and
hence increase the residue deeoder’s speed
as
a
whole. Since a
modulus
mi
can be represented by logzm. bit binary number, the
j
th residue,
I
II
[@J
1l
rj
=
Z‘bj
where
6)
t
{0,
1). By substituting
rj
in
eq.
(l),
we can rewrite the
CRT
as
follows:
k4
Hence, if we have
a
set of
8
moduli {2,3,5,7,11,13,17,23} with residues
{rl,rz,r3,r4,rs,r~,r,,r*},
respectively, only
4
ROW with 7bit address
input are needed
to
implement this level, and modulus summation of
4
operands instead
of
8
is needed, where
3.2.
Partial Sum Adder
By far the modulo M summation of partial sums,
(ti’s,)
poses
the biggest challenge
to
the implementation of the residue decoder
due
to
the slow computational speed of the
modulo
M
adder.
This
stage can be implemented using two different approaches.
s3.2.1.
Implementation
using
CSA
A multilevel CSA tree consists of
N2
CSAs and a carry propagate
adder, CPA[13], are
used
to
reduce
A
partial sums, t’s,
to
a sum,
S.
Let
I
be the number of levels on a CSA tree, and
e(/)
be the max
imum number of operands that can be processed with a Ilevel CSA
tree. We
can
compute
0
by the recursive formula provided by
Avizienis[l8],
e(/)
*
3
+
(e(/1))
mod
2
121
for
I
=
2,
3
,.....,
and
initially
e(1)
=
3
(3)
A
CSA tree
for
adding
6
operands is shown in Figure 2.
CPA is
a
(m1)bit twelevel carry lookhead adder, CLA[13] where:
m
=
IlogdMA)]
Hence, the output
S
is
an mbit number that
is
passed
to
the next
stage. The complexity of the scheme is determined by Theorem 1.
Theorem
1:
The addition
of
N
numbers using
CSAs
can
be
per
formed in
O(logN)
atepa.
Pmo,t
The number of levels in
a
CSA tree is determined by:
e(/)
=
e(r1)
*
3
+
e(/1)
mod
2
121
To determine the number of levels required to add
N
numbers let
us
consider the following two cases:
(i) @(I1) is even, then:
203
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:07:27 EST from IEEE Xplore. Restrictions apply.



means that
output
is
shifted left with zero
enterina from the
riaht
S
Figure
2.
An
Example for
Partial
Sum Adder for
A

6.
e(/1) mod2
=
0
Substituting in
(3)
using
(4)
and
(5),
we have:
qr)
=
%(/I)
2
Since
O(1)
=
3
,
we can substitute in
(6)
to
get successive values for
O(1)
as
follows:
11
e([)
=
*3
($)I
*2
B(I)
represents the number of operands that can be added using a
CSA tree that has
I
levels. Suppose that the number
of
operands is
N
then:
N
(3)’
*2
2
3
logN l*l~
2
log
Taking the logarithm
of
both sides we have:
Then:
1
=
+*logN
We can find constants
Cl
>
0,
Cz
>
0,
and
No
2
0,
such that
for
all
N
2
No
the following is true:
2
C1
logN
5
+*logN
5
Cz
logN
(7)
log
2
Then
Cl
logN
5
1
5
Cz
logN
V
N
2
No
(8)
Possible values
for
Cl
,
Cz
and
No
are
1,2,1.
Equation
(8)
means
that
1
=
O(1ogN).
(ii)
O(11)
is
odd
,
then:
Substituting in
(3)
using
(9)
and
(IO),
we get:
(11)
1
2 2
e(/)
=
+l)


Since
O(1)
=
3
,
we can substitute in
(11)
to
get successive values
for
Of
1)
as
follow:
3 11
12
I4
O(1)
(T)
*3((2)
+(T)
+...+
l)*0.5
4
3‘
+1
Suppose that tpe number of operands is
N
then:
Using the same analytical method used
for
the case of even
O(11)
we
can find constants
C,,
C,,
and NOS, such that for all
N2No
the fol
lowing is true:
CllogN
5
$*logN
5
CzlogN
From the previous analysis in both cases
i
and
ii,
N
numbers can be
added using CSAs in O(1ogN).
53.2.2.
Implementation
using
MCSA
The MCSA
is
based on the idea
of
representing a number
as
a
Carry
and a
Sum
similar
to
CSA. It can be used in the modulo addi
tion of two numbers
to
obtain a scheme that has a constant speed
which does not depend on the number
of
bits. Basically CSA depends
on
the idea of not completing the addition process at a certain stage,
but postpone it
to
the final stage. In the intermediate stages numbers
are represented
as
Sum
and
Cany
to
avoid the complete addition
process. The MCSA is used
to
add two numbers
A
and
B
in modulo
m.
Figure
3.a
shows that
A
is represented
as
a pair
of
numbers
(As
,
Ac),
B
is
represented
as
(Bs
,
Bc),
and the output
C
is
represented
as
(Cs
,
Cc).
Each number is represented
as
a group of
Sum
bits and
Caw
bits. There is no unique representation
for
As
and
Ac.
The condition that need
to
be satisfied is:
43
N=$T)
+1
log
2
0
One possible representation is:
(
As, A,)
(B
,
B
)
sc
I
I
3
*
I
I=
2(Il)

1.5
(9)
Figure
3
a
A
Modified
CSA
(MCSA)
204
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:07:27 EST from IEEE Xplore. Restrictions apply.
We need
to
add
four
numbers (As
,
Ac
,E,
,Bc),
which needs two
steps of
CSA.
After the addition process we need
to
detect if
M
or
2*(M) is required
to
adjust the result. The adjusting process takes
at most three steps. The proposed algorithm for modulo m addition
of two numbers can be described
as
follow:
Algorithm moduloadd
(
A
,
B
,
Result)
Input:
Two variables A and
B
in modulo m, A is represented
as
As and Ac.
B
is represented
as
Bs
and
Ec.
All
variables are n bit
numbers.
Output:
Variable Result represented
as
Result and R
sults.
The
relation between A,
E,
and Result is: Reault
=
TA
+
Elm.
Procedure:
begin
Do in parallel
begin
Call Sum(temp,
,
As
,
A0
,
Bs)
Call Cany(temp2
I
AS
,
AC
I
Bs)
end
begin
Do in parallel
Call Cany(temp3, temp,
,
temp2,
Bc)
Call Cany(temp4, temp,
,
temp2
,
Bc)
Case (temp sub
2
[n+l/
+
temp sub
4
[n+l])
of
end
0:
Do in parallel
begin
Results
:=
temps
Resultc
:=
temp4
end
ezit
1:
Do
in parallel
begin
Call Sum(temp6
,
temp3
,
temp4
,
(2"m))
Call Carry(temp6
,
temp3, temp,
,
(2"m))
end
begin
2:
Do
in parallel
Call Sum(temp6
,
temp3
,
temp,
,
2*(2"m))
Call Cany(temp6, temp3, temp,, 2*(2"m))
end
end case
Case
(
temp sub
6
[n+lI)
of
0:
Do
in parallel
begin
Results
:=
temp6
Resultc
:=
temp,
end
ezit
1:
Do
in parallel
begin
Call Sum(temp7
,
temp6
,
temp,
,
(2"m))
Call Cany(temps, temp6, temp6
,
(2"m))
end
end case
Case (temp sub 8 Intll)
of
0:
Do
in parallel
begin
Results
:=
temp7
Resultc
:=
temps
end
begin
1:
Do in parallel
Call Sum(temp9
,
temp,
,
temps
,
(2"~))
call Cany(templo, temp7
,
temps
,
(2"M))
end
Do in parallel
begin
Results
:=
tempo
Resultc
:=
templo
end
end case
end.
Sum (A
,
B
,
C
,
D)
begin
Do in parallel (lsisn)
A[i] :=(B[i]AC[i])
V
(B[i]AD[:])V(C[i]AD[i])
end
Carry(A
,
E
,
C
,
D)
begin
A
[l]
:=
0
Do in parallel (lsisn)
A[i+l]
:=B[i]
@
C[i]
@
D[i]
end
An
implementation of the algorithm is shown in Figure 3.b.
2"
n
2(
2"
M)
Rcrultll
1
Res~ltll
1
SE
0
Res~lt1nI Resultin]
IC
Figure
3
b
Different
Stages
of
the
MCSA
Theorem
2:
The modulo adder scheme for adding two nbit numbers
in modulo m haa an asymptotic time complezity O(1).
To prove that the number of steps is constant (five) we need
to
prove that the last carry is
equal
to
zero in five
or
less
steps.
Induction is used
to
prove the correctness of the theorem on the
number of bits n.
[I] Basis step: for n
=
0,
means that we do not add any numbers
and in this case the required number of steps is zero.
[2]
Induction hypothesis: asume for a fixed arbitrary n>O that
that the maximum number of steps is five.
131
Induction step: for numbers with n
+1
bits let:
qtemp2[n+1]
+
temp4[n+2]
Then we have the following c~ses:
(a)
q=O:
then the carry propagation stopped
at
bit n, and it
ends after five steps at most according
to
the induction
hypothesis.
(b) q=l: then the correction is 2"+'m in step 3. Since m
>
2"
then 2"+'m
<
2", which means that
(2"+'m)[n]

0.
The
worst case we get
to
have tcmpS[n+l] and tcmp4[n+2]
to
be
205
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:07:27 EST from IEEE Xplore. Restrictions apply.
equal one. This means that temp6[n+1]

0
and
temp6[n+2]

1,
then tcmps[n+2]

0.
In this case the correc
tion is done in two steps (step3 and step 4).
(c) q=2: then the correction is 2*(2"+'m) in step 3. The
worst case we get
to
have temps[n+l], temp4[n+2], and
2*(2"+'m)
to
be equal one. Then temp6(n+1]1,
tempa(n+l]l and 2"+'M0. At step4 temp,[n+l]0 and
tempg(n +2]1. At step5 temppln
+1]1
and temp +2]0.
In this case the correction is done in three steps (steps35).
Since the adder has a fixed number of stages which does not depend
on the operands' length, it can be used in the implementation of a
pipelined multioperand modulo addition scheme[l9].
Ekam&
As
an example, the modulo addition of
A
=
1272 and
B
=
450 for m
=
2050 is shown in Figure 3.c. There is no unique
representation for
A
and
E.
One valid representation is shown in
Figure 3.c. Figure3.c shows the detailed modulo addition operation
for this example. In step1 we get lempz[13]
=
1, and in step2 we get
temp,[l3]
=
1, which means that at step3 we have
to
add 2(2"M).
At step3 we get tempa[l3]
=
1, which means that at step4 we have
to
add
PM.
At step4 we get temp8[13]
=
0,
which means that the
addition
D~OC~SS
S~DS
at step4. The result of step4 is the final
0
result.
0
Initial:
Step
1
As=
1
0
I
1
1 1 1
1
0
1
1 1
A,
11001
1101
101
Bs
11
1100010101
Bo=
1010101
1001
1
M=2050
,
N=
12
As
1
0
1
1
1
1
1
1
0
1
I
1
&.
11001
1101
101
BSI
11
1100010101
Step
2.
temp,:
100000001
1
I
I
temp2=[t
1
I
I
1
I
101010
templ=
100000001
11
1
temp2=
11
11 11
101010
8,
11
1100010101
Step
3.
temp,
1
lOlOlOIOl10
temp,=[01010
10
10
I
IO
temb=
1101010101
IO
temp4=
OlOlOIOlOl
IO
2(2"
PI).=
1
I1
11
11
11
100
temp,=
01
11
11
11
1100
temp,:[
1010
io
io
1
loo
Step
4.
temp5=
01
11 11
I
11
loo
temp,=
lOlOlOlOl
loo
2nrl
=
011111111110
temp,:
101010101
1
Io
temp,zljjll
I
I
I
I
I
I
I
1000
Resu/ts=
/<?/<?/(7/0/
/
/<?
Resu/tc
=
/ / /
I
/
/ / /
/c?~?~?
_________
Figure
3
c
A detailed Example for the Modulo Addition
Theorem
3%
Adding
n
numbers
(yl
,
ya
,
.
. .
,
U")
in
modulo
M
is
equivalent to
:
[l]
Adding
(yl
,
y2)
modulo
M
,....,
(yi
,
y;+J
,
. . .
,
and
(~m1
9
v)
gives
~12
,
. .
Y(n1)".
Step [2]
is
repeated for
output represented
as
a sum and carry.
(21
Step
PI
is repeated on
(Yl2
,
v.4
B'",
(Y(.4)("2)
,
Y(ml)").
[3]
logN
1
2 times
to
obtain one final
To add two numbers
a
and
b
in modulo
M
we have the fol
a
<M
and
b
<M
then
a
=
I
a
1.
and
b
=
I
b
b,
then:
lowing cases:
(i)
(a+)
b=
)aM+bM
1
(12)
(ii)
a>M
and
b<M
then
b
=
I
b
1.
and
a
=
M
+
z, then:
la+b
b=
IM+z+b
b=
lz+b
1.
(13)
Since z
<M
and
b
<M,
then from (12) and (13):
la
+b
b=
la~+b~
(iii)
(iv)
a
>M
and
b
<M
like case (ii).
a
>M
and
b
>M
then
a
=
M+z
and
b
=
M+y,
then:
la+b
1.
=
IM+=+M+y
I.
=
Iz+y
1.
=
+
b,
From the previous four cases:
la+b
b=
Iau+bM
1
(14)
Since addition
is
assoyiative, then:
I
We
CA
'further expand this expression using the same method to get
the addition process in the right hand side in terms of only two
operands added in modulo
M.
Theorem
3
means that adding
n
numbers in modulo
M
can be
performed using a binary tree consists of units that are capable of
adding only two numbers in modulo
M.
MCSAs are used
as
those
building blocks
to
perform the addition process. Since MCSA requires
that inputs be represented in the form of sum and carry, then this
form should be enforced
at
all levels. The form will be enforced
automatically for levels
2
2, because the outputs of the previous lev
els are in the correct form.
For
first level we have the following:
For the last stage the output is in the form of
sum
and carry which
is
exactly the same form
as
CSAs.
Figure 3.d. shows the binary tree
required
to
add
n
numbers in modulo
M.
0
xs
=
y;
,
Y~Q
=
0
V
1
5
i
5
n
Stage
1
stage
2
stage
3
'12
(n3)(nZ)(nI)n
Stage
rlognl
Figure
3
d
Partial
Sum Adder Implementation
Using MCSAs
206
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:07:27 EST from IEEE Xplore. Restrictions apply.
3.3.
Range
Determinator
This stage consists of three levels namely ROM, Magnitude
Comparator(MC), and Bit Corrector(BC). The major function of this
stage is
to
determine
S
range
so
that appropriate value
can
be
sub
tracted from
S
to
obtain the desired result. In order
to
accomplish
this, 2 sets
of
values
as
shown in Table
I
have
to
be compared. For
simplicity, we explain the first set then the second set.
Since the input
to
this stage,
S,
is
a
large binary number, it
is
partitioned into groups of adjacent bits.
For
example, if
S
is
a
24bit
number, we can partition
S
into
3
&bit groups
GI, G,,
and
G3,
where
GI
=
S,A
GZ
=
SI6..8,
and
G3
=
SZ~..M
Since each group
is
fed into a ROM module
as
an address input, the
number of bits in each group should be small
so
that small ROMs
that are
fast
and occupy small silicon area are used
to
implement this
level. However, the number of groups,
g,
should be kept small
as
possible since the complexity of MC cells is a function of the number
of ROM modules,
g
.
Hence, there are tradeofk in choosing
g
and the
number of bits in each group.
As
shown in Figure
4,
the input
to
ith ROM module of the the
first set of ROW
is
Gi,
and the outputs are
Bj's
and
Cik
The func
tion of this ROM module is depicted
as
follows:
Bj
=
if
Gi
I
[
jMl]i
if
Gi
>
[
jMl]i
for
j
=
l..Al
1
Magnitude
Compared
A
1
2
n1
First Set
M1
2M1
...
(n1)M1
Second Set
.TzqKK

M1
2
3M1
2
1

M1
3M
1
2

2
...
(2n3)M1
2
...
(2n3)M
2

Table
I
Values Compared by Multimagnitude Comparators
v
I
BC
1
if
Gi
#
[
kMl)i
if
Gi
=
[
kMlIi
for
k
=
2..A1
fori
=
1,
2
,.......,
g
Clearly, these ROM modules serve
as
a
partial multimagnitude
comparator that compares the input pattern
S
to
the first set
of
values
as
shown in Table
I
and produce g*(2A3) outputs that are
to
be fed into the MC level.
The MC level consists of
(A1)
MC cells. This level takes the
input from ROM level and does further comparison
so
that a 2level
multimagnitude comparator is formed. The complexity of a MC cell
is
a
function of the number of ROM modules.
If
we have
g
ROM
modules, then the Boolean equation for the
I
th MC cell
is
as
follows:
MC,'
=
Bp
+
Bp'Cp
+
Bp4CpCf'
+
........
+
B,%pCp'Cp" ........ C:
+
B,'CpCp
'........
CtC?
(15)
Hence we have,
if
S
<
1M
MC,'
=
[
for[=l,2
,.......,
A1
(16)
Since
S
may he larger than several values compared, the
out
puts of several MC cells may be set
to
1;
therefore, the BC level is
used
to
ensure that only one of the outputs of the
MC'
cells
is
set to
one and also
to
indicate the appropriate range. In order
to
do
so,
A
identical
BC1
cells are needed, and their common Boolean equation
is
as
follows:
BCA
=
MCA+MCA_,
for
m
=
1,
2
,........,
A
where,
MC;
=
1
and
MC:
=
0
(17)
Hence, the range of
S
is determined
to
be
(ml)M
5
S
<
mM
if
BC,,!
=
1.
Figure
4
shows the implementation of the multi
magnitude comparator that compares
S
with the first set of values
shown in Table
I
and its BC level.
Thus far, the range determination enables the
S
modulus M
operation
to
be performed by
S(ml)M
if
BCA
is set
to
one.
Therefore, only residue
to
unsigned magnitude number system
conversion is possible. However, for residue
to
2's complement
number system conversion, the second set of values,
as
shown in
Table
I,
has
to
be compared with
S
by another multimagnitude
comparator which is done in the same way
as
previously explained.
Figure
5
shows the input
to
the ith ROM module of the second set of
ROMs is
Gi,
and the outputs are
0;'s
and
E;%.
The function
of
this
ROM module is clearly depicted
as
follows:
For
M
odd,
1
fori
=
1,
2
,.......,
g
and for
M
even,
Figure4 Implementation of the first part
of
Range
Determinator Stage
201
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:07:27 EST from IEEE Xplore. Restrictions apply.
0
ifG'5
(vl]i
!
1
if
Gi
>
[
l]i
D;l
=
for
j
=
l..A1(20)
fori=1,2
,
.......,
g
The MC level
of
this part
is
exactly the same
as
previously proposed,
that is, it consists of
A
Me
cells, and each
MC2
cell has the same
Boolean equation
BS
follows:
MC:
=
DP
+
Df'EP
+
Df'E?E/''
+
........
+
D12EfEt'Ep.
......
.E,'
+
Dll
EfEt'.
.......
E13Et2
Since different set
of
values
is
compared with
S,
we have for M odd,
lo
if
s
5
(211)M1
L
ifs
>
(211)M1
b
2
MC:
=
and
for
M even,
ifs
<
w1
(211)M
ifs>
2
MC:
=
for
=
1, 2
,
..........,
A
1
The BC level
for
this part
of
the design consists of
A
BC2
cells.
Each of these cells has a control line C. If C is equal
to
zero, then all
the output lines of BC level will be equal
to
one and residue
to
unsigned magnitude number system conversion will be performed;
otherwise, only one of the BC level output lines will be equal to one,
and thus residue
to
2's
complement number system conversion will be
performed. The Boolean equation for a BC cell is
as
follows:
for
m
=
1,
2
,
........., A

BC:
=
C.MC:.MC:,
where,
MC;
=
1
and
MC?
=
0
Therefore, the range of
S
is determined to be
<
(2m?M+1
for M odd, and
(2m 3)M+1
2
(2m1)M
for
M even if
BC;
=
1
(Note that the
(2m3)M
IS<
lower bound is equal to zero when m=l). Figure
5
shows the imple
mentation
of
this part
of
the design.
3.4.
Final Corrector
This stage consists
of
A
tristate multiplexers and a carry
look
head adder. The
BC'
input lines will be used to enable one
of
the
tristate multiplexers while
BC2
input lines will be used
as
the selec
tors
of the multiplexers.
If
BC;'
is set, then
(il)M
5
S
<
iM.
The
lower bound
(il)M
will be subtracted from
S
if
conversion to
unsigned magnitude number system
is
desired,
or
S
is less than
for M odd
or
for M even; otherwise, the
upper bound,
iM,
will be subtracted from
S.
The implementation
of
this stage is shown in Figure
6.
The CLA
is
used to add the
2's
com
plement
of
the value to be subtracted to
S
and output the desired
result.
BC
Bc
A
I

T

TT
BC
BC
BC2
m
Figure
5.
Implementation
of
the Second part
of
Range
Determinator Staae
Ti
I
2x1
0
BC:
_
q
1
2x1
0
BC;
_
0
H
I
IS
I
Carry Lookahead Adder
I
1
1x1
n
Figure
6
Implementation
of
Final Corrector Stage
4.
Performance
Evaluation
The partial sum generator is implemented using small ROMs,
If
the number of residues
is
N
and each residue is represented in
P
bits, then it is required
to
use
N
ROMs.
Each ROM is
storing values bounded by
M,
then the size
of
each ROM is
2'
*
[
logMl
The total area required for this stage is:
N
*
2'
*
r
IogMl Since ROW have a constant time delay
(P
is
a small number) which does not depend on
N,
then the delay
of this stage is
O(1).
The partial sum adder is implemented in two different ways:
(a) Using CSAs: The complexity
of
the scheme is determined by
Theorem
1.
Since each CSA has
a
constant time delay, then
the total time required
to
add
N
numbers in modulo
M
is
B(1ogN).
(b) Using MCSAs: The number
of
levels required to perform the
addition of
N
numbers using a binary tree of MCSAs is
r
logN1
as
it is shown in
Theorem
3.
Since at each level the
required time is constant (MCSA has a constant time), then the
total time required for this step using MCSAs is
B(1ogn).
The range determinator consists
of
three different
levels(Figure
4).
The first level consists
of
g
ROMs. The second
level is the MC cells, which are combinational circuits that can
208
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:07:27 EST from IEEE Xplore. Restrictions apply.
I41
be represented with a two level switching function. Finally the
last
level is a two stage combinational circuit. The Three levels
have a constant time delay that does not depend on
N.
The
previous analysis shows that the range determinator has a time
delay of
B(
1).
The Final corrector consists
of
two stages. In the first stage we
have
A
tristate multiplexers which have a constant delay
equivalent to two serial NAND gates. The second stage is a
CLA which has a constant delay and
for number of bits less
than 64 the delay is equivalent to the delay
of
12 serial NAND
gates
as
shown in[13]. For number of bits larger than this we
can still obtain a constant delay CLA. Then the final corrector
has a delay of
B(
1).
From cases [1][4] we see that all stages except the partial sum adder
has a constant time delay which does not depend on the number of
residues
N.
Only the second stage requires
B(logN)
steps.
5.
Conclusions
of
1
logN
1.
In addition, it has several advantages
as
listed below:
1)
The residue decoder introduced in this paper has a total delay
The design is quite modular and consists of simple cells such
as
small ROW and MC cells. This makes the implementation
of
the whole residue decoder in a single chip is possible.
It doesn't have any limitation on the moduli used.
It is flexible since it can convert residues to either unsigned
magnitude
or
2's
complement number system, and it is con
trolled by only a control line, C. This means that it can be
applied
to
wider area.
It is fast compared with most schemes proposed before since it
has a time complexity of B(1ogN).
It can be easily pipelined without any modifications
2)
3)
4)
5)
Acknowledgement
The authors wish
to
thank the reviewers for thier valuable comments.
References
M. A. Bayoumi, "Digital Filter
VLSI
Systolic Arrays over Finite
Fields for DSP Applications," Proc. of the 6th IEEE Annual
Phoenix Conference on Computers and Communications, pp.
194199, Feb. 1987.
M. A. Bayoumi,
G.
A. Jullien, W. C. Miller, "A Lookup Table
VLSI
Design Methodology for RNS Structures Used in DSP
Applications," IEEE Trans. on Circuits and Systems, pp. 604
616, Vol. 34, No.
6,
June 1987.
F.
J. Taylor, "Residue Arithmetic: A Tutorial with Examples,"
IEEE Computer Magazine, pp. 5G62,May 1984.
M. A. Bayoumi, "A High Speed
VLSI
Complex Digital Signal
Processor Based on Quadratic Residue Number System,"
VLSI
Signal Processing
11,
pp. 200.211, IEEE Press, 1986.
N.
S.
Szabo and R. I. Tanaka,
Residue Arithmetic and its
Applications
Lo
Computer Technology,
New York: McGrawHill,
1967.
W.
K.
Jenkins, "Techniques for Residue to Analog Conversion
for Residue Encoded Digital Filters," IEEE Trans. Circuits
Syst., vol. CAS25, pp. 553562, July 1978.
S.
Andraos and H. Ahmed, "A New Efficient Memoryless Resi
due to Binary Converter," IEEE Trans. Circuits Syst., vol. 35,
K.
M.
Ibrahim and
S.
N. Saloum,
"An
Efficient Residue to
Binary Converter Design," IEEE Trans. Circuits Syst., vol.
35,
pp 11561158, September 1988.
S.
Bandyopadhyay, G. A. Jullien, and
A.
Sengupta,
"A
Systolic
Array for FauleTolerant Digital Signal Processing Using A
Residue Number System Approach," Proc. of Intl. Conf. on Sys
tolic Arrays, pp. 577586, 1988.
NOV. 1988, pp. 14411444.
A.
P. Shenoy and R. Kumaresan, "Residue
to
Binary Conver
sion for RNS Arithmetic Using Only Modular Lookup Tables,"
IEEE
Trans. circuits Syst., vol. 35, pp. 11581162, September
1988.
T.
V. Vu, "Efficient Implementations of the CRT for Sign
Detection and Residue Decoding,"
IEEE
Trans. Comp., vol. C
C. N. Zhang, B. Shirazi, and D. Y. Y. Yun, "Parallel Designs
for Chinese Remainder Conversion," Proc. IEEE
16
th Annual
Conf. on Parallel Processing, Aug. 1987.
K.
Hwang,
Computer Arithmetic: Principles, Architecture, and
Design.
New York: Wiley, 1978.
R. M. Capocelli and R. Giancarlo, "Efficient VLSI Networks for
Converting an Integer from Binary System and Vice Versa,"
IEEE
Trans. Circuits Syst., vol. 35, Nov. 1988, pp. 14251430.
G.
Alia and
E.
Martinelli, "A VLSI Algorithm for Direct and
Reverse Conversion from Weighted Binary Number System to
Residue Number System," IEEE Trans. Circuits Syst., vol.
CAS31, 1984, pp. 10331039.
K.
P.
Lee, M. A. Bayoumi and
K.
M. Elleithy, "A Fast and
Flexible Residue Decoder Based on The Chinese Remainder
Theorem," The 1989 International Symposium on Circuits and
Systems.
K.
M. Elleithy, "On BitParallel Processing for Modulo Arith
metic,"
VLSI
Technical Report TR8681, The Center for
Advanced Computer Studies, University of Southwestern
Louisiana, 1986.
A. Avizienis, "A Study of Redundant Number Representations
for Parallel Digital Computers," Ph.D Thesis, Univ. of Illinois,
Urbana, Illinois, May 1960.
K.
M. Elleithy, "On the BitParallel Implementation for the
Chinese Remainder Theorem,"
VLSI
Technical Report TR878
1,
The Center for Advanced Computer Studies, University
of
Southwestern Louisiana, 1987.
34, pp. 646651, July 1985.
209
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:07:27 EST from IEEE Xplore. Restrictions apply.