Content uploaded by Khaled Elleithy
Author content
All content in this area was uploaded by Khaled Elleithy on Sep 08, 2013
Content may be subject to copyright.
Content uploaded by Khaled Elleithy
Author content
All content in this area was uploaded by Khaled Elleithy on Sep 01, 2013
Content may be subject to copyright.
226
IEEE
TRANSACTIONS
ON CIRCUITS
AND
SYSTEMS11:
ANALOG
AND
DIGITAL
SIGNAL
PROCESSING,
VOL.
39,
NO.
4,
APRIL
1992
Fast and Flexible Architectures for
RNS
Arithmetic Decoding
Khaled M. Elleithy,
Member, IEEE,
and Magdy
A.
Bayoumi,
Senior Member, IEEE
AbstractAn implementation of a fast and flexible residue
decoder for residue number system (RNS)based architectures is
proposed. The decoder is based
on
the Chinese Remainder
Theorem
(CRT).
It decodes a set of residues to its equivalent
representation in weighted binary number system. This decoder
is flexible since the decoded data can be selected to be either
unsigned magnitude or
2's
complement binary number.
Two
different architectures are analyzed; the first one is based
on
using carrysave adders (CSA's), while the other is based
on
utilizing modulo adders (MA). The implementation of both
architectures is modular and is based
on
simple cells, which
leads to efficient
VLSI
realization. The proposed decoder is fast;
it has a time complexity of O(log
N) (N
is the number of
moduli).
KeywordsResidue number system, Chinese remainder theo
rem, modulo adder, carrysave adder, residue decoding, finite
field algorithm.
I. INTRODUCTION
ECENTLY, RNS has received increased attention due to
R
its ability to support highspeed concurrent arithmetic
[
11

[3]. Applications such as fast fourier transform, digital
filtering, and image processing utilize the efficiencies of RNS
arithmetics in addition and multiplication; they do not require
the difficult RNS operations such as division and magnitude
comparison. RNS has been employed efficiently in the imple
mentation of digital signal processors
[
11, [4].
Since special purpose processors are associated with gen
eral purpose computers, binarytoresidue and residuetobi
nary conversions become inherently important, and the con
version process should not offset the speed gain in RNS
operations. While the binarytoresidue conversion does not
pose a serious threat to the highspeed RNS operations, the
residuetobinary conversion can be a bottleneck. The Chi
nese Remainder Theorem (CRT)
[5],
[6] is considered the
main algorithm for the conversion process. Several imple
mentations of the residue decoder have been reported
[7][15]. The residue decoders in
[7]
and [8] are based on
using three moduli in the form
(2"

1,
2",
2"
f
1);
n
is
the number of bits. Due to the limitation imposed on the
number
of
moduli and the choice of them, it is limited in
application. In [9], a scheme of O(1og
NP)
(where
P
is the
Manuscript received October 15, 1990; revised December 17, 1991. This
work was supported in part by NSF Grant MIP8809811. This paper was
recommended by Associate Editor M.
H.
Etzel.
K.
M. Elleithy is with the Computer Engineering Department, King Fahd
University, Dharan 31261, Saudi Arabia.
M. A. Bayoumi is with the Center for Advanced Computer Studies,
University
of
Southwestern Louisiana, Lafayette, LA 70504.
IEEE Log Number 9 10752 1.
number of bits and
N
is the number of moduli) is used
to
support only unsigned magnitude binary numbers.
In
[lo],
the residue decoder is based on the base extension technique;
it uses only modular lookup tables in its implementation.
Since lookup tables are used, the choice of moduli must not
be large for the implementation to be feasible. In addition, it
does not support residue to 2's complement binary number
system conversion. Although lookup tables are used in this
scheme, its time complexity is
O(N2).
The implementation in
[l 11 requires that one
of
the moduli must be a power of two;
therefore, it may be limited in application. In [12], the
proposed residue decoders are basically based on biased
addition, and take advantage
of
the fast addition speed of
CSA [16]. But the conversion output is not in
2's
comple
ment form. In [13] and [14], the scheme used has a time
complexity of (?((log
N)2).
In [15], the mixedradix conver
sion algorithm is used with a time complexity of
O(N).
In this paper, a O(1og
N)
residue decoder capable of
decoding a set of residues to its equivalent representation in
unsigned magnitude or
2's
complement binary number
sys
tem is introduced. Two different architectures using CSA's
based on [17] and modulo adders (MA's) [18] are imple
mented. In the following section, the RNS theory is re
viewed. Section I11 discusses how this fast and flexible
residue decoder can be implemented. Section IV evaluates
the speed performance of this residue decoder.
II.
RESIDUE
NUMBER SYSTEM
In
RNS,
an integer,
X,
can be represented by an Ntuple
of residue digits,
X=
(r,,r2;.*,rN)
where
ri
=
I
X
I
mi,
with respect to a set
of
N
moduli
{
m,
,
m2,
a,
mN}
.
In order to have a unique residue representa
tion, the moduli must be pairwise relatively prime; that is,
GCD(mi,
mj)
=
1,
fori
#j.
Then it is shown that there is a unique representation for each
number in the range of
0
5
X
<
IIE,
mi
=
M
where
N
is
the number of moduli.
The arithmetic operation on two integers
A
and
B
is
equivalent to the arithmetic operation on its residue represen
tation, that is,
10577130/92$03.00
0
1992 IEEE
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:35:49 EST from IEEE Xplore. Restrictions apply.
ELLEITHY
AND
BAYOUMI:
FAST
AND
FLEXIBLE
ARCHITECTURES
221
where
‘‘a
”
can be addition, subtraction, or multiplication.
Therefore, it is desired to convert binary arithmetic on large
integers to residue arithmetic on smaller residue digits in
which the operations can be parallelly executed, and there is
no carry chain between residue digits.
For applications in digital signal processing, it is helpful to
define a dynamic range for the RNS with positive and
negative integers. The dynamic range is defined as
[

(M

1)/2,
(M

1)/2] for
M
odd, and as
[M/2,
M/2

11
for
M
even, or more specifically, for
M
odd,
M
1
ifZ57
and for
M
even,
M
ifZ<
2
z,
ZM, ifZ2
2
x=
[
M
where
Z
is an integer within the legitimate range,
0
5
Z
<
M.
Any integer,
X,
within the dynamic range can be
represented by
N
residue digits.
The conversion from RNS to weighted binary number
system is done by using the CRT, which states that
where
and
GCD(mj,
mk)
=
1, fOrj
#
k.
Although the CRT provides a direct, fast, and simple conver
sion formula, the lack of large and fast modulo
M
adder has
held back this approach.
III.
THE
RESIDUE
DECODER
The residue decoder based on the CRT can be imple
mented by a modulo
M
adder tree. The modulo
M
adders at
each level are used to correct the partial sum
so
that it will be
within the legitimate range. Since the modulo
M
adder is
very slow, the implementation may pose an overhead to the
overall speed performance of an
RNS
processor. In addition,
the CRT only converts residues to its binary representation in
the legitimate range but not in the dynamic range. Therefore,
conversion to 2’s complement binary number system requires
a final correction.
In order to implement a highspeed residue decoder that
can perform conversion to both unsigned magnitude and 2’s
complement binary number system, the following solutions
are proposed.
1) The number of modulo
M
adders or binary adders
should be reduced to a minimum.
2) CSA’s can be used wherever multioperand addition is
required due to its high addition speed.
3) MA’s can be used for multioperand addition due to
its constant speed in adding nbit numbers in modulo
M.
4)
Correction can be performed only at the last stage,
and it supports conversion to both unsigned magnitude
and the 2’s complement binary number system.
For ease of residue decoder design, it is partitioned into
four stages as shown in Fig.
1.
The input to the residue
decoder are the residues and a control line,
C
which deter
mines the output to be in unsigned magnitude or 2’s comple
ment number system.
3.1
Partial
Sum
Generator
The inputs to this stage are the
N
residues. The main
function of this stage is to compute partial sums,
ti’s,
where
Since
mi
is usually small, the value of
ti
can be obtained by
accessing a lookup table with a small address space. Hence,
ri
will serve as a ROM address input, and
ti
will be obtained
from a ROM output as shown in Fig. 2.
In most cases, it is better to reduce the number of partial
sums,
ti’s,
in order to reduce the complexity at lower stages
and hence increase the speed of residue decoder as a whole.
Since a modulus mj can be represented by [log, mi]bit
binary number, the jth residue,
[log2
mjl
1
ri=
2kbi
where b&{O, I}. By substituting
rj
in (l), we can rewrite
the CRT as follows:
k=O
Hence, if we have a set of
8
moduli (2, 3,
5,
7, 11, 13, 17,
23} with residues
{r,, r,, r3, r4, r,, r,, r7, rg},
respec
tively, only four ROM’s with 7bit address input can be used
to implement this level, and modulus summation
of
four
operands is required instead of eight, where
t,
=
t,
=
t3
=
t,
=
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:35:49 EST from IEEE Xplore. Restrictions apply.
228
IEEE
TRANSACTIONS ON CIRCUITS AND SYSTEMS11: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL.
39,
NO.
4,
APRIL
1992
rr
rr
r
N
N1
32
1
II
II
. . . . . . .
.
.
.
. .
. .
,
.
I
I
I
IXIM
Fig. 1.
Block diagram
of
the residue decoder
TABLE
I
THE VALUE
OF
O(1)
VERSUS
I
FOR
PRACTICAL APPLICATIONS
Number
of
levels,
I
Maximum Number
of
Operands,
@(I)
The previous implementation means that we decrease the
N
partial sums to a new number of partial sums
(A).
3.2
Partial
Sum
Adder
By
far, the modulo
M
summation of partial sums,
ti’s,
poses the biggest challenge to the implementation of residue
decoder due to the slow computational speed of modulo
M
adder. This stage can be implemented using two different
approaches.
3.2.1
Implementation
using
CSA:
A multilevel CSA tree
consists of
N

2 CSA’s and a carry propagate adder (CPA)
[16]. These are used to reduce
A
partial sums,
t’s,
to a sum,
S.
Let
I
be the number of levels on a CSA tree, and
e(/)
be
the maximum number
of
operands that can be processed with
an Ilevel CSA tree. We can compute
19
by the recursive
formula provided by Avizienis
[
191
:
c
3
4
6
9
13
19
28
42
for
I
=
2,3;
e,
and initially
O(1)
=
3.
Table
I
summarizes the values
I9
versus the values
of
I
for
practical situations. CPA is an
(m

1)bit twolevel carry
lookahead adder (CLA) [16] where
m
=
[log,
(MA)].
Hence the output
S
is an mbit number that is passed to the
next stage. A design example for this stage is shown in Fig.
3(a). The complexity of the scheme is determined by the
Theorem
1.
Theorem
1:
The addition of
N
numbers using CSA’s can
be performed in O(log
N)
steps.
CSA tree is determined by
Proof:
The number of levels required for addition in a
To determine the number of levels required to add
N
num
bers, let us consider the following two cases.
Case
(1):
I9(l

1)
is even, then:
e(/

1)
mod2
=
0.
(5)
(6)
Substituting in
(3)
using
(4)
and
(5),
we have
e(/)
=
+e(/

1).
Since
O(l)
=
3,
we can substitute in (6) to get successive
values for
e(/)
as follows:
o(2)
=
q*3
8(3)
=
(3**3
8(4)
=
(3’*3
6(5)
=
(q)“*3
e(/)
=
(+)/,*3
=
(3‘*2.
e(/)
represents the number of operands that can be added
using a CSA tree and has
I
levels. Suppose that the number
of operands is
N
then
N
=
(+)‘*2.
Taking the logarithm of both sides we have
log
N=
I*logq.
Then:
1
I=

*log
N.
We can find constants
C,
>
0,
C,
>
0,
and
No
2
0,
such
log
;
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:35:49 EST from IEEE Xplore. Restrictions apply.
ELLEITHY AND BAYOUMI: FAST AND
FLEXIBLE
ARCHITECTURES
229
''

''
means that output
is shifted left with zero
enterlna from the riaht
2'
M
2(
2
2n
M
0
2n
M
0
Result11
5c
1
Result11
1
0
Rcsulthl Rcsultlnl
55
(C)
initial
As=
101111110111
Ac=
11001
1101
101
6,.
11I100010101
6,.
1010101
1001
1
fl2050
,
N
12
Step
1
A.=
101
11 11
101
11
A.=
11001
1101
101
Bs.
111100010101
temp,
=
10000000
1 1 1 1
temp2=Il
11
I I I
IO~OIO
Step
2.
temp,:
lOOOOOOOI
I1
1
temp,.
11
1
I1
1101010
8,;
1010101
1001
I
temp,=
I101010101
IO
temp4=Ij0~0~o~o~0~
IO
Step
3.
temp,=
1101010101
IO
temp4=
0101010101
IO
I1
11
I
I1
11
100
temp,.
01
I
I
I I
I I
I100
temp,=[
101010
101
100
Step
4.
temp,.
01
11
1
I1
11
100
temp,.
101010101
100
_________.
2(2"fl)
=
______
Z"fl=
011111I11110
Q,
(Result
,
Result)
c
Stage
[log
nl
(e)
Fig.
3.
(a) An example
of
partial sum adder
for
A06.
@)
A modulo sum adder. (c) Different stages
of
the modulo adder. (d) A
detailed example
for
modulo addition. (e) Addition
of
partial products using modulo adders.
that for all
N
2
No
the following
is
true:
means that
1
=
O(1og
N).
1
Case
(2):
O(l

1)
is
odd, then
C,
log
N
I
7
*log
N
I
C,
log
N.
(7)
1%
7
Then
C,
log
NI
I
I
C,
log
NVN
2
No.
(8)
3*[?]
=
?O(l
2

1)

1.5.
O(l

l)mod2
=
1.
Possible values for
C,
,
C,
,
and
No
are 1, 2,
1.
Equation
(8)
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:35:49 EST from IEEE Xplore. Restrictions apply.
230
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS11: ANALOG AND DIGITAL SIGNAL PROCESSING,
VOL.
39,
NO.
4,
APRIL
1992
Substituting in
(3)
using
(9)
and
(lo),
we get
Algorithm ModuloAdd (A,
B,
Result)
Input:
Two variables
A
and
B
in modulo
m,
A
is
represented as
A,
and
A,.
B
is represented as
B,
and
B,.
~11 variables
are
Output:
Variable
Result
represented as
Result,
and
Result,.
The relation between
A,
B,
and
Result
is:
Result
=
IA+BIm.
Procedure:
e(,)
=
;e([
1)

;.
(11)
Since
O(1)
=
3,
we can substitute in
(11)
to get successive
values for
O(1)
as follows:
bit numbers.
O(2)
=
;*3

1.
,9(3)
=
(512*3

(+)'*3

;
8(4)
=
($)3*3

($*3

($*3

1
O(1)
=
($)''*3

(($)',
+
($)',
+
*e*
+1)
*0.5
=
$($)'
+
1.
Suppose that the number of operands is
N;
then
N
=
$(;)'
+
1.
Using the same analytical method used for the case of even
e(,
1)
we can find constants
C,,
C2,
and
No
e
0,
such
that for all
N
2
No
the following is true:
1
C,
log
N
S

,
*log
N
I
C,
log
N.
1%
2
From the previous analysis in both cases
1
and
2,
N
numbers
can be added using CSA's in O(1og
N).
3.2.2
Implementation using Modulo Adder:
The MA
adder proposed in
[
181
is used to implement the partial sum
adder. The idea of representing a number as
a
carry
and a
sum
borrowed from CSA can be used in the modulo addition
to obtain a scheme that has a constant speed that does not
depend on the number of bits. Basically, CSA depends not on
the idea of completing the addition process at a certain stage,
but postponing it to the final stage. In the intermediate stages,
numbers are represented
as
sum
and
carry
to avoid the
complete addition process.
The MA is used to add two
numbers
A
and
B
in modulo
m.
Fig. 3(b) shows that
A
is
represented as a pair of numbers
(A,,
A,),
B
is repre
sented as
(B,, B,),
and the output
C
is represented as
(C,,
C,).
Each number is represented as a group of
sum
bits and
carry
bits. There
is
no unique representation for
A,
and
A,.
The condition that needs to be satisfied is
lAs+&Im=
IAIm.
One possible representation is
A,= \AIm,
A,=O.
We need to add four numbers
(A,,
A,, B,, B,),
which
need two steps of CSA. After the addition process we need to
detect if
M
or
2
*
(
M)
is required to adjust the result.
The adjusting process takes at most three steps. The proposed
algorithm for modulo
m
addition of two numbers can be
described as follows.
begin
Do
in parallel
begin
Call Sum(temp,,
As,
A,,
B,)
Call Carry(temp,,
A,,
A,,
B,)
end
begin
Do
in parallel
Call Sum(temp,, temp,, temp,,
B,)
Call Carry( temp,, temp,, temp,,
B,)
end
0:
Do
in parallel
Case(temp2
[n
+
11
+
temp,
[n
+
11)of
begin
Result,
:=
temp,
Result,
:=
temp,
end
exit
begin
I:
Do
in parallel
Call Sum(temp,, temp,, temp,,
(2"

m))
Call Carry(temp,, temp,, temp,,
(2"

m))
end
begin
2
:
Do
in parallel
Cali Sum(temp,, temp,, temp,,
2*(2

m))
Call carry( temp,, temp,, temp,,
2"
(2

m))
end
end case
Case(temp,
[n
+
13)
of
0:
Do
in parallel
begin
Result,
:=
temp,
Result,
:=
temp,
end
exit
begin
I:
Do
in parallel
Call Sum(temp,, temp,, temp,,
(2"

m))
Call Carry(temp,, temp,, temp,,
(2"

m))
end
end case
Case (temp,
[n
+
11)
of
0:
Do
in parallel
begin
Result,
:=
temp,
Result,
:=
temp,
end
I:
Do
in parallel
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:35:49 EST from IEEE Xplore. Restrictions apply.
ELLEITHY AND BAYOUMI: FAST AND FLEXIBLE ARCHITECTURES
23
1
begin
Call Sum( temp,, temp,, temp,, (2"

M))
Call Carry(temp,,, temp,, temp,, (2"

M))
end
Do
in parallel
begin
Result,
:=
temp,
Result,
:=
temp,,
end
end case
end.
Sum (A, B,
C, D)
begin
Do
in
parallel
(1
I
i
I
n>
A[i]
:=
(B[i]
A
C[i])
V
(B[i]
A
D[i])
V
(C[i]
AD[il>
end.
Carry (A, B,
C, D)
begin
A[1]
:=
0
Do
in parallel
(1
I
i
I
n)
A[i
+
11
:=
B[i]
@
C[i]
@
D[i]
end.
An implementation of the algorithm is shown in Fig.
3(c).
Theorem
2:
The modulo adder scheme for adding two
nbit numbers in modulo
rn
has an asymptotic time complex
ity
8(l).
Proof:
To prove that the number
of
steps is constant
(five), we need to prove that the last carry is equal to zero in
five or less steps. Induction is used to prove the correctness
of the theorem on the number of bits
n.
Basis step:
n
=
0
means that we do not add any
numbers and in this case the required number of steps
is zero.
Induction hypothesis: assume for a fixed arbitrary
n
1
0
that the maximum number
of
steps is five.
Induction step: for numbers with
n
+
1
bits, let
7
=
temp,[n
+
11
+
temp,[n
+
21.
Then we have the following cases.
a)
7
=
0;
then the carry propagation stopped at bit
n,
and it ends after five steps at most according to the
induction hypothesis.
b)
7
=
1:
then the correction is
2""

m
in step
3.
Since
rn
>
2"
then
2"+'

m
<
2",
which means
that
(2""

m)[n]
=
0.
The worst case is to have
ternp,[n
+
I]
and
temp,[n
+
21
equal one. This
means that
temp,[n
+
11
=
0
and
temp,[n
+
21
=
1,
then
temp,[
n
+
21
=
0.
In this case the correction
is done in two steps (step
3
and step
4).
c)
7
=
2:
then the correction is
2
*
(2"+'

m)
in
step
3.
The worst case is
temp,[n
+
11,
temp4[n
+
21,
and
2*(2"+'

m)
=
1.
Then
temp,[n
+
13
=
1,
temp,[n
+
11
=
1
and
2"'l

M=
0.
At step
4
temp,[n
+
11
=
0
and
temp,[n
+
21
=
1.
At step
5
temp,[n
+
11
=
1
and
temp,,[n
+
21
=
0.
In this
U
Example:
As an example, the modulo addition of
A
=
1272
and
B
=
450
for
m
=
2050
is shown in Fig.
3(d).
There is no unique representation for
A
and
B.
One valid
representation is shown in Fig.
3(d).
Fig.
3(d)
shows the
detailed modulo addition operation for this example. In step
1
we get
ternp2[13]
=
1,
and in step
2
we get
temp4[13]
=
1,
which means that at step
3
we have to add
2(2"

M).
At
step
3
we get
temp,[l3]
=
1,
which means that at step
4
we
have to add
2"

M.
At step
4
we get
temp,[l3]
=
0,
which means that the addition process stops at step
4.
The
result of step
4
is the final result.
The proposed modulo adder has the following advantages.
1)
It does not have any limitation on the size of the
modulus.
2)
It is quite modular; it is a
2D
array of one type cell
(fulladder)
.
3)
It is easy to pipeline.
4)
It is very efficient architecture for the implementation
of the
CRT
decoding and modulo multiplication
[20].
Theorem
3:
Adding
n
numbers
(
yl,
y2,

e,
y,)
in mod
1)
Adding
(
yl,
y2)
modulo
M,

*
,
(
yi,
yi+
1),

*
,
and
ulo
M
is equivalent to:
(Yn13
U,)
gives
Y122***9
Y(n1)".
2)
Step
1
is repeated on
(y12,
~~~1,.
,
(Y(~~)(~,),
Y("
1,").
3)
Step
2
is repeated for
1
log
lv]

2
times to obtain one
final output represented as a sum and carry.
Proof:
To
add two numbers
a
and
b
in modulo
M
we
a
<M
and
b
<M
then
a
=
lalM
and
b
=
I
b
1
M,
then:
have the following cases:
i)
la+bl,= Ia,+b,I,*
(12)
ii) a>Mand b<Mthen
b=
Ibl,anda=M
la+bl,= IM+x+bl,= (x+bJ,.
(13)
Since
x
<
M
and
b
<
M,
then from
(12)
and
(13):
+
x,
then:
la+bl,= Ia,+b,I,. (14)
iii)
a
>
M
and
b
<
M
like case ii).
iv)
a
>
M
and
b
>
M
then
a
=
M
+
x
and
b
=
M
+
y,
then:
la+ bJ,= IM+x+M+yJ,
=
IX+YIM'
Ia,+b,l,.
(15)
From the previous four cases,
Since addition is associative, then
I
~1
+
~2
+
*
.
*
+Yn
I
M
case the correction is done in three steps (steps
35).
I._
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:35:49 EST from IEEE Xplore. Restrictions apply.
232
IEEE TRANSACTIONS ON CIRCUITS AND
SYSTEMS11:
ANALOG AND DIGITAL SIGNAL PROCESSING, VOL.
39,
NO.
4,
APRIL
1992
if
M
odd
Using (16) we have
if
Meven
We can further expand this expression using the same method
to get the addition process in the righthand side
in
terms of
only two operands added in modulo
M.
Theorem 3 means that adding
n
numbers in modulo
M
can be performed using a binary tree consists of units that are
capable of adding only two numbers in modulo
M.
MA’s
are
used as those building blocks to perform the addition process.
Since
MA
requires that inputs be represented in the form of
sum and carry, then this form should be enforced at all
levels. The form will be enforced automatically for levels
22, because the outputs of the previous levels are
in
the
correct forin. For the first level we have the following:
Ti,
=
y,,
Ti,
=
0 vl
I
i
I
n.
For the last stage the output is in the form of sum and carry
which is exactly the same form we have using the
CSA’s.
Fig. 3(e) shows the binary tree required to add
n
numbers in
modulo
M.
3.3
Range Determinator
This stage consists of three levelsnamely
ROM,
magni
tude comparator
(MC),
and bit corrector (BC). The major
function of this stage is to determine the range of
S
so
that
appropriate value can be subtracted from
S
to obtain the
correct result.
Since the input to this stage,
S,
is a large binary number, it
is partitioned into groups of adjacent bits. For example, if
S
is a 24bit number, we can partition
S
into three 8bit groups
GI, G2
,
and G,, where
G,
=
S,
...
o,
G,
=
SI,
...
8,
and
G,
=
S,,
...
16.
Since each group if fed into a
ROM
module as an address
input, the number of bits in each group should be small
so
that small
ROM’s
that are fast and occupy small silicon area
are used to implement this level. However, the number of
groups, g, should be kept to be as small as possible since the
complexity of
MC
cells is a function of the number of
ROM
modules, g. Hence, there are tradeoffs in choosing
g
and
the number of bits in each group. The following discussion is
divided into two parts: sign magnitude and 2’s complement.
3.3.1
Sign Magnitude:
As
shown in Fig. 4, the input to
ith
ROM
module is
G,,
and the outputs are
Bi’s
and
Ci’s.
The function of this
ROM
module is depicted as follows:
0,
ifG,S(jMl),
for
j
=
1
A

1
1,
if G,>
(jM
1);
0,
if
Gi+
(jM
for
j
=
2
A

1
1,
ifG,=
(jM
l),
for i
=
1,2;*,
g.
The
ROM
modules compare the input pattern
S
to the first
set of values in Table
11
and produce
g
*
(2
A

3) outputs
that are fed to the
MC
level.
1
I
B’c
1
BC
’
BC
BC
’
A
1
rn
1
Implementation of the first part
of
range determinator stage. Fig.
4.
TABLE
II
VALUES COMPARED
BY
MULTIMAGNITUDE COMPARATORS
A
1
2
n1
First set
M
1
2M
1
(n

l)M
1
Magnitude Compared
M
1
2
3M
1
L
(2n

3)M
1
2
M
2
3M
2
1
_
1

The
MC
level consists of
(A

1)
MC
cells. This level
takes the input from
ROM
level and does further comparison.
The function of the
MC
cells is determined by:
ifS<jM
ifSzjM
for
j
=
1,2;.,
A

1.
(15)
The complexity of an
MC
cell is a function of the number of
ROM
modules. If we have g
ROM
modules, the Boolean
equation for the jth
MC
cell is as follows:
Mc!
=
Bj”
+
Bjg’cj”
+
Bjg’C,fcf’
+
* *
*
+
qqcjg’cig2
*
*
cj’
+
BjlcjgCjg’

*
C?C?.
JJ
(16)
Since
S
may be larger than several values compared, the
outputs of several
MC
cells may be set to
1;
therefore, the
BC
level is used to ensure that only one of the outputs of the
MC‘
cells is set to one and also to indicate the appropriate
range. In order to do
so,
A
identical
BC’
cells are needed,
and their common Boolean equation is as follows:
BC;
=
MC,’
+
MC,’,,
for
j
=
1,2;..,
A
where
MCA
=
1
and
MC;
=
0.
(17)
Hence, the range of
S
is determined to be
(rn

l)M
5
S
<
mM if
BC,!,,
=
1. Thus far, the range determination enables
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:35:49 EST from IEEE Xplore. Restrictions apply.
ELLEITHY AND BAYOUMI: FAST AND FLEXIBLE ARCHITECTURES
233
Di
=
<
the
S
modulus
M
operation to be performed by
S

(rn

l)M
if
BCA
is set to one.
3.3.2
Two’s
Complement:
Fig.
5
shows that the input to
the ith
ROM
module is Gi, and the outputs are
Di’s
and
Ei’s.
The function of this
ROM
module is depicted as
follows.
For
M
odd,
I
ii
\
ii
0,
if Gi
I
1, if Gi>
EJ!
=
c
I
Fig.
5.
Implementation
of
the second part
of
range determinator stage.
The Boolean equations are:
0,
if
Gi
#
r2j
:’”
ii
ii
\
MC’
=
Dig +
Dig’Eig
+
Df2EfEf’
+
* *
1,
if
Gi=
Dj
=
<
for
i
=
1,2,.*,
g.
The
MC
level of this part consists of
A
MC2
cells, and each
MC2
cell has the following function: for
M
odd,
(2i

1)M 1
2
(2i

1)M 1
2
(22)
0,
if
Sr
1,
if
S>
MC?
=
and for
M
even,
magnitude number system conversion will be performed;
otherwise, only one of the BC level output lines will be equal
to one, and thus residue to 2’s complement number system
conversion will be performed. The Boolean equation for a
i
(2j
2

lii
0,
if
G,
I
1,
if Gi
>
for
i
=
1,2;.*,
A

1.
E;
=
<
3.4
Final
Corrector
This stage consists of
A
tristate multiplexers and a carry
lookahead adder. The
BC’
input lines will be used to enable
one of the tristate multiplexers while
BC2
input lines will be
used as selectors of the multiplexers. If
BC,‘
is set, then
(i

l)M
I
S
<
iM.
The lower bound (i

l)M will be
subtracted from
S
if conversion to an unsigned magnitude
number system is desired, or
S
is less than ((2 i

l)M
+
1)/2 for
M
odd or ((2i

1)M)/2 for
M
even; otherwise,
the upper bound,
iM,
will be subtracted from
S.
The imple
mentation of this stage is shown in Fig.
6.
The CLA is used
to add the 2’s complement of the value to be subtracted from
S
and output the desired result.
IV
.
PERFORMANCE
EVALUATION
1) The partial sum generator is implemented using small
ROM’s.
If the number of residues is
N
and each
I
where

lii
MC;
=
1
and
MC;
=
0.
Therefore, the range of
S
is determined to be ((2
j

3)M
+
1)/2
I
S
<
((2j

l)M
+
1)/2 for
M
odd, and ((2j

i
(2j
i
i
(2j
2
l)M

lii
0,
if Gi
#
\
1,
if
Gi
=
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:35:49 EST from IEEE Xplore. Restrictions apply.
234
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS11: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 39, NO.
4,
APRIL 1992
(Al)M AM
(il)M iM
M
2M
0
M
II
II
II
11
I
2x1
0
qij
;
BCA
I
2x1
0
BC
3
BC
f
1
2x1
0
BC
I
1
S
Carry Lookahead Adder
Fig.
6.
Implementation of the final corrector.
residue is represented in
P
bits, then it is required to
use
N
ROM’s.
Each
ROM
is storing values bounded
by
M,
then the size of each
ROM
is
2
*
1
log
M]
.
The total area required for this stage is:
N
*
2p
*
1
log
MI.
Since
ROM’s
have a constant time
delay
(P
is a small number) that does not depend on
N,
then the delay
of
this stage is
O(1).
2)
The partial sum adder is implemented in two different
ways.
Using
CSA’s:
the complexity of the scheme
is determined by Theorem
1.
Since each
CSA
has a constant time delay, then the total
time required to add
N
numbers in modulo
M
is
O(log
n).
b) Using
MA’s:
The number
of
levels required
to perform the addition of
N
numbers using
a binary tree of
MA’s
is
[logN]
as it is
shown in Theorem
1.
Since at each level the
required time is constant
(MA
has a constant
time), then the total time required for this
step using
MA’s
is O(1og
n).
3)
The range determinator consists of three different
levels (Fig.
4).
The first level consists of
g
ROM’s.
The second level is the
MC
cells, which are combina
tional circuits that can be represented with a twolevel
switching function. Finally the last level is a twostage
combinational circuit. The three levels have a constant
time delay that does not depend on
N.
The previous
analysis shows that the range determinator has a time
delay of
e(
1).
4)
The final corrector consists of two stages. In the first
stage we have
A
tristate multiplexers, which have a
constant delay equivalent to two serial
NAND
gates.
The second stage is a
CLA
that has a constant delay
(for case of number of bits less than
64,
the delay is
equivalent to the delay of
12
serial
NAND
gates as
shown in
[16]).
For numbers of bits larger than this
we can still obtain a constant delay
CLA.
Then the
final corrector has a delay of
O(
1).
a)
From cases
1)4)
we see that all stages except the partial
sum adder have a constant time delay, which does not depend
on the number of residues
N.
Only the second stage requires
O(log
N)
steps.
V.
CONCLUSION
The residue decoder introduced in this paper has a total
delay of
[log
N]
.
In addition, it has several advantages as
listed below.
1)
The design is quite modular and consists of simple
cells such as small
ROM’s
and
MC
cells. This makes
the implementation
of
the whole residue decoder in a
single chip possible.
2)
It doesn’t have any limitation on the moduli used.
3)
It is flexible since it can convert residues to either
unsigned magnitude or
2’s
complement number sys
tem, and it is controlled by only a control line,
C.
This means that it can be applied to wider area.
4)
It is fast, compared with most schemes proposed
before, since it has a time complexity of
O(log
N).
ACKNOWLEDGMENT
K.
M.
Elleithy would like to thank the King Fahd Univer
sity of Petroleum and Minerals for support.
REFERENCES
M.
A.
Bayoumi, “Digital filter VLSI systolic arrays over finite fields
for
DSP applications,” in
Proc. 6th ZEEE Annual Phoenix Conf.
Computers and Communications,
pp.
194 199,
Feb.
1987.
M.
A.
Bayoumi,
G.
A.
Jullien, and
W.
C.
Miller,
“A
lookup table
VLSI design methodology for RNS structures used in DSP applica
tions,’’
IEEE Trans. Circuits
Syst.,
pp.
604616,
vol.
CAS34,
June
1987.
F. J. Taylor, “Residue arithmetic:
A
tutorial with examples,”
IEEE
Comp. Mag.,
pp.
5062,
May
1984.
M.
A.
Bayoumi,
“A
high speed VLSI complex digital signal proces
sor
based
on
quadratic residue number system,”
VLSZ Signal Pro
cessing
II,
pp.
200211,
IEEE Press,
1986.
N.
S.
Szabo and R. I. Tanaka,
Residue Arithmetic and its Applica
tions
to
Computer Technology.
W.
K.
Jenkins, “Techniques for residue to analog conversion
for
residue encoded digital filters,”
ZEEE Trans. Circuits Syst.,
vol.
S.
Andraos and H. Ahmed,
“A
new
efficient memoryless residue to
binary converter,”
ZEEE Trans. Circuits
Syst.,
vol.
CAS35,
pp.
K.
M. Ibrahim and
S.
N. Saloum,
“An
efficient residue to binary
converter design,”
IEEE Trans. Circuits
Syst.,
vol.
CAS35,
pp.
New York: McGrawHill,
1967.
CAS25,
pp.
553562,
July
1978.
1441444,
NOV.
1988.
I
._
11561158,
Sipt.
1988.
S.
Bandyopadhyay,
G.
A.
Jullien, and
A.
Sengupta,
“A
systolic array
for faulttolerant digital signal processing using a residue number
system approach,” in
Proc.
Int.
Conf. Systolic Arrays,
pp.
577586, 1988.
A.
P.
Shenoy and R. Kumaresan, “Residue to binary conversion for
RNS arithmetic using only modular lookup tables,”
ZEEE Trans.
Circuits
Syst.,
vol.
CAS35,
pp.
11581162,
Sept.
1988.
T. V.
Vu,
“Efficient implementations of the CRT for sign detection
and residue decoding,”
ZEEE Trans. Comp.,
vol.
C34,
pp.
C.
N. Zhang, B. Shirazi, and D. Y.
Y.
Yun, “Parallel designs for
Chinese remainder conversion,” in
Proc. ZEEE 16th Annual Conf.
Parallel Processing,
Aug.
1987.
R. M. Capocelli and R. Giancarlo, “Efficient VLSI networks for
converting an integer from binary system and vice versa,”
IEEE
Trans. Circuits
Syst.,
vol.
35,
pp.
14251430,
Nov.
1988.
G.
Alia and E. Martinelli,
“A
VLSI algorithm for direct and reverse
conversion from weighted binary number system to residue number
system,”
ZEEE Trans. Circuits
Syst.,
vol.
CAS31,
pp.
1033 1039,
1984.
[9]
[lo]
[ll]
646651,
July
1985.
[12]
[I31
[14]
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:35:49 EST from IEEE Xplore. Restrictions apply.
ELLEITHY AND BAYOUMI: FAST AND FLEXIBLE ARCHITECTURES
235
C.
K.
Koc and P. R. Cappello, “Systolic arrays for integer Chinese
Remaindering,” in
Proc. 9th Symp. Computer Arithmetic,
pp.
216223, Sept. 1989.
K. Hwang,
Computer Arithmetic: Principles, Architecture, and
Design.
New York, Wiley, 1978.
K.
P. Lee, M. A. Bayoumi, and K. M. Elleithy, “A fast and flexible
residue decoder based on the Chinese remainder theorem,” presented
at 1989 Int. Symp. Circuits and Systems.
K.
M. Elleithy, “On bitparallel processing for modulo arithmetic,”
VLSI
Tech. Rep. TR8681,
The Center for Advanced Computer
Studies, Univ. Southwestern Louisiana, 1986.
A. Avizienis, “A study of redundant number representations for
parallel digital computers,” Ph.D. dissertation, Univ. Illinois, Ur
bana,
IL,
May 1960.
K.
M. Elleithv and M. A. Bavoumi. “A Ofloe
n\
algorithm for
and Minerals, Duhran, Saudi Arabia. His current research interests are in
the areas of design automation, computer architecture, and VLSI. He has
special interests in the area of formal highlevel synthesis and has published
more than a dozen research papers in these areas.
Dr. Elleithy is a member of Phi Kappa Phi.
~
,I
\_I
0
modulo multiplication,” in
Proc.
32nd
Midwest Symp. Circuits and
Systems,
Aug. 1989.
K. M. Elleithy, M. A. Bayoumi, and K. P. Lee, “B(log
n)
architec
tures for RNS arithmetic decoding,” in
Proc. 9th Symp. Computer
Arithmetic,
pp. 202209, Sept. 1989.
Khaled M. Elleithy
(S’89M’90M’90) received
the B.Sc. degree in computer science and auto
matic control from Alexandria University in 1983,
the M.Sc. degree in computer networks from the
same university in 1986, and the M.Sc. and Ph.D.
degrees
in
computer science from the Center for
Advanced Computer Studies at the University of
Southwestern Louisiana in 1988 and 1990, respec
tively.
From 1983 to 1986 he was with the Computer
Science Department, Alexandria University, Egypt,
as
a lecturer. Since 1990 he has beenworking as an assistant Professor
at
the
Department of Computer Engineering, King Fahd University
of
Petroleum
Madgy
A.
Bayoumi
(S’80M’M S’85M’85
SM’87) received the B.S.C. and M.Sc. degrees in
electrical engineering from Cairo University,
Egypt, the M.Sc. degree in computer engineering
from Washington University, St. Louis, and the
Ph.D. degree in electrical engineering from the
University of Windsor, Canada.
He is an associate professor in the Center for
Advanced Computer Studies, at the University of
Southwestern Louisiana, where he has been a fac
ulty member since 1985. His research interests
include VLSI design methods and architectures, digital signal processing
architectures, parallel algorithm design, and computer arithmetic. He is the
chairman of the VLSI Systems and Applications CAS Technical Committee.
He serves on the ASSP Technical Committee on VLSI Signal Processing. He
was the secretary of the CS Technical Committee
on
VLSI Design. He
is
a
member of the Technical Committee of the IEEE VLSI Signal Processing
Workshop and the Application Specific Array Processors, 1992. He is the
editor
of
the new series, Advances in VLSI Signal Processing (Ablex
Publishing). He is the editor of the new book,
Parallel Algorithms and
Architectures for DSP Applications
(Kluwer Academic Publishers, 1991).
Dr. Bayoumi won the USL 1988 Researcher of the Year award.
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:35:49 EST from IEEE Xplore. Restrictions apply.