Conference PaperPDF Available

O (logN) architectures for RNS arithmetic decoding

Authors:

Abstract

Decoding in residue-number-system (RNS)-based architectures can be a bottleneck. A high-speed, flexible modulo decoder is an essential computational element to maintain the advantages of RNS. A fast and flexible modulo decoder, based on the Chinese remainder theorem (CRT), is presented. It decodes a set of residues into its equivalent representation in either unsigned magnitude or two's-complement binary number system. Two different architectures are analyzed: the first one uses carry-save adders, and the other uses modified structure carry-save adders. Both architectures are modular and are based on simple cells, which leads to efficient VLSI implementation. The decoder has a time complexity of θ(log N )
e(logN)
ARCHITECTURES
FOR
RNS ARITHMETIC
DECODING*
K.
M.
Elleithy,
M.
A.
Bayoumt, and
K.
P.
Lee
The Center
for
Advanced Computer Studies
University
of
Southwestern Louisiana
Lafayette,
LA
70504.
U.S.A.
ABSTRACT
Decoding in Residue Number System (RNS) based architectures
can be a bottleneck. A high speed and flexible modulo decoder
is
an
essential computational element
to
maintain the advantages of RNS.
In
this paper, a fast and flexible modulo decoder, based on the
Chinese Remainder Theorem (CRT), is presented. It decodes a set of
residues into its equivalent representation in either unsigned magni-
tude
or
2’s
complement binary number system. Two different archi-
tectures are analyzed; the first one is based on using Carry Save
Adders(CSA), while, the other is based on utilizing a modified struc-
ture of Carry Save Adders(MCSA). Both architectures are modular
and are based on simple cells which leads
to
efficient VLSI implemen-
tation.
it has a time complexity of
e(
IogN).
1.
Introduction
Recently, RNS has received increased attention due
to
its abil-
ity
to
support high-speed concurrent arithmetic
11-31.
Applications
such
as
fast
fourier transform, digital filtering, and image processing
utilize the efficiencies
of
RNS arithmetics in addition and multiplica-
tion, they do not require the difficult RNS operations such
as
divi-
sion and magnitude comparison. RNS has been employed efficiently
in the implementation of several special purpose processors such
as
digital signal processors[4].
Since special purpose pme4sors are associated with general pur-
pose computers, binary-bresidue and residue-tebinary conversions
become inherently important and the eonversion process should not
offset the speed gain in RNS operations. While the binary-bresidue
conversion does not pose a serious threat
to
the speed gain in RNS
operations, the residue-bbinary conversion can be a bottleneck. It is
mainly carried out employing the Chinese Remainder Theorem
(CRT) [5,6]. Several implementations of the residue decoder have
been reported [7-121.
In
1121, the proposed residue decoders are basi-
cally based on biased addition, and take advantage of the fast addi-
tion speed of CSA[13]. But, the eonversion output is not in
2’s
com-
plement
form.
The implementation
in
[ll]
requires that one of the
moduli must be a power of two; therefore, it may be limited in
application. The residue decoders in [7,8] are based on using three
moduli in the form (2”-1,
2’,
2*+1).
Due
to
the limitation imposed
on the number of moduli and the choice of them, it
is
limited in
application. In
[lo],
the residue decoder is based on the base exten-
sion technique, it
uses
modular look-up tables in its implementation.
Since two moduli are fed into a look-up table, the choice of moduli
must not be large for the implementation
to
be feasible. In addition,
it does not support residue
to
2’s
complement binary number system
conversion. Although look-up tables are used in this scheme, its time
complexity is
O(p).
In [14,15], the scheme used has a time complex-
ity of
B((logN)’):
In
[9],
a
scheme of
B(logNP)
(where
P
is the
number of bits)
IS
used
to
support only unsigned magnitude binary
numbers.
The proposed decoder is fast,
‘This
work
was
supported
in
part
by
NSF
grant
No.
MIP-8809811
In this paper,
a
B(logn)
residue decoder capable of decoding a
set of residues
to
its equivalent representation in unsigned magnitude
or
2’s complement binary number system is introduced. Two different
architectures using CSAs based on[16] and MCSA[17] are imple-
mented.
In the following section, the RNS theory
is
reviewed. Sec-
tion
3
discusses how this fast and flexible residue decoder can be
implemented. Section
4
evaluates the speed performance
of
this resi-
due decoder.
2.
Residue Number System
due digits,
X
=
(rl, r2
,..............,
rN)
where
ri
=
1x1
111,,
with respect
to
a set
of
N moduli
{ml, m2
,________.____.,
mN}.
In order to have a unique residue represen-
tation, the moduli must be pairwise relatively prime, that is,
In RNS, an integer
,
X,
can be represented by N-tuple of resi-
GCD(mi, mi)
=
1,
for
i
#
J’
then it is shown that there is
a
unique representation for each
number in the range of
0
5
X
<
nmi
=
M where N is the number
of moduli.
to
the arithmetic operation on its residue representation. that is,
N
i
-1
The arithmetic operation on two integers A and
B
is equivalent
where
’.’
can be addition, subtraction,
or
multiplication. Therefore, it
is
desired
to
convert binary arithmetic on large integers
to
residue
arithmetic on small residue digits in which the operations can be
parallelly executed, and there
is
no carry chain between residue digits.
For
applications in digital signal processing, it is helpful to
define a dynamic range for the RNS
integers. The dynamic range is defined
odd and
as
[-+,
+-11
for
M
even,
or
more specifically, for
M
odd,
I.
ifz<M-’
2
X=
and for
M
even,
M
ifZ<-
2
Z
M
Z-M ifZ2-
2
where
Z
is an integer within the legitimate range,
0
<
Z
<
M. Any
integer,
X,
within the dynamic range can be represented by N residue
digits.
202
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:07:27 EST from IEEE Xplore. Restrictions apply.
The conversion from RNS
to
weighted binary number system is
done by using the CRT, which states that
where
Although the CRT provides a direct, fast, and simple conversion for-
mula, the lack of large and fast modulo
M
adder has held back this
approach.
3.
The Residue Decoder
The residue decoder based on the CRT can be implemented by
a modulo M adder tree. The modulo
M
adders at each level are used
to
correct the partial sum
so
that it is within the legitimate range.
Since modulo M adder is very slow, the possible implementation may
pose an overhead
to
the overall speed performance of an RNS proces-
sor.
In addition, the CRT only converts residues
to
its binary
representation in the legitimate range but not in the dynamic range.
Therefore, conversion
to
2’s complement binary number system
requires a final correction.
In
order
to
implement a high speed residue decoder that can
perform conversion
to
both unsigned magnitude and 2’s complement
binary number system, the following solutions are proposed:
1) The number of modulo M adders
or
binary adders should be
reduced
to
a
minimum.
2)
CSAs
or
MCSAs can be used wherever multi-operand addition
is
required due
to
its high addition speed.
3) Correction can be performed only at the last stage, and it
sup-
ports conversion
to
both unsigned magnitude and 2’s comple-
ment binary number system.
For
ease of residue decoder design, it
is
partitioned into
4
stages
as
shown in Figure 1. The input
to
the residue decoder are the residues
and a control line,
C,
which determines the output
to
be in unsigned
magnitude
or
2’s
complement number system
rr
rr
r
t*.l
.
. .
.
. .
.
.
. .
. .
Partial
Sum
Adder
1s
Range Deter-k
I
BClI
1
1
S
I
BC;
I
BC:
Final Corrector
1
1x1,
-c
Figure
1
Block
Diagram
of
the Residue Decoder
3.1.
Partial Sum Generator
The inputs
to
this stage are the
N
residues.
The main function
of
this stage
is
to
compute partial sums,
ti
’s,
where
Since
mi
is usually small, the value of
ti
can be obtained by access-
ing a lookup table with a small address space. Hence,
ri
will serve
as
ROM address input, and
ti
will be obtained from ROM output.
In
most cases, it
is
better
to
reduce the number of partial
sums,
(ti’s),
in order
to
reduce the complexity at lower stages and
hence increase the residue deeoder’s speed
as
a
whole. Since a
modulus
mi
can be represented by logzm. -bit binary number, the
j
th residue,
I
II
[@-J
1-l
rj
=
Z‘bj
where
6)
t
{0,
1). By substituting
rj
in
eq.
(l),
we can rewrite the
CRT
as
follows:
k4
Hence, if we have
a
set of
8
moduli {2,3,5,7,11,13,17,23} with residues
{rl,rz,r3,r4,rs,r~,r,,r*},
respectively, only
4
ROW with 7-bit address
input are needed
to
implement this level, and modulus summation of
4
operands instead
of
8
is needed, where
3.2.
Partial Sum Adder
By far the modulo M summation of partial sums,
(ti’s,)
poses
the biggest challenge
to
the implementation of the residue decoder
due
to
the slow computational speed of the
modulo
M
adder.
This
stage can be implemented using two different approaches.
s3.2.1.
Implementation
using
CSA
A multilevel CSA tree consists of
N-2
CSAs and a carry propagate
adder, CPA[13], are
used
to
reduce
A
partial sums, t’s,
to
a sum,
S.
Let
I
be the number of levels on a CSA tree, and
e(/)
be the max-
imum number of operands that can be processed with a I-level CSA
tree. We
can
compute
0
by the recursive formula provided by
Avizienis[l8],
e(/)
*
3
+
(e(/-1))
mod
2
121
for
I
=
2,
3
,.....,
and
initially
e(1)
=
3
(3)
A
CSA tree
for
adding
6
operands is shown in Figure 2.
CPA is
a
(m-1)-bit twelevel carry lookhead adder, CLA[13] where:
m
=
IlogdMA)]
Hence, the output
S
is
an m-bit number that
is
passed
to
the next
stage. The complexity of the scheme is determined by Theorem 1.
Theorem
1:
The addition
of
N
numbers using
CSAs
can
be
per-
formed in
O(logN)
atepa.
Pmo,t
The number of levels in
a
CSA tree is determined by:
e(/)
=
e(r-1)
*
3
+
e(/-1)
mod
2
121
To determine the number of levels required to add
N
numbers let
us
consider the following two cases:
(i) @(I-1) is even, then:
203
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:07:27 EST from IEEE Xplore. Restrictions apply.
-
-
-
means that
output
is
shifted left with zero
enterina from the
riaht
S
Figure
2.
An
Example for
Partial
Sum Adder for
A
-
6.
e(/-1) mod2
=
0
Substituting in
(3)
using
(4)
and
(5),
we have:
qr)
=
%(/-I)
2
Since
O(1)
=
3
,
we can substitute in
(6)
to
get successive values for
O(1)
as
follows:
1-1
e([)
=
*3
($)I
*2
B(I)
represents the number of operands that can be added using a
CSA tree that has
I
levels. Suppose that the number
of
operands is
N
then:
N
(3)’
*2
2
3
logN l*l~-
2
log-
Taking the logarithm
of
both sides we have:
Then:
1
=
+*logN
We can find constants
Cl
>
0,
Cz
>
0,
and
No
2
0,
such that
for
all
N
2
No
the following is true:
2
C1
logN
5
+*logN
5
Cz
logN
(7)
log-
2
Then
Cl
logN
5
1
5
Cz
logN
V
N
2
No
(8)
Possible values
for
Cl
,
Cz
and
No
are
1,2,1.
Equation
(8)
means
that
1
=
O(1ogN).
(ii)
O(1-1)
is
odd
,
then:
Substituting in
(3)
using
(9)
and
(IO),
we get:
(11)
1
2 2
e(/)
=
+l)
-
-
Since
O(1)
=
3
,
we can substitute in
(11)
to
get successive values
for
Of
1)
as
follow:
3 1-1
1-2
I4
O(1)
(T)
*3-((2)
+(T)
+...+
l)*0.5
4
3‘
+1
Suppose that tpe number of operands is
N
then:
Using the same analytical method used
for
the case of even
O(1-1)
we
can find constants
C,,
C,,
and NOS, such that for all
N2No
the fol-
lowing is true:
CllogN
5
-$-*logN
5
CzlogN
From the previous analysis in both cases
i
and
ii,
N
numbers can be
added using CSAs in O(1ogN).
53.2.2.
Implementation
using
MCSA
The MCSA
is
based on the idea
of
representing a number
as
a
Carry
and a
Sum
similar
to
CSA. It can be used in the modulo addi-
tion of two numbers
to
obtain a scheme that has a constant speed
which does not depend on the number
of
bits. Basically CSA depends
on
the idea of not completing the addition process at a certain stage,
but postpone it
to
the final stage. In the intermediate stages numbers
are represented
as
Sum
and
Cany
to
avoid the complete addition
process. The MCSA is used
to
add two numbers
A
and
B
in modulo
m.
Figure
3.a
shows that
A
is represented
as
a pair
of
numbers
(As
,
Ac),
B
is
represented
as
(Bs
,
Bc),
and the output
C
is
represented
as
(Cs
,
Cc).
Each number is represented
as
a group of
Sum
bits and
Caw
bits. There is no unique representation
for
As
and
Ac.
The condition that need
to
be satisfied is:
43
N=$T)
+1
log-
2
0
One possible representation is:
(
As, A,)
(B
,
B
)
sc
I
I
3
*
I
I=
2(I-l)
-
1.5
(9)
Figure
3
a
A
Modified
CSA
(MCSA)
204
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:07:27 EST from IEEE Xplore. Restrictions apply.
We need
to
add
four
numbers (As
,
Ac
,E,
,Bc),
which needs two
steps of
CSA.
After the addition process we need
to
detect if
-M
or
2*(-M) is required
to
adjust the result. The adjusting process takes
at most three steps. The proposed algorithm for modulo m addition
of two numbers can be described
as
follow:
Algorithm modulo-add
(
A
,
B
,
Result)
Input:
Two variables A and
B
in modulo m, A is represented
as
As and Ac.
B
is represented
as
Bs
and
Ec.
All
variables are n bit
numbers.
Output:
Variable Result represented
as
Result and R
sults.
The
relation between A,
E,
and Result is: Reault
=
TA
+
Elm.
Procedure:
begin
Do in parallel
begin
Call Sum(temp,
,
As
,
A0
,
Bs)
Call Cany(temp2
I
AS
,
AC
I
Bs)
end
begin
Do in parallel
Call Cany(temp3, temp,
,
temp2,
Bc)
Call Cany(temp4, temp,
,
temp2
,
Bc)
Case (temp sub
2
[n+l/
+
temp sub
4
[n+l])
of
end
0:
Do in parallel
begin
Results
:=
temps
Resultc
:=
temp4
end
ezit
1:
Do
in parallel
begin
Call Sum(temp6
,
temp3
,
temp4
,
(2"-m))
Call Carry(temp6
,
temp3, temp,
,
(2"-m))
end
begin
2:
Do
in parallel
Call Sum(temp6
,
temp3
,
temp,
,
2*(2"-m))
Call Cany(temp6, temp3, temp,, 2*(2"-m))
end
end case
Case
(
temp sub
6
[n+lI)
of
0:
Do
in parallel
begin
Results
:=
temp6
Resultc
:=
temp,
end
ezit
1:
Do
in parallel
begin
Call Sum(temp7
,
temp6
,
temp,
,
(2"-m))
Call Cany(temps, temp6, temp6
,
(2"-m))
end
end case
Case (temp sub 8 Intll)
of
0:
Do
in parallel
begin
Results
:=
temp7
Resultc
:=
temps
end
begin
1:
Do in parallel
Call Sum(temp9
,
temp,
,
temps
,
(2"-~))
call Cany(templo, temp7
,
temps
,
(2"-M))
end
Do in parallel
begin
Results
:=
tempo
Resultc
:=
templo
end
end case
end.
Sum (A
,
B
,
C
,
D)
begin
Do in parallel (lsisn)
A[i] :=(B[i]AC[i])
V
(B[i]AD[:])V(C[i]AD[i])
end
Carry(A
,
E
,
C
,
D)
begin
A
[l]
:=
0
Do in parallel (lsisn)
A[i+l]
:=B[i]
@
C[i]
@
D[i]
end
An
implementation of the algorithm is shown in Figure 3.b.
2"-
n
2(
2"-
M)
Rcrultll
1
Res~ltll
1
SE
0
Res~lt1nI Resultin]
IC
Figure
3
b
Different
Stages
of
the
MCSA
Theorem
2:
The modulo adder scheme for adding two n-bit numbers
in modulo m haa an asymptotic time complezity O(1).
To prove that the number of steps is constant (five) we need
to
prove that the last carry is
equal
to
zero in five
or
less
steps.
Induction is used
to
prove the correctness of the theorem on the
number of bits n.
[I] Basis step: for n
=
0,
means that we do not add any numbers
and in this case the required number of steps is zero.
[2]
Induction hypothesis: asume for a fixed arbitrary n>O that
that the maximum number of steps is five.
131
Induction step: for numbers with n
+1
bits let:
q-temp2[n+1]
+
temp4[n+2]
Then we have the following c~ses:
(a)
q=O:
then the carry propagation stopped
at
bit n, and it
ends after five steps at most according
to
the induction
hypothesis.
(b) q=l: then the correction is 2"+'-m in step 3. Since m
>
2"
then 2"+'-m
<
2", which means that
(2"+'-m)[n]
-
0.
The
worst case we get
to
have tcmpS[n+l] and tcmp4[n+2]
to
be
205
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:07:27 EST from IEEE Xplore. Restrictions apply.
equal one. This means that temp6[n+1]
-
0
and
temp6[n+2]
-
1,
then tcmps[n+2]
-
0.
In this case the correc-
tion is done in two steps (step3 and step 4).
(c) q=2: then the correction is 2*(2"+'-m) in step 3. The
worst case we get
to
have temps[n+l], temp4[n+2], and
2*(2"+'-m)
to
be equal one. Then temp6(n+1]-1,
tempa(n+l]-l and 2"+'-M-0. At step4 temp,[n+l]-0 and
tempg(n +2]-1. At step5 temppln
+1]-1
and temp +2]-0.
In this case the correction is done in three steps (steps3-5).
Since the adder has a fixed number of stages which does not depend
on the operands' length, it can be used in the implementation of a
pipelined multi-operand modulo addition scheme[l9].
Ekam&
As
an example, the modulo addition of
A
=
1272 and
B
=
450 for m
=
2050 is shown in Figure 3.c. There is no unique
representation for
A
and
E.
One valid representation is shown in
Figure 3.c. Figure3.c shows the detailed modulo addition operation
for this example. In step1 we get lempz[13]
=
1, and in step2 we get
temp,[l3]
=
1, which means that at step3 we have
to
add 2(2"-M).
At step3 we get tempa[l3]
=
1, which means that at step4 we have
to
add
P-M.
At step4 we get temp8[13]
=
0,
which means that the
addition
D~OC~SS
S~DS
at step4. The result of step4 is the final
0
result.
0
Initial:
Step
1
As=
1
0
I
1
1 1 1
1
0
1
1 1
A,-
11001
1101
101
Bs-
11
1100010101
Bo=
1010101
1001
1
M=2050
,
N=
12
As-
1
0
1
1
1
1
1
1
0
1
I
1
&.
11001
1101
101
BSI
11
1100010101
Step
2.
temp,:
100000001
1
I
I
temp2=[t
1
I
I
1
I
101010
templ=
100000001
11
1
temp2=
11
11 11
101010
8,-
11
1100010101
Step
3.
temp,-
1
lOlOlOIOl10
temp,=[01010
10
10
I
IO
temb=
1101010101
IO
temp4=
OlOlOIOlOl
IO
2(2"-
PI).=
1
I1
11
11
11
100
temp,=
01
11
11
11
1100
temp,:[
1010
io
io
1
loo
Step
4.
temp5=
01
11 11
I
11
loo
temp,=
lOlOlOlOl
loo
2n-rl
=
011111111110
temp,:
101010101
1
Io
temp,zljjll
I
I
I
I
I
I
I
1000
Resu/ts=
/<?/<?/(7/0/
/
/<?
Resu/tc
=
/ / /
I
/
/ / /
/c?~?~?
-----_________
Figure
3
c
A detailed Example for the Modulo Addition
Theorem
3%
Adding
n
numbers
(yl
,
ya
,
.
. .
,
U")
in
modulo
M
is
equivalent to
:
[l]
Adding
(yl
,
y2)
modulo
M
,....,
(yi
,
y;+J
,
. . .
,
and
(~m-1
9
v-)
gives
~12
,
. .
Y(n-1)".
Step [2]
is
repeated for
output represented
as
a sum and carry.
(21
Step
PI
is repeated on
(Yl2
,
v.4
B'",
(Y(.4)("-2)
,
Y(m-l)").
[3]
logN
1-
2 times
to
obtain one final
To add two numbers
a
and
b
in modulo
M
we have the fol-
a
<M
and
b
<M
then
a
=
I
a
1.
and
b
=
I
b
b,
then:
lowing cases:
(i)
(a+)
b=
)aM+bM
1
(12)
(ii)
a>M
and
b<M
then
b
=
I
b
1.
and
a
=
M
+
z, then:
la+b
b=
IM+z+b
b=
lz+b
1.
(13)
Since z
<M
and
b
<M,
then from (12) and (13):
la
+b
b=
la~+b~
(iii)
(iv)
a
>M
and
b
<M
like case (ii).
a
>M
and
b
>M
then
a
=
M+z
and
b
=
M+y,
then:
la+b
1.
=
IM+=+M+y
I.
=
Iz+y
1.
=
+
b,
From the previous four cases:
la+b
b=
Iau+bM
1
(14)
Since addition
is
assoyiative, then:
I
We
CA
'further expand this expression using the same method to get
the addition process in the right hand side in terms of only two
operands added in modulo
M.
Theorem
3
means that adding
n
numbers in modulo
M
can be
performed using a binary tree consists of units that are capable of
adding only two numbers in modulo
M.
MCSAs are used
as
those
building blocks
to
perform the addition process. Since MCSA requires
that inputs be represented in the form of sum and carry, then this
form should be enforced
at
all levels. The form will be enforced
automatically for levels
2
2, because the outputs of the previous lev-
els are in the correct form.
For
first level we have the following:
For the last stage the output is in the form of
sum
and carry which
is
exactly the same form
as
CSAs.
Figure 3.d. shows the binary tree
required
to
add
n
numbers in modulo
M.
0
xs
=
y;
,
Y~Q
=
0
V
1
5
i
5
n
Stage
1
stage
2
stage
3
'12
(n-3)(n-Z)(n-I)n
Stage
rlognl
Figure
3
d
Partial
Sum Adder Implementation
Using MCSAs
206
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:07:27 EST from IEEE Xplore. Restrictions apply.
0
ifG'5
(vl]i
!
1
if
Gi
>
[
--l]i
D;l
=
for
j
=
l..A-1(20)
fori=1,2
,
.......,
g
The MC level
of
this part
is
exactly the same
as
previously proposed,
that is, it consists of
A
Me
cells, and each
MC2
cell has the same
Boolean equation
BS
follows:
MC:
=
DP
+
Df-'EP
+
Df-'E?E/'-'
+
........
+
D12EfEt-'Ep.
......
.E,'
+
Dll
EfEt-'.
.......
E13Et2
Since different set
of
values
is
compared with
S,
we have for M odd,
lo
if
s
5
(21-1)M-1
L
ifs
>
(21-1)M-1
b
2
MC:
=
and
for
M even,
ifs
<
w-1
(21-1)M
ifs>--
2
MC:
=
for
=
1, 2
,
..........,
A
-1
The BC level
for
this part
of
the design consists of
A
BC2
cells.
Each of these cells has a control line C. If C is equal
to
zero, then all
the output lines of BC level will be equal
to
one and residue
to
unsigned magnitude number system conversion will be performed;
otherwise, only one of the BC level output lines will be equal to one,
and thus residue
to
2's
complement number system conversion will be
performed. The Boolean equation for a BC cell is
as
follows:
for
m
=
1,
2
,
........., A
-
BC:
=
C.MC:.MC:-,
where,
MC;
=
1
and
MC?
=
0
Therefore, the range of
S
is determined to be
<
(2m-?M+1
for M odd, and
(2m -3)M+1
2
(2m-1)M
for
M even if
BC;
=
1
(Note that the
(2m-3)M
IS<-
lower bound is equal to zero when m=l). Figure
5
shows the imple-
mentation
of
this part
of
the design.
3.4.
Final Corrector
This stage consists
of
A
tristate multiplexers and a carry
look-
head adder. The
BC'
input lines will be used to enable one
of
the
tristate multiplexers while
BC2
input lines will be used
as
the selec-
tors
of the multiplexers.
If
BC;'
is set, then
(i-l)M
5
S
<
iM.
The
lower bound
(i-l)M
will be subtracted from
S
if
conversion to
unsigned magnitude number system
is
desired,
or
S
is less than
for M odd
or
for M even; otherwise, the
upper bound,
iM,
will be subtracted from
S.
The implementation
of
this stage is shown in Figure
6.
The CLA
is
used to add the
2's
com-
plement
of
the value to be subtracted to
S
and output the desired
result.
BC
Bc
A-
I
----
-T
---
T-T
BC
BC
BC2
m
Figure
5.
Implementation
of
the Second part
of
Range
Determinator Staae
Ti----
I
2x1
0
BC:
_-
---q
1
2x1
0
BC;
_
0
-H
I
IS
I
Carry Lookahead Adder
I
1
1x1
n
Figure
6
Implementation
of
Final Corrector Stage
4.
Performance
Evaluation
The partial sum generator is implemented using small ROMs,
If
the number of residues
is
N
and each residue is represented in
P
bits, then it is required
to
use
N
ROMs.
Each ROM is
storing values bounded by
M,
then the size
of
each ROM is
2'
*
[
logMl
The total area required for this stage is:
N
*
2'
*
r
IogMl Since ROW have a constant time delay
(P
is
a small number) which does not depend on
N,
then the delay
of this stage is
O(1).
The partial sum adder is implemented in two different ways:
(a) Using CSAs: The complexity
of
the scheme is determined by
Theorem
1.
Since each CSA has
a
constant time delay, then
the total time required
to
add
N
numbers in modulo
M
is
B(1ogN).
(b) Using MCSAs: The number
of
levels required to perform the
addition of
N
numbers using a binary tree of MCSAs is
r
logN1
as
it is shown in
Theorem
3.
Since at each level the
required time is constant (MCSA has a constant time), then the
total time required for this step using MCSAs is
B(1ogn).
The range determinator consists
of
three different
levels(Figure
4).
The first level consists
of
g
ROMs. The second
level is the MC cells, which are combinational circuits that can
208
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:07:27 EST from IEEE Xplore. Restrictions apply.
I41
be represented with a two level switching function. Finally the
last
level is a two stage combinational circuit. The Three levels
have a constant time delay that does not depend on
N.
The
previous analysis shows that the range determinator has a time
delay of
B(
1).
The Final corrector consists
of
two stages. In the first stage we
have
A
tristate multiplexers which have a constant delay
equivalent to two serial NAND gates. The second stage is a
CLA which has a constant delay and
for number of bits less
than 64 the delay is equivalent to the delay
of
12 serial NAND
gates
as
shown in[13]. For number of bits larger than this we
can still obtain a constant delay CLA. Then the final corrector
has a delay of
B(
1).
From cases [1]-[4] we see that all stages except the partial sum adder
has a constant time delay which does not depend on the number of
residues
N.
Only the second stage requires
B(logN)
steps.
5.
Conclusions
of
1
logN
1.
In addition, it has several advantages
as
listed below:
1)
The residue decoder introduced in this paper has a total delay
The design is quite modular and consists of simple cells such
as
small ROW and MC cells. This makes the implementation
of
the whole residue decoder in a single chip is possible.
It doesn't have any limitation on the moduli used.
It is flexible since it can convert residues to either unsigned
magnitude
or
2's
complement number system, and it is con-
trolled by only a control line, C. This means that it can be
applied
to
wider area.
It is fast compared with most schemes proposed before since it
has a time complexity of B(1ogN).
It can be easily pipelined without any modifications
2)
3)
4)
5)
Acknowledgement
The authors wish
to
thank the reviewers for thier valuable comments.
References
M. A. Bayoumi, "Digital Filter
VLSI
Systolic Arrays over Finite
Fields for DSP Applications," Proc. of the 6th IEEE Annual
Phoenix Conference on Computers and Communications, pp.
194-199, Feb. 1987.
M. A. Bayoumi,
G.
A. Jullien, W. C. Miller, "A Look-up Table
VLSI
Design Methodology for RNS Structures Used in DSP
Applications," IEEE Trans. on Circuits and Systems, pp. 604-
616, Vol. 34, No.
6,
June 1987.
F.
J. Taylor, "Residue Arithmetic: A Tutorial with Examples,"
IEEE Computer Magazine, pp. 5G62,May 1984.
M. A. Bayoumi, "A High Speed
VLSI
Complex Digital Signal
Processor Based on Quadratic Residue Number System,"
VLSI
Signal Processing
11,
pp. 200.211, IEEE Press, 1986.
N.
S.
Szabo and R. I. Tanaka,
Residue Arithmetic and its
Applications
Lo
Computer Technology,
New York: McGraw-Hill,
1967.
W.
K.
Jenkins, "Techniques for Residue to Analog Conversion
for Residue Encoded Digital Filters," IEEE Trans. Circuits
Syst., vol. CAS25, pp. 553-562, July 1978.
S.
Andraos and H. Ahmed, "A New Efficient Memoryless Resi-
due to Binary Converter," IEEE Trans. Circuits Syst., vol. 35,
K.
M.
Ibrahim and
S.
N. Saloum,
"An
Efficient Residue to
Binary Converter Design," IEEE Trans. Circuits Syst., vol.
35,
pp 1156-1158, September 1988.
S.
Bandyopadhyay, G. A. Jullien, and
A.
Sengupta,
"A
Systolic
Array for FauleTolerant Digital Signal Processing Using A
Residue Number System Approach," Proc. of Intl. Conf. on Sys-
tolic Arrays, pp. 577-586, 1988.
NOV. 1988, pp. 1441-1444.
A.
P. Shenoy and R. Kumaresan, "Residue
to
Binary Conver-
sion for RNS Arithmetic Using Only Modular Look-up Tables,"
IEEE
Trans. circuits Syst., vol. 35, pp. 1158-1162, September
1988.
T.
V. Vu, "Efficient Implementations of the CRT for Sign
Detection and Residue Decoding,"
IEEE
Trans. Comp., vol. C-
C. N. Zhang, B. Shirazi, and D. Y. Y. Yun, "Parallel Designs
for Chinese Remainder Conversion," Proc. IEEE
16
th Annual
Conf. on Parallel Processing, Aug. 1987.
K.
Hwang,
Computer Arithmetic: Principles, Architecture, and
Design.
New York: Wiley, 1978.
R. M. Capocelli and R. Giancarlo, "Efficient VLSI Networks for
Converting an Integer from Binary System and Vice Versa,"
IEEE
Trans. Circuits Syst., vol. 35, Nov. 1988, pp. 1425-1430.
G.
Alia and
E.
Martinelli, "A VLSI Algorithm for Direct and
Reverse Conversion from Weighted Binary Number System to
Residue Number System," IEEE Trans. Circuits Syst., vol.
CAS-31, 1984, pp. 1033-1039.
K.
P.
Lee, M. A. Bayoumi and
K.
M. Elleithy, "A Fast and
Flexible Residue Decoder Based on The Chinese Remainder
Theorem," The 1989 International Symposium on Circuits and
Systems.
K.
M. Elleithy, "On Bit-Parallel Processing for Modulo Arith-
metic,"
VLSI
Technical Report TR86-81, The Center for
Advanced Computer Studies, University of Southwestern
Louisiana, 1986.
A. Avizienis, "A Study of Redundant Number Representations
for Parallel Digital Computers," Ph.D Thesis, Univ. of Illinois,
Urbana, Illinois, May 1960.
K.
M. Elleithy, "On the Bit-Parallel Implementation for the
Chinese Remainder Theorem,"
VLSI
Technical Report TR87-8-
1,
The Center for Advanced Computer Studies, University
of
Southwestern Louisiana, 1987.
34, pp. 646-651, July 1985.
209
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:07:27 EST from IEEE Xplore. Restrictions apply.
Chapter
Digital Signal Processing, since its establishment as a discipline 30 years ago, has always received a great impetus from electronic technological advances. It often rides the crest of that wave and sometimes is responsible for pushing it.
Conference Paper
Full-text available
A O(log n) algorithm for large moduli multiplication for Residue Number System(FtNS) based architectures is proposed. The proposed modulo multiplier is much faster than previously proposed multipliers and more area efficient. The implementation of the multiplier is modular and is based on simple cells which leads to efficient VLSI realization. A VLSI implementation using 3 micron CMOS process shows that a pipelined n-bit modulo multiplication scheme can operate with a throughput of 30 M operation per second.
Conference Paper
Full-text available
Parallelism on the algorithmic, architectural, and arithmetic levels is exploited in the design of a residue number system (RNS) based architecture. The architecture is based on modulo processors. Each modulo processor is implemented by a two-dimensional systolic array composed of very simple cells. The decoding state is implemented using a two-dimensional array. The decoding bottleneck is eliminated. The whole architecture is pipelined, which leads to a high throughput rate. High speed algorithms for modulo addition, modulo multiplication, and RNS decoding are presented
Data
A #(fog n) algorithm for large moduli multiplication for Residue Number System(FtNS) based architectures is proposed. The proposed modulo multiplier is much faster than previously proposed multipliers and more area efficient. The implementation of the multiplier is modular and is based on simple cells which leads to efficient VLSI realization. A VLSI implementation using 3 micron CMOS process shows that a pipelined n-bit modulo multiplication scheme can operate with a throughput of 30 M operation per second.
Data
In this paper parallelism on the algorithmic, architec-tural, and arithmetic levels is exploited in the design of a Residue Number System (RNS) based architecture. The archi-tecture is based on modulo pro-cessors. Each modulo processor is implemented by two dimen-sional systolic array composed of very simple cells. The decod-ing stage is implemented using a 2-0 array, too. The decoding bottleneck is eliminated. The whole architecture is pipelined which leads to high throughput rate.
Data
Designing an optimal Residue Number System (RNS) processor in terms of area and speed depends on the choice of the system moduli. In this paper an optimal algorithm for choosing the system moduli is presented. The algorithm takes into consideration several constraints imposed by the problem definition. The problem is formalized as an integer programming problem to optimize an aredtime objective function.
Data
Designing an optimal Residue Number System (RNS) processor in terms of area and speed depends on the choice of the system moduli. In this paper an optimal algorithm for choosing the system moduli is presented. The algorithm takes into consideration several constraints imposed by the problem definition. The problem is formalized as an integer programming problem to optimize an aredtime objective function.
Article
A generalization of a new generic 4-modulus base for residue number systems (RNS) is presented in this paper. An efficient RNS to binary conversion algorithm and a hierarchical architecture for these bases are also described. The key features of our conversion architecture compared to previous published architectures of the same output range are a larger moduli set selection and savings on the critical delay, area and power. The FPGA implementation and the detailed proof supporting it are also discussed.
Conference Paper
A new architecture for modulo 2<sup>n</sup>+1 multi-operand addition (MOMA) of weighted operands is introduced. It is based on the use of a translator circuit that enables to use n-bit operations for performing the weighted multi-operand addition. Our experimental results indicate that the proposed MOMAs offer significant savings in execution time compared to the previously proposed solutions that either require two parallel additions or a carry-save adder tree with twice the depth of the proposed while they can be implemented in less area in most cases.
Conference Paper
The contribution of this paper is twofold. We firstly show that an augmented diminished-1 adder can be used for the modulo 2<sup>n</sup> +1 addition of two n-bit operands in the weighted representation, if it is driven by operands whose sum has been decreased by 1. This scheme outperforms solutions that are based on the use of binary adders and/or weighted modulo 2<sup>n</sup> + 1 adders in both area and delay terms. We then apply this scheme in the design of residue generators (RGs) and multi-operand modulo adders (MOMAs). The resulting arithmetic components remove at least a whole parallel adder out of the critical path of the currently most efficient proposals. Experimental results indicate savings of more than 30% in execution time and of approximately 19% in implementation area when the proposed architectures are used.
Conference Paper
Full-text available
An implementation of a fast and flexible residue decoder based on the Chinese remainder theorem (CRT) is proposed for decoding a set of residues to its equivalent representation in a weighted binary number system. This decoder is flexible, since the decoded data can be selected to be either unsigned magnitude of two's complement binary number. It is shown that this residue decoder is extremely fast compared to previously proposed residue decoders
Article
A description is given of a novel residue to binary converter. It converts the three moduli residue number system (RNS) representation (2 n-1, 2n, 2n+1) into binary representation. The conversion process depends on simple mathematical relationships without using mixed radix or the Chinese remained theorem. These simple relationships provide simpler hardware realization for the RNS to binary conversion
Data
In this paper, an implementation of a f-t and flexible rscidue decoder based on Chinese Remainder Theorem, or CRT, is propolrad to decode a set of nsiduea to ita equivalent reprasentation in weighted binary number system. This decoder is flexible since the decoded data can be selected to be either unsigned magnitude or 2's comple-ment binary number. It is shown that this rcsidue decoder is extremely fast compared to the pmiousl). propoaed widue decoders.
Article
The effects of redundancy in each digital position of a number ; representation for arithemtic operations in parallel digital computers were ; investigated without the use of a carry-borrow identification. In the approach ; employed, a method of addition which has the desired properties is initially ; postulated and from this description the properties of a number representation ; which permits the postulated method of addition were derived. A totally-parallel ; mode of addition is defined and postulated to be the required characteristic of a ; number representation. A class of signeddigit number representations which ; permit totally-parallel addition is developed. Various significant properties of ; these signed-digit representations are developed and the existence of methods for ; the executions of arithmetic operations is demonstrated. The logical design of ; an adder circuit for totally-parallel addition of two numbers in signeddigit ; representation is discussed. (C.J.G.);
Article
Thesis--University of Illinois. Includes bibliographical references (leaf 76). Vita. Photocopy by University Microfilms, 1970.
Conference Paper
Fault detection and correction using the Chinese remainder theorem for decoding is investigated. It is shown that this approach is well suited for implementation by VLSI circuits for digital signal processing using systolic architectures. A systolic array for multioperand residue addition is considered, and its application in error-tolerant digital signal processing is presented. It is shown that the array can be easily used for comparing efficiently a set of residues S ={ x <sub>0</sub>, x <sub>1</sub>, . . ., x <sub>N-1</sub>} to a known constant. This algorithm has been used to detect errors by checking whether S lies in the illegitimate range. The multioperand residue adder has been modified to design a variable modulus adder. An error-tolerant RNS finite-impulse response filter has been designed using this variable modulus adder. Three schemes for error detection and correction are proposed
Article
In current high-speed digital signal-processing (DSP) architectures, the Residue Number System (RNS) has an important role to play. RNS implementations have a highly modular structure, and are not dependent upon large binary arithmetic elements. RNS implementations become more attractive when combined with the advantages offered by VLSI fabrication technology. In this paper, a novel design methodology has been developed for RNS structures, based on using look-up tables, which takes into consideration the unique features and requirements of RNS. The paper discusses the following three phases: 1) developing a look-up table layout model, which is used to derive relationships between the size of each modulus and both chip area and time; this model supports all types of moduli; 2) selecting the most efficient layout according to the design requirements; the procedure allows the designer to control the area, time, or the configuration of the memory module required for implementing a modulo look-up table; 3) proposing a set of multi-look-up table modules, to be used as building block units for implementing digital signal-processing architectures. The paper uses two examples to -illustrate the use of the modules in phase 3).