ArticlePDF Available

Abstract

An implementation of a fast and flexible residue decoder for residue-number-system (RNS)-based architectures is proposed. The decoder is based on the Chinese remainder theorem. It decodes a set of residues to its equivalent representation in weighted binary number system. This decoder is flexible since the decoded data can be selected to be either unsigned magnitude or 2's complement binary number. Two different architectures are analyzed; the first one is based on using carry-save adders, while the other is based on utilizing modulo adders. The implementation of both architectures is modular and is based on simple cells, which leads to efficient VLSI realization. The proposed decoder is fast; it has a time complexity of θ(log N ), where N is the number of moduli
226
IEEE
TRANSACTIONS
ON CIRCUITS
AND
SYSTEMS-11:
ANALOG
AND
DIGITAL
SIGNAL
PROCESSING,
VOL.
39,
NO.
4,
APRIL
1992
Fast and Flexible Architectures for
RNS
Arithmetic Decoding
Khaled M. Elleithy,
Member, IEEE,
and Magdy
A.
Bayoumi,
Senior Member, IEEE
Abstract-An implementation of a fast and flexible residue
decoder for residue number system (RNS)-based architectures is
proposed. The decoder is based
on
the Chinese Remainder
Theorem
(CRT).
It decodes a set of residues to its equivalent
representation in weighted binary number system. This decoder
is flexible since the decoded data can be selected to be either
unsigned magnitude or
2's
complement binary number.
Two
different architectures are analyzed; the first one is based
on
using carry-save adders (CSA's), while the other is based
on
utilizing modulo adders (MA). The implementation of both
architectures is modular and is based
on
simple cells, which
leads to efficient
VLSI
realization. The proposed decoder is fast;
it has a time complexity of O(log
N) (N
is the number of
moduli).
Keywords-Residue number system, Chinese remainder theo-
rem, modulo adder, carry-save adder, residue decoding, finite
field algorithm.
I. INTRODUCTION
ECENTLY, RNS has received increased attention due to
R
its ability to support high-speed concurrent arithmetic
[
11
-
[3]. Applications such as fast fourier transform, digital
filtering, and image processing utilize the efficiencies of RNS
arithmetics in addition and multiplication; they do not require
the difficult RNS operations such as division and magnitude
comparison. RNS has been employed efficiently in the imple-
mentation of digital signal processors
[
11, [4].
Since special purpose processors are associated with gen-
eral purpose computers, binary-to-residue and residue-to-bi-
nary conversions become inherently important, and the con-
version process should not offset the speed gain in RNS
operations. While the binary-to-residue conversion does not
pose a serious threat to the high-speed RNS operations, the
residue-to-binary conversion can be a bottleneck. The Chi-
nese Remainder Theorem (CRT)
[5],
[6] is considered the
main algorithm for the conversion process. Several imple-
mentations of the residue decoder have been reported
[7]-[15]. The residue decoders in
[7]
and [8] are based on
using three moduli in the form
(2"
-
1,
2",
2"
f
1);
n
is
the number of bits. Due to the limitation imposed on the
number
of
moduli and the choice of them, it is limited in
application. In [9], a scheme of O(1og
NP)
(where
P
is the
Manuscript received October 15, 1990; revised December 17, 1991. This
work was supported in part by NSF Grant MIP-8809811. This paper was
recommended by Associate Editor M.
H.
Etzel.
K.
M. Elleithy is with the Computer Engineering Department, King Fahd
University, Dharan 31261, Saudi Arabia.
M. A. Bayoumi is with the Center for Advanced Computer Studies,
University
of
Southwestern Louisiana, Lafayette, LA 70504.
IEEE Log Number 9 10752 1.
number of bits and
N
is the number of moduli) is used
to
support only unsigned magnitude binary numbers.
In
[lo],
the residue decoder is based on the base extension technique;
it uses only modular look-up tables in its implementation.
Since look-up tables are used, the choice of moduli must not
be large for the implementation to be feasible. In addition, it
does not support residue to 2's complement binary number
system conversion. Although look-up tables are used in this
scheme, its time complexity is
O(N2).
The implementation in
[l 11 requires that one
of
the moduli must be a power of two;
therefore, it may be limited in application. In [12], the
proposed residue decoders are basically based on biased
addition, and take advantage
of
the fast addition speed of
CSA [16]. But the conversion output is not in
2's
comple-
ment form. In [13] and [14], the scheme used has a time
complexity of (?((log
N)2).
In [15], the mixed-radix conver-
sion algorithm is used with a time complexity of
O(N).
In this paper, a O(1og
N)
residue decoder capable of
decoding a set of residues to its equivalent representation in
unsigned magnitude or
2's
complement binary number
sys-
tem is introduced. Two different architectures using CSA's
based on [17] and modulo adders (MA's) [18] are imple-
mented. In the following section, the RNS theory is re-
viewed. Section I11 discusses how this fast and flexible
residue decoder can be implemented. Section IV evaluates
the speed performance of this residue decoder.
II.
RESIDUE
NUMBER SYSTEM
In
RNS,
an integer,
X,
can be represented by an N-tuple
of residue digits,
X=
(r,,r2;.*,rN)
where
ri
=
I
X
I
mi,
with respect to a set
of
N
moduli
{
m,
,
m2,
a,
mN}
.
In order to have a unique residue representa-
tion, the moduli must be pairwise relatively prime; that is,
GCD(mi,
mj)
=
1,
fori
#j.
Then it is shown that there is a unique representation for each
number in the range of
0
5
X
<
IIE,
mi
=
M
where
N
is
the number of moduli.
The arithmetic operation on two integers
A
and
B
is
equivalent to the arithmetic operation on its residue represen-
tation, that is,
1057-7130/92$03.00
0
1992 IEEE
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:35:49 EST from IEEE Xplore. Restrictions apply.
ELLEITHY
AND
BAYOUMI:
FAST
AND
FLEXIBLE
ARCHITECTURES
221
where
‘‘a
can be addition, subtraction, or multiplication.
Therefore, it is desired to convert binary arithmetic on large
integers to residue arithmetic on smaller residue digits in
which the operations can be parallelly executed, and there is
no carry chain between residue digits.
For applications in digital signal processing, it is helpful to
define a dynamic range for the RNS with positive and
negative integers. The dynamic range is defined as
[
-
(M
-
1)/2,
(M
-
1)/2] for
M
odd, and as
[-M/2,
M/2
-
11
for
M
even, or more specifically, for
M
odd,
M-
1
ifZ57
and for
M
even,
M
ifZ<-
2
z,
Z-M, ifZ2-
2
x=
[
M
where
Z
is an integer within the legitimate range,
0
5
Z
<
M.
Any integer,
X,
within the dynamic range can be
represented by
N
residue digits.
The conversion from RNS to weighted binary number
system is done by using the CRT, which states that
where
and
GCD(mj,
mk)
=
1, fOrj
#
k.
Although the CRT provides a direct, fast, and simple conver-
sion formula, the lack of large and fast modulo
M
adder has
held back this approach.
III.
THE
RESIDUE
DECODER
The residue decoder based on the CRT can be imple-
mented by a modulo
M
adder tree. The modulo
M
adders at
each level are used to correct the partial sum
so
that it will be
within the legitimate range. Since the modulo
M
adder is
very slow, the implementation may pose an overhead to the
overall speed performance of an
RNS
processor. In addition,
the CRT only converts residues to its binary representation in
the legitimate range but not in the dynamic range. Therefore,
conversion to 2’s complement binary number system requires
a final correction.
In order to implement a high-speed residue decoder that
can perform conversion to both unsigned magnitude and 2’s
complement binary number system, the following solutions
are proposed.
1) The number of modulo
M
adders or binary adders
should be reduced to a minimum.
2) CSA’s can be used wherever multi-operand addition is
required due to its high addition speed.
3) MA’s can be used for multi-operand addition due to
its constant speed in adding n-bit numbers in modulo
M.
4)
Correction can be performed only at the last stage,
and it supports conversion to both unsigned magnitude
and the 2’s complement binary number system.
For ease of residue decoder design, it is partitioned into
four stages as shown in Fig.
1.
The input to the residue
decoder are the residues and a control line,
C
which deter-
mines the output to be in unsigned magnitude or 2’s comple-
ment number system.
3.1
Partial
Sum
Generator
The inputs to this stage are the
N
residues. The main
function of this stage is to compute partial sums,
ti’s,
where
Since
mi
is usually small, the value of
ti
can be obtained by
accessing a look-up table with a small address space. Hence,
ri
will serve as a ROM address input, and
ti
will be obtained
from a ROM output as shown in Fig. 2.
In most cases, it is better to reduce the number of partial
sums,
ti’s,
in order to reduce the complexity at lower stages
and hence increase the speed of residue decoder as a whole.
Since a modulus mj can be represented by [log, mi]-bit
binary number, the jth residue,
[log2
mjl-
1
ri=
2kbi
where b&{O, I}. By substituting
rj
in (l), we can rewrite
the CRT as follows:
k=O
Hence, if we have a set of
8
moduli (2, 3,
5,
7, 11, 13, 17,
23} with residues
{r,, r,, r3, r4, r,, r,, r7, rg},
respec-
tively, only four ROM’s with 7-bit address input can be used
to implement this level, and modulus summation
of
four
operands is required instead of eight, where
t,
=
t,
=
t3
=
t,
=
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:35:49 EST from IEEE Xplore. Restrictions apply.
228
IEEE
TRANSACTIONS ON CIRCUITS AND SYSTEMS-11: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL.
39,
NO.
4,
APRIL
1992
rr
rr
r
N
N-1
32
1
II
II
. . . . . . .
.
.
.
. .
. .
,
.
I
I
I
IXIM
Fig. 1.
Block diagram
of
the residue decoder
TABLE
I
THE VALUE
OF
O(1)
VERSUS
I
FOR
PRACTICAL APPLICATIONS
Number
of
levels,
I
Maximum Number
of
Operands,
@(I)
The previous implementation means that we decrease the
N
partial sums to a new number of partial sums
(A).
3.2
Partial
Sum
Adder
By
far, the modulo
M
summation of partial sums,
ti’s,
poses the biggest challenge to the implementation of residue
decoder due to the slow computational speed of modulo
M
adder. This stage can be implemented using two different
approaches.
3.2.1
Implementation
using
CSA:
A multilevel CSA tree
consists of
N
-
2 CSA’s and a carry propagate adder (CPA)
[16]. These are used to reduce
A
partial sums,
t’s,
to a sum,
S.
Let
I
be the number of levels on a CSA tree, and
e(/)
be
the maximum number
of
operands that can be processed with
an I-level CSA tree. We can compute
19
by the recursive
formula provided by Avizienis
[
191
:
-c
3
4
6
9
13
19
28
42
for
I
=
2,3;
e,
and initially
O(1)
=
3.
Table
I
summarizes the values
I9
versus the values
of
I
for
practical situations. CPA is an
(m
-
1)-bit two-level carry
lookahead adder (CLA) [16] where
m
=
[log,
(MA)].
Hence the output
S
is an m-bit number that is passed to the
next stage. A design example for this stage is shown in Fig.
3(a). The complexity of the scheme is determined by the
Theorem
1.
Theorem
1:
The addition of
N
numbers using CSA’s can
be performed in O(log
N)
steps.
CSA tree is determined by
Proof:
The number of levels required for addition in a
To determine the number of levels required to add
N
num-
bers, let us consider the following two cases.
Case
(1):
I9(l
-
1)
is even, then:
e(/
-
1)
mod2
=
0.
(5)
(6)
Substituting in
(3)
using
(4)
and
(5),
we have
e(/)
=
+e(/
-
1).
Since
O(l)
=
3,
we can substitute in (6) to get successive
values for
e(/)
as follows:
o(2)
=
q*3
8(3)
=
(3**3
8(4)
=
(3’*3
6(5)
=
(q)“*3
e(/)
=
(+)/-,*3
=
(3‘*2.
e(/)
represents the number of operands that can be added
using a CSA tree and has
I
levels. Suppose that the number
of operands is
N
then
N
=
(+)‘*2.
Taking the logarithm of both sides we have
log
N=
I*logq.
Then:
1
I=
-
*log
N.
We can find constants
C,
>
0,
C,
>
0,
and
No
2
0,
such
log
;
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:35:49 EST from IEEE Xplore. Restrictions apply.
ELLEITHY AND BAYOUMI: FAST AND
FLEXIBLE
ARCHITECTURES
229
''
-
''
means that output
is shifted left with zero
enterlna from the riaht
2'-
M
2(
2-
2n-
M
0
2n-
M
0
Result11
5c
1
Result11
1
0
Rcsulthl Rcsultlnl
55
(C)
initial
As=
101111110111
Ac=
11001
1101
101
6,.
11I100010101
6,.
1010101
1001
1
fl-2050
,
N-
12
Step
1
A.=
101
11 11
101
11
A.=
11001
1101
101
Bs.
111100010101
temp,
=
10000000
1 1 1 1
temp2=Il
11
I I I
IO~OIO
Step
2.
temp,:
lOOOOOOOI
I1
1
temp,.
11
1
I1
1101010
8,;
1010101
1001
I
temp,=
I101010101
IO
temp4=Ij0~0~o~o~0~
IO
Step
3.
temp,=
1101010101
IO
temp4=
0101010101
IO
I1
11
I
I1
11
100
temp,.
01
I
I
I I
I I
I100
temp,=[
101010
101
100
Step
4.
temp,.
01
11
1
I1
11
100
temp,.
101010101
100
---________-_.
2(2"-fl)
=
----______----
Z"-fl=
011111I11110
Q,
(Result
,
Result)
c
Stage
[log
nl
(e)
Fig.
3.
(a) An example
of
partial sum adder
for
A06.
@)
A modulo sum adder. (c) Different stages
of
the modulo adder. (d) A
detailed example
for
modulo addition. (e) Addition
of
partial products using modulo adders.
that for all
N
2
No
the following
is
true:
means that
1
=
O(1og
N).
1
Case
(2):
O(l
-
1)
is
odd, then
C,
log
N
I
7
*log
N
I
C,
log
N.
(7)
1%
7
Then
C,
log
NI
I
I
C,
log
NVN
2
No.
(8)
3*[?]
=
?O(l
2
-
1)
-
1.5.
O(l
-
l)mod2
=
1.
Possible values for
C,
,
C,
,
and
No
are 1, 2,
1.
Equation
(8)
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:35:49 EST from IEEE Xplore. Restrictions apply.
230
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-11: ANALOG AND DIGITAL SIGNAL PROCESSING,
VOL.
39,
NO.
4,
APRIL
1992
Substituting in
(3)
using
(9)
and
(lo),
we get
Algorithm Modulo-Add (A,
B,
Result)
Input:
Two variables
A
and
B
in modulo
m,
A
is
represented as
A,
and
A,.
B
is represented as
B,
and
B,.
~11 variables
are
Output:
Variable
Result
represented as
Result,
and
Result,.
The relation between
A,
B,
and
Result
is:
Result
=
IA+BIm.
Procedure:
e(,)
=
;e([-
1)
-
;.
(11)
Since
O(1)
=
3,
we can substitute in
(11)
to get successive
values for
O(1)
as follows:
bit numbers.
O(2)
=
;*3
-
1.
,9(3)
=
(512*3
-
(+)'*3
-
;
8(4)
=
($)3*3
-
($*3
-
($*3
-
1
O(1)
=
($)'-'*3
-
(($)'-,
+
($)'-,
+
*e*
+1)
*0.5
=
$($)'
+
1.
Suppose that the number of operands is
N;
then
N
=
$(;)'
+
1.
Using the same analytical method used for the case of even
e(,-
1)
we can find constants
C,,
C2,
and
No
e
0,
such
that for all
N
2
No
the following is true:
1
C,
log
N
S
-
,
*log
N
I
C,
log
N.
1%
2
From the previous analysis in both cases
1
and
2,
N
numbers
can be added using CSA's in O(1og
N).
3.2.2
Implementation using Modulo Adder:
The MA
adder proposed in
[
181
is used to implement the partial sum
adder. The idea of representing a number as
a
carry
and a
sum
borrowed from CSA can be used in the modulo addition
to obtain a scheme that has a constant speed that does not
depend on the number of bits. Basically, CSA depends not on
the idea of completing the addition process at a certain stage,
but postponing it to the final stage. In the intermediate stages,
numbers are represented
as
sum
and
carry
to avoid the
complete addition process.
The MA is used to add two
numbers
A
and
B
in modulo
m.
Fig. 3(b) shows that
A
is
represented as a pair of numbers
(A,,
A,),
B
is repre-
sented as
(B,, B,),
and the output
C
is represented as
(C,,
C,).
Each number is represented as a group of
sum
bits and
carry
bits. There
is
no unique representation for
A,
and
A,.
The condition that needs to be satisfied is
lAs+&Im=
IAIm.
One possible representation is
A,= \AIm,
A,=O.
We need to add four numbers
(A,,
A,, B,, B,),
which
need two steps of CSA. After the addition process we need to
detect if
-M
or
2
*
(-
M)
is required to adjust the result.
The adjusting process takes at most three steps. The proposed
algorithm for modulo
m
addition of two numbers can be
described as follows.
begin
Do
in parallel
begin
Call Sum(temp,,
As,
A,,
B,)
Call Carry(temp,,
A,,
A,,
B,)
end
begin
Do
in parallel
Call Sum(temp,, temp,, temp,,
B,)
Call Carry( temp,, temp,, temp,,
B,)
end
0:
Do
in parallel
Case(temp2
[n
+
11
+
temp,
[n
+
11)of
begin
Result,
:=
temp,
Result,
:=
temp,
end
exit
begin
I:
Do
in parallel
Call Sum(temp,, temp,, temp,,
(2"
-
m))
Call Carry(temp,, temp,, temp,,
(2"
-
m))
end
begin
2
:
Do
in parallel
Cali Sum(temp,, temp,, temp,,
2*(2
-
m))
Call carry( temp,, temp,, temp,,
2"
(2
-
m))
end
end case
Case(temp,
[n
+
13)
of
0:
Do
in parallel
begin
Result,
:=
temp,
Result,
:=
temp,
end
exit
begin
I:
Do
in parallel
Call Sum(temp,, temp,, temp,,
(2"
-
m))
Call Carry(temp,, temp,, temp,,
(2"
-
m))
end
end case
Case (temp,
[n
+
11)
of
0:
Do
in parallel
begin
Result,
:=
temp,
Result,
:=
temp,
end
I:
Do
in parallel
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:35:49 EST from IEEE Xplore. Restrictions apply.
ELLEITHY AND BAYOUMI: FAST AND FLEXIBLE ARCHITECTURES
23
1
begin
Call Sum( temp,, temp,, temp,, (2"
-
M))
Call Carry(temp,,, temp,, temp,, (2"
-
M))
end
Do
in parallel
begin
Result,
:=
temp,
Result,
:=
temp,,
end
end case
end.
Sum (A, B,
C, D)
begin
Do
in
parallel
(1
I
i
I
n>
A[i]
:=
(B[i]
A
C[i])
V
(B[i]
A
D[i])
V
(C[i]
AD[il>
end.
Carry (A, B,
C, D)
begin
A[1]
:=
0
Do
in parallel
(1
I
i
I
n)
A[i
+
11
:=
B[i]
@
C[i]
@
D[i]
end.
An implementation of the algorithm is shown in Fig.
3(c).
Theorem
2:
The modulo adder scheme for adding two
n-bit numbers in modulo
rn
has an asymptotic time complex-
ity
8(l).
Proof:
To prove that the number
of
steps is constant
(five), we need to prove that the last carry is equal to zero in
five or less steps. Induction is used to prove the correctness
of the theorem on the number of bits
n.
Basis step:
n
=
0
means that we do not add any
numbers and in this case the required number of steps
is zero.
Induction hypothesis: assume for a fixed arbitrary
n
1
0
that the maximum number
of
steps is five.
Induction step: for numbers with
n
+
1
bits, let
7
=
temp,[n
+
11
+
temp,[n
+
21.
Then we have the following cases.
a)
7
=
0;
then the carry propagation stopped at bit
n,
and it ends after five steps at most according to the
induction hypothesis.
b)
7
=
1:
then the correction is
2""
-
m
in step
3.
Since
rn
>
2"
then
2"+'
-
m
<
2",
which means
that
(2""
-
m)[n]
=
0.
The worst case is to have
ternp,[n
+
I]
and
temp,[n
+
21
equal one. This
means that
temp,[n
+
11
=
0
and
temp,[n
+
21
=
1,
then
temp,[
n
+
21
=
0.
In this case the correction
is done in two steps (step
3
and step
4).
c)
7
=
2:
then the correction is
2
*
(2"+'
-
m)
in
step
3.
The worst case is
temp,[n
+
11,
temp4[n
+
21,
and
2*(2"+'
-
m)
=
1.
Then
temp,[n
+
13
=
1,
temp,[n
+
11
=
1
and
2"'l
-
M=
0.
At step
4
temp,[n
+
11
=
0
and
temp,[n
+
21
=
1.
At step
5
temp,[n
+
11
=
1
and
temp,,[n
+
21
=
0.
In this
U
Example:
As an example, the modulo addition of
A
=
1272
and
B
=
450
for
m
=
2050
is shown in Fig.
3(d).
There is no unique representation for
A
and
B.
One valid
representation is shown in Fig.
3(d).
Fig.
3(d)
shows the
detailed modulo addition operation for this example. In step
1
we get
ternp2[13]
=
1,
and in step
2
we get
temp4[13]
=
1,
which means that at step
3
we have to add
2(2"
-
M).
At
step
3
we get
temp,[l3]
=
1,
which means that at step
4
we
have to add
2"
-
M.
At step
4
we get
temp,[l3]
=
0,
which means that the addition process stops at step
4.
The
result of step
4
is the final result.
The proposed modulo adder has the following advantages.
1)
It does not have any limitation on the size of the
modulus.
2)
It is quite modular; it is a
2-D
array of one type cell
(full-adder)
.
3)
It is easy to pipeline.
4)
It is very efficient architecture for the implementation
of the
CRT
decoding and modulo multiplication
[20].
Theorem
3:
Adding
n
numbers
(
yl,
y2,
-
e,
y,)
in mod-
1)
Adding
(
yl,
y2)
modulo
M,
-
*
,
(
yi,
yi+
1),
-
*
,
and
ulo
M
is equivalent to:
(Yn-13
U,)
gives
Y122***9
Y(n-1)".
2)
Step
1
is repeated on
(y12,
~~~1,.
-,
(Y(~-~)(~-,),
Y("-
1,").
3)
Step
2
is repeated for
1
log
lv]
-
2
times to obtain one
final output represented as a sum and carry.
Proof:
To
add two numbers
a
and
b
in modulo
M
we
a
<M
and
b
<M
then
a
=
lalM
and
b
=
I
b
1
M,
then:
have the following cases:
i)
la+bl,= Ia,+b,I,*
(12)
ii) a>Mand b<Mthen
b=
Ibl,anda=M
la+bl,= IM+x+bl,= (x+bJ,.
(13)
Since
x
<
M
and
b
<
M,
then from
(12)
and
(13):
+
x,
then:
la+bl,= Ia,+b,I,. (14)
iii)
a
>
M
and
b
<
M
like case ii).
iv)
a
>
M
and
b
>
M
then
a
=
M
+
x
and
b
=
M
+
y,
then:
la+ bJ,= IM+x+M+yJ,
=
IX+YIM'
Ia,+b,l,.
(15)
From the previous four cases,
Since addition is associative, then
I
~1
+
~2
+
*
.
*
+Yn
I
M
case the correction is done in three steps (steps
3-5).
I._
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:35:49 EST from IEEE Xplore. Restrictions apply.
232
IEEE TRANSACTIONS ON CIRCUITS AND
SYSTEMS-11:
ANALOG AND DIGITAL SIGNAL PROCESSING, VOL.
39,
NO.
4,
APRIL
1992
if
M
odd
Using (16) we have
if
Meven
We can further expand this expression using the same method
to get the addition process in the right-hand side
in
terms of
only two operands added in modulo
M.
Theorem 3 means that adding
n
numbers in modulo
M
can be performed using a binary tree consists of units that are
capable of adding only two numbers in modulo
M.
MA’s
are
used as those building blocks to perform the addition process.
Since
MA
requires that inputs be represented in the form of
sum and carry, then this form should be enforced at all
levels. The form will be enforced automatically for levels
22, because the outputs of the previous levels are
in
the
correct forin. For the first level we have the following:
Ti,
=
y,,
Ti,
=
0 vl
I
i
I
n.
For the last stage the output is in the form of sum and carry
which is exactly the same form we have using the
CSA’s.
Fig. 3(e) shows the binary tree required to add
n
numbers in
modulo
M.
3.3
Range Determinator
This stage consists of three levels-namely
ROM,
magni-
tude comparator
(MC),
and bit corrector (BC). The major
function of this stage is to determine the range of
S
so
that
appropriate value can be subtracted from
S
to obtain the
correct result.
Since the input to this stage,
S,
is a large binary number, it
is partitioned into groups of adjacent bits. For example, if
S
is a 24-bit number, we can partition
S
into three 8-bit groups
GI, G2
,
and G,, where
G,
=
S,
...
o,
G,
=
SI,
...
8,
and
G,
=
S,,
...
16.
Since each group if fed into a
ROM
module as an address
input, the number of bits in each group should be small
so
that small
ROM’s
that are fast and occupy small silicon area
are used to implement this level. However, the number of
groups, g, should be kept to be as small as possible since the
complexity of
MC
cells is a function of the number of
ROM
modules, g. Hence, there are trade-offs in choosing
g
and
the number of bits in each group. The following discussion is
divided into two parts: sign magnitude and 2’s complement.
3.3.1
Sign Magnitude:
As
shown in Fig. 4, the input to
ith
ROM
module is
G,,
and the outputs are
Bi’s
and
Ci’s.
The function of this
ROM
module is depicted as follows:
0,
ifG,S(jM-l),
for
j
=
1
A
-
1
1,
if G,>
(jM-
1);
0,
if
Gi+
(jM-
for
j
=
2
A
-
1
1,
ifG,=
(jM-
l),
for i
=
1,2;*-,
g.
The
ROM
modules compare the input pattern
S
to the first
set of values in Table
11
and produce
g
*
(2
A
-
3) outputs
that are fed to the
MC
level.
1
I
B’c
1
BC
BC
BC
A-
1
rn
1
Implementation of the first part
of
range determinator stage. Fig.
4.
TABLE
II
VALUES COMPARED
BY
MULTIMAGNITUDE COMPARATORS
A
1
2
n-1
First set
M-
1
2M-
1
(n
-
l)M-
1
Magnitude Compared
M-
1
2
3M-
1
L
(2n
-
3)M-
1
2
M
2
3M
2
1
-_
1
--
The
MC
level consists of
(A
-
1)
MC
cells. This level
takes the input from
ROM
level and does further comparison.
The function of the
MC
cells is determined by:
ifS<jM
ifSzjM
for
j
=
1,2;-.,
A
-
1.
(15)
The complexity of an
MC
cell is a function of the number of
ROM
modules. If we have g
ROM
modules, the Boolean
equation for the jth
MC
cell is as follows:
Mc!
=
Bj”
+
Bjg-’cj”
+
Bjg-’C,fcf-’
+
* *
*
+
qqcjg-’cig-2
*
*
cj’
+
BjlcjgCjg-’
-
*
C?C?.
JJ
(16)
Since
S
may be larger than several values compared, the
outputs of several
MC
cells may be set to
1;
therefore, the
BC
level is used to ensure that only one of the outputs of the
MC-‘
cells is set to one and also to indicate the appropriate
range. In order to do
so,
A
identical
BC’
cells are needed,
and their common Boolean equation is as follows:
BC;
=
MC,’
+
MC,’-,,
for
j
=
1,2;..,
A
where
MCA
=
1
and
MC;
=
0.
(17)
Hence, the range of
S
is determined to be
(rn
-
l)M
5
S
<
mM if
BC,!,,
=
1. Thus far, the range determination enables
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:35:49 EST from IEEE Xplore. Restrictions apply.
ELLEITHY AND BAYOUMI: FAST AND FLEXIBLE ARCHITECTURES
233
Di
=
<
the
S
modulus
M
operation to be performed by
S
-
(rn
-
l)M
if
BCA
is set to one.
3.3.2
Two’s
Complement:
Fig.
5
shows that the input to
the ith
ROM
module is Gi, and the outputs are
Di’s
and
Ei’s.
The function of this
ROM
module is depicted as
follows.
For
M
odd,
I
ii
\
ii
0,
if Gi
I
1, if Gi>
EJ!
=
c
I
Fig.
5.
Implementation
of
the second part
of
range determinator stage.
The Boolean equations are:
0,
if
Gi
#
r2j-
:’”-
ii
ii
\
MC’
=
Dig +
Dig-’Eig
+
Df-2EfEf-’
+
* *
1,
if
Gi=
Dj
=
<
for
i
=
1,2,.-*,
g.
The
MC
level of this part consists of
A
MC2
cells, and each
MC2
cell has the following function: for
M
odd,
(2i
-
1)M- 1
2
(2i
-
1)M- 1
2
(22)
0,
if
Sr
1,
if
S>
MC?
=
and for
M
even,
magnitude number system conversion will be performed;
otherwise, only one of the BC level output lines will be equal
to one, and thus residue to 2’s complement number system
conversion will be performed. The Boolean equation for a
i
(2j
2
-
lii
0,
if
G,
I
1,
if Gi
>
for
i
=
1,2;.*,
A
-
1.
E;
=
<
3.4
Final
Corrector
This stage consists of
A
tristate multiplexers and a carry
lookahead adder. The
BC’
input lines will be used to enable
one of the tristate multiplexers while
BC2
input lines will be
used as selectors of the multiplexers. If
BC,‘
is set, then
(i
-
l)M
I
S
<
iM.
The lower bound (i
-
l)M will be
subtracted from
S
if conversion to an unsigned magnitude
number system is desired, or
S
is less than ((2 i
-
l)M
+
1)/2 for
M
odd or ((2i
-
1)M)/2 for
M
even; otherwise,
the upper bound,
iM,
will be subtracted from
S.
The imple-
mentation of this stage is shown in Fig.
6.
The CLA is used
to add the 2’s complement of the value to be subtracted from
S
and output the desired result.
IV
.
PERFORMANCE
EVALUATION
1) The partial sum generator is implemented using small
ROM’s.
If the number of residues is
N
and each
I
where
-
lii
MC;
=
1
and
MC;
=
0.
Therefore, the range of
S
is determined to be ((2
j
-
3)M
+
1)/2
I
S
<
((2j
-
l)M
+
1)/2 for
M
odd, and ((2j
-
i
(2j
i
i
(2j
2
l)M
-
lii
0,
if Gi
#
\
1,
if
Gi
=
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:35:49 EST from IEEE Xplore. Restrictions apply.
234
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-11: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 39, NO.
4,
APRIL 1992
-(A-l)M -AM
-(i-l)M -iM
-M
-2M
0
-M
II
II
II
11
I
2x1
0
qij
---;
BCA
I
2x1
0
BC
-3
BC
f
1
2x1
0
BC
I
1
S
Carry Lookahead Adder
Fig.
6.
Implementation of the final corrector.
residue is represented in
P
bits, then it is required to
use
N
ROM’s.
Each
ROM
is storing values bounded
by
M,
then the size of each
ROM
is
2
*
1
log
M]
.
The total area required for this stage is:
N
*
2p
*
1
log
MI.
Since
ROM’s
have a constant time
delay
(P
is a small number) that does not depend on
N,
then the delay
of
this stage is
O(1).
2)
The partial sum adder is implemented in two different
ways.
Using
CSA’s:
the complexity of the scheme
is determined by Theorem
1.
Since each
CSA
has a constant time delay, then the total
time required to add
N
numbers in modulo
M
is
O(log
n).
b) Using
MA’s:
The number
of
levels required
to perform the addition of
N
numbers using
a binary tree of
MA’s
is
[logN]
as it is
shown in Theorem
1.
Since at each level the
required time is constant
(MA
has a constant
time), then the total time required for this
step using
MA’s
is O(1og
n).
3)
The range determinator consists of three different
levels (Fig.
4).
The first level consists of
g
ROM’s.
The second level is the
MC
cells, which are combina-
tional circuits that can be represented with a two-level
switching function. Finally the last level is a two-stage
combinational circuit. The three levels have a constant
time delay that does not depend on
N.
The previous
analysis shows that the range determinator has a time
delay of
e(
1).
4)
The final corrector consists of two stages. In the first
stage we have
A
tristate multiplexers, which have a
constant delay equivalent to two serial
NAND
gates.
The second stage is a
CLA
that has a constant delay
(for case of number of bits less than
64,
the delay is
equivalent to the delay of
12
serial
NAND
gates as
shown in
[16]).
For numbers of bits larger than this
we can still obtain a constant delay
CLA.
Then the
final corrector has a delay of
O(
1).
a)
From cases
1)-4)
we see that all stages except the partial
sum adder have a constant time delay, which does not depend
on the number of residues
N.
Only the second stage requires
O(log
N)
steps.
V.
CONCLUSION
The residue decoder introduced in this paper has a total
delay of
[log
N]
.
In addition, it has several advantages as
listed below.
1)
The design is quite modular and consists of simple
cells such as small
ROM’s
and
MC
cells. This makes
the implementation
of
the whole residue decoder in a
single chip possible.
2)
It doesn’t have any limitation on the moduli used.
3)
It is flexible since it can convert residues to either
unsigned magnitude or
2’s
complement number sys-
tem, and it is controlled by only a control line,
C.
This means that it can be applied to wider area.
4)
It is fast, compared with most schemes proposed
before, since it has a time complexity of
O(log
N).
ACKNOWLEDGMENT
K.
M.
Elleithy would like to thank the King Fahd Univer-
sity of Petroleum and Minerals for support.
REFERENCES
M.
A.
Bayoumi, “Digital filter VLSI systolic arrays over finite fields
for
DSP applications,” in
Proc. 6th ZEEE Annual Phoenix Conf.
Computers and Communications,
pp.
194- 199,
Feb.
1987.
M.
A.
Bayoumi,
G.
A.
Jullien, and
W.
C.
Miller,
“A
look-up table
VLSI design methodology for RNS structures used in DSP applica-
tions,’’
IEEE Trans. Circuits
Syst.,
pp.
604-616,
vol.
CAS-34,
June
1987.
F. J. Taylor, “Residue arithmetic:
A
tutorial with examples,”
IEEE
Comp. Mag.,
pp.
50-62,
May
1984.
M.
A.
Bayoumi,
“A
high speed VLSI complex digital signal proces-
sor
based
on
quadratic residue number system,”
VLSZ Signal Pro-
cessing
II,
pp.
200-211,
IEEE Press,
1986.
N.
S.
Szabo and R. I. Tanaka,
Residue Arithmetic and its Applica-
tions
to
Computer Technology.
W.
K.
Jenkins, “Techniques for residue to analog conversion
for
residue encoded digital filters,”
ZEEE Trans. Circuits Syst.,
vol.
S.
Andraos and H. Ahmed,
“A
new
efficient memoryless residue to
binary converter,”
ZEEE Trans. Circuits
Syst.,
vol.
CAS-35,
pp.
K.
M. Ibrahim and
S.
N. Saloum,
“An
efficient residue to binary
converter design,”
IEEE Trans. Circuits
Syst.,
vol.
CAS-35,
pp.
New York: McGraw-Hill,
1967.
CAS-25,
pp.
553-562,
July
1978.
144-1444,
NOV.
1988.
I
._
1156-1158,
Sipt.
1988.
S.
Bandyopadhyay,
G.
A.
Jullien, and
A.
Sengupta,
“A
systolic array
for fault-tolerant digital signal processing using a residue number
system approach,” in
Proc.
Int.
Conf. Systolic Arrays,
pp.
577-586, 1988.
A.
P.
Shenoy and R. Kumaresan, “Residue to binary conversion for
RNS arithmetic using only modular look-up tables,”
ZEEE Trans.
Circuits
Syst.,
vol.
CAS-35,
pp.
1158-1162,
Sept.
1988.
T. V.
Vu,
“Efficient implementations of the CRT for sign detection
and residue decoding,”
ZEEE Trans. Comp.,
vol.
C-34,
pp.
C.
N. Zhang, B. Shirazi, and D. Y.
Y.
Yun, “Parallel designs for
Chinese remainder conversion,” in
Proc. ZEEE 16th Annual Conf.
Parallel Processing,
Aug.
1987.
R. M. Capocelli and R. Giancarlo, “Efficient VLSI networks for
converting an integer from binary system and vice versa,”
IEEE
Trans. Circuits
Syst.,
vol.
35,
pp.
1425-1430,
Nov.
1988.
G.
Alia and E. Martinelli,
“A
VLSI algorithm for direct and reverse
conversion from weighted binary number system to residue number
system,”
ZEEE Trans. Circuits
Syst.,
vol.
CAS-31,
pp.
1033- 1039,
1984.
[9]
[lo]
[ll]
646-651,
July
1985.
[12]
[I31
[14]
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:35:49 EST from IEEE Xplore. Restrictions apply.
ELLEITHY AND BAYOUMI: FAST AND FLEXIBLE ARCHITECTURES
235
C.
K.
Koc and P. R. Cappello, “Systolic arrays for integer Chinese
Remaindering,” in
Proc. 9th Symp. Computer Arithmetic,
pp.
216-223, Sept. 1989.
K. Hwang,
Computer Arithmetic: Principles, Architecture, and
Design.
New York, Wiley, 1978.
K.
P. Lee, M. A. Bayoumi, and K. M. Elleithy, “A fast and flexible
residue decoder based on the Chinese remainder theorem,” presented
at 1989 Int. Symp. Circuits and Systems.
K.
M. Elleithy, “On bit-parallel processing for modulo arithmetic,”
VLSI
Tech. Rep. TR86-8-1,
The Center for Advanced Computer
Studies, Univ. Southwestern Louisiana, 1986.
A. Avizienis, “A study of redundant number representations for
parallel digital computers,” Ph.D. dissertation, Univ. Illinois, Ur-
bana,
IL,
May 1960.
K.
M. Elleithv and M. A. Bavoumi. “A Ofloe
n\
algorithm for
and Minerals, Duhran, Saudi Arabia. His current research interests are in
the areas of design automation, computer architecture, and VLSI. He has
special interests in the area of formal high-level synthesis and has published
more than a dozen research papers in these areas.
Dr. Elleithy is a member of Phi Kappa Phi.
~
,I
\_I
0
modulo multiplication,” in
Proc.
32nd
Midwest Symp. Circuits and
Systems,
Aug. 1989.
K. M. Elleithy, M. A. Bayoumi, and K. P. Lee, “B(log
n)
architec-
tures for RNS arithmetic decoding,” in
Proc. 9th Symp. Computer
Arithmetic,
pp. 202-209, Sept. 1989.
Khaled M. Elleithy
(S’89-M’90-M’90) received
the B.Sc. degree in computer science and auto-
matic control from Alexandria University in 1983,
the M.Sc. degree in computer networks from the
same university in 1986, and the M.Sc. and Ph.D.
degrees
in
computer science from the Center for
Advanced Computer Studies at the University of
Southwestern Louisiana in 1988 and 1990, respec-
tively.
From 1983 to 1986 he was with the Computer
Science Department, Alexandria University, Egypt,
as
a lecturer. Since 1990 he has been-working as an assistant Professor
at
the
Department of Computer Engineering, King Fahd University
of
Petroleum
Madgy
A.
Bayoumi
(S’80-M’M S’85-M’85-
SM’87) received the B.S.C. and M.Sc. degrees in
electrical engineering from Cairo University,
Egypt, the M.Sc. degree in computer engineering
from Washington University, St. Louis, and the
Ph.D. degree in electrical engineering from the
University of Windsor, Canada.
He is an associate professor in the Center for
Advanced Computer Studies, at the University of
Southwestern Louisiana, where he has been a fac-
ulty member since 1985. His research interests
include VLSI design methods and architectures, digital signal processing
architectures, parallel algorithm design, and computer arithmetic. He is the
chairman of the VLSI Systems and Applications CAS Technical Committee.
He serves on the ASSP Technical Committee on VLSI Signal Processing. He
was the secretary of the CS Technical Committee
on
VLSI Design. He
is
a
member of the Technical Committee of the IEEE VLSI Signal Processing
Workshop and the Application Specific Array Processors, 1992. He is the
editor
of
the new series, Advances in VLSI Signal Processing (Ablex
Publishing). He is the editor of the new book,
Parallel Algorithms and
Architectures for DSP Applications
(Kluwer Academic Publishers, 1991).
Dr. Bayoumi won the USL 1988 Researcher of the Year award.
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:35:49 EST from IEEE Xplore. Restrictions apply.
... However, the approach presented in [22] suggests using redundant modulus and channel. In [3,10,15,34] the alternative number-theoretic approaches to generate a pair of the numbers to be added in one two-operand modular adder are suggested. Of these methods, the most efficient are these using LSB extraction since the size of the lookup tables is the smallest. ...
Article
Full-text available
In this paper, we propose a new design of reverse converters for residue number systems with arbitrary moduli sets consisting of any number of odd moduli and one even modulus of the type 2k. The new converters are arithmetic-based designs, that may be implemented using only arithmetic components without any read-only memories nor lookup tables. We tackle the problem of large modular reduction imposed by the properties of Chinese Remainder Theorem (CRT) employed in our method by calculating small correction factor in parallel with weighted sum of CRT in a set of constant multipliers followed by one two-operand modulo adder. Synthesis results show delay reduction over existing designs of up to 39.23% with area reductions of up to 28.48%.
... The success of reverse conversion depends on forward conversion. Reverse conversion is based on two popular algorithms: Chinese Remainder Theorem (CRT) [5], [6] and Mixed Radix Conversion (MRC) [7], [8] algorithms. The use of CRT entails a large modular adder whereas MRC is a sequential process that requires a number of Look-Up Table (LUT) [9]. ...
Article
ABSTRACT This paper presents an overview and comprehensive survey of Residue Number System (RNS) and Bioinformatics. It focuses on application of RNS to Smith-Waterman Algorithm (SWA), highlights how the inherent attributes of RNS can be applied to improve performance of SWA in the analysis of Deoxyribonucleic Acid (DNA) and also observes the two principal methods of data conversion and lastly, we suggest the direction for future research. Keywords: Residue Number System, Smith-Waterman Algorithm, Deoxyribonucleic Acid, Bioinformatics.
... The success of reverse conversion depends on forward conversion [22]. Reverse conversion is based on two popular algorithms: Chinese Remainder Theorem (CRT) [23,24] and Mixed Radix Conversion (MRC) [25] algorithms. The use of CRT entails a large modular adder whereas MRC is a sequential process that requires a number of Look-Up Table (LUT) [26]. ...
... However, the real-time implementation of the CRT is not practical, as it requires modular operations with respect to a large . M In order to avoid processing large valued integers, fast algorithms for the computation of X have been proposed in [13]. ...
Article
Full-text available
This paper deals with an efficient method for achieving low power and high speed in advanced Direct-Sequence Code Division Multiple-Access (DS-CDMA) wireless communication systems based on the Residue Number System (RNS). A modified algorithm for multiuser DS-CDMA signal generation in MATLAB is proposed and investigated. The most important characteristics of the generated PN code are also presented. Subsequently, a DS-CDMA system based on the combination of the RNS or the so-called Redundant Residue Number System (RRNS) is proposed. The enhanced method using a spectrally efficient 8-PSK data modulation scheme to improve the bandwidth efficiency for RRNS-based DS-CDMA systems is presented. By using the C-measure (complexity measure) of the error detection function, it is possible to estimate the size of the circuit. Error detection function in RRNSs can be efficiently implemented by Look- Up Table (LUT) cascades.
... Modulo addition has been used in many fields such as cryptography algorithms [1], [2], integrity algorithms [1], Residue Number System (RNS) [3], Pseudo Random Number Generation (PRGN) [4], convolution computation [5], FIR filters [6], fault tolerant systems [7]. Modular arithmetic underpins encryption algorithms such as ZUC, AES, SNOW 3G, RC4. ...
Conference Paper
Modulo (231-1) adder is one of the important module in ZUC stream cipher. The paper presents compact, high performance architecture for modulo (231-1) adder using CLA. The proposed architecture is implemented by using VHDL language with CAD tool Xilinx ISE Design Suite 13.2 and target device is Xilinx Spartan3-xc3s1000, with package FG320. Presented result shows that proposed architecture minimizes the chip area, power consumption and increases computation speed of modulo (231-1) adder.
Conference Paper
Recently, RNS has received increased attention due to its ability to support high-speed concurrent arithmetic. Applications such as fast fourier transform, digital filtering, and image processing utilize the efficiencies of RNS arithmetics in addition and multiplication; they do not require the difficult RNS operations such as division and magnitude comparison of digital signal processor. RNS have computational advantages since operation on residue digits are performed independently and so these processes can be performed in parallel. There are basically two methods that are used for residue to binary conversion. The first approach uses the mixed radix conversion algorithm, and the second approach is based on the Chinese remainder theorem. In this paper, the new design of CRT conversion is presented. This is a derived method using an overlapped multiple-bit scanning method in the process of CRT conversion. This is achieved by a general moduli form(2(k)-1, 2(k) 2(k)+1). Then, it simulates the implementation using an overlapped multiple-bit scanning method in the process of CRT conversion, In conclusion, the simulation shows that the CRT method which is adopted in this research, performs arithmetic operations faster that the traditional approaches, due to advantages of parallel processing and carry-free arithmetic operation.
Article
In this study, we investigate Residue Number System (RNS) to binary conversion, which is an important issue concerning the utilization of RNS numbers in Digital Signal Processing (DSP) applications. We present a Mixed Radix. Conversion (MRC) technique for efficient RNS to binary conversion. First, we show that the computation of the required multiplicative inverses can be eliminated. Next, we propose an adder based RNS to binary converter, which requires mod-(2n-1) or mod-(2n-1-1) instead of mod-(2n) (2n-1-1) required by other state of the art Chinese Remainder Theorem (CRT) based equivalent converters. The proposed converter outperforms CRT based equivalent state of the art converters in terms of both speed and area. Consequently, due to the fact that our scheme operates on smaller magnitude operands, it results in less complex adders, which potentially results in faster implementation.
Conference Paper
Full-text available
A O(log n) algorithm for large moduli multiplication for Residue Number System(FtNS) based architectures is proposed. The proposed modulo multiplier is much faster than previously proposed multipliers and more area efficient. The implementation of the multiplier is modular and is based on simple cells which leads to efficient VLSI realization. A VLSI implementation using 3 micron CMOS process shows that a pipelined n-bit modulo multiplication scheme can operate with a throughput of 30 M operation per second.
Conference Paper
Full-text available
An implementation of a fast and flexible residue decoder based on the Chinese remainder theorem (CRT) is proposed for decoding a set of residues to its equivalent representation in a weighted binary number system. This decoder is flexible, since the decoded data can be selected to be either unsigned magnitude of two's complement binary number. It is shown that this residue decoder is extremely fast compared to previously proposed residue decoders
Conference Paper
Full-text available
Decoding in residue-number-system (RNS)-based architectures can be a bottleneck. A high-speed, flexible modulo decoder is an essential computational element to maintain the advantages of RNS. A fast and flexible modulo decoder, based on the Chinese remainder theorem (CRT), is presented. It decodes a set of residues into its equivalent representation in either unsigned magnitude or two's-complement binary number system. Two different architectures are analyzed: the first one uses carry-save adders, and the other uses modified structure carry-save adders. Both architectures are modular and are based on simple cells, which leads to efficient VLSI implementation. The decoder has a time complexity of θ(log N )
Article
A description is given of a novel residue to binary converter. It converts the three moduli residue number system (RNS) representation (2 n-1, 2n, 2n+1) into binary representation. The conversion process depends on simple mathematical relationships without using mixed radix or the Chinese remained theorem. These simple relationships provide simpler hardware realization for the RNS to binary conversion
Data
In this paper, an implementation of a f-t and flexible rscidue decoder based on Chinese Remainder Theorem, or CRT, is propolrad to decode a set of nsiduea to ita equivalent reprasentation in weighted binary number system. This decoder is flexible since the decoded data can be selected to be either unsigned magnitude or 2's comple-ment binary number. It is shown that this rcsidue decoder is extremely fast compared to the pmiousl). propoaed widue decoders.
Article
The effects of redundancy in each digital position of a number ; representation for arithemtic operations in parallel digital computers were ; investigated without the use of a carry-borrow identification. In the approach ; employed, a method of addition which has the desired properties is initially ; postulated and from this description the properties of a number representation ; which permits the postulated method of addition were derived. A totally-parallel ; mode of addition is defined and postulated to be the required characteristic of a ; number representation. A class of signeddigit number representations which ; permit totally-parallel addition is developed. Various significant properties of ; these signed-digit representations are developed and the existence of methods for ; the executions of arithmetic operations is demonstrated. The logical design of ; an adder circuit for totally-parallel addition of two numbers in signeddigit ; representation is discussed. (C.J.G.);
Conference Paper
A θ(log n ) algorithm for large moduli multiplication for residue-number-system (RNS)-based architectures is proposed. The modulo multiplier is much faster than previously proposed multipliers, and more area efficient. The implementation of the multiplier is modular and is based on simple cells, which leads to efficient VLSI realization. A VLSI implementation using 3-μm CMOS process shows that a pipelined n -bit modulo multiplication scheme can operate with a throughput of 30M operations/s