Content uploaded by Khaled Elleithy
Author content
All content in this area was uploaded by Khaled Elleithy
Content may be subject to copyright.
Content uploaded by Khaled Elleithy
Author content
All content in this area was uploaded by Khaled Elleithy
Content may be subject to copyright.
Choosing the System Moduli
of
RNS
Arithmetic Processors
Khaled
M.
Elleithy
Computer Engineering Department
King Fahd University of Petroleum and Minerals
Dhahran 31261, Saudi Arabia
Abstract
Designing an optimal Residue Number System (RNS)
processor in terms
of
area and speed depends
on
the
choice
of
the system moduli.
In
this paper an optimal
algorithm
for
choosing the system moduli is presented.
The algorithm takes into consideration several
constraints imposed
by
the problem definition. The
problem is formalized as an integer programming
problem to optimize an aredtime objective function.
1:
Introduction
Residue Number System
(RNS)
has received increased
attention due to its ability to support high$peed
concurrent arithmetic
[15].
Although, Digital Signal
Processing (DSP) applications utilize the efficiencies
of
RNS
arithmetic in addition and multiplication, they do
not require the difficult
RNS
operations such as division
and magnitude comparison.
RNS
has been employed
efficiently in the implementation of DSP processors.
Since special purpose processors are associated with
general purpose computers, binarytoresidue and residue
tobinary conversions become inherently important and
the conversion process should not offset the speed gain in
RNS
operations. While the binarytoresidue conversion
does not pose a serious threat to the high speedRNS
operations, the residuetobinary conversion can be a
bottleneck. Chinese Remainder Theorem (CRT) [6] is
considered the main algorithm for the conversion process.
Several implementations of the residue decoder have been
reported
[2,5,7
101.
Designing an optimal
RNS
processor in terms
of
area
and speed depends on the choice of the system moduli.
Most of the reported implementations in the literature for
choosing the system moduli are based on using a special
moduli The residue decoders in
[7,8]
are based on using
three moduli in the form (2"

1,2",2n
+
1)
,
where
n
is
the number of bits. Due to the limitation imposed on the
number of moduli and the choice of them, it is limited in
application. In
[9],
the residue decoder is based on the
base extension technique, it uses only modular lookup
tables in its implementation. Since lookup tables are
used, the choice of moduli must not be large for the
implementation to be feasible. In addition, it does not
support residue to
2's
complement binary number system
conversion. The implementation in
[lo]
requires that one
of the moduli must be a power of two; therefore, it may be
limited in application. Due
to
the constraints imposed on
the chosen moduli set,
such approaches have limited
applications and the final design is not optimal in area
and time.
In this paper an optimal algorithm for choosing the
system moduli is presented. The algorithm takes into
consideration several constraints imposed by the problem
definition. The problem is formalized as an integer
programming
problem to optimize an aredtime objective
function.
2:
Residue Number System
In
RNS,
an integer,
X,
can be represented by
an
N
tuple of residue digits,
X
=
(5,
r,,
..
..,
r,)
where
q
=
pqmi
with respect to a set of
N
moduli
Set(q,
r2,
..
.
.,
r,)
*
In order to have a unique residue
representation, the moduli must be pairwise relatively
prime, that
is,
GO(<,
rj)
f
1
Vi#j
then
it
is shown that there is a unique representation for
each number in the range
of:
n
i=l
0
I
X
I
(n
mi
=
M)
The arithmetic operation on two integers
A
and
B
is
equivalent to the arithmetic operation on the residue
representation, that
is,
210
10586393/97 $10.00
0
1997
IEEE
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:13:27 EST from IEEE Xplore. Restrictions apply.
where
"
"
can be addition, subtraction, or multiplication.
Therefore,
it
is desired to convert binary arithmetic on
large integers to residue arithmetic on smaller residue
digits in which the operations can be executed in parallel,
and there is no carry chain between residue digits.
system
is
done using the Chinese Remainder Theorem
(CRT), which states that:
The conversion from
RNS
to weighted binary number
where
N
,M
j=1
mi
M=nmj
mj=
,
GCD(mj,mk)=l
Vj#k
3:
ROM
Requirements
A
typical architecture for RNSbased processor is
shown in Figure
1.
The architecture consists of four
stages. In the first stage a variable
X
is converted to its
RNS
representation. In the second stage anarithmetic
operation is performed. The third and fourth stages
perform the conversion fkom RNS representation to
weighted number system representation. The third stage
is responsible for the computation oft's according to the
following definition:
The fourth stage is a modulo adder that adds the
t's
inputs
and gives the final weighted value.
The ROM required for the
RNS
based system can be
divided into the following:
1
ROM required to implement processor
i
mod
mi
,
1
I
i
5
n.
The processor implementation is shown in
Figure 2. The ROM required for each processor is
given by:
2F'og
m,l
*
2r'og
mil
*
1
1
og mil
The total memory requirements for
i
processors is
given by:
5
22 [log
mil
*
r
log mil
(1)
i=l
2 ROM required to implement
t's.
The ROM size is
given by:
3
If the summation unit is implemented using ROM, as
shown in Figure
3,
the memory required is:
{
2Pogmi1
*
2Pgm21
*
....
*
2Pgmnl
1
*
=
(3)
The ROM implementation of the summation unit is
impossible since it needs huge memory even for
a.
small size system.
4
The implementation of processors as well as
t's
calculation can be implemented using ROMs as given
in equations
1
and
2.
The total size of ROM
is:
4:
Implementation
of
the Chinese Remainder
Theorem
Chinese Remainder Theorem (CRT) is considered the
main algorithm for the conversion process. Several
implementations of CRT have been reported[2,5,710]. In
[IO]
VU introduced
an
algorithm for computing the
CRT
based on manipulating the form the summands stored in
ROMs. The main idea
in
the algorithm is representing
the summands in a different form doesn'trequireany
special realtime processing as they are stored in ROMs.
The summands are stored in a form that can be efficiently
used
in
evaluation
of
the final summation modulo
M.
The
following derivation is used
211
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:13:27 EST from IEEE Xplore. Restrictions apply.
Computing the fist part
of
the sum is easy because
modules
mj
is small and can be implemented using
ROMs. However the second part is obtained most easily
when
mj
equal 2k or 2k1. Since
it
is not unusual to
have a power of 2 included in most practical systems of
moduli,
mj
=
2
k
is assumed. The following equation
is
valid:
4.1
Implementation details
One of the moduli must be equal
to
2k* A hardware
There are four
1.
At level
1
a Mod
Zk
adder is used. The speed depends
on the speed of the multioperand adder. An optimal
adder of
@(log
n)
time complexity[3] can be used in
this stage.
implementation is given in Figure
4.
levels in implementing the modulo
M
adder as follows:
2. Level
2
is a small size ROM.
3.
Levels
3
and
4
are 2operand adders.
The size of buses and
ROMs
used are as
follows:
1.
qi
requires log
mi
bits
(0
5
qi
5
mi)
r
!
M
mi
bits
(0
I
7
I)
3.
q
requires [log
mil
bits
(q
=
4.
r
requires Llog
M]
bits
(r
=
M
2k
5.
Y
requires Llog
MJ
bits
(Y
=
q(
+2"

M)
6. Z
requires Llog
MJ
bits
4.2
Choosing the System moduli
4.2.1 Choosing the system moduli to minimize the
ROM
size:
The
following two cases are considered
(1)
ROM
is
used to obtain ti's, then the problem is
modeled as follows:
(2) ROM is used to obtain
tis
and implement the
processors, then the problem is
ml*m,*
......*
m,,
221
such
that ml,
Q,
.
.
. . .
,
m,,
are relatively prime
As mentioned earlier the problem can be solved using
integer programming. An extra constraint that the moduli
are relatively prime. It is not
an
integer programming
problem, it needs some modifications, it can be called
Prime Integer Programming (PIP).
4.2.2 Special
Case
of
M
=
2'
:
In this section a special
case
is
considered. The required condition for
M
is
that
M
2
2'

1,
If
M
=
2'
the difficult problem
of
designing a modulo M will be reduced to the design
of
multioperand adder. The integer programming problem
is represented as:
minimize
5
22
[log
m,l
*
[log mil
+
$(2['0g
mll
*li>
ml*m,*
......*mn
=2
1
1
i=l
i=l
such
that
m,,
%,
. . . . .
,
m,
are
relatively prime
Solving this special case is
as
difficult as the general
case. There may not be a feasible solution. If there is
no
feasible solution
the general problem is solved instead of
the special case problem.. If there is a feasible solution for
the special case, it
is
likely that the memory requirements
may be larger than the general case. In this case we have
to compare between larger memory and high speedin
realizing the summation network, on the other side, less
memory and complicated summation network.
k
4.2.3 Choosing one
of
the moduli equal to
2
:
The
implementation discussed in
section 4.1 requires one
of
system moduli
to be equal to
2
.
In this case the
problem
is
defined
as
follows:
k
i=l
m,
*
m,
*.
. .
. .
.*m,,
I
2'

1
such
that
m,,
q,
. .
.
. .
,
m,,
are relatively prime
mj
=2k
for
anyjandk
m,
*
%*.
.
. . .
.*mn
2
2'
1
m,,
%,
. .
. .
.
,
m,,
are relatively prime
such
that
212
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:13:27 EST from IEEE Xplore. Restrictions apply.
As
in
the previous case, there may not be a feasible
solution. If there is no feasible solution the general
problem is solved instead of the special case problem.. If
there is a feasible solution for this special case, it is likely
that the memory requirements may
be
larger than the
general case. In this case we have to compare between
larger memory and high speed
in
realizing the summation
network,
on
the other side, less memory and complicated
summation network.
4.3
New Algorithm
for
choosing the system moduli
From our previous analysis in section 4.2 we see that
the problem of choosing the system moduli is presented as
an
integer programming problem with the following
objective function:
4.3.1
Algorithm
Step
#I.
Solve objective function such that:
ml*%*
......*
mn
52l1
m,,
%,
. .
.
. .
,
m,,
are relatively prime
mi
=
2k
for any
j
and
k
Sten
#2.
Solve objective function such that:
m,
a%*......*
m,,
12'1
ml
,
%,
. .
. .
.
,
m,,
are relatively prime
mj
=
2k
for
any
j
and
k
Step
#3.
Solve objective
function
such that:
m,*q*......*km,
52l1
m,,
m,,
. . .
.
.
,
m,
are relatively prime
Sten
#4.
Choosing solution
(a) Speed
Sol
1
(if
exists) is the fastest
Sol 2(if exists) is slower than Sol
1
Sol
3
is the slowest
(b)
Memory requirements
Sol
3
needs the least memory
Sol 2 needs memory equal or greater
than
Sol
3
Sol
1
needs memory similar to Sol
3
Sol
1:
using noperands adder.
Sol 2: is implementationApigue 43.
Sol
3:
is Elleithy's implementation[2].
(c)
Implementation
approach
with
extra constraint
of
having relatively prime
moduli set. Table
1
shows the results starting from 7bit
output to 16 bits. The approach
is
general
and
can
be used
to
obtain
results for larger bits. For each bit size a number
of solutions are obtained. In Table
1
onlythesolutions
with the minimum ROM size are shown.
5:
Conclusions
Conversion from Binary Representation into
RNS
Representation is a time consuming process. In this pa er
the
(CRT) is used as the man algorithm for &e
conversion process. An algorithm to design optimal
RNS
processor
in
terms of area and speed depends
on
the
choice
of
the system moduli. In this paper
an
optimal
algorithm for choosing the system moduli is presented.
The algorithm takes into consideration several constraints
imposed by the problem definition. The problem is
formalized as an integer programming problem
to
optimize
an
areahime objective function.
Acknowledgments
The author would like to acknowledge King Fahd
University of Petroleum and Minerals for support
provided for this work.
References
1
2
3
4
5
6
K.
M. Elleithy, and M. A. Bayoumi,
"A
Systolic
Architecture for Modulo Multiplication,"
EEE
Transactions
on
Circuits and SystemsII: Analog and
Digital Signal Processing,
vol.
42,
no.
11,
pp. 725729,
Nov.
1995.
K.
M.
Elleithy and M. A. Bayoumi,
"Fast and Flexible
Architectures for
RNS
Arithmetic Decoding,"
IEEE
Transactions on Circuits and SystemsII: Analog and
Digital Signal Processing,
vol.
39,
no.
4,
pp. 226235,
April
1992.
K.
M.
Elleithy and
M.
A. Bayoumi,
'!A
Atgorithm
for
modulo
Addition,
"
IEEE
Transactions
on Circuits
and Systems,
vol.
37, no.
5,
pp. 628631, May 1990.
K.
M.
Elleithy and
M.
A. Bayoumi
"A
theta(1)
Algorithm for Modulo Multiplication," Proc.
of
the 32nd
Midwest Symposium on Circuits and Systems, Aug.
1989.
K.
M.
Elleithy,
M.
A. Bayoumi, and
K. P.
Lee,
"
@(log
n)
Architectures for
RNS
Arithmetic
Decoding,"
Proc.
of
the 9th Symposium on Computer
Arithmetic, pp. 202209, Sep. 1989.
S.
Szabo and
R.
I.
Tanaka, "Residue Arithmetic and its
Applications to Computer Technology,"
New
York
McGrawHill, 1967.
4.3.2
Results:
The problem for choosing the system
moduli has been solved using the integer programming
213
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:13:27 EST from IEEE Xplore. Restrictions apply.
7
Andraos and
H.
Ahmed, "A New Efficient Memoryless
Residue to Binary Converter," IEEE Trans. Circuits
M.
Ibrahim and
S.
N. Saloum,
"An
Efficient Residue to
Binary Converter Design," IEEE Trans. Circuits Syst.,
vol. 35,
pp.
11561158, September 1988.
9
P.
Shenoy and
R.
Kumaresan, "Residue
to
Binary
Conversion
for
RNS
Arithmetic Using OnlyMcdular
Syst., vol.
35,
NOV.
1988, pp. 14411444.
8
\
.
I
I
I
I
I
I
I
I
I
I
I
I
,
Rnal
Result
1
(Binaly)
I
I
I
\
Figure 1. RNSBased Architecture.
Figure
2. Processor
i
mod
mi
Implementation.
Lookup Tables," IEEE Trans. circuits
Syst.,
vol. 35, pp.
11581162, September 1988.
10
V.
Vu,
"Efficient Implementations of the CRT for Sign
Detection and Residue Decoding,"
IEEE
Trans. Comp.,
11
N.
Zhang,
B.
Shirazi, and D.
Y.
Y. Yun, "Parallel
Designs
for
Chinese Remainder Conversion," Proc.
IEEE 16 th Annual Conf. on Parallel Processing, Aug.
1987.
vol. C34,
pp.
646651,
July
1985.
binary

t
log
"n]
Figure
3.
Implementation
of
the
CRT
Decoder.
1
Figure
4.
Implementation of the
CRT
decoder
with one
of the moduli equal
Bk

Table
1
:
Optimal
ROM
size
for
RNS
processors where the bus size varies between
7
and
16
bits.
2
14
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:13:27 EST from IEEE Xplore. Restrictions apply.