Conference PaperPDF Available

Choosing the system moduli of RNS arithmetic processors

Conference Paper

Choosing the system moduli of RNS arithmetic processors

Abstract

Designing an optimal residue number system (RNS) processor in terms of area and speed depends on the choice of the system moduli. In this paper an optimal algorithm for choosing the system moduli is presented. The algorithm takes into consideration several constraints imposed by the problem definition. The problem is formalized as an integer programming problem to optimize an area/time objective function.
Choosing the System Moduli
of
RNS
Arithmetic Processors
Khaled
M.
Elleithy
Computer Engineering Department
King Fahd University of Petroleum and Minerals
Dhahran 31261, Saudi Arabia
Abstract
Designing an optimal Residue Number System (RNS)
processor in terms
of
area and speed depends
on
the
choice
of
the system moduli.
In
this paper an optimal
algorithm
for
choosing the system moduli is presented.
The algorithm takes into consideration several
constraints imposed
by
the problem definition. The
problem is formalized as an integer programming
problem to optimize an aredtime objective function.
1:
Introduction
Residue Number System
(RNS)
has received increased
attention due to its ability to support high-$peed
concurrent arithmetic
[1-5].
Although, Digital Signal
Processing (DSP) applications utilize the efficiencies
of
RNS
arithmetic in addition and multiplication, they do
not require the difficult
RNS
operations such as division
and magnitude comparison.
RNS
has been employed
efficiently in the implementation of DSP processors.
Since special purpose processors are associated with
general purpose computers, binary-to-residue and residue
to-binary conversions become inherently important and
the conversion process should not offset the speed gain in
RNS
operations. While the binary-to-residue conversion
does not pose a serious threat to the high speedRNS
operations, the residue-to-binary conversion can be a
bottleneck. Chinese Remainder Theorem (CRT) [6] is
considered the main algorithm for the conversion process.
Several implementations of the residue decoder have been
reported
[2,5,7-
101.
Designing an optimal
RNS
processor in terms
of
area
and speed depends on the choice of the system moduli.
Most of the reported implementations in the literature for
choosing the system moduli are based on using a special
moduli The residue decoders in
[7,8]
are based on using
three moduli in the form (2"
-
1,2",2n
+
1)
,
where
n
is
the number of bits. Due to the limitation imposed on the
number of moduli and the choice of them, it is limited in
application. In
[9],
the residue decoder is based on the
base extension technique, it uses only modular look-up
tables in its implementation. Since look-up tables are
used, the choice of moduli must not be large for the
implementation to be feasible. In addition, it does not
support residue to
2's
complement binary number system
conversion. The implementation in
[lo]
requires that one
of the moduli must be a power of two; therefore, it may be
limited in application. Due
to
the constraints imposed on
the chosen moduli set,
such approaches have limited
applications and the final design is not optimal in area
and time.
In this paper an optimal algorithm for choosing the
system moduli is presented. The algorithm takes into
consideration several constraints imposed by the problem
definition. The problem is formalized as an integer
programming
problem to optimize an aredtime objective
function.
2:
Residue Number System
In
RNS,
an integer,
X,
can be represented by
an
N-
tuple of residue digits,
X
=
(5,
r,,
..
..,
r,)
where
q
=
pqmi
with respect to a set of
N
moduli
Set(q,
r2,
..
.
.,
r,)
*
In order to have a unique residue
representation, the moduli must be pairwise relatively
prime, that
is,
GO(<,
rj)
f
1
Vi#j
then
it
is shown that there is a unique representation for
each number in the range
of:
n
i=l
0
I
X
I
(n
mi
=
M)
The arithmetic operation on two integers
A
and
B
is
equivalent to the arithmetic operation on the residue
representation, that
is,
210
1058-6393/97 $10.00
0
1997
IEEE
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:13:27 EST from IEEE Xplore. Restrictions apply.
where
"
"
can be addition, subtraction, or multiplication.
Therefore,
it
is desired to convert binary arithmetic on
large integers to residue arithmetic on smaller residue
digits in which the operations can be executed in parallel,
and there is no carry chain between residue digits.
system
is
done using the Chinese Remainder Theorem
(CRT), which states that:
The conversion from
RNS
to weighted binary number
where
N
,M
j=1
mi
M=nmj
mj=-
,
GCD(mj,mk)=l
Vj#k
3:
ROM
Requirements
A
typical architecture for RNS-based processor is
shown in Figure
1.
The architecture consists of four
stages. In the first stage a variable
X
is converted to its
RNS
representation. In the second stage anarithmetic
operation is performed. The third and fourth stages
perform the conversion fkom RNS representation to
weighted number system representation. The third stage
is responsible for the computation oft's according to the
following definition:
The fourth stage is a modulo adder that adds the
t's
inputs
and gives the final weighted value.
The ROM required for the
RNS
based system can be
divided into the following:
1-
ROM required to implement processor
i
mod
mi
,
1
I
i
5
n.
The processor implementation is shown in
Figure 2. The ROM required for each processor is
given by:
2F'og
m,l
*
2r'og
mil
*
1
1
og mil
The total memory requirements for
i
processors is
given by:
5
22 [log
mil
*
r
log mil
(1)
i=l
2- ROM required to implement
t's.
The ROM size is
given by:
3-
If the summation unit is implemented using ROM, as
shown in Figure
3,
the memory required is:
{
2Pogmi1
*
2Pgm21
*
....
*
2Pgmnl
1
*
=
(3)
The ROM implementation of the summation unit is
impossible since it needs huge memory even for
a.
small size system.
4-
The implementation of processors as well as
t's
calculation can be implemented using ROMs as given
in equations
1
and
2.
The total size of ROM
is:
4:
Implementation
of
the Chinese Remainder
Theorem
Chinese Remainder Theorem (CRT) is considered the
main algorithm for the conversion process. Several
implementations of CRT have been reported[2,5,7-10]. In
[IO]
VU introduced
an
algorithm for computing the
CRT
based on manipulating the form the summands stored in
ROMs. The main idea
in
the algorithm is representing
the summands in a different form doesn'trequireany
special real-time processing as they are stored in ROMs.
The summands are stored in a form that can be efficiently
used
in
evaluation
of
the final summation modulo
M.
The
following derivation is used
211
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:13:27 EST from IEEE Xplore. Restrictions apply.
Computing the fist part
of
the sum is easy because
modules
mj
is small and can be implemented using
ROMs. However the second part is obtained most easily
when
mj
equal 2k or 2k-1. Since
it
is not unusual to
have a power of 2 included in most practical systems of
moduli,
mj
=
2
k
is assumed. The following equation
is
valid:
4.1
Implementation details
One of the moduli must be equal
to
2k* A hardware
There are four
1.
At level
1
a Mod
Zk
adder is used. The speed depends
on the speed of the multi-operand adder. An optimal
adder of
@(log
n)
time complexity[3] can be used in
this stage.
implementation is given in Figure
4.
levels in implementing the modulo
M
adder as follows:
2. Level
2
is a small size ROM.
3.
Levels
3
and
4
are 2-operand adders.
The size of buses and
ROMs
used are as
follows:
1.
qi
requires log
mi
bits
(0
5
qi
5
mi)
r
-!
M
mi
bits
(0
I
7
I-)
3.
q
requires [log
mil
bits
(q
=
4.
r
requires Llog
M]
bits
(r
=
M
2k
5.
Y
requires Llog
MJ
bits
(Y
=
q(-
+2"
-
M)
6. Z
requires Llog
MJ
bits
4.2
Choosing the System moduli
4.2.1 Choosing the system moduli to minimize the
ROM
size:
The
following two cases are considered
(1)
ROM
is
used to obtain ti's, then the problem is
modeled as follows:
(2) ROM is used to obtain
tis
and implement the
processors, then the problem is
ml*m,*
......*
m,,
22-1
such
that ml,
Q,
.
.
. . .
,
m,,
are relatively prime
As mentioned earlier the problem can be solved using
integer programming. An extra constraint that the moduli
are relatively prime. It is not
an
integer programming
problem, it needs some modifications, it can be called
Prime Integer Programming (PIP).
4.2.2 Special
Case
of
M
=
2'
:
In this section a special
case
is
considered. The required condition for
M
is
that
M
2
2'
-
1,
If
M
=
2'
the difficult problem
of
designing a modulo M will be reduced to the design
of
multi-operand adder. The integer programming problem
is represented as:
minimize
5
22
[log
m,l
*
[log mil
+
$(2['0g
mll
*li>
ml*m,*
......*mn
=2
1
-1
i=l
i=l
such
that
m,,
%,
. . . . .
,
m,
are
relatively prime
Solving this special case is
as
difficult as the general
case. There may not be a feasible solution. If there is
no
feasible solution
the general problem is solved instead of
the special case problem.. If there is a feasible solution for
the special case, it
is
likely that the memory requirements
may be larger than the general case. In this case we have
to compare between larger memory and high speedin
realizing the summation network, on the other side, less
memory and complicated summation network.
k
4.2.3 Choosing one
of
the moduli equal to
2
:
The
implementation discussed in
section 4.1 requires one
of
system moduli
to be equal to
2
.
In this case the
problem
is
defined
as
follows:
k
i=l
m,
*
m,
*.
. .
. .
.*m,,
I
2'
-
1
such
that
m,,
q,
. .
.
. .
,
m,,
are relatively prime
mj
=2k
for
anyjandk
m,
*
%*.
.
. . .
.*mn
2
2'
-1
m,,
%,
. .
. .
.
,
m,,
are relatively prime
such
that
212
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:13:27 EST from IEEE Xplore. Restrictions apply.
As
in
the previous case, there may not be a feasible
solution. If there is no feasible solution the general
problem is solved instead of the special case problem.. If
there is a feasible solution for this special case, it is likely
that the memory requirements may
be
larger than the
general case. In this case we have to compare between
larger memory and high speed
in
realizing the summation
network,
on
the other side, less memory and complicated
summation network.
4.3
New Algorithm
for
choosing the system moduli
From our previous analysis in section 4.2 we see that
the problem of choosing the system moduli is presented as
an
integer programming problem with the following
objective function:
4.3.1
Algorithm
Step
#I.
Solve objective function such that:
ml*%*
......*
mn
52l-1
m,,
%,
. .
.
. .
,
m,,
are relatively prime
mi
=
2k
for any
j
and
k
Sten
#2.
Solve objective function such that:
m,
a%*......*
m,,
12'-1
ml
,
%,
. .
. .
.
,
m,,
are relatively prime
mj
=
2k
for
any
j
and
k
Step
#3.
Solve objective
function
such that:
m,*q*......*km,
52l-1
m,,
m,,
. . .
.
.
,
m,
are relatively prime
Sten
#4.
Choosing solution
(a) Speed
Sol
1
(if
exists) is the fastest
Sol 2(if exists) is slower than Sol
1
Sol
3
is the slowest
(b)
Memory requirements
Sol
3
needs the least memory
Sol 2 needs memory equal or greater
than
Sol
3
Sol
1
needs memory similar to Sol
3
Sol
1:
using n-operands adder.
Sol 2: is implementation-Apigue 43.
Sol
3:
is Elleithy's implementation[2].
(c)
Implementation
approach
with
extra constraint
of
having relatively prime
moduli set. Table
1
shows the results starting from 7-bit
output to 16 bits. The approach
is
general
and
can
be used
to
obtain
results for larger bits. For each bit size a number
of solutions are obtained. In Table
1
onlythesolutions
with the minimum ROM size are shown.
5:
Conclusions
Conversion from Binary Representation into
RNS
Representation is a time consuming process. In this pa er
the
(CRT) is used as the man algorithm for &e
conversion process. An algorithm to design optimal
RNS
processor
in
terms of area and speed depends
on
the
choice
of
the system moduli. In this paper
an
optimal
algorithm for choosing the system moduli is presented.
The algorithm takes into consideration several constraints
imposed by the problem definition. The problem is
formalized as an integer programming problem
to
optimize
an
areahime objective function.
Acknowledgments
The author would like to acknowledge King Fahd
University of Petroleum and Minerals for support
provided for this work.
References
1-
2-
3-
4-
5-
6-
K.
M. Elleithy, and M. A. Bayoumi,
"A
Systolic
Architecture for Modulo Multiplication,"
EEE
Transactions
on
Circuits and Systems-II: Analog and
Digital Signal Processing,
vol.
42,
no.
11,
pp. 725-729,
Nov.
1995.
K.
M.
Elleithy and M. A. Bayoumi,
"Fast and Flexible
Architectures for
RNS
Arithmetic Decoding,"
IEEE
Transactions on Circuits and Systems-II: Analog and
Digital Signal Processing,
vol.
39,
no.
4,
pp. 226-235,
April
1992.
K.
M.
Elleithy and
M.
A. Bayoumi,
'!A
Atgorithm
for
modulo
Addition,
"
IEEE
Transactions
on Circuits
and Systems,
vol.
37, no.
5,
pp. 628-631, May 1990.
K.
M.
Elleithy and
M.
A. Bayoumi
"A
theta(1)
Algorithm for Modulo Multiplication," Proc.
of
the 32nd
Midwest Symposium on Circuits and Systems, Aug.
1989.
K.
M.
Elleithy,
M.
A. Bayoumi, and
K. P.
Lee,
"
@(log
n)
Architectures for
RNS
Arithmetic
Decoding,"
Proc.
of
the 9th Symposium on Computer
Arithmetic, pp. 202-209, Sep. 1989.
S.
Szabo and
R.
I.
Tanaka, "Residue Arithmetic and its
Applications to Computer Technology,"
New
York
McGraw-Hill, 1967.
4.3.2
Results:
The problem for choosing the system
moduli has been solved using the integer programming
213
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:13:27 EST from IEEE Xplore. Restrictions apply.
7-
Andraos and
H.
Ahmed, "A New Efficient Memoryless
Residue to Binary Converter," IEEE Trans. Circuits
M.
Ibrahim and
S.
N. Saloum,
"An
Efficient Residue to
Binary Converter Design," IEEE Trans. Circuits Syst.,
vol. 35,
pp.
1156-1158, September 1988.
9-
P.
Shenoy and
R.
Kumaresan, "Residue
to
Binary
Conversion
for
RNS
Arithmetic Using OnlyMcdular
Syst., vol.
35,
NOV.
1988, pp. 1441-1444.
8-
\
.
I
I
I-
I-
I
I
I
I
I
I
I
I
,
Rnal
Result
1
(Binaly)
I
I
I
\
Figure 1. RNS-Based Architecture.
Figure
2. Processor
i
mod
mi
Implementation.
Look-up Tables," IEEE Trans. circuits
Syst.,
vol. 35, pp.
1158-1162, September 1988.
10-
V.
Vu,
"Efficient Implementations of the CRT for Sign
Detection and Residue Decoding,"
IEEE
Trans. Comp.,
11-
N.
Zhang,
B.
Shirazi, and D.
Y.
Y. Yun, "Parallel
Designs
for
Chinese Remainder Conversion," Proc.
IEEE 16 th Annual Conf. on Parallel Processing, Aug.
1987.
vol. C-34,
pp.
646-651,
July
1985.
binary
-
t
log
"n]
Figure
3.
Implementation
of
the
CRT
Decoder.
1
Figure
4.
Implementation of the
CRT
decoder
with one
of the moduli equal
Bk
-
Table
1
:
Optimal
ROM
size
for
RNS
processors where the bus size varies between
7
and
16
bits.
2
14
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:13:27 EST from IEEE Xplore. Restrictions apply.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
With the current advances in VLSI technology, traditional algorithms for Residue Number System (RNS) based architectures should be reevaluated to explore the new technology dimensions. In this brief, we introduce A @(log n) algorithm for large moduli multiplication for RNS based architectures. A systolic array has been designed to perform the modulo multiplication Algorithm. The proposed modulo multiplier is much faster than previously proposed multipliers and more area efficient. The implementation of this multiplier is modular and is based on using simple cells which leads to efficient VLSI realization. A VLSI implementation using 3 micron CMOS technology shows that a pipelined n-bit modulo multiplication scheme can operate with a throughput of 30 M operation per second.
Conference Paper
Full-text available
Decoding in residue-number-system (RNS)-based architectures can be a bottleneck. A high-speed, flexible modulo decoder is an essential computational element to maintain the advantages of RNS. A fast and flexible modulo decoder, based on the Chinese remainder theorem (CRT), is presented. It decodes a set of residues into its equivalent representation in either unsigned magnitude or two's-complement binary number system. Two different architectures are analyzed: the first one uses carry-save adders, and the other uses modified structure carry-save adders. Both architectures are modular and are based on simple cells, which leads to efficient VLSI implementation. The decoder has a time complexity of θ(log N )
Article
Full-text available
An implementation of a fast and flexible residue decoder for residue-number-system (RNS)-based architectures is proposed. The decoder is based on the Chinese remainder theorem. It decodes a set of residues to its equivalent representation in weighted binary number system. This decoder is flexible since the decoded data can be selected to be either unsigned magnitude or 2's complement binary number. Two different architectures are analyzed; the first one is based on using carry-save adders, while the other is based on utilizing modulo adders. The implementation of both architectures is modular and is based on simple cells, which leads to efficient VLSI realization. The proposed decoder is fast; it has a time complexity of θ(log N ), where N is the number of moduli
Article
A description is given of a novel residue to binary converter. It converts the three moduli residue number system (RNS) representation (2 n-1, 2n, 2n+1) into binary representation. The conversion process depends on simple mathematical relationships without using mixed radix or the Chinese remained theorem. These simple relationships provide simpler hardware realization for the RNS to binary conversion
Article
Absfrad -A O(l) algorithm for large modulo addition for residue number system (RNS) based archictectures is proposed. The addition is done in a fixed number of stages which does not depend on the size of the modulus. The proposed modulo adder is much faster than the previous adders and more area efficient. The implementation of the adder is modular and is based on simple cells which leads to efficient VLSI realization. I. INTRODUC~ION Recently, the residue member system (RNS) is receiving in-creased attention due to its ability to support high-speed concur-rent arithmetic [ 11. Applications such as fast Fourier transform, digital filtering, and image processing utilize the high-speed RNS arithmetic operations; addition and multiplication, do not require the difficult RNS operations such as division and magni-tude comparison. The technological advantages offered by VLSI have added a new dimension in the implementation of RNS-based architectures [2]. Several high-speed VLSI special pur-pose digital signal processors have been successfully imple-mented [31-[51. Modulo addition represents the computational kernel for RNS-based architectures. Subtraction is performed by adders using the additive inverse property [6]. Multiplication can be transformed into addition by several techniques [7]. Also, mod-ulo addition is the basic element in the conversion from RNS to binary using the Chinese remainder theorem (CRT) [6]. Banerji [8] analyzed modulo addition in MSI technology. A VLSI analy-sis of modulo addition has been reported in [9]-[11]. In general, lookup tables and PLAs have been the main logical modules used when the data granularity is the word. It has been found that such structure is only efficient for small size moduli. For medium size and large moduli, bit-level structures are more efficient, where the data granularity is the bit [12]. In this paper, we present a modulo adder for medium size and large moduli. It is based on using a two-dimensional array of very simple cells (full adders). The modulo addition is per-formed in a fixed time delay independent of the size of the moduli. 11. RESIDUE NUMBER SYSTEM (RNS)
Data
A #(fog n) algorithm for large moduli multiplication for Residue Number System(FtNS) based architectures is proposed. The proposed modulo multiplier is much faster than previously proposed multipliers and more area efficient. The implementation of the multiplier is modular and is based on simple cells which leads to efficient VLSI realization. A VLSI implementation using 3 micron CMOS process shows that a pipelined n-bit modulo multiplication scheme can operate with a throughput of 30 M operation per second.
Conference Paper
A θ(log n ) algorithm for large moduli multiplication for residue-number-system (RNS)-based architectures is proposed. The modulo multiplier is much faster than previously proposed multipliers, and more area efficient. The implementation of the multiplier is modular and is based on simple cells, which leads to efficient VLSI realization. A VLSI implementation using 3-μm CMOS process shows that a pipelined n -bit modulo multiplication scheme can operate with a throughput of 30M operations/s
Article
With the current advances in VLSI technology, traditional algorithms for Residue Number System (RNS) based architectures should be reevaluated to explore the new technology dimensions. In this brief, we introduce A θ(log n) algorithm for large moduli multiplication for RNS based architectures. A systolic array has been designed to perform the modulo multiplication algorithm. The proposed modulo multiplier is much faster than previously proposed multipliers and more area efficient. The implementation of this multiplier is modular and is based on using simple cells which leads to efficient VLSI realization. A VLSI implementation using 3 micron CMOS technology shows that a pipelined n-bit module multiplication scheme can operate with a throughput of 30 M operation per second