ArticlePDF Available

Abstract

A θ(1) algorithm for large modulo addition for architectures based on the residue number (RNS) is proposed. The addition is done in a fixed number of stages which does not depend on the size of the modulus. The proposed modulo adder is much faster than previous adders and more area efficient. The implementation of the adder is modular and is based on simple cells, which leads to efficient VLSI realization
628
/;
IEEE
TRANSACTIONS
ON
CIRCUITS
AND
SYSTEMS, VOL.
37,
NO.
5,
MAY
1990
2
m-1
rn-1
0
m+
1 1
Zm-Z
m-2
A
O(l)
Algorithm for Modulo Addition
KHALED M. ELLEITHY
AND
MAGDY A. BAYOUMI
Absfrad
-A
O(l)
algorithm for large modulo addition
for
residue
number system
(RNS)
based archictectures is proposed. The addition is
done in a fixed number of stages which does not depend on the size of
the modulus. The proposed modulo adder is much faster than the
previous adders and more area efficient. The implementation
of
the
adder is modular and
is
based on simple cells which leads
to
efficient
VLSI
realization.
I. INTRODUC~ION
Recently, the residue member system (RNS) is receiving in-
creased attention due
to
its ability to support high-speed concur-
rent arithmetic
[
11.
Applications such as fast Fourier transform,
digital filtering, and image processing utilize the high-speed
RNS arithmetic operations; addition and multiplication, do not
require the difficult RNS operations such as division and magni-
tude comparison. The technological advantages offered by
VLSI
have added a new dimension in the implementation of RNS-
based architectures [2]. Several high-speed
VLSI special pur-
pose digital signal processors have been successfully imple-
mented [31-[51.
Modulo addition represents the computational kernel for
RNS-based architectures. Subtraction is performed by adders
using the additive inverse property [6]. Multiplication can be
transformed into addition by several techniques [7].
Also,
mod-
ulo addition is the basic element in the conversion from RNS to
binary using the Chinese remainder theorem (CRT) [6]. Banerji
[8] analyzed modulo addition in MSI technology. A VLSI analy-
sis of modulo addition has been reported in [9]-[11]. In general,
lookup tables and
PLAs have been the main logical modules
used when the data granularity is the word. It has been found
that such structure is only efficient for small size moduli. For
medium size and large moduli, bit-level structures are more
efficient, where the data granularity is the bit [12].
In this paper, we present a modulo adder for medium size
and large moduli. It is based on using a two-dimensional array
of very simple cells (full adders). The modulo addition is per-
formed in a fixed time delay independent of the size
of
the
moduli.
11.
RESIDUE
NUMBER SYSTEM
(RNS)
In
RNS,
an integer
X
can be represented by N-tuple of
residue digits
=
(rl
1
r2
1
' '
'
7
rN)
where
r,
=
IXI,,,,
with respect to a set of N moduli
(rn1,m2;.
.,rnN].
In order to have a unique residue representa-
tion, the moduli must be painvise relatively prime, that is:
GCD(rn,,m,)
=
1,
for
i
#
j.
Then it is shown that there is a unique representation for each
number in the range of
0
Q
X
<
nElm,
=
A4
where N is the
number of moduli.
Manuscript received April 12, 1989. This
work
was supported in part by the
National Science Foundation under Grant MIP-8809811. This letter was
recommended by Associate Editor T.
R.
Viswanathan.
The authors are with The Center
for
Advanced Computer Studies, Univer-
sity
of
Southwestern Louisiana, Lafayette, LA 70504.
IEEE
Log
Number 9034410.
h+
tl,
Fig.
1.
Modulo
addition using
two
adders.
Fig. 2.
Modulo
addition using a lookup table.
The arithmetic operation on two integers
A
and
B
is equiva-
lent to the arithmetic operation on its residue representation,
that is:
1A.BI.M
=
(IIAlrnI.
IBl~lIml>IIAl~2~
IBb2lm2,.
.
.3
IIAI~N.
I
BI~NI~N)
where
''."
can be addition, subtraction, or multiplication. It is
desirable to convert binary arithmetic on large integers to residue
arithmetic on smaller residue digits in which the operations can
be parallelly executed, and there is no carry chain between
residue digits.
2.1.
The
Modulo Addition
Generally, addition modulo
rn
has 2"
-
rn
(n
=
[log
rnl)
incor-
rect residue states. These states are in the range [m,2"
-
11
which may be called overflow states. The corrected residue
numbers can be obtained by two methods; employing a binary
adder or a correction table. In the first method,
a
constant
(2"
-
rn)
is added to correct the overflow residue states (gener-
alized end-round carry) as shown in Fig.
1.
The addition is
performed as follows:
if
xl
+
x2
<
m
if
+
x2
rn.
y=lxl
+xZlm=
{
z:
1
z::
m,
Two n-bit adders are used; the first computes
x1
+
x2,
while the
second computes
x1
+
xg
-
rn.
The carry bit generated from the
second adder indicates whether or not
xl
+
x2
is greater than
rn.
A multiplexer, controlled by the carry, selects the correct
output. In the second method, a lookup table is used to correct
the incorrect residue states (2"
-
m),
Fig.
2.
The first algorithm
of modulo addition has a time complexity of O(logn), and the
second algorithm is not suitable for medium and large moduli.
0098-4094/90/0500-062SS01.00
01990
IEEE
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:38:37 EST from IEEE Xplore. Restrictions apply.
IEEE
TRANSACTIONS
ON
CIRCUITS
AND
SYSTEMS,
VOL.
31,
NO.
5,
MAY
1990
629
Fig.
3.
A
modulo sum adder.
111. THE PROPOSED MODULO
ADDER
Carry save adder (CSA)
[13]
has been proved to have high-
speed in multioperand addition. Basically CSA depends on the
idea of not completing the addition process at a certain stage,
but postponing it to the final stage. In the intermediate stages
numbers are represented as
sum
and
carry
to avoid the com-
plete addition process.
The idea of representing a number as a
carry
and a
sum
can
be used in the modulo addition to obtain a scheme that has a
constant speed which does not depend on the number of bits.
The modulo adder is used to add two numbers
A
and
B
in
modulo
m.
Fig.
3
shows that
A
is represented as a pair of
numbers
(A,,A,),
B
is also represented as
(Bs,B,),
and the
output
C
is represented as
(C,,C,).
Each number is repre-
sented as a group
of
sum
bits and
carry
bits. There is no unique
representation
for
A,
and
A,.
The condition that needs to be
satisfied is
IA,
+
ACIm
=
14,.
One possible representation is
A,
=
JAl,
A,
=
0.
The choice
of
a representation has no implication on the com-
plexity
of
the design. With such representation, four numbers
(A,,
A,,
B,, B,)
need to be added, and two steps of CSA are
required. After the addition process we need to detect
if
-
M
or
2*(-
M)
is
required to adjust the result. The adjusting
process takes at most three steps. Since the adder has a fixed
number
of
steps-five-no matter how long
A
and B are, it
can be used in a multioperand pipelined addition scheme
[141.
3.1.
The Modulo Addition Algorithm
The proposed algorithm for modulo
m
addition
of
two num-
Algorithm modulo
add
(A,
B,
Result)
bers can be described as follows.
Znput:
Two variables
A
and
B
in modulo
m,A
is repre-
sented as
A,
and
A,.
B
is represented as
B,
and
B,.
All
variables are
n
bit numbers
(2"-'
<
m
Q
2").
Output:
Variable
Result
represented as Result. and Re-
sult,. The relation between
A,
B,
and Result is: Result
=
IA
+
BIm.
Procedure:
begin
Do
in parallel
begin
Call Sum(temp,,
A,,
A,,
B,)
Call Carry(temp,,
A,, A,,
B,)
end
begin
Do
in parallel
Call Carry(temp,, temp,, temp,,
B,)
Call Carry(temp,, temp,, temp,,
B,)
end
0:
Do
in parallel
Case (temp,
[n
+
11
-temp,
[n
+
11)
of
begin
Result,
:=
temp,
Result,
:=
temp,
end
exit
begin
1:
do in parallel
Call Sum(temp,, temp,, temp,,
(2"
-
rn))
Call Carry(temp,, temp,, temp,,
(2"
-
rn))
end
begin
2:
Do
in parallel
Call sum(temp,, temp,, temp,,
2*(2"
-
rn))
Call CarryItemp,, temp,, temp,,
2*(2"
-
rn))
end
end case
Case (temp,
[n
+
11)
of
0: do in parallel
begin
Result,
:=
temp,
Result
,
:=
temp,
end
exit
begin
1:
Do
in parallel
Call Sum(temp,, temp,, temp,,
(2"
-
rn))
Call Carry(temp,, temp,, temp,,
(2"
-
rn))
end
end case
Case (temp,
[n
+
11)
of
0:
do in parallel
begin
Result,
:=
temp,
Result,
:=
temp,
end
begin
1:
Do
in parallel
Call Sum(temp,, temp,, temp,,
(2"
-
m))
Call Carry(temp,,, tempo,, temp,,
(2"
-
m))
end
begin
Do
in parallel
Result,
:=
temp,
Result,
:=
temp,,
end
end case
end.
Sum(A,B,C,D)
begin
Do
in parallel
(1
<
i
Q
n)
~[i]
:=
(B[i]
A
C[i])
V
(B[i]
A
D[i])V (C[i]
A
D[il)
end
Carry
(A,
B,
C,
0)
begin
A[1]
:=
0
Do
in parallel
(1
<
i
<
n)
A[
i
+
11
:=
B[
i]
CB
C[i]
CBD[
i]
end
An
implementation of the algorithm is shown in Fig.
4.
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:38:37 EST from IEEE Xplore. Restrictions apply.
630
IEEE
TRANSACTIONS
ON
CIRCUITS
AND
SYSTEMS,
VOL.
31,
NO.
5,
MAY
1990
F*
...
"
..............
"
...............
0-
JJ-
J$&j
....................................
2'-
M
2(
2'-
2'-
M
2(
2'-
M
)
...
2n-
M
2(
2'-
M
1
....................................
pia ....................................
+-Ti!/
#i!
0
Resultln]
sc
Result[nl
ResultLl
1
Result[l
I
S
C
Fig.
4.
Different stages of the modulo adder.
Theorem
1:
The modulo adder scheme for adding two n-bit
numbers in modulo
rn
has an asymptotic time complexity
O(1).
Proof:
To prove that the number of steps is constant (five)
we need to prove that the last carry is equal to zero in five or
less steps. Induction is used to prove the correctness of the
theorem on the number
of
bits
n.
1) Basis step: for
n=0,
it means that we do not add any
numbers and in this case the required number
of
steps is zero.
2) Induction hypothesis: assume for a fixed arbitrary
n
>
0
that the maximum number of steps is five.
3)
Induction step: for numbers with
n
+
1
bits let:
7
=
temp,
[
n
+
11
+temp,
[n
+2].
Then we have the following cases.
(a)
7
=
0:
then the carry propagation stopped at bit
n,
and it
ends after five steps at most according to the induction hypothe-
sis.
(b)
7
=
1:
then the correction is 2"+'
-
rn
in step 3. Since
rn
>
2", then 2"+'
-
rn
<
2", which means that (2"+'
-
rn)
[n]=
0.
The worst case we get to have temp,[
n
+
11 and temp,[
n
+
21
to be equal to one. This means that temp,[n
+
11
=
0
and
temp,
[
n
+
21
=
1,
then temp,
[
n
+
21
=
0.
In this case the correc-
tion is done in
two
steps (step 3 and step
4).
(c)
7
=
2: then the correction is 2*(2"+l-
rn)
in step 3. The
worst case we get to have temp,
[
n
+
11,
temp,
[
n
+
21, and
2*(2"+'
-
rn)
to be equal
to
one. Then temp,[n
+
11
=
1,
temp6[n+1]=1, and 2"+'-M=0. At step
4
temp,[n+l]=O
and temp,[n +2]=
1.
At step
5
temp,[n +1]=
1
and templ0[n
+
21
=
0.
In this case the correction is done in three steps (steps
3-5).
As an example, the modulo addition of
A
=
1272 and
B
=
450
for
rn
=
2050 is shown in Fig.
5.
There is no unique representa-
tion for
A
and
B.
One valid representation is shown in this
figure. The detailed modulo addition operation is shown in this
Initial:
As=
101
11
11
101
11
Ac.
11001
I
I01
101
Bp
11
1100010101
Bp
1010101
1001
1
M-2050
,
NI
12
Step
1.
As=
101
11
11
101
11
Ac*
11001
1101
101
8.;
11
1100010101
temp,
=
10000000
I
I
1 1
temp,;llI!
11
1101010
Step
2.
temp,.
100000001
11
I
temp,:
11
11
11
101010
______-____---
&.
1010101
1001
1
temp,.
l10101010110
temp,=Ij010101010
1
10
step
3.
tamp,.
1101010101
10
temp,.
0101010101
10
11
11
11
11
1100
______-___----
2(2"-
M)
=
temp,=
01
I1
11
11
1100
temp6:ij101010101
100
Step
4.
temp,.
01
1
I
I1
11
1100
temp6=
101010101
100
2"-n
=
011111111110
temp,=
1010101011
IO
ternpp.nl111t
11
11000
Resu/fs=
li?/i?/O/i?/
Ill)
A
detailed example
for
the modulo addition
____________--
RPs/lIt0
:
I
/
I
I I
I
I
I
lC?i?C?
Fig.
5.
example. In step
1
we get temp,[l3]= 1, and in step 2 we get
temp,[l3]= 1, which means that at step 3 we have to add
2(2"
-
M).
At step 3 we get temp6[13]
=
1,
which means that at
step
4
we have to add 2"
-
M.
At step
4
we get temp,[l3]
=
0,
which means that the addition process stops at step
4.
The result
of
step
4
is the final result.
IV.
MODULO
ADDER
EVALUATION
Using the VLSI model of computation for asymptotic com-
plexity [15], a comparative study for the proposed adder is
analyzed. For adder I (Fig. 11, using the binary adder
of
Brent
and Kung [16], the complexity measures will be as follows:
A
=
O(log
rn
log log
rn)
=
O(
n
log
n)
T=O(loglogrn) =O(logn)
AT,
=
O(
n(10g
.I,).
For adder I1 (Fig. 21, using the complexity analysis of the
correlation table of [17]:
A
=
O(Iogrn IogIogrn
+
rnlog
rn)
=
O(nI0gn +2"n)
=
O(n2")
T=O(loglogrn+ logrn) =O(logn+n)=O(n)
AT,
=
qn32n).
For the proposed adder,
A
=
O(n)
AT,=
O(n).
T
=
O(1)
V.
CONCLUSIONS
The modulo adder introduced in this paper has a total time-
delay complexity
of
O(1)
for adding two n-bit numbers in mod-
ulo
rn.
Based on the analysis of Section IV, this adder is the
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:38:37 EST from IEEE Xplore. Restrictions apply.
IEEE
TRANSACTIONS
ON
CIRCUITS
AND
SYSTEMS,
VOL.
37,
NO.
5,
MAY
1990
63
1
fastest and the most area efficient for large moduli. The pro-
posed design has the following advantages.
1)
It does not have any limitation on the size of the modulus.
2)
It is quite modular, and it is a two-dimensional array of
3) It is easy to pipeline.
4)
It is very efficient architecture for the implementation of
one type cell (full-adder).
the CRT decoding [14].
REF
ER
EN
c
E
s
F.
J.
Taylor, “Residue arithmetic: A tutorial with examples,”
IEEE
Comput. Mag.,
pp. 50-62, May 1984.
M. A. Bayoumi,
G.
A. Jullien, and W. C. Miller, “A look-up table
VLSI design methodology for RNS structures used in DSP applica-
tions,”
IEEE Trans. Circuits
Syst.,
vol. CAS-34, pp. 604-616, June 1987.
M.
A.
Bayoumi, “A high speed VLSI complex digital signal processor
based on quadratic residue number system,” in
VLSI Signal Processing
II.
-,
“Digital filter VLSI systolic arrays over finite fields for DSP
applications,” in
Proc. 6th IEEE Ann. Phoenix Conf.
on
Computers and
Communications,
pp. 194-199, Feb. 1987.
W. Jenkins and
E.
Davidson, “A custom-designed integrated circuit
for
the realization of residue number digital filters,” in
Proc. ICASSP
1985, pp. 220-223, Mar. 1985.
N.
S.
Szabo and R.
I.
Tanaka,
Residue Arithmatic and Its Applications
to
Computer Technology.
M.
A.
Soderstrand and E. L. Fields, “Multipliers and residue number
arithmetic digital filters,”
Electron. Lett.,
vol. 13, no.
6,
pp. 164-166,
Mar. 1977.
D.
K.
Banerji, “A novel implementation method for addition and
subtraction in residue number systems,”
IEEE Trans. Comput.,
vol.
C-23, pp. 106-109, Jan. 1974.
M. A. Bayoumi,
G.
A. Jullien, and W. C. Miller, “A VLSI implementa-
tion of residue adders,”
IEEE Trans. Circuits Syst.,
vol. CAS-34, pp.
284-288, Mar. 1987.
M.
A.
Bayoumi, “VLSI PLA structures for residue number systems
arithmetic implementations,” in
Proc. ISCAS
1987,
1987.
C.
L.
Chiang and L. Johnsson, “Residue arithmetic and VLSI,” in
Proc. ICCD
83,
pp. 80-83, Oct. 1983.
K.
M.
Elleithy, “On hit-parallel processing for modulo arithmetic,”
VLSI Tech. Rep. TR86-8-1, Ctr. Advanced Computer Studies, Univ. of
Southwestern Louisiana, 1986.
K.
Hwang,
Computer Arithmetic: Principles, Architecture, and Design.
New
York:
Wiley, 1978.
K.
M.
Elleithy, “On the bit-parallel implementation for the Chinese
remainder theorem,’’ VLSI Tech. Rep. TR87-8-1,
Ctr.
Advanced Com-
puter Studies, Univ. of Southwestern Louisiana, 1987.
G.
Alia and
E.
Materinelli, “A VLSI algorithm for direct and reverse
conversion from weighted binary number system to residue number
system,”
IEEE Trans. Circuits
Syst.,
vol. CAS-31, pp. 1033-1039, 1984.
R. P. Brent and
H.
T. Kung, “A regular layout for parallel adders,”
IEEE Trans. Comput.
vol. C-31, pp. 260-264, Mar. 1982.
M.
A.
Bayoumi, “Lower bounds for VLSI implementation of residue
number system architectures,”
Integration, The VLSI
J.,
vol. 4, no. 4, pp.
263-269, Dec. 1986.
New
York
IEEE
Press, 1986, pp. 200-211.
New
York
McGraw-Hill, 1967.
I. INTRODUCTION
The solution given by Vlcek and Unbehauen [l] to what they
refer to as the “degree equation” for elliptic-function filters is
exact, but rather complicated for the intended use in adjusting
the parameters at the beginning of a design, and involves
computing several elliptic functions of rational fractions of a
quarterperiod. High precision is unnecessary at this stage in the
design, and a much simpler formula which allows one to achieve
the same end result with a pocket calculator was in fact given by
Darlington
[2]
just
50
years ago, but seems to have been over-
looked. The purpose of this note is to explain his formula in
more detail than appeared in
[2]
and to expand on it slightly.
In the design of an elliptic-function filter one is given a
specification for: The passband ripple,
up,
the minimum stop-
band
loss,
a,,
(both in decibels) and the elliptic modulus k
=
wp/os,
where
op
is the passband-edge frequency and
os
is the
stopband-edge frequency. From these one must first find the
smallest integral value of the degree
n
that can be used. This
choice for
n
will normally cause the filter to be slightly better
than is called for in the specification,
so
one can adjust the
values of
up,
a,
and k that one uses in the design to allow some
margin inside the specification. The relation between these four
parameters,
up,
a,,
k and
n
is
given by the “degree equation.”
11.
DEFINITION OF
THE
ELLIPTIC-FUNCTION
FILTER
The power ratio for an elliptic-function filter is most conve-
niently defined by a pair of equations involving a parametric
variable, exactly analogous to those for the well-known Cheby-
shev filter. The latter
is
defined by
10~’’~
=I+ e2cos2nu (la)
n
=
COSU (1b)
where
U is the parametric variable and
n
is the degree. The
passband edge is normalized to
=
1
and the passband ripple is
up
=
10
log
(1
+
E‘)
dB.
The equations for the elliptic-function filter are obtained by
replacing the cosines in (1) by the Jacobian cd elliptic functions
and take the form’
10n/’O
=
1
+
c2cd2(nuK, /K;k,)
(3a)
fl=cd(u;k)
As
in
(l),
this definition holds for both odd and even values of
the degree
n.
K, and
K
are the real quarterperiods belonging
respectively. The fre uency scale is still normalized to
R
=
1
at
wp,
rather than to
&
as in [l] and
[2];
the latter normaliza-
tion, though nicer for theoretical work, is a nuisance in practical
design.
In order for
(3)
to define a rational function, the period
Of
cd(nuKl
iK;
in
the
plane,
must
fit
exactly
?I
times into the period rectangle of cd(u; k), just as the period
strip of cos
nu
fits
n
times into the period strip of cos U. This
Adjusting the Parameters in Elliptic-Function Filters
to the elliptic functions with moduli k, and k in (3a) and (3b)
H.
J.
ORCHARD
Abstract
-When designing elliptic-function filters there is usually
some margin in performance to be distributed over the defining parame-
ters.
A
recent paper offered some comparatively complicated formulas
for use in this stage of the design. However, a simpler method, originally
due to Darlington, is available and
is
described briefly.
passband ripple is again given by
(2).
‘The cd function is the same as the sn function by a real quarter period,
just
as the cosine is the same as the sine shifted by ~/2. Using the cd
function rather than the sn function as in [2], allows one to describe both odd
and even degree cases with one common formula.
Manuscript received April 28, 1989. This letter was recommended by
The author is with the Electrical Engineering Department, University of
IEEE
Log
Number 8930187.
Associate Editor
T.
R. Viswanathan.
California
Los
Angeles,
Los
Angeles, Ca 90024.
OO98-4094/90/0500-oS31~$01
.OO
01990
IEEE
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 11:38:37 EST from IEEE Xplore. Restrictions apply.
... The first step is to multiply the fixed-point values as if they are integers according to (13). The next steps involve performing a division by the fractional range RF according to (15). One way to perform this division is to convert the intermediate product YIP to a mixed-radix number MIP, ...
... The quantity on the left of (25) represents a normalized intermediate product according to (15) except that the quantity is in the mixed-radix format, and there is no round-up function. Converting the right side of (25) to a residue format solves one of these issues. ...
... To clarify the problem of normalizing negative values, (29) is substituted into (15), so that a normalization of a negative intermediate machine word is attempted as, ...
Preprint
Full-text available
An arithmetic circuit is introduced that performs efficient matrix multiplication for hardware acceleration of neural networks, machine learning, web search and other applications. The circuit is comprised of a plurality of two-dimensional systolic arrays of multiplier-accumulators; these systolic arrays are connected to form a hardware matrix multiplier which processes data using high precision, fixed-point RNS arithmetic. The matrix multiplier takes advantage of the carry-free properties of RNS arithmetic by partitioning each RNS digit into its own dedicated matrix multiplier unit. By operating a plurality of digit matrix multipliers in parallel, a suitable RNS word size is realized. Each partitioned digit matrix multiplier processes a plurality of product summations, or dot products. After dot product summation is complete, the matrix multiplier transfers each summation to a normalization pipeline unit. The systolic architecture favors a narrow RNS digit encoding, so each digit matrix multiplier is realized in minimal IC area, using fast adders and multipliers. These features and others result in greater speed and efficiency versus binary fixed-point arithmetic, especially when comparing high precision implementations. Using an FPGA based implementation and analysis, the RNS matrix multiplier performs matrix multiplication of 32.32 fixed-point arithmetic 7 to 9 times more efficiently than a binary matrix multiplier provided the dimension of the matrices multiplied are sufficiently large.
... The first step is to multiply the fixed-point values as if they are integers according to (13). The next steps involve performing a division by the fractional range RF according to (15). One way to perform this division is to convert the intermediate product YIP to a mixed-radix number MIP, ...
... The quantity on the left of (25) represents a normalized intermediate product according to (15) except that the quantity is in the mixed-radix format, and there is no round-up function. Converting the right side of (25) to a residue format solves one of these issues. ...
... To clarify the problem of normalizing negative values, (29) is substituted into (15), so that a normalization of a negative intermediate machine word is attempted as, ...
Conference Paper
Full-text available
An arithmetic circuit is introduced that performs efficient matrix multiplication for hardware acceleration of neural networks, machine learning, web search and other applications. The circuit is comprised of a plurality of two-dimensional systolic arrays of multiplier-accumulators; these systolic arrays are connected to form a hardware matrix multiplier which processes data using high precision, fixed-point RNS arithmetic. The matrix multiplier takes advantage of the carry-free properties of RNS arithmetic by partitioning each RNS digit into its own dedicated matrix multiplier unit. By operating a plurality of digit matrix multipliers in parallel, a suitable RNS word size is realized. Each partitioned digit matrix multiplier processes a plurality of product summations, or dot products. After dot product summation is complete, the matrix multiplier transfers each summation to a normalization pipeline unit. The systolic architecture favors a narrow RNS digit encoding, so each digit matrix multiplier is realized in minimal IC area, using fast adders and multipliers. These features and others result in greater speed and efficiency versus binary fixed-point arithmetic, especially when comparing high precision implementations. Using an FPGA based implementation and analysis, the RNS matrix multiplier performs matrix multiplication of 32.32 fixed-point arithmetic 7 to 9 times more efficiently than a binary matrix multiplier provided the dimension of the matrices multiplied are sufficiently large. (PDF) RNS Hardware Matrix Multiplier for High Precision Neural Network Acceleration: "RNS TPU". Available from: https://www.researchgate.net/publication/324961518_RNS_Hardware_Matrix_Multiplier_for_High_Precision_Neural_Network_Acceleration_RNS_TPU [accessed Nov 07 2018].
Chapter
In this Chapter, the basic operations of modulo addition and subtraction are considered. Both the cases of general moduli and specific moduli of the form 2n −1 and 2n + 1 are considered in detail. The case with moduli of the form 2n + 1 can benefit from the use of diminished-1 arithmetic. Multi-operand modulo addition also is discussed.
Article
The paper is a part of student cooperation in AKTION project (Austria-Czech). Theoretical work on the numerical solution of ordinary differential equations by the Taylor series method has been going on for a number of years. The simulation language TKSL has been created to test the properties of the technical initial problems and to test an algorithm for Taylor series method [1]. The Residue Number System (RNS) has great potential for accelerating arithmetic operation, achieved by breaking operands into several residues and computing with them independently.
Article
Full-text available
With the current advances in VLSI technology, traditional algorithms for Residue Number System (RNS) based architectures should be reevaluated to explore the new technology dimensions. In this brief, we introduce A @(log n) algorithm for large moduli multiplication for RNS based architectures. A systolic array has been designed to perform the modulo multiplication Algorithm. The proposed modulo multiplier is much faster than previously proposed multipliers and more area efficient. The implementation of this multiplier is modular and is based on using simple cells which leads to efficient VLSI realization. A VLSI implementation using 3 micron CMOS technology shows that a pipelined n-bit modulo multiplication scheme can operate with a throughput of 30 M operation per second.
Conference Paper
Full-text available
A formal design methodology is used to design a residue Number System (RNS) processor. An optimal architecture for the residue decoding process is obtained through this design approach. The architecture is modular, consists of simple cells, and is general for any set of moduli
Conference Paper
Full-text available
Parallelism on the algorithmic, architectural, and arithmetic levels is exploited in the design of a residue number system (RNS) based architecture. The architecture is based on modulo processors. Each modulo processor is implemented by a two-dimensional systolic array composed of very simple cells. The decoding state is implemented using a two-dimensional array. The decoding bottleneck is eliminated. The whole architecture is pipelined, which leads to a high throughput rate. High speed algorithms for modulo addition, modulo multiplication, and RNS decoding are presented
Data
In this paper parallelism on the algorithmic, architec-tural, and arithmetic levels is exploited in the design of a Residue Number System (RNS) based architecture. The archi-tecture is based on modulo pro-cessors. Each modulo processor is implemented by two dimen-sional systolic array composed of very simple cells. The decod-ing stage is implemented using a 2-0 array, too. The decoding bottleneck is eliminated. The whole architecture is pipelined which leads to high throughput rate.
Data
Designing an optimal Residue Number System (RNS) processor in terms of area and speed depends on the choice of the system moduli. In this paper an optimal algorithm for choosing the system moduli is presented. The algorithm takes into consideration several constraints imposed by the problem definition. The problem is formalized as an integer programming problem to optimize an aredtime objective function.
Article
Since the number of components that can fit on a single chip is large and rapidly growing, the asymptotic analysis and computational complexity have become applicable to the VLSI systems. We propose a model of computation devoted to VLSI structures based on Residue Number System (RNS). The developed model employs the ‘cut theorem’ which has been used by most of the abstract VLSI models. It is not as general as other reported models, but it gives tighter lower bounds and more accurate measures of performance for RNS structures. This computational model relates the area and time complexities with the inherent properties of RNS, the moduli size and the dynamic range. The model supports the look-up table implementation approach and it is technology-independent.
Conference Paper
Results are presented on the design, layout, and fabrication of a custom-designed integrated circuit for a residue number system digital filter module. The architecture is based on a ROM-ACCUMULATOR FIR structure in which the modular arithmetic for each modulus is realized on a separate chip. The modules are designed to support error detection and fault isolation at module boundaries. Of the five chips that were fabricated and tested, all were found to be fully operational, with three operating at a maximum data-cycle frequency of approximately 1.7 MHz.
Conference Paper
This correspondence describes an implementation scheme for the operations of addition and subtraction in the residue number systems. The method is based on the property that the set of residues modulo m form a finite group under addition and subtraction (modulo m). The proposed adder/subtractor structure is very systematic and, hence, suitable for MSI/LSI realization.
Article
In the residue number system arithmetic is carried out on each digit individually. There is no carry chain. This locality is of particular interest in VLSI. An evaluation of different implementations of residue arithmetic is carried out, and the effects of reduced feature sizes estimated. At the current state of technology the traditional table lookup method is preferable for a range that requires a maximum modulus that is represented by up to 4 bits, while an array of adders offers the best performance fur 7 or more bits. A combination of adders and tables covers 5 and 6 bits the best. At 0.5 mu m feature size table lookup is competitive only up to 3 bits, These conclusions are based on sample designs in nMOS.
Article
A recently proposed residue-number-arithmetic digital filter offers major cost and speed advantages over binary-arithmetic digital filters, but suffers one major drawback. The filter coefficients must be constant, since the lack of a fast method of multiplication by a fraction in residue arithmetic requires the coefficients to be realised by a fixed table look-up read-only memory. Two multipliers are proposed which realise a completely general fractional multiply and are suitable for digital-filtering applications.
Article
In current high-speed digital signal-processing (DSP) architectures, the Residue Number System (RNS) has an important role to play. RNS implementations have a highly modular structure, and are not dependent upon large binary arithmetic elements. RNS implementations become more attractive when combined with the advantages offered by VLSI fabrication technology. In this paper, a novel design methodology has been developed for RNS structures, based on using look-up tables, which takes into consideration the unique features and requirements of RNS. The paper discusses the following three phases: 1) developing a look-up table layout model, which is used to derive relationships between the size of each modulus and both chip area and time; this model supports all types of moduli; 2) selecting the most efficient layout according to the design requirements; the procedure allows the designer to control the area, time, or the configuration of the memory module required for implementing a modulo look-up table; 3) proposing a set of multi-look-up table modules, to be used as building block units for implementing digital signal-processing architectures. The paper uses two examples to -illustrate the use of the modules in phase 3).
Article
The efficient hardware implementation of residue number system (RNS) architectures has evolved based on the development of integrated circuit technology. The implementation of RNS adders is discussed in this paper. Three approaches (the binary adder, the look-up table, and the hybrid implementation) are analyzed in the scope of VLSI criteria where the performance measures are area and time. Two layout design procedures have been used. They are flexible and support any type of moduli. The implementation complexity depends on the form and size of the modulus, but, in general, the look-up table approach is preferable in both area and time for moduli up to five bits, while the binary adder and the hybrid approaches offer better performance for larger moduli.