Content uploaded by Miroslav Knežević
Author content
All content in this area was uploaded by Miroslav Knežević on Oct 12, 2014
Content may be subject to copyright.
Faster Interleaved Modular Multiplication
Based on Barrett and Montgomery
Reduction Methods
Miroslav Kne!
zevi"
c, Member,IEEE,
Frederik Vercauteren, and
Ingrid Verbauwhede, Senior Member,
IEEE
Abstract—This paper proposes two improved interleaved modular multiplication
algorithms based on Barrett and Montgomery modular reduction. The algorithms
are simple and especially suitable for hardware implementations. Four large sets
of moduli for which the proposed methods apply are given and analyzed from a
security point of view. By considering state-of-the-art attacks on public-key
cryptosystems, we show that the proposed sets are safe to use, in practice, for
both elliptic curve cryptography and RSA cryptosystems. We propose a hardware
architecture for the modular multiplier that is based on our methods. The results
show that concerning the speed, our proposed architecture outperforms the
modular multiplier based on standard modular multiplication by more than
50 percent. Additionally, our design consumes less area compared to the standard
solutions.
Index Terms—Modular multiplication, Barrett reduction, Montgomery reduction,
public-key cryptography.
Ç
1INTRODUCTION
PUBLIC-KEY cryptography (PKC), a concept introduced by Diffie
and Hellman [10] in the mid 1970s, has gained its popularity
together with the rapid evolution of today’s digital communication
systems. The best-known public-key cryptosystems are based on
factoring, i.e., RSA [25], and on the discrete logarithm problem in a
large prime field (Diffie-Hellman, ElGamal, Schnorr, DSA) [19] or
on an elliptic curve (ECC/HECC) [15], [20], [14]. Based on the
hardness of the underlying mathematical problem, PKC usually
deals with large numbers ranging from a few hundreds to a few
thousands of bits in size. Consequently, efficient implementation
of PKC primitives has always been a challenge.
Modular multiplication forms the basis of modular exponentia-
tion which is the core operation of the RSA cryptosystem. It is also
present in many other cryptographic algorithms including those
based on ECC and HECC. In particular, if one uses projective
coordinates for ECC/HECC, modular multiplication remains the
most time-consuming operation for ECC. For efficient implemen-
tation of modular multiplication, the crucial operation is modular
reduction. Algorithms that are most commonly used for this
purpose are Barrett reduction [4] and Montgomery reduction [21].
In this study, we propose two interleaved modular multi-
plication algorithms based on Barrett and Montgomery modular
reduction. The methods are simple and especially suitable for
hardware implementations. Four large sets of moduli for which the
proposed methods apply are given and analyzed from a security
point of view. We propose a hardware architecture for the modular
multiplier that is based on our methods. The results show that
concerning the speed, our proposed architecture outperforms the
modular multiplier based on standard modular multiplication by
more than 50 percent. Additionally, our design consumes less area
compared to the standard solutions.
The remainder of this paper is structured as follows: Section 2
describes the algorithms of Barrett and Montgomery as the two
most commonly used reduction methods and presents a short
overview of related work. In Section 3, we show how precomputa-
tion can be omitted and the quotient evaluation simplified in
Barrett and Montgomery algorithms. Section 4 analyzes the
security implications, and in Section 5, we describe a hardware
implementation. Section 6 concludes the paper.
2PRELIMINARIES
In this paper, we use the following notations. A multiple-precision
n-bit integer Ais represented in radix rrepresentation as
A¼ðAnw#1...A0Þr, where r¼2w;nwrepresents the number of
digits and is equal to dn=we, where wis a digit size; and Aiis called a
digit and Ai2½0;r#1&. A special case is when r¼2(w¼1) and the
representation of A¼ðAn#1...A0Þ2is called a bit representation.
To make the following discussion easier, we define the floor
function for integers in the following manner. Let U,M2ZZ and
M>0, then there exist integers qand Zsuch that U¼qM þZand
0(Z<M. The integer qis called the quotient and is denoted by
the floor function as
q¼!U=M ":ð1Þ
The integer Zis called the remainder and can also be represented
as Z¼Umod M. Note here that the floor function always
rounds toward negative infinity. This is very useful for hardware
implementations, where the numbers are given in two’s comple-
ment representation. If the divisor is of type 2s, the floor function
is just a simple shift to the right for spositions.
2.1 Classical and Montgomery Modular Multiplication
Methods
Given a modulus Mand two elements X,Y2ZZM, where ZZMis
the ring of integers modulo M, the ordinary modular multi-
plication is defined as
X)Y¼
4X*Ymod M:
Let the modulus Mbe an nw-digit integer, where the radix of each
digit is r¼2w. The classical modular multiplication algorithm
computes XY mod Mby interleaving the multiplication and
modular reduction phases, as shown in Algorithm 1. The value q
is called an intermediate quotient, while Zrepresents an intermediate
remainder. The calculation of qat step 4 of the algorithm is done by
utilizing integer division which is considered as an expensive
operation, especially in hardware. The idea of using the pre-
computed reciprocal of the modulus Mand simple shift and
multiplication operations instead of division was first introduced
by Barrett [3], [4] in 1984. The original algorithm considers only
reduction, assuming that the multiplication is performed before-
hand. To explain the basic idea, we rewrite the intermediate
quotient qas
q¼Z
M
#$
¼
Z
2nþ!
2nþ"
M
2"#!
$%
+!Z
2nþ!"!2nþ"
M"
2"#!
$%
¼^
q: ð2Þ
The value ^
qrepresents an estimation of the intermediate quotient q.
In most of the cryptographic applications, the modulus Mis fixed
during the many modular multiplications, and hence, the value #¼
b2nþ"=Mccan be precomputed and reused multiple times. Since the
value of ^
qis an estimated value, some correction steps at the end of
the modular multiplication algorithm have to be performed.
IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 12, DECEMBER 2010 1715
.The authors are with the Department of Electrical Engineering, Katholieke
Universiteit Leuven, ESAT/SCD-COSIC, Kasteelpark Arenberg 10, B-3001
Leuven-Heverlee, Belgium.
E-mail: {mknezevi, fvercaut, iverbauw}@esat.kuleuven.be.
Manuscript received 24 Apr. 2009; revised 18 Sept. 2009; accepted 23 Jan.
2010; published online 14 Apr. 2010.
Recommended for acceptance by P. Montuschi.
For information on obtaining reprints of this article, please send e-mail to:
tc@computer.org, and reference IEEECS Log Number TC-2009-04-0173.
Digital Object Identifier no. 10.1109/TC.2010.93.
0018-9340/10/$26.00 !2010 IEEE Published by the IEEE Computer Society
Algorithm 1. Classical modular multiplication algorithm
Input: X¼ðXnw#1...X0Þr,Y¼ðYnw#1...Y0Þr,
M¼ðMnw#1...M0Þrwhere 0(X; Y < M,2n#1(M<2n,
r¼2wand nw¼dn=we.
Output: Z¼XY mod M.
1: Z(0
2: for i¼nw#1downto 0 do
3: Z(Zr þXYi
4: q(bZ=M c
5: Z(Z#qM
6: end for
7: Return Z.
To reduce the number of correction steps, Dhem [9] determines
the values of "¼wþ3and !¼#2for which the classical modular
multiplication based on Barrett reduction needs at most one
subtraction at the end of the algorithm. To make the following
explanations easier, we outline a similar analysis as given in [9].
We assume that step 4 of Algorithm 1 is performed according to (2)
and ^
qis used instead.
Analysis of Algorithm 1. Let us first consider the first iteration
of Algorithm 1 (i¼0). We can find an integer $such that
Z0¼XYnw#1<2nþ$. This represents an upper bound of Z(Z0for
i¼0). The quotient q¼b
Z0
Mccan now be written as
q¼Z0
M
#$
¼
Z0
2nþ!
2nþ"
M
2"#!
$%
;
where "and !are two variables. The estimation of the given
quotient is now equal to
^q¼!Z0
2nþ!"!2nþ"
M"
2"#!
$%
¼!Z0
2nþ!"#
2"#!
$%
;
where #¼b
2nþ"
Mcis a constant and may be precomputed. Let us
now define the quotient error as a function of the variables ",!,
and $
e¼eð";!;$Þ¼q#^
q:
Since A
B+b
A
Bc>A
B#1for any A; B 2ZZ, we can write the following
inequality:
q¼Z0
M
#$
+^
q>!Z0
2nþ!"!2nþ"
M"
2"#!#1
>%Z0
2nþ!#1&%2nþ"
M#1&
2"#!#1
¼Z0
M#Z0
2nþ"#2nþ!
Mþ1
2"#!#1
+Z0
M
#$
#Z0
2nþ"#2nþ!
Mþ1
2"#!#1
¼q#Z0
2nþ"#2nþ!
Mþ1
2"#!#1:
Now, since e2ZZ, the quotient error can be estimated as
e¼eð";!;$Þ( 1þZ0
2nþ"þ2nþ!
M#1
2"#!
#$
:
According to Algorithm 1, we have Z0<2nþ$and M+2n#1.
Hence, we can evaluate the quotient error as
e¼eð";!;$Þ( 1þ2$#"þ2!þ1#1
2"#!
#$
:
Following the previous inequality, it is obvious that for "+$þ1
and !(#2, it holds e¼1.
Next, we need to ensure that the intermediate remainder Zi
does not grow uncontrollably as iincreases. Since X<M,Yi<2w,
Zi<MþeM and M<2n, after iiterations, we have
Zi¼Zi#12wþXYi
<ðMþeMÞ2wþM2w
<ð2þeÞ2nþw:
Since we want to use the same value for eduring the algorithm, the
next condition must hold
Zi<ð2þeÞ2nþw<2nþ$:
To minimize the quotient error (e¼1), we must choose $such that
3*2w<2$:
In other words, we choose $+wþ2. Now, according to the
previous analysis, we can conclude that for "+$þ1,!(#2and
$+wþ2, we may realize a modular multiplication with only one
correction step at the end of the whole process.
The only drawback of the proposed method is the size of the
intermediate quotient ^
qand the precomputed value #. Due to the
parameters "and !chosen in a given way, the size of ^
qis wþ2
and #is at most wþ4bits. This introduces an additional overhead
for the software implementations, while it can be easily overcome
in the hardware implementations.
Montgomery’s algorithm [21] is the most commonly utilized
modular multiplication algorithm today. In contrast to the classical
modular multiplication, it utilizes right to left divisions. Given an
n-digit odd modulus Mand an integer U2ZZM, the image or the
Montgomery residue of Uis defined as X¼UR mod M, where R,
the Montgomery radix, is a constant relatively prime to M. If X
and Yare the images of Uand V, respectively, the Montgomery
multiplication of these two images is defined as
X,Y¼
4XY R#1mod M:
The result is the image of UV mod Mand needs to be converted
back at the end of the process. For the sake of efficient
implementation, one usually uses R¼rnw, where r¼2wis the
radix of each digit. Similar to a classical modular multiplication
based on Barrett reduction, this algorithm uses a precomputed
value M0¼#M#1mod r¼#M#1
0mod r. The algorithm is shown
as follows:
Algorithm 2. Montgomery modular multiplication algorithm
Input: X¼ðXnw#1...X0Þr,Y¼ðYnw#1...Y0Þr,
M¼ðMnw#1...M0Þr,M0¼#M#1
0mod rwhere
0(X; Y < M,2n#1(M<2n,r¼2w,gcdðM;rÞ¼1and
nw¼dn=we.
Output: Z¼XY r#nwmod M.
1: Z(0
2: for i¼0to nw#1do
3: Z(ZþXYi
4: qM(ðZmod rÞM0mod r
5: Z(ðZþqMMÞ=r
6: end for
7: if Z+Mthen
8: Z(Z#M
9: end if
10: Return Z.
1716 IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 12, DECEMBER 2010
2.2 Related Work
Before introducing related work, we note here that for the moduli
used in all common ECC cryptosystems, the modular reduction
can be done much faster than the one proposed by Barrett or
Montgomery. Even without any multiplication. This is the reason
behind standardizing generalized Mersenne prime moduli (sums/
differences of a few powers of 2) [22], [1], [26].
The idea of simplifying an intermediate quotient evaluation
was first presented by Quisquater [23] at the rump session of
Eurocrypt ’90. The method is similar to the one of Barrett except
that the modulus Mis preprocessed before the modular multi-
plication in such a way that the evaluation of the intermediate
quotient qbasically comes for free. Preprocessing requires some
extra memory and computational time, but the latter is negligible
when many modular multiplications are performed using the same
modulus.
Lenstra [16] points out that choosing moduli with a predeter-
mined portion is beneficial both for storage and computational
requirements. He proposes a way to generate RSA moduli with
any number of predetermined leading (trailing) bits, with the
fraction of specified bits only limited by security considerations.
Furthermore, Lenstra discusses security issues and concludes that
the resulting moduli do not seem to offer less security than regular
RSA moduli. In [13], Joye enhances the method for generating
RSA moduli with a predetermined portion proposed in [16].
In [12], Hars proposes a long modular multiplication method
that also simplifies an intermediate quotient evaluation. The
method is based on Quisquater’s algorithm and requires a
preprocessing of the modulus by increasing its length. The
algorithm contains conditional branches that depend on the sign
of the intermediate remainder. That increases the complexity of
the algorithm, especially concerning the hardware implementa-
tions where additional control logic needs to be added.
In this paper, we propose four sets of moduli that specifically
target efficient modular multiplication by means of classical
modular multiplication based on general Barrett reduction [9] and
Montgomery modular multiplication [21]. In addition to simplified
quotient evaluation, our algorithms do not require any additional
preprocessing. The algorithms are simple and especially suitable for
hardware implementations. They contain no conditional branches
inside the loop, and hence, require a very simple control logic. Note
that the same algorithms are applicable to general moduli if the
preprocessing described in [12] is performed beforehand.
The methods describing how to generate such moduli in case of
RSA are discussed in [16], [13]. Furthermore, from the sets
proposed in this paper, one can also choose the primes that
generate the RSA modulus to speed up a decryption of RSA by
means of the Chinese Remainder Theorem (CRT). In Section 4, we
discuss security issues concerning this matter.
3THE PROPOSED MODULAR MULTIPLICATION
METHODS FOR INTEGERS
In both Barrett and Montgomery modular multiplications, the
precomputed values of either modulus reciprocal (#) or modulus
inverse (M0) are used in order to avoid multiple-precision
divisions. However, single-precision multiplications still need to
be performed (step 4 of the Algorithms 1 and 2). This especially
concerns the hardware implementations, as the multiplication with
the precomputed values often occurs within the critical path of the
whole design. Section 5 discusses this issue in more detail.
Let us, for now, assume that the precomputed values #and M0
are both of type -2%#", where %2ZZ and "2f0;1g. By tuning #
and M0to be of this special type, we transform a single-precision
multiplication with these values into a simple shift operation in
hardware. Therefore, we find sets of moduli for which the
precomputed values are both of type -2%#".
3.1 Speeding Up Classical Modular Multiplication
Before describing the actual algorithm, we provide two lemmas to
make the following explanation easier:
Lemma 1. Let M¼2n##be an n-digit positive integer in radix 2
representation and let #¼b2nþ"=Mc, where "2NN. If 0<#(
b2n
1þ2"c, then
#¼2":ð3Þ
Proof of Lemma 1. Rewrite 2nþ"as
2nþ"¼M2"þ2"#:
Since it is given that 0<#(b2n
1þ2"c,weconcludethat
0<2"#<M. By the definition of euclidean division, this
shows that #¼2".tu
Lemma 2. Let M¼2n#1þ#be an n-digit positive integer in radix 2
representation and let #¼b2nþ"=Mc, where "2NN. If 0<#(
b2n#1
2"þ1#1c, then
#¼2"þ1#1:ð4Þ
Proof of Lemma 2. Rewrite 2nþ"as
2nþ"¼Mð2"þ1#1Þþ2n#1##ð2"þ1#1Þ:
Since 0<#(b2n#1
2"þ1#1c,weconcludethat0(2n#1#
#ð2"þ1#1Þ<M. By the definition of euclidean division,
this shows that #¼2"þ1#1.tu
The interleaved modular multiplication algorithm based on
general Barrett reduction is given in Section 2. Now, according to
Lemmas 1 and 2, we can define two sets of moduli for which the
modular multiplication based on Barrett modular reduction can be
improved. These sets are of type:
S1:M¼2n##where 0 <#(2n
1þ2"
#$
;
S2:M¼2n#1þ#where 0 <#(2n#1
2"þ1#1
#$
:
ð5Þ
Fig. 1 further illustrates the properties of the two proposed
sets S1and S2. As can be seen in the figure, approximately "bits of
the modulus are fixed to be all 0s or all 1s, while the other
n#"bits are arbitrarily chosen.
1
The proposed modular multiplication algorithm is shown in
Algorithm 3. The parameters "and !are important for the
quotient evaluation. As we show later, to minimize the error in
quotient evaluation, "and !are chosen such that "¼wþ3and
!¼#2. The same values of the parameters are obtained in [9] for
the classical modular multiplication based on Barrett reduction.
IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 12, DECEMBER 2010 1717
Fig. 1. Binary representation of the proposed sets S1and S2.
1. If Mn#"#2¼1for M2S1(Mn#"#1¼0for M2S2), then the
remaining n#"#2(n#"#1) least significant bits can be arbitrarily
chosen. Otherwise, if Mn#"#2¼0(Mn#"#1¼1), then the remaining n#"#
2(n#"#1) least significant bits are chosen such that (5) is satisfied.
Algorithm 3. Proposed interleaved modular multiplication based
on generalized Barrett reduction ("¼wþ3and !¼#2)
Input: X¼ðXnw#1...X0Þr,Y¼ðYnw#1...Y0Þr,M2S1[S2
where 0(X; Y < M,r¼2wand nw¼'n=w(.
Output: Z¼XY mod M.
Z(0
for i¼nw#1downto 0 do
Z(Z2wþXYi
^
q¼
Z
2n
#$ if M2S1;
Z
2n#1
#$
if M2S2:
8
>
>
<
>
>
:
Z(Z#^
qM
end for
if Z+Mthen
Z(Z#M// At most 1 subtraction is needed.
end if
while Z<0do
Z(ZþM// At most 2 additions are needed.
end while
return Z.
In contrast to the classical modular multiplication based on
Barrett reduction where the quotient is evaluated as
^
q¼!Z
2nþ!"!2nþ"
M"
2"#!
$%
;
in our proposed algorithm, the evaluation basically comes for free:
^
q¼
Z
2n
#$
;if M2S1;
Z
2n#1
#$
;if M2S2:
8
>
>
<
>
>
:
This saves one single-precision multiplication and additionally
increases the speed of the proposed modular multiplication
algorithm.
Proof of Algorithm 3. To prove the correctness of the algorithm,
we need to show that there exist ";!2ZZ, such that ^
qcan
indeed be represented as
^
q¼
Z
2n
#$
;if M2S1;
Z
2n#1
#$
;if M2S2:
8
>
>
<
>
>
:
As shown in the analysis of Algorithm 1, to have the minimized
quotient error, the parameters "and !need to be chosen such
that "+wþ3and !(#2. Let us first assume that M2S1.
According to Lemma 1, it follows that #¼2". Now, ^
qbecomes
equal to
^
q¼!Z
2nþ!"#
2"#!
#$
¼!Z
2nþ!"2"
2"#!
#$
¼Z
2nþ!
#$
2!
#$
:
For !(0, the previous equation becomes equivalent to
^
q¼Z
2n
#$
:
For the case where M2S2, we have, according to Lemma 2,
that #¼2"þ1#1. Now, ^
qbecomes equal to
^
q¼!Z
2nþ!"ð2"þ1#1Þ
2"#!
#$
¼Z
2nþ!
#$
2!þ11#1
2"þ1
)*#$
:
To further simplify the proof, we choose !¼#2and the
previous equation becomes equivalent to
^
q¼Z
2n#2
#$
1
21#1
2"þ1
)*#$
:
If we choose "such that
2"þ1>max Z
2n#2
#$+,
;ð6Þ
the expression of ^
qsimplifies to
^
q¼
Z
2n#1
#$
#1;if 2 --
Z
2n#2
#$
;
Z
2n#1
#$
;if 2 6--
Z
2n#2
#$
:
8
>
>
<
>
>
:
ð7Þ
The inequality (6) can be written as
2"þ1>maxfZg
2n#2
#$
;
where max fZgis evaluated in the analysis of Algorithm 1 and
given as max fZg¼ð2þeÞ2nþw. To have the minimal error, we
choose e¼1and get the following relation:
2"þ1>3*2nþw
2n#2
#$
¼b3*2wþ2c:
The latter inequality is satisfied for "+wþ3.
If, instead of (7), we use only ^
q¼bZ
2n#1c, the evaluation of the
intermediate quotient ^
qwill, for 2jb Z
2n#2c, become greater than
or equal to the real intermediate quotient q. Due to this fact, Z
can become negative at the end of the current iteration. Hence,
we need to consider the case where Z<0. Let us prevent Z
from an uncontrollable decrease by putting a lower bound with
Z>#2nþ$, where $2ZZ. Since A
B+b
A
Bc>A
B#1for any A,
B2ZZ, we can write the following inequality (note that Z<0
and M>0):
^
q¼!Z
2nþ!"!2nþ"
M"
2"#!
$%
(!Z
2nþ!"!2nþ"
M"
2"#!
<
Z
2nþ!%2nþ"
M#1&
2"#!
¼Z
M#Z
2nþ"
<Z
M
#$
þ1#Z
2nþ"
¼qþ1#Z
2nþ"
<qþ1þ2$#":
Now, since q; ^
q; e 2ZZ, we choose "+$þ1and the quotient
error gets estimated as #1(e(0. If in the next iteration, it
again happens that 2jb Z
2n#2c, the quotient error will become
#2(e(0.
Finally, to assure that Zwill remain within the bounds
during the ith iteration, we write
Zi¼Zi#12wþXYi
¼ðZi#2#qM þeMÞ2wþXYi
>ð0þeMÞ2wþ0
>e2nþw>#2nþ$:
The worst case is when e¼#2, and then, it must hold
$>wþ1. By choosing "¼wþ3and !¼#2, all conditions
are satisfied, and hence, ^
qis indeed a good estimate of q. At
1718 IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 12, DECEMBER 2010
most one subtraction or two additions at the correction step are
required to obtain Z¼XY mod M.tu
3.2 Speeding Up Montgomery Modular Multiplication
Similar to Lemmas 1 and 2, we also have Lemmas 3 and 4 that are
at the heart of the proposed modular multiplication algorithm
based on Montgomery reduction.
Lemma 3. Let M¼#2wþ1be an n-digit positive integer in radix 2
representation, i.e., 2n#w#1(#<2n#w, and let M0¼#M#1mod 2w,
where w2NN, then
M0¼#1:ð8Þ
Proof of Lemma 3. Since M.1 mod 2w,weclearlyhave
#M#1.#1 mod 2w.tu
Lemma 4. Let M¼#2w#1be an n-digit positive integer in radix 2
representation, i.e., 2n#w#1<#(2n#wand let M0¼#M#1mod 2w,
where w2NN, then
M0¼1:ð9Þ
Proof of Lemma 4. Since M.#1 mod 2w, we clearly have
#M#1.1 mod 2w.tu
According to the previous two lemmas, we can easily find two
sets of moduli for which the precomputation step in Montgomery
multiplication can be excluded. The resulting algorithm is shown
in Algorithm 4. The proposed sets are of type
S3:M¼#2wþ1;where 2n#w#1(#<2n#w;
S4:M¼#2w#1;where 2n#w#1<#(2n#w:ð10Þ
Fig. 2 further illustrates the properties of the two proposed sets S3
and S4. As can be seen in the figure, w#1bits of the modulus are
fixed to be all 0s or all 1s, while the other n#wþ1bits are
arbitrarily chosen. To fulfill the condition gcdðM;bÞ¼1(see
Algorithm 2), the least significant bit of Mis set to 1.
Algorithm 4. Proposed interleaved modular multiplication based
on Montgomery modular reduction.
Input: X¼ðXnw#1...X0Þr,Y¼ðYnw#1...Y0Þr,M2S3[S4
where 0(X; Y < M,r¼2wand nw¼dn=we.
Output: Z¼XY r#nwmod M.
Z(0
for i¼0to nw#1do
Z(ZþXYi
qM¼#Zmod rif M2S3;
Zmod rif M2S4:
+
Z(ðZþqMMÞ=r
end for
if Z+Mthen
Z(Z#M
end if
return Z.
Due to the use of special type of moduli, the evaluation of the
intermediate Montgomery quotient is simplified compared to the
original algorithm given in Algorithm 2. As in our case, the value
of M0is simply equal to 1 or #1, the Montgomery quotient qM¼
ðZmod rÞM0mod rbecomes now
qM¼#Zmod r; if M2S3;
Zmod r; if M2S4:
+
Since r¼2w, the evaluation of qbasically comes for free.
Proof of Algorithm 4. Follow immediately from Lemmas 3 and 4.tu
4SECURITY CONSIDERATIONS
In this section, we analyze the security implications of choosing
primes in one of the sets S1;S
2;S
3;S
4for use in ECC/HECC
and in RSA.
In the current state of the art, the security of ECC/HECC over
finite fields GF(q) only depends on the extension degree of the
field [2]. Therefore, the security does not depend on the precise
structure of the prime p. This is illustrated by the particular
choices for pthat have been made in several standards such as
SEC [26], NIST [22], ANSI [1]. In particular, the following primes
have been proposed: p192 ¼2192 #264 #1,p224 ¼2224 #296 þ1,
p256 ¼2256 #2224 þ2192 þ296 #1,p384 ¼2384 #2128 #296 þ232 #1,
and p521 ¼2521 #1. It is easy to verify that for w(28, all primes
are in one of the proposed sets. As such at least one of our
methods applies for all primes included in the standards. In
conclusion, choosing a prime of prescribed structure has no
influence on the security of ECC/HECC.
The case of RSA requires a more detailed analysis than ECC/
HECC. First, we assume that the modulus Nis chosen from one of
the proposed sets. This is a special case of the security analysis
given in [16] followed by the conclusion that the resulting moduli
do not seem to offer less security than regular RSA moduli.
Next, we assume that the primes pand qthat constitute the
modulus N¼pq both are chosen in one of the sets Si. To analyze
the security implications of the restricted choice of pand q, we first
make a trivial observation. The number of n-bit primes in the sets
Sifor n>259 þwis large enough such that exhaustive listing of
these sets is impossible, since a maximum of wþ3bits are fixed.
The security analysis then corresponds to attacks on RSA with
partially known factorization. This problem has been analyzed
extensively in the literature and the first results come from Rivest
and Shamir [24] in 1985. They describe an algorithm that factors N
in polynomial time if 2=3of the bits of por qare known. In 1995,
Coppersmith [6] improves this bound to 3=5.
Today’s best attacks all rely on variants of Coppersmith’s
algorithm published in 1996 [8], [7]. A good overview of these
algorithms is given in [17], [18]. The best results in this area are as
follows: Let Nbe an nbit number, which is a product of two
n=2-bit primes. If half of the bits of either por q(or both) are
known, then Ncan be factored in polynomial time. If less than half
of the bits are known, say n=4#"bits, then the best algorithm
simply guesses "bits, and then, applies the polynomial-time
algorithm, leading to a running time exponential in ". In practice,
the values of w(typically, w(64) and n(n+1;024) are always
such that our proposed moduli remain secure against Coppers-
mith’s factorization algorithm, since at most wþ3bits of pand q
are known.
Finally, we consider a similar approach extended to moduli of
the form N¼prq, where pand qhave the same bit size. This
extension was proposed by Boneh et al. [5]. Assuming that pand q
are of the same bit size, one needs a 1=ðrþ1Þ-fraction of the most
significant bits of pin order to factor Nin polynomial time. In other
words, for the case r¼1, we need half of the bits, whereas for, e.g.,
r¼2, we need only a third of the most significant bits of p. These
results show that the primes p; q 2S, assembling an RSA modulus
of the form N¼prq, should be used with care. This is especially
IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 12, DECEMBER 2010 1719
Fig. 2. Binary representation of the proposed sets S3and S4.
true when ris large. Note that if r/log p, the latter factoring
method factors Nin polynomial time for any primes p; q 2NN.
5HARDWARE IMPLEMENTATION OF THE PROPOSED
ALGORITHMS
A typical architecture that describes an interleaved modular
multiplier is shown in Fig. 3. Both Barrett and Montgomery
algorithms can be implemented based on this architecture. The
architecture consists of two multiple-precision multipliers (&1and
&2) and one single-precision multiplier (&3). Apart from the
multipliers, the architecture contains an additional adder denoted
by $. Having two multiple-precision multipliers may seem
redundant at first glance, but the multiplier &1uses data from x
and ythat are fixed during a single modular multiplication. Now,
by running &1and &2in parallel, we speed up the whole
multiplication process. If the target is a more compact design,
one can also use a single multiple-precision multiplier which does
not reduce the generality of our discussion.
Multipliers &1and &2perform multiplications at lines 3 and 5
of both Algorithms 1 and 2, respectively. A multiplication
performed in step 4 of both algorithms is done by multiplier &3.
An eventual shift of the register Zis handled by controller. The
exact schedule of the functional parts of the multiplier is as
follows: &1!$!&1&2&3!$!$!&1&2&3!$!$!*** In
case of generalized Barrett reduction [9], the precomputed value #
is '¼wþ4-bits long, while for the case of Montgomery, the
precomputed value M0is '¼w-bits long. Due to the generalized
Barrett’s algorithm, the multiplier &2uses the most significant
'bits of the product calculated by &3, while for the case of
Montgomery, it uses the least significant 'bits of the same
product. This is indeed a reason for Montgomery’s multiplier
being superior compared to the one of Barrett.
The critical path of the whole design occurs from the output of
the register Zto the input of the temporary register in &2, passing
through two single-precision multipliers and one adder (bold
line). To show this, in practice, we have synthesized 192, 256, and
512-bit multipliers, each with the digit size of 32 bits. The code was
first written in GEZEL [11] and tested for its functionality, and
1720 IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 12, DECEMBER 2010
TABLE 1
Synthesis Results for the Hardware Architectures of 192, 256, and 512-Bit Modular Multipliers
Fig. 3. Architecture for an interleaved modular multiplier based on Barrett or
Montgomery reduction.
Fig. 4. Architecture for an interleaved modular multiplier based on modified Barrett
or modified Montgomery reduction.
then, translated to VHDL and synthesized using the Synposys
Design Compiler version Y-2006.06. The library we used was
UMC 0:13 #mCMOS High-Speed standard cell library. The results
can be found in Table 1. The size of the designs is given as the
number of NAND gate equivalences (GEs).
A major improvement of the new algorithms is the simplified
quotient evaluation. This fact results in the new proposed
architecture for the efficient modular multiplier, as shown in
Fig. 4. It consists of two multiple-precision multipliers (&1and &2)
only. The most important difference is that there are no multi-
plications with the precomputed values, and hence, the critical
path contains one single-precision multiplier and one adder only
(bold line). To compare the performance with the architecture
proposed in Fig. 3, we have again synthesized a number of
multipliers using the same standard cell library.
The results are given in Table 1 and show that frequencywise
our proposed architecture outperforms the modular multiplier
based on standard Barrett’s reduction up to 52 percent. The
architecture based on Montgomery’s reduction results in a
relative speedup up to 31 percent. Additionally, designs based
on our algorithms demonstrate area savings in range from 3.5 to
14 percent. Note here that the obtained results are based on the
synthesis only. After the place and route are performed, we
expect a decrease of the performance for both implemented
multipliers, and hence, believe that the relative speedup will
approximately remain the same.
Finally, it is interesting to consider a choice of the digit size.
As discussed in the previous section, the upper bound is decided
by security margins. A typical digit size of 8, 16, 32, or 64 bits
seems to provide a reasonable security margin for the RSA
modulus of 512 bits or more. On the other side, with the increase
of digit size, the number of cycles decreases for the whole design
and the overall speedup is increasing. It is also obvious that the
larger digit size implies the larger circuit, and thus, the
performance trade-off concerning throughput and area would
be interesting to explore.
6CONCLUSION
In this work, we proposed two interleaved modular multiplication
algorithms based on Barrett and Montgomery modular reductions.
We introduced two sets of moduli for the algorithm based on
Barrett and two sets of moduli for the algorithm based on
Montgomery algorithm. These sets contain moduli with a
prescribed number (typically, the digit size) of zero/one bits,
either in the most significant or least significant part. Due to this
choice, our algorithms have no precomputational phase and have a
simplified quotient evaluation, which makes them more flexible
and efficient than existing solutions.
Following the same principles as described in the paper, this
approach can be easily extended to finite fields of characteristic two.
ACKNOWLEDGMENTS
This work is supported in part by the IAP Programme P6/26
BCRYPT of the Belgian State, by FWO project G.0300.07, by the
European Commission under contract number ICT-2007-216676
ECRYPT NoE phase II, and by K.U. Leuven-BOF (OT/06/40).
REFERENCES
[1] ANSI, “ANSI X9.62 The Elliptic Curve Digital Signature Algorithm
(ECDSA),” http://www.ansi.org, 2010.
[2] R.M. Avanzi, H. Cohen, C. Doche, G. Frey, T. Lange, K. Nguyen, and F.
Vercauteren, Handbook of Elliptic and Hyperelliptic Curve Cryptography.
CRC Press, 2005.
[3] P. Barrett, “Communications Authentication and Security Using Public Key
Encryption—A Design for Implementation,” master’s thesis, Oxford Univ.,
1984.
[4] P. Barrett, “Implementing the Rivest Shamir and Adleman Public Key
Encryption Algorithm on a Standard Digital Signal Processor,” Proc. Ann.
Int’l Cryptology Conf. Advances in Cryptology (CRYPTO ’86), pp. 311-323,
1986.
[5] D. Boneh, G. Durfee, and N. Howgrave-Graham, “Factoring N¼prqfor
Large r,” Proc. 19th Ann. Int’l Cryptology Conf. Advances in Cryptology
(CRYPTO ’99), pp. 326-337, 1999.
[6] D. Coppersmith, “Factoring with a Hint,” IBM Research Report RC 19905,
1995.
[7] D. Coppersmith, “Finding a Small Root of a Bivariate Integer Equation;
Factoring with High Bits Known,” Proc. Int’l Conf. Theory and Application of
Cryptographic Techniques (Eurocrypt ’96), 1996.
[8] D. Coppersmith, “Small Solutions to Polynomial Equations, and Low
Exponent Vulnerabilities,” J. Cryptology, vol. 10, no. 4, pp. 233-260, 1996.
[9] J.-F. Dhem, “Modified Version of the Barrett Algorithm,” technical report,
1994.
[10] W. Diffie and M.E. Hellman, “New Directions in Cryptography,” IEEE
Trans. Information Theory, vol. IT-22, no. 6, pp. 644-654, Nov. 1976.
[11] GEZEL, http://www.ee.ucla.edu/~schaum/gezel, 2010.
[12] L. Hars, “Long Modular Multiplication for Cryptographic Applica-
tions,” Proc. Int’l Workshop Cryptographic Hardware and Embedded Systems
(CHES ’04), pp. 218-254, 2004.
[13] M. Joye, “RSA Moduli with a Predetermined Portion: Techniques and
Applications,” Proc. Information Security Practice and Experience Conf.,
pp. 116-130, 2008.
[14] N. Koblitz, “A Family of Jacobians Suitable for Discrete Log Cryptosys-
tems,” Proc. Ann. Int’l Cryptology Conf. Advances in Cryptology (CRYPTO ’88),
pp. 94-99, 1988.
[15] N. Koblitz, “Elliptic Curve Cryptosystem,” Math. of Computation, vol. 48,
pp. 203-209, 1987.
[16] A. Lenstra, “Generating RSA Moduli with a Predetermined Portion,” Proc.
Advances in Cryptology (ASIACRYPT ’98), pp. 1-10, 1998.
[17] A. May, “New RSA Vulnerabilities Using Lattice Reduction Methods,”
PhD thesis, Univ. of Paderborn, 2003.
[18] A. May, “Using LLL-Reduction for Solving RSA and Factorization
Problems: A Survey,” http://www.informatik.tu-darmstadt.de/KP/
publications/07/lll.pdf, 2007.
[19] A. Menezes, P. van Oorschot, and S. Vanstone, Handbook of Applied
Cryptography. CRC Press, 1997.
[20] V. Miller, “Uses of Elliptic Curves in Cryptography,” Proc. Ann. Int’l
Cryptology Conf. Advances in Cryptology (CRYPTO ’85), pp. 417-426, 1985.
[21] P. Montgomery, “Modular Multiplication without Trial Division,” Math. of
Computation, vol. 44, no. 170, pp. 519-521, 1985.
[22] National Institute of Standards and Technology. FIPS 186-2: Digital
Signature Standard, Jan. 2000.
[23] J.-J. Quisquater, “Encoding System According to the So-Called RSA
Method, by Means of a Microcontroller and Arrangement Implementing
This System,” US Patent #5,166,978, 1992.
[24] R.L. Rivest and A. Shamir, “Efficient Factoring Based on Partial Informa-
tion,” Proc. Workshop Theory and Application of Cryptographic Techniques on
Advances in Cryptology—EUROCRYPT ’85, pp. 31-34, 1986.
[25] R.L. Rivest, A. Shamir, and L. Adleman, “A Method for Obtaining
Digital Signatures and Public-Key Cryptosystems,” Comm. ACM, vol. 21,
no. 2, pp. 120-126, 1978.
[26] Standards for Efficient Cryptography, “Elliptic Curve Cryptography,
Version 1.5, Draft,” http://www.secg.org, 2005.
IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 12, DECEMBER 2010 1721