Content uploaded by Miroslav Knežević

Author content

All content in this area was uploaded by Miroslav Knežević on Oct 12, 2014

Content may be subject to copyright.

Faster Interleaved Modular Multiplication

Based on Barrett and Montgomery

Reduction Methods

Miroslav Kne!

zevi"

c, Member,IEEE,

Frederik Vercauteren, and

Ingrid Verbauwhede, Senior Member,

IEEE

Abstract—This paper proposes two improved interleaved modular multiplication

algorithms based on Barrett and Montgomery modular reduction. The algorithms

are simple and especially suitable for hardware implementations. Four large sets

of moduli for which the proposed methods apply are given and analyzed from a

security point of view. By considering state-of-the-art attacks on public-key

cryptosystems, we show that the proposed sets are safe to use, in practice, for

both elliptic curve cryptography and RSA cryptosystems. We propose a hardware

architecture for the modular multiplier that is based on our methods. The results

show that concerning the speed, our proposed architecture outperforms the

modular multiplier based on standard modular multiplication by more than

50 percent. Additionally, our design consumes less area compared to the standard

solutions.

Index Terms—Modular multiplication, Barrett reduction, Montgomery reduction,

public-key cryptography.

Ç

1INTRODUCTION

PUBLIC-KEY cryptography (PKC), a concept introduced by Diffie

and Hellman [10] in the mid 1970s, has gained its popularity

together with the rapid evolution of today’s digital communication

systems. The best-known public-key cryptosystems are based on

factoring, i.e., RSA [25], and on the discrete logarithm problem in a

large prime field (Diffie-Hellman, ElGamal, Schnorr, DSA) [19] or

on an elliptic curve (ECC/HECC) [15], [20], [14]. Based on the

hardness of the underlying mathematical problem, PKC usually

deals with large numbers ranging from a few hundreds to a few

thousands of bits in size. Consequently, efficient implementation

of PKC primitives has always been a challenge.

Modular multiplication forms the basis of modular exponentia-

tion which is the core operation of the RSA cryptosystem. It is also

present in many other cryptographic algorithms including those

based on ECC and HECC. In particular, if one uses projective

coordinates for ECC/HECC, modular multiplication remains the

most time-consuming operation for ECC. For efficient implemen-

tation of modular multiplication, the crucial operation is modular

reduction. Algorithms that are most commonly used for this

purpose are Barrett reduction [4] and Montgomery reduction [21].

In this study, we propose two interleaved modular multi-

plication algorithms based on Barrett and Montgomery modular

reduction. The methods are simple and especially suitable for

hardware implementations. Four large sets of moduli for which the

proposed methods apply are given and analyzed from a security

point of view. We propose a hardware architecture for the modular

multiplier that is based on our methods. The results show that

concerning the speed, our proposed architecture outperforms the

modular multiplier based on standard modular multiplication by

more than 50 percent. Additionally, our design consumes less area

compared to the standard solutions.

The remainder of this paper is structured as follows: Section 2

describes the algorithms of Barrett and Montgomery as the two

most commonly used reduction methods and presents a short

overview of related work. In Section 3, we show how precomputa-

tion can be omitted and the quotient evaluation simplified in

Barrett and Montgomery algorithms. Section 4 analyzes the

security implications, and in Section 5, we describe a hardware

implementation. Section 6 concludes the paper.

2PRELIMINARIES

In this paper, we use the following notations. A multiple-precision

n-bit integer Ais represented in radix rrepresentation as

A¼ðAnw#1...A0Þr, where r¼2w;nwrepresents the number of

digits and is equal to dn=we, where wis a digit size; and Aiis called a

digit and Ai2½0;r#1&. A special case is when r¼2(w¼1) and the

representation of A¼ðAn#1...A0Þ2is called a bit representation.

To make the following discussion easier, we define the floor

function for integers in the following manner. Let U,M2ZZ and

M>0, then there exist integers qand Zsuch that U¼qM þZand

0(Z<M. The integer qis called the quotient and is denoted by

the floor function as

q¼!U=M ":ð1Þ

The integer Zis called the remainder and can also be represented

as Z¼Umod M. Note here that the floor function always

rounds toward negative infinity. This is very useful for hardware

implementations, where the numbers are given in two’s comple-

ment representation. If the divisor is of type 2s, the floor function

is just a simple shift to the right for spositions.

2.1 Classical and Montgomery Modular Multiplication

Methods

Given a modulus Mand two elements X,Y2ZZM, where ZZMis

the ring of integers modulo M, the ordinary modular multi-

plication is defined as

X)Y¼

4X*Ymod M:

Let the modulus Mbe an nw-digit integer, where the radix of each

digit is r¼2w. The classical modular multiplication algorithm

computes XY mod Mby interleaving the multiplication and

modular reduction phases, as shown in Algorithm 1. The value q

is called an intermediate quotient, while Zrepresents an intermediate

remainder. The calculation of qat step 4 of the algorithm is done by

utilizing integer division which is considered as an expensive

operation, especially in hardware. The idea of using the pre-

computed reciprocal of the modulus Mand simple shift and

multiplication operations instead of division was first introduced

by Barrett [3], [4] in 1984. The original algorithm considers only

reduction, assuming that the multiplication is performed before-

hand. To explain the basic idea, we rewrite the intermediate

quotient qas

q¼Z

M

#$

¼

Z

2nþ!

2nþ"

M

2"#!

$%

+!Z

2nþ!"!2nþ"

M"

2"#!

$%

¼^

q: ð2Þ

The value ^

qrepresents an estimation of the intermediate quotient q.

In most of the cryptographic applications, the modulus Mis fixed

during the many modular multiplications, and hence, the value #¼

b2nþ"=Mccan be precomputed and reused multiple times. Since the

value of ^

qis an estimated value, some correction steps at the end of

the modular multiplication algorithm have to be performed.

IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 12, DECEMBER 2010 1715

.The authors are with the Department of Electrical Engineering, Katholieke

Universiteit Leuven, ESAT/SCD-COSIC, Kasteelpark Arenberg 10, B-3001

Leuven-Heverlee, Belgium.

E-mail: {mknezevi, fvercaut, iverbauw}@esat.kuleuven.be.

Manuscript received 24 Apr. 2009; revised 18 Sept. 2009; accepted 23 Jan.

2010; published online 14 Apr. 2010.

Recommended for acceptance by P. Montuschi.

For information on obtaining reprints of this article, please send e-mail to:

tc@computer.org, and reference IEEECS Log Number TC-2009-04-0173.

Digital Object Identifier no. 10.1109/TC.2010.93.

0018-9340/10/$26.00 !2010 IEEE Published by the IEEE Computer Society

Algorithm 1. Classical modular multiplication algorithm

Input: X¼ðXnw#1...X0Þr,Y¼ðYnw#1...Y0Þr,

M¼ðMnw#1...M0Þrwhere 0(X; Y < M,2n#1(M<2n,

r¼2wand nw¼dn=we.

Output: Z¼XY mod M.

1: Z(0

2: for i¼nw#1downto 0 do

3: Z(Zr þXYi

4: q(bZ=M c

5: Z(Z#qM

6: end for

7: Return Z.

To reduce the number of correction steps, Dhem [9] determines

the values of "¼wþ3and !¼#2for which the classical modular

multiplication based on Barrett reduction needs at most one

subtraction at the end of the algorithm. To make the following

explanations easier, we outline a similar analysis as given in [9].

We assume that step 4 of Algorithm 1 is performed according to (2)

and ^

qis used instead.

Analysis of Algorithm 1. Let us first consider the first iteration

of Algorithm 1 (i¼0). We can find an integer $such that

Z0¼XYnw#1<2nþ$. This represents an upper bound of Z(Z0for

i¼0). The quotient q¼b

Z0

Mccan now be written as

q¼Z0

M

#$

¼

Z0

2nþ!

2nþ"

M

2"#!

$%

;

where "and !are two variables. The estimation of the given

quotient is now equal to

^q¼!Z0

2nþ!"!2nþ"

M"

2"#!

$%

¼!Z0

2nþ!"#

2"#!

$%

;

where #¼b

2nþ"

Mcis a constant and may be precomputed. Let us

now define the quotient error as a function of the variables ",!,

and $

e¼eð";!;$Þ¼q#^

q:

Since A

B+b

A

Bc>A

B#1for any A; B 2ZZ, we can write the following

inequality:

q¼Z0

M

#$

+^

q>!Z0

2nþ!"!2nþ"

M"

2"#!#1

>%Z0

2nþ!#1&%2nþ"

M#1&

2"#!#1

¼Z0

M#Z0

2nþ"#2nþ!

Mþ1

2"#!#1

+Z0

M

#$

#Z0

2nþ"#2nþ!

Mþ1

2"#!#1

¼q#Z0

2nþ"#2nþ!

Mþ1

2"#!#1:

Now, since e2ZZ, the quotient error can be estimated as

e¼eð";!;$Þ( 1þZ0

2nþ"þ2nþ!

M#1

2"#!

#$

:

According to Algorithm 1, we have Z0<2nþ$and M+2n#1.

Hence, we can evaluate the quotient error as

e¼eð";!;$Þ( 1þ2$#"þ2!þ1#1

2"#!

#$

:

Following the previous inequality, it is obvious that for "+$þ1

and !(#2, it holds e¼1.

Next, we need to ensure that the intermediate remainder Zi

does not grow uncontrollably as iincreases. Since X<M,Yi<2w,

Zi<MþeM and M<2n, after iiterations, we have

Zi¼Zi#12wþXYi

<ðMþeMÞ2wþM2w

<ð2þeÞ2nþw:

Since we want to use the same value for eduring the algorithm, the

next condition must hold

Zi<ð2þeÞ2nþw<2nþ$:

To minimize the quotient error (e¼1), we must choose $such that

3*2w<2$:

In other words, we choose $+wþ2. Now, according to the

previous analysis, we can conclude that for "+$þ1,!(#2and

$+wþ2, we may realize a modular multiplication with only one

correction step at the end of the whole process.

The only drawback of the proposed method is the size of the

intermediate quotient ^

qand the precomputed value #. Due to the

parameters "and !chosen in a given way, the size of ^

qis wþ2

and #is at most wþ4bits. This introduces an additional overhead

for the software implementations, while it can be easily overcome

in the hardware implementations.

Montgomery’s algorithm [21] is the most commonly utilized

modular multiplication algorithm today. In contrast to the classical

modular multiplication, it utilizes right to left divisions. Given an

n-digit odd modulus Mand an integer U2ZZM, the image or the

Montgomery residue of Uis defined as X¼UR mod M, where R,

the Montgomery radix, is a constant relatively prime to M. If X

and Yare the images of Uand V, respectively, the Montgomery

multiplication of these two images is defined as

X,Y¼

4XY R#1mod M:

The result is the image of UV mod Mand needs to be converted

back at the end of the process. For the sake of efficient

implementation, one usually uses R¼rnw, where r¼2wis the

radix of each digit. Similar to a classical modular multiplication

based on Barrett reduction, this algorithm uses a precomputed

value M0¼#M#1mod r¼#M#1

0mod r. The algorithm is shown

as follows:

Algorithm 2. Montgomery modular multiplication algorithm

Input: X¼ðXnw#1...X0Þr,Y¼ðYnw#1...Y0Þr,

M¼ðMnw#1...M0Þr,M0¼#M#1

0mod rwhere

0(X; Y < M,2n#1(M<2n,r¼2w,gcdðM;rÞ¼1and

nw¼dn=we.

Output: Z¼XY r#nwmod M.

1: Z(0

2: for i¼0to nw#1do

3: Z(ZþXYi

4: qM(ðZmod rÞM0mod r

5: Z(ðZþqMMÞ=r

6: end for

7: if Z+Mthen

8: Z(Z#M

9: end if

10: Return Z.

1716 IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 12, DECEMBER 2010

2.2 Related Work

Before introducing related work, we note here that for the moduli

used in all common ECC cryptosystems, the modular reduction

can be done much faster than the one proposed by Barrett or

Montgomery. Even without any multiplication. This is the reason

behind standardizing generalized Mersenne prime moduli (sums/

differences of a few powers of 2) [22], [1], [26].

The idea of simplifying an intermediate quotient evaluation

was first presented by Quisquater [23] at the rump session of

Eurocrypt ’90. The method is similar to the one of Barrett except

that the modulus Mis preprocessed before the modular multi-

plication in such a way that the evaluation of the intermediate

quotient qbasically comes for free. Preprocessing requires some

extra memory and computational time, but the latter is negligible

when many modular multiplications are performed using the same

modulus.

Lenstra [16] points out that choosing moduli with a predeter-

mined portion is beneficial both for storage and computational

requirements. He proposes a way to generate RSA moduli with

any number of predetermined leading (trailing) bits, with the

fraction of specified bits only limited by security considerations.

Furthermore, Lenstra discusses security issues and concludes that

the resulting moduli do not seem to offer less security than regular

RSA moduli. In [13], Joye enhances the method for generating

RSA moduli with a predetermined portion proposed in [16].

In [12], Hars proposes a long modular multiplication method

that also simplifies an intermediate quotient evaluation. The

method is based on Quisquater’s algorithm and requires a

preprocessing of the modulus by increasing its length. The

algorithm contains conditional branches that depend on the sign

of the intermediate remainder. That increases the complexity of

the algorithm, especially concerning the hardware implementa-

tions where additional control logic needs to be added.

In this paper, we propose four sets of moduli that specifically

target efficient modular multiplication by means of classical

modular multiplication based on general Barrett reduction [9] and

Montgomery modular multiplication [21]. In addition to simplified

quotient evaluation, our algorithms do not require any additional

preprocessing. The algorithms are simple and especially suitable for

hardware implementations. They contain no conditional branches

inside the loop, and hence, require a very simple control logic. Note

that the same algorithms are applicable to general moduli if the

preprocessing described in [12] is performed beforehand.

The methods describing how to generate such moduli in case of

RSA are discussed in [16], [13]. Furthermore, from the sets

proposed in this paper, one can also choose the primes that

generate the RSA modulus to speed up a decryption of RSA by

means of the Chinese Remainder Theorem (CRT). In Section 4, we

discuss security issues concerning this matter.

3THE PROPOSED MODULAR MULTIPLICATION

METHODS FOR INTEGERS

In both Barrett and Montgomery modular multiplications, the

precomputed values of either modulus reciprocal (#) or modulus

inverse (M0) are used in order to avoid multiple-precision

divisions. However, single-precision multiplications still need to

be performed (step 4 of the Algorithms 1 and 2). This especially

concerns the hardware implementations, as the multiplication with

the precomputed values often occurs within the critical path of the

whole design. Section 5 discusses this issue in more detail.

Let us, for now, assume that the precomputed values #and M0

are both of type -2%#", where %2ZZ and "2f0;1g. By tuning #

and M0to be of this special type, we transform a single-precision

multiplication with these values into a simple shift operation in

hardware. Therefore, we find sets of moduli for which the

precomputed values are both of type -2%#".

3.1 Speeding Up Classical Modular Multiplication

Before describing the actual algorithm, we provide two lemmas to

make the following explanation easier:

Lemma 1. Let M¼2n##be an n-digit positive integer in radix 2

representation and let #¼b2nþ"=Mc, where "2NN. If 0<#(

b2n

1þ2"c, then

#¼2":ð3Þ

Proof of Lemma 1. Rewrite 2nþ"as

2nþ"¼M2"þ2"#:

Since it is given that 0<#(b2n

1þ2"c,weconcludethat

0<2"#<M. By the definition of euclidean division, this

shows that #¼2".tu

Lemma 2. Let M¼2n#1þ#be an n-digit positive integer in radix 2

representation and let #¼b2nþ"=Mc, where "2NN. If 0<#(

b2n#1

2"þ1#1c, then

#¼2"þ1#1:ð4Þ

Proof of Lemma 2. Rewrite 2nþ"as

2nþ"¼Mð2"þ1#1Þþ2n#1##ð2"þ1#1Þ:

Since 0<#(b2n#1

2"þ1#1c,weconcludethat0(2n#1#

#ð2"þ1#1Þ<M. By the definition of euclidean division,

this shows that #¼2"þ1#1.tu

The interleaved modular multiplication algorithm based on

general Barrett reduction is given in Section 2. Now, according to

Lemmas 1 and 2, we can define two sets of moduli for which the

modular multiplication based on Barrett modular reduction can be

improved. These sets are of type:

S1:M¼2n##where 0 <#(2n

1þ2"

#$

;

S2:M¼2n#1þ#where 0 <#(2n#1

2"þ1#1

#$

:

ð5Þ

Fig. 1 further illustrates the properties of the two proposed

sets S1and S2. As can be seen in the figure, approximately "bits of

the modulus are fixed to be all 0s or all 1s, while the other

n#"bits are arbitrarily chosen.

1

The proposed modular multiplication algorithm is shown in

Algorithm 3. The parameters "and !are important for the

quotient evaluation. As we show later, to minimize the error in

quotient evaluation, "and !are chosen such that "¼wþ3and

!¼#2. The same values of the parameters are obtained in [9] for

the classical modular multiplication based on Barrett reduction.

IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 12, DECEMBER 2010 1717

Fig. 1. Binary representation of the proposed sets S1and S2.

1. If Mn#"#2¼1for M2S1(Mn#"#1¼0for M2S2), then the

remaining n#"#2(n#"#1) least significant bits can be arbitrarily

chosen. Otherwise, if Mn#"#2¼0(Mn#"#1¼1), then the remaining n#"#

2(n#"#1) least significant bits are chosen such that (5) is satisfied.

Algorithm 3. Proposed interleaved modular multiplication based

on generalized Barrett reduction ("¼wþ3and !¼#2)

Input: X¼ðXnw#1...X0Þr,Y¼ðYnw#1...Y0Þr,M2S1[S2

where 0(X; Y < M,r¼2wand nw¼'n=w(.

Output: Z¼XY mod M.

Z(0

for i¼nw#1downto 0 do

Z(Z2wþXYi

^

q¼

Z

2n

#$ if M2S1;

Z

2n#1

#$

if M2S2:

8

>

>

<

>

>

:

Z(Z#^

qM

end for

if Z+Mthen

Z(Z#M// At most 1 subtraction is needed.

end if

while Z<0do

Z(ZþM// At most 2 additions are needed.

end while

return Z.

In contrast to the classical modular multiplication based on

Barrett reduction where the quotient is evaluated as

^

q¼!Z

2nþ!"!2nþ"

M"

2"#!

$%

;

in our proposed algorithm, the evaluation basically comes for free:

^

q¼

Z

2n

#$

;if M2S1;

Z

2n#1

#$

;if M2S2:

8

>

>

<

>

>

:

This saves one single-precision multiplication and additionally

increases the speed of the proposed modular multiplication

algorithm.

Proof of Algorithm 3. To prove the correctness of the algorithm,

we need to show that there exist ";!2ZZ, such that ^

qcan

indeed be represented as

^

q¼

Z

2n

#$

;if M2S1;

Z

2n#1

#$

;if M2S2:

8

>

>

<

>

>

:

As shown in the analysis of Algorithm 1, to have the minimized

quotient error, the parameters "and !need to be chosen such

that "+wþ3and !(#2. Let us first assume that M2S1.

According to Lemma 1, it follows that #¼2". Now, ^

qbecomes

equal to

^

q¼!Z

2nþ!"#

2"#!

#$

¼!Z

2nþ!"2"

2"#!

#$

¼Z

2nþ!

#$

2!

#$

:

For !(0, the previous equation becomes equivalent to

^

q¼Z

2n

#$

:

For the case where M2S2, we have, according to Lemma 2,

that #¼2"þ1#1. Now, ^

qbecomes equal to

^

q¼!Z

2nþ!"ð2"þ1#1Þ

2"#!

#$

¼Z

2nþ!

#$

2!þ11#1

2"þ1

)*#$

:

To further simplify the proof, we choose !¼#2and the

previous equation becomes equivalent to

^

q¼Z

2n#2

#$

1

21#1

2"þ1

)*#$

:

If we choose "such that

2"þ1>max Z

2n#2

#$+,

;ð6Þ

the expression of ^

qsimplifies to

^

q¼

Z

2n#1

#$

#1;if 2 --

Z

2n#2

#$

;

Z

2n#1

#$

;if 2 6--

Z

2n#2

#$

:

8

>

>

<

>

>

:

ð7Þ

The inequality (6) can be written as

2"þ1>maxfZg

2n#2

#$

;

where max fZgis evaluated in the analysis of Algorithm 1 and

given as max fZg¼ð2þeÞ2nþw. To have the minimal error, we

choose e¼1and get the following relation:

2"þ1>3*2nþw

2n#2

#$

¼b3*2wþ2c:

The latter inequality is satisfied for "+wþ3.

If, instead of (7), we use only ^

q¼bZ

2n#1c, the evaluation of the

intermediate quotient ^

qwill, for 2jb Z

2n#2c, become greater than

or equal to the real intermediate quotient q. Due to this fact, Z

can become negative at the end of the current iteration. Hence,

we need to consider the case where Z<0. Let us prevent Z

from an uncontrollable decrease by putting a lower bound with

Z>#2nþ$, where $2ZZ. Since A

B+b

A

Bc>A

B#1for any A,

B2ZZ, we can write the following inequality (note that Z<0

and M>0):

^

q¼!Z

2nþ!"!2nþ"

M"

2"#!

$%

(!Z

2nþ!"!2nþ"

M"

2"#!

<

Z

2nþ!%2nþ"

M#1&

2"#!

¼Z

M#Z

2nþ"

<Z

M

#$

þ1#Z

2nþ"

¼qþ1#Z

2nþ"

<qþ1þ2$#":

Now, since q; ^

q; e 2ZZ, we choose "+$þ1and the quotient

error gets estimated as #1(e(0. If in the next iteration, it

again happens that 2jb Z

2n#2c, the quotient error will become

#2(e(0.

Finally, to assure that Zwill remain within the bounds

during the ith iteration, we write

Zi¼Zi#12wþXYi

¼ðZi#2#qM þeMÞ2wþXYi

>ð0þeMÞ2wþ0

>e2nþw>#2nþ$:

The worst case is when e¼#2, and then, it must hold

$>wþ1. By choosing "¼wþ3and !¼#2, all conditions

are satisfied, and hence, ^

qis indeed a good estimate of q. At

1718 IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 12, DECEMBER 2010

most one subtraction or two additions at the correction step are

required to obtain Z¼XY mod M.tu

3.2 Speeding Up Montgomery Modular Multiplication

Similar to Lemmas 1 and 2, we also have Lemmas 3 and 4 that are

at the heart of the proposed modular multiplication algorithm

based on Montgomery reduction.

Lemma 3. Let M¼#2wþ1be an n-digit positive integer in radix 2

representation, i.e., 2n#w#1(#<2n#w, and let M0¼#M#1mod 2w,

where w2NN, then

M0¼#1:ð8Þ

Proof of Lemma 3. Since M.1 mod 2w,weclearlyhave

#M#1.#1 mod 2w.tu

Lemma 4. Let M¼#2w#1be an n-digit positive integer in radix 2

representation, i.e., 2n#w#1<#(2n#wand let M0¼#M#1mod 2w,

where w2NN, then

M0¼1:ð9Þ

Proof of Lemma 4. Since M.#1 mod 2w, we clearly have

#M#1.1 mod 2w.tu

According to the previous two lemmas, we can easily find two

sets of moduli for which the precomputation step in Montgomery

multiplication can be excluded. The resulting algorithm is shown

in Algorithm 4. The proposed sets are of type

S3:M¼#2wþ1;where 2n#w#1(#<2n#w;

S4:M¼#2w#1;where 2n#w#1<#(2n#w:ð10Þ

Fig. 2 further illustrates the properties of the two proposed sets S3

and S4. As can be seen in the figure, w#1bits of the modulus are

fixed to be all 0s or all 1s, while the other n#wþ1bits are

arbitrarily chosen. To fulfill the condition gcdðM;bÞ¼1(see

Algorithm 2), the least significant bit of Mis set to 1.

Algorithm 4. Proposed interleaved modular multiplication based

on Montgomery modular reduction.

Input: X¼ðXnw#1...X0Þr,Y¼ðYnw#1...Y0Þr,M2S3[S4

where 0(X; Y < M,r¼2wand nw¼dn=we.

Output: Z¼XY r#nwmod M.

Z(0

for i¼0to nw#1do

Z(ZþXYi

qM¼#Zmod rif M2S3;

Zmod rif M2S4:

+

Z(ðZþqMMÞ=r

end for

if Z+Mthen

Z(Z#M

end if

return Z.

Due to the use of special type of moduli, the evaluation of the

intermediate Montgomery quotient is simplified compared to the

original algorithm given in Algorithm 2. As in our case, the value

of M0is simply equal to 1 or #1, the Montgomery quotient qM¼

ðZmod rÞM0mod rbecomes now

qM¼#Zmod r; if M2S3;

Zmod r; if M2S4:

+

Since r¼2w, the evaluation of qbasically comes for free.

Proof of Algorithm 4. Follow immediately from Lemmas 3 and 4.tu

4SECURITY CONSIDERATIONS

In this section, we analyze the security implications of choosing

primes in one of the sets S1;S

2;S

3;S

4for use in ECC/HECC

and in RSA.

In the current state of the art, the security of ECC/HECC over

finite fields GF(q) only depends on the extension degree of the

field [2]. Therefore, the security does not depend on the precise

structure of the prime p. This is illustrated by the particular

choices for pthat have been made in several standards such as

SEC [26], NIST [22], ANSI [1]. In particular, the following primes

have been proposed: p192 ¼2192 #264 #1,p224 ¼2224 #296 þ1,

p256 ¼2256 #2224 þ2192 þ296 #1,p384 ¼2384 #2128 #296 þ232 #1,

and p521 ¼2521 #1. It is easy to verify that for w(28, all primes

are in one of the proposed sets. As such at least one of our

methods applies for all primes included in the standards. In

conclusion, choosing a prime of prescribed structure has no

influence on the security of ECC/HECC.

The case of RSA requires a more detailed analysis than ECC/

HECC. First, we assume that the modulus Nis chosen from one of

the proposed sets. This is a special case of the security analysis

given in [16] followed by the conclusion that the resulting moduli

do not seem to offer less security than regular RSA moduli.

Next, we assume that the primes pand qthat constitute the

modulus N¼pq both are chosen in one of the sets Si. To analyze

the security implications of the restricted choice of pand q, we first

make a trivial observation. The number of n-bit primes in the sets

Sifor n>259 þwis large enough such that exhaustive listing of

these sets is impossible, since a maximum of wþ3bits are fixed.

The security analysis then corresponds to attacks on RSA with

partially known factorization. This problem has been analyzed

extensively in the literature and the first results come from Rivest

and Shamir [24] in 1985. They describe an algorithm that factors N

in polynomial time if 2=3of the bits of por qare known. In 1995,

Coppersmith [6] improves this bound to 3=5.

Today’s best attacks all rely on variants of Coppersmith’s

algorithm published in 1996 [8], [7]. A good overview of these

algorithms is given in [17], [18]. The best results in this area are as

follows: Let Nbe an nbit number, which is a product of two

n=2-bit primes. If half of the bits of either por q(or both) are

known, then Ncan be factored in polynomial time. If less than half

of the bits are known, say n=4#"bits, then the best algorithm

simply guesses "bits, and then, applies the polynomial-time

algorithm, leading to a running time exponential in ". In practice,

the values of w(typically, w(64) and n(n+1;024) are always

such that our proposed moduli remain secure against Coppers-

mith’s factorization algorithm, since at most wþ3bits of pand q

are known.

Finally, we consider a similar approach extended to moduli of

the form N¼prq, where pand qhave the same bit size. This

extension was proposed by Boneh et al. [5]. Assuming that pand q

are of the same bit size, one needs a 1=ðrþ1Þ-fraction of the most

significant bits of pin order to factor Nin polynomial time. In other

words, for the case r¼1, we need half of the bits, whereas for, e.g.,

r¼2, we need only a third of the most significant bits of p. These

results show that the primes p; q 2S, assembling an RSA modulus

of the form N¼prq, should be used with care. This is especially

IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 12, DECEMBER 2010 1719

Fig. 2. Binary representation of the proposed sets S3and S4.

true when ris large. Note that if r/log p, the latter factoring

method factors Nin polynomial time for any primes p; q 2NN.

5HARDWARE IMPLEMENTATION OF THE PROPOSED

ALGORITHMS

A typical architecture that describes an interleaved modular

multiplier is shown in Fig. 3. Both Barrett and Montgomery

algorithms can be implemented based on this architecture. The

architecture consists of two multiple-precision multipliers (&1and

&2) and one single-precision multiplier (&3). Apart from the

multipliers, the architecture contains an additional adder denoted

by $. Having two multiple-precision multipliers may seem

redundant at first glance, but the multiplier &1uses data from x

and ythat are fixed during a single modular multiplication. Now,

by running &1and &2in parallel, we speed up the whole

multiplication process. If the target is a more compact design,

one can also use a single multiple-precision multiplier which does

not reduce the generality of our discussion.

Multipliers &1and &2perform multiplications at lines 3 and 5

of both Algorithms 1 and 2, respectively. A multiplication

performed in step 4 of both algorithms is done by multiplier &3.

An eventual shift of the register Zis handled by controller. The

exact schedule of the functional parts of the multiplier is as

follows: &1!$!&1&2&3!$!$!&1&2&3!$!$!*** In

case of generalized Barrett reduction [9], the precomputed value #

is '¼wþ4-bits long, while for the case of Montgomery, the

precomputed value M0is '¼w-bits long. Due to the generalized

Barrett’s algorithm, the multiplier &2uses the most significant

'bits of the product calculated by &3, while for the case of

Montgomery, it uses the least significant 'bits of the same

product. This is indeed a reason for Montgomery’s multiplier

being superior compared to the one of Barrett.

The critical path of the whole design occurs from the output of

the register Zto the input of the temporary register in &2, passing

through two single-precision multipliers and one adder (bold

line). To show this, in practice, we have synthesized 192, 256, and

512-bit multipliers, each with the digit size of 32 bits. The code was

first written in GEZEL [11] and tested for its functionality, and

1720 IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 12, DECEMBER 2010

TABLE 1

Synthesis Results for the Hardware Architectures of 192, 256, and 512-Bit Modular Multipliers

Fig. 3. Architecture for an interleaved modular multiplier based on Barrett or

Montgomery reduction.

Fig. 4. Architecture for an interleaved modular multiplier based on modified Barrett

or modified Montgomery reduction.

then, translated to VHDL and synthesized using the Synposys

Design Compiler version Y-2006.06. The library we used was

UMC 0:13 #mCMOS High-Speed standard cell library. The results

can be found in Table 1. The size of the designs is given as the

number of NAND gate equivalences (GEs).

A major improvement of the new algorithms is the simplified

quotient evaluation. This fact results in the new proposed

architecture for the efficient modular multiplier, as shown in

Fig. 4. It consists of two multiple-precision multipliers (&1and &2)

only. The most important difference is that there are no multi-

plications with the precomputed values, and hence, the critical

path contains one single-precision multiplier and one adder only

(bold line). To compare the performance with the architecture

proposed in Fig. 3, we have again synthesized a number of

multipliers using the same standard cell library.

The results are given in Table 1 and show that frequencywise

our proposed architecture outperforms the modular multiplier

based on standard Barrett’s reduction up to 52 percent. The

architecture based on Montgomery’s reduction results in a

relative speedup up to 31 percent. Additionally, designs based

on our algorithms demonstrate area savings in range from 3.5 to

14 percent. Note here that the obtained results are based on the

synthesis only. After the place and route are performed, we

expect a decrease of the performance for both implemented

multipliers, and hence, believe that the relative speedup will

approximately remain the same.

Finally, it is interesting to consider a choice of the digit size.

As discussed in the previous section, the upper bound is decided

by security margins. A typical digit size of 8, 16, 32, or 64 bits

seems to provide a reasonable security margin for the RSA

modulus of 512 bits or more. On the other side, with the increase

of digit size, the number of cycles decreases for the whole design

and the overall speedup is increasing. It is also obvious that the

larger digit size implies the larger circuit, and thus, the

performance trade-off concerning throughput and area would

be interesting to explore.

6CONCLUSION

In this work, we proposed two interleaved modular multiplication

algorithms based on Barrett and Montgomery modular reductions.

We introduced two sets of moduli for the algorithm based on

Barrett and two sets of moduli for the algorithm based on

Montgomery algorithm. These sets contain moduli with a

prescribed number (typically, the digit size) of zero/one bits,

either in the most significant or least significant part. Due to this

choice, our algorithms have no precomputational phase and have a

simplified quotient evaluation, which makes them more flexible

and efficient than existing solutions.

Following the same principles as described in the paper, this

approach can be easily extended to finite fields of characteristic two.

ACKNOWLEDGMENTS

This work is supported in part by the IAP Programme P6/26

BCRYPT of the Belgian State, by FWO project G.0300.07, by the

European Commission under contract number ICT-2007-216676

ECRYPT NoE phase II, and by K.U. Leuven-BOF (OT/06/40).

REFERENCES

[1] ANSI, “ANSI X9.62 The Elliptic Curve Digital Signature Algorithm

(ECDSA),” http://www.ansi.org, 2010.

[2] R.M. Avanzi, H. Cohen, C. Doche, G. Frey, T. Lange, K. Nguyen, and F.

Vercauteren, Handbook of Elliptic and Hyperelliptic Curve Cryptography.

CRC Press, 2005.

[3] P. Barrett, “Communications Authentication and Security Using Public Key

Encryption—A Design for Implementation,” master’s thesis, Oxford Univ.,

1984.

[4] P. Barrett, “Implementing the Rivest Shamir and Adleman Public Key

Encryption Algorithm on a Standard Digital Signal Processor,” Proc. Ann.

Int’l Cryptology Conf. Advances in Cryptology (CRYPTO ’86), pp. 311-323,

1986.

[5] D. Boneh, G. Durfee, and N. Howgrave-Graham, “Factoring N¼prqfor

Large r,” Proc. 19th Ann. Int’l Cryptology Conf. Advances in Cryptology

(CRYPTO ’99), pp. 326-337, 1999.

[6] D. Coppersmith, “Factoring with a Hint,” IBM Research Report RC 19905,

1995.

[7] D. Coppersmith, “Finding a Small Root of a Bivariate Integer Equation;

Factoring with High Bits Known,” Proc. Int’l Conf. Theory and Application of

Cryptographic Techniques (Eurocrypt ’96), 1996.

[8] D. Coppersmith, “Small Solutions to Polynomial Equations, and Low

Exponent Vulnerabilities,” J. Cryptology, vol. 10, no. 4, pp. 233-260, 1996.

[9] J.-F. Dhem, “Modified Version of the Barrett Algorithm,” technical report,

1994.

[10] W. Diffie and M.E. Hellman, “New Directions in Cryptography,” IEEE

Trans. Information Theory, vol. IT-22, no. 6, pp. 644-654, Nov. 1976.

[11] GEZEL, http://www.ee.ucla.edu/~schaum/gezel, 2010.

[12] L. Hars, “Long Modular Multiplication for Cryptographic Applica-

tions,” Proc. Int’l Workshop Cryptographic Hardware and Embedded Systems

(CHES ’04), pp. 218-254, 2004.

[13] M. Joye, “RSA Moduli with a Predetermined Portion: Techniques and

Applications,” Proc. Information Security Practice and Experience Conf.,

pp. 116-130, 2008.

[14] N. Koblitz, “A Family of Jacobians Suitable for Discrete Log Cryptosys-

tems,” Proc. Ann. Int’l Cryptology Conf. Advances in Cryptology (CRYPTO ’88),

pp. 94-99, 1988.

[15] N. Koblitz, “Elliptic Curve Cryptosystem,” Math. of Computation, vol. 48,

pp. 203-209, 1987.

[16] A. Lenstra, “Generating RSA Moduli with a Predetermined Portion,” Proc.

Advances in Cryptology (ASIACRYPT ’98), pp. 1-10, 1998.

[17] A. May, “New RSA Vulnerabilities Using Lattice Reduction Methods,”

PhD thesis, Univ. of Paderborn, 2003.

[18] A. May, “Using LLL-Reduction for Solving RSA and Factorization

Problems: A Survey,” http://www.informatik.tu-darmstadt.de/KP/

publications/07/lll.pdf, 2007.

[19] A. Menezes, P. van Oorschot, and S. Vanstone, Handbook of Applied

Cryptography. CRC Press, 1997.

[20] V. Miller, “Uses of Elliptic Curves in Cryptography,” Proc. Ann. Int’l

Cryptology Conf. Advances in Cryptology (CRYPTO ’85), pp. 417-426, 1985.

[21] P. Montgomery, “Modular Multiplication without Trial Division,” Math. of

Computation, vol. 44, no. 170, pp. 519-521, 1985.

[22] National Institute of Standards and Technology. FIPS 186-2: Digital

Signature Standard, Jan. 2000.

[23] J.-J. Quisquater, “Encoding System According to the So-Called RSA

Method, by Means of a Microcontroller and Arrangement Implementing

This System,” US Patent #5,166,978, 1992.

[24] R.L. Rivest and A. Shamir, “Efficient Factoring Based on Partial Informa-

tion,” Proc. Workshop Theory and Application of Cryptographic Techniques on

Advances in Cryptology—EUROCRYPT ’85, pp. 31-34, 1986.

[25] R.L. Rivest, A. Shamir, and L. Adleman, “A Method for Obtaining

Digital Signatures and Public-Key Cryptosystems,” Comm. ACM, vol. 21,

no. 2, pp. 120-126, 1978.

[26] Standards for Efficient Cryptography, “Elliptic Curve Cryptography,

Version 1.5, Draft,” http://www.secg.org, 2005.

IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 12, DECEMBER 2010 1721