ArticlePDF Available

# Faster Interleaved Modular Multiplication Based on Barrett and Montgomery Reduction Methods

Authors:

## Abstract and Figures

This paper proposes two improved interleaved modular multiplication algorithms based on Barrett and Montgomery modular reduction. The algorithms are simple and especially suitable for hardware implementations. Four large sets of moduli for which the proposed methods apply are given and analyzed from a security point of view. By considering state-of-the-art attacks on public-key cryptosystems, we show that the proposed sets are safe to use, in practice, for both elliptic curve cryptography and RSA cryptosystems. We propose a hardware architecture for the modular multiplier that is based on our methods. The results show that concerning the speed, our proposed architecture outperforms the modular multiplier based on standard modular multiplication by more than 50 percent. Additionally, our design consumes less area compared to the standard solutions.
Content may be subject to copyright.
Faster Interleaved Modular Multiplication
Based on Barrett and Montgomery
Reduction Methods
Miroslav Kne!
zevi"
c, Member,IEEE,
Frederik Vercauteren, and
Ingrid Verbauwhede, Senior Member,
IEEE
Abstract—This paper proposes two improved interleaved modular multiplication
algorithms based on Barrett and Montgomery modular reduction. The algorithms
are simple and especially suitable for hardware implementations. Four large sets
of moduli for which the proposed methods apply are given and analyzed from a
security point of view. By considering state-of-the-art attacks on public-key
cryptosystems, we show that the proposed sets are safe to use, in practice, for
both elliptic curve cryptography and RSA cryptosystems. We propose a hardware
architecture for the modular multiplier that is based on our methods. The results
show that concerning the speed, our proposed architecture outperforms the
modular multiplier based on standard modular multiplication by more than
50 percent. Additionally, our design consumes less area compared to the standard
solutions.
Index Terms—Modular multiplication, Barrett reduction, Montgomery reduction,
public-key cryptography.
Ç
1INTRODUCTION
PUBLIC-KEY cryptography (PKC), a concept introduced by Diffie
and Hellman [10] in the mid 1970s, has gained its popularity
together with the rapid evolution of today’s digital communication
systems. The best-known public-key cryptosystems are based on
factoring, i.e., RSA [25], and on the discrete logarithm problem in a
large prime field (Diffie-Hellman, ElGamal, Schnorr, DSA) [19] or
on an elliptic curve (ECC/HECC) [15], [20], [14]. Based on the
hardness of the underlying mathematical problem, PKC usually
deals with large numbers ranging from a few hundreds to a few
thousands of bits in size. Consequently, efficient implementation
of PKC primitives has always been a challenge.
Modular multiplication forms the basis of modular exponentia-
tion which is the core operation of the RSA cryptosystem. It is also
present in many other cryptographic algorithms including those
based on ECC and HECC. In particular, if one uses projective
coordinates for ECC/HECC, modular multiplication remains the
most time-consuming operation for ECC. For efficient implemen-
tation of modular multiplication, the crucial operation is modular
reduction. Algorithms that are most commonly used for this
purpose are Barrett reduction [4] and Montgomery reduction [21].
In this study, we propose two interleaved modular multi-
plication algorithms based on Barrett and Montgomery modular
reduction. The methods are simple and especially suitable for
hardware implementations. Four large sets of moduli for which the
proposed methods apply are given and analyzed from a security
point of view. We propose a hardware architecture for the modular
multiplier that is based on our methods. The results show that
concerning the speed, our proposed architecture outperforms the
modular multiplier based on standard modular multiplication by
more than 50 percent. Additionally, our design consumes less area
compared to the standard solutions.
The remainder of this paper is structured as follows: Section 2
describes the algorithms of Barrett and Montgomery as the two
most commonly used reduction methods and presents a short
overview of related work. In Section 3, we show how precomputa-
tion can be omitted and the quotient evaluation simplified in
Barrett and Montgomery algorithms. Section 4 analyzes the
security implications, and in Section 5, we describe a hardware
implementation. Section 6 concludes the paper.
2PRELIMINARIES
In this paper, we use the following notations. A multiple-precision
n-bit integer Ais represented in radix rrepresentation as
A¼ðAnw#1...A0Þr, where r¼2w;nwrepresents the number of
digits and is equal to dn=we, where wis a digit size; and Aiis called a
digit and Ai2½0;r#1&. A special case is when r¼2(w¼1) and the
representation of A¼ðAn#1...A0Þ2is called a bit representation.
To make the following discussion easier, we define the floor
function for integers in the following manner. Let U,M2ZZ and
M>0, then there exist integers qand Zsuch that U¼qM þZand
0(Z<M. The integer qis called the quotient and is denoted by
the floor function as
q¼!U=M ":ð1Þ
The integer Zis called the remainder and can also be represented
as Z¼Umod M. Note here that the floor function always
rounds toward negative infinity. This is very useful for hardware
implementations, where the numbers are given in two’s comple-
ment representation. If the divisor is of type 2s, the floor function
is just a simple shift to the right for spositions.
2.1 Classical and Montgomery Modular Multiplication
Methods
Given a modulus Mand two elements X,Y2ZZM, where ZZMis
the ring of integers modulo M, the ordinary modular multi-
plication is defined as
X)Y¼
4X*Ymod M:
Let the modulus Mbe an nw-digit integer, where the radix of each
digit is r¼2w. The classical modular multiplication algorithm
computes XY mod Mby interleaving the multiplication and
modular reduction phases, as shown in Algorithm 1. The value q
is called an intermediate quotient, while Zrepresents an intermediate
remainder. The calculation of qat step 4 of the algorithm is done by
utilizing integer division which is considered as an expensive
operation, especially in hardware. The idea of using the pre-
computed reciprocal of the modulus Mand simple shift and
multiplication operations instead of division was first introduced
by Barrett [3], [4] in 1984. The original algorithm considers only
reduction, assuming that the multiplication is performed before-
hand. To explain the basic idea, we rewrite the intermediate
quotient qas
q¼Z
M
#$¼ Z 2nþ! 2nþ" M 2"#!$%
+!Z
2nþ!"!2nþ"
M"
2"#!
$% ¼^ q: ð2Þ The value ^ qrepresents an estimation of the intermediate quotient q. In most of the cryptographic applications, the modulus Mis fixed during the many modular multiplications, and hence, the value #¼ b2nþ"=Mccan be precomputed and reused multiple times. Since the value of ^ qis an estimated value, some correction steps at the end of the modular multiplication algorithm have to be performed. IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 12, DECEMBER 2010 1715 .The authors are with the Department of Electrical Engineering, Katholieke Universiteit Leuven, ESAT/SCD-COSIC, Kasteelpark Arenberg 10, B-3001 Leuven-Heverlee, Belgium. E-mail: {mknezevi, fvercaut, iverbauw}@esat.kuleuven.be. Manuscript received 24 Apr. 2009; revised 18 Sept. 2009; accepted 23 Jan. 2010; published online 14 Apr. 2010. Recommended for acceptance by P. Montuschi. For information on obtaining reprints of this article, please send e-mail to: tc@computer.org, and reference IEEECS Log Number TC-2009-04-0173. Digital Object Identifier no. 10.1109/TC.2010.93. 0018-9340/10/$26.00 !2010 IEEE Published by the IEEE Computer Society
Algorithm 1. Classical modular multiplication algorithm
Input: X¼ðXnw#1...X0Þr,Y¼ðYnw#1...Y0Þr,
M¼ðMnw#1...M0Þrwhere 0(X; Y < M,2n#1(M<2n,
r¼2wand nw¼dn=we.
Output: Z¼XY mod M.
1: Z(0
2: for i¼nw#1downto 0 do
3: Z(Zr þXYi
4: q(bZ=M c
5: Z(Z#qM
6: end for
7: Return Z.
To reduce the number of correction steps, Dhem [9] determines
the values of "¼wþ3and !¼#2for which the classical modular
multiplication based on Barrett reduction needs at most one
subtraction at the end of the algorithm. To make the following
explanations easier, we outline a similar analysis as given in [9].
We assume that step 4 of Algorithm 1 is performed according to (2)
and ^
Analysis of Algorithm 1. Let us first consider the first iteration
of Algorithm 1 (i¼0). We can find an integer $such that Z0¼XYnw#1<2nþ$. This represents an upper bound of Z(Z0for
i¼0). The quotient q¼b
Z0
Mccan now be written as
q¼Z0
M
#$¼ Z0 2nþ! 2nþ" M 2"#!$%
;
where "and !are two variables. The estimation of the given
quotient is now equal to
^q¼!Z0
2nþ!"!2nþ"
M"
2"#!
$% ¼!Z0 2nþ!"# 2"#!$%
;
where #¼b
2nþ"
Mcis a constant and may be precomputed. Let us
now define the quotient error as a function of the variables ",!,
and $e¼eð";!;$Þ¼q#^
q:
Since A
B+b
A
Bc>A
B#1for any A; B 2ZZ, we can write the following
inequality:
q¼Z0
M
#$+^ q>!Z0 2nþ!"!2nþ" M" 2"#!#1 >%Z0 2nþ!#1&%2nþ" M#1& 2"#!#1 ¼Z0 M#Z0 2nþ"#2nþ! Mþ1 2"#!#1 +Z0 M #$
#Z0
2nþ"#2nþ!
Mþ1
2"#!#1
¼q#Z0
2nþ"#2nþ!
Mþ1
2"#!#1:
Now, since e2ZZ, the quotient error can be estimated as
e¼eð";!;$Þ( 1þZ0 2nþ"þ2nþ! M#1 2"#! #$
:
According to Algorithm 1, we have Z0<2nþ$and M+2n#1. Hence, we can evaluate the quotient error as e¼eð";!;$Þ( 1þ2$#"þ2!þ1#1 2"#! #$
:
Following the previous inequality, it is obvious that for "+$þ1 and !(#2, it holds e¼1. Next, we need to ensure that the intermediate remainder Zi does not grow uncontrollably as iincreases. Since X<M,Yi<2w, Zi<MþeM and M<2n, after iiterations, we have Zi¼Zi#12wþXYi <ðMþeMÞ2wþM2w <ð2þeÞ2nþw: Since we want to use the same value for eduring the algorithm, the next condition must hold Zi<ð2þeÞ2nþw<2nþ$:
To minimize the quotient error (e¼1), we must choose $such that 3*2w<2$:
In other words, we choose $+wþ2. Now, according to the previous analysis, we can conclude that for "+$þ1,!(#2and
$+wþ2, we may realize a modular multiplication with only one correction step at the end of the whole process. The only drawback of the proposed method is the size of the intermediate quotient ^ qand the precomputed value #. Due to the parameters "and !chosen in a given way, the size of ^ qis wþ2 and #is at most wþ4bits. This introduces an additional overhead for the software implementations, while it can be easily overcome in the hardware implementations. Montgomery’s algorithm [21] is the most commonly utilized modular multiplication algorithm today. In contrast to the classical modular multiplication, it utilizes right to left divisions. Given an n-digit odd modulus Mand an integer U2ZZM, the image or the Montgomery residue of Uis defined as X¼UR mod M, where R, the Montgomery radix, is a constant relatively prime to M. If X and Yare the images of Uand V, respectively, the Montgomery multiplication of these two images is defined as X,Y¼ 4XY R#1mod M: The result is the image of UV mod Mand needs to be converted back at the end of the process. For the sake of efficient implementation, one usually uses R¼rnw, where r¼2wis the radix of each digit. Similar to a classical modular multiplication based on Barrett reduction, this algorithm uses a precomputed value M0¼#M#1mod r¼#M#1 0mod r. The algorithm is shown as follows: Algorithm 2. Montgomery modular multiplication algorithm Input: X¼ðXnw#1...X0Þr,Y¼ðYnw#1...Y0Þr, M¼ðMnw#1...M0Þr,M0¼#M#1 0mod rwhere 0(X; Y < M,2n#1(M<2n,r¼2w,gcdðM;rÞ¼1and nw¼dn=we. Output: Z¼XY r#nwmod M. 1: Z(0 2: for i¼0to nw#1do 3: Z(ZþXYi 4: qM(ðZmod rÞM0mod r 5: Z(ðZþqMMÞ=r 6: end for 7: if Z+Mthen 8: Z(Z#M 9: end if 10: Return Z. 1716 IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 12, DECEMBER 2010 2.2 Related Work Before introducing related work, we note here that for the moduli used in all common ECC cryptosystems, the modular reduction can be done much faster than the one proposed by Barrett or Montgomery. Even without any multiplication. This is the reason behind standardizing generalized Mersenne prime moduli (sums/ differences of a few powers of 2) [22], [1], [26]. The idea of simplifying an intermediate quotient evaluation was first presented by Quisquater [23] at the rump session of Eurocrypt ’90. The method is similar to the one of Barrett except that the modulus Mis preprocessed before the modular multi- plication in such a way that the evaluation of the intermediate quotient qbasically comes for free. Preprocessing requires some extra memory and computational time, but the latter is negligible when many modular multiplications are performed using the same modulus. Lenstra [16] points out that choosing moduli with a predeter- mined portion is beneficial both for storage and computational requirements. He proposes a way to generate RSA moduli with any number of predetermined leading (trailing) bits, with the fraction of specified bits only limited by security considerations. Furthermore, Lenstra discusses security issues and concludes that the resulting moduli do not seem to offer less security than regular RSA moduli. In [13], Joye enhances the method for generating RSA moduli with a predetermined portion proposed in [16]. In [12], Hars proposes a long modular multiplication method that also simplifies an intermediate quotient evaluation. The method is based on Quisquater’s algorithm and requires a preprocessing of the modulus by increasing its length. The algorithm contains conditional branches that depend on the sign of the intermediate remainder. That increases the complexity of the algorithm, especially concerning the hardware implementa- tions where additional control logic needs to be added. In this paper, we propose four sets of moduli that specifically target efficient modular multiplication by means of classical modular multiplication based on general Barrett reduction [9] and Montgomery modular multiplication [21]. In addition to simplified quotient evaluation, our algorithms do not require any additional preprocessing. The algorithms are simple and especially suitable for hardware implementations. They contain no conditional branches inside the loop, and hence, require a very simple control logic. Note that the same algorithms are applicable to general moduli if the preprocessing described in [12] is performed beforehand. The methods describing how to generate such moduli in case of RSA are discussed in [16], [13]. Furthermore, from the sets proposed in this paper, one can also choose the primes that generate the RSA modulus to speed up a decryption of RSA by means of the Chinese Remainder Theorem (CRT). In Section 4, we discuss security issues concerning this matter. 3THE PROPOSED MODULAR MULTIPLICATION METHODS FOR INTEGERS In both Barrett and Montgomery modular multiplications, the precomputed values of either modulus reciprocal (#) or modulus inverse (M0) are used in order to avoid multiple-precision divisions. However, single-precision multiplications still need to be performed (step 4 of the Algorithms 1 and 2). This especially concerns the hardware implementations, as the multiplication with the precomputed values often occurs within the critical path of the whole design. Section 5 discusses this issue in more detail. Let us, for now, assume that the precomputed values #and M0 are both of type -2%#", where %2ZZ and "2f0;1g. By tuning # and M0to be of this special type, we transform a single-precision multiplication with these values into a simple shift operation in hardware. Therefore, we find sets of moduli for which the precomputed values are both of type -2%#". 3.1 Speeding Up Classical Modular Multiplication Before describing the actual algorithm, we provide two lemmas to make the following explanation easier: Lemma 1. Let M¼2n##be an n-digit positive integer in radix 2 representation and let #¼b2nþ"=Mc, where "2NN. If 0<#( b2n 1þ2"c, then #¼2":ð3Þ Proof of Lemma 1. Rewrite 2nþ"as 2nþ"¼M2"þ2"#: Since it is given that 0<#(b2n 1þ2"c,weconcludethat 0<2"#<M. By the definition of euclidean division, this shows that #¼2".tu Lemma 2. Let M¼2n#1þ#be an n-digit positive integer in radix 2 representation and let #¼b2nþ"=Mc, where "2NN. If 0<#( b2n#1 2"þ1#1c, then #¼2"þ1#1:ð4Þ Proof of Lemma 2. Rewrite 2nþ"as 2nþ"¼Mð2"þ1#1Þþ2n#1##ð2"þ1#1Þ: Since 0<#(b2n#1 2"þ1#1c,weconcludethat0(2n#1# #ð2"þ1#1Þ<M. By the definition of euclidean division, this shows that #¼2"þ1#1.tu The interleaved modular multiplication algorithm based on general Barrett reduction is given in Section 2. Now, according to Lemmas 1 and 2, we can define two sets of moduli for which the modular multiplication based on Barrett modular reduction can be improved. These sets are of type: S1:M¼2n##where 0 <#(2n 1þ2" #$
;
S2:M¼2n#1þ#where 0 <#(2n#1
2"þ1#1
#$: ð5Þ Fig. 1 further illustrates the properties of the two proposed sets S1and S2. As can be seen in the figure, approximately "bits of the modulus are fixed to be all 0s or all 1s, while the other n#"bits are arbitrarily chosen. 1 The proposed modular multiplication algorithm is shown in Algorithm 3. The parameters "and !are important for the quotient evaluation. As we show later, to minimize the error in quotient evaluation, "and !are chosen such that "¼wþ3and !¼#2. The same values of the parameters are obtained in [9] for the classical modular multiplication based on Barrett reduction. IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 12, DECEMBER 2010 1717 Fig. 1. Binary representation of the proposed sets S1and S2. 1. If Mn#"#2¼1for M2S1(Mn#"#1¼0for M2S2), then the remaining n#"#2(n#"#1) least significant bits can be arbitrarily chosen. Otherwise, if Mn#"#2¼0(Mn#"#1¼1), then the remaining n#"# 2(n#"#1) least significant bits are chosen such that (5) is satisfied. Algorithm 3. Proposed interleaved modular multiplication based on generalized Barrett reduction ("¼wþ3and !¼#2) Input: X¼ðXnw#1...X0Þr,Y¼ðYnw#1...Y0Þr,M2S1[S2 where 0(X; Y < M,r¼2wand nw¼'n=w(. Output: Z¼XY mod M. Z(0 for i¼nw#1downto 0 do Z(Z2wþXYi ^ q¼ Z 2n #$ if M2S1;
Z
2n#1
#$if M2S2: 8 > > < > > : Z(Z#^ qM end for if Z+Mthen Z(Z#M// At most 1 subtraction is needed. end if while Z<0do Z(ZþM// At most 2 additions are needed. end while return Z. In contrast to the classical modular multiplication based on Barrett reduction where the quotient is evaluated as ^ q¼!Z 2nþ!"!2nþ" M" 2"#!$%
;
^
q¼
Z
2n
#$;if M2S1; Z 2n#1 #$
;if M2S2:
8
>
>
<
>
>
:
This saves one single-precision multiplication and additionally
increases the speed of the proposed modular multiplication
algorithm.
Proof of Algorithm 3. To prove the correctness of the algorithm,
we need to show that there exist ";!2ZZ, such that ^
qcan
indeed be represented as
^
q¼
Z
2n
#$;if M2S1; Z 2n#1 #$
;if M2S2:
8
>
>
<
>
>
:
As shown in the analysis of Algorithm 1, to have the minimized
quotient error, the parameters "and !need to be chosen such
that "+wþ3and !(#2. Let us first assume that M2S1.
According to Lemma 1, it follows that #¼2". Now, ^
qbecomes
equal to
^
q¼!Z
2nþ!"#
2"#!
#$¼!Z 2nþ!"2" 2"#! #$
¼Z
2nþ!
#$2! #$
:
For !(0, the previous equation becomes equivalent to
^
q¼Z
2n
#$: For the case where M2S2, we have, according to Lemma 2, that #¼2"þ1#1. Now, ^ qbecomes equal to ^ q¼!Z 2nþ!"ð2"þ1#1Þ 2"#! #$
¼Z
2nþ!
#$2!þ11#1 2"þ1 )*#$
:
To further simplify the proof, we choose !¼#2and the
previous equation becomes equivalent to
^
q¼Z
2n#2
#$1 21#1 2"þ1 )*#$
:
If we choose "such that
2"þ1>max Z
2n#2
#$+, ;ð6Þ the expression of ^ qsimplifies to ^ q¼ Z 2n#1 #$
#1;if 2 --
Z
2n#2
#$; Z 2n#1 #$
;if 2 6--
Z
2n#2
#$: 8 > > < > > : ð7Þ The inequality (6) can be written as 2"þ1>maxfZg 2n#2 #$
;
where max fZgis evaluated in the analysis of Algorithm 1 and
given as max fZg¼ð2þeÞ2nþw. To have the minimal error, we
choose e¼1and get the following relation:
2"þ1>3*2nþw
2n#2
#$¼b3*2wþ2c: The latter inequality is satisfied for "+wþ3. If, instead of (7), we use only ^ q¼bZ 2n#1c, the evaluation of the intermediate quotient ^ qwill, for 2jb Z 2n#2c, become greater than or equal to the real intermediate quotient q. Due to this fact, Z can become negative at the end of the current iteration. Hence, we need to consider the case where Z<0. Let us prevent Z from an uncontrollable decrease by putting a lower bound with Z>#2nþ$, where $2ZZ. Since A B+b A Bc>A B#1for any A, B2ZZ, we can write the following inequality (note that Z<0 and M>0): ^ q¼!Z 2nþ!"!2nþ" M" 2"#!$%
(!Z
2nþ!"!2nþ"
M"
2"#!
<
Z
2nþ!%2nþ"
M#1&
2"#!
¼Z
M#Z
2nþ"
<Z
M
#$þ1#Z 2nþ" ¼qþ1#Z 2nþ" <qþ1þ2$#":
Now, since q; ^
q; e 2ZZ, we choose "+$þ1and the quotient error gets estimated as #1(e(0. If in the next iteration, it again happens that 2jb Z 2n#2c, the quotient error will become #2(e(0. Finally, to assure that Zwill remain within the bounds during the ith iteration, we write Zi¼Zi#12wþXYi ¼ðZi#2#qM þeMÞ2wþXYi >ð0þeMÞ2wþ0 >e2nþw>#2nþ$:
The worst case is when e¼#2, and then, it must hold
$>wþ1. By choosing "¼wþ3and !¼#2, all conditions are satisfied, and hence, ^ qis indeed a good estimate of q. At 1718 IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 12, DECEMBER 2010 most one subtraction or two additions at the correction step are required to obtain Z¼XY mod M.tu 3.2 Speeding Up Montgomery Modular Multiplication Similar to Lemmas 1 and 2, we also have Lemmas 3 and 4 that are at the heart of the proposed modular multiplication algorithm based on Montgomery reduction. Lemma 3. Let M¼#2wþ1be an n-digit positive integer in radix 2 representation, i.e., 2n#w#1(#<2n#w, and let M0¼#M#1mod 2w, where w2NN, then M0¼#1:ð8Þ Proof of Lemma 3. Since M.1 mod 2w,weclearlyhave #M#1.#1 mod 2w.tu Lemma 4. Let M¼#2w#1be an n-digit positive integer in radix 2 representation, i.e., 2n#w#1<#(2n#wand let M0¼#M#1mod 2w, where w2NN, then M0¼1:ð9Þ Proof of Lemma 4. Since M.#1 mod 2w, we clearly have #M#1.1 mod 2w.tu According to the previous two lemmas, we can easily find two sets of moduli for which the precomputation step in Montgomery multiplication can be excluded. The resulting algorithm is shown in Algorithm 4. The proposed sets are of type S3:M¼#2wþ1;where 2n#w#1(#<2n#w; S4:M¼#2w#1;where 2n#w#1<#(2n#w:ð10Þ Fig. 2 further illustrates the properties of the two proposed sets S3 and S4. As can be seen in the figure, w#1bits of the modulus are fixed to be all 0s or all 1s, while the other n#wþ1bits are arbitrarily chosen. To fulfill the condition gcdðM;bÞ¼1(see Algorithm 2), the least significant bit of Mis set to 1. Algorithm 4. Proposed interleaved modular multiplication based on Montgomery modular reduction. Input: X¼ðXnw#1...X0Þr,Y¼ðYnw#1...Y0Þr,M2S3[S4 where 0(X; Y < M,r¼2wand nw¼dn=we. Output: Z¼XY r#nwmod M. Z(0 for i¼0to nw#1do Z(ZþXYi qM¼#Zmod rif M2S3; Zmod rif M2S4: + Z(ðZþqMMÞ=r end for if Z+Mthen Z(Z#M end if return Z. Due to the use of special type of moduli, the evaluation of the intermediate Montgomery quotient is simplified compared to the original algorithm given in Algorithm 2. As in our case, the value of M0is simply equal to 1 or #1, the Montgomery quotient qM¼ ðZmod rÞM0mod rbecomes now qM¼#Zmod r; if M2S3; Zmod r; if M2S4: + Since r¼2w, the evaluation of qbasically comes for free. Proof of Algorithm 4. Follow immediately from Lemmas 3 and 4.tu 4SECURITY CONSIDERATIONS In this section, we analyze the security implications of choosing primes in one of the sets S1;S 2;S 3;S 4for use in ECC/HECC and in RSA. In the current state of the art, the security of ECC/HECC over finite fields GF(q) only depends on the extension degree of the field [2]. Therefore, the security does not depend on the precise structure of the prime p. This is illustrated by the particular choices for pthat have been made in several standards such as SEC [26], NIST [22], ANSI [1]. In particular, the following primes have been proposed: p192 ¼2192 #264 #1,p224 ¼2224 #296 þ1, p256 ¼2256 #2224 þ2192 þ296 #1,p384 ¼2384 #2128 #296 þ232 #1, and p521 ¼2521 #1. It is easy to verify that for w(28, all primes are in one of the proposed sets. As such at least one of our methods applies for all primes included in the standards. In conclusion, choosing a prime of prescribed structure has no influence on the security of ECC/HECC. The case of RSA requires a more detailed analysis than ECC/ HECC. First, we assume that the modulus Nis chosen from one of the proposed sets. This is a special case of the security analysis given in [16] followed by the conclusion that the resulting moduli do not seem to offer less security than regular RSA moduli. Next, we assume that the primes pand qthat constitute the modulus N¼pq both are chosen in one of the sets Si. To analyze the security implications of the restricted choice of pand q, we first make a trivial observation. The number of n-bit primes in the sets Sifor n>259 þwis large enough such that exhaustive listing of these sets is impossible, since a maximum of wþ3bits are fixed. The security analysis then corresponds to attacks on RSA with partially known factorization. This problem has been analyzed extensively in the literature and the first results come from Rivest and Shamir [24] in 1985. They describe an algorithm that factors N in polynomial time if 2=3of the bits of por qare known. In 1995, Coppersmith [6] improves this bound to 3=5. Today’s best attacks all rely on variants of Coppersmith’s algorithm published in 1996 [8], [7]. A good overview of these algorithms is given in [17], [18]. The best results in this area are as follows: Let Nbe an nbit number, which is a product of two n=2-bit primes. If half of the bits of either por q(or both) are known, then Ncan be factored in polynomial time. If less than half of the bits are known, say n=4#"bits, then the best algorithm simply guesses "bits, and then, applies the polynomial-time algorithm, leading to a running time exponential in ". In practice, the values of w(typically, w(64) and n(n+1;024) are always such that our proposed moduli remain secure against Coppers- mith’s factorization algorithm, since at most wþ3bits of pand q are known. Finally, we consider a similar approach extended to moduli of the form N¼prq, where pand qhave the same bit size. This extension was proposed by Boneh et al. [5]. Assuming that pand q are of the same bit size, one needs a 1=ðrþ1Þ-fraction of the most significant bits of pin order to factor Nin polynomial time. In other words, for the case r¼1, we need half of the bits, whereas for, e.g., r¼2, we need only a third of the most significant bits of p. These results show that the primes p; q 2S, assembling an RSA modulus of the form N¼prq, should be used with care. This is especially IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 12, DECEMBER 2010 1719 Fig. 2. Binary representation of the proposed sets S3and S4. true when ris large. Note that if r/log p, the latter factoring method factors Nin polynomial time for any primes p; q 2NN. 5HARDWARE IMPLEMENTATION OF THE PROPOSED ALGORITHMS A typical architecture that describes an interleaved modular multiplier is shown in Fig. 3. Both Barrett and Montgomery algorithms can be implemented based on this architecture. The architecture consists of two multiple-precision multipliers (&1and &2) and one single-precision multiplier (&3). Apart from the multipliers, the architecture contains an additional adder denoted by$. Having two multiple-precision multipliers may seem
redundant at first glance, but the multiplier &1uses data from x
and ythat are fixed during a single modular multiplication. Now,
by running &1and &2in parallel, we speed up the whole
multiplication process. If the target is a more compact design,
one can also use a single multiple-precision multiplier which does
not reduce the generality of our discussion.
Multipliers &1and &2perform multiplications at lines 3 and 5
of both Algorithms 1 and 2, respectively. A multiplication
performed in step 4 of both algorithms is done by multiplier &3.
An eventual shift of the register Zis handled by controller. The
exact schedule of the functional parts of the multiplier is as
follows: &1!$!&1&2&3!$!$!&1&2&3!$!$!*** In case of generalized Barrett reduction [9], the precomputed value # is '¼wþ4-bits long, while for the case of Montgomery, the precomputed value M0is '¼w-bits long. Due to the generalized Barrett’s algorithm, the multiplier &2uses the most significant 'bits of the product calculated by &3, while for the case of Montgomery, it uses the least significant 'bits of the same product. This is indeed a reason for Montgomery’s multiplier being superior compared to the one of Barrett. The critical path of the whole design occurs from the output of the register Zto the input of the temporary register in &2, passing through two single-precision multipliers and one adder (bold line). To show this, in practice, we have synthesized 192, 256, and 512-bit multipliers, each with the digit size of 32 bits. The code was first written in GEZEL [11] and tested for its functionality, and 1720 IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 12, DECEMBER 2010 TABLE 1 Synthesis Results for the Hardware Architectures of 192, 256, and 512-Bit Modular Multipliers Fig. 3. Architecture for an interleaved modular multiplier based on Barrett or Montgomery reduction. Fig. 4. Architecture for an interleaved modular multiplier based on modified Barrett or modified Montgomery reduction. then, translated to VHDL and synthesized using the Synposys Design Compiler version Y-2006.06. The library we used was UMC 0:13 #mCMOS High-Speed standard cell library. The results can be found in Table 1. The size of the designs is given as the number of NAND gate equivalences (GEs). A major improvement of the new algorithms is the simplified quotient evaluation. This fact results in the new proposed architecture for the efficient modular multiplier, as shown in Fig. 4. It consists of two multiple-precision multipliers (&1and &2) only. The most important difference is that there are no multi- plications with the precomputed values, and hence, the critical path contains one single-precision multiplier and one adder only (bold line). To compare the performance with the architecture proposed in Fig. 3, we have again synthesized a number of multipliers using the same standard cell library. The results are given in Table 1 and show that frequencywise our proposed architecture outperforms the modular multiplier based on standard Barrett’s reduction up to 52 percent. The architecture based on Montgomery’s reduction results in a relative speedup up to 31 percent. Additionally, designs based on our algorithms demonstrate area savings in range from 3.5 to 14 percent. Note here that the obtained results are based on the synthesis only. After the place and route are performed, we expect a decrease of the performance for both implemented multipliers, and hence, believe that the relative speedup will approximately remain the same. Finally, it is interesting to consider a choice of the digit size. As discussed in the previous section, the upper bound is decided by security margins. A typical digit size of 8, 16, 32, or 64 bits seems to provide a reasonable security margin for the RSA modulus of 512 bits or more. On the other side, with the increase of digit size, the number of cycles decreases for the whole design and the overall speedup is increasing. It is also obvious that the larger digit size implies the larger circuit, and thus, the performance trade-off concerning throughput and area would be interesting to explore. 6CONCLUSION In this work, we proposed two interleaved modular multiplication algorithms based on Barrett and Montgomery modular reductions. We introduced two sets of moduli for the algorithm based on Barrett and two sets of moduli for the algorithm based on Montgomery algorithm. These sets contain moduli with a prescribed number (typically, the digit size) of zero/one bits, either in the most significant or least significant part. Due to this choice, our algorithms have no precomputational phase and have a simplified quotient evaluation, which makes them more flexible and efficient than existing solutions. Following the same principles as described in the paper, this approach can be easily extended to finite fields of characteristic two. ACKNOWLEDGMENTS This work is supported in part by the IAP Programme P6/26 BCRYPT of the Belgian State, by FWO project G.0300.07, by the European Commission under contract number ICT-2007-216676 ECRYPT NoE phase II, and by K.U. Leuven-BOF (OT/06/40). REFERENCES [1] ANSI, “ANSI X9.62 The Elliptic Curve Digital Signature Algorithm (ECDSA),” http://www.ansi.org, 2010. [2] R.M. Avanzi, H. Cohen, C. Doche, G. Frey, T. Lange, K. Nguyen, and F. Vercauteren, Handbook of Elliptic and Hyperelliptic Curve Cryptography. CRC Press, 2005. [3] P. Barrett, “Communications Authentication and Security Using Public Key Encryption—A Design for Implementation,” master’s thesis, Oxford Univ., 1984. [4] P. Barrett, “Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor,” Proc. Ann. Int’l Cryptology Conf. Advances in Cryptology (CRYPTO ’86), pp. 311-323, 1986. [5] D. Boneh, G. Durfee, and N. Howgrave-Graham, “Factoring N¼prqfor Large r,” Proc. 19th Ann. Int’l Cryptology Conf. Advances in Cryptology (CRYPTO ’99), pp. 326-337, 1999. [6] D. Coppersmith, “Factoring with a Hint,” IBM Research Report RC 19905, 1995. [7] D. Coppersmith, “Finding a Small Root of a Bivariate Integer Equation; Factoring with High Bits Known,” Proc. Int’l Conf. Theory and Application of Cryptographic Techniques (Eurocrypt ’96), 1996. [8] D. Coppersmith, “Small Solutions to Polynomial Equations, and Low Exponent Vulnerabilities,” J. Cryptology, vol. 10, no. 4, pp. 233-260, 1996. [9] J.-F. Dhem, “Modified Version of the Barrett Algorithm,” technical report, 1994. [10] W. Diffie and M.E. Hellman, “New Directions in Cryptography,” IEEE Trans. Information Theory, vol. IT-22, no. 6, pp. 644-654, Nov. 1976. [11] GEZEL, http://www.ee.ucla.edu/~schaum/gezel, 2010. [12] L. Hars, “Long Modular Multiplication for Cryptographic Applica- tions,” Proc. Int’l Workshop Cryptographic Hardware and Embedded Systems (CHES ’04), pp. 218-254, 2004. [13] M. Joye, “RSA Moduli with a Predetermined Portion: Techniques and Applications,” Proc. Information Security Practice and Experience Conf., pp. 116-130, 2008. [14] N. Koblitz, “A Family of Jacobians Suitable for Discrete Log Cryptosys- tems,” Proc. Ann. Int’l Cryptology Conf. Advances in Cryptology (CRYPTO ’88), pp. 94-99, 1988. [15] N. Koblitz, “Elliptic Curve Cryptosystem,” Math. of Computation, vol. 48, pp. 203-209, 1987. [16] A. Lenstra, “Generating RSA Moduli with a Predetermined Portion,” Proc. Advances in Cryptology (ASIACRYPT ’98), pp. 1-10, 1998. [17] A. May, “New RSA Vulnerabilities Using Lattice Reduction Methods,” PhD thesis, Univ. of Paderborn, 2003. [18] A. May, “Using LLL-Reduction for Solving RSA and Factorization Problems: A Survey,” http://www.informatik.tu-darmstadt.de/KP/ publications/07/lll.pdf, 2007. [19] A. Menezes, P. van Oorschot, and S. Vanstone, Handbook of Applied Cryptography. CRC Press, 1997. [20] V. Miller, “Uses of Elliptic Curves in Cryptography,” Proc. Ann. Int’l Cryptology Conf. Advances in Cryptology (CRYPTO ’85), pp. 417-426, 1985. [21] P. Montgomery, “Modular Multiplication without Trial Division,” Math. of Computation, vol. 44, no. 170, pp. 519-521, 1985. [22] National Institute of Standards and Technology. FIPS 186-2: Digital Signature Standard, Jan. 2000. [23] J.-J. Quisquater, “Encoding System According to the So-Called RSA Method, by Means of a Microcontroller and Arrangement Implementing This System,” US Patent #5,166,978, 1992. [24] R.L. Rivest and A. Shamir, “Efficient Factoring Based on Partial Informa- tion,” Proc. Workshop Theory and Application of Cryptographic Techniques on Advances in Cryptology—EUROCRYPT ’85, pp. 31-34, 1986. [25] R.L. Rivest, A. Shamir, and L. Adleman, “A Method for Obtaining Digital Signatures and Public-Key Cryptosystems,” Comm. ACM, vol. 21, no. 2, pp. 120-126, 1978. [26] Standards for Efficient Cryptography, “Elliptic Curve Cryptography, Version 1.5, Draft,” http://www.secg.org, 2005. IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 12, DECEMBER 2010 1721 ... Peter Montgomery is famous for having invented algorithms and concepts which are very effective in computer science but are not justified rigorously or are not presented as part of a broader theory. For instance the modern presentation[KVV10] of the Montgomery reduction is Barrett's reduction with Q2 replacing R whereas the use of Murphy's α to compare polynomials for NFS, originally used by Montgomery, was justified in[BL17]. ... Preprint The complexity of the elliptic curve method of factorization (ECM) is proven under the celebrated conjecture of existence of smooth numbers in short intervals. In this work we tackle a different version of ECM which is actually much more studied and implemented, especially because it allows us to use ECM-friendly curves. In the case of curves with complex multiplication (CM) we replace the heuristics by rigorous results conditional to the Elliott-Halberstam (EH) conjecture. The proven results mirror recent theorems concerning the number of primes p such thar p -- 1 is smooth. To each CM elliptic curve we associate a value which measures how ECM-friendly it is. In the general case we explore consequences of a statement which translated EH in the case of elliptic curves. ... The public key is used by everyone whereas, the private key is enclosed by a few clients. [4]. However, the RSA algorithm requires many modulo multiplications to be performed to achieve the required modulo exponentiation. ... Article Full-text available In cryptography, during the process of encryption, we use multiplication modulo architectures (2^n- 1) and (2^n+ 1) are proposed that enable the implementation of very efficient combinational and pipelined circuits for modular arithmetic. This project is an improvisation of existing modulo multiplier architectures for higher speed and regularity. We use the Montgomery multiplier in this project for the RSA algorithm. Hence, the IDEA block cipher is represented as a high-performance modulo multiplier adder. The circuits that arise are compared qualitatively and quantitatively for the RSA algorithm. Article Post-Quantum Cryptography (PQC) has emerged as a response of the cryptographic community to the danger of attacks performed using quantum computers. All PQC schemes can be implemented in software and hardware using conventional (non-quantum) computing systems. PQC is the biggest revolution in cryptography since the invention of public-key schemes in the mid-1970s. Lattice-based key exchange schemes have emerged as leading candidates in the NIST PQC standardization process due to their relatively short public keys and ciphertexts. This paper presents novel high-speed hardware architectures for four lattice-based Key Encapsulation Mechanisms (KEMs) representing three NIST PQC finalists: NTRU (with two distinct variants, NTRU-HPS and NTRU-HRSS), CRYSTALS-Kyber, and Saber. We benchmark these candidates in terms of their performance and resource utilization in today's FPGAs. Our best architectures outperform the best designs from other groups reported to date in terms of the area-time product by factors ranging from 1.01 to 2.88, depending on the algorithm and security level. Additionally, our study demonstrates that CRYSTALS-Kyber and Saber have very similar hardware performance. Both outperform NTRU in terms of execution time by a factor 36-62 for key generation and 3-7 for decapsulation, assuming the same security level. Article As a recent cryptography protocol, Zero-Knowledge Succinct Non-Interactive Argument of Knowledge (zk-SNARK) allows one party to prove that it possesses certain information without revealing information to these untrusted proof provers. This mechanism has the ability to provide the function for constructing and verifying information integrity without privacy leaking. However, the computation kernels of zk-SNARK consume too much computing power and produce a significant performance bottleneck with the growing data volume and security requirement. In this paper, we take advantage of Graphic Processing Unit (GPU) to enhance zk-SNARK efficiency by accelerating the most time-consuming computation kernels: modular multiplication and Number-Theoretic Transform (NTT)/Inverse Number-Theoretic Transform (INTT) in Elliptic Curve Cryptography (ECC) pairing with two major improvements: (1) Adopting interval limbs multiply-add quaternary operation to directly accelerate ECC pairing by making full advantage of information entropy within the limited hardware bit width; (2) Data layout and shuffle methods in GPU global memory and shared memory for data space consistency maintenance accelerating NTT/INTT which indirectly works on ECC pairing. To the best of our knowledge, our work would be the first exploration to accelerate these improvements on GPU. The measured results show that our methods are able to accelerate modular multiplication and NTT/INNT by 1.22× and 4.67× times respectively compared with the previous GPU implementation. With these accelerated kernels, we are able to achieve 3.14× speedup for Groth16, which is the most efficient zk-SNARK implementation working on BLS12-381 ECC field. With the bottleneck tackled, our work will expand the deployment scenarios of zk-SNARK in Zero Knowledge Proof (ZKP). Preprint Full-text available The migration of computation to the cloud has raised privacy concerns as sensitive data becomes vulnerable to attacks since they need to be decrypted for processing. Fully Homomorphic Encryption (FHE) mitigates this issue as it enables meaningful computations to be performed directly on encrypted data. Nevertheless, FHE is orders of magnitude slower than unencrypted computation, which hinders its practicality and adoption. Therefore, improving FHE performance is essential for its real world deployment. In this paper, we present a year-long effort to design, implement, fabricate, and post-silicon validate a hardware accelerator for Fully Homomorphic Encryption dubbed CoFHEE. With a design area of$12mm^2$, CoFHEE aims to improve performance of ciphertext multiplications, the most demanding arithmetic FHE operation, by accelerating several primitive operations on polynomials, such as polynomial additions and subtractions, Hadamard product, and Number Theoretic Transform. CoFHEE supports polynomial degrees of up to$n = 2^{14}$with a maximum coefficient sizes of 128 bits, while it is capable of performing ciphertext multiplications entirely on chip for$n \leq 2^{13}$. CoFHEE is fabricated in 55nm CMOS technology and achieves 250 MHz with our custom-built low-power digital PLL design. In addition, our chip includes two communication interfaces to the host machine: UART and SPI. This manuscript presents all steps and design techniques in the ASIC development process, ranging from RTL design to fabrication and validation. We evaluate our chip with performance and power experiments and compare it against state-of-the-art software implementations and other ASIC designs. Developed RTL files are available in an open-source repository. Article The proposed herein is a scalable high-radix (i.e.,$2^m$) Montgomery Modular (MM) Multiplication circuit replacing the integer multiplications in each iteration of the Montgomery MM algorithm (related to the product of$m$bits of the multiplier and the multiplicand) with carry-save compressions and completely eliminating costly multiplications. Furthermore, the proposed Montgomery MM decomposes the multiplicand itself using a radix of$2^w$with$w\geq 2m$, thereby achieving a scalable design, which can deliver an issue latency of one cycle and a cycle (count) latency of$O(N^2/(wmp))$where$p$denotes the number of available processing elements, each of which is designed to complete the above iteration by computing in part the product of$w$bits of the multiplicand and$m$bits of the multiplier. The area complexity of the proposed Montgomery MM is$O(wmp)$, and thus, the Area-Latency-Product complexity is$O(N^{2})\$ .
Article
The fast Fourier transform (FFT) based on modular arithmetic can compute convolution without round‐off errors, which is desirable in many applications such as computational algebra and combinatory pattern matching. One of the critical challenges of the FFT is to enhance the performance. An effective approach is to optimize the high‐cost operations. Modular reduction is one of the most frequently used high‐cost operations that is a bottleneck of the FFT using modular arithmetic. In this article, we present three modular reduction methods and apply them to the implementation of the FFT. We use the strategy of delaying the modular reduction in the first method. We apply the Montgomery reduction to the FFT in the second method. The two methods both first transform the input, and then replace the modular reductions with lightweight replacements, and apply the reverse transform in the output stage to compute the right results. In the third method, we design an efficient modular reduction for the specific form of modular used in FFT. Experiments show that the incorporation of the new modular reductions speedups the FFT based on modular arithmetic.
Book
The discrete logarithm problem based on elliptic and hyperelliptic curves has gained a lot of popularity as a cryptographic primitive. The main reason is that no subexponential algorithm for computing discrete logarithms on small genus curves is currently available, except in very special cases. Therefore curve-based cryptosystems require much smaller key sizes than RSA to attain the same security level. This makes them particularly attractive for implementations on memory-restricted devices like smart cards and in high-security applications. The Handbook of Elliptic and Hyperelliptic Curve Cryptography introduces the theory and algorithms involved in curve-based cryptography. After a very detailed exposition of the mathematical background, it provides ready-to-implement algorithms for the group operations and computation of pairings. It explores methods for point counting and constructing curves with the complex multiplication method and provides the algorithms in an explicit manner. It also surveys generic methods to compute discrete logarithms and details index calculus methods for hyperelliptic curves. For some special curves the discrete logarithm problem can be transferred to an easier one; the consequences are explained and suggestions for good choices are given. The authors present applications to protocols for discrete-logarithm-based systems (including bilinear structures) and explain the use of elliptic and hyperelliptic curves in factorization and primality proving. Two chapters explore their design and efficient implementations in smart cards. Practical and theoretical aspects of side-channel attacks and countermeasures and a chapter devoted to (pseudo-)random number generation round off the exposition. The broad coverage of all- important areas makes this book a complete handbook of elliptic and hyperelliptic curve cryptography and an invaluable reference to anyone interested in this exciting field.
Book
Cryptography, in particular public-key cryptography, has emerged in the last 20 years as an important discipline that is not only the subject of an enormous amount of research, but provides the foundation for information security in many applications. Standards are emerging to meet the demands for cryptographic protection in most areas of data communications. Public-key cryptographic techniques are now in widespread use, especially in the financial services industry, in the public sector, and by individuals for their personal privacy, such as in electronic mail. This Handbook will serve as a valuable reference for the novice as well as for the expert who needs a wider scope of coverage within the area of cryptography. It is a necessary and timely guide for professionals who practice the art of cryptography. The Handbook of Applied Cryptography provides a treatment that is multifunctional: It serves as an introduction to the more practical aspects of both conventional and public-key cryptography It is a valuable source of the latest techniques and algorithms for the serious practitioner It provides an integrated treatment of the field, while still presenting each major topic as a self-contained unit It provides a mathematical treatment to accompany practical discussions It contains enough abstraction to be a valuable reference for theoreticians while containing enough detail to actually allow implementation of the algorithms discussed Now in its third printing, this is the definitive cryptography reference that the novice as well as experienced developers, designers, researchers, engineers, computer scientists, and mathematicians alike will use.
Article
We discuss analogs based on elliptic curves over finite fields of public key cryptosystems which use the multiplicative group of a finite field. These elliptic curve cryptosystems may be more secure, because the analog of the discrete logarithm problem on elliptic curves is likely to be harder than the classical discrete logarithm problem, especially over GF(2"). We discuss the question of primitive points on an elliptic curve modulo p, and give a theorem on nonsmoothness of the order of the cyclic subgroup generated by a global point.
Article
We discuss analogs based on elliptic curves over finite fields of public key cryptosystems which use the multiplicative group of a finite field. These elliptic curve cryptosystems may be more secure, because the analog of the discrete logarithm problem on elliptic curves is likely to be harder than the classical discrete logarithm problem, especially over GF ( 2 n ) {\text {GF}}({2^n}) . We discuss the question of primitive points on an elliptic curve modulo p , and give a theorem on nonsmoothness of the order of the cyclic subgroup generated by a global point.
Article
Preface Introduction to Public-Key Cryptography Mathematical Background Algebraic Background Background on p-adic Numbers Background on Curves and Jacobians Varieties Over Special Fields Background on Pairings Background on Weil Descent Cohomological Background on Point Counting Elementary Arithmetic Exponentiation Integer Arithmetic Finite Field Arithmetic Arithmetic of p-adic Numbers Arithmetic of Curves Arithmetic of Elliptic Curves Arithmetic of Hyperelliptic Curves Arithmetic of Special Curves Implementation of Pairings Point Counting Point Counting on Elliptic and Hyperelliptic Curves Complex Multiplication Computation of Discrete Logarithms Generic Algorithms for Computing Discrete Logarithms Index Calculus Index Calculus for Hyperelliptic Curves Transfer of Discrete Logarithms Applications Algebraic Realizations of DL Systems Pairing-Based Cryptography Compositeness and Primality Testing-Factoring Realizations of DL Systems Fast Arithmetic Hardware Smart Cards Practical Attacks on Smart Cards Mathematical Countermeasures Against Side-Channel Attacks Random Numbers-Generation and Testing References