Conference PaperPDF Available

Fast Point Multiplication Algorithms for Binary Elliptic Curves with and without Precomputation

Authors:

Abstract

In this paper we introduce new methods for computing constant-time variable-base point multiplications over the Galbraith-Lin-Scott (GLS) and the Koblitz families of elliptic curves. Using a left-to-right double-and-add and a right-to-left halve-and-add Montgomery ladder over a GLS curve, we present some of the fastest timings yet reported in the literature for point multiplication. In addition, we combine these two procedures to compute a multi-core protected scalar multiplication. Furthermore, we designed a novel regular τ -adic scalar expansion for Koblitz curves. As a result, using the regular recoding approach, we set the speed record for a single-core constant-time point multiplication on standardized binary elliptic curves at the 128-bit security level.
Fast point multiplication algorithms for binary
elliptic curves with and without precomputation
Thomaz Oliveira1, Diego F. Aranha2, Julio L´opez2?, and Francisco
Rodr´ıguez-Henr´ıquez1
1Computer Science Department, CINVESTAV-IPN
2Institute of Computing, University of Campinas
Abstract. In this paper we introduce new methods for computing constant-
time variable-base point multiplications over the Galbraith-Lin-Scott
(GLS) and the Koblitz families of elliptic curves. Using a left-to-right
double-and-add and a right-to-left halve-and-add Montgomery ladder
over a GLS curve, we present some of the fastest timings yet reported
in the literature for point multiplication. In addition, we combine these
two procedures to compute a multi-core protected scalar multiplication.
Furthermore, we designed a novel regular τ-adic scalar expansion for
Koblitz curves. As a result, using the regular recoding approach, we set
the speed record for a single-core constant-time point multiplication on
standardized binary elliptic curves at the 128-bit security level.
Keywords: binary elliptic curves, scalar multiplication, software imple-
mentation
1 Introduction
From a cryptographic perspective, one of the most interesting consequences of
the Snowden revelations is the increased awareness about the importance of
implementing security protocols that offer the Perfect Forward Secrecy (PFS)
property. The PFS property guarantees that in a given protocol, none of its past
short term session keys can be derived from the long term server’s private key.
One tangible example of this situation is the recent announcement by the Inter-
net Engineering Task Force that the Transport Layer Security (TLS) protocol
version 1.3, will no longer include cipher suites based on RSA key transport
primitives [34]. Instead, the client-server secret key establishment will be per-
formed via either the Ephemeral Diffie-Hellman or the Elliptic Curve Ephemeral
Diffie-Hellman (ECDHE) methods. Because of the significant performance ad-
vantage of the latter over the former, it is anticipated that in the years to come,
ECDHE will be the favorite choice for establishing a TLS shared secret.
The specifications of all the TLS protocol versions [8–10] include support
for prime and binary field elliptic curve cryptographic primitives. In the case
of binary elliptic curves, the TLS protocol supports a selection of several stan-
dardized random curves as well as Koblitz curves [23] at the 80-, 128-, 192- and
?The author was supported in part by the Intel Labs University Research Office.
256-bit security levels. Koblitz curves allow performance improvements, due to
the availability of the Frobenius automorphism τ. Also, their generation is in-
herently rigid (in the SafeCurves sense [2]), where the only degree of freedom
in the curve generation process consists in choosing a suitable prime degree ex-
tension mthat produces a curve with almost-prime order. This severely limits
the possibility of “1-in-a-million attacks” [35] aiming to reach a weak curve after
testing many random seeds.
Point multiplication is the single most important operation of (hyper) elliptic
curve cryptography, for that reason, considerable effort has been directed towards
achieving fast and compact software/hardware implementations of it. A major
result that has influenced the latest implementations was found in 2009, when
Galbraith, Lin and Scott (GLS), building on a previous technique introduced by
Gallant, Lambert and Vanstone (GLV) [14], constructed efficient endomorphisms
for a class of elliptic curves defined over the quadratic field Fq2, where qis a prime
number [13]. Taking advantage of this result, the authors of [13] performed a
128-bit security level point multiplication that took 326,000 clock cycles on a
64-bit processor. Since then, a steady stream of algorithmic and technological
advances has translated into a significant reduction in the number of clock cycles
required to compute a (hyper) elliptic curve constant-time variable-base-point
multiplication at the 128-bit security level [1, 11, 24, 5, 4, 16, 38].
The authors of [24, 11] targeted a twisted Edwards GLV-GLS curve defined
over Fp2,with p= 2127 5997.That curve is equipped with a degree-4 endo-
morphism allowing a fast point multiplication computation that required just
92,000 clock cycles on an Ivy Bridge processor [11]. Bos et al. [5] and Bernstein
et al. [1], presented an efficient point multiplication on the Kummer surface as-
sociated with the Jacobian of a genus 2 curve defined over a field generated by
the prime p= 2127 1. Each iteration of the Montgomery ladder presented in [1]
costs roughly 25 field multiplications, which implemented on a Haswell processor
permits to compute a point multiplication in 72,000 clock cycles.
In 2014, Oliveira et al. introduced the λ-projective coordinate system that
leads to faster binary field elliptic curve arithmetic [31, 32]. The authors applied
that coordinate system into a binary GLS curve that admits a degree-2 endomor-
phism and a fast field arithmetic associated with the quadratic field extension of
the binary field F2127 .When implemented on a Haswell processor, this approach
permits to perform one constant-time point multiplication computation in just
60,000 clock cycles.
Contributions of this paper. This work presents new methods aimed to per-
form fast constant-time variable-base-point multiplication computation for both
random and Koblitz binary elliptic curves of the form y2+xy =x3+ax2+b.
In the case of random binary elliptic curves, we introduce a novel right-to-left
variant of the classical Montgomery-L´opez-Dahab ladder algorithm presented
in [25], which efficiently adapted the original ladder idea introduced by Peter
Montgomery in his 1987 landmark paper [26]. The new variant presented in this
work does not require point doublings, but instead, it uses the efficient point
halving operation available on binary elliptic curves. In contrast with the algo-
rithm presented in [25] that does not admit the benefit of precomputed tables,
our proposed variant can take advantage of this technique, a feature that could
be proved valuable for the fixed-base-point multiplication scenario. Moreover, we
show that our new right-to-left Montgomery ladder formulation can be nicely
combined with the classical ladder to attain a high parallel acceleration factor
for a constant-time multi-core implementation of the point multiplication oper-
ation. As a second contribution, we present a procedure that adapts the regular
scalar recoding of [21] to the task of producing a regular τ-NAF scalar recoding
for Koblitz curves. This approach has faster precomputation than related recod-
ings [30] and allows us to achieve a speed record for single-core constant-time
point multiplication on standardized binary elliptic curves at the 128-bit security
level.
The remainder of this paper is organized as follows. In Section 2 we give
a short description of the GLS and Koblitz curves, their arithmetic and their
security. In Section 3 we present new variants of the Montgomery ladder for
binary elliptic curves. Then, in Section 4, we introduce a regular τ-NAF recod-
ing amenable for producing protected point multiplication implementations on
Koblitz curves. In Section 5, we present our experimental implementation results
and finally, we draw our conclusions in Section 6.
2 Mathematical background
2.1 Quadratic field arithmetic
A binary extension field Fq, q = 2m,can be constructed by taking an degree-
mpolynomial f(x)F2[x] irreducible over F2,where the field elements in Fq
are the set of binary polynomials of degree less than m. Quadratic extensions
of a binary extension field can be built using a degree two monic polynomial
g(u)Fq[u] irreducible over Fq. In this case, the field Fq2is isomorphic to
Fq[u]/(g(u)) and its elements can be represented as a0+a1u, with a0, a1Fq.
Operations in the quadratic extension are performed coefficient-wise. For in-
stance, the multiplication of two elements a, b Fq2is computed at the cost of
three multiplications in the base field using the customary Karatsuba formula-
tion,
a·b= (a0+a1u)·(b0+b1u) (1)
= (a0b0+a1b1)+(a0b0+ (a0+a1)·(b0+b1))u,
with a0, a1, b0, b1Fq.
In [31, 32], the authors developed an efficient software library for the field
F2mand its quadratic extension F22m,with m= 127,generated by means of the
irreducible trinomials f(x) = x127 +x63 + 1 and g(u) = u2+u+ 1, respectively.
The computational cost of the field arithmetic in the quadratic extension field
gets significantly reduced by using that towering approach. To be more concrete,
let Mand mdenote the cost of one field multiplication over Fq2and Fq,respec-
tively. The execution of the arithmetic library of [32] on the Sandy Bridge and
Haswell microprocessors yields a ratio M/m of just 2.23 and 1.51, respectively.
These experimental results are considerably better than the theoretical ratio
M/m = 3 that one could expect from the Karatsuba formulation of Eq (1). The
aforementioned performance speedup can be explained from the fact that the
towering field approach permits a much better usage of the processor’s pipelined
execution unit, which potentially can improve the speed of one 64-bit carry-less
multiplication3from 7 clock cycles to the maximum achievable throughput of
just 2 clock cycles [12].
2.2 GLS binary elliptic curves
Let Ea,b(Fq2) denote the additive abelian group formed by the point at infinity
Oand the set of affine points P= (x, y) with x, y Fq2that satisfy the ordinary
binary elliptic curve equation given as,
E:y2+xy =x3+ax2+b, (2)
defined over Fq2=22m,with aFq2and bF
q2.Let #Ea,b(Fq2) denote the size
of the group Ea,b(Fq2),and let us assume that Ea,b (Fq2) includes a subgroup
hPiof prime order r.
The point multiplication operation, denoted by Q=kP , corresponds to
adding Pto itself k1 times, with k[0, r 1]. The average cost of computing
kP by a random n-bit scalar kusing the traditional double-and-add method is
about nD +n
2A, where Dand Aare the cost of doubling and adding a point,
respectively. If the elliptic curve Eof Eq. (2) is equipped with a non-trivial
efficiently computable endomorphism ψsuch that ψ(P) = δP ∈ hPi,for some
δ[2, r 2].Then the point multiplication can be computed `a la GLV as,
Q=kP =k1P+k2ψ(P) = k1P+k2·δP,
where the subscalars |k1|,|k2| ≈ n/2,can be found by solving a closest vector
problem in a lattice [13]. Having split the scalar kinto two parts, the computation
of kP =k1P+k2ψ(P) can be performed by applying simultaneous multiple
point multiplication techniques [18] that translates into a saving of half of the
doublings required by the execution of a single point multiplication kP .
Inspired by the GLS technique of [13], Hankerson, Karabina and Menezes
presented in [17] a family of binary GLS curves defined over the field Fq2,with
q= 2m,which admits a two-dimensional endomorphism. This endomorphism can
be computed at the inexpensive cost of just three additions in Fq. Furthermore,
by carefully choosing the elliptic curve parameters a, b of Eq. (2), the authors
of [17] showed that it is possible to find members of that family of GLS curves
with an almost-prime group order of the form #Ea,b(Fq2) = hr, with h= 2 and
where ris a (2m1)-bit prime number.
3corresponding to the Intel’s PCLMULQDQ instruction.
Security of GLS curves Given a point Q∈ hPi, the Elliptic Curve Discrete
Logarithm Problem (ECDLP) consists of finding the unique integer k[0, r 1]
such that Q=kP. To the best of our knowledge, the most powerful attack
for solving the ECDLP on binary elliptic curves was presented in [33] (see
also [20, 36]), with an associated computational complexity of O(2c·m2/3log m),
where c < 2,and where mis a prime number. This is worse than generic algo-
rithms with time complexity O(2m/2) for all prime field extensions mless than
N= 2000,a bound that is well above the range used for performing elliptic
curve cryptography [33]. On the other hand, since the elliptic curve of Eq. (2) is
defined over a quadratic extension of the field Fq,the generalized Gaudry-Hess-
Smart (gGHS) attack [15, 19] to solve the ECDLP on the curve E, applies. To
prevent this attack, it suffices to verify that the constant bof Ea,b(Fq2) is not
weak. Nevertheless, the probability that a randomly selected bF
qis a weak
parameter, is negligibly small [17].
2.3 Koblitz curves
A Koblitz curve, also known as an anomalous binary curve or subfield curve, is
defined as the set of affine points P= (x, y)Fq×Fq, q = 2m, that satisfy
the Weierstraß equation Ea:y2+xy =x3+ax2+ 1, a ∈ {0,1},together with
a point at infinity denoted by O. In λ-affine coordinates, where the points are
represented as P= (x, λ =x+y
x), x 6= 0, the λ-affine form of the above equation
becomes [32], (λ2+λ+a)x2=x4+ 1.A Koblitz curve forms an abelian group
denoted as Ea(F2m) of order 2(2 a)r, for an odd prime r, where its group law
is defined by the point addition operation.
Frobenius map. Since their introduction in [23], Koblitz curves were exten-
sively studied for their additional structure that allows, in principle, a perfor-
mance speedup in the point multiplication computation. The Frobenius map
τ:Ea(Fq)Ea(Fq) defined by τ(O) = O, τ(x, y) = (x2, y2),is a curve auto-
morphism satisfying (τ2+ 2)P=µτ(P) for µ= (1)1aand all PEa(Fq).
By solving the equation τ2+ 2 = µτ, the Frobenius map can be seen as the
complex number τ=µ+7
2. Notice that in λ-coordinates the Frobenius map
action remains the same, because, τ(x, λ)=(x2, λ2)=(x2, x2+y2
x2),which cor-
responds to the λ-representation of τ(x, y). Let Z[τ] be the ring of polynomials
in τwith coefficients in Z. Since the Frobenius map is highly efficient, as long as
it is possible to convert an integer scalar kto its τ-representation k=Pl1
i=0 uiτi,
its action can be exploited in a point multiplication computation by adding mul-
tiples uiτi(P), with uiτiZ[τ]. Solinas [37] proposed exactly that, namely, a
τ-adic scalar recoding analogous to the signed digit scalar Non-Adjacent Form
representation.
Security of Koblitz curves From the security point of view, it has been ar-
gued that the availability of additional structure in the form of endomorphisms
can be a potential threat to the hardness of elliptic curve discrete logarithms [3],
but limitations observed in approaches based on isogeny walks is evidence con-
trariwise [22]. Furthermore, the generation of Koblitz curves satisfy by definition
the rigidity property. Constant-time compact implementations for Koblitz curves
are also easily obtained by specializing the Montgomery-L´opez-Dahab ladder al-
gorithm [25] for b= 1, although we show below that this is not the most efficient
constant-time implementation strategy possible. Another practical advantage is
the adoption of Koblitz curves by several standards bodies [27], which guaran-
tee interoperability and availability of implementations in many hardware and
software platforms.
3 New Montgomery ladder variants
This Section presents algorithms for computing the scalar multiplication through
the Montgomery ladder method. Here, we let Pbe a point in a binary elliptic
curve of prime order r > 2 and ka scalar of bit length n. Our objective is to
compute Q=kP .
Algorithm 1 Left-to-right Montgomery ladder [26]
Input: P= (x, y), k = (1, kn2,...,k1, k0)
Output: Q=kP
1: R0P;R12P;
2: for i=n2downto 0do
3: if ki= 1 then
4: R0R0+R1;R12R1
5: else
6: R1R0+R1;R02R0
7: end if
8: end for
9: return Q=R0
Algorithm 1 describes the classical left-to-right Montgomery ladder approach
for point multiplication [26], whose key algorithmic idea is based on the following
observation. Given a base point Pand two input points R0and R1,such that
their difference, R0R1=P, is known, the x-coordinates of the points, 2R0,
2R1and R0+R1,are fully determined by the x-coordinates of P, R0and R1.
More than one decade after its original proposal in [26], L´opez and Dahab
presented in [25] an optimized version of the Montgomery ladder, which was
specifically crafted for the efficient computation of point multiplication on or-
dinary binary elliptic curves. In this scenario, compact formulae for the point
addition and point doubling operations of Algorithm 1 can be derived from the
following result.
Lemma 1 ([25]). Let P= (x, y), R1= (x1, y1),and R0= (x0, y0)be elliptic
curve points, and assume that R1R0=P, and x06= 0.Then, the x-coordinate
of the point (R0+R1),x3,can be computed in terms of x0, x1,and xas follows,
x3=(x+x0·x1
(x0+x1)2R06=±R1
x2
0+b
x2
0
R0=R1
(3)
Moreover, the y-coordinate of R0can be expressed in terms of P, and the x-
coordinates of R0, R1as,
y0=x1(x0+x)(x0+x)(x1+x) + x2+y+y(4)
Let us denote the projective representation of the points R0, R1and R0+R1,
without considering their y-coordinates as, R0= (X0,, Z0), R1= (X1,, Z1)
and R0+R1= (X3,, Z3).Then, for the case R0=R1,Lemma 1 implies,
(X3=X4
0+b·Z4
0
Z3=X2
0·Z2
0
(5)
Furthermore, for the case R06=±R1,one has that,
(Z3= (X0·Z1+X1·Z0)2
X3=x·Z3+ (X0·Z1)·(X1·Z0)(6)
From Equations (5) and (6) it follows that the computational cost of each
ladder step in Algorithm 1 is of 5 multiplications, 1 multiplication by the curve
b-constant, 4 or 5 squarings4and 3 additions over the binary extension field
where the elliptic curve has been defined.
In the rest of this Section, we will present a novel right-to-left formulation of
the classical Montgomery ladder.
3.1 Right-to-left double-and-add Montgomery-LD ladder
Algorithm 2 presents a right-to-left version of the classical Montgomery ladder
procedure. At the end of the i-th iteration, the points in the variables R0, R1
are, R0= 2i+1P, and R1=`P +P
2,where `is the integer represented by the
irightmost bits of the scalar k. The variable R2maintains the relationship,
R2=R0R1from the initialization (step 1), until the execution of the last
iteration of the main loop (steps 2-9). This comes from the fact that at each
iteration, if ki= 1,then the difference R0R1remains unchanged. If otherwise,
ki= 0,then both R2and R0are updated with their respective original values
plus R0,which ensures that R2=R0R1,still holds. Notice however that,
although the difference R2=R0R1,is known, it may vary throughout the
iterations.
As stated in Lemma 1, the point additions of steps 4 and 6 in Algorithm 2
can be computed using the x-coordinates of the points R0, R1and R2,according
4Either b= 1 or bis precomputed. Formula (5) can also be computed as Z3=
(X0·Z0)2and X3= (X2
0+b·Z2
0)2
Algorithm 2 Montgomery-LD double-and-add scalar multiplication (right-to-
left)
Input: P= (x, y), k = (kn1, kn2,...,k1, k0)
Output: Q=kP
1: R0P;R1P
2;R2P
2= (R0R1);
2: for i= 0 to n1do
3: if ki= 1 then
4: R1R1+R0;
5: else
6: R2R2+R0;
7: end if
8: R02R0;
9: end for
10: return Q=R1P
2
to the following analysis. If ki= 1, then the x-coordinate of R0+R1is a function
of the x-coordinates of R0,R1and R2, because R2=R0R1. If ki= 0, the
x-coordinate of R2+R0is a function of the x-coordinates of the points R0,
R1and R2, because R0R2=R0(R0R1) = R1. Hence, considering
the projective representation of the points R0= (X0,, Z0), R1= (X1,, Z1),
R2= (X2,, Z2) and R0+R1= (X3,, Z3),where all the y-coordinates are
ignored, and assuming R06=±R1,we have,
T= (X0·Z1+X1·Z0)2
Z3=Z2·T
X3=X2·T+Z2·(X0·Z1)·(X1·Z0)
(7)
From Equations (5) and (7), it follows that the computational cost of each ladder
step in Algorithm 2 is of 7 multiplications, 1 multiplication by the curve b-
constant, 4 or 5 squarings and 3 additions over the binary field where the elliptic
curve lies.
Although conceptually simple, the above method has several algorithmic and
practical shortcomings. The most important one is the difficulty to recover, at
the end of the algorithm, the y-coordinate of R1, as in none of the available points
(R0,R1and R2) the corresponding y-coordinate is known. This may force the
decision to use complete projective formulae for the point addition and doubling
operations of steps 4, 6 and 8, which would be costly. Finally, we stress that to
guarantee that the case R0=R2will never occur, it is sufficient to initialize R1
with P
2,and perform an affine subtraction at the end of the main loop (step 10).
In the following subsection we present a halve-and-add right-to-left Mont-
gomery ladder algorithm that alleviates the above shortcomings and still achieves
a competitive performance.
3.2 Right-To-Left halve-and-add Montgomery-LD ladder
Algorithm 3 Montgomery-L´opez-Dahab halve-and-add (right-to-left)
Input: P= (x, y), k0= (k0
n1, k0
n2,...,k0
1, k0
0)
Output: Q=kP
1: Precomputation: x(Pi),where Pi=P
2i,for i= 0,...,n
2: R1Pn;R2Pn;
3: for i= 0 to n1do
4: R0Pn1i;
5: if k0
i= 1 then
6: R1R0+R1;
7: else
8: R2R0+R2;
9: end if
10: end for
11: R1R1Pn
12: return R1
Algorithm 3 presents a right-to-left Montgomery ladder procedure similar to
Algorithm 2, but in this case, all the point doubling operations are substituted
with point halvings. A left-to-right approach using halve-and-add with Mont-
gomery ladder was published in [29], however, this method requires one inversion
per iteration, which degrades its efficiency due to the cost of this operation.
As in any halve-and-add procedure, an initial step before performing the
actual computation consists of processing the scalar ksuch that it can be equiv-
alently represented with negative powers of two. To this end, one first computes
k02n1kmod r, with n=|r|. This implies that, kPn
i=1 k0
ni/2i1mod r
and therefore, kP =Pn
i=1 k0
ni(1
2i1P).Then, in the first step of Algorithm 3, n
halvings of the base point Pare computed. We stress that all the precomputed
points Pi=P
2i,for i= 0, . . . , n can be stored in affine coordinates. In fact, just
the x-coordinate of each one of the above npoints must be stored (with the sole
exception of the point Pn, whose y-coordinate is also computed and stored).
As in the preceding algorithm notice that at the end of the i-th iteration,
the points in the variables R0, R1are, R0=P
2ni1,and R1=`P +Pn,where
in this case `is the integer represented as, `=
i
P
j=0
k0
j
2njmod r. Notice also that
the variable R2maintains the relationship, R2=R0R1, until the execution of
the last iteration of the main loop (steps 3-10). This comes from the fact that
at each iteration, if ki= 1,then the difference R0R1remains unchanged.
If otherwise, ki= 0,then both R2and R0are updated with their respective
original values plus R0,which ensures that R2=R0R1,still holds.
Since at every iteration, the values of the points R0, R1and R0R1,are
all known, the compact point addition formula (7) can be used. In practice, this
is also possible because the y-coordinate of the output point kP can be readily
recovered using Equation 4, along with the point 2P. Moreover, since the points
in the precomputed table were generated using affine coordinates, it turns out
that the z-coordinate of the point R0is always 1 for all the iterations of the
main loop. This simplifies (7) as,
T= (X0·Z1+X1)2
Z3=Z2·T
X3=X2·T+Z2·(X0·Z1)·(X1)
(8)
Hence, the computational cost per iteration of Algorithm 3 is of 5 multiplications,
1 squaring, 2 additions and one point halving over the binary field where the
elliptic curve lies.
GLS Endomorphism The efficient computable endomorphism provided by the
GLS curves can be used to implement the 2-GLV method on the Algorithm 3. As
a result, only n/2 point halving operations must be computed. Besides the speed
improvement, the 2-GLV method reduces to a half the number of precomputed
points that must be stored.
3.3 Multi-core Montgomery ladder
As proposed in [38], by properly recoding the scalar, one can efficiently compute
the scalar multiplication in a multi-core environment. Specifically, given a scalar
kof size n, we fix a constant twhich establishes how many scalar bits will be
processed by the double-and-add, and by the halve-and-add procedures. This is
accomplished by computing, k0= 2tkmod r, which yields,
k=k0
0
2t+k0
1
2t1+· ·· +k0
t1
21
| {z }
halveandadd
+k0
t
20+ 21k0
t+1 + 22k0
t+2 +· ·· + 2(n1)tk0
n1
| {z }
doubleandadd
In a two-core setting, it is straightforward to combine the left-to-right and
right-to-left Montgomery ladder procedures of Algorithms 1 and 3, and distribute
them to both cores. In this scenario, the number of necessary pre-computed
halved points reduces to n
4. In a four-core platform, we can apply the GLS en-
domorphism to the left-to-right Montgomery ladder (Algorithm 1). Even though
the GLV technique is ineffective for the classical Montgomery algorithm (due to
the fact that we cannot share the point doublings between the base point and
its endomorphism), the method permits an efficient splitting of the algorithm
workload into two cores. In this way, one can use the first two cores for com-
puting t-digits of the GLV subscalars k1and k2by means of Algorithm 3, while
we allocate the other two cores to compute the rest of the scalar’s bits using
Algorithm 1, as shown in Algorithm 6 (see Appendix A).
Table 1. Montgomery-LD algorithms cost comparison. In this table, M, Ma, Mb, S, I
denote the following field operations: multiplication, multiplication by the curve a-
constant, multiplication by the curve b-constant, squaring and inversion. The point
halving operation is denoted by H.
Method Cost
1-core
Alg. 1: Montgomery-LD
(double-and-add, left-to-right)
pre/post 10M+ 1S+ 1I
sc. mult. n(5M+ 1Mb+ 4S)
Alg. 3: Montgomery-LD-2-GLV
(halve-and-add, right-to-left)
pre/post 48M+ 1Ma+ 13S+ 3I
sc. mult. ( n
2+ 1)H+n(5M+ 1S)
2-core
Montgomery-LD-2-GLV
(double-and-add, left-to-right) core I pre/post 25M+ 1Ma+ 5S+ 2I
sc. mult. (nt2)(5M+ 1Mb+ 4S)
Montgomery-LD-2-GLV
(halve-and-add, right-to-left) core II pre/post 46M+ 2Ma+ 12S+ 2I
sc. mult. ( t2
2+ 1)H+t2(5M+ 1S)
Overhead 15M+ 5S+ 1I
4-core
Montgomery-LD-2-GLV
(double-and-add, left-to-right)
cores pre/post 10M+ 1S+ 1I
I & II sc. mult. ( n
2t4)(5M+ 1Mb+ 4S)
Montgomery-LD-2-GLV
(halve-and-add, right-to-left)
cores pre/post 16M+ 1Ma+ 4S+ 1I
III & IV sc. mult. ( t4
2+ 1)H+t4(5M+ 1S)
Overhead 34M+ 1Ma+ 12S+ 1I
3.4 Cost comparison of Montgomery ladder variants
Table 1 shows the computational costs associated to the Montgomery ladder vari-
ants described in this Section. The constants t2and t4represent the values of
the parameter tchosen for the two- and four-core implementations, respectively.5
All Montgomery ladder algorithms require a basic post-computation cost to re-
trieve the y-coordinate, which demands ten multiplications, one squaring and
one inversion. Due to the application of the GLV technique, the Montgomery-
LD-2-GLV halve-and-add version (corresponding to Algorithm 3), requires some
few extra operations, namely, the subtraction of a point and the addition of two
accumulators, which is performed using the L´opez-Dahab (LD) projective coor-
dinate formulae. In the end, one extra inversion is needed to convert the point
representation from LD-projective coordinates to affine coordinates.
In the case of the parallel versions, the overhead is given by the post-computation
done in one single core. The exact costs are mainly determined by the accumu-
lator additions that are performed via full and mixed LD-projective formulae. In
all of the timings reported in Section 5, we consider the LD-projective to affine
coordinate transformation cost.
5In our implementations (see subsection 5.3 below), the values used for the parameters
t2and t4ranged from 53 to 55.
4 A novel regular τ-adic approach
4.1 Recoding in τ-adic form
The recoding approach proposed by Solinas finds an element ρZ[τ],of as
small norm as possible, such that ρk(mod τm1
τ1). A τ-adic expansion with
average non-zero density 1
3can be obtained by repeatedly dividing ρby τand
assigning the remainders to the digits uito obtain k=Pi=l1
i=0 uiτi. An alter-
native approach that does not involve multi-precision divisions, is to compute
an element ρ0=kpartmodτm1
τ1by performing a partial reduction proce-
dure [37]. A width-w τ-NAF expansion with non-zero density 1
w+1 , where at
most one of any wconsecutive coefficients is non-zero, can also be obtained
by repeatedly dividing ρ0by τwand assigning the remainders to the digit set
{0,±α1,±α3,...,±α2w11}, for αi=imod τw. Under reasonable assump-
tions, this window-based recoding has length lm+ 1 [37].
In this section, a regular recoding version of the (width-w)τ-NAF expan-
sion is derived. The security advantages of such recoding are the predictable
length and locations of non-zero digits in the expansion. This eliminates any
side-channel information that an attacker could possibly collect regarding the
operation executed at any iteration of the scalar multiplication algorithm (point
doubling/Frobenius map or point addition). As long as querying a precomputed
table of points to select the second operand of a point addition takes constant
time, the resulting algorithm should be resistant against any timing-based side-
channel attacks.
Let us first consider the integer recoding proposed by Joye and Tunstall [21].
They observed that any odd integer iin the interval [0,2w) can be written
as i= 2w1+ ((2w1i)). Repeatedly dividing an odd n-bit integer k
((kmod 2w)2w1) by 2w1maintains the parity and assigns the remainders to
the digit set 1,...,±(2w11)}, producing an expansion of length d1 + n
w1]
with non-zero density 1
w1. Our solution for the problem of finding a regular
τ-adic expansion employs the same intuition, as explained next.
Let φw:Z[τ]Z2wbe a surjective ring homomorphism induced by τ7→
tw, for t2
w+ 2 µtw(mod 2w), with kernel {αZ[τ] : τwdivides α}. An
element i=i0+i1τfrom Z[τ] with odd integers i0, i1[0,2w) satisfies the
analogous property φw(i)=2w1+((2w1φw(i))). Repeated division of (r0+
r1τ)(((r0+r1τ) mod τw)τw1) by τw1, correspondingly of φw(ρ0)=(r0+
r1tw)((r0+r1twmod 2w)2w1) by 2w1, obtains remainders that belong to
the set {0,±α1,±α3,...,±α2w11}. The resulting expansion always has length
d1 + m+2
w1eand non-zero density 1
w1. Algorithm 4 presents the recoding process
for any w2. The resulting recoding can also be seen as an adaption of the SPA-
resistant recoding of [30], mapping to the digit set {0,±α1,±α3,...,±α2w11}
instead of integers. While the non-zero densities are very similar, our scheme
provides a performance benefit in the precomputation step, since the Frobenius
map is usually faster than point doubling and preserves affine coordinates and
consequently faster point additions.
Algorithm 4 Regular width-w τ -recoding for m-bit scalar
Input: w,tw,αu=βu+γuτfor u=1,±3,±5,...,±2w11}, ρ =r0+r1τZ[τ]
with odd r0, r1
Output: ρ=
dm+2
w1e
X
i=0
uiτi(w1)
1: for i0to dm+2
w1e- 1 do
2: if w= 2 then
3: ui((r02r1) mod 4) 2
4: r0r0ui
5: else
6: u(r0+r1twmod 2w)2w1
7: if u > 0then s1else s← −1
8: r0r0u, r1r1u, uiu
9: end if
10: for j0to (w2) do
11: tr0, r0r1+µr0/2, r1← −t/2
12: end for
13: end for
14: if r06= 0 and r16= 1 then
15: uir0+r1τ
16: else
17: if r16= 0 then
18: uir1
19: else
20: uir0
21: end if
22: end if
4.2 Left-to-right regular approach
Algorithm 5 presents a complete description of a regular scalar multiplication
approach that uses as a building block the regular width-w τ-recoding procedure
just described.
Algorithm 5 Protected scalar multiplication
Input: P= (x, λ), kZ, width w
Output: Q=kP
1: Compute ρ0=r0+r1τ=kpartmodτm1
τ1
2: if 2|r0then r0
0=r0+ 1
3: if 2|r1then r0
1=r1+ 1
4: Compute width-wlength-lregular τ-adic of r0
0+r0
1τas Pd1+ m+2
w1e
i=0 uiτi(w1) (Alg. 4)
5: for i∈ {1,...,2w11}do
6: Compute Pu=αuP
7:
8: Q← O
9: for i=l1downto 0do
10: Qτw1(Q)
11: Perform a linear pass to recover Pui
12: QQ+Pui
13: end for
14: return Q=Q(r0
0r0)P(r0
1r1)τ(P).
For benchmarking purposes, we also included a baseline implementation of
the customary Montgomery L´opez-Dahab ladder. This allows easier comparisons
with related work and permits to evaluate the impact of incomplete reduction
in the field arithmetic performance (cf. subsection 5.2).
5 Implementation issues and results
In this Section, we discuss several implementation issues. We also present our
experimental results and we compare them against state-of-the-art protected
point multiplication implementations at the 128-bit security level.
5.1 Mechanisms to achieve a constant-time GLS-Montgomery
ladder implementation
To protect the previously described algorithms against timing attacks, we ob-
served the following precautions,
Branchless code The main loop, the pre- and post-computation phases are im-
plemented by a completely branch-free code.
Data veiling To guarantee a constant memory access pattern in the main loop
of the Montgomery ladder algorithms, we proposed an efficient data veiling
method, as described in Algorithm 7 of Appendix B. Algorithm 7 evaluates the
actual and the previous scalar bits to decide whether the variables containing
the Montgomery-LD accumulators values should or should not be masked. This
strategy saves a considerable portion of the computational effort associated to
Algorithm 1 of [4].
Field arithmetic Two of the base field arithmetic operations over Fqwere im-
plemented through look-up tables, namely, the half-trace and the multiplicative
inverse operations. The half-trace is used to perform the point halving prim-
itive, which is required in the pre-computation phase of the Montgomery-LD
halve-and-add algorithm. The multiplicative inverse is one of the operations in
the y-coordinate retrieval procedure, at the end of the Montgomery ladder al-
gorithms. Also, whenever post-computational additions are necessary, inverses
must be performed to convert a point from LD-projective to affine coordinates.
Although we are aware of the existence of protocols that consider the base
point as a secret information [6], in which case one could not consider that
our software provides protection against timing attacks, in the vast majority of
protocols, the base point is public. Consequently, any attacks aimed at the two
field operations mentioned above would be pointless.
5.2 Mechanisms to achieve a constant-time Koblitz implementation
Implementing Algorithm 5 in constant time needs some care, since all of its
building blocks must be implemented in constant time.
Finite field arithmetic. Modern implementations of finite field arithmetic can
make extensive use of vector registers, removing timing variances due to the cache
hierarchy. For our illustrative implementation of curve NIST-K283, we closely
follow the arithmetic described in Bluhm-Gueron [4], adopting the incomplete
reduction improvement proposed by Negre-Robert [28].
Integer recoding. All the branches in Algorithm 4 need to be eliminated by
conditional execution statements to protect leakage of the scalar k. Moreover, to
remove the remaining sign-related branches, multiple precision integer arithmetic
must be implemented in complement of two. If two constants, say βu, γu,are
stored in a precomputed table, then they need to be recovered by a linear pass
across the table in constant time. Finally, the partial reduction step producing ρ0
must also be implemented in constant time by removing all of its branches. Notice
that the requirement for r0, r1to be odd is not a problem, since partial reduction
can be modified to always result in odd integers, with a possible correction at
the end of the scalar multiplication by performing a (protected) conditional
subtraction of points (line 14 of Algorithm 5).
5.3 Results
Our implementation was mainly designed for the Intel Haswell processor family,
which supports vectorial sets such as SSE and AVX, a carry-less multiplication
and some bit manipulation instructions. The programming was done in C with
the support of assembly inline code. The compilation was performed via GCC
version 4.7.3 with the flags -m64 -march=core-avx2 -mtune=core-avx2 -O3
-fomit-frame-pointer -funroll-loops. Finally, the timings were collected
on an Intel Core i7-4700MQ, with the Turbo Boost and Hyperthreading features
disabled6.
Table 2 presents the experimental timings obtained for the most prominent
building blocks required for computing the point multiplication operation on the
GLS and Koblitz binary elliptic curves.
We present in Table 3 a comparison of our timings against a selection of
state-of-the-art implementations of the point multiplication operation on binary
and prime elliptic curves. Due to the Montgomery-LD point doubling efficiency,
which costs 49% less than a point halving, the GLS-Montgomery-LD-double-and-
add achieved the fastest timing in the one-core setting, with 70,800 clock cycles.
This is 13% faster than the performance obtained by the GLS-Montgomery-LD-
halve-and-add algorithm. In the known-base point setting, we can ignore the
GLS-Montgomery-LD-halve-and-add pre-computation expenses associated with
its table of halved points. In that case, we can compute the scalar multiplication
in an estimated time of 44,600 clock cycles using a table of just 4128 bytes.
6We intend to submit our software to the ECRYPT Benchmarking of Cryptographic
Systems (eBACS) SUPERCOP toolkit in the near future.
7The flexibility for finding a curve b-constant, provided by the GLS curves, allow
us to have a small b(see Appendix C). As a consequence, we used the Eq. (5)
alternative formula.
Table 2. Timings (in clock cycles) for the elliptic curve operations in the Intel Haswell
platform.
Elliptic curve
operation
GLS E/F2254
cycles op/M1
Halving 184 4.181
Montgomery-LD D&A (left-to-right) Addition (Eq. (6)) 161 3.659
Montgomery-LD H&A (right-to-left) Addition (Eq. (8)) 199 4.522
Montgomery-LD Doubling7(Eq. (5)) 95 2.159
Elliptic curve
operation
Koblitz E/F2283
cycles op/M1
Frobenius 70 1.235
Integer τ-adic recoding (Alg. 4) (w= 5) 8,900 156.863
Point addition 602 10.588
1Ratio to multiplication.
Furthermore, the GLS-Montgomery-LD-halve-and-add is crucial for imple-
menting the multi-core versions of the Montgomery ladder. When compared
with our one-core double-and-add implementation, Table 3 reports a speedup
of 1.36 and 2.03 in our two- and four-core Montgomery ladder versions, re-
spectively. Here, besides the overhead costs commented in Section 3, we can
clearly perceive the usual multicore management penalty. Finally, we observe
that our GLS-Montgomery-LD-double-and-add surpasses by 48%, 40% and 2%
the Montgomery ladder implementations of [4] (Random), [4] (Koblitz) and [1],
respectively.
As for our Koblitz implementations, the fast τendomorphism allows us
to have a regular-recoding implementation that outperforms a standard Mont-
gomery ladder for Koblitz curves by 18%. In addition, our fastest Koblitz code
surpasses by 16% the recent implementation reported in [4] 8. Finally, note that,
in spite of the fact that the τendomorphism is 26% faster than the Montgomery-
LD point doubling, the superior efficiency of the GLS quadratic field arithmetic
produces faster results for the GLS Montgomery ladder algorithms.
6 Conclusion
We presented several algorithms that permit to compute a constant-time high-
security point multiplication operation over two families of binary elliptic curves,
namely, the GLS and the Koblitz curves. Although this work was completely fo-
cused on a high-end desk computation of the variable-base point multiplication,
8We could not reproduce the timing of 118,000 cycles with the code available from [4],
which indicates that TurboBoost could be possibly turned on on their benchmarks.
Considering this, our implementation of Koblitz-Montgomery-LD becomes 9% faster
than [4], reflecting the savings from partial reduction, and the speedup achieved by
the Koblitz-regular implementation increases to 26%.
Table 3. Timings (in clock cycles) for 128-bit level scalar multiplication with timing-
attack resistance in the Intel Ivy Bridge (I) and Haswell (H) architectures.
Method Cycles Arch
State-of-the-art
implementations
Montgomery-DJB-chain (prime) [7] 148,000 I
Random-Montgomery-LD ladder (binary) [4] 135,000 H
Genus-2-Kummer (prime) [5] 122,000 I
Koblitz-Montgomery-LD ladder (binary) [4] 118,000 H
Twisted-Edwards-4-GLV (prime) [11] 92,000 I
Genus-2-Kummer Montgomery ladder (prime) [1] 72,200 H
GLS-2-GLV double-and-add (binary, λ) [32] 60,000 H
Our Work
Koblitz-Montgomery-LD double-and-add (left-to-right) 122,000 H
Koblitz-regular τ-and-add (left-to-right, w= 5) 99,000 H
GLS-Montgomery-LD-2-GLV halve-and-add (Algorithm 3) 80,800 H
GLS-Montgomery-LD double-and-add (Algorithm 1) 70,800 H
2-core GLS-Montgomery-LD-2-GLV halve-and-add/double-and-add 52,000 H
4-core GLS-Montgomery-LD-2-GLV halve-and-add/double-and-add
(Algorithm 6)
34,800 H
the possibility of applying Algorithm 3 to the fixed-base point multiplication
setting is highly appealing since that procedure requires a comparatively small
pre-computed table of roughly 2n·(n+ 1) bits for computing a point multipli-
cation at the n-bit security level. The above combined with the Montgomery
ladder unique feature of performing all the computations using only two point
coordinates, should be attractive for deployments of public key cryptography on
constrained computing environments.
References
1. D. J. Bernstein, C. Chuengsatiansup, T. Lange, and P. Schwabe. Kummer strikes
back: new DH speed records. Cryptology ePrint Archive, Report 2014/134, 2014.
http://eprint.iacr.org/.
2. D. J. Bernstein and T. Lange. SafeCurves: choosing safe curves for elliptic-curve
cryptography. http://safecurves.cr.yp.to.
3. D. J. Bernstein and T. Lange. Security dangers of the NIST curves. Invited talk,
International State of the Art Cryptography Workshop, Athens, Greece, 2013.
4. M. Bluhm and S. Gueron. Fast Software Implementation of Binary Elliptic
Curve Cryptography. Cryptology ePrint Archive, Report 2013/741, 2013. http:
//eprint.iacr.org/.
5. J. W. Bos, C. Costello, H. Hisil, and K. Lauter. Fast Cryptography in Genus 2. In
T. Johansson and P. Q. Nguyen, editors, Advances in Cryptology - EUROCRYPT
2013, volume 7881 of LNCS, pages 194–210. Springer, 2013.
6. S. Chatterjee, K. Karabina, and A. Menezes. A New Protocol for the Nearby
Friend Problem. In M. G. Parker, editor, Cryptography and Coding, 12th IMA
International Conference, Cryptography and Coding 2009, volume 5921 of LNCS,
pages 236–251. Springer, 2009.
7. C. Costello, H. Hisil, and B. Smith. Faster Compact Diffie-Hellman: Endomor-
phisms on the x-line. In P. Nguyen and E. Oswald, editors, Advances in Cryptol-
ogy EUROCRYPT 2014, volume 8441 of LNCS, pages 183–200. Springer Berlin
Heidelberg, 2014.
8. T. Dierks and C. Allen. The TLS Protocol Version 1.0. RFC 2246 (Proposed
Standard), Jan. 1999. Obsoleted by RFC 4346, updated by RFCs 3546, 5746,
6176.
9. T. Dierks and E. Rescorla. The Transport Layer Security (TLS) Protocol Version
1.1. RFC 4346 (Proposed Standard), Apr. 2006. Obsoleted by RFC 5246, updated
by RFCs 4366, 4680, 4681, 5746, 6176.
10. T. Dierks and E. Rescorla. The Transport Layer Security (TLS) Protocol Version
1.2. RFC 5246 (Proposed Standard), August 2008.
11. A. Faz-Hern´andez, P. Longa, and A. H. Sanchez. Efficient and Secure Algo-
rithms for GLV-Based Scalar Multiplication and Their Implementation on GLV-
GLS Curves. In J. Benaloh, editor, Topics in Cryptology - CT-RSA 2014, volume
8366 of LNCS, pages 1–27. Springer, 2014.
12. A. Fog. Instruction Tables: List of Instruction Latencies, Throughputs and Micro-
operation Breakdowns for Intel, AMD and VIA CPUs., Accessed: May 14 2014.
Available at: http://www.agner.org/optimize/instruction_tables.pdf.
13. S. D. Galbraith, X. Lin, and M. Scott. Endomorphisms for Faster Elliptic Curve
Cryptography on a Large Class of Curves. In A. Joux, editor, Advances in Cryptol-
ogy - EUROCRYPT 2009, volume 5479 of LNCS, pages 518–535. Springer, 2009.
14. R. P. Gallant, R. J. Lambert, and S. A. Vanstone. Faster Point Multiplication on
Elliptic Curves with Efficient Endomorphisms. In J. Kilian, editor, Advances in
Cryptology CRYPTO 2001, volume 2139 of LNCS, pages 190–200. Springer Berlin
Heidelberg, August 2001.
15. P. Gaudry, F. Hess, and N. P. Smart. Constructive and destructive facets of Weil
descent on elliptic curves. Journal of Cryptology, 15:19–46, March 2002.
16. S. Gueron and V. Krasnov. Fast Prime Field Elliptic Curve Cryptography with 256
Bit Primes. Cryptology ePrint Archive, Report 2013/816, 2013. http://eprint.
iacr.org/.
17. D. Hankerson, K. Karabina, and A. Menezes. Analyzing the Galbraith-Lin-Scott
Point Multiplication Method for Elliptic Curves over Binary Fields. Computers,
IEEE Transactions on, 58(10):1411 – 1420, October 2009.
18. D. Hankerson, A. Menezes, and S. Vanstone. Guide to Elliptic Curve Cryptography.
Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2003.
19. F. Hess. Generalising the GHS Attack on the Elliptic Curve Discrete Logarithm
Problem. LMS Journal of Computation and Mathematics, 7:167–192, June 2004.
20. Y.-J. Huang, C. Petit, N. Shinohara, and T. Takagi. Improvement of Faug`ere et
al.’s Method to Solve ECDLP. In K. Sakiyama and M. Terada, editors, Advances in
Information and Computer Security - IWSEC 2013, volume 8231 of LNCS, pages
115–132. Springer, 2013.
21. M. Joye and M. Tunstall. Exponent Recoding and Regular Exponentiation Al-
gorithms. In B. Preneel, editor, Progress in Cryptology – AFRICACRYPT 2009,
volume 5580 of LNCS, pages 334–349. Springer Berlin Heidelberg, 2009.
22. A. H. Koblitz, N. Koblitz, and A. Menezes. Elliptic curve cryptography: The
serpentine course of a paradigm shift. Journal of Number Theory, 131(5):781 –
814, 2011. Elliptic Curve Cryptography.
23. N. Koblitz. CM-curves with good cryptographic properties. In J. Feigenbaum,
editor, Advances in Cryptology - CRYPTO ’91, volume 576 of LNCS, pages 279–
287. Springer, 1991.
24. P. Longa and F. Sica. Four-Dimensional Gallant-Lambert-Vanstone Scalar Multi-
plication. Journal of Cryptology, 27(2):248–283, 2014.
25. J. L´opez and R. Dahab. Fast Multiplication on Elliptic Curves over GF(2m)
without Precomputation. In C¸ etin Kaya Ko¸c and C. Paar, editors, Cryptographic
Hardware and Embedded Systems, First International Workshop, CHES’99, volume
1717 of LNCS, pages 316–327. Springer, 1999.
26. P. L. Montgomery. Speeding the Pollard and elliptic curve methods of factorization.
Mathematics of Computation, 48:243–264, 1987.
27. National Institute of Standards and Technology. Recommended Elliptic Curves
for Federal Government Use. NIST Special Publication, 1999. http://csrc.nist.
gov/csrc/fedstandards.html.
28. C. N`egre and J.-M. Robert. Impact of Optimized Field Operations AB,AC and
AB+CD in Scalar Multiplication over Binary Elliptic Curve. In Progress in Cryp-
tology AFRICACRYPT 2013, volume 7918 of LNCS, pages 279–296. Springer
Berlin Heidelberg, 2013.
29. C. N`egre and J.-M. Robert. New Parallel Approaches for Scalar Multiplica-
tion in Elliptic Curve over Fields of Small Characteristic. 2013. http://hal.
archives-ouvertes.fr/docs/00/90/84/63/PDF/parallelization-ecsm8.pdf.
30. K. Okeya, T. Takagi, and C. Vuillaume. Efficient representations on koblitz curves
with resistance to side channel attacks. In ACISP, volume 3574 of LNCS, pages
218–229. Springer, 2005.
31. T. Oliveira, J. L´opez, D. F. Aranha, and F. Rodr´ıguez-Henr´ıquez. Lambda Coor-
dinates for Binary Elliptic Curves. In G. Bertoni and J.-S. Coron, editors, Cryp-
tographic Hardware and Embedded Systems - CHES 2013, volume 8086 of LNCS,
pages 311–330. Springer, 2013.
32. T. Oliveira, J. L´opez, D. F. Aranha, and F. Rodr´ıguez-Henr´ıquez. Two is the fastest
prime: lambda coordinates for binary elliptic curves. J. Cryptographic Engineering,
4(1):3–17, 2014.
33. C. Petit and J.-J. Quisquater. On Polynomial Systems Arising from a Weil Descent.
In X. Wang and K. Sako, editors, Advances in Cryptology - ASIACRYPT 2012,
volume 7658 of LNCS, pages 451–466. Springer, 2012.
34. J. Salowey. Confirming Consensus on removing RSA key Transport from TLS 1.3.
Transport Layer Security working group of the IETF Mailing List, May 3 2014.
35. M. Scott. Re: NIST announces set of Elliptic Curves. https://groups.google.
com/forum/message/raw?msg=sci.crypt/mFMukSsORmI/FpbHDQ6hM_MJ, 1999.
36. M. Shantz and E. Teske. Solving the Elliptic Curve Discrete Logarithm Prob-
lem Using Semaev Polynomials, Weil Descent and Gr¨obner Basis Methods –
an Experimental Study. Cryptology ePrint Archive, Report 2013/596, 2013.
http://eprint.iacr.org/.
37. J. A. Solinas. Efficient Arithmetic on Koblitz Curves. Designs, Codes and Cryp-
tography, 19(2-3):195–249, 2000.
38. J. Taverne, A. Faz-Hern´andez, D. F. Aranha, F. Rodr´ıguez-Henr´ıquez, D. Hanker-
son, and J. L´opez. Speeding scalar multiplication over binary elliptic curves using
the new carry-less multiplication instruction. Journal of Cryptographic Engineer-
ing, 1:187–199, November 2011.
A Multi-core Montgomery ladder
Here we present the four-core GLS-Montgomery-LD ladder algorithm. Given t4
the integer constant that establishes the workload of each algorithm, PE(Fq2),
and the scalar krepresented as k1+k2·δusing the GLS-GLV method, cores
Iand II are both responsible for computing bn
2c − t4bits of the subscalars k1
and k2using the Montgomery-LD double-and-add method. In turn, the cores
III and IV , both compute t4bits of k1and k2with the Montgomery-LD halve-
and-add algorithm. In the end, on a single core, it is necessary to add all the
accumulators Qi, for i= 0 . . . 3.
Algorithm 6 Parallel Montgomery ladder scalar multiplication (four-core)
Input: PE(Fq2) of order r, scalar kof bit length n, integer constant t4
Output: Q=kP
k02t4kmod r
Represent k0=k0
1+k0
2λ, where ψ(P) = λP
{Initialization}
R0← O, R1P
for i=dn
2edownto t4do
bk0
1,i ∈ {0,1}
R1bR1b+Rb
Rb2Rb
end for
Q0R0
{Barrier}Core I
{Initialization}
R0← O, R1P
for i=dn
2edownto t4do
bk0
2,i ∈ {0,1}
R1bR1b+Rb
Rb2Rb
end for
Q1R0
{Barrier}Core II
{Precomputation}
for i= 1 to t4+ 1 do
PiP
2i
end for
{Initialization}
R1Pt4+1,R2Pt4+1
for i= 0 to t41do
R0Pt4i
bk0
1,i ∈ {0,1}
R2bR2b+R0
end for
Q2R1Pt4+1
{Barrier}Core III
{Precomputation}
for i= 1 to t4+ 1 do
PiP
2i
end for
{Initialization}
R1Pt4+1,R2Pt4+1
for i= 0 to t41do
R0Pt4i
bk0
2,i ∈ {0,1}
R2bR2b+R0
end for
Q3R1Pt4+1
{Barrier}Core IV
return Q=Q0+Q2+ψ(Q1+Q3)
B Memory access pattern
The following data veiling algorithm ensures a fixed memory access pattern for
all Montgomery-LD ladder algorithms. Given the two Montgomery-LD ladder
accumulators Aand B, and the scalar k= (kn1, kn2,...k0), this method
allows us, in the beginning of the i-th main loop iteration, to use the bits ki1
and kito decide if Aand Bwill or will not be swapped. As a result, it is not
necessary to reapply the procedure at the end of the i-th iteration.
Algorithm 7 Data veiling algorithm
Input: Scalar digits kiand ki1, Montgomery-LD ladder accumulators Aand B
Output: Montgomery-LD ladder accumulators Aand B
mask 0(ki1ki)
tmp AB
tmp tmp mask
AAtmp
BBtmp
return A, B
C GLS elliptic curve parameters
For achieving a greater benefit from the multiplication by the b-constant in
the Montgomery-LD doubling formula X3=X04+bZ04= (X02+bZ02)2we
carefully selected a GLS curve with a 64-bit b-parameter square-root. As a result,
we saved two carry-less multiplication and a dozen of SSE instructions per field
multiplication. Next, we describe the parameters, as polynomials represented in
hexadecimal, for our GLS curve Ea,b/Fq2:y2+xy =x3+ax2+b.
a=u
b=0x54045144410401544101540540515101
b=0xE2DA921E91E38DD1
The 253-bit prime order rof the main subgroup of Ea,b/Fq2is,
r=0x1FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFA6B89E49D3FECD828CA8D66BF4B88ED5.
Also, the integer δsuch that ψ(P) = δP for all PEa,b is,
δ=0x74AEFB81EE8A42E9E9D0085E156A8EFBA3D302F9C74D737FA00360F9395C788.
The base point P= (x, y) of order rused in this work is,
x=0x4A21A3666CF9CAEBD812FA19DF9A3380 +0x358D7917D6E9B5A7550B1B083BC299F3 ·u
y=0x6690CB7B914B7C4018E7475D9C2B1C13 +0x2AD4E15A695FD54011BA179D5F4B44FC ·u.
Finally, the towering of our field Fq
=F2[x]/(f(x)) and its quadratic exten-
sion Fq2
=Fq[u]/(g(x)) is constructed by means of the irreducible trinomials
f(x) = x127 +x63 + 1 and g(u) = u2+u+ 1.
... In order to eliminate the inverses in the equation of x 16 , we multiply both sides by U 2 16 , using the value of U 16 in Equation 22, ...
... Multiplying both sides by U 3 16 , to eliminate all inverses as we have done previously with considering the value of U 16 in Equation 22, ...
... The introduced optimization speeds up the elliptic curve computation and is beneficial for SIDH's quantum security margin. The new equations can be applied to the right-to-left fast exponentiation algorithm for binary elliptic curves [14,22], also used accordingly for Montgomery curves [23,12]. In addition, it can also be applied to the left-to-right fast Double-and-Add or Double-Add-&-Subtract algorithms. ...
Preprint
Elliptic curve multiplications can be improved by replacing the standard ladder algorithm's base 2 representation of the scalar multiplicand, with mixed-base representations with power-of-2 bases, processing the n bits of the current digit in one optimized step. For this purpose, we also present a new methodology to compute short Weierstrass form elliptic curve operations of the type mP+nQ, where m and n are small integers, aiming for faster implementation with the lowest cost among previous algorithms, using only one inversion. In particular, the proposed techniques enable more opportunities for optimizing computations, leading to an important speed-up for applications based on elliptic curves, including the post-quantum cryptosystem Super Singular Isogeny Diffie Hellman (SIDH).
... Algorithm 5 Regular window TNAF (wTNAF) method [46] Require: n-bit scalar k = (k n−1 , . . . , 1, 0) 2 , point P ∈ E(F 2 m ) Ensure: R=k · P . ...
... Typical wTNAF is vulnerable to side channel attacks [49], [50], such as TA (Timing Attack) and SPA (Simple Power Analysis), because point addition is omitted whenever the value of the tested window is zero, which leaks sensitive information. Oliveira et al. [46] proposed a regular wTNAFbased scalar multiplication method on Koblitz curves. Alg. 5 computes scalar multiplication with a regular pattern, which always conducts (w − 1) Frobenius maps and single ECADD regardless of the scalar value. ...
... Alg. 5 computes scalar multiplication with a regular pattern, which always conducts (w − 1) Frobenius maps and single ECADD regardless of the scalar value. With the regular wTNAF method, the length of the scalar is 1 + m+2 w−1 and the density of nonzero is 1 w−1 [46]. For finding the optimal window size on the target device, we tested the performance of regular wTNAF and found that 5 and 6 are the optimal window widths for a variable-base scalar multiplication and a fixed-base scalar multiplication, respectively. ...
Article
Full-text available
This paper presents an efficient implementation of elliptic curve cryptography (ECC) over the National Institute of Standards and Technology (NIST) K-233 curve for 8-bit AVR microcontrollers commonly used for sensor nodes in wireless sensor networks. Until now, several ECC implementations over NIST-compliant curves have been presented on 8-bit sensor nodes. However, most of them do not provide 112-bit security level currently recommended by NIST. Although some works provide more than 112-bit security level, their performance needs to be improved in order to be executed properly on resource-constrained sensor nodes. For optimizing the performance of ECC, we focus on the efficiency of field arithmetics and propose several optimization techniques. First, we present a novel polynomial multiplication technique based on multiplier encoding. The proposed method significantly reduces the required number of registers for a multiplier, which allows the larger block size for the Karatsuba Block-Comb method. The proposed method provides around 17.05% of improvement compared with the best result previously presented. Second, we optimize modular squaring and reduction algorithms considering the features of 8-bit AVR, and each of them provides around 21.86% and 3.7% improvements compared with the related works. With proposed methods, we present two versions of ECC implementation: (highly fast) HF and (highly secure) HS over NIST K-233 curve on an 8-bit ATmega128. Especially, HF version outperforms the best result previously implemented on the same curve by 18.6% and 34.5% for a variable and a fixed-based scalar multiplication, respectively. Furthermore, on the 8-bit AVR platform, our ECC implementation shows the best performance compared with other existing implementations over both NIST-standardized prime or binary curves.
... Similarly, these curves support the most efficient Montgomery ladder formulas that we know of (consisting of only five field multiplications per bit, which have the extra bonus of being amenable for parallelization). Moreover, Koblitz curves are suitable for right-to-left Montgomery ladders as discussed in [69]. This feature is especially valuable for the vast majority of protocols (such as the Diffie-Hellman protocol, digital signatures, key generation, etc.), which usually require the computation of one or more fixed-point scalar multiplications. ...
... We found interesting to explore the cryptographic usage of Koblitz curves defined over F 4 due to their inherent usage of quadratic field arithmetic. Indeed, it has been recently shown [56,69,71] that quadratic field arithmetic is extraordinarily efficient when implemented in software. This is because one can take full advantage of the Single Instruction Multiple Data (SIMD) paradigm, where a vector instruction performs simultaneously the same operation on a set of input data items. ...
... Assuming that the scalar k is specified in the Z[τ ] domain, one can represent the scalar in the regular width-w τNAF form [69] by slightly adopting the method for the F 4 case. The length of the representation width-w τNAF of an element k ∈ Z[τ ] is discussed in [81]. ...
Article
In this work, we retake an old idea that Koblitz presented in his landmark paper (Koblitz, in: Proceedings of CRYPTO 1991. LNCS, vol 576, Springer, Berlin, pp 279–287, 1991), where he suggested the possibility of defining anomalous elliptic curves over the base field F4{\mathbb {F}}_4. We present a careful implementation of the base and quadratic field arithmetic required for computing the scalar multiplication operation in such curves. We also introduce two ordinary Koblitz-like elliptic curves defined over F4{\mathbb {F}}_4 that are equipped with efficient endomorphisms. To the best of our knowledge, these endomorphisms have not been reported before. In order to achieve a fast reduction procedure, we adopted a redundant trinomial strategy that embeds elements of the field F4m,{\mathbb {F}}_{4^{m}}, with m a prime number, into a ring of higher order defined by an almost irreducible trinomial. We also suggest a number of techniques that allow us to take full advantage of the native vector instructions of high-end microprocessors. Our software library achieves the fastest timings reported for the computation of the timing-protected scalar multiplication on Koblitz curves, and competitive timings with respect to the speed records established recently in the computation of the scalar multiplication over binary and prime fields.
... We also re-discover the y-coordinate retrieval formula as it was originally reported in [43]. In Sect. 3 we describe a Montgomery ladder procedures that admit off-line pre-computation, which was a long-standing open problem that was solved in [51] by Oliveira et al. In Sect. ...
... Given a point P ∈ E(F 2 m ), the point halving operation consists of computing a point R such that P = 2R. The interested reader is referred to [3,19,51,52] for a discussion on how to implement point halvings efficiently. ...
... The Montgomery halve-and-add method presented in [51] was the first to apply the point halving operation on Montgomery ladders without the necessity of performing the expensive inversion function over finite fields. This was achieved by combining the pre-computation approach of Algorithm 3 with the halve-and-add procedure shown in Algorithm 5. ...
Article
Full-text available
In this survey paper, we present a careful analysis of the Montgomery ladder procedure applied to the computation of the constant-time point multiplication operation on elliptic curves defined over binary extension fields. We give a general view of the main improvements and formula derivations that several researchers have contributed across the years, since the publication of Peter Lawrence Montgomery seminal work in 1987. We also report a fast software implementation of the Montgomery ladder applied on a Galbraith–Lin–Scott (GLS) binary elliptic curve that offers a security level close to 128 bits. Using our software, we can execute the ephemeral Diffie–Hellman protocol in just 95,702 clock cycles when implemented on an Intel Skylake machine running at 4 GHz.
... Oliveira, Aranha, López and Henríguez [8] designed a regular window-ω τ-adic non-adjacent expansion, inspired by Joye and Tunstall's key idea [3] that any odd integer i in [0, 2 ω+1 ) can be rewritten as 2 ω + (−(2 ω − i)). Algorithm 2 achieves a constant-time regular τ -adic expansion for an integer, that is compatible with Frobenius endomorphism and provides efficient scalar multiplication algorithm against simple side-channel attacks. ...
Chapter
Full-text available
Koblitz curves are a special family of binary elliptic curves satisfying equation y2+xy=x3+ax2+1y^2+xy=x^3+ax^2+1, a{0,1}a\in \{0,1\}. Scalar multiplication on Koblitz curves can be achieved with point addition and fast Frobenius endomorphism. We show a new point representation system μ4\mu _4 coordinates for Koblitz curves. When a=0, μ4\mu _4 coordinates derive basic group operations—point addition and mixed-addition with complexities 7M+2S7\mathbf{M}+2\mathbf{S} and 6M+2S6\mathbf{M}+2\mathbf{S}, respectively. Moreover, Frobenius endomorphism on μ4\mu _4 coordinates requires 4S4\mathbf{S}. Compared with the state-of-the-art λ\lambda representation system, the timings obtained using μ4\mu _4 coordinates show speed-ups of 28.6%28.6\% to 32.2%32.2\% for NAF algorithms, of 13.7%13.7\% to 20.1%20.1\% for τ\tau NAF and of 18.4%18.4\% to 23.1%23.1\% for regular τ\tau NAF on four NIST-recommended Koblitz curves K-233, K-283, K-409 and K-571.
... Moreover, Joye's algorithm uses a regular execution pattern of elliptic curve operations and without using dummy operations, these features aid on the prevention of timings attacks [20] and fault-based attacks [42,3]. Joye's algorithm has been applied on the implementation of both Weierstrass curves [13] and Koblitz binary curves [28,38]. More recently, Oliveira et al. [29] adapted the right-to-left Joye's algorithm to use precomputed look-up tables with the purpose of accelerating fixed-point multiplications (see Algorithm 6). ...
Conference Paper
Full-text available
Digital signatures provide a means to publicly authenticate messages sent over an insecure channel. Recently, the Quotient Digital Signature Algorithm (qDSA) was introduced aiming key-compatibility with the Diffie-Hellman X25519 function. Due to the novelty of qDSA, there remains a need for an optimized implementation that allows identifying the real impact of this new algorithm. In this work, we focus on the secure and efficient implementation of qDSA. By leveraging the use of precomputation on the right-to-left Joye’s algorithm, we reduced the running time of signature generation by 30–35%, and the running time of the verification procedure by 19%. In addition, for increased security, we show a verification method that validates qDSA signatures unequivocally. All of these improvements were included into an optimized software library targeting 32–bit ARM and 64–bit Intel architectures. The improved performance achieved in these platforms, it positions qDSA as a competitive alternative for deploying digital signatures efficiently and securely.
Chapter
Let Ea/F2:y2+xy=x3+ax2+1 be a Koblitz curve. The window τ-adic non-adjacent form (window τNAF) is currently the standard representation system to perform scalar multiplications on Ea/F2m utilizing the Frobenius map τ. This work focuses on the pre-computation part of scalar multiplication. We first introduce μτ¯-operations where μ=(-1)1-a and τ¯ is the complex conjugate of τ. Efficient formulas of μτ¯-operations are then derived and used in a novel pre-computation scheme. Our pre-computation scheme requires 6M+6S, 18M+17S, 44M+32S, and 88M+62S (a=0) and 6M+6S, 19M+17S, 46M+32S, and 90M+62S (a=1) for window τNAF with widths from 4 to 7 respectively. It is about two times faster, compared to the state-of-the-art technique of pre-computation in the literature. The impact of our new efficient pre-computation is also reflected by the significant improvement of scalar multiplication. Traditionally, window τNAF with width at most 6 is used to achieve the best scalar multiplication. Because of the dramatic cost reduction of the proposed pre-computation, we are able to increase the width for window τNAF to 7 for a better scalar multiplication. This indicates that the pre-computation part becomes more important in performing scalar multiplication. With our efficient pre-computation and the new window width, our scalar multiplication runs in at least 85.2% the time of Kohel’s work (Eurocrypt’2017) combining the best previous pre-computation. Our results push the scalar multiplication of Koblitz curves, a very well-studied and long-standing research area, to a significant new stage.
Chapter
Full-text available
In the RFC 7748 memorandum, the Internet Research Task Force specified a Montgomery-ladder scalar multiplication function based on two recently adopted elliptic curves, "curve25519" and "curve448". The purpose of this function is to support the Diffie-Hellman key exchange algorithm that will be included in the forthcoming version of the Transport Layer Security cryptographic protocol. In this paper, we describe a ladder variant that permits to accelerate the fixed-point multiplication function inherent to the Diffie-Hellman key pair generation phase. Our proposal combines a right-to-left version of the Montgomery ladder along with the pre-computation of constant values directly derived from the base-point and its multiples. To our knowledge, this is the first proposal of a Montgomery ladder procedure for prime elliptic curves that admits the extensive use of pre-computation. In exchange of very modest memory resources and a small extra programming effort, the proposed ladder obtains significant speedups for software implementations. Moreover, our proposal fully complies with the RFC 7748 specification. A software implementation of the X25519 and X448 functions using our pre-computable ladder yields an acceleration factor of roughly 1.20, and 1.25 when implemented on the Haswell and the Skylake micro-architectures, respectively.
Conference Paper
Full-text available
Designing efficient and secure implementations of Elliptic Curve Cryptography (ECC) has attracted enormous interest from both theoreticians and practitioners. The main contenders in terms of performance are curves defined over binary extension fields or large prime characteristic fields. In addition to the efficiency requirements, security advantages such as implementation simplicity and resistance to side-channel attacks are receiving increasing attention in research and commercial applications. In this paper, we keep pushing in this direction and study efficient implementation of regular scalar multiplication algorithms for binary curves equipped with efficient endomorphisms. Our focus is on implementing the Galbraith-Lin-Scott (GLS) family of binary curves by exploring the space of different models and laddering algorithms, for their high performance, reasonable implementation simplicity, lower memory consumption and side-channel resistance. Our results demonstrate that laddering implementations can be competitive with window-based methods by obtaining a new speed record for laddering implementations of elliptic curves on high-end Intel processors.
Article
Full-text available
Since its introduction by Jao and De Feo in 2011, the supersingular isogeny Diffie-Hellman (SIDH) key exchange protocol has positioned itself as a promising candidate for post-quantum cryptography. One salient feature of the SIDH protocol is that it requires exceptionally short key sizes. However, the latency associated to SIDH is higher than the ones reported for other post-quantum cryptosystem proposals. Aiming to accelerate the SIDH runtime performance, we introduce a more efficient approach for calculating the elliptic curve operation P + [k]Q. Our strategy achieves a factor 1.4 speedup compared with the popular variable-three-point ladder algorithm regularly used in the SIDH shared secret phase. Moreover, profiting from pre-computation techniques our algorithm yields a factor 1.7 acceleration for the computation of this operation in the SIDH key generation phase. We also present an optimized evaluation of the point tripling formula, and discuss several algorithmic and implementation techniques that lead to faster field arithmetic computations. A software implementation of the above improvements on an Intel Skylake Core i7-6700 processor gives a factor 1.33 speedup against the state-of-the-art software implementation of the SIDH protocol reported by Costello-Longa-Naehrig in CRYPTO 2016.
Article
Full-text available
We present two new strategies for parallel implementation of scalar multiplication over elliptic curves. We first introduce a Montgomery-halving algorithm which is a variation of the original Montgomery-ladder for point multiplication. This Montgomery-halving can be run in parallel with the original Montgomery-ladder in order to concurrently compute part of the scalar multiplication. We also present two point thirding formulas in some subfamilies of curves E(F3m). We use these thirding formulas to implement scalar multiplication through (Third, Double)-andadd and (Third, Triple)-and-add parallel approaches. We also provide some implementation results of the presented parallel strategies which show a speed-up of 5%-14% on an Intel Core i7 processor and a speed-up of 8%-19% on a Qualcomm Snapdragon processor compared to non-parallelized approaches.
Conference Paper
Full-text available
A scalar multiplication over a binary elliptic curve consists in a sequence of hundreds of multiplications, squarings and additions. This sequence of field operations often involves a large amount of operations of type AB,AC and AB+CD. In this paper, we modify classical polynomial multiplication algorithms to obtain optimized algorithms which perform these particular operations AB,AC and AB+CD. We then present software implementation results of scalar multiplication over binary elliptic curve over two platforms: Intel Core 2 and Intel Core i5. These experimental results show some significant improvements in the timing of scalar multiplication due to the proposed optimizations.
Conference Paper
This paper sets new speed records for high-security constant-time variable-base-point Diffie–Hellman software: 305395 Cortex-A8-slow cycles; 273349 Cortex-A8-fast cycles; 88916 Sandy Bridge cycles; 88448 Ivy Bridge cycles; 54389 Haswell cycles. There are no higher speeds in the literature for any of these platforms. The new speeds rely on a synergy between (1) state-of-the-art formulas for genus-2 hyperelliptic curves and (2) a modern trend towards vectorization in CPUs. The paper introduces several new techniques for efficient vectorization of Kummer-surface computations.
Article
This paper studies software optimization of elliptic-curve cryptography with 256-bit prime fields. We propose a constant-time implementation of the NIST and SECG standardized curve P-256, that can be seamlessly integrated into OpenSSL. This accelerates Perfect Forward Secrecy TLS handshakes that use ECDSA and/or ECDHE, and can help in improving the efficiency of TLS servers. We report significant performance improvements for ECDSA and ECDH, on several architectures. For example, on the latest Intel Haswell microarchitecture, our ECDSA sign is 2.33×2.33\times faster than OpenSSL’s implementation.
Article
This paper presents an efficient and side-channel-protected software implementation of scalar multiplication for the standard National Institute of Standards and Technology (NIST) and Standards for Efficient Cryptography Group binary elliptic curves. The enhanced performance is achieved by leveraging Intel’s AVX architecture and utilizing the pclmulqdq processor instruction. The fast carry-less multiplication is further used to speed up the reduction on the Haswell platform. For the five NIST curves over GF(2m)GF(2^m) with m \in {163,233,283,409,571}\{163,233,283,409,571\}, the resulting scalar multiplication implementation is about 5–12 times faster than that of OpenSSL-1.0.1e, enhancing the ECDHE and ECDSA algorithms significantly.
Conference Paper
The GLV method of Gallant, Lambert and Vanstone~(CRYPTO 2001) computes any multiple kP of a point P of prime order n lying on an elliptic curve with a low-degree endomorphism Φ\Phi (called GLV curve) over Fp\mathbb{F}_p as kP=k1P+k2Φ(P),with max{k1,k2}C1nkP = k_1P + k_2\Phi(P), \text{with } \max\{|k_1|,|k_2|\}\leq C_1\sqrt n, for some explicit constant C1>0C_1>0. Recently, Galbraith, Lin and Scott (EUROCRYPT 2009) extended this method to all curves over Fp2\mathbb{F}_{p^2} which are twists of curves defined over Fp\mathbb{F}_p. We show in this work how to merge the two approaches in order to get, for twists of any GLV curve over Fp2\mathbb{F}_{p^2}, a four-dimensional decomposition together with fast endomorphisms Φ,Ψ\Phi, \Psi over Fp2\mathbb{F}_{p^2} acting on the group generated by a point P of prime order n, resulting in a proven decomposition for any scalar k[1,n]k\in[1,n] given by kP=k1P+k2Φ(P)+k3Ψ(P)+k4ΨΦ(P)  with maxi(ki)<C2n1/4kP=k_1P+ k_2\Phi(P)+ k_3\Psi(P) + k_4\Psi\Phi(P)\; \text{with } \max_i (|k_i|)< C_2\, n^{1/4}, for some explicit C2>0C_2>0. Remarkably, taking the best C1,C2C_1, C_2, we obtain C2/C1<412C_2/C_1<412, independently of the curve, ensuring in theory an almost constant relative speedup. In practice, our experiments reveal that the use of the merged GLV-GLS approach supports a scalar multiplication that runs up to 50\% times faster than the original GLV method. We then improve this performance even further by exploiting the Twisted Edwards model and show that curves originally slower may become extremely efficient on this model. In addition, we analyze the performance of the method on a multicore setting and describe how to efficiently protect GLV-based scalar multiplication against several side-channel attacks. Our implementations improve the state-of-the-art performance of point multiplication for a variety of scenarios including side-channel protected and unprotected cases with sequential and multicore execution.
Conference Paper
We propose efficient algorithms and formulas that improve the performance of side-channel protected scalar multiplication exploiting the Gallant-Lambert-Vanstone (CRYPTO 2001) and Galbraith-Lin-Scott (EUROCRYPT 2009) methods. Firstly, by adapting Feng et al.'s recoding to the GLV setting, we derive new regular algorithms for variable-base scalar multiplication that offer protection against simple side-channel and timing attacks. Secondly, we propose an efficient technique that interleaves ARM-based and NEON-based multiprecision operations over an extension field, as typically found on GLS curves and pairing computations, to improve performance on modern ARM processors. Finally, we showcase the efficiency of the proposed techniques by implementing a state-of-the-art GLV-GLS curve in twisted Edwards form defined over GF(p^2) , which supports a four dimensional decomposition of the scalar and runs in constant time, i.e., it is fully protected against timing attacks. For instance, using a precomputed table of only 512 bytes, we compute a variable-base scalar multiplication in 92,000 cycles on an Intel Ivy Bridge processor and in 244,000 cycles on an ARM Cortex-A15 processor. Our benchmark results and the proposed techniques contribute to the improvement of the state-of-the-art performance of elliptic curve computations. Most notably, our techniques allow us to reduce the cost of adding protection against timing attacks in the GLV-based variable-base scalar multiplication computation to below 10%.
Conference Paper
Solving the elliptic curve discrete logarithm problem (ECDLP) by using Gröbner basis has recently appeared as a new threat to the security of elliptic curve cryptography and pairing-based cryptosystems. At Eurocrypt 2012, Faugère, Perret, Petit and Renault proposed a new method using a multivariable polynomial system to solve ECDLP over finite fields of characteristic 2. At Asiacrypt 2012, Petit and Quisquater showed that this method may beat generic algorithms for extension degrees larger than about 2000. In this paper, we propose a variant of Faugère et al.’s attack that practically reduces the computation time and memory required. Our variant is based on the idea of symmetrization. This idea already provided practical improvements in several previous works for composite-degree extension fields, but its application to prime-degree extension fields has been more challenging. To exploit symmetries in an efficient way in that case, we specialize the definition of factor basis used in Faugère et al.’s attack to replace the original polynomial system by a new and simpler one. We provide theoretical and experimental evidence that our method is faster and requires less memory than Faugère et al.’s method when the extension degree is large enough.