Conference PaperPDF Available

Faster Implementation of Scalar Multiplication on Koblitz Curves

Authors:

Abstract and Figures

We design a state-of-the-art software implementation of field and elliptic curve arithmetic in standard Koblitz curves at the 128-bit security level. Field arithmetic is carefully crafted by using the best formulae and implementation strategies available, and the increasingly common native support to binary field arithmetic in modern desktop computing platforms. The i-th power of the Frobenius automorphism on Koblitz curves is exploited to obtain new and faster interleaved versions of the well-known τ\tauNAF scalar multiplication algorithm. The usage of the τm/3\tau^{\lfloor m/3 \rfloor} and τm/4\tau^{\lfloor m/4 \rfloor} maps are employed to create analogues of the 3-and 4-dimensional GLV decompositions and in general, the m/s\lfloor m/s \rfloor-th power of the Frobenius automorphism is applied as an analogue of an s-dimensional GLV decomposition. The effectiveness of these techniques is illustrated by timing the scalar multiplication operation for fixed, random and multiple points. To our knowledge, our library was the first to compute a random point scalar multiplication in less than 10^5 clock cycles among all curves with or without endomorphisms defined over binary or prime fields. The results of our optimized implementation suggest a trade-off between speed, compliance with the published standards and side-channel protection. Finally, we estimate the performance of curve-based cryptographic protocols instantiated using the proposed techniques and compare our results to related work.
Content may be subject to copyright.
Faster implementation of scalar multiplication
on Koblitz curves
Diego F. Aranha1, Armando Faz-Hernández2,
Julio López3, and Francisco Rodríguez-Henríquez2
1Departament of Computer Science, University of Brasília
dfaranha@unb.br
2Computer Science Department, CINVESTAV-IPN
armfaz@computacion.cs.cinvestav.mx,francisco@cs.cinvestav.mx
3Institute of Computing, University of Campinas
jlopez@ic.unicamp.br
Abstract. We design a state-of-the-art software implementation of field
and elliptic curve arithmetic in standard Koblitz curves at the 128-bit
security level. Field arithmetic is carefully crafted by using the best
formulae and implementation strategies available, and the increasingly
common native support to binary field arithmetic in modern desktop
computing platforms. The i-th power of the Frobenius automorphism on
Koblitz curves is exploited to obtain new and faster interleaved versions
of the well-known τNAF scalar multiplication algorithm. The usage of
the τbm/3cand τbm/4cmaps are employed to create analogues of the
3-and 4-dimensional GLV decompositions and in general, the bm/sc-th
power of the Frobenius automorphism is applied as an analogue of an
s-dimensional GLV decomposition. The effectiveness of these techniques
is illustrated by timing the scalar multiplication operation for fixed, ran-
dom and multiple points. To our knowledge, our library was the first to
compute a random point scalar multiplication in less than 105clock cy-
cles among all curves with or without endomorphisms defined over binary
or prime fields. The results of our optimized implementation suggest a
trade-off between speed, compliance with the published standards and
side-channel protection. Finally, we estimate the performance of curve-
based cryptographic protocols instantiated using the proposed techniques
and compare our results to related work.
Key words: Efficient software implementation, Koblitz elliptic curves,
scalar multiplication.
1 Introduction
Since its introduction in 1985, Elliptic Curve Cryptography (ECC) has become
one of the most important and efficient public key cryptosystems in use. Its
security is based on the computational intractability of solving discrete logarithm
problems over the group formed by the rational points on an elliptic curve.
Anomalous binary curves, also known as Koblitz elliptic curves, were intro-
duced in [1]. Since then, these curves have been subject of extensive analysis and
study. Given a finite field Fqfor q= 2m, a Koblitz curve Ea(Fq), is defined as
the set of points (x, y)Fq×Fqthat satisfy the equation
Ea:y2+xy =x3+ax2+ 1, a {0,1},(1)
together with a point at infinity denoted by O. It is known that Ea(Fq)forms an
additive Abelian group with respect to the elliptic point addition operation. In
this paper, Eais a Koblitz curve with order #Ea(F2m)=22ar, where ris an
odd prime. Let hPibe an additively written subgroup in Eaof prime order r, and
let kbe a positive integer such that k[0, r 1]. Then, the elliptic curve scalar
multiplication operation computes the multiple Q=kP , which corresponds to
the point resulting of adding Pto itself, k1times. Given r, P and Q hPi,
the Elliptic Curve Discrete Logarithm Problem (ECDLP) consists of finding the
unique integer ksuch that Q=kP holds.
Since Koblitz curves are defined over the binary field F2, the Frobenius map
and its inverse naturally extend to an automorphism of the curve denoted by
τ. The τmap takes (x, y)to (x2, y2)and Oto O. It can been shown that
(x4, y4) + 2(x, y) = µ(x2, y 2)for every (x, y)on Ea, where µ= (1)1a. In
other words, τsatisfies τ2+ 2 = µτ. By solving the quadratic equation, we can
associate τwith the complex number τ=1+7
2.
Elliptic curve scalar multiplication is the most expensive operation in crypto-
graphic protocols whose security guarantees are based on the ECDLP. Improving
the computational efficiency of this operation is a widely studied problem. Across
the years, a number of algorithms and techniques providing efficient implementa-
tions with higher performance have been proposed [2]. Many research works have
focused their efforts on the unknown point scenario, where the base point Pis
not known in advance and when only one single scalar multiplication is required,
as in the case of the Diffie-Hellman key exchange protocol [3,4,5]. However, there
are situations where a single scalar multiplication must be performed on fixed
base points such as in the case of the key and signature generation procedures
of the Elliptic Curve Digital Signature Algorithm (ECDSA) standard. In other
scenarios, such as in the ECDSA signature verification, the simultaneous compu-
tation of two scalar multiplications (one with unknown point and the other with
fixed point) of the form R=kG +lQ, is required. Comparatively less research
works have studied the latter cases [6,7,8].
In [9,3], authors evaluated the achievable performance of binary elliptic curve
arithmetic in the latest 64-bit micro-architectures, presenting a comprehensive
analysis of unknown-point scalar multiplication computations on random and
Koblitz NIST elliptic curves at the 112-bit and 192-bit security levels. However,
for the 128-bit security level they only considered a random curve with side-
channel resistant scalar multiplication.4This was mainly due to the unavail-
4Scalar multiplication on curve CURVE2251 was implemented in [3] using the Mont-
gomery laddering approach that is naturally protected against first-order side-
channel attacks.
ability of benchmarking data for curves equipped with endomorphisms and the
performance penalty of halving-based approaches when applied to standardized
curves.
In this work we revisit the software serial computation of scalar multiplication
on Koblitz curves defined over binary fields. This study includes the computa-
tion of the scalar multiplication using unknown and fixed points; and single and
simultaneous scalar multiplication computations as required in the generation
and verification of discrete-log based digital signatures. We extend the analysis
given in [3,9] and further investigate an alternate curve choice to provide a com-
plete picture of the performance scenario, while also showing through operation
counting and experimental results that Koblitz curves are still the fastest choice
for deploying curve-based cryptography if sufficient native support for binary
field arithmetic is available in the target platform and if resistance to software
side-channel attacks can be disregarded.
To this end, we adopted several techniques previously proposed by differ-
ent authors: (i) formulation of binary field arithmetic using vector instruc-
tions [10]; (ii) time-memory trade-offs for the evaluation of fixed 2k-powers in
binary fields [11]; (iii) new formulas for polynomial multiplication over F2and
its extensions [12]; (iv) efficient support for the recently introduced carry-less
multiplier [3].
Besides building on these advancements on finite field arithmetic, this pa-
per presents several novel techniques including: (i) improved implementation of
width-w τNAF integer recoding; (ii) a new precomputation scheme for small
multiples of a random point in a Koblitz curve; (iii) lazy-reduction formulae for
mixed addition in binary elliptic curves; (iv) novel interleaving strategies of the
τNAF algorithm for scalar multiplication in Koblitz curves via powers of the
Frobenius automorphism. We remark that the interleaved techniques proposed
in this work can be seen as the effective application for the first time in Koblitz
curves of an s-dimensional GLV decomposition. Moreover, in this work only the
“tried and tested” Koblitz curve NIST-K283 is considered, providing immediate
compatibility and interoperability with standards and existing implementations.
Note, however, that several of our techniques are not restricted in any sense to
this curve choice, and can therefore be used to accelerate scalar multiplication
in other Koblitz curves at different security levels.
Our main implementation result is a speed record for the unknown-point
single-core scalar multiplication computation over the NIST-K283 curve in a
little less than 105clock cycles. Running on an Intel Core i7-2600K processor
clocked at 3.4 GHz, we were able to compute a random point scalar multiplication
in just 29.18µs.
This document is structured as follows: Section 2 discusses the low-level tech-
niques used for the implementation of field arithmetic and integer recoding.
Section 3 presents high-level techniques for arithmetic in the elliptic curve, com-
prising improved formulas for mixed addition by means of lazy reduction and
strategies for speeding up the scalar multiplication computation by using powers
of the Frobenius automorphism. Section 4 illustrates the efficiency of the pro-
posed techniques reporting operation counts and timings for scalar multiplication
in the fixed, unknown and multiple point scenarios; and extensively compares
the results with related work. Additionally in this section we estimate the per-
formance of signature and key agreement protocols when they are instantiated
with Koblitz curves. The final section concludes the paper with perspectives for
further performance improvement based on upcoming instruction sets.
2 Low-level techniques
Let f(z)be a monic irreducible polynomial of degree mover F2. Then, the binary
extension field F2mis isomorphic to F2m
=F2[z]/(f(z)), i.e., F2mis a finite field
of characteristic 2, whose elements are the finite set of all the binary polynomials
of degree less than m. In order to achieve a security level equivalent to 128-bit
AES when working with binary elliptic curves, NIST recommends to choose the
field extension F2283 , along with the irreducible pentanomial f(z) = z283 +
z12 +z7+z5+ 1. In a modern 64-bit computing platform, an element from the
field F2mrepresented in canonical basis requires n64 =dm
64 eprocessor words, or
n64 = 5 when m= 283. In the rest of this section, descriptions of algorithms and
formulas will refer to either generic or fixed versions of the binary field, depending
on whether or not the optimization is restricted to the choice of m= 283.
As mentioned before, in this work we made an extensive use of vector in-
struction sets present in contemporary desktop processors. The platform model
given in Table 1 extends the notation reported in [10]. There is limited sup-
port for flexible bitwise shifting in vector registers, because propagation of bits
between the two contiguous 64-bit words requires additional operations. Notice
that vectorized multiple-precision or intra-digit shifts can always be made faster
when the shift amount is a multiple of 8 by means of the memory alignment in-
struction or the bytewise shift instruction, respectively, and that a simultaneous
table lookup mapping 4-bit indexes to bytes can be implemented through the
byte shuffling instruction called PSHUFB in the SSE instruction set.
Table 1: Relevant vector instructions for the implementation of binary field arithmetic.
Mnemonic Description SSE
Carry-less multiplication PCLMULQDQ
-8,-864-bit bitwise shifts PSLLQ,PSRLQ
8,8128-bit bytewise shift PSLLDQ,PSRLDQ
,,Bitwise XOR,AND,OR PXOR,PAND,POR
C,BMemory alignment/Multi-precision shifts PALIGNR
In the following, we provide brief implementation notes on how relevant field
arithmetic operations such as, addition, multiplication, squaring, multi-squaring,
modular reduction and inversion; and integer width-w τNAF recoding, were
implemented.
Addition. It is the simplest operation in a binary field and can employ the
exclusive-or instruction with the largest operand size in the target platform.
This is particularly beneficial for vector instructions, but according to our
experiments, the 128-bit SSE [13] integer instruction proved to be faster
than the 256-bit AVX [14] floating-point instruction due to a higher recip-
rocal throughput [15] when operands are stored into registers.
Multiplication. Field multiplication is the performance-critical arithmetic op-
eration for elliptic curve arithmetic. Given two field elements a(z), b(z)
F2283 we want to compute a third field element c(z) = a(z)·b(z) mod f(z).
This can be accomplished by performing two separate steps: first the poly-
nomial multiplication of the two operands a(z), b(z)is evaluated and then
the resulting double length polynomial is modular reduced by f(z). From
our field element representation, the polynomial multiplication step can be
seen as the computation of the product of two (n64 1)-degree polynomials,
each with n64 64-bit coefficients. Alternatively, the two operands may also
be seen as (dn64
2e 1)-degree polynomials, each with dn64
2e128-bit coeffi-
cients. In the latter case, each term-by-term multiplication can be solved
via the standard Karatsuba formula by performing 3 carry-less multiplica-
tions. When n64 = 5, the above approaches require 13 (see [12,16]) and 14
invocations of the carry-less multiplier instruction, respectively. Algorithm 1
below presents our implementation of field multiplication over the field F2283
with 64-bit granularity using the formula given in [12]. The computational
complexity of Algorithm 1 is of 13 carry-less multiplications and 32 vector
additions, respectively, plus one modular reduction (Alg. 1, step 22) that will
be discussed later. The most salient feature of Algorithm 1 is that all the 13
carry-less multiplications have been grouped into one single loop on steps 6-8.
This is an attractive feature from a throughput point of view, as it is impor-
tant to potentially reduce the cost of the carry-less multiplication instruction
from 14 to 8 clock cycles in the Intel Sandy Bridge micro-architecture; and
from 12 to 7 clock cycles in an AMD Bulldozer [15]. The rationale behind
this cost reduction is that the batch execution of independent multiplica-
tions directly benefits the micro-architecture pipeline occupancy level. It is
worth mentioning that in [3], authors concluded that the 64-bit granular ap-
proach tends to consume more resources and complicate register allocation,
limiting the natural throughput exhibited by the carry-less multiplication in-
struction. However, if the digits are stored in an interleaved form (see [17]),
these side effects are mitigated and higher throughput can again be achieved.
Squaring and multi-squaring. Squaring is a cheap operation in a binary field
due to the action of the Frobenius map, consisting of a linear expansion of
coefficients. Vectorized approaches using simultaneous table lookups through
byte shuffling instructions allow a particularly efficient formulation of the co-
efficient expansion step [10]. Modular reduction usually is the most expen-
sive step when computing a squaring, especially when f(z)is an ordinary
pentanomial (see [18]) for the word size. Dealing efficiently with ordinary
pentanomials requires flexible and often not directly supported shifting in-
structions in the target platform. Multi-squaring is a time-memory trade-off
in which a table of 16dm
4efield elements allows computing any fixed 2kpower
with the cost equivalent of just a few squarings [11]. It is usually the case
that the multi-squaring approach becomes faster than repeated squaring,
whenever k6[3]. Contrary to addition, the availability of 256-bit instruc-
tions here contributes significantly to a performance increase. This happens
because this operation basically consists of a sequence of additions with field
elements obtained through a precomputed table stored in main memory.
Algorithm 1 Proposed implementation of multiplication in F2283.
Input: a(z) = a[0..4], b(z) = b[0..4].
Output: c(z) = c[0..4] = a(z)·b(z).
Note: Pairs ai, bi, ci, miof 64-bit words represent vector registers.
1: for i0to 4do
2: ci(a[i], b[i])
3: end for
4: c5c0c1, c6c0c2, c7c2c4, c8c3c4
5: c9c3c6, c10 c1c7, c11 c5c8, c12 c2c11
6: for i0to 12 do
7: mici[0] ci[1]
8: end for
9: c0m0, c8m4
10: c1c0m1, c2c1m6
11: c1c1m5, c2c2m2
12: c7c8m3, c6c7m7
13: c7c7m8, c6c6m2
14: c5m11 m12, c3c5m9
15: c3c3c0c10
16: c4c1c7m9m10 m12
17: c5c5c2c8m10
18: c9c7864
19: (c7, c5, c3, c1)(c7, c5, c3, c1)C8
20: c0c0c1, c1c2c3, c2c4c5
21: c3c6c7, c4c8c9
22: return c= (c4, c3, c2, c1, c0) mod f(z)
Modular reduction. Efficient modular reduction of a double-length value re-
sulting of a squaring or multiplication operation to a proper field element
involves expressing the required shifted additions in terms of the best shifting
instructions possible. For the instruction sets available in our target platform,
this amounts to converting the highest possible number of shifts to memory
alignment instructions or byte-wise shifts. Curve NIST-K283 is defined over
an ordinary pentanomial, a particularly inefficient choice for our vector reg-
ister size. However, by observing that f(z) = z283 +z12 +z7+z5+ 1 =
z283 + (z7+ 1)(z5+ 1), one can take advantage of this factorization to for-
mulate faster shifted additions. Algorithm 2 presents our explicit scheduling
of shift instructions to perform modular reduction in F2283 . Suppose that
the polynomial cis written as c=p1||p0where the polynomial p0represent
the lower 283 bits of c. The computation of cmod f(z)in Algorithm 2 is
performed as follows: in lines 1 to 3, the polynomial p1is computed by shift-
ing the vector (c4, c3, c2)to the right exactly 27 bits. Then, in lines 4 to 10,
the operation c+p1(z7+ 1)(z5+ 1) is performed, thus getting the vector
(c2, c1, c0). Finally, in lines 11 to 14, the remaining 101 most significant
bits of c2are reduced, a process that again involves a multiplication by the
polynomial (z7+ 1)(z5+ 1).
Algorithm 2 Implementation of reduction by f(z) = z283 + (z7+ 1)(z5+ 1).
Input: Double-precision polynomial stored into 128-bit registers c= (c4, c3, c2, c1, c0).
Output: Field element cmod f(z)stored into 128-bit registers (c2, c1, c0).
1: t2c2, t0(c3, c2)B64, t1(c4, c3)B64
2: c4c4-827, c3c3-827, c3c3(t1-837)
3: c2c2-827, c2c2(t0-837)
4: t0(c4, c3)B120, c4c4(t0-81)
5: t1(c3, c2)B64, c3c3(c3-87) (t1-857)
6: t0c2864, c2c2(c2-87) (t0-857)
7: t0(c4, c3)B120, c4c4(t0-83)
8: t1(c3, c2)B64, c3c3(c3-85) (t1-859)
9: t0c2864, c2c2(c2-85) (t0-859)
10: c0c0c2, c1c1c3, c2t2c4
11: t0c4-827
12: t1t0(t0-85)
13: t0t1(t1-87)
14: c0c0t0, c2c2(0x0000000000000000,0x0000000007FFFFFF)
15: return c= (c2, c1, c0)
Inversion. The field inversion approach that probably is the friendliest to vector
instruction sets is the Itoh-Tsuji inversion [19] that computes the field inverse
of ausing the identity a1=a2m112
. The term a2m11is obtained by
sequentially computing intermediate terms of the form
a2i12j
·a2j1.(2)
where the exponents 0i, j m1,are elements of the addition chain asso-
ciated to the exponent e=m1[20,21]. The shortest addition chain for e=
282 has length 11 and is 12481617343570140141282.
The computation of the above outlined procedure introduces an impor-
tant memory cost of storing 4 multi-squaring tables (for computing powers
217,235 ,270,2141 ), with each table containing 16dm
4efield elements. However,
several of those tables can be reused in the interleaving approach for scalar
multiplication by exploiting powers of the Frobenius automorphism as will
be explained in the next section. We note that other approaches for comput-
ing multiplicative field inverses, such as a polynomial version of the extended
euclidean algorithm, tend to be not so efficient when vectorized mostly be-
cause they require intensive shifting of the intermediate values generated by
the algorithm.
Integer τNAF recoding Solinas [22] presented a τ-adic analogue of the cus-
tomary Non-Adjacent Form (NAF) recoding. An element ρZ[τ]is found
with ρk(mod τm1
τ1),of as small norm as possible, where for the sub-
group of interest, kP =ρP and a width-w τNAF representation for ρcan
be obtained in a way that mimics the usual width-wNAF recoding. As
in [22], let us define αi=imod τwfor i {1,3,5,...,2w11}. A width-w
τNAF of a nonzero element kis an expression k=Pl1
i=0 uiτiwhere each
ui {0,±α1,±α3,...,±α2w11}and ul16= 0, and at most one of any con-
secutive wcoefficients is nonzero. Under reasonable assumptions, this proce-
dure outputs an expansion with length lm+1. Although the cost of width-
wNAF recoding is usually negligible when compared with the overall cost
of scalar multiplication, this is not generally the case with Koblitz curves,
where integer to width-w τNAF recoding can reach more than 10% of the
computational time for computing a scalar multiplication [3]. In this work,
the recoding was implemented by employing as much as possible branchless
techniques: the branches inside the recoding operation essentially depend on
random values, presenting a worst-case scenario for branch prediction and
causing severe performance penalties. In addition to that, the code was also
completely unrolled to handle only the precision required in the current it-
eration. Since the magnitude of the involved scalars gets reduced with each
iteration, it is suboptimal to perform operations considering the initial full
precision. The deterministic nature of the algorithm allows one to know in
which precise iteration of the main recoding loop, the most significant word
of the intermediate values become zero, which permits to represent these
values with one less processor word.
3 High-level techniques
In the last section, several notes gave a general description of our algorithmic and
implementation choices for field arithmetic. This section describes the higher-
level strategies used in the elliptic curve arithmetic layer for increasing the per-
formance of scalar multiplication.
3.1 Exploiting powers of the Frobenius automorphism
Scalar multiplication algorithms on Koblitz curves are always tailored to exploit
the Frobenius automorphism τon E(F2m)given by τ(x, y)=(x2, y2). One such
example is the classic τNAF scalar multiplication algorithm [22] and its width-w
window variants. Given kZand PE(F2m), these methods work by first
writing k=Pkiτifor ki {0,±α1,±α3,...,±α2w11}, with αi=imod τw
for i {1,3,5,...,2w11}. Then the scalar multiplication is computed as
kP =PkiτiP.
While powers τiof the automophism can be automatically considered en-
domorphisms in the context of the GLV method [23], this does not bring any
performance improvement, since applying these powers to a point has exactly
the same cost of iterating the automorphism during a standard execution of
the τNAF algorithm. Nevertheless, by employing time-memory trade-offs for
computing fixed 2i-th powers with cost significantly smaller than iconsecutive
squarings, a map of the form τbm/iccan now be seen as an endomorphism useful
for accelerating scalar multiplication through interleaving strategies. For exam-
ple, the map ψτbm/2callows an interleaved scalar multiplication of two points
from the expression kP =k1P+ 2bm/2ck2P=Pk1,iτiP+Pk2,iτiψ(P), saving
the computational cost of bm
2capplications of the Frobenius, or 3bm
2csquarings.
This might be seen as a modest saving, since squaring in a binary field is often
considered a free of cost operation. However, this is not entirely true when work-
ing with cumbersome irreducible polynomials that lead to relatively expensive
modular reductions. This is exactly the case studied in this work and, to be more
precise, it can be said instead that interleaving via the ψendomorphism saves
the computational cost associated to 3bm
2cmodular reductions.
As explained above, the map ψachieves an analogue of a bidimensional GLV
decomposition for a Koblitz curve. Similarly, the usage of the τbm/3cand τbm/4c
maps can be seen as analogues to 3- and 4-dimensional GLV decompositions
or, more generally, the bm/sc-th power of the Frobenius automorphism as an
analogue of an s-dimensional GLV decomposition. In our working case where
m= 283, note that the addition chain for Itoh-Tsuji inversion was already chosen
to include bm/2cand bm/4c. Thus, exploiting these powers of the automorphism
does not imply additional storage costs. Observe that [24,9] already explored this
concept to obtain parallel formulations of scalar multiplication in Koblitz curves.
3.2 Lazy-reduced mixed point addition
The fastest formula for the mixed addition R= (X3, Y3, Z3)of points P=
(X1, Y1, Z1)and Q= (X2, Y2)in binary curves use López-Dahab coordinates [25]
and were proposed in [26]. When the a-coefficient of the curve is 0, the formula
is given below:
A=Y1+Y2·Z2
1, B =X1+X2·Z1, C =B·Z1
Z3=C2, D =X2·Z3, E =A·C
X3=E+ (A2+C·B2), Y3= (D+X3)·(E+Z3)+(Y2+X2)·Z2
3.
Evaluating this formula has a cost of 8 field multiplications, 5 field squar-
ings and 8 additions. It is possible to further save 2 modular reductions when
computing sums of products in the expressions for the coordinates X3and Y3
given above. This technique is called lazy reduction [27] and trades off a modular
reduction by a double-length addition. Our working case presents the best con-
ditions for lazy reduction due to the poor choice of the irreducible pentanomial
associated to the NIST K-283 elliptic curve, and the high computational effi-
ciency of the field addition operation. It is then possible to evaluate the formula
with a cost equivalent to 8 unreduced multiplications, 5 unreduced squarings,
11 modular reductions, and 10 field addditions. This is very similar to the for-
mula proposed in [28], but without introducing any new coordinates to chain
unreduced values across sequential additions.
3.3 Scalar multiplication algorithm
Algorithm 3 provides a generic interleaved version of the width-w τNAF point
multiplication method when the main loop is folded stimes by exploring the
bm/sc-th power of the Frobenius automorphism. In comparison with the original
algorithm, approximately 3(s1)bm
scfield squarings are saved. Notice however,
that incrementing the value salso increases the computational and storage costs
of constructing the table of base-point multiples performed in Steps 2-5. In the
following, the construction of this table of points is referred as precomputation
phase.
3.4 Precomputation scheme
The scalar multiplication algorithm presented in Algorithm 3 requires the com-
putation of the set of affine points P0,u =αuP, for u {1,3,5,...,2w11}.
Basically, there are two simple approaches to compute this set: use inversion-free
addition in projective coordinates and convert all the points at the end to affine
coordinates using the Montgomery’s simultaneous inversion method; or perform
the additions directly in affine coordinates. High inversion-to-multiplication ra-
tios clearly favor the former approach. The latter can be made more viable when
the ratio is moderate and simultaneous inversion is employed for computing the
denominators in affine addition.
For an illustration of both approaches, assume the choice w= 5, and let
M, S, A, I be the cost of multiplication, squaring, addition and inversion in F2m,
respectively. Let us consider first the strategy of performing most of the opera-
tions in projective coordinates. For the selected value of w, the first four point
multiples of the precomputation table given as,
α1P=P;α3P= (τ21)P;α5P= (τ2+ 1)P;α7P= (τ31)P;
can be computed in projective coordinates at a cost of three point additions
plus three Frobenius operations. However, the last 4 point multiples in the table,
namely,
α9P= (τ3α5+ 1)P;α11P= (τ2α51)P;
α13P= (τ2α5+ 1)P;α15 P= (τ2α5α5)P;
Algorithm 3 Interleaved width-w τNAF scalar multiplication using τbm/sc.
Input: kZ, P E(F2m), integer sdenoting the interleaving factor.
Output: kP E(F2m).
1: Compute width-w τ -NAF(k) = Pl1
i=0 uiτi
2: Compute P0,u =αuP, for u {1,3,5,...,2w11}
3: for i1to (s1) do
4: Compute Pi,u =τbm/scPi1,u
5: end for
6: Q
7: for il1to sbm
scdo
8: QτQ
9: if ui6= 0 then
10: Let ube such that αu=uior αu=ui
11: if ui>0then QQ+P0,u;else QQP0,u
12: end if
13: end for
14: for i(bm
sc 1) to 0do
15: QτQ
16: for j0to (s1) do
17: if ui+jbm/sc6= 0 then
18: Let ube such that αu=ui+jbm/scor αu=ui+jbm/sc
19: if ui>0then QQ+Pj,u;else QQPj,u
20: end if
21: end for
22: end for
23: return Q= (x, y)
can be only computed until the point τ2α5Phas been calculated [2]. This situ-
ation requires either an expensive conversion to affine coordinates of the point
τ2α5Por the lower penalty of performing one general instead of a mixed point
addition with an associated cost of (13M+ 4S+ 9A). Hence, it is possible to
compute all the required points with just 6 point additions or subtractions, a
single general point addition, 6 Frobenius in affine or projective coordinates and
a simultaneous conversion of 7 points to affine coordinates. Half of the 6 point
additions and subtractions mentioned above are between points in affine coordi-
nates and considering the associated cost of simultaneous Montgomery inversion,
each of them has a computational cost of just (5M+ 3S+ 8A)and one single
inversion. Hence, the total precomputation cost for w= 5 is given as,
Proj. Precomputation cost = 3 ·(5M+ 3S+ 8A)+3·(8M+ 5S+ 8A) +
3·2S+ 3 ·3S+ (13M+ 4S+ 9A) +
3·(7 1)M+I+ 7 ·(2M+S)
= 84M+ 50S+ 57A+I.
On the other hand, let us consider the second approach where all the additions
are directly performed in affine coordinates. Let us recall that one affine addition
costs 2M+S+I+ 8A. Due to the dependency previously mentioned, we have
to split all the affine addition computations into two groups {α3P, α5P, α7P}
and {α9P, α11 P, α13 P, α15 P},without dependencies. Computing the first group
requires 3 affine additions and a simultaneous inversion to obtain 3 line slopes;
whereas the second group requires 4 affine additions and a simultaneous inversion
to obtain the 4 slopes, for a total of 7·(2M+S+ 8A) + 3(3 1)M+ 3(4 1)M+
2I= 29M+ 7S+ 56A+ 2I. Considering only the dominant multiplications and
inversions, the affine precomputation scheme will be faster than the projective
precomputation scheme whenever the inversion-to-multiplication ratio is lower
than 55, an assumption entirely compatible with the target platform [3].
4 Estimates, results and discussion
4.1 Performance estimates
Now we are in a position to estimate the performance of Algorithm 3 for the
values of m= 283, s = 1, w = 5. The algorithm executes the precomputation
scheme described in the last section, an average of mapplications of the Frobe-
nius automorphism, an expected number of m
w+1 additions and a final conversion
to affine coordinates. This amounts to a cost of about,
Estimated cost of Algorithm 3 = 29M+ 7S+ 56A+ 2I+ 283 ·3S+
47 ·(8M+ 5S+ 8A)+(I+ 2M+S)
= 407M+ 1092S+ 3I
For comparison, the current state-of-the-art serial implementation of a ran-
dom point multiplication, using a 4-dimensional GLV method over a prime curve
and the same choice of w, takes 1 inversion, 742 multiplications, 225 squarings
and 767 additions in Fp2, where phas approximately 128 bits [29]. By using the
latest formula for 5-term polynomial multiplication described in the last section,
the scalar multiplication in Koblitz curves is expected to execute 407 ·13 = 5291
word multiplications, while the GLV-capable prime curve is expected to execute
(742 ·3 + 225 ·2) ·4 = 10704 word multiplications. This rough comparison means
that a scalar multiplication in a Koblitz curve should be considerably faster than
a prime curve equipped with endomorphisms if sufficient support to binary field
multiplication is present, or even twice faster if this support is equivalent to
integer multiplication. Although the latency of the fastest carry-less multiplier
available (7 cycles at best [15]) is substantially higher than the integer multiplier
counterpart (3 cycles [15]), from our analysis above, it is still entirely possible
that a careful implementation of a Koblitz curve comparable computational cost.
4.2 Experimental results
In order to illustrate the performance obtained by the proposed techniques,
we implemented a library targeted to the Intel Westmere and Sandy Bridge
micro-architectures, focusing our efforts on benefitting from the SSE and AVX
instruction sets with the corresponding availability of 128-bit and 256-bit regis-
ters. The library was implemented in the C programming language, with vector
instructions accessed through their intrinsics interface. Both version 4.7.1 of the
GNU C Compiler Suite (GCC) and version 12.1 of the Intel C Compiler (ICC)
were used to build the library in a GNU/Linux environment.
Benchmarking was conducted on Intel Core i5-540M and Core i7-2600K pro-
cessors clocked at 2.5GHz and 3.4 GHz, respectively, following the guidelines
provided in the EBACS website [30]. Namely, automatic overclocking, frequency
scaling and HyperThreading technologies were disabled to reduce randomness
in the results.
Table 2 presents timings and ratios related to the cost of multiplication for the
low-level field arithmetic layer of the library, which computes basic operations
in the field F2283 . Note how modular reduction dominates the cost of squaring
and how the moderate inversion-to-multiplication ratios justify the algorithmic
choices. Our best timing on Sandy Bridge for unreduced multiplication is 5%
faster than the 135 cycles reported in [31], this saving is obtained by a careful
implementation of the same polynomial multiplication formula used in [31].
Table 2: Timings given in clock cycles for basic operations in F2283 .
Westmere Sandy Bridge
Base field operation GCC ICC op/MGCC ICC op/M
Modular reduction 28 28 0.11 20 22 0.15
Unreduced multiplication 159 163 0.89 128 132 0.89
Multiplication 182 182 1.00 142 149 1.00
Squaring 42 39 0.21 28 29 0.18
Multi-Squaring 287 295 1.62 235 243 1.63
Inversion 4,372 4,268 23.45 3,286 3,308 22.20
Table 3 shows the number of clock cycles for elliptic curve operations, such
as point addition, Frobenius endomorphism, and point doubling. The latter is
shown only to reflect the improvement of using point doubling-free scalar mul-
tiplication as is the case in Koblitz curves. Integer recoding is almost 3 times
faster than [3,9], even with longer scalars.
Timings reported for scalar multiplication are divided into three scenarios:
(i) known point, where the point to be multiplied is already known before the
execution of scalar multiplication; (ii) unknown point, the general case, where
the input point is not known until scalar multiplication is processed; (iii) double
multiplication of a fixed and a random point, a case usually needed for verify-
ing curve-based digital signatures. For the three scenarios, we used interleaved
versions of the left-to-right width-wwindow τNAF scalar multiplication algo-
rithm with different choices of w. We present timings in Table 4. It was verified
experimentally that s= 2 is the best choice for random and double point multi-
Table 3: Elliptic curve operations on NIST-K283 when points are represented in affine
or López-Dahab coordinates [25].
Westmere Sandy Bridge
Elliptic curve operation GCC ICC op/MGCC ICC op/M
Frobenius (Affine) 84 70 0.38 55 55 0.37
Frobenius (LD) 118 115 0.63 85 83 0.55
Doubling (LD) 965 939 5.15 741 764 5.12
Addition (LD Mixed) 1,684 1,650 9.06 1,300 1,336 8.96
Addition (LD General) 2,683 2,643 14.52 2,086 2,145 14.39
Width-w τNAF recoding 4,841 6,652 36.55 3,954 4,693 31.50
plication, providing a speedup of 3-5% over the conventional case s= 1, and that
s= 4 provides a significant performance increase for fixed point multiplication.
Table 4: Scalar multiplication in three different scenarios: fixed, random and multiple
points. Timings are given in 103processing cycles.
Westmere Sandy Bridge
Scalar multiplication GCC ICC GCC ICC
Random point (kP ), w= 5, s = 1 139.6 135.1 105.3 105.3
Random point (kP ), w= 5, s = 2 130.9 127.8 99.2 99.7
Fixed point (kG), w= 8, s = 2 80.8 79.0 61.5 62.3
Fixed point (kG), w= 8, s = 4 72.6 71.7 55.1 55.9
Fixed/random point (kG +lQ), wG= 6, wQ= 5, s = 2 207.8 206.8 157.7 160.8
Fixed/random point (kG +lQ), wG= 8, wQ= 5, s = 2 192.3 190.6 146.3 148.7
4.3 Comparison to related work
The current state-of-the-art is an implementation by Longa and Sica at the
128-bit security level on a Sandy Bridge platform and achieves an unprotected
scalar multiplication of a random point on a prime curve in 91,000 clock cycles
with 16 precomputed points; and a side-channel resistant scalar multiplication
in 137,000 cycles with 36 precomputed points [29]. A protected implementation
by Bernstein et al. [8] reports 226,872 cycles for computing this operation on
Westmere and 194,208 cycles on Sandy Bridge [30]. Another implementation by
Hamburg [32] reports 153,000 cycles on Sandy Bridge. Our implementation is
only 9% slower than the current speed record when computing instances of the
ECDH key agreement protocol, even with considerably lower platform support
for the underlying field arithmetic.
Computing curve-based digital signatures usually amounts to scalar multi-
plication of fixed points. The authors of [8] report a latency of 87,548 cycles
to compute this operation on the Westmere and 70,292 cycles on the Sandy
Bridge [30] micro-architectures, while using a precomputed table of 256 points.
Hamburg [32] implemented this operation on Sandy Bridge in just 52,000 cycles
with 160 precomputed points. Compared to the first implementation and using
the same number of points, our timings are faster by 22%. Comparing to the
second implementation while reducing the number of precomputed points to 128,
our timings are slower by 15%.
The last scenario to analyze is signature verification, where work [8] reports
single signature verification timings of 273,364 cycles on Westmere and 226,516
cycles on Sandy Bridge [30], while reporting significantly improved timings for
batch verification. A faster implementation [32] verifies a signature using 32 pre-
computed points on Sandy Bridge in 165,000 cycles. We obtain speedups between
5% and 35% on this scenario, considering implementations with the same num-
ber of points, and leave the possibility of batch verification as a future direction
of this work. It is important to stress that our implementation provides a trade-
off between side-channel protection and standards compliance. Consequently, it
allows faster and interoperable curve-based cryptography when resistance to side
channels is not required.
5 Conclusion
In this work, we presented a software implementation of elliptic curve arithmetic
in Koblitz curves defined over binary fields. By reusing several low-level tech-
niques recently-introduced by other authors and proposing a number of useful
high-level techniques, we obtained state-of-the-art timings for computing scalar
multiplication of a random point in a binary curve, modelling a curve-based
key agreement protocol. Our implementation also provides a trade-off between
execution time and storage overhead for computing digital signatures and signif-
icantly improves the time to verify a single signature. We expect our timings to
be accelerated further as support to binary field arithmetic improves on modern
64-bit platforms, either through a faster carry-less multiplier or via the 256-bit
integer vector instructions from the upcoming AVX2 instruction set. Our com-
putational cost analysis suggests that if the target platform had a binary field
multiplication instruction as efficient as integer multiplication, our implementa-
tion could still receive a further factor-2 speedup.
References
1. Koblitz, N.: CM-Curves with Good Cryptographic Properties. In: Feigenbaum, J.
(ed.) CRYPTO 1991. LNCS, vol. 576, pp. 279–287. Springer (1991)
2. Hankerson, D., Menezes, A.J., Vanstone, S.: Guide to Elliptic Curve Cryptography.
Springer-Verlag, Secaucus, USA (2003)
3. Taverne, J., Faz-Hernández, A., Aranha, D.F., Rodríguez-Henríquez, F., Hanker-
son, D., López, J.: Speeding scalar multiplication over binary elliptic curves using
the new carry-less multiplication instruction. Journal of Cryptographic Engineer-
ing 1(3) 187–199 (2011)
4. Longa, P., Gebotys, C.H.: Efficient techniques for high-speed elliptic curve cryp-
tography. In Mangard, S., Standaert, F.X. (eds.) CHES 2010. LNCS, vol. 6225,
pp. 80–94. Springer (2010)
5. Gaudry, P., Thomé, E.: The mpFq library and implementing curve-based key ex-
changes. In: Software Performance Enhancement of Encryption and Decryption
(SPEED 2007), pp. 49–64. http://www.hyperelliptic.org/SPEED/record.pdf
(2009)
6. Brown, M., Hankerson, D., López, J., Menezes, A.: Software Implementation of
the NIST Elliptic Curves Over Prime Fields. In Naccache, D. (ed.) CT-RSA 2001.
LNCS, vol. 2020, pp. 250–265. Springer (2001)
7. Galbraith, S.D., Lin, X., Scott, M.: Endomorphisms for faster elliptic curve cryp-
tography on a large class of curves. In Joux, A. (ed.) EUROCRYPT 2009. LNCS,
vol. 5479, pp. 518–535. Springer (2009)
8. Bernstein, D.J., Duif, N., Lange, T., Schwabe, P., Yang, B.Y.: High-speed high-
security signatures. In Preneel, B., Takagi, T. (eds.) CHES 2011. LNCS, vol. 6917,
pp. 124–142. Springer (2011)
9. Taverne, J., Faz-Hernández, A., Aranha, D.F., Rodríguez-Henríquez, F., Hanker-
son, D., López, J.: Software implementation of binary elliptic curves: Impact of
the carry-less multiplier on scalar multiplication. In Preneel, B., Takagi, T. (eds.)
CHES 2011. LNCS, vol. 6917, pp. 108–123. Springer (2011)
10. Aranha, D.F., López, J., Hankerson, D.: Efficient Software Implementation of
Binary Field Arithmetic Using Vector Instruction Sets. In Abdalla, M., Barreto,
P.S.L.M. (eds.) In LATINCRYPT 2010. LNCS, vol. 6212, pp. 144-161. Springer
(2010)
11. Bos, J.W., Kleinjung, T., Niederhagen, R., Schwabe, P.: ECC2K-130 on Cell CPUs.
In D, J.B., Lange, T. (eds.) AFRICACRYPT 2010. LNCS, vol. 6055, pp. 225–242.
Springer (2010)
12. Cenk, M., Özbudak, F.: Improved Polynomial Multiplication Formulas over F2
Using Chinese Remainder Theorem. IEEE Trans. Computers 58(4) 572–576 (2009)
13. Intel: Intel Architecture Software Developer’s Manual Volume 2: Instruction Set
Reference. http://www.intel.com (2002)
14. Firasta, N., Buxton, M., Jinbo, P., Nasri, K., Kuo, S.: Intel AVX: New frontiers in
performance improvement and energy efficiency. White paper available at http:
//software.intel.com/ (2008)
15. Fog, A.: Instruction tables: List of instruction latencies, throughputs and micro-
operation breakdowns for Intel, AMD and VIA CPUs. http://www.agner.org/
optimize/instruction_tables.pdf (2012)
16. Montgomery, P.: Five, six, and seven-term Karatsuba-like formulae. IEEE Trans-
actions on Computers 54(3) 362–369 (2005)
17. Gaudry, P., Brent, R., Zimmermann, P., Thomé, E.: The gf2x binary field multi-
plication library. https://gforge.inria.fr/projects/gf2x/
18. Scott, M.: Optimal Irreducible Polynomials for GF (2m)Arithmetic. Cryptology
ePrint Archive, Report 2007/192. http://eprint.iacr.org/ (2007)
19. Itoh, T., Tsujii, S.: A fast algorithm for computing multiplicative inverses in
GF(2m)using normal bases. Inf. Comput. 78(3) 171–177 (1988)
20. Guajardo, J., Paar, C.: Itoh-Tsujii inversion in standard basis and its application in
cryptography and codes. Designs, Codes and Cryptography 25(2) 207–216 (2002)
21. Rodríguez-Henríquez, F., Morales-Luna, G., Saqib, N.A., Cruz-Cortés, N.: Parallel
Itoh—Tsujii multiplicative inversion algorithm for a special class of trinomials. Des.
Codes Cryptography 45(1) 19–37 (2007)
22. Solinas, J.A.: Efficient Arithmetic on Koblitz Curves. Designs, Codes and Cryp-
tography 19(2-3) 195–249 (2000)
23. Gallant, R., Lambert, R., Vanstone, S.: Faster Point Multiplication on Elliptic
Curves with Efficient Endomorphisms. In Kilian, J., (ed.) CRYPTO 2001. LNCS,
vol. 2139, pp. 190–200. Springer (2001)
24. Ahmadi, O., Hankerson, D., Rodríguez-Henríquez, F.: Parallel formulations of
scalar multiplication on Koblitz curves. Journal of Universal Computer Science
14(3) 481–504 (2008)
25. López, J., Dahab, R.: Improved Algorithms for Elliptic Curve Arithmetic in
GF(2n). In Tavares, S.E., Meijer, H. (eds.) SAC 98. LNCS, vol. 1556, pp. 201–212.
Springer (1998)
26. Al-Daoud, E., Mahmod, R., Rushdan, M., Kiliçman, A.: A New Addition Formula
for Elliptic Curves over GF(2n). IEEE Trans. Computers 51(8) 972–975 (2002)
27. Weber, D., Denny, T.F.: The Solution of McCurley’s Discrete Log Challenge. In
Krawczyk, H. (ed.) CRYPTO 1998. LNCS, vol. 1462, pp. 458–471. Springer (1998)
28. Kim, K.H., Kim, S.I.: A new method for speeding up arithmetic on elliptic curves
over binary fields. Cryptology ePrint Archive, Report 2007/181. http://eprint.
iacr.org/ (2007)
29. Longa, P., Sica, F.: Four-Dimensional Gallant-Lambert-Vanstone Scalar Multipli-
cation. In ASIACRYPT 2012. To appear. (2012)
30. Bernstein, D.J., (editors), T.L.: eBACS: ECRYPT Benchmarking of Cryptographic
Systems. http://bench.cr.yp.to, (May 18, 2012).
31. Su, C., Fan, H.: Impact of Intel’s new instruction sets on software implementation
of GF(2)[x] multiplication. Inf. Process. Lett. 112(12) 497–502 (2012)
32. Hamburg, M.: Fast and compact elliptic-curve cryptography. Cryptology ePrint
Archive, Report 2012/309. http://eprint.iacr.org/ (2012)
A Appendixes
We complete tables 3.9 and 3.10 from [2] to include corresponding values for
w= 7,8.
Table 5: Expressions for αu=umod τwfor w= 7.
u u mod τwTNAF(umod τw)αu
11(1) 1
33(-1, 0, 0, 1, 0, -1) τ2α39 1
55(-1, 0, 0, 1, 0, 1) τ2α39 + 1
77(-1, 0, 1, 0, 0, -1) τ3α35 1
93τ5(1, 0, 0, -1, 0, 1, 0, 0, 1) τ3α3+ 1
11 3τ3(-1, 0, -1, 0, -1, 0, -1) τ2α53 1
13 3τ1(-1, 0, -1, 0, -1, 0, 1) τ2α53 + 1
15 3τ+ 1 (1, 0, 0, 0, -1) τ2α37 α37
17 3τ+ 3 (1, 0, 0, 0, 1) τ2α35 +α37
19 3τ+ 5 (1, 0, 0, -1, 0, 1, 0, -1) τ2α31
21 4τ3(-1, 0, 1, 0, 1) τ2α35 + 1
23 4τ1(1, 0, 0, -1, 0, 0, -1) τ3α39 1
25 4τ+ 1 (1, 0, 0, -1, 0, 0, 1) τ3α39 + 1
27 4τ+ 3 (1, 0, 0, 0, -1, 0, -1) τ2α15 1
29 6τ+ 1 (-1, 0, 0, 0, -1, 0, 1) τ2α17 + 1
31 τ7(1, 0, 0, 0, 0, -1) τ2α39 +α35
33 τ5(1, 0, 0, 0, 0, 1) τ2α39 +α37
35 τ3(1, 0, -1) τ21
37 τ1(1, 0, 1) τ2+ 1
39 τ+ 1 (1, 0, 0, -1) τ31
41 τ+ 3 (1, 0, 0, 1) τ3+ 1
43 τ+ 5 (1, 0, 1, 0, -1, 0, -1) τ2α51 1
45 τ+ 7 (1, 0, 1, 0, -1, 0, 1) τ2α51 + 1
47 2τ5(-1, 0, -1, 0, 0, 0, -1) τ2α53 +α35
49 2τ3(-1, 0, -1, 0, 0, 0, 1) τ2α53 +α37
51 2τ1(1, 0, 1, 0, -1) τ2α37 1
53 2τ+ 1 (1, 0, 1, 0, 1) τ2α37 + 1
55 2τ+ 3 (-1, 0, -1, 0, 0, -1) τ3α37 1
57 2τ+ 5 (-1, 0, -1, 0, 0, 1) τ3α37 + 1
59 2τ+ 7 (-1, 0, 0, -1, 0, -1) τ2α41 1
61 5τ1(-1, 0, -1, 0, 0, -1, 0, 1) τ2α55 + 1
63 5τ+ 1 (1, 0, 0, 0, 0, 0, -1) τ2α15 +α35
a= 0.
u u mod τwTNAF(umod τw)αu
11(1) 1
33(1, 0, 0, 1, 0, -1) τ2α39 1
55(1, 0, 0, 1, 0, 1) τ2α39 + 1
77(1, 0, -1, 0, 0, -1) τ3α35 1
93τ5(1, 0, 0, 1, 0, -1, 0, 0, 1) τ3α3+ 1
11 3τ3(-1, 0, -1, 0, -1, 0, -1) τ2α53 1
13 3τ1(-1, 0, -1, 0, -1, 0, 1) τ2α53 + 1
15 3τ+ 1 (1, 0, 0, 0, -1) τ2α37 α37
17 3τ+ 3 (1, 0, 0, 0, 1) τ2α35 +α37
19 3τ+ 5 (-1, 0, 0, -1, 0, 1, 0, -1) τ2α31
21 4τ3(-1, 0, 1, 0, 1) τ2α35 + 1
23 4τ1(1, 0, 0, 1, 0, 0, -1) τ3α39 1
25 4τ+ 1 (1, 0, 0, 1, 0, 0, 1) τ3α39 + 1
27 4τ+ 3 (1, 0, 0, 0, -1, 0, -1) τ2α15 1
29 6τ+ 1 (-1, 0, 0, 0, -1, 0, 1) τ2α17 + 1
31 τ7(-1, 0, 0, 0, 0, -1) τ2α39 +α35
33 τ5(-1, 0, 0, 0, 0, 1) τ2α39 +α37
35 τ3(1, 0, -1) τ21
37 τ1(1, 0, 1) τ2+ 1
39 τ+ 1 (-1, 0, 0, -1) τ31
41 τ+ 3 (-1, 0, 0, 1) τ3+ 1
43 τ+ 5 (1, 0, 1, 0, -1, 0, -1) τ2α51 1
45 τ+ 7 (1, 0, 1, 0, -1, 0, 1) τ2α51 + 1
47 2τ5(-1, 0, -1, 0, 0, 0, -1) τ2α53 +α35
49 2τ3(-1, 0, -1, 0, 0, 0, 1) τ2α53 +α37
51 2τ1(1, 0, 1, 0, -1) τ2α37 1
53 2τ+ 1 (1, 0, 1, 0, 1) τ2α37 + 1
55 2τ+ 3 (1, 0, 1, 0, 0, -1) τ3α37 1
57 2τ+ 5 (1, 0, 1, 0, 0, 1) τ3α37 + 1
59 2τ+ 7 (1, 0, 0, -1, 0, -1) τ2α41 1
61 5τ1(1, 0, 1, 0, 0, -1, 0, 1) τ2α55 + 1
63 5τ+ 1 (1, 0, 0, 0, 0, 0, -1) τ2α15 +α35
a= 1.
Table 7: Expressions for αu=umod τwfor w= 8.
u u mod τwTNAF(umo d τw)αu
11(1) 1
33(-1, 0, 0, 1, 0, -1) τ2α89 1
55(-1, 0, 0, 1, 0, 1) τ2α89 + 1
77(-1, 0, 1, 0, 0, -1) τ3α93 1
93τ5(1, 0, 0, -1, 0, 1, 0, 0, 1) τ3α3+ 1
11 3τ3(-1, 0, -1, 0, -1, 0, -1) τ2α75 1
13 3τ1(-1, 0, -1, 0, -1, 0, 1) τ2α75 + 1
15 3τ+ 1 (1, 0, 0, 0, -1) τ2α93 α93
17 3τ+ 3 (1, 0, 0, 0, 1) τ2α93 α91
19 3τ+ 5 (1, 0, 0, -1, 0, 1, 0, -1) τ2α31
21 3τ+ 7 (1, 0, 0, -1, 0, 1, 0, 1) τ2α3+ 1
23 3τ+ 9 (-1, 0, -1, 0, 0, -1, 0, 0, -1) τ3α73 1
25 6τ3(-1, 0, 0, -1, 0, 0, 1) τ3α87 + 1
27 6τ1(-1, 0, 0, 0, -1, 0, -1) τ2α17 1
29 6τ+ 1 (-1, 0, 0, 0, -1, 0, 1) τ2α17 + 1
31 6τ+ 3 (1, 0, 1, 0, 0, 0, 0, -1) τ3α75 +α87
33 6τ+ 5 (1, 0, 1, 0, 0, 0, 0, 1) τ3α75 +α89
35 6τ+ 7 (1, 0, 0, 0, 0, 1, 0, -1) τ2α95 1
37 6τ+ 9 (1, 0, 0, 0, 0, 1, 0, 1) τ2α95 + 1
39 6τ+ 11 (1, 0, 0, 0, 1, 0, 0, -1) τ3α17 1
41 8τ7(-1, 0, 0, 0, 1, 0, 0, 1) τ3α15 + 1
43 8τ5(1, 0, 0, -1, 0, 1, 0, -1, 0, -1) τ2α19 1
45 8τ3(1, 0, 0, -1, 0, 1, 0, -1, 0, 1) τ2α19 + 1
47 8τ1(1, 0, -1, 0, 0, 0, -1) τ2α109 +α91
49 8τ+ 1 (1, 0, -1, 0, 0, 0, 1) τ2α109 +α93
51 5τ11 (-1, 0, 0, 1, 0, 1, 0, -1) τ2α51
53 5τ9(-1, 0, 0, 1, 0, 1, 0, 1) τ2α5+ 1
55 5τ7(-1, 0, -1, 0, -1, 0, 0, -1) τ3α75 1
57 5τ5(-1, 0, -1, 0, -1, 0, 0, 1) τ3α75 + 1
59 5τ3(-1, 0, -1, 0, 0, -1, 0, -1) τ2α73 1
61 5τ1(-1, 0, -1, 0, 0, -1, 0, 1) τ2α73 + 1
63 5τ+ 1 (1, 0, 0, 0, 0, 0, -1) τ2α17 +α91
65 5τ+ 3 (1, 0, 0, 0, 0, 0, 1) τ2α17 +α93
67 2τ9(1, 0, 0, 1, 0, -1) τ2α87 1
69 2τ7(1, 0, 0, 1, 0, 1) τ2α87 + 1
71 2τ5(1, 0, 1, 0, 0, -1) τ3α91 1
73 2τ3(1, 0, 1, 0, 0, 1) τ3α91 + 1
75 2τ1(-1, 0, -1, 0, -1) τ2α91 1
77 2τ+ 1 (-1, 0, -1, 0, 1) τ2α91 + 1
79 2τ+ 3 (1, 0, 1, 0, 0, 0, -1) τ2α77 α93
81 2τ+ 5 (1, 0, 1, 0, 0, 0, 1) τ2α77 α91
83 τ7(-1, 0, -1, 0, 1, 0, -1) τ2α77 1
85 τ5(-1, 0, -1, 0, 1, 0, 1) τ2α77 + 1
87 τ3(-1, 0, 0, -1) τ31
89 τ1(-1, 0, 0, 1) τ3+ 1
91 τ+ 1 (-1, 0, -1) τ21
93 τ+ 3 (-1, 0, 1) τ2+ 1
95 τ+ 5 (-1, 0, 0, 0, 0, -1) τ2α89 +α91
97 τ+ 7 (-1, 0, 0, 0, 0, 1) τ2α89 +α93
99 τ+ 9 (-1, 0, -1, 0, 0, 0, 1, 0, -1) τ2α79 1
101 4τ3(-1, 0, 0, 0, 1, 0, 1) τ2α15 + 1
103 4τ1(-1, 0, 0, 1, 0, 0, -1) τ3α89 1
105 4τ+ 1 (-1, 0, 0, 1, 0, 0, 1) τ3α89 + 1
107 4τ+ 3 (1, 0, -1, 0, -1) τ2α93 1
109 4τ+ 5 (1, 0, -1, 0, 1) τ2α93 + 1
111 4τ+ 7 (1, 0, 0, -1, 0, 0, 0, -1) τ2α3+α91
113 4τ+ 9 (1, 0, 0, -1, 0, 0, 0, 1) τ2α3+α93
115 4τ+ 11 (-1, 0, -1, 0, 1, 0, 1, 0, -1) τ2α85 1
117 7τ1(-1, 0, 1, 0, 1, 0, 1) τ2α107 + 1
119 7τ+ 1 (1, 0, 1, 0, -1, 0, 0, -1) τ3α77 1
121 7τ+ 3 (1, 0, 1, 0, -1, 0, 0, 1) τ3α77 + 1
123 7τ+ 5 (1, 0, 1, 0, 0, -1, 0, -1) τ2α71 1
125 7τ+ 7 (1, 0, 1, 0, 0, -1, 0, 1) τ2α71 + 1
127 7τ+ 9 (1, 0, 0, 0, 0, 0, 0, -1) τ2α95 +α91
a= 0.
u u mod τwTNAF(umo d τw)αu
11(1) 1
33(1, 0, 0, 1, 0, -1) τ2α89 1
55(1, 0, 0, 1, 0, 1) τ2α89 + 1
77(1, 0, -1, 0, 0, -1) τ3α93 1
93τ5(1, 0, 0, 1, 0, -1, 0, 0, 1) τ3α3+ 1
11 3τ3(-1, 0, -1, 0, -1, 0, -1) τ2α75 1
13 3τ1(-1, 0, -1, 0, -1, 0, 1) τ2α75 + 1
15 3τ+ 1 (1, 0, 0, 0, -1) τ2α93 α93
17 3τ+ 3 (1, 0, 0, 0, 1) τ2α93 α91
19 3τ+ 5 (-1, 0, 0, -1, 0, 1, 0, -1) τ2α31
21 3τ+ 7 (-1, 0, 0, -1, 0, 1, 0, 1) τ2α3+ 1
23 3τ+ 9 (-1, 0, -1, 0, 0, 1, 0, 0, -1) τ3α73 1
25 6τ3(1, 0, 0, 1, 0, 0, 1) τ3α87 + 1
27 6τ1(-1, 0, 0, 0, -1, 0, -1) τ2α17 1
29 6τ+ 1 (-1, 0, 0, 0, -1, 0, 1) τ2α17 + 1
31 6τ+ 3 (-1, 0, -1, 0, 0, 0, 0, -1) τ3α75 +α87
33 6τ+ 5 (-1, 0, -1, 0, 0, 0, 0, 1) τ3α75 +α89
35 6τ+ 7 (-1, 0, 0, 0, 0, 1, 0, -1) τ2α95 1
37 6τ+ 9 (-1, 0, 0, 0, 0, 1, 0, 1) τ2α95 + 1
39 6τ+ 11 (-1, 0, 0, 0, -1, 0, 0, -1) τ3α17 1
41 8τ7(1, 0, 0, 0, -1, 0, 0, 1) τ3α15 + 1
43 8τ5(-1, 0, 0, -1, 0, 1, 0, -1, 0, -1) τ2α19 1
45 8τ3(-1, 0, 0, -1, 0, 1, 0, -1, 0, 1) τ2α19 + 1
47 8τ1(1, 0, -1, 0, 0, 0, -1) τ2α109 +α91
49 8τ+ 1 (1, 0, -1, 0, 0, 0, 1) τ2α109 +α93
51 5τ11 (1, 0, 0, 1, 0, 1, 0, -1) τ2α51
53 5τ9(1, 0, 0, 1, 0, 1, 0, 1) τ2α5+ 1
55 5τ7(1, 0, 1, 0, 1, 0, 0, -1) τ3α75 1
57 5τ5(1, 0, 1, 0, 1, 0, 0, 1) τ3α75 + 1
59 5τ3(1, 0, 1, 0, 0, -1, 0, -1) τ2α73 1
61 5τ1(1, 0, 1, 0, 0, -1, 0, 1) τ2α73 + 1
63 5τ+ 1 (1, 0, 0, 0, 0, 0, -1) τ2α17 +α91
65 5τ+ 3 (1, 0, 0, 0, 0, 0, 1) τ2α17 +α93
67 2τ9(-1, 0, 0, 1, 0, -1) τ2α87 1
69 2τ7(-1, 0, 0, 1, 0, 1) τ2α87 + 1
71 2τ5(-1, 0, -1, 0, 0, -1) τ3α91 1
73 2τ3(-1, 0, -1, 0, 0, 1) τ3α91 + 1
75 2τ1(-1, 0, -1, 0, -1) τ2α91 1
77 2τ+ 1 (-1, 0, -1, 0, 1) τ2α91 + 1
79 2τ+ 3 (1, 0, 1, 0, 0, 0, -1) τ2α77 α93
81 2τ+ 5 (1, 0, 1, 0, 0, 0, 1) τ2α77 α91
83 τ7(-1, 0, -1, 0, 1, 0, -1) τ2α77 1
85 τ5(-1, 0, -1, 0, 1, 0, 1) τ2α77 + 1
87 τ3(1, 0, 0, -1) τ31
89 τ1(1, 0, 0, 1) τ3+ 1
91 τ+ 1 (-1, 0, -1) τ21
93 τ+ 3 (-1, 0, 1) τ2+ 1
95 τ+ 5 (1, 0, 0, 0, 0, -1) τ2α89 +α91
97 τ+ 7 (1, 0, 0, 0, 0, 1) τ2α89 +α93
99 τ+ 9 (-1, 0, -1, 0, 0, 0, 1, 0, -1) τ2α79 1
101 4τ3(-1, 0, 0, 0, 1, 0, 1) τ2α15 + 1
103 4τ1(-1, 0, 0, -1, 0, 0, -1) τ3α89 1
105 4τ+ 1 (-1, 0, 0, -1, 0, 0, 1) τ3α89 + 1
107 4τ+ 3 (1, 0, -1, 0, -1) τ2α93 1
109 4τ+ 5 (1, 0, -1, 0, 1) τ2α93 + 1
111 4τ+ 7 (-1, 0, 0, -1, 0, 0, 0, -1) τ2α3+α91
113 4τ+ 9 (-1, 0, 0, -1, 0, 0, 0, 1) τ2α3+α93
115 4τ+ 11 (-1, 0, -1, 0, 1, 0, 1, 0, -1) τ2α85 1
117 7τ1(-1, 0, 1, 0, 1, 0, 1) τ2α107 + 1
119 7τ+ 1 (-1, 0, -1, 0, 1, 0, 0, -1) τ3α77 1
121 7τ+ 3 (-1, 0, -1, 0, 1, 0, 0, 1) τ3α77 + 1
123 7τ+ 5 (-1, 0, -1, 0, 0, -1, 0, -1) τ2α71 1
125 7τ+ 7 (-1, 0, -1, 0, 0, -1, 0, 1) τ2α71 + 1
127 7τ+ 9 (-1, 0, 0, 0, 0, 0, 0, -1) τ2α95 +α91
a= 1.
... In the last decade, however, inspired by the new lines of research presented by Semaev [79], several researchers [29,31,35,49,80] (see also [30] for a comprehensive survey) have attempted to attack the DLP of all binary elliptic curves using summation-polynomial methods. 3 However, the current status of these attempts is crucially based on ill-understood Gröbner basis assumptions, which in some cases have led to contradictory behaviors [42], even for tiny experiments [30]. 4 On the other side, it is now standard knowledge that the Pollard rho algorithm is able to solve the DLP over generic curves with an exponential computational complexity of (1 + o(1))O( √ π ·q 2 ). ...
... Given a target ordinary binary curve defined over the field F 2 m , the Gaudry-Hess-Smart (GHS) attack [33,36,41,60] exploits the idea of finding an algebraic curve C of a relatively small genus g such that the Jacobian of C contains the target elliptic 3 Sometimes also called Semaev's polynomials. 4 It is interesting to note that Semaev's original work in [79] described an attack that in principle can be applied not only to binary but to all elliptic curves [75]. ...
... In software, the libraries reported in [3,70,71,84] rank among the fastest Diffie-Hellman software benchmarked in the eBACS site [11] in various platforms. In particular, the software library announced in [70] holds the current speed record for constant-time variable-base-point Diffie-Hellman software at the 128 bit security level. ...
Article
In this work, we retake an old idea that Koblitz presented in his landmark paper (Koblitz, in: Proceedings of CRYPTO 1991. LNCS, vol 576, Springer, Berlin, pp 279–287, 1991), where he suggested the possibility of defining anomalous elliptic curves over the base field F4{\mathbb {F}}_4. We present a careful implementation of the base and quadratic field arithmetic required for computing the scalar multiplication operation in such curves. We also introduce two ordinary Koblitz-like elliptic curves defined over F4{\mathbb {F}}_4 that are equipped with efficient endomorphisms. To the best of our knowledge, these endomorphisms have not been reported before. In order to achieve a fast reduction procedure, we adopted a redundant trinomial strategy that embeds elements of the field F4m,{\mathbb {F}}_{4^{m}}, with m a prime number, into a ring of higher order defined by an almost irreducible trinomial. We also suggest a number of techniques that allow us to take full advantage of the native vector instructions of high-end microprocessors. Our software library achieves the fastest timings reported for the computation of the timing-protected scalar multiplication on Koblitz curves, and competitive timings with respect to the speed records established recently in the computation of the scalar multiplication over binary and prime fields.
Chapter
The idea of the Kummer line was introduced by Gaudry and Lubicz [22]. Karati and Sarkar [31] proposed three efficient Kummer lines over prime fields, and [31, 40] show that they are faster than Curve25519\textsf{Curve25519} [4]. In this work, we explore the problem of secure and efficient scalar multiplications using the Kummer lines over binary fields compared to Koblitz curves, binary Edwards curves, and Weierstrass curves. In this article, we provide the first concrete proposal for binary Kummer line: BKL251\textsf{BKL}251 over the field F2251\mathbb {F}_{2^{251}}, and it offers 124.5-bit security that is the same as that of BEd251\textsf{BEd251} [8] and CURVE2251\textsf{CURVE2251} [51]. BKL251\textsf{BKL}251 has small curve parameters and a small base point. We implement BKL251\textsf{BKL}251 using the instruction PCLMULQDQ\texttt{PCLMULQDQ} of modern Intel processors and a software BBK251\textsf{BBK251} for batch computation of scalar multiplications using the bitslicing technique. We also provide the first implementation of Edwards curve BEd251\textsf{BEd}251 [8] using the PCLMULQDQ\texttt{PCLMULQDQ}, best to our knowledge. Thus this work complements the works of [5, 8]. All the implemented software compute scalar multiplications in constant time using Montgomery ladders. For the right-to-left Montgomery ladder scalar multiplication, each ladder step of a binary Kummer line needs fewer field operations than an Edwards curve. In the case of the left-to-right Montgomery ladder, a Kummer line and an Edwards curve have almost the same number of field operations. Our experimental results show that left-to-right Montgomery scalar multiplications of BKL251\textsf{BKL}251 are 9.63%9.63\% and 0.52%0.52\% faster than those of BEd251\textsf{BEd}251 for fixed-base and variable-base, respectively. Left-to-right Montgomery scalar multiplication for the variable-base of BKL251\textsf{BKL}251 is 39.74%39.74\%, 23.25%23.25\%, and 32.92%32.92\% faster than those of the curves CURVE2251\textsf{CURVE2251}, K283\mathsf {K-283}, and B283\mathsf {B-283}, respectively. Using the right-to-left Montgomery ladder with precomputation, BKL251\textsf{BKL}251 achieves a 17.84%17.84\% speedup over BEd251\textsf{BEd}251 for fixed-base scalar multiplication. For a batch computation, BBK251\textsf{BBK251} performs comparatively the same (slightly faster) as the BBE251\textsf{BBE251} and sect283r1\textsf{sect283r1}. Our experiments reveal that scalar multiplications on BKL251\textsf{BKL}251 and BEd251\textsf{BEd251} are (approximately) 65% faster than one scalar multiplication (after scaling down) of batch software BBK251\textsf{BBK251} and BBE251\textsf{BBE251}.KeywordsElliptic Curve CryptographyKummer lineEdwards CurveMontgomery LadderScalar MultiplicationBinary Field Arithmetic
Article
Nowadays, pairing-based cryptography researchers are looking for new parameters for standard security levels against the new number field sieve tower number field sieve algorithm. Recently, they have suggested new parameters for well-studied pairing-friendly curves with odd embedding degrees five and seven resistant to this attack. In this paper, we define optimal ate pairing on curves using sparse families with embedding degrees five and seven. We also provide details to perform the miller loop and the final exponentiation using addition chain process. Our theoretical results costs indicate that these families of curves offer the best performance in the computation of the optimal ate pairing at the 128-bit security level compared to Cocks–Pinch curves of embedding degrees five and seven. The improvement is about [Formula: see text] and [Formula: see text] faster than the optimal ate pairing previously computed on Cocks–Pinch curves of embedding degrees five and seven, respectively.
Chapter
Scalar multiplication is the basic operation in elliptic curve cryptography. The double-base number system (DBNS) is an effective tool for speeding up scalar multiplication on elliptic curves. This paper proposes a novel decomposition algorithm for scalar n based on the specific double bases (τ¯,τ) instead of the ordinary window τ-NAF. On μ4-Koblitz curves, we evaluate the cost of our scalar multiplication method and compare it to related work. We also consider scalar multiplication using LD coordinates. Experiment results show that μ4-Koblitz curves perform well.
Article
Full-text available
Elliptic curve cryptosystems are considered an efficient alternative to conventional systems such as DSA and RSA. Recently, Montgomery and Edwards elliptic curves have been used to implement cryptosystems. In particular, the elliptic curves Curve25519 and Curve448 were used for instantiating Diffie-Hellman protocols named X25519 and X448. Mapping these curves to twisted Edwards curves allowed deriving two new signature instances, called Ed25519 and Ed448, of the Edwards Digital Signature Algorithm. In this work, we focus on the secure and efficient software implementation of these algorithms using SIMD parallel processing. We present software techniques that target the Intel AVX2 vector instruction set for accelerating prime field arithmetic and elliptic curve operations. Our contributions result in a high-performance software library for AVX2-ready processors. For example, our library computes digital signatures 19% (for Ed25519) and 29% (for Ed448) faster than previous optimized implementations. Also, our library improves by 10% and 20% the execution time of X25519 and X448, respectively.
Chapter
This paper discusses the choices of elliptic curve models available to the would-be implementer, and assists the decision as to which model to use by examining the links between security and efficiency. In early public key cryptography schemes, such as ElGamal and RSA, the use of finite fields over large prime numbers was prevalent, thus preventing the need for difficult and expensive computations over extension fields. Thus, with the introduction of elliptic curve models, the same computational infrastructure using prime fields was inevitably used. As it became clear that elliptic curve models were more efficient than their public key competitors, they acquired a great deal of attention. In more recent times, and with the onset of the Internet of Things, the cryptography community is faced with the challenge of improving the efficiency of cryptography even further, resulting in many papers dealing with improvements of computational efficiencies. This search, along with improvements in both software and hardware dealing with characteristic two fields has instigated the analysis of elliptic curve constructions over binary extension fields. In particular, the ability to identify an object in the field with a bit string aids computation for binary elliptic curves. These circumstances account for our focus on binary elliptic curve fields in this paper in which we present an in-depth discussion on their efficiency and security properties along with other relevant features of various binary elliptic curve models.
Article
Koblitz curves allow very efficient elliptic curve cryptography. The reason is that one can trade expensive point doublings to cheap Frobenius endomorphisms by representing the scalar as a τ\tau -adic expansion. Typically elliptic curve cryptosystems, such as ECDSA, also require the scalar as an integer. This results in a need for conversions between integers and the τ\tau -adic domain, which are costly and hinder the use of Koblitz curves on very constrained devices, such as RFID tags, wireless sensors, or certain applications of the Internet of things. We provide solutions to this problem by showing how complete cryptographic processes, such as ECDSA signing, can be completed in the τ\tau -adic domain with very few resources. This allows outsourcing conversions to a more powerful party. We provide several algorithms for performing arithmetic operations in the τ\tau -adic domain. In particular, we introduce a new representation allowing more efficient and secure computations compared to the algorithms available in the preliminary version of this work from CARDIS 2014. We also provide datapath extensions with different speed and side-channel resistance properties that require areas from less than one hundred to a few hundred gate equivalents on 0.13-\upmu m CMOS. These extensions are applicable for all Koblitz curves.
Thesis
Full-text available
By inventing public key ciphers such as RSA and elliptic curve many problems including the key distribution and digital signature was somewhat relieved, but with the advent of quantum computers and solve integer numbers factoring and discrete logarithm problems, there are concerns about the future use of these cryptosystems. One of the new methods for resistance against quantum attacks is Isogeny problem in elliptic curves that is attracted considerable attention. In this thesis, after the introduction of elliptic curves and the Isogeny problem, we will introduce their applications in cryptography and the security of the system shall be compared with other projects. In the following we will show that the problem of Isogeny can be used to obtain suitable cryptographic elliptic curves, resistance against side channel attack, making quantum- resistant key exchange protocol, ordered digital signature and quantum-resistance public key cryptosystem. Finally, after a few examples of software implementations by MAGMA, a time-memory attack against Isogeny star based public key cryptosystem will be offer. The attack is able to convert the Isogeny problem to the Knapsacks problem.
Article
The conversion from an integer scalar to a short and sparse τ-adic nonadjacent form (τNAF) is crucial for efficient elliptic curve scalar multiplication over Koblitz curves. Currently the conversion is costly both in time and area, limiting the application of Koblitz curves. In this paper, we propose improved algorithms and implementations for both the single-digit and double-digit scalar conversions. Area reduction is achieved by removing the τ-and-add calculation of the remainder upon division by τm for lazy reduction or the τ²-and-add one for the double lazy reduction. The τNAF and the double τNAF algorithms are modified accordingly to support a mixed-form-reduced scalar from the new reduction algorithms. Furthermore, fair pipelining is explored to speed up conversion with only a slight increase in area. Implementation results on Altera Stratix II FPGA show that the proposed single-digit converters are both smaller and faster than existing works, and the 4-stage pipelined one achieves at least 42.3% area reduction and 78.9% better area-time product (ATP) performance. On Xilinx Virtex IV, our non-pipelined double-digit converters are at least 44.5% smaller but slightly slower, while the 4-stage pipelined one can run faster with averagely 46.6% better ATP than previous equivalent works.
Article
Full-text available
In this contribution, we derive a novel parallel formulation of the standard Itoh–Tsujii algorithm for multiplicative inverse computation over the field GF(2m ). The main building blocks used by our algorithm are: field multiplication, field squaring and field square root operators. It achieves its best performance when using a special class of irreducible trinomials, namely, P(x) = x m + x k + 1, with m and k odd numbers and when implemented in hardware platforms. Under these conditions, our experimental results show that our parallel version of the Itoh–Tsujii algorithm yields a speedup of about 30% when compared with the standard version of it. Implemented in a Virtex 3200E FPGA device, our design is able to compute multiplicative inversion over GF(2193) after 20 clock cycles in about 0.94 μS.
Article
Full-text available
Efficiently computable homomorphisms allow elliptic curve point multiplication to be accelerated using the Gallant–Lambert–Vanstone (GLV) method. Iijima, Matsuo, Chao and Tsujii gave such homomorphisms for a large class of elliptic curves by working over \mathbbFp2{\mathbb{F}}_{p^{2}}. We extend their results and demonstrate that they can be applied to the GLV method. In general we expect our method to require about 0.75 the time of previous best methods (except for subfield curves, for which Frobenius expansions can be used). We give detailed implementation results which show that the method runs in between 0.70 and 0.83 the time of the previous best methods for elliptic curve point multiplication on general curves. Key wordsElliptic curves–Point multiplication–GLV method–Multiexponentiation–Isogenies
Conference Paper
Full-text available
Efficiently computable homomorphisms allow elliptic curve point multiplication to be accelerated using the Gallant-Lambert- Vanstone (GLV) method. We extend results of Iijima, Matsuo, Chao and Tsujii which give such homomorphisms for a large class of elliptic curves by working over \mathbb Fp2{\mathbb F}_{p^2} and demonstrate that these results can be applied to the GLV method. In general we expect our method to require about 0.75 the time of previous best methods (except for subfield curves, for which Frobenius expansions can be used). We give detailed implementation results which show that the method runs in between 0.70 and 0.84 the time of the previous best methods for elliptic curve point multiplication on general curves.
Conference Paper
Full-text available
This paper shows that a $390 mass-market quad-core 2.4GHz Intel Westmere (Xeon E5620) CPU can create 108000 signatures per second and verify 71000 signatures per second on an elliptic curve at a 2128 security level. Public keys are 32 bytes, and signatures are 64 bytes. These performance figures include strong defenses against software side-channel attacks: there is no data flow from secret keys to array indices, and there is no data flow from secret keys to branch conditions. KeywordsElliptic curves–Edwards curves–signatures–speed–software side channels–foolproof session keys
Conference Paper
The GLV method of Gallant, Lambert and Vanstone~(CRYPTO 2001) computes any multiple kP of a point P of prime order n lying on an elliptic curve with a low-degree endomorphism Φ\Phi (called GLV curve) over Fp\mathbb{F}_p as kP=k1P+k2Φ(P),with max{k1,k2}C1nkP = k_1P + k_2\Phi(P), \text{with } \max\{|k_1|,|k_2|\}\leq C_1\sqrt n, for some explicit constant C1>0C_1>0. Recently, Galbraith, Lin and Scott (EUROCRYPT 2009) extended this method to all curves over Fp2\mathbb{F}_{p^2} which are twists of curves defined over Fp\mathbb{F}_p. We show in this work how to merge the two approaches in order to get, for twists of any GLV curve over Fp2\mathbb{F}_{p^2}, a four-dimensional decomposition together with fast endomorphisms Φ,Ψ\Phi, \Psi over Fp2\mathbb{F}_{p^2} acting on the group generated by a point P of prime order n, resulting in a proven decomposition for any scalar k[1,n]k\in[1,n] given by kP=k1P+k2Φ(P)+k3Ψ(P)+k4ΨΦ(P)  with maxi(ki)<C2n1/4kP=k_1P+ k_2\Phi(P)+ k_3\Psi(P) + k_4\Psi\Phi(P)\; \text{with } \max_i (|k_i|)< C_2\, n^{1/4}, for some explicit C2>0C_2>0. Remarkably, taking the best C1,C2C_1, C_2, we obtain C2/C1<412C_2/C_1<412, independently of the curve, ensuring in theory an almost constant relative speedup. In practice, our experiments reveal that the use of the merged GLV-GLS approach supports a scalar multiplication that runs up to 50\% times faster than the original GLV method. We then improve this performance even further by exploiting the Twisted Edwards model and show that curves originally slower may become extremely efficient on this model. In addition, we analyze the performance of the method on a multicore setting and describe how to efficiently protect GLV-based scalar multiplication against several side-channel attacks. Our implementations improve the state-of-the-art performance of point multiplication for a variety of scenarios including side-channel protected and unprotected cases with sequential and multicore execution.
Article
Elliptic curve cryptosystems have improved greatly in speed over the past few years. Here we outline a new elliptic curve signature and key agreement implementation which achieves record speeds while remaining relatively compact. For example, on Intel Sandy Bridge, a curve with about 2 250 points produces a signature in just under 52k clock cycles, verifies in under 170k clock cycles, and computes a Diffie-Hellman shared secret in under 153k clock cycles. Our implementation has a small footprint: the library is under 60kB. Our implementation is also fast on ARM processors, verifying a signature in under 625k Tegra-2 cycles. We introduce faster field arithmetic, a new point compression al-gorithm, an improved fixed-base scalar multiplication algorithm and a new way to verify signatures without inversions or coordinate recovery. Some of these improvements should be applicable to other systems.
Conference Paper
Our purpose is to describe elliptic curves with complex multiplication which in characteristic 2 have the following useful properties for constructing Diffie-Hellman type cryptosystems: (1) they are nonsupersingular (so that one cannot use the Menezes-Okamoto-Vanstone reduction of discrete log from elliptic curves to finite fields); (2) the order of the group has a large prime factor (so that discrete logs cannot be computed by giant-step/baby-step or the Pollard rho method); (3) doubling of points can be carried out almost as efficiently as in the case of the supersingular curves used by Vanstone; (4) the curves are easy to find.