Content uploaded by Diego F. Aranha
Author content
All content in this area was uploaded by Diego F. Aranha on Mar 10, 2019
Content may be subject to copyright.
Software implementation of binary elliptic curves:
impact of the carry-less multiplier on scalar
multiplication
Jonathan Taverne1?, Armando Faz-Hern´
andez2, Diego F. Aranha3??, Francisco
Rodr´
ıguez-Henr´
ıquez2, Darrel Hankerson4, and Julio L´
opez3
1Universit´
e de Lyon, Universit´
e Lyon1, ISFA, France
jonathan.taverne@etu.univ-lyon1.fr
2Computer Science Department, CINVESTAV-IPN, M´
exico
armfaz@computacion.cs.cinvestav.mx,francisco@cs.cinvestav.mx
3Institute of Computing, University of Campinas, Brazil
dfaranha@ic.unicamp.br,jlopez@ic.unicamp.br
4Auburn University, USA
hankedr@auburn.edu
Abstract. The availability of a new carry-less multiplication instruction in the
latest Intel desktop processors significantly accelerates multiplication in binary
fields and hence presents the opportunity for reevaluating algorithms for binary
field arithmetic and scalar multiplication over elliptic curves. We describe how
to best employ this instruction in field multiplication and the effect on perfor-
mance of doubling and halving operations. Alternate strategies for implementing
inversion and half-trace are examined to restore most of their competitiveness
relative to the new multiplier. These improvements in field arithmetic are com-
plemented by a study on serial and parallel approaches for Koblitz and random
curves, where parallelization strategies are implemented and compared. The con-
tributions are illustrated with experimental results improving the state-of-the-art
performance of halving and doubling-based scalar multiplication on NIST curves
at the 112- and 192-bit security levels, and a new speed record for side-channel
resistant scalar multiplication in a random curve at the 128-bit security level.
Key words: Elliptic curve cryptography, finite field arithmetic, parallel algo-
rithm, efficient software implementation.
1 Introduction
Improvements in the fabrication process of microprocessors allow the resulting higher
transistor density to be converted into architectural features such as inclusion of new
instructions or faster execution of the current instruction set. Limits on the conven-
tional ways of increasing a processor’s performance such as incrementing the clock
rate, scaling the memory hierarchy [38] or improving support for instruction-level par-
allelism [37] have pushed manufacturers to embrace parallel processing as the main-
stream computing paradigm and consequently amplify support for resources such as
?This work was performed while the author was visiting CINVESTAV-IPN.
?? A portion of this work was performed while the author was visiting University of Waterloo.
multiprocessing and vectorization. Examples of the latter are the recent inclusions of
the SSE4 [22], AES [19] and AVX [14] instruction sets in the latest Intel microarchi-
tectures.
Since the dawn of elliptic curve cryptography in 1985, several field arithmetic as-
sumptions have been made by researchers and designers regarding its efficient imple-
mentation in software platforms. Some analysis (supported by experiments) assumed
that inversion to multiplication ratios (I/M) were sufficiently small (e.g., I/M ≈3)
that point operations would be done in affine coordinates, favoring certain techniques.
However, the small ratios were a mix of old hardware designs, slower multiplication
algorithms compared with [32], and composite extension degree. It seems clear that
sufficient progress was made in multiplication so there is incentive to use projective co-
ordinates. Our interest in the face of much faster multiplication is at the other end—is
I/M large enough to affect methods that commonly assumed this ratio is modest?
On the other hand, authors in [16] considered that the cost of a point halving com-
putation was roughly equivalent to 2 field multiplications. The expensive computations
in halving are a field multiplication, solving a quadratic z2+z=c, and finding a
square root over F2m. However, quadratic solvers presented in [21] are multiplication-
free and hence, provided that a fast binary field multiplier is available, there would
be concern that the ratio of point halving to multiplication may be much larger than 2.
Having a particularly fast multiplier would also push for computing square roots in F2m
as efficiently as possible. Similarly, the common software design assumption that field
squaring is essentially free (relative to multiplication) may no longer be valid.
A prevalent assumption is that large-characteristic fields are faster than binary field
counterparts for software implementations of elliptic curve cryptography.5In spite of
simpler arithmetic, binary field realizations could not be faster than large-characteristic
analogues mostly due to the absence of a native carry-less multiplier in contemporary
high-performance processors. However, using a bit-slicing technique, Bernstein [6] was
able to compute a batch of scalar multiplications on a 251-bit binary curve, employing
314,323 clock cycles per scalar multiplication, which, before the results presented in
this work and to the best of our knowledge, was the fastest reported time for a software
implementation of binary elliptic point multiplication.
In this work, we evaluate the impact of the recently introduced carry-less multipli-
cation instruction [20] in the performance of binary field arithmetic and scalar multipli-
cation over elliptic curves. We also consider parallel strategies in order to speed scalar
multiplication when working on multi-core architectures. In contrast to parallelization
applied to a batch of operations, the approach considered here applies to a single point
multiplication. These approaches target different environments: batching makes sense
when throughput is the measurement of interest, while the lower level parallelization is
of interest when latency matters and the device is perhaps weak but has multiple pro-
cessing units. Furthermore, throughout this paper we will assume that we are working
in the unknown point scenario, i.e., where the elliptic curve point to be processed is not
known in advance, thus precluding off-line precomputation. We will assume that there
5In hardware realizations, the opposite thesis is widely accepted: elliptic curve scalar point
multiplication can be computed (much) faster using binary extension fields.
2
is sufficient memory space for storing a few multiples of the point to be processed and
look-up tables for accelerating the computation of the underlying field arithmetic.
As the experimental results will show, our implementation of multiplication via this
native support was significantly faster than previous timings reported in the literature.
This motivated a study on alternative implementations of binary field arithmetic in hope
of restoring the performance ratios among different operations in which the literature is
traditionally based [21]. A direct consequence of this study is that performance analysis
based on these conventional ratios [5] will remain valid in the new platform. Our main
contributions are:
–A strategy to efficiently employ the native carry-less multiplier in binary field mul-
tiplication.
–Branchless and/or vectorized approaches for implementing half-trace computation,
integer recoding and inversion. These approaches allow the halving operation to
become again competitive with doubling in the face of a significantly faster multi-
plier, and help to reduce the impact of integer recoding and inversion in the overall
speed of scalar multiplication, even when projective coordinates are used.
–Parallelization strategies for dual core execution of scalar multiplication algorithms
in random and Koblitz binary elliptic curves.
We obtain a new state-of-the-art implementation of arithmetic in binary elliptic curves,
including improved performance for NIST-standardized Koblitz curves and random
curves suitable for halving and a new speed record for side-channel resistant point mul-
tiplication in a random curve at the 128-bit security level.
The remainder of the paper progresses as follows. Section 2 elaborates on exploit-
ing carry-less multiplication for high-performance field multiplication along with im-
plementation strategies for half-trace and inversion. Sections 3 and 4 discuss serial and
parallel approaches for scalar multiplication. Section 5 presents extensive experimen-
tal results and comparison with related work. Section 6 concludes the paper with per-
spectives on the interplay between the proposed implementation strategies and future
enhancements in the architecture under consideration.
2 Binary field arithmetic
A binary extension field F2mcan be constructed by means of a degree-mpolynomial f
irreducible over F2as F2m∼
=F2[z]/(f(z)). In the case of software implementations in
modern desktop platforms, field elements a∈F2mcan be represented as polynomials of
degree at most m−1with binary coefficients aipacked in n64 =dm
64 e64-bit processor
words. In this context, the recently introduced carry-less multiplication instruction can
play a significant role in order to efficiently implement a multiplier in F2m. Along with
field multiplication, other relevant field arithmetic operations such as squaring, square
root, and half-trace, will be discussed in the rest of this section.
2.1 Multiplication
Field multiplication is the performance-critical operation for implementing several cryp-
tographic primitives relying on binary fields, including arithmetic over elliptic curves
3
and the Galois Counter Mode of operation (GCM). For accelerating the latter when used
in combination with the AES block cipher [19], Intel introduced the carry-less multi-
plier in the Westmere microarchitecture as an instruction operating on 64-bit words
stored in 128-bit vector registers with opcode pclmulqdq [20]. The instruction latency
currently peaks at 15 cycles while reciprocal throughput ranks at 10 cycles. In other
words, when operands are not in a dependency chain, effective latency is 10 cycles [15].
The instruction certainly looks expensive when compared to the 3-cycle 64-bit in-
teger multiplier present in the same platform, which raises speculation whether Intel
aimed for an area/performance trade-off or simply balanced the latency to the point
where the carry-less multiplier did not interfere with the throughput of the hardware
AES implementation. Either way, the instruction features suggest the following empir-
ical guidelines for organizing the field multiplication code: (i) as memory access by
vector instructions continues to be expensive [6], the maximum amount of work should
be done in registers, for example through a Comba organization [12]; (ii) as the number
of registers employed in multiplication should be minimized for avoiding false depen-
dencies and maximize throughput, the multiplier should have 128-bit granularity; (iii) as
the instruction latency allows, each 128-bit multiplication should be implemented with
three carry-less multiplications in a Karatsuba fashion [25].
In fact, the overhead of Karatsuba multiplication is minimal in binary fields and the
Karatsuba formula with the smaller number of multiplications for multiplying dn64
2e
128-bit digits proved to be optimal in all the considered field sizes. This observation
comes in direct contrast to previous vectorized implementations of the comb method
for binary field multiplication due to L´
opez and Dahab [32, Algorithm 5], where the
memory-bound precomputation step severely limits the number of Karatsuba steps
which can be employed, fixing the cutoff point to large fields [2] such as F21223. To
summarize, multiplication was implemented as a 128-bit granular Karatsuba multiplier
with each 128-digit multiplication solved by another Karatsuba instance requiring three
carry-less multiplications, cheap additions and efficient shifts by multiples of 8 bits. A
single 128-digit level of Karatsuba was used for fields F2233 and F2251 where dn64
2e= 2,
while two instances were used for field F2409 where dn64
2e= 4. Particular approaches
which led to lower performance in our experiments were organizations based on op-
timal Toom-Cook [10] due to the higher overhead brought by minor operations; and
on a lower 64-bit granularity combined with alternative multiple-term Karatsuba for-
mulas [33] due to register exhaustion to store all the intermediate values, causing a
reduction in overall throughput.
2.2 Squaring, square-root and multi-squaring
Squaring and square-root are considered cheap operations in a binary field, especially
when F2mis defined by a square-root friendly polynomial [3,1], because they require
only linear manipulation of individual coefficients [21]. These operations are tradition-
ally implemented with the help of large precomputed tables, but vectorized implemen-
tations are possible with simultaneous table lookups through byte shuffling instruc-
tions [2]. This approach is enough to keep square and square-root efficient relative to
multiplication even with a dramatic acceleration of field multiplication. For illustra-
tion, [2] reports multiplication-to-squaring ratios as high as 34 without a native multi-
4
plier, far from the conventional ratios of 5 [5] or 7 [21] and with a large room for future
improvement.
Multi-squaring, or exponentiation to 2k, can be efficiently implemented with a
time-memory trade-off proposed as m-squaring in [1,11] and here referred as multi-
squaring. For a fixed k, a table Tof 16dm
4efield elements can be precomputed such
that T[j, i0+ 2i1+ 4i2+ 8i3]=(i0z4j+i1z4j+1 +i2z4j+2 +i3z4j+3)2kand
a2k=Pdm
4e
j=0 T[j, ba/24jcmod 24]. The threshold where multi-squaring became faster
than simple consecutive squaring observed in our implementation was around k≥6
for F2233 and k≥10 for F2409 .
2.3 Inversion
Inversion modulo f(z)can be implemented via the polynomial version of the Extended
Euclidean Algorithm (EEA), but the frequent branching and recurrent shifts by arbi-
trary amounts present a performance obstacle for vectorized implementations, which
makes it difficult to write consistently fast EEA codes across different platforms. A
branchless approach can be implemented through Itoh-Tsuji inversion [23] by com-
puting a−1=a(2m−1−1)2, as proposed in [18]. In contrast to the EEA method, the
Itoh-Tsujii approach has the additional merit of being similarly fast (relative to multi-
plication) across common processors.
The overall cost of the method is m−1squarings and a number of multiplications
dictated by the length of an addition chain for m−1. The cost of squarings can be
reduced by computing each required 2i-power as a multi-squaring [11]. The choice of
an addition chain allows the implementer to control the amount of required multiplica-
tions and the precomputed storage for multi-squaring, since the number of 2i-powers
involved can be balanced.
Previous work obtained inversion-to-multiplication ratios between 22 and 41 by
implementing EEA in 64-bit mode [2], while the conventional ratios are between 5 and
10 [21,5]. While we cannot reach the small ratios with Itoh-Tsujii for the parameters
considered here, we can hope to do better than applying the method from [2] which
will give significantly larger ratios with the carry-less multiplier. Hence the cost of
squarings and multi-squarings should be minimized to the lowest possible allowed by
storage capacity.
To summarize, we use addition chains of 10, 10 and 11 steps for computing field
inversion over the fields F2233 ,F2251 and F2409 , respectively.6We extensively used the
multi-squaring approach described in the preceding section. For example, in the case
of F2233 , we selected the addition chain 1→2→3→6→7→14→28→29→58→116→232,
and used 3 pre-computed tables for computing the iterated squarings a229 ,a258 and
a2116 . The rest of the field squaring operations were computed by executing consecutive
squarings. We recall that each table stores a total of 16dm
4efield elements.
6In the case of inversion over F2409 , the minimal length addition chain to reach m−1 = 408
has 10 steps. However, we preferred to use an 11-step chain to save one look-up table.
5
2.4 Half-trace
Half-trace plays a central role in point halving and its performance is essential if halving
is to be competitive against doubling. For an odd integer m, the half-trace function H:
F2m→F2mis defined by H(c) = P(m−1)/2
i=0 c22iand satisfies the equation λ2+λ=
c+ Tr(c)required for point halving. One efficient desktop-targeted implementation of
the half-trace is described in [3] and presented as Algorithm 1, making extensive use of
precomputations. This implementation is based on two main steps: the elimination of
even power coefficients and the accumulation of half-trace precomputed values.
Step 5 in Algorithm 1, as shown in [21], consists in reducing the number of non-zero
coefficients of cby removing the coefficients of even powers ivia H(zi) = H(zi/2) +
zi/2+ Tr(zi). That will lead to memory and time savings during the last step of the
half-trace computation, the accumulation (step 6). This is done by extraction of the odd
and even bits and can benefit from vectorization in the same way as square-root in [2].
However, in the case of half-trace there is a bottleneck caused by data dependencies.
For efficiency, the bank of 128-bit registers is used as much as possible, but at one point
in the algorithm execution the number of available bits to process decreases. For 64-bit
and 32-bit digits, the use of 128-bit registers is still beneficial, but for a smaller size, the
conventional approach (not vectorized) becomes again competitive.
Once step 5 is completed, the direction taken in [21] remains in reducing memory
needs. However another approach is followed in [3] which does not attempt to minimize
memory requirements but rather it greedily strives to speed up the accumulation part
(step 6). Precomputation is extended so as to reduce the number of accesses to the
lookup table. The following values of the half-trace are stored: H(l0c8i+1 +l1c8i+3 +
l2c8i+5 +l3c8i+7)for all i≥0such that 8i<m−3and lj∈F2. The memory size in
bytes taken by the precomputations follows the formula 16 ×n64 ×8× dm
8e.
Algorithm 1 Solve x2+x=c
Input: c=Pm−1
i=0 cizi∈F2mwhere mis odd and Tr(c) = 0
Output: a solution sof x2+x=c
1: compute H(l0c8i+1+l1c8i+3 +l2c8i+5 +l3c8i+7)for i∈I={0,...,bm−3
8c} and lj∈F2
2: s←0
3: for i= (m−1)/2downto 1do
4: if c2i= 1 then
5: c←c+zi,s←s+zi
6: return s←s+Pi∈Ic8i+1H(z8i+1)+c8i+3 H(z8i+3 )+c8i+5 H(z8i+5 )+c8i+7 H(z8i+7 )
While considering different organizations of the half-trace code, we made the fol-
lowing serendipitous observation: inserting as many xor operations as the data depen-
dencies permitted from the accumulation stage (step 6) into step 5 gave a substantial
speed-up of 20% to 25% compared with code written in the order as described in Al-
gorithm 1. Plausible explanations are compiler optimization and processor pipelining
characteristics. The result is a half-trace-to-multiplication ratio near 1, and this ratio can
be reduced if memory can be consumed more aggressively.
6
3 Random binary elliptic curves
Given a finite field Fqfor q= 2m, a non-supersingular elliptic curve E(Fq)is defined
to be the set of points (x, y)∈Fq×Fqthat satisfy the affine equation
y2+xy =x3+ax2+b, (1)
where aand 06=b∈Fq, together with the point at infinity denoted by O. It is known
that E(Fq)forms an additive Abelian group with respect to the elliptic point addition
operation.
Let kbe a positive integer and Pa point on an elliptic curve. Then elliptic curve
scalar multiplication is the operation that computes the multiple Q=kP , defined as
the point resulting of adding Pto itself k−1times. One of the most basic methods
for computing a scalar multiplication is based on a double-and-add variant of Horner’s
rule. As the name suggests, the two most prominent building blocks of this method
are the point doubling and point addition primitives. By using the non-adjacent form
(NAF) representation of the scalar k, the addition-subtraction method computes a scalar
multiplication in about mdoubles and m/3additions [21]. The method can be extended
to a width-ωNAF k=Pt−1
i=0 ki2iwhere ki∈ {0,±1,...,±2m−1},kt−16= 0, and at
most one of any ωconsecutive digits is nonzero. The length tis at most one larger than
the bitsize of k, and the density is approximately 1/(ω+ 1); for ω= 2, this is the same
as NAF.
3.1 Sequential algorithms for random binary curves
The traditional left-to-right double-and-add method is illustrated in Algorithm 2 where
n= 0 (that is, the computation corresponds to the left column) and the width-ωNAF
k=Pt−1
i=0 ki2iexpression is computed from left to right, i.e., it starts processing kt−1
first, then kt−2until it ends with the coefficient k0. Step 1 computes 2ω−2−1multiples
of the point P. Based on the Montgomery trick, authors in [13] suggested a method
to precompute the affine points in large-characteristic fields Fp, employing only one
inversion. Exporting that approach to F2m,we obtained formulae that offer a saving of
4 multiplications and 15 squarings for ω= 4 when compared with a naive method that
would make use of the Montgomery trick in a trivial way (see Table 1 for a summary
of the computational effort associated to this phase).
For a given ω, the evaluation stage of the algorithm has approximately m/(ω+ 1)
point additions, and hence increasing ωhas diminishing returns. For the curves given
by NIST [34] and with on-line precomputation, ω≤6is optimal in the sense that total
point additions are minimized. In many cases, the recoding in ωNAF(k)is performed
on-line and can be considered as part of the precomputation step.
The most popular way to represent points in binary curves is L´
opez-Dahab pro-
jective coordinates that yield an effective cost for a mixed point addition and point
doubling operation of about 8M+ 5S≈9Mand 4M+ 5S≈5M, respectively
(see Tables 2 and 3). Kim and Kim [26] report alternate formulas for point doubling
requiring four multiplications and five squarings, but two of the four multiplications are
by the constant b, and these have the same cost as general multiplication with the native
7
Algorithm 2 Double-and-add, halve-and-add scalar multiplication: parallel
Input: ω, scalar k,P∈E(F2m)of odd order r, constant n(e.g., from Table 1(b))
Output: kP
1: Compute Pi=iP for
i∈I={1,3,...,2ω−1−1}
2: Q0← O
3: Recode: k0= 2nkmod rand obtain rep
ωNAF(k0)/2n=Pt
i=0 k0
i2i−n
4: Initialize Qi← O for i∈I
{Barrier}
5: for i=tdownto ndo
6: Q0←2Q0
7: if k0
i>0then
8: Q0←Q0+Pk0
i
9: else if k0
i<0then
10: Q0←Q0−P−k0
i
11: for i=n−1downto 0do
12: P←P/2
13: if k0
i>0then
14: Qk0
i←Qk0
i+P
15: else if k0
i<0then
16: Q−k0
i←Q−k0
i−P
{Barrier}
17: return Q←Q0+Pi∈IiQi
carry-less multiplier. For mixed addition, Kim and Kim require eight multiplications
but save two field reductions when compared with L´
opez-Dahab, giving their method
the edge. Hence, in this work we use L´
opez-Dahab for point doubling and Kim and
Kim for point addition.
Right-to-left halve-and-add Scalar multiplication based on point halving replaces
point doubling by a potentially faster halving operation that produces Qfrom Pwith
P= 2Q. The method was proposed independently by Knudsen [28] and Schroeppel
[35] for curves y2+xy =x3+ax2+bover F2m. The method is simpler if the trace
of ais 1, and this is the only case we consider. The expensive computations in halving
are a field multiplication, solving a quadratic z2+z=c, and finding a square root.
On the NIST random curves studied in this work, we found that the cost of halving is
approximately 3M, where Mdenotes the cost of a field multiplication.
Let the base point Phave odd order r, and let tbe the number of bits to represent
r. For 0< n ≤t, let Pt
i=0 k0
i2ibe given by the width-ωNAF of 2nkmod r. Then
k≡k0/2n≡Pt
i=0 k0
i2i−n(mod r)and the scalar multiplication can be split as
kP = (k0
t2t−n+· · · +k0
n)P+ (k0
n−12−1+· · · +k0
02−n)P. (2)
When n=t, this gives the usual representation for point multiplication via halving,
illustrated in Algorithm 2 (that is, the computation is essentially the right column). The
cost for postcomputation appears in Table 1.
3.2 Parallel scalar multiplication on random binary curves
For parallelization, choose n<tin (2) and process the first portion by a double-and-
add method and the second portion by a method based on halve-and-add. Algorithm 2
illustrates a parallel approach suitable for two processors. Recommended values for n
to balance cost between processors appear in Table 1.
8
Table 1. Costs and parameter recommendations for ω∈ {3,4,5}.
ωAlgorithm 2 [21, Alg 3.70] [21, Alg 3.70]0
Precomp Postcomp Precomp Postcomp
3 14M,11S,I 43M ,26S2M,3S ,I 26M ,13S
4 38M,15S,I 116M ,79S9M,9S ,I 79M ,45S
5N/A N/A 23M,19S,2I200M ,117S
ωAlgorithm 2 Algorithm 3
B-233 B-409 K-233 K-409
3 128 242 131 207
4 132 240 135 210
5N/A N/A 136 213
(a) Pre- and post-computation costs. (b) Recommended value for n.
3.3 Side-channel resistant multiplication on random binary curves
Another approach for scalar multiplication offering some resistance to side-channel
attacks was proposed by L´
opez and Dahab [31] based on the Montgomery laddering
technique. This approach requires 6M+ 5Sin F2mper iteration independently of the
bit pattern in the scalar, and one of these multiplications is by the curve coefficient b.
This work targets the Weierstraß curve CURVE2251 y2+xy =x3+ (z13 +z9+z8+
z7+z2+z+ 1), at a security level similar to curves used lately for benchmarking
purposes [7]. It is clear that this curve is especially tailored for this method due to the
short length of b, reducing the cost of the algorithm to approximately 5.25M+ 5Sper
iteration. At the same time, halving-based approaches are non-optimal for this curve
due to the penalties introduced by the 4-cofactor [27]. Considering this and to partially
satisfy the side-channel resistance offered by a bitsliced implementation such as [6], we
restricted the choices of scalar multiplication at this security level to the Montgomery
laddering approach.
4 Koblitz elliptic curves
A Koblitz curve Ea(Fq), also known as an Anomalous Binary Curve [29], is a special
case of (1) where b= 1 and a∈ {0,1}. In a binary field, the map taking xto x2is an
automorphism known as the Frobenius map. Since Koblitz curves are defined over the
binary field F2, the Frobenius map and its inverse naturally extend to automorphisms
of the curve denoted τand τ−1, respectively, where τ(x, y)=(x2, y2). Moreover,
(x4, y4) + 2(x, y) = µ(x2, y 2)for every (x, y)on Ea, where µ= (−1)1−a; that is, τ
satisfies τ2+ 2 = µτ and we can associate τwith the complex number τ=µ+√−7
2.
Solinas [36] presents a τ-adic analogue of the usual NAF as follows. Since short
representations are desirable, an element ρ∈Z[τ]is found with ρ≡k(mod δ)of as
small norm as possible, where δ= (τm−1)/(τ−1). Then for the subgroup of interest,
kP =ρP and a width-ω τ -adic NAF (ωτ NAF) for ρis obtained in a fashion that
parallels the usual ωNAF. As in [36], define αi=imod τωfor i∈ {1,3,...,2ω−1−
1}. A ωτ NAF of a nonzero element ρis an expression ρ=Pl−1
i=0 uiτiwhere each
ui∈ {0,±α1,±α3,...,±α2ω−1−1},ul−16= 0, and at most one of any consecutive
ωcoefficients is nonzero. Scalar multiplication kP can be performed with the ωτ NAF
expansion of ρas
ul−1τl−1P+· · · +u2τ2P+u1τ P +u0P(3)
with l−1applications of τand approximately l/(ω+ 1) additions.
9
The length of the representation is at most m+a, and Solinas presents an efficient
technique to find an estimate for ρ, denoted ρ0=kpartmod δwith ρ0≡ρ(mod δ),
having expansion of length at most m+a+3 [36,9]. Under reasonable assumptions, the
algorithm will usually produce an estimate giving length at most m+ 1. For simplicity,
we will assume that the recodings obtained have this as an upper bound on length; small
adjustments are necessary to process longer representations. Under these assumptions
and properties of τ, scalars may be written k=Pm
i=0 uiτi=Pm
i=0 uiτ−(m−i)since
τ−i=τm−ifor all i.
4.1 Sequential algorithms for Koblitz curves
A traditional left-to-right τ-and-add method for (3) appears as [21, Alg 3.70], and is
essentially the left-hand portion of Algorithm 3. Precomputation consists of 2ω−2−1
multiples of the point P, each at a cost of approximately one point addition (see Table 1
for a summary of the computational effort associated to this phase).
Alternatively, we can process bits right-to-left and obtain a variant we shall denote
as [21, Alg 3.70]0(an analogue of [21, Alg 3.91]). The multiple points of precompu-
tation Puare exchanged for the same number of accumulators Qualong with post-
computation of form PαuQu. The cost of postcomputation is likely more than the
precomputation of the left-to-right variant; see Table 1 for a summary in the case where
postcomputation uses projective additions. However, if the accumulator in Algorithm 3
is in projective coordinates, then the right-to-left variant has a less expensive evaluation
phase since τis applied to points in affine coordinates.
4.2 Parallel algorithm for Koblitz curves
The basic strategy in our parallel algorithm is to reformulate the scalar multiplication in
terms of both the τand the τ−1operators as k=Pm
i=0 uiτi=u0+u1τ1+· · ·+unτn+
un+1τ−(m−n−1) +· · · +um=Pn
i=0 uiτi+Pm
i=n+1 uiτ−(m−i)where 0< n < m.
Algorithm 3 illustrates a parallel approach suitable for two processors. Although similar
in structure to Algorithm 2, a significant difference is the shared precomputation rather
than the pre and postcomputation required in Algorithm 2.
The scalar representation is given by Solinas [36] and hence has an expected m/(ω+
1) point additions in the evaluation-stage, and an extra point addition at the end. There
are also approximately mapplications of τor its inverse. If the field representation is
such that these operators have similar cost or are sufficiently inexpensive relative to field
multiplication, then the evaluation stage can be a factor 2 faster than a corresponding
non-parallel algorithm.
As discussed before, unlike the ordinary width-ωNAF, the τ-adic version requires
a relatively expensive calculation to find a short ρwith ρ≡k(mod δ). Hence, (a por-
tion of) the precomputation is “free” in the sense that it occurs during scalar recoding.
This can encourage the use of a larger window size ω. The essential features exploited
by Algorithm 3 are that the scalar can be efficiently represented in terms of the Frobe-
nius map and that the map and its inverse can be efficiently computed, and hence the
algorithm adapts to curves defined over small fields.
10
Algorithm 3 ωτ NAF scalar multiplication: parallel
Input: ω,k∈[1, r −1],P∈Ea(F2m)of order r, constant n(e.g., from Table 1(b))
Output: kP
1: ρ←kpartmod δ
2: Pl−1
i=0 uiτi←ωτ NAF(ρ)
3: Pu=αuP,
for u∈ {1,3,5,...,2ω−1−1}
{Barrier}
4: Q0← O
5: for i=ndownto 0do
6: Q0←τQ0
7: if ui=αjthen
8: Q0←Q0+Pj
9: else if ui=−αjthen
10: Q0←Q0−Pj
11: Q1← O
12: for i=n+ 1 to mdo
13: Q1←τ−1Q1
14: if ui=αjthen
15: Q1←Q1+Pj
16: else if ui=−αjthen
17: Q1←Q1−Pj
{Barrier}
18: return Q←Q0+Q1
Algorithm 3 is attractive in the sense that two processors are directly supported
without “extra” computations. However, if multiple applications of the “doubling step”
are sufficiently inexpensive, then more processors and additional curves can be accom-
modated in a straightforward fashion without sacrificing the high-level parallelism of
Algorithm 3. As an example for Koblitz curves, a variant on Algorithm 3 discards the
applications of τ−1(which may be more expensive than τ) and finds kP =k1(τjP) +
k0P=τj(k1P) + k0Pfor suitable kiand j≈m/2with traditional methods to calcu-
late kiP. The application of τjis low cost if there is storage for a per-field matrix as it
was first discussed in [1].
5 Experimental results
We consider example fields F2mfor m∈ {233,251,409}. These were chosen to ad-
dress 112-bit and 192-bit security levels, according to the NIST recommendation, and
the 251-bit binary Edwards elliptic curve presented in [6]. The field F2233 was also cho-
sen as more likely to expose any overhead penalty in the parallelization compared with
larger fields from NIST. Our C library coded all the algorithms using the GNU C 4.6
(GCC) and Intel 12 (ICC) compilers, and the timings were obtained on a 3.326 GHz
32nm Intel Westmere processor i5 660.
Obtaining times useful for comparison across similar systems can be problematic.
Intel, for example, introduced “Pentium 4” processors that were fundamentally different
than earlier designs with the same name. The common method via time stamp counter
(TSC) requires care on recent processors having “turbo” modes that increase the clock
(on perhaps 1 of 2 cores) over the nominal clock implicit in TSC, giving an underesti-
mate of actual cycles consumed. Benchmarking guidelines on eBACS [7], for example,
recommend disabling such modes, and this is the method followed in this paper.
Timings for field arithmetic appear in Table 2. The L ´
opez-Dahab multiplier de-
scribed in [2] was implemented as a baseline to quantify the speedup due to the native
multiplier. For the most part, timings for GCC and ICC are similar, although L ´
opez-
11
Table 2. Timings in clock cycles for field arithmetic operations. “op/M” denotes ratio to multi-
plication obtained from ICC.
Base field F2233 F2251 F2409
operation GCC ICC op/MGCC ICC op/MGCC ICC op/M
Multiplication 128 128 1.00 161 159 1.00 345 348 1.00
L´
opez-Dahab Mult. 256 367 2.87 338 429 2.70 637 761 2.19
Square root 67 60 0.47 155 144 0.91 59 56 0.16
Squaring 30 35 0.27 56 59 0.37 44 49 0.14
Half trace 167 150 1.17 219 212 1.33 322 320 0.92
Multi-Squaring 191 184 1.44 195 209 1.31 460 475 1.36
Inversion 2,951 2,914 22.77 3,710 3,878 24.39 9,241 9,350 26.87
4-τNAF 9,074 11,249 87.88 - - - 23,783 26,633 76.53
3-NAF 5,088 5,059 39.52 - - - 13,329 14,373 41.30
4-NAF 4,280 4,198 32.80 - - - 11,406 12,128 34.85
Recoding (halving) 1,543 1,509 11.79 - - - 3,382 3,087 8.87
Recoding (parallel) 999 1,043 8.15 - - - 2,272 2,188 6.29
Table 3. Timings in clock cycles for curve arithmetic operations. “op/M” denotes ratio to multi-
plication obtained from ICC.
Elliptic curve B-233 B-409
operations GCC ICC op/MGCC ICC op/M
Doubling (LD) 690 710 5.55 1,641 1,655 4.76
Addition (KIM Mixed) 1,194 1,171 9.15 2,987 3,000 8.62
Addition (LD Mixed) 1,243 1,233 9.63 3,072 3,079 8.85
Addition (LD General) 1,954 1,961 15.32 4,893 4,922 14.14
Halving 439 417 3.26 894 878 2.52
Dahab multiplication is an exception. The difference in multiplication times between
F2233 =F2[z]/(z233+z74 +1) and F2251 =F2[z]/(z251+z7+z4+z2+1) is in reduc-
tion. The relatively expensive square root in F2251 is due to the representation chosen;
if square roots are of interest, then there are reduction polynomials giving faster square
root and similar numbers for other operations. Inversion via exponentiation (§2) gives
I/M similar to that in [2] where an Euclidean algorithm variant was used with similar
hardware but without the carry-less multiplier.
Table 4 shows timings obtained for different variants of sequential and parallel
scalar multiplication. We observe that for ωNAF recoding with ω= 3,4, the halve-and-
add algorithm is always faster than its double-and-add counterpart. This performance
is a direct consequence of the timings reported in Table 3, where the cost of one point
doubling is roughly 5.5 and 4.8 multiplications whereas the cost of a point halving is of
only 3.3 and 2.5 multiplications in the fields F2233 and F2409 , respectively. The parallel
version that concurrently executes these algorithms in two threads computes one scalar
multiplication with a latency that is roughly 37.7% and 37.0% smaller than that of the
halve-and-add algorithm for the curves B-233 and B-409, respectively.
The bold entries for Koblitz curves identify fastest timings per category (i.e., con-
sidering the compiler, curve, and the specific value of ωused in the ωNAF recoding).
For smaller ω, [21, Alg 3.70]0has an edge over [21, Alg 3.70] because τis applied to
12
Table 4. Timings in 103clock cycles for scalar multiplication in the unknown-point scenario.
Scalar mult B-233 B-409
ωrandom curves GCC ICC GCC ICC
Double-and-add 240 238 984 989
3 Halve-and-add 196 192 755 756
(Dbl,Halve)-and-add 122 118 465 466
Double-and-add 231 229 941 944
4 Halve-and-add 188 182 706 705
(Dbl,Halve)-and-add 122 116 444 445
Side-channel resistant CURVE2251
scalar multiplication GCC ICC
Montgomery laddering 296 282
Scalar mult K-233 K-409
ωKoblitz curves GCC ICC GCC ICC
[21, Alg 3.70] 111 110 413 416
3 [21, Alg 3.70]098 98 381 389
(τ, τ )-and-add 73 74 248 248
Alg. 3 80 78 253 248
[21, Alg 3.70] 97 95 353 355
4 [21, Alg 3.70]090 89 332 339
(τ, τ )-and-add 68 65 216 214
Alg. 3 73 69 218 214
[21, Alg 3.70] 92 90 326 328
5 [21, Alg 3.70]095 93 321 332
(τ, τ )-and-add 63 58 197 191
Alg. 3 68 63 197 194
points in affine coordinates; this advantage diminishes with increasing ωdue to post-
computation cost. “(τ, τ )-and-add” denotes the parallel variant described in §4.2. There
is a storage penalty for a linear map, but applications of τ−1are eliminated (of interest
when τis significantly less expensive). Given the modest cost of the multi-squaring op-
eration (with an equivalent cost of less than 1.44 field multiplications, see Table 2), the
(τ, τ )-and-add parallel variant is usually faster than Algorithm 3. When using ω= 5,
the parallel (τ, τ )-and-add algorithm computes one scalar multiplication with a latency
that is roughly 35.5% and 40.5% smaller than that of the best sequential algorithm for
the curves K-233 and K-409, respectively.
Per-field storage and coding techniques compute half-trace at cost comparable to
field multiplication, and methods based on halving continue to be fastest for suitable
random curves. However, the hardware multiplier and squaring (via shuffle) give a
factor 2 advantage to Koblitz curves in the examples from NIST. This is larger than
in [16,21], where a 32-bit processor in the same general family as the i5 has half-trace
at approximately half the cost of a field multiplication for B-233 and a factor 1.7 ad-
vantage to K-163 over B-163 (and the factor would have been smaller for K-233 and
B-233). It is worth remarking that the parallel scalar multiplications versions shown in
Table 4 look best for bigger curves and larger ω.
6 Conclusion and future work
In this work we achieve the fastest timings reported in the open literature for software
computation of scalar multiplication in NIST and binary elliptic curves defined at the
112-bit, 128-bit and 192-bit security levels. The fastest curve implemented, namely
NIST K-233, can compute one scalar multiplication in less than 17.5µs, a result that
is not only much faster than previous software implementations of that curve, but is
also quite competitive with the computation time achieved by state-of-the-art hardware
accelerators working on similar or smaller curves [24,1].
13
These fast timings were obtained through the usage of the native carry-less multi-
plier available in the newest Intel processors. At the same time, we strive to use the best
algorithmic techniques, and the most efficient elliptic curve and finite field arithmetic
formulae. Further, we proposed effective parallel formulations of scalar multiplication
algorithms suitable for deployment in multi-core platforms.
The curves over binary fields permit relatively elegant parallelization with low syn-
chronization cost, mainly due to the efficient halving or τ−1operations. Parallelizing at
lower levels in the arithmetic would be desirable, especially for curves over prime fields.
Grabher et al. [17] apply parallelization for extension field multiplication, but times for
a base field multiplication in a 256-bit prime field are relatively slow compared with
Beuchat et al. [8]. On the other hand, a strategy that applies to all curves performs
point doubles in one thread and point additions in another. The doubling thread stores
intermediate values corresponding to nonzero digits of the NAF; the addition thread
processes these points as they become available. Experimentally, synchronization cost
is low, but so is the expected acceleration. Against the fastest times in Longa and Gebo-
tys [30] for a curve over a 256-bit prime field, the technique would offer roughly 17%
improvement, a disappointing return on processor investment.
The new native support for binary field multiplication allowed our implementation
to improve by 10% the previous speed record for side-channel resistant scalar multipli-
cation in random elliptic curves. It is hard to predict what will be the superior strategy
between a conventional non-bitsliced or a bitsliced implementation on future revisions
of the target platform: the latency of the carry-less multiplier instruction has clear room
for improvement, while the new AVX instruction set has 256-bit registers. An issue with
the current Sandy Bridge version of AVX is that xor throughput for operations with
register operands was decreased significantly from 3 operations per cycle in SSE to 1
operation per cycle in AVX. The resulting performance of a bitsliced implementation
will ultimately rely on the amount of work which can be scheduled to be done mostly
in registers.
Acknowledgments We wish to thank the University of Waterloo and especially Pro-
fessor Alfred Menezes for useful discussions related to this work during a visit by three
of the authors, where the idea of this project was discussed, planned and a portion of the
development phase was done. Diego F. Aranha and Julio L´
opez thank CNPq, CAPES
and FAPESP for financial support.
References
1. O. Ahmadi, D. Hankerson, and F. Rodr´
ıguez-Henr´
ıquez. Parallel formulations of scalar
multiplication on Koblitz curves. J. UCS, 14(3):481–504, 2008.
2. D. F. Aranha, J. L ´
opez, and D. Hankerson. Efficient software implementation of binary field
arithmetic using vector instruction sets. In M. Abdalla and P. S. L. M. Barreto, editors,
The First International Conference on Cryptology and Information Security (LATINCRYPT
2010), volume 6212 of Lecture Notes in Computer Science, pages 144–161, 2010.
3. R. M. Avanzi. Another look at square roots (and other less common operations) in fields of
even characteristic. In C. M. Adams, A. Miri, and M. J. Wiener, editors, 14th International
14
Workshop on Selected Areas in Cryptography (SAC 2007), volume 4876 of Lecture Notes in
Computer Science, pages 138–154. Springer, 2007.
4. M. Bellare, editor. Advances in Cryptology—CRYPTO 2000, volume 1880 of Lecture Notes
in Computer Science. 20th Annual International Cryptology Conference, Santa Barbara, Cal-
ifornia, August 2000, Springer-Verlag, 2000.
5. D. Bernstein and T. Lange. Analysis and optimization of elliptic-curve single-scalar mul-
tiplication. In Proceedings 8th International Conference on Finite Fields and Applications
(Fq8), volume 461, pages 1–20. AMS, 2008.
6. D. J. Bernstein. Batch binary Edwards. In S. Halevi, editor, Advances in Cryptology—
CRYPTO 2009, volume 5677 of Lecture Notes in Computer Science, pages 317–336. 29th
Annual International Cryptology Conference, Santa Barbara, CA, USA, August 16–20,
2009, Springer, 2009.
7. D. J. Bernstein and T. Lange, editors. eBACS: ECRYPT Benchmarking of Cryptographic
Systems. http://bench.cr.yp.to, accessed 30 Mar 2011.
8. J.-L. Beuchat, J. D´
ıaz, S. Mitsunari, E. Okamoto, F. Rodr´
ıguez-Henr´
ıquez, and T. Teruya.
High-speed software implementation of the optimal ate pairing over Barreto-Naehrig curves.
In M. Joye, A. Miyaji, and A. Otsuka, editors, Pairing-Based Cryptography – Pairing 2010,
volume 6487 of Lecture Notes in Computer Science, pages 21–39, 2010.
9. I. F. Blake, V. K. Murty, and G. Xu. A note on window τ-NAF algorithm. Inf. Process. Lett.,
95(5):496–502, 2005.
10. M. Bodrato. Towards optimal Toom-Cook multiplication for univariate and multivariate
polynomials in characteristic 2 and 0. In C. Carlet and B. Sunar, editors, Arithmetic of Finite
Fields (WAIFI 2007), volume 4547 of Lecture Notes in Computer Science, pages 116–133.
Springer, 2007.
11. J. W. Bos, T. Kleinjung, R. Niederhagen, and P. Schwabe. ECC2K-130 on Cell CPUs. In
D. J. Bernstein and T. Lange, editors, 3rd International Conference on Cryptology in Africa
(AFRICACRYPT 2010), volume 6055 of Lecture Notes in Computer Science, pages 225–242.
Springer, 2010.
12. P. G. Comba. Exponentiation Cryptosystems on the IBM PC. IBM Systems Journal,
29(4):526–538, 1990.
13. E. Dahmen, K. Okeya, and D. Schepers. Affine precomputation with sole inversion in el-
liptic curve cryptography. In J. Pieprzyk, H. Ghodosi, and E. Dawson, editors, Information
Security and Privacy (ACISP 2007), volume 4586 of Lecture Notes in Computer Science,
pages 245–258. Springer-Verlag, 2007.
14. N. Firasta, M. Buxton, P. Jinbo, K. Nasri, and S. Kuo. Intel AVX: New frontiers in perfor-
mance improvement and energy efficiency. White paper. http://software.intel.com/.
15. A. Fog. Instruction tables: List of instruction latencies, throughputs and micro-operation
breakdowns for Intel, AMD and VIA CPUs. http://www.agner.org/optimize/instruction
tables.pdf, accessed 01 Mar 2011.
16. K. Fong, D. Hankerson, J. L´
opez, and A. Menezes. Field inversion and point halving revis-
ited. IEEE Transactions on Computers, 53(8):1047–1059, 2004.
17. P. Grabher, J. Großsch¨
adl, and D. Page. On software parallel implementation of crypto-
graphic pairings. Cryptology ePrint Archive, Report 2008/205, 2008. http://eprint.iacr.org/.
18. J. Guajardo and C. Paar. Itoh-Tsujii inversion in standard basis and its application in cryp-
tography and codes. Designs, Codes and Cryptography, 25(2):207–216, 2002.
19. S. Gueron. Intel Advanced Encryption Standard (AES) Instructions Set. White paper. http:
//software.intel.com/.
20. S. Gueron and M. E. Kounavis. Carry-less multiplication and its usage for computing the
GCM mode. White paper. http://software.intel.com/.
21. D. Hankerson, A. J. Menezes, and S. Vanstone. Guide to Elliptic Curve Cryptography.
Springer-Verlag, Secaucus, NJ, USA, 2004.
15
22. Intel. Intel SSE4 Programming Reference. Technical Report. http://software.intel.com/.
23. T. Itoh and S. Tsujii. A fast algorithm for computing multiplicative inverses in GF(2m)using
normal bases. Inf. Comput., 78(3):171–177, 1988.
24. K. J¨
arvinen. Optimized FPGA-based elliptic curve cryptography processor for high-speed
applications. Integration, the VLSI Journal, to appear.
25. A. Karatsuba and Y. Ofman. Multiplication of many-digital numbers by automatic com-
puters. Doklady Akad. Nauk SSSR, 145:293–294, 1962. Translation in Physics-Doklady 7,
595-596, 1963.
26. K. H. Kim and S. I. Kim. A new method for speeding up arithmetic on elliptic curves over
binary fields. Cryptology ePrint Archive, Report 2007/181, 2007. http://eprint.iacr.org/.
27. B. King and B. Rubin. Improvements to the point halving algorithm. In H. Wang, J. Pieprzyk,
and V. Varadharajan, editors, 9th Australasian Conference on Information Security and Pri-
vacy (ACISP 2004), volume 3108 of Lecture Notes in Computer Science, pages 262–276.
Springer, 2004.
28. E. Knudsen. Elliptic scalar multiplication using point halving. In K. Lam and E. Okamoto,
editors, Advances in Cryptology—ASIACRYPT ’99, volume 1716 of Lecture Notes in Com-
puter Science, pages 135–149. International Conference on the Theory and Application of
Cryptology and Information Security, Singapore, November 1999, Springer-Verlag, 1999.
29. N. Koblitz. CM-curves with good cryptographic properties. In J. Feigenbaum, editor, Ad-
vances in Cryptology—CRYPTO ’91, volume 576 of Lecture Notes in Computer Science,
pages 279–287. Springer-Verlag, 1992.
30. P. Longa and C. H. Gebotys. Efficient techniques for high-speed elliptic curve cryptogra-
phy. In S. Mangard and F.-X. Standaert, editors, Cryptographic Hardware and Embedded
Systems, (CHES 2010), volume 6225 of Lecture Notes in Computer Science, pages 80–94.
Springer, 2010.
31. J. L´
opez and R. Dahab. Fast multiplication on elliptic curves over GF(2m)without precom-
putation. In C¸ . K. Koc¸ and C. Paar, editors, First International Workshop on Cryptographic
Hardware and Embedded Systems (CHES 99), volume 1717 of Lecture Notes in Computer
Science, pages 316–327. Springer, 1999.
32. J. L´
opez and R. Dahab. High-speed software multiplication in GF(2m). In B. K. Roy and
E. Okamoto, editors, 1st International Conference in Cryptology in India (INDOCRYPT
2000), volume 1977 of Lecture Notes in Computer Science, pages 203–212. Springer, 2000.
33. P. L. Montgomery. Five, six, and seven-term Karatsuba-like formulae. IEEE Transactions
on Computers, 54(3):362–369, 2005.
34. National Institute of Standards and Technology (NIST). Recommended Elliptic Curves for
Federal Government Use. NIST Special Publication, July 1999. http://csrc.nist.gov/csrc/
fedstandards.html.
35. R. Schroeppel. Elliptic curves: Twice as fast! Presentation at the CRYPTO 2000 [4] Rump
Session, 2000.
36. J. A. Solinas. Efficient arithmetic on Koblitz curves. Designs, Codes and Cryptography,
19(2-3):195–249, 2000.
37. D. W. Wall. Limits of instruction-level parallelism. In 4th International Conference on
Architectural Support for Programming Languages and Operating System (ASPLOS 91),
pages 176–188, New York, NY, 1991. ACM.
38. W. A. Wulf and S. A. McKee. Hitting the Memory Wall: Implications of the Obvious.
SIGARCH Computer Architecture News, 23(1):20–24, 1995.
16