Conference PaperPDF Available

Curve25519 for the Cortex-M4 and Beyond


Abstract and Figures

We present techniques for the implementation of a key exchange protocol and digital signature scheme based on the Curve25519 elliptic curve and its Edwards form, respectively, in resource-constrained ARM devices. A possible application of this work consists of TLS deployments in the ARM Cortex-M family of processors and beyond. These devices are located towards the lower to mid-end spectrum of ARM cores, and are typically used on embedded devices. Our implementations improve the state-of-the-art substantially by making use of novel implementation techniques and features specific to the target platforms.
Content may be subject to copyright.
Curve25519 for the Cortex-M4 and beyond
Hayato Fujii and Diego F. Aranha
Institute of Computing University of Campinas,
We present techniques for the implementation of a key ex-
change protocol and digital signature scheme based on the Curve25519
elliptic curve and its Edwards form, respectively, in resource-constrained
ARM devices. A possible application of this work consists of TLS deploy-
ments in the ARM Cortex-M family of processors and beyond. These
devices are located towards the lower to mid-end spectrum of ARM
cores, and are typically used on embedded devices. Our implementa-
tions improve the state-of-the-art substantially by making use of novel
implementation techniques and features specific to the target platforms.
Keywords: ECC, Curve25519, X25519, Ed25519, ARM Cortex-M.
1 Introduction
The growing number of devices connected to the Internet collecting and storing
sensitive information raises concerns about the security of their communications
and of the devices themselves. Many of them are equipped with microcontrollers
constrained in terms of computing or storage capabilities, and lack tamper re-
sistance mechanisms or any form of physical protection. Their attack surface
is widely open, ranging from physical exposure to attackers and ease of access
through remote availability. While designing and developing efficient and secure
implementations of cryptography is not a new problem and has been an active
area of research since at least the birth of public-key cryptography, the appli-
cation scenarios for these new devices imposes new challenges to cryptographic
A possible way to deploy security in new devices is to reuse well-known and
well-analyzed building blocks, such as the Transport Layer Security (TLS) proto-
col. In comparison with reinventing the wheel using a new and possibly proprietary
solution, this has a major advantage of avoiding risky security decisions that may
repeat issues already solved in TLS. In RFC 7748 and RFC 8032, published by
the Internet Engineering Task Force (IETF), two cryptographic protocols based
on the Curve25519 elliptic curve and its Edwards form are recommended and
slated for future use in the TLS suite: the Diffie-Hellman key exchange using
Curve25519 [2] called X25519 [3] and the Ed25519 digital signature scheme [5].
These schemes rely on a careful choice of parameters, favoring secure and efficient
implementations of finite field and elliptic curve arithmetic with smaller room
for mistakes due to their overall implementation simplicity.
Special attention must be given to side-channel attacks, in which operational
aspects of the implementation of a cryptographic algorithm may leak internal state
2 Hayato Fujii and Diego F. Aranha
information that allows an attacker to retrieve secret information. Secrets may leak
through the communication channel itself, power consumption, execution time or
radiation measurements. Information leaked through cache latency or execution
time already allows powerful timing attacks against naive implementations of
symmetric and public-key cryptography [20]. More intrusive attacks also attempt
to inject faults at precise execution times, in hope of corrupting execution state
to reveal secret information [8]. Optimizing such implementations to achieve
an ideal balance between resource efficiency and side-channel resistance further
complicates matters, beckoning algorithmic advances and novel implementation
This work presents techniques for efficient, compact and secure implementation
against timing and caching attacks of both X25519 and Ed25519 algorithms, with
an eye towards possible application for TLS deployments on constrained ARM
. Our main target platform is the Cortex-M family of microcontrollers
starting from the Cortex-M4, but the same techniques can be used in higher-end
CPUs such as the ARM Cortex-A series.
We first present an ARM-optimized implementation of the
finite field arithmetic modulo the prime
= 2
19. The main contribution in
terms of novelty is an efficient multiplier largely employing the powerful multiply-
and-accumulate DSP instructions in the target platform. The multiplier uses
the full 32-bit width and allows to remove any explicit addition instructions to
accumulate results or propagate carries, as all of these operations are combined
with the DSP instructions. These instructions are not present in the Cortex-M0
microcontrollers, and present a variable-time execution in the Cortex-M3 [13],
hence the choice of focusing out efforts on the Cortex-M4 and beyond. The same
strategy used for multiplication is adapted to squarings, with similar success
in performance. Following related work [11], intermediate results are reduced
modulo 2𝑝and ultimately reduced modulo 𝑝at the end of computation.
Finite field arithmetic is then used to implement the higher levels, including
group arithmetic and cryptographic protocols. The key agreement implemen-
tation uses homogeneous projective coordinates in the Montgomery curve in
order to take advantage of the constant-time Montgomery ladder as the scalar
multiplication algorithm, protecting against timing attacks. The digital signature
scheme implementation represents points on a Twisted Edwards curve using
projective extended coordinates, benefiting of efficient and unified point addition
and doubling. The most time-consuming operation in the scheme is the fixed-
point scalar multiplication, which uses the signed comb method as introduced by
Hamburg [15], using approximately 7.5 KiB of precomputed data and running
approximately two times faster in comparison to the Montgomery ladder. Side-
channel security is achieved using isochronous (constant time) code execution and
linear table scan countermeasures. We also evaluate a different way to implement
the conditional selection operation in terms of their potential resistance against
profiled power attacks [25]. Experiments conducted on a Cortex-M4 development
Curve25519 for the Cortex-M4 and beyond 3
board indicate that our work provides the fastest implementations of these specific
algorithms for our main target architecture.
Section 2 briefly describes features of our target platform. Sec-
tion 3 documents related works in this area, by summarizing previous implementa-
tions of elliptic curve algorithms in ARM microcontrollers. In Section 4 discusses
our techniques for finite field arithmetic in detail, focusing in the squaring and
multiplication operations. Section 5 describes the algorithms using elliptic curve
arithmetic in the key exchange and digital signatures scenarios. Finally, Section 6
presents experimental results and implementation details.
Code Avaliability.
For reproducibility, the prime field multiplication code is pub-
licly available at
2 ARMv7 architecture
The ARMv7 architecture is a reduced instruction set computer (RISC) using a
load-store architecture. Processors with this technology are equipped with 16
registers: 13 general purpose, one as the program counter (
), one as the stack
pointer (
), and the last one as the link register (
). The latter can be freed up
by saving it in slower memory and retrieving it after the register has been used.
The processor core has a three-stage pipeline which can be used to optimize
batch memory operations. Memory access involving
registers in these processors
+ 1 cycles if there are no dependencies (for example, when a loaded
register is the address for a consecutive store). This can happen either in a
sequence of loads and stores or during the execution of instructions involving
multiple registers simultaneously.
The ARMv7E-M instruction set is also comprised of standard instructions for
basic arithmetic (such as addition and addition with carry) and logic operations,
but differently from other lower processors classes, the Cortex-M4 has support
for the so-called DSP instructions, which include multiply-and-accumulate (MAC)
Unsigned MULtiply Long:
UMULL rLO, rHI, a, b
takes two unsigned integer
and multiplies them; the upper half result is written back to
rHI and the lower half is written into rLO.
Unsigned MULtiply Accumulate Long:
UMLAL rLO, rHI, a, b
takes un-
signed integer words
and multiplies them; the product is added
and written back to the double word integer stored as (rHI, rLO).
Unsigned Multiply Accumulate Accumulate Long:
UMAAL rLO, rHI, a, b
takes unsigned integer words
and multiplies them; the product is
added with the word-sized integer stored in
then added again with the
word-sized integer
. This double-word integer is then written back into
rLO and rHI, respectively the lower half and the upper half of the result.
ARM’s Technical Reference Manual of the Cortex-M4 core [1] states that
all the mentioned MAC instructions take one CPU cycle for execution in the
4 Hayato Fujii and Diego F. Aranha
Cortex-M4 and above. However, those instructions deterministically take an
extra three cycles to write the lower half of the double-word result, and a final
extra cycle to write the upper half. Therefore, proper instruction scheduling is
necessary in order to avoid pipeline stalls and to make best use of the delay slots.
The ARM Cortex-A cores are computationally more powerful than their
Cortex-M counterparts. Cortex-A based processors can run robust operating
systems due to extra auxiliary hardware; additionally, they may have a NEON
engine, which is a Single Instruction-Multiple Data (SIMD) unit. Aside from
that, those processors may have sophisticated out-of-order execution and extra
pipeline stages.
3 Related work
Research in curve-based cryptography proceeds in several directions: looking for
efficient elliptic curve parameters, instantiating and implementing the respective
cryptographic protocols, and finding new applications. More recently, isogeny-
based cryptography [18], which uses elliptic curves, was proposed as candidates
for post-quantum cryptography.
3.1 Scalar multiplication
Düll et al. [11] implemented X25519 and its underlying field arithmetic on
a Cortex-M0 processor, equipped with a 32
32-bit multiplier. Since
this instruction only returns the lower part of the product, this multiplier is
abstracted as a smaller one (16
32) to facilitate a 3-level Refined Karatsuba
implementation, taking 1294 cycles to complete on the same processor. Their
256-bit squaring uses the same multiplier strategy with standard tricks to save
up repeated operations, taking 857 cycles. Putting all together, an entire X25519
operation takes about 3.6M cycles with approximately 8 KiB of code size.
On the Cortex-A family of processor cores, implementers may use NEON, a
SIMD instruction set executed in its own unit inside the processor. Bernstein
and Schwabe [7] reported 527,102 Cortex-A8 cycles for the X25519 function. In
the elliptic curves formulae used in their work, most of the multiplications can
be handled in a parallel way, taking advantage of NEON’s vector instructions
and Curve25519’s parallelization opportunities.
The authors are not aware of an Ed25519 implementation specifically targeting
the Cortex-M4 core. However, Bernstein’s work using Cortex-A8’s NEON unit
reports 368,212 cycles to sign a short message and 650,102 cycles to verify its
validity. The authors point out that 50 and 25 thousand cycles of signing and
verification are spent by SUPERCOP-choosen SHA-512 implementation, with
room for further improvements.
is an elliptic curve providing about 128 bits of security equipped
with the endomorphisms
, providing efficient scalar multiplication [9].
Implementations of key exchange over this elliptic curve in different software and
hardware platforms show a factor-2 speedup in comparison to Curve25519 and
Curve25519 for the Cortex-M4 and beyond 5
factor-5 speedup in comparison to NIST’s P-256 curve [22]. Liu et al. reported [22]
a 559,200 cycle count on an ARM Cortex-M4 based processor of their 32-bit
implementation of the Diffie-Hellman Key Exchange in this curve.
Generating keys and Schnorr-like signatures over Four
takes about 796,000
cycles on a Cortex-M4 based processor, while verification takes about 733,000
cycles on the same CPU [22]. Key generation and signing are aided by a 80-
point table taking 7.5KiB of ROM, and verification is assisted by a 256-point
table, using 24 KiB of memory. Quotient DSA (qDSA) [27] is a novel signature
scheme relying on Kummer arithmetic in order use the same key for DH and
digital signature schemes. It relies only on the
-coordinate with the goal of
reducing stack usage and the use of optimized formulae for group operations.
When instantiated with Curve25519, it takes about 3 million cycles to sign a
message and 5.7 million cycles to verify it in a Cortex-M0. This scheme does not
rely on an additional table for speedups since low code size is an objective given
the target architecture, although this can be done using the ideas from [26] with
increased ROM usage.
3.2 Modular Multiplication
Field multiplication is usually the most performance-critical operation, because
other non-trivial field operations, such as inversion, are avoided by algorithmic
techniques. Multiprecision multiplication algorithms can be ranked on how many
single word multiplications are performed. For example, operand scanning takes
)multiplications, where
is the number of words. Product scanning takes the
same number of word multiplications, but reduces the number of memory access by
accumulating intermediate results in registers. One of the most popular algorithms
that asymptotically reduces such complexity is the Karatsuba multiplication,
which takes the computational cost down to
). This algorithm performs
three multiplications of, usually, half-sized operands, thus giving it a divide-
and-conquer structure and lowering its asymptotic complexity. As an example
of such an application, De Santis and Sigl [28] X25519 implementation on the
Cortex-M4 features a two-level Karatsuba multiplier implementation, splitting a
256-bit multiplier down to 64-bit multiplications, each one taking four hardware-
supported 32 ×32 64 multiplication instructions.
Memory accesses can be accounted for part of the time consumed by the
multiplication routine. Thus, algorithms and instruction scheduling methods
which minimize those memory operations are highly desirable, specially on not-
so-powerful processors with slow memory access. This problem can be addressed
by scheduling the basic multiplications following the product scanning strategy,
which can be seen as a rhombus-like structure. However, following this scheme
in its traditional way requires multiple stores and loads from memory, since the
number of registers available may be not sufficient to hold the full operands.
Improvements to reduce the amount of memory operations are present in the
literature: namely, Operand Caching due to Hutter and Wegner [17], further
improved by the Consecutive Operand Caching [30] and the Full Operand Caching,
both due to Seo et al. [29].
6 Hayato Fujii and Diego F. Aranha
The Operand Caching technique reduces the number of memory accesses in
comparison to the standard product-scanning by caching data in registers and
storing part of the operands in memory. This method resembles the product
scanning approach, but instead of calculating a word in its entirety, rows are
introduced to compute partial sub-products from each column. This method is
illustrated in Figure 1.
Fig. 1.
Operand Caching. Each dot in the rhombus represents one-word multiplication;
each column, at the end of its evaluation, is a (partial) word of the product. Vertical
lines represents additions.
This method divides product scanning in two steps:
Initial Block
The first step loads part of the operands, and proceeds to
calculate the upper part of the rhombus using classical product-scanning.
In the rightmost part, most of the necessary operands are already
loaded from previous calculations, requiring only some extra, low-count
operand loads, depending on row width. Product scanning is done until the
row ends. Note that, at the end of each column, parts of the operands are
previously loaded, hence a small quantity of loads is necessary to evaluate
the next column.
At every row change, new operands needs to be reloaded, since the current
operands in the registers are not useful at the start of the new row. Consecutive
Operand Caching avoids those memory access by rearranging the rows and
further improving the quantity of operands already in registers. This algorithm
is depicted in Figure 2.
Note that during the transition between the bottommost row and the one
above, part of the operands are already available in registers, solving the reload
problem between row changes. Let
be the number of “limbs”,
the number of
available working registers and number of rows
. Full Operand Caching further
improves the quantity of memory access in two cases: if
𝑛𝑟𝑒 < 𝑒
, the Full
Operand Caching structure looks like the original Operand Caching, but with a
different multiplication order. Otherwise, Consecutive Operand Caching bottom
Curve25519 for the Cortex-M4 and beyond 7
Fig. 2. Consecutive Operand Caching
row’s length is adjusted in order to make full use of all available registers at the
next row’s processing.
3.3 Modular Squaring
The squaring routine can be built by repeating the operands and using the
multiplication procedure, saving ROM space. Alternatively, writing a specialized
procedure can save cycles by duplicating partial products [23]. The squaring
implementation in [28] follows this strategy, specializing the 64-bit multiplication
routine to 8 cycles, down from 10. Partial products are calculated then added
twice to the accumulator and the resulting carry is rippled away.
Following the product scanning method, Seo’s Sliding Block Doubling [30]
halves the rhombus structure, allowing to use more registers to store part of the
operands and doubling partial products. The algorithm is illustrated in Figure 3
and can be divided in three parts:
C[14] ...
Fig. 3.
Sliding Block Doubling. Black dots represent multiplications and white dots
represent squarings.
8 Hayato Fujii and Diego F. Aranha
Partial Products of the Upper Part Triangle
: an adaption of product
scanning calculates partial products (represented by the black dots at the
superior part of the rhombus in Figure 3) and saves them to memory.
Sliding Block Doubling of Partial Products
: Each result of the column
is doubled by left shifting each result by one, effectively duplicating the
partial products. This process must be done in parts because the number of
available registers is limited, since they hold parts of the operand.
Remaining Partial Products of the Bottom Line
: The bottom line
multiplications are squares of part of the operand. These products must be
added to their respective partial result of its above column.
4 Implementation of F225519 Arithmetic
Our implementation aims for efficiency, so specific ARM Assembly is thoroughly
used and code size is moderately sacrificed for speed. Code portability is a non-
goal, so each 255-bit integer field element is densely represented, using 2
implying in eight “limbs” of 32 bits, each one are in a little-endian format. This
contrasts with the reference implementation [2], which use 25 or 26 bits in 32-bits
words, allowing carry values requiring proper handling at the expense of more
Modular Reduction.
We call as “weak” reduction a reduction modulo 2
performed at the end of every field operation in order to avoid extra carry
computations between operations, as in [11]; this reduction finds a integer lesser
then 2
that is congruent modulo 2
19. When necessary, a “strong” reduction
modulo 2
19 is performed, much like when data must be sent over the wire.
This strategy is justified over the extra 10% difference between the “strong” and
the “weak” reduction.
Addition and Subtraction.
256-bit addition is implemented by respectively
adding each limb in a lower to higher element fashion. The carry flag, present
in the ARM status registers, is used to ripple the carry across the limbs. In
order to avoid extra bits generated by the final sum, the result is weakly reduced.
Subtraction follows a similar strategy.
Multiplication by a 32-bit Word.
Multiplication by a single word follows the
algorithm described in [28], used to multiply a long integer by 121666, operation
required to double a point on Curve25519.
This operation follows the standard Itoh-Tsujii addition-chain ap-
proach to compute
), using 11 multiplications and 254 field
squarings as proposed in [2]. Adding up the costs, inversion turns to be the most
expensive field operation in our implementation.
4.1 Multiplication
The 256
512-bit multiplication follows a product-scanning like approach;
more specifically, Full Operand-Caching. As mentioned in Section 3, parameters
Curve25519 for the Cortex-M4 and beyond 9
for this implementation are
= 8,
= 3,
= 2; since
so Full Operand Caching with a Consecutive-like structure yields the best option.
Catching the Carry Bit.
Using product scanning to calculate partial products
with a double-word multiplier implies adding partial products of the next column,
which in turn might generate carries. A partial column, divided in rows in a
manner as described in Operand Caching, can be calculated using Algorithm 1; an
example of implementation in ARM Assembly is shown in Listing 1.1. Notation
follows as
(𝜀, 𝑧)𝑤
1, where
is the bit-size of a word;
denotes a 2
-bit word
obtained by concatenating the 𝑊-bit words 𝐴and 𝐵.
Algorithm 1 Column computation in product scanning.
𝐴, 𝐵
; column index
; partial product
(calculated during column
𝑘1); accumulated carry 𝑅𝑘+1 (generated from sum of partial products).
(Partial) product
]; sum
(higher half part of the partial product
for column
+ 1); accumulated carry
(generated from sum of partial products).
𝑅𝑘+2 0
for all (𝑖, 𝑗 )|𝑖+𝑗=𝑘, 0𝑖 < 𝑗 𝑛1do
(𝑇 𝑅𝑘)𝐴[𝑖]×𝐵[𝑗] + (𝑇, 𝑅𝑘)
(𝜀, 𝑅𝑘+1)𝑇+𝑅𝑘+1
𝑅𝑘+2 𝑅𝑘+2 +𝜀
end for
return 𝐴𝐵[𝑘],𝑅𝑘+1 ,𝑅𝑘+2
Listing 1.1. ARM code for calculating a column in product scanning.
@ r5 and r4 hold R_6, R_7 respectively
@ r6, r7, r8 hold A[3], A[4] and A[5] respectively
@ r9, r10, r11 hold B[3], B[1], B[2] respectively
MOV r12, #0
MOV r3, #0
UMLAL r5, r12, r8, r10 @ A5 B1
ADDS r4, r4, r12
ADC r3, r3, #0
MOV r14, #0
UMLAL r5, r14, r7, r11 @ A4 B2
ADDS r4, r4, r14
ADC r3, r3, #0
MOV r12, #0
UMLAL r5, r12, r6, r9 @ A3 B3
ADDS r4, r4, r12
ADC r3, r3, #0
@ r5 holds AB[6], r4 holds R_7, @ r3 holds R_8
10 Hayato Fujii and Diego F. Aranha
One possible optimization is
delaying the carry bit
: eliminating the last
addition of Algorithm 1, this addition can be deferred to the next column with
the use of a single instruction to add the partial products and the carry bit. This
is easier on ARM processors, where there is fine-grained control of whether or not
instructions may update the processor flags. Other optimizations involve proper
register allocation in order to avoid reloads, saving up a few cycles.
Carry Elimination.
Storing partial products in extra registers without adding
them avoids potential carry values. In a trivial implementation, a register accu-
mulator may be used to add the partial values, potentially generating carries. The
instruction can be employed to perform such addition, while also taking
advantage of the multiplication part to further calculate more partial products.
This instruction never generates a carry bit, since (2
1) = (2
eliminating the need for carry handling. Partial products generated by this instruc-
tion can be forwarded to the next multiply-accumulate(-accumulate) operation;
this goes on until all rows are processed. Algorithm 2 and Listing 1.2 illustrate
how a column from product-scanning can be evaluated following this strategy.
Algorithm 2 Column computation in product scanning, eliminating carries.
𝐴, 𝐵
; column index
partial products
1] (calculated
during column 𝑘1and stored in registers).
Partial product
partial products
1] (higher half
part of the calculated partial product for column 𝑘+ 1 stored in registers).
for all (𝑖, 𝑗 )|𝑖+𝑗=𝑘, 0𝑖 < 𝑗 𝑛1do
(𝑅𝑘[𝑡]𝑅𝑘[0]) 𝐴[𝑖]×𝐵[𝑗] + 𝑅𝑘[0] + 𝑅𝑘[𝑡]
𝑅𝑘+1[𝑡1] 𝑅𝑘[𝑡]
𝑡𝑡+ 1
end for
return 𝐴𝐵[𝑘],𝑅𝑘+1 [0,...,𝑚1]
Listing 1.2.
ARM code for calculating a column in product scanning without carries.
@ r3, r4, r12 and r5 hold R_6[0,1,2,3]
@ r6, r7, r8 hold A[3], A[4] and A[5] respectively
@ r9, r10, r11 hold B[3], B[1], B[2] respectively
UMAAL r3, r4, r8, r10 @ A5 B1
UMAAL r3, r12, r7, r11 @ A4 B2
UMAAL r3, r5, r6, r9 @ A3 B3
@ r3 holds (partially) AB[6]
@ r4, r5 and r12 hold partial products for k = 7
Note that this strategy is limited by the number of working registers available.
These registers hold partial products without adding them up, avoiding the need
of carry handling, so strategies diving columns into rows like in Operand Caching
are desirable.
Curve25519 for the Cortex-M4 and beyond 11
4.2 Squaring
The literature suggests the use of a multiplication algorithm similar to Schoolbook
[30], but saving up registers and repeated multiplications. Due to its similarity
with product scanning (and the possibility to apply the above optimization
techniques), we choose the Sliding Block Doubling algorithm as squaring routine.
Note that, with the usage of carry flag present in some machine architectures,
both Sliding Block Doubling and the Bottom Line steps (as described in Section
3) can be efficiently computed. In order to avoid extra memory access, one can
implement those two routines without reloading operands; because of the need of
the carry bit in both those operations, high register pressure may arise in order to
save them into registers. We propose a technique to alleviate the register pressure:
calculating a few multiplications akin to the Initial Block step as presented in
the Operand Caching reduces register usage, allowing proper carry catching and
handling in exchange for a few memory accesses (Figure 4).
C[14] ...
Fig. 4. Sliding Block Doubling with Initial Block
In this example, each product-scanning column is limited to height 2, meaning
that only two consecutive multiplications can be handled without losing partial
products. Incrementing the size of the “initial block” (or, more accurately, the
initial triangle) frees up registers during the bottom row evaluation.
5 Elliptic curves
An elliptic curve
over a field
is the set of solutions (
𝑥, 𝑦
satisfy the Weierstrass equation
𝐸/F𝑞:𝑦2+𝑎1𝑥𝑦 +𝑎3𝑦=𝑥3+𝑎2𝑥2+𝑎4𝑥+𝑎6,(1)
𝑎1, 𝑎2, 𝑎3, 𝑎4, 𝑎6F𝑞
and the curve discriminant is
= 0. We restrict our
attention to curves defined over prime fields which can be represented in the
Montgomery [24] (or Twisted Edwards [4]) model, allowing faster formulas and
unified arithmetic [6].
The set of points
) =
𝑥, 𝑦
} {𝒪}
} {𝒪}
under the addition operation +(chord and tangent) forms an additive group,
12 Hayato Fujii and Diego F. Aranha
as the identity element. Given an elliptic curve point
an integer
, the operation
, called scalar point multiplication, is defined by
the addition of the point
to itself
1times. This operation encodes the
security assumption for Elliptic Curve Cryptography (ECC) protocols, basing
their security on the hardness of solving the elliptic curve analogue of the discrete
logarithm problem (ECDLP). Given a public key represented as a point
in the
curve, the problem amounts to finding the secret
such that
some given point 𝑃in the curve.
ECC is an efficient yet conservative option for deploying public-key cryp-
tography in embedded systems, since the ECDLP still enjoys conjectured full-
exponential security against classical computers and, consequently, reduced key
sizes and storage requirements. In practice, a conservative instance of this problem
can be obtained by selecting prime curves of near-prime order without supporting
any non-trivial endomorphisms. Curve25519 is a popular curve at the 128-bit
security level represented through the Montgomery model
Curve25519: 𝑦2=𝑥3+𝐴𝑥2+𝑥, (2)
compactly described by the small value of the coefficient
= 486662. This
curve model is ideal for curve-based key exchanges, because it allows the scalar
multiplication to be computed using
-coordinates only. Using a birational
equivalence, Curve25519 can be also represented in the twisted Edwards model
using full coordinates to allow instantiations of secure signature schemes:
edwards25519: 𝑥2+𝑦2= 1 121655
Key exchange protocols and digital signature schemes are building blocks
for applications like key distribution schemes and secure software updates based
on code signing. These protocols are fundamental for preserving the integrity of
software running in embedded devices and establishing symmetric cryptographic
keys for data encryption and secure communication.
5.1 Elliptic Curve Diffie Hellman
The Elliptic Curve Diffie Hellman protocol is an instantiation of the Diffie-
Hellman key agreement protocol over elliptic curves. Modern implementations
of this protocol employ
-coordinate-only formulas over a Montgomery model
of the curve, for both computational savings, side-channel security and ease of
implementation. Following this idea, the protocol may be implemented using
the X25519 function, which is in essence a scalar multiplication of a point on
the Curve25519 [2]. In this scheme, a pair of entities generate their respective
private keys, each of them 32-byte long. A public, generator point
is multiplied
by the private key, generating a public key. Then, those entities exchange their
public keys over an insecure channel; computing the X25519 function with their
private keys and the received point generates a shared secret which may be used
to generate a symmetric session key for both parties.
Curve25519 for the Cortex-M4 and beyond 13
Since the ECDH protocol does not authenticate keys, public key authentication
must be performed off-band, or an authenticated key agreement scheme such as
the Elliptic Curve Menezes-Qu-Vanstone (ECMQV) [21] must be adopted.
For data confidentiality, authenticated encryption can be constructed by
combining X25519 as an interactive key exchange mechanism, together with
a block or stream cipher and a proper mode of operation, as proposed in the
future Transport Layer Security protocol versions. Alternatively, authenticated
encryption with additional data (AEAD) schemes may be combined with X25519,
replacing block ciphers and a mode of operation.
5.2 Ed25519 digital signatures
The Edwards-curve Digital Signature Algorithm [5] (EdDSA) is a signature
scheme variant of Schnorr signatures based on elliptic curves represented in the
Edwards model. Like other discrete-log based signature schemes, EdDSA requires
a secret value, or nonce, unique to each signature. For reducing the risk of a
random number generator failure, EdDSA calculates this nonce deterministically,
as the hash of the message and the private key. Thus, the nonce is very unlikely
to be repeated for different signed messages. While this reduces the attack surface
in terms of random number generation and improves nonce misuse resistance
during the signing process, high quality random numbers are still needed for key
generation. When instantiated using edwards25519 (Equation 3), the EdDSA
scheme is called Ed25519. Concretely, let
be the SHA512 hash function mapping
arbitrary-length strings to 512-bit hash values. The signature of a message
under this scheme and private key
is the 512-bit string (
𝑅, 𝑆
), where
a generator of the subgroup of points or order
computed as
, 𝑀
𝑅, 𝐴
𝑎𝐵, 𝑀
, for an integer
derived from
). Verification works by parsing the signature components and checking if
the equation 𝑆𝐵 =𝑅+𝐻(𝑅, 𝐴, 𝑀 )𝐴holds [5].
6 Implementation details and results
The focus given in this work is microcontrollers suitable for integration within
embedded projects. Therefore, we choose some representative ARM architecture
processors. Specifically, the implementations were benchmarked on the following
: Teensy 3.2 board equipped with a MK20DX256VLH7 Cortex-M4-
based microcontroller, clocked at 48 and 72 MHz.
: STM32F401 Discovery board powered by a STM32F401C
microcontroller, also based on the Cortex-M4 design, clocked at 84MHz.
: ODROID-XU4 board with a Samsung Exynos5422 CPU
clocked at 2 GHz, containing four Cortex-A7 and four Cortex-A15 cores in a
heterogeneous configuration.
14 Hayato Fujii and Diego F. Aranha
Code for the Teensy board was generated using GCC version 5.4.1 com-
piled with the
-O3 -mthumb
flags; same settings apply for code compiled to
the STM32F401C board, but using an updated compiler version (7.2.0). For
the Cortex-A family, code was generated with GCC version 6.3.1 using the
optimization flag. Cycle counts were obtained using the corresponding cy-
cle counter in each architecture. Randomness, where required, was sampled
on the Cortex-A7/A15 device. In the Cortex-M4 boards,
was implemented with SHA256 and the generator is seeded
by analogically sampling disconnected pins on the board.
Albeit not the most efficient for every possible target, the codebase is the same
for every ARMv7 processor equipped with DSP instructions, being ideal to large
heterogeneous deployments, such as a network of smaller sensors connected to a
larger central server with a more powerful processor than its smaller counterparts.
This helps to improve code maintenance, avoiding possible security problems.
6.1 Field arithmetic
Table 1 presents timings and Table 3 presents code size for field operations with
implementations described in Section 4. In comparison to the current state-of-
art [28], our addition/subtraction takes 18% less cycles; the 256-bit multiplier
with a weak reduction is almost 50% faster and the squaring operation takes 30%
less cycles. The multiplication routine may be used in replacement of the squaring
if code size is a restriction, since 1
is approximately 0.9
. The implementation
of all arithmetic operations takes less code space in comparison to [28], ranging
from 20% savings in the addition to 50% for the multiplier.
As noted by Hasse [14], cycle counts on the same Cortex-M4-based controller
can be different depending on the clock frequency set on the chip. Different
clock frequencies set for the controller and the memory may cause stalls on the
former if the latter is slower. For example, the multiplication and the squaring
implementations, which rely on memory operations, use 10% more cycles when
the controller is set to a 33% higher frequency. This behavior is also present on
cryptographic schemes, as shown in Table 2.
Table 1.
Timings in cycles for arithmetic in
on multiple ARM processors.
Numbers for this work were taken as the average of 256 executions.
Cortex Add/Sub Mult Mult by word Square Inversion
De Groot [12] M4 73/77 631 129 563 151997
De Santis [28] M4 106 546 72 362 96337
This work
M4 @ 48 MHz (Teensy) 86 276 76 252 66634
M4 @ 72 MHz (Teensy) 86 310 76 280 75099
M4 @ 84 MHz (STM32F401C) 86 273 76 243 64425
A7 52 290 61 233 62648
A15 36 225 37 139 41978
Cortex F𝑝2Add/Sub F𝑝2Mult Mult by word F𝑝2Square F𝑝2Inversion
FourQ[22] M4 (STM32F407) 84/86 358 - 215 21056
Curve25519 for the Cortex-M4 and beyond 15
Table 2.
Timings in cycles for computing the Montgomery ladder in the X25519 key
exchange; and key generation, signature and verification of a 5-byte message in the
Ed25519 scheme. Key generation encompasses taking a secret key and computing its
public key; signature takes both keys and a message to generate its respective signature.
Numbers were taken as the average of 256 executions in multiple ARM processors.
Protocols are inherently protected against timing attacks (constant-time CT) on the
Cortex-M4 due to the lack of cache memory, while side-channel protection is explicitly
needed in the Cortex-A. Performance penalties for side-channel protection can be
obtained by comparing the implementations with CT = Y over N in the same platform.
CT Cortex X25519 Ed25519 Key Gen. Ed25519 Sign Ed25519 Verify
De Groot [12] Y M4 1816351 - - -
De Santis [28] Y M4 1563852 - - -
This work
Y M4 @ 48 MHz (Teensy) 907240 347225 496039 1265078
Y M4 @ 72 MHz (Teensy) 1003707 379734 531471 1427923
Y M4 @ 84 MHz (STM32F401) 894391 389480 543724 1331449
Schwabe, Bernstein [7] Y A8 527102 -368212 650102
This work
N A7 - - 423058 1118806
Y A7 825914 397261 524804 -
N A15 - - 264252 776806
Y A15 572910 245377 305797 -
eBACS ref. code [10] Y A15 342477 241641 245712 730047
CT Cortex DH SchnoorQKey Gen. SchnoorQSign SchnoorQVerify
FourQ[22] Y M4 (STM32F407) 542900 265100 345400 648600
Table 3.
Code size in bytes for implementing arithmetic in
, X25519 and
Ed25519 protocols on the Cortex-M4. Code size for protocols considers the entire
software stack needed to perform the specific action, including but not limited to field
operations, hashing, tables for scalar multiplication and other algorithms.
Add Sub Mult Mult by word Square
De Groot [12] 44 64 1284 300 1168
De Santis [28] 138 148 1264 116 882
This work 110 108 622 92 562
Inversion X25519 Ed25519 Key Gen. Ed25519 Sign Ed25519 Verify
De Groot [12] 388 4140 - - -
De Santis [28] 484 3786 - - -
This work 328 4152 21265 22162 28240
6.2 X25519 implementation
X25519 was implemented using the standard Montgomery ladder over the
coordinate. Standard tricks like randomized projective coordinates (amounting
to a 1% performance penalty) and constant-time conditional swaps were imple-
mented for side-channel protection. Cycle counts of the X25519 function executed
on the evaluated processors are shown in Table 2 and code size in Table 3.
Our implementation is 42% faster than De Santis and Sigl [28] while staying
competitive in terms of code size.
Note on conditional swaps.
The classical conditional swap using logic in-
structions is used by default as the compiler optimizes it using function inlining,
saving about 30 cycles. However, this approach opens a breach for a power
16 Hayato Fujii and Diego F. Aranha
analysis attack, as shown in [25], since all bits from a 32-bit long register (in
ARM architectures) must be set or not depending on a secret bit.
Alternatively, the conditional swap operation can be implemented by setting
the 4-bit
-flag in the Application Program Status Register (
) and then
issuing the
instruction, which pick parts from the operand registers in byte-
sized blocks and writes them to the destination [1]. Note that setting
to the
flag and issuing
copies one of the operands; setting
and using
copies the other one. The
cannot be set directly through a
an immediate operand, so a Move to Special Register (
) instruction must be
issued. Only registers may be used as arguments of this operation, so another
one must be used to set the
flag. Therefore, at least 8 bits must be used
to implement the conditional move. This theoretically reduces the attack surface
of a potential side-channel analysis, down from 32 bits.
6.3 Ed25519 implementation
Key generation and message signing requires a fixed-point scalar multiplication,
here implemented through a comb-like algorithm proposed by Hamburg in [15].
The signed-comb approach recodes the scalar into its signed binary form using
a single addition and a right-shift. This representation is divided in blocks and
each one of those are divided in combs, much like in the multi-comb approach
described in [16]. Like in the original work, we use five teeth for each of the five
blocks and 10 combs for each block (11 for the last one) due to the performance
balance between the direct and the linear table scan to access precomputed data
if protection against cache attacks is required. To effectively calculate the scalar
multiplication, our implementation requires 50 point additions and 254 point
doublings. Five lookup tables of 16 points in Extended Projective coordinate
format with 𝑧= 1 are used, adding up to approximately 7.5 KiB of data.
Verification requires a double-point multiplication involving the generator
and point
using a
-NAF interleaving technique [16], with a window of
width 5 for the
point, generated on-the-fly, taking approximately 3 KiB of
volatile memory. The group generator
is interleaved using a window of width 7,
implying in a lookup table of 32 points stored in Extended Projective coordinate
format with
= 1 taking 3 KiB of ROM. Note that verification has no need
to be executed in constant time, since all input data is (expected to be) public.
Decoding uses a standard field exponentiation for both inversion and square root
to calculate the
-coordinate as suggested by [19] and [5]; this exponentiation is
carried out by the Itoh-Tsujii algorithm, providing an efficient way to calculate
the missing coordinate. Timings for computing a signature (both protected and
unprotected against cache attacks) and verification functionality in the evaluated
processors can be found in Table 2. Arithmetic modulo the group order in
Ed25519-related operations relates closely to the previously shown arithmetic
modulo 2255 19, but Barrett reduction is used instead.
Final Remarks.
We consider that our implementation is competitive in compar-
ison to the mentioned works in Section 3, given the performance numbers shown
Curve25519 for the Cortex-M4 and beyond 17
in Tables 2 and 3. Using Curve25519 and its corresponding Twisted Edwards
form in well-known protocols is beneficial in terms of security, mostly due to its
maturity and its widespread usage to the point of becoming a de facto standard.
The authors gratefully acknowledge financial support from
LG Electronics Inc. during the development of this work, under project Effi-
cient and Secure Cryptography for IoT”, and Armando Faz-Hernández for his
helpful contributions and discussions during its development. We also thank the
anonymous reviewers for their helpful comments.
ARM: Cortex-M4 Devices Generic User Guide. Avaliable on
Bernstein, D.J.: Curve25519: New Diffie-Hellman speed records. In: Public Key
Cryptography. Lecture Notes in Computer Science, vol. 3958, pp. 207–228. Springer
Bernstein, D.J.: 25519 naming. Available on
archive/web/cfrg/current/msg04996.html (Aug 2014)
Bernstein, D.J., Birkner, P., Joye, M., Lange, T., Peters, C.: Twisted Edwards
Curves. In: AFRICACRYPT. Lecture Notes in Computer Science, vol. 5023, pp.
389–405. Springer (2008)
Bernstein, D.J., Duif, N., Lange, T., Schwabe, P., Yang, B.: High-speed high-security
signatures. J. Cryptographic Engineering 2(2), 77–89 (2012)
Bernstein, D.J., Lange, T.: Analysis and optimization of elliptic-curve single-scalar
multiplication. Contemporary Mathematics Finite Fields and Applications 461
Bernstein, D.J., Schwabe, P.: NEON crypto. In: CHES. Lecture Notes in Computer
Science, vol. 7428, pp. 320–339. Springer (2012)
Boneh, D., DeMillo, R.A., Lipton, R.J.: On the importance of checking cryptographic
protocols for faults (extended abstract). In: EUROCRYPT. Lecture Notes in
Computer Science, vol. 1233, pp. 37–51. Springer (1997)
Costello, C., Longa, P.: Four
: Four-dimensional decompositions on a
-curve over
the Mersenne Prime. In: ASIACRYPT. Lecture Notes in Computer Science, vol.
9452, pp. 214–235. Springer (2015)
Daniel J. Bernstein and Tanja Lange (editors): eBACS: ECRYPT Benchmarking of
Cryptographic Systems. Avaliable on
Düll, M., Haase, B., Hinterwälder, G., Hutter, M., Paar, C., Sánchez, A.H., Schwabe,
P.: High-speed Curve25519 on 8-bit, 16-bit, and 32-bit microcontrollers. Des. Codes
Cryptography 77(2-3), 493–514 (2015)
de Groot, W.: A Performance Study of X25519 on Cortex-M3 and M4. Ph.D. thesis,
Eindhoven University of Technology (Sep 2015)
Großschädl, J., Oswald, E., Page, D., Tunstall, M.: Side-channel analysis of crypto-
graphic software via early-terminating multiplications. In: ICISC. Lecture Notes in
Computer Science, vol. 5984, pp. 176–192. Springer (2009)
Haase, B.: Memory bandwidth influence makes cortex m4 bench-
marking difficult (sep 2017),
18 Hayato Fujii and Diego F. Aranha
Hamburg, M.: Fast and compact elliptic-curve cryptography. IACR Cryptology
ePrint Archive 2012, 309 (2012)
Hankerson, D., Menezes, A.J., Vanstone, S.: Guide to Elliptic Curve Cryptography.
Springer-Verlag New York, Inc., Secaucus, NJ, USA (2003)
Hutter, M., Wenger, E.: Fast multi-precision multiplication for public-key cryp-
tography on embedded microprocessors. In: CHES. Lecture Notes in Computer
Science, vol. 6917, pp. 459–474. Springer (2011)
Jao, D., Feo, L.D.: Towards quantum-resistant cryptosystems from supersingular
elliptic curve isogenies. In: PQCrypto. Lecture Notes in Computer Science, vol.
7071, pp. 19–34. Springer (2011)
Josefsson, S., Liusvaara, I.: Edwards-Curve Digital Signature Algorithm (EdDSA).
RFC 8032 (Jan 2017),
Kocher, P.C.: Timing attacks on implementations of Diffie-Hellman, RSA, DSS,
and other systems. In: CRYPTO. Lecture Notes in Computer Science, vol. 1109,
pp. 104–113. Springer (1996)
21. Law, L., Menezes, A., Qu, M., Solinas, J.A., Vanstone, S.A.: An efficient protocol
for authenticated key agreement. Des. Codes Cryptography 28(2), 119–134 (2003)
Liu, Z., Longa, P., Pereira, G., Reparaz, O., Seo, H.: Four
: on embedded devices
with strong countermeasures against side-channel attacks. In: CHES (to appear).
Springer Berlin Heidelberg, Berlin, Heidelberg (2017)
Liu, Z., Seo, H., Kim, H.: A synthesis of multi-precision multiplication and squaring
techniques for 8-bit sensor nodes: State-of-the-art research and future challenges. J.
Comput. Sci. Technol. 31(2), 284–299 (2016)
Montgomery, P.L.: Speeding the Pollard and Elliptic Curve Methods of Factorization.
Mathematics of Computation 48(177), 243–264 (1987),
Nascimento, E., Chmielewski, L., Oswald, D., Schwabe, P.: Attacking embedded
ECC implementations through cmov side channels. IACR Cryptology ePrint Archive
2016, 923 (2016)
Oliveira, T., López, J., Hışıl, H., Faz-Hernández, A., Rodríguez-Henríquez, F.: How
to (pre-)compute a ladder. In: SAC (to appear). Springer International Publishing
Renes, J., Smith, B.: qdsa: Small and secure digital signatures with curve-based
diffie-hellman key pairs. IACR Cryptology ePrint Archive 2017, 518 (2017)
Santis, F.D., Sigl, G.: Towards Side-Channel Protected X25519 on ARM Cortex-M4
Processors. In: SPEED-B. Utrecht, The Netherlands (Oct 2016),
Seo, H., Kim, H.: Consecutive operand-caching method for multiprecision multipli-
cation, revisited. J. Inform. and Commun. Convergence Engineering 13(1), 27–35
Seo, H., Liu, Z., Choi, J., Kim, H.: Multi-precision squaring for public-key cryptog-
raphy on embedded microprocessors. In: INDOCRYPT. Lecture Notes in Computer
Science, vol. 8250, pp. 227–243. Springer (2013)
... Without the cycle count/ code size trade-off discussed in Sect. 7, the code size of the proposed implementation is comparable to [40]. ...
Full-text available
Hybrid key encapsulation is in the process of becoming the de-facto standard for integration of post-quantum cryptography (PQC). Supporting two cryptographic primitives is a challenging task for constrained embedded systems. Both contemporary cryptography based on elliptic curves or RSA and PQC based on lattices require costly multiplications. Recent works have shown how to implement lattice-based cryptography on big-integer coprocessors. We propose a novel hardware design that natively supports the multiplication of polynomials and big integers, integrate it into a RISC-V core, and extend the RISC-V ISA accordingly. We provide an implementation of Saber and X25519 to demonstrate that both lattice- and elliptic-curve-based cryptography benefits from our extension. Our implementation requires only intermediate logic overhead, while significantly outperforming optimized ARM Cortex M4 implementations, other hardware/software codesigns, and designs that rely on contemporary accelerators.
... [37]. There are various implementation methods for the basic operation of Curve25519, which may vary in speed, source code size, and memory usage depending on the implementation method [34,[38][39][40]. Therefore, to achieve an unbiased experiment, the Curve 25519 source code in OpenSSL, which is used worldwide, was used as the basic source code. ...
Full-text available
When 5G telecommunication becomes a standardized and widely used communication medium, it must be implemented in coherence with certain 5G network standards and requirements. One such requirement is a Subscription Concealed Identifier called SUCI. SUCI prevents the exposure of international mobile subscriber identity (IMSI), which was a vulnerability in previous generation mobile telecommunication networks. Unlike IMSI, SUCI is encrypted and transmitted using a symmetric key cryptographic algorithm, to prevent the aforementioned vulnerabilities. However, for the first terminal to be encrypted, it is necessary to exchange a key with the home network, and this key exchange for SUCI encryption is performed through the Elliptic Curve Integrated Encryption Scheme (ECIES) key exchange algorithm, which is a public-key encryption scheme. However, ECIES uses more computing resources compared to a symmetric key cryptographic algorithm. Additionally, for 5G Subscriber Identity Deconcealing Function (SIDF) to satisfy the massive machine-type communication (mMTC) requirements of 5G, it is necessary to decrypt at least a million SUCIs within a short time. This puts a great burden on the 5G home network to provide the mMTC service for IoT. Therefore, in this paper, we propose a method of constructing 5G SIDF in an mMTC IoT environment. A key method of the proposed 5G SIDF configuration is the use of GPUs. This proposal was aimed at reducing the load in the mMTC environment by performing parallel processing of all cryptographic operations performed in the SIDF using a GPU. In particular, we focused on parallelization of public-key encryption algorithms. In addition, we also compared the method proposed in this paper through a survey of various 5G security products.
... Several works [12], [13], [14], [15], [16] show performance, power consumption, and memory use software-based optimization strategies, targeting the IoT 32-bit ARM Cortex-M4 processors, and hardware-based improvements targeting Xilinx Virtex 7 platform [17], [18]. The literature also shows multiple research lines focusing on the optimal, side-channel, and fault attack secure lower security level Curve25519 and Ed25519 protocols on the Cortex-M4 embedded device [19], [20], [21] or other target processors [22], [23], [24]. However, to the best of our knowledge, there has not been any work on Ed448 for ARMv7-M architecture. ...
Conference Paper
The demand for classical cryptography schemes continues to increase due to the exhaustive studies on their security. Thus, constant improvement of timing, power consumption, and memory requirements are needed for the most widely used classical Elliptic Curve Cryptography (ECC) primitives, suiting high-as well as low-end devices. In this work, we present the first implementation of the Edwards Curve Digital Signature Algorithm (EdDSA) based on the Ed448 targeting the ARM Cortex-M4-based STM32F407VG microcontroller, which forms a large part of the Internet of Things (IoT) world. We report timing and memory consumption results based on portable C and target-specific hand-crafted assembly code implementations of the low-level finite filed arithmetics. We optimize the high-level group operations by implementing the efficient scalar multiplication over the Ed448 isogenous map to reduce the computation complexity. Furthermore, we provide a side-channel analysis (SCA) and fault attack protected design by developing point randomization, scalar blinding techniques, and repeated signature, and evaluate the performance. Our optimized architecture performs a signature and verification in 39.88ms and 51.54ms, respectively, where SCA protection can be achieved at less than 6.4% cost of performance overhead.
... For example a 256-bit ECDSA produces a 512-bit or 64 byte signature, which is not quantum secure. Moreover, even using advanced microprocessors (rather than cheap microcontrollers), such as ARM Cortex-M4, the signature computation time is measured in hundreds of milliseconds compared to hundreds of microseconds [2] for ∼30 hashes that Bob computes in each round of RWS (w = 4096). The volume of data transmitted by Bob in one round of RWS is the same as it is under ECDSA, namely 32 bytes for the signature s and 32 bytes for the acknowledgment q. ...
Full-text available
This paper proposes and evaluates a new bipartite post-quantum digital signature protocol based on Winternitz chains and the HORS few-times signature scheme\cite{HORS}. Mutually mistrustful Alice and Bob are able to agree and sign a series of documents in a way that makes it impossible (within the assumed security model) to repudiate their signatures. Some ramifications are discussed, practical parameters evaluated and an application area delineated for the proposed concept.
The elliptic curve family of schemes has the lowest computational latency, memory use, energy consumption, and bandwidth requirements, making it the most preferred public key method for adoption into network protocols. Being suitable for embedded devices and applicable for key exchange and authentication, ECC is assuming a prominent position in the field of IoT cryptography. The attractive properties of the relatively new curve Curve448 contribute to its inclusion in the TLS1.3 protocol and pique the interest of academics and engineers aiming at studying and optimizing the schemes. When addressing low-end IoT devices, however, the literature indicates little work on these curves. In this paper, we present an efficient design for both protocols based on Montgomery curve Curve448 and its birationally equivalent Edwards curve Ed448 used for key agreement and digital signature algorithm, specifically the X448 function and the Ed448 DSA, relying on efficient low-level arithmetic operations targeting the ARM-based Cortex-M4 platform. Our design performs point multiplication, the base of the Elliptic Curve Diffie-Hellman (ECDH), in 3,2KCCs, resulting in more than 48% improvement compared to the best previous work based on Curve448, and performs sign and verify, the main operations of the Edwards-curves Digital Signature Algorithm (EdDSA), in 6,038KCCs and 7,404KCCs, showing a speedup of around \(11\%\) compared to the counterparts. We present novel modular multiplication and squaring architectures reaching \(\sim \)25% and \(\sim \)35% faster runtime than the previous best-reported results, respectively, based on Curve448 key exchange counterparts, and \(\sim \)13% and \(\sim \)25% better latency results than the Ed448-based digital signature counterparts targeting Cortex-M4 platform.KeywordsElliptic Curve CryptographyCurve448Elliptic Curve Diffie-Hellman (ECDH)Edwards-Curve Digital Signature Algorithm (EdDSA)Cortex-M4
The advances in quantum technologies and the fast move toward quantum computing are threatening classical cryptography and urge the deployment of post-quantum (PQ) schemes. The only isogeny-based candidate forming part of the third round of the standardization, the Supersingular Isogeny Key Encapsulation (SIKE) mechanism, is a subject of constant latency optimizations given its attractive compact key size lengths and, thus, its limited bandwidth and memory requirements. In this work, we present a new speed record of the SIKE protocol by implementing novel low-level finite field arithmetics targeting ARMv7-M architecture. We develop a handcrafted assembly code for the modular multiplication and squaring functions where we obtain 8.71% and 5.38% of speedup, respectively, compared to the last best-reported assembly implementations for p434. After deploying the finite field optimized architecture to the SIKE protocol, we observe 5.63%, 3.93%, 3.48%, and 1.61% of latency reduction for SIKE p434, p503, p610, and p751, respectively, targeting the NIST recommended STM32F407VG discovery board for our experiments.
Ciphertext-policy attribute-based encryption (CP-ABE) has attracted much interest from the practical community to enforce access control in distributed settings such as the Internet of Things (IoT). In such settings, encryption devices are often constrained, having small memories and little computational power, and the associated networks are lossy. To optimize both the ciphertext sizes and the encryption speed is therefore paramount. In addition, the master public key needs to be small enough to fit in the encryption device’s memory. At the same time, the scheme needs to be expressive enough to support common access control models. Currently, however, the state of the art incurs undesirable efficiency trade-offs. Existing schemes often have linear ciphertexts, and consequently, the ciphertexts may be too large and encryption may be too slow. In contrast, schemes with small ciphertexts have extremely large master public keys, and are generally computationally inefficient.In this work, we propose TinyABE: a novel CP-ABE scheme that is expressive and can be configured to be efficient enough for settings with embedded devices and low-quality networks. In particular, we demonstrate that our scheme can be configured such that the ciphertexts are small, encryption is fast and the master public key is small enough to fit in memory. From a theoretical standpoint, the new scheme and its security proof are non-trivial generalizations of the expressive scheme with constant-size ciphertexts by Agrawal and Chase (TCC’16, Eurocrypt’17) and its proof to the unbounded setting. By using techniques of Rouselakis and Waters (CCS’13), we remove the restrictions that the Agrawal-Chase scheme imposes on the keys and ciphertexts, making it thus more flexible. In this way, TinyABE is especially suitable for IoT devices and networks.KeywordsAttribute-based encryptionCiphertext-policy attribute-based encryptionShort ciphertextsEfficient encryption
In 2016, the National Institute of Standards and Technology (NIST) initiated a standardization process among the post-quantum secure algorithms. Forming part of the alternate group of candidates after Round 2 of the process is the Supersingular Isogeny Key Encapsulation (SIKE) mechanism which attracts with the smallest key sizes offering post-quantum security in scenarios of limited bandwidth and memory resources. Even further reduction of the exchanged information is offered by the compression mechanism, proposed by Azarderakhsh et al., which, however, introduces a significant time overhead and increases the memory requirements of the protocol, making it challenging to integrate it into an embedded system. In this paper, we propose the first compressed SIKE implementation for a resource-constrained device, where we targeted the NIST recommended platform STM32F407VG featuring ARM Cortex-M4 processor. We integrate the isogeny-based implementation strategies described previously in the literature into the compressed version of SIKE. Additionally, we propose a new assembly design for the finite field operations particular for the compressed SIKE, and observe a speedup of up to 16% and up to 25% compared to the last best-reported assembly implementations for p434, p503, and p610.
Full-text available
We present new candidates for quantum-resistant public-key cryptosystems based on the conjectured difficulty of finding isogenies between supersingular elliptic curves. The main technical idea in our scheme is that we transmit the images of torsion bases under the isogeny in order to allow the parties to construct a shared commutative square despite the non-commutativity of the endomorphism ring. We give a precise formulation of the necessary computational assumptions along with a discussion of their validity, and prove the security of our protocols under these assumptions. In addition, we present implementation results showing that our protocols are multiple orders of magnitude faster than previous isogeny-based cryptosystems over ordinary curves. This paper is an extended version of [Lecture Notes in Comput. Sci. 7071, Springer (2011), 19–34]. We add a new zero-knowledge identification scheme and detailed security proofs for the protocols. We also present a new, asymptotically faster, algorithm for key generation, a thorough study of its optimization, and new experimental data.
Full-text available
This work deals with the energy-efficient, high-speed and high-security implementation of elliptic curve scalar multiplication, elliptic curve Diffie-Hellman (ECDH) key exchange and elliptic curve digital signatures on embedded devices using FourQ and incorporating strong countermeasures to thwart a wide variety of side-channel attacks. First, we set new speed records for constant-time curve-based scalar multiplication, DH key exchange and digital signatures at the 128-bit security level with implementations targeting 8, 16 and 32-bit microcontrollers. For example, our software computes a static ECDH shared secret in 6.9 million cycles (or 0.86 seconds @8MHz) on a low-power 8-bit AVR microcontroller which, compared to the fastest Curve25519 and genus-2 Kummer implementations on the same platform, offers 2x and 1.4x speedups, respectively. Similarly, it computes the same operation in 495 thousand cycles on a 32-bit ARM Cortex-M4 microcontroller, achieving a factor-1.9 speedup when compared to the fastest Curve25519 implementation targeting another Cortex-M4 platform. A similar speed performance is observed in the case of digital signatures. Second, we engineer a set of side-channel countermeasures taking advantage of FourQ's rich arithmetic and propose a secure implementation that offers protection against a wide range of sophisticated side-channel attacks, including differential power analysis (DPA). Despite the use of strong countermeasures, the experimental results show that our FourQ software is still efficient enough to outperform implementations of Curve25519 that only protect against timing attacks. Finally, we perform a differential power analysis evaluation of our software running on an ARM Cortex-M4, and report that no leakage was detected with up to 10 million traces. These results demonstrate the potential of deploying FourQ on low-power applications such as protocols for the Internet of Things.
Full-text available
In the RFC 7748 memorandum, the Internet Research Task Force specified a Montgomery-ladder scalar multiplication function based on two recently adopted elliptic curves, "curve25519" and "curve448". The purpose of this function is to support the Diffie-Hellman key exchange algorithm that will be included in the forthcoming version of the Transport Layer Security cryptographic protocol. In this paper, we describe a ladder variant that permits to accelerate the fixed-point multiplication function inherent to the Diffie-Hellman key pair generation phase. Our proposal combines a right-to-left version of the Montgomery ladder along with the pre-computation of constant values directly derived from the base-point and its multiples. To our knowledge, this is the first proposal of a Montgomery ladder procedure for prime elliptic curves that admits the extensive use of pre-computation. In exchange of very modest memory resources and a small extra programming effort, the proposed ladder obtains significant speedups for software implementations. Moreover, our proposal fully complies with the RFC 7748 specification. A software implementation of the X25519 and X448 functions using our pre-computable ladder yields an acceleration factor of roughly 1.20, and 1.25 when implemented on the Haswell and the Skylake micro-architectures, respectively.
Conference Paper
Full-text available
Side-channel attacks against implementations of elliptic-curve cryptography have been extensively studied in the literature and a large tool-set of countermeasures is available to thwart different attacks in different contexts. The current state of the art in attacks and countermeasures is nicely summarized in multiple survey papers, the most recent one by Danger et al. [21]. However, any combination of those countermeasures is ineffective against attacks that require only a single trace and directly target a conditional move (cmov) – an operation that is at the very foundation of all scalar-multiplication algorithms. This operation can either be implemented through arithmetic operations on registers or through various different approaches that all boil down to loading from or storing to a secret address. In this paper we demonstrate that such an attack is indeed possible for ECC software running on AVR ATmega microcontrollers, using a protected version of the popular \(\mu \)NaCl library as an example. For the targeted implementations, we are able to recover 99.6% of the key bits for the arithmetic approach and 95.3% of the key bits for the approach based on secret addresses, with confidence levels 76.1% and 78.8%, respectively. All publicly available ECC software for the AVR that we are aware of uses one of the two approaches and is thus in principle vulnerable to our attack.
Conference Paper
Full-text available
This work deals with the energy-efficient, high-speed and high-security implementation of elliptic curve scalar multiplication and elliptic curve Diffie-Hellman (ECDH) key exchange on embedded devices using Four\(\mathbb {Q}\) and incorporating strong countermeasures to thwart a wide variety of side-channel attacks. First, we set new speed records for constant-time curve-based scalar multiplication and DH key exchange at the 128-bit security level with implementations targeting 8, 16 and 32-bit microcontrollers. For example, our software computes a static ECDH shared secret in \(\sim \)6.9 million cycles (or 0.86 s @8 MHz) on a low-power 8-bit AVR microcontroller which, compared to the fastest Curve25519 and genus-2 Kummer implementations on the same platform, offers 2\(\times \) and 1.4\(\times \) speedups, respectively. Similarly, it computes the same operation in \(\sim \)496 thousand cycles on a 32-bit ARM Cortex-M4 microcontroller, achieving a factor-2.9 speedup when compared to the fastest Curve25519 implementation targeting the same platform. Second, we engineer a set of side-channel countermeasures taking advantage of Four\(\mathbb {Q}\)’s rich arithmetic and propose a secure implementation that offers protection against a wide range of sophisticated side-channel attacks. Finally, we perform a differential power analysis evaluation of our software running on an ARM Cortex-M4, and report that no leakage was detected with up to 10 million traces. These results demonstrate the potential of deploying Four\(\mathbb {Q}\) on low-power applications such as protocols for IoT.
Multi-precision multiplication is one of the most fundamental operations on microprocessors to allow public-key cryptography such as RSA and elliptic curve cryptography (ECC). In this paper, we present a novel multiplication technique that increases the performance of multiplication by sophisticated caching of operands. Our method significantly reduces the number of needed load instructions which is usually one of the most expensive operations on modern processors. We evaluate our new technique on an 8-bit ATmega128 and a 32-bit ARM7TDMI microcontroller and compare the results with existing solutions. For the ATmega128, our implementation needs only 2395 clock cycles for a 160-bit multiplication. The number of required load instructions is reduced from 167 (needed for the best known hybrid multiplication) to only 80. On the ARM7TDMI, our implementation needs only 281 clock cycles as opposed to 357. For both platforms, the proposed technique outperforms related work by a factor of about 10–23%. We also show that the method scales very well even for larger Integer sizes (required for RSA) and limited register sets. It fully complies with existing multiply–accumulate instructions that are integrated in most of the available processors.
Conference Paper
qDSA is a high-speed, high-security signature scheme that facilitates implementations with a very small memory footprint, a crucial requirement for embedded systems and IoT devices, and that uses the same public keys as modern Diffie–Hellman schemes based on Montgomery curves (such as Curve25519) or Kummer surfaces. qDSA resembles an adaptation of EdDSA to the world of Kummer varieties, which are quotients of algebraic groups by \(\pm 1\). Interestingly, qDSA does not require any full group operations or point recovery: all computations, including signature verification, occur on the quotient where there is no group law. We include details on four implementations of qDSA, using Montgomery and fast Kummer surface arithmetic on the 8-bit AVR ATmega and 32-bit ARM Cortex M0 platforms. We find that qDSA significantly outperforms state-of-the-art signature implementations in terms of stack usage and code size. We also include an efficient compression algorithm for points on fast Kummer surfaces, reducing them to the same size as compressed elliptic curve points for the same security level.
qDSA is a high-speed, high-security signature scheme that facilitates implementations with a very small memory footprint, a crucial requirement for embedded systems and IoT devices, and that uses the same public keys as modern Diffie--Hellman schemes based on Montgomery curves (such as Curve25519) or Kummer surfaces. qDSA resembles an adaptation of EdDSA to the world of Kummer varieties, which are quotients of algebraic groups by $\pm$1. Interestingly, qDSA does not require any full group operations or point recovery: all computations, including signature verification, occur on the quotient where there is no group law. We include details on four implementations of qDSA, using Montgomery and fast Kummer surface arithmetic on the 8-bit AVR ATmega and 32-bit ARM Cortex M0 platforms. We find that qDSA significantly outperforms state-of-the-art signature implementations in terms of stack usage and code size. We also include an efficient compression algorithm for points on fast Kummer surfaces, reducing them to the same size as compressed elliptic curve points for the same security level.
Conference Paper
We introduce Fourℚ, a high-security, high-performance elliptic curve that targets the 128-bit security level. At the highest arithmetic level, cryptographic scalar multiplications on Fourℚ can use a four-dimensional Gallant-Lambert-Vanstone decomposition to minimize the total number of elliptic curve group operations. At the group arithmetic level, Fourℚ admits the use of extended twisted Edwards coordinates and can therefore exploit the fastest known elliptic curve addition formulas over large prime characteristic fields. Finally, at the finite field level, arithmetic is performed modulo the extremely fast Mersenne prime p = 2127 − 1. We show that this powerful combination facilitates scalar multiplications that are significantly faster than all prior works. On Intel’s Haswell, Ivy Bridge and Sandy Bridge architectures, our software computes a variable-base scalar multiplication in 59,000, 71,000 cycles and 74,000 cycles, respectively; and, on the same platforms, our software computes a Diffie-Hellman shared secret in 92,000, 110,000 cycles and 116,000 cycles, respectively.