Content uploaded by Diego F. Aranha
Author content
All content in this area was uploaded by Diego F. Aranha on Jul 22, 2019
Content may be subject to copyright.
Curve25519 for the Cortex-M4 and beyond
Hayato Fujii and Diego F. Aranha
Institute of Computing – University of Campinas
hayato@lasca.ic.unicamp.br, dfaranha@ic.unicamp.br
Abstract.
We present techniques for the implementation of a key ex-
change protocol and digital signature scheme based on the Curve25519
elliptic curve and its Edwards form, respectively, in resource-constrained
ARM devices. A possible application of this work consists of TLS deploy-
ments in the ARM Cortex-M family of processors and beyond. These
devices are located towards the lower to mid-end spectrum of ARM
cores, and are typically used on embedded devices. Our implementa-
tions improve the state-of-the-art substantially by making use of novel
implementation techniques and features specific to the target platforms.
Keywords: ECC, Curve25519, X25519, Ed25519, ARM Cortex-M.
1 Introduction
The growing number of devices connected to the Internet collecting and storing
sensitive information raises concerns about the security of their communications
and of the devices themselves. Many of them are equipped with microcontrollers
constrained in terms of computing or storage capabilities, and lack tamper re-
sistance mechanisms or any form of physical protection. Their attack surface
is widely open, ranging from physical exposure to attackers and ease of access
through remote availability. While designing and developing efficient and secure
implementations of cryptography is not a new problem and has been an active
area of research since at least the birth of public-key cryptography, the appli-
cation scenarios for these new devices imposes new challenges to cryptographic
engineering.
A possible way to deploy security in new devices is to reuse well-known and
well-analyzed building blocks, such as the Transport Layer Security (TLS) proto-
col. In comparison with reinventing the wheel using a new and possibly proprietary
solution, this has a major advantage of avoiding risky security decisions that may
repeat issues already solved in TLS. In RFC 7748 and RFC 8032, published by
the Internet Engineering Task Force (IETF), two cryptographic protocols based
on the Curve25519 elliptic curve and its Edwards form are recommended and
slated for future use in the TLS suite: the Diffie-Hellman key exchange using
Curve25519 [2] called X25519 [3] and the Ed25519 digital signature scheme [5].
These schemes rely on a careful choice of parameters, favoring secure and efficient
implementations of finite field and elliptic curve arithmetic with smaller room
for mistakes due to their overall implementation simplicity.
Special attention must be given to side-channel attacks, in which operational
aspects of the implementation of a cryptographic algorithm may leak internal state
2 Hayato Fujii and Diego F. Aranha
information that allows an attacker to retrieve secret information. Secrets may leak
through the communication channel itself, power consumption, execution time or
radiation measurements. Information leaked through cache latency or execution
time already allows powerful timing attacks against naive implementations of
symmetric and public-key cryptography [20]. More intrusive attacks also attempt
to inject faults at precise execution times, in hope of corrupting execution state
to reveal secret information [8]. Optimizing such implementations to achieve
an ideal balance between resource efficiency and side-channel resistance further
complicates matters, beckoning algorithmic advances and novel implementation
strategies.
This work presents techniques for efficient, compact and secure implementation
against timing and caching attacks of both X25519 and Ed25519 algorithms, with
an eye towards possible application for TLS deployments on constrained ARM
processors
1
. Our main target platform is the Cortex-M family of microcontrollers
starting from the Cortex-M4, but the same techniques can be used in higher-end
CPUs such as the ARM Cortex-A series.
Contributions.
We first present an ARM-optimized implementation of the
finite field arithmetic modulo the prime
𝑝
= 2
255 −
19. The main contribution in
terms of novelty is an efficient multiplier largely employing the powerful multiply-
and-accumulate DSP instructions in the target platform. The multiplier uses
the full 32-bit width and allows to remove any explicit addition instructions to
accumulate results or propagate carries, as all of these operations are combined
with the DSP instructions. These instructions are not present in the Cortex-M0
microcontrollers, and present a variable-time execution in the Cortex-M3 [13],
hence the choice of focusing out efforts on the Cortex-M4 and beyond. The same
strategy used for multiplication is adapted to squarings, with similar success
in performance. Following related work [11], intermediate results are reduced
modulo 2𝑝and ultimately reduced modulo 𝑝at the end of computation.
Finite field arithmetic is then used to implement the higher levels, including
group arithmetic and cryptographic protocols. The key agreement implemen-
tation uses homogeneous projective coordinates in the Montgomery curve in
order to take advantage of the constant-time Montgomery ladder as the scalar
multiplication algorithm, protecting against timing attacks. The digital signature
scheme implementation represents points on a Twisted Edwards curve using
projective extended coordinates, benefiting of efficient and unified point addition
and doubling. The most time-consuming operation in the scheme is the fixed-
point scalar multiplication, which uses the signed comb method as introduced by
Hamburg [15], using approximately 7.5 KiB of precomputed data and running
approximately two times faster in comparison to the Montgomery ladder. Side-
channel security is achieved using isochronous (constant time) code execution and
linear table scan countermeasures. We also evaluate a different way to implement
the conditional selection operation in terms of their potential resistance against
profiled power attacks [25]. Experiments conducted on a Cortex-M4 development
1https://www.keil.com/pack/doc/mw/Network/html/use_mbed_tls.html
Curve25519 for the Cortex-M4 and beyond 3
board indicate that our work provides the fastest implementations of these specific
algorithms for our main target architecture.
Organization.
Section 2 briefly describes features of our target platform. Sec-
tion 3 documents related works in this area, by summarizing previous implementa-
tions of elliptic curve algorithms in ARM microcontrollers. In Section 4 discusses
our techniques for finite field arithmetic in detail, focusing in the squaring and
multiplication operations. Section 5 describes the algorithms using elliptic curve
arithmetic in the key exchange and digital signatures scenarios. Finally, Section 6
presents experimental results and implementation details.
Code Avaliability.
For reproducibility, the prime field multiplication code is pub-
licly available at
https://github.com/hayatofujii/curve25519-cortex-m4
.
2 ARMv7 architecture
The ARMv7 architecture is a reduced instruction set computer (RISC) using a
load-store architecture. Processors with this technology are equipped with 16
registers: 13 general purpose, one as the program counter (
pc
), one as the stack
pointer (
sp
), and the last one as the link register (
lr
). The latter can be freed up
by saving it in slower memory and retrieving it after the register has been used.
The processor core has a three-stage pipeline which can be used to optimize
batch memory operations. Memory access involving
𝑛
registers in these processors
takes
𝑛
+ 1 cycles if there are no dependencies (for example, when a loaded
register is the address for a consecutive store). This can happen either in a
sequence of loads and stores or during the execution of instructions involving
multiple registers simultaneously.
The ARMv7E-M instruction set is also comprised of standard instructions for
basic arithmetic (such as addition and addition with carry) and logic operations,
but differently from other lower processors classes, the Cortex-M4 has support
for the so-called DSP instructions, which include multiply-and-accumulate (MAC)
instructions:
–
Unsigned MULtiply Long:
UMULL rLO, rHI, a, b
takes two unsigned integer
words
𝑎
and
𝑏
and multiplies them; the upper half result is written back to
rHI and the lower half is written into rLO.
–
Unsigned MULtiply Accumulate Long:
UMLAL rLO, rHI, a, b
takes un-
signed integer words
𝑎
and
𝑏
and multiplies them; the product is added
and written back to the double word integer stored as (rHI, rLO).
–
Unsigned Multiply Accumulate Accumulate Long:
UMAAL rLO, rHI, a, b
takes unsigned integer words
𝑎
and
𝑏
and multiplies them; the product is
added with the word-sized integer stored in
rLO
then added again with the
word-sized integer
rHI
. This double-word integer is then written back into
rLO and rHI, respectively the lower half and the upper half of the result.
ARM’s Technical Reference Manual of the Cortex-M4 core [1] states that
all the mentioned MAC instructions take one CPU cycle for execution in the
4 Hayato Fujii and Diego F. Aranha
Cortex-M4 and above. However, those instructions deterministically take an
extra three cycles to write the lower half of the double-word result, and a final
extra cycle to write the upper half. Therefore, proper instruction scheduling is
necessary in order to avoid pipeline stalls and to make best use of the delay slots.
The ARM Cortex-A cores are computationally more powerful than their
Cortex-M counterparts. Cortex-A based processors can run robust operating
systems due to extra auxiliary hardware; additionally, they may have a NEON
engine, which is a Single Instruction-Multiple Data (SIMD) unit. Aside from
that, those processors may have sophisticated out-of-order execution and extra
pipeline stages.
3 Related work
Research in curve-based cryptography proceeds in several directions: looking for
efficient elliptic curve parameters, instantiating and implementing the respective
cryptographic protocols, and finding new applications. More recently, isogeny-
based cryptography [18], which uses elliptic curves, was proposed as candidates
for post-quantum cryptography.
3.1 Scalar multiplication
Düll et al. [11] implemented X25519 and its underlying field arithmetic on
a Cortex-M0 processor, equipped with a 32
×
32
→
32-bit multiplier. Since
this instruction only returns the lower part of the product, this multiplier is
abstracted as a smaller one (16
×
16
→
32) to facilitate a 3-level Refined Karatsuba
implementation, taking 1294 cycles to complete on the same processor. Their
256-bit squaring uses the same multiplier strategy with standard tricks to save
up repeated operations, taking 857 cycles. Putting all together, an entire X25519
operation takes about 3.6M cycles with approximately 8 KiB of code size.
On the Cortex-A family of processor cores, implementers may use NEON, a
SIMD instruction set executed in its own unit inside the processor. Bernstein
and Schwabe [7] reported 527,102 Cortex-A8 cycles for the X25519 function. In
the elliptic curves formulae used in their work, most of the multiplications can
be handled in a parallel way, taking advantage of NEON’s vector instructions
and Curve25519’s parallelization opportunities.
The authors are not aware of an Ed25519 implementation specifically targeting
the Cortex-M4 core. However, Bernstein’s work using Cortex-A8’s NEON unit
reports 368,212 cycles to sign a short message and 650,102 cycles to verify its
validity. The authors point out that 50 and 25 thousand cycles of signing and
verification are spent by SUPERCOP-choosen SHA-512 implementation, with
room for further improvements.
Four
Q
is an elliptic curve providing about 128 bits of security equipped
with the endomorphisms
𝜓
and
𝜑
, providing efficient scalar multiplication [9].
Implementations of key exchange over this elliptic curve in different software and
hardware platforms show a factor-2 speedup in comparison to Curve25519 and
Curve25519 for the Cortex-M4 and beyond 5
factor-5 speedup in comparison to NIST’s P-256 curve [22]. Liu et al. reported [22]
a 559,200 cycle count on an ARM Cortex-M4 based processor of their 32-bit
implementation of the Diffie-Hellman Key Exchange in this curve.
Generating keys and Schnorr-like signatures over Four
Q
takes about 796,000
cycles on a Cortex-M4 based processor, while verification takes about 733,000
cycles on the same CPU [22]. Key generation and signing are aided by a 80-
point table taking 7.5KiB of ROM, and verification is assisted by a 256-point
table, using 24 KiB of memory. Quotient DSA (qDSA) [27] is a novel signature
scheme relying on Kummer arithmetic in order use the same key for DH and
digital signature schemes. It relies only on the
𝑥
-coordinate with the goal of
reducing stack usage and the use of optimized formulae for group operations.
When instantiated with Curve25519, it takes about 3 million cycles to sign a
message and 5.7 million cycles to verify it in a Cortex-M0. This scheme does not
rely on an additional table for speedups since low code size is an objective given
the target architecture, although this can be done using the ideas from [26] with
increased ROM usage.
3.2 Modular Multiplication
Field multiplication is usually the most performance-critical operation, because
other non-trivial field operations, such as inversion, are avoided by algorithmic
techniques. Multiprecision multiplication algorithms can be ranked on how many
single word multiplications are performed. For example, operand scanning takes
𝑂
(
𝑛2
)multiplications, where
𝑛
is the number of words. Product scanning takes the
same number of word multiplications, but reduces the number of memory access by
accumulating intermediate results in registers. One of the most popular algorithms
that asymptotically reduces such complexity is the Karatsuba multiplication,
which takes the computational cost down to
𝑂
(
𝑛log23
). This algorithm performs
three multiplications of, usually, half-sized operands, thus giving it a divide-
and-conquer structure and lowering its asymptotic complexity. As an example
of such an application, De Santis and Sigl [28] X25519 implementation on the
Cortex-M4 features a two-level Karatsuba multiplier implementation, splitting a
256-bit multiplier down to 64-bit multiplications, each one taking four hardware-
supported 32 ×32 →64 multiplication instructions.
Memory accesses can be accounted for part of the time consumed by the
multiplication routine. Thus, algorithms and instruction scheduling methods
which minimize those memory operations are highly desirable, specially on not-
so-powerful processors with slow memory access. This problem can be addressed
by scheduling the basic multiplications following the product scanning strategy,
which can be seen as a rhombus-like structure. However, following this scheme
in its traditional way requires multiple stores and loads from memory, since the
number of registers available may be not sufficient to hold the full operands.
Improvements to reduce the amount of memory operations are present in the
literature: namely, Operand Caching due to Hutter and Wegner [17], further
improved by the Consecutive Operand Caching [30] and the Full Operand Caching,
both due to Seo et al. [29].
6 Hayato Fujii and Diego F. Aranha
The Operand Caching technique reduces the number of memory accesses in
comparison to the standard product-scanning by caching data in registers and
storing part of the operands in memory. This method resembles the product
scanning approach, but instead of calculating a word in its entirety, rows are
introduced to compute partial sub-products from each column. This method is
illustrated in Figure 1.
A[7]B[7]
A[0]B[7]
A[7]B[0]
A[0]B[0]
C[0]
C[7]
C[14]
r0
r1
binit
...
...
Fig. 1.
Operand Caching. Each dot in the rhombus represents one-word multiplication;
each column, at the end of its evaluation, is a (partial) word of the product. Vertical
lines represents additions.
This method divides product scanning in two steps:
– Initial Block
The first step loads part of the operands, and proceeds to
calculate the upper part of the rhombus using classical product-scanning.
– Rows
In the rightmost part, most of the necessary operands are already
loaded from previous calculations, requiring only some extra, low-count
operand loads, depending on row width. Product scanning is done until the
row ends. Note that, at the end of each column, parts of the operands are
previously loaded, hence a small quantity of loads is necessary to evaluate
the next column.
At every row change, new operands needs to be reloaded, since the current
operands in the registers are not useful at the start of the new row. Consecutive
Operand Caching avoids those memory access by rearranging the rows and
further improving the quantity of operands already in registers. This algorithm
is depicted in Figure 2.
Note that during the transition between the bottommost row and the one
above, part of the operands are already available in registers, solving the reload
problem between row changes. Let
𝑛
be the number of “limbs”,
𝑟
the number of
available working registers and number of rows
𝑒
. Full Operand Caching further
improves the quantity of memory access in two cases: if
𝑛−𝑟𝑒 < 𝑒
, the Full
Operand Caching structure looks like the original Operand Caching, but with a
different multiplication order. Otherwise, Consecutive Operand Caching bottom
Curve25519 for the Cortex-M4 and beyond 7
A[7]B[7]
A[0]B[7]
A[7]B[0]
A[0]B[0]
C[0]
C[7]
C[14]
r0
r1
binit
...
...
Fig. 2. Consecutive Operand Caching
row’s length is adjusted in order to make full use of all available registers at the
next row’s processing.
3.3 Modular Squaring
The squaring routine can be built by repeating the operands and using the
multiplication procedure, saving ROM space. Alternatively, writing a specialized
procedure can save cycles by duplicating partial products [23]. The squaring
implementation in [28] follows this strategy, specializing the 64-bit multiplication
routine to 8 cycles, down from 10. Partial products are calculated then added
twice to the accumulator and the resulting carry is rippled away.
Following the product scanning method, Seo’s Sliding Block Doubling [30]
halves the rhombus structure, allowing to use more registers to store part of the
operands and doubling partial products. The algorithm is illustrated in Figure 3
and can be divided in three parts:
C[0]
C[7]
C[14] ...
...
A[7]A[7]
A[7]A[0]
A[0]A[0]
Fig. 3.
Sliding Block Doubling. Black dots represent multiplications and white dots
represent squarings.
8 Hayato Fujii and Diego F. Aranha
– Partial Products of the Upper Part Triangle
: an adaption of product
scanning calculates partial products (represented by the black dots at the
superior part of the rhombus in Figure 3) and saves them to memory.
– Sliding Block Doubling of Partial Products
: Each result of the column
is doubled by left shifting each result by one, effectively duplicating the
partial products. This process must be done in parts because the number of
available registers is limited, since they hold parts of the operand.
– Remaining Partial Products of the Bottom Line
: The bottom line
multiplications are squares of part of the operand. These products must be
added to their respective partial result of its above column.
4 Implementation of F2255−19 Arithmetic
Our implementation aims for efficiency, so specific ARM Assembly is thoroughly
used and code size is moderately sacrificed for speed. Code portability is a non-
goal, so each 255-bit integer field element is densely represented, using 2
32
-radix,
implying in eight “limbs” of 32 bits, each one are in a little-endian format. This
contrasts with the reference implementation [2], which use 25 or 26 bits in 32-bits
words, allowing carry values requiring proper handling at the expense of more
cycles.
Modular Reduction.
We call as “weak” reduction a reduction modulo 2
256 −
38,
performed at the end of every field operation in order to avoid extra carry
computations between operations, as in [11]; this reduction finds a integer lesser
then 2
256
that is congruent modulo 2
255−
19. When necessary, a “strong” reduction
modulo 2
255 −
19 is performed, much like when data must be sent over the wire.
This strategy is justified over the extra 10% difference between the “strong” and
the “weak” reduction.
Addition and Subtraction.
256-bit addition is implemented by respectively
adding each limb in a lower to higher element fashion. The carry flag, present
in the ARM status registers, is used to ripple the carry across the limbs. In
order to avoid extra bits generated by the final sum, the result is weakly reduced.
Subtraction follows a similar strategy.
Multiplication by a 32-bit Word.
Multiplication by a single word follows the
algorithm described in [28], used to multiply a long integer by 121666, operation
required to double a point on Curve25519.
Inversion.
This operation follows the standard Itoh-Tsujii addition-chain ap-
proach to compute
𝑎𝑝−2≡𝑎−1
(mod
𝑝
), using 11 multiplications and 254 field
squarings as proposed in [2]. Adding up the costs, inversion turns to be the most
expensive field operation in our implementation.
4.1 Multiplication
The 256
×
256
→
512-bit multiplication follows a product-scanning like approach;
more specifically, Full Operand-Caching. As mentioned in Section 3, parameters
Curve25519 for the Cortex-M4 and beyond 9
for this implementation are
𝑛
= 8,
𝑒
= 3,
𝑟
=
⌊𝑛/𝑒⌋
= 2; since
⌊3/2⌋<
8
−
2
·
3
≤
3,
so Full Operand Caching with a Consecutive-like structure yields the best option.
Catching the Carry Bit.
Using product scanning to calculate partial products
with a double-word multiplier implies adding partial products of the next column,
which in turn might generate carries. A partial column, divided in rows in a
manner as described in Operand Caching, can be calculated using Algorithm 1; an
example of implementation in ARM Assembly is shown in Listing 1.1. Notation
follows as
(𝜀, 𝑧)←𝑤
meaning
𝑧←𝑤mod
2
𝑊
and
𝜀←
0if
𝑤∈0,2𝑊
,
otherwise
𝜀←
1, where
𝑊
is the bit-size of a word;
(𝐴𝐵)
denotes a 2
𝑊
-bit word
obtained by concatenating the 𝑊-bit words 𝐴and 𝐵.
Algorithm 1 Column computation in product scanning.
Input:
Operands
𝐴, 𝐵
; column index
𝑘
; partial product
𝑅𝑘
(calculated during column
𝑘−1); accumulated carry 𝑅𝑘+1 (generated from sum of partial products).
Output:
(Partial) product
𝐴𝐵
[
𝑘
]; sum
𝑅𝑘+1
(higher half part of the partial product
for column
𝑘
+ 1); accumulated carry
𝑅𝑘+2
(generated from sum of partial products).
𝑅𝑘+2 ←0
for all (𝑖, 𝑗 )|𝑖+𝑗=𝑘, 0≤𝑖 < 𝑗 ≤𝑛−1do
𝑇←0
(𝑇 𝑅𝑘)←𝐴[𝑖]×𝐵[𝑗] + (𝑇, 𝑅𝑘)
(𝜀, 𝑅𝑘+1)←𝑇+𝑅𝑘+1
𝑅𝑘+2 ←𝑅𝑘+2 +𝜀
end for
𝐴𝐵[𝑘]←𝑅𝑘
return 𝐴𝐵[𝑘],𝑅𝑘+1 ,𝑅𝑘+2
Listing 1.1. ARM code for calculating a column in product scanning.
@k=6
@ r5 and r4 hold R_6, R_7 respectively
@ r6, r7, r8 hold A[3], A[4] and A[5] respectively
@ r9, r10, r11 hold B[3], B[1], B[2] respectively
MOV r12, #0
MOV r3, #0
UMLAL r5, r12, r8, r10 @ A5 B1
ADDS r4, r4, r12
ADC r3, r3, #0
MOV r14, #0
UMLAL r5, r14, r7, r11 @ A4 B2
ADDS r4, r4, r14
ADC r3, r3, #0
MOV r12, #0
UMLAL r5, r12, r6, r9 @ A3 B3
ADDS r4, r4, r12
ADC r3, r3, #0
@ r5 holds AB[6], r4 holds R_7, @ r3 holds R_8
10 Hayato Fujii and Diego F. Aranha
One possible optimization is
delaying the carry bit
: eliminating the last
addition of Algorithm 1, this addition can be deferred to the next column with
the use of a single instruction to add the partial products and the carry bit. This
is easier on ARM processors, where there is fine-grained control of whether or not
instructions may update the processor flags. Other optimizations involve proper
register allocation in order to avoid reloads, saving up a few cycles.
Carry Elimination.
Storing partial products in extra registers without adding
them avoids potential carry values. In a trivial implementation, a register accu-
mulator may be used to add the partial values, potentially generating carries. The
UMAAL
instruction can be employed to perform such addition, while also taking
advantage of the multiplication part to further calculate more partial products.
This instruction never generates a carry bit, since (2
𝑛−
1)
2
+2(2
𝑛−
1) = (2
2𝑛−
1),
eliminating the need for carry handling. Partial products generated by this instruc-
tion can be forwarded to the next multiply-accumulate(-accumulate) operation;
this goes on until all rows are processed. Algorithm 2 and Listing 1.2 illustrate
how a column from product-scanning can be evaluated following this strategy.
Algorithm 2 Column computation in product scanning, eliminating carries.
Input:
Operands
𝐴, 𝐵
; column index
𝑘
;
𝑚
partial products
𝑅𝑘
[0
,...,𝑚−
1] (calculated
during column 𝑘−1and stored in registers).
Output:
Partial product
𝐴𝐵
[
𝑘
];
𝑚
partial products
𝑅𝑘+1
[0
,...,𝑚−
1] (higher half
part of the calculated partial product for column 𝑘+ 1 stored in registers).
𝑡←1
for all (𝑖, 𝑗 )|𝑖+𝑗=𝑘, 0≤𝑖 < 𝑗 ≤𝑛−1do
(𝑅𝑘[𝑡]𝑅𝑘[0]) ←𝐴[𝑖]×𝐵[𝑗] + 𝑅𝑘[0] + 𝑅𝑘[𝑡]
𝑅𝑘+1[𝑡−1] ←𝑅𝑘[𝑡]
𝑡←𝑡+ 1
end for
𝐴𝐵[𝑘]←𝑅𝑘[0]
return 𝐴𝐵[𝑘],𝑅𝑘+1 [0,...,𝑚−1]
Listing 1.2.
ARM code for calculating a column in product scanning without carries.
@k=6
@ r3, r4, r12 and r5 hold R_6[0,1,2,3]
@ r6, r7, r8 hold A[3], A[4] and A[5] respectively
@ r9, r10, r11 hold B[3], B[1], B[2] respectively
UMAAL r3, r4, r8, r10 @ A5 B1
UMAAL r3, r12, r7, r11 @ A4 B2
UMAAL r3, r5, r6, r9 @ A3 B3
@ r3 holds (partially) AB[6]
@ r4, r5 and r12 hold partial products for k = 7
Note that this strategy is limited by the number of working registers available.
These registers hold partial products without adding them up, avoiding the need
of carry handling, so strategies diving columns into rows like in Operand Caching
are desirable.
Curve25519 for the Cortex-M4 and beyond 11
4.2 Squaring
The literature suggests the use of a multiplication algorithm similar to Schoolbook
[30], but saving up registers and repeated multiplications. Due to its similarity
with product scanning (and the possibility to apply the above optimization
techniques), we choose the Sliding Block Doubling algorithm as squaring routine.
Note that, with the usage of carry flag present in some machine architectures,
both Sliding Block Doubling and the Bottom Line steps (as described in Section
3) can be efficiently computed. In order to avoid extra memory access, one can
implement those two routines without reloading operands; because of the need of
the carry bit in both those operations, high register pressure may arise in order to
save them into registers. We propose a technique to alleviate the register pressure:
calculating a few multiplications akin to the Initial Block step as presented in
the Operand Caching reduces register usage, allowing proper carry catching and
handling in exchange for a few memory accesses (Figure 4).
C[0]
C[7]
C[14] ...
...
A[7]A[7]
A[7]A[0]
A[0]A[0]
binit
Fig. 4. Sliding Block Doubling with Initial Block
In this example, each product-scanning column is limited to height 2, meaning
that only two consecutive multiplications can be handled without losing partial
products. Incrementing the size of the “initial block” (or, more accurately, the
initial triangle) frees up registers during the bottom row evaluation.
5 Elliptic curves
An elliptic curve
𝐸
over a field
F𝑞
is the set of solutions (
𝑥, 𝑦
)
∈F𝑞×F𝑞
which
satisfy the Weierstrass equation
𝐸/F𝑞:𝑦2+𝑎1𝑥𝑦 +𝑎3𝑦=𝑥3+𝑎2𝑥2+𝑎4𝑥+𝑎6,(1)
where
𝑎1, 𝑎2, 𝑎3, 𝑎4, 𝑎6∈F𝑞
and the curve discriminant is
𝛥
= 0. We restrict our
attention to curves defined over prime fields which can be represented in the
Montgomery [24] (or Twisted Edwards [4]) model, allowing faster formulas and
unified arithmetic [6].
The set of points
𝐸
(
F𝑞
) =
{
(
𝑥, 𝑦
)
∈𝐸
(
F𝑞
)
} ∪ {𝒪}
=
{𝑃∈𝐸
(
F𝑞
)
} ∪ {𝒪}
under the addition operation +(chord and tangent) forms an additive group,
12 Hayato Fujii and Diego F. Aranha
with
𝒪
as the identity element. Given an elliptic curve point
𝑃∈𝐸
(
F𝑞
)and
an integer
𝑘
, the operation
𝑘𝑃
, called scalar point multiplication, is defined by
the addition of the point
𝑃
to itself
𝑘−
1times. This operation encodes the
security assumption for Elliptic Curve Cryptography (ECC) protocols, basing
their security on the hardness of solving the elliptic curve analogue of the discrete
logarithm problem (ECDLP). Given a public key represented as a point
𝑄
in the
curve, the problem amounts to finding the secret
𝑘∈Z
such that
𝑄
=
𝑘𝑃
for
some given point 𝑃in the curve.
ECC is an efficient yet conservative option for deploying public-key cryp-
tography in embedded systems, since the ECDLP still enjoys conjectured full-
exponential security against classical computers and, consequently, reduced key
sizes and storage requirements. In practice, a conservative instance of this problem
can be obtained by selecting prime curves of near-prime order without supporting
any non-trivial endomorphisms. Curve25519 is a popular curve at the 128-bit
security level represented through the Montgomery model
Curve25519: 𝑦2=𝑥3+𝐴𝑥2+𝑥, (2)
compactly described by the small value of the coefficient
𝐴
= 486662. This
curve model is ideal for curve-based key exchanges, because it allows the scalar
multiplication to be computed using
𝑥
-coordinates only. Using a birational
equivalence, Curve25519 can be also represented in the twisted Edwards model
using full coordinates to allow instantiations of secure signature schemes:
edwards25519: −𝑥2+𝑦2= 1 −121655
121666𝑥2𝑦2.(3)
Key exchange protocols and digital signature schemes are building blocks
for applications like key distribution schemes and secure software updates based
on code signing. These protocols are fundamental for preserving the integrity of
software running in embedded devices and establishing symmetric cryptographic
keys for data encryption and secure communication.
5.1 Elliptic Curve Diffie Hellman
The Elliptic Curve Diffie Hellman protocol is an instantiation of the Diffie-
Hellman key agreement protocol over elliptic curves. Modern implementations
of this protocol employ
𝑥
-coordinate-only formulas over a Montgomery model
of the curve, for both computational savings, side-channel security and ease of
implementation. Following this idea, the protocol may be implemented using
the X25519 function, which is in essence a scalar multiplication of a point on
the Curve25519 [2]. In this scheme, a pair of entities generate their respective
private keys, each of them 32-byte long. A public, generator point
𝑃
is multiplied
by the private key, generating a public key. Then, those entities exchange their
public keys over an insecure channel; computing the X25519 function with their
private keys and the received point generates a shared secret which may be used
to generate a symmetric session key for both parties.
Curve25519 for the Cortex-M4 and beyond 13
Since the ECDH protocol does not authenticate keys, public key authentication
must be performed off-band, or an authenticated key agreement scheme such as
the Elliptic Curve Menezes-Qu-Vanstone (ECMQV) [21] must be adopted.
For data confidentiality, authenticated encryption can be constructed by
combining X25519 as an interactive key exchange mechanism, together with
a block or stream cipher and a proper mode of operation, as proposed in the
future Transport Layer Security protocol versions. Alternatively, authenticated
encryption with additional data (AEAD) schemes may be combined with X25519,
replacing block ciphers and a mode of operation.
5.2 Ed25519 digital signatures
The Edwards-curve Digital Signature Algorithm [5] (EdDSA) is a signature
scheme variant of Schnorr signatures based on elliptic curves represented in the
Edwards model. Like other discrete-log based signature schemes, EdDSA requires
a secret value, or nonce, unique to each signature. For reducing the risk of a
random number generator failure, EdDSA calculates this nonce deterministically,
as the hash of the message and the private key. Thus, the nonce is very unlikely
to be repeated for different signed messages. While this reduces the attack surface
in terms of random number generation and improves nonce misuse resistance
during the signing process, high quality random numbers are still needed for key
generation. When instantiated using edwards25519 (Equation 3), the EdDSA
scheme is called Ed25519. Concretely, let
𝐻
be the SHA512 hash function mapping
arbitrary-length strings to 512-bit hash values. The signature of a message
𝑀
under this scheme and private key
𝑘
is the 512-bit string (
𝑅, 𝑆
), where
𝑅
=
𝑟𝐵
,
for
𝐵
a generator of the subgroup of points or order
ℓ
and
𝑟
computed as
𝐻
(
𝐻
(
𝑘
)
, 𝑀
);
𝑆
=
𝑟
+
𝐻
(
𝑅, 𝐴
=
𝑎𝐵, 𝑀
)
mod ℓ
, for an integer
𝑎
derived from
𝐻
(
𝑘
). Verification works by parsing the signature components and checking if
the equation 𝑆𝐵 =𝑅+𝐻(𝑅, 𝐴, 𝑀 )𝐴holds [5].
6 Implementation details and results
The focus given in this work is microcontrollers suitable for integration within
embedded projects. Therefore, we choose some representative ARM architecture
processors. Specifically, the implementations were benchmarked on the following
platforms:
– Teensy
: Teensy 3.2 board equipped with a MK20DX256VLH7 Cortex-M4-
based microcontroller, clocked at 48 and 72 MHz.
– STM32F401C
: STM32F401 Discovery board powered by a STM32F401C
microcontroller, also based on the Cortex-M4 design, clocked at 84MHz.
– Cortex-A7/A15
: ODROID-XU4 board with a Samsung Exynos5422 CPU
clocked at 2 GHz, containing four Cortex-A7 and four Cortex-A15 cores in a
heterogeneous configuration.
14 Hayato Fujii and Diego F. Aranha
Code for the Teensy board was generated using GCC version 5.4.1 com-
piled with the
-O3 -mthumb
flags; same settings apply for code compiled to
the STM32F401C board, but using an updated compiler version (7.2.0). For
the Cortex-A family, code was generated with GCC version 6.3.1 using the
-O3
optimization flag. Cycle counts were obtained using the corresponding cy-
cle counter in each architecture. Randomness, where required, was sampled
through
/dev/urandom
on the Cortex-A7/A15 device. In the Cortex-M4 boards,
NIST’s
Hash_DRBG
was implemented with SHA256 and the generator is seeded
by analogically sampling disconnected pins on the board.
Albeit not the most efficient for every possible target, the codebase is the same
for every ARMv7 processor equipped with DSP instructions, being ideal to large
heterogeneous deployments, such as a network of smaller sensors connected to a
larger central server with a more powerful processor than its smaller counterparts.
This helps to improve code maintenance, avoiding possible security problems.
6.1 Field arithmetic
Table 1 presents timings and Table 3 presents code size for field operations with
implementations described in Section 4. In comparison to the current state-of-
art [28], our addition/subtraction takes 18% less cycles; the 256-bit multiplier
with a weak reduction is almost 50% faster and the squaring operation takes 30%
less cycles. The multiplication routine may be used in replacement of the squaring
if code size is a restriction, since 1
S
is approximately 0.9
M
. The implementation
of all arithmetic operations takes less code space in comparison to [28], ranging
from 20% savings in the addition to 50% for the multiplier.
As noted by Hasse [14], cycle counts on the same Cortex-M4-based controller
can be different depending on the clock frequency set on the chip. Different
clock frequencies set for the controller and the memory may cause stalls on the
former if the latter is slower. For example, the multiplication and the squaring
implementations, which rely on memory operations, use 10% more cycles when
the controller is set to a 33% higher frequency. This behavior is also present on
cryptographic schemes, as shown in Table 2.
Table 1.
Timings in cycles for arithmetic in
F2255−19
on multiple ARM processors.
Numbers for this work were taken as the average of 256 executions.
Cortex Add/Sub Mult Mult by word Square Inversion
De Groot [12] M4 73/77 631 129 563 151997
De Santis [28] M4 106 546 72 362 96337
This work
M4 @ 48 MHz (Teensy) 86 276 76 252 66634
M4 @ 72 MHz (Teensy) 86 310 76 280 75099
M4 @ 84 MHz (STM32F401C) 86 273 76 243 64425
A7 52 290 61 233 62648
A15 36 225 37 139 41978
Cortex F𝑝2Add/Sub F𝑝2Mult Mult by word F𝑝2Square F𝑝2Inversion
FourQ[22] M4 (STM32F407) 84/86 358 - 215 21056
Curve25519 for the Cortex-M4 and beyond 15
Table 2.
Timings in cycles for computing the Montgomery ladder in the X25519 key
exchange; and key generation, signature and verification of a 5-byte message in the
Ed25519 scheme. Key generation encompasses taking a secret key and computing its
public key; signature takes both keys and a message to generate its respective signature.
Numbers were taken as the average of 256 executions in multiple ARM processors.
Protocols are inherently protected against timing attacks (constant-time – CT) on the
Cortex-M4 due to the lack of cache memory, while side-channel protection is explicitly
needed in the Cortex-A. Performance penalties for side-channel protection can be
obtained by comparing the implementations with CT = Y over N in the same platform.
CT Cortex X25519 Ed25519 Key Gen. Ed25519 Sign Ed25519 Verify
De Groot [12] Y M4 1816351 - - -
De Santis [28] Y M4 1563852 - - -
This work
Y M4 @ 48 MHz (Teensy) 907240 347225 496039 1265078
Y M4 @ 72 MHz (Teensy) 1003707 379734 531471 1427923
Y M4 @ 84 MHz (STM32F401) 894391 389480 543724 1331449
Schwabe, Bernstein [7] Y A8 527102 -368212 650102
This work
N A7 - - 423058 1118806
Y A7 825914 397261 524804 -
N A15 - - 264252 776806
Y A15 572910 245377 305797 -
eBACS ref. code [10] Y A15 342477 241641 245712 730047
CT Cortex DH SchnoorQKey Gen. SchnoorQSign SchnoorQVerify
FourQ[22] Y M4 (STM32F407) 542900 265100 345400 648600
Table 3.
Code size in bytes for implementing arithmetic in
F2255−19
, X25519 and
Ed25519 protocols on the Cortex-M4. Code size for protocols considers the entire
software stack needed to perform the specific action, including but not limited to field
operations, hashing, tables for scalar multiplication and other algorithms.
Add Sub Mult Mult by word Square
De Groot [12] 44 64 1284 300 1168
De Santis [28] 138 148 1264 116 882
This work 110 108 622 92 562
Inversion X25519 Ed25519 Key Gen. Ed25519 Sign Ed25519 Verify
De Groot [12] 388 4140 - - -
De Santis [28] 484 3786 - - -
This work 328 4152 21265 22162 28240
6.2 X25519 implementation
X25519 was implemented using the standard Montgomery ladder over the
𝑥
-
coordinate. Standard tricks like randomized projective coordinates (amounting
to a 1% performance penalty) and constant-time conditional swaps were imple-
mented for side-channel protection. Cycle counts of the X25519 function executed
on the evaluated processors are shown in Table 2 and code size in Table 3.
Our implementation is 42% faster than De Santis and Sigl [28] while staying
competitive in terms of code size.
Note on conditional swaps.
The classical conditional swap using logic in-
structions is used by default as the compiler optimizes it using function inlining,
saving about 30 cycles. However, this approach opens a breach for a power
16 Hayato Fujii and Diego F. Aranha
analysis attack, as shown in [25], since all bits from a 32-bit long register (in
ARM architectures) must be set or not depending on a secret bit.
Alternatively, the conditional swap operation can be implemented by setting
the 4-bit
ge
-flag in the Application Program Status Register (
ASPR
) and then
issuing the
SEL
instruction, which pick parts from the operand registers in byte-
sized blocks and writes them to the destination [1]. Note that setting
0x0
to the
ASPR.ge
flag and issuing
SEL
copies one of the operands; setting
0xF
and using
SEL
copies the other one. The
ASPR
cannot be set directly through a
MOV
with
an immediate operand, so a Move to Special Register (
MSR
) instruction must be
issued. Only registers may be used as arguments of this operation, so another
one must be used to set the
ASPR.ge
flag. Therefore, at least 8 bits must be used
to implement the conditional move. This theoretically reduces the attack surface
of a potential side-channel analysis, down from 32 bits.
6.3 Ed25519 implementation
Key generation and message signing requires a fixed-point scalar multiplication,
here implemented through a comb-like algorithm proposed by Hamburg in [15].
The signed-comb approach recodes the scalar into its signed binary form using
a single addition and a right-shift. This representation is divided in blocks and
each one of those are divided in combs, much like in the multi-comb approach
described in [16]. Like in the original work, we use five teeth for each of the five
blocks and 10 combs for each block (11 for the last one) due to the performance
balance between the direct and the linear table scan to access precomputed data
if protection against cache attacks is required. To effectively calculate the scalar
multiplication, our implementation requires 50 point additions and 254 point
doublings. Five lookup tables of 16 points in Extended Projective coordinate
format with 𝑧= 1 are used, adding up to approximately 7.5 KiB of data.
Verification requires a double-point multiplication involving the generator
𝐵
and point
𝐴
using a
𝑤
-NAF interleaving technique [16], with a window of
width 5 for the
𝐴
point, generated on-the-fly, taking approximately 3 KiB of
volatile memory. The group generator
𝐵
is interleaved using a window of width 7,
implying in a lookup table of 32 points stored in Extended Projective coordinate
format with
𝑧
= 1 taking 3 KiB of ROM. Note that verification has no need
to be executed in constant time, since all input data is (expected to be) public.
Decoding uses a standard field exponentiation for both inversion and square root
to calculate the
𝑦
-coordinate as suggested by [19] and [5]; this exponentiation is
carried out by the Itoh-Tsujii algorithm, providing an efficient way to calculate
the missing coordinate. Timings for computing a signature (both protected and
unprotected against cache attacks) and verification functionality in the evaluated
processors can be found in Table 2. Arithmetic modulo the group order in
Ed25519-related operations relates closely to the previously shown arithmetic
modulo 2255 −19, but Barrett reduction is used instead.
Final Remarks.
We consider that our implementation is competitive in compar-
ison to the mentioned works in Section 3, given the performance numbers shown
Curve25519 for the Cortex-M4 and beyond 17
in Tables 2 and 3. Using Curve25519 and its corresponding Twisted Edwards
form in well-known protocols is beneficial in terms of security, mostly due to its
maturity and its widespread usage to the point of becoming a de facto standard.
Acknowledgments.
The authors gratefully acknowledge financial support from
LG Electronics Inc. during the development of this work, under project “Effi-
cient and Secure Cryptography for IoT”, and Armando Faz-Hernández for his
helpful contributions and discussions during its development. We also thank the
anonymous reviewers for their helpful comments.
References
1.
ARM: Cortex-M4 Devices Generic User Guide. Avaliable on
http://infocenter.
arm.com/help/index.jsp?topic=%2Fcom.arm.doc.dui0553a%2FCHDBFFDB.html
(2010)
2.
Bernstein, D.J.: Curve25519: New Diffie-Hellman speed records. In: Public Key
Cryptography. Lecture Notes in Computer Science, vol. 3958, pp. 207–228. Springer
(2006)
3.
Bernstein, D.J.: 25519 naming. Available on
https://www.ietf.org/mail-
archive/web/cfrg/current/msg04996.html (Aug 2014)
4.
Bernstein, D.J., Birkner, P., Joye, M., Lange, T., Peters, C.: Twisted Edwards
Curves. In: AFRICACRYPT. Lecture Notes in Computer Science, vol. 5023, pp.
389–405. Springer (2008)
5.
Bernstein, D.J., Duif, N., Lange, T., Schwabe, P., Yang, B.: High-speed high-security
signatures. J. Cryptographic Engineering 2(2), 77–89 (2012)
6.
Bernstein, D.J., Lange, T.: Analysis and optimization of elliptic-curve single-scalar
multiplication. Contemporary Mathematics – Finite Fields and Applications 461
(2008)
7.
Bernstein, D.J., Schwabe, P.: NEON crypto. In: CHES. Lecture Notes in Computer
Science, vol. 7428, pp. 320–339. Springer (2012)
8.
Boneh, D., DeMillo, R.A., Lipton, R.J.: On the importance of checking cryptographic
protocols for faults (extended abstract). In: EUROCRYPT. Lecture Notes in
Computer Science, vol. 1233, pp. 37–51. Springer (1997)
9.
Costello, C., Longa, P.: Four
Q
: Four-dimensional decompositions on a
Q
-curve over
the Mersenne Prime. In: ASIACRYPT. Lecture Notes in Computer Science, vol.
9452, pp. 214–235. Springer (2015)
10.
Daniel J. Bernstein and Tanja Lange (editors): eBACS: ECRYPT Benchmarking of
Cryptographic Systems. Avaliable on https://bench.cr.yp.to
11.
Düll, M., Haase, B., Hinterwälder, G., Hutter, M., Paar, C., Sánchez, A.H., Schwabe,
P.: High-speed Curve25519 on 8-bit, 16-bit, and 32-bit microcontrollers. Des. Codes
Cryptography 77(2-3), 493–514 (2015)
12.
de Groot, W.: A Performance Study of X25519 on Cortex-M3 and M4. Ph.D. thesis,
Eindhoven University of Technology (Sep 2015)
13.
Großschädl, J., Oswald, E., Page, D., Tunstall, M.: Side-channel analysis of crypto-
graphic software via early-terminating multiplications. In: ICISC. Lecture Notes in
Computer Science, vol. 5984, pp. 176–192. Springer (2009)
14.
Haase, B.: Memory bandwidth influence makes cortex m4 bench-
marking difficult (sep 2017),
https://ches.2017.rump.cr.yp.to/
fe534b32e52fcacee026786ff44235f0.pdf
18 Hayato Fujii and Diego F. Aranha
15.
Hamburg, M.: Fast and compact elliptic-curve cryptography. IACR Cryptology
ePrint Archive 2012, 309 (2012)
16.
Hankerson, D., Menezes, A.J., Vanstone, S.: Guide to Elliptic Curve Cryptography.
Springer-Verlag New York, Inc., Secaucus, NJ, USA (2003)
17.
Hutter, M., Wenger, E.: Fast multi-precision multiplication for public-key cryp-
tography on embedded microprocessors. In: CHES. Lecture Notes in Computer
Science, vol. 6917, pp. 459–474. Springer (2011)
18.
Jao, D., Feo, L.D.: Towards quantum-resistant cryptosystems from supersingular
elliptic curve isogenies. In: PQCrypto. Lecture Notes in Computer Science, vol.
7071, pp. 19–34. Springer (2011)
19.
Josefsson, S., Liusvaara, I.: Edwards-Curve Digital Signature Algorithm (EdDSA).
RFC 8032 (Jan 2017), https://rfc-editor.org/rfc/rfc8032.txt
20.
Kocher, P.C.: Timing attacks on implementations of Diffie-Hellman, RSA, DSS,
and other systems. In: CRYPTO. Lecture Notes in Computer Science, vol. 1109,
pp. 104–113. Springer (1996)
21. Law, L., Menezes, A., Qu, M., Solinas, J.A., Vanstone, S.A.: An efficient protocol
for authenticated key agreement. Des. Codes Cryptography 28(2), 119–134 (2003)
22.
Liu, Z., Longa, P., Pereira, G., Reparaz, O., Seo, H.: Four
Q
: on embedded devices
with strong countermeasures against side-channel attacks. In: CHES (to appear).
Springer Berlin Heidelberg, Berlin, Heidelberg (2017)
23.
Liu, Z., Seo, H., Kim, H.: A synthesis of multi-precision multiplication and squaring
techniques for 8-bit sensor nodes: State-of-the-art research and future challenges. J.
Comput. Sci. Technol. 31(2), 284–299 (2016)
24.
Montgomery, P.L.: Speeding the Pollard and Elliptic Curve Methods of Factorization.
Mathematics of Computation 48(177), 243–264 (1987),
http://dx.doi.org/10.
2307/2007888
25.
Nascimento, E., Chmielewski, L., Oswald, D., Schwabe, P.: Attacking embedded
ECC implementations through cmov side channels. IACR Cryptology ePrint Archive
2016, 923 (2016)
26.
Oliveira, T., López, J., Hışıl, H., Faz-Hernández, A., Rodríguez-Henríquez, F.: How
to (pre-)compute a ladder. In: SAC (to appear). Springer International Publishing
(2017)
27.
Renes, J., Smith, B.: qdsa: Small and secure digital signatures with curve-based
diffie-hellman key pairs. IACR Cryptology ePrint Archive 2017, 518 (2017)
28.
Santis, F.D., Sigl, G.: Towards Side-Channel Protected X25519 on ARM Cortex-M4
Processors. In: SPEED-B. Utrecht, The Netherlands (Oct 2016),
http://ccccspeed.
win.tue.nl/
29.
Seo, H., Kim, H.: Consecutive operand-caching method for multiprecision multipli-
cation, revisited. J. Inform. and Commun. Convergence Engineering 13(1), 27–35
(2015)
30.
Seo, H., Liu, Z., Choi, J., Kim, H.: Multi-precision squaring for public-key cryptog-
raphy on embedded microprocessors. In: INDOCRYPT. Lecture Notes in Computer
Science, vol. 8250, pp. 227–243. Springer (2013)