Content uploaded by Diego F. Aranha
Author content
All content in this area was uploaded by Diego F. Aranha on Jul 22, 2019
Content may be subject to copyright.
Curve25519 for the CortexM4 and beyond
Hayato Fujii and Diego F. Aranha
Institute of Computing – University of Campinas
hayato@lasca.ic.unicamp.br, dfaranha@ic.unicamp.br
Abstract.
We present techniques for the implementation of a key ex
change protocol and digital signature scheme based on the Curve25519
elliptic curve and its Edwards form, respectively, in resourceconstrained
ARM devices. A possible application of this work consists of TLS deploy
ments in the ARM CortexM family of processors and beyond. These
devices are located towards the lower to midend spectrum of ARM
cores, and are typically used on embedded devices. Our implementa
tions improve the stateoftheart substantially by making use of novel
implementation techniques and features speciﬁc to the target platforms.
Keywords: ECC, Curve25519, X25519, Ed25519, ARM CortexM.
1 Introduction
The growing number of devices connected to the Internet collecting and storing
sensitive information raises concerns about the security of their communications
and of the devices themselves. Many of them are equipped with microcontrollers
constrained in terms of computing or storage capabilities, and lack tamper re
sistance mechanisms or any form of physical protection. Their attack surface
is widely open, ranging from physical exposure to attackers and ease of access
through remote availability. While designing and developing eﬃcient and secure
implementations of cryptography is not a new problem and has been an active
area of research since at least the birth of publickey cryptography, the appli
cation scenarios for these new devices imposes new challenges to cryptographic
engineering.
A possible way to deploy security in new devices is to reuse wellknown and
wellanalyzed building blocks, such as the Transport Layer Security (TLS) proto
col. In comparison with reinventing the wheel using a new and possibly proprietary
solution, this has a major advantage of avoiding risky security decisions that may
repeat issues already solved in TLS. In RFC 7748 and RFC 8032, published by
the Internet Engineering Task Force (IETF), two cryptographic protocols based
on the Curve25519 elliptic curve and its Edwards form are recommended and
slated for future use in the TLS suite: the DiﬃeHellman key exchange using
Curve25519 [2] called X25519 [3] and the Ed25519 digital signature scheme [5].
These schemes rely on a careful choice of parameters, favoring secure and eﬃcient
implementations of ﬁnite ﬁeld and elliptic curve arithmetic with smaller room
for mistakes due to their overall implementation simplicity.
Special attention must be given to sidechannel attacks, in which operational
aspects of the implementation of a cryptographic algorithm may leak internal state
2 Hayato Fujii and Diego F. Aranha
information that allows an attacker to retrieve secret information. Secrets may leak
through the communication channel itself, power consumption, execution time or
radiation measurements. Information leaked through cache latency or execution
time already allows powerful timing attacks against naive implementations of
symmetric and publickey cryptography [20]. More intrusive attacks also attempt
to inject faults at precise execution times, in hope of corrupting execution state
to reveal secret information [8]. Optimizing such implementations to achieve
an ideal balance between resource eﬃciency and sidechannel resistance further
complicates matters, beckoning algorithmic advances and novel implementation
strategies.
This work presents techniques for eﬃcient, compact and secure implementation
against timing and caching attacks of both X25519 and Ed25519 algorithms, with
an eye towards possible application for TLS deployments on constrained ARM
processors
1
. Our main target platform is the CortexM family of microcontrollers
starting from the CortexM4, but the same techniques can be used in higherend
CPUs such as the ARM CortexA series.
Contributions.
We ﬁrst present an ARMoptimized implementation of the
ﬁnite ﬁeld arithmetic modulo the prime
𝑝
= 2
255 −
19. The main contribution in
terms of novelty is an eﬃcient multiplier largely employing the powerful multiply
andaccumulate DSP instructions in the target platform. The multiplier uses
the full 32bit width and allows to remove any explicit addition instructions to
accumulate results or propagate carries, as all of these operations are combined
with the DSP instructions. These instructions are not present in the CortexM0
microcontrollers, and present a variabletime execution in the CortexM3 [13],
hence the choice of focusing out eﬀorts on the CortexM4 and beyond. The same
strategy used for multiplication is adapted to squarings, with similar success
in performance. Following related work [11], intermediate results are reduced
modulo 2𝑝and ultimately reduced modulo 𝑝at the end of computation.
Finite ﬁeld arithmetic is then used to implement the higher levels, including
group arithmetic and cryptographic protocols. The key agreement implemen
tation uses homogeneous projective coordinates in the Montgomery curve in
order to take advantage of the constanttime Montgomery ladder as the scalar
multiplication algorithm, protecting against timing attacks. The digital signature
scheme implementation represents points on a Twisted Edwards curve using
projective extended coordinates, beneﬁting of eﬃcient and uniﬁed point addition
and doubling. The most timeconsuming operation in the scheme is the ﬁxed
point scalar multiplication, which uses the signed comb method as introduced by
Hamburg [15], using approximately 7.5 KiB of precomputed data and running
approximately two times faster in comparison to the Montgomery ladder. Side
channel security is achieved using isochronous (constant time) code execution and
linear table scan countermeasures. We also evaluate a diﬀerent way to implement
the conditional selection operation in terms of their potential resistance against
proﬁled power attacks [25]. Experiments conducted on a CortexM4 development
1https://www.keil.com/pack/doc/mw/Network/html/use_mbed_tls.html
Curve25519 for the CortexM4 and beyond 3
board indicate that our work provides the fastest implementations of these speciﬁc
algorithms for our main target architecture.
Organization.
Section 2 brieﬂy describes features of our target platform. Sec
tion 3 documents related works in this area, by summarizing previous implementa
tions of elliptic curve algorithms in ARM microcontrollers. In Section 4 discusses
our techniques for ﬁnite ﬁeld arithmetic in detail, focusing in the squaring and
multiplication operations. Section 5 describes the algorithms using elliptic curve
arithmetic in the key exchange and digital signatures scenarios. Finally, Section 6
presents experimental results and implementation details.
Code Avaliability.
For reproducibility, the prime ﬁeld multiplication code is pub
licly available at
https://github.com/hayatofujii/curve25519cortexm4
.
2 ARMv7 architecture
The ARMv7 architecture is a reduced instruction set computer (RISC) using a
loadstore architecture. Processors with this technology are equipped with 16
registers: 13 general purpose, one as the program counter (
pc
), one as the stack
pointer (
sp
), and the last one as the link register (
lr
). The latter can be freed up
by saving it in slower memory and retrieving it after the register has been used.
The processor core has a threestage pipeline which can be used to optimize
batch memory operations. Memory access involving
𝑛
registers in these processors
takes
𝑛
+ 1 cycles if there are no dependencies (for example, when a loaded
register is the address for a consecutive store). This can happen either in a
sequence of loads and stores or during the execution of instructions involving
multiple registers simultaneously.
The ARMv7EM instruction set is also comprised of standard instructions for
basic arithmetic (such as addition and addition with carry) and logic operations,
but diﬀerently from other lower processors classes, the CortexM4 has support
for the socalled DSP instructions, which include multiplyandaccumulate (MAC)
instructions:
–
Unsigned MULtiply Long:
UMULL rLO, rHI, a, b
takes two unsigned integer
words
𝑎
and
𝑏
and multiplies them; the upper half result is written back to
rHI and the lower half is written into rLO.
–
Unsigned MULtiply Accumulate Long:
UMLAL rLO, rHI, a, b
takes un
signed integer words
𝑎
and
𝑏
and multiplies them; the product is added
and written back to the double word integer stored as (rHI, rLO).
–
Unsigned Multiply Accumulate Accumulate Long:
UMAAL rLO, rHI, a, b
takes unsigned integer words
𝑎
and
𝑏
and multiplies them; the product is
added with the wordsized integer stored in
rLO
then added again with the
wordsized integer
rHI
. This doubleword integer is then written back into
rLO and rHI, respectively the lower half and the upper half of the result.
ARM’s Technical Reference Manual of the CortexM4 core [1] states that
all the mentioned MAC instructions take one CPU cycle for execution in the
4 Hayato Fujii and Diego F. Aranha
CortexM4 and above. However, those instructions deterministically take an
extra three cycles to write the lower half of the doubleword result, and a ﬁnal
extra cycle to write the upper half. Therefore, proper instruction scheduling is
necessary in order to avoid pipeline stalls and to make best use of the delay slots.
The ARM CortexA cores are computationally more powerful than their
CortexM counterparts. CortexA based processors can run robust operating
systems due to extra auxiliary hardware; additionally, they may have a NEON
engine, which is a Single InstructionMultiple Data (SIMD) unit. Aside from
that, those processors may have sophisticated outoforder execution and extra
pipeline stages.
3 Related work
Research in curvebased cryptography proceeds in several directions: looking for
eﬃcient elliptic curve parameters, instantiating and implementing the respective
cryptographic protocols, and ﬁnding new applications. More recently, isogeny
based cryptography [18], which uses elliptic curves, was proposed as candidates
for postquantum cryptography.
3.1 Scalar multiplication
Düll et al. [11] implemented X25519 and its underlying ﬁeld arithmetic on
a CortexM0 processor, equipped with a 32
×
32
→
32bit multiplier. Since
this instruction only returns the lower part of the product, this multiplier is
abstracted as a smaller one (16
×
16
→
32) to facilitate a 3level Reﬁned Karatsuba
implementation, taking 1294 cycles to complete on the same processor. Their
256bit squaring uses the same multiplier strategy with standard tricks to save
up repeated operations, taking 857 cycles. Putting all together, an entire X25519
operation takes about 3.6M cycles with approximately 8 KiB of code size.
On the CortexA family of processor cores, implementers may use NEON, a
SIMD instruction set executed in its own unit inside the processor. Bernstein
and Schwabe [7] reported 527,102 CortexA8 cycles for the X25519 function. In
the elliptic curves formulae used in their work, most of the multiplications can
be handled in a parallel way, taking advantage of NEON’s vector instructions
and Curve25519’s parallelization opportunities.
The authors are not aware of an Ed25519 implementation speciﬁcally targeting
the CortexM4 core. However, Bernstein’s work using CortexA8’s NEON unit
reports 368,212 cycles to sign a short message and 650,102 cycles to verify its
validity. The authors point out that 50 and 25 thousand cycles of signing and
veriﬁcation are spent by SUPERCOPchoosen SHA512 implementation, with
room for further improvements.
Four
Q
is an elliptic curve providing about 128 bits of security equipped
with the endomorphisms
𝜓
and
𝜑
, providing eﬃcient scalar multiplication [9].
Implementations of key exchange over this elliptic curve in diﬀerent software and
hardware platforms show a factor2 speedup in comparison to Curve25519 and
Curve25519 for the CortexM4 and beyond 5
factor5 speedup in comparison to NIST’s P256 curve [22]. Liu et al. reported [22]
a 559,200 cycle count on an ARM CortexM4 based processor of their 32bit
implementation of the DiﬃeHellman Key Exchange in this curve.
Generating keys and Schnorrlike signatures over Four
Q
takes about 796,000
cycles on a CortexM4 based processor, while veriﬁcation takes about 733,000
cycles on the same CPU [22]. Key generation and signing are aided by a 80
point table taking 7.5KiB of ROM, and veriﬁcation is assisted by a 256point
table, using 24 KiB of memory. Quotient DSA (qDSA) [27] is a novel signature
scheme relying on Kummer arithmetic in order use the same key for DH and
digital signature schemes. It relies only on the
𝑥
coordinate with the goal of
reducing stack usage and the use of optimized formulae for group operations.
When instantiated with Curve25519, it takes about 3 million cycles to sign a
message and 5.7 million cycles to verify it in a CortexM0. This scheme does not
rely on an additional table for speedups since low code size is an objective given
the target architecture, although this can be done using the ideas from [26] with
increased ROM usage.
3.2 Modular Multiplication
Field multiplication is usually the most performancecritical operation, because
other nontrivial ﬁeld operations, such as inversion, are avoided by algorithmic
techniques. Multiprecision multiplication algorithms can be ranked on how many
single word multiplications are performed. For example, operand scanning takes
𝑂
(
𝑛2
)multiplications, where
𝑛
is the number of words. Product scanning takes the
same number of word multiplications, but reduces the number of memory access by
accumulating intermediate results in registers. One of the most popular algorithms
that asymptotically reduces such complexity is the Karatsuba multiplication,
which takes the computational cost down to
𝑂
(
𝑛log23
). This algorithm performs
three multiplications of, usually, halfsized operands, thus giving it a divide
andconquer structure and lowering its asymptotic complexity. As an example
of such an application, De Santis and Sigl [28] X25519 implementation on the
CortexM4 features a twolevel Karatsuba multiplier implementation, splitting a
256bit multiplier down to 64bit multiplications, each one taking four hardware
supported 32 ×32 →64 multiplication instructions.
Memory accesses can be accounted for part of the time consumed by the
multiplication routine. Thus, algorithms and instruction scheduling methods
which minimize those memory operations are highly desirable, specially on not
sopowerful processors with slow memory access. This problem can be addressed
by scheduling the basic multiplications following the product scanning strategy,
which can be seen as a rhombuslike structure. However, following this scheme
in its traditional way requires multiple stores and loads from memory, since the
number of registers available may be not suﬃcient to hold the full operands.
Improvements to reduce the amount of memory operations are present in the
literature: namely, Operand Caching due to Hutter and Wegner [17], further
improved by the Consecutive Operand Caching [30] and the Full Operand Caching,
both due to Seo et al. [29].
6 Hayato Fujii and Diego F. Aranha
The Operand Caching technique reduces the number of memory accesses in
comparison to the standard productscanning by caching data in registers and
storing part of the operands in memory. This method resembles the product
scanning approach, but instead of calculating a word in its entirety, rows are
introduced to compute partial subproducts from each column. This method is
illustrated in Figure 1.
A[7]B[7]
A[0]B[7]
A[7]B[0]
A[0]B[0]
C[0]
C[7]
C[14]
r0
r1
binit
...
...
Fig. 1.
Operand Caching. Each dot in the rhombus represents oneword multiplication;
each column, at the end of its evaluation, is a (partial) word of the product. Vertical
lines represents additions.
This method divides product scanning in two steps:
– Initial Block
The ﬁrst step loads part of the operands, and proceeds to
calculate the upper part of the rhombus using classical productscanning.
– Rows
In the rightmost part, most of the necessary operands are already
loaded from previous calculations, requiring only some extra, lowcount
operand loads, depending on row width. Product scanning is done until the
row ends. Note that, at the end of each column, parts of the operands are
previously loaded, hence a small quantity of loads is necessary to evaluate
the next column.
At every row change, new operands needs to be reloaded, since the current
operands in the registers are not useful at the start of the new row. Consecutive
Operand Caching avoids those memory access by rearranging the rows and
further improving the quantity of operands already in registers. This algorithm
is depicted in Figure 2.
Note that during the transition between the bottommost row and the one
above, part of the operands are already available in registers, solving the reload
problem between row changes. Let
𝑛
be the number of “limbs”,
𝑟
the number of
available working registers and number of rows
𝑒
. Full Operand Caching further
improves the quantity of memory access in two cases: if
𝑛−𝑟𝑒 < 𝑒
, the Full
Operand Caching structure looks like the original Operand Caching, but with a
diﬀerent multiplication order. Otherwise, Consecutive Operand Caching bottom
Curve25519 for the CortexM4 and beyond 7
A[7]B[7]
A[0]B[7]
A[7]B[0]
A[0]B[0]
C[0]
C[7]
C[14]
r0
r1
binit
...
...
Fig. 2. Consecutive Operand Caching
row’s length is adjusted in order to make full use of all available registers at the
next row’s processing.
3.3 Modular Squaring
The squaring routine can be built by repeating the operands and using the
multiplication procedure, saving ROM space. Alternatively, writing a specialized
procedure can save cycles by duplicating partial products [23]. The squaring
implementation in [28] follows this strategy, specializing the 64bit multiplication
routine to 8 cycles, down from 10. Partial products are calculated then added
twice to the accumulator and the resulting carry is rippled away.
Following the product scanning method, Seo’s Sliding Block Doubling [30]
halves the rhombus structure, allowing to use more registers to store part of the
operands and doubling partial products. The algorithm is illustrated in Figure 3
and can be divided in three parts:
C[0]
C[7]
C[14] ...
...
A[7]A[7]
A[7]A[0]
A[0]A[0]
Fig. 3.
Sliding Block Doubling. Black dots represent multiplications and white dots
represent squarings.
8 Hayato Fujii and Diego F. Aranha
– Partial Products of the Upper Part Triangle
: an adaption of product
scanning calculates partial products (represented by the black dots at the
superior part of the rhombus in Figure 3) and saves them to memory.
– Sliding Block Doubling of Partial Products
: Each result of the column
is doubled by left shifting each result by one, eﬀectively duplicating the
partial products. This process must be done in parts because the number of
available registers is limited, since they hold parts of the operand.
– Remaining Partial Products of the Bottom Line
: The bottom line
multiplications are squares of part of the operand. These products must be
added to their respective partial result of its above column.
4 Implementation of F2255−19 Arithmetic
Our implementation aims for eﬃciency, so speciﬁc ARM Assembly is thoroughly
used and code size is moderately sacriﬁced for speed. Code portability is a non
goal, so each 255bit integer ﬁeld element is densely represented, using 2
32
radix,
implying in eight “limbs” of 32 bits, each one are in a littleendian format. This
contrasts with the reference implementation [2], which use 25 or 26 bits in 32bits
words, allowing carry values requiring proper handling at the expense of more
cycles.
Modular Reduction.
We call as “weak” reduction a reduction modulo 2
256 −
38,
performed at the end of every ﬁeld operation in order to avoid extra carry
computations between operations, as in [11]; this reduction ﬁnds a integer lesser
then 2
256
that is congruent modulo 2
255−
19. When necessary, a “strong” reduction
modulo 2
255 −
19 is performed, much like when data must be sent over the wire.
This strategy is justiﬁed over the extra 10% diﬀerence between the “strong” and
the “weak” reduction.
Addition and Subtraction.
256bit addition is implemented by respectively
adding each limb in a lower to higher element fashion. The carry ﬂag, present
in the ARM status registers, is used to ripple the carry across the limbs. In
order to avoid extra bits generated by the ﬁnal sum, the result is weakly reduced.
Subtraction follows a similar strategy.
Multiplication by a 32bit Word.
Multiplication by a single word follows the
algorithm described in [28], used to multiply a long integer by 121666, operation
required to double a point on Curve25519.
Inversion.
This operation follows the standard ItohTsujii additionchain ap
proach to compute
𝑎𝑝−2≡𝑎−1
(mod
𝑝
), using 11 multiplications and 254 ﬁeld
squarings as proposed in [2]. Adding up the costs, inversion turns to be the most
expensive ﬁeld operation in our implementation.
4.1 Multiplication
The 256
×
256
→
512bit multiplication follows a productscanning like approach;
more speciﬁcally, Full OperandCaching. As mentioned in Section 3, parameters
Curve25519 for the CortexM4 and beyond 9
for this implementation are
𝑛
= 8,
𝑒
= 3,
𝑟
=
⌊𝑛/𝑒⌋
= 2; since
⌊3/2⌋<
8
−
2
·
3
≤
3,
so Full Operand Caching with a Consecutivelike structure yields the best option.
Catching the Carry Bit.
Using product scanning to calculate partial products
with a doubleword multiplier implies adding partial products of the next column,
which in turn might generate carries. A partial column, divided in rows in a
manner as described in Operand Caching, can be calculated using Algorithm 1; an
example of implementation in ARM Assembly is shown in Listing 1.1. Notation
follows as
(𝜀, 𝑧)←𝑤
meaning
𝑧←𝑤mod
2
𝑊
and
𝜀←
0if
𝑤∈0,2𝑊
,
otherwise
𝜀←
1, where
𝑊
is the bitsize of a word;
(𝐴𝐵)
denotes a 2
𝑊
bit word
obtained by concatenating the 𝑊bit words 𝐴and 𝐵.
Algorithm 1 Column computation in product scanning.
Input:
Operands
𝐴, 𝐵
; column index
𝑘
; partial product
𝑅𝑘
(calculated during column
𝑘−1); accumulated carry 𝑅𝑘+1 (generated from sum of partial products).
Output:
(Partial) product
𝐴𝐵
[
𝑘
]; sum
𝑅𝑘+1
(higher half part of the partial product
for column
𝑘
+ 1); accumulated carry
𝑅𝑘+2
(generated from sum of partial products).
𝑅𝑘+2 ←0
for all (𝑖, 𝑗 )𝑖+𝑗=𝑘, 0≤𝑖 < 𝑗 ≤𝑛−1do
𝑇←0
(𝑇 𝑅𝑘)←𝐴[𝑖]×𝐵[𝑗] + (𝑇, 𝑅𝑘)
(𝜀, 𝑅𝑘+1)←𝑇+𝑅𝑘+1
𝑅𝑘+2 ←𝑅𝑘+2 +𝜀
end for
𝐴𝐵[𝑘]←𝑅𝑘
return 𝐴𝐵[𝑘],𝑅𝑘+1 ,𝑅𝑘+2
Listing 1.1. ARM code for calculating a column in product scanning.
@k=6
@ r5 and r4 hold R_6, R_7 respectively
@ r6, r7, r8 hold A[3], A[4] and A[5] respectively
@ r9, r10, r11 hold B[3], B[1], B[2] respectively
MOV r12, #0
MOV r3, #0
UMLAL r5, r12, r8, r10 @ A5 B1
ADDS r4, r4, r12
ADC r3, r3, #0
MOV r14, #0
UMLAL r5, r14, r7, r11 @ A4 B2
ADDS r4, r4, r14
ADC r3, r3, #0
MOV r12, #0
UMLAL r5, r12, r6, r9 @ A3 B3
ADDS r4, r4, r12
ADC r3, r3, #0
@ r5 holds AB[6], r4 holds R_7, @ r3 holds R_8
10 Hayato Fujii and Diego F. Aranha
One possible optimization is
delaying the carry bit
: eliminating the last
addition of Algorithm 1, this addition can be deferred to the next column with
the use of a single instruction to add the partial products and the carry bit. This
is easier on ARM processors, where there is ﬁnegrained control of whether or not
instructions may update the processor ﬂags. Other optimizations involve proper
register allocation in order to avoid reloads, saving up a few cycles.
Carry Elimination.
Storing partial products in extra registers without adding
them avoids potential carry values. In a trivial implementation, a register accu
mulator may be used to add the partial values, potentially generating carries. The
UMAAL
instruction can be employed to perform such addition, while also taking
advantage of the multiplication part to further calculate more partial products.
This instruction never generates a carry bit, since (2
𝑛−
1)
2
+2(2
𝑛−
1) = (2
2𝑛−
1),
eliminating the need for carry handling. Partial products generated by this instruc
tion can be forwarded to the next multiplyaccumulate(accumulate) operation;
this goes on until all rows are processed. Algorithm 2 and Listing 1.2 illustrate
how a column from productscanning can be evaluated following this strategy.
Algorithm 2 Column computation in product scanning, eliminating carries.
Input:
Operands
𝐴, 𝐵
; column index
𝑘
;
𝑚
partial products
𝑅𝑘
[0
,...,𝑚−
1] (calculated
during column 𝑘−1and stored in registers).
Output:
Partial product
𝐴𝐵
[
𝑘
];
𝑚
partial products
𝑅𝑘+1
[0
,...,𝑚−
1] (higher half
part of the calculated partial product for column 𝑘+ 1 stored in registers).
𝑡←1
for all (𝑖, 𝑗 )𝑖+𝑗=𝑘, 0≤𝑖 < 𝑗 ≤𝑛−1do
(𝑅𝑘[𝑡]𝑅𝑘[0]) ←𝐴[𝑖]×𝐵[𝑗] + 𝑅𝑘[0] + 𝑅𝑘[𝑡]
𝑅𝑘+1[𝑡−1] ←𝑅𝑘[𝑡]
𝑡←𝑡+ 1
end for
𝐴𝐵[𝑘]←𝑅𝑘[0]
return 𝐴𝐵[𝑘],𝑅𝑘+1 [0,...,𝑚−1]
Listing 1.2.
ARM code for calculating a column in product scanning without carries.
@k=6
@ r3, r4, r12 and r5 hold R_6[0,1,2,3]
@ r6, r7, r8 hold A[3], A[4] and A[5] respectively
@ r9, r10, r11 hold B[3], B[1], B[2] respectively
UMAAL r3, r4, r8, r10 @ A5 B1
UMAAL r3, r12, r7, r11 @ A4 B2
UMAAL r3, r5, r6, r9 @ A3 B3
@ r3 holds (partially) AB[6]
@ r4, r5 and r12 hold partial products for k = 7
Note that this strategy is limited by the number of working registers available.
These registers hold partial products without adding them up, avoiding the need
of carry handling, so strategies diving columns into rows like in Operand Caching
are desirable.
Curve25519 for the CortexM4 and beyond 11
4.2 Squaring
The literature suggests the use of a multiplication algorithm similar to Schoolbook
[30], but saving up registers and repeated multiplications. Due to its similarity
with product scanning (and the possibility to apply the above optimization
techniques), we choose the Sliding Block Doubling algorithm as squaring routine.
Note that, with the usage of carry ﬂag present in some machine architectures,
both Sliding Block Doubling and the Bottom Line steps (as described in Section
3) can be eﬃciently computed. In order to avoid extra memory access, one can
implement those two routines without reloading operands; because of the need of
the carry bit in both those operations, high register pressure may arise in order to
save them into registers. We propose a technique to alleviate the register pressure:
calculating a few multiplications akin to the Initial Block step as presented in
the Operand Caching reduces register usage, allowing proper carry catching and
handling in exchange for a few memory accesses (Figure 4).
C[0]
C[7]
C[14] ...
...
A[7]A[7]
A[7]A[0]
A[0]A[0]
binit
Fig. 4. Sliding Block Doubling with Initial Block
In this example, each productscanning column is limited to height 2, meaning
that only two consecutive multiplications can be handled without losing partial
products. Incrementing the size of the “initial block” (or, more accurately, the
initial triangle) frees up registers during the bottom row evaluation.
5 Elliptic curves
An elliptic curve
𝐸
over a ﬁeld
F𝑞
is the set of solutions (
𝑥, 𝑦
)
∈F𝑞×F𝑞
which
satisfy the Weierstrass equation
𝐸/F𝑞:𝑦2+𝑎1𝑥𝑦 +𝑎3𝑦=𝑥3+𝑎2𝑥2+𝑎4𝑥+𝑎6,(1)
where
𝑎1, 𝑎2, 𝑎3, 𝑎4, 𝑎6∈F𝑞
and the curve discriminant is
𝛥
= 0. We restrict our
attention to curves deﬁned over prime ﬁelds which can be represented in the
Montgomery [24] (or Twisted Edwards [4]) model, allowing faster formulas and
uniﬁed arithmetic [6].
The set of points
𝐸
(
F𝑞
) =
{
(
𝑥, 𝑦
)
∈𝐸
(
F𝑞
)
} ∪ {𝒪}
=
{𝑃∈𝐸
(
F𝑞
)
} ∪ {𝒪}
under the addition operation +(chord and tangent) forms an additive group,
12 Hayato Fujii and Diego F. Aranha
with
𝒪
as the identity element. Given an elliptic curve point
𝑃∈𝐸
(
F𝑞
)and
an integer
𝑘
, the operation
𝑘𝑃
, called scalar point multiplication, is deﬁned by
the addition of the point
𝑃
to itself
𝑘−
1times. This operation encodes the
security assumption for Elliptic Curve Cryptography (ECC) protocols, basing
their security on the hardness of solving the elliptic curve analogue of the discrete
logarithm problem (ECDLP). Given a public key represented as a point
𝑄
in the
curve, the problem amounts to ﬁnding the secret
𝑘∈Z
such that
𝑄
=
𝑘𝑃
for
some given point 𝑃in the curve.
ECC is an eﬃcient yet conservative option for deploying publickey cryp
tography in embedded systems, since the ECDLP still enjoys conjectured full
exponential security against classical computers and, consequently, reduced key
sizes and storage requirements. In practice, a conservative instance of this problem
can be obtained by selecting prime curves of nearprime order without supporting
any nontrivial endomorphisms. Curve25519 is a popular curve at the 128bit
security level represented through the Montgomery model
Curve25519: 𝑦2=𝑥3+𝐴𝑥2+𝑥, (2)
compactly described by the small value of the coeﬃcient
𝐴
= 486662. This
curve model is ideal for curvebased key exchanges, because it allows the scalar
multiplication to be computed using
𝑥
coordinates only. Using a birational
equivalence, Curve25519 can be also represented in the twisted Edwards model
using full coordinates to allow instantiations of secure signature schemes:
edwards25519: −𝑥2+𝑦2= 1 −121655
121666𝑥2𝑦2.(3)
Key exchange protocols and digital signature schemes are building blocks
for applications like key distribution schemes and secure software updates based
on code signing. These protocols are fundamental for preserving the integrity of
software running in embedded devices and establishing symmetric cryptographic
keys for data encryption and secure communication.
5.1 Elliptic Curve Diﬃe Hellman
The Elliptic Curve Diﬃe Hellman protocol is an instantiation of the Diﬃe
Hellman key agreement protocol over elliptic curves. Modern implementations
of this protocol employ
𝑥
coordinateonly formulas over a Montgomery model
of the curve, for both computational savings, sidechannel security and ease of
implementation. Following this idea, the protocol may be implemented using
the X25519 function, which is in essence a scalar multiplication of a point on
the Curve25519 [2]. In this scheme, a pair of entities generate their respective
private keys, each of them 32byte long. A public, generator point
𝑃
is multiplied
by the private key, generating a public key. Then, those entities exchange their
public keys over an insecure channel; computing the X25519 function with their
private keys and the received point generates a shared secret which may be used
to generate a symmetric session key for both parties.
Curve25519 for the CortexM4 and beyond 13
Since the ECDH protocol does not authenticate keys, public key authentication
must be performed oﬀband, or an authenticated key agreement scheme such as
the Elliptic Curve MenezesQuVanstone (ECMQV) [21] must be adopted.
For data conﬁdentiality, authenticated encryption can be constructed by
combining X25519 as an interactive key exchange mechanism, together with
a block or stream cipher and a proper mode of operation, as proposed in the
future Transport Layer Security protocol versions. Alternatively, authenticated
encryption with additional data (AEAD) schemes may be combined with X25519,
replacing block ciphers and a mode of operation.
5.2 Ed25519 digital signatures
The Edwardscurve Digital Signature Algorithm [5] (EdDSA) is a signature
scheme variant of Schnorr signatures based on elliptic curves represented in the
Edwards model. Like other discretelog based signature schemes, EdDSA requires
a secret value, or nonce, unique to each signature. For reducing the risk of a
random number generator failure, EdDSA calculates this nonce deterministically,
as the hash of the message and the private key. Thus, the nonce is very unlikely
to be repeated for diﬀerent signed messages. While this reduces the attack surface
in terms of random number generation and improves nonce misuse resistance
during the signing process, high quality random numbers are still needed for key
generation. When instantiated using edwards25519 (Equation 3), the EdDSA
scheme is called Ed25519. Concretely, let
𝐻
be the SHA512 hash function mapping
arbitrarylength strings to 512bit hash values. The signature of a message
𝑀
under this scheme and private key
𝑘
is the 512bit string (
𝑅, 𝑆
), where
𝑅
=
𝑟𝐵
,
for
𝐵
a generator of the subgroup of points or order
ℓ
and
𝑟
computed as
𝐻
(
𝐻
(
𝑘
)
, 𝑀
);
𝑆
=
𝑟
+
𝐻
(
𝑅, 𝐴
=
𝑎𝐵, 𝑀
)
mod ℓ
, for an integer
𝑎
derived from
𝐻
(
𝑘
). Veriﬁcation works by parsing the signature components and checking if
the equation 𝑆𝐵 =𝑅+𝐻(𝑅, 𝐴, 𝑀 )𝐴holds [5].
6 Implementation details and results
The focus given in this work is microcontrollers suitable for integration within
embedded projects. Therefore, we choose some representative ARM architecture
processors. Speciﬁcally, the implementations were benchmarked on the following
platforms:
– Teensy
: Teensy 3.2 board equipped with a MK20DX256VLH7 CortexM4
based microcontroller, clocked at 48 and 72 MHz.
– STM32F401C
: STM32F401 Discovery board powered by a STM32F401C
microcontroller, also based on the CortexM4 design, clocked at 84MHz.
– CortexA7/A15
: ODROIDXU4 board with a Samsung Exynos5422 CPU
clocked at 2 GHz, containing four CortexA7 and four CortexA15 cores in a
heterogeneous conﬁguration.
14 Hayato Fujii and Diego F. Aranha
Code for the Teensy board was generated using GCC version 5.4.1 com
piled with the
O3 mthumb
ﬂags; same settings apply for code compiled to
the STM32F401C board, but using an updated compiler version (7.2.0). For
the CortexA family, code was generated with GCC version 6.3.1 using the
O3
optimization ﬂag. Cycle counts were obtained using the corresponding cy
cle counter in each architecture. Randomness, where required, was sampled
through
/dev/urandom
on the CortexA7/A15 device. In the CortexM4 boards,
NIST’s
Hash_DRBG
was implemented with SHA256 and the generator is seeded
by analogically sampling disconnected pins on the board.
Albeit not the most eﬃcient for every possible target, the codebase is the same
for every ARMv7 processor equipped with DSP instructions, being ideal to large
heterogeneous deployments, such as a network of smaller sensors connected to a
larger central server with a more powerful processor than its smaller counterparts.
This helps to improve code maintenance, avoiding possible security problems.
6.1 Field arithmetic
Table 1 presents timings and Table 3 presents code size for ﬁeld operations with
implementations described in Section 4. In comparison to the current stateof
art [28], our addition/subtraction takes 18% less cycles; the 256bit multiplier
with a weak reduction is almost 50% faster and the squaring operation takes 30%
less cycles. The multiplication routine may be used in replacement of the squaring
if code size is a restriction, since 1
S
is approximately 0.9
M
. The implementation
of all arithmetic operations takes less code space in comparison to [28], ranging
from 20% savings in the addition to 50% for the multiplier.
As noted by Hasse [14], cycle counts on the same CortexM4based controller
can be diﬀerent depending on the clock frequency set on the chip. Diﬀerent
clock frequencies set for the controller and the memory may cause stalls on the
former if the latter is slower. For example, the multiplication and the squaring
implementations, which rely on memory operations, use 10% more cycles when
the controller is set to a 33% higher frequency. This behavior is also present on
cryptographic schemes, as shown in Table 2.
Table 1.
Timings in cycles for arithmetic in
F2255−19
on multiple ARM processors.
Numbers for this work were taken as the average of 256 executions.
Cortex Add/Sub Mult Mult by word Square Inversion
De Groot [12] M4 73/77 631 129 563 151997
De Santis [28] M4 106 546 72 362 96337
This work
M4 @ 48 MHz (Teensy) 86 276 76 252 66634
M4 @ 72 MHz (Teensy) 86 310 76 280 75099
M4 @ 84 MHz (STM32F401C) 86 273 76 243 64425
A7 52 290 61 233 62648
A15 36 225 37 139 41978
Cortex F𝑝2Add/Sub F𝑝2Mult Mult by word F𝑝2Square F𝑝2Inversion
FourQ[22] M4 (STM32F407) 84/86 358  215 21056
Curve25519 for the CortexM4 and beyond 15
Table 2.
Timings in cycles for computing the Montgomery ladder in the X25519 key
exchange; and key generation, signature and veriﬁcation of a 5byte message in the
Ed25519 scheme. Key generation encompasses taking a secret key and computing its
public key; signature takes both keys and a message to generate its respective signature.
Numbers were taken as the average of 256 executions in multiple ARM processors.
Protocols are inherently protected against timing attacks (constanttime – CT) on the
CortexM4 due to the lack of cache memory, while sidechannel protection is explicitly
needed in the CortexA. Performance penalties for sidechannel protection can be
obtained by comparing the implementations with CT = Y over N in the same platform.
CT Cortex X25519 Ed25519 Key Gen. Ed25519 Sign Ed25519 Verify
De Groot [12] Y M4 1816351   
De Santis [28] Y M4 1563852   
This work
Y M4 @ 48 MHz (Teensy) 907240 347225 496039 1265078
Y M4 @ 72 MHz (Teensy) 1003707 379734 531471 1427923
Y M4 @ 84 MHz (STM32F401) 894391 389480 543724 1331449
Schwabe, Bernstein [7] Y A8 527102 368212 650102
This work
N A7   423058 1118806
Y A7 825914 397261 524804 
N A15   264252 776806
Y A15 572910 245377 305797 
eBACS ref. code [10] Y A15 342477 241641 245712 730047
CT Cortex DH SchnoorQKey Gen. SchnoorQSign SchnoorQVerify
FourQ[22] Y M4 (STM32F407) 542900 265100 345400 648600
Table 3.
Code size in bytes for implementing arithmetic in
F2255−19
, X25519 and
Ed25519 protocols on the CortexM4. Code size for protocols considers the entire
software stack needed to perform the speciﬁc action, including but not limited to ﬁeld
operations, hashing, tables for scalar multiplication and other algorithms.
Add Sub Mult Mult by word Square
De Groot [12] 44 64 1284 300 1168
De Santis [28] 138 148 1264 116 882
This work 110 108 622 92 562
Inversion X25519 Ed25519 Key Gen. Ed25519 Sign Ed25519 Verify
De Groot [12] 388 4140   
De Santis [28] 484 3786   
This work 328 4152 21265 22162 28240
6.2 X25519 implementation
X25519 was implemented using the standard Montgomery ladder over the
𝑥

coordinate. Standard tricks like randomized projective coordinates (amounting
to a 1% performance penalty) and constanttime conditional swaps were imple
mented for sidechannel protection. Cycle counts of the X25519 function executed
on the evaluated processors are shown in Table 2 and code size in Table 3.
Our implementation is 42% faster than De Santis and Sigl [28] while staying
competitive in terms of code size.
Note on conditional swaps.
The classical conditional swap using logic in
structions is used by default as the compiler optimizes it using function inlining,
saving about 30 cycles. However, this approach opens a breach for a power
16 Hayato Fujii and Diego F. Aranha
analysis attack, as shown in [25], since all bits from a 32bit long register (in
ARM architectures) must be set or not depending on a secret bit.
Alternatively, the conditional swap operation can be implemented by setting
the 4bit
ge
ﬂag in the Application Program Status Register (
ASPR
) and then
issuing the
SEL
instruction, which pick parts from the operand registers in byte
sized blocks and writes them to the destination [1]. Note that setting
0x0
to the
ASPR.ge
ﬂag and issuing
SEL
copies one of the operands; setting
0xF
and using
SEL
copies the other one. The
ASPR
cannot be set directly through a
MOV
with
an immediate operand, so a Move to Special Register (
MSR
) instruction must be
issued. Only registers may be used as arguments of this operation, so another
one must be used to set the
ASPR.ge
ﬂag. Therefore, at least 8 bits must be used
to implement the conditional move. This theoretically reduces the attack surface
of a potential sidechannel analysis, down from 32 bits.
6.3 Ed25519 implementation
Key generation and message signing requires a ﬁxedpoint scalar multiplication,
here implemented through a comblike algorithm proposed by Hamburg in [15].
The signedcomb approach recodes the scalar into its signed binary form using
a single addition and a rightshift. This representation is divided in blocks and
each one of those are divided in combs, much like in the multicomb approach
described in [16]. Like in the original work, we use ﬁve teeth for each of the ﬁve
blocks and 10 combs for each block (11 for the last one) due to the performance
balance between the direct and the linear table scan to access precomputed data
if protection against cache attacks is required. To eﬀectively calculate the scalar
multiplication, our implementation requires 50 point additions and 254 point
doublings. Five lookup tables of 16 points in Extended Projective coordinate
format with 𝑧= 1 are used, adding up to approximately 7.5 KiB of data.
Veriﬁcation requires a doublepoint multiplication involving the generator
𝐵
and point
𝐴
using a
𝑤
NAF interleaving technique [16], with a window of
width 5 for the
𝐴
point, generated ontheﬂy, taking approximately 3 KiB of
volatile memory. The group generator
𝐵
is interleaved using a window of width 7,
implying in a lookup table of 32 points stored in Extended Projective coordinate
format with
𝑧
= 1 taking 3 KiB of ROM. Note that veriﬁcation has no need
to be executed in constant time, since all input data is (expected to be) public.
Decoding uses a standard ﬁeld exponentiation for both inversion and square root
to calculate the
𝑦
coordinate as suggested by [19] and [5]; this exponentiation is
carried out by the ItohTsujii algorithm, providing an eﬃcient way to calculate
the missing coordinate. Timings for computing a signature (both protected and
unprotected against cache attacks) and veriﬁcation functionality in the evaluated
processors can be found in Table 2. Arithmetic modulo the group order in
Ed25519related operations relates closely to the previously shown arithmetic
modulo 2255 −19, but Barrett reduction is used instead.
Final Remarks.
We consider that our implementation is competitive in compar
ison to the mentioned works in Section 3, given the performance numbers shown
Curve25519 for the CortexM4 and beyond 17
in Tables 2 and 3. Using Curve25519 and its corresponding Twisted Edwards
form in wellknown protocols is beneﬁcial in terms of security, mostly due to its
maturity and its widespread usage to the point of becoming a de facto standard.
Acknowledgments.
The authors gratefully acknowledge ﬁnancial support from
LG Electronics Inc. during the development of this work, under project “Eﬃ
cient and Secure Cryptography for IoT”, and Armando FazHernández for his
helpful contributions and discussions during its development. We also thank the
anonymous reviewers for their helpful comments.
References
1.
ARM: CortexM4 Devices Generic User Guide. Avaliable on
http://infocenter.
arm.com/help/index.jsp?topic=%2Fcom.arm.doc.dui0553a%2FCHDBFFDB.html
(2010)
2.
Bernstein, D.J.: Curve25519: New DiﬃeHellman speed records. In: Public Key
Cryptography. Lecture Notes in Computer Science, vol. 3958, pp. 207–228. Springer
(2006)
3.
Bernstein, D.J.: 25519 naming. Available on
https://www.ietf.org/mail
archive/web/cfrg/current/msg04996.html (Aug 2014)
4.
Bernstein, D.J., Birkner, P., Joye, M., Lange, T., Peters, C.: Twisted Edwards
Curves. In: AFRICACRYPT. Lecture Notes in Computer Science, vol. 5023, pp.
389–405. Springer (2008)
5.
Bernstein, D.J., Duif, N., Lange, T., Schwabe, P., Yang, B.: Highspeed highsecurity
signatures. J. Cryptographic Engineering 2(2), 77–89 (2012)
6.
Bernstein, D.J., Lange, T.: Analysis and optimization of ellipticcurve singlescalar
multiplication. Contemporary Mathematics – Finite Fields and Applications 461
(2008)
7.
Bernstein, D.J., Schwabe, P.: NEON crypto. In: CHES. Lecture Notes in Computer
Science, vol. 7428, pp. 320–339. Springer (2012)
8.
Boneh, D., DeMillo, R.A., Lipton, R.J.: On the importance of checking cryptographic
protocols for faults (extended abstract). In: EUROCRYPT. Lecture Notes in
Computer Science, vol. 1233, pp. 37–51. Springer (1997)
9.
Costello, C., Longa, P.: Four
Q
: Fourdimensional decompositions on a
Q
curve over
the Mersenne Prime. In: ASIACRYPT. Lecture Notes in Computer Science, vol.
9452, pp. 214–235. Springer (2015)
10.
Daniel J. Bernstein and Tanja Lange (editors): eBACS: ECRYPT Benchmarking of
Cryptographic Systems. Avaliable on https://bench.cr.yp.to
11.
Düll, M., Haase, B., Hinterwälder, G., Hutter, M., Paar, C., Sánchez, A.H., Schwabe,
P.: Highspeed Curve25519 on 8bit, 16bit, and 32bit microcontrollers. Des. Codes
Cryptography 77(23), 493–514 (2015)
12.
de Groot, W.: A Performance Study of X25519 on CortexM3 and M4. Ph.D. thesis,
Eindhoven University of Technology (Sep 2015)
13.
Großschädl, J., Oswald, E., Page, D., Tunstall, M.: Sidechannel analysis of crypto
graphic software via earlyterminating multiplications. In: ICISC. Lecture Notes in
Computer Science, vol. 5984, pp. 176–192. Springer (2009)
14.
Haase, B.: Memory bandwidth inﬂuence makes cortex m4 bench
marking diﬃcult (sep 2017),
https://ches.2017.rump.cr.yp.to/
fe534b32e52fcacee026786ff44235f0.pdf
18 Hayato Fujii and Diego F. Aranha
15.
Hamburg, M.: Fast and compact ellipticcurve cryptography. IACR Cryptology
ePrint Archive 2012, 309 (2012)
16.
Hankerson, D., Menezes, A.J., Vanstone, S.: Guide to Elliptic Curve Cryptography.
SpringerVerlag New York, Inc., Secaucus, NJ, USA (2003)
17.
Hutter, M., Wenger, E.: Fast multiprecision multiplication for publickey cryp
tography on embedded microprocessors. In: CHES. Lecture Notes in Computer
Science, vol. 6917, pp. 459–474. Springer (2011)
18.
Jao, D., Feo, L.D.: Towards quantumresistant cryptosystems from supersingular
elliptic curve isogenies. In: PQCrypto. Lecture Notes in Computer Science, vol.
7071, pp. 19–34. Springer (2011)
19.
Josefsson, S., Liusvaara, I.: EdwardsCurve Digital Signature Algorithm (EdDSA).
RFC 8032 (Jan 2017), https://rfceditor.org/rfc/rfc8032.txt
20.
Kocher, P.C.: Timing attacks on implementations of DiﬃeHellman, RSA, DSS,
and other systems. In: CRYPTO. Lecture Notes in Computer Science, vol. 1109,
pp. 104–113. Springer (1996)
21. Law, L., Menezes, A., Qu, M., Solinas, J.A., Vanstone, S.A.: An eﬃcient protocol
for authenticated key agreement. Des. Codes Cryptography 28(2), 119–134 (2003)
22.
Liu, Z., Longa, P., Pereira, G., Reparaz, O., Seo, H.: Four
Q
: on embedded devices
with strong countermeasures against sidechannel attacks. In: CHES (to appear).
Springer Berlin Heidelberg, Berlin, Heidelberg (2017)
23.
Liu, Z., Seo, H., Kim, H.: A synthesis of multiprecision multiplication and squaring
techniques for 8bit sensor nodes: Stateoftheart research and future challenges. J.
Comput. Sci. Technol. 31(2), 284–299 (2016)
24.
Montgomery, P.L.: Speeding the Pollard and Elliptic Curve Methods of Factorization.
Mathematics of Computation 48(177), 243–264 (1987),
http://dx.doi.org/10.
2307/2007888
25.
Nascimento, E., Chmielewski, L., Oswald, D., Schwabe, P.: Attacking embedded
ECC implementations through cmov side channels. IACR Cryptology ePrint Archive
2016, 923 (2016)
26.
Oliveira, T., López, J., Hışıl, H., FazHernández, A., RodríguezHenríquez, F.: How
to (pre)compute a ladder. In: SAC (to appear). Springer International Publishing
(2017)
27.
Renes, J., Smith, B.: qdsa: Small and secure digital signatures with curvebased
diﬃehellman key pairs. IACR Cryptology ePrint Archive 2017, 518 (2017)
28.
Santis, F.D., Sigl, G.: Towards SideChannel Protected X25519 on ARM CortexM4
Processors. In: SPEEDB. Utrecht, The Netherlands (Oct 2016),
http://ccccspeed.
win.tue.nl/
29.
Seo, H., Kim, H.: Consecutive operandcaching method for multiprecision multipli
cation, revisited. J. Inform. and Commun. Convergence Engineering 13(1), 27–35
(2015)
30.
Seo, H., Liu, Z., Choi, J., Kim, H.: Multiprecision squaring for publickey cryptog
raphy on embedded microprocessors. In: INDOCRYPT. Lecture Notes in Computer
Science, vol. 8250, pp. 227–243. Springer (2013)