Content uploaded by Diego F. Aranha
Author content
All content in this area was uploaded by Diego F. Aranha on Feb 07, 2014
Content may be subject to copyright.
Advances in Mathematics of Communications doi:10.3934/amc.2010.4.xxx
Volume 4, No. 2, 2010, xxx–xxx
EFFICIENT IMPLEMENTATION OF ELLIPTIC CURVE
CRYPTOGRAPHY IN WIRELESS SENSORS
Diego F. Aranha, Ricardo Dahab,
Julio L
´
opez and Leonardo B. Oliveira
University of Campinas (UNICAMP)
Campinas - SP, CEP 13083-970, Brazil
(Communicated by Joan-Josep Climent)
Abstract. The deployment of cryptography in sensor networks is a challeng-
ing task, given the limited computational power and the resource-constrained
nature of the sensoring devices. This paper presents the implementation of
elliptic curve cryptography in the MICAz Mote, a popular sensor platform.
We present optimization techniques for arithmetic in binary fields, including
squaring, multiplication and modular reduction at two different security levels.
Our implementation of field multiplication and modular reduction algorithms
focuses on the reduction of memory accesses and appears as the fastest result
for this platform. Finite field arithmetic was implemented in C and Assembly
and elliptic curve arithmetic was implemented in Koblitz and generic binary
curves. We illustrate the performance of our implementation with timings for
key agreement and digital signature protocols. In part icular, a key agreement
can be computed in 0.40 seconds and a digital signature can b e computed and
verified in 1 second at the 163-bit security level. Our results strongly indicate
that binary curves are the most efficient alternative for the implementation of
elliptic curve cryptography in this platform.
1. Introduction
A Wireless Sensor Network (WSN) [
5] is a wireless ad-hoc ne twork consisting of
resource-constrained sensoring devices (limited energy source, low communication
bandwidth, small computational power) and one or more base stations. The base
stations are more powerful and collect th e data gathered by the sensor nodes so
it can be analyzed. As any ad hoc network, routing is accomplished by the nodes
themselves through hop-by-hop forwarding of data. Common WSN appl ic ations
range from battlefield reconnaissanc e and emergency rescue operations to sur veil-
lance and environmental protection.
WSNs may be organized in different ways. In flat WSNs, all nodes play similar
roles in sensing, data processing, and routing. In hierarchical WSNs, on the other
hand, the network is typically organized into clusters, with ordinary cluster mem-
bers and the cluster heads playing different r oles . While ordinary cluster members
are responsible f or sensing, the cluster heads are respon si bl e for additional tasks
such as collecting and processing the sensing data from their cluster members, and
forwarding the resul ts towards the base stations.
2000 Mathematics Subject Classification: Primary: 11-04; Secondary: 94A60.
Key words and ph rases: Efficient software imp lementation, cr y p tograph ic engineering, elliptic
curve cryptography, finite field arithmetic.
1
c
2010 AIMS-SDU
2 Diego F. Aranha, Ricardo Dahab, Julio L
´
opez and Leonardo B. Oliveira
Besides the vulnerabilities already present in ad-hoc networks, WSNs pose addi-
tional challenges: the sensor nodes are commonly distributed on locations physically
accessible to adversaries; and the resources available in a sensor node are more lim-
ited than those in a conventional ad hoc network node, thus traditional solutions
are not adequate. For example, the fact that sensor nodes should be discardable
and consequently have low cost makes the integration of anti-tampering measures
on these devices difficult.
Conventional public key cryptography syste ms such as RSA and DSA are im-
practical in this scenario due to the low processing power of sensor nodes. Until
recently, security services such as confidentiality, auth entication and integrity were
achieved exclusively by symmetric techniques [
26, 13] . Nowadays, however, ellip-
tic curve cryptography (ECC) [22, 14] has emerged as a promising alternative to
traditional public key methods on WSNs [8], because of its lower processing and
storage requirements. These features motivate the search for increasingly efficient
algorithms and implementations of ECC for such devices. The usual target platform
is the MICAz Mote [
10], a node commonly used on real WSN deployments, whose
main characteristics are the low availability of RAM memory and the high cost of
memory instructions, memory addressing and bitwise shifts by arbitrary amounts.
This work proposes optimizations for implementing ECC over binary fields, im-
proving its limits of performance and viability. Experimental results show that
binary el lip ti c curves offer signifi cant computational advantages over prime curves
when implemented in WSNs. Note that this observation contradicts a common
misconception that sensor nodes are not sufficiently equipped to compute elliptic
curve arithmetic over binary fields in an efficient way [8, 4].
Our main contributions in this work are:
• Efficient implementations of multiplication, squaring, modular reduction and
inversion in F
2
163
and F
2
233
: optimized versions of known algorit hms are
presented, r ed u cin g the number of memory accesses to obtain performance
gains. The new optimizations produce the fastest implementation of binary
field arithmetic published for this platform;
• Efficient implementation of elliptic curve cryptography: point multiplication
algorithms are imple mented on Koblitz curves and generic binary cur ves. The
time for a scalar multiplication of a random point in a binary curve is 61%
faster than the b e s t implementation so far [
12] and 57% faste r than the best
implementation over a prime curve [
7] at the 160-bit security level. We also
present the first point multiplication timings at the 233-bit security level in
this platform. Performance is illustrated by executions of key agreement and
digital signature protocols.
The remaining sections of this paper are organized as follows. Related work
is presented in Section
2 and elementary elliptic curve concepts are introduced
in Section 3. The platform characteristi cs are pre se nted in Section 4. Section 5
investigates efficient implementations of finite field arithmetic in the target platform
while Section 6 investigates efficient elliptic curve arithmet ic . Section 7 presents
implementation results and Section
8 concludes the paper.
2. Related work
Cryptographic protocols are used to establish security services in WSNs. Key
agreement is a fundamental protocol in this context because it can be used to nego-
tiate cryptographic keys suitable for fast and energy-efficient symmetric algorithms.
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
Efficient implementation of ECC in wireless sensors 3
One possible solution for key agreement in WSNs is the deployment of pairing-based
protocols, such as TinyTate [
23] and TinyPBC [25], with the added advantage of
not requiring communication. Here instead we focus on the perfor mance side and
assume that a simple one-pass Elliptic Curve Diffie-Hellman [
3] pr otocol is employed
for key agreement. With this assumption, different implementations of ECC can be
compared by the cost of multiplying a r and om ellipt ic point by a random integer.
Gura et al. [
8] presented the first implementation results of ECC and RSA on
ATmega128 microcontrollers and demonstrated the superiority of the former ove r
the latter. In Gura’s work, prime field arithmetic was implemented in C and As-
sembly and a point multiplication took 0.81 seconds on a 8MHz device. Uhsadel
et al. [
32] later presented an expected time of 0.76 seconds for computing a point
multiplication in a 7.3728MHz device. The fastest implementation of prime curves
so far [7] explores the potential of elliptic curves with efficient computable endo-
morphisms defined over optimal prime field s and computes a point multiplication
in 5.5 million cycles, or 0.745 second.
For bin ary curves, Malan et al. [
20] implemented ECC using polynomial basis
and presented results for the Diffie-Hellman key agreement protocol. A public key
generation, which consists of a point multiplication, was computed in 34 seconds.
Yan and Shi [
34] implemented ECC over F
2
163
and obtained a point multiplication in
13.9 seconds, suggest in g that binary curves had too high a cost for sensor s ’ current
technology. Eb er l e et al. [
4] implemented ECC in Assembly over F
2
163
and obtained
a point multiplication in 4.14 seconds, making us e of architectural extensions for
additional acce le r ation. NanoECC [
31] specialized portions of the MIRACL arith-
metic library [28] i n the C programming language for efficient execution in sensor
nodes, resulting in a point multiplication in 2.16 seconds over prime fields and 1.27
seconds over binary fields. Later, TinyECCK [
29] presented an implementation of
ECC over binary curves which takes into account the platform characteristics to
optimize finite field arithmetic and obtained a point multiplic ation in 1.14 se cond .
Recently, Kargl et al. [
12] investigated algorithms resistant to simple power analysis
and obtained a point multiplication in 0.7633 s econ d on a 8MHz device. Table 1
presents the increasing efficiency of ECC in WSNs.
Finite field Work Execution time (seconds )
Binary
Malan et al. [20] 34
Yan and Shi [34] 13.9
Eberle et al. [4] 4.14
NanoECC [31] 2.16
TinyECCK [29] 1.14
Kargl et al. [12] 0.83
Prime
Wang and Li. [33] 1.35
NanoECC [31] 1.27
Gura et al. [8] 0.87
Uhsadel et al. [32] 0.76
TinySA [7] 0.745
Table 1. Timings for scalar multiplication of a random point on
a MICAz Mote at the 160-bit security level. The timin gs are nor-
malized for a clock frequency of 7.3728MHz.
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
4 Diego F. Aranha, Ricardo Dahab, Julio L
´
opez and Leonardo B. Oliveira
3. Elliptic curve cryptography
An elliptic cur ve E over a field K is the set of solutions (x, y) ∈ K × K which
satisfy the Weierstrass equation
y
2
+ a
1
xy + a
3
y = x
3
+ a
2
x
2
+ a
4
x + a
6
where a
1
, a
2
, a
3
, a
4
, a
6
∈ K and the curve d is cr imi nant is ∆ 6= 0; together wit h a
point at infinity denoted by O. If K is a field of characteristic 2, then the curve is
called a binar y elliptic curve and there are two cases to consider. If a
1
6= 0, then an
admissible change of variables transforms E to the non-supersingular binary elliptic
curve of equati on
y
2
+ xy = x
3
+ ax
2
+ b
where a, b ∈ F
2
m
and ∆ = b. A non-supersingular curve with a ∈ {0, 1} and b = 1
is also a Koblitz curve. If a
1
= 0, then an admissible change of variables transforms
E to the supersingular binary elliptic curve
y
2
+ cy = x
3
+ ax + b
where a, b, c ∈ F
2
m
and ∆ = c
4
.
The number of points on the cur ve E(F
2
m
), denoted by #E(F
2
m
), is called
the curve order over the field F
2
m
. The Hasse bound enunciates in this case that
n = 2
m
+ 1 − t and |t| ≤ 2
√
2
m
, where t is the trace of Frobenius. A curve can
be generated with a prescribed order using the complex multiplication method [15]
or the curve order can be explicitly computed in binary curves using the approach
due to Satoh, Skjernaa and Taguchi [
27]. Non-supersingulari ty comes from the fact
that t is not a multiple of the characteristic 2 of the und er ly in g finit e fiel d [9].
The set of points {(x, y) ∈ E( F
2
m
)}∪{O} under the addition operation + (chord
and tangent) forms an additive group, with O as the id entity element. Given
an elliptic point P ∈ E(F
2
m
) and an integer k, the operation kP , called point
multiplication, is defined by the addition of the point P to itself k − 1 times:
kP = P + P + . . . + P
|
{z }
k−1 additions
.
Public key cryptography protocols, such as the Elliptic Curve Diffie-Hellman
key agreement [
3] and the Elliptic Curve Digital Signature Algorithm [3], employ
point multiplication as a fundamental operation; and their security is based on the
difficulty of solving the Elliptic Curve Discrete Logarithm Problem (ECDLP). This
problem consists in finding the discrete logarithm k given a point kP . Criteria
for selecting suitable secure curves are a complex subject and a matter of much
discussion. We adopt the well-known standard NIST curves as a conservative choice,
but we refer the reader to [
3] for further details on how to generate efficient curves
where instances of the ECDLP are computationally hard.
We restrict th e discussion to non-supersingular curves because supersingular
curves are not suitable for e lli pt ic curve cryptosystems based on the ECDLP prob-
lem [
21]. However, supersingul ar curves are particularly of interest in applications
of pairing-based protocols on WSNs [25].
4. The platform
The MICAz Mote sensor node is equipped with an ATmega128 8-bit processor
clocked at 7.3728MHz. The program code is l oaded from an 128KB EEPROM chip
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
Efficient implementation of ECC in wireless sensors 5
and runtime memory is stored in a 4KB RAM chip [
10]. The ATmega128 proces-
sor is a typical RISC architecture with 32 registers, but six of them are special
pointer registers. Since at least one register is needed to store temporary results or
data loaded from memory, 25 registers are generally available for arithmetic. The
instruction se t is also reduced, as only 1-bit shift/rotate instr uc tions are natively
supported. Bitwise shifts by arbitrary amounts can then be implemented with com-
binations of shift/rotate instruction s and other instructions. The processor pipeline
has two stages and memory instructions always cause pipeline stalls. Arithmetic
instructions with register operands cost 1 cycle and memory instructions or memory
addressing cost 2 processing cycles [1]. Table 2 presents the instructions provided
by the platform which can be used for the implementation of binary field arithmetic.
Instruction Description Use Cost
rsl, lsl Right/left 1-bit shift Multi-precision 1-bit shift 1 cycle
rol, ror Right/left 1-bit rotate Multi-precision 1-bit shift 1 cycle
swap Swap high and low nibbles Shift by 4 bits 1 cycle
bld, bst Bit load/store from/to flag Shift by 7 bits 1 cycle
eor Bitwise exclusive OR Binary field addition 1 cycle
ld, st Memory load/store Read operands/write results 2 cycles
adiw, sbiw Pointer arithmetic Memory addressing 2 cycles
Table 2. Rele vant instructions for the implementation of binary
field arithmetic.
5. Algorithms for f in ite field ari thm eti c
In this section we will represent the elements of F
2
m
using a poly nomial basis. Let
f(z) be an irreducible binary trinomial or pentanomial of degree m. The elements
of F
2
m
are the binary polynomials of degree at most m −1. A field element a(z) =
P
m−1
i=0
a
i
z
i
is associated with the binary vector a = (a
m−1
, . . . , a
1
, a
0
) of length m.
In a software implementation in an 8-bit processor, the element a is stored as a
vector of n = ⌈m/8⌉ bytes. The field operations in F
2
m
can be implemented by
common pr ocessor instructions, such as logical shifts ( ≫,≪) and addition modulo
2 (XOR, ⊕).
5.1. Multiplication. The computation of kP i s the most time-consuming oper-
ation on ECC and this operation depends directly on the finite field arithmetic. In
particular, a fast field multipli cati on is critical for the performance of ECC.
Two different strategies are commonly considered for the implementation of mul-
tiplication in F
2
m
. The first one consists in applying the Karatsuba’s algorithm [
11]
to divide the multiplication in sub-problems and solve each problem independently
by the following formula [
9] (with a(z) = A
1
z
⌈m/2⌉
+A
0
and b(z) = B
1
z
⌈m/2⌉
+B
0
):
c(z) = a(z) ·b(z) = A
1
B
1
z
m
+ [(A
1
+ A
0
)(B
1
+ B
0
) + A
1
B
1
+ A
0
B
0
]z
⌈m/2⌉
+ A
0
B
0
.
Naturally, Karatsuba multiplication imposes some overhead for t he divide and con-
quer steps. The second one consists in applying a direct algorithm like the L´opez-
Dahab (LD) binary field multiplication (Algorithm
1) [19]. In this algorithm, the
precomputation window is usually chosen as t = 4 and the precomputation table
T has size |T | = 16(n + 1), since each element T [i] requires at most n + 1 bytes
to store the result of u(z)b(z). Operand a is scanned from left to right and pro-
cessed in groups of 4 bits. In an 8-bit processor, the algorithm is comprised by two
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
6 Diego F. Aranha, Ricardo Dahab, Julio L
´
opez and Leonardo B. Oliveira
phases, where the lower halves of bytes of a are processed in the first phase and the
higher h alves are processed in the second phase. These phases are separated by an
intermediate s hi ft which implements multiplication by z
t
.
Algorithm 1 L´opez-Dahab multiplication in F
2
m
[
19].
Input: a(z) = a[0..n − 1], b(z) = b[0..n − 1].
Output: c(z) = c[0..2n − 1].
1: Compute T (u) = u(z)b(z) for all polynomials u(z) of degree lower than t.
2: c[0 . . . 2n − 1] ← 0
3: for k ← 0 to n − 1 do
4: u ← a[k] ≫ t
5: for j ← 0 to n do
6: c[j + k] ← c[j + k] ⊕ T (u)[j]
7: end for
8: end for
9: c(z) ← c(z)z
t
10: for k ← 0 to n − 1 do
11: u ← a[k] mod 2
t
12: for j ← 0 to n do
13: c[j + k ] ← c[j + k] ⊕ T ( u) [j]
14: end for
15: end for
16: return c
Conventionally, the series of additions involved in the LD multiplication are im-
plemented through additions over subparts of a double-precision vector. In order
to reduce the number of memory accesses employed during thes e additions, we em-
ploy a rotating register window. This window simulates the series of additions by
accumulating consecutive writes into registers. After a final result is obtained in
the lowest precision register, this value is written into memory and this register
is free to participate as the highest precision register. Figure
1 shows a rotating
register window with n + 1 registers. We modify the LD multiplication algorithm
by integrating a r otatin g r e gist er wind ow. The result of this integration is ref er r e d
as LD multiplication with registers and shown as Algorit hm
2. Figure 2 presents
this modification graphically. These descriptions of the algorithm assumes that n
general-purpose registers are available for arithmetic. If this is not the case, (e.g.
multiplication in F
2
233
on this platform) the accumulation in the r e gist er window
must be divided in different blocks in a multistep fashion and each block processed
with a different rotating register window. A slight overhead is introduced between
the processing of consecutive blo cks because some registers must b e written into
memory and freed before they can be used in a new rotati ng r e gist er wind ow.
An additional suggested optimization is the separation of the precomputation
table T in different blocks of 256 bytes, where each block is stored on a 256-byte
aligned memory address. This optimization accelerates memory addressing because
offsets lower than 256 can be computed by a simple 1-cycle addition instruction,
avoiding expensive pointer arithmet ic. Anoth er optimizati on is to store the results
of the first phase of the algorithm already shifted, eliminating some redundant
memory reads to reload the intermediate result into registers for multi-precision
shifting. A last optimizati on is the embedding of modular reduction at the en d of
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
Efficient implementation of ECC in wireless sensors 7
Figure 1. Rotating register window with n + 1 registers.
Figure 2. L´opez-Dahab multiplication with r e gis te r s of two field
elements repr es e nted as n-byte vectors in an 8-bit processor.
Algorithm 2 Proposed optimization for multiplication in F
2
m
using n+1 regis te r s .
Input: a(z) = a[0..n − 1], b(z) = b[0..n − 1].
Output: c(z) = c[0..2n − 1].
Note: v
i
denotes the vector of n + 1 registers (r
i−1
, . . . , r
0
, r
n
, . . . , r
i
).
1: Compute T (u) = u(z)b(z) for all polynomials u(z) of degree lower than 4.
2: Let u
i
be the 4 most significant bits of a[i].
3: v
0
← T (u
0
), c[0] ← r
0
4: v
1
← v
1
⊕ T (u
1
), c[1] ← r
1
5: ···
6: v
n−1
← v
n−1
⊕ T (u
n−1
), c[n − 1] ← r
n−1
7: c ← ((r
n−2
, . . . , r
0
, r
n
) || (c[n − 1], . . . , c[0])) ≪ 4
8: Let u
i
be the 4 least significant bits of a[i].
9: v
0
← T (u
0
), c[0] ← c[0] ⊕ r
0
10: ···
11: v
n−1
← v
n−1
⊕ T (u
n−1
), c[n − 1] ← c[n − 1] ⊕ r
n−1
12: c[n . . . 2n − 1] ← c[n . . . 2n − 1] ⊕ (r
n−2
, . . . , r
0
, r
n
)
13: return c
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
8 Diego F. Aranha, Ricardo Dahab, Julio L
´
opez and Leonardo B. Oliveira
the multiplication algorithm. This trick allows the reuse of values already loaded
into registers to speed up modular reduc tion. The following analysis does not take
these suggested optimizations into account.
Analysis of multiplicatio n algorithms. Observi ng the fact that the more
expensive instructions in the target platform are related to memory accesses, the
behavior of different algorithms was analyzed to estimate their per f ormanc e. This
analysis traces the cost of different algorithms i n terms of memory accesses (reads
and writes) and arithmetic instructions (XOR).
Without considering partial multipl icati ons , the Karatsuba algorithm in a binary
field executes approximately 11n memory reads, 7n memory writes and 4n XOR
instructions.
For LD multiplication, analysis shows that building the precomputation table
requires n memory reads to obtain the values b[i] and |T | writes and 11n XOR
instructions for filling the table. Inside each inner loop, the algorithm executes
2(n + 1) memory reads, n + 1 writes and n + 1 XOR instructions. In each outer
loop, the algorithm executes n memory accesses to read the values a[k] and n
iterations of the inner loop, totalizing n + 2n(n + 1) reads, n(n + 1) writes and
n(n + 1) XOR ins t r uc tion s. The logical shift of c(z) computed at the intermediate
stage requires 2n memory reads and writes. Considering the initialization of c,
we have 3n + 2(n + 2n(n + 1)) memory reads, |T | + 2(2n) + 2n(n + 1) writes and
11n + 2n(n + 1) XOR instruct ions .
For the proposed optimization (Algori th m
2), building the precomputation table
requires n memory reads to obtain the values b[i] and |T | writes and 11n XOR
instructions for filling the table. Line 3 of the algorithm executes n+1 memory reads
and 1 write on c[0]. Lines 4-6 execute n + 1 memory reads, 1 write on c[i] and n + 1
XOR instructions, all this n −1 times. The intermediate shift executes n reads and
(2n) writes. Lines 9-11 execute n+1 memory reads, 1 read and write on c[i] and n+2
XOR instructions, all this n times. The final operation costs n memory reads , writ es
and XOR instructions. The algorithm thus requires a total of 3n+n(n+1)+n(n+2)
reads, |T |+n+2n+2n writes and 11n+(n−1)(n+1)+n(n+ 2)+n XOR instr u cti ons .
Table 3 presents the costs associate d with memory operations for LD multipli-
cation, LD with registers multiplication and Karats u ba multiplication. Table 4
presents approximate costs of the algorithms in terms of executed memory instruc-
tions for the fields F
2
163
and F
2
233
.
Number of instructions in terms of vectors of n bytes
Method Reads Writes XOR
L´opez-Dahab 4n
2
+ 9n |T | + 2n
2
+ 6n 2n
2
+ 13n
LD with registers 2 n
2
+ 6n |T| + 5n 2n
2
+ 14n − 1
Karatsuba 11n + 3M(⌈n/2⌉) 7n + 3M (⌈n/2⌉) 4n + 3M(⌈n/2⌉)
Table 3. Costs in number of executed instructions for the multi-
plication algorithms in F
2
m
. M(x) denotes the cost of a multipli-
cation algorithm which multiplies two x-byte vectors.
We can see from Table
3 that the number of memory accesses for LD with
registers is drastically reduced in comparison with the original algorithm, reducing
the number of reads by half and the number of writes by a quadratic factor . The
comparison b e tween LD with registers and Karatsu ba+ LD with registers favors
the first (l ower number of writes) on both fi nit e fields. One problem with this
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
Efficient implementation of ECC in wireless sensors 9
n = 21 n = 30
Method Reads Wr ites XOR Reads Writes XOR
L´opez-Dahab 1953 1452 1155 3870 2476 2190
LD with registers 1071 457 1175 1980 646 2219
Karatsuba+LD 1980 1647 1239 3310 2518 1984
Karatsuba+LD with registers 1155 888 1269 1898 1134 2025
Table 4. Costs in number of executed instructions for the multi-
plication algorithms in F
2
163
and F
2
233
. The Karatsuba algorithm
in F
2
233
executes two instances of cost M(15) and one instance of
cost M(14) to better approximate the results.
analysis is that it assumes that the processor has at least n general-purpose registers
available for arithmetic. This is not true in F
2
233
, because the algorithm requires
31 registers for a full rotating r egis te r window. Th e decision between a multistep
implementation of LD with registers and Karatsuba+LD with registers will depend
on the actual implementation of the algorithms.
5.2. Modular reduction. The NIST irreducible polynomial for the finite field
F
2
163
, f(z) = z
163
+ z
7
+ z
6
+ z
3
+ 1, allows a fast modular reduction algorithm. Al-
gorithm
3 [29] presents an adaptation of this algorithm for 8-bit processors. In this
algorithm, reducing a digit c[i] of the upper half of the vector c req ui r es six memory
accesses to read and write c[i] on lines 3-5. Four of them are redundant because ide-
ally we only need to read and write c[i] once. We eliminate these redundant accesses
by employing a rotating register window of three registers which accumulate writes
into registers before a final result can be written into memory. This optimization
is given in Algorithm
4 along with th e substitution of some bitwise shifts which
are expensive in this platform for cheaper ones. Since the proce ss or only supports
1-bit and 4-bit shifts natively, we further replace the various expensive shifts in the
accumulate function R by table lookups on 256-byte tables. These tables are stored
on 256-byte aligned memory addresses to speed up memory addressing. The new
version of the accumulate f un ct ion i s depi ct ed in Algorith m
5.
Algorithm 3 Fast modular reduction by f (z) = z
163
+ z
7
+ z
6
+ z
3
+ 1.
Input: c(z) = c[0..40].
Output: c(z) mod f(z) = c[0..20].
1: for i ← 40 downto 21 do
2: t ← c[i]
3: c[i − 19] ← c[i − 19] ⊕ (t ≫ 4) ⊕ (t ≫ 5)
4: c[i − 20] ← c[i − 20] ⊕ (t ≪ 4) ⊕ (t ≪ 3) ⊕ t ⊕ (t ≫ 3)
5: c[i − 21] ← c[i − 21] ⊕ (t ≪ 5)
6: end for
7: t ← c[20] ≫ 3
8: c[0] ← c[0] ⊕ (t ≪ 7) ⊕ (t ≪ 6) ⊕ (t ≪ 3) ⊕ t
9: c[1] ← c[1] ⊕ (t ≫ 1) ⊕ (t ≫ 2)
10: c[20] ← c [20] ∧ 0x07
11: return c
For the NIST irreducible polynomial in F
2
233
on 8-bit processors, we present
Algorithm
6, a direct adaptation of the standard algorithm. This algorithm only
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
10 Diego F. Aranha, Ricardo Dahab, Julio L
´
opez and Leonardo B. Oliveira
Algorithm 4 Fast modular reduction in F
2
163
with rotating register window.
Input: c(z) = c[0..40].
Output: c(z) mod f(z) = c[0..20].
Note: The accumulate function R(r
0
, r
1
, r
2
, t) executes:
s
0
← t ≪ 4
r
0
← (r
0
⊕ t ⊕ (t ≫ 1)) ≫ 4
r
1
← r
1
⊕ s
0
⊕ (t ≪ 3) ⊕ t ⊕ (t ≫ 3)
r
2
← s
0
≪ 1
1: r
b
← 0, r
c
← 0
2: for i ← 40 downto 25 by 3 do
3: R(r
b
, r
c
, r
a
, c[i]), c[i − 19] ← c[i − 19] ⊕ r
b
4: R(r
c
, r
a
, r
b
, c[i − 1]), c[i − 20] ← c[i − 20] ⊕ r
c
5: R(r
a
, r
b
, r
c
, c[i − 2]), c[i − 21] ← c[i − 21] ⊕ r
a
6: end for
7: R(r
b
, r
c
, r
a
, c[22]), c[3] ← c[3] ⊕ r
b
8: R(r
c
, r
a
, r
b
, c[21]), c[2] ← c[2] ⊕ r
c
9: r
a
← c[1] ⊕ r
a
10: r
b
← c[0] ⊕ r
b
11: t ← c[20]
12: c[20] ← t ∧ 0x07
13: t ← t ≫ 3
14: c[0] ← r
b
⊕ (t ≪ 7) ⊕ (t ≪ 6) ⊕ (t ≪ 3) ⊕ t
15: c[1] ← r
a
⊕ (t ≫ 1) ⊕ (t ≫ 2)
16: return c
Algorithm 5 Optimized version of the accumulate function R.
Input: r
0
, r
1
, r
2
, t.
Output: r
0
, r
1
, r
2
.
1: r
0
← r
0
⊕ T
0
[t]
2: r
1
← r
1
⊕ T
1
[t]
3: r
2
← t ≪ 5
executes 1-bit or 7-bit shifts. These two shifts can be translated efficiently to the
processor instruction set, because 1-bit s h ift s are supported natively and 7-bit shifts
can be emulated efficiently. Hence lookup tables are not needed and the only op-
timization made during implementation of Algorithm
6 was complete unrolling of
the main loop and straightforward elimination of consecutive redundant memory
accesses.
Analysis of modular reduction algorithms. As pointed by Seo et al. [
29],
Algorithm 3 executes many redundant memory accesses: 4 memory reads and 3
writes during each loop iteration and additional 4 reads and 3 writes on the final
step, which s u m up to 88 reads and 66 writes. The proposed optimization reduces
the number of memory op e r ations to 43 reads and 23 writes. Despi te Algorithm
5
being specialized for the chosen polynomial, the register window technique can be
applied to any irreducible polynomial with the non-null coefficients located in the
first word. The implementation of Algorithm
6 also reduces the number of memory
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
Efficient implementation of ECC in wireless sensors 11
Algorithm 6 Fast modular reduction by f (z) = z
233
+ z
74
+ 1.
Input: c(z) = c[0..58].
Output: c(z) mod f(z) = c[0..29].
1: for i ← 58 downto 32 by 2 do
2: t
0
← c[i]
3: t
1
← c[i − 1]
4: c[i − 19] ← c[i − 19] ⊕ (t
0
≫ 7)
5: c[i − 20] ← c[i − 20] ⊕ (t
0
≪ 1) ⊕ (t
1
≫ 7)
6: c[i − 21] ← c[i − 21] ⊕ (t
1
≪ 1)
7: c[i − 29] ← c[i − 29] ⊕ (t
0
≫ 1)
8: c[i − 30] ← c[i − 30] ⊕ (t
0
≪ 7) ⊕ (t
1
≫ 1)
9: c[i − 31] ← c[i − 31] ⊕ (t
1
≪ 7)
10: end for
11: t
0
← c[30]
12: c[0] ← c[0] ⊕ (t
0
≪ 7)
13: c[1] ← c[1] ⊕ (t
0
≫ 1)
14: c[10] ← c [10] ⊕ (t
0
≪ 1)
15: c[11] ← c [11] ⊕ (t
0
≫ 7)
16: t
0
← c[29] ≫ 1
17: c[0] ← c[0] ⊕ t
0
18: c[9] ← c[9] ⊕ (t
0
≪ 2)
19: c[10] ← c [10] ⊕ (t
0
≫ 6)
20: c[29] ← c [29] ∧ 0x01
21: return c
accesses, since a standard i mple mentation executes 122 reads and 92 writes while
our implementation ex ec u tes 92 memory re ads and 62 writes.
5.3. Squaring. The square of a finite field element a(z) ∈ F
2
m
is given by a(z)
2
=
P
m−1
i=0
a
i
z
2i
= a
m−1
z
2m−2
+···+a
2
z
4
+a
1
z
2
+a
0
. The binary representation of a(z)
2
can be computed by inserting a “0” bit between each pair of successive bits on the
binary representation of a(z) and accelerated by introducing a 16-byte lookup table.
If modular reduction is computed in a separate step, re du ntant memory operations
are requir ed to store the squaring result and reload this result for reduction. This
can be improved by embedding the modular reduction step directly into the squaring
algorithm. This way, the lower half of the digit vector a is expanded in the usual
fashion and the upper half digits are expanded and immediately reduced. If modular
reduction of a single byte requires expensive shifts, additional lookup tables can be
used to store the expanded bytes already reduced. This is illustrated in Algorithm
7
which computes squaring in F
2
163
using the same small rotating register window
as Algorithm
5 and three additional 16-byte lookup tables T
0
, T
1
and T
2
. For
squaring in F
2
233
, we also combine byte expansion of the digit vector’s lower half
with Algorithm
6 for fast reduction.
5.4. Inversion. For inversion in F
2
m
we implemented the Extended Euclidean
Algorithm for polynomials [
9]. Since this algorithm requires flexible left shifts by
arbitrary amounts, we implemented six dedicate shifting funct ions to shi ft a binary
field element by every amount possibl e for an 8-bit processor. The core of a multi-
precision left shift algorithm is t he sequence of instructions which receives as input
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
12 Diego F. Aranha, Ricardo Dahab, Julio L
´
opez and Leonardo B. Oliveira
Algorithm 7 Squaring in F
2
163
.
Input: a(z) = a[0..20].
Output: c(z) = a(z)
2
mod f(z).
Note: The accumulate function R(r
0
, r
1
, r
2
, t) executes:
r
0
← r
0
⊕ T
0
[t], r
1
← r
1
⊕ T
1
[t], r
2
← r
2
⊕ T
2
[t]
1: For each 4-bit combination u, T (u) = (0, u
3
, 0, u
2
, 0, u
1
, 0, u
0
).
2: for i ← 0 to 9 do
3: c[2i] ← T (a[i] ∧ 0x0F)
4: c[2i + 1] ← T (a[i] ≫ 4)
5: end for
6: c[20] ← T (a[10] ∧ 0x0F)
7: r
b
← 0, r
c
← 0, j ← 20
8: t
0
← a[20] ∧ 0x0F
9: R(r
b
, r
c
, r
a
, t
0
), c[21] ← r
b
10: for i ← 19 downto 13 by 3 do
11: a
o
← a[i], t
0
← a
0
≫ 4, t
1
← a
0
∧ 0x0F
12: R(r
c
, r
a
, r
b
, t
0
), c[j] ← c[j] ⊕ r
c
13: R(r
a
, r
b
, r
c
, t
1
), c[j − 1] ← c[j − 1] ⊕ r
a
14: a
0
← a[i − 1], t
0
← a
0
≫ 4, t
1
= a
0
∧ 0x0F
15: R(r
b
, r
c
, r
a
, t
0
), c[j − 2] ← c[j − 2] ⊕ r
b
16: R(r
c
, r
a
, r
b
, t
1
), c[j − 3] ← c[j − 3] ⊕ r
c
17: a
0
= a[i − 2], t
0
= a
0
≫ 4, t
1
= a
0
∧ 0x0F
18: R(r
a
, r
b
, r
c
, t
0
), c[j − 4] ← c[j − 4] ⊕ r
a
19: R(r
b
, r
c
, r
a
, t
1
), c[j − 5] ← c[j − 5] ⊕ r
b
20: j ← j − 6
21: end for
22: t
0
= a[10] ≫ 4
23: R(r
c
, r
a
, r
b
, t
0
), c[2] ← c[2] ⊕ r
c
24: r
a
← c[1] ⊕ r
a
, r
b
← c[0] ⊕ r
b
25: t ← c[21]
26: r
a
← r
a
⊕ t ⊕ (t ≪ 3) ⊕ (t ≪ 4) ⊕ (t ≫ 3)
27: r
b
← r
b
⊕ (t ≪ 5)
28: t ← c[20]
29: c[20] ← t ∧ 0x07
30: t ← t ≫ 3
31: c[0] ← r
b
⊕ (t ≪ 7) ⊕ (t ≪ 6) ⊕ (t ≪ 3) ⊕ t
32: c[1] ← r
a
⊕ (t ≫ 1) ⊕ (t ≫ 2)
33: return c
the amount to shift i, a r egis t er r and a carry register rc storing the bits shifted
out in the last iteration; and produce (r ≪ i) ⊕rc as output and r ≫ (8 −i) as new
carry. Table
5 lists the required instructions and costs in cycles for shifting a single
byte in each of the implemented multi-precision shifts by i bits. Each instruction
in the table cost 1 c y cle , thus the cost to compu te the core of a multi-precision left
shift by i bits is just the number of rows in the i-th row of the table.
6. Algorithms for ell ip tic curve arithmetic
We have selected fast algorithms for elliptic curve arithmetic in three situations :
multiplying a random point P by a scalar k, multiplying the generator G by a scalar
k and simultaneously multiplying two points P and Q by scalars k and l to obtain
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
Efficient implementation of ECC in wireless sensors 13
i Intructions
1 rol r
2
clr rt
lsl r
rol rt
lsl r
rol rt
eor r, rc
mov rc, rt
i Intructions
3
clr rt
lsl r
rol rt
lsl r
rol rt
lsl r
rol rt
eor r, rc
mov rc, rt
i Intructions
4
swap r
mov rt, r
andi r, 0xF0
andi rt, 0x0F
eor r, rc
mov rc, rt
5
swap r
mov rt, r
andi r, 0xF0
andi rt, 0x0F
lsl r
rol rt
eor r, rc
mov rc, rt
i Intructions
6
bst rt, 0
bld r, 6
bst rt, 1
bld r, 7
lsr rt
lsr rt
eor r, rc
mov rc, rt
7
bst rt, 0
bld r, 7
lsr rt
eor r, rc
mov rc, rt
Table 5. Processor instructions used to efficiently implement
multi-precis ion left shifts by i bits. The input regis ter is r, the
carry register is rc and a temporary register is rt. When i = 1, rc
is represented by the carry processor flag.
kP + lQ. Our implementation uses mixed addition with projective coordinates [
18],
given that the ratio of inversion to multiplication is 16.
For multiplying a random point by a scalar, we choose Solinas’ τ - ad ic non-
adjacent form (TNAF) representation [
30] with w = 4 for Koblitz curves (4-TNAF
method with 4 precomputation points) and the method due to L´opez and Dahab [17]
for random binary curves. Solinas’ algorithm explores the optimizations provided
by Koblitz curves and accelerates the computation of kP by substituting point
doublings for applications of the efficiently computable endomorphism based on the
Frobeniu s map τ(x, y) = (x
2
, y
2
). The method due to L´opez and Dahab does not use
precomputation, its execution time is constant and each iteration of the algorithm
executes the same number of operations, inde pendently of the bit pattern in k [
9].
For multiplyi ng the generator, we employ the same 4-TNAF method for Koblitz
curves; and for generic curves, we employ the Comb method [
16] with 16 precom-
puted points. Precomputed tables f or the generator are stored in ROM memory to
reduce RAM consumption. Larger precomputed t able s can be used if program size
is not an issue.
For simultaneous multiplication, we implement the interleaving method with 4-
TNAFs for Koblitz curves and the interleaving of 4-NAFs with integers represented
in non-adjacent form (NAF) for generic curves [
6]. The same table built for mul-
tiplying the generator is used during simultaneous multiplication in Koblitz curves
when point P or Q is the generator G. An additional small table of 4 points is
precomputed for the generator and stored in ROM to provide the same situation
with generic curves.
7. Implementation r es ults
The compiler and assembler used is the GCC 4.1.2 suite for ATmega128 with
optimization level -O2. The timings were measured with the software AVR Studio
4.14 [
2]. This tool is a cycle-accurate simulator frequently used to prototype soft-
ware for execution on the target platform. We have written a specialized library
containing the software implementations.
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
14 Diego F. Aranha, Ricardo Dahab, Julio L
´
opez and Leonardo B. Oliveira
Finite field arithmetic. The algorithms for squaring, multiplication, modular
reduction and inversion in the finite field were implemented in the C language
and Assembly. Table
6 presents the costs measured in cycles of each implemented
operation in F
2
163
and F
2
233
. Since the platform does n ot have cache memory or
out-of-order execution, the finite field operations always cost the same number of
cycles and the timings were taken exactly once, except for inversion. The timing for
inversion was taken as the average of 50 timings measured on consecutive executions
of the algorithm.
m = 163 m = 233
Algorithm C language Assembly C language Assembly
Squaring 629 430 908 463
Modular Squaring 1154 570 1340 956
LD Mult. with registers 13838 4508 – 8314
LD Mult. (new variant) 9738 – 18028 –
Karatsuba+LD with registers 12246 6968 25850 9261
Modular reduction 606 430 911 620
Inversion 243790 81365 473618 142986
Table 6. Timings in cycles for arithmetic algorithms in F
2
m
.
From Table
6, m = 163, we can observe that in the C language implementa-
tion, Karatsuba+LD with registers multiplication is more efficient than the direct
application of LD with registers multiplication. This contradicts the preliminary
analysis based on the number of memory accesses executed by each algorithm. This
can be explained by the fact that the LD with registers multiplication uses 21 of
the 32 general-purpose registers to store intermediate results during multiplication.
Several addit ional registers are also needed to s t ore memory addresses and tempo-
rary variables for arithmetic operations. The inefficienc y found is thus originated
from the difficulty of the C c ompil er to maintain all intermediate values on registers.
To confirm this limitation, a new variant of LD with registers multiplication which
reduces the number of temporary variables needed was also implemented. Thi s vari-
ant processes 32 bits of the operand in e ach interaction compared to the original
version of LD multiplication which processes 4 bits in each interaction. The new
variant reduces the number of memory accesses while keeping a smaller number of
temporary variables and thus exhibits the expected performance. For the squaring
algorithm, we can see that embedding the modular reduction step reduces the cost
of modular squaring significantly compared with t he sequential execution of squar-
ing plus modular reduction. Table 6, m = 233, shows that the Karatsuba algorithm
in F
2
233
indeed does not improve performance over the multist ep implementation
of LD with registers multiplication, even if the processor does not have enough
registers to store the full rotating register window. The Assembly implementations
demonstrate the compiler inefficienc y in generating optimized code and allocating
resources for the target platform, showing considerably faster timings.
Elliptic curve arithmetic. Point multiplication was implemented on elliptic
curves standard ize d by NIST. Table
7 presents the execution time of the multi-
plication of a rand om point P by a random integer k of 163 or 233 bi ts , with the
underlying finite field arithmetic implemented in C or Assembly. In each of the pro-
gramming languages, the f aste s t field multiplication algor ith m is used. The results
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
Efficient implementation of ECC in wireless sensors 15
were compu te d by the arithmetic mean of the timings measured on 50 consec ut ive
executions of the algorithm.
C language Assembly
Curve kG kP kP + lQ kG kP kP + lQ
NIST-K163 (Koblitz) 0.56 0.67 1.24 0.29 0.32 0.60
NIST-B163 (Generic) 0.77 1.55 2.21 0.37 0.74 1.04
NIST-K233 (Koblitz) 1.26 1.48 2.81 0.66 0.73 1.35
NIST-B233 (Generic) 1.94 3.90 5.35 0.94 1.89 2.52
Table 7. Timings in seconds for point multiplication.
Table
8 compares the performance of the proposed implementation with
TinyECCK [29] and the work of Kargl et al. [12], the previously fastest binary
curves implementation in C and Assembly published for this platform. For the C
implementation, we achieve faster timings on all fini te field arithmetic operati ons
with improvements over 50%. For the Assembly implementation, we obtain speed
improvements on field squarin g and multiplication and exactly the same timing for
modular reduction, but the polynomial used by Kargl et al.[
12] is a trinomial care-
fully selected to support a faster modu lar reduction algorithm. The computation of
kP on Koblitz curves implemented in C language was 41% faster than TinyECCK.
By choosing the L´opez-Dahab point multiplication algorithm with generic curves
implemented i n Ass embly, we achieve a timing 11% faster than [
12] while satis-
fying the ti ming-r e s is tant property. If we relax this condition, we obtain a point
multiplication 61% faster in Assembly by using Solinas’ method. Comparing our
Assembly imple mentation with TinyECCK and [
12] with the same curve param-
eters, we achieve a 72% speedup and an 11% speedup for point multiplication,
respectively.
Proposed TinyECCK Proposed Kargl et al. [12]
Algorithm C language C languag e Assembly Assembly
Modular Squaring 1154 c 2729 c 570 c 663
Multiplication 9738 c 19670 c 4508 c 5057 c
Modular reduction 606 c 1904 c 430 c 433 c
Inversion 243790 c 539132 c 81365 c –
kP on Koblitz 0.67 s 1.1 4 s 0.32 s –
kP on Generic 1.55 s – 0.74 s 0.83 s
Table 8. Comparison between differ ent implementations. The
timings are presented in cycles (c) or seconds (s) on a 7.2838MHz
device.
The fastest time for point multiplication previously published for this platform
at the 160-bit security level was 0.745 second [7]. Compared to this implementation,
which uses prime fields, the proposed optimizations result in a point multiplication
57% faster.
The implemented optimizations allow performance gains but provoke a collateral
effect on memory consumption. Table
9 presents memory requir eme nts for cod e siz e
and RAM memory for the different implementations at th e 160-bit security level.
We can also observe that Assembly implementations are responsible for a significant
expansion in program code size.
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
16 Diego F. Aranha, Ricardo Dahab, Julio L
´
opez and Leonardo B. Oliveira
ROM memory Static RAM Stack RAM
Proposed (Koblitz) – C 22092 1028 1207
Proposed (Koblitz) – C+Assembly 25802 1732 1207
Proposed (Generic) – C 12848 881 682
Proposed (Generic) – C+Assembly 16218 1585 682
TinyECCK (C-only) 5592 – 618
Kargl et a. (C+Assembly) [12] 11264 – –
Table 9. Cost in bytes of memory for implementations of scalar
multiplication of a random point at the 160-bit security level.
Cryptographic protocols. We now illustrate the performance obtained by our
efficient implementation with some executions of cryptographic protocols for key
agreement and digital signatures. Key agreement is employed in sens or networks
for establishing symmetric keys which can be used for encr y pt ion or authentication.
Digital signatures are employed for communication between the sensor nodes and
the base stations where data must be made available to multiple applications and
users [
24]. For key agreement between nodes, we implemented the Elliptic Curve
Diffie & Hellman (ECDH) protocol [3], and for digital signatures, we implemented
the Elliptic Curve Digital Signature Algorithm (ECDSA) [3]. We assume that public
and private keys are generated and loaded into the nodes before the deployment of
the sensor network. Hence timings for key generation and public key authentication
are not presented or considered. Table
10 prese nts the timings for the ECDH
protocol and Table 11 presents the timings for the ECD S A protocol, using the choice
of algorithms discussed in Section 6. Result s on these tables pose an interesting
decision between deploying generic binary cur ves on the lowe r security level or
deploying special curves on th e higher s ec ur i ty level.
C language Assembly
Curve Time ROM RAM Time ROM RAM
NIST-K163 0.74 28.3 2.2 0.39 32.0 2.8
NIST-B163 1.62 24.0 1.1 0.81 27.8 1.9
NIST-K233 1.55 31.0 2.9 0.80 38.6 3.7
NIST-B233 3.97 26.9 1.5 1.96 34.6 2.2
Table 10. Timings for the ECDH protocol execution. Timings
are given in seconds and ROM memory or Static+Stack RAM con-
sumption are given in KB.
C language Assembly
Curve Time (S + V) ROM RAM Time (S + V) ROM RAM
NIST-K163 0.67 + 1.23 31.8 2.9 0.36 + 0.63 35.3 3.7
NIST-B163 0.87 + 2.17 29.6 2.1 0.45 + 1.05 33.2 2.8
NIST-K233 1.46 + 2.76 34.6 3.1 0.78 + 1.39 42.2 3.8
NIST-B233 2.09 + 5.25 32.8 2.3 1.04 + 2.55 40.4 3.1
Table 11. Timings for the ECDSA protocol execution. Timings
for signature (S) and verification (V) are given in seconds and ROM
memory or Static+Stack RAM consumption are given in KB.
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
Efficient implementation of ECC in wireless sensors 17
8. Conclusions
Despite several years of intense research, security and cryptography on W SNs
still face several open problems. In this work, we presented efficient implementa-
tions of binary field algorithms such as squaring, multiplication, modular reduction
and inversion. These implementations take into account the characteristics of the
target platform (the MICAz Mote) to develop optimizations, specifically : (i ) the
cost of memory address ing; (ii) the cost of memory instructions; (iii) the limited
flexibility of bitwise shift instructions. We obtain the fastest binary field arithmetic
implementations in C and Assembly published for t he target platform. Significant
performance benefits where achieved by the Assembly implementation, resulting
from fine-grained resource allocation and instruction selection. These optimizations
produced a point multiplication at the 160-bit security level und er
1
3
of a secon d, an
improvement of 72% compared to the best implementation of a Koblitz curve previ-
ously published and an improvement of 61% compared to the best implementation
of binary curves. When compared to the best implementation of prime curves, we
obtain a performance gain of 57%. We also presented the first timings of elliptic
curves at the higher 233-bit security level. For both security levels, we illustrate
the performance obtained with executions of key agreement and digital signature
protocols. In particular, a key agreement can be computed in under 0.40 second
at the 163-bit security level and under 0.80 second at the 233-bit security level. A
digital signature can be computed and verified in 1 second at the 163-bit security
level and in 2.17 seconds at the 233-bit security level. We hope that our results can
increase the efficiency and viability of elliptic curve cryptography on wireless sensor
networks.
Acknowledgements
We would like to thank the referees for their valuable comments and suggestions.
Diego F. Aranha is supported by FAPESP, grant no. 2007/06950-0. Julio L´opez
and Ricardo Dahab are p ar tiall y supported by CNPq and FAPESP research grants.
References
[1] Atmel Corp oration, 8 bit AVR Microcontroller ATmega128(L) manual, Atmel, (2004), edition
2467m-avr-11/04.
[2] Atmel Corporation, AVR Studio 4.14, Atmel, (2005), available online at
http://www.atmel.
com/
.
[3] Certicom Research, SEC 1: Elliptic Curve Cryptography, (2000), available online at
http://
www.secg.org
.
[4] H. Eberle, A. Wand er, N. Gura, S. Chang-Sh antz and V. Gupta, Arc h it ectural extensio ns
for elliptic curve cryptography over GF(2
m
) on 8-b it microprocessors, in “Proceedings of
IEEE International Conference on Application-specific Systems, Architectures and Processors
(ASAP’05)”, IEEE, (2005), 343–349.
[5] D. Estrin, R. Govindan, J. S. Heidemann and S. Kumar, Next century challenges: Scal-
able coordination in sensor networks, in “Proceedings of Mobile Computing and Networking
(MobiCom’99)”, (1999), 263–270.
[
6] R. Gallant, R. Lambert and S. Vanstone, Faster point multiplication on elliptic curves with
efficient endomorphisms, in “Proceedings of the 21st Annual International Cryptology Con-
ference on Advances in Cryptology (CRYPT O ’01),” Springer, (2001), 190–200.
[7] J. Großsch¨adl, TinySA: a security architecture for wireless sensor networks, in “Proceedings
of ACM International Conference on emerging Networking EXperiments and Technologies
(CoNEXT’06)”, ACM, (2006).
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
18 Diego F. Aranha, Ricardo Dahab, Julio L
´
opez and Leonardo B. Oliveira
[8] N. Gura, A. Patel, A. Wander, H. Eberle and S. C. Shantz, Comparing elliptic curve cryp -
tography and RSA on 8-bit CPUs, In “Proceedings of Workshop on Cryptographic Hardware
and Embedded Systems (CHES’04)”, Springer, (2004), 119–132.
[
9] D. Hankerson, A. J. Menezes and S. Vanstone, “Guide to Elliptic Curve Cryptography,”
Springer, New York, 2004.
[10] J. L. Hill and D. E. Culler, MICA: a Wireless platform for deeply embedded networks, IEEE
Micro., 22 (2002), 12–24.
[11] A. Karatsuba and Y. Ofman, Multiplication of many-digital numbers by automatic computers,
Transl. Physics-Doklady, 7 (1963), 595–596.
[12] A. Kargl, S. Pyka and H. Seuschek, Fa s t arithmetic on ATmega128 for elliptic curve cryp-
tography, preprint, available online at
http://eprint.iacr.org/2008/442.
[13] C. Karlof, N. Sastry and D. Wagner, TinySec: a link layer security architecture for wireless
sensor networks, In “Proceedings of 2nd ACM Conference on Embedded Networked Sensor
Systems (SenSys’04)”, ACM, (2004), 162–175.
[
14] N. Koblitz, Ellipt i c c urv e cryptosystems, Math. Comput., 48 (1987), 203–209.
[15] G.-J. Lay and H. G. Zimmer, Constructing ellip t ic curves with given group order over large
finite fields, in “Algorithmic Number Theory,” (1994), 250–263.
[
16] C. H. Lim and P. J. Lee, More flex ib le exponentiation with precomputation, in “Proceed-
ings of the 14th Annual International Cryptology Conference on Advances in Cryptology
(CRYPTO’94)”, Springer, (1994), 95–107.
[17] J. L´opez and R. Dahab, Fast multiplication on elliptic curves over GF(2
m
) without precom-
putation, in “Proceedings of Workshop on Cryptographic Hardware and Embedded Systems
(CHES’99)”, Springer, (1999), 316–327.
[
18] J. L´opez and R. Dahab, Improved algorithms for elliptic curve arithmetic in GF(2
n
), in
“Proceedings of Workshop on Selected Areas in Cryptography (SAC’98)”, Springer, (1999),
201–212.
[
19] J. L´opez and R. Dahab, High-speed software multiplication in GF(2
m
), in “Proceedings of
International Conference on Cryptology in India (INDOCRYPT’00)”, Springer, (2000), 203–
212.
[20] D. J. Malan, M. Welsh and M. D. Smith, A public-key i nf rastructure for key distribution
in Tiny O S based on elliptic curve cryptography, in “Proceedings of IEEE Communications
Society Conference on Sensor and Ad Hoc Communications and Networks (SE C O N ’04)”,
(2004).
[
21] A. Menezes, T. Okamoto and S. Vanstone, Reducing elliptic curve logarithms to logarithms
in a finite field, IEEE Trans. Inform. Theory 39 (1993), 1639–1646.
[
22] V. Miller, Uses of elliptic curves in cryptography, in “Advances in Cryptology (CRYPTO’85)”,
Springer, (1986), 417–426.
[23] L. B. Oliveira, D. F. Aranha, E. Morais, F. Daguano, J. L´opez and R. Dahab, TinyTate:
computing the Tate pairing in resource-constrained sensor nodes, in “Proceedings of IEEE
International Symposium on Network Computing and Applications (NCA’07)”, IEEE, (2007),
318–323.
[24] L. B. Oliveira, A. Kansal, B. Priyantha, M. Goraczko and F. Zhao, Secure-TWS: authen-
ticating node to multi-user communication in shared sensor networks, in “Proceedings of
International Conference on Information Processing in Sensor Networks (IPSN’09)”, IEEE,
(2009), 289–300.
[25] L. B. Oliveira, M. Scott, J. L´opez and R. Dahab, TinyPBC: pairings for authenticated
identity-based no n-i nte ractive key distribution in sensor networks, in “Proceedings of In-
ternational Conference on Networked Sensing Systems (INSS’08)”, IEEE, (2008), 173–180.
[26] A. Perrig, R. Szewczyk, V. Wen, D. Culler and J. D. Tygar, SPINS: security protocols for
sensor networks, Wireless Networks, 8 (2002), 521–534.
[
27] T. Satoh, B. Skjernaa and Y. Taguchi, Fast computation of canonical lifts o f elliptic curves
and its application to point counting, Finite Fields Appl., 9 (2003), 89–101.
[28] M. Scott, MIRACL – multiprecision integer and rational arithmetic C/C++ library, available
online at
http://www.shamus.ie/.
[29] S. C. Seo, D. Han and S. Hong, TinyECCK: efficient elliptic c ur ve cry p t ography implementa-
tion over GF(2
m
) on 8-bit MICAz mote, preprint, available online at
http://eprint.iacr.
org/2008/122
.
[
30] J. A. Solinas, Efficient arithmetic on Koblitz curves, Des. Codes Cryptogr., 19 (2000), 195–
249.
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
Efficient implementation of ECC in wireless sensors 19
[31] P. Szczechowiak, L. B. Oliveira, M. Scott, M. Collier and R. Dahab, NanoECC: testing
the limits of elliptic curve cryptography in sensor networks, in “Pro ceedin gs of European
conference on Wireless Sensor Networks (EWSN’08)”, Springer, (2008), 305–320.
[32] L. Uhsadel, A. Poschmann and C. Paar, Enabling full-size public-key algorithms on 8-bit
sensor nodes, in “Proceedings of Eu ropean Workshop on Security in Ad-hoc and Sensor
Networks (ESAS’07)”, Springer, (2007), 73–86.
[
33] H. Wang and Q . Li, Efficient implementation of public key cry pt o s y s t em s on mote sensors,
in “Proceedings of International Conference on Information and Communication Systems
(ICICS’06)”, Springer, (2006), 519–528.
[34] H. Yan and Z. J. Shi, Studying software implementations of elliptic curve cryptography,
in “Proceedings of International Conference on Information Technology: New Generations
(ITNG’06)”, IEEE, (2006), 78–83.
Received June 2009; revis ed De ce mbe r 2009.
E-mail address: dfaranh a@i c.u n icamp. br
E-mail address: rdahab @i c. un ic amp.br
E-mail address: jlopez@ic.u ni camp. br
E-mail address: leob@f t. un ic amp.br
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx