ArticlePDF Available

Abstract

The deployment of cryptography in sensor networks is a challenging task, given the limited computational power and the resource-constrained nature of the sensoring devices. This paper presents the implementation of elliptic curve cryptography in the MICAz Mote, a popular sensor platform. We present optimization techniques for arithmetic in binary fields, including squaring, multiplication and modular reduction at two different security levels. Our implementation of field multiplication and modular reduction algorithms focuses on the reduction of memory accesses and appears as the fastest result for this platform. Finite field arithmetic was implemented in C and Assembly and elliptic curve arithmetic was implemented in Koblitz and generic binary curves. We illustrate the performance of our implementation with timings for key agreement and digital signature protocols. In particular, a key agreement can be computed in 0.40 seconds and a digital signature can be computed and verified in 1 second at the 163-bit security level. Our results strongly indicate that binary curves are the most efficient alternative for the implementation of elliptic curve cryptography in this platform.
Advances in Mathematics of Communications doi:10.3934/amc.2010.4.xxx
Volume 4, No. 2, 2010, xxx–xxx
EFFICIENT IMPLEMENTATION OF ELLIPTIC CURVE
CRYPTOGRAPHY IN WIRELESS SENSORS
Diego F. Aranha, Ricardo Dahab,
Julio L
´
opez and Leonardo B. Oliveira
University of Campinas (UNICAMP)
Campinas - SP, CEP 13083-970, Brazil
(Communicated by Joan-Josep Climent)
Abstract. The deployment of cryptography in sensor networks is a challeng-
ing task, given the limited computational power and the resource-constrained
nature of the sensoring devices. This paper presents the implementation of
elliptic curve cryptography in the MICAz Mote, a popular sensor platform.
We present optimization techniques for arithmetic in binary fields, including
squaring, multiplication and modular reduction at two different security levels.
Our implementation of field multiplication and modular reduction algorithms
focuses on the reduction of memory accesses and appears as the fastest result
for this platform. Finite field arithmetic was implemented in C and Assembly
and elliptic curve arithmetic was implemented in Koblitz and generic binary
curves. We illustrate the performance of our implementation with timings for
key agreement and digital signature protocols. In part icular, a key agreement
can be computed in 0.40 seconds and a digital signature can b e computed and
verified in 1 second at the 163-bit security level. Our results strongly indicate
that binary curves are the most efficient alternative for the implementation of
elliptic curve cryptography in this platform.
1. Introduction
A Wireless Sensor Network (WSN) [
5] is a wireless ad-hoc ne twork consisting of
resource-constrained sensoring devices (limited energy source, low communication
bandwidth, small computational power) and one or more base stations. The base
stations are more powerful and collect th e data gathered by the sensor nodes so
it can be analyzed. As any ad hoc network, routing is accomplished by the nodes
themselves through hop-by-hop forwarding of data. Common WSN appl ic ations
range from battlefield reconnaissanc e and emergency rescue operations to sur veil-
lance and environmental protection.
WSNs may be organized in different ways. In flat WSNs, all nodes play similar
roles in sensing, data processing, and routing. In hierarchical WSNs, on the other
hand, the network is typically organized into clusters, with ordinary cluster mem-
bers and the cluster heads playing different r oles . While ordinary cluster members
are responsible f or sensing, the cluster heads are respon si bl e for additional tasks
such as collecting and processing the sensing data from their cluster members, and
forwarding the resul ts towards the base stations.
2000 Mathematics Subject Classification: Primary: 11-04; Secondary: 94A60.
Key words and ph rases: Efficient software imp lementation, cr y p tograph ic engineering, elliptic
curve cryptography, finite field arithmetic.
1
c
2010 AIMS-SDU
2 Diego F. Aranha, Ricardo Dahab, Julio L
´
opez and Leonardo B. Oliveira
Besides the vulnerabilities already present in ad-hoc networks, WSNs pose addi-
tional challenges: the sensor nodes are commonly distributed on locations physically
accessible to adversaries; and the resources available in a sensor node are more lim-
ited than those in a conventional ad hoc network node, thus traditional solutions
are not adequate. For example, the fact that sensor nodes should be discardable
and consequently have low cost makes the integration of anti-tampering measures
on these devices difficult.
Conventional public key cryptography syste ms such as RSA and DSA are im-
practical in this scenario due to the low processing power of sensor nodes. Until
recently, security services such as confidentiality, auth entication and integrity were
achieved exclusively by symmetric techniques [
26, 13] . Nowadays, however, ellip-
tic curve cryptography (ECC) [22, 14] has emerged as a promising alternative to
traditional public key methods on WSNs [8], because of its lower processing and
storage requirements. These features motivate the search for increasingly efficient
algorithms and implementations of ECC for such devices. The usual target platform
is the MICAz Mote [
10], a node commonly used on real WSN deployments, whose
main characteristics are the low availability of RAM memory and the high cost of
memory instructions, memory addressing and bitwise shifts by arbitrary amounts.
This work proposes optimizations for implementing ECC over binary fields, im-
proving its limits of performance and viability. Experimental results show that
binary el lip ti c curves offer signifi cant computational advantages over prime curves
when implemented in WSNs. Note that this observation contradicts a common
misconception that sensor nodes are not sufficiently equipped to compute elliptic
curve arithmetic over binary fields in an efficient way [8, 4].
Our main contributions in this work are:
Efficient implementations of multiplication, squaring, modular reduction and
inversion in F
2
163
and F
2
233
: optimized versions of known algorit hms are
presented, r ed u cin g the number of memory accesses to obtain performance
gains. The new optimizations produce the fastest implementation of binary
field arithmetic published for this platform;
Efficient implementation of elliptic curve cryptography: point multiplication
algorithms are imple mented on Koblitz curves and generic binary cur ves. The
time for a scalar multiplication of a random point in a binary curve is 61%
faster than the b e s t implementation so far [
12] and 57% faste r than the best
implementation over a prime curve [
7] at the 160-bit security level. We also
present the first point multiplication timings at the 233-bit security level in
this platform. Performance is illustrated by executions of key agreement and
digital signature protocols.
The remaining sections of this paper are organized as follows. Related work
is presented in Section
2 and elementary elliptic curve concepts are introduced
in Section 3. The platform characteristi cs are pre se nted in Section 4. Section 5
investigates efficient implementations of finite field arithmetic in the target platform
while Section 6 investigates efficient elliptic curve arithmet ic . Section 7 presents
implementation results and Section
8 concludes the paper.
2. Related work
Cryptographic protocols are used to establish security services in WSNs. Key
agreement is a fundamental protocol in this context because it can be used to nego-
tiate cryptographic keys suitable for fast and energy-efficient symmetric algorithms.
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
Efficient implementation of ECC in wireless sensors 3
One possible solution for key agreement in WSNs is the deployment of pairing-based
protocols, such as TinyTate [
23] and TinyPBC [25], with the added advantage of
not requiring communication. Here instead we focus on the perfor mance side and
assume that a simple one-pass Elliptic Curve Diffie-Hellman [
3] pr otocol is employed
for key agreement. With this assumption, different implementations of ECC can be
compared by the cost of multiplying a r and om ellipt ic point by a random integer.
Gura et al. [
8] presented the first implementation results of ECC and RSA on
ATmega128 microcontrollers and demonstrated the superiority of the former ove r
the latter. In Gura’s work, prime field arithmetic was implemented in C and As-
sembly and a point multiplication took 0.81 seconds on a 8MHz device. Uhsadel
et al. [
32] later presented an expected time of 0.76 seconds for computing a point
multiplication in a 7.3728MHz device. The fastest implementation of prime curves
so far [7] explores the potential of elliptic curves with efficient computable endo-
morphisms defined over optimal prime field s and computes a point multiplication
in 5.5 million cycles, or 0.745 second.
For bin ary curves, Malan et al. [
20] implemented ECC using polynomial basis
and presented results for the Diffie-Hellman key agreement protocol. A public key
generation, which consists of a point multiplication, was computed in 34 seconds.
Yan and Shi [
34] implemented ECC over F
2
163
and obtained a point multiplication in
13.9 seconds, suggest in g that binary curves had too high a cost for sensor s ’ current
technology. Eb er l e et al. [
4] implemented ECC in Assembly over F
2
163
and obtained
a point multiplication in 4.14 seconds, making us e of architectural extensions for
additional acce le r ation. NanoECC [
31] specialized portions of the MIRACL arith-
metic library [28] i n the C programming language for efficient execution in sensor
nodes, resulting in a point multiplication in 2.16 seconds over prime fields and 1.27
seconds over binary fields. Later, TinyECCK [
29] presented an implementation of
ECC over binary curves which takes into account the platform characteristics to
optimize finite field arithmetic and obtained a point multiplic ation in 1.14 se cond .
Recently, Kargl et al. [
12] investigated algorithms resistant to simple power analysis
and obtained a point multiplication in 0.7633 s econ d on a 8MHz device. Table 1
presents the increasing efficiency of ECC in WSNs.
Finite field Work Execution time (seconds )
Binary
Malan et al. [20] 34
Yan and Shi [34] 13.9
Eberle et al. [4] 4.14
NanoECC [31] 2.16
TinyECCK [29] 1.14
Kargl et al. [12] 0.83
Prime
Wang and Li. [33] 1.35
NanoECC [31] 1.27
Gura et al. [8] 0.87
Uhsadel et al. [32] 0.76
TinySA [7] 0.745
Table 1. Timings for scalar multiplication of a random point on
a MICAz Mote at the 160-bit security level. The timin gs are nor-
malized for a clock frequency of 7.3728MHz.
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
4 Diego F. Aranha, Ricardo Dahab, Julio L
´
opez and Leonardo B. Oliveira
3. Elliptic curve cryptography
An elliptic cur ve E over a field K is the set of solutions (x, y) K × K which
satisfy the Weierstrass equation
y
2
+ a
1
xy + a
3
y = x
3
+ a
2
x
2
+ a
4
x + a
6
where a
1
, a
2
, a
3
, a
4
, a
6
K and the curve d is cr imi nant is 6= 0; together wit h a
point at infinity denoted by O. If K is a field of characteristic 2, then the curve is
called a binar y elliptic curve and there are two cases to consider. If a
1
6= 0, then an
admissible change of variables transforms E to the non-supersingular binary elliptic
curve of equati on
y
2
+ xy = x
3
+ ax
2
+ b
where a, b F
2
m
and = b. A non-supersingular curve with a {0, 1} and b = 1
is also a Koblitz curve. If a
1
= 0, then an admissible change of variables transforms
E to the supersingular binary elliptic curve
y
2
+ cy = x
3
+ ax + b
where a, b, c F
2
m
and = c
4
.
The number of points on the cur ve E(F
2
m
), denoted by #E(F
2
m
), is called
the curve order over the field F
2
m
. The Hasse bound enunciates in this case that
n = 2
m
+ 1 t and |t| 2
2
m
, where t is the trace of Frobenius. A curve can
be generated with a prescribed order using the complex multiplication method [15]
or the curve order can be explicitly computed in binary curves using the approach
due to Satoh, Skjernaa and Taguchi [
27]. Non-supersingulari ty comes from the fact
that t is not a multiple of the characteristic 2 of the und er ly in g finit e fiel d [9].
The set of points {(x, y) E( F
2
m
)}{O} under the addition operation + (chord
and tangent) forms an additive group, with O as the id entity element. Given
an elliptic point P E(F
2
m
) and an integer k, the operation kP , called point
multiplication, is defined by the addition of the point P to itself k 1 times:
kP = P + P + . . . + P
|
{z }
k1 additions
.
Public key cryptography protocols, such as the Elliptic Curve Diffie-Hellman
key agreement [
3] and the Elliptic Curve Digital Signature Algorithm [3], employ
point multiplication as a fundamental operation; and their security is based on the
difficulty of solving the Elliptic Curve Discrete Logarithm Problem (ECDLP). This
problem consists in finding the discrete logarithm k given a point kP . Criteria
for selecting suitable secure curves are a complex subject and a matter of much
discussion. We adopt the well-known standard NIST curves as a conservative choice,
but we refer the reader to [
3] for further details on how to generate efficient curves
where instances of the ECDLP are computationally hard.
We restrict th e discussion to non-supersingular curves because supersingular
curves are not suitable for e lli pt ic curve cryptosystems based on the ECDLP prob-
lem [
21]. However, supersingul ar curves are particularly of interest in applications
of pairing-based protocols on WSNs [25].
4. The platform
The MICAz Mote sensor node is equipped with an ATmega128 8-bit processor
clocked at 7.3728MHz. The program code is l oaded from an 128KB EEPROM chip
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
Efficient implementation of ECC in wireless sensors 5
and runtime memory is stored in a 4KB RAM chip [
10]. The ATmega128 proces-
sor is a typical RISC architecture with 32 registers, but six of them are special
pointer registers. Since at least one register is needed to store temporary results or
data loaded from memory, 25 registers are generally available for arithmetic. The
instruction se t is also reduced, as only 1-bit shift/rotate instr uc tions are natively
supported. Bitwise shifts by arbitrary amounts can then be implemented with com-
binations of shift/rotate instruction s and other instructions. The processor pipeline
has two stages and memory instructions always cause pipeline stalls. Arithmetic
instructions with register operands cost 1 cycle and memory instructions or memory
addressing cost 2 processing cycles [1]. Table 2 presents the instructions provided
by the platform which can be used for the implementation of binary field arithmetic.
Instruction Description Use Cost
rsl, lsl Right/left 1-bit shift Multi-precision 1-bit shift 1 cycle
rol, ror Right/left 1-bit rotate Multi-precision 1-bit shift 1 cycle
swap Swap high and low nibbles Shift by 4 bits 1 cycle
bld, bst Bit load/store from/to flag Shift by 7 bits 1 cycle
eor Bitwise exclusive OR Binary field addition 1 cycle
ld, st Memory load/store Read operands/write results 2 cycles
adiw, sbiw Pointer arithmetic Memory addressing 2 cycles
Table 2. Rele vant instructions for the implementation of binary
field arithmetic.
5. Algorithms for f in ite field ari thm eti c
In this section we will represent the elements of F
2
m
using a poly nomial basis. Let
f(z) be an irreducible binary trinomial or pentanomial of degree m. The elements
of F
2
m
are the binary polynomials of degree at most m 1. A field element a(z) =
P
m1
i=0
a
i
z
i
is associated with the binary vector a = (a
m1
, . . . , a
1
, a
0
) of length m.
In a software implementation in an 8-bit processor, the element a is stored as a
vector of n = m/8 bytes. The field operations in F
2
m
can be implemented by
common pr ocessor instructions, such as logical shifts ( ,) and addition modulo
2 (XOR, ).
5.1. Multiplication. The computation of kP i s the most time-consuming oper-
ation on ECC and this operation depends directly on the finite field arithmetic. In
particular, a fast field multipli cati on is critical for the performance of ECC.
Two different strategies are commonly considered for the implementation of mul-
tiplication in F
2
m
. The first one consists in applying the Karatsuba’s algorithm [
11]
to divide the multiplication in sub-problems and solve each problem independently
by the following formula [
9] (with a(z) = A
1
z
m/2
+A
0
and b(z) = B
1
z
m/2
+B
0
):
c(z) = a(z) ·b(z) = A
1
B
1
z
m
+ [(A
1
+ A
0
)(B
1
+ B
0
) + A
1
B
1
+ A
0
B
0
]z
m/2
+ A
0
B
0
.
Naturally, Karatsuba multiplication imposes some overhead for t he divide and con-
quer steps. The second one consists in applying a direct algorithm like the opez-
Dahab (LD) binary field multiplication (Algorithm
1) [19]. In this algorithm, the
precomputation window is usually chosen as t = 4 and the precomputation table
T has size |T | = 16(n + 1), since each element T [i] requires at most n + 1 bytes
to store the result of u(z)b(z). Operand a is scanned from left to right and pro-
cessed in groups of 4 bits. In an 8-bit processor, the algorithm is comprised by two
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
6 Diego F. Aranha, Ricardo Dahab, Julio L
´
opez and Leonardo B. Oliveira
phases, where the lower halves of bytes of a are processed in the first phase and the
higher h alves are processed in the second phase. These phases are separated by an
intermediate s hi ft which implements multiplication by z
t
.
Algorithm 1 opez-Dahab multiplication in F
2
m
[
19].
Input: a(z) = a[0..n 1], b(z) = b[0..n 1].
Output: c(z) = c[0..2n 1].
1: Compute T (u) = u(z)b(z) for all polynomials u(z) of degree lower than t.
2: c[0 . . . 2n 1] 0
3: for k 0 to n 1 do
4: u a[k] t
5: for j 0 to n do
6: c[j + k] c[j + k] T (u)[j]
7: end for
8: end for
9: c(z) c(z)z
t
10: for k 0 to n 1 do
11: u a[k] mod 2
t
12: for j 0 to n do
13: c[j + k ] c[j + k] T ( u) [j]
14: end for
15: end for
16: return c
Conventionally, the series of additions involved in the LD multiplication are im-
plemented through additions over subparts of a double-precision vector. In order
to reduce the number of memory accesses employed during thes e additions, we em-
ploy a rotating register window. This window simulates the series of additions by
accumulating consecutive writes into registers. After a final result is obtained in
the lowest precision register, this value is written into memory and this register
is free to participate as the highest precision register. Figure
1 shows a rotating
register window with n + 1 registers. We modify the LD multiplication algorithm
by integrating a r otatin g r e gist er wind ow. The result of this integration is ref er r e d
as LD multiplication with registers and shown as Algorit hm
2. Figure 2 presents
this modification graphically. These descriptions of the algorithm assumes that n
general-purpose registers are available for arithmetic. If this is not the case, (e.g.
multiplication in F
2
233
on this platform) the accumulation in the r e gist er window
must be divided in different blocks in a multistep fashion and each block processed
with a different rotating register window. A slight overhead is introduced between
the processing of consecutive blo cks because some registers must b e written into
memory and freed before they can be used in a new rotati ng r e gist er wind ow.
An additional suggested optimization is the separation of the precomputation
table T in different blocks of 256 bytes, where each block is stored on a 256-byte
aligned memory address. This optimization accelerates memory addressing because
offsets lower than 256 can be computed by a simple 1-cycle addition instruction,
avoiding expensive pointer arithmet ic. Anoth er optimizati on is to store the results
of the first phase of the algorithm already shifted, eliminating some redundant
memory reads to reload the intermediate result into registers for multi-precision
shifting. A last optimizati on is the embedding of modular reduction at the en d of
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
Efficient implementation of ECC in wireless sensors 7
Figure 1. Rotating register window with n + 1 registers.
Figure 2. opez-Dahab multiplication with r e gis te r s of two field
elements repr es e nted as n-byte vectors in an 8-bit processor.
Algorithm 2 Proposed optimization for multiplication in F
2
m
using n+1 regis te r s .
Input: a(z) = a[0..n 1], b(z) = b[0..n 1].
Output: c(z) = c[0..2n 1].
Note: v
i
denotes the vector of n + 1 registers (r
i1
, . . . , r
0
, r
n
, . . . , r
i
).
1: Compute T (u) = u(z)b(z) for all polynomials u(z) of degree lower than 4.
2: Let u
i
be the 4 most significant bits of a[i].
3: v
0
T (u
0
), c[0] r
0
4: v
1
v
1
T (u
1
), c[1] r
1
5: ···
6: v
n1
v
n1
T (u
n1
), c[n 1] r
n1
7: c ((r
n2
, . . . , r
0
, r
n
) || (c[n 1], . . . , c[0])) 4
8: Let u
i
be the 4 least significant bits of a[i].
9: v
0
T (u
0
), c[0] c[0] r
0
10: ···
11: v
n1
v
n1
T (u
n1
), c[n 1] c[n 1] r
n1
12: c[n . . . 2n 1] c[n . . . 2n 1] (r
n2
, . . . , r
0
, r
n
)
13: return c
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
8 Diego F. Aranha, Ricardo Dahab, Julio L
´
opez and Leonardo B. Oliveira
the multiplication algorithm. This trick allows the reuse of values already loaded
into registers to speed up modular reduc tion. The following analysis does not take
these suggested optimizations into account.
Analysis of multiplicatio n algorithms. Observi ng the fact that the more
expensive instructions in the target platform are related to memory accesses, the
behavior of different algorithms was analyzed to estimate their per f ormanc e. This
analysis traces the cost of different algorithms i n terms of memory accesses (reads
and writes) and arithmetic instructions (XOR).
Without considering partial multipl icati ons , the Karatsuba algorithm in a binary
field executes approximately 11n memory reads, 7n memory writes and 4n XOR
instructions.
For LD multiplication, analysis shows that building the precomputation table
requires n memory reads to obtain the values b[i] and |T | writes and 11n XOR
instructions for filling the table. Inside each inner loop, the algorithm executes
2(n + 1) memory reads, n + 1 writes and n + 1 XOR instructions. In each outer
loop, the algorithm executes n memory accesses to read the values a[k] and n
iterations of the inner loop, totalizing n + 2n(n + 1) reads, n(n + 1) writes and
n(n + 1) XOR ins t r uc tion s. The logical shift of c(z) computed at the intermediate
stage requires 2n memory reads and writes. Considering the initialization of c,
we have 3n + 2(n + 2n(n + 1)) memory reads, |T | + 2(2n) + 2n(n + 1) writes and
11n + 2n(n + 1) XOR instruct ions .
For the proposed optimization (Algori th m
2), building the precomputation table
requires n memory reads to obtain the values b[i] and |T | writes and 11n XOR
instructions for filling the table. Line 3 of the algorithm executes n+1 memory reads
and 1 write on c[0]. Lines 4-6 execute n + 1 memory reads, 1 write on c[i] and n + 1
XOR instructions, all this n 1 times. The intermediate shift executes n reads and
(2n) writes. Lines 9-11 execute n+1 memory reads, 1 read and write on c[i] and n+2
XOR instructions, all this n times. The final operation costs n memory reads , writ es
and XOR instructions. The algorithm thus requires a total of 3n+n(n+1)+n(n+2)
reads, |T |+n+2n+2n writes and 11n+(n1)(n+1)+n(n+ 2)+n XOR instr u cti ons .
Table 3 presents the costs associate d with memory operations for LD multipli-
cation, LD with registers multiplication and Karats u ba multiplication. Table 4
presents approximate costs of the algorithms in terms of executed memory instruc-
tions for the fields F
2
163
and F
2
233
.
Number of instructions in terms of vectors of n bytes
Method Reads Writes XOR
opez-Dahab 4n
2
+ 9n |T | + 2n
2
+ 6n 2n
2
+ 13n
LD with registers 2 n
2
+ 6n |T| + 5n 2n
2
+ 14n 1
Karatsuba 11n + 3M(n/2) 7n + 3M (n/2) 4n + 3M(n/2)
Table 3. Costs in number of executed instructions for the multi-
plication algorithms in F
2
m
. M(x) denotes the cost of a multipli-
cation algorithm which multiplies two x-byte vectors.
We can see from Table
3 that the number of memory accesses for LD with
registers is drastically reduced in comparison with the original algorithm, reducing
the number of reads by half and the number of writes by a quadratic factor . The
comparison b e tween LD with registers and Karatsu ba+ LD with registers favors
the first (l ower number of writes) on both fi nit e fields. One problem with this
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
Efficient implementation of ECC in wireless sensors 9
n = 21 n = 30
Method Reads Wr ites XOR Reads Writes XOR
opez-Dahab 1953 1452 1155 3870 2476 2190
LD with registers 1071 457 1175 1980 646 2219
Karatsuba+LD 1980 1647 1239 3310 2518 1984
Karatsuba+LD with registers 1155 888 1269 1898 1134 2025
Table 4. Costs in number of executed instructions for the multi-
plication algorithms in F
2
163
and F
2
233
. The Karatsuba algorithm
in F
2
233
executes two instances of cost M(15) and one instance of
cost M(14) to better approximate the results.
analysis is that it assumes that the processor has at least n general-purpose registers
available for arithmetic. This is not true in F
2
233
, because the algorithm requires
31 registers for a full rotating r egis te r window. Th e decision between a multistep
implementation of LD with registers and Karatsuba+LD with registers will depend
on the actual implementation of the algorithms.
5.2. Modular reduction. The NIST irreducible polynomial for the finite field
F
2
163
, f(z) = z
163
+ z
7
+ z
6
+ z
3
+ 1, allows a fast modular reduction algorithm. Al-
gorithm
3 [29] presents an adaptation of this algorithm for 8-bit processors. In this
algorithm, reducing a digit c[i] of the upper half of the vector c req ui r es six memory
accesses to read and write c[i] on lines 3-5. Four of them are redundant because ide-
ally we only need to read and write c[i] once. We eliminate these redundant accesses
by employing a rotating register window of three registers which accumulate writes
into registers before a final result can be written into memory. This optimization
is given in Algorithm
4 along with th e substitution of some bitwise shifts which
are expensive in this platform for cheaper ones. Since the proce ss or only supports
1-bit and 4-bit shifts natively, we further replace the various expensive shifts in the
accumulate function R by table lookups on 256-byte tables. These tables are stored
on 256-byte aligned memory addresses to speed up memory addressing. The new
version of the accumulate f un ct ion i s depi ct ed in Algorith m
5.
Algorithm 3 Fast modular reduction by f (z) = z
163
+ z
7
+ z
6
+ z
3
+ 1.
Input: c(z) = c[0..40].
Output: c(z) mod f(z) = c[0..20].
1: for i 40 downto 21 do
2: t c[i]
3: c[i 19] c[i 19] (t 4) (t 5)
4: c[i 20] c[i 20] (t 4) (t 3) t (t 3)
5: c[i 21] c[i 21] (t 5)
6: end for
7: t c[20] 3
8: c[0] c[0] (t 7) (t 6) (t 3) t
9: c[1] c[1] (t 1) (t 2)
10: c[20] c [20] 0x07
11: return c
For the NIST irreducible polynomial in F
2
233
on 8-bit processors, we present
Algorithm
6, a direct adaptation of the standard algorithm. This algorithm only
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
10 Diego F. Aranha, Ricardo Dahab, Julio L
´
opez and Leonardo B. Oliveira
Algorithm 4 Fast modular reduction in F
2
163
with rotating register window.
Input: c(z) = c[0..40].
Output: c(z) mod f(z) = c[0..20].
Note: The accumulate function R(r
0
, r
1
, r
2
, t) executes:
s
0
t 4
r
0
(r
0
t (t 1)) 4
r
1
r
1
s
0
(t 3) t (t 3)
r
2
s
0
1
1: r
b
0, r
c
0
2: for i 40 downto 25 by 3 do
3: R(r
b
, r
c
, r
a
, c[i]), c[i 19] c[i 19] r
b
4: R(r
c
, r
a
, r
b
, c[i 1]), c[i 20] c[i 20] r
c
5: R(r
a
, r
b
, r
c
, c[i 2]), c[i 21] c[i 21] r
a
6: end for
7: R(r
b
, r
c
, r
a
, c[22]), c[3] c[3] r
b
8: R(r
c
, r
a
, r
b
, c[21]), c[2] c[2] r
c
9: r
a
c[1] r
a
10: r
b
c[0] r
b
11: t c[20]
12: c[20] t 0x07
13: t t 3
14: c[0] r
b
(t 7) (t 6) (t 3) t
15: c[1] r
a
(t 1) (t 2)
16: return c
Algorithm 5 Optimized version of the accumulate function R.
Input: r
0
, r
1
, r
2
, t.
Output: r
0
, r
1
, r
2
.
1: r
0
r
0
T
0
[t]
2: r
1
r
1
T
1
[t]
3: r
2
t 5
executes 1-bit or 7-bit shifts. These two shifts can be translated efficiently to the
processor instruction set, because 1-bit s h ift s are supported natively and 7-bit shifts
can be emulated efficiently. Hence lookup tables are not needed and the only op-
timization made during implementation of Algorithm
6 was complete unrolling of
the main loop and straightforward elimination of consecutive redundant memory
accesses.
Analysis of modular reduction algorithms. As pointed by Seo et al. [
29],
Algorithm 3 executes many redundant memory accesses: 4 memory reads and 3
writes during each loop iteration and additional 4 reads and 3 writes on the final
step, which s u m up to 88 reads and 66 writes. The proposed optimization reduces
the number of memory op e r ations to 43 reads and 23 writes. Despi te Algorithm
5
being specialized for the chosen polynomial, the register window technique can be
applied to any irreducible polynomial with the non-null coefficients located in the
first word. The implementation of Algorithm
6 also reduces the number of memory
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
Efficient implementation of ECC in wireless sensors 11
Algorithm 6 Fast modular reduction by f (z) = z
233
+ z
74
+ 1.
Input: c(z) = c[0..58].
Output: c(z) mod f(z) = c[0..29].
1: for i 58 downto 32 by 2 do
2: t
0
c[i]
3: t
1
c[i 1]
4: c[i 19] c[i 19] (t
0
7)
5: c[i 20] c[i 20] (t
0
1) (t
1
7)
6: c[i 21] c[i 21] (t
1
1)
7: c[i 29] c[i 29] (t
0
1)
8: c[i 30] c[i 30] (t
0
7) (t
1
1)
9: c[i 31] c[i 31] (t
1
7)
10: end for
11: t
0
c[30]
12: c[0] c[0] (t
0
7)
13: c[1] c[1] (t
0
1)
14: c[10] c [10] (t
0
1)
15: c[11] c [11] (t
0
7)
16: t
0
c[29] 1
17: c[0] c[0] t
0
18: c[9] c[9] (t
0
2)
19: c[10] c [10] (t
0
6)
20: c[29] c [29] 0x01
21: return c
accesses, since a standard i mple mentation executes 122 reads and 92 writes while
our implementation ex ec u tes 92 memory re ads and 62 writes.
5.3. Squaring. The square of a finite field element a(z) F
2
m
is given by a(z)
2
=
P
m1
i=0
a
i
z
2i
= a
m1
z
2m2
+···+a
2
z
4
+a
1
z
2
+a
0
. The binary representation of a(z)
2
can be computed by inserting a “0” bit between each pair of successive bits on the
binary representation of a(z) and accelerated by introducing a 16-byte lookup table.
If modular reduction is computed in a separate step, re du ntant memory operations
are requir ed to store the squaring result and reload this result for reduction. This
can be improved by embedding the modular reduction step directly into the squaring
algorithm. This way, the lower half of the digit vector a is expanded in the usual
fashion and the upper half digits are expanded and immediately reduced. If modular
reduction of a single byte requires expensive shifts, additional lookup tables can be
used to store the expanded bytes already reduced. This is illustrated in Algorithm
7
which computes squaring in F
2
163
using the same small rotating register window
as Algorithm
5 and three additional 16-byte lookup tables T
0
, T
1
and T
2
. For
squaring in F
2
233
, we also combine byte expansion of the digit vector’s lower half
with Algorithm
6 for fast reduction.
5.4. Inversion. For inversion in F
2
m
we implemented the Extended Euclidean
Algorithm for polynomials [
9]. Since this algorithm requires flexible left shifts by
arbitrary amounts, we implemented six dedicate shifting funct ions to shi ft a binary
field element by every amount possibl e for an 8-bit processor. The core of a multi-
precision left shift algorithm is t he sequence of instructions which receives as input
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
12 Diego F. Aranha, Ricardo Dahab, Julio L
´
opez and Leonardo B. Oliveira
Algorithm 7 Squaring in F
2
163
.
Input: a(z) = a[0..20].
Output: c(z) = a(z)
2
mod f(z).
Note: The accumulate function R(r
0
, r
1
, r
2
, t) executes:
r
0
r
0
T
0
[t], r
1
r
1
T
1
[t], r
2
r
2
T
2
[t]
1: For each 4-bit combination u, T (u) = (0, u
3
, 0, u
2
, 0, u
1
, 0, u
0
).
2: for i 0 to 9 do
3: c[2i] T (a[i] 0x0F)
4: c[2i + 1] T (a[i] 4)
5: end for
6: c[20] T (a[10] 0x0F)
7: r
b
0, r
c
0, j 20
8: t
0
a[20] 0x0F
9: R(r
b
, r
c
, r
a
, t
0
), c[21] r
b
10: for i 19 downto 13 by 3 do
11: a
o
a[i], t
0
a
0
4, t
1
a
0
0x0F
12: R(r
c
, r
a
, r
b
, t
0
), c[j] c[j] r
c
13: R(r
a
, r
b
, r
c
, t
1
), c[j 1] c[j 1] r
a
14: a
0
a[i 1], t
0
a
0
4, t
1
= a
0
0x0F
15: R(r
b
, r
c
, r
a
, t
0
), c[j 2] c[j 2] r
b
16: R(r
c
, r
a
, r
b
, t
1
), c[j 3] c[j 3] r
c
17: a
0
= a[i 2], t
0
= a
0
4, t
1
= a
0
0x0F
18: R(r
a
, r
b
, r
c
, t
0
), c[j 4] c[j 4] r
a
19: R(r
b
, r
c
, r
a
, t
1
), c[j 5] c[j 5] r
b
20: j j 6
21: end for
22: t
0
= a[10] 4
23: R(r
c
, r
a
, r
b
, t
0
), c[2] c[2] r
c
24: r
a
c[1] r
a
, r
b
c[0] r
b
25: t c[21]
26: r
a
r
a
t (t 3) (t 4) (t 3)
27: r
b
r
b
(t 5)
28: t c[20]
29: c[20] t 0x07
30: t t 3
31: c[0] r
b
(t 7) (t 6) (t 3) t
32: c[1] r
a
(t 1) (t 2)
33: return c
the amount to shift i, a r egis t er r and a carry register rc storing the bits shifted
out in the last iteration; and produce (r i) rc as output and r (8 i) as new
carry. Table
5 lists the required instructions and costs in cycles for shifting a single
byte in each of the implemented multi-precision shifts by i bits. Each instruction
in the table cost 1 c y cle , thus the cost to compu te the core of a multi-precision left
shift by i bits is just the number of rows in the i-th row of the table.
6. Algorithms for ell ip tic curve arithmetic
We have selected fast algorithms for elliptic curve arithmetic in three situations :
multiplying a random point P by a scalar k, multiplying the generator G by a scalar
k and simultaneously multiplying two points P and Q by scalars k and l to obtain
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
Efficient implementation of ECC in wireless sensors 13
i Intructions
1 rol r
2
clr rt
lsl r
rol rt
lsl r
rol rt
eor r, rc
mov rc, rt
i Intructions
3
clr rt
lsl r
rol rt
lsl r
rol rt
lsl r
rol rt
eor r, rc
mov rc, rt
i Intructions
4
swap r
mov rt, r
andi r, 0xF0
andi rt, 0x0F
eor r, rc
mov rc, rt
5
swap r
mov rt, r
andi r, 0xF0
andi rt, 0x0F
lsl r
rol rt
eor r, rc
mov rc, rt
i Intructions
6
bst rt, 0
bld r, 6
bst rt, 1
bld r, 7
lsr rt
lsr rt
eor r, rc
mov rc, rt
7
bst rt, 0
bld r, 7
lsr rt
eor r, rc
mov rc, rt
Table 5. Processor instructions used to efficiently implement
multi-precis ion left shifts by i bits. The input regis ter is r, the
carry register is rc and a temporary register is rt. When i = 1, rc
is represented by the carry processor flag.
kP + lQ. Our implementation uses mixed addition with projective coordinates [
18],
given that the ratio of inversion to multiplication is 16.
For multiplying a random point by a scalar, we choose Solinas’ τ - ad ic non-
adjacent form (TNAF) representation [
30] with w = 4 for Koblitz curves (4-TNAF
method with 4 precomputation points) and the method due to opez and Dahab [17]
for random binary curves. Solinas’ algorithm explores the optimizations provided
by Koblitz curves and accelerates the computation of kP by substituting point
doublings for applications of the efficiently computable endomorphism based on the
Frobeniu s map τ(x, y) = (x
2
, y
2
). The method due to opez and Dahab does not use
precomputation, its execution time is constant and each iteration of the algorithm
executes the same number of operations, inde pendently of the bit pattern in k [
9].
For multiplyi ng the generator, we employ the same 4-TNAF method for Koblitz
curves; and for generic curves, we employ the Comb method [
16] with 16 precom-
puted points. Precomputed tables f or the generator are stored in ROM memory to
reduce RAM consumption. Larger precomputed t able s can be used if program size
is not an issue.
For simultaneous multiplication, we implement the interleaving method with 4-
TNAFs for Koblitz curves and the interleaving of 4-NAFs with integers represented
in non-adjacent form (NAF) for generic curves [
6]. The same table built for mul-
tiplying the generator is used during simultaneous multiplication in Koblitz curves
when point P or Q is the generator G. An additional small table of 4 points is
precomputed for the generator and stored in ROM to provide the same situation
with generic curves.
7. Implementation r es ults
The compiler and assembler used is the GCC 4.1.2 suite for ATmega128 with
optimization level -O2. The timings were measured with the software AVR Studio
4.14 [
2]. This tool is a cycle-accurate simulator frequently used to prototype soft-
ware for execution on the target platform. We have written a specialized library
containing the software implementations.
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
14 Diego F. Aranha, Ricardo Dahab, Julio L
´
opez and Leonardo B. Oliveira
Finite field arithmetic. The algorithms for squaring, multiplication, modular
reduction and inversion in the finite field were implemented in the C language
and Assembly. Table
6 presents the costs measured in cycles of each implemented
operation in F
2
163
and F
2
233
. Since the platform does n ot have cache memory or
out-of-order execution, the finite field operations always cost the same number of
cycles and the timings were taken exactly once, except for inversion. The timing for
inversion was taken as the average of 50 timings measured on consecutive executions
of the algorithm.
m = 163 m = 233
Algorithm C language Assembly C language Assembly
Squaring 629 430 908 463
Modular Squaring 1154 570 1340 956
LD Mult. with registers 13838 4508 8314
LD Mult. (new variant) 9738 18028
Karatsuba+LD with registers 12246 6968 25850 9261
Modular reduction 606 430 911 620
Inversion 243790 81365 473618 142986
Table 6. Timings in cycles for arithmetic algorithms in F
2
m
.
From Table
6, m = 163, we can observe that in the C language implementa-
tion, Karatsuba+LD with registers multiplication is more efficient than the direct
application of LD with registers multiplication. This contradicts the preliminary
analysis based on the number of memory accesses executed by each algorithm. This
can be explained by the fact that the LD with registers multiplication uses 21 of
the 32 general-purpose registers to store intermediate results during multiplication.
Several addit ional registers are also needed to s t ore memory addresses and tempo-
rary variables for arithmetic operations. The inefficienc y found is thus originated
from the difficulty of the C c ompil er to maintain all intermediate values on registers.
To confirm this limitation, a new variant of LD with registers multiplication which
reduces the number of temporary variables needed was also implemented. Thi s vari-
ant processes 32 bits of the operand in e ach interaction compared to the original
version of LD multiplication which processes 4 bits in each interaction. The new
variant reduces the number of memory accesses while keeping a smaller number of
temporary variables and thus exhibits the expected performance. For the squaring
algorithm, we can see that embedding the modular reduction step reduces the cost
of modular squaring significantly compared with t he sequential execution of squar-
ing plus modular reduction. Table 6, m = 233, shows that the Karatsuba algorithm
in F
2
233
indeed does not improve performance over the multist ep implementation
of LD with registers multiplication, even if the processor does not have enough
registers to store the full rotating register window. The Assembly implementations
demonstrate the compiler inefficienc y in generating optimized code and allocating
resources for the target platform, showing considerably faster timings.
Elliptic curve arithmetic. Point multiplication was implemented on elliptic
curves standard ize d by NIST. Table
7 presents the execution time of the multi-
plication of a rand om point P by a random integer k of 163 or 233 bi ts , with the
underlying finite field arithmetic implemented in C or Assembly. In each of the pro-
gramming languages, the f aste s t field multiplication algor ith m is used. The results
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
Efficient implementation of ECC in wireless sensors 15
were compu te d by the arithmetic mean of the timings measured on 50 consec ut ive
executions of the algorithm.
C language Assembly
Curve kG kP kP + lQ kG kP kP + lQ
NIST-K163 (Koblitz) 0.56 0.67 1.24 0.29 0.32 0.60
NIST-B163 (Generic) 0.77 1.55 2.21 0.37 0.74 1.04
NIST-K233 (Koblitz) 1.26 1.48 2.81 0.66 0.73 1.35
NIST-B233 (Generic) 1.94 3.90 5.35 0.94 1.89 2.52
Table 7. Timings in seconds for point multiplication.
Table
8 compares the performance of the proposed implementation with
TinyECCK [29] and the work of Kargl et al. [12], the previously fastest binary
curves implementation in C and Assembly published for this platform. For the C
implementation, we achieve faster timings on all fini te field arithmetic operati ons
with improvements over 50%. For the Assembly implementation, we obtain speed
improvements on field squarin g and multiplication and exactly the same timing for
modular reduction, but the polynomial used by Kargl et al.[
12] is a trinomial care-
fully selected to support a faster modu lar reduction algorithm. The computation of
kP on Koblitz curves implemented in C language was 41% faster than TinyECCK.
By choosing the opez-Dahab point multiplication algorithm with generic curves
implemented i n Ass embly, we achieve a timing 11% faster than [
12] while satis-
fying the ti ming-r e s is tant property. If we relax this condition, we obtain a point
multiplication 61% faster in Assembly by using Solinas’ method. Comparing our
Assembly imple mentation with TinyECCK and [
12] with the same curve param-
eters, we achieve a 72% speedup and an 11% speedup for point multiplication,
respectively.
Proposed TinyECCK Proposed Kargl et al. [12]
Algorithm C language C languag e Assembly Assembly
Modular Squaring 1154 c 2729 c 570 c 663
Multiplication 9738 c 19670 c 4508 c 5057 c
Modular reduction 606 c 1904 c 430 c 433 c
Inversion 243790 c 539132 c 81365 c
kP on Koblitz 0.67 s 1.1 4 s 0.32 s
kP on Generic 1.55 s 0.74 s 0.83 s
Table 8. Comparison between differ ent implementations. The
timings are presented in cycles (c) or seconds (s) on a 7.2838MHz
device.
The fastest time for point multiplication previously published for this platform
at the 160-bit security level was 0.745 second [7]. Compared to this implementation,
which uses prime fields, the proposed optimizations result in a point multiplication
57% faster.
The implemented optimizations allow performance gains but provoke a collateral
effect on memory consumption. Table
9 presents memory requir eme nts for cod e siz e
and RAM memory for the different implementations at th e 160-bit security level.
We can also observe that Assembly implementations are responsible for a significant
expansion in program code size.
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
16 Diego F. Aranha, Ricardo Dahab, Julio L
´
opez and Leonardo B. Oliveira
ROM memory Static RAM Stack RAM
Proposed (Koblitz) C 22092 1028 1207
Proposed (Koblitz) C+Assembly 25802 1732 1207
Proposed (Generic) C 12848 881 682
Proposed (Generic) C+Assembly 16218 1585 682
TinyECCK (C-only) 5592 618
Kargl et a. (C+Assembly) [12] 11264
Table 9. Cost in bytes of memory for implementations of scalar
multiplication of a random point at the 160-bit security level.
Cryptographic protocols. We now illustrate the performance obtained by our
efficient implementation with some executions of cryptographic protocols for key
agreement and digital signatures. Key agreement is employed in sens or networks
for establishing symmetric keys which can be used for encr y pt ion or authentication.
Digital signatures are employed for communication between the sensor nodes and
the base stations where data must be made available to multiple applications and
users [
24]. For key agreement between nodes, we implemented the Elliptic Curve
Diffie & Hellman (ECDH) protocol [3], and for digital signatures, we implemented
the Elliptic Curve Digital Signature Algorithm (ECDSA) [3]. We assume that public
and private keys are generated and loaded into the nodes before the deployment of
the sensor network. Hence timings for key generation and public key authentication
are not presented or considered. Table
10 prese nts the timings for the ECDH
protocol and Table 11 presents the timings for the ECD S A protocol, using the choice
of algorithms discussed in Section 6. Result s on these tables pose an interesting
decision between deploying generic binary cur ves on the lowe r security level or
deploying special curves on th e higher s ec ur i ty level.
C language Assembly
Curve Time ROM RAM Time ROM RAM
NIST-K163 0.74 28.3 2.2 0.39 32.0 2.8
NIST-B163 1.62 24.0 1.1 0.81 27.8 1.9
NIST-K233 1.55 31.0 2.9 0.80 38.6 3.7
NIST-B233 3.97 26.9 1.5 1.96 34.6 2.2
Table 10. Timings for the ECDH protocol execution. Timings
are given in seconds and ROM memory or Static+Stack RAM con-
sumption are given in KB.
C language Assembly
Curve Time (S + V) ROM RAM Time (S + V) ROM RAM
NIST-K163 0.67 + 1.23 31.8 2.9 0.36 + 0.63 35.3 3.7
NIST-B163 0.87 + 2.17 29.6 2.1 0.45 + 1.05 33.2 2.8
NIST-K233 1.46 + 2.76 34.6 3.1 0.78 + 1.39 42.2 3.8
NIST-B233 2.09 + 5.25 32.8 2.3 1.04 + 2.55 40.4 3.1
Table 11. Timings for the ECDSA protocol execution. Timings
for signature (S) and verification (V) are given in seconds and ROM
memory or Static+Stack RAM consumption are given in KB.
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
Efficient implementation of ECC in wireless sensors 17
8. Conclusions
Despite several years of intense research, security and cryptography on W SNs
still face several open problems. In this work, we presented efficient implementa-
tions of binary field algorithms such as squaring, multiplication, modular reduction
and inversion. These implementations take into account the characteristics of the
target platform (the MICAz Mote) to develop optimizations, specifically : (i ) the
cost of memory address ing; (ii) the cost of memory instructions; (iii) the limited
flexibility of bitwise shift instructions. We obtain the fastest binary field arithmetic
implementations in C and Assembly published for t he target platform. Significant
performance benefits where achieved by the Assembly implementation, resulting
from fine-grained resource allocation and instruction selection. These optimizations
produced a point multiplication at the 160-bit security level und er
1
3
of a secon d, an
improvement of 72% compared to the best implementation of a Koblitz curve previ-
ously published and an improvement of 61% compared to the best implementation
of binary curves. When compared to the best implementation of prime curves, we
obtain a performance gain of 57%. We also presented the first timings of elliptic
curves at the higher 233-bit security level. For both security levels, we illustrate
the performance obtained with executions of key agreement and digital signature
protocols. In particular, a key agreement can be computed in under 0.40 second
at the 163-bit security level and under 0.80 second at the 233-bit security level. A
digital signature can be computed and verified in 1 second at the 163-bit security
level and in 2.17 seconds at the 233-bit security level. We hope that our results can
increase the efficiency and viability of elliptic curve cryptography on wireless sensor
networks.
Acknowledgements
We would like to thank the referees for their valuable comments and suggestions.
Diego F. Aranha is supported by FAPESP, grant no. 2007/06950-0. Julio opez
and Ricardo Dahab are p ar tiall y supported by CNPq and FAPESP research grants.
References
[1] Atmel Corp oration, 8 bit AVR Microcontroller ATmega128(L) manual, Atmel, (2004), edition
2467m-avr-11/04.
[2] Atmel Corporation, AVR Studio 4.14, Atmel, (2005), available online at
http://www.atmel.
com/
.
[3] Certicom Research, SEC 1: Elliptic Curve Cryptography, (2000), available online at
http://
www.secg.org
.
[4] H. Eberle, A. Wand er, N. Gura, S. Chang-Sh antz and V. Gupta, Arc h it ectural extensio ns
for elliptic curve cryptography over GF(2
m
) on 8-b it microprocessors, in “Proceedings of
IEEE International Conference on Application-specific Systems, Architectures and Processors
(ASAP’05)”, IEEE, (2005), 343–349.
[5] D. Estrin, R. Govindan, J. S. Heidemann and S. Kumar, Next century challenges: Scal-
able coordination in sensor networks, in “Proceedings of Mobile Computing and Networking
(MobiCom’99)”, (1999), 263–270.
[
6] R. Gallant, R. Lambert and S. Vanstone, Faster point multiplication on elliptic curves with
efficient endomorphisms, in “Proceedings of the 21st Annual International Cryptology Con-
ference on Advances in Cryptology (CRYPT O ’01),” Springer, (2001), 190–200.
[7] J. Großsch¨adl, TinySA: a security architecture for wireless sensor networks, in “Proceedings
of ACM International Conference on emerging Networking EXperiments and Technologies
(CoNEXT’06)”, ACM, (2006).
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
18 Diego F. Aranha, Ricardo Dahab, Julio L
´
opez and Leonardo B. Oliveira
[8] N. Gura, A. Patel, A. Wander, H. Eberle and S. C. Shantz, Comparing elliptic curve cryp -
tography and RSA on 8-bit CPUs, In “Proceedings of Workshop on Cryptographic Hardware
and Embedded Systems (CHES’04)”, Springer, (2004), 119–132.
[
9] D. Hankerson, A. J. Menezes and S. Vanstone, “Guide to Elliptic Curve Cryptography,”
Springer, New York, 2004.
[10] J. L. Hill and D. E. Culler, MICA: a Wireless platform for deeply embedded networks, IEEE
Micro., 22 (2002), 12–24.
[11] A. Karatsuba and Y. Ofman, Multiplication of many-digital numbers by automatic computers,
Transl. Physics-Doklady, 7 (1963), 595–596.
[12] A. Kargl, S. Pyka and H. Seuschek, Fa s t arithmetic on ATmega128 for elliptic curve cryp-
tography, preprint, available online at
http://eprint.iacr.org/2008/442.
[13] C. Karlof, N. Sastry and D. Wagner, TinySec: a link layer security architecture for wireless
sensor networks, In “Proceedings of 2nd ACM Conference on Embedded Networked Sensor
Systems (SenSys’04)”, ACM, (2004), 162–175.
[
14] N. Koblitz, Ellipt i c c urv e cryptosystems, Math. Comput., 48 (1987), 203–209.
[15] G.-J. Lay and H. G. Zimmer, Constructing ellip t ic curves with given group order over large
finite fields, in “Algorithmic Number Theory,” (1994), 250–263.
[
16] C. H. Lim and P. J. Lee, More flex ib le exponentiation with precomputation, in “Proceed-
ings of the 14th Annual International Cryptology Conference on Advances in Cryptology
(CRYPTO’94)”, Springer, (1994), 95–107.
[17] J. opez and R. Dahab, Fast multiplication on elliptic curves over GF(2
m
) without precom-
putation, in “Proceedings of Workshop on Cryptographic Hardware and Embedded Systems
(CHES’99)”, Springer, (1999), 316–327.
[
18] J. opez and R. Dahab, Improved algorithms for elliptic curve arithmetic in GF(2
n
), in
“Proceedings of Workshop on Selected Areas in Cryptography (SAC’98)”, Springer, (1999),
201–212.
[
19] J. opez and R. Dahab, High-speed software multiplication in GF(2
m
), in “Proceedings of
International Conference on Cryptology in India (INDOCRYPT’00)”, Springer, (2000), 203–
212.
[20] D. J. Malan, M. Welsh and M. D. Smith, A public-key i nf rastructure for key distribution
in Tiny O S based on elliptic curve cryptography, in “Proceedings of IEEE Communications
Society Conference on Sensor and Ad Hoc Communications and Networks (SE C O N ’04)”,
(2004).
[
21] A. Menezes, T. Okamoto and S. Vanstone, Reducing elliptic curve logarithms to logarithms
in a finite field, IEEE Trans. Inform. Theory 39 (1993), 1639–1646.
[
22] V. Miller, Uses of elliptic curves in cryptography, in “Advances in Cryptology (CRYPTO’85)”,
Springer, (1986), 417–426.
[23] L. B. Oliveira, D. F. Aranha, E. Morais, F. Daguano, J. opez and R. Dahab, TinyTate:
computing the Tate pairing in resource-constrained sensor nodes, in “Proceedings of IEEE
International Symposium on Network Computing and Applications (NCA’07)”, IEEE, (2007),
318–323.
[24] L. B. Oliveira, A. Kansal, B. Priyantha, M. Goraczko and F. Zhao, Secure-TWS: authen-
ticating node to multi-user communication in shared sensor networks, in “Proceedings of
International Conference on Information Processing in Sensor Networks (IPSN’09)”, IEEE,
(2009), 289–300.
[25] L. B. Oliveira, M. Scott, J. opez and R. Dahab, TinyPBC: pairings for authenticated
identity-based no n-i nte ractive key distribution in sensor networks, in “Proceedings of In-
ternational Conference on Networked Sensing Systems (INSS’08)”, IEEE, (2008), 173–180.
[26] A. Perrig, R. Szewczyk, V. Wen, D. Culler and J. D. Tygar, SPINS: security protocols for
sensor networks, Wireless Networks, 8 (2002), 521–534.
[
27] T. Satoh, B. Skjernaa and Y. Taguchi, Fast computation of canonical lifts o f elliptic curves
and its application to point counting, Finite Fields Appl., 9 (2003), 89–101.
[28] M. Scott, MIRACL multiprecision integer and rational arithmetic C/C++ library, available
online at
http://www.shamus.ie/.
[29] S. C. Seo, D. Han and S. Hong, TinyECCK: efficient elliptic c ur ve cry p t ography implementa-
tion over GF(2
m
) on 8-bit MICAz mote, preprint, available online at
http://eprint.iacr.
org/2008/122
.
[
30] J. A. Solinas, Efficient arithmetic on Koblitz curves, Des. Codes Cryptogr., 19 (2000), 195–
249.
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
Efficient implementation of ECC in wireless sensors 19
[31] P. Szczechowiak, L. B. Oliveira, M. Scott, M. Collier and R. Dahab, NanoECC: testing
the limits of elliptic curve cryptography in sensor networks, in “Pro ceedin gs of European
conference on Wireless Sensor Networks (EWSN’08)”, Springer, (2008), 305–320.
[32] L. Uhsadel, A. Poschmann and C. Paar, Enabling full-size public-key algorithms on 8-bit
sensor nodes, in “Proceedings of Eu ropean Workshop on Security in Ad-hoc and Sensor
Networks (ESAS’07)”, Springer, (2007), 73–86.
[
33] H. Wang and Q . Li, Efficient implementation of public key cry pt o s y s t em s on mote sensors,
in “Proceedings of International Conference on Information and Communication Systems
(ICICS’06)”, Springer, (2006), 519–528.
[34] H. Yan and Z. J. Shi, Studying software implementations of elliptic curve cryptography,
in “Proceedings of International Conference on Information Technology: New Generations
(ITNG’06)”, IEEE, (2006), 78–83.
Received June 2009; revis ed De ce mbe r 2009.
E-mail address: dfaranh a@i c.u n icamp. br
E-mail address: rdahab @i c. un ic amp.br
E-mail address: jlopez@ic.u ni camp. br
E-mail address: leob@f t. un ic amp.br
Advances in Mathematics of Communications Volume 4, No. 2 (2010), xxx–xxx
... The first option to address this is to optimize software for cryptographic calculations on microcontrollers. For example, [6,20] propose efficient assembly implementations by manipulating the data flow to maximize the register use, achieving around two to three orders of magnitude of speedup. ...
... The overall performance improvement and energy reduction of each algorithm will be discussed in the next subsection. Recryptor achieves > 11× speedup and > 6.7× energy savings over the baseline software [6]. The performance improvements and energy reductions increase as the word length increases, showing that Recryptor scales well to large bit-width operations. ...
... Then each ω bits are used as an index for the precompute table lookup.However, the number in the finite field needs m bits, and the inputs/output/intermediate values are stored in the memory. Due to register spilling on the M0, using this algorithm tends to create a large memory accesses.[6,20] optimizes the overflow to solve this spilling problem by maximizing the register reuse.We propose a new optimization to combine the (LD) field multiplication and reduction algorithm with the goal of reducing the number of operations on Recryptor, as shown in Algorithm 5. ...
Thesis
The Internet of Things (IoT) is a rapidly growing field that holds potential to transform our everyday lives by placing tiny devices and sensors everywhere. The ubiquity and scale of IoT devices require them to be extremely energy efficient. Given the physical exposure to malicious agents, security is a critical challenge within the constrained resources. This dissertation presents energy-efficient hardware designs for IoT security. First, this dissertation presents a lightweight Advanced Encryption Standard (AES) accelerator design. By analyzing the algorithm, a novel method to manipulate two internal steps to eliminate storage registers and replace flip-flops with latches to save area is discovered. The proposed AES accelerator achieves state-of-art area and energy efficiency. Second, the inflexibility and high Non-Recurring Engineering (NRE) costs of Application-Specific-Integrated-Circuits (ASICs) motivate a more flexible solution. This dissertation presents a reconfigurable cryptographic processor, called Recryptor, which achieves performance and energy improvements for a wide range of security algorithms across public key/secret key cryptography and hash functions. The proposed design employs circuit techniques in-memory and near-memory computing and is more resilient to power analysis attack. In addition, a simulator for in-memory computation is proposed. It is of high cost to design and evaluate new-architecture like in-memory computing in Register-transfer level (RTL). A C-based simulator is designed to enable fast design space exploration and large workload simulations. Elliptic curve arithmetic and Galois counter mode are evaluated in this work. Lastly, an error resilient register circuit, called iRazor, is designed to tolerate unpredictable variations in manufacturing process operating temperature and voltage of VLSI systems. When integrated into an ARM processor, this adaptive approach outperforms competing industrial techniques such as frequency binning and canary circuits in performance and energy.
... Thus, designing an optimal BF multiplication method on such MCUs is a challenging task. Nonetheless, until now, several efficient BF multiplication methods have been proposed on 8-bit AVR MCUs and they are classified into two main categories: Lookup Table (LUT)-based approaches [2][3][4][5] and Block-Comb (BC)-based approaches [6-10]. ...
... Until now, many studies have been conducted for optimizing BF multiplication's performance on 8-bit AVR platforms [2][3][4][5][6][7][8]. They can be categorized into two main approaches: LookUp Table-based (LUT-based) approaches [2][3][4][5] and Block-Comb-based (BC-based) approaches [6][7][8][9]. ...
... Until now, many studies have been conducted for optimizing BF multiplication's performance on 8-bit AVR platforms [2][3][4][5][6][7][8]. They can be categorized into two main approaches: LookUp Table-based (LUT-based) approaches [2][3][4][5] and Block-Comb-based (BC-based) approaches [6][7][8][9]. Table 1 summarizes the existing result results, and the details will be explained in the following Sections 3.1 and 3.2. ...
Article
Full-text available
Binary field ( B F ) multiplication is a basic and important operation for widely used crypto algorithms such as the GHASH function of GCM (Galois/Counter Mode) mode and NIST-compliant binary Elliptic Curve Cryptosystems (ECCs). Recently, Seo et al. proposed a novel SCA-resistant binary field multiplication method in the context of GHASH optimization in AES GCM mode on 8-bit AVR microcontrollers (MCUs). They proposed a concept of Dummy XOR operation with a kind of garbage registers and a concept of instruction level atomicity ( I L A ) for resistance against Timing Analysis (TA) and Simple Power Analysis (SPA) and used a Karatsuba Block-Comb multiplication approach for efficiency. Even though their method achieved a large performance improvement compared with previous works, it still has room for improvement on the 8-bit AVR platform. In this paper, we propose a more improved binary field multiplication method on 8-bit AVR MCUs. Our method basically adopts a Dummy XOR technique using a set of garbage registers for TA and SPA security; however, we save the number of used garbage registers from eight to one by using the fact that the number of used garbage registers does not affect TA and SPA security. In addition, we apply a multiplier encoding approach so as to decrease the number of required registers when accessing the multiplier, which enables the use of extended block size in the Karatsuba Block-Comb multiplication technique. Actually, the proposed technique extends the block size from four to eight and the proposed binary field multiplication method can compute a 128-bit B F multiplication with only 3816 clock cycles ( c c ) (resp. 3490 c c ) with (resp. without) the multiplier encoding process, which is almost a 32.8% (resp. 38.5%) improvement compared with 5675 c c of the best previous work. We apply the proposed technique to the GHASH function of the GCM mode with several additional optimization techniques. The proposed GHASH implementation provides improved performance by over 42% compared with the previous best result. The concept of the proposed B F method can be extended to other MCUs, including 16-bit MSP430 MCUs and 32-bit ARM MCUs.
... Before the data are transmitted, the sender and the receiver must agree on a secret key, and both parties must keep the key. If the key of one party is leaked, the encrypted information is insecure, and the security cannot be guaranteed [24]. ...
Article
Full-text available
University and college laboratories are important places to train professional and technical personnel. Various regulatory departments in colleges and universities still rely on traditional laboratory management in research projects, which are prone to problems such as untimely information and data transmission. The present study aimed to propose a new method to solve the problem of data islands, explicit ownership, conditional sharing, data safety, and efficiency during laboratory data management. Hence, this study aimed to develop a data-centered lab management system that enhances the safety of lab data management and allows the data owners of the labs to control data sharing with other users. The architecture ensures data privacy by binding data ownership with a person using a key management method. To achieve data flow safely, data ownership conversion through the process of authorization and confirmation was introduced. The designed lab management system enables laboratory regulatory departments to receive data in a secure form by using this platform, which could solve data sharing barriers. Finally, the proposed system was applied and run in different server environments by implementing data security registration, authorization, confirmation, and conditional sharing using SM2, SM4, RSA, and AES algorithms. The system was evaluated in terms of the execution time for several lab data with different sizes. The findings of this study indicate that the proposed strategy is safe and efficient for lab data sharing across domains.
... SSAS is an addressing method, which uses an elliptic curve cryptography (ECC) algorithm (Aranha et al., 2010;McGrew et al., 2011;Khalique et al., 2010) instead of RSA that is utilised by SeND for address arrangement. SSAS is less complicated in comparison with the SeND method. ...
Article
Full-text available
Internet Protocol version 6 (IPv6) signifies the latest version of IP, the ‘Internet Protocol’. The communication protocol allocates an identification, as well as a location system of computers on the network. One of the key protocols in IPv6 is the neighbour discovery protocol (NDP). The NDP covers many functions including the discovery of nodes on a similar link, the detection of addresses that are duplicate, and the detection of routers. Due to the importance of NDP, it makes it susceptible to several attacks including denial of service (DoS); ‘Denial of Service’ attack. Many mechanisms proposed to secure NDP. However, these mechanisms are still vulnerable. This paper reviews the significance of NDP and presents the advantage and disadvantage of each proposed mechanisms. Moreover, the SeND mechanism was implemented and the results were compared with the original NDP. This paper discusses the requirements of features of the proposed mechanism in securing the IPv6 NDP processes.
... The variance in bandwidth upgraded the robustness of curve including reduction in cost of clock cycles by 18%. Previous methods applied by Liu et al. (2014), Hutter and Schwabe (2013), Hinterwälder et al. (2014), De Clercq et al. (2014), Wenger et al. (2013), Gouvêa et al. (2012), Aranha et al. (2010) and Gura et al. (2004), utilized identical hardware configured IoT devices with different types of elliptic curves where each type resulted in different computational cost and memory consumption. However, the results indicated that these resource constrained computing capable IoT devices handled encryption computational tasks in reasonable time and inexpensive memory. ...
Article
Full-text available
Robust encryption techniques require heavy computational capability and consume large amount of memory which are unaffordable for resource constrained IoT devices and Cyber-Physical Systems with an inclusion of general-purpose data manipulation tasks. Many encryption techniques have been introduced to address the inability of such devices, lacking in robust security provision at low cost. This article presents an encryption technique, implemented on a resource constrained IoT device (AVR ATmega2560) through utilizing fast execution and less memory consumption properties of curve25519 in a novel and efficient lightweight hash function. The hash function utilizes GMP library for multi-precision arithmetic calculations and pre-calculated curve points to devise a good cipher block using ECDH based key exchange protocols and large random prime number generator function.
Article
Low-power wireless sensor networks (WSNs) and Internet of Things (IoT) have great impact for the real-time applications in future 5th generation (5G) mobile networks due to the wireless powered communication technologies. The age of information (AoI) plays a crucial performance metric in an IoT-enabled real-time smart warehouse application, where the freshness of the aggregated data is very important. However, wireless medium communication among the beacon nodes and the user equipments (tracking nodes) gives an opportunity to an adversary not only to eavesdrop the data, but also to corrupt the data by means of deleting, modifying or inserting malicious information during communication among the entities involved in the smart warehouse environment. To mitigate these issues, we design a security scheme for AoI-enabled 5G smart warehouse through an access control mechanism, where the secure communication among the beacon nodes and the tracking nodes will take place by mutual device authentication and key agreement process. The fresh data collected at the enterprise cloud is then used for Big data analytics for better predictions and analysis, such as optimal device scheduling so that the data becomes very fresh. The rigorous security analysis and comparative study show that the proposed mechanism has significantly better security and comparable communication and computational costs as compared to the relevant schemes. In addition, through the real-time testbed experiments we show that the proposed scheme is practical in 5G smart warehouse context.
Article
Full-text available
Machine-type communication devices have become a vital part of the autonomous industrial internet of things and industry 4.0. These autonomous resource-constrained devices share sensitive data, and are primarily acquired for automation and to operate consistently in remote environments under severe conditions. The requirements to secure the sensitive data shared between these devices consist of a resilient encryption technique with affordable operational costs. Consequently, devices, data, and networks are made secure by adopting a lightweight cryptosystem that should achieve robust security with sufficient computational and communication costs and counter modern security threats. This paper offers in-depth studies on different types and techniques of hardware and software-based lightweight cryptographies for machine-type communication devices in machine-to-machine communication networks.
Article
Privacy, identity preserving and integrity have become key problems for telecommunication standards. Significant privacy threats are expected in 5G networks considering the large number of devices that will be deployed. As Internet of Things (IoT) and long-term evolution for machine type (LTE-m) are growing very fast with massive data traffic the risk of privacy attacks will be greatly increase. For all the above issues standards' bodies should ensure users' identity and privacy in order to gain the trust of service providers and industries. Against such threats, 5G specifications require a rigid and robust privacy procedure. Many research studies have addressed user privacy in 5G networks. This paper proposes a method to enhance user identity privacy in 5G systems through a scheme to protect the international mobile subscriber identity (IMSI) using a mutable mobile subscriber identity (MMSI) that changes randomly and avoids the exchange of IMSIs. It maintains authentication and key agreement (AKA) structure compatibility with previous mobile generations and improves user equipment (UE) synchronization with home networks. The proposed algorithm adds no computation overhead to UE or the network except a small amount in the home subscriber server (HSS). The proposed pseudonym mutable uses the XOR function to send the MMSI from the HSS to the UE which is reducing the encryption overhead significantly. The proposed solution was verified by ProVerif.
Article
Full-text available
Since their introduction to cryptography in 1985, elliptic curves have sparked a lot of research and interest in public key cryptography. In this essay, we present an overview of public key cryptography based on the discrete logarithm problem of both finite fields and elliptic curves. We discuss one of the basic and important properties of elliptic curves, the group law, and show that the set of points on the curve forms an additive abelian group. We show how the order of this abelian group affects the discrete logarithm problem and hence the security of a public key cryptosystem. We present the Diffie-Hellman key exchange and ElGamal cryptosystem based on the discrete logarithm problem of finite fields and also give their analogues in the elliptic curve case. We finally show why elliptic curves are dictating the future of public key cryptography and what makes them more efficient in constrained and wireless communications.
Article
We discuss analogs based on elliptic curves over finite fields of public key cryptosystems which use the multiplicative group of a finite field. These elliptic curve cryptosystems may be more secure, because the analog of the discrete logarithm problem on elliptic curves is likely to be harder than the classical discrete logarithm problem, especially over GF(2"). We discuss the question of primitive points on an elliptic curve modulo p, and give a theorem on nonsmoothness of the order of the cyclic subgroup generated by a global point.
Article
We discuss analogs based on elliptic curves over finite fields of public key cryptosystems which use the multiplicative group of a finite field. These elliptic curve cryptosystems may be more secure, because the analog of the discrete logarithm problem on elliptic curves is likely to be harder than the classical discrete logarithm problem, especially over GF ( 2 n ) {\text {GF}}({2^n}) . We discuss the question of primitive points on an elliptic curve modulo p , and give a theorem on nonsmoothness of the order of the cyclic subgroup generated by a global point.
Article
Let p be a fixed small prime. We give an algorithm with preprocessing to compute the j-invariant of the canonical lift of a given ordinary elliptic curve (q=pN, ) modulo pN/2+O(1) in O(N2μ+1/μ+1) bit operations (assuming the time complexity of multiplying two n-bit objects is O(nμ)) using O(N2) memory, not including preprocessing. This is faster than the algorithm of Vercauteren et al. [14] by a factor of Nμ/μ+1. Let K be the unramified extension field of degree N over . We also develop an algorithm to compute with O(N2μ+0.5) bit operations and O(N2) memory when x∈K satisfies certain conditions, which are always satisfied when applied to our point counting algorithm. As a result, we get an O(N2μ+0.5) time, O(N2) memory algorithm for counting the -rational points on , which turns out to be very fast in practice for cryptographic size elliptic curves.
Article
We describe data path extensions for general-purpose microprocessors to accelerate the emerging public-key cryptosystem Elliptic Curve Cryptography (ECC). ECC is computation- ally more ecient than the popular RSA cryptosystem and, thus, is an enabling security technology for light-weight devices that are limited in compute power, memory capacity, and battery power. Elliptic curves have been standardized by NIST and SECG for elds GF (p) and GF (2m). Though both types of elds oer similar security strengths, the standards oer a choice to accommodate dierent implementation platforms. While arithmetic operations over elds GF (p) directly map to integer operations found in standard processors, operations over elds GF (2m) are supported rather ineciently. We show that simple extensions of the data path suce to eciently support ECC over GF (2m) and to outperform ECC over GF (p). These extensions include an extended inte- ger multiplier that also generates multiplication results for elds GF (2m) and a multiply- accumulate instruction for ecient multiple-precision multiplications. On the 8-bit ATmega128 microprocessor running at 8 MHz we measured an execution time for a 163-bit ECC point multiplication over GF (2m) of 0.4 s with the extended mul- tiplier and 0.29 s if, in addition, a multiply-accumulate instruction is provided. In com- parison, a 1024-bit RSA private-key operation providing equivalent security strength takes 11 s.
Conference Paper
This paper describes an algorithm for computing elliptic scalar multiplications on non-supersingular elliptic curves defined over GF(2m). The algorithm is an optimized version of a method described in [1], which is based on Montgomery’s method [8]. Our algorithm is easy to implement in both hardware and software, works for any elliptic curve over GF(2m), requires no precomputed multiples of a point, and is faster on average than the addition-subtraction method described in draft standard IEEE P1363. In addition, the method requires less memory than projective schemes and the amount of computation needed for a scalar multiplication is fixed for all multipliers of the same binary length. Therefore, the improved method possesses many desirable features for implementing elliptic curves in restricted environments.