Content uploaded by Diego F. Aranha
Author content
All content in this area was uploaded by Diego F. Aranha on Jan 02, 2017
Content may be subject to copyright.
Lightweight Cryptography on ARM
Software implementation of block ciphers and ECC
Rafael Cruz, Tiago Reis, Diego F. Aranha, Julio López, Harsh Kupwade Patil
University of Campinas, LG Electronics Inc.
Context
1
Context
Cryptography can mitigate critical security issues in embedded devices.
Security property Technique Primitive
Protecting data at rest FS-level encryption Block cipher
Protecting data in transit Secure channel Auth block/stream cipher
Secure software updates Code signing Digital signatures
Secure booting Integrity/Authentication Hash functions, MACs
Secure debugging Entity authentication Challenge-response
Device id/auth Auth protocol PKC
Key distribution Key exchange PKC
Several algorithms required to implement primitives:
• Block and stream ciphers
• Hash functions
• AEAD and Message Authentication Codes (MACs)
• Elliptic Curve Cryptography 2
Context
Problem: Why “lightweight cryptography”? Shouldn’t all cryptography
be ideally lightweight?
From Mouha in [Mou15]
“Although the question seems simple, this appears to be a quite
controversial subject. (...) It is important to note that lightweight
cryptography should not be equated with weak cryptography”.
Solution: Alternative name for application-specic cryptography or
application-driven cryptographic design?
3
Summary
We discuss techniques for ecient and secure implementations of
lightweight encryption in software:
1. Fantomas, an LS-Design proposed in [GLSV14].
2. PRESENT, a Substitution-Permutation Network (SPN) [BKL+07].
3. Curve25519 for Elliptic Curve Cryptography.
.
We target low-end and NEON-capable ARM processors, typical of
embedded systems. Results are part of a project sponsored by LG
involving 7 students and more than 30 symmetric (C) and asymmetric
(ASM) algorithms.
4
Construction
LS-Designs
Paradigm to construct block ciphers providing:
•Lightweight designs from simple substitution and linear layers.
• Friendliness to side-channel countermeasures (bitslicing and
masking).
• Tweakable variant for authenticated encryption (SCREAMv3).
lbits
sbits
State Matrix
5
Construction
Algorithm 1 LS-Design encrypting block Binto ciphertext Cwith key K.
1: C←B⊕K▷Crepresents an s×l-bit matrix
2: for 0≤r<Nrdo
3: for 0≤i<ldo ▷S-box layer
4: C[i, ⋆] = S[C[i, ⋆]]
5: end for
6: for 0≤j<sdo ▷L-box layer
7: C[⋆, j] = L[C[⋆, j]]
8: end for
9: C←C⊕K⊕C(r)▷Key and round constant addition
10: end for
11: return C
6
Algorithm
The LS-Design paper introduced an involutive instance (Robin), and a
non-involutive cipher (Fantomas).
Fantomas
•128-bit key length and block size.
•No key scheduling.
• 8-bit (3/5-bit 3-round) S-boxes from MISTY.
•L-box from vector-matrix product in F2.
16-bits
8-bits
Sbox
Lbox
X
Count Parity
?
7
Implementation in 32/64 bits
Internal state can be represented with union to respect strict aliasing
rules for 16/32/64-bit operations:
type d e f uni o n {
uint32_t u32; / / ui nt6 4_ t u64 ;
ui nt 16 _t u16 [ 2 ] ; // ui nt 16 _t u16 [ 4 ] ;
} U32_t ;
Bitsliced S-boxes operate over 16-bit chunks in the u16 portion.
Key addition works using the u32/u64 internal state:
f o r ( j =0 ; j < 4 ; j ++) // f o r ( j =0 ; j < 2 ; j ++)
s t [ j ] . u32 ^= k ey_ 32 [ j ] ; // s t [ j ] . u64 ^= k ey_ 64 [ j ] ;
8
Implementation in 32/64 bits
L-box can be evaluated using two precomputed tables:
/* Unprotected L−bo x v e r s i o n * /
s t [ j ] . u16 [ 0 ] = LBoxH [ s t [ j ] . u 16 [0] > >8] ^
LBo xL [ st [ j ] . u16 [ 0 ] & 0 x f f ] ;
s t [ j ] . u16 [ 1 ] = LBoxH [ s t [ j ] . u 16 [1] > >8] ^
LBo xL [ st [ j ] . u16 [ 1 ] & 0 x f f ] ;
Problem: Beware of cache-timing attacks!
Attacker who monitors L-box positions in cache can recover internal
state. Internal state trivially reveals keys and plaintext if recovered right
before/after last/rst key addition.
9
Implementation in 32/64 bits
L-box can be evaluated using two precomputed tables:
/* Unprotected L−bo x v e r s i o n * /
s t [ j ] . u16 [ 0 ] = LBoxH [ s t [ j ] . u 16 [0] > >8] ^
LBo xL [ st [ j ] . u16 [ 0 ] & 0 x f f ] ;
s t [ j ] . u16 [ 1 ] = LBoxH [ s t [ j ] . u 16 [1] > >8] ^
LBo xL [ st [ j ] . u16 [ 1 ] & 0 x f f ] ;
Problem: Beware of cache-timing attacks!
Attacker who monitors L-box positions in cache can recover internal
state. Internal state trivially reveals keys and plaintext if recovered right
before/after last/rst key addition.
9
Construction
Algorithm 2 LS-Design encrypting block Binto ciphertext Cwith key K.
1: C←B⊕K▷Crepresents an s×l-bit matrix
2: for 0≤r<Nrdo
3: for 0≤i<ldo ▷S-box layer
4: C[i, ⋆] = S[C[i, ⋆]]
5: end for
6: for 0≤j<sdo ▷L-box layer
7: C[⋆, j] = L[C[⋆, j]]
8: end for
9: C←C⊕K⊕C(r)▷Key and round constant addition
10: end for
11: return C
10
Implementation in 32/64 bits
Solution: We can replace memory access with online computation:
static i n l i n e t yp e_ t LBox ( ty pe _t x , ty pe _t y , u in t8 _t s ) {
x &= y ;
x ^= x >> 8 ;
x ^= x >> 4 ;
x ^= x >> 2 ;
x ^= x >> 1 ;
return ( x & 0 x0 00 10 00 1 ) << s ;
// r e t u r n ( x & 0 x 00 01 0 00 10 00 1 00 01 ) << s
}
11
NEON implementation
L-boxes can be evaluated using shuing instructions to compute 8
table lookups in parallel.
L-box in
Registers
Important: 32-bit implementations can process 2 blocks and vector
implementations can process 16 blocks simultaneously in CTR mode.
12
NEON implementation
Counter transformation for the vectorized CTR implementation:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
b0
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
b11
b12
b13
b14
b15
(a) Initial state of the counter
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
(b) Final state of the counter
13
NEON implementation
Key must be transformed to follow representation.
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
10
11
12
13
14
15
14
Experiments I
Benchmark: Encrypt+decrypt 128 bytes in CBC or encrypt 128 bits in
CTR mode.
•Related work: FELICS (triathlon of block ciphers) [DCK+15].
•Platforms:
1. Cortex-M3 (Arduino Due, 32 bits):
• GCC 4.8.4 from Arduino with ags -O3 -fno-schedule-insns
-mcpu=cortex-m3 -mthumb.
• Cycles count by converting the output of the micros() function.
2. Cortex-M4 (Teensy 3, 32 bits):
• GCC 4.8.4 from Arduino with ags -O3 -fno-schedule-insns
-mcpu=cortex-m3 -mthumb.
• Cycles counts through CCNT register.
3. Cortex-A53 (ODROID OC2, 64 bits):
• GCC 6.1.1 with ags -O3 -fno-schedule-insns -mcpu=cortex-a53
-mthumb -march=native.
• Cycles counts through CCNT register.
15
Results
32-bit 32-bit CT
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
Fantomas in CBC mode
Arduino Due Cortex-M3
Ours
FELICS Fast
FELICS Compact
Cycle Count
32-bit 32-bit CT
0
1000
2000
3000
4000
5000
Implementation
Code Size (ROM)
16
Results
32-bit 32-bit CT
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Fantomas in CTR mode
Arduino Due Cortex-M3
Ours
FELICS Fast
FELICS Compact
Cycle Count
32-bit 32-bit CT
0
500
1000
1500
2000
2500
Implementation
Code Size (ROM)
17
Results
32-bit CBC 32-bit CT CBC 64-bit CBC 64-bit CT CBC
0
500
1000
1500
2000
2500
Fantomas in CBC mode
Cortex-M3/M4/A53
Cortex-M3 (Ours)
Cortex-M4 (Ours)
Cortex-A53 (Ours)
Cycles Per Byte (CPB)
32-bit CBC 32-bit CT CBC 64-bit CBC 64-bit CT CBC
0
1000
2000
3000
4000
5000
6000
Implementation
Code Size (ROM)
18
Experiments II
Benchmark: Encrypt 128 bits in CTR mode.
•Related work: Ajusted timings from SCREAMv3 presentation in
the CAESAR competition [GLS+15].
•Platforms:
1. Cortex-A15 (ODROID XU4, 32 bits + NEON):
• GCC 6.1.1 with ags -O3 -fno-schedule-insns -mcpu=cortex-a15
-mthumb -march=native.
• Cycles count through CCNT register.
2. Cortex-A53 (ODROID OC2, 64 bits + NEON):
• GCC 6.1.1 with ags -O3 -fno-schedule-insns -mcpu=cortex-a53
-mthumb -march=native.
• Cycles counts through CCNT register.
19
Results
Cortex-A15 Cortex-A53
0
10
20
30
40
50
60
70
Fantomas in CTR mode
NEON implementation
Fantomas (Ours)
16-block version (Ours)
16-block version (RW)
Cycles Per Byte (CPB)
Cortex-A15 Cortex-A53
0
1000
2000
3000
4000
5000
6000
7000
8000
Platform
Code Size (ROM)
20
Side-channel resistance
1. Constant time implementation against cache-timing attacks:
• Performance penalty of 3 times in low-end ARMs.
•Inherent in vector implementations.
• Not sucient against other side-channel attacks.
2. Masked implementation against power attacks:
•Signicant quadratic performance penalty (almost twice slower with
a single mask).
• Not sucient against cache timing attacks.
•Key masking to force attacker to recover all shares (additional
10-20% overhead).
21
Conclusions
Fantomas has some limitations regarding side-channel resistance:
• S-boxes do not require tables, but are expensive to mask.
• L-boxes are free to mask, but expensive to compute in constant
time.
New state-of-the-art implementations of Fantomas:
• Portable implementation in C is 35% and 52% faster
than [DCK+15] on Cortex-M, and similar in code size.
• New countermeasures against cache timing attacks.
• NEON implementation is 40% faster in ARM than [GLS+15].
22
Algorithm
Proposed in 2007 and standardized by ISO/IEC, one of the rst
lightweight block cipher designs.
PRESENT
• Substitution-permutation network.
•80-bit or 128-bit key and 64-bit block.
• Key schedule for 31 rounds with 64-bit subkeys subkeyi.
• 4-bit S-boxes with Boolean representation friendly to bitslicing.
• Bit permutation Psuch that P2=P−1.
23
Algorithm
Figure 2: 4-bit S-Boxes in PRESENT.
P(i) = {16imod 63 if i̸=63
63 if i=63
24
Algorithm
Algorithm 3 PRESENT encrypting block Bto ciphertext block C.
1: C←B
2: for i=1to 31 do
3: C←C⊕subkeyi
4: C←S(C)
5: C←P(C)
6: end for
7: C←P⊕subkey32
8: return C
25
Implementation
Figure 3: Permutation Pin PRESENT.
Figure 4: Permutations P0and P1for optimized PRESENT.
27
Implementation
f
28
Implementation
Algorithm 4 PRESENT encrypting block Bto ciphertext block C.
1: C←B
2: for i=1to 15 do
3: C←C⊕subkey2i−1
4: C←P0(C)
5: C←S(C)
6: C←P1(C)
7: C←C⊕P(subkey2i)
8: C←S(C)
9: end for
10: C←P⊕subkey31
11: C←P(C)
12: C←S(C)
13: C←C⊕subkey32
14: return C
29
Experiments I
Benchmark: Encrypt+decrypt+key schedule 128 bytes in CBC or encrypt
128 bits in CTR mode.
•Related work: ASM implementation in FELICS [DCK+15],
2nd-order constant-time masked ASM implementation of
PRESENT [dGPdLP+16].
•Platforms:
1. Cortex-M3 (Arduino Due, 32 bits):
• GCC 4.8.4 from Arduino with ags -O3 -fno-schedule-insns
-mcpu=cortex-m3 -mthumb.
• Cycles count by converting the output of the micros() function.
2. Cortex-M4 (Teensy 3.2, 32 bits):
• GCC 4.8.4 from Arduino with ags -O3 -fno-schedule-insns
-mcpu=cortex-m3 -mthumb.
• Cycles counts through CCNT register.
30
Results
32-bit CBC 32-bit CTR
0
50000
100000
150000
200000
250000
300000
PRESENT
E+D+KS 128 bytes (CBC) or encrypt 128 bits (CTR) on ARM Cortex-M3
Ours
FELICS
Cycle Count
32-bit CBC 32-bit CTR
0
500
1000
1500
2000
2500
3000
Implementation
Code Size (ROM)
31
Results
32-bit CBC 32-bit CTR
0
10000
20000
30000
40000
50000
60000
PRESENT
E+D+KS 128 bytes (CBC) or encrypt 128 bits (CTR) on ARM Cortex-M4
Constant time (Ours)
Masked (RW)
Cycle Count
32-bit CBC 32-bit CTR
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Implementation
Code Size (ROM)
32
Conclusions
Side-channel resistance:
• PRESENT can be eciently implemented in constant time.
• Performance penalty from masking is lower than Fantomas, mainly
due to choice of S-boxes.
New state-of-the-art implementations of PRESENT:
• S-boxes can be bitsliced (no tables) and permutations can be made
much faster.
• Performance improvement of 8x factor.
• Our constant-time CTR implementation is now among the fastest
block ciphers in the FELICS benchmark (competitive with SPARX).
33
Detailed timings
Table 1: Comparison of block ciphers implemented in C by this work with AES
in Assembly for encrypting 128 bits in CTR mode across long messages.
Cortex-M3 Cortex-M4
Block cipher Unprotected CT Unprotected CT ROM
Fantomas 2291 9063 2191 7866 1272
PRESENT-80 - 2052 - 1597 1124
AES-128 [SS16] 546 1617 554 1618 12120
34
Field arithmetic in F2255−19
Dicult choice of multiplication instructions in Cortex-M3 [dG15]:
•MUL: eectively 16 ×16 →32, 1 cycle.
•MLA (acc): eectively 16 ×16 →32, 2 cycles.
•UMULL: 32 ×32 →64, 3-5 cycles.
•UMLAL: 32 ×32 →64, 4-7 cycles.
Side-channel attack known using early-terminating multiplications for
ECDH [GOPT09], although not clear if applicable to laddering.
Countermeasures replace UMULL with instructions costing 12-19
cycles [Ham11].
Important: At this penalty, Cortex-M0 implementation [DHH+15] should
still be competitive.
35
Field arithmetic in F2255−19
Previous work in constant time with Karatsuba over reduced radix [dG15].
Alternative implementation on Cortex-M4:
• Full-radix to enjoy arithmetic density and single-cycle multiplications.
• Comba with register allocation inspired by operand caching [HW11].
• Arithmetic closely follow ideas from the full-radix Cortex-M0
implementation.
• Check next presentation. :)
36
Result
Table 2: Experimental results for dierent implementations of randomized
X25519 and Ed25519 on ARM processors. The gures include timings for the
eld arithmetic and protocol operations. Measurements for latency in clock
cycles were taken as the average of 1000 executions by benchmarking code
directly in the M4 board.
Operation Ours Next presentation :)
Addition 85 cc 106 cc
Subtraction 85 cc 108 cc
Multiplication 532 cc 546 cc
Squaring 532 cc 362 cc
Inversion 140,306 cc 96,337 cc
X25519 1,607,860 cc 1,658,083 cc
Code size of X25519 3,102B of ROM 2,952B of ROM
Signature 1,122,709 cc -
Verication 2,747,329 cc -
Code size for Ed25519 32,210B of ROM -
37
Questions?
38
References I
A. Bogdanov, L. R. Knudsen, G. Leander, C. Paar, A. Poschmann,
M. J. B. Robshaw, Y. Seurin, and C. Vikkelsoe.
PRESENT: an ultra-lightweight block cipher.
In CHES, volume 4727 of Lecture Notes in Computer Science, pages
450–466. Springer, 2007.
N. Courtois, D. Hulme, and T. Mourouzis.
Solving circuit optimisation problems in cryptography and
cryptanalysis.
IACR Cryptology ePrint Archive, 2011:475, 2011.
D. Dinu, Y. L. Corre, D. Khovratovich, L. Perrin, J. Großschädl, and
A. Biryukov.
Triathlon of lightweight block ciphers for the internet of
things.
IACR Cryptology ePrint Archive, 2015:209, 2015.
References II
W. de Groot.
A performance study of X25519 on Cortex M3 and M4, 2015.
W. de Groot, K. Papagiannopoulos, A. de La Piedra, E. Schneider,
and L. Batina.
Bitsliced masking and arm: Friends or foes?
Cryptology ePrint Archive, Report 2016/946, 2016.
http://eprint.iacr.org/2016/946.
M. Düll, B. Haase, G. Hinterwälder, M. Hutter, C. Paar, A. H.
Sánchez, and P. Schwabe.
High-speed curve25519 on 8-bit, 16-bit, and 32-bit
microcontrollers.
Des. Codes Cryptography, 77(2-3):493–514, 2015.
References III
V. Grosso, G. Laurent, F. Standaert, K. Varici, F. Durvaux,
L. Gaspar, and S. Kerckhof.
CAESAR candidate SCREAM Side-Channel Resistant
Authenticated Encryption with Masking.
http://2014.diac.cr.yp.to/slides/leurent-scream.pdf,
2015.
V. Grosso, G. Leurent, F. Standaert, and K. Varici.
LS-Designs: Bitslice Encryption for Ecient Masked Software
Implementations.
In FSE, volume 8540 of Lecture Notes in Computer Science, pages
18–37. Springer, 2014.
References IV
J. Großschädl, E. Oswald, D. Page, and M. Tunstall.
Side-channel analysis of cryptographic software via
early-terminating multiplications.
In ICISC, volume 5984 of Lecture Notes in Computer Science, pages
176–192. Springer, 2009.
F. B. Hamouda.
Exploration of eciency and side-channel security of dierent
implementations of rsa.
2011.
M. Hutter and E. Wenger.
Fast multi-precision multiplication for public-key cryptography
on embedded microprocessors.
In CHES, volume 6917 of Lecture Notes in Computer Science, pages
459–474. Springer, 2011.