Content uploaded by Diego F. Aranha

Author content

All content in this area was uploaded by Diego F. Aranha on Jan 02, 2017

Content may be subject to copyright.

Lightweight Cryptography on ARM

Software implementation of block ciphers and ECC

Rafael Cruz, Tiago Reis, Diego F. Aranha, Julio López, Harsh Kupwade Patil

University of Campinas, LG Electronics Inc.

Context

1

Context

Cryptography can mitigate critical security issues in embedded devices.

Security property Technique Primitive

Protecting data at rest FS-level encryption Block cipher

Protecting data in transit Secure channel Auth block/stream cipher

Secure software updates Code signing Digital signatures

Secure booting Integrity/Authentication Hash functions, MACs

Secure debugging Entity authentication Challenge-response

Device id/auth Auth protocol PKC

Key distribution Key exchange PKC

Several algorithms required to implement primitives:

• Block and stream ciphers

• Hash functions

• AEAD and Message Authentication Codes (MACs)

• Elliptic Curve Cryptography 2

Context

Problem: Why “lightweight cryptography”? Shouldn’t all cryptography

be ideally lightweight?

From Mouha in [Mou15]

“Although the question seems simple, this appears to be a quite

controversial subject. (...) It is important to note that lightweight

cryptography should not be equated with weak cryptography”.

Solution: Alternative name for application-specic cryptography or

application-driven cryptographic design?

3

Summary

We discuss techniques for ecient and secure implementations of

lightweight encryption in software:

1. Fantomas, an LS-Design proposed in [GLSV14].

2. PRESENT, a Substitution-Permutation Network (SPN) [BKL+07].

3. Curve25519 for Elliptic Curve Cryptography.

.

We target low-end and NEON-capable ARM processors, typical of

embedded systems. Results are part of a project sponsored by LG

involving 7 students and more than 30 symmetric (C) and asymmetric

(ASM) algorithms.

4

Construction

LS-Designs

Paradigm to construct block ciphers providing:

•Lightweight designs from simple substitution and linear layers.

• Friendliness to side-channel countermeasures (bitslicing and

masking).

• Tweakable variant for authenticated encryption (SCREAMv3).

lbits

sbits

State Matrix

5

Construction

Algorithm 1 LS-Design encrypting block Binto ciphertext Cwith key K.

1: C←B⊕K▷Crepresents an s×l-bit matrix

2: for 0≤r<Nrdo

3: for 0≤i<ldo ▷S-box layer

4: C[i, ⋆] = S[C[i, ⋆]]

5: end for

6: for 0≤j<sdo ▷L-box layer

7: C[⋆, j] = L[C[⋆, j]]

8: end for

9: C←C⊕K⊕C(r)▷Key and round constant addition

10: end for

11: return C

6

Algorithm

The LS-Design paper introduced an involutive instance (Robin), and a

non-involutive cipher (Fantomas).

Fantomas

•128-bit key length and block size.

•No key scheduling.

• 8-bit (3/5-bit 3-round) S-boxes from MISTY.

•L-box from vector-matrix product in F2.

16-bits

8-bits

Sbox

Lbox

X

Count Parity

?

7

Implementation in 32/64 bits

Internal state can be represented with union to respect strict aliasing

rules for 16/32/64-bit operations:

type d e f uni o n {

uint32_t u32; / / ui nt6 4_ t u64 ;

ui nt 16 _t u16 [ 2 ] ; // ui nt 16 _t u16 [ 4 ] ;

} U32_t ;

Bitsliced S-boxes operate over 16-bit chunks in the u16 portion.

Key addition works using the u32/u64 internal state:

f o r ( j =0 ; j < 4 ; j ++) // f o r ( j =0 ; j < 2 ; j ++)

s t [ j ] . u32 ^= k ey_ 32 [ j ] ; // s t [ j ] . u64 ^= k ey_ 64 [ j ] ;

8

Implementation in 32/64 bits

L-box can be evaluated using two precomputed tables:

/* Unprotected L−bo x v e r s i o n * /

s t [ j ] . u16 [ 0 ] = LBoxH [ s t [ j ] . u 16 [0] > >8] ^

LBo xL [ st [ j ] . u16 [ 0 ] & 0 x f f ] ;

s t [ j ] . u16 [ 1 ] = LBoxH [ s t [ j ] . u 16 [1] > >8] ^

LBo xL [ st [ j ] . u16 [ 1 ] & 0 x f f ] ;

Problem: Beware of cache-timing attacks!

Attacker who monitors L-box positions in cache can recover internal

state. Internal state trivially reveals keys and plaintext if recovered right

before/after last/rst key addition.

9

Implementation in 32/64 bits

L-box can be evaluated using two precomputed tables:

/* Unprotected L−bo x v e r s i o n * /

s t [ j ] . u16 [ 0 ] = LBoxH [ s t [ j ] . u 16 [0] > >8] ^

LBo xL [ st [ j ] . u16 [ 0 ] & 0 x f f ] ;

s t [ j ] . u16 [ 1 ] = LBoxH [ s t [ j ] . u 16 [1] > >8] ^

LBo xL [ st [ j ] . u16 [ 1 ] & 0 x f f ] ;

Problem: Beware of cache-timing attacks!

Attacker who monitors L-box positions in cache can recover internal

state. Internal state trivially reveals keys and plaintext if recovered right

before/after last/rst key addition.

9

Construction

Algorithm 2 LS-Design encrypting block Binto ciphertext Cwith key K.

1: C←B⊕K▷Crepresents an s×l-bit matrix

2: for 0≤r<Nrdo

3: for 0≤i<ldo ▷S-box layer

4: C[i, ⋆] = S[C[i, ⋆]]

5: end for

6: for 0≤j<sdo ▷L-box layer

7: C[⋆, j] = L[C[⋆, j]]

8: end for

9: C←C⊕K⊕C(r)▷Key and round constant addition

10: end for

11: return C

10

Implementation in 32/64 bits

Solution: We can replace memory access with online computation:

static i n l i n e t yp e_ t LBox ( ty pe _t x , ty pe _t y , u in t8 _t s ) {

x &= y ;

x ^= x >> 8 ;

x ^= x >> 4 ;

x ^= x >> 2 ;

x ^= x >> 1 ;

return ( x & 0 x0 00 10 00 1 ) << s ;

// r e t u r n ( x & 0 x 00 01 0 00 10 00 1 00 01 ) << s

}

11

NEON implementation

L-boxes can be evaluated using shuing instructions to compute 8

table lookups in parallel.

L-box in

Registers

Important: 32-bit implementations can process 2 blocks and vector

implementations can process 16 blocks simultaneously in CTR mode.

12

NEON implementation

Counter transformation for the vectorized CTR implementation:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

b0

b1

b2

b3

b4

b5

b6

b7

b8

b9

b10

b11

b12

b13

b14

b15

(a) Initial state of the counter

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15

(b) Final state of the counter

13

NEON implementation

Key must be transformed to follow representation.

0

1

2

3

4

5

6

7

8

9

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

10

11

12

13

14

15

14

Experiments I

Benchmark: Encrypt+decrypt 128 bytes in CBC or encrypt 128 bits in

CTR mode.

•Related work: FELICS (triathlon of block ciphers) [DCK+15].

•Platforms:

1. Cortex-M3 (Arduino Due, 32 bits):

• GCC 4.8.4 from Arduino with ags -O3 -fno-schedule-insns

-mcpu=cortex-m3 -mthumb.

• Cycles count by converting the output of the micros() function.

2. Cortex-M4 (Teensy 3, 32 bits):

• GCC 4.8.4 from Arduino with ags -O3 -fno-schedule-insns

-mcpu=cortex-m3 -mthumb.

• Cycles counts through CCNT register.

3. Cortex-A53 (ODROID OC2, 64 bits):

• GCC 6.1.1 with ags -O3 -fno-schedule-insns -mcpu=cortex-a53

-mthumb -march=native.

• Cycles counts through CCNT register.

15

Results

32-bit 32-bit CT

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

Fantomas in CBC mode

Arduino Due Cortex-M3

Ours

FELICS Fast

FELICS Compact

Cycle Count

32-bit 32-bit CT

0

1000

2000

3000

4000

5000

Implementation

Code Size (ROM)

16

Results

32-bit 32-bit CT

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Fantomas in CTR mode

Arduino Due Cortex-M3

Ours

FELICS Fast

FELICS Compact

Cycle Count

32-bit 32-bit CT

0

500

1000

1500

2000

2500

Implementation

Code Size (ROM)

17

Results

32-bit CBC 32-bit CT CBC 64-bit CBC 64-bit CT CBC

0

500

1000

1500

2000

2500

Fantomas in CBC mode

Cortex-M3/M4/A53

Cortex-M3 (Ours)

Cortex-M4 (Ours)

Cortex-A53 (Ours)

Cycles Per Byte (CPB)

32-bit CBC 32-bit CT CBC 64-bit CBC 64-bit CT CBC

0

1000

2000

3000

4000

5000

6000

Implementation

Code Size (ROM)

18

Experiments II

Benchmark: Encrypt 128 bits in CTR mode.

•Related work: Ajusted timings from SCREAMv3 presentation in

the CAESAR competition [GLS+15].

•Platforms:

1. Cortex-A15 (ODROID XU4, 32 bits + NEON):

• GCC 6.1.1 with ags -O3 -fno-schedule-insns -mcpu=cortex-a15

-mthumb -march=native.

• Cycles count through CCNT register.

2. Cortex-A53 (ODROID OC2, 64 bits + NEON):

• GCC 6.1.1 with ags -O3 -fno-schedule-insns -mcpu=cortex-a53

-mthumb -march=native.

• Cycles counts through CCNT register.

19

Results

Cortex-A15 Cortex-A53

0

10

20

30

40

50

60

70

Fantomas in CTR mode

NEON implementation

Fantomas (Ours)

16-block version (Ours)

16-block version (RW)

Cycles Per Byte (CPB)

Cortex-A15 Cortex-A53

0

1000

2000

3000

4000

5000

6000

7000

8000

Platform

Code Size (ROM)

20

Side-channel resistance

1. Constant time implementation against cache-timing attacks:

• Performance penalty of 3 times in low-end ARMs.

•Inherent in vector implementations.

• Not sucient against other side-channel attacks.

2. Masked implementation against power attacks:

•Signicant quadratic performance penalty (almost twice slower with

a single mask).

• Not sucient against cache timing attacks.

•Key masking to force attacker to recover all shares (additional

10-20% overhead).

21

Conclusions

Fantomas has some limitations regarding side-channel resistance:

• S-boxes do not require tables, but are expensive to mask.

• L-boxes are free to mask, but expensive to compute in constant

time.

New state-of-the-art implementations of Fantomas:

• Portable implementation in C is 35% and 52% faster

than [DCK+15] on Cortex-M, and similar in code size.

• New countermeasures against cache timing attacks.

• NEON implementation is 40% faster in ARM than [GLS+15].

22

Algorithm

Proposed in 2007 and standardized by ISO/IEC, one of the rst

lightweight block cipher designs.

PRESENT

• Substitution-permutation network.

•80-bit or 128-bit key and 64-bit block.

• Key schedule for 31 rounds with 64-bit subkeys subkeyi.

• 4-bit S-boxes with Boolean representation friendly to bitslicing.

• Bit permutation Psuch that P2=P−1.

23

Algorithm

Figure 2: 4-bit S-Boxes in PRESENT.

P(i) = {16imod 63 if i̸=63

63 if i=63

24

Algorithm

Algorithm 3 PRESENT encrypting block Bto ciphertext block C.

1: C←B

2: for i=1to 31 do

3: C←C⊕subkeyi

4: C←S(C)

5: C←P(C)

6: end for

7: C←P⊕subkey32

8: return C

25

Implementation

Figure 3: Permutation Pin PRESENT.

Figure 4: Permutations P0and P1for optimized PRESENT.

27

Implementation

f

28

Implementation

Algorithm 4 PRESENT encrypting block Bto ciphertext block C.

1: C←B

2: for i=1to 15 do

3: C←C⊕subkey2i−1

4: C←P0(C)

5: C←S(C)

6: C←P1(C)

7: C←C⊕P(subkey2i)

8: C←S(C)

9: end for

10: C←P⊕subkey31

11: C←P(C)

12: C←S(C)

13: C←C⊕subkey32

14: return C

29

Experiments I

Benchmark: Encrypt+decrypt+key schedule 128 bytes in CBC or encrypt

128 bits in CTR mode.

•Related work: ASM implementation in FELICS [DCK+15],

2nd-order constant-time masked ASM implementation of

PRESENT [dGPdLP+16].

•Platforms:

1. Cortex-M3 (Arduino Due, 32 bits):

• GCC 4.8.4 from Arduino with ags -O3 -fno-schedule-insns

-mcpu=cortex-m3 -mthumb.

• Cycles count by converting the output of the micros() function.

2. Cortex-M4 (Teensy 3.2, 32 bits):

• GCC 4.8.4 from Arduino with ags -O3 -fno-schedule-insns

-mcpu=cortex-m3 -mthumb.

• Cycles counts through CCNT register.

30

Results

32-bit CBC 32-bit CTR

0

50000

100000

150000

200000

250000

300000

PRESENT

E+D+KS 128 bytes (CBC) or encrypt 128 bits (CTR) on ARM Cortex-M3

Ours

FELICS

Cycle Count

32-bit CBC 32-bit CTR

0

500

1000

1500

2000

2500

3000

Implementation

Code Size (ROM)

31

Results

32-bit CBC 32-bit CTR

0

10000

20000

30000

40000

50000

60000

PRESENT

E+D+KS 128 bytes (CBC) or encrypt 128 bits (CTR) on ARM Cortex-M4

Constant time (Ours)

Masked (RW)

Cycle Count

32-bit CBC 32-bit CTR

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Implementation

Code Size (ROM)

32

Conclusions

Side-channel resistance:

• PRESENT can be eciently implemented in constant time.

• Performance penalty from masking is lower than Fantomas, mainly

due to choice of S-boxes.

New state-of-the-art implementations of PRESENT:

• S-boxes can be bitsliced (no tables) and permutations can be made

much faster.

• Performance improvement of 8x factor.

• Our constant-time CTR implementation is now among the fastest

block ciphers in the FELICS benchmark (competitive with SPARX).

33

Detailed timings

Table 1: Comparison of block ciphers implemented in C by this work with AES

in Assembly for encrypting 128 bits in CTR mode across long messages.

Cortex-M3 Cortex-M4

Block cipher Unprotected CT Unprotected CT ROM

Fantomas 2291 9063 2191 7866 1272

PRESENT-80 - 2052 - 1597 1124

AES-128 [SS16] 546 1617 554 1618 12120

34

Field arithmetic in F2255−19

Dicult choice of multiplication instructions in Cortex-M3 [dG15]:

•MUL: eectively 16 ×16 →32, 1 cycle.

•MLA (acc): eectively 16 ×16 →32, 2 cycles.

•UMULL: 32 ×32 →64, 3-5 cycles.

•UMLAL: 32 ×32 →64, 4-7 cycles.

Side-channel attack known using early-terminating multiplications for

ECDH [GOPT09], although not clear if applicable to laddering.

Countermeasures replace UMULL with instructions costing 12-19

cycles [Ham11].

Important: At this penalty, Cortex-M0 implementation [DHH+15] should

still be competitive.

35

Field arithmetic in F2255−19

Previous work in constant time with Karatsuba over reduced radix [dG15].

Alternative implementation on Cortex-M4:

• Full-radix to enjoy arithmetic density and single-cycle multiplications.

• Comba with register allocation inspired by operand caching [HW11].

• Arithmetic closely follow ideas from the full-radix Cortex-M0

implementation.

• Check next presentation. :)

36

Result

Table 2: Experimental results for dierent implementations of randomized

X25519 and Ed25519 on ARM processors. The gures include timings for the

eld arithmetic and protocol operations. Measurements for latency in clock

cycles were taken as the average of 1000 executions by benchmarking code

directly in the M4 board.

Operation Ours Next presentation :)

Addition 85 cc 106 cc

Subtraction 85 cc 108 cc

Multiplication 532 cc 546 cc

Squaring 532 cc 362 cc

Inversion 140,306 cc 96,337 cc

X25519 1,607,860 cc 1,658,083 cc

Code size of X25519 3,102B of ROM 2,952B of ROM

Signature 1,122,709 cc -

Verication 2,747,329 cc -

Code size for Ed25519 32,210B of ROM -

37

Questions?

38

References I

A. Bogdanov, L. R. Knudsen, G. Leander, C. Paar, A. Poschmann,

M. J. B. Robshaw, Y. Seurin, and C. Vikkelsoe.

PRESENT: an ultra-lightweight block cipher.

In CHES, volume 4727 of Lecture Notes in Computer Science, pages

450–466. Springer, 2007.

N. Courtois, D. Hulme, and T. Mourouzis.

Solving circuit optimisation problems in cryptography and

cryptanalysis.

IACR Cryptology ePrint Archive, 2011:475, 2011.

D. Dinu, Y. L. Corre, D. Khovratovich, L. Perrin, J. Großschädl, and

A. Biryukov.

Triathlon of lightweight block ciphers for the internet of

things.

IACR Cryptology ePrint Archive, 2015:209, 2015.

References II

W. de Groot.

A performance study of X25519 on Cortex M3 and M4, 2015.

W. de Groot, K. Papagiannopoulos, A. de La Piedra, E. Schneider,

and L. Batina.

Bitsliced masking and arm: Friends or foes?

Cryptology ePrint Archive, Report 2016/946, 2016.

http://eprint.iacr.org/2016/946.

M. Düll, B. Haase, G. Hinterwälder, M. Hutter, C. Paar, A. H.

Sánchez, and P. Schwabe.

High-speed curve25519 on 8-bit, 16-bit, and 32-bit

microcontrollers.

Des. Codes Cryptography, 77(2-3):493–514, 2015.

References III

V. Grosso, G. Laurent, F. Standaert, K. Varici, F. Durvaux,

L. Gaspar, and S. Kerckhof.

CAESAR candidate SCREAM Side-Channel Resistant

Authenticated Encryption with Masking.

http://2014.diac.cr.yp.to/slides/leurent-scream.pdf,

2015.

V. Grosso, G. Leurent, F. Standaert, and K. Varici.

LS-Designs: Bitslice Encryption for Ecient Masked Software

Implementations.

In FSE, volume 8540 of Lecture Notes in Computer Science, pages

18–37. Springer, 2014.

References IV

J. Großschädl, E. Oswald, D. Page, and M. Tunstall.

Side-channel analysis of cryptographic software via

early-terminating multiplications.

In ICISC, volume 5984 of Lecture Notes in Computer Science, pages

176–192. Springer, 2009.

F. B. Hamouda.

Exploration of eciency and side-channel security of dierent

implementations of rsa.

2011.

M. Hutter and E. Wenger.

Fast multi-precision multiplication for public-key cryptography

on embedded microprocessors.

In CHES, volume 6917 of Lecture Notes in Computer Science, pages

459–474. Springer, 2011.