DataPDF Available

Lightweight Cryptography on ARM (slides for SPEED-B 2016)

Authors:

Figures

Content may be subject to copyright.
Lightweight Cryptography on ARM
Software implementation of block ciphers and ECC
Rafael Cruz, Tiago Reis, Diego F. Aranha, Julio López, Harsh Kupwade Patil
University of Campinas, LG Electronics Inc.
Introduction
Context
1
Context
Cryptography can mitigate critical security issues in embedded devices.
Security property Technique Primitive
Protecting data at rest FS-level encryption Block cipher
Protecting data in transit Secure channel Auth block/stream cipher
Secure software updates Code signing Digital signatures
Secure booting Integrity/Authentication Hash functions, MACs
Secure debugging Entity authentication Challenge-response
Device id/auth Auth protocol PKC
Key distribution Key exchange PKC
Several algorithms required to implement primitives:
Block and stream ciphers
Hash functions
AEAD and Message Authentication Codes (MACs)
Elliptic Curve Cryptography 2
Context
Problem: Why “lightweight cryptography”? Shouldn’t all cryptography
be ideally lightweight?
From Mouha in [Mou15]
“Although the question seems simple, this appears to be a quite
controversial subject. (...) It is important to note that lightweight
cryptography should not be equated with weak cryptography”.
Solution: Alternative name for application-specic cryptography or
application-driven cryptographic design?
3
Summary
We discuss techniques for ecient and secure implementations of
lightweight encryption in software:
1. Fantomas, an LS-Design proposed in [GLSV14].
2. PRESENT, a Substitution-Permutation Network (SPN) [BKL+07].
3. Curve25519 for Elliptic Curve Cryptography.
.
We target low-end and NEON-capable ARM processors, typical of
embedded systems. Results are part of a project sponsored by LG
involving 7 students and more than 30 symmetric (C) and asymmetric
(ASM) algorithms.
4
Fantomas
Construction
LS-Designs
Paradigm to construct block ciphers providing:
Lightweight designs from simple substitution and linear layers.
Friendliness to side-channel countermeasures (bitslicing and
masking).
Tweakable variant for authenticated encryption (SCREAMv3).
lbits
sbits
State Matrix
5
Construction
Algorithm 1 LS-Design encrypting block Binto ciphertext Cwith key K.
1: CBKCrepresents an s×l-bit matrix
2: for 0r<Nrdo
3: for 0i<ldo S-box layer
4: C[i, ⋆] = S[C[i, ⋆]]
5: end for
6: for 0j<sdo L-box layer
7: C[⋆, j] = L[C[⋆, j]]
8: end for
9: CCKC(r)Key and round constant addition
10: end for
11: return C
6
Algorithm
The LS-Design paper introduced an involutive instance (Robin), and a
non-involutive cipher (Fantomas).
Fantomas
128-bit key length and block size.
No key scheduling.
8-bit (3/5-bit 3-round) S-boxes from MISTY.
L-box from vector-matrix product in F2.
16-bits
8-bits
Sbox
Lbox
X
Count Parity
?
7
Implementation in 32/64 bits
Internal state can be represented with union to respect strict aliasing
rules for 16/32/64-bit operations:
type d e f uni o n {
uint32_t u32; / / ui nt6 4_ t u64 ;
ui nt 16 _t u16 [ 2 ] ; // ui nt 16 _t u16 [ 4 ] ;
} U32_t ;
Bitsliced S-boxes operate over 16-bit chunks in the u16 portion.
Key addition works using the u32/u64 internal state:
f o r ( j =0 ; j < 4 ; j ++) // f o r ( j =0 ; j < 2 ; j ++)
s t [ j ] . u32 ^= k ey_ 32 [ j ] ; // s t [ j ] . u64 ^= k ey_ 64 [ j ] ;
8
Implementation in 32/64 bits
L-box can be evaluated using two precomputed tables:
/* Unprotected Lbo x v e r s i o n * /
s t [ j ] . u16 [ 0 ] = LBoxH [ s t [ j ] . u 16 [0] > >8] ^
LBo xL [ st [ j ] . u16 [ 0 ] & 0 x f f ] ;
s t [ j ] . u16 [ 1 ] = LBoxH [ s t [ j ] . u 16 [1] > >8] ^
LBo xL [ st [ j ] . u16 [ 1 ] & 0 x f f ] ;
Problem: Beware of cache-timing attacks!
Attacker who monitors L-box positions in cache can recover internal
state. Internal state trivially reveals keys and plaintext if recovered right
before/after last/rst key addition.
9
Implementation in 32/64 bits
L-box can be evaluated using two precomputed tables:
/* Unprotected Lbo x v e r s i o n * /
s t [ j ] . u16 [ 0 ] = LBoxH [ s t [ j ] . u 16 [0] > >8] ^
LBo xL [ st [ j ] . u16 [ 0 ] & 0 x f f ] ;
s t [ j ] . u16 [ 1 ] = LBoxH [ s t [ j ] . u 16 [1] > >8] ^
LBo xL [ st [ j ] . u16 [ 1 ] & 0 x f f ] ;
Problem: Beware of cache-timing attacks!
Attacker who monitors L-box positions in cache can recover internal
state. Internal state trivially reveals keys and plaintext if recovered right
before/after last/rst key addition.
9
Construction
Algorithm 2 LS-Design encrypting block Binto ciphertext Cwith key K.
1: CBKCrepresents an s×l-bit matrix
2: for 0r<Nrdo
3: for 0i<ldo S-box layer
4: C[i, ⋆] = S[C[i, ⋆]]
5: end for
6: for 0j<sdo L-box layer
7: C[⋆, j] = L[C[⋆, j]]
8: end for
9: CCKC(r)Key and round constant addition
10: end for
11: return C
10
Implementation in 32/64 bits
Solution: We can replace memory access with online computation:
static i n l i n e t yp e_ t LBox ( ty pe _t x , ty pe _t y , u in t8 _t s ) {
x &= y ;
x ^= x >> 8 ;
x ^= x >> 4 ;
x ^= x >> 2 ;
x ^= x >> 1 ;
return ( x & 0 x0 00 10 00 1 ) << s ;
// r e t u r n ( x & 0 x 00 01 0 00 10 00 1 00 01 ) << s
}
11
NEON implementation
L-boxes can be evaluated using shuing instructions to compute 8
table lookups in parallel.
L-box in
Registers
Important: 32-bit implementations can process 2 blocks and vector
implementations can process 16 blocks simultaneously in CTR mode.
12
NEON implementation
Counter transformation for the vectorized CTR implementation:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
b0
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
b11
b12
b13
b14
b15
(a) Initial state of the counter
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
(b) Final state of the counter
13
NEON implementation
Key must be transformed to follow representation.
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
10
11
12
13
14
15
14
Experiments I
Benchmark: Encrypt+decrypt 128 bytes in CBC or encrypt 128 bits in
CTR mode.
Related work: FELICS (triathlon of block ciphers) [DCK+15].
Platforms:
1. Cortex-M3 (Arduino Due, 32 bits):
GCC 4.8.4 from Arduino with ags -O3 -fno-schedule-insns
-mcpu=cortex-m3 -mthumb.
Cycles count by converting the output of the micros() function.
2. Cortex-M4 (Teensy 3, 32 bits):
GCC 4.8.4 from Arduino with ags -O3 -fno-schedule-insns
-mcpu=cortex-m3 -mthumb.
Cycles counts through CCNT register.
3. Cortex-A53 (ODROID OC2, 64 bits):
GCC 6.1.1 with ags -O3 -fno-schedule-insns -mcpu=cortex-a53
-mthumb -march=native.
Cycles counts through CCNT register.
15
Results
32-bit 32-bit CT
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
Fantomas in CBC mode
Arduino Due Cortex-M3
Ours
FELICS Fast
FELICS Compact
Cycle Count
32-bit 32-bit CT
0
1000
2000
3000
4000
5000
Implementation
Code Size (ROM)
16
Results
32-bit 32-bit CT
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Fantomas in CTR mode
Arduino Due Cortex-M3
Ours
FELICS Fast
FELICS Compact
Cycle Count
32-bit 32-bit CT
0
500
1000
1500
2000
2500
Implementation
Code Size (ROM)
17
Results
32-bit CBC 32-bit CT CBC 64-bit CBC 64-bit CT CBC
0
500
1000
1500
2000
2500
Fantomas in CBC mode
Cortex-M3/M4/A53
Cortex-M3 (Ours)
Cortex-M4 (Ours)
Cortex-A53 (Ours)
Cycles Per Byte (CPB)
32-bit CBC 32-bit CT CBC 64-bit CBC 64-bit CT CBC
0
1000
2000
3000
4000
5000
6000
Implementation
Code Size (ROM)
18
Experiments II
Benchmark: Encrypt 128 bits in CTR mode.
Related work: Ajusted timings from SCREAMv3 presentation in
the CAESAR competition [GLS+15].
Platforms:
1. Cortex-A15 (ODROID XU4, 32 bits + NEON):
GCC 6.1.1 with ags -O3 -fno-schedule-insns -mcpu=cortex-a15
-mthumb -march=native.
Cycles count through CCNT register.
2. Cortex-A53 (ODROID OC2, 64 bits + NEON):
GCC 6.1.1 with ags -O3 -fno-schedule-insns -mcpu=cortex-a53
-mthumb -march=native.
Cycles counts through CCNT register.
19
Results
Cortex-A15 Cortex-A53
0
10
20
30
40
50
60
70
Fantomas in CTR mode
NEON implementation
Fantomas (Ours)
16-block version (Ours)
16-block version (RW)
Cycles Per Byte (CPB)
Cortex-A15 Cortex-A53
0
1000
2000
3000
4000
5000
6000
7000
8000
Platform
Code Size (ROM)
20
Side-channel resistance
1. Constant time implementation against cache-timing attacks:
Performance penalty of 3 times in low-end ARMs.
Inherent in vector implementations.
Not sucient against other side-channel attacks.
2. Masked implementation against power attacks:
Signicant quadratic performance penalty (almost twice slower with
a single mask).
Not sucient against cache timing attacks.
Key masking to force attacker to recover all shares (additional
10-20% overhead).
21
Conclusions
Fantomas has some limitations regarding side-channel resistance:
S-boxes do not require tables, but are expensive to mask.
L-boxes are free to mask, but expensive to compute in constant
time.
New state-of-the-art implementations of Fantomas:
Portable implementation in C is 35% and 52% faster
than [DCK+15] on Cortex-M, and similar in code size.
New countermeasures against cache timing attacks.
NEON implementation is 40% faster in ARM than [GLS+15].
22
PRESENT
Algorithm
Proposed in 2007 and standardized by ISO/IEC, one of the rst
lightweight block cipher designs.
PRESENT
Substitution-permutation network.
80-bit or 128-bit key and 64-bit block.
Key schedule for 31 rounds with 64-bit subkeys subkeyi.
4-bit S-boxes with Boolean representation friendly to bitslicing.
Bit permutation Psuch that P2=P1.
23
Algorithm
Figure 2: 4-bit S-Boxes in PRESENT.
P(i) = {16imod 63 if i̸=63
63 if i=63
24
Algorithm
Algorithm 3 PRESENT encrypting block Bto ciphertext block C.
1: CB
2: for i=1to 31 do
3: CCsubkeyi
4: CS(C)
5: CP(C)
6: end for
7: CPsubkey32
8: return C
25
Implementation
PRESENT optimizations
1. Decompose permutation P2in software-friendly involutive
permutations P0and P1.
2. Rearrange rounds to accommodate new permutations.
3. Ecient bitsliced S-boxes from [CHM11].
4. For CTR mode in 32 bits, process two blocks simultaneously.
26
Implementation
Figure 3: Permutation Pin PRESENT.
Figure 4: Permutations P0and P1for optimized PRESENT.
27
Implementation
f
28
Implementation
Algorithm 4 PRESENT encrypting block Bto ciphertext block C.
1: CB
2: for i=1to 15 do
3: CCsubkey2i1
4: CP0(C)
5: CS(C)
6: CP1(C)
7: CCP(subkey2i)
8: CS(C)
9: end for
10: CPsubkey31
11: CP(C)
12: CS(C)
13: CCsubkey32
14: return C
29
Experiments I
Benchmark: Encrypt+decrypt+key schedule 128 bytes in CBC or encrypt
128 bits in CTR mode.
Related work: ASM implementation in FELICS [DCK+15],
2nd-order constant-time masked ASM implementation of
PRESENT [dGPdLP+16].
Platforms:
1. Cortex-M3 (Arduino Due, 32 bits):
GCC 4.8.4 from Arduino with ags -O3 -fno-schedule-insns
-mcpu=cortex-m3 -mthumb.
Cycles count by converting the output of the micros() function.
2. Cortex-M4 (Teensy 3.2, 32 bits):
GCC 4.8.4 from Arduino with ags -O3 -fno-schedule-insns
-mcpu=cortex-m3 -mthumb.
Cycles counts through CCNT register.
30
Results
32-bit CBC 32-bit CTR
0
50000
100000
150000
200000
250000
300000
PRESENT
E+D+KS 128 bytes (CBC) or encrypt 128 bits (CTR) on ARM Cortex-M3
Ours
FELICS
Cycle Count
32-bit CBC 32-bit CTR
0
500
1000
1500
2000
2500
3000
Implementation
Code Size (ROM)
31
Results
32
Conclusions
Side-channel resistance:
PRESENT can be eciently implemented in constant time.
Performance penalty from masking is lower than Fantomas, mainly
due to choice of S-boxes.
New state-of-the-art implementations of PRESENT:
S-boxes can be bitsliced (no tables) and permutations can be made
much faster.
Performance improvement of 8x factor.
Our constant-time CTR implementation is now among the fastest
block ciphers in the FELICS benchmark (competitive with SPARX).
33
Detailed timings
Table 1: Comparison of block ciphers implemented in C by this work with AES
in Assembly for encrypting 128 bits in CTR mode across long messages.
Cortex-M3 Cortex-M4
Block cipher Unprotected CT Unprotected CT ROM
Fantomas 2291 9063 2191 7866 1272
PRESENT-80 - 2052 - 1597 1124
AES-128 [SS16] 546 1617 554 1618 12120
34
Curve25519
Field arithmetic in F225519
Dicult choice of multiplication instructions in Cortex-M3 [dG15]:
MUL: eectively 16 ×16 32, 1 cycle.
MLA (acc): eectively 16 ×16 32, 2 cycles.
UMULL: 32 ×32 64, 3-5 cycles.
UMLAL: 32 ×32 64, 4-7 cycles.
Side-channel attack known using early-terminating multiplications for
ECDH [GOPT09], although not clear if applicable to laddering.
Countermeasures replace UMULL with instructions costing 12-19
cycles [Ham11].
Important: At this penalty, Cortex-M0 implementation [DHH+15] should
still be competitive.
35
Field arithmetic in F225519
Previous work in constant time with Karatsuba over reduced radix [dG15].
Alternative implementation on Cortex-M4:
Full-radix to enjoy arithmetic density and single-cycle multiplications.
Comba with register allocation inspired by operand caching [HW11].
Arithmetic closely follow ideas from the full-radix Cortex-M0
implementation.
Check next presentation. :)
36
Result
Table 2: Experimental results for dierent implementations of randomized
X25519 and Ed25519 on ARM processors. The gures include timings for the
eld arithmetic and protocol operations. Measurements for latency in clock
cycles were taken as the average of 1000 executions by benchmarking code
directly in the M4 board.
Operation Ours Next presentation :)
Addition 85 cc 106 cc
Subtraction 85 cc 108 cc
Multiplication 532 cc 546 cc
Squaring 532 cc 362 cc
Inversion 140,306 cc 96,337 cc
X25519 1,607,860 cc 1,658,083 cc
Code size of X25519 3,102B of ROM 2,952B of ROM
Signature 1,122,709 cc -
Verication 2,747,329 cc -
Code size for Ed25519 32,210B of ROM -
37
Final notes
Important: All timings cross-checked with the MPS2 ARM development
board provided by LG.
Fantomas for x86/SSE can be found at
https://github.com/rafajunio/fantomas-x86.
38
Questions?
38
References I
A. Bogdanov, L. R. Knudsen, G. Leander, C. Paar, A. Poschmann,
M. J. B. Robshaw, Y. Seurin, and C. Vikkelsoe.
PRESENT: an ultra-lightweight block cipher.
In CHES, volume 4727 of Lecture Notes in Computer Science, pages
450–466. Springer, 2007.
N. Courtois, D. Hulme, and T. Mourouzis.
Solving circuit optimisation problems in cryptography and
cryptanalysis.
IACR Cryptology ePrint Archive, 2011:475, 2011.
D. Dinu, Y. L. Corre, D. Khovratovich, L. Perrin, J. Großschädl, and
A. Biryukov.
Triathlon of lightweight block ciphers for the internet of
things.
IACR Cryptology ePrint Archive, 2015:209, 2015.
References II
W. de Groot.
A performance study of X25519 on Cortex M3 and M4, 2015.
W. de Groot, K. Papagiannopoulos, A. de La Piedra, E. Schneider,
and L. Batina.
Bitsliced masking and arm: Friends or foes?
Cryptology ePrint Archive, Report 2016/946, 2016.
http://eprint.iacr.org/2016/946.
M. Düll, B. Haase, G. Hinterwälder, M. Hutter, C. Paar, A. H.
Sánchez, and P. Schwabe.
High-speed curve25519 on 8-bit, 16-bit, and 32-bit
microcontrollers.
Des. Codes Cryptography, 77(2-3):493–514, 2015.
References III
V. Grosso, G. Laurent, F. Standaert, K. Varici, F. Durvaux,
L. Gaspar, and S. Kerckhof.
CAESAR candidate SCREAM Side-Channel Resistant
Authenticated Encryption with Masking.
http://2014.diac.cr.yp.to/slides/leurent-scream.pdf,
2015.
V. Grosso, G. Leurent, F. Standaert, and K. Varici.
LS-Designs: Bitslice Encryption for Ecient Masked Software
Implementations.
In FSE, volume 8540 of Lecture Notes in Computer Science, pages
18–37. Springer, 2014.
References IV
J. Großschädl, E. Oswald, D. Page, and M. Tunstall.
Side-channel analysis of cryptographic software via
early-terminating multiplications.
In ICISC, volume 5984 of Lecture Notes in Computer Science, pages
176–192. Springer, 2009.
F. B. Hamouda.
Exploration of eciency and side-channel security of dierent
implementations of rsa.
2011.
M. Hutter and E. Wenger.
Fast multi-precision multiplication for public-key cryptography
on embedded microprocessors.
In CHES, volume 6917 of Lecture Notes in Computer Science, pages
459–474. Springer, 2011.
References V
N. Mouha.
The design space of lightweight cryptography.
Cryptology ePrint Archive, Report 2015/303, 2015.
http://eprint.iacr.org/2015/303.
P. Schwabe and K. Stoelen.
All the AES You Need on Cortex-M3 and M4.
Cryptology ePrint Archive, Report 2016/714, 2016.
http://eprint.iacr.org/2016/714.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In this paper, we introduce a framework for the benchmarking of lightweight block ciphers on a multitude of embedded platforms. Our framework is able to evaluate the execution time, RAM footprint, as well as binary code size, and allows one to define a custom “figure of merit” according to which all evaluated candidates can be ranked. We used the framework to benchmark implementations of 19 lightweight ciphers, namely AES, Chaskey, Fantomas, HIGHT, LBlock, LEA, LED, Piccolo, PRESENT, PRIDE, PRINCE, RC5, RECTANGLE, RoadRunneR, Robin, Simon, SPARX, Speck, and TWINE, on three microcontroller platforms: 8-bit AVR, 16-bit MSP430, and 32-bit ARM. Our results bring some new insights into the question of how well these lightweight ciphers are suited to secure the Internet of things. The benchmarking framework provides cipher designers with an easy-to-use tool to compare new algorithms with the state of the art and allows standardization organizations to conduct a fair and consistent evaluation of a large number of candidates.
Article
Full-text available
This paper presents new speed records for 128-bit secure elliptic-curve Diffie-Hellman key-exchange software on three different popular microcontroller architectures. We consider a 255-bit curve proposed by Bernstein known as Curve25519, which has also been adopted by the IETF. We optimize the X25519 key-exchange protocol proposed by Bernstein in 2006 for AVR ATmega 8-bit microcontrollers, MSP430X 16-bit microcontrollers, and for ARM Cortex-M0 32-bit microcontrollers. Our software for the AVR takes only 13,900,397 cycles for the computation of a Diffie-Hellman shared secret, and is the first to perform this computation in less than a second if clocked at 16 MHz for a security level of 128 bits. Our MSP430X software computes a shared secret in 5,301,792 cycles on MSP430X microcontrollers that have a 32-bit hardware multiplier and in 7,933,296 cycles on MSP430X microcontrollers that have a 16-bit multiplier. It thus outperforms previous constant-time ECDH software at the 128-bit security level on the MSP430X by more than a factor of 1.2 and 1.15, respectively. Our implementation on the Cortex-M0 runs in only 3,589,850 cycles and outperforms previous 128-bit secure ECDH software by a factor of 3.
Conference Paper
Full-text available
The design of embedded processors demands a careful trade-off between many conflicting objectives such as performance, silicon area and power consumption. Finding such a trade-off often ignores the issue of security, which can cause, otherwise secure, cryptographic software to leak information through so-called micro-architectural side channels. In this paper we show that early-terminating integer multipliers found in various embedded processors (e.g., ARM7TDMI) represent an instance of this problem. The early-termination mechanism causes differences in the time taken to execute a multiply instruction depending on the magnitude of the operands (e.g., up to three clock cycles on an ARM7TDMI processor), which are observable via variations in execution time and power consumption. Exploiting the early-termination mechanism makes Simple Power Analysis (SPA) attacks relatively straightforward to conduct, and may even allow one to attack implementations with integrated countermeasures that would not leak any information when executed on a processor with a constant-latency multiplier. We describe several case studies, including both secret-key (RC6, AES) and public-key algorithms (RSA, ECIES) to demonstrate the threat posed by embedded processors with early-terminating multipliers. KeywordsSide-channel attack-power analysis-computer arithmetic-general-purpose processor-micro-architectural cryptanalysis
Conference Paper
Full-text available
With the establishment of the AES the need for new block ciphers has been greatly diminished; for almost all block cipher appli- cations the AES is an excellent and preferred choice. However, despite recent implementation advances, the AES is not suitable for extremely constrained environments such as RFID tags and sensor networks. In this paper we describe an ultra-lightweight block cipher, present. Both security and hardware efficiency have been equally importantduring the design of the cipher and at 1570 GE, the hardware requirements for present are competitive with today's leading compact stream ciphers.
Conference Paper
Full-text available
Multi-precision multiplication is one of the most fundamental operations on microprocessors to allow public-key cryptography such as RSA and Elliptic Curve Cryptography (ECC). In this paper, we present a novel multiplication technique that increases the performance of multiplication by sophisticated caching of operands. Our method significantly reduces the number of needed load instructions which is usually one of the most expensive operation on modern processors. We evaluate our new technique on an 8-bit ATmega128 microcontroller and compare the result with existing solutions. Our implementation needs only 2, 395 clock cycles for a 160-bit multiplication which outperforms related work by a factor of 10% to 23 %. The number of required load instructions is reduced from 167 (needed for the best known hybrid multiplication) to only 80. Our implementation scales very well even for larger Integer sizes (required for RSA) and limited register sets. It further fully complies to existing multiply-accumulate instructions that are integrated in most of the available processors.
Article
Full-text available
One of the hardest problems in computer science is the problem of gate-efficient implementation. Such optimizations are particularly important in industrial hardware implementations of standard cryptographic algorithms [13, 17, 7, 22]. In this paper we focus on optimizing some small circuits such as S-boxes in cryptographic algorithms. We consider the notion of Multiplicative Complexity, a new important notion of complexity studied in 2008 by Boyar and Peralta and applied to find interesting optimizations for the S-box of the AES cipher [19, 22, 21]. We applied this methodology to produce a compact implementation of several ciphers. In this short paper we report our results on PRESENT and GOST, two block ciphers known for their exceptionally low hardware cost. This kind of representation seems to be very promising in implementations aiming at preventing side channel attacks on cryptographic chips such as DPA. More importantly, we postulate that this kind of minimality is also an important and interesting tool in cryptanalysis.
Conference Paper
This paper describes highly-optimized AES-{128,192,256}\{128,192,256\}-CTR assembly implementations for the popular ARM Cortex-M3 and M4 embedded microprocessors. These implementations are about twice as fast as existing implementations. Additionally, we provide the fastest bitsliced constant-time and masked implementations of AES-128-CTR to protect against timing attacks, power analysis and other (first-order) side-channel attacks. All implementations, including an architecture-specific instruction scheduler and register allocator, which we use to minimize expensive loads, are released into the public domain.
Article
The Residue Number System (RNS) is a non-classical way to implement multi-precision arith-metic for RSA. Hardware implementations of RSA using RNS have already been proposed, but soft-ware implementations seem to have had far less attention. We analyse the time and space efficiency of RNS implementations of RSA on microprocessors and compare it with classical implementations. In addition, we propose an instruction set extension (ISE) designed to accelerate RNS on RISC-style processors. We also investigate two countermeasures, against some side-channel attacks, for RNS implementations of RSA, and we partially extend them to classical implementations.
A performance study of X25519 on Cortex M3 and M4
  • Ii W References
  • De Groot
References II W. de Groot. A performance study of X25519 on Cortex M3 and M4, 2015.
CAESAR candidate SCREAM Side-Channel Resistant Authenticated Encryption with Masking
  • Iii V References
  • G Grosso
  • F Laurent
  • K Standaert
  • F Varici
  • L Durvaux
  • S Gaspar
  • Kerckhof
References III V. Grosso, G. Laurent, F. Standaert, K. Varici, F. Durvaux, L. Gaspar, and S. Kerckhof. CAESAR candidate SCREAM Side-Channel Resistant Authenticated Encryption with Masking. http://2014.diac.cr.yp.to/slides/leurent-scream.pdf, 2015.