ArticlePDF Available

Efficient and secure software implementations of Fantomas


Abstract and Figures

In this paper, the efficient software implementation and side-channel resistance of the LS-Design construction is studied through a series of software implementations of the Fantomas block cipher, one of its most prominent instantiations. Target platforms include resource-constrained ARM devices like the Cortex-M3 and M4, and more powerful processors such as the ARM Cortex-A15 and modern Intel platforms. The implementations span a broad range of characteristics: 32-bit and 64-bit versions, unprotected and side-channel resistant, and vectorized code for NEON and SSE instruction sets. Our results improve the state of the art substantially, both in terms of efficiency and compact-ness, by making use of novel algorithmic techniques and features specific to the target platform. We finish by proposing and prototyping instruction set extensions to reduce by half the performance penalty of the introduced side-channel countermeasures.
Content may be subject to copyright.
Noname manuscript No.
(will be inserted by the editor)
Efficient and secure software implementations of Fantomas
Rafael J. Cruz ·Antonio Guimar˜aes ·Diego F. Aranha
the date of receipt and acceptance should be inserted later
Abstract In this paper, the efficient software imple-
mentation and side-channel resistance of the LS-Design
construction is studied through a series of software im-
plementations of the Fantomas block cipher, one of its
most prominent instantiations. Target platforms include
resource-constrained ARM devices like the Cortex-M3
and M4, and more powerful processors such as the ARM
Cortex-A15 and modern Intel platforms. The imple-
mentations span a broad range of characteristics: 32-
bit and 64-bit versions, unprotected and side-channel
resistant, and vectorized code for NEON and SSE in-
struction sets. Our results improve the state of the art
substantially, both in terms of efficiency and compact-
ness, by making use of novel algorithmic techniques and
features specific to the target platform. We finish by
proposing and prototyping instruction set extensions
to reduce by half the performance penalty of the intro-
duced side-channel countermeasures.
Keywords LS-Design ·Fantomas ·side-channel
resistance ·vectorization ·instruction set extension.
1 Introduction
Efficient cryptography for embedded systems has been
a very active field of research for a few decades, and it
recently gained renewed interest with the emergence of
the Internet of Things, under the moniker lightweight
Rafael J. Cruz and Antonio Guimar˜aes
Institute of Computing, University of Campinas, Campinas,
E-mail: {raju@lasca, antonio.junior@students}
Diego F. Aranha
Department of Engineering, Aarhus University, Aarhus, Den-
cryptography. Applications of cryptography can indeed
solve problems faced by connected devices collecting
and exchanging sensitive information through an open
network. Typical solutions involve authenticated en-
cryption for data protection in transit or at rest, and
code signing for secure firmware updates.
Significant interest has been dedicated to the de-
sign and implementation of block ciphers, since they
represent a fundamental primitive from which many se-
curity properties in the symmetric setting can be pro-
vided. In that direction, many innovative block ciphers
were proposed to maximize performance in resource-
constrained devices and to provide lighter but secure
alternatives to AES [DR02]. Remarkable examples are
PRESENT [BKL+07], PRINCE [BCG+12], and more
recently SPARX [DPU+16]. These lightweight designs
follow and combine multiple constructions, such as Feis-
tel, Substitution-Permutation and ARX networks, pos-
ing distinct trade-offs in terms of efficiency, compact-
ness and resistance against different types of attacks.
Even when cryptographic algorithms can be consid-
ered secure according to the latest theoretical cryptana-
lytic results, their corresponding implementations may
be susceptible to attacks based on information leak-
age. Side-channel analysis is a growing and important
issue for cryptographic security, especially in embed-
ded systems where devices are physically accessible to
an attacker. These attacks are based on information
leaked during computation through side channels such
as execution time [Koc96], power consumption [KJJ99],
acoustic and electromagnetic emanations. When suc-
cessful, they help the adversary to identify and recover
secret data from observations captured during execu-
tion, overcoming the much higher computational cost of
cryptanalysis or exhaustive search in the key space. Se-
cret data may be a long-term private key, an ephemeral
2 Rafael J. Cruz et al.
session key or partial information about the internal
state of a primitive, including bits of the plaintext or
round keys. The attacks may be based on a small num-
ber of observations, such as Branch Prediction [AKS07]
or Simple Power Attacks (SPA); or require traces from
many consecutive observations, as in the case of Dif-
ferential Power Attacks (DPA) [KJJ99]. Resistance to
side-channel attacks has been considered as a strong se-
curity requirement for modern ciphers, and algorithms
which facilitate addition of side-channel countermea-
sures have been thus favored in the scientific literature,
bringing attention to ciphers like PICARO [PRC12] and
Fantomas [GLSV14].
The LS-Design paradigm [GLSV14] was created with
side-channel resistance in mind, because it allows the
designer to construct lightweight algorithms friendly to
implementation of side-channel countermeasures. LS-
Design ciphers typically combine a bitsliced substitu-
tion layer with a linear diffusion layer implemented with
precomputed tables, both amenable to masking schemes.
Masking was initially proposed in 2003 [ISW03] in the
context of protecting circuits against probing, but the
technique has been later extended to much more com-
plex operations, achieving provable security guarantees
[RP10]. Masked implementations have the interesting
property that the entire computation is performed over
shared secrets, decorrelating any potential side-channel
leakages from the actual data being encrypted or the
real cryptographic keys. From this point of view, mask-
ing can be seen as a collection of data perturbation
techniques to introduce external noise in the encryp-
tion or decryption processes, acting as countermeasure
against several types of side-channel attacks.
Our contributions. This work extends our previ-
ous work [CA16] and presents several efficient, compact,
portable and secure (in the sense of side-channel resis-
tant) implementations of the Fantomas block cipher:
In terms of performance, a number of implementa-
tion techniques are described to save execution time
or code, several of them easily adaptable to other
LS-Designs, such as the CAESAR second-round can-
didate SCREAMv3 [GLS+15b]. The techniques in-
clude a simple and efficient representation of the in-
ternal state, an efficient way to organize state in vec-
torized implementations, and strategies for exploit-
ing parallelism in the CTR mode of operation. Our
unprotected 32-bit implementation achieves perfor-
mance improvements ranging from 3.9% to 62.6%
in the ARM Cortex-M architecture, while consum-
ing considerably less code. The vector implemen-
tations naturally provide much higher throughput,
especially if 16 blocks can be processed simultane-
In terms of security, timing attacks on LS-Designs
are discussed together with cache protection heuris-
tics, constant time execution (isochronicity), and
masking as countermeasures. Cache protection
heuristics attempt to reduce effectiveness of cache-
timing attacks [Ber04,BM06]; isochronous implemen-
tation avoids vulnerable precomputed tables to pro-
tect execution against timing attacks; and mask-
ing protects against generic side-channel attacks but
with a significant performance penalty, illustrating
several challenges to future research. The constant
time property of the isochronous implementation
was validated by hand at the source code level and
through static and dynamic analysis tools.
Instruction set extensions for flexible parity com-
putation are proposed to reduce the performance
penalty of the isochronous implementation, while
preserving resistance against timing attacks. The
extensions were prototyped in an Altera NIOS II
platform synthesized on an FPGA and reduced the
performance penalty by half.
This paper is organized as follows. Section 2 intro-
duces LS-Designs and the Fantomas block cipher. Sec-
tion 3 discusses cache-timing attacks and countermea-
sures for LS-Designs. Section 4 summarizes some char-
acteristics of the target platforms and discusses multi-
ple implementations of Fantomas, targeting ARM and
Intel instruction sets. Section 5 presents experimental
results and Section 6 details instruction set extensions
to reduce the performance impact of side-channel coun-
termeasures. Section 7 concludes the paper.
2 LS-Designs and Fantomas
The LS-Design construction is a framework for design-
ing lightweight block ciphers while addressing threats
posed by side-channel attacks. Instances of an LS-Design
cipher are characterized by the choice of bitsliced S-
boxes S, an L-box matrix Lacting as the diffusion
layer, a number of rounds Nrand corresponding round
constants Const. A particular feature of LS-Designs is
the lack of a complex or even any key schedule, sav-
ing on storage for temporary variables. In the original
LS-Design paper, two algorithms were instantiated and
analyzed: Robin, a faster involutive instance that later
succumbed to invariant subspace attacks [LMR15]; and
the non-involutive cipher Fantomas. Algorithm 1 shows
a generic specification for an LS-Design, illustrating its
simplicity and regularity. For Fantomas, parameters are
s= 8 and l= 16, resulting in a cipher with a 128-bit
key length and 128-bit block size.
Efficient and secure software implementations of Fantomas 3
Algorithm 1 LS-Design construction encrypting
plaintext block Pwith key Kto generate a ciphertext
block C.
1: XPK . State Xrepresents an s×l-bit matrix
2: for 0r < Nrdo
3: for 0i<ldo .S-box layer with bitslicing
4: X[i, ?] = S[X[i, ?]]
5: end for
6: for 0j < s do .L-box layer with table lookups
7: X[?, j] = L[X[?, j ]]
8: end for
9: XXK . Key addition
10: XXConst(r).Addition of round constants
11: end for
12: CX
13: return C
Fantomas employs 3-round 3/5-bit S-boxes with sim-
ilar structure to the MISTY cipher [CDL15], as pre-
sented in detail on Algorithm 2. An important consid-
eration taken by the designers of the cipher is the num-
ber of nonlinear operations in the choice of S-boxes. Be-
cause Fantomas employs 8-bit S-boxes, they must con-
tain at least 8 nonlinear operations to not be weak from
a cryptanalytic point of view. There is some security
margin in this design decision because Fantomas em-
ploys 11 AND operations between elements of the cipher
state. However, the additional ANDs penalize the mask-
ing countermeasure, as later discussed in Section 3. The
L-box presented in Figure 1 provides diffusion and its
computation can be seen as a sequence of vector-matrix
products in F2, as illustrated in the picture.
Algorithm 2 MISTY-like 3/5 bits S-boxes operating
over state x={X0, X1, . . . , X7}.
1: .S5
2: X2X2(X0X1)
3: X1X1X2
4: X3X3(X0X4)
5: X2X2X3
6: X0X0(X1X3)
7: X4X4X1
8: X1X1(X2X4)
9: X1X1X0
10: .Extend-Xor
11: X0X0X5
12: X1X1X6
13: X2X2X7
14: .Key
15: X3← ¬X3
16: X4← ¬X4
17: .S3: 3-bit Keccak S-box
18: t0X5
19: t1X6
20: t2X7
21: X5X5((¬t1)t2)
22: X6X6((¬t2)t0)
23: X7X7((¬t0)t1)
24: .Truncate-Xor
25: X5X5X0
26: X6X6X1
27: X7X7X2
28: .S5
29: X2X2(X0X1)
30: X1X1X2
31: X3X3(X0X4)
32: X2X2X3
33: X0X0(X1X3)
34: X4X4X1
35: X1X1(X2X4)
36: X1X1X0
3 Side-channel security
In this section, we discuss the concept of masking for
protecting implementations against side-channel attacks
and how to implement this countermeasure for differ-
ent types of operations recurrent in block ciphers. We
also discuss a cache-timing attack against LS-Designs
to motivate the countermeasures we later propose.
3.1 Masking scheme
Masking is one of the most investigated countermea-
sures against side-channel cryptanalysis, in particular
against different variants of power analysis. In the con-
text of block ciphers, masking aims to protect sensi-
tive data, such as plaintext during encryption or inter-
mediate values during decryption. Because information
computed in these processes will be later transformed
into the algorithm outputs, intermediary states must
be protected at all times. The masked state of mwith
d+ 1 shared secrets is given by m0m1. . . md=
mi=m, where each miis a shared secret and all
shared secrets form together a masked secret. From this
definition, we can collect some observations that allow
any cryptographic algorithm to be implemented in a
masked way.
1. Applying a linear operation over a masked secret m
is equivalent to applying the same operation over
shared secrets of m:
L(m)L(m0m1. . . md)
L(m0)L(m1). . . L(md)
2. A NOT operation over a masked secret mcan be
computed as:
¬m≡ ¬m0m1. . . md.
3. A XOR operation between masked secrets a=
and b=
bican be seen as:
4. An AND operation between two masked secrets a=
aiand b=
biis more complicated and can be
computed with the help of Algorithm 3.
4 Rafael J. Cruz et al.
Fig. 1: Linear layer of Fantomas, with gray cells representing 1and white cells 0values in the L-box matrix. The
L-box computation can be seen as a matrix multiplication over F2, where each row of the cipher state is multiplied
by a column of the matrix and the resulting bit is the parity of their bitwise product.
Algorithm 3 Nonlinear AND operation performed on
two masked secrets aand b[ISW03].
Require: Shares (ai) and (bi) satisfying d
i=0 ai=
aand d
i=0 bi=b.
Ensure: Shares (ci) satisfying d
i=0 ci=ab
1: for ifrom 0to ddo
2: ri,i 0;
3: for jfrom i+ 1 to ddo
4: ri,j random();
5: rj,i (ri,j (aibj)) (ajbi);
6: end for
7: end for
8: for ifrom 0to ddo
9: ciaibi;
10: for jfrom 0 to ddo
11: ciciri,j ;
12: end for
13: end for
Line 4 of Algorithm 3 presents an important chal-
lenge in terms of performance, since fresh random num-
bers must be generated for each distinct computation
to achieve provable security guarantees. By considering
that every share aiof arepresents a unity, every masked
AND computation requires (d+ 1)2d+1
2units of ran-
dom data and additional space of (d+1)2to store a ma-
trix containing all possible combinations of shares. In
practice, randomness recycling and other heuristics are
often used to reduce the performance penalty incurred
by masking strategies, with potential impact on secu-
rity [BGG+14]. A recent work reduced the time com-
plexity, allowing masking to scale efficiently to higher
orders [JS17].
3.2 Attacks and countermeasures on LS-Designs
The L-boxes in the LS-Design construction present an
obstacle for their secure implementation. Because ma-
trix multiplication in F2involves many individual bit
operations which turn out to be computationally expen-
sive in software, computation of the diffusion layer may
become critical to performance, and the L-box compu-
tation is thus commonly implemented through cheaper
table lookups. However, table lookups using secret in-
dexes are vulnerable to cache-timing attacks [Ber04,
BM06], more recently through Flush+Reload attacks
and variants [YF14].
Cache-timing attacks allow an attacker to recover
critical data in form of plaintext/key bits or portions of
the internal state. While the S-boxes in LS-Designs can
be secured against cache-timing attacks when imple-
mented with bitslicing, this attack methodology can be
easily extended to the L-Boxes implemented with table
lookups. In Line 7 of Algorithm 1, values of the inter-
nal state are used to index the L-box. If an adversary is
able to monitor the latency of each individual memory
access, for instance by means of a Flush+Reload attack,
the complete internal state Xof the cipher at a given
round can be recovered. A successful strategy consists
in recovering the internal state just before the last key
addition in the last round of encryption, or after the
first key addition in the first round of decryption. If
this happens for a ciphertext block Clater transmit-
ted and captured over the network, the key Kcan be
directly computed with K=xCConst(Nr1).
There are several possibilities for countermeasures
to protect the L-box computation: isochronicity, cache
protection heuristics and masking. Implementing the L-
box computation as an explicit matrix product by re-
moving secret indexes and precomputed tables results
in an isochronous implementation with uniform mem-
ory access pattern and response latency. However, the
performance impact is substantial, since all the bitwise
computations in the L-box must now be performed on-
Efficient and secure software implementations of Fantomas 5
line. Another possibility involves heuristics to guaran-
tee that the entire L-box table is always in cache, by
either employing smaller tables or visiting all cache lines
at every L-box access. Conventional masking schemes
can also be applied to introduce noise to make cache-
timing attacks more difficult, but if the L-box is com-
puted over all shares of xand key addition is only per-
formed in the first share, the attacker only needs to
recover the first share of the internal state to compute
the key K. Thus the key Kmust also be decomposed in
a set of additive shares as an improved countermeasure,
forcing the attacker to recover all shares of the internal
state to derive the key K. While this is not sufficient
to fully overcome the threat of cache-timing attacks, at
least the effort dedicated by the attacker is closer to
what is expected in the threat model for masking. Al-
gorithm 4 formalizes the masking scheme, by replacing
the S-boxes with a masked version and computing the
remaining steps over shares of the internal state and
key. The next section explores the implementation of
these countermeasures in detail.
Algorithm 4 LS-Design implemented with masking to
encrypt d+ 1 shares of plaintext P:={p0, . . . , pd}and
key K:={k0, . . . , kd}resulting in the ciphertext c.
1: for 0`ddo .Initialize the internal state
2: x`=p`k`
3: end for
4: for 0r < Nrdo
5: x=MaskedS(x).Masked S-box layer
6: for 0`ddo .L-box layer
7: for 0j < s do
8: x`[?, j] = L[x`[?, j ]]
9: end for
10: end for
11: for 0`ddo .Add masked key to shares
12: x`=x`k`
13: end for
14: x0=x0Const(r).Add round constants
15: end for
16: c
17: return c
4 Implementation
In this section, we present the multiple implementations
of Fantomas, aiming at performance and side-channel
security targets. We discuss portable implementations
for 32-bit and 64-bit processors implemented in the C
programming language, mostly targeting ARM plat-
forms, and additional code vectorized for SSE/NEON
instructions. Strategies for masked implementation are
discussed last, before experimental results are presented
in the next section.
4.1 The target platforms
The ARM Cortex-M is a set of 32-bit ARM proces-
sor cores intended for microcontroller use, composed of
the Cortex-M0, M0+, M1, M3, M4, and M7. These mi-
crocontrollers implement load-store architectures opti-
mized for embedded systems in low-power applications.
The Cortex-M processors implement slightly different
subsets of the more restricted Thumb and Thumb-2 in-
struction sets, tailored to small code size. With the ex-
ception of the M7, Cortex-M microcontrollers do not
have internal cache memory, but it is possible to in-
tegrate a system-level cache. The Cortex-A family of
processors is tailored for more time-consuming applica-
tions and provide sophisticated out-of-order execution
and NEON vector instruction sets. Cortex-A processors
typically have large amounts of cache memory, which
can be disabled only in privileged mode. The register
file has 16 general purpose registers (r0-r15), although
pointer arithmetic is restricted to the lower half. A dis-
tinctive feature of ARM processors is the possibility
to apply a bitwise operation to a second operand of
an arithmetic instruction by means of a built-in barrel
The Intel platform is well-known for its aggressive
out-of-order execution and rich vector instruction set.
The Streaming SIMD Extensions (SSE) support many
vector operations over integers or floating point val-
ues, many of them useful for fast cryptography, such
as the byte shuffle instruction PSHUFB, also available
in ARM NEON under the VTBL mnemonic. Byte shuf-
fling instructions take 128-bit registers filled with bytes
ra=a0, a1, . . . , a16 and rb=b0, b1, . . . , b15 and replace
rawith the permutation ab0, ab1, . . . , ab15 . A powerful
use of this instruction is to perform 16 simultaneous
lookups in a 16-byte lookup table, computing a map-
ping from 4-bit sets to 8-bit values. This can be easily
done by storing the lookup table in raand the lookup
indexes in rb.
4.2 Unprotected 32/64-bit implementations
The description starts from the unprotected 32-bit im-
plementation, realized exclusively in the C program-
ming language. Fantomas requires S/L-boxes which op-
erate over 16-bit chunks and other operations over 32-
bit data, such as key addition. Therefore, a portable
and efficient implementation must simultaneously sup-
port the two data types in one concise structure to rep-
6 Rafael J. Cruz et al.
resent the internal state. Following the C99 standard,
breaking strict aliasing pointer rules can be prevented
by representing the internal state as a union combin-
ing pointers to the data types, as in Listing 1. This
allows the compiler to have sufficient information to op-
timize arithmetic and memory accesses for both 16- and
32-bit chunks without introducing explicit type conver-
sions and the risk of interference with neighboring data
chunks. In 32-bit mode, the internal state is represented
as a vector of 4 objects of this type. The interface of
the encryption and decryption functions do not have to
be modified and still take word-aligned byte vectors as
input and conveniently convert them to 32-bit pointers
when needed.
Listing 1: Internal state is represented using a vector of
unions to respect strict aliasing, such that two different
pointers cannot reference the same memory address.
ty p e d e f un i o n {
ui n t 32 _ t u 3 2 ;
ui n t1 6_ t u1 6 [2 ];
} U 3 2_ t ;
The substitution layer is computed using the union
structure. Some operations over 16-bit chunks in the
bitsliced S-boxes could be combined in 32-bit opera-
tions to increase arithmetic density, but this was avoided
to prevent unaligned loads and stores which cause per-
formance degradation. For the linear diffusion layer,
the unprotected variable-time version employs two 256-
position half-word precomputed tables. A small code
portion illustrating the unprotected L-box can be found
in Listing 2, where st stores the 128-bit state, LBoxH
transforms the 8 most significant bits and LboxL trans-
forms the 8 less significant bits for all j∈ {0,1,2,3}.
Listing 2: Unprotected L-Box using the 16-bit values of
the internal state. LBoxH maps the higher 8 bits of the
linear transformation and LBoxL the lower 8 bits.
st [ j ] . u 16 [0 ] = LB o xH [ s t [ j ]. u1 6 [ 0] > >8 ] ^
LB o x L [ st [ j ] . u 16 [0 ] & 0 x f f ] ;
st [ j ] . u 16 [1 ] = LB o xH [ s t [ j ]. u1 6 [ 1] > >8 ] ^
LB o x L [ st [ j ] . u 16 [1 ] & 0 x f f ] ;
To improve performance slightly, the key addition
works by accumulating the key in the internal state
using 32-bit XOR operations, as in Listing 3.
Listing 3: Key addition of Fantomas using the 32-bit
state of the union.
fo r ( j =0 ; j < 4 ; j + +) {
st [ j ] . u 32 ^= ke y _3 2 [ j ] ;
The portable 64-bit implementations are a general-
ization of the 32-bit implementations and mostly follow
the same structure. The internal state is represented us-
ing a different union, to allow simultaneous operations
over 16-bit and 64-bit data chunks without violating
strict aliasing rules, as specified in Listing 4. The S-
boxes must again be implemented over the union with-
out breaking alignment and causing performance penal-
ties. The unprotected L-box follows the same structure
as the corresponding 32-bit implementation.
Listing 4: Union type representing part of the internal
state for 64-bit platforms.
ty p e d e f un i o n {
ui n t 64 _ t u 6 4 ;
ui n t1 6_ t u1 6 [4 ];
} U 6 4_ t ;
4.3 Cache protection heuristics
Two versions were implemented with mitigations against
cache-timing attacks: a compact one storing the entire
L-box in a single cache line and a cache-filling imple-
mentation that visits all L-box cache lines at every ac-
Compact implementation
In this version the S-boxes still need to follow the im-
plementation of the union without breaking alignment.
The L-box is represented in four small tables, so the
mapping of the L-box changes from 16 bits to 4 bits.
This way, each table will contain 32 bytes, fitting a sin-
gle cache line in modern processors (ARM and Intel),
given that the compiler is instructed to align the base
address properly. Because every table is used only once
in the L-box computation, memory access patterns do
not reveal critical information.
A piece of code illustrating the idea can be seen in
Listing 5, where state stores the 128-bit state, LBHH
maps the highest 4 bits of the linear transformation,
LBHL maps the 4 least significant bits of the most sig-
nificant byte, LBLH maps the 4 most significant bits of
the least significant byte and LBLL the remaining least
significant 4 bits, for all j∈ {0,2,4,6,8,10,12,14}.
Cache-filling implementation
This version is similar to the cache-protected versions,
but it always brings to cache memory the two entire
256-position half-word precomputed tables by visiting
multiple cache lines.
Efficient and secure software implementations of Fantomas 7
Listing 5: Compact implementation using four small tables to store the entire L-box in a single cache line.
uint8_t *b = (uint8_t *)st;
.. .
fo r ( j = 0 , k = 0 ; j < 4; j += 2 , k += 2 ) {
st [ j ]. u 1 6 [ 0] = L BL H [ b [ k ] > >4 ] ^ L B L L [ b[ k ] & 0 x f ] ^ LB HH [b [ k + 1] > >4 ] ^ LB H L [ b[ k + 1 ] & 0 x f ];
st [ j ]. u 1 6 [ 1] = L BL H [ b [ k ] > >4 ] ^ L B L L [ b[ k ] & 0 x f ] ^ LB HH [b [ k + 1] > >4 ] ^ LB H L [ b[ k + 1 ] & 0 x f ];
st [ j + 1] . u 16 [ 0] = L BL H [ b [k ] > > 4] ^ LB LL [ b [ k ] & 0 xf ] ^ LB HH [ b [ k +1 ] > >4 ] ^ LB HL [ b [ k +1 ] & 0 x f ];
st [ j + 1] . u 16 [ 1] = L BL H [ b [k ] > > 4] ^ LB LL [ b [ k ] & 0 xf ] ^ LB HH [ b [ k +1 ] > >4 ] ^ LB HL [ b [ k +1 ] & 0 x f ];
Listing 6: LBoxH contains the upper part of the linear transformation and LBoxL contains the lower part, CPU CACHELINE
contains the cache line size (default is 64). The loop traverses all the cache lines where the table will be stored.
ui n t1 6 _t LB o xH [ 2 56 ] _ _ at t ri b ut e __ (( a l ig ne d ( C P U_ C AC H EL I NE ) ) ) = { . .. };
ui n t1 6 _t LB o xL [ 2 56 ] _ _ at t ri b ut e __ (( a l ig ne d ( C P U_ C AC H EL I NE ) ) ) = { . .. };
ui n t 16 _ t t m p ;
.. .
fo r ( j =0 ; j < ( 2 56 / C P U _C A CH E L IN E ) ; j + + ) {
tm p ^ = L B ox L [ j* CP U _ CA C HE L IN E ] ;
tm p ^ = L B ox H [ j* CP U _ CA C HE L IN E ] ;
.. .
st [0 ] . u16 [ 0 ] ^= t m p ;
.. .
st [0 ] . u16 [ 0 ] ^= t m p ;
It is necessary to explicitly force that precomputed
tables start at a cache-line boundary, which may not
be compatible with compiler defaults. The code to ac-
cess the whole table can be seen in the Listing 6. In
the code, CPU CACHELINE contains the size of the cache
line and the tables are aligned according to the size of
the cache line, by default CPU CACHELINE = 64 as com-
monly found in ARM and Intel processors. The loop
scans the table by shifting in cache lines to bring the
entire table to cache memory before performing the lin-
ear transformation.
4.4 Isochronous 32/64-bit implementation
The isochronous implementations are a little more in-
volved. The S-box layer implemented through bitslicing
fortunately provides isochronicity already, so no addi-
tional countermeasures are needed. The diffusion layer
is performance-critical and presents more obstacles to
side-channel resistance, since it is usually implemented
through table lookups on the L-box. The protected ver-
sion implements the operation online by performing
vector-matrix binary multiplications, where two 16-bit
chunks are processed at the same time. The code por-
tion in Listing 7 illustrates part of it, where xstores the
32 bits to be transformed by the L-box in 16-bit pairs
and ycontains the l-th duplicate line of the binary ma-
trix representing the linear transformation. This func-
tion computes the dot product of the two 32-bit vectors
in F2, and calculates the parity of each 16-bit result,
processing two transformations at the same time. Func-
tion ProdLBox was transformed to operate over 64 bits
with simple modifications to the input and output types
and a repeated bit mask 0x0001000100010001 in the
last operation, allowing computation of 4 simultaneous
evaluations of the L-box (Listing 8).
8 Rafael J. Cruz et al.
Listing 7: Multiply the s-th row of the matrix Lcon-
taining the value y= (ys, ys) by the value x= (xa, xb)
where the result is the s-th value of (x·L)s= (xa·
ys, xb·ys).
static in l i ne u i nt 3 2 _t P ro d L Bo x ( ui n t 32 _ t x ,
ui n t 3 2 _ t y , u i n t 8 _ t s ) {
x &= y ;
x ^= x >> 8 ;
x ^= x >> 4 ;
x ^= x >> 2 ;
x ^= x >> 1 ;
return (x & 0 x 0 0 0 1 0 0 0 1 ) << s ;
Listing 8: Explicit L-box computation, adapted to 64-
bit platforms.
static in l i ne u i nt 6 4 _t P ro d L Bo x ( ui n t 64 _ t x ,
ui n t 6 4 _ t y , u i n t 8 _ t s ) {
x &= y ;
x ^= x >> 8 ;
x ^= x >> 4 ;
x ^= x >> 2 ;
x ^= x >> 1 ;
return ( x & 0 x 00 0 10 00 1 00 0 10 0 01 ) << s ;
Platform-specific 64-bit implementation
As a possibly faster alternate option for Intel platforms,
we also implemented a 64-bit version using the POPCNT
instruction for population counting. This instruction is
part of SSE4 extension and counts the number of 1bits
over a 64-bit register, storing the parity in the least
significant bit. Because the instruction takes 3 cycles
to complete [Fog16] and only a single bit of the result
is useful for our computation, this version performed
much less efficiently. For reference, it can be found in
Listing 9 below.
Listing 9: L-box evaluation with parity computation
computed through the least significant bit of the result
from Intel POPCNT instruction.
static in l i ne u i nt 6 4 _t P ro d L Bo x ( U6 4 _t x,
ui n t 6 4 _ t y , u i n t 8 _ t s ) {
x . u6 4 & = y ;
return (( _m m _ p op c n t _u 1 6 ( x . u 16 [0 ] ) & 0 x1 ) ^
(( _m m _ p op c n t _u 1 6 ( x . u 16 [1 ] ) & 0 x 1 ) < <1 6) ^
(( _m m _ p op c n t _u 1 6 ( x . u 16 [2 ] ) & 0 x 1 ) < <3 2) ^
(( _m m _ p op c n t _u 1 6 ( x . u 16 [3 ] ) & 0 x 1 ) < <4 8) )
<< s ;
4.5 Masked implementation
The masked implementation needs large modifications
in the S-boxes, because every operation computed in
Algorithm 2 must now be replaced by the operations
specified in Section 3.1. Others modifications in the al-
gorithm are described in Algorithm 4.
Countermeasures in the linear layer are still needed
during encryption and decryption to protect against the
cache-timing attack discussed in Section 3.2, because
the linear layer comes immediately before the last key
addition in the encryption and immediately after the
first key addition in the decryption. If successful, the
cache-timing attack would disclose the internal state in
these positions and, with knowledge of the ciphertext,
an attacker could mount a critical key recovery attack.
Two functions are essential for preprocessing the
blocks before masked encryption and decryption can be
performed. These functions convert a plaintext block to
a masked block and the converse, respectively. The first
function must generate drandomized blocks and com-
bine these blocks with the original by means of XOR op-
erations to generate the last block. The second function
must combine all masked blocks with XOR operations af-
ter encryption and decryption are processed.
A substantial amount of random bits is required to
generate the masked blocks and to compute the masked
AND described in Algorithm 3. Random number genera-
tion was implemented through the standardized
Hash DRBG [BK12] algorithm instantiated with the
SHA-256 hash function. This choice proved to be faster
than reading bytes from /dev/urandom by a 10-factor
in the Linux-enabled platforms. Even with this faster
option, generating random bits still imposes a massive
performance penalty and dominates the execution time
in our masked implementation. Because this is highly
platform-specific, we take two approaches: follow re-
lated work and exclude the random generation time
from the experimental results, and also measure the
time for random number generation for comparison.
4.6 Further exploiting parallelism
Although the S-box implementation already extracts
some internal parallelism inside Fantomas, we further
note that there is much more room for exploiting par-
allelism. Under a mode of operation amenable to par-
allelization of both encryption and decryption such as
CTR, Fantomas can be implemented quite efficiently at
the cost of flexibility and code size by processing multi-
ple blocks simultaneously. In particular, the 32-bit im-
plementation can be adapted to process 2 blocks at the
Efficient and secure software implementations of Fantomas 9
same time, by sharing each 32-bit word in the inter-
nal state among two 16-bit chunks of two consecutive
blocks; and the 64-bit implementation can be adapted
in a similar way to process 4 consecutive blocks. This
optimization also has the effect of accelerating the S-
box computation, because internal horizontal depen-
dencies between 16-bit chunks from the same block are
now eliminated and the S-boxes can be computed with
32-bit operations alone, without introducing unaligned
memory accesses or other performance penalties. Be-
cause the CBC mode imposes a serialization of encryp-
tion, vectorized implementations should stick to the
CTR mode processing multiple blocks simultaneously,
where only the counters are encrypted/decrypted and
later added to the plaintext/ciphertext, respectively.
For comparison, we also implemented a single-block
vectorized Fantomas in CBC mode.
Vectorized implementation
Our implementations target the ARM and Intel plat-
forms equipped with modern vector instruction sets ca-
pable of computing high-throughput bitwise operations
over full vector registers, and performing fast lookups
over small precomputed tables. LS-Designs become very
friendly to vectorization under these conditions.
Recall that the internal state of Fantomas is repre-
sented as 8 pieces of 16 bits each, hereby called lanes to
follow vectorization terminology. The S-boxes are com-
puted in a bitsliced way, facilitating vectorization as
long as the S-layer can compute over at least 8 blocks
simultaneously, applying the same operation over each
16-bit lane from the same block. The L-box presents
a higher obstacle, because memory accesses should be
reduced to increase arithmetic density. There are two
clear ways of implementing the L-box with higher arith-
metic density: the first one is to perform an explicit
vector-matrix multiplication over F2as in the constant-
time 32/64-bit implementation; or employing byte shuf-
fling instructions for table lookups inside vector reg-
isters. In the latter, registers are sliced in byte-sized
chunks, processing 16 blocks simultaneously, where the
individual bytes can be stored and transposed in a ma-
trix to guarantee that every vector register has the same
i-th byte of each block. These two approaches were
implemented and the latter was clearly faster due to
higher occupancy of the vector registers. For portabil-
ity over Intel and ARM, the table lookups were imple-
mented using the GCC intrinsic builtin shuffle()
for byte shuffling, which translates to the instructions
PSHUFB and VTBL discussed on Section 4.1.
Since the L-box is a linear transformation, the 16
bits can be broken in smaller pieces. The L-box in Fig-
ure 2 can be split in 4-bit lanes and the table reduced to
4 tables of 16 positions storing 16-bit values. To make
use of the table lookup instructions mapping 4-bit val-
ues to 8 bits, the splitting must divide the most sig-
nificant bytes from the least significant bytes and the
entire table is stored in 8 vector registers of 128 bits.
Listing 10 presents part of the vectorized linear layer.
The single-block CBC version operates separately
in the most significant bytes and least significant bytes,
and combines them together at the end. The 16-block
CTR version is a little more complex and follows the
organization adopted by the vectorized implementation
of SCREAMv3 [GLS+15b]. First, it is necessary to ex-
pand the CTR counter for the 16 simultaneous blocks.
After expansion, the counters must be transposed and
stored in a different order. Counter updates can be done
by propagating carries using vector comparisons. The
expanded counter is computed from the original counter
as in Figure 2a, and the state must be partially trans-
posed and stored as in Figure 2b. Partial transposition
is not too computationally expensive, because the orga-
nization required is an intermediate step of the trans-
position algorithm.
The organization in Figure 2 must be kept through
the whole process, because then the substitution layer
can be performed in the first 8 blocks and then on the
final 8 blocks. The linear layer is similar to the single-
block version and the splitting is not required, since the
least significant bytes are stored in the first 8 blocks in
Figure 2b and the most significant bytes in the remain-
ing 8 blocks in the same Figure. The key must also be
transformed in a similar way as in Figure 2c to facilitate
the key addition step. A total of 16 copies of the key
are stored in a set of registers and partially transposed
to match the organization used for the counter, such
that the correct operands are used for all the additions.
Our SSE-vectorized implementations of Fantomas are
publicly available for independent benchmarking and
5 Experimental results
Our implementations were benchmarked in seven dif-
ferent ARM and Intel platforms. The compiler used for
the Cortex-M platforms was GCC 4.8.4 provided by the
Arduino Development Kit with flags -O3 -nostdlib
-fno-schedule-insns -mcpu=cpu -mthumb, for values
of cpu matching the processor (cortex-m0plus/m3/m4).
For the higher-end platforms, GCC 6.3.1 provided by
the operating system was used instead with common
flags -O3 -fno-schedule-insns -march=native.
10 Rafael J. Cruz et al.
Listing 10: Linear layer vectorized to simultaneously process 16 blocks.
static inline vo i d L La y er ( v 1 6q u * X ) {
st a t i c c o n st v 16 qu ta bl es [ 8] = // St o r e L - b ox in r eg i s te r t ab l e s
{{ 0 x00 , 0x FF ,0 x9 0 ,0 x6F , 0x 37 ,0 xC 8 ,0 xA7 , 0x 58 ,0 x4 8 ,0 xB7 , 0x D8 ,0 x2 7 ,0 x7F , 0x 80 ,0 xE F ,0 x1 0} ,
{0 x0 0 ,0 xBF , 0x 6E ,0 xD 1 ,0 x41 , 0x FE ,0 x2 F ,0 x90 , 0xD 5 ,0 x6A , 0x BB ,0 x0 4 ,0 x94 , 0x 2B ,0 xF A ,0 x4 5} ,
.. .
fo r (int i = 0 ; i < 8; i+ = 2) { / / Pr oc es s 2 bl o ck s i n p ar al l el
// X [ i ] co n ta i n s t h e l e ss s i gn i f i ca n t by t e o f th e i - th b lo c k
// X [ i + 8] c o nt a i ns t he mo s t si g n if i c a nt b yt e of th e i - t h b l oc k
v1 6 q u t [ 4] = { X [i ] , X [ i + 8] , X[ i + 1] , X[ i + 9 ] };
// Repl a c e t h e 4 le s s s i g n i f i c a n t b i ts of t [ 0 ]/ t [2 ]
X [i ] = _ _ bu i l ti n _s h u ff l e ( t ab l es [ 0] , t [0 ] ) ;
X [i + 1] = _ _ b ui l ti n _ sh u f fl e ( t ab l es [0 ] , t [ 2] ) ;
X [i + 8] = _ _ b ui l ti n _ sh u f fl e ( t ab l es [1 ] , t [ 0] ) ;
X [i + 9] = _ _ b ui l ti n _ sh u f fl e ( t ab l es [1 ] , t [ 2] ) ;
// Repl a c e t h e 4 mo s t s i g n i f i c a n t b i ts of t [ 0 ]/ t [2 ]
t [0 ] > >= 4 ; t [ 2] > >= 4 ;
X [i ] ^= _ _b u il t i n_ s h uf f le ( ta b le s [ 2] , t [0 ]) ;
X [i + 1] ^= _ _b u il t i n_ s h uf f le ( ta b le s [ 2] , t [2 ]) ;
X [i + 8] ^= _ _b u il t i n_ s h uf f le ( ta b le s [ 3] , t [0 ]) ;
X [i + 9] ^= _ _b u il t i n_ s h uf f le ( ta b le s [ 3] , t [2 ]) ;
.. .
Fig. 2: Counter and key transformation for the vectorized CTR implementation. In the initial state in (a), each row
represents a different counter (block bi) and each column represents bytes in the same position in all counters. The
final state is computed through a partial transposition of the initial state, reaching the intermediate state in (b). The
key is transformed analogously in (c). An alternative figure depicting the state organization can be found in [CA16].
(a) Initial state of the counter.
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
(b) Final state of the counter.
(c) Key transformation.
Efficient and secure software implementations of Fantomas 11
More details about the platforms can be found below:
Cortex-M0+: Arduino Zero powered by an Atmel
SAMD21 ARM Cortex-M0+ 48MHz CPU. Execu-
tion time was measured through the native SysTick
cycle counter.
– Cortex-M3: Arduino Due powered by an Atmel
SAM3X8E ARM Cortex-M3 84MHz CPU. Execu-
tion time was measured by converting the output of
the micros() function in Arduino for measuring mi-
croseconds to cycles through simple multiplication
by the nominal frequency.
Cortex-M4: Teensy 3.2 board containing a Cortex-
M4 MK20DX256VLH7 72MHz processor. Execution
time was measured through the native cycle count-
ing register and some Assembly code.
– Cortex-A15: ODROID-XU4 board containing a
Samsung Exynos5422 Cortex-A15 2GHz and Cortex-
A7 8-core CPUs. We installed the official distri-
bution of Arch Linux for the board, which comes
equipped with GCC 6.3.1 for ARM, using the ad-
ditional flags -mfpu=neon -mcpu=cortex-a15. Ex-
ecution time was measured by enabling reading from
the Cycle CouNT register (CCNT) from the Perfor-
mance Monitor Unit (PMU) in user level.
Cortex-A53: ODROID-C2 board containing an Am-
logic ARM Cortex-A53(ARMv8) 2GHz 4-core CPUs.
We installed Arch Linux with GCC 6.3.1 for ARM
using flags -mfpu=neon -mcpu=cortex-a53.
Execution time was also measured through the PMU
enabled by loading a special kernel module.
Core i7 Ivy Bridge: Intel Core i7-3632Q 2.20GHz
CPU. GCC 6.3.1 was again used with flags -mssse3
-msse. The RDTSC register was used for cycle count-
ing and Turbo Boost was disabled.
Core i7 Haswell: Intel Core i7-4770 3.40GHz CPU.
GCC 6.3.1 was again used with flags -mssse3 -msse.
The RDTSC register was used for cycle counting
and Turbo Boost was disabled.
Tables 1 and 2 in the next pages present results
for the 32- and 64-bit portable implementations; and
the NEON and SSE vectorized implementations of Fan-
tomas. All measurements take into account the time to
encrypt and decrypt using the operating modes CBC
and CTR. The isochronous/constant-time implemen-
tations receive the CT abbreviation suffix. The vector
implementations are intrinsically isochronous by oper-
ating over registers only. Cycle counts were computed
by encrypting or decrypting the same message of length
1024 bytes a 100 times. The final result represents the
average time to encrypt or decrypt a single byte using
a specific implementation. Because Fantomas does not
have a key schedule and encryption/decryption algo-
rithms are very similar performance-wise, results can
be easily converted to a cycles per byte (CPB) metric.
The isochronicity property of the constant time im-
plementations was validated using the FlowTracker tool
for static analysis [RPA16] and dudect for dynamic
analysis [RBV17] in the Intel platforms. FlowTracker
performs information flow analysis from function inputs
marked as secret to branch instructions and memory
addresses at the LLVM IR level, effectively detecting
and thwarting timing attacks in compiled code. The
tool dudect performs statistical testing of execution
times. All timings for Cortex-M processors were repro-
duced to a reasonable degree in the ARM Cortex-M
Prototyping System (MPS2), an FPGA-based board
with support to microcontrollers ranging from Cortex-
M0 to M7. However, we only report timings collected in
the widely available platforms to simplify comparisons
with future competing implementation efforts. Our im-
plementations were tailored for ARM processors and
enjoy the benefits of the second-operand barrel shifter.
5.1 Discussion
The tables contain several interesting results to be dis-
cussed. We omit cases when there is a mismatch be-
tween the word size in the implementation and the pro-
cessor word, because the compiler may degrade perfor-
mance substantially. This effect was very clear when
compiling the 64-bit constant-time implementations on
the 32-bit processors, for example.
Cache protection heuristics have a low impact of
performance, with the cache-filling implementation be-
ing faster than the compact small-table one. These im-
plementations do not have formal side-channel resis-
tance guarantees and rely on compiler alignment and
other runtime characteristics, hence they should be used
only when some side-channel resistance is desired and
the performance overhead of isochronicity is prohibitive.
Constant-time implementations with uniform access
to memory receive a massive performance penalty. In
the Cortex-M, the 32-bit CBC/CTR constant time im-
plementation of Fantomas proved to be almost twice
as compact due to the lack of precomputed tables, al-
though more than 3 times slower than the unprotected
version. Similar ratios can be found in the 32-bit Cortex-
A15, but performance degradation is lower in the 64-bit
Cortex-A53. If the main objective is to obtain a smaller
code fingerprint and/or resistance against timing-based
side-channel attacks, this implementation can however
still be a good choice. Observe that Cortex-A processors
and even some Cortex-M microcontrollers may have
cache memory, so it is important to measure the perfor-
mance impact of protecting the implementations against
cache-timing leakage.
12 Rafael J. Cruz et al.
Table 1: Execution time and code size (ROM bytes) for encryption using Fantomas benchmarked in various Intel and ARM platforms. Figures present
average cycles for encrypting a single byte (CPB) in CBC/CTR mode. There are multiple types of implementations: constant-time (CT), cache protection
heuristics (small table and cache-filling) and vectorized using NEON/SSE instruction sets. For related works, we list the most efficient and compact
implementations, and also implementations with a good trade-off between speed and size, as represented by the Figure of Metric (FOM) defined by
Cortex-M0+ Cortex-M3 Cortex-M4 Cortex-A15 Cortex-A53 i7 Ivy Bridge i7 Haswell
per byte
per byte
per byte
per byte
per byte
per byte
per byte
32/64 (CBC) 265.20 1658 170.46 1618 129.26 1618 49.91 1944 75.91 1720 50.96 1680 42.68 1680
32/64 (CTR) 265.67 1808 171.31 1834 129.78 1822 51.22 2256 76.07 2008 50.63 1776 42.04 1776
32/64 (2/4-block CTR) 205.48 3208 143.22 3364 113.50 3238 40.61 4108 73.65 5284 23.72 4571 20.42 4571
32/64 Compact (CBC) 401.94 982 288.28 954 211.02 954 78.31 1328 138.83 1076 69.83 1130 61.53 1130
32/64 Compact (CTR) 402.41 1132 289.07 1170 211.53 1158 78.92 1640 139.38 1364 67.06 1226 61.10 1226
32/64 Cache-filling (CBC) 354.36 1830 244.11 1798 182.21 1798 59.27 2080 87.58 1812 53.01 1764 45.51 1764
32/64 Cache-filling (CTR) 354.82 1980 244.95 2014 182.73 2002 59.79 2392 93.36 1364 53.19 1226 45.35 1226
32/64 CT (CBC) 1132.72 1018 615.83 974 491.30 970 435.54 1544 221.85 1504 119.23 1853 103.39 1853
32/64 CT (CTR) 1133.19 1168 616.68 1190 491.82 1174 441.35 1856 222.51 1792 112.82 1949 102.71 1949
32/64 CT (2/4-block CTR) 1074.45 2348 566.46 2524 455.58 2398 418.92 3408 223.85 6012 79.60 6821 70.67 6821
POPCNT (CTR) – – – – – – – – – – 224.61 2231 193.10 2231
NEON/SSE (CBC) – – – – – – 63.77 924 59.98 680 55.80 669 47.97 669
NEON/SSE (CTR) – – – – – – 63.72 1236 59.68 968 49.31 765 48.14 765
NEON/SSE (16-block CTR) – – – – – – 16.05 6822 17.93 3868 5.87 6099 5.62 5940
Related work
32 (CBC) Fast1– – 228.96 3124 – – – – – – – – –
32 (CTR) Fast1– – 237.31 3092 – – – – – – – – –
32 (CBC) FOM1– – 344.12 1484 – – – – – – – – –
32 (CTR) FOM1– – 344.06 1524 – – – – – – – – –
32 (CBC) Compact1– – 432.06 1564 – – – – – – – – –
32 (CTR) Compact1– – 437.31 1428 – – – – – – – – –
NEON/SSE (16-block CTR)2– – – – – – 14.2 – 4.2 – –
1[DCK+15] 2[GLS+15a] Timings provided only for reference, due to incompatible metrics or benchmarking strategies.
Efficient and secure software implementations of Fantomas 13
Table 2: Execution time and code size (ROM bytes) for decryption using Fantomas benchmarked in various Intel and ARM platforms. Figures present
average cycles for encrypting a single byte (CPB) in CBC/CTR mode. There are multiple types of implementations: constant-time (CT), cache protection
heuristics (small table and cache-filling) and vectorized using NEON/SSE instruction sets. For related works, we list the most efficient and compact
implementations, and also implementations with a good trade-off between speed and size, as represented by the Figure of Metric (FOM) defined by
Cortex-M0+ Cortex-M3 Cortex-M4 Cortex-A15 Cortex-A53 i7 Ivy Bridge i7 Haswell
per byte
per byte
per byte
per byte
per byte
per byte
per byte
32/64 (CBC) 255.64 1634 182.18 1598 144.08 1598 51.73 1944 85.96 1752 51.02 1662 46.51 1662
32/64 (CTR) 265.67 1524 171.31 1492 129.78 1492 51.22 1724 76.07 2008 50.63 1776 42.04 1776
32/64 (2/4-block CTR) 205.48 3208 143.22 3364 113.50 3238 40.61 4108 73.65 5284 23.72 4571 20.42 4571
32/64 Compact (CBC) 373.41 970 286.24 938 216.69 938 83.55 1296 122.49 1092 68.23 1110 64.42 1110
32/64 Compact (CTR) 402.41 1132 289.07 1170 211.53 1158 78.92 1640 139.38 1364 67.06 1226 61.10 1226
32/64 Cache-filling (CBC) 303.41 1714 247.25 1726 181.99 1726 61.43 2076 95.50 1832 49.07 1754 45.68 1754
32/64 Cache-filling (CTR) 354.82 1980 244.95 2014 182.73 2002 59.79 2392 93.36 2100 53.19 1860 45.35 1860
32/64 CT (CBC) 1127.86 942 610.59 898 486.39 898 442.38 1528 230.92 1500 136.66 1849 121.26 1849
32/64 CT (CTR) 1133.19 1168 616.68 1190 491.82 1174 441.35 1856 222.51 1792 112.82 1949 102.71 1949
32/64 CT (2/4-block CTR) 1074.45 2348 566.46 2524 455.58 2398 418.92 3408 223.85 6012 79.60 6821 70.67 6821
POPCNT (CTR) – – – – – – – – – – 224.61 2231 193.10 2231
NEON/SSE (CBC) – – – – – – 73.19 908 63.20 684 53.34 663 53.32 663
NEON/SSE (CTR) – – – – – – 63.72 1236 59.68 968 49.31 765 48.14 765
NEON/SSE (16-block CTR) – – – – – – 16.05 6822 17.93 3868 5.87 6099 5.62 5940
Related work
32 (CBC) Fast1– – 189.63 4316 – – – – – – – – –
32 (CTR) Fast1– – 237.31 3092 – – – – – – – – –
32 (CBC) FOM1– – 346.24 2148 – – – – – – – – –
32 (CTR) FOM1– – 344.06 1524 – – – – – – – – –
32 (CBC) Compact1– – 487.02 2138 – – – – – – – – –
32 (CTR) Compact1– – 437.31 1428 – – – – – – – – –
NEON/SSE (16-block CTR)2– – – – – – 14.2 – 4.2 – –
1[DCK+15] 2[GLS+15a] Timings provided only for reference, due to incompatible metrics or benchmarking strategies.
14 Rafael J. Cruz et al.
It was surprising that the 64-bit implementation in
the Cortex-A53 was slower than the 32-bit implemen-
tation in the Cortex-A15, but the Cortex-A53 is at the
low-end of the 64-bit ARM processors, so better perfor-
mance for the 64-bit implementation might be expected
from higher-end processors.
Code sizes generally grow from the Cortex-M0+ to
the Ivy Bridge. The code size for the 16-block NEON
(CTR) implementation in the Cortex-A53 was also sur-
prising, producing almost twice more compact binaries
than the same NEON version in the Cortex-A15 and
SSE version in the Core i7. There is a clear space-
time trade-off in the multi-block CTR implementations.
They are the largest implementations in terms of code
size, but also almost always the fastest in a given plat-
form, specially in those supporting vector instructions.
For the masked implementations of Fantomas, cy-
cle counts for encrypting one block with different num-
bers of shares are presented in Figure 3, where a clear
quadratic trend for the performance degradation can
be observed, as expected [GLSV14]. The figure also il-
lustrates the impact of random number generation, as
generating random bytes for the masked AND operations
in the S-boxes can consume 97% of the execution time.
Three versions were actually implemented: table-based
L-box, isochronous and NEON. All versions can be seen
in Figure 4 and Table 3. Table 3 presents the timings
required to encrypt a single block with dshares using
all three versions without random number generation.
Values for d= 1 represent the time to encrypt a single
block in Fantomas without masking.
Fig. 3: Cycle counts for encrypting one 128-bit block
with the masked implementation of Fantomas in the
Cortex-A15 platform as a function of the number of
shares. The black dots take into account the time to
generate random numbers with the Hash DRBG using
SHA256, while the red dots disregard random number
generation. Both red and black points use the table-
based L-Box.
2 4 6 8 10 12 14 16 18 20
Cycles (c)
t without random bit generator
Data points without random bit generator
Shares (d)
Best curve t with Hash_DRBG using SHA256
Data points with Hash_DRBG using SHA256
Best curve
Fig. 4: Cycle counts for encrypting one 128-bit block
with the masked implementation of Fantomas in the
Cortex-A15 platform as a function of the number of
shares. The black dots refer to the table-based L-box,
red dots to the isochronous and the blue dots to the
NEON version.
2 4 6 8 10 12 14 16 18 20
Cycles (c)
Best curve t with table-based L-box
Data points with table-based L-box
Best curve t with isochronous L-box
Data points with isochronous L-box
Best curve t with NEON implementation
Data points with NEON implementation
Shares (d)
Table 3: Cycle counts for encrypting one 128-bit block
with the masked implementation of Fantomas using
masked 32 bits, 32-bit isochronous (CT) and NEON
in the Cortex-A15 platform.
Shares (d)32 bits
32 bits CT
1789 6854 990
217285 31382 11411
319534 36743 18727
429543 53191 31844
538074 68424 41754
646664 82970 53925
759015 101782 66189
873739 125908 81523
991238 148323 97799
10 108276 169223 126548
5.2 Comparison with related work
There are two main related works that established the
previous state of the art in the context of this paper.
The most recent is the massive implementation effort
from the FELICS framework [DCK+15] to compare
lightweight block ciphers performance-wise in represen-
tative 8/16/32-bit platforms. The project website2also
contains results for some stream ciphers and block ci-
phers underlying MAC constructions. The target 32-bit
Efficient and secure software implementations of Fantomas 15
platform considered in their work is the same Cortex-
M3 present in the Arduino Due and two scenarios are
taken into consideration. Scenario 1 considers consec-
utive encryption and decryption of 128 bytes in CBC
mode, simulating a communication protocol. In the web-
site, the best implementation according to their Figure
of Merit (FOM) takes for encryption 44,047 cycles us-
ing 1484 bytes of ROM (or 344.12 CPB in Table 1)
and for decryption 44,319 cycles using 2148 bytes of
ROM (or 346.24 CPB in Table 2). Even in compari-
son with the fastest and most compact implementation
from the website our implementation is still compet-
itive, being 25.6% to 60.5% more efficient in encryp-
tion and 3.9% to 62.6% in decryption than their imple-
mentations, and competitive in terms of code size with
their more compact implementation. These speedups
decreased in comparison to our previous work [CA16]
due to performance improvements from FELICS. In
Scenario 2, FELICS reports a range of figures for un-
protected Fantomas when encrypting 128 bits in CTR
mode, ranging from most compact implementation to
best execution time. The most compact takes 6,997 cy-
cles and 1428 bytes of ROM (437.31 CPB), the most
efficient takes 3,797 cycles and 3,092 bytes of ROM
(237.31 CPB) and a good trade-off is found at 5,505
cycles and 1524 bytes of code size (344.06 CPB). After
the proper conversions, our implementation improves
these figures by 60.8%, 27.8% and 50.2%, respectively,
by spending only 1834 bytes of ROM.
Our compact small-table implementation takes 288.3
CPB for encryption with only 954 bytes of ROM and
286.2 CPB for decryption with only 938 bytes of ROM,
both in Scenario 1; and 289.07 CPB with 1170 bytes
of ROM in Scenario 2. The compact version is 33.3%
and 41.2% faster in Scenario 1 and 33.9% in Scenario 2
than the compact version of FELICS, respectively, us-
ing only 1170 bytes in CTR mode. An even more com-
pact version with only 870 bytes of ROM in CTR mode
was also implemented but the speed degrades to 426.46
CPB, although still faster than FELICS. The same ver-
sion is competitive in CBC mode with only 1262 bytes
of ROM adding encryption and decryption. As a refer-
ence point, FELICS reports much higher latencies for
standardized block ciphers such as AES under different
operating modes (30,613 cycles or 239.16 CPB for en-
crypting and 42,114 cyles or 329.017 CPB for decrypt-
ing in CBC mode in the Cortex-M3, for example).
The second related work is the presentation for the
SCREAMv3 candidate in the CAESAR competition
[GLS+15a]. In the slides, numbers for a 16-block vec-
tor implementation of Fantomas are also reported. We
could not reproduce the numbers presented in the ta-
ble due to unavailability of the Fantomas code, and
benchmarking the publicly-available SCREAMv3 code
gave rather different results. In private contact with the
authors, we discovered that their benchmarking code
takes the outputs of the gettimeofday() function for
time measurement, a less precise approach than using
cycle counts measured directly. Additionally, it is not
clear if their numbers were taken in a machine with
Turbo Boost enabled, as it is well known to distort
benchmarking data [BL16]. Due to the different bench-
marking strategies, we only report the numbers for ref-
erence, without attempting an explicit comparison.
5.3 Comparison with Fantomas*
After the LS-Design block cipher Robin was successfully
attacked through invariant-subspace attacks [LMR15],
the algorithm was tweaked to include larger tables of
round constants spanning the entire internal state. This
modification was called Robin*. Although Fantomas
was not affected by the same attack, the same coun-
termeasure was applied to the algorithm in a follow-up
work [JSV17], leading to the version called Fantomas*.
We have implemented and benchmarked Fantomas*
in some representatives target architectures, more specif-
ically the Cortex-M4 and Cortex-A15, and observed
small performance penalties due to the extra additions
of round constants. The largest performance difference
was 10.3% in the execution of insecure table-based im-
plementations, since addition of round constants re-
sponds to a higher part of overall performance. Constant-
time versions suffered penalties of at most 2%. The least
affected versions were the vectorized NEON implemen-
tations, in which performance became only 0.1% worse.
We could not compare our masked implementation
with timings from [JS17] in the Cortex-M4, because
their implementation employs very different techniques:
the masking technique is novel and more efficient, and
implemented in Assembly at a much higher order. There
are also differences in the target platform configuration
and how random number generation is implemented.
From the point of view of security, it is worth not-
ing that our portable code written in C may suffer
from further order reduction exacerbated by the com-
piler [BGG+14].
6 Instruction Set Extension
The previous sections studied software-based counter-
measures for side-channel resistance. This section adopts
a different approach and presents a hardware exten-
sion built to enable more secure and efficient imple-
mentations of Fantomas. Although our software solu-
16 Rafael J. Cruz et al.
tions could already be sufficiently effective on mitigat-
ing the side-channel vulnerabilities, an extended hard-
ware support allows the implementations to achieve the
same security property while reducing the performance
For this experiment, we built a system using Intel
Platform Designer [Int17] (formerly Altera Qsys) and
instantiated an Altera NIOS II processor [Alt16], along
with other Altera IP components and our hardware ex-
tension. Table 4 lists all the instantiated components
and their configurations.
Our hardware modifications were guided by two goals:
to improve the performance of the online L-box evalu-
ation and to mitigate the side-channel vulnerability in
the L-box table lookups. This way, two different secure
versions were produced to explore hardware features
and trade-offs. We achieve the first goal by introducing
a new specialized instruction for parity check, which
enabled a great performance improvement in the on-
line parity calculation. The second goal, in turn, was
achieved by simply activating NIOS’ cache bypass mech-
anism, which allowed the L-box table accesses to be
performed in constant time.
Table 4: System components and configurations.
Component Configuration
Clock Source Known frequency - 50 MHz
NIOS II Processor - Gen 2
Cache configured according to the experiment.
All other configurations to default.
On-Chip Memory RAM - Size: 159.744 bytes
Performance Counter 4 simultaneously-measured sections
System ID Default
Interval Timer Default
Custom Instruction Type Extended (Combinational)
6.1 Cache bypass mechanism
The NIOS processor has a built-in mechanism for cache
bypass that can be activated through its configuration
interface. The mechanism works by checking the value
of the most significant bit of a memory address before
looking up to it in the cache: If the bit is set, the cache
is ignored and a direct access is performed to the main
memory. Otherwise, the ordinary cache look-up process
It should be noted that, despite being a 32-bit pro-
cessor, NIOS is capable of addressing only 31-bit ad-
dresses, once the most-significant bit is reserved for the
cache bypass feature. Listing 11 shows the macro used
for setting the most significant bit.
6.2 Parity check instruction
NIOS also has an interface for extending its instruction
set. It can communicate with the processor using up to
15 signals defined by standardized templates. For the
parity instruction, we used the simplest one, designed
for combinational instructions. It offers two 32-bit in-
puts as instruction operands and one 32-bit output as
instruction result. We also added the ninput, which re-
ceives a 3-bit value representing an opcode offset. The
implementation logic is similar to the one presented in
Listing 7. We synthesized our system to a Cyclone V
SoC 5CSEMA4U23C6N FPGA device, where the in-
struction took 25 Adaptive Look-Up Tables (ALUTs).
For comparison, the entire system we built took 2734
The instruction behaves as follows:
1. It receives two operands, dataa and datab and cal-
culates a logical AND between them.
2. Depending on the value of the opcode offset n
0,1,2,3, the instructions performs 24nparity check
computations on 2n+1 chunks; otherwise, the in-
struction performs one parity check calculation on
the entire register.
3. The result is presented in the least-significant bit of
each chunk.
The first step was conveniently chosen to help the
implementation of Fantomas, but it does not affect the
instruction generality. The parity chunk size, imple-
mented using the opcode offset, is also not necessary
for Fantomas, since we only use n= 3. Nonetheless, we
have decided to implement this feature once it brings
more flexibility to the instruction at almost zero hard-
ware cost.
Listing 12 defines the macro ALT CI PARITY 0 used
in the C code for inserting the parity check custom in-
struction. The macro receives the operands Aand B, the
opcode offset n, and returns the result. A built-in com-
piler function is used to perform the insertion of the
6.3 Experiments using the NIOS II processor
Using the system described in Table 4, we executed the
following experiments:
– Baseline: Non-modified versions for performance
Efficient and secure software implementations of Fantomas 17
Parity Instruction: Implementations using our spe-
cialized custom instruction to perform the online
parity check calculation.
– Cache Bypass: Implementations using the cache
bypass mechanism for the L-box access.
No-Cache: Implementations executed with the pro-
cessor’s cache disabled.
Table 5 shows the results of each experiment. The
first column indicates the versions in which the L-box
is either evaluated Online or when it is queried from
aTable. For the Baseline experiment, Table versions
are non-constant time. Analyzing the results, it can be
noted that the Online versions had a 62% gain on aver-
age when using the specialized instruction rather than
the C implementation for the parity calculation. Con-
sidering the Table versions, we had a 25% slowdown
when using the cache bypass mechanism to mitigate the
side-channel vulnerability. From these results, the best
approach to take still seems to be using precomputed
L-boxes. The Table versions are 47% faster than the
Online ones when using the cache bypass mechanism
and 18% faster with the entire cache disabled.
It should be mentioned, though, that NIOS has a
relatively small cost for memory accesses due to its low
frequency. Thus, the conclusions obtained in this sec-
tion could be different for other architectures.
Listing 11: Cache bypass macro.
#define D O NT _C A CH E ( A) \
ui n t 1 6_ t *A # # _t = A ; \
ui n t 16 _ t * A = ( u in t 16 _ t * ) (( ui n t3 2 _t ) A# # _t \
| ( u in t 32 _ t ) A LT _ C PU _ DC A CH E _B Y P AS S _M A SK ) ;
Listing 12: Custom instruction macro.
#define ALT_CI_PARITY_0_N 0x0
#define A L T_ C I_ P AR I TY _ 0_ N _M A S K ( (1 < < 3) - 1)
#define A L T _C I _ P AR I T Y _0 (n , A , B ) \
_ _b u il ti n _c u st o m_ i ni i ( A LT _ CI _ PA R IT Y _0 _N + \
( n& A LT _ CI _ PA R IT Y_ 0 _N _ MA S K ) ,(A ) ,( B ))
7 Conclusion
We presented several serial and vectorized software im-
plementations of the Fantomas block cipher, produc-
ing more efficient and compact implementations in the
ARM and Intel target platforms. Four approaches for
side-channel resistance were implemented: constant time,
cache protection heuristics, masking and hardware ex-
tensions. The constant time approach for implement-
ing the L-box is of independent interest, as it can also
be easily extended to other LS-Design ciphers. A sim-
ple instruction set extension for parity computation re-
duced the performance penalty of the constant time
countermeasures by more than half. The cache protec-
tion heuristics have a lower performance impact, but
lose security guarantees compared to the isochronous
implementation. Masking illustrates the computational
cost of powerful side-channel countermeasures. Even if
Fantomas was conceived to be easily masked in a pro-
tected implementation, the performance penalty can be
as high as a factor of 27x with 10 shares, when com-
pared to a constant time implementation. We have also
observe that efficient random number generation is an
important research target to enable masked software
implementations to perform well in a realistic setting.
As future work, we leave the task of evaluating what
countermeasures are more effective against different kinds
of side-channel attacks.
The first and last authors acknowledge financial sup-
port and access to the ARM MPS2 board by LG Elec-
tronics Inc. during the development of this research,
under the project titled “Efficient and Secure Cryptog-
raphy for IoT ”. The second and third authors acknowl-
edge financial support from Intel and FAPESP under
the project “Secure Execution of Cryptographic Algo-
rithms”, grant 14/50704-7. We thank the anonymous
reviewers for their comments.
[AKS07] Onur Acii¸cmez, C¸ etin Kaya Ko¸c, and Jean-Pierre
Seifert. On the power of simple branch prediction
analysis. In Proceedings of the 2nd ACM Sympo-
sium on Information, Computer and Communica-
tions Security, ASIACCS ’07, pages 312–320, New
York, NY, USA, 2007. ACM.
[Alt16] Altera. Nios ii processor reference handbook,
[BCG+12] Julia Borghoff, Anne Canteaut, Tim G¨uneysu,
Elif Bilge Kavun, Miroslav Knezevic, Lars R.
Knudsen, Gregor Leander, Ventzislav Nikov,
Christof Paar, Christian Rechberger, Peter Rom-
bouts, Søren S. Thomsen, and Tolga Yal¸cin.
PRINCE - A low-latency block cipher for perva-
sive computing applications - extended abstract.
In Xiaoyun Wang and Kazue Sako, editors, Ad-
vances in Cryptology - ASIACRYPT 2012 - 18th
International Conference on the Theory and Appli-
cation of Cryptology and Information Security, Bei-
jing, China, December 2-6, 2012. Proceedings, vol-
ume 7658 of Lecture Notes in Computer Science,
pages 208–225. Springer, 2012.
18 Rafael J. Cruz et al.
Table 5: Execution time for Fantomas benchmarked in NIOS II
Version Execution Time (cycles per Byte)
L-box Name Baseline Parity
Bypass No-Cache
CBC Encryption 875.69 335.60 - -
CBC Decryption 880.23 331.99 - -
CTR 874.53 340.58 - -
2-block CTR 846.51 303.67 - -
CBC Encryption 146.80 - 183.57 263.41
CBC Decryption 147.96 - 185.52 285.46
CTR 141.68 - 184.33 264.61
2-block CTR 114.36 - 138.56 259.85
CBC Compact Encryption 216.12 - 316.68 437.91
CBC Compact Decryption 213.27 - 309.35 418.06
CTR Compact 221.60 - 322.22 443.69
[Ber04] Daniel J. Bernstein. Cache-timing
attacks on AES, 2004. URL:
[BGG+14] Josep Balasch, Benedikt Gierlichs, Vincent
Grosso, Oscar Reparaz, and Fran¸cois-Xavier
Standaert. On the cost of lazy engineering for
masked software implementations. In CARDIS,
volume 8968 of Lecture Notes in Computer Science,
pages 64–81. Springer, 2014.
[BK12] Elaine Barker and John Kelsey. NIST SP 800-90A
– Recommendation for Random Number Genera-
tion Using Deterministic Random Bit Generators,
[BKL+07] Andrey Bogdanov, Lars R. Knudsen, Gregor Le-
ander, Christof Paar, Axel Poschmann, Matthew
J. B. Robshaw, Yannick Seurin, and C. Vikkelsoe.
PRESENT: an ultra-lightweight block cipher. In
Pascal Paillier and Ingrid Verbauwhede, editors,
Cryptographic Hardware and Embedded Systems -
CHES 2007, 9th International Workshop, Vienna,
Austria, September 10-13, 2007, Proceedings, vol-
ume 4727 of Lecture Notes in Computer Science,
pages 450–466. Springer, 2007.
[BL16] Daniel J. Bernstein and Tanja Lange. eBACS:
ECRYPT Benchmarking of Cryptographic Sys-
tems., 2016.
[BM06] Joseph Bonneau and Ilya Mironov. Cache-
collision timing attacks against AES. In
Louis Goubin and Mitsuru Matsui, editors,
Cryptographic Hardware and Embedded Systems -
CHES 2006, 8th International Workshop, Yoko-
hama, Japan, October 10-13, 2006, Proceedings, vol-
ume 4249 of Lecture Notes in Computer Science,
pages 201–215. Springer, 2006.
[CA16] Rafael J. Cruz and Diego F. Aranha. Efficient
Software Implementations of Fantomas. In 16th
Brazilian Symposium on Information and Computer
Systems Security (SBSeg 2016), pages 212–225,
[CDL15] Anne Canteaut, S´ebastien Duval, and Ga¨etan
Leurent. Construction of lightweight s-boxes us-
ing feistel and MISTY structures. In Orr Dunkel-
man and Liam Keliher, editors, Selected Areas
in Cryptography - SAC 2015 - 22nd International
Conference, Sackville, NB, Canada, August 12-14,
2015, Revised Selected Papers, volume 9566 of Lec-
ture Notes in Computer Science, pages 373–393.
Springer, 2015.
[DCK+15] Daniel Dinu, Yann Le Corre, Dmitry Khovra-
tovich, L´eo Perrin, Johann Großscadl, and Alex
Biryukov. Triathlon of lightweight block ciphers
for the internet of things. Cryptology ePrint
Archive, Report 2015/209, 2015. http://eprint.
[DPU+16] Daniel Dinu, L´eo Perrin, Aleksei Udovenko, Ves-
selin Velichkov, Johann Großsch¨adl, and Alex
Biryukov. Design strategies for ARX with prov-
able bounds: Sparx and LAX. In Jung Hee Cheon
and Tsuyoshi Takagi, editors, Advances in Cryptol-
ogy - ASIACRYPT 2016 - 22nd International Con-
ference on the Theory and Application of Cryptology
and Information Security, Hanoi, Vietnam, Decem-
ber 4-8, 2016, Proceedings, Part I, volume 10031 of
Lecture Notes in Computer Science, pages 484–513,
[DR02] Joan Daemen and Vincent Rijmen. The Design
of Rijndael: AES - The Advanced Encryption Stan-
dard. Information Security and Cryptography.
Springer, 2002.
[Fog16] A. Fog. Instruction tables: List of in-
struction latencies, throughputs and micro-
operation breakdowns for Intel, AMD and
instruction_tables.pdf, version published on 08
Oct 2018., 2016.
[GLS+15a] Vincent Grosso, Ga¨etan Laurent, Fran¸cois-
Xavier Standaert, Kerem Varici, Fran¸cois Dur-
vaux, Lubos Gaspar, and St´ephanie Kerck-
hof. CAESAR candidate SCREAM Side-
Channel Resistant Authenticated Encryption
with Masking.
slides/leurent-scream.pdf, 2015.
Efficient and secure software implementations of Fantomas 19
[GLS+15b] Vincent Grosso, Ga¨etan Laurent, Fran¸cois-
Xavier Standaert, Kerem Varici, Fran¸cois Dur-
vaux, Lubos Gaspar, and St´ephanie Kerckhof.
SCREAM Side-Channel Resistant Authenticated
Encryption with Masking. https://competitions., 2015.
[GLSV14] Vincent Grosso, Ga¨etan Leurent, Fran¸cois-Xavier
Standaert, and Kerem Varici. Ls-designs: Bitslice
encryption for efficient masked software imple-
mentations. In Carlos Cid and Christian Rech-
berger, editors, Fast Software Encryption - 21st
International Workshop, FSE 2014, London, UK,
March 3-5, 2014. Revised Selected Papers, volume
8540 of Lecture Notes in Computer Science, pages
18–37. Springer, 2014.
[Int17] Intel. Quartus prime standard edition handbook
volume 1 - design and synthesis, 2017.
[ISW03] Yuval Ishai, Amit Sahai, and David Wagner. Pri-
vate circuits: Securing hardware against prob-
ing attacks. In Dan Boneh, editor, Advances in
Cryptology - CRYPTO 2003, 23rd Annual Interna-
tional Cryptology Conference, Santa Barbara, Cali-
fornia, USA, August 17-21, 2003, Proceedings, vol-
ume 2729 of Lecture Notes in Computer Science,
pages 463–481. Springer, 2003.
[JS17] Anthony Journault and Fran¸cois-Xavier Stan-
daert. Very high order masking: Efficient imple-
mentation and security evaluation. In CHES, vol-
ume 10529 of Lecture Notes in Computer Science,
pages 623–643. Springer, 2017.
[JSV17] Anthony Journault, Fran¸cois-Xavier Standaert,
and Kerem Varici. Improving the security and
efficiency of block ciphers based on LS-designs.
Des. Codes Cryptography, 82(1-2):495–509, 2017.
[KJJ99] Paul C. Kocher, Joshua Jaffe, and Benjamin Jun.
Differential power analysis. In Michael J. Wiener,
editor, Advances in Cryptology - CRYPTO ’99, 19th
Annual International Cryptology Conference, Santa
Barbara, California, USA, August 15-19, 1999, Pro-
ceedings, volume 1666 of Lecture Notes in Computer
Science, pages 388–397. Springer, 1999.
[Koc96] Paul C. Kocher. Timing attacks on implementa-
tions of diffie-hellman, rsa, dss, and other systems.
In Neal Koblitz, editor, Advances in Cryptology -
CRYPTO ’96, 16th Annual International Cryptol-
ogy Conference, Santa Barbara, California, USA,
August 18-22, 1996, Proceedings, volume 1109 of
Lecture Notes in Computer Science, pages 104–113.
Springer, 1996.
[LMR15] Gregor Leander, Brice Minaud, and Sondre
Rønjom. A generic approach to invariant sub-
space attacks: Cryptanalysis of robin, iscream and
zorro. In Elisabeth Oswald and Marc Fischlin,
editors, Advances in Cryptology - EUROCRYPT
2015 - 34th Annual International Conference on
the Theory and Applications of Cryptographic Tech-
niques, Sofia, Bulgaria, April 26-30, 2015, Proceed-
ings, Part I, volume 9056 of Lecture Notes in Com-
puter Science, pages 254–283. Springer, 2015.
[PRC12] Gilles Piret, Thomas Roche, and Claude Car-
let. PICARO - A block cipher allowing efficient
higher-order side-channel resistance. In Feng Bao,
Pierangela Samarati, and Jianying Zhou, editors,
Applied Cryptography and Network Security - 10th
International Conference, ACNS 2012, Singapore,
June 26-29, 2012. Proceedings, volume 7341 of Lec-
ture Notes in Computer Science, pages 311–328.
Springer, 2012.
[RBV17] Oscar Reparaz, Josep Balasch, and Ingrid Ver-
bauwhede. Dude, is my code constant time? In
DATE, pages 1697–1702. IEEE, 2017.
[RP10] Matthieu Rivain and Emmanuel Prouff. Prov-
ably secure higher-order masking of AES. In Ste-
fan Mangard and Fran¸cois-Xavier Standaert, ed-
itors, Cryptographic Hardware and Embedded Sys-
tems, CHES 2010, 12th International Workshop,
Santa Barbara, CA, USA, August 17-20, 2010. Pro-
ceedings, volume 6225 of Lecture Notes in Computer
Science, pages 413–427. Springer, 2010.
[RPA16] Bruno Rodrigues, Fernando Magno Quint˜ao
Pereira, and Diego F. Aranha. Sparse represen-
tation of implicit flows with applications to side-
channel detection. In Ayal Zaks and Manuel V.
Hermenegildo, editors, Proceedings of the 25th In-
ternational Conference on Compiler Construction,
CC 2016, Barcelona, Spain, March 12-18, 2016,
pages 110–120. ACM, 2016.
[YF14] Yuval Yarom and Katrina Falkner.
FLUSH+RELOAD: A high resolution, low
noise, L3 cache side-channel attack. In Kevin
Fu and Jaeyeon Jung, editors, Proceedings of the
23rd USENIX Security Symposium, San Diego,
CA, USA, August 20-22, 2014., pages 719–732.
USENIX Association, 2014.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
We present a series of software implementations of the Fantomas block cipher in resource-constrained ARM devices like the Cortex-M3 and Cortex-M4; and more powerful processors such as the ARM Cortex-A15 and modern Intel platforms. Our implementations span a broad range of characteristics: 32-bit and 64-bit versions, side-channel resistant and vectorized code for NEON and SSE instructions. Our implementations of the algorithm improve the state of the art substantially, both in terms of efficiency or compactness, by making use of novel algorithmic techniques and features specific to the target platform. In particular, our unprotected 32-bit implementation achieves speedups from 35% to 66% in the ARM Cortex-M architecture, while consuming considerably less code size. The vectorized implementations improve performance over the state of the art by 40% in the ARM Cortex-A15 and 50% in the Core i7 Ivy Bridge, setting new speed records for the implementation of the block cipher.
Full-text available
In this paper, we introduce a framework for the benchmarking of lightweight block ciphers on a multitude of embedded platforms. Our framework is able to evaluate the execution time, RAM footprint, as well as binary code size, and allows one to define a custom “figure of merit” according to which all evaluated candidates can be ranked. We used the framework to benchmark implementations of 19 lightweight ciphers, namely AES, Chaskey, Fantomas, HIGHT, LBlock, LEA, LED, Piccolo, PRESENT, PRIDE, PRINCE, RC5, RECTANGLE, RoadRunneR, Robin, Simon, SPARX, Speck, and TWINE, on three microcontroller platforms: 8-bit AVR, 16-bit MSP430, and 32-bit ARM. Our results bring some new insights into the question of how well these lightweight ciphers are suited to secure the Internet of things. The benchmarking framework provides cipher designers with an easy-to-use tool to compare new algorithms with the state of the art and allows standardization organizations to conduct a fair and consistent evaluation of a large number of candidates.
Conference Paper
Full-text available
Masking is one of the most popular countermeasures to mitigate side-channel analysis. Yet, its deployment in actual cryptographic devices is well known to be challenging, since designers have to ensure that the leakage corresponding to different shares is independent. Several works have shown that such an independent leakage assumption may be contradicted in practice, because of physical effects such as “glitches” or “transition-based” leakages. As a result, implementing masking securely can be a time-consuming engineering problem. This is in strong contrast with recent and promising approaches for the automatic insertion of countermeasures exploiting compilers, that aim to limit the development time of side-channel resistant software. Motivated by this contrast, we question what can be hoped for these approaches – or more generally for masked software implementations based on careless assembly generation. For this purpose, our first contribution is a simple reduction from security proofs obtained in a (usual but not always realistic) model where leakages depend on the intermediate variables manipulated by the target device, to security proofs in a (more realistic) model where the transitions between these intermediate variables are leaked. We show that the cost of moving from one context to the other implies a division of the security order by two for masking schemes. Next, our second and main contribution is to provide a comprehensive empirical validation of this reduction, based on two microcontrollers, several (handwritten and compiler-based) ways of generating assembly codes, with and without “recycling” the randomness used for sharing. These experiments confirm the relevance of our analysis, and therefore quantify the cost of lazy engineering for masking.
Conference Paper
Full-text available
Information flow analyses traditionally use the Program Dependence Graph (PDG) as a supporting data-structure. This graph relies on Ferrante et al.'s notion of control dependences to represent implicit flows of information. A limitation of this approach is that it may create O(|I| × |E|) implicit flow edges in the PDG, where I are the instructions in a program, and E are the edges in its control flow graph. This paper shows that it is possible to compute information flow analyses using a different notion of implicit dependence, which yields a number of edges linear on the number of definitions plus uses of variables. Our algorithm computes these dependences in a single traversal of the program's dominance tree. This efficiency is possible due to a key property of programs in Static Single Assignment form: the definition of a variable dominates all its uses. Our algorithm correctly implements Hunt and Sands system of security types. Contrary to their original formulation, which required O(I 2) space and time for structured programs, we require only O(I). We have used our ideas to build FlowTracker, a tool that uncovers side-channel vulnerabilities in cryptographic algorithms. FlowTracker handles programs with over one-million assembly instructions in less than 200 seconds, and creates 24% less implicit flow edges than Ferrante et al.'s technique. FlowTracker has detected an issue in a constant-time implementation of Ellip-tic Curve Cryptography; it has found several time-variant constructions in OpenSSL, one issue in TrueCrypt and it has validated the isochronous behavior of the NaCl library.
Full-text available
LS-designs are a family of bitslice ciphers aiming at efficient masked implementations against side-channel analysis. This paper discusses their security against invariant subspace attacks, and describes an alternative family of eXtended LS-designs (XLS-designs), that enables additional options to prevent such attacks. LS- and XLS-designs provide a large family of ciphers from which efficient implementations can be obtained, possibly enhanced with countermeasures against physical attacks. We argue that they are interesting primitives in order to discuss the general question of “how simple can block ciphers be?”.
Conference Paper
In this paper, we study the performances and security of recent masking algorithms specialized to parallel implementations in a 32-bit embedded software platform, for the standard AES Rijndael and the bitslice cipher Fantomas. By exploiting the excellent features of these algorithms for bitslice implementations, we first extend the recent speed records of Goudarzi and Rivain (presented at Eurocrypt 2017) and report realistic timings for masked implementations with 32 shares. We then observe that the security level provided by such implementations is uneasy to quantify with current evaluation tools. We therefore propose a new “multi-model” evaluation methodology which takes advantage of different (more or less abstract) security models introduced in the literature. This methodology allows us to both bound the security level of our implementations in a principled manner and to assess the risks of overstated security based on well understood parameters. Concretely, it leads us to conclude that these implementations withstand worst-case adversaries with \(>\!2^{64}\) measurements under falsifiable assumptions.
Conference Paper
Invariant subspace attacks were introduced at CRYPTO 2011 to cryptanalyze PRINTcipher. The invariant subspaces for PRINTcipher were discovered in an ad hoc fashion, leaving a generic technique to discover invariant subspaces in other ciphers as an open problem. Here, based on a rather simple observation, we introduce a generic algorithm to detect invariant subspaces. We apply this algorithm to the CAESAR candidate iSCREAM, the closely related LS-design Robin, as well as the lightweight cipher Zorro. For all three candidates invariant subspaces were detected, and result in practical breaks of the ciphers. A closer analysis of independent interest reveals that these invariant subspaces are underpinned by a new type of self-similarity property. For all ciphers, our strongest attack shows the existence of a weak key set of density \(2^{-32}\). These weak keys lead to a simple property on the plaintexts going through the whole encryption process with probability one. All our attacks have been practically verified on reference implementations of the ciphers.
Conference Paper
We present, for the first time, a general strategy for designing ARX symmetric-key primitives with provable resistance against single-trail differential and linear cryptanalysis. The latter has been a long standing open problem in the area of ARX design. The wide-trail design strategy (WTS), that is at the basis of many S-box based ciphers, including the AES, is not suitable for ARX designs due to the lack of S-boxes in the latter. In this paper we address the mentioned limitation by proposing the long trail design strategy (LTS) – a dual of the WTS that is applicable (but not limited) to ARX constructions. In contrast to the WTS, that prescribes the use of small and efficient S-boxes at the expense of heavy linear layers with strong mixing properties, the LTS advocates the use of large (ARX-based) S-Boxes together with sparse linear layers. With the help of the so-called long-trail argument, a designer can bound the maximum differential and linear probabilities for any number of rounds of a cipher built according to the LTS. To illustrate the effectiveness of the new strategy, we propose Sparx – a family of ARX-based block ciphers designed according to the LTS. Sparx has 32-bit ARX-based S-boxes and has provable bounds against differential and linear cryptanalysis. In addition, Sparx is very efficient on a number of embedded platforms. Its optimized software implementation ranks in the top 6 of the most software-efficient ciphers along with Simon, Speck, Chaskey, LEA and RECTANGLE. As a second contribution we propose another strategy for designing ARX ciphers with provable properties, that is completely independent of the LTS. It is motivated by a challenge proposed earlier by Wallén and uses the differential properties of modular addition to minimize the maximum differential probability across multiple rounds of a cipher. A new primitive, called LAX, is designed following those principles. LAX partly solves the Wallén challenge.