Low Voltage Fault Attacks to AES and RSA on General Purpose Processors.

Conference Paper: A Competitive Study of Cryptography Techniques over Block Cipher
[Show abstract] [Hide abstract]
ABSTRACT: The complexity of cryptography does not allow many people to actually understand the motivations and therefore available for practicing security cryptography. Cryptography process seeks to distribute an estimation of basic cryptographic primitives across a number of confluences in order to reduce security assumptions on individual nodes, which establish a level of faulttolerance opposing to the node alteration. In a progressively networked and distributed communications environment, there are more and more useful situations where the ability to distribute a computation between a number of unlike network intersections is needed. The reason back to the efficiency (separate nodes perform distinct tasks), faulttolerance (if some nodes are unavailable then others can perform the task) and security (the trust required to perform the task is shared between nodes) that order differently. Hence, this paper aims to describe and review the different research that has done toward text encryption and description in the block cipher. Moreover, this paper suggests a cryptography model in the block cipher.Computer Modelling and Simulation (UKSim), 2011 UkSim 13th International Conference on; 05/2011  SourceAvailable from: ocean.kisti.re.kr[Show abstract] [Hide abstract]
ABSTRACT: Recently, there has been introduced various types of pairing computations to implement ID based cryptosystem for mobile ad hoc network. According to spreading the applications of pairing computations, various fault attacks have been proposed. Among them, a counter fault attack has been considered the strongest threat. Thus this paper proposes a new countermeasure to prevent the counter fault attack on Miller's algorithm. The proposed method is able to reduce the possibility of fault propagation by a random index of intermediate values. Additionally, it is difficult to challenge fault attacks on the proposed method since a simple side channel leakage of 'if' branch is eliminated.Journal of the Institute of Electronics and Information Engineers. 07/2013; 50(7).  SourceAvailable from: SK Subidh Ali[Show abstract] [Hide abstract]
ABSTRACT: In this paper, we present a theoretical analysis of the limits of the differential fault analysis (DFA) of AES by developing an interrelationship between conventional cryptanalysis of AES and DFAs. We show that the existing attacks have not reached these limits and present techniques to reach these. More specifically, we propose optimal DFA on states of AES128 and AES256. We also propose attacks on the key schedule of the three versions of AES, and demonstrate that these are some of the most efficient attacks on AES to date. Our attack on AES128 key schedule is optimal, and the attacks on AES192 and AES256 key schedule are very close to optimal. Detailed experimental results have been provided for the developed attacks. The work has been compared to other works and also the optimal limits of DFA of AES.Journal of Cryptographic Engineering. 3(2).
Page 1
Low Voltage Fault Attacks to AES and RSA on
General Purpose Processors
Alessandro Barenghi∗, Guido Bertoni‡, Luca Breveglieri∗, Mauro Pellicioli§, Gerardo Pelosi†
∗DEI  Politecnico di Milano, Milan, Italy
Email: {barenghi,brevegli}@elet.polimi.it
‡STMicroelectronics, Agrate Brianza, Italy
Email: guido.bertoni@st.com
§Politecnico di Milano, Milan, Italy
Email: mauro.pellicioli@mail.polimi.it
†DIIMM  Universit` a degli Studi di Bergamo, Dalmine (BG), Italy
Email: {gerardo.pelosi@unibg.it}
Abstract—Fault injection attacks have proven in recent times
a powerful tool to exploit implementative weaknesses of robust
cryptographic algorithms. A number of different techniques
aimed at disturbing the computation of a cryptographic primitive
have been devised, and have been successfully employed to leak
secret information inferring it from the erroneous results. In
particular, many of these techniques involve directly tampering
with the computing device to alter the content of the embedded
memory, e.g. through irradiating it with laser beams.
In this contribution we present a lowcost, noninvasive and
effective technique to inject faults in an ARM9 general purpose
CPU through lowering its feeding voltage. This is the first
result available in fault attacks literature to attack a software
implementation of a cryptosystem running on a full fledged
CPU with a complete operating system. The platform under
consideration (an ARM9 CPU running a full Linux 2.6 kernel)
is widely used in mobile computing devices such as smartphones,
gaming platforms and network appliances.
We fully characterise both the fault model and the errors
induced in the computation, both in terms of ensuing frequency
and corruption patterns on the computed results.
At first, we validate the effectiveness of the proposed fault
model to lead practical attacks to implementations of RSA and
AES cryptosystems, using techniques known in open literature.
Then we devised two new attack techniques, one for each
cryptosystem. The attack to AES is able to retrieve all the round
keys regardless both their derivation strategy and the number of
rounds. A known ciphertext attack to RSA encryption has been
devised: the plaintext is retrieved knowing the result of a correct
and a faulty encryption of the same plaintext, and assuming the
fault corrupts the public key exponent. Through experimental
validation, we show that we can break any AES with roughly
4 kb of ciphertext, RSA encryption with 3 to 5 faults and RSA
signature with 1 to 2 faults.
I. INTRODUCTION
Key requirements during the design of a complex system are
its fault resistance and fault tolerance: the former refers to the
ability of a system of being able to work properly in an en
vironment where hazards are present; the latter represents the
property of being able to exhibit a controlled behaviour when
faults occur. A key point in the design of fault tolerant systems
is to obtain a graceful degradation of the performances, that is
to design devices able to gradually lose functionality instead
of exhibiting catastrophic failures when faults occur.
Taking into account the fact that also complex computa
tional systems are designed to fail gracefully, it is sensible
to assume that there exists a category of faults which will
disrupt only a very small amount of the ongoing computation,
leaving the rest of the system untouched, regardless of its
complexity. These errors are usually corrected through the
use of redundancy either in the form of replicated units or
correction codes. However these are usually employed only
in industry grade chip design, where it is expected that the
component will work in a hazardous environment. Consumer
grade devices usually are not designed with these concerns due
to the milder expected deployment setting and the additional
costs involved which would drive higher the price of the unit.
Regardless of the presence of error correcting countermea
sures, it can be noticed that the graceful degradation property
may sometimes be an undesired one, i.e. when the computing
device is performing security related tasks, in particular when
calculating cryptographic primitives.
During the computation of cryptographic primitives, small,
traceable changes in the behaviour, strongly correlated to the
internal state of the device, represent a serious threat to the
security they are supposed to provide. In particular, if it is
possible to bind with precision the induced faults to a specific,
non catastrophic, change in the output, the leaked information
may be used in order to break the cryptoscheme.
This attack methodology fully embodies the side channel
attack paradigm, since it exploits informative content leaked
from an implementation of the cipher, which is strongly related
to the encryption or decryption process.
A number of hazards which can be introduced into the
working environment of a device will cause faults. In particular
irradiation with light or lasers, glitches on the clock signal or in
the power supply, have been used until now in order to success
fully induce controlled faults in small computing components
such as microcontrollers or memory storage devices [26], [27].
All these techniques rely on the fact that the disturbed device is
reasonably small and suffer from the evolution in lithographic
etching technologies, which enable higher clock rates. This
makes more difficult to correctly irradiate the sensitive zone
Page 2
of the chip or to disturb the execution during a specific clock
cycle.
Willing to provide experimental evidences of the correctness
of this perspective, this work presents the first characteriza
tion of a controlled fault model in a complex computational
platform: specifically a general purpose ARM9 CPU running
a full fledged Linux operating system and its applications
to cryptanalisys. The choice of this platform was driven by
its widespread adoption as computational platform in almost
all current smartphones, network appliances, portable gaming
platforms and low power computers.
Our choice of fault induction technique was done aiming
at picking a fault injection methodology independent of the
technological progress, able to act on complex systems and as
cheap as possible, in order to obtain the widest applicability.
Our chosen methodology is to underfeed constantly a circuit
during the whole computation time: an approach which until
now has never been tried for systems including a general
purpose CPU. Underfeeding a circuit is a known cause of
faults, and lends itself well to the purpose of this research
since it represents a simple and effective alteration of the
environment, which can be achieved without leaving any
evidences of having tampered with the device. This is an
advantage with respect to irradiation techniques, which require
delicate procedures in order to remove the packaging from the
silicon chip to be effective. Moreover, the equipment required
in order to explore the effect of a gradual decay of the quality
of the working environment, is reasonably cheap and does not
compromise the correct working of the device afterwards.
In the next Section we are going to provide a summary
of the most pertinent open literature contributions concerning
practical fault attacks on microcontrollers. A comprehensive
literature archive on the subject is provided by [9]–[11].
The target of the first part of this work (Section II) is to
provide a full characterisation of the model of faults occurring
when gradually lowering the power feed input to the circuit,
and the errors they induce on the outputs. Since the system is
able to operate at more than one working frequency for power
saving issues, the fault characterisation takes into account this
factor, too.
In the second part of this work consists of Section IV and
Section V. In Section IV we analyze the AES cryptosystem
and recall a known attack which correctly fits our fault model.
Moreover we propose a new attack technique able to recover
the full cipher key for any AES cipher and is not necessarily
bound to our fault model. In Section V we recall a well known
attack which is employable with our fault model and propose
a new one able to successfully decipher an RSA ciphertext
assuming a fault injection in the public exponent and the
possess of a correct ciphertext.
Part of the RSA attack techniques and of the fault model
characterization is described in [7] of which this work is an
extension.
The third part of this work (Section VI) reports the results
of the experimental campaign conducted in order to ascertain
the applicability of the algorithmic methods presented in the
previous sections, thus validating the practical cipher breaking
abilities proposed.
Finally, in Section VII we draw the conclusions, summariz
ing our original contributions.
II. FAULT CHARACTERISATION AND INDUCED ERRORS
This section provides a complete characterization of the
faults happening when a general purpose CPU is constantly
underfed in terms of position, shape and timing and subse
quently delineates the induced errors on the computed outputs
together with the methodology followed in order to build it. A
complete description of the working environment is provided
in order to properly outline the workflow we followed in order
to coalesce the new fault and error model.
A. CPU Architecture and Experimental Settings
The processor architecture taken into account in this study
is the ARMv5TE, in particular the version implemented by
the ARM9E microprocessor. This choice was driven by the
vast diffusion of this CPU, which is nowadays the dominant
choice for smartphones, network appliances, portable gaming
platforms and low power computers, thus quite likely to be
used also to compute cryptographic primitives while in possess
of a possible attacker.
Our target chip is an an ARM926EJS [5]: a 32bit RISC
Harvard architecture CPU with 16 general purpose registers
and a 5 stage pipeline. The ARM processor has a full MMU
and separate data and instruction caches each 16 Kb wide,
coupled with a 16 entry write buffer which avoids stalls in the
CPU when memory writebacks are performed. In particular the
ARM926EJS is also endowed with a hardware Java bytecode
interpreter able to run directly Java bytecode.The richness of
the available features justify the vast popularity achieved by
this model in consumer mobile devices.
The CPU is embedded in a system on chip mounted on a
development board, specifically a SPEAr Head200 [30] built
by ST Microelectronics, which is used as reference board to
design ARM based devices equipped with 64 MB DDR RAM
clocked at 133 MHz, 32 MB of onboard Flash storage, 2 USB
Host ports, a RS232 Serial Interface and 100 Mbps Ethernet
network card. The system is endowed with an UBoot [16]
embedded bootloader, which is able to load the binary to be
run via TFTP [18] protocol. This allows the board to either run
a specific binary, compiled to be independently executed on the
ARM9 CPU, or to boot a full fledged operating system. In the
following experiments, raw binaries were employed in order to
characterise with precision the fault model of this system. On
the other hand, for the sake of practical applicability, all the
attacks to cryptosystems were lead with a full vanilla Linux 2.6
kernel (DENX distribution) employing an NFS [6] partition
as root filesystem. All the binaries were compiled with the
GCC 3.4 based development toolchain for ARM9 provided
by Codesourcery [12]. All the fault characterisation tests were
performed on more than one instance of the board, reporting
analogous results: for the sake of clarity we present the results
on a single board.
Page 3
Two experimental workbenches were employed during this
work: the first, aimed at producing a precise characterisation
of the effect of power supply lapses was endowed with a high
precision power supply. The second one, aimed at carrying
the attacks with a lower budget, employed less expensive
equipment, without loss in the efficacy of the attacks. The
two workbenches are identical except for the change in the
Power Supply Unit (PSU).
Figure 1.
range
Workbench used in order to accurately characterise the voltage
Figure 2. Workbench used in order to perform the error characterisation and
the attacks
The first workbench is depicted in Figure 1: the board was
at first fed through an Agilent E3631A PSU [3] having a
precision of 1mV while on the second workbench, depicted
in Figure 2, was fed through an Agilent 6633B [2] power
supply with a 0.01V precision. In order to achieve rough
subcentivolt precision we used a resistive partitor with a
common commercial grade resistor: whilst this solution does
not provide the same accuracy on controlling the voltage as
using a high precision PSU, it proved effective enough to
successfully bring all the attacks. All the voltage measures
were taken with an Agilent 34420A [1] voltmeter with a nV
precision probe which was already available to us, nonetheless
the needed precision was only up to 0.5mV. The board was
connected to a PC both with a null modem cable and an
Ethernet cable: the first provided an interface with the Linux
shell running on the ARM chip, while the second was used to
provide the network connection needed for both booting the
board via TFTP and providing the storage via NFS.
B. Graceful Degradation of Outputs
The first experiment run on the target chip was aimed
at investigating whether the appearance of the errors in the
system followed the gradual behaviour we expected. The
ARM9 processor has three separate supply lines, one for the
core, one for the I/O buses and one for the memory interface;
we chose to interact with the one feeding the computational
part, due to its critical importance for the correct execution of
the binaries run on the device.
Willing to detect the frequency of the appearance of faults in
order to determine how fast they appear during the execution
of a program, we tested the correct functioning of the CPU
using a simple probe program whose core loop is reported
hereafter.
for(a=i=0; i<1000000; i++){
a = a + 1;
if(a != i+1){
printc(’?’);
if(a!= i+1){
printc(’#’);
a = i+1; // fix the fault
if (a != i+1) a = i+1;
} /* if */
} /* if */
} /* for */
The aforementioned code increments a variable a million
times, and checks if a fault has happened exactly after the
increment. A redundant check has been added in order to lower
the likelihood of a false positive occurring in the detection:
we consider an actual fault to have happened only if both
checks confirm it. This program was run multiple times while
decreasing the voltage of the power supply of 1mV at a time:
500 thousand runs were performed for each voltage level
probed and the results output by the code were stored. Figure 3
represents the percentage of correct computations over 500
thousand runs, for each voltage level probed. The errors in
the output grow linearly with the lapse in the voltage supply,
thus confirming our hypothesis of a gradual degradation in
the quality of the results. The dashed line in the figure points
out the voltage point where the faulty computations are in the
same number as the correct ones.
0
905.5 906.0 906.5 907.0 907.5 908.0 908.5 909.0 909.5 910.0 910.5 911.0 911.5 912.0
Measured CPU Input Voltage [mV]
20
40
60
80
100
Relative Frequencies[%]
Correct Computations
Faulty Computations
Figure 3.
thousand runs.
Percentage of correct and wrong computations averaged over 500
After ascertaining that the faulty computations of a program
are happening slowly, we moved on to consider the number
Page 4
of faults appeared during each single computation. Since our
target will be to inject single faults, we are interested in seeing
whether the errors in the outputs of the former computations
were caused by one or more faults during their executions.
Reclassifying the faulted runs from the former experiments
according to the number of faults, it is possible to observe, as
shown in Figure 4, that it is present a 1 mV wide voltage range
where only a single fault happens with dominant probability.
Moreover, in the adjacent 1 mV range, the probability of hav
ing a faulty computation triggered by a single fault ranges from
2% to 40%. This probability dwarfs the one of having multiple
faults contributing to the erroneous result, while keeping still
in the voltage range where less than a half of the computations
is faulted. When working under the 50% faulty computation
threshold, the number of possible faults starts growing, and
multiple fault scenarios start dominating the fault profile as it
can be expected in a degrading environment. After lowering
even more the power supply voltage, the board stops outputting
data from the RS232 interface used to communicate, thus
preventing the results from being collected. It is therefore clear
that, in the voltage region where the correct computations are
the majority and the faulty ones begin to appear, the faults
occurring in the computations are single and not represented
by small bursts. Whilst results until now have been obtained
0
905.5 906.0 906.5 907.0 907.5 908.0 908.5 909.0 909.5 910.0 910.5 911.0 911.5 912.0
Measured CPU Input Voltage [mV]
20
40
60
80
100
Normalized average of faulty computations [%]
One Fault
Two Faults
Three or More Faults
Figure 4.
the voltage.
Distribution of the quantity of the injected faults as a function of
while keeping the number of machine instruction per binary
fixed for each measurement, and since there is no specific
timing in the insertion of the hazard causing faults, it is
reasonable to assume that a growth in the executable code
size will be met by an analogous rise in the probability of a
fault appearing during its computation. This is not an obstacle
to injecting single faults, since the fault incidence per single
execution may be tuned for different length binaries lowering
or raising the voltage accordingly.
C. Fault Type Characterisation
Having ascertained the possibility of injecting a single fault
per computation, we proceed to investigate which kind of fault
has actually hit the computation, through characterising its
effect on the executed code.
Analysing the binary at assembly level, all the possible
instructions executed by a CPU may be split into three
categories according to the architectural units composing the
CPU which are used to complete them. The three categories
are arithmeticallogical operations, memory operations and
branch instructions. Since memory instructions represent the
most expensive operation class in terms of power consumption,
they are allegedly the most vulnerable to underfeeding issues.
In order to ascertain this, we recompiled the same probe
program instructing the compiler to keep the variables in the
CPU registers during the whole computation in order to avoid
any memory operations while computing the result. Only the
instruction cache of the CPU was enabled, leaving thus only
the data loading operations uncached. The execution of the
tuned program showed no faults, thus indicating that the wrong
values detected by the checks were uniquely to be ascribed to
memory operations, while both arithmeticallogical and branch
instructions ran correctly regardless of the voltage drop. All
the experiments were conducted through collecting erroneous
outputs after the binary had been running for a couple of
seconds: this allowed the caching of the whole probe program
due to its tiny size.
The low voltage fault immunity shown by the CPU registers
is to be ascribed to the low capacitance design of their
implementation, which yields faster switching times than the
average logic, thus compensating partially for the slowdown
induced by the lapse in the supply. This feature is mandated
by the architectural need of providing fast accesses to the
component, which is critical in order to design efficient units
and is thus to be expected in all the common CPUs.
Since only the memory operations are affected by faults,
the next natural step was to check if all the instructions (i.e.
both loads and stores) were equally affected. In order to
distinguish which kind of memory instructions are affected by
faults, it is possible to use a registerheld value as a fault free
reference for computing checks. We set up a probe program
which loaded and saved from the memory zero and one filled
words, and ran it multiple times while sweeping the whole
voltage range of the previous campaign.
The only instructions to report faulty results were the
load instructions, while all the write instructions were safely
performed. This behaviour may be sensibly ascribed to the
fact that only the memory operations which store values
on the underfed part (i.e. the load operations which store
information in the registers) suffer from the lapse in the power
supply. On the other hand the store operations place the
data on a properly fed part of the architecture. Moreover, the
ARM926EJS (similarly to all modern CPUs) is endowed with
a 16 entry write buffer between the CPU and the memory,
in order to perform aggressive instruction reordering. The
presence of the buffer helps cutting down the capacitive load
of the path to the memory and thus helps to perform correct
writes.
Page 5
D. Fault Spread
The experiments run up to now characterise the faults as
affecting only load instructions and, as far as their number
in a single execution goes, depending on the supplied voltage.
We are now willing to investigate whether the faulty behaviour
of the load instructions depends on the referenced memory
address from which the load is performed.
In order to understand this key point, a probe program was
designed to overwrite a one million 32bit word array with
1s, and subsequently to check the values which were loaded
back into the registers, while keeping the voltage in the single
fault functioning range. To avoid any possible disturbances,
during this test the data cache of the ARM9 processor was
disabled, thus forcing the CPU to load each value from the
main memory.
Figure 5 shows the number of faults occurred while per
forming 106load operations of a onefilled 32bit integer
from the aforementioned array. In order to analyse the data,
the probed memory was clustered into 40 kB wide zones. We
encountered 1864 faulty loads while running the program,
thus we are expecting an average of 18.64 faults per zone
in case the faults fall uniformly over the address space. The
dashed line in Fig. 5 indicates the expected number of faults
occurring for each zone, assuming a uniform distribution of
the faults over the memory. To confirm the hypothesis of a
0
5
10
15
20
25
Number of faults
Memory address space
Figure 5.
of the position in the address space. The dashed line indicates the expected
average value in the hypothesis they are uniformly spread.
Distribution of the quantity of the injected faults as a function
uniform distribution of the faults over the whole address space,
we modelled the position hit by the fault as a random variable
and we conducted a Pearson χ2test to assess the goodness
of fit. The results confirmed our hypothesis with a confidence
higher than 99.99%.
E. Error Characterisation and Effects of Frequency Scaling
After fully characterising the frequency and the conditions
of occurrence of the faults, the natural target of the inves
tigation becomes characterising the kind of errors induced
in the computations by the faults. Through analysing the
data collected during the last experiment, we were able to
notice that all the faulty loads were affected by flip downs in
the bit value loaded. Willing to ascertain if only flip downs
were possible, we ran the same memory exploration program
changing the loaded value to both a zerofilled 32 bit word and
to some random values. In all the cases only bit flip downs
occurred, and there was never a single instance of a flip up.
When analysing the position of the bits which are flipped down
during the faulty loads, we detected that only a very small
number of flip down patterns were present (namely 4) and
one of them accounted for more than the 50% of the fault
occurrences. When repeating the tests on different boards, the
recurring patterns changed, but their number and frequency
did not, allowing us to deduce that some bits within a word
are more sensible to flip downs when the CPU performs a
load operation while undervolted. This may be ascribed to the
different capacitive load of the signal lines of the CPU, is due
to routing issues which may force the I/O lines for a register
to have different lengths.
Willing to complete the analysis of the error patterns, we
decided to run the same error pattern detection experiment
while varying the CPU frequency according to the allowed
working range. Through piloting the clock generator on the
board we were able to scale the frequency of the CPU
mimicking a real world scenario where the ARM processor
is often run at frequencies lower than the maximum allowed
in order to save power. It is possible to choose among a
number of frequency settings which alter globally the working
frequency through writing in the PLL generator (Phase Locked
Loop) register interface. This causes both the board and the
CPU to switch their working frequency: the board is ran at
the frequency written in the register while the CPU is run at
twice the set value.
The clock setting is retained until the board is rebooted,
but it is possible to customise the deployment model in
order to either lock it permanently or to leave the frequency
scaling to the operating system. We wanted to investigate
whether the faulty behaviour had any changes while working
in different frequency environments, therefore we locked the
running frequency in order to collect homogeneous samples
of the behaviours.
In a real world scenario this may happen to be the actual
working environment since it is quite common to lock the
CPU frequency at a lower value than the maximum allowed in
order to save power. The possibility that the frequency choice
is left to the operating system does not impair our analysis
since the CPU will be running at a constant frequency in
discrete timeslices, thus reporting the same faulty behaviour
pertimeslice.
Table I reports the result of the experiments performed
and shows how, regardless of the frequency at which we are
running, the error patterns are few and characterised by one,
which is dominant as far as the occurrence frequency goes.
F. Effects of the Errors on the Computation
After fully characterising the kind of errors induced by our
fault injection technique, the last part of this enquire sums
Page 6
Table I
NORMALIZED FREQUENCIES OF THE DIFFERENT ERROR PATTERNS PER
CPU CLOCK SETTING
CPU Clock [MHz] Loaded Pattern
{3}
{21}
{3,21}
{21}
{10,16}
{10}
{9,10,11,12,15,16,21}
{9,10,16}
[%]
58.76
10.01
31.22
100.0
38.72
53.23
2.95
3.34
140
224
266 (Full)
up the possible effects on the computation caused by such
errors. Albeit originating from the same cause, i.e. faulty
load operations, we may distinguish two different effects of
the faults depending on whether the load was related to an
instruction fetch or to a data load. In the latter case a data load
error occurs, while, in the former case instruction swapping
may occur. For the sake of clarity, we will deal separately with
the two outcomes in order to distinguish their possible effects
on the computation of cryptographic primitives.
1) Data Load Errors: Data related errors are representable
as a transient change in the value of a tbit wide variable c
during an execution. In particular they are single bit flipdowns
placed in a fixed position within the microprocessor word. The
faulty value ? c equals the correct one c minus a power of two
word length, i ∈ [0,w − 1] and k ∈ [0,t
the loaded value are very precise in the way they cause the
alterations and therefore may easily leak sensitive information,
as it will be shown by successfully conducting attacks in the
next sections.
2) Instruction Swap Errors: Bit flipping during an in
struction fetch may alter either the opcode or the arguments
of the instructions, depending on which bit is affected by
the flip down. In particular, the affected instructions will be
transformed into the ones having a binary encoding differing
only by a flipdown of the faulty bit. In the case of the ARM
architecture, this may result in either a swap of one kind of
instruction with another one or in a reversal of the triggering
condition of a conditional instruction.
An example of a possible instruction swap through a single
bit flipdown is the following one:
2ε, where ε is the position of the fault. Possible values of ε
are expressed in the form ε = kw + i with w equal to the
w]. These changes in
AND R1,R1,#0x42 // Fault Free
EOR R1,R1,#0x42 // Faulty
Since the “and” and the “exclusiveor” instructions have a
radically different behavior, it is possible to alter the inner
working of the algorithm through swapping them, thus leading
to the possible computation of a weaker version.
Given that the ARM architecture allows the conditional
execution of all the arithmeticallogical instructions, and stores
the kind of condition in a suffix of the opcode, as Figure 6
depicts, it is possible for the error to actually invert the
condition of the predicate instruction. For instance, in the
Figure 6.
and data processing instructions. Grayed areas point out the interesting fields
during fault injection.
Excerpt from the ARM ISA description [5] depicting branch
following code sample, the two instructions share the same
opcode except for the zerocondition bit setting :
ADDNE R1,R1,#0x42 // Fault Free
ADDEQ R1,R1,#0x42 // Faulty
This behaviour could lead to misexecutions of the algorithm
leaking significant content, especially if the conditional in
structions are directly related to the key value (e.g. in the
common square and multiply algorithm used to perform fast
exponentiation).
Moreover, since also the branch instructions rely on the
same condition bits of the common conditioned, the control
flow of the program may be equally altered if the condition
bit of a branch is flipped like in the following sample :
BNE LOOP // Fault Free
BEQ LOOP // Faulty
This kind of alteration may lead to substantial control flow
alterations, which can turn into lowering the number of times
a loop is executed or skipping it altogether, thus providing
substantial reductions in the complexity of a cryptographic
primitive computed on the device.
We have been able to reproduce all the aforementioned
alterations on our chip samples through running the probing
programs without enabling the instruction cache and thus the
code loading operations to be performed directly from the
memory. Since the alterations are chip dependent, the exploita
tion of this kind of fault requires to know precisely which bit is
affected by the fault, thus determining which instruction swaps
are performed. Nonetheless, since our methodology of probing
does not compromise the computing architecture, it is possible
to scan a sample chip in order to understand which of these
code mutations are performed and devise specific attacks.
III. RELATED WORK
The open literature does not provide any examples of attacks
to a general purpose CPU ; instead all the known contributions
are focused on smaller computing devices such as micro
controller and smart cards. We may distinguish the practical
attacks to real world systems according to the techniques
Page 7
proposed to inject the faults, since these directly affect the
fault model attainable and therefore the applicable attacks.
A first technique relies on altering the state of a Microchip
PIC16F84 by irradiating directly the silicon die through the use
of a concentrated light beam, either polarized (laser beams) or
unpolarized (common flashes) [29]. The beam is usually timed
in order to achieve changes in the values stored in SRAM cells
and either allowing modification or inferences on the values
previously contained. The alterations may be as precise as a
single bit assuming it is possible to focus the beam on a spot
as wide as a single gate. This constraint is becoming very
difficult to comply with, since the new etching technologies
are able to print subvisible wavelength wide gates. Moreover,
this fault induction technique relies on depackaging the chip
thus leaving sensible evidences of the tampering involoved.
A reasonable way to avoid depackaging the chip is the EM
disturbances based faults injection recent technique proposed
by Schmidt and Hutter [25] using a 8bit microcontroller with
256 Bytes RAM as testbed device. The injection of faults is
achieved through small electrical discharges generated near the
sensitive device with the help of a pair of small electrodes.
The technique can be timed, although not with clock cycle
accuracy, but there is no way to direct the fault in a precise
manner. Moreover, packages providing inbound EMshielding
(e.g. grounded metal heat spreader ones) are able to thwart the
attack.
As far as the non package lesive techniques go, it might
be possible to insert phase shifts on the clock line through
manipulating the position of the rising and falling edges.
This tampering may induce instruction skipping in smart
cards, therefore altering the control flow of the algorithm,
possibly leaking sensitive information. A description of such
a technique is presented in [4].
Another transient fault induction technique relies on the
capability of altering the yield of the power supply line. A
first method consists of inserting tiny, well timed glitches,
realized with either spikes or temporary brownouts, aimed at
disrupting the value held on the input lines of flipflops during
their setup time. This causes incorrect values to be stored in
latches thus possibly resulting in either instruction skips or
data corruptions. A practical example of an attack brought to
a plain squareandmultiply RSA software implementation on
a AVR Microcontroller through this technique is given in [24].
Another method of injecting fault relies on constantly un
derfeeding a device to alter the values stored by its bistables
due to the slowdown in the logical gate setup time. While
this has never been tried for a full CPU, in [27] the authors
report a faulty behavior of the lines at the end of the longest
combinatorial cones of an ASIC AES coprocessor embedded
in a smart card, and exploit it in order to carry a successful
attack using the method proposed by Piret et al. in [23].
For a full fledged collection of works on the subject, we
refer the interested reader to [9]–[11].
IV. SYMMETRIC KEY APPLICATIONS  ATTACKS TO AES
A. Overview of the AES Block Cipher
The Advanced Encryption Standard (AES) [22] is a
symmetric cryptographic algorithm originally requested and
adopted by the National Institute of Standards and Technol
ogy (NIST) to replace the ageing Data Encryption Standard
(DES) [21]. AES is an iterated block cipher which corresponds
to a block size restricted version of the Rijndael [15], and can
encrypt and decrypt 128bit wide plaintext blocks using a key,
whose size may be 128bit, 192bit or 256bit. The Rijndael
cipher was chosen among the other final candidates due to
its ease of implementation on a wide range of 8bit to 32bit
computing platforms as well as to its being amenable to high
performance ad hoc hardware implementations. Moreover,
the clarity and compactness of its design allowed a wide
cryptanalytic scrutiny that helped to strengthen the confidence
in its security level. In software, AES can be implemented
with a fully symmetric structure using only bitwise XOR
operations, tablelookups and 1byte shifts. [15]
The cipher is designed to execute a number of round trans
formations on the input plaintext, where the output of each
round is the input to the next one. The number of rounds r is
determined by the key length: 128bit uses 10 rounds, 192bit
12 and 256bit 14. Each round is composed by the same steps,
except for the first where an extra addition of a round key is
inserted and for the last where the last step (MIXCOLUMN)
is skipped. Each step operates on 16 bytes of data (referred
as the internal state of the cipher) generally viewed as a
4×4 matrix of bytes or an array of four 32bit words, where
each word corresponds to a column of the state table. The
four round stages are: ADDROUNDKEY (XOR addition of a
scheduled round key for blending together the key and the
state), SUBBYTE (byte substitution by an Sbox, i.e. a full
lookup table for a non linear function), SHIFTROW (cyclical
shifting of bytes in each row to realise a interword byte
diffusion), and MIXCOLUMN (linear transformation which
mixes column state data for intraword interbyte diffusion).
The specification of the AES algorithm includes the
description of a a KEYSCHEDULE procedure which is
responsible for the computation of each 16bytes round key
kj given the global input key k. The AES key scheduling
process expands the cipher key k in a total of 4(r+1) 32bit
words with r ∈ {10,12,14} according to whether the cipher
key length s is equal to 4, 6 or 8 words, respectively. The
resulting key schedule consists of a linear array of 32bit
words, denoted W[0,...,4(r + 1) − 1]. The first s words
of W are loaded with the user supplied key. The remaining
words of W are updated according to the following rule:
for i = s,...,4(r + 1) − 1 do
if i ≡ 0 mod s then
W[i] = W[i−s]⊕S[W[i−1] <<< 8])⊕RCON[i/s]
else if s = 8 and i ≡ 4 mod s
W[i] = W[i − s] ⊕ S[W[i − 1]]
else
Page 8
W[i] = W[i − s] ⊕ W[i − 1]
Where RCON[...] is an array of predetermined constants,
S[...] is the array of precomputed constants corresponding
to the substitution map of the cipher, and <<< denotes the
rotation of one byte of the word to the left.
The enciphering procedure is amenable to several software
implementations which tradeoff memory and computational
resources in order to obtain the best performance for the
specific architecture.
Specifically, the different steps of the round transformation
can be combined in a single set of table lookups, allowing for
very fast implementations on processors having word length
of 32 bits or greater [15]. Let us denote with ai,jthe generic
element of the state table, with a the generic value of a
byte variable, with S[0,...,255] the 256bytes of the Sbox
table and with • a GF(28) finite field multiplication [15].
Let T0, T1, T2and T3be four lookup tables, each viewed as
a 256 sequence of 32bit words, containing results from the
combination of the round operations as follows:
These tables are used to compute the round stages opera
tions as a whole, as described by the following equation,
where kj is the jth word of the expanded key and Aj =
?a0,j,a1,j,a2,j,a3,j? is the jth column of the state table
considered as a single 32bit word (with abuse of notation:
Aj= Aj mod 4, ai,j= ai,j mod 4):
T0[a] =
S[a] • 02
S[a]
S[a]
S[a] • 03
S[a]
S[a] • 03
S[a] • 02
S[a]
T1[a] =
S[a] • 03
S[a] • 02
S[a]
S[a]
T2[a] =
T3[a] =
S[a]
S[a]
S[a] • 03
S[a] • 02
Aj= T0[a0,j] ⊕ T1[a1,j−1] ⊕ T2[a2,j−2] ⊕ T3[a3,j−3] ⊕ kj
The four tables T0, T1, T2and T3(called Tboxes from now
on) make up for 4 KB of storage space and their main goal is
to avoid performing the MIXCOLUMN and INVMIXCOLUMN
transformations as these operations, in the original definition of
Rijdael algorithm, perform Galois Field multiplication by fixed
constants which map poorly to general purpose processors in
terms of performance.
Notably, in the final round of the cipher there is no MIX
COLUMN operation, and also the KEYSCHEDULE algorithm
requires pure substitution operations. Whilst these facts could
represent an impairment in the use of T tables, it is possible
to extract efficiently the S table through proper masking of
the T tables.
Since the Tboxes may be derived also through rotating
each word of T0 by i bytes, Ti[a] = ROTBYTE(T0[a],i),
i ∈ {0,...,3}, to reduce the active memory footprint used
within each round, each column of the state table may be also
computed as:
Aj= T0[a0,j] ⊕ ROTBYTE(T0[a1,j−1],1)⊕
⊕ROTBYTE(T0[a2,j−2],2) ⊕ ROTBYTE(T0[a3,j−3],3) ⊕ kj
This variation reduces the lookup tables to a single 1kB
one, thus lowering the burden on the caches, while incurring
in a penalty of only three extra rotates per column per round
with respect to the 4 Tbox implementation.
Decryption requires different tables from the encryption,
therefore an AES implementation able to perform both en
cryption and decryption may require up to 8 kB of additional
memory, which may extend to 16 kB if the last round
operations are realised with adhoc tables.
When employing general purpose CPUs, endowed with
wide Dcaches, the Tbox implementation proves more effec
tive since the memory access latency is lower than the com
putation time that would be required in place of each Tbox
lookup. On the other hand, in cache constrained environments
a valid alternative to the use of Tboxes is the computation
of the entire AES rounds on the processor, memorising only
the Sbox and the inverse Sbox tables needed to perform the
substitution operations.
B. Effects of the Low Voltage Induced Errors on AES
Given the error model on the loaded data presented in
Section IIF, we may expect that the errors induced during
the computation of the AES cipher affect the results through
alterations in the values loaded during each memory lookup.
In particular, being the corruption characterised by a single bit
flip down, we may safely assume that only a single byte of
the state of the cipher is affected by a lone fault. Since the
attack strategy proposed by Piret and Quisquater in [23] for the
AES128 cipher works under the hypothesis of a single byte
error, it fits correctly the error characterisation we provided
in Section IIF, and thus provides a proper framework to
lead a successful attack. The attack works under a known
ciphertext assumption, only requiring pairs of faulty and fault
free ciphertexts generated from the same plaintext for each
pair. The goal is to derive the cipher key using the informative
content of the last round key, which is feasible as far as AES
128 goes. We have extended the attack technique proposed
in [23], to recover any round key from the AES cipher,
regardless of the key scheduling algorithm (i.e. regardless
from the fact that the round keys are computed through the
standardised KEYSCHEDULE algorithm or filled completely
with a much longer cipher key), the key length or the number
of rounds (even if exceeding the number of rounds set by the
standard).
C. Piret and Quisquater’s Attack to AES128
The error hypothesis assumed by Piret and Quisquater
in [23] considers the corruption of a single byte value between
the last and the lastbutone MIXCOLUMN computation. The
standard sets the number of rounds for AES128 to r = 10,
expands the cipher key into r + 1 round keys and removes
the MIXCOLUMN operation from the last round. Therefore a
Page 9
Algorithm IV.1: BASE ALGORITHM
Input: ∆ = { ?δ0,0,...,0?, ?0,δ1,0,...,0?,
?0,...,δu,...,0?, ..., ?0,...,δ15? } with
0 ≤ u ≤ 15, 1 ≤ δu≤ 255, and ∆ = 255 × 16.
∆
Output:¯k: last round subkey
Data: All the states are represented through a 4 × 4
matrix, the cells are enumerated from topleft to
bottomright
begin
1
Record a faulty ciphertext ? c and a faultfree one c
L ? ∅
δ
if δ
6
L ? L ∪ {k}
Record a faulty ciphertext ? c and a faultfree one c
δ
INVSUBBYTE(c ⊕ k) ⊕ INVSUBBYTE(? c ⊕ k)
L ? L\{k}
end
15
?= {dd ? MIXCOLUMN(δ), ∀δ ∈ ∆}
2
/* Set up of CandidateKeys List
*/
3
4
foreach k ∈ {0,...,2128− 1} do
?? INVSUBBYTE(c⊕k)⊕INVSUBBYTE(? c⊕k)
/* Key Selection Phase
while L > 1 do
foreach k ∈ L do
??
5
?∈ ∆
?then
7
*/
8
9
10
11
if δ
?/ ∈ ∆
?then
12
13
14
return¯k
/*
L = {¯k}
*/
single corrupted byte value must be computed either during
the execution of the 8th round ADDROUNDKEY operation
or during the execution of the 9th round SUBBYTE and
SHIFTROW operations.
Given a faulty ciphertext, ? c = {? cu,u ∈ {0,...,15}}, and
differences evaluated just before the last MIXCOLUMN add up
to 255×16 different values. Such values can be listed through
enumerating all the state tables resulting from changing a
single fixedposition byte value, and then repeating the change
for each one of the 16 bytes composing the state table, i.e.:
∆ = { ?δ0,0,...,0?, ?0,δ1,0,...,0?, ?0,...,δu,...,0?, ...,
?0,...,δ15? } with 0 ≤ u ≤ 15 and 1 ≤ δu≤ 255. The inter
byte diffusion operated by the MIXCOLUMN maps bijectively
each difference value into another thus obtaining another set
of differential state tables with the same cardinality of ∆:
∆= { ?δ
1 ≤ δ
A base algorithm that summaries the main steps of the attack
is described by Algorithm IV.1. The algorithm takes as input
the list of all the differences that may occur just after the last
MIXCOLUMN operation: ∆
records a faultfree ciphertext c and a faulty one ? c of the same,
roundkey, k, computes the difference δ
a faultfree one, c = {cu,u ∈ {0,...,15}}, the possible
??
0,...,δ
?
u,...,δ
?
15? } with 0 ≤ u ≤ 15 and
?
u≤ 255.
?. As a first step, the algorithm
unknown, plaintext. Then, for each possible value of the last
?between the state
tables corresponding to c and ? c just after the last MIXCOLUMN
δ
operation:
?= INVSUBBYTE(c ⊕ k) ⊕ INVSUBBYTE(? c ⊕ k)
sponding subkey k is inserted in a list L of candidate keys.
Subsequently, until L contains only a single key, another pair
of faulty and faultfree ciphertext generated from an unknown
plaintext is collected. Then, for each candidate key in L the
differential value corresponding to the faulty and faultfree
ciphertexts is computed. If the differential value is not included
in ∆
of this sieving phase the list L will contain a single value for
the last round key. Since the KEYSCHEDULE algorithm uses
only invertible operations the knowledge of the last round key
¯k is sufficient to retrieve the global input key.
Obviously, the computational complexity of Algorithm IV.1
is not practical since a scan over the whole key space is
required (see lines 4–7). However, to initially fill the list of
candidate keys L the authors of [23] proposed an experimental
heuristic which considerably reduces the overall complexity
of the differential fault attack in practise. Algorithm IV.2
reports the heuristic used to set up the candidatekeys list and
replaces the impractical procedure reported in lines 4–7 of
Algorithm IV.1.
The key intuition under the candidatekey sieving procedure
is that: given the precomputed list ∆
differentials just after the last MIXCOLUMN and given a faulty
and a faultfree ciphertext, if a candidate key allows to match
a differential in ∆
probability) also when considering the ciphertexts and the
candidate key having non zerovalues only in x ≥ 2 byte
positions. In such a way, it can be experimentally shown that
the exploration space for a fulllength candidate key shrinks
very quickly. Actually, in order to set up the list L of candidate
keys, Algorithm IV.2 considers two pairs of faulty and fault
free ciphertexts, i.e., ?c,? c? and ?d,?d? (lines 2–3).
?d
value (lines 5–6), the algorithm fills a temporary list L
candidate keys, k, having only the two leftmost bytes with a
nonzero value and such that the two leftmost bytes of the
differentials
If δ
?is included in the set ∆?then the value of the corre
?it is removed from the list of candidates, L. At the end
?
of all the possible
?then such matching will hold (with high
Then, considering a copy of the ciphertexts, ?c
?,? c
?? and
?,?d
??, where only the two leftmost bytes have a nonzero
?with
β ? INVSUBBYTE(c
both match the two leftmost bytes of any differential in ∆
(lines 7–14).
For each key k in L
(line 17), a copy of the original
ciphertexts having only the 2nd and the 3rd bytes with non
zero value is considered. Moreover, a temporary key k
the 2nd byte copied from k and the 3rd byte assuming all
values in {0,...,255} is considered. If the 2nd and 3rd bytes
of the computed differentials β and γ (lines 27–28) match the
corresponding bytes of an element in ∆
?⊕ k) ⊕ INVSUBBYTE(? c
?⊕ k)
?⊕ k)
γ ? INVSUBBYTE(d
?⊕ k) ⊕ INVSUBBYTE(?d
?
?
?having
?(lines 29–32) then
Page 10
Algorithm IV.2: CANDIDATEKEYS SKIMMING [23]
Input: ∆ = { ?δ0,0,...,0?, ?0,δ1,0,...,0?,
?0,...,δu,...,0?, ..., ?0,...,δ15? } with
0 ≤ u ≤ 15, 1 ≤ δu≤ 255, and ∆ = 255 × 16.
∆
Output: L: list of candidatekeys
Data: All the states are represented through a 4 × 4
matrix, the cells are enumerated from topleft to
bottomright
begin
1
Record a faulty ciphertext ? c and a faultfree one c
L
4
c
5
d
6
foreach (a,b) ∈ {0,...,28− 1}2do
k ? ?a,b,0,...,0?
γ ? INVSUBBYTE(d
if δ0= β0AND δ1= β1AND
δ
L
13
break
14
L ? ∅
k ? GETITEM(L
c
? c
?d
match ? false
k
β ? INVSUBBYTE(c
γ ? INVSUBBYTE(d
foreach δ,δ
if δu= βuAND δu+1= βu+1AND
30
δ
match ? true
ku+1? k
break /* Discard k
if match = true then
38
L ? L ∪ {k}
?= {dd ? MIXCOLUMN(δ), ∀δ ∈ ∆}
2
Record a faulty ciphertext?d and a faultfree one d
?? ?d0,d1,0,...,0?,?d
β ? INVSUBBYTE(c
foreach δ,δ
3
?? ∅
?? ?c0,c1,0,...,0?, ? c
?? ?? c0,? c1,0,...,0?
?? ??d0,?d1,0,...,0?
?⊕ k) ⊕ INVSUBBYTE(? c
7
8
?⊕ k)
?⊕ k)
9
?⊕ k) ⊕ INVSUBBYTE(?d
?
1= γ1then
?∪ {k}
10
?∈ ∆
?, δ ?= δ
?do
11
12
?
0= γ0AND δ
?? L
? ≥ 1 do
15
while L
16
?)
/* L
?? L
?\{k}
*/
17
18
for u ? 1to15 do
d
?? ?0,...,?du,?du+1,0,...?
foreach b ∈ {0,...,28− 1} do
?
u+1? b
⊕ INVSUBBYTE(? c
⊕ INVSUBBYTE(?d
?
u= γuAND δ
?? ?0,...,cu,cu+1,0,...?
19
?? ?0,,...? cu,? cu+1,0,...?
k
20
?? ?0,...,du,du+1,0,...?
21
22
?? ?0,...,ku,0,...?
23
24
25
26
?⊕ k
?⊕ k
?)⊕
27
?)
?⊕ k
?⊕ k
?)⊕
28
?)
?∈ ∆
?, δ ?= δ
?do
29
?
u+1= γu+1then
31
32
33
break
if match = true then
?
u+1
34
35
36
37
break
if match = false then
*/
39
40
41
return L
end
the value of the 3rd byte of k has been found (lines 33–35).
The same operations are repeated for all the remaining bytes,
until the whole candidate key k has been checked or the key
is candidate discarded (lines 18–37). If a fulllength candidate
key is computed, it is added to a the list of candidates L,
before analysing another item from list L
After building the candidate list L, the selection of the last
round key steps on following lines 8–13 of Algorithm IV.1.
In a real attack scenario the hypothesis to have a fault
localised amidst the 8th and the 9th round, will be verified
less than one time out of r = 10, but also assuming a correct
fault with a rate of 1 out of 100, experimental evidence
demonstrates that the attack is easily mounted against the
AES encryption primitive in few minutes using offtheshelf
equipment.
The method proposed in [23] attacks successfully any
SPN based cipher with diffusion layer linear with respect to
the bitwise xor operation, notwithstanding the fact that the
diffusion layer achieves perfect diffusion in a single pass or
not.
In the case of the AES cipher the diffusion layer is not
perfect and only spreads a single bit difference on a quarter of
the inner state (i.e. diffuses a single byte change over a single
column (word) of the inner state). The exploitation of this
peculiarity of the AES diffusion layer allows to conceive a 32
bit word based implementation of the attack, which retrieves
the whole last round key in four passes (one for each word of
the last round key).
The key idea of the word based algorithm is rooted in the
observation that a single byte fault happening before the last
MIXCOLUMN operation will affect only four bytes of the
ciphertext. It is thus possible to focus on the recovery of a
single word of the last round key at a time, thus reducing the
candidate space to 232at most.
Algorithm IV.3 details the tailored version of Algorithm IV.1
while retaining the same notation. In Algorithm IV.3, ∆ now
contains all the possible one byte inner state differences for a
single word evaluated before the last MIXCOLUMN. Since the
differences contained in ∆ are computed on a single word, the
ciphertexts, both faulty and faultfree, must be carved taking
into account both the position of the target word within the
round key under retrieval and the effect of the SHIFTROW
operation.
Through iterating the Algorithm IV.3 for each of the four
word of the last round key it is possible to retrieve it regardless
of the original key length used.
However the knowledge of the last round key is not enough
in order to derive the full cipher key when its length exceeds
128 bits.
?.
D. Generalized AES Attack
The attack described in the previous section is able to
recover only the last round key of the Square [14] based
ciphers to which is applied, thanks to the peculiar structure
of the last round. In the case of AES128 recovering the
Page 11
Figure 7.Impact of a single bit fault between the lastbutone and lastbuttwo MIXCOLUMN operations
Algorithm IV.3: AES WORD ORIENTED KEY RE
TRIEVAL
Input: ∆ = { ?δ0,0,0,0?, ?0,δ1,0,0?, ?0,0,δ2,0?,
?0,0,0,δ3? },
∀ u ∈ {0,1,2,3},δu∈ {1,...,255},
∆ = 255 × 4,
j ∈ {0,1,2,3} round key word index,
∆
Output: ?, jth word of the last round subkey
begin
1
Record a faultfree ciphertext, and carve a word w
2
and a faulty ciphertext, and carve a word ? w both
/* Set up of CandidateWords List
L ? ∅
δ
if δ
6
L ? L ∪ {v }
Record a faultfree ciphertext, and carve a word
9
w and a faulty ciphertext, and carve a word ? w
last SHIFTROW operation
foreach v ∈ L do
δ
11
if δ
12
L ? L\{v }
end
15
?= {dd ? MIXCOLUMN(δ), ∀δ ∈ ∆}
according to j and taking into account the last
SHIFTROW operation
*/
3
4
foreach v ∈ {0,...,232− 1} do
?? INVSUBBYTE(w ⊕ v) ⊕ INVSUBBYTE(? w ⊕ v)
/* Word Selection Phase
while L > 1 do
5
?∈ ∆
?then
7
*/
8
both according to j and taking into account the
10
?? INVSUBBYTE(w ⊕ v) ⊕ INVSUBBYTE(? w ⊕ v)
return ?
/*
L = {?}
?/ ∈ ∆
?then
13
14
*/
last round key is enough also to reconstruct the whole key
schedule.
For all the others key length employed in the AES this
reconstruction cannot be performed only with the last round
key due to lack of key material. In fact, the key scheduling
strategy of AES uniformly spreads the informative content
of the cipher key over the whole key schedule in a word
wise fashion (see Section IVA). It is therefore mandatory to
retrieve at least as many words of the key schedule content
as the ones composing the cipher key. Moreover the position
of the recovered words needs to be such that they do not
contain redundant information. In particular, the knowledge
of a consecutive block of words (at least as wide as the cipher
key) from the key schedule enables a successful cipher key
reconstruction.
Since the aforementioned attack is bound to the lack of the
MIXCOLUMN operation in the last round of the cipher, it is not
able to actually invert the cipher any further. Thus, if either
a different key scheduling strategy is employed (e.g. derive
a round key from a single word of the original key, cycling
though the words, instead of using the standard key schedule
procedure), or if the key length is extended up to filling the
whole key schedule resulting in an AES employing a 128r
bit wide key, where r is the number of rounds, the previous
attack strategy fails to break the cipher.
We devised a new attack technique which is able to pierce
successfully a regular round of the AES cipher (i.e. one
including the MIXCOLUMN), thus obtaining a method able
to roll back the whole cipher and retrieve all the round keys
regardless of their mutual relations, derivation strategy or the
number of rounds.
We are therefore able to break the AES cipher even when
used with the key lengths recommended for Secret and Top
Secret documents by NSA (192 or 256 bits). No results of
a successful key extraction from either AES192 or AES256
are known.
Algorithm IV.1 and Algorithm IV.2 work under a known
Page 12
ciphertext assumption with no particular requirements on the
enciphered plaintexts, other than having pairs of faulty and
fault free ciphertexts obtained from the same plaintext.
Our extension will require the enciphered plaintext to be
the same for all the faulty ciphertexts needed, while retaining
the assumption of not knowing the actual plaintext. This is not
particularly hindering in practise since the number of required
faulty ciphertexts is very small (16 at most).
Algorithm IV.5 is able to invert both the last (rth) and
the lastbutone (r − 1)th rounds of the AES cipher, thus
retrieving the last two round keys (k(r)and k(r−1)). In order
to recover the last round key we employ Algorithm IV.3 (line
3). Subsequently, using the retrieved round key we invert the
effect of the last round for all the ciphertexts available.
To perform the retrieval of the key k(r−1)we assume that
an erroneous ciphertext is the result of a single byte fault
occurred between the lastbutone MIXCOLUMN (round r−2)
and the lastbuttwo MIXCOLUMN operation (round r − 3).
As depicted in Figure 7, this fault will result in a complete
corruption of the state c(r−1)by the end of the last but one
round, therefore, in order to distinguish the induced errors
respecting our hypothesis from the non useful ones, we need
to cope with the diffusing effect of the last MIXCOLUMN
operation and to eliminate the obfuscation provided by the
(r − 1)th ADDROUNDKEY.
In order to remove the effect of the ADDROUNDKEY the
GETDIFFERENTIAL function (Function IV.4) at first inverts
the last round for a faulty ciphertext ? c(r)(line 3) using the
then computes the difference between the correct and faulty
outputs of the lastbutone round. This differential information
can be safely transformed through an INVMIXCOLUMN since
the diffusion layer is linear w.r.t the xor operation, and
subsequently passed through a INVSHIFTROW primitive to
realign the bytes (line 4).
We are now able to distinguish a the effects of a useful fault
for our purposes through examining the computed differential
value (denoted as δ in Function IV.4) and checking whether it
is nonzero for only a single word as depicted in Figure 7. In
the case the fault is not useful, the function discards the faulty
ciphertext and starts examining a new one. Once a useful fault
has been found, the function GETDIFFERENTIAL returns both
the non zero word differential and its relative position within
the state.
Assuming the fault skimming issues are solved as described,
the attack described in Algorithm IV.5 can be successfully
mounted trying to recover the value of the four words of c(r−2)
after the application of the SUBBYTE primitive (denoted by
s (line 21) and depicted in Figure 7). We will therefore use
four sets of candidates, one for each word of the state matrix
to be recovered (s) (line 2).
After obtaining a fault free ciphertext and applying the
attack proposed by [23] in order to recover the last round
key k(r)(lines 3–4), the effect of the last round is inverted on
the correct ciphertext obtaining c(r−1)(line 5).
In lines 7–13 the four candidates lists are filled one at a time
last round key k(r)which has already been retrieved, and
until they all contain at least a value. The word differential
value ? returned by the GETDIFFERENTIAL function is used
in order to fill the list indexed by the value m, also returned
by the same function.
In order to exploit the information provided by knowing
that a single byte fault occurred, we now guess a word w of
the s matrix, combine it with the correct differential ? and
obtain an alleged faultycorrect (? w,w) pair of state s words
through an INVSUBBYTE operation since they represent pure
state information and an used in order to obtain the differential
state value ζ which represents the alleged difference between
c(r−2)and ? c(r−2)(lines 10–11).
possess of a state differential ζ which, once processed through
an INVMIXCOLUMN operation, will retain only a single non
zero byte in accordance with the verified fault assumption (as
depicted in the first state matrix of Figure 7). In this case,
the guessed state word w is added to the candidate list under
processing Lm(line 12).
Once all the lists are filled with at least a single candidate
word, a pruning phase takes place (lines 15–22). This second
phase aims at reducing the number of candidates contained in
each list to one, through further validation. In order to perform
this pruning, a new differential ? is obtained from a fresh
faulty ciphertext, and all the candidates for that differential
word are checked for validity with the same criterion used to
include the guesses in the candidate lists (lines 17–20). In the
case a candidate word does not pass the check, it is removed
from the list (lines 21–22).
After obtaining a single candidate for each four of the
state word of s, it is possible to apply a SHIFTROW and a
MIXCOLUMN operation to find the correct value of the σ state
(see Figure 7). In order to to retrieve the (r−1)th round key
k(r−1), it suffices to compute σ ⊕ c(r−1).
If needed, this procedure may be performed again at will,
since it is possible to fully invert the effect of any round of
the AES algorithm by removing the rounds one by one.
(line 9). The two words w, ? w may be separately processed
If the guess on the state word was correct, we are now in
V. ASYMMETRIC KEY APPLICATIONS  ATTACKS TO RSA
In order to test the efficacy of the new fault model proposed
in Section II against a public key cryptosystem, we chose to
attack the RSA cryptosystem since, due to its vast adoption,
it has undergone an extremely careful cryptanalytic scrutiny
and thus represents an appealing target.
In this section we present two attack techniques, one of
which is well known and will serve as a testbench for our fault
model, while the other one has been designed from scratch.
The first one is the so called Bellcore attack to the RSA
signing primitive, when implemented using the Chinese Re
mainder Theorem. Its aim is to recover the private key while
in possess of a faulty signature.
The new one, henceforth named eth root extraction attack,
aims at decrypting an RSA message under a known ciphertext
only assumption. The only requirement is to have a faulty and
a correct encryption of the same unknown message. The attack
Page 13
Function 4.4: GetDifferential(c(r−1),k(r))
Input : c(r−1), faultfree lastbutone round output; k(r),
last round key
Output: (w,j), w: one word difference between faulty
and faulty free state after the SUBBYTE of the
lastbutone round; j ∈ {0,1,2,3}: position of
the only nonzero word in the aforementioned
difference
repeat
1
Record a new faulty ciphertext ? c(r)
/* lastbutone MIXCOLUMN
δ ? INVSHIFTROW(
return (wj,j)
6
2
3
? c(r−1)? INVSUBBYTE(
until
δ ∈ { ?w0,w1,w2,w3?  ∃ ! j ∈ {0,1,2,3}, w ?= 0 }
INVSHIFTROW(? c(r)⊕ k(r)))
*/
INVMIXCOLUMN(? c(r−1)⊕ c(r−1)))
4
5
technique is not specifically tailored for our fault model and
fits reasonably well even multibit fault events.
Throughout the description of the attacks, we will use the
following notation: let p and q be two large primes and let
n = pq be the RSA modulus. Let e,d be two unitary elements
in (Z∗
bound together by the congruence d = e−1mod ϕ(n). Let
t = ?log2ϕ(n)? denote the length of their binary encodings.
Having m,c ∈ Z∗
ciphertext pair as c = memod n. Having m,s ∈ Z∗
note a generic RSA messagesignature pair as s = mdmod n.
ϕ(n), ·) representing the public and private exponent
n, we denote a generic RSA plaintext
n, we de
A. Bellcore Attack
The Bellcore attack [8] enables to factor the modulus n
through inducing an error during the computation of the
exponentiation phase of any RSA primitive implemented using
the Chinese Remainder Theorem.
Let s = CRT(mp,mq) denote the CRT recombination of
the value s = mdmod n from the two values sp= mdmod p
and sq= mdmod q:
s =(sp+ p((sq− sp)(p−1mod q) mod q))mod n
If a fault occurs during the computation of sq while the
computation of sp remains error free, we may denote the
faulty value of sqas ? sq= sq+ ∆. Therefore, the faulty CRT
? s = s + p(∆(p−1mod q) mod q)mod n
n, it is possible to extract p = gcd(? s−s,n) efficiently through
Moreover, as showed in [20], the modulus factorisation is
also computable using only the message m and one faulty
computation of the signature ? s, through calculating
recombination will yield ? s = CRT(sp, ? sq), given by:
Since the value ? s−s shares a nontrivial factor with the modulus
Euclid’s Algorithm.
p = gcd(? se− m,n)
Algorithm IV.5: FULL AES DIFFERENTIAL ATTACK
Input: ∆ = { ?δ0,0,0,0?, ?0,δ1,0,0?, ?0,0,δ2,0?,
?0,0,0,δ3?
},∀ u ∈ {0,1,2,3},δu∈ {1,...,255},
∆ = 255 × 4,
∆
Output: (k(r−1),k(r)), last two round keys
begin
L0? ∅, L1? ∅, L2? ∅, L3? ∅
Apply Algorithm IV.3 and retrieve the last round key
k(r)
c(r−1)? INVSUBBYTE(
index
(?,m) ? GETDIFFERENTIAL(c(r−1),k(r))
? w ? w ⊕ ?
/* lastbuttwo MIXCOLUMN
if INVMIXCOLUMN(ζ) ∈ ∆
Lm? Lm∪ {w}
(?,n) ? GETDIFFERENTIAL(c(r−1),k(r))
? w ? w ⊕ ?
/* lastbuttwo MIXCOLUMN
if INVMIXCOLUMN(ζ) / ∈ ∆
Ln? Ln\{w}
/*
s ? ? ¯ w0, ¯ w1, ¯ w2, ¯ w3?
k(r−1)? c(r−1)⊕ σ
end
?= {dd ? MIXCOLUMN(δ), ∀δ ∈ ∆}
1
2
3
4
Record a faultfree ciphertext c(r)
INVSHIFTROW(c(r)⊕ k(r)))
5
6
repeat
/* m ∈ {0,1,2,3}, round key word
*/
7
8
9
foreach w ∈ {0,...,232− 1} do
ζ ? ?ζ0,ζ1,ζ2,ζ3? ? ?0,0,0,0?
10
11
ζm? INVSUBBYTE(? w) ⊕ INVSUBBYTE(w)
*/
?then
12
13
14
15
16
17
18
19
20
until ∀m, Lm?= ∅
while ∀m, Lm > 1 do
foreach w ∈ Lndo
ζ ? ?ζ0,ζ1,ζ2,ζ3? ? ?0,0,0,0?
ζm? INVSUBBYTE(? w) ⊕ INVSUBBYTE(w)
*/
?then
21
22
/*
L0= { ¯ w0}, L1= { ¯ w1}
L2= { ¯ w2}, L3= { ¯ w3}
*/
*/
23
/* lastbuttwo MIXCOLUMN
σ ? MIXCOLUMN(SHIFTROW(s))
return (k(r−1),k(r))
*/
24
25
26
27
The main advantage of this technique is that any kind of
fault induced in the computation of one of the two values
to be recombined with the CRT, will yield a useful faulty
computation regardless of precise timing and placement, which
nicely fits our fault model.
B. eth Root Extraction Attack
In order to attack the RSA cryptosystem we propose a new
algorithm to extract the eth root of a number modulo n in
polynomial time exploiting the knowledge of another power
of the same number. The target of this attack is to retrieve the
Page 14
Algorithm V.1: eTH ROOT EXTRACTION
Input: e1,e2∈ {1,...,ϕ(n) − 1}, e1≥ e2,
c1= me1mod n, c2= me2mod n
Output: (m,n): either (m,⊥) if the eth root may be
extracted, (p,q) if the modulus can be factored
or (⊥,⊥) otherwise
begin
1
if τ ?= 1 then
return (τ,n/τ)
3
τ ? gcd(c2,n)
if gcd(e1,e2) ?= 1 then
return (⊥,⊥)
τ ? gcd(c1,n)
ε1,ε2? e1,e2
θ ? ?ε1
multiplication,
inversion and 1 modular
exponentiation
γ3? γ1γ−θ
γ1,γ2? γ2,γ3
/* Integer division
θ ? ?ε1
1 modular inversion and 1 modular
exponentiation
γ3? γ1γ−θ
2
4
5
6
7
8
9
if τ ?= 1 then
return (τ,n/τ)
γ1,γ2? c1,c2
/* Integer division
ε2?, ρ ? ε1mod ε2
10
11
*/
12
/* Cost: 1 modular
1 modular
*/
2
mod n
13
14
15
16
while ρ ?= 0 do
ε1,ε2? ε2,ε1− θε2
/* Cost: 1 modular multiplication,
*/
ε2?, ρ ? ε1mod ε2
17
*/
2
mod n
18
19
20
return (γ2,⊥)
end
input message encrypted through RSA using a correct and a
faulty encryption of the same message.
This hypothesis is analogous to being able to decipher a
message assuming the knowledge of two encryptions done
with two public keys sharing the same modulus n. Whilst this
does not happen due to an incorrect generation of two public
private keypairs (otherwise the two keyholders would be able
to mutually read the other’s messages), the encryption of a
same message through exponentiation by two different public
exponents e1,e2may happen if a message is reencrypted and
a fault hits the exponent during the second encryption.
A practical applicative scenario could be the retrieval of
the session key during an RSAKEM [28] handshake. This
assumes that the party in charge to choose the session key
reencrypts the same value in the case a faulty encapsulation
occurred. To the best of the author’s knowledge this technique
has not yet been used in order to mount an attack.
Algorithm V.1 describes a method to retrieve the plaintext of
an RSA encryption using Euclid’s Greatest Common Divisor
Algorithm as a pivot to perform operations on the two known
ciphertexts.
In the case either of the ciphertexts shares a nontrivial
factor with the modulus n, which would in turn imply that the
ciphertext value is a zero divider over (Zn,·), it is possible
to employ it to factor n by simply computing their greatest
common divisor. However, the chances of this happening in
a real world scenario are extremely slim: in fact the ratio of
unitary elements in (Zn,·) is exactly ϕ(n)/n, which is very
close to one when n is the product of two large primes.
The algorithm properly extracts the eth root only when the
two values e1,e2are coprime. A well known result in number
theory [17] states that, provided that the two numbers are
randomly chosen from a large enough range, the probability
of them being coprime approaches
The algorithm computes the value of gcd(e1,e2) following
the classic Euclid’s Algorithm and computing for each step
the value of me1mod e2mod n employing the values c1 =
me1mod n and c2= me2mod n (line 13 and line 18).
Assuming e1 ≥ e2, the number of steps that Euclid’s
Algorithm needs to perform is in O(log(e1)), therefore at most
in O(logϕ(n)) (Lam´ e’s Theorem [19]).
For each step, the integer division between the exponents
has complexity in O(log2ϕ(n)), which is dominated by the
complexity of the additional modular operations required to
compute the intermediate value γ3 (line 18) using the two
ciphertexts. In fact, the complexity of performing a modular
multiplication, a modular exponentiation and a modular in
version is in O(log3n). Thus, the complexity of the whole
algorithm is in O(log4n), that is in P and therefore treatable
even for large values of n. In particular, given the common
sizes of n in RSA modules the computation is largely feasible
even with limited computational resources.
In order to employ the Algorithm V.1 in a fault attack
scenario, the values of e1 and e2 must be known: this is
equivalent to a very precise fault hypothesis where both the
number of erroneous bits of the exponent and their positions
are known.
We assume, coherently with the error model presented in
Section II, the hypothesis of a single faulty bit of the exponent,
whose position is known up to a small number of possible
ones. Express a single bit faulty exponent e2 as e1− 2ξ
for some values of ξ ∈ Ξ = {0,...?log2ϕ(n)? − 1}; in
order to retrieve the correct plaintext m we need to run the
Algorithm V.1 for each possible value of e2, and check through
reencryption if the computed value is the one sought.
In the worst case, for a single bit fault, the number of
hypotheses amount exactly to the bit size of ϕ(n). On the
other hand, if the position of the faulty bit is fixed w.r.t. the
width w of the computing device word (as in Section IIE),
the amounts of the hypothesis set Ξ is reduced to
which typically is between one and two orders of magnitude
smaller than ?log2ϕ(n)?.
6
π2, that is roughly 61%.
?log2ϕ(n)?
w
,
Page 15
VI. EXPERIMENTAL RESULTS
We now provide experimental evidence of the practicality
of the algorithmic techniques exposed in Section IV and in
Section V, and reporting the results of conducting them on an
ARM9 CPU. We report figures of merit for both the attack
strategy proposed in [23] and our original contribution which
allows us to attack any number of rounds of any AES cipher.
Subsequently, we discuss the results of the experimental
campaign conducted in order to assess the practical feasibility
of the Bellcore and eth root extraction attacks addressed to
the RSA cryptosystem.
The attacked platform was running a vanilla Linux 2.6.15
kernel (DENX distribution) during all the fault collection
campaigns and the programs performing encryption were
compiled into regular ELF binaries which were run from the
shell. Both the instruction and the data caches of the CPU
were enabled during the experiments and the frequency set to
the maximum one supported, thus providing an unsimplified
real world working condition.
A. Experimental Evaluation of the Attacks to AES
Since all the attacks on AES are based on the successful
injection of one byte faults in a specific word of a specific
round of the algorithm, the first step to ascertain the practical
feasibility of the attack is understanding the distribution of the
faults over the states of the cipher.
We considered three different implementations of AES
according to the strategies described in Section IVA: the
first implementation uses 4 Tboxes and is the one used in
OpenSSL [13], while the other two respectively use a single
Tbox and the reference Sbox in order to achieve a smaller
memory footprint. The choice of evaluating implementations
of AES differing by the computationmemory tradeoff was
made in light of the fact that the data caching policies of
the ARM9 could have a sensible impact on the performances
of the attacks, since the CPU caches have been shown in
Section II to have a mitigating effect on faults. The first
0
2
4
6
8
10
12
12345
Round
678910
Fault hitting the round [%]
Figure 8.Distribution of the faults over the rounds of the AES algorithm
explorative campaign was directed at understanding the fault
distribution w.r.t. the rounds of the cipher. Figure 8 depicts the
faults spread on the first 10 rounds of the AES128 algorithm,
obtained through collecting 100k faults and classifying them
by the round they hit. This was done through inverting the
faulty ciphertexts with the known key and calculating the
differences between each state of the correct and the erroneous
runs until the single byte difference was found. The depicted
data were collected using the 4 Tbox implementation of AES,
but all the other fault distribution differ for less than 0.1% for
each value from the reported ones. As the figure shows, the
fault are almost equally distributed on the first r−1 rounds of
the cipher except for the last one which has a sensibly lower
probability to be hit. The fault distribution over the rounds for
the AES192 and AES256 algorithms are analogous to the
reported one except for the larger number of rounds.
Table II
PERCENTAGES OF FAULTS HITTING EACH COLUMN OVER 50K INJECTED
FAULTS – AES FOUR TBOXES
State Word
Faults hitting a column [%]
O0O1
40.3025.08
19.1524.73
19.36 24.62
21.1725.56
O2
24.48
25.05
25.13
25.32
O3
24.63
24.14
25.93
25.28
First
Second
Third
Fourth
Table III
PERCENTAGES OF FAULTS HITTING EACH COLUMN OVER 50K INJECTED
FAULTS – AES ONE TBOX
State Word
Faults hitting a column [%]
O0O1
25.1625.00
23.8525.67
25.8123.99
25.2625.33
O2
25.06
25.03
24.35
25.54
O3
25.52
25.12
24.75
24.60
First
Second
Third
Fourth
Table IV
PERCENTAGES OF FAULTS HITTING EACH COLUMN OVER 50K INJECTED
FAULTS – AES REFERENCE IMPLEMENTATION
State Word
Faults hitting a column [%]
O0 O1
24.4524.83
22.9425.12
25.1024.88
27.48 25.17
O2
24.38
25.81
26.19
23.60
O3
20.76
19.23
20.59
39.40
First
Second
Third
Fourth
Willing to ascertain the fault distribution over the state of a
single round, it is necessary to take into account the effect of
the optimisation strategies employed by the compiler. This is
mandated by the fact that aggressive optimisation may employ
the coalesced instructions of the ARMv5TE architecture which
may alter the fault spread over the words of the state.
Table II, Table III and Table IV report the fault spread
over the words of a state, averaged over 50k faults for each
implementation, and sorted by increasing optimisation level to
which the GCC compiler was set.
View other sources
Hide other sources
 Available from Gerardo Pelosi · May 31, 2014
 Available from iacr.org