ArticlePDF Available

REIL: A platform-independent intermediate representation of disassembled code for static code analysis

  • AG


In this paper we introduce REIL (Reverse Engineering In-termediate Language), a platform-independent intermediate language to represent disassembled assembly code. We cre-ated the REIL language specifically to simplify and auto-mate static code analysis of assembly code in the context of software reverse engineering for the purpose of security auditing and vulnerability detection. This paper introduces the complete REIL language with all of its instructions as well as the virtual REIL architec-ture that defines the effects of REIL instructions on registers and memory. Furthermore we discuss the reasons why we designed the REIL language the way we did, what limita-tions the user of the language should be aware of, and what we have planned for REIL in the future.
REIL: A platform-independent intermediate representation
of disassembled code for static code analysis
Thomas Dullien
zynamics GmbH
Bochum, Germany
Sebastian Porst
zynamics GmbH
Bochum, Germany
In this paper we introduce REIL (Reverse Engineering In-
termediate Language), a platform-independent intermediate
language to represent disassembled assembly code. We cre-
ated the REIL language specifically to simplify and auto-
mate static code analysis of assembly code in the context
of software reverse engineering for the purpose of security
auditing and vulnerability detection.
This paper introduces the complete REIL language with
all of its instructions as well as the virtual REIL architec-
ture that defines the effects of REIL instructions on registers
and memory. Furthermore we discuss the reasons why we
designed the REIL language the way we did, what limita-
tions the user of the language should be aware of, and what
we have planned for REIL in the future.
REIL, Reverse Engineering Intermediate Language, Static
Code Analysis, Disassembly, Intermediate Representation,
Intermediate Language Recovery
Only a few years ago the main exposure people had to
security-critical computer programs like credit card stealing
malware came through their home computers. Due to the
dominance of Microsoft’s operating system Windows these
computers were nearly always computers of the x86 family.
This situation changed. People today come in contact with
more and more computer architectures that are directly or
indirectly relevant to the safety of their private data. Ex-
amples include appliances like modern cell phones, network
printers with integrated web servers and hard drives, and
more complex routers or wireless devices that are now part
of many home networks.
On the side of the security researchers this led to a diver-
sification of target architectures that need to be analyzed.
Of course there is still the x86 platform which is the bread
and butter of many security researchers but other devices
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CanSecWest 2009 Vancouver, Canada
Copyright 2009 zynamics GmbH.
use different CPUs. In the mobile world ARM is the most
popular architecture, while network appliances like routers
often use PowerPC CPUs. MIPS is also making a comeback
in wireless devices and some high-end routers.
For a typical security research and consulting company
that wants to diversify its product palette this creates a need
to have tools that can work on assembly code of several dif-
ferent platforms. Here is where our contribution comes in.
We developed an analysis language called REIL (Reverse
Engineering Intermediate Language) that abstracts from na-
tive assembly code and therefore makes it possible to develop
analysis tools and algorithms that work on many different
One thing that sets REIL apart from other proposed anal-
ysis languages is that REIL is not just a prototype language.
A REIL implementation was already shipped in a commer-
cial reverse engineering tool (BinNavi) where it has proven
to be very valuable in developing new static analysis algo-
rithms in real-world software analysis scenarios.
One of the most important advantages of the REIL lan-
guage is its very reduced instruction set. REIL knows only
17 different instructions. This distinguishes REIL signifi-
cantly from all popular instruction sets supported by real
CPUs today. For example, the x86 instruction set including
all of its modern extensions contains more than 600 instruc-
tions. The PowerPC instruction set including all simplified
mnemonics contains more than 1000 instructions1. The re-
duction to a core minimum of REIL instructions was delib-
erately chosen to make it as easy as possible to write static
analysis algorithms for REIL code. The idea behind this is
that fewer instructions in an assembly language mean fewer
different transformations of program state that must be con-
sidered in a static analysis algorithm.
Another advantage of REIL instructions over instructions
of common architectures is the single-responsibility aspect
of all REIL instructions. Typical instructions of real archi-
tectures often have many different responsibilities. A sin-
gle x86 instruction can load a value from memory, perform
some kind of computation on it, and set flags according to
the result of the computation. In REIL this is not possi-
ble. Every instruction has exactly one effect on the program
state. Either it loads a value from memory, or it performs a
1Simplified mnemonics are relevant here because in nearly
all cases where binary programs are disassembled for secu-
rity analysis, industry standard disassemblers like IDA Pro
generate simplified mnemonics.
computation on a value, or it sets a single flag. This makes
it very easy for the user of REIL code to understand what
exactly is going on in a REIL instruction. Non-obvious side-
effects that require a deep understanding of the underlying
instruction set can never happen.
The operands of REIL instructions are also very regu-
lar. Each REIL instruction has exactly three operands. The
first two operands of a REIL instruction are in all cases in-
put operands which are never modified by the instruction.
The third operand is generally the output operand of the in-
struction. This operand stores the result of the computation
performed by the instruction2. Not all instructions need to
have three different operands. The most obvious example is
the no-operation instruction NOP which does not need to
have any operand. Nevertheless, the REIL instruction NOP
still has three different operands of type Empty.
Except for Empty, there are only three more types of REIL
instruction operands. Operands can be integer literals, reg-
isters, or subaddresses. Of those three operand types, inte-
ger literals are the simplest type. Operands of this type are
typically used when a constant integer value like 5 or 4379
is required as part of the computation of a more complex
result. Since integer literals are read-only they can never be
used as output operands of REIL instructions.
Registers are the second type of REIL operands. REIL
registers work exactly like native general-purpose CPU reg-
isters. They can hold integer values and they are mutable,
meaning the value they hold can be changed.
In fact, while most registers used in REIL instructions are
pure REIL registers of the form t<number> (e.g. t0, t1, ...
, t123, ...), native registers from the source architecture of
an analyzed program can also show up as operands of REIL
instructions. This does not limit the platform-independence
of REIL code as REIL registers and source architecture reg-
isters are treated completely uniformly in REIL analysis al-
gorithms. Native registers are simply used in REIL code to
make it easier to liken the results of an analysis algorithm
back to the original input code.
The last REIL instruction operand type is the subaddress.
Operands of this type are comparable to integer literals but
instead of integral values these operands always hold ad-
dresses of REIL instructions. Furthermore, this operand
type can only appear as the third operand of JCC (condi-
tional jump) instructions. Operands of this type are only
generated when an original native assembly instruction is
translated into a series of REIL instructions that contains
branches from decisions or loops. Examples for such instruc-
tions are the prefixed string operations (rep stos, ...) of the
x86 instruction set which are translated to REIL instructions
that form a loop.
Except for their type and their value, REIL operands are
characterized by their size. This size is equal to the max-
imum size of its operand value. REIL operand sizes have
names like b1,b2,b4and so on meaning that the size of the
operand is 1 byte, 2 bytes, and 4 bytes respectively. For
example, the integer literal operand 0x17/b2is really two
bytes large and could also be represented as 0x0017b2while
the size of the register t0in the operand t0/b4is 32 bits.
In addition to its operands, all REIL instructions can come
with so-called meta-data. This meta-data is simply a map
of key-value pairs that give additional information about an
2The one exception is the jump instruction JCC where the
third operand is the jump target.
instruction that is probably important during static code
analysis. In general, the number of pieces of meta-data as-
sociated with an instruction is not limited but in practice
most REIL instructions do not have any meta-data at all
associated with them.
In the current version of REIL there is only one kind of
meta-data. Jump instructions that were generated during
the translation of a subfunction call (like call on the x86
CPU or bl on PowerPC) are specifically marked with the
key isCall and the value true. This is necessary because
subfunction calls need to be treated very differently than
conditional jumps during many static code analysis algo-
The 17 different REIL instructions can be grouped into a
few different instructions groups. The biggest group are the
arithmetic instructions like addition and subtraction. Then
there are the bitwise instructions that perform operations
like bitwise OR and AND, the conditional instructions that
are used to compare values and jump according to the re-
sult of the comparison, the data transfer instructions that
access REIL memory and transfer the content of registers,
and the remaining instructions which do not really fall into
any group.
2.1 The arithmetic instructions
With six members, the group of arithmetic instructions
covers more than one third of the total instructions of the
REIL instruction set.
ADD: Addition of two values
SUB: Subtraction of two values
MUL: Unsigned multiplication of two values
DIV: Unsigned division of two values
MOD: Unsigned modulo of two values
BSH: Logical shift operation
ADD and SUB work exactly like standard addition and
subtraction on most platforms.
The multiplicative instructions MUL, DIV, and MOD in-
terpret all of their input operands in an unsigned way. The
REIL instruction set does not contain signed counterparts of
these instructions because signed multiplication and division
can easily be simulated in terms of unsigned multiplication
and division.
The logical shift operation can either be used as a left
shift or a right shift, depending on the sign of its second
operand. If the second operand is positive, the shift oper-
ation is a left-shift. If it is negative, the shift operation is
a right-shift. Arithmetic shifts do not exist in the REIL
instruction set because arithmetic shifts can easily be simu-
lated with the help of logical shifts. Like in the case of the
multiplicative instructions, keeping the REIL instruction set
small was more important than adding the convenience of
having more epxressive REIL translations.
Figure 1 shows examples of all arithmetic instructions.
The structure of all arithmetic instructions is the same. The
first two operands are the input operands of the operation
while the third operand is the output operand where the
result of the operation is stored. The order of the input
operands is the natural order that is generally used when
writing down the operations in infix notation on paper or
in the source code of computer programs. For example, the
first operand of the SUB operation is the minuend while the
second operand is the subtrahend. In the DIV operation the
first operand is the dividend and the second operand is the
ADD t0/b4,t1/b4,t2/b8
SUB t7/b4,t9/b4,t12/b8
MUL t8/b4,4/b4,t9/b8
DIV 4000/b4,t2/b4,t3/b4
MOD t8/b4,8/b4,t4/b4
BSH t1/b4,2/b4,t2/b8
Figure 1: Examples of the arithmetic REIL instruc-
Another important aspect of REIL is first shown in fig-
ure 1 too. Potential overflows in the results of operations
are handled explicitly. If an operation can overflow, the
output operand must be large enough to store the whole
result including the overflow. This is the reason why the
output operands of the example instructions are twice as
large as their input operands3. The two exceptions are the
output operands of the DIV and MOD instructions. Since
the results of these operations can never be larger than the
first input operand, an extension of the size of the output
operand is not necessary. The output operand has the size
of the input operand instead.
The explicit handling of overflow is an important differ-
ence to real architectures where overflows produced by op-
erations are nearly always cut off because of the fixed size
of native CPU registers. This explicit overflow handling is
what enables REIL algorithms to analyze the results of op-
erations in bigger detail when the exact overflowing value
of a register might be important instead of simply having a
flag that signals that an operation produced an overflow.
2.2 The bitwise instructions
The next biggest instruction group is the group formed by
the three bitwise instructions.
AND: Bitwise AND of two values
OR: Bitwise OR of two values
XOR: Bitwise XOR of two values
The three bitwise instructions work exactly like one ex-
pects bitwise instructions to work. Bit for bit they connect
the bits of two input operands according to the truth ta-
ble defined for their operation. The calculated value is then
written to the output operand of the instruction.
A bitwise NOT instruction is not part of the REIL in-
struction set because NOT is equivalent to XOR-ing a value
with a value of equal size and all bits set. That means to
calculate the one’s complement of the 16-bit value 0x1234
one would XOR it with the 16-bit value 0xFFFF.
3The result operand of addition and subtraction is techni-
cally too large because these operations performed on two
32-bit values can only overflow into the 33rd bit; however
there is no 33-bit REIL operand size so the next biggest
operand size (64-bit) was chosen.
AND t0/b4,t1/b4,t2/b4
OR t7/b4,t9/b4,t12/b4
XOR t8/b4,4/b4,t9/b4
Figure 2: Examples of the bitwise REIL instructions
Figure 2 shows examples of all bitwise instructions. Their
general structure equals the structure of the arithmetic in-
structions. Like them, bitwise instructions take two input
operands and store the result of the operation in the out-
put operand. One important difference is that none of the
bitwise instructions produce an overflow. An explicit mod-
eling of overflowing values and an extension of the size of
the output operand are therefore not necessary.
2.3 Data transfer instructions
To access the REIL memory, two different REIL instruc-
tions are needed. One is used for loading a value of arbitrary
size from the REIL memory while the other one is used to
store a value of arbitrary size to the REIL memory. Fur-
thermore, this group of instructions contains an instruction
that is used to transfer values into registers.
LDM: Load a value from memory
STM: Store a value to memory
STR: Store a value in a register
The first operand of the LDM instruction contains the
address of the REIL memory where the value is loaded from.
This operand can either be an integer literal or a register.
When the instruction is executed, it loads the value from
the given memory address and stores it in the third operand
of the instruction. The size of the value that is loaded from
memory equals the size of the third operand. If the size of
the third operand is a 32-bit register, a 32-bit value is loaded
from memory. As the loaded value is written to the third
operand, the third operand must be a register.
The store operation STM is the inverse operation to the
load operation LDM. It can be used to store a value of ar-
bitrary size to memory. The first operand of the STM in-
struction is the value to be stored in memory. Its size de-
termines how many bytes are written to memory when the
STM instruction is executed. The third operand specifies
the address where the value of the first operand is written
to. Both operands can be either integer literals or registers.
The second operand is unused.
The STR instruction is one of the simplest instructions of
the REIL instruction set. It copies a value to the output
register specified in the instruction. The input operand can
be either a literal (to load a register with a constant) or
another register (to transfer the content of one register to
another register).
LDM 413800/b4, , t1/b2
STR t1/b2, , t2/b2
STM t2/b2, , 415280/b4
Figure 3: Examples of the data transfer REIL in-
Figure 3 shows a sequence of data transfer instructions
that load a value from memory, copy it to another register,
and store it back to another address in memory. Since the
size of the output register of LDM instructions specifies how
many bytes are loaded from memory, it is clear that two
bytes are loaded from memory. The size of the two used
operands of STR instructions is typically the same as STR
only copies a value. In the end the two-byte register t2is
stored back to memory.
2.4 Conditional instructions
The group of conditional instructions is used to compare
values and depending on the results of the comparison to
jump to one REIL instruction or another.
BISZ: Compare a value to zero
JCC: Conditional jump
The BISZ instruction is the only instruction of the REIL
instruction set that can be used to compare two values. In
fact, it can only be used to compare a single value to zero
but this is sufficient to emulate any kind of more complex
comparison. The BISZ instruction takes a single operand,
compares it to zero, and depending on the value of the input
operand, the output operand is set to 0 (if the value of the
input operand was not 0) or 1 (if the value of the input
operand was 0).
The conditional jump instruction JCC is typically used to
process the results of a BISZ instruction. If the first operand
of the JCC instruction evaluates to 0, the jump is not taken.
If the first operand evaluates to any other value than zero,
the jump is taken and control is transferred to the address
(or sub-address) specified in the third operand.
An unconditional jump is not part of the REIL instruction
set because it is possible to emulate an unconditional jump
using a conditional jump by setting the first operand of the
conditional jump to the integer literal 1 (or any other non-
zero integer literal).
BISZ t0/b4, , t1/b1
JCC t1/b1, , 401000/b4
Figure 4: Examples of the conditional REIL instruc-
Figure 4 shows a typical sequence of a single BISZ instruc-
tion followed by a JCC instruction that uses the output of
the BISZ instruction to determine whether to take a jump to
the address specified in its third operand. Since the output
of BISZ instructions is always either 0 or 1, the size of the
output operand of BISZ instructions is always b1.
2.5 Other instructions
There are a few other instructions which do not really
belong to any group at all.
UNDEF: Undefines a value
UNKN: Unknown source instruction
NOP: No operation
The UNDEF instruction undefines the value of a register.
This means that once the UNDEF instruction is executed,
the value inside the undefined register is unknown. This
is important because there are native assembly instructions
which leave registers or flags in an undefined state. The x86
instruction DIV, for example, leaves a number of flags like
the zero flag and the carry flag in an undefined state.
The UNKN instruction is kind of a placeholder instruc-
tion. It indicates that during the REIL code generation an
original assembly instruction was encountered that could not
be translated.
The NOP instruction does nothing. Nevertheless it is not
useless. REIL translators can generate this instruction to
pad control flow in certain edge cases. In a few situations
this is very useful because it keeps REIL translators very
simple. Without the existence of the NOP instruction, the
REIL translator would have to look ahead to the next native
instruction to generate correct REIL code4.
UNDEF , , t1/b4
UNKN , ,
NOP , ,
Figure 5: Examples of other REIL instructions
Figure 5 shows examples of the remaining REIL instruc-
tions. The only instruction that takes operands is the UN-
DEF instruction which undefines a register.
The definition of the REIL language includes the descrip-
tion of the REIL architecture and the definition of a virtual
machine that can be used to execute the generated REIL
The REIL architecture is a simple register-based architec-
ture without an explicit stack. The number of registers avail-
able in REIL code is unlimited. As previously explained,
the names of REIL registers have the form t<number>. The
index number of register names is unbounded. There is fur-
thermore no requirement that all REIL registers between t0
and tn1are used by a given program that uses n different
registers. A program that uses exactly three REIL registers
can use t7, t799, and t3199 if desired.
REIL registers themselves do not have a fixed width or
a limited width. The size of REIL registers is always equal
to the size of the operands where they are used. The size
of REIL registers can even change between instructions. In
one instruction register tncan have size bswhile in another
instruction it can have size bt. Since operands can grow
arbitrarily large, REIL registers can also grow arbitrarily
large. In practice we have not yet encountered registers with
more than 128 bits (equivalent to b16) though.
We already mentioned that registers of the original input
code can appear in REIL code. In fact, the registers of
the original architecture will always appear in REIL code to
make it possible to port results of REIL code analysis back
to the original code. This does not violate the platform-
independent nature of REIL code. REIL registers and native
registers can be mixed at will and be treated completely
uniformly. While analyzing REIL code there is no difference
between the registers t0,t1, and t2and the registers eax, ebx,
and ecx. At the end of an analysis algorithm one can then
easily distinguish between the REIL registers (which have
the tnform) and the native registers (which do not have the
4Technically, the NOP instruction could of course be re-
placed by an instruction like add 0, 0, tnthat also has no
discernible effect on the program state.
tnform) to port the values of relevant registers back to the
original assembly code.
The memory of the virtual REIL machine follows a flat
memory model. Unlike some real CPUs like the x86 which
has memory segments (in real mode) or at least memory se-
lectors (in protected mode), REIL memory starts at address
0 and can grow arbitrarily large. While there is technically
an infinite amount of storage available in REIL memory,
practical concerns of the source architecture limit the used
memory in practice. If the source assembly language (like
32 bit x86 assembly) can only address 4 GB of memory, only
4 GB of REIL memory will ever be accessed in REIL code
created from x86 programs. REIL memory higher than the
addressable memory range of the source target architecture
is never used.
Due to the flat memory model of the REIL memory, seg-
mented memory access of native architectures must be sim-
ulated in REIL programs if necessary. This can be done by
creating virtual segments which represent the memory seg-
ments of the native architecture. Since REIL memory is not
limited in size, there is enough space available to make these
virtual segments non-overlapping, meaning that memory ac-
cess through one segment of the native architecture never
interfers with memory access through another segment of
the native architecture.
The endianness of the source architecture must be con-
sidered too when accessing REIL memory. On native ar-
chitectures endianness falls into two different categories. In
some cases (like x86) native architectures have a fixed endi-
anness that can not be changed during runtime while other
architectures can switch the endianness of their memory ac-
cess at runtime by executing a special instruction (Pow-
erPC, ARM). In general, REIL does not have any mech-
anisms to deal with endianness. All endianness issues must
be handled by the REIL translators when generating the
REIL instructions that access memory. This poses a prob-
lem when endianness is switched at runtime because REIL
code is generated in advance and can not be updated any-
more when endianness-switching happens. However, the rar-
ity of endianness-switching makes this a special situation
that is seldomly relevant for security audits.
After REIL memory and the REIL registers are given an
initial state, REIL code can be analyzed or even executed.
Execution of REIL code happens just like program execu-
tion on a real CPU. Starting with the value in the program
counter register, REIL code is executed5. The REIL instruc-
tion at the position of the current program counter is fetched
and interpreted with regard to the current state of the REIL
register bank and the REIL memory. Once interpretation is
complete, the REIL register bank and the REIL memory are
updated to reflect the effects of the instruction on the global
The translation of native assembly code to REIL code
is straightforward. For each supported native assembly lan-
guage there is a so called REIL translator. This REIL trans-
5There is no special REIL program counter register. Rather,
the program counter register of the input architecture is
used. This is important to make sure that at each step
of the REIL code analysis, the value of the program counter
register has the same value as it would have during a real
execution of the program on the source platform.
lator takes a piece of native assembly code and translates it
to REIL code. Linearly iterating over all instructions in a
piece of input code, the translator translates each instruc-
tion to REIL code independently. The REIL translator does
not look ahead to see what instruction follows the current
instruction and it does not require information generated
during the translation of previous instructions. This state-
lessness of the translation makes REIL translators very sim-
ple. In fact, REIL translators are nothing but glorified maps
that repeatedly map a single native instruction to a list of
REIL instructions.
Due to the simplicity of REIL instructions and what they
can do in one step, a single native assembly instruction is
nearly always translated to many REIL instructions. Exper-
imental results have shown that on average, an original in-
struction is translated into approximately 20 REIL instruc-
tions while the most complex native instruction we found in
practice was translated to more than 50 REIL instructions.
This one-to-many relation between native instructions and
REIL instructions unfortunately destroys a direct correspon-
dence between the address of a native assembly instruction
and the addresses of the REIL instructions created for the
native assembly instruction. Having such a correspondence
would be most desirable because it would make it signif-
icantly simpler to port the results of a REIL analysis al-
gorithm back to the original assembly code. To solve this
problem, the addresses of REIL instructions are shifted to
the left by 8 bits (or multiplied by 0x100). This means that
the first REIL instruction that corresponds to the native as-
sembly instruction at offset n has the offset 0x100 nwhile
the second REIL instruction has the offset 0x100 n+ 1 and
so on. This address translation limits the translation of a
single native instruction to at most 256 different REIL in-
structions. Should it ever happen that more than 256 REIL
instructions are generated for a single native instruction, the
addresses of the REIL instructions would overflow into the
addresses of the REIL instructions of the following native
There are a number of more or less significant issues that
might limit the use of REIL in practice. Some of these lim-
itations are built into the REIL language itself while others
exist simply because we have not yet had time to implement
certain aspects of native architectures.
The first limitation is that the REIL translators we have
so far (32-bit x86, 32-bit PowerPC, and 32-bit ARM) are
unable to translate certain classes of instructions. For ex-
ample, none of the translators can translate FPU instruc-
tions. CPU extensions like the MMX and SSE extensions
of x86 CPUs are also not translated yet. We have chosen
to skip the translation of these instructions because REIL is
supposed to be a language for analyzing assembly code for
security-critical bugs and vulnerabilities. FPU, MMX, and
SSE extensions are only very rarely involved in these kinds
of flaws. Should FPU bugs or other CPU extension bugs
become popular targets of software exploits in the future,
we can easily extend our existing translators to be able to
handle these instructions.
Like FPU instructions, privileged instructions like system
calls, interrupts, and other kernel-level instructions are not
translated by our current REIL translators. The justifica-
tion for the lack of support for these kinds of instructions
follows along the lines of the lack of FPU support. In our
initial implementation of REIL we wanted to focus on the
instructions that are most often involved in some kind of
security-relevant software flaws. Depending on the exact
effects of the missing privileged instructions, it might be
trivial to impossible to add them to the REIL language. An
instruction that has significant low-level effects on the under-
lying hardware, for example one that flushes the CPU cache,
will never be part of REIL for this would mean a complete
loss of platform-independence and/or a big increase in the
number of different instruction mnemonics. Other privileged
instructions like interrupt execution can often be simulated
using the features REIL already has.
REIL can also not deal with exceptions in a platform-
independent way. This means that at this point exceptions
and the corresponding stack unwinding can not be handled
by REIL. Due to the lack of exception handling common
situations that throw exceptions (dividing by zero, hitting
a breakpoint, ...) are simply ignored in the default REIL
The next limitation is that REIL can not handle self-
modifying code of any kind. This is simply because native
code is pre-translated instruction for instruction of a na-
tive function and the resulting REIL code is fixed after the
initial translation. The reason for this is that REIL instruc-
tions themselves do not reside in the REIL memory. They
can therefore not be overwritten and modified during the
interpretation of REIL code.
The first and foremost goal of the next few months is to
write more REIL translators (for example to translate MIPS
code) and to implement more REIL-based code analysis al-
gorithms. Additionaly, we have a few minor ideas about
improving the quality of generated REIL code and its use-
fulness in static code analysis.
The first idea is the introduction of a bit-sized operand
type b0. Right now the smallest operand type is the byte-
sized operand b1. During bit-width analysis it might be use-
ful to know that an operand that has size b1in current code
does not use any bits but its least significant bit. Extending
on this idea, maybe it would be smarter to give the size of
operands in bits instead of bytes.
An idea that can be used to improve the correctness of
REIL translation and certain analysis algorithms is the in-
troduction of two additional instructions, extend and reduce.
The motivation for these two instructions is simple. Right
now there are no limitations on how operand sizes can be
combined in one instruction. When generating an ADD in-
struction one input operand can have size b1while the sec-
ond input operand can have size b4. A rule that specifies
that the input operands of all instructions must be of equal
size would make REIL code more regular for analysis and
certain bugs classes in REIL translators can be checked for
automatically. The role of the extend instruction would be
to extend a value of a smaller size like b1to a larger size
like b2or b4while keeping the value of the extended register
the same. The reduce instruction would be the opposite of
the extend instruction. Reduce would reduce the size of an
operand to a smaller operand size. In this case it can not be
guaranteed that the value of the reduced register equals the
value of the original register. In many situations overflow-
ing high bits will be truncated and lost. This is perfectly
acceptable though because this is used in many different sit-
uations already, for example when writing the 33-bit wide
result of an addition of two 32-bit values back to a 32-bit
register while truncating the overflow. Right now this trun-
cation is done using an AND instruction. In the future the
reduce instruction might make things semantically clearer.
The number of operand types might also be increased in
the future. As soon as FPU instructions are supported by
the REIL translators it is necessary to add single-precision
FPU operands and double-precision FPU operands. An-
other example are certain architectures like PowerPC where
registers can be addressed not by name but by an index into
the register bank. These instructions can not be translated
to REIL yet because REIL does not know an operand type
like register index.
The use of intermediate languages for code analysis is
nothing new. In fact all serious compilers use some kind
of intermediate language during the optimization phase of
their generated code (see GCC for example). Creating inter-
mediate representations for disassembled assembly code in
the context of security analysis is not nearly as widespread.
Nevertheless there are a few approaches which are notewor-
At the conference 2008 Mihai Chiriac of the anti-
virus software company BitDefender presented an interme-
diate language that he used to speed up the emulation of
obfuscated malware programs [1]. The intermediate lan-
guage he presented is structurally close to REIL. Like REIL,
his language has a very reduced instruction set where every
instruction has exactly one effect on the global state. Fur-
thermore his virtual architecture has an infinite number of
virtual registers and a fully emulated memory.
An open-source implementation of an intermediate lan-
guage specifically made for reverse engineering and stat-
ically analyzing binary code is the ELIR language of the
ERESI project6. Like REIL, the goal of the ELIR interme-
diate language is simplified platform-independent reasoning
about assembly code by providing an intermediate language
that makes the effects of all native assembly operations ex-
plicit. An overview of the ELIR language was given in Julien
Vanegue’s EKOPARTY 2008 talk Static binary analysis with
a domain specific language [2].
A commercial use of intermediate language recovery from
disassembled code in the context of security analysis is IDA
Pro and Hex-Rays. IDA Pro is the industry standard disas-
sembler for many platforms and Hex-Rays is a decompiler
plugin for IDA Pro. The Hex-Rays decompiler uses an in-
termediate language representation (IR) of the underlying
disassembled code to analyze and optimize the disassembled
code and to decompile it into a C-style high-level language.
As shown in Ilfak Guilfanov’s Black Hat 2008 presentation
Decompilers and Beyond [3] [4], the intermediate representa-
tion used by Hex-Rays is significantly different from REIL.
There are more instructions in the Hex-Rays IR and they
do not obey the single-responsibility rule for avoiding side
effects. Other differences include the distinction between in-
teger literals and pointers to code which is present in the
Hex-Rays IR but not in REIL and features like the option
to address basic blocks instead of addresses in jump instruc-
tions. Another striking difference that can be seen directly
when looking at code snippets of REIL and the Hex-Rays IR
is that REIL uses way more temporary registers to translate
a typical piece of code.
Another implementation of an intermediate language was
created by GrammaTech in their CodeSurfer/X86 product.
While not publicly available at this point, several whitepa-
pers have been released about CodeSurfer/X86 (for example
see [5] or [6]). Unfortunately these whitepapers focus on the
results of certain static analysis algorithms with CodeSurfer/X86
instead of their intermediate language so it is unclear at this
point how similar this language is to REIL.
As part of AbsInt, an analysis framework specifically suited
for statically analyzing embedded system code, Saarland
University developed the intermediate language CRL2. Like
REIL, CRL2 is generated by transforming the assembly code
of a disassembled input program. Nevertheless the similar-
ities to REIL end at this point. CRL2 was specifically de-
veloped for detailed control flow analysis and as a result of
that, CRL2 code is very complex due to a large number of
annotations that are relevant for control flow. Examples of
generated CRL2 code can be found at [7].
Using the information presented in this paper it is possi-
ble to write a complete implementation of the Reverse En-
gineering Intermediate Language REIL that can be used for
static code analysis of disassembled assembly code. We have
already created a commercial implementation of REIL in
our product BinNavi and we have successfully written sev-
eral simple static code analysis algorithms. Thanks to REIL
these algorithms work platform-independently on x86 code,
on PowerPC, and on ARM code.
[1] Mihai G. Chiriac. Anti Virus 2.0 - Compilers in
disguise ., October 2008.
[2] Julien Vanegue. Static binary analysis with a domain
specific language. EKOPARTY 2008, October 2008.
[3] Ilfak Guilfanov. Decompilers and beyond. BlackHat
USA 2008, August 2008.
[4] Ilfak Guilfanov. Decompilers and beyond - Whitepaper
. BlackHat USA 2008, August 2008.
[5] Gogul Balakrishnan, Radu Gruian, Thomas Reps, and
Tim Teitelbaum. Codesurfer/x86-a platform for
analyzing x86 executables. In of Lecture Notes in
Computer Science, pages 250–254. Springer, 2005.
[6] T. Reps, G. Balakrishnan, J. Lim, and T. Teitelbaum.
A next-generation platform for analyzing executables.
In In APLAS, pages 212–229, 2005.
[7] AbsInt Angewandte Informatik GmbH. CRL Version 2
Manual .
... Nous avons choisi 27 de baser notre approche sur une analyse statique et symbolique pour plusieurs raisons : 28 -Limites des approches dynamiques : Les analyses dynamiques sont généralement assez ro- 29 bustes face à certaines obfuscations comme l'auto-modification. Cependant les éditeurs de malwares 30 réussissent de mieux en mieux à protéger leurs binaires contre ce genre d'analyses en y intégrant 31 différentes heuristiques de détection pour changer le comportement de leur logiciel malveillant si 32 ce dernier détecte la présence d'un environnement d'analyse. 2 Cette thèse a permis le développement de la platefrome BOA, pour Basic blOck Analysis, un outil dont 3 le but est de construire le graphe de flot de contrôle d'un binaire quelconque. ...
... L'état machine minimal correspond aux seuls éléments (valeurs des 30 registres et des cases mémoire) nécessaires aux instructions du bloc de base à exécuter. En effet, 31 il est peu probable que durant son exécution un bloc de base effectue une lecture de la totalité 32 des registres et des cases mémoire de la machine. L'utilisation d'un état machine minimal nous ...
... Cette expérience nous a permis 29 de récupérer la seconde vague d'exécution de ce binaire qui est bien détectée comme un échantillon de 30 Emotet par VirusTotal. 31 32 En ce qui concerne les binaires Windows, BOA permet de simuler le chargement des DLL utilisées par 33 un binaire, que ces bibliothèques soient chargées au lancement du programme via la table d'importation 34 ou bien à la volée par l'intermédiaire des fonctions de l'API Windows comme LoadLibrary. Grâce à cette 35 fonctionnalité, et en utilisant également un mécanisme de hook des fonctions externes, BOA est capable 36 de détecter la construction à la volée de la Cependant, les machines sont le plus souvent incapables d'exécuter un programme sous cette forme là. ...
L’augmentation des cyberattaques dans le monde fait de l’analyse des codes malveillants un domaine de recherche prioritaire. Ces logiciels utilisent diverses méthodes de protection, encore appelées obfuscations, visant à contourner les antivirus et à ralentir le travail d’analyse. Dans ce contexte, cette thèse apporte une solution à la construction du Graphe de Flot de Contrôle (GFC) d’un code binaire obfusqué. Nous avons développé la plateforme BOA (Basic blOck Analysis) qui effectue une analyse statique d’un code binaire protégé. Pour cela, nous avons défini une sémantique s’appuyant sur l’outil BINSEC à laquelle nous avons ajouté des continuations. Ces dernières permettent d’une part de contrôler les auto-modifications, et d’autre part de simuler le système d’exploitation pour traiter les appels et interruptions système. L’analyse statique est faite en exécutant symboliquement le code binaire et en calculant les valeurs des états du système à l’aide de solveur SMT. Ainsi, nous effectuons une analyse du flot de données afin de construire le GFC en calculant les adresses de transfert. Enfin, la gestion des boucles est réalisée en transformant un GFC en un automate à pile. BOA est capable de calculer les adresses des sauts dynamiques, de détecter les prédicats opaques, de calculer les adresses de retour sur une pile même si elles ont été falsifiées, de gérer les falsifications des gestionnaires d’interruption, reconstruire à la volée les tables d’importation, et pour finir, de gérer les auto-modifications. Nous avons validé la correction de BOA en utilisant l’obfuscateur de code Tigress. Ensuite, nous avons testé BOA sur 35 packers connus et nous avons montré que dans 30 cas, BOA était capable de reconstruire complètement ou partiellement le binaire initialement masqué. Pour finir, nous avons détecté les prédicats opaques protégeant XTunnel, un malware utilisé lors des élections américaines de 2016, et nous avons partiellement dépacké un échantillon du cheval de Troie Emotet, qui, le 14/10/2020 n’était détecté que par 7 antivirus sur les 63 que propose VirusTotal. Ce travail contribue au développement des outils d’analyse statique des codes malveillants. Contrairement aux analyses dynamiques, cette solution permet une analyse sans exécution du binaire, ce qui offre un double avantage : d’une part une approche statique est plus facile à déployer, et d’autre part le code malveillant n’étant pas exécuté, il ne peut pas prévenir son auteur.
... Currently, there are a lot of processor architectures with various instructions. In order to abstract from the specifics of a particular architecture when writing universal algorithms, one traditionally uses an intermediate representation of machine instructions (VEX [101], REIL [102], Pivot [103,104], etc.). In this case, the binary code analysis algorithms work with a simpler intermediate representation, and not with the target processor architecture. ...
... Heitman et al. [105] first translate the gadget instructions into an intermediate REIL representation [102]. And only after that, REIL instructions are subjected to symbolic execution. ...
... Initially, the catalog contains virtual addresses and gadget instructions. The instructions of each gadget are translated into REIL [102] intermediate representation, for which a dependency graph is constructed. As a result of the graph traversal, a gadget semantic description is computed: the final values of registers and memory are expressed via the initial ones. ...
Full-text available
This paper provides a survey of methods and tools for automated code-reuse exploit generation. Such exploits use code that is already contained in a vulnerable program. The code-reuse approach allows one to exploit vulnerabilities in the presence of operating system protection that prohibits data memory execution. This paper contains a description of various code-reuse methods: return-to-libc attack, return-oriented programming, jump-oriented programming, and others. We define fundamental terms: gadget, gadget frame, gadget catalog. Moreover, we show that, in fact, a gadget is an instruction, and a set of gadgets defines a virtual machine. We can reduce an exploit creation problem to code generation for this virtual machine. Each particular executable file defines a virtual machine instruction set. We provide a survey of methods for gadgets searching and determining their semantics (creating a gadget catalog). These methods allow one to get the virtual machine instruction set. If a set of gadgets is Turing-complete, then a compiler can use a gadget catalog as a target architecture. However, some instructions can be absent. Hence we discuss several approaches to replace missing instructions with multiple gadgets. An exploit generation tool can chain gadgets by pattern searching (regular expressions) or considering gadgets semantics. Furthermore, some chaining methods use genetic algorithms, while others use SMT-solvers. We compare existing open-source tools and propose a testing system rop-benchmark that can be used to verify whether a generated chain successfully opens a shell.
... There are multiple intermediate languages and related lifters that can lift a given binary file each having their advantages and limitations [11], [18], [30]. For our research, we decided on the following selection criteria: (i) An accessible and comprehensive API to allow fast prototyping, (ii) an included SSA-form for the intermediate language to assist data-flow analysis, (iii) typed variables instead of registers as well as the elimination of stack-usages to further backup the platform-independence of our approach, (iv) function call parameters linked to each function call, and (v) a well-maintained framework. ...
Analyzing third-party software such as malware or firmware is a crucial task for security analysts. Although various approaches for automatic analysis exist and are the subject of ongoing research, analysts often have to resort to manual static analysis to get a deep understanding of a given binary sample. Since the source code of encountered samples is rarely available, analysts regularly employ decompilers for easier and faster comprehension than analyzing a binary's disassembly. In this paper, we introduce our decompilation approach dewolf. We developed a variety of improvements over the previous academic state-of-the-art decompiler and some novel algorithms to enhance readability and comprehension, focusing on manual analysis. To evaluate our approach and to obtain a better insight into the analysts' needs, we conducted three user surveys. The results indicate that dewolf is suitable for malware comprehension and that its output quality noticeably exceeds Ghidra and Hex-Rays in certain aspects. Furthermore, our results imply that decompilers aiming at manual analysis should be highly configurable to respect individual user preferences. Additionally, future decompilers should not necessarily follow the unwritten rule to stick to the code-structure dictated by the assembly in order to produce readable output. In fact, the few cases where dewolf already cracks this rule lead to its results considerably exceeding other decompilers. We publish a prototype implementation of dewolf and all survey results on GitHub.
... [150], utilisée par Valgrind [71] et angr [75]. On retrouve d'autres IR comme REIL (Reverse Engineering Intermediate Language) [149], ESIL (Universal IL) [151], utilisé par radare2 [148], ou encore LLVM (Low Level Virtual Machine) [145], utilisé par CLANG. D'autres outils très connus utilisent leur propre IR comme IDA [146], Ghidra [147], BAP [76] avec BIL (BAP Intermediate Language), ou encore avrora [144]. ...
Les attaques par corruption de mémoire représentent 70% des attaques informatiques. Avec la croissance exponentielle d’objets programmables et connectés à travers notamment l’Internet des Objets (IdO), il se pose la question de la vulnérabilité de ces systèmes électroniques embarqués: quelle est la menace réelle, quelles solutions trouve-t-on dans l’état de l’art et sont-elles suffisantes? Comment proposer des solutions de protection pertinentes et fonctionnelles dans un écosystème comme l’IdO composé d’environnements contraints ? Dans un premier temps, nous étudions par la mise en place d’une plateforme expérimentale (prototype basé sur un SoC Zynq-7000) les attaques par corruption de mémoire pour comprendre leur fonctionnement, déployer les contremesures existantes et analyser leurs limitations.Nous montrons notamment que les solutions de CFI existantes posent un certain nombre d’inconvénients majeurs: un impact fort sur les performances dans leur approche logicielle, un coût silicium additionnel dans leur implémentation matérielle, et une incapacité à détecter des attaques sur l'utilisation de données.Une alternative possible consiste en la détection indirecte de signature d’attaque ou d’anomalie de fonctionnement par l’utilisation de compteurs de performance, technique qui présente l’avantage de reposer sur un matériel généralement disponible dans les microprocesseurs. La littérature compte un certain nombre d’études dans ce domaine, que nous analysons pour spécifier les bases d’une solution qui s’avèrerait pertinente dans le choix des évènements pour prédire l’occurrence d’une attaque. Nous comparons ainsi un certain nombre de modèles de classification issus du machine learning, notamment leur précision et leur impact mémoire/performance. Notre analyse du contexte applicatif nous oriente vers une méthode susceptible de générer un minimum de faux négatifs, et suffisamment légère quant aux contraintes de ressources spécifiques à notre domaine de l’embarqué. Ainsi nous arrivons, par une approche expérimentale, à proposer une solution reposant sur l’utilisation de QDA et à obtenir une précision de détection d’attaque allant jusqu’à 98.78%, avec des pénalités en performance inférieures à 1%. Pour que cette solution puisse être pertinente dans un contexte industriel, il faut aussi tenir compte du nombre potentiellement rédhibitoire de faux positifs. C’est dans cette optique que nous proposons un étage de diagnostic successif à celui de détection, s’appuyant sur l’utilisation d’un CFI partiel dédié à la vérification plus formelle de faux positifs identifiés au préalable.Nous montrons également que des implémentations d'outils de détection issues de l’état de l’art (par exemple des solutions de vérification d’intégrité du flot de contrôle (CFI)) reposent sur l’utilisation d’une contremesure, faite en deux temps, qui peut se révéler faible. La première partie consistant à notifier le processeur d'une attaque à travers une interruption peut être contournée, pour finalement ignorer toute la politique de sécurité proposée par la solution. Nous proposons d’y remédier par une méthode de réinitialisation et redémarrage robuste adaptée à des systèmes multiprocesseurs.Ainsi, dans cette thèse, nous apportons une contribution à la sécurisation des systèmes embarqués vis-à-vis des attaques informatiques parmi les plus délétères, démontrée de manière expérimentale, et ce dans une approche répondant aux contraintes de ressources spécifiques au domaine.
... Malicious use of encryption and malicious use of mathematics are evolving fields [7][11] which originated in Young and Yung's earlier research about the use of public key cryptography for designing a offensive system for money extortion named as cryptovirology [12]. Eric Filiol describes, how encrypting malware payload prevents malware analyst to reverse engineer of binaries [15]. ...
Full-text available
Encryption, which is essential for the protection of sensitive information can also encrypt any malicious content which can then reside in any network, undetected. Encryption of malicious payload is used by malware authors to mask their code, however,the objective of hiding of malicious code can be further improved by techniques of re-randomization. The concept of re-randomization using asymmetric cryptography has been emerged as a new area of interest for malware designers. Re-randomizing is a technique which can prevent detection of source path of a malware and makes it indistinguishable. This article extends the idea of using asymmetric cryptography for re-randomization and has proposed a novel scheme using Pailliar’s asymmetric cryptosystem. Moreover, this research work illustrates the limitations of RSA for malware re-randomization. A comprehensive performance analysis of the re-randomization techniques for various malware payloads is also presented, which can be used for the detection of rerandomized malware effectively.
... There are other similar intermediate languages (Dullien and Porst, 2009;Song et al., 2008;Anju et al., 2010;Cesare and Xiang, 2012;Christodorescu et al., 2005) being used for malware analysis and detection. The reason for choosing MAIL is, that unlike other such languages: MAIL provides control flow patterns that enhance (optimize) the malware analysis, by making the code accessible to NLP techniques for checking semantic similarities; and it's publicly available formal model and tools makes it easy to use. ...
With increasing quantity and sophistication, malicious code is becoming difficult to discover and analyze. Modern NLP (Natural Language Processing) techniques have significantly improved, and are being used in practice to accomplish various tasks. Recently, many research works have applied NLP for finding malicious patterns in Android and Windows apps. In this paper, we exploit this fact and apply NLP techniques to an intermediate representation (MAIL – Malware analysis intermediate language) of Android apps to build a similarity index model, named SIMP. We use SIMP to find malicious patterns in Android apps. MAIL provides control flow patterns to enhance the malware analysis and makes the code accessible to NLP techniques for checking semantic similarities. For applying NLP, we consider a MAIL program as one document. The control flow patterns in this program when divided, into specific blocks (words), become sentences. We apply TFIDF and Bag-of-Words over these control flow patterns to build SIMP. Our proposed model, when tested with real malware and benign Android apps using different validation methods, achieved an MCC (Mathews Correlation Coefficient) ≥ 0.94 between the true and predicted values. That indicates, predicting a new sample either as malware or benign with a high success rate.
... We recommend that the next step is to develop a generic lifter that can integrate different instruction sets. We suggest that whether the developed lifter should lift the instruction to the generic LLVM IR or to a reverse engineering IR as REIL IR [31]. Lifting to LLVM IR facilitates the integration of LLVM analysis as value analysis and data flow analysis. ...
Full-text available
Autonomous systems are today’s trend in the aerospace domain. These systems require more on-board data processing capabilities. They follow data-flow programming, and have similar software architecture. Developing a framework that is applicable for these architectures reduces the development efforts and improves the re-usability. However, its design’s essential requirement is to use a programming language that can offer both abstraction and static memory capabilities. As a result, C++ was chosen to develop the Tasking Framework, which is used to develop on-board data-flow-oriented applications. Validating the timing requirements for such a framework is a long, complicated process. Estimating the worst-case execution time (WCET) is the first step within this process. Thus, in this thesis, we focus on performing WCET analysis for C++ model-based applications developed by the Tasking Framework. This work deals with two main challenges that emerged from using C++: using objects impose the need for a memory model and using virtual methods implicate indirect jumps. To this end, we developed a tool based on symbolic execution that can handle both challenges. The tool showed high precision of early 90 % in bounding loops of the Benchmark suit. We then integrated our advanced analysis with an open toolbox for adaptive WCET analysis. Finally, we evaluated our approach for estimating the WCET for tasks developed by the Tasking Framework.
Due to the absence of validity detection on pointers and automatic memory rubbish reclaim mechanisms in programming languages such as the C/C++ language, software developed in these languages may have many memory safety vulnerabilities, such as Use-After-Free (UAF) vulnerability. An UAF vulnerability occurs when a memory object has been freed, but it can still be accessed through a dangling pointer that points to the object before it is reclaimed. Since UAF vulnerabilities are frequently exploited by malware which may lead to memory data leakage or corruption, much research work has been carried out to detect UAF vulnerabilities. This paper investigates existing UAF detection methods. After comparing and categorizing these methods, an outlook on the future development of UAF detection methods is provided. This has an important reference value for subsequent research on UAF detection.
Full-text available
In recent years, there has been a growing need for tools that an analyst can use to understand the workings of COTS components, plug-ins, mobile code, and DLLs, as well as memory snapshots of worms and virus-infected code. Static analysis provides techniques that can help with such problems; however, there are several obstacles that must be overcome: For many kinds of potentially malicious programs, symbol-table and debugging information is entirely absent. Even if it is present, it cannot be relied upon. To understand memory-access operations, it is necessary to determine the set of addresses accessed by each operation. This is difficult because While some memory operations use explicit memory addresses in the instruction (easy), others use indirect addressing via address expressions (difficult). Arithmetic on addresses is pervasive. For instance, even when the value of a local variable is loaded from its slot in an activation record, address arithmetic is performed. There is no notion of type at the hardware level, so address values cannot be distinguished from integer values. Memory accesses do not have to be aligned, so word-sized address values could potentially be cobbled together from misaligned reads and writes. We have developed static-analysis algorithms to recover information about the contents of memory locations and how they are manipulated by an executable. By combining these analyses with facilities provided by the IDAPro and Codesurfer toolkits, we have created CodeSurfer/x86, a prototype tool for browsing, inspecting, and analyzing x86 executables. From an x86 executable, CodeSurfer/x86 recovers intermediate representations that are similar to what would be created by a compiler for a program written in a high-level language. CodeSurfer/x86 also supports a scripting language, as well as several kinds of sophisticated pattern-matching capabilities. These facilities provide a platform for the development of additional tools for analyzing the security properties of executables.
Conference Paper
CodeSurfer/x86 is a prototype system for analyzing x86 executables. It uses a static-analysis algorithm called value-set analysis (VSA) to recover in- termediate representations that are similar to those that a compiler creates for a program written in a high-level language. A major challenge in building an analysis tool for executables is in providing useful information about operations involving memory. This is difficult when symbol-table and debugging information is absent or untrusted. CodeSurfer/x86 overcomes these challenges to provide an analyst with a powerful and flexible platform for investigating the properties and behaviors of potentially malicious code (such as COTS components, plugins, mo- bile code, worms, Trojans, and virus-infected code) using (i) CodeSurfer/x86's GUI, (ii) CodeSurfer/x86's scripting language, which provides access to all of the intermediate representations that CodeSurfer/x86 builds for the executable, and (iii) GrammaTech's Path Inspector, which is a tool that uses a sophisticated pattern-matching engine to answer questions about the flow of execution in a program.
Anti Virus 2.0 -Compilers in disguise
  • G Mihai
  • Chiriac
Mihai G. Chiriac. Anti Virus 2.0 -Compilers in disguise., October 2008.