A Survey on Tools for Binary Code Analysis
ABSTRACT Different strategies for binary analysis are widely used in systems dealing with software maintenance and system security. Binary code is self-contained; though it is easy to execute, it is not easy to read and understand. Binary analysis tools are useful in software maintenance because the binary of software has all the information necessary to recover the source code. It is also incredibly important and sensitive in the domain of security. Malicious binary code can infect other applications, hide in their binary code, contaminate the whole system or travel through Internet and attack other systems. This makes it imperative for security personnel to scan and analyze binary codes with the aid of the binary code analysis tools. On the other hand, crackers can reverse engineer the binary code to assembly code in order to break the secrets embedded in the binary code, such as registration number, password or secret algorithms. This motivates researches to prevent malicious monitoring by binary code analysis tools. Evidently, binary analysis tools play an important double-sided role in security. This paper surveys binary code analysis from the most fundamental perspective views: the binary code formats, several of the most basic analysis tools, such as disassembler, debugger and the instrumentation tools based on them. The previous research on binary analysis are investigated and summarized and a new approach of analysis, disasembler-based binary interpreter, is proposed and discussed.
Conference Proceeding: Instruction level profiling and evaluation of the IBM RS/6000Computer Architecture, 1991. The 18th Annual International Symposium on; 02/1991
Article: Decompilation of Binary Programs[show abstract] [hide abstract]
ABSTRACT: this paper is structured in the following way: a thorough description of the structure of a decompiler, followed by the description of our implementation of an # An idiom is a sequence of instruction that forms a logical entity and has a meaning that cannot be derived by considering the primary meanings of the individual instructions # # # # HLL program (language dependent) Back-end (analysis) UDM (machine dependent) Front-end binary program Figure 1. Decompiler modules automatic decompiling system, and conclusions. The paper is followed by the definitions of graph theoretical concepts used throughout the paper (Appendix I), and sample output from different phases of the decompilation of a program (Appendix II)04/1997;
- [show abstract] [hide abstract]
ABSTRACT: execution dramatically reduces the cost of program tracing and the size of trace files by factors of 50 or more over simple techniques that record every reference. qpt is a second generation tracing system. The earlier version, AE, was a modification to the Gnu C compiler gcc that inserted tracing code while compiling a C program. Note that qpt, because it operates on executable files is relatively language-independent and has been used extensively on unoptimized and optimized C, C11, Modula-3, and Fortran programs. The discussion below only mentions qpt, but many of the techniques originated in qp.12/1996;
A Survey on Tools for Binary Code Analysis
Stony Brook University
August 24, 2004
software maintenance and system security. Binary code is self-contained; though it is
easy to execute, it is not easy to read and understand.
Binary analysis tools are useful in software maintenance because the binary of
software has all the information necessary to recover the source code. It is also
incredibly important and sensitive in the domain of security. Malicious binary code can
infect other applications, hide in their binary code, contaminate the whole system or
travel through Internet and attack other systems. This makes it imperative for security
personnel to scan and analyze binary codes with the aid of the binary code analysis tools.
On the other hand, crackers can reverse engineer the binary code to assembly code in
order to break the secrets embedded in the binary code, such as registration number,
password or secret algorithms. This motivates researches to prevent malicious monitoring
by binary code analysis tools. Evidently, binary analysis tools play an important double-
sided role in security.
This paper surveys binary code analysis from the most fundamental perspective
views: the binary code formats, several of the most basic analysis tools, such as
disassembler, debugger and the instrumentation tools based on them. The previous
research on binary analysis are investigated and summarized and a new approach of
analysis, disasembler-based binary interpreter, is proposed and discussed.
Different strategies for binary analysis are widely used in systems dealing with
1. Introduction on Binary and Binary Analysis Tools
Software industry is one of the most promising areas in the current world. Each
year, software companies produce thousands and millions of software products. All
software products are ultimately translated to binary before execution, independent of the
high-level languages used in the source code. In other words, binary code is one of the
lowest representations of software.
The state-of-the-art program analysis tools work best at the source code level,
because it can use much more high-level information than that present at binary code
level. Why is binary code still interesting besides source code? That is because binaries
have their own charms in research. The most important point is that all secrets of a
software exist in its binaries. With necessary skill and patience, it is possible to reveal all
the secrets of the software from binaries. Since most of the commercial software
products, especially on Windows platform and malicious software, such as virus, Trojans,
spyware, are distributed in the form of binary code, it becomes extremely important to
explore the methods for analyzing binary codes.
Compared to source code, binary code has other obvious advantages; for example,
it is convenient to execute but difficult to understand. Once generated in one machine, the
binary code is self-contained and all static library routines are bound into the binary file.
It has good portability; it can be fed into any other machines with the same instruction set
on hardware, and can be executed simply and immediately. On the other hand, binary is
much harder to understand than source code written in high-level programming
languages, since everything inside binary code is represented with 0 and 1. This feature is
very helpful for the protection of privacy of software. These advantages are so useful and
practical that software companies prefer to distribute their products in the form of binary.
Source code of some old software may be lost and only binary code is left. It is
hard, if not impossible, to recover the original source code from the representation of
binary code. A lot of research has been done in the last few years to reverse engineer
binary code back to high-level languages. Cifuentes discusses several possible solutions
of the problem in her thesis and papers [CIG96] including how to construct inter-
procedural data structure, how to build control graph, and so on. Furthermore, binary
analysis is very important to protect a system from attacks. Since most of applications are
in binary forms, such as Windows applications, security protection with the aid of binary
analysis does not require source code and avoid a lot of trouble on legitimacy.
On the flip side, the advantage of binary analysis can be used for malicious
purposes. Malicious users can cause serious problems threatening the privacy of software
products with binary analysis. This is one of the critical issues of software safety. Since
binaries contain all secrets of the software, malicious users may apply tools to crack the
binaries and reveal the underlying secrets hidden in them by reverse engineering. Here,
reverse engineering means to take the binary code or the low level instructions executed
in the processor and extract information of the high level language. Most of the publicly
available information about reverse engineering is available at sites dedicated to software
piracy and cracking software protection schemes. For years the most famous resource for
crackers was Fravia's website[FRAVIA]. While the collection of tutorials and well
documented techniques for cracking makes it invaluable resource for aspiring “reverse
engineer”, cracking software causes thousands or even millions of dollars’ loss for
software companies every year.
The development of cracking techniques also invokes a prevalent research
direction, software protection that prevents the software from being reverse engineered.
Cracking and protection seem like an endless game. Both of them need to use binary
analysis to understand the binaries, while protection requires much more than cracking.
The reason is that cracking only needs to understand the logic of the code, find the
security sensitive parts and disable or modify them; the protection system needs to
understand binaries, build up the defense system, insert them into security sensitive parts
and furthermore, prevent both original binaries and the defense system from being
understood by reverse engineering.
Secondly, malicious software may threaten the safety of user’s system or
machine. Malicious software, such as virus and Trojans, are distributed in binary code,
and hide their own binary code within the victim binaries. When the original binary code
executes, the malicious code will get executed as well and infect more binaries. It may
take control of the system with the highest privilege, and disrupt entire system. To make
things worse, with increasing popularity of Internet, network based attacks, like network-
aware worms, DDoS agents, IRC Controlled bots, spyware, and so on has grown. The
infection vectors have changed and malicious agents now use techniques such as email
harvesting, browser exploits, operating system vulnerabilities, and P2P networks to
spread. Basically, network system transfers data in the form of packets, but it does not
inspect what data payload the packets carry. No matter what the packets are, the network
system will assemble or disassemble the packets and transfer them faithfully as long as
the headers of the packets comply with the network protocol. However, it can be very
dangerous in the real world. Virus or worms can travel by network in the form of binary
code to all around the world, become activated in the contaminated system and cause
This paper focuses on the most fundamental aspects of binary area: binary code’s
formats and its most basic analysis tools, such as disassemblers, debuggers and so on.
These are the basis of all other advanced binary tools, but unfortunately, until now, there
is no paper with enough details on it. Computer simulator and emulator are also
investigated in this paper, since they are convenient to be used for the purpose of binary
analysis. For each of them, I will give an overview, fundamental skills and related
problems. Specific challenges in implementation will be discussed, their disadvantages
and advantages will be compared, and the existing state-of-the-art tools will be
introduced. I will concentrate on the tools to reverse engineering from binary to assembly
code, the technology to analyze information in binary code and its corresponding
assembly code, and related applications in the security domains.
Relevant tools for analysis and instrument binaries are also discussed, which
have a broad range of applications. For instance, the anti-virus software can use
disassemblers and debuggers to scan the data of packets and monitor traffic in a network
system. Once the tool finds some suspicious code, it can stop it and prevent the damage
in advance. Additionally, with the help of binary analysis tools, even normal users and
administrator are able to determine if the binaries are harmful by examining them
manually [REM04]. Furthermore, based on analysis information on binaries, security
researchers are able to embed security instrumentation codes into the original programs to
protect their binaries [SL02].
The organization of this paper includes six sections. Section 2 presents the
overview of different binary forms. Section 3 discusses the static analysis tool of
disassembler, followed by section 4 introducing the dynamic tool as debugger and the
more complicated emulator, which is, however, hard to be implemented. Section 5
describes state-of-the-art reverse engineering and tamper resistant software techniques.
Section 6 proposed a new approach to implement a binary analysis tool for security,
combining the advantages of disassemblers and debuggers, and achieving better
performance both in time and accuracy. Finally, conclusion and future work are presented
at the end of the paper.
2. Binary Object File Formats and Contents
Let’s look at how an executable object code is generated. An object file is
generated from source code in a high-level language by compilers or from low-level
assembly code by assemblers. Then, linkers combine multiple related object files into one
executable file to fulfill multiple functionalities in one final object file. At run time,
loader loads object files into memory and starts execution. This section introduces several
popular object files’ formats, structures and contents, and how these object files start to
run on their corresponding operating system.
Basically, an object file may contain five fundamental type of information, as
shown in figure 1.1 : (1) Header information, which is overall information about a file,
such as the size of code, creation dates and so on; (2) Relocation information, which is a
list of addresses in the code that need to be fixed when the object file is loaded into an
address different from the expected loading address; (3) Symbols: which are global
symbols and mainly used by linkers, such as the symbols imported from other modules or
exported to other modules; (4) debugging information: which is used by debuggers, such
as source file and line number information, local symbols, description of data structure
(e.g. structure definition in C language); (5) code and data, which are binary instructions
and data generated from source file and are stored in the sections. John R. Levine [JL00]
describes detail information of the object format and its operations under the control of
linkers and loaders.
Figure 1.1 Binary File Format General Abstraction
separately or with some combination of them:
• Linkable object file: used as input by a linker or linking loader. It contains a lot of
symbols and relocation information. Its object code is usually divided into many
small logical segments that will be treated separately and differently each time by
• Executable object file: is loaded into memory and runs as a program. It contains
object code, usually with page alignment to allow the whole file to be mapped
into address space. Usually it doesn’t need symbols or relocation information.
• Loadable object file: is able to be loaded into memory as a library along with
other programs. It may consist of pure object code or may contain complete
symbol and relocation information to permit runtime symbolic linking according
to different systems’ runtime environments.
With different systems, object files have quite a number of different formats. The
most popular ones include MS-DOS.com files, Unix a.out files, ELF files, and Windows
PE format and so on.
.COM file and .EXE file for MS-DOS
The simplest MS-DOS.COM object file is a null object file, i.e., it only contains
pure executable binary code, and no other information. In Windows DOS, the address
from 0 to FF is named as Program Segment Prefix (PSP), which contains arguments and
parameters. At run time, an object file is loaded into a free chunk of memory address
starting from the fixed address, 0x100. All segment registers are set to point to the PSP,
and SP (stack pointer) points to the end of the segment. When the size of the object file is
larger than one segment, it is programmer’s responsibility to fix the addresses using
explicit segment numbers to address the program and data.
MS-DOS.EXE file is an object file with relocation information besides data and
code. It has relocation entries that indicate the places in a program where addresses need
to be modified when the program is loaded. The reason is that 32-bit Windows gives each
program its own address space and each program can require a desired loading address,
but it doesn’t always load the program at the required address. Figure 1.2 explains header
format of a .EXE file, indicating the size of code by lastsize and nblocks, the related
relocation information with relocs, nreloc.
According to the usage of an object file, it falls into several different categories