ArticlePDF Available

Structural Comparison of Executable Objects

Authors:
  • optimyze.cloud AG

Abstract and Figures

A method to heuristically construct an isomorphism between the sets of functions in two similar but differing versions of the same executable file is presented. Such an isomorphism has multiple practical applications, specifically the ability to detect programmatic changes between the two executable versions. Moreover, information (function names) which is available for one of the two versions can also be made available for the other . A framework implementing the described methods is presented, along with empirical data about its performance when used to analyze patches to recent security vulnerabilities. As a more practical example, a security update which fixes a critical vulnerability in an H.323 parsing component is analyzed, the relevant vulnerability extracted and the implications of the vulnerability and the fix discussed.
Content may be subject to copyright.
Ulrich Flegel, Michael Meier (Eds.)
Detection of Intrusions and Malware
& Vulnerability Assessment
GI Special Interest Group SIDAR Workshop, DIMVA 2004
Dortmund, Germany, July 6-7, 2004
Proceedings
DIMVA 200
4
Gesellschaft für Informatik 2004
Lecture Notes in Informatics (LNI) - Proceedings
Series of the Gesellschaft für Informatik (GI)
Volume P-46
ISBN 3-88579-375-X
ISSN 1617-5468
Volume Editors
Ulrich Flegel
University of Dortmund,
Computer Science Department, Chair VI, ISSI
D-44221 Dortmund, Germany
ulrich.flegel@udo.edu
Michael Meier
Brandenburg University of Technology Cottbus,
Computer Science Department, Chair Computer Networks
P.O. Box 10 13 44, D-03013 Cottbus, Germany
mm@informatik.tu-cottbus.de
Series Editorial Board
Heinrich C. Mayr, Universität Klagenfurt, Austria (Chairman, mayr@ifit.uni-klu.ac.at)
Jörg Becker, Universität Münster, Germany
Ulrich Furbach, Universität Koblenz, Germany
Axel Lehmann, Universität der Bundeswehr München, Germany
Peter Liggesmeyer, Universität Potsdam, Germany
Ernst W. Mayr, Technische Universität München, Germany
Heinrich Müller, Universität Dortmund, Germany
Heinrich Reinermann, Hochschule für Verwaltungswissenschaften Speyer, Germany
Karl-Heinz Rödiger, Universität Bremen, Germany
Sigrid Schubert, Universität Siegen, Germany
Dissertations
Dorothea Wagner, Universität Karlsruhe, Germany
Seminars
Reinhard Wilhelm, Universität des Saarlandes, Germany
Gesellschaft für Informatik, Bonn 2004
printed by Köllen Druck+Verlag GmbH, Bonn
Structural Comparison of Executable Objects
Halvar Flake
halvar@blackhat.com
Abstract: A method to heuristically construct an isomorphism between the sets of
functions in two similar but differing versions of the same executable file ispresented.
Such an isomorphism has multiple practical applications, specifically the ability to
detect programmatic changes between the two executable versions. Moreover, infor-
mation (function names) which is available for one of the two versions can also be
made available for the other .
A framework implementing the described methods is presented, along with em-
pirical data about its performance when used to analyze patches to recent security
vulnerabilities. As a more practical example, a security update which fixes a critical
vulnerability in an H.323 parsing component is analyzed, the relevant vulnerability
extracted and the implications of the vulnerability and the fix discussed.
1 Introduction
While programs that compare different versions of the same source code file have been in
widespread use for many years, very little focus has so far been placed onthe importance
of detecting and analyzing changes between two versions of the same executable.
Without an automated way of detecting source code changes in the object code resulting
from compilation, the party prompted with the task of reverse engineering the changes
from the object code is at a disadvantage: It takes relatively little work to change source
code and recompile, while the analysis of the objectcode will haveto be completely redone
to detect the changes. Both virus authors of high-level-language virus families (such as
SoBig) and closed-source software vendors try to exploit this asymmetry: The authors of
SoBig intend to create large quantities of work for the antivirus researchers to have more
time to use the infrastructure built by their worm, whereas closed source vendors hope
their customers will have time for installing patches because possible attackers presumably
need a lot of time to reverse-engineerthe relevant changes fromobject-code-only security
updates.
This paper presents a novel approach which corrects the abovementioned asymmetry:
Given two variants of the same executable Acalled Aand A, an one-to-one mapping
between all the functions in Ato all functions in A is created. The mapping does not
depend on the specific assembly-level instructions generated by the compiler but is more
general in nature: It maps control flow graphs of functions, thus ignoring less-aggressive
optimization such as instruction reordering and changes in register allocation.
162 Halvar Flake
This allows porting of information (such as function names from symbolic debug informa-
tion or prior analysis) from one executable to another. Furthermore, due to the approach
taken, functions that have changed their functionality significantly will not be mapped,
allowing the easy detection of functional changes to the program.
Detecting programmatic changes between two versions of the same executableis relevant
to security research as it allows for quick analysis of security updates (”patches”) to extract
detailed information about the underlying security vulnerabilities. This allows for quick
assessment of the risk posed by a particular problem and can be used to prevent vendors
from fixing security issues ”silently”, e.g. without notifying their customers about the
security problem.
2 Previous Work
Automatically analyzing and classifying changes to source code have been studied exten-
sively in literature before, and listing all relevant papers seems to be out of scope for this
paper. Most of this research focuses on treating the source code as a sequence of lines, and
applying a sequence-comparison algorithm [Hir77][HS77].
The problem of matching functions in two executables to form pairs has been studied in
[ZW00, ZW99], although focused on reuse of profiling information which allowed the
assumption of symbols for both executables being available. Other work has been done
with focus on efficient distribution of binary patches [Poc] [BM99]. Both approaches,
while finding differences between two binaries, are incapable of dealing with aggressive
link-time profiling-information-based optimizationsand will generate a lot of superfluous
information in case register allocation or instruction ordering has changed. A bytecode-
centric approach to find sections of similar JAVA-code is studied in [BM98].
Recently another approach to binary comparison also dealing with graph isomorphisms
was discussed in [Tod]: Starting from the entry points of an executable basic blocks are
matched one-to-one based on instructions present in them. If no matching is possible,
a change must have occured. Due to the reliance on comparing actual instructions, a
significant number of locations is falsely identified as changed - the paper mentions that
about 3-5 % of all instructions change between two versions of the same executable.
3 Graph-Centric analysis
Instead of focusing on the concrete assembly level instructions generated by a compiler,
we focus on a graph-centric analysis, neglecting as much of the assembly as possible and
instead analyzing only structural properties of the executable.
Structural Comparison of Executable Objects 163
3.1 An executable as Graph of Graphs
We analyze the executable by regarding it as a graph of graphs. This means that our
executable consists of a set of functions F:= {f1,...,f
n}. They correspond to the dis-
assembly of the functions as defined in the original C sourcecode. The callgraph of the
program is the directed graph with {f1,...,f
n}as nodes. The edges of this graph repre-
sent function calls: An edge from fito fkimplies that ficontains a subfunction call to
fk.
Every function fiFitself can be represented as a control flow graph (or short cfg)
consisting of individual basic blocks and their branch relations. Thus one can represent an
executable as a graph of graphs, e.g. a directed graph (the callgraph) in which each node
itself corresponds to a cfg of the corresponding function.
3.2 Control Flow Graphs
The concept discussed here is well-known in literature on compilers and code analysis
[AVA99]. Every function in an executable can be treated as a directed graph of special
shape. Every node of the graph consists of assembly instructions that imply the execution
of the following instruction in memory if and only if the previous instruction in memory
was executed. To clarify this: Let ik,i
k+1 be addresses of two assembly-level instructions
which are adjancent in memory. These instructions belong to the same basic block if the
execution of ikat nsteps of execution implies execution of ik+1 at n+1steps, and the
execution of ik+1 at n+1implies execution of ikat step n.
Control flow graphs have a few special properties:
1. Every cfg has a unique entry point, meaning a unique node that is not linked to by
any other node.
2. Every cfg has one or more exit points, meaningnodes that do not link to any other
node.
Figures 1 through 4 show a simple C function (figure 1), it’s assembly-level counterpart
(figure 2), a full cfg containing the assembly-level instructions (figure 3) and finally just
the cfg of the function.
3.3 Retrieving the information
In order to retrieve these graphs from an executable, a good disassembly of the binary
is needed. The industry standard for disassembly is [Dat], mainly due to its excellent
cross-platform capabilities coupled with a programming interface that allows retrieval of
164 Halvar Flake
int foo( int a, int b ) {
while( a-- ) {
b++;
}
if( b > 100 )
return 1;
else
return 0;
}
Figure 1: The C function
the needed information without knowledge of the underlying CPU or its assembly. This
facilitates implementing the described algorithms only once but testing them on multiple
architectures.
3.4 Indirect calls and disassembly problems
In many cases creating a complete callgraph (which represents all possible relations be-
tween the different functions) from a binary is not trivial. Specifically indirect subfunction
calls through tables (very commonfor example in C++ codethat uses virtual methods) are
hard to resolve statically.
In the presented approach, such indirect calls whose targets cannot be resolved statically,
are simply ignored and treated as a regular assembly-level instruction. In practice, this
does not yield many problems: The question whether a certain call is made directly or not
is not answered by the optimizer but by the code that is being compiled, and thus does not
change between different builds of the same program without a source code change.
4 Structural matching
The general idea explored in this paper is matching the functions in two executables by
utilizing both information derived from the callgraph and the respective cfg’s instead of
relying on instructions or instruction patterns. In this section two versions of the same
executable will be considered: Aand Bas well as their callgraphs A:= {{a1,...,a
n},
{ae
1,...,a
e
m}} and B:= {{b1,...,b
l},{be
1,...,b
e
k}} which consist of their respective
nodes (functions) and edges (ce
iis a 2-tuple containing two nodes, and thus describes an
edge in the graph).
Ideally, we want to create a bijective mapping p:{a1,...,a
n}→{b1,...,b
m}.In
the general case, this mapping does not exist due to different cardinalities of the two sets
(if functions have been added or removed). Furthermore, properly embedding Binto A
seems to be an excessively expensive operation, specificially consideringthe possibility of
Structural Comparison of Executable Objects 165
push ebp
mov ebp, esp
push esi
push edi
jmp short loc_40126A
loc_401267:
inc [ebp+arg_4]
loc_40126A:
mov edi, [ebp+arg_0]
mov esi, edi
sub esi, 1
mov [ebp+arg_0], esi
cmp edi, 0
jnz short loc_401267
cmp [ebp+arg_4], 64h
jle short loc_401287
mov eax, 1
jmp short loc_40128C
loc_401287:
mov eax, 0
loc_40128C:
pop edi
pop esi
pop ebp
retn
Figure 2: The assembly code
m<n.
A different iterative approachto creating an approximation of pis taken: An initial map-
ping p1is created which maps elements of A1⊂{a1,...,a
n}to elements of B1
{b1,...,b
n}. The mapping is then used to iteratively create a sequence of mappings
p2,...,p
hwith A1A2···⊂ Ahand B1B2 ··· Bh.
4.1 A simple matching heuristic
Comparing undirected graphs is well-known to be excessively expensive, and even the
restricted directed graphs can be quite expensive to compare. A relatively simple (and
very imprecise) heuristic for telling whether two graphs are isomorphic is to compare the
number of nodes and edges. If they do not match, it can be said with certainty that no
isomorphism exists. The initial partial mapping p1is constructed by associating every
bi∈{b1,...,b
m}with a 3-tuple (αi
i
i)where αiis the number of basic control
blocks in bi,βiis the number of edges in biand γiis the number of edges in the calltree
166 Halvar Flake
Figure 3: cfg with assembly
originating at bi. We denote the mapping that maps a function in a callgraph Cto it’s 3-
tuple with s:CN3and define the inverse function s1:N3P({c1,...,c
o})that
retrieves the set of functions which map to a certain tuple.
The mapping p1is constructed by examining all 3-tuples generated from Aand Bas
follows: The functions aiAand bjBare mapped to each other if and only if they
map to the same tuple and no other element exists in {a1,...,a
n}or {b1,...,b
l}which
maps to the same tuple. More formally:
p1(ai)=bj⇔|s1(s(ai))|=1=|s1(s(bj))|∧s(ai)=s(bj)(1)
If the cardinalities of the sets s1(s(ai)) and s1(s(bj)) are both equal to one and both ai
and bjmap to the same tuple, p1maps aito bj.
Structural Comparison of Executable Objects 167
Figure 4: cfg without assembly
4.2 Improving p1
The above heuristics yield only relatively small subsets of A:= {a1,...,a
n}and B:=
{b1,...,b
l}that can be successfully matched initially. In general, smaller functions are
a lot less likely to be successfully matched. This is mainly due to the special form that
cfg’s take: Most basic blocks have exactly two ”children” - this means that the odds that
two randomly chosen cfg’s with the same node count have the same number of edges
decrease as cfg’s grow. Smaller functions tend to havefewer subfunctioncalls, furthermore
increasing the likelyhood that |s1(s(ai))| =1occurs. It is intuitively clear that smaller
sets Aand Bwould reduce the odds of such collisions.
The improved mappings piare constructed by taking advantage of the information gained
from pi1and using them to create small subsets A
jAand B
jBwhich are used for
improving the mapping as explained above. The algorithm for constructing pifrom pi1
works as follows:
1. Take the ith element aifrom Ai1and retrieve pi1(ai)
2. Let A
ibe the set of all functions akthat have edges originating from aileading to
akin Aand B
ithe set of all functions bothat have edges originating frompi1(ai)
leading to boin B
3. Construct p
i:A
iB
iin the same way as p1was constructed
4. pi(aj):=pi1(aj)if ajAi1.Ifaj∈ Ai1and the construction of p
iyielded
a match, pi(aj):=p
i(aj). If the construction of p
idid not yield a match and
aj∈ Ai1then pi(aj)is undefined.
168 Halvar Flake
5. Aiand Biare the domain and image of pi
Once pkhas been constructed where |Ak|=k, the iteration is finished and cannot yield
improved results.
4.3 Graph restructuring
Compilers (and optimizing linkers) tend to change cfg’s in ways that do not truly change
the logical structure of the function. Oftentimes, a single control block in a cfg is split up
into several smaller ones that are linked with an unconditional branchinstruction 1. Since
these operations will change the node and link count a way to easily and quickly undo the
changes is needed. A simple graph-restructuring algorithm is applied before generating
the 3-tuples which removes superfluous nodes generated by these optimizations:
for x∈{c1,...,c
o}:
if (number of edges to x)=1:
ye
x:= edge to x
y:= source node of ye
x
{xe
i,...,x
e
j}:= set of edges originating in x
remove edge from yto xfrom graph
remove xfrom graph
for xe∈{xe
i,...,x
e
j}:
add edge from yto target of xeto graph
remove xefrom graph
5 Practical results
An implementation of the described methods has been created as an extension to the com-
mercial debugger IDA Pro.
5.1 Name porting between databases
Two libraries that come with everystandard install of Windows were examined in different
versions: wininet.dll and msgsvc.dll. The versions of wininet.dll were those of Windows
XP SP1/SP2 respectively, the versions of msgsvc.dll those pre/post MS03-48. Both have
been heavily fragmented by aggressivelink-time optimizations and pose significant prob-
lems to signature-based function matching.
1This seems to be a specialty of Microsoft’s optimizing linker
Structural Comparison of Executable Objects 169
File File Size # functions # mapped Runtime in seconds
msgsvc.dll pre MS03-48 35.600 134 100 <5
msgsvc.dll post MS03-48 34.064 129 100 <5
wininet.dll SP1 599.040 2310 1522 183
wininet.dll SP2 588.288 2321 1522 183
5.2 Analysis of security patches
5.2.1 H.323 Parser
After the NISCC published information about vulnerabilities in multiple H.323 parsers,
the question arose where the relevant mistake in Microsofts ISA Server product was. Mi-
crosoft refuses to publish detailed information about the vulnerability they fix. According
to the NISCC report, the problem was located in ASN.1 decoding.
Both the pre- and post-patch versions of H323ASN1.DLL were analyzed, with the result
that 11 functions in the unpatched version could not be mapped to the patched version, and
8 functions in the patched version could not be mapped to any function in the unpatched
version.
Address # Nodes # Links # Children Address # Nodes # Links # Children
40f627 26 46 21 40f4bb 22 40 21
40f837 19 32 12 40f697 14 24 12
41d012 10 16 7 41cd73 9 15 8
41ed06 8 12 2 41ce7d 8 13 7
428d36 8 12 2 425595 4 5 4
42b9e2 8 12 2 425728 4 5 4
42bc90 8 12 2 428b72 7 10 2
42bd85 8 12 2 42b98e 7 10 2
42bbd2 7 10 2
42bcbf 7 10 2
Patched Version Unpatched Version
A manual inspection of these functions yielded the result that the first three functions in
both tables are in fact the same with the only change being an added range check. In
all three cases, the old version retrieves an unsigned 32-bit integer from an ASN.1 PER
encoded stream by means of a function called ASN1PERDecU32Val().
This 32-bit integer is passed on to ASN1PERDecZeroTableCharStringNoAlloc() as
second argument. The patched variant on the other hand introduces a range check to
make sure this second argument is smaller than 129.
A closer inspection of ASN1PERDecZeroTableCharStringNoAlloc() reveals that the
function calculates the size of memory allocation based on the formerly untrusted value
– an attacker was able to set this value in a manner that the calculation would exceed
170 Halvar Flake
MAXUINT and thus be of very small size. The subsequent copy-operation would then
corrupt the heap, allowing an attacker to gain control in the next round of heap consoli-
dation. Instead of fixing the issue at the core (e.g. in the MSASN1.DLL library), a range
check was added into the calling application (H323ASN1.DLL).
The update thus disclosed to an examining party that every call to ASN1PERDecZero-
TableCharStringNoAlloc() needs to have argument checking done before the call is is-
sued. A short system-wide scan was conducted to see if other applications besides ISA
Server use the relevantfunction in dangerous way. Two other instances were found: The
Windows-internal H.323 Multimedia Provider Library (which allows arbitrary applica-
tions to easily process H.323 data) and Microsoft’s Video Conferencing Software Net-
meeting. Neither does proper range checking on the function in question.
The result was that the update to H323ASN1.DLL fixed one bug but alerted anyone with
the capability to analyze patches to two further remotely exploitable vulnerabilities which
were not fixed at the time.
Microsoft was contacted and the issues were fixed a few months later, in MS04-11.
The total analysis took less than 3 hours time.
5.2.2 SSL/PCT Parser
In April, Microsoft issued an update to SCHANNEL.DLL, the library responsible for han-
dling SSL communication. According to their security bulletin, they removed a security
problem that allowed attackers to take full controlof any computer running an SSL-based
server. No technical details were provided, except that the problem itself lay in a part of
the library responsible for parsing PCT packets 2.
More than 20 changed functions were detected in total, but only one with a name that im-
plied it was involved with PCT parsing. An examination ofthe function Pct1SrvHandle-
UniHello() revealed that the old version had taken a string, NOT’ed every character and
appended it to the original string. The new version was changed in such a manner that it
ensured the combined string would not exceed32 characters.
Detecting and understanding the vulnerability (a vanilla stack-smash with EIP overwrite)
took less than 30 minutes. Subsequently, code was constructed to reach the appropriate
location in the binary. Within 5 hours, EIP could be overwritten with an arbitrary value,
and within 10 hours of the start of the analysis, a program that reliably exploited the
vulnerability was created.
6 Comparison to other methods
In comparison to other methods for reverse engineeringchanges to a binary, the presented
method has a few distinctive advantages as well as a few significantdisadvantages.
2PCT is a legacy-protocol that was obsoleted by TLS and is supported for legacy browsers
Structural Comparison of Executable Objects 171
6.1 Few False positives
The presented method performs significantly better than [Tod] in terms of false positives:
The instruction-based approach suffers from 3-5 % of all instructions being marked as
changed. Unless heavy, structure-changing optimizations are performed (such as the in-
lining of complex functions), the presented method is free of false positives: A functions
whose flowgraph has changed has undergone a change. While testing the method on a
multitude of different programs, no function pair was found that had not changed but was
marked as changed. This drastically reduces the human work involved when trying to
detect the significant changes in a security update.
6.2 CPU-independence
The presented method is almost completely independentof the underlying CPU architec-
ture as long as a good disassembly with cross-references is available. The only CPU-
dependent function that has to be available in addition to flow informationis the capability
to distinguish between a subfunction-call and a non-subfunctioncall. Successful tests were
ran examining differences between MIPS-based ROM images and SPARC-based Solaris
ELF executables in addition to the x86-based PE files discussed above.
Instruction-based approaches contain large amounts of CPU-dependent code which makes
creating a multi-platform analysis tool significantly more complex.
6.3 Possible False Negatives
The downside obviously is the presence of possible false negatives if the program logic
itself is not changed but constants or buffer sizes are. It is easy to imagine that a software
vendor will fix a security vulnerability not by adding a range check but by enlarging the
size of a buffer, which in the current method will go unnoticed. This is where [Tod]
is clearly superior, as any change in buffer sizes or constants will be detected. This is
bought by the cost of having to examine a significantly larger number of detected changes.
Empirical evidence suggests that security updates which changeconstants but not program
flow are very rare. Nonetheless, this is a region in which improvements on the proposed
method are desirable.
7 Summary
It has been shown that nondisclosure of vulnerability informationis not a promising deter-
rent to would-be-attackers and that security updates can be reverse engineeredin relatively
little time (given the right tools). It has furthermore been shown that special care has to be
172 Halvar Flake
taken when releasing security updates, as the information in the patch has to be assumed to
be public. An incomplete bugfix can do more harm than good by disclosing the existence
of other (unfixed) bugs alongwith the fix.
The presented work furthermore implies that the common practice of leaving one or two
weeks between the publication of a security update and installing the patch is highlydan-
gerous.
Leaving the politics of vulnerability disclosure out, it has been shown that analysis of
binaries based only on structural properties of the code is a promising field of research,
as it allows analysis of executable code without the need to abstract to an intermediate
language or CPU-specific analysis engines.
8 Future work
Many things have to be improved and worked on to make the proposed method truly
useful. Fast heuristics that can tell that two graphs are not isomorphic which are better
than the current version are needed and would greatly improve the matching statistics.
Furthermore, intraprocedural difference analysis would be useful: Given two cfg’s a1,b
1
which are different, but expected to belong to the same function due to their position in
the callgraph or due to other heuristics, an algorithm that constructs a partial isomorphism
between subgraphs of a1and b1would allow quicker analysis of changes. Given two
versions of the same large function, finding a relatively small change still has to be done
manually.
A separation of the function-matching for name porting and function-matchingfor binary
difference analysis will be needed sooner or later: Function-matching benefits from re-
laxed heuristics, while binary differenceanalysis does not want to miss changes.
More in-depth study of the effects of heavily optimizing/inlining compilerswould be de-
sirable, as well as more studies on the applicability of the presented methods to other
CPU-architectures.
Detecting changes in buffer sizes and changes in (certain) constants would be desirable
goals in the immediate future.
9 Acknowledgements
The author would like to thank the anonymous reviewers for many constructive comments.
Valuable comments were also provided by Josh Anderson, Brandon Baker, John Pincus,
Felix Lindner und Jan Muenther.
Structural Comparison of Executable Objects 173
References
[AVA99] Jeffrey D. Ullmann Alfred V. Aho, Ravi Sethi. Compilerbau. Oldenburg Verlag, M¨unchen
Wien, 2 edition, 1999.
[BM98] Brenda S. Baker and Udi Manber. Deducing Similarities in Java Sources from Bytecodes.
pages 179–190, 1998.
[BM99] Brenda S.Baker, Udi Manber and Robert Muth. Compressing Differences of Executable
Code. In ACMSIGPLAN Workshop on Compiler Support for System Software (WCSS),
pages 1–10, 1999.
[Dat] DataRescue. IDA Pro Disassembler
http://www.datarescue.com/idabase.
[Hir77] Daniel S. Hirschberg. Algoritms for the Longest Common Subsequence Problem. J. ACM,
24(4):664–675, 1977.
[HS77] James W. Hunt and Thomas G. Szymanski. A fast algorithm for computing longest com-
mon susequences. Commun. ACM, 20(5):350–353, 1977.
[Poc] Pocket Soft Inc. RTPatch – Software Update Tool
http://www.pocketsoft.com/whitepapers/whitepaper.html.
[Tod] Todd Sabin. Comparing binaries with graph isomorphisms
http://razor.bindview.com/publish/papers/comparing-binaries.html.
[ZW99] Scott McFarling Zheng Wang, Ken Pierce. BMAT - A Binary Matching Tool. 2nd ACM
Workshop on Feedback-Directed Optimization, November 1999.
[ZW00] Scott McFarling Zheng Wang, Ken Pierce. BMAT - A Binary Matching Tool for Stale
Profile Propagation. The Journal of Instruction-Level Parallelism (JILP), 2, May 2000.
... However, these approaches that only considering the syntax of instructions are expensive and cannot handle all the syntax changes due to different compiler settings. Moreover, utilizing graph-isomorphism (GI) theory [19,20] to compare control flow graphs (CFGs) of function is time consuming and lacks polynomial-time solutions. ...
Article
Full-text available
Binary code similarity comparison is the technique that determines if two functions are similar by only considering their compiled form, which has many applications, including clone detection, malware classification, and vulnerability discovery. However, it is challenging to design a robust code similarity comparison engine since different compilation settings that make logically similar assembly functions appear to be very different. Moreover, existing approaches suffer from high-performance overheads, lower robustness, or poor scalability. In this paper, a novel solution HBinSim is proposed by employing the multiview features of the function to address these challenges. It first extracts the syntactic and semantic features of each basic block by static analysis. HBinSim further analyzes the function and constructs a syntactic attribute control flow graph and a semantic attribute control flow graph for each function. Then, a hierarchical attention graph embedding network is designed for graph-structured data processing. The network model has a hierarchical structure that mirrors the hierarchical structure of the function. It has three levels of attention mechanisms applied at the instruction, basic block, and function level, enabling it to attend differentially to more and less critical content when constructing the function representation. We conduct extensive experiments to evaluate its effectiveness and efficiency. The results show that our tool outperforms the state-of-the-art binary code similarity comparison tools by a large margin against compilation diversity clone searching. A real-world vulnerabilities search case further demonstrates the usefulness of our system.
Article
Binary code similarity analysis (BCSA¹) is meaningful in various software security applications, including vulnerability discovery, clone detection, and malware detection. Although many BCSA studies have been based on neural networks in recent years, some significant problems are challenging to solve. First, most existing methods focus more on the function pair similarity detection task (FPSDT²) while ignoring the function search task (FST³), which is more major in vulnerability discovery. Moreover, they care more about the final result, which is to improve the success rate of FPSDT by using unexplainable neural networks. Finally, in practice, most methods are difficult to resist cross-optimization and cross-obfuscation in BCSA. We first proposed an adaptive BCSA architecture combining interpretable feature engineering and learnable attention mechanism to solve these problems. We design an adaptive model with rich interpretable features, and the experimental results on FPSDT and FST are better than the state-of-the-art methods. In addition, we also found that the attention mechanism has outstanding advantages in functional semantic expression. Finally, the evaluation shows that our approach can significantly improve FST performance between cross-architecture, cross-optimization, cross-obfuscation and cross-compiler binaries.
Article
Fingerprinting individual functions in binary code is useful in many security applications ranging from digital forensic analysis of malware corpora to the detection of critical security vulnerabilities. However, existing approaches for fingerprinting functions are typically not resilient to code transformation methods or the use of different compilers. Moreover, another common weakness with these approaches is that when they report a similarity, they do not provide reverse engineers with any insight into the underlying evidence. In order to bridge this gap, our paper presents Plumeria, an obfuscation-resilient and scalable approach based on a stratified architecture comprised of three layers. The first layer retrieves as many candidates as possible by capturing statistical characteristics, function behavior, and function neighborhood relationships. The second layer then trains a linear conditional random field to learn the correlations between the features of the function and its semantics. This layer is designed to reduce the number of false positives. Finally, the third layer is designed to provide insights into the underlying evidence by collecting the side effects exhibited from the candidates selected by the previous layer. Our study evaluates Plumeria in the context of several scenarios: fingerprinting functions in obfuscated/de-obfuscated binaries; fingerprinting functions across different compilers; fingerprinting various vulnerabilities across compilers and versions; and fingerprinting standard library functions. We then benchmark Plumeria on real-world projects and malware binaries, comparing it with existing state-of-the-art solutions. Our results show that Plumeria outperforms existing solutions, with an average precision of over 89%
Article
Full-text available
Code reuse brings vulnerabilities in third-party library to many Internet of Things (IoT) devices, opening them to attacks such as distributed denial of service. Program-wide binary diffing technology can help detect these vulnerabilities in IoT devices whose source codes are not public. Considering the architectures of IoT devices may vary, we propose a data-aware program-wide diffing method across architectures and optimization levels. We rely on the defined anchor functions and call relationship to expand the comparison scope within the target file, reducing the impact of different architectures on the diffing result. To make the diffing result more accurate, we extract the semantic features that can represent the code by data flow dependence analysis. Earth mover distance is used to calculate the similarity of functions in two files based on semantic features. We implemented a proof-of-concept DAPDiff and compared it with baseline BinDiff, TurboDiff and Asm2vec. Experiments showed the availability and effectiveness of our method across optimization levels and architectures. DAPDiff outperformed BinDiff in recall and precision by 41.4% and 9.2% on average when making diffing between standard third-party library and the real-world firmware files. This proves that DAPDiff can be applicable for the vulnerability detection in IoT devices.
Article
During software development, numerous third-party library functions are often reused. Accurately recognizing library functions reused in software is of great significance for some security scenarios, such as the detection of known vulnerabilities and reverse analyses of malware. An optional method for recognizing library functions is matching the functions in the library to those in the target software. However, due to the diversity of function library versions, compilers, build options, etc., there are differences between the two corresponding functions. Recognizing library functions used in target software precisely is still a challenging task. In this paper, we propose a novel method named SELF (SEarch for Library Functions) to recognize library functions used in target software. In SELF, the function is represented with a co-occurrence matrix and encoded by a convolutional auto-encoder (CAE). Then, the similarity between two functions is detected using the generated bottleneck features. This scheme focuses on the discriminative semantic features; thus, this method can not only distinguish different functions but also tolerate the subtle differences between two pairing functions, which is specifically required for library function recognition. We collected 451 software projects, including approximately 3 million functions, to train and evaluate SELF. The experimental results show that SELF performs well in both [email protected] and [email protected] Especially when the library version gap is large, SELF significantly outperforms classic BINDIFF. In addition, SELF shows good computational efficiency.
Article
Two algorithms are presented that solve the longest common subsequence problem The first algorithm is applicable in the general case and requires O(pn + n log n) time where p is the length of the longest common subsequence The second algorithm requires time bounded by O(p(m + 1 - p)log n) In the common speoal case where p is close to m, this algorithm takes much less time than n 2.
Article
Previously published algorithms for finding the longest common subsequence of two sequences of length n have had a best-case running time of O(n2). An algorithm for this problem is presented which has a running time of O((r + n) log n), where r is the total number of ordered pairs of positions at which the two sequences match. Thus in the worst case the algorithm has a running time of O(n2 log n). However, for those applications where most positions of one sequence match relatively few positions in the other sequence, a running time of O(n log n) can be expected.
Article
A major challenge of applying profile-based optimization on large real-world applications is how to capture adequate profile information. A large program, especially a GUI-based application, may be used in a large variety of ways by different users on different machines. Extensive collection of profile data is necessary to fully characterize this type of program behavior. Unfortunately, in a realistic software production environment, many developers and testers need fast access to the latest build, leaving little time for collecting profiles. To address this dilemma, we would like to re-use stale profile information from a prior program build. In this paper we present BMAT, a fast and effective tool that matches two versions of a binary program without knowledge of source code changes. BMAT enables the propagation of profile information from an older, extensively profiled build to a newer build, thus greatly reducing or even eliminating the need for re-profiling. We use two m...
Article
A major challenge of applying profile-based optimization on large real-world applications is how to capture adequate profile information. A large program, with millions of lines of code, may be used in a large variety of ways by different users on different machines. In addition, GUI-based applications can behave differently with only the slightest changes of running conditions. To fully characterize this type of program behavior, extensive collection of profile data is required. Unfortunately, in a realistic software production environment, little time is available for extensive profiling. In a large project, many developers and testers need fast access to the latest build without having to wait for extensive profile runs to be completed. To address this dilemma, we would like to be able to reuse profile information from a prior build. In this paper we present BMAT, a tool that matches two versions of a binary program with high success rate. BMAT allows us to propagate profile informa...
Article
Several techniques for detecting similarities of Java programs from bytecode files, without access to the source, are introduced in this paper. These techniques can be used to compare two files, to find similarities among thousands of files, or to compare one new file to an index of many old ones. Experimental results indicate that these techniques can be very effective. Even changes of 30% to the source file will usually result in bytecode that can be associated with the original. Several applications are discussed. 1
Article
Programs change often, and it is important to bring those changes to users as conveniently as possible. The two most common ways to deliver changes are to send a whole new program or to send "patches" that encode the differences between the two versions, requiring much less space. In this paper, we address computation of patches for executables of programs. Our techniques take into account the platform-dependent structure of executables, We identify changes in the executables that are likely to be artifacts of the compilation process, and arrange to reconstruct these when the patch is applied rather than including them in the patch; the remaining changes that must be placed in the patch are likely to be derived from source lines that changed. Our techniques should be useful for updating programs over slow data lines and should be particularly important for small devices whose programs will need to be updated through wireless communication. We have implemented our techniques for Digital UNIX Alpha executables; our experiments show our techniques to improve significantly over previous approaches to updating executables. 1
  • Jeffrey D Ullmann Alfred
  • V Aho
  • Ravi Sethi
Jeffrey D. Ullmann Alfred V. Aho, Ravi Sethi. Compilerbau. Oldenburg Verlag, München Wien, 2 edition, 1999.
BMAT -A Binary Matching Tool. 2nd ACM Workshop on Feedback-Directed Optimization
  • Scott Mcfarling
  • Zheng Wang
  • Ken Pierce
Scott McFarling Zheng Wang, Ken Pierce. BMAT -A Binary Matching Tool. 2nd ACM Workshop on Feedback-Directed Optimization, November 1999.
Comparing binaries with graph isomorphisms
  • Todd Sabin
Todd Sabin. Comparing binaries with graph isomorphisms http://razor.bindview.com/publish/papers/comparing-binaries.html.
  • D Jeffrey
  • V Ullmann Alfred
  • Ravi Aho
  • Sethi
Jeffrey D. Ullmann Alfred V. Aho, Ravi Sethi. Compilerbau. Oldenburg Verlag, München Wien, 2 edition, 1999.