ArticlePDF Available

Automated Attacker Correlation for Malicious Code

Authors:
  • optimyze.cloud AG

Abstract and Figures

Correlating attacks can be specifically problematic in the digital domain. It is a common scenario that the only real "trace" of an attack that can be obtained is executable code. As such, executable code of malicious software forms one of the primary pieces of evidence that need to be examined in order to establish correlation between seemingly independent events/attacks. Due to the high technical sophistication required for building advanced and stealthy persistent backdoors ("rootkits"), it is quite common for code fragments to be re-used. A big obstacle to performing proper correlation between different executables is the high degree of variability which the compiler introduces when generating the final byte sequences. This paper presents the results of research on executable code comparison for attacker correlation. Instead of pursuing a byte-based approach, a structural approach is chosen. The result is a system that can identify code similarities in executables with accuracy that often exceeds that of a human analyst and at much higher speed.
Content may be subject to copyright.
Thomas Dullien, Ero Carrera, Soeren-Meyer Eppler, Sebastian Porst
Ruhr-University Bochum
zynamics GmbH
Grosse Beck Str 3.
44787 Bochum
Germany
March 22, 2010
Abstract
Correlating attacks can be specifically problematic in the digital domain. It is a common
scenario that the only real “trace” of an attack that can be obtained is executable code. As such,
executable code of malicious software forms one of the primary pieces of evidence that need to
be examined in order to establish correlation between seemingly independent events/attacks.
Due to the high technical sophistication required for building advanced and stealthy persistent
backdoors (“rootkits”), it is quite common for code fragments to be re-used. A big obstacle to
performing proper correlation between different executables is the high degree of variability which
the compiler introduces when generating the final byte sequences.
This paper presents the results of research on executable code comparison for attacker corre-
lation. Instead of pursuing a byte-based approach, a structural approach is chosen. The result
is a system that can identify code similarities in executables with accuracy that often exceeds
that of a human analyst and at much higher speed.
1 Introduction
One of the core problems facing actors in the cyber domain is the difficulty of attribution. While
this problem appears as hard as ever, a related problem is “correlation” – even if it is unclear what
specific actor is behind an attack, it is desirable to group attacks by suspected actor. This allows
much improved intelligence on and understanding of coordinated multi-pronged attacks.
Often, the principal artifact of a successful attack that the defenders can perform analysis of is
malicious code that was obtained from compromised systems. Direct comparison of such artifacts is
difficult due to the heavy changes that are introduced by the compilation and optimization process.
The problem of comparing executables derived from a common source code origin but compiled
using different compiler settings was first studied in the context of security patch analysis. In
this scenario, two versions of the same executable that were compiled on similar but nonidentical
build environments are to be compared, and the security implications of a software update is then
examined. The executables are usually relatively similar, but variability on the byte level exists
nonetheless. Motivated by this, structural approaches to compare executables are discussed in
[6, 7, 9, 3]. In [1, 2], applications of such methods to malware classification were given. Further
work along similar lines was recently performed by [8].
RTO-MP-IST-091 26 - 1
Automated Attacker Correlation for Malicious Code
Report Documentation Page Form Approved
OMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and
maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,
including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington
VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it
does not display a currently valid OMB control number.
1. REPORT DATE
NOV 2010 2. REPORT TYPE 3. DATES COVERED
00-00-2010 to 00-00-2010
4. TITLE AND SUBTITLE
Automated Attacker Correlation for Malicious Code 5a. CONTRACT NUMBER
5b. GRANT NUMBER
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES)
Ruhr-University Bochum,zynamics GmbH,Grosse Beck Str 3,44787
Bochum Germany,
8. PERFORMING ORGANIZATION
REPORT NUMBER
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT
NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release; distribution unlimited
13. SUPPLEMENTARY NOTES
presented at the Information Systems and Technology Panel (IST) Symposium held in Tallinn, Estonia,
22-23 November 2010. U.S. Government or Federal Rights License
14. ABSTRACT
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF
ABSTRACT
Same as
Report (SAR)
18. NUMBER
OF PAGES
10
19a. NAME OF
RESPONSIBLE PERSON
a. REPORT
unclassified b. ABSTRACT
unclassified c. THIS PAGE
unclassified
Standard Form 298 (Rev. 8-98)
Prescribed by ANSI Std Z39-18
2 Contribution
The primary research contributions of this paper are the following:
1. A comparison algorithm that allows comparison of executable code independent of compiler
optimizations. This algorithm is an extension of the algorithms discussed in [7, 3] and has been
successfully used to compare code compiled for different operating systems (using different
compilers) [5] and different architecture [4].
2. A full system that, based on the above algorithm, allows the fast comparison and correlation
between different pieces of malicious software.
3. An algorithm that allows for specific efficient queries for particularly interesting functions into
a large database of functions.
Two practical case studies are also studied – both originating in a real-world computer net-
work attack. Further work will be presented where correlation is done amongst large quantities of
malicious software.
3 Structural Executable comparison
Compilers enjoy wide-ranging freedom when assigning instruction sequences and registers to perform
particular tasks. This leads to a wide variability of the byte sequences that are generated from a
particular piece of source code. A compounding problem arises from the fact that many compilers use
non-deterministic and randomized algorithms to perform common tasks such as register allocation:
Even identical build environments and compilers can generate different byte sequences due to this.
On the structural level, the freedom is much more restricted: The control flow logic of the
underlying source code dictates, to a certain extent, the control flow logic of the emitted assembly,
and thus the “shape” or topology of the control flow graph. While the compiler has a certain level of
freedom in creating control flow graphs, the need to generate efficient code leads to a ’convergence’
to a few possible shapes.
Techniques for using the structure of the control flows for executable comparison were pioneered
independently by [9] and Flake’s [6]. The latter of the two already includes measures that also take
the callgraph structure of the executable into account. Callgraph similarity measures were used by
[1] to automatically recognize similarities between variants of the same malicious software.
4 The goal
The goal for correlation of malicious software is two-fold:
1. Allow specific querying for a particular structural feature (control flow graph) in a large body
of malicious code
2. Automatically recognize similarities between superficially different pieces of malicious code
In the presented work, the first goal is achieved by identifying a way to perform quick database
lookup for a particular graph structure. The second goal is achieved by a more involved algorithm
that computes similarities between two executables based on their structural properties.
Automated Attacker Correlation for Malicious Code
26 - 2 RTO-MP-IST-091
5 Fast database lookup for flowgraphs
In order to allow fast querying into large sets of data, it is desirable to implement something like
ahash function for control flow graphs. Given an easily-calculated encoding of control-flow-graphs
into a sequence of bits, fast lookup is possible using either hashtables or binary search trees (as the
encoding automatically imposes a total ordering on the space of control flow graphs).
The question then becomes: What is a good and fast way to encode a control flow graph into a
(possibly small) sequence of bits, if possible with a low probability of random collisions (e.g. same
hash values for structurally different flowgraphs) ?
We begin by converting the set of edges in a control flow graph into a set of n-tuples of integers.
In the next step, a suitable encoding for this set of n-tuples is constructed which attempts to
minimize the odds of multiple sets of tuples mapping to the same value.
The construction of this set of tuples works as follows: Let Gbe the set of all control flow graphs
and Egthe set of edges of a particular GG. We then define:
tup :GP(Z5)
tup(g)7→
topologicalorder(src(e)),
indegree(src(e)),
outdegree(src(e)),
indegree(dest(e)),
outdegree(dest(e))
|eEg
This provides a set of integers from a given graph Gin a quite straightforward manner. The
next question is now the choice of a suitable encoding that minimizes unwanted collisions. In order
to achieve this, we notice that these tuples form a subset of vectors from Q5and the fact that we
can embed the vector space Q5(which is a 5-dimensional vector space over Q) into the vector space
Q[2,3,5,7]. This vector space is also a 5-dimensional vector space over Qwith the added
property that each element of this vector space is also a real number. We thus have an easy way of
converting a 5-tupel into a real number:
emb(z)7→ z0+z12 + z23 + z35 + z47
The only thing that we still need to construct our hash function is now a method to combine
the values obtained from the individual tuples in a way that is agnostic of the ordering of the edges
while at the same time avoiding unwanted side effects. The final construction is the following:
Hash(g) := X
ttup(g)
1
pemb(t)
In reality, all calculations happen over floating point numbers and not over R– e.g. the math-
ematical intuition above which works decently over Ris reduced to handwaving in real implemen-
tations. Nonetheless, empirical evidence from several hundred thousand control flow graphs has
shown that the odds of structurally different control flow graphs randombly hashing to the same
value is exceedingly low and does not matter in practice.
In the following, we call this hash function the MD-Index for a given graph.
6 Comparing full executables
A fast lookup of individual control flow graphs is just one part of a proper correlation system.
Compiler optimizations often have a pretty significant impact on the structure of control flow graphs,
and more thorough comparison algorithms that can deal with small modifications of control flow
Automated Attacker Correlation for Malicious Code
RTO-MP-IST-091 26 - 3
graphs are desirable. These algorithms should exploit properties of the callgraph structure along
with control flow graph structures, and also take miscellaneous other properties (use of common
API functions etc.) into account.
6.1 The Goal
In order to determine the extent of code sharing between two executables, a mapping between the
functions, basic blocks, and instructions in the two executables is constructed. This means that the
algorithm constructs a mapping that for each function/basic block/instruction in the first executable
contains a reference to the corresponding function/basic block/instruction in the second executable
(if such a corresponding element exists).
In order to construct these mappings, the algorithm approximates a solution for the maximum
subgraph isomorphism problem on both the level of the executable callgraphs and the executable
flowgraphs.
Specifically, this means that an approximation of the maximum subgraph isomorphism is con-
structed between:
1. The two callgraphs of the two given executables
2. Each pair of flowgraphs that arise from the matching of two nodes in the two respective
callgraphs
6.1.1 The algorithm
The algorithm that is used to construct the approximation to the maximum subgraph isomorphism
problem is identical on both the callgraphs and flowgraphs - some details change, but the algorithm
stays identical. Let G= (N , E), G0= (N0, E0) be the graphs in question.
Definition 1 (Characteristic) A pair of mappings σ:ND, σ0:N0Dfrom the set of nodes
in a graph to some domain Dis called a node characteristic. A pair of mappings σ:ED, σ0:
E0Dis called an edge characteristic. To simplify notation, σwill be written for both σand σ0.
We maintain a separate list of characteristics for both the callgraph and the control-flow-graph -
e.g. the only difference between the algorithm on the callgraphs vs. on the flowgraph is in the list
of characteristics.
The algorithm itself is relatively simple:
Given an arbitrary node, each characteristic provides a set of candidates that might be a good
match for this node. The characteristics are applied successively to refine the set of candidates
down to a unique match; if a unique match is found, the nodes are associated. If no unique match
is found by refining from the first characteristic (usually because no match is found at all) the first
characteristic is stripped off the list, and the process is iterated.
Automated Attacker Correlation for Malicious Code
26 - 4 RTO-MP-IST-091
Data:N, N 0(sets of either edges or nodes), List of (node or edge) characteristics σ1, . . . , σm
Result: Dictionary Dthat contains mappings Nto N0
for i1to mdo
foreach nNdo
temp ← {n0N0|σi(n0) = σi(n)};
for jito mdo
if |temp|= 1 then
D[n] = temp;
break;
end
temp ← {n0temp|σj(n0) = σj(n)};
end
end
end
Algorithm 1: Calculating a mapping given node or edge characteristics
The values for σi(n0) can be precomputed. If Dis “cheaply” ordered (e.g. allows constant-time
comparison of elements) and the values for σi(n0) were precomputed, lookup of elements for the line
temp ← {n0N0|σi(n0) = σi(n)}can be done using a regular binary tree - the computational cost
is hence logarithmic in |N0|. A quick analysis of the algorithm suggests that the runtime complexity
if all values are precomputed should be upper-bounded by
O(m2|N|log(|N0|)2)
There is no backtracking involved in this algorithm, and no complicated decisions are made.
Speed is given precedence over accuracy, and the accuracy is obtained through the choice of char-
acteristics. The exact characteristics will be discussed in section 6.2.
6.2 List of characteristics
From the previous section it becomes clear that the proper choice of characteristics is crucial for
proper functioning of the discussed algorithm. In the following, we list some criteria that are used
for the graph matching step, both on the control flow graphs and on the callgraphs.
Byte Hash A traditional hash over the bytes of the function or the basic block is calculated.
MD-Index of a particular function For a given node in the callgraph, the MD-Index of the
control flow graph of the underlying function is constructed. This can only be applied to
callgraph nodes.
MD-Index of source and destination of callgraph edges For a given edge in the callgraph,
the MD-Index of the control flow graphs of the underlying functions for both the source
and the destination of the edge is calculated. The tuple consisting of two values forms the
characteristic for the edge.
MD-Index of graph neighborhood of a node/edge For a given node (or edge), a subgraph of
the larger graph consisting of a neighborhood of the original graph of a given size is extracted
(for example, all nodes that are less than 2 “hops” away). The MD-Index of this “local” graph
is calculated. This characteristic has the interesting property that nodes are matched not by
their intrinsic properties (e.g. the code contained therein), but by the local structure of the
graph of which they are part. This means that even in situations where both the callgraph
and the individual control flow graphs changed significantly, proper matching is still possible.
Automated Attacker Correlation for Malicious Code
RTO-MP-IST-091 26 - 5
Small Prime Product The small prime product [3] is a simple way of calculating a hash of a se-
quence of mnemonics that ignores compiler-induced reordering of instructions. Each mnemonic
is mapped to a small prime number, and the product of all elements in the sequence is calcu-
lated (module 264).
More characteristics (hashes of string references etc.) can be added easily. It is notable though
that characteristics that generate few “collisions” (e.g., that on average map few elements to each
image value) produce better matchings in practice than those that produce many “collisions”.
7 Full system implementation
The described algorithms have been implemented in an automated system to process large volumes of
malicious software. The system is designed around a central database from which multiple different
workers fetch data to perform computations on. This allows easy distribution of the computationally
intensive parts.
Figure 1: Architecture: A central database server that many workers poll from/write to
The system consists of a non-distributed scheduler component, and distributed components
called Unpacker, Disassembly and Comparison. These components are described below:
Unpacker: An unpacking components attempts to remove executable encryption by emulating the
executable while monitoring the statistical properties of the emulated system RAM. Once the
entropy of the memory pages drops, the component assumes that encryption/compression was
removed, and writes memory dumps back into the central database.
This step in the processing of malware can be skipped if memory dumps were obtained in a
different manner.
Disassembly: A disassembly component disassembles a memory dump and extracts the control-
flow-graph structures from the disassembly. Functions are identified, the callgraph is ex-
tracted, and the results are written back to the central database.
Scheduler: Performing a full comparison of all files amongst each other would by its very nature
be quadratic, and hence prohibitively expensive.
In order to reduce the overall cost, the scheduler performs a rough comparison based on
MD-Indices of functions in the disassembly. This cheap comparison is used to schedule more
expensive comparisons: For a new executable, accurate comparisons against those existing
Automated Attacker Correlation for Malicious Code
26 - 6 RTO-MP-IST-091
executables are scheduled that have a non-negligible similarity score in the approximate com-
parison.
Comparison: A comparison-worker queries the database for pairs of executables to be compared
using the “more accurate” algorithms described below. It then fetches the relevant data from
the database, performs the comparisons, and writes the results of the comparison back to the
database.
8 Case studies
The system was used for attacker correlation in several real-world scenarios. In the following, we
will describe two real-world cases, followed by some internal evaluation.
8.1 Attack on financial services provider
Around the same time the much-publicized “Aurora” attacks appeared in the media, we were
contacted by one of our clients in the financial services sector. Several machines on their network
had been compromised, and malicious software had been found on these systems. Due to the timing
and widespread media attention given to “Aurora”, the customer suspected that the attacks that
they had fallen victim to were part of the same attacker.
In order to confirm or disprove this suspicion, the system was fed with a number of executables
that had been obtained from the compromised machines on the clients network, along with exe-
cutables that were part of the “Aurora” attacks. Some manual intervention was required to remove
the obfuscation layers from the executables – since they were loaded in a nonstandard manner, the
generic emulation could not handle them straight away.
The result of the comparison/classification process showed that the executables obtained from
the compromised machines were all very similar amongst each other - sharing between 98% and
99% of all the code.
On the other hand, no discernible similarity between these files and various files involved with
the “Aurora” attacks were found.
Figure 2: Case study: Classification results showing no similarity
8.2 Sophisticated Rootkit
We were contacted by the victim of a large-scale, multi-year attack on their network infrastructure
to perform an analysis of the rootkit code that had been obtained on several of their hosts. During
the investigation, many versions of the rootkit code were uncovered. Most of them were relatively
recent. On some legacy systems, a suspicious device driver that dated back several years was found.
This driver did not exhibit a superficial similarity to the “current” version of the rootkit.
Automated Attacker Correlation for Malicious Code
RTO-MP-IST-091 26 - 7
Nonetheless, a structural comparison as described above determined that about 48 % of the code
from the new version of the driver was similar to (or derived from) the old version of the driver.
This helped tremendously in establishing important points on the timeline of the attack.
8.3 Large-scale clustering
To perform further tests, 5000 randomly chosen malicious executables were automatically grouped
into families using the similarity metric above – executables that shared more than 60 % of their
overall code were considered one cluster.
From these 5000 executables, a few of the larger clusters were extracted. For each cluster,
names were determined by querying the VirustTotal database – if a few executables in a cluster
were detected with consistent names by any anti-virus product, the clusters were named accordingly.
Name of the cluster # of executables
Win32.KillAV.Variants 183
Win32.Bacuy.Variants 599
Win32.SkinTrim 173
Win32.SwizzorA 15
Win32.WinTrim 114
FakeAlert 54
Win32.Allaple 39
Win32.Prosti 27
Win32.Sality 21
Win32.Zhelatin? 22
Win32.Swizzor? 14
Win32.Chifrax 12
Table 1: The number of executables for the different clusters
9 Summary
Automated correlation between attackers is possible based on sharing of executable code which, in
turn, is possible through the structural comparison algorithms described in this paper. Shifting the
focus from byte-sequences to the graph structure of the executables (and keeping both the control-
flow-graph structure as well as the callgraph structure in mind) allows the correlation of software
in the presence of heavy compiler-induced changes to the bytecode.
References
[1] Ero Carrera and Gergely Erdelyi. Digital genome mapping - advanced binary malware analysis.
In Proceedings of the Virus Bulletin Conference 2004, pages 187–197, 2004.
[2] Ero Carrera and Halvar Flake. Automated structural classification of malware. In Proceedings
of the RSA Conference 2008, 7-11 April, 2008, 2008.
[3] Thomas Dullien and Rolf Rolles. Graph-based comparison of executable objects (english ver-
sion). In SSTIC ’05, Symposium sur la S´ecurit´e des Technologies de l’Information et des Com-
munications, June 1-3, 2005, Rennes, France, 06 2005.
[4] Halvar Flake. Diffing x86 vs arm code.
Automated Attacker Correlation for Malicious Code
26 - 8 RTO-MP-IST-091
[5] Halvar Flake. Improving binary comparison.
[6] Halvar Flake. More fun with graphs. In Blackhat Federal 2003, 2003.
[7] Halvar Flake. Structural comparison of executable objects. In Ulrich Flegel and Michael Meier,
editors, DIMVA, volume 46 of LNI, pages 161–173. GI, 2004.
[8] Xin Hu, Tzi-cker Chiueh, and Kang G. Shin. Large-scale malware indexing using function-call
graphs. In CCS ’09: Proceedings of the 16th ACM conference on Computer and communications
security, pages 611–620, New York, NY, USA, 2009. ACM.
[9] Todd Sabin. Comparing binaries with graph isomorphism. 2003.
Automated Attacker Correlation for Malicious Code
RTO-MP-IST-091 26 - 9
Automated Attacker Correlation for Malicious Code
26 - 10 RTO-MP-IST-091
... We propose a set of function features extracted from a binary's CG and CFGs; these can be used by variants of the BinDiff algorithm [8,9,11] to (i) build a set of initial exact matches with minimal false positives, by scanning for unique perfect matches, and (ii) to propagate approximate matching information, for example, by using a nearest-neighbor scheme. One of the proposed features is obtained by applying Markov lumping to function CFGs (to our knowledge, this technique has not been previously studied). ...
... For example, Dullien et al. [8,9,11] form feature vectors (composed of the number of vertices and edges as well as the out-degree of each CFG) which they use to locate perfect matches in the compared executables. Later publications [2,10] extend Dullien's work by adding or modifying features in the aforementioned feature vector. ...
... In previous publications [2,[8][9][10][11] feature vectors extracted from function CFGs include two features which reflect a digraph's overall structure; the number of vertices (basic blocks) and the number of edges in the corresponding function. These two important structural characteristics can significantly speed up the pre-filtering phase, by discarding candidates with incompatible vertex and edge counts. ...
Article
Full-text available
Binary diffing consists in comparing syntactic and semantic differences of two programs in binary form, when source code is unavailable. It can be reduced to a graph isomorphism problem between the Control Flow Graphs, Call Graphs or other forms of graphs of the compared programs. Here we present REveal, a prototype tool which implements a binary diffing algorithm and an associated set of features, extracted from a binary’s CG and CFGs. Additionally, we explore the potential of applying Markov lumping techniques on function CFGs. The proposed algorithm and features are evaluated in a series of experiments on executables compiled for i386, amd64, arm and aarch64. Furthermore, the effectiveness of our prototype tool, code-named REveal, is assessed in a second series of experiments involving clustering of a corpus of 18 malware samples into 5 malware families. REveal’s results are compared against those produced by Diaphora, the most widely used binary diffing software of the public domain. We conclude that REveal improves the state-of-the-art in binary diffing by achieving higher matching scores, obtained at the cost of a slight running time increase, in most of the experiments conducted. Furthermore, REveal successfully partitions the malware corpus into clusters consisting of samples of the same malware family.
... For binaries like firmware, finding the clones at the binary level is challenging [18], [19]. Dullien et al. [20] formulate a correlation technique by defining five function features that can map the code clones. Saebjornsen et al. [21] model the assembly instruction by extending the tree similarity framework based on clustering of characteristic vectors. ...
Article
Full-text available
Binary-binary function matching problem serves as a plinth in many reverse engineering techniques such as binary diffing, malware analysis, and code plagiarism detection. In literature, function matching is performed by first extracting function features (syntactic and semantic), and later these features are used as selection criteria to formulate an approximate 1:1 correspondence between binary functions. The accuracy of the approximation is dependent on the selection of efficient features. Although substantial research has been conducted on this topic, we have explored two major drawbacks in previous research. (i) The features are optimized only for a single architecture and their matching efficiency drops for other architectures. (ii) function matching algorithms mainly focus on the structural properties of a function, which are not inherently resilient against compiler optimizations. To resolve the architecture dependency and compiler optimizations, we benefit from the intermediate representation (IR) of function assembly and propose a set of syntactic and semantic (embedding-based) features which are efficient for multi-architectures, and sensitive to compiler-based optimizations. The proposed function matching algorithm employs one-shot encoding that is flexible to small changes and uses a KNN based approach to effectively map similar functions. We have evaluated proposed features and algorithms using various binaries, which were compiled for x86 and ARM architectures; and the prototype implementation is compared with Diaphora (an industry-standard tool), and other baseline research. Our proposed prototype has achieved a matching accuracy of approx. 96%, which is higher than the compared tools and consistent against optimizations and multi-architecture binaries.
... The token-based approach relies on feature vectors consisted of opcodes and operands and employs metrics to provide function-level clone detection [38]. The structural-based approach maps the code back to execution schema and compares their structural features [39]. Our recent work combines several existing concepts from classic program analysis, including control flow graph, register flow graph, and function call graph, to capture semantic similarity between two binaries [40]. ...
Article
Diversity has long been regarded as a security mechanism for improving the resilience of software and networks against various attacks. More recently, diversity has found new applications in cloud computing security, moving target defense, and improving the robustness of network routing. However, most existing efforts rely on intuitive and imprecise notions of diversity, and the few existing models of diversity are mostly designed for a single system running diverse software replicas or variants. At a higher abstraction level, as a global property of the entire network, diversity and its effect on security have received limited attention. In this paper, we take the first step toward formally modeling network diversity as a security metric by designing and evaluating a series of diversity metrics. In particular, we first devise a biodiversity-inspired metric based on the effective number of distinct resources. We then propose two complementary diversity metrics, based on the least and the average attacking efforts, respectively. We provide guidelines for instantiating the proposed metrics and present a case study on estimating software diversity. Finally, we evaluate the proposed metrics through simulation.
Chapter
Reverse engineering is labor-intensive work to understand the inner implementation of a program, and is necessary for malware analysis, vulnerability hunting, etc. Cross-version function identification and subroutine matching would greatly release manpower by indicating the known parts coming from different binary programs. Existing approaches mainly focus on function recognition ignoring the recovery of the relationships between functions, which makes the researchers hard to locate the calling routine they are interested in. In this paper, we propose a method using graphlet edge embedding to abstract high-level topology features of function call graphs and recover the relationships between functions. With the recovery of function relationships, we reconstruct the calling routine of the program and then infer the specific functions in it. We implement a prototype model called RouAlign, which can automatically align the trunk routine of assembly codes. We evaluated RouAlign on 65 groups of real-world programs, with over two million functions. RouAlign outperforms state-of-the-art binary comparing solutions by over 35% with a high precision of 92% on average in pairwise function recognition.
Chapter
Diversity has long been regarded as a security mechanism and it has found new applications in security, e.g., in cloud, Moving Target Defense (MTD), and network routing. However, most existing efforts rely on intuitive and imprecise notions of diversity, and the few existing models of diversity are mostly designed for a single system running diverse software replicas or variants. At a higher abstraction level, as a global property of the entire network, diversity and its effect on security have received limited attention. In this chapter, we present a formal model of network diversity as a security metric. Specifically, we first devise a biodiversity-inspired metric based on the effective number of distinct resources. We then propose two complementary diversity metrics, based on the least and the average attacking efforts, respectively. Finally, we evaluate the proposed metrics through simulation.
Article
Full-text available
During the software development developers often copy and paste fragments of code to achieve the desired result. Copying of code can lead to variety of errors, as well as can increase the size of the source and binary code. The problem of finding semantically similar pieces of code (clones) in binary code becomes actual due to the unavailability of source code of many software programs. The first part of the article is dedicated to the analysis of the existing methods for finding code clone in binary code. In the second part we provide a newly developed tool for finding code clones in binary code. The work of the tool is divided into three main stages. The first stage is based on the Binnavi [1] framework, which is responsible for generation of program dependence graphs (PDG). Program dependence graphs are generated using REIL (Reverse Engineering Intermediate Language). The usage of REIL language allows to generate graphs for multiple architectures (x86, x86-64, ARM, MIPS, PPC), thus providing the independence of the tool from the target architecture. In the second step code clones are found based on previously created graphs. Maximum common subgraph is built for each pair of graphs and based on it, code clones are detected. In the third stage, the detected clones are visualized for convenient analysis of the results.
Article
Full-text available
Résumé A method to construct an optimal isomorphism between the sets of instructions, sets of basic blocks and sets of functions in two differing but similar executables is presented. This isomorphism can be used for porting recovered information between different disassemblies, recover changes made by security updates and detect code theft. The most interesting applications in the realm of security are in malware analysis where the analysis of a family of trojans or viruses can be redu-ced to analyzing the differences between the variants, and in recovering the details of fixed vulnerabilities when the vendor of the security patch refuses to disclose details. A framework implementing the described methods is presented, along with empirical data about it's performance when analyzing multiple va-riants of the same malware and recovering vulnerability details from security updates.
Article
Full-text available
A method to heuristically construct an isomorphism between the sets of functions in two similar but differing versions of the same executable file is presented. Such an isomorphism has multiple practical applications, specifically the ability to detect programmatic changes between the two executable versions. Moreover, information (function names) which is available for one of the two versions can also be made available for the other . A framework implementing the described methods is presented, along with empirical data about its performance when used to analyze patches to recent security vulnerabilities. As a more practical example, a security update which fixes a critical vulnerability in an H.323 parsing component is analyzed, the relevant vulnerability extracted and the implications of the vulnerability and the fix discussed.
Conference Paper
A major challenge of the anti-virus (AV) industry is how to ef- fectively process the huge influx of malware samples they receive every day. One possible solution to this problem is to quickly de- termine if a new malware sample is similar to any previously-seen malware program. In thispaper, we design, implement and evaluate a malware database management system called SMIT (Symantec Malware Indexing Tree) that can efficiently make such determina- tion based on malware's function-call graphs, which is a structural representation known to be less susceptible to instruction-level ob- fuscations commonly employed by malware writers to evade detec- tion of AV software. Because each malware program is represented as a graph, the problem of searching for the most similar malware program in a database to a given malware sample is cast into a nearest-neighbor search problem in a graph database. To speed up this search, we have developed an efficient method to compute graph similarity that exploits structural and instruction-level infor- mation in the underlying malware programs, and a multi-resolution indexing scheme that uses a computationally economical feature vector for early pruning and resorts to a more accurate but com- putationally more expensive graph similarity function only when it needs to pinpoint the most similar neighbors. Results of a compre- hensive performance study of the SMIT prototype using a database of more than 100,000 malware demonstrate the effective pruning power and scalability of its nearest neighbor search mechanisms.
More fun with graphs
  • Halvar Flake
Halvar Flake. More fun with graphs. In Blackhat Federal 2003, 2003.
Automated Attacker Correlation for Malicious Code
  • Todd Sabin
Todd Sabin. Comparing binaries with graph isomorphism. 2003. Automated Attacker Correlation for Malicious Code
Automated structural classification of malware
  • Ero Carrera
  • Halvar Flake
Ero Carrera and Halvar Flake. Automated structural classification of malware. In Proceedings of the RSA Conference 2008, 7-11 April, 2008, 2008.
Digital genome mapping -advanced binary malware analysis
  • Ero Carrera
  • Gergely Erdelyi
Ero Carrera and Gergely Erdelyi. Digital genome mapping -advanced binary malware analysis. In Proceedings of the Virus Bulletin Conference 2004, pages 187-197, 2004.
Diffing x86 vs arm code
  • Halvar Flake
Halvar Flake. Diffing x86 vs arm code.