ArticlePDF Available

Abstract

Dynamic Syntax Tree (DST) implementations [1] use Binary Sandboxing for enhancing the Static Analysis process. In this paper we present a new Dynamic Binary analysis method for collecting information on ELF, PE and Mach-O executables and dynamic libraries. This information will enrich DST contents during application scanning
Dynamic Syntax Tree: Optimized Binary Sandboxing
Prof. Tim Moses, Department of Software Engineering, BitBrainery University, London UK
David Syman, CTO, Security Reviewer - UK
Marco Barzanti, Security Auditor, Poste Italiane - IT
[2014- 10th of December]
Abstract Dynamic Syntax Tree (DST) implementations [1] use Binary Sandboxing for enhancing the Static Analysis process. In
this paper we present a new Dynamic Binary analysis method for collecting information on ELF, PE and Mach-O executables
and dynamic libraries. This information will enrich DST contents during application scanning
Keywords - dynamic syntax tree, binary analysis, sandbox, dynamic analysis , static code analysis, abstract syntax tree, parser
I. INTRODUCTION
In our earlier research[1], we presented Dynamic Syntax
Tree and a Dynamic Syntax Tree-based implementation[2].
Program analyses can be categorized into three groups,
according to the type of code being analyzed:
Source analysis involves analyzing programs at the level of
source code. Many tools perform source analysis. This
category includes analyses performed on program
representations that are derived directly from source code.
Source analyses are generally done in terms of programming
language constructs, such as functions, statements,
expressions, and variables, and fail on analyzing dynamic
languages.
Dynamic analysis involves analyzing a client program as it
executes. Many tools perform dynamic analysis, for example,
profilers, checkers and execution visualizers. Tools
performing dynamic analysis must instrument the client
program with analysis code. The analysis code may be
inserted entirely inline; it may also include external routines
called from the inline analysis code. The analysis code runs as
part of the program's normal execution, not disturbing the
execution (other than probably slowing it down), but doing
extra work on the side", such as measuring performance, or
identifying bugs. The analysis code must maintain some kind
of analysis state, called metadata (and individual pieces of
metadata are meta-values).
Binary analysis involves analyzing programs at the level of
machine code, stored either as object code (pre-linking) or
executable code (post-linking). This category includes
analyses performed at the level of executable intermediate
representations, such as bytecodes, which run on a virtual
machine.
Pre-processing the source code written in a dynamic
language, such as JAVA, PHP, C#, Python, etc., created a
specialized Dynamic Syntax Tree and an Object Dictionary for
each Class found. The construction of Dynamic Syntax Trees
provided a Binaries analysis too. Binaries are sandboxed
collecting dynamic information at runtime, using a very fast
process. Mixing source code and binaries analysis fixed the
Static Analysis limitations, updating the Dynamic Syntax Tree
with additional information. Object Dictionaries and Dynamic
Syntax Trees are multiple, and optimized for low resource
consumption and higher performances.
This article focuses on binary analysis of machine code, and
does not consider source analysis or byte-code binary analysis
any further.
II. BINARY ANALYSIS
Binary analysis is particularly useful when:
you get binaries, but no source code and want to
know what could be in there
you get binaries and source code, but don't know if
binaries and sources match
you want to know how binaries interact, which a
source code scanner cannot tell you
The two main approaches in binary analysis are static binary
analysis and dynamic binary analysis.
In the first case, the analysis is done without executing the
program under analysis. The first step of the analysis involves
reading and interpreting the format of the binary program
such as ELF, PE, Mach-O. This provides information about the
file which includes the location of its different sections, e.g.,
data and text sections. Once the text section is located, the
disassembly process may begin. This task is not straight
forward. Some architectures, for instance x86, have variable
length instructions which makes the process tricky. Moreover,
assembly code lacks structure, meaning code and data can be
mixed together which complicates the process even further.
The last step consists of translation to the intermediate
representation language over which analysis algorithms are
applied.
In contrast, Dynamic Binary analysis requires running the
program in order to obtain an execution trace. This is
accomplished through dynamic binary instrumentation.
There is a lot of information that can be obtained through this
process, for instance, the value of each register at each step
of the program execution, the navigation of conditional
branches as well as the list of system calls invoked by a
program, classes and objects hierarchy, etc. Then, the trace is
analyzed with online methods. Limited path coverage is one
of the main disadvantages of this technique as only a small
subset of all possible paths are exercised during normal
execution. We solved this problem by sandboxing the
application and automatically creating a set of stimulators (i.e.
dedicated tests for stimulating properly all methods)
obtaining a very high path coverage.
III. HYBRID ANALYSIS via SANDBOXING
We named this tool Binary Reviewer. Because it is execution-
driven, almost all code is handled naturally without difficulty;
this includes normal executable code, dynamically linked
libraries, and dynamically generated code. The only code that
can cause problems is self-modifying code. In computer
science, self-modifying code is code that alters its
own instructions while it is executing - usually to reduce
the instruction path length and improve performance or
simply to reduce otherwise repetitively similar code, thus
simplifying maintenance. On some platforms, handling self-
modifying code is easy because an explicit “push" instruction
must be used when code is modified, but the x86 does not
have this feature[3]. One could write-protect pages
containing code that has been compiled, and then push them
if that page is written; however, this approach is not practical
if a program puts code on the stack (as GCC does for C code
that uses nested functions, for example), and can be slow if
code and data segments appear on the same page (as the
x86/Linux ELF executable format allows). By default, Binary
Reviewer isolates the code in a smart virtual machine, using
VirtualBox API system. Using a virtual machine as a basis for
Dynamic Binary analysis guarantees a secure and controlled
environment in which the malware can be executed and the
original system state can be restored afterwards. Inside the
virtual machine we can observe system activity of a certain
target software during runtime. VirtualBox employs a client-
server design, meaning that whenever any part of VirtualBox
is running, a dedicated server process named VBoxSVC runs
in the background. This allows multiple processes working
with VirtualBox to cooperate without conflicts. Each
VirtualBox used in our implementation runs at client side and
needs binary code size + 1MB of RAM for running.
Performances overhead respect than solely Static Analysis is
calculated as 15%. A medium size DLL, JAR, .so or executable
takes about 15 seconds to be analyzed in a middle-range,
dual-core notebook, and you can perform up to 5 analysis at
time, with only 4GB RAM. Binary Reviewer can be also used in
combination with the faster StaticStream® engine.
StaticStream® utilizes state of the art technologies that can
very quickly detect interesting code sequences (such as
shellcodes, bot handler code) in memory data, dumps or
on-disc binary files. This engine is a centralized and
optimized in a Linux server virtual machine, that takes care
of all sandboxing activities. Using StaticStream® engine
takes no additional RAM and no performances overhead
on the machines used to process the analysis, being
redirected to the StaticStream® server, that can be on
premises or in SaaS mode. Binary Reviewer can combine
dynamic and static data and code in order to enrich its
analysis results using e.g. data flow tracking. Both
VirtualBox and StaticStream® Engines are supported by
Binary Reviewer, that combines them with Static Analysis,
using an Hybrid approach. Hybrid analysis attempts to
erase the boundaries between static and dynamic analysis
and create unified analyses that can operate in either
mode or in a mode that utilizes the strength of both
approaches. Static or dynamic analyses can enhance one
another by providing information that would otherwise be
unavailable. Hybrid analyses would sacrifice a small
amount of the soundness of static analysis and a small
amount of accuracy of dynamic analysis to obtain new
techniques whose properties are better suited for
particular cases than purely static or dynamic analysis[4].
IV. VirtualBox-based vs StaticStream®-based
The Malware Running Process is optional, and obtained by
integrating the StaticStream® engine technology.
While VirtualBox-based solution can analyze ELF, PE and
Mach-O executables, StaticStream® is focused on PE
executables only and it is more suitable to Malware
Detection or when higher performances are needed,
having an x86 architecture. VirtualBox-based solution
implements itself the Dynamic Binary Analysis, while
StaticStream is a Virtual Machine (VxStream Sandbox®)
seen as a blackbox. VirtualBox-based solution runs in the
following steps, similar to Valgrind analysis process[5], but
simplified and optimized for speed:
Hybrid Decompilation. Binary Reviewer represents code
with a RISC-like two-address intermediate representation which uses
virtual registers. The automatically generated RISC-like code features
high-level control structures (if, do-while loops, switch statements),
function parameters and local variables, high-level types (including
high-level types for APIs), and function call arguments.
Compared to standard decompilation, the Hybrid Decompilation (a
simplified version of one used by Joe SandBox[6], and similar to GFI
Sandbox™[7]), directly operates on memory and benefits from
dynamic data such as strings, function arguments and execution
marks.
Instrumentation. The tool adds its desired analysis code. It can
make as many passes over the code as it likes, including
optimisation passes.
Memory snapshots. Snapshots are stored into basic blocks,
together to Memory dumps, in the translation table, a linear-
probe hash table. The translation table is large (300,000 entries)
so it rarely gets full. If the table gets more than 80% full,
translations are evicted in chunks, 1/8th of the table at a time,
using a FIFO Policy. This is better than the simplistic strategy of
clearing the entire translation table used by many
Instrumentation frameworks.
Code Execution. The code is executed in basic blocks, that are
translated and executed one-by-one, with a virtualization of
every external resource called by the executable.
Security Analysis. All tainted data, suspicious behaviors and
dangerous API are detected and collected as Security Tags in
the Symbols Collection.
Symbols Collection. The Collect Symbols phase stores all Classes,
Methods, Objects, Dynamic Code, String data and API calls
including the Security Tags. All those data will be used to enrich
the Dynamic Syntax Trees used for processing the analysis.
V CONCLUSION AND FUTURE WORK
The presented paper described latest 4 years (2011-2014) of
research on optimizing an implementation of automatic
analysis for dynamic language source code using Dynamic
Syntax Trees. For DST enhancement with dynamic
information, we used an optimized Binary Sandboxing
method presented in our previous article [1], that will be
enhanced year-by-year as it will be object of future papers.
VI. ACKNOWLEDGMENTS
This work was gently supported by:
Ruth Goldberg, Software Engineer, Security Reviewer ltd,
London UK
Figures and some phrases are picked by:
https://www.payload-security.com/technology/hybrid-analysis
https://www.joesecurity.org/joe-sandbox-technology#hca
REFERENCES
[1] Moses.T., Syman.D., Barzanti M. Static Analysis: A Dynamic Syntax
Tree Implementation. London, December 2001
[2] Moses.T., Syman.D. Static Analysis of Applications written in
modern languages. Moldova, 1999. Translated from Russian and
published by ResearchGate, 2008
[3] Jonas Maebe and Koen De Bosschere. Instrumenting self-modifying
code. In Proceedings of the Fifth International Workshop on
Automated and Algorithmic Debugging (AADEBUG2003), Ghent,
Belgium, September 2003.
[4] M. D. Ernst. Static and dynamic analysis: synergy and duality. In In
WODA 2003: ICSE Workshop on Dynamic Analysis, pages 2427, 2003
[5] Nicholas Nethercote. Dynamic Binary Analysis and Instrumentation.
University of Cambridge. Computer Laboratory. Technical Report
N.606. November 2004.
[6] How does Joe Sandbox work?, 2011. http://www.joesecurity.org/
products.php?index=3, Retrieved May 2011.
[7] GFI Software. Malware Analysis with GFI SandBox (formerly
CWSandbox). http: //www.gfi.com/malware-analysis-tool. Retrieved
November 2011.
... The construction of Dynamic Syntax Trees provide a binaries analysis too. Binaries are sandboxed collecting dynamic information at runtime, using a very fast algorithm that we discussed in [4] and [7]. Mixing source code and binaries analysis fixed the above mentioned limitations, updating the Dynamic Syntax Tree with additional information. ...
... For DST enhancement with dynamic information, we used an optimized Binary Sandboxing method described in our previous article [7]. ...
Article
Full-text available
Updated Results of a Dynamic Syntax Tree method implementation for enhancing the Static Analysis process. We collected the most significant results of latest 4 year, presented in this paper
... The construction of Dynamic Syntax Trees provide a binaries analysis too. Binaries are sandboxed collecting dynamic information at runtime, using a very fast algorithm that we discussed in [4] and [7]. Mixing source code and binaries analysis fixed the above mentioned limitations, updating the Dynamic Syntax Tree with additional information. ...
... For DST enhancement with dynamic information, we used an optimized Binary Sandboxing method described in our previous article [7]. ...
Article
Full-text available
Updated Results of a Dynamic Syntax Tree method implementation for enhancing the Static Analysis process. We collected the most significant results of latest 4 year, presented in this paper
... The construction of Dynamic Syntax Trees provide a binaries analysis too. Binaries are sandboxed collecting dynamic information at runtime, using a very fast algorithm that we discussed in [4] and [7]. Mixing source code and binaries analysis fixed the above mentioned limitations, updating the Dynamic Syntax Tree with additional information. ...
... For DST enhancement with dynamic information, we used an optimized Binary Sandboxing method described in our previous article [7]. ...
Article
Full-text available
Updated Results of a Dynamic Syntax Tree method implementation for enhancing the Static Analysis process. We collected the most significant results of latest 4 year, presented in this paper
Conference Paper
Full-text available
Most of Static Analysis tools are nowadays based on Abstract Syntax or Concrete (aka Parser) Trees. For analyzing applications written in modern programming languages, were types and objects are dynamically created, those tools cannot provide accurate analysis results because they are designed for static programming languages only. Moreover described is the new Dynamic Syntax Trees-based method for enhancing the Static Analysis process.
Article
Full-text available
In our earlier research on area of Static Analysis of applications written using modern languages, we discussed about lack of accurate analysis of algorithms based on Abstract Syntax and Concrete (CST, aka Parser) Trees. Moreover described is the Dynamic Syntax Tree method implementation for enhancing the Static Analysis process.
Instrumenting self-modifying code Static and dynamic analysis: synergy and duality
  • Jonas Maebe
  • Koen De Bosschere
Jonas Maebe and Koen De Bosschere. Instrumenting self-modifying code. In Proceedings of the Fifth International Workshop on Automated and Algorithmic Debugging (AADEBUG2003), Ghent, Belgium, September 2003. [4] M. D. Ernst. Static and dynamic analysis: synergy and duality. In In WODA 2003: ICSE Workshop on Dynamic Analysis, pages 24–27, 2003 [5]
[6] How does Joe Sandbox work? Malware Analysis with GFI SandBox (formerly CWSandbox)
  • Gfi Software
Nicholas Nethercote. Dynamic Binary Analysis and Instrumentation. University of Cambridge. Computer Laboratory. Technical Report N.606. November 2004. [6] How does Joe Sandbox work?, 2011. http://www.joesecurity.org/ products.php?index=3, Retrieved May 2011. [7] GFI Software. Malware Analysis with GFI SandBox (formerly CWSandbox). http: //www.gfi.com/malware-analysis-tool. Retrieved November 2011.
Static Analysis of Applications written in modern languages. Moldova, 1999. Translated from Russian and published by ResearchGate
  • Moses T Syman
Moses.T., Syman.D. Static Analysis of Applications written in modern languages. Moldova, 1999. Translated from Russian and published by ResearchGate, 2008 [3]
Instrumenting self-modifying code
  • Jonas Maebe
  • Koen De Bosschere
Jonas Maebe and Koen De Bosschere. Instrumenting self-modifying code. In Proceedings of the Fifth International Workshop on Automated and Algorithmic Debugging (AADEBUG2003), Ghent, Belgium, September 2003.