ArticlePDF Available

Abstract

In our earlier research on area of Static Analysis of applications written using modern languages, we discussed about lack of accurate analysis of algorithms based on Abstract Syntax and Concrete (CST, aka Parser) Trees. Moreover described is the Dynamic Syntax Tree method implementation for enhancing the Static Analysis process.
Static Analysis: a Dynamic Syntax Tree implementation
Prof. Tim Moses
Department of Software Engineering, BitBrainery University, London UK
David Syman, CTO, Security Review Консультант, Chiʂinau - MD
Marco Barzanti, Security Auditor, LCI Technology Group, Den Haag - NL
[2001- 1th of December]
Abstract In our earlier research [1] on area of Static Analysis of applications written using modern languages, we discussed
about lack of accurate analysis of algorithms based on Abstract Syntax and Concrete (CST, aka Parser) Trees. Moreover
described is the Dynamic Syntax Tree method implementation for enhancing the Static Analysis process.
Keywords - dynamic analysis , static code analysis, abstract syntax tree, concrete syntax tree, dynamic syntax tree, semantic
I INTRODUCTION
In dynamic languages, there are no real ‘static declarations.
In general there are different Variables, Classes or Class
Members every time the same source code runs. No
declaration has a definite meaning until the runtime and until
the declaration is used.
The techniques used for parsing of static languages cannot
work properly for dynamic languages, because there is not a
complete set of information needed for analysis. The concept
of Domain-Specific Languages (DSLs) has been around for
years, but has recently experienced an uptick in popularity.
This is due in large part to the success of Java, which exploits
the dynamic programming language and its
meta-programming capabilities to encourage the
development of programs that closely mirror the domains
they model. This style of DSL is sometimes called an Internal
DSL because it uses available constructs from the
programming language itself to create code that reads in a
manner similar to the underlying domain.
Programming languages obviously differ in their ability to
express domain concepts and relations using only built-in
programming constructs. When a given application requires
more power and flexibility than a programming language
alone can offer, developers can create External DSLs using
tools such as ANTLR (which stands for Another Tool for
Language Recognition) or below described more modern
approaches, intended to be interpreted sensibly by both
subject matter experts and programmers through the use of
variable names chosen from the domain's vocabulary,
minimizing constructs that read too "tech-y," and organizing
syntactic elements in a more "natural" order. The proposed
implementation also uses pluralization rules to further
enhance meaning and readability.
The main contribution of this paper is to present a new
implementation called Dynamic Syntax Tree. Unlike the static
languages, the analysis of dynamic languages has to estimate
the types and possible declarations from the code fragment
semantics [2], as well as all configuration files and binaries.
These implementation also take into account non-static and
non-typed implicit and explicit declarations, so that the
resulting information covers most of readable dynamic code
fragments.
II DYNAMIC SYNTAX TREE
Mapping of constructs from a dynamic language into a
Dynamic Syntax Tree (DST) is a kind of semantic analysis.
Information implying from the syntax is analyzed and the
results are inserted back into the same tree, but in the form
of complete static information. Pre-processing the source
code will create a specialized Syntax Tree for each Class
found. That can be applied to traditional programming
languages too, like COBOL, where Classes can be represented
by Programs, Methods by calls/Performs, Parameter by Using
etc.
In the proposed implementation, this pre-processing phase
will generate:
A separate Object Dictionary for each Class. All Class
objects will be mapped into 2-bytes Dict-Id, handling a
maximum of 65535 objects per Dictionary. Instead of
storing the object name, 4-bytes and 1-bytes pointers
to source will be used for retrieving the object’s name
(source code line and name’s starting column). Parent
Dict-ID (for child Classes) or Type+local/global attribute
(for the others), and a bitmask Attribute field (abstract,
serializable, public, private, protected, static and final).
A Dynamic Syntax Tree for each NameSpace/Package
storing: NameSpace/Package Name and File Name
(including web pages and configuration files) in
compressed format. Further, Dict-Id for Class, Inner
Classes, External Classes, Methods, Parameters,
Branches and Variables will be stored in the tree. For
Methods and Branches, further to Dict-Id, also an hash
code will stored, for Code Duplication detection
purposes. For Branches, conditional statement as a
single line and nesting level (for calculating Quality
metrics) are also stored. Fields will be compressed
using Huffman Coding [3].
Thereafter this pre-processing enables us to work with the
syntax tree of the dynamic source code as it is in a static code
with some limitations, that are not resolvable until runtime
in dynamic languages. For that reason we provide a binaries
analysis too. Binaries will be sandboxed collecting dynamic
information at runtime, using a very fast algorithm that will
be discussed in a future paper. Mixing source code and
binaries analysis will fix the above mentioned limitations,
updating the Dynamic Syntax Tree with additional
information. Object Dictionary and Dynamic Syntax Tree will
be multiple, and optimized for low resource consumption
and higher performances.
II.A Enhancing the Tree
In dynamic languages the resulting tree node information is
gathered from various source elements in the code. The
whole source file, in a parsed form DST, must be processed
and every expression or construct has to be analyzed. As a
result it extends some existing node information or add new
declaration nodes (implicit declarations). In static languages
it is typically sufficient to scan declaration constructs only
(class declaration, variable declaration etc.); single
expressions are not needed for collecting available symbols
and their description.
In addition to statically defined symbols other tree nodes
are added using the semantic analysis of the DST. Also some
existing information can be modified. Particular expressions
are scanned to find, e.g. implicitly declared variables and
dynamically added object members. Also possible variable
values, estimated types and behavior influenced from
configuration are discovered. The so far built DST is used for
that. Process of mapping the Assignment Expression,
containing Variable Usage (including the one declared in web
pages or JavaScript) into the static declaration of used
variable and Configuration file influencing the Behavior, is
depicted on the following Figure:
The mapping collects known information from the
expression and stores it into the static representation of the
source code (DST).
III. OPTIMIZING
The process of single Dynamic Syntax Tree mapping is
performed every time the corresponding source code
changes [4][5][7], hence it is suitable to implement some
optimizations. Also, there could be trees containing symbols
from large static language libraries, e.g. from .NET
assemblies, which manage typically thousands of symbols.
Initializing all of them may be too slow.
Differently from AST and CST (Parse) trees, in case of huge
Classes having more than 65535 objects, the DST Object
Dictionary structure (68,083 nodes), will be paged into some
small XML files, about 575KB sized each. The same will be
done with the Dynamic Syntax Tree itself: only 4096 Classes
at time will be processed, max 135KB each. There will be no
case of RAM usage over 700MB, that means it can be able to
perform a Static Analysis using a low-end Windows XP
notebook with 1GB of RAM and a single-core processor, and
up to 5 simultaneously static analysis of different applications
at time can be achieved using only a 4GB RAM, dual-core
processor machine. So, performances will be scalable
depending of hardware architecture, with the guarantee of
performing at least a complete static analysis starting from a
notebook and going up to multi-core machine. An important
optimization respect than AST and CST will be the number of
tree nodes. GCC 2.52 from SPEC 95 benchmark suite
comprises 604,000 AST nodes vs only 127,067 nodes paged
in 2 DST, and Word 1.1A (1990) comprises 607,700 AST nodes
vs only 204,249 nodes, paged in 3 DST. Interestingly, GCC
2.52 is about 140,000 lines of code and Word 1.1A is about
322,000. Static Analysis processing times using DST were half
an hour for GCC and 1 hour and half for Word 1.1A. Word
1.1A source was downloaded from Computer History
Museum site.
Some of additional information initialization can be
postponed until it is requested, because not all of the source
elements must be analyzed and mapped immediately.
III.A Lazy mapping
The process of mapping does not need to be done
immediately. In fact a single DST or even a single tree node
can remember its state and the origin source data element;
then the mapping of child elements can be done on-demand
and not before the content of the tree node is used,
facilitating multi-threading techniques.
Only the top level tree nodes are initiated immediately and
the rest of the tree can be created later, because in dynamic
languages all the tree must be created in order to perform
semantic analysis. Because the results of semantic analysis
can affect other DST.
In case of .NET assemblies, only types are added into the
tree, their members need not to be scanned immediately.
The declaration node describing the type references the
source System.Type object instance. When the type
members are requested, the type declaration node
initialization is finished first.
Dynamic languages are typically able to modify local or
global declarations everywhere. Dynamic members of global
variables can be added within e.g. Class member function
call. Also global variables can be assigned here. Therefore the
Class content, its members and member function bodies
have to be scanned and processed without postponing. The
symbols information then will not be complete until the
content of all the source elements is used and analyzed.
B
C
A
Is Membe
rr
Of
Assi
g
t
[
declaration
]
A
]
member
[
B
]
[
type
type of C
value
]
[
value of C
Configuration / Web page
Behavior
III.B Multi-threading
Some tasks within the Dynamic Syntax Tree creation
process may be performed independently. They can run on
separate threads, so the process is parallelized. Especially
different tree nodes can be created separately, as products
of mapping source static elements.
For the following step, the semantic analysis of the dynamic
language code, the complete Dynamic Syntax Tree is needed.
Various tree elements are modified and additional
information is added, that is then used for the rest of analysis.
III.C Analysis Rules
DST will be navigated applying a large number of rules
specialized per programming language. The DST is designed
for processing a number of simultaneous rules at time, that
can be customized from the user. During the reporting phase,
only vulnerable source code will be pointed out and all object
names will be picked up for completing the technical and
compliance reports.
IV CONCLUSION AND FUTURE WORK
The presented paper described an implementation of
automatic analysis of the dynamic language source code
using Dynamic Syntax Trees. It is designed to work
independently and to allow a combination of classic
techniques of the static language syntax and dynamic
sandboxing analysis. For testing the implementation, we
reused the Security Review Консультант traditional Static
Analysis technology designed for McCabe® Security and
Quality Add-ons (since year 2000). Dynamic code is
processed by mapping of dynamic constructs, and then usual
techniques for vulnerabilities detection in the static way are
used in combination with dynamic sandboxing. The semantic
analysis works even for static languages too. It is able to
gather useful information about the source code, such as
possible values of variables or possible relations between
objects. In case of dynamic languages we are able to estimate
possible types of objects. McCabe® Security and Quality Add-
ons, implemented with the new Dynamic Syntax Tree, offer
the following pros:
Support of a large number of programming languages,
including C, C++, Java, C#, Objective-C, VB, PHP,
JavaScript, ActionScript, VBScript, ColdFusion, Perl,
COBOL, ABAP, SQL, HTML, JSP, ASP, ASP.NET.
More accurate results, due to a combination of static
and sandboxing analysis, with very rare cases of False
Positives.
Security [12], Quality [14], Dead Code [13], Code
Duplication [6], Best Practices [15], Bug Detection are all
detected at same time. Compliance Standards, like SEI-
CERT [9] and PCI-DSS[11], are used for classifying the
security threats and related risk analysis. Quality
Metrics such as McCabe Cyclomatic and Essential
Complexity [8], SEI-Maintainability Index, Halstead
Scientific metrics, OOP Metrics, CK Metrics, Moods,
Computed metrics [9] and Anti-Patterns [10] discovery
are available.
Faster performances thanks to simultaneous usage of
different DST in a multi-thread environment, allowing a
number of different analysis at same time. Nowadays,
we reached 280,000+ Lines of Code per hour.
Lower CPU and resource consumption: handling a
number of small DSTs is better than handling a unique,
huge, AST or SG. Our solution needs maximum 700MB
of RAM per simultaneous analysis and won’t trespass
40% of CPU usage. It can run in a single low-end
notebook.
IV.A FUTURE WORK
The implemented Dynamic Syntax Tree will be used for some
products re-engineering and, after some years of stable
Static Analysis experiences, will be compared to other AST
and CST-based solutions. A separate paper at that time will
be available.
V. ACKNOWLEDGMENTS
This work was gently supported by:
Prof. Elina Petrovna, Department of Software Engineering,
BitBrainery University, London - UK
REFERENCES
[1] Moses.T, Syman.D.. : Static Analysis of Applications written in
modern languages. Moldova, 1999. Translated from Russian and
published by ResearchGate, 2008
[2] Denotational Semantics for Asynchronous Concurrent Languages.
Sven-Olof Nystrӧm. Computing Science Department, Uppsala
University. Sweden, 1996
[3] D.A. Huffman, "A method for the construction of minimum-
redundancy codes", Proceedings of the I.R.E., September 1952
[4] Plump, D. (1999). Ehrig, Hartmut; Engels, G.; Rozenberg, Grzegorz,
eds. Handbook of Graph Grammars and Computing by Graph
Transformation: applications, languages and tools 2. World Scientific.
pp. 913. ISBN 9789810228842
[5] Barendregt, H. P.; van Eekelen, M. C. J. D.; Glauert, J. R. W.;
Kennaway, J. R.; Plasmeijer, M. J.; Sleep, M. R. (1987). "Term graph
rewriting". PARLE Parallel Architectures and Languages Europe
(Lecture Notes in Computer Science)
[6] Baxter, Ira D.; Yahin, Andrew; Moura, Leonardo; Sant' Anna, Marcelo;
Bier, Lorraine (November 1619, 1998). Clone Detection Using
Abstract Syntax Trees(PDF). Proceedings of ICSM'98 (Bethesda,
Maryland: IEEE)
[7] Würsch, Michael. Improving Abstract Syntax Tree based Source Code
Change Detection (Diploma thesis)
[8] Arthur H. Watson Thomas J. McCabe. NIST Special Publication 500-
235 Structured Testing: A Testing Methodology Using the Cyclomatic
Complexity Metric. September 1996.
[9] Everald E. Mills. Software Metrics. SEI Curriculum Module SEI-CM-12-
1.1. December 1988.
[10] Koenig, Andrew (MarchApril 1995). "Patterns and
Antipatterns". Journal of Object-Oriented Programming 8 (1): 4648.;
was later re-printed in the: Rising, Linda (1998)
[11] Payment Card Industry (PCI) Data Security Standard Glossary,
Abbreviations and Acronyms
[12] Robert Seacord. Top 10 Secure Coding Practices.. SEI CERT, 2001.
[13] Douglas W. Jones Dead Code Maintenance, Risks 8.19 (Feb. 1, 1989)
[14] B. Kitchenham and S. Pfleeger, "Software quality: the elusive target",
Software, IEEE, vol. 13, no. 1, 1996.
[15] Charles Petzold. Code: The Hidden Language of Computer Hardware
and Software. Oct 21, 2000.
... In our earlier research [1], we presented Dynamic Syntax Tree and a Dynamic Syntax Tree-based implementation [2]. ...
... For DST enhancement with dynamic information, we used an optimized Binary Sandboxing method presented in our previous article [1], that will be enhanced year-by-year as it will be object of future papers. VI. ...
Article
Full-text available
Dynamic Syntax Tree (DST) implementations [1] use Binary Sandboxing for enhancing the Static Analysis process. In this paper we present a new Dynamic Binary analysis method for collecting information on ELF, PE and Mach-O executables and dynamic libraries. This information will enrich DST contents during application scanning
... In our earlier research [1], we presented Dynamic Syntax Tree and a Dynamic Syntax Tree-based implementation [2]. Main differences respect than Abstract Syntax Tree (AST) are: ...
Article
Full-text available
Updated Results of a Dynamic Syntax Tree method implementation for enhancing the Static Analysis process. We collected the most significant results of latest 4 year, presented in this paper
... In our earlier research [1], we presented Dynamic Syntax Tree and a Dynamic Syntax Tree-based implementation [2]. Main differences respect than Abstract Syntax Tree (AST) are: ...
Article
Full-text available
Updated Results of a Dynamic Syntax Tree method implementation for enhancing the Static Analysis process. We collected the most significant results of latest 4 year, presented in this paper
... In our earlier research [1], we presented Dynamic Syntax Tree and a Dynamic Syntax Tree-based implementation [2]. Main differences respect than Abstract Syntax Tree (AST) are: ...
Article
Full-text available
Updated Results of a Dynamic Syntax Tree method implementation for enhancing the Static Analysis process. We collected the most significant results of latest 4 year, presented in this paper
Article
Full-text available
In our earlier research[1], we described the Dynamic Syntax Tree method implementation for enhancing the Static Analysis process. After 10+ years of experience, we collected the significant results presented in this paper
Article
Full-text available
In our earlier research[1], we described the Dynamic Syntax Tree method implementation for enhancing the Static Analysis process. After 10+ years of experience, we collected the significant results presented in this paper
Article
Full-text available
In our earlier research[1], we described the Dynamic Syntax Tree method implementation for enhancing the Static Analysis process. After 10+ years of experience, we collected the significant results presented in this paper
Conference Paper
Full-text available
Most of Static Analysis tools are nowadays based on Abstract Syntax or Concrete (aka Parser) Trees. For analyzing applications written in modern programming languages, were types and objects are dynamically created, those tools cannot provide accurate analysis results because they are designed for static programming languages only. Moreover described is the new Dynamic Syntax Trees-based method for enhancing the Static Analysis process.
Article
Full-text available
Most of Static Analysis tools are nowadays based on Abstract Syntax or Concrete (aka Parser) Trees. For analyzing applications written in modern programming languages, were types and objects are dynamically created, those tools cannot provide accurate analysis results because they are designed for static programming languages only. Moreover described is the new Dynamic Syntax Trees-based method for enhancing the Static Analysis process.
Conference Paper
Full-text available
Existing research suggests that a considerable fraction (5-10%) of the source code of large scale computer programs is duplicate code (“clones”). Detection and removal of such clones promises decreased software maintenance costs of possibly the same magnitude. Previous work was limited to detection of either near misses differing only in single lexems, or near misses only between complete functions. The paper presents simple and practical methods for detecting exact and near miss clones over arbitrary program fragments in program source code by using abstract syntax trees. Previous work also did not suggest practical means for removing detected clones. Since our methods operate in terms of the program structure, clones could be removed by mechanical methods producing in-lined procedures or standard preprocessor macros. A tool using these techniques is applied to a C production software system of some 400 K source lines, and the results confirm detected levels of duplication found by previous work. The tool produces macro bodies needed for clone removal, and macro invocations to replace the clones. The tool uses a variation of the well known compiler method for detecting common sub expressions. This method determines exact tree matches; a number of adjustments are needed to detect equivalent statement sequences, commutative operands, and nearly exact matches. We additionally suggest that clone detection could also be useful in producing more structured code, and in reverse engineering to discover domain concepts and their implementations
Article
develops standards and guidelines, provides technical assistance, and conducts research for computers and related telecommunications systems to achieve more effective utilization of Federal information technology resources. CSL’s reponsibilities include development of technical, management, physical, and administrative standards and guidelines for the cost-effective security and privacy of sensitive unclassified information processed in federal computers. CSL assists agencies in developing security plans and in improving computer security awareness training. This Special Publication 500 series reports CSL research and guidelines to Federal agencies as well as to organizations in industry, government, and academia.
Conference Paper
Graph rewriting (also called reduction) as defined in Wadsworth [1971] was introduced in order to be able to give a more efficient implementation of functional programming languages in the form of lambda calculus or term rewrite systems: identical subterms are shared using pointers. Several other authors, e.g. Ehrig [1979], Staples [1980a,b,c], Raoult [1984] and van den Broek et al. [1986] have given mathematical descriptions of graph rewriting, usually employing concepts from category theory. These papers prove among other things the correctness of graph rewriting in the form of the Church-Rosser property for well-behaved (i.e. regular) rewrite systems. However, only Staples has formally studied the soundness and completeness of graph rewriting with respect to term rewriting. In this paper we give a direct operational description of graph rewriting that avoids the category theoretic notions. We show that if a term t is interpreted as a graph g(t) and is reduced in the graph world, then the result represents an actual reduct of the original term t(soundness). For weakly regular term rewrite systems, there is also a completeness result: every normal form of a term t can be obtained from the graphical implementation. We also show completeness for all term rewrite systems which possess a so called hypernormalising strategy, and in that case the strategy also gives a normalising strategy for the graphical implementation. Besides having nice theoretical properties, weakly regular systems offer opportunities for parallelism, since redexes at different places can be executed independently or in parallel, without affecting the final result.
Article
An optimum method of coding an ensemble of messages consisting of a finite number of members is developed. A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.