Conference PaperPDF Available

Retaining Semantic Information in the Static Analysis of Real-World Software *

Authors:

Abstract

Static analysis is the analysis of a program through inspection of the source code, usually carried out by an automated tool. One of the greatest challenges posed by real-world applications is that the whole program is rarely available at any point of the analysis process. One reason for this is separate compilation, where the source code of some libraries might not be accessible in order to protect intellectual property. But even if we have a complete view of the source code including the underlying operating system, we might still have trouble fitting the representation of the entire software into memory. Thus, industrial tools need to deal with uncertainty due to the lack of information. In my dissertation I discuss state-of-the-art methods to deal with this uncertainty and attempt to improve upon each method to retain information that would otherwise be unavailable. I also propose guidelines on which methods to choose to solve certain problems.
Retaining Semantic Information in the Static
Analysis of Real-World Software
Gábor Horváth
Department of Programming Languages and Compilers
Eötvös Loránd University
Hungary
xazax@caesar.elte.hu
Abstract
Static analysis is the analysis of a program through inspec-
tion of the source code, usually carried out by an automated
tool. One of the greatest challenges posed by real-world ap-
plications is that the whole program is rarely available at any
point of the analysis process. One reason for this is separate
compilation, where the source code of some libraries might
not be accessible in order to protect intellectual property. But
even if we have a complete view of the source code including
the underlying operating system, we might still have trouble
tting the representation of the entire software into memory.
Thus, industrial tools need to deal with uncertainty due to
the lack of information.
In my dissertation I discuss state-of-the-art methods to
deal with this uncertainty and attempt to improve upon
each method to retain information that would otherwise be
unavailable. I also propose guidelines on which methods to
choose to solve certain problems.
CCS Concepts Software and its engineering
Auto-
mated static analysis.
Keywords static analysis, defect detection, C++
ACM Reference Format:
Gábor Horváth. 2019. Retaining Semantic Information in the Static
Analysis of Real-World Software. In Proceedings of the 2019 ACM
SIGPLAN International Conference on Systems, Programming, Lan-
guages, and Applications: Software for Humanity (SPLASH Compan-
ion ’19), October 20–25, 2019, Athens, Greece. ACM, New York, NY,
USA, 3pages. hps://doi.org/10.1145/3359061.3361075
The publication of this paper is supported by the European Union, co-
nanced by the European Social Fund (EFOP 3.6.2-16-2017-00013).
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear
this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specic permission and/or a fee. Request
permissions from permissions@acm.org.
SPLASH Companion ’19, October 20–25, 2019, Athens, Greece
©2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6992-3/19/10. . . $15.00
hps://doi.org/10.1145/3359061.3361075
1 Introduction
There is a wide variety of static analysis methods, but all of
them are aected by the uncertainty of unavailable informa-
tion in the analyzed source code. This uncertainty can often
result in both false positive reports and missed bugs. In my
dissertation, I survey four methods to gather information
about the program that is otherwise not present in the source
text. The method of the dissertation is to implement new
checks using state-of-the-art methods, identify the main pain
points by evaluating the checks on real-world code, attempt
to x the problems and reevaluate the proposed solution on
the projects by inspecting the result manually.
2 Hardcoded Information
One of the most prevalent methods of writing static anal-
ysis software is to hardcode information about the source
code into the analysis tool: for example information about
frequently used APIs such as
malloc
and
free
to detect
memory errors. While this is easy to implement it is almost
never a complete solution. A user might have custom alloca-
tion functions or calls might be wrapped in other APIs that
are not visible to the analysis tool.
Professional C++ programs cannot avoid the usage of C++
Standard Template Library (STL) because it increases the
quality, maintainability, and eciency of the code. However,
the use of STL does not guarantee error-free code. On the
contrary, incorrect application of the library may introduce
new types of problems.
As the semantics of STL is well-dened in the C++ stan-
dard, we can encode this knowledge in the implementation
of a static analysis tool. I implemented a static analysis tool
that matched the AST and was able to detect more than a
dozen code smells [1].
One way to relax the concept of hardcoding semantics is
to do duck typing. For example, I implemented a check that
rewrites container emptiness checks of form
cont.size()
== 0
and similar to use the
empty
method. This check does
not have a hard coded list of containers, it will look at the
API of a type instead and do classication. This enables the
check to work with non-standard container types.
More sophisticated static analysis methods such as sym-
bolic execution can also prot greatly from such encoded
knowledge. The Clang Static Analyzer has modeling checks
32
SPLASH Companion ’19, October 20–25, 2019, Athens, Greece Gábor Horváth
that encode the behavior of some standard APIs and update
analysis state accordingly.
3 Annotations
Annotations are great for users to provide the analyzer with
additional semantic information that is otherwise not avail-
able in the source code. Annotations can have dierent forms
such as dedicated attributes or functions and types that have
a special meaning in the analyzer.
This method also imposes an additional maintenance bur-
den on the user, as annotations need to be maintained just
like regular code. Moreover, erroneous annotations can de-
ceive the tool and result in spurious diagnostics or missed
bugs.
About 70% of the security patches issued by Microsoft are
xing memory errors [
4
]. Herb Sutter, the chairman of the
C++ Standards Committee, suggested a ow-based points-
to analysis to catch lifetime-related errors [
5
]. The novelty
of the analysis is to distinguish Owner,Pointer,Value and
Aggregate categories. Moreover, instead of relying on the
user to annotate all types, it has well dened rules on how
to categorize a user-dened type automatically.
I participated in the design and implementation of the
analysis. While automatic classication of types works great
in most cases, occasionally, users might end up using an-
notations to x misclassications. We observed that it is
advantageous for analysis tools to decouple the inference
of annotations from the analysis itself. Having a separate
module with the sole purpose of adding annotations is better
design: the rest of the analysis can assume that the annota-
tions are always present.
The analysis also needs to know how function calls alter
the abstract domain. While it has some default heuristics,
the user can override them with pre- and postconditions. I
designed an embedded domain specic language (EDSL) to
describe these analysis related pre- and postconditions. The
novelty of this solution is that these EDSL expressions can
be used as contracts, thus the user gets a certain amount of
runtime verication of the attributes for free. Unfortunately,
lots of the properties (e.g. as two iterators pointing into
the same container) are impossible to check at run-time
using standard C++. However, some other conditions such
as pointer equality and nullness are quite easy to check.
As one of the most important drawbacks of annotations
is the increased maintenance cost, it is important to have
tools that automate related tasks: check the correctness of
annotations at compile time, generate instructions to do
the same at run time, and help bootstrap the analysis by
automatically synthesizing annotations for the code.
4 Summaries
Some of the errors that can be detected by static analysis
tools span across multiple subprograms. For example, the
same object might be deleted in two distinct procedures. To
nd such errors, some tools implement context-sensitive
interprocedural analysis. One of the challenges with this
approach is that the implementation of a function might be
separately compiled, for example it might reside in a binary
distribution of a proprietary library.
A common approach to mitigate this problem is to assign
a summary to some of the functions. Each time the imple-
mentation is not available, the summary is used to analyze
the eect of the function call. Such a summary is in fact
an approximation of the function implementation, suitable
to be used to model some behavior of the called functions
in a given context. Summaries can be synthesized from the
source code during a multipass analysis or written manually
by a developer.
Most summaries are specic to an analysis and reect the
semantics of the described function in terms of the analysis
state. The problem with this approach is that it does not scale,
and each new analysis author needs to implement their own
summary representation.
An alternative way is to represent the summary of a func-
tion in terms of the semantics of the language rather than the
analysis state. One natural representation of the semantics of
the language is the source text. Textual summaries, however,
have some challenges in C-family languages.
1S ( x );
2T * y;
3fu n c ( ( R) * z ) ;
Listing 1. Ambiguous C grammar.
The grammar of C is ambiguous without contextual infor-
mation. There are some examples of ambiguity in Listing 1.
If
S
is a type, then the rst line is a denition of a variable,
otherwise, if
S
is a function, then it is a function call. The sec-
ond line can be interpreted both as a denition of a pointer,
or a multiplication, depending on the type context. The third
line can be ether a multiplication, or a dereference followed
by a C type cast. Moreover, it is possible to have the same
name as both a variable name and a type name in the same
scope. To get type information, headers included in a le
need to be parsed before the le can be parsed.
I developed a prototype [
2
] that can parse summaries
quickly using the type context available at the call site. This
circumvents the problem of reparsing all headers for the
sake of loading a function summary. Unfortunately, the type
context at the call site is not guaranteed to contain every
symbol that is available in the implementation. Thus, the
original implementation cannot be used as a summary, a
separate one needs to be written. This can only refer to
built-in types, types of the parameters and return value, and
types required to be in the type context by these types. The
implementation is accepted into mainline Clang. This is work
in progress and I need help to come up with the best way to
evaluate this method.
33
Retaining Semantic Information in the Static Analysis of Real-World Soware SPLASH Companion ’19, October 20–25, 2019, Athens, Greece
5 Cross Translation Unit Analysis
The scope of the analysis has a big impact on the preci-
sion. The example in Listing 2shows that it does not matter
whether we know the initial value of
x
, because we have no
knowledge about the body of
f
. The implementation of
f
is
free to change the value of
x
(as it is passed by address) and
we no longer have information about its value. Moreover,
we cannot nd any errors in the body of the function when
it is called in this context.
1// A . c pp
2void f ( int *);
3void g () {
4int x = 41;
5f (& x ) ;
6// N o k n o w l e d g e a b o ut t he va l u e o f x .
7if ( x ! = 4 2 )
8int *p = new int;/ / L ea k , fa l s e p o s i ve .
9}
10 // B . c pp
11 int f ( int *y ) {
12 ++ * y ;
13 return 5 / (* y - 4 2 );
14 // Di v is io n by z er o , f al s e n eg a ti v e .
15 }
Listing 2. Example of false positive and a false negative
If we cannot reason about functions in separate translation
units (abbriviated TUs), we are unable to nd the errors that
span across TUs. In Listing 2, there is a division by zero error
that we cannot nd unless we can get information from both
A.cpp
and
B.cpp
together. Without context, the division by
zero error in
B.cpp
could be reported by single-TU analysis
as well. However, that would lead to a false positive if
f
is
not actually called with a parameter that satises
*y == 41
.
We implemented an extension to the Clang Static Ana-
lyzer [
3
] to carry out cross translation unit analysis. The im-
plementation supports separately compiled projects, as their
indexes can be combined. We also were able to update the
analysis engine in a way that existing checks could continue
to operate without any changes to their source code. Despite
the analyzer being more than 10 years old, it did not support
cross translation unit (CTU) analysis until our recent addi-
tion. One reason for this is the implementation challenge of
supporting the C compilation model, the other is the fact the
analyzer uses symbolic execution. Symbolic execution tries
to simulate all the execution paths of the program, which is
an intractable problem. The analyzer therefore employs cut
heuristics about where to stop the analysis, to keep execution
time reasonable while still being able to discover interesting
bugs. The fear was that the analyzer with additional source
code the analysis time would explode. It turns out after some
small tuning to the cut heuristics, we were able to maintain
analysis time that is proportional to the bugs found in the
project. The analysis time did not explode.
We also explored the false positive ratio of the cross trans-
lation unit analysis and found a slight improvement. Since
inline substitution retains the most information possible, this
work is a great baseline for any more ecient solution to
measure how much useful information was preserved.
6 Conclusions
Most static analysis tools we checked (Clang Static Analyzer,
Clang Tidy, Infer, CppCheck) employ a combination of the
techniques I described. I was able to devise incremental im-
provements for most of these methods extending existing,
widely-used C++ tools. One of the main tasks left to take a
step back and try to derive a global conclusion and a good
way of presentation from all this work. The rest of this sec-
tion summarizes my observations so far.
For frequently used APIs it is useful to hardcode semantics
in the tools, but one needs to be aware of the additional main-
tenance burden for APIs in motion. Sometimes it can help to
use a method similar to duck typing to encode semantics.
A move towards annotations gives users additional ex-
pressiveness, but it comes with maintenance costs on the
user’s side. To mitigate those costs, one can generate runtime
checks to nd erroneous annotations and tools to (partially)
synthesize annotations.
Summary-based analysis can be a big investment, this is
why it is worth to study how frameworks can help reduce the
costs for the tool developers. I proposed summaries that are
represented in terms if the language rather than the analysis
state, and implemented a prototype for C.
CTU analysis using inline substitution has great power
and works without modifying existing checks, but it has
serious runtime costs, thus carefully chosen cut heuristics
need to be employed.
References
[1]
Gábor Horváth and Norbert Pataki. 2015. Clang matchers for veried
usage of the C++ Standard Template Library. Annales Mathematicae et
Informaticae 44 (2015), 99–109.
[2]
Gábor Horváth and Norbert Pataki. 2016. Source Language Represen-
tation of Function Summaries in Static Analysis. In Proceedings of the
11th Workshop on Implementation, Compilation, Optimization of Object-
Oriented Languages, Programs and Systems (ICOOOLPS ’16). ACM, New
York, NY, USA, Article 6, 9 pages. hps://doi.org/10.1145/3012408.
3012414
[3]
Gábor Horváth, Péter Szécsi, Zoltán Gera, Dániel Krupp, and Norbert
Pataki. 2018. Challenges of Implementing Cross Translation Unit Analy-
sis in the Clang Static Analyzer. In 2018 IEEE 18th International Working
Conference on Source Code Analysis and Manipulation (SCAM). IEEE,
171–176.
[4]
Matt Miller. 2018. Trends, Challenges, and Strategic
Shifts in the Software Vulnerability Mitigation Landscape.
https://www.zdnet.com/article/microsoft-70-percent-of-all-security-
bugs-are-memory-safety-issues/ (last accessed: 28-02-2019).
[5]
Herb Sutter. 2018. Lifetime safety: Preventing common dangling. Techni-
cal Report. Microsoft Corporation.
34
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The C++ Standard Template Library (STL) is the exemplar of generic libraries. Professional C++ programs cannot miss the usage of this standard library because it increases quality, maintainability, understandability and efficacy of the code. However, the usage of C++ STL does not guarantee error-free code. Contrarily, incorrect application of the library may introduce new types of problems. Unfortunately, there is still a large number of properties are tested neither at compilation-time nor at run-time. It is not surprising that in implementation of C++ programs so many STL-related bugs are occurred. We match patterns on abstract syntax trees (AST) with the help of predicates. The predicates can be combined and define an embedded language. We have developed a tool which finds the potential missuses of the STL as a validation of our approach. The software takes advantage of the Clang ASTMatcher technology. The tool is in-use in Ericsson. We advise new matchers that have get into the Clang code base.
Conference Paper
Static analysis is a popular method to find bugs. In context-sensitive static analysis the analyzer considers the calling context when evaluating a function call. This approach makes it possible to find bugs that span across multiple functions. In order to find those issues the analyzer engine requires information about both the calling context and the callee. Unfortunately the implementation of the callee might only be available in a separate translation unit or module. In these scenarios the analyzer either makes some assumptions about the behavior of the callee (which may be unsound) or conservatively creates a program state that marks every value that might be affected by this function call. In this case the marked value becomes unknown which implies significant loss of precision. In order to mitigate this overapproximation, a common approach is to assign a summary to some of the functions, and each time the implementation is not available, use the summary to analyze the effect of the function call. These summaries are in fact approximations of the function implementations that can be used to model some behavior of the called functions in a given context. The most proper way to represent summaries, however, remains an open question. This paper describes a method for summarising C (or C++) functions in C (or C++) itself. We evaluate the advantages and disadvantages of this approach. It is challenging to use source language representation efficiently due to the compilation model of C/C++. We propose an efficient solution. The emphasis of the paper is on using static analysis to find errors in the programs, however the same approach can be used to optimize programs or any other tasks that static analysis is capable of. Our proof of concept implementation is available in the upstream version of the Clang compiler.
Lifetime safety: Preventing common dangling
  • Herb Sutter
Herb Sutter. 2018. Lifetime safety: Preventing common dangling. Technical Report. Microsoft Corporation.