BookPDF Available

# Principles of Program Analysis

Authors:

## Abstract

In this book we shall introduce four of the main approaches to program analysis: Data Flow Analysis, Control Flow Analysis, Abstract Interpretation, and Type and Effect Systems. Each of Chapters 2 to 5 deals with one of these approaches to some length and generally treats the more advanced material in later sections. Throughout the book we aim at stressing the many similarities between what may at a first glance appear to be very unrelated approaches. To help getting this idea across, and to serve as a gentle introduction, this chapter treats all of-the approaches at the level of examples. The technical details are worked-out but it may be difficult to apply the techniques to related examples until some of the material of later chapters have been studied.
A preview of the PDF is not available

## Chapters (3)

In this chapter we present the technique of Constraint Based Analysis using a simple functional language, FUN. We begin by presenting an abstract specification of a Control Flow Analysis and then study its theoretical properties: it is correct with respect to a Structural Operational Semantics and it can be used to analyse all programs. This specification of the analysis does not immediately lend itself to an efficient algorithm for computing a solution so we proceed by developing first a syntax directed specification and then a constraint based formulation and finally we show how the constraints can be solved. We conclude by illustrating how the precision of the analysis can be improved by combining it with Data Flow Analysis and by incorporating context information thereby linking up with the development of the previous chapter.
The purpose of this chapter is to convey some of the essential ideas of Abstract Interpretation. We shall mainly do so in a programming language independent way and thus focus on the design of the property spaces, the functions and computations upon them, and the relationships between them.
In previous chapters we have studied several algorithms for obtaining solutions of program analyses. In this chapter we shall explore further the similarities between the different approaches to program analysis by studying general algorithms for solving equation or inequation systems.
... It is common that adding semantic descriptions (e.g., via code comments, visualizing code control flow graphs, etc.) may enhance human understanding of programs and ease machine learning. With the help of static code dependency analysis techniques (Nielson, Nielson, and Hankin 1999), for example, Gated Graph Neural Networks (GGNN) Fernandes, Allamanis, and Brockschmidt 2019;Allamanis, Brockschmidt, and Khademi 2018) learn code semantics via graphs where edges are added between the code syntax tree nodes to indicate various kinds of dependencies between the nodes. However, adding such edges requires extra processing of ASTs and may introduce noise for different learning tasks since there is no consensus on which types of edges are needed for which tasks. ...
... Capsule Graph Neural Networks (Zhang and Chen 2019) proposed to classify biological and social network graphs does not handle tree-or graph-based code syntax. To the best of our knowledge, we are the first to adapt capsule networks for program source code processing to learn code models on syntax trees directly, without the need for extra static program semantic analysis techniques that may be expensive or introduce inaccuracies (Nielson, Nielson, and Hankin 1999). ...
Article
Recently program learning techniques have been proposed to process source code based on syntactical structures (e.g., abstract syntax trees) and/or semantic information (e.g., dependency graphs). While graphs may be better than trees at capturing code semantics, constructing the graphs from code inputs through the semantic analysis of multiple viewpoints can lead to inaccurate noises for a specific software engineering task. Compared to graphs, syntax trees are more precisely defined on the grammar and easier to parse; unfortunately, previous tree-based learning techniques have not been able to learn semantic information from trees to achieve better accuracy than graph-based techniques. We have proposed a new learning technique, named TreeCaps, by fusing together capsule networks with tree-based convolutional neural networks to achieve a learning accuracy higher than some existing graph-based techniques while it is based only on trees. TreeCaps introduces novel variable-to-static routing algorithms into the capsule networks to compensate for the loss of previous routing algorithms. Aside from accuracy, we also find that TreeCaps is the most robust to withstand those semantic-preserving program transformations that change code syntax without modifying the semantics. Evaluated on a large number of Java and C/C++ programs, TreeCaps models outperform prior deep learning models of program source code, in terms of both accuracy and robustness for program comprehension tasks such as code functionality classification and function name prediction. Our implementation is publicly available at: https://github.com/bdqnghi/treecaps.
... We have seen in Section 2.1 that in some cases, it is sufficient to extend the bi-directional first match policy of RFun and CoreFun to include information that should be provided by the caller. This method will require a program analysis such as the available expressions analysis specified in Nielsen, Nelson, and Hankin [9]. The judgements for evaluating programs will, in this case, be extended with a statically available environment that contains bindings provided by the caller, and the side condition in Figures 3 and 5 will be an extended notion of orthogonality that is allowed to look at the pattern in each case as well. ...
Preprint
Full-text available
An algorithm describes a sequence of steps that transform a problem into its solution. Furthermore, when the inverted sequence is well-defined, we say that the algorithm is invertible. While invertible algorithms can be described in general-purpose languages, no guarantees are generally made by such languages as regards invertibility, so ensuring invertibility requires additional (and often non-trivial) proof. On the other hand, while reversible programming languages guarantee that their programs are invertible by restricting the permissible operations to those which are locally invertible, writing programs in the reversible style can be cumbersome, and may differ significantly from conventional implementations even when the implemented algorithm is, in fact, invertible. In this paper we introduce Jeopardy, a functional programming language that guarantees program invertibility without imposing local reversibility. In particular, Jeopardy allows the limited use of uninvertible -- and even nondeterministic! -- operations, provided that they are used in a way that can be statically determined to be invertible. However, guaranteeing invertibility is not obvious. Thus, we furthermore outline three approaches that can give a partial static guarantee.
... 4.3.1 0-CFA analysis. The backbone of the dependency analysis is 0-CFA analysis [Nielson et al. 1999]. We extend standard 0-CFA analysis, which tracks data-flow of functions, to additionally track data-flow of holes. ...
Preprint
Developing efficient and maintainable software systems is both hard and time consuming. In particular, non-functional performance requirements involve many design and implementation decisions that can be difficult to take early during system development. Choices -- such as selection of data structures or where and how to parallelize code -- typically require extensive manual tuning that is both time consuming and error-prone. Although various auto-tuning approaches exist, they are either specialized for certain domains or require extensive code rewriting to work for different contexts in the code. In this paper, we introduce a new methodology for writing programs with holes, that is, decision variables explicitly stated in the program code that enable developers to postpone decisions during development. We introduce and evaluate two novel ideas: (i) context-sensitive holes that are expanded by the compiler into sets of decision variables for automatic tuning, and (ii) dependency-aware tuning, where static analysis reduces the search space by finding the set of decision variables that can be tuned independently of each other. We evaluate the two new concepts in a system called Miking, where we show how the general methodology can be used for automatic algorithm selection, data structure decisions, and parallelization choices.
Chapter
Abstract Interpretation approximates the semantics of a program by mimicking its concrete fixpoint computation on an abstract domain A. The abstract (post-) fixpoint computation is classically divided into two phases: the ascending phase, using widenings as extrapolation operators to enforce termination, is followed by a descending phase, using narrowings as interpolation operators, so as to mitigate the effect of the precision losses introduced by widenings. In this paper we propose a simple variation of this classical approach where, to more effectively recover precision, we decouple the two phases: in particular, before starting the descending phase, we replace the domain A with a more precise abstract domain D. The correctness of the approach is justified by casting it as an instance of the A2I framework. After demonstrating the new technique on a simple example, we summarize the results of a preliminary experimental evaluation, showing that it is able to obtain significant precision improvements for several choices of the domains A and D.
Chapter
Serverless computing leverages the design of complex applications as the composition of small, individual functions to simplify development and operations. However, this flexibility complicates reasoning about the trade-off between performance and costs, requiring accurate models to support prediction and configuration decisions. Established performance model inference from execution traces is typically more expensive for serverless applications due to the significantly larger topologies and numbers of parameters resulting from the higher fragmentation into small functions. On the other hand, individual functions tend to embed simpler logic than larger services, which enables inferring some structural information by reasoning directly from their source code. In this paper, we use static control and data flow analysis to extract topological and parametric dependencies among interacting functions from their source code. To enhance the accuracy of model parameterization, we devise an instrumentation strategy to infer performance profiles driven by code analysis. We then build a compact layered queueing network (LQN) model of the serverless workflow based on the static analysis and code profiling data. We evaluated our method on serverless workflows with several common composition patterns deployed on Azure Functions, showing it can accurately predict the performance of the application under different resource provisioning strategies and workloads with a mean error under 7.3%.
Article
We propose a family of logical theories for capturing an abstract notion of consistency and show how to build a generic and efficient theory solver that works for all members in the family. The theories can be used to model the influence of memory consistency models on the semantics of concurrent programs. They are general enough to precisely capture important examples like TSO, POWER, ARMv8, RISC-V, RC11, IMM, and the Linux kernel memory model. To evaluate the expressiveness of our theories and the performance of our solver, we integrate them into a lazy SMT scheme that we use as a backend for a bounded model checking tool. An evaluation against related verification tools shows, besides flexibility, promising performance on challenging programs under complex memory models.
Chapter
Loop abstraction is a central technique for program analysis, because loops can cause large state-space representations if they are unfolded. In many cases, simple tricks can accelerate the program analysis significantly. There are several successful techniques for loop abstraction, but they are hard-wired into different tools and therefore difficult to compare and experiment with. We present a framework that allows us to implement different loop abstractions in one common environment, where each technique can be freely switched on and off on-the-fly during the analysis. We treat loops as part of the abstract model of the program, and use counterexample-guided abstraction refinement to increase the precision of the analysis by dynamically activating particular techniques for loop abstraction. The framework is independent from the underlying abstract domain of the program analysis, and can therefore be used for several different program analyses. Furthermore, our framework offers a sound transformation of the input program to a modified, more abstract output program, which is unsafe if the input program is unsafe. This allows loop abstraction to be used by other verifiers and our improvements are not ‘locked in’ to our verifier. We implemented several existing approaches and evaluate their effects on the program analysis.
Article
Local fixpoint iteration describes a technique that restricts fixpoint iteration in function spaces to needed arguments only. It has been studied well for first-order functions in abstract interpretation and also in model checking. Here we consider the problem for least and greatest fixpoints of arbitrary type order. We define an abstract algebra of simply-typed higher-order functions with fixpoints in order to express fixpoint evaluation problems as they occur routinely in various applications, including program verification. We present an algorithm that realises local fixpoint iteration for such higher-order fixpoints, prove its correctness and study its optimisation potential in the context of several applications. We also examine a particular fragment of this higher-order fixpoint algebra which allows us to pre-compute needed arguments, as this may help to speed up the fixpoint iteration process.
Article
We describe our experience of using property-based testing---an approach for automatically generating random inputs to check executable program specifications---in a development of a higher-order smart contract language that powers a state-of-the-art blockchain with thousands of active daily users. We outline the process of integrating QuickChick---a framework for property-based testing built on top of the Coq proof assistant---into a real-world language implementation in OCaml. We discuss the challenges we have encountered when generating well-typed programs for a realistic higher-order smart contract language, which mixes purely functional and imperative computations and features runtime resource accounting. We describe the set of the language implementation properties that we tested, as well as the semantic harness required to enable their validation. The properties range from the standard type safety to the soundness of a control- and type-flow analysis used by the optimizing compiler. Finally, we present the list of bugs discovered and rediscovered with the help of QuickChick and discuss their severity and possible ramifications.
Article
Full-text available
Flow logic is a “fast prototyping” approach to program analysis that shows great promise of being able to deal with a wide variety of languages and calculi for computation. However, seemingly innocent choices in the flow logic as well as in the operational semantics may inhibit proving the analysis correct. Our main conclusion is that environment based semantics is more flexible than either substitution based semantics or semantics making use of structural congruences (like alpha-renaming).
Conference Paper
Full-text available
Generation of efficient cde for ob ject-orient4 programs requir~ knowledge of object Iifetimes and method bindings. For object-oriented Ian- guages that have automatic storqge management and dynamic look-up of methods, the compiler must obtain such knowledge by performing static anal- ysis of the source code. We present a.n analysis algorithm which discovers the potential classes of each object in an object-oriented program as well as a safe approximation of their lifetimh. These results are obtained using abstract. domains that approximate memory configurations and interprocedural cdl patterns of the program We present several alternatives for these abstract domains that permit a trade-off between accuracy and complexity of the overall analysis.
Conference Paper
Full-text available
This paper concerns the static analysis of programs that performdestructive updating on heap-allocated storage. Wegive an algorithm that conservatively solves this problemby using a finite shape-graph to approximate the possible"shapes" that heap-allocated structures in a program cantake on. In contrast with previous work, our method is evenaccurate for certain programs that update cyclic data structures.For example, our method can determine that whenthe input to a program that searches ...
Article
This paper concerns interprocedural dataflow-analysis problems in which the dataflow information at a program point is represented by an environment (i.e., a mapping from symbols to values), and the effect of a program operation is represented by a distributive environment transformer. We present two efficient algorithms that produce precise solutions: an exhaustive algorithm that finds values for all symbols at all program points, and a demand algorithm that finds the value for an individual symbol at a particular program point. Two interesting problems that can be handled by our algorithms are (decidable) variants of the interprocedural constant-propagation problem: copy-constant propagation and linear-constant propagation. The former interprets program statements of the form x ≔ 7 and x ≔ y. The latter also interprets statements of the form x ≔ 5 * y + 17. Experimental results on C programs have shown that •Although solving constant-propagation problems precisely (i.e., finding the meet-over-all-valid-paths solution, rather than the meet-over-all-paths solution) resulted in a slowdown by a factor ranging from 2.2 to 4.5, the precise algorithm found additional constants in 7 of 38 test programs. •In contrast to previous results for numeric Fortran programs, linear-constant propagation found more constants than copy-constant propagation in 6 of 38 test programs. •The demand algorithm, when used to demand values for all uses of scalar integer variables, was faster than the exhaustive algorithm by a factor ranging from 1.14 to about 6.
Article
We present a new approach to proving type soundness for Hindley/Milner-style polymorphic type systems. The keys to our approach are (1) an adaptation of subject reduction theorems from combinatory logic to programming languages, and (2) the use of rewriting techniques for the specification of the language semantics. The approach easily extends from polymorphic functional languages to imperative languages that provide references, exceptions, continuations, and similar features. We illustrate the technique with a type soundness theorem for the core of Standard ML, which includes the first type soundness proof for polymorphic exceptions and continuations.
Article
This paper describes a memory management discipline for programs that perform dynamic memory allocation and de-allocation. At runtime, all values are put intoregions. The store consists of a stack of regions. All points of region allocation and de-allocation are inferred automatically, using a type and effect based program analysis. The scheme does not assume the presence of a garbage collector. The scheme was first presented in 1994 (M. Tofte and J.-P. Talpin,in“Proceedings of the 21st ACM SIGPLAN–SIGACT Symposium on Principles of Programming Languages,” pp. 188–201); subsequently, it has been tested in The ML Kit with Regions, a region-based, garbage-collection free implementation of the Standard ML Core language, which includes recursive datatypes, higher-order functions and updatable references L. Birkedal, M. Tofte, and M. Vejlstrup, (1996),in“Proceedings of the 23 rd ACM SIGPLAN–SIGACT Symposium on Principles of Programming Languages,” pp. 171–183. This paper defines a region-based dynamic semantics for a skeletal programming language extracted from Standard ML. We present the inference system which specifies where regions can be allocated and de-allocated and a detailed proof that the system is sound with respect to a standard semantics. We conclude by giving some advice on how to write programs that run well on a stack of regions, based on practical experience with the ML Kit.
Article
We show how the Hindley/Milner polymorphic type system can be extended to incorporate overloading and subtyping. Our approach is to attach constraints to quantified types in order to restrict the allowed instantiations of type variables. We present an algorithm for inferring principal types and prove its soundness and completeness. We find that it is necessary in practice to simplify the inferred types, and we describe techniques for type simplification that involve shape unification, strongly connected components, transitive reduction, and the monotonicities of type formulas.
Article
In this paper, we present a lattice of graphs particularly suitable for semantic analysis of dynamic data structures. We consider LISP-like structures only, as generalization to any dynamic structure is easy. Viewing those structures as data graphs, we introduce a special kind of graphs (called heap-graphs or h-graphs), each of these being able to figure out a set of data graphs. Then we build a finite subset of those h-graphs by means of a notion of normalized h-graphs, and define an algebraic structure on this subset, thus building a (finite) lattice. Finally, we define abstract operations on this subset, corresponding to (some) LISP primitives. We show how this analysis can be used, and give some results produced by an experimental analyzer.
Conference Paper
The earliest data flow analysis research dealt with concreteproblems (such as detection of available expressions) and with lowlevel representations of control flow (with one large graph, eachof whose nodes represents a basic block). Several recent papershave introduced an abstract approach, dealing with any problemexpressible in terms of a semilattice L and a monoid M of isotonemaps from L to L, under various algebraic constraints. Examplesinclude [CC77; GW76; KU76; Ki73; Ta75; Ta76; We75]. Several otherrecent papers have introduced a high level representation with manysmall graphs, each of which represents a small portion of thecontrol flow information in a program. The hierarchy of smallgraphs is explicit in [Ro77a; Ro77b] and implicit in papers thatdeal with syntax directed analysis of programs written within theconfines of classical structured programming [DDH72, Sec. 1.7].Examples include [TK76; ZB74]. The abstract papers have retainedthe low level representations while the high level papers haveretained the concrete problems of the earliest work. This paperstudies abstract conditions on L and M that lead to rapid data flowanalysis, with emphasis on high level representations. Unlike someanalysis methods oriented toward structured programming [TK76;Wu75; ZB74], our method retains the ability to cope with arbitraryescape and jump statements while it exploits the control flowinformation implicit in the parse tree.The general algebraic framework for data flow analysis withsemilattices is presented in Section 2, along with some preliminarylemmas. Our "rapid" monoids properly include the "fast" monoids of[GW76]. Section 3 relates data flow problems to the hierarchies ofsmall graphs introduced in [Ro77a; Ro77b]. High level analysisbegins with local information expressed by mapping the arcs of alarge graph into the monoid M, much as in low level analysis. Buteach arc in our small graphs represents a set (often an infiniteset) of paths in the underlying large graph. Appropriate members ofM are associated with these arcs. This "globalized" localinformation is used to solve global flow problems in Section 4. Thefundamental theorem of Section 4 is applied to programs with thecontrol structures of classical structured programming in Section5. For a given rapid monoid M, the time required to solve anyglobal data flow problem is linear in the number of statements inthe program. (For varying M, the time is linear in the product ofthis number by t@, where t@ is a parameter ofM introduced in the definition of rapidity.) For reasons sketchedat the beginning of Section 6, we feel obliged to cope with sourcelevel escape and jump statements as well as with classicalstructured programming. Section 6 shows how to apply thefundamental theorem of Section 4 to programs with arbitrary escapesand jumps. The explicit time bound is only derived for programswithout jumps. A comparison between the results obtained by ourmethod and those obtained by [GW76] is in Section 7, which alsocontains examples of rapid monoids in the full paper. Finally,Section 8 lists conclusions and open problems. Proofs of lemmas areomitted to save space. The full paper will resubmitted to ajournal.We proceed from the general to the particular, except in someplaces where bending the rule a little makes a significantimprovement in the expository flow. Common mathematical notation isused. To avoid excessive parentheses, the value of a function f atan argument x is fx rather than f(x). If fx is itself a functionthen (fx)y is the result of applying fx to y. The usual¡Ü and ¡Ý symbols are used for arbitrarypartial orders as well as for the usual order among integers. Afunction from a partially ordered set (poset) to a poset isisotone iff x ¡Ü y implies fx ¡Ü fy.(Isotone maps are sometimes called "monotonic" in the literature.)A meet semilattice is a poset with a binary operation¡Ä such that x ¡Ä y is the greatest lowerbound of the set {x, y}. A meet semilattice wherein every subsethas a greatest lower bound is complete. In particular, theempty subset has a greatest lower bound T, so a complete meetsemilattice has a maximum element. A monoid is a settogether with an associative binary operation &compfn; that hasa unit element 1 : 1 &compfn; m = m &compfn;1 = m for all m. In all our examples the monoid M will be amonoid of functions: every member of M is a function (from aset into itself), the operation &compfn; is the usualcomposition (f &compfn; g)x = f(gx), and the unit 1 isthe identity function with 1X = x for all x. Twoconsiderations governed the notational choices. First, we speak inways that are common in mathematics and are convenient here.Second, we try to facilitate comparisons with [GW76; KU76; Ro77b],to the extent that the disparities among these works permit. Onedisparity is between the meet semilattices of [GW76; KU76; Ki73]and the join semilattices of [Ro77b; Ta75; We75], whereleast upper bounds are considered instead of greatest lower bounds.To speak of meets is more natural in applications that areintuitively stated in terms of "what must happen on all paths" insome class of paths in a program, while to speak of joins is morenatural in applications that are intuitively stated in terms of"what can happen on some paths." By checking whether there are anypaths in the relevant class and by using the rule that 3 isequivalent to &dlcrop;V&dlcrop;, join oriented applicationscan be reduced to meet oriented ones (and vice versa). A generaltheory should speak in one way or the other, and we have chosenmeets. For us, strong assertions about a program's data flow arehigh in the semilattice.