ArticlePDF Available

An SMT-LIB Format for Sequences and Regular Expressions

Authors:
An SMT-LIB Format for Sequences and Regular Expressions
Nikolaj Bjørner
Microsoft Research Vijay Ganesh
MIT Rapha¨el Michel
University of Namur Margus Veanes
Microsoft Research
June 3, 2012
Abstract
Strings are ubiquitous in software. Tools for verification and testing of software rely in various
degrees on reasoning about strings. Web applications are particularly important in this context since
they tend to be string-heavy and have large number security errors attributable to improper string
sanitzation and manipulations. In recent years, many string solvers have been implemented to ad-
dress the analysis needs of verification, testing and security tools aimed at string-heavy applications.
These solvers support a basic representation of strings, functions such as concatenation, extraction,
and predicates such as equality and membership in regular expressions. However, the syntax and
semantics supported by the current crop of string solvers are mutually incompatible. Hence, there is
an acute need for a standardized theory of strings (i.e., SMT-LIBization of a theory of strings) that
supports a core set of functions, predicates and string representations.
This paper presents a proposal for exactly such a standardization effort, i.e., an SMT-LIBization
of strings and regular expressions. It introduces a theory of sequences generalizing strings, and
builds a theory of regular expressions on top of sequences. The proposed logic QF BVRE is designed
to capture a common substrate among existing tools for string constraint solving.
1 Introduction
This paper is a design proposal for an SMT-LIB format for a theory of strings and regular expressions.
The aim is to develop a set of core operations capturing the needs of verification, analysis, security and
testing applications that use string constraints. The standardized theory should be rich enough to support
a variety of existing and as-yet-unknown new applications. More complex functions/predicates should
be easily definable in it. On the other hand, the theory should be as minimal as possible in order for the
corresponding solvers to be relatively easy to write and maintain.
Strings can be viewed as monoids (sequences) where the main operations are creating the empty
string, the singleton string and concatentation of strings. Unification algorithms for strings have been
subject to extensive theoretical advances over several decades. Modern programming environments
support libraries that contain a large set of string operations. Applications arising from programming
analysis tools use the additional vocabulary available in libraries. A realistic interchange format should
therefore support operations that are encountered in applications.
The current crop of string solvers [9, 12, 3] have incompatible syntax and semantics. Hence, the
objective of creating an SMT-LIB format for string and regular expression constraints is to identify a
uniform format that can be targeted by applications and consumed by solvers.
The paper is organized as follows. Section 2 introduces the theory Seq of sequences. The theory
RegEx of regular expressions in Section 3 is based on Seq. The theories admit sequences and regular
expressions over any type of finite alphabet. The characters in the alphabet are defined over the theory of
bit-vectors (Section 4). Section 5 surveys the state of string-solving tools. Section 6 describes benchmark
sets made available for QF BVRE and a prototype. We provide a summary in Section 7.
2Seq: A Theory of Sequences
In the following, we develop Seq as a theory of sequences. It has a sort constructor Seq that takes the
sort of the alphabet as argument.
1
SMT-LIB Sequences and Regular Expressions
2.1 The Signature of Seq
(par (A) (seq-unit (A) (Seq A))) ; String consisting of a single character
(par (A) (seq-empty (Seq A))) ; The empty string
(par (A) (seq-concat ((Seq A) (Seq A)) (Seq A))) ; String concatentation
(par (A) (seq-cons (A (Seq A)) (Seq A))) ; pre-pend a character to a seq
(par (A) (seq-rev-cons ((Seq A) A) (Seq A))) ; post-pend a characeter
(par (A) (seq-head ((Seq A)) A)) ; retrieve first character
(par (A) (seq-tail ((Seq A)) (Seq A))) ; retrieve tail of seq
(par (A) (seq-last ((Seq A)) A)) ; retrieve last character
(par (A) (seq-first ((Seq A)) (Seq A))) ; retrieve all but the last char
(par (A) (seq-prefix-of ((Seq A) (Seq A)) Bool)) ; test for seq prefix
(par (A) (seq-suffix-of ((Seq A) (Seq A)) Bool)) ; test for postfix
(par (A) (seq-subseq-of ((Seq A) (Seq A)) Bool)) ; sub-sequence test
(par (A) (seq-extract ((Seq A) Num Num) (Seq A))) ; extract sub-sequence
parametric in Num
(par (A) (seq-nth ((Seq A) Num) A)) ; extract n’th character
parametric in Num
(par (A) (seq-length ((Seq A)) Int) ; retrieve length of sequence
The sort Num can be either an integer or a bit-vector. The logic QF BVRE instantiates the sort Num to
bit-vectors, and not to an integer.
2.2 Semantics Seq
The constant seq-empty and function seq-concat satisfy the axioms for monoids. That is, seq-empty
is an identity of seq-concat and seq-concat is associative.
(seq-concat seq-empty x) = (seq-concat xseq-empty) = x
(seq-concat x(seq-concat y z)) = (seq-concat (seq-concat x y)z)
Furthermore, Seq is the theory all of whose models are an expansion to the free monoid generated
by seq-unit and seq-empty.
2.2.1 Derived operations
All other functions (except extraction and lengths) are derived. They satisfy the axioms:
2
SMT-LIB Sequences and Regular Expressions
(seq-cons x y) = (seq-concat (seq-unit x)y)
(seq-rev-cons y x) = (seq-concat y(seq-unit x))
(seq-head (seq-cons x y)) = x
(seq-tail (seq-cons x y)) = y
(seq-last (seq-rev-cons x y)) = y
(seq-first (seq-rev-cons x y)) = x
(seq-prefix-of x y)⇔ ∃z.(seq-concat x z) = y
(seq-suffix-of x y)⇔ ∃z.(seq-concat z x) = y
(seq-subseq-of x y)⇔ ∃z,u.(seq-concat u x z) = y
Observe that the value of (seq-head seq-empty) is undetermined. Similarly for seq-tail,
seq-first and seq-last. Their meaning is under-specified. Thus, the theory Seq admits all inter-
pretations that satisfy the free monoid properties and the axioms above.
2.2.2 Extraction and lengths
It remains to provide semantics for sequence extraction and length functions. We will here describe these
informally.
(seq-length s) The length of sequence s.Seq satisfies the monoid axioms and is freely generated
by unit and concatenation. So every sequence is a finite concatenation of units (i.e., characters in
the alphabet). The length of a sequence is the number of units in the concatenation.
(seq-extract seq lo hi) produces the sub-sequence of characters between lo and hi-1. If the
length of seq is less than lo, then the produced subsequence is empty. If the bit-vector hi is
smaller than lo the result is, once again, the empty sequence. If the length of seq is larger than
lo, but less than hi, then the result is truncated to the length of seq. In other words, seq-extract
satisfies the equation (The length function is abbreviated as l(s)):
(seq-extract s lo hi) =
seq-empty if l(s)<lo
seq-empty if hi <lo
seq-empty if hi <0
(seq-extract (seq-tail s) (lo 1) (hi1)) if 0<lo
(seq-extract (seq-first s) (0) (m)) if 0<m
m=l(s)hi+1
s otherwise
(seq-nth s n) Extract the n’th character of sequence s. Indexing starts at 0, so for example is c
(where Num ranges over Int).
(seq-nth (seq-cons c s)0)
3
SMT-LIB Sequences and Regular Expressions
3RegEx: A Theory of Regular Expressions
We summarize a theory of regular expressions over sequences. It includes the usual operations over
regular expressions, but also a few operations that we found useful from applications when modeling
recognizers of regular expressions. It has a sort constructor RegEx that takes a sort of the alphabet as
argument.
3.1 The Signature of RegEx
(par (A) (re-empty-set () (RegEx A))) ; Empty set
(par (A) (re-full-set () (RegEx A))) ; Univeral set
(par (A) (re-concat ((RegEx A) (RegEx A)) (RegEx A))) ; Concatenation
(par (A) (re-of-seq ((Seq A)) (RegEx A))) ; Regular expression of sequence
(par (A) (re-empty-seq () (RegEx A))) ; same as (re-of-seq seq-empty)
(par (A) (re-star ((RegEx A)) (RegEx A))) ; Kleene star
(par (A) ((_ re-loop i j) ((RegEx A)) (RegEx A))) ; Bounded star, i,j >= 0
(par (A) (re-plus ((RegEx A)) (RegEx A))) ; Kleene plus
(par (A) (re-option ((RegEx A)) (RegEx A))) ; Option regular expression
(par (A) (re-range (A A) (RegEx A))) ; Character range
(par (A) (re-union ((RegEx A) (RegEx A)) (RegEx A))) ; Union
(par (A) (re-difference ((RegEx A) (RegEx A)) (RegEx A))) ; Difference
(par (A) (re-intersect ((RegEx A) (RegEx A)) (RegEx A))) ; Intersection
(par (A) (re-complement ((RegEx A)) (RegEx A))) ; Complement language
(par (A) (re-of-pred ((Array A Bool)) (RegEx A))) ; Range of predicate
(par (A) (re-member ((Seq A) (RegEx A)) Bool)) ; Membership test
Note the following. The function re-range is defined modulo an ordering over the character sort.
The ordering is bound in the logic. For example, in the QF BVRE logic, the corresponding ordering
is unsigned bit-vector comparison bvule. While re-range could be defined using re-of-pred, we
include it because it is pervasively used in regular expressions. The function re-of-pred takes an array
as argument. The array encodes a predicate. No other features of arrays are used, and the intent is that
benchmarks that use re-of-pred also include axioms that define the values of the arrays on all indices.
For example we can constrain pusing an axiom of the form
(assert (forall ((i (_ BitVec 8))) (iff (select p i) (bvule #0A i))))
3.2 Semantics of RegEx
Regular expressions denote sets of sequences. Assuming a denotation [[s]] for sequence expressions, we
can define a denotation function of regular expressions:
4
SMT-LIB Sequences and Regular Expressions
[[re-empty-set]] = /0
[[re-full-set]] = {s|sis a sequence}
[[(re-concat x y)]] = {s·t|s[[x]],t[[y]]}
[[(re-of-seq s)]] = {[[s]]}
[[re-empty-seq]] = {[[seq-empty]]}
[[(re-star x)]] = [[x]]=
ω
[
i=0
[[x]]i
[[(re-plus x)]] = [[x]]+=
ω
[
i=1
[[x]]i
[[(re-option x)]] = [[x]] ∪ {[[seq-empty]]}
[[(( re-loop l u)x)]] =
u
[
i=l
[[x]]i
[[(re-union x y)]] = [[x]] [[y]]
[[(re-difference x y)]] = [[x]] \[[y]]
[[(re-intersect x y)]] = [[x]] [[y]]
[[(re-complement x)]] = [[x]]
[[(re-range a z)]] = {[[(seq-unit x)]] |axz}
[[re-of-pred p]] = {[[(seq-unit x)]] |p[x]}
[[(re-member s x)]] = [[s]] [[x]]
3.3 Anchors
Most regular expression libraries include anchors. They are usually identified using regular expression
constants ^(match the beginning of the string) and $(match the end of a string). We were originally
inclined to include operators corresponding these constants. In the end, we opted to not include anchors
as part of the core. The reasons were that it is relatively straightforward to convert regular expressions
with anchor semantics into regular expressions without anchor semantics. The conversion increases the
size of the regular expression at most linearly, but in practice much less. If we were to include anchors,
the semantics of regular expression containment would also have to take anchors into account. The
denotation of regular expressions would then be context dependent and not as straightforward.
We embed regular expressions with anchor semantics into regular expressions with “regular” seman-
tics using the funnction complete. It takes three regular expressions as arguments, and it is used to convert
the regular expression ewith anchors by calling it with the arguments complete(e,,). Note that the
symbol corresponds to re-full-set, and
ε
corresponds to re-empty-set.
5
SMT-LIB Sequences and Regular Expressions
complete(string,e1,e2) = e1·string·e2
complete(x·y,,) = complete(x,,
ε
)complete(y,
ε
,)
complete(x·y,,
ε
) = complete(x,,
ε
)y
complete(x·y,
ε
,) = x complete(y,
ε
,)
complete($,e1,e2) =
ε
complete(^,e1,e2) =
ε
complete(x+y,e1,e2) = complete(x,e1,e2) + complete(y,e1,e2)
We will not define complete for Kleene star, complement or difference. Such regular expressions are
normally considered malformed and are rejected by regular expression tools.
4 The logic QF BVRE
The logic QF BVRE uses the theory of sequences and regular expressions. It includes the SMT-LIB theory
of bit-vectors as well. Formulas are subject to the following constraints:
Sequences and regular expressions are instantiated to bit-vectors.
The sort Num used for extraction and indexing is a bit-vector.
re-range assumes the comparison predicate bvule.
Length functions can only occur in comparisons with other lengths or numerals obtained from
bit-vectors. So while the range of seq-length is Int, it is only used in relative comparisons
or in comparisons with a number over a bounded range. In other words, we admit the following
comparisons (where nis an integer constant):
({<,<=,=,>=,>}(seq-length x) (seq-length y))
({<,<=,=,>=,>}(seq-length x)n)
To maintain decidability, we also require that whenever a benchmark contains (seq-length x)
it also contains an assertion of the form (assert (<= (seq-length x) n)).
The sequence operations seq-prefix-of,seq-suffix-of and seq-subseq-of are excluded.
5 String solvers
String analysis has recently received increased attention, with several automata-based analysis tools. Be-
sides differences in notation, which the current proposal addresses, the tools also differ in expressiveness
and succinctness of representation for various fragments of (extended) regular expressions. The tools
also use different representations and algorithms for dealing with the underlying automata theoretic op-
erations. A comparison of the basic tradeoffs between automata representations and the algorithms for
product and difference is studied in [11], where the benchmarks originate from a case study in [19].
6
SMT-LIB Sequences and Regular Expressions
The Java String Analyzer (JSA) [7] uses finite automata internally to represent strings with the
dk.brics.automaton library, where automata are directed graphs whose edges represent contiguous
character ranges. Epsilon moves are not preserved in the automata but are eliminated upon insertion.
This representation is optimized for matching strings rather than finding strings.
The Hampi tool [16] uses an eager bitvector encoding from regular expressions to bitvector logic.
The Kudzu/Kaluza framework extends this approach to systems of constraints with multiple variables
and supports concatenation [22]. The original Hampi format does not directly support regular expression
quantifiers “at least mtimes” and “at most ntimes”, e.g., a regex a{1,3} would need to be expanded
to a|aa|aaa. The same limitation is true for the core constraint language of Kudzu [22] that extends
Hampi.
The tool presented in [14] uses lazy search algorithms for solving regular subset constraints, inter-
section and determinization. The automaton representation is based on the Boost Graph Library [23]
and uses a range representation of character intervals that is similar to JSA. The lazy algorithms pro-
duce significant performance benefits relative to DPRLE [13] and the original Rex [27] implementation.
DPRLE [13] has a fully verified core specification written in Gallina [8], and an OCaml implementation
that is used for experiments.
Rex [27] uses a symbolic representation of automata where labels are represented by predicates.
Such automata were initially studied in the context of natural language processing [21]. Rex uses sym-
bolic language acceptors, that are first-order encodings of symbolic automata into the theory of alge-
braic datatypes. The initial Rex work [27] explores various optimizations of symbolic automata, such
as minimization, that make use of the underlying SMT solver to eliminate inconsistent conditions. Sub-
sequent work [26] explores trade-offs between the language acceptor based encoding and the use of
automata-specific algorithms forlanguage intersection and language difference. The Symbolic Automata
library [25] implements the algebra of symbolic automata and transducers [24]. Symbolic Automata is
the backbone of Rex and Bek.1
Kleene Boole re-range re-of-pred re-loop seq-concat seq-length Σ
JSA X X X BV16
Hampi X X BV8
Kudzu/Kaluza X X X X BV8
Symbolic Automata/Rex X X X X X ALL
Table 1: Expressivity of string tools.
Table 1 compares expressivity of the tools with an emphasis on regular expression constraints.
Columns represent supported features. Kleene stands for the operations re-empty-set,re-empty-seq,
re-concat,re-union, and re-star.Boole stands for re-intersect and re-complement.Σrefers
to supported alphabet theories. In Hampi and Kudzu the Boolean operations over languages can be en-
coded through membership constraints and Boolean operations overformulas. In the Symbolic Automata
Toolkit, automata are generic and support all SMT theories as alphabets.
A typical use of re-range is to succinctly describe a contiguous range of characters, such as all
upper case letters or [A-Z]. Similarly, re-of-pred can be used to define a character class such as \W
(all non-word-letter characters) through a predicate (represented as an array). For example, provided that
Wis defined as follows
x(W[x]⇔ ¬((‘A’ x‘Z’)(‘a’ x‘z’)(‘0’ x‘9’)x=‘_’))
then (re-of-pred W) is the regex that matches all non-word-letter characters. Finally, re-loop is
a succinct shorthand for bounded loops that is used very frequently in regular expressions.
1http://research.microsoft.com/bek/
7
SMT-LIB Sequences and Regular Expressions
MONA [10, 17] provides decision procedures for several varieties of monadic second–order logic
(M2L) that can be used to express regular expressions over words as well as trees. MONA relies on a
highly-optimized multi-terminal BDD-based representation for deterministic automata. Mona is used in
the PHP string analysis tool Stranger [29] through a string manipulation library.
Other tools include custom domain-specific string solvers [20, 28]. There is also a wide range of
application domains that rely on automata based methods: strings constraints with length bounds [30];
automata for arithmetic constraints [6]; automata inexplicit state model checking [5]; word equations [1,
18]; construction of automata from regular expressions [15]. Moreover, certain string constraints based
on common string library functions [4] (not using regular expressions) can be directly encoded using a
combination of existing theories provided by an SMT solver.
6 A prototype for QF BVRE based on the Symbolic Automata Toolkit
This section describes a prototype implementation for QF BVRE. It is based on the Symbolic Automata
Toolkit [25] powered by Z3. The description sidesteps the current limitation that all terms sof sort
(Seq
σ
)are converted to terms of sort (List
σ
). While lists in Z3 satisfy all the algebraic properties
of sequences, only the operations equivalent to seq-empty,seq-cons,seq-head, and seq-tail are
(directly) supported in the theory of lists. This also explains why seq-concat and seq-length (as is
also noted in Table 1) are currently not supported in this prototype.
To start with, the benchmark file is parsed by using Z3’s API method ParseSmtlib2File providing
a Z3 Term object
ϕ
that represents the AST of the assertion contained in the file. The assertion
ϕ
is
converted into a formula Conv(
ϕ
)where each occurrence of a membership constraint (re-member s r)
has been replaced by an atom (Accrs), where Accris a new uninterpreted function symbol called the
symbolic languge acceptor for r. The symbol Accris associated with a set of axioms Th(r)such that,
(Accrs)holds modulo Th(r)iff sis a sequence that matches the regular expression r. The converted
formula Conv(
ϕ
)as well as all the axioms Th(r)are asserted to Z3 and checked for satisfiability.
The core of the translation is in converting rinto a Symbolic Finite Automaton SFA(r)and then
defining Th(r)as the theory of SFA(r)[26]. The translation uses closure properties of symbolic automata
under the following (effective) Kleene and Boolean operations:
If Aand Bare SFAs then there is an SFA A·Bsuch that L(A·B) = L(A)·L(B).
If Aand Bare SFAs then there is an SFA ABsuch that L(AB) = L(A)L(B).
If Aand Bare SFAs then there is an SFA ABsuch that L(AB) = L(A)L(B).
If Ais an SFAsthen there is an SFA Asuch that L(A) = L(A).
If Ais an SFAsthen there is an SFA Asuch that L(A) = L(A).
The effectiveness of the above operations does not depend on the theory of the alphabet. In SFAs all
transitions are labeled by predicates. In particular, a bit-vector range (re-range m n)is mapped into an
anonymous predicate
λ
x.(mxn)over bit-vectors and a predicate (re-of-pred p)is just mapped
to p. The overall translation SFA(r)now follows more-or-less directly by induction of the structure
of r. The loop construct (re-loop m n r)is unfolded by using re-concat and re-union. Several
optimizatons are possible that have been omitted here.
As a simple example of the above translation, consider the regex
utf16 =^([\0-\uD7FF\uE000-\uFFFF]|([\uD800-\uDBFF][\uDC00-\uDFFF]))*$
8
SMT-LIB Sequences and Regular Expressions
that describes valid UTF16 encoded strings. Using the SMT2 format and assuming the defined sort
as (_ BitVec 16) the regex is
(re-star (re-union (re-union (re-range #x0000 #xD7FF) (re-range #xE000 #xFFFF))
(re-concat (re-range #xD800 #xDBFF) (re-range #xDC00 #xDFFF))))
The resulting SFA(utf16)can be depicted as follows:
q0q0q1
λ
x.#xD800 x#xDBFF
λ
x.#xDC00 x#xDFFF
λ
x.(x#xD7FF #xE000 x)
and the theory Th(utf16)contains the following axioms:
y(Accutf16(y)(y=
ε
(y6=
ε
(head(y)#xD7FF #xE000 head(y)) Accutf16(tail(y)))
(y6=
ε
#xD800 head(y)#xDBFF Acc1(tail(y)))))
y(Acc1(y)(y6=
ε
#xDC00 head(y)#xDFFF Accutf16(tail(y))))
Benchmarks in the proposed SMT-LIB format that are handled by the tool are available2.
7 Summary
We proposed an interchange format for sequences and regular expressions. It is based on the features
of strings and regular expressions used in current main solvers for regular expressions. There are many
possible improvements and extensions to this proposed format. For example, it is tempting to lever-
age that SMT-LIB already allows string literals. The first objective is to identify a logic that allows to
exchange meaningful benchmarks between solvers and enable comparing techniques that are currently
being developed for solving sequence and regular expression constraints.
7.1 Contributors
Several people contributed to discussions about SMTization of strings, including Nikolaj Bjørner, Vi-
jay Ganesh, Tim Hinrichs, Pieter Hooimeijer, Rapha¨el Michel, Ruzica Piskac, Cesare Tinelli, Margus
Veanes, Andrei Voronkov and Ting Zhang. This effort grew out from discussions at Dagstuhl seminar
[2] and was followed up at strings-smtization@googlegroups.com.
References
[1] Sebastian Bala. Regular language matching and other decidable cases of the satisfiability problem for con-
straints between regular open terms. In STACS, pages 596–607, 2004.
[2] Nikolaj Bjørner, Robert Nieuwenhuis, Helmut Veith, and Andrei Voronkov. Decision Procedures in Soft,
Hard and Bio-ware - Follow Up (Dagstuhl Seminar 11272). Dagstuhl Reports, 1(7):23–35, 2011.
[3] Nikolaj Bjørner, Nikolai Tillmann, and Andrei Voronkov. Path feasibility analysis for string-manipulating
programs. In TACAS, 2009.
[4] Nikolaj Bjørner, Nikolai Tillmann, and Andrei Voronkov. Path feasibility analysis for string-manipulating
programs. In TACAS, 2009.
[5] Stefan Blom and Simona Orzan. Distributed state space minimization. J. Software Tools for Technology
Transfer, 7(3):280–291, 2005.
2http://research.microsoft.com/~nbjorner/microsoft.automata.smtbenchmarks.zip
9
SMT-LIB Sequences and Regular Expressions
[6] Bernard Boigelot and Pierre Wolper. Representing arithmetic constraints with finite automata: An overview.
In ICLP 2002: Proceedings of The 18th International Conference on Logic Programming, pages 1–19, 2002.
[7] Aske Simon Christensen, Anders Møller, and Michael I. Schwartzbach. Precise Analysis of String Expres-
sions. In SAS, 2003.
[8] Thierry Coquand and G´erard P. Huet. The calculus of constructions. Information and Computation,
76(2/3):95–120, 1988.
[9] Vijay Ganesh, Adam Kiezun, Shay Artzi, Philip J. Guo, Pieter Hooimeijer, and Michael D. Ernst. Hampi: A
string solver for testing, analysis and vulnerability detection. In Ganesh Gopalakrishnan and Shaz Qadeer,
editors, CAV, volume 6806 of Lecture Notes in Computer Science, pages 1–19. Springer, 2011.
[10] J.G. Henriksen, J. Jensen, M. Jørgensen, N. Klarlund, B. Paige, T. Rauhe, and A. Sandholm. Mona: Monadic
second-order logic in practice. In TACAS’95, volume 1019 of LNCS, 1995.
[11] Pieter Hooimeijer and Margus Veanes. An evaluation of automata algorithms for string analysis. In VM-
CAI’11, volume 6538 of LNCS, pages 248–262. Springer, 2011.
[12] Pieter Hooimeijer and Westley Weimer. A decision procedure for subset constraints over regular languages.
In PLDI, 2009.
[13] Pieter Hooimeijer and Westley Weimer. A decision procedure for subset constraints over regular languages.
In PLDI, 2009.
[14] Pieter Hooimeijer and Westley Weimer. Solving string constraints lazily. In ASE, 2010.
[15] Lucian Ilie and Sheng Yu. Follow automata. Information and Computation, 186(1):140–162, 2003.
[16] Adam Kiezun, Vijay Ganesh, Philip J. Guo, Pieter Hooimeijer, and Michael D. Ernst. HAMPI: a solver for
string constraints. In ISSTA, 2009.
[17] Nils Klarlund, Anders Møller, and Michael I. Schwartzbach. MONA implementation secrets. International
Journal of Foundations of Computer Science, 13(4):571–586, 2002.
[18] Michal Kunc. What do we know about language equations? In Developments in Language Theory, pages
23–27, 2007.
[19] Nuo Li, Tao Xie, Nikolai Tillmann, Peli de Halleux, and Wolfram Schulte. Reggae: Automated test genera-
tion for programs using complex regular expressions. In ASE’09, 2009.
[20] Yasuhiko Minamide. Static approximation of dynamically generated web pages. In WWW ’05, pages 432–
441, 2005.
[21] Gertjan Van Noord and Dale Gerdemann. Finite state transducers with predicates and identities. Grammars,
4:263–286, 2001.
[22] Prateek Saxena, Devdatta Akhawe, Steve Hanna, Feng Mao, Stephen McCamant, and Dawn Song. A Sym-
bolic Execution Framework for JavaScript, Mar 2010.
[23] Jeremy G. Siek, Lie-Quan Lee, and Andrew Lumsdaine. The Boost Graph Library: User Guide and Refer-
ence Manual (C++ In-Depth Series). Addison-Wesley Professional, December 2001.
[24] M. Veanes, P. Hooimeijer, B. Livshits, D. Molnar, and N. Bjørner. Symbolic finite state transducers: Algo-
rithms and applications. In POPL’12, 2012.
[25] Margus Veanes and Nikolaj Bjørner. Symbolic automata: The toolkit. In C. Flanagan and B. K¨onig, editors,
TACAS’12, volume 7214 of LNCS, pages 472–477. Springer, 2012.
[26] Margus Veanes, Nikolaj Bjørner, and Leonardo de Moura. Symbolic automata constraint solving. In
C. Ferm¨uller and A. Voronkov, editors, LPAR-17, volume 6397 of LNCS/ARCoSS, pages 640–654. Springer,
2010.
[27] Margus Veanes, Peli de Halleux, and Nikolai Tillmann. Rex: Symbolic Regular Expression Explorer. In
ICST’10. IEEE, 2010.
[28] Gary Wassermann and Zhendong Su. Sound and precise analysis of web applications for injection vulnera-
bilities. In PLDI, 2007.
[29] Fang Yu, Muath Alkhalaf, and Tevfik Bultan. Stranger: An automata-based string analysis tool for PHP. In
TACAS’10, LNCS. Springer, 2010.
[30] Fang Yu, Tevfik Bultan, and Oscar H. Ibarra. Symbolic String Verification: Combining String Analysis and
Size Analysis. In TACAS, pages 322–336, 2009.
10
... Related work Regarding theory design, some standardized theories in the SMT-LIB are actually based on published work proposing signatures and semantics for these theories. t is the case for the theory of Floating Point arithmetic [20,7] and the theory of strings [3]. To our knowledge, theory design is not discussed in other contributions as clearly as it is here, but there are extensive discussions on the subject available on the SMT-LIB mailing list 1 . ...
... The theory of sequences was introduced relatively recently, it was first formalized by Bjørner et al. [3]. Sheng et al. developed calculi to reason over their variant of the theory of sequences [22]. ...
... Sequences are 0-indexed ordered collections of elements with dynamic lengths. The SMT theory of sequences was proposed by Bjørner et al. [3] as a generalization of the theory of strings to non-character values (Seq). State-of-the-art SMT solvers such as CVC5 2 [22] and Z3 3 support theories of sequences, referred to as Seq cvc5 and Seq z3 respectively. ...
Preprint
Choices in the semantics and the signature of a theory are integral in determining how the theory is used and how challenging it is to reason over it. Our interest in this paper lies in the SMT theory of sequences. Various versions of it exist in the literature and in state-of-the-art SMT solvers, but it has not yet been standardized in the SMT-LIB. We reflect on its existing variants, and we define a set of theory design criteria to help determine what makes one variant of a theory better than another. The criteria we define can be used to appraise theory proposals for other theories as well. Based on these criteria, we propose a series of changes to the SMT theory of sequences as a contribution to the discussion regarding its standardization.
... Our work crucially builds on a proposal by Bjørner et al. [9], but extends it in several key ways. First, their implementation (for a logic they call QF_BVRE) restricts the generality of the theory by allowing only bit-vector elements (representing characters) and assuming that sequences are bounded. ...
... In contrast, our calculus maintains full generality, allowing unbounded sequences and elements of arbitrary sort. Second, while our core calculus focuses only on a subset of the operators in [9], our implementation supports the remaining operators by reducing them to the core operators, and also adds native support for the update operator, which is not included in [9]. ...
... In contrast, our calculus maintains full generality, allowing unbounded sequences and elements of arbitrary sort. Second, while our core calculus focuses only on a subset of the operators in [9], our implementation supports the remaining operators by reducing them to the core operators, and also adds native support for the update operator, which is not included in [9]. ...
Article
Full-text available
Dynamic arrays, also referred to as vectors, are fundamental data structures used in many programs. Modeling their semantics efficiently is crucial when reasoning about such programs. The theory of arrays is widely supported but is not ideal, because the number of elements is fixed (determined by its index sort) and cannot be adjusted, which is a problem, given that the length of vectors often plays an important role when reasoning about vector programs. In this paper, we propose reasoning about vectors using a theory of sequences. We introduce the theory, propose a basic calculus adapted from one for the theory of strings, and extend it to efficiently handle common vector operations. We prove that our calculus is sound and show how to construct a model when it terminates with a saturated configuration. Finally, we describe an implementation of the calculus in cvc5 and demonstrate its efficacy by evaluating it on verification conditions for smart contracts and benchmarks derived from existing array benchmarks.
... Motivating Applications. We originally encountered the problem of incremental state classification during our prior work while building Z3's regex solver [61] for the SMT theory of string and regex constraints [4,13,15]. Our solver leveraged derivatives (in the sense of Brzozowski [18] and Antimirov [5]) to explore the states of the finite state machine corresponding to the regex incrementally (as the graph is built), to avoid the prohibitive cost of expanding all states initially. ...
... The benchmarks focus on extended regexes, rather than plain classical regexes as these are the ones for which dead state detection is relevant (see Sect. 5). We include GIDs for the RegExLib benchmarks [15] and the handcrafted Boolean benchmarks reported in [61]. We add to these 11 additional examples designed to be difficult GID cases. ...
Chapter
Full-text available
Identifying live and dead states in an abstract transition system is a recurring problem in formal verification; for example, it arises in our recent work on efficiently deciding regex constraints in SMT. However, state-of-the-art graph algorithms for maintaining reachability information incrementally (that is, as states are visited and before the entire state space is explored) assume that new edges can be added from any state at any time, whereas in many applications, outgoing edges are added from each state as it is explored. To formalize the latter situation, we propose guided incremental digraphs (GIDs), incremental graphs which support labeling closed states (states which will not receive further outgoing edges). Our main result is that dead state detection in GIDs is solvable in O(logm)O(\log m) O ( log m ) amortized time per edge for m edges, improving upon O(m)O(\sqrt{m}) O ( m ) per edge due to Bender, Fineman, Gilbert, and Tarjan (BFGT) for general incremental directed graphs. We introduce two algorithms for GIDs: one establishing the logarithmic time bound, and a second algorithm to explore a lazy heuristics-based approach. To enable an apples-to-apples experimental comparison, we implemented both algorithms, two simpler baselines, and the state-of-the-art BFGT baseline using a common directed graph interface in Rust. Our evaluation shows 110-530x speedups over BFGT for the largest input graphs over a range of graph classes, random graphs, and graphs arising from regex benchmarks.
... Satisfiability modulo theories (SMT) solvers are a natural extension to SAT solvers that can reason about firstorder structures with background theories (Barrett et al. 2021), allowing them to tackle more general problems or to accept more succinct inputs. For example, SMT solvers can reason about bit-vectors (Brummayer and Biere 2009), floating-point numbers (Rümmer and Wahl 2010), strings (Bjørner et al. 2012), and algebraic data types (ADTs) (Barrett, Fontaine, and Tinelli 2017). ...
Article
Algebraic data types (ADTs) are a construct classically found in functional programming languages that capture data structures like enumerated types, lists, and trees. In recent years, interest in ADTs has increased. For example, popular programming languages, like Python, have added support for ADTs. Automated reasoning about ADTs can be done using satisfiability modulo theories (SMT) solving, an extension of the Boolean satisfiability problem with first-order logic and associated background theories. Unfortunately, SMT solvers that support ADTs do not scale as state-of-the-art approaches all use variations of the same lazy approach. In this paper, we present an SMT solver that takes a fundamentally different approach, an eager approach. Specifically, our solver reduces ADT queries to a simpler logical theory, uninterpreted functions (UF), and then uses an existing solver on the reduced query. We prove the soundness and completeness of our approach and demonstrate that it outperforms the state of the art on existing benchmarks, as well as a new, more challenging benchmark set from the planning domain.
... For each regex benchmark, we thus get a GID benchmark for the present paper. In our regex-derived GID examples, we include the RegExLib benchmarks from [9,49] and the handcrafted Boolean benchmarks reported in [50]. We add to these 11 additional examples designed to stress test the GID side of a regex solver. ...
Preprint
Identifying live and dead states in an abstract transition system is a recurring problem in formal verification. However, state-of-the-art graph algorithms for maintaining reachability information incrementally (that is, as states are visited and before the entire state space is explored) assume that new edges can be added from any state at any time, whereas in many applications, outgoing edges are added from each state as it is explored. To formalize the latter situation, we propose guided incremental digraphs (GIDs), incremental graphs which support labeling closed states (states which will not receive further outgoing edges). Our main result is that dead state detection in GIDs is solvable in O(logm)O(\log m) time per edge update for m edges, improving upon O(m)O(\sqrt{m}) per edge due to Bender, Fineman, Gilbert, and Tarjan (BFGT) for general incremental directed graphs. We introduce two algorithms for GIDs: one establishing the logarithmic time bound, and a second algorithm to explore a lazy heuristics-based approach. To demonstrate applicability, we show how GIDs can be used to lazily decide regular expression constraints in SMT applications. To enable an apples-to-apples experimental comparison, we implemented both algorithms, two naive baselines, and the state-of-the-art BFGT baseline using a common directed graph interface in Rust. Our evaluation shows 110-530x speedups over BFGT for the largest input graphs over a range of graph classes, random graphs, and graphs arising from regular expression benchmarks.
... There is no existing decision procedure to discharge proof goals involving concatenation and slicing. Bjørner et al.'s sequence theory [6] is just what we want, but they do not give a decision procedure. If encoded using quantifiers, Bradley et al.'s array property fragment [7] does not allow index shifting (expressions of the form a[i + n]), which is necessary to encode concatenation and slicing. ...
Article
Full-text available
The theory of arrays has been widely investigated. But concatenation, an operator that consistently appears in specifications of functional-correctness verification tools (e.g., Dafny, VeriFast, VST), is not included in most variants of the theory. Arrays with concatenation need better solvers with theoretical guarantees. We formalize a theory of arrays with concatenation, and define the array property fragment with concatenation. Although the array property fragment without concatenation is decidable, the fragment with concatenation is undecidable in general (e.g., when the base theory for array elements is linear integer arithmetic). But we characterize a “tangle-free” fragment; we present an algorithm that classifies verification goals in the array property fragment with concatenation as tangle-free or entangled, and that decides validity of tangle-free goals. We implement the algorithm in Coq and apply it to functional-correctness verification of C programs. The result shows our algorithm is reasonably efficient and reduces a significant amount of human effort in verification tasks. We also give an algorithm for using this array theory solver as a theory solver in SMT solvers.
Article
Verification of asynchronous distributed programs is challenging due to the need to reason about numerous control paths resulting from the myriad interleaving of messages and failures. In this paper, we propose an automated bookkeeping method based on message chains. Message chains reveal structure in asynchronous distributed system executions and can help programmers verify their systems at the message passing level of abstraction. To evaluate our contributions empirically we build a verification prototype for the P programming language that integrates message chains. We use it to verify 16 benchmarks from related work, one new benchmark that exemplifies the kinds of systems our method focuses on, and two industrial benchmarks. We find that message chains are able to simplify existing proofs and our prototype performs comparably to existing work in terms of runtime. We extend our work with support for specification mining and find that message chains provide enough structure to allow existing learning and program synthesis tools to automatically infer meaningful specifications using only execution examples.
Chapter
Full-text available
Amazon Web Services (AWS) is a cloud computing services provider that has made significant investments in applying formal methods to proving correctness of its internal systems and providing assurance of correctness to their end-users. In this paper, we focus on how we built abstractions and eliminated specifications to scale a verification engine for AWS access policies, Zelkova , to be usable by all AWS users. We present milestones from our journey from a thousand SMT invocations daily to an unprecedented billion SMT calls in a span of five years. In this paper, we talk about how the cloud is enabling application of formal methods, key insights into what made this scale of a billion SMT queries daily possible, and present some open scientific challenges for the formal methods community.
Chapter
Full-text available
The use of the Ethereum blockchain platform [17] has experienced an enormous growth since its very first transaction back in 2015 and, along with it, the verification and optimization of the programs executed in the blockchain (known as Ethereum smart contracts) have raised considerable interest within the research community.
Chapter
Full-text available
Anti-unification aims at computing generalizations for given terms, retaining their common structure and abstracting differences by variables. We study quantitative anti-unification where the notion of the common structure is relaxed into “proximal” up to the given degree with respect to the given fuzzy proximity relation. Proximal symbols may have different names and arities. We develop a generic set of rules for computing minimal complete sets of approximate generalizations and study their properties. Depending on the characterizations of proximities between symbols and the desired forms of solutions, these rules give rise to different versions of concrete algorithms.
Article
Full-text available
Finite automata and finite transducers are used in a wide range of applications in software engineering, from regu-lar expressions to specification languages. We extend these classic objects with symbolic alphabets represented as para-metric theories. Admitting potentially infinite alphabets makes this representation strictly more general and succinct than classical finite transducers and automata over strings. Despite this, the main operations, including composition, checking that a transducer is single-valued, and equivalence checking for single-valued symbolic finite transducers are ef-fective given a decision procedure for the background theory. We provide novel algorithms for these operations and extend composition to symbolic transducers augmented with regis-ters. Our base algorithms are unusual in that they are non-constructive, therefore, we also supply a separate model gen-eration algorithm that can quickly find counterexamples in the case two symbolic finite transducers are not equivalent. The algorithms give rise to a complete decidable algebra of symbolic transducers. Unlike previous work, we do not need any syntactic restriction of the formulas on the transitions, only a decision procedure. In practice we leverage recent advances in satisfiability modulo theory (SMT) solvers. We demonstrate our techniques on four case studies, covering a wide range of applications. Our techniques can synthe-size string pre-images in excess of 8, 000 bytes in roughly a minute, and we find that our new encodings significantly outperform previous techniques in succinctness and speed of analysis.
Conference Paper
Full-text available
The Mona tool provides an implementation of the decision procedures for the logics WS1S and WS2S. It has been used for numerous applications, and it is remarkably efficient in practice, even though it faces a theoretically non-elementary worst-case complexity. The implementation has matured over a period of six years. Compared to the first naive version, the present tool is faster by several orders of magnitude. This speedup is obtained from many different contributions working on all levels of the compilation and execution of formulas. We present a selection of implementation “secrets” that have been discovered and tested over the years, including formula reductions, DAGification, guided tree automata, three-valued logic, eager minimization, BDD-based automata representations, and cache-conscious data structures. We describe these techniques and quantify their respective effects by experimenting with separate versions of the Mona tool that in turn omit each of them.
Article
Full-text available
We present a new algorithm, and its distributed implementation, for reducing labeled transition systems modulo strong bisimulation. The base of this algorithm is the Kanellakis–Smolka “naive method”, which has a high theoretical complexity but is successful in practice and well suited to parallelization. This basic approach is combined with optimizations inspired by the Kanellakis–Smolka algorithm for the case of bounded fanout, which has the best known time complexity. The distributed implementation is improved with respect to previous attempts by a better overlap between communication and computation, which results in an efficient usage of both memory and processing power. We also discuss the time complexity of this algorithm and show experimental results with sequential and distributed prototype tools.
Conference Paper
Full-text available
Many automatic testing, analysis, and verification techniques for programs can effectively be reduced to a constraint-generation phase followed by a constraint-solving phase. This separation of concerns often leads to more effective and maintainable software reliability tools. The increasing efficiency of off-the- shelf constraint solvers makes this approach even more compelling. However, there are few effective and sufficiently expressive off-the-shelf solvers for string constraints generated by analysis of string-manipulating programs, and hence researchers end up implementing their own ad-hoc solvers. Thus, there is a clear need for an effective and expressive string-constraint solver that can be easily integrated into a variety of applications. To fulfill this need, we designed and implemented Hampi, an efficient and easy-to-use string solver. Users of the Hampi string solver specify constraints using membership predicate over regular expressions, context-free grammars, and equality/dis-equality between string terms. These terms are constructed out of string constants, bounded string variables, and typical string operations such as concatenation and substring extraction. Hampi takes such a constraint as input and decides whether it is satisfiable or not. If an input constraint is satisfiable, Hampi generates a satsfying assignment for the string variables that occur in it. We demonstrate Hampi’s expressiveness and efficiency by applying it to program analysis and automated testing: We used Hampi in static and dynamic analyses for finding SQL injection vulnerabilities in Web applications with hundreds of thousands of lines of code.We also used Hampi in the context of automated bug finding in C programs using dynamic systematic testing (also known as concolic testing). Hampi’s source code, documentation, and experimental data are available at http://people.csail.mit.edu/akiezun/hampi .
Article
Full-text available
In [5], we have given a straightforward distributed implementation of the Kanellakis-Smolka 'naive' algorithm for reducing labeled transition systems modulo strong bisimulation. The algorithm proceeds by partition refinement, that is by computing increasingly fine-grained partitions of the set of states. In this paper we present an optimized distributed implementation, in which the refinements are no longer entirely recomputed in every iteration, but they are computed incrementally. A second significant improvement is the overlap between communication and computation, that results in a better use of both memory and processing power. We discuss these optimizations and show experimental results.
Conference Paper
The symbolic automata toolkit lifts classical automata analysis to work modulo rich alphabet theories. It uses the power of state-of-the-art constraint solvers for automata analysis that is both expressive and effifficient, even for automata over large finite alphabets. The toolkit supports analysis of finite symbolic automata and transducers over strings. It also handles transducers with registers. Constraint solving is used when composing and minimizing automata, and a much deeper and powerful integration is also obtained by internalizing automata as theories. The toolkit, freely available from Microsoft Research, has recently been used in the context of web security for analysis of potentially malicious data over Unicode characters.
Conference Paper
There has been significant recent interest in automated reasoning techniques, in particular constraint solvers, for string variables. These techniques support a wide variety of clients, ranging from static analysis to automated testing. The majority of string constraint solvers rely on finite automata to support regular expression constraints. For these approaches, performance depends critically on fast automata operations such as intersection, complementation, and determinization. Existing work in this area has not yet provided conclusive results as to which core algorithms and data structures work best in practice. In this paper, we study a comprehensive set of algorithms and data structures for performing fast automata operations. Our goal is to provide an apples-to-apples comparison between techniques that are used in current tools. To achieve this, we re-implemented a number of existing techniques. We use an established set of regular expressions benchmarks as an indicative workload. We also include several techniques that, to the best of our knowledge, have not yet been used for string constraint solving. Our results show that there is a substantial performance difference across techniques, which has implications for future tool design.
Conference Paper
We perform static analysis of Java programs to answer a simple question: which values may occur as results of string expressions? The answers are summarized for each expression by a regular language that is guaranteed to contain all possible values. We present several applications of this analysis, including statically checking the syntax of dynamically generated expressions, such as SQL queries. Our analysis constructs flow graphs from class files and generates a context-free grammar with a nonterminal for each string expression. The language of this grammar is then widened into a regular language through a variant of an algorithm previously used for speech recognition. The collection of resulting regular languages is compactly represented as a special kind of multi-level automaton from which individual answers may be extracted.
Conference Paper
In the talk we give an overview of recent developments in the area of language equations, with an emphasis on methods for dealing with non-classical types of equations whose theory has not been successfully developed already in the previous decades, and on results forming the current borderline of our knowledge. This abstract is in particular meant to provide the interested listener with references to the material discussed in the talk.