Conference Paper

Cuts in Regular Expressions

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Most software packages with regular expression matching engines offer operators that extend the classical regular expressions, such as counting, intersection, complementation, and interleaving. Some of the most popular engines, for example those of Java and Perl, also provide operators that are intended to control the nondeterminism inherent in regular expressions. We formalize this notion in the form of the cut and iterated cut operators. They do not extend the class of languages that can be defined beyond the regular, but they allow for exponentially more succinct representation of some languages. Membership testing remains polynomial, but emptiness testing becomes PSPACE-hard.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... A common strategy, when it comes to implementing regex matching, is a greedy leftmost maximal substring match with the searched pattern. Such behaviour has been well defined theoretically in the context of formal languages as the cut operation [1,9]. The cut of two languages K and L is a subset of their concatenation defined as the language K ! ...
... If languages K and L are accepted by the deterministic finite automata (DFAs) A and B, then K ! L is accepted by the cut automaton A ! B, described by Berglund et al. [1], which has a grid-like structure similar to the product automaton. The DFA A ! B simulates A and starts also simulating B when a final state of A is reached, but it restarts the computation in B whenever a final state of A is reached again. ...
Chapter
Full-text available
The cut of two languages is a subset of their concatenation given by the leftmost maximal substring match. We study the state complexity of the cut operation assuming that both operands belong to some, possibly different, subclasses of convex languages, namely, right, left, two-sided, and all-sided ideal, prefix-, suffix-, factor-, and subword-closed, and -free languages. For all considered pairs of classes, we get the exact state complexity of cut. We show that it is m whenever the first language is a right ideal, and it is m+n-1 or m+n-2 if the first language is prefix-closed or prefix-free. In the other cases, the state complexity of cut is between mn-2n-m+4 and mn-n+m, the latter being the known state complexity of cut on regular languages. All our witnesses are described over a fixed alphabet of size at most three, except for three cases when the witness languages are described over an alphabet of size m or m-1.
... Take S = S · a n−min S+1 . Then 0 / ∈ S and 1 ∈ S , so (0, S ) is reachable as shown in case (2) and it is sent to (0, S) by (ab) min S−1 . ...
Chapter
Full-text available
We investigate the class of languages recognized by permutation deterministic finite automata. Using automata constructions and some properties of permutation automata, we show that this class is closed under Boolean operations, reversal, and quotients, and it is not closed under concatenation, power, Kleene closure, positive closure, cut, shuffle, cyclic shift, and permutation. We prove that the state complexity of Boolean operations, Kleene closure, positive closure, and right quotient on permutation languages is the same as in the general case of regular languages. Next, we get the tight upper bounds on the state complexity of concatenation (\(m2^n-2^{n-1}-m+1\)), square (\(n2^{n-1}-2^{n-2}\)), reversal (\(n\atopwithdelims ()\lceil n/2\rceil \)), and left quotient (\(m\atopwithdelims ()\lceil m/2\rceil \); tight if \(m\le n\)). All our witnesses are unary or binary, and the binary alphabet is always optimal, except for Boolean operations in the case of \(\gcd (m,n)=1\). In the unary case, the state complexity of all considered operations is the same as for regular languages, except for quotients and cut. In case of quotients, it is \(\min \{m,n\}\), and in case of cut, it is either \(2m-1\) or \(2m-2\), depending on whether there exists an integer \(\ell \) with \(2\le \ell \le n\) such that \(m\bmod \ell \ne 0\).
... For example, there are a lot of additional operators in regex libraries that should be analyzed. A special example is pruning operators, such as atomic subgroups, and the cut operator of [1], which interact deeply with the matching procedure. From a theoretical perspective, the next step should be to determine the precise complexity class for equivalence in Corollary 19. ...
Conference Paper
We introduce prioritized transducers to formalize capturing groups in regular expression matching in a way that permits straightforward modelling of and comparison with real-world regular expression matching library behaviors. The broader questions of parsing semantics and performance are discussed, and also the complexity of deciding equivalence of regular expressions with capturing groups.
Article
Full-text available
Research and calculations of non-uniform motion are very important from a practical point of view and have certain features for different states of flow, analysis of the shape of free surface curves, as well as the design of many hydraulic structures. When considering these issues, the concepts of specific cross-sectional energy and critical depth are used. The current trend of technology development in the educational process is based on Internet communications, instant online calculations and mobile microprocessor gadgets with appropriate software. The presented experimental project of educational and methodical material with web forms of online calculation of individual tasks is a variant of the modern competitive online environment on the Internet. Further direction of development - addition of methodical video, audio and graphic elements. Analysis of web analytics will gradually simplify the interface and choose the most effective set of modern formats for teaching materials. Computer on-line calculations allow to change the initial data and to introduce elements of modeling and in-depth study of theoretical positions using one typical example in the form of a web form
Article
We introduce prioritized transducers to formalize capturing groups in regular expression matching in a way that permits straightforward modeling of capturing in Java's
Conference Paper
We investigate the state complexity of the cut and iterated cut operation for deterministic finite automata (DFAs), answering an open question stated in [M. Berglund, et al.: Cuts in regular expressions. In Proc. DLT, LNCS 7907, 2011]. These operations can be seen as an alternative to ordinary concatenation and Kleene star modelling leftmost maximal string matching. We show that the cut operation has a matching upper and lower bound of \((n-1)\cdot m+n\) states on DFAs accepting the cut of two individual languages that are accepted by n- and m-state DFAs, respectively. In the unary case we obtain \(\max (2n-1,m+n-2)\) states as a tight bound. For accepting the iterated cut of a language accepted by an n-state DFA we find a matching bound of \(1+(n+1)\cdot \mathsf {F}(\,1,n+2,-n+2;n+1\mid -1\,)\) states on DFAs, where \(\mathsf {F}\) refers to the generalized hypergeometric function. This bound is in the order of magnitude \(\varTheta ((n-1)!)\). Finally, the bound drops to \(2n-1\) for unary DFAs accepting the iterated cut of an n-state DFA and thus is similar to the bound for the cut operation on unary DFAs.
Conference Paper
Full-text available
We improve on some recent results on lower bounds for conversion problems for regular expressions. In particular we consider the conversion of planar deterministic finite automata to regular expressions, study the effect of the complementation operation on the descriptional complexity of regular expressions, and the conversion of regular expressions extended by adding intersection or interleaving to ordinary regular expressions. Almost all obtained lower bounds are optimal, and the presented examples are over a binary alphabet, which is best possible.
Conference Paper
Full-text available
Regular expressions with numerical occurrence indicators (#REs) are used in established text manipulation tools like Perl and Unix egrep, and in the recent W3C XML Schema Definition Language. Numerical occurrence indicators do not increase the expressive power of regular expressions, but they do increase the succinctness of expressions by an exponential factor. Therefore methods based on straightforward translation of #REs into corresponding standard regular expressions are computationally infeasible in the general case. We report some prelimi- nary results about computational problems related to ecient matching and comparison of #REs. Matching, or membership testing of languages described by #REs, is shown to be tractable. Simple comparison prob- lems (inclusion and overlap) of #REs are shown to be NP-hard. We also consider simple #REs consisting of a single symbol and nested numeri- cal occurrence indicators only, and derive a simple numerical test for the membership of a word in the language described by a simple #RE.
Article
Full-text available
. We consider regular expressions extended with the interleaving operator, and investigate the complexity of membership and inequivalence problems for these expressions. For expressions using the operators union, concatenation, Kleene star, and interleaving, we show that the inequivalence problem (deciding whether two given expressions do not describe the same set of words) is complete for exponential space. Without Kleene star, we show that the inequivalence problem is complete for the class Sigma p 2 at the second level of the polynomial-time hierarchy. Certain cases of the membership problem (deciding whether a given word is in the language described by a given expression) are shown to be NP-complete. It is also shown that certain languages can be described exponentially more succinctly by using interleaving. 1 The research of this author was partly supported by ONR grant N00014-91-J-1613. Part of this work was performed while the author was at the IBM T.J. Watson Research Cent...
Book
Regular expressions are a central element of UNIX utilities like egrep and programming languages such as Perl. But whether you're a UNIX user or not, you can benefit from a better understanding of regular expressions since they work with applications ranging from validating data-entry fields to manipulating information in multimegabyte text files. Mastering Regular Expressions quickly covers the basics of regular-expression syntax, then delves into the mechanics of expression-processing, common pitfalls, performance issues, and implementation-specific differences. Written in an engaging style and sprinkled with solutions to complex real-world problems, Mastering Regular Expressions offers a wealth information that you can put to immediate use. Regular expressions are an extremely powerful tool for manipulating text and data. They are now standard features in a wide range of languages and popular tools, including Perl, Python, Ruby, Java, VB.NET and C# (and any language using the .NET Framework), PHP, and MySQL. If you don't use regular expressions yet, you will discover in this book a whole new world of mastery over your data. If you already use them, you'll appreciate this book's unprecedented detail and breadth of coverage. If you think you know all you need to know about regular expressions, this book is a stunning eye-opener. As this book shows, a command of regular expressions is an invaluable skill. Regular expressions allow you to code complex and subtle text processing that you never imagined could be automated. Regular expressions can save you time and aggravation. They can be used to craft elegant solutions to a wide range of problems. Once you've mastered regular expressions, they'll become an invaluable part of your toolkit. You will wonder how you ever got by without them. Yet despite their wide availability, flexibility, and unparalleled power, regular expressions are frequently underutilized. Yet what is power in the hands of an expert can be fraught with peril for the unwary. Mastering Regular Expressions will help you navigate the minefield to becoming an expert and help you optimize your use of regular expressions. Mastering Regular Expressions , Third Edition, now includes a full chapter devoted to PHP and its powerful and expressive suite of regular expression functions, in addition to enhanced PHP coverage in the central "core" chapters. Furthermore, this edition has been updated throughout to reflect advances in other languages, including expanded in-depth coverage of Sun's java.util.regex package, which has emerged as the standard Java regex implementation. Topics include: A comparison of features among different versions of many languages and tools How the regular expression engine works Optimization (major savings available here!) Matching just what you want, but not what you don't want Sections and chapters on individual languages Written in the lucid, entertaining tone that makes a complex, dry topic become crystal-clear to programmers, and sprinkled with solutions to complex real-world problems, Mastering Regular Expressions , Third Edition offers a wealth information that you can put to immediate use. Reviews of this new edition and the second edition: "There isn't a better (or more useful) book available on regular expressions." --Zak Greant, Managing Director, eZ Systems "A real tour-de-force of a book which not only covers the mechanics of regexes in extraordinary detail but also talks about efficiency and the use of regexes in Perl, Java, and .NET...If you use regular expressions as part of your professional work (even if you already have a good book on whatever language you're programming in) I would strongly recommend this book to you." --Dr. Chris Brown, Linux Format "The author does an outstanding job leading the reader from regex novice to master. The book is extremely easy to read and chock full of useful and relevant examples...Regular expressions are valuable tools that every developer should have in their toolbox. Mastering Regular Expressions is the definitive guide to the subject, and an outstanding resource that belongs on every programmer's bookshelf. Ten out of Ten Horseshoes." --Jason Menard, Java Ranch
Article
CONTENTSIntroduction § 1. Homomorphism and equivalence of automata § 2. Introduction of mappings in automata § 3. Introduction of events in finite automata, operations on events § 4. Automata and semi-groups § 5. The composition of automata § 6. Experiments with automataConclusion References Bibtex entry for this abstract Preferred format for this abstract (see Preferences) Find Similar Abstracts: Use: Authors Title Abstract Text Return: Query Results Return items starting with number Query Form Database: Astronomy Physics arXiv e-prints
Conference Paper
The presence of a schema offers many advantages in processing, translating, querying, and storage of XML data. Basic decision problems such as equivalence, inclusion, and nonemptiness of intersection of schemas form the basic building blocks for schema optimization and integration, and algorithms for static analysis of transformations. It is thereby paramount to establish the exact complexity of these problems. Most common schema languages for XML can be adequately modeled by some kind of grammar with regular expressions at right-hand sides. In this paper, we observe that, apart from the usual regular operators of union, concatenation, and Kleene-star, schema languages also allow numerical occurrence constraints and interleaving operators. Although the expressiveness of these operators remains within the regular languages, the presence or absence of these operators has a significant impact on the complexity of the basic decision problems. We present a complete overview of the complexity of the basic decision problems for DTDs, XSDs, and Relax NG with regular expressions incorporating numerical occurrence constraints and interleaving. We also discuss chain regular expressions and the complexity of the schema simplification problem incorporating the new operators.
Conference Paper
We show that the recognition problem of context-free languages can be reduced to membership in the language defined by a regular expression with intersection by a log space reduction with linear output length. We also show a matching upper bound improving the known fact that the membership problem for these regular expressions is in NC2. Together these results establish that the membership problem is complete in LOGCFL. For unary expressions we show hardness for the class NL and some related results.
Conference Paper
The nondeterministic lower space bound Ön\sqrt n of Hunt, for the problem if a regular expression with intersection describes a non-empty language, is improved to the upper bound n. For the general inequivalence problem for regular expressions with intersection the lower bound cn matches the upper bound except for the constant c. And the proof for this tight lower bound is simpler than the proofs for previous bounds. Methods developed in a result about one letter alphabets are extended to get a complete characterization for the problem of deciding if one input-expression describes a given language. The complexity depends only on the property of the given language to be finite, infinite but bounded, or unbounded.
Article
We study the succinctness of the complement and intersection of regular expressions. In particular, we show that when constructing a regular expression defining the complement of a given regular expression, a double exponential size increase cannot be avoided. Similarly, when constructing a regular expression defining the intersection of a fixed and an arbitrary number of regular expressions, an exponential and double exponential size increase, respectively, can in worst-case not be avoided. All mentioned lower bounds improve the existing ones by one exponential and are tight in the sense that the target expression can be constructed in the corresponding time class, i.e., exponential or double exponential time. As a by-product, we generalize a theorem by Ehrenfeucht and Zeiger stating that there is a class of DFAs which are exponentially more succinct than regular expressions, to a fixed four-letter alphabet. When the given regular expressions are one-unambiguous, as for instance required by the XML Schema specification, the complement can be computed in polynomial time whereas the bounds concerning intersection continue to hold. For the subclass of single-occurrence regular expressions, we prove a tight exponential lower bound for intersection.
Conference Paper
Extended regular expressions (ERE) define regular languages using union, concatenation, repetition, intersection, and complementation operators. The fact ERE allow intersection and complementation makes them exponentially more succinct than regular expressions. The membership problem for extended regular expressions is to decide, given an expression r and a word w, whether w belongs to the language defined by r. Since regular expressions are useful for describing patterns in strings, the membership problem has numerous applications. In many such applications, the words w are very long and patterns are conveniently described using ERE, making efficient solutions to the membership problem of great practical interest. In this paper we introduce alternating automata with synchronized universality and negation and use them in order to obtain a simple and efficient algorithm for solving the membership problem for ERE. Our algorithm runs in time O (m · n 2) and space O(m · n + k · n 2), where m is the length of r, n is the length of w, and k is the number of intersection and complementation operators in r. This improves the best known algorithms for the problem.
  • A K Chandra
  • D C Kozen
  • L J Stockmeyer
A. K. Chandra, D. C. Kozen, and L. J. Stockmeyer. Alternation. J. ACM, 28(1):114-133, January 1981.