SimpleRegular Expressions and Languages.
ABSTRACT We define simpleregular expressions and languages. Simpleregular languages pro vide a necessary condition for a language to be outfixfree. We design algorithms that compute simpleregular languages from finitestate automata. Furthermore, we inves tigate the complexity blowup from a given finitestate automaton to its simpleregular language automaton and show that there is an exponential blowup. In addition, we present a finitestate automata construction for simpleregular expressions based on state expansion.

Conference Paper: OutfixFree Regular Languages and Prime OutfixFree Decomposition.
[Show abstract] [Hide abstract]
ABSTRACT: As tringx is an outfix of a string y if there is a string w such that x1wx2 = y ,w herex = x1x2 and a set X of strings is outfixfree if no string in X is an outfix of any other string in X. We examine the outfixfree regular languages. Based on the properties of outfix strings, we develop a polynomialtime algorithm that determines the outfixfreeness of regular languages. We consider two cases: A language is given as a set of strings and a language is given by an acyclic deterministic finitestate automaton. Furthermore, we investigate the prime outfixfree decom position of outfixfree regular languages and design a lineartime prime outfixfree decomposition algorithm for outfixfree regular languages. We demonstrate the uniqueness of prime outfixfree decomposition.Theoretical Aspects of Computing  ICTAC 2005, Second International Colloquium, Hanoi, Vietnam, October 1721, 2005, Proceedings; 01/2005
Page 1
HKUST Theoretical Computer Science Center Research Report HKUSTTCSC200505
SimpleRegular Expressions and Languages∗
YoSub Han, Gerhard Trippen and Derick Wood
Department of Computer Science
The Hong Kong University of Science and Technology
{emmous, trippen, dwood}@cs.ust.hk
Abstract
We define simpleregular expressions and languages. Simpleregular languages pro
vide a necessary condition for a language to be outfixfree. We design algorithms that
compute simpleregular languages from finitestate automata. Furthermore, we inves
tigate the complexity blowup from a given finitestate automaton to its simpleregular
language automaton and show that there is an exponential blowup. In addition, we
present a finitestate automata construction for simpleregular expressions based on
state expansion.
1Introduction
It is well known that the family of languages specified by finitestate automata (FAs) is
the same as the family of languages described by regular expressions [13]. It can be proved
by showing that we can construct FAs from regular expressions and that we can construct
regular expressions from FAs. Additionally, some automata constructions preserve the
structural properties of given regular expressions; for example, the Thompson construc
tion [17] or the position construction [8, 14]. Recently, Giammarresi et al. [7] introduced
Thompson languages and simple Thompson languages based on the structural properties of
Thompson automata. One interesting property of simple Thompson languages is that for
a given Thompson automaton A, a string in a simple Thompson language of A corresponds
to a simple path from the start state to the final state of A.
FAs are often used to represent codes since codes are sets of strings. A code can be
classified by properties such as prefixfreeness, suffixfreeness, infixfreeness and outfix
freeness [12]. The conditions that classify code types define proper subfamilies of given
language families. For regular languages, for example, outfixfreeness defines the family of
outfixfree regular languages, which is a proper subfamily of regular languages. Then, we
can classify FAs based on these conditions according to the languages that FAs define. We
observe that an FA must have some structural properties to satisfy a certain condition. For
example, a deterministic finitestate automaton (DFA) should not have any outtransitions
from a final state if the DFA defines a prefixfree language [9]. Given a nondeterministic
finitestate automaton (NFA) A, if L(A) is outfixfree, then there are no cycles in A since
∗Han and Wood were supported under the Research Grants Council of Hong Kong Competitive Ear
marked Research Grant HKUST6197/01E and Trippen was supported under RGC/HKUST Direct Allo
cation Grant DAG03/04.EG05.
1
Page 2
an outfixfree regular language is always finite [12]. In other words, all accepting paths of
an outfixfree regular language in an FA must be simple. Furthermore, since an outfix
free regular language L is finite, there is an acyclic deterministic finitestate automaton
for L. Han and Wood [10] designed an algorithm for determining the outfixfreeness of
a given acyclic deterministic finitestate automaton based on the structural properties of
outfixfree languages.
FAs are directed graphs with labels on edges and simple paths of FAs are already
used in the literature [7, 12]. We examine FAs with respect to simple paths. In Sec
tion 2, we define some basic notions. In Section 3, we introduce simpleregular expressions
and languages and design algorithms for computing simpleregular expressions and lan
guages. The algorithm for computing simpleregular language of a given FA is based on
the algorithm [16] for enumerating all simple paths. Then, we investigate the complexity
blowup from an FA A to its simpleregular language FA A′. We also consider when A is a
Thompson automaton or a position automaton since these two constructions are popular.
2Preliminaries
Let Σ denote a finite alphabet of characters and Σ∗denote the set of all strings over Σ.
A language over Σ is any subset of Σ∗. The character ∅ denotes the empty language and
the character λ denotes the null string. Given two strings x and y in Σ∗, x is said to be
an outfix of y if there is a string w such that y = x1wx2, where x = x1x2, For example,
abe is an outfix of abcde. Given a set X of strings over Σ, X is outfixfree if no string in
X is an outfix of any other string in X.
An FA A is specified by a tuple (Q,Σ,δ,s,F), where Q is a finite set of states, Σ is an
input alphabet, δ ⊆ Q × Σ × Q is a (finite) set of transitions, s ∈ Q is the start state and
F ⊆ Q is a set of final states. Let Q be the number of states in Q and δ be the number
of transitions in δ. Then, the size A of A is Q + δ. Given a transition (p,a,q) in δ,
where p,q ∈ Q and a ∈ Σ, we say that p has an outtransition and q has an intransition.
Furthermore, p is a source state of q and q is a target state of p. A string x over Σ is
accepted by A if there is a labeled path from s to a final state in F that spells out x.
Thus, the language L(A) of an FA A is the set of all strings spelled out by paths from s
to a final state in F. We assume that A has only useful states; that is, each state of A
appears on some path from the start state to some final state.
3Simpleregular languages
3.1Simpleregular languages from regular expressions and FAs
Definition 1. Given a regular expression E, we define the simpleregular expression S(E)
of E as follows:
1. S(∅) = ∅.
2. S(λ) = λ.
3. S(a) = a, for a ∈ Σ.
2
Page 3
4. S(E + F) = S(E) + S(F), where E and F are regular expressions.
5. S(E · F) = S(E) · S(F).
6. S(E∗) = λ + E.
Given a regular expression E, L(S(E)) is the corresponding simpleregular language of
L(E). A language is often given by an FA. We define a simpleregular language from a
FA as follows.
Definition 2. Given an FA A, we define the simpleregular language S(L(A)) of L(A)
to be a subset of L(A) such that each string is accepted by a simple path in A.
3.2Simpleregular languages from finitestate automata
We define a path to be simple if it does not have a cycle in A. Br¨ uggemannKlein and
Wood [2] defined the orbit of a state q and its gate states to characterize oneunambiguous
regular languages. The orbit O(q) of q in A is the strongly connected component of q;
that is, it is the set of states of A that can be reached from q and from which q can be
reached. We say that O(q) is trivial if it consists of only q and there are no transitions
from q to itself in A. A state q of A is a gate of its orbit O(q) if q is a final state or q
has an outtransition to a state outside O(q). Note that an orbit is a cycle. If we have
(p,a,q) in δ, where p / ∈ O(q), q ∈ O(q) and a ∈ Σ, we call p an entry state of an orbit.
Thus, we compute simple paths for each pair of an entry state and a gate state in an orbit
in A.
Given an FA A = (Q,Σ,δ,s,F), we transform A into a new FA A′such that L(A′) =
L(A) and A′is nonexiting and nonreturning as follows: A′= (Q ∪ {s′,f′},Σ,δ ∪
{(s′,λ,s)} ∪ {(fi,λ,f′)  fi ∈ F},s′,f′). Furthermore, if there is more than one tran
sition between two state p and q in A, we combine these transitions as a single transition.
For example, if (p,a,q) and (p,b,q) in δ, we modify these two transitions as (p,a + b,q)
in δ.
We now have regular expressions instead of single characters in a transition of A.
We call an FA with regular expressions expression automaton (EA). Thus, we have an
EA A = (Q,Σ,δ,s,f) such that there is at most one transition between two states in Q
and A is nonexiting and nonreturning. If a regular expression E in δ is not simple, then
we modify E as its simpleregular expression E′. If there are selfloops in A, we remove
selfloops since a simple path cannot pass through a selfloop. For a formal definition
and more details on EAs, refer to Han and Wood [9]. EAs are a generalization of FAs;
they allow regular languages on transitions instead of single characters. If we only allow
characters on transitions, then A is an FA and if we only allow strings, then A is a
generalized automaton [4, 5].
Since the strongly connected components of a directed graph can be computed in linear
time [1], we can identify all orbits of A in linear time in A. Once we identify orbits, then
we compute all simple paths for each pair of an entry state and a gate state.
3.3 Computing all simple paths
We design an algorithm that computes all simple paths between two vertices in a graph
based on the algorithm of Rubin [16]. Let G = (V,E) be a directed graph without multiple
3
Page 4
edges or selfloops, where V is a set of vertices and E is a set of edges. Let V  = n be the
number of vertices of V and E = m be the number of edges of E. A path of length k in
G is a nonempty sequence of vertices p = (v0,v1,...,vk) such that (vi,vi+1) is an edge of
G. We say a path p is simple if all of its vertices are distinct.
Let Iibe the nbit boolean vector, where 0 is in the position i and 1 otherwise. Let
∧ and ∨ be the boolean and and or operations. Given a path p = (v0,v1,...,vk), we
define the vertex vector of p to be an nbit boolean vector such that the bits in posi
tions v0,v1,...,vkare 1 and the others are 0. We define the edge vector of p to be an
mbit boolean vector such that the bit j is either 1 if the edge ejin p or 0 otherwise. The
path descriptor of p is an ordered pair d(p) = (v,e), where v is the vertex vector and e is
the edge vector of p.
1
2
3
4
5
6
7
Figure 1: An example of an FA.
For example, we can represent the FA in Fig. 1 as a graph G = (V,E) such that
V = {1,2,3,4,5,6,7} and E = {e1 = (1,2),e2 = (1,3),e3 = (1,4),e4 = (2,5),e5 =
(3,7),e6= (4,6),e7= (5,7),e8= (6,7)}. Note that we assign a unique index number for
each state and assign a unique edge index number for each outtransition. We make all
outtransition indices of a state to be consecutive in an edge vector of G; for example,
the first three bits of an edge vector are outtransition indices of state 1 in Fig. 1. For a
path p = 1 → 3 → 7, the path descriptor d(p) is (1010001,01001000).
Lemma 1 (Rubin [16]). Given a path descriptor (v,e) in G = (V,E), we can compute
the sequence of vertices of the path in O(m) time, where m = E.
Since G is an FA A, m = O(n2) if A is nondeterministic and we need O(n2) time
to construct a path from a path descriptor. We propose a more efficient method for
computing the sequence of vertices from a given path descriptor in an NFA. Given a path
descriptor d(p) = (v,e) of a simple path p in G, where v is a vertex vector and e is an edge
vector, let us assume that we have computed the vertex sequences from its start vertex s
to an intermediate vertex i and io= i1i2···ikare the outtransition indices of i in e. Note
that since p is a simple path only one bit in iomust be 1 and the other bits must be 0.
We search for the bit with 1.
We notice that bit operations (for example, shift, and, or and not) take constant
time [15]. We use binary search method to find the bit with 1 from io= i1i2···ikusing
these bit operations. Assume that k is even. We shift i1i2···ikto the left by k/2 and
append a k/2 number of 0s so that we have i′= ik/2+1···ik000···0. If i′= 0, then the
bit with 1 is in i1i2···ik/2. Otherwise, the bit with 1 is in ik/2+1···ik. For example, if i is
00000100, then i′= 01000000 and we know that the bit with 1 must be in i5i6i7i8= 0100.
4
Page 5
We repeat this procedure recursively until we find the bit with 1. It takes at most ⌈logk⌉
iterations to find the bit with 1 since the procedure is implemented based on binary search
method, where k is the number of outtransitions from i in G. The length of a simple
path can be at most n and the number of outtransitions from a vertex can be at most n.
Lemma 2. Given a path descriptor (v,e) in G = (V,E), we can compute the sequence of
vertices of the path in O(nlogn) worstcase time, where n = V .
Since m = O(n2) in NFAs, Lemma 2 is an improvement from O(n2) time to O(nlogn)
time compared with Lemma 1.
The algorithm of Rubin [16] is based on matrix multiplication. He showed that we
can enumerate all simple paths in a given directed graph G in O(n3) matrix operations1,
where n is the number of vertices in G.
EnumerateAllSimplePath (G = (V,E))
Initialize a n × n matrix D, where n = V 
for (i,j) ∈ E
D(i,j) = {d((i,j))}
for i = 1 to n
for j = 1 to n
for k = 1 to n
for each (v,e) ∈ D(j,i) and (w,f) ∈ D(i,k)
if v ∧ w ∧ Ii= 0 then
add (v ∨ w,e ∨ f) into D(j,k)
Figure 2: The algorithm of Rubin [16] that enumerates all simple paths of a given graph.
Each entry D(i,j) contains all simple paths descriptors from i to j in G.
3.4 Computing simpleregular languages
Given an EA A = (Q,Σ,δ,s,f), we compute the simple path matrix D for A using
EnumerateAllSimplePath (EASP) in Fig. 2. Then, for each pair of an entry state i and
a gate state j of an orbit, we compute all simple paths from D(i,j).
descriptor in D(i,j), we construct the corresponding simple path and compute the regular
expression E for the simple path. Then, we add a new transition (i,E,j) into δ. After we
complete to compute all simple paths for all pairs of an entry state and a gate state of an
orbit O(j), we remove all states and transitions in O(j) except gate states.
In Fig. 3, we construct the simple path matrix using EASP and compute all simple
paths from 1 to 2 and from 1 to 3. Note that there is only one nontrivial orbit in the EA A
For each path
1Note that it is not the total running time to compute all simple paths. There can be an exponential
number of simple paths in G.
5