Page 1

HKUST Theoretical Computer Science Center Research Report HKUST-TCSC-2005-05

Simple-Regular Expressions and Languages∗

Yo-Sub Han, Gerhard Trippen and Derick Wood

Department of Computer Science

The Hong Kong University of Science and Technology

{emmous, trippen, dwood}@cs.ust.hk

Abstract

We define simple-regular expressions and languages. Simple-regular languages pro-

vide a necessary condition for a language to be outfix-free. We design algorithms that

compute simple-regular languages from finite-state automata. Furthermore, we inves-

tigate the complexity blowup from a given finite-state automaton to its simple-regular

language automaton and show that there is an exponential blowup. In addition, we

present a finite-state automata construction for simple-regular expressions based on

state expansion.

1 Introduction

It is well known that the family of languages specified by finite-state automata (FAs) is

the same as the family of languages described by regular expressions [13]. It can be proved

by showing that we can construct FAs from regular expressions and that we can construct

regular expressions from FAs. Additionally, some automata constructions preserve the

structural properties of given regular expressions; for example, the Thompson construc-

tion [17] or the position construction [8, 14]. Recently, Giammarresi et al. [7] introduced

Thompson languages and simple Thompson languages based on the structural properties of

Thompson automata. One interesting property of simple Thompson languages is that for

a given Thompson automaton A, a string in a simple Thompson language of A corresponds

to a simple path from the start state to the final state of A.

FAs are often used to represent codes since codes are sets of strings. A code can be

classified by properties such as prefix-freeness, suffix-freeness, infix-freeness and outfix-

freeness [12]. The conditions that classify code types define proper subfamilies of given

language families. For regular languages, for example, outfix-freeness defines the family of

outfix-free regular languages, which is a proper subfamily of regular languages. Then, we

can classify FAs based on these conditions according to the languages that FAs define. We

observe that an FA must have some structural properties to satisfy a certain condition. For

example, a deterministic finite-state automaton (DFA) should not have any out-transitions

from a final state if the DFA defines a prefix-free language [9]. Given a nondeterministic

finite-state automaton (NFA) A, if L(A) is outfix-free, then there are no cycles in A since

∗Han and Wood were supported under the Research Grants Council of Hong Kong Competitive Ear-

marked Research Grant HKUST6197/01E and Trippen was supported under RGC/HKUST Direct Allo-

cation Grant DAG03/04.EG05.

1

Page 2

an outfix-free regular language is always finite [12]. In other words, all accepting paths of

an outfix-free regular language in an FA must be simple. Furthermore, since an outfix-

free regular language L is finite, there is an acyclic deterministic finite-state automaton

for L. Han and Wood [10] designed an algorithm for determining the outfix-freeness of

a given acyclic deterministic finite-state automaton based on the structural properties of

outfix-free languages.

FAs are directed graphs with labels on edges and simple paths of FAs are already

used in the literature [7, 12]. We examine FAs with respect to simple paths. In Sec-

tion 2, we define some basic notions. In Section 3, we introduce simple-regular expressions

and languages and design algorithms for computing simple-regular expressions and lan-

guages. The algorithm for computing simple-regular language of a given FA is based on

the algorithm [16] for enumerating all simple paths. Then, we investigate the complexity

blowup from an FA A to its simple-regular language FA A′. We also consider when A is a

Thompson automaton or a position automaton since these two constructions are popular.

2 Preliminaries

Let Σ denote a finite alphabet of characters and Σ∗denote the set of all strings over Σ.

A language over Σ is any subset of Σ∗. The character ∅ denotes the empty language and

the character λ denotes the null string. Given two strings x and y in Σ∗, x is said to be

an outfix of y if there is a string w such that y = x1wx2, where x = x1x2, For example,

abe is an outfix of abcde. Given a set X of strings over Σ, X is outfix-free if no string in

X is an outfix of any other string in X.

An FA A is specified by a tuple (Q,Σ,δ,s,F), where Q is a finite set of states, Σ is an

input alphabet, δ ⊆ Q × Σ × Q is a (finite) set of transitions, s ∈ Q is the start state and

F ⊆ Q is a set of final states. Let |Q| be the number of states in Q and |δ| be the number

of transitions in δ. Then, the size |A| of A is |Q| + |δ|. Given a transition (p,a,q) in δ,

where p,q ∈ Q and a ∈ Σ, we say that p has an out-transition and q has an in-transition.

Furthermore, p is a source state of q and q is a target state of p. A string x over Σ is

accepted by A if there is a labeled path from s to a final state in F that spells out x.

Thus, the language L(A) of an FA A is the set of all strings spelled out by paths from s

to a final state in F. We assume that A has only useful states; that is, each state of A

appears on some path from the start state to some final state.

3Simple-regular languages

3.1Simple-regular languages from regular expressions and FAs

Definition 1. Given a regular expression E, we define the simple-regular expression S(E)

of E as follows:

1. S(∅) = ∅.

2. S(λ) = λ.

3. S(a) = a, for a ∈ Σ.

2

Page 3

4. S(E + F) = S(E) + S(F), where E and F are regular expressions.

5. S(E · F) = S(E) · S(F).

6. S(E∗) = λ + E.

Given a regular expression E, L(S(E)) is the corresponding simple-regular language of

L(E). A language is often given by an FA. We define a simple-regular language from a

FA as follows.

Definition 2. Given an FA A, we define the simple-regular language S(L(A)) of L(A)

to be a subset of L(A) such that each string is accepted by a simple path in A.

3.2Simple-regular languages from finite-state automata

We define a path to be simple if it does not have a cycle in A. Br¨ uggemann-Klein and

Wood [2] defined the orbit of a state q and its gate states to characterize one-unambiguous

regular languages. The orbit O(q) of q in A is the strongly connected component of q;

that is, it is the set of states of A that can be reached from q and from which q can be

reached. We say that O(q) is trivial if it consists of only q and there are no transitions

from q to itself in A. A state q of A is a gate of its orbit O(q) if q is a final state or q

has an out-transition to a state outside O(q). Note that an orbit is a cycle. If we have

(p,a,q) in δ, where p / ∈ O(q), q ∈ O(q) and a ∈ Σ, we call p an entry state of an orbit.

Thus, we compute simple paths for each pair of an entry state and a gate state in an orbit

in A.

Given an FA A = (Q,Σ,δ,s,F), we transform A into a new FA A′such that L(A′) =

L(A) and A′is non-exiting and non-returning as follows: A′= (Q ∪ {s′,f′},Σ,δ ∪

{(s′,λ,s)} ∪ {(fi,λ,f′) | fi ∈ F},s′,f′). Furthermore, if there is more than one tran-

sition between two state p and q in A, we combine these transitions as a single transition.

For example, if (p,a,q) and (p,b,q) in δ, we modify these two transitions as (p,a + b,q)

in δ.

We now have regular expressions instead of single characters in a transition of A.

We call an FA with regular expressions expression automaton (EA). Thus, we have an

EA A = (Q,Σ,δ,s,f) such that there is at most one transition between two states in Q

and A is non-exiting and non-returning. If a regular expression E in δ is not simple, then

we modify E as its simple-regular expression E′. If there are self-loops in A, we remove

self-loops since a simple path cannot pass through a self-loop. For a formal definition

and more details on EAs, refer to Han and Wood [9]. EAs are a generalization of FAs;

they allow regular languages on transitions instead of single characters. If we only allow

characters on transitions, then A is an FA and if we only allow strings, then A is a

generalized automaton [4, 5].

Since the strongly connected components of a directed graph can be computed in linear

time [1], we can identify all orbits of A in linear time in |A|. Once we identify orbits, then

we compute all simple paths for each pair of an entry state and a gate state.

3.3Computing all simple paths

We design an algorithm that computes all simple paths between two vertices in a graph

based on the algorithm of Rubin [16]. Let G = (V,E) be a directed graph without multiple

3

Page 4

edges or self-loops, where V is a set of vertices and E is a set of edges. Let |V | = n be the

number of vertices of V and |E| = m be the number of edges of E. A path of length k in

G is a nonempty sequence of vertices p = (v0,v1,...,vk) such that (vi,vi+1) is an edge of

G. We say a path p is simple if all of its vertices are distinct.

Let Iibe the n-bit boolean vector, where 0 is in the position i and 1 otherwise. Let

∧ and ∨ be the boolean and and or operations. Given a path p = (v0,v1,...,vk), we

define the vertex vector of p to be an n-bit boolean vector such that the bits in posi-

tions v0,v1,...,vkare 1 and the others are 0. We define the edge vector of p to be an

m-bit boolean vector such that the bit j is either 1 if the edge ejin p or 0 otherwise. The

path descriptor of p is an ordered pair d(p) = (v,e), where v is the vertex vector and e is

the edge vector of p.

1

2

3

4

5

6

7

Figure 1: An example of an FA.

For example, we can represent the FA in Fig. 1 as a graph G = (V,E) such that

V = {1,2,3,4,5,6,7} and E = {e1 = (1,2),e2 = (1,3),e3 = (1,4),e4 = (2,5),e5 =

(3,7),e6= (4,6),e7= (5,7),e8= (6,7)}. Note that we assign a unique index number for

each state and assign a unique edge index number for each out-transition. We make all

out-transition indices of a state to be consecutive in an edge vector of G; for example,

the first three bits of an edge vector are out-transition indices of state 1 in Fig. 1. For a

path p = 1 → 3 → 7, the path descriptor d(p) is (1010001,01001000).

Lemma 1 (Rubin [16]). Given a path descriptor (v,e) in G = (V,E), we can compute

the sequence of vertices of the path in O(m) time, where m = |E|.

Since G is an FA A, m = O(n2) if A is nondeterministic and we need O(n2) time

to construct a path from a path descriptor. We propose a more efficient method for

computing the sequence of vertices from a given path descriptor in an NFA. Given a path

descriptor d(p) = (v,e) of a simple path p in G, where v is a vertex vector and e is an edge

vector, let us assume that we have computed the vertex sequences from its start vertex s

to an intermediate vertex i and io= i1i2···ikare the out-transition indices of i in e. Note

that since p is a simple path only one bit in iomust be 1 and the other bits must be 0.

We search for the bit with 1.

We notice that bit operations (for example, shift, and, or and not) take constant

time [15]. We use binary search method to find the bit with 1 from io= i1i2···ikusing

these bit operations. Assume that k is even. We shift i1i2···ikto the left by k/2 and

append a k/2 number of 0s so that we have i′= ik/2+1···ik000···0. If i′= 0, then the

bit with 1 is in i1i2···ik/2. Otherwise, the bit with 1 is in ik/2+1···ik. For example, if i is

00000100, then i′= 01000000 and we know that the bit with 1 must be in i5i6i7i8= 0100.

4

Page 5

We repeat this procedure recursively until we find the bit with 1. It takes at most ⌈logk⌉

iterations to find the bit with 1 since the procedure is implemented based on binary search

method, where k is the number of out-transitions from i in G. The length of a simple

path can be at most n and the number of out-transitions from a vertex can be at most n.

Lemma 2. Given a path descriptor (v,e) in G = (V,E), we can compute the sequence of

vertices of the path in O(nlogn) worst-case time, where n = |V |.

Since m = O(n2) in NFAs, Lemma 2 is an improvement from O(n2) time to O(nlogn)

time compared with Lemma 1.

The algorithm of Rubin [16] is based on matrix multiplication. He showed that we

can enumerate all simple paths in a given directed graph G in O(n3) matrix operations1,

where n is the number of vertices in G.

EnumerateAllSimplePath (G = (V,E))

Initialize a n × n matrix D, where n = |V |

for (i,j) ∈ E

D(i,j) = {d((i,j))}

for i = 1 to n

for j = 1 to n

for k = 1 to n

for each (v,e) ∈ D(j,i) and (w,f) ∈ D(i,k)

if v ∧ w ∧ Ii= 0 then

add (v ∨ w,e ∨ f) into D(j,k)

Figure 2: The algorithm of Rubin [16] that enumerates all simple paths of a given graph.

Each entry D(i,j) contains all simple paths descriptors from i to j in G.

3.4Computing simple-regular languages

Given an EA A = (Q,Σ,δ,s,f), we compute the simple path matrix D for A using

EnumerateAllSimplePath (EASP) in Fig. 2. Then, for each pair of an entry state i and

a gate state j of an orbit, we compute all simple paths from D(i,j).

descriptor in D(i,j), we construct the corresponding simple path and compute the regular

expression E for the simple path. Then, we add a new transition (i,E,j) into δ. After we

complete to compute all simple paths for all pairs of an entry state and a gate state of an

orbit O(j), we remove all states and transitions in O(j) except gate states.

In Fig. 3, we construct the simple path matrix using EASP and compute all simple

paths from 1 to 2 and from 1 to 3. Note that there is only one nontrivial orbit in the EA A

For each path

1Note that it is not the total running time to compute all simple paths. There can be an exponential

number of simple paths in G.

5