ArticlePDF Available

Complete Subhedge Projection for Stepwise Hedge Automata

MDPI
Algorithms
Authors:

Abstract and Figures

We demonstrate how to evaluate stepwise hedge automata (Shas) with subhedge projection while completely projecting irrelevant subhedges. Since this requires passing finite state information top-down, we introduce the notion of downward stepwise hedge automata. We use them to define in-memory and streaming evaluators with complete subhedge projection for Shas. We then tune the evaluators so that they can decide on membership at the earliest time point. We apply our algorithms to the problem of answering regular XPath queries on Xml streams. Our experiments show that complete subhedge projection of Shas can indeed speed up earliest query answering on Xml streams so that it becomes competitive with the best existing streaming tools for XPath queries.
This content is subject to copyright.
Citation: Al Serhali, A.; Niehren, J.
Complete Subhedge Projection for
Stepwise Hedge Automata.
Algorithms 2024,17, 339. https://
doi.org/10.3390/a17080339
Academic Editor: Henning Fernau
and Klaus Jansen
Received: 29 January 2024
Revised: 23 July 2024
Accepted: 23 July 2024
Published: 2 August 2024
Copyright: © 2024 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
algorithms
Article
Complete Subhedge Projection for Stepwise Hedge Automata
Antonio Al Serhali * and Joachim Niehren *
Inria Center, University of Lille, 59000 Lille, France
*Correspondence: antonio.al-serhali@inria.fr (A.A.S.); joachim.niehren@inria.fr (J.N.)
This paper is an extended version of our paper published in International Symposium on Fundamentals of
Computation Theory, Trier, Germany, 18–21 September 2023.
Abstract: We demonstrate how to evaluate stepwise hedge automata (SHAs) with subhedge projection
while completely projecting irrelevant subhedges. Since this requires passing finite state information
top-down, we introduce the notion of downward stepwise hedge automata. We use them to define
in-memory and streaming evaluators with complete subhedge projection for SHAs. We then tune the
evaluators so that they can decide on membership at the earliest time point. We apply our algorithms
to the problem of answering regular
XPath
queries on XM L streams. Our experiments show that
complete subhedge projection of SH As can indeed speed up earliest query answering on XML streams
so that it becomes competitive with the best existing streaming tools for XPath queries.
Keywords: automata; projection algorithm; streaming algorithm; XML
1. Introduction
Hedges are sequences of letters from the alphabet and trees
h
where
h
is again a
hedge. Hedges are abstract from the syntactical details of XML or
JSON
documents while
still being able to represent them. The linearization of a hedge is a nested word that can
be written in a file. In this article, we study the problem of hedge pattern matching, i.e.,
whether a given hedge
h
belongs to a given regular hedge language
L
. Regular hedge
languages can be defined in XPath or JSONPath in particular.
Projection is necessary for the efficiency of many algorithms on words, trees, hedges,
or nested words. Intuitively, an algorithm is projecting if it visits only a fragment of the
input structure: in the best case, the part that is relevant to the problem under consideration.
The relevance of projection for XML processing was already noticed by [
1
4
]. Saxon’s in-
memory evaluator, for instance, projects input XML documents relative to an XSLT program,
which contains a collection of
XPath
queries to be answered simultaneously [
5
]. The
QuiXPath tool [
4
] evaluates
XPath
queries in streaming mode with subtree and descendant
projection. Projection during the evaluation of
JSONPath
queries on
JSON
documents in
streaming mode is called fast-forwarding [6].
Projecting in-memory evaluation assumes that the full graph of the input hedge is
constructed beforehand. Nevertheless, projection may still save time if one has to match
several patterns on the same input hedge or if the graph was constructed for different
reasons anyway. In streaming mode, the situation is similar: the whole input hedge on
the stream needs to be parsed, but the evaluators need to inspect only nodes that are not
projected away. Given that pure parsing is by two or three orders of magnitude faster than
pattern evaluation, the time gain of projection may be considerable.
The starting point of this article is the notion of the subhedge irrelevance of positions
of hedge
h
with respect to a hedge language
L
that we introduce. These are positions of
h
where, for all possible continuations, inserting any subhedge does not affect membership
to
L
. We contribute an algorithm for hedge pattern matching with complete subhedge
projection. Our algorithm decides hedge pattern matching while completely projecting
away the subhedges located at irrelevant subhedge positions. In other words, it decides
Algorithms 2024,17, 339. https://doi.org/10.3390/a17080339 https://www.mdpi.com/journal/algorithms
Algorithms 2024,17, 339 2 of 74
whether a hedge
h
matches a pattern
L
without visiting any subhedge of
h
that is located at
a position that is irrelevant with respect to L.
Regular hedge languages can be represented by nested regular expressions, which can
be derived from regular
XPath
queries [
7
] or by some kind of hedge automata, which can
be compiled from nested regular expressions. We use stepwise hedge automata (SHAs) [
8
],
a recent variant of standard hedge automata, which, in turn, date back to the 1960s [
9
,
10
].
SHAs mix up the bottom-up processing of standard tree automata with the left-to-right
processing of finite word automata (DFAs). They neither support top-down processing
nor have an explicit stack, unlike nested word automata NWAs [
11
15
]. SHAs have a good
notion of bottom-up and left-to-right determinism, avoiding the problematic notion of
determinism for standard hedge automata and the too-costly notion of determinism for
NWAs. Furthermore, compilers from nested regular expressions to SH As are available [
8
],
and determinization algorithms [
16
] produce small SH As for all regular
XPath
queries in
the existing practical benchmarks [7,17].
The motivation for the present work is to add subhedge projection to a recent stream-
ing algorithm for deterministic SHAs that can evaluate regular
XPath
queries in the earliest
manner. Most alternative streaming approaches avoid earliest query answering by con-
sidering sublanguages of regular queries without delays so that node selection depends
only on the past of the node but not on the future [
18
,
19
] or by admitting late selection [
20
].
In particular, it was recently demonstrated that the earliest query answering for regular
monadic queries defined by deterministic SHAs [
21
] has a lower worst-case complexity
than for deterministic NWAs [
22
]. Thereby, the earliest query answering for regular queries
became feasible in practice for the first time. On the other hand side, it is still slower
experimentally than the best existing non-earliest approaches for general regular monadic
queries on XML streams (see [
23
] for a comparison) since the latter supports projection.
How projections could be made for SH As has not been studied before the conference version
of the present article [24].
Consider the example of the
XPath
filter
[self::list/child::item]
that is satisfied by
an XML document if its root is a
list
element that has an
item
child. With the encoding of
the XML document as hedges are chosen in the present article, satisfying hedges have the
form
list ·h1·item ·h2·h3·h4
, and for some hedges,
h1
,
h2
,
h3
,
h4
. When evaluating
this filter on a hedge, it is sufficient to inspect the first subtree for having the root label
list
and then all its children until one with a root label
item
is found. The subhedges of these
children, i.e., the hedge
h3
and the subhedges of the top-level trees in
h1
and
h3
, can be
projected away, as well as the hedge
h4
: they do not have to be inspected for evaluating this
filter. However, one must memorize whether the level of the current node is 0, 1, or greater.
This level of information can be naturally updated in a top-down manner. The evaluators
of SHAs, however, operate bottom-up and left-to-right exclusively. Therefore, projecting
evaluators for SH As need to be based on more general machines. We propose downward
stepwise hedge automata (
SHA
s), a variant of SHAs that supports top-down processing. They
are basically Neumann and Seidl’s pushdown forest automata [
25
], except that they are
applied to unlabeled hedges instead of labeled forests. Furthermore, NWAs are known to
operate in basically the same manner on nested words [
26
] while allowing for slightly more
general visible pushdowns. We then distinguish subhedge projection states for
SHA
s, and
show how to use them to evaluate SHAs with subhedge projection both in memory and in
streaming mode.
As a first contribution, we present the safe-no-change compiler from SHAs to
SHA
s
that can distinguish appropriate subhedge projection states. The idea is that the safe-no-
change
SHA
can distinguish contexts in which the states of the SHA will safely not change.
For instance, the
XPath
filter
[self::list/child::item]
can be defined by a deterministic
SHA operting bottom-up which our compiler maps to the
SHA
operating top-down. The
context information made explicit by top-down processing is about the levels of the states.
This permits us to distinguish projection states starting from level 2, and in which subhedges
can be ignored. We prove the soundness of our compiler based on a nontrivial invariant that
Algorithms 2024,17, 339 3 of 74
we establish (we note that the proof required an adaptation of the original compiler from
FCT [
24
]). We also present a counter-example showing that the safe-no-change projection
is not complete for subhedge projection. It shows that a subhedge may be irrelevant even
though its state may still be changing.
As a second contribution, we propose the congruence projection algorithm. It again
compiles SHAs to
SHA
s but relies on the congruence relations of automata states. We
then prove that congruence projection yields not only a sound algorithm but that this
algorithm is also complete for subhedge projection, i.e., all strongly irrelevant subhedges
(see Definition 9) are evaluated into a looping state. Congruence projection starts on
the top-level of hedges with the Myhill–Nerode congruence that is well-known from
automata minimization.
For languages with words
L
, this congruence identifies prefixes
v
that have the same
residual language
v1(L) = {w|v·wL}
. A prefix
v
is then an irrelevant suffix if its
residual language
v1(L)
is either universal or empty. In the case of hedges, which extend
on words by adding hierarchical nesting via a tree constructor, the congruence must be
adapted when moving down into subtrees.
We show that both compilers may increase the size of the automata exponentially
since they must be able to convert bottom-up deterministic automata on monadic trees into
top-down deterministic automata. That is the same as inverting deterministic automata on
words, which is well known to raise an exponential blow-up in the worst case (e.g., for the
family of languages
(a+b)·a·(a+b)n
, where
nN
, the minimal deterministic left-to-left
automaton, has 2
n+1
states while the minimal deterministic right-to-left automaton has
n+1 states).
The exponential explosion can be avoided when interested in the evaluation of hedges
by automaton with subhedge projection, by not constructing the complete
SHA
s statically.
Instead, only the required part of the
SHA
s may be constructed on the fly when evaluating
a hedge.
Our third contribution is the refinement of the two previous compilers so that they can
also distinguish safe states for selection at the earliest position. For this, we combine these
compilers with a variant of the earliest membership tester of [
21
] that operates in memory
by compiling
SHA
s instead of NWAs. Furthermore, membership failure is detected at the
earliest position, too. In this way, we obtain the earliest in-memory membership tester for
deterministic SHAs.
Our fourth contribution is to show how to run the previous in-memory evaluators
in streaming mode while reading the linearization of the hedge into a nested word as an
input. Thereby, we can improve on the recent earliest streaming membership tester behind
the AStream tool [21] by adding subhedge projection to it.
Our fifth and last contribution is an implementation and experimental evaluation
of the streaming earliest query answering algorithm for dSH As by introducing subhedge
projection into the AStream tool [
21
]. We implemented both safe-no-change and congruence
projection. For this, we lifted our earliest membership tester with subhedge projection to
an earliest query answering algorithm for monadic queries. For the experimentation, we
started with deterministic SH As constructed with the compiler from [
8
] for the forward
regular
XPath
queries of the XPathMark benchmark [
27
] and real-world
XPath
queries [
7
].
It turns out that congruence projection projects much more strongly than the safe-no-change
projection for at least half of the benchmark queries. It reduces the running time for all
regular
XPath
queries considerably since large parts of the input hedges can be projected
away. In our benchmark, the projected percentage ranges from 75.7% to 100% of the
input stream. For
XPath
queries that contain only child axes, the earliest query answering
algorithm of AStream with congruence projection becomes competitive in efficiency with
the best existing streaming tool called QuiXPath [
23
], which is not always earliest for some
queries. Our current implementation of the earliest congruence projection in AStream is by
a factor from 1.3 to 2.9 slower than QuiXPath on the benchmark queries. The improvement
is smaller for
XPath
queries with the descendant axis, where less subhedge projection is
Algorithms 2024,17, 339 4 of 74
possible. Instead, some kind of descendant projection would be needed. Still, even in the
worst case in our benchmark, the earliest congruence projection with AStream is currently
only by a factor of 13.8 slower than QuiXPath.
Outline. We start with preliminaries on hedges and nested words in Section 2. In Section 3,
we recall the problem of hedge pattern matching and formally define irrelevant subhedges.
In Section 4, we recall the definition of SH As, introduce
SHA
s, and discuss their in-memory
evaluators. In Section 5, we introduce the notion of subhedge projection states, define
when an
SHA
is complete for subhedge projection, and present an in-memory evalua-
tor for
SHA
s with subhedge projection. In Section 6, we introduce the safe-no-change
projection, state its soundness, and show its incompleteness. In Section 7, we introduce
congruence projection an prove it to be sound and complete for subhedge projection. The
earliest in-memory evaluators for
SHA
s with subhedge projection follow in Section 8.
Streaming variants of our in-memory evaluators are derived systematically in Section 9.
Section 10 discusses how to lift our algorithms from membership testing to monadic queries.
Section 11
discusses our practical experiments. Section 12 discusses further related work
on automata notions related to SH As and
SHA
s. Appendix Apresents the reformulation
of the earliest query-answering algorithm from [
21
] based on
SHA
s and with schema
restrictions. Appendix Bprovides sound proof for the safe-no-change projection.
Publication Comments. This original version of the journal article was published at the
FCT conference [
24
]. Compared to the previous publication, the following contributions
are new:
The definition of subhedge irrelevant prefixes of nested words and the definition of
completeness for subhedge projection (Section 3).
The addition of schema restrictions throughout the paper and the notion of schema
completeness (Section 4.3).
The notion of completeness for subhedge projection (Section 5.2).
The soundness proof of the safe-no-change projection algorithm in Section 6.2, moved
to Appendix B.
The congruence projection algorithm and the proof of its soundness and completeness
for subhedge projection (Section 7).
A systematic method to add subhedge projection to an earliest SHA(Section 8).
A systematic method to reduce the correctness of streaming automata evaluators to
the corresponding in-memory evaluators (Section 9).
A discussion of how to deal with monadic queries while exploiting the availability of
schema restrictions (Section 10).
An implementation of congruence projection with experimental results for
XPath
evaluation on XM L streams (Section 11).
A longer discussion on related work (Section 12).
In Appendix A, we also show how to obtain the earliest
dSH A
s from dSH As by
adapting the compiler from [21] mapping dSH As to the earliest dNWAs.
2. Preliminaries
Let
A
and
B
be sets. If
A
is a subset of
B
, then the complement of
A
in
B
is denoted
by
A=B\A
while keeping the set relative to which the complement is taken implicit.
The domain of the binary relation
rA×B
is
dom(r) = {aA| bB
.
(a
,
b)r}
. A
partial function
f:A,B
is a binary relation
fA×B
that is functional. A total function
f:ABis a partial function f:A,Bwith dom(f) = A.
Words. Let
N
be the set of natural numbers, including 0. Let the alphabet
Σ
be a set. The
set of words over
Σ
is
Σ=nNΣn
. A word is
(a1
,
. . .
,
an)Σn
, where
nN
is written
as
a1. . . an
. We denote the empty word of length 0 by
εΣ0
, and by
v1·v2Σ
the
concatenation of two words v1,v2Σ.
Algorithms 2024,17, 339 5 of 74
If
v=u·v·w
is a word, then we call
u
a prefix of
v
and
v
a factor of
v
and
w
a suffix
of
v
. Given any subset
LΣ
, we denote the set of prefixes of words in
L
by
prefs(L)
and
the set of suffixes of words in Lby suffs(L).
For any subalphabet
ΣΣ
, the projection of a word
vΣ
to
Σ
is the word
projΣ(v)
in (Σ)that is obtained from vby removing all letters from Σ\Σ.
Hedges. Hedges are sequences of letters and trees
h
with a hedge
h
. More formally, a
hedge h HΣhas the following abstract syntax:
h,h HΣ::=ε|a|h|h·hwhere aΣ
We assume
ε·h=h·ε=h
and
(h·h1)·h2=h·(h1·h2)
. Therefore, we consider any word
in
Σ
as a hedge in
HΣ
, i.e.,
Σaab =a·a·b HΣ
. Any hedge can be converted or
stored as a graph. For the signature
Σ={list
,
item
,
A
,
. . .
,
Z}
, the graph of an example
hedge in HΣis shown in Figure 1. It encodes the sequence of XM L documents in Figure 2:
⟨⟩
list ⟨⟩
item F C T
⟨⟩
item C I A A
⟨⟩
list ⟨⟩
item C M S B
Figure 1. The graph of the hedge
list ·item ·F·C·T·item ·C·I·A·A⟩⟩ ·list ·item ·C·M·
S·B⟩⟩ is a sequence of trees.
<list>
<item>FCT</item>
<item>CIAA</item>
</list>
<list>
<item>CMSB</item>
</list>
Figure 2. The sequence of XM L documents encoded by the hedge in Figure 1.
Nested Words. A nested word over
Σ
is a word over the alphabet
ˆ
Σ=Σ {
,
}
in which
all parentheses are well-nested. Intuitively, this means that for any opening parenthesis,
there is a matching closing parenthesis and vice versa.
The set of nested words over
Σ
can be defined as the subsets of words over
ˆ
Σ
that are
a linearization of the hedge, where the linearization function
nw :HΣ(Σ {
,
})
is
defined as follows:
nw(ε) = ε,nw(h) = ·nw(h)·,nw(a) = a,nw(h·h) = nw(h)·nw(h).
So, the set of all nested words over
Σ
is
nw(HΣ)
. Let
hdg
be the inverse of the injective
function
nw
restricted to its image, so the mapping from nested words to hedges with
hdg(nw(h)) = hfor all h HΣ.
Nested Word Prefixes. Prefixes, suffixes, and factors of nested words may not be nested
words themselves. For instance, the hedge
h=a·b·c
has the linearization
nw(h) = abc
.
Its prefix abis not well-nested since it has a dangling opening parenthesis. Its suffix cis
also not well-nested since it has a dangling closing parenthesis.
Any algorithm that traverses a hedge
h
in a top-down and left-to-right manner inspects
all the prefixes of
nw(h)
. Any prefix
v
of
nw(h)
not only distinguishes the position of the
hedge
h
but also specifies the part of the hedge that is located before this position in the
linearization
nw(h)
. An example is illustrated graphically in Figure 3. This particularly
holds for streaming algorithms that receive the nested word
nw(h)
as an input on a stream
Algorithms 2024,17, 339 6 of 74
and may be inspected only once from the left to the right. But it holds equally for in-memory
algorithms that receive the input hedge
h
as a hierarchical structure whose graph is stored
in memory and then traverses the graph top-down and left-to-right. Any
⟨⟩
node will
then be visited twice: once when reading an opening parenthesis
and going down into
a subhedge, and another time when going up to the closing parenthesis
after having
processed the subhedge.
Figure 3. The part of the hedge distinguished by the nested word prefix
list ·item ·F·C·T·
item ·C·I·A·A ·list ·.
3. Problem
We formally define the pattern-matching problem for hedges and the notion of sub-
hedge irrelevance, which allows us to characterize the parts of the input hedge that do not
need to be visited during pattern matching.
3.1. Hedge Pattern Matching
We are interested in algorithms that match a hedge pattern against a hedge. We start
with algorithms that test whether such matching exists. This is sometimes called filtering
or, alternatively, Boolean query answering. In Section 10, we consider the more general
problem of returning the set of matchings. This problem is sometimes called monadic query
answering. Our experiments in Section 11 apply monadic query answering for regular
XPath queries.
We start from hedge patterns that are nested regular expressions with letters in the
signature
Σ
. For these, we fix a set
V
of recursion variables and assume that it is disjointed
from Σ.
e,enRegExpΣ::=ε|a|z|e·e|e+e|e|e|µz.ewhere aΣ,zV
Each nested regular expression
enRegExpΣ
is satisfied by a subset of hedges in
JeK HΣV
.
We recall the definition of this set only informally. The expression
ε
is satisfied only by the
empty word, expression
aΣ
only by the hedge
a
only, and expression
zV
only by the
one letter hedge
z
. An expression
e·e
with the concatenation operator and
e
,
enRegExpΣ
is satisfied by the concatenations of hedges satisfying
e
and
e
, respectively. An expression
e
with Kleene’s star operator and
enRegExpΣ
is satisfied by repetitions of words satisfying
e
. The Kleene star provides for horizontal recursion on hedges. The
µ
expression
µz
.
e
,
where
zV
and
enRegExpΣ
, is satisfied by all hedges without recursion variable
x
in
which
z
is repeatedly replaced by a hedge satisfying
e
until it disappears. We adopt the
usual restriction [
28
] that in any nested regular expression
µz
.
e
, all occurrences of
z
in
e
are guarded by a tree constructor
.
. In this way, the
µ
-operator provides proper vertical
recursion. Also note that, for defining sublanguages of
HΣ
, we only need such expressions
in
nRegExpΣ
in which all occurrences of recursion variables
z
are bound in the scope of
binder µz.
It is well known that nested regular expressions in
nRegExpΣ
can define all regular
sublanguages of hedges in
HΣ
, i.e., they have the same expressiveness as hedge automata
with alphabet
Σ
. Analogous results for regular tree expressions and tree automata are
folklore [29].
Algorithms 2024,17, 339 7 of 74
Example 1 (Nested regular expressions of regular
XPath
filters).For instance, we can define
subsets of hedge encodings of XML documents with two kinds of elements in
Σ={list
,
item}
by
expressions in
nRegExpΣ
. For this article, we will use an encoding of the XML document, which
produces hedges that satisfy the following regular expression doc nRegExpΣ, where z V:
doc =def treewhere tree =def µz.(list +item)·z
In our application, the encodings are a little richer in order to distinguish different types of XM L
nodes such as elements, attributes, comments, and texts. An
XPath
filter such as
[self::list/child::item] can then be expressed by the nested regular expression:
filter =def list ·doc ·item ·doc·doc·doc
General compilers from regular downward
XPath
filter queries to nested regular
expressions were proposed in [
8
]. An implementation was developed in the context of the
AStream tool [21].
We will use schemas
S HΣ
to restrict the inputs of our pattern-matching problem.
As we are mostly interested in regular schemas, we do not assume regularity globally since
safe-no-change projection does not rely on it. The pattern-matching problem for nested
regular expressions with respect to a schema S HΣreceives the following inputs:
a nested regular expression enRegExpΣand
a hedge hS.
It then returns the truth value of the judgment
hJeK
. The input hedge
hS
may either
be given by a graph that resides in memory or by a stream containing the linearization
nw(h)of the input hedge, which can be read only once from left to right.
3.2. Irrelevant Subhedges
We define the concept of irrelevant occurrences of subhedges with respect to a given
hedge pattern. What this means depends on the kind of algorithm that we will use for
pattern matching. We will use algorithms that operate on the input hedge top-down,
left-to-right, and bottom-up. This holds for streaming algorithms in particular.
Intuitively, when the pattern-matching algorithm reaches a node top-down or left-to-
right whose subsequent subhedge is irrelevant, then it can jump over it without inspecting
its structure. What jumping means should be clear if the hedge is given by a graph that is
stored in memory. Notice that the full graph is to be constructed beforehand, even though
some parts of it may turn out irrelevant. Still, one may win lots of time by jumping over
irrelevant parts if either the graph was already constructed for other reasons or if many
pattern-matching problems are to be solved on the same hedge.
In the case of streams, the irrelevant subhedge still needs to be parsed, but it will not
be analyzed otherwise. Most typically, the possible analysis is carried out by automata,
which may take two orders of magnitude more time than needed for parsing. Therefore,
not having to do any analysis may considerably speed up the streaming algorithm.
Definition 1. Let
S HΣ
be a schema and
LS
a language satisfying this schema. We define
diff L
Sas the least symmetric relation on prefixes of nested words u,uprefs(NΣ):
diff L
S(u,u) wsuffs(NΣ).u·wnw(L)u·wnw(S\L).
A nested word prefix
u
is called subhedge relevant for
L
with schema
S
if there exists nested words
v
,
v NΣ
such that
diff L
S(u·v
,
u·v)
. Otherwise, the prefix
u
is called subhedge irrelevant for
L with schema S.
So,
diff L
S(u
,
u)
states that
u
and
u
have continuations that behave differently with
respect to
L
and
S
. Furthermore, a prefix
u
is subhedge irrelevant if language membership
does not depend on hedge insertion at
u
under the condition that schema membership is
Algorithms 2024,17, 339 8 of 74
guaranteed. The case without schema restrictions is subsumed by the above definition by
choosing
S=HΣ
. In this case, the complement
diff L
HΣ
is the Myhill–Nerode congruence
of
L
, which is well-known in formal language theory from the minimization of DFAs (see
any standard textbook on formal language theory or Wikipedia (Myhill–Nerode theorem:
https://en.wikipedia.org/wiki/Myhill-Nerode_theorem, accessed on 23 July 2024)).
This congruence serves as the basis for Gold-style learning of regular languages
from positive and negative examples (see [
30
] or Wikipedia (Gold-style learning in the
limit https://en.wikipedia.org/wiki/Language_identification_in_the_limit, accessed on
23 July 2024)).
Schema restrictions are needed for our application to regular XPath patterns.
Example 2 (Regular XPath filters).Consider the schema
S=JdocK
for hedges representing XML
documents with signature
Σ={list
,
item}
, and the regular hedge language
L=JfilterK
, where
filter =list ·tree·item ·tree·tree·tree
is the nested regular expression from Example 1for the
XPath
filter
[self::list/child::item]
.
Recall that this nested regular expression can be applied to hedges representing XML documents
only. The nested word prefix
u=list ·item
is subhedge irrelevant for the language
L
with
schema
S
. Note that its continuation
list ·item⟩⟩
with the suffix
w=⟩⟩
belongs to
L
. Hence,
for any
h HΣ
, if the continuation
list ·item ·h⟩⟩
belongs to
S=JdocK
, then it also belongs
to
L
. Nevertheless, for the hedge
h1=⟨⟩
, the continuation
list ·item ·h1⟩⟩
does not belong to
L since it does not satisfy the schema S=JdocK.
The prefix
item
is also irrelevant for the language
L
, even independently of the schema. This
is because
L
does not contain any continuation of this prefix to some hedge. The prefix
list
is
not irrelevant for the language L wrt S. This can be seen with the suffix w =, since listdoes
not belong to
L
while the continuation
list ·h
with
h=item
does belong to
L
and both
continuations satisfy the schema S=JdocK.
However, schema restrictions may also have surprising consequences that make the
problem more tedious.
Example 3 (Surprising consequences of schema restrictions).Consider the signature
Σ={a}
,
the pattern
L={a}
, and the schema
S=L {⟨⟩ ·a}
. The prefix
is indeed subhedge irrelevant
to
L
and
S
. In order to see this, we consider all possible closing suffixes one by one. The smallest
closing prefix is
. Note that a hedge
hL
if and only if
h=a
. So, membership to
L
seems to
depend on the subhedge
h
. However, also note that
hS
if and only if
h=a
. Therefore, when
assuming that
h
belongs to the schema
S
, it must also belong to
L
. So, the pattern must match
when assuming schema membership. The next closing suffix is
·a
. Note that a hedge
h·aS
if and only if
h=ε
. However,
h·a∈ L
, so the pattern will not match for any subhedge
h
with
this prefix when assuming the input hedge satisfies the schema. The situation is similar for larger
suffixes, so in no case does language membership depend on the subhedge at prefix
when assuming
that the full hedge satisfies the schema S.
3.3. Basic Properties
We first show that subhedge irrelevant prefixes remain irrelevant when extended by
the nested word of any hedge. This property should be expected for any reasonable notion
of subhedge irrelevance.
Lemma 1. For any nested word prefix
u
, language
L
,
S HΣ
, and nested word
v NΣ
, if the
prefix u is subhedge irrelevant for L with schema S, then the prefix u ·v is so too.
Proof.
Let
u
be subhedge irrelevant for
L
with schema
S
and
v NΣ
a nested word. We
fix arbitrary nested words
v
,
v′′ NΣ
and a nested word suffix
wsuffs(NΣ)
such that
Algorithms 2024,17, 339 9 of 74
u·v·v·wnw(S)
and
u·v·v′′ ·wnw(S)
. Since
v·v
and
v·v′′
are also nested words,
the subhedge irrelevance of uyields the following:
u·v·v·wnw(L)u·v·v′′ ·wnw(L).
Hence, the nested word prefix u·vis subhedge irrelevant for Lwith schema S.
Definition 2. We call a binary relation
D
on prefixes of nested words a difference relation on
prefixes if for all prefixes of nested words u,uprefs(NΣ)and nested words v NΣ:
(u·v,u·v)D(u,u)D.
Lemma 2. diff L
Sis a difference relation.
Proof.
We must show for all prefixes
u
,
uprefs(NΣ)
and nested words
v NΣ
that
(u·v
,
u·v)diff L
S
implies
(u
,
u)diff L
S
. So, let
u
,
u
be prefixes of nested words and
v
be
a nested word such that (u·v,u·v)diff L
S. Then, there exists a nested suffix wsuch that
u·v·wnw(L)u·v·wnw(S\L).
The suffix
˜
w=v·w
then satisfies
u·˜
wnw(L)u·˜
wnw(S\L)
. Hence,
(u
,
u)diff L
S
as required.
In the schema-free case of words, the complement
diff L
Σ
is an equivalence relation
that is known as the Myhill–Nerode congruence. In the case of schemas, however, the
complement
diff L
S
may not be transitive; thus, it may not even be an equivalence relation.
This might be surprising at first sight.
Example 4. Let
Σ={a
,
b}
,
L={ε
,
b
,
aa}
and
S=L {aab}
. Then,
(ε
,
a)diff L
S
since
there is no continuation that extends both
ε
and
a
into the schema. For the same reason, we have
(a
,
aa)diff L
S
. However,
diff L
S(ε
,
aa)
since
ε·bL
and
aa ·bS\L
. Thus,
(ε
,
aa) diff L
S
, so
the complement diff L
Sis not transitive.
It should be noted for all languages
L
that
diff L
HΣ
is reflexive and transitive, so it is
an equivalence relation and thus a congruence. For general schemas
S
,
diff L
S
may not be
transitive, as shown by Example 4, and may thus not be congruent.
This indicates that schemas may make it more difficult to decide subhedge irrelevance.
Indeed, the natural manner in which one might expect to eliminate schemas based on
complements does not work, as confirmed by following lemma (in particular, the counter-
example in proof 1. ⇒ 2.):
Lemma 3. For any nested word prefix
usuffs(NΣ)
and
L
,
S HΣ
, consider the following
two properties:
1. u is irrelevant for L with schema S.
2. u is irrelevant for L Swith schema HΣ.
Then, property 2. implies property 1., but not always vice versa.
Proof. 1. ⇒ 2.:
Here, a counter-example is presented. Let
L={ε}
and
S={ε
,
a}
. Then,
prefix
a
is irrelevant for
L
with schema
S
. However, prefix
a
is relevant for
LS=
HΣ\ {a}
with schema
HΣ
, since for the nested words
v=ε
and
v=a
we have
a·v∈ LSand a·vLS.
1. 2.:
Next, we assume property 2. Consider arbitrary nested words
v
,
v NΣ
and
a nested word suffix
wsuffs(NΣ)
such that
u·v·w
,
u·v·wS
. It is sufficient
Algorithms 2024,17, 339 10 of 74
to show that
u·v·wS
and
u·v·wS
implies
u·v·wLu·v·wL
.
This implication holds trivially if
u·v·wS
or
u·v·wS
. Otherwise, we have
u·v·wS
and
u·v·wS
. Property 2., as assumed, then implies
u·v·wL
u·v·wL. So, property 1. holds.
4. Hedge Automata
We recall the notion of stepwise hedge automata (SH As), introduce their downward
extension (
SHA
), discuss schema-completeness for both kinds of automata, define sub-
hedge projection states for
SHA
s, and show how to use them for evaluating
SHA
s with
subhedge projection. We limit ourselves to in-memory evaluation in this section but will
discuss streaming evaluation in Section 9. The relation to several closely related notions of
automata on hedges, forests, or nested words is discussed in some more detail in Section 12.
4.1. Stepwise Hedge Automata
SHAs are a recent notion of automata for hedges [
8
] that mix up bottom-up tree and
left-to-right word automata in a natural manner. They extend stepwise tree automata [
31
]
in that they operate on hedges rather than unranked trees, i.e., on sequences of letters and
trees containing hedges. They differ from nested word automata (NWAs) [
11
,
15
] in that
they process hedges directly rather than their linearizations to nested words.
Definition 3. A stepwise hedge automaton (SHA) is a tuple
A= (Σ
,
Q
,
,
I
,
F)
where
Σ
and
Q
are finite sets,
I
,
F Q
, and
= ((a)aΣ
,
⟨⟩
, @
)
, where
a Q × Q
,
⟨⟩ Q
, and
@
(Q × Q)× Q
. A SHA is deterministic or equivalently a dSHA if
I
and
⟨⟩
contain at most
one element, and all relations (a)aΣand @are partial functions.
The set of states
Q
subsumes a subset
I
of the initial state, a subset
F
of final states, and
a subset
⟨⟩
of tree initial states. The transition rules in
have three forms: if
(q
,
q)a
,
then we have a letter rule that we write as
qa
qin
. If
(q
,
p
,
q)
@
, then we have an
apply rule that we write as
q
@
pqin
. Lastly, if
q ⟨⟩ Q
, then we have a tree initial
rule that we denote as ⟨⟩
qin .
In Figure 4, we illustrate the graph of a dSHA for the
XPath
filter
[self::list/child::item]
.
The graph of SHA is obtained and drawn as usual for state automaton on words. The states
are the nodes of the graph; in the example,
Q={
0, 1, 2, 3, 4, 5
}
. The transition rules define
the edges. The edge of a letter rule
qa
q
in
is annotated by the letter
aΣ
. In the
example,
Σ={list
,
item}
. Any apply rule
q
@
pq
in
is drawn as an edge
qp
q
annotated by state
p Q
. The initial state 0
I
is indicated by an ingoing gray arrow, and
the tree initial state 0
⟨⟩
by an ingoing gray arrow that is annotated by symbol
⟨⟩
. Final
states, such as 4 F, are indicated by a double circle.
Figure 4. A dSHA for the
XPath
filter
[self::list/child::item]
that is schema-complete for schema
JdocK
.
Algorithms 2024,17, 339 11 of 74
4.1.1. Semantics
We next define the semantics of SHAs, i.e., the transitions on hedges that it defines and
the hedge language that it accepts. For any hedge h
HΣ
we define the transition steps
qh
pwrt such that for all q,q,p,p Q,aΣ, and h,h HΣ
qa
qin
qa
qwrt
qh
qwrt qh
q′′ wrt
qh·h
q′′ wrt
q Q
qε
qwrt
⟨⟩
pin ph
pq@pqin
qh
qwrt
The transitions can be used to evaluate hedges nondeterministically bottom-up and left-to-
right using SHAs. The first three rules state how to evaluate sequences of trees and letters,
such as a usual finite state automaton, while assuming that the trees were already evaluated
to states. The last rule states how to evaluate a tree
h
from some state
q
to some state
q
’.
This can be visualized as follows:
q
php
q
⟨⟩
For any tree initial state
p ⟨⟩
, one has to evaluate the subhedge
h
nondeterministically to
some state
p
. For each
p
obtained, one then returns to a state
q
such that
q
@
pqin
nondeterministically.
A hedge is accepted by
A
if it permits a transition from some initial state to some final
state. The language L(A)is the set of all accepted hedges:
L(A) = {h HΣ|qh
qwrt ,qI,qF}.
4.1.2. Runs
Runs can represent the whole history of a single choice of the nondeterministic evalua-
tor on a hedge. For instance, the unique successful run of the dSH A in Figure 4on the hedge
h=h
with subhedge
h=list ·item
is the hedge 0
·
0
·list ·
1
·
0
·item ·
2
·
3
·
4
whose computation is illustrated in Figure 5. On the top level, the run justifies the transition
0
h
4
wrt
from the initial state 0 to a final state 4. When started in state 0, the subhedge
h
needs to be evaluated first, so the run on the upper level needs to be suspended until this
is done. The subrun on
h
eventually justifies the transition 0
h
3
wrt
. At this time point,
the run of the upper level can be resumed and continue in state 4 since 0@3 4 in .
04
01
02
3
⟨⟩ list
item
⟨⟩
Figure 5. The unique successful run of the dSH A in Figure 4on the hedge
list ·item⟩⟩
is the
hedge 0 ·0·list ·1·0·item ·2·3·4 computed, as illustrated above.
Algorithms 2024,17, 339 12 of 74
When drawing runs, we illustrate any suspension/resumption mechanism with a box,
for instance, the box on the edge 0 4. The box stands for a future value computed
by the lower level, causing the upper level to suspend until its arrival. Eventually, the box
is filled by state 3, as illustrated by 3 , so that the computation can be resumed on the
upper level.
We next define runs of SH As on hedges formally. Whether a hedge with letters and
states
R HΣ∪Q
is a
-run or simple a run on a hedge
h HΣ
, written
Rrun(h)
, is
defined by the following rules:
qa
qin
q·a·qrun(a)
q·R·qrun(h)q·R·q′′ run(h)
q·R·q·R·q′′ run(h·h)
true
qrun(ε)
⟨⟩
pin p·R·prun(h)q@pqin
q·R·qrun(h)
Note that if
Rrun(h)
, then
h
can be obtained by removing all states from
R
, i.e.,
projΣ(R) = h
. Any run
Rrun(h)
can be identified with the mapping of prefixes of
the nested word
nw(h)
to states in
Q
. Note that
nw(h) = projΣ(nw(R))
. For any prefix
v
of
nw(h)
, there exists a unique prefix
r
of the nested word
nw(R)
that ends with some
state
q Q
and satisfies
projΣ(r) = v
. We then call
q
the state of
R
at prefix
v
. The run
0
·
0
·list ·
1
·
0
·item ·
2
·
3
·
4 on hedge
listitem⟩⟩
from Figure 5, for instance, assigns
state 0 to the nested word prefixes
ε
and
and state 1 to the nested word prefix
·list
, etc.
A
-run is called a run of
A= (Σ
,
Q
,
,
I
,
F)
if it starts with some state in
I
. A run of
Ais successful if it ends with some state in F.
Lemma 4. h L(A)iff there exists a successful run R run(h)of A.
Proof. Straightforward with induction on h.
4.1.3. Determinization
SHAs can always be determinized using the subset construction known from the
determinization of finite state automata on words or of tree automata.
Given a SHA
A= (Σ
,
Q
,
,
I
,
F)
, the state set of its determinization is
Qdet =
2
Q
. The
transition rules det of the determinization are presented in Figure 6.
Q Q aΣQ={q∈Q|qa
q,qQ}
Qa
Qdet
⟨⟩=
⟨⟩det
Q1,Q2 Q Q={q Q | q1@q2q,q1Q1,q2Q2}
Q1@Q2Qdet
Figure 6. The determinization det of the transition rules of a SHA.
4.1.4. In-Memory Membership Testing
Testing membership
h L(A)
for a hedge
h
, whose graph is stored in memory, can be
carried out by computing the unique run of the determinization of
A
in a bottom-up and
left-to-right manner, and testing whether it is successful. Alternatively, one can compute
the unique set
P Q
such that
Ih
P
with respect to the determinization of
A
and to test
whether
PF=
. The computations are essentially the same, but the full history of the
computation is lost when returning only the set Pof states reached and not the run.
For determinizing
SHAs
we can rely on a variant of the usual subset construction
that is well-known from finite state automata on words or standard tree automata (see
e.g., [
8
,
16
]). When it comes to membership testing, on-the-fly determinization is sufficient,
which creates only the part of the deterministic automaton that is used by the run on
h
.
Algorithms 2024,17, 339 13 of 74
This part is of linear size
O(|h|)
while the full determinization of
A
may be of exponential
size O(2|A|). In this way, membership h L(A)can be tested in time O(|h||A|).
4.1.5. Hedge Accessibility
For any set
P Q
we define the set of states accessible from
P
via a hedge as follows:
acc(P) = {q Q | qP.h HΣ.qh
qwrt }.
We note that
acc(P)
can be computed from
and
P
in linear time in the size of
. A
state is hedge accessible from
P
if and only if it is accessible for
P
in the graph of
. Or by
computing the least fixed point of the following ground Datalog program depending on
P
and that is of linear size in A:
qP
acc(p).
qa
qin
acc(q):acc(q).
q@pqin
acc(q):acc(q),acc(p).
Clearly,
L(A) =
if and only if
Facc(I) =
. Therefore, emptiness can be decided
in linear time for
SHAs
. This is in contrast to the more general notion of
SHA
s that we
introduce next.
4.2. Downward Stepwise Hedge Automata
The evaluation of SHAs operates in a bottom-up and left-to-right manner, but cannot
pass any information top-down. We next propose an extension of SHAs to
SHA
s, adding
the ability to pass finite state information top-down.
SHA
s are basically equal to Neumann and Seidl’s pushdown forest automata [
25
]
but lifted from (labeled) forests to (unlabeled) hedges. See Section 12 for a more detailed
discussion on the relationship of
SHA
s with other automata notions and NWAs in particular.
Definition 4. A downward stepwise hedge automaton (
SHA
) is a tuple
A= (Σ
,
Q
,
,
I
,
F)
,
where
Σ
and
Q
are finite sets,
I
,
F Q
, and
= ((a)aΣ
,
⟨⟩
, @
)
. Furthermore,
a Q × Q
,
⟨⟩ Q × Q
, and @
(Q × Q)× Q
. A
SHA
is deterministic or equivalently a
dSHA
if
I
contains at most one element, and all relations, ⟨⟩, a, and @, are partial functions.
The only difference compared to SHAs is the form of the tree-opening rules. If
(q
,
q)
⟨⟩ Q, then we have a tree initial rule that we denote as follows: q⟨⟩
qin .
So, state
q
, where the evaluation of a subhedge starts, depends on the state
q
of
the parent.
4.2.1. Semantics
The definition of transition steps for
SHAs
differs from that of SHAs only in the usage
of opening rules in the following inference rule:
q⟨⟩
pin ph
pwrt q@pqin
qh
qwrt
This means that the evaluation of the subhedge hstarts in a state psuch that q⟨⟩
pin .
q
php
q
⟨⟩
Algorithms 2024,17, 339 14 of 74
So, the restart state
p
that is chosen may now depend on the state
q
above. This is how
finite state information is passed top-down by
SHAs
. SHAs, in contrast, operate purely
bottom-up and left-to-right.
4.2.2. Runs
The notion of runs can be adapted straightforwardly from SHAs to
SHA
s. When in
state
q
, it is sufficient to restart the computation in subhedges on the state
p
, such that
q⟨⟩
p
. In this way, finite state information is passed down (while for SHAs some
tree initial is to be chosen that is independent of
q
). The only rule of runs to be changed is
the following:
q⟨⟩
pin p·R·prun(h)q@pqin
q·R·qrun(h)
An example of a run of the
dSH A
in Figure 7is shown in Figure 8. Here, the information on
the level of nodes is passed down. It is represented by the number of primes. For instance,
when opening the first subtree labeled with
item
we use the transition rule 1
⟨⟩
0
′′
stating
that one moves from state 1 on the first level to state 0 on the second level.
Figure 7. The deterministic
SHA
with subhedge projection state
Π
obtained by safe-no-change
projection.
0
0’ 1’
0” 1′′
h2
h1
Π1′′
1’
0” 2” Π2”
3’
4
list
⟨⟩
⟨⟩
list
⟨⟩
item
Figure 8. A successful run of the S HAin Figure 7on the hedge list ·list ·h1·item ·h2⟩⟩.
One can see that all nodes of the subtrees
h1
and
h2
are evaluated to the projection state
Π
. So, one can jump over these subtrees without reading them. This is justified by the fact
that they are on level 2, the information that the evaluator of this
SHA
was passing down.
4.2.3. In-Memory Membership Testing
Testing membership
h L(A)
for a
SHA
s in memory can again be carried out by
computing the unique run of the determinization of
A
. As before, the membership tester
can be based on on-the-fly determinization. However, the determinization procedure for
SHA
s is more complicated then for SH As since
SHA
s can pass information top-down and
not only bottom-up and left-to-right. Determinization not only relies on the pure subset
construction but also uses pairs of states in the subsets, basically in the same way as for
Algorithms 2024,17, 339 15 of 74
nested word automata (NWAs) [
14
,
15
]. This is needed to deal with the stack of states that
were seen above. Therefore, each determinization step may cost time
O(|A|5)
as stated for
instance in [
23
]. The upper time bound for membership testing for
SHA
s is thus in time
O(|h||A|5)and no longer combined linear as it is for SHAs.
4.2.4. Relationship to SHAs
Any SHA
A= (Σ
,
Q
,
,
I
,
F)
can be mapped to a
SHAAdown = (Σ
,
Q
,
down
,
I
,
F)
while preserving runs and determinism. The only change of
down
compared to
is
described by the following rule:
⟨⟩
pin q Q
q⟨⟩
pin down
Independently of the current state
q Q
, the
SHAAdown
can start the evaluation of the
subhedge of a subtree in any open tree state
p ⟨⟩
. For instance, the
dSH AAdown
for
dSH A
A
from Figure 4from the introduction is provided in Figure 9. Conversely, one can
convert any
dSH AA
into an equivalent SHA by introducing nondeterminism to eliminate
top-down processing. See Section 12.2 for the construction.
Figure 9. The SHA Adown for the dSHA Afor the filter [self ::list/child::item]in Figure 4.
4.2.5. Hedge Accessiblity
Hedge accessibility can be defined for
SHA
s identically as for SH As. However, the set
acc(P)
can no longer be computed in linear time in the size of
, in contrast to
SHAs
. It can
still be done in cubic time, similarly as for NWAs [
22
]. The problem is that hedge accessibility
for SHAs does no more coincide with the accessibility notion of the automaton’s graph.
For instance, in the
SHA
from Figure 7, we have
acc({
0
}) = {
4
}
. Note that 0
does
not belong to this set, even though there is a transition rule 0
⟨⟩
0
in
, making node 0
accessible from node 0 in the graph of the automaton. This is because 0
is not accessible
over any hedge from 0; that is, there is no hedge
h
such that 0
h
0
wrt
. Note that 0
··
0
is a factor of the nested word of some run of
. This shows that 0
is accessible from 0 over
a nested word prefix nevertheless. For any
SHAA
, hedge accessibility can be computed
by the following ground Datalog program of cubic size depending on the number of states
of A:
p@qqin p⟨⟩
pin
acc(p,q):acc(p,q).
p Q
acc(p,p).
pa
qin
acc(p,q).
Algorithms 2024,17, 339 16 of 74
p,q,r Q
acc(p,q):acc(p,r),acc(r,q).
Due to the cubic time of hedge accessibility for
SHA
s, we will use hedge accessibility only
for SHAs where it is in linear time.
4.3. Schema Completeness
Schemas are inputs of the membership problem that we consider while completeness
is a fundamental property of automata. To combine both aspects appropriately, we propose
the notion of schema completeness, which we define uniformly for both SH As and
SHA
s.
We call a set of
transition rules
SHA
complete if all its relations
(a)aΣ
,
⟨⟩
, and
@
are total. For
SHAs
, we require that
⟨⟩
is a nonempty set (instead of a total relation).
A SHA or
SHAA
is called complete if
is complete and
I=
. We can complete any
SHA or
SHA
by adding a fresh sink state and also transition rules into the sink state for all
left-hand sides, for which no other transition rule exists. The sink state is made initial if
and only if there is no initial state before. It is not final though. We denote the completion
of Aby com pl(A).
Definition 5. Apartial run of A on a hedge h is a prefix r of a run of com pl(A)on h, such that
r ends with some state of Q, and
r does not contain the sink state of compl(A).
A partial run
r
of
A
on
h
is called blocking if there does not exist any partial run
r
of
A
on
h
, such
that r is a strict prefix of r.
For instance, consider the hedge
h=list ·item⟩⟩
. The dSH A in Figure 4has the
unique run
R=
0
··
0
·list ·
1
··
0
·item ·
2
·
3
··
4 on
h
that is illustrated in Figure 5.
The partial runs of
h
are thus all prefixes of
nw(R)
that end with some state, i.e., 0, 0
··
0,
0··0·list ·1, etc. None of these partial runs are blocking.
Definition 6. A schema for an automaton
A
is a hedge language
S HΣ
, such that
L(A)S
.
We call the automaton
A
schema-complete for schema
S
if no hedge
hS
has a blocking partial run
of A.
Schemas
S
are usually used to restrict the type of hedges that a problem may accept
as input. The dSH A pattern-matching problem for a schema
S
takes two inputs: a hedge
hS
and an automaton
A
, rather than a nested regular expression
e
. The automaton
A
then selects those input hedges
hS
that match the pattern, i.e., such that
h L(A)
. For
this reason, we assume that
S
is a schema for
A
, i.e.,
L(A)S
. We will often assume that
A
is schema-complete for
S
, so that no partial run of
A
on any input hedge
hS
may ever
block. This can always be achieved next based on the next Lemma 5.
Example 5. The dSHA in Figure 4for the XPath filter [self::list/child::item]has signature
Σ={item
,
list}
. It is schema-complete for schema
JdocK
. To make this happen, we added state 5
to this automaton, which is not used by any successful run. Still, this SH A is not complete. For
completing it, we would have to run 5into sink states by adding many transitions into it, such as
0@4
5and 0@5
5, and also add for all other states
q {
1, 2, 3, 4, 5
}
the transition rules
qlist
5, q item
5, q@4 5, and q@5 5.
As we will see, automaton completion would make the safe-no-change algorithm
incomplete for subhedge projection for Example 5. Without schema-completeness, however,
safe-no-change projection may be unsound: it might overlook the rejection of hedges inside
the schema. This is why it is much better to assume only schema-completeness rather than
full completeness for the input automata of safe-no-change projection. For congruence
projection, schema-completeness will be assumed implicitly.
Algorithms 2024,17, 339 17 of 74
We note that schema-completeness is well-defined even for nonregular schemas. For
safe-no-change projection, we indeed do not have to restrict schemas to be regular, while
for congruence projection, only regular schemas are considered.
Furthermore, if
A
is schema-complete for the universal schema
S=HΣ
and does not
have inaccessible states, then Ais complete.
Schema-completeness of deterministic automata for regular schemas can always be
made to hold as we will show next. For this we define that
A
is compatible with schema
S
if for
any hedge
hS
there exists a run of
A
in
run(h)
. Schema-completeness implies compatibility,
as shown by the next lemma. The converse is true only when assuming determinism.
Lemma 5. Let
A
be a deterministic SHA with schema
S
. Then,
A
is compatible with
S
iff
A
is
schema-complete for S.
Proof.
Suppose that
A
is schema-complete for
S
Wlog., we can assume
S=
. Note that
if
I=
, then
ε
would be a blocking partial run for any hedge in
S\ {ε}
. Since
S=
, it
follows that there exists
q0I
. So,
q0
is a partial run for any hedge
hS\ {ε}
. Since there
exist no blocking partial runs by schema-completeness, this run can be extended step by
step to a run on any
hS
. Therefore, for any hedge
hS
, there exists some run by
A
,
showing compatibility.
For the converse, let
A
be compatible with
S
and deterministic. We have to show that
A
is schema-complete for
S
. Let
v
be a partial run of
A
on
hS
. By compatibility, there
exists a run
Rrun(h)
that starts in some initial state of
A
. By determinism,
v
must be a
prefix of nw(R). Thus, vis not blocking.
Proposition 1. Any SH A A with regular schema Scan be made schema-complete for S.
Proof.
By Lemma 5, it is sufficient to make
A
compatible with
S
. Let
B
be a dSHA with
S=L(B)
. Automaton
B
is compatible with
S
. Since
B
is deterministic, Lemma 5implies
that
B
is schema-complete for
S
. Since
S
is a schema for
A
with
L(A) L(B)
. We then
compute the completion of
com pl(A)
. The product
C=B×com pl(A)
with final states
FC=FB×FA
is schema-complete for schema
S
. Furthermore, since
L(A)S=L(B)
, it
follows that L(C) = L(B) L(A) = L(A).
Finally, note that the schema
S
can be recognized by the same product
C=B×
com pl(A)
except for replacing the set of final states
FC
by the set of schema final states
FS=FB×(QA {sink})
. We have
S=L(C[FC/FS])
, where
C[FC/FS]
is the dSH A
obtained from Cby replacing the set of final states FCwith FS.
5. Subhedge Projection
We next introduce the concept of subhedge projection states for
SHA
s in order to
distinguish prefixes of hedges that are subhedge irrelevant, and to evaluate
SHA
s with
subhedge projection.
5.1. Subhedge Projection States
Intuitively, a subhedge projection state can only loop in itself until a hedge closes (and
then it is applied). When going down into a subtree, the subhedge projection state may
still be changed into some other looping state. When leaving the subtree, however, one
must come back to the original subhedge projection state. Clearly, any sink is a subhedge
projection state. The even more interesting cases, however, are subhedge projection states
that are not sinks.
We start with
SHA
s without any restrictions but will impose schema-completeness
and determinism later on.
Algorithms 2024,17, 339 18 of 74
Definition 7. We call a state
q Q
asubhedge projection state of
if there exists
q Q
called the witness of
q
, such that the set of transition rules of
containing
q
or
q
on the leftmost
position is included in the following set:
{q⟨⟩
q,q@qq,q⟨⟩
q,q@qq}
∪{qa
q,qa
q|aΣ}
The set of all subhedge projection states of is denoted by Q
shp.
For any complete
SHA
, the above set of transition rules must be equal to the set
of all transition rules of
with
q
or
q
on the leftmost position. But, if the
SHA
is only
schema-complete for some schema
S
, then not all these transitions must be present. Note
also that a subhedge projection state
q
may be equal to its witness
q
. Therefore, any witness
qof some subhedge projection state is itself a subhedge projection state with witness q.
In the example
dSHAA= (Σ
,
Q
,
,
I
,
F)
in Figure 7, we have
Q
shp ={Π
, 4, 2
, 3
, 1
′′
, 2
′′}
.
The witness of all these subhedge projection states is
Π
. Note that not all possible transitions
are present for the states in
Q
shp \ {Π}
, given that this automaton is not complete (but still
schema-complete for schema JdocK).
Lemma 6. If q Q
shp is a subhedge projection state and q h
p wrt , then q =p.
Proof.
Suppose that
qh
p
wrt
and that
q
is a subhedge projection state of
with witness
q. So, the only possible transitions with qor qon the left-hand side are the following:
qq
⟨⟩
⟨⟩
aΣ
q
q
aΣ
Suppose that
h=t1·. . . ·tn
is a sequence of trees or letters
tiHΣΣ
, where
1in
. Then, there exists runs
Rirun(ti)
and a run
Rrun(h)
that has the form
h=q0·R1·q1·. . . ·Rn·qn
, where
q0=q
and
qn=p
. We prove for all 0
in
that
qi=q
by induction on
i
. For
i=
0, this is trivial. For
i>
0, the induction hypothesis shows that
qi1=q.
Case ti=aΣ.
Then,
qi1
a
qiin
. Since
qi1=q
is a subhedge projection state of
, it
follows that qi=q, too.
Case ti=hfor some h HΣ.
Since
qi1=q
is a subhedge projection state of
with
witness
q
, the subrun of
R
recognizing
ti
must be justified by the following diagram:
q
qhq
q
⟨⟩
So, the subrun of Rof hmay only contain the witness qand, furthermore, qi=q.
Algorithms 2024,17, 339 19 of 74
5.2. Completeness for Subhedge Projection
Before defining when a
SHA
is complete for subhedge projection, we show that sub-
hedge projection states in runs of deterministic SHAs may only occur at irrelevant prefixes.
Proposition 2. Let
A= (Σ
,
Q
,
,
I
,
F)
be a
dSHA
with schema
S
,
Rrun(h)
is the run of
com pl(A)
on some hedge h
S
, and
q
is a subhedge projection state for
. Then, for any prefix
r·q of nw(R), the prefix projΣ(r)of nw(h)is subhedge irrelevant for L(A)for schema S.
Proof.
Let
Rrun(h)
be the unique run of
com pl(A)
on a hedge
hS
. Suppose that
r·q·s=nw(R)
for some subhedge projection states
q
of
and let
v=projΣ(r)
and
w=projΣ(s)
. Since
v·w=nw(h)
and
hS
, it follows that
v·wnw(S)
. We fix
some hedge
h HΣ
, such that
v·nw(h)·wnw(S)
arbitrarily. Let
˜
h HΣ
, such that
nw(˜
h) = v·nw(h)·w. For the subhedge irrelevance of v, we must show that
h L(A)˜
h L(A).
We suppose that
h L(A)
and have to show that
˜
h L(A)
. For the partial run
r·q
of
A
on
˜
h
, there must exist
R
and
p
, such that
q·R·prun(h)
. Since
q
is a subhedge
projection state of
, it follows by Lemma 6that
q=p
. Hence,
r·q·nw(R)·q·s
is a
run of
A
on
˜
h
. Since
R
was successful for
A
,
s
must end in
F
; thus, the above run on
˜
h
is successful for Atoo. Hence, ˜
h L(A), as we have shown.
We suppose that
˜
h L(A)
and have to show that
h L(A)
. Let
˜
R
be the unique
run of
com pl(A)
on
˜
h
. By determinism,
˜
R
must have the prefix
r·q
. So, there exist
R
,
p
,
s
, such that
p·R·qrun(h)
and
r·q·R·p·s=˜
R
. Lemma 6shows that
p=q
. Hence,
r·q·s
is the unique run of
com pl(A)
on
h
and this run is accepting. So,
h L(A).
We note that Proposition 2would not hold without assuming determinism. To see
this, we can add some sink to any dSHA as a second initial state. One can then always go to
this sink, which is a subhedge projection state. Nevertheless, no prefix going to the sink
may be subhedge irrelevant.
Proposition 2shows that subhedge projection states permit distinguishing prefixes
that are subhedge irrelevant for deterministic
SHA
s. An interesting question is whether
all subhedge irrelevant prefixes can be found in this way. A closely related question is
whether, for all regular patterns, there exist dSH As that project all irrelevant subhedges.
Example 6. Reconsider Example 3with
Σ={a}
, a regular pattern
L={a}
, and a regular
schema
S=L {⟨⟩ ·a}
. The prefix
⟨⟩
is irrelevant for
L
and
S
, since all its continuations ending
in the schema are rejected: there is only one such continuation, which is
⟨⟩ ·a
. However, a dSHA for
L
and Swith maximal subhedge projection is given in Figure 17 but will not go into a subhedge
projection state at this prefix. The reason is that it will go into the same state
{
3, 4
}D
0for any prefix
in
NΣ
, ignoring the nested word of the irrelevant subhedge. Therefore, this state must accept
a
.
But, when reading an
a
in this state, the dSHA goes into the rejection state
{
5
}D
0, showing that
the hedge from the schema ⟨⟩ ·aS is rejected.
The example shows that we eventually reason with sets of prefixes. We first lift the
notion of subhedge irrelevance to prefix sets.
Definition 8. Let
S HΣ
be a schema and
LS
a language satisfying this schema. A set of
nested word prefixes
Uprefs(NΣ)
is called subhedge relevant for
L
with schema
S
if there exist
nested words
v
,
v′′ NΣ
prefixes
u
,
u′′ U
, such that
diff L
S(u·v
,
u′′ ·v′′ )
. Otherwise, the set
U is called subhedge irrelevant for L wrt S.
Algorithms 2024,17, 339 20 of 74
In order to explain the problem of Example 6, we define for each nested word prefix
wprefs(NΣ)
a class
classL
S(w)
of similar prefixes up replacing the nested word of any
irrelevant subhedge by
NΣ
from the left to the right. For all prefixes
wprefs(NΣ)
,
nested words
v NΣ
, letters
aΣ
, subsets of prefixes
Uprefs(NΣ)
, and
lˆ
Σ
letters
or parenthesis, the following applies:
classL
S(w) = class_pr{ε}(w)
class_prU(ε) = U· NΣi f U irrelevant for Lwrt S
Uelse
class_prU(w··v) = class_insU(,v)where U=class_prU(w)
class_prU(a·v) = class_insU(a,v)where U=U·a
class_prU(v·v) = class_insU(,v)where U=class_prU(·v)
class_insU(l,v) = U· NΣif Uirrelevant for Lwrt S
class_prU·l(v)else
Definition 9. A prefix
uprefs(NΣ)
is strongly subhedge irrelevant for
L wrt S
if the set
classL
S(u)is subhedge irrelevant for L wrt S.
Example 7. In our running Example 6, where
L={a}
,
S=L {⟨⟩ ·a}
, and
Σ={a}
, we
have
classL
S() = · NΣ
, so the prefix
is strongly subhedge irrelevant. The set
classL
S(a) = NΣ
is subhedge-relevant, so the prefix
a
is not strongly subhedge irrelevant. The set
classL
S(⟨⟩ ·a) =
NΣ·a· NΣis subhedge irrelevant, so the prefix ⟨⟩ ·a is strongly subhedge irrelevant.
Definition 10. A
dSHAA= (Σ
,
Q
,
,
I
,
F)
is complete for subhedge projection for schema
S
if
A
is schema-complete for
S
and for all hedges
hS
and all prefixes
v
of
nw(h)
that are strongly
subhedge irrelevant for
, the unique run of
A
on
h
goes to some subhedge projection state of
at
v
.
Example 8. The dSHA in Figure 17 is complete for subhedge projection for our running Example 6
where
L={a}
and
S=L {⟨⟩ ·a}
. It was obtained by congruence projection. On all
prefixes of the subhedge irrelevant class
classL
S()
, it goes to the subhedge projection state
{
0
}D
1.
On all prefixes of the class
classL
S(a)
, which is not subhedge irrelevant, it goes to the state
{3, 4}D0, which is not a subhedge projection state. On all prefixes of the subhedge irrelevant class
classL
S(⟨⟩ ·a), it goes to the subhedge projection state {5}D0.
Example 9. The
dSHA
from Figure 7is complete for subhedge projection with schema
JdocK
. It
can be obtained by safe-no-change projection. In general, however,
dSHA
s obtained by safe-no-
change projection may not be complete for subhedge projection, as we will discuss in Section 6.3.
5.3. In-Memory Evaluator with Subhedge Projection
We next show how to refine the transitions for
SHA
s by subhedge projection. We
define the transition relation with subhedge projection
h
shp Q × Q
with respect to
,
such that, for all hedges h,h HΣand letters aΣ,
q Q
shp h HΣ
qh
shp qwrt
q∈ Q
shp qa
qin
qa
shp qwrt
q∈ Q
shp
qε
shp qwrt
q∈ Q
shp qh
shp qwrt qh
shp q′′ wrt
qh·h
shp q′′ wrt
q∈ Q
shp q⟨⟩
pin ph
shp pq@pqin
qh
shp qwrt
Algorithms 2024,17, 339 21 of 74
The first rule says that subhedge projecting transitions stay in subhedge projection states
until the end of the current subhedge is reached. This is correct by Lemma 6under the
condition that there are no blocking runs. The other rules state that the evaluator behaves
without subhedge projection in other states.
Proposition 3. Let
A= (Σ
,
Q
,
,
I
,
F)
be a
SHA
that is schema-complete for
S
. Then, for all
hedges h Sand states q,p Q, the following applies:
qh
p wrt iff q h
shp p wrt
Proof. We distinguish two cases as follows:
Case q∈ Q
shp.
Then for any
p Q
,
qh
pwrt
iff
qh
shp pwrt
by definition of the
projecting transitions.
Case q Q
shp.
Then,
qh
shp qwrt
. Since
hS
and
A
are schema-complete for
S
, there
exists some run of
A
on
h
; thus, the state is
q
, such that
qh
qwrt
. Lemma 6shows
that q=q, thus qh
qwrt .
For evaluating a nondeterministic
SHAA= (Σ
,
Q
,
,
I
,
F)
with subhedge projection
on some hedge
h
whose graph is stored in memory, we can compute the unique subset
P Q
, such that
Ih
shp P
with respect to the transition relation
det
of determinization
of
A
and test whether
PF=
. When reaching any subhedge projection state
P
while
doing so, the evaluator can directly jump to the end of the current subhedge by applying
the “stay” rule of the determinization of A:
P Qdet
shp h HΣ
Ph
shp Pwrt det
Thereby, it will avoid visiting any subhedge in a subhedge projection state of the deter-
minization of
A
. The running time will thus depend linearly on the size of the non-projected
part of hunder the assumption that the determinization of Ais available.
As before, only the needed part of the determinization of
A
for evaluating
h
is to be
produced on-the-fly. Overall, this costs polynomial time in the size of the needed part of
the determinization (but not necessarily linear time since the determinization of
SHA
s
is more costly than for SH As). Also note that whenever discovering a new state
P
of the
determinization of
A
, we have to test whether it is a subhedge projection state. For this, we
must compute the state
P′′
, such that
P⟨⟩
P′′
wrt
det
and test whether only the transition
rules allowed for subhedge projection states for Pwith witness P′′ are available wrt det.
6. Safe-No-Change Projection
We want to solve the regular pattern-matching problem for hedges with subhedge
projection. We assume that the regular pattern is given by a dSHA while the schema may be
arbitrary, except that the dSH A must be schema-complete (see Section 4.3). We then present
the safe-no-change projection algorithm, which compiles the dSH A into some
dSH A
while
introducing subhedge projection states.
The idea of the safe-no-change projection is to push information top-down by which
to distinguish looping states that will safely no more be changed. Thereby, subhedge
projection states are produced, as we will illustrate by examples. The soundness proof of
safe-no-change projection is nontrivial and instructive for the soundness proof of congru-
ence projection in Section 7.3. Completeness for subhedge projection cannot be expected,
Algorithms 2024,17, 339 22 of 74
since the safe-no-change projection does only take simple loops into account. More gen-
eral loops cannot be treated and it is not clear how that could be done. The congruence
projection algorithm in Section 7.2 will eventually give an answer to this.
We keep the safe-no-change compiler independent of the schema and, therefore, do
not have to assume its regularity. But, we have to assume the schema-completeness of the
input dSHA in order to show that the output
dSH A
recognizes the same language within
the schema. The regular pattern-matching problem can then be solved in memory and with
subhedge projection by evaluating the
dSH A
on the input hedge in memory and with
subhedge projection, as discussed in Section 5.3.
6.1. Algorithm
We now describe the safe-no-change projection algorithm. Let
A= (Σ
,
Q
,
,
I
,
F)
be
a SHA with schema
S
. The algorithm takes the
dSH A A
as input and compiles it to some
dSH AAsnc
. We will state why
Asnc
is sound for subhedge projection under the condition
that
A
is schema-complete for schema
S
. We do not even have to assume that the schema
S
is regular. The proof is eight pages long and, therefore, delegated to the Appendix B.
For any set P Q, we define the set of states that safely lead to P:
safe(P) = {q Q | acc({q})P}.
So, state
q
is safe for
P
if any hedge read from
q
on which the run with
does not block
must reach some state in
P
. We note that
safe(P)
can be computed in linear time in the
size of
using inverse hedge accessibility from
P
. We define the set of states that will no
longer change:
no-change={q|qsafe({q})}.
So,
qno-change
if and only if
acc({q}) {q}
. This means that all transition rules
starting in
q
must loop in
q
. In the example automaton from Figure 4, the self-looping states
are those in no-change={2, 3, 4, 5}.
Lemma 7. Let
A= (Σ
,
Q
,
,
I
,
F)
be a SHA that is schema-complete for schema
S
. For any hedge
hSand state q Ino-change, it then holds that q h
q wrt .
Proof.
The schema-completeness of
A
for
S
applied to
hS
and
qI
yields the existence
of some state
q Q
, such that
qh
q
wrt
. Let
q
be any such state. Note that
qacc(q)
.
Since
qno-change
, we have
qsafe({q})
, so that
qacc({q})
implies
q=q
. This
proves qh
qwrt .
For any state q Q and subset of states Q Q, we define the following:
s-down(q,Q) = safe({p Q | q@pQ})
s-no-change(q) = s-down(q,{q})
A state belongs to s
-down(q
,
Q)
if all states
pacc(q)
satisfy
q
@
pQ
. So,
p
is a state in
a down level that will safely go up to some state in
Q
. A state
p
belongs to
s-no-change(q)
if it safely does not change q, i.e., if {q}@acc({p}) {q}.
We next compile the SHA
A
to a
SHAAsnc = (Σ
,
Qsnc
,
snc
,
Isnc
,
Fsnc)
. For this, let
Π
be a fresh symbol and consider the state set as follows:
Qsnc ={Π} (Q × 2Q).
In practice, we restrict the state space to the those states that are accessible or clean the
dSH As keeping only those states that are used in some successful run. But, in the worst
case, the construction may indeed be exponential. Example 15 can be adapted to show this.
Algorithms 2024,17, 339 23 of 74
A pair
(q
,
P)
means the state
q
will safely no longer change in the current subhedge if
qPno-change
. In this case, the subhedge can be projected. The sets of initial and final
states are defined as follows:
Isnc =I× {}Fsnc =F× {}.
How to generate the transition rules of
Asnc
from those of
A
is described in Figure 10. On
states assigned on top-level
(q
,
P)
, the set
P
is empty so that only the states in
no-change
are safe for no change. This is why the definition of
Isnc
and
Fsnc
use
P=
. When going
down from a state
(q
,
P)
for which
q
is safe to not change, i.e.,
qPno-change
, then the
following rule is applied:
qPno-change
(q,P)⟨⟩
Πin snc .
The evaluation on the lower level goes to the extra state
Π
, where it then loops until
going back to
q
on the upper level. When going down from some state
(q
,
P)
, such that
q∈ Pno-change, then the following rule is applied:
⟨⟩
qin q∈ Pno-change
(q,P)⟨⟩
(q,s-no-change(q)) in snc
The states in the set
s-no-change(q)
on the lower level safely will not make
q
change on
the upper level for any subhedge to come (we could detect more irrelevant subhedges by
allowing
q
to change but only in a unique manner. This can be obtained with the alternative
definition:
s-no-change(q) = r∈Q
s
-down(q
,
{r})
. Computing this set would require
quadratic time
O(n|A|)
, while computing
s-no-change(q)
can be carried out in linear time
O(|A|)).
qa
qin q∈ Pno-change
(q,P)a
(q,P)in snc
aΣqPno-change
(q,P)a
(q,P)in snc
⟨⟩
qin q∈ Pno-change
(q,P)⟨⟩
(q,s-no-change(q)) in snc
qPno-change
(q,P)⟨⟩
Πin snc
q@pqin q∈ Pno-change
(q,P)@(p,s-no-change(q)) (q,P)in snc
qPno-change
(q,P)@Π(q,P)in snc
aΣ
Πa
Πin snc
true
Π@ΠΠin snc
true
Π⟨⟩
Πin snc
Figure 10. The transition rules of the SH AAsnc inferred from those of the SHA A.
When applied to the SHA in Figure 4for
[self::list/child::item]
, the construction
yields the
SHA
in Figure 11, which is indeed equal to the
SHA
from Figure 7up to state
renaming. When run on the hedge
list ·list ·h1·item ·h2⟩⟩
, as shown in Figure 8,
it does not have to visit subhedges
h1
or
h2
since all of them will be reached starting from
the projection state Π.
Algorithms 2024,17, 339 24 of 74
Figure 11. The safe-no-change projection
dSH AAsnc
constructed from the dSHA
A
for query
[self::list/child::item]
is shown in Figure 4. Subhedge projection states are colored in orange.
Useless states and transitions leading out of schema
JdocK
are omitted. State
Π
has a transition
labeled by wildcard letter
”, which stands for either letter
list
or
item
. This
dSH A
is equal
to the
dSH A
in Figure 7up to the state renaming 0
= (
0,
{})
, 0
= (
0,
{
2, 4
})
, 0
′′ = (
0,
{
1, 3, 4
})
,
1
= (
1,
{
2, 4
})
, 1
′′ = (
1,
{
1, 3, 4
})
, 2
= (
2,
{
2, 4
})
, 2
′′ = (
2,
{
1, 3, 4
})
, 3
= (
3,
{
2, 4
})
, and 4
= (
4,
{})
.
Recall that no-change={2, 3, 4, 5}.
6.2. Soundness
We next state and prove a soundness result for safe-no-change projection.
Theorem 1 (Soundness of safe-no-change projection.).If a SHA
A
is schema-complete for
some schema
S
, then safe-no-change projection for
A
preserves the language within this schema:
L(Asnc)S=L(A).
The projecting in-memory evaluator of
Asnc
will be more efficient than the non-
projecting evaluator of
A
. Note that the size of
Asnc
may be exponentially bigger than that
of
A
. Therefore, for evaluating a dSHA
A
with subhedge projection on a given hedge
h
, we
may prefer only the needed part of
Asnc
on-the-fly. This part has a size of
O(|h|)
and can be
computed in time O(|A| |h|)so the exponential preprocessing time is avoided.
6.3. Incompleteness
Safe-no-change projection may be incomplete for subhedge projection so that not
all prefixes that are strongly subhedge irrelevant are mapped to subhedge projection
states. This is shown by the counter example dSHA
A
for the
XPath
filter
[child::item]
in
Figure 12
. Its safe-no-change projection
Asnc
is given in Figure 13. Note that the prefix
u=item
.
item
has the class
classL(A)
S(u) = item
.
item
.
NΣ
, which is subhedge irrele-
vant so
u
is strongly subhedge irrelevant for the
XPath
filter
[child::item]
. Nevertheless, it
leads to the state
(
2,
{
1, 3, 5, 6
})
, which is not a subhedge projection state since 2
∈ {
1, 3, 5, 6
}
.
The problem is that this state can still be changed to
(
4,
{
1, 3, 5, 6
})
. This state is somehow
equivalent with respect to the filter but not equal to (2, {1, 3, 5, 6}).
Another incompleteness problem should be mentioned: safe-no-change projection is
sensitive to automata completion as noticed earlier in Example 5. This is because a state
may belong to
no-change
before completion but no more after adding a sink. Nevertheless,
such states never change on any tree satisfying the schema. This problem can be applied, for
instance, to example dSHA in Figure 4for
XPath
query
[self::list/child::item]
. There-
fore, it was important to assume only schema-completeness for safe-no-change projection
and not to impose full completeness.
Algorithms 2024,17, 339 25 of 74
Figure 12. A schema-complete dSH A for the
XPath
filter
[child::item]
. It is a counter example for
the completeness of safe-no-change projection, see Figure 13.
Figure 13. The safe-no-change projection
Asnc
of the dSH A
A
in Figure 12. It is incomplete for
subhedge projection with with schema
JdocK
at the state
(
2,
{
1, 3, 5, 6
})
: this state is not a sub-
hedge projection state, even though the prefixes
list ·item
and
item ·item
leading to it are
subhedge irrelevant.
7. Congruence Projection
We present the congruence projection algorithm for detecting irrelevant subhedges
for regular hedge patterns with regular schema restrictions. We prove that congruence
projection is complete for subhedge projection, resolving the incompleteness of safe-no-
change projection. For this, congruence projection may no more ignore the schema, so we
have to assume that the input schema is regular too.
7.1. Ideas
The starting idea of congruence projection is to resolve the counter example for
the completeness of safe-no-change projection in Figure 13. There, a state is changed
when reading an irrelevant subhedge hedge. But, the state change moves to a somehow
equivalent state, so it is not really relevant. Therefore, the idea is to detect when a state
always remains equivalent rather than unchanged when reading an arbitrary subhedge.
This is eventually done by the congruence projection in Figure 14 that we are going
to construct.
The obvious question is as follows: which equivalence relation do we choose? Suppose
that we want to test whether a hedge satisfying a regular schema
S
matches a regular pattern
L
. In the restricted case of words without schema restrictions, the idea could be to use
Myhill–Nerode’s congruence
diff L
Σ
but mapped to states. In the general case, however,
the situation becomes more complex, given that in the general case
diff L
S
may fail to be an
equivalence relation. So, it may not be a congruence, as already illustrated in Example 4.
Furthermore, the treatment of the nesting of hedges will require to update the considered
relation when moving down a hedge.
Algorithms 2024,17, 339 26 of 74
Figure 14. The congruence projection
dSH AAcgr(JdocK)
for the counter example of the completeness
of safe-no-change projection, i.e., the dSH A
A
for the filter
[child::item]
in Figure 12 with the schema
final states
FS={
0, 5, 6
}
. Note that state
{1}D3
that is reached over the prefix
list ·item
and the
state
{2}D3
that is reached over the prefix
item ·item
are subhedge irrelevant, and thus, subhedge
projection states.
7.1.1. Difference Relations on States
The congruence projection algorithm will maintain a difference relation on states of
the dSHA which at the beginning will be induced by the difference relation on hedges
diff L
S
and updated whenever moving down a hedge.
Definition 11. Let
(Σ
,
Q
,
,
_
,
_)
be a dSHA. A difference relation for
is a symmetric relation
on states D Q × Q, such that for all h HΣ,
(qh
qwrt ph
pwrt (q,p)D)(q,p)D.
The set of all difference relations for is denoted by D.
We call a subset
Q Q
compatible with a difference relation
D D
if
Q2D=
.
This means that no two states of Qmay be different with respect to D.
Definition 12. Let
(Σ
,
Q
,
,
_
,
_)
be a SHA and
D
a difference relation for
. A subset of states
Q Q
is subhedge irrelevant for
D
wrt
if
acc(Q)
is compatible with
D
. The set of all
subhedge irrelevant subsets of states for D wrt. thus is as follows:
dPrj(D) = {Q⊆Q|acc(Q)2D=}.
A state
q Q
is subhedge irrelevant for
D
if the singleton
{q}
is subhedge irrelevant for
D
. The set
of all subhedge irrelevant states of Qis denoted by Prj(D).
We consider subhedge-irrelevance for subsets of states since the congruence projec-
tion algorithm has to eliminate the nondeterminism that it introduces by a kind of SHA
determinization. Determinization is necessary in order to recognize subhedge irrelevant
prefixes properly: all single states in a subset may be subhedge irrelevant while the whole
subset is not.
7.1.2. Least Difference Relations
For any binary relation
R Q × Q
, let
ldr(R)
be the least difference relation on states
that contains R.
Lemma 8. (p1,p2)ldr(R) (q1,q2)R.h HΣ.p1
h
q1wrt p2h
q2wrt .
Algorithms 2024,17, 339 27 of 74
Proof.
The set
{(p1
,
p2) Q2| (q1
,
q2)R
.
h HΣ
.
p1
h
q1wrt p2h
q2wrt }
clearly is a difference relation that contains
R
, and thus, contains the least such difference
relation
ldr(R)
. Conversely, each pair in the above set must be contained in any difference
relation containing R, and thus, in ldr(R).
Lemma 9. For any
R Q × Q
, the difference relation
ldr(R)
is the value of predicate
D
in the
least fixed point of the ground datalog program generated by the following inference rules:
p,q Q
D(p,q):R(p,q).
p1
a
p2wrt q1
a
q2wrt
D(p1,q1):D(p2,q2).
p1@pp2wrt q1@qq2wrt
D(p1,q1):D(p2,q2).
p,q Q
D(q,p):D(p,q).
Proof.
The first inference rule guarantees that
RD
. The three later inference rules
characterize difference relations
D D
. The second and third rules state that differences
in
D
are propagated backwards over hedges
h
. This is done recursively by treating letter
hedges
h=a
by the second rule and tree hedges
h=h
by the third rule. The fourth rules
expresses the symmetry of difference relations
D
. So, the least fixed point of the datalog
program generated by the above rules contains
R
and is the least relation that satisfies all
axioms of difference relations, so it is ldr(R).
Proposition 4. For any
R Q × Q
, the least difference relation
ldr(R)
can be computed in time
O(|A|2).
Proof.
The size of the ground datalog program computing
ldr(R)
from Lemma 9is at
most quadratic in the size of
A
, so its least fixed point can be computed in time
O(|A|2)
.
7.2. Algorithm
We now define the congruence projection algorithm by a compiler from dSHAs and a
set of schema final states to
SHA
s, which, when run on a hedge, can detect all subhedge
irrelevant prefixes.
7.2.1. Inputs and Outputs
As inputs, the congruence projection algorithm is given a dSHA
A= (Σ
,
Q
,
,
I
,
F)
for
the regular pattern and a set
FS
with
FFS Q
. The dSH A defines the regular language
L=L(A)
while the regular schema
S
is recognized by the dSH A
A[F/FS]=(Σ
,
Q
,
,
I
,
FS)
.
So, the same base automaton defines the regular language and regular schema by choosing
different sets of final states.
Example 10. In the example dSHA in Figure 4for the
XPath
filter
[self::list/child::item]
with schema
JdocK
, we have
F={
4
}
and
FS={
4, 5
}
. In the automaton for the counter example
[child::item]for safe-no-change projection in Figure 12, we have F ={6}and FS={5, 6}.
We note that, if
L
and
S
were given by independent SH As, then we can obtain a
common dSH A Aand a set FSas above by computing the product of the dSHAs for Land
S
and completing it. As noticed in [
16
], it may be more efficient to determinize the SHA for
S
in a first step, build the product of the SH A for
L
with the dSHA for
S
in a second, and
determinize this product in the third step.
The congruence projection of
A
wrt.
FS
will maintain in its current configuration
a subset of states
Q Q
and a difference relation on states in
D D
. Thereby, the
congruence projection
dSH A
can always detect whether the current prefix is subhedge
irrelevant for
L
wrt. schema
S
by testing whether the current set of states
Q
is subhedge
irrelevant for the current difference relation D.
Algorithms 2024,17, 339 28 of 74
7.2.2. Initial Difference Relation
The initial difference relation on states
DA,FS
init D
is induced from the difference
relation on prefixes diff L
Sas follows:
DA,FS
init ={(q,q)| (v,v′′ )diff L
S.q0I.q0
hdg(v)
qwrt q0
hdg(v′′ )
q′′ wrt }.
The next lemma indicates how
DA,FS
init
can be defined from
A
and
FS
without reference to
the languages L=L(A)and S=L(A[F/FS]).
Lemma 10. DA,FS
init =ldr(F×(FS\F)) acc(I)2.
Proof.
For one direction let
(p
,
p′′)DA,FS
init
. Then, there exist nested words
(v
,
v′′)diff L
S
and an initial state
q0I
, such that
q0
hdg(v)
pwrt
and
q0
hdg(v′′ )
p′′ wrt
. Since
v
,
v′′
are nested words, hedge accessibility
(p
,
p′′)acc(I)2
follows. Furthermore,
(v
,
v′′)diff L
S
requires the existence of a hedge
h HΣ
, such that
hdg(v)·hL
and
hdg(v′′ )·hS\L
. Hence, there are states
qF
and
q′′ FS\F
, such that
ph
q
and
p′′ h
q′′. Using Lemma 8, this implies that (p,p′′ )ldr(F×(FS\F)).
For the other direction, let
(p
,
p′′)ldr(F×(FS\F)) acc(I)2
. Using Lemma 8,
property
(p
,
p′′)ldr(F×(FS\F))
shows that there exist states
qF
,
q′′ FS\F
, and
h HΣ
, such that
ph
qwrt
and
p′′ h
q′′ wrt
. From
(p
,
p′′)acc(I)2
, it follows
that there are nested words
v
,
v′′ NΣ
and an initial state
q0I
, such that
q0
hdg(v)
pwrt and q
0
hdg(v′′ )
p′′ wrt . Hence, (v,v)diff L
S, so that (q,q′′ )DA,FS
init .
As a consequence of Lemma 10 and Proposition 4, the initial difference relation
DA,FS
init
can be computed in time O(|A|2)from Aand FS.
Proposition 5 (Soundness of the initial difference relation).Let
A= (Σ
,
Q
,
,
I
,
F)
be a
complete dSHA,
FFS Q
,
L=L(A)
, and
S=L(A[F/FS])
. For any hedge
h HΣ
and state
q Q
, such that
q0h
q wrt
for some
q0I
, the nested word
nw(h)
is subhedge irrelevant for
L and Sif and only if q is subhedge irrelevant for DA,FS
init wrt. .
Proof.
We show that
nw(h)
is subhedge relevant for
L
and
S
if and only if
q
is subhedge
relevant for DA,FS
init wrt. .
Let
nw(h)
be subhedge relevant for
L
and
S
. Then, there exist hedges
h
,
h′′ HΣ
and a nested word w NΣ, such that
nw(h)·nw(h)·wLnw(h)·nw(h′′ )·wS\L.
Since
A
is deterministic and
q0h
qwrt
for the unique
q0I
, it follows that there
exist states p,p Q,qF, and q′′ FS\F, such that
qh
phdg(w)
qwrt qh
p′′ hdg(w)
q wrt .
From
qF
and
q′′ FS\F
, it follows that
(q
,
q′′)DA,FS
init
. Since
DA,FS
init
is a difference
relation,
phdg(w)
qwrt
and
p′′ hdg(w)
q wrt
, this implies that
(p
,
p′′)DA,FS
init
,
too. Hence, (p,p′′ )acc({q})2DA,FS
init , i.e., q, is relevant for DA,FS
init wrt. .
Algorithms 2024,17, 339 29 of 74
Let
q
be subhedge relevant for
DA,FS
init
wrt.
. Then, there exist a pair
(p
,
p′′)
acc({q})2DA,FS
init . So, there are hedges hand h′′ , such that
qh
pwrt qh′′
p′′ wrt .
Since
DA,FS
init =ldr(F×(FS\F))
by Lemma 10, there exist a nested word
w
and a
pair (q,q′′) Q2so that either (q,q′′ )or (q′′ ,q)belongs to F×(FS\F)and
phdg(w)
qwrt p hdg(w)
q wrt .
Hence,
nw(h)·nw(h)·w
in
L
and
nw(h)·nw(h′′ )·wS\L
or vice versa. This
shows that nw(h)is relevant for Land S.
7.2.3. Updating the Difference Relation
For any difference relation
D D
and subset of states
Q Q
, we define a binary
relation down
Q(D) Q × Q, such that for all states p1,p2 Q:
(p1,p2)down
Q(D)iff q1,q2Q.(q1@p1,q2@p2)D.
For any subset of states
Q Q
and difference relation
D D
, let
DQ D
be the least
difference relation that contains down
Q(D):
DQ=ldr(down
Q(D)).
Lemma 11. The difference relation DQcan be computed in time O(|A|2)from Q,, and D.
Proof.
The binary relation
down
Q(D)
can be computed in time
O(||2)
from
Q
,
D
, and
.
The least difference relation
ldr(down
Q(D))
can be computed by ground datalog in time
O(|A|2)from down
Q(D)using Proposition 4.
Example 11. Reconsider the dSHA for the
XPath
filter
[self::list/child::item]
in Figure 4
with the set of schema-final states
FS={
4, 5
}
. Since
F={
4
}
, we have
FS\F={
5
}
. The
initial difference relation is
D
0
=DA,FS
init
, which is the symmetric closure of
({
0
}×{
1, 4, 5
})
{(
1, 4
)
,
(
4, 5
)}
. The subhedge irrelevant states are
Prj(D
0
) = {
1, 2, 3, 4, 5
}
since only state 0
can access two states in the difference relation
D
0, the final state in
F={
4
}
, and a nonfinal state
that is schema-final in
FS\F={
5
}
of some hedges. The difference relation
down
{0}(DA,FS
init )
is
the symmetric closure of
{
1, 2
} × {
3
}
, which is equal to the difference relation
D1 =D0{0}=
ldr(down
{0}(D0))
. Hence, the projection states are
Prj(D
1
) = {
2, 3, 4, 5
}
since from states 0
and 1one can still reach both 1and 3while
(
1, 3
)D
1. The difference relation
D2 =down
{1}(D
1
)
is the symmetric closure of
{(
1, 2
)
,
(
2, 3
)}
. Hence, the projection states are
Prj(D
2
) = {
1, 2, 3, 4, 5
}
.
From state 1, one can only reach states 1and 3, which are not different for D2.
Projection states for the initial difference relation contain all states that are safe for
selection or safe for rejection with respect to the schema:
Lemma 12. All states in safe(F)and safe(FS\F)are subhedge irrelevant for DA,FS
init .
We omit the proof since the result is not used later on.
Algorithms 2024,17, 339 30 of 74
Given a dSH A
A= (Σ
,
Q
,
,
I
,
F)
and a set
FFS Q
, we now construct the
congruence projection Acgr(S)as the following dSHA:
Acgr(S)= (Σ,Qcgr,cgr,Icgr(S),Fcgr(S)),
where the set of states, initial states, and final states are as follows:
Qcgr 2Q× D
Icgr(S)={(I,DA,FS
init )}
Fcgr(S)=((Q,DA,FS
init )| Q dPrj(DA,FS
init )QF=
QdPrj(DA,FS
init )acc(Q)F=!)
The transition rules in
cgr
are provided by the inference rules in Figure 15. To illustrate
construction and why determinization is needed, we reconsider
Example 3
, where schema
restrictions have consequences that at first sight may be counter-intuitive.
aΣQdPrj(D)
(Q,D)a
(Q,D)in cgr
QdPrj(D)
(Q,D)⟨⟩
(Q,D)in cgr
Q@PQin det QdPrj(D)
(Q,D)@(Q,D)(Q,D)in cgr
Qa
Qin det Q∈ dPrj(D)
(Q,D)a
(Q,D)in cgr
Q∈ dPrj(D)
(Q,D)⟨⟩
(⟨⟩,DQ)in cgr
Q@acc(P)Qin det Q∈ dPrj(D)PdPrj(DQ)
(Q,D)@(P,DQ)(Q,D)in cgr
Q@PQin det Q∈ dPrj(D)P∈ dPrj(DQ)
(Q,D)@(P,DQ)(Q,D)in cgr
Figure 15. The transitions rules cgr of the congruence projection Acgr(S).
Example 12 (Determinization during congruence projection is needed.).Reconsider
Example 3
with signature
Σ={a}
, pattern
L={a}
, and schema
S=L {⟨⟩ ·a}
. This language can
be defined by the dSHA
A
in Figure 16. The schema
S
can be defined by the same automaton but
with the schema-final states
FS={
3, 5
}
and
F={
3
}
. The dSHA
Acgr(S)
produced by congruence
projection is shown in Figure 17. The unique hedge
a
of the language
L=L(A)
is accepted
in state
({
3, 4
}
,
D0)
, where D0
={(
3, 5
)
,
(
5, 3
)}
. The unique hedge
⟨⟩ ·a
in
S\L
is rejected:
after reading the tree
⟨⟩
, the automaton goes to state
({
3, 4
}
,
D0)
where it goes to a projecting sink
({
5
}
,
D0)
when reading the subsequent letter
a
. We note that any hedge in
Σ
is accepted in state
({
3, 4
}
,
D0)
by
Acgr(S)
, even though only the single hedge
a
belongs to
L
. This is sound given
that the hedges in
(ε+a·a·a)
do not belong to schema
S
anyway. We notice that
({
3, 4
}
,
D0)
cannot be a projection state since the run on hedge
⟨⟩ ·a
must continue to the sink
({
5
}
,
D0)
;
therefore, state ({3, 4},D0)is subhedge relevant.
Algorithms 2024,17, 339 31 of 74
Figure 16. A dSH A
A
for
L(A) = {a}
and schema-final states
FS={
3, 5
}
for the schema
S=L(A) {⟨⟩ ·a}.
Figure 17. The congruence projection
Acgr(S)
for the dSH A
A
and the schema-final states
FS
in
Figure 16.
Note that both individual states 3and 4are subhedge irrelevant for D0 while the subset with
both states
{
3, 4
}
is subhedge relevant for D0 since
(
3, 5
)acc({
3, 4
})2D0
. This shows that
the deterministic construction is indeed needed to decide subhedge irrelevance for cases such as
in this example. Also, notice that state 0is subhedge irrelevant for D1=
. This is fine since the
acceptance of hedges of the form
h·h
by
Acgr(S)
does indeed not depend on
h
. In contrast, it
depends on h, so that the subset of states {3, 4}cannot be subhedge irrelevant for D0.
Finally, let us discuss how the transition rules of
dSHAAcgr(S)
are inferred. The most
interesting transition rule is
({
0
}
,
D0)
@
({
0
}
,
D1)({
3, 4
}
,
D0)
in
cgr
. It is produced by the
following inference rule, where
Q=P={
0
}
,
acc(P) = {
0, 1, 3, 4, 5
}
,
Q={
3, 4
}
,
D=D0
,
and DQ=D1:
Q@acc(P)Qin det Q∈ dPrj(D)PdPrj(DQ)
(Q,D)@(P,DQ)(Q,D)in cgr
This rule states that if
P
is subhedge irrelevant for
DQ
, then all states accessible from
P
must be
tried out since all of them could be reached by some different subhedge that got projected away. As a
result, one cannot know due to subhedge projection into which state of
Q
one should go. In order to
stay deterministic, we thus go into all possible states, i.e., into the subset
Q
. In the example, these
are all the states that can be reached when reading some hedge in the pattern
{h|h HΣ}
, given
that the subhedges h Σare not inspected with subhedge projection.
Example 13 (XPath filter
[self::list/child::item]
with schema
JdocK
.).The congruence
projection of the example dSHA in Figure 4is given in Figure 18. This dSH A is similar to the safe-
no-change projection in Figure 7, except that the useless state 2
is removed and the four projection
states are now looping in themselves rather than going to a shared looping state
Π
. It should also be
noticed that only singleton state sets are used there, so no determinization is needed. As we will see,
this is typical for experiments on regular XPath queries where larger subsets rarely occur.
Algorithms 2024,17, 339 32 of 74
Figure 18. The congruence projection
Acgr(JdocK)
of the dSH A
A
in Figure 4for the XPath filter
[self::list/child::item]
with schema
JdocK
. Subhedge irrelevant states are colored in orange. The
underscore stands for any label, either list or item.
Example 14 (Counter-example for completeness of the safe-no-change projection.).Recon-
sider the counter example for the completeness of the safe-no-change projection, i.e., the dSHA in
Figure 12 for the
XPath
query
[child::item]
with schema final states
FS={
0, 5, 6
}
. Its congru-
ence projection is shown in Figure 14. We note that the prefix
item
.
item
leads to the state {2}D3,
which is a subhedge projection state since 2is subhedge irrelevant for D3. In particular, note that
acc({
2
}) = {
2, 4
}
so no two states accessible from 2are different in D3. This means that state 2
may still be changed to 4but then it does not become different with respect to D3. This resolves the
incompleteness issue with the safe-no-change projection on this example.
The first property for states
(Q
,
D)
assigned by the congruence projection is that
Q
is compatible with
D
. This means that no two states in
Q
are different with respect to
D
.
Intuitively, each state of
Q
is as good as each other except for leading out of the schema. So,
if
(Q
,
D)
is assigned to a prefix, and a suffix leads from some state in
Q
to
F
, then it cannot
lead to FS\Ffrom some other state in Q.
Lemma 13. If the partial run of
Acgr(S)
assigns a state
(Q
,
D)
to a prefix, then
Q
is compatible
with D.
We omit the proof since this instructive result will not be used directly later on. Still,
compatibility will play an important role in the soundness proof.
7.3. Soundness
We next adapt the soundness result and proof from the safe-no-change projection to
congruence projection.
Theorem 2 (Soundness of congruence projection.).For any dSHA
A= (Σ
,
Q
,
,
I
,
F)
and set
FFS Q
, the congruence projection of
A
with respect to
FS
preserves the language of
A
within
schema S=L(A[F/FS]), i.e.,
L(Acgr(S))S=L(A).
We note that
S
is a schema for
A
since
FFS
. Furthermore,
A
is schema-complete for
Ssince Ais deterministic and S=L(A[F/FS]).
Proof.
The proof of the inclusion
L(A) L(Acgr(S))S
will be based on the following
two claims:
Claim 2.1a. For all h HΣ,D D, and QdPrj(D),(Q,D)h
(Q,D)wrt cgr.
Algorithms 2024,17, 339 33 of 74
Proof.
We prove Claim 2.1a by induction on the structure of
h
. Suppose that
QdPrj(D)
.
Case h=h.
The induction hypothesis applied to
h
yields
(Q
,
D)h
(Q
,
D)wrt cgr
.
We can thus apply the following two inference rules:
QdPrj(D)
(Q,D)⟨⟩
(Q,D)in cgr
QdPrj(D)
(Q,D))@(Q,D)(Q,D)in cgr
in order to close the following diagram with respect to cgr:
(Q,D)
(Q,Dh(Q,D)
(Q,D)
⟨⟩
This proves (Q,D)h
(Q,D)wrt cgr, as required by the claim.
Case h=a.Since qdPrj(D), we can apply the inference rule:
aΣQdPrj(D)
(Q,D)a
(Q,D)in cgr
This proves this case of the claim.
Case h=ε.We trivially have (Q,D)ε
(Q,D)wrt cgr.
Case h=h·h′′.
By the induction hypothesis applied to
h
and
h′′
, we obtain the following:
(Q,D)h
(Q,D)and (Q,D)h′′
(Q,D)wrt cgr. Hence, (Q,D)h·h′′
(Q,D)wrt cgr.
This ends the proof of Claim 2.1a. The next claim is the key of the soundness proof.
For any difference relation
D
, we define a binary relation
depa(D)
2
Q×
2
Q
, such
that, for any two subsets of states Q,Q′′ Q,
(Q,Q′′)depa(D)(Q∈ dPrj(D)QQ′′ (1a)
QdPrj(D)(Qacc(Q′′ )Q′′ dPrj(D)) (2a)
Furthermore, note that
Q′′ dPrj(D)Qacc(Q′′ )
implies
QdPrj(D)
, and thus,
(Q,Q′′)depa(D).
Claim 2.2a. Let
h HΣ
a hedge,
Q
,
Q Q
subsets of states, and
D D
a difference re-
lation. If
Qh
Qwrt det
, then there exists
Q′′ Q
, such that
(Q
,
D)h
(Q′′
,
D)wrt cgr
and (Q,Q′′)depa(D).
Proof.
If
QdPrj(D)
, then
(Q
,
D)h
(Q
,
D)wrt cgr
by Claim 2.1a. Let
Q′′ =Q
. We
then have
Qacc(Q′′)
and
Q′′ dPrj(D)
, and thus,
QdPrj(D)
so that (1
a
) and
(2
a
) of
(Q
,
Q′′)depa(D)
. So, it is sufficient to prove the claim under the assumption that
Q∈ dPrj(D). The proof is by induction on the structure of h.
Case h=h.
The assumption
Qh
Qwrt det
shows that there exists a subset of states
P Q closing the following diagram:
Algorithms 2024,17, 339 34 of 74
Q
⟨⟩hP
Q
⟨⟩
In particular, we have
Q
@
PQ
in
det
. Let
D=DQ
. Since
Q∈ dPrj(D)
, we
can infer the following:
Q∈ dPrj(D)
(Q,D)⟨⟩
(⟨⟩,D)in cgr
We have to distinguish two cases depending on whether
⟨⟩
belongs to
dPrj(D)
or not.
Subcase ⟨⟩∈ dPrj(D).
The induction hypothesis applied to
h
yields the existence
of P Q, such that (P,P)depa(D)and
(⟨⟩,D)h
(P,D)wrt cgr.
Subsubcase P∈ dPrj(D).
Since
(P
,
P)depa(D)
, it follows that
PP
.
Hence,
P∈ dPrj(D)
. Let
Q′′ Q
be such that
Q
@
PQ′′ in det
. We
can then apply the following inference rule:
Q@PQ′′ in det Q∈ dPrj(D)P dPrj(D)
(Q,D)@(P,D)(Q′′,D)in cgr
Hence, we can close the diagram as follows:
(Q,D)
(⟨⟩,D)h(P,D)
(Q′′,D)
⟨⟩
This shows that
(Q
,
D)h
(Q′′
,
D)wrt cgr
. Since
PP
,
Q
@
PQ
, and
Q
@
PQ′′ in det
, it follows that
QQ′′
, and thus,
(Q
,
Q′′)depa(D)
.
This shows the claim in this case.
Subsubcase PdPrj(D)
. Since
(P
,
P)depa(D)
, it follows that
Pacc(P)
and
PdPrj(D)
. Hence, we can apply the following inference rule for
some Q′′ HΣ:
Q@acc(P)Q′′ in det Q∈ dPrj(D)PdPrj(D)
(Q,D)@(P,D)(Q′′,D)in cgr
Hence, we can close the diagram as follows:
(Q,D)
(⟨⟩,D)h(P,D)
(Q′′,D)
⟨⟩
This shows that
(Q
,
D)h
(Q′′
,
D)wrt cgr
. Since
Pacc(P)
, it follows
from
Q
@
PQ
and
Q
@
acc(P)Q′′
in
det
that
QQ′′
. Thus,
(Q,Q′′)depa(D), so the claim holds in this case too.
Algorithms 2024,17, 339 35 of 74
Subcase ⟨⟩Prj(D).
Claim 2.1a shows that
(⟨⟩
,
D)h
(⟨⟩
,
D)
wrt
cgr
. Let
Q′′
be such that
Q
@
acc(⟨⟩)Q′′ in det
. We can apply the following infer-
ence rule:
Q@acc(⟨⟩)Q′′ in det Q dPrj(D)⟨⟩Prj(D)
(Q,D)@(⟨⟩,D)(Q′′ ,D)in cgr
We can thus close the diagram below as follows with respect to cgr :
(Q,D)
(⟨⟩,D)h(⟨⟩,D)
(Q′′,D)
⟨⟩
This shows that
(Q
,
D)h
(Q′′
,
D)wrt cgr
. Since
Pacc(⟨⟩)
, it follows
from
Q
@
PQ
and
Q
@
acc(⟨⟩)Q′′
in
det
that
QQ′′
, and thus,
(Q,Q′′)depa(D).
Case h=a.Since Q∈ dPrj(D), we can apply the inference rule:
Qa
Qin det Q∈ dPrj(D)
(Q,D)a
(Q,D)in cgr
Hence,
(Q
,
D)h
(Q
,
D)wrt cgr
. With
Q′′ =Q
, the claim follows since
(Q
,
Q′′)
depa(D).
Case h=ε.
In this case, we have
Q=Q′′
and
(Q
,
D)ε
(Q
,
D)wrt cgr
, so the claim holds.
Case h=h1·h2.
Since
Qh
Qwrt det
, there exists
Q1 Q
, such that
Qh1
Q1
and
Q1
h2
Qwrt det
. By the induction hypothesis applied to
h1
there exists
Q
1 Q
,
such that (Q,D)h1
(Q
1,D)wrt cgr and (Q1,Q
1)depa(D).
Subcase Q1dPrj(D)
. Since
(Q1
,
Q
1)depa(D)
, it follows that
Q1acc(Q
1)
and
Q
1dPrj(D)
. Furthermore, Claim 2.1a shows that
(Q
1
,
D)h2
(Q
1
,
D)
,
and hence,
(Q
,
D)h
(Q
1
,
D)
wrt
cgr
. Since
Qacc(Q1)
and
Q1acc(Q
1)
,
it follows that
Qacc(Q
1)
. Hence,
(Q
1
,
Q)depa(D)
and
(Q
,
D)h
(Q
1,D).
Subcase Q1∈ dPrj(D)
. In this case, we can apply the induction hypothesis to
Q1
h2
Qwrt det
, showing that there exists
Q′′ Q
, such that
(Q
1
,
D)h2
(Q′′
,
D)wrt cgr
and
(Q
,
Q′′)depa(D)
. Hence,
(Q
,
D)h
(Q′′
,
D)wrt cgr
and (Q,Q′′)depa(D).
This ends the proof of Claim 2.2a.
Proof of Inclusion
L(A) L(Acgr(S))S
.Since
S
is a schema for
A
, we have
L(A)S
, so
that it is sufficient to show
L(A) L(Acgr(S))
. Let
h L(A)
. Then, there exist
q0I
and
qF
, such that
q0h
q
wrt
. Let
Q={q}
. Since
A
is deterministic, it follows that
Ih
Q
wrt
det
. Using Claim 2.2a, this implies the existence of a subset of states
Q Q
, such that
(Q,Q)depa(D)and (I,DA,FS
init )h
(Q,DA,FS
init )wrt cgr . Furthermore, (I,DA,FS
init )Icgr(S).
In order to prove
h L(Acgr(S))
, it is thus sufficient to show that
(Q
,
DA,FS
init )Fcgr(S)
. We
distinguish two cases:
Algorithms 2024,17, 339 36 of 74
Case Q∈ dPrj(DA,FS
init )
. Condition (1
a
) of
(Q
,
Q)depa(D)
, shows that
QQ
so that
qQ
. Since
qF
, it follows that
QF=
. Thus,
(Q
,
DA,FS
init )Fcgr(S)
, so that
h L(Acgr(S)).
Case QdPrj(DA,FS
init )
. Condition (2
a
) of
(Q
,
Q)depa(D)
yields
Qacc(Q)
. Hence,
qacc(Q)Fso that (Q,DA,FS
init )Fcgr(S), and thus, h L(Acgr(S)).
This ends the proof of the first inclusion.
We next want to show the inverse inclusion
L(Acgr(S))S L(A)
. It will eventually
follow from the next two claims.
Claim 2.1b. For any hedge
h HΣ
, difference relation
D D
, projection state
QdPrj(D)
and state µ Qcgr: if (Q,D)h
µwrt cgr, then µ= (Q,D).
Proof. By induction on the structure of h HΣ, suppose that (Q,D)h
µwrt cgr:
Case h=h.There must exist states µ1,µ
1 Qcgr closing the following diagram:
(Q,D)
µ1hµ
1
µ
⟨⟩
Since
QdPrj(D)
, the following rule must have been applied to infer
(Q
,
D)⟨⟩
µ1wrt cgr:
QdPrj(D)
(Q,D)⟨⟩
(Q,D)in cgr
Therefore,
µ1= (Q
,
D)
. The induction hypothesis applied to
(Q
,
D)h
µ
1wrt cgr
shows that µ
1= (Q,D), too. So, µmust have been inferred by applying the rule:
QdPrj(D)
(Q,D)@(Q,D)(Q,D)in cgr
Hence, µ= (Q,D), as required.
Case h=a.The following rule must be applied:
aΣQdPrj(D)
(Q,D)a
(Q,D)in cgr
Hence, µ= (Q,D).
Case h=ε.Obvious.
Case h=h1·h2.
There must exist some
µ1
, such that
(Q
,
D)h1
µ1
h2
µ
wrt
cgr
. Using
the induction hypothesis applied to
h1
, we have
µ1= (Q
,
D)
. We can thus apply the
induction hypothesis to h2to obtain µ2= (Q,D).
This ends the proof of Claim 2.1b.
Algorithms 2024,17, 339 37 of 74
We next need an inverse of Claim 2.2a. For relating runs
Acgr
back to runs of
A
, we
define for any difference relation
D
another binary relation
depb(D)
2
Q×
2
Q
, such that
for any two subsets of states Q,Q′′ Q,
(Q,Q′′)depb(D)
(Q×Q′′)D=(0b)
Q∈ dPrj(D)Q′′ Q(1b)
QdPrj(D)Q′′ acc(Q) (2b)
Claim 2.2b. Let
Q Q
be a subset of states that is compatible with a difference relation
D D
. For any hedge
h HΣ
and state
µ Qcgr
with
(Q
,
D)h
µwrt cgr
, there exist a
pair of subsets of states
(Q
,
Q′′)depb(D)
, such that
µ= (Q
,
D)
and
Qh
Q′′
wrt.
det
.
Proof.
If
QdPrj(D)
, then Claim 2.1b shows that
µ= (Q
,
D)
. Let
Q=Q
and
Q′′
be
the unique subset of states, such that
Qh
Q′′
wrt.
det
. We have to show that
(Q
,
Q′′)
depb(D)
. Since
Qh
Q′′
, we have
Q′′ acc(Q)
, so condition (2
b
) holds. Condition (1
b
)
holds trivially since
QdPrj(D)
. For condition (0
b
), note that
QdPrj(D)
implies
acc(Q)2D=
. Furthermore,
Q×Q′′ acc(Q)2
, so that
(Q×Q′′)D=
as required.
We can thus assume that
Q∈ dPrj(D)
. The proof is by induction on the structure of
h HΣ. We distinguish all the possible forms of hedges h HΣ:
Case h=h.
By definition of
(Q
,
P)h
µwrt cgr
, there must exist
µ1
,
µ
1 Qcgr
, such
that the following diagram can be closed:
(Q,D)
µ1hµ
1
µ
⟨⟩
Since
Q∈ dPrj(D)
, the following inference rule was applied to infer
(Q
,
D)⟨⟩
µ1
wrt cgr, where D=DQ:
Q∈ dPrj(D)
(Q,D)⟨⟩
(⟨⟩,D)in cgr
Hence,
µ1= (⟨⟩
,
D)
. If
⟨⟩∈ dPrj(D)
, then the induction hypothesis applied
to
(⟨⟩
,
D)h
µ
1
wrt
cgr
shows that there exists
(P
,
P′′)depb(D)
, such that
µ
1= (P
,
D)
and
⟨⟩h
P′′ wrt det
. Otherwise, the same can be concluded as we
argued at the beginning.
Subcase PdPrj(D).
The transition rule
(Q
,
D)
@
µ1µ
must be inferred as
follows for Q Q:
Q@acc(P)Qin det Q∈ dPrj(D)PdPrj(D)
(Q,D)@(P,D)(Q,D)in cgr
This shows that µ= (Q,D). So, we have the following diagram:
Algorithms 2024,17, 339 38 of 74
(Q,D)
(⟨⟩,D)h(P,D)
(Q,D)
⟨⟩
Let
Q′′
be the unique subset of states, such that
Q
@
P′′ Q′′
wrt
det
. We can
then close the following diagram:
Q
⟨⟩hP′′
Q′′
⟨⟩
From
(P
,
P′′)depb(D)
, it follows
(P×P′′)D=
, and thus,
(Q×Q′′)
D=
, i.e., condition (0
b
) of
(Q
,
Q′′)depb(D)
. Since
PdPrj(D)
, condi-
tion (2
b
) of
(P
,
P′′)depb(D)
yields
P′′ acc(P)
. Furthermore,
Q
@
P′′ Q′′
and
Q
@
acc(P)Q
in
det
, which yields
Q′′ Q
so that conditions (1
b
) and
(2
b
) of
(Q
,
Q′′)depb(D)
follow. Hence,
(Q
,
Q′′)depb(D)
. In summary, we
show that
µ= (Q
,
D)
,
Qh
Q′′ wrt det
, and
(Q
,
Q′′)depb(D)
, as required
by the claim.
Subcase P∈ dPrj(D).
The transition rule
(Q
,
D)
@
µ1µ
must thus be inferred
as follows for Q Q:
Q@PQin det Q∈ dPrj(D)P∈ dPrj(D)
(Q,D)@(P,D)(Q,D)in cgr
This shows that µ= (Q,D), and we can close the following diagram:
(Q,D)
(⟨⟩,D)h(P,D)
(Q,D)
⟨⟩
Condition (0
b
) of
(P
,
P′′)depb(D)
yields
(P×P′′)D=
. Let
Q′′
be the
unique subset of states, such that
Q
@
P′′ Q′′
wrt.
det
. Since
Ddown
Q(D)
and
(P×P′′)D=
, it follows from
Q
@
PQ
and
Q
@
P′′ Q′′
wrt.
det
that
(QQ′′)D=
. That is, condition (0
b
) of
(Q
,
Q′′)depb(D)
.
Since
P∈ dPrj(D)
, condition (1
b
) of
(P
,
P′′)depb(D)
is
P′′ P
. From
Q
@
P′′ Q′′
and
Q
@
PQ
, it thus follows that
Q′′ Q
. Hence, conditions
(1
b
) and (2
b
) of
(Q
,
Q′′)depb(D)
are valid, so that
(Q
,
Q′′)depb(D)
holds.
Furthermore,
Q
⟨⟩hP′′
Q′′
⟨⟩
This shows that
Qh
Q′′
wrt
. The other two requirements of the claim,
µ= (Q,D)and (Q,Q′′ )depb(D), were shown earlier.
Algorithms 2024,17, 339 39 of 74
Case h=a.Since Q∈ dPrj(D), the following inference rule must be used:
Qa
Qin Q∈ dPrj(D)
(Q,D)a
(Q,D)in cgr
So
µ= (Q
,
D)
,
Qa
Qwrt det
. Furthermore, since
Q
is compatible with
D
, and
D
is a difference relation,
Q
is compatible with
D
too. Hence,
(Q×Q)D=
,
showing condition (0
b
) of
(Q
,
Q)depb(D)
. Trivially,
QQacc(Q)
so
conditions (1b) and (2b) of (Q,Q)depb(D)hold, too. Hence, (Q,Q)depb(D).
Case h=ε.Obvious.
Case h=h1·h2.
Since
(Q
,
D)h
µ
wrt.
cgr
, there exists some
µ1 Qcgr
, such that
(Q
,
D)h1
µ1
and
µ1
h2
µ
wrt
cgr
. We can apply the induction hypothesis to
h1
.
Hence, there exist subsets of states
Q1
,
Q
1 Q
, such that
µ1= (Q1
,
D)
,
Qh1
Q
1
wrt
, and (Q1,Q
1)depb(D). We distinguish two cases:
Subcase Q1dPrj(D)
. Since
µ1= (Q1
,
D)h2
µ
wrt
cgr
, Claim 2.1b shows
that
µ= (Q1
,
D)
so that
(Q1
,
D)h2
(Q1
,
D)
wrt.
cgr
. Condition (2
b
) of
(Q1
,
Q
1)depb(D)
and
Q1dPrj(D)
imply that
Q
1acc(Q1)
. Let
Q2
be
the unique subset of states, such that
Q
1
h2
Q2
wrt.
det
. Then,
Q2acc(Q
1)
so that
Q2acc(Q1)
, and thus, condition (2
b
) of
(Q1
,
Q2)depb(D)
holds.
Condition (1
b
) of
(Q1
,
Q2)depb(D)
is trivial since
Q1dPrj(D)
. Since
Q1dPrj(D)
and
Q1×Q2acc(Q)2
, it follows that
Q1×Q2D=
so
condition (0
b
) of
(Q1
,
Q2)depb(D)
holds, too. Hence,
Qh
Q2
wrt
det
and
(Q1,Q2)depb(D).
Subcase Q1∈ dPrj(D)
. Since
Q1
h1
Q
1
and
Q1
is compatible with difference
relation
D
, it follows that
Q
1
is compatible with
D
, too. We can thus apply the
induction hypothesis to
(Q1
,
D)h2
µ
, showing the existence of subsets of states
(Q2
,
Q
2)depb(D)
, such that
µ= (Q2
,
D)
and
Q1
h2
Q
2
wrt
. So, we have
Qh
Q
2wrt. and (Q2,Q
2)depb(D), as required.
This ends the proof of Claim 2.2b.
Proof of Inclusion
L(Acgr(S))S L(A)
.Let
h L(Acgr(S))S
. Then, there exists a
final subset of states
QFcgr(S)
, such that
(I
,
DA,FS
init )h
(Q
,
DA,FS
init )
wrt
cgr
. Since
I
is
a singleton or empty, it is compatible with
DA,FS
init
. Claim 2.2b shows that there exists a
subset of states
Q Q
, such that
Ih
Q
wrt.
det
and
(Q
,
Q)depb(DA,FS
init )
. Condition
(0
b
) of
(Q
,
Q)depb(DA,FS
init )
shows that
(Q
,
Q)DA,FS
init =
. Since
hS
, it follows that
QFS=
. The determinism of
A
shows that
Q
is a singleton. So, there exists a state
qFS, such that Q={q}.
Case Q∈ dPrj(DA,FS
init )
. Condition (1
b
) shows that
QQ
so that
qQ
. Since
QFcgr
,
we have
QF=
. Since
Q
is compatible with
DA,FS
init
,
qQ
, and
QF=
, it
follows that q∈ FS\F. Since qFS, this implies qF, and thus, h L(A).
Case QdPrj(DA,FS
init )
. Condition (2
b
) shows that
Qacc(Q)
so that
qacc(Q)
.
Since
QFcgr
, we have
acc(Q)F=
. Let
qacc(Q)F
be arbitrary. Since
QdPrj(DA,FS
init )
and
(q
,
q)acc(Q)2
, it follows that
(q
,
q)∈ DA,FS
init
. Furthermore,
qF
so that
q∈ FS\F
. In combination with
qFS
, this implies
qFS
so
h L(A).
This ends the proof of the inverse inclusion, and thus, of L(A) = L(Acgr(S))S.
Algorithms 2024,17, 339 40 of 74
7.4. Completeness
We next show the completeness of congruence projection for subhedge projection
according to Definition 10. Let
A= (Σ
,
Q
,
,
I
,
F)
be a complete dSH A and
FFS Q
.
Automaton
A
defines the regular pattern
L=L(A)
with the schema-final states in
FS
the regular schema
S=L(A[F/FS])
. We first show that all subhedge irrelevant states of
difference relations are subhedge projection states Acgr(S)according to Definition 7.
Lemma 14. If
QdPrj(D)
, then
(Q
,
D)
is a subhedge projection state of
cgr
with witness
(Q,D).
Proof.
We assume that
qPrj(D)
and have to show that
(Q
,
D)
is a subhedge projection
state of
cgr
. We have to show that all transition rules starting with
(Q
,
D)
are permitted
for a subhedge projection state
(Q
,
D)
with
(Q
,
D)
. They are generated by the following
rules, all of which are looping:
aΣQdPrj(D)
(Q,D)a
(Q,D)in cgr
QdPrj(D)
(Q,D)@(Q,D)(Q,D)in cgr
QdPrj(D)
(Q,D)⟨⟩
(Q,D)in cgr
So, if a partial run of
Acgr(S)
on a prefix
u
assigns some state
(P
,
D)
with
PdPrj(D)
,
then (P,D)is a subhedge projection state by Lemma 14, and thus, subhedge irrelevant by
Proposition 2.
Lemma 15. Let
A=A[I/⟨⟩]
,
L=L(A)
, and
S=L(A[F/FS])
. If
(⟨⟩
,
D)hdg(u)
(Q
,
D)
wrt Acgr(S)for some nested word u NΣand either
Q∈ dPrj(D)and q Q, or
QdPrj(D)and q acc(Q),
then there exists a nested word uclassL
S(u)and q0 ⟨⟩, such that q0
hdg(u)
q wrt .
Proof.
By induction on the length of
u
, suppose that
(⟨⟩
,
D)hdg(u)
(Q
,
D)wrt Acgr(S)
.
In the base case, we have u=ε. Hence, Q=.
Case Q∈ dPrj(D)
and
qQ
. Then,
⟨⟩={q}
. The unique run of
A
on
u=ε
starts and
ends in the tree initial state q ⟨⟩.
Case QdPrj(D)
and
qacc(Q)
. Since
QdPrj(D)
, Lemma 14 shows that
(Q
,
D)
is a subhedge projection state, so that
ε
is subhedge irrelevant for
L
and
S
by
Proposition 2. Hence,
classL
S(ε) = NΣ
. Furthermore, since
qacc(Q)
, there exists a
hedge
h
and
q0I
, such that
q0h
qwrt
. Let
u=nw(h)
. The run of
A
on
u
ends
in qand uclassL
S(u).
For the induction step, we distinguish the possible forms of the nested word
u
. So, there
exist ˜
u,v NΣand aΣ, such that either of the following cases holds:
Case u=˜
u··v·. Let (˜
Q,D)be the state in which the run of (A)cgr(S)on ˜
uends.
Subcase ˜
Q∈ dPrj(D)
. Let
D=D˜
Q
. Then, there exist
P Q
, such that
(⟨⟩
,
D)v
(P
,
D)wrt Acgr(S)
. So, the run of
(A)cgr(S)
on
v
goes to state
(P,D).
Subsubcase PdPrj(D)
. By construction of
Acgr(S)
, we then have
Q=˜
Q
@
acc(P)
. Since
qQ
, there exist
˜
q˜
Q
and
pacc(P)
, such that
˜
q
@
pq
in
. By induction hypothesis, there exists
vclassL
S(v)
, such
Algorithms 2024,17, 339 41 of 74
that the run of
A
on
v
ends in
p
. Hence, the run of
A
on
u=˜
u··v·
ends in q, and furthermore, uclassL
S(u).
Subsubcase P∈ dPrj(D)
. By construction of
Acgr(S)
, we then have
Q=˜
Q
@
P
. Since
qQ
, there exist
˜
q˜
Q
and
pP
, such that
˜
q
@
pq
in
. By induction hypothesis, there exists
vclassL
S(v)
, such that the run
of
A
on
v
ends in
p
. Hence, the run of
A
on
u=˜
u··v·
ends in
q
, and
furthermore, uclassL
S(u).
Subcase ˜
QdPrj(D)
. Then,
Q=˜
Q
and
qacc(˜
Q)
. Using the induction
hypothesis applied to
˜
u
, there exists
˜
uclassL
S(˜
u)
, such that
⟨⟩hdg(˜
u)
q
.
Since ˜
uis irrelevant for Land S, we have ˜
uclassL
S(u).
Case u=˜
u·a. This is easier to achieve than the previous cases and is thus omitted.
Lemma 16. A partial run of
Acgr(S)
assigns a state
(Q
,
D)
to a prefix
uprefs(NΣ)
. If
Q∈ dPrj(D)
and
qQ
, then there exists a prefix
uclassL
S(u)
, such that a partial run
of A assigns state q to u.
Proof.
By induction on the length of
u
. In the base case, we have
u=ε
. The partial run of
Acgr(S)
on
u
assigns state
(Q
,
D)
, where
Q=I
and
D=DA,FS
init
. Suppose that
Q∈ dPrj(D)
and
qQ
. Then,
I={q}
. The unique partial run of
A
on
u=ε
ends in the initial state
qI
. For the induction step, let
A=A[I/⟨⟩]
,
L=L(A)
, and
S=L(A[F/FS]
. We
distinguish the possible forms of
u
. So,
˜
uprefs(NΣ)
,
v NΣ
, and
aΣ
exist, such that
either of the following hold:
Case u=˜
u··v
. Let
(˜
Q
,
˜
D)
be the state in which the partial run of
Acgr(S)
on
˜
u
ends.
From
Q∈ dPrj(D)
, it follows that
˜
Q∈ dPrj(˜
D)
. Furthermore,
(⟨⟩
,
D)v
(Q
,
D)
wrt
Acgr(S)
. So, the partial run of
(A)cgr(S)
on
v
goes to state
(Q
,
D)
. Using the
induction hypothesis, there exists
vclassL
S(v)
, such that the partial run of
A
on
v
ends in
q
. Hence, the partial run of
A
on
u=˜
u··v
ends in
q
, and furthermore,
uclassL
S(u).
Case u=˜
u··v·
. Let
(˜
Q
,
˜
D)
be the state in which the partial run of
Acgr(S)
on
˜
u
ends.
From
Q∈ dPrj(D)
, it follows that
˜
Q∈ dPrj(˜
D)
. Furthermore,
(⟨⟩
,
D)v
(P
,
D)
wrt Acgr(S)for some P Q.
Subcase PdPrj(D)
. Then,
Q=˜
Q
@
acc(P)
. So, there exists
˜
qQ
and
pacc(P)
, such that
˜
q
@
pq
in
. Using Lemma 15, there exists
vclassL
S(v)
, such that
⟨⟩hdg(v)
pwrt
. Using the induction hypoth-
esis applied to
˜
u
, there exists an initial state
q0I
and a prefix
˜
uclassL
S(˜
u)
,
such that the partial run of
A
on
˜
u
goes on to state
˜
q
. Hence, the partial run of
Aon u=˜
u··v·goes on to qand uclassL
S(u).
Subcase P∈ dPrj(D)
. Then,
Q=˜
Q
@
P
. So, there exists
˜
qQ
and
pP
, such
that
˜
q
@
pq
in
. Using Lemma 15, there exists
vclassL
S(v)
, such that
⟨⟩hdg(v)
pwrt
. Using the induction hypothesis applied to
˜
u
, there exists
an initial state
q0I
and a prefix
˜
uclassL
S(˜
u)
, such that the partial run of
A
on
˜
u
goes on to state
˜
q
. Hence, the partial run of
A
on
u=˜
u··v·
goes on
to qand uclassL
S(u).
Case u=˜
u·a. This is easier to achieve than the previous case and is thus omitted.
Algorithms 2024,17, 339 42 of 74
The next lemma states the key invariant of congruence projection, which eventually
proves its completeness for subhedge projection. For this, we define for any
FFSF
the binary relation ≈F
FSas the symmetric closure of F×(FS\F), i.e., for all (q,q′′) Q2:
q≈F
FSq′′ (qFq′′ FS\F)
(qFS\Fq′′ F)
Lemma 17 (Key Invariant).Let
A= (Σ
,
Q
,
,
I
,
F)
a dSHA,
FFS Q
,
L=L(A)
, and
S=L(A[F/FS])
. If
Acgr(S)
has a partial run on prefix
uprefs(NΣ)
to state
(P
,
D)
2
Q× D
,
then for all (p,p′′ )P2,(r,r′′ )D, and (v,v′′ ) N 2
Σ, such that
phdg(v)
rwrt p′′ hdg(v′′)
r′′ wrt
there exist (u,u′′)classL
S(u), w suffs(NΣ), q≈F
FSq′′, and q0I such that
q0
hdg(u·v·w)
qwrt q0
hdg(u′′ ·v′′ ·w)
q′′ wrt
Proof. Induction on the number of dangling opening parenthesis of u.
In the base case,
u
does not have any dangling opening parenthesis so
u NΣ
is a
nested word. In this case, D=DA,FS
init . So, Acgr(S)has a partial run on nested word u NΣ
to
(P
,
DA,FS
init )
, where
P Q
. Let
(p
,
p′′)P2
,
(r
,
r′′)DA,FS
init
, and
(v
,
v′′) N 2
Σ
, such that
phdg(v)
rwrt p′′ hdg(v′′)
r′′ wrt
It then follows that
P∈ dPrj(DA,FS
init )
. Therefore, we can apply Lemma 16. It shows that
there exist nested words (u,u′′)(classL
S(u))2and q0I, such that
q0
hdg(u)
pwrt q0
hdg(u′′ )
p′′ wrt
Since
DA,FS
init ldr(F×(FS\F))
by Lemma 10, there exist a nested word
w NΣ
and
states q≈F
FSq′′, such that
rhdg(w)
qwrt r hdg(w)
q wrt
Then, we have the following:
q0
hdg(u·v·w)
qwrt q0
hdg(u′′ ·v′′ ·w)
q′′ wrt
For the induction step, let
u=u1··u2
for some prefix
u1prefs(NΣ)
and nested
word
u2 NΣ
so that
u1
has one open dangling bracket less than
u
. Let
Acgr(S)
have a
partial run on nested word
u NΣ
to
(P
,
D)
. Let
(P1
,
D1)
be the state that this run assigns
to prefix
u1
. Then,
D= (D1)P1
. Let
(p
,
p′′)P2
,
(r
,
r′′)D
, and
(v
,
v′′) N 2
Σ
, such that
phdg(v)
rwrt p′′ hdg(v′′)
r′′ wrt
Since
D= (D1)P1
, there exist
(p
1
,
p′′
1)(P1)2
and states
(r
1
,
r′′
1)D1
, such that
p
1
@
qr
1
and
p′′
1
@
q′′ r′′
1
. In particular,
P1∈ dPrj(D1)
so that we can apply Lemma 16. It
shows that there exist
(u
2
,
u′′
2)classL
S(u2)
, such that
⟨⟩u
2
p
and
⟨⟩u′′
2
p′′
. Let
v
1=·u
2·v·
and
v′′
1=·u
2·v′′ ·
. Then,
p
1
hdg(v
1)
r
1
,
p
1
hdg(v′′
1)
r′′
1wrt
. Using
the induction hypothesis applied to
u1
, on which
Acgr(S)
has a partial run to state
(P1
,
D1)
,
Algorithms 2024,17, 339 43 of 74
such that
(p
1
,
p′′
1)(P1)2
,
p
1
hdg(v
1)
r
1
,
p
1
hdg(v′′
1)
r′′
1
, and
(r
1
,
r′′
1)D
, there exist
(u
1,u′′
1)classL
S(u1),w1suffs(NΣ),q≈F
FSq′′, and q0I, such that
q0
hdg(u
1·v
1·w1)
qwrt q0
hdg(u′′
1·v′′
1·w1)
q′′ wrt
Let u=u
1··u
2,u′′ =u′′
1··u′′
2, and w=·w1. The above then yields the following:
q0
hdg(u·v·w)
qwrt q0
hdg(u′′ ·v′′ ·w)
q′′ wrt
Furthermore, (u,u′′)(classL
S(u))2, so this was to be shown.
Proposition 6. Let
A= (Σ
,
Q
,
,
I
,
F)
a dSHA,
FFS Q
,
L=L(A)
, and
S=L(A[F/FS])
.
If
Acgr(S)
has a partial run on prefix
uprefs(NΣ)
to some state
(P
,
D)
2
Q× D
so that
P∈ dPrj(D), then cl assL
S(u)is subhedge relevant for L wrt S.
Proof.
Suppose that
Acgr(S)
has a partial run on prefix
uprefs(NΣ)
to some state
(P
,
D)
2
Q× D
so that
P∈ dPrj(D)
. Since
P∈ dPrj(D)
, there exist
acc(P)2D=emptyset
.
So, there exist (p,p′′)P2,(r,r′′ )D, and (v,v′′ ) N 2
Σ, such that
phdg(v)
rwrt p′′ hdg(v′′)
r′′ wrt
Using Lemma 17, there exist
(u
,
u′′)classL
S(u)
,
wsuffs(NΣ)
,
q≈F
FSq′′
, and
q0I
,
such that
q0
hdg(u·v·w)
qwrt q0
hdg(u′′ ·v′′ ·w)
q′′ wrt
Hence, classL
S(u)is relevant for Lwrt S.
Theorem 3 (Completeness of congruence projection.).For any complete dSHA
(Σ
,
Q
,
,
I
,
F)
and set
FFS Q
, the congruence projection
Acgr(S)
is sound and complete for subhedge
projection for the regular pattern L(A)wrt the regular schema S=L(A[FS/F]).
Proof.
The soundness of congruence projection is shown in Theorem 2. For proving
completeness, let
uprefs(NΣ)
be a nested word prefix that is strongly subhedge irrelevant
for
L
wrt
S
. By definition of strong irrelevance, the class
classL
S(u)
is irrelevant for
L
and
S
. Let
(P
,
D)
be the unique state assigned by the partial run of
Acgr(S)
to prefix
u
. Since
classL
S(u)
is irrelevant for
L
and
S
, Proposition 6shows that
PdPrj(D)
. Using
Lemma 14
,
(P,D)is then a subhedge projection state of Acgr(S).
7.5. In-Memory Complexity
We next discuss the complexity of membership testing with complete subhedge pro-
jection based on the in-memory evaluation of the input hedge by the congruence projection
of the input automaton.
Lemma 18. The number of states |Qcgr |is in O(2n2+n), where n is the number of states of A.
Proof.
With the deterministic construction, the states of
Acgr(S)
are pairs in 2
Q× D
. So,
the maximal number of states of the congruence projection is
|Qcgr(S)|=
2
|Q||D|
2n2n2=2n2+n.
Let diff -rel(Qcgr(S))be the set of difference relations used in the states of Acgr(S).
Corollary 1. Let the signature
Σ
be fixed. For any complete dSHA
A
with schema
S HΣ
, the
membership of any hedge
hS
can be tested in memory in time
O(|
1
|)
per nonprojected node of
h
Algorithms 2024,17, 339 44 of 74
after a preprocessing time of
O(|Acgr(S)|+n3d)
, where
d
is the cardinality of
diff -rel(Qcgr(S))
and
n is the number of states of A.
Since
|Qcgr(S)|
is in
O(
2
n2+n)
by Lemma 18, the preprocessing time
O(|Acgr(S)|+n3d|)
is in O(22n2+2n), too.
Proof.
Since
A
is complete, Thereom 3shows that
Acgr(S)
is complete for subhedge projec-
tion for schema
S
. The in-memory evaluator of
Acgr(S)
can be run on any input hedge
hS
in time O(|1|)per nonprojected node of hafter a preprocessing time of O(|Acgr(S)|+n3d),
where
d
is the cardinality of
diff -rel(Acgr(S))
. Since the signature
Σ
is fixed and
Acgr(S)
de-
terministic, |Acgr(S)|is in O(|Qcgr(S)|2). The dominating part |Acgr(S)|is in O(|Qcgr|2).
When applied to regular XPath query answering as in our experiments in Section 11,
we have
n
80 and
d
25 so the cubic part
O(n3d)
is not limiting. The sizes of the
SHA
s
are below 2600 counting both states and rules. The upper bound
O(|Acgr(S)|+n3d)
thus
suggests that the preprocessing time may be feasible in practice, even though exponential
in the worst case.
We must be careful here since the sizes of the
SHA
s reported in Section 11 do not refer
to the pure congruence projection
Acgr(S)
but to the earliest congruence projection
Acgr(S)
e(S)
that we will introduce only in Section 8. The noncreation of transition rules to states that
are safe for rejection may lead to a radical size reduction. Alternatively, an important size
reduction in
Acgr(S)
could be obtained by removing all useless states and transition rules.
But, this would not reduce the precomputation time, just the opposite: it woud increase the
preprocessing time since cleaning automata a posteriori also needs time, which for
SHA
s
may be nonlinear.
Instead of precomputing Acgr(S)from the inputs Aand FSstatically, we can compute
only the part needed of
Acgr(S)
for evaluating the input hedge
h
on-the-fly. We note that
the number of transition rules of the needed part is bounded by
O(|h|)
but may be way
smaller due to projection. The preprocessing time then goes down to
O(|A|)
. Since the full
automaton
Acgr(S)
may be exponential in
n
in the worst case, this may reduce the overall
processing time.
On the positive side, on-the-fly evaluation reduces the overall computation time to
O(|h|+n3d(h))
, where
d(h)
is the number of difference relations in the part of
Acgr(S)
needed for the evaluation of
h
, so
d(h)min(|h|
,
d)
. On the down side,
d(h)
steps will no
more be in constant but cubic time in
O(n3)
. It should also be noted that the overall time
of the safe-no-change projection on-the-fly evaluation is in
O(|A|+|h|+n s(h))
, where
s(h)
is the number of subsets of safe states of
A
in states of
Qcgr
, i.e.,
s(h) |h|
. This
is since each subset of safe states can be computed in linear time in
O(n)
. As argued
above, the overall time of the congruence projection obtained by on-the-fly evaluation is in
O(|A|+|h|+n3d(h))
. This is because the computation of a difference relation is in
O(n3)
by using ground Datalog programs of cubic size. The additional cubic factor in
n3d(h)
instead of a linear factor in
n s(h)
is the price to be paid for achieving complete subhedge
projection with on-the-fly evaluation. Note that this price is largely acceptable in cases
where n80, d(h)15, and s(h)15.
Example 15 (Exponential blow-up.).In order to see how the exponential worst case may happen,
we consider a family of regular languages, for which the minimal left-to-right DFA is exponentially
bigger than the minimal right-to-left DFA. The classical example languages with this property are
Ln=Σ
.
a
.
Σn
, where
nN
and
Σ={a
,
b}
. Intuitively, a word in
Σ
belongs to
Ln
if and only
its
n+
1-th letter from the end is equal to
a
. The minimal left-to-right DFA for
Ln
has 2
n+1
many
states since it needs to memorize a window of
n+
1letters. In contrast, its minimal right-to-left
DFA has only
n+
1states; in this direction, it is sufficient to memorize the distance from the end
modulo n +1.
We next consider the schema
S
defined by
µz
.
(a+b)·z
. It contains all hedges having
in all its subtrees a label from
Σ={a
,
b}
as the first letter, and no further occurrences of letters
Algorithms 2024,17, 339 45 of 74
of
Σ
elsewhere. We then consider the family of hedge languages
Hn HΣ
with schema
S
, such
that the sequence of labels of some root-to-leaf path of
hn
belongs to
Ln
. Note that
Hn
can be
recognized in a bottom-up manner by the
dSHA An
with
O(n+
1
)
states, which simulates the
minimal deterministic DFA of
Ln
on all paths of the input hedge. For an evaluator with subhedge
projection, the situation is different. When moving top-down, it needs to memorize the sequence
of labels of the
n+
1-last ancestors, possibly filled with
bs
, and there are 2
n+1
of such sequences.
If, for some leaf, its sequence starts with an “a” then the following subhedges with the following
leaves can be projected away. As a consequence, there cannot be any
dSHA
recognizing
Hn
that
projects away all irrelevant subhedges with less than 2
n+1
states. In particular, the size of
Acgr(S)
n
must be exponential in the size of An.
8. Earliest Membership with Subhedge Projection
We next enhance our compilers for introducing subhedge projection states, such that
they can also detect certain membership or nonmembership in an earliest manner. For this,
we show how to combine the previous earliest membership tester for dSH As from [
21
] with
any subtree projection algorithm so that it applies to both, to safe-no-change projection
and to congruence projection. By combining both aspects orthogonally, we obtain an
earliest dSH A with subhedge projection, which can be evaluated either in in memory or in
streaming mode.
8.1. Earliest Membership
We start with a semantic characterization of earliest membership, i.e., of prefixes of
nested words of hedges whose suffixes are irrelevant for membership, when assuming
top-down and left-to-right processing. Note that the same characterization up to various
presentation choices was given earlier in the context of stream processing [21,22].
Definition 13. Let
S HΣ
be a hedge schema and
LS
a hedge language satisfying this
schema. A nested word prefix
v
is certain for membership in
L
with schema
S
for all suffixes
w
of nested words such that
v·wnw(S)
it holds that
v·wnw(L)
. It is called certain for
nonmembership in
L
with schema
S
for all suffixes
w
of nested words such that
v·wnw(S)
it
holds that v ·w nw(L).
We are now interested in
SHA
that are able to detect certain membership and non-
membership syntactically at the earliest possible prefix or node of the run.
Definition 14. A
dSHAA= (Σ
,
Q
,
,
I
,
F)
with schema
S HΣ
is called earliest for schema
S
if there exists a state
sel Q
, such that for all hedges
hS
, the unique run
R
of
com pl(A)
on
h
satisfies the following:
it goes to the state
sel
for exactly all those prefixes of
nw(h)
that are certain for membership in
L(A)wrt S;
it goes to the sink for all prefixes of
nw(h)
that are certain for nonmembership in
L(A)
with
S
.
For dSHA
A
, schema-complete for schema
S=L(A[F/FS])
, we can construct a
dSH A
Ae(S)
that is the earliest for
S
, as shown in Appendix A. The idea is to adapt the earliest
membership tester for deterministic SH As from [
21
] so that it takes schema restrictions
into account. Furthermore, we need to compile the input dSHAs to
dSH A
s rather than to
dNWAs, as used there. This change is straightforward once the relationship between both
automata models becomes clear.
Given that the construction of
Ae(S)
is not part of the core of the present paper, we
only illustrate its outcome at our running example: for the dSH A
A
with schema
S=JdocK
,
in Figure 4, we obtain the
SHAAe(S)
in Figure 19. Note that, on the prefix
listitem
, it
goes to state to
sel
, showing that it is certain for membership. On prefix
item
, it blocks
showing that it is certain for nonmembership. However, on the prefix
listlist
, it does
not go into a subhedge projection state even though this prefix is subhedge irrelevant.
Algorithms 2024,17, 339 46 of 74
Figure 19. The earliest
dSH AAe(S)
for the dSH A
A
in Figure 4with schema
S=JdocK
for the
XPath
filter [self::list/child::item].
8.2. Adding Subhedge Projection
We now enhance any earliest membership for dSH As with subhedge projection.
The idea is to combine subhedge projection with earliest membership. So, suppose that we
are given a schema
S
and two automata,
Aπ
and
Ae
, with schema
S
that recognize the same
language up to the schema, so
L(Aπ) L(S) = L(Ae) L(S)
. Furthermore, assume that
Aehas a selection state sel, indicating certain membership at the earliest possible prefix.
In order to obtain the earliest membership with subhedge projection, we combine the
two
SHA
s into a single
SHA
,
Aπ
e= (Σ
,
Qπ
e
,
π
e
,
Iπ
e
,
Fπ
e)
, basically running both automata
in parallel but under shared control. The state set of Aπ
eare as follows:
Qπ
e= (Qπ×(Qe\ {sel})) {sel}
Iπ
e= (Iπ×(Ie\ {sel})) {sel |sel Ie}
Fπ
e= (Fπ×(Fe\ {sel})) {sel}
The transition rules in
π
e
are given in Figure 20. Any run of
Aπ
e
synchronizes parallel
runs of
Aπ
and
Ae
as follows: whenever
Ae
goes to
sel
, then
Aπ
e
does so too, and whenever
Aπ
goes into a subhedge projection state, then it makes Aejump over the subsequent subhedge.
For illustration, we reconsider the dSHA from Figure 4. The earliest
dSH A
with
subhedge projection
Asnc
e(S)
is given in Figure 21. Here, we combine the safe-no-change
projection
Asnc
from Figure 11 with the earliest membership tester
Ae(S)
from Figure 19.
Their combination
Asnc
e(S)
represents the pattern
[self::list/child::item]
with schema
JdocK
in a concise manner. After prefix
listitem
, it detects certain membership: for the
prefix
listlist
, it detects that the subhedge is irrelevant, and after prefix
item
, it
detects certain nonmembership.
Algorithms 2024,17, 339 47 of 74
qa
qin πra
rin eq∈ P r=sel
(q,r)a
(q,r)in π
e
qa
qin πra
sel in e
(q,r)a
sel in π
e
qa
qin πra
rin eqP r=sel
(q,r)a
(q,r)in π
e
q@pqin πr@srin er=sel q P
(q,r)@(p,s)(q,r)in π
e
q@pqin πr@ssel in e
(q,r)@(p,s)sel in π
e
q@pqin πr@srin eqP r=sel
(q,r)@(p,s)(q,r)in π
e
q⟨⟩
qin πr⟨⟩
rin eq∈ P r=sel
(q,r)⟨⟩
(q,r)in π
e
q⟨⟩
qin πr⟨⟩
sel in e
(q,r)⟨⟩
sel in π
e
q⟨⟩
qin πr⟨⟩
rin eqP r=sel
(q,r)⟨⟩
(q,r)in π
e
aΣ
sel a
sel in π
e
µ Qπ
e
sel@µsel in π
e
µ Qπ
e
µ@sel sel in π
e
true
sel ⟨⟩
sel in π
e
Figure 20. The transition rules
π
e
inferred from those of the
SHA
s
Aπ
with projection states
P
and
Aewith selection state sel.
Figure 21. The earliest
dSH AAsnc
e(S)
with safe-no-change subhedge projection for the dSH A
A
in
Figure 4with schema S=JdocKfor the XPath filter [self::list/child::item].
8.3. Soundness and Completeness
We next show that the soundness and completeness of a projection algorithm are pre-
served when combined with some earliest membership tester when assuming determinism.
Proposition 7. Let
Aπ
and
Ae
be
dSHAs
with the same schema
S
and the same language up to
the schema, i.e.,
L(Aπ) L(S) = L(Ae) L(S)
. The combination
dSHAAπ
e
then has the same
language up to schema
S
, too. Furthermore, if
Ae
is earliest for S, then
Aπ
e
is earliest for Sand goes
into some subhedge projection state whenever Aπdoes.
Proof.
If
q
is a subhedge projection state of
Aπ
and
r
a state of
Ae
, then
(q
,
r)
is a subhedge
projection state of
Aπ
e
. The evaluator for automaton
Aπ
e
runs
Aπ
and
Ae
in parallel on the
Algorithms 2024,17, 339 48 of 74
input hedge, while skipping subhedges starting in subhedge projection states of
Aπ
. These
include the subhedges starting in a projection state when evaluating
Aπ
e
on
h
. Applying
Proposition 2to
Aπ
, such subhedges are irrelevant for
L(Aπ)
with respect to
S
, since
Aπ
is assumed to be deterministic. Since
L(Aπ)S=L(Ae)S
, skipping such a subhedge
does not affect acceptance by
Ae
for hedges inside schema
S
. Therefore, the evaluator of
Aπ
eis earliest with respect to S, too.
The above proposition shows that if
Aπ
is complete for subhedge projection, then
Aπ
e
is also complete for subhedge projection while also being earliest.
Theorem 4. For any complete dSHAs
A
with schema
S
, the dSHA
Acgr(S)e(S)
is earliest and
complete for subhedge projection for schema S.
Proof.
The congruence projection
Acgr(S)
is sound by Proposition 2and complete for
subhedge projection wrt
S
by Theorem 3. The automaton
Ae(S)
discussed in Section 8.1 is
earliest for S. So, their combination
Acgr(S)e(S)
yields the earliest automaton with complete
subhedge projection wrt Susing Proposition 7.
8.4. In-Memory Complexity
We next discuss the complexity of earliest membership testing with complete subhedge
projection by an in-memory evaluator for Acgr (S)e(S).
Lemma 19. The number of states in Qcgr(S)e(S)is in O(2n2+2n+log(n)), where n =|Q|.
Proof.
Using Lemma 18, the number of states of
Acgr(S)
is in
O(
2
n2+n)
, where
n=|Q|
.
The number of states of
Ae(S)
is in
O(n
2
n)
, and thus, in
O(
2
n+log(n))
. The number of states
of Qcgr(S)e(S)is thus in O(2n2+n2n+log(n)), which is in O(2n2+2n+log(n))
Corollary 2. Let the signature
Σ
be fixed. For any complete dSHA
A
with schema
S HΣ
, earliest
membership on a input hedge
hS
can be tested in memory in time
O(|
1
|)
per nonprojected node
of
h
after a preprocessing time of
O(|Acgr(S)e(S)|+n3d+n s)
, where
d
is the number of difference
relations and
s
is the number of subset of safe states in
Qcgr(S)e(S)
. The preprocessing time is also
in O(22n2+4n+2lo g(n)), where n is the number of states of A.
Proof.
Since
A
is complete, Thereom 4shows that
Acgr(S)e(S)
is the earliest, sound, and
complete for subhedge projection for schema
S
. The in-memory evaluator of
Acgr(S)e(S)
can be run on any input hedge
hS
in time
O(|
1
|)
per nonprojected node of
h
after
a preprocessing time of
O(|Acgr(S)e(S)|+n3d+n s
.
Σ
is fixed to the size
|Acgr(S)e(S)|
bounded by
O(|Qcgr(S)e(S)|2)
. Lemma 19 shows that
|Qcgr (S)e(S)|
is in
O(
2
n2+2n+log(n))
,
so |Acgr(S)e(S)|is in O(22n2+4n+2l og(n)).
As before, instead of precomputing
Acgr(S)e(S)
from the input dSH A
A
and
FS
, we can
compute only the part needed for evaluating the input hedge
h
on-the-fly. If the needed part
of Acgr(S)e(S)for evaluating his small, the overall time and space go down considerably.
9. Streaming Algorithms
We show that
dSH A
s can be evaluated on the nested word of a hedge in a streaming
manner, so that they produce the same results and projection behavior as by evaluating
them on hedges in-memory in a top-down and left-to-right manner.
All aspects related to subhedge projection remain unchanged. Earliest query answer-
ing in streaming mode means that nested word suffixes are ignored if irrelevant, and not
only nested words as with subhedge projection. What changes between both evaluation
modes is memory consumption. For in-memory evaluation, the whole graph of the input
Algorithms 2024,17, 339 49 of 74
hedge must be stored, while in streaming mode, only a stack and a state are maintained
at any event. For testing regular language membership, the state space is finite, while
the stack space become infinite, since the depth of the stack depends on the depth of the
input hedge.
The property of having equivalent streaming and in-memory evaluators was already
noticed in [
26
] for Neumann and Seidl’s pushdown forest automata [
25
] and for nested
word automata [
11
,
15
]. We here show how to transpose this result to
dSH A
s. This
permits us to show the correctness of our evaluation algorithms based on properties of
the constructed automata. It also permits obtaining correctness properties of streaming
algorithms by reasoning about the in-memory evaluator of SH As.
9.1. Streaming Evaluators for Downward Hedge Automata
We define the streaming evaluator of a
SHAA= (Σ
,
Q
,
,
I
,
F)
by a visibly pushdown
machine. The set of configurations of this machine is
K=Q × Q
, containing pairs of
states and stacks of states. The visibly pushdown machine provides for any word
vˆ
Σ
a streaming transition relation
v
str K × K
, such that for all states
q
,
q Q
and stacks
σ Qand words v,vˆ
Σ:
true
(q,σ)ε
str(q,σ)wrt
(q,σ)v
str(q,σ) (q,σ)v
str(q′′,σ)wrt
(q,σ)v·v
str(q′′,σ)wrt
qa
qin
qa
strqwrt
q⟨⟩
qin
(q,σ)
str(q,σ·q)wrt
q@pqin
(p,σ·q)
str(q,σ)wrt
We note that the same visibly pushdown machine can be obtained by compiling the
SHA
to an NWA, whose streaming evaluator is defined by a visibly pushdown machine too (see
Section 12.3 on related work).
The notion of partial runs of the in-memory evaluator is closely related to streaming
runs. To see this, we first note that each partial run
v
wrt
naturally defines a stack as
follows. Since
v
is a nested word prefix, there exist unique nested words
v0
,
. . .
,
vn
nw(HΣ∪Q ), such that
v=v0··v1··. . . ··vn.
These nested words must be partial runs of
too, so there exist states
p0
,
. . . pn
,
q0
,
. . .
,
qn Q
,
such that
vi
goes from
pi
to
qi
for all 0
in
. We define the stack
σ
of
v
by
σ=q0·. . . ·qn1
and say that
v
goes from
p0
to
qn
and has stack
σ
. Note that the stack is empty if
v
is a
nested word, since in this case
n=
0 so that
v=v0
and
σ=ε
. For instance, the partial
run
v0=
0
··
0
·list ·
1
··
0
·item ·
2 has the stack
σ0=
0
·
1 while the partial run
v1=v0··3··4 has stack σ1=ε.
Lemma 20. Consider a SHA A= (Σ,Q,,I,F), states q,q Q, and a stack σ Q. Let v be
a prefix of some nested word in
nw(HΣ)
. Then, there exists a partial run
r
wrt
from
q
to
q
with
stack σsuch that projΣ(r) = v iff (q,ε)v
str(q,σ)wrt .
Proof. This is carried out by induction on the length of prefix v.
This lemma implies that any in-memory transition of
A
on a hedge
h
from
q
to
q
can
be identified with some streaming transition of Aon nw(h)from (q,ε)to (q,ε).
Proposition 8. qh
qwrt (q,ε)nw(h)
str
(q,ε)wrt .
Algorithms 2024,17, 339 50 of 74
Proof.
If
qh
qwrt
, there exists a run
R
of hwith
that starts with
q
and ends with
q
. Consider the partial run
v=nw(R)
on
h
. Since
v
is a nested word, its stack is empty.
By Lemma 20, we have
(q
,
ε)projΣ(v)
str(q
,
ε)
wrt
and
projΣ(v) = nw(h)
. For the inverse
implication, assume that
(q
,
ε)nw(h)
str
(q
,
ε)wrt
. Let
u=nw(h)
. By Lemma 20, there
exists a partial run
v
of
from
q
to
q
with stack
ε
. Then, there exists a run
R
of
on
h
such
that v=nw(R). Hence, qh
qwrt .
For any deterministic
dSH AA
, hedge
h
, and state
q
, the streaming transition on
nw(h)
with respect to
starting with
(q
,
ε)
can be computed in time
O(
1
)
per letter after a
precomputation in time
O(|A|)
. So, the overall computation time is in
O(|A|+|h|)
. The
streaming memory needed to store a configuration is of size
O(depth(h) + |A|)
since the
size of the visible stack is bounded by the depth of the input hedge.
9.2. Adding Subhedge Projection
We next extend the streaming transition relation with subhedge projection, in analogy
to what we did for the in-memory transition relation. We define a transition relation with
subhedge projection
h
str
shp K × K
with respect to
, such that for all words
v
,
vˆ
Σ
,
letters aΣ, states p,q,q,q Q, and stacks σ,σ,σ′′ Q:
q Q
shp vnw(HΣ)
(q,σ)v
str
shp(q,σ)wrt
q∈ Q
shp qa
qin
(q,σ)a
str
shp(q,σ)wrt
q∈ Q
shp
(q,σ)ε
str
shp(q,σ)wrt
q∈ Q
shp (q,σ)v
str
shp(q,σ)wrt (q,σ)v
str
shp(q′′,σ′′ )wrt
(q,σ)v·v
str
shp(q′′,σ′′ )wrt
q∈ Q
shp q⟨⟩
qin
(q,σ)
str
shp(q,σ·q)wrt
p∈ Q
shp q@pqin
(p,σ·q)
str
shp(q,σ)wrt
The projecting transition relation stays in a configuration with a subhedge projection state
until the end of the current subhedge is reached. This is correct since a subhedge projection
state cannot be changed anyway.
Proposition 9. For any
SHAA= (Σ
,
Q
,
,
I
,
F)
, states
q
,
q Q
, nested word
v
on which
has no blocking partial run starting from q:
(q,ε)v
str(q,ε)wrt (q,ε)v
str
shp(q,ε)wrt .
Proof.
Let
h HΣ
be the hedge such that
v=nw(h)
. The proof is by induction on the size
of
v
. For the implication from the left to the right we assume that
(q
,
ε)v
str(q
,
ε)wrt
.
Since v=nw(h), Proposition 8proves qh
strqwrt .
Algorithms 2024,17, 339 51 of 74
Case q Q
shp.
By Lemma 6, it follows from
qh
strq
wrt
that
q=q
. We can then apply
the following rule since vis a nested word:
q Q
shp v=nw(HΣ)
(q,ε)v
str
shp(q,ε)wrt
Hence, (q,ε)v
str
shp(q,ε)wrt .
Case q∈ Q
shp.We distinguish all possible forms of hedge h.
Subcase h=h·h′′.
Let
v=nw(h)
and
v′′ =nw(h′′ )
so
v=v·v′′ =nw(h)
.
The following rule must have been applied:
(q,ε)v
str(p,σ)wrt (p,σ)v′′
str(q,ε)wrt
(q,ε)v·v′′
str(q,ε)wrt
Since
v
is a nested word, it follows that
σ=ε
. By the induction hypoth-
esis applied to the nested words
v
and
v′′
, we get
(q
,
ε)v
str
shp(p
,
ε)wrt
and
(p
,
ε)v′′
str
shp(q
,
ε)
wrt . Hence, we can apply the following rule:
q∈ Q
shp (q,ε)v
str
shp(p,ε)wrt (p,ε)v′′
str
shp(q,ε)wrt
(q,ε)v·v′′
str
shp(q,ε)wrt
This shows (q,ε)v
str
shp(q,ε)as required.
Subcases h=hor h=aor h=ε.Straightforward.
For the implication from the right to the left we assume that (q,ε)v
str
shp(q,ε)wrt .
Case q Q
shp.Then, the following rule must have been applied with q=q:
q Q
shp v=nw(HΣ)
(q,ε)v
str
shp(q,ε)wrt
Since we assume that
v
does not have any blocking partial runs with
, there exists
a partial run on
v
with
that starts with
q
. Since
v
is a nested word, any partial
run on
v
is the nested word of some run, so there exists a run on the hedge
h
that
starts with
q
. This shows the existence of some state
p
such that
qh
p
wrt
. Since
q Q
shp
, Lemma 6shows
q=p
, and thus,
qh
q
wrt
. Proposition 8then proves
that (q,ε)nw(h)
str(q,ε)wrt , which is equal to (q,ε)v
str(q,ε)wrt , as required.
Case q∈ Q
shp.We distinguish all possible forms of the hedge h.
Algorithms 2024,17, 339 52 of 74
Subcase h=h·h′′.
Let
v=nw(h)
and
v′′ =nw(h′′ )
so
v=v·v′′ =nw(h)
.
The following rule must have been applied:
q∈ Q
shp (q,ε)v
str
shp(p,σ)wrt (p,σ)v′′
str
shp(q,ε)wrt
(q,ε)v·v′′
str
shp(q,ε)wrt
Since
v
is a nested word, it follows that
σ=ε
too. By the induction hypoth-
esis applied to the nested words
v
and
v′′
, we get
(q
,
ε)v
str(p
,
ε)wrt
and
(p,ε)v′′
str(q,ε)wrt . Hence, we can apply the following rule:
(q,ε)v
str(p,ε)wrt (p,ε)v′′
str(q,ε)wrt
(q,ε)v·v′′
str(q,ε)wrt
This shows (q,ε)v
str(q,ε), as required.
Subcases h=hor h=aor h=ε.Straightforward.
This ends the proofs of both directions.
A streaming evaluator with subhedge projection for deterministic
SHA
s
A
on hedges
h
can thus be obtained by computing the streaming transition relation with subhedge
projection for
A
of
nw(h)
starting with the initial configuration. This costs time at most
O(1)per letter of nw(h), i.e., constant time per event of the stream.
9.3. Streaming Complexity of Earliest Membership with Subhedge Projection
Using streaming evaluators for
SHAs
, we can test earliest membership with subhedge
projection in streaming mode by running the SHAAcgr (S)e(S).
Corollary 3. For any complete dSHAs
(Σ
,
Q
,
,
I
,
F)
,
FFS Q
, language
L=L(A)
,
and schema
S=L(A[F/FS])
, earliest membership with complete subhedge projection for
L
wrt
S
can be tested in streaming mode for any hedge
hS
in time
O(|
1
|)
per nonprojected letter
of
nw(h)
after a preprocessing time of
O(|Acgr(S)e(S)|+n3d+n s)
, where
d
is the number of
difference relations and
s
the number of safe subsets in states of
Qcgr
e(S)(S)
. The space required is in
O(depth(h) + |Acgr (S)e(S)|).
Proof.
By Theorem 4, the
dSH AAcgr(S)e(S)
is earliest and sound and complete for sub-
hedge projection. By Propositions 8and 9, we can thus obtain a streaming membership
tester with subhedge projection for a hedge
hS
by running the streaming evaluator
with subhedge projection of
Acgr(S)e(S)
on the nested word
nw(h)
. Since this
SHA
is
deterministic, the streaming evaluation can be done in time
O(|
1
|)
per letter of
nw(h)
after
a preprocessing time of O(|Acgr(S)e(S)|+n3d+n s).
As for safe-no-change projection, the exponential preprocessing time can again be
avoided by creating the part of the
SHAAcgr(S)e(S)
needed for the evaluation of the hedge
on the fly. The size of the needed part is in
O(|h|)
. Hence, the space requirements can also
be reduced to
O(|h|)
, which may be too much for large streams
nw(h)
. The needed part of
Acgr(S)e(S), however, may be much smaller than |h|. Furthermore, a stack of size depth(h)
has to be maintained but this dependence on
h
may be acceptable even for large streams
nw(h).
Algorithms 2024,17, 339 53 of 74
10. Monadic Regular Queries and XPath
We next consider the problem of how to answer monadic regular queries on hedges
in an earliest manner and with subhedge projection. Each monadic query defines a set of
positions for each input hedge. The
XPath
queries
self::list[child::item]
, for instance,
can be applied to the hedge
list
.
list
.
item⟩⟩
.
list
.
item⟩⟩
, where it selects the
position of the first top-most tree labeled by
list
since it contains a child tree labeled by
item
. We identify positions of hedges by integers. For instance, the selected
list
element in
the hedge above is identified by the integer 2. Therefore, the answer set of the above query
on the above hedge is
{
2
}
. Note that monadic queries do not return any other information
about the selected positions, in contrast to the subtree semantics of
XPath
in the official
W3C standard.
Answering this query is more complicated than verifying whether the
XPath
filter
[self::list/child::item]
matches at the root of the first subtree of the hedge, since the
answer set needs to be constructed, too. Concerning memory consumption, alive answer
candidates need to be buffered for monadic query answering. Therefore, the memory
consumption of streaming evaluators of monadic queries may grow linearly in the size of
the stream to the number of the alive candidates does.
10.1. Monadic Regular Queries
Monadic queries on hedges in
HΣ
can be identified with languages of annotated
hedges in
HΣ∪{x}
, where
x∈ Σ
in an arbitrarily fixed selection variable. A monadic query
on hedges in
HΣ
can then be identified with the language of all annotated hedges in
HΣ∪{x}
,
in which a single position got x-annotated and this position is selected by query.
Therefore, regular monadic queries on hedges in
HΣ
can be defined by nested reg-
ular expressions in
nRegExpΣ∪{x}
or by a
dSH A
with the same signature. The regular
XPath
query
self::list[child::item]
, for instance, can be defined by following nested
regular expression:
list ·x·⊤··item ·⊤·· · ,
where
is like a wildcard accepting any hedge in
HΣ
without
x
, i.e., for some recursion
variable zV:
=µz.(+aΣa+z).
Note that the nested regular expression for the
XPath
filter
[self::list/child::item]
in
nRegExpΣ
can be obtained from that of the
XPath
query
self::list[child::item]
by removing the
x
. In addition, for any monadic query on hedges, we have to restrict
the annotated hedges to be recognized by the language of the query such that exactly
one position in each hedge is annotated by
x
. This can be done by intersection with the
following nested regular expression onex:
µz.(T.x.T+T.z.T)
The previous schema
JdocK
for pattern matching has now to be changed from to
S=JdocxKJonexK
, where
docxnRegExpΣ∪{x}
is like
doc
but over the signature
Σ {x}
.
For the regular XPath query
self::list[child::item]
we obtain the dSH A in Figure 22.
The transition rules with the special letter
x
indicate, where positions may be bound to the
selection variable:
x
can be bound only in state 2, that is in subtrees starting with a
list
element. The same automaton can be used to define the schema
S=JdocxKJonexK
by
using the schema-final states
FS={
14, 20
}
instead of the final states
F={
20
}
. Indeed,
we constructed the automaton by intersecting the automaton obtained from the
XPath
expression with the automata computed from the regular expressions
docx
and
onex
for
the schema. Note that 20 is a final state for the query (and the schema), while 14 is a sink
state for the query (but not for the schema).
Algorithms 2024,17, 339 54 of 74
Figure 22. The dSHA for the
XPath
query
self::list[child::item]
. The selection position is
indicated by x.
10.2. Earliest Query Answering with Subhedge Projection
A naive query evaluation algorithm for a hedge automaton with signature
Σ {x}
receives an input hedge
h HΣ
, guesses all possible positions of where to insert
x
into
h
,
and tests for all of them whether the annotated hedge obtained is accepted by
A
. In a less
naive manner, one inserts
x
only at those position where the current state has an outgoing
x
transition. All this can be improved if the hedge automaton under consideration supports
selection, rejection, or projection states. In all cases, the algorithm has to buffer all binding
candidates, that are not yet in a selection or rejection state. In the worst case it has to buffer
the bindings to all positions. Therefore, the state space of the evaluator of monadic queries
is no more finite.
Selection and rejection states can be identified by adapting the earliest query answering
algorithm from [
21
]. This can be made such that one obtains an
SHAAe(S)
with signature
Σ {x}
depending on the schema. It should be noticed that automata used in [
21
] are
dNWAs instead of
SHA
s, and that schemas are ignored there. The needed adaptations are
therefore discussed in Appendix A. How to obtain an efficient query answering algorithm
in streaming mode from
Ae(S)
is the main topic of [
21
]. This is the base algorithm that we
will extend with subhedge projection for our experiments.
In order to add congruence projection to the earliest query answering algorithm, we
have to run the automaton
Acgr
e(S)(S)
. The earliest congruence projection for the
XPath
query
self::list[child::item]
is shown in Figure 23. It binds
x
in state
2D1
after having read
a top-level
list
-element and then checks for the existence of a child element
item
, going
into selection state
1D3
once it is found. All other children of the top-level
list
element are
projected in state
2D3
. For adding earliest query answering to self-no-change projection,
we have to run the automaton
Aπ
e(S)
instead of
Acgr
e(S)
. We do not give the details, but claim
for the present query that the projection power of both automaton is the same.
Figure 23. The earliest congruence projection
Acgr(JdocKJonexK)e(JdocKJo nexK)
for the dSH A
A
in
Figure 22 with FS={14, 20 }. It defines the query of the XPath self::list[child::item].
Algorithms 2024,17, 339 55 of 74
Fortunately, the soundness and completeness theorems for earliest membership testing
with subhedge projection do only affect the constructions of these
SHA
s, but not how they
are to be evaluated. However, there is one additional issue: Nested word prefixes cannot
be considered as irrelevant for the subsequent subhedge, if the selection variable
x
can be
bound there, even if the acceptance doesn’t depend on where it will be bound. This is since
the position of the binding is to be returned by any query answering algorithm.
Definition 15. We call a prefix
v
binding irrelevant for
L
and
S
it there does not exist any hedge
h containing x and suffix w such that v ·wnw(S)and v ·nw(h)·wnw(L).
Our subhedge projecting evaluator can project only at subhedge irrelevant prefixes
that are also binding irrelevant. Given an dSHA
A
and a set of schema final-states
FS
,
one can decide whether a prefix is subhedge irrelevant if the current state
q
does accept
hedges containing
x
but can access some other state that is not safe for rejection and permits
x
-bindings. In ground Datalog, we can distinguish binding irrelevant states
q
by imposing
the following rules for all q,q Q:
binding_irrelev(q) :- state(q), not binding_relev(q).
binding_relev(q) :- not bind_x(q), acc(q,q’), bind_x(q’), not rej(q’).
Note in our example of the
XPath
query
self::list[child::item]
that state
2D3
of the
dSH AAcgr(JdocKJonexK)e(JdocKJonexK)
in Figure 23 is binding irrelevant, since
x
must
be bound on the top-level
list
element in state
2D1
, so it cannot be bound on any
list
element below.
11. Experiments with Streaming XPath Evaluation
We compare experimentally three streaming evaluators for dSHA, for earliest monadic
query answering, without subhedge projection, with safe-no-change projection, and with
congruence projection.
11.1. Benchmark dSHA for XPath Queries
We start from dSH A defining monadic queries for the regular
XPath
queries A1-A8
from the XPathMark benchmark [
7
] that are given in Table 1. These
XPath
queries show
most of the features of the regular fragment of
XPath
. In Table 2, we added 14 further
XPath path queries that we found useful for testing too.
Table 1. Regular XPath queries A1-A8 from the XPathMark collection.
Query ID XPath Expression
A1 /site/closed_auctions/closed_auction/annotation/description/
text/keyword
A2 //closed_auction//keyword
A3 /site/closed_auctions/closed_auction//keyword
A4 /site/closed_auctions/closed_auction[annotation/description/
text/keyword]/date
A5 /site/closed_auctions/closed_auction
[descendant::keyword]/date
A6 /site/people/person[profile/gender and profile/age]/name
A7 /site/people/person[phone or homepage]/name
A8 /site/people/person[address and (phone or homepage) and
(creditcard or profile)]/name
Algorithms 2024,17, 339 56 of 74
Table 2. Additional regular XPath queries for the XPathMark documents.
A0 /site
A1_0a /site/*
A1_0b /site/@*
A1_0c /site//@*
A1_1a //bidder/personref[starts-with(@person,’person0’)]
A1_1d //bidder/personref[@person=’person0’]
A1_2 //person
A1_3 /site/regions/africa//@*
A1_4 /site/regions/africa/*
A1_5 /site/regions/*
A1_6 //closed_auction/annotation//keyword
A2_1 //closed_auction[descendant::keyword]
A4_0 /site/closed_auctions/closed_auction[annotation]/date
A4_1 /site[open_auctions]/closed_auctions
Deterministic SHAs for A1-A8 were provided earlier in [
21
]. For the other
XPath
queries we compiled them to dSH As via nested regular expression.
In order to produce the input dSHAs for our evaluators, we intersect these automata
with a dSH A for the schema
Jdoc
xKJonexK
where
doc
x
is the schema we use for hedge
encodings of real-world XM L documents with
x
annotations. Thereby, we could identify
the schema final states
FS
. We minimized and completed the result. Table 3reports the size
of the input dSHAs for our evaluators obtained by the above procedure. For each dSHA
A
,
the size is given by two numbers size(n), the first for the overall size and the second for the
number of states n.
For the input dSH A
A
of each query, we statically computed the whole
dSH A
s
Acgr(S)e(S)
while using the necessary parts of the determinization algorithm from [
16
]. The
size of the
dSH A
s obtained and the number
d
of difference relations are reported in Table 3
too. The biggest size is 2091(504) for the input dSH A for A8. The largest number of difference
relations d = 24 is also obtained for this query. So, indeed the size of these automata is much
smaller than one might expect from a construction that is highly exponential in the worst
case. However, the only query that we couldn’t obtain its correspondent deterministic
earliest congruence projection Acgr (S)e(S)is A5.
The time for computing the earliest congruence projection took between 2.3 and 26 s.
Table 3. Overall size and number
n
of states of the input dSH A
A
for the query with schema-final
states
FS
, the overall size and number of states of the
SHAAcgr(S)e(S)
obtained by earliest congruence
projection, and the number dits difference relations.
Query dSH A A dS HA#Diff- Query dSHA A dSHA#Diff-
ID Size(n) Size(#State) Rel. d ID Size(n) Size(#State) Rel. d
A1 482(68) 1296(324) 16 A1_0c 230(43) 238(62) 6
A2 224(42) 316(82) 7 A1_1a 305(54) 384(101) 8
A3 320(53) 662(156) 10 A1_1d 305(54) 382(101) 8
A4 629(74) 1651(404) 18 A1_2 194(39) 152(42) 5
A5 438(63) 1226(269) 13 A1_3 318(53) 672(159) 10
Algorithms 2024,17, 339 57 of 74
Table 3. Cont.
Query dSH A A dS HA#Diff- Query dSHA A dSHA#Diff-
ID Size(n) Size(#State) Rel. d ID Size(n) Size(#State) Rel. d
A6 675(76) 2090(500) 22 A1_4 312(52) 479(132) 10
A7 394(59) 728(184) 12 A1_5 266(47) 293(84) 8
A8 648(79) 2091(504) 24 A1_6 265(47) 588(142) 9
A0 203(40) 158(44) 5 A2_1 232(42) 295(78) 7
A1_0a 224(42) 145(44) 6 A4_0 392(59) 719(184) 12
A1_0b 203(39) 68(23) 5 A4_1 292(49) 287(78) 8
We note that we have not yet computed the complete
dSH A
s statically for safe-no-
change projection
Asnc
, pure congruence projection
Acgr(S)
and earliest query answering
Ae(S)
. We believe that these automata may become bigger than
Acgr(S)e(S)
that we con-
structed in a single shot.
11.2. Streaming Evaluation Tool: AStream
We are using and developing the AStream tool for answering monadic dSHA queries
on XML streams with schema restrictions. Version 1.01 of AStream was presented in [
21
]
supports earliest query answering without projection. Given a dSHA
A
defining a monadic
query, a set of schema final states
FS
and an XML stream
w
, it constructs on-the-fly the
needed part of the
SHAAe(S)
while parsing
w
and evaluating
A
on the hedge encoding
w
.
For the FCT conference version of the present paper [
24
], we enhanced AStream with
safe-no-change projection. This leads to version AStream 2.01. It constructs the earliest safe-
no-change projection
dSH AAsnc
e(S)
on the fly while evaluating the monadic query defined
by
A
on the hedge encoding the XML stream
w
. Subhedge projection can be switched on or
off in order to compare both versions without further implementation differences.
For this present journal version, we added earliest congruence projection and inte-
grated it into AStream leading to AStream 3.0.
AStream 3.0 is different from the two previous versions in that the
dSH AAcgr
e(S)
is
constructed statically and entirely, independently of the input hedge. We then use a generic
earliest streaming evaluator for
SHA
s that rejects in rejection states, selects in selection
states, and projects in all subhedge projection states. This evaluator could also be run with
Asnc
e(S)or Ae(S), as long as these do not grow too big.
It should be noticed that
Acgr
e(S)
turned out to be nicely small for our whole benchmark,
with the other
dSH A
s
Asnc
e(S)
and
Ae(S)
risk becoming bigger. As stated earlier, we did not
construct these
dSH A
s so far for our benchmark queries. This is why we continued to run
the earliest streaming with safe-no-change projection with AStream 2.01, while we used
AStream 3.0 for earliest streaming with congruence projection.
All AStream versions rely on Java’s abc-Datalog for computing the least fixed points
of Datalog programs in a bottom-up manner. Datalog programs are needed for all logical
reasoning: for computing subsets of states that are safe for rejection, selection, or subhedge
projection on all levels. Also, the difference relations are computed based on Datalog. Note
that earliest on-the-fly query evaluation requires running Datalog during query evaluation,
while with a static approach for constructing
dSH A
s entirely, Datalog is only needed at
preprocessing time during the automaton construction.
11.3. Evaluation Measures
We want to measure the time for running the three streaming query evaluators for all
dSH As obtained from the XPath expressions in the collection in Tables 1and 2.
Algorithms 2024,17, 339 58 of 74
One advantage of the XPathMark benchmark is that it comes with a generator of XML
documents to which the queries can be applied and that can be scaled in size. We created
an XML document of size 1.1 GB, which is sufficiently large to make streaming mandatory.
Our experiments show that the efficiency of query evaluation grows linearly with the size
of the non-projected part of the XM L document. This holds for each of our three evaluators.
Therefore, measuring the time of the evaluator on XML documents of other sizes would not
show any new insights, except that memory consumption remains moderate too.
The time gain of projection for a query
Q
is the percentage of time saved by projection
during query evaluation when ignoring the pure document parsing time tparse seconds:
time-gainQ=100 (1(tnoprojectQtparse)/(tprojectQtparse)).
We can measure the time gain for earliest safe-no-change projection by AStream 2.01 and
for earliest congruence projection by AStream 3.0, since projection can be switched off
in both of them. The disadvantage of the time gain is that it depends on the details of
the implementation.
The event gain is a better measure for the projection power. The parser sends a stream
of needed events to the
SHA
, while ignoring parts of the input document that can be
projected away. For this, it must always be informed by the automaton about what kind of
events can be projected in the current state. The event gain of projection
event-gainQ
is then
the percentage of events that are recognized to be irrelevant and thus no more created by a
projecting evaluator for query
Q
. One might expect that the time gain is always smaller
than the event gain.
time-gainQevent-gainQ
Indeed, this will be confirmed by our experiments. We believe that these discrepancies
between these two measure indicate how much room remains for optimizing the imple-
mentation of an evaluator. In this way, we can be observe that some room for further
optimizations AStream 3.0 still remains.
11.4. Experiments
We used Java’s
XMLStreamReader
Interface v1.0 from the
javax
.
xml
.
stream
package to
parse and stream XM L files in Scala. Parsing a 1.1 GB XML document required
tparse =15
s,
while querying it without projection took us in average
tnoprojectavg =752
s, while varying
from 600 to 900 s. The average processing time for 1 MB was thus 0.72 s. Once we knew
this, we predicted the expected evaluation time from the size of the non-projected part of
the XML document. It is as follows:
event-gainQ1100 0.72 s +tparse s.
Without projection, the parser generates 1,012, 018, 728 events for the 1.1 GB XML document.
This is the baseline for computing event-gainQfor all queries Q.
We ran AStream with Scala v2.13.3 on a machine with the operating system Ubuntu
20.04.06 LTS (64-bit). The machine is a Dell Precision-7750 machine (Round Rock, TX, USA),
equipped with an Intel® Core™ i7-10875H CPU @ 2.30GHz × 16 and 32 GB of RAM (Santa
Clara, CA, USA).
11.4.1. Earliest Congruence Projection
We present the measures of our evaluator with the earliest congruence projection in
Table 4. The event gain is high, varying between 75.7% for A1_0c and 100% for A0, A1_0a,
A1_0b, A1_4, A1_5, and A4_1. The event gain for queries without descendant axis is above
98.2%. For these queries, all subtrees that are not on the main path of the query can be
projected away. In particular, only subtrees until a fixed depth have to be inspected. These
are the queries A1, A4, A6-A8 and A4_0 and the queries with event gain 100% listed above.
Algorithms 2024,17, 339 59 of 74
The time gain for all these queries is a little smaller than the event gain but still above
98.2%.
Table 4. Time gain and event gain by earliest congruence projection with AStream3.01 on a 1.1 GB
XPathMark stream.
Query #Ans- Time Time Event Query #Ans- Time Time Event
ID Wer in s Gain Gain ID Wer in s Gain Gain
A1 38,267 29.3 98.1% 98.9% A1_0c 3,600,816 226.6 72.9% 75.7%
A2 117,389 171.8 77.9% 81.1% A1_1a 2 187.8 78.1% 80.3%
A3 117,389 33.4 97.4% 97.8% A1_1d 2 217.3 74.8% 80.3%
A4 24,959 33.2 97.8% 98.9% A1_2 1,062,965 238.2 68.2% 76.0%
A5 50,186 A1_3 25,074 16.2 99.1% 99.8%
A6 30,307 45.9 96.9% 98.2% A1_4 5170 17.5 99.7% 100.0%
A7 179,889 30.5 98.0% 98.7% A1_5 6 14.3 100.0% 100.0%
A8 67,688 37.8 97.3% 98.7% A1_6 117,389 188.7 75.4% 81.1%
A0 1 21.3 99.1% 100.0% A2_1 50,186 199.7 76.5% 81.1%
A1_0a 6 15.2 99.9% 100.0% A4_0 91,650 21.8 99.0% 99.3%
A1_0b 0 0.0 100.0% 100.0% A4_1 1 12.0 100.0% 100.0%
We next consider the second type of queries having the descendant axis, specifically A2,
A1_0c, A1_1a, A1_1d, A1_2, A1_6, and A2_1. Intuitively, a lower event gain is expected in
these cases because subhedge projection with the descendant axis cannot directly exclude
an entire subtree under a given element; all descendant elements must be inspected.
For instance, the query A1_0c equal to
/site//@*
must select all attributes of all elements
under the root element
site
, requiring the inspection of every XM L element in the document,
for attribute presence, except for the root element site.
The lower event gain reported for the aforementioned queries align with the above
expectation, showing percentages ranging from 75.7% to 81.1%. Also, these queries exhibit
a larger gap between time gain and event gain, which ranges from 2.2% to 5.7% and will
be subject to future optimization. It is worth noting that some of these queries combine
multiple features at once, like queries A1_1a and A1_1d, which have a descendant axis,
filter with attributes, and string comparison.
The only exception to this trend for the queries having descendant axis are A3 and
A1_3, with respective event gains of 97.8% and 99.8%. The reason behind this high gain is
that these queries starts from the root, with a child axis path of depth 3 before querying the
descendant axis. This initial path allows for the exclusion and projection of most elements
under site that do not fit the specified path with their entire subtrees.
11.4.2. Earliest Safe-No-Change Projection
We compare earliest safe-no-change projection with the earliest congruence projection
in Table 5. For this, we restrict ourselves to the queries
A
1–
A
8 from the XPathMark in
Table 1.
The
XPath
queries A1, A4, A6, A7, and A8 do not have descendant axes. The time
gain for safe-no-change projection on these queries is between 92.3–98.9%. For earliest
congruence projection, the time gain was better for A1, A4, and A6 with at least 96.6%.
For A7 and A7, the time gain of safe-no-change projection is close to the event gain of
congruence projection but the time gain of congruence projection is 1.5% lower. So, we
hope that an optimized version of congruence projection could outperform safe-no-change
projection in all cases.
The
XPath
queries with descendant axis are A2, A3, and A5. For A2 and A3 in
particular, the time gain of safe-no-change projection is very low with even not 10.9% while
Algorithms 2024,17, 339 60 of 74
congruence projection yields at least 77.9%. We believe that this is bug in our current
implementation of safe-no-change projection, which may be prohibiting the projection of
attributes and text nodes for these queries (but not for the others). Congruence projection
on A2 gains 77.9% of time and 81.1% of events. This is much better, but still far from the
projection power for queries without descendant axi s. For A3, congruence projection gains
97.4% of time. This is much better than for A2 since the descendant axis in A3 is at the end,
while for A2 it is at the beginning of the path. But, even for A2, congruence projection is
very helpful.
Table 5. Time gain for earliest safe-no-change and eaerliest congruence projection, and the event gain
of earliest congrunce projection.
Query Time (s) Time (s) Time Gain Time Gain Event Gain
ID Snc Congr Snc Congr Congr
A1 72.8 29.3 92.3% 98.1% 98.9%
A2 664.6 171.8 8.5% 77.9% 81.1%
A3 666.1 33.4 10.9% 97.4% 97.8%
A4 78.5 33.2 92.4% 97.8% 98.9%
A5 77.7 92.3%
A6 65.2 45.9 95.1% 96.9% 98.2%
A7 24.6 30.5 98.8% 98.0% 98.7%
A8 24.8 37.5 98.9% 97.3% 98.7%
11.4.3. Comparison to External Tools
QuiXPath was shown to be the most efficient large coverage external tool for regular
XPath
evaluation on XM L streams [
4
]. Therefore, we compare the efficiency of congruence
projection to QuiXPath in Table 6, and thereby, by transitivity to the many alternative
tools. We note that QuiXPath performs early query answering by compilation to possibly
nondeterministic NWAs with selection states, without always being earliest. Apart from
this, it bases its efficiency on both subtree projection and descendant projection.
Table 6. Comparison of of parse-free timings in seconds for XPathMark queries with QuiXPath and
earliest congruence projection.
Query Parse-Free Parse-Free Fract. Query Parse-Free Parse-Free Fract.
ID Time (s) Time (s) ID Time (s) Time (s)
QuiXPath Congr. Prj. QuiXPath Congr. Prj.
A1 11.0 14.3 1.3 A2 11.4 156.8 13.8
A4 11.6 18.2 1.6 A3 11.5 18.4 1.6
A6 10.7 30.9 2.9 A5 12.0
A7 8.6 15.5 1.8
A8 8.8 22.5 2.6
The experiments with QuiXPath in [
4
] use the parse-free time. Therefore, we do
the same for congruence projection by reducing the measured overall time by 15 s of
parsing time.
Compared to the parsing-free times for the
XPath
queries without descendant axis,
the earliest congruence projection demonstrates significant improvements. On average,
our projection is only slower by a factor of 1.3–2.9 than QuiXPath. This shows that our
Algorithms 2024,17, 339 61 of 74
implementation is already close to be competitive with the existing XM L streaming tools,
while being the only one guaranteeing earliest query answering.
For query A2 with a descendant axis, congruence projection is by a factor of 13.8 slower
than QuiXPath. The reason is that QuiXPath supports descendant projection, while congru-
ence projection concerns only subhedge. For query A3 with a descendant axis at the end,
congruence projection is only by a factor of 1.6 behind, so descendant projection seems less
relevant here.
12. Related Automata Models
A large number of alternative automata notions for trees, forests, and hedges were
proposed in the literature. We compare them to SHAs and
SHA
s in the following subsection
and discuss the implications.
12.1. SHAs versus Tree, Forest, and Hedge Automata
Standard Hedge Automata. Standard hedge automata [
9
,
10
,
32
,
33
] operate on labeled
hedges in a bottom-up manner, similarly to SHAs. Also they have the same expressiveness,
but horizontal languages are specified differently, leading to a problematic notion of
determinism [
34
]. For this reason, unique minimization fails for deterministic standard
hedge automata. Still, they have the same expressiveness as SHAs when restricted to
labeled hedges.
Standard Tree Automata and SHA Minimization. Syntactically, any SHA
A
is a standard
tree automaton over the following ranked signature:
a unary symbol for any letter aΣ;
a binary symbol @;
a constant ⟨⟩.
Semantically, there is no perfect general correspondence. Only for subclasses a perfect
correspondence can be made up via binary encoding. This was shown in [
8
] for the subclass
of dSH As with
I=⟨⟩
. In [
35
], it could also be shown for the subclass of multi-module
dSHAs. This leads to unique minimization results for both subclasses of deterministic SHAs.
Bojanczyk’s Forest Automata. Standard hedge automata also have the same expressiveness
as Bojanczyk’s forest automata (see Section 3.3 of [
36
]). These are based on the idea of
transition monoids, rather than on the idea of transition rules such as SHAs or finite
state automata or Neuman’s and Seidl’s forest automata. As a consequence, there is an
exponential difference in succinctness between SHAs and Bojanczyk’s forest automata [
35
].
12.2. SHAs versus SHAs
We start explaining why
SHA
’s have the same expressiveness as
SHAs
. Basically, it is
the same reason for why NWAs have the same expressiveness as SHAs (see e.g., [
8
]). For
the easier direction, we argued already that any SHA
A= (Σ
,
Q
,
,
I
,
F)
can converted to a
SHAdown(A).
Downward Elimination. Conversely, we can convert any
dSH AA
into an equivalent SHA
elim(A)while possibly introducing nondeterminism as follows:
Ielim=I× Q
Felim=F× Q
q⟨⟩
qin
⟨⟩
(q,q)in elim
qa
qin r Q
(r,q)a
(r,q)in elim
q@pqin r Q
(r,q)@(q,p)(r,q)in elim
Proposition 10. L(A) = L(elim(A)).
Algorithms 2024,17, 339 62 of 74
The construction is analogous to the conversion of NWAs to SHAs [
8
] or to hedge
automata [26]. The correctness proofs for these compilers are standard.
Unique Minimization. Unique minimization fails for
dSH A
s as usual for deterministic
multiway automata. This even happens for deterministic two-way finite state automata on
words. In contrast, the subclass of dSHAs with the restriction that the initial state and the
tree initial state are the same enjoys unique minimization [8].
Neuman and Seidl’s Pushdown Forest Automata. Standard hedge automata are also
known to have the same expressiveness than Neumann and Seidl’s pushdown forest
automata [
25
]. The latter can be identified with
SHA
s for labeled hedges. What is needed
to map pushdown forest automata to standard hedge automata is downward elimination,
as for mapping SH As to SHAs.
12.3. SHAs versus NWAs
The streaming evaluator for
SHA
s via a visibly pushdown machine can also be
obtained by compiling a
SHA
to a nested word automaton (NWA) [
11
] whose streaming
evaluator is given by a visibly pushdown machine [
12
], previously known as input driven
automata [1315].
Definition 16. A nested word automata (NWA) is a tuple
(Σ
,
Q
,
Γ
,
,
I
,
F)
, where
Σ
,
Γ
, and
Q
are sets,
I
,
F Q
, and
= ((a)aΣ
,
,
)
contains the following relations:
a Q × Q
,
Q × (Γ× Q)
, and
Q × Γ× Q
. A NWA is deterministic or equivalently a dNWA if
I
contains at most one element and all above relations are partial functions.
The elements of
Γ
are called stack symbols. The transition rules in
again have three
forms: internal rules
qa
q
in
as for SHAs, opening rules
qγ
q
in
if
(q) = (q
,
γ)
,
and closing rules qγ
qin if (q,γ) = q.
Streaming evaluators for NWAs. The streaming evaluator for NWAs can be seen as push-
down machines for evaluating the nested words of hedges in streaming manner. A config-
uration of the pushdown machine is a pair in
K=Q × Γ
containing a state and a stack.
For any word
vˆ
Σ
, we define a streaming transition relation
v
str K × K
, such that,
for all q,q Q and σΓ, the following is applicable:
true
(q,σ)ε
str(q,σ)wrt
(q,σ)v
str(q,σ) (q,σ)v
str(q′′,σ)wrt
(q,σ)v·v
str(q′′,σ)wrt
qa
qin
qa
strqwrt
qγ
qin
(q,σ)
str(q,σ)wrt
qγ
qin
(q,σ·γ)
str(q,σ)wrt
From SHA
s to NWAs. For any
SHAA= (Σ
,
Q
,
,
I
,
F)
, we can define the NWA
Anwa =
(Σ
,
Q
,
Γ
,
nwa
,
Inwa
,
Fnwa)
while preserving determinism, such that
Γ=Q
, while
nwa
contains for all aΣand q,p Q transition rules:
qa
qin
qa
qin nwa
q⟨⟩
qin
qq
qin nwa
q@pqin
pq
qin nwa
For instance, the dNWA
Anwa
of the
dSH AA
in Figure 9obtained from the dSHA for
the
[self::list/child::item]
from the introduction is given in Figure 24. The dNWA
Anwa
of the dSH AAin Figure 21 is given in Figure 25.
Algorithms 2024,17, 339 63 of 74
Figure 24. The dNWA
Anwa
for the
dSH AA
in Figure 9obtained by applying the
down
operator to
the dSH A in Figure 4for the filter [self::list/child::item].
Figure 25. The dNWA
(Asnc
e(JdocK))nwa
for the earliest
dSH A
with safe-no-change projection in
Figure 21.
Lemma 21. L(Anwa) = L(A).
The runs of SH AsAand NWAsAnwa can be identified.
Projection for NWAs. Projecting evaluators for NWAs were proposed in the context of
projecting NWAs [
4
]. They are based on the following notion of states of irrelevant subtrees
for NWAs.
Definition 17 (Variant of Definition 3 of [
4
]).We call a state
q
of an NWA astate of irrelevant
subtrees if there exist two different stack symbols
γ
,
γ
and a state
q
such that the transitions
shown on the right exist but no further opening transitions with
γ
, no further transitions with
γ
,
and no further closing transition in q popping γ. In this case, we write q i-treeΣ\.
qq
:γ
:γ
:γ
:γ
aΣ
Lemma 22. Any subhedge projection state of a complete
SHAA
is a state of irrelevant subtrees
of Anwa.
It was then shown in [
4
] how to map an NWA with a subset
Q
shp
of states of irrelevant
subtrees to a projecting NWA, whose streaming semantics yield a streaming evaluator with
subhedge projection. The presentation of subhedge projection for
SHA
s without passing
via NWAs yields the same result in a more direct manner.
Algorithms 2024,17, 339 64 of 74
It should also be noticed that projecting NWAs supports descendant projection besides
subhedge projection. For SHAs, this is left to future research.
13. Conclusions and Future Work
We introduced the notion of irrelevant subhedges for membership to hedge language
under schema restrictions. We developed algorithms with subhedge projection that jump over
irrelevant subhedges for testing membership to the regular hedge languages defined by dSHAs.
All our algorithms can be run in memory mode and in streaming mode. The difficulty was
how to push the needed finite state information for subtree projection top-down given that
SHAs operate bottom-up. We solved it based on compilers from SHAs to downward SHAs.
This first compiler propagates safety information regarding non-changing states.
The second compiler, which we prove to be complete for subhedge projection, propagates
difference relations. We then combined subhedge projection with earliest membership
testing and combined it earliest monadic query answering. We integrated both safe-no-
change and complete congruence in our AStream tool and confirmed their usefulness
experimentally: we showed that it can indeed speed up the the AStream tool for answering
regular
XPath
queries on XM L streams. Moreover, congruence projection showed much
higher event projection gain rates than safe-no-change, ranging between 75.7% and 100%
for the class of queries that does not have descendant axis, becoming competitive in time
with the best existing streaming tools for regular XPath queries without a recursive axis.
A remaining open theoretical and practical question is how to obtain descendant pro-
jection for SH As, as needed to speed up the evaluation of
XPath
queries with a descendant
axis. On the implementation and experimental side, it would be nice to investigate the
in-memory version of our subhedge algorithms too. Finally, it would be nice to optimize
streaming earliest-query answering tools further until they outperform the best streaming
tools in all cases, while being better for queries where the other tools are not earliest.
Author Contributions: Writing—original draft, A.A.S. and J.N. All authors have read and agreed to
the published version of the manuscript.
Funding: This research received no external funding.
Data Availability Statement: Dataset available on request from the authors.
Conflicts of Interest: The authors declare no conflicts of interest.
Appendix A. Earliest Membership
We adapt the earliest membership tester for dSH As from [
21
] so that it compiles to
SHAs instead of NWAs and such that it accounts for schemas.
Appendix A.1. Schema Safety Is an Inclusion Problem
The idea for testing certain membership and certain nonmembership is to consider
states are safe to reach final states except if going out of the schema. Let
A= (Σ
,
Q
,
,
I
,
F)
be an SH A with schema
S
, such that there exists
FFS Q
with
S=L(A[F/FS])
. Note
that
L(A)S HΣ
in particular. The set of states that are safe to reach a subset
P Q
for all hedges that do not go out of the schema Sis as follows:
safe
S(P) = safe(P(Q \ FS)).
If
A
is deterministic and schema-complete for
S
then the safety of a state of
A
with respect
to schema Sis an inclusion problem.
Lemma A1. Let
S=L(A[F/FS])
. If
A
is deterministic and schema-complete for
S
, then for all
states q Q:
L(A[I/{q},F/FS]) L(A[I/{q}]) qsafe
S(F).
Algorithms 2024,17, 339 65 of 74
Proof. ”.
Suppose that
q∈ safe
S(F)
. The definition of schema-based safety yields
acc({q})(FS\F)=
. So, there exists some hedge
h HΣ
, such that
qh
FS\F
wrt
. In particular,
q L(A[I/{q}
,
F/FS])
. By determinism, there exists no other
transition on
h
from
q
, and thus,
q∈ L(A[I/{q}])
. Hence,
L(A[I/{q}
,
F/FS]) ⊆
L(A[I/{q}].
”.
Suppose that
qsafe
S(F)
. Let
h L(A[I/{q}
,
F/FS])
. Then, there exists
qFS
,
such that
qh
q
wrt
. Hence,
qacc({q})
so that
acc({q})F(Q \ FS)
implies qF. Thus, h L(A[I/{q}]).
Lemma A1 with
FS=Q
shows Lemma 5 of [
21
], which states for any complete dSHAs
Athat safety is a universality problem:
L(A[I/{q}]) = HΣqsafe(F).
This is since the schema-completeness of
A
for
S=HΣ
implies the completeness of
A
, and
thus, L(A[I/{q},F/Q]) = HΣ.
Appendix A.2. Algorithm
We now present our earliest membership tester for dSH As with schema restrictions.
Given a SHA
A
, we compute a
SHAAe(S)= (Σ
,
Qe
,
e
,
Ie(S)
,
Fe(S))
as follows. Let
sel
be a
fresh symbol. We start with set set of states that are safe for selection
S0=safe
S(F)
and,
respectively, safe for rejection
R0=safe
S(Q \ F)
. The state sets of
Ae(S)
are then as follows:
Qe= (Q × 2Q×2Q) {sel}
Ie(S)={(q,R0,S0)|qI,q∈ R0S0} {sel |IS0=}
Fe(S)={(q,R0,S0)|qF} {sel}
For illustration, reconsider the dSHA from Figure 4. This dSHA is not complete but
schema-complete for the schema
JdocK
. We get s
-down(
0,
{
4
}) = {
3, 4
}
. It may seem counter
intuitive that note only state 3 but also state 4 down remains safe for selection. This reflects
the fact that no proper subhedge of any hedge satisfying the intended schema may ever get
into state 4. Applying the earliest construction with safe-no-change projection to dSHA in
Figure 4yields the dSHAin Figure A1. The transition rules in eare given in Figure A2.
Figure A1. The earliest
dSH AAe
for the dSH A
A
in Figure 19 with schema
JdocK
for the
XPath
filter
[self::list/child::item]
. The states of
Ae
correspond to the following tuples:
0
= (
0,
{
1, 2, 3, 5
}
,
{
4
})
, 0
= (
0,
{
2, 4, 5
}
,
{
3, 4
})
, 1
= (
1,
{
2, 4, 5
}
,
{
3, 4
})
, 0
′′ = (
0,
,
{
2, 4, 5
})
,
1′′ = (1, ,{2, 4, 5}), 0′′ = (0, ,), 1′′′ = (1, ,), 2′′′ = (1, ,), and 3′′′ = (1, ,).
Algorithms 2024,17, 339 66 of 74
qa
qin q∈ SR
(q,S,R)a
(q,S,R)in e
⟨⟩
qin (q,S,R)in Qeq∈ s-down(q,S)s-down(q,R)
(q,S,R)⟨⟩
(q,s-down(q,S),s-down(q,R)) in e
q@pqin q∈ SR
(q,S,R)@(p,s-down(q,S),s-down(q,R)) (q,S,R)in e
qa
qin qS
(q,S,R)a
sel in e
⟨⟩
qin (q,S,R)in Qeqs-down(q,S)
(q,S,R)⟨⟩
sel in e
q@pqin qS
(q,S,R)@(p,s-down(q,S),s-down(q,R)) sel in e
aΣ
sel a
sel in e
µ Qe
sel@µsel in e
µ Qe
µ@sel sel in e
true
sel ⟨⟩
sel in e
Figure A2. The earliest transition rules einferred from transition rules of some SHA.
Appendix A.3. Soundness and Completeness
We next provide soundness and completeness results for our earliest in-memory
membership tester with projection.
Proposition A1 (Soundness).For any
SHA A
with schema
S=L(A[F/FS])
, hedge
hS
,
and partial run v of Ae(S)on h that ends in state sel, then h L(A).
So, if the partial run reaches state
sel
then
projΣ(v)
is certain for membership to
L(A)
with respect to
S
. Evaluating a
SHAs Ae(S)
yields an earliest in-memory membership tester
for
dSH A A
. A single adaptation is as follows: whenever
sel
is reached, the evaluation can
stop all over and accept the input hedge.
Theorem A1 (Completeness).Let
A= (Σ
,
Q
,
,
I
,
F)
be a dSHA that is schema-complete for
schema S=L(A[F/FS]). For any hedge h Sand prefix v of nw(h):
If
v
is certain for membership in
L(A)
with
S
, then there exists a partial run
w
for
Ae(S)
on
h
,
such that v =projΣ(w)and w ends with state sel.
If
v
is certain for nonmembership in
L(A)
with
S
, then there exists blocking partial run
w
for
Ae(S)on h, such that projΣ(w)is a prefix of v.
This means that certainty for membership and nonmembership is detected at the
earliest prefix when running the evaluator for Ae(S).
Proof.
We notice that this algorithm is basically the same as the earliest membership tester
from Proposition 6 of [
21
], except that it also check for safe rejection and that it accounts
for schemas. The fact that NWAs are used there instead of
SHA
s here is not essential. So,
we can lift the soundness and completeness proof from there without problem.
Appendix B. Proofs for Section 6(Safe-No-Change Projection)
Proof. We have to prove that no more changing states qPno-changeis sound.
If
qno-change
, this follows from the schema-completeness of
so that one can
neither block on any hedge from the schema nor change the state. In the case
qP
, the
Algorithms 2024,17, 339 67 of 74
intuition is that the state on level above, say
r
, can neither change, since
P=s-no-change(r)
,
nor can the automaton block on any hedge from the schema due to schema-completeness.
We first prove the inclusion
L(A) L(Asnc)
. Since
L(A)S
by definition of
schemas, this implies
L(A) L(Asnc)S
. The proof will be based on the following
three Claims A1.1a,A1.2a, and A1.3a. Note that schema-completeness is not needed for
this direction.
Claim A1.1a. Πh
Πwrt snc for all hedges h HΣ.
The proof is straightforward by induction on the structure of
h
. It uses the last three transition
rules of snc in Figure 10 permitting to always stay in Πfor whatever hedge follows.
Claim A1.2a. For all h HΣ,q Q, and P Q, such that qPno-change,
(q,P)h
(q,P)wrt snc.
We prove Claim A1.2a by induction on the structure of h.
Case h=h.
In this case, we can use Claim A1.1a to show
Πh
Π
wrt
snc
and the
inference rules
qPno-change
(q,P)⟨⟩
Πin snc
qPno-change
(q,P)@Π(q,P)in snc
in order to close the following diagram with respect to snc:
(q,P)
ΠhΠ
(q,P)
⟨⟩
This proves (q,P)h
(q,P)wrt snc, as required by the claim.
Case h=a.Since qPno-changewe can apply the inference rule:
aΣqPno-change
(q,P)a
(q,P)in snc
This proves this case of the claim.
Case h=ε.We trivially have (q,P)ε
(q,P)wrt snc.
Case h=h·h′′.
By induction hypothesis applied to
h
and
h′′
we have:
(q
,
P)h
(q
,
P)
and (q,P)h′′
(q,P)wrt snc. Hence, (q,P)h·h′′
(q,P)wrt snc .
This ends the proof of Claim A1.2a. The next claim, in which the induction step is a little
more tedious to prove, is the key of the soundness proof. We define the following predicate
for all q,q′′ Q and P Q:
qPq′′ iff (q=q′′ q,q′′ P)
Claim A1.3a. Let
h HΣ
be a hedge,
q
,
q Q
states and
P Q
a subset of states, such
that
acc(P)P
and
q∈ Pno-change
. If
qh
qwrt
, then there exists
q′′
, such that
(q,P)h
(q′′,P)wrt snc and qPq′′.
Algorithms 2024,17, 339 68 of 74
Proof. Induction on the structure of h.
Case h=h.
The assumption
qh
qwrt
shows that there exists states
p0 ⟨⟩
and
p Q closing the following diagram:
q
p0hp
q
⟨⟩
Let
P=s-no-change(q)
and note that
acc(P)P
. Since
q∈ Pno-change
we
can infer the following:
⟨⟩
p0in q∈ Pno-change
(q,P)⟨⟩
(p0,P)in snc
q@pqin q∈ Pno-change
(q,P)@(p,P)(q,P)in snc
Subcase p0∈ Pno-change.
The induction hypothesis applied to
h
shows that
there exists
p
, such that
(p0
,
P)h
(p
,
P)wrt snc
and
pPp
. We distinguish
the two cases justifying the latter predicate:
Subsubcase p=p.
Hence,
(p0
,
P)h
(p
,
P)wrt snc
, so we can close the
diagram as follows:
(q,P)
(p0,P)h(p,P)
(q,P)
⟨⟩
This shows that
(q
,
P)h
(q
,
P)wrt snc
, and thus, the claim since
qPq
.
Subsubcase p,pP.
Since
pP
and
P=s-no-change(q)
, we have
q@pq
in . Hence, we can close the diagram as follows:
(q,P)
(p0,P)h(p,P)
(q,P)
⟨⟩
Since
pP
and
q
@
pq
in
, we have
q=q
by definition of
P=
s-no-change(q)
. This shows that
(q
,
P)h
(q
,
P)wrt snc
. Since
qPq
, the
claim follows.
Subcase p0Pno-change.
Claim A1.2a then shows that
(p0
,
P)h
(p0
,
P)
wrt
snc.
Subsubcase p0P.
Since
pacc(p0)
and
acc(P)P
, it follows that
pP
, too. By definition
P=s-no-change(q)
and the completeness of
,
the memberships
p0P
and
pP
imply that
q
@
p0={q}=q
@
p
. We
can now close the diagram below as follows:
(q,P)
(p0,P)h(p0,P)
(q,P)
⟨⟩
Subsubcase p0no-change.In this case, p0=pso that qq@p0. Hence,
Algorithms 2024,17, 339 69 of 74
(q,P)
(p0,P)h(p,P)
(q,P)
⟨⟩
Case h=a.Since q∈ Pno-change, we can apply the inference rule:
qa
qin q∈ Pno-change
(q,P)a
(q,P)in snc
This shows that (q,P)h
(q,P)validates the claim since qPq.
Case h=ε.In this case, we have q=qand (q,P)ε
(q,P)so the claim holds.
Case h=h1·h2.
Since
qh
q
wrt
, there exists
q1 Q
, such that
qh1
q1
wrt
and
q1
h2
q
wrt
. Since
q∈ Pno-change
, we apply the induction hypothesis on
h1
.
This implies that there exists q
1such that,
(q,P)h1
(q
1,P)wrt snc and q1Pq
1.
We distinguish the two cases of q1Pq
1:
Subcase q1=q
1.We also distinguish two subcases here:
Subsubcase q1∈ P.
The induction hypothesis applied to
h2
yields the following:
q′′.(q1,P)h2
(q′′,P)wrt snc qPq′′,
Hence
q′′.(q,P)h
(q′′,P)wrt snc qPq′′.
Subsubcase q1P.
By Claim A1.2a, we have
(q1
,
P)h2
(q1
,
P)
. We also have
qacc({q1})
, and since we assume
acc(P)P
, this implies
qP
. Hence,
(q,P)h
(q1,P)and q,q1P, implying the claim is qPq1.
Subcase q1,q
1P.
Since
q
1P
, Claim A1.2a implies
(q
1
,
P)h2
(q
1
,
P)
wrt
snc
.
Thus,
(q
,
P)h
(q
1
,
P)
wrt
snc
. Since
qacc({q1})
and
q1P
, it follows that
qacc(P)P
. Here, we used as in the previous subsubcase that
acc(P)P
is assumed by the claim. Let
q′′ =q
1
. Then, we have
(q
,
P)h
(q′′
,
P)
wrt
snc
and q,q′′ Pshowing the claim.
This ends the proof of Claim A1.3a.
Proof of Inclusion
L(A) L(Asnc)
.Let
h L(A)
. Then, there exists
q0I
and
qF
,
such that q0h
q. We distinguish two cases:
Case q0no-change.
By definition of
no-change
and since
qacc(q0)
we have
q0=q
.
Claim A1.2a shows that
(q0
,
)h
(q0
,
)
wrt
snc
and thus
(q0
,
)h
(q
,
)
so that
h L(Asnc).
Case q0∈ no-change.
Claim A1.3a with
P=
shows that
(q0
,
)h
(q
,
)
wrt
snc
and
hence h L(Asnc).
Algorithms 2024,17, 339 70 of 74
This ends the proof of the first inclusion.
We next want to show the inverse inclusion
L(Asnc)S L(A)
. It will eventually
follow from the following three Claims A1.1b,A1.2b, and A1.3b.
Claim A1.1b. For any hedge hand state µ Qsnc, if Πh
µwrt snc then µ=Π.
The proof is straightforward by induction on the structure of
h
: the only transitions rules
of
snc
with
Π
on the left hand side are inferred by the last three rules in Figure 10. These
require to stay in Πwhatever hedge follows.
Claim A1.2b. For any hedge
h
, set
P Q
, state
qPno-change
, and state
µ Qsnc
: if
(q,P)h
µwrt snc, then µ= (q,P).
Proof. By induction on the structure of h. Suppose that (q,P)h
µwrt snc:
Case h=h.There must exist states µ1,µ
1 Qsnc closing the following diagram:
(q,P)
µ1hµ
1
µ
⟨⟩
Since
qPno-change
, the following rule must have been applied to infer
(q
,
P)⟨⟩
µ1wrt snc:
qPno-change
(q,P)⟨⟩
Πin snc
Therefore,
µ1=Π
. Claim A1.1b shows that
µ
1=Π
, too. So,
µ
must have been
inferred by applying the rule:
qPno-change
(q,P)@Π(q,P)in snc .
So, µ= (q,P), as required.
Case h=a.The following rule must have been applied:
qPno-change
(q,P)a
(q,P)in snc
Hence, µ= (q,P).
Case h=ε.Obvious.
Case h=h1·h2.
There must exist
µ1
, such that
(q
,
P)h1
µ1
h2
µ
wrt
snc
. By induction
hypothesis applied to
h1
, we have
µ1= (q
,
P)
. We can thus apply the induction
hypothesis to h2to obtain µ2= (q,P).
This ends the proof of Claim A1.2b. We next need an inverse of Claim A1.3a.
Claim A1.3b. Let
q Q
,
P Q
, such that
q∈ Pno-change
and
acc(P)P
. For any
h HΣ
, such that
(q
,
P)h
µwrt snc
for some
µ Qsnc
and such that
does not have
any blocking partial run on hstarting from q, there exists q,q′′, such that
µ= (q,P),qh
q′′ wrt , and qPq′′.
Algorithms 2024,17, 339 71 of 74
Proof.
By induction on the structure of
h HΣ
, we distinguish cases for all possible
forms h.
Case h=h.
By definition of
(q
,
P)h
µwrt snc
, there must exist
µ1
,
µ
1 Qsnc
, such
that the following diagram can be closed:
(q,P)
µ1hµ
1
µ
⟨⟩
Since
q∈ Pno-change
, the following rule got applied to infer
(q
,
P)⟨⟩
µ1
where
P=s-no-change(q):
⟨⟩
pin q∈ Pno-change
(q,P)⟨⟩
(p,P)in snc
Hence, there exists p ⟨⟩, such that µ1= (p,P). We fix such a state parbitrarily.
Since
does not have any blocking runs on
h
from
q
there exists
p
,
q′′ Q
, such that
ph
p
and
q
@
pq′′
wrt
. Furthermore,
does not have any blocking partial run
on hstarting from p.
Subcase pP.
In this case, we can apply Claim A1.2b to
(p
,
P)h
µ
1
wrt
in order
to show that
µ
1= (p
,
P)
. Since
pacc({p})
and
pP=s-no-change(q)
it follows that
q
@
p={q}
. Hence, the following rule got applied to infer
(q,P)h
µ:
q@pqin q∈ Pno-change
(q,P)@(p,P)(q,P)in snc ,
This shows that
µ= (q
,
P)
. Let
q=q
so that
µ= (q
,
P)
. Since
ph
pwrt
, we
have
pacc({p})
so that
q
@
pqwrt
by definition of
s-no-change({q})
.
Hence, we can close the following diagram:
q
php
q
⟨⟩
Let
q′′ =q
, so that
q=q′′
. It then holds that
qPq′′
and
qh
qwrt
, as
required by the claim.
Subcase pno-change.
In this case
ph
pwrt
implies that
p=p
. Hence, we
can close the following diagram:
q
php
q
⟨⟩
This shows that qh
qwrt .
Algorithms 2024,17, 339 72 of 74
Subcase p∈ Pno-change.
By induction hypothesis applied to
(p
,
P)h
µ
1
wrt
, there exists p′′ , such that
µ
1= (p,P),ph
p′′ wrt , and pPp′′.
Since
does not have blocking runs on
h
starting with
q
, there exist
q′′
such that
q@p′′ q′′ wrt . There are two ways to satisfy pPp:
Subsubcase p=p′′.We then have the following:
q
php
q
⟨⟩
This shows qh
qwrt .
Subsubcase p,p′′ P.By definition of P, it follows that q=q=q. Hence,
q
php′′
q
⟨⟩
This shows qh
qwrt .
Case h=a.Since q∈ Pno-change, the following inference rule must be used:
qa
qin q∈ Pno-change
(q,P)a
(q,P)in snc
So µ= (q,P)and qa
qwrt .
Case h=ε.Obvious.
Case h=h1·h2.
The judgement
(q
,
P)h
µwrt snc
shows that there exists
µ1
, such that
(q
,
P)h1
µ1
h2
µwrt snc
. Since
q∈ Pno-change
, we can apply the induction
hypothesis to
h1
. It shows that there exists
q
1
and
q′′
1
, such that
µ1= (q
1
,
P)
,
qh1
q′′
1wrt , and q
1Pq′′
1.
Subcase q
1=q′′
1.Hence, (q1,P)h2
µwrt snc.
Subsubcase q
1∈ Pno-change.
In this case, we can apply the induction hy-
pothesis to
(q
1
,
P)h2
µwrt snc
, showing the existence of
q
, such that
µ= (q
,
P)
and state
q′′
, such that
q
1
h2
q′′
and
qPq′′
. Hence,
q′′.qh
q′′ wrt and qPq′′ so the claim holds.
Subsubcase q
1Pno-change.
Claim A1.2b applied to
(q
1
,
P)h2
µwrt snc
shows that µ= (q
1,P).
Subcase q
1,q′′
1P.
Recall that
(q′′
1
,
P)h2
µwrt snc
and
q′′
1P
. Claim A1.2b
shows that
µ= (q′′
1
,
P)wrt snc
. We also have
qh1
q′′
1wrt
. Since there
are no blocking partial runs on
h
starting from
q
there exist a state
q′′
such
that
q′′
1
h2
q′′
wrt
. Since
q
1P
and
P
is closed by accessibility, we have
q′′ acc({q′′
1})acc(P)P
. From
qh1
q′′
1
wrt
, we get
qh
q′′
wrt
. Since
q′′
1,q′′ P, it follows that q′′
1Pq′′, and thus, the claim holds.
This ends the proof of Claim A1.3b.
Algorithms 2024,17, 339 73 of 74
Proof of Inclusion
L(Asnc)S L(A)
.Let
h L(Asnc)S
. Since
h L(Asnc)
, then
there exists
q0I
and
qF
, such that
(q0
,
)h
(q
,
)
wrt
snc
. Using Lemma 7, we have
q0h
qwrt .
We distinguish two cases:
Case q0no-change.
Claim A1.2b shows that
q=q0
. Since
A
is schema-complete for
S
,
hS
, and
q0Ino-change
, Lemma 7shows that
q0h
q0
wrt
. Since
q=q0
,
this yields h L(A).
Case q0∈ no-change.
Since
A
is schema-complete for
S
and
hS
there exist no blocking
runs on
h
that start in
q0
. Therefore, we can apply Claim A1.3b with
P=
to
(q0,)h
(q,)wrt snc. This shows that q0h
qwrt , and hence, h L(A).
This ends the proof of the inverse inclusion, and thus, of L(A) = L(Asnc).
References
1.
Marian, A.; Siméon, J. Projecting XML Documents. In Proceedings of the VLDB; Morgan Kaufmann: San Francisco, CA, USA, 2003;
pp. 213–224.
2.
Frisch, A. Regular Tree Language Recognition with Static Information. In Proceedings of the Exploring New Frontiers of
Theoretical Informatics, IFIP 18th World Computer Congress, TCS 3rd International Conference on Theoretical Computer Science,
Toulouse, France, 22–27 August 2004; pp. 661–674.
3. Maneth, S.; Nguyen, K. XPath Whole Query Optimization. VLPB J. 2010,3, 882–893. [CrossRef]
4.
Sebastian, T.; Niehren, J. Projection for Nested Word Automata Speeds up XPath Evaluation on XML Streams. In Proceedings
of the International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM), Harrachov, Czech
Republic, 23–28 January 2016.
5. Kay, M. The Saxon XSLT and XQuery Processor. 2004. Available online: https://www.saxonica.com (accessed on 22 July 2024).
6.
Gienieczko, M.; Murlak, F.; Paperman, C. Supporting Descendants in SIMD-Accelerated JSONPath. In Proceedings of the 29th
ACM International Conference on Architectural Support for Programming Languages and Operating Systems, La Jolla, CA, USA,
27 April–1 May 2024; Volume 1, pp. 15–29.
7.
Al Serhali, A.; Niehren, J. A Benchmark Collection of Deterministic Automata for XPath Queries. In Proceedings of the XML
Prague 2022, Prague, Czech Republic, 9–11 June 2022.
8.
Niehren, J.; Sakho, M. Determinization and Minimization of Automata for Nested Words Revisited. Algorithms 2021,14, 68.
[CrossRef]
9.
Comon, H.; Dauchet, M.; Gilleron, R.; Löding, C.; Jacquemard, F.; Lugiez, D.; Tison, S.; Tommasi, M. Tree Automata Techniques
and Applications. Available online: http://tata.gforge.inria.fr (accessed on 22 July 2024).
10.
Thatcher, J.W. Characterizing derivation trees of context-free grammars through a generalization of automata theory. J. Comput.
Syst. Sci. 1967,1, 317–322. [CrossRef]
11.
Alur, R. Marrying Words and Trees. In Proceedings of the 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of
Database Systems, Beijing, China, 11–14 June 2007; ACM-Press: New York, NY, USA, 2007; pp. 233–242.
12.
Alur, R.; Madhusudan, P. Visibly pushdown languages. In Proceedings of the 36th Annual ACM Symposium on Theory of
Computing, Chicago, IL, USA, 13–16 June 2004; Babai, L., Ed.; ACM: New York, NY, USA, 2004; pp. 202–211. [CrossRef]
13.
Mehlhorn, K. Pebbling Mountain Ranges and its Application of DCFL-Recognition. In Proceedings of the Automata, Languages
and Programming, 7th Colloquium, Noordweijkerhout, The Netherlands, 14–18 July 1980; Proceedings; de Bakker, J.W., van
Leeuwen, J., Eds.; Springer: Berlin/Heidelberg, Germany, 1980; Volume 85, pp. 422–435. [CrossRef]
14.
von Braunmühl, B.; Verbeek, R. Input Driven Languages are Recognized in log n Space. In Topics in the Theory of Computation;
North-Holland Mathematics Studies; Karpinski, M., van Leeuwen, J., Eds.; North-Holland: Amsterdam, The Netherlands, 1985;
Volume 102, pp. 1–19. [CrossRef]
15. Okhotin, A.; Salomaa, K. Complexity of input-driven pushdown automata. SIGACT News 2014,45, 47–67. [CrossRef]
16.
Niehren, J.; Sakho, M.; Al Serhali, A. Schema-Based Automata Determinization. In Proceedings of the 13th International
Symposium on Games, Automata, Logics and Formal Verification, GandALF 2022, Madrid, Spain, 21–23 September 2022; Ganty,
P., Monica, D.D., Eds.; EPTCS: The Hague, The Netherlands, 2022; Volume 370, pp. 49–65. [CrossRef]
17.
Lick, A.; Schmitz, S. XPath Benchmark. Available online: https://archive.softwareheritage.org/browse/directory/1ea68cf5bb3f9
f3f2fe8c7995f1802ebadf17fb5 (accessed on 22 July 2024).
18.
Mozafari, B.; Zeng, K.; Zaniolo, C. High-performance complex event processing over XML streams. In Proceedings of the SIGMOD
Conference; Candan, K.S., Chen, Y., Snodgrass, R.T., Gravano, L., Fuxman, A., Candan, K.S., Chen, Y., Snodgrass, R.T., Gravano, L.,
Fuxman, A., Eds.; ACM: New York, NY, USA, 2012; pp. 253–264. [CrossRef]
Algorithms 2024,17, 339 74 of 74
19.
Grez, A.; Riveros, C.; Ugarte, M. A Formal Framework for Complex Event Processing. In Proceedings of the 22nd International
Conference on Database Theory (ICDT 2019), Lisbon, Portugal, 26–28 March 2019; pp. 5:1–5:18. [CrossRef]
20. Muñoz, M.; Riveros, C. Streaming Enumeration on Nested Documents. In Proceedings of the 25th International Conference on
Database Theory, ICDT 2022, Edinburgh, UK, 29 March–1 April 2022; Olteanu, D., Vortmeier, N., Eds.; Schloss Dagstuhl—Leibniz-
Zentrum für Informatik: Wadern, Germany, 2022; Volume 220, pp. 19:1–19:18. [CrossRef]
21.
Al Serhali, A.; Niehren, J. Earliest Query Answering for Deterministic Stepwise Hedge Automata. In Proceedings of the
Implementation and Application of Automata—27th International Conference, CIAA 2023, Famagusta, North Cyprus, 19–22
September 2023; Proceedings; Nagy, B., Ed.; Springer: Berlin/Heidelberg, Germany, 2023; Volume 14151, pp. 53–65. [CrossRef]
22.
Gauwin, O.; Niehren, J.; Tison, S. Earliest Query Answering for Deterministic Nested Word Automata. In Proceedings of
the 17th International Symposium on Fundamentals of Computer Theory, Wroclaw, Poland, 2–4 September 2009; Springer:
Berlin/Heidelberg, Germany, 2009; Volume 5699, pp. 121–132.
23.
Debarbieux, D.; Gauwin, O.; Niehren, J.; Sebastian, T.; Zergaoui, M. Early nested word automata for XPath query answering on
XML streams. Theor. Comput. Sci. 2015,578, 100–125. [CrossRef]
24.
Al Serhali, A.; Niehren, J. Subhedge Projection for Stepwise Hedge Automata. In Proceedings of the Fundamentals of
Computation Theory—24th International Symposium, FCT 2023, Trier, Germany, 18–21 September 2023; Proceedings; Fernau, H.,
Jansen, K., Eds.; Springer: Berlin/Heidelberg, Germany, 2023; Volume 14292, pp. 16–31. [CrossRef]
25.
Neumann, A.; Seidl, H. Locating Matches of Tree Patterns in Forests. In Proceedings of the Foundations of Software Technol-
ogy and Theoretical Computer Science, Chennai, India, 17–19 December 1998; Springer: Berlin/Heidelberg, Germany, 1998;
Volume 1530, pp. 134–145.
26. Gauwin, O.; Niehren, J.; Roos, Y. Streaming Tree Automata. Inf. Process. Lett. 2008,109, 13–17. [CrossRef]
27.
Franceschet, M. XPathMark Performance Test. Available online: https://users.dimi.uniud.it/~massimo.franceschet/xpathmark/
PTbench.html (accessed on 25 October 2020).
28.
Hosoya, H.; Pierce, B.C. XDuce: A Statically Typed XML Processing Language. ACM Trans. Internet Technol. 2003,3, 117–148.
[CrossRef]
29. Gécseg, F.; Steinby, M. Tree Automata; Akadémiai Kiadó: Budapest, Hungary, 1984.
30. Gold, E.M. Language Identification in the Limit. Inf. Control. 1967,10, 447–474. [CrossRef]
31.
Carme, J.; Niehren, J.; Tommasi, M. Querying Unranked Trees with Stepwise Tree Automata. In Proceedings of the Rewriting
Techniques and Applications, 15th International Conference, RTA 2004, Aachen, Germany, 3–5 June 2004; Proceedings; van
Oostrom, V., Ed.; Springer: Berlin/Heidelberg, Germany, 2004; Volume 3091, pp. 105–118. [CrossRef]
32. Pair, C.; Quéré, A. Définition et etude des bilangages réguliers. Inf. Control 1968,13, 565–593. [CrossRef]
33.
Murata, M. Hedge Automata: A Formal Model for XML Schemata. 2000. Available online: http://www.xml.gr.jp/relax/hedge_
nice.html (accessed on 22 July 2024).
34.
Martens, W.; Niehren, J. On the Minimization of XML Schemas and Tree Automata for Unranked Trees. J. Comput. Syst. Sci. 2007,
73, 550–583. [CrossRef]
35. Niehren, J. Stepwise Hedge Automata are exponentially more succinct than Forest Automata. 2024, in preparation.
36.
Bojanczyk, M.; Walukiewicz, I. Forest algebras. In Proceedings of the Logic and Automata: History and Perspectives [in Honor of
Wolfgang Thomas]; Flum, J., Grädel, E., Wilke, T., Eds.; Amsterdam University Press: Amsterdam, The Netherlands, 2008; Volume 2,
pp. 107–132.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We consider the problem of determinizing and minimizing automata for nested words in practice. For this we compile the nested regular expressions (NREs) from the usual XPath benchmark to nested word automata (NWAs). The determinization of these NWAs, however, fails to produce reasonably small automata. In the best case, huge deterministic NWAs are produced after few hours, even for relatively small NREs of the benchmark. We propose a different approach to the determinization of automata for nested words. For this, we introduce stepwise hedge automata (SHAs) that generalize naturally on both (stepwise) tree automata and on finite word automata. We then show how to determinize SHAs, yielding reasonably small deterministic automata for the NREs from the XPath benchmark. The size of deterministic SHAs automata can be reduced further by a novel minimization algorithm for a subclass of SHAs. In order to understand why the new approach to determinization and minimization works so nicely, we investigate the relationship between NWAs and SHAs further. Clearly, deterministic SHAs can be compiled to deterministic NWAs in linear time, and conversely NWAs can be compiled to nondeterministic SHAs in polynomial time. Therefore, we can use SHAs as intermediates for determinizing NWAs, while avoiding the huge size increase with the usual determinization algorithm for NWAs. Notably, the NWAs obtained from the SHAs perform bottom-up and left-to-right computations only, but no top-down computations. This NWA behavior can be distinguished syntactically by the (weak) single-entry property, suggesting a close relationship between SHAs and single-entry NWAs. In particular, it turns out that the usual determinization algorithm for NWAs behaves well for single-entry NWAs, while it quickly explodes without the single-entry property. Furthermore, it is known that the class of deterministic multi-module single-entry NWAs enjoys unique minimization. The subclass of deterministic SHAs to which our novel minimization algorithm applies is different though, in that we do not impose multiple modules. As further optimizations for reducing the sizes of the constructed SHAs, we propose schema-based cleaning and symbolic representations based on apply-else rules that can be maintained by determinization. We implemented the optimizations and report the experimental results for the automata constructed for the XPathMark benchmark.
Chapter
We show how to evaluate stepwise hedge automata (Shas) with subhedge projection. Since this requires passing finite state information top-down, we introduce the notion of downward stepwise hedge automata. We use them to define an in-memory and a streaming evaluator with subhedge projection for Shas. We then tune the streaming evaluator so that it can decide membership at the earliest time point. We apply our algorithms to the problem of answering regular XPath queries on Xml streams. Our experiments show that subhedge projection of Shas can indeed speed up earliest query answering on Xml streams.
Chapter
Earliest query answering (EQA) is the problem to enumerate certain query answers on streams at the earliest events. We consider EQA for regular monadic queries on hedges or nested words defined by deterministic stepwise hedge automata (dShas). We present an EQA algorithm for dShas that requires time O(c m)O(c\ m) per event, where m is the size of the automaton and c the concurrency of the query. We show that our EQA algorithm runs efficiently on regular XPath queries in practice.
Article
Complex event recognition (CER) has emerged as the unifying field for technologies that require processing and correlating distributed data sources in real time. CER finds applications in diverse domains, which has resulted in a large number of proposals for expressing and processing complex events. Existing CER languages lack a clear semantics, however, which makes them hard to understand and generalize. Moreover, there are no general techniques for evaluating CER query languages with clear performance guarantees. In this article, we embark on the task of giving a rigorous and efficient framework to CER. We propose a formal language for specifying complex events, called complex event logic (CEL), that contains the main features used in the literature and has a denotational and compositional semantics. We also formalize the so-called selection strategies, which had only been presented as by-design extensions to existing frameworks. We give insight into the language design trade-offs regarding the strict sequencing operators of CEL and selection strategies. With a well-defined semantics at hand, we discuss how to efficiently process complex events by evaluating CEL formulas with unary filters. We start by introducing a formal computational model for CER, called complex event automata (CEA), and study how to compile CEL formulas with unary filters into CEA. Furthermore, we provide efficient algorithms for evaluating CEA over event streams using constant time per event followed by output-linear delay enumeration of the results.
Conference Paper
We discuss the model of nested words for representation of data with both a linear ordering and a hierarchically nested matching of items. Examples of data with such dual linear-hierarchical structure include annotated linguistic data, executions of structured programs, and HTML/XML documents. Nested words generalize both words and ordered trees, and allow both word and tree operations. We define nested word automata-finite-state acceptors for nested words, and show that the resulting class of regular languages of nested words has all the appealing theoretical properties that the classical regular word languages enjoy such as determinization, closure under a variety of operations, decidability of emptiness as well as equivalence, and characterization using monadic second order logic. The linear encodings of nested words gives the class of visibly pushdown languages of words, and this class lies between balanced languages and deterministic context-free languages. We argue that for algorithmic verification of structured programs, instead of viewing the program as a context-free language over words, one should view it as a regular language of nested words (or equivalently, as a visibly pushdown language), and this would allow model checking of many properties (such as stack inspection, pre-post conditions) that are not expressible in existing specification logics. We also study the relationship between ordered trees and nested words, and the corresponding automata: while the analysis complexity of nested word automata is the same as that of classical tree automata, they combine both bottom-up and top-down traversals, and enjoy expressiveness and succinctness benefits over tree automata. There is a rapidly growing literature related to nested words, and we will briefly survey results on languages infinite nested words, nested trees, temporal logics over nested words, and new decidability results based on visibility.
Article
We show that the class of input driven languages is contained in logn space, an improvement over the previously known bound of log2 n/loglogn space [Me 801. We also show the power of deterministic and nondeterministic input driven automata is the same and that these automata can recognize e.g. the parenthesis languages and the leftmost Szilard languages (cf. [Me 75], [LY 76], [Ig 77]).
Conference Paper
We present an evaluator for navigational XPath on Xml streams with projection. The idea is to project away those parts of an Xml stream that are irrelevant for evaluating a given XPath query. This task is relevant for processing Xml streams in general since all Xml standard languages are based on XPath. The best existing streaming algorithm for navigational XPath queries runs nested word automata. Therefore, we develop a projection algorithm for nested word automata, for the first time to the best of our knowledge. It turns out that projection can speed up the evaluation of navigational XPath queries on Xml streams by a factor of 4 in average on the usual XPath benchmarks.