# Learning n-ary Node Selecting Tree Transducers from Completely Annotated Examples

**ABSTRACT** We present the first algorithm for learning n-ary node selection queries in trees from completely annotated examples by methods of grammatical inference. We propose to represent n-ary queries by deterministic n-ary node selecting tree transducers (NSTTs), that are known to capture the class of MSO-definable n-ary queries. Despite of this highly expressive, we show that n-aryy queries, selecting a polynomially bounded number of tuples per tree, represented by deterministic NSTTs can be learned from polynomial time and data while allowing for efficient enumeration of query answers. An application to wrapper induction in Web information extraction yields encouraging results.

**0**Bookmarks

**·**

**64**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**The induction of monadic node selecting queries from partially annotated XML-trees is a key task in Web information extraction. We show how to integrate schema guidance into an RPNI-based learning algorithm, in which monadic queries are represented by pruning node selecting tree transducers. We present experimental results on schema guidance by the DTD of HTML.01/2008; - SourceAvailable from: inria.fr[Show abstract] [Hide abstract]

**ABSTRACT:**We investigate in this paper the spatial logic TQL for querying semi-structured data, represented as unranked ordered trees over an infinite alphabet. This logic consists of usual Boolean connectives, spatial connectives (derived from the constructors of a tree algebra), tree variables and a fixpoint operator for recursion. Motivated by XML-oriented tasks, we investigate the guarded TQL fragment. We prove that for closed formulas this fragment is MSO-complete. In presence of tree variables, this fragment is strictly more expressive than MSO as it allows for tree (dis)equality tests, i.e. testing whether two subtrees are isomorphic or not. We devise a new class of tree automata, called TAGED, which extends tree automata with global equality and disequality constraints. We show that the satisfiability problem for guarded TQL formulas reduces to emptiness of TAGED. Then, we focus on bounded TQL formulas: intuitively, a formula is bounded if for any tree, the number of its positions where a subtree is captured by a variable is bounded. We prove this fragment to correspond with a subclass of TAGED, called bounded TAGED, for which we prove emptiness to be decidable. This implies the decidability of the bounded guarded TQL fragment. Finally, we compare bounded TAGED to a fragment of MSO extended with subtree isomorphism tests.01/2007; - [Show abstract] [Hide abstract]

**ABSTRACT:**Inference algorithms for tree automata that define node selecting queries in unranked trees rely on tree pruning strategies. These impose additional assumptions on node selection that are needed to compensate for small numbers of annotated examples. Pruning-based heuristics in query learning algorithms for Web information extraction often boost the learning quality and speed up the learning process. We will distinguish the class of regular queries that are stable under a given schemaguided pruning strategy, and show that this class is learnable with polynomial time and data. Our learning algorithm is obtained by adding pruning heuristics to the traditional learning algorithm for tree automata from positive and negative examples. While justified by a formal learning model, our learning algorithm for stable queries also performs very well in practice of XML information extraction.Journal of Machine Learning Research 01/2013; 14(1):927-964. · 3.42 Impact Factor

Page 1

Learning n-ary Node Selecting Tree Transducers

from Completely Annotated Examples

A. Lemay1, J. Niehren2, and R. Gilleron1

Mostrare project of INRIA Futurs, LIFL, Lille France

1University of Lille 3

2INRIA Futurs

Abstract. We present the first algorithm for learning n-ary node selec-

tion queries in trees from completely annotated examples by methods

of grammatical inference. We propose to represent n-ary queries by de-

terministic n-ary node selecting tree transducers (n-NSTTs). These are

tree automata that capture the class of monadic second-order definable n-

ary queries. We show that n-NSTT defined polynomially bounded n-ary

queries can be learned from polynomial time and data. An application

in Web information extraction yields encouraging results.

1Introduction

The problem of selecting nodes in trees is the most basic and fundamental query-

ing problem in the context of XML [8,14,12]. In this paper, we propose a new

machine learning algorithm based on grammatical inference for learning n-ary

node selection queries. We will illustrate its interest in an application to wrapper

induction for Web information extraction [10,4,13,18].

We consider finite rooted directed sibling-ordered unranked trees t ∈ TΣwith

nodes labeled in a fixed signature Σ. An n-ary query in such trees [14,9,15] is a

function q that maps trees t ∈ TΣto sets of n-tuples of nodes q(t) ⊆ nodes(t)n.

Boolean queries are 0-ary queries and can be identified with tree languages1.

Monadic queries where n = 1 select nodes in trees. Binary queries where n = 2

select pairs of nodes in trees, and so on. The most natural way to represent n-

ary queries is monadic second-order logic (MSO), i.e. by MSO-formulas with n free

variables. MSO-defined queries are regular, i.e. definable by tree automata over

Σ ×Booln, and vice versa. This follows from Thatcher and Wright’s theorem in

the case of ranked trees [19] and carries over to unranked trees.

We investigate learning algorithms for MSO-definable n-ary queries. The input

is a set of completely annotated examples for the target query q. These are pairs

(t,q(t)) for some tree t ∈ TΣ. Completely annotated examples contain positive

information on all tuples in q(t), and negative information on all others. In the

Boolean case, they coincide with the positive and negative examples for tree

languages, i.e. whether a tree belongs to the language or not.

1This is well-known in database theory. A tree t belongs to the language defined by

a Boolean query q if and only if the empty 0-tuple () belongs to q(t).

Page 2

All learnability results depend on how n-ary queries are represented. The

following properties are wishful in general, and in particular for applications to

Web information extraction.

Learnability For all n-ary queries q a representative can be learned from poly-

nomial time and data in form of completely annotated examples.

Expressiveness All n-ary MSO-definable queries can be represented.

Efficiency Given a representation of an n-ary query q and a tree t the set q(t)

can be enumerated efficiently.

For n = 0 all three conditions can be satisfied when representing tree lan-

guages by bottom-up deterministic tree automata. Completely annotated exam-

ples then coincide with positive and negative examples. Learning algorithms for

deterministic tree automata from positive and negative examples (RPNI) have

been studied in [5].

For n = 1, these properties have been shown recently [1,2] when representing

monadic queries by deterministic node selecting tree transducer (NSTTs). These

are functional tree automata over Σ × Bool, which define relabeling functions

from trees over Σ to trees over Bool. Selected nodes are relabeled to true, all

others to false. A learning algorithm from polynomial time and date can be

obtained by adapting RPNI to deterministic NSTTs while taking functionality

into account, for the treatment of negative information. MSO completeness for

deterministic NSTTs can still be inferred from Thatcher and Wright’s theorem,

despite of the restriction to functionality. Efficient query answering is possible

in linear time by a two phases algorithm.

For n > 1, the question is still open whether there exists a representation

formalism for n-ary queries that satisfies the above three properties. A number

of principle problems arise. The most disturbing fact is that functional tree

automata over Σ×Boolnare not sufficiently expressive for n > 1. They can only

define finite unions of Cartesian closed n-ary queries as shown in [15]. These are

clearly insufficient in theory and practice.

Furthermore, the number of n-tuples in q(t) ⊆ nodes(t)nmay become expo-

nential for unbounded n so that efficient enumeration becomes an issue for n > 1.

Completely annotated examples for q may thus become huge. This should not

happen in practice of information extraction. In theory, we will restrict ourselves

to queries where the number of answers is polynomially bounded in the size of

the tree. Our learning algorithms will have to use compact representations for

huge sets of negative examples, i.e., complements nodes(t)n− q(t).

In this article, we propose to represent n-ary queries in Σ-trees by determin-

istic tree automata over Σ × Boolnthat recognize canonical languages, where

every accepted tree corresponds to precisely one n-tuple. We call tree automata

with canonical languages n-ary node selection tree transducer (n-NSTTs).

All tree automata obtained from MSO formula have canonical languages as

long as all free variables are first-order. However, most NSTTs are not 1-NSTTs

and vice versa. Despite of this, both classes of automata have the same ex-

pressiveness – they can both represent all monadic MSO definable queries, but

Page 3

T

Himmel ¨ uber Berlin

F

P

Wenders

L

T

Vertigo

F

P

Hitchcock

Fig.1. The binary tree two-films with some data content.

differently. We show how to learn deterministic n-NSTTs from completely an-

notated examples. Our algorithm satisfies the learning model from polynomial

time and data, under the assumption that the number of answers to queries is

polynomially bounded in the size of the tree. The main problem is to represent

the possibly exponential amount of negative information contained in a set of

completely annotated examples in a compact manner. In the monadic case, this

could be solved by the functionality requirement on NSTTs which is no more

available for n-NSTTs. We also show that answers of n-ary queries represented

by deterministic n-NSTTs can be enumerated efficiently.

We have implemented our algorithm and started applying it to Web infor-

mation extraction. We assume highly structured Web pages generated by some

database. First experiments yield encouraging results for the n-ary case, that let

us hope for competitive systems by future work.

2N-ary Node Selecting Tree Transducer

We introduce n-NSTTs for binary trees. Unranked trees will be considered in

Section 6. The principal difference between 1-NSTTs as presented here and

NSTTs from [1] is essential to generalize smootly from monadic to n-ary queries.

Let N = 1,2,,... be the natural numbers without 0 and Bool = {0,1} the

Booleans. We denote the cardinality of a set A by |A|. Given a finite set Σ of node

labels, a finite directed sibling-ordered binary tree t ∈ TΣis either a label a ∈ Σ

or a triple a(t1,t2) consisting of a label a ∈ Σ and two binary trees t1,t2∈ TΣ.

Fig. 1, for instance, contains the binary tree two-films = L(F(T,P),F(T,P))

where Σ = {L,F,T,P}. This tree represents a list (L) of two films (F) each

having a title (T) and a producer (P). Rather than putting data content into tree

labels, we assume an external mapping from nodes to data values. Note that

nodes may carry the same label while containing different data. For instance,

both films have different producers and titles. This works as long as we carefully

distinguish different nodes with the same label.

We identify node of trees with their relative address from the root. The node

2·1, for instance, is the first child of the second child of the root. In the example

in Fig. 1, this is the T node containing Vertigo. We write t(v) for the label of some

v ∈ nodes(t), for instance: two-films(2·1) = T. We denote by nodes(t) ⊆ N∗

the set of nodes of a tree t. We say that two trees have the same shape if they

have the same sets of nodes. We write size(t) for |nodes(t)|.

Page 4

Definition 1. An n-ary query in trees over Σ is a function q from trees t ∈ TΣ

to sets of n-tuples of nodes q(t) ⊆ nodes(t)n.

Let the binary query title-producer-pairs ask for all pairs of titles and

producers in trees that encode lists of films. From the tree two-films, this

query selects the following node pairs: title-producer-pairs(two-films) =

{(1·1,1·2),(2·1,2·2)}

The usual idea how to represent n-ary queries by tree automata stems from

early work on MSO [19]. It consists in identifying n-ary queries over Σ with tree

languages over Σ × Booln. These can then be recognized by a tree automaton.

There are several possibilities in doing so, which differ in how many n-tuples may

be encoded by Boolean annotations at the same tree. For n = 2 for instance, con-

sider L00(F00(T10,P01),F00(T10,P01)). This tree is annotated by pairs of Booleans

that represent 4 pairs of nodes: {(1·1,1·2),(2·1,2·2),(1·1,2·2),(2·1,1·2)}. The

third and fourth pair may be unwanted since they mix up the titles and produc-

ers. We cannot annotate, however, only the first two pairs to the same copy of tree

two-films, we need two independent copies: L00(F00(T10,P01),F00(T00,P00)) and

L00(F00(T00,P00),F00(T10,P01)). This contrasts strongly with the monadic case,

where one can always annotate all tuples in q(t) to a unique copy of t. Such com-

pact annotations lead to functional tree languages, as recognized by the NSTTs

in [1].

In the n-ary case, however, several copies of t need to be annotated, one

for each of its n-tuples. We call trees over Σ × Boolntuple trees if they are

annotated by a single n-tuple. Every tree over Σ ×Boolncan be decomposed in

a unique manner into two trees of the same shape, a tree t ∈ TΣand its Boolean

annotation β ∈ TBooln. We write t × β for the unique tree in Σ × Boolnthat

can be decomposed into t and β. Given a n-tuple α and 1 ≤ i ≤ n let Πi(α) be

the i-th component of α. If t × β is a tuple tree then β corresponds to a unique

n-tuple β ∈ nodes(t)nsuch that:

∀v ∈ nodes(t) ∀1 ≤ i ≤ n : Πi(β(v)) = 1 iff Πi(β) = v

We call tree languages over Σ × Boolncanonical if all trees contained are

tuple trees. Clearly, every n-ary query q over Σ is represented by exactly one

canonical language.

We will use tree automata to represent canonical languages, so we recall their

definition. A tree automaton A over Σ is a triple that consists of three finite sets

states(A), final(A) ⊆ states(A), and rules(A), so that all rules are or the

form a → q or a(q1,q2) → q where a ∈ Σ and q,q1,q2 ∈ states(A). A run

of a tree automaton A on a tree t is a function r : nodes(A) → states(A) so

that all states that r assigns to nodes of t are justified by some rule of A. A

run r of A on t is successful if it maps the root of t to a final state of A, i.e.

r(ε) ∈ final(A). We write succ runsA(t) for the set of successful runs by A on

t. The language L(A) is the set of all trees t ∈ TΣ that permit a successful

run by A. The size measure for automata in this paper counts states and rules:

size(A) = |rules(A)| + |states(A)|. We call a tree automaton trimmed if all

of its states are used in some successful run.

Page 5

Definition 2. An n-ary node selecting tree transducer (n-NSTT) over Σ is a

tree automaton over Σ × Boolnthat recognizes a canonical language.

An n-NSTT A over Σ represents the n-ary query qAin trees t ∈ TΣsuch that:

qA(t) = {β ∈ nodes(t)n| t × β ∈ L(A)}

In other words, a query q is represented by all n-NSTTs that recognize the

language of all tuple trees for q. All such n-NSTTs are equivalent in that they

recognize the same language. Thatcher and Wright’s theorem [19] states that

n-NSTTs capture the class of MSO-definable n-ary queries. Thus, 1-NSTT ?=

NSTT even though both of them capture monadic MSO-definable queries.

3Membership Testing to the Class of n-NSTTs

We present an efficient algorithm for testing whether a tree automaton is an

n-NSTT. The results on types obtained on the way will help avoiding such tests

during query induction in Section 5.

An n-type b is an n-tuple of non-negative integers, that is b ∈ (N ∪ {0})n.

All bit vectors in Boolnare n-types. The type of a tree β ∈ TBooln is the n-type

obtained by summing up all labels of nodes in β.

t(β) =?

v∈nodes(β)β(v)

Note that t × β is a tuple tree if and only if t(β) = (1,...,1) = 1n. Let A be a

tree automaton over Σ × Booln. To every q ∈ states(A), we assign a set t(q)

of n-types by the following inference rules:

(a,b) → q ∈ rules(A)

b ∈ t(q)

(a,b)(q1,q2) → q ∈ rules(A) b1∈ t(q1) b2∈ t(q2)

b + b1+ b2∈ t(q)

Lemma 1. If r is a run of A on t × β then t(β) ∈ t(r(ε)).

Lemma 2. For all q ∈ states(A) and b ∈ t(q) there exists a tree t × β over

Σ × Boolnand a run r of A on this tree such that q = r(ε) and t(β) = b.

Lemma 3. If A is a trimmed n-NSTT then t(q) ⊆ Boolnis a singleton.

Proof. To see that t(q) ?= ∅, note that we assume A to be trimmed. Thus there

exists a tree t × β and a run r on that tree such that r(ε) = q. By Lemma 1 it

follows that t(β) ∈ t(q). To see that t(q) ∈ Booln, let b ∈ t(q). By Lemma 2

there exists a tree t × β over Σ × Boolnand a run r of A on this tree such that

q = r(ε) and t(β) = b. Since A is trimmed there exists a tree in˜t ×˜β ∈ L(A)

that contains t × β as a subtree. Hence: b = t(β) ≤ t(˜β) = 1n. It remains to

show that t(q) is a singleton, so let us assume that b′∈ t(q) too. By Lemma 2

there exists a second tree t′× β′over Σ × Boolnand a run r′of A on this tree

such that q = r′(ε) and t(β′) = b′. Let˜t′×˜β′be the tree obtained by replacing

one occurrence of t × β in˜t ×˜β by t′× β′. Note that˜t′×˜β′∈ L(A), hence

Page 6

t(β′) = 1n. Let V be the set of nodes of˜t ×˜β that have not been affected by

the substitution.

1n= t(˜β) = t(β) +?

1n= t(˜β′) = t(β′) +?

Since˜β(v) =˜β′(v) for all v ∈ V , t(β) = t(β′) so that b = b′.

v∈V˜β(v)

v∈V˜β′(v)

Lemma 4. A trimmed automaton A over Σ × Boolnis an n-NSTT iff t(q) =

{1n} for all q ∈ final(A).

Proof. let A be a trimmed n-NSTT and let q ∈ final(A). Since A is trimmed

there exists a tree t × β and a run r on that tree such that r(ε) = q. Thus

t×β ∈ L(A) so that t(β) = 1n. By Lemma 1, it follows that 1n∈ t(q). This set

is a singleton by Lemma 3 so that t(q) = {1n}. For the converse, it follows from

Lemma 1, that all t × β ∈ L(A) satisfy t(β) = 1nso that they are tuple trees.

Proposition 1. Whether a tree automaton A over Σ × Boolnis an n-NSTT

can be decided in polynomial time O(size(A)×n). If so, all types in {t(q) | q ∈

states(A)} can be computed in the same time.

Proof. Lemma 4 gives us a way to check whether a tree automaton is an n-

NSTT. In the first step, we trim the automaton without changing its language.

This requires linear time O(size(A) × n). We then compute all values t(q) by

saturation with respect to the defining rules. We exit saturation immediately, if it

tries to add a second element to some type set, or if it tries to add a non-Boolean

n-type. If this happens then we return false, which is justified by Lemma 3. Note

that all positions in rules will be touched at most once and all type sets at

most twice. Hence, saturation can be implemented in time O(size(A) × n). If

saturation succeeds then we apply the third step. All types have been computed

successfully now. We check for all q ∈ final(A) whether t(q) = {1n}. If so we

return true otherwise false, which is licenced by Lemma 4. This can be done in

time O(size(A) × n) too.

4Efficient Answer Enumeration

We develop an efficient algorithm for enumerating the answers of an n-NSTT

defined query on a given input tree. The insights gained will again be used in

our learning algorithm.

Given an n-NSTT A and a tree t the problem is to compute all β such that

t×β ∈ L(A). The first step to do is to project A to a tree automaton over Σ that

we denote by π(A). This automaton satisfies states(π(A)) = states(A) and

final(π(A)) = final(A). Its rules are inferred by the following two schemata

where a ∈ Σ and b ∈ Booln:

(a,b)(q1,q2) → q ∈ rules(A)

a(q1,q2) → q ∈ rules(π(A))

(a,b) → q ∈ rules(A)

a → q ∈ rules(π(A))

Page 7

Given an trimmed n-NSTT A, let tA: states(A) → Boolnbe the function that

maps states of A to their unique n-type according to Lemma 3. The following

lemma permits to type rules of projections of n-NSTTs.

Lemma 5. For all trimmed n-NSTTs A, labels a ∈ Σ, and q,q1,q2∈ states(A):

a → q ∈ rules(π(A)) iff (a,tA(q)) → q ∈ rules(A)

a(q1,q2) → q ∈ rules(π(A)) iff (a,b)(q1,q2) → q ∈ rules(A)

where b = tA(q) − tA(q1) − tA(q2)

Proof. The implications from the right to the left are obvious from the defini-

tion of the rules of π(A). For the converse there are two cases. First, assume

a → q ∈ rules(π(A)). By definition of π(A) there exists b ∈ Boolnsuch that

(a,b) → q ∈ rules(π(A)). Lemma 1 shows that b = tA(q). Second, assume

a(q1,q2) → q ∈ rules(π(A)). By definition of π(A) there exists b ∈ Boolnsuch

that (a,b)(q1,q2) → q ∈ rules(A). Since A is trimmed, there exist a tree t1×β1

and t2×β2over Σ×Bool that can be evaluated by A into states q1and q2respec-

tively. Thus, the tree (a,b)(t1×β1,t2×β2) can be evaluated to q by A. Lemma

1 shows that b+tA(q1)+tA(q2) = tA(q). Hence, b = tA(q)−tA(q1)−tA(q2).

For every tree run r of a trimmed tree automaton A over Σ ×Boolnon some

tree with node v we define a function mapping nodes to n-types.

tr

A(v) =

?tA(r(v))

tA(r(v)) − tr

if v is a leaf

A(r(v·1)) − tr

A(r(v·2)) else

Note that tr

for all nodes v.

Acan be identified with the unique β ∈ TBooln such that β(v) = tr

A(v)

Lemma 6. For all trimmed n-NSTTs A, t ∈ TΣ, and r : nodes(t) → states(A):

r ∈ succ runsπ(A)(t) iff r ∈ succ runsA(t × tr

A)

Proof. Straightforward from Lemma 5 by induction on trees t.

Recall that a tree automaton is unambiguous, if no tree permits more than

one successful run. All deterministic tree automata are unambiguous.

Proposition 2. Let A be a trimmed unambiguous n-NSTT. For all trees t ∈

TΣ, the function mapping r ∈ succ runsπ(A)(t) to Boolean annotations tr

TBooln is a bijection with range {β | t × β ∈ L(A)}.

Ain

Proof. First note that the function always maps to {β | t × β ∈ L(A)}. This

follows from Lemma 6. If r ∈ succ runsπ(A)(t) then r ∈ succ runsA(t × tr

that t × tr

A∈ L(A). Second, we show that the function is onto. To see this,

we show by induction on t that if r is a run of A on t × β then β = tr

β such that t × β ∈ L(A). Thus, there exists r ∈ succ runsA(t × β) so that

β = tr

value taken by the function. Third, we have to show that the function is one-to-

one. Let r1,r2∈ succ runsπ(A)(t) such that tr1

ri∈ succ runsA(t × tri

A) for both i = 1,2. Hence, r1,r2are successful runs of A

on the same tree, so they are equal by unambiguity of A.

A) so

A. Let

A. By Lemma 6, it also holds that r ∈ succ runsπ(A)(t) so that tr

Ais a

A= tr2

A. By Lemma 6 it holds that

Page 8

Theorem 1. For every unambiguous n-NSTT A, we can compute an algorithm

in time O(size(A)×n) that enumerates qA(t) with delay O(size(A)×size(t)×

n) per n-tuple.

Hence, one can compute the answer set qA(t) for unambiguous n-NSTTs A on

trees t in time O((|qA(t)| + 1) × size(A) × size(t) × n).

Proof. In order to enumerate the answer set qA(t) of an n-NSTTs A it is suffi-

cient to enumerate the set {β | t × β ∈ L(A)} since every β can be transformed

in linear time into a unique n-tuple β by definition of n-NSTTs. This set is in

bijection to the set of successful runs of π(A) on t by Proposition 2. Given an

unambiguous n-NSTT A, we trim A, compute its projection π(A) and types tA

in time O(size(A) × n). Given a tree t ∈ TΣthe algorithm proceeds as follows.

It enumerates r ∈ succ runsA(t) with delay O(size(A) × size(t) × n) per run

and return all n-tuples of nodes corresponding to some Boolean annotation tr

A.

5 Learning Model and Algorithm

The learning model for words languages from polynomial time and data with

positive and negative examples [7,6] can be adapted to tree languages.

Definition 3. Tree languages over a fixed set Σ represented tree automata in

some class C are called identifiable from polynomial time and data if there exist

two polynomials p1and p2and an algorithm learner such that:

– for all input samples S ⊆ TΣ× Bool, learner(S) returns a tree automaton

A ∈ C in time O(p1(|S|)), that is consistent with S in that for all t × b ∈ S:

t ∈ L(A) iff b = 1;

– for all tree automata A ∈ C there exists a so called characteristic sample

char(A) of cardinality less than p2(size(A)) such that, for all input samples

S ⊇ char(A), learner(S) returns a tree automaton A′∈ C equivalent to A.

In contrast to the case of words, the learning model for trees bounds the

cardinality of the characteristic sample, not its size. This relaxation may be

acceptable as long as one is only interested in the existence of a polynomial time

learner. If C is the class of deterministic tree automata, the learner can be defined

by the RPNI algorithm in [17].

The model for learning tree languages is only partially adapted to queries.

The question is which examples to use for n-ary queries. Let q be an n-ary query.

A completely annotated example for q for q is a pair (t,q(t)) where t ∈ TΣ. We

call t called carrier of (t,q(t)). For a set S of completely annotated examples

for q, we denote by carrier(S) the set of supports. A completely annotated

example (t,q(t)) defines |q(t)| positive examples, i.e. a positive example (t×β,1)

for each tuple tree t×β with β in q(t). It also defines implicit negative examples,

i.e. trees (t × β,0) with β not in q(t) for all t in carrier(S).

The cardinality of a completely annotated example (t,q(t)) is q(t) + 1. The

size of a completely annotated example (t,q(t)) is size(t)+ |q(t)|×n. A sample

Page 9

is a set of completely annotated examples for a target query q, its cardinality is

the sum of the cardinalities of all completely annotated examples in S, its size is

the sum of sizes of all completely annotated examples in S. A tree automaton A

over Σ × Boolnis consistent with a sample S if every tree t × β with (t,q(t)) in

S and β ∈ q(t) is in L(A) and if there is no tree t×β in L(A) such that (t,q(t))

is in S and β is not in q(t).

The model for learning queries is defined w.r.t. a query representation for-

malism. Two query representations are said to be equivalent if they represent

the same query. This leads us to the following definition:

Definition 4. n-ary queries represented by a query representation formalism

R are said to be identifiable from polynomial time and data from completely

annotated examples if there exist two polynomials p1 and p2 and an algorithm

learner such that:

– for all input samples S of completely annotated n-ary examples learner(S)

returns a representation A ∈ R in time O(p1(|S|)) that is consistent with S;

– for all query representations A ∈ R there exists a so called characteristic

sample char(A) for A of cardinality less than p2(|A|) such that, for all in-

put sample S ⊇ char(A), learner(S) returns a query representation A′∈ R

equivalent to A.

Let us recall that, for a tree t and an n-ary query q, the number of selected

n-tuples in q(t) is at most size(t)n. Therefore, if we consider a target query qn,t

that extract all n-tuples of a tree t and no tuple for every other tree, the char-

acteristic sample should contain the completely annotated example (t,qn,t(t))

whose cardinality is size(t) + size(t)n× n. This holds for arbitrary query rep-

resentation formalisms. In order to avoid this blow-up, we restrict ourselves to

queries that selects a polynomially bounded number of n-tuples per tree: an n-ary

query q over Σ-trees is said to be polynomially bounded if there is a polynomial

p such that for each trees t ∈ TΣ, |q(t)|< p(size(t)).

Theorem 2. Polynomially bounded n-ary queries represented by deterministic

n-NSTTs are identifiable from polynomial time and data from completely anno-

tated examples.

Before defining the learning algorithm we recall the basics of the RPNI algo-

rithm for trees. RPNI inputs a sample of positive and negative examples. It first

computes an initial deterministic tree automaton which recognizes exactly the

set of positive examples in the sample. It then merges states as long as possible

while verifying consistency (no negative example in the sample is recognized)

and preserving determinism. The order of state fusions matters.

Merging works as follows: Let A be the initial automaton. We consider a par-

tition π of states(A). The equivalence class of q in that partition is denoted by

π(q). The quotions of A with respect to π is denoted by A/π. It is the automaton

which satisfies states(A/π) = π and final(A) = {p ∈ π | π ∩ final(A) ?= ∅}.

The rules of A are defined such that (a,b) → q ∈ rules(A) ⇒ (a,b) → π(q) ∈

Page 10

rules(A/π) and (f,b)(q1,q2) → q ∈ rules(A),⇒ (f,b)(π(q1),π(q2)) → π(q) ∈

rules(A/π). State merging is performed by a function merge(A,π,qi,qj) that

outputs a partition π′such that π′(qi) = π′(qj) and other elements of π are pre-

served. Function det-merge(A,π,qi,qj) first merges qiand qjand then performs

further merges such that the resulting automaton becomes deterministic.

The learning algorithm learner for n-ary queries represented by deterministic

n-NSTTs is set to be RPNIn−NSTT. It is given in Figure 2. It uses the same schema

than the RPNI algorithm for tree languages, but with the following features:

– the positive examples are tuple trees t × β for every t ∈ carrier(S) such

that q(t) ?= ∅ and β ∈ q(t);

– not all deterministic tree automata over Σ × Boolnare deterministic n-

NSTTs, therefore after every merge we have to check whether the resulting

automaton is an n-NSTT, this is done using the t function (see Proposi-

tion 1). Note that, as one never merges states of different type, we denote,

for a partition π of states(A) considered by the algorithm and for a set of

states p ∈ π, t(p) as the type of its states;

– we do not have negative examples, but the hypothesis of completely anno-

tated examples as input allows to define implicit negative examples: t × β

such that (t,q(t)) ∈ S and β ?∈ q(t). As there is a bijection between runs on

Σ-trees and answers of a query (see lemma 6), verifying whether an implicit

negative example is recognized or not is the same as verifying that the num-

ber of runs on the support of the input sample does not grow. This replaces

the usual consistency check of RPNI-like algorithms.

– Also, note that RPNI requires an order on states. In the initial automaton,

each state can be associated to the single tree that it recognizes; states are

then ordered following a fixed order on those trees.

The initial n-NSTT A is consistent with the input sample S because it

recognizes exactly the set S+of tuple trees constructed from S. Let us suppose

that, at every call to det-merge, the n-NSTT A/π is consistent with S. The

automaton A/π′satisfies L(A) ⊆ L(A/π′). To check whether A/π is consistent

with S, it is sufficient to test whether there is no new tree t × β in L(A′) with

t ∈ carrier(S). From lemma 6, this is equivalent to check whether, for every

tree t in carrier(S), the number of successful runs of the projected automaton

π(A) is equal to |q(t)|. Counting the number of successful runs on an input tree

can be done in O(size(S)). Note that we do not consider the size of A′because

it is lower than the size of A, and the size of A is linear in the size of S.

Also, we compute the t function described in section 3 on A. As A is an

n−NSTT, condition of lemma 4 is satisfied for A. It is easy to verify that those

conditions are also satisfied for A/π if and only if there do not exist two states

of different types in the same element of π. This is guaranteed by the fact we

never merge states of different types.

Thus RPNIn−NSTTcomputes in polynomial time, for every input sample S,

an n-NSTT consistent with S. To end the proof of Theorem 2, it remains to

prove the second item of Definition 4, i.e. we must define characteristic samples

Page 11

RPNIn−NSTT

Input: a sample S of completely annotated examples

compute S+= {t × β | t ∈ carrier(S),β ∈ q(t)}

let A be the minimal deterministic n-NSTT such that L(A) = S+

Compute t and order states of A from qi to qn

let m = Σt∈carrier(S)|q(t)|

let π be the trivial partition of states(A)

For i = 0 to |states(A)| do

let q be the state with the smallest index in π(qi)

If qi = q then % qi has not been merged

For j = 0 to i − 1 do

If t(qi) = t(qj) then

π′← det-merge(A,π,qi,qj)

let m′be the number of runs of A/π′on carrier(S)

% test consistency with negative information

If m = m′then π ← π′and Exit Inner Loop

Output : A/π

Fig.2. The learning algorithm learner for n-ary queries represented by deterministic

n-NSTTs

for n-ary queries represented by deterministic n-NSTTs, and we must prove the

convergence property of RPNIn−NSTTw.r.t. characteristic samples.

Tree languages represented by deterministic automata are identifiable from

polynomial time and data [17]. Thus n-ary queries, considered as tree languages

over Σ × Booln, represented by deterministic n-NSTTs are identifiable from

polynomial time and data. But, recall that this result is true in the learning

model from positive and negative examples. Let learner′= RPNI be the learning

algorithm for tree languages represented by deterministic tree automata and

char′be the function computing the characteristic sample associated with a

deterministic tree automaton. Let A be a deterministic n-NSTT, char′(A) is

the characteristic sample for A which is the representation of a tree language of

Σ × Booln-trees. We define the characteristic sample char(A) for A which is the

representation of an n-ary query by:

char(A) = {(t,q(t)) | (t × β,b) ∈ char′(A)}

We show that the cardinality of char(A) is polynomial. As tree languages

represented by deterministic tree automata are learnable from polynomial time

and data, there is a polynomial p′

than p′

2(s). Consequently, the number of trees t such that there exists an example

(t × β,b) ∈ char′(A) is less than p′

2(s). Therefore, carrier(S) has cardinality

less than p′

2(s). As we consider polynomially bounded queries, the cardinality of

every completely annotated example is polynomial. Thus there is a polynomial

p2such that the cardinality of char(A) is less than p2(S).

2such that the cardinality of char′(A) is less

Page 12

Let learner be set to RPNIn−NSTT. We have shown that, for every sample

S, RPNIn−NSTT outputs in polynomial time an n-NSTT consistent with S. It

remains to show that if char(A) ⊆ S then RPNIn−NSTTwith input S outputs an

n-NSTT, denoted by RPNIn−NSTT(S), equivalent to A.

Let A be the target n-NSTT, let S be a sample that contains char(A), we

define the sample S′of positive and negative examples by:

S′= {(t×β,1) | t ∈ carrier(S),β ∈ q(t)}∪{(t×β,0) | t ∈ carrier(S),β ?∈ q(t)}

By definition of char(A) and of S′, we have char′(A) ⊆ S′. Then, RPNI with input

S′outputs a deterministic automaton RPNI(S′) = A′such that L(A′) = L(A).

It remains to show that RPNIn−NSTT(S) = RPNI(S′). First, verifying that the

number of runs on carrier(S) does not grow is equivalent to the consistency test

done by RPNI w.r.t. S′(as said above). Second, if char(A) ⊆ S, and consequently

char′(A) ⊆ S′, RPNI(S′) = A′is an n-NSTT because L(A′) = L(A) is canonical.

Therefore, under the hypothesis that char(A) ⊆ S, at every step of RPNIn−NSTT,

the current deterministic automaton is an n-NSTT. This is because otherwise

a tree which is not a tuple tree would be accepted (the sequence of languages

is increasing according to inclusion because states are merged). Thus, under the

hypothesis that char(A) ⊆ S, merged states will always be of the same type.

Thus, RPNIn−NSTT(S) = RPNI(S′).

6 n-NSTTs for Unranked Trees

HTML or XML documents parse into unranked trees where every node may have a

list of children of unbounded length, not only two. The notion of n-ary queries

carries over literally.

As an example, consider the unranked tree film-list in Fig. 3. This tree

represents a list (L) of three films (F), two of which are directed by Hitch-

cock (H) and one by Wenders (W). The letter (T) represents the title of the

film. The binary query hitchcock asks for pairs of directors and title in films

by Hitchcock. From film-list, this query selects the following pairs of nodes:

hitchcock(film-list) = {(1·2,1·1),(3·2,3·1)}. The tree in Fig. 3 is annotated

by the first pair (1·2,1·1).

For extending n-NSTTs to unranked trees, we only need a notion of tree

automata for unranked trees. It must come with a good notion of bottom-up

determinism, for which the Myhill-Nerode theorem holds. This needs some care

[11]. We solve this problem as in [1] by using stepwise tree automata [3]. These

have the further advantage that they can be identified with standard tree au-

tomata operating on binary encodings of unranked trees, so that all our learning

algorithms carry over.

An example of stepwise tree automaton inferred by our learning algorithm

is given Fig. 3. This automaton has been inferred from completely annotated

example for query hitchcock, and recognizes that query, at least for documents

of the correct type.

Page 13

1

20

8

34567

H

H10

W

T

T01F

52

L

7

6

6

1,3,4

L8

F7

T015

H102

F6

T4

W3

F6

T4

H1

Fig.3. A stepwise tree automaton inferred by our algorithm RPNI2-NSTT (left); the tree

film-list annotated by a successful run; state 8 is obtained by evaluating the word

L·7·6·6. Bit vectors 00 are ignored, so we write L instead of L00

OkraBigbook

RPNINSTT

RPNI1-NSTT

RPNINSTT

RPNI1-NSTT

# Ex. F-meas. Init. infer. F-meas. Init. infer. F-meas. Init. infer. F-meas. Init. infer.

1100 %72 2497.1 % 624 30

2 100 % 8224 98.3 % 547 28

3100 %85 2494.3 % 1045 31

68.4 % 162 37

91.3 % 172 42

100 %

89.4 % 485 29

98.6 % 877 29

100 %179 48 1226 30

Fig.4. Learning monadic queries by RPNI for either NSTTs [2] or 1-NSTTs as pro-

posed here: F-measure, sizes of initial and inferred automata.

7 Application to Web Information Extraction

We have implemented our learning algorithm and started applying it to Web

information extraction tasks. We have added a single heuristic proposed in [6],

which consists in typing states, so that only trees compatible with HTML syntax

are recognized. Textual values and attributes are ignored.

In the case of monadic queries, we compare our algorithm RPNI1-NSTTwith

RPNINSTTfrom [2]. We use the RISE benchmark: www.isi.edu/info-agents/

RISE. Results are averaged over 30 experiments. They are presented in Fig. 4.

Our algorithm achieves a little worse on the Okra benchmark, because this

benchmark contains pages with a single element to be extracted. On Bigbook,

however, RPNI1-NSTTperforms better than RPNINSTT. It is interesting to observe

that our technique produce bigger initial automata (because of canonicity, we

have one input tree per tuple), but output automata are roughly of the same

size for the two systems. These experiments show that induction of NSTTs and

1-NSTTs yield similarly good performance while using different representation

schemas.

For n-ary queries, we run RPNIn−NSTTon the benchmarks Bigbook and Okra.

The results are promising. We also use the Datafoot benchmark available at

www.grappa.univ-lille3.fr/∼marty/corpus.html. It contains different doc-

uments with various structures: lists, tables, rotated tables, cross-tables among

others. We learn from only one completely annotated Web document. “Success”

Page 14

OkraBigbook

# Examples F-meas. Init. Infer. F-meas. Init. Infer.

190.4 % 469 31

297.6 % 781 31

399.4 % 1171 32

89.9 % 505 33

95.2 % 891 33

100 %1342 34

Fig.5. Results of RPNI2-NSTT on Okra and Bigbook benchmarks on a binary task:

extraction of (name, mail) on Okra and (name, address) on Bigbook.

Dataset Succ. ? Description

L0YES table with tuples in rows

L1NO

table with tuples in columns L6

L2YES 2 column table w/ separator L7

L3YES nested lists

L4 YES lists without separator

Dataset Succ. ? Description

L5 YES fake list (sequence of EM)

NO

fake list 2 (sequence of SPAN)

YES list of descriptions (DD/DT tag)

L8YES description and list of SPAN

L9YES list of tables, one element factorized

Fig.6. RPNI2-NSTTon Web pages with various structures from the Datafoot benchmark.

means that we achieve 100% F-measure on other web pages. Experimental re-

sults are given in Fig. 6. They are generally very positive. Limitation arise only

in the case of non regular queries (L1), or when the tree structure alone is not

sufficiently informative (L6). These limitations are to be expected of course.

Future Work

Completely annotated examples are not realistic in practice of information ex-

traction. As in the monadic case, we will have to introduce intelligent tree prun-

ing techniques in order to cut of irrelevant parts of documents. This is needed to

deal with partially annotated documents, in order to reduce the annotation effort

and to improve the quality of inferred queries. It is fundamental to interactive

learning of n-ary queries.

References

1. J. Carme, R. Gilleron, A. Lemay, and J. Niehren. Interactive learning of node

selecting tree transducer. Machine Learning, 2006.

2. J. Carme, A. Lemay, and J. Niehren. Learning node selecting tree transducer from

completely annotated examples. In ICGI, vol. 3264 of LNAI, p. 91–102. 2004.

3. J. Carme, J. Niehren, and M. Tommasi. Querying unranked trees with stepwise

tree automata. In RTA, vol. 3091 of LNCS, p. 105 – 118. 2004.

4. B. Chidlovskii. Wrapping web information providers by transducer induction. In

ECML, vol. 2167 of LNAI, p. 61 – 73, 2001.

5. A. Corb´ ı, J. Oncina, and P. Garc´ ıa. Learning regular languages from a complete

sample by error correcting techniques. IEE, p. 4/1–4/7, 1993.

Page 15

6. C. de la Higuera. Characteristic sets for polynomial grammatical inference. Ma-

chine Learning, 27:125–137, 1997.

7. E.M. Gold. Complexity of automaton identification from given data. Inf. Cont,

37:302–320, 1978.

8. G. Gottlob and C. Koch. Monadic queries over tree-structured data. In 17th

Annual IEEE Symposium on Logic in Computer Science, p. 189–202, 2002.

9. H. Hosoya and B. Pierce. Regular expression pattern matching for XML. Journal

of Functional Programming, 6(13):961–1004, 2003.

10. N. Kushmerick. Wrapper induction: Efficiency and expressiveness. Artificial Intel-

ligence, 118(1-2):15–68, 2000.

11. W. Martens and J. Niehren. On the minimization of XML schemas and tree

automata for unranked trees. Journal of Computer and System Science, 2006.

12. Gerome Miklau and Dan Suciu. Containment and equivalence for a fragment of

xpath. Journal of the ACM, 51(1):2–45, 2004.

13. I. Muslea, S. Minton, and C. Knoblock. Active learning with strong and weak

views: a case study on wrapper induction. In IJCAI 2003, p. 415–420, 2003.

14. F. Neven and J. Van Den Bussche. Expressiveness of structured document query

languages based on attribute grammars. Journal of the ACM, 49(1):56–100, 2002.

15. J. Niehren, Laurent Planque, J.M. Talbot, and S. Tison. N-ary queries by tree

automata. In DBPL, vol. 3774 of LNCS, p. 217–231. 2005.

16. J. Oncina and P. Garcia. Inferring regular languages in polynomial update time.

In Pattern Recognition and Image Analysis, p. 49–61, 1992.

17. J. Oncina and P. Garc´ ıa. Inference of recognizable tree sets. Tech. report, Univer-

sidad de Alicante, 1993. DSIC-II/47/93.

18. S. Raeymaekers, M. Bruynooghe, and J. Van den Bussche.

contextual tree languages for information extraction. In ECML, vol. 3720 of LNAI,

p. 305–316, 2005.

19. J. W. Thatcher and J. B. Wright. Generalized finite automata with an application

to a decision problem of second-order logic. Math. System Theory, 2:57–82, 1968.

Learning (k,l)-