Learning n-ary Node Selecting Tree Transducers from Completely Annotated Examples

11/2006; DOI: 10.1007/11872436_21
Source: OAI

ABSTRACT We present the first algorithm for learning n-ary node selection queries in trees from completely annotated examples by methods of grammatical inference. We propose to represent n-ary queries by deterministic n-ary node selecting tree transducers (NSTTs), that are known to capture the class of MSO-definable n-ary queries. Despite of this highly expressive, we show that n-aryy queries, selecting a polynomially bounded number of tuples per tree, represented by deterministic NSTTs can be learned from polynomial time and data while allowing for efficient enumeration of query answers. An application to wrapper induction in Web information extraction yields encouraging results.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The induction of monadic node selecting queries from partially annotated XML-trees is a key task in Web information extraction. We show how to integrate schema guidance into an RPNI-based learning algorithm, in which monadic queries are represented by pruning node selecting tree transducers. We present experimental results on schema guidance by the DTD of HTML.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We investigate in this paper the spatial logic TQL for querying semi-structured data, represented as unranked ordered trees over an infinite alphabet. This logic consists of usual Boolean connectives, spatial connectives (derived from the constructors of a tree algebra), tree variables and a fixpoint operator for recursion. Motivated by XML-oriented tasks, we investigate the guarded TQL fragment. We prove that for closed formulas this fragment is MSO-complete. In presence of tree variables, this fragment is strictly more expressive than MSO as it allows for tree (dis)equality tests, i.e. testing whether two subtrees are isomorphic or not. We devise a new class of tree automata, called TAGED, which extends tree automata with global equality and disequality constraints. We show that the satisfiability problem for guarded TQL formulas reduces to emptiness of TAGED. Then, we focus on bounded TQL formulas: intuitively, a formula is bounded if for any tree, the number of its positions where a subtree is captured by a variable is bounded. We prove this fragment to correspond with a subclass of TAGED, called bounded TAGED, for which we prove emptiness to be decidable. This implies the decidability of the bounded guarded TQL fragment. Finally, we compare bounded TAGED to a fragment of MSO extended with subtree isomorphism tests.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Inference algorithms for tree automata that define node selecting queries in unranked trees rely on tree pruning strategies. These impose additional assumptions on node selection that are needed to compensate for small numbers of annotated examples. Pruning-based heuristics in query learning algorithms for Web information extraction often boost the learning quality and speed up the learning process. We will distinguish the class of regular queries that are stable under a given schemaguided pruning strategy, and show that this class is learnable with polynomial time and data. Our learning algorithm is obtained by adding pruning heuristics to the traditional learning algorithm for tree automata from positive and negative examples. While justified by a formal learning model, our learning algorithm for stable queries also performs very well in practice of XML information extraction.
    Journal of Machine Learning Research 01/2013; 14(1):927-964. · 3.42 Impact Factor

Full-text (3 Sources)

Available from
Jun 11, 2014