Learning n-ary Node Selecting Tree Transducers from Completely Annotated Examples

11/2006; DOI: 10.1007/11872436_21
Source: OAI


We present the first algorithm for learning n-ary node selection queries in trees from completely annotated examples by methods of grammatical inference. We propose to represent n-ary queries by deterministic n-ary node selecting tree transducers (NSTTs), that are known to capture the class of MSO-definable n-ary queries. Despite of this highly expressive, we show that n-aryy queries, selecting a polynomially bounded number of tuples per tree, represented by deterministic NSTTs can be learned from polynomial time and data while allowing for efficient enumeration of query answers. An application to wrapper induction in Web information extraction yields encouraging results.

Download full-text


Available from: Rémi Gilleron, May 26, 2014
  • Source
    • "Learning XML queries and transformations have been studied for the task of data extraction [21] [22], which is closely related. The classical framework of language inference in the limit [24] has been adapted to learn n-ary XML queries captured with tree automata [9] [28] and tree transformations captured with tree transducers [27]. While tree automata and transducers are valued for their ability to model large classes of queries, they have little support from the existing infrastructure which favors more common standards like XPath and XQuery. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Web applications store their data within various database models, such as relational, semi-structured, and graph data models to name a few. We study learning algorithms for queries for the above mentioned models. As a further goal, we aim to apply the results to learning cross-model database mappings, which can also be seen as queries across different schemas.
  • Source
    • "For static tests involving schemas, such as typechecking for XML transformations (see, e.g., [17] [33]), a schema minimizer can be used as a preprocessor to improve the running time of the typechecker. Minimal deterministic automata for unranked tree languages play a prominent role in recent approaches to query induction for Web information extraction [4]. The objective is to identify a tree automaton for a previously unknown target language from given examples. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Automata for unranked trees form a foundation for XML Schemas, querying and pattern languages. We study the problem of efficiently minimizing such automata. First, we study unranked tree automata that are standard in database theory, assuming bottom-up determinism and that horizontal recursion is represented by deterministic finite automata. We show that minimal automata in that class are not unique and that minimization is np-complete. Second, we study more recent automata classes that do allow for polynomial time minimization. Among those, we show that bottom-up deterministic stepwise tree automata yield the most succinct representations. Third, we investigate abstractions of XML schema languages. In particular, we show that the class of one-pass preorder typeable schemas allows for polynomial time minimization and unique minimal models.
    Journal of Computer and System Sciences 06/2007; 73(4-73):550-583. DOI:10.1016/j.jcss.2006.10.021 · 1.14 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: XML query induction is a key task in Web information extraction. Recent approaches based on grammatical inference represent node selection queries in XML trees by de- terministic tree automata. In this paper, we show how to guide RPNI-based learning algo- rithms by XML schemas which we can infer in a preprocessing step. We hope that schema guidance will help to improve heuristics that are essential for query learning algorithms.
Show more