Conference Paper

Predicate-based Filtering of XPath Expressions

University of Toronto, Canada
DOI: 10.1109/ICDE.2006.115 Conference: Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, 3-8 April 2006, Atlanta, GA, USA
Source: DBLP


The XML/XPath filtering problem has found wide-spread interest. In this paper, we propose a novel algorithm for solving it. Our approach encodes XPath expressions (XPEs) as ordered sets of predicates and translates XML documents into sets of tuples, which are evaluated over these predicates. Predicates representing overlapping portions of XPEs are stored and processed once, thus fully exploiting potential overlap in XPEs. We experimentally evaluate the performance of our algorithm, demonstrating its scalability to millions of XPEs, with matching performance in the millisecond range. We show interesting trade-offs to alternative approaches.

  • Source
    • "In our evaluations, we show that GPX-Matcher outperforms both BPA (called basic-pc-ap in [13]) and the automatonbased Yfilter [6]. "

    Full-text · Conference Paper · Jan 2011
  • Source
    • "In content-based publish/subscribe, subscribers (b) Poset Fig. 2. Example of the poset data structure can specify more complex constraints in addition to just specifying the topic of interest. Depending on the implementation, subscription filters in content-based publish/subscribe can be based on attributes (which this paper focuses on), on path expressions through the use of XML [Altinel and Franklin 2000; Pereira et al. 2001; Hou and Jacobsen 2006; Li et al. 2008], or on graph expressions [Petrovic et al. 2005]. Attribute-based predicates consist of an operator-value pair to specify the filtering conditions on each attribute. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Distributed content-based publish/subscribe systems suffer from performance degradation and poor scalability caused by uneven load distributions typical in real-world applications. The reason for this shortcoming is the lack of a load balancing scheme. This article proposes a load balancing solution specifically tailored to the needs of content-based publish/subscribe systems that is distributed, dynamic, adaptive, transparent, and accommodates heterogeneity. The solution consists of three key contributions: a load balancing framework, a novel load estimation algorithm, and three offload strategies. A working prototype of our solution is built on an open-sourced content-based publish/subscribe system and evaluated on PlanetLab, a cluster testbed, and in simulations. Real-life experiment results show that the proposed load balancing solution is efficient with less than 0.2% overhead; effective in distributing and balancing load originating from a single server to all available servers in the network; and capable of preventing overloads to preserve system stability, availability, and quality of service.
    Full-text · Article · Dec 2010 · ACM Transactions on Computer Systems
  • Source
    • "In recent years, many approaches have been presented for providing efficient filtering of XML data against large sets of queries. Centralized approaches include works like XTrie [8], YFilter [14], FiST [23] and others [28] [7] [32] [21]. However, in order to offer XML filtering functionality on Internet-scale and avoid the typical problems of centralized solutions, such a service should be deployed in a distributed environment. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Many XML filtering systems have emerged in recent years identifying XML data that structurally match XPath queries in an efficient way. However, apart from structural match- ing, it is considered equally important to deal with value- based predicates. In this paper, we propose methods to com- bine both structural and value XML filtering in a distributed environment based on distributed hash tables. Structural matching is performed using automata, while we study dif- ferent methods for evaluating value-based predicates. As a result, our algorithms scale in both the size of the query set and the number of the predicates per query. We perform an experimental evaluation and demonstrate the strengths and weaknesses of the proposed methods in both a controlled environment of a cluster and on a real testbed provided by the PlanetLab network.
    Full-text · Conference Paper · Jan 2010
Show more