Article

Locating valid SLCAs for XML keyword search with NOT semantics

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Keyword search provides an easy way for users to pose queries against XML documents, and it is important to support queries with arbitrary combinations of AND, OR, and NOT operators. The previous RELMN algorithm processed such kind of queries by extending the original SLCA definition in a straightforward way, but it did not work correctly in some cases. In this paper, we propose the concept of valid SLCAs as query results. Basically, nodes in an XML document are classified according to their usages, which is further used to define the scope affected by a negative keyword. Only valid nodes, which are not affected by any negative keyword, are qualified to identify valid SLCAs. The experimental results show that the proposed algorithm achieves higher precision and recall, and is more efficient than the previous work.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... SLCA is the important concept for XML keyword query. It is used mainly for defining the significant result returned for the given keyword query, which is the core problem in keyword query research [1,2,6,7,[17][18][19][20]. The concrete definition of SLCA is to solve the subtree root node meeting the two conditions below: ...
Article
Extensible Markup Language (XML) is commonly employed to represent and transmit information over the Internet Therefore, how to effectively search for keywords of massive XML data becomes a new issue. In this paper, we first present four properties to improve the classical ILE algorithm. Then, a kind of parallel XML keyword search algorithm, based on intelligent grouping to calculate SLCA, is proposed and realized under MapReduce programming model. At last, a series of experiments are implemented on 7 datasets of different sizes. The obtained results indicate that the proposed algorithm has high execution efficiency and is applicable to keyword search of massive XML data.
... There are some achievements (Kudo and Hada 2000;Li et al., 2007;Liu and Chen, 2007;Sun et al., 2007;Lin et al., 2014) on the keywords search under the secure access control in XML document. The reference (Li et al., 2010) selects security attributes in the keyword search results, and combines XML keyword search scheme with the security view of access control. ...
Article
With increasing rate of storing and sharing information in the cloud by the users, data storage brings new challenges to the Extensible Markup Language (XML) database in big data environments. The efficient retrieval of data with protection and privacy issues for accessing mass data in the cloud is more and more important. Most of existing research about XML data query and retrieval focuses on efficiency or establishing the index, and so on. However, these methods or algorithms do not take into account the data and data structure for their own safety issues. Furthermore, traditional access control rules read XML document node in a dynamic environment, relevant dynamic query-based keyword research data security and privacy protection requirements are not many. In order to improve the search efficiency with security condition, this paper examines how to generate the sub-tree of matching keywords that the user can access by the access control rules for the user's role. The corresponding algorithm is proposed to achieve safe and efficient keywords search.
Conference Paper
Full-text available
In this paper, we study the problem of effective keyword search over XML documents. We begin by introducing the notion of Valu- able Lowest Common Ancestor (VLCA) to accurately and effec- tively answer keyword queries over XML documents. We then propose the concept of Compact VLCA (CVLCA) and compute the meaningful compact connected trees rooted as CVLCAs as the answers of keyword queries. To efficiently compute CVLCAs, we devise an effective optimization strategy for speeding up the com- putation, and exploit the key properties of CVLCA in the design of the stack-based algorithm for answering keyword queries. We have conducted an extensive experimental study and the experimental results show that our proposed approach achieves both high effi- ciency and effectiveness when compared with existing proposals.
Article
Full-text available
Keyword search is a friendly mechanism for users to identify desired information in XML databases, and LCA is a popular concept for locating the meaningful subtrees corresponding to query keywords. Among all the LCA-based approaches, MaxMatch [9] is the only one which could achieve the property of monotonicity and consistency, by outputting only contributors instead of the whole subtree. Although the MaxMatch algorithm performs efficiently in some cases, there is still room for improvement. In this paper, we first propose to improve its performance by avoiding unnecessary index accesses. We then speed up the process of subset detection, which is a core procedure for determining contributors. The resultant algorithm is called MinMap and MinMap+, respectively. At last, we analytically and empirically demonstrate the efficiency of our methods. According to our experiments, our two algorithms work better than the existing one, and MinMap+ is particularly helpful when the breadth of the tree is large and the number of keywords grows.
Article
Efficient query processing has been a critical issue for XML repositories. In this paper, we consider the XML query which can be represented as a query tree with twig patterns, and also consists of full-text constraints. Previously, the structure-first approach and the keyword-first approach have been proposed to process such kind of queries. The main focus of this paper is constructing an integrated system to support these two approaches and find the best execution plan. To achieve this goal, we first analyze the components of these two approaches and design a set of operators. We then derive the corresponding cost model and rewriting rules to perform costbased optimization. We also propose several heuristic rules by observing the behaviors of the two approaches. Via an extensive experimental study, we demonstrate that our cost-based system and heuristic system are both effective.
Conference Paper
In this paper, we focus on efficient keyword query processing for XML data based on SLCA and ELCA semantics. We propose for each keyword a novel form of inverted list, which includes IDs of nodes that directly or indirectly contain the keyword. We propose a family of efficient algorithms that are based on the set intersection operation for both semantics. We show that the problem of SLCA/ELCA computation becomes finding a set of nodes that appear in all involved inverted lists and satisfy certain conditions. We also propose several optimization techniques to further improve the query processing performance. We have conducted extensive experiments with many alternative methods. The results demonstrate that our proposed methods outperform existing ones by up to two orders of magnitude in many cases.
Conference Paper
Keyword search is a proven, user-friendly way to query HTML documents in the World Wide Web. We propose keyword search in XML documents, modeled as labeled trees, and describe corresponding efficient algorithms. The proposed keyword search returns the set of smallest trees containing all keywords, where a tree is designated as "smallest" if it contains no tree that also contains all keywords. Our core contribution, the Indexed Lookup Eager algorithm, exploits key properties of smallest trees in order to outperform prior algorithms by orders of magnitude when the query contains keywords with significantly different frequencies. The Scan Eager variant is tuned for the case where the keywords have similar frequencies. We analytically and experimentally evaluate two variants of the Eager algorithm, along with the Stack algorithm [13]. We also present the XKSearch system, which utilizes the Indexed Lookup Eager, Scan Eager and Stack algorithms and a demo of which on DBLP data is available at http://www.db.ucsd.edu/projects/xksearch. Finally, we extend the Indexed Lookup Eager algorithm to answer Lowest Common Ancestor (LCA) queries.
Conference Paper
Keyword search over XML documents has been widely studied in recent years. It allows users to retrieve relevant data from XML documents without learning complicated query languages. SLCA (smallest lowest common ancestor)-based keyword search is a common mechanism to locate the desirable LCAs for the given query keywords, but the conventional SLCA-based keyword search is for AND-only semantics. In this paper, we extend the SLCA keyword search to a more general case, where the keyword query could be an arbitrary combination of AND, OR, and NOT operators. We further define the query result based on the monotonicity and consistency properties, and propose an efficient algorithm to figure out the SLCAs and the relevant matches. Since the keyword query becomes more complex, we also discuss the variations of the monotonicity and consistency properties in our framework. Finally, the experimental results show that the proposed algorithm runs efficiently and gives reasonable query results by measuring the processing time, scalability, precision, and recall.
Article
Keyword search is a user-friendly mechanism for retrieving XML data in web and scientific applications. An intuitively compelling but vaguely defined goal is to identify matches to query keywords that are relevant to the user. However, it is hard to directly evaluate the relevance of query results due to the inherent ambiguity of search semantics. In this work, we investigate an axiomatic framework that includes two intuitive and non-trivial properties that an XML keyword search technique should ideally satisfy: monotonicity and consistency, with respect to data and query. This is the first work that reasons about keyword search strategies from a formal perspective. Then we propose a novel semantics for identifying relevant matches, which, to the best of our knowledge, is the only existing algorithm that satisfies both properties. An efficient algorithm is designed for realizing this semantics. Extensive experimental studies have verified the intuition of the properties and shown the effectiveness of the proposed algorithm.