Hybrid XML Retrieval: Combining Information Retrieval and a Native XML Database

Information Retrieval (Impact Factor: 0.63). 11/2005; 8(4):571-600. DOI: 10.1007/s10791-005-0748-1
Source: arXiv

ABSTRACT This paper investigates the impact of three approaches to XML retrieval: using Zettair, a full-text information retrieval system; using eXist, a native XML database; and using a hybrid system that takes full article answers from Zettair and uses eXist to extract elements from those articles. For the content-only topics, we undertake a preliminary analysis of the INEX 2003 relevance assessments in order to identify the types of highly relevant document components. Further analysis identifies two complementary sub-cases of relevance assessments (General and Specific) and two categories of topics (Broad and Narrow). We develop a novel retrieval module that for a content-only topic utilises the information from the resulting answer list of a native XML database and dynamically determines the preferable units of retrieval, which we call Coherent Retrieval Elements. The results of our experiments show that—when each of the three systems is evaluated against different retrieval scenarios (such as different cases of relevance assessments, different topic categories and different choices of evaluation metrics)—the XML retrieval systems exhibit varying behaviour and the best performance can be reached for different values of the retrieval parameters. In the case of INEX 2003 relevance assessments for the content-only topics, our newly developed hybrid XML retrieval system is substantially more effective than either Zettair or eXist, and yields a robust and a very effective XML retrieval.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In today's world of the Internet and the Web, enormous quantities of information in various forms and on diverse subjects have become available to users. The information that is available can be categorized into three classes: structured data that follows a regular structure (e.g. database records), unstructured data (e.g. flat-files containing textual information), and semi-structured data, which is positioned in between and combines the benefits of structured and unstructured data. Semi-structured data is typically marked up using XML. XML supports user-defined markup. This markup usually imposes a hierarchical, semantic structure on documents and permits systems to retrieve the most relevant elements (or portions) of documents, rather than complete documents or web pages. Thus, XML is a paradigm that holds the promise of meeting users' demands for precise information at the right granularity with as little irrelevant material as possible. The problem of retrieval from XML repositories has, therefore, attracted a great deal of attention in recent times, from both the Database and the Information Retrieval (IR) research communities. This survey provides an overview of research done in the area of XML retrieval from the IR angle. Much of this research has been carried out under the aegis of the INitiative for the Evaluation of XML retrieval (INEX). INEX provides a framework where participating researchers can evaluate their retrieval tech-niques using standardised test collections and uniform scoring procedures, and discuss the results. Various XML retrieval tasks have been studied at INEX, with the ad-hoc retrieval task being arguably the most important. The ad-hoc task is intended to model a situation in which a user submits a query (representing a one-time or casual infor-mation need) to a system, which then tries to retrieve document elements or passages from within the text collection that are most relevant to the user's information need. Other tasks at INEX include document mining, entity ranking, book search, automatic hyperlinking (link-the-wiki), etc. Our primary focus in this survey is on the ad-hoc task. We first describe the query languages that have been used at INEX. We also discuss the evolution of evaluation metrics over the years at INEX. Next, we describe retrieval approaches based on vari-ous IR models (such as the Vector Space Model, the Probabilistic Model, the Language Modeling approach, the Logistic Regression approach, etc.), as well as techniques for improving retrieval effectiveness such as relevance feedback. We also briefly review implementation techniques that are intended to enable efficient element retrieval. We conclude by indicating issues of current interest, and potential directions for further research.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: With the rapid emergence of XML as a data exchange standard over the Web, storing and querying XML data have become critical issues. The two main approaches to storing XML data are (1) to employ traditional storage such as relational database, object-oriented database and so on, and (2) to create an XML-specific native storage. The storage representation affects the efficiency of query processing. In this paper, firstly, we review the two approaches for storing XML data. Secondly, we review various query optimization techniques such as indexing, labeling and join algorithms to enhance query processing in both approaches. Next, we suggest an indexing classification scheme and discuss some of the current trends in indexing methods, which indicate a clear shift towards hybrid indexing.
    Knowl.-Based Syst. 01/2011; 24:1317-1340.
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper specifically addresses the effectiveness of our theoretically-based two dimensional retrieval model for searching semantically synchronized media streams. Conventional IR systems, which support partial retrieval of synchronized media streams, retrieve “atomic units”, for example slides, pages and shots of underlying media streams such as presentations, electronic books, and lecture videos respectively. In contrast, our model is based upon the concept of an extended retrieval unit and thus retrieves dynamically integrated media streams comprising several atomic units both along and across media streams. In this paper, in addition to reviewing our model, we describe its system implementation that we developed for conducting experiments on several real-world datasets built from scratch. We then present extensive empirical results which demonstrate that our system outperforms (a) conventional systems based on atomic retrieval units, (b) single dimensional retrieval systems, which extend retrieval units over several atomic units of the same media and (c) cross-media retrieval systems, which extend atomic retrieval units across several media stream units. The results thus verify our claims regarding the effectiveness of our two dimensional retrieval model for retrieving meaningful units of synchronized media streams.
    Data & Knowledge Engineering 01/2013; 83:70-92. · 1.52 Impact Factor

Full-text (2 Sources)

Available from
May 31, 2014