Hybrid XML Retrieval: Combining Information Retrieval and a Native XML Database

Information Retrieval (Impact Factor: 0.92). 11/2005; 8(4):571-600. DOI: 10.1007/s10791-005-0748-1
Source: arXiv


This paper investigates the impact of three approaches to XML retrieval: using Zettair, a full-text information retrieval system; using eXist, a native XML database; and using a hybrid system that takes full article answers from Zettair and uses eXist to extract elements from those articles. For the content-only topics, we undertake a preliminary analysis of the INEX 2003 relevance assessments in order to identify the types of highly relevant document components. Further analysis identifies two complementary sub-cases of relevance assessments (General and Specific) and two categories of topics (Broad and Narrow). We develop a novel retrieval module that for a content-only topic utilises the information from the resulting answer list of a native XML database and dynamically determines the preferable units of retrieval, which we call Coherent Retrieval Elements. The results of our experiments show that—when each of the three systems is evaluated against different retrieval scenarios (such as different cases of relevance assessments, different topic categories and different choices of evaluation metrics)—the XML retrieval systems exhibit varying behaviour and the best performance can be reached for different values of the retrieval parameters. In the case of INEX 2003 relevance assessments for the content-only topics, our newly developed hybrid XML retrieval system is substantially more effective than either Zettair or eXist, and yields a robust and a very effective XML retrieval.

Download full-text


Available from: James A. Thom,
  • Source
    • "The hybrid system and the CRE retrieval module we use in this paper extend the system and the module we previously proposed and evaluated for the INEX 2003 CO retrieval topics [6] "
    [Show abstract] [Hide abstract]
    ABSTRACT: Three approaches to content-and-structure XML retrieval are analysed in this paper: first by using Zettair, a full-text information retrieval system; second by using eXist, a native XML database, and third by using a hybrid XML retrieval system that uses eXist to produce the final answers from likely relevant articles retrieved by Zettair. INEX 2003 content-and-structure topics can be classified in two categories: the first retrieving full articles as final answers, and the second retrieving more specific elements within articles as final answers. We show that for both topic categories our initial hybrid system improves the retrieval effectiveness of a native XML database. For ranking the final answer elements, we propose and evaluate a novel retrieval model that utilises the structural relationships between the answer elements of a native XML database and retrieves Coherent Retrieval Elements. The final results of our experiments show that when the XML retrieval task focusses on highly relevant elements our hybrid XML retrieval system with the Coherent Retrieval Elements module is 1.8 times more effective than Zettair and 3 times more effective than eXist, and yields an effective content-and-structure XML retrieval.
  • Source
    • "In contrast, no relevance judgements are available for topic B2, while data from around 18 users was collected for each of the topics B2 and C1. Previous work has also shown that XML retrieval systems exhibit varying behaviour when their performance is evaluated against different CO topic categories [7] [15]. It is then reasonable to expect that the level of agreement between the assessor and the users, which concerns the choice of the best units of retrieval, may depend on the topic category. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The main aspects of XML retrieval are identified by analysing and comparing the following two behaviours: the behaviour of the assessor when judging the relevance of returned document components; and the behaviour of users when interacting with components of XML documents. We argue that the two INEX relevance dimensions, Exhaustivity and Specificity, are not orthogonal dimensions; indeed, an empirical analysis of each dimension reveals that the grades of the two dimensions are correlated to each other. By analysing the level of agreement between the assessor and the users, we aim at identifying the best units of retrieval. The results of our analysis show that the highest level of agreement is on highly relevant and on non-relevant document components, suggesting that only the end points of the INEX 10-point relevance scale are perceived in the same way by both the assessor and the users. We propose a new definition of relevance for XML retrieval and argue that its corresponding relevance scale would be a better choice for INEX.
  • Source
    • "In this paper will refer to XED and NXD of this calibre and hence will use XEDBMS and NXDBMS respectively. There are quite a number of XEDBMS like Berkeley DB XML [Burd and Staken 2005], DB2 9 Express-C No Charge PureXML Hybrid Data Server [IBM 2006] etc. and numerous implementations of NXDBMS like eXist [Pehcevski et al. 2005], 4suite [Olson 2000], Sedna [Aznauryan et al. 2006], Xindice [Gabillon 2004; Sattler et al. 2005],TIMBER [Jagadish et al. 2002], Natix [Fiebig et al. 2002] etc. An attempt to include quality metrics in an NXD requires a thorough understanding of its architecture and interfaces. "
    [Show abstract] [Hide abstract]
    ABSTRACT: As XML data is being widely adopted as a data and object exchange format for both structured and semi structured data, the need for quality control and measurement is only to be expected. This can be attributed to the increase in the need for data quality metrics in traditional databases over the past decade. The traditional model provide constraints mechanisms and features to control quality defects but unfortunately these methods are not foolproof. This report reviews work on data quality in both database and management research areas. The review includes (i) the exploration into the notion of data quality, its definitions, metrics, control and improvement in data and information sets and (ii) investigation of the techniques which used in traditional databases like relational and object databases where most focus and resource has been directed. In spite of the wide adoption of XML data since its inception, the exploration does not only show a huge gap between research works of data quality in relational databases and XML databases but also show how very little support database systems provide in giving a measure of the quality of the data they hold. This inducts the need to formularize mechanisms and techniques for embedding data quality control and metrics into XML data sets. It also presents the viability of a process based approach to data quality measurement with suitable techniques, applicable in a dynamic decision environments with multidimensional data and heterogeneous sources. This will involve modelling the interdependencies and categories of the attributes of data quality generally referred to as data quality dimensions and the adoption of a formal means like process algebra, fuzzy logic and any other appropriate approaches. The attempt is contextualised using the healthcare domain as it bears all the required characteristics.
Show more