Book

Abstract

This paper reports our participation in the INEX 2008 Ad- Hoc Retrieval track. We investigated the effect of multiword terms on retrieval effectiveness in an interactive query expansion (IQE) frame- work. The IQE approach is compared to a state-of-the-art IR engine (in this case Indri) implementing a bag-of-word query and document rep- resentation, coupled with pseudo-relevance feedback (automatic query expansion(AQE)). The performance of multiword query and document representation was enhanced when the term structure was relaxed to ac- cept the insertion of additional words while preserving the original struc- ture and word order. The search strategies built with multiword terms coupled with QE obtained very competitive scores in the three Ad-Hoc tasks: Focused retrieval, Relevant-in-Context and Best-in-Context.

Chapters (49)

This paper gives an overview of the INEX 2008 Ad Hoc Track. The main goals of the Ad Hoc Track were two-fold. The first goal was to investigate the value of the internal document structure (as provided by the XML mark-up) for retrieving relevant information. This is a continuation of INEX 2007 and, for this reason, the retrieval results are liberalized to arbitrary passages and measures were chosen to fairly compare systems retrieving elements, ranges of elements, and arbitrary passages. The second goal was to compare focused retrieval to article retrieval more directly than in earlier years. For this reason, standard document retrieval rankings have been derived from all runs, and evaluated with standard measures. In addition, a set of queries targeting Wikipedia have been derived from a proxy log, and the runs are also evaluated against the clicked Wikipedia pages. The INEX 2008 Ad Hoc Track featured three tasks: For the Focused Task a ranked-list of non-overlapping results (elements or passages) was needed. For the Relevant in Context Task non-overlapping results (elements or passages) were returned grouped by the article from which they came. For the Best in Context Task a single starting point (element start tag or passage start) for each article was needed. We discuss the results for the three tasks, and examine the relative effectiveness of element and passage retrieval. This is examined in the context of content only (CO, or Keyword) search as well as content and structure (CAS, or structured) search. Finally, we look at the ability of focused retrieval techniques to rank articles, using standard document retrieval techniques, both against the judged topics as well as against queries and clicks from a proxy log.
Proximity enhanced scoring models significantly improve retrieval quality in text retrieval. For XML IR, we can sometimes enhance the retrieval efficacy by exploiting knowledge about the document structure combined with established text IR methods. This paper elaborates on our approach used for INEX 2008 which modifies a proximity scoring model from text retrieval for usage in XML IR and extends it by taking the document structure information into account.
This paper describes the integration of our methodology for the dynamic retrieval of XML elements [2] with traditional article retrieval to facilitate the Focused and the Relevant-in-Context Tasks of the INEX 2008 Ad Hoc Track. The particular problems that arise for dynamic element retrieval in working with text containing both tagged and untagged elements have been solved [3]. The current challenge involves utilizing its ability to produce a rank-ordered list of elements in the context of focused retrieval. Our system is based on the Vector Space Model [8]; basic functions are performed using the Smart experimental retrieval system [7]. Experimental results are reported for the Focused, Relevant-in-Context, and Best-in-Context Tasks of both the 2007 and 2008 INEX Ad Hoc Tracks. These results indicate that the goal of our 2008 investigations—namely, finding good focused elements in the context of the Wikipedia collection–has been achieved.
In this work we propose new utility models for the structured information retrieval system Garnata, and expose the results of our participation at INEX’08 in the AdHoc track using this system.
This paper addresses the integration of tags in terms weighting function for focused XML retrieval. Our model allows to consider a certain kind of structural information: tags that represent logical structure (title, section, etc.) as well as tags related to formatting (bold font, centered text, etc.). We first take into account the tags influence by estimating the probability that tags distinguishes terms which are the most relevant. Then, these weights are impacted on terms weighting function using several combining schemes. Experiments on a large collection during INEX 2008 XML IR evaluation campaign (INitiative for Evaluation of XML Retrieval) showed that using tags leads to improvements on focused retrieval.
This paper reports our participation in the INEX 2008 Ad-Hoc Retrieval track. We investigated the effect of multiword terms on retrieval effectiveness in an interactive query expansion (IQE) framework. The IQE approach is compared to a state-of-the-art IR engine (in this case Indri) implementing a bag-of-word query and document representation, coupled with pseudo-relevance feedback (automatic query expansion(AQE)). The performance of multiword query and document representation was enhanced when the term structure was relaxed to accept the insertion of additional words while preserving the original structure and word order. The search strategies built with multiword terms coupled with QE obtained very competitive scores in the three Ad-Hoc tasks: Focused retrieval, Relevant-in-Context and Best-in-Context.
Combining evidence of relevance coming from two sources — a keyword index and a keyphrase index — has been a fundamental part of our INEX-related experiments on XML Retrieval over the past years. In 2008, we focused on improving the quality of the keyphrase index and finding better ways to use it together with the keyword index even when processing non-phrase queries. We also updated our implementation of the word index which now uses a state-of-the-art scoring function for estimating the relevance of XML elements. Compared to the results from previous years, the improvements turned out to be successful in the INEX 2008 ad hoc track evaluation of the focused retrieval task.
Semi-structured document retrieval is becoming more popular with the increasing quantity of data available in XML format. In this paper, we describe a search engine model that exploits the structure of the document and uses language modelling and smoothing at the document and collection levels for calculating the relevance of each element from all the documents in the collection to a user query. Element priors, CAS query constraint filtering, and the +/- operators are also used in the ranking procedure. We also present the results of our participation in the INEX 2008 Ad Hoc Track.
This paper describes the work that we did at Indian Statistical Institute towards XML retrieval for INEX 2008. Besides the Vector Space Model (VSM) that we have been using since INEX 2006, this year we implemented the Language Modeling (LM) approach in our text retrieval system (SMART) to retrieve XML elements against the INEX Adhoc queries. Like last year, we considered Content-Only (CO) queries and submitted three runs for the FOCUSED sub-task. Two runs are based on the Vector Space Model and one uses the Language Model. One of the VSM-based runs (VSMfbElts0.4) retrieves sub-document-level elements. Both the other runs (VSMfb and LM-nofb-0.20) retrieve elements only at the whole-document level. We applied blind feedback for both the VSM-based runs; no query expansion was used in the LM-based run. In general, the relative performance of our document-level runs is respectable (ranked 15/61 and 22/61 according to the official metric). Though our element retrieval run does reasonably (ranked 16/61 by iP[0.01]) according to the early-precision metrics, we think there is plenty of scope to improve our element retrieval strategy. Our immediate next task is therefore to focus on how to improve true element-level retrieval.
We present in this paper the work of the Information Retrieval Modeling Group (MRIM) of the Computer Science Laboratory of Grenoble (LIG) at the INEX 2008 Ad Hoc Track. We study here the use of non structural relations between document elements (doxels) in conjunction with document/doxel structural relationships. The non structural links between doxels of the collection come from the collectionlink doxels. We characterize the non structural relations with relative exhaustivity and specificity scores. Results of experiments on the test collection are presented. Our best run is in the top 5 for iP[0.01] values for the Focused Task.
At INEX 2008 we presented SPIRIX, a Peer-to-Peer search engine developed to investigate distributed XML-Retrieval. Such investigations have been neglected by INEX so far: while there is a variety of successful and effective XML-Retrieval approaches, all current solutions are centralized search engines. They do not consider distributed scenarios, where it is undesired or impossible to hold the whole collection on one single machine. Such scenarios include search in large-scale collections, where the load of computations and storage consumption is too high for one server. Other systems consist of different owners of heterogeneous collections willing to share their documents without giving up full control over their documents by uploading them on a central server. Currently, there are research solutions for distributed text-retrieval or multimedia-retrieval. With INEX and innovative techniques for exploiting XML-structure, it is now time to extend research to distributed XML-Retrieval. This paper reports on SPIRIX’ performance at INEX’08.
This paper provides an overview of the INEX 2008 Book Track. Now in its second year, the track aimed at broadening its scope by investigating topics of interest in the fields of information retrieval, human computer interaction, digital libraries, and eBooks. The main topics of investigation were defined around challenges for supporting users in reading, searching, and navigating the full texts of digitized books. Based on these themes, four tasks were defined: 1) The Book Retrieval task aimed at comparing traditional and book-specific retrieval approaches, 2) the Page in Context task aimed at evaluating the value of focused retrieval approaches for searching books, 3) the Structure Extraction task aimed to test automatic techniques for deriving structure from OCR and layout information, and 4) the Active Reading task aimed to explore suitable user interfaces for eBooks enabling reading, annotation, review, and summary across multiple books. We report on the setup and results of each of these tasks.
We present here XRCE participation to the Structure Extraction task of the INEX Book track. After briefly explaining the method used for detecting table of contents and their corresponding entries in the book body, we will mainly discuss the evaluation and the main issues we faced, and eventually we will propose improvements for our method as well as for the evaluation framework/method.
In this paper, we describe University of Waterloo's ap- proaches to the Adhoc, Book, and Link-the-Wiki tracks. For the Adhoc track, we submitted runs for all the tasks, the Focused, the Relevant-in-Context, and the Best-in-Context tasks. The results show that we ranked first among all participants for each task, by the simple scoring of elements using Okapi BM25. In the Book track, we participated in the Book retrieval and the Page-in-Context tasks, by using the approaches we used in the Adhoc track. We attribute our poor performance to lack of training. In the Link-the-Wiki track, we submitted runs for both File-to-File and Anchor-to-BEP tasks, using PageRank [1] algorithms on top of our previous year's algorithms that yielded high performance. The results indicate that our baseline approaches work best, although other approaches have rooms for improvement.
Document retrieval techniques have proven to be competitive methods in the evaluation of focused retrieval. Although focused approaches such as XML element retrieval and passage retrieval allow for locating the relevant text within a document, using the larger context of the whole document often leads to superior document level ranking. In this paper we investigate the impact of using the document retrieval ranking in two collections used in the INEX 2008 Ad hoc and Book Tracks; the relatively short documents of the Wikipedia collection and the much longer books in the Book Track collection. We experiment with several methods of combining document and element retrieval approaches. Our findings are that 1) we can get the best of both worlds and improve upon both individual retrieval strategies by retaining the document ranking of the document retrieval approach and replacing the documents by the retrieved elements of the element retrieval approach, and 2) using document level ranking has a positive impact on focused retrieval in Wikipedia, but has more impact on the much longer books in the Book Track collection.
For this year’s INEX UC Berkeley focused on the Book track and also submitted two runs for the Adhoc Focused Element search task and one for the Best in Context task. For all of these runs we used the TREC2 logistic regression probabilistic model. For the Adhoc Element runs and Best in Context runs we used the “pivot” score merging method to combine paragraph-level searches with scores for document-level searches.
Scanned then OCRed documents usually lack detailed layout and structural information. We present a book specific layout analysis system used to extract TOC structure information from the scanned and OCRed books. This system was used for navigation purposes by the live books search project. We provide labeling scheme for the TOC sections of the books, high level overview for the book layout analysis system, as well as TOC Structure Extraction Engine. In the end we present accuracy measurements of this system on a representative test set.
This paper describes the RMIT group's participation in the book retrieval task of the INEX booktrack in 2008. Our results suggest that for book retrieval task, using a page-based index and ranking books based on the number of pages retrieved may be more effective than di- rectly indexing and ranking whole books. This paper describes the participation of the RMIT group in the Initiative for the Evaluation of XML retrieval (INEX) book search track in 2008, specifically the book retrieval task. Book search is an important track at INEX - books are generally much larger documents than the scientific articles or the wikipedia pages that have been used in the main ad hoc track at INEX. The book corpus as provided by Microsoft Live Book Search and the Internet Archive contains 50239 digitized out-of-copyright books. The contents of books are marked up in an XML format called BookML (1). The size of this XML marked up corpus is about 420 Gb. With book retrieval, structure is likely to play a much more important role than in retrieval from collections of shorter documents. This is the first year of RMIT's participation in the book search track at INEX, and we explore the effectiveness of book retrieval by experimenting with different parameters, namely: the length of queries and the length of documents being indexed and retrieved. We begin by describing our approach in the next section, which is followed by our results, and then the conclusion.
This paper presents an overview of the Efficiency Track that was newly introduced to INEX in 2008. The new INEX Efficiency Track is intended to provide a common forum for the evaluation of both the effectiveness and efficiency of XML ranked retrieval approaches on real data and real queries. As opposed to the purely synthetic XMark or XBench benchmark settings that are still prevalent in efficiency-oriented XML retrieval tasks, the Efficiency Track continues the INEX tradition using a rich pool of manually assessed relevance judgments for measuring retrieval effectiveness. Thus, one of the main goals is to attract more groups from the DB community to INEX, being able to study effectiveness/efficiency trade-offs in XML ranked retrieval for a broad audience from both the DB and IR communities. The Efficiency Track significantly extends the Ad-Hoc Track by systematically investigating different types of queries and retrieval scenarios, such as classic ad-hoc search, high-dimensional query expansion settings, and queries with a deeply nested structure (with all topics being available in both the NEXI-style CO and CAS formulations, as well as in their XPath 2.0 Full-Text counterparts).
A common approach for developing XML element retrieval systems is to adapt text retrieval systems to retrieve elements from documents. Two key challenges in this approach are to effectively score structural queries and to control overlap in the output across different search tasks. In this paper, we continue our research into the use of navigation models for element scoring as a way to represent the user’s preferences for the structure of retrieved elements. Our goal is to improve search systems using structural scoring by boosting the score of desirable elements and to post-process results to control XML overlap. This year we participated in the Ad-hoc Focused, Efficiency, and Entity Ranking Tracks, where we focused our attention primarily on the effectiveness of small navigation models. Our experiments involved three modifications to our previous work; (i) using separate summaries for boosting and post-processing, (ii) introducing summaries that are generated from user study data, and (iii) confining our results to using small models. Our results suggest that smaller models can be effective but more work needs to be done to understand the cases where different navigation models may be appropriate.
The paper describes the submissions of CWI and University of Twente to the efficiency and entity ranking track of INEX 2008. With the INEX participation, we demonstrate and evaluate the functionality of our open source XML retrieval system PF/Tijah.
This paper reports the result of experimentation of our approach using the vector space model for retrieving large-scale XML data. The purposes of the experiments are to improve retrieval precision on the INitiative for the Evaluation of XML Retrieval (INEX) 2008 Adhoc Track, and to compare the retrieval time of our system to other systems on the INEX 2008 Efficiency Track. For the INEX 2007 Adhoc Track, we developed a system using a relative inverted-path (RIP) list and a Bottom-UP approach. The system achieved reasonable retrieval time for XML data. However the system has a room for improvement in terms of retrieval precision. So for INEX 2008, the system uses CAS titles and Pseudo Relevance Feedback (PRF) to improve retrieval precision.
For the INEX Efficiency Track 2008, we were just on time to finish and evaluate our brand-new TopX 2.0 prototype. Complementing our long-running effort on efficient top-k query processing on top of a relational back-end, we now switched to a compressed object-oriented storage for text-centric XML data with direct access to customized inverted files, along with a complete reimplementation of the engine in C++. Our INEX 2008 experiments demonstrate efficiency gains of up to a factor of 30 compared to the previous Java/JDBC-based TopX 1.0 implementation over a relational back-end. TopX 2.0 achieves overall runtimes of less than 51 seconds for the entire batch of 568 Efficiency Track topics in their content-and-structure (CAS) version and less than 29 seconds for the content-only (CO) version, respectively, using a top-15, focused (i.e., non-overlapping) retrieval mode—an average of merely 89 ms per CAS query and 49 ms per CO query.
When applying XML-Retrieval in a distributed setting, efficiency issues have to be considered, e.g. reducing the network traffic involved in an swering a given query. The new Efficiency Track of INEX gave us the opportu nity to explore the possibility of improving both effectiveness and efficiency by exploiting structural similarity. We ran some of the track’s highly structured queries on our top-k search engine to analyze the impact of various structural similarity functions. We applied those functions first to the ranking and based on that to the query routing process. Our results indicate that detection of structural similarity can be used in order to re duce the amount of messages sent between distributed nodes and thus lead to more efficiency of the search.
In many contexts a search engine user would prefer to retrieve entities instead of just documents. Example queries include “Italian nobel prize winners”, “Formula 1 drivers that won the Monaco Grand Prix”, or “German spoken Swiss cantons”. The XML Entity Ranking (XER) track at INEX creates a discussion forum aimed at standardizing evaluation procedures for entity retrieval. This paper describes the XER tasks and the evaluation procedure used at the XER track in 2008, focusing specifically on the sampled pooling strategy applied first this year. We conclude with a brief discussion of the predominant participant approaches and their effectiveness.
Entity Ranking is a recently emerging search task in Information Retrieval. In Entity Ranking the goal is not finding documents matching the query words, but instead finding entities which match those requested in the query. In this paper we focus on the Wikipedia corpus, interpreting it as a set of entities and propose algorithms for finding entities based on their structured representation for three different search tasks: entity ranking, list completion, and entity relation search. The main contribution is a methodology for indexing entities using a structured representation. Our approach focuses on creating an index of facts about entities for the different search tasks. More, we use the category structure information for improving the effectiveness of the List Completion task.
In this paper, we propose two methods to adapt language modeling methods for expert search to the INEX entity ranking task. In our experiments, we notice that language modeling methods for expert search, if directly applied to the INEX entity ranking task, cannot effectively distinguish entity types. Thus, our proposed methods aim at resolving this problem. First, we propose a method to take into account the INEX category query field. Second, we use an interpolation of two language models to rank entities, which can solely work on the text query. Our experiments indicate that both methods can effectively adapt language modeling methods for expert search to the INEX entity ranking task.
In this paper we describe our participation in the INEX Entity Ranking track. We explored the relations between Wikipedia pages, categories and links. Our approach is to exploit both category and link information. Category information is used by calculating distances between document categories and target categories. Link information is used for relevance propagation and in the form of a document link prior. Both sources of information have value, but using category information leads to the biggest improvements.
Entity ranking has recently emerged as a research field that aims at retrieving entities as answers to a query. Unlike entity extraction where the goal is to tag the names of the entities in documents, entity ranking is primarily focused on returning a ranked list of relevant en- tity names for the query. Many approaches to entity ranking have been proposed, and most of them were evaluated on the INEX Wikipedia test collection. In this paper, we show that the knowledge of predicted classes of topic difficulty can be used to further improve the entity ranking per- formance. To predict the topic difficulty, we generate a classifier that uses features extracted from an INEX topic definition to classify the topic into an experimentally pre-determined class. This knowledge is then utilised to dynamically set the optimal values for the retrieval parameters of our entity ranking system. Our experiments suggest that topic difficulty pre- diction is a promising approach that could be exploited to improve the effectiveness of entity ranking.
We describe our participation in the INEX 2008 Entity Ranking track. We develop a generative language modeling approach for the entity ranking and list completion tasks. Our framework comprises the following components: (i) entity and (ii) query language models, (iii) entity prior, (iv) the probability of an entity for a given category, and (v) the probability of an entity given another entity. We explore various ways of estimating these components, and report on our results. We find that improving the estimation of these components has very positive effects on performance, yet, there is room for further improvements.
This paper presents the organization of the INEX 2008 interactive track. In this year’s iTrack we aimed at exploring the value of element retrieval for two different task types, fact-finding and research tasks. Two research groups collected data from 29 test persons, each performing two tasks. We describe the methods used for data collection and the tasks performed by the participants. A general result indicates that test persons were more satisfied when completing research task compared to fact-finding task. In our experiment, test persons regarded the research task easier, were more satisfied with the search results and found more relevant information for the research tasks.
The Link the Wiki track at INEX 2008 offered two tasks, file-to-file link discovery and anchor-to-BEP link discovery. In the former 6600 topics were used and in the latter 50 were used. Manual assessment of the anchor-to- BEP runs was performed using a tool developed for the purpose. Runs were evaluated using standard precision & recall measures such as MAP and precision / recall graphs. 10 groups participated and the approaches they took are discussed. Final evaluation results for all runs are presented.
In this paper, we discuss our participation to the INEX 2008 Link-the-Wiki track. We utilized a sliding window based algorithm to extract the frequent terms and phrases. Using the extracted phrases and term as descriptive vectors, the anchors and relevant links (both incoming and outgoing) are recognized efficiently.
This paper describes the runs that I submitted to the INEX 2008 Link-the-Wiki track. I participated in the incoming File-to-File and the outgoing Anchor-to-BEP tasks. For the File-to-File task I used a generic IR engine and constructed queries based on the title, keywords, and keyphrases of the Wikipedia article. My runs performed well for this task achieving the highest precision for low recall levels. Further post-hoc experiments showed that constructing queries using titles only produced even better results than the official submissions. For the Anchor-to-BEP task, I used a keyphrase extraction engine developed in-house and I filtered the keyphrases using existing Wikipedia titles. Unfortunately, my runs performed poorly compared to those of other groups. I suspect that this was the result of using many phrases that were not central to articles as anchors.
This paper describes the Link-the-Wiki submission of Lycos Europe. We try to learn suitable anchor texts by looking at the anchor texts the Wikipedia authors used. Disambiguation is done by using tex- tual similarity and also by checking whether a set of link targets "makes sense" together.
Automatically linking Wikipedia pages can be done either content based by exploiting word similarities or structure based by exploiting characteristics of the link graph. Our approach focuses on a content based strategy by detecting Wikipedia titles as link candidates and selecting the most relevant ones as links. The relevance calculation is based on the context, i.e. the surrounding text of a link candidate. Our goal was to evaluate the influence of the link-context on selecting relevant links and determining a links best-entry-point. Results show, that a whole Wikipedia page provides the best context for resolving link and that straight forward inverse document frequency based scoring of anchor texts achieves around 4% less Mean Average Precision on the provided data set.
This paper describes our participation in the INEX 2008 Link the Wiki track. We focused on the file-to-file task and submitted three runs, which were designed to compare the impact of different features on link generation. For outgoing links, we introduce the anchor likelihood ratio as an indicator for anchor detection, and explore two types of evidence for target identification, namely, the title field evidence and the topic article content evidence. We find that the anchor likelihood ratio is a useful indicator for anchor detection, and that in addition to the title field evidence, re-ranking with the topic article content evidence is effective for improving target identification. For incoming links, we use exact match and retrieval method with language modeling approach, and find that the exact match approach works best. On top of that, our experiment shows that the semantic relatedness between Wikipedia articles also has certain ability to indicate links.
The University of Otago submitted three element runs and three passage runs to the Relevance-in-Context task of the ad hoc track. The best Otago run was a whole-document run placing 7 th . The best Otago passage run placed 13 th while the best Otago element run placed 31 st . There were a total of 40 runs submitted to the task. The ad hoc result reinforced our prior belief that passages are better answers than elements and that the most important aspect of the focused retrieval is the identification of relevant documents. Six runs were submitted to the Link-the-Wiki track. At time of writing the results had not been published.
In this paper, we describe methods taken by CSIR in the INEX 2008 Link-the-Wiki track. For the incoming link detection, we use p(d|t), the probability to generate a document, when given the topic file, to judge which documents are proper link sources for the given topic. For the file-to-file task of outgoing link detection, we take a two-step approach: first, we identify a group of candidate target documents by literally matching the topic file title and document content; then, candidate documents are ranked by the number of incoming links. For the anchor-to-BEP task, we use p(d|a,t), the probability to generate a document, when given the topic file and an anchor name, to select anchors and link targets for a given topic.
Link detection can be seen as a special application of Focused Retrieval. This paper presents a content-based link detection approach using the Vector Space Model. We present our results, and conclude by discussing the merits and deficiencies of our approach.
We describe here the XML Mining Track at INEX 2008. This track was launched for exploring two main ideas: first identifying key problems for mining semi-structured documents and new challenges of this emerging field and second studying and assessing the potential of machine learning techniques for dealing with generic Machine Learning (ML) tasks in the structured domain i.e. classification and clustering of semi structured documents. This year, the track focuses on the supervised classification and the unsupervised clustering of XML documents using link information. We consider a corpus of about 100,000 Wikipedia pages with the associated hyperlinks. The participants have developed models using the content information, the internal structure information of the XML documents and also the link information between documents.
We address the problem of categorizing a large set of linked docu- ments with important content and structure aspects, for example, from Wikipedia collection proposed at the INEX XML Mining track. We cope with the case where there is a small number of labeled pages and a very large number of unlabeled ones. Due to the sparsity of the link based structure of Wikipedia, we apply the spectral and graph-based techniques developed in the semi-supervised machine learning. We use the content and structure views of Wikipedia collection to build a transductive categorizer for the unlabeled pages. We report evaluation results obtained with the label propagation function which ensures a good scalability on sparse graphs.
This paper describes the approach taken to the XML Mining track at INEX 2008 by a group at the Queensland University of Technology. We introduce the K-tree clustering algorithm in an Information Retrieval context by adapting it for document clustering. Many large scale problems exist in document clustering. K-tree scales well with large inputs due to its low complexity. It offers promising results both in terms of efficiency and quality. Document classification was completed using Support Vector Machines.
This paper contains a description of experiments for the 2008 INEX XML-mining track. Our goal for the XML-mining track is to explore whether we can use link information to improve classification accuracy. Our approach is to propagate category probabilities over linked pages. We find that using link information leads to marginal improvements over a baseline that uses a Naive Bayes model. For the initially misclassified pages, link information is either not available or contains too much noise.
This paper presents an experimental study conducted over the INEX 2008 Document Mining Challenge corpus using both the structure and the content of XML documents for clustering them. The concise common substructures known as the closed frequent subtrees are generated using the structural information of the XML documents. The closed frequent subtrees are then used to extract the constrained content from the documents. A matrix containing the term distribution of the documents in the dataset is developed using the extracted constrained content. The k-way clustering algorithm is applied to the matrix to obtain the required clusters. In spite of the large number of documents in the INEX 2008 Wikipedia dataset, the proposed frequent subtree-based clustering approach was successful in clustering the documents. This approach significantly reduces the dimensionality of the terms used for clustering without much loss in accuracy.
This paper reports our experiments carried out for the INEX XML Mining track, consisting in developing categorization (or classification) and clustering methods for XML documents. We represent XML documents as vectors of index terms. For our first participation, the purpose of our experiments is twofold: Firstly, our overall aim is to set up a categorization text only approach that can be used as a baseline for further work which will take into account the structure of the XML documents. Secondly, our goal is to define two criteria based on terms distribution for reducing the size of the index. Results of our baseline are good and using our two criteria, we improve these results while we slightly reduce the index term. The results are slightly worse when we reduce sharply the index of terms.
In this paper we propose a new method for link-based classification using Bayesian networks. It can be used in combination with any content only probabilistic classsifier, so it can be useful in combination with several different classifiers. We also report the results obtained of its application to the XML Document Mining Track of INEX'08.
This paper reports on the experiments and results of a clustering approach used in the INEX 2008 document mining challenge. The clustering approach utilizes both the structure and content information of the Wikipedia XML document collection. A latent semantic kernel (LSK) is used to measure the semantic similarity between XML documents based on their content features. The construction of a latent semantic kernel involves the computing of singular vector decomposition (SVD). On a large feature space matrix, the computation of SVD is very expensive in terms of time and memory requirements. Thus in this clustering approach, the dimension of the document space of a term-document matrix is reduced before performing SVD. The document space reduction is based on the common structural information of the Wikipedia XML document collection. The proposed clustering approach has shown to be effective on the Wikipedia collection in the INEX 2008 document mining challenge.
Data mining on Web documents is one of the most challenging tasks in machine learning due to the large number of documents on the Web, the underlying structures (as one document may refer to another document), and the data is commonly not labeled (the class in which the document belongs is not known a-priori). This paper considers latest developments in Self-Organizing Maps (SOM), a machine learning approach, as one way to classifying documents on the Web. The most recent development is called a Probability Mapping Graph Self-Organizing Map (PMGraphSOM), and is an extension of an earlier Graph-SOM approach; this encodes undirected and cyclic graphs in a scalable fashion. This paper illustrates empirically the advantages of the PMGraphSOM versus the original GraphSOM model in a data mining application involving graph structured information. It will be shown that the performances achieved can exceed the current state-of-the art techniques on a given benchmark problem.
... III. BENCHMARK COLLECTION We use benchmark datasets provided by the Initiative for the Evaluation of XML Retrieval (INEX) [8], [9]. These datasets consist of a collection of documents (in which we search for information), sample queries, and relevance judgments (information about which documents are relevant for which queries). ...
Article
Consider a user searching for information on the World Wide Web. If the information need of the user is somewhat specific, and if the user is permitted to provide a detailed description of his precise need, then it is quite likely that this description will include negative constraints, i.e., specifications of what the user is 'not' looking for. A search engine that makes use of such constraints is likely to return more accurate results. In this paper, we consider the problem of identifying such negative constraints from verbose queries. A maximum-entropy classifier is trained to identify negative sentences in verbose queries with about 90\% accuracy. We next study how retrieval effectiveness is affected when these negative sentences are eliminated from the queries. We find that this step results in modest improvements in retrieval accuracy, but our analysis suggests that significant improvements can be obtained if negative sentences are properly handled during query processing.
... Although pre- sented as a ranking problem, they use binary classification to rank the related concepts. INEX (the INitiative for the Evaluation of XML retrieval) has launched the Link-the-Wiki task, which defines target detection as a ranking problem, as at most 5 target concepts can be returned for an anchor text; various heuristics as well as retrieval-based methods have been proposed [7]. ...
Conference Paper
Full-text available
We focus on the task of target detection in automatic link generation with Wikipedia, i.e., given an N-gram in a snippet of text, find the relevant Wikipedia concepts that explain or provide background knowledge for it. We formulate the task as a ranking problem and investigate the effectiveness of learning to rank approaches and of the features that we use to rank the target concepts for a given N-gram. Our experiments show that learning to rank approaches outperform traditional binary classification approaches. Also, our proposed features are effective both in binary classification and learning to rank settings.
... Motivated by the need to foster research in areas relating to large digital book repositories, see e.g., [8], the Book Track was launched in 2007 as part of the Initiative for the Evaluation of XML retrieval (INEX) 7 . INEX was chosen as a suitable forum as searching for information in a collection of books can be seen as one of the natural application areas of focused retrieval approaches [7], which have been investigated at INEX since 2002 [4,5]. In particular, focused retrieval over books presents a clear benefit to users, enabling them to gain direct access to parts of books (of potentially hundreds of pages in length) that are relevant to their information need. ...
Article
Full-text available
This paper describes the setup of the Book Structure Ex- traction competition run at ICDAR 2009. The goal of the competition was to evaluate and compare automatic techniques for deriving struc- ture information from digitized books, which could then be used to aid navigation inside the books. More specically, the task that participants faced was to construct hyperlinked tables of contents for a collection of 1,000 digitized books. This paper describes the setup of the competition and its challenges. It introduces and discusses the book collection used in the task, the collaborative construction of the ground truth, the eval- uation measures and the evaluation results. The paper also introduces a data set to be used freely for research evaluation purposes.
Article
Full-text available
Image processing techniques have been used over the years to convert printed material into electronic form. In our work we exploit the fact that some applications may find such conversions redundant and yet satisfactorily meet the demands of the end user. Using the horizontal and vertical white-spaces present in any document, independent regions of text, pictures, tables etc. could be identified. Inherent characteristic disparities were then used to distinguish pictures from text, and section-headings from the explanations that follow them. A table of contents, showing the heading and the associated page number, was generated and displayed on the browser. Each heading was hyperlinked to the corresponding page of the original document. HTML code was written dynamically, using file handling techniques in MATLAB to accommodate for variable number of headings obtained for different documents and also from different pages of a single document. The platform thus developed was tested on various languages and it was verified that the method implemented was language independent.
Article
Purpose The purpose of this paper is to propose methods for fast incremental indexing with effective and efficient query processing in XML element retrieval. The effectiveness of a search system becomes lower if document updates are not handled when these occur frequently on the Web. The search accuracy is also reduced if drastic changes in document statistics are not managed. However, existing studies of XML element retrieval do not consider document updates, although these studies have attained both effectiveness and efficiency in query processing. Thus, the authors add a function for handling document updates to the existing techniques for XML element retrieval. Design/methodology/approach Though it will be important to enable fast updates of indices, preliminary experiments have shown that a simple incremental update approach has two problems: some kinds of statistics are inaccurate, and it takes a long time to update indices. Therefore, two methods are proposed: one to approximate term weights accurately with a small number of documents, even for dynamically changing statistics; and the other to eliminate unnecessary update targets. Findings Experimental results show that this proposed system can update indices up to 32 per cent faster than the simple incremental updates while the search accuracy improved by 4 per cent compared with the simple approach. The proposed methods can also be fast and accurate in query processing, even if document statistics change drastically. Originality/value The paper shows that there could be a more practical XML element search engine, which can access the latest XML documents accurately and efficiently.
Conference Paper
In this paper we propose a path expression-based smoothing method of query likelihood model for XML element retrieval (QLMER). Though the query likelihood model, one of the statistical language models, is regarded as an accurate term weighting scheme in document retrieval, it has not been surveyed enough in XML element retrieval. Some term Weighting schemes for XML element retrieval utilize the idea of a path expression and it is effective for accurate retrieval. Therefore, we propose the path expression-based smoothing method. We are also interested in the potential power of QLMER compared with a commonly used term weighting scheme, BM25E, which is a classic probabilistic model. Our experimental evaluations showed that the proposed smoothing method is more effective than the existing one. In addition, BM25E is more effective than QLMER even though the effectiveness improved with the proposed method.
Conference Paper
Full-text available
This paper summarizes the 3rd Book Structure Extraction competition that was run at the ICDAR 2013. Its goal is to evaluate and compare automatic techniques for deriving structure information from digitized books, which could then be used to aid navigation inside the books. More specifically, the task that participants are faced with is to construct hyper linked tables of contents for a collection of 1,000 digitized books. This paper reviews the setup of the competition, the book collection used in the task, and the measures used for the evaluation. The main novelty of the 2013 competition is that we were able to rely on an external provider for the ground truthing phase, hence granting the consistency of the evaluation. In addition, this permitted to nearly double the number of annotated books from the 1,040 books annotated in 2009 and 2011 to over 2,000 books. The paper further presents the result performance of the 6 participating research teams, and briefly summarizes their approaches.
Article
This paper addresses the integration of XML tags into a term-weighting function for focused XML information retrieval (IR). Our model allows us to consider a certain kind of structural information: tags that represent a logical structure (e.g., title, section, paragraph, etc.) as well as other tags (e.g., bold, italic, center, etc.). We take into account the influence of a tag by estimating the probability for this tag to distinguish relevant terms from the others. Then, these weights are integrated in a term-weighting function. Experiments on a large collection from the INEX 2008 XML IR evaluation campaign showed improvements on focused XML retrieval. KeywordsProbabilistic information retrieval model–Structured information retrieval–XML–Tags–Weighting scheme–BM25
Conference Paper
Full-text available
Related entity finding is the task of returning a ranked list of homepages of relevant entities of a specified type that need to engage in a given relationship with a given source entity. We propose a framework for addressing this task and perform a detailed analysis of four core components; co-occurrence models, type filtering, context modeling and homepage finding. Our initial focus is on recall. We analyze the performance of a model that only uses co-occurrence statistics. While this method identifies the potential set of related entities, it fails to rank them effectively. Two types of error emerge: (1) entities of the wrong type pollute the ranking and (2) while somehow associated to the source entity, some retrieved entities do not engage in the right relation with it. To address (1), we add type filtering based on category information available in Wikipedia. To correct for (2), we complement our related entity finding method with contextual information, represented as language models derived from documents in which source and target entities co-occur. To complete the pipeline, we find homepages of top ranked entities by combining a language modeling approach with heuristics based on Wikipedia's external links. Our method achieves very high recall scores on the end-to-end task, providing a solid starting point for expanding our focus to improve precision. Our framework can effectively incorporate additional heuristics and these extensions lead to state-of-the-art performance.
Conference Paper
In many contexts a search engine user would prefer to retrieve entities instead of just documents. Example queries include “Italian nobel prize winners”, “Formula 1 drivers that won the Monaco Grand Prix”, or “German spoken Swiss cantons”. The XML Entity Ranking (XER) track at INEX creates a discussion forum aimed at standardizing evaluation procedures for entity retrieval. This paper describes the XER tasks and the evaluation procedure used at the XER track in 2008, focusing specifically on the sampled pooling strategy applied first this year. We conclude with a brief discussion of the predominant participant approaches and their effectiveness.
Conference Paper
Full-text available
This paper provides an overview of the INEX 2008 Book Track. Now in its second year, the track aimed at broadening its scope by investigating topics of interest in the fields of information retrieval, human computer interaction, digital libraries, and eBooks. The main topics of investigation were defined around challenges for supporting users in reading, searching, and navigating the full texts of digitized books. Based on these themes, four tasks were defined: 1) The Book Retrieval task aimed at comparing traditional and book-specific retrieval approaches, 2) the Page in Context task aimed at evaluating the value of focused retrieval approaches for searching books, 3) the Structure Extraction task aimed to test automatic techniques for deriving structure from OCR and layout information, and 4) the Active Reading task aimed to explore suitable user interfaces for eBooks enabling reading, annotation, review, and summary across multiple books. We report on the setup and results of each of these tasks.
Conference Paper
Full-text available
In this paper, we summarize the 2nd Book Structure Extraction competition run at ICDAR 2011. Its goal is to evaluate and compare automatic techniques for deriving structure information from digitized books, which could then be used to aid navigation inside the books. More specifically, the task that participants are faced with is to construct hyper linked tables of contents for a collection of 1,000 digitized books. This paper reviews the setup of the competition, the book collection used in the task, and the measures used for the evaluation. It further presents the outcome of the competition: an additional ground truth of 513 book tables of contents, contributed by 6 institutions, and the result performance of the 4 participating research teams.
Article
Full-text available
INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2009 evaluation campaign, which consisted of a wide range of tracks: Ad hoc, Book, Efficiency, Entity Ranking, Interactive, QA, Link the Wiki, and XML Mining. INEX in running entirely on volunteer effort by the IR research community: anyone with an idea and some time to spend, can have a major impact.
ResearchGate has not been able to resolve any references for this publication.