Article

Overview of the second Text Retrieval Conference (TREC-2)

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

In November of 1992 the first Text REtrieval Conference (TREC-1) was held at NIST (Harman 1993). This conference, co-sponsored by ARPA and NIST, brought together information retrieval researchers to discuss their system results on the new TIPSTER test collection. This was the first time that such groups had ever compared results on the same data using the same evaluation methods, and represented a breakthrough in cross-system evaluation in information retrieval. It was also the first time that most of these groups had tackled such a large test collection and required a major effort by all groups to scale up their retrieval techniques.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... A large number of shared tasks rely on such collections. Some of the well known text collections are Cranfield project [Cleverdon, 1959], Text REtrieval Conference 1 (TREC) and more specifically TREC adhoc [Harman, 1995], Cross-Language Evaluation Forum 2 (CLEF) datasets [Ferro, 2014], and NACSIS test Collection for Information Retrieval 3 (NTCIR) [Kando et al., 1999]. Indeed, the IR international conferences such as TREC, CLEF, NTCIR, INEX 4, and FIRE 5 are held based on their own test collections. ...
... Cranfield test collection is the first IR test collection that also grounded the evaluation framework used nowadays in IR. It was created in late 1960's and contains the abstract of 1400 documents, 225 queries and the corresponding relevance judgment [Cleverdon, 1967;Harman, 1995]. The Cranfield test collection is the base for the success of different conferences like Text Retrieval Conference (TREC). ...
... TREC was established in 1992 in order to support IR researches, to provide larger and more realistic collections, as well as to promote a standard for IR evaluation [Harman, 1995]. Since then, the TREC conference creates series of evaluation resources specifically for adhoc retrieval. ...
Chapter
Full-text available
Evaluation is highly important for designing, developing, and maintaining information retrieval (IR) systems. The IR community has developed shared tasks where evaluation framework, evaluation measures and test collections have been developed for different languages. Although Amharic is the official language of Ethiopia currently having an estimated population of over 110 million, it is one of the under-resourced languages and there is no Amharic adhoc IR test collection to date. In this paper, we promote the monolingual Amharic IR test collection that we build for the IR community. Following the framework of Cranfield project and TREC, the collection that we named 2AIRTC consists of 12,583 documents, 240 topics and the corresponding relevance judgments.
... We tested the data structures described by applying them to a collection of around 1 Gb of world-wide web text data derived from the TREC [21] Very Large Collection. The Text REtrieval Conference (TREC) is an ongoing international collaborative experiment in information retrieval sponsored by NIST and ARPA (US Defense Advanced Research Projects Agency). ...
... Overall results comparing self-adjusting structures to a BST, red-black tree, and hashing are shown in Table I. In these experiments, we present average results with five different text collections of around 1 GB in size derived from the TREC [21] Very Large Collection web data. The hash table is moderate in size, containing 220 000 slots. ...
... The news documents in the corpus are finally converted to TRECs defined SGML format (Harman, 1993a) (Harman, 1993b) (Harman, 1995) (Harman, 1996) so that the IR community can use the document collection to evaluate the effectiveness of the existing, as well as the future IR techniques. The documents in SGML format with DTD is a standard representation of TREC text documents. ...
... In TREC, two strategies have been used to construct queries. According to the first strategy that was used in TREC-1 and TREC-2, queries were written by the real users of the system (Harman, 1993b) (Harman, 1995). In contrast, in TREC-3 and TREC-4, the queries were written by the assessors (Harman, 1993a) (Harman, 1996). ...
Article
Full-text available
Information Retrieval (IR) systems are developed to fulfill several needs of users. As a plethora of IR techniques has been developed, therefore the evaluation of these techniques has been of paramount importance. The evaluation of these techniques require test collections, which are composed of a collection of documents, queries and relevance judgement between query - document pairs. Recognizing the importance of the evaluation of IR techniques, Text REtrieval Conference (TREC) has been regularly organized for the last three decades with an aim to develop and continuously improve these techniques. However, most of the resource development effort has been directed to English and other Western languages, whereas resource development for Urdu has received little attention, which has thwarted the development of IR techniques. Furthermore, the available benchmarks have several limitations which include smaller size, unavailability of the benchmark, inadequate candidate documents for evaluation, and merely binary relevance judgement. To that end, this study has focused on constructing the largest-ever and semantic IR benchmark for the Urdu language that strictly complies with the procedures proposed by TREC. That is, firstly, a large collection of 2,887,169 Urdu documents is scrapped and converted into the standard format proposed by TREC. Secondly, 105 queries are generated, which includes detailed descriptions of 35 queries and three variants of each query in the TREC format. Thirdly, a pooling-based approach is employed using queries, as well as their variants, to identify a candidate pool of 13,392 documents for human judgement. Finally, two experts performed 13,392 query-document comparisons and ranked them on a scale of four types of relevance: highly relevant, fairly relevant, marginally relevant and irrelevant, which is a significant enhancement from the existing benchmarks. The benchmark can be used for the evaluation of existing IR techniques, as well as for future techniques.
... (2) testing data (X (te) , y (te) ); Learning: Note that unlike the standard news corpora in NLP or the SEC-mandated financial report, Transcripts of earnings call is a very special genre of text. For example, the length of WSJ documents is typically one to three hundreds (Harman, 1995), but the averaged document length of our three earnings calls datasets is 7677. Depending on the amount of interactions in the question answering session, the complexities of the calls vary. ...
Conference Paper
Earnings call summarizes the financial performance of a company, and it is an important indicator of the future financial risks of the company. We quantitatively study how earnings calls are correlated with the financial risks, with a special focus on the financial crisis of 2009. In particular, we perform a text regression task: given the transcript of an earnings call, we predict the volatility of stock prices from the week after the call is made. We propose the use of copula: a powerful statistical framework that separately models the uniform marginals and their complex multivariate stochastic dependencies, while not requiring any prior assumptions on the distributions of the covariate and the dependent variable. By performing probability integral transform, our approach moves beyond the standard count-based bag-ofwords models in NLP, and improves previous work on text regression by incorporating the correlation among local features in the form of semiparametric Gaussian copula. In experiments, we show that our model significantly outperforms strong linear and non-linear discriminative baselines on three datasets under various settings.
... We evaluate our proposed NPRF framework on two standard test collections, namely, TREC1-3 (Harman, 1993) and Ro-bust04 (Voorhees, 2004). TREC1-3 consists of 741,856 documents with 150 queries used in the TREC 1-3 ad-hoc search tasks (Harman, 1993(Harman, , 1994(Harman, , 1995. Robust04 contains 528,155 documents and 249 queries used in the TREC 2004 Robust track (Voorhees, 2004). ...
Preprint
Full-text available
Pseudo-relevance feedback (PRF) is commonly used to boost the performance of traditional information retrieval (IR) models by using top-ranked documents to identify and weight new query terms, thereby reducing the effect of query-document vocabulary mismatches. While neural retrieval models have recently demonstrated strong results for ad-hoc retrieval, combining them with PRF is not straightforward due to incompatibilities between existing PRF approaches and neural architectures. To bridge this gap, we propose an end-to-end neural PRF framework that can be used with existing neural IR models by embedding different neural models as building blocks. Extensive experiments on two standard test collections confirm the effectiveness of the proposed NPRF framework in improving the performance of two state-of-the-art neural IR models.
... Many metrics have been introduced since. Popular metrics include mean average precisions (MAP) [68], normalized discounted cumulative gain (nDCG) [89] and expected reciprocal rank (ERR) [33]. Sanderson [155] gives a thorough overview of TREC and its metrics. ...
Thesis
Full-text available
More than half the world’s population uses web search engines, resulting in over half a billion queries every single day. For many people, web search engines such as Baidu, Bing, Google, and Yandex are among the first resources they go to when a question arises. Moreover, for many search engines have become the most trusted route to information, more so even than traditional media such as newspapers, news websites or news channels on television. What web search engines present people with greatly influences what they believe to be true and consequently it influences their thoughts, opinions, decisions, and the actions they take. With this in mind two things are important, from an information retrieval research perspective. First, it is important to understand how well search engines (rankers) perform and secondly this knowledge should be used to improve them. This thesis is about these two topics: evaluation of search engines and learning search engines. In the first part of this thesis we investigate how user interactions with search engines can be used to evaluate search engines. In particular, we introduce a new online evaluation paradigm called multileaving that extends upon interleaving. With multileaving, many rankers can be compared at once by combining document lists from these rankers into a single result list and attributing user interactions with this list to the rankers. Then we investigate the relation between A/B testing and interleaved comparison methods. Both studies lead to much higher sensitivity of the evaluation methods, meaning that fewer user interactions are required to arrive at reliable conclusions. This has the important implication that fewer users need to be exposed to the results from possibly inferior search engines. In the second part of this thesis we turn to online learning to rank. We learn from the evaluation methods introduced and extended upon in the first part. We learn the parameters of base rankers based on user interactions. Then we use the multileaving methods as feedback in our learning method, leading to much faster convergence than existing methods. Again, the important implication is that fewer users need to be exposed to possibly inferior search engines as they adapt more quickly to changes in user preferences. The last part of this thesis is of a different nature than the earlier two parts. As opposed to the earlier chapters, we no longer study algorithms. Progress in information retrieval research has always been driven by a combination of algorithms, shared resources, and evaluation. In the last part we focus on the latter two. We introduce a new shared resource and a new evaluation paradigm. Firstly, we propose Lerot. Lerot is an online evaluation framework that allows us to simulate users interacting with a search engine. Our implementation has been released as open source software and is currently being used by researchers around the world. Secondly we introduce OpenSearch, a new evaluation paradigm involving real users of real search engines. We describe an implementation of this paradigm that has already been widely adopted by the research community through challenges at CLEF and TREC.
... To evaluate the performance of the proposed query refinement approach, we performed experiments several times. The test data was drawn from "TREC" conferences [6]. We used part of these data sets crawled in 1997 [7] for TREC 9 and 10, which have sets of ten topics and accompanying relevance determinations. ...
Article
Effective information gathering and retrieval of the most relevant web documents on the topic of interest is difficult due to the large amount of information that exists in various formats. Current information gathering and retrieval techniques are unable to exploit semantic knowledge within documents in the "big data" environment; therefore, they cannot provide precise answers to specific questions. Existing commercial big data analytic platforms are restricted to a single data type; moreover, different big data analytic platforms are effective at processing different data types. Therefore, the development of a common big data platform that is suitable for efficiently processing various data types is needed. Furthermore, users often possess more than one intelligent device. It is therefore important to find an efficient preference profile construction approach to record the user context and personalized applications. In this way, user needs can be tailored according to the user`s dynamic interests by tracking all devices owned by the user.
... We restrict our attention to the five best and five worst scoring topics amongst the remaining set of 28; the average precision scores for these ten topic are shown in Table IX. The average precision scores can radically differ over topics (Harman, 1994). But the scores (of the same topics) across multiple languages tend to be more robust. ...
Article
Full-text available
Recent years have witnessed considerable advances in information retrieval for European languages other than English. We give an overview of commonly used techniques and we analyze them with respect to their impact on retrieval effectiveness. The techniques considered range from linguistically motivated techniques, such as morphological normalization and compound splitting, to knowledge-free approaches, such as n-gram indexing. Evaluations are carried out against data from the CLEF campaign, covering eight European languages. Our results show that for many of these languages a modicum of linguistic techniques may lead to improvements in retrieval effectiveness, as can the use of language independent techniques.
... Second, the QL method downloads an average of 2.43 unseen (not previously sampled) documents per query, while the corresponding average for QBS is 2.80. Having access to the term document frequency information of any collection, it is possible to calculate the expected [Harman, 1995a]. Collections labelled WEB are subsets of the TREC WT10G collection [Bailey et al., 2003]. ...
... The Okapi BM25 (Sari and Adriani, 2014) measure is used to rank the retrieved documents according to their relevancy. The name Okapi BM25 (Billerbeck et al., 2003;Harman, 1994) is derived of ''BM", which is the abbreviation of ''Best Match", and in this case 25, is a combination of BM11 and BM15. The Okapi method is described as: ...
Article
Full-text available
Retrieving accurate information from collection of information available on web in a cross-lingual communication environment is a very difficult task in our world. In order to retrieve information, user specifies the needed information in the form of query. Sometimes query may not be able to express the needed information in specific way due to ambiguity or un-translated query words. This problem can be minimized by expanding the query with other suitable words that make it more specific. Purpose of query expansion is to improve the performance and quality of retrieved information in CLIR. In this paper, Q.E. has been explored for a Hindi-English CLIR in which Hindi queries are used to search English documents. We used Okapi BM25 for documents ranking and then by using Term Selection Value (TSV) translated queries have been expanded. All experiments have been performed on FIRE 2012 dataset by analysing the impact of occurrence of terms in top @3 ranked documents. Our result shows that the relevancy of retrieved results of Hindi-English CLIR using Q.E. which is performed by adding a lowest frequency term from the corpus of top @3 ranked documents is 51.33%, which is higher than before and after Q.E. (i.e. Case1, Case2).
... We evaluate our proposed NPRF framework on two standard test collections, namely, TREC1-3 (Harman, 1993) and Ro-bust04 (Voorhees, 2004). TREC1-3 consists of 741,856 documents with 150 queries used in the TREC 1-3 ad-hoc search tasks (Harman, 1993(Harman, , 1994(Harman, , 1995. Robust04 contains 528,155 documents and 249 queries used in the TREC 2004 Robust track (Voorhees, 2004). ...
... American TREC [2] is the most famous QA effectiveness competition in the world. Its Japanese equivalent is called NTCIR[5] and our teams decided to participate in its QAC [1] task for the second time, though their first time brought no significant success. ...
Conference Paper
In our paper we describe our second collective challenge to NTCIR-6 Question Answering Challenge (QAC4). Also this time we decided to investigate the limits of the ”as automatic as possible” approach to QA. Three teams of Otaru University of Commerce, Mie University and Hokkaido University concentrated on three new question types and the last team also re- modeled its WWW Verifier to cope with these types. We will introduce our ideas and methods and then con- clude with results and a proposal of further innova- tions.
... In ad hoc querying, the user formulates any number of arbitrary queries but applies them to a fixed collection. [7] We have considered Gujarati language, the reason being no such tasks have been performed for Gujarati language, although some work is carried out for Bengali, Hindi and Marathi languages. [10], [13] Apart from this Gujarati is spoken by nearly 50 Million people over the world and is an official language for the state of Gujarat. ...
Preprint
In this paper, we present the experimental work done on Query Expansion (QE) for retrieval tasks of Gujarati text documents. In information retrieval, it is very difficult to estimate the exact user need, query expansion adds terms to the original query, which provides more information about the user need. There are various approaches to query expansion. In our work, manual thesaurus based query expansion was performed to evaluate the performance of widely used information retrieval models for Gujarati text documents. Results show that query expansion improves the recall of text documents.
... Events can then be ranked by their score, which is especially important if some prioritization is needed. Ranking metrics such as precision for a fixed number of retrieved documents, or a fixed fraction of all available documents, are often used in IR [67][68][69][70]. The most commonly used ranking metric is however the Area Under the ROC Curve (AUC), which is popular in MD [71][72][73][74][75] because it represents "the probability that a randomly chosen diseased subject is correctly ranked with greater suspicion than a randomly chosen non-diseased subject". ...
Preprint
Full-text available
HEP event selection is traditionally considered a binary classification problem, involving the dichotomous categories of signal and background. In distribution fits for particle masses or couplings, however, signal events are not all equivalent, as the signal differential cross section has different sensitivities to the measured parameter in different regions of phase space. In this paper, I describe a mathematical framework for the evaluation and optimization of HEP parameter fits, where this sensitivity is defined on an event-by-event basis, and for MC events it is modeled in terms of their MC weight derivatives with respect to the measured parameter. Minimising the statistical error on a measurement implies the need to resolve (i.e. separate) events with different sensitivities, which ultimately represents a non-dichotomous classification problem. Since MC weight derivatives are not available for real data, the practical strategy I suggest consists in training a regressor of weight derivatives against MC events, and then using it as an optimal partitioning variable for 1-dimensional fits of data events. This CHEP2019 paper is an extension of the study presented at CHEP2018: in particular, event-by-event sensitivities allow the exact computation of the "FIP" ratio between the Fisher information obtained from an analysis and the maximum information that could possibly be obtained with an ideal detector. Using this expression, I discuss the relationship between FIP and two metrics commonly used in Meteorology (Brier score and MSE), and the importance of "sharpness" both in HEP and in that domain. I finally point out that HEP distribution fits should be optimized and evaluated using probabilistic metrics (like FIP or MSE), whereas ranking metrics (like AUC) or threshold metrics (like accuracy) are of limited relevance for these specific problems.
... The overall relevance assessment made available through the data set is built according to the TREC pooling procedure (Harman, 1993). ...
Article
A large body of research work examined, from both the query side and the user behavior side, the characteristics of medical- and health-related searches. One of the core issues in medical information retrieval (IR) is diversity of tasks that lead to diversity of categories of information needs and queries. From the evaluation perspective, another related and challenging issue is the limited availability of appropriate test collections allowing the experimental validation of medically task oriented IR techniques and systems. In this paper, we explore the peculiarities of TREC and CLEF medically oriented tasks and queries through the analysis of the differences and the similarities between queries across tasks, with respect to length, specificity, and clarity features and then study their effect on retrieval performance. We show that, even for expert oriented queries, language specificity level varies significantly across tasks as well as search difficulty. Additional findings highlight that query clarity factors are task dependent and that query terms specificity based on domain-specific terminology resources is not significantly linked to term rareness in the document collection. The lessons learned from our study could serve as starting points for the design of future task-based medical information retrieval frameworks.
... Events can then be ranked by their score, which is especially important if some prioritization is needed. Ranking metrics such as precision for a fixed number of retrieved documents, or a fixed fraction of all available documents, are often used in IR [68][69][70][71]. The most commonly used ranking metric is however the Area Under the ROC Curve (AUC), which is popular in MD [72][73][74][75][76] because it represents "the probability that a randomly chosen diseased subject is correctly ranked with greater suspicion than a randomly chosen non-diseased subject". ...
Article
Full-text available
HEP event selection is traditionally considered a binary classification problem, involving the dichotomous categories of signal and background. In distribution fits for particle masses or couplings, however, signal events are not all equivalent, as the signal differential cross section has different sensitivities to the measured parameter in different regions of phase space. In this paper, I describe a mathematical framework for the evaluation and optimization of HEP parameter fits, where this sensitivity is defined on an event-by-event basis, and for MC events it is modeled in terms of their MC weight derivatives with respect to the measured parameter. Minimising the statistical error on a measurement implies the need to resolve (i.e. separate) events with different sensitivities, which ultimately represents a non-dichotomous classification problem. Since MC weight derivatives are not available for real data, the practical strategy I suggest consists in training a regressor of weight derivatives against MC events, and then using it as an optimal partitioning variable for 1-dimensional fits of data events. This CHEP2019 paper is an extension of the study presented at CHEP2018: in particular, event-by-event sensitivities allow the exact computation of the “FIP” ratio between the Fisher information obtained from an analysis and the maximum information that could possibly be obtained with an ideal detector. Using this expression, I discuss the relationship between FIP and two metrics commonly used in Meteorology (Brier score and MSE), and the importance of “sharpness” both in HEP and in that domain. I finally point out that HEP distribution fits should be optimized and evaluated using probabilistic metrics (like FIP or MSE), whereas ranking metrics (like AUC) or threshold metrics (like accuracy) are of limited relevance for these specific problems.
... Detta är något som är tämligen omöjligt när det handlar om stora samlingar i operativa databaser. 33 Ett alternativ Baeza-Yates & Ribeiro-Neto lägger fram är måttet relativ recall som beskrivs som förhållandet mellan antalet återvunna relevanta dokument, |R å |, och antalet relevanta dokument sökaren förväntar sig att finna, |R f |. 34 Harman (1995) s. 271 31 Rowley (1994) s. 110 32 Lancaster (1998) Även mot detta mått har framförts vissa anmärkningar. Det föreligger nämligen ofta det problemet, påstår Chowdhury, att en sökare/användare sällan kan specificera hur många poster hon/han önskar återvinna. ...
... The development and evaluation of such algorithms need textual corpora and references [8] [9] [10] [11]. In the case of well studied languages such as English, evaluation forums like TREC 1 [12], CLEF 2 [13], and NTCIR 3 [14] are used to develop and evaluate these algorithms on different tasks. For digitally underresourced languages, tools and reference corpora are usually not available, which is the case for Amharic. ...
... A large number of shared tasks rely on such collections. Some of the well-known text collections and evaluation programs are Cranfield project [5], Text REtrieval Conference (TREC) and more specifically TREC adhoc [6], Cross-Language Evaluation Forum (CLEF) [3], ...
Article
Full-text available
Information retrieval (IR) is one of the most important research and development areas due to the explosion of digital data and the need of accessing relevant information from huge corpora. Although IR systems function well for technologically advanced languages such as English, this is not the case for morphologically complex, under-resourced and less-studied languages such as Amharic. Amharic is a Semitic language characterized by a complex morphology where thousands of words are generated from a single root form through inflection and derivation. This has made the development of Amharic natural language processing (NLP) tools a challenging task. Amharic adhoc retrieval also faces challenges due to scarcity of linguistic resources, tools and standard evaluation corpora. In this research work, we investigate the impact of morphological features on the representation of Amharic documents and queries for adhoc retrieval. We also analyze the effects of stem-based and root-based text representation, and proposed new Amharic IR system architecture. Moreover, we present the resources and corpora we constructed for evaluation of Amharic IR systems and other NLP tools. We conduct various experiments with a TREC-like approach for Amharic IR test collection using a standard evaluation framework and measures. Our findings show that root-based text representation outperforms the conventional stem-based representation on Amharic IR.
Conference Paper
This session focused on experimental or planned approaches to human language technology evaluation and included an overview and five papers: two papers on experimental evaluation approaches[1, 2], and three about the ongoing work in new annotation and evaluation approaches for human language technology[3, 4, 5]. This was followed by fifteen minutes of general discussion.
Conference Paper
Full-text available
In this chapter we present the main data structures and algorithms for searching large text collections. We emphasize inverted files, the most used index, but also review suffix arrays, which are useful in a number of specialized applications. We also cover parallel and distributed implementations of these two structures. As an example, we show how mechanisms based upon inverted files can be used to index and search the Web.
Conference Paper
Words that frequently occur in a document but carry less significant meaning are called stopwords. Identification and removal of stopwords can result in effective indexing of documents. Mean average precision (MAP) is the metric used to measure the efficiency of information retrieval (IR) tasks. In this paper, we have experimented with elimination of Gujarati stopwords to measure the improvements in Adhoc monolingual information retrieval of Gujarati text documents. Results show that elimination of stopwords improve the MAP values of Gujarati IR.
Article
Expertise retrieval has attracted significant interest in the field of information retrieval. Expert finding has been studied extensively, with less attention going to the complementary task of expert profiling, that is, automatically identifying topics about which a person is knowledgeable. We describe a test collection for expert profiling in which expert users have self‐selected their knowledge areas. Motivated by the sparseness of this set of knowledge areas, we report on an assessment experiment in which academic experts judge a profile that has been automatically generated by state‐of‐the‐art expert‐profiling algorithms; optionally, experts can indicate a level of expertise for relevant areas. Experts may also give feedback on the quality of the system‐generated knowledge areas. We report on a content analysis of these comments and gain insights into what aspects of profiles matter to experts. We provide an error analysis of the system‐generated profiles, identifying factors that help explain why certain experts may be harder to profile than others. We also analyze the impact on evaluating expert‐profiling systems of using self‐selected versus judged system‐generated knowledge areas as ground truth; they rank systems somewhat differently but detect about the same amount of pairwise significant differences despite the fact that the judged system‐generated assessments are more sparse.
Article
Widespread digitization of information in today’s internet age has intensified the need for effective textual document classification algorithms. Most real life classification problems, including text classification, genetic classification, medical classification, and others, are complex in nature and are characterized by high dimensionality. Current solution strategies include Naïve Bayes (NB), Neural Network (NN), Linear Least Squares Fit (LLSF), k-Nearest-Neighbor (kNN), and Support Vector Machines (SVM); with SVMs showing better results in most cases. In this paper we introduce a new approach called dynamic architecture for artificial neural networks (DAN2) as an alternative for solving textual document classification problems. DAN2 is a scalable algorithm that does not require parameter settings or network architecture configuration. To show DAN2 as an effective and scalable alternative for text classification, we present comparative results for the Reuters-21578 benchmark dataset. Our results show DAN2 to perform very well against the current leading solutions (kNN and SVM) using established classification metrics.
Article
An important problem for information access systems is that of organizing large sets of documents that have been retrieved in response to a query. Text categorization and text clustering are two natural language processing tasks whose results can be applied to document organization. This chapter describes user interfaces that use categories and clusters to organize retrieval results, and examines the relationship between the two.1
Article
Abstract Text retrieval systems store a great variety of documents, from abstracts, newspaper articles, and web pages to journal articles, books, court transcripts, and legislation. Collections of diverse types of documents expose shortcomings in current approaches to ranking. Use of short fragments of documents, called passages, instead of whole documents can overcome these shortcomings: passage ranking provides convenient units of text to return to the user, can avoid the diculties of comparing documents of dierent length, and enables identication of short blocks of relevant material amongst otherwise irrelevant text. In this paper, we compare several kinds of passage in an extensive series of experiments. We introduce a new type of passage, overlapping fragments of either xed or variable length. We show that ranking with these arbitrary passages gives substantial improvements in retrieval eectiveness over traditional document ranking schemes, particularly for queries on collections of long documents. Ranking with arbitrary passages shows consistent improvements compared to ranking with whole documents, and to ranking with previous passage types that depend on document structure or topic shifts in documents. Keywords: passage retrieval, document retrieval, eective ranking, similarity measures, pivoted
Article
Full-text available
The paper describes the ideas and assumptions underlying the development of a new method for the evaluation and testing of interactive information retrieval (IR) systems, and reports on the initial tests of the proposed method. The method is designed to collect different types of empirical data, i.e. cognitive data as well as traditional systems performance data. The method is based on the novel concept of a ‘simulated work task situation’ or scenario and the involvement of real end users. The method is also based on a mixture of simulated and real information needs, and involves a group of test persons as well as assessments made by individual panel members. The relevance assessments are made with reference to the concepts of topical as well as situational relevance. The method takes into account the dynamic nature of information needs which are assumed to develop over time for the same user, a variability which is presumed to be strongly connected to the processes of relevance assessment.
Article
Three Web search engines, namely, Alta Vista, Excite, and Lycos, were compared and evaluated in terms of their search capabilities (e.g., Boolean logic, truncation, field search, word and phrase search) and retrieval performances (i.e., precision and response time) using sample queries drawn from real reference questions. Recall, the other evaluation criterion of information retrieval, is deliberately omitted from this study because it is impossible to assume how many relevant items there are for a particular query in the huge and ever changing Web system. The authors of this study found that Alta Vista outperformed Excite and Lycos in both search facilities and retrieval performance although Lycos had the largest coverage of Web resources among the three Web search engines examined. As a result of this research, we also proposed a methodology for evaluating other Web search engines not included in the current study.
Article
Recall, the proportion of the relevant documents that is retrieved, is a key indicator of the performance of an information retrieval system. With large information systems, like the World Wide Web on the Internet, recall is almost impossible to measure or estimate by all standard techniques. A proposal ‘needle hiding’ is made as a technique to estimate recall. It is also shown that ranking by relative recall does not have to be isomorphic to ranking by recall; hence the use of relative recall for comparative evaluation might not be entirely sound.
Article
Music Information Retrieval (MIR) evaluation has traditionally focused on system-centered approaches where components of MIR systems are evaluated against predefined data sets and golden answers (i.e., ground truth). There are two major limitations of such system-centered evaluation approaches: (a) The evaluation focuses on subtasks in music information retrieval, but not on entire systems and (b) users and their interactions with MIR systems are largely excluded. This article describes the first implementation of a holistic user-experience evaluation in MIR, the MIREX Grand Challenge, where complete MIR systems are evaluated, with user experience being the single overarching goal. It is the first time that complete MIR systems have been evaluated with end users in a realistic scenario. We present the design of the evaluation task, the evaluation criteria and a novel evaluation interface, and the data-collection platform. This is followed by an analysis of the results, reflection on the experience and lessons learned, and plans for future directions.
Article
We compared the information retrieval performances of some popular search engines (namely, Google, Yahoo, AlltheWeb, Gigablast, Zworks and Alta Vista and Bing/MSN) in response to a list of ten queries, varying in complexity. These queries were run on each search engine and the precision and response time of the retrieved results were recorded. The first ten documents on each retrieval output were evaluated as being 'relevant' or 'non-relevant' for evaluation of the search engine's precision. To evaluate response time, normalised recall ratios were calculated at various cut-off points for each query and search engine. This study shows that Google appears to be the best search engine in terms of both average precision (70%) and average response time (2 s). Gigablast and AlltheWeb performed the worst overall in this study.
Thesis
Full-text available
Information retrieval methods, especially considering multimedia data, have evolved to wards the integration of multiple sources of evidence in the analysis of the relevance of items considering a given user search task. In this context, for attenuating the semantic gap between low-level features extracted from the content of the digital objects and high-level semantic concepts (objects, categories, etc.) and making the systems adaptive to different user needs, interactive models have brought the user closer to the retrieval loop allowing user-system interaction mainly through implicit or explicit relevance feedback. Analogously, diversity promotion has emerged as an alternative for tackling ambiguous or underspecified queries. Additionally, several works have addressed the issue of minimizing the required user effort on providing relevance assessments while keeping an acceptable overall effectiveness. This thesis discusses, proposes, and experimentally analyzes multimodal and interactive diversity-oriented information retrieval methods. This work, comprehensively covers the interactive information retrieval literature and also discusses about recent advances, the great research challenges, and promising research opportunities. We have proposed and evaluated two relevance-diversity trade-off enhancement work-flows, which integrate multiple information from images, such as: visual features, textual metadata, geographic information, and user credibility descriptors. In turn, as an integration of interactive retrieval and diversity promotion techniques, for maximizing the coverage of multiple query interpretations/aspects and speeding up the information transfer between the user and the system, we have proposed and evaluated a multimodal learning-to-rank method trained with relevance feedback over diversified results. Our experimental analysis shows that the joint usage of multiple information sources positively impacted the relevance-diversity balancing algorithms. Our results also suggest that the integration of multimodal-relevance-based filtering and reranking is effective on improving result relevance and also boosts diversity promotion methods. Beyond it, with a thorough experimental analysis we have investigated several research questions related to the possibility of improving result diversity and keeping or even improving relevance in interactive search sessions. Moreover, we analyze how much the diversification effort affects overall search session results and how different diversification approaches behave for the different data modalities. By analyzing the overall and per feedback iteration effectiveness, we show that introducing diversity may harm initial results whereas it significantly enhances the overall session effectiveness not only considering the relevance and diversity, but also how early the user is exposed to the same amount of relevant items and diversity.
Chapter
Ranking a set of documents based on their relevances with respect to a given query is a central problem of information retrieval (IR). Traditionally people have been using unsupervised scoring methods like tf-idf, BM25, Language Model etc., but recently supervised machine learning framework is being used successfully to learn a ranking function, which is called learning-to-rank (LtR) problem. There are a few surveys on LtR in the literature; but these reviews provide very little assistance to someone who, before delving into technical details of different algorithms, wants to have a broad understanding of LtR systems and its evolution from and relation to the traditional IR methods. This chapter tries to address this gap in the literature. Mainly the following aspects are discussed: the fundamental concepts of IR, the motivation behind LtR, the evolution of LtR from and its relation to the traditional methods, the relationship between LtR and other supervised machine learning tasks, the general issues pertaining to an LtR algorithm, and the theory of LtR.
Article
Full-text available
We compared the information retrieval performances of some popular search engines (namely, Google, Yahoo, AlltheWeb, Gigablast, Zworks and AltaVista and Bing/MSN) in response to a list of ten queries, varying in complexity. These queries were run on each search engine and the precision and response time of the retrieved results were recorded. The first ten documents on each retrieval output were evaluated as being ‘relevant’ or ‘non-relevant’ for evaluation of the search engine’s precision. To evaluate response time, normalised recall ratios were calculated at various cut-off points for each query and search engine. This study shows that Google appears to be the best search engine in terms of both average precision (70%) and average response time (2 s). Gigablast and AlltheWeb performed the worst overall in this study.
Article
This paper focuses on the performance gain obtained on the Kepler graphics processing units (GPUs) for multi-key quicksort. Because multi-key quicksort is a recursive-based algorithm, many of the researchers have found it tedious to parallelize the algorithm on the multi and many core architectures. A survey of the state-of-the-art string sorting algorithms and a robust insight of the Kepler GPU architecture gave rise to an intriguing research idea of matching the template of multi-key quicksort with the dynamic parallelism feature offered by the Kepler-based GPU's. The CPU parallel implementation has an improvement of 33 to 50% and 62 to 75 improvement when compared with 8-bit and 16-bit parallel most significant digit radix sort, respectively. The GPU implementation of multi-key quicksort gives 6× to 18× speed up compared with the CPU parallel implementation of parallel multi-key quicksort. The GPU implementation of multi-key quicksort achieves 1.5× to 3× speed up when compared with the GPU implementation of string sorting algorithm using singleton elements in the literature.
Article
With the amount and variety of information available on digital repositories, answering complex user needs and personalizing information access became a hard task. Putting the user in the retrieval loop has emerged as a reasonable alternative to enhance search effectiveness and consequently the user experience. Due to the great advances on machine learning techniques, optimizing search engines according to user preferences has attracted great attention from the research and industry communities. Interactively learning-to-rank has greatly evolved over the last decade but it still faces great theoretical and practical obstacles. This paper describes basic concepts and reviews state-of-the-art methods on the several research fields that complementarily support the creation of interactive information retrieval (IIR) systems. By revisiting ground concepts and gathering recent advances, this article also intends to foster new research activities on IIR by highlighting great challenges and promising directions. The aggregated knowledge provided here is intended to work as a comprehensive introduction to those interested in IIR development, while also providing important insights on the vast opportunities of novel research.
Thesis
Full-text available
Information retrieval methods, especially considering multimedia data, have evolved to wards the integration of multiple sources of evidence in the analysis of the relevance of items considering a given user search task. In this context, for attenuating the semantic gap between low-level features extracted from the content of the digital objects and high-level semantic concepts (objects, categories, etc.) and making the systems adaptive to different user needs, interactive models have brought the user closer to the retrieval loop allowing user-system interaction mainly through implicit or explicit relevance feedback. Analogously, diversity promotion has emerged as an alternative for tackling ambiguous or underspecified queries. Additionally, several works have addressed the issue of minimizing the required user effort on providing relevance assessments while keeping an acceptable overall effectiveness. This thesis discusses, proposes, and experimentally analyzes multimodal and interactive diversity-oriented information retrieval methods. This work, comprehensively covers the interactive information retrieval literature and also discusses about recent advances, the great research challenges, and promising research opportunities. We have proposed and evaluated two relevance-diversity trade-off enhancement work-flows, which integrate multiple information from images, such as: visual features, textual metadata, geographic information, and user credibility descriptors. In turn, as an integration of interactive retrieval and diversity promotion techniques, for maximizing the coverage of multiple query interpretations/aspects and speeding up the information transfer between the user and the system, we have proposed and evaluated a multimodal learning-to-rank method trained with relevance feedback over diversified results. Our experimental analysis shows that the joint usage of multiple information sources positively impacted the relevance-diversity balancing algorithms. Our results also suggest that the integration of multimodal-relevance-based filtering and reranking is effective on improving result relevance and also boosts diversity promotion methods. Beyond it, with a thorough experimental analysis we have investigated several research questions related to the possibility of improving result diversity and keeping or even improving relevance in interactive search sessions. Moreover, we analyze how much the diversification effort affects overall search session results and how different diversification approaches behave for the different data modalities. By analyzing the overall and per feedback iteration effectiveness, we show that introducing diversity may harm initial results whereas it significantly enhances the overall session effectiveness not only considering the relevance and diversity, but also how early the user is exposed to the same amount of relevant items and diversity.
Article
The music information retrieval (MIR) community has long understood the role of evaluation as a critical component for successful information retrieval systems. Over the past several years, it has also become evident that user-centered evaluation based on realistic tasks is essential for creating systems that are commercially marketable. Although user-oriented research has been increasing, the MIR field is still lacking in holistic, user-centered approaches to evaluating music services beyond measuring the performance of search or classification algorithms. In light of this need, we conducted a user study exploring how users evaluate their overall experience with existing popular commercial music services, asking about their interactions with the system as well as situational and personal characteristics. In this paper, we present a qualitative heuristic evaluation of commercial music services based on Jakob Nielsen's 10 usability heuristics for user interface design, and also discuss 8 additional criteria that may be used for the holistic evaluation of user experience in MIR systems. Finally, we recommend areas of future user research raised by trends and patterns that surfaced from this user study.
Thesis
Full-text available
Information retrieval methods, especially considering multimedia data, have evolved to wards the integration of multiple sources of evidence in the analysis of the relevance of items considering a given user search task. In this context, for attenuating the semantic gap between low-level features extracted from the content of the digital objects and high-level semantic concepts (objects, categories, etc.) and making the systems adaptive to different user needs, interactive models have brought the user closer to the retrieval loop allowing user-system interaction mainly through implicit or explicit relevance feedback. Analogously, diversity promotion has emerged as an alternative for tackling ambiguous or underspecified queries. Additionally, several works have addressed the issue of minimizing the required user effort on providing relevance assessments while keeping an acceptable overall effectiveness. This thesis discusses, proposes, and experimentally analyzes multimodal and interactive diversity-oriented information retrieval methods. This work, comprehensively covers the interactive information retrieval literature and also discusses about recent advances, the great research challenges, and promising research opportunities. We have proposed and evaluated two relevance-diversity trade-off enhancement work-flows, which integrate multiple information from images, such as: visual features, textual metadata, geographic information, and user credibility descriptors. In turn, as an integration of interactive retrieval and diversity promotion techniques, for maximizing the coverage of multiple query interpretations/aspects and speeding up the information transfer between the user and the system, we have proposed and evaluated a multimodal learning-to-rank method trained with relevance feedback over diversified results. Our experimental analysis shows that the joint usage of multiple information sources positively impacted the relevance-diversity balancing algorithms. Our results also suggest that the integration of multimodal-relevance-based filtering and reranking is effective on improving result relevance and also boosts diversity promotion methods. Beyond it, with a thorough experimental analysis we have investigated several research questions related to the possibility of improving result diversity and keeping or even improving relevance in interactive search sessions. Moreover, we analyze how much the diversification effort affects overall search session results and how different diversification approaches behave for the different data modalities. By analyzing the overall and per feedback iteration effectiveness, we show that introducing diversity may harm initial results whereas it significantly enhances the overall session effectiveness not only considering the relevance and diversity, but also how early the user is exposed to the same amount of relevant items and diversity.
Article
This article presents a study from the field of interactive information retrieval in which the expectations of search engine users were investigated as possible determinant of user satisfaction. The experimental design is based on a business model that explains the creation of customer satisfaction by the confirmation or disconfirmation of expectations. A central result of this study is that the point in time of the measurement is important with respect to the assessment of user satisfaction. Aside from that user relevance criteria seem to depend on system quality.
Article
A key decision when developing in-memory computing applications is choice of a mechanism to store and retrieve strings. Themost efficient current data structures for this task are the hash table withmove-to-front chains and the burst trie, both of which use linked lists as a substructure, and variants of binary search tree. These data structures are computationally efficient, but typical implementations use large numbers of nodes and pointers to manage strings, which is not efficient in use of cache. In this article, we explore two alternatives to the standard representation: the simple expedient of including the string in its node, and, for linked lists, the more drastic step of replacing each list of nodes by a contiguous array of characters. Our experiments show that, for large sets of strings, the improvement is dramatic. For hashing, in the best case the total space overhead is reduced to less than 1 bit per string. For the burst trie, over 300MB of strings can be stored in a total of under 200MB of memory with significantly improved search time. These results, on a variety of data sets, show that cache-friendly variants of fundamental data structures can yield remarkable gains in performance.
Article
Huffman algorithm allows for constructing optimal prefix-codes with O(n·logn) complexity. As the number of symbols ngrows, so does the complexity of building the code-words. In this paper, a new algorithm and implementation are proposed that achieve nearly optimal coding without sorting the probabilities or building a tree of codes. The complexity is proportional to the maximum code length, making the algorithm especially attractive for large alphabets. The focus is put on achieving almost optimal coding with a fast implementation, suitable for real-time compression of large volumes of data. A practical case example about checkpoint files compression is presented, providing encouraging results.
Article
The increase in the amount of data is evident in recent times. The amount of data stored and retrieved is increasing at a fast rate. Processing text data consumes large amount of memory in terms of storage and extraction. Sorting the stored data is one of the most favorable methods that can be used in order to increase the efficiency of extracting stored data. Graphic Processing Units (GPUs) have evolved from being used as dedicated graphic rendering modules to being used to exploit fast parallelism for large computational purposes. The use of GPUs for sorting strings large in size has produced effective and fast results when compared to using CPUs. This paper produces a comparative study on the most popular and efficient string sorting algorithms that have been implemented on CPU and GPU machines. This paper also proposes an efficient parallel multi-key quicksort implementation which uses ternary search tree in order to increase the speed up and efficiency of sorting large set of string data.
Article
Full-text available
In many applications, independence of event occurrences is assumed, even if there is evidence for dependence. Capturing dependence leads to complex models, and even if the complex models were superior, they fail to beat the simplicity and scalability of the independence assumption. Therefore, many models assume independence and apply heuristics to improve results. Theoretical explanations of the heuristics are seldom given or generalizable. This paper reports that some of these heuristics can be explained as encoding dependence in an exponent based on the generalized harmonic sum. Unlike independence, where the probability of subsequent occurrences of an event is the product of the single event probability, harmony is based on a product with decaying exponent. For independence, the sequence probability is $p^{1+1+ \cdots +1}=p^n$, whereas for harmony, it is $p^{1+1/2+ \cdots +1/n}$. The generalized harmonic sum leads to a spectrum of harmony assumptions. This paper shows that harmony assumptions naturally extend probability theory. An experimental evaluation for information retrieval (IR; term occurrences) and social networks (SN's; user interactions) shows that assuming harmony is more suitable than assuming independence. The potential impact of harmony assumptions lies beyond IR and SN's, since many applications rely on probability theory and apply heuristics to compensate the independence assumption. Given the concept of harmony assumptions, the dependence between multiple occurrences of an event can be reflected in an intuitive and effective way.
Article
Full-text available
Describes a project at Carnegie Mellon University libraries to convert the congressional papers of the late Senator John Heinz to digital format and to create an online system to search and retrieve these papers. Highlights include scanning, optical character recognition, and a search engine utilizing natural language processing. (Author/LRW)
Article
In this paper we describe an information retrieval system in which advanced natural language processing techniques are used to enhance the effectiveness of term-based document retrieval. The backbone of our system is a traditional statistical engine that builds inverted index files from pre-processed documents, and then searches and ranks the documents in response to user queries. Natural language processing is used to (a) preprocess the documents in order to extract content-carrying terms, (b) discover inter-term dependencies and build a conceptual hierarchy specific to the database domain, and (c) process the user's natural language requests into effective search queries. During the course of the Text REtrieval Conferences, TREC-1 and TREC-2,∗ our system has evolved from a scaled-up prototype, originally tested on such collections as CACM-3204 and Cranfield, to its present form, which can be effectively used to process hundreds of millions of words of unrestricted text.
Article
this paper. LSI is an extension of the vector retrieval method (e.g., Salton & McGill,1983) in which the dependencies between terms are explicitly taken into account in the representation and exploited in retrieval. We assume that there is some underlying or "latent" structure in the pattern of word usage across documents, and use statistical techniques to estimate this latent structure. A description of terms, documents and user queries based on the underlying, "latent semantic", structure (rather than surface level word choice) is used for retrieving information. Latent Semantic Indexing uses singular-value decomposition (SVD), a technique closely related to eigenvector decomposition and factor analysis (Cullum and Willoughby, 1985), to model the associations among terms and documents. As is the case with the standard vector method, LSI begins with a large term-document matrix in which the cell entries 3 represent the frequency of a term in a document. This frequency matrix is then transformed using appropriate term weighting and document length normalization 2 . If there were no correlation between the occurrence of one term and another, then there would be no way to use the data in the term-document matrix to improve retrieval. On the other hand, if there is a great deal of structure in this matrix, i.e., the occurrence of some words gives us a strong clue as to the likely occurrence of others, then this structure can be modeled and we use the SVD to do so. Any rectangular matrix, X, for example a t × d matrix of terms and documents, can be decomposed into the product of three other matrices: t ×d
Article
The Smart information retrieval project emphasizes completely automatic approaches to the understanding and retrieval of large quantities of text. We continue our work in the TREC 2 environment, performing both routing and ad-hoc experiments. The ad-hoc work extends our investigations into combining global similarities, giving an overall indication of how a document matches a query, with local similarities identifying a smaller part of the document that matches the query. The performance of the ad-hoc runs is good, but it is clear we are not yet taking full advantage of the available local information.Our routing experiments use conventional relevance feedback approaches to routing, but with a much greater degree of query expansion than was previously done. The length of a query vector is increased by a factor of 5 to 10 by adding terms found in previously seen relevant documents. This approach improves effectiveness by 30–40% over the original query.
Article
The Okapi system has been used in a series of experiments on the TREC collections, investigating probabilistic models, relevance feedback, and query expansion, and interaction issues. Some new probabilistic models have been developed, resulting in simple weighting functions that take account of document length and within-document and within-query term frequency. All have been shown to be beneficial. Relevance feedback and query expansion are highly beneficial when based on large quantities of relevance data (as in the routing task). Interaction issues are much more difficult to evaluate in the TREC framework, and no benefits have yet been demonstrated from feedback based on small numbers of “relevant” items identified by intermediary searchers.
Article
We report on two studies in the TREC-2 program which investigated the effect on retrieval performance of combina- tion of multiple representations of TREC topics. In one of the projects, five separate Boolean queries for each of the 50 TREC routing topics and 25 of the TREC ad hoc topics were generated by 75 experienced online searchers. Using the INQUERY retrieval system, these queries were both combined into single queries, and used to produce five separate retrieval results, for each topic. In the former case, progressive combination of queries led to progressively improving retrieval performance, significantly better than that of single queries, and at least as good as the best indi- vidual single query formulations. In the latter case, data fusion of the ranked lists also led to performance better than that of any single list. In the second project, two automatically-produced vector queries and three versions of a man- ually produced P-norm extended boolean query for each routing and ad hoc topic were compared and combined. This project investigated six different methods of combination of queries, and the combination of the same queries on different databases. As in the first project, progressive combination led to progressively improving results, with the best results, on average, being achieved by combination through summing of retrieval status values. Both projects found that the best method of combination often led to results that were better than the best performing single query. The combined results from the two projects have also been combined, by data fusion. The results of this procedure show that combining evidence from completely different systems also leads to performance improvement.
On expanding query vectors with lexically related words The Second Text REtrieval Conference (TREC-2), National Institute of Standards and Technology Special Publication 500-215
  • E Voorhees
Voorhees, E. 0994). On expanding query vectors with lexically related words. In D. Harman (Ed.), The Second Text REtrieval Conference (TREC-2), National Institute of Standards and Technology Special Publication 500-215, Gaithersburg, MD 20899.
The First Text REtrieval Conference (TREC-1). National Institute of Standards and Technology Special Publication
  • D Harman
Harman D. (Ed.). (1993). The First Text REtrieval Conference (TREC-1). National Institute of Standards and Technology Special Publication 500-207, Gaithersburg, Md. 20899.
Data PreparationThe Proceedings of the TIPSTER Text Program - Phase I
  • D Harman