Article

Access-Ordered Indexes

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Search engines are an essential tool for modern life. We use them to discover new information on diverse topics and to locate a wide range of resources. The search process in all practical search engines is supported by an inverted index structure that stores all search terms and their locations within the searchable document collection. Inverted indexes are highly optimised, and significant work has been undertaken over the past fifteen years to store, retrieve, compress, and understand heuristics for these structures. In this paper, we propose a new self-organising inverted index based on past queries. We show that this access-ordered index improves query evaluation speed by 25%--40% over a conventional, optimised approach with almost indistinguishable accuracy. We conclude that access-ordered indexes are a valuable new tool to support fast and accurate web search.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Another approach is to limit the number of documents available for the extraction of terms, which should result in higher efficiency, due to reduced cache misses when retrieving the remaining documents and otherwise smaller seek times, if the limited number of documents are clustered on disk. Documents could be chosen by, for example, discarding those that are the least often accessed over a large number of queries (Garcia et al., 2004). ...
... Other strategies could also lead to reduced costs. Only some documents, perhaps chosen by frequency of access (Garcia et al., 2004) or sampling, might be included in the set of surrogates. A second tier of surrogates could be stored on disk, for retrieval in cases where the highly-ranked documents are not amongst those selected by sampling. ...
... Furthermore, we could use a hybrid approach of our in-memory summaries and using a sub-collection of documents by keeping only summaries of selected documents (for instance chosen by frequency of access (Garcia et al., 2004)) in memory. This would reduce the overall memory requirements and should increase the likelihood that needed inverted lists are cached. ...
Article
Full-text available
... Another approach is to limit the number of documents available for the extraction of terms, which should result in higher efficiency, due to reduced cache misses when retrieving the remaining documents and otherwise smaller seek times, if the limited number of documents are clustered on disk. Documents could be chosen by, for example, discarding those that are the least often accessed over a large number of queries (Garcia et al., 2004). ...
... Other strategies could also lead to reduced costs. Only some documents, perhaps chosen by frequency of access (Garcia et al., 2004) or sampling, might be included in the set of surrogates. A second tier of surrogates could be stored on disk, for retrieval in cases where the highly-ranked documents are not amongst those selected by sampling. ...
... Furthermore, we could use a hybrid approach of our in-memory summaries and using a sub-collection of documents by keeping only summaries of selected documents (for instance chosen by frequency of access (Garcia et al., 2004)) in memory. This would reduce the overall memory requirements and should increase the likelihood that needed inverted lists are cached. ...
Thesis
Full-text available
Hundreds of millions of users each day search the web and other repositories to meet their information needs. However, queries can fail to find documents due to a mismatch in terminology. Query expansion seeks to address this problem by automatically adding terms from highly ranked documents to the query. While query expansion has been shown to be effective at improving query performance, the gain in effectiveness comes at a cost: expansion is slow and resource-intensive. Current techniques for query expansion use fixed values for key parameters, determined by tuning on test collections. We show that these parameters may not be generally applicable, and, more significantly, that the assumption that the same parameter settings can be used for all queries is invalid. Using detailed experiments, we demonstrate that new methods for choosing parameters must be found. In conventional approaches to query expansion, the additional terms are selected from highly ranked documents returned from an initial retrieval run. We demonstrate a new method of obtaining expansion terms, based on past user queries that are associated with documents in the collection. The most effective query expansion methods rely on costly retrieval and processing of feedback documents. We explore alternative methods for reducing query-evaluation costs, and propose a new method based on keeping a brief summary of each document in memory. This method allows query expansion to proceed three times faster than previously, while approximating the effectiveness of standard expansion. We investigate the use of document expansion, in which documents are augmented with related terms extracted from the corpus during indexing, as an alternative to query expansion. The overheads at query time are small. We propose and explore a range of corpus-based document expansion techniques and compare them to corpus-based query expansion on TREC data. These experiments show that document expansion delivers at best limited benefits, while query expansion - including standard techniques and efficient approaches described in recent work - usually delivers good gains. We conclude that document expansion is unpromising, but it is likely that the efficiency of query expansion can be further improved.
... Several techniques have been proposed to optimise the organisation of inverted lists to allow efficient query processing. One such technique is the access-ordered index [7]. In this approach, inverted lists are reorganised based on past user queries to allow faster processing at query time. ...
... Garcia et. al. demonstrated that such a skew in query terms leads to a non-uniform distribution of documents returned to the user at query time [7]. That is, given a query log, the search system will in general return a subset of documents from the collection more frequently than all other documents.Figure 1 shows the skew in distribution of document access on a collection of 7.5 million documents when 20 million queries from Lycos.de are run against the collection. ...
... Based on such trends, Garcia et. al. proposed an index organisation technique where the most frequently accessed documents are placed towards the head of the inverted lists [7]. They label this technique access-ordering. ...
Conference Paper
Full-text available
Reorganising the index of a search engine based on access frequencies can significantly reduce query evaluation time while maintaining search effectiveness. In this paper we extend access-ordering and introduce a variant index organisation technique that we label access-reordering. We show that by access-reordering an inverted index, query evaluation time can be reduced by as much as 62% over the standard approach, while yielding highly similar effectiveness results to those obtained when using a conventional index. KeywordsSearch engines-index organisation-efficiency-access-ordering
... They suggest a term-impact, where the impact is calculated based on the influence of a term within a document, similar to [5], as well as a document-centric impact, which calculates the impact at a document, rather that a global level. Garcia et al [10], alternatively suggest sorting the inverted index based on access counts. Founded on the idea that even with a large number of different queries the same documents are quite often ranked highly, while other documents are rarely if ever returned to the user, the postings are ordered so that the documents relevant for most queries are towards the top of the lists and the less retrieved documents are stored towards the bottom. ...
... Similar to our previous involvements in the Terabyte track in 2004 and 2005, we again utilised the top subset approach [7, 4, 9] for selecting the postings to be processed from each query term. We have found that this is also the same approach that is utilised by Garcia et al [10], where it is referred to as maxpost. Using this strategy, a maximum number of postings is chosen to be processed from each posting list, e.g. ...
Article
For the 2006 Terabyte track in TREC, Dublin City University’s participation was focussed on the ad hoc search task. As per the pervious two years [7, 4], our experiments on the Terabyte track have concentrated on the evaluation of a sorted inverted index, the aim of which is to sort the postings within each posting list in such a way, that allows only a limited number of postings to be processed from each list, while at the same time minimising the loss of effectiveness in terms of query precision. This is done using the Físréal search system, developed at Dublin City University [4, 8].
... Methods: It has been shown that the likelihood of document access from a collection is non-uniform [2]. As such, for each document in a collection, a probability can be obtained that indicates the likelihood of seeing that document in any given result set. ...
... : It has been shown that the likelihood of document access from a collection is non-uniform [2]. As such, for each document in a collection, a probability can be obtained that indicates the likelihood of seeing that document in any given result set. ...
Article
Full-text available
e the quality of a set of predictions is the area under the curve metric that measures the quality of the prediction by measuring the difference in average precision between the worst 25 predicted topics of a run, to the worst 25 performing topics. We use this metric on the TREC 2005 Robust topics and the Aquaint collection. Results: The figure illustrates the performance of our five techniques using the area under the curve approach. Each line shows the mean average precision for the remaining TREC queries after removing the worst x queries. The best possible prediction for this run is shown with the optimal curve. The area between the optimal curve and the other curves shows the gap between that approach and the optimal prediction. 0 5 10 15 20 25 Remaining 50 - x queries 0.15 0.20 0.25 MAP of remaining documents optimal as-a-1000 as-k-1000 as-k-50 as-p-1000 dl-k-1000 Conclusions: We explore a novel approach to query difficulty prediction and propose five metrics to determine
... Another approach is limiting the number of documents available for extraction of terms, which should result in higher efficiency, due to reduced cache misses when retrieving the remaining documents and otherwise smaller seek times as it can be expected that the limited number of documents are clustered on disk. Documents could be chosen by, for example, discarding those that are the least often accessed over a large number of queries (Garcia et al., 2004). A more radical measure is to use in-memory document surrogates that provide a sufficiently large pool of expansion terms, as described in the following section. ...
... Other strategies could also lead to reduced costs. Only some documents, perhaps chosen by frequency of access (Garcia et al., 2004) or sampling, might be included in the set of surrogates. A second tier of surrogates could be stored on disk, for retrieval in cases where the highly-ranked documents are not amongst those selected by sampling. ...
Article
Query expansion is a well-known method for improving average eectiv eness in information retrieval. The most eectiv e query expansion methods rely on retriev- ing documents which are used as a source of expansion terms. Retrieving those documents is costly. We examine the bottlenecks of a conventional approach and investigate alternative methods aimed at reducing query evaluation time. We pro- pose a new method that draws candidate terms from brief document summaries that are held in memory for each document. While approximately maintaining the eectiv eness of the conventional approach, this method signican tly reduces the time required for query expansion by a factor of v e to ten.
... Past submissions to the robust track have typically either considered query term related statistics, or have relied on the similarity measure between the query and results (Voorhees, 2004). It has been shown that documents within a collection have a non-uniform likelihood of retrieval (Singhal et al., 1996, Garcia et al., 2004). Such information can be used to construct prior probabilities of document access. ...
... Two aspects of importance to our approach are the generation of the retrieval-likelihood values, and the method used to compare the ordering of documents by retrieval-likelihood to that of the ranked result set. Given a collection and a query log for that collection, access-counts can be used to measure the skew of document access by a search engine (Garcia et al., 2004). To generate our retrieval-likelihood probabilities we used an approach similar to that of access counting, but instead of processing a complete query log over a document collection, we processed a single query that was the amalgamation of every distinct term in the collection. ...
Conference Paper
Full-text available
The terabyte track consists of the three tasks: adhoc retrieval, efficient retrieval, and named page finding. For the adhoc retrieval task we used a language modelling approach based on query likelihood, as well as a new technique aimed at reducing the amount of memory used for ranking documents. For the efficiency task, we submitted results from both a single-machine system and one that was distrubuted among a number of machines, with promising results. The named page task was organised by RMIT University and as a result we repeated last year's experiments, slightly modified, with this year's data. The robust track has two subtasks: adhoc retrieval, and query difficulty prediction. For adhoc retrieval, we employed a standard local analysis query expansion method, sourcing expansion terms for different runs from the collection supplied, from a side corpus, or a combination of both. In one run, we also tested removing duplicate documents from the list of results. In order to predict topic difficulty, we evaluated different document priors of the documents in the result set, in the hope of supplying a more robust set of answers at the cost of returning a potentially smaller number of relevant documents. The second task was to predict query difficulty. To this end, we compared the order of the documents in the result sets to the ordering as determined by document priors. A high similarity in orderings indicated that the query was poor. Two different priors were used. The first was based on document access counts, where each document is given a score that is derived from how likely it is to be ranked by a randomly generated query. The second was directly related to the document size. In this paper we outline our approaches and experiments in both tracks, and discuss our results.
... Another approach is limiting the number of documents available for extraction of terms, which should result in higher efficiency, due to reduced cache misses when retrieving the remaining documents and otherwise smaller seek times as it can be expected that the limited number of documents are clustered on disk. Documents could be chosen by, for example, discarding those that are the least often accessed over a large number of queries (Garcia et al., 2004). A more radical measure is to use in-memory document surrogates that provide a sufficiently large pool of expansion terms, as described in the following section. ...
... Other strategies could also lead to reduced costs. Only some documents, perhaps chosen by frequency of access (Garcia et al., 2004) or sampling, might be included in the set of surrogates. A second tier of surrogates could be stored on disk, for retrieval in cases where the highly-ranked documents are not amongst those selected by sampling. ...
Conference Paper
Full-text available
Query expansion is a well-known method for improving av- erage eectiveness in information retrieval. However, the most eective query expansion methods rely on costly retrieval and processing of feed- back documents. We explore alternative methods for reducing query- evaluation costs, and propose a new method based on keeping a brief summary of each document in memory. This method allows query expan- sion to proceed three times faster than previously, while approximating the eectiveness of standard expansion.
... This is known as document ordering and is commonly used in text retrieval systems because it is straightforward to maintain, and additionally yields compression benefits as discussed below. Ordering postings list entries by metrics other than document number can achieve significant efficiency gains during ranked query evaluation (Anh & Moffat, 2002; Garcia, Williams, & Cannane, 2004; Persin, Zobel, & Sacks-Davis, 1996 ), but the easiest way to construct such variant indexes is by post-processing a document-ordered index. Another key to efficient evaluation of text queries is index compression. ...
... Doing so involves separately storing state information that describes the end of the existing list: the last number encoded, the number of bits consumed in the last byte, and so on. For addition of new documents in document-ordered lists, such appending is straightforward; under other organizations of postings lists—such as frequency-ordered (Persin et al., 1996), impact-ordered (Anh & Moffat, 2002), or access-ordered (Garcia et al., 2004 )—the entire existing list must be decoded. Our previous experiments have shown that implementations that must decode postings lists prior to updating them are significantly less efficient than implementations that store information describing the state of the end of each list (Lester et al., 2004). ...
Article
Search engines and other text retrieval systems use high-performance inverted indexes to provide efficient text query evaluation. Algorithms for fast query evaluation and index construction are well-known, but relatively little has been published concerning update. In this paper, we experimentally evaluate the two main alternative strategies for index maintenance in the presence of insertions, with the constraint that inverted lists remain contiguous on disk for fast query evaluation. The in-place and re-merge strategies are benchmarked against the baseline of a complete re-build. Our experiments with large volumes of web data show that re-merge is the fastest approach if large buffers are available, but that even a simple implementation of in-place update is suitable when the rate of insertion is low or memory buffer size is limited. We also show that with careful design of aspects of implementation such as free-space management, in-place update can be improved by around an order of magnitude over a naïve implementation.
... Static index pruning plays an important role in large–scale web information retrieval systems, which crawl and index hundreds of billions of pages [4], [6], [2]. The key point in designing a good static index pruning technique is the function used to assess the importance of a document within a postings list. ...
Conference Paper
Full-text available
Static index pruning techniques aim at removing from the posting lists of an inverted file the references to documents which are likely to be not relevant for answering user queries. The reduction in the size of the index results in a better exploitation of memory hierarchies and faster query processing. On the other hand, pruning may affect the precision of the information retrieval system, since pruned entries are unavailable at query processing time. Static pruning techniques proposed so far exploit query-independent measures to evaluate the importance of a document within a posting list. This paper proposes a general framework that aims at enhancing the precision of any static pruning methods by exploiting usage information extracted from query logs. Experiments conducted on the TREC WT10g Web collection and a large Altavista query log show that integrating usage knowledge into the pruning process is profitable, and increases remarkably performance figures obtained with the state-of-the art Carmel's static pruning method.
... In our preliminary experiments, the initial estimated values for D(i, j) suggested that the degree of overlap among collections is usually overestimated. This observation can be easily explained; document retrieval models are biased towards returning some popular documents for many queries [Garcia et al., 2004]. In addition, we found evidence that samples produced by query-based sampling are not random [Shokouhi et al., 2006b]. ...
Conference Paper
ABSTRACT In federated text retrieval systems, the query is sent to mul- tiple collections at the same time. The results returned by collections are gathered and ranked by a central broker that presents them to the user. It is usually assumed that the col- lections have little overlap. However, in practice collections may share many,common,documents,as either exact or near duplicates, potentially leading to high numbers of duplicates in the final results. Considering the natural bandwidth,re- strictions and efficiency issues of federated search, sending queries to redundant collections leads to unnecessary costs. We propose a novel method for estimating the rate of over- lap among collections based on sampling. Then, using the estimated overlap statistics, we propose two collection selec- tion methods,that aim to maximize,the number,of unique relevant documents,in the final results. We show,exper- imentally that, although our estimates of overlap are not inexact, our suggested techniques can significantly improve the search effectiveness when collections overlap. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Informa-
... In practice, however , generating random samples by random queries is subject to biases. For example, some documents are more likely to be retrieved for a wide range of queries, and some might never appear in the results [Garcia et al., 2004]. Moreover, long documents are more likely to be retrieved, and there could be other biases in the collection ranking functions. ...
Conference Paper
Full-text available
Modern distributed information retrieval techniques require accurate knowledge of collection size. In non-cooperative environments, where detailed collection statistics are not available, the size of the underlying collections must be esti- mated. While several approaches for the estimation of col- lection size have been proposed, their accuracy has not been thoroughly evaluated. An empirical analysis of past estima- tion approaches across a variety of collections demonstrates that their prediction accuracy is low. Motivated by ecologi- cal techniques for the estimation of animal populations, we propose two new approaches for the estimation of collection size. We show that our approaches are significantly more accurate that previous methods, and are more efficient in use of resources required to perform the estimation.
... We also replicate documents frequently retrieved as top results in previous queries. The frequency with which documents have been retrieved in the past has been used for efficient query processing to sort posting lists [20]. ...
Conference Paper
Full-text available
Web search engines are often implemented as centralized systems. Designing and implementing a Web search engine in a distributed environment is a challenging engineering task that encompasses many interesting research questions. However, distributing a search engine across multiple sites has several advantages, such as utilizing less compute resources and exploiting data locality. In this paper we investigate the cost-effectiveness of building a distributed Web search engine. We propose a model for assessing the total cost of a distributed Web search engine that includes the computational costs and the communication cost among all distributed sites. We then present a query-processing algorithm that maximizes the amount of queries answered locally, without sacrificing the quality of the results compared to a centralized search engine. We simulate the algorithm on real document collections and query workloads to measure the actual parameters needed for our cost model, and we show that a distributed search engine can be competitive compared to a centralized architecture with respect to real cost.
... There are several reasons why this subset of documents is worth investigating. Document access patterns are extremely skewed, such that some documents are accessed extremely frequently and others rarely or not at all (Garcia et al. 2004). For example, the documents that appeared in the largest sets discussed in Section 6 would be relevant to few queries and as such accessed rarely. ...
Conference Paper
The web contains a great many documents that are content-equivalent, that is, informationally redundant with respect to each other. The presence of such mutually redundant documents in search results can degrade the user search experience. Previous attempts to address this issue, most notably the TREC novelty track, were characterized by difficulties with accuracy and evaluation. In this paper we explore syntactic techniques --- particularly document fingerprinting --- for detecting content equivalence. Using these techniques on the TREC GOV1 and GOV2 corpora revealed a high degree of redundancy; a user study confirmed that our metrics were accurately identifying content-equivalence. We show, moreover, that content-equivalent documents have a significant effect on the search experience: we found that 16.6% of all relevant documents in runs submitted to the TREC 2004 terabyte track were redundant.
... Not all documents are created equal; some documents contain much more information than others, or information of a higher quality, or information that is pertinent to a wider variety of queries. Document access patterns are extremely skewed, such that some documents are accessed extremely frequently and others rarely or not at all (Garcia et al., 2004). For example, the documents that appeared in the largest sets discussed in Section 5.6.2 would be relevant to few queries and as such accessed rarely. ...
... Another ordering variant is described where, within each inverted list, documents are ordered according to the number of times the document is highly ranked by a training query. The savings yielded are not as high as for impact-ordering and large numbers of training queries are required [7]. Many different types of index have been described but the most efficient index structure for text query evaluation is the inverted file [8]. ...
Article
Full-text available
The amount of documents increases so fast. Those documents exist not only in a paper based but also in an electronic based. It can be seen from the data sample taken by the SpringerLink publisher in 2010, which showed an increase in the number of digital document collections from 2003 to mid of 2010. Then, how to manage them well becomes an important need. This paper describes a new method in managing documents called as inverted files system. Related with the electronic based document, the inverted files system will closely used in term of its usage to document so that it can be searched over the Internet using the Search Engine. It can improve document search mechanism and document save mechanism.
... That is, the sampled documents are assumed to be uniformly selected from the collection. Although previous studies suggested that the documents downloaded by query-based sampling are not uniformly sampled [22, 34, 101, 235, 256], Shokouhi and Zobel [232] empirically suggested that the performance of SAFE is not significantly affected by that assumption. In the final step, SAFE uses the regression techniques [118] to fit a curve to the adjusted scores, and to predict the scores of the topranked — unseen — documents returned by each collection. ...
Article
Federated search (federated information retrieval or distributed information retrieval) is a technique for searching multiple text collections simultaneously. Queries are submitted to a subset of collections that are most likely to return relevant answers. The results returned by selected collections are integrated and merged into a single list. Federated search is preferred over centralized search alternatives in many environments. For example, commercial search engines such as Google cannot easily index uncrawlable hidden web collections while federated search systems can search the contents of hidden web collections without crawling. In enterprise environments, where each organization maintains an independent search engine, federated search techniques can provide parallel search over multiple collections. There are three major challenges in federated search. For each query, a subset of collections that are most likely to return relevant documents are selected. This creates the collection selection problem. To be able to select suitable collections, federated search systems need to acquire some knowledge about the contents of each collection, creating the collection representation problem. The results returned from the selected collections are merged before the final presentation to the user. This final step is the result merging problem. The goal of this work is to provide a comprehensive summary of the previous research on the federated search challenges described above.
... Despite the strong need for indexing and the popularity of multi-cores, little work has addressed the mapping of document inversion onto multi-cores. Authors focused on in-mainmemory index merging [3], on index re-orderings [7] that reduce querying latencies, or on compressed text databases [18] that reduce footprint, all at the expense of indexing speed. No work addresses indexing with reference to multi-cores and data-level parallelism. ...
Article
Text indexing is computationally expensive. Commercial search engines attack the task with massive, scalable, cluster-based so-lutions. But different domains (e.g., desktop, embedded, net-work appliances) are not compatible with a cluster solution. These domains would greatly benefit from a small form-factor, high-performance text indexing solution. Such a solution is the key enabler for new applications like wire-speed traffic indexing for network security forensics. The Cell/B.E. Processor is a popular multi-core platform that promises enough compute power to perform live indexing, but its cores are notorious for architectural peculiarities (scratchpad memories, weak branch prediction) that require radical algo-rithm redesign to achieve acceptable performance. No previous work has investigated the potential of the Cell processor for in-dexing tasks. In this work we consider document inversion, a core compo-nent of text indexing, and propose Blocked Hash-Based Inver-sion (BHBI), a data-parallel, single-pass, hash-based, in-core al-gorithm that maps well to the peculiarities we mentioned above. We show the viability of our approach with a proof-of-concept implementation optimized for the Cell processor. Our tests show that our parallel document inversion is 1,200× faster than a single-core vanilla implementation of traditional Single-Pass In-Memory Inversion (SPIMI).
... Following this, we use Gini to compute the bias of the system on the overall collection using the r(d) scores. We also report the total retrievability (RSUM), which is d r (d), to provide a measure of how much retrievability is a orded to the collection (a similar access measure is used by Garcia et al. [15]). Figure 1 provides plots of MAP, NDCG@10, Gini coe cient, RSUM, and query time across the p-ratios for each of the pruning algorithms. ...
Conference Paper
Prior work on using retrievability measures in the evaluation of information retrieval (IR) systems has laid out the foundations for investigating the relation between retrieval performance and retrieval bias. While various factors influencing retrievability have been examined, showing how the retrieval model may influence bias, no prior work has examined the impact of the index (and how it is optimized) on retrieval bias. Intuitively, how the documents are represented, and what terms they contain, will influence whether they are retrievable or not. In this paper, we investigate how the retrieval bias of a system changes as the inverted index is optimized for efficiency through static index pruning. In our analysis, we consider four pruning methods and examine how they affect performance and bias on the TREC GOV2 Collection. Our results show that the relationship between these factors is varied and complex - and very much dependent on the pruning algorithm. We find that more pruning results in relatively little change or a slight decrease in bias up to a point, and then a dramatic increase. The increase in bias corresponds to a sharp decrease in early precision such as NDCG@10 and is also indicative of a large decrease in MAP. The findings suggest that the impact of pruning algorithms can be quite varied - but retrieval bias could be used to guide the pruning process. Further work is required to determine precisely which documents are most affected and how this impacts upon performance.
Article
This article analyzed domestic and foreign contribution and fusion research actuality between informationization and advanced orbital transportation equipment manufacturing industry, introduced China emphatically about the informationization and the advanced orbital transportation equipment manufacturing industry policy development environment. First, the Informationization to advanced orbital transportation equipment manufacturing industry contribution model was built. Fusion stage model based on the stratagem, operation, fusion degree between technology and advanced orbital transportation equipment manufacturing industry was proposed. Information and advanced orbital transportation equipment manufacturing industry evaluation index system was built. Then their evaluation index system of hierarchical model was choice. On the base of this, this paper promotes information technology and advanced orbital transportation equipment manufacturing integration measures. For example between the competent authorities at all levels of Information and advanced orbital transportation equipment manufacturing, coordination mechanisms should be the established.
Article
En esta tesis se propone el diseño de una máquina de búsqueda paralela donde constantemente se entregan documentos al buscador, los cuales son procesados de manera on-line y en paralelo junto con las consultas de los usuarios. Estas operaciones respetan el orden de causalidad dado por los instantes en que los documentos son ingresados o actualizados al sistema. Es decir, si una consulta llega a la máquina de búsqueda antes que un documento ha sido completamente ingresado al sistema, entonces este documento no es considerado en la construcción de la respuesta a la consulta. El objetivo es lograr un sistema que sea capaz de indexar de manera concurrente los documentos que va recibiendo e incluirlos en las respuestas a los usuarios tan pronto como sea posible. Las aplicaciones para este tipo de máquina de búsqueda pueden ser en la Web así como también en sistemas de bolsas de comercio electrónico, servidores de noticias, o en general sistemas donde es crítico respetar el orden en que se actualiza y consulta la base de documentos. Los documentos pueden ser ingresados por entidades de software externas o usuarios, o pueden ser recolectados por la propia máquina de búsqueda visitando sitios Web o servidores de bases de datos. El diseño propuesto utiliza técnicas de computación paralela y distribuida para planificar y realizar la recolección automática de documentos, y procesarlos junto con las consultas de manera concurrente. Se presentan algoritmos diseñados sobre el modelo BSP de computación paralela y se describen implementaciones y experimentos realizados en clusters de computadores de alto rendimiento. En particular se proponen y evalúan dos estrategias de control de concurrencia las cuales presentan mejor eficiencia que los algoritmos clásicos desarrollados para sistemas de bases de datos. También se propone una estrategia de recuperación de documentos en la Web basada en partición por sitios.
Article
Con la irrupción de las CPU multicores (Chip-level MultiProcessor - CMPs-) se hace imprescindible desarrollar técnicas que aprovechen las ventajas de los CMPs para aumentar el rendimiento de las aplicaciones, haciendo uso de la computación paralela. En esta tesis se propone el diseño de una máquina de búsqueda capaz de explotar el nivel de paralelismo disponibles en los los CMPs, para el procesamiento de miles de consultas por unidad de tiempo. En particular, para esta aplicación y dada la enorme cantidad de recursos computacionales que demanda, es importante desarrollar estrategias paralelas que sean capaces de aprovechar eficientemente el hardware disponible. El diseño propuesto utiliza técnicas de computación paralela y distribuida para organizar y procesar las consultas. Se propone un esquema de paralelización híbrida basado en los paradigmas de programación BSP y OpenMP que ha sido diseñado para sacar el máximo provecho de las características multi-threading de los CMPs para máquinas de búsqueda. Se describen implementaciones y experimentos realizados sobre dos tipos de procesadores: UltraSPARC T1 de Sun Microsystem y dos nodos Intel Quad-Xeon.
Chapter
Full-text available
Full-text search engines are important tools for information retrieval. In a proximity full-text search, a document is relevant if it contains query terms near each other, especially if the query terms are frequently occurring words. For each word in a text, we use additional indexes to store information about nearby words that are at distances from the given word of less than or equal to the MaxDistance parameter. We showed that additional indexes with three-component keys can be used to improve the average query execution time by up to 94.7 times if the queries consist of high-frequency occurring words. In this paper, we present a new search algorithm with even more performance gains. We consider several strategies for selecting multi-component key indexes for a specific query and compare these strategies with the optimal strategy. We also present the results of search experiments, which show that three-component key indexes enable much faster searches in comparison with two-component key indexes.
Article
Full-text available
In federated information retrieval, a query is routed to multiple collections and a single answer list is constructed by combining the results. Such metasearch provides a mechanism for locating documents on the hidden Web and, by use of sampling, can proceed even when the collections are uncooperative. However, the similarity scores for documents returned from different collections are not comparable, and, in uncooperative environments, document scores are unlikely to be reported. We introduce a new merging method for uncooperative environments, in which similarity scores for the sampled documents held for each collection are used to estimate global scores for the documents returned per query. This method requires no assumptions about properties such as the retrieval models used. Using experiments on a wide range of collections, we show that in many cases our merging methods are significantly more effective than previous techniques.
Chapter
We investigate potential benefits of exploiting a global impact ordering in a selective search architecture. We propose a generalized, ordering-aware version of the learning-to-rank-resources framework [9] along with a modified selection strategy. By allowing partial shard processing we are able to achieve a better initial trade-off between query cost and precision than the current state of the art. Thus, our solution is suitable for increasing query throughput during periods of peak load or in low-resource systems.
Conference Paper
Full-text available
The presentation of query biased document snippets as part of results pages presented by search engines has become an expectation of search engine users. In this paper we explore the algorithms and data structures required as part of a search engine to allow efficient generation of query biased snippets. We begin by proposing and analysing a document compression method that reduces snippet generation time by 58% over a baseline using the zlib compression library. These experiments reveal that finding documents on secondary storage dominates the total cost of generating snippets, and so caching documents in RAM is essential for a fast snippet generation process. Using simulation, we examine snippet generation performance for different size RAM caches. Finally we propose and analyse document reordering and compaction, revealing a scheme that increases the number of document cache hits with only a marginal affect on snippet quality. This scheme effectively doubles the number of documents that can fit in a fixed size cache.
Article
This paper examines the space-time performance of in-memory conjunctive list intersection algorithms, as used in search engines, where integers represent document identifiers. We demonstrate that the combination of bitvectors, large skips, delta compressed lists and URL ordering produces superior results to using skips or bitvectors alone. We define semi-bitvectors, a new partial bitvector data structure that stores the front of the list using a bitvector and the remainder using skips and delta compression. To make it particularly effective, we propose that documents be ordered so as to skew the postings lists to have dense regions at the front. This can be accomplished by grouping documents by their size in a descending manner and then reordering within each group using URL ordering. In each list, the division point between bitvector and delta compression can occur at any group boundary. We explore the performance of semi-bitvectors using the GOV2 dataset for various numbers of groups, resulting in significant space-time improvements over existing approaches. Semi-bitvectors do not directly support ranking. Indeed, bitvectors are not believed to be useful for ranking based search systems, because frequencies and offsets cannot be included in their structure. To refute this belief, we propose several approaches to improve the performance of ranking-based search systems using bitvectors, and leave their verification for future work. These proposals suggest that bitvectors, and more particularly semi-bitvectors, warrant closer examination by the research community.
Article
Static index pruning techniques permanently remove a presumably redundant part of an inverted file, to reduce the file size and query processing time. These techniques differ in deciding which parts of an index can be removed safely; that is, without changing the top-ranked query results. As defined in the literature, the query view of a document is the set of query terms that access to this particular document, that is, retrieves this document among its top results. In this paper, we first propose using query views to improve the quality of the top results compared against the original results. We incorporate query views in a number of static pruning strategies, namely term-centric, document-centric, term popularity based and document access popularity based approaches, and show that the new strategies considerably outperform their counterparts especially for the higher levels of pruning and for both disjunctive and conjunctive query processing. Additionally, we combine the notions of term and document access popularity to form new pruning strategies, and further extend these strategies with the query views. The new strategies improve the result quality especially for the conjunctive query processing, which is the default and most common search mode of a search engine.
Conference Paper
Full-text available
This paper introduces the concept of accessibility from the fleld of transportation planning and adopts it within the context of In- formation Retrieval (IR). An analogy is drawn between the flelds, which motivates the development of document accessibility measures for IR systems. Considering the accessibility of documents within a collection given an IR System provides a difierent perspective on the analysis and evaluation of such systems which could be used to inform the design, tuning and management of current and future IR systems.
Article
Full-text available
Even after 20 years of research on real-world image retrieval, there is still a big gap between what search engines can provide and what users expect to see. To bridge this gap, we present an image knowledge base, ImageKB, a graph representation of structured entities, categories, and representative images, as a new basis for practical image indexing and search. ImageKB is automatically constructed via a both bottom-up and top-down, scalable approach that efficiently matches 2 billion web images onto an ontology with millions of nodes. Our approach consists of identifying duplicate image clusters from billions of images, obtaining a candidate set of entities and their images, discovering definitive texts to represent an image and identifying representative images for an entity. To date, ImageKB contains 235.3M representative images corresponding to 0.52M entities, much larger than the state-of-the-art alternative ImageNet that contains 14.2M images for 0.02M synsets. Compared to existing image databases, ImageKB reflects the distributions of both images on the web and users' interests, contains rich semantic descriptions for images and entities, and can be widely used for both text to image search and image to text understanding.
Conference Paper
Full-text available
We investigated how shape features in natural images influence emotions aroused in human beings. Shapes and their characteristics such as roundness, angularity, simplicity, and complexity have been postulated to affect the emotional responses of human beings in the field of visual arts and psychology. However, no prior research has modeled the dimensionality of emotions aroused by roundness and angularity. Our contributions include an in depth statistical analysis to understand the relationship between shapes and emotions. Through experimental results on the International Affective Picture System (IAPS) dataset we provide evidence for the significance of roundness-angularity and simplicity-complexity on predicting emotional content in images. We combine our shape features with other state-of-the-art features to show a gain in prediction and classification accuracy. We model emotions from a dimensional perspective in order to predict valence and arousal ratings which have advantages over modeling the traditional discrete emotional categories. Finally, we distinguish images with strong emotional content from emotionally neutral images with high accuracy.
Article
Conventional approaches to information retrieval search through all applicable entries in an inverted file for a particular collection in order to find those documents with the highest scores. For particularly large collections this may be extremely time consuming.A solution to this problem is to only search a limited amount of the collection at query-time, in order to speed up the retrieval process. In doing this we can also limit the loss in retrieval efficacy (in terms of accuracy of results). The way we achieve this is to firstly identify the most “important” documents within the collection, and sort documents within inverted file lists in order of this “importance”. In this way we limit the amount of information to be searched at query time by eliminating documents of lesser importance, which not only makes the search more efficient, but also limits loss in retrieval accuracy. Our experiments, carried out on the TREC Terabyte collection, report significant savings, in terms of number of postings examined, without significant loss of effectiveness when based on several measures of importance used in isolation, and in combination. Our results point to several ways in which the computation cost of searching large collections of documents can be significantly reduced.
Conference Paper
In this paper we examine text indexing on the Cell Broadband Enginetrade (Cell/B.E.), an emerging workload on an emerging multicore architecture. The Cell Broadband Engine is a microprocessor jointly developed by Sony Computer Entertainment, Toshiba, and IBM (herein, we refer to it simply as the "Cell"). The importance of text indexing is growing not only because it is the core task of commercial and enterprise-level search engines, but also because it appears more and more frequently in desktop and mobile applications, and on network appliances. Text indexing is a computationally intensive task. Multi-core processors promise a multiplicative increase in compute power, but this power is fully available only if workloads exhibit the right amount and kind of parallelism. We present the challenges and the results of mapping text indexing tasks to the Cell processor. The Cell has become known as a platform capable of impressive performance, but only when algorithms have been parallelized with attention paid to its hardware peculiarities (expensive branching, wide SIMD units, small local memories). We propose a parallel software design that provides essential text indexing features at a high throughput (161 Mbyte/s per chip on Wikipedia inputs) and we present a performance analysis that details the resources absorbed by each subtask. Not only does this result affect traditional applications, but it also enables new ones such as live network traffic indexing for security forensics, until now believed to be too computationally demanding to be performed in real time. We conclude that, at the cost of a radical algorithmic redesign, our Cell-based solution delivers a 4x performance advantage over recent commodity machine like the Intel Q6600. In a per-chip comparison, ours is the fastest text indexer that we are aware of.
Chapter
Full-text search engines are important tools for information retrieval. In a proximity full-text search, a document is relevant if it contains query terms near each other, especially if the query terms are frequently occurring words. For each word in the text, we use additional indexes to store information about nearby words at distances from the given word of less than or equal to MaxDistance, which is a parameter. A search algorithm for the case when the query consists of high-frequently occurring words is discussed. In addition, we present results of experiments with different values of MaxDistance to evaluate the search speed dependence on the value of MaxDistance. These results show that the average time of the query execution with our indexes is 94.7–45.9 times (depending on the value of MaxDistance) less than that with standard inverted files when queries that contain high-frequently occurring words are evaluated.
Chapter
Full-text available
A search query consists of several words. In a proximity full-text search, we want to find documents that contain these words near each other. This task requires much time when the query consists of high-frequently occurring words. If we cannot avoid this task by excluding high-frequently occurring words from consideration by declaring them as stop words, then we can optimize our solution by introducing additional indexes for faster execution. In a previous work, we discussed how to decrease the search time with multi-component key indexes. We had shown that additional indexes can be used to improve the average query execution time up to 130 times if queries consisted of high-frequently occurring words. In this paper, we present another search algorithm that overcomes some limitations of our previous algorithm and provides even more performance gain.
Article
Full-text available
The paper combines a comprehensive account of the probabilistic model of retrieval with new systematic experiments on TREC Programme material. It presents the model from its foundations through its logical development to cover more aspects of retrieval data and a wider range of system functions. Each step in the argument is matched by comparative retrieval tests, to provide a single coherent account of a major line of research. The experiments demonstrate, for a large test collection, that the probabilistic model is effective and robust, and that it responds appropriately, with major improvements in performance, to key features of retrieval situations.Part 1 covers the foundations and the model development for document collection and relevance data, along with the test apparatus. Part 2 covers the further development and elaboration of the model, with extensive testing, and briefly considers other environment conditions and tasks, model training, concluding with comparisons with other approaches and an overall assessment.Data and results tables forboth partsare given in Part 1. Key results are summarised in Part 2.
Conference Paper
Full-text available
Compression reduces both the size of indexes and the time needed to evaluate queries. In this paper, we revisit the compression of inverted lists of document postings that store the position and frequency of indexed terms, considering two approaches to improving retrieval efficiency: better implementation and better choice of integer compression schemes. First, we propose several simple optimisations to well-known integer compression schemes, and show experimentally that these lead to significant reductions in time. Second, we explore the impact of choice of compression scheme on retrieval efficiency.In experiments on large collections of data, we show two surprising results: use of simple byte-aligned codes halves the query evaluation time compared to the most compact Golomb-Rice bitwise compression schemes; and, even when an index fits entirely in memory, byte-aligned codes result in faster query evaluation than does an uncompressed index, emphasising that the cost of transferring data from memory to the CPU cache is less for an appropriately compressed index than for an uncompressed index. Moreover, byte-aligned schemes have only a modest space overhead: the most compact schemes result in indexes that are around 10% of the size of the collection, while a byte-aligned scheme is around 13%. We conclude that fast byte-aligned codes should be used to store integers in inverted lists.
Conference Paper
Full-text available
This paper addresses two unresolved issues about Web caching. The first issue is whether Web requests from a fixed user community are distributed according to Zipf's (1929) law. The second issue relates to a number of studies on the characteristics of Web proxy traces, which have shown that the hit-ratios and temporal locality of the traces exhibit certain asymptotic properties that are uniform across the different sets of the traces. In particular, the question is whether these properties are inherent to Web accesses or whether they are simply an artifact of the traces. An answer to these unresolved issues will facilitate both Web cache resource planning and cache hierarchy design. We show that the answers to the two questions are related. We first investigate the page request distribution seen by Web proxy caches using traces from a variety of sources. We find that the distribution does not follow Zipf's law precisely, but instead follows a Zipf-like distribution with the exponent varying from trace to trace. Furthermore, we find that there is only (i) a weak correlation between the access frequency of a Web page and its size and (ii) a weak correlation between access frequency and its rate of change. We then consider a simple model where the Web accesses are independent and the reference probability of the documents follows a Zipf-like distribution. We find that the model yields asymptotic behaviour that are consistent with the experimental observations, suggesting that the various observed properties of hit-ratios and temporal locality are indeed inherent to Web accesses observed by proxies. Finally, we revisit Web cache replacement algorithms and show that the algorithm that is suggested by this simple model performs best on real trace data. The results indicate that while page requests do indeed reveal short-term correlations and other structures, a simple model for an independent request stream following a Zipf-like distribution is sufficient to capture certain asymptotic properties observed at Web proxies
Article
The TREC-8 Web Track defined ad hoc retrieval tasks over the 100 gigabyte VLC2 collection(Large Web Task) and a selected 2 gigabyte subset known as WT2g (Small Web Task). Here, theguidelines and resources for both tasks are described and results presented and analysed.Performance on the Small Web was strongly correlated with performance on the regular TRECAd Hoc task. Little benefit was derived from the use of link-based methods, for standard TRECmeasures on the WT2g collection. The...
Article
In studying actual Web searching by the public at large, we analyzed over one million Web queries by users of the Excite search engine. We found that most people use few search terms, few modified queries, view few Web pages, and rarely use advanced search features. A small number of search terms are used with high frequency, and a great many terms are unique; the language of Web queries is distinctive. Queries about recreation and entertainment rank highest. Findings are compared to data from two other large studies of Web queries. This study provides an insight into the public practices and choices in Web searching.
Conference Paper
We extend the applicability of impact transformation, which is a technique for adjusting the term weights assigned to documents so as to boost the effectiveness of retrieval when short queries are applied to large document collections. In conjunction with techniques called quantization and thresholding, impact transformation allows improved query execution rates compared to traditional vector-space similarity computations, as the number of arithmetic operations can be reduced. The transformation also facilitates a new dynamic query pruning heuristic. We give results based upon the trec web data that show the combination of these various techniques to yield highly competitive retrieval, in terms of both effectiveness and efficiency, for both short and long queries.
Article
Query-processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Retrieval time for inverted lists can be greatly reduced by the use of compression, but this adds to the CPU time required. Here we show that the CPU component of query response time for conjunctive Boolean queries and for informal ranked queries can be similarly reduced, at little cost in terms of storage, by the inclusion of an internal index in each compressed inverted list. This method has been applied in a retrieval system for a collection of nearly two million short documents. Our experimental results show that the self-indexing strategy adds less than 20% to the size of the compressed inverted file, which itself occupies less than 10% of the indexed text, yet can reduce processing time for Boolean queries of 5-10 terms to under one fifth of the previous cost. Similarly, ranked queries of 40-50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval effectiveness.
Article
Ranked queries are used to locate relevant documents in text databases. In a ranked query a list of terms is specified, then the documents that most closely match the query are returned---in decreasing order of similarity---as answers. Crucial to the efficacy of ranked querying is the use of a similarity heuristic, a mechanism that assigns a numeric score indicating how closely a document and the query match. In this note we explore and categorise a range of similarity heuristics described in the literature. We have implemented all of these measures in a structured way, and have carried out retrieval experiments with a substantial subset of these measures.Our purpose with this work is threefold: first, in enumerating the various measures in an orthogonal framework we make it straightforward for other researchers to describe and discuss similarity measures; second, by experimenting with a wide range of the measures, we hope to observe which features yield good retrieval behaviour in a variety of retrieval environments; and third, by describing our results so far, to gather feedback on the issues we have uncovered. We demonstrate that it is surprisingly difficult to identify which techniques work best, and comment on the experimental methodology required to support any claims as to the superiority of one method over another.
Article
this paper we experimentally evaluate the performance of several data structures for building vocabularies, using a range of data collections and machines. Given the well-known properties of text and some initial experimentation, we chose to focus on the most promising candidates, splay trees and chained hash tables, also reporting results with binary trees. Of these, our experiments show that hash tables are by a considerable margin the most e#cient. We propose and measure a refinement to hash tables, the use of move-to-front lists. This refinement is remarkably e#ective: as we show, using a small table in which there are large numbers of strings in each chain has only limited impact on performance. Moving frequentlyaccessed words to the front of the list has the surprising property that the vast majority of accesses are to the first or second node. For example, our experiments show that in a typical case a table with an average of around 80 strings per slot is only 10%--40% slower than a table with around one string per slot---while a table without move-to-front is perhaps 40% slower again---and is still over three times faster than using a tree. We show, moreover, that a move-to-front hash table of fixed size is more e#cient in space and time than a hash table that is dynamically doubled in size to maintain a constant load average
Article
In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space requirements and performance characteristics of the main-memory data structures used for this task. However, it is not clear how many distinct words will be found in a text collection or whether new words will continue to appear after inspecting large volumes of data. We propose practical definitions of a word, and investigate new word occurrences under these models in a large text collection. We inspected around two billion word occurrences in 45 gigabytes of world-wide web documents, and found just over 9.74 million different words in 5.5 million documents; overall, 1 word in 200 was new. We observe that new words continue to occur, even in very large data sets, and that choosing stricter definitions of what constitutes a word has only limited impact on the number of new words found.
Article
Past access to files of integers is crucial for the efficient resolution of queries to databases. Integers are the basis of indexes used to resolve queries, for example, in large internet search systems, and numeric data forms a large part of most databases. Disk access costs can be reduced by compression, if the cost of retrieving a compressed representation from disk and the CPU cost of decoding such a representation is less than that of retrieving uncompressed data. In this paper we show experimentally that, for large or small collections, storing integers in a compressed format reduces the time required for either sequential stream access or random access. We compare different approaches to compressing integers, including the Elias gamma and delta codes, Golomb coding, and a variable-byte integer scheme. As a conclusion, we recommend that, for fast access to integers, files be stored compressed.
Article
Ranking techniques are effective at finding answers in document collections but can be expensive to evaluate. We propose an evaluation technique that uses early recognition of which documents are likely to be highly ranked to reduce costs; for our test data, queries are evaluated in 2% of the memory of the standard implementation without degradation in retrieval effectiveness. cpu time and disk traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. The principle of the index design is that inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces cpu time and disk traffic to around one third of the original requirement. We also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed.
Article
Two stages in measurement of techniques for information retrieval are gathering of documents for relevance assessment and use of the assessments to numerically evaluate e#ectiveness. We consider both of these stages in the context of the TREC experiments, to determine whether they lead to measurements that are trustworthy and fair. Our detailed empirical investigation of the TREC results shows that the measured relative performance of systems appears to be reliable, but that recall is overestimated: it is likely that many relevant documents have not been found. We propose a new pooling strategy that can significantly increase the number of relevant documents found for given e#ort, without compromising fairness.
The Eighth Text REtrieval Conference (TREC-8)', NIST Special Publication
  • S Robertson
  • S Walker
Robertson, S. & Walker, S. (2000), Okapi/Keenbow at TREC- 8, in E. Voorhees & D. Harman, eds, 'The Eighth Text REtrieval Conference (TREC-8)', NIST Special Publication 500-246, Gaithersburg, MD, pp. 151–161.
Okapi/Keenbow at TREC-8
  • S Robertson
  • S Walker
Robertson, S. & Walker, S. (2000), Okapi/Keenbow at TREC-8, in E. Voorhees & D. Harman, eds, 'The Eighth Text REtrieval Conference (TREC-8)', NIST Special Publication 500-246, Gaithersburg, MD, pp. 151-161.
Improved retrieval effectiveness through impact transformation
  • V Anh
  • A Moffat
Anh, V. & Moffat, A. (2001), Improved retrieval effectiveness through impact transformation, in X. Zhou, ed., 'Proc. Australasian Database Conference', Vol. 24(2), Australian Computer Society, Melbourne, Australia, pp. 41-48.