Conference Paper

Efficient phrase querying with an auxiliary index

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. A significant proportion of the queries posed to search engines involve phrases. In this paper we consider how phrase queries can be efficiently supported with low disk overheads. Previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. We propose a combination of nextword indexes with inverted files as a solution to this problem. Our experiments show that combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file. Further time savings are available with only slight increases in disk requirements.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... An auxiliary nextword index proposed by Bahle et. al [4] further reduces the space overhead to only 10% of the size of the inverted index file. ...
... Inverted index is not efficient for evaluating query with common terms since the three most common words account for about 4% of the size of the whole index file [4] and retrieving such long postings list can suffer a long operation time. Hence, nextword index [15] is proposed to construct an index by recording additional index for supporting fast evaluation of phrase queries. ...
... Bahle et. al [4] observe the weakness of resolving phrase query by using an inverted index and the enormous size overhead of a nextword index and hence proposed auxiliary nextword index. The main idea of the auxiliary nextword index is that only the top-frequency words are to be indexed with nextwords. ...
Conference Paper
In this paper, we propose a common phrase index as an efficient index structure to support phrase queries in a very large text database. Our structure is an extension of previous index structures for phrases and achieves better query efficiency with modest extra storage cost. Further improvement in efficiency can be attained by implementing our index according to our observation of the dynamic nature of common word set. In experimental evaluation, a common phrase index using 255 common words has an improvement of about 11% and 62% in query time for the overall and large queries (queries of long phrases) respectively over an auxiliary nextword index. Moreover, it has only about 19% extra storage cost. Compared with an inverted index, our improvement is about 72% and 87% for the overall and large queries respectively. We also propose to implement a common phrase index with dynamic update feature. Our experiments show that more improvement in time efficiency can be achieved.
... Users can submit explicit phrase queries to search engines typically by enclosing them in quotation marks. [3] reported that there were 8.3% of explicit phrase queries in excite log during 1997-1999. In addition, users can also submit Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. ...
... With the traditional wordlevel index, search engines first intersect DocID sets to get a list of candidate documents that possibly contain the phrase, and then checks whether the query terms are adjacent or not in the candidate documents. However, as suggested in [3], the traditional wordlevel index is not efficient since the cost for processing common term's posting list is very high. To solve this problem, one crude method is to remove stop words in queries, which may result in incorrect query evaluation. ...
... To solve this problem, one crude method is to remove stop words in queries, which may result in incorrect query evaluation. Other methods [3,10] add auxiliary structures (e.g. N-gram, partial next word index, phrase index, etc) to speed up phrase query evaluation. ...
Conference Paper
A large proportion of search engine queries contain phrases,namely a sequence of adjacent words. In this paper, we propose to use flat position index (a.k.a schema-independent index) for phrase query evaluation. In the flat position index, the entire document collection is viewed as a huge sequence of tokens. Each token is represented by one flat position, which is a unique position offset from the beginning of the collection. Each indexed term is associated with a list of the flat positions about that term in the sequence. To recover DocID from flat positions efficiently, we propose a novel cache sensitive look-up table (CSLT), which is much faster than existing search algorithms. Experiments on TREC GOV2 data collection show that flat position index can reduce the index size and speed up phrase querying substantially, compared with traditional word-level index.
... Finally, question answering systems often rely on NLP components that may directly or indirectly use wild card queries. Examples are taxonomy construction, fact extraction, named entity recog- Table 1: Summary of the literature on query types and their result sets Query Type Query Elements Result Set References Keyword queries keywords, Boolean operators documents AltaVista [1], Google [4], Yahoo [8] Multi-keyword queries keywords, phrases documents [17], [11] [30] Proximity queries keywords, proximity radius documents Lucene [2], INDRI [5], [16] Wild card queries queries, wild cards(%) list of keywords Dewild [28], BE [14], KnowItAll [18] Structured queries SQL, text predicates relations Oracle InterMedia Text[9], DB2 Text Extender[24], [15] Full-text search characters, regular expressions strings nition and query expansion. Rafiei and Li [28] present wild card querying over web text and discuss several techniques such as query expansion and relevance ranking to increase the precision and recall of extractions. ...
... This indicates that inverted indexes are not appropriate for evaluating wild card queries. 6706993152 →<3,1,[10]> is →<1,4,[12] [154] [184] [190]>, <2,4,[379] [401] [427] [503]> , <3,1,[9]> population →<1,7,[8] [30] [38] [57] [153] [170] [194]>, <2,2,[125] [155]> , <3,1,[8]> world →<1,3,[11] [37] [56]>, <2,2,[29] [124]>, <3,1,[7]> ...
... Solutions on multi-keyword queries such as phrase and nextword indexes [11] [30] can help reduce the time it takes to intersect the posting lists, but won't help in the keyword matching step, which is in most cases the dominant process. Therefore, development of solutions for efficient retrieval of keyword matches from text seems essential. ...
Conference Paper
Full-text available
Many existing indexes on text work at the document granularity and are not effective in answering the class of queries where the desired answer is only a term or a phrase. In this paper, we study some of the index structures that are capable of answering the class of queries referred to here as wild card queries and perform an analysis of their performance. Our experimental results on a large class of queries from different sources (including query logs and parse trees) and with various datasets reveal some of the performance barriers of these indexes. We then present Word Permuterm Index (WPI) which is an adaptation of the permuterm index for natural language text applications and show that this index supports a wide range of wild card queries, is quick to construct and is highly scalable. Our experimental resultS comparing WPI to alternative methods on a wide range oF wild card queries show a few orders of magnitude performancE improvements for WPI while the memory usage is kept the same for all compared systems.
... Recall is an IR performance measure which represents the fraction of relevant documents in a set of retrieved documents. Let R represent a set of relevant documents, and |R| the number of documents in R. Now, assuming a answer set A is retrieved in response to some query, and |A| the number of documents in A, and that |Ra| represents the number of documents in the intersection of the two sets, R and A. Then we have a formal calculation of recall illustrated in Equation 5. ...
... Disks have not gone through the same development as the CPUs, and therefore suffers from slow access rates compared to operations within the CPU. This skipping technique is better described in [20,14], and somewhat in [5,12]. ...
... Techniques for such an list reordering to promote this capability could be frequency-ordering and impact-ordering. These techniques are somewhat described in [5], and in much more detail in [17]. When considering ranked phrase query processing however, it is not so simple to predict which occurrence of the term which will be in a phrase query. ...
... Indexing is the main process in the information retrieval system (Mao et al., 2006). Indexing can be done by several methods, including inverted index, auxiliary nextword index, and the common phrase index (Manning et al., 2009;Bahle et al., 2002;Chang and Poon, 2007). Inverted index or inverted file is a basic concept in information retrieval (Manning et al., 2009). ...
... The advantage of this method is providing fast searching through an enomous amount of document (Chang and Poon, 2007). In contrast, for phrase querying it is not simple to predict occurences of the term will be in a query phrase, and thus such reordering is unlikely to be effective (Bahle et al., 2002). Auxiliary nextword index is an indexing method that could reduce costs and the requirement for accessing a disk which can increase efficiency, but the auxiliary nextword index method can only work very well on a phrase query size of two (Chang and Poon, 2007). ...
... The fourth step is to calculate the value of term weighting (tf-idf) of each term arrays on each document and query. Tf-idf calculation can be performed using Equation 1. Example tf-idf calculation results for the document is shown in Table 2. ...
Conference Paper
Full-text available
As the development of technology, the process of finding information on the news text is easy, because the text of the news is not only distributed in print media, such as newspapers, but also in electronic media that can be accessed using the search engine. In the process of finding relevant documents on the search engine, a phrase often used as a query. The number of words that make up the phrase query and their position obviously affect the relevance of the document produced. As a result, the accuracy of the information obtained will be affected. Based on the outlined problem, the purpose of this research was to analyze the implementation of the common phrase index method on information retrieval. This research will be conducted in English news text and implemented on a prototype to determine the relevance level of the documents produced. The system is built with the stages of pre-processing, indexing, term weighting calculation, and cosine similarity calculation. Then the system will display the document search results in a sequence, based on the cosine similarity. Furthermore, system testing will be conducted using 100 documents and 20 queries. That result is then used for the evaluation stage. First, determine the relevant documents using kappa statistic calculation. Second, determine the system success rate using precision, recall, and F-measure calculation. In this research, the result of kappa statistic calculation was 0.71, so that the relevant documents are eligible for the system evaluation. Then the calculation of precision, recall, and F-measure produces precision of 0.37, recall of 0.50, and F-measure of 0.43. From this result can be said that the success rate of the system to produce relevant documents is low.
... There is a body of literature [3, 22, 4, 2] that discusses modifications to the inverted-index structure to support fast evaluation of specific query classes. In prior work, nextword indexes [3, 22] were proposed as a way of supporting phrase queries and phrase browsing . ...
... In a nextword index, for each index term or firstword, there is a list of the words or nextwords that follow that term, together with the documents and word positions at which the firstword and nextword occur as a pair. Bahle et al. [4] try to overcome two disadvantages of the nextword index, viz., its size (which is typically around half that of the indexed collection) and inefficiency (since nextwords must be processed linearly and compared to a standard inverted index for rare firstwords). They propose evaluation of phrase queries through a combination of an inverted index on rare words and a form of nextword index on common words. ...
Article
Entity annotation is emerging as a key enabling requirement for search based on deeper semantics: for example, a search on 'John's address', that returns matches to all entities annotated as an address that co-occur with 'John'. A dominant paradigm adopted by rule-based named entity annotators is to annotate a document at a time. The complexity of this approach varies linearly with the number of documents and the cost for annotating each document, which could be prohibiting for large document corpora. A recently proposed al-ternative paradigm for rule-based entity annotation [16], operates on the inverted index of a document collection and achieves an or-der of magnitude speed-up over the document-based counterpart. In addition the index based approach permits collection level op-timization of the order of index operations required for the anno-tation process. It is this aspect that is explored in this paper. We develop a polynomial time algorithm that, based on estimated cost, can optimally select between different logically equivalent evalua-tion plans for a given rule. Additionally, we prove that this prob-lem becomes NP-hard when the optimization has to be performed over multiple rules and provide effective heuristics for handling this case. Our empirical evaluations show a speed-up factor upto 2 over the baseline system without optimizations.
... One important application of our technique is efficient phrase searching using an auxiliary index of word n-grams, as discussed in section 6.4. Bahle et al. [4] discuss a similar technique for phrase searching using an auxiliary " nextword index " that is pruned by only including phrases where common words appear first, similar to the approach of Mah and D'Amore but at the word n-gram level. For the same reason, it includes some strictly unnecessary terms; for example, if a two-word term occurs very frequently, it may be added to the nextword index, even if the two terms never occur separately . ...
... A similar problem occurs at the character level in documents featuring large alphabets, such as Asian-language texts. The hybrid indexes of Bahle et al. [4] reduce storage requirements by pruning the set of bigrams stored, and accelerate queries by using a query planning technique based on the document frequencies of the constituent words. Their pruning criterion retains only pairs where the first word of the bigram is among the most k common words. ...
Conference Paper
Full-text available
Inverted indexes using sequences of characters (n-grams) as terms provide an error-resilient and language-independent way to query for arbitrary substrings and perform approximate matching in a text, but present a number of practical problems: they have a very large number of terms, they exhibit pathologically expensive worst-case query times on certain natural inputs, and they cannot cope with very short query strings. In word-based indexes, static index pruning has been successful in reducing index size while maintaining precision, at the expense of recall. Taking advantage of the unique inclusion structure of n-gram terms of different lengths, we show that the lexicon size of an n-gram index can be reduced by 7 to 15 times without any loss of recall, and without any increase in either index size or query time. Because the lexicon is typically stored in main memory, this substantially reduces the memory required for queries. Simultaneously, our construction is also the first overlapping n-gram index to place tunable worst-case bounds on false positives and to permit efficient queries on strings of any length. Using this construction, we also demonstrate the first feasible n-gram index using words rather than characters as units, and its applications to phrase searching.
... This can be reduced to about 40% to 50% using the compression techniques of [7] at little cost of querying speed. In order to obtain further reductions in the memory consumption, partial nextword indexes were introduced in [8]. A partial nextword index contains just the commonest words as firstwords. ...
... The two-term phrase index can be used to speed-up queries consisting of an arbitrary number of terms. Similar to [13] and [8], phrase queries can be evaluated efficiently as follows: Let t 1 · t 2 · ... · t n be a phrase query consisting of n terms. Starting at t 1 , we replace each term t i with the phrase t i · t i+1 where possible because t i · t i+1 ≤ t i . ...
Conference Paper
We present a method for optimizing phrase search based on inverted indexes. Our approach adds selected (two-term) phrases to an existing index. Whereas competing approaches are often based on the analysis of query logs, our approach works out of the box and uses only the information contained in the index. Also, our method is competitive in terms of query performance and can even improve on other approaches for difficult queries. Moreover, our approach gives performance guarantees for arbitrary queries. Further, we propose using a phrase index as a substitute for the positional index of an in-memory search engine working with short documents. We support our conclusions with experiments using a high-performance main-memory search engine. We also give evidence that classical disk based systems can profit from our approach.
... We suggest a "milder" pruning strategy that produces an index capable of restoring relative positional information for all (non-stop) words within the short distance L: in our experiments · L = 10. Our indexing approach is similar to the approach of Bahle et al. [2002], who proposed to index only those adjacent words where one of the words is frequent. To evaluate the effectiveness and efficiency of our approach, we participated in ad hoc web-track 2010. ...
... This simple folklore approach should have been described in the literature, but we are unaware of exact references. It is similar to the method by Bahle et al. [2002] and is inspired by the following assumptions and observations: ...
Conference Paper
Full-text available
We describe experiments with proximity-aware ranking functions that use indexing of word pairs. Our goal is to evaluate a method of “mild” pruning of proximity information, which would be appropriate for a moderately loaded retrieval system, e.g., an enterprise search engine. We create an index that includes occurrences of close word pairs, where one of the words is frequent. This allows one to efficiently restore relative positional information for all non-stop words within a certain distance. It is also possible to answer phrase queries promptly. We use two functions to evaluate relevance: a modification of a classic proximity-aware function and a logistic function that includes a linear combination of relevance features.
... Similar to our problem scenario , full materialization of indexes of all common phrases entails prohibitive storage costs. The approach adopted in [5, 12] is to use different types of indices – inverted indices for rare words, a variant of nextword indices for the commonest words and a phrase index for the commonest phrases; similarly, we use different combinations of access paths depending on keyword frequencies. The underlying indexing problem can also be phrased as an instance of the partial-match problem: lower bounds on the performance of partial-match queries have been studied theoretically in [8] using a cell-probe framework. ...
... In addition, each entry in the match list also stores the number of postings in the corresponding inverted index; we also maintain the number of postings in each single-keyword inverted index together with the vocabulary. The resulting structure is in many ways similar to the nextword indexes of [5, 12] and can be implemented in a similar manner. The physical layout of this structure is as follows: since (as we will describe later) we only materialize combinations of frequent keywords and only a small fraction of them, it is possible to maintain an index with the first two keywords of each combination in main memory 2 . ...
Conference Paper
Intersecting inverted indexes is a fundamental operation for many applicati-ons in information retrieval and databases. Efficient indexing for this opera-tion is known to be a hard problem for arbitrary data distributions. However, text corpora used in Information Retrieval applications often have conveni-ent power-law constraints (also known as Zipf's Law and long tails) that allow us to materialize carefully chosen combinations of multi-keyword in-dexes, which significantly improve worst-case performance without requi-ring excessive storage. These multi-keyword indexes limit the number of postings accessed when computing arbitrary index intersections. Our eva-luation on an e-commerce collection of 20 million products shows that the indexes of up to four arbitrary keywords can be intersected while accessing less than 20% of the postings in the largest single-keyword index.
... The position lists obtained using the word pairs need to be combined in order to find locations of the original, longer phrase. Bahle et al. [2001bBahle et al. [ , 2002 showed how the word pairs should be selected in order to optimise querying time, and then implemented a retrieval system which combined a nextword index with a word-level IR system. It was shown that the addition of the nextword index provided more efficient retrieval when the first word is a common word. ...
... For example, pre-computed values for components of some similarity measures described in Section 2.5.3 can be stored, which only require computation once per document. The inverted lists can also include term offset information in the documents, which is useful in identifying adjacent terms in phrase queries [Bahle et al., 2002]. For example, the inverted list for the term "for" with offsets is "2 : [1,1,5 ], [2,2,4,10 ]", with the first term in each document being assigned a zero offset. ...
... A cursory survey of recent conference proceedings and journal issues reveals that the resources created by the VLC/Web Track are being used quite routinely in studies reported outside TREC. For example, in the years 2000-2002, eight SIGIR papers [34, 2, 44, 15, 32, 3, 4, 39] and four TOIS articles [12, 11, 10, 9] made use of VLC/Web data and several others referred to the track or its methodology. A glance at the same forums for 2003 suggests that usage of VLC/Web Track resources and results is increasing still further. ...
... : Caching of intersections is related to the problem of building optimized index structures for phrase queries [6] (i.e., " New York " ). In particular, intersections can be used to evaluate phrase queries, while on the other hand some of the most profitable pairs of lists in intersection caching turn out to be common phrases. ...
Article
Large web search engines have to answer thousands of queries per second with interactive response times. Due to the sizes of the data sets involved, often in the range of multiple terabytes, a single query may require the processing of hundreds of megabytes or more of index data. To keep up with this immense workload, large search engines employ clusters of hundreds or thousands of machines, and a number of techniques such as caching, index compression, and index and query pruning are used to improve scalability. In particular, two-level caching techniques cache results of repeated identical queries at the frontend, while index data for frequently used query terms are cached in each node at a lower level. We propose and evaluate a three-level caching scheme that adds an intermediate level of caching for additional performance gains. This intermediate level attempts to exploit frequently occurring pairs of terms by caching intersections or projections of the corresponding inverted lists. We propose and study several offline and online algorithms for the resulting weighted caching problem, which turns out to be surprisingly rich in structure. Our experimental evaluation based on a large web crawl and real search engine query log shows significant performance gains for the best schemes, both in isolation and in combination with the other caching levels. We also observe that a careful selection of cache admission and eviction policies is crucial for best overall performance.
... (Of course, GuruQA is not designed to find all relevant documents, like be does.) A series of articles describes the nextword index [5, 23, 4], a structure designed to speed up phrase queries and to enable some amount of " phrase browsing. " It is an inverted index where each term list contains a list of the successor words found in the corpus. ...
Conference Paper
Many modern natural language-processing applications utilize search engines to locate large numbers of Web documents or to compute statistics over the Web corpus. Yet Web search engines are designed and optimized for simple human queries---they are not well suited to support such applications. As a result, these applications are forced to issue millions of successive queries resulting in unnecessary search engine load and in slow applications with limited scalability.In response, this paper introduces the Bindings Engine (BE), which supports queries containing typed variables and string-processing functions. For example, in response to the query "powerful ‹noun›" BE will return all the nouns in its index that immediately follow the word "powerful", sorted by frequency. In response to the query "Cities such as ProperNoun(Head(‹NounPhrase›))", BE will return a list of proper nouns likely to be city names.BE's novel neighborhood index enables it to do so with O(k) random disk seeks and O(k) serial disk reads, where k is the number of non-variable terms in its query. As a result, BE can yield several orders of magnitude speedup for large-scale language-processing applications. The main cost is a modest increase in space to store the index. We report on experiments validating these claims, and analyze how BE's space-time tradeoff scales with the size of its index and the number of variable types. Finally, we describe how a BE-based application extracts thousands of facts from the Web at interactive speeds in response to simple user queries.
... Unless such queries are very frequent, the overall savings achieved by our pruning method can still be substantial. Bahle et al. [3], for instance, report, that only 8.3% of the queries they found in an Excite query log were phrase queries. ...
Conference Paper
We present a static index pruning method, to be used in ad-hoc document retrieval tasks, that follows a document- centric approach to decide whether a posting for a given term should remain in the index or not. The decision is made based on the term's contribution to the document's Kullback-Leibler divergence from the text collection's global language model. Our technique can be used to decrease the size of the index by over 90%, at only a minor decrease in retrieval eectiv eness. It thus allows us to make the index small enough to t entirely into the main memory of a sin- gle PC, even for large text collections containing millions of documents. This results in great eciency gains, superior to those of earlier pruning methods, and an average response time around 20 ms on the GOV2 document collection.
... 3) Early termination approaches [1,4]. 4) Next-word and partial phrase auxiliary indexes for an exact phrase search [17,2]. ...
Preprint
Full-text search engines are important tools for information retrieval. In a proximity full-text search, a document is relevant if it contains query terms near each other, especially if the query terms are frequently occurring words. For each word in the text, we use additional indexes to store information about nearby words at distances from the given word of less than or equal to MaxDistance, which is a parameter. A search algorithm for the case when the query consists of high-frequently used words is discussed. In addition, we present results of experiments with different values of MaxDistance to evaluate the search speed dependence on the value of MaxDistance. These results show that the average time of the query execution with our indexes is 94.7-45.9 times (depending on the value of MaxDistance) less than that with standard inverted files when queries that contain high-frequently occurring words are evaluated. This is a pre-print of a contribution published in Pinelas S., Kim A., Vlasov V. (eds) Mathematical Analysis With Applications. CONCORD-90 2018. Springer Proceedings in Mathematics & Statistics, vol 318, published by Springer, Cham. The final authenticated version is available online at: https://doi.org/10.1007/978-3-030-42176-2_37
... In [6,14,15], nextword indexes and partial phrase indexes are introduced. These additional indexes can be used to improve performance. ...
Chapter
Full-text available
Full-text search engines are important tools for information retrieval. Term proximity is an important factor in relevance score measurement. In a proximity full-text search, we assume that a relevant document contains query terms near each other, especially if the query terms are frequently occurring words. A methodology for high-performance full-text query execution is discussed. We build additional indexes to achieve better efficiency. For a word that occurs in the text, we include in the indexes some information about nearby words. What types of additional indexes do we use? How do we use them? These questions are discussed in this work. We present the results of experiments showing that the average time of search query execution is 44–45 times less than that required when using ordinary inverted indexes.
... A phrase query is like a Boolean query except that the operator AND is used implicitly between the query words, and the order of the keywords in the documents should follow the order of the keywords in the query [Bahle et al., 2002]. When a user enters the phrase Melbourne weather report as a query, a IR system that implements phrase querying should return documents containing the words contiguously and in this order. ...
... They show that static index pruning can be written as a convex integer program. They recast static index pruning as a model induction problem under the framework of Kullback's principle of minimum cross-entropy [23]. Based on the experimentation results on standard ad-hoc retrieval benchmarks, they confirmed that uniform pruning is robust to high prune ratio and its performance is currently state of the art. ...
Article
Full-text available
This paper proposes a static index pruning method for phrase queries based on term distance. It models the terms distance within document as a measure to find the term co-occurrence with another term. The standard score is then used to prune non relevant postings related to phrase queries while assuring no change in the top-k results. The proposed method creates an effective prune inverted index. Analysis of the results shows that this method is correlated with the term proximity based on the term frequency val-ues as well as terms informative ness. With experiments on a number of different FIRE collections, it is shown that the model is comparable with the existing static pruning method which only works well for single term queries. It is an advantage of the proposed approach that the pruning model is applicable to standard inverted index for phrase queries.
... Such methods achieve high retrieval efficiency by sacrificing on search quality for some queries. The method in [3] creates the auxiliary indexes for firstword-nextword pairs to speed up the phrase query. However, it is not directly suitable to the non-phrase query. ...
Conference Paper
Full-text available
There has been a large amount of research on early termination techniques in web search and information retrieval. Such techniques return the top-k documents without scanning and evaluating the full inverted lists of the query terms. Thus, they can greatly improve query processing efficiency. However, only a limited amount of efficient top-k processing work considers the impact of term proximity, i.e., the distance between term occurrences in a document, which has recently been integrated into a number of retrieval models to improve effectiveness. In this paper, we propose new early termination techniques for efficient query processing for the case where term proximity is integrated into the retrieval model. We propose new index structures based on a term-pair index, and study new document retrieval strategies on the resulting indexes. We perform a detailed experimental evaluation on our new techniques and compare them with the existing approaches. Experimental results on large-scale data sets show that our techniques can significantly improve the efficiency of query processing.
... A cursory survey of recent conference proceedings and journal issues reveals that the resources created by the VLC/Web Track are being used quite routinely in studies reported outside TREC. For example, in the years 2000-2002, eight SIGIR papers [34, 2, 44, 15, 32, 3, 4, 39] and four TOIS articles [12, 11, 10, 9] made use of VLC/Web data and several others referred to the track or its methodology. A glance at the same forums for 2003 suggests that usage of VLC/Web Track resources and results is increasing still further. ...
... Виходимо з тези, що якщо експерт (педагогічний працівник) вибрав книги одного автора, то книги цього ж автора також можуть мати для цього ж експерта певну цінність. Важливим при цьому є також факт, що автори зазвичай публікують свої праці з конкретної предметної області і вкрай рідко переходять до іншої [6]. Наприклад, якщо якийсь автор займався архітектурою (будівництво) -швидше за все проектуванням комп'ютерних процесорів він не буде цікавитися. ...
Article
Full-text available
This paper describes the implementation of the algorithm of ranking search results of the user in informational and search system of library. Today, in Ukraine almost no university libraries, that aren’t computerized, at least partially. Libraries offer a range of remote services for their users – e-catalog and specialized e-services. If these services are provided by web means then access to them in most cases is anonymous. Today, libraries have large databases that describe millions of books, magazines and newspapers. If the user does not clearly imagine that he searches, then deal in such large arrays is very difficult. Therefore, in addition to specific features of information system that are used to refine the search query, an important element is the function of governing the presentation of search results - ranking. Implementation of ranking algorithms in the work of library information systems allow users to minimize the time required to find and organize information search based on qualitative criteria. In scientific and technical library of the National University "Lviv Polytechnic" was developed specialized information system that allows you to create a list of books for students. The proposed ranking algorithm was incorporated in the work of the system. In the opinion of users it is greatly improved process for selecting literature for the study of academic disciplines, if literature, recommended by the teacher, doesn’t fit for some reasons
... A phrase query (Bahle et al., 2002) is a special type of a Boolean AND query, where terms must not only occur within the same document, but consecutively in the same order as stated in the query -that is, they must occur as a phrase. Phrase queries are commonly used in conjunction with ranked queries. ...
Article
Full-text available
... More compact alternatives are to omit the locations, or even to omit the number of occurrences, recording only the document identifiers. However, term locations can be used for accurate ranking heuristics and for resolution of advanced query types such as phrase queries (Bahle, Williams & Zobel 2002). ...
Conference Paper
Full-text available
Indexes are the key technology underpinning efficient text search. A range of algorithms have been developed for fast query evaluation and for index creation, but update algorithms for high-performance indexes have not been evaluated or even fully described. In this paper, we explore the three main alternative strategies for index update: in-place update, index merging, and complete re-build. Our experiments with large volumes of web data show that re-merge is for large numbers of updates the fastest approach, but in-place update is suitable when the rate of update is low or buffer size is limited.
... Therefore, many researchers concentrated on compression techniques to reduce storage space [1,2,23]. To reduce storage space further, partial next word indexes were proposed [3]. Later, the combination of inverted index, partial phrase index and partial next word index was also introduced to moderately reduce the query time [26]. ...
Article
Full-text available
Text documents are significant arrangements of various words, while images are significant arrangements of various pixels/features. In addition, text and image data share a similar semantic structural pattern. With reference to this research, the feature pair is defined as a pair of adjacent image features. The innovative feature pair index graph (FPIG) is constructed from the unique feature pair selected, which is constructed using an inverted index structure. The constructed FPIG is helpful in clustering, classifying and retrieving the image data. The proposed FPIG method is validated against the traditional KMeans++, KMeans and Farthest First cluster methods which have the serious drawback of initial centroid selection and local optima. The FPIG method is analyzed using Iris flower image data, and the analysis yields 88% better results than Farthest First and 28.97% better results than conventional KMeans in terms of sum of squared errors. The paper also discusses the scope for further research in the proposed methodology.
... : Caching of intersections is related to the problem of building optimized index structures for phrase queries [6] (i.e., " New York " ). In particular, intersections can be used to evaluate phrase queries, while on the other hand some of the most profitable pairs of lists in intersection caching turn out to be common phrases. ...
Conference Paper
Large web search engines have to answer thousands of queries per second with interactive response times. Due to the sizes of the data sets involved, often in the range of multiple terabytes, a single query may require the processing of hundreds of megabytes or more of index data. To keep up with this immense workload, large search engines employ clusters of hundreds or thousands of machines, and a number of techniques such as caching, index compression, and index and query pruning are used to improve scalability. In particular, two-level caching techniques cache results of repeated identical queries at the frontend, while index data for frequently used query terms are cached in each node at a lower level.We propose and evaluate a three-level caching scheme that adds an intermediate level of caching for additional performance gains. This intermediate level attempts to exploit frequently occurring pairs of terms by caching intersections or projections of the corresponding inverted lists. We propose and study several offline and online algorithms for the resulting weighted caching problem, which turns out to be surprisingly rich in structure. Our experimental evaluation based on a large web crawl and real search engine query log shows significant performance gains for the best schemes, both in isolation and in combination with the other caching levels. We also observe that a careful selection of cache admission and eviction policies is crucial for best overall performance.
Conference Paper
This paper addresses the problem of identifying collection dependent stop-words in order to reduce the size of inverted files. We present four methods to automatically recognise stop-words, analyse the tradeoff between efficiency and effectiveness, and compare them with a previous pruning approach. The experiments allow us to conclude that in some situations stop-words pruning is competitive with respect to other inverted file reduction techniques.
Conference Paper
Modern web search engines, while indexing billions of web pages, are expected to process queries and return results in a very short time. Many approaches have been proposed for efficiently computing top-k query results, but most of them ignore one key factor in the ranking functions of commercial search engines - term-proximity, which is the metric of the distance between query terms in a document. When term-proximity is included in ranking functions, most of the existing top-k algorithms will become inefficient. To address this problem, in this paper we propose to build a compact phrase index to speed up the search process when incorporating the term-proximity factor. The compact phrase index can help more accurately estimate the score upper bounds of unknown documents. The size of the phrase index is controlled by including a small portion of phrases which are possibly helpful for improving search performance. Phrase index has been used to process phrase queries in existing work. It is, however, to the best of our knowledge, the first time that phrase index is used to improve the performance of generic queries. Experimental results show that, compared with the state-of-the-art top-k computation approaches, our approach can reduce average query processing time to 1/5 for typical setttings.
Conference Paper
Full-text available
In this paper, we propose a common phrase index as an efficient index structure to support phrase queries in a very large text database. Our structure is an extension of previous index structures for phrases and achieves better query efficiency with negligible extra storage cost. In our experimental evaluation, a common phrase index has 5% and 20% improvement in query time for the overall and large queries (queries of long phrases) respectively over an auxiliary nextword index. Moreover, it uses only 1% extra storage cost. Compared with an inverted index, our improvement is 40% and 72% for the overall and large queries respectively.
Conference Paper
Along with single word query, phrase query is frequently used in digital library. This paper proposes a new partition based hierarchical index structure for efficient phrase query and a parallel algorithm based on the index structure. In this scheme, a document is divided into several elements. The elements are distributed on several processors. In each processor, a hierarchical inverted index is built, by which single word and phrase queries can be answered efficiently. This index structure and the partition make the postings lists shorter. At the same time, integer compression technique is used more efficiently. Experiments and analysis show that query evaluation time is significantly reduced.
Article
Term proximity scoring is an established means in information retrieval for improving result quality of full-text queries. Integrating such proximity scores into efficient query processing, however, has not been equally well studied. Existing methods make use of precomputed lists of documents where tuples of terms, usually pairs, occur together, usually incurring a huge index size compared to term-only indexes. This article introduces a joint framework for trading off index size and result quality, and provides optimization techniques for tuning precomputed indexes towards either maximal result quality or maximal query processing performance under controlled result quality, given an upper bound for the index size. The framework allows to selectively materialize lists for pairs based on a query log to further reduce index size. Extensive experiments with two large text collections demonstrate runtime improvements of more than one order of magnitude over existing text-based processing techniques with reasonable index sizes.
Article
This paper proposes a static index pruning method for phrase queries which is based on the cohesive similarity between terms the co-occurrence between terms is model by term's cohesiveness within document the less relevant terms gets pruned away while assuring that there is no change in the top-k results the proposed method creates an effective pruned index. This method also considers the term proximity based on the term frequency and the terms informative ness the experiments were conducted on a number of different standard text collections, and analysis of the results shows promising results and is comparable with the existing static pruning method. It is an advantage of the proposed approach that it can be applies to standard inverted index for phrase queries also.
Chapter
Radiofrequency Identification (RFID) is an automated technology for communication between two objects that are reader and tag. The fundamental challenge in the chipless RFID tag is to encode the data without a chip. This problem is solved by the use of a resonator by the utilization electromagnetic properties to encode the bits of data. The structure conducts deep absorption of the impinging signal at multiple frequencies associated with the resonator loops. This paper presents an enhancement for many designs for chip less tags with good performance. The proposed tags did with two types of designing, first; tag with performance of the square tag to encode 8 data bits, Rogers RO4003C substrate has been used that spans 12 x 12 mm2, that shows the possibility of obtaining good data capacity with a small area. The second; tag with performance of 7 concentric circular rings overlapped with different metals, in addition to a solid circular situated at the inside rings, where this tag was designed to be good with mass production techniques, with low-cost materials for substrate has been used, this tag called Overlapped Metals Tag (OMT).
Chapter
Full-text search engines are important tools for information retrieval. In a proximity full-text search, a document is relevant if it contains query terms near each other, especially if the query terms are frequently occurring words. For each word in the text, we use additional indexes to store information about nearby words at distances from the given word of less than or equal to MaxDistance, which is a parameter. A search algorithm for the case when the query consists of high-frequently occurring words is discussed. In addition, we present results of experiments with different values of MaxDistance to evaluate the search speed dependence on the value of MaxDistance. These results show that the average time of the query execution with our indexes is 94.7–45.9 times (depending on the value of MaxDistance) less than that with standard inverted files when queries that contain high-frequently occurring words are evaluated.
Chapter
Full-text available
Full-text search engines are important tools for information retrieval. In a proximity full-text search, a document is relevant if it contains query terms near each other, especially if the query terms are frequently occurring words. For each word in a text, we use additional indexes to store information about nearby words that are at distances from the given word of less than or equal to the MaxDistance parameter. We showed that additional indexes with three-component keys can be used to improve the average query execution time by up to 94.7 times if the queries consist of high-frequency occurring words. In this paper, we present a new search algorithm with even more performance gains. We consider several strategies for selecting multi-component key indexes for a specific query and compare these strategies with the optimal strategy. We also present the results of search experiments, which show that three-component key indexes enable much faster searches in comparison with two-component key indexes.
Article
The efficiency of retrieval system is crucial for large-scale information retrieval systems. By analyzing the documents and the users' query logs of a real search engine, a blocking inverted file structure is proposed. Simulation results show that the retrieval algorithm under the new organization of the inverted file can decrease its execution time significantly, and the optimal parameter selection for this blocking organization is discussed.
Article
The establishment of inverted index is the main content of index system. In this paper, we propose a novel fusion method using dynamic updating. Firstly, the establishing process of the inverted index is introduced. Secondly, the regimes of the dynamic updating for hash lists based index are discussed in details. Experimental results show that this method can save index merging time as well as spare some memory space.
Article
We study the caching of query result pages in Web search engines. Popular search engines receive millions of queries per day, and efficient policies for caching query results may enable them to lower their response time. In this paper, we propose an architecture that uses a combination of cache result and admission policy to improve the efficiency of search engines. In our system, we divide the cache into two layers to ensure the high hit ratio of the cache. We propose a admission policy to prevent infrequent queries from taking space of more frequent queries in the cache. We also introduce new eviction policy to update the result cache, which is more general than traditional heuristics such as LRU. We experiment with real query logs and a large document collection, and show that the hybrid cache enables efficient reduction of the query processing costs and thus is practical to use in Web search engines.
Conference Paper
Prior research into search system scalability has primarily addressed query processing efficiency [1, 2, 3] or indexing efficiency [3], or has presented some arbitrary system architecture [4]. Little work has introduced any formal theoretical framework for evaluating architectures with regard to specific operational requirements, or for comparing architectures beyond simple timings [5] or basic simulations [6, 7]. In this paper, we present a framework based upon queuing network theory for analyzing search systems in terms of operational requirements. We use response time, throughput, and utilization as the key operational characteristics for evaluating performance. Within this framework, we present a scalability strategy that combines index partitioning and index replication to satisfy a given set of requirements
Conference Paper
To augment the information retrieval process, a model is proposed to facilitate simple contextual indexing for a large scale of standard text corpora. An Edge Index Graph model is presented, which clusters documents based on a root index and an edge index created. Intelligent information retrieval is possible with the projected system where the process of querying provides proactive help to users through a knowledge base. The query is provided with automatic phrase completion and word suggestions. A thesaurus is used to provide meaningful search of the query. This model can be utilized for document retrieval, clustering, and phrase browsing.
Conference Paper
The world wide web of today publishes a great number of real-time content, causing the increasing need for a differentiated way of searching. In this paper, three issues related to retrieving real-time content are presented, and their applications are proposed. First, the characteristics of real-time content, as well as the concept of real-time search are introduced. Second, the real-time technologies that enable real-time search are described. Finally, a platform for application services utilizing real-time search is proposed.
Article
Scalability is a major disadvantage of Web question-answering systems (QA), which produces slow response and tedious search time. The bottleneck lies in the commercial search engine used in simple QA systems. In normal architecture of QA, after finding the related documents in the text corpus, analysis of these documents and retrieval of the answers will cost much time. The reason is that the traditional inverted index used in the commercial search engine is not optimal for the QA systems. One of the solutions is to index the position of answers of QA systems directly in the corpus, not index only the single meaningless words. Thus, the response time of the improved search engine to QA queries will be expected to be almost at the same level as the commercial search engines to searching issued by users. In this paper, we will propose a new inverted index structure for QA systems. By indexing the possible meaningful phrases relative to the position of items (words), our approach can improve the response time without losing the advantages of the inverted index. Due to the larger and larger cheap amount of storing space available in nowadays computers, the extra space used by the approach may be regarded neglectable. Thus, our approach can be used in any large-scale QA system, which always produces enormous quantity of possible answers.
Conference Paper
Full-text available
Both phrases and Boolean queries have a long history in information retrieval, particularly in commercial sys- tems. In previous work, Boolean queries have been used as a source of phrases for a statistical retrieval model, This work, like the majority of research on phrases, re- sulted in little improvement in retrieval effectiveness, In this paper, we describe an approach where phrases identified in natural language queries are used to build structured queries for a probabilistic retrieval model. Our results show that using phrases in this way can improve performance, and that phrases that are auto- matically extracted from a natural language query per- form nearly as well as manually selected phrases.
Conference Paper
Full-text available
Most search systems for querying large document collections, e.g., Web search engines, are based on well-understood information retrieval principles. These systems are both efficient and effective in finding answers to many user information needs, expressed through informal ranked or structured Boolean queries. Phrase querying and browsing are additional techniques that can augment or replace conventional querying tools. The authors propose optimisations for phrase querying with a nextword index, an efficient structure for phrase based searching. We show that careful consideration of which search terms are evaluated in a query plan and optimisation of the order of evaluation of the plan can reduce query evaluation costs by more than a factor of five. We conclude that, for phrase querying and browsing with nextword indexes, an ordered query plan should be used for all browsing and querying. Moreover, we show that optimised phrase querying is practical on large text collections
Article
Full-text available
This article compares search effectiveness when using query-based Internet search (via the Google search engine), directory-based search (via Yahoo) and phrasebased query reformulation assisted search (via the Hyperindex browser) by means of a controlled, userbased experimental study. The focus was to evaluate aspects of the search process. Cognitive load was measured using a secondary digit-monitoring task to quantify the effort of the user in various search states; independent relevance judgements were employed to gauge the quality of the documents accessed during the search process. Time was monitored in various search states. Results indicated the directory-based search does not offer increased relevance over the query-based search (with or without query formulation assistance), and also takes longer. Query reformulation does significantly improve the relevance of the documents through which the user must trawl versus standard query-based internet search. However,...
Article
Full-text available
Term clustering and syntactic phrase formation are methods for transforming natural language text. Both have had only mixed success as strategies for improving the quality of text representations for document retrieval. Since the strengths of these methods are complementary, we have explored combining them to produce superior representations.
Article
Full-text available
Phrase browsing techniques use phrases extracted automatically from a large information collection as a basis for browsing and accessing it. This paper describes a case study that uses an automatically constructed phrase hierarchy to facilitate browsing of an ordinary large Web site. Phrases are extracted from the full text using a novel combination of rudimentary syntactic processing and sequential grammar induction techniques. The interface is simple, robust and easy to use. To convey a feeling for the quality of the phrases that are generated automatically, a thesaurus used by the organization responsible for the Web site is studied and its degree of overlap with the phrases in the hierarchy is analyzed. Our ultimate goal is to amalgamate hierarchical phrase browsing and hierarchical thesaurus browsing: the latter provides an authoritative domain vocabulary and the former augments coverage in areas the thesaurus does not reach.
Article
In studying actual Web searching by the public at large, we analyzed over one million Web queries by users of the Excite search engine. We found that most people use few search terms, few modified queries, view few Web pages, and rarely use advanced search features. A small number of search terms are used with high frequency, and a great many terms are unique; the language of Web queries is distinctive. Queries about recreation and entertainment rank highest. Findings are compared to data from two other large studies of Web queries. This study provides an insight into the public practices and choices in Web searching.
Article
We investigate the application of a novel relevance ranking technique, cover density ranking, to the requirements of Web-based information retrieval, where a typical query consists of a few search terms and a typical result consists of a page indicating several potentially relevant documents. Traditional ranking methods for information retrieval, based on term and inverse document frequencies, have been found to work poorly in this context. Under the cover density measure, ranking is based on term proximity and cooccurrence. Experimental comparisons show performance that compares favorably with previous work.
Article
A frozen 18.5 million page snapshot of part of the Web has been created to enable and encourage meaningful and reproducible evaluation of Web search systems and techniques. This collection is being used in an evaluation framework within the Text Retrieval Conference (TREC) and will hopefully provide convincing answers to questions such as, “Can link information result in better rankings?”, “Do longer queries result in better answers?”, and, “Do TREC systems work well on Web data?” The snapshot and associated evaluation methods are described and an invitation is extended to participate. Preliminary results are presented for an effectivess comparison of six TREC systems working on the snapshot collection against five well-known Web search systems working over the current Web. These suggest that the standard of document rankings produced by public Web search engines is by no means state-of-the-art.
Conference Paper
Considerable research effort has been invested in improving the effectiveness of information retrieval systems. Techniques such as relevance feedback, thesaural expansion, and pivoting all provide better quality responses to queries when tested in standard evaluation frameworks. But such enhancements can add to the cost of evaluating queries. In this paper we consider the pragmatic issue of how to improve the cost-effectiveness of searching. We describe a new inverted file structure using quantized weights that provides superior retrieval effectiveness compared to conventional inverted file structures when early termination heuristics are employed. That is, we are able to reach similar effectiveness levels with less computational cost, and so provide a better cost/performance compromise than previous inverted file organisations.
Article
Query-processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Retrieval time for inverted lists can be greatly reduced by the use of compression, but this adds to the CPU time required. Here we show that the CPU component of query response time for conjunctive Boolean queries and for informal ranked queries can be similarly reduced, at little cost in terms of storage, by the inclusion of an internal index in each compressed inverted list. This method has been applied in a retrieval system for a collection of nearly two million short documents. Our experimental results show that the self-indexing strategy adds less than 20% to the size of the compressed inverted file, which itself occupies less than 10% of the indexed text, yet can reduce processing time for Boolean queries of 5-10 terms to under one fifth of the previous cost. Similarly, ranked queries of 40-50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval effectiveness.
Article
Ranked queries are used to locate relevant documents in text databases. In a ranked query a list of terms is specified, then the documents that most closely match the query are returned---in decreasing order of similarity---as answers. Crucial to the efficacy of ranked querying is the use of a similarity heuristic, a mechanism that assigns a numeric score indicating how closely a document and the query match. In this note we explore and categorise a range of similarity heuristics described in the literature. We have implemented all of these measures in a structured way, and have carried out retrieval experiments with a substantial subset of these measures.Our purpose with this work is threefold: first, in enumerating the various measures in an orthogonal framework we make it straightforward for other researchers to describe and discuss similarity measures; second, by experimenting with a wide range of the measures, we hope to observe which features yield good retrieval behaviour in a variety of retrieval environments; and third, by describing our results so far, to gather feedback on the issues we have uncovered. We demonstrate that it is surprisingly difficult to identify which techniques work best, and comment on the experimental methodology required to support any claims as to the superiority of one method over another.
Article
Browsing accounts for much of people's interaction with digital libraries, but it is poorly supported by standard search engines. Conventional systems often operate at the wrong level, indexing words when people think in terms of topics, and returning documents when people want a broader view. As a result, users cannot easily determine what is in a collection, how well a particular topic is covered, or what kinds of queries will provide useful results. We have built a new kind of search engine, Keyphind, that is explicitly designed to support browsing. Automatically extracted keyphrases form the basic unit of both indexing and presentation, allowing users to interact with the collection at the level of topics and subjects rather than words and documents. The keyphrase index also provides a simple mechanism for clustering documents, refining queries, and previewing results. We compared Keyphind to a traditional query engine in a small usability study. Users reported that certain kinds of browsing tasks were much easier with the new interface, indicating that a keyphrase index would be a useful supplement to existing search tools. This is an author’s version of an article published in the journal: Decision Support Systems. © 1999 Elsevier Science B.V.
Article
In this paper we examine the question of query parsing for World Wide Web queries and present a novel method for phrase recognition and expansion. Given a training corpus of approximately 16 million Web queries and a handwritten context-free grammar, the EM algorithm is used to estimate the parameters of a probabilistic context-free grammar (PCFG) with a system developed by Carroll [5]. We use the PCFG to compute the most probable parse for a user query, reflecting linguistic structure and word usage of the domain being parsed. The optimal syntactic parse for a user query thus obtained is employed for phrase recognition and expansion. Phrase recognition is used to increase retrieval precision; phrase expansion is applied to make the best use possible of very short Web queries.
Article
Most search systems for querying large document collections---for example, web search engines---are based on well-understood information retrieval principles. These systems are both efficient and effective in finding answers to many user information needs, expressed through informal ranked or structured Boolean queries. Phrase querying and browsing are additional techniques that can augment or replace conventional querying tools. In this paper, we propose optimisations for phrase querying with a nextword index, an efficient structure for phrase-based searching. We show that careful consideration of which search terms are evaluated in a query plan and optimisation of the order of evaluation of the plan can reduce query evaluation costs by more than a factor of five. We conclude that, for phrase querying and browsing with nextword indexes, an ordered query plan should be used for all browsing and querying. Moreover, we show that optimised phrase querying is practical on large text collections.
Article
Ranking techniques are effective at finding answers in document collections but can be expensive to evaluate. We propose an evaluation technique that uses early recognition of which documents are likely to be highly ranked to reduce costs; for our test data, queries are evaluated in 2% of the memory of the standard implementation without degradation in retrieval effectiveness. cpu time and disk traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. The principle of the index design is that inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces cpu time and disk traffic to around one third of the original requirement. We also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed.
Article
Text retrieval systems are used to fetch documents from large text collections, using queries consisting of words and word sequences.
Relevance ranking for one-to three-term queriesRecherche d'Information Assistee par Ordinateur
  • C L Clarke
  • G V Cormack
  • E A Tudhope
C. L. Clarke, G. V. Cormack, and E. A. Tudhope. Relevance ranking for one-to three-term queries. In Proc. of RIAO-97, 5th International Conference "Recherche d'Information Assistee par Ordinateur", pages 388–400, Montreal, CA, 1997.