Article

What's Next? - Index Structures for Efficient Phrase Querying

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Text retrieval systems are used to fetch documents from large text collections, using queries consisting of words and word sequences.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... These phrase queries have the benefit of better discrimination of documents than regular term queries, but as will be shown, comes with an additional cost. 13 ...
... In the research work [13] the issue of phrase queries in inverted indexes is discussed, and approached further. This is one of the earliest research work which addresses this problem. ...
... More thorough work has been accomplished in [5] with the introduction of nextword index structures, and then we have the introduction of phrase indexes in [12], and the substantial gains of combining structures into hybrid IR-systems. The latter approach is also more discussed in [13]. ...
... Also, these early termination heruistics do not retain the complete set of result documents. Nextword index provides a fast alternative for resolving phrase queries, phrase browsing, and phrase completion [15]. Unlike an inverted index, it has a list of nextwords and positions following each distinct word. ...
... Inverted index is not efficient for evaluating query with common terms since the three most common words account for about 4% of the size of the whole index file [4] and retrieving such long postings list can suffer a long operation time. Hence, nextword index [15] is proposed to construct an index by recording additional index for supporting fast evaluation of phrase queries. A nextword index is a three-level structure. ...
... This speeds up the evaluation of a phrase query. The applications of nextword index can be found in [15]. However, the size of index is large. ...
Conference Paper
In this paper, we propose a common phrase index as an efficient index structure to support phrase queries in a very large text database. Our structure is an extension of previous index structures for phrases and achieves better query efficiency with modest extra storage cost. Further improvement in efficiency can be attained by implementing our index according to our observation of the dynamic nature of common word set. In experimental evaluation, a common phrase index using 255 common words has an improvement of about 11% and 62% in query time for the overall and large queries (queries of long phrases) respectively over an auxiliary nextword index. Moreover, it has only about 19% extra storage cost. Compared with an inverted index, our improvement is about 72% and 87% for the overall and large queries respectively. We also propose to implement a common phrase index with dynamic update feature. Our experiments show that more improvement in time efficiency can be achieved.
... Another solution is to index phrases directly, but the set of word pairs in a text collection is large and an index on such phrases difficult to manage. In recent work, nextword indexes were proposed as a way of supporting phrase queries and phrase browsing [2, 3, 15]. In a nextword index, for each index term or firstword there is a list of the words or nextwords that follow that term, together with the documents and word positions at which the firstword and nextword occur as a pair. ...
... Another way to support phrase based query modes is to index and store phrases directly [8] or simply by using an inverted index and approximating phrases through a ranked query technique [5, 10]. Greater efficiency, with no additional in-memory space overheads, is possible with a special-purpose structure, the nextword index [15] , where search structures are used to accelerate processing of word pairs. The nextword index takes the middle ground by indexing pairs of words and, therefore, is particularly good at resolving phrase queries containing two or more words. ...
... It follows that phrase query evaluation can be extremely fast. Nextword indexes also have the benefit of allowing phrase browsing or phrase querying [4, 15]; given a sequence of words, the index can be used to identify which words follow the sequence, thus providing an alternative mechanism for searching text collections. We do not consider phrase browsing further in this paper, however. ...
Conference Paper
Search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. A significant proportion of the queries posed to search engines involve phrases. In this paper we consider how phrase queries can be efficiently supported with low disk overheads. Previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. We propose a combination of nextword indexes with inverted files as a solution to this problem. Our experiments show that combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file. Further time savings are available with only slight increases in disk requirements.
... Also, these early termination heuristics do not retain the complete set of result documents. Nextword index provides a fast alternative for resolving phrase queries, phrase browsing and phrase completion (Williams, Zobel, & Anderson, 1999). Unlike an inverted index, it has a list of nextwords and positions following each distinct word. ...
... Inverted index is not efficient for evaluating query containing common words since the three most common words account for about 4% of the size of the whole index file (Bahle et al., 2002) and retrieving such long postings list can suffer from a long operation time. Hence, nextword index (Williams et al., 1999) is proposed to construct an index by recording additional index for supporting fast evaluation of phrase queries. ...
... This speeds up the evaluation of a phrase query. For more details and the various applications of nextword index the readers are referred to Williams et al. (1999). However, the size of index is large as mentioned before. ...
Article
Full-text available
In this paper, we propose a common phrase index as an efficient index structure to support phrase queries in a very large text database. Our structure is an extension of previous index structures for phrases and achieves better query efficiency with modest extra storage cost. Further improvement in efficiency can be attained by implementing our index according to our observation of the dynamic nature of common word set. In experimental evaluation, a common phrase index using 255 common words has an improvement of about 11% and 62% in query time for the overall and large queries (queries of long phrases) respectively over an auxiliary nextword index. Moreover, it has only about 19% extra storage cost. Compared with an inverted index, our improvement is about 72% and 87% for the overall and large queries respectively. We also propose to implement a common phrase index with dynamic update feature. Our experiments show that more improvement in time efficiency can be achieved.
... Such a technique is particularly powerful when a user cannot easily express a concept in a few words, or does not have a concrete list of words to express the concept. Phrase browsing has been shown to be practical for small collections [6, 7, 10], but it is unclear how these techniques may work for large collections. In a query log from the Excite search engine [9] of around 1.8 million queries, almost 7% were phrase queries or contained a phrase query component. ...
... Efficient evaluation of phrase queries is therefore crucial to overall retrieval performance. An index structure specifically designed for fast phrase querying has previously been proposed and compared to conventional structures [10]. This structure—the nextword index—has been shown to be around five times faster than a conventional structure for resolving typical two-or three-word phrase queries. ...
... The structure of the nextword index and the algorithms used for phrase query and phrase browsing are described in detail elsewhere [1, 10]. In the following section, we briefly discuss the structure and storage of inverted lists. ...
... After refining a phrase, a user could return to conventional querying to formulate a better informal query, or the user could retrieve documents containing the browsed phrase. We have previously proposed efficient data structures for special-purpose phrase querying and browsing [10]. We have shown that these structures can permit phrase searching that is two to four times faster than with an efficient conventional system. ...
... For example, a ranked phrase query: " Richmond Football Club " premiership 1980 contains three terms, one of which is a phrase. Another technique is phrase browsing, where terms in the vocabulary are explored in the context in which they occur in the database [3] [7] [10]. To support phrase querying, word positions must be stored in the index. ...
... We have previously described the nextword index structure for phrase querying and browsing [10]. A nextword index stores the words that occur in a collection and, for each such word, the words that immediately follow that word anywhere in the collection. ...
Article
Most search systems for querying large document collections---for example, web search engines---are based on well-understood information retrieval principles. These systems are both efficient and effective in finding answers to many user information needs, expressed through informal ranked or structured Boolean queries. Phrase querying and browsing are additional techniques that can augment or replace conventional querying tools. In this paper, we propose optimisations for phrase querying with a nextword index, an efficient structure for phrase-based searching. We show that careful consideration of which search terms are evaluated in a query plan and optimisation of the order of evaluation of the plan can reduce query evaluation costs by more than a factor of five. We conclude that, for phrase querying and browsing with nextword indexes, an ordered query plan should be used for all browsing and querying. Moreover, we show that optimised phrase querying is practical on large text collections.
... Specifically, we must add a significant one-time preparation phase in order to construct a suitable DB. A key piece of our preparation phase is inspired by next-word indexing [23]. ...
... The next-word index is a very efficient data structure that is often used in phrase searching in the information retrieval space. As mentioned in [23], the next-word index allows for faster phrase queries as the next-word index is smaller than an index that catalogs the positions of every word in the document collection. In addition, there is only a limited reduction in query expressiveness. ...
Article
Full-text available
In recent years searchable symmetric encryption has seen a rapid increase in query expressiveness including keyword, phrase, Boolean, and fuzzy queries. With this expressiveness came increasingly complex constructions. Having these facts in mind, we present an efficient and generic searchable symmetric encryption construction for phrase queries. Our construction is straightforward to implement, and is proven secure under adaptively chosen query attacks (CQA2) in the random oracle model with an honest-but-curious adversary. To our knowledge, this is the first encrypted phrase search system that achieves CQA2 security. Moreover, we demonstrate that our document collection preprocessing algorithm allows us to extend a dynamic SSE construction so that it supports phrase queries. We also provide a compiler theorem which transforms any CQA2-secure SSE construction for keyword queries into a CQA2-secure SSE construction that supports phrase queries.
... There is a body of literature [3, 22, 4, 2] that discusses modifications to the inverted-index structure to support fast evaluation of specific query classes. In prior work, nextword indexes [3, 22] were proposed as a way of supporting phrase queries and phrase browsing . ...
... There is a body of literature [3, 22, 4, 2] that discusses modifications to the inverted-index structure to support fast evaluation of specific query classes. In prior work, nextword indexes [3, 22] were proposed as a way of supporting phrase queries and phrase browsing . In a nextword index, for each index term or firstword, there is a list of the words or nextwords that follow that term, together with the documents and word positions at which the firstword and nextword occur as a pair. ...
Article
Entity annotation is emerging as a key enabling requirement for search based on deeper semantics: for example, a search on 'John's address', that returns matches to all entities annotated as an address that co-occur with 'John'. A dominant paradigm adopted by rule-based named entity annotators is to annotate a document at a time. The complexity of this approach varies linearly with the number of documents and the cost for annotating each document, which could be prohibiting for large document corpora. A recently proposed al-ternative paradigm for rule-based entity annotation [16], operates on the inverted index of a document collection and achieves an or-der of magnitude speed-up over the document-based counterpart. In addition the index based approach permits collection level op-timization of the order of index operations required for the anno-tation process. It is this aspect that is explored in this paper. We develop a polynomial time algorithm that, based on estimated cost, can optimally select between different logically equivalent evalua-tion plans for a given rule. Additionally, we prove that this prob-lem becomes NP-hard when the optimization has to be performed over multiple rules and provide effective heuristics for handling this case. Our empirical evaluations show a speed-up factor upto 2 over the baseline system without optimizations.
... Index-based search tools look up subsequences and their corresponding posting lists in some well-defined data structures. For example, FLASH [1], RAMDB [4], MAP [27] and CAFE [7][8][9][10] have adopted indexing techniques in their search tools. The advantage of index-based search tools over the exhaustive ones is that the pre-built indices can help to speed up the search process. ...
... There are three major considerations of using indexing techniques in genomic search tools: space requirement, sensitivity and effectiveness of retrieved sequences as well as the efficiency of indexing and retrieval. For example, CAFE addressed the first two problems by its compression techniques and two-component search processes while some researchers proposed some indexing techniques to enhance the efficiency [6], [10], [24], [25]. Our goal is to address the third consideration above, i.e., making index-based search tool more practical and scalable to increasing database size and query rates. ...
Conference Paper
Full-text available
Indexing and retrieval techniques for homology searching of genomic databases are increasingly important as the search tools are facing great challenges of rapid growth in sequence collection size. Consequently, the indexing and retrieval of possibly gigabytes sequences become expensive. In this paper, we present two new approaches for indexing genomic databases that can enhance the speed of indexing and retrieval. We show experimentally that the proposed methods can be more computationally efficient than the existing ones.
... Because all types of phrase queries can be decomposed into pairs of words, inverted indexes cannot address dictionary terms of variable length. Nextword lists are a data structure used to represent documents and queries with dictionary terms of variable lengths (Williams et al. 1999). ( 2 ) ...
Article
Full-text available
Community question answering (cQA) has emerged as a popular service on the web; users can use it to ask and answer questions and access historical question-answer (QA) pairs. cQA retrieval, as an alternative to general web searches, has several advantages. First, user can register a query in the form of natural language sentences instead of a set of keywords; thus, they can present the required information more clearly and comprehensively. Second, the system returns several possible answers instead of a long list of ranked documents, thereby enhancing the efficient location of the desired answers. Question retrieval from a cQA archive, an essential function of cQA retrieval services, aims to retrieve historical QA pairs relevant to the query question. In this study, combined queries (combined inverted and nextword indexes) are proposed for question retrieval in cQA. The method performance is investigated for two different scenarios: (a) when only questions from QA pairs are used as documents, and (b) when QA pairs are used as documents. In the proposed method, combined indexes are first created for both queries and documents; then, different information retrieval (IR) models are used to retrieve relevant questions from the cQA archive. Evaluation is performed on a public Yahoo! Answers dataset; the results thereby obtained show that using combined queries for all three IR models (vector space model, Okapi model, and language model) improves performance in terms of the retrieval precision and ranking effectiveness. Notably, by using combined indexes when both QA pairs are used as documents, the retrieval and ranking effectiveness of these cQA retrieval models increases significantly.
... Users browse using bag of words rather than single words. Thus, next word indexes, the consecutive terms will be stored with the position information [4,20,25] and two term indexes [22] and word pair indexes have also been proposed [9,11]. By simply viewing, each pair of terms in a corpus is considered as a single term in the index. ...
Article
Full-text available
Text documents are significant arrangements of various words, while images are significant arrangements of various pixels/features. In addition, text and image data share a similar semantic structural pattern. With reference to this research, the feature pair is defined as a pair of adjacent image features. The innovative feature pair index graph (FPIG) is constructed from the unique feature pair selected, which is constructed using an inverted index structure. The constructed FPIG is helpful in clustering, classifying and retrieving the image data. The proposed FPIG method is validated against the traditional KMeans++, KMeans and Farthest First cluster methods which have the serious drawback of initial centroid selection and local optima. The FPIG method is analyzed using Iris flower image data, and the analysis yields 88% better results than Farthest First and 28.97% better results than conventional KMeans in terms of sum of squared errors. The paper also discusses the scope for further research in the proposed methodology.
... In the 'Database' section, we will introduce the database we use to store the data and describe some of the problems we encountered when storing MBPs. In the 'Data retrieval' section, we will describe a number of indexing mechanisms, including the default Mongo DB queries using regular expressions, our own implementation of the nextword indexing [1], and an indexing system we built based on Lucene [2] . In particular , we will describe the structures of the systems for indexing, searching, and carrying out statistical analysis. ...
Article
Full-text available
Dealing with big data in computational social networks may require powerful machines, big storage, and high bandwidth, which may seem beyond the capacity of small labs. We demonstrate that researchers with limited resources may still be able to conduct big-data research by focusing on a specific type of data. In particular, we present a system called MPT (Microblog Processing Toolkit) for handling big volume of microblog posts with commodity computers, which can handle tens of millions of micro posts a day. MPT supports fast search on multiple keywords and returns statistical results. We describe in this paper the architecture of MPT for data collection and phrase search for returning search results with statistical analysis. We then present different indexing mechanisms and compare them on the microblog posts we collected from popular online social network sites in mainland China.
... For example, searching with phrase " operating system " requires that not only each keyword " operating " and " system " must exist in each returned document, but also the order that " operating " is followed by " system " must also be satisfied. In [21], the authors introduced a solution based on Nextword Index [34]. It allows the index to record the keyword position for each document and enables the user to query the consecutive keywords based on binary search over all positions. ...
Article
Full-text available
Searchable encryption technique enables the users to securely store and search their documents over the remote semitrusted server, which is especially suitable for protecting sensitive data in the cloud. However, various settings (based on symmetric or asymmetric encryption) and functionalities (ranked keyword query, range query, phrase query, etc.) are often realized by different methods with different searchable structures that are generally not compatible with each other, which limits the scope of application and hinders the functional extensions. We prove that asymmetric searchable structure could be converted to symmetric structure, and functions could be modeled separately apart from the core searchable structure. Based on this observation, we propose a layered searchable encryption (LSE) scheme, which provides compatibility, flexibility, and security for various settings and functionalities. In this scheme, the outputs of the core searchable component based on either symmetric or asymmetric setting are converted to some uniform mappings, which are then transmitted to loosely coupled functional components to further filter the results. In such a way, all functional components could directly support both symmetric and asymmetric settings. Based on LSE, we propose two representative and novel constructions for ranked keyword query (previously only available in symmetric scheme) and range query (previously only available in asymmetric scheme).
... Storing offsets to perform phrase searches is expensive in storage cost and query processing time. The nextword indexes of Williams et al. [23] take a different approach, storing the set of all word bigrams and using standard compression techniques to reduce the index size; but these indexes are still very large, about 60% of the collection size, due to a large number of terms. For example, in the TREC-10MB collection with about 38,000 unique words, there are over 500,000 unique bigrams. ...
Conference Paper
Full-text available
Inverted indexes using sequences of characters (n-grams) as terms provide an error-resilient and language-independent way to query for arbitrary substrings and perform approximate matching in a text, but present a number of practical problems: they have a very large number of terms, they exhibit pathologically expensive worst-case query times on certain natural inputs, and they cannot cope with very short query strings. In word-based indexes, static index pruning has been successful in reducing index size while maintaining precision, at the expense of recall. Taking advantage of the unique inclusion structure of n-gram terms of different lengths, we show that the lexicon size of an n-gram index can be reduced by 7 to 15 times without any loss of recall, and without any increase in either index size or query time. Because the lexicon is typically stored in main memory, this substantially reduces the memory required for queries. Simultaneously, our construction is also the first overlapping n-gram index to place tunable worst-case bounds on false positives and to permit efficient queries on strings of any length. Using this construction, we also demonstrate the first feasible n-gram index using words rather than characters as units, and its applications to phrase searching.
... Later on, various authors have contributed to speed-up phrase querying times of (full-text) inverted indexes. So-called nextword indexes were proposed by Williams et al. [6] . For each term or firstword, they store a list of all successors together with the positions at which they occur as a consecutive pair. ...
Conference Paper
We present a method for optimizing phrase search based on inverted indexes. Our approach adds selected (two-term) phrases to an existing index. Whereas competing approaches are often based on the analysis of query logs, our approach works out of the box and uses only the information contained in the index. Also, our method is competitive in terms of query performance and can even improve on other approaches for difficult queries. Moreover, our approach gives performance guarantees for arbitrary queries. Further, we propose using a phrase index as a substitute for the positional index of an in-memory search engine working with short documents. We support our conclusions with experiments using a high-performance main-memory search engine. We also give evidence that classical disk based systems can profit from our approach.
... Different heuristics are proposed in this respect, such as maintaining the inverted lists only for popular phrases, or maintaining inverted lists of all phrases up to some fixed number (say h) of words. Another approach is called " next-word index " [36, 3, 4, 37], such that corresponding to each term w, a list of all the terms which occurs immediately after w is maintained. This approach will double the space, but it can support searching of any phrase with two words efficiently. ...
Conference Paper
Full-text available
Inverted indexes are the most fundamental and widely used data structures in information retrieval. For each unique word occurring in a document collection, the inverted index stores a list of the documents in which this word occurs. Compression techniques are often applied to further reduce the space requirement of these lists. However, the index has a shortcoming, in that only predefined pattern queries can be supported efficiently. In terms of string documents where word boundaries are undefined, if we have to index all the substrings of a given document, then the storage quickly becomes quadratic in the data size. Also, if we want to apply the same type of indexes for querying phrases or sequence of words, then the inverted index will end up storing redundant information. In this paper, we show the first set of inverted indexes which work naturally for strings as well as phrase searching. The central idea is to exclude document d in the inverted list of a string P if every occurrence of P in d is subsumed by another string of which P is a prefix. With this we show that our space utilization is close to the optimal. Techniques from succinct data structures are deployed to achieve compression while allowing fast access in terms of frequency and document id based retrieval. Compression and speed trade-offs are evaluated for different variants of the proposed index. For phrase searching, we show that our indexes compare favorably against a typical inverted index deploying position-wise intersections. We also show efficient top-k based retrieval under relevance metrics like frequency and tf-idf.
... (Of course, GuruQA is not designed to find all relevant documents, like be does.) A series of articles describes the nextword index [5, 23, 4], a structure designed to speed up phrase queries and to enable some amount of " phrase browsing. " It is an inverted index where each term list contains a list of the successor words found in the corpus. ...
Conference Paper
Many modern natural language-processing applications utilize search engines to locate large numbers of Web documents or to compute statistics over the Web corpus. Yet Web search engines are designed and optimized for simple human queries---they are not well suited to support such applications. As a result, these applications are forced to issue millions of successive queries resulting in unnecessary search engine load and in slow applications with limited scalability.In response, this paper introduces the Bindings Engine (BE), which supports queries containing typed variables and string-processing functions. For example, in response to the query "powerful ‹noun›" BE will return all the nouns in its index that immediately follow the word "powerful", sorted by frequency. In response to the query "Cities such as ProperNoun(Head(‹NounPhrase›))", BE will return a list of proper nouns likely to be city names.BE's novel neighborhood index enables it to do so with O(k) random disk seeks and O(k) serial disk reads, where k is the number of non-variable terms in its query. As a result, BE can yield several orders of magnitude speedup for large-scale language-processing applications. The main cost is a modest increase in space to store the index. We report on experiments validating these claims, and analyze how BE's space-time tradeoff scales with the size of its index and the number of variable types. Finally, we describe how a BE-based application extracts thousands of facts from the Web at interactive speeds in response to simple user queries.
... Prior research has examined how to efficiently index text documents and resolve text queries: for example, with invertedindices [3], signature files [8], or sparse matrices [9]. Further improvements to these index structures have been made for handling special query types [10] [11] [12] and reducing I/O overhead [13] [14] [15]. While much work addresses this indexlevel view of search performance, little work addresses performance at the architectural level of a complete search service. ...
Conference Paper
Prior research into search system scalability has primarily addressed query processing efficiency [1, 2, 3] or indexing efficiency [3], or has presented some arbitrary system architecture [4]. Little work has introduced any formal theoretical framework for evaluating architectures with regard to specific operational requirements, or for comparing architectures beyond simple timings [5] or basic simulations [6, 7]. In this paper, we present a framework based upon queuing network theory for analyzing search systems in terms of operational requirements. We use response time, throughput, and utilization as the key operational characteristics for evaluating performance. Within this framework, we present a scalability strategy that combines index partitioning and index replication to satisfy a given set of requirement.
... A Next-Word Index helps to store phrase information. 1 It reduces the time to retrieve the text by 50% [6]. ...
Article
Text retrieval, Analysis, Mining and Knowledge managementhave gained a lot of importance in a time when we drown ininformation but are starved for knowledge. In this paper, wepropose a novel Index that uses a Text Cube model to store thetext information similar to a data cube in Data Mining. This modelcreates a direct index, next word index and inverted index in asingle Cube Index which is three dimensional in nature. TheDimensions considered are first word, next word and document.The measure of the cube is the frequency of occurrence of theword next-word pair. The cube index has been tested bymodifying the open source of terrier 2.1.
Conference Paper
Prior research into search system scalability has primarily addressed query processing efficiency [1, 2, 3] or indexing efficiency [3], or has presented some arbitrary system architecture [4]. Little work has introduced any formal theoretical framework for evaluating architectures with regard to specific operational requirements, or for comparing architectures beyond simple timings [5] or basic simulations [6, 7]. In this paper, we present a framework based upon queuing network theory for analyzing search systems in terms of operational requirements. We use response time, throughput, and utilization as the key operational characteristics for evaluating performance. Within this framework, we present a scalability strategy that combines index partitioning and index replication to satisfy a given set of requirements
Conference Paper
Dealing with big data in computational social networks may require powerful machines, big storage, and high bandwidth, which may seem beyond the capacity of small labs. We demonstrate that researchers with limited resources may still be able to conduct big-data research by focusing on a specific type of data. In particular, we present a system called MPT (microblog processing toolkit) for handling big volume of microblog posts with commodity computers, which can handle tens of millions of micro posts a day. MPT supports fast search on multiple keywords and returns statistical results. We describe in this paper the architecture of MPT for data collection and stat search for returning search results with statistical analysis. We then present different indexing mechanisms and compare them on the micro posts we collected from popular social network sites in China.
Article
Formulating and processing phrases and other term dependencies to improve query effectiveness is an important problem in information retrieval. However, accessing word-sequence statistics using inverted indexes requires unreasonable processing time or substantial space overhead. Establishing a balance between these competing space and time trade-offs can dramatically improve system performance. In this article, we present and analyze a new index structure designed to improve query efficiency in dependency retrieval models. By adapting a class of (ε, δ)-approximation algorithms originally proposed for sketch summarization in networking applications, we show how to accurately estimate statistics important in term-dependency models with low, probabilistically bounded error rates. The space requirements for the vocabulary of the index is only logarithmically linked to the size of the vocabulary. Empirically, we show that the sketch index can reduce the space requirements of the vocabulary component of an index of n-grams consisting of between 1 and 4 words extracted from the GOV2 collection to less than 0.01% of the space requirements of the vocabulary of a full index. We also show that larger n-gram queries can be processed considerably more efficiently than in current alternatives, such as positional and next-word indexes.
Conference Paper
To augment the information retrieval process, a model is proposed to facilitate simple contextual indexing for a large scale of standard text corpora. An Edge Index Graph model is presented, which clusters documents based on a root index and an edge index created. Intelligent information retrieval is possible with the projected system where the process of querying provides proactive help to users through a knowledge base. The query is provided with automatic phrase completion and word suggestions. A thesaurus is used to provide meaningful search of the query. This model can be utilized for document retrieval, clustering, and phrase browsing.
Conference Paper
Formulating and processing phrases and other term dependencies to improve query effectiveness is an important problem in information retrieval. However, accessing these types of statistics using standard inverted indexes requires unreasonable processing time or incurs a substantial space overhead. Establishing a balance between these competing space and time trade-offs can dramatically improve system performance. In this paper, we present and analyze a new index structure designed to improve query efficiency in term dependency retrieval models, with bounded space requirements. By adapting a class of (ε,δ)-approximation algorithms originally proposed for sketch summarization in networking applications, we show how to accurately estimate various statistics important in term dependency models with low, probabilistically bounded error rates. The space requirements of the sketch index structure is largely independent of this size and the number of phrase term dependencies. Empirically, we show that the sketch index can reduce the space requirements of the vocabulary component of an index of all n-grams consisting of between 1 and 5 words extracted from the Clueweb-Part-B collection to less than 0.2% of the requirements of an equivalent full index. We show that n-gram queries of 5 words can be processed more efficiently than in current alternatives, such as next-word indexes. We show retrieval using the sketch index to be up to 400 times faster than with positional indexes, and 15 times faster than next-word indexes.
Conference Paper
Full-text available
In this paper, we propose a common phrase index as an efficient index structure to support phrase queries in a very large text database. Our structure is an extension of previous index structures for phrases and achieves better query efficiency with negligible extra storage cost. In our experimental evaluation, a common phrase index has 5% and 20% improvement in query time for the overall and large queries (queries of long phrases) respectively over an auxiliary nextword index. Moreover, it uses only 1% extra storage cost. Compared with an inverted index, our improvement is 40% and 72% for the overall and large queries respectively.
Conference Paper
In addition to purely occurrence-based relevance models, term proximity has been frequently used to enhance retrieval quality of keyword-oriented retrieval systems. While there have been approaches on effective scoring functions that incorporate proximity, there has not been much work on algorithms or access methods for their efficient evaluation. This paper presents an efficient evaluation framework including a proximity scoring function integrated within a top-k query engine for text retrieval. We propose precomputed and materialized index structures that boost performance. The increased retrieval effectiveness and efficiency of our framework are demonstrated through extensive experiments on a very large text benchmark collection. In combination with static index pruning for the proximity lists, our algorithm achieves an improvement of two orders of magnitude compared to a term-based top-k evaluation, with a significantly improved result quality.
Conference Paper
Along with single word query, phrase query is frequently used in digital library. This paper proposes a new partition based hierarchical index structure for efficient phrase query and a parallel algorithm based on the index structure. In this scheme, a document is divided into several elements. The elements are distributed on several processors. In each processor, a hierarchical inverted index is built, by which single word and phrase queries can be answered efficiently. This index structure and the partition make the postings lists shorter. At the same time, integer compression technique is used more efficiently. Experiments and analysis show that query evaluation time is significantly reduced.
Conference Paper
Full-text available
We describe experiments with proximity-aware ranking functions that use indexing of word pairs. Our goal is to evaluate a method of “mild” pruning of proximity information, which would be appropriate for a moderately loaded retrieval system, e.g., an enterprise search engine. We create an index that includes occurrences of close word pairs, where one of the words is frequent. This allows one to efficiently restore relative positional information for all non-stop words within a certain distance. It is also possible to answer phrase queries promptly. We use two functions to evaluate relevance: a modification of a classic proximity-aware function and a logistic function that includes a linear combination of relevance features.
Conference Paper
This chapter is based on a series of five lectures presented at the EL- SNET TesTia Summer School held in Chios, Greece in July, 2000. The material has been updated in August 2001 and, at the suggestion of the students, some explanatory diagrams which were at the time drawn on the whiteboard have been included in more polished form. The scale of electronic document collections has grown dramatically in recent decades. Test collections of the 1960s and 70s (such as Cranfield (9)) contained thousands of documents; the initial TREC collection of 1991 (21) reached almost a million; and the collections indexed by current Web search engines contain approximately a billion. Information Retrieval (IR) has been associated from its beginning with the analogy of "looking for a needle in a haystack." Extending this metaphor to very large scale, we see that the haystack is now big enough to cover Australia! Furthermore, enthusiastic farmers have filled it with every possible type of item, including many which are very similar to needles but which are not what the searcher wanted. Most items include in- structions on how to go directly to other items, but often the instructions are misleading or out of date. Now there are not only needles but sewing machines, business cards for tailors, needle exchange services, needle-sharpening services, a sewing technology futures ex- change, catalogues of needles available for sale or hire and directories of where to find needles within the haystack. Unfortunately, cunning businesspeople have inserted items which look identical to needles but which turn out to be pictures of naked women or advertisements for get- rich-quick schemes. Millions of searchers arrive each day to search and they do so with the expectation that they will find what they want within less than two seconds. Some of them are very demanding; they are looking for a particular individual needle and they will not be satisfied unless they find it first. Others want to find as many different needles as they can. Some just want to get an overview of the types of needles which are "out there". A few start looking for needles but when they find one, realise that what they really wanted was a can-opener or an air-ticket to Hawaii! This chapter attempts to cover the changes which occur when document collec- tions and searcher populations become very large. It addresses the major engineering challenges imposed by very large scale search (particularly on the World Wide Web), outlines parallel and distributed models and canvasses the problem of how to evaluate the effectiveness of very large scale retrieval.
Conference Paper
A Cube Index Model on multidimensional text database and effective study of Online Analytical Processing (OLAP) over such data had been experimented and found to provide good results. We had proposed a cube index model for unstructured text database derived from the text index structures. There are three kinds of hierarchies on it. They are term hierarchy and dimensional hierarchy. This paper proposes the document hierarchy. Two new operations scroll up and scroll down are discussed exclusively for the cube index. The implementation, OLAP execution and query processing on the index are studied. The performance study gives a good guarantee of the model to be used on unstructured text database.
Article
The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolidated in textbooks, many specific techniques are not widely known or the textbook descriptions are out of date. In this tutorial, we introduce the key techniques in the area, describing both a core implementation and how the core can be enhanced through a range of extensions. We conclude with a comprehensive bibliography of text indexing literature.
Article
Search engines need to evaluate queries extremely fast, a challenging task given the quantities of data being indexed. A significant proportion of the queries posed to search engines involve phrases. In this article we consider how phrase queries can be efficiently supported with low disk overheads. Our previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. Alternatively, special-purpose phrase indexes can be used, but it is not feasible to index all phrases. We propose combinations of nextword indexes and phrase indexes with inverted files as a solution to this problem. Our experiments show that combined use of a partial nextword, partial phrase, and conventional inverted index allows evaluation of phrase queries in a quarter the time required to evaluate such queries with an inverted file alone; the additional space overhead is only 26% of the size of the inverted file.
Article
Inverted index compression, block addressing and sequential search on compressed text are three techniques that have been separately developed for efficient, low-overhead text retrieval. Modern text compression techniques can reduce the text to less than 30% of its size and allow searching it directly and faster than the uncompressed text. Inverted index compression obtains significant reduction of its original size at the same processing speed. Block addressing makes the inverted lists point to text blocks instead of exact positions and pay the reduction in space with some sequential text scanning. In this work we combine the three ideas in a single scheme. We present a compressed inverted file that indexes compressed text and uses block addressing. We consider different techniques to compress the index and study their performance with respect to the block size. We compare the index against three separate techniques for varying block sizes, showing that our index is superior to each isolated approach. For instance, with just 4% of extra space overhead the index has to scan less than 12% of the text for exact searches and about 20% allowing one error in the matches.
Article
To augment the information retrieval process, a model is proposed to facilitate simple contextual indexing for a large scale of standard text corpora. An edge index graph model is presented, which clusters documents based on a root index and an edge index created. Intelligent information retrieval is possible with the projected system where the process of querying provides proactive help to users through a knowledge base. The query is provided with automatic phrase completion and word suggestions. A thesaurus is used to provide meaningful search of the query. This model can be utilized for document retrieval, clustering, and phrase browsing.
Conference Paper
Speed and accuracy are the two key factors for the public-oriented integrated information service platform. This paper discusses a mixed segmentation based on the inverted index and compression algorithm. By combining database and file storage, we achieved index storage, which can support multi-table & multi-field simultaneous inquiry effectively. The design think of search engine interface based on such index storage mode has been introduced in detail at last in the paper. According to the above, search engine we designed inquiries faster than commonly used SQL statements several times faster. I am sure that will make more practical value for those who are interested in it.
Article
Full-text available
Recent advances in compression and indexing techniques have yielded a qualitative change in the feasibility of large-scale full-text retrieval.
Article
Full-text available
This paper describes an algorithm, called SEQUITUR, that identifies hierarchical structure in sequences of discrete symbols and uses that information for compression. On many practical sequences it performs well at both compression and structural inference, producing comprehensible descriptions of sequence structure in the form of grammar rules. The algorithm can be stated concisely in the form of two constraints on a context-free grammar. Inference is performed incrementally, the structure faithfully representing the input at all times. It can be implemented efficiently and operates in time that is approximately linear in sequence length. Despite its simplicity and efficiency, SEQUITUR succeeds in inferring a range of interesting hierarchical structures from naturallyoccurring sequences.
Article
Full-text available
This paper describes an algorithm, called SEQUITUR, that identifies hierarchical structure in
Article
Full-text available
this article tends to be answered by making a selection of queries more or less haphazardly to gain a feeling for what the collection contains.
Book
Preface. Part I: Basics. 1. An Overview of Information Retrieval A.F. Smeaton. 2. An Overview of Hypertext M. Agosti. Part II: Text to Hypertext Conversion. 3. Automatic Analysis, Theme Generation, and Summarization of Machine-Readable Texts G. Salton, J. Allan, C. Buckley, A. Singhal. 4. The Representation and Comparison of Hypertext Structures Using Graphs J. Furner, D. Ellis, P. Willett. Part III: Information Retrieval from Hypertext. 5. Citation Schemes in Hypertext Information Retrieval J. Savoy. 6. Information Modelling and Retrieval in Hypermedia Systems D. Lucarella, A. Zanzi. 7. An Integrated Model for Hypermedia and Information Retrieval Y. Chiaramella, A. Kheirbek. Part IV: Using Visualisation and Structure in Hypertext. 8. 'Why was This Item Retrieved?': New Ways to Explore Retrieval Results U. Thiel, A. Muller. 9. Interactive Dynamic Maps for Visualisation and Retrieval from Hypertext Systems M. Zizi. 10. Knowledge-Based Information Access for Hypermedia Reference Works: Exploring the Spread of the Bauhaus Movement T. Kamps, C. Huser, W. Mohr, I. Schmidt. 10. Integration of Information Retrieval and Hypertext via Structure R. Wilkinson, M. Fuller. Index.
Article
Query-processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Retrieval time for inverted lists can be greatly reduced by the use of compression, but this adds to the CPU time required. Here we show that the CPU component of query response time for conjunctive Boolean queries and for informal ranked queries can be similarly reduced, at little cost in terms of storage, by the inclusion of an internal index in each compressed inverted list. This method has been applied in a retrieval system for a collection of nearly two million short documents. Our experimental results show that the self-indexing strategy adds less than 20% to the size of the compressed inverted file, which itself occupies less than 10% of the indexed text, yet can reduce processing time for Boolean queries of 5-10 terms to under one fifth of the previous cost. Similarly, ranked queries of 40-50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval effectiveness.
Article
In November of 1992 the first Text REtrieval Conference (TREC-1) was held at NIST (Harman 1993). This conference, co-sponsored by ARPA and NIST, brought together information retrieval researchers to discuss their system results on the new TIPSTER test collection. This was the first time that such groups had ever compared results on the same data using the same evaluation methods, and represented a breakthrough in cross-system evaluation in information retrieval. It was also the first time that most of these groups had tackled such a large test collection and required a major effort by all groups to scale up their retrieval techniques.
Article
this paper. An interesting feature of compressed inverted lists is that the best compression is achieved for the longest lists, that is, the most frequent terms. In the limit---which, 7 in the case of text indexing, is a term such as "the" that occurs in almost every record---at most one bit per record is required. There is thus no particular need to eliminate common terms from the index: the decision as to whether or not to use the inverted lists for these terms to evaluate a query can be made, as it should be, at query evaluation time
Article
To provide keyword-based access to a large text file it is usually necessary to invert the file and create an inverted index that stores, for each word in the file, the paragraph or sentence numbers in which that word occurs. Inverting a large file using traditional techniques may take as much temporary disk space as is occupied by the file itself, and consume a great deal of cpu time. Here we describe an alternative technique for inverting large text files that requires only a nominal amount of temporary disk storage, instead building the inverted index in compressed form in main memory. A program implementing this approach has created a paragraph level index of a 132 Mbyte collection of legal documents using 13 Mbyte of main memory; 500 Kbyte of temporary disk storage; and approximately 45 cpu-minutes on a Sun SPARCstation 2.
Article
Countable prefix codeword sets are constructed with the universal property that assigning messages in order of decreasing probability to codewords in order of increasing length gives an average code-word length, for any message set with positive entropy, less than a constant times the optimal average codeword length for that source. Some of the sets also have the asymptotically optimal property that the ratio of average codeword length to entropy approaches one uniformly as entropy increases. An application is the construction of a uniformly universal sequence of codes for countable memoryless sources, in which the n th code has a ratio of average codeword length to source rate bounded by a function of n for all sources with positive rate; the bound is less than two for n = 0 and approaches one as n increases.
Article
First Page of the Article
Article
Ranking techniques are effective at finding answers in document collections but can be expensive to evaluate. We propose an evaluation technique that uses early recognition of which documents are likely to be highly ranked to reduce costs; for our test data, queries are evaluated in 2% of the memory of the standard implementation without degradation in retrieval effectiveness. cpu time and disk traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. The principle of the index design is that inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces cpu time and disk traffic to around one third of the original requirement. We also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed.
Article
Indexing and retrieval techniques for large text databases are well developed, but most of the techniques developed to date assume that the text to be indexed has little or no structure. With the growth in the use of sophisticated markup languages for text, a database system for structured documents should use, not just document content, but structural information and attributes, and should support queries on content, structure and attributes. In this paper we review and compare two recent approaches for accessing document collections. For one of the approaches, position-based indexing, queries are resolved by manipulating ranges of word o sets while for the other, based on a path model, the position of a word is represented in terms of the structural components that enclose it. The former allows slightly smaller indexes; the latter allows more efficient query evaluation.
Article
INQUERY is a probablistic information retrieval system based upon a Bayesian inference network model. This paper describes recent improvements to the system as a result of participation in the TIPSTER project and the TREC-2 conference. Improvements include transforming forms-based specifications of information needs into complex structured queries, automatic query expansion, automatic recognition of features in documents, relevance feedback, and simulated document routing. Experiments with one and two gigabyte document collections are also described. To appear in Information Processing and Management. 1 Introduction The effectiveness of an information retrieval (IR) system depends upon representation and matching. The system must represent the information need, it must represent the documents, and it must determine how well the information need matches each document. Our approach has been to use improved representations of document text and queries in the framework of the infe...
Article
Automatic query expansion has long been suggested as a technique for dealing with the fundamental issue of word mismatch in information retrieval. A number of approaches to expansion have been studied and, more recently, attention has focused on techniques that analyze the corpus to discover word relationships (global techniques) and those that analyze documents retrieved by the initial query ( local feedback). In this paper, we compare the effectiveness of these approaches and show that, although global analysis has some advantages, local analysis is generally more effective. We also show that using global analysis techniques, such as word context and phrase structure, on the local set of documents produces results that are both more effective and more predictable than simple local feedback. 1 Introduction The problem of word mismatch is fundamental to information retrieval. Simply stated, it means that people often use different words to describe concepts in their queries than auth...
Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer Managing Gigabytes: Compressing and Indexing Documents and Images
  • G Salton
[Salton, 1989] Salton, G. (1989). Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA. [Witten et al., 1994] Witten, I.H., Moffat, A., and Bell, T.C. (1994). Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold, New York.
Searching the world wide web made easy? the cognitive load imposed by query refinement mechanisms
  • Dennis
[Dennis et al., 1998] Dennis, S., McArthur, R., and Bruza, P. (1998). Searching the world wide web made easy? the cognitive load imposed by query refinement mechanisms. In Kay, J. and Milosavljevic, M., editors, Proc. Australian Document Computing Conference, Sydney, Australia. University of Sydney. To appear.
Filtered document retrieval with frequency-sorted indexes
  • Nevill-Manning
Nevill-Manning et al., 1998] Nevill-Manning, C.G., Witten, I.H., and Paynter, G.W. (1998). Browsing in digital libraries: a phrase-based approach. In Allen, R.B. and Rasmussen, E., editors, Proc. ACM Digital Libraries, pages 230-236, Philadephia, Pennsylvania. [Persin et al., 1996] Persin, M., Zobel, J., and Sacks-Davis, R. (1996). Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47(10):749-764.
Query expansion using local and global document analysis
  • J Xu
  • W B Croft
  • H.-P Frei
  • D Harman
  • P Schäuble
  • R Wilkinson
  • J Zobel
  • A Moffat
  • K Ramamohanarao
and Croft, 1996] Xu, J. and Croft, W.B. (1996). Query expansion using local and global document analysis. In Frei, H.-P., Harman, D., Schäuble, P., and Wilkinson, R., editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 4-11, Zurich, Switzerland. [Zobel et al.] Zobel, J., Moffat, A., and Ramamohanarao, K. Inverted files versus signature files for text indexing. ACM Transactions on Database Systems. To appear.