Conference Paper

Compaction techniques for nextword indexes

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Not Available

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... However, this technique requires additional 50% to 60% of the size of an inverted index. This can be reduced to about 40% to 50% using the compression techniques of [7] at little cost of querying speed. In order to obtain further reductions in the memory consumption, partial nextword indexes were introduced in [8]. ...
Conference Paper
We present a method for optimizing phrase search based on inverted indexes. Our approach adds selected (two-term) phrases to an existing index. Whereas competing approaches are often based on the analysis of query logs, our approach works out of the box and uses only the information contained in the index. Also, our method is competitive in terms of query performance and can even improve on other approaches for difficult queries. Moreover, our approach gives performance guarantees for arbitrary queries. Further, we propose using a phrase index as a substitute for the positional index of an in-memory search engine working with short documents. We support our conclusions with experiments using a high-performance main-memory search engine. We also give evidence that classical disk based systems can profit from our approach.
... Prior research has examined how to efficiently index text documents and resolve text queries: for example, with invertedindices [3], signature files [8], or sparse matrices [9]. Further improvements to these index structures have been made for handling special query types [10] [11] [12] and reducing I/O overhead [13] [14] [15]. While much work addresses this indexlevel view of search performance, little work addresses performance at the architectural level of a complete search service. ...
Conference Paper
Prior research into search system scalability has primarily addressed query processing efficiency [1, 2, 3] or indexing efficiency [3], or has presented some arbitrary system architecture [4]. Little work has introduced any formal theoretical framework for evaluating architectures with regard to specific operational requirements, or for comparing architectures beyond simple timings [5] or basic simulations [6, 7]. In this paper, we present a framework based upon queuing network theory for analyzing search systems in terms of operational requirements. We use response time, throughput, and utilization as the key operational characteristics for evaluating performance. Within this framework, we present a scalability strategy that combines index partitioning and index replication to satisfy a given set of requirement.
... Though these techniques improve the speed, the space complexity increases. Therefore, many researchers concentrated on compression techniques to reduce storage space [1,2,23]. To reduce storage space further, partial next word indexes were proposed [3]. ...
Article
Full-text available
Text documents are significant arrangements of various words, while images are significant arrangements of various pixels/features. In addition, text and image data share a similar semantic structural pattern. With reference to this research, the feature pair is defined as a pair of adjacent image features. The innovative feature pair index graph (FPIG) is constructed from the unique feature pair selected, which is constructed using an inverted index structure. The constructed FPIG is helpful in clustering, classifying and retrieving the image data. The proposed FPIG method is validated against the traditional KMeans++, KMeans and Farthest First cluster methods which have the serious drawback of initial centroid selection and local optima. The FPIG method is analyzed using Iris flower image data, and the analysis yields 88% better results than Farthest First and 28.97% better results than conventional KMeans in terms of sum of squared errors. The paper also discusses the scope for further research in the proposed methodology.
... The major drawback of using a nextword index is its large space consumption which is around 60% of the size of the indexed data. With careful optimization techniques (Bahle, Williams, & Zobel, 2001), the size of a nextword index file can be reduced to 49% of the indexed data. An auxiliary nextword index proposed by Bahle, Williams, and Zobel (2002) further reduces the space overhead to only 10% of the size of the inverted index file. ...
Article
Full-text available
In this paper, we propose a common phrase index as an efficient index structure to support phrase queries in a very large text database. Our structure is an extension of previous index structures for phrases and achieves better query efficiency with modest extra storage cost. Further improvement in efficiency can be attained by implementing our index according to our observation of the dynamic nature of common word set. In experimental evaluation, a common phrase index using 255 common words has an improvement of about 11% and 62% in query time for the overall and large queries (queries of long phrases) respectively over an auxiliary nextword index. Moreover, it has only about 19% extra storage cost. Compared with an inverted index, our improvement is about 72% and 87% for the overall and large queries respectively. We also propose to implement a common phrase index with dynamic update feature. Our experiments show that more improvement in time efficiency can be achieved.
... This search method, however, is inefficient and does not meet our needs. To speed up the search process, we developed an indexing system based on the nextword indexing scheme [5]. ...
Article
Full-text available
Dealing with big data in computational social networks may require powerful machines, big storage, and high bandwidth, which may seem beyond the capacity of small labs. We demonstrate that researchers with limited resources may still be able to conduct big-data research by focusing on a specific type of data. In particular, we present a system called MPT (Microblog Processing Toolkit) for handling big volume of microblog posts with commodity computers, which can handle tens of millions of micro posts a day. MPT supports fast search on multiple keywords and returns statistical results. We describe in this paper the architecture of MPT for data collection and phrase search for returning search results with statistical analysis. We then present different indexing mechanisms and compare them on the microblog posts we collected from popular online social network sites in mainland China.
... However, complete phrase indexes that cover all phrases in a given text are likely to be prohibitive in size and construction time, while storing only selected phrases creates the problem of how to effectively identify useful phrases in the absence of queries. In recent work, we proposed nextword indexes as a way of supporting phrase queries and phrase browsing [Bahle et al. 2001a; 2001b;. In a nextword index, for each index term or firstword there is a list of the words or nextwords that follow that term, together with the documents and word positions at which the firstword and nextword occur as a pair. ...
Article
Search engines need to evaluate queries extremely fast, a challenging task given the quantities of data being indexed. A significant proportion of the queries posed to search engines involve phrases. In this article we consider how phrase queries can be efficiently supported with low disk overheads. Our previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. Alternatively, special-purpose phrase indexes can be used, but it is not feasible to index all phrases. We propose combinations of nextword indexes and phrase indexes with inverted files as a solution to this problem. Our experiments show that combined use of a partial nextword, partial phrase, and conventional inverted index allows evaluation of phrase queries in a quarter the time required to evaluate such queries with an inverted file alone; the additional space overhead is only 26% of the size of the inverted file.
... Another solution is to index phrases directly, but the set of word pairs in a text collection is large and an index on such phrases difficult to manage. In recent work, nextword indexes were proposed as a way of supporting phrase queries and phrase browsing [2, 3, 15]. In a nextword index, for each index term or firstword there is a list of the words or nextwords that follow that term, together with the documents and word positions at which the firstword and nextword occur as a pair. ...
Conference Paper
Search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. A significant proportion of the queries posed to search engines involve phrases. In this paper we consider how phrase queries can be efficiently supported with low disk overheads. Previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. We propose a combination of nextword indexes with inverted files as a solution to this problem. Our experiments show that combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file. Further time savings are available with only slight increases in disk requirements.
... A next word index is a mapping from pairs of terms to lists. A next word index can be used in a straightforward way to evaluate phrase queries [Bahle 2001a]. Figure 3 shows an example of a NextWord index. ...
... Preliminary versions of some results or discussions in this thesis have been previously published. Chapters 5, 6, and 7 contain material that appeared in [Bahle et al., 2001b], [Bahle et al., 2001a], and [Bahle et al., 2002] ...
... Different heuristics are proposed in this respect, such as maintaining the inverted lists only for popular phrases, or maintaining inverted lists of all phrases up to some fixed number (say h) of words. Another approach is called " next-word index " [36, 3, 4, 37], such that corresponding to each term w, a list of all the terms which occurs immediately after w is maintained. This approach will double the space, but it can support searching of any phrase with two words efficiently. ...
Conference Paper
Full-text available
Inverted indexes are the most fundamental and widely used data structures in information retrieval. For each unique word occurring in a document collection, the inverted index stores a list of the documents in which this word occurs. Compression techniques are often applied to further reduce the space requirement of these lists. However, the index has a shortcoming, in that only predefined pattern queries can be supported efficiently. In terms of string documents where word boundaries are undefined, if we have to index all the substrings of a given document, then the storage quickly becomes quadratic in the data size. Also, if we want to apply the same type of indexes for querying phrases or sequence of words, then the inverted index will end up storing redundant information. In this paper, we show the first set of inverted indexes which work naturally for strings as well as phrase searching. The central idea is to exclude document d in the inverted list of a string P if every occurrence of P in d is subsumed by another string of which P is a prefix. With this we show that our space utilization is close to the optimal. Techniques from succinct data structures are deployed to achieve compression while allowing fast access in terms of frequency and document id based retrieval. Compression and speed trade-offs are evaluated for different variants of the proposed index. For phrase searching, we show that our indexes compare favorably against a typical inverted index deploying position-wise intersections. We also show efficient top-k based retrieval under relevance metrics like frequency and tf-idf.
... This was recently used for document clustering, entity finding or document classification in small collections of documents, but it has not been used in large scale webpage indexing. Bahle et al. (2001) showed that phrasal queries can be rapidly evaluated using next-word indexes, but the indexes are twice as large as conventional inverted files. They combined the use of an auxiliary next-word index and a conventional inverted file, and their space overhead was only 10% of the size of the inverted file. ...
Thesis
Full-text available
The Web is comprised of a vast quantity of text. Modern search engines struggle to index it independent of the structure of queries and type of Web data, and commonly use indexing based on Web‘s graph structure to identify high-quality relevant pages. However, despite the apparent widespread use of these algorithms, Web indexing based on human feedback and document content is controversial. There are many fundamental questions that need to be addressed, including: How many types of domains/websites are there in the Web? What type of data is in each type of domain? For each type, which segments/HTML fields in the documents are most useful? What are the relationships between the segments? How can web content be indexed efficiently in all forms of document configurations? Our investigation of these questions has led to a novel way to use Wikipedia to find the relationships between the query structures and document configurations throughout the document indexing process and to use them to build an efficient index that allows fast indexing and searching, and optimizes the retrieval of highly relevant results. We consider the top page on the ranked list to be highly important in determining the types of queries. Our aim is to design a powerful search engine with a strong focus on how to make the first page highly relevant to the user, and on how to retrieve other pages based on that first page. Through processing the user query using the Wikipedia index and determining the type of the query, our approach could trace the path of a query in our index, and retrieve specific results for each type. We use two kinds of data to increase the relevancy and efficiency of the ranked results: offline and real-time. Traditional search engines find it difficult to use these two kinds of data together, because building a real-time index from social data and integrating it with the index for the offline data is difficult in a traditional distributed index. As a source of offline data, we use data from the Text Retrieval Conference (TREC) evaluation campaign. The web track at TREC offers researchers chance to investigate different retrieval approaches for web indexing and searching. The crawled offline dataset makes it possible to design powerful search engines that extends current methods and to evaluate and compare them. We propose a new indexing method, based on the structures of the queries and the content of documents. Our search engine uses a core index for offline data and a hash index for real-time V data, which leads to improved performance. The TREC Web track evaluation of our experiments showed that our approach can be successfully employed for different types of queries. We evaluated our search engine on different sets of queries from TREC 2010, 2011 and 2012 Web tracks. Our approach achieved very good results in the TREC 2010 training queries. In the TREC 2011 testing queries, our approach was one of the six best compared to all other approaches (including those that used a very large corpus of 500 million documents), and it was second best when compared to approaches that used only part of the corpus (50 million documents), as ours did. In the TREC 2012 testing queries, our approach was second best if compared to all the approaches, and first if compared only to systems that used the subset of 50 million documents.
... The major drawback of using nextword index is its large space consumption which is around 60% of the size of the indexed data. With careful optimization techniques [3], the size of a nextword index file can be reduced to 49% of the indexed data. An auxiliary nextword index proposed by Bahle et. ...
Conference Paper
In this paper, we propose a common phrase index as an efficient index structure to support phrase queries in a very large text database. Our structure is an extension of previous index structures for phrases and achieves better query efficiency with modest extra storage cost. Further improvement in efficiency can be attained by implementing our index according to our observation of the dynamic nature of common word set. In experimental evaluation, a common phrase index using 255 common words has an improvement of about 11% and 62% in query time for the overall and large queries (queries of long phrases) respectively over an auxiliary nextword index. Moreover, it has only about 19% extra storage cost. Compared with an inverted index, our improvement is about 72% and 87% for the overall and large queries respectively. We also propose to implement a common phrase index with dynamic update feature. Our experiments show that more improvement in time efficiency can be achieved.
Conference Paper
Along with single word query, phrase query is frequently used in digital library. This paper proposes a new partition based hierarchical index structure for efficient phrase query and a parallel algorithm based on the index structure. In this scheme, a document is divided into several elements. The elements are distributed on several processors. In each processor, a hierarchical inverted index is built, by which single word and phrase queries can be answered efficiently. This index structure and the partition make the postings lists shorter. At the same time, integer compression technique is used more efficiently. Experiments and analysis show that query evaluation time is significantly reduced.
Article
To bound memory consumption, most compression systems provide a facility that controls the amount of data that may be processed at once—usually as a block size, but sometimes as a direct megabyte limit. In this work we consider the Re-Pair mechanism of Larsson and Moffat (2000), which processes large messages as disjoint blocks to limit memory consumption. We show that the blocks emitted by Re-Pair can be postprocessed to yield further savings, and describe techniques that allow files of 500 MB or more to be compressed in a holistic manner using less than that much main memory. The block merging process we describe has the additional advantage of allowing new text to be appended to the end of the compressed file.
Conference Paper
Full-text available
In this paper, we propose a common phrase index as an efficient index structure to support phrase queries in a very large text database. Our structure is an extension of previous index structures for phrases and achieves better query efficiency with negligible extra storage cost. In our experimental evaluation, a common phrase index has 5% and 20% improvement in query time for the overall and large queries (queries of long phrases) respectively over an auxiliary nextword index. Moreover, it uses only 1% extra storage cost. Compared with an inverted index, our improvement is 40% and 72% for the overall and large queries respectively.
Conference Paper
To bound memory consumption, most compression systems provide a facility that controls the amount of data that may be processed at once. In this work we consider the Re-Pair mechanism of [2000], which processes large messages as disjoint blocks. We show that the blocks emitted by Re-Pair can be post-processed to yield further savings, and describe techniques that allow files of 500 MB or more to be compressed in a holistic manner using less than that much main memory. The block merging process we describe has the additional advantage of allowing new text to be appended to the end of the compressed file.
Conference Paper
Dealing with big data in computational social networks may require powerful machines, big storage, and high bandwidth, which may seem beyond the capacity of small labs. We demonstrate that researchers with limited resources may still be able to conduct big-data research by focusing on a specific type of data. In particular, we present a system called MPT (microblog processing toolkit) for handling big volume of microblog posts with commodity computers, which can handle tens of millions of micro posts a day. MPT supports fast search on multiple keywords and returns statistical results. We describe in this paper the architecture of MPT for data collection and stat search for returning search results with statistical analysis. We then present different indexing mechanisms and compare them on the micro posts we collected from popular social network sites in China.
Article
Full-text available
Indexing is a crucial technique for dealing with the massive amount of data present on the web. In our third participation in the web track at TREC 2012, we explore the idea of building an efficient query-based indexing system over Web page collection. Our prototype explores the trends in user queries and consequently indexes texts using particular attributes available in the documents. This paper provides an in-depth description of our approach for indexing web documents efficiently; that is, topics available in the web documents are discovered with the assistance of knowledge available in Wikipedia. The well-defined articles in Wikipedia are shown to be valuable as a training set when indexing Webpages. Our complex index structure also records information from titles and urls, and pays attention to web domains. Our approach is designed to close the gaps in our approaches from the previous two years, for some queries. Our framework is able to efficiently index the 50 million pages available in the subset B of the ClueWeb09 collection. Our preliminary experiments on the TREC 2012 testing queries showed that our indexing scheme is robust and efficient for both indexing and retrieving relevant web pages, for both the ad-hoc and diversity task.
Conference Paper
Full-text available
Most search systems for querying large document collections, e.g., Web search engines, are based on well-understood information retrieval principles. These systems are both efficient and effective in finding answers to many user information needs, expressed through informal ranked or structured Boolean queries. Phrase querying and browsing are additional techniques that can augment or replace conventional querying tools. The authors propose optimisations for phrase querying with a nextword index, an efficient structure for phrase based searching. We show that careful consideration of which search terms are evaluated in a query plan and optimisation of the order of evaluation of the plan can reduce query evaluation costs by more than a factor of five. We conclude that, for phrase querying and browsing with nextword indexes, an ordered query plan should be used for all browsing and querying. Moreover, we show that optimised phrase querying is practical on large text collections
Article
Full-text available
This paper describes an algorithm, called SEQUITUR, that identifies hierarchical structure in sequences of discrete symbols and uses that information for compression. On many practical sequences it performs well at both compression and structural inference, producing comprehensible descriptions of sequence structure in the form of grammar rules. The algorithm can be stated concisely in the form of two constraints on a context-free grammar. Inference is performed incrementally, the structure faithfully representing the input at all times. It can be implemented efficiently and operates in time that is approximately linear in sequence length. Despite its simplicity and efficiency, SEQUITUR succeeds in inferring a range of interesting hierarchical structures from naturallyoccurring sequences.
Article
Full-text available
This paper describes an algorithm, called SEQUITUR, that identifies hierarchical structure in
Article
Full-text available
Often queries to internet search engines consist of one or two terms. As a consequence, the effectiveness of the retrieval suffers. This paper describes an internet search engine that helps the user formulate their query by a process of navigation through a structured, automatically constructed, information space called a hyperindex. In the first part of this paper, the logs of an internet search engine were analyzed to determine the proportions with which different types of query transformation occur. It was found that the primary transformation type was repetition of the previous query. Users also substitute, add and delete terms from a previous query and with lower frequency split compound terms, make changes to spelling, punctuation, and case and use derivative forms of words and abbreviations. The second part of the paper details the hyperindex - which aids the user in query term addition and deletion. The architecture of a hyperindex-based internet search engine is presented. Som...
Article
Full-text available
this article tends to be answered by making a selection of queries more or less haphazardly to gain a feeling for what the collection contains.
Article
In studying actual Web searching by the public at large, we analyzed over one million Web queries by users of the Excite search engine. We found that most people use few search terms, few modified queries, view few Web pages, and rarely use advanced search features. A small number of search terms are used with high frequency, and a great many terms are unique; the language of Web queries is distinctive. Queries about recreation and entertainment rank highest. Findings are compared to data from two other large studies of Web queries. This study provides an insight into the public practices and choices in Web searching.
Conference Paper
Abstract In line with the wishes of last year's participants, this year's VLC track was essentially a re - run of last year's with a ve - fold increase in data size The data used was a completely new 100 - gigabyte collection of Web documents (the VLC2) whose characteristics are presented here This time, two orders of magnitude scale - up was investigated using 1%, and 10% samples as well as the full collection Six groups managed to complete the full VLC task, of which ve completed last year's track An overview is given of the track participants, the methods used and the results obtained One group of participants, using hardware costing less than $US10,000, have shown that a hundred gigabyte collection can be indexed in less than ten hours and that quite good rankings (better than several well - known search engines) can be produced from queries processed in less than one second
Article
In November of 1992 the first Text REtrieval Conference (TREC-1) was held at NIST (Harman 1993). This conference, co-sponsored by ARPA and NIST, brought together information retrieval researchers to discuss their system results on the new TIPSTER test collection. This was the first time that such groups had ever compared results on the same data using the same evaluation methods, and represented a breakthrough in cross-system evaluation in information retrieval. It was also the first time that most of these groups had tackled such a large test collection and required a major effort by all groups to scale up their retrieval techniques.
Article
Browsing accounts for much of people's interaction with digital libraries, but it is poorly supported by standard search engines. Conventional systems often operate at the wrong level, indexing words when people think in terms of topics, and returning documents when people want a broader view. As a result, users cannot easily determine what is in a collection, how well a particular topic is covered, or what kinds of queries will provide useful results. We have built a new kind of search engine, Keyphind, that is explicitly designed to support browsing. Automatically extracted keyphrases form the basic unit of both indexing and presentation, allowing users to interact with the collection at the level of topics and subjects rather than words and documents. The keyphrase index also provides a simple mechanism for clustering documents, refining queries, and previewing results. We compared Keyphind to a traditional query engine in a small usability study. Users reported that certain kinds of browsing tasks were much easier with the new interface, indicating that a keyphrase index would be a useful supplement to existing search tools. This is an author’s version of an article published in the journal: Decision Support Systems. © 1999 Elsevier Science B.V.
Article
Countable prefix codeword sets are constructed with the universal property that assigning messages in order of decreasing probability to codewords in order of increasing length gives an average code-word length, for any message set with positive entropy, less than a constant times the optimal average codeword length for that source. Some of the sets also have the asymptotically optimal property that the ratio of average codeword length to entropy approaches one uniformly as entropy increases. An application is the construction of a uniformly universal sequence of codes for countable memoryless sources, in which the n th code has a ratio of average codeword length to source rate bounded by a function of n for all sources with positive rate; the bound is less than two for n = 0 and approaches one as n increases.
Article
First Page of the Article
Article
Text retrieval systems are used to fetch documents from large text collections, using queries consisting of words and word sequences.
National Institute of Standards and Technology Special Publication
  • E In
  • D K Voorhees
  • Harman
In E. Voorhees and D.K. Harman, editors, Proc. Text Retrieval Conference (TREC), pages 91–104, Washington, 1999. National Institute of Standards and Technology Special Publication 500-242.