Article

Fast phrase querying with combined indexes

Authors:
If you want to read the PDF, try requesting it from the authors.

Abstract

Search engines need to evaluate queries extremely fast, a challenging task given the quantities of data being indexed. A significant proportion of the queries posed to search engines involve phrases. In this article we consider how phrase queries can be efficiently supported with low disk overheads. Our previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. Alternatively, special-purpose phrase indexes can be used, but it is not feasible to index all phrases. We propose combinations of nextword indexes and phrase indexes with inverted files as a solution to this problem. Our experiments show that combined use of a partial nextword, partial phrase, and conventional inverted index allows evaluation of phrase queries in a quarter the time required to evaluate such queries with an inverted file alone; the additional space overhead is only 26% of the size of the inverted file.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Although both types of inverted index supports phrase query but they required extra space in memory. There has been lot of work related to process and experiment for efficient processing of phrase queries [11] but very little work has been done for reducing the index size. Our work is primarily focus on this gap. ...
... We found that majority of the solutions address to the pruning for single term inverted index. Less attention has been given for pruning of bi-word index and positional index [11] [5]. ...
... The static index pruning for phrase queries has been addressed by Moura et al. [11]. They proposed a new pruning method, which they call the Locality-based pruning method. ...
Article
Full-text available
This paper proposes a static index pruning method for phrase queries based on term distance. It models the terms distance within document as a measure to find the term co-occurrence with another term. The standard score is then used to prune non relevant postings related to phrase queries while assuring no change in the top-k results. The proposed method creates an effective prune inverted index. Analysis of the results shows that this method is correlated with the term proximity based on the term frequency val-ues as well as terms informative ness. With experiments on a number of different FIRE collections, it is shown that the model is comparable with the existing static pruning method which only works well for single term queries. It is an advantage of the proposed approach that the pruning model is applicable to standard inverted index for phrase queries.
... In our approach, we include information about all words in the indexes. We cannot exclude a word from the search because a high-frequently occurring word can have a specific meaning in the context of a specific query [10,17]; therefore, excluding some words from consideration can induce search quality degradation or unpredictable effects [17]. Let us consider the query example "who are you who". ...
... In our approach, we include information about all words in the indexes. We cannot exclude a word from the search because a high-frequently occurring word can have a specific meaning in the context of a specific query [10,17]; therefore, excluding some words from consideration can induce search quality degradation or unpredictable effects [17]. Let us consider the query example "who are you who". ...
... 3) Early termination approaches [1,4]. 4) Next-word and partial phrase auxiliary indexes for an exact phrase search [17,2]. ...
Preprint
Full-text search engines are important tools for information retrieval. In a proximity full-text search, a document is relevant if it contains query terms near each other, especially if the query terms are frequently occurring words. For each word in the text, we use additional indexes to store information about nearby words at distances from the given word of less than or equal to MaxDistance, which is a parameter. A search algorithm for the case when the query consists of high-frequently used words is discussed. In addition, we present results of experiments with different values of MaxDistance to evaluate the search speed dependence on the value of MaxDistance. These results show that the average time of the query execution with our indexes is 94.7-45.9 times (depending on the value of MaxDistance) less than that with standard inverted files when queries that contain high-frequently occurring words are evaluated. This is a pre-print of a contribution published in Pinelas S., Kim A., Vlasov V. (eds) Mathematical Analysis With Applications. CONCORD-90 2018. Springer Proceedings in Mathematics & Statistics, vol 318, published by Springer, Cham. The final authenticated version is available online at: https://doi.org/10.1007/978-3-030-42176-2_37
... Another approach is to create additional indexes. In [3,18], the authors introduced some additional indexes to improve the search performance, but they only improved phrase searches. ...
... We cannot exclude a word from the search because a high-frequency occurring word can have a specific meaning in the context of a specific query [11,18]; therefore, excluding some words from consideration can induce search quality degradation or unpredictable effects [18]. ...
... We cannot exclude a word from the search because a high-frequency occurring word can have a specific meaning in the context of a specific query [11,18]; therefore, excluding some words from consideration can induce search quality degradation or unpredictable effects [18]. ...
Preprint
Full-text available
Full-text search engines are important tools for information retrieval. In a proximity full-text search, a document is relevant if it contains query terms near each other, especially if the query terms are frequently occurring words. For each word in the text, we use additional indexes to store information about nearby words at distances from the given word of less than or equal to MaxDistance, which is a parameter. We had shown that additional indexes with three-component keys can be used to improve the average query execution time up to 94.7 times if the queries consist of high-frequency used words. In this paper, we present a new search algorithm with even more performance gains. We also present results of search experiments, which show that three-component key indexes enable much faster searches in comparison with two-component key indexes.
... An analysis of the query logs of the Excite search engine by Williams et al. [22] indicated that 5-10% of the web queries were phrase queries and that 41% of the rest also matched a phrase. In our terms, 5-10% of the queries are explicit keyphrases, and 41% of the rest may be implicit keyphrases. ...
... We immediately notice that the proportion of topics containing keyphrases is much higher than that reported by Williams et al. [22] for web queries (37% of explicit keyphrases versus 5 to 10%, and 84% of implicit keyphrases amongst the rest instead of 41%). This is natural, as the INEX topics are much longer than web queries, and, unlike them, they were carefully thought up, reviewed, and selected by the organizers of the forum. ...
... Various link-based techniques based on the correlation between the link density and content have been developed for a diverse set of research problems including link discovery and relevance ranking [12]. Moreover, communities can be identified by analyzing the link graph [22]. Beside co-citation used by Kumar et al. [23] to measure similarity, bibliographic coupling and SimRank based on citation patterns, and the similarity of structural context (respectively), have also been used to identify the similarity of web objects [24]. ...
... Another approach is to create additional indexes. In [3,18], the authors introduced some additional indexes to improve the search performance, but they only improved phrase searches. ...
... We cannot exclude a word from the search because a high-frequency occurring word can have a specific meaning in the context of a specific query [11,18]; therefore, excluding some words from consideration can induce search quality degradation or unpredictable effects [18]. ...
... We cannot exclude a word from the search because a high-frequency occurring word can have a specific meaning in the context of a specific query [11,18]; therefore, excluding some words from consideration can induce search quality degradation or unpredictable effects [18]. ...
Conference Paper
Full-text available
Full-text search engines are important tools for information retrieval. In a proximity full-text search, a document is relevant if it contains query terms near each other, especially if the query terms are frequently occurring words. For each word in the text, we use additional indexes to store information about nearby words at distances from the given word of less than or equal to MaxDistance, which is a parameter. We had shown that additional indexes with three-component keys can be used to improve the average query execution time up to 94.7 times if the queries consist of high-frequency used words. In this paper, we present a new search algorithm with even more performance gains. We also present results of search experiments, which show that three-component key indexes enable much faster searches in comparison with two-component key indexes.
... 2) Additional indexes methods. In [15], some phrase indexes are presented, but such methods cannot be applied to proximity full-text searches. For this reason, the author proposed the method [1] that allows the solution of the proximity full-text search task. ...
... In some search engines, stop lemmas can be excluded from the search and the index and, therefore, can be ignored in the search. However, it is stated in [1,15] that a stop lemma can have a specific meaning in the context of a specific search query in some cases. Therefore, stop lemmas cannot be excluded from consideration, and examples are provided. ...
... We have four index files. In the first index file, the range for the first component of keys is [0,4], in the second [5,15], in the third [16,52], and in the fourth [53,149]. For every index file, we enumerate its groups. ...
Preprint
Full-text available
In this paper, proximity full-text searches in large text arrays are considered. A search query consists of several words. The search result is a list of documents containing these words. In a modern search system, documents that contain search query words that are near each other are more relevant than documents that do not share this trait. To solve this task, for each word in each indexed document, we need to store a record in the index. In this case, the query search time is proportional to the number of occurrences of the queried words in the indexed documents. Consequently, it is common for search systems to evaluate queries that contain frequently occurring words much more slowly than queries that contain less frequently occurring, ordinary words. For each word in the text, we use additional indexes to store information about nearby words at distances from the given word of less than or equal to MaxDistance, which is a parameter. This parameter can take a value of 5, 7, or even more. Three-component key indexes can be created for faster query execution. Previously, we presented the results of experiments showing that when queries contain very frequently occurring words, the average time of the query execution with three-component key indexes is 94.7 times less than that required when using ordinary inverted indexes. In the current work, we describe a new three-component key index building algorithm and demonstrate the correctness of the algorithm. We present the results of experiments creating such an index that is dependent on the value of MaxDistance.
... One way is to skip the most frequently occurring words. However, there are some concerns about this approach [18]. A high-frequently occurring word may have a unique meaning in the context of the specific query. ...
... A high-frequently occurring word may have a unique meaning in the context of the specific query. The authors [18] stated literally that "stopping or ignoring common words will have an unpredictable effect". Examples are provided in [18,14]. ...
... The authors [18] stated literally that "stopping or ignoring common words will have an unpredictable effect". Examples are provided in [18,14]. We can consider as an example the query "Who are you who". ...
Chapter
Full-text available
A search query consists of several words. In a proximity full-text search, we want to find documents that contain these words near each other. This task requires much time when the query consists of high-frequently occurring words. If we cannot avoid this task by excluding high-frequently occurring words from consideration by declaring them as stop words, then we can optimize our solution by introducing additional indexes for faster execution. In a previous work, we discussed how to decrease the search time with multi-component key indexes. We had shown that additional indexes can be used to improve the average query execution time up to 130 times if queries consisted of high-frequently occurring words. In this paper, we present another search algorithm that overcomes some limitations of our previous algorithm and provides even more performance gain.
... One way is to skip the most frequently occurring words. However, there are some concerns about this approach [18]. A high-frequently occurring word may have a unique meaning in the context of the specific query. ...
... A high-frequently occurring word may have a unique meaning in the context of the specific query. The authors [18] stated literally that "stopping or ignoring common words will have an unpredictable effect". Examples are provided in [18,14]. ...
... The authors [18] stated literally that "stopping or ignoring common words will have an unpredictable effect". Examples are provided in [18,14]. We can consider as an example the query "Who are you who". ...
Preprint
A search query consists of several words. In a proximity full-text search, we want to find documents that contain these words near each other. This task requires much time when the query consists of high-frequently occurring words. If we cannot avoid this task by excluding high-frequently occurring words from consideration by declaring them as stop words, then we can optimize our solution by introducing additional indexes for faster execution. In a previous work, we discussed how to decrease the search time with multi-component key indexes. We had shown that additional indexes can be used to improve the average query execution time up to 130 times if queries consisted of high-frequently occurring words. In this paper, we present another search algorithm that overcomes some limitations of our previous algorithm and provides even more performance gain. This is a pre-print of a contribution published in Arai K., Kapoor S., Bhatia R. (eds) Intelligent Systems and Applications. IntelliSys 2020. Advances in Intelligent Systems and Computing, vol 1251, published by Springer, Cham. The final authenticated version is available online at: https://doi.org/10.1007/978-3-030-55187-2_37
... The idea of this figure is borrowed from [17]. List of Tables [12]. The document collection used was TREC WT10g, and it contained 1.67 million documents and is 10.27 GB in size. ...
... The queries used in order to extract response times was from the Excite query log, reflecting real life queries. 62 9 A table in [12], with the average query times (in seconds) of ≈ 66 000 queries. The three-way combined approach described in Section 5. 3 ...
... In Chapter 8 we discuss our bigram index vs the "state of the art" material presented in 11 Chapter 5. Finally, in Chapter 9, we will present our conclusion for the problem statement which this report is concerned, along with the results from the experiment carried out. 12 2 Background and problem statement Page 13 2 Background and problem statement This master thesis is concerned with challenges within the search technology subject, with a extensive focus on phrase searching in large text indexes. We shall explore our initial problem statement along with an descriptive background for exploring this problem. ...
... Additional indexes can improve the search performance. In [12,16], additional indexes were used to improve phrase searches. However, the approaches reported in [12,16] cannot be used for proximity full-text searches. ...
... In [12,16], additional indexes were used to improve phrase searches. However, the approaches reported in [12,16] cannot be used for proximity full-text searches. Their area of application is limited by phrase searches. ...
... We do not agree with this approach. A word cannot be excluded from the search because even a word occurring with high-frequency can have a specific meaning in the context of a specific query [20,16]. Therefore, excluding some words from the search can lead to search quality degradation or unpredictable effects [16]. ...
Preprint
Full-text available
Proximity full-text search is commonly implemented in contemporary full-text search systems. Let us assume that the search query is a list of words. It is natural to consider a document as relevant if the queried words are near each other in the document. The proximity factor is even more significant for the case where the query consists of frequently occurring words. Proximity full-text search requires the storage of information for every occurrence in documents of every word that the user can search. For every occurrence of every word in a document, we employ additional indexes to store information about nearby words, that is, the words that occur in the document at distances from the given word of less than or equal to the MaxDistance parameter. We showed in previous works that these indexes can be used to improve the average query execution time by up to 130 times for queries that consist of words occurring with high-frequency. In this paper, we consider how both the search performance and the search quality depend on the value of MaxDistance and other parameters. Well-known GOV2 text collection is used in the experiments for reproducibility of the results. We propose a new index schema after the analysis of the results of the experiments. This is a pre-print of a contribution published in Supplementary Proceedings of the XXII International Conference on Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2020), Voronezh, Russia, October 13-16, 2020, P. 336-350, published by CEUR Workshop Proceedings. The final authenticated version is available online at: http://ceur-ws.org/Vol-2790/
... When nextword indexes [16] are used, a key consists of two words. The value of the key is the list of postings, where each posting corresponds to an occurrence of these two words in the text in a consecutive manner, that is, one is immediately after another. ...
... The value of the key is the list of postings, where each posting corresponds to an occurrence of these two words in the text in a consecutive manner, that is, one is immediately after another. A phrase can also be considered as a key [16]. Such indexes can be used to optimize the phrase search; that is, when the user searches for a document that contains the specified phrase. ...
... Sometimes, high-frequently occurring words can have a specific meaning in the context of a specific search query. In this case, exclusion of such a word from the search can lead to unpredictable effects [4,16]. For example, consider the following search query: "who are you who". ...
Article
Full-text available
The problem of proximity full-text search is considered. If a search query contains high-frequently occurring words, then multi-component key indexes deliver improvement of the search speed in comparison with ordinary inverted indexes. It was shown that we can increase the search speed up to 130 times in cases when queries consist of high-frequently occurring words. In this paper, we are investigating how the multi-component key indexes architecture affects the quality of the search. We consider several well-known methods of relevance ranking; these methods are of different authors. Using these methods we perform the search in the ordinary inverted index and then in the index that is enhanced with multi-component key indexes. The results show that with multi-component key indexes we obtain search results that are very near in terms of relevance ranking to the search results that are obtained by means of ordinary inverted indexes.
... When nextword indexes [16] are used, a key consists of two words. The value of the key is the list of postings, where each posting corresponds to an occurrence of these two words in the text in a consecutive manner, that is, one is immediately after another. ...
... The value of the key is the list of postings, where each posting corresponds to an occurrence of these two words in the text in a consecutive manner, that is, one is immediately after another. A phrase can also be considered as a key [16]. Such indexes can be used to optimize the phrase search; that is, when the user searches for a document that contains the specified phrase. ...
... Sometimes, high-frequently occurring words can have a specific meaning in the context of a specific search query. In this case, exclusion of such a word from the search can lead to unpredictable effects [4,16]. For example, consider the following search query: "who are you who". ...
Preprint
Full-text available
The problem of proximity full-text search is considered. If a search query contains high-frequently occurring words, then multi-component key indexes deliver an improvement in the search speed compared with ordinary inverted indexes. It was shown that we can increase the search speed by up to 130 times in cases when queries consist of high-frequently occurring words. In this paper, we investigate how the multi-component key index architecture affects the quality of the search. We consider several well-known methods of relevance ranking, where these methods are of different authors. Using these methods, we perform the search in the ordinary inverted index and then in an index enhanced with multi-component key indexes. The results show that with multi-component key indexes we obtain search results that are very close, in terms of relevance ranking, to the search results that are obtained by means of ordinary inverted indexes.
... 2) Additional indexes methods. In [15], some phrase indexes are presented, but such methods cannot be applied to proximity full-text searches. For this reason, the author proposed the method [1] that allows the solution of the proximity full-text search task. ...
... In some search engines, stop lemmas can be excluded from the search and the index and, therefore, can be ignored in the search. However, it is stated in [1,15] that a stop lemma can have a specific meaning in the context of a specific search query in some cases. Therefore, stop lemmas cannot be excluded from consideration, and examples are provided. ...
... We have four index files. In the first index file, the range for the first component of keys is [0,4], in the second [5,15], in the third [16,52], and in the fourth [53,149]. For every index file, we enumerate its groups. ...
Article
Full-text available
Proximity full-text searches in large text arrays are considered. A search query consists of several words. The search result is a list of documents containing these words. In a modern search system, documents that contain search query words that are near each other are more relevant than other documents. To solve this task, for each word in each indexed document, we need to store a record in the index. In this case, the query search time is proportional to the number of occurrences of the queried words in the indexed documents. Consequently, it is common for search systems to evaluate queries that contain frequently occurring words much more slowly than queries that contain less frequently occurring, ordinary words. For each word in the text, we use additional indexes to store information about nearby words at distances from the given word of less than or equal to MaxDistance, which is a parameter. This parameter can take a value of 5, 7, or even more. Three-component key indexes can be created for faster query execution. Previously, we presented the results of experiments showing that, when queries contain very frequently occurring words, the average time of the query execution with three-component key indexes is 94.7 times less than that required when using ordinary inverted indexes. In the current work, we describe a new three-component key index building algorithm. We prove the correctness of the algorithm. We present the results of experiments of the index creation depending on the value of MaxDistance.
... To reduce storage space further, partial next word indexes were proposed [3]. Later, the combination of inverted index, partial phrase index and partial next word index was also introduced to moderately reduce the query time [26]. ...
... In comparison with the inverted index structure, usually the inverted word pair has a small posting list due to the reduced existence of most pairs. In addition, inverted word pair index structure approaches have been fruitfully implemented in text document retrieval [9,11,26] and document clustering [15]. ...
Article
Full-text available
Text documents are significant arrangements of various words, while images are significant arrangements of various pixels/features. In addition, text and image data share a similar semantic structural pattern. With reference to this research, the feature pair is defined as a pair of adjacent image features. The innovative feature pair index graph (FPIG) is constructed from the unique feature pair selected, which is constructed using an inverted index structure. The constructed FPIG is helpful in clustering, classifying and retrieving the image data. The proposed FPIG method is validated against the traditional KMeans++, KMeans and Farthest First cluster methods which have the serious drawback of initial centroid selection and local optima. The FPIG method is analyzed using Iris flower image data, and the analysis yields 88% better results than Farthest First and 28.97% better results than conventional KMeans in terms of sum of squared errors. The paper also discusses the scope for further research in the proposed methodology.
... Therefore, it is interesting to combine these approaches to achieve maximum benefits. In combined indexes, common words are used as first-word indexes, and rare words are used as nextword indexes (Williams et al. 2004). The basic structure of combined index used in cQA archives is shown in Fig. 6. ...
... In this study, we propose the use of combined queries (combined inverted and nextword indexes) for question retrieval in cQA (Williams et al. 2004). In combined indexes, both the nextword index and the inverted index have posting lists. ...
Article
Full-text available
Community question answering (cQA) has emerged as a popular service on the web; users can use it to ask and answer questions and access historical question-answer (QA) pairs. cQA retrieval, as an alternative to general web searches, has several advantages. First, user can register a query in the form of natural language sentences instead of a set of keywords; thus, they can present the required information more clearly and comprehensively. Second, the system returns several possible answers instead of a long list of ranked documents, thereby enhancing the efficient location of the desired answers. Question retrieval from a cQA archive, an essential function of cQA retrieval services, aims to retrieve historical QA pairs relevant to the query question. In this study, combined queries (combined inverted and nextword indexes) are proposed for question retrieval in cQA. The method performance is investigated for two different scenarios: (a) when only questions from QA pairs are used as documents, and (b) when QA pairs are used as documents. In the proposed method, combined indexes are first created for both queries and documents; then, different information retrieval (IR) models are used to retrieve relevant questions from the cQA archive. Evaluation is performed on a public Yahoo! Answers dataset; the results thereby obtained show that using combined queries for all three IR models (vector space model, Okapi model, and language model) improves performance in terms of the retrieval precision and ranking effectiveness. Notably, by using combined indexes when both QA pairs are used as documents, the retrieval and ranking effectiveness of these cQA retrieval models increases significantly.
... Some search systems exclude most frequently used words from the index and, consequently, from any search -this is called the stop words approach. However, this approach is not correct [6]. Some most frequently occurring words can have unique meanings in specific contexts. ...
... In [6,14,15], nextword indexes and partial phrase indexes are introduced. These additional indexes can be used to improve performance. ...
Chapter
Full-text available
Full-text search engines are important tools for information retrieval. Term proximity is an important factor in relevance score measurement. In a proximity full-text search, we assume that a relevant document contains query terms near each other, especially if the query terms are frequently occurring words. A methodology for high-performance full-text query execution is discussed. We build additional indexes to achieve better efficiency. For a word that occurs in the text, we include in the indexes some information about nearby words. What types of additional indexes do we use? How do we use them? These questions are discussed in this work. We present the results of experiments showing that the average time of search query execution is 44–45 times less than that required when using ordinary inverted indexes.
... While considering the implementation for gyani we considered three key aspects: scalability, reliability, and compatibility. Prior work [170,202] highlights the utility of using combinations of inverted indexes and augmented indexes (e.g., next word, phrase, or direct indexes) can provide in answering phrase queries. In particular, we base our index design on a combination of inverted and direct indexes. ...
... Agrawal et al. [27] described an algorithm that identifies relevant sets of document for named entities by finding a "token-set-cover" for various surface forms of the named entity and computing a join of the retrieved documents. Williams et al. [202] and Panev and Berberich [170] described approaches to query phrases using combinations of inverted, phrase, nextword, and direct indexes. Our work in contrast explores ways to compute an optimal plan of hyper-phrase query execution using dictionaries and indexes over n-grams and skip-grams. ...
... An analysis of the query logs of the Excite search engine by Williams et al. [22] indicated that 5-10% of the web queries were phrase queries and that 41% of the rest also matched a phrase. In our terms, 5-10% of the queries are explicit keyphrases, and 41% of the rest may be implicit keyphrases. ...
... We immediately notice that the proportion of topics containing keyphrases is much higher than that reported by Williams et al. [22] for web queries (37% of explicit keyphrases versus 5 to 10%, and 84% of implicit keyphrases amongst the rest instead of 41%). This is natural, as the INEX topics are much longer than web queries, and, unlike them, they were carefully thought up, reviewed, and selected by the organizers of the forum. ...
Article
Full-text available
In this paper, we study and discuss the usage of phrases in the INEX evaluation of XML retrieval as well as in re- lated research. We find that the INEX framework could easily become a unique testbed for researchers interested in the exploitation of complex terms in IR, while trigger- ing interest from others. Unfortunately, our analysis of the use of keyphrases in INEX topics shows a downwards trend over the years that impacts on the attention of participants. While NEXI, the ocial query format of INEX, does indeed support keyphrases, its full potential does not materialize, as topic contents show a lack of consistency in their markup. In 2007, 87% of the INEX queries contained keyphrases, but only 11% of those were marked up. We present simple and low-cost solutions to let the INEX collections deliver their full potential in keyphrase retrieval.
... While considering the implementation for gyani we considered three key aspects: scalability, reliability, and compatibility. Prior work [26,30] highlights the utility of using combinations of inverted indexes and augmented indexes (e.g., next word, phrase, or direct indexes) can provide in answering phrase queries. In particular, we base our index design on a combination of inverted and direct indexes. ...
Conference Paper
In this work, we describe GYANI (gyan stands for knowledge in Hindi), an indexing infrastructure for search and analysis of large semantically annotated document collections. To facilitate the search for sentences or text regions for many knowledge-centric tasks such as information extraction, question answering, and relationship extraction, it is required that one can query large annotated document collections interactively. However, currently such an indexing infrastructure that scales to millions of documents and provides fast query execution times does not exist. To alleviate this problem, we describe how we can effectively index layers of annotations (e.g., part-of-speech, named entities, temporal expressions, and numerical values) that can be attached to sequences of words. Furthermore, we describe a query language that provides the ability to express regular expressions between word sequences and semantic annotations to ease search for sentences and text regions for enabling knowledge acquisition at scale. We build our infrastructure on a state-of-the-art distributed extensible record store. We extensively evaluate GYANI over two large news archives and the entire Wikipedia amounting to more than fifteen million documents. We observe that using GYANI we can achieve significant speed ups of more than 95x in information extraction, 53x on extracting answer candidates for questions, and 12x on relationship extraction task.
... Efficient data structures for prefix-text indexing have been studied. For example, trie variants such as burst tries [62,153] have been shown to be effective for indexing word sequences in large corpora. However, these techniques still require memory resident structures and furthermore do not consider phrase boundaries or phrase frequencies, and hence cannot be used for our application. ...
Article
Humans are increasingly becoming the primary consumer of structured data. As the volume and heterogeneity of data produced in the world increases, the existing paradigm of using an application layer to query and search for information in data is becoming infeasible. The human end-user is overwhelmed with a barrage of diverse query and data models. Due to the lack of familiarity with the data sources, search queries issued by the user are typically found to be imprecise. To solve this problem, this dissertation introduces the notion of a "queried unit", or qunit, which is the semantic unit of information returned in response to a user's search query. In a qunits-based system, the user comes in with an information need, and is guided to the qunit that is an appropriate response for that need. The qunits-based paradigm aids the user by systematically shrinking both the query and result spaces. On one end, the query space is reduced by enriching the user's imprecise information need. This is done by extracting information from the user during query input by providing schema and data suggestions. On the other end, the result space is reduced by modeling the structured data into a collection of qunits. This is done using qunit derivation methods that use various sources of information such as query logs. This dissertation describes the design and implementation of a autocompletion-style system that performs both query and result space reduction by interacting with the user in real time, providing suggestions and pruning candidate qunit results. It enables the user to search through databases without any knowledge of the data, schema or the query language.
... This layout decreases overhead when positions are not required, but otherwise leads to more costly, co-sequential access patterns. Finally, special auxiliary index structures can be an option, for instance if only phrase queries are of interest [Williams et al., 2004]. ...
... The fourth approach is a Combined Inverted Common-NextWord index. Each word has an inverted list and the common first words [15] alone have a NextWord index [Williams 2004]. This approach reduces the index size required for a complete NextWord index and provides a better query time compared to inverted indexes alone. ...
... In the area of string similarity search, the positional q-gram inverted index is utilized to identify the candidates from a long string for the similarity query [27], [36], [37]. Besides, the inverted index with position informations is widely used for the phrase querying in information retrieval [38], [39]. ...
Article
Full-text available
We study efficient regular expression (regex) matching problem. Existing algorithms are scanning-based algorithms which typically use an equivalent automaton compiled from the regex query to verify a document. Although some works propose various strategies to quickly jump to candidate locations in a document where a query result may appear, they still need to utilize the scanning-based method to verify candidate locations. These methods become inefficient when there are still many candidate locations needed to be verified. In this paper, we propose a novel approach to efficiently compute all matching positions for a regex query purely based on a positional q-gram inverted index. We propose a gram-driven NFA to represent the language of a regex and show all regex matching locations can be obtained by finding positions on q-grams of GNFA that satisfy certain positional constraints. Then we propose several GNFA-based query plans to answer the query using the positional inverted index. In order to improve the query efficiency, we design the algorithm to build a tree-based query plan by carefully choosing a checking order for positional constraints. Experimental results on real-world datasets show that our method outperforms state-of-the-art methods by up to an order of magnitude in query efficiency.
... Most research has dealt with the compression of inverted lists for index terms that represent single words within a document. Williams et al. present an approach to make phrase queries more ef cient by combining next-word indexes with indexing phrases [21]. Commonly available features in search engines also include the use of ranges and site names. ...
... 4. The IRS acquires the content from the documents [38]. 5. ...
Article
An information retrieval system (IRS) is used to retrieve documents based on an information need. The IRS makes relevance judgements by attempting to match a query to a document. As IRS capabilities are indexing design dependent, the hybrid indexing method (IRS-H) is introduced. The objectives of this article are to examine IRS-H (as an alternative indexing method that performs exact phrase matching) and IRS-I, regarding retrieval usefulness, identification of relevant documents, and the quality of rejecting irrelevant documents by conducting three experiments and by analysing the related data. Three experiments took place where a collection of 100 research documents and 75 queries were presented to: (1) five participants answering a questionnaire, (2) IRS-I to generate data and (3) IRS-H to generate data. The data generated during the experiments were statistically analysed using the performance measurements of Precision, Recall and Specificity, and one-tailed Student’s t-tests. The results reveal that IRS-H (1) increased the retrieval of relevant documents, (2) reduced incorrect identification of relevant documents and (3) increased the quality of rejecting irrelevant documents. The research found that the hybrid indexing method, using a small closed document collection of a hundred documents, produced the required outputs and that it may be used as an alternative IRS indexing method.
... To make this work the stoplist has to be abandoned (it can still used for sets of words). Williams et al. (2004) compare a few different representations of sequences of words. ...
... An effective technique for indexing word sequences in large corpora has been proposed (Williams et al., 2004) and Sequitur algorithms used in phrase creation (Moffat and Wan, 2001). These techniques put large datasets in memory. ...
Conference Paper
Full-text available
This paper reviews some of the common search techniques that have been applied to the Quránic text as well as their limitations and advantages. In addition, the paper investigates auto-completion techniques and their challenges to the Arabic language. Finally, it proposes a new auto-completion technique for the Quránic text, which improves the accuracy of the retrieved results when searching the text of the Qurán.
Article
Score-safe index processing has received a great deal of attention over the last two decades. By pre-calculating maximum term impacts during indexing, the number of scoring operations can be minimized, and the top-k documents for a query can be located efficiently. However, these methods often ignore the importance of the effectiveness gains possible when using sequential dependency models. We present a hybrid approach which leverages score-safe processing and suffix-based self-indexing structures in order to provide efficient and effective top-k document retrieval.
Article
Full-text available
Real-time search requires to incrementally ingest content updates and almost immediately make them searchable while serving search queries at low latency. This is currently feasible for datasets of moderate size by fully maintaining the index in the main memory of multiple machines. Instead, disk-based methods for incremental index maintenance substantially increase search latency with the index fragmented across multiple disk locations. For the support of fast search over disk-based storage, we take a fresh look at incremental text indexing in the context of current architectural features. We introduce a greedy method called Selective Range Flush (SRF) to contiguously organize the index over disk blocks and dynamically update it at low cost. We show that SRF requires substantial experimental effort to tune specific parameters for performance efficiency. Subsequently, we propose the Unified Range Flush (URF) method, which is conceptually simpler than SRF, achieves similar or better performance with fewer parameters and less tuning, and is amenable to I/O complexity analysis. We implement interesting variations of the two methods in the Proteus prototype search engine that we developed and do extensive experiments with three different Web datasets of size up to 1TB. Across different systems, we show that our methods offer search latency that matches or reduces up to half the lowest achieved by existing disk-based methods. In comparison to an existing method of comparable search latency on the same system, our methods reduce by a factor of 2.0-2.4 the I/O part of build time and by 21-24% the total build time.
Conference Paper
To augment the information retrieval process, a model is proposed to facilitate simple contextual indexing for a large scale of standard text corpora. An Edge Index Graph model is presented, which clusters documents based on a root index and an edge index created. Intelligent information retrieval is possible with the projected system where the process of querying provides proactive help to users through a knowledge base. The query is provided with automatic phrase completion and word suggestions. A thesaurus is used to provide meaningful search of the query. This model can be utilized for document retrieval, clustering, and phrase browsing.
Article
We consider a multi-stage retrieval architecture consisting of a fast, “cheap” candidate generation stage, a feature extraction stage, and a more “expensive” reranking stage using machine-learned models. In this context, feature extraction can be accomplished using a document vector index, a mapping from document ids to document representations. We consider alternative organizations of such a data structure for efficient feature extraction: design choices include how document terms are organized, how complex term proximity features are computed, and how these structures are compressed. In particular, we propose a novel document-adaptive hashing scheme for compactly encoding term ids. The impact of alternative designs on both feature extraction speed and memory footprint is experimentally evaluated. Overall, results show that our architecture is comparable in speed to using a traditional positional inverted index but requires less memory overall, and offers additional advantages in terms of flexibility.
Article
Formulating and processing phrases and other term dependencies to improve query effectiveness is an important problem in information retrieval. However, accessing word-sequence statistics using inverted indexes requires unreasonable processing time or substantial space overhead. Establishing a balance between these competing space and time trade-offs can dramatically improve system performance. In this article, we present and analyze a new index structure designed to improve query efficiency in dependency retrieval models. By adapting a class of (ε, δ)-approximation algorithms originally proposed for sketch summarization in networking applications, we show how to accurately estimate statistics important in term-dependency models with low, probabilistically bounded error rates. The space requirements for the vocabulary of the index is only logarithmically linked to the size of the vocabulary. Empirically, we show that the sketch index can reduce the space requirements of the vocabulary component of an index of n-grams consisting of between 1 and 4 words extracted from the GOV2 collection to less than 0.01% of the space requirements of the vocabulary of a full index. We also show that larger n-gram queries can be processed considerably more efficiently than in current alternatives, such as positional and next-word indexes.
Chapter
Information retrieval is the process of satisfying user information needs that are expressed as textual queries. Search engines represent a Web-specific example of the information retrieval paradigm. The problem of Web search has many additional challenges, such as the collection of Web resources, the organization of these resources, and the use of hyperlinks to aid the search. Whereas traditional information retrieval only uses the content of documents to retrieve results of queries, the Web requires stronger mechanisms for quality control because of its open nature. Furthermore, Web documents contain significant meta-information and zoned text, such as title, author, or anchor text, which can be leveraged to improve retrieval accuracy.
Article
Web archives include both archives of contents originally published on the Web (e.g., the Internet Archive) but also archives of contents published long ago that are now accessible on the Web (e.g., the archive of The Times). Thanks to the increased awareness that web-born contents are worth preserving and to improved digitization techniques, web archives have grown in number and size. To unfold their full potential, search techniques are needed that consider their inherent special characteristics. This work addresses three important problems toward this objective and makes the following contributions: - We present the Time-Travel Inverted indeX (TTIX) as an efficient solution to time-travel text search in web archives, allowing users to search only the parts of the web archive that existed at a user's time of interest. - To counter negative effects that terminology evolution has on the quality of search results in web archives, we propose a novel query-reformulation technique, so that old but highly relevant documents are retrieved in response to today's queries. - For temporal information needs, for which the user is best satisfied by documents that refer to particular times, we describe a retrieval model that integrates temporal expressions (e.g., "in the 1990s") seamlessly into a language modelling approach. Experiments for each of the proposed methods show their efficiency and effectiveness, respectively, and demonstrate the viability of our approach to search in web archives.
Article
This paper proposes a static index pruning method for phrase queries which is based on the cohesive similarity between terms the co-occurrence between terms is model by term's cohesiveness within document the less relevant terms gets pruned away while assuring that there is no change in the top-k results the proposed method creates an effective pruned index. This method also considers the term proximity based on the term frequency and the terms informative ness the experiments were conducted on a number of different standard text collections, and analysis of the results shows promising results and is comparable with the existing static pruning method. It is an advantage of the proposed approach that it can be applies to standard inverted index for phrase queries also.
Conference Paper
Phrase queries are a key functionality of modern search engines. Beyond that, they increasingly serve as an important building block for applications such as entity-oriented search, text analytics, and plagiarism detection. Processing phrase queries is costly, though, since positional information has to be kept in the index and all words, including stopwords, need to be considered. We consider an augmented inverted index that indexes selected variable-length multi-word sequences in addition to single words. We study how arbitrary phrase queries can be processed efficiently on such an augmented inverted index. We show that the underlying optimization problem is NP-hard in the general case and describe an exact exponential algorithm and an approximation algorithm to its solution. Experiments on ClueWeb09 and The New York Times with different real-world query workloads examine the practical performance of our methods.
Conference Paper
Phrase queries play an important role in web search and other applications. Traditionally, phrase queries have been processed using a positional inverted index, potentially augmented by selected multi-word sequences (e.g., n-grams or frequent noun phrases). In this work, instead of augmenting the inverted index, we take a radically different approach and leverage the direct index, which provides efficient access to compact representations of documents. Modern retrieval systems maintain such a direct index, for instance, to generate snippets or compute proximity features. We present extensions of the established term-at-a-time and document-at-a-time query-processing methods that make effective combined use of the inverted index and the direct index. Our experiments on two real-world document collections using diverse query workloads demonstrate that our methods improve response time substantially without requiring additional index space.
Conference Paper
Effective postings list compression techniques, and the efficiency of postings list processing schemes such as WAND, have significantly improved the practical performance of ranked document retrieval using inverted indexes. Recently, suffix array-based index structures have been proposed as a complementary tool, to support phrase searching. The relative merits of these alternative approaches to ranked querying using phrase components are, however, unclear. Here we provide: (1) an overview of existing phrase indexing techniques; (2) a description of how to incorporate recent advances in list compression and processing; and (3) an empirical evaluation of state-of-the-art suffix-array and inverted file-based phrase retrieval indexes using a standard IR test collection.
Conference Paper
The use of phrases as part of similarity computations can enhance search effectiveness. But the gain comes at a cost, either in terms of index size, if all word-tuples are treated as queryable objects; or in terms of processing time, if postings lists for phrases are constructed at query time. There is also a lack of clarity as to which phrases are “interesting”, in the sense of capturing useful information. Here we explore several techniques for recognizing phrases using statistics of large-scale collections, and evaluate their quality.
Article
During a search, phrase-terms expressed in queries are presented to an information retrieval system (IRS) to find documents relevant to a topic. The IRS makes relevance judgements by attempting to match vocabulary in queries to documents. If there is a mismatch, the problem of vocabulary mismatch occurs. The aim is to examine ways of searching for documents more effectively, in order to minimise mismatches. A further aim is to understand the mechanisms of, and the differences between, human and machine-assisted, retrieval. The objective of this study was to determine whether IRS-H (an IRS using the hybrid indexing method) and human participants agree or disagree on relevancy judgments, and whether the problem of mismatching vocabulary can be solved. A collection of eighty research documents and sixty-five phrase-terms were presented to (i) IRS-H and four participants in Test 1, and (ii) IRS-H and one participant (aided by search software) in Test 2. Statistical analysis was performed using the Kappa coefficient. IRS-H and the four participants' judgements disagreed. IRS-H and the participant aided by search software judgments did agree. IRS-H solves the problem of mismatching vocabulary between a query and a document.
Chapter
Full-text available
Full-text search engines are important tools for information retrieval. In a proximity full-text search, a document is relevant if it contains query terms near each other, especially if the query terms are frequently occurring words. For each word in a text, we use additional indexes to store information about nearby words that are at distances from the given word of less than or equal to the MaxDistance parameter. We showed that additional indexes with three-component keys can be used to improve the average query execution time by up to 94.7 times if the queries consist of high-frequency occurring words. In this paper, we present a new search algorithm with even more performance gains. We consider several strategies for selecting multi-component key indexes for a specific query and compare these strategies with the optimal strategy. We also present the results of search experiments, which show that three-component key indexes enable much faster searches in comparison with two-component key indexes.
Chapter
Full-text search engines are important tools for information retrieval. In a proximity full-text search, a document is relevant if it contains query terms near each other, especially if the query terms are frequently occurring words. For each word in the text, we use additional indexes to store information about nearby words at distances from the given word of less than or equal to MaxDistance, which is a parameter. A search algorithm for the case when the query consists of high-frequently occurring words is discussed. In addition, we present results of experiments with different values of MaxDistance to evaluate the search speed dependence on the value of MaxDistance. These results show that the average time of the query execution with our indexes is 94.7–45.9 times (depending on the value of MaxDistance) less than that with standard inverted files when queries that contain high-frequently occurring words are evaluated.
Article
In relational database management systems (RDBMSs), an efficient join method for text retrieval using an inverted index has been developed and implemented. However, the existing crossing of the posting inverted list increases the keyword search time for large texts because of unnecessary comparisons. The relation-based search produces results by utilizing the posting list intersection. To reduce the search time for queries, a multi-way skip-merge join algorithm is proposed in this study. The proposed algorithm improves the execution speed by using a sorted inverted index posting list to minimize unnecessary comparison operations in the posting list intersection. The skip-merge join method, which minimizes unnecessary comparison operations using the aggregate function, is integrated with the multi-way join as a replacement for the existing two-way join method. The join algorithm combining skip-merge join and multi-way join shows good performance because the number of search keywords and the number of documents increase. The performance improvement of the keyword search is verified by implementing the multi-way skip-merge join algorithm in PostgreSQL, an RDBMS.
Chapter
Information retrieval is the process of satisfying user information needs that are expressed as textual queries. Search engines represent a Web-specific example of the information retrieval paradigm. The problem of Web search has many additional challenges, such as the collection of Web resources, the organization of these resources, and the use of hyperlinks to aid the search. Whereas traditional information retrieval only uses the content of documents to retrieve results of queries, the Web requires stronger mechanisms for quality control because of its open nature. Furthermore, Web documents contain significant meta-information and zoned text, such as title, author, or anchor text, which can be leveraged to improve retrieval accuracy.
Article
Type-ahead search can on-the-fly find answers as a user types in a keyword query. A main challenge in this search paradigm is the high-efficiency requirement that queries must be answered within milliseconds. In this paper we study how to answer top-k queries in this paradigm, i.e., as a user types in a query letter by letter, we want to efficiently find the k best answers. Instead of inventing completely new algorithms from scratch, we study challenges when adopting existing top-k algorithms in the literature that heavily rely on two basic list-access methods: random access and sorted access. We present two algorithms to support random access efficiently. We develop novel techniques to support efficient sorted access using list pruning and materialization. We extend our techniques to support fuzzy type-ahead search which allows minor errors between query keywords and answers. We report our experimental results on several real large data sets to show that the proposed techniques can answer top-k queries efficiently in type-ahead search.
Conference Paper
Full-text available
Both phrases and Boolean queries have a long history in information retrieval, particularly in commercial sys- tems. In previous work, Boolean queries have been used as a source of phrases for a statistical retrieval model, This work, like the majority of research on phrases, re- sulted in little improvement in retrieval effectiveness, In this paper, we describe an approach where phrases identified in natural language queries are used to build structured queries for a probabilistic retrieval model. Our results show that using phrases in this way can improve performance, and that phrases that are auto- matically extracted from a natural language query per- form nearly as well as manually selected phrases.
Conference Paper
Full-text available
We present an effective caching scheme that reduces the computing and I/O requirements of a Web search engine without altering its ranking characteristics. The novelty is a two-level caching scheme that simultaneously combines cached query results and cached inverted lists on a real case search engine. A set of log queries are used to measure and compare the performance and the scalability of the search engine with no cache, with the cache for query results, with the cache for inverted lists, and with the two-level cache. Experimental results show that the two-level cache is superior, and that it allows increasing the maximum number of queries processed per second by a factor of three, while preserving the response time. These results are new, have not been reported before, and demonstrate the importance of advanced caching schemes for real case search engines.
Conference Paper
Full-text available
This paper serves as an introduction to the research described in detail in the remainder of the volume. Thenext section provides a summary of the retrieval background knowledge that is assumed in the other papers. Section 3presents a short description of each track---a more complete description of a track can be found in that track's overviewpaper in the proceedings. The final section looks forward to future TREC conferences
Article
Full-text available
Term clustering and syntactic phrase formation are methods for transforming natural language text. Both have had only mixed success as strategies for improving the quality of text representations for document retrieval. Since the strengths of these methods are complementary, we have explored combining them to produce superior representations.
Article
Full-text available
Phrase browsing techniques use phrases extracted automatically from a large information collection as a basis for browsing and accessing it. This paper describes a case study that uses an automatically constructed phrase hierarchy to facilitate browsing of an ordinary large Web site. Phrases are extracted from the full text using a novel combination of rudimentary syntactic processing and sequential grammar induction techniques. The interface is simple, robust and easy to use. To convey a feeling for the quality of the phrases that are generated automatically, a thesaurus used by the organization responsible for the Web site is studied and its degree of overlap with the phrases in the hierarchy is analyzed. Our ultimate goal is to amalgamate hierarchical phrase browsing and hierarchical thesaurus browsing: the latter provides an authoritative domain vocabulary and the former augments coverage in areas the thesaurus does not reach.
Article
In studying actual Web searching by the public at large, we analyzed over one million Web queries by users of the Excite search engine. We found that most people use few search terms, few modified queries, view few Web pages, and rarely use advanced search features. A small number of search terms are used with high frequency, and a great many terms are unique; the language of Web queries is distinctive. Queries about recreation and entertainment rank highest. Findings are compared to data from two other large studies of Web queries. This study provides an insight into the public practices and choices in Web searching.
Article
We investigate the application of a novel relevance ranking technique, cover density ranking, to the requirements of Web-based information retrieval, where a typical query consists of a few search terms and a typical result consists of a page indicating several potentially relevant documents. Traditional ranking methods for information retrieval, based on term and inverse document frequencies, have been found to work poorly in this context. Under the cover density measure, ranking is based on term proximity and cooccurrence. Experimental comparisons show performance that compares favorably with previous work.
Article
Past research into text retrieval methods for the Web has been restricted by the lack of a test collection capable of supporting experiments which are both realistic and reproducible. The 1.69 million document WT10g collection is proposed as a multi-purpose testbed for experiments with these attributes, in distributed IR, hyperlink algorithms and conventional ad hoc retrieval.WT10g was constructed by selecting from a superset of documents in such a way that desirable corpus properties were preserved or optimised. These properties include: a high degree of inter-server connectivity, integrity of server holdings, inclusion of documents related to a very wide spread of likely queries, and a realistic distribution of server holding sizes. We confirm that WT10g contains exploitable link information using a site (homepage) finding experiment. Our results show that, on this task, Okapi BM25 works better on propagated link anchor text than on full text.WT10g was used in TREC-9 and TREC-2000 and both topic relevance and homepage finding queries and judgments are available.
Conference Paper
Considerable research effort has been invested in improving the effectiveness of information retrieval systems. Techniques such as relevance feedback, thesaural expansion, and pivoting all provide better quality responses to queries when tested in standard evaluation frameworks. But such enhancements can add to the cost of evaluating queries. In this paper we consider the pragmatic issue of how to improve the cost-effectiveness of searching. We describe a new inverted file structure using quantized weights that provides superior retrieval effectiveness compared to conventional inverted file structures when early termination heuristics are employed. That is, we are able to reach similar effectiveness levels with less computational cost, and so provide a better cost/performance compromise than previous inverted file organisations.
Conference Paper
We measure the WT10g test collection, used in the TREC-9 and TREC 2001 Web Tracks, with common measures used in the web topology community, in order to see if WT10g "looks like" the web. This is not an idle question; characteristics of the web, such as power law relationships, diameter, and connected components have all been observed within the scope of general web crawls, constructed by blindly following links. In contrast, WT10g was carved out from a larger crawl specifically to be a web search test collection within the reach of university researchers. Does such a collection retain the properties of the larger web? In the case of WT10g, yes.
Article
Query-processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Retrieval time for inverted lists can be greatly reduced by the use of compression, but this adds to the CPU time required. Here we show that the CPU component of query response time for conjunctive Boolean queries and for informal ranked queries can be similarly reduced, at little cost in terms of storage, by the inclusion of an internal index in each compressed inverted list. This method has been applied in a retrieval system for a collection of nearly two million short documents. Our experimental results show that the self-indexing strategy adds less than 20% to the size of the compressed inverted file, which itself occupies less than 10% of the indexed text, yet can reduce processing time for Boolean queries of 5-10 terms to under one fifth of the previous cost. Similarly, ranked queries of 40-50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval effectiveness.
Article
Ranked queries are used to locate relevant documents in text databases. In a ranked query a list of terms is specified, then the documents that most closely match the query are returned---in decreasing order of similarity---as answers. Crucial to the efficacy of ranked querying is the use of a similarity heuristic, a mechanism that assigns a numeric score indicating how closely a document and the query match. In this note we explore and categorise a range of similarity heuristics described in the literature. We have implemented all of these measures in a structured way, and have carried out retrieval experiments with a substantial subset of these measures.Our purpose with this work is threefold: first, in enumerating the various measures in an orthogonal framework we make it straightforward for other researchers to describe and discuss similarity measures; second, by experimenting with a wide range of the measures, we hope to observe which features yield good retrieval behaviour in a variety of retrieval environments; and third, by describing our results so far, to gather feedback on the issues we have uncovered. We demonstrate that it is surprisingly difficult to identify which techniques work best, and comment on the experimental methodology required to support any claims as to the superiority of one method over another.
Article
Research on Web searching is at an incipient stage. This aspect provides a unique opportunity to review the current state of research in the field, identify common trends, develop a methodological framework, and define terminology for future Web searching studies. In this article, the results from published studies of Web searching are reviewed to present the current state of research. The analysis of the limited Web searching studies available indicates that research methods and terminology are already diverging. A framework is proposed for future studies that will facilitate comparison of results. The advantages of such a framework are presented, and the implications for the design of Web information retrieval systems studies are discussed. Additionally, the searching characteristics of Web users are compared and contrasted with users of traditional information retrieval and online public access systems to discover if there is a need for more studies that focus predominantly or exclusively on Web searching. The comparison indicates that Web searching differs from searching in other environments.
Article
In November of 1992 the first Text REtrieval Conference (TREC-1) was held at NIST (Harman 1993). This conference, co-sponsored by ARPA and NIST, brought together information retrieval researchers to discuss their system results on the new TIPSTER test collection. This was the first time that such groups had ever compared results on the same data using the same evaluation methods, and represented a breakthrough in cross-system evaluation in information retrieval. It was also the first time that most of these groups had tackled such a large test collection and required a major effort by all groups to scale up their retrieval techniques.
Article
Browsing accounts for much of people's interaction with digital libraries, but it is poorly supported by standard search engines. Conventional systems often operate at the wrong level, indexing words when people think in terms of topics, and returning documents when people want a broader view. As a result, users cannot easily determine what is in a collection, how well a particular topic is covered, or what kinds of queries will provide useful results. We have built a new kind of search engine, Keyphind, that is explicitly designed to support browsing. Automatically extracted keyphrases form the basic unit of both indexing and presentation, allowing users to interact with the collection at the level of topics and subjects rather than words and documents. The keyphrase index also provides a simple mechanism for clustering documents, refining queries, and previewing results. We compared Keyphind to a traditional query engine in a small usability study. Users reported that certain kinds of browsing tasks were much easier with the new interface, indicating that a keyphrase index would be a useful supplement to existing search tools. This is an author’s version of an article published in the journal: Decision Support Systems. © 1999 Elsevier Science B.V.
Article
This paper reports selected findings from an ongoing series of studies analyzing large-scale data sets containing queries posed by Excite users, a major Internet search service. The findings presented report on: (1) queries length and frequency, (2) Boolean queries, (3) query reformulation, (4) phrase searching, (5) search term distribution, (6) relevance feedback, (7) viewing pages of results, (8) successive searching, (9) sexually-related searching, (10) image queries and (11) multi-lingual aspects. Further research is discussed.
Article
In this paper we examine the question of query parsing for World Wide Web queries and present a novel method for phrase recognition and expansion. Given a training corpus of approximately 16 million Web queries and a handwritten context-free grammar, the EM algorithm is used to estimate the parameters of a probabilistic context-free grammar (PCFG) with a system developed by Carroll [5]. We use the PCFG to compute the most probable parse for a user query, reflecting linguistic structure and word usage of the domain being parsed. The optimal syntactic parse for a user query thus obtained is employed for phrase recognition and expansion. Phrase recognition is used to increase retrieval precision; phrase expansion is applied to make the best use possible of very short Web queries.
Article
this paper we experimentally evaluate the performance of several data structures for building vocabularies, using a range of data collections and machines. Given the well-known properties of text and some initial experimentation, we chose to focus on the most promising candidates, splay trees and chained hash tables, also reporting results with binary trees. Of these, our experiments show that hash tables are by a considerable margin the most e#cient. We propose and measure a refinement to hash tables, the use of move-to-front lists. This refinement is remarkably e#ective: as we show, using a small table in which there are large numbers of strings in each chain has only limited impact on performance. Moving frequentlyaccessed words to the front of the list has the surprising property that the vast majority of accesses are to the first or second node. For example, our experiments show that in a typical case a table with an average of around 80 strings per slot is only 10%--40% slower than a table with around one string per slot---while a table without move-to-front is perhaps 40% slower again---and is still over three times faster than using a tree. We show, moreover, that a move-to-front hash table of fixed size is more e#cient in space and time than a hash table that is dynamically doubled in size to maintain a constant load average
Article
Most search systems for querying large document collections---for example, web search engines---are based on well-understood information retrieval principles. These systems are both efficient and effective in finding answers to many user information needs, expressed through informal ranked or structured Boolean queries. Phrase querying and browsing are additional techniques that can augment or replace conventional querying tools. In this paper, we propose optimisations for phrase querying with a nextword index, an efficient structure for phrase-based searching. We show that careful consideration of which search terms are evaluated in a query plan and optimisation of the order of evaluation of the plan can reduce query evaluation costs by more than a factor of five. We conclude that, for phrase querying and browsing with nextword indexes, an ordered query plan should be used for all browsing and querying. Moreover, we show that optimised phrase querying is practical on large text collections.
Article
Past access to files of integers is crucial for the efficient resolution of queries to databases. Integers are the basis of indexes used to resolve queries, for example, in large internet search systems, and numeric data forms a large part of most databases. Disk access costs can be reduced by compression, if the cost of retrieving a compressed representation from disk and the CPU cost of decoding such a representation is less than that of retrieving uncompressed data. In this paper we show experimentally that, for large or small collections, storing integers in a compressed format reduces the time required for either sequential stream access or random access. We compare different approaches to compressing integers, including the Elias gamma and delta codes, Golomb coding, and a variable-byte integer scheme. As a conclusion, we recommend that, for fast access to integers, files be stored compressed.
Article
Ranking techniques are effective at finding answers in document collections but can be expensive to evaluate. We propose an evaluation technique that uses early recognition of which documents are likely to be highly ranked to reduce costs; for our test data, queries are evaluated in 2% of the memory of the standard implementation without degradation in retrieval effectiveness. cpu time and disk traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. The principle of the index design is that inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces cpu time and disk traffic to around one third of the original requirement. We also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed.
Article
Text retrieval systems are used to fetch documents from large text collections, using queries consisting of words and word sequences.
Selected results from a large study of Web searching: The Excite study Available online at Overview of TREC
  • A Spink
  • J And Xu
  • Html
  • E M Voorhees
  • D K And Harman
SPINK, A. AND XU, J. 2000. Selected results from a large study of Web searching: The Excite study. Informat. Res. 6, 1. Available online at: http://InformationR.net/ir/6-1/paper90.html. VOORHEES, E. M. AND HARMAN, D. K. 2001. Overview of TREC 2001. In The Tenth Text REtrieval Conference (TREC 2001), E. M. Voorhees and D. K. Harman, Eds. NIST Spec. pub. 500-250.