Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space requirements and performance characteristics of the main-memory data structures used for this task. However, it is not clear how many distinct words will be found in a text collection or whether new words will continue to appear after inspecting large volumes of data. We propose practical definitions of a word, and investigate new word occurrences under these models in a large text collection. We inspected around two billion word occurrences in 45 gigabytes of world-wide web documents, and found just over 9.74 million different words in 5.5 million documents; overall, 1 word in 200 was new. We observe that new words continue to occur, even in very large data sets, and that choosing stricter definitions of what constitutes a word has only limited impact on the number of new words found.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The high variance (11,885) of items length is mainly due to the large diversity of the description fields that can be either missing (or be a simple url) or oppositely be a long text (even an entire HTML document).Figure 4 plots the number of items vs their length. A long-tail curve is observed in items length distribution as also reported in the literature for the size [28]. 51.39% of the items have a length between 21 and 50 terms, and 14% between 8 and 20 terms. ...
... However, as can be seen inFigure 5 the corresponding curve for V W has a significant deviation from Zipf's law, i.e., from a straight line in log-log scale. This deviation is smaller for the V W curve. Similar deviations have been already reported for web related text collections [2, 17, 1, 28, 10] and few attempts have been made to devise more adequate distributions. [18] tried to generalize the Zipf's law by proposing a Zipf -Mandelbrot distribution while [13] suggested to use a Modified Power Law distribution. ...
... where n is the number of collected items while K and β (values in [0, 1]) are constants depending strongly on the characteristics of the analyzed text corpora and on the model used to extract the terms [28]. β determines how fast the vocabulary evolves over time with typical values ranging between 0.4 and 0.6 for medium-size and homogeneous text collections [2]. ...
Conference Paper
Full-text available
We are witnessing a widespread of web syndication technologies such as RSS or Atom for a timely delivery of frequently updated Web content. Almost every personal weblog, news portal, or discussion forum employs nowadays RSS/Atom feeds for enhancing pull-oriented searching and browsing of web pages with push-oriented protocols of web content. Social media applications such as Twitter or Facebook also employ RSS for notifying users about the newly available posts of their preferred friends. Unfortunately, previous works on RSS/Atom statistical characteristics do not provide a precise and updated characterization of feeds' behavior and content, characterization which can be used to successfully benchmark effectiveness and efficiency of various RSS processing/ analysis techniques. In this paper, we present the first thorough analysis of three complementary features of real-scale RSS feeds, namely, publication activity, items structure and length, as well as, vocabulary of its content which we believe are crucial for Web 2.0 applications.
... The high variance (11,885.90) of items length is mainly due to the large diversity of the description fields that can be either missing (or be a simple url) or oppositely be a long text (even an entire HTML document).Figure 5 plots the number of items vs their length. A long-tail curve is observed in items length distribution as also reported in the literature for the size of Web documents [30]. 51.39% of the items have a length between 21 and 50 terms, and 14% between 8 and 20 terms. ...
... where n is the number of collected items while K and β (taking values in [0, 1]) are constants depending strongly on the characteristics of the analyzed text corpora and on the model used to extract the terms [30]. β determines how fast the vocabulary evolves over time with typical values ranging between 0.4 and 0.6 for medium-size and homogeneous text collections [2]. [30] reports a Heap's law exponent lying outside this range (β = 0.16) for a 500 MB collection of documents from the Wall Street Journal.Table 12 specifies the constants chosen for Heap's laws approximating the global vocabulary growth as well as V W and V W . Clearly, the Heap's law exponent (β) of the global vocabulary is affected by the evolution of V W rather than by V W whose size is significantly smaller (attributed to the slow acquisition of less commonly used terms). The exponent for V W (0.675) slightly higher than those reported in the literature [2, 17] indicates a faster increase of the vocabulary size due the aforementioned language imperfections of the items in our testbed. ...
... Figure 5 plots the number of items vs their length. A long-tail curve is observed in items length distribution as also reported in the literature for the size of Web documents [30]. 51.39% of the items have a length between 21 and 50 terms, and 14% between 8 and 20 terms. ...
... Figure 5 plots the number of items vs their size. A long-tail curve is observed for the distribution of item sizes as also reported in the literature for the size of Web documents (Williams and Zobel, 2005). Further, 51.39 per cent of the items are sized between 21 and 50 terms and 14 per cent between 8 and 20 terms. ...
... A clear deviation from a straight line is also reported for Web-related texts (Manning et al., 2008;Baeza-Yates and Ribeiro-Neto, 1999;Ahmad and Kondrak, 2005;Dhillon et al., 2001). Several other studies over statistics of Web documents (French, 2002), Web queries (Spink et al., 2001;Williams and Zobel, 2005;Jansen et al., 1998;König et al., 2009), blogs (Lambiotte et al., 2007) and sponsored data on the Web (König et al., 2009) also observe this deviation for their results, and few attempts have been made to devise more adequate distributions. ...
... where n is the number of collected items, whereas K and ␤ (taking values in [0, 1]) are constants, depending strongly on the characteristics of the analyzed text corpora and on the model used to extract the terms (Williams and Zobel, 2005). ␤ determines how fast the vocabulary evolves over time with typical values ranging between 0.4 and 0.6 for medium-sized and homogeneous text collections (Baeza-Yates and Ribeiro-Neto, 1999). ...
Article
Full-text available
Purpose – The purpose of this paper is to present a thorough analysis of three complementary features of real-scale really simple syndication (RSS)/Atom feeds, namely, publication activity, items characteristics and their textual vocabulary, that the authors believe are crucial for emerging Web 2.0 applications. Previous works on RSS/Atom statistical characteristics do not provide a precise and updated characterization of feeds’ behavior and content, characterization that can be used to successfully benchmark the effectiveness and efficiency of various Web syndication processing/analysis techniques. Design/methodology/approach – The authors empirical study relies on a large-scale testbed acquired over an eight-month campaign from 2010. They collected a total number of 10,794,285 items originating from 8,155 productive feeds. The authors deeply analyze feeds productivity (types and bandwidth), content (XML, text and duplicates) and textual content (vocabulary and buzz-words). Findings – The findings of the study are as follows: 17 per cent of feeds produce 97 per cent of the items; a formal characterization of feeds publication rate conducted by using a modified power law; most popular textual elements are the title and description, with the average size of 52 terms; cumulative item size follows a lognormal distribution, varying greatly with feeds type; 47 per cent of the feed-published items share the same description; the vocabulary does not belong to Wordnet terms (4 per cent); characterization of vocabulary growth using Heaps’ laws and the number of occurrences by a stretched exponential distribution conducted; and ranking of terms does not significantly vary for frequent terms. Research limitations/implications – Modeling dedicated Web applications capacities, Defining benchmarks, optimizing Publish/Subscribe index structures. Practical implications – It especially opens many possibilities for tuning Web applications, like an RSS crawler designed with a resource allocator and a refreshing strategy based on the Gini values and evolution to predict bursts for each feed, according to their category and class for targeted feeds; an indexing structure which matches textual items’ content, which takes into account item size according to targeted feeds, size of the vocabulary and term occurrences, updates of the vocabulary and evolution of term ranks, typos and misspelling correction; filtering by pruning items for content duplicates of different feeds and correlation of terms to easily detect replicates. Originality/value – A content-oriented analysis of dynamic Web information.
... [Oita and Senellart, 2010] basé sur une analyse de 400 flux, conclue que seul un petit nombre d'items ont de grandes tailles. Finalement, concernant la caractérisation du vocabulaire des données publiées sur le Web, [Williams and Zobel, 2005] ont étudié 5,5 millions de documents du Web pour lesquels ils ont extrait 10 millions de termes distincts. Ils ont trouvé qu'une loi de Heap caractérise l'évolution de la taille du vocabulaire, et que ses paramètres dépendent du jeu de données utilisé ainsi que du processus d'extraction des termes. ...
... Le même comportement a été observé pour les requêtes sur le Web [Zien et al., 2001]. [Williams and Zobel, 2005, Sia et al., 2007, Hu and Chou, 2009] ont aussi étudié la distribution des fréquences des termes et l'ont caractérisée par une loi Zipf. D'autres travaux ont caractérisé les fréquences des termes avec d'autres ditributions. ...
... Nous pouvons observer une longue queue. Elle correspond à celle décrite dans la littérature pour la distribution de la taille des documents du Web [Williams and Zobel, 2005]. Nous pouvons constater que 51,39% des items ont une taille entre 21 et 50 termes, et 14% entre 8 et 20 termes. ...
Article
Based on a Publish/Subscribe paradigm, Web Syndication formats such as RSS have emerged as a popular means for timely delivery of frequently updated Web content. According to these formats, information publishers provide brief summaries of the content they deliver on the Web, while information consumers subscribe to a number of RSS feeds and get informed about newly published items. The goal of this thesis is to propose a notification system which scales on the Web. To deal with this issue, we should take into account the large number of users on the Web and the high publication rate of items. We propose a keyword-based index for user subscriptions to match it on the fly with incoming items. We study three indexing techniques for user subscriptions. We present analytical models to estimate memory requirements and matching time. We also conduct a thorough experimental evaluation to exhibit the impact of critical workload parameters on these structures. For subscriptions which are never notified, we adapt the indexes to support a partial matching between subscriptions and items. We integrate a diversity and novelty filtering step in our system in order to decrease the number of notified items for short subscriptions. This filtering is based on the set of items already received by the user.
... The process of computing the semi-static model is complicated by the fact that the number of words and non-words appearing in large web collections is high. If we stored all words and non-words appearing in the collection, and their associated frequency, many gigabytes of RAM or a B-tree or similar on-disk structure would be required [23]. Moffat et al. [14] have examined schemes for pruning models during compression using large alphabets, and conclude that rarely occurring terms need not reside in the model. ...
... Each machine, therefore, requires RAM to hold the following. @BULLET The CTS model, which should be 1/1000 of the size of the uncompressed collection (using results in Figure 5 and Williams et al. [23]). Assuming an average uncompressed document size of 8 Kb [11], this would require N/M × 8.192 bytes of memory. ...
Conference Paper
Full-text available
The presentation of query biased document snippets as part of results pages presented by search engines has become an expectation of search engine users. In this paper we explore the algorithms and data structures required as part of a search engine to allow efficient generation of query biased snippets. We begin by proposing and analysing a document compression method that reduces snippet generation time by 58% over a baseline using the zlib compression library. These experiments reveal that finding documents on secondary storage dominates the total cost of generating snippets, and so caching documents in RAM is essential for a fast snippet generation process. Using simulation, we examine snippet generation performance for different size RAM caches. Finally we propose and analyse document reordering and compaction, revealing a scheme that increases the number of document cache hits with only a marginal affect on snippet quality. This scheme effectively doubles the number of documents that can fit in a fixed size cache.
... The work of Bharat and Broder [10] went further and estimated that the World Wide Web pages are growing at the rate of 7.5 pages every second. This revolution, which the Web is witnessing, has led to the appearance of two points: -The first point is the entry of new words into the Web which is estimated, according to [11], at about one new word in every two hundred words. Studies by [11][12] [13] have shown that this invasion is mainly due to: neologisms, first occurrences of rare personal names and place names, abbreviations, acronyms, emoticons, URLs and typographical errors. ...
... This revolution, which the Web is witnessing, has led to the appearance of two points: -The first point is the entry of new words into the Web which is estimated, according to [11], at about one new word in every two hundred words. Studies by [11][12] [13] have shown that this invasion is mainly due to: neologisms, first occurrences of rare personal names and place names, abbreviations, acronyms, emoticons, URLs and typographical errors. -The second point is that the users employ these new words during the search. ...
Conference Paper
Full-text available
The massive growth of information and the exponential increase in the number of documents published and uploaded online each day have led to led to the appearance of new words in the Internet. Due to the difficulty of reaching the meanings of these new terms, which play a central role in retrieving the desired information, it becomes necessary to give more importance to the sites and topics where these new words appear, or rather, to give value to the words that occur frequently with them. For this purpose, in this paper, we propose a new term-term similarity measure based on the co-occurrence and closeness of words. It relies on searching for each query feature the locations where it appears, then selecting from these locations the words which often neighbor and co-occur with the query features, and finally used the selected words in the retrieval process. Our experiments were performed using the OHSUMED test collection and show significant performance enhancement over the state-of-the-art.
... In order to clarify whether it is meaningful to estimate a vocabulary size in the limit of infinitely large databases, it is essential to understand not only the birth and death of words [4][5][6], but also the process governing the usage of new words and its dependence on database size. The interest in this problem is motivated by fundamental linguistic studies [7,8] as well as by applications in search engines, which require an estimation of the number of different words in a given database [9][10][11]. ...
... Simply knowing the database size (in number of words, M, or potentially in bits), and using the language-dependent parameters (C, N max c ¼ b à , ¼ à À 1) reported above, from Eq. (2) one can immediately estimate the expected number of different words, N, appearing more than n times. This is crucial for search engines and data-mining programs because it allows for an estimation of the memory to be allocated prior to the scanning of an unknown database, e.g., in the construction of the inverted index [9][10][11]. Even the fluctuations around this expectation can be easily computed through our generative model or through the Poisson assumption of word usage. ...
Article
Full-text available
We propose a stochastic model for the number of different words in a given database which incorporates the dependence on the database size and historical changes. The main feature of our model is the existence of two different classes of words: (i) a finite number of core-words which have higher frequency and do not affect the probability of a new word to be used; and (ii) the remaining virtually infinite number of noncore-words which have lower frequency and once used reduce the probability of a new word to be used in the future. Our model relies on a careful analysis of the google-ngram database of books published in the last centuries and its main consequence is the generalization of Zipf's and Heaps' law to two scaling regimes. We confirm that these generalizations yield the best simple description of the data among generic descriptive models and that the two free parameters depend only on the language but not on the database. From the point of view of our model the main change on historical time scales is the composition of the specific words included in the finite list of core-words, which we observe to decay exponentially in time with a rate of approximately 30 words per year for English.
... Intuitively, larger collections with diverse topics need more samples while smaller, topic-specific ones might need less. Williams and Zobel [2005] have shown that even after processing about 45 GB of web data, vocabulary growth does not converge to zero; the rate of discovery of new unique terms stabilized at about one in every 400 term occurrences. ...
... According to Williams and Zobel [2005], continued sampling will always continue to find new words but the rate will decrease. The rate for completeness drops more rapidly than that for the unique terms. ...
... For example, vast quantities of information are held as text documents, ranging from news archives, law collections, and business reports to repositories of documents gathered from the web. These collections contain many millions of distinct words, the number growing more or less linearly with collection size [52]. Other applications also involve large numbers of distinct strings, such as bibliographic databases and indexes for genomic data [51]. ...
... The most frequent words in the collection are "the", "of", and "and" respectively; the word "the" occurs as about one word in seventeen, almost twice as often as the second most frequent term. (The behaviour described, rather inaccurately, by Zipf's distribution [52,54].) On the other hand, more than 200,000 words (40% of the total) occur only once. ...
Article
Full-text available
Many applications depend on ecient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, the burst trie, that has signicant advantages over existing options for such applications: it requires no more memory than a binary tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or near-sorted order. In this paper we describe burst tries and explore the parameters that govern their performance. We experimentally determine good choices of parameters, and compare burst tries to other structures used for the same task, with a variety of data sets. These experiments show that the burst trie is particularly eective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.
... We explore a novel way of expanding queries, where -instead of using the top ranked documents as sources for expansion terms -we select terms from past user queries that are associated with documents in the collection. Since user queries are typically carefully constructed to retrieve information, the vocabulary employed is therefore more controlled than that of web pages, which often consist of unrelated terms that for instance co-occur in tables (see also Williams and Zobel, 2005). ...
... However, for the experiments detailed in this chapter and in Chapter 7, we made several changes to the search engine: in contrast to the experiments run in Chapters 4 and 5, any indexes used do not contain term offsets; the vocabulary is initially stored on disk, but term information is cached permanently, once it has been retrieved for the first time. Finally, we indexed only terms that contain no more than four non-alphabetical characters (Williams and Zobel, 2005). This reduces the total number of unique terms by between 30% to 40% (depending on which collection is being indexed), and leads to a similar decrease in the combined size of the inverted lists. ...
Article
Full-text available
... -New words are continuously being introduced in the World Wide Web. According to Williams and Zobel [41], there is one new word in every two hundred words. Studies by [16,39,41] showed that this invasion is mainly owing to: neologisms, first occurrences of rare personal names/place names, abbreviations, acronyms, emoticons, URLs and typographical errors. ...
... According to Williams and Zobel [41], there is one new word in every two hundred words. Studies by [16,39,41] showed that this invasion is mainly owing to: neologisms, first occurrences of rare personal names/place names, abbreviations, acronyms, emoticons, URLs and typographical errors. -The Web users are constantly exploiting these new words in their search queries. ...
Article
Full-text available
The difficulty of disambiguating the sense of the incomplete and imprecise keywords that are extensively used in the search queries has caused the failure of search systems to retrieve the desired information. One of the most powerful and promising method to overcome this shortcoming and improve the performance of search engines is Query Expansion, whereby the user?s original query is augmented by new keywords that best characterize the user?s information needs and produce more useful query. In this paper, a new Firefly Algorithm-based approach is proposed to enhance the retrieval effectiveness of query expansion while maintaining low computational complexity. In contrast to the existing literature, the proposed approach uses a Firefly Algorithm to find the best expanded query among a set of expanded query candidates. Moreover, this new approach allows the determination of the length of the expanded query empirically. Experimental results on MEDLINE, the on-line medical information database, show that our proposed approach is more effective and efficient compared to the state-of-the-art.
... -New terms are introduced in the Web. According to Williams and Zobel (2005), there is one new term in every two hundred terms. -Users employ these new terms in their search request. ...
Article
One of the most successful techniques to improve the retrieval effectiveness and overcome the shortcomings of search engines is Query Expansion (QE). Despite its effectiveness, QE still suffers from drawbacks that have limited its deployment as a standard component in search systems. Its major weakness is the computational cost, especially for large-scale data sources. To cope with this issue, we first propose in this paper, a judicious modeling of query expansion with a new and original metaheuristic namely, Bat-Inspired Approach to enhance the retrieval efficiency. Next, this approach is used to find both the best expansion keywords and the best relevant documents simultaneously unlike the previous works where these two tasks are performed sequentially. Our computational experiments undertaken on MEDLINE, the on-line medical database, show that our approach significantly enhances the retrieval efficiency over state-of-the-art methods.
... -New words are introduced in the World Wide Web. According to Williams and Zobel [8], there is one new word in every two hundred words. -Users employ these new words in their queries search. ...
Chapter
Query expansion (QE) has long been suggested as an effective way to improve the retrieval effectiveness and overcome the shortcomings of search engines. Notwithstanding its performance, QE still suffers from limitations that have limited its deployment as a standard component in search systems. Its major drawback is the retrieval efficiency, especially for large-scale data sources. To overcome this issue, we first put forward a new modeling of query expansion with a new and original metaheuristic namely, Bat-Inspired Approach to improve the computational cost. Then, this approach is used to retrieve both the best expansion keywords and the best relevant documents simultaneously unlike the previous works where these two tasks are performed sequentially
... As sampling continues, the slope becomes flatter. Based on previous work[Williams and Zobel, 2005], continued sampling will always continue to find new words but the rate will decrease. Note that the rate for significant terms drops more rapidly than for terms. ...
Conference Paper
Full-text available
The goal of distributed information retrieval is to support effective searching over multiple document collections. For efficiency, queries should be routed to only those collections that are likely to con- tain relevant documents, so it is necessary to first obtain information about the content of the target collections. In an uncooperative envi- ronment, query probing — where randomly-chosen queries are used to retrieve a sample of the documents and thus of the lexicon — has been proposed as a technique for estimating statistical term distributions. In this paper we rebut the claim that a sample of 300 documents is sufficient to provide good coverage of collection terms. We propose a novel sam- pling strategy and experimentally demonstrate that sample size needs to vary from collection to collection, that our methods achieve good cover- age based on variable-sized samples, and that we can use the results of a probe to determine when to stop sampling.
... The number of documents that have at least one association grows rapidly when processing of the query log starts. However, as processing continues, the rate at which new documents are retrieved diminishes, and appears to approach a limit; this is a similar trend to that of the occurrence of new words in web documents [23]. When processing has completed, the number of documents with no associations is 1.22 million, and the number with at least one is nearly 470,000. ...
Conference Paper
Full-text available
We introduce a novel technique for document summarisation which we call query association. Query association is based on the notion that a query that is highly similar to a document is a good descriptor of that document. For example, the user query "richmond football club" is likely to be a good summary of the content of a document that is ranked highly in response to the query. We describe this process of defining, maintaining, and presenting the relationship between a user query and the documents that are retrieved in response to that query. We show that associated queries are an excellent technique for describing a document: for relevance judgement, associated queries are as effective as a simple online query-biased summarisation technique. As future work, we suggest additional uses for query association including relevance feedback and query expansion.
... The vocabulary grows in size as bufferloads are incorporated into the index, and is processed sequentially in its entirety at every merge, even when the merge is in partition 1 to carry out b + b → 2b. Indeed, the vocabulary is approximately proportional in size to the number of pointers in the index, reflecting the observation that new words appear at a steady rate no matter how large the collection has already grown (Williams and Zobel, 2005). In our experiments, detailed below, new words were encountered at a rate of one per three hundred pointers, leading to the final value v = 3 × 10 7 used in the previous calculations. ...
Conference Paper
Full-text available
Inverted index structures are the mainstay of modern text retrieval systems. They can be constructed quickly using off-line merge-based methods, and provide efficient support for a variety of querying modes. In this paper we examine the task of on-line index construction -- that is, how to build an inverted index when the underlying data must be continuously queryable, and the documents must be indexed and available for search as soon they are inserted. When straightforward approaches are used, document insertions become increasingly expensive as the size of the database grows. This paper describes a mechanism based on controlled partitioning that can be adapted to suit different balances of insertion and querying operations, and is faster and scales better than previous methods. Using experiments on 100GB of web data we demonstrate the efficiency of our methods in practice, showing that they dramatically reduce the cost of on-line index construction.
... For each category, we downloaded up to 150 websites that were members of that category ; each web site was crawled up to a two level depth and a maximum of 10 HTML pages per website were collected. We preprocessed all documents by removing HTML tags and other markup by following the approach of Williams and Zobel [18] ; we also removed documents that were unsuccessfully retrieved, for example those that resulted from an HTTP 404 response or " Not Found " . The final collection comprised of 5,296 documents. ...
Conference Paper
On the Web, browsing and searching categories is a popular method of finding documents. Two well-known category-based search systems are the Yahoo!~and DMOZ hierarchies, which are maintained by experts who assign documents to categories. However, manual categorisation by experts is costly, subjective, and not scalable with the increasing volumes of data that must be processed. Several methods have been investigated for effective automatic text categorisation. These include selection of categorisation methods, selection of pre-categorised training samples, use of hierachies, and selection of document fragments or features. In this paper, we further investigate categorisation into Web hierarchies and the role of hierarchical information in improving categorisation effectiveness. We introduce new strategies to reduce errors in hierarchical categorisation. In particular, we propose novel techniques that shift the assignment into higher level categories when lower level assignment is uncertain. Our results show that absolute error rates can be reduced by over 2%.
... We use single words as the features of a document, so that each document is reduced to an unordered set of words or terms. Our particular definition of a term is the class ALNUM [29], where each term is delimited by whitespace and may contain at most one hyphen, one apostrophe, and no more than one consecutive digit. ...
Conference Paper
Categorisation is a useful method for organising documents into subcollections that can be browsed or searched to more accurately and quickly meet information needs. On the Web, category-based portals such as Yahoo! and DMOZ are extremely popular: DMOZ is maintained by over 56,000 volunteers, is used as the basis of the popular Google directory, and is perhaps used by millions of users each day. Support Vector Machines (SVM) is a machine-learning algorithm which has been shown to be highly effective for automatic text categorisation. However, a problem with iterative training techniques such as SVM is that during their learning or training phase, they require the entire training collection to be held in main-memory; this is infeasible for large training collections such as DMOZ or large news wire feeds. In this paper, we show how inverted indexes can be used for scalable training in categorisation, and propose novel heuristics for a fast, accurate, and memory efficient approach. Our results show that an index can be constructed on a desktop workstation with little effect on categorisation accu-racy compared to a memory-based approach. We conclude that our techniques permit automatic categorisation using very large train-ing collections, vocabularies, and numbers of categories.
... Some authors believe the vocabulary size should stabilize for huge enough texts because the number of different words in English is finite (Baeza-Yates & Ribeiro-Neto, 1999). After inspecting 45 gigabytes of WWW documents, Williams and Zobel (2005) found that new words continue to occur due to spelling errors or emerging terms. The value of b is 0.59 for the first ten million word occurrences in the WWW documents. ...
Article
The power-law regularities have been discovered behind many complex natural and social phenomenons. We discover that the power-law regularities, especially the Zipf’s and Heaps’ laws, also exist in large-scale software systems. We find that the distribution of lexical tokens in modern Java, C++ and C programs follows Zipf–Mandelbrot law, and the growth of program vocabulary follows Heaps’ law. The results are obtained through empirical analysis of real-world software systems. We believe our discovery reveals the statistical regularities behind computer programming.
... Each web site was crawled to a depth of two or to a maximum of 10 web pages, whichever limit is reached first. We extracted words [23] from each document, and removed documents that had no content; for example, we removed documents that resulted from the HTTP response 404 or "Not Found". ...
Article
Full-text available
Categorisation of digital do ments is useful for organisation and retrieval. While do c mentcP egoriesc an be a set of unstruc turedc ategory labels, some do c mentc; egories are hierarc hicPRA stru c ured. This paper investigates automatic hierarc hicP c ategorisation and, spec; c;; , the role of features in the development of more e# ec ivec; egorisers. We show that a good hierarc hicR mac hine learningbasedc ategoriserce be developed using small numbers of features from pre-c ategorised training doc uments. Overall, we show that by using a few terms, c: egorisation ac; - rac yc an be improved substantially unstruc tured leaf level c: egorisationc an be improved by up to 8.6%, while topdown hierarc hicP c egorisation acP rac cc be improved by up to 12%. In addition, unlike other feature selec tion models --- whic h typicPFM require di#erent featureselec ion parameters forc ategories at di#erent hierarc hic; levels --- our tec hnique works equally well for all c: egories in a hierarc hic$ stru c ure. WecP c lude that, in general, more ace rate hierarc hic; c egorisation is possible by using our simple feature selecPA: tec hnique.
... Term partitioning also offers advantages through its smaller per-node vocabulary. The vocabulary size advantage over document partitioning is muted when only the on-disk or latent vocabulary is considered, since the vocabularies of different document subsets differ substantially, due to the ongoing occurrence of new (predominantly non-dictionary) terms [Williams and Zobel, 2005]. ...
Article
Abstract Web-scale search engines deal with a volume of data and queries that forces them to make use of an index partitioned across many machines. Two main methods of partitioning an index for distributed processing have been described in the literature. In document partitioning, each processor node holds the information for a subset of documents, while in term partitioning, each node holds the in- formation for a subset of terms. Additionally, a novel architecture, pipelining, has been proposed, offering to combine the best features of both architectures. This thesis develops a careful methodology for the experimental comparison of distributed information retrieval architectures, addressing questions such as experiment scalability and query set generation. Novel methods are proposed for accumulator pruning, and for compression of accumulators for shipping between nodes in the pipelined architecture. A meticulous experimental assessment of the three distributed architectures is then undertaken. The results demonstrate that term distribution suffers a severe processing bottleneck. Pipelining resolves term distribution’s processing bottleneck, while maintaining its superior I/O characteristics. However, pipelin- ing suffers from serious load imbalance between the nodes, fails to fully utilise the cluster’s processing capacity, and scales poorly. Document distribution, in contrast, distributes workload evenly and scales well. Load balancing through the intelligent assignment of terms to partitions is explored, but fails to fully resolve the imbalance of the pipelined architecture. Instead, the partial replication of high-workload terms is proposed, coupled with the intelligent routing of queries. These techniques resolve pipelining’s load imbalance, allowing it to marginally outperform document distribution. The partially-replicated pipelined architecture is shown to benefit from sys-
... Figure 4 shows that if a peer already contains 0.4% of the TREC collection, it would have had to add approximately 3000 more documents, totaling 800,000 more terms, to have found an additional 1000 unique terms. (The trend we found in Figure 4 is consistent with that found by a much larger study of word distribution [25].) Figure 5(a) plots the simulated propagation times for six scenarios: ...
Article
Full-text available
We introduce PlanetP, a content addressable publish /subscribe service for unstructured peer-to-peer (P2P) communities. PlanetP supports content addressing by providing: (1) a gossiping layer used to globally replicate a membership directory and an extremely compact content index, and (2) a completely distributed content search and ranking algorithm that helps users find the most relevant information. PlanetP is a simple, yet powerful system for sharing information. PlanetP is simple because each peer must only perform a periodic, randomized, point-to-point message exchange with other peers. PlanetP is powerful because it maintains a globally content-ranked view of the shared data. Using simulation and a prototype implementation, we show that PlanetP achieves ranking accuracy that is comparable to a centralized solution and scales easily to several thousand peers while remaining resilient to rapid membership changes.
... Other term distribution models, including the K-mixture and two-Poisson model, are discussed by Manning and Schütze [15] . Williams and Zobel present a detailed study of vocabulary growth in large web collections [25]. van Leijenhorst and van der Weide formally derive Heaps' law from a generalized version of Zipf's law [23]; for a more heuristic derivation, see [2]. ...
Conference Paper
Web search engines use indexes to eciently retrieve pages containing specified query terms, as well as pages linking to specified pages. The problem of compressed indexes that permit such fast retrieval has a long history. We consider the problem: assuming that the terms in (or links to) a page are generated from a probability distribution, how well com- pactly can we build such indexes that allow fast retrieval? Of particular interest is the case when the probability dis- tribution is Zipfian (or a similar power law), since these are the distributions that arise on the web. We obtain sharp bounds on the space requirement of Boolean indexes for text documents that follow Zipf's law. In the process we develop a general technique that applies to any probability distribution, not necessarily a power law; this is the first analysis of compression in indexes under arbitrary distributions. Our bounds lead to quantitative versions of rules of thumb that are folklore in indexing. Our experi- ments on several document collections show that the distri- bution of terms appears to follow a double-Pareto law rather than Zipf's law. Despite widely varying sets of documents, the index sizes observed in the experiments conform well to our theoretical predictions.
... We chose to extract non-unique words to reflect the real-world stemming problem encountered in text search, document summarisation, and translation. The frequency of word occurrence in normal usage is highly skew [Williams and Zobel, 2005]; there are a small number words that are very common, and a large number of words that are used infrequently. In English, for example, "the" appears about twice as often as the next most common word; a similar phenomenon exists in Indonesian, where "yang" (a relative pronoun that is similar to "who", "which", or "that", or "the" if used with an adjective as mentioned in Section 2.1.5) ...
... Each web site was crawled to a depth of two or to a maximum of 10 web pages, whichever limit is reached first. We extracted words [23] from each document, and removed documents that had no content; for example, we removed documents that resulted from the HTTP response 404 or " Not Found " . The final collection contained 5,296 documents. ...
Conference Paper
Categorisation of digital documents is useful for organisation and retrieval. While document categories can be a set of unstructured category labels, some document categories are hierarchically structured. This paper investigates automatic hierarchical categorisation and, specifically, the role of features in the development of more effective categorisers. We show that a good hierarchical machine learning-based categoriser can be developed using small numbers of features from pre-categorised training documents. Overall, we show that by using a few terms, categorisation accuracy can be improved substantially: unstructured leaf level categorisation can be improved by up to 8.6\%, while top-down hierarchical categorisation accuracy can be improved by up to 12\%. In addition, unlike other feature selection models --- which typically require different feature selection parameters for categories at different hierarchical levels --- our technique works equally well for all categories in a hierarchical structure. We conclude that, in general, more accurate hierarchical categorisation is possible by using our simple feature selection technique.
... in contrast to the experiments run in Chapters 4 and 5, any indexes used do not contain term offsets; the vocabulary is initially stored on disk, but term information is cached permanently, once it has been retrieved for the first time. Finally, we indexed only terms that contain no more than four non-alphabetical characters (Williams and Zobel, 2005). This reduces the total number of unique terms by between 30% to 40% (depending on which collection is being indexed), and leads to a similar decrease in the combined size of the inverted lists. ...
Thesis
Full-text available
Hundreds of millions of users each day search the web and other repositories to meet their information needs. However, queries can fail to find documents due to a mismatch in terminology. Query expansion seeks to address this problem by automatically adding terms from highly ranked documents to the query. While query expansion has been shown to be effective at improving query performance, the gain in effectiveness comes at a cost: expansion is slow and resource-intensive. Current techniques for query expansion use fixed values for key parameters, determined by tuning on test collections. We show that these parameters may not be generally applicable, and, more significantly, that the assumption that the same parameter settings can be used for all queries is invalid. Using detailed experiments, we demonstrate that new methods for choosing parameters must be found. In conventional approaches to query expansion, the additional terms are selected from highly ranked documents returned from an initial retrieval run. We demonstrate a new method of obtaining expansion terms, based on past user queries that are associated with documents in the collection. The most effective query expansion methods rely on costly retrieval and processing of feedback documents. We explore alternative methods for reducing query-evaluation costs, and propose a new method based on keeping a brief summary of each document in memory. This method allows query expansion to proceed three times faster than previously, while approximating the effectiveness of standard expansion. We investigate the use of document expansion, in which documents are augmented with related terms extracted from the corpus during indexing, as an alternative to query expansion. The overheads at query time are small. We propose and explore a range of corpus-based document expansion techniques and compare them to corpus-based query expansion on TREC data. These experiments show that document expansion delivers at best limited benefits, while query expansion - including standard techniques and efficient approaches described in recent work - usually delivers good gains. We conclude that document expansion is unpromising, but it is likely that the efficiency of query expansion can be further improved.
Article
Full-text available
Inverted index structures are a core element of current text retrieval systems. They can be constructed quickly using offline approaches, in which one or more passes are made over a static set of input data, and, at the completion of the process, an index is available for querying. However, there are search environments in which even a small delay in timeliness cannot be tolerated, and the index must always be queryable and up to date. Here we describe and analyze a geometric partitioning mechanism for online index construction that provides a range of tradeoffs between costs, and can be adapted to different balances of insertion and querying operations. Detailed experimental results are provided that show the extent of these tradeoffs, and that these new methods can yield substantial savings in online indexing costs.
For the success of lexical text correction, high coverage of the underlying background dictionary is crucial. Still, most correction tools are built on top of static dictionaries that represent fixed collections of expressions of a given language. When treating texts from specific domains and areas, often a significant part of the vocabulary is missed. In this situation, both automated and interactive correction systems produce suboptimal results. In this article, we describe strategies for crawling Web pages that fit the thematic domain of the given input text. Special filtering techniques are introduced to avoid pages with many orthographic errors. Collecting the vocabulary of filtered pages that meet the vocabulary of the input text, dynamic dictionaries of modest size are obtained that reach excellent coverage values. A tool has been developed that automatically crawls dictionaries in the indicated way. Our correction experiments with crawled dictionaries, which address English and German document collections from a variety of thematic fields, show that with these dictionaries even the error rate of highly accurate texts can be reduced, using completely automated correction methods. For interactive text correction, more sensible candidate sets for correcting erroneous words are obtained and the manual effort is reduced in a significant way. To complete this picture, we study the effect when using word trigram models for correction. Again, trigram models from crawled corpora outperform those obtained from static corpora.
Article
Search engines are an essential tool for modern life. We use them to discover new information on diverse topics and to locate a wide range of resources. The search process in all practical search engines is supported by an inverted index structure that stores all search terms and their locations within the searchable document collection. Inverted indexes are highly optimised, and significant work has been undertaken over the past fifteen years to store, retrieve, compress, and understand heuristics for these structures. In this paper, we propose a new self-organising inverted index based on past queries. We show that this access-ordered index improves query evaluation speed by 25%--40% over a conventional, optimised approach with almost indistinguishable accuracy. We conclude that access-ordered indexes are a valuable new tool to support fast and accurate web search.
Conference Paper
Full-text available
Stemming words to (usually) remove suffixes has applications in text search, machine translation, document summarization, and text classification. For example, English stemming reduces the words "computer," "computing," "computation," and "computability" to their common morphological root, "comput-." In text search, this permits a search for "computers" to find documents containing all words with the stem "comput-." In the Indonesian language, stemming is of crucial importance: words have prefixes, suffixes, infixes, and confixes that make matching related words difficult. This work surveys existing techniques for stemming Indonesian words to their morphological roots, presents our novel and highly accurate CS algorithm, and explores the effectiveness of stemming in the context of general-purpose text information retrieval through ad hoc queries.
Article
Full-text available
Stemming words to (usually) remove suffixes has applications in text search, machine translation, document summarization, and text classification. For example, English stemming reduces the words "computer," "computing," "computation," and "computability" to their common morphological root, "comput-." In text search, this permits a search for "computers" to find documents containing all words with the stem "comput-." In the Indonesian language, stemming is of crucial importance: words have prefixes, suffixes, infixes, and confixes that make matching related words difficult. This work surveys existing techniques for stemming Indonesian words to their morphological roots, presents our novel and highly accurate CS algorithm, and explores the effectiveness of stemming in the context of general-purpose text information retrieval through ad hoc queries.
Article
Ecien t construction of inverted indexes is essential to provision of search over large collections of text data. In this paper, we review the principal approaches to inversion, analyse their theoretical cost, and present experimental results. We identify the drawbacks of existing inversion approaches and propose a single-pass inversion method that, in contrast to previous approaches, does not require the complete vocabulary of the indexed collection in main memory, can operate within limited resources, and does not sacrice speed with high temporary storage requirements. We show that the performance of the single-pass approach can be improved by constructing inverted les in segments, reducing the cost of disk accesses during inversion of large volumes of data.
Article
Query expansion is a well-known method for improving average eectiv eness in information retrieval. The most eectiv e query expansion methods rely on retriev- ing documents which are used as a source of expansion terms. Retrieving those documents is costly. We examine the bottlenecks of a conventional approach and investigate alternative methods aimed at reducing query evaluation time. We pro- pose a new method that draws candidate terms from brief document summaries that are held in memory for each document. While approximately maintaining the eectiv eness of the conventional approach, this method signican tly reduces the time required for query expansion by a factor of v e to ten.
Conference Paper
This report outlines TREC-2008 Relevance Feedback Track experiments done at RMIT University. Relevance feedback in text retrieval systems is a process where a user gives explicit feedback on an initial set of retrieval results returned by a search system. For example, the user might mark some of the items as being relevant, or not relevant, to their current information need. This feedback can be used in different ways;
Preprint
Full-text available
Johnson--Lindenstrauss Transforms are powerful tools for reducing the dimensionality of data while preserving key characteristics of that data, and they have found use in many fields from machine learning to differential privacy and more. This note explains what they are; it gives an overview of their use and their development since they were introduced in the 1980s; and it provides many references should the reader wish to explore these topics more deeply.
Conference Paper
In this paper a new method based on utility and decision theory is presented to deal with structured documents. The aim of the application of these methodologies is to refine a first ranking of structural units, generated by means of an information retrieval model based on Bayesian networks. Units are newly arranged in the new ranking by combining their posterior probabilities, obtained in the first stage, with the expected utility of retrieving them. The experimental work has been developed using the Shakespeare structured collection and the results show an improvement of the effectiveness of this new approach.
Conference Paper
To provide fast, scalable search facilities, web search engines store collections locally. The collections are gathered by crawling the Web. A problem with crawling is determining when to revisit resources because they have changed: stale documents contribute towards poor search results, while unnecessary refreshing is expensive. However, some changes — such as in images, advertisements, and headers — are unlikely to affect query results. In this paper, we investigate measures for determining whether documents have changed and should be recrawled. We show that content-based measures are more effective than the traditional approach of using HTTP headers. Refreshing based on HTTP headers typically recrawls 16% of the collection each day, but users do not retrieve the majority of refreshed documents. In contrast, refreshing documents when more than twenty words change recrawls 22% of the collection but updates documents more effectively. We conclude that our simple measures are an effective component of a web crawling strategy.
Article
During the last few years, it has become abundantly clear that the technological advances in information technology have led to the dramatic proliferation of information on the web and this, in turn, has led to the appearance of new words in the Internet. Due to the difficulty of reaching the meanings of these new terms, which play an essential role in retrieving the desired information, it becomes necessary to give more importance to the sites and topics where these new words appear, or rather, to give value to the words that occur frequently with them. For this purpose, in this paper, the authors propose a new robust correlation measure that assesses the relatedness of words for pseudo-relevance feedback. It is based on the co-occurrence and closeness of terms, and aims to select the appropriate words that best capture the user information need. Extensive experiments have been conducted on the OHSUMED test collection and the results show that the proposed approach achieves a considerable performance improvement over the baseline.
Conference Paper
Nowadays, more and more people outsource their data to cloud servers for great flexibility and economic savings. Due to considerations on security, private data is usually protected by encryption before sending to cloud. How to utilize data efficiently while preserving user's privacy is a new challenge. In this paper, we focus on a efficient multi-keyword search scheme meeting a strict privacy requirement. First, we make a short review of two existing schemes supporting multi-keyword search, the kNN-based MRSE scheme and scheme based on bloom filter. Based on the kNN-based scheme, we propose an improved scheme. Our scheme adopt a product of three sparse matrix pairs instead of the original dense matrix pair to encrypt index, and thus get a significant improvement in efficiency. Then, we combine our improved scheme with bloom filter, and thus gain the ability for index updating. Simulation Experiments show proposed scheme indeed introduces low overhead on computation and storage.
Conference Paper
To protect privacy of users, sensitive data need to be encrypted before outsourcing to cloud, which makes effective data retrieval a very tough task. In this paper, we proposed a novel order-preserving encryption(OPE) based ranked search scheme over encrypted cloud data, which uses the encrypted keyword frequency to rank the results and provide accurate results via two-step ranking strategy. The first step coarsely ranks the documents with the measure of coordinate matching, i.e., classifying the documents according to the number of query terms included in each document. In the second step, for each category obtained in the first step, a fine ranking process is executed by adding up the encrypted score. Extensive experiments show that this new method is indeed an advanced solution for secure multi-keyword retrieval.
Article
Full-text available
In this paper we combine statistical analysis of large text databases and simple stochastic models to explain the appearance of scaling laws in the statistics of word frequencies. Besides the sublinear scaling of the vocabulary size with database size (Heaps' law), here we report a new scaling of the fluctuations around this average (fluctuation scaling analysis). We explain both scaling laws by modeling the usage of words by simple stochastic processes in which the overall distribution of word-frequencies is fat tailed (Zipf's law) and the frequency of a single word is subject to fluctuations across documents (as in topic models). In this framework, the mean and the variance of the vocabulary size can be expressed as quenched averages, implying that: i) the inhomogeneous dissemination of words cause a reduction of the average vocabulary size in comparison to the homogeneous case, and ii) correlations in the co-occurrence of words lead to an increase in the variance and the vocabulary size becomes a non-self-averaging quantity. We address the implications of these observations to the measurement of lexical richness. We test our results in three large text databases (Google-ngram, Enlgish Wikipedia, and a collection of scientific articles).
Chapter
Full-text available
Owing to the increasing use of ambiguous and imprecise words in expressing the user’s information need, it has become necessary to expand the original query with additional terms that best capture the actual user intent. Selecting the appropriate words to be used as additional terms is mainly dependent on the degree of relatedness between a candidate expansion term and the query terms. In this paper, we propose two criteria to assess the degree of relatedness: (1) attribute more importance to terms occurring in the largest possible number of documents where the query keywords appear, (2) assign more importance to terms having a short distance with the query terms within documents. We employ the strength Pareto fitness assignment in order to satisfy both criteria simultaneously. Our computational experiments on OHSUMED test collection show that our approach significantly improves the retrieval performance compared to the baseline.
Chapter
The dramatic proliferation of information on the web and the tremendous growth in the number of files published and uploaded online each day have led to the appearance of new words in the Internet. Due to the difficulty of reaching the meanings of these new terms, which play a central role in retrieving the desired information, it becomes necessary to give more importance to the sites and topics where these new words appear, or rather, to give value to the words that occur frequently with them. For this aim, in this paper, we propose a novel term-term similarity score based on the co-occurrence and closeness of words for retrieval performance improvement. A novel efficiency/effectiveness measure based on the principle of optimal information forager is also proposed in order to assess the quality of the obtained results. Our experiments were performed using the OHSUMED test collection and show significant effectiveness enhancement over the state-of-the-art.
Chapter
Query expansion (QE) has long been suggested as an effective way to improve the retrieval effectiveness and overcome the shortcomings of search engines. Notwithstanding its performance, QE still suffers from limitations that have limited its deployment as a standard component in search systems. Its major drawback is the retrieval efficiency, especially for large-scale data sources. To overcome this issue, we first put forward a new modeling of query expansion with a new and original metaheuristic namely, Bat-Inspired Approach to improve the computational cost. Then, this approach is used to retrieve both the best expansion keywords and the best relevant documents simultaneously unlike the previous works where these two tasks are performed sequentially.
Article
Because of users’ growing utilization of unclear and imprecise keywords when characterizing their information need, it has become necessary to expand their original search queries with additional words that best capture their actual intent. The selection of the terms that are suitable for use as additional words is in general dependent on the degree of relatedness between each candidate expansion term and the query keywords. In this paper, we propose two criteria for evaluating the degree of relatedness between a candidate expansion word and the query keywords: (1) co-occurrence frequency, where more importance is attributed to terms occurring in the largest possible number of documents where the query keywords appear; (2) proximity, where more importance is assigned to terms having a short distance from the query terms within documents. We also employ the strength Pareto fitness assignment in order to satisfy both criteria simultaneously. The results of our numerical experiments on MEDLINE, the online medical information database, show that the proposed approach significantly enhances the retrieval performance as compared to the baseline.
Chapter
During the last few years, it has become abundantly clear that the technological advances in information technology have led to the dramatic proliferation of information on the web and this, in turn, has led to the appearance of new words in the Internet. Due to the difficulty of reaching the meanings of these new terms, which play an essential role in retrieving the desired information, it becomes necessary to give more importance to the sites and topics where these new words appear, or rather, to give value to the words that occur frequently with them. For this purpose, in this paper, the authors propose a new robust correlation measure that assesses the relatedness of words for pseudo-relevance feedback. It is based on the co-occurrence and closeness of terms, and aims to select the appropriate words that best capture the user information need. Extensive experiments have been conducted on the OHSUMED test collection and the results show that the proposed approach achieves a considerable performance improvement over the baseline.
Chapter
Query expansion (QE) is one of the most effective techniques to enhance the retrieval performance and to retrieve more relevant information. It attempts to build more useful queries by enriching the original queries with additional expansion terms that best characterize the users' information needs. In this chapter, the authors propose a new correlation measure for query expansion to evaluate the degree of similarity between the expansion term candidates and the original query terms. The proposed correlation measure is a hybrid of two correlation measures. The first one is considered as an external correlation and it is based on the term co-occurrence, and the second one is considered as an internal correlation and it is based on the term proximity. Extensive experiments have been performed on MEDLINE, a real dataset from a large online medical database. The results show the effectiveness of the proposed approach compared to prior state-of-the-art approaches.
Article
Full-text available
The quality of text correction systems can be im- proved when using complex language models and by taking peculiarities of the garbled input text into account. We report on a series of experiments where we crawl domain dependent web corpora for a given garbled input text. From crawled corpora we derive dictionaries and language models, which are used to correct the input text. We show that correction accuracy is improved when integrating word bigram frequency values from the crawls as a new score into a baseline correction strategy based on word similarity and word (unigram) frequen- cies. In a second series of experiments we compare the quality of distinct language models, measur- ing how closely these models reect the frequen- cies observed in a given input text. It is shown that crawled language models are superior to language models obtained from standard corpora.
Article
Full-text available
Many applications depend on ecient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, the burst trie, that has signicant advantages over existing options for such applications: it requires no more memory than a binary tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or near-sorted order. In this paper we describe burst tries and explore the parameters that govern their performance. We experimentally determine good choices of parameters, and compare burst tries to other structures used for the same task, with a variety of data sets. These experiments show that the burst trie is particularly eective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.
Article
Full-text available
this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including n-grams and permuted lexicons, and several string matching techniques, including string similarity measures and phonetic coding. We propose methods for combining these techniques, and show experimentally that these combinations yield good retrieval effectiveness while keeping index size and retrieval time low. Our experiments also suggest that, in contrast to previous claims, phonetic codings are markedly inferior to string distance measures, which are demonstrated to be suitable for both spelling correction and personal name matching. KEY WORDS: pattern matching; string indexing; approximate matching; compressed inverted files; Soundex
Article
In studying actual Web searching by the public at large, we analyzed over one million Web queries by users of the Excite search engine. We found that most people use few search terms, few modified queries, view few Web pages, and rarely use advanced search features. A small number of search terms are used with high frequency, and a great many terms are unique; the language of Web queries is distinctive. Queries about recreation and entertainment rank highest. Findings are compared to data from two other large studies of Web queries. This study provides an insight into the public practices and choices in Web searching.
Article
The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.
Article
Stemming is used in many information retrieval (IR) systems to reduce variant word forms to common roots. It is one of the simplest applications of natural-language processing to IR and is one of the most effective in terms of user acceptance and consistency, though small retrieval improvements. Current stemming techniques do not, however, reflect the language use in specific corpora, and this can lead to occasional serious retrieval failures. We propose a technique for using corpus-based word variant cooccurrence statistics to modify or create a stemmer. The experimental results generated using English newspaper and legal text and Spanish text demonstrate the viability of this technique and its advantages relative to conventional approaches that only employ morphological rules.
Article
In studying actual Web searching by the public at large, we analyzed over one million Web queries by users of the Excite search engine. We found that most people use few search terms, few modified queries, view few Web pages, and rarely use advanced search features. A small number of search terms are used with high frequency, and a great many terms are unique; the language of Web queries is distinctive. Queries about recreation and entertainment rank highest. Findings are compared to data from two other large studies of Web queries. This study provides an insight into the public practices and choices in Web searching.
Article
A model of a natural language text is a collection of information that approximates the statistics and structure of the text being modeled. The purpose of the model may be to give insight into rules which govern how language is generated, or to predict properties of future samples of it. This paper studies models of natural language from three different, but related, viewpoints. First, we examine the statistical regularities that are found empirically, based on the natural units of words and letters. Second, we study theoretical models of language, including simple random generative models of letters and words whose output, like genuine natural language, obeys Zipf's law. Innovation in text is also considered by modeling the appearance of previously unseen words as a Poisson process. Finally, we review experiments that estimate the information content inherent in natural text.
Article
Research aimed at correcting words in text has focused on three progressively more difficult problems:(1) nonword error detection; (2) isolated-word error correction; and (3) context-dependent work correction. In response to the first problem, efficient pattern-matching and n-gram analysis techniques have been developed for detecting strings that do not appear in a given word list. In response to the second problem, a variety of general and application-specific spelling correction techniques have been developed. Some of them were based on detailed studies of spelling error patterns. In response to the third problem, a few experiments using natural-language-processing tools or statistical-language models have been carried out. This article surveys documented findings on spelling error patterns, provides descriptions of various nonword detection and isolated-word error correction techniques, reviews the state of the art of context-dependent word correction techniques, and discusses research issues related to all three areas of automatic error correction in text.
Article
An inverted index stores, for each term that appears in a collection of documents, a list of document numbers containing that term. Such an index is indispensable when Boolean or informal ranked queries are to be answered. Construction of the index is, however, a nontrivial task. Simple methods using in-memory data structures cannot be used for large collections because they require too much random access storage, and traditional disk-based methods require large amounts of temporary file space. This paper describes a new indexing algorithm designed to create large compressed inverted indexes in situ. It makes use of simple compression codes for the positive integers and an in-place external multi-way mergesort. The new technique has been used to invert a two-gigabyte text collection in under 4 hours, using less than 40 megabytes of temporary disk space, and less than 20 megabytes of main memory. © 1995 John Wiley & Sons, Inc.
Article
Abstract Splay and randomised search trees are self-balancing binary tree structures with little or no space overhead compared to a standard binary search tree. Both trees are intended for use in applications where node accesses are skewed, for example in gathering the distinct words in a large text collection for index construction. We investigate the eciency,of these trees for such vocabulary accumulation. Surprisingly, unmodied splaying and randomised search trees are on average around 25% slower than using a standard binary tree. We investigate heuristics to limit splay tree reorganisation costs and show their eectiveness in practice. In particular, a periodic rotation scheme improves the speed of splaying by 27%, while other proposed heuristics are less eective. We also report the performance of ecient,bit-wise hashing and red-black trees for comparison. 1I NTRODUCTION
Article
In November of 1992 the first Text REtrieval Conference (TREC-1) was held at NIST (Harman 1993). This conference, co-sponsored by ARPA and NIST, brought together information retrieval researchers to discuss their system results on the new TIPSTER test collection. This was the first time that such groups had ever compared results on the same data using the same evaluation methods, and represented a breakthrough in cross-system evaluation in information retrieval. It was also the first time that most of these groups had tackled such a large test collection and required a major effort by all groups to scale up their retrieval techniques.
Article
this paper we experimentally evaluate the performance of several data structures for building vocabularies, using a range of data collections and machines. Given the well-known properties of text and some initial experimentation, we chose to focus on the most promising candidates, splay trees and chained hash tables, also reporting results with binary trees. Of these, our experiments show that hash tables are by a considerable margin the most e#cient. We propose and measure a refinement to hash tables, the use of move-to-front lists. This refinement is remarkably e#ective: as we show, using a small table in which there are large numbers of strings in each chain has only limited impact on performance. Moving frequentlyaccessed words to the front of the list has the surprising property that the vast majority of accesses are to the first or second node. For example, our experiments show that in a typical case a table with an average of around 80 strings per slot is only 10%--40% slower than a table with around one string per slot---while a table without move-to-front is perhaps 40% slower again---and is still over three times faster than using a tree. We show, moreover, that a move-to-front hash table of fixed size is more e#cient in space and time than a hash table that is dynamically doubled in size to maintain a constant load average
Automatic dictionary construction from large collections of text. Master's thesis, School of Computer Sci-ence and Information Technology
  • Hasan
Hasan J (2001) Automatic dictionary construction from large collections of text. Master's thesis, School of Computer Sci-ence and Information Technology, RMIT University, RT-35, Melbourne, Australia
Comments on Zipf’s law and the structures and evolution of natural language
  • W Li
W. Li. Comments on Zipf's law and the structures and evolution of natural language. Complexity, 3(5):9-10, 1998.
  • R Baeza-Yates
  • B Ribeiro-Neto
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley-Longman, May 1999.
Master’s thesis, School of Computer Science and Information Technology
  • J Hasan
J. Hasan. Automatic dictionary construction from large collections of text. Master's thesis, School of Computer Science and Information Technology, RMIT University, 2001. RT-35.
In-situ generation of compressed inverted files
  • A Moffat
  • Tah Bell