Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

this paper we experimentally evaluate the performance of several data structures for building vocabularies, using a range of data collections and machines. Given the well-known properties of text and some initial experimentation, we chose to focus on the most promising candidates, splay trees and chained hash tables, also reporting results with binary trees. Of these, our experiments show that hash tables are by a considerable margin the most e#cient. We propose and measure a refinement to hash tables, the use of move-to-front lists. This refinement is remarkably e#ective: as we show, using a small table in which there are large numbers of strings in each chain has only limited impact on performance. Moving frequentlyaccessed words to the front of the list has the surprising property that the vast majority of accesses are to the first or second node. For example, our experiments show that in a typical case a table with an average of around 80 strings per slot is only 10%--40% slower than a table with around one string per slot---while a table without move-to-front is perhaps 40% slower again---and is still over three times faster than using a tree. We show, moreover, that a move-to-front hash table of fixed size is more e#cient in space and time than a hash table that is dynamically doubled in size to maintain a constant load average

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... For strings with a skew distribution, such as occurrences of words in text, it was found in earlier work [10] that a standard-chain hash table is faster and more compact than sorted data structures such as tries and binary trees. Using move-to-front in the individual chains [27], the load average can reach dozens of strings per slot without significant impact on access speed, as the likelihood of having to inspect more than the first string in each slot is low. Thus a standard-chain hash table has clear advantages over open-addressing alternatives , whose performance rapidly degrades as the load average approaches 1 and which cannot be easily re-sized. ...
... Moreover, in principle there is no reason why a chained hash table could not be managed with methods designed for disk, such as linear hashing [13] and extensible hashing [19], which allow an on-disk hash table to grow and shrink gracefully. Zobel et al. [27] compared the performance of several data structures for in-memory accumulation of the vocabulary of a large text collection, and found that the standard-chain hash table, coupled with the bitwise hash function and a self-organizing list structure [12], move-to-front on access, is the fastest previous data structure available for maintaining fast access to variablelength strings. However, a standard-chain hash table is not particularly cache-efficient. ...
... To measure the impact of load factor, we vary the number of slots made available, using the sequence 10,000, 31,622, 100,000, 316,227 and so forth up until a minimal execution time is observed. Both the compact-chain and standard-chain hash tables are most efficient when coupled with move-to-front on access, as suggested by Zobel et al. [27]. We therefore enabled move-to-front for the chaining methods but disable it for the array, a decision that is justified in the next section. ...
Conference Paper
Full-text available
In-memory hash tables provide fast access to large numbers of strings, with less space overhead than sorted structures such as tries and binary trees. If chains are used for collision resolution, hash tables scale well, particularly if the pattern of access to the stored strings is skew. However, typical implementations of string hash tables, with lists of nodes, are not cache-efficient. In this paper we explore two alternatives to the standard representation: the simple expedient of including the string in its node, and the more drastic step of replacing each list of nodes by a contiguous array of characters. Our experiments show that, for large sets of strings, the improvement is dramatic. In all cases, the new structures give substantial savings in space at no cost in time. In the best case, the overhead space required for pointers is reduced by a factor of around 50, to less than two bits per string (with total space required, including 5.68 megabytes of strings, falling from 20.42 megabytes to 5.81 megabytes), while access times are also reduced.
... In the standard binary search tree (BST), each node has a string pointer and two child pointers. These string data structures are illustrated in Figure 1 and are currently among the fastest and most compact tools available for managing large sets of strings in memory [Askitis and Sinha 2007;Heinz et al. 2002;Williams et al. 2001;Zobel et al. 2001;Bell and Gupta 1993;Knuth 1998;Crescenzi et al. 2003]. ...
... These results are an illustration of the importance of considering cache in algorithm design. The standard chain hash table, burst trie, and the BST have previously been shown to be among the most efficient structures for managing strings [Heinz et al. 2002;Williams et al. 2001;Zobel et al. 2001], but we have greatly reduced their total space consumption while simultaneously reducing access time. ...
... For sets of strings with a skew distribution, such as occurrences of words in text, a standard-chain hash table, coupled with an effective, efficient hash function, is faster and more compact than sorted data structures such as tries and variants of BST . Using move-to-front in the individual chains [Knuth 1998;Zobel et al. 2001], the load average can reach dozens of strings per slot without significant impact on access speed, as the likelihood of having to inspect more than the first string in each slot is low. Thus, a standardchain hash table has clear advantages over open-addressing alternatives, whose performance rapidly degrades as the load average approaches 1 and which cannot easily be resized. ...
Article
A key decision when developing in-memory computing applications is choice of a mechanism to store and retrieve strings. The most efficient current data structures for this task are the hash table with move-to-front chains and the burst trie, both of which use linked lists as a substructure, and variants of binary search tree. These data structures are computationally efficient, but typical implementations use large numbers of nodes and pointers to manage strings, which is not efficient in use of cache. In this article, we explore two alternatives to the standard representation: the simple expedient of including the string in its node, and, for linked lists, the more drastic step of replacing each list of nodes by a contiguous array of characters. Our experiments show that, for large sets of strings, the improvement is dramatic. For hashing, in the best case the total space overhead is reduced to less than 1 bit per string. For the burst trie, over 300MB of strings can be stored in a total of under 200MB of memory with significantly improved search time. These results, on a variety of data sets, show that cache-friendly variants of fundamental data structures can yield remarkable gains in performance.
... The most efficient form of hash table for vocabulary accumulation is based on chaining [56]. In such hash tables, a large array is used to index a set of linked lists of nodes, each of which is said to be at a slot. ...
... For standard chaining, the number of slots needs to be a significant fraction of the number of distinct strings. However, in other work we have found that if, on each search, the accessed node is moved to the front of the list, the same efficiency can be obtained with much smaller tables [56]; and, most importantly for this application, with move-to-front chains efficiency declines much more slowly with increasing vocabulary size. ...
... In earlier work comparing the tree data structures discussed above [56], we observed that, compared to hashing, these structures had three sources of inefficiency. First, the average search lengths were surprisingly high, typically exceeding ten pointer traversals and string comparisons even on moderate-sized data sets with highly skew distributions. ...
Article
Full-text available
Many applications depend on ecient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, the burst trie, that has signicant advantages over existing options for such applications: it requires no more memory than a binary tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or near-sorted order. In this paper we describe burst tries and explore the parameters that govern their performance. We experimentally determine good choices of parameters, and compare burst tries to other structures used for the same task, with a variety of data sets. These experiments show that the burst trie is particularly eective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.
... In-memory data structures are fundamental tools used in virtually any computing application that requires efficient management of data. A well-known example is a hash table (Knuth 1998), which distributes keys amongst a set of slots by using a hash function (Ramakrishna & Zobel 1997, Zobel, Heinz & Williams 2001 ). A hash table can offer rapid insertion, deletion, and search of both strings and integers but requires a form of collision resolution to resolve cases where two or more keys are hashed to the same slot. ...
... The simplest and most effective collision resolution scheme for when the number of keys is not known in advance is the use of linked lists. This forms a chaining hash table (Zobel et al. 2001) also known as a standard-chain hash table (Askitis & Zobel 2005). Linked lists (or chains) are simple and flexible structures to implement and maintain but are not particularly cache-friendly. ...
... Drawing together the themes we have sketched, this paper provides three important contributions. First, we will show how to develop an array hash table for integers which we experimentally compare against a standardchain hash table (Zobel et al. 2001), a more cache-efficient variant known as a clustered-chain hash table (Chilimbi 1999, Askitis 2007), and a linear probing open-address hash table (Peterson 1957, Heileman & Luo 2005). Our experiments measure the time, space, and actual cache performance of these hash tables using large volumes of 32-bit integers. ...
Conference Paper
A hash table is a fundamental data structure in computer science that can offer rapid storage and retrieval of data. A leading implementation for string keys is the cache- conscious array hash table. Although fast with strings, there is currently no information in the research literatur e on its performance with integer keys. More importantly, we do not know how efficient an integer-based array hash table is compared to other hash tables that are designed for integers, such as bucketized cuckoo hashing. In this paper, we explain how to efficiently implement an array hash ta- ble for integers. We then demonstrate, through careful ex- perimental evaluations, which hash table, whether it be a bucketized cuckoo hash table, an array hash table, or al- ternative hash table schemes such as linear probing, offers the best performance—with respect to time and space— for maintaining a large dictionary of integers in-memory, on a current cache-oriented processor.
... The conventional approach only focuses on organizing the hash chain according to the first packet of each flow, and fails to exploit the locality within each burst. To address this shortcoming and inspired by Zobel's idea in accumulating text vocabularies [7], we propose three different algorithms focusing on exploiting the locality within a burst. ...
... The key goal of our experiments is to investigate how our algorithm improves the flow state lookup efficiency. The hash function we use is the bit-wise hash function from Zobel et al. [7]. ...
Conference Paper
Flow state tables are an essential component for improving the performance of packet classification in network security and traffic management. Generally, a hash table is used to store the state of each flow due to its fast lookup speed. However, hash table collisions can severely reduce the effectiveness of packet classification using a flow state table. In this paper, we propose three schemes to reduce hash collisions by exploiting the locality in traffic. Our experiments show that all our proposed schemes perform better than the standard practice of hashing with overflow chains. More importantly, our move and insert to front scheme is insensitive to the hash table size.
... Efficient index construction for static collections has been studied in detail over the last two decades. After contributions by Moffat and Bell [8, 9] and Heinz and Zobel [13, 5], the indexing problem for static collections seems solved. Following an inverted-file approach and combining the techniques described in the literature, it is possible to index text collections at a rate well above 50 GB per hour on a standard desktop PC, allowing the indexing of text collections in the terabyte range on a single PC (see section 4 for details). ...
... Inverted files have proved to be the most efficient data structure for high performance indexing of large text collections [14]. For every term that appears in The process of creating this inverted file can be roughly described as follows: Input documents are read, one at a time, and postings are accumulated in an in-memory index, using a hash table with move-to-front heuristics [13] to look up vocabulary terms. Postings for the same term are stored in memory in compressed form, either in an augmentable bitvector [5] or in a linked list [2]. ...
Conference Paper
In-place and merge-based index maintenance are the two main competing strategies for on-line index construction in dynamic information retrieval systems based on inverted lists. Motivated by re- cent results for both strategies, we investigate possible combinations of in-place and merge-based index maintenance. We present a hybrid ap- proach in which long posting lists are updated in-place, while short lists are updated using a merge strategy. Our experimental results show that this hybrid approach achieves better indexing performance than either method (in-place, merge-based) alone.
... Based on the success of copying strings to arraybased buckets in string sorting (Sinha et al. 2006), Askitis and Zobel (2005) replaced the linked lists of the chaining hash table (previously the best method for string hashing (Zobel, Williams & Heinz 2001)) with re-sizable buckets (arrays), forming the cacheconscious array hash. Arrays require more instructions for search, but compensate by eliminating the pointer-chasing problem through contiguous memory allocation. ...
... The fastest data structures currently available for managing (storing and retrieving) variable-length strings in-memory, is the burst-trie (Heinz et al. 2002) and the chaining hash table with move-to-front on access (Zobel et al. 2001). These data structures are fast because they minimize the number of instructions required for search and update, but are not particularly cache-conscious. ...
Conference Paper
Full-text available
Tries are the fastest tree-based data structures for managing strings in-memory, but are space-intensive. The burst-trie is almost as fast but reduces space by collapsing trie-chains into buckets. This is not however, a cache-conscious approach and can lead to poor performance on current processors. In this paper, we introduce the HAT-trie, a cache-conscious trie-based data structure that is formed by carefully combining existing components. We evaluate performance using several real-world datasets and against other high-performance data structures. We show strong improvements in both time and space; in most cases approaching that of the cache-conscious hash table. Our HAT-trie is shown to be the most efficient trie-based data structure for managing variable-length strings in-memory while maintaining sort order.
... Hashed Index, imposes a constraint of load factor which limits the performance of hashing beyond certain pick value. In addition, enormous amount of hashing in case of collision, to find the right bucket to map the key to be retrieved or to be placed, is also the limiting factor in hashing technique [7][8][9][10][11][12][13][14][15][16][17][18][19][20]. ...
... Though hash table can offer rapid insertion, deletion, and search of both strings and integers, it requires a form of collision resolution to resolve cases where two or more keys are hashed to the same bucket. To resolve this, various mechanisms have been proposed like, linked lists [7] used when number of keys is not known in advance, array hash [8]a cache conscious scheme for previous method, open addressingstores homogenous keys directly within bucket & gives better usage of CPU & cache [9,10]. Open addressing schemes: Linear probing, where the interval between probes is fixed [18]; quadratic probing [12] where probe interval is increased by addition of successive outputs of a polynomial to the starting value; and double hashing [12] where probe interval is computed by second hash function. ...
... Hashing techniques provide simple and efficient implementation of an unordered dictionary. This paper evaluates some of the popular in-memory hash table data structures ( Pugh, 1990) which are frequently referred to in earlier works and commonly used in dictionary like applications Simple in-memory ( Zobel et al., 2001) data structures are basic building blocks of programming, and are used to manage temporary data in scales ranging from a few items to gigabytes. For the storage and retrieval of strings, the main data structures are the varieties of hash tables, trie, and binary search tree. ...
Article
The efficiency of in-memory computing applications depends on the choice of mechanism to store and retrieve strings. The tree and trie are the abstract data types (ADTs) that offer better efficiency for ordered dictionary. Hash table is one among the several other ADTs that provides efficient implementation for unordered dictionary. The performance of a data structure will depend on hardware capabilities of computing devices such as RAM size, cache memory size and even the speed of the physical storage media. Hence, an application which will be running on real or virtualised hardware environment certainly will have restricted access to memory and hashing is heavily used for such applications for speedy process. In this work, an analysis on the performance of six hash table based dictionary ADT implementations with different data usage models is carried out. The six different popular hash table based dictionary ADT implementations are Khash, Uthash, GoogleDenseHash, TommyHashtable, TommyHashdyn and TommyHa...
... After an inverted file has been created for each such subcollection , all these sub-indices are combined in a multiway merge process, yielding the final index representing the entire collection. [9] [17] [5] In the context of this paper, all inverted lists contain full positional information, i.e. the exact locations of all occurrences of a given term (as opposed to mere document numbers). Inverted lists are split up into chunks containing ≈ 30,000 postings, and each chunk is compressed using a byte-aligned gap compression method [10], resulting in an average storage requirement of ≈ 12 bits per posting. ...
Conference Paper
Full-text available
We present a new family of hybrid index maintenance strate- gies to be used in on-line index construction for monotoni- cally growing text collections. These new strategies improve upon recent results for hybrid index maintenance in dynamic text retrieval systems. Like previous techniques, our new method distinguishes between short and long posting lists: While short lists are maintained using a merge strategy, long lists are kept separate and are updated in-place. This way, costly relocations of long posting lists are avoided. We discuss the shortcomings of previous hybrid methods and give an experimental evaluation of the new technique, showing that its index maintenance performance is superior to that of the earlier methods, especially when the amount of main memory available to the indexing system is small. We also present a complexity analysis which proves that, under a Zipan term distribution, the asymptotical number of disk accesses performed by the best hybrid maintenance strategy is linear in the size of the text collection, implying the asymptotical optimality of the proposed strategy.
... 228-230]). Dictionary implementations other than hash tables, such as burst tries [5] , have been examined as well, but hash tables seem offer the greatest performance, especially when used with move-to-front heuristics, which very accurately models the term distribution in most English texts (as discussed by Zobel et al. [14]). The power of the hash-based in-memory inversion with linked lists is limited by the amount of main memory available . ...
Article
Many text retrieval systems construct their index by accu-mulating postings in main memory until there is no more memory available and then creating an on-disk index from the in-memory data. When the entire text collection has been read, all on-disk indices are combined into one big in-dex through a multiway merge process. This paper discusses several ways to arrange postings in memory and studies the effects on memory requirements and indexing performance. Starting from the traditional ap-proach that holds all postings for one term in a linked list, we examine strategies for combining consecutive postings into posting groups and arranging these groups in a linked list in order to reduce the number of pointers in the linked list. We then extend these techniques to compressed post-ing lists and finally evaluate the effects they have on overall indexing performance for both static and dynamic text col-lections. Substantial improvements are achieved over the initial approach.
... First is to builds a contextual matrix through scanning the input corpus and counting the frequency of seeing words with different features in their contexts (lines 2 to 7). The two search steps in Line 4 and Line 5 can be rapidly done by the hashing mechanism proposed by Zobel, Heinz, and Williams (2001). Hence, the time complexity of this step is mostly influenced by the corpus size T . ...
Preprint
We generalize principal component analysis for embedding words into a vector space. The generalization is made in two major levels. The first is to generalize the concept of the corpus as a counting process which is defined by three key elements vocabulary set, feature (annotation) set, and context. This generalization enables the principal word embedding method to generate word vectors with regard to different types of contexts and different types of annotations provided for a corpus. The second is to generalize the transformation step used in most of the word embedding methods. To this end, we define two levels of transformations. The first is a quadratic transformation, which accounts for different types of weighting over the vocabulary units and contextual features. Second is an adaptive non-linear transformation, which reshapes the data distribution to be meaningful to principal component analysis. The effect of these generalizations on the word vectors is intrinsically studied with regard to the spread and the discriminability of the word vectors. We also provide an extrinsic evaluation of the contribution of the principal word vectors on a word similarity benchmark and the task of dependency parsing. Our experiments are finalized by a comparison between the principal word vectors and other sets of word vectors generated with popular word embedding methods. The results obtained from our intrinsic evaluation metrics show that the spread and the discriminability of the principal word vectors are higher than that of other word embedding methods. The results obtained from the extrinsic evaluation metrics show that the principal word vectors are better than some of the word embedding methods and on par with popular methods of word embedding.
... The most expensive phase is the parsing and looking-up of word-ids which takes roughly one third of the total cost. Our intuition is that there is no much room for improvement here as a hash table with move-to-front chains and a bit-wise hash function is for a long time the fastest data structure for in-memory vocabulary accumulation of words, assuming that words do not need to be maintained in sort order [7, 17]. While Zettair spends one third of its time merging runs, our index builder spends only one fifth of its time on block writing. ...
Article
Full-text available
We show that the powerful HYB index from [9] can be constructed twice as fast as an ordinary inverted index. As shown in a series of recent works, apart from the basic functionality of the inverted index, the HYB index enables very fast prefix search, which in turn is the basis for fast processing of many other types of advanced queries, including autocompletion, faceted search, synonym search, error-tolerant search etc. Unlike the inverted index, our algorithm naturally requires an initial pass over the collection in order to compute the so called block boundaries, essential to the HYB index (analogous to index terms). We show (i) how to reliably estimate the block boundaries by random sampling from the documents; and (ii) how to avoid an expensive merge of the index, achieving a truly single-pass construction in that every posting (word occurrence) is touched (read and written) only once. The algorithm has been carefully engineered, with special attention paid to cache-efficiency and disk cost. We compared our algorithm against the state-of-the-art index construction from Zettair. Finally, we show that our HYB index construction naturally supports very fast dynamic index updates.
... The dictionary is implemented as a hash table with a bitwise hash function [28] and the move-to-front technique [34], mapping terms (strings) to integers term ids (see [37] for a study that compares this to other approaches). There is nothing noteworthy about our dictionary implementation, and we claim no novelty in this design. ...
Article
For text retrieval systems, the assumption that all data structures reside in main memory is increasingly common. In this context, we present a novel incremental inverted indexing algorithm for web-scale collections that directly constructs compressed postings lists in memory. Designing efficient in-memory algorithms requires understanding modern processor architectures and memory hierarchies: in this paper, we explore the issue of postings lists contiguity. Naturally, postings lists that occupy contiguous memory regions are preferred for retrieval, but maintaining contiguity increases complexity and slows indexing. On the other hand, allowing discontiguous index segments simplifies index construction but decreases retrieval performance. Understanding this tradeoff is our main contribution: We find that co-locating small groups of inverted list segments yields query evaluation performance that is statistically indistinguishable from fully-contiguous postings lists. In other words, it is not necessary to lay out in-memory data structures such that all postings for a term are contiguous; we can achieve ideal performance with a relatively small amount of effort.
... There are several techniques to store the dictionary in memory. A nice comparison between several data structures is given in [31]. From the comparison it resulted that the fastest data structure is the hash table. ...
Article
Full-text available
Acknowledgments The value of the master thesis goes beyond its scientic,content since it sum- marizes the work of,ve years of education. In my particular case this thesis symbolically represents all the years spent between Venice, Vienna and Ams- terdam, with all the gained amount of experience and personal development. My master project took more or less six months. This document,reports only the nal outcome, leaving out everything that is behind it. My thesis is also made by many emails, meetings, hours spent in reading articles, testing and debugging. During this time I had the honor to work with many great persons and I want to thank all of them, for their support and patience. More in particular my,rst thanks go to Eyal, my supervisor, that helped me through all the development of this thesis, whatever the problem was. He tried to teach me two of the nicest virtues in the scientic,world: conciseness and clarity. I hope I have learned them well. This thesis received also a crucial help from Spyros Kotoulas. Without his observations and critics this work could not be done. I also would like to thank Frank van Harmelen who oered,me a job and nancial support for my work. Without his support, my life would have been much more complicated. A special thanks goes to all the users of the DAS3 cluster, who patiently accepted my constant abuses of the cluster when I was running the tests and, more in general, to all the researchers who wrote all the literature or programs that are related to this thesis. It is because of their previous work that I could
... The efficiency of hashing greatly shortens the working process. For example, many encryption algorithms use a shorter hash key to find items instead of using the original value to find it in order to save time and energy [1][2][3]. Hashing scatters the records throughout the hash table in a completely random manner. This is the reason why hashing is appropriate for implementing a relationship among elements but it does not lend itself to operations, which attempts to make use of any other relationships among the data [4]. ...
Article
Full-text available
This study compares different hashing encryption algorithms and provides several suggestions about how to choose different hashing encryption algorithms for particular cases. This paper presents three different kinds of hashing encryption algorithms. The author introduces the fundamental conceptions of these algorithms and their application. The author also computes some significant factors in computer science and compares these algorithms by mathematical quantity computing data. This article discusses the advantages and disadvantages of hashing encryption algorithms and proposes its potential value in the future.
... Another disadvantage comparing to BST, in case of splay trees, is that it requires three comparisons at each level, and necessitates an additional space for an extra pointer reference. Hash tables [17] are faster than any tree structures, but its performance comes with a price. The search can become sequential if the data set is large and the hash table is comparably small. ...
Conference Paper
In the current information age, the dominant method for information search is by providing few keywords to a search engine. Keyword search is currently one of the most important operations in search engines and numerous other applications. In this paper we propose a new text indexing technique for improving the performance of keyword search. Our proposed technique not only speeds up searching operations but also the operations for inserting and for deleting keywords, which are particularly important for the ever increasing and dynamic changing databases such as that for search engines. We propose to partition all keywords into search trees based on the first character and the length of the keywords. Our partitioning scheme creates a much more even distribution of keywords and results in a 32% speedup in the worst cases and a 1% speedup in the average cases in comparing to one of the leading text indexing techniques called burst tries. In addition, our proposed technique stores document indexes only at the leaf nodes of the search trees and results in efficient algorithms for searching, insertion, and deletion of keywords. We successfully integrated the technique into our Information Classification and Search Engine system and showed its potential and feasibility.
... With other string hashing functions, all versions of trees were faster. We have recently investigated variations of hashing for accumulating vocabularies and have shown that even faster hashing is possible using a move-to-front chain reorganization scheme [26]. However, a disadvantage of hashing is that sorting of the vocabulary is required to support, for example, range or prefix queries; in applications where sorted access is required, our PERIODIC-ROTATION scheme is preferable. ...
... Because we want to store information about each bin, this string of numbers makes an ideal identifier for a bin. Having a unique identifier for every bin in the visualization suggests an efficient method of storing information about each bin is to utilize a hashing function[28]. ...
Article
Abstract Dimensional Stacking is a technique for displaying multivariate data in two dimensional screen space. This technique involves the discretization and recursive embedding of dimensions, each resulting N-dimensional bin occupying a unique position on the screen. This thesis describes the extension of this technique to a three dimensional projection. In addition to the visual enhancements, hashing was used to improve the scalability of records and dimensions. The resulting visualization was evaluated by a usability study. Acknowledgements I would like to thank my adviser, Matt Ward, for his guidance and patience.
... This procedure consists of a loop with two search steps on Line 12 and Line 13, and an assignment on Line 14. The two search steps can be done rapidly by hashing mechanisms such as the hash chain proposed by Zobel et al. (2001). The efficiency of the assignment step if highly dependent on the algorithm and data structure used to implement the contextual matrix. ...
Thesis
Full-text available
Word embedding is a technique for associating the words of a language with real-valued vectors, enabling us to use algebraic methods to reason about their semantic and grammatical properties. This thesis introduces a word embedding method called principal word embedding, which makes use of principal component analysis (PCA) to train a set of word embeddings for words of a language. The principal word embedding method involves performing a PCA on a data matrix whose elements are the frequency of seeing words in different contexts. We address two challenges that arise in the application of PCA to create word embeddings. The first challenge is related to the size of the data matrix on which PCA is performed and affects the efficiency of the word embedding method. The data matrix is usually a large matrix that requires a very large amount of memory and CPU time to be processed. The second challenge is related to the distribution of word frequencies in the data matrix and affects the quality of the word embeddings. We provide an extensive study of the distribution of the elements of the data matrix and show that it is unsuitable for PCA in its unmodified form. We overcome the two challenges in principal word embedding by using a generalized PCA method. The problem with the size of the data matrix is mitigated by a randomized singular value decomposition (SVD) procedure, which improves the performance of PCA on the data matrix. The data distribution is reshaped by an adaptive transformation function, which makes it more suitable for PCA. These techniques, together with a weighting mechanism that generalizes many different weighting and transformation approaches used in literature, enable the principal word embedding to train high quality word embeddings in an efficient way. We also provide a study on how principal word embedding is connected to other word embedding methods. We compare it to a number of word embedding methods and study how the two challenges in principal word embedding are addressed in those methods. We show that the other word embedding methods are closely related to principal word embedding and, in many instances, they can be seen as special cases of it. The principal word embeddings are evaluated in both intrinsic and extrinsic ways. The intrinsic evaluations are directed towards the study of the distribution of word vectors. The extrinsic evaluations measure the contribution of principal word embeddings to some standard NLP tasks. The experimental results confirm that the newly proposed features of principal word embedding (i.e., the randomized SVD algorithm, the adaptive transformation function, and the weighting mechanism) are beneficial to the method and lead to significant improvements in the results. A comparison between principal word embedding and other popular word embedding methods shows that, in many instances, the proposed method is able to generate word embeddings that are better than or as good as other word embeddings while being faster than several popular word embedding methods.
... In this approach, the vocabulary of searchable terms consists of both words and phrases, and each search term has an inverted list. In our implementation, we store the phrase vocabulary in a separate, compact search structure [Zobel et al. 2001]. With this combined index, processing involves postings lists from both the inverted index and the partial phrase index. ...
Article
Search engines need to evaluate queries extremely fast, a challenging task given the quantities of data being indexed. A significant proportion of the queries posed to search engines involve phrases. In this article we consider how phrase queries can be efficiently supported with low disk overheads. Our previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. Alternatively, special-purpose phrase indexes can be used, but it is not feasible to index all phrases. We propose combinations of nextword indexes and phrase indexes with inverted files as a solution to this problem. Our experiments show that combined use of a partial nextword, partial phrase, and conventional inverted index allows evaluation of phrase queries in a quarter the time required to evaluate such queries with an inverted file alone; the additional space overhead is only 26% of the size of the inverted file.
... Splay Trees were used as cumulative frequency tables in order to maintain Dynamic arithmetic data compression [8]. Splay Trees were also used in memory hash tables for accumulating text vocabularies [9]. In a new approach proposed by us we use splay trees in a branch of mathematics for computing binomial coefficients. ...
... Aside from needing to have a PostingSerializer implementation that only writes positional information, the document consumer code base needed to be modiied to detect stop-words. Possible implementations to check index terms for belonging to the set of stop-words is the use of a Hash table or a burst trie [18,24]. ...
... The tower levels dictates the order of assignments: vertices with degree = 1 19 , v 22 } has the following set of keys: "tussle", "oculomotor nerve", "deposited", "Han Cities", "Chungking", "meridiem", "Lagomorpha", "antenae", "sodium lamp", "Euclidean", "quibbles", "ethyl ether", and "x-rays" that can be assigned to their final addresses according to h 1 or h 2 without collision. Here, we favor placing keys according to their h 1 (k) value since we would compute h 1 (k) first when searching for k. ...
... Hashing techniques provide simple and efficient implementation of an unordered dictionary. This paper evaluates some of the popular in-memory hash table data structures ( Pugh, 1990) which are frequently referred to in earlier works and commonly used in dictionary like applications Simple in-memory ( Zobel et al., 2001) data structures are basic building blocks of programming, and are used to manage temporary data in scales ranging from a few items to gigabytes. For the storage and retrieval of strings, the main data structures are the varieties of hash tables, trie, and binary search tree. ...
Article
The efficiency of in-memory computing applications depends on the choice of mechanism to store and retrieve strings. The tree and trie are the abstract data types (ADTs) that offer better efficiency for ordered dictionary. Hash table is one among the several other ADTs that provides efficient implementation for unordered dictionary. The performance of a data structure will depend on hardware capabilities of computing devices such as RAM size, cache memory size and even the speed of the physical storage media. Hence, an application which will be running on real or virtualised hardware environment certainly will have restricted access to memory and hashing is heavily used for such applications for speedy process. In this work, an analysis on the performance of six hash table based dictionary ADT implementations with different data usage models is carried out. The six different popular hash table based dictionary ADT implementations are Khash, Uthash, GoogleDenseHash, TommyHashtable, TommyHashdyn and TommyHashlin, tested under different hardware and software configurations.
... The most significant change for the results described in this paper is the use of a B + -tree as a vocabulary structure. Our previous implementation used a large in-memory hash table for vocabulary management (Zobel, Williams, & Heinz, 2001), motivated by the expectation that vocabulary size would remain much smaller than memory. However, this expectation was false. ...
Article
Search engines and other text retrieval systems use high-performance inverted indexes to provide efficient text query evaluation. Algorithms for fast query evaluation and index construction are well-known, but relatively little has been published concerning update. In this paper, we experimentally evaluate the two main alternative strategies for index maintenance in the presence of insertions, with the constraint that inverted lists remain contiguous on disk for fast query evaluation. The in-place and re-merge strategies are benchmarked against the baseline of a complete re-build. Our experiments with large volumes of web data show that re-merge is the fastest approach if large buffers are available, but that even a simple implementation of in-place update is suitable when the rate of insertion is low or memory buffer size is limited. We also show that with careful design of aspects of implementation such as free-space management, in-place update can be improved by around an order of magnitude over a naïve implementation.
... The most expensive phase is the parsing and hashing of words which takes roughly one third of the total cost. We believe that there is not much room for improvement here since a hash table with move-to-front chains and a bit-wise hash function has been the fastest practical data structure for in-memory vocabulary accumulation for some time, conditioned on the assumption that the words do not need to be maintained in sort order [Zobel et al., 2001]. An alternative fast and practical data structure that maintains the sorted order of the words is the burst trie [Heinz and Zobel, 2002]. ...
Thesis
In this dissertation, we consider the problem of fuzzy full-text search and query suggestion in large text collections, that is, full-text search that is robust against errors on the side of the query as well as errors on the side of the documents. We consider two variants of the problem. The first variant is keyword-based search tolerant to errors. The second variant is autocompletion or prefix search tolerant to errors. In this variant of the problem, each keyword can be specified partially and the results appear instantly as the user types the query letter by letter. One of the main challenges in building an information retrieval system for fuzzy search is e ciency. Providing interactive query times (below 100 ms) for either fuzzy search variant is surprisingly challenging due to the one order of magnitude larger volume of data to be handled by the system. While e cient index data structures exist that allow fast search for the exact variants of the problem, there has been limited work on indexes that tackle fuzzy search. Commercial search engines such as Yahoo!, Google and Bing provide error-tolerance to certain extent thanks to the large amount of available query log data. In our version of the problem, we assume a complete absence of query logs or any other precomputed information. This assumption is often realistic for information retrieval systems for vertical or domain-specific search that typically have a much smaller user base. In the first part of this dissertation, we propose e cient data structures and algorithms that are the core of our fuzzy search. In the second part, we address important algorithm-engineering aspects of an error-tolerant search system. All of our algorithms and data structures have been implemented and integrated into the CompleteSearch engine.
... Constructing dictionaries is however not trivial, as during construction a list of all seen terms and their assigned keys must be kept [20]. For building the dictionary, we employ move-to-front hashtables [30], which are a particular type of chained hashtables using linked lists for overflow entries. When accessed, elements are moved to the front of their list. ...
Conference Paper
Full-text available
RDF is increasingly being used to represent large amounts of data on the Web. Current query evaluation strategies for RDF are inspired by databases, assuming perfect answers on nite repositories. In this paper, we focus on a query method based on evolutionary com- puting, which allows us to handle uncertainty, incompleteness and un- satisability, and deal with large datasets, all within a single conceptual framework. Our technique supports approximate answers with \anytime" behaviour. We present scalability results and next steps for improvement.
... In some domains, the dictionary is small enough to be kept in main memory. A good comparison between the performance of different in-memory data structures is given in [16]. From the comparison, it is clear that the hash table is the fastest data structure. ...
Conference Paper
Full-text available
The Semantic Web consists of many billions of statements made of terms that are either URIs or literals. Since these terms usually consist of long sequences of characters, an ef- fective compression technique must be used to reduce the data size and increase the application performance. One of the best known techniques for data compression is dictionary encoding. In this paper we propose a MapReduce algorithm that eciently compresses and decompresses a large amount of Semantic Web data. We have implemented a prototype using the Hadoop framework and we report an evaluation of the performance. The evaluation shows that our approach is able to eciently compress a large amount of data and that it scales linearly regarding the input size and number of nodes.
... Merge-based indexing algorithms may differ in the way in which they build the index partitions I k . Published work on this topic, however, suggests that hash-based implementations with move-to-front heuristic lead to the best overall performance (Zobel et al., 2001). In such implementations, an extensible in-memory posting list is maintained for each index term. ...
Article
Index maintenance strategies employed by dynamic text retrieval systems based on inverted files can be divided into two categories: merge-based and in-place update strategies. Within each category, individual update policies can be distinguished based on whether they store their on-disk posting lists in a contiguous or in a discontiguous fashion. Contiguous inverted lists, in general, lead to higher query performance, by minimizing the disk seek overhead at query time, while discontiguous inverted lists lead to higher update performance, requiring less effort during index maintenance operations. In this paper, we focus on retrieval systems with high query load, where the on-disk posting lists have to be stored in a contiguous fashion at all times. We discuss a combination of re-merge and in-place index update, called Hybrid Immediate Merge. The method performs strictly better than the re-merge baseline policy used in our experiments, as it leads to the same query performance, but substantially better update performance. The actual time savings achievable depend on the size of the text collection being indexed; a larger collection results in greater savings. In our experiments, variations of Hybrid Immediate Merge were able to reduce the total index update overhead by up to 73% compared to the re-merge baseline.
Conference Paper
We describe indexing and retrieval techniques that are suited to perform terabyte-scale information retrieval tasks on a standard desktop PC. Start- ing from an Okapi-BM25-based default baseline re- trieval function, we explore both sides of the eec- tiveness spectrum. On one side, we show how term proximity can be integrated into the scoring function in order to improve the search results. On the other side, we show how index pruning can be employed to increase retrieval eciency { at the cost of reduced retrieval eectiv eness. We show that, although index pruning can harm the quality of the search results considerably, ac- cording to standard evaluation measures, the actual loss of precision, according to other measures that are more realistic for the given task, is rather small and is in most cases outweighed by the immense ef- ciency gains that come along with it.
Article
Full-text available
As shown in a series of recent works, the HYB index is an alternative to the inverted index (INV) that enables very fast prefix searches, which in turn is the basis for fast processing of many other types of advanced queries, including autocompletion, faceted search, error-tolerant search, database-style select and join, and semantic search. In this work we show that HYB can be constructed at least as fast as INV, and often up to twice as fast. This is because HYB, by its nature, requires only a half-inversion of the data and allows an efficient in-place instead of the traditional merge-based index construction. We also pay particular attention to the cache efficiency of the in-memory posting accumulation, an issue that has not been addressed in previous work, and show that our simple multilevel posting accumulation scheme yields much fewer cache misses compared to related approaches. Finally, we show that HYB supports fast dynamic index updates more easily than INV.
Article
Compression of large collections can lead to improvements in retrieval times by offsetting the CPU decompression costs with the cost of seeking and retrieving data from disk. We propose a semistatic phrase-based approach called xray that builds a model offline using sample training data extracted from a collection, and then compresses the entire collection online in a single pass. The particular benefits of xray are that it can be used in applications where individual records or documents must be decompressed, and that decompression is fast. The xray scheme also allows new data to be added to a collection without modifying the semistatic model. Moreover, xray can be used to compress general-purpose data such as genomic, scientific, image, and geographic collections without prior knowledge of the structure of the data. We show that xray is effective on both text and general-purpose collections. In general, xray is more effective than the popular gzip and compress schemes, while being marginally less effective than bzip2. We also show that xray is efficient: of the popular schemes we tested, it is typically only slower than gzip in decompression. Moreover, the query evaluation costs of retrieval of documents from a large collection with our search engine is improved by more than 30% when xray is incorporated compared to an uncompressed approach. We use simple techniques for obtaining the training data from the collection to be compressed and show that with just over 4% of data the entire collection can be effectively compressed. We also propose four schemes for phrase-match selection during the single pass compression of the collection. We conclude that with these novel approaches xray is a fast and effective scheme for compression and decompression of large general-purpose collections.
Article
Full-text available
We describe a new practical algorithm for finding perfect hash functions with no specification space at all, suitable for key sets ranging in size from small to very large. The method is able to find perfect hash functions for various sizes of key sets in linear time. The perfect hash functions produced are optimal in terms of time (perfect) and require at most computation of h1(k) and h2(k); two simple auxiliary pseudorandom functions.
Conference Paper
Full-text available
We describe a new practical algorithm for finding perfect hash functions with no specification space at all, suitable for key sets ranging in size from small to very large. The method is able to find perfect hash functions for various sizes of key sets in linear time. The perfect hash functions produced are optimal in terms of time (perfect) and require at most computation of h1(k) and h2(k); two simple auxiliary pseudorandom functions.
Conference Paper
The paper gives the guideline to choose a best suitable hashing method hash function for a particular problem. After studying the various problem we find some criteria has been found to predict the best hash method and hash function for that problem. We present six suitable various classes of hash functions in which most of the problems can find their solution. Paper discusses about hashing and its various components which are involved in hashing and states the need of using hashing for faster data retrieval. Hashing methods were used in many different applications of computer science discipline. These applications are spread from spell checker, database management applications, symbol tables generated by loaders, assembler, and compilers. There are various forms of hashing that are used in different problems of hashing like Dynamic hashing, Cryptographic hashing, Geometric hashing, Robust hashing, Bloom hash, String hashing. At the end we conclude which type of hash function is suitable for which kind of problem.
Conference Paper
This paper presents a morphological analyzer for the Spanish language (MAHT). This system is mainly based on the storage of words and its morphological information, leading to a lexical knowledge base that has almost five million words. The lexical knowledge base practically covers the whole morphological casuistry of the Spanish language. However, the analyzer solves the processing of prefixes and of enclitic pronouns by easy rules, since the words that can include these elements are much and some of them are neologisms. MAHT reaches a processing average speed over 275,000 words per second. This one is possible because it uses hash tables in main memory. MAHT has been designed to isolate the data from the algorithms that analyze words, even with their irregular forms. This design is very important for an irregular and highly inflectional language, like Spanish, to simplify the insertion of new words and the maintenance of program code.
Article
The Semantic Web contains many billions of statements, which are released using the resource description framework (RDF) data model. To better handle these large amounts of data, high performance RDF applications must apply a compression technique. Unfortunately, because of the large input size, even this compression is challenging. In this paper, we propose a set of distributed MapReduce algorithms to efficiently compress and decompress a large amount of RDF data. Our approach uses a dictionary encoding technique that maintains the structure of the data. We highlight the problems of distributed data compression and describe the solutions that we propose. We have implemented a prototype using the Hadoop framework, and evaluate its performance. We show that our approach is able to efficiently compress a large amount of data and scales linearly on both input size and number of nodes. Copyright © 2012 John Wiley & Sons, Ltd.
Article
Recently many different programming languages have emerged for the development of bioinformatics applications. In addition to the traditional languages, languages from open source projects such as BioPerl, BioPython, and BioJava have become popular because they provide special tools for biological data processing and are easy to use. However, it is not well-studied which of these programming languages will be most suitable for a given bioinformatics task and which factors should be considered in choosing a language for a project. Like many other application projects, bioinformatics projects also require various types of tasks. Accordingly, it will be a challenge to characterize all the aspects of a project in order to choose a language. However, most projects require some common and primitive tasks such as file I/O, text processing, and basic computation for counting, translation, statistics, etc. This paper presents the benchmarking results of six popular languages, Perl, BioPerl, Python, BioPython, Java, and BioJava, for several common and simple bioinformatics tasks. The experimental results of each language are compared through quantitative evaluation metrics such as execution time, memory usage, and size of the source code. Other qualitative factors, including writeability, readability, portability, scalability, and maintainability, that affect the success of a project are also discussed. The results of this research can be useful for developers in choosing an appropriate language for the development of bioinformatics applications.
Article
In order to avoid the frequently read-write of hard disk and to speed up the search, the index should be saving in the memory in the small-scale web search. But, to express the original information by fewer memory spaces, also needs for index compression, and this would increases the computation expenses or brings certain harm to the original information in a way. In this research of Uyghur small-scale web search, in order to speed up the retrieval and query speed, inverted index has established uses Hash table data structure and entirely stay resident in memory. In the aspect of index compression, have not uses any compression technique, but proposed a word grouping approach based on simplified N-gram statistical model,and extracting semantic words that structurally stable, semantically complete and independent,and greatly reduces the scale of indexing item list. Thereby, not only served the purpose of index compression, but also solved the ambiguity problem certain extent and improved the search precision obviously. The experimental result indicated that, our method is feasible and effective.
Thesis
Full-text available
La demanda de información se ha multiplicado en los últimos años gracias, principalmente, a la globalización en el acceso a la WWW, Esto ha propiciado un aumento sustancial en el tamaño de las colecciones de texto disponibles en formato electrónico, cuya compresión no sólo permite obtener un ahorro espacial sino que, a su vez, aumenta la eficiencia de sus procesos de entrada/salida y de transmisión en red. La compresión de texto trata con información expresada en lenguaje natural. Por lo tanto, la identificación de la redundancia subyacente a este tipo de textos requiere adoptar una perspectiva orientada a palabras, considerando ésta como la unidad mínima de información utilizada en los procesos de comunicación entre personas. Esta tesis aborda el estudio del contexto anterior desde tres perspectivas complementarias cuyos resultados se traducen en la obtención de un conjunto de compresores de texto específicos. El lenguaje natural posee unas propiedades particulares, tanto en lo relativo al tamaño del vocabulario de palabras identificado en el texto como a la distribución de frecuencia que muestra cada una de ellas. Sin embargo, las técnicas universales de compresión no son capaces de identificar, específicamente, estas propiedades al no restringir el tipo de mensajes que toman como entrada. La primera propuesta de esta tesis se centra en la construcción de un esquema de preprocesamiento (denominado Word-Codeword Improved Mapping: WCIM) que transforma el texto original en una representación más redundante del mismo que favorece su compresión con técnicas clásicas. A pesar de su sencillez y efectividad, esta propuesta no gestiona un aspecto relevante en lenguaje natural: la relación existente entre las palabras. La familia de técnicas Edge-Guided (E-G) utilizan la relación de adyacencia entre símbolos como base para la representación del texto. El compresor E-G1 construye un modelo de orden 1 orientado a palabras, cuya representación se materializa en las aristas de un grafo dirigido. Por su parte, E-Gk considera la extensión del vocabulario original con un conjunto de secuencias de palabras (frases) significativas que se representan a través de una gramática libre de contexto. El modelo de grafo original evoluciona de tal forma que pasa a representar un modelo de orden 1 orientado a frases en el que la relación de jerarquía, existente entre las palabras que las constituyen, puede ser aprovechada a través de la información almacenada en la gramática. Tanto E-G1 como E-Gk utilizan la información almacenada en las aristas del grafo para la construcción de sus esquema de codificación basado en un código de Huffman. Los corpus paralelos bilingües (bitextos) están formados por dos textos, en lenguaje natural, que expresan la misma información en dos idiomas diferentes. Esta propiedad suma un tipo de redundancia no tratada en los casos anteriores: la redundancia semántica. Nuestras propuestas, en este contexto, se centran en la representación de bitextos alineados, cuya utilización es un aspecto esencial en numerosas aplicaciones relacionadas con la traducción. Para ello introducimos el concepto de bipalabra como unidad simbólica de representación y se plantean sendas técnicas basadas en sus propiedades estructurales (Translation Relationship-based Compressor : TRC) y semánticas (Two-Level Compressor for Aligned Bitexts: 2LCAB). Ambas propuestas analizan el efecto, en la compresión, asociado al hecho de utilizar diferentes estrategias de alineamiento del bitexto. Complementariamente, 2LCAB plantea un mecanismo de búsqueda, basado en pattern-matching, que permite llevar a cabo diferentes tipos de operaciones sobre el texto comprimido. Los procesos de experimentación, llevados a cabo sobre corpus de referencia en cada uno de los contextos, demuestran la competitividad de cada una de los compresores propuestos. Los resultados obtenidos con la técnica 2LCAB son especialmente significativos ya que soportan la primera propuesta conocida que facilita la consulta monolingüe y translingüe sobre un bitexto comprimido. Esta propiedad aísla el idioma en el que se recuperan los resultados del utilizado en la consulta, planteando 2LCAB como una alternativa competitiva para su uso como motor de búsqueda en diferentes herramientas de traducción.
Article
A key decision when developing in-memory computing applications is choice of a mechanism to store and retrieve strings. Themost efficient current data structures for this task are the hash table withmove-to-front chains and the burst trie, both of which use linked lists as a substructure, and variants of binary search tree. These data structures are computationally efficient, but typical implementations use large numbers of nodes and pointers to manage strings, which is not efficient in use of cache. In this article, we explore two alternatives to the standard representation: the simple expedient of including the string in its node, and, for linked lists, the more drastic step of replacing each list of nodes by a contiguous array of characters. Our experiments show that, for large sets of strings, the improvement is dramatic. For hashing, in the best case the total space overhead is reduced to less than 1 bit per string. For the burst trie, over 300MB of strings can be stored in a total of under 200MB of memory with significantly improved search time. These results, on a variety of data sets, show that cache-friendly variants of fundamental data structures can yield remarkable gains in performance.
Article
Indexing n-gram phrases from text has many practical applications. Plagiarism detection, comparison of DNA of sequence or spam detection. In this paper we describe several data structures like hash table or B+ tree that could store n-grams for searching. We perform tests that shows their advantages and disadvantages. One of neglected data structure for this purpose, ternary search tree, is deeply described and two performance improvements are proposed.
Conference Paper
During indexing the vocabulary of a collection needs to be built. The structure used for this needs to account for the skew distribution of terms. Parallel indexing allows for a large reduction in number of times the global vocabulary needs to be examined, however, this also raises a new set of challenges. In this paper we examine the structures used to resolve collisions in a hash table during parallel indexing, and find that the best structure is different from those suggested previously.
Article
A model of a natural language text is a collection of information that approximates the statistics and structure of the text being modeled. The purpose of the model may be to give insight into rules which govern how language is generated, or to predict properties of future samples of it. This paper studies models of natural language from three different, but related, viewpoints. First, we examine the statistical regularities that are found empirically, based on the natural units of words and letters. Second, we study theoretical models of language, including simple random generative models of letters and words whose output, like genuine natural language, obeys Zipf's law. Innovation in text is also considered by modeling the appearance of previously unseen words as a Poisson process. Finally, we review experiments that estimate the information content inherent in natural text.
Conference Paper
Skip lists are data structures that use probabilistic balancing rather than strictly enforced balancing. The structure of a skip list is determined only by the number of elements in the skip list and the results of consulting the random number generator. Skip lists can be used to perform the same kinds of operations that a balanced tree can perform, including the use of search fingers and ranking operations. The algorithms for insertion and deletion in skip lists are much simpler and faster than equipment systems for balanced trees. Included in the article 13 an analysis of the probabilistic performance of skip lists.
Conference Paper
We present theoretical algorithms for sorting and searching multikey data, and derive from them practical C implementations for applications in which keys are charac- ter strings. The sorting algorithm blends Quicksort and radix sort; it is competitive with the best known C sort codes. The searching algorithm blends tries and binary search trees; it is faster than hashing and other commonly used search methods. The basic ideas behind the algo- rithms date back at least to the 1960s, but their practical utility has been overlooked. We also present extensions to more complex string problems, such as partial-match searching. that is competitive with the most efficient string sorting programs known. The second program is a symbol table implementation that is faster than hashing, which is com- monly regarded as the fastest symbol table implementa- tion. The symbol table implementation is much more space-efficient than multiway trees, and supports more advanced searches. In many application programs, sorts use a Quicksort implementation based on an abstract compare operation, and searches use hashing or binary search trees. These do not take advantage of the properties of string keys, which are widely used in practice. Our algorithms provide a nat- ural and elegant way to adapt classical algorithms to this important class of applications.
Article
In November of 1992 the first Text REtrieval Conference (TREC-1) was held at NIST (Harman 1993). This conference, co-sponsored by ARPA and NIST, brought together information retrieval researchers to discuss their system results on the new TIPSTER test collection. This was the first time that such groups had ever compared results on the same data using the same evaluation methods, and represented a breakthrough in cross-system evaluation in information retrieval. It was also the first time that most of these groups had tackled such a large test collection and required a major effort by all groups to scale up their retrieval techniques.
Article
String hashing is a fundamental operation, used in countless applications where fast access to distinct strings is required. In this paper we describe a class of string hashing functions and explore its performance. In particular, using experiments with both small sets of keys and a large key set from a text database, we show that it is possible to achieve performance close to that theoretically predicted for hashing functions. We also consider criteria for choosing a hashing function and use them to compare our class of functions to other methods for string hashing. These results show that our class of hashing functions is reliable and efficient, and is therefore an appropriate choice for general-purpose hashing.
Algorithms in C: Parts 1–4
  • R Sedgewick
R. Sedgewick, Algorithms in C: Parts 1–4:
Data Structures, Sorting, Searching
  • Fundamentals
Fundamentals, Data Structures, Sorting, Searching, Addison-Wesley, Reading, MA, 1998.