Article

Self-adjusting trees in practice for large text collections

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Abstract Splay and randomised search trees are self-balancing binary tree structures with little or no space overhead compared to a standard binary search tree. Both trees are intended for use in applications where node accesses are skewed, for example in gathering the distinct words in a large text collection for index construction. We investigate the eciency,of these trees for such vocabulary accumulation. Surprisingly, unmodied splaying and randomised search trees are on average around 25% slower than using a standard binary tree. We investigate heuristics to limit splay tree reorganisation costs and show their eectiveness in practice. In particular, a periodic rotation scheme improves the speed of splaying by 27%, while other proposed heuristics are less eective. We also report the performance of ecient,bit-wise hashing and red-black trees for comparison. 1I NTRODUCTION

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Hence, the disk-resident skip list is built from the bottom-up using keys that are known and sorted in advance. Although the skip list can be updated from the top-down, in practice, this would be inefficient—particularly with strings—due to cost of updating its multi-layer index [79,92]. The disk-resident skip list supports unbounded-length strings in a manner similar to the SB-tree. ...
... The relative performance of the B + -trees and B-trie, however, remained the same regardless of file format. Research on splay trees [92] reported the inefficiency of using the string-compare system call provided by the Linux operating system. String comparisons are a vital component of most string-based data structures. ...
... String comparisons are a vital component of most string-based data structures. Williams et al. [92] used their own implementation of string-compare and achieved speed gains of up to 20%. We do the same for our implementations . ...
Article
A wide range of applications require that large quantities of data be maintained in sort order on disk. The B-tree, and its variants, are an efficient general-purpose disk-based data structure that is almost universally used for this task. The B-trie has the potential to be a competitive alternative for the storage of data where strings are used as keys, but has not previously been thoroughly described or tested. We propose new algorithms for the insertion, deletion, and equality search of variable-length strings in a disk-resident B-trie, as well as novel splitting strategies which are a critical element of a practical implementation. We experimentally compare the B-trie against variants of B-tree on several large sets of strings with a range of characteristics. Our results demonstrate that, although the B-trie uses more memory, it is faster, more scalable, and requires less disk space.
... This is true even if the working set changes over time. Experimental results by Bell and Gupta [2], and more recently by Williams et al. [3], have shown that even when the access pattern is heavily skewed, the instruction overhead involved with the splay operation outweighs the instruction savings of having shorter access paths. Bell and Gupta [2] showed that a randomly built BST answered queries faster than a splay, even though a splay tree performs best in a skewed setting. ...
... Most focus on decreasing the overhead involved in splaying by reducing rotations. Williams et al. [3] discuss (and evaluate) several heuristics in the context of large text collections [3] (see also experimental results from Bolster [5]). One particular heuristic, randomized splaying, has been shown to perform well in practice. ...
... Most focus on decreasing the overhead involved in splaying by reducing rotations. Williams et al. [3] discuss (and evaluate) several heuristics in the context of large text collections [3] (see also experimental results from Bolster [5]). One particular heuristic, randomized splaying, has been shown to perform well in practice. ...
Article
In this paper we present new empirical results for splay trees. These results provide a better understanding of how cache performance affects query execution time. Our results show that splay trees can have faster lookup times compared with randomly built binary search trees (BST) under certain settings. In contrast, previous experiments have shown that because of the instruction overhead involved in splaying, splay trees are less efficient in answering queries than randomly built BSTs—even when the data sets are heavily skewed (a favorable setting for splay trees). We show that at large tree sizes the difference in cache performance between the two types of trees is significant. This difference means that splay trees are faster than BSTs for this setting—despite still having a higher instruction count. Based on these results we offer guidelines in terms of tree size, access pattern, and cache size as to when splay trees will likely be more efficient. We also present a new splaying heuristic aimed at reducing instruction count and show that it can improve on standard splaying by 10–27%. Copyright © 2007 John Wiley & Sons, Ltd.
... Splay trees lose the provably logarithmic worst-case bounds of individual operations, but still behave well under amortized analysis. The need for (expensive) splaying can be reduced by randomizing the decision of whether to splay or not in connection of an operation[3,4]as well as by heuristic limitsplaying algorithms[2,5,6]. Several theoretical results indicate that splay trees should work particularly well when there is locality of reference in the request sequence[2]. ...
... Several theoretical results indicate that splay trees should work particularly well when there is locality of reference in the request sequence[2]. However, some empirical studies[6,7,8]have indicated that they could be actually at their best in highly dynamic environments, where the focus of locality drifts over time. Moreover, despite careful implementation basic splay tree variations have empirically been observed to be less efficient than red-black trees (RBTs), standard binary search trees (BSTs), and hashing at least in some situations[6,7]. ...
... However, some empirical studies[6,7,8]have indicated that they could be actually at their best in highly dynamic environments, where the focus of locality drifts over time. Moreover, despite careful implementation basic splay tree variations have empirically been observed to be less efficient than red-black trees (RBTs), standard binary search trees (BSTs), and hashing at least in some situations[6,7]. Randomized adaptive data structures can do better[4,6], but only heuristic limit-splaying has been competitive in practice[6]. ...
Conference Paper
Full-text available
Access requests to keys stored into a data structure often exhibit locality of reference in practice. Such a regularity can be modeled, e.g., by working sets. In this paper we study to what extent can the existence of working sets be taken advantage of in splay trees. In order to reduce the number of costly splay operations we monitor for information on the current working set and its change. We introduce a simple algorithm which attempts to splay only when necessary. Under worst-case analysis the algorithm guarantees an amortized logarithmic bound. In empirical experiments it is 5% more efficient than randomized splay trees and at most 10% more efficient than the original splay tree. We also briefly analyze the usefulness of the commonly-used Zipf’s distribution as a general model of locality of reference.
... In the standard binary search tree (BST), each node has a string pointer and two child pointers. These string data structures are illustrated in Figure 1 and are currently among the fastest and most compact tools available for managing large sets of strings in memory [Askitis and Sinha 2007;Heinz et al. 2002;Williams et al. 2001;Zobel et al. 2001;Bell and Gupta 1993;Knuth 1998;Crescenzi et al. 2003]. ...
... These results are an illustration of the importance of considering cache in algorithm design. The standard chain hash table, burst trie, and the BST have previously been shown to be among the most efficient structures for managing strings [Heinz et al. 2002;Williams et al. 2001;Zobel et al. 2001], but we have greatly reduced their total space consumption while simultaneously reducing access time. ...
... The Judy data structure required the least amount of space of all standard-chained data structures but was almost the slowest to access under skew, being only slightly faster than the red-black tree. As reported by Williams et al. [2001], the standard BST was the fastest tree to construct and self-search under skew. The red-black and splay trees were inefficient due to the maintenance of a balanced and self-adjusting tree structure, respectively. ...
Article
A key decision when developing in-memory computing applications is choice of a mechanism to store and retrieve strings. The most efficient current data structures for this task are the hash table with move-to-front chains and the burst trie, both of which use linked lists as a substructure, and variants of binary search tree. These data structures are computationally efficient, but typical implementations use large numbers of nodes and pointers to manage strings, which is not efficient in use of cache. In this article, we explore two alternatives to the standard representation: the simple expedient of including the string in its node, and, for linked lists, the more drastic step of replacing each list of nodes by a contiguous array of characters. Our experiments show that, for large sets of strings, the improvement is dramatic. For hashing, in the best case the total space overhead is reduced to less than 1 bit per string. For the burst trie, over 300MB of strings can be stored in a total of under 200MB of memory with significantly improved search time. These results, on a variety of data sets, show that cache-friendly variants of fundamental data structures can yield remarkable gains in performance.
... A little work has been done in the self-adjusting of tries based on the underlying access distribution. To the best of our knowledge, the only work directly pertaining to this is the Burst trie [21,22]. The Burst trie starts with a single container, implemented in [21,22] as a BST with a moveto-front heuristic. ...
... To the best of our knowledge, the only work directly pertaining to this is the Burst trie [21,22]. The Burst trie starts with a single container, implemented in [21,22] as a BST with a moveto-front heuristic. When the number of nodes in the container starts to be large, (as per a pre-defined criterion), it "bursts" to form a node of the trie that points to smaller containers, and so on. ...
Article
Full-text available
A Ternary Search Trie (TST) is a highly efficient dynamic dictionary structure applicable for strings and textual data. The strings are accessed based on a set of access probabilities and are to be arranged using a TST. We consider the scenario where the probabilitiesare not known a priori, and is time-invariant. Our aim is to adaptively restructure the TST so as to yield the best access or retrieval time. Unlike the case of lists and binary search trees, where numerous methods have been proposed, in the case of the TST, currently, the number of reported adaptive schemes are few. In this paper, we consider various self-organizing schemes that were applied to Binary Search Trees, and apply them to TSTs . Three new schemes, which are the splaying, the conditional rotation and the randomization heuristics, have been proposed, tested and comparatively presented. The results demonstrate that the conditional rotation heuristic is the best when compared to other heuristics that are considered in the paper.
... Although this approach is not very efficient, it has the advantage that it does not use extra space. Splaying is another technique due to Sleator and Tarjan [20,34,38]. It uses its own tree structure called the splay tree. ...
... The reader will observe that we have defined these operators in terms of various cases. This is, conceptually, similar to the zig-zig and zig-zag cases of the tree-based operations already introduced in the literature[20,25,38]. It is, of course, conceivable that we can include all the possible cases under a single umbrella, and then "pick and choose" those which have to be used in each scenario, i.e., for the STL and the STR operators. ...
Article
Full-text available
In this paper, we demonstrate that we can efiectively use results from the fleld of adaptive self-organizing data structures in enhancing compression schemes. Unlike adaptive lists, which have already been used in compression, to the best of our knowledge, adaptive self-organizing trees have not been used in this regard. To achieve this, we introduce a new data structure, the Partitioning Binary Search Tree (PBST) which, although based on the well-known Binary Search Tree (BST), also appropriately partitions the data elements into mutually exclusive sets. When used in conjunction with Fano encoding, the PBST leads to the so-called Fano Binary Search Tree (FBST), which, indeed, incorporates the required Fano coding (nearly-equal-probability) property into the BST. We demonstrate how both the PBST and FBST can be maintained adaptively and in a self-organizing manner. The updating procedure that converts a PBST into an FBST, and the corresponding new tree-based operators, namely the Shift-To-Left (STL) and the Shift-To-Right (STR) operators, are explicitly presented. The encoding and decoding procedures that also update the FBST have been implemented and rigorously tested. Our empirical results on flles of the well-known benchmark, the Canterbury corpus, show that the adaptive Fano coding using FBSTs, the Hufiman, and the greedy adaptive Fano coding achieve similar compression ratios. However, in terms of encoding/decoding speed, the new scheme is much faster than the latter two in the encoding phase, and they achieve approximately the same speed in the decoding phase. We believe that the same philosophy, namely that of using an adaptive self-organizing BST to maintain the frequencies, can also be utilized for other data encoding mechanisms, even as the Fenwick scheme has been used in arithmetic coding.
... 2. In practice, the self-adjusting data structures do not perform as well as balanced trees except in cases where a few of the leaves are accessed significantly more frequently than others [8,39]. Our layout scheme is balanced in its initial state, thus combining the advantages of both types allows all the suffixes that share the same first k characters to be of data structures. ...
... This observation provides an explanation for the results in [8,39], where the authors were surprised that self-adjusting data structures do not perform nearly as well as balance trees, except for very skewed data sets. However, if the suffix tree is built with our layout scheme, then it will be a balanced tree, potentially avoiding the initial inefficiencies. ...
... The dictionaries are also known as associative arrays or maps ( Williams et al., 2001). In programming, the abstract data structure dictionary is represented by many aggregated pairs (key, value) along with predefined methods for accessing the values by a given key. ...
Article
The efficiency of in-memory computing applications depends on the choice of mechanism to store and retrieve strings. The tree and trie are the abstract data types (ADTs) that offer better efficiency for ordered dictionary. Hash table is one among the several other ADTs that provides efficient implementation for unordered dictionary. The performance of a data structure will depend on hardware capabilities of computing devices such as RAM size, cache memory size and even the speed of the physical storage media. Hence, an application which will be running on real or virtualised hardware environment certainly will have restricted access to memory and hashing is heavily used for such applications for speedy process. In this work, an analysis on the performance of six hash table based dictionary ADT implementations with different data usage models is carried out. The six different popular hash table based dictionary ADT implementations are Khash, Uthash, GoogleDenseHash, TommyHashtable, TommyHashdyn and TommyHashlin, tested under different hardware and software configurations.
... For storage of strings, a standard representation for such a hash table is a standard chain, consisting of a fixedsize array of pointers (or slots), each the start of a linked list, where each node in the list contains a pointer to a string and a pointer to the next node. For strings with a skew distribution, such as occurrences of words in text, it was found in earlier work [10] that a standard-chain hash table is faster and more compact than sorted data structures such as tries and binary trees. Using move-to-front in the individual chains [27], the load average can reach dozens of strings per slot without significant impact on access speed, as the likelihood of having to inspect more than the first string in each slot is low. ...
Conference Paper
Full-text available
In-memory hash tables provide fast access to large numbers of strings, with less space overhead than sorted structures such as tries and binary trees. If chains are used for collision resolution, hash tables scale well, particularly if the pattern of access to the stored strings is skew. However, typical implementations of string hash tables, with lists of nodes, are not cache-efficient. In this paper we explore two alternatives to the standard representation: the simple expedient of including the string in its node, and the more drastic step of replacing each list of nodes by a contiguous array of characters. Our experiments show that, for large sets of strings, the improvement is dramatic. In all cases, the new structures give substantial savings in space at no cost in time. In the best case, the overhead space required for pointers is reduced by a factor of around 50, to less than two bits per string (with total space required, including 5.68 megabytes of strings, falling from 20.42 megabytes to 5.81 megabytes), while access times are also reduced.
... In this way, our heuristic often enables more efficient implementations involving less restructuring than splaying. Indeed, the amount of restructuring performed by splay trees is a limitation in the centralized setting as well, and has been addressed previously ; e.g., variants like semi-splaying [26], randomized splaying [3, 10] and periodic splaying [29], all attempt to reduce restructuring. We compare the restructuring costs of flattening versus splaying and its variants in our companion document [24]. ...
Conference Paper
Full-text available
We present a novel protocol for restructuring a tree- based overlay network in response to the workload of the application running over it. Through low-cost restructuring operations, our protocol incrementally adapts the tree so as to bring nodes that tend to communicate with one another closer together in the tree. It achieves this while respecting degree bounds on nodes so that, e.g., no node degenerates into a "hub" for the overlay. Moreover, it limits restructuring to those parts of the tree over which communication takes place, avoiding restructuring other parts of the tree unnecessarily. We show via experiments on PlanetLab that our protocol can significantly reduce communication latencies in workloads dominated by clusters of communicating nodes.
... In our case, we assume that the (adaptive) BST is dynamically changed while the records are searched for. Heuristics to maintain an adaptive BST include move-to-root heuristic [4], the simple exchange rule [4], splaying [18], the monotonic tree [7], biasing [6], dynamic binary search [13], weighted randomization [5], deepsplaying [16], and the technique that uses conditional rotations [8]. On the other hand, adaptive coding is important in many applications that require online data compression and transmission. ...
Conference Paper
Full-text available
In this paper, we show an effective way of using adaptive self-organizing data structures in enhancing compression schemes. We introduce a new data structure, the Partitioning Binary Search Tree (PBST), which is based on the well-known Binary Search Tree (BST), and when used in conjunction with Fano encoding, the PBST leads to the so-called Fano Binary Search Tree (FBST). The PBST and FBST can be maintained adaptively and in a self-organizing manner by using new tree-based operators, namely the ...
... The dictionary is implemented as a hash table with a bitwise hash function [28] and the move-to-front technique [34], mapping terms (strings) to integers term ids (see [37] for a study that compares this to other approaches). There is nothing noteworthy about our dictionary implementation, and we claim no novelty in this design. ...
Article
For text retrieval systems, the assumption that all data structures reside in main memory is increasingly common. In this context, we present a novel incremental inverted indexing algorithm for web-scale collections that directly constructs compressed postings lists in memory. Designing efficient in-memory algorithms requires understanding modern processor architectures and memory hierarchies: in this paper, we explore the issue of postings lists contiguity. Naturally, postings lists that occupy contiguous memory regions are preferred for retrieval, but maintaining contiguity increases complexity and slows indexing. On the other hand, allowing discontiguous index segments simplifies index construction but decreases retrieval performance. Understanding this tradeoff is our main contribution: We find that co-locating small groups of inverted list segments yields query evaluation performance that is statistically indistinguishable from fully-contiguous postings lists. In other words, it is not necessary to lay out in-memory data structures such that all postings for a term are contiguous; we can achieve ideal performance with a relatively small amount of effort.
... The quintessential distribution-sensitive data structure is the splay tree [4]. Splay trees seem to perform very efficiently over several natural sequences of operations, both theoretically [4] (asymptotically faster than Θ(log n) search time on a set of n elements) and practically [5]. There still exists no single comprehensive distribution-sensitive analysis for splay trees. ...
Article
We present a priority queue that supports insert in worst-case constant time, and delete-min, access-min, delete, and decrease of an element x in worst-case O(log(min{wx,qx}))O(log(min{wx,qx})) time, where wxwx (respectively, qxqx) is the number of elements that were accessed after (respectively, before) the last access to x and are still in the priority queue at the time when the corresponding operation is performed. (An access to an element is accounted for by any priority-queue operation that involves this element.) Our priority queue then has both the working-set and the queueish properties; and, more strongly, it satisfies these properties in the worst-case sense. From the results in Iacono (2001) [11] and Elmasry et al. (2011) [7], our priority queue also satisfies the static-finger, static-optimality, and unified bounds. Moreover, we modify our priority queue to realize a new unifying property — the time-finger property — which encapsulates both the working-set and the queueish properties.
... Splay trees are a variant of self-adjusting binary search tree [22]. ...
... In English, a simple definition of what constitutes a "word" is any sequence of alphanumeric characters bounded by non-alphanumerics. If this is extended to include single occurrences of the quote or hyphen characters within a word (thus including "don't" and "right-handed" but not "students' "), but to exclude strings with more than two digits, it covers almost every string that in English might reasonably be regarded as a word and can be used in practice for vocabulary accumulation tasks [4,10,16]. We refer to this as the alnum class. ...
Article
In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space requirements and performance characteristics of the main-memory data structures used for this task. However, it is not clear how many distinct words will be found in a text collection or whether new words will continue to appear after inspecting large volumes of data. We propose practical definitions of a word, and investigate new word occurrences under these models in a large text collection. We inspected around two billion word occurrences in 45 gigabytes of world-wide web documents, and found just over 9.74 million different words in 5.5 million documents; overall, 1 word in 200 was new. We observe that new words continue to occur, even in very large data sets, and that choosing stricter definitions of what constitutes a word has only limited impact on the number of new words found.
... The dictionaries are also known as associative arrays or maps ( Williams et al., 2001). In programming, the abstract data structure dictionary is represented by many aggregated pairs (key, value) along with predefined methods for accessing the values by a given key. ...
Article
The efficiency of in-memory computing applications depends on the choice of mechanism to store and retrieve strings. The tree and trie are the abstract data types (ADTs) that offer better efficiency for ordered dictionary. Hash table is one among the several other ADTs that provides efficient implementation for unordered dictionary. The performance of a data structure will depend on hardware capabilities of computing devices such as RAM size, cache memory size and even the speed of the physical storage media. Hence, an application which will be running on real or virtualised hardware environment certainly will have restricted access to memory and hashing is heavily used for such applications for speedy process. In this work, an analysis on the performance of six hash table based dictionary ADT implementations with different data usage models is carried out. The six different popular hash table based dictionary ADT implementations are Khash, Uthash, GoogleDenseHash, TommyHashtable, TommyHashdyn and TommyHa...
... Another such result is the scanning theorem that shows that iterating the items is a O(n) time operation [66]. Despite these results we note that the splay tree is not necessarily the best solution, because its performance can be slightly worse than that of other solutions [74] and in experiments the splay tree performs better only when there is a structure in the accesses [60]. Thus, static optimality does not imply good performance for random or non-dynamic data, because the O-notation has a constant factor in it and thus balanced bsts are faster in practice for such data [60]. ...
Article
Degree granted in Computer Science. Thesis (M.S.)--University of California, Davis, 2005.
Article
SUMMARY Simulation is indispensable in computer architecture research. Researchers increasingly resort to detailed architecture simulators to identify performance bottlenecks, analyze interactions among different hardware and software components, and measure the impact of new design ideas on the system performance. However, the slow speed of conventional execution-driven architecture simulators is a serious impediment to obtaining desirable research productivity. This paper describes a novel fast multicore processor architecture simulation framework called Two-Phase Trace-driven Simulation (TPTS), which splits detailed timing simulation into a trace generation phase and a trace simulation phase. Much of the simulation overhead caused by uninteresting architectural events is only incurred once during the cycle-accurate simulation- based trace generation phase and can be omitted in the repeated trace-driven simulations. We report our experiences with tsim, an event-driven multicore processor architecture simulator that models detailed memory hierarchy, interconnect, and coherence protocol based on the TPTS framework. By applying aggressive event filtering, tsim achieves an impressive simulation speed of 146 millions of simulated instructions per second, when running 16-thread parallel applications. Copyright q 2010 John Wiley & Sons, Ltd.
Article
Splay trees are widely considered as the classic examples of self-adjusting binary search trees and are part of most courses on data structures and algorithms. Already in the first seminal paper on splay trees (J. Assoc. Comput. Mach. 1985; 32(3):652–686) alternative operations were introduced, among which is semi-splaying. On the one hand, the analysis of semi-splaying gives a smaller constant for the amortized complexity, but on the other hand the authors write: Whether any version of semi-splaying is an improvement over splaying depends on the access sequence. Semi-splaying may be better when the access pattern is stable, but splaying adapts much faster to changes in usage. Maybe this sentence was the reason that nobody seriously ran tests to compare the performance of semi-splaying and splaying. Semi-splaying is conceptually simpler than splaying, has the same asymptotic amortized complexity and, as will be clear from empirical data presented in this paper, the practical performance is better for a very broad variety of access patterns. Therefore, its efficiency is a good reason to use semi-splaying for applications instead of its more prominent brother. Moreover, its simplicity also makes it very attractive for teaching purposes. Copyright © 2008 John Wiley & Sons, Ltd.
Article
Digital repositories must periodically check the integrity of stored objects to assure users of their correctness. Prior solutions calculate integrity metadata and require the repository to store it alongside the actual data objects. To safeguard and detect damage to this metadata, prior solutions rely on widely visible media (unaffiliated third parties) to store and provide back digests of the metadata to verify it is intact. However, they do not address recovery of the integrity metadata in case of damage or adversarial attack. We introduce IntegrityCatalog, a novel software system that can be integrated into any digital repository. It collects all integrity-related metadata in a single component and treats them as first class objects, managing both their integrity and their preservation. We introduce a treap-based persistent authenticated dictionary managing arbitrary length key/value pairs, which we use to store all integrity metadata, accessible simply by object name. Additionally, IntegrityCatalog is a distributed system that includes a network protocol that manages both corruption detection and preservation of this metadata, using administrator-selected network peers with 2 possible roles. Verifiers store and offer attestations on digests and have minimal storage requirements, while preservers efficiently synchronize a complete copy of the catalog to assist in recovery in case of a detected catalog compromise on the local system. We present our approach in developing the prototype implementation, measure its performance experimentally, and demonstrate its effectiveness in real-world situations. We believe the implementation techniques of our open-source IntegrityCatalog will be useful in the construction of next-generation digital repositories.
Article
Storing and retrieving strings in main memory is a fundamental problem in computer science. The efficiency of string data structures used for this task is of paramount importance for applications such as in-memory databases, text-based search engines and dictionaries. The burst trie is a leading choice for such tasks, as it can provide fast sorted access to strings. The burst trie, however, uses linked lists as substructures which can result in poor use of CPU cache and main memory. Previous research addressed this issue by replacing linked lists with dynamic arrays forming a cache-conscious array burst trie. Though faster, this variant can incur high instruction costs which can hinder its efficiency. Thus, engineering a fast, compact, and scalable trie for strings remains an open problem. In this paper, we introduce a novel and practical solution that carefully combines a trie with a hash table, creating a variant of burst trie called HAT-trie. We provide a thorough experimental analysis which demonstrates that for large set of strings and on alternative computing architectures, the HAT-trie—and two novel variants engineered to achieve further space-efficiency—is currently the leading in-memory trie-based data structure offering rapid, compact, and scalable storage and retrieval of variable-length strings.
Article
Full-text available
Many applications depend on ecient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, the burst trie, that has signicant advantages over existing options for such applications: it requires no more memory than a binary tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or near-sorted order. In this paper we describe burst tries and explore the parameters that govern their performance. We experimentally determine good choices of parameters, and compare burst tries to other structures used for the same task, with a variety of data sets. These experiments show that the burst trie is particularly eective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.
Conference Paper
In this paper, we propose novel randomized versions of the splay trees. We have evaluated the practical performance of these structures in comparison with the original version of splay trees and with their loglogn-competitive variations, in the application field of compression. In order to evaluate performance, we utilize plain splay trees, their loglog n-competitive variations, and our proposed randomized version with the Chain Splay technique to compress data. It is observed in practice, that the compression achieved in the case of the loglog n-competitive technique is, as intuitively expected, more efficient than the one of the plain splay trees.
Conference Paper
We present a self-adjusting layout scheme for suffix trees in secondary storage that provides optimal number of disk accesses for a sequence of string or substring queries. This has been an open problem since Sleator and Tarjan presented their splaying technique to create self-adjusting binary search trees in 1985. In addition to resolving this open problem, our scheme provides two additional advantages: 1) The partitions are slowly readjusted, requiring fewer disk accesses than splaying methods, and 2) the initial state of the layout is balanced, making it useful even when the sequence of queries is not highly skewed. Our method is also applicable to PATRICIA trees, and potentially to other data structures.
Conference Paper
During indexing the vocabulary of a collection needs to be built. The structure used for this needs to account for the skew distribution of terms. Parallel indexing allows for a large reduction in number of times the global vocabulary needs to be examined, however, this also raises a new set of challenges. In this paper we examine the structures used to resolve collisions in a hash table during parallel indexing, and find that the best structure is different from those suggested previously.
Conference Paper
Dynamic binary search trees are a fundamental class of dictionary data structure. Amongst these, the splay tree is space efficient and has amortized running-time bounds. In practice, splay trees perform best when the access sequence has regions of atypical items. Continuing a tradition started by Sleator and Tarjan themselves, we introduce a relaxed version, the α-Frequent Tree, that performs fewer rotations than the standard splay tree. We prove that the α-frequent trees inherit many of the distribution-sensitive properties of splay trees. Meanwhile, Conditional Rotation trees [Cheetham et al.] maintain access counters - one at each node - and have an excellent experimental reputation. By adding access counters to α-frequent trees, we create Splay Conditional Rotation (SCR) trees. These have the experimental performance of other counter-based trees, and the amortized bounds of splay trees.
Article
Fundamental structures such as trees and hash tables are used for managing data in a huge variety of circumstances. Making the right choice of structure is essential to efficiency. In previous work we have explored the performance of a range of data structures -- different forms of trees, tries, and hash tables -- for the task of managing sets of millions of strings, and have developed new variants of each that are more efficient for this task than previous alternatives. In this paper we test the performance of the same data structures on small sets of strings, in the context of document processing for index construction. Our results show that the new structures, in particular our burst trie, are the most efficient choice for this task, thus demonstrating that they are suitable for managing sets of hundreds to millions of distinct strings, and for input of hundreds to billions of occurrences.
Article
Full-text available
We present a randomized strategy for maintaining balance in dynamically changing search trees that has optimalexpected behavior. In particular, in the expected case a search or an update takes logarithmic time, with the update requiring fewer than two rotations. Moreover, the update time remains logarithmic, even if the cost of a rotation is taken to be proportional to the size of the rotated subtree. Finger searches and splits and joins can be performed in optimal expected time also. We show that these results continue to hold even if very little true randomness is available, i.e., if only a logarithmic number of truely random bits are available. Our approach generalizes naturally to weighted trees, where the expected time bounds for accesses and updates again match the worst-case time bounds of the best deterministic methods. We also discuss ways of implementing our randomized strategy so that no explicit balance information is maintained. Our balancing strategy and our algorithms are exceedingly simple and should be fast in practice.
Article
Full-text available
The splay tree, a self-adjusting form of binary search tree, is developed and analyzed. The binary search tree is a data structure for representing tables and lists so that accessing, inserting, and deleting items is easy. On an n-node splay tree, all the standard search tree operations have an amortized time bound of O(log n) per operation, where by “amortized time” is meant the time per operation averaged over a worst-case sequence of operations. Thus splay trees are as efficient as balanced trees when total running time is the measure of interest. In addition, for sufficiently long access sequences, splay trees are as efficient, to within a constant factor, as static optimum search trees. The efficiency of splay trees comes not from an explicit structural constraint, as with balanced trees, but from applying a simple restructuring heuristic, called splaying, whenever the tree is accessed. Extensions of splaying give simplified forms of two other data structures: lexicographic or multidimensional search trees and link/cut trees.
Article
Full-text available
In this paper we present randomized algorithms over binary search trees such that: a) the insertion of a set of keys, in any fixed order, into an initially empty tree always produces a random binary search tree; b) the deletion of any key from a random binary search tree results in a random binary search tree; c) the random choices made by the algorithms are based upon the sizes of the subtrees of the tree; this implies that we can support accesses by rank without additional storage requirements or modification of the data structures; and d) the cost of any elementary operation, measured as the number of visited nodes, is the same as the expected cost of its standard deterministic counterpart; hence, all search and update operations have guaranteed expected cost O(log n), but now irrespective of any assumption on the input distribution.
Article
Full-text available
An investigation ts made of the expected value of the maximum number of accesses needed to locate any element m a hashing file under various colhston resoluuon schemes This differs from usual worst-case considerations winch, for hashmg, would be the largest sequence of accesses for the worst possible file Asymptotic expressxons of these expected values are found for full and partly full tables For the open addressing scheme with a clustering-free model these values are found to be 0.6315... x n for a full table and = -logan for a partly full table, where n ts the number of records, m is the size of the table, and a = n/m. For the open addressing scheme which reorders the insertions to minimize the worst case, the lower bounds In n + 1 077... and (-a-~in(l - a)) are found for full and partly full tables, respectively FmaUy, for the separate chaining (or direct chaining) method both expected values are found to be ~.F-t(n). These results show that for these schemes, the actual behawor of the worst case in hash tables is quite good on the average.
Article
The splay-prefix algorithm is one of the simplest and fastest adaptive data compression algorithms based on the use of a prefix code. The data structures used in the splay-prefix algorithm can also be applied to arithmetic data compression. Applications of these algorithms to encryption and image processing are suggested.
Conference Paper
In this paper we introduce a self-adjusting k-ary search tree scheme to implement the abstract data type DICTIONARY. Sleator and Tarjan introduced splay trees and the splay heuristic in 1983 [ST83]. They proved that the amortized time efficiency of splay trees is within a constant factor of the efficiency of both balanced binary trees (such as AVL trees) and static optimal binary trees. Sleator and Tarjan's splay heuristic is defined only for binary search trees. In this paper, we consider a self-adjustment heuristic for k-ary search trees. We present a heuristic called k-splaying and prove that the amortized number of node READs per operation in k-ary trees maintained using this heuristic is O(log2 n). (Note: All constants in our time bounds are independent of both k and n). This is within a factor of O(log2 k) of the amortized number of node READs required for a B-tree operation. A k-ary tree maintained using the k-splay heuristic can be thought of as a self-adjusting B-tree. It differs from a B-tree in that leaves may be at different depths and the use of space is optimal. We also prove that the time efficiency of k-splay trees is comparable to that of static optimal k-ary trees. If sequence s in a static optimal tree takes time t, then sequence s in any k-splay tree will take time O(tlog2 k+n 2). These two results are k-ary analogues of two of Sleator and Tarjan's results for splay trees. As part of our static optimality proof, we prove that for every static tree (including any static optimal tree) there is a balanced static tree which takes at most twice as much time on any sequence of search operations. This lemma allows us to improve our static optimality bound to O(tlog2 k+nlogk n), and similarly improve Sleator and Tarjan's static optimality result.
Article
A model of a natural language text is a collection of information that approximates the statistics and structure of the text being modeled. The purpose of the model may be to give insight into rules which govern how language is generated, or to predict properties of future samples of it. This paper studies models of natural language from three different, but related, viewpoints. First, we examine the statistical regularities that are found empirically, based on the natural units of words and letters. Second, we study theoretical models of language, including simple random generative models of letters and words whose output, like genuine natural language, obeys Zipf's law. Innovation in text is also considered by modeling the appearance of previously unseen words as a Poisson process. Finally, we review experiments that estimate the information content inherent in natural text.
Conference Paper
Skip lists are data structures that use probabilistic balancing rather than strictly enforced balancing. The structure of a skip list is determined only by the number of elements in the skip list and the results of consulting the random number generator. Skip lists can be used to perform the same kinds of operations that a balanced tree can perform, including the use of search fingers and ranking operations. The algorithms for insertion and deletion in skip lists are much simpler and faster than equipment systems for balanced trees. Included in the article 13 an analysis of the probabilistic performance of skip lists.
Conference Paper
Organization and maintenance of an index for a dynamic random access file is considered. It is assumed that the index must be kept on some pseudo random access backup store like a disc or a drum. The index organization described allows retrieval, insertion, and deletion of keys in time proportional to logk I where I is the size of the index and k is a device dependent natural number such that the performance of the scheme becomes near optimal. Storage utilization is at least 50% but generally much higher. The pages of the index are organized in a special datastructure, so-called B-trees. The scheme is analyzed, performance bounds are obtained, and a near optimal k is computed. Experiments have been performed with indexes up to 100000 keys. An index of size 15000 (100000) can be maintained with an average of 9 (at least 4) transactions per second on an IBM 360/44 with a 2311 disc.
Conference Paper
We present applications of splay trees to two topics in data compression. First is a variant of the move-to-front (mtf) data compression (of Bentley, Sleator Tarjan and Wei) algorithm, where we introduce secondary list(s). This seems to capture higher-order correlations. An implementation of this algorithm with Sleator-Tarjan splay trees runs in time (provably) proportional to the entropy of the input sequence. When tested on some telephony data, compression ratio and run time showed significant improvements over original mtf-algorithm, making it competitive or better than popular programs. For stationary ergodic sources, we analyse the compression and output distribution of the original mtf-algorithm, which suggests why the secondary list is appropriate to introduce. We also derive analytical upper bounds on the average codeword length in terms of stochastic parameters of the source. Secondly, we consider the compression (or coding) of source sequences where the codewords are required to preserve the alphabetic order of the source symbols. We describe the use of the semisplay tree, the regular splay tree, and splay trees with depth three and four for this application, and derive upper bounds on their compression efficiency. For example, the average codeword length of the semisplay tree is no more than twice the source entropy plus some minor terms.
Conference Paper
The performance of the original version of the splay tree algorithm has been unchallenged for over a decade. We propose three randomized versions with better upper bounds on the expected running times (by constant factors). The improvements are particularly strong if the number of insertions is relatively small. All expectations are taken over the coin tosses of the randomized algorithms for worst case inputs. Hence slow running times are very unlikely for any request sequence. Algorithm A improves the expected running time, but could be very slow (with tiny probability). Algorithm B shows that without any loss in the original amortized running time, the expected running time can still be improved by a constant percentage. Algorithm C has the same efficient expected running time as Algorithm A, while its (worst case) amortized running time deteriorates only by a constant factor compared to standard deterministic splaying.
Book
This book describes data structures, methods of organizing large amounts of data, and algorithm analysis, the estimation of the running time of algorithms. As computers become faster and faster, the need for programs that can handle large amounts of input becomes more acute. Paradoxically, this requires more careful attention to efficiency, since inefficiencies in programs become most obvious when input sizes are large. By analyzing an algorithm before it is actually coded, students can decide if a particular solution will be feasible. For example, in this text students look at specific problems and see how careful implementations can reduce the time constraint for large amounts of data from 16 years to less than a second. Therefore, no algorithm or data structure is presented without an explanation of its running time. In some cases, minute details that affect the running time of the implementation are explored. Once a solution method is determined, a program must still be written. As computers have become more powerful, the problems they solve have become larger and more complex, thus requiring development of more intricate programs to solve the problems. The goal of this text is to teach students good programming and algorithm analysis skills simultaneously so that they can develop such programs with the maximum amount of efficiency. This book is suitable for either an advanced data structures (CS7) course or a first-year graduate course in algorithm analysis. Students should have some knowledge of intermediate programming, including such topics as pointers and recursion, and some background in discrete math.
Article
Adaptivity in sorting algorithms is sometimes gained at the expense of practicality. We give experimental results showing that Splaysort — sorting by repeated insertion into a Splay tree — is a surprisingly efficient method for in-memory sorting. Splaysort appears to be adaptive with respect to all accepted measures of presortedness, and it outperforms Quicksort for sequences with modest amounts of existing order. Although Splaysort has a linear space overhead, there are many applications for which this is reasonable. In these situations Splaysort is an attractive alternative to traditional comparison-based sorting algorithms such as Heapsort, Mergesort, and Quicksort.
Article
In November of 1992 the first Text REtrieval Conference (TREC-1) was held at NIST (Harman 1993). This conference, co-sponsored by ARPA and NIST, brought together information retrieval researchers to discuss their system results on the new TIPSTER test collection. This was the first time that such groups had ever compared results on the same data using the same evaluation methods, and represented a breakthrough in cross-system evaluation in information retrieval. It was also the first time that most of these groups had tackled such a large test collection and required a major effort by all groups to scale up their retrieval techniques.
Article
A class of binary trees is described for maintaining ordered sets of data. Random insertions, deletions, and retrievals of keys can be done in time proportional to log N where N is the cardinality of the data-set. Symmetric B-Trees are a modification of B-trees described previously by Bayer and McCreight. This class of trees properly contains the balanced trees.
Article
We introduce a new data structure, the k-forest, which is a self-adjusting multi-way search tree. A k-forest provides an efficient implementation of a weighted Dictionary in a virtual memory environment where the time to access a tree node is much greater than the time to examine data itmes in a node. We achieve results that are natural generalizations of the Working Set and Static Optimality theorems of Sleator and Tarjan for splay trees. Thus we are able to show that even when the lookup frequencies of the elements of the dictionary are not known in advance, a k-forest can equal or exceed the performance of an optimal static multi-way tree which is built knowing the lookup frequencies in advance.
Article
. We survey results on self-organizing data structures for the search problem and concentrate on two very popular structures: the unsorted linear list, and the binary search tree. For the problem of maintaining unsorted lists, also known as the list update problem, we present results on the competitiveness achieved by deterministic and randomized on-line algorithms. For binary search trees, we present results for both on-line and off-line algorithms. Self-organizing data structures can be used to build very effective data compression schemes. We summarize theoretical and experimental results. 1 Introduction This paper surveys results in the design and analysis of self-organizing data structures for the search problem. The general search problem in pointer data structures can be phrased as follows. The elements of a set are stored in a collection of nodes. Each node also contains O(1) pointers to other nodes and additional state data which can be used for navigation and self-o...
Conference Paper
In this paper we present a uniform framework for the implementation and study of balanced tree algorithms. We show how to imbed in this framework the best known balanced tree techniques and then use the framework to develop new algorithms which perform the update and rebalancing in one pass, on the way down towards a leaf. We conclude with a study of performance issues and concurrent updating.
Article
Genomic sequence databases are widely used by molecular biologists for homology searching. Amino acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationally intensive local alignments on selected sequences and to reduce the costs of the alignments that are attempted. We present an index-based approach for both selecting sequences that display broad similarity to a query and for fast local alignment. We show experimentally that the indexed approach results in significant savings in computationally intensive local alignments and that index-based searching is as accurate as existing exhaustive search schemes
Article
String hashing is a fundamental operation, used in countless applications where fast access to distinct strings is required. In this paper we describe a class of string hashing functions and explore its performance. In particular, using experiments with both small sets of keys and a large key set from a text database, we show that it is possible to achieve performance close to that theoretically predicted for hashing functions. We also consider criteria for choosing a hashing function and use them to compare our class of functions to other methods for string hashing. These results show that our class of hashing functions is reliable and efficient, and is therefore an appropriate choice for general-purpose hashing.
Article
this paper we experimentally evaluate the performance of several data structures for building vocabularies, using a range of data collections and machines. Given the well-known properties of text and some initial experimentation, we chose to focus on the most promising candidates, splay trees and chained hash tables, also reporting results with binary trees. Of these, our experiments show that hash tables are by a considerable margin the most e#cient. We propose and measure a refinement to hash tables, the use of move-to-front lists. This refinement is remarkably e#ective: as we show, using a small table in which there are large numbers of strings in each chain has only limited impact on performance. Moving frequentlyaccessed words to the front of the list has the surprising property that the vast majority of accesses are to the first or second node. For example, our experiments show that in a typical case a table with an average of around 80 strings per slot is only 10%--40% slower than a table with around one string per slot---while a table without move-to-front is perhaps 40% slower again---and is still over three times faster than using a tree. We show, moreover, that a move-to-front hash table of fixed size is more e#cient in space and time than a hash table that is dynamically doubled in size to maintain a constant load average
Article
Two stages in measurement of techniques for information retrieval are gathering of documents for relevance assessment and use of the assessments to numerically evaluate e#ectiveness. We consider both of these stages in the context of the TREC experiments, to determine whether they lead to measurements that are trustworthy and fair. Our detailed empirical investigation of the TREC results shows that the measured relative performance of systems appears to be reliable, but that recall is overestimated: it is likely that many relevant documents have not been found. We propose a new pooling strategy that can significantly increase the number of relevant documents found for given e#ort, without compromising fairness.
Article
Much has been said in praise of... this paper, we compare the performance of three different techniques for self-adjusting trees with that of AVL and random binary search trees. Comparisons are made for various tree sizes, levels of key-access-frequency skewness and ratios of insertions and deletions to searches. The results show that, because of the high cost of maintaining self-adjusting trees, in almost all cases the AVL tree outperforms all the self-adjusting trees and in many cases even a random binary search tree has better performance, in terms of CPU time, than any of the self-adjusting trees. Self-adjusting trees seem to perform best in a highly dynamic environment, contrary to intuition.
Data Structures, Sorting, Searching
  • Fundamentals
Fundamentals, Data Structures, Sorting, Searching. Addison-Wesley: Reading, MA, 1998.
Self-organizing data structures Online Algorithms: The State of the
  • Albers S Westbrook
Albers S, Westbrook J. Self-organizing data structures. Online Algorithms: The State of the Art, Fiat A, Woeginger G (eds.).
Online Algorithms: The State of the Art
  • Albers S
  • Westbrook J
Proceedings of the International Conference on Database Systems for Advanced Applications
  • Ramakrishna MV
  • Zobel J
Remarks on choosing and implementing random number generators
  • G Marsaglia
G. Marsaglia. Remarks on choosing and implementing random number generators. Communications of the ACM, 36(7):105-110, July 1993.
19th Annual Symposium on Foundations of Computer Science
  • Guibas LJ
  • Sedgewick R
  • Sherk
  • Moffat