Article

# Compressing Integers for Fast File Access

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

## Abstract

Past access to files of integers is crucial for the efficient resolution of queries to databases. Integers are the basis of indexes used to resolve queries, for example, in large internet search systems, and numeric data forms a large part of most databases. Disk access costs can be reduced by compression, if the cost of retrieving a compressed representation from disk and the CPU cost of decoding such a representation is less than that of retrieving uncompressed data. In this paper we show experimentally that, for large or small collections, storing integers in a compressed format reduces the time required for either sequential stream access or random access. We compare different approaches to compressing integers, including the Elias gamma and delta codes, Golomb coding, and a variable-byte integer scheme. As a conclusion, we recommend that, for fast access to integers, files be stored compressed.

## No full-text available

... • Vbyte. We included a simple posting list representation based on Vbyte [31] which uses no sampling and consequently performs intersections in a merge-wise fashion. We also included two alternatives using Vbyte coupled with sampling [9] (called Vbyte-CM), with k = {4, 32}, or domain sampling [28] (called Vbyte-ST), with B = {16, 128}. ...
... Vbyte Vbyte [31] Simple Vbyte encoding with no sampling. Intersections are performed in a merge-wise fashion. ...
... Intersections are performed in a merge-wise fashion. Vbyte-LZMA [6] No variants Encodes gaps with Vbyte [31] and, if the size of the resulting Vbytesequence is ≥ 10 bytes, then it is further compressed with LZMA. ...
Preprint
Full-text available
This work introduces a companion reproducible paper with the aim of allowing the exact replication of the methods, experiments, and results discussed in a previous work [5]. In that parent paper, we proposed many and varied techniques for compressing indexes which exploit that highly repetitive collections are formed mostly of documents that are near-copies of others. More concretely, we describe a replication framework, called uiHRDC (universal indexes for Highly Repetitive Document Collections), that allows our original experimental setup to be easily replicated using various document collections. The corresponding experimentation is carefully explained, providing precise details about the parameters that can be tuned for each indexing solution. Finally, note that we also provide uiHRDC as reproducibility package.
... Compressing inverted lists (IL) is a challenge since the beginning of their usage. It is usually based on integer compressing methods, most typically Golomb code [6], Elias codes [7], Rice code [8] or Variable byte codes [9,10]. The positional (word-level) index was explored by Choueka et al. in [11]. ...
... Gap encoding significantly changes the scale of the used integers and causes an increase of small numbers in the integer distribution. Thus, the resulting distribution becomes more compressible for most of the integer compression methods [6][7][8][9]. See the gap encoded sequence of positions in the second row below. ...
... We can observe in Figure 3 that the stem "hors" occurs i.a. in the document D 7 . The document D 7 contains f 7,9 = 4 occurrences of the stem in the tree different variants (see the Huffman tree HT 7,9 ). The first occurrence is at the position FI 7,9 = 1202. ...
Article
Full-text available
We address the problem of positional indexing in the natural language domain. The positional inverted index contains the information of the word positions. Thus, it is able to recover the original text file, which implies that it is not necessary to store the original file. Our Positional Inverted Self-Index (PISI) stores the word position gaps encoded by variable byte code. Inverted lists of single terms are combined into one inverted list that represents the backbone of the text file since it stores the sequence of the indexed words of the original file. The inverted list is synchronized with a presentation layer that stores separators, stop words, as well as variants of the indexed words. The Huffman coding is used to encode the presentation layer. The space complexity of the PISI inverted list is where N is a number of stems, n is a number of unique stems, α is a step/period of the back pointers in the inverted list and b is the size of the word of computer memory given in bits. The space complexity of the presentation layer is O(-ΣiN =1⌈log2 piⁿ⁽ⁱ⁾ ⌉ - ΣjN' =1⌊log2 jp'⌋ + N) with respect to pn(i) i as a probability of a stem variant at position i, p'j as the probability of separator or stop word at position j and N0 as the number of separators and stop words.
... An inverted index consists of a set of postings lists, each describing the locations and occurrences in the collection of a single term. Proposals for compactly storing postings lists include byte-aligned codes [22,25]; word-aligned codes [2,3,23,28]; and binary-packed approaches [12,30]. ...
... Traditional techniques such as Golomb, Rice, and Elias codes (see Witten et al. [26] for details) operate on a bit-by-bit basis, and are relatively slow during the decoding process. Compromises that allow faster decoding, but with reduced compression effectiveness, include the byte-based VByte mechanism [22,25] and variants thereof [4,5,7]; and the word-based Simple approaches [2,3,23,28]. At the other end of the scale, the Interp mechanism of Moffat and Stuiver [15] provides very good compression effectiveness, but with even slower decoding than the bit-based Golomb and Elias codes. ...
... One of the implementation drawbacks of the Packed+ANS arrangement described in Section 3.1 is the cost of maintaining the frames in the three-table form employed for ANS decoding (see Section 2.2). With m as large as 2 25 in some frames, with M ≈ 8m, and with the two-dimensional approach meaning that the number of contexts might be close to |S |(|S | + 1)/2, execution-time memory space is a key factor that cannot be ignored. Caching effects mean that memory space also affects decoding speed. ...
Conference Paper
We examine approaches used for block-based inverted index compression, such as the OptPFOR mechanism, in which fixed-length blocks of postings data are compressed independently of each other. Building on previous work in which asymmetric numeral systems (ANS) entropy coding is used to represent each block, we explore a number of enhancements: (i) the use of two-dimensional conditioning contexts, with two aggregate parameters used in each block to categorize the distribution of symbol values that underlies the ANS approach, rather than just one; (ii) the use of a byte-friendly strategic mapping from symbols to ANS codeword buckets; and (iii) the use of a context merging process to combine similar probability distributions. Collectively, these improvements yield superior compression for index data, outperforming the reference point set by the Interp mechanism, and hence representing a significant step forward. We describe experiments using the 426 GiB gov2 collection and a new large collection of publicly-available news articles to demonstrate that claim, and provide query evaluation throughput rates compared to other block-based mechanisms.
... Ters dizinlerin yanı sıra, belge derecelendirmede kullanılan frekans degerleri ve terim geri kazanımı için oluşturulması gereken sözlük de gözö nüne alındıgında, veri sıkıştırma işlemi, diskte arama yapma oranını azaltması bakımındanönem kazanmaktadır [2]. Bu çerçevede,özellikle ters dizinlerin sıkıştırılması, verimli kod çözme algoritmaları vasıtasıyla sorgu işlemeyiönemliölçüde hızlandırabildigi için bilgiye erişim sistemlerinin optimizasyonundaönemli bir yere sahiptir [3], [4]. ...
... Genelde geometrik dagılıma yakın şekilde, olasılıkları monoton azalan bir davranışa sahip aralık degerlerini sıkıştırmak için kullanılabilecek evrensel veya uyumsal birçok farklı kodlama teknigi bulunmaktadır. Evrensel kodlama şemalarına, degişken sekiz ikili kodlama [3] ve Elias kodlaması [5]; uyumsal kodlama şemalarına ise, Golomb kodlaması [6], Simple kodlaması [7] ve PForDelta [8]örnek gösterilebilir. Bu noktada, kullanılacak teknigin dizin sıkıştırma oranı ve kod çözme hızı arasında ortaya koyduguödünleşim gözönünde bulundurularak bilgi erişim sisteminin verimliligini bellek kullanımı ve sorgu işleme hızı bakımından eniyileyen şemanın seçilmesi amaçlanır.Örnegin, Golomb kodları diger tekniklere görë ustün bir sıkıştırma performansı göstermesine ragmen kod çözme hızı açısından geri kalabilmektedir [7], [9]. ...
Conference Paper
In this paper, an entropy coding technique, namely, combinatorial encoding, is investigated in the context of document indexing as an alternative compression scheme to the conventional gap encoding methods. To this purpose, a finite state transducer is designed to reduce the complexity of calculating the lexicographic rank of the encoded bitstream, and a component that will efficiently calculate the binomial coefficient for large numbers is developed. The encoding speed of the implemented solid and multi-block indexing schemes are empirically evaluated through varying term frequencies on randomly created bit strings. The method's ability of compressing a memoryless source to its entropy limit, yielding an on-the-fly indexing scheme, and conforming to document reordering by means of the transducer have been designated as its most prominent aspects.
... Some of them achieve a high compression ratio, but at the expense of a lower decompression speed [15], for example Elias γ and δ encodings [21], Golomb/Rice parametric encodings [26], or interpolative coding [38]. Other approaches may achieve a (usually slightly) lower compression ratio, though with faster decompression speeds; examples are VByte [57], S9 [4] and its variants, PForDelta [62], Quasi-Succinct Coding [55], or SIMD-BP128 [32]. The best compression method depends on the scenario at hand. ...
... This method [57] encodes an integer number using a variable number of bytes. To do this, VByte uses the most significant bit in a byte as a flag. ...
Preprint
Full-text available
Positional ranking functions, widely used in web search engines and related search systems, improve result quality by exploiting the positions of the query terms within documents. However, it is well known that positional indexes demand large amounts of extra space, typically about three times the space of a basic nonpositional index. Textual data, on the other hand, is needed to produce text snippets. In this paper, we study time-space trade-offs for search engines with positional ranking functions and text snippet generation. We consider both index-based and non-index based alternatives for positional data. We aim to answer the question of whether positional data should be indexed, and how. We show that there is a wide range of practical time-space trade-offs. Moreover, we show that using about 1.30 times the space of positional data, we can store everything needed for efficient query processing, with a minor increase in query time. This yields considerable space savings and outperforms, both in space and time, recent alternatives from literature. We also propose several efficient compressed text representations for snippet generation, which are able to use about half of the space of current state-of-the-art alternatives with little impact in query processing time.
... Byte-aligned encoder codes such as Varint and Group Varint represent an integer in bytes [1], [18]. The basic idea of Varint is to use the low 7 bits of a byte as the data area and the most significant bit as the status bit to indicate whether the byte is the last that stores the data of the integer. ...
... With the successive padding mode j 1 and j 2 , the status bits of SSimple-9 can be determined for the above Simple-9 coded integers (line-16). Then, the specific padding mode of SSimple-9 can be used to compress the above Simple-9 coded integers (line [17][18][19][20][21][22]. ...
Article
Full-text available
The growth in the amount of information available on the Internet and thousands of user queries per second brings huge challenges to the index update and query processing of search engines. Index compression is partially responsible for the current performance achievements of existing search engines. The selection of the index compression algorithms must weigh three factors, i.e., compression ratio, compression speed and decompression speed. In this paper, we study the well-known Simple-9 compression, in which exist many branch operations, table lookup and data transfer operations when processing each 32-bit machine word. To enhance the compression and decompression performance of Simple-9 algorithm, we propose a successive storage structure and processing metric to compress two successive Simple-9 encoded sequence of integers in a single data processing procedure, thus the name Successive Simple-9 (SSimple-9). In essence, the algorithm shortens the process of branch operations, table lookup and data transfer operations when compressing the integer sequence. More precisely, we initially present the data storage format and mask table of SSimple-9 algorithm. Then, for each mode in the mask table, we design and hard-code the main steps of the compression and decompression processes. Finally, analysis and comparison on the experimental results of the simulation and TREC datasets show the compression and decompression efficiency speedup of the proposed SSimple-9 algorithm.
... One standard approach to storing postings lists -and other sequences of non-decreasing integers -is to transform them to gaps, and apply any integer compression regime that assigns short codewords to small values. A wide range of techniques have been used, including byte-aligned codes [19,23]; word-aligned mechanisms [1,2,20,26]; and binary-packed approaches [13,28]. e same methods can also be applied directly to the f t,i values without any transformation being required, as they are also usually small integers. ...
... Byte-Aligned Codes. In VByte compression [19,23] each input integer s is partitioned into 7-bit fragments, and each fragment is placed into a byte with a ag bit that indicates whether there are further bytes yet to come. at is, if s ≤ 2 7 = 128 then a single byte su ces, with a ag bit of (say) 0. Values in the range 2 7 +1 to 2 14 are coded in two bytes, the rst with a ag bit of (say) 1, and the second with a ag of 0. Decoding is performed by shi -or reassembly, continuing until a byte with a ag bit of 0 is reached, marking the end of this code. ...
Conference Paper
Techniques for effectively representing the postings lists associated with inverted indexes have been studied for many years. Here we combine the recently developed "asymmetric numeral systems" (ANS) approach to entropy coding and a range of previous index compression methods, including VByte, Simple, and Packed. The ANS mechanism allows each of them to provide markedly improved compression effectiveness, at the cost of slower decoding rates. Using the 426 GB GOV2 collection, we show that the combination of blocking and ANS-based entropy-coding against a set of 16 magnitude-based probability models yields compression effectiveness superior to most previous mechanisms, while still providing reasonable decoding speed.
... WSDM '19, sequences in compressed form if the primary goal is fast decompression. Competitors in this space include Trotman's QMX codec [36]; the VByte and Simple16 byte-and word-aligned mechanisms [2,39]; and the PFOR approach of Zukowski et al. [45]. ...
... Byte-and Word-Aligned Codes. In byte-aligned compression methods [31,35,39] input integers are partitioned into 7-bit fragments, and the fragments are placed in bytes, leaving one bit spare per byte. That bit then serves as a flag to mark the last fragment of each integer, allowing the coded values to be reconstituted via byte-at-a-time shift and mask operations. ...
Conference Paper
Full-text available
Dictionary-based compression schemes provide fast decoding operation, typically at the expense of reduced compression effectiveness compared to statistical or probability-based approaches. In this work, we apply dictionary-based techniques to the compression of inverted lists, showing that the high degree of regularity that these integer sequences exhibit is a good match for certain types of dictionary methods, and that an important new trade-off balance between compression effectiveness and compression efficiency can be achieved. Our observations are supported by experiments using the document-level inverted index data for two large text collections, and a wide range of other index compression implementations as reference points. Those experiments demonstrate that the gap between efficiency and effectiveness can be substantially narrowed.
... To further improve the running time of the algorithm, we employ a few bit manipulation techniques that take advantage of our particular encoding scheme. With standard variable-byte encoding [38], we need to read multiple bytes to determine the size and decode the value. But by storing the size of the variable-byte value in a 2-bit code, we can determine the size  by looking up the code c in a small array:  sizeFromCode[c]. ...
... With delta en-coding, storing the scores, including the 2 bit header, takes only 4.0 and 9.6 bits per node and string, respectively. In comparison, standard variable-byte encoding with a single continuation bit [38] requires at least 8 bits per node. Simi-larly, we utilize an average of only 16.4 bits per string in the dataset to encode the tree structure. ...
Article
Full-text available
Today in every search application, desktop, web, and mobile devices, All provide some kind of query auto-completion. In its basic form, the problem consists in retrieving from a string set a small number of completions, i.e. strings beginning with a given prefix, that have the highest scores according to some static ranking. In this paper, we focus on the case where the string set is so large that compression is needed to the data structure in memory. This is a immerging case for web search engines and social networks, where it is necessary to index hundreds of millions of distinct queries to guarantee a reasonable time coverage; and for mobile devices, where the amount of memory is limited. Mobile devices are very common now a days. Typing on screen of small display unit is very difficult task. User require help to speed up .If we provide compression of scored set then it will be beneficial for future purpose In this paper we present three different tree-based data structures to address this problem, each one with different space/time/ complexity trade-offs. Experiments on large-scale datasets show that it is possible to compress the string sets, including the scores, down to spaces competitive with the gzip'ed data, while supporting efficient retrieval of completions at about a microsecond per completion. .
... However, it only decreases the minimum number of bits necessary to encode the partitions, but not their final representation. Consequently, deltas are Vbyte encoded (Williams and Zobel, 1999): each byte used to encode a delta has one bit indicating whether the byte starts a new delta or not, allowing to remove unnecessary bytes from each delta. Thus, partitions use a variable number of bytes proportional to the minimum number of bits necessary to encode their deltas. ...
Article
Full-text available
The advent of high throughput sequencing (HTS) technologies raises a major concern about storage and transmission of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pangenomes. The ideal way to represent and transfer pangenomes is through compression. A number of HTS-specific compression tools have been developed to reduce the storage and communication costs of HTS data, yet none of them is designed to process a pangenome. In this article, we present dynamic alignment-free and reference-free read compression (DARRC), a new alignment-free and reference-free compression method. It addresses the problem of pangenome compression by encoding the sequences of a pangenome as a guided de Bruijn graph. The novelty of this method is its ability to incrementally update DARRC archives with new genome sequences without full decompression of the archive. DARRC can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code. On a large Pseudomonas aeruginosa data set, our method outperforms all other tested tools. It provides a 30% compression ratio improvement in single-end mode compared with the best performing state-of-the-art HTS-specific compression method in our experiments.
... • We used vbyte (byte-aligned) codes [48] rather that bit-oriented Huffman codes to differentially encode include a sequence of byte-oriented codewords (either 1 or 2-byte codewords in our example) that are used to represent the gaps from the original Ψ structure. It can also contain a pair of codewords for the pair 1, L to encode a 1-run of length L. Of course, using byte-aligned rather than bit-oriented codes will imply a loss in compression effectiveness. ...
Article
Temporal graphs represent binary relationships that change along time. They can model the dynamism of, for example, social and communication networks. Temporal graphs are defined as sets of contacts that are edges tagged with the temporal intervals when they are active. This work explores the use of the Compressed Suffix Array (CSA), a well-known compact and self-indexed data structure in the area of text indexing, to represent large temporal graphs. The new structure, called Temporal Graph CSA (TGCSA), is experimentally compared with the most competitive compact data structures in the state-of-the-art, namely, EdgeLog and CET. The experimental results show that TGCSA obtains a good space-time trade-off. It uses a reasonable space and is efficient for solving complex temporal queries. Furthermore, TGCSA has wider expressive capabilities than EdgeLog and CET, because it is able to represent temporal graphs where contacts on an edge can temporally overlap.
... We observe around a 27% and 30% reduction in size for Pri and Opt, respectively. We also try testing a traditional Bitmap compression method EWAH [4] (run-length), and the mathematical encoding methods Pfor [9] and Vbyte [8]. We see a decrease in size of 4.35% (EWAH), -1.0% (Pfor) and 11.3% (VByte), which is unsurprisingly poor. ...
Conference Paper
Large-scale search engines utilize inverted indexes which store ordered lists of document identifies (docIDs) relevant to query terms, which can be queried thousands of times per second. In order to reduce storage requirements, we propose a dictionary-based compression approach for the recently proposed bitwise data-structure BitFunnel, which makes use of a Bloom filter. Compression is achieved through storing frequently occurring blocks in a dictionary. Infrequently occurring blocks (those which are not represented in the dictionary) are instead referenced using similar blocks that are in the dictionary, introducing additional false positive errors. We further introduce a docID reordering strategy to improve compression. Experimental results indicate an improvement in compression by 27% to 30%, at the expense of increasing the query processing time by 16% to 48% and increasing the false positive rate by around 7.6 to 10.7 percentage points.
... This allows us skipping blocks at search time, decompressing only the blocks that are relevant for a query. Among the existing compression schemes for inverted lists, we have classical encodings like Elias δ and γ (Elias, 1975) and Golomb/Rice (Golomb, 1966), as well as the more recent ones VByte (Williams & Zobel, 1999), Simple 9 (Anh & Moffat, 2005), and PForDelta (Zukowski, Héman, Nes, & Boncz, 2006) encodings. All these methods benefit from sequences of small integers. ...
Article
Full-text available
Text search engines are a fundamental tool nowadays. Their efficiency relies on a popular and simple data structure: inverted indexes. They store an inverted list per term of the vocabulary. The inverted list of a given term stores, among other things, the document identifiers (docIDs) of the documents that contain the term. Currently, inverted indexes can be stored efficiently using integer compression schemes. Previous research also studied how an optimized document ordering can be used to assign docIDs to the document database. This yields important improvements in index compression and query processing time. In this paper we show that using a hybrid compression approach on the inverted lists is more effective in this scenario, with two main contributions: • First, we introduce a document reordering approach that aims at generating runs of consecutive docIDs in a properly-selected subset of inverted lists of the index.• Second, we introduce hybrid compression approaches that combine gap and run-length encodings within inverted lists, in order to take advantage not only from small gaps, but also from long runs of consecutive docIDs generated by our document reordering approach. Our experimental results indicate a reduction of about 10%–30% in the space usage of the whole index (just regarding docIDs), compared with the most efficient state-of-the-art results. Also, decompression speed is up to 1.22 times faster if the runs of consecutive docIDs must be explicitly decompressed, and up to 4.58 times faster if implicit decompression of these runs is allowed (e.g., representing the runs as intervals in the output). Finally, we also improve the query processing time of AND queries (by up to 12%), WAND queries (by up to 23%), and full (non-ranked) OR queries (by up to 86%), outperforming the best existing approaches.
... If the route names contain no common prefixes then the size of the stored suffixes in F C(R) reduces to that of R with no compression. A variable length non-negative integer encoding scheme such as VByte [94] is employed to compress the integer representing the length of the common prefixes π i , i = 1, · · · , n. The following proposition demonstrates the space efficiency of F C(R). ...
Thesis
Full-text available
Named data networking (NDN) is a content-centric future Internet architecture that uses routable content names instead of IP addresses to achieve location-independent forwarding. Nevertheless, NDN's design is limited to offering hosted applications a simple content pull mechanism. As a result, increased complexity is needed in developing applications that require more sophisticated content delivery functionalities (e.g., push, publish/subscribe, streaming, generalized forwarding, and dynamic content naming). This thesis introduces a novel Enhanced NDN (ENDN) architecture that offers an extensible catalog of content delivery services (e.g., adaptive forwarding, customized monitoring, and in-network caching control). More precisely, the proposed architecture allows hosted applications to associate their content namespaces with a set of services offered by ENDN. The design of ENDN starts from the current NDN architecture that is gradually modified to meet the evolving needs of novel applications. NDN switches use several forwarding tables in the packet processing pipeline, the most important one being the Forwarding Information Base (FIB). The NDN FIBs face stringent performance requirements, especially in Internet-scale deployments. Hence, to increase the NDN data plane scalability and flexibility, we first propose FCTree, a novel FIB data structure. FCTree is a compressed FIB data structure that significantly reduces the required storage space within the NDN routers while providing fast lookup and modification operations. FCTree also offers additional lookup types that can be used as building blocks to novel network services (e.g., in-network search engine). Second, we design a novel programmable data plane for ENDN using P4, a prominent data plane programming language. Our proposed data plane allows content namespaces to be processed by P4 functions implementing complex stateful forwarding behaviors. We thus extend existing P4 models to overcome their limitations with respect to processing string-based content names. Our proposed data plane also allows running independent P4 functions in isolation, thus enabling P4 code run-time pluggability. We further enhance our proposed data plane by making it protocol-independent using programmable parsers to allow interfacing with IP networks. Finally, we introduce a new control plane architecture that allows the applications to express their network requirements using intents. We employ Event-B machine (EBM) language modeling and tools to represent these intents and their semantics on an abstract model of the network. The resulting EBMs are then gradually refined to represent configurations at the programmable data plane. The Event-B method formally ensures the consistency of the different application requirements using proof obligations and verifies that the requirements of different intents do not contradict each other. Thus, the desired properties of the network or its services, as defined by the intent, are guaranteed to be satisfied by the refined EBM representing the final data-plane configurations. Experimental evaluation results demonstrate the feasibility and efficiency of our proposed architecture.
... Among these, Variable-Byte [40,44] (henceforth, VByte) is the most popular and used byte-aligned code. In particular, VByte owes its popularity to its sequential decoding speed and, indeed, it is the fastest representation up to date for integer sequences. ...
Preprint
Full-text available
The ubiquitous Variable-Byte encoding is considered one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to be compressed are small that is the typical case for inverted indexes. This paper shows that the compression ratio of Variable-Byte can be improved by $2\times$ by adopting a partitioned representation of the inverted lists. This makes Variable-Byte surprisingly competitive in space with the best bit-aligned encoders, hence disproving the folklore belief that Variable-Byte is space-inefficient for inverted index compression. Despite the significant space savings, we show that our optimization almost comes for free, given that: we introduce an optimal partitioning algorithm that, by running in linear time and with low constant factors, does not affect indexing time; we show that the query processing speed of Variable-Byte is preserved, with an extensive experimental analysis and comparison with several other state-of-the-art encoders.
... In general, compression of sequences (or arrays) of numbers is one of the most old problems in the compression field [46,47,48,49,50]. In particular, one could also see the problem of representing raster datasets as representing an array of numbers. ...
Article
Compact data structures are storage structures that combine a compressed representation of the data and the access mechanisms for retrieving individual data without the need of decompressing from the beginning. The target is to be able to keep the data always compressed, even in main memory, given that the data can be processed directly in that form. With this approach, we obtain several benefits: we can load larger datasets in main memory, we can make a better usage of the memory hierarchy, and we can obtain bandwidth savings in a distributed computational scenario, without wasting time in compressing and decompressing data during data exchanges. In this work, we follow a compact data structure approach to design a storage structure for raster data, which is commonly used to represent attributes of the space (temperatures, pressure, elevation measures, etc.) in geographical information systems. As it is common in compact data structures, our new technique is not only able to store and directly access compressed data, but also indexes its content, thereby accelerating the execution of queries. Previous compact data structures designed to store raster data work well when the raster dataset has few different values. Nevertheless, when the number of different values in the raster increases, their space consumption and search performance degrade. Our experiments show that our storage structure improves previous approaches in all aspects, especially when the number of different values is large, which is critical when applying over real datasets. Compared with classical methods for storing rasters, namely netCDF, our method competes in space and excels in access and query times.
... The first example is Vbyte. 28 The ⌊log s i ⌋ + 1 bits required to represent s i in binary are split into blocks of b − 1 bits. The chunk holding the most significant bits of s i is completed with a bit set to 0 in the highest position, whereas the rest are completed with a bit set to 1 in the highest position. ...
Article
This paper presents 2 main contributions. The first is a compact representation of huge sets of functional data or trajectories of continuous-time stochastic processes, which allows keeping the data always compressed even during the processing in main memory. It is oriented to facilitate the efficient computation of the sample autocovariance function without a previous decompression of the data set, by using only partial local decoding. The second contribution is a new memory-efficient algorithm to compute the sample autocovariance function. The combination of the compact representation and the new memory-efficient algorithm obtained in our experiments the following benefits. The compressed data occupy in the disk 75% of the space needed by the original data. The computation of the autocovariance function used up to 13 times less main memory, and run 65% faster than the classical method implemented, for example, in the R package.
... However, it only decreases the minimum number of bits necessary to encode the partitions but not their final representation. Consequently, deltas are Vbyte encoded [32]: each byte used to encode a delta has one bit indicating whether the byte starts a new delta or not, allowing to remove unnecessary bytes from each delta. Thus, partitions use a variable number of bytes proportional to the minimum number of bits necessary to encode their deltas. ...
Conference Paper
The advent of High Throughput Sequencing (HTS) technologies raises a major concern about storage and transmission of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pan-genomes. The ideal way to represent and transfer pan-genomes is through compression. A number of HTS-specific compression tools have been developed to reduce the storage and communication costs of HTS data, yet none of them is designed to process a pan-genome. In this paper, we present DARRC, a new alignment-free and reference-free compression method. It addresses the problem of pan-genome compression by encoding the sequences of a pan-genome as a guided de Bruijn graph. The novelty of this method is its ability to incrementally update DARRC archives with new genome sequences without full decompression of the archive. DARRC can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code. On a large P. aeruginosa dataset, our method outperforms all other tested tools. It provides a 30% compression ratio improvement in single-end mode compared to the best performing state-of-the-art HTS-specific compression method in our experiments.
... Many compression techniques have been developed and evaluated to deal with long lists of integers that represent the document identifiers and complementary information associated with each term. For example, classic techniques such us Variable-Byte Encoding (Williams and Zobel 1999) and Simple 9 (Anh and Moffat 2005) or PForDelta (Zukowski et al. 2006) and its optimized versions are commonly used. ...
Article
Full-text available
Modern information retrieval systems use several levels of caching to speedup computation by exploiting frequent, recent or costly data used in the past. Previous studies show that the use of caching techniques is crucial in search engines, as it helps reducing query response times and processing workloads on search servers. In this work we propose and evaluate a static cache that acts simultaneously as list and intersection cache, offering a more efficient way of handling cache space. We also use a query resolution strategy that takes advantage of the existence of this cache to reorder the query execution sequence. In addition, we propose effective strategies to select the term pairs that should populate the cache. We also represent the data in cache in both raw and compressed forms and evaluate the differences between them using different configurations of cache sizes. The results show that the proposed Integrated Cache outperforms the standard posting lists cache in most of the cases, taking advantage not only of the intersection cache but also the query resolution strategy.
... This finding is of interest to information retrieval researchers for a number of reasons. Most previous index compression studies considered only the case where postings reside on disk (Williams and Zobel 1999;Scholer et al. 2002;Anh and Moffat 2005a;Brewer 2005). Main memory has become sufficiently plentiful that in-memory indexes are considered the common (if not the default) setting for search engines today. ...
Article
Full-text available
This paper explores the performance of top k document retrieval with score-at-a-time query evaluation on impact-ordered indexes in main memory. To better understand execution efficiency in the context of modern processor architectures, we examine the role of index compression on query evaluation latency. Experiments include compressing postings with variable byte encoding, Simple-8b, variants of the QMX compression scheme, as well as a condition that is less often considered—no compression. Across four web test collections, we find that the highest query evaluation speed is achieved by simply leaving the postings lists uncompressed, although the performance advantage over a state-of-the-art compression scheme is relatively small and the index is considerably larger. We explain this finding in terms of the design of modern processor architectures: Index segments with high impact scores are usually short and inherently benefit from cache locality. Index segments with lower impact scores may be quite long, but modern architectures have sufficient memory bandwidth (coupled with prefetching) to “keep up” with the processor. Our results highlight the importance of “architecture affinity” when designing high-performance search engines.
... Variable-byte coding [42] uses a sequence of bytes to provide a compressed representation of integers. In particular, when compressing an integer n, the seven least significant bits of each byte are used to code n, whereas the most significant bit of each byte is set to 0 in the last byte of the sequence and to 1 if further bytes follow. ...
Article
Full-text available
A multitude of contemporary applications heavily involve graph data whose size appears to be ever-increasing. This trend shows no signs of subsiding and has caused the emergence of a number of distributed graph processing systems including Pregel, Apache Giraph and GraphX. However, the unprecedented scale now reached by real-world graphs hardens the task of graph processing due to excessive memory demands even for distributed environments. By and large, such contemporary graph processing systems employ ineffective in-memory representations of adjacency lists. Therefore, memory usage patterns emerge as a primary concern in distributed graph processing. We seek to address this challenge by exploiting empirically-observed properties demonstrated by graphs generated by human activity. In this paper, we propose 1) three compressed adjacency list representations that can be applied to any distributed graph processing system, 2) a variable-byte encoded representation of out-edge weights for space-efficient support of weighted graphs, and 3) a tree-based compact out-edge representation that allows for efficient mutations on the graph elements. We experiment with publicly-available graphs whose size reaches two-billion edges and report our findings in terms of both space-efficiency and execution time. Our suggested compact representations do reduce respective memory requirements for accommodating the graph elements up-to 5 times if compared with state-of-the-art methods. At the same time, our memory-optimized methods retain the efficiency of uncompressed structures and enable the execution of algorithms for large scale graphs in settings where contemporary alternative structures fail due to memory errors.
... If the route names contain no common prefixes then the size of F C(S) reduces to that of S with no compression. A variable length non-negative integer encoding scheme such as VByte [25] is usually employed in order to compress the integer representing the length of the common prefix. This compression can also be applied to the pointers to the list l i , i = 1, · · · , n of next-hop ports. ...
Conference Paper
Named Data Networking (NDN) is a future Internet architecture that replaces IP addresses with namespaces of contents that are searchable at the network layer. A challenging task for NDN routers is to manage forwarding-information bases (FIBs) that store next-hop routes to contents using their stored usually long names or name prefixes. In this paper, we propose FCTree, a compressed FIB data structure that significantly reduces the required storage space at the router and can efficiently meet the demands of having routes that are orders of magnitude larger than IP-based ones in conventional routing tables. FCTree employs a localized front-coding compression to buckets containing partitions of the routes. The top routes in these buckets are then organized in B-ary self-balancing trees. By adjusting the size of the buckets, the router can reach an optimal tradeoff between the latency of the longest prefix matching (LPM) operation and the FIB storage space. In addition, in contrast to existing hash and bloom-filter based solutions, the proposed FCTree structure can significantly reduce the latency required for range and wildcard searches (e.g., for latency sensitive streaming applications or network-layer search engines) where up to k routes are returned if they are prefixed by a requested name. Performance evaluation results demonstrate the significant space savings achieved by FCTree compared to traditional hash-based FIBs.
... Many representation for inverted lists are known, each exposing a different space/time trade-off [10]. Among these, Variable-Byte [11], [12] (henceforth, VByte) is the most popular and used byte-aligned code. In particular, VByte owes its popularity to its sequential decoding speed and, indeed, it is one of the fastest representation for integer sequences. ...
Article
Full-text available
The ubiquitous Variable-Byte encoding is one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to be compressed are small that is the typical case for inverted indexes. This paper shows that the compression ratio of Variable-Byte can be improved by $2\times$ by adopting a partitioned representation of the inverted lists. This makes Variable-Byte surprisingly competitive in space with the best bit-aligned encoders, hence disproving the folklore belief that Variable-Byte is space-inefficient for inverted index compression. Despite the significant space savings, we show that our optimization almost comes for free, given that: we introduce an optimal partitioning algorithm that does not affect indexing time because of its linear-time complexity; we show that the query processing speed of Variable-Byte is preserved, with an extensive experimental analysis and comparison with several other state-of-the-art encoders.
... For even larger numbers, the so-called Variable Byte [53] (VByte) representation is interesting, as it offers fast decoding by accessing byte-aligned data. The idea is to split each integer into 7-bit chunks and encode each chunk in a byte. ...
Preprint
Sequence representations supporting not only direct access to their symbols, but also rank/select operations, are a fundamental building block in many compressed data structures. Several recent applications need to represent highly repetitive sequences, and classical statistical compression proves ineffective. We introduce, instead, grammar-based representations for repetitive sequences, which use up to 6% of the space needed by statistically compressed representations, and support direct access and rank/select operations within tens of microseconds. We demonstrate the impact of our structures in text indexing applications.
... The survey by Zobel and Moffat [113] covers more than 40 years of academic research in Information Retrieval and gives an introduction to the field, with Section 8 dealing with efficient index representations. Moffat and Turpin [68], Moffat [58], Pibiri and Venturini [82] describe several of the techniques illustrated in this article; Williams and Zobel [108], Scholer et al. [92] and Trotman [102] experimentally evaluate many of them. ...
Article
Full-text available
The data structure at the core of large-scale search engines is the inverted index , which is essentially a collection of sorted integer sequences called inverted lists . Because of the many documents indexed by such engines and stringent performance requirements imposed by the heavy load of queries, the inverted index stores billions of integers that must be searched efficiently. In this scenario, index compression is essential because it leads to a better exploitation of the computer memory hierarchy for faster query processing and, at the same time, allows reducing the number of storage machines. The aim of this article is twofold: first, surveying the encoding algorithms suitable for inverted index compression and, second, characterizing the performance of the inverted index through experimentation.
... The particular use of GC in encoding inverted indexes are well motivated by the fact that gaps between document identifiers approximately 1 follow a geometric distribution, provided that documents are randomly ordered [5]. Empirical studies conducted on various datasets have also shown that GC yields fine results in terms of compression efficiency [7]- [10]. However, one drawback of GC relates to the bit-level decoding of prefix codes making the technique unsuitable for large-scale search engines where high-performance retrieval is crucial [11]. ...
Conference Paper
Full-text available
In this paper, we present a finite-state approach to efficiently decode a sequence of Rice codes. The proposed method is capable of decoding a byte stream for any Golomb parameter for unboundedly large alphabets with constant space complexity. Performance of the approach is evaluated in comparison to the conventional bit-level decoding algorithm on compressed inverted indexes with respect to various parameter values. It is observed that decoding performance of the method increases with the mean value of encoded integers. Speed gains up to about a factor of 2 are empirically obtained in comparison with the conventional decoding from the point where the optimal value of the divisor is computed as 128. Results show that it is particularly effective for tasks in which a stream of large integers are encoded such as compression of document identifier gaps in inverted indexes.
... The particular use of GC in encoding inverted indexes are well motivated by the fact that gaps between document identifiers approximately 1 follow a geometric distribution, provided that documents are randomly ordered [5]. Empirical studies conducted on various datasets have also shown that GC yields fine results in terms of compression efficiency [7]- [10]. However, one drawback of GC relates to the bit-level decoding of prefix codes making the technique unsuitable for large-scale search engines where high-performance retrieval is crucial [11]. ...
Preprint
In this paper, we present a finite-state approach to efficiently decode a sequence of Rice codes. The proposed method is capable of decoding a byte stream for any Golomb parameter for unboundedly large alphabets with constant space complexity. Performance of the approach is evaluated in comparison to the conventional bit-level decoding algorithm on compressed inverted indexes with respect to various parameter values. It is observed that decoding performance of the method increases with the mean value of encoded integers. Speed gains up to about a factor of 2 are empirically obtained in comparison with the conventional decoding from the point where the optimal value of the divisor is computed as 128. Results show that it is particularly effective for tasks in which a stream of large integers are encoded such as compression of document identifier gaps in inverted indexes.
... It is possible to cite for classical methods: Elias encodings [24] and Golomb/Rice's encoding [25]. Newer methods are VByte [26], Simple [27], Interpolative [28], PForDelta [29]. Other techniques are proposed in [30] [31]. ...
Article
Full-text available
Nowadays, current information systems are so large and maintain huge amount of data. At every time, they process millions of documents and millions of queries. In order to choose the most important responses from this amount of data, it is well to apply what is so called early termination algorithms. These ones attempt to extract the Top-K documents according to a specified increasing monotone function. The principal idea behind is to reach and score the most significant less number of documents. So, they avoid fully processing the whole documents. WAND algorithm is at the state of the art in this area. Despite it is efficient, it is missing effectiveness and precision. In this paper, we propose two contributions, the principal proposal is a new early termination algorithm based on WAND approach, we call it MWAND (Modified WAND). This one is faster and more precise than the first. It has the ability to avoid unnecessary WAND steps. In this work, we integrate a tree structure as an index into WAND and we add new levels in query processing. In the second contribution, we define new fine metrics to ameliorate the evaluation of the retrieved information. The experimental results on real datasets show that MWAND is more efficient than the WAND approach.
... If the route names contain no common prefixes then the size of the stored suffixes in FC (R) reduces to that of R with no compression. A variable length non-negative integer encoding scheme such as VByte [36] is employed to compress the integer representing the length of the common prefixes π i , i = 1, . . . , n. ...
Article
Named data networking (NDN) is a nascent vision for the future Internet that replaces IP addresses with content names searchable at the network layer. One challenging task for NDN routers is to manage huge forwarding information bases (FIBs) that store next-hop routes to contents. In this article, we propose a family of compressed FIB data structures that significantly reduce the required storage space within the NDN routers. Our first compressed FIB data structure is FCTree. FCTree employs a localized front-coding compression, that eliminates repeated prefixes, to buckets containing partitions of routes. These buckets are then organized in self-balancing trees to speed up the longest prefix match (LPM) operations. We propose two enhancements to FCTree, a statistically compressed FCTree (StFCTree) and a dictionary compressed FCTree (DiFCTree). Both StFCTree and DiFCTree achieve higher compression ratios for NDN FIBs and can be used for FIB updates or exchanges between the forwarding and control planes. Finally, we provide the control plane with several knobs that can be employed to achieve different target trade-offs between the lookup speed and the FIB size in each of these structures. Theoretical analysis along with experimental results demonstrate the significant space savings and performance achieved by the proposed schemes.
... We decided to use byte-wise compression rather than bit-wise compression because the latter does not appear to be worthwhile [63]. Note that more complex compression schemes could also be used (e.g., VByte [94]) but this should be seen as future work. ...
Preprint
The increasing availability and usage of Knowledge Graphs (KGs) on the Web calls for scalable and general-purpose solutions to store this type of data structures. We propose Trident, a novel storage architecture for very large KGs on centralized systems. Trident uses several interlinked data structures to provide fast access to nodes and edges, with the physical storage changing depending on the topology of the graph to reduce the memory footprint. In contrast to single architectures designed for single tasks, our approach offers an interface with few low-level and general-purpose primitives that can be used to implement tasks like SPARQL query answering, reasoning, or graph analytics. Our experiments show that Trident can handle graphs with 10^11 edges using inexpensive hardware, delivering competitive performance on multiple workloads.
... In this paper we use Variablebyte (Vbyte) encoding [21], a simple integer compression technique that essentially splits an integer in 7-bit chunks, and stores them in consecutive bytes, using the most significant bit of each byte to mark whether the number has more chunks or not. It is simple to implement and fast to decode. ...
Preprint
We introduce a new family of compressed data structures to efficiently store and query large string dictionaries in main memory. Our main technique is a combination of hierarchical Front-coding with ideas from longest-common-prefix computation in suffix arrays. Our data structures yield relevant space-time tradeoffs in real-world dictionaries. We focus on two domains where string dictionaries are extensively used and efficient compression is required: URL collections, a key element in Web graphs and applications such as Web mining; and collections of URIs and literals, the basic components of RDF datasets. Our experiments show that our data structures achieve better compression than the state-of-the-art alternatives while providing very competitive query times.
... 2. We can shorten the scan time with the skipping technique used in array hashing [8]. This technique puts its length in front of each node label via some prefix encoding such as VByte [58]. Note that we can omit the terminators of each node label. ...
Preprint
A keyword dictionary is an associative array whose keys are strings. Recent applications handling massive keyword dictionaries in main memory have a need for a space-efficient implementation. When limited to static applications, there are a number of highly-compressed keyword dictionaries based on the advancements of practical succinct data structures. However, as most succinct data structures are only efficient in the static case, it is still difficult to implement a keyword dictionary that is space efficient and dynamic. In this article, we propose such a keyword dictionary. Our main idea is to embrace the path decomposition technique, which was proposed for constructing cache-friendly tries. To store the path-decomposed trie in small memory, we design data structures based on recent compact hash trie representations. Exhaustive experiments on real-world datasets reveal that our dynamic keyword dictionary needs up to 68% less space than the existing smallest ones.
... In this paper we use Variablebyte (Vbyte) encoding [21], a simple integer compression technique that essentially splits an integer in 7-bit chunks, and stores them in consecutive bytes, using the most significant bit of each byte to mark whether the number has more chunks or not. It is simple to implement and fast to decode. ...
Conference Paper
We introduce a new family of compressed data structures to efficiently store and query large string dictionaries in main memory. Our main technique is a combination of hierarchical Front-coding with ideas from longest-common-prefix computation in suffix arrays. Our data structures yield relevant space-time tradeoffs in real-world dictionaries. We focus on two domains where string dictionaries are extensively used and efficient compression is required: URL collections, a key element in Web graphs and applications such as Web mining; and collections of URIs and literals, the basic components of RDF datasets. Our experiments show that our data structures achieve better compression than the state-of-the-art alternatives while providing very competitive query times.
... Since each position uses a fixed number of bits, they can be easily positionally accessed for decompression. It is possible to use other techniques to encode the positions that may use less space [Variable Byte (Williams and Zobel, 1999), Golomb/Rice (Golomb,1966), etc.], but in our tests the gain in space was negligible and the negative effect on decompression times was noticeable. On the other hand, factor lengths are significantly compressed using Golomb codes (Golomb, 1966). ...
Article
Motivation: Genome repositories are growing faster than our storage capacities, challenging our ability to store, transmit, process and analyze them. While genomes are not very compressible individually, those repositories usually contain myriads of genomes or genome reads of the same species, thereby creating opportunities for orders-of-magnitude compression by exploiting inter-genome similarities. A useful compression system, however, cannot be only usable for archival, but it must allow direct access to the sequences, ideally in transparent form so that applications do not need to be rewritten. Results: We present a highly compressed filesystem that specializes in storing large collections of genomes and reads. The system obtains orders-of-magnitude compression by using Relative Lempel-Ziv, which exploits the high similarities between genomes of the same species. The filesystem transparently stores the files in compressed form, intervening the system calls of the applications without the need to modify them. A client/server variant of the system stores the compressed files in a server, while the client's filesystem transparently retrieves and updates the data from the server. The data between client and server are also transferred in compressed form, which saves an order of magnitude network time. Availability and implementation: The C++ source code of our implementation is available for download in https://github.com/vsepulve/relz_fs.
Article
String dictionaries store a collection $$\left( s_i\right) _{0\le i < m}$$ of m variable-length keys (strings) over an alphabet $$\varSigma$$ and support the operations lookup (given a string $$s\in \varSigma ^*$$, decide if $$s_i=s$$ for some i, and return this i) and access (given an integer $$0\le i < m$$, return the string $$s_i$$). We show how to modify the Lempel–Ziv-78 data compression algorithm to store the strings space-efficiently and support the operations lookup and access in optimal time. Our approach is validated experimentally on dictionaries of up to 1.5 GB of uncompressed text. We achieve compression ratios often outperforming the existing alternatives, especially on dictionaries containing many repeated substrings. Our query times remain competitive.
Article
The suffix array, perhaps the most important data structure in modern string processing, needs to be augmented with the longest-common-prefix (LCP) array in many applications. Their construction is often a major bottleneck, especially when the data is too big for internal memory. We describe two new algorithms for computing the LCP array from the suffix array in external memory. Experiments demonstrate that the new algorithms are about a factor of two faster than the fastest previous algorithm. We then further engineer the two new algorithms and improve them in three ways. First, we speed up the algorithms by up to a factor of two through parallelism. Eight threads is sufficient for making the algorithms essentially I/O bound. Second, we reduce the disk space usage of the algorithms making them in-place: the input (text and suffix array) is treated as read-only, and the working disk space never exceeds the size of the final output (the LCP array). Third, we add support for large alphabets. All previous implementations assume the byte alphabet.
Article
A keyword dictionary is an associative array whose keys are strings. Recent applications handling massive keyword dictionaries in main memory have a need for a space-efficient implementation. When limited to static applications, there are a number of highly compressed keyword dictionaries based on the advancements of practical succinct data structures. However, as most succinct data structures are only efficient in the static case, it is still difficult to implement a keyword dictionary that is space efficient and dynamic . In this article, we propose such a keyword dictionary. Our main idea is to embrace the path decomposition technique, which was proposed for constructing cache-friendly tries. To store the path-decomposed trie in small memory, we design data structures based on recent compact hash trie representations. Experiments on real-world datasets reveal that our dynamic keyword dictionary needs up to 68% less space than the existing smallest ones, while achieving a relevant space-time tradeoff.
Article
An entropy coder takes as input a sequence of symbol identifiers over some specified alphabet and represents that sequence as a bitstring using as few bits as possible, typically assuming that the elements of the sequence are independent of each other. Previous entropy coding methods include the well-known Huffman and arithmetic approaches. Here we examine the newer asymmetric numeral systems (ANS) technique for entropy coding and develop mechanisms that allow it to be efficiently used when the size of the source alphabet is large—thousands or millions of symbols. In particular, we examine different ways in which probability distributions over large alphabets can be approximated and in doing so infer techniques that allow the ANS mechanism to be extended to support large-alphabet entropy coding. As well as providing a full description of ANS, we also present detailed experiments using several different types of input, including data streams arising as typical output from the modeling stages of text compression software, and compare theproposed ANS variants with Huffman and arithmetic coding baselines, measuring both compression effectiveness and also encoding and decoding throughput. We demonstrate that in applications in which semi-static compression is appropriate, ANS-based coders can provide an excellent balance between compression effectiveness and speed, even when the alphabet is large.
Article
Search engines are exceptionally important tools for accessing information in today's world. In satisfying the information needs of millions of users, the effectiveness (the quality of the search results) and the efficiency (the speed at which the results are returned to the users) of a search engine are two goals that form a natural trade-off, as techniques that improve the effectiveness of the search engine can also make it less efficient. Meanwhile, search engines continue to rapidly evolve, with larger indexes, more complex retrieval strategies and growing query volumes. Hence, there is a need for the development of efficient query processing infrastructures that make appropriate sacrifices in effectiveness in order to make gains in efficiency. This survey comprehensively reviews the foundations of search engines, from index layouts to basic term-at-a-time (TAAT) and document-at-a-time (DAAT) query processing strategies, while also providing the latest trends in the literature in efficient query processing, including the coherent and systematic reviews of techniques such as dynamic pruning and impact-sorted posting lists as well as their variants and optimisations. Our explanations of query processing strategies, for instance the WAND and BMW dynamic pruning algorithms, are presented with illustrative figures showing how the processing state changes as the algorithms progress. Moreover, acknowledging the recent trends in applying a cascading infrastructure within search systems, this survey describes techniques for efficiently integrating effective learned models, such as those obtained from learning-to-rank techniques. The survey also covers the selective application of query processing techniques, often achieved by predicting the response times of the search engine (known as query efficiency prediction), and making per-query tradeoffs between efficiency and effectiveness to ensure that the required retrieval speed targets can be met. Finally, the survey concludes with a summary of open directions in efficient search infrastructures, namely the use of signatures, real-time, energy-efficient and modern hardware and software architectures.
Conference Paper
A keyword dictionary is an associative array with string keys. Although it is a classical data structure, recent applications require the management of massive string data using the keyword dictionary in main memory. Therefore, its space-efficient implementation is very important. If limited to static applications, there are a number of very compact dictionary implementations; however, existing dynamic implementations consume much larger space than static ones. In this paper, we propose a new practical implementation of space-efficient dynamic keyword dictionaries. Our implementation uses path decomposition, which is proposed for constructing cache-friendly trie structures, for dynamic construction in compact space with a different approach. Using experiments on real-world datasets, we show that our implementation can construct keyword dictionaries in spaces up to 2.8x smaller than the most compact existing dynamic implementation.
Article
Positional ranking functions, widely used in web search engines and related search systems, improve result quality by exploiting the positions of the query terms within documents. However, it is well known that positional indexes demand large amounts of extra space, typically about three times the space of a basic nonpositional index. Textual data, on the other hand, is needed to produce text snippets. In this paper, we study time–space trade-offs for search engines with positional ranking functions and text snippet generation. We consider both index-based and non-index based alternatives for positional data. We aim to answer the question of whether positional data should be indexed, and how. We show that there is a wide range of practical time–space trade-offs. Moreover, we show that using about 1.30 times the space of positional data, we can store everything needed for efficient query processing, with a minor increase in query time. This yields considerable space savings and outperforms, both in space and time, recent alternatives from literature. We also propose several efficient compressed text representations for snippet generation, which are able to use about half of the space of current state-of-the-art alternatives with little impact in query processing time.
Article
We examine index representation techniques for document-based inverted files, and present a mechanism for compressing them using word-aligned binary codes. The new approach allows extremely fast decoding of inverted lists during query processing, while providing compression rates better than other high-throughput representations. Results are given for several large text collections in support of these claims, both for compression effectiveness and query efficiency.
Article
Efficient storage of large inverted indexes is one of the key technologies that support current web search services. Here we re-examine mechanisms for representing document-level inverted indexes and within-document term frequencies, including comparing specialized methods developed for this task against recent fast implementations of general-purpose adaptive compression techniques. Experiments with the Gov2-URL collection and a large collection of crawled news stories show that standard compression libraries can provide compression effectiveness as good as or better than previous methods, with decoding rates only moderately slower than reference implementations of those tailored approaches. This surprising outcome means that high-performance index compression can be achieved without requiring the use of specialized implementations.
Article
Full-text available
This work introduces a companion reproducible paper with the aim of allowing the exact replication of the methods, experiments, and results discussed in a previous work [5]. In that parent paper, we proposed many and varied techniques for compressing indexes which exploit that highly repetitive collections are formed mostly of documents that are near-copies of others. More concretely, we describe a replication framework, called uiHRDC (universal indexes for Highly Repetitive Document Collections), that allows our original experimental setup to be easily replicated using various document collections. The corresponding experimentation is carefully explained, providing precise details about the parameters that can be tuned for each indexing solution. Finally, note that we also provide uiHRDC as reproducibility package.
Conference Paper
Inverted index is the core data structure in large scale information retrieval systems such as Web search engine. Index compression techniques are usually used to reduce the storage and transmission time from disk to memory. Many index compression schemes have been proposed and among them Binary Interpolative Coding (IPC) is one of the most widely used schemes due to its superior compression ratio (CR). However, the decompression speed of IPC is relatively slow, thus fully decompressing (FD) IPC will slow down the whole process of online query processing. In this paper, we first point out that it is unnecessary to fully decompress all the IPC nodes in query processing and then propose a partial decompression (PD) algorithm for IPC. Experimental results on two publicly available standard corpora show that compared with normal IPC our algorithm performs 40 % faster for Boolean conjunctive queries and 20 % faster for Rank queries without additional storage consumption.
Article
Full-text available
The problems arising in the modeling and coding of strings for compression purposes are discussed. The notion of an information source that simplifies and sharpens the traditional one is axiomatized, and adaptive and nonadaptive models are defined. With a measure of complexity assigned to the models, a fundamental theorem is proved which states that models that use any kind of alphabet extension are inferior to the best models using no alphabet extensions at all. A general class of so-called first-in first-out (FIFO) arithmetic codes is described which require no alphabet extension devices and which therefore can be used in conjunction with the best models. Because the coding parameters are the probabilities that define the model, their design is easy, and the application of the code is straightforward even with adaptively changing source models.
Article
An abstract is not available.
Article
A new method of estimating the entropy and redundancy of a language is described. This method exploits the knowledge of the language statistics possessed by those who speak the language, and depends on experimental results in prediction of the next letter when the preceding text is known. Results of experiments in prediction are given, and some properties of an ideal predictor are developed.
Article
This article reports on a variety of compression algorithms developed in the context of a project to put all the data files for a full-text retrieval system on CD-ROM. In the context of inexpensive pre-processing, a text-compression algorithm is presented that is based on Markov-modeled Huffman coding on an extended alphabet. Data structures are examined for facilitating random access into the compressed text. In addition, new algorithms are presented for compression of word indices, both the dictionaries (word lists) and the text pointers (concordances). The ARTFL database is used as a test case throughout the article.
Conference Paper
A query to a nucleotide database is a DNA sequence. Answers are similar sequences, that is, sequences with a high-quality local alignment. Existing techniques for finding answers use exhaustive search, but it is likely that, with increasing database size, these algorithms will become prohibitively expensive. We have developed a partitioned search approach, in which local alignment string matching techniques are used in tandem with an index. We show that fixedlength substrings, or intervals, are a suitable basis for indexing in conjunction with local alignment on likely answers. By use of suitable compression techniques the index size is held to an acceptable level, and queries can be evaluated several times more quickly than with exhaustive search techniques.
Article
When data compression is applied to full-text retrieval systems, intricate relationships emerge between the amount of compression, access speed, and computing resources required. We propose compression methods, and explore corresponding tradeoffs, for all components of static full-text systems such as text databases on CD-ROM. These components include lexical indexes, inverted files, bitmaps, signature files, and the main text itself. Results are reported on the application of the methods to several substantial full-text databases, and show that a large, unindexed text can be stored, along with indexes that facilitate fast searching, in less than half its original size—at some appreciable cost in primary memory requirements. © 1993 John Wiley & Sons, Inc.
Article
Over the last decade, arithmetic coding has emerged as an important compression tool. It is now the method of choice for adaptive coding on multisymbol alphabets because of its speed, low storage requirements, and effectiveness of compression. This article describes a new implementation of arithmetic coding that incorporates several improvements over a widely used earlier version by Witten, Neal, and Cleary, which has become a de facto standard. These improvements include fewer multiplicative operations, greatly extended range of alphabet sizes and symbol probabilities, and the use of low-precision arithmetic, permitting implementation by fast shift/add operations. We also describe a modular structure that separates the coding, modeling, and probability estimation components of a compression system. To motivate the improved coder, we consider the needs of a word-based text compression program. We report a range of experimental results using this and other models. Complete source code is available. This document is available online at ACM Transactions on Information Systems.
Article
Motivation: International sequencing efforts are creating huge nucleotide databases, which are used in searching applications to locate sequences homologous to a query sequence. In such applications, it is desirable that databases are stored compactly, that sequences can be accessed independently of the order in which they were stored, and that data can be rapidly retrieved from secondary storage, since disk costs are often the bottleneck in searching. Results: We present a purpose-built direct coding scheme for fast retrieval and compression of genomic nucleotide data. The scheme is lossless, readily integrated with sequence search tools, and does not require a model. Direct coding gives good compression and allows faster retrieval than with either uncompressed data or data compressed by other methods, thus yielding significant improvements in search times for high-speed homology search tools. Availability: The direct coding scheme (cino) is available free of charge by anonymous ftp from goanna.cs.rmit.edu.au in the directory pub/rmit/cino. Contact: E-mail: [email protected] /* */
Conference Paper
During its long gestation in the 1970s and early 1980s, arithmetic coding was widely regarded more as an academic curiosity than a practical coding technique. One factor that helped it gain the popularity it enjoys today was the publication in 1987 of source code for a multi symbol arithmetic coder in Communications of the ACM. Now (1995), our understanding of arithmetic coding has further matured, and it is timely to review the components of that implementation and summarise the improvements that we and other authors have developed since then. We also describe a novel method for performing the underlying calculation needed for arithmetic coding. Accompanying the paper is a “Mark II” implementation that incorporates the improvements we suggest. The areas examined include: changes to the coding procedure that reduce the number of multiplications and divisions and permit them to be done to low precision; the increased range of probability approximations and alphabet sizes that can be supported using limited precision calculation; data structures for support of arithmetic coding on large alphabets; the interface between the modelling and coding subsystems; the use of enhanced models to allow high performance compression. For each of these areas, we consider how the new implementation differs from the CACM package
Conference Paper
Document databases contain large volumes of text, and currently have typical sizes into the gigabyte range. In order to efficiently query these text collections some form of index is required, since without an index even the fastest of pattern matching techniques results in unacceptable response times. One pervasive indexing method is the use of inverted files, also sometimes known as concordances or postings files. There has been a number of effort made to capture the “clustering” effect, and to design index compression methods that condition their probability predictions according to context. In these methods information as to whether or not the most recent (or second most recent, and so on) document contained term t is used to bias the prediction that the next document will contain term t. We further extend this notion of context-based index compression, and describe a surprisingly simple index representation that gives excellent performance on all of our test databases; allows fast decoding; and is, even in the worst case, only slightly inferior to Golomb (1966) coding
Conference Paper
Witten, Bell and Nevill (see ibid., p.23, 1991) have described compression models for use in full-text retrieval systems. The authors discuss other coding methods for use with the same models, and give results that show their scheme yielding virtually identical compression, and decoding more than forty times faster. One of the main features of their implementation is the complete absence of arithmetic coding; this, in part, is the reason for the high speed. The implementation is also particularly suited to slow devices such as CD-ROM, in that the answering of a query requires one disk access for each term in the query and one disk access for each answer. All words and numbers are indexed, and there are no stop words. They have built two compressed databases.< >
Article
Countable prefix codeword sets are constructed with the universal property that assigning messages in order of decreasing probability to codewords in order of increasing length gives an average code-word length, for any message set with positive entropy, less than a constant times the optimal average codeword length for that source. Some of the sets also have the asymptotically optimal property that the ratio of average codeword length to entropy approaches one uniformly as entropy increases. An application is the construction of a uniformly universal sequence of codes for countable memoryless sources, in which the n th code has a ratio of average codeword length to source rate bounded by a function of n for all sources with positive rate; the bound is less than two for n = 0 and approaches one as n increases.
Article
First Page of the Article
Article
We describe the implementation of a data compression scheme as an integral and transparent layer within a full-text retrieval system. Using a semi-static word-based compression model, the space needed to store the text is under 30 per cent of the original requirement. The model is used in conjunction with canonical Huffman coding and together these two paradigms provide fast decompression. Experiments with 500 Mb of newspaper articles show that in full-text retrieval environments compression not only saves space, it can also yield faster query processing - a win-win situation.
verted file processing for large text databases Universal modeling Indexing nucleotide Improved in- r10 Williams and Zobel
• A Moffat
• J Zobel
• S T Klein
A. Moffat, J. Zobel, and S. T. Klein. verted file processing for large text databases. In Proc. Australasian Database Conference, pages 162–171, Ade-laide, Australia, January 1995. Universal modeling Indexing nucleotide Improved in- r10 Williams and Zobel
Adding compression to a fulltext retrieval system. Software-Practice and Experience
• J Zobel
• A Moffat
J. Zobel and A. Moffat. Adding compression to a fulltext retrieval system. Software-Practice and Experience, 25(8):891-903, 1995.