## No full-text available

To read the full-text of this research,

you can request a copy directly from the authors.

Past access to files of integers is crucial for the efficient resolution of queries to databases. Integers are the basis of indexes used to resolve queries, for example, in large internet search systems, and numeric data forms a large part of most databases. Disk access costs can be reduced by compression, if the cost of retrieving a compressed representation from disk and the CPU cost of decoding such a representation is less than that of retrieving uncompressed data. In this paper we show experimentally that, for large or small collections, storing integers in a compressed format reduces the time required for either sequential stream access or random access. We compare different approaches to compressing integers, including the Elias gamma and delta codes, Golomb coding, and a variable-byte integer scheme. As a conclusion, we recommend that, for fast access to integers, files be stored compressed.

To read the full-text of this research,

you can request a copy directly from the authors.

... • Vbyte. We included a simple posting list representation based on Vbyte [31] which uses no sampling and consequently performs intersections in a merge-wise fashion. We also included two alternatives using Vbyte coupled with sampling [9] (called Vbyte-CM), with k = {4, 32}, or domain sampling [28] (called Vbyte-ST), with B = {16, 128}. ...

... Vbyte Vbyte [31] Simple Vbyte encoding with no sampling. Intersections are performed in a merge-wise fashion. ...

... Intersections are performed in a merge-wise fashion. Vbyte-LZMA [6] No variants Encodes gaps with Vbyte [31] and, if the size of the resulting Vbytesequence is ≥ 10 bytes, then it is further compressed with LZMA. ...

This work introduces a companion reproducible paper with the aim of allowing the exact replication of the methods, experiments, and results discussed in a previous work [5]. In that parent paper, we proposed many and varied techniques for compressing indexes which exploit that highly repetitive collections are formed mostly of documents that are near-copies of others. More concretely, we describe a replication framework, called uiHRDC (universal indexes for Highly Repetitive Document Collections), that allows our original experimental setup to be easily replicated using various document collections. The corresponding experimentation is carefully explained, providing precise details about the parameters that can be tuned for each indexing solution. Finally, note that we also provide uiHRDC as reproducibility package.

... An inverted index consists of a set of postings lists, each describing the locations and occurrences in the collection of a single term. Proposals for compactly storing postings lists include byte-aligned codes [22,25]; word-aligned codes [2,3,23,28]; and binary-packed approaches [12,30]. ...

... Traditional techniques such as Golomb, Rice, and Elias codes (see Witten et al. [26] for details) operate on a bit-by-bit basis, and are relatively slow during the decoding process. Compromises that allow faster decoding, but with reduced compression effectiveness, include the byte-based VByte mechanism [22,25] and variants thereof [4,5,7]; and the word-based Simple approaches [2,3,23,28]. At the other end of the scale, the Interp mechanism of Moffat and Stuiver [15] provides very good compression effectiveness, but with even slower decoding than the bit-based Golomb and Elias codes. ...

... One of the implementation drawbacks of the Packed+ANS arrangement described in Section 3.1 is the cost of maintaining the frames in the three-table form employed for ANS decoding (see Section 2.2). With m as large as 2 25 in some frames, with M ≈ 8m, and with the two-dimensional approach meaning that the number of contexts might be close to |S |(|S | + 1)/2, execution-time memory space is a key factor that cannot be ignored. Caching effects mean that memory space also affects decoding speed. ...

We examine approaches used for block-based inverted index compression, such as the OptPFOR mechanism, in which fixed-length blocks of postings data are compressed independently of each other. Building on previous work in which asymmetric numeral systems (ANS) entropy coding is used to represent each block, we explore a number of enhancements: (i) the use of two-dimensional conditioning contexts, with two aggregate parameters used in each block to categorize the distribution of symbol values that underlies the ANS approach, rather than just one; (ii) the use of a byte-friendly strategic mapping from symbols to ANS codeword buckets; and (iii) the use of a context merging process to combine similar probability distributions. Collectively, these improvements yield superior compression for index data, outperforming the reference point set by the Interp mechanism, and hence representing a significant step forward. We describe experiments using the 426 GiB gov2 collection and a new large collection of publicly-available news articles to demonstrate that claim, and provide query evaluation throughput rates compared to other block-based mechanisms.

... Ters dizinlerin yanı sıra, belge derecelendirmede kullanılan frekans degerleri ve terim geri kazanımı için oluşturulması gereken sözlük de gözö nüne alındıgında, veri sıkıştırma işlemi, diskte arama yapma oranını azaltması bakımındanönem kazanmaktadır [2]. Bu çerçevede,özellikle ters dizinlerin sıkıştırılması, verimli kod çözme algoritmaları vasıtasıyla sorgu işlemeyiönemliölçüde hızlandırabildigi için bilgiye erişim sistemlerinin optimizasyonundaönemli bir yere sahiptir [3], [4]. ...

... Genelde geometrik dagılıma yakın şekilde, olasılıkları monoton azalan bir davranışa sahip aralık degerlerini sıkıştırmak için kullanılabilecek evrensel veya uyumsal birçok farklı kodlama teknigi bulunmaktadır. Evrensel kodlama şemalarına, degişken sekiz ikili kodlama [3] ve Elias kodlaması [5]; uyumsal kodlama şemalarına ise, Golomb kodlaması [6], Simple kodlaması [7] ve PForDelta [8]örnek gösterilebilir. Bu noktada, kullanılacak teknigin dizin sıkıştırma oranı ve kod çözme hızı arasında ortaya koyduguödünleşim gözönünde bulundurularak bilgi erişim sisteminin verimliligini bellek kullanımı ve sorgu işleme hızı bakımından eniyileyen şemanın seçilmesi amaçlanır.Örnegin, Golomb kodları diger tekniklere görë ustün bir sıkıştırma performansı göstermesine ragmen kod çözme hızı açısından geri kalabilmektedir [7], [9]. ...

In this paper, an entropy coding technique, namely, combinatorial encoding, is investigated in the context of document indexing as an alternative compression scheme to the conventional gap encoding methods. To this purpose, a finite state transducer is designed to reduce the complexity of calculating the lexicographic rank of the encoded bitstream, and a component that will efficiently calculate the binomial coefficient for large numbers is developed. The encoding speed of the implemented solid and multi-block indexing schemes are empirically evaluated through varying term frequencies on randomly created bit strings. The method's ability of compressing a memoryless source to its entropy limit, yielding an on-the-fly indexing scheme, and conforming to document reordering by means of the transducer have been designated as its most prominent aspects.

... Some of them achieve a high compression ratio, but at the expense of a lower decompression speed [15], for example Elias γ and δ encodings [21], Golomb/Rice parametric encodings [26], or interpolative coding [38]. Other approaches may achieve a (usually slightly) lower compression ratio, though with faster decompression speeds; examples are VByte [57], S9 [4] and its variants, PForDelta [62], Quasi-Succinct Coding [55], or SIMD-BP128 [32]. The best compression method depends on the scenario at hand. ...

... This method [57] encodes an integer number using a variable number of bytes. To do this, VByte uses the most significant bit in a byte as a flag. ...

Positional ranking functions, widely used in web search engines and related search systems, improve result quality by exploiting the positions of the query terms within documents. However, it is well known that positional indexes demand large amounts of extra space, typically about three times the space of a basic nonpositional index. Textual data, on the other hand, is needed to produce text snippets. In this paper, we study time-space trade-offs for search engines with positional ranking functions and text snippet generation. We consider both index-based and non-index based alternatives for positional data. We aim to answer the question of whether positional data should be indexed, and how. We show that there is a wide range of practical time-space trade-offs. Moreover, we show that using about 1.30 times the space of positional data, we can store everything needed for efficient query processing, with a minor increase in query time. This yields considerable space savings and outperforms, both in space and time, recent alternatives from literature. We also propose several efficient compressed text representations for snippet generation, which are able to use about half of the space of current state-of-the-art alternatives with little impact in query processing time.

... WSDM '19, sequences in compressed form if the primary goal is fast decompression. Competitors in this space include Trotman's QMX codec [36]; the VByte and Simple16 byte-and word-aligned mechanisms [2,39]; and the PFOR approach of Zukowski et al. [45]. ...

... Byte-and Word-Aligned Codes. In byte-aligned compression methods [31,35,39] input integers are partitioned into 7-bit fragments, and the fragments are placed in bytes, leaving one bit spare per byte. That bit then serves as a flag to mark the last fragment of each integer, allowing the coded values to be reconstituted via byte-at-a-time shift and mask operations. ...

Dictionary-based compression schemes provide fast decoding operation, typically at the expense of reduced compression effectiveness compared to statistical or probability-based approaches. In this work, we apply dictionary-based techniques to the compression of inverted lists, showing that the high degree of regularity that these integer sequences exhibit is a good match for certain types of dictionary methods, and that an important new trade-off balance between compression effectiveness and compression efficiency can be achieved. Our observations are supported by experiments using the document-level inverted index data for two large text collections, and a wide range of other index compression implementations as reference points. Those experiments demonstrate that the gap between efficiency and effectiveness can be substantially narrowed.

... To further improve the running time of the algorithm, we employ a few bit manipulation techniques that take advantage of our particular encoding scheme. With standard variable-byte encoding [38], we need to read multiple bytes to determine the size and decode the value. But by storing the size of the variable-byte value in a 2-bit code, we can determine the size ` by looking up the code c in a small array: ` sizeFromCode[c]. ...

... With delta en-coding, storing the scores, including the 2 bit header, takes only 4.0 and 9.6 bits per node and string, respectively. In comparison, standard variable-byte encoding with a single continuation bit [38] requires at least 8 bits per node. Simi-larly, we utilize an average of only 16.4 bits per string in the dataset to encode the tree structure. ...

Today in every search application, desktop, web, and mobile devices, All provide some kind of query auto-completion. In its basic form, the problem consists in retrieving from a string set a small number of completions, i.e. strings beginning with a given prefix, that have the highest scores according to some static ranking. In this paper, we focus on the case where the string set is so large that compression is needed to the data structure in memory. This is a immerging case for web search engines and social networks, where it is necessary to index hundreds of millions of distinct queries to guarantee a reasonable time coverage; and for mobile devices, where the amount of memory is limited. Mobile devices are very common now a days. Typing on screen of small display unit is very difficult task. User require help to speed up .If we provide compression of scored set then it will be beneficial for future purpose In this paper we present three different tree-based data structures to address this problem, each one with different space/time/ complexity trade-offs. Experiments on large-scale datasets show that it is possible to compress the string sets, including the scores, down to spaces competitive with the gzip'ed data, while supporting efficient retrieval of completions at about a microsecond per completion. .

... Data partitioning plays an essential role in achieving a good compression ratio for various algorithms. Several prior work [87,90] targeting inverted indexes proposed partitioning algorithms for specific compression schemes like Elias-Fano [101] and VByte [100,103]. The partitioning algorithms introduced in Section 3.2 are general under the LeCo framework (although not universal). ...

Lightweight data compression is a key technique that allows column stores to exhibit superior performance for analytical queries. Despite a comprehensive study on dictionary-based encodings to approach Shannon's entropy, few prior works have systematically exploited the serial correlation in a column for compression. In this paper, we propose LeCo (i.e., Learned Compression), a framework that uses machine learning to remove the serial redundancy in a value sequence automatically to achieve an outstanding compression ratio and decompression performance simultaneously. LeCo presents a general approach to this end, making existing (ad-hoc) algorithms such as Frame-of-Reference (FOR), Delta Encoding, and Run-Length Encoding (RLE) special cases under our framework. Our microbenchmark with three synthetic and six real-world data sets shows that a prototype of LeCo achieves a Pareto improvement on both compression ratio and random access speed over the existing solutions. When integrating LeCo into widely-used applications, we observe up to 3.9x speed up in filter-scanning a Parquet file and a 16% increase in Rocksdb's throughput.

... , ])} =1 , we wish to construct a small data structure that can return an individual value in constant time and hence the entire list in time ( ). Standard file compression methods that employ methods such as run-length encoding, delta coding, specialized codecs and other techniques [47,[54][55][56]are insufficient for this task because they do not allow queries directly on the compressed representation. Moreover, existing succinct data structures do not fully leverage the underlying structure of the data to achieve lower space. ...

Lookup tables are a fundamental structure in many data processing and systems applications. Examples include tokenized text in NLP, quantized embedding collections in recommendation systems, integer sketches for streaming data, and hash-based string representations in genomics. With the increasing size of web-scale data, such applications often require compression techniques that support fast random $O(1)$ lookup of individual parameters directly on the compressed data (i.e. without blockwise decompression in RAM). While the community has proposd a number of succinct data structures that support queries over compressed representations, these approaches do not fully leverage the low-entropy structure prevalent in real-world workloads to reduce space. Inspired by recent advances in static function construction techniques, we propose a space-efficient representation of immutable key-value data, called CARAMEL, specifically designed for the case where the values are multi-sets. By carefully combining multiple compressed static functions, CARAMEL occupies space proportional to the data entropy with low memory overheads and minimal lookup costs. We demonstrate 1.25-16x compression on practical lookup tasks drawn from real-world systems, improving upon established techniques, including a production-grade read-only database widely used for development within Amazon.com.

... Finally, we mention the Directly Addressable Codes (DACs) [2,3] a scheme that makes use of the generalized Vbyte coding [26]. Given a fixed integer parameter b > 1, the symbols are first encoded into a sequence of (b + 1)-bit chunks, which are then arranged in ⌈log(σ)/b⌉ bitstreams, were the i-th bitstream contains the i-th least significant chunk of every code-word. ...

We present a new variable-length computation-friendly encoding scheme, named SFDC (Succinct Format with Direct aCcesibility), that supports direct and fast accessibility to any element of the compressed sequence and achieves compression ratios often higher than those offered by other solutions in the literature. The SFDC scheme provides a flexible and simple representation geared towards either practical efficiency or compression ratios, as required. For a text of length $n$ over an alphabet of size $\sigma$ and a fixed parameter $\lambda$, the access time of the proposed encoding is proportional to the length of the character's code-word, plus an expected $\mathcal{O}((F_{\sigma - \lambda + 3} - 3)/F_{\sigma+1})$ overhead, where $F_j$ is the $j$-th number of the Fibonacci sequence. In the overall it uses $N+\mathcal{O}\big(n \left(\lambda - (F_{\sigma+3}-3)/F_{\sigma+1}\big) \right) = N + \mathcal{O}(n)$ bits, where $N$ is the length of the encoded string. Experimental results show that the performance of our scheme is, in some respects, comparable with the performance of DACs and Wavelet Tees, which are among of the most efficient schemes. In addition our scheme is configured as a \emph{computation-friendly compression} scheme, as it counts several features that make it very effective in text processing tasks. In the string matching problem, that we take as a case study, we experimentally prove that the new scheme enables results that are up to 29 times faster than standard string-matching techniques on plain texts.

... Besides, decoding a single integer encoded with Elias or Fibonacci is slow in practice -15-30 nsecs per decoded integer is usual [11]. Hence, alternative approaches are used in practice, such as VByte [54], Simple9 [2] (and optimized variants like Simple16 [57] and Simple18 [4]), and PForDelta [59] (and optimized variants like OptPFD [56]). These have efficient decoding time -e.g., less than 1 nsec/int on average is typical [11]-, yet their space usage is not guaranteed to achieve any compression measure, although they yield efficient space usage in practice. ...

We introduce space- and time-efficient algorithms and data structures for the offline set intersection problem. We show that a sorted integer set $S \subseteq [0{..}u)$ of $n$ elements can be represented using compressed space while supporting $k$-way intersections in adaptive $O(k\delta\lg{\!(u/\delta)})$ time, $\delta$ being the alternation measure introduced by Barbay and Kenyon. Our experimental results suggest that our approaches are competitive in practice, outperforming the most efficient alternatives (Partitioned Elias-Fano indexes, Roaring Bitmaps, and Recursive Universe Partitioning (RUP)) in several scenarios, offering in general relevant space-time trade-offs.

... For example, Heaps [31] describes a general approach to integer compression that includes arrangements in which the code lengths are 8, 16, 24 (and so on) bits long; and Cutting and Pedersen [19] also describe a VByte mechanism. Williams and Zobel [62] include VByte in their experimental study, and further comparison was undertaken by Scholer et al. [51] and by Trotman [57,58]. Subsequent developments are then reported by de Moura et al. [22], by Brisaboa et al. [6], by Culpepper and Moffat [17], by Dean [23], by Stepanov et al. [54], and most recently, by Lemire et al. [35]. ...

In a dynamic retrieval system, documents must be ingested as they arrive, and be immediately findable by queries. Our purpose in this paper is to describe an index structure and processing regime that accommodates that requirement for immediate access, seeking to make the ingestion process as streamlined as possible, while at the same time seeking to make the growing index as small as possible, and seeking to make term-based querying via the index as efficient as possible. We describe a new compression operation and a novel approach to extensible lists which together facilitate that triple goal. In particular, the structure we describe provides incremental document-level indexing using as little as two bytes per posting and only a small amount more for word-level indexing; provides fast document insertion; supports immediate and continuous queryability; provides support for fast conjunctive queries and similarity score-based ranked queries; and facilitates fast conversion of the dynamic index to a "normal" static compressed inverted index structure. Measurement of our new mechanism confirms that in-memory dynamic document-level indexes for collections into the gigabyte range can be constructed at a rate of two gigabytes/minute using a typical server architecture, that multi-term conjunctive Boolean queries can be resolved in just a few milliseconds each on average even while new documents are being concurrently ingested, and that the net memory space required for all of the required data structures amounts to an average of as little as two bytes per stored posting.

... If the route names contain no common prefixes then the size of the stored suffixes in F C(R) reduces to that of R with no compression. A variable length non-negative integer encoding scheme such as VByte [94] is employed to compress the integer representing the length of the common prefixes π i , i = 1, · · · , n. The following proposition demonstrates the space efficiency of F C(R). ...

Named data networking (NDN) is a content-centric future Internet architecture that uses routable content names instead of IP addresses to achieve location-independent forwarding. Nevertheless, NDN's design is limited to offering hosted applications a simple content pull mechanism. As a result, increased complexity is needed in developing applications that require more sophisticated content delivery functionalities (e.g., push, publish/subscribe, streaming, generalized forwarding, and dynamic content naming). This thesis introduces a novel Enhanced NDN (ENDN) architecture that offers an extensible catalog of content delivery services (e.g., adaptive forwarding, customized monitoring, and in-network caching control). More precisely, the proposed architecture allows hosted applications to associate their content namespaces with a set of services offered by ENDN.
The design of ENDN starts from the current NDN architecture that is gradually modified to meet the evolving needs of novel applications. NDN switches use several forwarding tables in the packet processing pipeline, the most important one being the Forwarding Information Base (FIB). The NDN FIBs face stringent performance requirements, especially in Internet-scale deployments. Hence, to increase the NDN data plane scalability and flexibility, we first propose FCTree, a novel FIB data structure. FCTree is a compressed FIB data structure that significantly reduces the required storage space within the NDN routers while providing fast lookup and modification operations. FCTree also offers additional lookup types that can be used as building blocks to novel network services (e.g., in-network search engine).
Second, we design a novel programmable data plane for ENDN using P4, a prominent data plane programming language. Our proposed data plane allows content namespaces to be processed by P4 functions implementing complex stateful forwarding behaviors. We thus extend existing P4 models to overcome their limitations with respect to processing string-based content names. Our proposed data plane also allows running independent P4 functions in isolation, thus enabling P4 code run-time pluggability. We further enhance our proposed data plane by making it protocol-independent using programmable parsers to allow interfacing with IP networks.
Finally, we introduce a new control plane architecture that allows the applications to express their network requirements using intents. We employ Event-B machine (EBM) language modeling and tools to represent these intents and their semantics on an abstract model of the network. The resulting EBMs are then gradually refined to represent configurations at the programmable data plane. The Event-B method formally ensures the consistency of the different application requirements using proof obligations and verifies that the requirements of different intents do not contradict each other. Thus, the desired properties of the network or its services, as defined by the intent, are guaranteed to be satisfied by the refined EBM representing the final data-plane configurations. Experimental evaluation results demonstrate the feasibility and efficiency of our proposed architecture.

... The survey by Zobel and Moffat [113] covers more than 40 years of academic research in Information Retrieval and gives an introduction to the field, with Section 8 dealing with efficient index representations. Moffat and Turpin [68], Moffat [58], Pibiri and Venturini [82] describe several of the techniques illustrated in this article; Williams and Zobel [108], Scholer et al. [92] and Trotman [102] experimentally evaluate many of them. ...

The data structure at the core of large-scale search engines is the inverted index , which is essentially a collection of sorted integer sequences called inverted lists . Because of the many documents indexed by such engines and stringent performance requirements imposed by the heavy load of queries, the inverted index stores billions of integers that must be searched efficiently. In this scenario, index compression is essential because it leads to a better exploitation of the computer memory hierarchy for faster query processing and, at the same time, allows reducing the number of storage machines.
The aim of this article is twofold: first, surveying the encoding algorithms suitable for inverted index compression and, second, characterizing the performance of the inverted index through experimentation.

... The particular use of GC in encoding inverted indexes are well motivated by the fact that gaps between document identifiers approximately 1 follow a geometric distribution, provided that documents are randomly ordered [5]. Empirical studies conducted on various datasets have also shown that GC yields fine results in terms of compression efficiency [7]- [10]. However, one drawback of GC relates to the bit-level decoding of prefix codes making the technique unsuitable for large-scale search engines where high-performance retrieval is crucial [11]. ...

In this paper, we present a finite-state approach to efficiently decode a sequence of Rice codes. The proposed method is capable of decoding a byte stream for any Golomb parameter for unboundedly large alphabets with constant space complexity. Performance of the approach is evaluated in comparison to the conventional bit-level decoding algorithm on compressed inverted indexes with respect to various parameter values. It is observed that decoding performance of the method increases with the mean value of encoded integers. Speed gains up to about a factor of 2 are empirically obtained in comparison with the conventional decoding from the point where the optimal value of the divisor is computed as 128. Results show that it is particularly effective for tasks in which a stream of large integers are encoded such as compression of document identifier gaps in inverted indexes.

... The particular use of GC in encoding inverted indexes are well motivated by the fact that gaps between document identifiers approximately 1 follow a geometric distribution, provided that documents are randomly ordered [5]. Empirical studies conducted on various datasets have also shown that GC yields fine results in terms of compression efficiency [7]- [10]. However, one drawback of GC relates to the bit-level decoding of prefix codes making the technique unsuitable for large-scale search engines where high-performance retrieval is crucial [11]. ...

In this paper, we present a finite-state approach to efficiently decode a sequence of Rice codes. The proposed method is capable of decoding a byte stream for any Golomb parameter for unboundedly large alphabets with constant space complexity. Performance of the approach is evaluated in comparison to the conventional bit-level decoding algorithm on compressed inverted indexes with respect to various parameter values. It is observed that decoding performance of the method increases with the mean value of encoded integers. Speed gains up to about a factor of 2 are empirically obtained in comparison with the conventional decoding from the point where the optimal value of the divisor is computed as 128. Results show that it is particularly effective for tasks in which a stream of large integers are encoded such as compression of document identifier gaps in inverted indexes.

... We decided to use byte-wise compression rather than bit-wise compression because the latter does not appear to be worthwhile [63]. Note that more complex compression schemes could also be used (e.g., VByte [94]) but this should be seen as future work. ...

The increasing availability and usage of Knowledge Graphs (KGs) on the Web calls for scalable and general-purpose solutions to store this type of data structures. We propose Trident, a novel storage architecture for very large KGs on centralized systems. Trident uses several interlinked data structures to provide fast access to nodes and edges, with the physical storage changing depending on the topology of the graph to reduce the memory footprint. In contrast to single architectures designed for single tasks, our approach offers an interface with few low-level and general-purpose primitives that can be used to implement tasks like SPARQL query answering, reasoning, or graph analytics. Our experiments show that Trident can handle graphs with 10^11 edges using inexpensive hardware, delivering competitive performance on multiple workloads.

... If the route names contain no common prefixes then the size of the stored suffixes in FC (R) reduces to that of R with no compression. A variable length non-negative integer encoding scheme such as VByte [36] is employed to compress the integer representing the length of the common prefixes π i , i = 1, . . . , n. ...

Named data networking (NDN) is a nascent vision for the future Internet that replaces IP addresses with content names searchable at the network layer. One challenging task for NDN routers is to manage huge forwarding information bases (FIBs) that store next-hop routes to contents. In this article, we propose a family of compressed FIB data structures that significantly reduce the required storage space within the NDN routers. Our first compressed FIB data structure is FCTree. FCTree employs a localized front-coding compression, that eliminates repeated prefixes, to buckets containing partitions of routes. These buckets are then organized in self-balancing trees to speed up the longest prefix match (LPM) operations. We propose two enhancements to FCTree, a statistically compressed FCTree (StFCTree) and a dictionary compressed FCTree (DiFCTree). Both StFCTree and DiFCTree achieve higher compression ratios for NDN FIBs and can be used for FIB updates or exchanges between the forwarding and control planes. Finally, we provide the control plane with several knobs that can be employed to achieve different target trade-offs between the lookup speed and the FIB size in each of these structures. Theoretical analysis along with experimental results demonstrate the significant space savings and performance achieved by the proposed schemes.

... For even larger numbers, the so-called Variable Byte [53] (VByte) representation is interesting, as it offers fast decoding by accessing byte-aligned data. The idea is to split each integer into 7-bit chunks and encode each chunk in a byte. ...

Sequence representations supporting not only direct access to their symbols, but also rank/select operations, are a fundamental building block in many compressed data structures. Several recent applications need to represent highly repetitive sequences, and classical statistical compression proves ineffective. We introduce, instead, grammar-based representations for repetitive sequences, which use up to 6% of the space needed by statistically compressed representations, and support direct access and rank/select operations within tens of microseconds. We demonstrate the impact of our structures in text indexing applications.

... In this paper we use Variablebyte (Vbyte) encoding [21], a simple integer compression technique that essentially splits an integer in 7-bit chunks, and stores them in consecutive bytes, using the most significant bit of each byte to mark whether the number has more chunks or not. It is simple to implement and fast to decode. ...

We introduce a new family of compressed data structures to efficiently store and query large string dictionaries in main memory. Our main technique is a combination of hierarchical Front-coding with ideas from longest-common-prefix computation in suffix arrays. Our data structures yield relevant space-time tradeoffs in real-world dictionaries. We focus on two domains where string dictionaries are extensively used and efficient compression is required: URL collections, a key element in Web graphs and applications such as Web mining; and collections of URIs and literals, the basic components of RDF datasets. Our experiments show that our data structures achieve better compression than the state-of-the-art alternatives while providing very competitive query times.

... In this paper we use Variablebyte (Vbyte) encoding [21], a simple integer compression technique that essentially splits an integer in 7-bit chunks, and stores them in consecutive bytes, using the most significant bit of each byte to mark whether the number has more chunks or not. It is simple to implement and fast to decode. ...

We introduce a new family of compressed data structures to efficiently store and query large string dictionaries in main memory. Our main technique is a combination of hierarchical Front-coding with ideas from longest-common-prefix computation in suffix arrays. Our data structures yield relevant space-time tradeoffs in real-world dictionaries. We focus on two domains where string dictionaries are extensively used and efficient compression is required: URL collections, a key element in Web graphs and applications such as Web mining; and collections of URIs and literals, the basic components of RDF datasets. Our experiments show that our data structures achieve better compression than the state-of-the-art alternatives while providing very competitive query times.

... Since each position uses a fixed number of bits, they can be easily positionally accessed for decompression. It is possible to use other techniques to encode the positions that may use less space [Variable Byte (Williams and Zobel, 1999), Golomb/Rice (Golomb,1966), etc.], but in our tests the gain in space was negligible and the negative effect on decompression times was noticeable. On the other hand, factor lengths are significantly compressed using Golomb codes (Golomb, 1966). ...

Motivation:
Genome repositories are growing faster than our storage capacities, challenging our ability to store, transmit, process and analyze them. While genomes are not very compressible individually, those repositories usually contain myriads of genomes or genome reads of the same species, thereby creating opportunities for orders-of-magnitude compression by exploiting inter-genome similarities. A useful compression system, however, cannot be only usable for archival, but it must allow direct access to the sequences, ideally in transparent form so that applications do not need to be rewritten.
Results:
We present a highly compressed filesystem that specializes in storing large collections of genomes and reads. The system obtains orders-of-magnitude compression by using Relative Lempel-Ziv, which exploits the high similarities between genomes of the same species. The filesystem transparently stores the files in compressed form, intervening the system calls of the applications without the need to modify them. A client/server variant of the system stores the compressed files in a server, while the client's filesystem transparently retrieves and updates the data from the server. The data between client and server are also transferred in compressed form, which saves an order of magnitude network time.
Availability and implementation:
The C++ source code of our implementation is available for download in https://github.com/vsepulve/relz_fs.

... 2. We can shorten the scan time with the skipping technique used in array hashing [8]. This technique puts its length in front of each node label via some prefix encoding such as VByte [58]. Note that we can omit the terminators of each node label. ...

A keyword dictionary is an associative array whose keys are strings. Recent applications handling massive keyword dictionaries in main memory have a need for a space-efficient implementation. When limited to static applications, there are a number of highly-compressed keyword dictionaries based on the advancements of practical succinct data structures. However, as most succinct data structures are only efficient in the static case, it is still difficult to implement a keyword dictionary that is space efficient and dynamic. In this article, we propose such a keyword dictionary. Our main idea is to embrace the path decomposition technique, which was proposed for constructing cache-friendly tries. To store the path-decomposed trie in small memory, we design data structures based on recent compact hash trie representations. Exhaustive experiments on real-world datasets reveal that our dynamic keyword dictionary needs up to 68% less space than the existing smallest ones.

... It is possible to cite for classical methods: Elias encodings [24] and Golomb/Rice's encoding [25]. Newer methods are VByte [26], Simple [27], Interpolative [28], PForDelta [29]. Other techniques are proposed in [30] [31]. ...

Nowadays, current information systems are so large and maintain huge amount of data. At every time, they
process millions of documents and millions of queries. In order to choose the most important responses from
this amount of data, it is well to apply what is so called early termination algorithms. These ones attempt
to extract the Top-K documents according to a specified increasing monotone function. The principal idea
behind is to reach and score the most significant less number of documents. So, they avoid fully processing
the whole documents. WAND algorithm is at the state of the art in this area. Despite it is efficient, it is missing
effectiveness and precision. In this paper, we propose two contributions, the principal proposal is a new early
termination algorithm based on WAND approach, we call it MWAND (Modified WAND). This one is faster
and more precise than the first. It has the ability to avoid unnecessary WAND steps. In this work, we integrate
a tree structure as an index into WAND and we add new levels in query processing. In the second contribution,
we define new fine metrics to ameliorate the evaluation of the retrieved information. The experimental results
on real datasets show that MWAND is more efficient than the WAND approach.

... Many representation for inverted lists are known, each exposing a different space/time trade-off [10]. Among these, Variable-Byte [11], [12] (henceforth, VByte) is the most popular and used byte-aligned code. In particular, VByte owes its popularity to its sequential decoding speed and, indeed, it is one of the fastest representation for integer sequences. ...

The ubiquitous Variable-Byte encoding is one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to be compressed are small that is the typical case for inverted indexes. This paper shows that the compression ratio of Variable-Byte can be improved by
$2\times$
by adopting a partitioned representation of the inverted lists. This makes Variable-Byte surprisingly competitive in space with the best bit-aligned encoders, hence disproving the folklore belief that Variable-Byte is space-inefficient for inverted index compression. Despite the significant space savings, we show that our optimization almost comes for free, given that: we introduce an optimal partitioning algorithm that does not affect indexing time because of its linear-time complexity; we show that the query processing speed of Variable-Byte is preserved, with an extensive experimental analysis and comparison with several other state-of-the-art encoders.

... If the route names contain no common prefixes then the size of F C(S) reduces to that of S with no compression. A variable length non-negative integer encoding scheme such as VByte [25] is usually employed in order to compress the integer representing the length of the common prefix. This compression can also be applied to the pointers to the list l i , i = 1, · · · , n of next-hop ports. ...

Named Data Networking (NDN) is a future Internet architecture that replaces IP addresses with namespaces of contents that are searchable at the network layer. A challenging task for NDN routers is to manage forwarding-information bases (FIBs) that store next-hop routes to contents using their stored usually long names or name prefixes. In this paper, we propose FCTree, a compressed FIB data structure that significantly reduces the required storage space at the router and can efficiently meet the demands of having routes that are orders of magnitude larger than IP-based ones in conventional routing tables. FCTree employs a localized front-coding compression to buckets containing partitions of the routes. The top routes in these buckets are then organized in B-ary self-balancing trees. By adjusting the size of the buckets, the router can reach an optimal tradeoff between the latency of the longest prefix matching (LPM) operation and the FIB storage space. In addition, in contrast to existing hash and bloom-filter based solutions, the proposed FCTree structure can significantly reduce the latency required for range and wildcard searches (e.g., for latency sensitive streaming applications or network-layer search engines) where up to k routes are returned if they are prefixed by a requested name. Performance evaluation results demonstrate the significant space savings achieved by FCTree compared to traditional hash-based FIBs.

... However, it only decreases the minimum number of bits necessary to encode the partitions, but not their final representation. Consequently, deltas are Vbyte encoded (Williams and Zobel, 1999): each byte used to encode a delta has one bit indicating whether the byte starts a new delta or not, allowing to remove unnecessary bytes from each delta. Thus, partitions use a variable number of bytes proportional to the minimum number of bits necessary to encode their deltas. ...

The advent of high throughput sequencing (HTS) technologies raises a major concern about storage and transmission of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pangenomes. The ideal way to represent and transfer pangenomes is through compression. A number of HTS-specific compression tools have been developed to reduce the storage and communication costs of HTS data, yet none of them is designed to process a pangenome. In this article, we present dynamic alignment-free and reference-free read compression (DARRC), a new alignment-free and reference-free compression method. It addresses the problem of pangenome compression by encoding the sequences of a pangenome as a guided de Bruijn graph. The novelty of this method is its ability to incrementally update DARRC archives with new genome sequences without full decompression of the archive. DARRC can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code. On a large Pseudomonas aeruginosa data set, our method outperforms all other tested tools. It provides a 30% compression ratio improvement in single-end mode compared with the best performing state-of-the-art HTS-specific compression method in our experiments.

... • We used vbyte (byte-aligned) codes [48] rather that bit-oriented Huffman codes to differentially encode include a sequence of byte-oriented codewords (either 1 or 2-byte codewords in our example) that are used to represent the gaps from the original Ψ structure. It can also contain a pair of codewords for the pair 1, L to encode a 1-run of length L. Of course, using byte-aligned rather than bit-oriented codes will imply a loss in compression effectiveness. ...

Temporal graphs represent binary relationships that change along time. They can model the dynamism of, for example, social and communication networks. Temporal graphs are defined as sets of contacts that are edges tagged with the temporal intervals when they are active. This work explores the use of the Compressed Suffix Array (CSA), a well-known compact and self-indexed data structure in the area of text indexing, to represent large temporal graphs. The new structure, called Temporal Graph CSA (TGCSA), is experimentally compared with the most competitive compact data structures in the state-of-the-art, namely, EdgeLog and CET. The experimental results show that TGCSA obtains a good space-time trade-off. It uses a reasonable space and is efficient for solving complex temporal queries. Furthermore, TGCSA has wider expressive capabilities than EdgeLog and CET, because it is able to represent temporal graphs where contacts on an edge can temporally overlap.

... We observe around a 27% and 30% reduction in size for Pri and Opt, respectively. We also try testing a traditional Bitmap compression method EWAH [4] (run-length), and the mathematical encoding methods Pfor [9] and Vbyte [8]. We see a decrease in size of 4.35% (EWAH), -1.0% (Pfor) and 11.3% (VByte), which is unsurprisingly poor. ...

Large-scale search engines utilize inverted indexes which store ordered lists of document identifies (docIDs) relevant to query terms, which can be queried thousands of times per second. In order to reduce storage requirements, we propose a dictionary-based compression approach for the recently proposed bitwise data-structure BitFunnel, which makes use of a Bloom filter. Compression is achieved through storing frequently occurring blocks in a dictionary. Infrequently occurring blocks (those which are not represented in the dictionary) are instead referenced using similar blocks that are in the dictionary, introducing additional false positive errors. We further introduce a docID reordering strategy to improve compression. Experimental results indicate an improvement in compression by 27% to 30%, at the expense of increasing the query processing time by 16% to 48% and increasing the false positive rate by around 7.6 to 10.7 percentage points.

... This allows us skipping blocks at search time, decompressing only the blocks that are relevant for a query. Among the existing compression schemes for inverted lists, we have classical encodings like Elias δ and γ (Elias, 1975) and Golomb/Rice (Golomb, 1966), as well as the more recent ones VByte (Williams & Zobel, 1999), Simple 9 (Anh & Moffat, 2005), and PForDelta (Zukowski, Héman, Nes, & Boncz, 2006) encodings. All these methods benefit from sequences of small integers. ...

Text search engines are a fundamental tool nowadays. Their efficiency relies on a popular and simple data structure: inverted indexes. They store an inverted list per term of the vocabulary. The inverted list of a given term stores, among other things, the document identifiers (docIDs) of the documents that contain the term. Currently, inverted indexes can be stored efficiently using integer compression schemes. Previous research also studied how an optimized document ordering can be used to assign docIDs to the document database. This yields important improvements in index compression and query processing time. In this paper we show that using a hybrid compression approach on the inverted lists is more effective in this scenario, with two main contributions: • First, we introduce a document reordering approach that aims at generating runs of consecutive docIDs in a properly-selected subset of inverted lists of the index.• Second, we introduce hybrid compression approaches that combine gap and run-length encodings within inverted lists, in order to take advantage not only from small gaps, but also from long runs of consecutive docIDs generated by our document reordering approach. Our experimental results indicate a reduction of about 10%–30% in the space usage of the whole index (just regarding docIDs), compared with the most efficient state-of-the-art results. Also, decompression speed is up to 1.22 times faster if the runs of consecutive docIDs must be explicitly decompressed, and up to 4.58 times faster if implicit decompression of these runs is allowed (e.g., representing the runs as intervals in the output). Finally, we also improve the query processing time of AND queries (by up to 12%), WAND queries (by up to 23%), and full (non-ranked) OR queries (by up to 86%), outperforming the best existing approaches.

... Among these, Variable-Byte [40,44] (henceforth, VByte) is the most popular and used byte-aligned code. In particular, VByte owes its popularity to its sequential decoding speed and, indeed, it is the fastest representation up to date for integer sequences. ...

The ubiquitous Variable-Byte encoding is considered one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to be compressed are small that is the typical case for inverted indexes. This paper shows that the compression ratio of Variable-Byte can be improved by $2\times$ by adopting a partitioned representation of the inverted lists. This makes Variable-Byte surprisingly competitive in space with the best bit-aligned encoders, hence disproving the folklore belief that Variable-Byte is space-inefficient for inverted index compression. Despite the significant space savings, we show that our optimization almost comes for free, given that: we introduce an optimal partitioning algorithm that, by running in linear time and with low constant factors, does not affect indexing time; we show that the query processing speed of Variable-Byte is preserved, with an extensive experimental analysis and comparison with several other state-of-the-art encoders.

... Variable-byte coding [42] uses a sequence of bytes to provide a compressed representation of integers. In particular, when compressing an integer n, the seven least significant bits of each byte are used to code n, whereas the most significant bit of each byte is set to 0 in the last byte of the sequence and to 1 if further bytes follow. ...

A multitude of contemporary applications heavily involve graph data whose size appears to be ever-increasing. This trend shows no signs of subsiding and has caused the emergence of a number of distributed graph processing systems including Pregel, Apache Giraph and GraphX. However, the unprecedented scale now reached by real-world graphs hardens the task of graph processing due to excessive memory demands even for distributed environments. By and large, such contemporary graph processing systems employ ineffective in-memory representations of adjacency lists. Therefore, memory usage patterns emerge as a primary concern in distributed graph processing. We seek to address this challenge by exploiting empirically-observed properties demonstrated by graphs generated by human activity. In this paper, we propose 1) three compressed adjacency list representations that can be applied to any distributed graph processing system, 2) a variable-byte encoded representation of out-edge weights for space-efficient support of weighted graphs, and 3) a tree-based compact out-edge representation that allows for efficient mutations on the graph elements. We experiment with publicly-available graphs whose size reaches two-billion edges and report our findings in terms of both space-efficiency and execution time. Our suggested compact representations do reduce respective memory requirements for accommodating the graph elements up-to 5 times if compared with state-of-the-art methods. At the same time, our memory-optimized methods retain the efficiency of uncompressed structures and enable the execution of algorithms for large scale graphs in settings where contemporary alternative structures fail due to memory errors.

Nowadays, sensors and signal catchers in various fields are capturing time-series data all the time, and time-series data are exploding. Due to the large storage space requirements and redundancy, many compression techniques for time series have been proposed. However, the existing compression algorithms still face the challenge of the contradiction between random access and compression ratio. That is, in a time series database, large-scale time series data have high requirements on compression ratio, while large pieces of data need to be decompressed during the access process, resulting in poor query efficiency. In this paper, a proper solution is proposed to resolve such a contradiction. We propose a data compression method based on reinforcement learning, and use the idea of data deduplication to design the data compression method, so that the queries can be processed without decompression. We theoretically show that the proposed approach is effective and could ensure random accessing. To efficiently implement the reinforcement-learning-based solution, we develop a data compression method based on DQN network. Experiments show that the proposed algorithm performs well in time series data sets with large amount of data and strong regularity, performs well in compression ratio and compression time. Besides, since no decompression is required, the query processing time is much less than the competitors.

Entropy coding is a widely used technique for lossless data compression. The entropy coding schemes supporting the direct access capability on the encoded stream have been investigated in recent years. However, all prior schemes require auxiliary space to support the direct access ability. This paper proposes a rearranging method for prefix codes to support a certain level of direct access to the encoded stream without requiring additional data space. Then, an efficient decoding algorithm is proposed based on lookup tables. The simulation results show that when the encoded stream does not allow additional space, the number of bits per access read of the proposed method is above two orders of magnitude less than the conventional method. In contrast, the alternative solution consumes at least one more bit per symbol on average than the proposed method to support direct access. This indicates that the proposed scheme can achieve a good trade-off between space usage and access performance. In addition, if a small amount of additional storage space is allowed (it is approximately 0.057% in the simulation), the number of bits per access read in our proposal can be significantly reduced by 90%.

String dictionaries are a core component of a plethora of applications, so it is not surprising that they have been widely and deeply investigated in the literature since the introduction of tries in the ’60s.We introduce a new approach to trie compression, called COmpressed COllapsed Trie (CoCo-trie ), that hinges upon a data-aware optimisation scheme that selects the best subtries to collapse based on a pool of succinct encoding schemes in order to minimise the overall space occupancy. CoCo-trie supports not only the classic lookup query but also the more sophisticated rank operation, formulated over a sorted set of strings.We corroborate our theoretical achievements with a large set of experiments over datasets originating from a variety of sources, e.g., URLs, DNA sequences, and databases. We show that our CoCo-trie provides improved space-time trade-offs on all those datasets when compared against well-established and highly-engineered trie-based string dictionaries.KeywordsString dictionariesTriesCompressed data structures

A keyword dictionary is an associative array whose keys are strings. Recent applications handling massive keyword dictionaries in main memory have a need for a space-efficient implementation. When limited to static applications, there are a number of highly compressed keyword dictionaries based on the advancements of practical succinct data structures. However, as most succinct data structures are only efficient in the static case, it is still difficult to implement a keyword dictionary that is space efficient and dynamic . In this article, we propose such a keyword dictionary. Our main idea is to embrace the path decomposition technique, which was proposed for constructing cache-friendly tries. To store the path-decomposed trie in small memory, we design data structures based on recent compact hash trie representations. Experiments on real-world datasets reveal that our dynamic keyword dictionary needs up to 68% less space than the existing smallest ones, while achieving a relevant space-time tradeoff.

An entropy coder takes as input a sequence of symbol identifiers over some specified alphabet and represents that sequence as a bitstring using as few bits as possible, typically assuming that the elements of the sequence are independent of each other. Previous entropy coding methods include the well-known Huffman and arithmetic approaches. Here we examine the newer asymmetric numeral systems (ANS) technique for entropy coding and develop mechanisms that allow it to be efficiently used when the size of the source alphabet is large—thousands or millions of symbols. In particular, we examine different ways in which probability distributions over large alphabets can be approximated and in doing so infer techniques that allow the ANS mechanism to be extended to support large-alphabet entropy coding. As well as providing a full description of ANS, we also present detailed experiments using several different types of input, including data streams arising as typical output from the modeling stages of text compression software, and compare theproposed ANS variants with Huffman and arithmetic coding baselines, measuring both compression effectiveness and also encoding and decoding throughput. We demonstrate that in applications in which semi-static compression is appropriate, ANS-based coders can provide an excellent balance between compression effectiveness and speed, even when the alphabet is large.

We examine index representation techniques for document-based inverted files, and present a mechanism for compressing them using word-aligned binary codes. The new approach allows extremely fast decoding of inverted lists during query processing, while providing compression rates better than other high-throughput representations. Results are given for several large text collections in support of these claims, both for compression effectiveness and query efficiency.

This work introduces a companion reproducible paper with the aim of allowing the exact replication of the methods, experiments, and results discussed in a previous work [5]. In that parent paper, we proposed many and varied techniques for compressing indexes which exploit that highly repetitive collections are formed mostly of documents that are near-copies of others. More concretely, we describe a replication framework, called uiHRDC (universal indexes for Highly Repetitive Document Collections), that allows our original experimental setup to be easily replicated using various document collections. The corresponding experimentation is carefully explained, providing precise details about the parameters that can be tuned for each indexing solution. Finally, note that we also provide uiHRDC as reproducibility package.

Positional ranking functions, widely used in web search engines and related search systems, improve result quality by exploiting the positions of the query terms within documents. However, it is well known that positional indexes demand large amounts of extra space, typically about three times the space of a basic nonpositional index. Textual data, on the other hand, is needed to produce text snippets. In this paper, we study time–space trade-offs for search engines with positional ranking functions and text snippet generation. We consider both index-based and non-index based alternatives for positional data. We aim to answer the question of whether positional data should be indexed, and how.
We show that there is a wide range of practical time–space trade-offs. Moreover, we show that using about 1.30 times the space of positional data, we can store everything needed for efficient query processing, with a minor increase in query time. This yields considerable space savings and outperforms, both in space and time, recent alternatives from literature. We also propose several efficient compressed text representations for snippet generation, which are able to use about half of the space of current state-of-the-art alternatives with little impact in query processing time.

The suffix array, perhaps the most important data structure in modern string processing, needs to be augmented with the longest-common-prefix (LCP) array in many applications. Their construction is often a major bottleneck, especially when the data is too big for internal memory. We describe two new algorithms for computing the LCP array from the suffix array in external memory. Experiments demonstrate that the new algorithms are about a factor of two faster than the fastest previous algorithm. We then further engineer the two new algorithms and improve them in three ways. First, we speed up the algorithms by up to a factor of two through parallelism. Eight threads is sufficient for making the algorithms essentially I/O bound. Second, we reduce the disk space usage of the algorithms making them in-place: the input (text and suffix array) is treated as read-only, and the working disk space never exceeds the size of the final output (the LCP array). Third, we add support for large alphabets. All previous implementations assume the byte alphabet.

Search engines are exceptionally important tools for accessing information in today's world. In satisfying the information needs of millions of users, the effectiveness (the quality of the search results) and the efficiency (the speed at which the results are returned to the users) of a search engine are two goals that form a natural trade-off, as techniques that improve the effectiveness of the search engine can also make it less efficient. Meanwhile, search engines continue to rapidly evolve, with larger indexes, more complex retrieval strategies and growing query volumes. Hence, there is a need for the development of efficient query processing infrastructures that make appropriate sacrifices in effectiveness in order to make gains in efficiency. This survey comprehensively reviews the foundations of search engines, from index layouts to basic term-at-a-time (TAAT) and document-at-a-time (DAAT) query processing strategies, while also providing the latest trends in the literature in efficient query processing, including the coherent and systematic reviews of techniques such as dynamic pruning and impact-sorted posting lists as well as their variants and optimisations. Our explanations of query processing strategies, for instance the WAND and BMW dynamic pruning algorithms, are presented with illustrative figures showing how the processing state changes as the algorithms progress. Moreover, acknowledging the recent trends in applying a cascading infrastructure within search systems, this survey describes techniques for efficiently integrating effective learned models, such as those obtained from learning-to-rank techniques. The survey also covers the selective application of query processing techniques, often achieved by predicting the response times of the search engine (known as query efficiency prediction), and making per-query tradeoffs between efficiency and effectiveness to ensure that the required retrieval speed targets can be met. Finally, the survey concludes with a summary of open directions in efficient search infrastructures, namely the use of signatures, real-time, energy-efficient and modern hardware and software architectures.

Efficient storage of large inverted indexes is one of the key technologies that support current web search services. Here we re-examine mechanisms for representing document-level inverted indexes and within-document term frequencies, including comparing specialized methods developed for this task against recent fast implementations of general-purpose adaptive compression techniques. Experiments with the Gov2-URL collection and a large collection of crawled news stories show that standard compression libraries can provide compression effectiveness as good as or better than previous methods, with decoding rates only moderately slower than reference implementations of those tailored approaches. This surprising outcome means that high-performance index compression can be achieved without requiring the use of specialized implementations.

The problems arising in the modeling and coding of strings for compression purposes are discussed. The notion of an information source that simplifies and sharpens the traditional one is axiomatized, and adaptive and nonadaptive models are defined. With a measure of complexity assigned to the models, a fundamental theorem is proved which states that models that use any kind of alphabet extension are inferior to the best models using no alphabet extensions at all. A general class of so-called first-in first-out (FIFO) arithmetic codes is described which require no alphabet extension devices and which therefore can be used in conjunction with the best models. Because the coding parameters are the probabilities that define the model, their design is easy, and the application of the code is straightforward even with adaptively changing source models.

A new method of estimating the entropy and redundancy of a language is described. This method exploits the knowledge of the language statistics possessed by those who speak the language, and depends on experimental results in prediction of the next letter when the preceding text is known. Results of experiments in prediction are given, and some properties of an ideal predictor are developed.

This article reports on a variety of compression algorithms developed in the context of a project to put all the data files for a full-text retrieval system on CD-ROM. In the context of inexpensive pre-processing, a text-compression algorithm is presented that is based on Markov-modeled Huffman coding on an extended alphabet. Data structures are examined for facilitating random access into the compressed text. In addition, new algorithms are presented for compression of word indices, both the dictionaries (word lists) and the text pointers (concordances). The ARTFL database is used as a test case throughout the article.

A query to a nucleotide database is a DNA sequence. Answers are similar sequences, that is, sequences with a high-quality local alignment. Existing techniques for finding answers use exhaustive search, but it is likely that, with increasing database size, these algorithms will become prohibitively expensive. We have developed a partitioned search approach, in which local alignment string matching techniques are used in tandem with an index. We show that fixedlength substrings, or intervals, are a suitable basis for indexing in conjunction with local alignment on likely answers. By use of suitable compression techniques the index size is held to an acceptable level, and queries can be evaluated several times more quickly than with exhaustive search techniques.

When data compression is applied to full-text retrieval systems, intricate relationships emerge between the amount of compression, access speed, and computing resources required. We propose compression methods, and explore corresponding tradeoffs, for all components of static full-text systems such as text databases on CD-ROM. These components include lexical indexes, inverted files, bitmaps, signature files, and the main text itself. Results are reported on the application of the methods to several substantial full-text databases, and show that a large, unindexed text can be stored, along with indexes that facilitate fast searching, in less than half its original size—at some appreciable cost in primary memory requirements. © 1993 John Wiley & Sons, Inc.

Over the last decade, arithmetic coding has emerged as an important compression tool. It is now the method of choice for adaptive coding on multisymbol alphabets because of its speed, low storage requirements, and effectiveness of compression. This article describes a new implementation of arithmetic coding that incorporates several improvements over a widely used earlier version by Witten, Neal, and Cleary, which has become a de facto standard. These improvements include fewer multiplicative operations, greatly extended range of alphabet sizes and symbol probabilities, and the use of low-precision arithmetic, permitting implementation by fast shift/add operations. We also describe a modular structure that separates the coding, modeling, and probability estimation components of a compression system. To motivate the improved coder, we consider the needs of a word-based text compression program. We report a range of experimental results using this and other models. Complete source code is available. This document is available online at ACM Transactions on Information Systems.

Motivation: International sequencing efforts are creating huge nucleotide databases, which are used in searching applications to locate sequences homologous to a query sequence. In such applications, it is desirable that databases are stored compactly, that sequences can be accessed independently of the order in which they were stored, and that data can be rapidly retrieved from secondary storage, since disk costs are often the bottleneck in searching. Results: We present a purpose-built direct coding scheme for fast retrieval and compression of genomic nucleotide data. The scheme is lossless, readily integrated with sequence search tools, and does not require a model. Direct coding gives good compression and allows faster retrieval than with either uncompressed data or data compressed by other methods, thus yielding significant improvements in search times for high-speed homology search tools. Availability: The direct coding scheme (cino) is available free of charge by anonymous ftp from goanna.cs.rmit.edu.au in the directory pub/rmit/cino. Contact: E-mail: [email protected]
/* */

During its long gestation in the 1970s and early 1980s, arithmetic coding was widely regarded more as an academic curiosity than a practical coding technique. One factor that helped it gain the popularity it enjoys today was the publication in 1987 of source code for a multi symbol arithmetic coder in Communications of the ACM. Now (1995), our understanding of arithmetic coding has further matured, and it is timely to review the components of that implementation and summarise the improvements that we and other authors have developed since then. We also describe a novel method for performing the underlying calculation needed for arithmetic coding. Accompanying the paper is a “Mark II” implementation that incorporates the improvements we suggest. The areas examined include: changes to the coding procedure that reduce the number of multiplications and divisions and permit them to be done to low precision; the increased range of probability approximations and alphabet sizes that can be supported using limited precision calculation; data structures for support of arithmetic coding on large alphabets; the interface between the modelling and coding subsystems; the use of enhanced models to allow high performance compression. For each of these areas, we consider how the new implementation differs from the CACM package

Document databases contain large volumes of text, and currently have typical sizes into the gigabyte range. In order to efficiently query these text collections some form of index is required, since without an index even the fastest of pattern matching techniques results in unacceptable response times. One pervasive indexing method is the use of inverted files, also sometimes known as concordances or postings files. There has been a number of effort made to capture the “clustering” effect, and to design index compression methods that condition their probability predictions according to context. In these methods information as to whether or not the most recent (or second most recent, and so on) document contained term t is used to bias the prediction that the next document will contain term t. We further extend this notion of context-based index compression, and describe a surprisingly simple index representation that gives excellent performance on all of our test databases; allows fast decoding; and is, even in the worst case, only slightly inferior to Golomb (1966) coding

Witten, Bell and Nevill (see ibid., p.23, 1991) have described compression models for use in full-text retrieval systems. The authors discuss other coding methods for use with the same models, and give results that show their scheme yielding virtually identical compression, and decoding more than forty times faster. One of the main features of their implementation is the complete absence of arithmetic coding; this, in part, is the reason for the high speed. The implementation is also particularly suited to slow devices such as CD-ROM, in that the answering of a query requires one disk access for each term in the query and one disk access for each answer. All words and numbers are indexed, and there are no stop words. They have built two compressed databases.< >

Countable prefix codeword sets are constructed with the universal property that assigning messages in order of decreasing probability to codewords in order of increasing length gives an average code-word length, for any message set with positive entropy, less than a constant times the optimal average codeword length for that source. Some of the sets also have the asymptotically optimal property that the ratio of average codeword length to entropy approaches one uniformly as entropy increases. An application is the construction of a uniformly universal sequence of codes for countable memoryless sources, in which the n th code has a ratio of average codeword length to source rate bounded by a function of n for all sources with positive rate; the bound is less than two for n = 0 and approaches one as n increases.

We describe the implementation of a data compression scheme as an integral and transparent layer within a full-text retrieval system. Using a semi-static word-based compression model, the space needed to store the text is under 30 per cent of the original requirement. The model is used in conjunction with canonical Huffman coding and together these two paradigms provide fast decompression. Experiments with 500 Mb of newspaper articles show that in full-text retrieval environments compression not only saves space, it can also yield faster query processing - a win-win situation.

verted file processing for large text databases Universal modeling Indexing nucleotide Improved in- r10 Williams and Zobel

- A Moffat
- J Zobel
- S T Klein

A. Moffat, J. Zobel, and S. T. Klein. verted file processing for large text databases. In Proc. Australasian Database Conference, pages 162–171, Ade-laide, Australia, January 1995. Universal modeling Indexing nucleotide Improved in- r10 Williams and Zobel

Adding compression to a fulltext retrieval system. Software-Practice and Experience

- J Zobel
- A Moffat

J. Zobel and A. Moffat. Adding compression to a fulltext retrieval system. Software-Practice and Experience, 25(8):891-903, 1995.