Article

General-purpose compression for efficient retrieval

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Compression of databases not only reduces space requirements but can also reduce overall retrieval times. In text databases, compression of documents based on semistatic modeling with words has been shown to be both practical and fast. Similarly, for specific applications—such as databases of integers or scientific databases—specially designed semistatic compression schemes work well. We propose a scheme for general-purpose compression that can be applied to all types of data stored in large collections. We describe our approach—which we call RAY—in detail, and show experimentally the compression available, compression and decompression costs, and performance as a stream and random-access technique. We show that, in many cases, RAY achieves better compression than an efficient Huffman scheme and popular adaptive compression techniques, and that it can be used as an efficient general-purpose compression scheme.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... He suggests several strategies, and gives experimental results. The basic idea for our scheme, as well as for some other similar approaches to dictionary derivation [6], [7], is clearly related to the incremental encoding schemes suggested by Rubin. However, Rubin paid relatively little attention to the issues of computational complexity and dictionary encoding techniques. ...
... Another independent work based on a character-pair phrase generation scheme is that of Cannane and Williams [7]. Their approach is specialized for processing very large files using limited primary storage. ...
... Algorithm R in Fig. 1 captures this mechanism. Although this simple scheme is not especially well known, similar techniques have, as noted in Section II, also been described by other authors [5]- [7]. ...
Article
Dictionary-based modeling is a mechanism used in many practical compression schemes. In most implementations of dictionary-based compression the encoder operates on-line, incrementally inferring its dictionary of available phrases from previous parts of the message. An alternative approach is to use the full message to infer a complete dictionary in advance, and include an explicit representation of the dictionary as part of the compressed message. In this investigation, we develop a compression scheme that is a combination of a simple but powerful phrase derivation method and a compact dictionary encoding. The scheme is highly efficient, particularly in decompression, and has characteristics that make it a favorable choice when compressed data is to be searched directly. We describe data structures and algorithms that allow our mechanism to operate in linear time and space
... We show that by inspecting as little as 1% of the data to be compressed to derive a model our approach, which we call XRAY, offers better compression effectiveness than GZIP and COMPRESS on a large text collection. Inspecting only around 6% of data from the same collection gives comparable performance to a word-based Huffman coding scheme where the complete data is inspected [8] and, as we have shown elsewhere [2], our approach is two to three times faster in decompression. A direct comparison with the most effective of the comparable escape model approaches described by Moffat, Zobel, and Sharman [8] shows that our approach is slightly more effective, while likely to be two to three times faster. ...
... We have previously described RAY [2], a dictionarybased scheme that replaces frequently occuring symbolpairs in the data with shorter references into the dictionary. RAY builds a hierachical set of phrases, where longer phrases may contain references to shorter phrases. ...
... We also show in Table 2 a comparison of the performance of XRAY to an escape model semi-static HUFFWORD using the most effective "Method C" of Moffat, Zobel, and Sharman [8] where just over 30 Mb of data was inspected to build the model; note that the MZS-HUFF results are approximate only, as the WSJ collection used in our experiments does not contain the SGML markup that was used in the experiments of Moffat, Zobel, and Sharman. The results show that XRAY is likely to be more efficient than an efficient escape model HUFFWORD scheme; we expect, as we have found elsewhere [2], that inclusion of SGML markup in the WSJ collection would further improve the compression effectiveness of XRAY. ...
Conference Paper
Full-text available
Compression of databases not only reduces space requirements but can also reduce overall retrieval times. We have described elsewhere our RAY algorithm for compressing databases containing general-purpose data, such as images, sound, and also text. We describe here an extension to the RAY compression algorithm that permits use on very large databases. In this approach, we build a model based on a small training set and use the model to compress large databases. Our preliminary implementation is slow for compression, but only slightly slower in decompression speed than the popular GZIP scheme. Importantly, we show that the compression effectiveness of our approach is excellent and markedly better than the GZIP and COMPRESS algorithms on our test sets
... The Ray system [Cannane and Williams, 2001] also compresses a message off-line, but through multiple passes so that the message itself does not need to be stored in memory. ...
... Only a small impact on the overall quality of the phrases is expected, even though phase 3 does not pair symbols recursively. But if recursive pairing is necessary, then multiple iterations of phase 3 over the sequence can be made, similar to the compression system Ray by Cannane and Williams [2001]. ...
... But from a browsing point of view, we wish to continue to regard the source document as being monolithic, and it is this requirement that gives rise to the need for RE-MERGE, which combines the compressed blocks generated by RE-PAIR back into a single structure. The multi-pass merging process is somewhat similar to the iterative process performed by the RAY system of Cannane and Williams [2001], which also does symbol pair replacements. ...
... There are many other proposed variants of the LZ78 family, such as the LZMW [242], the LZAP [322], RAY and XRAY [74,75] methods. (See [294] for more variants of the LZ77 and LZ78 compression family). ...
Article
TR-COSC 07/01 This paper provides a survey of techniques for pattern matching in compressed text and images. Normally compressed data needs to be decompressed before it is processed, but if the compression has been done in the right way, it is often possible to search the data without having to decompress it, or at least only partially decompress it. The problem can be divided into lossless and lossy compression methods, and then in each of these cases the pattern matching can be either exact or inexact. Much work has been reported in the literature on techniques for all of these cases, including algorithms that are suitable for pattern matching for various compression methods, and compression methods designed specifically for pattern matching. This work is surveyed in this paper. The paper also exposes the important relationship between pattern matching and compression, and proposes some performance measures for compressed pattern matching algorithms. Ideas and directions for future work are also described.
... There are a number of off-line compressing mechanisms that share features with RE- PAIR, and these might also be assisted by application of block merging. They include RAY [Cannane and Williams, 2001], XRAY [Cannane and Williams, 2000] , and OFF- LINE [Apostolico and Lonardi, 2000]. Bentley and McIlroy [1999] describe a slightly different scheme in which long replacements are identified and used as the basis for substitutions . ...
Conference Paper
To bound memory consumption, most compression systems provide a facility that controls the amount of data that may be processed at once. In this work we consider the Re-Pair mechanism of [2000], which processes large messages as disjoint blocks. We show that the blocks emitted by Re-Pair can be post-processed to yield further savings, and describe techniques that allow files of 500 MB or more to be compressed in a holistic manner using less than that much main memory. The block merging process we describe has the additional advantage of allowing new text to be appended to the end of the compressed file.
... There are a number of off-line compressing mechanisms that share features with RE- PAIR, and these might also be assisted by application of block merging. They include RAY [Cannane and Williams, 2001], XRAY [Cannane and Williams, 2000] , and OFF- LINE [Apostolico and Lonardi, 2000]. Bentley and McIlroy [1999] describe a slightly different scheme in which long replacements are identified and used as the basis for substitutions . ...
Article
To bound memory consumption, most compression systems provide a facility that controls the amount of data that may be processed at once—usually as a block size, but sometimes as a direct megabyte limit. In this work we consider the Re-Pair mechanism of Larsson and Moffat (2000), which processes large messages as disjoint blocks to limit memory consumption. We show that the blocks emitted by Re-Pair can be postprocessed to yield further savings, and describe techniques that allow files of 500 MB or more to be compressed in a holistic manner using less than that much main memory. The block merging process we describe has the additional advantage of allowing new text to be appended to the end of the compressed file.
... Extensions to sets of strings have started to appear, where now the dictionary consists of repeated substrings that appear anywhere in the collection. Relevant for this review is COMRAD [38], an adaptation to the compression of DNA sequences of another existing general-purpose compression algorithm [59]. COMRAD exploits Huffman coding to encode both the dictionary and the data, and takes into account information about the alphabet size and string variations due to evolution, also allowing for reverse complement matches to be included in the dictionary. ...
Article
High-throughput sequencing technologies produce large collections of data, mainly DNA sequences with additional information, requiring the design of efficient and effective methodologies for both their compression and storage. In this context, we first provide a classification of the main techniques that have been proposed, according to three specific research directions that have emerged from the literature and, for each, we provide an overview of the current techniques. Finally, to make this review useful to researchers and technicians applying the existing software and tools, we include a synopsis of the main characteristics of the described approaches, including details on their implementation and availability. Performance of the various methods is also highlighted, although the state of the art does not lend itself to a consistent and coherent comparison among all the methods presented here.
... All special-purpose encoding method should meet the following criteria [3]. i) It must permit autonomous use of encoded sequence and independent decompression of data. ...
Article
Full-text available
Huge amount of genomic data are produced due to high-throughput sequencing technology. Those enormous volumes of sequence data require effective storage, fast transmission and provision of quick access for alignment and analysis to any record. It has been proved that standard general purpose lossless compression techniques failed to compress these sequences rather they may increase the size. But some general purpose compression method may be useful with a modification for genome compression. In this paper, a variation of statistical Huffman algorithm have been proposed named KMerHuffman, which instead of calculating frequency of individual character it comes as a substring of length four which we have experiment to be optimal due to redundancy of genome sequence. Then KMerHuffman result on benchmark sequence has been compare with the other biological sequence specific compression algorithm. The result shows that KMerHuffman is competitive with other method. Another important aspect is that there is no need of any reference sequence so it is useful for upcoming sequence.
... In particular, they are limited to cases where the characteristics of the data (for example, that it consists of ASCII text that can be parsed into words) are known in advance; and compression performance is relatively poor. Another family is based on dictionary inference [3,4,18,24]. These methods use the data (or a large part of it) to infer a dictionary represented as a simple hierarchical grammar, and then replace the bytes or words with references to tokens in the dictionary. ...
Conference Paper
Full-text available
Compression of collections, such as text databases, can both reduce space consumption and increase retrieval efficiency, through better caching and better exploitation of the memory hierarchy. A promising technique is relative Lempel-Ziv coding, in which a sample of material from the collection serves as a static dictionary; in previous work, this method demonstrated extremely fast decoding and good compression ratios, while allowing random access to individual items. However, there is a trade-o� between dictionary size and compression ratio, motivating the search for a compact, yet similarly effective, dictionary. In previous work it was observed that, since the dictionary is generated by sampling, some of it (selected substrings) may be discarded with little loss in compression. Unfortunately, simple dictionary pruning approaches are ineffective. We develop a formal model of our approach, based on generating an optimal dictionary for a given collection within a memory bound. We generate measures for identification of low-value substrings in the dictionary, and show on a variety of sizes of text collection that halving the dictionary size leads to only marginal loss in compression ratio. This is a dramatic improvement on previous approaches.
... Ray [16] is a general-purpose compression algorithm that is possible to use compressing the DNA sequence. The ray algorithm is similar to Re-pair, except the occurrence od many. ...
... COMRAD algorithm is based on RAY [29] which is a general-purpose compression algorithm. COMRAD reduce the costs for DNA compression compares to RAY. ...
Article
Full-text available
Genomic repositories gradually increase individual and reference sequences, which shares long identical and near-identical strings of nucleotides. In this paper a lossless DNA data compression technique called Optimized Base Repeat Length DNA Compression (OBRLDNAComp) has been proposed, based upon redundancy of DNA sequences. For easy storage, retrieval time reducing and to find similarity within and between sequences compression is mandatory. OBRLDNAComp searches long identical and near-identical strings of nucleotides which are overlooked by other DNA specific compression algorithms. This technique is an optimal solution of longest possible exact repeat benefits towards compression ratio. It scans a sequence horizontally from left to right to find statistic of repeats then follow substitution technique to compress those repeats. The algorithm is straightforward and does not need any external reference file; it scans the individual file for compression and decompression. The achieved compression ratio 1.673 bpb outperforms many non-reference based compression methods.
... It is worth pointing out that for the purpose of implementing this method, it is possible to use parallelization on multiple cores. COMRAD (COMpression using RedundAncy of DNA) [66] is a dictionary construction method, in which a disk-based technique, RAY [67], is exploited to identify exact repeats in large DNA datasets. In this method, the collection is frequently scanned to identify repeated substrings, and then use them to construct a corpus-wide dictionary. ...
Article
Full-text available
The ever increasing growth of the production of high-throughput sequencing data poses a serious challenge to the storage, processing and transmission of these data. As frequently stated, it is a data deluge. Compression is essential to address this challenge—it reduces storage space and processing costs, along with speeding up data transmission. In this paper, we provide a comprehensive survey of existing compression approaches, that are specialized for biological data, including protein and DNA sequences. Also, we devote an important part of the paper to the approaches proposed for the compression of different file formats, such as FASTA, as well as FASTQ and SAM/BAM, which contain quality scores and metadata, in addition to the biological sequences. Then, we present a comparison of the performance of several methods, in terms of compression ratio, memory usage and compression/decompression time. Finally, we present some suggestions for future research on biological data compression.
Article
In this paper, we study domain compositions of proteins via compression of whole proteins in an organism for the sake of obtaining the entropy that the individual contains. We suppose that a protein is a multiset of domains. Since gene duplication and fusion have occurred through evolutionary processes, the same domains and the same compositions of domains appear in multiple proteins, which enables us to compress a proteome by using references to proteins for duplicated and fused proteins. Such a network with references to at most two proteins is modeled as a directed hypergraph. We propose a heuristic approach by combining the Edmonds algorithm and an integer linear programming, and apply our procedure to fourteen proteomes of D. discoideum, E. coli, S. cerevisiae, S. pombe, C. elegans, D. melanogaster, A. thaliana, O. sativa, D. rerio, X. laevis, G. gallus, M. musculus, P. troglodytes, and H. sapiens. The compressed size using both of duplication and fusion was smaller than that using only duplication, which suggests the importance of fusion events in evolution of a proteome.
Conference Paper
Nowadays, decreasing cost and better accessibility of sequencing methods have enabled studies of genetic variation between individuals of the same species and also between two related species. This has led to a rapid increase in biological data consisting of sequences that are very similar to each other, these sequences usually being stored together in one database. We propose a compression method based on Wavelet Tree FM-index optimized for compression of a set of similar biological sequences. The compression method is based on tracking single changes (together with their context) between every single sequence and the chosen reference sequence. We call our compression method BIO-FMI. The space complexity of our self-index is O(n+n log σ+N+N log σ+N' (logr+logN'/r) + N'+N' log n+r log N'+r log N) bits when applied on a set of r sequences, where n is the length of the reference sequence, N is the total length of distinct segments in all sequences, N' is the count of distinct segments in all sequences and σ is the size of the alphabet. BIO-FMI distinguishes so-called primary occurrences (occurring in the reference sequence) and secondary occurrences (not occurring in the reference sequence). BIO-FMI can locate each primary occurrence in O(s log σ + r log N'/r) time and each secondary occurrence in O(s log σ), where s is the length of a sample with a localization pointer. BIO-FMI gives very promising results in compression ratio and in locate time when performed on an extremely repetitive data set (less than 0.5% mutations) and when the searched patterns are of smaller lengths (less than 20 bases). BIO-FMI is competitive in extraction speed and it seems to be superior in time needed to build the index, especially in the case when the alignments of single sequences are given in advance.
Article
Text compression algorithms based on the Burrows-Wheeler transform (BWT) typically achieve a good compression ratio but are slow compared to Lempel-Ziv type compression algorithms. The main culprit is the time needed to compute the BWT during compression and its inverse during decompression. We propose to speed up BWT-based compression by performing a grammar-based precompression before the transform. The idea is to reduce the amount of data that BWT and its inverse have to process. We have developed a very fast grammar precompressor using pair replacement. Experiments show a substantial speed up in practice without a significant effect on compression ratio.
Conference Paper
The performance of data compression on a large static text may be improved if certain variable-length strings are included in the character set for which a code is generated. A new method for extending the alphabet is presented, based on a reduction to a graph-theoretic problem. A related optimization problem is shown to be NP-complete, a fast heuristic is suggested, and experimental results are presented.
Conference Paper
2a b r a c a d a b r a c a d c a d r a a b r a r a c a d
Article
Compression of large collections can lead to improvements in retrieval times by offsetting the CPU decompression costs with the cost of seeking and retrieving data from disk. We propose a semistatic phrase-based approach called xray that builds a model offline using sample training data extracted from a collection, and then compresses the entire collection online in a single pass. The particular benefits of xray are that it can be used in applications where individual records or documents must be decompressed, and that decompression is fast. The xray scheme also allows new data to be added to a collection without modifying the semistatic model. Moreover, xray can be used to compress general-purpose data such as genomic, scientific, image, and geographic collections without prior knowledge of the structure of the data. We show that xray is effective on both text and general-purpose collections. In general, xray is more effective than the popular gzip and compress schemes, while being marginally less effective than bzip2. We also show that xray is efficient: of the popular schemes we tested, it is typically only slower than gzip in decompression. Moreover, the query evaluation costs of retrieval of documents from a large collection with our search engine is improved by more than 30% when xray is incorporated compared to an uncompressed approach. We use simple techniques for obtaining the training data from the collection to be compressed and show that with just over 4% of data the entire collection can be effectively compressed. We also propose four schemes for phrase-match selection during the single pass compression of the collection. We conclude that with these novel approaches xray is a fast and effective scheme for compression and decompression of large general-purpose collections.
Article
Full-text available
In mobile computing, issues such as limited resources, network capacities and organisational constraints may cause the complete replication of large databases on a mobile device to be infeasible. At the same time, some on-board storage of data is attractive as communication to the main database can be inconsistent. Thus, as the emphasis on application mobility increases, data summarisation offers a useful solution to improving response times and the availability of data. These summarisation techniques can also be of benefit to distributed databases, particularly those with mobile components or where the profile of the transaction load varies significantly over time. This paper surveys summarisation techniques used for mobile distributed databases. It also surveys the manner in which database functionality is maintained in mobile database systems, including query processing, data replication, concurrency control, transaction support and system recovery.
Article
Genomic repositories increasingly include individual as well as reference sequences, which tend to share long identical and near-identical strings of nucleotides. However, the sequential processing used by most compression algorithms, and the volumes of data involved, mean that these long-range repetitions are not detected. An order-insensitive, disk-based dictionary construction method can detect this repeated content and use it to compress collections of sequences. We explore a dictionary construction method that improves repeat identification in large DNA datasets. Our adaptation, COMRAD, of an existing disk-based method identifies exact repeated content in collections of sequences with similarities within and across the set of input sequences. COMRAD compresses the data over multiple passes, which is an expensive process, but allows COMRAD to compress large datasets within reasonable time and space. COMRAD allows for random access to individual sequences and sub-sequences without decompressing the whole dataset. COMRAD has no competitor in terms of the size of datasets that it can compress (extending to many hundreds of gigabytes) and, even for smaller datasets, the results are competitive compared to alternatives; as an example 39 S. cerevisiae genomes compressed to 0.25 bits per base.
Article
Full-text available
Data available in real time database are highly valuable. Data stored in databases keep growing as a result of business requirement and user need. Transmission of such a large quantity of data requires more money. Managing such a huge amount of data in real time environment is a big challenge. Traditionally available methods faces more difficultly in dealing with such a large databases. In this Research Paper we provide an overview of various data compression techniques and a solution to how to optimize and enhance the process to compress the real time databases and achieve better performance than conventional database system. This research paper will provide a solution to compress the real time data more effectively and reduce the storage requirement, cost and increase speed of backup. The compression of data for real time environment is developed with Iterative length Compression Algorithm. This Algorithm provide parallel storage backup for real time database system .
Article
A new electronic watermarking method is proposed in which the computer-generated hologram (CGH) technique is applied. Since the CGH used here is based on binary data, the original CGH data can be recovered by a binary process under some conditions, even if noise is added by media transformation. This advantage of the present method is demonstrated by a computer simulation. Further, when a random-phase CGH is used, the CGH pattern is distributed in a random fashion. Therefore, when embedding a signature image, the method is suitable for encoding secret information, such as the embedding location into the image, and for pattern decomposition of the signature image. As examples of this encoding, a method for interchanging the CGH regions and a method for combining the different CGHs are proposed. The effectiveness of the method is demonstrated by a simulation experiment. © 2000 Scripta Technica, Electron Comm Jpn Pt 3, 84(1): 21–31, 2001
Article
Full-text available
This paper takes a compression scheme that infers a hierarchical grammar from its input, and investigates its application to semi-structured text. Although there is a huge range and variety of data that comes within the ambit of "semi-structured", we focus attention on a particular, and very large, example of such text. Consequently the work is a case study of the application of grammar-based compression to a large-scale problem. We begin by identifying some characteristics of semi-structured text that have special relevance to data compression. We then give a brief account of a particular large textual database, and describe a compression scheme that exploits its structure. In addition to providing compression, the system gives some insight into the structure of the database. Finally we show how the hierarchical grammar can be generalized, first manually and then automatically, to yield further improvements in compression performance.
Article
Full-text available
PIR-International is an association of macromolecular sequence data collection centers dedicated to fostering international cooperation as an essential element in the development of scientific databases. A major objective of PIR-International is to continue the development of the Protein Sequence Database as an essential public resource for protein sequence information. This paper briefly describes the architecture of the Protein Sequence Database and how it and associated data sets are distributed and can be accessed electronically.
Article
Full-text available
The GenBank (Registered Trademark symbol) sequence database incorporates DNA sequences from all available public sources, primarily through the direct submission of sequence data from individual laboratories and from large-scale sequencing projects. Most submitters use the BankIt (Web) or Sequin programs to format and send sequence data. Data exchange with the EMBL Data Library and the DNA Data Bank of Japan helps ensure comprehensive worldwide coverage. GenBank data is accessible through NCBI's integrated retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome and protein structure information. MEDLINE (Registered Trademark symbol) s from published articles describing the sequences are included as an additional source of biological annotation through the PubMed search system. Sequence similarity searching is offered through the BLAST series of database search programs. In addition to FTP, Email, and server/client versions of Entrez and BLAST, NCBI offers a wide range of World Wide Web retrieval and analysis services based on GenBank data. The GenBank database and related resources are freely accessible via the URL: http://www.ncbi.nlm.nih.gov
Conference Paper
Full-text available
Compression of databases not only reduces space requirements but can also reduce overall retrieval times. We have described elsewhere our RAY algorithm for compressing databases containing general-purpose data, such as images, sound, and also text. We describe here an extension to the RAY compression algorithm that permits use on very large databases. In this approach, we build a model based on a small training set and use the model to compress large databases. Our preliminary implementation is slow for compression, but only slightly slower in decompression speed than the popular GZIP scheme. Importantly, we show that the compression effectiveness of our approach is excellent and markedly better than the GZIP and COMPRESS algorithms on our test sets
Article
Full-text available
This paper describes an algorithm, called SEQUITUR, that identifies hierarchical structure in sequences of discrete symbols and uses that information for compression. On many practical sequences it performs well at both compression and structural inference, producing comprehensible descriptions of sequence structure in the form of grammar rules. The algorithm can be stated concisely in the form of two constraints on a context-free grammar. Inference is performed incrementally, the structure faithfully representing the input at all times. It can be implemented efficiently and operates in time that is approximately linear in sequence length. Despite its simplicity and efficiency, SEQUITUR succeeds in inferring a range of interesting hierarchical structures from naturallyoccurring sequences.
Article
Full-text available
This paper describes an algorithm, called SEQUITUR, that identifies hierarchical structure in
Article
From its origin the Protein Sequence Database has been designed to support research and has focused on comprehensive coverage, quality control and organization of the data in accordance with biological principles. Since 1988 the database has been maintained collaboratively within the framework of PIR-International, an association of macromolecular sequence data collection centers dedicated to fostering international cooperation as an essential element in the development of scientific databases. The database is widely distributed and is available on the World Wide Web, via ftp, email server, on CD-ROM and magnetic media. It is widely redistributed and incorporated into many other protein sequence data compilations, including SWISS-PROT and the Entrez system of the NCBI.
Article
This paper surveys a variety of data compression methods spanning almost 40 years of research, from the work of Shannon, Fano, and Huffman in the late 1940s to a technique developed in 1986. The aim of data compression is to reduce redundancy in stored or communicated data, thus increasing effective data density. Data compression has important application in the areas of file storage and distributed systems. Concepts from information theory as they relate to the goals and evaluation of data compression methods are discussed briefly. A framework for evaluation and comparison of methods is constructed and applied to the algorithms presented. Comparisons of both theoretical and empirical natures are reported, and possibilities for future research are suggested.
Book
"A wonderful treasure chest of information; spanning a wide range of data compression methods, from simple test compression methods to the use of wavelets in image compression. It is unusual for a text on compression to cover the field so completely." - ACM Computing Reviews"Salomon's book is the most complete and up-to-date reference on the subject. The style, rigorous yet easy to read, makes this book the preferred choice. [and] the encyclopedic nature of the text makes it an obligatory acquisition by our library." - Dr Martin Cohn, Brandeis University Data compression is one of the most important tools in modern computing, and there has been tremendous progress in all areas of the field. This fourth edition of Data Compression provides an all-inclusive, thoroughly updated, and user-friendly reference for the many different types and methods of compression (especially audio compression, an area in which many new topics covered in this revised edition appear). Among the important features of the book are a detailed and helpful taxonomy, a detailed description of the most common methods, and discussions on the use and comparative benefits of different methods. The book's logical, clear and lively presentation is organized around the main branches of data compression. Topics and features:*highly inclusive, yet well-balanced coverage for specialists and nonspecialists*thorough coverage of wavelets methods, including SPIHT, EZW, DjVu, WSQ, and JPEG 2000*comprehensive updates on all material from previous editions And these NEW topics: *RAR, a proprietary algorithm*FLAC, a free, lossless audio compression method*WavPack, an open, multiplatform audio-compression algorithm*LZMA, a sophisticated dictionary-based compression method*Differential compression*ALS, the audio lossless coding algorithm used in MPEG-4*H.264, an advanced video codec, part of the huge MPEG-4 project*AC-3, Dolby's third-generation audio codec*Hyperspectral compression of 3D data sets This meticulously enhanced reference is an essential resource and companion for all computer scientists; computer, electrical and signal/image processing engineers; and scientists needing a comprehensive compilation of compression methods. It requires only a minimum of mathematics and is well-suited to nonspecialists and general readers who need to know and use this valuable content. David Salomon is a professor emeritus of computer Science at California State University, Northridge. He has authored numerous articles and books, including Coding for Data and Computer Communications, Guide to Data Compression Methods, Data Privacy and Security, Computer Graphics and Geometric Modeling, Foundations of Computer Security and Transformations and Projections in Computer Graphics.
Article
In November of 1992 the first Text REtrieval Conference (TREC-1) was held at NIST (Harman 1993). This conference, co-sponsored by ARPA and NIST, brought together information retrieval researchers to discuss their system results on the new TIPSTER test collection. This was the first time that such groups had ever compared results on the same data using the same evaluation methods, and represented a breakthrough in cross-system evaluation in information retrieval. It was also the first time that most of these groups had tackled such a large test collection and required a major effort by all groups to scale up their retrieval techniques.
Article
Abstract For compression of text databases, semi-static word-based methods provide good perfor- mance in terms of both speed and disk space, but two problems arise. First, the memory requirements for the compression model during decoding can be unacceptably high. Sec- ond, the need to handle document insertions means that the collection must be periodically recompressed, if compression eciency is to be maintained on dynamic collections. Here we show that with careful management,the impact of both of these drawbacks can be kept small. Experiments with a word-based model and 500 Mb of text show that excellent compression rates can be retained even in the presence of severe memory limitations on the decoder, and after signicant expansion in the amount,of stored text. Index Terms Document databases, text compression, dynamic databases, word-based
Article
Over the last decade, arithmetic coding has emerged as an important compression tool. It is now the method of choice for adaptive coding on multisymbol alphabets because of its speed, low storage requirements, and effectiveness of compression. This article describes a new implementation of arithmetic coding that incorporates several improvements over a widely used earlier version by Witten, Neal, and Cleary, which has become a de facto standard. These improvements include fewer multiplicative operations, greatly extended range of alphabet sizes and symbol probabilities, and the use of low-precision arithmetic, permitting implementation by fast shift/add operations. We also describe a modular structure that separates the coding, modeling, and probability estimation components of a compression system. To motivate the improved coder, we consider the needs of a word-based text compression program. We report a range of experimental results using this and other models. Complete source code is available. This document is available online at ACM Transactions on Information Systems.
Article
Motivation: International sequencing efforts are creating huge nucleotide databases, which are used in searching applications to locate sequences homologous to a query sequence. In such applications, it is desirable that databases are stored compactly, that sequences can be accessed independently of the order in which they were stored, and that data can be rapidly retrieved from secondary storage, since disk costs are often the bottleneck in searching. Results: We present a purpose-built direct coding scheme for fast retrieval and compression of genomic nucleotide data. The scheme is lossless, readily integrated with sequence search tools, and does not require a model. Direct coding gives good compression and allows faster retrieval than with either uncompressed data or data compressed by other methods, thus yielding significant improvements in search times for high-speed homology search tools. Availability: The direct coding scheme (cino) is available free of charge by anonymous ftp from goanna.cs.rmit.edu.au in the directory pub/rmit/cino. Contact: E-mail: [email protected] /* */
Conference Paper
Computer technology is continually developing, with ongoing rapid improvements in processor speed and disk capacity. At the same time, demands on retrieval systems are increasing, with, in applications such as World-Wide Web search engines, growth in data volumes outstripping gains in hardware performance. We experimentally explore the relationship between hardware and data volumes using a new framework designed for retrieval systems. We show that changes in performance depend entirely on the application: in some cases, even with large increases in data volume, the faster hardware allows improvements in response time; but in other cases, performance degrades far more than either raw hardware statistics or speed on processor-bound tasks would suggest. Overall, it appears that seek times rather than processor limitations are a crucial bottleneck and there is little likelihood of reductions in retrieval system response time without improvements in disk performance
Conference Paper
Dictionary-based modelling is the mechanism used in many practical compression schemes. We use the full message (or a large block of it) to infer a complete dictionary in advance, and include an explicit representation of the dictionary as part of the compressed message. Intuitively, the advantage of this offline approach is that with the benefit of having access to all of the message, it should be possible to optimize the choice of phrases so as to maximize compression performance. Indeed, we demonstrate that very good compression can be attained by an offline method without compromising the fast decoding that is a distinguishing characteristic of dictionary-based techniques. Several nontrivial sources of overhead, in terms of both computation resources required to perform the compression, and bits generated into the compressed message, have to be carefully managed as part of the offline process. To meet this challenge, we have developed a novel phrase derivation method and a compact dictionary encoding. In combination these two techniques produce the compression scheme RE-PAIR, which is highly efficient, particularly in decompression
Conference Paper
Greedy off-line textual substitution refers to the following steepest descent approach to compression or structural inference. Given a long text string x, a substring w is identified such that replacing all instances of w in x except one by a suitable pair of pointers yields the highest possible contraction of x; the process is then repeated on the contracted text string, until substrings capable of producing contractions can no longer be found. This paper examines the computational issues and performance resulting from implementations of this paradigm in preliminary applications and experiments. Apart from intrinsic interest, these methods may find use in the compression of massively disseminated data, and lend themselves to efficient parallel implementation, perhaps on dedicated architectures
Conference Paper
Text compression by inferring a phrase hierarchy from the input is a technique that shows promise as a compression scheme and as a machine learning method that extracts some comprehensible account of the structure of the input text. Its performance as a data compression scheme outstrips other dictionary schemes, and the structures that it learns from sequences have been put to such eclectic uses as phrase browsing in digital libraries, music analysis, and inferring rules for fractal images. We focus attention on the memory requirements of the method. Since the algorithm operates in linear time, the space it consumes is at most linear with input size. The space consumed does in fact grow linearly with the size of the inferred hierarchy, and this makes operation on very large files infeasible. We describe two elegant ways of curtailing the space complexity of hierarchy inference, one of which yields a bounded space algorithm. We begin with a review of the hierarchy inference procedure that is embodied in the SEQUITUR program. Then we consider its performance on quite large files, and show how the compression performance improves as the file size increases
Article
Compressibility of individual sequences by the class of generalized finite-state information-lossless encoders is investigated. These encoders can operate in a variable-rate mode as well as a fixed-rate one, and they allow for any finite-state scheme of variable-length-to-variable-length coding. For every individual infinite sequence x a quantity rho(x) is defined, called the compressibility of x , which is shown to be the asymptotically attainable lower bound on the compression ratio that can be achieved for x by any finite-state encoder. This is demonstrated by means of a constructive coding theorem and its converse that, apart from their asymptotic significance, also provide useful performance criteria for finite and practical data-compression tasks. The proposed concept of compressibility is also shown to play a role analogous to that of entropy in classical information theory where one deals with probabilistic ensembles of sequences rather than with individual sequences. While the definition of rho(x) allows a different machine for each different sequence to be compressed, the constructive coding theorem leads to a universal algorithm that is asymptotically optimal for all sequences.
Article
Countable prefix codeword sets are constructed with the universal property that assigning messages in order of decreasing probability to codewords in order of increasing length gives an average code-word length, for any message set with positive entropy, less than a constant times the optimal average codeword length for that source. Some of the sets also have the asymptotically optimal property that the ratio of average codeword length to entropy approaches one uniformly as entropy increases. An application is the construction of a uniformly universal sequence of codes for countable memoryless sources, in which the n th code has a ratio of average codeword length to source rate bounded by a function of n for all sources with positive rate; the bound is less than two for n = 0 and approaches one as n increases.
Article
Dictionary-based modeling is a mechanism used in many practical compression schemes. In most implementations of dictionary-based compression the encoder operates on-line, incrementally inferring its dictionary of available phrases from previous parts of the message. An alternative approach is to use the full message to infer a complete dictionary in advance, and include an explicit representation of the dictionary as part of the compressed message. In this investigation, we develop a compression scheme that is a combination of a simple but powerful phrase derivation method and a compact dictionary encoding. The scheme is highly efficient, particularly in decompression, and has characteristics that make it a favorable choice when compressed data is to be searched directly. We describe data structures and algorithms that allow our mechanism to operate in linear time and space
Article
We describe the implementation of a data compression scheme as an integral and transparent layer within a full-text retrieval system. Using a semi-static word-based compression model, the space needed to store the text is under 30 per cent of the original requirement. The model is used in conjunction with canonical Huffman coding and together these two paradigms provide fast decompression. Experiments with 500 Mb of newspaper articles show that in full-text retrieval environments compression not only saves space, it can also yield faster query processing - a win-win situation.