Article

Algorithms for Adaptive Huffman Codes.

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

L'algorithme d'Huffman permet de generer des codes a redondance minimum pour un ensemble fini de message a frequences de transmissions connues. On considere ici seulement les codes d'Huffman binaires. On decrit un algorithme qui peut etre generalise, mais le systeme binaire reste certainement le mieux adapte aux applications informatiques

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Conversely the method should not be used too rarely otherwise it loses much of its relevance. A better solution, from Cormack et al. [28], is to age past events relatively to current events, in the following way. We choose a value s > 1, close to 1, for instance s = 1.01. ...
... frequency) of recently accessed elements to be increased, moving them up near the root. This method was introduced by Cormack et al. [28] for real time compression of data. It is well adapted when accesses are consecutive. ...
... Here, a step consists in a swap of nodes, subtrees or both. For an increment or a decrement by any positive value, the algorithm from Cormack et al. [28] updates the tree in O k expected time, although this one has a particularly bad worst case. If we use this kind of algorithm to update authenticated Huffman trees, we can count the expected number of hash operations in the following way: In each step of the algorithm we need to update ancestor nodes that have been exchanged. ...
Article
We propose models for data authentication which take into account the behavior of the clients who perform queries. Our models reduce the size of the authenticated proof when the frequency of the query corresponding to a given data is higher. Existing models implicitly assume the frequency distribution of queries to be uniform, but in reality, this distribution generally follows Zipf’s law. Our models better reflect reality and the communication cost between clients and the server provider is reduced allowing the server to save bandwidth. The obtained gain on the average proof size compared to existing schemes depends on the parameter of Zipf law. The greater the parameter, the greater the gain. When the frequency distribution follows a perfect Zipf’s law, we obtain a gain that can reach 26%. Experiments show the existence of applications for which Zipf parameter is greater than 1, leading to even higher gains.
... Shortly after Shannon's work, D.A. Huffman introduced what is now called Huffman Coding [32], which proved to be quite a flexible and useful coding scheme. Although Huffman coding is not used in the compression algorithms to be presented later, much of the preliminary work on adaptive and context based coders was done with Huffman coders [16,20]. More recently, a method called Arithmetic Coding has gained popularity due to several practical advantages it has over Huffman coding. ...
... 16 shows the distortion results for the bitplane method and MPEG-2 at a rate of approximately 100:1. ...
... 16: Video compression of the "Venus Cubes" sequence Chapter 5 ...
Article
This thesis presents two novel coding schemes and applications to both two- andthree-dimensional image compression. Image compression can be viewed as methods of functional approximation under a constraint on the amount of information allowable in specifying the approximation. Two methods of approximating functions are discussed: Iterated function systems (IFS) and wavelet-based approximations. IFS methods approximate a function by the fixed point of an iterated operator, using consequences of the Banach contraction mapping principle. Natural images under a wavelet basis have characteristic coefficient magnitude decays which may be used to aid approximation. The relationship between quantization, modelling, and encoding in a compression scheme is examined. Context based adaptive arithmetic coding is described. This encoding method is used in the coding schemes developed. A coder with explicit separation of the modelling and encoding roles is presented: an embedded wavelet bitplane coder based on hierarchical context in the wavelet coefficient trees. Fractal (spatial IFSM) and fractal-wavelet (coefficient tree), or IFSW, coders are discussed. A second coder is proposed, merging the IFSW approaches with the embedded bitplane coder. Performance of the coders, and applications to two- and three-dimensional images are discussed. Applications include two-dimensional still images in greyscale and colour, and three-dimensional streams (video). "A thesis presented to the University of Waterloo in fulfilment of the thesis requirement for the degree of Master of Mathematics in Applied Mathematics". Thesis (M.Math.) - University of Waterloo, 2001. Includes bibliographical references. Available in PDF format. System requirements: Internet connectivity and World Wide Web browser. Adobe Acrobat reader required to view and print files. Mode of access: World Wide Web.
... Ratio Search Splay Tree [9] Rotation O(log n) Yes Greedy Future [28] Rotation O(log n) Yes Tango Tree [29] Rotation θ(log log n) Yes Adaptive Huffman [30] Subtree swap θ(1) No Push-down Tree [10] Item swap θ(1) No SeedTree ...
... We note that in contrast to binary search trees, our local tree does not require an ordering of the items in the left and right subtrees of a node. Self-adjusting trees have also been explored in the context of coding, where for example adaptive Huffman coding [30], [39]- [42] is used to minimize the depth of most frequent items. The reconfiguration cost, however, is different: in adaptive Huffman algorithms, two subtrees might be swapped at the cost of one. ...
Preprint
Full-text available
We consider the fundamental problem of designing a self-adjusting tree, which efficiently and locally adapts itself towards the demand it serves (namely accesses to the items stored by the tree nodes), striking a balance between the benefits of such adjustments (enabling faster access) and their costs (reconfigurations). This problem finds applications, among others, in the context of emerging demand-aware and reconfigurable datacenter networks and features connections to self-adjusting data structures. Our main contribution is SeedTree, a dynamically optimal self-adjusting tree which supports local (i.e., greedy) routing, which is particularly attractive under highly dynamic demands. SeedTree relies on an innovative approach which defines a set of unique paths based on randomized item addresses, and uses a small constant number of items per node. We complement our analytical results by showing the benefits of SeedTree empirically, evaluating it on various synthetic and real-world communication traces.
... First, whenever the model changes a new set of codes must be computed. Although efficient algorithms exist to do this incrementally [Cormack and Horspool 1984;Failer 1973;Gallager 1978;Knuth 1985;Vitter 19871, they still require storage space for a code tree. If they were to be used for adaptive coding, a different probability distribution, and corresponding set of codes, would be needed for every conditioning class in which a symbol might be predicted. ...
... The probability of each symbol is estimated by its relative frequency. This simple adaptive model is invariably used by adaptive Huffman coders [Cormack and Horspool 1984;Faller 1973;Gallager 1978;Knuth 1985;Vitter 1987;19891. A more sophisticated way of computing the probabilities of symbols is to recognize that they will depend on the preceding character. ...
Article
Full-text available
The best schemes for text compression use large models to help them predict which characters will come next. The actual next characters are coded with respect to the prediction, resulting in compression of information. Models are best formed adaptively, based on the text seen so far. This paper surveys successful strategies for adaptive modeling that are suitable for use in practical text compression systems. The strategies fall into three main classes: finite-context modeling, in which the last few characters are used to condition the probability distribution for the next one; finite-state modeling, in which the distribution is conditioned by the current state (and which subsumes finite-context modeling as an important special case); and dictionary modeling, in which strings of characters are replaced by pointers into an evolving dictionary. A comparison of different methods on the same sample texts is included, along with an analysis of future research directions.
... Coding encoding uses varying bit bits in encoding a character [4]. ...
... Two drawbacks of Huffman coding that make it a costly solution for genomic compression are its storage complexity, since we need to record large tree structures for big alphabet size which arise when encoding positions in long sequences and the need to know the underlying distribution a priori. Adaptive Huffman coding mitigates the second problem, at the cost of increased computational complexity associated with constructing multiple encoding trees [38]. In order to alleviate computational challenges, we implemented so called canonical Huffman encoding, which bypasses the problem of storing a large code tree by sequentially encoding lengths of the codes [39]. ...
Article
Full-text available
Metagenomics is a genomics research discipline devoted to the study of microbial communities in environmental samples and human and animal organs and tissues. Sequenced metagenomic samples usually comprise reads from a large number of different bacterial communities and hence tend to result in large file sizes, typically ranging between 1–10 GB. This leads to challenges in analyzing, transferring and storing metagenomic data. In order to overcome these data processing issues, we introduce MetaCRAM, the first de novo, parallelized software suite specialized for FASTA and FASTQ format metagenomic read processing and lossless compression. MetaCRAM integrates algorithms for taxonomy identification and assembly, and introduces parallel execution methods; furthermore, it enables genome reference selection and CRAM based compression. MetaCRAM also uses novel reference-based compression methods designed through extensive studies of integer compression techniques and through fitting of empirical distributions of metagenomic read-reference positions. MetaCRAM is a lossless method compatible with standard CRAM formats, and it allows for fast selection of relevant files in the compressed domain via maintenance of taxonomy information. The performance of MetaCRAM as a stand-alone compression platform was evaluated on various metagenomic samples from the NCBI Sequence Read Archive, suggesting 2- to 4-fold compression ratio improvements compared to gzip. On average, the compressed file sizes were 2-13 percent of the original raw metagenomic file sizes. We described the first architecture for reference-based, lossless compression of metagenomic data. The compression scheme proposed offers significantly improved compression ratios as compared to off-the-shelf methods such as zip programs. Furthermore, it enables running different components in parallel and it provides the user with taxonomic and assembly information generated during execution of the compression pipeline. The MetaCRAM software is freely available at http://web.engr.illinois.edu/~mkim158/metacram.html. The website also contains a README file and other relevant instructions for running the code. Note that to run the code one needs a minimum of 16 GB of RAM. In addition, virtual box is set up on a 4GB RAM machine for users to run a simple demonstration.
... The Huffman coding was intensively investigated during the years. Some of the interesting works were provided by Faller [62], Knuth [93], Cormack and Horspool [51], and Vitter [176,177]. These works contain description of methods of storing and maintaining the Huffman tree. ...
Conference Paper
In 2004, Ethiopia launched Health Extension Program (HEP), to expand the national health program to include community based health interventions as a primary component of the HSDP. HEP is "a package of basic and essential promotive, preventive, and curative health services targeting households in a community, based on the principle of Primary Health Care (PHC) to improve the families' health status with their full participation." HEP became a core component of the broader health system, and it is one of the strategies adopted with a view to achieving universal coverage of primary health care to the rural population by 2009, in a context of limited resources. The overall goal of HEP is to create a healthy society and reduce maternal and child morbidity and mortality rates. To ensure effective function of the HEP program, expansion of primary health care units, strengthening the health system and procurement of drugs and supplies have been emphasized in the design and implementation of HEP. A study shows that access to safe water supply is encouraging and has improved significantly over time. The increased access to safe water would creat an enabling environment for the desired change in personal hygiene behavior such as adopting consistent hand washing at critical times in a day since it requires adequate access to adequate quantity of water. Safe water management practice at the source and home remains low and HEP should focus on creating knowledge and skill on safe water management practice through education and demonstration approaches. Coverage of households with latrine facility has shown an improvement over time, and the finding that almost all model-family households have access to a latrine facility indicates the effectiveness of HEP and spcifically the model-family approach. Solurce: EDHS 2011; HEP Survey 2010
... Adaptive minimum-redundancy (Huffman) coding is expensive in both time and memory space, and is handsomely outperformed by adaptive AC besides the advantage of AC in compression effectiveness [14]. Fenwicks structure requires just n words of memory to manage an n-symbol alphabet, whereas the various implementations of dynamic Huffman coding [15], [16] consume more than 10 times as much memory [17]. ...
Article
Arithmetic coding is increasingly being used in upcoming image and video compression standards such as JPEG2000, and MPEG-4/H.264 AVC and SVC standards. It provides an efficient way of lossless compression and recently, it has been used for joint compression and encryption of video data. In this paper, we present an interpretation of arithmetic coding using chaotic maps. This interpretation greatly reduces the hardware complexity of decoder to use a single multiplier by using an alternative algorithm and enables encryption of video data at negligible computational cost. The encoding still requires two multiplications. Next, we present a hardware implementation using 64 bit fixed point arithmetic on Virtex-6 FPGA (with and without using DSP slices). The encoder resources are slightly higher than a traditional AC encoder, but there are savings in decoder performance. The architectures achieve clock frequency of 400-500 MHz on Virtex-6 xc6vlx75 device. We also advocate multiple symbol AC encoder design to increase throughput/slice of the device, obtaining a value of 4.
... In case of BWT we got a 10x10 matrix of strings from that we obtained first set of string " aaabbdimoy " and the last string " mbiadayabo " . Then we applied MTF then we got 0,0,0,1,0,3,8,12,14,24 for first set of string and 12,2,9,3,5,1,24,1,4,15 And information is inversely proportionally log(1/P(x)).So if we reduce the entropy then we get optimal result. And we have done it successfully. ...
Article
Full-text available
This paper proposes arithmetic coding for application to data compression. The use of arithmetic codes results in a codeword whose length is close to the optimal value (as predicted by entropy in information theory), thus achieving a higher compression. Other techniques particularly those based on Huffman coding result in optimal codes for data sets in which the probability model of the symbols satisfies specific requirements. We are providing a new modified concept on entropy-based data compression technique. We are modified the Burrows–Wheeler transform by our own concept. This paper shows empirically and analytically that the proposed modification on the Burrows– Wheeler transform reduces the entropy and improves the compression efficiency. A software implementation is pursued using MTF, Burrows–Wheeler transform, arithmetic coding with proposed modification and suggesting an alternative to the established schemes developed for this purpose.
... One of the reported approaches consists of periodically multiplying each counter by a positive real number less than unity [4]. Another approach, suggests that the probabilities of occurrence should be real numbers to represent the frequency counters [2]. These authors proposed an exponential incrementing of the counters by choosing a multiplication factor α > 1, suggesting a value of α slightly greater than unity, e.g. ...
Conference Paper
Full-text available
In this paper, we introduce a new approach to adaptive coding which utilizes Stochastic Learning-based Weak Estimation (SLWE) techniques to adap- tively update the probabilities of the source symbols. We present the correspond- ing encoding and decoding algorithms, as well as the details of the probability updating mechanisms. Once these probabilities are estimated, they can be used in a variety of data encoding schemes, and we have demonstrated this, in particu- lar, for the adaptive Fano scheme and and an adaptive entropy-based scheme that resembles the well-known arithmetic coding. We include empirical results using the latter adaptive schemes on real-life files that possess a fair degree of non- stationarity. As opposed to higher-order statistical models, our schemes require linear space complexity, and compress with nearly 10% better efficiency than the traditional adaptive coding methods.
Book
String matching is one of the oldest algorithmic techniques, yet still one of the most pervasive in computer science. The past 20 years have seen technological leaps in applications as diverse as information retrieval and compression. This copiously illustrated collection of puzzles and exercises in key areas of text algorithms and combinatorics on words offers graduate students and researchers a pleasant and direct way to learn and practice with advanced concepts. The problems are drawn from a large range of scientific publications, both classic and new. Building up from the basics, the book goes on to showcase problems in combinatorics on words (including Fibonacci or Thue-Morse words), pattern matching (including Knuth-Morris-Pratt and Boyer-Moore like algorithms), efficient text data structures (including suffix trees and suffix arrays), regularities in words (including periods and runs) and text compression (including Huffman, Lempel-Ziv and Burrows-Wheeler based methods).
Article
Huffman’s algorithm for computing minimum-redundancy prefix-free codes has almost legendary status in the computing disciplines. Its elegant blend of simplicity and applicability has made it a favorite example in algorithms courses, and as a result it is perhaps one of the most commonly implemented algorithmic techniques. This article presents a tutorial on Huffman coding and surveys some of the developments that have flowed as a consequence of Huffman’s original discovery, including details of code calculation and of encoding and decoding operations. We also survey related mechanisms, covering both arithmetic coding and the recently developed asymmetric numeral systems approach and briefly discuss other Huffman-coding variants, including length-limited codes.
Article
Full-text available
In information age, sending the data from one end to another endneed lot of space as well as time. Data compression is atechnique to compress the information source (e.g. a data file, aspeech signal, an image, or a video signal) in possible fewnumbers of bits. One of the major factors that influence the DataCompression technique is the procedure to encode the sourcedata and space required for encoded data. There are many datacompressions methods which are used for data compression andout of which Huffman is mostly used for same. Huffmanalgorithms have two ranges static as well as adaptive. StaticHuffman algorithm is a technique that encoded the data in twopasses. In first pass it requires to calculate the frequency of eachsymbol and in second pass it constructs the Huffman tree.Adaptive Huffman algorithm is expanded on Huffman algorithmthat constructs the Huffman tree but take more space than StaticHuffman algorithm. This paper introduces a new datacompression Algorithm which is based on Huffman coding. Thisalgorithm not only reduces the number of pass but also reducethe storage space in compare to adaptive Huffman algorithm andcomparable to static.
Conference Paper
Dealing with Digital Terrain Models requires storing and processing of huge amounts of data, obtained from hydrographic measurements. Currently no dedicated methods for DTM data compression exist. In the paper a lossless compression method is proposed, tailored specifically for DTM data. The method involves discarding redundant data, performing differential coding, Variable Length Value coding, and finally compression using LZ77 or PPM algorithm. We present the results of experiments performed on real-world hydrographic data, which prove the validity of the proposed approach.
Chapter
Arithmetic Coding (AC) is widely used for the entropy coding of text and multimedia data. It involves recursive partitioning of the range [0,1) in accordance with the relative probabilities of occurrence of the input symbols. In this work, we present a data (image or video) encryption scheme based on arithmetic coding, which we refer to as Chaotic Arithmetic Coding (CAC). In CAC, a large number of chaotic maps can be used to perform coding, each achieving Shannon-optimal compression performance. The exact choice of map is governed by a key. CAC has the effect of scrambling the intervals without making any changes to the width of interval in which the codeword must lie, thereby allowing encryption without sacrificing any coding efficiency. We next describe Binary CAC (BCAC) with some simple Security Enhancement (SE) modes which can alleviate the security of a scheme against known cryptanalysis against AC-based encryption techniques. These modes, namely Plaintext Modulation (PM), Pair-Wise-Independent Keys (PWIK), and Key and ciphertext Mixing (MIX) modes have insignificant computational overhead, while BCAC decoder has lower hardware requirements than BAC coder itself, making BCAC with SE as excellent choice for deployment in secure embedded multimedia systems. A bit sensitivity analysis for key and plaintext is presented along with experimental tests for compression performance.
Article
This paper presents a new data transformation technique that uses algorithms from fields of data compression, data encryption and imaging to convert data into an image. The purpose of such transformation is twofold: i) reduce data size by means of data compression algorithms, and ii) guarantee that data stays sufficiently cryptic for all unauthorized users that could gain access to the resource-containing data by means of some encryption algorithms. The result of such transformation is a colourful and "dotty" n ×m image. As the image becomes accessible to an authorized user, inverse algorithms are applied to the image to transform it back to the original data. One end result of such transformation is that although data is both compressed and encrypted, it can still be viewed as an image or manipulated as an editable file - an attractive feature that promises numerous advances in applications related to data storage and transmission, string operations, and security.
Article
Fundamental Data Compression provides all the information students need to be able to use this essential technology in their future careers. A huge, active research field, and a part of many people's everyday lives, compression technology is an essential part of today's Computer Science and Electronic Engineering courses. With the help of this book, students can gain a thorough understanding of the underlying theory and algorithms, as well as specific techniques used in a range of scenarios, including the application of compression techniques to text, still images, video and audio. Practical exercises, projects and exam questions reinforce learning, along with suggestions for further reading. * Dedicated data compression textbook for use on undergraduate courses * Provides essential knowledge for today's web/multimedia applications * Accessible, well structured text backed up by extensive exercises and sample exam questions.
Article
The authors present an accessible implementation of arithmetic coding and by detailing its performance characteristics. The presentation is motivated by the fact that although arithmetic coding is superior in most respects to the better-known Huffman method many authors and practitioners seem unaware of the technique. The authors start by briefly reviewing basic concepts of data compression and introducing the model-based approach that underlies most modern techniques. They then outline the idea of arithmetic coding using a simple example, and present programs for both encoding and decoding. In these programs the model occupies a separate module so that different models can easily be used. Next they discuss the construction of fixed and adaptive models and detail the compression efficiency and execution time of the programs, including the effect of different arithmetic word lengths on compression efficiency. Finally, they outline a few applications where arithmetic coding is appropriate.
Article
Optimization Methods for Data Compression A dissertation presented to the Faculty of the Graduate School of Arts and Sciences of Brandeis University, Waltham, Massachusetts by Giovanni Motta Many data compression algorithms use ad--hoc techniques to compress data efficiently. Only in very few cases, can data compressors be proved to achieve optimality on a specific information source, and even in these cases, algorithms often use sub--optimal procedures in their execution.
Book
Opening with a detailed review of existing techniques for selective encryption, this text then examines algorithms that combine both encryption and compression. The book also presents a selection of specific examples of the design and implementation of secure embedded multimedia systems. Features: reviews the historical developments and latest techniques in multimedia compression and encryption; discusses an approach to reduce the computational cost of multimedia encryption, while preserving the properties of compressed video; introduces a polymorphic wavelet architecture that can make dynamic resource allocation decisions according to the application requirements; proposes a light-weight multimedia encryption strategy based on a modified discrete wavelet transform; describes a reconfigurable hardware implementation of a chaotic filter bank scheme with enhanced security features; presents an encryption scheme for image and video data based on chaotic arithmetic coding.
Article
This paper surveys a variety of data compression methods spanning almost 40 years of research, from the work of Shannon, Fano, and Huffman in the late 1940s to a technique developed in 1986. The aim of data compression is to reduce redundancy in stored or communicated data, thus increasing effective data density. Data compression has important application in the areas of file storage and distributed systems. Concepts from information theory as they relate to the goals and evaluation of data compression methods are discussed briefly. A framework for evaluation and comparison of methods is constructed and applied to the algorithms presented. Comparisons of both theoretical and empirical natures are reported, and possibilities for future research are suggested.
Article
Data compression is of interest in business data processing, both because of the cost savings it offers and because of the large volume of data manipulated in many business applications. A method and system for transmitting a digital image (i.e., an array of pixels) from a digital data source to a digital data receiver. More the size of the data be smaller, it provides better transmission speed and saves time. In this communication we always want to transmit data efficiently and noise free. Both the LZW and Huffman data compression methods are lossless in manner. These methods or some versions of them are very common in use of compressing different types of data. Even though on average Huffman gives better compression results, it determines the case in which the LZW performs best and when the compression efficiency gap between the LZW algorithm and its Huffman counterpart is the largest. In the case of Hybrid compression it gives better compression ratio than in single compression. So, at first I wanted to compress original data by Huffman Encoding Technique then by the LZW Encoding Technique .But it did not give better compression ratio than in single LZW compression. At that time I have found that if we compress the data by Huffman first and then by LZW all the cases it gives better compression ratio. Then it named as "Data compression using Huffman based LZW Encoding". Its compression ratio most of the cases above 2.55 and in some cases it becomes above 3.25 or more. It will provide cheap, reliable and efficient system for data compression in digital communication system.
Article
We propose an efficient, sub-optimal prefix code construction method for discrete sources with a finite alphabet and known probability mass function (pmf). It is well known that for a source that puts out symbols xi with probability pi, the optimal codeword lengths are li=log(1/pi). However, codeword lengths are integers and log(1/pi) is, in general, not an integer. We propose a method to find binary codewords for xi whose lengths are initially assumed to be ⌈log(1/pi)⌉−1. Every prefix code must satisfy the Kraft's inequality but our initial codeword lengths may not satisfy the Kraft's inequality. Using a simplified version of the subset sum problem we find a minimal set of codeword lengths that must be increased from ⌈log(1/pi)⌉−1 to ⌈log(1/pi)⌉, so that Kraft's inequality is satisfied. Even though this solution is not optimal it leads to average codeword lengths that are close to optimal and in some cases codeword lengths that are optimal. Unlike the Huffman code, our solution does not require the ordering of probabilities in the pmf. The efficiency of our method can be further improved by reducing the size of the subset sum problem. The example of English text shows that our method leads to a solution that is very close to the optimal solution. The proposed method can also be used for encryption, thereby accomplishing both compression and encryption simultaneously.
Article
A new one-phase technique for compression text files is presented as a modification of the Ziv and Lempel compression scheme. The method replaces parts of words in a text by references to a fixed-size dictionary which contains the subwords of the text already compressed. An essential part of the technique is the concept of reorganization. Its purpose is to drop from the dictionary the parts which are never used. The reorganization principle is based on observations of information theory and structural linguistics. By the reorganization concept the method can adapt to any text file with no a priori knowledge of the nature of the text. © 1988 John Wiley & Sons, Inc.
Article
Full-text available
Abstract Molecular dynamics,(MD) simulations,generate vast amounts,of data. A typical 100-million atom,MD simulation,produces,approximately,5 gigabytes,of data per frame consisting of atom types, coordinates and velocities. The main,contribution of the report is the specification of an MD compressor which,targets simulations,with large amounts,of water in it. Water is targeted since many,MD simulations,contain significant amounts,of water,molecules. It uses an approach based on predictive point cloud compressors, but with predictors tailored towards,models,of water. A point cloud is a collection of points in 3D space. The method,improves,on existing point cloud compressors [Gumhold et al., 2005, Devillers and Gandoin, 2000] and MD compressor [Omeltchenko et al., 2000] when applied to MD data with significant amounts,of water molecules.
Article
An OPM/L data compression scheme suggested by Ziv and Lempel, LZ77, is applied to text compression. A slightly modified version suggested by Storer and Szymanski, LZSS, is found to achieve compression ratios as good as most existing schemes for a wide range of texts. LZSS decoding is very fast, and comparatively little memory is required for encoding and decoding. Although the time complexity of LZ77 and LZSS encoding is O(M) for a text of M characters, straightforward implementations are very slow. The time consuming step of these algorithms is a search for the longest string match. Here a binary search tree is used to find the longest string match, and experiments show that this results in a dramatic increase in encoding speed. The binary tree algorithm can be used to speed up other OPM/L schemes, and other applications where a longest string match is required. Although the LZSS scheme imposes a limit on the length of a match, the binary tree algorithm will work without any limit.
Article
Arithmetic coding, in conjunction with a suitable probabilistic model, can provide nearly optimal data compression. In this article we analyze the effect that the model and the particular implementation of arithmetic coding have on the code length obtained. Periodic scaling is often used in arithmetic coding implementations to reduce time and storage requirements, it also introduces a recency effect which can further affect compression. Our main contribution is introducing the concept of weighted entropy and using it to characterize in an elegant way the effect that periodic scaling has on the code length. We explain why and by how much scaling increases the code length for files with a homogeneous distribution of symbols, and we characterize the reduction in code length due to scaling for files exhibiting locality of reference. We also give a rigorous proof that the coding effects of rounding scaled weights, using integer arithmetic, and encoding end-of-file are negligible.
Article
Many contemporary data compression schemes distinguish two distinct components: modelling and coding. Modelling involves estimating probabilities for the input symbols, while coding involves generating a sequence of bits in the output based on those probabilities. Several different methods for coding have been proposed. The best known are Huffman's code and the more recent technique of arithmetic coding, but other viable methods include various fixed codes, fast approximations to arithmetic coding, and splay coding. Each offers a different trade-off between compression speed, memory requirements, and the amount of compression obtained. This paper evaluates the performance of these methods in several situations and determines which is the most suitable for particular classes of application.
Article
A technique for image data compression (and reconstruction) has been implemented on a transputer network. This technique developes in three phases: data decorrelatlon by Discrete Cosine Transform (DCT), quantization and Huffman coding. Two different architectural solutions are presented for the decorrelating phase, which is the higher compute intensive of the overall processing. The first one refers to the parallelism in the algorithm for the DCT, while the second one is based on the "farm processing" approach. Evaluations in terms of processing speed and efficiency are given for both these solutions.
Article
Semistatic minimum-redundancy prefix (MRP) coding is fast compared with rival coding methods, but requires two passes during encoding. Its adaptive counterpart, dynamic Huffman coding, requires only one pass over the input message for encoding and decoding, and is asymptotically efficient. Dynamic Huffman (1952) coding is, however, notoriously slow in practice. By removing the restriction that the code used for each message symbol must have minimum-redundancy and thereby admitting some compression loss, it is possible to improve the speed of adaptive MRP coding. This paper presents a controlled method for trading compression loss for coding speed by approximating symbol frequencies with a geometric distribution. The result is an adaptive MRP coder that is asymptotically efficient and also fast in practice
Article
With the dramatic increasing of electronic Arabic content, the text compression techniques will play a major role in several domains and applications such as search engines, data archiving, searching and retrieval from huge databases. Mainly the combination of compression and indexing techniques allows the interesting possibility to work directly on the compressed textual files or databases, which results saving time and resources. The existing compression techniques and tools are generic and do not consider the specific characteristics of the Arabic language such as its derivative nature. Mainly compression techniques should be based on the morphology characteristics of the Arabic language, its grammatical characteristics, the texts subject, and their statistical characteristics. The paper surveys the state of the art of the Arabic texts compression techniques and tools and identifies some research tracks that should be explored in future. It presents also some dedicated Arabic text compression algorithms which save more physical space and speed up the data retrieval text files by searching in their compressed form.
Article
A technique for compressing large databases is presented. The method replaces frequent variable-length byte strings (words or word fragments) in the database by minimum-redundancy codes—Huffman codes. An essential part of the technique is the construction of the dictionary to yield high compression ratios. A heuristic is used to count frequencies of word fragments. A detailed analysis is provided of our implementaton in support of high compression ratios and efficient encoding and decoding under the constraint of a fixed amount of main memory. In each phase of our implementation, we explain why certain data structures or techniques are employed. Experimental results show that our compression scheme is very effective for compressing large databases of library records.
Article
A general-purpose data-compression routine—implemented on the IMS database system—makes use of context to achieve better compression than Huffman's method applied character by character. It demonstrates that a wide variety of data can be compressed effectively using a single, fixed compression routine with almost no working storage.
Article
In 1994 Peter Fenwick at the University of Auckland devised an elegant mechanism for tracking the cumulative symbol frequency counts that are required for adaptive arithmetic coding. His structure spends O(log n) time per update when processing the sth symbol in an alphabet of n symbols. In this note we propose a small but significant alteration to this mechanism, and reduce the running time to O(log (1+s)) time per update. If a probability-sorted alphabet is maintained, so that symbol s in the alphabet is the sth most frequent, the cost of processing each symbol is then linear in the number of bits produced by the arithmetic coder. Copyright © 1999 John Wiley & Sons, Ltd.
Article
An industrial case study of data compression within the library automation domain is described. A context dependent approach, where the individual records require file-independent compression and expansion, is evaluated. The discussed approach favorably compares against popular compression algorithms. Comparisons were made against commercially available implementations of the conventional compression schemes. The described approach is now in use by The Library Corporation.
Article
There has been an unparalleled explosion of textual information flow over the internet through electronic mail, web browsing, digital library and information retrieval systems, etc. Since there is a persistent increase in the amount of data that needs to be transmitted or archived, the importance of data compression is likely to increase in the near future. Virtually, all modern compression methods are adaptive models and generate variable-bit-length codes that must be decoded sequentially from beginning to end. If there is any error during transmission, the entire file cannot be retrieved safely. In this article we propose few fault-tolerant methods of text compression that facilitate decoding to begin with any part of compressed file not necessarily from the beginning. If any sequence of one or more bytes is changed during transmission of compressed file due to various reasons, the remaining data can be retrieved safely. These algorithms also support reversible decompression. [Article copies are available for purchase from InfoSci-on-Demand.com].
Article
Full-text available
The state of the art in data compression is arithmetic coding, not the better-known Huffman method. Arithmetic coding gives greater compression, is faster for adaptive models, and clearly separates the model from the channel encoding.
Article
Over the last decade, arithmetic coding has emerged as an important compression tool. It is now the method of choice for adaptive coding on multisymbol alphabets because of its speed, low storage requirements, and effectiveness of compression. This article describes a new implementation of arithmetic coding that incorporates several improvements over a widely used earlier version by Witten, Neal, and Cleary, which has become a de facto standard. These improvements include fewer multiplicative operations, greatly extended range of alphabet sizes and symbol probabilities, and the use of low-precision arithmetic, permitting implementation by fast shift/add operations. We also describe a modular structure that separates the coding, modeling, and probability estimation components of a compression system. To motivate the improved coder, we consider the needs of a word-based text compression program. We report a range of experimental results using this and other models. Complete source code is available. This document is available online at ACM Transactions on Information Systems.
Article
Rice developed a universal noiseless coding structure that provides efficient performance over an extremely broad range of source entropy. This is accomplished by adaptively selecting the best of several easily implemented variable length coding algorithms. Variations of such noiseless coders have been used in many NASA applications. Custom VLSI coder and decoder modules capable of processing over 50 million samples per second have been fabricated and tested. In this study, the first of the code options used in this module development is shown to be equivalent to a class of Huffman code under the Humblet condition, for source symbol sets having a Laplacian distribution. Except for the default option, other options are shown to be equivalent to the Huffman codes of a modified Laplacian symbol set, at specified symbol entropy values. Simulation results are obtained on actual aerial imagery over a wide entropy range, and they confirm the optimality of the scheme. Comparison with other known techniques are performed on several widely used images and the results further validate the coder's optimality.
Article
Full-text available
A broad spectrum of techniques for electrocardiogram (ECG) data compression have been proposed during the last three decades. Such techniques have been vital in reducing the digital ECG data volume for storage and transmission. These techniques are essential to a wide variety of applications ranging from diagnostic to ambulatory ECG's. Due to the diverse procedures that have been employed, comparison of ECG compression methods is a major problem. Present evaluation methods preclude any direct comparison among existing ECG compression techniques. The main purpose of this paper is to address this issue and to establish a unified view of ECG compression techniques. ECG data compression schemes are presented in two major groups: direct data compression and transformation methods. The direct data compression techniques are: ECG differential pulse code modulation and entropy coding, AZTEC, Turning-point, CORTES, Fan and SAPA algorithms, peak-picking, and cycle-to-cycle compression methods. The transformation methods briefly presented, include: Fourier, Walsh, and K-L transforms. The theoretical basis behind the direct ECG data compression schemes are presented and classified into three categories: tolerance-comparison compression, differential pulse code modulation (DPCM), and entropy coding methods. The paper concludes with the presentation of a framework for evaluation and comparison of ECG compression schemes.
Article
Remote telemonitoring of physiological parameters is being used increasingly both for medical research and for clinical management. However, the technique entails the collection of large volumes of information for transmission and storage which necessitates efficient means of data compression. This paper presents a simple mathematical model of the lossy compression of physiological signals. The model is developed in a top-down style making design decisions progressively. First a general model of lossy compression is developed and some lemmas presented. Then the model is refined to take advantage of inherent cyclicity in many types of physiological signal. Finally the model is specialized to the compression of electrocardiograms. An algorithm is described for implementing the model. The algorithm is a table-based method capable of achieving very low output bit rates (50 bps) with reasonable fidelity. This compares favourably with the results obtained by other workers and holds promise for further evaluation. The generality of the model should ultimately allow a coherent approach to the compression of the various types of physiological signal currently being recorded.
Conference Paper
Full-text available
Random access text compression is a type of compression technique in which there is a direct access to the compressed data. It facilitates to start decompression from any place in the compressed file, not necessarily from first. If any byte changed during transmission, the remaining data can be retrieved safely. In this paper, a try has been made to develop few algorithms for random access text compression based on the byte pair encoding scheme (Gage, 1997). The BPE algorithm relies on the fact that ASCII character set uses only codes from 0 through 127. That frees up codes from 128 through 255 for use as pair codes. Pair code is a byte, used to replace the most frequently appearing pair of bytes in the text file. Five algorithms are developed based on this byte pair encoding scheme. These algorithms finds the unused bytes at each level and tries to use those bytes for replacing the most frequently used bytes
Conference Paper
Many modern analog media coders employ some form of entropy coding (EC). Usually, a simple per-letter EC is used to keep the coder's complexity and price low. In some coders, individual symbols are grouped into small fixed-size vectors before EC is applied. We extend this approach to form variable-size vector EC (VSVEC) in which vector sizes may be from 1 to several hundreds. The method is, however, complexity-constrained in the sense that the vector size is always as large as allowed by a pre-set complexity limit. The idea is studied in the framework of a modified discrete cosine transform (MDCT) coder. It is shown experimentally, using diverse audio material, that a rate reduction of about 37% can be achieved. The method is, however, not specific to MDCT coding but can be incorporated in various speech, audio, image and video coders
Conference Paper
Full-text available
A novel predictive lossless coding scheme is proposed. The prediction is based on a new weighted cascaded least mean squared (WCLMS) method. To obtain both a high compression ratio and a very low encoding and decoding delay, the residuals from the prediction are encoded using either a variant of adaptive Huffman coding or a version of adaptive arithmetic coding. WCLMS is especially designed for music/speech signals. It can be used either in combination with psycho-acoustically pre-filtered signals to obtain perceptually lossless coding, or as a stand-alone lossless coder. Experiments on a database of moderate size and a variety of pre-filtered mono-signals show that the proposed lossless coder (which needs about 2 bit/sample for pre-filtered signals) outperforms competing lossless coders, such as ppmz, bzip2, Shorten, and LPAC, in terms of compression ratios. The combination of WCLMS with either of the adaptive coding schemes is also shown to achieve better compression ratios and lower delay than an earlier scheme combining WCLMS with Huffman coding over blocks of 4096 samples
Conference Paper
This paper surveys the theoretical literature on fixed-to-variable-length lossless source code trees, called code trees, and on variable-length-to-fixed lossless source code trees, called parse trees. In particular, the following code tree topics are outlined in this survey: characteristics of the Huffman (1952) code tree; Huffman-type coding for infinite source alphabets and universal coding; the Huffman problem subject to a lexicographic constraint, or the Hu-Tucker (1982) problem; the Huffman problem subject to maximum codeword length constraints; code trees which minimize other functions besides average codeword length; coding for unequal cost code symbols, or the Karp problem, and finite state channels; and variants of Huffman coding in which the assignment of 0s and 1s within codewords is significant such as bidirectionality and synchronization. The literature on parse tree topics is less extensive. Treated here are: variants of Tunstall (1968) parsing; dualities between parsing and coding; dual tree coding in which parsing and coding are combined to yield variable-length-to-variable-length codes; and parsing and random number generation. Finally, questions related to counting and representing code and parse trees are also discussed
Article
This study focuses largely on two issues: (a) improved syntax for iterations and error exits, making it possible to write a larger class of programs clearly and efficiently without ″go to″ statements; (b) a methodology of program design, beginning with readable and correct, but possibly inefficient programs that are systematically transformed in necessary into efficent and correct, but possibly less readable code. The discussion brings out opposing points of view about whether or not ″go to″ statements should be abolished; some merit is found on both sides of this question. Finally, an attempt is made to define the true nature of structured programming, and to recommend fruitful directions for further study.
Article
A system for the compression of data files, viewed as strings of characters, is presented. The method is general, and applies equally well to English, to PL/I, or to digital data. The system consists of an encoder, an analysis program, and a decoder. Two algorithms for encoding a string differ slightly from earlier proposals. The analysis program attempts to find an optimal set of codes for representing substrings of the file. Four new algorithms for this operation are described and compared. Various parameters in the algorithms are optimized to obtain a high degree of compression for sample texts.
Article
An optimum method of coding an ensemble of messages consisting of a finite number of members is developed. A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.
Article
In honor of the twenty-fifth anniversary of Huffman coding, four new results about Huffman codes are presented. The first result shows that a binary prefix condition code is a Huffman code iff the intermediate and terminal nodes in the code tree can be listed by nonincreasing probability so that each node in the list is adjacent to its sibling. The second result upper bounds the redundancy (expected length minus entropy) of a binary Huffman code by P_{1}+ log_{2}[2(log_{2}e)/e]=P_{1}+0.086 , where P_{1} is the probability of the most likely source letter. The third result shows that one can always leave a codeword of length two unused and still have a redundancy of at most one. The fourth result is a simple algorithm for adapting a Huffman code to slowly varying esthnates of the source probabilities. In essence, one maintains a running count of uses of each node in the code tree and lists the nodes in order of these counts. Whenever the occurrence of a message increases a node count above the count of the next node in the list, the nodes, with their attached subtrees, are interchanged.