Article

Universal Lossless Data Compression Algorithms

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

this dissertation the attention is focused on the BWT-based compression algorithms. We investigate the properties of this transform and propose an improved compression method based on it

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... When the network bandwidth is low, we can set a smaller chunk size for chunker, which helps minimize the network transmission time while increasing computational overheads; otherwise, we set a larger chunk size. Specifically, the initial chunk size is set to 1KB, and with better/worse network conditions, Chunking Selector chooses a larger/smaller chunk size (use the chunk sizes of 512B, 1KB, and 2KB for network bandwidths (0, 10), [10,40), and [40, +∞) (Mbps) respectively) according to our experimental studies (see Figure 16 in Subsection 5.4). Moreover, our later evaluation results in Subsection 5.4 indicate that the chunking sizes selected by Chunking Selector perform well in NetSync. ...
... In NetSync, unless specially explained, the Chunking Selector sets a proper avg. chunk size according to the current network bandwidth (as discussed in Subsection 4.4, 512B, 1KB, and 2KB for network bandwidths (0, 10), [10,40), and [40, +∞) (Mbps) respectively). The default jumping bits L is set to 16 in FastFP(+L). ...
... Benchmark Dataset. Silesia [39] is a widely acknowledged dataset for data compression [40] covering typical data types that are commonly used, including text, executables, pictures, HTML, etc. According to several published studies on real-world and benchmark datasets [41], [24], the file modifications are made at the beginning, middle, and end of a file with a distribution of 70%, 10%, and 20%, respectively [41]. ...
Article
Full-text available
Delta sync (synchronization) is a key bandwidth-saving technique for cloud storage services. The representative delta sync utility, rsync , matches data chunks by sliding a search window byte-by-byte to maximize the redundancy detection for bandwidth efficiency. However, it is difficult for this process to cater to the forthcoming high-bandwidth cloud storage services which require lightweight delta sync that can well support large files. Moreover, rsync employs invariant chunking and compression methods during the sync process, making it unable to cater to services from various network environments which require the sync approach to perform well under different network conditions. Inspired by the Content-Defined Chunking (CDC) technique used in data deduplication, we propose NetSync, a network adaptive and CDC-based lightweight delta sync approach with less computing and protocol (metadata) overheads than the state-of-the-art delta sync approaches. Besides, NetSync can choose appropriate compressing and chunking strategies for different network conditions. The key idea of NetSync is (1) to simplify the process of chunk matching by proposing a fast weak hash called FastFP that is piggybacked on the rolling hashes from CDC, and redesigning the delta sync protocol by exploiting deduplication locality and weak/strong hash properties; (2) to minimize the sync time by adaptively choosing chunking parameters and compression methods according to the current network conditions. Our evaluation results driven by both benchmark and real-world datasets suggest NetSync performs $2\times$ – $10\times$ faster and supports $30\%$ – $80\%$ more clients than the state-of-the-art rsync -based WebR2sync+ and deduplication-based approach.
... Zaletą algorytmów, w których stosuje się transformatę Burrowsa-Wheelera [1,3,4,16,17,18], jest niewielka złożoność czasowa i pamięciowa połączona z uzyskiwanymi przez te algorytmy współczynnikami kompresji dorównującymi najlepszym obecnie znanym algorytmom. Badania nad algorytmami, w których stosuje się transformatę Burrowsa-Wheelera, prowadzone są również w Instytucie Informatyki Politechniki Śląskiej, a uzyskane przez S. Deorowicza rezultaty [12,13,14] są obecnie najlepsze spośród algorytmów wykorzystujących transformatę Burrowsa-Wheelera. ...
... • cPPMII -wariant algorytmu predykcji przez częściowe dopasowanie cPPMII autorstwa D. Shkarina (rząd 64); wyniki zaczerpnięto z [14], ...
... • BW-SD -wariant algorytmu Burrowsa-Wheelera autorstwa S. Deorowicza (metoda Weighted Frequency Count); wyniki zaczerpnięto z [14], ...
Article
Full-text available
W niniejszym artykule omówiono zwięźle kilka nowoczesnych algorytmów z zakresu bezstratnej kompresji danych: algorytm predykcji przez częściowe dopasowanie, algorytm ważenia drzewa kontekstów, algorytm Burrowsa–Wheelera oraz algorytm Wolfa i Willemsa. Pod względem uzyskiwanych współczynników kompresji algorytmy te uznawane są obecnie za najlepsze wśród uniwersalnych algorytmów bezstratnej kompresji danych. (In this paper we present a brief overview of four modern lossless data compression algorithms: Prediction by Partial Matching, Context Tree Weighting, Burrows–Wheeler Block Sorting Compression Algorithm and the Switching Method of Volf and Willems. In terms of average compression ratio presented algorithms are regarded as best among universal lossless data compression algorithms.) Studia Informatica, Vol. 24, Nr 1, pp. 159-169, 2003.
... The family of Lempel-Ziv algorithms comes under the dictionary based method of text compression in which LZ77 is the seminal work (David 2004;Deorowicz 2003;Ziv and Lempel 1977). LZ77 algorithm makes use of the previously seen input stream as a dictionary to encode the data where the dictionary is represented by a window. ...
... For effective compression, the sliding window size must be large but may be a space inefficient appraoch. The variants of LZ77 are LZX, SLH, Lempel-Ziv-Renau (LZR), Statistical Lempel-Ziv, Lempel-Ziv-Bell (LZB), Lempel-Ziv-Pylak-Pawe (LZPP), Lempel-Ziv-Markov chain-Algorithm (LZMA), Reduced Offset Lempel-Ziv (ROLZ), Lempel-Ziv and Haruyasu (LZH), LZHuffman, etc (David 2004;Deorowicz 2003). The Deflate:zip and gzip algorithm is the basis for a large number of compression algorithms that exist today. ...
... LZ77 is mainly used in the Roshal Archive (RAR) Format and also in network data compression. Other variations are LZ-Tail (LZT), Lempel-Ziv Ross Williams (LZRW1), ZIP, LZ-Prediction (LZP1), Dynamic Markov compression (DMC), Context Tree Weighting, Win RAR, RAR, Lempel Ziv-Jakobsson (LZJ), LZH, LZRW1, LZR, LZP2, GZIP, bzip2, LZW, Lempel Ziv-Fiala-Greene (LZFG), UNIX Compress, V.42bis, Cyclical Redundancy Check (CRC) etc (Nelson 1989;David 2004;Deorowicz 2003;Brent 1987;Williams 1991;Rodeh et al. 1981;Ramabadran and Gaitonde 1988;RAR 2006). Variations of LZ77 have an assumption that patterns present in the input text occur closely and this is not true always which is a demerit of this approach. ...
Article
Full-text available
Data Compression as a research area has been explored in depth over the years resulting in Huffman Encoding, LZ77, LZW, GZip, RAR, etc. Much of the research has been focused on conventional character/word based mechanism without looking at the larger perspective of pattern retrieval from dense and large datasets. We explore the compression perspective of Data Mining suggested by Naren Ramakrishnan et al. where in Huffman Encoding is enhanced through frequent pattern mining (FPM) a non-trivial phase in Association Rule Mining (ARM) technique. The paper proposes a novel frequent pattern mining based Huffman Encoding algorithm for Text data and employs a Hash table in the process of Frequent Pattern counting. The proposed algorithm operates on pruned set of frequent patterns and also is efficient in terms of database scan and storage space by reducing the code table size. Optimal (pruned) set of patterns is employed in the encoding process instead of character based approach of Conventional Huffman. Simulation results over 18 benchmark corpora demonstrate the betterment in compression ratio ranging from 18.49% over sparse datasets to 751% over dense datasets. It is also demonstrated that the proposed algorithm achieves pattern space reduction ranging from 5% over sparse datasets to 502% in dense corpus.
... I sincerely hope that all the participants will benefit from the technical contents of this conference, and enjoy the stay in Bandung. (1) , Gelar Budiman (2) , Ledya Novamizanti (3) (1,2,3) ...
... Telkom Engineering School Telkom University Bandung, Indonesia sumarsana.map@gmail.com (1) , gelarbudiman@telkomuniversity.ac.id (2) , ledyaldn@telkomuniversity.ac.id (3) Abstract-The development of data communications enabling the exchange of information via mobile devices more easily. Security in the exchange of information on mobile devices is very important. ...
... One way that can be done to protect the sensitive information is to steganography, which is a technique to hide information by inserting a message in another message [1], so other than those intended, others will not be aware of the existence of the message. In the previous works, Dr. Ajit Singh and Meenakshi Gahlawat [2] introduce the technique of steganography and compression on the image. In the study, carried out on image compression after insertion process is done. ...
Conference Paper
Full-text available
The development of data communications enabling the exchange of information via mobile devices more easily. Security in the exchange of information on mobile devices is very important. One of the weaknesses in steganography is the capacity of data that can be inserted. With compression, the size of the data will be reduced. In this paper, designed a system application on the Android platform with the implementation of LSB steganography and cryptography using TEA to the security of a text message. The size of this text message may be reduced by performing lossless compression technique using LZW method. The advantages of this method is can provide double security and more messages to be inserted, so it is expected be a good way to exchange information data. The system is able to perform the compression process with an average ratio of 67.42 %. Modified TEA algorithm resulting average value of avalanche effect 53.8%. Average result PSNR of stego image 70.44 dB. As well as average MOS values is 4.8.
... LZMA uses a sliding window-based dynamic dictionary compression algorithm and interval coding algorithm, which has the advantages of high compression rate, small decompression space requirement, and fast speed. Figure 3 shows the LZMA workflow, including the sliding window algorithm based on LZ77 [16] and interval encoding [17,18] (range encoding) two-stage compression. The LZMA supports a dictionary space of 4 KB to hundreds of MBs, which increases the compression rate and also causes its search cache space to become very large. ...
... Table 2 shows the test example of the benchmark test set. The experimental platform is the Sunway Taihulight supercomputing system, and its parameters are shown in Table 3 [18]. The compression algorithm benchmark test set used in the experiment is the Silesia corpus benchmark test set. ...
Article
Full-text available
With the development of high-performance computing and big data applications, the scale of data transmitted, stored, and processed by high-performance computing cluster systems is increasing explosively. Efficient compression of large-scale data and reducing the space required for data storage and transmission is one of the keys to improving the performance of high-performance computing cluster systems. In this paper, we present SW-LZMA, a parallel design and optimization of LZMA based on the Sunway 26010 heterogeneous many-core processor. Combined with the characteristics of SW26010 processors, we analyse the storage space requirements, memory access characteristics, and hotspot functions of the LZMA algorithm and implement the thread-level parallelism of the LZMA algorithm based on Athread interface. Furthermore, we make a fine-grained layout of LDM address space to achieve DMA double buffer cyclic sliding window algorithm, which optimizes the performance of SW-LZMA. The experimental results show that compared with the serial baseline implementation of LZMA, the parallel LZMA algorithm obtains a maximum speedup ratio of 4.1 times using the Silesia corpus benchmark, while on the large-scale data set, speedup is 5.3 times.
... A compression algorithm is based on several statements that are true for data in the buffer. Considering the statements listed above for the optimal data compression, we implement preprocessing of data in the buffer and then compress it with one of the standard algorithms, e.g., with the DEFLATE method [5]. For better compatibility with standard compression algorithms, a final coding of symbol numbers uses the least significant bit of a number to store its sign and other bits to store its modulus. ...
... Experiments show that the given algorithm provides more than eleven times higher compression rate compared to the initial buffer size. Note that using standard algorithms (e.g., DEFLATE [5]) one obtains a smaller compression rate: about 6 times compared to the initial buffer size. At the same time, the preprocessing algorithm requires an insignificant amount of computational resources and barely slows the system. ...
Article
Full-text available
In this paper, we propose an approach for compact storage of big graphs. We propose preprocessing algorithms for graphs of a certain type, which can significantly increase the data density on the disk and increase performance for basic operations with graphs.
... Huffman coding, developed, in 1952, by D. Huffman, is a popular method for data compression, categorized under the entropy coders. An entropy coder is a method that assigns to every symbol from the alphabet a code depending on the probability of symbol occurrence [5]. The symbols that are more probable to occur get shorter codes than the less probable ones. ...
... Entropy is one of the important concepts in information theory. It is a theoretical measure of quantity of information [5,6,7,9,12]. Entropy is a measure of the amount of order or redundancy in a message. The value of entropy is small if there is a lot of order and otherwise it is large if there is a lot of disorder. ...
... This modification improves overall compression by about 1-2%, depending on the chosen back-end compressor. We have tried some other variants too, like MTF-1 [14], but they were unsuccessful. ...
... The MTF transform is a simple and well-known representative of the family of transforms coping with the list update problem [7]. Still, in the context of BWT compression, there are known more successful solutions (see, e.g., [14]) and it will be interesting to try them out in our application. ...
Article
Web log files, storing user activity on a server, may grow at the pace of hundreds of megabytes a day, or even more, on popular sites. They are usually archived, as it enables further analysis, e.g., for detecting attacks or other server abuse patterns. In this work we present a specialized lossless Apache web log preprocessor and test it with combination of several popular general-purpose compressors. Our method works on individual fields of log data (each storing such information like the client’s IP, date/time, requested file or query, download size in bytes, etc.), and utilizes such compression techniques like finding and extracting common prefixes and suffixes, dictionary-based phrase sequence substitution, move-to-front coding, and more. The test results show the proposed transform improves the average compression ratios 2.70 times in case of gzip and 1.86 times in case of bzip2.
... In general, lossy techniques offer far greater compression ratios than lossless techniques. [5] There are methods of lossy image compression like vector quantization (VQ), JPEG, subband coding, fractal based coding and etc. [6] Lossless coding guaranties that decompressed image is completely identical to the image before compression. This is an important requirement for some application domains, e.g. ...
... text documents and program executable. [5] There are methods of lossless image compression like Run Length Encoding, Huffman Encoding, Entropy Encoding, Arithmetic coding, and Quadtree. [1] Data compression reduces the bits, by identifying and eliminating statistical redundancy, hence the compression with proposed system be able to encode color values even if they are irregularly distributed on the contrary a lot of other compression algorithms that require statistical redundancy are distributed in regular form. ...
Article
In this research we have proposed a new compression algorithm that used locational compression technique based on Freeman chain code. The technique consists of two parts, the first part is compression algorithm which starts by obtaining the chain code for particular color value then saving location of start point for chain code, color value and chain code in compressed file, the next step is to remove all color values that related to chain code from input image and shrink the input image, the algorithm repeats the previous procedures until there will be no color values with significant chain code. The second part is to construct the original image by using start point, color value and chain code.
... The files are listed in Table 4.3. Sebastian Deorowicz in his dissertation thesis [33] also points to the file kennedy.xls, which seems to be controversial to him because of its specific structure causing the very different results of various compression schemes. ...
... The Silesia Corpus was proposed within the dissertation thesis of Sebastian Deorowicz at Silesian University of Technology (Gliwice, Poland) in 2003 [33]. Author mainly concentrated on the disadvantages of existing corpora. ...
... Hence, it does not need to create the sorted rotations matrix to obtain BWT output (Figure 3 (a) and (b)). Nevertheless, finding suffix sorting algorithms that run in linear worst case time is still an open problem.Figure 8 shows a few different methods for BWT computation [17]. We use the BWT of Yuta Mori that is based on SA-IS algorithm to construct the BWT out- put [18]. ...
... This transform is not very effective since it takes time to favoring symbols [8]. The Weighted Frequency Count (WFC) improves FC transform by defining a function based on symbol frequencies [17]. It also counts the distance of symbols within a sliding window. ...
Article
Full-text available
Common image compression standards are usually based on frequency transform such as Discrete Cosine Transform. We present a different approach for lossless image compression, which is based on a combinatorial transform. The main transform is Burrows Wheeler Transform (BWT) which tends to reorder symbols according to their following context. It becomes one of promising compression approach based on context modeling. BWT was initially applied for text compression software such as BZIP2; nevertheless it has been recently applied to the image compression field. Compression schemes based on the Burrows Wheeler Transform have been usually lossless; therefore we implement this algorithm in medical imaging in order to reconstruct every bit. Many variants of the three stages which form the original compression scheme can be found in the literature. We propose an analysis of the latest methods and the impact of their association and present an alternative compression scheme with a significant improvement over the current standards such as JPEG and JPEG2000.
... Silesia Corpus [6] [7] is a recently developed set of large files (each greater than 5 MB) that seems to avoid several shortcomings of the Calgary Corpus (e.g., the excess of English textual data). The tests on Silesia Corpus files were carried out with the same specified memory limit (200 MB). ...
... Silesia Corpus [6, 7] is a recently developed set of large files (each greater than 5 MB) that seems to avoid several shortcomings of the Calgary Corpus (e.g., the excess of English textual data). The tests on Silesia Corpus files were carried out with the same specified memory limit (200 MB). ...
Conference Paper
Full-text available
This paper presents a PPM variation which combines traditional character based processing with string matching. Such an approach can effectively handle repetitive data and can be used with practically any algorithm from the PPM family. The algorithm, inspired by its predecessors, PPM<sup>*</sup> and PPMZ, searches for matching sequences in arbitrarily long, variable-length, deterministic contexts. The experimental results show that the proposed technique may be very useful, especially in combination with relatively low order (up to 8) models, where the compression gains are often significant and the additional memory requirements are moderate.
... The Sequitur results were obtained with Roberto Maglica's windows port.GLZA again achieves better compression ratios than the competition and has a better compression ratio than PPMd on three of the files.3 The following table shows compression ratios in bits per byte for the Silesia Corpus.[23]. (PPMd is somewhat memory-limited on webster, samba, and mozilla.) ...
Conference Paper
Full-text available
(This is the 2016 full paper; poster was presented at the 2016 Data Compression Conference, and a summary published in the proceedings.) GLZA is a free, open-source, enhanced grammar-based compressor that constructs a low entropy grammar amenable to entropy coding, using a greedy hill-climbing search guided by estimates of encoded string lengths; the estimates are efficiently computed incrementally during (parallelized) suffix tree construction in a batched iterative repeat replacement cycle. The grammar-coded symbol stream is further compressed by order-1 Markov modeling of trailing/leadingsubsymbols and selective recency modeling , MTF-coding only symbols that tend to recur soon. This combination results in excellent compression ratios—similar to PPMC's for small files, averaging within about five percent of PPMd's for large text files (1 MB – 10 MB)—with fast decompression on one core or two. Compression time and memory use are not dramatically higher than for similarly high-performance asymmetrical compressors of other kinds. GLZA is on the Pareto frontier for text compression ratio and decompression speed on a variety of benchmarks(LTCB, Calgary, Canterbury, Large Canterbury, Silesia, MaximumCompression, World Compression Challenge), compressing better and/or decompressing faster than its competitors (PPM, LZ77-Markov, BWT, etc.), with better compression ratios than previous grammar-basedcompressors such as RePair, Sequitur, Offline 3 (Greedy), Sequential/grzip, and IRR-S.
... Benchmark Datasets. Silesia [10] is a widely acknowledged dataset for data compression [11] covering typical data types that are commonly used, including text, executables, pictures, htmls and et cetera. According to several published studies on real-world and benchmark datasets [24], [30], the file modifications are made at the beginning, middle, and end of a file with a distribution of 70%, 10%, and 20% respectively [24]. ...
Conference Paper
Full-text available
Delta synchronization (sync) is a key bandwidth-saving technique for cloud storage services. The representative delta sync utility, rsync, matches data chunks by sliding a search window byte by byte, to maximize the redundancy detection for bandwidth efficiency. This process, however, is difficult to cater to the demand of forthcoming high-bandwidth cloud storage services , which require lightweight delta sync that can well support large files. Inspired by the Content-Defined Chunking (CDC) technique used in data deduplication, we propose Dsync, a CDC-based lightweight delta sync approach that has essentially less computation and protocol (metadata) overheads than the state-of-the-art delta sync approaches. The key idea of Dsync is to simplify the process of chunk matching by (1) proposing a novel and fast weak hash called FastFp that is piggybacked on the rolling hashes from CDC; and (2) redesigning the delta sync protocol by exploiting deduplication locality and weak/strong hash properties. Our evaluation results driven by both benchmark and real-world datasets suggest Dsync performs 2×-8× faster and supports 30%-50% more clients than the state-of-the-art rsync-based WebR2sync+ and deduplication-based approach.
... It simply counts the number of times a character occurs repeatedly in the source file, for example, BOOKKEPPER will be encoded as 1B2O2K1E2P1E1R. (Sebastian, 2003). Arup, et al. (2013) presented a paper which was set with the objective of examining the performance of various lossless data compression based on different test files. ...
Article
Full-text available
Data compression is the process of reducing the size of a file to effectively reduce storage space and communication cost. The evolvement in technology and digital age has led to an unparalleled usage of digital files in this current decade. The usage of data has resulted to an increase in the amount of data being transmitted via various channels of data communication which has prompted the need to look into the current lossless data compression algorithms to check for their level of effectiveness so as to maximally reduce the bandwidth requirement in communication and transfer of data. Four lossless data compression algorithm: Lempel-Ziv Welch algorithm, Shannon-Fano algorithm, Adaptive Huffman algorithm and Run-Length encoding have been selected for implementation. The choice of these algorithms was based on their similarities, particularly in application areas. Their level of efficiency and effectiveness were evaluated using some set of predefined performance evaluation metrics namely compression ratio, compression factor, compression time, saving percentage, entropy and code efficiency. The algorithms implementation was done in the NetBeans Integrated Development Environment using Java as the programming language. Through the statistical analysis performed using Boxplot and ANOVA and comparison made on the four algorithms, Lempel Ziv Welch algorithm was the most efficient and effective based on the metrics used for evaluation.
... It simply counts the number of times a character occurs repeatedly in the source file, for example, BOOKKEPPER will be encoded as 1B2O2K1E2P1E1R. (Sebastian, 2003). Arup, et al. (2013) presented a paper which was set with the objective of examining the performance of various lossless data compression based on different test files. ...
Article
Full-text available
Data compression is the process of reducing the size of a file to effectively reduce storage space and communication cost. The evolvement in technology and digital age has led to an unparalleled usage of digital files in this current decade. The usage of data has resulted to an increase in the amount of data being transmitted via various channels of data communication which has prompted the need to look into the current lossless data compression algorithms to check for their level of effectiveness so as to maximally reduce the bandwidth requirement in communication and transfer of data. Four lossless data compression algorithm: Lempel-Ziv Welch algorithm, Shannon-Fano algorithm, Adaptive Huffman algorithm and Run-Length encoding have been selected for implementation. The choice of these algorithms was based on their similarities, particularly in application areas. Their level of efficiency and effectiveness were evaluated using some set of predefined performance evaluation metrics namely compression ratio, compression factor, compression time, saving percentage, entropy and code efficiency. The algorithms implementation was done in the NetBeans Integrated Development Environment using Java as the programming language. Through the statistical analysis performed using Boxplot and ANOVA and comparison made on the four algorithms, Lempel Ziv Welch algorithm was the most efficient and effective based on the metrics used for evaluation.
... The Huffman coder is excellent in the class of methods that allocate codes of integer length, while the arithmetic coder is free from this limitation. Therefore, it usually leads to shorter expected code length [9]. ...
... In dictionary-coding methods, there are many examples such as LZW coding, LZ77 family and LZ78 family [10]. ...
Thesis
Full-text available
Medical images, like any other digital data, require compression in order to reduce disk space needed for storage and time needed for transmission. The lossless compression is bounded a limitation that is shown by information theory. In this thesis, proposed a novel method of lossy medical image compression, and used a Structural SIMilarity(SSIM) to develop a tool capable of improving the image quality assessments. The proposed encoding system incorporates the wavelet transform and neural network to achieve significant improvement in medical image compression performance. To reduce the computational effort a new neural system which is called an Adaptive Wavelet Back Propagation (AWBP) has been used and modified. The encoding system is implemented in two phases. In the first phase, the input medical image is coded by using Lifting Discrete Wavelet Transform (LDWT). The primary compression ratio of the first stage is 4:1, with the low-low pass (LL) subband coefficients values taken. In the second phase, AWBP is implemented on the LL subband to reduce the bit rate to 0.0625 bit/pixel at 64-16 Input-Hidden layers consequences and use them to encode the coefficients, which removes the correlation among coefficients. In traditional WBP, the position and dilation of the wavelets are fixed and the weights are optimized. In the proposed approach AWBP, the position and dilation of the wavelets are not fixed but optimized an adaptive learning rate free parameter has been used to improve the performance of the coder. Comparing the numerical results obtained by proposed AWBP with the corresponding ones obtained by the standard Back Propagation (BP) neural network reveals the better performance generality of AWBP. The traditional Peak Signal-to-Noise-Ratio (PSNR) measure is mainly focused on the pixel-by-pixel difference between the original and compressed images. Such metric is improper for subjective quality assessment, since human perception is very sensitive to specific correlations between adjacent pixels. For medical images we are concerned, this measure is subjected to strong critics. A Structural SIMilarity(SSIM) is used to measure the visual quality of compressed medical images. Our proposed system gave better quality than JPEG2000 standard at high compression rate (CR), above (80%) and low bit rate. The results show the proposed algorithm has high correlation with human judgment in assessing reconstructed medical images.
... Nevertheless, suffix sorting algorithms that run in linear in the worst case is still open. Figure 3.5 shows a few different methods for BWT computation [25]. We use the BWT of Yuta Mori that is based on SA-IS algorithm to construct the BWT output [52]. ...
Article
Full-text available
Common image compression standards are usually based on frequency transform such as Discrete Cosine Transform or Wavelets. We present a different approach for loss-less image compression, it is based on combinatorial transform. The main transform is Burrows Wheeler Transform (BWT) which tends to reorder symbols according to their following context. It becomes a promising compression approach based on contextmodelling. BWT was initially applied for text compression software such as BZIP2 ; nevertheless it has been recently applied to the image compression field. Compression scheme based on Burrows Wheeler Transform is usually lossless ; therefore we imple-ment this algorithm in medical imaging in order to reconstruct every bit. Many vari-ants of the three stages which form the original BWT-based compression scheme can be found in the literature. We propose an analysis of the more recent methods and the impact of their association. Then, we present several compression schemes based on this transform which significantly improve the current standards such as JPEG2000and JPEG-LS. In the final part, we present some open problems which are also further research directions
... The MTF transform is a simple and well-known representative of the family of transforms coping with the list update problem. Still, in the context of BWT compression, there are known more successful solutions (see, e.g., [11]) and it will be interesting to try them out in our application. Merging fields is definitely a promising idea, and more attention is needed to find appropriate heuristics for this problem. ...
Conference Paper
Full-text available
Web log files, storing user activity on a server, may grow at the pace of hundreds of megabytes a day, or even more, on popular sites. It makes sense to archive old logs, to analyze them further, e.g., for detecting attacks or other server abuse patterns. In this work we present a specialized lossless Apache web log preprocessor and test it with combination of several popular general-purpose compressors. Our method works on individual fields of log data (each storing such information like the client's IP, date/time, requested file or query, download size in bytes, etc.), and utilizes such compression techniques like finding and extracting common prefixes and suffixes, dictionary -based phrase sequence substitution, move -to-front coding, and more. The test results show the proposed transform improves the average compression ratios 2.64 times in case of gzip and 1.83 times in case of bzip2.
... Simulation for the performance of the LZ algorithm for different buffers lengths is performed using the Calgary corpus and the Silesia Corpus (Deorowicz, 2003) [16], as shown in Fig. 2 and Fig. 3, respectively. In these experiments, the codeword is up to 2 bytes long. ...
Chapter
Full-text available
Hardware Implementation of Data compression algorithms is receiving increasing attention due to exponentially expanding network traffic and digital data storage usage. Among lossless data compression algorithms for hardware implementation, lampel -Ziv algorithm is one of the most widely used. The main objective of this paper is to enhance the efficiency of systolic- array approach for implementation of Lempel- Ziv algorithm. The proposed implementation is area and speed efficient. The compression rate is increased by more than 40% and the design area is decreased by more than 30%. The effect of the selected buffer's size on the compression ratio is analyzed. An FPGA implementation of the proposed design is carried out It verifies that data can be compressed and decompressed on-the-fly.
... In [18,19,20], to help reduce redundancy, the BWT, was followed by a Global Structure Transformation (GST) [Ref] called Move-To-Front (MTF) transform. We refer the reader to ( [13,30,31]) for a detailed description of a number of possible GST. We adopt a new GST called Incremental frequency Count (IFC) [16], which is paired with a Run Length Encoding (RLE) stage between the BWT and the combination of an Entropy Coding (EC) and a Fountain Coding (FC) as depicted in Figure 2. The rest of the paper is as follows. ...
Conference Paper
Full-text available
Summary form only given. This paper presents an algorithm in a purely lossless text compression setting based on fountain codes and the Burrows-Wheeler transform (BWT). The scheme consists of five stages, each of which is briefly described in the paper. The algorithm offers encouraging compression rate performance for large files. A summary of the results of the proposed scheme and other compression schemes is provided
... Other corpora for lossless compression were proposed and are available. Two examples are the Canterbury Corpus and the 'Large Canterbury Corpus' (Arnold & Bell, 1997) and the Silesia Compression Corpus (Deorowicz, 2003), which contains significantly larger files than both the Calgary and Canterbury corpora. 26 Bunton (1997) provides a (Calgary Corpus) comparison between ppm-c, ppm-d and ppm * (with its 'C' and 'D' versions). ...
Article
Full-text available
This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Context Tree Weighting (CTW), Prediction by Partial Match (PPM) and Probabilistic Suffix Trees (PSTs). We discuss the properties of these algorithms and compare their performance using real life sequences from three domains: proteins, English text and music pieces. The comparison is made with respect to prediction quality as measured by the average log-loss. We also compare classification algorithms based on these predictors with respect to a number of large protein classification tasks. Our results indicate that a "decomposed" CTW (a variant of the CTW algorithm) and PPM outperform all other algorithms in sequence prediction tasks. Somewhat surprisingly, a different algorithm, which is a modification of the Lempel-Ziv compression algorithm, significantly outperforms all algorithms on the protein classification problems.
... If the recipient of the video is a human, then small changes of colours of some pixels introduced during the compression could be imperceptible. The lossy compression methods typically results in much better compression ratios than lossless algorithms (Deorowicz, 2003). ...
Article
The increase of the amount of DNA sequences requires efficient computational algorithms for performing sequence comparison and analysis. Standard compression algorithms are not able to compress DNA sequences because they do not consider special characteristics of DNA sequences (i.e., DNA sequences contain several approximate repeats and complimentary palindromes). Recently, new algorithms have been proposed to compress DNA sequences, often using detection of long approximate repeats. The current work proposes a Lossless Compression Algorithm (LCA), providing a new encoding method. LCA achieves a better compression ratio than that of existing DNA-oriented compression algorithms, when compared to GenCompress, DNACompress, and DNAPack.
Conference Paper
Full-text available
This study was performed on the aim of detecting the quality of the watermelon with eight features; sound, color, root, belly button, texture, sugar rate, density, and touch which were obtained from the Kaggle website. Two ranking feature selection methods; ReliefF Ranking Filter and Information Gain Ranking Filter, and six machine learning algorithms; Decision Table (DT), J48Tree (J48), Na¨ıve Bayes (NB), Support Vector Machine (SVM), Multi-Layer Perceptron (MLP), and Random Forest (RF) accordingly have been employed for the Feature Selection and Classification Model (FS-CM) to predict the quality of this fruit. Evaluation process has been conducted with five features which were selected under Information Gain Ranking filter. The metric Accuracy and ROC area were used for the evaluation and hence, MLP with IG was selected as the best model with the highest accuracy of 87.0813 detect the quality of the watermelon.
Article
Full-text available
Kebutuhan kompresi data teks di era komputasi awan saat ini masih cukup tinggi. Data teks perlu dikompresi sekecil mungkin agar mudah dikirimkan. Burrows Wheeler Compression Algorithm (BWCA) adalah salah satu algoritma kompresi teks jenis block sorting yang bersifat non-proprietary dan cukup populer digunakan. Dalam prosesnya, BWCA menggunakan metode pemrosesan awal yang disebut Global Structure Transformation (GST) untuk menyusun karakter agar lebih baik hasil kompresinya. Penelitian ini membandingkan tiga metode pemrosesan awal Move-to-Front, yaitu MTF, MTF-1 dan MTF-2. Bahan uji kompresi berupa data Alkitab Bahasa Inggris, Indonesia dan Jawa, dan beberapa data yang berasal dari Calgary Corpus. Oleh karena kompresi teks adalah kompresi yang bersifat lossless dan reversibel, maka selain melakukan pengujian untuk pengompresian data, juga dilakukan pengujian untuk pendekompresian data dengan Inverse Burrows Wheeler Transform. Pengujian kompresi dan dekompresi pada data Alkitab maupun Calgary Corpus berhasil dilakukan dan menunjukkan MTF-1 mampu memberikan rasio kompresi yang lebih baik dikarenakan jumlah total tiap bit pada proses Huffman lebih sedikit dibandingkan dua metode lainnya.
Article
Data compression is commonly used in NAND flash-based Solid State Drives (SSDs) to increase their storage performance and lifetime as it can reduce the amount of data written to and read from NAND flash memory. Software based data compression reduces SSD performance significantly and, as such, hardware-based data compression designs are required. This paper studies the latest lossless data compression algorithm? i.e., the LZ4 algorithm which is one of the fastest compression algorithms reported to date. A data compression FPGA prototype based on the LZ4 lossless compression algorithm is studied. The original LZ4 compression algorithm is modified for real-time hardware implementation. Two hardware architectures of the modified LZ4 algorithm (MLZ4) are proposed with both compressors and decompressors, which are implemented on a FPGA evaluation kit. The implementation results show that the proposed compressor architecture can achieve a high throughput of up to 1.92Gbps with a compression ratio of up to 2.05, which is higher than all previous LZ algorithm designs implemented on FPGAs. The compression device can be used in high-end SSDs to further increase their storage performance and lifetime.
Conference Paper
The paper explores a novel compression perspective of Data Mining. Frequent Pattern Mining, an important phase of Association Rule Mining is employed in the process of Huffman Encoding for Lossless Text Compression. Conventional Apriori algorithm has been refined to employ efficient pruning strategies to optimize the number of pattern(s) employed in encoding. Detailed simulations of the proposed algorithms in relation to Conventional Huffman Encoding has been done over benchmark datasets and results indicate significant gains in compression ratio.
Article
This paper presents a lossless audio coding using Burrows-Wheeler Transform (BWT) and a combination of a Move-To-Front coding (MTF) and Run Length Encoding (RLE). Audio signals used are assumed to be of floating point values. The BWT is applied to this floating point values to get the transformed coefficients; and then these resulting coefficients are converted using the Move-to-Front coding to coefficients can be better compressed and then these resulting coefficients are compressed using a combination of the Run Length Encoding, and entropy coding. Two entropy coding are used which are Arithmetic and Huffman coding. Simulation results show that the proposed lossless audio coding method outperforms other lossless audio coding methods; using only Burrows-Wheeler Transform method, using combined Burrows-Wheeler Transform and Move-to-Front coding method, and using combined Burrows-Wheeler Transform and Run Length Encoding method.
Conference Paper
In Huffman-encoded data a bit error may propagate arbitrarily long. This paper introduces a method for limiting such error propagation to at most L bits, L being a parameter. It is required that the decoder knows the bit number currently being decoded. The method utilizes the inherent tendency of Huffman codes to resynchronize spontaneously and does not introduce any redundancy if such a resynchronization takes place. The method is applied to parallel decoding of Huffman data and is tested on JPEG compression.
Article
This report presents background material‡ for the Quantized Indexing§ (QI) form of enumerative coding. Following the introduction to conventional enumerative coding and its reformulation as lattice walks, the relations between arithmetic and enumerative coding are explored. In addition to examining the fundamental origins of the performance differences (in speed and output size), special emphasis was on the distinctions in the two approaches to modeling. General modeling pattern for enumerative coding (including QI) is described, along with the special cases for finite order Markov sources. This approach is compared to the arithmetic coder adaptive modeling as well as to the grammer and dictionary based modeling methods.
Conference Paper
Full-text available
This paper presents some aspects of radar signals acquisition in the electronic intelligence (ELINT) system, the analysis of their parameters, feature extraction using linear Karhunen-Loeve transformation and applying knowledge-based techniques to the recognition of the intercepted signals. The process of final emitter identification is based on ¿the knowledge-based approach¿ which was implemented during the processing of constructing the database.
Article
Full-text available
We developed a fast text compression method based on multiple static dictionaries and named this algorithm as STECA (Static Text Compression Algorithm). This algorithm is language dependent because of its static structure; however, it owes its speed to that structure. To evaluate encoding and decoding performance of STECA with different languages, we select English and Turkish that have different grammatical structures. Compression and decompression times and compression ratio results are compared with the results of LZW, LZRW1, LZP1, LZOP, WRT, DEFLATE (Gzip), BWCA (Bzip2) and PPMd algorithms. Our evaluation experiments show that: If speed is the primary consideration, STECA is an efficient algorithm for compressing natural language texts.
Article
Genomic sequences contain a variety of repeated structures of various lengths and types, interspersed or in tandem. Repetitive structures play an important role in molecular biology; they are related to the genetic backgrounds of inherited diseases, and they can also serve as markers for DNA mapping and DNA fingerprinting. Since biological databases keep growing in size and number there is a need for creating new tools for finding repeats in genomic sequences. This paper presents a new method for searching for tandem repeats in DNA sequences. It is based on the Burrows-Wheeler Transform (BWT), a very fast and effective data compression algorithm.
Article
The explosive growth in biological data in recent years has led to the development of new methods to identify DNA sequences. Many algorithms have recently been developed that search DNA sequences looking for unique DNA sequences. This paper considers the application of the Burrows-Wheeler transform (BWT) to the problem of unique DNA sequence identification. The BWT transforms a block of data into a format that is extremely well suited for compression. This paper presents a time-efficient algorithm to search for unique DNA sequences in a set of genes. This algorithm is applicable to the identification of yeast species and other DNA sequence sets.
Conference Paper
In case of variable-length codes a single bit error may cause loss of synchronization at the decoder and thus may lead to error propagation. Even if the decoder resynchronizes after a number of bits, it may have decoded incorrect number of symbols and may place the further decoded symbols at wrong positions. This paper describes a method for choosing such string of bits w<sub>s</sub> that the decoder can always recognize any insertions of w<sub>s</sub> into the encoded message and reestablish synchronization. w<sub>s</sub> will be constructed of the shortest word that is not a substring of the encoded message. The method does not require any modification of the code.
Conference Paper
The paper presents a new text transform algorithm suitable for embedding in compression algorithms. The strategy the new algorithm employed to increase performance of text compression is to replace words with predefined codes. Instead of using a huge dictionary containing exhaustive words as in previous works, the new algorithm uses a list of stoplists and/or frequent words. The research devised different encoding schemes for such a list. It then made experiments of using these schemes with different compression algorithms on standard texts. The result shows that each scheme gives increasing compression when using with specific compression algorithms.
Conference Paper
In a text encoded with a Huffman code a bit error can propagate arbitrarily long. This paper introduces a method for limiting such error propagation to not more than L bits, L being a parameter of the algorithm. The method utilizes the inherent tendency of the codes to synchronize spontaneously and does not introduce any redundancy if such a synchronization takes place.
Conference Paper
We have investigated the synthesis and field emission properties of high-quality double-walled carbon nanotubes (DWCNTs) using a catalytic chemical vapor deposition method. For the synthesis of DWCNTs using a catalytic CVD method, we used various carbon containing molecules such as methane, ethylene, THF and propanol over Fe-Mo embedded MgO support material. The produced carbon materials using a catalytic CVD method indicated high-purity DWCNT bundles free of amorphous carbon covering on the surface. DWCNTs showed low turn-on voltage and higher emission stability compared with SWCNTs.
Article
Full-text available
In this paper the performance of the Block Sorting algorithm proposed by Burrows and Wheeler is evaluated theoretically. It is proved that the Block Sorting algorithm is asymptotically optimal for stationary ergodic finite order Markov sources. Our proof is based on the facts that symbols with the same Markov state (or context) in an original data sequence are grouped together in the output sequence obtained by Burrows-Wheeler transform, and the codeword length of each group can be bounded by a function described with the frequencies of symbols included in the group.
Article
Full-text available
We present an estimate of an upper bound of 1.75 bits for the entropy of characters in printed English, obtained by constructing a word trigram model and then computing the cross-entropy between this model and a balanced sample of English text. We suggest the well-known and widely available Brown Corpus of printed English as a standard against which to measure progress in language modeling and offer our bound as the first of what we hope will be a series of steadily decreasing bounds.
Conference Paper
Full-text available
In 1970, Knuth, Pratt, and Morris [1] showed how to do basic pattern matching in linear time. Related problems, such as those discussed in [4], have previously been solved by efficient but sub-optimal algorithms. In this paper, we introduce an interesting data structure called a bi-tree. A linear time algorithm for obtaining a compacted version of a bi-tree associated with a given string is presented. With this construction as the basic tool, we indicate how to solve several pattern matching problems, including some from [4] in linear time.
Article
Full-text available
Optimum off-line algorithms for the list update problem are investigated. The list update problem involves implementing a dictionary of items as a linear list. Several characterizations of optimum algorithms are given; these lead to optimum algorithm which runs in time Θ2n(n − 1)!m, where n is the length of the list and m is the number of requests. The previous best algorithm, an adaptation of a more general algorithm due to Manasse et al. (1988), runs in time Θ(n!)2m.
Conference Paper
Full-text available
Encoding data by switching between two universal data compression algorithms achieves higher rates of compression than either algorithm alone. Applied with two list updating algorithms, the technique yields higher compression of piecewise independent identically distributed (p.i.i.d.) data, such as the output of the Burrows-Wheeler transform.Introduced within is a new class of such algorithms, the Best x of 2x-1. When paired with variants of the Move-To-Front algorithm in a switching scheme, the Best x of 2x-1 algorithms achieve higher compression of p.i.i.d. data.
Conference Paper
Full-text available
This paper reports the results of an empirical test of the performances of a large set of online list accessing algorithms. The dgorithms’ access cost performances were tested with respect to request sequences generated from the Calgary Corpus. In addition to testing access costs within the traditional dynamic list accessing model we tested all algorithms’ relative performances as data compressors via the compression scheme of Bentley et al. Some of the results are quite surprising and stand in contrast to someco mpetitive analysis theoretical results. For example, the randomized algorithms that were tested, all attaining competitive ratio less than 2, performed consistently inferior to quite a few deterministic algorithms that obtained the best performance. In many instances the best performance was obtained by deterministic algorithms that are either not competitive or not optimal.
Book
Full-text available
Preface. 1. Data Compression Systems. 2. Fundamental Limits. 3. Static Codes 4. Minimum-Redundancy Coding. 5. Arithmetic Coding. 6. Adaptive Coding. 7. Additional Constraints. 8. Compression Systems. 9. What Next? References. Index.
Article
Full-text available
Approaches to the zero-frequency problem in adaptive text compression are discussed. This problem relates to the estimation of the likelihood of a novel event occurring. Although several methods have been used, their suitability has been on empirical evaluation rather than a well-founded model. The authors propose the application of a Poisson process model of novelty. Its ability to predict novel tokens is evaluated, and it consistently outperforms existing methods. It is applied to a practical statistical coding scheme, where a slight modification is required to avoid divergence. The result is a well-founded zero-frequency model that explains observed differences in the performance of existing methods, and offers a small improvement in the coding efficiency of text compression over the best method previously known
Article
Full-text available
The best schemes for text compression use large models to help them predict which characters will come next. The actual next characters are coded with respect to the prediction, resulting in compression of information. Models are best formed adaptively, based on the text seen so far. This paper surveys successful strategies for adaptive modeling that are suitable for use in practical text compression systems. The strategies fall into three main classes: finite-context modeling, in which the last few characters are used to condition the probability distribution for the next one; finite-state modeling, in which the distribution is conditioned by the current state (and which subsumes finite-context modeling as an important special case); and dictionary modeling, in which strings of characters are replaced by pointers into an evolving dictionary. A comparison of different methods on the same sample texts is included, along with an analysis of future research directions.
Article
Full-text available
We present a Pascal implementation of the one-pass algorithm for constructing dynamic Huffman codes that is described and analyzed in a companion paper. The program runs in real time; that is, the processing time for each letter of the message is proportional to the length of its codeword. The number of bits used to encode a message of t letters is less than t bits more than that used by the well-known two-pass algorithm. This is best possible for any one-pass Huffman scheme. In practice, it uses fewer bits than all other Huffman schemes. The algorithm has applications in file compression and network transmission.
Article
A statistical technique called 'Dynamic Markov Compression' (DMC) that uses a very different approach is presented. First introduced in the late 1980s, its overall performance is not as good as that of PKZIP or other similar archiving packages. However, DMC yields a better compression ratio when applied to large binary files such as speech and image files. Because it handles one bit at a time, DMC might also be appropriate for fax machines that compares black-and-white images. Implementing DMC is simpler than any other compression scheme with a comparable compression ratio. This article presents in particular DMC's working principles.
Article
The authors present an accessible implementation of arithmetic coding and by detailing its performance characteristics. The presentation is motivated by the fact that although arithmetic coding is superior in most respects to the better-known Huffman method many authors and practitioners seem unaware of the technique. The authors start by briefly reviewing basic concepts of data compression and introducing the model-based approach that underlies most modern techniques. They then outline the idea of arithmetic coding using a simple example, and present programs for both encoding and decoding. In these programs the model occupies a separate module so that different models can easily be used. Next they discuss the construction of fixed and adaptive models and detail the compression efficiency and execution time of the programs, including the effect of different arithmetic word lengths on compression efficiency. Finally, they outline a few applications where arithmetic coding is appropriate.
Article
A Bitplane Tree Weighting (BTW) method with arithmetic coding is proposed for lossless coding of gray scale images, which are represented with multiple bitplanes. A bitplane tree, in the same way as the context tree in the CTW method, is used to derive a weighted coding probability distribution for arithmetic coding with the first order Markov model. It is shown that the proposed method can attain better compression ratio than known schemes with MDL criterion. Furthermore, the BTW method can be extended to a high order Markov model by combining the BTW with the CTW or with prediction. The performance of these modified methods is also evaluated. It is shown that they attain better compression ratio than the original BTW method without increasing memory size and coding time, and they can beat the lossless JPEG coding.
Article
Arithmetic coding is a data compression technique that encodes data (the data string) by creating a code string which represents a fractional value on the number line between 0 and 1. The coding algorithm is symbolwise recursive; i.e., it operates upon and encodes (decodes) one data symbol per iteration or recursion. On each recursion, the algorithm successively partitions an interval of the number line between 0 and 1, and retains one of the partitions as the new interval. Thus, the algorithm successively deals with smaller intervals, and the code string, viewed as a magnitude, lies in each of the nested intervals. The data string is recovered by using magnitude comparisons on the code string to recreate how the encoder must have successively partitioned and retained each nested subinterval. Arithmetic coding differs considerably from the more familiar compression coding techniques, such as prefix (Huffman) codes. Also, it should not be confused with error control coding, whose object is to detect and correct errors in computer operations. This paper presents the key notions of arithmetic compression coding by means of simple examples.
Article
At the ISIT'95 Suzuki (1) presented a context weighting algorithm that covered a more general class of sources than the context-tree weighting method, at the cost of some extra complexity. Here his algorithm will be compared to an algorithm, that covers the same model class. Most modern universal source coding algorithms are based on statistical tech- niques. These algorithms consist of two parts. The flrst part is a modeling al- gorithm that gives a probability distribution for the next symbol of the source sequence, based on some statistics. It obtains these statistics from the previous symbols of the source sequence. The second part is an arithmetic encoder that uses this probability distribution and the next symbol to form the code sequence. The arithmetic encoder performs nearly optimal, thus the performance of the source coding algorithm depends fully on the modeling algorithm. The modeling algorithm tries to approximate the source which generated the source sequence. It assumes that the source belongs to a speciflc model class, and it wants to flnd the model within that model class, that matches the source sequence best. If the source sequence has indeed been generated by a model from that model class, then the performance of the source coding algorithm could be optimal. But if the actual source difiers from that model class, then this will result in some extra redundancy. Willems, Shtarkov and Tjalkens presented a universal source coding algorithm in (3). This modeling algorithm, the context-tree weighting algorithm, gives ex- cellent performance in theory and in practice. It assumes that the model is a tree source. A tree source is a su-x tree, of which one of its leaves will be identifled by the last few symbols of the source sequence seen so far (the context of the new symbol). All symbols following such a context are assumed to be independent. Thus a tree source is a su-x tree with memoryless sources in its leaves. To estimate the probability of a memoryless (sub-) sequence, the modeling algorithms discussed in this paper use the Krichevksy-Troflmov estimator. It deflnes the probability of a memoryless sequence with a zeros and b ones as:
Article
The context-tree weighting algorithm (4) is a universal source coding algo- rithm for binary tree sources. In (2) the algorithm is modifled for byte-oriented tree sources. This paper describes the context-tree branch-weighting algorithm, which can reduce the number of parameters for such sources, without increasing the complexity signiflcantly. Contemporary universal source coding algorithms process a source sequence symbol by symbol. An adaptive modeling algorithm estimates the probability distribution of the new symbol, based on statistical information which it inferred from the previous symbols of the source sequence. The encoder and decoder use the same modeling algorithm. In the encoder, an arithmetic encoder uses the probability distribution and the actual value of the new symbol to form the code word. The decoder uses the probability distribution to decode the new symbol from the code word. Then they both update their statistics and process the next symbol. The arithmetic encoder and decoder perform nearly optimal, thus the performance of these universal source coding algorithms depends fully on the modeling algorithm. To implement the modeling algorithm e-ciently, it assumes that the source belongs to a speciflc model class, and it wants to flnd the model within that model class, that matches the source sequence best. But if the source sequence has not been generated by a source from that model class, then the actual model has to be approximated by a model in the model class. This often results in some extra redundancy. Section 2 describes an e-cient modeling algorithm for the class of tree sources, called the context-tree weighting algorithm (4). Our main goal is to apply this algorithm to texts. But texts are not generated by tree sources. Therefore, the next section discusses an extension of the model class, and an e-cient weighting algorithm for the new model class, called the context-tree branch-weighting algorithm. Next, the new weighting algorithm and the normal context-tree weighting algorithm are combined into a single algorithm. Finally, this combined algorithm is tested on some text flles. 2 The context-tree weighting algorithm The context-tree weighting algorithm assumes that the actual model is from the class of tree sources. A tree source consists of a su-x tree (the model) and a parameter vector. The su-x tree determines the state of the source, based on the context of the new symbol. The context of the new symbol is the su-x (of at most D symbols) of the source sequence seen so far. The flrst symbol of the context, the most recent symbol of the source sequence, will be used in the root of the su-x tree to choose a child (if
Article
The switching method (4) is a scheme which combines two universal source coding algorithms. The two universal source coding algorithms both estimate the probability distribution of the source symbols, and the switching method allows an encoder to choose which of the two probability distributions it uses for every source symbol. The switching algorithm is an e-cient weighting algorithm that uses this switching method. This paper focuses on the companion algorithm, the algorithm running in parallel to the main CTW-algorithm. 1 The switching method: A short introduction The switching method (4) deflnes a way in which two modeling algorithms can be com- bined. Consider a source sequence x1;:::;xN. Suppose that two sequential modeling algorithms, A and B, run both along the entire source sequence, and give for every symbol an estimate of its probability distribution. These modeling algorithms could be memoryless estimators, estimators for flxed tree models, or entire universal source cod- ing algorithms on their own. At each moment the encoder in the switching method uses the estimate from only one of the two modeling algorithms to encode the next source symbol. The estimate of the other modeling algorithm is ignored, but the statistics of both modeling algorithms are updated. The switching method starts using modeling algorithm A. It can switch from one modeling algorithm to the other one between any two source symbols. The switching behaviour of the switching method is specifled for the decoder with a transition sequence t1;:::;tN. If a transition symbol ti = 1 then the encoder switched from one modeling algorithm to the other one between source symbols xi¡1 and xi. Otherwise ti = 0. The transition sequence will be intertwined with the source sequence, t1;x1;:::;tN;xN, to allow sequential encoding and decoding. This combined sequence will then be encoded. In principle any encoding algorithm can be used to encode the transition sequence.
Article
A serial file is considered in which, after a query is processed, the order in the file is changed by moving the record to which the query referred into the first place in the file. The theory of regular Markov chains is used to demonstrate the existence of EX and to show that the law of large numbers holds, where EX is the limiting average position of a queried record. A closed-form expression for EX is determined. A second method of relocation is proposed, in which the queried record exchanges positions with the record immediately before it in the file. It is conjectured that this method of relocation is at least as good as the first method. It is pointed out that, whichever method of relocation is used, if one only relocated after every Mth query, the limiting average position of a queried record is the same as it is if we relocate after every query.
Article
An abstract is not available.
Article
Several methods are presented for adaptive, invertible data compression in the style of Lempel's and Ziv's first textual substitution proposal. For the first two methods, the article describes modifications of McCreight's suffix tree data structure that support cyclic maintenance of a window on the most recent source characters. A percolating update is used to keep node positions within the window, and the updating process is shown to have constant amortized cost. Other methods explore the tradeoffs between compression time, expansion time, data structure size, and amount of compression achieved. The article includes a graph-theoretic analysis of the compression penalty incurred by our codeword selection policy in comparison with an optimal policy, and it includes empirical studies of the performance of various adaptive compressors from the literature.
Article
An abstract is not available.
Article
A new method of estimating the entropy and redundancy of a language is described. This method exploits the knowledge of the language statistics possessed by those who speak the language, and depends on experimental results in prediction of the next letter when the preceding text is known. Results of experiments in prediction are given, and some properties of an ideal predictor are developed.
Article
An on-line algorithm is presented for constructing the suffix tree for a given string in time linear in the length of the string. The new algorithm has the desirable property of processing the string symbol by symbol from left to right. It always has the suffix tree for the scanned part of the string ready. The method is developed as a linear-time version of a very simple algorithm for (quadratic size) suffixtries. Regardless of its quadratic worst case this latter algorithm can be a good practical method when the string is not too long. Another variation of this method is shown to give, in a natural way, the well-known algorithms for constructing suffix automata (DAWGs).
Article
A new method (the ‘binary indexed tree’) is presented for maintaining the cumulative frequencies which are needed to support dynamic arithmetic data compression. It is based on a decomposition of the cumulative frequencies into portions which parallel the binary representation of the index of the table element (or symbol). The operations to traverse the data structure are based on the binary coding of the index. In comparison with previous methods, the binary indexed tree is faster, using more compact data and simpler code. The access time for all operations is either constant or proportional to the logarithm of the table size. In conjunction with the compact data structure, this makes the new method particularly suitable for large symbol alphabets.
Article
We review the linear-time suffix tree constructions by Weiner, McCreight, and Ukkonen. We use the terminology of the most recent algorithm, Ukkonen's on-line construction, to explain its historic predecessors. This reveals relationships much closer than one would expect, since the three algorithms are based on rather different intuitive ideas. Moreover, it completely explains the differences between these algorithms in terms of simplicity, efficiency, and implementation complexity.
Article
An OPM/L data compression scheme suggested by Ziv and Lempel, LZ77, is applied to text compression. A slightly modified version suggested by Storer and Szymanski, LZSS, is found to achieve compression ratios as good as most existing schemes for a wide range of texts. LZSS decoding is very fast, and comparatively little memory is required for encoding and decoding. Although the time complexity of LZ77 and LZSS encoding is O(M) for a text of M characters, straightforward implementations are very slow. The time consuming step of these algorithms is a search for the longest string match. Here a binary search tree is used to find the longest string match, and experiments show that this results in a dramatic increase in encoding speed. The binary tree algorithm can be used to speed up other OPM/L schemes, and other applications where a longest string match is required. Although the LZSS scheme imposes a limit on the length of a match, the binary tree algorithm will work without any limit.
Conference Paper
The storage complexity of the CTW-method is decreased by combining the estimated probability of a node in the context tree and the weighted probabilities of its children in a single ratio
Article
Algorithms for encoding and decoding finite strings over a finite alphabet are described. The coding operations are arithmetic involving rational numbers l i as parameters such that ∑ i 2<sup>−l</sup>i≤2<sup>−ε</sup>. This coding technique requires no blocking, and the per-symbol length of the encoded string approaches the associated entropy within ε. The coding speed is comparable to that of conventional coding methods.
Article
There has been considerable effort to prove lower bounds for the competitiveness of a randomized list update algorithm. Lower bounds of 1.18 and (by a numerical technique) 1.27 were so far the best result. In this paper we construct a randomized request sequence σ̂s that no deterministic on-line algorithm can service with an expected cost less than times the off-line cost (n denoting the length of the list). Using a result of Yao this establishes a new lower bound of 1.5 for the competitiveness of randomized list update algorithms.
Article
This note shows how to maintain a prefix code that remains optimum as the weights change. A Huffman tree with nonnegative integer weights can be represented in such a way that any weight w at level l can be increased or decreased by unity in O(l) steps, preserving minimality of the weighted path length. One-pass algorithms for file compression can be based on such a representation.
Article
Several measures are defined and investigated, which allow the comparison of codes as to their robustness against errors. Then new universal and complete sequences of variable-length codewords are proposed, based on representing the integers in a binary Fibonacci numeration system. Each sequence is constant and needs not be generated for every probability distribution. These codes can be used as alternatives to Huffman codes when the optimal compression of the latter is not required, and simplicity, faster processing and robustness are preferred. The codes are compared on several “real-life” examples.
Article
In this paper we give a randomized on-line algorithm for the list update problem. Sleator and Tarjan show a deterministic algorithm, Move-to-Front, that achieves competitive ratio of for lists of length L. Karp and Raghavan show that no deterministic algorithm can beat . We show that Move-to-Front in fact achieves an optimal competitive ratio of . We show a randomized algorithm that achieves a competitive ratio of against an oblivious adversary. This is the first randomized strategy whose competitive factor beats a constant less than 2.
Article
This work concerns the search for text compressors that compress better than existing dictionary coders, but run faster than statistical coders. We describe a new method for text compression using multiple dictionaries, one for each context of preceeding characters, where the contexts have varying lengths. The context to be used is determined using an escape mechanism similar to that of prediction by partial matching (PPM) methods. We describe modifications of three popular dictionary coders along these lines and experiments evaluating their effectiveness using the text files in the Calgary corpus. Our results suggest that modifying LZ77, LZFG, and LZW along these lines yields improvements in compression of about 3%, 6%, and 15%, respectively.
Article
A direct relationship between Shellsort and the classical “problem of Frobenius” from additive number theory is used to derive a sequence of O(log N) increments for Shellsort for which the worst case running time is . The previous best-known upper bound for sequences of O(log N) increments was , which was shown by Pratt to be tight for a large family of sequences, including those commonly used in practice. The new upper bound is of theoretical interest because it suggests that increment sequences might exist which admit even better upper bounds, and of practical interest because the increment sequences which arise outperform those common used, even for random files.
Conference Paper
Lossless compression researchers have developed highly sophisticated approaches, such as Huffman encoding, arithmetic encoding, the Lempel-Ziv family, Dynamic Markov Compression (DMC), Prediction by Partial Matching (PPM), and Burrows-Wheeler Transform (BWT) based algorithms. We propose an alternative approach in this paper to develop a reversible transformation that can be applied to a source text that improves existing algorithm's ability to compress. The basic idea behind our approach is to encode every word in the input text file, which is also found in the English text dictionary that we are using, as a word in our transformed static dictionary. These transformed words give shorter length for most of the input words and also retain some context and redundancy. Thus we achieve some compression at the preprocessing stage as well as retain enough context and redundancy for the compression algorithms to give better results. Bzip2 with our proposed text transform, LIPT, gives 5.24% improvement in average BPC over Bzip2 without LIPT, and PPMD (a variant of PPM with order 5) with LIPT gives 4.46% improvement in average BPC over PPMD (with order 5) without LIPT, for a set of text files extracted from Calgary and Canterbury corpuses, and also from Project Gutenberg. Bzip2 with LIPT, although 79.12% slower than the original Bzip in compression time, achieves average BPC almost equal to that of original PPMD and is also 1.2% faster than the original PPMD in compression time.
Conference Paper
The problem of constructing models of English text is considered. A number of applications of such models including cryptology, spelling correction and speech recognition are reviewed. The best current models for English text have been the result of research into compression. Not only is this an important application of such models but the amount of compression provides a measure of how well such models perform. Three main classes of models are considered: character based models, word based models, and models which use auxiliary information in the form of parts of speech. These models are compared in terms of their memory usage and compression
Conference Paper
The Burrows—Wheeler Transform (also known as Block-Sorting) is at the base of compression algorithms that are the state of the art in lossless data compression. In this paper, we analyze two algorithms that use this technique. The first one is the original algorithm described by Burrows and Wheeler, which, despite its simplicity outperforms the Gzip compressor. The second one uses an additional run-length encoding step to improve compression. We prove that the compression ratio of both algorithms can be bounded in terms of the kth order empirical entropy of the input string for any k ≥ 0. We make no assumptions on the input and we obtain bounds which hold in the worst case that is for every possible input string. All previous results for Block-Sorting algorithms were concerned with the average compression ratio and have been established assuming that the input comes from a finite-order Markov source.
Conference Paper
We present theoretical algorithms for sorting and searching multikey data, and derive from them practical C implementations for applications in which keys are charac- ter strings. The sorting algorithm blends Quicksort and radix sort; it is competitive with the best known C sort codes. The searching algorithm blends tries and binary search trees; it is faster than hashing and other commonly used search methods. The basic ideas behind the algo- rithms date back at least to the 1960s, but their practical utility has been overlooked. We also present extensions to more complex string problems, such as partial-match searching. that is competitive with the most efficient string sorting programs known. The second program is a symbol table implementation that is faster than hashing, which is com- monly regarded as the fastest symbol table implementa- tion. The symbol table implementation is much more space-efficient than multiway trees, and supports more advanced searches. In many application programs, sorts use a Quicksort implementation based on an abstract compare operation, and searches use hashing or binary search trees. These do not take advantage of the properties of string keys, which are widely used in practice. Our algorithms provide a nat- ural and elegant way to adapt classical algorithms to this important class of applications.
Article
A new and conceptually simple data structure, called a suffixarray, for on-line string searches is intro- duced in this paper. Constructing and querying suffixarrays is reduced to a sort and search paradigm that employs novel algorithms. The main advantage of suffixarrays over suffixtrees is that, in practice, they use three to five times less space. From a complexity standpoint, suffix arrays permit on-line string searches of the type, ''Is W a substring of A?'' to be answered in time O(P + log N), where P is the length of W and N is the length of A, which is competitive with (and in some cases slightly better than) suffixtrees. The only drawback is that in those instances where the underlying alphabet is finiteand small, suffixtrees can be constructed in O(N) time in the worst case, versus O(N log N) time for suffixarrays. However, we give an augmented algorithm that, regardless of the alphabet size, constructs suffixarrays in O(N) expected time, albeit with lesser space efficiency. We believe that suffixarrays will prove to be better in practice than suffixtrees for many applications.
Article
Given that the bit is the unit of stored data, it appears impossible for codewords to occupy fractional bits. And given that a minimum-redundancy code as described in Chapter 4 is the best that can be done using integral-length codewords, it would thus appear that a minimum-redundancy code obtains compression as close to the entropy as can be achieved.
Article
A recent development in text compression is a 'block sorting' algorithm which permutes the input text according to a special sort procedure and then processes the permuted text with Move-To-Front (MTF) and a final statistical compressor, The technique combines good speed with excellent compression performance, This paper investigates the fundamental operation of the algorithm and presents some improvements based on that analysis, Although block sorting is clearly related to previous compression techniques, it appears that it is best described by techniques derived from work by Shannon on the prediction and entropy of English text, A simple model is developed which relates the compression to the proportion of zeros after the MTF stage.
Article
Until recently, the best schemes for text compression have used variable-order Markov models, i.e. each symbol is predicted using some finite number of directly preceding symbols as a context. The recent Dynamic Markov Compression (DMC) scheme models data with Finite State Automata, which are capable of representing more complex contexts than simple Markov models. The DMC scheme builds models by ‘cloning’ states which are visited often. Because patterns can occur in English which can be recognised by a Finite State Automaton, but not by a variable-order Markov model, the use of FSA models is potentially more powerful. A class of Finite State Automata, called Finite Context Automata, is defined, which are equivalent to variable-order Markov models. For the initial models proposed, the DMC scheme is shown to have no more power than variable-order Markov models by showing that it generates only Finite Context Automata. This is verified in practice, where experiments show that the compression performance of DMC is comparable to that of existing variable-order Markov model schemes. Consequently, more sophisticated models than Finite Context Automata still need to be explored in order to achieve better text compression.
Article
The bit-oriented finite-state model applied in Dynamic Markov Compression (DMC) is here generalized to a larger alphabet. The finite-state machine is built adaptively during compression, by applying two types of modifications to the machine structure: state cloning and shortcut creation. The machine size is kept tolerable by an escape transition mechanism. Similar to DMC, the new method is combined with arithmetic coding, based on the maintained transition frequencies. The experiments show that the new approach produces notably better compression gains for different sorts of texts in natural and formal languages. In some cases the results are better than for any compression technique found in the literature.