Conference Paper

Data compression using encrypted text

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Abstract only given. Presents an algorithm for text compression that exploits the properties of the words in a dictionary to produce an encryption of given text. The basic idea is to define a unique encryption or signature of each word in the dictionary by replacing certain characters in the words by a special character “*” and retain a few characters so that the word is still retrievable. The question is whether we can develop a better signature of the text before compression so that the compressed signature uses less storage than the original compressed text. This indeed is possible as our experimental results confirm. For any cryptic text the most frequently used character is “*” and the standard compression algorithms can effectively exploit this redundancy in an effective way. Our algorithm produces the best lossless compression rate reported to date in the literature. One basic assumption of our algorithm is that the system has access to a dictionary of words used in all the texts along with a corresponding “cryptic” dictionary. The cost of this dictionary is amortized over the compression savings for all the text files handled by the organization. If two organizations wish to exchange information using our compression algorithm, they must share a common dictionary. We used ten text files from the English text domain to test our algorithm

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... All these transformation methods used the static dictionary both in encoding and decoding process. En-crypted word based dictionary [10] also designed and tested which produced better results compared with the un-encrypted word based dictionary. V.K. Govindan and B.S. Shajee Mohan [6] proposed that the actual codeword consists of the length of the code concatenated with the code and the codewords are created using the ASCII characters 33 to 250. ...
Article
In higher educational institutions, the source code of the student's projects and their documentations should be submitted in both printed and electronic form. The electronic form of storage smooth the progress of computerized processing of the documents for purposes such as plagiarism detection and for future references. Considering the higher education system, there are several hundred theses added to the archive every year. The primary motivation for this paper was to reduce the storage requirements of the student's projects and their doc- mentation’s electronic archive in higher education institutions.
... The compression algorithms treat all data the same way and mainly they all are looking for repetitive sequences in a smaller or larger window in order to eliminate the redundant data [4] ÷ [7]. These generic compression algorithms can be improved if a transformation on the input data is previously applied, which could emphasize some of the redundancies that appear especially in written language texts and then they could be detected by the compression algorithm [8], [9], [10], or to eliminate the generic compression algorithm having less information to process. ...
Conference Paper
Full-text available
The Fixed Constraints Transform (FCT) encodes the text based on a dictionary. This dictionary is used to accomplish the connections between the words in the text and their corresponding transforms. The dictionary is generated one time and it is saved in a binary form for a better word-indexing speed. This method is strictly designed for text compression and it has maximum performances when the text has normal formatting – in a phrase, only the first word starts with upper case, and it continues with lower case. Because the algorithm is based on modification of the words in the text, on unaltered signs of punctuation, on spaces and other special characters, the algorithms performance is given by the ratio between the number of letters in the text and the total number of characters. FCT has close performances with other frequently used transforms – Star, Burrows-Wheeler, etc. – in terms of compression, but it has better execution speed. The applied algorithms of lossless data compression for testing are: RLE (Run-Length Encoding), arithmetic, PPMd (Prediction by Partial Matching), BZip2, Deflate (WinZip), LAMA, and RAR. The following indicators of compression performance were measured: the requested time for transform generation, the compression rate, and the requested time for compression. The text files used for evaluating the performance are from the Calgary Corpus. FCT leads to compression performances close to the ones obtained by the usual transforms used as pre-compression methods (BWT, Star Transform and derivatives). FCT is suited for the use of a chain of processors that have as purpose lossless data compression. The transform itself does not do a performing compression, but – most important – it helps a compression algorithm applied after it with the fact that it eliminates some redundant information and specific features to the idioms written in a certain language. FCT is an efficient method of data processing with notable results that can be very easily implemented and used in a lossless compression chain both for stream sequences and files in usual applications.
... The dictionary generation operation aims to obtain transforms and their structures, starting from an unsorted list of words [8], [9], and [10]. This dictionary serves the main function of the application, also, text encoding and decoding using the Fixed Constraints Transform. ...
Conference Paper
Full-text available
The paper presents an original method of text transformation based on permutations with restrictions and elimination of repetitive letters within a word having the purpose of improving the performance of lossless compression algorithms, based on a new generated dictionary. The direct transformation algorithm and the reverse transformation algorithm are presented, as well as the impact on more compression techniques compared to other transforms used in text compression. This transformation has as purpose the performance improvement for existing text compression algorithms, including the use of neural networks. The main condition is that the transformation should be reversible for the text recovery after decompression.
... The average size of the English dictionary is around 0.5 MB. It can be downloaded together with the considered application files [3]. ...
Conference Paper
Full-text available
The PPM (Prediction by Partial Matching) methods are applied before the lossless compression algorithms, as a preprocessing procedure for the text in order to use efficiently the redundancy. This paper presents 4 compression algorithms preceded by 3 different PPM methods. The compression ratio is studied in relation with the file size, its content, and the context order (the prediction degree). Some useful recommendations are given in order to achieve the best compression for any kind of text files, both in English and in Romanian, as well as in natural or artificial languages. The paper also refers to the benefits provided in the lossless compression by using a preprocessing method in order to exploit better the redundancy of the source file. The procedure is known as the Star (*) Transform and it is applied for text files. The algorithm is briefly presented before emphasizing its benefic effects on a certain set of test files chosen from basic text corpora. Experimental results were performed and some interesting conclusions were driven on their basis.
... The transformation is meant to make the file compression easier [1], [2]. The original text is offered to the transformation input and its output is the transformed text, further applied to an existing compression algorithm [7], [8]. Decompression uses the same methods in the reverse order: decompression of the transformed text first and the inverse transform after that. ...
Article
Full-text available
The present paper refers to the benefits provided in the lossless compression by using preprocessing methods in order to exploit better the redundancy of the source file. The following contributions represent a sequel of the work entitled,Transformed Methods Used in Lossless Compression of Text Files" [20], focusing on the Length-Index Preserving Transform (LIPT). The procedures are derived from LIPT and are known as ILPT, NIT and LIT. These transforms are applied to text files. The algorithms are briefly presented before emphasizing their positive effects on a set of test files chosen from the classical text corpora. Experimental results were performed and some interesting conclusions were driven on their basis.
... Kruse and Mukherjee [13,10] propose a special case of word encoding known as star encoding. This encoding method replaces words by a symbol sequence that mostly consist of repetitions of the single symbol '*'. ...
Article
Full-text available
In this paper, we propose an approach to develop a Dictionary based reversible lossless text transformation called Improved Intelligent Dictionary Based Encoding (IIDBE). This approach is used in conjunction with BWT that can be applied to source text to improve the existing or backend algorithm's ability to compress and also offer a sufficient level of security of the transmitted information. The basic philosophy of our secure compression is to preprocess the text and transform it into some intermediate form which can be compressed with better efficiency and which exploits the natural redundancy of the language in making the transformation. IIDBE achieves better compression at the preprocessing stage and enough redundancy is retained for the compression algorithms to get better results. The experimental results of this compression method are also analysed. IIDBE gives 18.32% improvement over Simple BWT, 8.55% improvement over BWT with *-encode, 2.28% improvement over BWT with IDBE and about 1% over BWT with EIDBE.
... This yields compression that outperforms most of the classic algorithms. A much simpler recent algorithm proposed by Mukherjee and Franceschini called Star Compression performs a transformation using a dictionary and yields comparable performance with respect to compression [12]. Its implementation is available on-line [13]. ...
... Almost in all the previous transformation techniques, an external static dictionary is used to store the frequently used words [7,16]. This paper presents the adaptive dictionary in which the lower case version of the words in the source file to be compressed are stored in a table like structure according to its frequency which occurs more than 25 times in the source file and used the indexes of the entries to refer the dictionary words. ...
... Kruse and Mukherjee [13,10] propose a special case of word encoding known as star encoding. This encoding method replaces words by a symbol sequence that mostly consist of repetitions of the single symbol '*'. ...
Article
In this paper, we propose RIDBE (Reinforced Intelligent Dictionary Based Encoding), a Dictionary-based reversible lossless text transformation algorithm. The basic philosophy of our secure compression is to preprocess the text and transform it into some intermediate form which can be compressed with better efficiency and which exploits the natural redundancy of the language in making the transformation. In RIDBE, the length of the input word is denoted by the ASCII characters 232 – 253 and the offset of the words in the dictionary is denoted with the alphabets A-Z. The existing or backend algorithm’s ability to compress is seen to improve considerably when this approach is applied to source text and it is used in conjunction with BWT. A sufficient level of security of the transmitted information is also maintained. RIDBE achieves better compression at the preprocessing stage and enough redundancy is retained for the compression algorithms to get better results. The experimental results of this compression method are analysed. RIDBE gives 19.08% improvement over Simple BWT, 9.40% improvement over BWT with *-encode, 3.20% improvement over BWT with IDBE, 1.85% over BWT with EIDBE and about 1% over IIDBE.
... Huffman's result is very well-known in the literature and it is in use even today. In [5], the possibility of data compression by replacing certain characters in the words by a special character * and retaining a few characters so that the word is still retrievable unambiguously were discussed. It is important to note that any compression technique must be loss less and towards this end Ian et al. proposed a language model for representing the data efficiently through the identification of new tokens and tokens in context of the text under consideration. ...
Article
In this paper, we revisit the classical data compression problem for domain specific texts. It is well-known that classical Huffman algorithm is optimal with respect to prefix encoding and the compression is done at character level. Since many data transfer are domain specific, for example, downloading of lecture notes, web-blogs, etc., it is natural to think of data compression in larger dimensions (i.e. word level rather than character level). Our framework employs a two-level compression scheme in which the first level identifies frequent patterns in the text using classical frequent pattern algorithms. The identified patterns are replaced with special strings and to acheive a better compression ratio the length of a special string is ensured to be shorter than the length of the corresponding pattern. After this transformation, on the resultant text, we employ classical Huffman data compression algorithm. In short, in the first level compression is done at word level and in the second level it is at character level. Interestingly, this two level compression technique for domain specific text outperforms classical Huffman technique. To support our claim, we have presented both theoretical and simulation results for domain specific texts.
... A popular approach is to parse the source text into syntactic units, such as q-grams, syllables, or words, and run a compressor tailored to handle a sequence of unit identifiers [12, 11, 32, 33, 34, 15, 35, 36, 20]. Alternatively, there are various ad-hoc heuristics [37, 38, 39, 40, 41, 42] that enrich a general-purpose compressor by a preprocessing that chooses some words or other types of syntactic units and replaces them by The Computer Journal, Vol. ??, No. ??, ???? different kinds of identifiers, usually assigning shorter identifiers to more frequent units. ...
Article
Full-text available
Semistatic word-based byte-oriented compressors are known to be attractive alternatives to compress natural language texts. With compression ratios around 30–35%, they allow fast direct searching of compressed text. In this article, we reveal that these compressors have even more benefits. We show that most of the state-of-the-art compressors benefit from compressing not the original text, but the compressed representation obtained by a word-based byte-oriented statistical compressor. For example, p7zip with a dense-coding preprocessing achieves even better compression ratios and much faster compression than p7zip alone. We reach compression ratios below 17% in typical large English texts, which was obtained only by the slow prediction by partial matching compressors. Furthermore, searches perform much faster if the final compressor operates over word-based compressed text. We show that typical self-indexes also profit from our preprocessing step. They achieve much better space and time performance when indexing is preceded by a compression step. Apart from using the well-known Tagged Huffman code, we present a new suffix-free Dense-Code-based compressor that compresses slightly better. We also show how some self-indexes can handle non-suffix-free codes. As a result, the compressed/indexed text requires around 35% of the space of the original text and allows indexed searches for both words and phrases.
... The transformation is meant to make the file compression easier [1], [2]. The original text is offered to the transformation input and its output is the transformed text, further applied to an existing compression algorithm [7], [8]. Decompression uses the same methods in the reverse order: decompression of the transformed text first and the inverse transform after that. ...
Article
Full-text available
The present paper refers to the benefits provided in the lossless compression by using preprocessing methods in order to exploit better the re-dundancy of the source file. The following contributions represent a sequel of the work entitled ,,Transformed Methods Used in Lossless Compression of Text Files" [20], focusing on the Length-Index Preserving Transform (LIPT). The procedures are derived from LIPT and are known as ILPT, NIT and LIT. These transforms are applied to text files. The algorithms are briefly presented before emphasizing their positive effects on a set of test files chosen from the classi-cal text corpora. Experimental results were performed and some interesting conclusions were driven on their basis.
... Capital conversion (CC) is a well-known preprocessing technique [37, 7]. It is based on the observation that words like for example compression and Compression, are almost equivalent , but general compression models, e.g., plain PPM, cannot immediately recognize the similarity. ...
Article
One of the attractive ways to increase text compression is to replace words with references to a text dictionary given in advance. Although there exist a few works in this area, they do not fully exploit the compression possibilities or consider al- ternative preprocessing variants for various compressors in the latter phase. In this paper, we discuss several aspects of dictionary-based compression, including com- pact dictionary representation, and present a PPM/BWCA oriented scheme, word replacing transformation, achieving compression ratios higher by 2-6% than state- of-the-art StarNT (2003) text preprocessor, working at a greater speed. We also present an alternative scheme designed for LZ77 compressors, with the advantage over StarNT reaching up to 14% in combination with gzip.
... In the recent past, the M5 Data Compression Group, University of Central Florida (http://vlsi.cs.ucf.edu/) has developed a family of reversible Star-transformations [2, 8] which applied to a source text along with a backend compression algorithm, achieves better compression.Figure 1 illustrates the paradigm. The basic idea of the transform module is to transform the text into some intermediate form, which can be compressed with better efficiency. ...
Conference Paper
In this paper we present StarNT, a dictionary-based fast lossless text transform algorithm. With a static generic dictionary, StarNT achieves a superior compression ratio than almost all the other recent efforts based on BWT and PPM. This algorithm utilizes ternary search tree to expedite transform encoding. Experimental results show that the average compression time has improved by orders of magnitude compared with our previous algorithm LIPT and the additional time overhead it introduced to the backend compressor is unnoticeable. Based on StarNT, we propose StarZip, a domain-specific lossless text compression utility. Using domain-specific static dictionaries embedded in the system, StarZip achieves an average improvement in compression performance (in terms of BPC) of 13% over bzip2-9, 19% over gzip-9, and 10% over PPMD.
Book
The security of Web applications is one noteworthy component that is often overlooked within the creation of Web apps. Web application security is required for securing websites and online services against distinctive security threats. The vulnerabilities of the Web applications are for the most part the outcome of a need for sanitization of input/output which is frequently utilized either to misuse source code or to pick up unauthorized access. An attacker can misuse vulnerabil�ities in an application’s code. The security of Web applications may be a central component of any Web-based commerce. The security of Web applications deals particularly with the security encompassing websites, Web applications, and Web administrations such as APIs. This paper gives a testing approach for vulnerability evaluation of Web applications to address the extent of security issues. We illustrate the vulnerability assessment tests on Web applications. Showing how with an aggregation of tools, the vulnerability testing broadcast for Web applications can be enhanced.
Chapter
Full-text available
Nowadays, cloud system is laid low with cyber-attack that underpin a great deal of today’s social group options and monetary development. To grasp the motives behind cyber-attacks, we glance at cyber-attacks as societal events related to social, economic, cultural and political factors (Kumar and Carley in 2016 IEEE conference on intelligence and security information science (ISI). IEEE, 2016 [1]). To seek out factors that encourage unsafe cyber activities, we have a tendency to build a network of mixture country to country cyber-attacks, and compare the network with different country-to-country networks. During this paper, we gift a completely unique approach to find cyber-attacks by exploitation of three attacks (Zero, Hybrid and Fault) which may simply break the cyber system. In hybrid attack, we use two attacks, dictionary attack and brute-pressure attack. Firstly, we analyze the system and then lunch the attacks. We have a tendency to observe that higher corruption and an oversized Web information measure favor attacks origination. We have a tendency to additionally notice that countries with higher per-capita-GDP and higher info and communication technologies (ICT) infrastructure are targeted a lot often (Kumar and Carley in 2016 IEEE conference on intelligence and security information science (ISI). IEEE, 2016 [1]).
Chapter
Cryptography is a technique of protecting the data from an unauthorized access using encryption process. Encryption converts the plain text into cipher text which is in non-readable form. Past studies suggest that the size of cipher text is the major concern which hinders users from adopting encryption methods to secure the data. This work involves an experimental study on exploring the effect of data compression on an amino acid form of cipher text using dictionary-based methods as well as entropy coding methods without compromising the security. Compression ratio is measured for different file sizes. The results show that 47% storage savings is achieved using entropy coding method and 60% using dictionary-based coding method. Owing to this, storage efficiency is also doubled. The advantage of this method is thus convinced and provides an improved data storage.KeywordsData compressionEncryptionHuffman codingLZMASecurity
Chapter
The security of Web applications is one noteworthy component that is often overlooked within the creation of Web apps. Web application security is required for securing websites and online services against distinctive security threats. The vulnerabilities of the Web applications are for the most part the outcome of a need for sanitization of input/output which is frequently utilized either to misuse source code or to pick up unauthorized access. An attacker can misuse vulnerabilities in an application’s code. The security of Web applications may be a central component of any Web-based commerce. The security of Web applications deals particularly with the security encompassing websites, Web applications, and Web administrations such as APIs. This paper gives a testing approach for vulnerability evaluation of Web applications to address the extent of security issues. We illustrate the vulnerability assessment tests on Web applications. Showing how with an aggregation of tools, the vulnerability testing broadcast for Web applications can be enhanced.
Chapter
A method of lossless data compression on various data formats using predefined algorithms and machine learning is claimed in this chapter. The proposed system makes use of predefined algorithms and a neural network for encoding and decoding. As it uses machine learning (sequence to sequence) model, it makes the system adaptive and the compression ratio keeps improving over time with more and more data. Use of machine learning ensures that different file formats can be compressed within the same system. Compressed data obtained after the implementation of predefined algorithms will be used as output labels for the ML model, and the original data will be the input labels for the ML model. For decompression, the compressed data is used as input, and the original data is used as output labels. The system can be deployed as a cloud-based application in which the user can input data and store it in a cloud database. The proposed system achieves 74% accuracy on a batch comprising 50,000 training data. Higher accuracy can be achieved by training the model on more data.KeywordsMachine learningLossless data compressionPredefined algorithmSequence to sequenceEncoderDecoder
Article
Full-text available
Security in terms of Networks have turn out to be more significant to Organizations, Military and personal computer user's. Since various kinds of threats are for data from sending it from sender side over internet till it reaches to receiver. Here we will focus on SSL it is a technique used to give client and server authentication, data confidentiality and data integrity. It transform our data into unintelligible form, data which we will be sending can be text or no text form, by encrypting our data we can save it from attacks like eavesdropping, in which interception of communication by unauthorized person, he can either listen or can add malicious information in our data which can lead to catastrophic results. This technique of secure data transmission is very useful in securing the integrity of data sent by the Unmanned Aerial Vehicles in military application to commercially used Electricity meter. Since the above mentioned devices uses microcontroller to send data through internet hence this data is always going to be susceptible to above mentioned threats so it is important to ensure that it doesn't fall in wrong hands, our objective is that our microcontroller sends the data to remote location has authenticity, confidentiality and integrity. First we will send some meaningful text already stored in controller of STM3240G Eval-board then that data will be sent to server. These encrypted packets will be sending to remote server through Ethernet. At the receiver end this data will be received and decrypted to get the original captured data.
Article
Data compression algorithms are used to reduce the redundancy, storage requirement and efficiently reduce communication costs. Data Encryption is used to protect our data from unauthorized users. Due to the unprecedented explosion in the amount of digital data transmitted via the Internet, it has become necessary to develop a compression algorithm that can effectively use available network bandwidth and also taking into account the security aspects of the compressed data transmitting over Internet. This paper presents an encoding technique that offers high compression ratios. An intelligent and reversible transformation technique is applied to source text to improve the capability of algorithms to compress the transmitted data. The results prove that the proposed method performs better than many other popular techniques with respected to compression ratio and the speed of compression and requires some additional processing on the server/nodes.
Conference Paper
In this paper we present an approach which is an alternative to compression algorithms in vogue such as Huffman encoding, arithmetic encoding, the Lempel-Ziv family, Dynamic Markov Compression (DMC), Prediction by Partial Matching (PPM), and Burrows-Wheeler Transform (BWT) based algorithms. This is aimed at developing generic, reversible transformations that can be applied to a source text that improve an existing, or backend, algorithm's ability to compress. In this connection, we present two lossless, reversible transformations namely Enhanced Intelligent Dictionary Based Encoding (EIDBE) and Improved Intelligent Dictionary Based Encoding (IIDBE). We then provide experimental results over the files chosen from the classical text corpus.
Conference Paper
Full-text available
This paper presents a study of transforms method used in lossless text compression in order to preprocess the text by exploiting the inner redundancy of the source file. The transform methods are derived from the star (*) transform. LIPT, ILPT, NIT, and LIT applied to text files emphasize their positive effects on a set of test files picked up from the classical corpora of both English and Romanian texts. Experimental results and comparisons with universal lossless compressors were performed, and set of interesting conclusions and recommendations are driven on their basis.
Article
Intelligent Dictionary Based Encoding (IDBE)[18], an encoding strategy offers higher compression ratios and rate of compression. Transforming text into some intermediate form by using IDBE is the basic philosophy of this compression technique. It is observed that a better compression is achieved by using IDBE as the preprocessing stage for the BWT based compressor. This paper aims at developing such a new and advanced technique called "Enhanced Intelligent Dictionary Based Encoding (EIDBE)". This approach is used in conjunction with BWT and is a dictionary based lossless reversible transformation that can be applied to a source text to improve existing or backend algorithm's ability to compress. EIDBE achieves better compression at the preprocessing stage and enough redundancy is retained for the compression algorithms to get better results. The experimental results of this compression method are also analysed.
Article
The basic objective of a data compression algorithm is to reduce the redundancy in data representation so as to decrease the data storage requirement. Data compression also provides an approach to reduce communication cost by effectively utilizing the available bandwidth. Data compression becomes important as file storage becomes a problem. In general, data compression consists of taking a stream of symbols and transforming them into codes. If the compression is effective, the resulting stream of codes will be smaller than the original symbols.
Article
Data compression algorithms are used to reduce the redundancy and storage requirement for data. Data compression is also an efficient approach to reduce communication costs by using available bandwidth effectively. Over the last decade we have seen an unprecedented explosion in the amount of digital data transmitted via the Internet in the form of text, images, video, sound, computer programs, etc. If this trend expected to continue, then it will be necessary to develop a compression algorithm that can most effectively use available network bandwidth by compressing the data at maximum level. Along with this it will also important to consider the security aspects of the compressed data transmitting over Internet, as most of the text data transmitted over the Internet is very much vulnerable to an attack. So, we are presenting an intelligent, reversible transformation technique that can be applied to source text that improve algorithm ability to compress and also offer a sufficient level of security to the transmitted data.
Chapter
In the last 20 years, we have seen a vast explosion of textual information flow over the Web through electronic mail, Web browsing, information retrieval systems, and so on. The importance of data compression is likely to be enhanced in the future as there is a continuous increase in the amount of data that needs to be transformed or archived. In the field of data compression, researchers developed various approaches such as Huffman encoding, arithmetic encoding, Ziv— Lempel family, dynamic Markov compression, prediction with partial matching (PPM [1] and Burrows–Wheeler transform (BWT [2]) based algorithms, among others. BWT permutes the symbol of a data sequence that shares the same unbounded context by cyclic rotation followed by lexicographic sort operations. BWT uses move-to-front and an entropy coder as the backend compressor. PPM is slow and also consumes a large amount of memory to store context information but PPM achieves better compression than almost all existing compression algorithms.
Article
In recent times, we have witnessed an unprecedented growth of textual information via the Internet, digital libraries and archival text in many applications. To be able to store, manage, organize and transport the data efficiently, text compression is necessary. We also need efficient search engines to speedily find the information from this huge mass of data, especially when it is compressed. In this chapter, we present a review of text compression algorithms, with particular emphasis on the LZ family algorithms, and present our current research on the family of Star compression algorithms. We discuss ways to search the information from its compressed format and introduce some recent work on compressed domain pattern matching, with a focus on a new two-pass compression algorithm based on LZW algorithm. We present the architecture of a compressed domain search and retrieval system for archival information and indicate its suitability for implementation in a parallel and distributed environment using random access property of the two-pass LZW algorithm.
Conference Paper
In this paper we present the transformation of english text into an intermediate form which is well suited for reducing the compression ratio. The approach for transforming a text is based on model of natural distribution for words. The transformation generates a static dictionary which is small in size and helpful for increasing the redundancy in the text. Experimental results shows that the compression ratio obtained by compressing the transformed text is much better than compression ratio obtained by directly compressing the text using some of the well known existing algorithms.
Conference Paper
Lossless compression researchers have developed highly sophisticated approaches, such as Huffman encoding, arithmetic encoding, the Lempel-Ziv family, Dynamic Markov Compression (DMC), Prediction by Partial Matching (PPM), and Burrows-Wheeler Transform (BWT) based algorithms. We propose an alternative approach in this paper to develop a reversible transformation that can be applied to a source text that improves existing algorithm's ability to compress. The basic idea behind our approach is to encode every word in the input text file, which is also found in the English text dictionary that we are using, as a word in our transformed static dictionary. These transformed words give shorter length for most of the input words and also retain some context and redundancy. Thus we achieve some compression at the preprocessing stage as well as retain enough context and redundancy for the compression algorithms to give better results. Bzip2 with our proposed text transform, LIPT, gives 5.24% improvement in average BPC over Bzip2 without LIPT, and PPMD (a variant of PPM with order 5) with LIPT gives 4.46% improvement in average BPC over PPMD (with order 5) without LIPT, for a set of text files extracted from Calgary and Canterbury corpuses, and also from Project Gutenberg. Bzip2 with LIPT, although 79.12% slower than the original Bzip in compression time, achieves average BPC almost equal to that of original PPMD and is also 1.2% faster than the original PPMD in compression time.
Article
Lossless data compression researchers have developed highly sophisticated approaches, such as Huffman encoding, arithmetic coding, the Lempel-Ziv family, prediction by partial matching and Burrow-Wheeler transform based algorithms. One approach for attaining better compression is to develop generic, reversible transformation that can be applied to a source text that improves an existing compression algorithm's ability to compress. A few reversible transformation techniques that give better compression ratios are presented. A method, which transforms a text file into intermediate file with minimum possible byte values, is proposed. An attempt has been made to reduce the number of possible bytes that appear after every byte in the source file. This increases backend algorithm's compression performance.
Article
In recent times, we have witnessed an unprecedented growth of textual information via the Internet, digital libraries and archival text in many applications. While a good fraction of this information is of transient interest, useful information of archival value will continue to accumulate. We need ways to manage, organize and transport this data from one point to the other on data communications links with limited bandwidth. We must also have means to speedily find the information we need from this huge mass of data. Sometimes, a single site may also contain large collections of data such as a library database, thereby requiring an efficient search mechanism even to search within the local data. To facilitate the information retrieval, an emerging ad hoc standard for uncompressed text is XML which preprocesses the text by putting additional user defined metadata such as DTD or hyperlinks to enable searching with better efficiency and effectiveness. This increases the file size considerably, underscoring the importance of applying text compression. On account of efficiency (in terms of both space and time), there is a need to keep the data in compressed form for as much as possible. Text compression is concerned with techniques for representing the digital text data in alternate representations that takes less space. Not only does it help conserve the storage space for archival and online data, it also helps system performance by requiring less number of secondary storage (disk or CD Rom) accesses and improves the network transmission bandwidth utilization by reducing the transmission time. Unlike static images or video, there is no international standard for text compression, although compressed formats like .zip, .gz, .Z files are increasingly being used. In general, data compression methods are classified as lossless or lossy. Lossless compression allows the original data to be recovered exactly. Although used primarily for text data, lossless compression algorithms are useful in special classes of images such as medical imaging, finger print data, astronomical images and data bases containing mostly vital numerical data, tables and text information. Many lossy algorithms use lossless methods at the final stage of the encoding stage underscoring the importance of lossless methods for both lossy and lossless compression applications. In order to be able to effectively utilize the full potential of compression techniques for the future retrieval systems, we need efficient information retrieval in the compressed domain. This means that techniques must be developed to search the compressed text without decompression or only with partial decompression independent of whether the search is done on the text or on some inversion table corresponding to a set of key words for the text. In this dissertation, we make the following contributions: (1) Star family compression algorithms: We have proposed an approach to develop a reversible transformation that can be applied to a source text that improves existing algorithm's ability to compress. We use a static dictionary to convert the English words into predefined symbol sequences. These transformed sequences create additional context information that is superior to the original text. Thus we achieve some compression at the preprocessing stage. We have a series of transforms which improve the performance. Star transform requires a static dictionary for a certain size. To avoid the considerable complexity of conversion, we employ the ternary tree data structure that efficiently converts the words in the text to the words in the star dictionary in linear time. (2) Exact and approximate pattern matching in Burrows-Wheeler transformed (BWT) files: We proposed a method to extract the useful context information in linear time from the BWT transformed text. The auxiliary arrays obtained from BWT inverse transform brings logarithm search time. Meanwhile, approximate pattern matching can be performed based on the results of exact pattern matching to extract the possible candidate for the approximate pattern matching. Then fast verifying algorithm can be applied to those candidates which could be just small parts of the original text. We present algorithms for both k-mismatch and k-approximate pattern matching in BWT compressed text. A typical compression system based on BWT has Move-to-Front and Huffman coding stages after the transformation. We propose a novel approach to replace the Move-to-Front stage in order to extend compressed domain search capability all the way to the entropy coding stage. A modification to the Move-to-Front makes it possible to randomly access any part of the compressed text without referring to the part before the access point. (3) Modified LZW algorithm that allows random access and partial decoding for the compressed text retrieval: Although many compression algorithms provide good compression ratio and/or time complexity, LZW is the first one studied for the compressed pattern matching because of its simplicity and efficiency. Modifications on LZW algorithm provide the extra advantage for fast random access and partial decoding ability that is especially useful for text retrieval systems. Based on this algorithm, we can provide a dynamic hierarchical semantic structure for the text, so that the text search can be performed on the expected level of granularity. For example, user can choose to retrieve a single line, a paragraph, or a file, etc. that contains the keywords. More importantly, we will show that parallel encoding and decoding algorithm is trivial with the modified LZW. Both encoding and decoding can be performed with multiple processors easily and encoding and decoding process are independent with respect to the number of processors.
Conference Paper
The compression algorithms reduce the redundancy in data representation to decrease the storage required for that data. Data compression offers an attractive approach to reducing communication costs by using available bandwidth effectively. There has been an unprecedented increase in the amount of digital data transmitted via networks especially through the Internet and mobile cellular networks, over the last decade. Digital data represent text, images, video, sound, computer programs, etc. With this trend expected to continue, it makes sense to pursue research on developing algorithms that can most effectively use available network bandwidth by maximally compressing data. A strategy called intelligent dictionary based encoding (IDBE) in conjunction with Burrows Wheeler Transform (BWT) is discussed to achieve this. It has been observed that a preprocessing of the text prior to conventional compression will improve the compression efficiency much better. The intelligent dictionary based encoding provides some level of confidentiality. The experimental results of this compression method are also analyzed.
Conference Paper
Summary form only given. StarZip, a multi-copora text compression system, was introduced together with its transform engine StarNT. One of the key features of the StarZip compression system is to develop domain specific dictionaries and provide tools to develop such dictionaries. StarNT was utilized because it achieves a superior compression ratio than almost all the other recent efforts based on BWT and PPM. StarNT is a dictionary-based fast lossless text transform. The main idea is to record each English word with a representation of no more than three symbols. This transform maintains most of the original context information at the word level and provides an "artificial" strong context. It ultimately reduces the size of the transformed text that, in turn, is provided to a backend compressor. This data structure provides a very fast transform encoding with a low storage overhead. StarNT also treats the transformed codewords as an offset of words in the transform dictionary. The time complexity for searching a word in the dictionary is achieved in the transform decoder. Experimental results have shown that the average compression time has improved by orders magnitude compared to previous dictionary-based transform LIPT. The complexity and compression performance of bzip2, in conjunction with this transform, is better than both gzip and PPMD. Results from five copora have shown that StarZip achieved an average improvement in compression performance (in terms of BPC) of 13% over bzip2-9, 19% over gzip-9, and 10% over PPMD.
Conference Paper
We propose an approach to develop a dictionary based reversible lossless text transformation, called LIFT (length index preserving transform), which can be applied to a source text to improve the existing algorithm's ability to compress. In LIFT, the length of the input word and the offset of the words in the dictionary are denoted with alphabets. Our encoding scheme makes use of the recurrence of same length words in the English language to create context in the transformed text that the entropy coders can exploit. LIFT also achieves some compression at the preprocessing stage and retains enough context and redundancy for the compression algorithms to give better results. Bzip2 with LIFT gives 5.24% improvement in average BPC over Bzip2 without LIPT, and PPMD with LIPT gives 4.46% improvement in average BPC over PPMD without LIFT, for our test corpus
Article
Full-text available
Several preprocessing algorithms for text files are presented which complement each other and which are performed prior to the compression scheme. The algorithms need no external dictionary and are language independent. The compression gain is compared along with the costs of speed for the BWT, PPM, and LZ compression schemes. The average overall compression gain is in the range of 3 to 5 percent for the text files of the Calgary Corpus and between 2 to 9 percent for the text files of the large Canterbury Corpus.
Article
Several methods are presented for adaptive, invertible data compression in the style of Lempel's and Ziv's first textual substitution proposal. For the first two methods, the article describes modifications of McCreight's suffix tree data structure that support cyclic maintenance of a window on the most recent source characters. A percolating update is used to keep node positions within the window, and the updating process is shown to have constant amortized cost. Other methods explore the tradeoffs between compression time, expansion time, data structure size, and amount of compression achieved. The article includes a graph-theoretic analysis of the compression penalty incurred by our codeword selection policy in comparison with an optimal policy, and it includes empirical studies of the performance of various adaptive compressors from the literature.
Conference Paper
Text compression algorithms are normally defined in terms of a source alphabet Sigma of 8-bit ASCII codes. The authors consider choosing Sigma to be an alphabet whose symbols are the words of English or, in general, alternate maximal strings of alphanumeric characters and nonalphanumeric characters. The compression algorithm would be able to take advantage of longer-range correlations between words and thus achieve better compression. The large size of Sigma leads to some implementation problems, but these are overcome to construct word-based LZW, word-based adaptive Huffman, and word-based context modelling compression algorithms.< >
Article
We discuss representations of prefix codes and the corresponding storage space and decoding time requirements. We assume that a dictionary of words to be encoded has been defined and that a prefix code appropriate to the dictionary has been constructed. The encoding operation becomes simple given these assumptions and given an appropriate parsing strategy, therefore we concentrate on decoding. The application which led us to this work constrains the use of internal memory during the decode operation. As a result, we seek a method of decoding which has a small memory requirement. Introduction Data compression is an important and much-studied problem. Compressing data to be stored or transmitted can result in significant improvements in the use of computing resources. The degree of improvement that can be achieved depends not only on the selection of a data compression method, but also on the characteristics of the particular application. That is, no single data compression algorithm wi...
A Universal Algorithm for Sequential Data CompressionAlso, by the same authors Compression of Individual Seq,uences via Variable Rate Coding, IT-24
  • J Ziv
  • A Lempel
J. Ziv and A.Lempel. A Universal Algorithm for Sequential Data Compression. IEEE Trans on Information TheoryIT- 23,pp.337-243,1977,Also, by the same authors Compression of Individual Seq,uences via Variable Rate Coding, IT-24,pp.530- 536,1978. [NC951 National Science and Technology Council. High Performance Computing and Com- munications: Foundation for Ameraca's Information Future, A report by the Committee on Information and Communications, (Supplement to the President's FY 1996 Budget, submitted to Congress),