Conference Paper

General-purpose compression scheme for databases

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Summary form only given. Current adaptive compression schemes such as GZIP and COMPRESS are impractical for database compression as they do not allow random access to individual records. A compression algorithm for general-purpose database systems must address the problem of randomly accessing and individually decompressing records, while maintaining compact storage of data. The SEQUITUR algorithm of Nevill-Manning et al., (1994, 1996, 1997) also adaptively compresses data, achieving excellent compression but with significant main-memory requirements. A preliminary version of SEQUITUR used a semi-static modelling approach to achieve slightly worse compression than the adaptive approach. We describe a new variant of the semi-static SEQUITUR algorithm, RAY, that reduces main-memory use and allows random-access to databases. RAY models repetition in sequences by progressively constructing a hierarchical grammar with multiple passes through the data. The multiple pass approach of RAY uses statistics on character pair repetition, or digram frequency, to create rules in the grammar. While our preliminary implementation is not especially fast, the multi-pass approach permits reductions in compression time, at the cost of affecting compression performance, by limiting the number of passes. We have found that RAY has practicable main-memory requirements and achieves better compression than an efficient Huffmann scheme and popular adaptive compression techniques. Moreover, our scheme allows random access to data and is not restricted to databases of text

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Komprimeringen med grammatikker er på niveau med de øvrige former for ordbogskomprimering, men den er samtidig semistatisk, hvilket er en klar fordel i forbindelse med strukturbevarende komprimering. Eksempelvis er ray-komprimeringsalgoritmen [CWZ99], der er udviklet til databaser og vil vaere naturlig at kombinere med ordningsbevarende komprimering som naevnt i afsnit 2.1.2, en form for komprimering med binaere grammatikker. ...
... Many techniques surveyed in those papers include compression methods that deal with only text data types rather than a diversity of data types. One particular method that does consider the different data types is the RAY algorithm as described by Cannane, Williams and Zobel (Cannane, Williams & Zobel 1999). However, data compression is a large area of research and it is orthogonal to the work outlined here. ...
Article
Full-text available
In mobile computing environments, as a result of the reduced capacity of local storage, it is commonly not feasible to replicate entire datasets on each mobile unit. In addition, reliable, secure and economical access to central servers is not always possible. Moreover, since mobile computers are designed to be portable, they are also physically small and thus often unable to hold or process the large amounts of data held in centralised databases. As many systems are only as useful as the data they can process, the support provided by database and system management middleware for applications in mobile environments is an important driver for the uptake of this technology by application providers and thus also for the wider use of the technology.
Article
Full-text available
Lower storage capacity and slower access time are the main problems of the Database Management System (DBMS). In this paper, we have been compared the storage and access time between the columnar multi-block vector structure (CMBVS) and Oracle 9i server. The experimental results shown that CMBVS is about 31 times efficient in storage cost and 21-70 times faster in retrieval time performance than that of the Oracle 9i server.
Conference Paper
Reduction of both compression ratio and retrieval of data from large collection is important in any computation requiring human interface. In this paper, a new algorithm is proposed for general purpose compression scheme that can be applied to all types of data storage in large collections. The paper presents a fast compression and decompression technique for natural language texts. The technique used in compressing text allows searching phrase in the compressed form without decompressing the compressed file. The algorithm suggested here uses static dictionary in matrix form, which reduces the compression and decompression time. The memory requirement of the algorithm is almost negligible.
Article
Compression of large collections can lead to improvements in retrieval times by offsetting the CPU decompression costs with the cost of seeking and retrieving data from disk. We propose a semistatic phrase-based approach called xray that builds a model offline using sample training data extracted from a collection, and then compresses the entire collection online in a single pass. The particular benefits of xray are that it can be used in applications where individual records or documents must be decompressed, and that decompression is fast. The xray scheme also allows new data to be added to a collection without modifying the semistatic model. Moreover, xray can be used to compress general-purpose data such as genomic, scientific, image, and geographic collections without prior knowledge of the structure of the data. We show that xray is effective on both text and general-purpose collections. In general, xray is more effective than the popular gzip and compress schemes, while being marginally less effective than bzip2. We also show that xray is efficient: of the popular schemes we tested, it is typically only slower than gzip in decompression. Moreover, the query evaluation costs of retrieval of documents from a large collection with our search engine is improved by more than 30% when xray is incorporated compared to an uncompressed approach. We use simple techniques for obtaining the training data from the collection to be compressed and show that with just over 4% of data the entire collection can be effectively compressed. We also propose four schemes for phrase-match selection during the single pass compression of the collection. We conclude that with these novel approaches xray is a fast and effective scheme for compression and decompression of large general-purpose collections.
Article
Full-text available
In mobile computing, issues such as limited resources, network capacities and organisational constraints may cause the complete replication of large databases on a mobile device to be infeasible. At the same time, some on-board storage of data is attractive as communication to the main database can be inconsistent. Thus, as the emphasis on application mobility increases, data summarisation offers a useful solution to improving response times and the availability of data. These summarisation techniques can also be of benefit to distributed databases, particularly those with mobile components or where the profile of the transaction load varies significantly over time. This paper surveys summarisation techniques used for mobile distributed databases. It also surveys the manner in which database functionality is maintained in mobile database systems, including query processing, data replication, concurrency control, transaction support and system recovery.
Article
Full-text available
This paper takes a compression scheme that infers a hierarchical grammar from its input, and investigates its application to semi-structured text. Although there is a huge range and variety of data that comes within the ambit of "semi-structured", we focus attention on a particular, and very large, example of such text. Consequently the work is a case study of the application of grammar-based compression to a large-scale problem. We begin by identifying some characteristics of semi-structured text that have special relevance to data compression. We then give a brief account of a particular large textual database, and describe a compression scheme that exploits its structure. In addition to providing compression, the system gives some insight into the structure of the database. Finally we show how the hierarchical grammar can be generalized, first manually and then automatically, to yield further improvements in compression performance.
Conference Paper
Full-text available
This paper provides a detailed analysis of various implementations of digital tries, including the “ternary search tries” of Bentley and Sedgewick. The methods employed combine symbolic uses of generating functions, Poisson models, and Mellin transforms. Theoretical results are matched against real-life data and justify the claim that ternary search tries are a highly efficient dynamic dictionary structure for strings and textual data.
Article
Full-text available
The paper describes a technique that constructs models of symbol sequences in the form of small, human-readable, hierarchical grammars. The grammars are both semantically plausible and compact. The technique can induce structure from a variety of different kinds of sequence, and examples are given of models derived from English text, C source code and a sequence of terminal control codes. It explains the grammatical induction technique, demonstrates its application to three very different sequences, evaluates its compression performance, and concludes by briefly discussing its use as a method for knowledge acquisition
Article
Full-text available
This paper describes two elegant ways of curtailing the space complexity of hierarchy inference, one of which yields a bounded-space algorithm. We begin with a brief review of the hierarchy inference procedure that is embodied in the SEQUITUR program. Then we consider its performance on quite large files, and show how compression performance improves as file size increases. We recognize that hierarchy inference will never define the state of the art in general-purpose text compression, not in practice (compared with PPM variants), chiefly because of its fundamentally non-statistical nature, nor in principle (compared with LZ methods), because it is easy to see that it fails to perform well for random input.
Conference Paper
We present theoretical algorithms for sorting and searching multikey data, and derive from them practical C implementations for applications in which keys are charac- ter strings. The sorting algorithm blends Quicksort and radix sort; it is competitive with the best known C sort codes. The searching algorithm blends tries and binary search trees; it is faster than hashing and other commonly used search methods. The basic ideas behind the algo- rithms date back at least to the 1960s, but their practical utility has been overlooked. We also present extensions to more complex string problems, such as partial-match searching. that is competitive with the most efficient string sorting programs known. The second program is a symbol table implementation that is faster than hashing, which is com- monly regarded as the fastest symbol table implementa- tion. The symbol table implementation is much more space-efficient than multiway trees, and supports more advanced searches. In many application programs, sorts use a Quicksort implementation based on an abstract compare operation, and searches use hashing or binary search trees. These do not take advantage of the properties of string keys, which are widely used in practice. Our algorithms provide a nat- ural and elegant way to adapt classical algorithms to this important class of applications.
Article
When data compression is applied to full-text retrieval systems, intricate relationships emerge between the amount of compression, access speed, and computing resources required. We propose compression methods, and explore corresponding tradeoffs, for all components of static full-text systems such as text databases on CD-ROM. These components include lexical indexes, inverted files, bitmaps, signature files, and the main text itself. Results are reported on the application of the methods to several substantial full-text databases, and show that a large, unindexed text can be stored, along with indexes that facilitate fast searching, in less than half its original size—at some appreciable cost in primary memory requirements. © 1993 John Wiley & Sons, Inc.
Article
Abstract For compression of text databases, semi-static word-based methods provide good perfor- mance in terms of both speed and disk space, but two problems arise. First, the memory requirements for the compression model during decoding can be unacceptably high. Sec- ond, the need to handle document insertions means that the collection must be periodically recompressed, if compression eciency is to be maintained on dynamic collections. Here we show that with careful management,the impact of both of these drawbacks can be kept small. Experiments with a word-based model and 500 Mb of text show that excellent compression rates can be retained even in the presence of severe memory limitations on the decoder, and after signicant expansion in the amount,of stored text. Index Terms Document databases, text compression, dynamic databases, word-based
Conference Paper
Text compression by inferring a phrase hierarchy from the input is a technique that shows promise as a compression scheme and as a machine learning method that extracts some comprehensible account of the structure of the input text. Its performance as a data compression scheme outstrips other dictionary schemes, and the structures that it learns from sequences have been put to such eclectic uses as phrase browsing in digital libraries, music analysis, and inferring rules for fractal images. We focus attention on the memory requirements of the method. Since the algorithm operates in linear time, the space it consumes is at most linear with input size. The space consumed does in fact grow linearly with the size of the inferred hierarchy, and this makes operation on very large files infeasible. We describe two elegant ways of curtailing the space complexity of hierarchy inference, one of which yields a bounded space algorithm. We begin with a review of the hierarchy inference procedure that is embodied in the SEQUITUR program. Then we consider its performance on quite large files, and show how the compression performance improves as the file size increases
Article
We describe the implementation of a data compression scheme as an integral and transparent layer within a full-text retrieval system. Using a semi-static word-based compression model, the space needed to store the text is under 30 per cent of the original requirement. The model is used in conjunction with canonical Huffman coding and together these two paradigms provide fast decompression. Experiments with 500 Mb of newspaper articles show that in full-text retrieval environments compression not only saves space, it can also yield faster query processing - a win-win situation.
Data compression. Computing Surveys
  • D A Lelewer
  • D S Hirschberg
D.A. Lelewer and D.S. Hirschberg. Data compression. Computing Surveys, Volume 19, Number 3, pages 261{296, September 1987.
Proc. IEEE Data Compression Conference