Conference PaperPDF Available

A compression scheme for large databases

Authors:

Abstract and Figures

Compression of databases not only reduces space requirements but can also reduce overall retrieval times. We have described elsewhere our RAY algorithm for compressing databases containing general-purpose data, such as images, sound, and also text. We describe here an extension to the RAY compression algorithm that permits use on very large databases. In this approach, we build a model based on a small training set and use the model to compress large databases. Our preliminary implementation is slow for compression, but only slightly slower in decompression speed than the popular GZIP scheme. Importantly, we show that the compression effectiveness of our approach is excellent and markedly better than the GZIP and COMPRESS algorithms on our test sets
Content may be subject to copyright.
A Compression Scheme for Large Databases
Adam Cannane Hugh E. Williams
Department of Computer Science, RMIT University,
GPO Box 2476V, Melbourne 3001, Australia
cannane,hugh @cs.rmit.edu.au
Abstract
Compression of databases not only reduces space re-
quirements but can also reduce overall retrieval times. We
have described elsewhere our RAY algorithm for compress-
ing databases containing general-purpose data, such as im-
ages, sound, and also text. We describe here an exten-
sion to the RAY compression algorithm that permits use on
very large databases. In this approach, we build a model
based on a small training set and use the model to com-
press large databases. Our preliminary implementation is
slow for compression, but only slightly slower in decom-
pression speed than the popular GZIP scheme. Importantly,
we show that the compression effectiveness of our approach
is excellent and markedly better than the GZIP and COM-
PRESS algorithms on our test sets.
1. Introduction
A compression algorithm can be applied to databases to
reduce the amount of disk space they occupy. It has also
been shown that for both text and document databases, com-
pression schemes can lead to faster retrieval of data than
with uncompressed data by offsetting the computational
costs of decompression with the cost of seeking and trans-
ferring from disk [8, 12, 15].
A compression algorithm for database systems must ad-
dress the problem of randomly accessing and individually
decompressing records, while maintaining compact storage
of the data. Importantly, as users expect fast performance
in response to queries, decompression must also be fast.
Currently, the most popular technique for compression of
text collections is Huffman coding with a zero-order semi-
static word based model [10]. Huffman coding provides
good compression for word-based models, however it is fast
decompression and random access that have shown it to be
more attractive for databases than schemes that offer better
compression effectiveness such as arithmetic coding [7].
A problem with word-based Huffman coding for large
databases is that all words to be coded must be included in
the model. When new documents are added to the database
new words may also be added, necessitating a strategy for
their representation in the compressed data. It is gener-
ally impractical to recompute the model and re-code the
compressed data with respect to a new model. Instead,
an escape code is included in the model and, after emit-
ting the escape code, words are stored in an uncompressed
form. Moffat, Zobel, and Sharman [8] have experimented
with this approach to compression for dynamic document
databases using semi-static word-based methods. More re-
cently, Zobel and Williams have extended these semi-static
word-based models by proposing new techniques for repre-
senting words that are not in the model using more effective
approaches [11].
In this paper we describe an alternative modeling tech-
nique for dynamic databases that permits random-access
and atomic decompression of records in large collections.
Our approach is an extension of our compression algorithm
RAY, a multi-pass scheme that identies repeated phrases
in the input. The set of identied phrases are derived by
inspecting part of a large database, a hierarchy of phrases
formed (with long phrases containing references to shorter
ones), and the phrases used to effectively compress the com-
plete database.
We show that by inspecting as little as 1% of the data
to be compressed to derive a model our approach, which
we call XRAY, offers better compression effectiveness than
GZIP and COMPRESS on a large text collection. Inspect-
ing only around 6% of data from the same collection gives
comparable performance to a word-based Huffman coding
scheme where the complete data is inspected [8] and, as
we have shown elsewhere [2], our approach is two to three
times faster in decompression. A direct comparison with the
most effective of the comparable escape model approaches
described by Moffat, Zobel, and Sharman [8] shows that our
approach is slightly more effective, while likely to be two
to three times faster. While our preliminary implementation
is not especially fast, compact storage makes it an attractive
alternative to word-based models.
2. Background
The compression process consists of two separate activ-
ities, modeling and coding. Modeling denes how each of
the distinct symbols in the input stream will be represented.
A model stores information on how often the symbol oc-
cured in the data, that is, symbol probabilities. Coding, the
second part of the compression process, results in a com-
pressed version of the data by constructing a set of codes
for the distinct symbols based on the probabilities provided
by the model. Ideally, symbols that occur more frequently
are replaced with shorter codewords and rare symbols with
longer codewords to reduce the overall length of the coded
data.
Compression schemes can be classied according to the
way the model alters during both compression and decom-
pression. A compression scheme can use a static approach
where a pre-determined model is used. In this approach the
model is independent of the data. Static models are built by
considering the average probability distribution from statis-
tics gathered by inspecting similar data. This approach be-
comes inefcient when the static model is a poor approxi-
mation of the symbol frequencies.
An alternative to static modeling is to allow the model
to evolve as each new symbol in the data is inspected. This
adaptive modeling approach alters the model in a similar
way during decompression. Adaptive models compensate
for new symbols by recalculating the probability distribu-
tion, taking advantage of the local properties of the data.
Altering the model after each symbol means that when de-
coding, all preceeding symbols need to be decoded to recon-
struct the same model. This is the major limitation of adap-
tive modeling when random access to individual records is
required.
An approach that combines static and adaptive schemes
alters the model during compression to better approximate
the frequencies of the symbols, yet maintains a static model
during decompression. This approach of semi-adaptive or
semi-static modeling is ideal for compression of databases,
where compression occurs only once—allowing it to be
slow or complex—while decompression occurs frequently
and must be efcient. Importantly, semi-static approaches
permit random-access to the compressed data, allowing
atomic decompression of records stored in databases.
Semi-static approaches require two separate passes of
the data: the rst gathers statistics on the symbols necessary
to build the model; and the second encodes the symbols ac-
cording to the model. A semi-static model makes good use
of specic properties of the data, while at the same time
remains static during decoding. The disadvantage of the
semi-static approach is that the model parameters need to
be stored with the compressed data.
3. Phrase-based Compression
A database requires a semi-static compression scheme
that will allow efcient retrieval of the compressed data, as
well as produce a saving in storage costs. Additionally, the
compression scheme must be able to independently decom-
press a random selection of records [15]. One candidate
class of semi-static methods for databases are efcient im-
plementations of canonical Huffman coding [10]. A second
class are the phrase-based dictionary methods, which we
discuss here.
Phrase-based compression methods generally construct a
table of phrases that can be referenced in an encoded mes-
sage. Phrases are motifs that are constructed from symbols
that occur in the data and from other shorter phrases. Ide-
ally, each dictionary phrase is referenced at least once in the
encoded message and, because of this repetition, replace-
ment of the phrase with the dictionary reference shortens
the overall length of the encoded message.
We have previously described RAY [2], a dictionary-
based scheme that replaces frequently occuring symbol-
pairs in the data with shorter references into the dictio-
nary. RAY builds a hierachical set of phrases, where longer
phrases may contain references to shorter phrases. For
example, the phrase ABC may be represented hierarchi-
cally as A1, where 1 is a reference to the shorter phrase
BC.RAY signicantly differs from other hierarchical dic-
tionary based compression schemes by using a multi-pass
approach, where the data stream is repeatedly re-inspected
in a left-to-right manner.
The multi-pass approach of phrase identication used by
RAY consists of three separate steps. In the rst step, statis-
tics for the occurence of symbol-pairs are recorded in an
in-memory table that includes frequencies for overlapping
pairs. Overlapping pairs, that is, consecutive symbol-pairs,
consist of three symbols—for example ABC consists of the
symbol-pairs AB and BC. Statistics are used in step two and
three of each pass to determine a selection of subphrases
that will be added to the dictionary.
The second step maintains another set of statistics that
record the number of times a symbol-pair was selected as
acandidate. A candidate symbol-pair is a symbol-pair that
is likely to result in the removal of more redundancy than
any other nearby symbol-pairs. Selection of a candidate
pair is made by comparing the frequencies of overlapping
symbol-pairs and taking the greedy approach of selecting
the left-most symbol-pair of the overlappingsymbol-pairsif
it has a frequency greater than or the same as the right-most
symbol-pair. If the right-most pair has a higher frequency,
comparison of the symbol-pairs moves to the next overlap-
ping pair in the input stream, where the right symbol-pair is
now the left symbol-pair. For example, the input sequence
BABAAABAB would have the symbol-pairs and frequencies:
AA:2, AB:3, and BA:3. The rst overlapping symbol-pairs
in the sequence are the symbol-pairs BA and AB. In this
case, the frequency of the BA is equal to the frequency of
the other symbol-pair, AB, and would be selected as a can-
didate pair with a candidate frequency of one and an oc-
curence frequency of three.
Step three is a comparison of the candidate frequencies
of overlapping symbol-pairs. Selection of a symbol-pair
during this step results in the creation of a phrase in the
dictionary that contains the selected symbol-pair as its con-
tents. Assuming that the pair BA has a higher candidate
frequency than the pair AB,BA would form a phrase 1
BA and the input stream altered to be 11AA1B. Symbol-pair
frequency is also updated each time a pair is replaced in the
input stream.
In summary, for a symbol-pair to be substituted in the
third step by a phrase symbol, it must adhere to the follow-
ing constraints:
the symbol-pair must be a candidate pair
the candidate frequency of the symbol-pair must be
greater than or equal to the candidate frequency of the
overlapping symbol-pair
and, additionally:
the candidate frequency of the symbol-pair must ex-
ceed a minimum candidate frequency for replacement
with phrases (selection of the minimum frequency is a
heuristic that requires careful consideration and we do
not discuss it in detail here)
After the rst pass through the data and replacement of
symbol-pairs with new phrases, steps two and three of this
process are repeated through further passes; for the second
and subsequent passes, the minimum candidate frequency is
determined at step three and step one need not be repeated.
The compression process terminates when no new phrases
are added to the phrase dictionary.
Dictionary-based schemes that remain static during de-
compression require the entire dictionary to be stored with
the compressed text. Therefore, it is necessary to choose
a compact encoding scheme for the phrase dictionary. In
our scheme, the phrase dictionary is carefully compressed
and stored on disk using canonical Huffman coding and the
bitwise Elias gamma and delta codes [3].
Several other approaches have also been proposed to
identify and code phrases. Apostolico and Lonardi [1] have
also proposed a multi-pass approach that uses statistics on
the frequency and length of every substring in the input
stream. Larson and Moffat [6] have also proposed a method
that uses symbol-pairs, but inspects the data as a whole and
recursively replaces the most frequently occuring pair with
a symbol. A further alternative approach is the SEQUITUR
scheme that incrementally constructs a context-free gram-
mar by replacing repetitive symbol-pairs in a left-to-right
manner [9].
All of the schemes, including RAY, are derivations of the
Lempel-Ziv [13, 14] techniques. The familiar LZ’78 dictio-
nary based compression scheme replaces repeated phrases
with a pointer offset to the previous occurence and a length
indicating how many symbols the phrase contains. This
scheme has been shown to work well on general-purpose
data. The LZ’78 phrase dictionary does not need to be
stored with the compressed data as it is inferred as the mes-
sage is decoded. The major difference between the ap-
proaches described above and the LZ’7 8 technique is that
they are all semi-static schemes requiring storage of the dic-
tionary, whereas the LZ’78 technique is adaptive.
Each of the approaches described above requires sig-
nicant resources to compress large les. For example,
RAY requires around 200 Mb of main-memory to maintain
statistics for the compression of just over 60 Mb of text;
while the implementation of RAY is as a prototype, we be-
lieve that a production implementation would still require
around 120 Mb to compress such a collection. The scheme
of Larsson and Moffat was tested on les of at most 20 Mb
in size, suggesting similar limitations to RAY. We have ob-
served the same limitations with the SEQUITUR approach of
Nevill-Manning and Witten [9].
For the compression of large collections, RAY requires
new heuristic approaches to maintaining statistics. We dis-
cuss such approaches in the next section.
4. Compressing Large Files with Ray
We propose in this section an extension to RAY, which
we call XRAY, that allows compression of large les with
moderate main-memory use and offers good compression
effectiveness, that is, space saved through compression.
XRAY requires three extensions of the RAY algorithm for
application to large les. The reason for this is that in or-
der to build a model, the algorithm requires statistics on ev-
ery symbol-pair, as well as keeping the entire data in main-
memory for efciency in the multi-pass approach. The most
signicant limitation is that the model size in RAY contin-
ues to grow as more data is inspected, resulting in main-
memory requirements that are linear with the data size. Our
extensions, which we describe below, address the main-
memory requirements of RAY, permitting XRAY to be used
on very large data sets.
The rst extension in XRAY is to build a hierarchical set
of phrases from a small sample of training data in limited
main-memory. Ideally, the small sample of training data
will be a segment taken from the data to be compressed.
The multi-pass RAY algorithm, as explained in the previous
section, is applied to the training data until phrases are no
longer being added to the phrase table. The result of in-
specting only part of the data is a xed model that is ideal
for the training set, but is non-optimal for the remainder of
the collection. However, we show later that this model still
works well on the rest of the uninspected data.
The second extension in XRAY involves altering the
model to store primitive symbols that are not present in the
small training set. We ensure that all primitive symbols,
that is, characters, are assigned space in the model and that
codes are created for each primitive symbol that was not
observed in the training data. This heuristic extension per-
mits compression to continue irrespective of the symbols
that are inspected in the data. The assignment of codes is
altered so that all primitive symbols that were not observed
in the training data are added to the model with the same
probability as the rarest symbol. Adding primitive symbols
to the model insignicantly alters the assignment of codes,
especially when the number of phrases exceeds 10,000.
An alternative approach for handling previously unob-
served symbols in the input is to allow space in the model
for an escape symbol. When a new symbol that is not rep-
resented in the model is encountered the code for the es-
cape symbol is emitted, followed by another sequence of
bits that allows the decoder to determine the value of the
new symbol. There are many different ways of selecting
an encoding of the bits following the escape symbol with
the easiest being to simply emit an eight-bit character code.
Various other escape methods have been experimented with
to handle symbols not represented in the model by Zobel
and Williams [11]. Zobel and Williams propose an escape
method that encodes the escape symbol, followed by a char-
acter count and then the sequence of characters comprising
the word.
We believe our approach of generating codes for all
primitive symbols is a more suitable approach than using an
escape method. The signicant difference between XRAY
and minimum-redundancy coding schemes is that symbols,
that is, phrases, in XRAY vary in length from a single prim-
itive character to long, repeating phrases. For single primi-
tive symbols, using an escape method would be inefcient.
We believe, however, that a combination of both the escape
method for encoding long sequences of characters and our
method of generating codes for all primitive symbols may
work well in XRAY.
The third extension we propose in XRAY is to atten the
hierarchy of the phrase dictionary in main-memory for sin-
gle pass compression. Flattening of the hierarchy from the
dictionary is an efciency in XRAY to permit the longest
matching phrase from the dictionary to replace characters
in the new input stream, removing the need for multi-pass
processing. The cost of attening the hierarchy is main-
memory use, but as the phrase dictionary is constructed
from a small training set, this is not a practical problem.
Aattened phrase in the dictionary contains a sequence of
primitive symbols that can easily be searched for in the in-
put and searching for matching phrases in the at hierarchy
is speed efcient.
We have implemented a search mechanism that inspects
data in the input stream and identies phrases that are stored
in our dictionary. The implementation of our search mecha-
nism is a simple exhaustive approach that sequentially pro-
cesses the set of phrases beginning with the same symbol-
pair as the input stream. This approach is not especially
efcient and can be improved in a production implementa-
tion.
5. Results
To test the performance of XRAY on large databases we
have selected a test collection drawn from the TREC collec-
tions (TREC is an ongoing international collaborative ex-
periment in information retrieval sponsored by NIST and
ARPA [5].). The collection, WSJ, is the rst 491.5 Mb of
the Wall Street Journal from TREC disk 1.
Table 1 shows the compression achieved with XRAY us-
ing different sized training sets from the WSJ collection and
then subsequently compressing the remainder of the collec-
tion. In all of the XRAY experiments, the training data is
taken from the beginning of the WSJ test collection where,
for example, the last column uses the rst 0.5 Mb of WSJ
training data to build a model to code the entire 491.5 Mb
WSJ collection. Not surprisingly, the compression effective-
ness degrades with decreasing model size. For the largest
model using 28.6 Mb of training data, the WSJ collection is
compressed to 29% of its original size, while for the small-
est model—where only 0.5 Mb of data is inspected—the
compression achieved is 38%. The most signicant result is
that by inspecting only 4.8 Mb of data, or around 1% of the
WSJ collection, compression of around 32% is possible; this
is signicant, since the 4.8 Mb training set is around 17%
the size of the 28.6 Mb training set, but only around 10%
worse in compression effectiveness for the WSJ collection.
In Table 2 we compare the compression achieved by
XRAY using a 28.6 Mb training data set to the popular adap-
tive compression schemes GZIP and COMPRESS. Although
the adaptive schemes do not allow random-access decoding
of the data for database applications, which is one of the
important features of XRAY, they are interesting reference
points for compression effectiveness. Our results show that
XRAY offers better compression effectiveness than GZIP
XRAY compresses the collection to around 29% of its orig-
inal size, while GZIP compresses it to just over 37%—and
COMPRESS which compresses WSJ to 43% of its original
size. Indeed, even by inspecting as little as 1% of the data—
our 4.8 Mb training set—XRAY is still signicantly more
effective than both GZIP and COMPRESS.
Table 1. Compressed le size and compression effectiveness (bits per character) for XRAY using different sized training
sets on the WSJ collection. The size of the compressed le includes the model.
Training Set Size (Mb) 28.6 19.1 9.5 4.8 0.5
Compressed File (Mb) 142.7 148.1 153.7 156.9 187.3
Compression (bpc) 2.32 2.41 2.50 2.55 3.04
Table 2. Compressed le size and compression effectiveness of different compression schemes (in bits per character)
for the WSJ collection. XRAY is shown with a 28.6 Mb training set. Results for the MZS-H UFF “Method C” are the best
results achieved for an escape method, are approximate only, and are taken from Moffat, Zobel, and Sharman [8].
Uncompressed XRAY GZIP COMPRESS MZS-HU FF
Size (Mb) 491.5 142.7 183.0 211.6 145.5
Compression (bpc) 8.00 2.32 2.98 3.44 2.35
We also show in Table 2 a comparison of the perfor-
mance of XRAY to an escape model semi-static HUFFWORD
using the most effective “Method C” of Moffat, Zobel, and
Sharman [8] where just over 30 Mb of data was inspected
to build the model; note that the MZS-HU FF results are ap-
proximate only, as the WSJ collection used in our experi-
ments does not contain the SGML markup that was used in
the experiments of Moffat, Zobel, and Sharman. The results
show that XRAY is likely to be more efcient than an ef-
cient escape model HUFFWORD scheme; we expect, as we
have found elsewhere [2], that inclusion of SGML markup
in the WSJ collection would further improve the compres-
sion effectiveness of XRAY.
To show the overall performance of XRAY in compress-
ing WSJ, we have plotted the actual, average, and ideal com-
pression performance in Figure 1. All measurements re-
ported do not include the size of the model and for these ex-
periments we used 28.6 Mb of WSJ as the training data. The
uppermost line, the actual performance, shows the com-
pression achieved for each 28.6 Mb block of data using the
model built from the training set. The middle line, the av-
erage performance, shows the average compression effec-
tiveness for the current and previous 28.6 Mb blocks during
compression; note that the nal point is just under 2.32 bits
per character, the overall result of compressing the WSJ col-
lection not including the model size in the calculation. The
bottom line is the baseline compression that is achieveable
if a separate model was generated for each block. The rapid
increases in the top line indicate points in the data where re-
dundancy exists but is not captured by the model, while the
bottom line indicates the amount of redundancy that could
be effectively removed for the same block of data. The dif-
ference between the actual and the ideal is especially ev-
ident as we inspect more data: the compression effective-
100 200 300 400 500
Data Inspected (Mb)
1.8
2.0
2.2
2.4
Compression Effectiveness (bits per character)
28.6 Mb on WSJ (Actual)
28.6 Mb on WSJ (Average)
28.6 Mb on WSJ (Ideal)
Figure 1. Actual, average, and ideal compression
effectiveness on WSJ using a 28.6 Mb training set.
ness using the model continues to deteriorate whereas the
compression achieveable for each block improves.
In terms of compression efciency, XRAY is slower than
the other approaches tested in both compression and decom-
pression. While we do not report detailed results here, GZIP
takes around 54 seconds to decompress the WSJ collection,
while XRAY takes around 106 seconds. In previous exper-
iments [2] we have found that RAY is slightly faster than
an efciently implemented Huffman scheme during decom-
pression; we have not compared XRAY to HUFFWORD for
the WSJ collection because of memory limitations of HUFF -
WORD, however we would expect that XRAY would be sig-
nicantly faster. XRAY is slow in compression, usually tak-
ing hours to compress large collections; much of the com-
pression time is consumed by an inefcient pattern match-
ing approach that, as we describe in the conclusion, can be
improved substantially.
6. Conclusion
We have described an extension of our RAY algorithm,
which we call XRAY, that provides good compression of
large databases while maintaining the two important quali-
ties of a database compression scheme: random-access and
independent decompressibility of the data.
We have shown that our method of dealing with symbols
not represented in the model achieves slightly better com-
pression effectiveness than an escape method implemen-
tation using an efcient Huffman scheme on similar data.
Moreover,our approach is markedly more effective than the
popular GZIP and COMPRESS schemes even when only 1%
of the data is inspected to build a model.
Our current research implementation of XRAY is inef-
cient. The task of string matching to nd the most effec-
tive phrase for replacement in the input stream is not a sim-
ple one and there are many possible candidate algorithms
for the task. An obvious compression efciency improve-
ment in XRAY during compression is to implement a simi-
lar string matching mechanism to that of GZIP [4]. We be-
lieve that an efciently implemented XRAY should be as fast
in compression as fast adaptive schemes such as GZIP in a
production implementation, while being signicantly more
effective in compression.
References
[1] A. Apostolico and S. Lonardi. Some theory and practice
of greedy off-line textual substitution. In J. Storer and
M. Cohn, editors, Proc. IEEE Data Compression Confer-
ence, pages 119–128, Snowbird, Utah, Mar. 1997. IEEE
Computer Society Press, Los Alamitos, California.
[2] A. Cannane and H. Williams. General-purpose compression
for efcient retrieval. Technical Report TR-99-6, Depart-
ment of Computer Science, RMIT, Melbourne, Australia,
1999.
[3] P. Elias. Universal codeword sets and representations of
the integers. IEEE Transactions on Information Theory, IT-
21(2):194–203, Mar. 1975.
[4] J. Gailly. Gzip program and documentation, 1993. Available
by anonymous ftp from prep.ai.mit.edu:/pub/gnu/gzip-*.tar.
[5] D. Harman. Overview of the second text retrieval confer-
ence (TREC-2). Information Processing & Management,
31(3):271–289, 1995.
[6] N. Larsson and A. Moffat. Ofine dictionary-based com-
pression. In J. Storer and M. Cohn, editors, Proc. IEEE
Data Compression Conference, pages 296–305, Snowbird,
Utah, 1999. IEEE Computer Society Press, Los Alamitos,
California.
[7] A. Moffat, R. Neal, and I. Witten. Arithmetic coding revis-
ited. ACM Transactions on Information Systems, 16(3):256–
294, July 1998.
[8] A. Moffat, J. Zobel, and N. Sharman. Text compression for
dynamic document databases. IEEE Transactions on Knowl-
edge and Data Engineering, 9(2):302–313, 1997.
[9] C. Nevill-Manning and I. Witten. Compression and ex-
planation using hierarchical grammars. Computer Journal,
40(2/3):103–116, 1997.
[10] A. Turpin. Efcient Prex Coding. PhD thesis, The Univer-
sity of Melbourne, 1999.
[11] H. Williams and J. Zobel. Combined models for high-
performance compression of large text collections. In String
Processing and Information Retrieval (SPIRE), Cancun,
Mexico, 1999. (to appear).
[12] H. Williams and J. Zobel. Compressing integers for fast le
access. Computer Journal, 42(3):193–201, 1999.
[13] J. Ziv and A. Lempel. A universal algorithm for sequential
data compression. IEEE Transactions on Information The-
ory, IT-23(3):337–343, 1977.
[14] J. Ziv and A. Lempel. Compression of individual sequences
via variable rate coding. IEEE Transactions on Information
Theory, IT-24(5):530–536, 1978.
[15] J. Zobel and A. Moffat. Adding compression to a full-
text retrieval system. Software Practice and Experience,
25(8):891–903, Aug. 1995.
... A general purpose compression scheme for image, voice and text data is given in [31]. This is not applicable to conventional relational DBMS. ...
... Run-length encoding (RLE), where repeats of the same element are expressed as pairs, is an attractive approach for compressing sorted data in a column store [12]. The task of string matching to find the most effective phrase [13] for replacement in the input stream is not a simple one. Real-time database compression algorithm must provide high compression radio to realize large numbers of data storage in real-time database [14]. ...
Article
Full-text available
One of the big challenges in the world was the amount of data being stored, especially in Data Warehouses. Data stored in databases keep growing as a result of businesses requirements for more information. A big portion of the cost of keeping large amounts of data is in the cost of disk systems, and the resources utilized in managing the data. Backup Compression field in the database systems has tremendously revolutionized in the past few decades. Most existing work presented attribute-level compression methods but the main contribution of such work had been on the numerical attributes compression alone, in which it employed to reduce the length of integers, dates and floating point numbers. Nevertheless, the existing work has not evolved the process of compressing string-valued attributes. To improve the existing work issues, in this work, we study how to construct compressed database systems that provides better in performance than traditional techniques and to store the database backups to multiple storages. The compression of database systems for real time environment is developed with our proposed Hierarchical Iterative Row-Attribute Compression (HIRAC) algorithm, provides good compression, while allowing access even at attribute level. HIRAC algorithm repeatedly increases the compression ratio at each scan of the database systems. The quantity of compression can be computed based on the number of iterations on the rows. After compressing the real time database systems, it allows database backups to be stored at multiple devices in parallel. Extensive experiments were conducted to estimate the performance of the proposed parallel multi-storage backup compression (PMBC) using HIRAC algorithm with respect to previously known tehniques, such as attribute-level compression methods.
... QSDC is a derivation of the LZ77 technique [LZ77]. However, it compresses data in only one pass and much faster than other Lempel-Ziv based compression schemes as for example XRAY [CW00]. QSDC operates on two main memory buffers. ...
Article
The rapid progress of digital technology has led to a situation where computers have become ubiquitous tools. Now we can find them in almost every environment, be it industrial or even private. With ever increasing performance computers assumed more and more vital tasks in engineering, climate and environmental research, medicine and the content industry. Previously, these tasks could only be accomplished by spending enormous amounts of time and money. By using digital sensor devices, like earth observation satellites, genome sequencers or video cameras, the amount and complexity of data with a spatial or temporal relation has gown enormously. This has led to new challenges for the data analysis and requires the use of modern multimedia databases. This thesis aims at developing efficient techniques for the analysis of complex multimedia objects such as CAD data, time series and videos. It is assumed that the data is modeled by commonly used representations. For example CAD data is represented as a set of voxels, audio and video data is represented as multi-represented, multi-dimensional time series. The main part of this thesis focuses on finding efficient methods for collision queries of complex spatial objects. One way to speed up those queries is to employ a cost-based decompositioning, which uses interval groups to approximate a spatial object. For example, this technique can be used for the Digital Mock-Up (DMU) process, which helps engineers to ensure short product cycles. This thesis defines and discusses a new similarity measure for time series called threshold-similarity. Two time series are considered similar if they expose a similar behavior regarding the transgression of a given threshold value. Another part of the thesis is concerned with the efficient calculation of Reverse k-Nearest Neighbor (RkNN) queries in general metric spaces using conservative and progressive approximations. The aim of such RkNN queries is to determine the impact of single objects on the whole database. At the end, the thesis deals with video retrieval and hierarchical genre classification of music using multiple representations. The practical relevance of the discussed genre classification approach is highlighted with a prototype tool that helps the user to organize large music collections. Both the efficiency and the effectiveness of the presented techniques are thoroughly analyzed. The benefits over traditional approaches are shown by evaluating the new methods on real-world test datasets.
Article
Full-text available
Data available in real time database are highly valuable. Data stored in databases keep growing as a result of business requirement and user need. Transmission of such a large quantity of data requires more money. Managing such a huge amount of data in real time environment is a big challenge. Traditionally available methods faces more difficultly in dealing with such a large databases. In this Research Paper we provide an overview of various data compression techniques and a solution to how to optimize and enhance the process to compress the real time databases and achieve better performance than conventional database system. This research paper will provide a solution to compress the real time data more effectively and reduce the storage requirement, cost and increase speed of backup. The compression of data for real time environment is developed with Iterative length Compression Algorithm. This Algorithm provide parallel storage backup for real time database system .
Article
The large-scale relational databases normally have a large size and a high degree of sparsity. This has made database compression very important to improve the performance and save storage space. Using standard compression techniques (syntactic) such as Gzip or Zip does not take advantage of the relational properties, as these techniques do not look at the nature of the data. Since semantic compression accounts for and exploits both the meanings and dynamic ranges of error for individual attributes (lossy compression); and existing data dependencies and correlations between attributes in the table (lossless compression), it is very effective for table-data compression. Inspired by semantic compression, this study proposes a novel independent lossless compression system through utilising data-mining model to find the frequent pattern with maximum gain (representative row) in order to draw attribute semantics, besides a modified version of an augmented vector quantisation coder to increase total throughput of the database compression. This algorithm enables more granular and suitable for every kind of massive data tables after synthetically considering compression ratio, space, and speed. The experimentation with several very large real-life datasets indicates the superiority of the system with respect to previously known lossless semantic techniques.
Article
Full-text available
Lower storage capacity and slower access time are the main problems of the Database Management System (DBMS). In this paper, we have been compared the storage and access time between the columnar multi-block vector structure (CMBVS) and Oracle 9i server. The experimental results shown that CMBVS is about 31 times efficient in storage cost and 21-70 times faster in retrieval time performance than that of the Oracle 9i server.
Article
Full-text available
Lower storage capacity and slower access time are the main problems of the Database Management System (DBMS). In this paper, we have been compared the storage and access time between the columnar multi-block vector structure (CMBVS) and Oracle 9i server. The experimental results shown that CMBVS is about 31 times efficient in storage cost and 21-70 times faster in retrieval time performance than that of the Oracle 9i server.
Article
In the frame of a Swiss national project aimed at developing a decision support system for pollution related data, the Canton of Ticino (the Italian speaking canton of Switzerland) has developed an integrated system (OASI) for data collection, management and analysis for noise, air and traffic related data.
Article
Full-text available
Globally accessed databases are having massive number of transactions and database size also beenincreased as MBs and GBs on daily basis. Handling huge volume of data’s in real time environment are bigchallenges in the world. Global domain providers like banking, insurance, finance, railway, and manyother government sectors are having massive number of transactions. Generally global providers are usedto take the backup of their sensitive data’s multiple times in a day, also many of the providers used to takethe data backup once in few hours as well. Implementing backup often for huge size of databases withoutlosing single bit of data is a susceptible problem and challengeable task.This research addresses the difficulties of the above problems and provides the solution to how to optimizeand enhance the process for compress the real time database and implement backup of the database inmultiple devices using parallel way. Our proposed efficient algorithms provides the solution to compressthe real time databases more effectively and improve the speed of backup and restore operations.
Conference Paper
Full-text available
Real time databases are processing millions and billions of records on daily basis. Handling huge volume of data’s in real time environment are big challenges in the world. Global domain providers will have a massive number of transactions on day-by-day. Such a domain database will be grown GBs and TBs of data during daily activities. Traditional compression methods are progressively more difficult to dealing with very large databases. In this paper, we provide the solution to how to optimize and enhance the process for compress the real time database and achieve better performance than conventional database systems. This research will provides a solution to compress the real time databases more effectively, reduce the storage requirements, costs and increase the speed of backup. The compression of database systems for real time environment is developed with our proposed Iterative Length Compression (ILC) algorithm.
Article
Compression of databases not only reduces space requirements but can also reduce overall retrieval times. In text databases, compression of documents based on semistatic modeling with words has been shown to be both practical and fast. Similarly, for specific applications—such as databases of integers or scientific databases—specially designed semistatic compression schemes work well. We propose a scheme for general-purpose compression that can be applied to all types of data stored in large collections. We describe our approach—which we call RAY—in detail, and show experimentally the compression available, compression and decompression costs, and performance as a stream and random-access technique. We show that, in many cases, RAY achieves better compression than an efficient Huffman scheme and popular adaptive compression techniques, and that it can be used as an efficient general-purpose compression scheme.
Article
In November of 1992 the first Text REtrieval Conference (TREC-1) was held at NIST (Harman 1993). This conference, co-sponsored by ARPA and NIST, brought together information retrieval researchers to discuss their system results on the new TIPSTER test collection. This was the first time that such groups had ever compared results on the same data using the same evaluation methods, and represented a breakthrough in cross-system evaluation in information retrieval. It was also the first time that most of these groups had tackled such a large test collection and required a major effort by all groups to scale up their retrieval techniques.
Article
Abstract For compression of text databases, semi-static word-based methods provide good perfor- mance in terms of both speed and disk space, but two problems arise. First, the memory requirements for the compression model during decoding can be unacceptably high. Sec- ond, the need to handle document insertions means that the collection must be periodically recompressed, if compression eciency is to be maintained on dynamic collections. Here we show that with careful management,the impact of both of these drawbacks can be kept small. Experiments with a word-based model and 500 Mb of text show that excellent compression rates can be retained even in the presence of severe memory limitations on the decoder, and after signicant expansion in the amount,of stored text. Index Terms Document databases, text compression, dynamic databases, word-based
Article
Over the last decade, arithmetic coding has emerged as an important compression tool. It is now the method of choice for adaptive coding on multisymbol alphabets because of its speed, low storage requirements, and effectiveness of compression. This article describes a new implementation of arithmetic coding that incorporates several improvements over a widely used earlier version by Witten, Neal, and Cleary, which has become a de facto standard. These improvements include fewer multiplicative operations, greatly extended range of alphabet sizes and symbol probabilities, and the use of low-precision arithmetic, permitting implementation by fast shift/add operations. We also describe a modular structure that separates the coding, modeling, and probability estimation components of a compression system. To motivate the improved coder, we consider the needs of a word-based text compression program. We report a range of experimental results using this and other models. Complete source code is available. This document is available online at ACM Transactions on Information Systems.
Conference Paper
Dictionary-based modelling is the mechanism used in many practical compression schemes. We use the full message (or a large block of it) to infer a complete dictionary in advance, and include an explicit representation of the dictionary as part of the compressed message. Intuitively, the advantage of this offline approach is that with the benefit of having access to all of the message, it should be possible to optimize the choice of phrases so as to maximize compression performance. Indeed, we demonstrate that very good compression can be attained by an offline method without compromising the fast decoding that is a distinguishing characteristic of dictionary-based techniques. Several nontrivial sources of overhead, in terms of both computation resources required to perform the compression, and bits generated into the compressed message, have to be carefully managed as part of the offline process. To meet this challenge, we have developed a novel phrase derivation method and a compact dictionary encoding. In combination these two techniques produce the compression scheme RE-PAIR, which is highly efficient, particularly in decompression
Conference Paper
Greedy off-line textual substitution refers to the following steepest descent approach to compression or structural inference. Given a long text string x, a substring w is identified such that replacing all instances of w in x except one by a suitable pair of pointers yields the highest possible contraction of x; the process is then repeated on the contracted text string, until substrings capable of producing contractions can no longer be found. This paper examines the computational issues and performance resulting from implementations of this paradigm in preliminary applications and experiments. Apart from intrinsic interest, these methods may find use in the compression of massively disseminated data, and lend themselves to efficient parallel implementation, perhaps on dedicated architectures