-
[show abstract]
[hide abstract]
ABSTRACT: Hierarchical dictionary-based compression schemes form a grammar
for a text by replacing each repeated string with a production rule.
While such schemes usually operate on-line, making a replacement as soon
as repetition is detected, off-line operation permits greater freedom in
choosing the order of replacement. In this paper, we compare the on-line
method with three off-line heuristics for selecting the next substring
to replace: longest string first, most common string first, and the
string that minimizes the size of the grammar locally. Surprisingly, two
of the off-line techniques, like the on-line method, run in time linear
in the size of the input. We evaluate each technique on artificial and
natural sequences. In general, the locally-most-compressive heuristic
performs best, followed by most frequent, the on-line technique, and,
lagging by some distance, the longest-first technique
Proceedings of the IEEE 12/2000; · 6.81 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Life is based on two polymers, DNA and protein, whose properties can be described in a simple text file. It is natural to expect that standard text compression techniques would work on biological sequences as they do on English text. But biological sequences have a fundamentally different structure from linguistic ones, and standard compression schemes exhibit disappointing performance on them. We describe a new approach to compression that takes account of the underlying biochemical principles. This gives rise to a generalization of blending for statistical compressors where every context is used, weighted by its similarity to the current context. Results support what research in bioinformatics has shown, that there is little Markov dependency in protein. This cripples data compression schemes and reduces them to order zero models
Data Compression Conference, 1999. Proceedings. DCC '99; 04/1999
-
[show abstract]
[hide abstract]
ABSTRACT: SEQUITUR is an algorithm that infers a hierarchical structure from a sequence of discrete symbols by replacing repeated phrases with a grammatical rule that generates the phrase, and continuing this process recursively. The result is a hierarchical representation of the original sequence, which offers insights into its lexical structure. The algorithm is driven by two constraints that reduce the size of the grammar, and produce structure as a by-product. SEQUITUR breaks new ground by operating incrementally. Moreover, the method's simple structure permits a proof that it operates in space and time that is linear in the size of the input. Our implementation can process 50,000 symbols per second and has been applied to an extensive range of real world sequences. Comment: See http://www.jair.org/ for an online appendix and other files accompanying this article
08/1997;
-
[show abstract]
[hide abstract]
ABSTRACT: It has been our experience that in order to obtain useful results using supervised learning of real-world datasets it is necessary to perform feature subset selection and to perform many experiments using computed aggregates from the most relevant features. It is, therefore, important to look for selection algorithms that work quickly and accurately so that these experiments can be performed in a reasonable length of time, preferably interactively. This paper suggests a method to achieve this using a very simple algorithm that gives good performance across different supervised learning schemes and when compared to one of the most common methods for feature subset selection. KEYWORDS Feature subset selection; supervised learning; 1R; filter model; wrapper model. INTRODUCTION There is growing evidence that feature subset selection can substantially improve the task of performing supervised learning. The algorithms that perform feature subset selection have been studied in a variety o...
05/1997;
-
[show abstract]
[hide abstract]
ABSTRACT: Data compression and learning are, in some sense, two sides of the same coin. If we paraphrase Occam's razor by saying that a small theory is better than a larger theory with the same explanatory power, we can characterize data compression as a preoccupation with small, and learning as a preoccupation with better. Nevill-Manning et al. (see Proc. Data Compression Conference, Los Alamitos, CA, p.244-253, 1994) presented an algorithm, since dubbed SEQUITUR, that presents both faces of the compression/learning coin. Its performance as a data compression scheme outstrips other dictionary schemes, and the structures that it learns from sequences as diverse as DNA and music are intuitively compelling. We present three new results that characterize SEQUITUR's computational and compression performance. First, we prove that SEQUITUR operates in time linear in n, the length of the input sequence, despite its ability to build a hierarchy as deep as log(n). Second, we show that a sequence can be compressed incrementally, improving on the non-incremental algorithm that was described by Nevill-Manning et al., and making on-line compression feasible. Third, we present an intriguing result that emerged during benchmarking; whereas PPMC outperforms SEQUITUR on most files in the Calgary corpus, SEQUITUR regains the lead when tested on multimegabyte sequences. We make some tentative conclusions about the underlying reasons for this phenomenon, and about the nature of current compression benchmarking
Data Compression Conference, 1997. DCC '97. Proceedings; 04/1997
-
[show abstract]
[hide abstract]
ABSTRACT: It has been our experience that in order to obtain useful results using supervised learning of real-world datasets it is necessary to perform feature subset selection and to perform many experiments using computed aggregates from the most relevant features. It is, therefore, important to look for selection algorithms that work quickly and accurately so that these experiments can be performed in a reasonable length of time, preferably interactively. This paper suggests a method to achieve this using a very simple algorithm that gives good performance across different supervised learning schemes and when compared to one of the most common methods for feature subset selection.
08/1996;
-
[show abstract]
[hide abstract]
ABSTRACT: This paper takes a compression scheme that infers a hierarchical grammar from its input, and investigates its application to semi-structured text. Although there is a huge range and variety of data that comes within the ambit of “semi-structured”, we focus attention on a particular, and very large, example of such text. Consequently the work is a case study of the application of grammar-based compression to a large-scale problem. We begin by identifying some characteristics of semi-structured text that have special relevance to data compression. We then give a brief account of a particular large textual database, and describe a compression scheme that exploits its structure. In addition to providing compression, the system gives some insight into the structure of the database. Finally we show how the hierarchical grammar can be generalized, first manually and then automatically, to yield further improvements in compression performance
Data Compression Conference, 1996. DCC '96. Proceedings; 04/1996
-
[show abstract]
[hide abstract]
ABSTRACT: The 1R machine learning scheme (Holte, 1993) is a very simple one that proves surprisingly effective on the standard datasets commonly used for evaluation. This paper describes the method and discusses two aspects of the algorithm that bear further analysis: the way, that intervals are formed when discretizing continuously-valued attributes; and the way missing values are treated. We then show how the algorithm can be extended to avoid a problem endemic to most practical machine learning algorithms-their frequent dismissal of an attribute as irrelevant when in fact it is highly relevant when combined with other attributes
Artificial Neural Networks and Expert Systems, 1995. Proceedings., Second New Zealand International Two-Stream Conference on; 12/1995
-
[show abstract]
[hide abstract]
ABSTRACT: The paper describes a technique that constructs models of symbol
sequences in the form of small, human-readable, hierarchical grammars.
The grammars are both semantically plausible and compact. The technique
can induce structure from a variety of different kinds of sequence, and
examples are given of models derived from English text, C source code
and a sequence of terminal control codes. It explains the grammatical
induction technique, demonstrates its application to three very
different sequences, evaluates its compression performance, and
concludes by briefly discussing its use as a method for knowledge
acquisition
Data Compression Conference, 1994. DCC '94. Proceedings; 04/1994
-
[show abstract]
[hide abstract]
ABSTRACT: Text compression by inferring a phrase hierarchy from the input is
a technique that shows promise as a compression scheme and as a machine
learning method that extracts some comprehensible account of the
structure of the input text. Its performance as a data compression
scheme outstrips other dictionary schemes, and the structures that it
learns from sequences have been put to such eclectic uses as phrase
browsing in digital libraries, music analysis, and inferring rules for
fractal images. We focus attention on the memory requirements of the
method. Since the algorithm operates in linear time, the space it
consumes is at most linear with input size. The space consumed does in
fact grow linearly with the size of the inferred hierarchy, and this
makes operation on very large files infeasible. We describe two elegant
ways of curtailing the space complexity of hierarchy inference, one of
which yields a bounded space algorithm. We begin with a review of the
hierarchy inference procedure that is embodied in the SEQUITUR program.
Then we consider its performance on quite large files, and show how the
compression performance improves as the file size increases
Data Compression Conference, 1998. DCC '98. Proceedings;
-
[show abstract]
[hide abstract]
ABSTRACT: This paper takes a compression scheme that infers a hierarchical grammar from its input, and investigates its application to semi-structured text. Although there is a huge range and variety of data that comes within the ambit of "semi-structured", we focus attention on a particular, and very large, example of such text. Consequently the work is a case study of the application of grammar-based compression to a large-scale problem. We begin by identifying some characteristics of semi-structured text that have special relevance to data compression. We then give a brief account of a particular large textual database, and describe a compression scheme that exploits its structure. In addition to providing compression, the system gives some insight into the structure of the database. Finally we show how the hierarchical grammar can be generalized, first manually and then automatically, to yield further improvements in compression performance.
Proceedings of the Data Compression Conference