C.G. Nevill-Manning

Rutgers, The State University of New Jersey, New Brunswick, NJ, USA

Are you C.G. Nevill-Manning?

Claim your profile

Publications (11)6.81 Total impact

  • Source
    Article: On-line and off-line heuristics for inferring hierarchies of repetitions in sequences
    C.G. Nevill-Manning, I.H. Witten
    [show abstract] [hide abstract]
    ABSTRACT: Hierarchical dictionary-based compression schemes form a grammar for a text by replacing each repeated string with a production rule. While such schemes usually operate on-line, making a replacement as soon as repetition is detected, off-line operation permits greater freedom in choosing the order of replacement. In this paper, we compare the on-line method with three off-line heuristics for selecting the next substring to replace: longest string first, most common string first, and the string that minimizes the size of the grammar locally. Surprisingly, two of the off-line techniques, like the on-line method, run in time linear in the size of the input. We evaluate each technique on artificial and natural sequences. In general, the locally-most-compressive heuristic performs best, followed by most frequent, the on-line technique, and, lagging by some distance, the longest-first technique
    Proceedings of the IEEE 12/2000; · 6.81 Impact Factor
  • Conference Proceeding: Protein is incompressible
    C.G. Nevill-Manning, I.H. Witten
    [show abstract] [hide abstract]
    ABSTRACT: Life is based on two polymers, DNA and protein, whose properties can be described in a simple text file. It is natural to expect that standard text compression techniques would work on biological sequences as they do on English text. But biological sequences have a fundamentally different structure from linguistic ones, and standard compression schemes exhibit disappointing performance on them. We describe a new approach to compression that takes account of the underlying biochemical principles. This gives rise to a generalization of blending for statistical compressors where every context is used, weighted by its similarity to the current context. Results support what research in bioinformatics has shown, that there is little Markov dependency in protein. This cripples data compression schemes and reduces them to order zero models
    Data Compression Conference, 1999. Proceedings. DCC '99; 04/1999
  • Source
    Article: Identifying Hierarchical Structure in Sequences: A linear-time algorithm
    C.G. Nevill-Manning, I.H. Witten
    [show abstract] [hide abstract]
    ABSTRACT: SEQUITUR is an algorithm that infers a hierarchical structure from a sequence of discrete symbols by replacing repeated phrases with a grammatical rule that generates the phrase, and continuing this process recursively. The result is a hierarchical representation of the original sequence, which offers insights into its lexical structure. The algorithm is driven by two constraints that reduce the size of the grammar, and produce structure as a by-product. SEQUITUR breaks new ground by operating incrementally. Moreover, the method's simple structure permits a proof that it operates in space and time that is linear in the size of the input. Our implementation can process 50,000 symbols per second and has been applied to an extensive range of real world sequences. Comment: See http://www.jair.org/ for an online appendix and other files accompanying this article
    08/1997;
  • Article: Unknown
    G. Holmes, C.G. Nevill-Manning
    [show abstract] [hide abstract]
    ABSTRACT: It has been our experience that in order to obtain useful results using supervised learning of real-world datasets it is necessary to perform feature subset selection and to perform many experiments using computed aggregates from the most relevant features. It is, therefore, important to look for selection algorithms that work quickly and accurately so that these experiments can be performed in a reasonable length of time, preferably interactively. This paper suggests a method to achieve this using a very simple algorithm that gives good performance across different supervised learning schemes and when compared to one of the most common methods for feature subset selection. KEYWORDS Feature subset selection; supervised learning; 1R; filter model; wrapper model. INTRODUCTION There is growing evidence that feature subset selection can substantially improve the task of performing supervised learning. The algorithms that perform feature subset selection have been studied in a variety o...
    05/1997;
  • Conference Proceeding: Linear-time, incremental hierarchy inference for compression
    C.G. Nevill-Manning, I.H. Witten
    [show abstract] [hide abstract]
    ABSTRACT: Data compression and learning are, in some sense, two sides of the same coin. If we paraphrase Occam's razor by saying that a small theory is better than a larger theory with the same explanatory power, we can characterize data compression as a preoccupation with small, and learning as a preoccupation with better. Nevill-Manning et al. (see Proc. Data Compression Conference, Los Alamitos, CA, p.244-253, 1994) presented an algorithm, since dubbed SEQUITUR, that presents both faces of the compression/learning coin. Its performance as a data compression scheme outstrips other dictionary schemes, and the structures that it learns from sequences as diverse as DNA and music are intuitively compelling. We present three new results that characterize SEQUITUR's computational and compression performance. First, we prove that SEQUITUR operates in time linear in n, the length of the input sequence, despite its ability to build a hierarchy as deep as log(n). Second, we show that a sequence can be compressed incrementally, improving on the non-incremental algorithm that was described by Nevill-Manning et al., and making on-line compression feasible. Third, we present an intriguing result that emerged during benchmarking; whereas PPMC outperforms SEQUITUR on most files in the Calgary corpus, SEQUITUR regains the lead when tested on multimegabyte sequences. We make some tentative conclusions about the underlying reasons for this phenomenon, and about the nature of current compression benchmarking
    Data Compression Conference, 1997. DCC '97. Proceedings; 04/1997
  • Article: Feature Selection Via The Discovery
    G. Holmes, C.G. Nevill-Manning
    [show abstract] [hide abstract]
    ABSTRACT: It has been our experience that in order to obtain useful results using supervised learning of real-world datasets it is necessary to perform feature subset selection and to perform many experiments using computed aggregates from the most relevant features. It is, therefore, important to look for selection algorithms that work quickly and accurately so that these experiments can be performed in a reasonable length of time, preferably interactively. This paper suggests a method to achieve this using a very simple algorithm that gives good performance across different supervised learning schemes and when compared to one of the most common methods for feature subset selection.
    08/1996;
  • Source
    Conference Proceeding: Compressing semi-structured text using hierarchical phraseidentifications
    C.G. Nevill-Manning, I.H. Witten, Olsen, D.R
    [show abstract] [hide abstract]
    ABSTRACT: This paper takes a compression scheme that infers a hierarchical grammar from its input, and investigates its application to semi-structured text. Although there is a huge range and variety of data that comes within the ambit of “semi-structured”, we focus attention on a particular, and very large, example of such text. Consequently the work is a case study of the application of grammar-based compression to a large-scale problem. We begin by identifying some characteristics of semi-structured text that have special relevance to data compression. We then give a brief account of a particular large textual database, and describe a compression scheme that exploits its structure. In addition to providing compression, the system gives some insight into the structure of the database. Finally we show how the hierarchical grammar can be generalized, first manually and then automatically, to yield further improvements in compression performance
    Data Compression Conference, 1996. DCC '96. Proceedings; 04/1996
  • Source
    Conference Proceeding: The development of Holte's 1R classifier
    C.G. Nevill-Manning, G. Holmes, I.H. Witten
    [show abstract] [hide abstract]
    ABSTRACT: The 1R machine learning scheme (Holte, 1993) is a very simple one that proves surprisingly effective on the standard datasets commonly used for evaluation. This paper describes the method and discusses two aspects of the algorithm that bear further analysis: the way, that intervals are formed when discretizing continuously-valued attributes; and the way missing values are treated. We then show how the algorithm can be extended to avoid a problem endemic to most practical machine learning algorithms-their frequent dismissal of an attribute as irrelevant when in fact it is highly relevant when combined with other attributes
    Artificial Neural Networks and Expert Systems, 1995. Proceedings., Second New Zealand International Two-Stream Conference on; 12/1995
  • Conference Proceeding: Compression by induction of hierarchical grammars
    C.G. Nevill-Manning, I.H. Witten, D.L. Maulsby
    [show abstract] [hide abstract]
    ABSTRACT: The paper describes a technique that constructs models of symbol sequences in the form of small, human-readable, hierarchical grammars. The grammars are both semantically plausible and compact. The technique can induce structure from a variety of different kinds of sequence, and examples are given of models derived from English text, C source code and a sequence of terminal control codes. It explains the grammatical induction technique, demonstrates its application to three very different sequences, evaluates its compression performance, and concludes by briefly discussing its use as a method for knowledge acquisition
    Data Compression Conference, 1994. DCC '94. Proceedings; 04/1994
  • Conference Proceeding: Phrase hierarchy inference and compression in bounded space
    C.G. Nevill-Manning, I.H. Witten
    [show abstract] [hide abstract]
    ABSTRACT: Text compression by inferring a phrase hierarchy from the input is a technique that shows promise as a compression scheme and as a machine learning method that extracts some comprehensible account of the structure of the input text. Its performance as a data compression scheme outstrips other dictionary schemes, and the structures that it learns from sequences have been put to such eclectic uses as phrase browsing in digital libraries, music analysis, and inferring rules for fractal images. We focus attention on the memory requirements of the method. Since the algorithm operates in linear time, the space it consumes is at most linear with input size. The space consumed does in fact grow linearly with the size of the inferred hierarchy, and this makes operation on very large files infeasible. We describe two elegant ways of curtailing the space complexity of hierarchy inference, one of which yields a bounded space algorithm. We begin with a review of the hierarchy inference procedure that is embodied in the SEQUITUR program. Then we consider its performance on quite large files, and show how the compression performance improves as the file size increases
    Data Compression Conference, 1998. DCC '98. Proceedings;
  • Article: Compressing semi-structured text using hierarchical phrase identifications
    C.G. Nevill-Manning, I.H. Witten, Jr. D.R. Olsen, J.A. Storer, M. Cohn
    [show abstract] [hide abstract]
    ABSTRACT: This paper takes a compression scheme that infers a hierarchical grammar from its input, and investigates its application to semi-structured text. Although there is a huge range and variety of data that comes within the ambit of "semi-structured", we focus attention on a particular, and very large, example of such text. Consequently the work is a case study of the application of grammar-based compression to a large-scale problem. We begin by identifying some characteristics of semi-structured text that have special relevance to data compression. We then give a brief account of a particular large textual database, and describe a compression scheme that exploits its structure. In addition to providing compression, the system gives some insight into the structure of the database. Finally we show how the hierarchical grammar can be generalized, first manually and then automatically, to yield further improvements in compression performance.
    Proceedings of the Data Compression Conference

Institutions

  • 1999–2000
    • Rutgers, The State University of New Jersey
      • Department of Computer Science
      New Brunswick, NJ, USA
  • 1997
    • Stanford University
      Palo Alto, CA, USA
  • 1994–1996
    • The University of Waikato
      • Department of Computer Science
      Hamilton, Waikato, New Zealand