Craig Nevill-Manning

Craig Nevill-Manning
Google Inc. | Google · Engineering Department

Ph.D., University of Waikato

About

80
Publications
17,294
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,423
Citations

Publications

Publications (80)
Conference Paper
Cities have benefited from the three greatest technological innovations of the past 200 years: the steam engine, electrification, and the automobile. But each advance has created its own challenges, including pollution, overcrowding, sprawl. As the digital revolution transforms cities once again, how can we make sure it improves quality of life whi...
Patent
Full-text available
A system and method for providing definitions is described. A phrase to be defined is received. One or more documents, which each contain at least one definition, are determined. The phrase is matched to at least one of the definitions. One or more definitions for the phrase are presented.
Patent
Full-text available
A rewrite component automatically generates rewrite rules that describe how uniform resource locators (URLs) can be rewritten to reduce or eliminate different URLs that redundantly refer to the same or substantially the same content. The rewrite rules can be applied to URLs received when crawling a network to increase the efficiency of the crawl an...
Article
The scientific endeavor of biology is becoming increasingly reliant on data in electronic form, and it is therefore necessary for biologists to manage and understand large quantities of data. Publicly available data including biological se- quences, biological structures, and literature in the life sciences have grown to such an extent that computi...
Chapter
Keyphrases provide semantic metadata that summarize and characterize documents. This chapter describes Kea, an algorithm for automatically extracting keyphrases from text. Kea identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machine-learning algorithm to predict which candidates are goo...
Article
We present a novel method for discovering conserved sequence motifs from families of aligned protein sequences. The method has been implemented as a computer program called EMOTIF (http://motif.stanford.edu/emotif). Given an aligned set of protein sequences, EMOTIF generates a set of motifs with a wide range of specificities and sensitivities. EMOT...
Article
Structure exists in sequences ranging from human language and music to the genetic information encoded in our DNA. This thesis shows how that structure can be discovered automatically and made explicit. Rather than examining the meaning of the individual symbols in the sequence, structure is detected in the way that certain combinations of symbols...
Article
Full-text available
3MOTIF is a web application that visually maps conserved sequence motifs onto three-dimensional protein structures in the Protein Data Bank (PDB; Berman et al., Nucleic Acids Res., 28, 235-242, 2000). Important properties of motifs such as conservation strength and solvent accessible surface area at each position are visually represented on the str...
Conference Paper
Full-text available
In this paper, we consider the problem of finding the MEDLINE articles that describe functions of particular genes. We describe our experiments using the mg system and the partitioning of a graph of biological sequences, structures and abstracts. We participated in the primary task of the TREC 2003 Genomics Track.
Article
Full-text available
Digital libraries of music have the potential to capture popular imagination in ways that more scholarly libraries cannot. We are working towards a comprehensive digital library of musical material, including popular music. We have developed new ways of collecting musical material, accessing it through searching and browsing, and presenting the res...
Article
Full-text available
etworked library technology is especially striking. .We aim to produce a library scheme that operates on small, inexpensive servers. Full-text indexes are provided to several substantial collections of information. These collections serve as case studies. They drive our research by providing technical challenges for indexing, and human interface ch...
Article
Motivation: Position-specific scoring matrices have been used extensively to recognize highly conserved protein regions. We present a method for accelerating these searches using a suffix tree data structure computed from the sequences to be searched.
Article
Keyphrases are an important means of document summarization, clustering, and topic search. Only a small minority of documents have author-assigned keyphrases, and manually assigning keyphrases to existing documents is very laborious. Therefore it is highly desirable to automate the keyphrase extraction process.
Article
The article presents information on encoding, storage and retrieval of data stored in a digital library relating to Human Genome Project. A biological digital library, GenBank consists of over eight million DNA sequences of varying quality from thousands of species, containing important genes and repetitive junk. Within the DNA are coded genes that...
Article
Motivation: We present techniques for increasing the speed of sequence analysis using scoring matrices. Our techniques are based on calculating, for a given scoring matrix, the quantile function, which assigns a probability, or p, value to each segmental score. Our techniques also permit the user to specify a p threshold to indicate the desired tra...
Article
Full-text available
The growing need to manage and exploit the proliferation of online data sources is opening up new opportunities for bringing people closer to the resources they need. For instance, consider a recommendation service through which researchers can receive daily pointers to journal papers in their fields of interest. We survey some of the known approac...
Conference Paper
With the number and types of documents in digital library systems incr easing, tools for automatically organizing and presenting the content have to be found. While many approaches focus on topic-based organization and structuring, hardly any system ...
Article
Full-text available
Hierarchical dictionary-based compression schemes form a grammar for a text by replacing each repeated string with a production rule. While such schemes usually operate on-line, making a replacement as soon as repetition is detected, off-line operation permits greater freedom in choosing the order of replacement. In this paper, we compare the on-li...
Article
Full-text available
The complementary paradigms of text compression and image compression suggest that there may be potential for applying methods developed for one domain to the other. In image coding, lossy techniques yield compression factors that are vastly superior to those of the best lossless schemes and we show that this is also the case for text. This paper i...
Article
Full-text available
Life is based on two polymers, DNA and protein, whose properties can be described in a simple text file. It is natural to expect that standard text compression techniques would work on biological sequences as they do on English text. But biological sequences have a fundamentally different structure from linguistic ones, and standard compression sch...
Article
Full-text available
Motivation: We present techniques for increasing the speed of sequence analysis using scoring matrices. Our techniques are based on calculating, for a given scoring matrix, the quantile function, which assigns a probability, or p, value to each segmental score. Our techniques also permit the user to specify a p threshold to indicate the desired tr...
Article
Position-specific scoring matrices have been used extensively to recognize highly conserved protein regions. We present a method for accelerating these searches using a suffix tree data structure computed from the sequences to be searched. Building on earlier work that allows evaluation of a scoring matrix to be stopped early, the suffix tree-based...
Conference Paper
Full-text available
People identify powerfully with music: someone might say “that's my song!” but they are unlikely to say “that's my book!” or “that's my picture!” A digital library of popular music therefore has the potential to be a compelling application of information retrieval technology. Such a library requires a retrieval method that is appropriate for a non-...
Article
The problem of assigning conference paper submissions to suitable reviewers can be viewed as a variant of the general problem of technical paper recommendation. In both cases one would ideally like to direct only those papers that are of the greatest interest to the appropriate set of people. Current attempts to automate the conference reviewing pr...
Article
Browsing accounts for much of people's interaction with digital libraries, but it is poorly supported by standard search engines. Conventional systems often operate at the wrong level, indexing words when people think in terms of topics, and returning documents when people want a broader view. As a result, users cannot easily determine what is in a...
Article
Full-text available
This doctoral dissertation presents a range of results concerning effi- cient algorithms and data structures for string processing, including several schemes contributing to sequential data compression. It com- prises both theoretic results and practical implementations. We study the suffix tree data structure, presenting an efficient rep- resentat...
Conference Paper
Full-text available
Keyphrases provide semantic metadata that summarize and characterize documents. This paper describes Kea, an algorithm for automatically extracting keyphrases from text. Kea identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machinelearning algorithm to predict which candidates are good k...
Article
Full-text available
Keyphrases are an important means of document summarization, clustering, and topic search. Only a small minority of documents have author-assigned keyphrases, and manually assigning keyphrases to existing documents is very laborious. Therefore it is highly desirable to automate the keyphrase extraction process. This paper shows that a simple proced...
Article
Full-text available
We introduce a minimal-risk method for estimating the frequencies of amino acids at conserved positions in a protein family. Our method, called minimal-risk estimation, finds the optimal weighting between a set of observed amino acid counts and a set of pseudofrequencies, which represent prior information about the frequencies. We compute the optim...
Conference Paper
Life is based on two polymers, DNA and protein, whose properties can be described in a simple text file. It is natural to expect that standard text compression techniques would work on biological sequences as they do on English text. But biological sequences have a fundamentally different structure from linguistic ones, and standard compression sch...
Article
Full-text available
This paper describes two elegant ways of curtailing the space complexity of hierarchy inference, one of which yields a bounded-space algorithm. We begin with a brief review of the hierarchy inference procedure that is embodied in the SEQUITUR program. Then we consider its performance on quite large files, and show how compression performance improv...
Article
Full-text available
Keyphrases provide semantic metadata that summarize and characterize documents. This paper describes Kea, an algorithm for automatically extracting keyphrases from text. Kea identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machine-learning algorithm to predict which candidates are good...
Article
Full-text available
: Because digital libraries are expensive to create and maintain, Internet analogs of public libraries---reliable, quality, community services---have only recently begun to appear. A serious obstacle to their creation is the provision of appropriate cataloguing information. Without a database of titles, authors and subjects, it is hard to offer the...
Article
This paper shows how these kinds of structure can be inferred automatically from sequences. Let us make clear at the outset what aspects of sequence structure we are not concerned with. We take no account of numerical frequencies other than the `more than once' that defines repetition. We do not consider any similarity metrics between the individua...
Article
Full-text available
SEQUITUR is an algorithm that infers a hierarchical structure from a sequence of discrete symbols by replacing repeated phrases with a grammatical rule that generates the phrase, and continuing this process recursively. The result is a hierarchical representation of the original sequence, which offers insights into its lexical structure. The algori...
Article
Full-text available
Technical reports are available electronically at hundreds of internet sites around the world. A major impediment to their utility in computer science research is the difficulty in locating reports that are relevant to a particular area. We describe the implementation of a digital library for computer science technical reports that indexes every wo...
Article
Full-text available
This paper describes an algorithm, called SEQUITUR, that identifies hierarchical structure in sequences of discrete symbols and uses that information for compression. On many practical sequences it performs well at both compression and structural inference, producing comprehensible descriptions of sequence structure in the form of grammar rules. Th...
Article
Full-text available
This paper reviews our experience with the application of machine learning techniques to agricultural databases. We have designed and implemented a machine learning workbench, WEKA, which permits rapid experimentation on a given dataset using a variety of machine learning schemes, and has several facilities for interactive investigation of the data...
Article
Full-text available
Structure exists in sequences ranging from human language and music to the genetic information encoded in our DNA. This thesis shows how that structure can be discovered automatically and made explicit. Rather than examining the meaning of the individual symbols in the sequence, structure is detected in the way that certain combinations of symbols...
Article
Full-text available
We present a method for discovering conserved sequence motifs from families of aligned protein sequences. The method has been implemented as a computer program called emotif (http://motif.stanford.edu/emotif). Given an aligned set of protein sequences, emotif generates a set of motifs with a wide range of specificities and sensitivities. emotif als...
Article
We show how to extract plain text from PostScript files. A textual scan is inadequate because PostScript interpreters can generate characters on the page that do not appear in the source file. Furthermore, word and line breaks are implicit in the graphical rendition, and must be inferred from the positioning of word fragments. We present a robust t...
Article
Full-text available
The New Zealand Digital Library aims to develop an easy-to-use digital library system that can be accessed via a full-text index and runs on inexpensive computers at the information providers own sites and offers a service that providers maintain. The library is collaborating with the MeDoc project in Germany to provide local indexes to German lang...
Conference Paper
Text compression by inferring a phrase hierarchy from the input is a technique that shows promise as a compression scheme and as a machine learning method that extracts some comprehensible account of the structure of the input text. Its performance as a data compression scheme outstrips other dictionary schemes, and the structures that it learns fr...
Article
Full-text available
Discrete motifs that discriminate functional classes of proteins are useful for classifying new sequences, capturing structural constraints, and identifying protein subclasses. Despite the fact that the space of such motifs can grow exponentially with sequence length and number, we show that in practice it usually does not, and we describe a techni...
Article
Full-text available
Statistical decision theory provides a principled way to estimate amino acid frequencies in conserved positions of a protein family. The goal is to minimize the risk function, or the expected squared-error distance between the estimates and the true population frequencies. The minimum-risk estimates are obtained by adding an optimal number of pseud...
Article
this article tends to be answered by making a selection of queries more or less haphazardly to gain a feeling for what the collection holds.
Article
Full-text available
Developing intuition for the content of a digital collection is difficult. Hierarchies of subject terms allow users to explore the space of topics that a collection covers, to form and specialize useful query terms, and to directly identify interesting documents. We describe two interfaces for navigating such hierarchies, and present a technique fo...
Article
This article focuses on the collections: technical details of mechanism [10], protocols [5] and novel prototype interfaces [2, 8] are available elsewhere. THE COMPUTER SCIENCE TECHNICAL REPORT COLLECTION
Article
Full-text available
This paper describes an algorithm, called SEQUITUR, that identifies hierarchical structure in
Article
SEQUITUR is an algorithm that infers a hierarchical structure from a sequence of discrete symbols by replacing repeated phrases with a grammatical rule that generates the phrase, and continuing this process recursively. The result is a hierarchical representation of the original sequence, which offers insights into its lexical structure. The algori...
Article
Full-text available
Digital libraries are expensive to create and maintain, and generally restricted to a particular corporation or group of paying subscribers. While many indexes to the World Wide Web are freely available, the quality of what is indexed is extremely uneven. The digital analog of a public library---a reliable, quality, community service---has yet to a...
Article
Full-text available
Programming by demonstration seeks to allow people to communicate algorithms easily to computers. The world in which the communication takes place, and the set of actions that can be performed in that world, both influence the efficiency and correctness of the transmitted algorithms, and the extent to which efficiency and correctness can be guarant...
Article
Full-text available
This paper shows how these kinds of structure can be inferred automatically from sequences. Let us make clear at the outset what aspects of sequence structure we are
Article
Full-text available
this article tends to be answered by making a selection of queries more or less haphazardly to gain a feeling for what the collection contains.
Article
It has been our experience that in order to obtain useful results using supervised learning of real-world datasets it is necessary to perform feature subset selection and to perform many experiments using computed aggregates from the most relevant features. It is, therefore, important to look for selection algorithms that work quickly and accuratel...
Conference Paper
Full-text available
Data compression and learning are, in some sense, two sides of the same coin. If we paraphrase Occam's razor by saying that a small theory is better than a larger theory with the same explanatory power, we can characterize data compression as a preoccupation with small, and learning as a preoccupation with better. Nevill-Manning et al. (see Proc. D...
Article
Discrete motifs that discriminate functional classes of proteins are useful for classifying new sequences, capturing structural constraints, and identifying protein subclasses. Despite the fact that the space of such motifs can grow exponentially with sequence length and number, we show that in practice it usually does not, and we describe a techni...
Article
Full-text available
SEQUITUR is an algorithm that infers a hierarchical structure from a sequence of discrete symbols by replacing repeated phrases with a grammatical rule that generates the phrase, and continuing this process recursively. The result is a hierarchical representation of the original sequence, which offers insights into its lexical structure. The algori...
Article
Full-text available
This paper analyzes the problems and provides suggestions for their solution. The next section reviews the major issues raised by learning agents from the point of view of ML, and section 3 discusses aspects of the interactive situation that can be used to provide additional leverage for learning.
Article
Full-text available
The paper describes a technique that constructs models of symbol sequences in the form of small, human-readable, hierarchical grammars. The grammars are both semantically plausible and compact. The technique can induce structure from a variety of different kinds of sequence, and examples are given of models derived from English text, C source code...
Article
It has been our experience that in order to obtain useful results using supervised learning of real-world datasets it is necessary to perform feature subset selection and to perform many experiments using computed aggregates from the most relevant features. It is, therefore, important to look for selection algorithms that work quickly and accuratel...
Article
Full-text available
This paper takes a compression scheme that infers a hierarchical grammar from its input, and investigates its application to semi-structured text. Although there is a huge range and variety of data that comes within the ambit of "semi-structured", we focus attention on a particular, and very large, example of such text. Consequently the work is a c...
Article
Many techniques have been developed for learning rules and relationships automatically from diverse data sets, to simplify the often tedious and error-prone process of acquiring knowledge from empirical data. While these techniques are plausible, theoretically well-founded, and perform well on more or less artificial test data sets, they depend on...
Conference Paper
Full-text available
The 1R machine learning scheme (Holte, 1993) is a very simple one that proves surprisingly effective on the standard datasets commonly used for evaluation. This paper describes the method and discusses two aspects of the algorithm that bear further analysis: the way, that intervals are formed when discretizing continuously-valued attributes; and th...
Article
It has been our experience that in order to obtain useful results using supervised learning of real-world datasets it is necessary to perform feature subset selection and to perform many experiments using computed aggregates from the most relevant features. It is, therefore, important to look for selection algorithms that work quickly and accuratel...
Article
When data compression is applied to full-text retrieval systems, intricate relationships emerge between the amount of compression, access speed, and computing resources required. We propose compression methods, and explore corresponding tradeoffs, for all components of static full-text systems such as text databases on CD-ROM. These components incl...
Article
The world-wide use of digital storage and communications devices is increasing the need to make texts available in multiple languages. To minimise the cost of storing and transmitting multiple translations of a text, one could store the text in just one language, from which other translations can be created. Unfortunately, the quality of machine tr...
Conference Paper
Full-text available
This paper explores the application of arithmetic coding to systems involving the storage of a large body of text, along with a lexicon that lists the words and a concordance that indicates the exact locations at which each word can be found. A typical query might seek all sentences that contain a particular word or combination of words. The random...
Article
Introduction Digital libraries until now could hardly be described as popular: they tend to be based on esoteric, scholarly sources close to the interests of digital library researchers themselves. We are developing a digital library containing the quintessence of popular culture: music. The principal mode of searching this library will be by sung...
Article
Full-text available
1 Introduction Numerous techniques have been proposed for learning rules and relationships from diverse data sets, in the hope that machines can help in the often tedious and error-prone process of knowledge acquisition. While these techniques are plausible and theoretically well-founded, they stand or fall on their ability to make sense of real-wo...
Article
Motivation: Position-specific scoring matrices have been used extensively to recognize highly conserved protein regions. We present a method for accelerating these searches using a suffix tree data structure computed from the sequences to be searched. Methods: Building on earlier work that allows evaluation of a scoring matrix to be stopped early,...
Article
HE NEW ZEALAND DIGITAL LIBRARY AIMS TO IMPOSE STRUCTURE ON ANAR- chic and uncataloged repositories of information providing information consumers with effective tools to locate and peruse what they need. Our goal is to produce an easy-to-use digital library system that runs on inex- pensive computers at the information providers' own sites and offe...
Article
Full-text available
The Waikato Environment for Knowledge Analysis (weka) is a New Zealand government-sponsored initiative to investigate the application of machine learning to economically important problems in the agricultural industries. The overall goals are to create a workbench for machine learning, determine the factors that contribute towards its successful ap...
Article
Full-text available
Programming by demonstration requires detection and analysis of sequential patterns in a user’s input, and the synthesis of an appropriate structural model that can be used for prediction. This paper describes SEQUITUR, a scheme for inducing a structural description of a sequence from a single example. SEQUITUR integrates several different inferenc...

Network

Cited By