
Charles Elkan- University of California, San Diego
Charles Elkan
- University of California, San Diego
About
158
Publications
97,234
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
25,512
Citations
Current institution
Publications
Publications (158)
One-class classification is a common situation in remote sensing, where researchers aim to extract a single land type from remotely sensed data. Learning a classifier from labeled positive and unlabeled background data, which is the case-control sampling scenario, is efficient for one-class remote sensing classification because labeled negative dat...
We consider real world task-oriented dialog settings, where agents need to generate both fluent natural language responses and correct external actions like database queries and updates. We demonstrate that, when applied to customer support chat transcripts, Sequence to Sequence (Seq2Seq) models often generate short, incoherent and ungrammatical na...
Scheduling surgeries is a challenging task due to the fundamental uncertainty of the clinical environment, as well as the risks and costs associated with under- and over-booking. We investigate neural regression algorithms to estimate the parameters of surgery case durations, focusing on the issue of heteroscedasticity. We seek to simultaneously es...
Besides the overpowering bouquet of raspberries in this guy's beer, this review is remarkable for another reason. It was produced by a computer program instructed to hallucinate a review for a "fruit/vegetable beer." Using a powerful artificial-intelligence tool called a recurrent neural network, the software that produced this passage isn't even p...
Clinical medical data, especially in the intensive care unit (ICU), consists
of multivariate time series of observations. For each patient visit (or
episode), sensor data and lab test results are recorded in the patient's
Electronic Health Record (EHR). While potentially containing a wealth of
insights, the data is difficult to mine effectively, ow...
We extend previous work on efficiently training linear models by applying
stochastic updates to non-zero features only, lazily bringing weights current
as needed. To date, only the closed form updates for the $\ell_1$,
$\ell_{\infty}$, and the rarely used $\ell_2$ norm have been described. We
extend this work by showing the proper closed form updat...
The objective of machine learning is to extract useful information from data,
while privacy is preserved by concealing information. Thus it seems hard to
reconcile these competing interests. However, they frequently must be balanced
when mining sensitive data. For example, medical research represents an
important application where it is necessary b...
This paper provides new insight into maximizing F1 measures in the context of binary classification and also in the context of multilabel classification. The harmonic mean of precision and recall, the F1 measure is widely used to evaluate the success of a binary classifier when one class is rare. Micro average, macro average, and per instance avera...
This paper investigates the properties of the widely-utilized F1 metric as
used to evaluate the performance of multi-label classifiers. We show that given
an uninformative binary classifier, F1-optimal thresholding is to predict all
instances positive. More surprisingly, we prove a relationship between the
optimal threshold and the best achievable...
This paper provides new insight into maximizing F1 scores in the context of binary classification and also in the context of multilabel classification. The harmonic mean of precision and recall, F1 score is widely used to measure the success of a binary classifier when one class is rare. Micro average, macro average, and per instance average F1 sco...
This paper analyzes a novel method for publishing data while still protecting privacy. The method is based on computing weights that make an existing dataset, for which there are no confidentiality issues, analogous to the dataset that must be kept private. The existing dataset may be genuine but public already, or it may be synthetic. The weights...
Multilabel learning is a machine learning task that is important for applications, but challenging. A recent method for multilabel learning called probabilistic classifier chains (PCCs) has several appealing properties. However, PCCs suffer from the computational issue that inference (i.e., predicting the label of an example) requires time exponent...
This paper investigates the profitability of a trading strategy based on training a model to identify stocks with high or low predicted returns. A tail set is defined to be a group of stocks whose volatility-adjusted price change is in the highest or lowest quantile, for example the highest or lowest 5%. Each stock is represented by a set of techni...
In this paper, we show how to use the classical technique of beam search for multilabel learning (MLL). A recent method for multilabel learning called probabilistic classifier chains (PCCs) has several appealing properties. However, PCCs suffer from the computational issue that inference (i.e., predicting the label of an example) requires time expo...
This paper investigates a reinforcement learning method that combines learning a model of the environment with least-squares policy iteration (LSPI). The LSPI algorithm learns a linear approximation of the optimal state-action value function; the idea studied here is to let this value function depend on a learned estimate of the expected next state...
This paper investigates the profitability of a trading strategy based on training a model to identify stocks with high or low predicted returns. A tail set is defined to be a group of stocks whose volatility-adjusted price change is in the highest or lowest quantile, for example the highest or lowest 5%. Each stock is represented by a set of techni...
In many real-world applications of machine learning classifiers, it is
essential to predict the probability of an example belonging to a particular
class. This paper proposes a simple technique for predicting probabilities
based on optimizing a ranking loss, followed by isotonic regression. This
semi-parametric technique offers both good ranking an...
The role of inhibition is investigated in a multiclass support vector machine formalism inspired by the brain structure of insects. The so-called mushroom bodies have a set of output neurons, or classification functions, that compete with each other to encode a particular input. Strongly active output neurons depress or inhibit the remaining output...
Suppose that we have n training examples. The training data are a matrix with n rows and p columns, where each example is represented by values for p different features. Assume that each feature value is a real number. Let feature value j for example number i be written xij. The label of example i is yi. For example, yi = 1 if message i is spam and...
We propose an online topic model for sequentially analyzing the time evolution of topics in document collections. Topics naturally evolve with multiple timescales. For example, some words may be used consistently over one hundred years, while other words ...
In ecological studies, it is useful to estimate the probability that a species occurs at given locations. The probability of presence can be modeled by traditional statistical methods, if both presence and absence data are available. However, the challenge is that most species records contain only presence data, without reliable absence data. Previ...
Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects includi...
ACT annotation guidelines. Basic classification criteria for PPI abstracts.
IMT method distribution. Distribution of interaction detection methods across the different IMT data sets.
Evaluation metrics overview. Details on the calculation of the used evaluation scores.
ACT example run. iP/R curve of the best team (73, S. Kim and W. J. Wilbur) in the Article Classification Task. Circle 1: Of the top 2% (130) of all results, approx. 90% (120) are relevant abstracts. Circle 2: To find half (295) of all relevant abstracts (Recall around 50%), a human going over the ranked list only has to look at the first 7% (421) o...
Many reinforcement learning methods are based on a function Q(s,a) whose value is the discounted total reward expected after performing the action a in the state s. This paper explores the implications of representing the Q function as Q(s,a) = s
T
Wa, where W is a matrix that is learned. In this representation, both s and a are real-valued vectors...
We propose to solve the link prediction problem in graphs using a supervised matrix factorization approach. The model learns
latent features from the topological structure of a (possibly directed) graph, and is shown to make better predictions than
popular unsupervised scores. We show how these latent features may be combined with optional explicit...
Convex optimization has emerged as useful tool for applications that include data analysis and model fitting, resource allocation, engineering design, network design and optimization, finance, and control and signal processing. After an overview, the ...
With well over 1,000 specialized biological databases in use today, the task of automatically identifying novel, relevant data for such databases is increasingly important. In this paper, we describe practical machine learning approaches for identifying MEDLINE documents and Swiss-Prot/TrEMBL protein records, for incorporation into a specialized bi...
What is called supervised learning is the most fundamental task in machine learning. In supervised learning, we have training examples and test examples. A training example is an ordered pair 〈x, y 〉 where x is an instance and y is a label. A test example is an instance x with unknown label. The goal is to predict labels for test examples. The name...
In remote-sensing classification, there are situations when users are only interested in classifying one specific land- cover type, without considering other classes. These situations are referred to as one-class classification. Traditional supervised learning is inefficient for one-class classification because it requires all classes that occur in...
A low-rank approximation to a matrix A is a matrix with significantly smaller rank than A, and which is close to A according to some norm. Many practical applications involving the use of large matrices focus on low-rank approximations. By reducing the rank or dimensionality of the data, we reduce the complexity of analyzing the data. The singular...
In dyadic prediction, labels must be predicted for pairs (dyads) whose members possess unique identifiers and, sometimes, additional features called side-information. Special cases of this problem include collaborative filtering and link prediction. We present a new log-linear model for dyadic prediction that is the first to satisfy several importa...
This paper presents a fundamentally new approach to allowing learning algorithms to be applied to a dataset, while still keeping
the records in the dataset confidential. Let D be the set of records to be kept private, and let E be a fixed set of records from a similar domain that is already public. The idea is to compute and publish a weight w(x) f...
In dyadic prediction, the input consists of a pair of items (a dyad), and the goal is to predict the value of an observation
related to the dyad. Special cases of dyadic prediction include collaborative filtering, where the goal is to predict ratings
associated with (user, movie) pairs, and link prediction, where the goal is to predict the presence...
An important extension of the idea of likelihood is conditional likelihood. The conditional likelihood of θ given data x and y is L(θ; y|x) = f(y|x; θ). Intuitively, y follows a probability distribution that is different for different x, but x itself is never unknown, so there is no need to have a probabilistic model of it. Technically, for each x...
In dyadic prediction, labels must be predicted for pairs (dyads) whose members possess unique identifiers and, sometimes, additional features called side-information. Special cases of this problem include collaborative filtering and link prediction. We present the first model for dyadic prediction that satisfies several important desiderata: (i) la...
Recently, supervised learning methods have been exploited to reconstruct gene regulatory networks from gene expression data. The reconstruction of a network is modeled as a binary classification problem for each pair of genes. A statistical classifier is trained to recognize the relationships between the activation profiles of gene pairs. This appr...
Identifying a subset of features that preserves classification accuracy is a problem of growing importance, because of the increasing size and dimensionality of real-world data sets. We propose a new feature selection method, named Quadratic Programming Feature Selection (QPFS), that reduces the task to a quadratic optimization problem. In order to...
Communications' Virtual Extension brings more quality articles to ACM members. These articles are now available in the ACM Digital Library.
The aim of latent semantic indexing (LSI) is to uncover the relationships between terms, hidden concepts, and documents. LSI uses the matrix factorization technique known as singular value decomposition (SVD). In this paper, we apply LSI to standard benchmark collections. We find that LSI yields poor retrieval accuracy on the TREC 2, 7, 8, and 2004...
Finding allowable places in words to insert hyphens is an important practical problem. The algorithm that is used most often nowadays has remained essentially unchanged for 25 years. This method is the TEX hyphenation algorithm of Knuth and Liang. We present here a hyphenation method that is clearly more accurate. The new method is an application o...
Many dierent topic models have been used successfully for a variety of applications. However, even state-of-the-art topic models suer from the important aw that they do not capture the tendency of words to appear in bursts; it is a fundamental property of lan- guage that if a word is used once in a doc- ument, it is more likely to be used again. We...
We apply topic models to financial data to obtain a more accurate view of eco-nomic networks than that supplied by traditional economic statistics. The learned topic models can serve as a substitute for or a complement to more complicated network analysis. Initial results on S&P500 stock market data show that topic models are able to obtain meaning...
Classifiers are traditionally learned using sets of positive and negative training examples. However, often a classifier is required, but for training only an incomplete set of positive examples and a set of un- labeled examples are available. This is the situation, for example, with the Transport Classification Database (TCDB, www.tcdb.org), a rep...
The Transporter Classification Database (TCDB), freely accessible at http://www.tcdb.org, is a relational database containing sequence, structural, functional and evolutionary information about transport systems
from a variety of living organisms, based on the International Union of Biochemistry and Molecular Biology-approved transporter
classifica...
The input to an algorithm that learns a binary classifier normally consists of two sets of examples, where one set consists of positive examples of the concept to be learned, and the other set consists of negative examples. However, it is often the case that the available training data are an incomplete set of positive examples, and a set of unlabe...
Learning a sequence classifier means learning to predict a sequence of output tags based on a set of input data items. For example, recognizing that a handwritten word is "cat", based on three images of handwritten letters and on gen- eral knowledge of English letter combinations, is a sequence classification task. This paper describes a new two-st...
The KDD Cup is the oldest of the many data mining competitions that are now popular [1]. It is an integral part of the annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). In 2007, the traditional KDD Cup competition was augmented with a workshop with a focus on the concurrently active Netflix Prize competition [...
The number of specialized databases in molecular biology is growing fast, as is the availability of molecular data. These
trends necessitate the development of automatic methods for finding relevant information to include in specialized databases.
We show how to use a comprehensive database (SwissProt) as a source of new entries for a specialized d...
Automatically improving the performance of inference engines is a central issue in automated deduction research. This paper
describes and evaluates mechanisms for speeding up search in an inference engine used in research on reactive planning. The
inference engine is adaptive in the sense that its performance improves with experience. This improvem...
This paper presents approaches to semi-supervised learning when the labeled training data and test data are differently distributed. Specifically, the samples selected for labeling are a biased subset of some general distribution and the test set consists of samples drawn from either that general distribution or the distribution of the unlabeled sa...
The Dirichlet compound multinomial (DCM) distribution, also called the multivariate Polya distribution, is a model for text documents that takes into account burstiness: the fact that if a word occurs once in a document, it is likely to occur repeatedly. We derive a new fam- ily of distributions that are approximations to DCM distributions and cons...
The Dirichlet compound multinomial (DCM) distribution has recently been shown to be a good model for documents because it captures the phenomenon of word burstiness, unlike standard mod- els such as the multinomial distribution. This paper investigates the DCM Fisher kernel, a function for comparing documents derived from the DCM. We show that the...
This paper explores the automatic classification of audio tracks into musical genres. Our goal is to achieve human-level accuracy with fast training and classification. This goal is achieved with radial basis function (RBF) networks by using a combination of unsupervised and supervised initialization methods. These initialization methods yield clas...
Multinomial distributions are often used to model text documents. However, they do not capture well the phenomenon that words in a document tend to appear in bursts: if a word appears once, it is more likely to appear again. In this paper, we propose the Dirichlet compound multinomial model (DCM) as an alternative to the multinomial. The DCM model...
In this paper, we examine an important recent rule-based information extraction (IE) technique named Boosted Wrapper Induction (BWI) by conducting experiments on a wider variety of tasks than previously studied, including tasks using several collections of natural text documents. We investigate systematically how each algorithmic component of BWI,...
When clustering a dataset, the right number $k$ of clusters to use is often not obvious, and choosing $k$ automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning $k$ while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian dist...
Most learning methods assume that the training set is drawn randomly from the population to which the learned model is to be applied. However in many applications this assumption is invalid. For example, lending institutions create models of who is likely to repay a loan from training sets consisting of people in their records to whom loans were gi...
An online topic-speci c web search requires an intelligent web crawler. To be eective, a crawler must be able to identify and prioritize hyperlinks that are most likely to lead to relevant documents. We propose and evaluate a heuristic scoring method that predicts the utility of a link based on the presence of topic-speci c keywords associated in v...
We investigate here the behavior of the standard k-means clustering algorithm and several alternatives to it: the k- harmonic means algorithm due to Zhang and colleagues, fuzzy k-means, Gaussian expectation-maximization, and two new variants of k-harmonic means. Our aim is to nd which aspects of these algorithms contribute to nding good clusterings...
Hard disk drive failures are rare but are often costly. The ability to predict failures is important to consumers, drive manufacturers, and computer system manufacturers alike. In this paper we investigate the abilities of two Bayesian methods to predict disk drive failures based on measurements of drive internal conditions. We first view the probl...
When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem.
In this paper we show how to learn rules to improve the performance of a machine translation system. Given a system consisting of two translation functions (one from language A to language B and one from B to A), training text is translated from A to B and back again to A. Using these two transla- tions, differences in knowledge between the two tra...
An important issue in reinforcement learning is how to incorporate expert knowledge in a principled manner, especially as we scale up to real-world tasks. In this paper, we present a method for incorporating arbitrary advice into the reward structure of a reinforcement learning agent without altering the optimal policy. This method extends the pote...
The k-means algorithm is by far the most widely used method for discovering clusters in data. We show how to accelerate it dramatically, while still always computing exactly the same result as the standard algorithm. The accelerated algorithm avoids unnecessary distance calculations by applying the triangle inequality in two different ways, and by...
We investigate here the behavior of the standard k-means clustering algorithm and several alternatives to it: the k-harmonic means algorithm due to Zhang and colleagues, fuzzy k-means, Gaussian expectation-maximization, and two new variants of k-harmonic means. Our aim is to find which aspects of these algorithms contribute to finding good clusteri...
Improved methods are proposed for disk-drive failure prediction.
The SMART (self monitoring and reporting technology) failure prediction
system is currently implemented in disk-drives. Its purpose is to
predict the near-term failure of an individual hard disk-drive, and
issue a backup warning to prevent data loss. Two experimental tests of
SMART sh...
Class membership probability estimates are important for many applications of data mining in which classification outputs are combined with other sources of information for decision-making, such as example-dependent misclassification costs, the outputs of other classifiers, or domain knowledge. Previous calibration methods apply only to two-class p...
We discuss a reinforcement learning framework where learners observe experts interacting with the environment. Our approach is to construct from these observations exploratory policies which favor selection of actions the expert has taken. This imitation strategy can be applied at any stage of learning, and requires neither that information regardi...
In this paper, we examine an important recent rule-based information extraction (IE) technique named Boosted Wrapper Induction (BWI), by conducting experiments on a wider variety of tasks than previously studied, including tasks using several collections of natural text documents. We provide a systematic analysis of how each algorithmic component o...
Introduction Information retrieval in the worldwide web environment poses unique challenges. The worldwide web is a distributed, always changing, and ever expanding collection of documents. These features of the web make it difficult to find information about a specific topic. The most common approaches involve indexing, but indexes introduce centr...
To combine information from heterogeneous sources, equivalent data in the multiple sources must be identified.
This paper is a reply to the article entitled Elkan's Theoretical Argument, Reconsidered by Prof. Enric Trillas and Prof. Claudi Alsina. I would like to express my thanks to Dr. Piero Bonissone for inviting me to write this paper and for showing me the article by Trillas and Alsina in advance of its publication. Ever since mathematical studies of f...
This paper presents a first attempt at explaining the relationship between the psychological and artificial intelligence points of view of learning with a special focus on social learning. A two dimensional classification methodology is proposed that classifies learning behaviors in intelligent agents on the basis of agent structure and of informat...
Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. Databases may contain duplicate records concerning the same realworld entity because of data entry errors, because of unstandardized abbreviations, or because of differences in the detailed schemas of records from multiple databases, among ot...
In many data mining domains, misclassification costs are different for different examples, in the same way that class membership probabilities are example-dependent. In these domains, both costs and probabilities are unknown for test examples, so both cost estimators and probability estimators must be learned. After discussing how to make optimal d...
CoIL challenge 2000 was a supervised learning contest that attracted 43 entries. The authors of 29 entries later wrote explanations of their work. This paper discusses these reports and reaches three main conclusions. First, naive Bayesian classifiers remain competitive in practice: they were used by both the winning entry and the next best entry....
This paper revisits the problem of optimal learning and decision-making when different misclassification errors incur different penalties. We characterize precisely but intuitively when a cost matrix is reasonable, and we show how to avoid the mistake of defining a cost matrix that is economically incoherent. For the two-class case, we prove a theo...
Accurate, well-calibrated estimates of class membership probabilities are needed in many supervised learning applications, in particular when a cost-sensitive decision must be made about examples with example-dependent costs. This paper presents simple but successful methods for obtaining calibrated probability estimates from decision tree and naiv...
This paper presents a simple new algorithm that performs k-means clustering in one scan of a dataset, while using a buffer for points from the dataset of fixed size. Experiments show that the new method is several times faster than standard k-means, and that it produces clusterings of equal or almost equal quality. The new method is a simplificatio...
This paper will try to explain the details of Heckman's procedure and its mathematical justifcation using language familiar in the field of machine learning.
With over 800 million pages covering most areas of human endeavor, the World-wide Web is a fertile ground for data mining research to make a difference to the effectiveness of information search. Today, Web surfers access the Web through two dominant ...
data mining, machine learning, model fitting, regression, exploratory data analysis, error rate estimation, data modeling, data cleaning, data preparation, predictability We prove an inequality bound for the variance of the error of a regression function plus its non-smoothness as quantified by the Uniform Lipschitz condition. The coefficients in t...
. Protein families are well characterized by a collection of motifs (Sonnhammer & Kahn 1994), sometimes referred to as the "common core" (Chothia & Lesk 1986). . These motifs can have structural and functional significance, and they may frequently be operated upon as units by diverse evolutionary mechanisms. . The quality of a multiple alignment de...
Motivation: Modeling families of related biological sequences using Hidden Markov models #HMMs#, although increasingly widespread, faces at least one major problem: because of the complexity of these mathematical models, they require a relatively large training set in order to accurately characterize a given family. For families in which there are...
. The MEME algorithm extends the expectation maximization (EM) algorithm for identifying motifs in unalignedbiopolymer sequences. The aim of MEME is to discover new motifs in a set of biopolymer sequences where little or nothing is known in advance about any motifs that may be present. MEME innovations expand the range of problems which can be solv...