About
30
Publications
23,236
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
16,118
Citations
Publications
Publications (30)
In 2008, Rocket Fuel's founders saw a gap in the digital advertising market. None of the existing players were building autonomous systems based on big data and artificial intelligence, but instead they were offering fairly simple technology and relying on human campaign managers to drive success. Five years later in 2013, Rocket Fuel had the best...
When modeling a probability distribution with a Bayesian network, we are
faced with the problem of how to handle continuous variables. Most previous
work has either solved the problem by discretizing, or assumed that the data
are generated by a single Gaussian. In this paper we abandon the normality
assumption and instead use statistical methods fo...
: Data mining is an umbrella term referring to the process of discovering patterns in data, typically with the aid of powerful algorithms to automate part of the search. These methods come from disciplines such as statistics, machine learning (artificial intelligence) , pattern recognition, neural networks, and databases. Two data analysts with dif...
Machine learning algorithms for supervised learning are in wide use. An important issue in the use of these algorithms is how to set the parameters of the algorithm. While the default parameter values may be appropriate for a wide variety of tasks, they are not necessarily optimal for a given task. In this paper, we investigate the use of cross-val...
In delayed reinforcement learning, an agent is concerned with the problem of discovering an optimal policy, a function mapping states to actions. The most popular delayed reinforcement learning technique, Q-learning, has been proven to produce an optimal policy under certain conditions. However, often the agent does not follow the optimal policy fa...
Successful technology becomes invisible. Few people think much about internal combustion engines while they drive to work in three-thousand-pound hunks of metal powered by them, or electricity while it enables countless parts of their modern lives. Data mining has a long way to go before it succeeds in this way -- or does it? At KDD-98, the Behind-...
: In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular training set, a feature subset selection method should consider...
In the feature subset selection problem, a learning algorithm is faced
with the problem of selecting a relevant subset of features upon which
to focus its attention, while ignoring the rest. To achieve the best
possible performance with a particular learning algorithm on a
particular training set, a feature subset selection method should
consider h...
We address the problem of finding a subset of features that allows a supervised induction algorithm to induce small high-accuracy concepts. We examine notions of relevance and irrelevance, and show that the definitions used in the machine learning literature do not adequately partition the features into useful categories of relevance. We present de...
We approach stock selection for long/short portfolios from the perspective of knowledge discovery in databases and rule induction: given a database of historical information on some universe of stocks, discover rules from the data that will allow one to predict which stocks are likely to have exceptionally high or low returns in the future. Long/sh...
We approach the problem of stock selection from the perspective of knowledge discovery in databases: given a database of several years of quarterly information on over a thousand companies, discover patterns in the data that will allow one to predict which stocks are likely to have exceptional returns in the future. The database includes measures o...
As data warehouses grow to the point where one hundred gigabytes is considered small, the computational efficiency of data-mining algorithms on large databases becomes increasingly important. Using a sample from the database can speed up the datamining process, but this is only acceptable if it does not reduce the quality of the mined knowledge. To...
In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular domain, a feature subset selection method should consider how the...
Finding and removing outliers is an important problem in data mining. Errors in large databases can be extremely common, so an important property of a data mining algorithm is robustness with respect to errors in the database. Most sophisticated methods in machine learning address this problem to some extent, but not fully, and can be improved by a...
We present a new method for the induction of tree-structured recursive partitioning classifiers that use a neural network as the partitioning function at each node in the tree. Our technique is appropriate for pattern recognition tasks with many continuous inputs and a single multivalued nominal output. This paper presents two main contributions: 1...
We address the problem of finding the parameter settings that will result in optimal performance of a given learning algorithm using a particular dataset as training data. We describe a “wrapper” method, considering determination of the best parameters as a discrete function optimization problem. The method uses best-first search and crossvalidatio...
We present CHILS , the Convex Hull Inductive Learning System, a novel supervised learning algorithm based on approximating concepts with sets of convex hulls. We introduce a theoretical methodology for describing the power of a concept representation language and use it to compare convex hulls with other geometrical concept representations.
When mining large databases, the data extraction problem and the interface between the database and data mining algorithm become important issues. Rather than giving a mining algorithm full access to a database (by extracting to a flat file or other directlyaccessible data structure), we propose the SQL Interface Protocol (SIP), which is a framewor...
We present a new method for the induction of classification trees with linear discriminants as the partitioning function at each internal node. This paper presents two main contributions: first, a novel objective function called soft entropy which is used to identify optimal coefficients for the linear discriminants, and second, a novel method for...
In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular training set, a feature subset selection method should consider h...
We present MLC++ , a library of C ++ classes and tools for supervised Machine Learning. While MLC++ provides general learning algorithms that can be used by end users, the main objective is to provide researchers and experts with a wide variety of tools that can accelerate algorithm development, increase software reliability, provide comparison too...
In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To a c hieve the best possible performance with a particular learning algorithm on a particular training set, a feature subset selection method should consider...
Thesis (Ph. D.)--Stanford University, 1997. Includes bibliographical references (leaves 179-194). Photocopy.
High-quality financial databases have existed for many years, but
human analysts can only scratch the surface of the wealth of knowledge
buried in this data. Using the rule-induction technology in the Recon
data-mining system, an investment strategy based purely on the learned
rules can generate significant profits
Discusses the weight update rule in the cascade correlation neural
net learning algorithm. The weight update rule implements gradient
descent optimization of the correlation between a new hidden unit's
output and the previous network's error. The author presents a
derivation of the gradient of the correlation function and shows that
his resulting w...
We address the problem of finding the parametersettings that will result in optimalperformance of a given learning algorithmusing a particular dataset as training data.We describe a "wrapper" method, consideringdetermination of the best parametersas a discrete function optimization problem.The method uses best-first search and crossvalidationto wra...
In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To a c hieve the best possible performance with a particular learning algorithm on a particular training set, a feature subset selection method should consider...
We present MLC++, a library of C++ classes and tools for
supervised machine learning. While MLC++ provides general learning
algorithms that can be used by end users, the main objective is to
provide researchers and experts with a wide variety of tools that can
accelerate algorithm development, increase software reliability, provide
comparison tools...
We present a new method for top-down induction of decision trees (TDIDT) with multivariate binary splits at the nodes. The primary contribution of this work is a new splitting criterion called soft entropy, which is continuous and differentiable with respect to the pa- rameters of the splitting function. Using simple gradi- ent descent to find mult...