Tom Fawcett

Tom Fawcett
Silicon Valley Data Science

Ph.D.

About

46
Publications
185,795
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
25,510
Citations

Publications

Publications (46)
Article
Full-text available
The last several years have seen an explosion of interest in wearable computing, personal tracking devices, and the so-called quantified self (QS) movement. Quantified self involves ordinary people recording and analyzing numerous aspects of their lives to understand and improve themselves. This is now a mainstream phenomenon, attracting a great de...
Article
Full-text available
Companies have realized they need to hire data scientists, academic institutions are scrambling to put together data science programs, and publications are touting data science as a hot -- even "sexy" -- career choice. However, there is confusion about what exactly data science is, and this confusion could lead to disillusionment as the concept dif...
Article
Full-text available
This paper surveys the intersection of two fascinating and increasingly popular domains: swarm intelligence and data mining. Whereas data mining has been a popular academic topic for decades, swarm intelligence is a relatively new subfield of artificial intelligence which studies the emergent collective intelligence of groups of simple agents. It i...
Article
Full-text available
Rules are commonly used for classification because they are modular, intelligible and easy to learn. Existing work in classification rule learning assumes the goal is to produce categorical classifications to maximize classification accuracy. Recent work in machine learning has pointed out the limitations of classification accuracy: when class dist...
Article
Full-text available
Abstract A cellular automaton is a discrete, dynamical system composed of very simple, uniformly intercon-nected cells. Cellular automata may be seen as an extreme form of simple, localized, distributed machines. Many researchers are familiar with cellular automata through Conway's Game of Life. Researchers have long been interested in the theoreti...
Article
Full-text available
Classifier calibration is the process of converting classifier scores into reliable probability estimates. Recently, a calibration technique based on isotonic regression has gained attention within machine learning as a flexible and effective way to calibrate classifiers. We show that, surprisingly, isotonic regression based calibration using the P...
Article
Full-text available
Classifier calibration is the process of converting classifier scores into reliable probability estimates. Recently, a calibration technique based on isotonic regression has gained attention within machine learning as a flexible and effective way to calibrate classifiers. We show that, surprisingly, isotonic regression based calibration using the P...
Article
Full-text available
Receiver Operating Characteristics (ROC) graphs are a useful technique for organizing classifiers and visual- izing their performance. ROC graphs have been used in cost-sensitive learning because of the ease with which class skew and error cost information can be applied to them to yield cost-sensitive decisions. However, they have been criticized...
Article
Full-text available
Receiver operating characteristics (ROC) graphs are useful for organizing classifiers and visualizing their performance. ROC graphs are commonly used in medical decision making, and in recent years have been used increasingly in machine learning and data mining research. Although ROC graphs are apparently simple, there are some common misconception...
Article
Full-text available
In an article in this issue, Webb and Ting criticize ROC analysis for its inability to handle certain changes in class distributions. They imply that the ability of ROC graphs to depict performance in the face of changing class distributions has been overstated. In this editorial response, we describe two general types of domains and argue that Web...
Article
Full-text available
This introductory paper to the special issue on Data Mining Lessons Learned presents lessons from data mining applications, including experience from science, business, and knowledge management in a collaborative data mining setting.
Article
Full-text available
IntroductionData mining is concerned with finding interesting patterns in data.Many techniques have emerged for analyzing and visualizing large volumesof data. What one finds in the technical literature are mostlysuccess stories of these techniques. Researchers rarely report on stepsleading to success, failed attempts, or critical representation ch...
Article
Full-text available
Receiver Operating Characteristics (ROC) graphs are a useful technique for organizing classifiers and visualizing their performance. ROC graphs are commonly used in medical decision making, and in recent years have been increasingly adopted in the machine learning and data mining research communities. Although ROC graphs are apparently simple, ther...
Article
Full-text available
Receiver Operating Characteristics (ROC) graphs are useful for organizing classi-fiers and visualizing their performance. ROC graphs are commonly used in medical decision making, and in recent years have been used increasingly in machine learning and data mining research. Although ROC graphs are apparently simple, there are some common misconceptio...
Article
Full-text available
Spam, also known as Unsolicited Commercial Email (UCE), is the bane of email communication. Many data mining researchers have addressed the problem of detecting spam, generally by treating it as a static text classification problem. True in vivo spam filtering has characteristics that make it a rich and challenging domain for data mining. Indeed, r...
Article
Full-text available
Spam, also known as Unsolicited Commercial Email (UCE), is the bane of email communication. Many data mining researchers have addressed the problem of detecting spam, generally by treating it as a static text classification problem. True in vivo spam filtering has characteristics that make it a rich and challenging domain for data mining. Indeed ,...
Article
Full-text available
Spam, also known as Unsolicited Commercial Email (UCE), is the bane of email communication. Many data mining researchers have addressed the problem of detecting spam, generally by treating it as a static text classification problem. True in vivo spam filtering has characteristics that make it a rich and challenging domain for data mining. Indeed, r...
Article
We investigate algebraic, logical, and geometric properties of concepts recognized by various classes of probabilistic classifiers. For this we introduce a natural hierarchy of probabilistic classifiers, the lowest level of which comprises the naive Bayesian classifiers. We show that the expressivity of classifiers on the different levels in the hi...
Conference Paper
This article describes the development of a prototype data mining system for detecting cellular phone (cloning) fraud. In cellular cloning fraud, the identity of a legitimate cellular phone is programmed into another; from the second phone, calls can be made illicitly that are charged to the customer's account. The system for detecting such fraud i...
Article
Full-text available
We analyze critically the use of classification accuracy to compare classifiers on natural data sets, providing a thorough investigation using ROC analysis, standard machine learning algorithms, and standard benchmark data sets. The results raise serious concerns about the use of accuracy for comparing classifiers and drawinto question the conclusi...
Conference Paper
Full-text available
Rules are commonly used for classification because they are modular intelligible and easy to learn. Existing work in classification rule learning assumes the goal is to produce categorical classifications to maximize classification accuracy. Recent work in machine learning has pointed out the limitations of classification accuracy: when class distr...
Article
Full-text available
In real-world environments it usually is difficult to specify target operating conditions precisely, for example, target misclassification costs. This uncertainty makes building robust classification systems problematic. We show that it is possible to build a hybrid classifier that will perform at least as well as the best available classifier for...
Article
Full-text available
We introduce a problem class which we term activity monitoring. Such problems involve monitoring the behavior of a large population of entities for interesting events requiring action. We present a framework within which each of the individual problems has a natural expression, as well as a methodology for evaluating performance of activity monitor...
Article
Full-text available
Applications of inductive learning algorithms to realworld data mining problems have shown repeatedly that using accuracy to compare classifiers is not adequate because the underlying assumptions rarely hold. We present a method for the comparison of classifier performance that is robust to imprecise class distributions and misclassification costs....
Article
Full-text available
This paper describes the automatic design of methods for detecting fraudulent behavior. Much of the design is accomplished using a series of machine learning methods. In particular, we combine data mining and constructive induction with more standard machine learning techniques to design methods for detecting fraudulent usage of cellular telephones...
Article
Full-text available
One method for detecting fraud is to check for suspicious changes in user behavior. This paper describes the automatic design of user profiling methods for the purpose of fraud detection, using a series of data mining techniques. Specifically, we use a rule-learning program to uncover indicators of fraudulent behavior from a large database of custo...
Article
Full-text available
Fraud is the deliberate use of deception to conduct illicit activities. Automatic fraud detection involves scanning large volumes of data to uncover patterns of fradulent usage, and as such it is well suited to data mining techniques. We present three general types of fraud that have been addressed in data mining research, and we summarize the appr...
Article
Since Samuel's work on checkers over thirty years ago, much effort has been devoted to learning evaluation functions. However, all such methods are sensitive to the feature set chosen to represent the examples. If the features do not capture aspects of the examples significant for problem solving, the learned evaluation function may be inaccurate o...
Article
ions are created by relaxing conditions specified in a domain theory, so the approach is a hybrid of the data-driven (bottom-up) and theory-driven (top-down) approaches. The use of a domain theory allows the system to start with complex initial features that are sensitive to the goals and operators of the domain, rather than simply starting with th...
Article
Full-text available
This paper describes Zenith, a discovery system that performs constructive induction. The system is able to generate and extend new features for concept learning using agenda-based heuristic search. The search is guided by feature worth (a composite measure of discriminability and cost). Zenith is distinguished from existing constructive induction...
Article
Existing methods for constructive induction usually isolate feature generation from problem solving, and do not exploit information about the purpose for which features are created. This paper describes a theory of feature generation that creates features using both a domain theory and feedback from a concept learner. An evaluation function can the...
Article
Typescript. Thesis (Ph. D.)--University of Massachusetts at Amherst, 1993. Includes bibliographical references (leaves 217-227).
Conference Paper
Existing approaches to constructive induction have been largely empirical, starting with structural features and combining them successively into higher level features. This paper describes a hybrid analytical/empirical method for feature generation for inductive concept learning. The goal of this method is to be able to generate useful features to...
Conference Paper
Full-text available
This paper describes CABOT, a case-based sys­ tem that is able to adjust its retrieval and adap­ tation metrics, in addition to storing cases. It has been applied to the game of OTHELLO. Ex­ periments show that CABOT saves about half as many cases as similar systems that do not adjust their retrieval and adaptation mecha­ nisms. It also consistentl...
Chapter
This chapter explores the incomplete-theory problem in which a learning system has an explicit domain theory that cannot generate an explanation for every example. The general method is to use the existing domain theory to generate a plausible explanation of the example and to extract from it one or more rules that may then be added to the domain t...
Article
Full-text available
One method for detecting fraud is to check for suspi-cious changes in user behavior over time. This paper describes the automatic design of user profiling meth-ods for the purpose of fraud detection, using a series of data mining and machine learning techniques. It uses a rule-learning program to uncover indicators of fraud-ulent behavior from a la...
Article
This document describes how to design and build boot/root diskettes for Linux. These disks can be used as rescue disks or to test new system components. You should be reasonably familiar with system administration tasks before attempting to build a bootdisk. If you just want a rescue disk to have for emergencies, see Appendix A.1.
Article
Full-text available
Applications of machine learning have shown re-peatedly that the standard assumptions of uni-form class distribution and uniform misclassifi-cation costs rarely hold. Little is known about how to select classifiers when error costs and class distributions are not known precisely at train-ing time, or when they can change. We present a method for an...

Network

Cited By

Projects

Projects (9)