Rich Caruana

Rich Caruana
Microsoft · Adaptive Systems and Interaction Group

About

137
Publications
46,133
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
19,781
Citations

Publications

Publications (137)
Preprint
Most pregnancies and births result in a good outcome, but complications are not uncommon and when they do occur, they can be associated with serious implications for mothers and babies. Predictive modeling has the potential to improve outcomes through better understanding of risk factors, heightened surveillance, and more timely and appropriate int...
Preprint
Machine learning (ML) interpretability techniques can reveal undesirable patterns in data that models exploit to make predictions--potentially causing harms once deployed. However, how to take action to address these patterns is not always clear. In a collaboration between ML and human-computer interaction researchers, physicians, and data scientis...
Preprint
Full-text available
Estimating heterogeneous treatment effects in domains such as healthcare or social science often involves sensitive data where protecting privacy is important. We introduce a general meta-algorithm for estimating conditional average treatment effects (CATE) with differential privacy (DP) guarantees. Our meta-algorithm can work with simple, single-s...
Preprint
Recent strides in interpretable machine learning (ML) research reveal that models exploit undesirable patterns in the data to make predictions, which potentially causes harms in deployment. However, it is unclear how we can fix these models. We present our ongoing work, GAM Changer, an open-source interactive system to help data scientists and doma...
Preprint
Full-text available
Although reinforcement learning (RL) has tremendous success in many fields, applying RL to real-world settings such as healthcare is challenging when the reward is hard to specify and no exploration is allowed. In this work, we focus on recovering clinicians' rewards in treating patients. We incorporate the what-if reasoning to explain clinician's...
Article
Various challenges in real life are multi-objective and conflicting (i.e., alter concurrent optimization). This implies that a single objective is optimized based on another’s cost. The Multi-Objective Optimization (MOO) issues are challenging but potentially realistic, and due to their wide-range application, optimization challenges have widely be...
Preprint
Full-text available
We show that adding differential privacy to Explainable Boosting Machines (EBMs), a recent method for training interpretable ML models, yields state-of-the-art accuracy while protecting privacy. Our experiments on multiple classification and regression datasets show that DP-EBM models suffer surprisingly little accuracy loss even with strong differ...
Preprint
Full-text available
Deployment of machine learning models in real high-risk settings (e.g. healthcare) often depends not only on model's accuracy but also on its fairness, robustness and interpretability. Generalized Additive Models (GAMs) have a long history of use in these high-risk domains, but lack desirable features of deep learning such as differentiability and...
Preprint
We examine Dropout through the perspective of interactions: learned effects that combine multiple input variables. Given $N$ variables, there are $O(N^2)$ possible pairwise interactions, $O(N^3)$ possible 3-way interactions, etc. We show that Dropout implicitly sets a learning rate for interaction effects that decays exponentially with the size of...
Preprint
Generalized additive models (GAMs) have become a leading model class for data bias discovery and model auditing. However, there are a variety of algorithms for training GAMs, and these do not always learn the same things. Statisticians originally used splines to train GAMs, but more recently GAMs are being trained with boosted decision trees. It is...
Preprint
Deep neural networks (DNNs) are powerful black-box predictors that have achieved impressive performance on a wide variety of tasks. However, their accuracy comes at the cost of intelligibility: it is usually unclear how they make their decisions. This hinders their applicability to high stakes decision-making domains such as healthcare. We propose...
Preprint
Recent methods for training generalized additive models (GAMs) with pairwise interactions achieve state-of-the-art accuracy on a variety of datasets. Adding interactions to GAMs, however, introduces an identifiability problem: effects can be freely moved between main effects and interaction effects without changing the model predictions. In some ca...
Preprint
InterpretML is an open-source Python package which exposes machine learning interpretability algorithms to practitioners and researchers. InterpretML exposes two types of interpretability - glassbox models, which are machine learning models designed for interpretability (ex: linear models, rule lists, generalized additive models), and blackbox expl...
Conference Paper
Generalized additive models (GAMs) are favored in many regression and binary classification problems because they are able to fit complex, nonlinear functions while still remaining interpretable. In the first part of this paper, we generalize a state-of-the-art GAM learning algorithm based on boosted trees to the multiclass setting, showing that th...
Preprint
We propose a neural architecture search (NAS) algorithm, Petridish, to iteratively add shortcut connections to existing network layers. The added shortcut connections effectively perform gradient boosting on the augmented layers. The proposed algorithm is motivated by the feature selection algorithm forward stage-wise linear regression, since we co...
Conference Paper
Full-text available
Without good models and the right tools to interpret them, data scientists risk making decisions based on hidden biases, spurious correlations, and false generalizations. This has led to a rallying cry for model interpretability. Yet the concept of interpretability remains nebulous, such that researchers and tool designers lack actionable guideline...
Conference Paper
As predictive models increasingly assist human experts (e.g., doctors) in day-to-day decision making, it is crucial for experts to be able to explore and understand how such models behave in different feature subspaces in order to know if and when to trust them. To this end, we propose Model Understanding through Subspace Explanations (MUSE), a nov...
Conference Paper
Black-box risk scoring models permeate our lives, yet are typically proprietary or opaque. We propose Distill-and-Compare, an approach to audit such models without probing the black-box model API or pre-defining features to audit. To gain insight into black-box models, we treat them as teachers, training transparent student models to mimic the risk...
Preprint
Full-text available
Generalized additive models (GAMs) are favored in many regression and binary classification problems because they are able to fit complex, nonlinear functions while still remaining interpretable. In the first part of this paper, we generalize a state-of-the-art GAM learning algorithm based on boosted trees to the multiclass setting, and show that t...
Article
Model distillation was originally designed to distill knowledge from a large, complex teacher model to a faster, simpler student model without significant loss in prediction accuracy. We investigate model distillation for another goal -- transparency -- investigating if fully-connected neural networks can be distilled into models that are transpare...
Article
Black-box risk scoring models permeate our lives, yet are typically proprietary and opaque. We propose a transparent model distillation approach to understand and detect bias in such models. Model distillation was originally designed to distill knowledge from a large, complex model (the teacher model) to a faster, simpler model (the student model)...
Article
Predictive models deployed in the real world may assign incorrect labels to instances with high confidence. Such errors or unknown unknowns are rooted in model incompleteness, and typically arise because of the mismatch between training data and the cases encountered at test time. As the models are blind to such errors, input from an oracle is need...
Article
Predictive models deployed in the world may assign incorrect labels to instances with high confidence. Such errors or unknown unknowns are rooted in model incompleteness, and typically arise because of the mismatch between training data and the cases seen in the open world. As the models are blind to such errors, input from an oracle is needed to i...
Conference Paper
Bird migration is a critical indicator of environmental health, biodiversity, and climate change. Existing techniques for monitoring bird migration are either expensive (e.g., satellite tracking), labor-intensive (e.g., moon watching), indirect and thus less accurate (e.g., weather radar), or intrusive (e.g., attaching geolocators on captured birds...
Conference Paper
In machine learning often a tradeoff must be made between accuracy and intelligibility. More accurate models such as boosted trees, random forests, and neural nets usually are not intelligible, but more intelligible models such as logistic regression, naive-Bayes, and single decision trees often have significantly worse accuracy. This tradeoff some...
Article
Full-text available
The generalized partially linear additive model (GPLAM) is a flexible and interpretable approach to building predictive models. It combines features in an additive manner, allowing them to have either a linear or nonlinear effect on the response. However, the assignment of features to the linear and nonlinear groups is typically assumed known. Thus...
Conference Paper
Full-text available
In a variety of real world problems from robot navigation to logistics, agents face the challenge of path optimization on a graph with unknown edge costs. These settings can be generally formalized as the Canadian Traveler Problems (CTPs) [13]. Although in many applications the edge costs have dependencies resulting from world dynamics, CTPs with s...
Conference Paper
Full-text available
Labeling data is a seemingly simple task required for training many machine learning systems, but is actually fraught with problems. This paper introduces the notion of concept evolution, the changing nature of a person's underlying concept (the abstract notion of the target class a person is labeling for, e.g., spam email, travel related web pages...
Article
Full-text available
In the mixture models problem it is assumed that there are $K$ distributions $\theta_{1},\ldots,\theta_{K}$ and one gets to observe a sample from a mixture of these distributions with unknown coefficients. The goal is to associate instances with their generating distributions, or to identify the parameters of the hidden distributions. In this work...
Conference Paper
Clustering never seems to live up to the hype. To paraphrase the popular saying, clustering looks good in theory, yet often fails to deliver in practice. Why? You would think that something so simple and elegant as finding groups of similar items in data would be incredibly useful. Yet often it isn't. The problem is that clustering rarely finds the...
Article
Most time series data mining algorithms use similarity search as a core subroutine, and thus the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms, including classification, clustering, motif discovery, ...
Conference Paper
Standard generalized additive models (GAMs) usually model the dependent variable as a sum of univariate models. Although previous studies have shown that standard GAMs can be interpreted by users, their accuracy is significantly less than more complex models that permit interactions. In this paper, we suggest adding selected terms of interacting pa...
Chapter
A challenge facing user generated content systems is vandalism, i.e. edits that damage content quality. The high visibility and easy access to social networks makes them popular targets for vandals. Detecting and removing vandalism is critical for these user generated content systems. Because vandalism can take many forms, there are many different...
Article
Deemed “one of the top ten data mining mistakes”, leakage is the introduction of information about the data mining target that should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, ...
Article
A key challenge in developing conversational systems is fusing streams of information provided by different sensors to make inferences about the behaviors and goals of people. Such systems can leverage visual and audio information collected through cameras and microphone arrays, including the location of various people, their focus of attention, bo...
Article
Full-text available
Complex models for regression and classification have high accuracy, but are unfortunately no longer interpretable by users. We study the performance of generalized additive models (GAMs), which combine single-feature models called shape functions through a linear function. Since the shape functions can be arbitrarily complex, GAMs are more accurat...
Article
Full-text available
Boosted decision trees typically yield good accuracy, precision, and ROC area. However, because the outputs from boosting are not well calibrated posterior probabilities, boosting yields poor squared error and cross-entropy. We empirically demonstrate why AdaBoost predicts distorted probabilities and examine three calibration methods for correcting...
Conference Paper
Recent studies have shown that boosting provides excellent predictive performance across a wide variety of tasks. In Learning-to-rank, boosted models such as RankBoost and LambdaMART have been shown to be among the best performing learning methods based on evaluations on public data sets. In this paper, we show how the combination of bagging as a v...
Article
Obtaining the best accuracy in machine learning usually requires carefully tuning learning algorithm parameters for each problem. Parameter optimization is computationally challenging for learning methods with many hyperparameters. In this paper we show that MapReduce Clusters are particularly well suited for parallel parameter optimization. We use...
Article
Summary1. Species monitoring is an essential component of assessing conservation status, predicting effects of habitat change and establishing management and conservation priorities. The pervasive access to the Internet has led to the development of several extensive monitoring projects that engage massive networks of volunteers who provide observa...
Article
Full-text available
In this paper we demonstrate a practical approach to interaction detection on real data describing the abundance of different species of birds in the prairies east of the southern Rocky Mountains. This data is very noisy - predictive models built from this data perform only slightly better than baseline. Previous approaches for interaction detectio...
Conference Paper
Full-text available
We examine the mechanism by which feature selection im- proves the accuracy of supervised learning. An empirical bias/variance analysis as feature selection progresses indicates that the most accurate feature set corresponds to the best bias-variance trade-o point for the learning algorithm. Often, this is not the point separating relevant from irr...
Article
Full-text available
The increasing availability of massive volumes of scientific data requires new synthetic analysis techniques to explore and identify interesting patterns that are otherwise not apparent. For biodiversity studies, a “data-driven” approach is necessary because of the complexity of ecological systems, particularly when viewed at large spatial and temp...
Conference Paper
Full-text available
In this paper, we address the semi-supervised learning problem when there is a small amount of labeled data augmented with pairwise constraints indicating whether a pair of examples belongs to a same class or different classes. We introduce a discriminative learning approach that incorporates pairwise constraints into the conventional margin-based...
Conference Paper
Full-text available
ABSTRACT In this paper, we address the problem of learning when some cases are fully labeled while other cases are only partially labeled, in the form of partial labels. Partial labels are rep- resented as a set of possible labels for each training example, one of which is the correct label. We introduce a discrimina- tive learning approach that in...
Conference Paper
Full-text available
Efficiently utilizing off-chip DRAM bandwidth is a critical issue in designing cost-effective, high-performance chip multiprocessors (CMPs). Conventional memory controllers deliver relatively low performance in part because they often employ fixed, rigid access scheduling policies designed for average-case application behavior. As a result, they ca...
Article
Efficiently utilizing off-chip DRAM bandwidth is a critical issuein designing cost-effective, high-performance chip multiprocessors(CMPs). Conventional memory controllers deliver relativelylow performance in part because they often employ fixed,rigid access scheduling policies designed for average-case applicationbehavior. As a result, they cannot...
Article
Full-text available
Discovering additive structure is an important step towards understanding a complex multi-dimensional function because it allows the function to be expressed as the sum of lower-dimensional components. When variables interact, however, their effects are not additive and must be modeled and interpreted simultaneously. We present a new approach for t...
Conference Paper
Full-text available
In this paper we perform an empirical evaluation of supervised learning on high- dimensional data. We evaluate perfor- mance on three metrics: accuracy, AUC, and squared loss and study the eect of increas- ing dimensionality on the performance of the learning algorithms. Our ndings are con- sistent with previous studies for problems of relatively l...
Article
Full-text available
Efficiently exploring exponential-size architectural design spaces with many interacting parameters remains an open problem: the sheer number of experiments required renders detailed simulation intractable. We attack this via an automated approach that builds accurate predictive models. We simulate sampled points, using results to teach our models...
Article
Full-text available
The clustering variations tool is based on a stochastic clustering method based on iterated k-means and spectral clustering. Clustering variations are found through a combination of PCA and random projections. Clustering variations are hierarchically organized by clustering the clusterings at a meta- level using a distance measure defined over enti...
Article
SUMMARY Consistentlygrowingarchitecturalcomplexityandmachinescalesmakecreatingaccurate performance models for large-scale applications increasingly challenging. Traditional analytic models are di-cult and time-consuming to construct, and are often unable to capture full system and application complexity. To address these challenges, we automaticall...
Conference Paper
Full-text available
In this paper we address the problem of combining multiple clusterings without access to the underlying features of the data. This process is known in the literature as clustering ensembles, clustering aggregation, or consensus clustering. Consensus clustering yields a stable and robust final clustering that is in agreement with multiple clustering...
Conference Paper
Full-text available
We present a new regression algorithm called Additive Groves and show empirically that it is superior in performance to a number of other established regression methods. A single Grove is an additive model containing a small number of large trees. Trees added to a Grove are trained on the residual error of other trees already in the model. We begin...
Conference Paper
Full-text available
Classifiers that are deployed in the field can be used and evaluated in ways that were not anticipated when the model was trained. The ultimate evaluation metric may not have been known to the modeler at training time, additional per- formance criteria may have been added, the evaluation met- ric may have changed over time, or the real-world evalua...
Article
ABSTRACT  Most ecologists use statistical methods as their main analytical tools when analyzing data to identify relationships between a response and a set of predictors; thus, they treat all analyses as hypothesis tests or exercises in parameter estimation. However, little or no prior knowledge about a system can lead to creation of a statistical...
Article
Full-text available
We consider the problem of learning Bayes Net structures for related tasks. We present an algo- rithm for learning Bayes Net structures that takes advantage of the similarity between tasks by bi- asing learning toward similar structures for each task. Heuristic search is used to find a high scor- ing set of structures (one for each task), where the...
Conference Paper
Full-text available
Clustering is ill-defined. Unlike supervised learning where labels lead to crisp performance criteria such as ac- curacy and squared error, clustering quality depends on how the clusters will be used. Devising clustering criteria that capture what users need is difficult. Most clustering al- gorithms search for one optimal clustering based on a pre...
Conference Paper
Full-text available
We investigate four previously unexplored aspects of ensemble selection, a procedure for building ensembles of classifiers. First we test whether adjusting model predictions to put them on a canonical scale makes the ensembles more effective. Second, we explore the performance of ensemble selection when different amounts of data are available for e...
Conference Paper
Full-text available
Architects use cycle-by-cycle simulation to evaluate design choices and understand tradeoffs and interactions among design parameters. Efficiently exploring exponential-size design spaces with many interacting parameters remains an open problem: the sheer number of experiments renders detailed simulation intractable. We attack this problem via an a...
Conference Paper
Full-text available
Often the best performing supervised learning models are ensembles of hundreds or thousands of base-level classiers. Unfortunately, the space required to store this many clas- siers, and the time required to execute them at run-time, prohibits their use in applications where test sets are large (e.g. Google), where storage space is at a premium (e....
Article
Full-text available
A number of supervised learning methods have been introduced in the last decade. Un-fortunately, the last comprehensive empiri-cal evaluation of supervised learning was the Statlog Project in the early 90's. We present a large-scale empirical comparison between ten supervised learning methods: SVMs, neural nets, logistic regression, naive bayes, me...
Conference Paper
Full-text available
The Cornell Laboratory of Ornithology’s mission is to in- terpret and conserve the earth’s biological diversity through research, education, and citizen science focused on birds. Over the years, the Lab has accumulated one of the largest and longest-running collections of environmental data sets in existence. The data sets are not only large, but a...
Conference Paper
Full-text available
Wrapper-based feature selection is attractive be- cause wrapper methods are able to optimize the fea- tures they select to the specific learning algorithm. Unfortunately, wrapper methods are prohibitively expensive to use with neural nets. We present an internal wrapper feature selection method for Cas- cade Correlation (C2) nets called C2FS that i...
Conference Paper
Full-text available
While there have been many successful applica- tions of machine learning methods to tasks in NLP, learning algorithms are not typically designed to optimize NLP performance metrics. This paper evaluates an ensemble selection framework de- signed to optimize arbitrary metrics and automate the process of algorithm selection and parameter tuning. We r...
Conference Paper
Full-text available
We examine the relationship between the predic- tions made by different learning algorithms and true posterior probabilities. We show that maxi- mum margin methods such as boosted trees and boosted stumps push probability mass away from 0 and 1 yielding a characteristic sigmoid shaped distortion in the predicted probabilities. Mod- els such as Naiv...
Article
This paper summarizes and analyzes the results of the 2004 KDD-Cup. The competition consisted of two tasks from the areas of particle physics and protein homology detection. It focused on the problem of optimizing supervised learning to different performance measures (accuracy, cross-entropy, ROC area, SLAC-Q, squared error, average precision, top...
Article
Full-text available
We present a method for constructing ensembles from libraries of thousands of models.