ArticlePDF Available

Machine Learning Methods for Estimating Heterogeneous Causal Effects

Authors:

Abstract and Figures

In this paper we study the problems of estimating heterogeneity in causal effects in experimental or observational studies and conducting inference about the magnitude of the differences in treatment effects across subsets of the population. In applications, our method provides a data-driven approach to determine which subpopulations have large or small treatment effects and to test hypotheses about the differences in these effects. For experiments, our method allows researchers to identify heterogeneity in treatment effects that was not specified in a pre-analysis plan, without concern about invalidating inference due to multiple testing. In most of the literature on supervised machine learning (e.g. regression trees, random forests, LASSO, etc.), the goal is to build a model of the relationship between a unit's attributes and an observed outcome. A prominent role in these methods is played by cross-validation which compares predictions to actual outcomes in test samples, in order to select the level of complexity of the model that provides the best predictive power. Our method is closely related, but it differs in that it is tailored for predicting causal effects of a treatment rather than a unit's outcome. The challenge is that the "ground truth" for a causal effect is not observed for any individual unit: we observe the unit with the treatment, or without the treatment, but not both at the same time. Thus, it is not obvious how to use cross-validation to determine whether a causal effect has been accurately predicted. We propose several novel cross-validation criteria for this problem and demonstrate through simulations the conditions under which they perform better than standard methods for the problem of causal effects. We then apply the method to a large-scale field experiment re-ranking results on a search engine.
Content may be subject to copyright.
A preview of the PDF is not available
... Uplift modeling-also known as individual treatment effect modeling (Rubin 1974) or heterogeneous treatment effect estimation (Athey and Imbens 2015;Hitsch and Misra 2018;Rudaś and Jaroszewicz 2018;Rößler and Schoder 2022)-is a powerful machine learning technique used to predict the incremental effect of a treatment (e.g., a direct marketing campaign) on an outcome (e.g., making a purchase) for each individual. By understanding the treatment effect for each individual, uplift modeling enables companies to make targeting decisions, including up-selling, cross-selling, churn prevention, and retention efforts (Chickering and Heckerman 2000;Guelman et al. 2012;Radcliffe 2007). ...
... Instead, we can only observe either Y i (1) or Y i (0) , but not their difference (Angrist and Pischke 2008). Therefore, researchers are often interested in estimating the conditional average treatment effects (CATEs), which represent the expected treatment effects for customers with specific features (e.g., Athey and Imbens 2015;Rzepakowski and Jaroszewicz 2012): ...
... Indirect approaches, such as meta-learners (e.g., S-learner, T-learner, X-learner, R-learner), extend existing supervised machine learning models to the estimation of upliftsnamely, by modeling the expected value of outcomes for different treatments (Künzel et al. 2019;Zhang et al. 2021). Conversely, direct approaches (e.g., causal trees and decision trees for uplifting modeling) estimate the CATE by partitioning the data based on features that predict the treatment effect (Athey and Imbens 2015;Rzepakowski and Jaroszewicz 2012;Zhang et al. 2021). To establish benchmarks, we compare our proposed models against meta-learners and uplift trees (as shown in Table 1). ...
Article
Full-text available
Uplift modeling, also referred to as heterogeneous treatment effect estimation, is a machine learning technique utilized in marketing for estimating the incremental impact of treatment on the response of each customer. Uplift models face a fundamental challenge in causal inference because the variable of interest (i.e., the uplift itself) remains unobservable. As a result, popular uplift models (such as meta-learners and uplift trees) do not incorporate loss functions for uplifts in their algorithms. This article addresses that gap by proposing uplift models with quasi-loss functions (UpliftQL models), which separately use four specially designed quasi-loss functions for uplift estimation in algorithms. Using simulated data, our analysis reveals that, on average, 55% (34%) of the top five models from a set of 14 are UpliftQL models for binary (continuous) outcomes. Further empirical data analysis shows that over 60% of the top-performing models are consistently UpliftQL models.
... Over the past decades, the extensive literature has focused on estimating the average treatment effect (ATE), which provides a measure of the average effectiveness of a treatment for a population of subjects (see, for example, Rubin, 1978;Pearl, 2009). More recently, some literature has shifted the interest to estimating the conditional average treatment effect (CATE), which provides a more refined measure of the average treatment effect for subgroups of the population defined by specific characteristics or covariates (see, for example, Athey and Imbens, 2015;Shalit et al., 2017;Wager and Athey, 2018;Künzel et al., 2019;Farrell et al., 2021). While CATE offers a more detailed and informative understanding of the treatment effect than ATE, both measures overlook the inherent variability in individual responses to treatment, under which the effect of treatment can vary significantly between individuals. ...
... To adjust for the second potential distributional shift between the treated and control groups, we leverage the propensity score adjustment in the classical causal inference literature (see e.g., Athey and Imbens, 2015;Shalit et al., 2017;Wager and Athey, 2018;Künzel et al., 2019;Farrell et al., 2021), and we calculate the treatment-balancing weight as, ...
Preprint
Estimating treatment effects from observational data is of central interest across numerous application domains. Individual treatment effect offers the most granular measure of treatment effect on an individual level, and is the most useful to facilitate personalized care. However, its estimation and inference remain underdeveloped due to several challenges. In this article, we propose a novel conformal diffusion model-based approach that addresses those intricate challenges. We integrate the highly flexible diffusion modeling, the model-free statistical inference paradigm of conformal inference, along with propensity score and covariate local approximation that tackle distributional shifts. We unbiasedly estimate the distributions of potential outcomes for individual treatment effect, construct an informative confidence interval, and establish rigorous theoretical guarantees. We demonstrate the competitive performance of the proposed method over existing solutions through extensive numerical studies.
... Currently, there is a growing body of literature that explores the integration of machine learning techniques into causal inference to address the above issues [2,23,24]. Notably, the double machine learning (DML) model proposed by Chernozhukov et al. has garnered widespread attention [2]. Within the framework of a partially linear model, DML allows for the estimation of the average treatment effect. ...
Article
Full-text available
Background Recently, there has been a growing interest in combining causal inference with machine learning algorithms. Double machine learning model (DML), as an implementation of this combination, has received widespread attention for their expertise in estimating causal effects within high-dimensional complex data. However, the DML model is sensitive to the presence of outliers and heavy-tailed noise in the outcome variable. In this paper, we propose the robust double machine learning (RDML) model to achieve a robust estimation of causal effects when the distribution of the outcome is contaminated by outliers or exhibits symmetrically heavy-tailed characteristics. Results In the modelling of RDML model, we employed median machine learning algorithms to achieve robust predictions for the treatment and outcome variables. Subsequently, we established a median regression model for the prediction residuals. These two steps ensure robust causal effect estimation. Simulation study show that the RDML model is comparable to the existing DML model when the data follow normal distribution, while the RDML model has obvious superiority when the data follow mixed normal distribution and t-distribution, which is manifested by having a smaller RMSE. Meanwhile, we also apply the RDML model to the deoxyribonucleic acid methylation dataset from the Alzheimer’s disease (AD) neuroimaging initiative database with the aim of investigating the impact of Cerebrospinal Fluid Amyloid \upbeta42 (CSF A\upbeta42) on AD severity. Conclusion These findings illustrate that the RDML model is capable of robustly estimating causal effect, even when the outcome distribution is affected by outliers or displays symmetrically heavy-tailed properties.
... CML integrates traditional econometric techniques with modern machine learning methods. This integration enables the estimation of individual-level differences between actual outcomes and hypothetical outcomes that would have occurred without the intervention (Athey and Imbens 2015). ...
Article
Full-text available
In e-commerce, product returns have become a costly and escalating issue for retailers. Beyond the financial implications for businesses, product returns also lead to increased greenhouse gas emissions and the squandering of natural resources. Traditional approaches, such as charging customers for returns, have proven largely ineffective in incurring returns, thus calling for more nuanced strategies to tackle this issue. This paper investigates the effectiveness of informing consumers about the negative environmental consequences of product returns (“green nudging”) to curtail product returns through a large-scale randomized field experiment (n = 117,304) conducted with a leading European fashion retailer’s online store. Our findings indicate that implementing green nudging can decrease product returns by 2.6% without negatively impacting sales. We then develop and assess a causal machine learning model designed to identify treatment heterogeneities and personalize green nudging (i.e., make nudging “smart”). Our off-policy evaluation indicates that this personalization can approximately double the success of green nudging.The study demonstrates the effectiveness of both subtle marketing interventions and personalization using causal machine learning in mitigating environmentally and economically harmful product returns, thus highlighting the feasibility of employing “Better Marketing for a Better World” approaches in a digital setting.
Article
Full-text available
Uplift modeling was first initiated in the industry in the early 2000s as a new methodology to improve marketing efficiency by predicting individual treatment effect (ITE). It estimates the conditional average treatment effect (CATE) as the difference in outcome probabilities with and without treatment. Recently, AI ethics including fairness evaluation has received significant attention from academia, industry, and regulatory agencies. However, standard fairness metrics, originally developed for conventional predictive models, generally require ground truth (ITE) and cannot be applied directly for uplift models. In this paper, we propose a novel and practical approach to compute fairness metrics suitable for uplift models. A formal framework is first established based on probability theory. It is followed by a simulation analysis to demonstrate its effectiveness. Finally, we illustrate how to apply the approach through an example using public data.
Article
A central tenet in the field of industrial organisation is that increasing/decreasing market concentration is associated with increased/reduced markups. But does this variation affect every consumer to the same extent? Previous literature finds price dispersion exists even for homogeneous goods, at least partially as a result of heterogeneity in consumer engagement with the market. We study this question by linking demographic and income heterogeneity across local areas to the impact of changing market concentration on markups. With 15 years of station‐level motor fuel price data from Western Australia and information on instances of local market exit and entry, we apply a non‐parametric causal forest approach to explore the heterogeneity in the effect of exit/entry. The paper provides evidence of the distributional effect of changing market concentration. Areas with lower income experience a larger increase in petrol stations' price margin as a result of market exit. On the other hand, entry does not benefit the same low‐income areas with a larger reduction in the margin than in high‐income areas. Policy implications include a need to further focus on increasing engagement by low‐income consumers.
Article
Full-text available
Randomized controlled trials play an important role in how internet companies predict the impact of policy decisions, marketing campaigns, and product changes. Heterogeneity in treatment effects refers to the fact that, in such `digital experiments', different units (people, devices, products) respond differently to the applied treatment. This article presents a fast and scalable Bayesian nonparametric analysis of heterogeneity and its measurement in relation to observable covariates. The analysis leads to a novel estimator of heterogeneity that is based around the distribution of covariates pooled across treatment groups. Results are provided to assess commonly used schemes for variance reduction, and we argue that such schemes will only be useful in estimation of average treatment effects if the sources of heterogeneity are known in advance or can be learned across multiple experiments. We also describe how, without any prior knowledge, one can mine experiment data to discover patterns of heterogeneity and communicate these results in sparse low dimensional summaries. Throughout, the work is illustrated with a detailed example experiment involving 21 million unique users of eBay.com
Article
Full-text available
The propensity score is the conditional probability of assignment to a particular treatment given a vector of observed covariates. Both large and small sample theory show that adjustment for the scalar propensity score is sufficient to remove bias due to all observed covariates. Applications include: (i) matched sampling on the univariate propensity score, which is a generalization of discriminant matching, (ii) multivariate adjustment by subclassification on the propensity score where the same subclasses are used to estimate treatment effects for all outcome variables and in all subpopulations, and (iii) visual representation of multivariate covariance adjustment by a two- dimensional plot.
Article
Full-text available
her help in reviewing the experimental literature in political science, Arjun Shenoy for his help in reviewing the literature on public support for welfare, Lynn Vavreck for sharing data, and the facilities and staff of
Book
Did mandatory busing programs in the 1970s increase the school achievement of disadvantaged minority youth? Does obtaining a college degree increase an individual's labor market earnings? Did the use of the butterfly ballot in some Florida counties in the 2000 presidential election cost Al Gore votes? If so, was the number of miscast votes sufficiently large to have altered the election outcome? At their core, these types of questions are simple cause-and-effect questions. Simple cause-and-effect questions are the motivation for much empirical work in the social sciences. This book presents a model and set of methods for causal effect estimation that social scientists can use to address causal questions such as these. The essential features of the counterfactual model of causality for observational data analysis are presented with examples from sociology, political science, and economics.
Article
Written by one of the preeminent researchers in the field, this book provides a comprehensive exposition of modern analysis of causation. It shows how causality has grown from a nebulous concept into a mathematical theory with significant applications in the fields of statistics, artificial intelligence, economics, philosophy, cognitive science, and the health and social sciences. Judea Pearl presents and unifies the probabilistic, manipulative, counterfactual, and structural approaches to causation and devises simple mathematical tools for studying the relationships between causal connections and statistical associations. The book will open the way for including causal analysis in the standard curricula of statistics, artificial intelligence, business, epidemiology, social sciences, and economics. Students in these fields will find natural models, simple inferential procedures, and precise mathematical definitions of causal concepts that traditional texts have evaded or made unduly complicated. The first edition of Causality has led to a paradigmatic change in the way that causality is treated in statistics, philosophy, computer science, social science, and economics. Cited in more than 5,000 scientific publications, it continues to liberate scientists from the traditional molds of statistical thinking. In this revised edition, Judea Pearl elucidates thorny issues, answers readers’ questions, and offers a panoramic view of recent advances in this field of research. Causality will be of interests to students and professionals in a wide variety of fields. Anyone who wishes to elucidate meaningful relationships from data, predict effects of actions and policies, assess explanations of reported events, or form theories of causal understanding and causal speech will find this book stimulating and invaluable.
Article
Recursive partitioning is embedded into the general and well-established class of parametric models that can be fitted using M-type estimators (including maximum likelihood). An algorithm for model-based recursive partitioning is suggested for which the basic steps are: (1) fit a parametric model to a data set, (2) test for parameter instability over a set of partitioning variables, (3) if there is some overall parameter instability, split the model with respect to the variable associated with the highest instability, (4) repeat the procedure in each of the daughter nodes. The algorithm yields a partitioned (or segmented) parametric model that can eectively be visualized and that subject-matter scientists are used to analyze and interpret.
Book
Setting of the learning problem consistency of learning processes bounds on the rate of convergence of learning processes controlling the generalization ability of learning processes constructing learning algorithms what is important in learning theory?.