ArticlePDF Available

# Machine Learning Methods for Estimating Heterogeneous Causal Effects

Authors:

## Abstract and Figures

In this paper we study the problems of estimating heterogeneity in causal effects in experimental or observational studies and conducting inference about the magnitude of the differences in treatment effects across subsets of the population. In applications, our method provides a data-driven approach to determine which subpopulations have large or small treatment effects and to test hypotheses about the differences in these effects. For experiments, our method allows researchers to identify heterogeneity in treatment effects that was not specified in a pre-analysis plan, without concern about invalidating inference due to multiple testing. In most of the literature on supervised machine learning (e.g. regression trees, random forests, LASSO, etc.), the goal is to build a model of the relationship between a unit's attributes and an observed outcome. A prominent role in these methods is played by cross-validation which compares predictions to actual outcomes in test samples, in order to select the level of complexity of the model that provides the best predictive power. Our method is closely related, but it differs in that it is tailored for predicting causal effects of a treatment rather than a unit's outcome. The challenge is that the "ground truth" for a causal effect is not observed for any individual unit: we observe the unit with the treatment, or without the treatment, but not both at the same time. Thus, it is not obvious how to use cross-validation to determine whether a causal effect has been accurately predicted. We propose several novel cross-validation criteria for this problem and demonstrate through simulations the conditions under which they perform better than standard methods for the problem of causal effects. We then apply the method to a large-scale field experiment re-ranking results on a search engine.
Content may be subject to copyright.
A preview of the PDF is not available
... The main idea that enables the bounded expected regret result is somewhat related to the idea of a technique called imputation, which has recently been popularized in the causal inference community ( [1,18,2,5,20]): Suppose that arm m is not the optimal arm for agent j, but it is optimal for a set of other agents, say A m . If the total number of agents is large enough, we can express the feature vector of agent j by a linear combination of feature vectors of agents in A m . ...
... Input: {α (j) } j∈A where α (j) denotes the feature vector of agent j 1 for k = 1, 2, . . . do 2 Observe s k and a k 3 for m = 1, 2, . . . , |M | do 4 Compute ucb ...
Preprint
Full-text available
On typical modern platforms, users are only able to try a small fraction of the available items. This makes it difficult to model the exploration behavior of platform users as typical online learners who explore all the items. Towards addressing this issue, we propose to interpret a recommender system as a bandit exploration coordinator that provides counterfactual information updates. In particular, we introduce a novel algorithm called Counterfactual UCB (CFUCB) which is guarantees user exploration coordination with bounded regret under the presence of linear representations. Our results show that sharing information is a Subgame Perfect Nash Equilibrium for agents in terms of regret, leading to each agent achieving bounded regret. This approach has potential applications in personalized recommender systems and adaptive experimentation.
... At present, two major categories of estimation techniques have been proposed in the literature, namely meta-learners and tailored methods (Zhang, Li, and Liu 2022). The first includes the Two-Model approach (Radcliffe 2007), the X-learner (Künzel et al. 2017) and the transformed outcome methods (Athey and Imbens 2015) which extend classical machine learning techniques. The second refers to direct uplift modeling such as uplift trees (Rzepakowski and Jaroszewicz 2010) and various neural network based methods (Louizos et al. 2017;Yoon, Jordon, and van der Schaar 2018), which modify the existing machine learning algorithms to estimate treatment effects. ...
Preprint
Full-text available
In this manuscript (ms), we propose causal inference based single-branch ensemble trees for uplift modeling, namely CIET. Different from standard classification methods for predictive probability modeling, CIET aims to achieve the change in the predictive probability of outcome caused by an action or a treatment. According to our CIET, two partition criteria are specifically designed to maximize the difference in outcome distribution between the treatment and control groups. Next, a novel single-branch tree is built by taking a top-down node partition approach, and the remaining samples are censored since they are not covered by the upper node partition logic. Repeating the tree-building process on the censored data, single-branch ensemble trees with a set of inference rules are thus formed. Moreover, CIET is experimentally demonstrated to outperform previous approaches for uplift modeling in terms of both area under uplift curve (AUUC) and Qini coefficient significantly. At present, CIET has already been applied to online personal loans in a national financial holdings group in China. CIET will also be of use to analysts applying machine learning techniques to causal inference in broader business domains such as web advertising, medicine and economics.
... Conditional average treatment effect. A vast number of approaches have been proposed for estimating the heterogeneity via the conditional average treatment effect (CATE) that quantifies the effect size of treatment on the outcome of interest given different confounders (see recent advances in Athey and Imbens, 2015;Shalit et al., 2017;Wager and Athey, 2018;Künzel et al., 2019;Nie and Wager, 2021;Farrell et al., 2021). Here, confounders used in CATE usually play the role of the moderators (Kraemer et al., 2002) which modify the impact of the treatment on the outcome as a presence of confounder-treatment interaction in modeling the outcome. ...
Preprint
Heterogeneity and comorbidity are two interwoven challenges associated with various healthcare problems that greatly hampered research on developing effective treatment and understanding of the underlying neurobiological mechanism. Very few studies have been conducted to investigate heterogeneous causal effects (HCEs) in graphical contexts due to the lack of statistical methods. To characterize this heterogeneity, we first conceptualize heterogeneous causal graphs (HCGs) by generalizing the causal graphical model with confounder-based interactions and multiple mediators. Such confounders with an interaction with the treatment are known as moderators. This allows us to flexibly produce HCGs given different moderators and explicitly characterize HCEs from the treatment or potential mediators on the outcome. We establish the theoretical forms of HCEs and derive their properties at the individual level in both linear and nonlinear models. An interactive structural learning is developed to estimate the complex HCGs and HCEs with confidence intervals provided. Our method is empirically justified by extensive simulations and its practical usefulness is illustrated by exploring causality among psychiatric disorders for trauma survivors.
... Yet, due to the heterogeneity of individuals in response to treatment/action options, there may not exist a uniformly optimal treatment across individuals. Thus, one major focus in machine learning is to access the heterogeneous treatment effect (HTE) (see e.g., Athey & Imbens 2015, Shalit et al. 2017, Wager & Athey 2018, Künzel et al. 2019, Farrell et al. 2021) that measures the treatment lift within a specific group, as a fundamental component in a number of exiting high-profile successes including policy optimization (Chakraborty & Moodie 2013, Greenewald et al. 2017) and policy evaluation (Swaminathan et al. 2017, Kallus 2018. Detecting such heterogeneity in panel data hence becomes an inevitable trend in the new era of personalization. ...
Preprint
In the new era of personalization, learning the heterogeneous treatment effect (HTE) becomes an inevitable trend with numerous applications. Yet, most existing HTE estimation methods focus on independently and identically distributed observations and cannot handle the non-stationarity and temporal dependency in the common panel data setting. The treatment evaluators developed for panel data, on the other hand, typically ignore the individualized information. To fill the gap, in this paper, we initialize the study of HTE estimation in panel data. Under different assumptions for HTE identifiability, we propose the corresponding heterogeneous one-side and two-side synthetic learner, namely H1SL and H2SL, by leveraging the state-of-the-art HTE estimator for non-panel data and generalizing the synthetic control method that allows flexible data generating process. We establish the convergence rates of the proposed estimators. The superior performance of the proposed methods over existing ones is demonstrated by extensive numerical studies.
... First, the descriptive evidence based on the randomized ranking contributes to the empirical literature studying position effects. Studies in this stream of literature either do not consider heterogeneity (e.g., Ghose et al., 2012;Ursu, 2018), do consider it without exogenous variation in rankings (Ghose et al., 2014), or study it in the context of search advertising with contrasting results (Goldman and Rao, 2014;Athey and Imbens, 2015;Jeziorski and Segal, 2015;Jeziorski and Moorthy, 2018). 2 Second, the paper adds to the literature examining the effects of different rankings on consumer welfare and search behavior (Ghose et al., 2012(Ghose et al., , 2014De los Santos and Koulayev, 2017;Ursu, 2018;Choi and Mela, 2019;Zhang et al., 2021;Compiani et al., 2021). This paper differs from these studies in that it focuses on differences in the objectives of maximizing total revenues and consumer welfare, and the inherent connection to heterogeneous position effects. ...
Preprint
Most online retailers, search intermediaries, and platforms present products on product lists. By changing the ordering of products on these lists (the "ranking"), these online outlets can aim to improve consumer welfare or increase revenues. This paper studies to what degree these objectives differ from each other. First, I show that rankings can increase both revenues and consumer welfare by increasing the overall purchase probability through heterogeneous position effects. Second, I provide empirical evidence for this heterogeneity and quantify revenue and consumer welfare effects across different rankings. For the latter, I develop an estimation procedure for the search and discovery model of Greminger (2022) that yields a smooth likelihood function by construction. Comparing different counterfactual rankings shows that rankings targeting revenues often also increase consumer welfare. Moreover, these revenue-based rankings reduce consumer welfare only to a limited extent relative to a consumer-welfare-maximizing ranking.
Article
Full-text available
Uplift modeling refers to individual level causal inference. Existing research on the topic ignores one prevalent and important aspect: high class imbalance. For instance in online environments uplift modeling is used to optimally target ads and discounts, but very few users ever end up clicking an ad or buying. One common approach to deal with imbalance in classification is by undersampling the dataset. In this work, we show how undersampling can be extended to uplift modeling. We propose four undersampling methods for uplift modeling. We compare the proposed methods empirically and show when some methods have a tendency to break down. One key observation is that accounting for the imbalance is particularly important for uplift random forests, which explains the poor performance of the model in earlier works. Undersampling is also crucial for class-variable transformation based models.
Article
As a subfield of machine learning, reinforcement learning (RL) aims at optimizing decision making by using interaction samples of an agent with its environment and the potentially delayed feedbacks. In contrast to traditional supervised learning that typically relies on one-shot, exhaustive, and supervised reward signals, RL tackles sequential decision-making problems with sampled, evaluative, and delayed feedbacks simultaneously. Such a distinctive feature makes RL techniques a suitable candidate for developing powerful solutions in various healthcare domains, where diagnosing decisions or treatment regimes are usually characterized by a prolonged period with delayed feedbacks. By first briefly examining theoretical foundations and key methods in RL research, this survey provides an extensive overview of RL applications in a variety of healthcare domains, ranging from dynamic treatment regimes in chronic diseases and critical care, automated medical diagnosis, and many other control or scheduling problems that have infiltrated every aspect of the healthcare system. In addition, we discuss the challenges and open issues in the current research and highlight some potential solutions and directions for future research.
Preprint
A common task in empirical economics is to estimate \emph{interaction effects} that measure how the effect of one variable $X$ on another variable $Y$ depends on a third variable $H$. This paper considers the estimation of interaction effects in linear panel models with a fixed number of time periods. There are at least two ways to estimate interaction effects in this setting, both common in applied work. Our theoretical results show that these two approaches are distinct, and only coincide under strong conditions on unobserved effect heterogeneity. Our empirical results show that the difference between the two approaches is large, leading to conflicting conclusions about the sign of the interaction effect. Taken together, our findings may guide the choice between the two approaches in empirical work.
Article
Selecting a set of features to include in a clinical prediction model is not always a simple task. The goals of creating parsimonious models with low complexity while, at the same time, upholding predictive performance by explaining a large proportion of the variance within the dependent variable must be balanced. With this aim, one must consider the clinical setting and what data are readily available to clinicians at specific timepoints, as well as more obvious aspects such as the availability of computational power and size of the training dataset. This chapter elucidates the importance and pitfalls in feature selection, focusing on applications in clinical prediction modeling. We demonstrate simple methods such as correlation-, significance-, and variable importance-based filtering, as well as intrinsic feature selection methods such as Lasso and tree- or rule-based methods. Finally, we focus on two algorithmic wrapper methods for feature selection that are commonly used in machine learning: Recursive Feature Elimination (RFE), which can be applied regardless of data and model type, as well as Purposeful Variable Selection as described by Hosmer and Lemeshow, specifically for generalized linear models.
Article
Full-text available
Randomized controlled trials play an important role in how internet companies predict the impact of policy decisions, marketing campaigns, and product changes. Heterogeneity in treatment effects refers to the fact that, in such `digital experiments', different units (people, devices, products) respond differently to the applied treatment. This article presents a fast and scalable Bayesian nonparametric analysis of heterogeneity and its measurement in relation to observable covariates. The analysis leads to a novel estimator of heterogeneity that is based around the distribution of covariates pooled across treatment groups. Results are provided to assess commonly used schemes for variance reduction, and we argue that such schemes will only be useful in estimation of average treatment effects if the sources of heterogeneity are known in advance or can be learned across multiple experiments. We also describe how, without any prior knowledge, one can mine experiment data to discover patterns of heterogeneity and communicate these results in sparse low dimensional summaries. Throughout, the work is illustrated with a detailed example experiment involving 21 million unique users of eBay.com
Article
Full-text available
The propensity score is the conditional probability of assignment to a particular treatment given a vector of observed covariates. Both large and small sample theory show that adjustment for the scalar propensity score is sufficient to remove bias due to all observed covariates. Applications include: (i) matched sampling on the univariate propensity score, which is a generalization of discriminant matching, (ii) multivariate adjustment by subclassification on the propensity score where the same subclasses are used to estimate treatment effects for all outcome variables and in all subpopulations, and (iii) visual representation of multivariate covariance adjustment by a two- dimensional plot.
Article
Full-text available
her help in reviewing the experimental literature in political science, Arjun Shenoy for his help in reviewing the literature on public support for welfare, Lynn Vavreck for sharing data, and the facilities and staff of
Article
Book
Did mandatory busing programs in the 1970s increase the school achievement of disadvantaged minority youth? Does obtaining a college degree increase an individual's labor market earnings? Did the use of the butterfly ballot in some Florida counties in the 2000 presidential election cost Al Gore votes? If so, was the number of miscast votes sufficiently large to have altered the election outcome? At their core, these types of questions are simple cause-and-effect questions. Simple cause-and-effect questions are the motivation for much empirical work in the social sciences. This book presents a model and set of methods for causal effect estimation that social scientists can use to address causal questions such as these. The essential features of the counterfactual model of causality for observational data analysis are presented with examples from sociology, political science, and economics.
Article
Written by one of the preeminent researchers in the field, this book provides a comprehensive exposition of modern analysis of causation. It shows how causality has grown from a nebulous concept into a mathematical theory with significant applications in the fields of statistics, artificial intelligence, economics, philosophy, cognitive science, and the health and social sciences. Judea Pearl presents and unifies the probabilistic, manipulative, counterfactual, and structural approaches to causation and devises simple mathematical tools for studying the relationships between causal connections and statistical associations. The book will open the way for including causal analysis in the standard curricula of statistics, artificial intelligence, business, epidemiology, social sciences, and economics. Students in these fields will find natural models, simple inferential procedures, and precise mathematical definitions of causal concepts that traditional texts have evaded or made unduly complicated. The first edition of Causality has led to a paradigmatic change in the way that causality is treated in statistics, philosophy, computer science, social science, and economics. Cited in more than 5,000 scientific publications, it continues to liberate scientists from the traditional molds of statistical thinking. In this revised edition, Judea Pearl elucidates thorny issues, answers readers’ questions, and offers a panoramic view of recent advances in this field of research. Causality will be of interests to students and professionals in a wide variety of fields. Anyone who wishes to elucidate meaningful relationships from data, predict effects of actions and policies, assess explanations of reported events, or form theories of causal understanding and causal speech will find this book stimulating and invaluable.
Article
Recursive partitioning is embedded into the general and well-established class of parametric models that can be fitted using M-type estimators (including maximum likelihood). An algorithm for model-based recursive partitioning is suggested for which the basic steps are: (1) fit a parametric model to a data set, (2) test for parameter instability over a set of partitioning variables, (3) if there is some overall parameter instability, split the model with respect to the variable associated with the highest instability, (4) repeat the procedure in each of the daughter nodes. The algorithm yields a partitioned (or segmented) parametric model that can eectively be visualized and that subject-matter scientists are used to analyze and interpret.
Book
Setting of the learning problem consistency of learning processes bounds on the rate of convergence of learning processes controlling the generalization ability of learning processes constructing learning algorithms what is important in learning theory?.