Article

Machine Learning Tests for Effects on Multiple Outcomes

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

A core challenge in the analysis of experimental data is that the impact of some intervention is often not entirely captured by a single, well-defined outcome. Instead there may be a large number of outcome variables that are potentially affected and of interest. In this paper, we propose a data-driven approach rooted in machine learning to the problem of testing effects on such groups of outcome variables. It is based on two simple observations. First, the 'false-positive' problem that a group of outcomes is similar to the concern of 'over-fitting,' which has been the focus of a large literature in statistics and computer science. We can thus leverage sample-splitting methods from the machine-learning playbook that are designed to control over-fitting to ensure that statistical models express generalizable insights about treatment effects. The second simple observation is that the question whether treatment affects a group of variables is equivalent to the question whether treatment is predictable from these variables better than some trivial benchmark (provided treatment is assigned randomly). This formulation allows us to leverage data-driven predictors from the machine-learning literature to flexibly mine for effects, rather than rely on more rigid approaches like multiple-testing corrections and pre-analysis plans. We formulate a specific methodology and present three kinds of results: first, our test is exactly sized for the null hypothesis of no effect; second, a specific version is asymptotically equivalent to a benchmark joint Wald test in a linear regression; and third, this methodology can guide inference on where an intervention has effects. Finally, we argue that our approach can naturally deal with typical features of real-world experiments, and be adapted to baseline balance checks.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... We improve upon these previous studies by considering more and different interventions (both brochures and videos) and a larger sample. As a methodological advancement, we use machine learning approaches by Belloni et al. (2014), , and Ludwig et al. (2017) for conducting robustness checks, finding effect heterogeneities, and accounting for multiple hypothesis testing, respectively. 4 John et al. (2014) conduct an experiment with task performance among US students and find that the awareness that others realize higher pay rates for comparable tasks increases dishonesty, likely due to fairness concerns about pay-rate differentials. ...
... In the same vein, Corbacho et al. (2016) find for an information experiment in Costa Rica that 2 A large body of empirical literature suggests that women tend to be less corrupt; see Dimant and Tosato (2017), Dollar et al. (2001), Frank et al. (2011), Rivas (2013 and Swamy et al. (2001). 3 See, for instance, Wolf (2005, 2016), Lehrer et al. (2016), and Ludwig et al. (2017) for inference methods that account or correct for multiple hypothesis testing. 4 Alternatively, issues of multiple hypothesis testing could have been addressed by means of a pre-analysis plan outlining the methodology to be used for analyzing the data prior to collecting them, see for instance Nosek et al. (2018) for an in-depth discussion. ...
... We therefore also consider joint hypothesis tests for treatment effects on groups of outcomes defined upon specific survey questions. To this end, we employ the procedure by Ludwig et al. (2017), which is based on predicting the treatment by the total of outcome variables using machine learning. ...
Article
Full-text available
This paper examines how anti-corruption educational campaigns affect the attitudes of Russian university students toward corruption and academic integrity in the short run. About 2000 survey participants were randomly assigned to one of four different information materials (brochures or videos) about the negative consequences of corruption or to a control group. While we do not find important effects in the full sample, applying machine learning methods for detecting effect heterogeneity suggests that some subgroups of students might react to the same information differently, albeit statistical significance mostly vanishes when accounting for multiple hypotheses testing. Taking the point estimates at face value, students who commonly plagiarize appear to develop stronger negative attitudes toward corruption in the aftermath of our intervention. Unexpectedly, some information materials seem inducing more tolerant views on corruption among those who plagiarize less frequently and in the group of male students, while the effects on female students are generally close to zero. Therefore, policy makers aiming to implement anti-corruption education at a larger scale should scrutinize the possibility of (undesired) heterogeneous effects across student groups.
... Our findings about nonselective attrition are also corroborated when using a machine learningbased test to investigate the joint balance of all covariates together, separately for each treatmentcontrol comparison. To this end, we apply an approach suggested by Ludwig, Mullainathan, and Spiess (2017). It is based on the intuition that the problem of obtaining too many significant differences when testing multiple hypotheses (e.g. ...
... In our case, the question is whether the treatment can be predicted by the covariates, which would point to imbalances. 16 We thus follow Ludwig et al (2017), who propose applying the machine learning logic to the context of multiple testing, and split our data into training and testing data. In the training data, we run a lasso logit regression of the respective treatment (vs. ...
... We use 5-fold cross-validation, such that the roles of training and test data are swapped, and take the average of the 5 mean squared errors obtained (in order to reduce its variance). In a next step, we randomly relabel (or permute) the treatment variables and re-estimate the MSE using the same procedure; see Ludwig et al (2017). Repeating the permutation 999 times, we compute the p-value for the joint significance of the covariates as the share of permutation based ...
... Furthermore, we ran machine learning-based tests for assessing balance jointly for all covariates across treatments using the approach of Ludwig, Mullainathan, and Spiess (2017). The authors point out that problems of obtaining too many significant differences by testing several hypotheses are tantamount to overfitting -or including too many regressors while predicting a variable − in machine learning. ...
... Furthermore, we randomly relabeled the treatment variables and reestimated the MSE based on the same procedure (cf. Ludwig, Mullainathan, and Spiess 2017). We repeated the permutation 999 times to compute the p-value for the joint significance of the covariates as the share of permutation-based MSEs that are lower than the MSE with the correct coding of the treatment. ...
Article
Full-text available
This paper presents the outcomes of an online coin-tossing experiment evaluating cheating behavior among Ukrainian students. Over 1,500 participants were asked to make ten coin tosses and were randomly assigned to one of the three treatment groups tossing coins (1) online, (2) manually, or (3) having the choice between tossing manually or online. The study outcomes suggest that students are more inclined to cheat when they perceive the coin toss to be more “private.” Moreover, the students’ attitudes toward corruption appear to matter for the extent of their cheating, while socio-demographic characteristics were less important.
... Rizzo (2009a, 2009b) developed the energy test, another nonparametric test for equality of multivariate distributions. Still other methods include Hansen and Bowers (2008), Heller, Heller and Gorfine (2013), Cattaneo, Frandsen and Titiunik (2015), Chen and Small (2016), Ludwig, Mullainathan and Spiess (2017), Gretton et al. (2012), Romano (1989), and Taskinen, Oja and Randles (2005). ...
Article
Text data is ultra-high dimensional, which makes machine learning techniques indispensable for textual analysis. Text is often selected–journalists, speechwriters, and others craft messages to target their audiences’ limited attention. We develop an economically motivated high dimensional selection model that improves learning from text (and from sparse counts data more generally). Our model is especially useful when the choice to include a phrase is more interesting than the choice of how frequently to repeat it. It allows for parallel estimation, making it computationally scalable. A first application revisits the partisanship of US congressional speech. We find that earlier spikes in partisanship manifested in increased repetition of different phrases, whereas the upward trend starting in the 1990s is due to distinct phrase selection. Additional applications show how our model can backcast, nowcast, and forecast macroeconomic indicators using newspaper text, and that it substantially improves out-of-sample fit relative to alternative approaches.
Article
The lack of adequate measures is often an impediment to robust policy evaluation. We discuss three approaches to measurement and data usage that have the potential to improve the way we conduct impact evaluations. First, the creation of new measures, when no adequate ones are available. Second, the use of multiple measures when a single one is not appropriate. And third, the use of machine learning algorithms to evaluate and understand programme impacts. We motivate the relevance of each of the categories by providing examples where they have proved useful in the past. We discuss the challenges and risks involved in each strategy and conclude with an outline of promising directions for future work.
Article
Full-text available
Researchers frequently test identifying assumptions in regression based research designs (which include instrumental variables or difference-in-differences models) by adding additional control variables on the right hand side of the regression. If such additions do not affect the coefficient of interest (much) a study is presumed to be reliable. We caution that such invariance may result from the fact that the observed variables used in such robustness checks are often poor measures of the potential underlying confounders. In this case, a more powerful test of the identifying assumption is to put the variable on the left hand side of the candidate regression. We provide derivations for the estimators and test statistics involved, as well as power calculations, which can help applied researchers interpret their findings. We illustrate these results in the context of various strategies which have been suggested to identify the returns to schooling.
Article
Full-text available
The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to problems of multiple significance testing is presented. It calls for controlling the expected proportion of falsely rejected hypotheses – the false discovery rate. This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. Therefore, in problems where the control of the false discovery rate rather than that of the FWER is desired, there is potential for a gain in power. A simple sequential Bonferroni-type procedure is proved to control the false discovery rate for independent test statistics, and a simulation study shows that the gain in power is substantial. The use of the new procedure and the appropriateness of the criterion are illustrated with examples.
Article
I follow R. A. Fisher's The Design of Experiments (1935), using randomization statistical inference to test the null hypothesis of no treatment effects in a comprehensive sample of 53 experimental papers drawn from the journals of the American Economic Association. In the average paper, randomization tests of the significance of individual treatment effects find 13% to 22% fewer significant results than are found using authors’ methods. In joint tests of multiple treatment effects appearing together in tables, randomization tests yield 33% to 49% fewer statistically significant results than conventional tests. Bootstrap and jackknife methods support and confirm the randomization results.
Article
Machines are increasingly doing "intelligent" things. Face recognition algorithms use a large dataset of photos labeled as having a face or not to estimate a function that predicts the presence y of a face from pixels x. This similarity to econometrics raises questions: How do these new empirical tools fit with what we know? As empirical economists, how can we use them? We present a way of thinking about machine learning that gives it its own place in the econometric toolbox. Machine learning not only provides new tools, it solves a different problem. Specifically, machine learning revolves around the problem of prediction, while many economic applications revolve around parameter estimation. So applying machine learning to economics requires finding relevant tasks. Machine learning algorithms are now technically easy to use: you can download convenient packages in R or Python. This also raises the risk that the algorithms are applied naively or their output is misinterpreted. We hope to make them conceptually easier to use by providing a crisper understanding of how these algorithms work, where they excel, and where they can stumble—and thus where they can be most usefully applied.
Article
The Moving to Opportunity (MTO) experiment offered randomly selected families housing vouchers to move from high-poverty housing projects to lower-poverty neighborhoods. We analyze MTO's impacts on children's long-term outcomes using tax data. We find that moving to a lower-poverty neighborhood when young (before age 13) increases college attendance and earnings and reduces single parenthood rates. Moving as an adolescent has slightly negative impacts, perhaps because of disruption effects. The decline in the gains from moving with the age when children move suggests that the duration of exposure to better environments during childhood is an important determinant of children's long-term outcomes. (JEL I31, I38, J13, R23, R38).
Article
Most empirical policy work focuses on causal inference. We argue an important class of policy problems does not require causal inference but instead requires predictive inference. Solving these “prediction policy problems” requires more than simple regression techniques, since these are tuned to generating unbiased estimates of coefficients rather than minimizing prediction error. We argue that new developments in the field of “machine learning” are particularly useful for addressing these prediction problems. We use an example from health policy to illustrate the large potential social welfare gains from improved prediction.
Article
Applied researchers are increasingly interested in whether and how treatment effects vary in randomized evaluations, especially variation not explained by observed covariates. We propose a model-free approach for testing for the presence of such unexplained variation. To use this randomization-based approach, we must address the fact that the average treatment effect, generally the object of interest in randomized experiments, actually acts as a nuisance parameter in this setting. We explore potential solutions and advocate for a method that guarantees valid tests in finite samples despite this nuisance. We also show how this method readily extends to testing for heterogeneity beyond a given model, which can be useful for assessing the sufficiency of a given scientific theory. We finally apply our method to the National Head Start Impact Study, a large-scale randomized evaluation of a Federal preschool program, finding that there is indeed significant unexplained treatment effect variation.
Article
Methods for constructing simultaneous confidence intervals for all possible linear contrasts among several means of normally distributed variables have been given by Scheffé and Tukey. In this paper the possibility is considered of picking in advance a number (say m) of linear contrasts among k means, and then estimating these m linear contrasts by confidence intervals based on a Student t statistic, in such a way that the overall confidence level for the m intervals is greater than or equal to a preassigned value. It is found that for some values of k, and for m not too large, intervals obtained in this way are shorter than those using the F distribution or the Studentized range. When this is so, the experimenter may be willing to select the linear combinations in advance which he wishes to estimate in order to have m shorter intervals instead of an infinite number of longer intervals.
Article
For rectangular confidence regions for the mean values of multivariate normal distributions the following conjecture of O. J. Dunn [3], [4] is proved: Such a confidence region constructed for the case of independent coordinates is, at the same time, a conservative confidence region for any case of dependent coordinates. This result is based on an inequality for the probabilities of rectangles in normal distributions, which permits one to factor out the probability for any single coordinate.
Article
This paper presents a simple and widely ap- plicable multiple test procedure of the sequentially rejective type, i.e. hypotheses are rejected one at a tine until no further rejections can be done. It is shown that the test has a prescribed level of significance protection against error of the first kind for any combination of true hypotheses. The power properties of the test and a number of possible applications are also discussed.
Article
Reverse regression has recently been proposed to assess discrimination by gender or race. We consider several stochastic models and find one that justifies reverse regression. Testable implications are deduced, and the analysis is illustrated with empirical material.
Article
We propose two statistical tests to determine if two samples are from different distributions. Our test statistic is in both cases the distance between the means of the two samples mapped into a reproducing kernel Hilbert space (RKHS). The first test is based on a large deviation bound for the test statistic, while the second is based on the asymptotic distribution of this statistic. The test statistic can be computed in O(m2)O(m^2) time. We apply our approach to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where our test performs strongly. We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist.
Article
Tests of the overall null hypothesis in datasets with one outcome variable and many covariates can be based on various methods to combine the p-values for univariate tests of association of each covariate with the outcome. The overall p-value is computed by permuting the outcome variable. We discuss the situations in which this approach is useful and provide several examples. We use simulations to investigate seven omnibus test statistics and find that the Anderson-Darling and Fisher's statistics are superior to the others.
Article
In econometric applications, often several hypothesis tests are carried out at once. The problem then becomes how to decide which hypotheses to reject, accounting for the multitude of tests. This paper suggests a stepwise multiple testing procedure that asymptotically controls the familywise error rate. Compared to related single-step methods, the procedure is more powerful and often will reject more false hypotheses. In addition, we advocate the use of studentization when feasible. Unlike some stepwise methods, the method implicitly captures the joint dependence structure of the test statistics, which results in increased ability to detect false hypotheses. The methodology is presented in the context of comparing several strategies to a common benchmark. However, our ideas can easily be extended to other contexts where multiple tests occur. Some simulation studies show the improvements of our methods over previous proposals. We also provide an application to a set of real data. Copyright The Econometric Society 2005.