Donald B. Rubin

Donald B. Rubin
Harvard University | Harvard · Department of Statistics

Doctor of Philosophy

About

464
Publications
102,900
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
257,164
Citations

Publications

Publications (464)
Preprint
Consider a situation with two treatments, the first of which is randomized but the second is not, and the multifactor version of this. Interest is in treatment effects, defined using standard factorial notation. We define estimators for the treatment effects and explore their properties when there is information about the nonrandomized treatment as...
Article
Full-text available
We describe a new method to combine propensity‐score matching with regression adjustment in treatment‐control studies when outcomes are binary by multiply imputing potential outcomes under control for the matched treated subjects. This enables the estimation of clinically meaningful measures of effect such as the risk difference. We used Monte Carl...
Article
Basic propensity score methodology is designed to balance the distributions of multivariate pre-treatment covariates when comparing one active treatment with one control treatment. However, practical settings often involve comparing more than two treatments, where more complicated contrasts than the basic treatment-control one, (1,−1), are relevant...
Article
Full-text available
Formal guidelines for statistical reporting of non-randomized studies are important for journals that publish results of such studies. Although it is gratifying to see some journals providing guidelines for statistical reporting, we feel that the current guidelines that we have seen are not entirely adequate when the study is used to draw causal co...
Article
Full-text available
Blocking is commonly used in randomized experiments to increase efficiency of estimation. A generalization of blocking removes allocations with imbalance in covariate distributions between treated and control units, and then randomizes within the remaining set of allocations with balance. This idea of rerandomization was formalized by Morgan and Ru...
Article
Full-text available
The weaponization of digital communications and social media to conduct disinformation campaigns at immense scale, speed, and reach presents new challenges to identify and counter hostile influence operations (IOs). This paper presents an end-to-end framework to automate detection of disinformation narratives, networks, and influential actors. The...
Article
Full-text available
We used a randomized crossover experiment to estimate the effects of ozone (vs. clean air) exposure on genome-wide DNA methylation of target bronchial epithelial cells, using 17 volunteers, each randomly exposed on two separated occasions to clean air or 0.3-ppm ozone for two hours. Twenty-four hours after exposure, participants underwent bronchosc...
Article
On 10 September 2020, C. R. Rao turns 100. Bradley Efron, Shun‐ichi Amari, Donald B. Rubin, Arni S. R. Srinivasa Rao and David R. Cox reflect on his contributions to statistics On 10 September 2020, C. R. Rao turns 100. Bradley Efron, Shun‐ichi Amari, Donald B. Rubin, Arni S. R. Srinivasa Rao and David R. Cox reflect on his contributions to statist...
Article
Full-text available
In randomized experiments, Fisher-exact P values are available and should be used to help evaluate results rather than the more commonly reported asymptotic P values. One reason is that using the latter can effectively alter the question being addressed by including irrelevant distributional assumptions. The Fisherian statistical framework, propose...
Preprint
Basic propensity score methodology is designed to balance multivariate pre-treatment covariates when comparing one active treatment with one control treatment. Practical settings often involve comparing more than two treatments, where more complicated contrasts than the basic treatment-control one,(1,-1), are relevant. Here, we propose the use of c...
Preprint
Full-text available
The weaponization of digital communications and social media to conduct disinformation campaigns at immense scale, speed, and reach presents new challenges to identify and counter hostile influence operations (IO). This paper presents an end-to-end framework to automate detection of disinformation narratives, networks, and influential actors. The f...
Article
Full-text available
A catalytic prior distribution is designed to stabilize a high-dimensional “working model” by shrinking it toward a “simplified model.” The shrinkage is achieved by supplementing the observed data with a small amount of “synthetic data” generated from a predictive distribution under the simpler model. We apply this framework to generalized linear m...
Article
Full-text available
The general reliance on blinded placebo-controlled randomized trials, both to approve drugs and to set their recommended dosages, although statistically sound for some purposes, may be statistically naïve in the context of guiding general medical practice. Briefly, the reason is that medical prescriptions are unblinded, and so patients who receive...
Article
Causal inference refers to the process of inferring what would happen in the future if we change what we are doing, or inferring what would have happened in the past, if we had done something different in the distant past. Humans adjust our behaviors by anticipating what will happen if we act in different ways, using past experiences to inform thes...
Preprint
With many pretreatment covariates and treatment factors, the classical factorial experiment often fails to balance covariates across multiple factorial effects simultaneously. Therefore, it is intuitive to restrict the randomization of the treatment factors to satisfy certain covariate balance criteria, possibly conforming to the tiers of factorial...
Article
Full-text available
Matching on an estimated propensity score is frequently used to estimate the effects of treatments from observational data. Since the 1970s, different authors have proposed methods to combine matching at the design stage with regression adjustment at the analysis stage when estimating treatment effects for continuous outcomes. Previous work has con...
Conference Paper
Full-text available
Estimating influence on social media networks is an important practical and theoretical problem, especially because this new medium is widely exploited as a platform for disinformation and propaganda. This paper introduces a novel approach to influence estimation on social media networks and applies it to the real-world problem of characterizing ac...
Chapter
Although randomized controlled trials (RCTs) are generally considered the gold standard for estimating causal effects, for example of pharmaceutical treatments, the valid analysis of RCTs is more complicated with human units than with plants and other such objects. One potential complication that arises with human subjects is the possible existence...
Article
Estimating influence on social media networks is an important practical and theoretical problem, especially because this new medium is widely exploited as a platform for disinformation and propaganda. This paper introduces a novel approach to influence estimation on social media networks and applies it to the real-world problem of characterizing ac...
Article
Full-text available
Several statistical agencies have started to use multiply-imputed synthetic microdata to create public-use data in major surveys. The purpose of doing this is to protect the confidentiality of respondents’ identities and sensitive attributes, while allowing standard complete-data analyses of microdata. A key challenge, faced by advocates of synthet...
Article
The health effects of environmental exposures have been studied for decades, typically using standard regression models to assess exposure-outcome associations found in observational non-experimental data. We propose and illustrate a different approach to examine causal effects of environmental exposures on health outcomes from observational data....
Article
The seminal work of Morgan and Rubin (2012) considers rerandomization for a sample of size 2N units all randomized at one time. In practice, however, experimenters may have to rerandomize units sequentially. For example, a clinician studying a rare disease may be unable to wait to perform an experiment until all 2N experimental units are recruited...
Article
Blinded randomized controlled trials (RCT) require participants to be uncertain if they are receiving a treatment or placebo. Although uncertainty is ideal for isolating the treatment effect from all other potential effects, it is poorly suited for estimating the treatment effect under actual conditions of intended use-when individuals are certain...
Article
This article considers causal inference for treatment contrasts from a randomized experiment using potential outcomes in a finite population setting. Adopting a Neymanian repeated sampling approach that integrates such causal inference with finite population survey sampling, an inferential framework is developed for general mechanisms of assigning...
Article
A few years ago, the New York Department of Education (NYDE) was planning to conduct an experiment involving five new intervention programs for a selected set of New York City high schools. The goal was to estimate the causal effects of these programs and their interactions on the schools’ performance. For each of the schools, about 50 premeasured...
Article
The wise use of statistical ideas in practice essentially requires some Bayesian thinking, in contrast to the classical rigid frequentist dogma. This dogma too often has seemed to influence the applications of statistics, even at agencies like the FDA. Greg Campbell was one of the most important advocates there for more nuanced modes of thought, es...
Article
Full-text available
This article considers causal inference for treatment contrasts from a randomized experiment using potential outcomes in a finite population setting. Adopting a Neymanian repeated sampling approach that integrates such causal inference with finite population survey sampling, an inferential framework is developed for general mechanisms of assigning...
Article
Although complete randomization ensures covariate balance on average, the chance for observing significant differences between treatment and control covariate distributions is high especially with many covariates. Rerandomization discards randomizations that do not satisfy a pre-determined covariate balance criterion, generally resulting in better...
Article
Full-text available
Data analyses typically rely upon assumptions about missingness mechanisms that lead to observed versus missing data. When the data are missing not at random, direct assumptions about the missingness mechanism, and indirect assumptions about the distributions of observed and missing data, are typically untestable. We explore an approach, where the...
Article
For likelihood-based inferences from data with missing values, models are generally needed for both the data and the missing-data mechanism. However, modeling the mechanism is challenging, and parameters are often poorly identified. Rubin (1976) showed that for likelihood and Bayesian inference, sufficient conditions for ignoring the missing data m...
Article
Many empirical settings involve the specification of models leading to complicated likelihood functions, for example, finite mixture models that arise in causal inference when using Principal Stratification (PS). Traditional asymptotic results cannot be trusted for the associated likelihood functions, whose logarithms are not close to being quadrat...
Article
The problem of missing data is ubiquitous in the social and behavioral sciences. We begin by distinguishing between the pattern of missing data and the mechanism that creates the missing data. Then we consider common, but limited, approaches: complete-case analysis, nonresponse weighting, and single imputation. We then discuss more principled metho...
Article
Full-text available
Factorial designs are widely used in agriculture, engineering, and the social sciences to study the causal effects of several factors simultaneously on a response. The objective of such a design is to estimate all factorial effects of interest, which typically include main effects and interactions among factors. To estimate factorial effects with h...
Article
In randomized experiments, the random assignment of units to treatment groups justifies many of the widely used traditional analysis methods for evaluating causal effects. Specifying subgroups of units for further examination after observing outcomes, however, may partially nullify any advantages of randomized assignment when data are analyzed naiv...
Article
We clarify the key concept of missingness at random in incomplete data analysis. We first distinguish between data being missing at random and the missingness mechanism being a missing-at-random one, which we call missing always at random and which is more restrictive. We further discuss how, in general, neither of these conditions is a statement a...
Article
When conducting a randomized experiment, if an allocation yields treatment groups that differ meaningfully with respect to relevant covariates, groups should be rerandomized. The process involves specifying an explicit criterion for whether an allocation is acceptable, based on a measure of covariate balance, and rerandomizing units until an accept...
Article
We examine the possible consequences of a change in law school admissions in the United States from an affirmative action system based on race to one based on socioeconomic class. Using data from the 1991-1996 Law School Admission Council Bar Passage Study, students were reassigned attendance by simulation to law school tiers by transferring the af...
Article
Health and medical data are increasingly being generated, collected, and stored in electronic form in healthcare facilities and administrative agencies. Such data hold a wealth of information vital to effective health policy development and evaluation, as well as to enhanced clinical care through evidence-based practice and safety and quality monit...
Article
By 'partially post-hoc' subgroup analyses, we mean analyses that compare existing data from a randomized experiment-from which a subgroup specification is derived-to new, subgroup-only experimental data. We describe a motivating example in which partially post hoc subgroup analyses instigated statistical debate about a medical device's efficacy. We...
Article
Estimation of causal effects in non-randomized studies comprises two distinct phases: design, without outcome data, and analysis of the outcome data according to a specified protocol. Recently, Gutman and Rubin (2013) proposed a new analysis-phase method for estimating treatment effects when the outcome is binary and there is only one covariate, wh...
Chapter
Imputation, that is, filling in a value for each missing datum, is a common approach to handling missing data that has many desirable properties. For example, it allows the application of standard analytic methods that have been developed for complete data sets. Moreover, in the context of the production of public-use data sets, it allows the data...
Article
We analyze publicly available data to estimate the causal effects of military interventions on the homicide rates in certain problematic regions in Mexico. We use the Rubin causal model to compare the post-intervention homicide rate in each intervened region to the hypothetical homicide rate for that same year had the military intervention not take...
Article
The estimation of causal effects in nonrandomized studies should comprise two distinct phases: design, with no outcome data available; and analysis of the outcome data according to a specified protocol. Here, we review and compare point and interval estimates of common statistical procedures for estimating causal effects (i.e. matching, subclassifi...
Article
Unreplicated factorial designs have been widely used in scientific and industrial settings, when it is important to distinguish “active” or real factorial effects from “inactive” or noise factorial effects used to estimate residual or “error” terms. We propose a new approach to screen for active factorial effects from such experiments that utilizes...
Chapter
The problem of missing data is ubiquitous in many fields, including clinical psychology. This entry begins by providing a brief overview of basic concepts describing types of missing data, patterns of missing data, and mechanisms that create missing data. It then goes on to consider common analytic approaches to deal with this problem and describes...
Article
Full-text available
Many outcomes of interest in the social and health sciences, as well as in modern applications in computational social science and experimentation on social media platforms, are ordinal and do not have a meaningful scale. Causal analyses that leverage this type of data, termed ordinal non-numeric, require careful treatment, as much of the classical...
Book
Most questions in social and biomedical sciences are causal in nature: what would happen to individuals, or to groups, if part of their environment were changed? In this groundbreaking text, two world-renowned experts present statistical methods for studying such questions. This book starts with the notion of potential outcomes, each corresponding...
Article
Although recent guidelines for dealing with missing data emphasize the need for sensitivity analyses, and such analyses have a long history in statistics, universal recommendations for conducting and displaying these analyses are scarce. We propose graphical displays that help formalize and visualize the results of sensitivity analyses, building up...
Article
Full-text available
The Neyman-Fisher controversy considered here originated with the 1935 presentation of Jerzy Neyman's Statistical Problems in Agricultural Experimentation to the Royal Statistical Society. Neyman asserted that the standard ANOVA F-test for randomized complete block designs is valid, whereas the analogous test for Latin squares is invalid in the sen...
Article
A framework for causal inference from two-level factorial designs is proposed, which uses potential outcomes to define causal effects. The paper explores the effect of non-additivity of unit level treatment effects on Neyman's repeated sampling approach for estimation of causal effects and on Fisher's randomization tests on sharp null hypotheses in...
Chapter
When sample sizes are small, a useful alternative approach to multiple imputation (ML) is to add a prior distribution for the parameters and compute the posterior distribution of the parameters of interest. As with ML estimation with a general pattern of missing values, Bayes simulation requires iteration. The iterative simulation methods discussed...
Chapter
Imputations are means or draws from a predictive distribution of the missing values, and require a method of creating a predictive distribution for the imputation based on the observed data. There are two generic approaches to generating this distribution: Explicit modeling: the predictive distribution is based on a formal statistical model, and he...
Chapter
The estimate is computed as part of the Newton?Raphson algorithm for Maximum Likelihood (ML) estimation, and computed as part of the scoring algorithm. This chapter considers methods for computing standard errors that do not require computation and inversion of an information matrix. Another method for calculating large-sample covariance matrices i...
Chapter
This chapter considers alternative distributions to the t distribution for robust inference, and robust inference for multivariate data sets with missing values. It describes a general mixture model for robust estimation of a univariate sample that includes the t and contaminated normal disztributions as special cases. The case of multivariate data...
Chapter
This chapter discusses complete-case analysis, which confines the analysis to the set of cases with no missing values, and modifications and extensions. It considers a modification of complete-case analysis that differentially weights the complete cases to adjust for bias. The basic idea is closely related to weighting in randomization inference fo...
Chapter
This chapter concerns the analysis of incomplete data when the variables are categorical. If the data matrix has missing items, some of the cases in the preceding contingency table are partially classified. The chapter discusses the estimation for general patterns via the Expectation Maximization (EM) algorithm and posterior simulation. It consider...
Chapter
This chapter discusses models where the data are not missing at random (MAR), so maximum likelihood (ML) estimation requires a model for the missing-data mechanism and maximization of the full likelihood. The EM algorithm or its extensions, discussed in the chapter for the general case of known or unknown nonignorable mechanisms, can typically achi...
Chapter
This chapter applies the tools to a variety of common problems involving incomplete data on multivariate normally distributed variables: estimation of the mean vector and covariance matrix; estimation of these quantities when there are restrictions on the mean and covariance matrix; multiple linear regression, including analysis of variance (ANOVA)...
Chapter
Controlled experiments are generally carefully designed to allow revealing statistical analyses to be made using straightforward computations. The estimates, standard errors, and analysis of variance (ANOVA) table corresponding to most designed experiments are easily computed because of balance in the design. This chapter assumes that the design ma...
Chapter
Patterns of incomplete data in practice often do not have the particular forms that allow explicit maximum likelihood (ML) estimates to be calculated by exploiting factorizations of the likelihood. This chapter considers the iterative methods of computation for situations without explicit ML estimates. An alternative computing strategy for incomple...
Chapter
This chapter focuses on the question of deriving estimates of uncertainty that incorporate the added variance due to nonresponse. The variance estimates presented here all effectively assume that the method of adjustment for nonresponse has succeeded in eliminating nonresponse bias. The chapter modifies the imputations so that valid standard errors...
Chapter
In this chapter, the authors assume that missing-data mechanism is ignorable, and for simplicity write for ℓ(θ|Yobs) the ignorable loglikelihood ℓign(θ|Yobs) based on incomplete data Yobs. For certain models and incomplete-data patterns, however, analyses based on ℓ(θ|Yobs) can employ standard complete-data techniques. The chapter discusses general...
Chapter
This chapter considers missing-data methods for mixtures of normal and non-normal variables, assuming that the missing-data mechanism is ignorable. An analysis of the missing-data pattern suggests that all the parameters of the general location model are estimable, despite the sparseness of the data matrix. Maximum Likelihood (ML) estimates compute...
Chapter
Many methods of estimation for incomplete data can be based on the likelihood function under specific modeling assumptions. This chapter reviews basic theory of inference based on the likelihood function and describes how it is implemented in the incomplete-data setting. It begins by considering maximum likelihood and Bayes’ estimation for complete...
Article
Full-text available
Multiple imputation (MI) has become a standard statistical technique for dealing with missing values. The CDC Anthrax Vaccine Research Program (AVRP) dataset created new challenges for MI due to the large number of variables of different types and the limited sample size. A common method for imputing missing data in such complex studies is to speci...
Article
A number of mixture modeling approaches assume both normality and independent observations. However, these two assumptions are at odds with the reality of many data sets, which are often characterized by an abundance of zero-valued or highly skewed observations as well as observations from biologically related (i.e., non-independent) subjects. We p...
Book
Praise for the First Edition of Statistical Analysis with Missing Data “An important contribution to the applied statistics literature.... I give the book high marks for unifying and making accessible much of the past and current work in this important area.”—William E. Strawderman, Rutgers University “This book...provide[s] interesting real-life e...
Article
The estimation of causal effects has been the subject of extensive research. In unconfounded studies with a dichotomous outcome, Y, Cangul, Chretien, Gutman and Rubin (2009) demonstrated that logistic regression for a scalar continuous covariate X is generally statistically invalid for testing null treatment effects when the distributions of X in t...
Article
Missing data are a common problem in statistics. Imputation, or filling in the missing values, is an intuitive and flexible way to address the resulting incomplete data sets. We focus on multiple imputation, which, when implemented correctly, can be a statistically valid strategy for handling missing data. The analysis of a multiply-imputed data se...
Book
Broadening its scope to nonstatisticians, Bayesian Methods for Data Analysis, Third Edition provides an accessible introduction to the foundations and applications of Bayesian analysis. Along with a complete reorganization of the material, this edition concentrates more on hierarchical Bayesian modeling as implemented via Markov chain Monte Carlo (...
Article
A framework for causal inference from two-level factorial designs is proposed. The framework utilizes the concept of potential outcomes that lies at the center stage of causal inference and extends Neyman's repeated sampling approach for estimation of causal effects and randomization tests based on Fisher's sharp null hypothesis to the case of 2-le...
Article
In studies of public health, outcome measures such as the odds ratio, rate ratio, or efficacy are often estimated across strata to assess the overall effect of active treatment versus control treatment. Patients may be partitioned into such strata or blocks by experimental design, or, in non-randomized studies, patients may be partitioned into subc...
Article
The effects of a job training program, Job Corps, on both employment and wages are evaluated using data from a randomized study. Principal stratification is used to address, simultaneously, the complications of noncompliance, wages that are only partially defined because of nonemployment, and unintended missing outcomes. The first two complications...
Article
Full-text available
Missing data are pervasive in large public-use databases. Multiple imputation (MI) is an effective methodology to handle the problem. Current state-of-the-art procedures of MI of-ten fit fully Bayesian models assuming some joint probability distribution for the underlying complete data. Though theoretically valid, joint modeling may not accurately...
Article
Randomized experiments are the "gold standard" for estimating causal effects, yet often in practice, chance imbalances exist in covariate distributions between treatment groups. If covariate data are available before units are exposed to treatments, these chance imbalances can be mitigated by first checking covariate balance before the physical exp...
Article
We review advances toward credible causal inference that have wide application for empirical legal studies. Our chief point is simple: Research design trumps methods of analysis. We explain matching and regression discontinuity approaches in intuitive (nontechnical) terms. To illustrate, we apply these to existing data on the impact of prison facil...
Chapter
Missing data are a pervasive problem in many data sets and seem especially widespread in social and economic studies, such as customer satisfaction surveys. Imputation is an intuitive and flexible way to handle the incomplete data sets that result. This chapter starts with a basic discussion of missing-data patterns, which describe which values are...
Chapter
Causal inference is used to measure effects from experimental and observational data. This chapter provides an overview of the approach to the estimation of such causal effects based on the concept of potential outcomes, which stems from the work on randomized experiments by Fisher and Neyman in the 1920s and was then extended by Rubin in the 1970s...
Article
Randomization of treatment assignment in experiments generates treatment groups with approximately balanced baseline covariates. However, in observational studies, where treatment assignment is not random, patients in the active treatment and control groups often differ on crucial covariates that are related to outcomes. These covariate imbalances...
Article
Randomization of treatment assignment in experiments generates treatment groups with approximately balanced baseline covariates. However, in observational studies, where treatment assignment is not random, patients in the active treatment and control groups often differ on crucial covariates that are related to outcomes. These covariate imbalances...