# Donald B. RubinHarvard University | Harvard · Department of Statistics

Donald B. Rubin

Doctor of Philosophy

## About

467

Publications

131,233

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

309,402

Citations

## Publications

Publications (467)

Catalytic prior distributions provide general, easy-to-use and interpretable specifications of prior distributions for Bayesian analysis. They are particularly beneficial when observed data are inadequate to well-estimate a complex target model. A catalytic prior distribution is constructed by augmenting the observed data with synthetic data that a...

Consider a situation with two treatments, the first of which is randomized but the second is not, and the multifactor version of this. Interest is in treatment effects, defined using standard factorial notation. We define estimators for the treatment effects and explore their properties when there is information about the nonrandomized treatment as...

We describe a new method to combine propensity‐score matching with regression adjustment in treatment‐control studies when outcomes are binary by multiply imputing potential outcomes under control for the matched treated subjects. This enables the estimation of clinically meaningful measures of effect such as the risk difference. We used Monte Carl...

Basic propensity score methodology is designed to balance the distributions of multivariate pre-treatment covariates when comparing one active treatment with one control treatment. However, practical settings often involve comparing more than two treatments, where more complicated contrasts than the basic treatment-control one, (1,−1), are relevant...

Formal guidelines for statistical reporting of non-randomized studies are important for journals that publish results of such studies. Although it is gratifying to see some journals providing guidelines for statistical reporting, we feel that the current guidelines that we have seen are not entirely adequate when the study is used to draw causal co...

Blocking is commonly used in randomized experiments to increase efficiency of estimation. A generalization of blocking removes allocations with imbalance in covariate distributions between treated and control units, and then randomizes within the remaining set of allocations with balance. This idea of rerandomization was formalized by Morgan and Ru...

Significance
Hostile influence operations (IOs) that weaponize digital communications and social media pose a rising threat to open democracies. This paper presents a system framework to automate detection of disinformation narratives, networks, and influential actors. The framework integrates natural language processing, machine learning, graph an...

We used a randomized crossover experiment to estimate the effects of ozone (vs. clean air) exposure on genome-wide DNA methylation of target bronchial epithelial cells, using 17 volunteers, each randomly exposed on two separated occasions to clean air or 0.3-ppm ozone for two hours. Twenty-four hours after exposure, participants underwent bronchosc...

On 10 September 2020, C. R. Rao turns 100. Bradley Efron, Shun‐ichi Amari, Donald B. Rubin, Arni S. R. Srinivasa Rao and David R. Cox reflect on his contributions to statistics On 10 September 2020, C. R. Rao turns 100. Bradley Efron, Shun‐ichi Amari, Donald B. Rubin, Arni S. R. Srinivasa Rao and David R. Cox reflect on his contributions to statist...

Significance
Statistical analyses of randomized experiments often rely on asymptotic P values instead of using the actual randomization procedure that led to the observed data. Fisher-exact and asymptotic P values can differ dramatically: The former should be preferred because it is calculated using the exact null randomization distribution, which,...

Basic propensity score methodology is designed to balance multivariate pre-treatment covariates when comparing one active treatment with one control treatment. Practical settings often involve comparing more than two treatments, where more complicated contrasts than the basic treatment-control one,(1,-1), are relevant. Here, we propose the use of c...

The weaponization of digital communications and social media to conduct disinformation campaigns at immense scale, speed, and reach presents new challenges to identify and counter hostile influence operations (IO). This paper presents an end-to-end framework to automate detection of disinformation narratives, networks, and influential actors. The f...

Significance
We propose a strategy for building prior distributions that stabilize the estimation of complex “working models” when sample sizes are too small for standard statistical analysis. The stabilization is achieved by supplementing the observed data with a small amount of synthetic data generated from the predictive distribution of a simple...

The general reliance on blinded placebo-controlled randomized trials, both to approve drugs and to set their recommended dosages, although statistically sound for some purposes, may be statistically naïve in the context of guiding general medical practice. Briefly, the reason is that medical prescriptions are unblinded, and so patients who receive...

Causal inference refers to the process of inferring what would happen in the future if we change what we are doing, or inferring what would have happened in the past, if we had done something different in the distant past. Humans adjust our behaviors by anticipating what will happen if we act in different ways, using past experiences to inform thes...

With many pretreatment covariates and treatment factors, the classical factorial experiment often fails to balance covariates across multiple factorial effects simultaneously. Therefore, it is intuitive to restrict the randomization of the treatment factors to satisfy certain covariate balance criteria, possibly conforming to the tiers of factorial...

Matching on an estimated propensity score is frequently used to estimate the effects of treatments from observational data. Since the 1970s, different authors have proposed methods to combine matching at the design stage with regression adjustment at the analysis stage when estimating treatment effects for continuous outcomes. Previous work has con...

Estimating influence on social media networks is an important practical and theoretical problem, especially because this new medium is widely exploited as a platform for disinformation and propaganda. This paper introduces a novel approach to influence estimation on social media networks and applies it to the real-world problem of characterizing ac...

Although randomized controlled trials (RCTs) are generally considered the gold standard for estimating causal effects, for example of pharmaceutical treatments, the valid analysis of RCTs is more complicated with human units than with plants and other such objects. One potential complication that arises with human subjects is the possible existence...

Estimating influence on social media networks is an important practical and theoretical problem, especially because this new medium is widely exploited as a platform for disinformation and propaganda. This paper introduces a novel approach to influence estimation on social media networks and applies it to the real-world problem of characterizing ac...

Several statistical agencies have started to use multiply-imputed synthetic microdata to create public-use data in major surveys. The purpose of doing this is to protect the confidentiality of respondents’ identities and sensitive attributes, while allowing standard complete-data analyses of microdata. A key challenge, faced by advocates of synthet...

The health effects of environmental exposures have been studied for decades, typically using standard regression models to assess exposure-outcome associations found in observational non-experimental data. We propose and illustrate a different approach to examine causal effects of environmental exposures on health outcomes from observational data....

The seminal work of Morgan and Rubin (2012) considers rerandomization for
a sample of size 2N units all randomized at one time. In practice, however,
experimenters may have to rerandomize units sequentially. For example, a clinician
studying a rare disease may be unable to wait to perform an experiment until all
2N experimental units are recruited...

Blinded randomized controlled trials (RCT) require participants to be uncertain if they are receiving a treatment or placebo. Although uncertainty is ideal for isolating the treatment effect from all other potential effects, it is poorly suited for estimating the treatment effect under actual conditions of intended use-when individuals are certain...

This article considers causal inference for treatment contrasts from a randomized experiment using potential outcomes in a finite population setting. Adopting a Neymanian repeated sampling approach that integrates such causal inference with finite population survey sampling, an inferential framework is developed for general mechanisms of assigning...

A few years ago, the New York Department of Education (NYDE) was planning to conduct an experiment involving five new intervention programs for a selected set of New York City high schools. The goal was to estimate the causal effects of these programs and their interactions on the schools’ performance. For each of the schools, about 50 premeasured...

The wise use of statistical ideas in practice essentially requires some Bayesian thinking, in contrast to the classical rigid frequentist dogma. This dogma too often has seemed to influence the applications of statistics, even at agencies like the FDA. Greg Campbell was one of the most important advocates there for more nuanced modes of thought, es...

This article considers causal inference for treatment contrasts from a randomized experiment using potential outcomes in a finite population setting. Adopting a Neymanian repeated sampling approach that integrates such causal inference with finite population survey sampling, an inferential framework is developed for general mechanisms of assigning...

Significance
Rerandomization refers to experimental designs that enforce covariate balance. This paper studies the asymptotic properties of the difference-in-means estimator under rerandomization, based on the randomness of the treatment assignment without imposing any parametric modeling assumptions on the covariates or outcome. The non-Gaussian a...

Significance
We consider data-analysis settings where data are missing not at random. In these cases, the two basic modeling approaches are 1) pattern-mixture models, with separate distributions for missing data and observed data, and 2) selection models, with a distribution for the data preobservation and a missing-data mechanism that selects whic...

For likelihood-based inferences from data with missing values, models are generally needed for both the data and the missing-data mechanism. However, modeling the mechanism is challenging, and parameters are often poorly identified. Rubin (1976) showed that for likelihood and Bayesian inference, sufficient conditions for ignoring the missing data m...

Many empirical settings involve the specification of models leading to complicated likelihood functions, for example, finite mixture models that arise in causal inference when using Principal Stratification (PS). Traditional asymptotic results cannot be trusted for the associated likelihood functions, whose logarithms are not close to being quadrat...

The problem of missing data is ubiquitous in the social and behavioral sciences. We begin by distinguishing between the pattern of missing data and the mechanism that creates the missing data. Then we consider common, but limited, approaches: complete-case analysis, nonresponse weighting, and single imputation. We then discuss more principled metho...

Factorial designs are widely used in agriculture, engineering, and the social
sciences to study the causal effects of several factors simultaneously on a
response. The objective of such a design is to estimate all factorial effects
of interest, which typically include main effects and interactions among
factors. To estimate factorial effects with h...

In randomized experiments, the random assignment of units to treatment groups justifies many of the widely used traditional analysis methods for evaluating causal effects. Specifying subgroups of units for further examination after observing outcomes, however, may partially nullify any advantages of randomized assignment when data are analyzed naiv...

We clarify the key concept of missingness at random in incomplete data analysis. We first distinguish between data being missing
at random and the missingness mechanism being a missing-at-random one, which we call missing always at random and which is
more restrictive. We further discuss how, in general, neither of these conditions is a statement a...

When conducting a randomized experiment, if an allocation yields treatment groups that differ meaningfully with respect to relevant covariates, groups should be rerandomized. The process involves specifying an explicit criterion for whether an allocation is acceptable, based on a measure of covariate balance, and rerandomizing units until an accept...

We examine the possible consequences of a change in law school admissions in
the United States from an affirmative action system based on race to one based
on socioeconomic class. Using data from the 1991-1996 Law School Admission
Council Bar Passage Study, students were reassigned attendance by simulation to
law school tiers by transferring the af...

Health and medical data are increasingly being generated, collected, and stored in electronic form in healthcare facilities and administrative agencies. Such data hold a wealth of information vital to effective health policy development and evaluation, as well as to enhanced clinical care through evidence-based practice and safety and quality monit...

By 'partially post-hoc' subgroup analyses, we mean analyses that compare existing data from a randomized experiment-from which a subgroup specification is derived-to new, subgroup-only experimental data. We describe a motivating example in which partially post hoc subgroup analyses instigated statistical debate about a medical device's efficacy. We...

Estimation of causal effects in non-randomized studies comprises two distinct phases: design, without outcome data, and analysis of the outcome data according to a specified protocol. Recently, Gutman and Rubin (2013) proposed a new analysis-phase method for estimating treatment effects when the outcome is binary and there is only one covariate, wh...

Imputation, that is, filling in a value for each missing datum, is a common approach to handling missing data that has many desirable properties. For example, it allows the application of standard analytic methods that have been developed for complete data sets. Moreover, in the context of the production of public-use data sets, it allows the data...

We analyze publicly available data to estimate the causal effects of military interventions on the homicide rates in certain problematic regions in Mexico. We use the Rubin causal model to compare the post-intervention homicide rate in each intervened region to the hypothetical homicide rate for that same year had the military intervention not take...

The estimation of causal effects in nonrandomized studies should comprise two distinct phases: design, with no outcome data available; and analysis of the outcome data according to a specified protocol. Here, we review and compare point and interval estimates of common statistical procedures for estimating causal effects (i.e. matching, subclassifi...

Unreplicated factorial designs have been widely used in scientific and industrial settings, when it is important to distinguish “active” or real factorial effects from “inactive” or noise factorial effects used to estimate residual or “error” terms. We propose a new approach to screen for active factorial effects from such experiments that utilizes...

The problem of missing data is ubiquitous in many fields, including clinical psychology. This entry begins by providing a brief overview of basic concepts describing types of missing data, patterns of missing data, and mechanisms that create missing data. It then goes on to consider common analytic approaches to deal with this problem and describes...

Many outcomes of interest in the social and health sciences, as well as in
modern applications in computational social science and experimentation on
social media platforms, are ordinal and do not have a meaningful scale. Causal
analyses that leverage this type of data, termed ordinal non-numeric, require
careful treatment, as much of the classical...

Most questions in social and biomedical sciences are causal in nature: what would happen to individuals, or to groups, if part of their environment were changed? In this groundbreaking text, two world-renowned experts present statistical methods for studying such questions. This book starts with the notion of potential outcomes, each corresponding...

Although recent guidelines for dealing with missing data emphasize the need for sensitivity analyses, and such analyses have a long history in statistics, universal recommendations for conducting and displaying these analyses are scarce. We propose graphical displays that help formalize and visualize the results of sensitivity analyses, building up...

The Neyman-Fisher controversy considered here originated with the 1935
presentation of Jerzy Neyman's Statistical Problems in Agricultural
Experimentation to the Royal Statistical Society. Neyman asserted that the
standard ANOVA F-test for randomized complete block designs is valid, whereas
the analogous test for Latin squares is invalid in the sen...

A framework for causal inference from two-level factorial designs is proposed, which uses potential outcomes to define causal effects. The paper explores the effect of non-additivity of unit level treatment effects on Neyman's repeated sampling approach for estimation of causal effects and on Fisher's randomization tests on sharp null hypotheses in...

When sample sizes are small, a useful alternative approach to multiple imputation (ML) is to add a prior distribution for the parameters and compute the posterior distribution of the parameters of interest. As with ML estimation with a general pattern of missing values, Bayes simulation requires iteration. The iterative simulation methods discussed...

Praise for the First Edition of Statistical Analysis with Missing Data “An important contribution to the applied statistics literature.... I give the book high marks for unifying and making accessible much of the past and current work in this important area.”—William E. Strawderman, Rutgers University “This book...provide[s] interesting real-life e...

Imputations are means or draws from a predictive distribution of the missing values, and require a method of creating a predictive distribution for the imputation based on the observed data. There are two generic approaches to generating this distribution: Explicit modeling: the predictive distribution is based on a formal statistical model, and he...

The estimate is computed as part of the Newton?Raphson algorithm for Maximum Likelihood (ML) estimation, and computed as part of the scoring algorithm. This chapter considers methods for computing standard errors that do not require computation and inversion of an information matrix. Another method for calculating large-sample covariance matrices i...

This chapter considers alternative distributions to the t distribution for robust inference, and robust inference for multivariate data sets with missing values. It describes a general mixture model for robust estimation of a univariate sample that includes the t and contaminated normal disztributions as special cases. The case of multivariate data...

This chapter discusses complete-case analysis, which confines the analysis to the set of cases with no missing values, and modifications and extensions. It considers a modification of complete-case analysis that differentially weights the complete cases to adjust for bias. The basic idea is closely related to weighting in randomization inference fo...

This chapter concerns the analysis of incomplete data when the variables are categorical. If the data matrix has missing items, some of the cases in the preceding contingency table are partially classified. The chapter discusses the estimation for general patterns via the Expectation Maximization (EM) algorithm and posterior simulation. It consider...

This chapter discusses models where the data are not missing at random (MAR), so maximum likelihood (ML) estimation requires a model for the missing-data mechanism and maximization of the full likelihood. The EM algorithm or its extensions, discussed in the chapter for the general case of known or unknown nonignorable mechanisms, can typically achi...

This chapter applies the tools to a variety of common problems involving incomplete data on multivariate normally distributed variables: estimation of the mean vector and covariance matrix; estimation of these quantities when there are restrictions on the mean and covariance matrix; multiple linear regression, including analysis of variance (ANOVA)...

Controlled experiments are generally carefully designed to allow revealing statistical analyses to be made using straightforward computations. The estimates, standard errors, and analysis of variance (ANOVA) table corresponding to most designed experiments are easily computed because of balance in the design. This chapter assumes that the design ma...

Patterns of incomplete data in practice often do not have the particular forms that allow explicit maximum likelihood (ML) estimates to be calculated by exploiting factorizations of the likelihood. This chapter considers the iterative methods of computation for situations without explicit ML estimates. An alternative computing strategy for incomple...

This chapter focuses on the question of deriving estimates of uncertainty that incorporate the added variance due to nonresponse. The variance estimates presented here all effectively assume that the method of adjustment for nonresponse has succeeded in eliminating nonresponse bias. The chapter modifies the imputations so that valid standard errors...

In this chapter, the authors assume that missing-data mechanism is ignorable, and for simplicity write for ℓ(θ|Yobs) the ignorable loglikelihood ℓign(θ|Yobs) based on incomplete data Yobs. For certain models and incomplete-data patterns, however, analyses based on ℓ(θ|Yobs) can employ standard complete-data techniques. The chapter discusses general...

This chapter considers missing-data methods for mixtures of normal and non-normal variables, assuming that the missing-data mechanism is ignorable. An analysis of the missing-data pattern suggests that all the parameters of the general location model are estimable, despite the sparseness of the data matrix. Maximum Likelihood (ML) estimates compute...

Many methods of estimation for incomplete data can be based on the likelihood function under specific modeling assumptions. This chapter reviews basic theory of inference based on the likelihood function and describes how it is implemented in the incomplete-data setting. It begins by considering maximum likelihood and Bayes’ estimation for complete...

Multiple imputation (MI) has become a standard statistical technique for dealing with missing values. The CDC Anthrax Vaccine Research Program (AVRP) dataset created new challenges for MI due to the large number of variables of different types and the limited sample size. A common method for imputing missing data in such complex studies is to speci...

A number of mixture modeling approaches assume both normality and independent observations. However, these two assumptions are at odds with the reality of many data sets, which are often characterized by an abundance of zero-valued or highly skewed observations as well as observations from biologically related (i.e., non-independent) subjects. We p...

The estimation of causal effects has been the subject of extensive research. In unconfounded studies with a dichotomous outcome, Y, Cangul, Chretien, Gutman and Rubin (2009) demonstrated that logistic regression for a scalar continuous covariate X is generally statistically invalid for testing null treatment effects when the distributions of X in t...

Missing data are a common problem in statistics. Imputation, or filling in the missing values, is an intuitive and flexible way to address the resulting incomplete data sets. We focus on multiple imputation, which, when implemented correctly, can be a statistically valid strategy for handling missing data. The analysis of a multiply‐imputed data se...

Broadening its scope to nonstatisticians, Bayesian Methods for Data Analysis, Third Edition provides an accessible introduction to the foundations and applications of Bayesian analysis. Along with a complete reorganization of the material, this edition concentrates more on hierarchical Bayesian modeling as implemented via Markov chain Monte Carlo (...

A framework for causal inference from two-level factorial designs is
proposed. The framework utilizes the concept of potential outcomes that lies at
the center stage of causal inference and extends Neyman's repeated sampling
approach for estimation of causal effects and randomization tests based on
Fisher's sharp null hypothesis to the case of 2-le...

In studies of public health, outcome measures such as the odds ratio, rate ratio, or efficacy are often estimated across strata to assess the overall effect of active treatment versus control treatment. Patients may be partitioned into such strata or blocks by experimental design, or, in non-randomized studies, patients may be partitioned into subc...

The effects of a job training program, Job Corps, on both employment and wages are evaluated using data from a randomized study. Principal stratification is used to address, simultaneously, the complications of noncompliance, wages that are only partially defined because of nonemployment, and unintended missing outcomes. The first two complications...

Missing data are pervasive in large public-use databases. Multiple imputation (MI) is an effective methodology to handle the problem. Current state-of-the-art procedures of MI of-ten fit fully Bayesian models assuming some joint probability distribution for the underlying complete data. Though theoretically valid, joint modeling may not accurately...

Randomized experiments are the "gold standard" for estimating causal effects, yet often in practice, chance imbalances exist in covariate distributions between treatment groups. If covariate data are available before units are exposed to treatments, these chance imbalances can be mitigated by first checking covariate balance before the physical exp...

We review advances toward credible causal inference that have wide application for empirical legal studies. Our chief point is simple: Research design trumps methods of analysis. We explain matching and regression discontinuity approaches in intuitive (nontechnical) terms. To illustrate, we apply these to existing data on the impact of prison facil...

Missing data are a pervasive problem in many data sets and seem especially widespread in social and economic studies, such as customer satisfaction surveys. Imputation is an intuitive and flexible way to handle the incomplete data sets that result. This chapter starts with a basic discussion of missing-data patterns, which describe which values are...

Causal inference is used to measure effects from experimental and observational data. This chapter provides an overview of the approach to the estimation of such causal effects based on the concept of potential outcomes, which stems from the work on randomized experiments by Fisher and Neyman in the 1920s and was then extended by Rubin in the 1970s...

Randomization of treatment assignment in experiments generates treatment groups with approximately balanced baseline covariates. However, in observational studies, where treatment assignment is not random, patients in the active treatment and control groups often differ on crucial covariates that are related to outcomes. These covariate imbalances...

Randomization of treatment assignment in experiments generates treatment groups with approximately balanced baseline covariates. However, in observational studies, where treatment assignment is not random, patients in the active treatment and control groups often differ on crucial covariates that are related to outcomes. These covariate imbalances...

Randomized encouragement designs, terminology established in the seminal article by Holland (1988) although used earlier (e.g., Swinton, 1975), are the norm
when dealing with human populations. At least in much of the world today, thankfully, we cannot force anyone to take a randomly
assigned treatment; rather, we can only encourage them to do so,...