Journal of Economic Perspectives—Volume 31, Number 2—Spring 2017—Pages 1–32
The gold standard for drawing inferences about the effect of a policy is a
randomized controlled experiment. However, in many cases, experiments
remain difficult or impossible to implement, for financial, political, or ethical
reasons, or because the population of interest is too small. For example, it would be
unethical to prevent potential students from attending college in order to study the
causal effect of college attendance on labor market experiences, and politically infea-
sible to study the effect of the minimum wage by randomly assigning minimum wage
policies to states. Thus, a large share of the empirical work in economics about policy
questions relies on observational data—that is, data where policies were determined
in a way other than through random assignment. Drawing inferences about the
causal effect of a policy from observational data is quite challenging. To understand
the challenges, consider the example of the minimum wage. A naive analysis of the
observational data might compare the average employment level of states with a high
minimum wage to that of states with a low minimum wage. This difference is surely
not a credible estimate of the causal effect of a higher minimum wage, defined as the
change in employment that would occur if the low-wage states raised their minimum
wage. For example, it might be the case that states with higher costs of living, as
well as more price-insensitive consumers, choose higher levels of the minimum wage
The State of Applied Econometrics:
Causality and Policy Evaluation
■ Susan Athey is Economics of Technology Professor and Guido W. Imbens is Applied Econo-
metrics Professor and Professor of Economics, both at the Graduate School of Business, Stanford
University, Stanford, California. Both authors are also Research Associates, National Bureau
of Economic Research, Cambridge, Massachusetts. Their email addresses are athey@stanford.
edu and email@example.com.
For supplementary materials such as appendices, datasets, and author disclosure statements, see the
article page at
Susan Athey and Guido W. Imbens
j_atheyimbens_312.indd 1 4/13/17 8:35 AM
2 Journal of Economic Perspectives
compared to states with lower costs of living and more price-sensitive consumers.
These factors, which may be unobserved, are said to be “confounders,” meaning that
they induce correlation between minimum wage policies and employment that is not
indicative of what would happen if the minimum wage policy changed.
In economics, researchers use a wide variety of strategies for attempting to draw
causal inference from observational data. These strategies are often referred to as
identification strategies or empirical strategies (Angrist and Krueger 1999), because they
are strategies for identifying the causal effect. We say, somewhat loosely, that a causal
effect is identified if it can be learned when the dataset is sufficiently large. In the first
main section of the paper, we review developments corresponding to several of these
identification strategies: regression discontinuity, synthetic control and differences-
in-differences methods, methods designed for networks settings, and methods that
combine experimental and observational data. In the next main section, we discuss
supplementary analyses, by which we mean analyses where the results are intended to
convince the reader of the credibility of the primary analyses. These supplementary
analyses have not always been systematically applied in the empirical literature, but
we believe they will be of growing importance. We then briefly discuss some new
developments in the machine learning literature, which focus on the combination of
predictive methods and causal questions. We argue that machine learning methods
hold great promise for improving the credibility of policy evaluation, and they can
also be used to approach supplementary analyses more systematically.
Overall, this article focuses on recent developments in econometrics that may
be useful for researchers interested in estimating the effect of policies on outcomes.
Our choice of topics and examples does not seek to be an overall review. Instead it
is selective and subjective, based on our reading and assessment of recent research.
New Developments in Program Evaluation
The econometric literature on estimating causal effects has been very active for
over three decades now. Since the early 1990s, the potential outcome approach, some-
times referred to as the Rubin Causal Model, has gained substantial acceptance as
a framework for analyzing causal problems.1 In the potential outcome approach,
there is for each unit i and each level of the treatment w, a potential outcome Yi(w),
which describes the value of the outcome under treatment level w for that unit.
Researchers observe which treatment a given unit received and the corresponding
outcome for each unit, but because we do not observe the outcomes for other levels
of the treatment that a given unit did not receive, we can never directly observe
the causal effects, which is what Holland (1986) calls the “fundamental problem of
causal inference.” Estimates of causal effects are ultimately based on comparisons of
different units with different levels of the treatment.
There is a complementary approach based on graphical models (for example, Pearl 2000) that is widely
used in other disciplines.
j_atheyimbens_312.indd 2 4/13/17 8:35 AM
Susan Athey and Guido W. Imbens 3
In some settings, the goal is to analyze the effect of a binary treatment, and
the unconfoundedness assumption can be justified. This assumption requires that all
“confounding factors” (that is, factors correlated with both potential outcomes
and with the assignment to the treatment) are observed, which in turn implies
that conditional on observed confounders, the treatment is as good as randomly
assigned. Rosenbaum and Rubin (1983a) show that under this assumption, the
average difference between treated and untreated groups with the same values for
the confounders can be given a causal interpretation. The literature on estimating
average treatment effects under unconfoundedness is very mature, with a number
of competing estimators and many applications. Some estimators use matching
methods (where each treated unit is compared to control units with similar covari-
ates), some rely on reweighting observations so that the observable characteristics
of the treatment and control group are similar after weighting, and some involve
the propensity score (that is, the conditional probability of receiving the treat-
ment given the covariates) (for reviews, see Imbens 2004; Abadie and Imbens 2006;
Imbens and Rubin 2015; Heckman and Vytlacil 2007). Because this setting has been
so well studied, we do not cover it in this article; neither do we cover the volumi-
nous (and very influential) literature on instrumental variables.2 Instead, we discuss
issues related to a number of other identification strategies and settings.
Regression Discontinuity Designs
A regression discontinuity design enables the estimation of causal effects by
exploiting discontinuities in incentives or ability to receive a discrete treatment.3
For example, school district boundaries may imply that two children whose houses
are on the same street will attend different schools, or birthdate cutoffs may limit
eligibility to start kindergarten between two children born only a few days apart.
Many government programs are means-tested, meaning that eligibility depends on
income falling below a threshold. In these settings, it is possible to estimate the
causal effect of attending a particular school or receiving a government program
by comparing outcomes for children who live on either side of the boundary, or by
comparing individuals on either side of an eligibility threshold.
There are two recent strands of the instrumental variables literature. One focuses on heterogenous
treatment effects, with a key development being the notion of the local average treatment effect (Imbens
and Angrist 1994; Angrist, Imbens, and Rubin 1996). This literature has been reviewed in Imbens (2014).
There is also a literature on weak instruments, focusing on settings with a possibly large number of
instruments and weak correlation between the instruments and the endogenous regressor. On this topic,
see Bekker (1994), Staiger and Stock (1997), and Chamberlain and Imbens (2004) for specific contribu-
tions, and Andrews and Stock (2006) for a survey. Also, we also do not discuss in detail bounds and partial
identification analyses. Starting with the work by Manski (for instance, Manski 1990), these topics have
received a lot of interest, with an excellent recent review in Tamer (2010).
This approach has a long history, dating back to work in psychology in the 1950s by Thistlewaite and
Campbell (1960), but did not become part of the mainstream economics literature until the early 2000s
(with an exception being Goldberger 1972, 2008). Fairly recent reviews include Imbens and Lemieux
(2008), Lee and Lemieux (2010), van der Klaauw (2008), and Skovron and Titiunik (2015).
j_atheyimbens_312.indd 3 4/13/17 8:35 AM
4 Journal of Economic Perspectives
In general, the key feature of the design is the presence of an exogenous
variable, referred to as the forcing variable, like the student’s birthday or address,
where the probability of participating in the program changes discontinuously at a
threshold value of the forcing variable. This design can be used to estimate causal
effects under the assumption that the individuals close to the threshold but on
different sides are otherwise comparable, so any difference in average outcomes
between individuals just to one side or the other can be attributed to the treat-
ment. If the jump in the conditional probability of treatment at the threshold value
is from zero to one, we refer to the design as a “sharp” regression discontinuity
design. In this case, a researcher can focus on the discontinuity of the conditional
expectation of the outcome given the forcing variable at the threshold, interpreted
as the average effect of the treatment for individuals close to the threshold. If the
magnitude of the jump in probability of receiving the treatment at the threshold
value is less than one, it is a “fuzzy” regression discontinuity design. For example,
some means-tested government programs are also rationed, so that not all eligible
people gain access. In this case, the focus is again on the discontinuity in the condi-
tional expectation of the outcome at the threshold, but now it must be scaled by the
discontinuity in the probability of receiving the treatment. The interpretation of the
estimand is the average effect for “compliers” at the threshold, that is, individuals at
the threshold whose treatment status would have been different had they been on
the other side of the threshold (Hahn, Todd, and van der Klaauw 2001).
Let us illustrate a regression discontinuity design with data from Jacob and
Lefgren (2004). They study the causal effect of attending summer school using
administrative data from the Chicago Public Schools, which in 1996 instituted an
accountability policy that tied summer school attendance and promotional decisions
to performance on standardized tests. We use the data for 70,831 third-graders in
years 1997–99. The rule was that individuals who scored below a threshold (2.75 in
this case) on either reading or mathematics were required to attend summer school.
Out of the 70,831 third graders, 15,846 scored below the threshold on the math-
ematics test, 26,833 scored below the threshold on the reading test, 12,779 score
below the threshold on both tests, and 29,900 scored below the threshold on at
least one test. The outcome variable Y i obs is the math score after the summer school,
normalized to have variance one. Table 1 presents some of the results. The first
row presents an estimate of the effect of summer school attendance on the math-
ematics test, using for the forcing variable the minimum of the initial mathematics
score and the initial reading score. We find that the summer school program has a
substantial effect, raising the math test outcome score by 0.18 standard deviations.
Researchers who are implementing a regression discontinuity approach might
usefully bear four pointers in mind. First, we recommend using local linear methods
for the estimation process, rather than local constant methods that simply attempt to
estimate average outcomes on either side of the boundary using a standard kernel
regression. A kernel regression predicts the average outcome at a point by taking
a weighted average of outcomes for nearby observations, where closer observa-
tions are weighted more highly. The problem is that when applying such a method
j_atheyimbens_312.indd 4 4/13/17 8:35 AM
The State of Applied Econometrics: Causality and Policy Evaluation 5
near a boundary, all of the observations lie on one side of the boundary, creating
a bias in the estimates. The estimates are biased towards the average outcomes for
observations that are strictly inside the boundary (Porter 2003). As an alternative
Porter suggested local linear regression, which involves estimating linear regres-
sions of outcomes on the forcing variable separately on the left and the right of
the threshold, and then taking the difference between the predicted values at the
threshold. This approach works better if the outcomes change systematically near
the boundary because the model accounts for this and corrects the bias that arises
due to truncating data at the boundary. The local linear estimator has substantially
better finite sample properties than nonparametric methods that do not account
for threshold effects, and it has become the standard in the empirical literature.
For details on implementation, see Hahn, Todd, and van der Klaauw (2001), Porter
(2003), and Calonico, Cattaneo, and Titiunik (2014a).4
A second key element in carrying out regression discontinuity analysis, given
a local linear estimation method, is the choice of the bandwidth—that is, how to
weight nearby versus more distant observations. Conventional methods for choosing
optimal bandwidths in nonparametric regressions look for bandwidths that are
optimal for estimating an entire regression function, but here the interest is solely
There are some suggestions that using local quadratic methods may work well given the current
technology for choosing bandwidths. Some empirical studies use global high-order polynomial approxi-
mations to the regression function, but Gelman and Imbens (2014) argue that such methods have poor
Tabl e 1
Regression Discontinuity Designs: The Jacob–Lefgren Data
Outcome Sample Estimator Estimate Standard error IK Bandwidth
Math All Local Linear 0.18 (0.02) 0.57
Math Reading > 3.32 Local Linear 0.15 (0.02) 0.57
Math Math > 3.32 Local Linear 0.17 (0.03) 0.57
Math Math and Reading < 3.32 Local Linear 0.19 (0.02) 0.57
Math All Local Constant −0.15 (0.02) 0.57
Note and Source: This table illustrates a regression discontinuity design with data from Jacob and
Lefgren (2004). They study the causal effect of attending summer school, using use administrative
data from the Chicago Public Schools, which in 1996 instituted an accountability policy that tied
summer school attendance and promotional decisions to performance on standardized tests.
We use the data for 70,831 third-graders in years 1997–99. The rule was that individuals who
scored below a threshold (2.75 in this case) on either a reading or mathematics were required to
attend summer school. Out of the 70,831 third graders, 15,846 scored below the threshold on the
mathematics test, 26,833 scored below the threshold on the reading test, 12,779 score below the
threshold on both tests, and 29,900 scored below the threshold on at least one test. The outcome
variable Y i obs is the math score after the summer school, normalized to have variance one. The
first row presents an estimate of the effect of summer school attendance on the mathematics
test, using for the forcing variable the minimum of the initial mathematics score and the initial
reading score. We find that the summer school program has a substantial effect, raising the math
test outcome score by 0.18 standard deviations. Rows 2–4 in Table 1 present estimates for separate
subsamples. In this case, we find relatively little evidence of heterogeneity in the estimates.
j_atheyimbens_312.indd 5 4/13/17 8:35 AM
6 Journal of Economic Perspectives
in the value of the regression function at a particular point. The current literature
suggests choosing the bandwidth for the local linear regression using asymptotic
expansions of the estimators around small values for the bandwidth (Imbens and
Kalyanaraman 2012; Calonico, Cattaneo, and Titiunik 2014a).
This example of summer school attendance also illustrates a situation in which
the discontinuity involves multiple exogenous variables: in this case, students who
score below a threshold on either a language or a mathematics test are required to
attend summer school. Although not all the students who are required to attend
summer school do so (a fuzzy regression discontinuity design), the fact that the
forcing variable is a known function of two observed exogenous variables makes it
possible to estimate the effect of summer school at different margins. For example,
one can estimate the effect of summer school for individuals who are required to
attend because of failure to pass the language test, and compare this with the esti-
mate for those who are required because of failure to pass the mathematics test. The
dependence of the threshold on multiple exogenous variables improves the ability
to detect and analyze heterogeneity in the causal effects. Rows 2–4 in Table 1 present
estimates for separate subsamples. In this case, we find relatively little evidence of
heterogeneity in the estimates.
A third concern for regression discontinuity analysis is how to assess the validity
of the assumptions required for interpreting the estimates as causal effects. We
recommend carrying out supplementary analyses to assess the credibility of the
design, and in particular to test for evidence of manipulation of the forcing vari-
able, as well as to test for discontinuities in average covariate values at the threshold.
We will discuss examples later.
Fourth, we recommend that researchers investigate the external validity of the
regression discontinuity estimates by assessing the credibility of extrapolations to
other subpopulations (Bertanha and Imbens 2014; Angrist and Rokkanen 2015;
Angrist and Fernandez-Val 2010; Dong and Lewbel 2015). Again, we return to this
topic later in the paper.
An interesting recent development in the area of regression discontinuity designs
involves the generalization to discontinuities in derivatives, rather than levels, of
conditional expectations. The basic idea is that at a threshold for the forcing variable,
the slope of the outcome function (as a function of the forcing variable) changes,
and the goal is to estimate this change in slope. The first discussions of these regres-
sion kink designs appear in Nielsen, Sorensen, and Taber (2010), Card, Lee, Pei, and
Weber (2015), and Dong (2014). For example, in Card, Lee, Pei, and Weber (2015),
the goal of the analysis is to estimate the causal effect of an increase in the unem-
ployment benefits on the duration of unemployment spells, where earnings are the
forcing variable. The analysis exploits the fact that, at the threshold, the relationship
between benefit levels and the forcing variable changes. If we are willing to assume
that in the absence of the kink in the benefit system, the derivative of the expected
duration of unemployment would be smooth in lagged earnings, then the change in
the derivative of the expected duration with respect to lagged earnings is informative
about the relation between the expected duration and the benefit schedule.
j_atheyimbens_312.indd 6 4/13/17 8:35 AM
Susan Athey and Guido W. Imbens 7
Synthetic Control Methods and Difference-In-Differences
Difference-in-differences methods have been an important tool for empirical
researchers since the early 1990s. These methods are typically used when some groups,
like cities or states, experience a treatment, such as a policy change, while others do
not. In this situation, the selection of which groups experience the treatment is not
necessarily random, and outcomes are not necessarily the same across groups in the
absence of the treatment. The groups are observed before and after the treatment.
The challenge for causal inference is to come up with a credible estimate of what the
outcomes would have been for the treatment group in the absence of the treatment.
This requires estimating a (counterfactual) change over time for the treatment group if
the treatment had not occurred. The assumption underlying difference-in-differences
strategies is that the change in outcomes over time for the control group is informative
about what the change would have been for the treatment group in the absence of the
treatment. In general, this requires functional form assumptions. If researchers make
a linearity assumption, they can estimate the average treatment effect as the difference
between the change in average outcomes over time for the treatment group, minus the
change in average outcomes over time for the control group.
Here we discuss two recent developments to the difference-in-differences approach:
the synthetic control approach and the nonlinear changes-in-changes method. The
synthetic control approach developed by Abadie, Diamond, and Hainmueller (2010,
2014) and Abadie and Gardeazabal (2003) is arguably the most important innovation
in the policy evaluation literature in the last 15 years. This method builds on differ-
ence-in-differences estimation, but uses systematically more attractive comparisons. To
gain some intuition about these methods, consider the classic difference-in-differences
study by Card (1990; see also Peri and Yasenov 2015). Card is interested in the effect
of the Mariel boatlift, which brought low-skilled Cuban workers to Miami. The ques-
tion is how the boatlift affected the Miami labor market, and specifically the wages of
low-skilled workers. He compares the change in the outcome of interest for the treat-
ment city (Miami) to the corresponding change in a control city. He considers various
possible control cities, including Houston, Petersburg, and Atlanta.
In contrast, the synthetic control approach moves away from using a single
control unit or a simple average of control units, and instead uses a weighted average
of the set of controls. In other words, instead of choosing between Houston, Peters-
burg, or Atlanta, or taking a simple average of outcomes in those cities, the synthetic
control approach chooses weights for each of the three cities so that the weighted
average is more similar to Miami than any single city would be. If pre-boatlift wages
are higher in Houston than in Miami, but lower in Atlanta than Miami, it would
make sense to compare Miami to the average of Houston and Atlanta rather than
to either Houston or Atlanta. The simplicity of the idea, and the obvious improve-
ment over the standard methods, have made this a widely used method in the short
period of time since its inception.
The implementation of the synthetic control method requires a specific choice
for the weights. The original paper, Abadie, Diamond, and Hainmueller (2010),
uses a minimum distance approach, combined with the restriction that the resulting
j_atheyimbens_312.indd 7 4/13/17 8:35 AM
8 Journal of Economic Perspectives
weights are nonnegative and sum to one. This approach often leads to a unique
set of weights. However, if a certain unit is on the extreme end of the distribution
of units, then allowing for weights that sum up to a number different from one or
allowing for negative weights may improve the fit. Doudchenko and Imbens (2016)
explore alternative methods for calculating appropriate weights for a synthetic
control approach, such as best subset regression or LASSO (the least absolute
shrinkage and selection operator) and elastic nets methods, which perform better
in settings with a large number of potential control units.
Functional form assumptions can play an important role in difference-in-differ-
ences methods. For example, in the extreme case with only two groups and two
periods, it is not clear whether we should assume that the percentage change over
time in average outcomes would have been the same in the treatment and control
groups in the absence of the treatment, or whether we should assume that the level
of the change over time would have been the same. In general, a treatment might
affect both the mean and the variance of outcomes, and the impact of the treatment
might vary across individuals.
For the case where the data includes repeated cross-sections of individuals (that
is, the data include individual observations about many units within each group
in two different time periods, but the individuals cannot be linked across time
periods or may come from a distinct sample such as a survey), in Athey and Imbens
(2006), we propose a nonlinear version of the difference-in-differences model.
This approach, which we call changes-in-changes, does not rely on functional form
assumptions, while still allowing the effects of time and treatment to vary system-
atically across individuals. For example, one can imagine a situation in which the
returns to skill are increasing over time, or in which a new medical treatment holds
greater benefit for sicker individuals. The distribution of outcomes that emerges
from the nonlinear difference-in-differences model is of direct interest for policy
implications, beyond the average effect of the treatment itself. Further, a number
of authors have used this approach as a robustness check, or what we will call in the
next main section a supplementary analysis, for the results from a linear model.
Estimating Average Treatment Effects in Settings with Multivalued Treatments
Much of the earlier econometric literature on treatment effects focused on the
case with binary treatments, but a more recent literature discusses the issues posed by
multivalued treatment, which is of great relevance as, in practice, many treatments
have multiple versions. For example, a get-out-the-vote campaign (or any advertising
campaign) might consider a variety of possible messages; or a firm might consider
several different price levels. In the case of a binary treatment, there are a variety of
methods for estimating treatment effects under the unconfoundedness assumption,
which requires that the treatment assignment is as good as random conditional on
covariates. One method that works well when the number of covariates is small is to
model average outcomes as a function of observed covariates, and then use the model
to adjust for the extent to which differences in the treatment and control group are
accounted for by observables.
j_atheyimbens_312.indd 8 4/13/17 8:35 AM
The State of Applied Econometrics: Causality and Policy Evaluation 9
However, this type of modeling performs less well if there are many covariates,
or if the differences between the treatment and control group in terms of covari-
ates are large, because errors in estimating the impact of covariates lead to large
biases. An alternative set of approaches relies on the concept of a propensity score
(Rosenbaum and Rubin 1983a), which is the probability that an individual gets a
treatment, conditional on the individual’s observable characteristics. In environ-
ments where unconfoundedness holds, it is sufficient to control for the propensity
score (a single-dimensional variable that summarizes how observables affect the
treatment probability), and it is not necessary to model outcomes as a function
of all observables. That is, a comparison of two people with the same propensity
score, one of whom received the treatment and one who did not, should in prin-
ciple adjust for confounding variables. In practice, some of the most effective causal
estimation methods in nonexperimental studies using observable data appear to
be those that combine some modeling of the conditional mean of outcomes (for
example, using regression adjustments) with a covariate balancing method such as
subclassification, matching, or weighting based on the propensity score (Imbens
and Rubin 2015), making them doubly robust (Bang and Robins 2005).
Substantially less attention has been paid to extensions of these methods to
the case where the treatment takes on multiple values (exceptions include Imbens
2000; Lechner 2001; Imai and Van Dyk 2004; Cattaneo 2010; Hirano and Imbens
2004; Yang et al. 2016). However, the recent literature shows that the dimension-
reducing properties of a generalized version of the propensity score, and by
extension the doubly robust properties, can be maintained in the multivalued treat-
ment setting, but the role of the propensity score is subtly different, opening up the
area for empirical research in this setting. Imbens (2000) introduced the concept
of a generalized propensity score, which is based on an assumption of weak uncon-
foundedness, requiring only that the indicator for receiving a particular level of
the treatment and the potential outcome for that treatment level are conditionally
independent. Weak unconfoundedness implies similar dimension-reduction prop-
erties as are available in the binary treatment case. This approach can be used to
develop matching or propensity score subclassification strategies (where groups of
individuals whose propensity scores lie in an interval are compared as if treatment
assignment was random within the band) (for example, Yang et al. 2016). The main
insight is that it is not necessary to look for subsets of the covariate space where one
can interpret the difference in average outcomes by all treatment levels as estimates
of causal effects. Instead, subsets of the covariate space are constructed where one
can estimate the marginal average outcome for a particular treatment level as the
conditional average for units with that treatment level, one treatment level at a time.
Causal Effects in Networks and Social Interactions
Peer effects, and more generally causal effects of various treatments, in networks
is an important area. For example, individuals in a social network may receive infor-
mation, or may gain access to a product or service, and we wish to understand the
impact of that treatment both on the treated individuals, but also their peers. This
j_atheyimbens_312.indd 9 4/13/17 8:35 AM
10 Journal of Economic Perspectives
area has seen much novel work in recent years, ranging from econometrics (Manski
1993) to economic theory (Jackson 2010). Here, we discuss some of the progress
that has been made in econometrics. In general, this literature focuses on causal
effects in settings where units, often individuals, interact in a way that violates the
no-interference assumptions (more precisely, the SUTVA or Stable Unit Treatment
Value Assumption as in Rosenbaum and Rubin 1983a; Imbens and Rubin 2015)
that are routinely made in the treatment effects literature. In some cases, the way in
which individuals interact is simply a nuisance, and the main interest continues to
be on the direct causal effects of own treatments. In other cases, the magnitude of
the interactions, or peer effects, is itself the subject of interest.
Networks and peer effects can operate through many scenarios, which has led
to the literature becoming somewhat fractured and unwieldy. For example, there
is a distinction between, on the one hand, settings where the population can be
partitioned into subpopulations with all units within a subpopulation connected,
as, for example, in classrooms (for example, Manski 1993; Carrell, Sacerdote,
and West 2013), workers in a labor market (Crépon et al. 2013), or roommates in
college (Sacerdote 2001). One can also consider settings with general networks,
in which friends of friends are not necessarily friends themselves (Christakis and
Fowler 2007). Another important distinction is between settings with many discon-
nected networks, where asymptotic arguments for consistency rely on the number
of networks getting large, and settings with a single connected network. It may be
reasonable in some cases to think of the links as symmetric, and in others of links
operating only in one direction. Links can be binary, with links either present or
not, or a network may contain links of different strengths.
A seminal paper in the econometric literature in this area focuses on Manski’s
linear-in-means model (Manski 1993; Bramoullé, Djebbari, and Fortin 2009; Gold-
smith-Pinkham and Imbens 2013). Manski’s original paper focuses on the setting
where the population is partioned into groups (like classrooms), and peer effects
are constant within the groups. The basic model specification is
Yi = β0 + β
Y i + β X ′
Xi + β
X i + β Z ′
Zi + εi,
where i indexes the individual. Here Yi is the outcome for individual i, say educa-
Y i is the average outcome for individuals in the peer group for
individual i; Xi is a set of exogenous characteristics of individual i, like prior test scores
in an educational setting;
X i is the average value of the characteristics in individual
i’s peer group; and Zi are group characteristics that are constant for all individuals in
the same peer group, like quality of teachers in a classroom setting. Manski considers
three types of peer effects that lead to correlations in outcomes between individuals.
Outcomes for individuals in the same group may be correlated because of a shared
environment. These effects are called correlated peer effects, and captured by the
coefficient on Zi. Next are the exogenous peer effects, captured by the coefficient on
the group average
X i of the exogenous variables. The third type is the endogenous
peer effect, captured by the coefficient on the group average outcomes
j_atheyimbens_312.indd 10 4/13/17 8:35 AM
Susan Athey and Guido W. Imbens 11
Manski (1993) concludes that separate identification of these three effects,
even in the linear model setting with constant coefficients, relies on very strong
assumptions and is unrealistic in many settings. In subsequent empirical work,
researchers have often put additional structure on the effects (for example, by
ruling out some of the effects) or brought in additional information (for example,
by using richer network structures) to obtain identification. Graham (2008) focuses
on a setting very similar to that of Manski’s linear-in-means model. He considers
restrictions on the within-group covariance matrix of the εi assuming homoskedas-
ticity at the individual level. In that case, a key insight is that variation in group
size implies restrictions on the within and between group variances that can be
used to identify peer effects. Bramoullé, Djebbari, and Fortin (2009) allow for a
more general network configuration than Manski, one in which friends of friends
are not necessarily connected, and demonstrate the benefits of such configurations
for identification of peer effects. Hudgens and Halloran (2008) start closer to the
Rubin Causal Model or potential outcome setup. They focus primarily on the case
with a binary treatment, and consider how the vector of treatments for the peer
group affects the individual. They suggest various structures on these treatment
effects that can aid in identification. Aronow and Samii (2013) allow for general
networks and peer effects, investigating the identifying power from randomization
of the treatments at the individual level.
Two other branches of the literature on estimation of causal effects in a context
of network and peer effects are worth mentioning. One part focuses on developing
models for network formation. Such approximations require the researcher to
specify in what way the expanding sample would be similar to or different from
the current sample, which in turn is important for deriving asymptotic approxima-
tions based on large samples. Recent examples of such work in economics include
Jackson and Wolinsky (1996), Jackson (2010), Goldsmith-Pinkham and Imbens
(2013), Christakis, Fowler, Imbens, and Kalyanaraman (2010), and Mele (2013).
Chandrasekhar and Jackson (2016) develop a model for network formation and
a corresponding central limit theorem in the presence of correlation induced by
network links. Chandrasekhar (2016) surveys the general econometrics literature
on network formation.
The other branch worth a mention is the use of randomization inference in
the context of causal regressions involving networks, as a way of generating exact
p-values. As an example of randomization inference, consider the null hypothesis
that a treatment has no effect. Because the null of no effects is sharp (that is, if the
null hypothesis is true, we know exactly what the outcomes would be in alternative
treatment regimes after observing the individual in one treatment regime), it allows
for the calculation of exact p-values. The approach works by simulating alternative
(counterfactual) treatment assignment vectors and then calculating what the test
statistic (for example, difference in means between treated and control units) would
have been if that assignment had been the real one. This approach relies heavily on
the fact that the null hypothesis is sharp, but many interesting null hypotheses are
not sharp. In Athey, Eckles, and Imbens (2015), we discuss a large class of alternative
j_atheyimbens_312.indd 11 4/13/17 8:35 AM
12 Journal of Economic Perspectives
null hypotheses: for example, hypotheses restricting higher order peer effects (peer
effects from friends-of-friends) while allowing for the presence of peer effects from
friends; hypotheses about whether a dense network can be represented by a simpli-
fied or sparsified set of rules; and hypotheses about whether peers are exchangeable,
or whether some peers have larger or different effects. To test such hypotheses, in
Athey, Eckles, and Imbens (2015), we introduce the notion of an artificial exper-
iment, in which some units have their treatment assignments held fixed, and we
randomize over the remaining units. The artificial experiment starts by designating
an arbitrary set of units to be focal. The test statistics considered depend only on
outcomes for these focal units. Given the focal units, one derives the set of assign-
ments that does not change the outcomes for the focal units. The exact distribution
of the test statistic can then be inferred despite the original null hypothesis not being
sharp. This approach allows us to test hypotheses about, for example, the effect of
friends-of-friends, without making additional assumptions about the network struc-
ture and without resorting to asymptotics in the size of the network.
Even when a causal study is done carefully, both in analysis and design, there
is often little assurance that the causal effects are valid for populations or settings
other than those studied. This concern has been raised particularly forcefully in
experimental studies (for examples, see the discussions in Deaton 2010; Imbens
2010; Manski 2013). Some have emphasized that without internal validity, little
can be learned from a study (Shadish, Cook, and Cambell 2002). However, Deaton
(2010), Manski (2013), and Banerjee, Chassang, and Snowberg (2016) have argued
that external validity should receive more emphasis.
In some recent work, approaches have been proposed that allow researchers to
directly assess the external validity of estimators for causal effects. A leading example
concerns settings with instrumental variables (for example, Angrist 2004; Angrist
and Fernandez-Val 2010; Dong and Lewbel 2015; Angrist and Rokkanen 2015;
Bertanha and Imbens 2014; Kowalski 2016; Brinch, Mogstad, and Wiswall 2015).
An instrumental variables estimator is often interpreted as an estimator of the local
average treatment effect, that is, the average effect of the treatment for individuals
whose treatment status is affected by the instrument. So under what conditions can
these estimates be considered representative for the entire sample? In this context,
one can partition the sample into several groups, depending on the effect of the
instrumental variable on the receipt of the treatment. There are two groups that
are unaffected by the instrumental variable: always-takers, who always receive the
treatment, and never-takers, who never receive the treatment, no matter the value of
the instrumental variable. Compliers are those whose treatment status is affected by
the instrumental variable. In that context, Angrist (2004) suggests testing whether
the difference in average outcomes for always-takers and never-takers is equal to the
average effect for compliers. Bertanha and Imbens (2014) suggest testing a combi-
nation of two equalities: whether the average outcome for untreated compliers is
equal to the average outcome for never-takers; and whether the average outcome
j_atheyimbens_312.indd 12 4/13/17 8:35 AM
The State of Applied Econometrics: Causality and Policy Evaluation 13
for treated compliers is equal to the average outcome for always-takers. Angrist and
Fernandez-Val (2010) seek to exploit the presence of other exogenous covariates
using conditional effect ignorability, which is that, conditional on these additional
covariates, the average effect for compliers is identical to the average effect for
never-takers and always-takers.
In the context of regression discontinuity designs, concerns about external
validity are especially salient. In that setting, the estimates are in principle valid only
for individuals with values of the forcing variable near the threshold. There have
been a number of approaches to assess the plausibility of generalizing those local
estimates to other parts of the population. Some of them apply to both sharp and
fuzzy regression discontinuity designs, and some apply only to fuzzy designs. Some
require the presence of additional exogenous covariates, and others rely only on
the presence of the forcing variable. For example, Dong and Lewbel (2015) observe
that in general, in regression discontinuity designs with a continuous forcing vari-
able, one can estimate the magnitude of the discontinuity as well as the magnitude
of the change in the first derivative of the regression function, or even higher-order
derivatives, which allows one to extrapolate away from values of the forcing variable
close to the threshold. In another approach, Angrist and Rokkanen (2015) suggest
testing whether conditional on additional covariates, the correlation between the
forcing variable and the outcome vanishes. Such a finding would imply that the
treatment assignment can be thought of as unconfounded conditional on the addi-
tional covariates, which again allows for extrapolation away from the threshold.
Finally, Bertanha and Imbens (2014) propose an approach based on a fuzzy regres-
sion discontinuity design. They suggest testing for continuity of the conditional
expectation of the outcome conditional on the treatment and the forcing variable
at the threshold, adjusted for differences in the covariates.
In some cases, we wish to exploit the benefits of the experimental results, in
particular the high degree of internal validity, in combination with the external
validity and precision from large-scale representative observational studies. Here we
discuss three settings in which experimental studies can be leveraged in combination
with observational studies to provide richer answers than either design could provide
on its own. In the first example, the surrogate variables case, the primary outcome
was not observed in the experiment, but an intermediate outcome was observed.
In a second case, both the intermediate outcome and the primary outcome were
observed. In the third case, multiple experiments bear on a common outcome.
These examples do not exhaust the settings in which researchers can leverage exper-
imental data more effectively, and more research in this area is likely to be fruitful.
In the case of surrogate variables, studied in Athey, Chetty, Imbens, and Kang
(2016), the researcher uses an intermediate variable as a surrogate for the treatment
variable. For example, in medical trials there is a long history of attempts to use
intermediate health measures as surrogates (Prentice 1989). The key condition for
an intermediate variable to be a valid surrogate is that, in the experimental sample,
j_atheyimbens_312.indd 13 4/13/17 8:35 AM
14 Journal of Economic Perspectives
conditional on the surrogate and observed covariates, the (primary) outcomes and
the treatment are independent (Prentice 1989; Begg and Leung 2000; Frangakis and
Rubin 2002). In medical settings, where researchers often used single surrogates, this
condition was often not satisfied in settings where it could be tested. But it may be more
plausible in other settings. For example, suppose an internet company is considering
a change to the user experience on the company’s website. It is interested in the effect
of that change on the user’s purchases over a year-long period. The firm carries out a
randomized experiment over a month, during which it measures details concerning
the customer’s engagement like the number of visits, webpages visited, and the length
of time spent on the various webpages. The firm may have also have historical records
on user characteristics, including past engagement. The combination of the pretreat-
ment variables and the surrogates may be sufficiently rich so that, conditional on the
combination, the primary outcome is independent of the treatment.
In administrative and survey research databases used in economics, a large
number of intermediate variables are often recorded that lie on or close to the
causal path between the treatment and the primary outcome. In such cases, it
may be plausible that the full set of surrogate variables satisfies at least approxi-
mately the independence condition. In this setting, in Athey, Chetty, Imbens, and
Kang (2016), we develop multiple methods for estimating the average effect. One
method corresponds to estimating the relation between the outcome and the surro-
gates in the observational data and using that to impute the missing outcomes in
the experimental sample. Another corresponds to estimating the relation between
the treatment and the surrogates in the experimental sample and using that to
impute the treatment indicator in the observational sample. Yet another exploits
both methods, using the efficient influence function. In the same paper, we also
derive the biases from violations of the surrogacy assumption.
In the second setting for leveraging experiments, studied in Athey, Chetty, and
Imbens (2016), the researcher has data from a randomized experiment, in this case
containing information on the treatment and the intermediate variables, as well as
pretreatment variables. In an observational study, the researcher observes the same
variables plus the primary outcome. One can then compare the estimates of the
average effect on the intermediate outcomes based on the observational sample,
after adjusting for pretreatment variables, with those from the experimental sample.
The latter are known to be consistent, and so if one finds substantial and statistically
significant differences, then unconfoundedness need not hold. For that case, in
Athey, Chetty, and Imbens (2016), we develop methods for adjusting for selection
on unobservables, exploiting the observations on the intermediate variables.
The third setting, involving the use of multiple experiments, has not received
as much attention, but provides fertile ground for future work. Consider a setting in
which a number of experiments were conducted that vary in terms of the population
from which the sample is drawn or in the exact nature of the treatments included.
The researcher may be interested in combining these experiments to obtain more
efficient estimates, perhaps for predicting the effect of a treatment in another popu-
lation or estimating the effect of a treatment with different characteristics. These
j_atheyimbens_312.indd 14 4/13/17 8:35 AM
Susan Athey and Guido W. Imbens 15
issues are related to external validity concerns but include more general efforts to
decompose the effects from experiments into components that can inform deci-
sions on related treatments. In the treatment effects literature, aspects of these
problems have been studied in Hotz, Imbens, and Mortimer (2005), Imbens
(2010), and Allcott (2015). They have also received some attention in the literature
on structural modeling, where experimental data are used to anchor aspects of the
structural model (for example, Todd and Wolpin 2006).
Primary analyses focus on point estimates of the primary estimands along with
standard errors. In contrast, supplementary analyses seek to shed light on the cred-
ibility of the primary analyses. These supplementary analyses do not seek a better
estimate of the effect of primary interest, nor do they (necessarily) assist in selecting
among competing statistical models. Instead, the analyses exploit the fact that the
assumptions behind the identification strategy often have implications for the data
beyond those exploited in the primary analyses. Supplementary analyses can take on
a variety of forms, and we are not aware of a comprehensive survey to date. This liter-
ature is very active, both in theoretical and empirical studies and likely to be growing
in importance in the future. Here, we discuss four examples from the empirical and
theoretical literatures, which we hope provide some guidance for future work.
We will discuss four forms of supplementary analysis: 1) placebo analysis, where
pseudo-causal effects are estimated that are known to be equal to zero based on
a priori knowledge; 2) sensitivity and robustness analyses that assess how much
estimates of the primary estimands can change if we weaken the critical assump-
tions underlying the primary analyses; 3) identification and sensitivity analyses that
highlight what features of the data identify the parameters of interest; and 4) a
supplementary analysis that is specific to regression discontinuity analyses, in which
the focus is on whether the density of the forcing variable is discontinuous at the
threshold, which would suggest that the forcing variable is being manipulated.
In a placebo analysis, the most widely used of the supplementary analyses, the
researcher replicates the primary analysis with the outcome replaced by a pseudo-
outcome that is known not to be affected by the treatment. Thus, the true value
of the estimand for this pseudo-outcome is zero, and the goal of the supplemen-
tary analysis is to assess whether the adjustment methods employed in the primary
analysis, when applied to the pseudo-outcome, lead to estimates that are close to
zero. These are not standard specification tests that suggest alternative specifica-
tions when the null hypothesis is rejected. The implication of rejection here is that
it is possible the original analysis was not credible at all.
One type of placebo test relies on treating lagged outcomes as pseudo-outcomes.
Consider, for example, the dataset assembled by Imbens, Rubin, and Sacerdote
j_atheyimbens_312.indd 15 4/13/17 8:35 AM
16 Journal of Economic Perspectives
(2001), which studies participants in the Massachusetts state lottery. The treatment
of interest is an indicator for winning a big prize in the lottery (with these prizes
paid out over a 20-year period), with the control group consisting of individuals who
won one small, one-time prizes. The estimates of the average treatment effect rely
on an unconfoundedness assumption, namely that the lottery prize is as good as
randomly assigned after taking out associations with some pre-lottery variables: for
example, these variables include six years of lagged earnings, education measures,
gender, and other individual characteristics. Unconfoundedness is certainly a plau-
sible assumption here, given that the winning lottery ticket is randomly drawn.
But there is no guarantee that unconfoundedness holds. The two primary reasons
are: 1) there is only a 50 percent response rate for the survey; and 2) there may be
differences in the rate at which individuals buy lottery tickets. To assess unconfound-
edness, it is useful to estimate the average causal effect with pre-lottery earnings as
the outcome. Using the actual outcome, we estimate that winning the lottery (with
on average a $20,000 yearly prize), reduces average post-lottery earnings by $5,740,
with a standard error of $1,400. Using the pseudo-outcome we obtain an estimate
of minus $530, with a standard error of $780. This finding, along with additional
analyses, strongly suggests that nonconfoundedness holds.
However, using the same approach with the LaLonde (1986) data that are
widely used in the training evaluation literature (for example, Heckman and Hotz
1989; Dehejia and Wahba 1999; Imbens 2015), the results are quite different. Here
we use 1975 (pretreatment) earnings as the pseudo-outcome, leaving us with only a
single pretreatment year of earnings to adjust for the substantial difference between
the trainees and comparison group from the Current Population Survey. Again, we
first test whether the simple average difference in adjusted 1975 earnings is zero.
Then we test whether both the level of 1975 earnings and the indicator for positive
1975 earnings are different in the trainees and the control groups, using separate
tests for individuals with zero and positive 1974 earnings. The null is clearly rejected,
casting doubt on the unconfoundedness assumption.
Placebo approaches can also be used in other contexts, like regression discon-
tinuity design. Covariates typically play only a minor role in the primary analyses
there, although they can improve precision (Imbens and Lemieux 2008; Calo-
nico, Cattaneo, and Titiunik 2014a, b). However, these exogenous covariates can
play an important role in assessing the plausibility of the regression discontinuity
design. According to the identification strategy, they should be uncorrelated with
the treatment when the forcing variable is close to the threshold. We can test this
assumption, for example by using a covariate as the pseudo-outcome in a regression
discontinuity analysis. If we were to find that the conditional expectation of one
of the covariates is discontinuous at the threshold, such a discontinuity might be
interpreted as evidence for an unobserved confounder whose distribution changes
at the boundary, one which might also be correlated with the outcome of interest.
We can illustrate this application with the election data from Lee (2008), who is
interested in estimating the effect of incumbency on electoral outcomes. The treat-
ment is a Democrat winning a congressional election, and the forcing variable is the
j_atheyimbens_312.indd 16 4/13/17 8:35 AM
The State of Applied Econometrics: Causality and Policy Evaluation 17
Democratic vote share minus the Republication vote share in the current election,
and so the threshold is zero. We look at an indicator for winning the next election as
the outcome. As a pretreatment variable, we consider an indicator for winning the
previous election to the one that defines the forcing variable. Our estimates for the
actual outcome (winning the next election) are substantially larger than those for
the pseudo-outcome (winning the previous election), where we cannot reject the
null hypothesis that the effect on the pseudo-outcome is zero.
One final example of the use of placebo regressions is Rosenbaum (1987), who
is interested in the causal effect of a binary treatment and focuses on a setting with
multiple comparison groups (see also Heckman and Hotz 1989; Imbens and Rubin
2015). In Rosenbaum’s case, there is no strong reason to believe that one of the
comparison groups is superior to another. Rosenbaum proposes testing equality of
the average outcomes in the two comparison groups after adjusting for pretreatment
variables. If one finds that there are substantial differences left after such adjust-
ments, it shows that at least one of the comparison groups is not valid, which makes
the use of either of them less credible. In applications to evaluations of labor market
programs, one might implement such methods by comparing a control group of
individuals who are eligible but choose not to participate with another control group
of individuals who are not eligible, as in Heckman and Hotz (1989). The biases from
evaluations based on the first control group might correspond to differences in
motivation, whereas evaluations based on the second control group could be biased
because of direct associations between eligibility criteria and outcomes.
Robustness and Sensitivity
The classical frequentist statistical paradigm suggests that a researcher specifies
a single statistical model, estimates this model on the data, and reports estimates
and standard errors. This is of course far from common practice, as pointed out, for
example, in Leamer (1978, 1983). In practice, researchers consider many specifica-
tions and perform various specification tests before settling on a preferred model.
Standard practice in modern empirical work is to present in the final paper esti-
mates of the preferred specification of the model in combination with assessments
of the robustness of the findings from this preferred specification. These alternative
specifications are intended to convey that the substantive results of the preferred
specification are not sensitive to some of the choices in that specification, like using
different functional forms of the regression function or alternative ways of control-
ling for differences in subpopulations.
Some recent work has sought to make these efforts at assessing robustness
more systematic. In Athey and Imbens (2015), we propose one approach to this
problem, which we illustrate here in the context of regression analyses, although it
can also be applied to more complex nonlinear or structural models. In the regres-
sion context, suppose that the object of interest is a particular regression coefficient
that has an interpretation as a causal effect. We suggest considering a set of different
specifications based on splitting the sample into two subsamples, and estimating
them separately. (Specifically, we suggest splitting the original sample once for each
j_atheyimbens_312.indd 17 4/13/17 8:35 AM
18 Journal of Economic Perspectives
of the elements of the original covariate vector Zi, and splitting at a threshold that
optimizes fit by minimizing the sum of squared residuals.) The original causal effect
is then estimated as a weighted average of the estimates from the two split specifica-
tions. If the original model is correct, the augmented model still leads to a consistent
estimator for the estimand. Notice that the focus is not on finding an alternative
specification that may provide a better fit; rather, it is on assessing whether the esti-
mate in the original specification is robust to a range of alternative specifications.
This approach has some weaknesses. For example, adding irrelevant covariates
to the procedure might decrease the standard deviation of estimates. If there are
many covariates, some form of dimensionality reduction may be appropriate prior
to estimating the robustness measure. Refining and improving this approach is an
interesting direction for future work. For example, the theoretical literature has
developed many estimators in the setting with unconfoundedness. Some rely on
estimating the conditional mean, others rely on estimating the propensity score,
and still others rely on matching on the covariates or the propensity score (for a
review of this literature, see Imbens and Wooldridge 2009). We recommend that
researchers should report estimates based on a variety of methods to assess robust-
ness, rather than estimates based on a single preferred method.
In combination with reporting estimates based on the preferred specification,
it may be useful to report ranges of estimates based on substantially weaker assump-
tions. For example, Rosenbaum and Rubin (1983b, see also Rosenbam 2002) suggest
starting with a restrictive specification, and then assessing the changes in the estimates
that result from small to modest relaxations of the key identifying assumptions such as
unconfoundedness. In the context Rosenbaum and Rubin consider, that of estimating
average treatment effects under selection on observables, they allow for the presence
of an unobserved covariate that should have been adjusted for in order to estimate
the average effect of interest. They explore how strong the correlation between this
unobserved covariate and the treatment, and the correlation between the unobserved
covariate and the potential outcomes, would have to be in order the substantially
change the estimate for the average effect of interest. Imbens (2003) builds on the
Rosenbaum and Rubin approach by developing a data-driven way to obtain a set of
correlations between the unobserved covariates and treatment and outcome.
In other work along these lines, Arkhangelskiy and Drynkin (2016) study sensi-
tivity of the estimates of the parameters of interest to misspecification of the model
governing the nuisance parameters. Tamer (2010) reviews how to assess robustness
based on the partial indentification or bounds literature originating with Manski
Altonji, Elder, and Taber (2008) and Oster (2015) focus on the correlation
between the unobserved component in the relation between the outcome and the
treatment and observed covariates, and the unobserved component in the relation
between the treatment and the observed covariates. In the absence of functional form
assumptions, this correlation is not identified. These papers therefore explore the
sensitivity to fixed values for this correlation, ranging from the case where the corre-
lation is zero (and the treatment is exogenous), to an upper limit chosen to match
j_atheyimbens_312.indd 18 4/13/17 8:35 AM
Susan Athey and Guido W. Imbens 19
the correlation found between the observed covariates in the two regression func-
tions. Oster takes this further by developing estimators based on this equality. This
useful approach provides the researcher with a systematic way of doing the sensitivity
analyses that are routinely done in empirical work, but often in an unsystematic way.
Identification and Sensitivity
Gentzkow and Shapiro (2015) take a different approach to sensitivity. They
propose a method for highlighting what statistical relationships in a dataset are
most closely related to parameters of interest. Intuitively, the idea is that simple
correlations between particular combinations of variables identify particular param-
eters. To operationalize this, they investigate, in the context of a given model, how
the key parameters of interest relate to a set of summary statistics. These summary
statistics would typically include easily interpretable functions of the data such as
correlations between subsets of variables. Under mild conditions, the joint distribu-
tion of the model parameters and the summary statistics should be jointly normal
in large samples. If the summary statistics are in fact asymptotically sufficient for
the model parameters, the joint distribution of the parameter estimates and the
summary statistics will be degenerate. More typically, the joint normal distribu-
tion will have a covariance matrix with full rank. For example, when estimating the
average causal effect of a binary treatment under unconfoundedness, one would
expect the parameter of interest to be closely related to the correlation between the
outcome and the treatment, and, in addition, to the correlations between some of
the additional covariates and the outcome, or to the correlations between some of
those covariates and the treatment. Gentzkow and Shapiro discuss how to interpret
the covariance matrix in terms of sensitivity of model parameters to model specifi-
cation. More broadly, their approach is related to proposals in different settings by
Conley, Hansen, and Rossi (2012) and Chetty (2009).
Supplementary Analyses in Regression Discontinuity Designs
One of the most interesting supplementary analyses is the McCrary (2008) test
in regression discontinuity designs (see also Otsu, Xu, and Matsushita 2013). What
makes this analysis particularly interesting is the conceptual distance between the
primary analysis and the supplementary analysis. The McCrary test assesses whether
there is a discontinuity in the density of the forcing variable at the threshold. In a
conventional analysis, it is unusual that the marginal distribution of a variable that is
assumed to be exogenous is of any interest to the researcher: often, the entire anal-
ysis is conducted conditional on such regressors. However, the identification strategy
underlying regression discontinuity designs relies on the assumption that units just
to the left and just to the right of the threshold are comparable. That argument is
difficult to reconcile if, say, there are substantially more units just to the left than
just to the right of the threshold. Again, even though such an imbalance could easily
be taken into account in the estimation, in many cases where one would find such
an imbalance, it would suggest that the forcing variable is not a characteristic exog-
enously assigned to individuals, but rather that it is being manipulated in some way.
j_atheyimbens_312.indd 19 4/13/17 8:35 AM
20 Journal of Economic Perspectives
The classic example is that of an educational regression discontinuity design
where the forcing variable is a test score. If the individual grading the test is aware
of the importance of exceeding the threshold, and in particular if they know the
student personally, they may assign scores differently than if they were not aware
of this. If there was such manipulation of the score, there would likely be a discon-
tinuity in the density of the forcing variable at the threshold; there would be no
reason to change the grade for an individual scoring just above the threshold.
Machine Learning and Econometrics
Supervised machine learning focuses primarily on prediction problems: given a
dataset with data on an outcome Yi, which can be discrete or continuous, and some
predictors Xi, the goal is to estimate a model on a subset of the data, given the
values of the predictors Xi. This subset is called the training sample, and it is used for
predicting outcomes in the remaining data, which is called the test sample. Note that
this approach is fundamentally different from the goal of causal inference in obser-
vational studies, where we observe data on outcomes and a treatment variable, and
we wish to draw inferences about potential outcomes. Kleinberg, Ludwig, Mullain-
athan, and Obermeyer (2015) argue that many important policy problems are
fundamentally prediction problems; see also the article by Mullainathan and Spiess
in this issue. A second class of problems, unsupervised machine learning, focuses on
methods for finding patterns in data, such as groups of similar items, like clustering
images into groups, or putting text documents into groups of similar documents.
The method can potentially be quite useful in applications involving text, images, or
other very high-dimensional data, even though these approaches have not had too
much use in the economics literature so far. For an exception, see Athey, Mobius,
and Pal (2016) for an example in which unsupervised learning is used to categorize
newspaper articles into topics.
An important difference between many (but not all) econometric approaches
and supervised machine learning is that supervised machine learning methods typi-
cally rely on data-driven model selection, most commonly through cross-validation,
and often the main focus is on prediction performance without regard to the impli-
cations for inference. For supervised learning methods, the sample is split into a
training sample and a test sample, where, for example, the test sample might have
10 percent of observations.
The training sample is itself partitioned into a number of subsamples, or cross-
validation samples, often 10 of them. For each subsample, the cross-validation
sample m is set aside. The remainder of the training sample is used for estimation.
The estimation results are then used to predict outcomes for the left-out subsample
m. The final choice of the tuning parameter is the one that minimizes the sum of
the squared residuals in the cross-validation samples. Ultimate model performance
is assessed by calculating the mean-squared error of model predictions (that is, the
sum of squared residuals) on the held-out test sample, which was not used at all
j_atheyimbens_312.indd 20 4/13/17 8:35 AM
The State of Applied Econometrics: Causality and Policy Evaluation 21
for model estimation or tuning. Predictions from these machine learning methods
are not typically unbiased, and estimators may not be asymptotically normal and
centered around the estimand. Indeed, the machine learning literature places little
emphasis on asymptotic normality, and when theoretical properties are analyzed,
they often take the forms of worst-case bounds on risk criteria. However, the fact
that model performance (in the sense of predictive accuracy on a test set) can be
directly measured makes it possible to compare predictive models even when their
asymptotic properties are not understood. Enormous progress has been made in the
machine learning literature in terms of developing models that do well (according
to the stated criteria) in real-world datasets. Here, we focus primarily on problems
of causal inference, showing how supervised machine learning methods improve
the performance of causal analysis, particularly in cases with many covariates.
Machine Learning Methods for Average Causal Effects
In recent years, researchers have used machine learning methods to help them
control in a flexible manner for a large number of covariates. Some of these methods
involved adaptions of methods used for the few-covariate case: for example, use the
weighting approach in Hirano, Imbens, Ridder, and Rubin (2001) in combination
with machine learning methods such as LASSO and random forests for estimating
the propensity score as in McCaffrey, Ridgeway, and Morral (2004) and Wyss et al.
(2014). Such methods have relatively poor properties in many cases because they
do not necessarily emphasize the covariates that are important for the bias, that
is, those that are correlated both with the outcomes and the treatment indicator.
More promising methods would combine estimation of the association between the
potential outcomes and the covariates, and of the association between the treat-
ment indicator and the covariates. Here we discuss three approaches along these
lines (see also Athey, Imbens, Pham, and Wager 2017).
First, Belloni, Chernozhukov, Fernández, and Hansen (2013) propose a double
selection procedure, where they first use a LASSO regression to select covariates
that are correlated with the outcome, and then again to select covariates that are
correlated with the treatment. In a final ordinary least squares regression, they
include the union of the two sets of covariates, improving the properties of the esti-
mators for the average treatment effect compared to simple regularized regression
of the outcome on the covariates and the treatment.
A second line of research has focused on finding weights that directly balance
covariates or functions of the covariates between treatment and control groups,
so that once the data has been reweighted, it mimics a randomized experiment
more closely. In the literature with few covariates, this approach has been devel-
oped in Hainmueller (2012) and Graham, Pinto, and Egel (2012, 2016); for
discussion of the case with many covariates, some examples include Zubizarreta
(2015) and Imai and Ratkovic (2014). In Athey, Imbens, and Wager (2016), we
develop an estimator that combines the balancing with regression adjustment. The
idea is that, in order to predict the counterfactual outcomes that the treatment
group would have had in the absence of the treatment, it is necessary to extrapolate
j_atheyimbens_312.indd 21 4/13/17 8:35 AM
22 Journal of Economic Perspectives
from control observations. By rebalancing the data, the amount of extrapolation
required to account for differences between the two groups is reduced. To capture
remaining differences, the regularized regression just mentioned can be used to
model outcomes in the absence of the treatment. In effect, the Athey et al. estimator
balances the bias coming from imbalance between the covariates in the treated
subsample and the weighted control subsample, with the variance from having
excessively variable weights.
A third approach builds on the semiparametric literature on influence func-
tions. In general, van der Vaart (2000) suggests estimating the finite dimensional
component as the average of the influence function, with the infinite dimensional
components estimated nonparametrically. In the context of estimation of average
treatment effects this leads to “doubly robust estimators” in the spirit of Robins and
Rotnitzky (1995), Robins, Rotnitzky, and Zhao (1995), and van der Laan and Rubin
(2006). Chernozhukov et al. (2016) propose using machine learning methods for
the infinite dimensional components and incorporate sample splitting to further
improve the properties.
In all three cases, procedures for trimming the data to eliminate extreme values
of the estimated propensity score (as in Crump, Hotz, Imbens, and Mitnik 2009)
remain important in practice.
Machine Learning for Heterogenous Causal Effects
In many cases, a policy or treatment might have different costs and benefits if
applied in different settings. Gaining insight into the nature of such heterogenous
treatment effects can be useful. Moreover, in evaluating a policy or treatment, it is
useful to know the applications where the benefit/cost ratios are most favorable.
However, when machine learning methods are applied to estimating heterogenous
treatment effects, they in effect search over many covariates and subsets of the
covariate space for the best fit. As a result, such methods may lead to spurious find-
ings of treatment effect differences. Indeed, in clinical medical trials, pre-analysis
plans must be registered in advance to avoid the problem that researchers will be
tempted to search among groups of the studied population to find one that seems
to be affected by the treatment, and may instead end up with spurious findings. In
the social sciences, the problem of searching across groups becomes more severe
when there are many covariates.
One approach to this problem is to search exhaustively for treatment effect
heterogeneity and then correct for issues of multiple hypothesis testing, by which
we mean the problems that arise when a researcher considers a large number of
statistical hypotheses, but analyzes them as if only one had been considered. This
can lead to false discovery, because across many hypothesis tests, we expect some
to be rejected even if the null hypothesis is true. To address this problem, List,
Shaikh, and Xu (2016) propose to give each covariate a “low” or “high” discrete
value, and then loop through the covariates, testing whether the treatment effect is
different when the covariate is low versus high. Because the number of covariates
may be large, standard approaches to correcting for multiple testing may severely
j_atheyimbens_312.indd 22 4/13/17 8:35 AM
Susan Athey and Guido W. Imbens 23
limit the power of a (corrected) test to find heterogeneity. List et al. propose an
approach based on bootstrapping that accounts for correlation among test statis-
tics; this approach can provide substantial improvements over standard multiple
testing approaches when the covariates are highly correlated, because dividing the
sample according to each of two highly correlated covariates results in substantially
the same division of the data. However, this approach has the drawback that the
researcher must specify in advance all of the hypotheses to be tested, along with
alternative ways to discretize covariates and flexible interactions among covariates.
It may not be possible to explore these combinations fully.
A different approach is to adapt machine learning methods to discover partic-
ular forms of heterogeneity by seeking to identify subgroups that have different
treatment effects. One example is to examine within subgroups in cases where eligi-
bility for a government program is determined according to criteria that can be
represented in a decision tree, similar to the situation when a doctor uses a decision
tree to determine whether to prescribe a drug to a patient. Another example is to
examine within subgroups in cases where an algorithm uses a table to determine
which type of user interface, offer, email solicitation, or ranking of search results to
provide to a user. Subgroup analysis has long been used in medical studies (Foster,
Taylor, and Ruberg 2011), but it is often subject to criticism due to concerns of
multiple hypothesis testing (Assmann, Pocock, Enos, and Kasten 2000).
Among the more common machine learning methods, regression trees are a
natural choice for partitioning into subgroups (the classic reference is Breiman,
Friedman, Stone, and Olshen 1984). Consider a regression with two covariates. The
value of each covariate can be split so that it is above or below a certain level. The
regression tree approach would consider which covariate should be split, and at
which level, so that the sum of squared residuals is minimized. With many covariates,
these steps of choosing which covariate to split, and where to split it, are carried out
sequentially, thus resulting in a tree format. The tree eventually results in a parti-
tion of the data into groups, defined according to values of the covariates, where
each group is referred to as a leaf. In the simplest version of a regression tree, we
would stop this splitting process once the reduction in the sum of squared residuals
is below a certain level.
In Athey and Imbens (2016), we develop a method that we call causal trees,
which builds on earlier work by Su et al. (2009) and Zeileis, Hothorn, and Hornik
(2008). The method is based on the machine learning method of regression trees,
but it uses a different criterion for building the tree: rather than focusing on
improvements in mean-squared error of the prediction of outcomes, it focuses on
mean-squared error of treatment effects. The method relies on sample splitting,
in which half the sample is used to determine the optimal partition of the covari-
ates space (the tree structure), while the other half is used to estimate treatment
effects within the leaves. The output of the method is a treatment effect and a confi-
dence interval for each subgroup. In Athey and Imbens (2016), we highlight the
fact that the criteria used for tree construction should differ when the goal is to esti-
mate treatment effect heterogeneity rather than heterogeneity in outcomes. After
j_atheyimbens_312.indd 23 4/13/17 8:35 AM
24 Journal of Economic Perspectives
all, the factors that affect the level of outcomes might be quite different from those
that affect treatment effects. Although the sample-splitting approach may seem
extreme—ultimately only half the data is used for estimating treatment effects—
it has several advantages. The confidence intervals are valid no matter how many
covariates are used in estimation. In addition, the researcher is free to estimate a
more complex model in the second part of the data, for example, if the researcher
wishes to include fixed effects in the model, or model different types of correlation
in the error structure.
A disadvantage of the causal tree approach is that the estimates are not person-
alized for each individual; instead, all individuals assigned to a given group have the
same estimate. For example, a leaf might contain all male individuals aged 60 to 70,
with income above $50,000. An individual whose covariates are near the boundary,
for example a 70 year-old man with income of $51,000, might have a treatment
effect that is different than the average for the whole group. For the problem of
more personalized prediction, Wager and Athey (2015) propose a method for esti-
mating heterogeneous treatment effects based on random forest analysis, where
the method generates many different trees and averages the result, except that the
component trees are now causal trees (and in particular, each individual tree is
estimated using sample splitting, where one randomly selected subsample is used
to build the tree while a distinct subsample is used to estimate treatment effects
in each leaf). Relative to a causal tree, which identifies a partition and estimates
treatment effects within each element of the partition, the causal forest leads to esti-
mates of causal effects that change more smoothly with covariates, and in principle
every individual has a distinct estimate. Random forests are known to perform very
well in practice for prediction problems, but their statistical properties were less well
understood until recently. Wager and Athey show that the predictions from causal
forests are asymptotically normal and centered on the true conditional average
treatment effect for each individual. They also propose an estimator for the vari-
ance, so that confidence intervals can be obtained. Athey, Tibshirani, and Wager
(2016) extend the approach to other models for causal effects, such as instrumental
variables, or other models that can be estimated using the generalized method of
moments (GMM). In each case, the goal is to estimate how a causal parameter of
interest varies with covariates.
An alternative approach, closely related, is based on Bayesian Additive Regres-
sion Trees (BART) (Chipman, George, and McCulloch 2010), which is essentially a
Bayesian version of random forests. Hill (2011) and Green and Kern (2012) apply
these methods to estimate heterogeneous treatment effects. Large-sample proper-
ties of this method are unknown, but it appears to have good empirical performance
Other machine-based approaches, like the LASSO regression approach, have
also been used in estimating heterogenous treatment effects. Imai and Ratkovic
(2013) estimate a LASSO regression model with the treatment indicator interacted
with covariates, and uses LASSO as a variable selection algorithm for determining
which covariates are most important. In using this approach, it may be prudent
j_atheyimbens_312.indd 24 4/13/17 8:35 AM
The State of Applied Econometrics: Causality and Policy Evaluation 25
to perform some supplementary analysis to verify that the method is not overfit-
ting; for example, one could use a sample-splitting approach, using half of the data
to estimate the LASSO regression and then comparing the results to an ordinary
least squares regression with the variables selected by LASSO in the other half of
the data. If the results are inconsistent, it could indicate that using half the data is
not good enough, or it might indicate that sample splitting is warranted to protect
against overfitting or other sources of bias that arise when data-driven model selec-
tion is used.
A natural application of personalized treatment effect estimation is to estimate
optimal policy functions in observational data. A literature in machine learning
considers this problem (Beygelzimer and Langford 2009; Beygelzimer et al. 2011);
some open questions include the statistical properties of the estimators, and the
ability to obtain confidence intervals on differences between policies obtained from
these methods. Recently, Athey and Wager (2017) bring in insights from semipa-
rametric efficiency theory in econometrics to propose a new estimator for optimal
policies and to analyze the properties of this estimator. Policies can be compared
in terms of their “risk,” which is defined as the gap between the expected outcomes
using the (unknown) optimal policy and the estimated policy. Athey and Wager
derive an upper bound for the risk of the policy estimated using their method and
show that it is necessary to use a method that is efficient (in the econometric sense)
to achieve that bound.
In the last few decades, economists have learned to take very seriously the old
admonition from undergraduate econometrics that “correlation is not causality.”
We have surveyed a number of recent developments in the econometrics toolkit
for addressing causality issues in the context of estimating the impact of policies.
Some of these developments involve a greater sophistication in the use of methods
like regression discontinuity and differences-in-differences estimation. But we have
also tried to emphasize that the project of taking causality seriously often benefits
from combining these tools with other approaches. Supplementary analyses can
help the analyst assess the credibility of estimation and identification strategies.
Machine learning methods provide important new tools to improve estimation of
causal effects in high-dimensional settings, because in many cases it is important to
flexibly control for a large number of covariates as part of an estimation strategy for
drawing causal inferences from observational data. When causal interpretations of
estimates are more plausible, and inference about causality can reduce their reli-
ance on modeling assumptions (like those about functional form), the credibility of
policy analysis is enhanced.
■ We are grateful for comments by the editor and coeditors.
j_atheyimbens_312.indd 25 4/13/17 8:35 AM
26 Journal of Economic Perspectives
Abadie, Alberto, Alexis Diamond, and Jens
Hainmueller. 2010. “Synthetic Control Methods for
Comparative Case Studies: Estimating the Effect of
California’s Tobacco Control Program.” Journal
of the American Statistical Association 105(490):
Abadie, Alberto, Alexis Diamond, and Jens
Hainmueller. 2014. “Comparative Politics and the
Synthetic Control Method.” American Journal of
Political Science 59(2): 495–510.
Abadie, Alberto, and Javier Gardeazabal. 2003.
“The Economic Costs of Conflict: A Case Study of
the Basque Country.” American Economic Review
Abadie, Alberto, and Guido W. Imbens. 2006.
“Large Sample Properties of Matching Estimators
for Average Treatment Effects.”Econometrica 74(1):
Allcott, Hunt. 2015. “Site Selection Bias in
Program Evaluation.”Quarterly Journal of Economics
Altonji, Joseph G., Todd E. Elder, and Christo-
pher R. Taber. 2008. “Using Selection on Observed
Variables to Assess Bias from Unobservables When
Evaluating Swan–Ganz Catheterization.”American
Economic R eview 98(2): 345–50.
Andrews, Donald, and James H. Stock. 2006.
“Inference with Weak Instruments.” Unpublished
Angrist, Joshua D. 2004. “Treatment Effect
Heterogeneity in Theory and Practice.” Economic
Journal 114(494): C52–83.
Angrist, Joshua, and Ivan Fernandez-Val.
2010. “ExtrapoLATE-ing: External Validity and
Overidentification in the LATE Framework.”
NBER Working Paper 16566.
Angrist, Joshua D., Guido W. Imbens, and
Donald B. Rubin. 1996. “Identification of Causal
Effects Using Instrumental Variables.”Journal of the
American Statistical Association 91(434): 444–55.
Angrist, Joshua D., and Alan B. Krueger. 1999.
“Empirical Strategies in Labor Economics.” In
Handbook of Labor Economics, edited by Orley C.
Ashenfelter and David Card, 1277–1366. North
Angrist, Joshua D., and Miikka Rokkanen.
2015. “Wanna Get Away? Regression Discontinuity
Estimation of Exam School Effects Away From the
Cutoff.”Journal of the American Statistical Association
Arkhangelskiy, Dmitry, and Evgeni Drynkin.
2016. “Sensitivity to Model Specification.” Unpub-
Aronow, Peter M., and Cyrus Samii. 2013.
“Estimating Average Causal Effects under Interfer-
ence between Units.” arXiv: 1305.6156v1.
Assmann, Susan F., Stuart J. Pocock, Laura
E. Enos, and Linda E. Kasten. 2000. “Subgroup
Analysis and Other (Mis)uses of Baseline Data in
Clinical Trials.” Lancet 355 (9209): 1064–69.
Athey, Susan, Raj Chetty, and Guido Imbens.
2016. “Combining Experimental and Obser-
vational Data: Internal and External Validity.”
Athey, Susan, Raj Chetty, Guido Imbens, and
Hyunseung Kang. 2016. “Estimating Treatment
Effects Using Multiple Surrogates: The Role of the
Surrogate Score and the Surrogate Index.” arXiv:
Athey, Susan, Dean Eckles, and Guido Imbens.
2015. “Exact p-Values for Network Interference.”
NBER Working Paper 21313.
Athey, Susan, and Guido W. Imbens. 2006.
“Identification and Inference in Nonlinear
Difference-in-Differences Models.” Econometrica
Athey, Susan, and Guido Imbens. 2015. “A
Measure of Robustness to Misspecification.”Amer-
ican Economic Review 105(5): 476–80.
Athey, Susan, and Guido Imbens. 2016. “Recur-
sive Partitioning for Estimating Heterogeneous
Causal Effects.” PNAS 113(27): 7353–60.
Athey, Susan, Guido Imbens, Thai Pham,
and Stefan Wager. 2017. “Estimating Average
Treatment Effects: Supplementary Analyses and
Remaining Challenges.” arXiv: 1702.01250.
Athey, Susan, Guido Imbens, and Stefan Wager.
2016. “Efficient Inference of Average Treatment
Effects in High Dimensions via Approximate
Residual Balancing.” arXiv: 1604.07125.
Athey, Susan, Markus Mobius, and Jeno Pal.
2016. “The Impact of Aggregators on News
Consumption.” Unpublished paper.
Athey, Susan, Julie Tibshirani, and Stefan Wager.
2016. “Solving Heterogeneous Estimating Equa-
tions with Gradient Forests.” arXiv: 1610.01271.
Athey, Susan, and Stefan Wager. 2017. “Efficient
Policy Learning.” arXiv: 1702.02896.
Banerjee, Abhijit, Sylvain Chassang, and Erik
Snowberg. 2016. “Decision Theoretic Approaches
to Experiment Design and External Validity.”
NBER Working Paper 22167.
Bang, Heejung, and James M. Robins. 2005.
“Doubly Robust Estimation in Missing Data and
Causal Inference Models.” Biometrics 61(4):
Begg, Colin B., and Denis H. Y. Leung.
2000. “On the Use of Surrogate End Points in
j_atheyimbens_312.indd 26 4/13/17 8:35 AM
Susan Athey and Guido W. Imbens 27
Randomized Trials.” Journal of the Royal Statistical
Society: Series A (Statistics in Society) 163(1): 15–28.
Bekker, Paul A. 1994. “Alternative Approxima-
tions to the Distributions of Instrumental Variable
Estimators.” Econometrica 62(3): 657–81.
Belloni, Alexandre, Victor Chernozhukov, Ivan
Fernández-Val, and Chris Hansen. 2013. “Program
Evaluation and Causal Inference with High-
Dimensional Data.” arXiv: 1311.2645.
Bertanha, Marinho, and Guido Imbens. 2014.
“External Validity in Fuzzy Regression Disconti-
nuity Designs.” NBER Working Paper 20773.
Beygelzimer, Alina, and John Langford. 2009.
“The Offset Tree for Learning with Partial Labels.”
Proceedings of the 15th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining,
Beygelzimer, Alina, John Langford, Lihong
Li, Lev Reyzin, and Robert E. Schapire. 2011.
“Contextual Bandit Algorithms with Supervised
Learning Guarantees.” Proceedings of the 14th
International Conference on Artificial Intelligence and
Statistics (AISTATS), pp. 19–26.
Bramoullé, Yann, Habiba Djebbari, and
Bernard Fortin. 2009. “Identification of Peer
Effects through Social Networks.” Journal of Econo-
metrics 150(1): 41–55.
Breiman, Leo, Jerome Friedman, Charles J.
Stone, and Richard A. Olshen. 1984. Classification
and Reg ression Trees. CRC Press.
Brinch, Christian, Magne Mogstad, and Matthew
Wiswall. 2015. “Beyond LATE with a Discrete
Instrument: Heterogeneity in the Quantity-Quality
Interaction in Children.” Unpublished paper.
Calonico, Sebastian, Matias D. Cattaneo, and
Rocío Titiunik. 2014a. “Robust Nonparametric
Confidence Intervals for Regression-Discontinuity
Designs.”Econometrica 82(6): 2295–2326.
Calonico, Sebastian, Matias D. Cattaneo, and
Rocío Titiunik. 2014b. “Robust Data-Driven Infer-
ence in the Regression-Discontinuity Design.” Stata
Journal 14(4): 909–46.
Card, David. 1990. “The Impact of the Mariel
Boatlift on the Miami Labor Market.” Industrial
and Labor R elations Review 43 (2): 245–57.
Card, David, David S. Lee, Zhuan Pei, and
Andrea Weber. 2015. “Inference on Causal Effects
in a Generalized Regression Kink Design.” Econo-
metrica 83 (6): 2453–83.
Carrell, Scott E., Bruce I. Sacerdote, and James
E. West. 2013. “From Natural Variation to Optimal
Policy? The Importance of Endogenous Peer
Group Formation.”Econometrica 81(3): 855–82.
Cattaneo, Matias D. 2010. “Efficient Semipa-
rametric Estimation of Multi-valued Treatment
Effects under Ignorability.”Journal of Econometrics
Chamberlain, Gary, and Guido Imbens. 2004.
“Random Effects Estimators with Many Instru-
mental Variables.”Econometrica 72(1): 295–306.
Chandrasekhar, Arun G. 2016. “The Econo-
metrics of Network Formation.” Chap. 13 in The
Oxford Handboo k on the Economics of Net works, edited
by Yann Bramoullé, Andrea Galeotti, Brian W.
Rogers. Oxford University Press.
Chandrasekhar, Arun, and Matthew Jackson.
2016. “A Network Formation Model Based on
Subgraphs.” arXiv: 1611.07658.
Chernozhukov, Victor, Denis Chetverikov, Mert
Demirer, Esther Duflo, Christian Hansen, and
Whitney Newey. 2016. “Double Machine Learning
for Treatment and Causal Parameters.” arXiv:
Chetty, Raj. 2009. “Sufficient Statistics for
Welfare Analysis: A Bridge between Structural
and Reduced-Form Methods.” Annual Review of
Economics 1(1): 451–87.
Chipman, Hugh A., Edward I. George, and
Robert E. McCulloch. 2010. “BART: Bayesian Addi-
tive Regression Trees.” Annals of Applied Statistics
Christakis, Nicholas A., and James H. Fowler.
2007. “The Spread of Obesity in a Large Social
Network over 32 Years.” New England Journal of
Medicine (357): 370–79.
Christakis, Nicholas A., James H. Fowler, Guido
W. Imbens, and Karthik Kalyanaraman. 2010. “An
Empirical Model for Strategic Network Forma-
tion.” NBER Working Paper 16039.
Conley, Timothy G., Christian B. Hansen, and
Peter E. Rossi. 2012. “Plausibly Exogenous.”Review
of Economics and Statistics 94(1): 260–72.
Crépon, Bruno, Esther Duflo, Marc Gurgand,
Roland Rathelot, and Philippe Zamora. 2013.
“Do Labor Market Policies Have Displacement
Effects? Evidence from a Clustered Randomized
Experiment.” Quarterly Journal of Economics 128(2):
Crump, Richard K., V. Joseph Hotz, Guido W.
Imbens, and Oscar A. Mitnik. 2009. “Dealing with
Limited Overlap in Estimation of Average Treat-
ment Effects.” Biometrika 96(1): 187–99.
Deaton, Angus. 2010. “Instruments, Randomiza-
tion, and Learning about Development.”Journal of
Economic Literature 48(2): 424–55.
Dehejia, Rajeev H., and Sadek Wahba. 1999.
“Causal Effects in Nonexperimental Studies:
Reevaluating the Evaluation of Training
Programs.”Journal of the American Statistical Associa-
tion 94(448): 1053–62.
Dong, Yingying. 2014. “Jump or Kink?
Identification of Binary Treatment Regression
Discontinuity Design without the Discontinuity.”
j_atheyimbens_312.indd 27 4/13/17 8:35 AM
28 Journal of Economic Perspectives
Dong, Yingying, and Arthur Lewbel. 2015. “Iden-
tifying the Effect of Changing the Policy Threshold
in Regression Discontinuity Models.” Review of
Economics and Statistics 97(5): 1081–92.
Doudchenko, Nikolay, and Guido W. Imbens.
2016. “Balancing, Regression, Difference-in-
Differences and Synthetic Control Methods: A
Synthesis.” arXiv: 1610.07748.
Foster, Jared C., Jeremy M. G. Taylor, and
Stephen J. Ruberg. 2011. “Subgroup Identifica-
tion from Randomized Clinical Data.” Statistics in
Medicine 30(24): 2867–80.
Frangakis, Constantine E., and Donald B.
Rubin. 2002. “Principal Stratification in Causal
Inference.” Biometrics 58(1): 21–29.
Gelman, Andrew, and Guido Imbens. 2014.
“Why High-Order Polynomials Should Not Be
Used in Regression Discontinuity Designs.” NBER
Working Paper 20405.
Gentzkow, Matthew, and Jesse Shapiro. 2015.
“Measuring the Sensitivity of Parameter Estimates
to Sample Statistics.” Unpublished paper.
Goldberger, Arthur S. 1972. “Selection Bias
in Evaluating Treatment Effects: Some Formal
Illustrations.” Institute for Research on Poverty
Discussion Paper 129-72.
Goldberger, Arthur S. 2008. “Selection Bias
in Evaluating Treatment Effects: Some Formal
Illustrations.” In Advances in Econometrics, Volume
21, edited by Tom Fomby, R. Carter Hill, Daniel L.
Millimet, Jeffrey A. Smith, and Edward J. Vytlacil,
1–31. Emerald Group Publishing Limited.
Goldsmith-Pinkham, Paul, and Guido W.
Imbens. 2013. “Social Networks and the Identi-
fication of Peer Effects.” Journal of Business and
Economic Statistics 31(3): 253–64.
Graham, Bryan S. 2008. “Identifying Social
Interactions through Conditional Variance Restric-
tions.”Econometrica 76(3): 643–60.
Graham, Bryan S., Cristine Campos de Xavier
Pinto, and Daniel Egel. 2012. “Inverse Probability
Tilting for Moment Condition Models with Missing
Data.”Review of Economic Studies 79(3): 1053–79.
Graham, Bryan, Christine Campos de Xavier
Pinto, and Daniel Egel. 2016. “Efficient Estimation
of Data Combination Models by the Method of
Auxiliary-to-Study Tilting (AST).” Journal of Busi-
ness and Economic Statistics 34(2): 288–301.
Green, Donald P., and Holger L. Kern. 2012.
“Modeling Heterogeneous Treatment Effects
in Survey Experiments with Bayesian Additive
Regression Trees.” Public Opinion Quarterly 76(3):
Hahn, Jinyong, Petra Todd, and Wilbert van der
Klaauw. 2001. “Identification and Estimation of
Treatment Effects with a Regression-Discontinuity
Design.”Econometrica 69(1): 201–09.
Hainmueller, Jens. 2012. “Entropy Balancing for
Causal Effects: A Multivariate Reweighting Method
to Produce Balanced Samples in Observational
Studies.” Political Analysis 20(1): 25–46.
Heckman, James J., and V. Joseph Hotz.
1989. “Choosing among Alternative Nonex-
perimental Methods for Estimating the Impact
of Social Programs: The Case of Manpower
Training.”Journal of the American Statistical Associa-
tion 84(408): 862–74.
Heckman, James J., and Edward Vytlacil. 2007.
“Econometric Evaluation of Social Programs,
Part I: Causal Models, Structural Models and
Econometric Policy Evaluation.” In Handbook of
Econometrics 6B, edited by James Heckman and
Edward Leamer, 4779–4874. Elsevier.
Hill, Jennifer L. 2011. “Bayesian Nonparametric
Modeling for Causal Inference.” Journal of Compu-
tational and Graphical Statistics 20(1): 217–40.
Hirano, Keisuke. 2001. “Combining Panel
Data Sets with Attrition and Refreshment
Samples.”Econometrica 69(6): 1645–59.
Hirano, Keisuke, and Guido Imbens. 2004. “The
Propensity Score with Continuous Treatments.” In
Applied Bayesian Modeling and Causal Inference from
Incomplete-Data Perspectives: An Essential Journey with
Donald Rubin’s Statistical Family, edited by Andrew
Gelman and Xiao-Li Meng, 73–84. Wiley.
Holland, Paul W. 1986. “Statistics and Causal
Inference.” Journal of the American Statistical Associa-
tion 81(396): 945–60.
otz, V. Joseph, Guido W. Imbens, and Julie H.
Mortimer. 2005. “Predicting the Efficacy of Future
Training Programs Using Past Experiences at
Other Locations.”Journal of Econometrics 125(1–2):
Hudgens, Michael G., and M. Elizabeth
Halloran. 2008. “Toward Causal Inference with
Interference.” Journal of the American Statistical
Association 103(482): 832–42.
Imai, Kosuke, and Marc Ratkovic. 2013.
“Estimating Treatment Effect Heterogeneity in
Randomized Program Evaluation.” Annals of
Applied Statistics 7(1): 443–70.
Imai, Kosuke, and Marc Ratkovic. 2014.
“Covariate Balancing Propensity Score.” Journal of
the Royal Statistical Society: Series B (Statistical Method-
ol og y) 76(1): 243–63.
Imai, Kosuke, and David A. van Dyk. 2004.
“Causal Inference with General Treatment
Regimes: Generalizing the Propensity Score.”
Journal of the American Statistical Association
Imbens, Guido W. 2000. “The Role of the
Propensity Score in Estimating Dose–Response
Functions.” Biometrika 87(3): 706–10.
Imbens, Guido W. 2003. “Sensitivity to
j_atheyimbens_312.indd 28 4/13/17 8:35 AM
The State of Applied Econometrics: Causality and Policy Evaluation 29
Exogeneity Assumptions in Program Evalua-
tion.”American Economic Review 93(2): 126–32.
Imbens, Guido W. 2004. “Nonparametric
Estimation of Average Treatment Effects under
Exogeneity: A Review.” Review of Economics and
Statistics 86(1): 4–29.
Imbens, Guido W. 2010. “Better LATE Than
Nothing: Some Comments on Deaton (2009) and
Heckman and Urzua (2009).” Journal of Economic
Literature 48(2): 399–423.
Imbens, Guido W. 2014. “Instrumental Vari-
ables: An Econometrician’s Perspective.” Statistical
Science 29(3): 323–58.
Imbens, Guido W. 2015. “Matching Methods
in Practice: Three Examples.” Journal of Human
Resources 50(2): 373–419. EQ: This used to be
2015b, but I removed the other Imbens 2015 refer-
ence because it was incomplete.
Imbens, Guido W., and Joshua D. Angrist. 1994.
“Identification and Estimation of Local Average
Treatment Effects.” Econometrica 62(2): 467–75.
Imbens, Guido W., and Karthik Kalyanaraman.
2012. “Optimal Bandwidth Choice for the Regres-
sion Discontinuity Estimator.”Review of Economic
Studies 79(3): 933–59.
Imbens, Guido W., and Thomas Lemieux.
2008. “Regression Discontinuity Designs: A Guide
to Practice.”Journal of Econometrics 142(2): 615–35.
Imbens, Guido W., and Donald B. Rubin.
2015. Causal Inference for Statistics, Social, and
Biomedical Sciences: An Introduction. Cambridge
Imbens, Guido W., Donald B. Rubin, and
Bruce I. Sacerdote. 2001. “Estimating the Effect
of Unearned Income on Labor Earnings, Savings,
and Consumption: Evidence from a Survey of
Lottery Players.” American Economic Review 91(4):
Imbens, Guido W., and Jeffrey M. Wooldridge.
2009. “Recent Developments in the Econometrics
of Program Evaluation.”Jour nal of Economic Litera-
ture 47(1): 5–86.
Jackson, Matthew O. 2010. Social and Economic
Networks. Princeton University Press.
Jackson, Matthew, and Asher Wolinsky. 1996. “A
Strategic Model of Social and Economic Networks.”
Journal of Economic Theor y 71(1): 44–74. FYI, You
had different page numbers, and a different date,
although this date matches the one in the citation.
Jacob, Brian A., and Lars Lefgren. 2004.
“Remedial Education and Student Achievement:
A Regression-Discontinuity Analysis.” Review of
Economics and Statistics 86(1): 226–44.
Kleinberg, Jon, Jens Ludwig, Sendhil Mullaina-
than, and Ziad Obermeyer. 2015. “Prediction
Policy Problems.” American Economic Review
Kowalski, Amanda. 2016. “Doing More When
You’re Running LATE: Applying Marginal Treat-
ment Effect Methods to Examine Treatment Effect
Heterogeneity in Experiments.” NBER Paper
LaLonde, Robert J. 1986. “Evaluating the
Econometric Evaluations of Training Programs
with Experimental Data.” American Economic
Review 76(4): 604–20.
Leamer, Edward. 1978. Specification Searches: Ad
Hoc Inference with Nonexperimental Data. Wiley.
Leamer, Edward E. 1983. “Let’s Take the Con
Out of Econometrics.” American Economic Review
Lechner, Michael. 2001. “Identification and
Estimation of Causal Effects of Multiple Treat-
ments under the Conditional Independence
Assumption.” In Econometric Evaluation of Labour
Market Policies, vol. 13, edited by Michael Lechner
and Friedhelm Pfeiffer, 43–58. Physica-Verlag
Lee, David S. 2008. “Randomized Experiments
from Non-random Selection in U.S. House Elec-
tions.”Journal of Econometrics 142(2): 675–97.
Lee, David S., and Thomas Lemieux.
2010. “Regression Discontinuity Designs in
Economics.” Journal of Economic Literature 48(2):
List, John A., Azeem M. Shaikh, and Yang Xu.
2016. “Multiple Hypothesis Testing in Experi-
mental Economics.” NBER Paper 21875.
Manski, Charles F. 1990. “Nonparametric
Bounds on Treatment Effects.”American Economic
Review 80(2): 319–23.
Manski, Charles F. 1993. “Identification of
Endogenous Social Effects: The Reflection
Problem.”Review of Economic Studies 60(3): 531–42.
Manski, Charles F. 2013. Public Policy in an
Uncertain World: Analysis and Decisions. Harvard
McCaffrey, Daniel F., Greg Ridgeway, and
Andrew R. Morral. 2004. “Propensity Score Estima-
tion with Boosted Regression for Evaluating Causal
Effects in Observational Studies.” Psychological
Methods 9(4): 403–25.
McCrary, Justin. 2008. “Manipulation of the
Running Variable in the Regression Discontinuity
Design: A Density Test.” Journal of Econometrics
Mele, Angelo. 2013. “A Structural Model of
Segregation in Social Networks.” Available at
Nielsen, Helena Skyt, Torben Sorensen, and
Christopher Taber. 2010. “Estimating the Effect of
Student Aid on College Enrollment: Evidence from
a Government Grant Policy Reform.” American
j_atheyimbens_312.indd 29 4/13/17 8:35 AM
30 Journal of Economic Perspectives
Economic Journal: Economic Policy 2(2): 185–215.
Oster, Emily. 2015. “Diabetes and Diet: Behav-
ioral Response and the Value of Health.” NBER
Working Paper 21600.
Otsu, Taisuke, Ke-Li Xu, and Yukitoshi
Matsushita. 2013. “Estimation and Inference of
Discontinuity in Density.” Journal of Business and
Economic Statistics 31(4): 507–24.
Pearl, Judea. 2000.Causality: Models, Reasoning,
and Inference. Cambridge University Press.
Peri, Giovanni, and Vasil Yasenov. 2015. “The
Labor Market Effects of a Refugee Wave: Applying
the Synthetic Control Method to the Mariel Boat-
lift.” NBER Working Paper 21801.
Porter, Jack. 2003. “Estimation in the Regres-
sion Discontinuity Model.” Unpublished paper.
Prentice, Ross L. 1989. “Surrogate Endpoints
in Clinical Trials: Definition and Operational
Criteria.” Statistics in Medicine 8 (4): 431–40.
Robins, James M., and Andrea Rotnitzky.
1995. “Semiparametric Efficiency in Multivariate
Regression Models with Missing Data.” Journal of
the American Statistical Association 90(429): 122–29.
Robins, James M., Andrea Rotnitzky, and Lue
Ping Zhao. 1995. “Analysis of Semiparametric
Regression Models for Repeated Outcomes in the
Presence of Missing Data.”Journal of the American
Statistical Association 90(429): 106–21.
Rosenbaum, Paul. 1987. “The Role of a Second
Control Group in an Observational Study.” Statis-
tical Science 2(3): 292–306.
Rosenbaum, Paul R. 2002. “Observational
Studies.” In Observational Studies, 1–17. Springer.
Rosenbaum, Paul R., and Donald B. Rubin.
1983a. “The Central Role of the Propensity Score
in Observational Studies for Causal Effects.”
Biometrika 70(1): 41–55.
Rosenbaum, Paul R., and Donald B. Rubin.
1983b. “Assessing Sensitivity to an Unobserved
Binary Covariate in an Observational Study with
Binary Outcome.” Journal of the Royal Statistical
Society. Series B (Methodological) 45(2): 212–18.
Sacerdote, Bruce. 2001. “Peer Effects with
Random Assignment: Results for Dartmouth
Roommates.”Quarterly Journal of Economics 116(2):
Shadish, William R., Thomas D. Cook, and
Donald T. Campbell. 2002.Experimental and Quasi-
experimental Designs for Generalized Causal Inference.
Skovron, Chistopher, and Rocío Titiunik. 2015.
“A Practical Guide to Regression Discontinuity
Designs in Political Science.” Unpublished paper.
Staiger, Douglas, and James H. Stock. 1997.
“Instrumental Variables Regression with Weak
Instruments.”Econometrica 65(3): 557–86.
Su, Xiaogang, Chih-Ling Tsai, Hansheng
Wang, David M. Nickerson, and Bogong Li. 2009.
“Subgroup Analysis via Recursive Partitioning.”
Journal of Machine Learning Research 10: 141–58.
Tamer, Elie. 2010. “Partial Identification in
Econometrics.” Annual Review of Economics 2(1):
Thistlewaite, D., and Donald Campbell. 1960.
“Regression-Discontinuity Analysis: An Alternative
to the Ex-post Facto Experiment.” Journal of Educa-
tional Psychology 51(6): 309–17.
Todd, Petra, and Kenneth I. Wolpin. 2006.
“Assessing the Impact of a School Subsidy Program
in Mexico: Using a Social Experiment to Validate
a Dynamic Behavioral Model of Child Schooling
and Fertility.” American Economic Review 96 (5):
van der Klaauw, Wilbert. 2008. “Regression-
Discontinuity Analysis: A Survey of Recent
Developments in Economics.” Labour 22(2):
van der Laan, Mark J., and Daniel Rubin. 2006.
“Targeted Maximum Likelihood Learning.” Inter-
national Journal of Biostatistics 2(1).
van der Vaart, Aad W. 2000. Asymptotic Statistics.
Cambridge University Press.
Wager, Stefan, and Susan Athey. 2015. “Causal
Random Forests.” Unpublished paper.
Wyss, Richard, Allan Ellis, Alan Brookhart,
Cynthia Girman, Michele Jonsson Funk, Robert
LoCasale, and Til Strümer. 2014 “The Role of
Prediction Modeling in Propensity Score Estima-
tion: An Evaluation of Logistic Regression, bCART,
and the Covariate-Balancing Propensity Score.”
American Journal of Epidemiolog y 180(6): 645–55.
Yang, Shu, Guido W. Imbens, Zhanglin Cui,
Douglas E. Faries, and Zbigniew Kadziola. 2016.
“Propensity Score Matching and Subclassification
in Observational Studies with Multi-level Treat-
ments.” Biometrics 72(4): 1055–65.
Zeileis, Achim, Torsten Hothorn, and Kurt
Hornik. 2008. “Model-Based Recursive Parti-
tioning.” Journal of Computational and G raphical
Statistics 17(2): 492–514.
Zubizarreta, Jose R. 2015. “Stable Weights that
Balance Covariates for Estimation with Incomplete
Outcome Data.” Journal of the American Statistical
Association 110(511): 910–22.
j_atheyimbens_312.indd 30 4/13/17 8:35 AM
AUTHOR QUERIES 31
PLEASE ANSWER ALL AUTHOR QUERIES (numbered with “AQ” in the
margin of the page). Please disregard all Editor Queries (numbered with
“EQ” in the margins). They are reminders for the editorial staff.
AQ# Question Response
1. Do you mean all the observations are on
the same side of the boundary? Or that
they must be on one side or the other?
And I don’t understand “The estimates
are biased towards the average outcomes
for observations that are strictly inside the
boundary.” Perhaps nothing can be done
because this is all over my head.
2. I had to remove the citation to Imbens
2015a because the corresponding
reference was extremely incomplete and
we could not figure out what was meant so
we removed the reference. The reference
said Imbens, Guido. 2015a. Book review.
3. It sounds at first like the LaLonde data
provide a second take on the lottery
question. Are we in fact still talking about
the lottery? If not, what are we talking
about? Can you clarify? Also, Tim says
“Not clear if this ‘we test’ discussion is
about what happens in Imbens 2015b,
or in some other source, or if it’s just
unpublished calculations from the
authors. Any of these are OK! But it can
4. In the last sentence, does “their reliance”
refer to interpretations or estimates?
5. For all the references that end with
the exact words “Unpublished paper,”
please give me a link if possible. If no
link available, that is fine. (All the official
working papers listed are fine.)
j_atheyimbens_312.indd 31 4/13/17 8:35 AM
REMINDERS FOR STAFF 32
Numbered with “EQ” in the page margin.
EQ# Question Response
1. Xi is/a re [?] the average value of the
characteristics in individual i’s peer
j_atheyimbens_312.indd 32 4/13/17 8:35 AM