ArticlePDF Available

Assessing lack of common support in causal inference using Bayesian nonparametrics: Implications for evaluating the effect of breastfeeding on children’s cognitive outcomes

Abstract and Figures

Causal inference in observational studies typically requires making com-parisons between groups that are dissimilar. For instance, researchers inves-tigating the role of a prolonged duration of breastfeeding on child outcomes may be forced to make comparisons between women with substantially dif-ferent characteristics on average. In the extreme there may exist neighbor-hoods of the covariate space where there are not sufficient numbers of both groups of women (those who breastfed for prolonged periods and those who did not) to make inferences about those women. This is referred to as lack of common support. Problems can arise when we try to estimate causal effects for units that lack common support, thus we may want to avoid inference for such units. If ignorability is satisfied with respect to a set of potential con-founders, then identifying whether, or for which units, the common support assumption holds is an empirical question. However, in the high-dimensional covariate space often required to satisfy ignorability such identification may not be trivial. Existing methods used to address this problem often require reliance on parametric assumptions and most, if not all, ignore the informa-tion embedded in the response variable. We distinguish between the concepts of "common support" and "common causal support." We propose a new ap-proach for identifying common causal support that addresses some of the shortcomings of existing methods. We motivate and illustrate the approach using data from the National Longitudinal Survey of Youth to estimate the effect of breastfeeding at least nine months on reading and math achievement scores at age five or six. We also evaluate the comparative performance of this method in hypothetical examples and simulations where the true treat-ment effect is known.
Content may be subject to copyright.
The Annals of Applied Statistics
2013, Vol. 7, No. 3, 1386–1420
DOI: 10.1214/13-AOAS630
©Institute of Mathematical Statistics, 2013
New York University and Tsinghua University
Causal inference in observational studies typically requires making com-
parisons between groups that are dissimilar. For instance, researchers inves-
tigating the role of a prolonged duration of breastfeeding on child outcomes
may be forced to make comparisons between women with substantially dif-
ferent characteristics on average. In the extreme there may exist neighbor-
hoods of the covariate space where there are not sufficient numbers of both
groups of women (those who breastfed for prolonged periods and those who
did not) to make inferences about those women. This is referred to as lack of
common support. Problems can arise when we try to estimate causal effects
for units that lack common support, thus we may want to avoid inference for
such units. If ignorability is satisfied with respect to a set of potential con-
founders, then identifying whether, or for which units, the common support
assumption holds is an empirical question. However, in the high-dimensional
covariate space often required to satisfy ignorability such identification may
not be trivial. Existing methods used to address this problem often require
reliance on parametric assumptions and most, if not all, ignore the informa-
tion embedded in the response variable. We distinguish between the concepts
of “common support” and “common causal support.” We propose a new ap-
proach for identifying common causal support that addresses some of the
shortcomings of existing methods. We motivate and illustrate the approach
using data from the National Longitudinal Survey of Youth to estimate the
effect of breastfeeding at least nine months on reading and math achievement
scores at age five or six. We also evaluate the comparative performance of
this method in hypothetical examples and simulations where the true treat-
ment effect is known.
1. Introduction. Causal inference strategies in observational studies that as-
sume ignorability of the treatment assignment also typically require an assumption
of common support; that is, for binary treatment assignment, Z, and a vector of
confounding covariates, X, it is commonly assumed that 0 <Pr(Z =1|X)<1.
Failure to satisfy this assumption can lead to unresolvable imbalance for matching
Received August 2011; revised January 2013.
1Supported in part by the Institute of Education Sciences Grant R305D110037 and by the Wang
Xuelian Foundation.
Key words and phrases. Common support, overlap, BART, propensity scores, breastfeeding.
methods, unstable weights in inverse-probability-of-treatment weighting (IPTW)
estimators, and undue reliance on model specification in methods that model the
response surface.
To satisfy the common support assumption in practice, researchers have used
various strategies to identify (and excise) observations in neighborhoods of the co-
variate space where there exist only treatment units (no controls) or only control
units (no treated) [see, e.g., Heckman, Ichimura and Todd (1997)]. Unfortunately
many of these methods rely on correct specification of a model for the treatment
assignment. Moreover, all such strategies (that we have identified) fail to take ad-
vantage of the outcome variable, Y, which can provide critical information about
the relative importance of each potential confounder. In the extreme this informa-
tion could help us discriminate between situations where overlap is lacking for a
variable that is a true confounder versus situations when it is lacking for a variable
that is not predictive of the outcome (and thus not a true confounder). Moreover,
there is currently a lack of guidance regarding how the researcher can or should
characterize how the inferential sample has changed after units have been dis-
In this paper we propose a strategy to address the problem of identifying units
that lack common support, even in fairly high-dimensional space. We start by
defining the causal inference setting and estimands of interest ignoring the com-
mon support issue. We then review a causal inference strategy [discussed previ-
ously in Hill (2011)] that exploits an algorithm called Bayesian Additive Regres-
sion Trees [BART; Chipman, George and McCulloch (2007, 2010)]. We discuss
the issue of common support and then introduce the concept of “common causal
Our method for addressing common support problems exploits a key feature of
the BART approach to causal inference. When BART is used to estimate causal
effects one of the “byproducts” is that it yields individual-specific posterior distri-
butions for each potential outcome; these act as proxies for the amount of infor-
mation we have about these outcomes. Comparisons of posterior distributions of
counterfactual outcomes versus factual (observed) outcomes can be used to create
red flags when the amount of information about the counterfactual outcome for
a given observation is not sufficient to warrant making inferences about that ob-
servation. We illustrate this method in several simple hypothetical examples and
examine the performance of our strategy relative to propensity-based methods in
simulations. Finally, we demonstrate the practical differences in our breastfeeding
2. Causal inference and BART. This section describes notation, estimands,
and assumptions followed by a discussion of how BART can be used to estimate
causal effects.2
2Green and Kern (2012) discuss extensions to this BART strategy for causal inference to more
thoroughly explore heterogeneous treatment effects.
1388 J. HILL AND Y.-S. SU
2.1. Notation,estimands and assumptions. We discuss a situation where we
attempt to identify a causal effect using a sample of independent observations of
size n.Datafortheith observation consists of an outcome variable, Yi, a vec-
tor of covariates, Xi, and a binary treatment assignment variable, Zi,where
Zi=1 denotes that the treatment was received. We define potential outcomes
for this observation, Yi(Zi=0)=Yi(0)and Yi(Z =1)=Yi(1), as the out-
comes that would manifest under each of the treatment assignments. It follows
that Yi=Yi(0)(1Zi)+Yi(1)Zi. Given that observational samples are rarely
random samples from the population and we will be limiting our samples in
further nonrandom ways in order to address lack of overlap, it makes sense
to focus on sample estimands such as the conditional average treatment effect
(CATE), n
i=1E[Yi(1)Yi(0)|Xi], and the conditional average treatment ef-
fect for the treated (CATT), i:Zi=1E[Yi(1)Yi(0)|Xi]. Other common sam-
ple estimands we may consider are the sample average treatment effect (SATE),
i=1E[Yi(1)Yi(0)], and the sample average effect of the treatment on the
treated (SATT), i:Zi=1E[Yi(1)Yi(0)].
If ignorability holds for our sample, that is, Yi(0), Yi(1)Zi|Xi=x,then
E[Yi(0)|Xi=x]=E[Yi|Zi=0,Xi=x]and E[Yi(1)|Xi=x]=E[Yi|Zi=1,
Xi=x]. The basic idea behind the BART approach to causal inference is to assume
E[Yi(0)|X=x]=f(0,x)and E[Yi(1)|Xi=x]=f(1,x)and then fit a very
flexible model for f.
In principle, any method that flexibly estimates fcould be used to model these
conditional expectations. Chipman, George and McCulloch (2007, 2010) describe
BARTs advantages as a predictive algorithm compared to similar alternatives in
the data mining literature. Hill (2011) describes the advantages of using BART for
causal inference estimation over several alternatives common in the causal infer-
ence literature.
The BART algorithm consists of two pieces: a sum-of-trees model and a regu-
larization prior. Dropping the isubscript for notational convenience, we describe
the sum-of-trees model by Y=f(z,x)+ε,whereεN(02)and
Here each (Tj,M
j)denotes a single subtree model. The number of trees is typi-
cally allowed to be large [Chipman, George and McCulloch (2007, 2010) suggest
200, though, in practice, this number should not exceed the number of observa-
tions in the sample]. As is the case with related sum-of-trees strategies (such as
boosting), the algorithm requires a strategy to avoid overfitting. With BART this is
achieved through a regularization prior that allows each (Tj,M
j)tree to contribute
only a small part to the overall fit.
BART fits the sum-of-trees model using a MCMC algorithm that cycles be-
tween draws of (Tj,M
j)conditional on σand draws of σconditional on all of the
j). Converence can be monitored by plotting the residual standard deviation
parameter σover time. More details regarding BART can be found in Chipman,
George and McCulloch (2007, 2010).
It is straightforward to use BART to estimate average causal effects such as
E[Y(1)|X=x]−E[Y(0)|X=x]=f(1,x)f(0,x). Each iteration of the
BART Markov Chain generates a new draw of ffrom the posterior distribution.
Let frdenote the rth draw of f. To perform causal inference, we then compute
ivalues over i
with rfixed, the resulting values will be our Monte Carlo approximation to the
posterior distribution of the average treatment effect for the associated population.
For example, we average over the entire sample if we want to estimate the average
treatment effect. We average over i:zi=1 if we want to estimate the effect of the
treatment on the treated.
2.2. Past evidence regarding BART performance. Hill (2011) provides evi-
dence of superior performance of BART relative to popular causal inference strate-
gies in the context of nonlinear response surfaces. The focus in those comparisons
is on methods that are reasonably simple to understand and implement: standard
linear regression, propensity score matching (with regression adjustment), and in-
verse probability of treatment weighted linear regression [IPTW; Imbens (2004),
Kurth et al. (2006)].
One vulnerability of BART identified in Hill (2011) is that there is nothing to
prevent it from extrapolating over areas of the covariate space where common
support does not exist. This problem is not unique to BART; it is shared by all
causal modeling strategies that do not first discard (or severely downweight) units
in these areas. Such extrapolations can lead to biased inferences because of the
lack of information available to identify either E[Y(0)|X]or E[Y(1)|X]in these
regions. This paper proposes strategies to address this issue.
2.3. Illustrative example with one predictor. We illustrate use of BART for
causal inference with an example [similar to one used in Hill (2011)]. This exam-
ple also demonstrates both the problems that can occur when common support is
compromised and a potential solution.
Figure 1displays simulated data from each of two treatment groups from a hy-
pothetical educational intervention. The 120 observations were generated indepen-
dently as follows. We generate the treatment variable as ZBernoulli(0.5).We
generate a pretest measure as X|Z=1N(40,102)and X|Z=0N(20,102).
Our post-test potential outcomes are drawn as Y(0)|XN(72 +3X, 1)and
Y(1)|XN(90 +exp(0.06X), 1). Since we conceptualize both our confounder
and our outcome as test scores, a ceiling is imposed on each (60 and 120, resp.).
Even with this constraint this is an extreme example of heterogeneous treatment
effects, designed, along with the lack of overlap, to make it extremely difficult for
any method to successfully estimate the true treatment effect.
1390 J. HILL AND Y.-S. SU
FIG.1. Left panel:simulated data (points)and true response surfaces.The black upper curve and
points that follow it correspond to the treatment condition;the grey lower curve and points that follow
it correspond to the control condition.BART inference for each treated observation is displayed as
a95% posterior interval for f(1,x
i)and f(0,x
i).Discarded units (described in Section 4)are
circled.Right panel:solid curve represents the treatment effect as it varies with our pretest,X.
BART inference is displayed as 95% posterior intervals for the treatment effect for each treated unit.
Intervals for discarded units (described in Section 4)are displayed as dotted lines.In this sample the
conditional average treatment effect for the treated (CATT)is 12.2, and the sample average treatment
effect for the treated (SATT)is 11.8.
In the left panel, the upper solid black curve represents E[Y(1)|X]and the
lower grey one E[Y(0)|X]. The black circles close to the upper curve are the
treated and the grey squares close to the lower curve are the untreated (ignore
the circled points for now). Since there is only one confounding covariate, X,
the difference between the two response surfaces at any level of Xrepresents the
treatment effect for observations with that value of the pretest X. In this sample the
conditional average treatment effect for the treated (CATT) is 12.2, and the sample
average treatment effect for the treated (SATT) is 11.8.
A linear regression fit to the data yields a substantial underestimate, 7.1 (s.e.
0.62), of both estimands. Propensity score matching (not restricted to common
support) with subsequent regression adjustment yields a much better estimate,
10.4 (s.e. 0.52), while the IPTW regression estimate is 9.6 (s.e. 0.45). For both
of these methods the propensity scores were estimated using logistic regression.
The left panel of Figure 1also displays the BART fit to the response sur-
face (with number of trees equal to 100 since there are only 120 observations).
Each vertical line segment corresponds to individual level inference about either
E[Yi(0)|Xi]or E[Yi(1)|Xi]for each treated observation. Note that the fit is quite
good until we try to predict E[Yi(0)|Xi]beyond the support of the data. The right
panel displays the true treatment effect as it varies with X,E[Y(1)Y(0)|X],as
a solid curve. The BART inference (95% posterior interval) for the treatment effect
for each treated unit is superimposed as a vertical segment (ignore the solid versus
dashed distinction for now). These individual-level inferences can be averaged to
obtain inference for the effect of the treatment on the treated which is 9.5 with
95% posterior interval (7.7, 11.8); this interval best corresponds to inference with
respect to the conditional average treatment effect on the treated [Hill (2011)].
None of these methods yields a 95% interval that captures CATT. BART is
the only method to capture SATT, though at the expense of a wider uncertainty
interval. All the approaches are hampered by the fairly severe lack of common
support. Notice, however, the way that the BART-generated uncertainty bounds
grow much wider in the range where there is no overlap across treatment groups
(X>40). The marginal intervals nicely cover the true conditional treatment effects
until we start to leave this neighborhood. However, inference in this region is based
on extrapolation. Our goal is to devise a rule to determine how much “excess”
uncertainty should warrant removing a unit from the analysis. We will return to
this example in Section 4.
3. Identifying areas of common support. It is typical in causal inference to
assume common support. In particular, many researchers assume “strong ignora-
bility” [Rosenbaum and Rubin (1983)] which combines the standard ignorability
assumption discussed above with an assumption of common support often formal-
ized as 0 <Pr(Z |X)<1. It is somewhat less common for researchers to check
whether common support appears to be empirically satisfied for their particular
data set.
Moreover, the definition of common support is itself left vague in practice. Typ-
ically, Xcomprises the set of covariates the researcher has chosen to justify the
ignorabilty assumption. As such, conservative researchers will understandably in-
clude a large number of pretreatment variables in X. However, this will likely mean
that Xincludes any number of variables that are not required to satisfy ignorability
once we condition on some other subset of the vector of covariates. Importantly,
the requirement of common support need not hold for the variables not in this sub-
set, thus, trying to force common support on these extraneous variables can lead to
unnecessarily discarding observations.
The goal instead should be to ensure common causal support which can be de-
fined as 0 <Pr(Z |W)<1, where Wrepresents any subset of Xthat will satisfy
Y(0), Y (1)Z|W. Because BART takes advantage of the information in the out-
come variable, it should be better able to target common causal support as will be
demonstrated in the examples below. Propensity score methods, on the other hand,
ignore this information, rendering them incapable of making these distinctions.
If the common causal support assumption does not hold for the units in our
inferential sample (the units in our sample about whom we’d like to make causal
inference), we do not have direct empirical evidence about the counterfactual state
for them. Therefore, if we retain these units in our sample, we run the risk of
obtaining biased treatment effect estimates.
1392 J. HILL AND Y.-S. SU
One approach to this problem is to weight observations by the strength of sup-
port [for an example of this strategy in a propensity score setting, see Crump et al.
(2009)]. This strategy may yield efficiency gains over simply discarding prob-
lematic units. However, this approach has two key disadvantages. First, if there
are a large number of covariates, the weights may become unstable. Second, it
changes the interpretation of the estimand to something that may have little policy
or practical relevance. For instance, suppose the units that have the most support
are those currently receiving the program, however, the policy-relevant question
is what would happen to those currently not receiving the program. In this case
the estimand would give most weight to those participants of least interest from a
policy perspective.
Another option is to identify and remove observations in neighborhoods of the
covariate space that lack sufficient common causal support. Simply discarding ob-
servations deemed problematic is unlikely to lead to an optimal solution. However,
this approach has the advantage of greater simplicity and transparency. More work
will need to be done, however, to provide strategies for adequately profiling the
discarded observations as well as those that we retain for inference; this paper will
provide a simple starting point in this effort. The primary goal of this paper is
simply to describe a strategy to identify these problematic observations.
3.1. Identifying areas of common causal support with BART. The simple idea
is to capitalize on the fact that the posterior standard deviations of the individual-
level conditional expectations estimated using BART increase markedly in areas
that lack common causal support, as illustrated in Figure 1. The challenge is to
determine how extreme these standard deviations should be before we need be
concerned. We present several possible rules for discarding units. In all strategies
when implementing BART we recommend setting the “number of trees” parameter
to 100 to allow BART to better determine the relative importance of the variables.
Recall that the individual-level causal effect for each unit can be expressed as
di=f(1,xi)f(0,xi). For each unit, i, we have explicit information about
i,xi). Our concern is whether we have enough information about f(1
Zi,xi). The amount of information is reflected in the posterior standard deviations.
Therefore, we can create a metric for assessing our uncertainty regarding the suffi-
ciency of the common support for any given unit by comparing σf0
i=sd(f (0,xi))
and σf1
i=sd(f (1,xi)),wheresd(·)denotes the posterior standard deviation.
In practice, of course we use Monte Carlo approximations to these quantities,
iand sf1
i, respectively, obtained by calculating the standard deviation of the
draws of f(0,xi)and f(0,xi)for the ith observation.
BART discarding rules. Our goal is to use the information that BART pro-
vides to create a rule for determining which units lack sufficient counterfactual
evidence (i.e., residing in a neighborhood without common support). For exam-
ple, when estimating the effect of the treatment on units, i,forwhichZi=a, one
might consider discarding any unit, i, with Zi=a,forwhichsf1a
j},j:Zj=a. So, for instance, when estimating the effect of the
treatment on the treated we would discard treated units whose counterfactual stan-
dard deviation sf0
iexceeded the maximum standard deviation under the observed
treatment condition sf1
iacross all the treated units.
This cutoff is likely too sharp, however, as even chance disturbances might put
some units beyond this threshold. Therefore, a more useful rule might use a cutoff
that includes a “buffer” such that we would only discard for unit iin the inferential
group defined as those with Zi=a,if
where sd(sfa
j)represents the estimated standard deviation of the empirical distri-
bution of sfa
jover all units with Zj=a. For this rule to be most useful, we need
Var(Y |X,Z =0)=Va r (Y |X,Z =1)to hold at least approximately.
Another option is to consider the squared ratio of posterior standard devia-
tions (or, equivalently, the ratio of posterior variances) for each observation, with
the counterfactual posterior standard deviation in the numerator. An approximate
benchmark distribution for this ratio might be a χ2distribution with 1 degree of
freedom. Thus, for an observation with Zi=awe can choose cutoffs that corre-
spond to a specified p-value of rejecting the hypothesis that the variances are the
same of 0.10,
i2>2.706 i:zi=1(α=0.10 rule)
or a p-value of 0.05,
i2>3.841 i:zi=1(α=0.05 rule).
These ratio rules do not require the same type of homogeneity of variance assump-
tion across units as does the 1sdrule. However, they rest instead on an implicit
assumption of homogeneity of variance within unit across treatment conditions.
Additionally, they may be less stable and will be prone to rejection for units that
have particularly large amounts of information for the observed state. For instance,
an observation in a neighborhood of the covariate space that has control units may
still reject (i.e., be flagged as a discard) if there are, relatively speaking, many more
treated units in this neighborhood as well.
Exploratory analyses using measures of common causal support uncertainty.
Another way to make use of the information in the posterior standard deviations
is more exploratory. The idea here is to use a classification strategy such as a re-
gression tree to identify neighborhoods of the covariate space with relatively high
levels of common support uncertainty. For instance, when the goal is estimation of
the effect of the treatment on the treated we may want to determine neighborhoods
1394 J. HILL AND Y.-S. SU
that have clusters of units with relatively high levels of sf1Zi
ior sf1Zi
these “flags” combined with researcher knowledge of the substantive context of the
research problem can be combined to identify observations or neighborhoods to be
excised from the analysis if it is deemed necessary. This approach may have the ad-
vantage of being more closely tied to the science of the question being addressed.
We illustrate possibilities for exploring and characterizing these neighborhoods in
Sections 4.3 and 6.
Reliance on this type of exploratory strategy will likely be eschewed by re-
searchers who favor strict analysis protocols as a means of promoting honesty in
research. In fact, the original BART causal analysis strategy was conceived with
this predilection in mind, which is why (absent the need or desire to address com-
mon support issues) the advice given is to run it only once and at the default set-
tings; this minimizes the amount of researcher “interference” [Hill (2011)]. These
preferences may still be satisfied, however, by specifying one of the discarding
rules above as part of the analysis protocol. For further discussion of this issue see
Section 3.3.
3.2. Competing strategies for identifying common support. The primary com-
petitors to our strategy for identification of units that lack sufficient common causal
support rely on propensity scores. While there is little advice directly given to the
topic of how to use the propensity score to identify observations that lack com-
mon support for the included predictors [for a notable exception see Crump et al.
(2009)], in practice, most researchers using propensity score strategies first esti-
mate the propensity score and then discard any inferential units that extend be-
yond the range of the propensity score [Dehejia and Wahba (1999), Heckman,
Ichimura and Todd (1997), Morgan and Harding (2006)]. This type of exclusion
is performed automatically in at least two popular propensity score matching soft-
ware packages, MatchIt in R [Ho et al. (2013)] and psmatch2 in Stata [Leuven
and Sianesi (2011)] when the “common support” option is chosen. For instance,
if the focus is on the effect of the treatment on the treated, one would typically
discard the treated units with propensity scores greater than the maximum control
propensity score, unless there happened to be some treated with propensity scores
less than the minimum control propensity score (in which case these treated units
would be discarded as well).
More complicated caliper matching methods might further discard inferential
units that lie within the range of propensity scores of their comparison group if
such units are more than a set distance (in propensity score units) away from
their closest match [see, e.g., Frolich (2004)]. Given the number of different ra-
dius/caliper matching methods and the lack of clarity about the optimal caliper
width, it was beyond the scope of this paper to examine those strategies as well.
Weighting methods are typically not coupled with discarding rules since one of
the advantages touted by weighting advocates is that IPTW allows the researcher
to include their full sample of inferential and comparison units. However, in some
situations failure to discard inferential units that are quite different from the bulk
of the comparison units can lead to more unstable weight estimates.
We have two primary concerns about use of propensity scores to identify units
that fail to satisfy common causal support. First, they require a correct specifi-
cation of the propensity score model. Offsetting this concern is the fact that our
BART strategy requires a reasonably good fit to the response surface. As demon-
strated in Hill (2011), however, BART appears to be flexible enough to perform
well in this respect even with highly nonlinear and nonparallel response surfaces.
A further caveat to this concern is the fact that several flexible estimation strate-
gies have recently been proposed for estimating the propensity score. In particular,
Generalized Boosted Models (GBM) and Generalized Additive Models (GAM)
have both been advocated in this capacity with mostly positive results [McCaffrey,
Ridgeway and Morral (2004), Woo, Reiter and Karr (2008)], although some more
mixed findings exist for GBM in particular settings [Hill, Weiss and Zhai (2013)].
In Section 5we explore the relative performance of these approaches against our
BART approach.
Our second concern is that the propensity score strategies ignore the information
about common support embedded in the response variable. This can be important
because the researcher typically never knows which of the covariates in her data
set are actually confounders; if a covariate is not associated with both the treatment
assignment and the outcome, we need not worry about forcing overlap with regard
to it. Using propensity scores to determine common support gives greatest weight
to those variables that are most predictive of the treatment variable. However, these
variables may not be most important for predicting the outcome. In fact, there is
no guarantee that they are predictive of the outcome variable at all. Conversely,
the propensity score may give insufficient weight to variables that are highly pre-
dictive of the outcome and thus may underestimate the risk of retaining units with
questionable support with regard to such a variable.
The BART approach, on the other hand, naturally and coherently incorporates
all of this information. For instance, if there is lack of common support with re-
spect to a variable that is not strongly predictive of the outcome, then the posterior
standard deviation for the counterfactual unit should not be systematically higher
to a large degree. However, a variable that similarly lacks common support but is
strongly predictive of the outcome should yield strong differences in the distribu-
tions of the posterior standard deviations across counterfactuals. Simply put, the
standard deviations should pick up “important” departures from complete overlap
and should largely ignore “unimportant” departures. This ability of BART to cap-
italize on information in the outcome variable allows it to more naturally target
common causal support.
3.3. Honesty. Advocates of propensity score strategies sometimes directly ad-
vocate for ignoring the information in the response variable [Rubin (2002)]. The
argument goes that such practice allows the researcher to be more honest because
a propensity score model can in theory be chosen (through balance checks) before
1396 J. HILL AND Y.-S. SU
the outcome variable is even included in the analysis. This approach can avoid the
potential problem of repeatedly tweaking a model until the treatment effect meets
one’s prior expectations. However, in reality there is nothing to stop a researcher
from estimating a treatment effect every time he fits a new propensity score model
and, in practice, this surely happens. We argue that a better way to achieve this
type of honesty is to fit just one model and use a prespecified discarding rule, as
can be achieved in the BART approach to causal inference.
4. Illustrative examples. We illustrate some of the key properties of our
method using several simple examples. Each example represents just one draw
from the given data generating mechanism, thus, these examples are not meant
to provide conclusive evidence regarding relative performance of the methods in
each scenario. These examples provide an opportunity to visualize some of the
basic properties of the BART strategy relative to more traditional propensity score
strategies: propensity score matching with regression adjustment and IPTW re-
gression estimates. Since we estimate average treatment effects for the treated in
all the examples, for the IPTW approach the treated units all receive weights of 1
and the control units receive weights of ˆe(x)/(1−ˆe(x)),where ˆe(x) denotes the
estimated propensity score.
4.1. Simple example with one predictor. First, we return to the simple example
from Section 2to see how our common causal support identification strategies
work in that setting. Since there is only one predictor and it is a true confounder,
common support and common causal support are equivalent in this example and
we would not expect to see much difference between the methods.
The circled treated observations in the left-hand panel of Figure 1indicate the
29 observations that would be dropped using the standard propensity score discard
rule. Similarly, the dotted line segments in the right panel of the figure indicate
individual-specific treatment effects that would no longer be included in our aver-
age treatment effect inference. All three BART discard rules lead to the same set
of discarded observations as the propensity score strategy in this example.
SATT and CATT for the remaining units are 7.9 and 8.0, respectively. Our new
BART estimate is 8.2 with 95% posterior interval (7.7, 9.0). With this reduced
sample propensity score, matching (with subsequent regression adjustment) yields
an estimate of the treatment effect at 8.3 (s.e. 0.26) while IPTW yields an estimate
of 7.6 (s.e. 0.32).
Advantages of BART over the propensity score approach are not evident in this
simple example. They should manifest in examples where the assignment mecha-
nism is more difficult to model or when there are multiple potential confounders
and not all variables that predict treatment also predict the outcome (or they do so
with different emphasis). We explore these issues next.
4.2. Illustrative examples with two predictors. We now describe two slightly
more complicated examples to illustrate the potential advantages of BART over
propensity-score-based competitors. In both examples there are two independent
covariates, each generated as N(0,1), and the goal is to estimate CATT which is
equal to 1 (in fact, the treatment effect is constant across observations in these
examples). The question in each case is whether some of the treated observations
should be dropped due to lack of empirical counterfactuals.
4.2.1. Example 2A: Two predictors,no confounders. In the first example the
assignment mechanism is simple—after generating Zas a random flip of the coin,
all controls with X1>0 are removed. The response surface is generated as E[Y|
2, thus, the true treatment effect is constant at 1. Since
there are no true confounders in this example, the requirement of common support
on both X1and X2will be overly conservative; overlap on neither is required to
satisfy common causal support. Figure 2illustrates how each strategy performs in
this scenario.
In both plots circles represent treated observations and squares represent control
observations. The left panel shows the results based on discarding units that lack
common support with respect to the propensity score. The observations discarded
by the propensity score method are displayed as solid circles. Since treatment as-
signment is driven solely by X1, there is a close mapping between X1and the
propensity score (were it not for the fact that X2was also in the estimation model
for the propensity score, the correspondence would be one-to-one). 62 of the 112
treatment observations are dropped based on lack of overlap with regard to the
propensity score.
FIG.2. Plots of simulated data with two predictors;the true treatment effect is 1. X1predicts treat-
ment assignment only and X2predicts outcome only.Control observations are displayed as squares.
Treated observations are displayed as circles.The left panel displays results based on propensity
score common support;solid circles indicate which observations were discarded.In the right panel
the size of the circle is proportional to the sf0
i.Observations discarded based on the BART 1sdrule
are displayed as solid circles.Observations discarded based on the BART α=0.10 rule are circled.
No observations were discarded based on the BART α=0.05 rule ratio rule.
1398 J. HILL AND Y.-S. SU
After re-estimating the propensity score matching on the smaller sample, the
matching estimate is 1.29. Since treatment assignment is independent of the po-
tential outcomes by design, this estimate should be unbiased over repeated sam-
ples. However, it now has less than half the observations available for estima-
tion. Inverse-probability-of-treatment weighting (IPTW) yields an estimate of 1.40
(s.e. 0.42) after discarding.3
In the right plot of Figure 2the size of the circle for each treated unit is propor-
tional to the corresponding size of the posterior standard deviation of the expected
outcome under the control condition (in this case, the counterfactual condition for
the treated). The size of the square that represents each control observation is pro-
portional to the cutoff level for discarding units. Observations discarded by the
1sdrulehave been made solid. Observations discarded by the α=0.10 rule have
been circled. No observations were discarded using the α=0.05 rule.
In contrast to the propensity score discard rule, the BART 1sdrulerecognizes
that X1does not play an important role in the response surface, so it only drops
7 observations that are at the boundary of the covariate space. The corresponding
BART estimate is 1.12 with a posterior standard deviation (0.26) that is quite a bit
smaller than the standard errors of both propensity score strategies. The α=0.10
rule drops 18 observations, on the other hand, and these observations are in a
different neighborhood than those dropped by the 1sdrulesince the individual
level ratios can get large not just when sf0
iis (relatively) large but also when sf1
is (relatively) small. The corresponding estimate of 1.17 and associated standard
error (0.23) are quite similar to those achieved by the 1sdrule.TheBARTα=
0.05 rule yields an estimate from the full sample since it leads to no discards (1.13
with a standard error of 0.27). All of the BART strategies benefit from being able
to take advantage of the information in the outcome variable.
4.2.2. Example 2B: Two predictors,changing information. In the second ex-
ample the assignment mechanism is slightly more complicated. We start by gen-
erating Zas a binomial draw with probabilities equal to the inverse logit of
X1+X20.5X1X2. Next all control units with X1>0and X2>0 are re-
moved. Two different response surfaces are generated, each as E[Y|Z, X1,X
Z+0.5X1+2X2+φX1X2, where one version sets φto 1 and the other sets φto 3.
Therefore, both covariates are confounders in this example and both the common
support assumption and the common causal support assumption are in question.
Once again the treatment effect is 1.
The propensity score discard strategy chooses the same observations to discard
across both response surface scenarios because it only takes into account infor-
mation in the assignment mechanism. Thus, the left panel in Figure 3presents the
3If we fail to re-estimate the propensity score after the initial discard, the matching estimate is 1.53
(s.e. 0.40) and the IPTW estimate is 1.47 (s.e. 0.44).
FIG.3. Plots of simulated data with two predictors;the true treatment effect is 1. The display is
analogous to Figure 2,although here the two left plots display propensity score results across the
two scenarios and the two right display BART results across the two scenarios.
same plot twice; the only differences are the estimates of the treatment effect which
vary with response surface. The matching estimates get worse (0.74, then 0.13) as
the response surface becomes more highly nonlinear as do the IPTW estimates
(0.75, then 0.05). The uncertainty associated with the estimates grows between the
first and second response surface (from roughly 0.2 to roughly 0.4), yet standard
95% confidence intervals do not cover the truth in the second setting.4
4If we fail to re-estimate the propensity score after discarding, the estimates are just as bad or
worse. For the first scenario, the matching estimate would be 0.65 (s.e. 0.28) and the IPTW estimate
would be 0.75 (s.e. 0.20). For the second scenario, the matching estimate would be 0.02 (s.e. 0.44)
and the IPTW estimate would be 0.06 (s.e. 0.36).
1400 J. HILL AND Y.-S. SU
The BART discard strategies, on the other hand, respond to information in the
response surface. Since the lack of overlap occurs in an area defined by the in-
tersection of X1and X2, uncertainty in the posterior counterfactual predictions
increases sharply when the coefficient on the interaction moves from 1 to 3 (as
displayed in the top and bottom plots in the right panel of Figure 3, resp.) and
more observations are dropped for both the 1sdruleand α=0.10 rule. In this ex-
ample α=0.10 rule once again focuses more on observations in the quadrant with
lack of overlap with respect to the treatment condition, whereas 1sdruleidentifies
observations than tend to have greater uncertainty more generally. No observations
are dropped by α=0.05 rule even when φis 3.
The BART treatment effect estimates in both the first scenario (all about 1.1)
and the second scenario (0.83, 0.70 and 0.76) are all closer to the truth than the
propensity-score-based estimates in this example. In the first scenario the uncer-
tainty estimates (posterior standard errors of 0.26 for each) are slightly higher than
the standard errors for the propensity score estimates; in the second scenario the
uncertainty estimates (posterior standard errors all around 0.3) are all smaller than
the standard errors for the propensity score estimates.
4.3. Profiling the discarded units:Finding a needle in a haystack. When treat-
ment effects are not homogeneous, discarding observations from the inferential
group can change the target estimand. For instance, if focus is on the effect of the
treatment on the treated (e.g., CATT or SATT) and we discard treated observations,
then we can only make inferences about the treated units that remain (or the pop-
ulation they represent). It is important to have a sense of how this new estimand
differs from the original. In this section we illustrate a simple way to “profile” the
units that remain in the inferential sample versus those that were discarded in an
attempt to achieve common support.
In this example there are 600 observations and 40 predictors, all generated as
N(1,1). Treatment was assigned randomly at the outset; control observations were
then eliminated from two neighborhoods in this high-dimensional covariate space.
The first such neighborhood is defined by X3>1andX4>1, the second by
X5>1andX6>1. The nonlinear nonparallel response surface is generated as
E[Y(1)|X]<0.5X1+2X2+0.5X5+2X6+0.2X5X6. The treatment effect
thus varies across levels of the included covariates. Importantly, since X3and X4
do not enter into the response surface, only the second of the two neighborhoods
that lack overlap should be of concern.
The leftmost plot in Figure 4displays results from the BART and propensity
score methods both before and after discarding. The numbers at the right repre-
sent the percentage of the treated observations that were dropped for each discard
method. Solid squares represent the true estimand (SATT) for the sample corre-
sponding to that estimate (the same for all methods that do not discard but different
for those that do). Circles and line segments represent estimates and corresponding
FIG.4. Left plot displays estimands (squares)and attempted inference (circles for estimates and
bars for 95%intervals)for the BART and propensity score methods both with and without discarding.
The right plots display regression tree fits using the covariates as predictors.The responses used are
the statistic from 1sdruleand then the propensity score,respectively.
95% intervals for each estimate. None of the methods that fail to discard has a 95%
interval that covers the truth for the full sample. After discarding using the BART
rules, all of the intervals cover the true treatment effect for the remaining sample.
The propensity score methods drop far fewer treated observations, leading to esti-
mands that do not change much and estimates that still do not cover the estimands
for the remaining sample.
We make use of simple regression trees [CART; Breiman (2001), Breiman et al.
(1984)] to investigate the differences between the neighborhoods perceived as
problematic for each method. Regression trees use predictors to partition the sam-
ple into subsamples that are relatively homogenous with respect to the response
variable. For our purposes, the predictors are our potential confounders and the
response is the statistic corresponding to a given discard rule.5Asimpletreefit
provides a crude means of describing the neighborhoods of the covariate space
considered most problematic by each rule with respect to common support. Each
tree is restricted to a maximum depth of three for the sake of parsimony.
5Another strategy would be to use the indicator for discard as the response variable. This could
become problematic if the number of discarded observations is small and would yield no information
about the likelihood of being discarded in situations where no units exceeded the threshold.
1402 J. HILL AND Y.-S. SU
To profile the units that the BART 1sdruleconsiders problematic, we use for
the response variable in the tree the corresponding statistic relative to the cutoff
rule (appropriate for estimating the effect of the treatment on the treated), sf0
j),whereiand jindex treated units. The tree fit is displayed in the top
right plot of Figure 4with the mean of the response in each terminal node given in
the corresponding oval. Note that the decision rules for the tree are based almost
exclusively on the variables X5and X6, as we would hope they would be given
how the data were generated.
The tree fit using the propensity score as the response is displayed in the lower
right plot of Figure 4.X5plays a far less prominent role in this tree and X6does not
appear at all. X16,X36 ,andX40 play important roles even though these variables
are not strong predictors in the response surface; in fact, these are all independent
of both the treatment and the response.
This example illustrates two things. First, regression trees may be a useful strat-
egy for profiling which neighborhoods each method has identified as problematic
with regard to common support. Second, the propensity score approach may fail
to appropriately discover areas that lack overlap if the model for the assignment
mechanism and the model for the response surface are not well aligned with re-
spect to the relative importance of each variable. We explore the importance of this
type of alignment in more detail in the next section.
5. Simulation evidence. This section explores simulation evidence regarding
the performance of our proposed method for identifying lack of common support
relative to the performance of two commonly-used and several less-commonly-
used propensity-score-based alternatives. Overall we compare the performance of
12 different estimation strategies across 32 different simulated scenarios.
5.1. Simulation scenarios. These scenarios represent all combinations of five
design factors. The first factor varies whether the logit of the conditional expec-
tation of the treatment assignment is linear or nonlinear in the covariates. The
second factor varies the relative importance of the covariates with regard to the
assignment mechanism versus the response surface. In one setting of this factor
(“aligned”) there is substantial alignment in the predictive strength of the covari-
ates across these two mechanisms—the covariates that best predict the treatment
also predict the outcome well. In the other setting (“not as aligned”) the covariates
that best predict the treatment strongly and those that predict the response strongly
are less well aligned (for details see the description of the treatment assignment
mechanisms and response surfaces and Table 1,below).
6The third factor is the
ratio of treated to control (4:1 or 1:4) units. The fourth factor is the number of
predictors available to the researcher (10 versus 50, although in both cases only
6For a related discussion of the importance of alignment in causal inference see Kern et al. (2013).
Nonzero coefficients in γLand γLfor the treatment assignment mechanism as well as for βL
zand βNL
zfor the nonlinear,not parallel response surfaces.
Coefficients for the parallel response surface are the same as those for Y(0)in the nonparallel response surface
2x2x6x5x6x7x8x9x10 x2
Treatment assignment mechanisms
Linear 0.4
Nonlinear 0.4 0.8 0.8 0.5 0.3 0.8 0.2 0.4 0.3 0.8 0.5
Response surfaces, nonlinear and not parallel
Y(0)0.5 2 0.5 2 0.4 0.8 0.5 0.5 0.5 0.7
Y(1)0.5 1 0.5 0.80.3
Not as aligned
Y(0)0.5 2 0.4 0.5 1 0.5 2 0.5 1.5 0.7
Y(1)0.5 0.5 0.5 2 0.3
1404 J. HILL AND Y.-S. SU
8 are relevant). The fifth and final factor is whether or not the nonlinear response
surfaces are parallel across treatment and control groups; nonparallel response sur-
faces imply heterogeneous treatment effects.
In all scenarios each covariate is generated independently from XjN(0,1).
These column vectors comprise the matrix X. The general form of the linear treat-
ment assignment mechanism is ZBinomial(n, p) with p=logit1+XγL),
where the offset ωis specified to create the appropriate ratio of treated to control
units. The nonlinear form of this assignment mechanism simply includes some
nonlinear transformations of the covariates in X, denoted as Qwith correspond-
ing coefficients γNL. The nonzero coefficients for the terms in these models are
displayed in Table 1.
We simulate two distinct sets of response surfaces that differ in both their level
of alignment with the assignment mechanism and whether they are parallel. Both
sets used are nonlinear in the covariates and each set is generated generally as
where βL
zis a vector of coefficients for the untransformed versions of the pre-
dictors Xand βNL
zis a vector of coefficients for the transformed versions of the
predictors captured in Q. In the scenarios with parallel response surfaces, τ(the
constant treatment effect) is 4, βL
1and both use the coeffi-
cients from Y(0)in Table 1(only nonzero coefficients displayed). In the scenarios
with responses surfaces are not parallel, τ=0, and the nonzero coefficients in the
zand βNL
zare displayed in Table 1.
Tabl e 1helps us understand the alignment in predictor strength between the
assignment mechanism and response surfaces for each of the two scenarios. The
“aligned” version of the response surfaces places weight on the covariates most
predictive of the assignment mechanism (both the linear and nonlinear pieces).
There is no reason to believe that this alignment occurs in real examples. There-
fore, we explore a more realistic scenario where coefficient strength is “not as
We replicate each of the 32 scenarios 200 times and in each simulation run we
implement each of 12 different modeling strategies. For each the goal is to estimate
the conditional average effect of the treatment on the subset of treated units that
were not discarded.
5.2. Estimation strategies compared. We compare three basic causal infer-
ence strategies without discarding—BART [implemented as described above and
in Hill (2011) except using 100 trees], propensity score matching, and IPTW—
with nine strategies that involve discarding.
The first three discarding approaches discard using the 1sdrule,theα=0.10
rule,andtheα=0.05 rule and each is coupled with a BART analysis of the causal
effect on the remaining sample.7The remaining 6 approaches are combinations of
3 propensity score discarding strategies and 2 analysis strategies. The 3 propensity
score discard strategies vary by the estimation strategy for the propensity score
model: standard logit, generalized boosted regression model [recommended for
propensity score estimation by McCaffrey, Ridgeway and Morral (2004)], and gen-
eralized additive models [recommended for propensity score estimation by Woo,
Reiter and Karr (2008)]. The 2 analysis strategies (each conditional on a given
propensity score estimation model) are one-to-one matching (followed by regres-
sion adjustment) and inverse-probability of treatment weighting (in the context of
a linear regression model). In all propensity score strategies the propensity score is
re-estimated after the initial units are discarded. The y-axis labels of the results fig-
ures indicate these 12 different combinations of strategies. All strategies estimate
the effect of the treatment on the treated.
We implement these models in several packages in R [R Core Team (2012)]. We
use the bart() function in the BayesTree package [Chipman and McCulloch
(2009)] to fit BART models. For each BART fit, we allow the maximum number
of trees in the sum to be 100 as described in Section 3.1 above. To ensure the
convergence of the MCMC in BART without having to check for each simulation
run, we are conservative and let the algorithm run for 3500 iterations with the
first 500 considered burn-in. To implement the GBM routine, we use the gbm()
function of the gbm package [Ridgeway (2007)]. In an attempt to optimize the
settings for esimating propensity scores, we adopt the suggestions of [McCaffrey,
Ridgeway and Morral (2004), 409] for the tuning parameters of the GBM: 100
trees, a maximum of 4 splits for each tree, a small shrinkage value of 0.0005, and
a random sample of 50% of the data set to be use for each fit in each iteration.8
We use the gam() function of the gam package [Hastie (2009)] to implement the
GAM routine.
5.3. Simulation results. Figure 5presents results from 8 scenarios that have
the common elements of a linear treatment assignment mechanism and parallel
response surfaces. The linear treatment assignment mechanism should favor the
propensity score approaches. The top panel of 4 plots in this figure corresponds
to the setting where there is alignment in the predictive strength of the covariates;
this setting should favor the propensity score approach as well since it implicitly
uses information about the predictive strength of the covariates with regard to the
treatment assignment mechanism to gauge the importance of each covariate as a
confounder. The bottom panel of Figure 5reflects scenarios in which the predictive
7We do not re-estimate BART after discarding but simply limit our inference to MCMC results
from the nondiscarded observations.
8In response to a suggestion by a reviewer we also implemented this method using the twang
package in R[Ridgeway et al. (2012)] using the settings suggested in the vignette (n.trees =5000,
interaction.depth =2, shrinkage =0.01). This did not improve the GBM results.
1406 J. HILL AND Y.-S. SU
FIG.5. Simulation results for the scenarios in which the treatment assignment is linear and the
response surfaces are parallel.Solid dots represent average differences between estimated treatment
effects and the true ones standardized by the standard deviations of the outcomes.Bars are root mean
square errors (RMSE)of such estimates.The drop rates are the percentage discarded units.Discard
and analysis strategies are described in the text.Five modeling strategies are highlighted with hollow
bars for comparison:the three BART strategies and the most likely propensity scores versions to be
implemented (these are the same strategies illustrated in the examples in Section 4).
strength of the covariates is not as well aligned between the treatment assignment
mechanism and the response surface. This setup provides less of an advantage for
the propensity score methods. The potential for bias across all methods, however,
should be reduced.
Within each plot, each bar represents the root mean square error (RMSE) of
the estimates for that scenario for a particular estimation strategy. The dots rep-
resent the absolute bias (the absolute value of the average difference between the
estimates and the CATT estimand). Drop rates for the discarding methods are in-
dicated on the right-hand side of each plot. We highlight (with unfilled bars) the
BART discard/analysis strategies as well as the two propensity score discard strate-
gies that rely on the logit specification of the propensity score model (the most
commonly used model for estimating propensity scores).
The first thing to note about Figure 5is that there is little bias in any of the
methods across all of these eight scenarios and likewise the RMSEs are all small.
Within this we do see some small differences in the absolute levels of bias across
methods in the aligned scenarios, with slightly less bias evidenced by the propen-
sity score approaches and smaller RMSEs for the BART approaches. In the non-
aligned scenarios the differences in bias nearly disappear (with a slight advantage
overall for BART) and the advantage with regard to RMSE becomes slightly more
pronounced. None of the methods drop a large percentage of treated observations,
but the BART rules discard the least (with one small exception).
The eight plots in Figure 6represent scenarios in which the nonlinear treatment
assignment mechanism was paired with parallel response surfaces. The nonlinear
treatment assignment presents a challenge to the naively specified propensity score
models. These plots vary between upper and lower panels in similar ways as seen
in Figure 5. Overall, these plots show substantial differences in results between the
BART and propensity score methods. The BART discard methods drop far fewer
observations and yield substantially less bias and smaller RMSE across the board.
The differences between propensity score methods are negligible.
Figure 7corresponds to scenarios with linear treatment assignment mechanism
and nonparallel response surfaces. The top panel shows little difference in RMSE
or bias for the BART 1sdrulecompared to the best propensity score strategies
(sometimes slightly better and sometimes slightly worse). The BART α=0.10
rule and α=0.05 rule perform slightly worse than the 1sdrulein all four sce-
narios. The bottom panel of Figure 7shows slightly more clear gains with regard
to RMSE for the BART discard methods; the results regarding bias, however, are
slightly more mixed, though the differences are not large. Across all scenarios the
BART 1sdruledrops a higher percentage of treated observations than the propen-
sity score rules; this difference is substantial in the scenarios where treated out-
number controls 4 to 1. The BART 1sdrulealways drops more than the ratio rules
when controls outnumber treated but not when the treated outnumber controls.
The eight plots in Figure 8all represent scenarios with nonlinear treatment as-
signment mechanism and nonparallel response surfaces. In the top panel the dif-
ferences between the BART methods and the best propensity score methods are
not large with regard to either bias or RMSE with BART performing worst in the
scenario with 50 potential predictors and more treated than controls. In the bottom
plots corresponding to misaligned strength of coefficients BART displays consis-
tent gains over the propensity scores approaches both in terms of bias and RMSE.
All the methods discard a relatively high percentage of treated observations.
1408 J. HILL AND Y.-S. SU
FIG.6. Simulation results for the scenarios with nonlinear treatment assignment and parallel re-
sponse surfaces.Description otherwise the same as in Figure 5.
While it does not dominate at every combination of our design factors, the
BART 1sdruleappears to perform most reliably across all the methods overall.
In particular, it almost always performs better with regard to RMSE and it often
performs well with respect to bias as well.
6. Discarding and profiling when examining the effect of breastfeeding on
intelligence. The putative effect of breastfeeding on intelligence or cognitive
achievement has been heavily debated over the past few decades. This debate is
complicated by the fact that this question does not lend itself to direct experimen-
tation and, thus, the vast majority of the research that has been performed has relied
FIG.7. Simulation results for the scenarios with nonlinear treatment assignment and nonparallel
response surfaces.Description otherwise the same as in Figure 5.
on observational data. While many of these studies demonstrate small to medium-
sized positive effects [see, e.g., Anderson, Johnstone and Remley (1999), Lawlor
et al. (2006), Mortensen et al. (2002), among others] some contrary evidence exists
[notably Der, Batty and Deary (2006), Drane and Logemann (2000), Jain, Concato
and Leventhal (2002)]. It has been hypothesized that the effects of breastfeeding
increase with the length of exposure, therefore, to maximize the chance of detect-
ing an effect, it makes sense to examine the effect of breastfeeding for extended
durations versus not at all. This approach is complicated by the fact that moth-
ers who breastfeed for longer periods of time tend to have substantially different
characteristics on average than those who never breastfeed (as an example see the
1410 J. HILL AND Y.-S. SU
FIG.8. Simulation results for the scenarios with nonlinear treatment assignment and nonparallel
response surfaces.Description otherwise the same as in Figure 5.
unmatched differences in means in Figure 9). Thus, identification of areas of com-
mon support should be an important characteristic of any analysis attempting to
identify such effects.
Randomized experiments have been performed that address related questions.
Such studies have been used to establish a causal link, for instance, between two
fatty acids found in breast milk (docosahexaenoic acid and arachidonic acid) and
eyesight and motor development [see, e.g., Lundqvist-Persson et al. (2010)]; this
could represent a piece of the causal pathway between breastfeeding and sub-
sequent cognitive development. Furthermore, a recent large-scale study [Kramer
et al. (2008)] randomized encouragement to breastfeed and found significant, pos-
FIG.9. Top panel:balance represented as standardized differences in means for each of three
samples:unmatched (open circles), post-discarding matched (solid circles), and post-discarding
re-weighted (plus signs). Discarding combined with matching and weighting substantially improve
the balance.Bottom panel:overlapping histograms of propensity scores (on the linear scale)for both
breastfeeding groups.
1412 J. HILL AND Y.-S. SU
itive estimates of the intention-to-treat effect (i.e., the effect of the randomized
encouragement) on verbal and performance IQ measures at six and a half years
old. Even a randomized study such as this, however, cannot directly address the
effects of prolonged breastfeeding on cognitive outcomes. This estimation would
still require comparisons between groups that are not randomly assigned. More-
over, an instrumental variables approach would not necessarily solve the problem
either. Binary instruments cannot be used to identify effects at different dosage
levels of a treatment without further assumptions. However, dichotomization of
breastfeeding duration would almost certainly lead to a violation of the exclusion
We examine the effect of breastfeeding for 9 months or more (compared to not
breastfeeding at all) on child math and reading achievement scores at age 5 or 6.
Our “treatment” group consists of 271 mothers who breastfed at least 38 weeks
and our “control” group consists of 1832 mothers who reported 0 weeks of breast-
feeding. To create a cleaner comparison, we remove from our analysis sample
mothers who breastfed greater than 0 weeks or less than 38 weeks. Given that the
most salient policy question is whether new mothers should be (more strongly)
encouraged to breastfeed their infants, the estimand of interest is the effect of the
treatment on the controls. That is, we would like to know what would have hap-
pened to the mothers in the sample who were observed to not breastfeed their
children if they had instead breastfed for at least 9 months.
We used data from the National Longitudinal Survey of Youth (NLSY) Child
Supplement [for more information see Chase-Lansdale et al. (1991)]. The NLSY
is a longitudinal survey that began in 1979 with a cohort of approximately 12,600
young men and women aged 14 to 21 and continued annually until 1994 and bian-
nually thereafter. The NLSY started collecting information on the children of fe-
male respondents in 1986. Our sample comprises 2103 children of the NLSY born
from 1982 to 1993 who had been tested in reading and math at age 5 or 6 by the
year 2000 and whose mothers fell into our two breastfeeding categories (no months
or 9 plus months).
In addition to information on number of weeks each mother breastfed her child,
we also have access to detailed information on potential confounders. The co-
variates included are similar to those used in other studies on breastfeeding using
the NLSY [see, e.g., Der, Batty and Deary (2006)], however, we excluded several
post-treatment variables that are often used, such as child care and home envi-
ronment measures since these could bias causal estimates [Rosenbaum (1984)].
Measurements regarding the child at birth include birth order, race/ethnicity, sex,
days in hospital, weeks preterm, and birth weight. Measurements on the mother in-
clude her age at the time of birth, race/ethnicity, Armed Forces Qualification Test
(AFQT) score, whether she worked before the child was born, days in hospital
after birth, and educational level at birth. Household measures include income (at
birth), whether a spouse or partner was present at the time of the birth of the child,
and whether grandparents were present one year before birth.
The children in the NLSY subsample were tested on a variety of cognitive mea-
sures at each survey point (every two years starting with age 3 or 4). We make use
of the Peabody Individual Achievement Test (PIAT) math and reading scores from
assessments that took place either at age 5 or 6 (depending on the timing of the
survey relative to the age of the child).
To allow focus on issues of common support and causal inference and to avoid
debate about the best way to deal with the missing data, we simply limit our sample
to complete cases. Due to this restriction, this sample should not be considered to
be representative of all children in the NLSY child sample whose mothers fell into
the categories defined.
Comparing the two groups based on the baseline characteristics reveals im-
balance. Figure 9displays the balance for the unmatched (open circles), post-
discarding matched (solid circles), and post-discarding re-weighted (plus signs)
samples. The matched and reweighted samples are much more closely balanced
than the unmatched sample, particularly for the household and race variables.
The bottom panel of Figure 9displays the overlap in propensity scores estimated
by logistic regression (displayed on the linear scale). The histogram for the con-
trol units has been shaded in with grey, while the histogram for the treated units
is simply outlined in black. This plot suggests lack of common support for the
control units with respect to the estimated propensity score. The question remains,
however, whether sufficient common support on relevant covariates exists.
We use both propensity score and BART approaches to address this question.
The results of our analyses are summarized in Table 2which displays for each
method and test score (reading or math) combination: treatment effect estimate,
standard error,9and number of units discarded. Without discarding there is a sub-
stantial degree of heterogeneity between BART, linear regression after one-to-
one nearest neighbor propensity score matching with replacement (Match), IPTW
(propensity scores estimated in all cases using logistic regression), regression and
standard linear regression. For reading test scores the treatment effect estimates
are (3.5, 2.5, 1.5, and 3.2) with standard errors ranging between roughly 0.9 and
1.6. For math test scores the estimates are (2.4, 3.4, 2.6, and 2.2) with standard
errors ranging between roughly 0.9 and 1.9.
For the analysis of the effect on reading, the BART α=0.10 rule would discard
93 observations, however, neither the BART 1sdruleor the α=0.05 rule would
discard any. Regardless of the discard strategy, however, the BART estimate is
about 3.5 with posterior standard deviation of a little over 1. Levels of discarding
are similar for math test scores, although for this outcome the BART α=0.10 rule
9We calculate standard errors for the propensity score analyses by treating the weights (for match-
ing the weights are equal to the number of times each observation is used in the analysis) as survey
weights. This was implemented using the survey package in R. Technically speaking, uncertainty
of each BART estimates is expressed by the standard deviation of the posterior distribution of the
treatment effect.
1414 J. HILL AND Y.-S. SU
Table displays treatment effect estimates,associated standard errors,and number of units discarded
for each method and test score (reading or math)combination
Reading Math
Treatment Standard Number Treatment Standard Number
Method effect error discarded effect error discarded
BART 3.5 1.07 0 2.4 1.05 0
BART-D1 3.5 1.07 0 2.4 1.05 0
BART-D2 3.5 1.04 93 2.4 1.04 53
BART-D3 3.5 1.07 0 2.4 1.05 0
Match 2.5 1.62 0 3.4 1.74 0
Match-D 3.6 1.50 168 1.5 1.13 168
Match-D-RE 3.8 1.43 168 1.5 1.18 168
IPTW 1.5 1.57 0 2.6 1.92 0
IPTW-D 1.6 1.52 168 2.6 1.85 168
IPTW-D-RE 1.6 1.51 168 2.6 1.80 168
OLS 3.2 0.87 0 2.2 0.89 0
would discard 53. Similarly, the effect estimates (2.4) and associated uncertainty
estimates (a little over 1) are almost identical across strategies.
Using propensity scores (estimated using a logistic model linear in the covari-
ates) to identify common support discards 168 of the control units. This strat-
egy does not change depending on the outcome variable. Using propensity scores
estimated on the remaining units, matching (followed by regression adjustment;
Match-D-RE) and IPTW regression (IPTW-D-RE) yield reading treatment effect
estimates for the reduced sample of 3.8 (s.e. 1.43) and 1.6 (s.e. 1.51), respectively.
If we do not re-estimate the propensity score after discarding, these estimates
(Match-D and IPTW-D) are 3.6 (s.e. 1.50) and 1.6 (s.e. 1.52), respectively. The
results for math are quite heterogeneous as well, with matching and IPTW yield-
ing estimates of 1.5 (s.e. 1.18) and 2.6 (s.e. 1.80), respectively. Re-estimating the
propensity scores did not change the results for this outcome (when rounding to
the first decimal place).
It is important to remember that the methods that discard units are estimating
different estimands than those that do not, therefore, direct comparisons between
the BART and propensity score estimates are not particularly informative. Impor-
tantly, however, both propensity score methods are estimating the same effect (they
discarded the exact same units), therefore, the differences between these estimates
are a bit disconcerting. One possible explanation for these discrepancies is that the
two propensity score methods do yield somewhat different results with regard to
balance as displayed in Figure 9; IPTW yields slightly closer balance on average
(though not for every covariate).
What might account for the differences in which units were discarded between
the BART and propensity score approaches? To better understand, we more closely
examine which variables each strategy identifies as being important with regard
to common support by considering the predictive strength of each covariate with
regard to both propensity score and BART models in combination with fitting re-
gression trees with the discard statistics as response variables just as in Section 4.3.
BART identifies birth order, mother’s AFQT score, household income, mother’s
educational attainment at time of birth, and the number of days the child spent in
the hospital as the most important continuous predictors for both outcomes (al-
though the relative importance of each changes a bit between outcomes). Recall,
however, that the BART discard rules are driven by circumstances in which the
level of information about the outcome changes drastically across observations in
different treatment groups. The overlap across treatment groups for most of these
variables is actually quite good. While some, like AFQT, are quite imbalanced,
overlap still exists for all of the inferential (control) observations. More problem-
atic in terms of common support is the variable that reflects the number of days
the child spent in the hospital; 30 children of mothers who did not breastfeed had
values for this variable higher than the maximum value (30 days) for the children
of mothers who did breastfeed for nine or more months. Not surprisingly, this vari-
able is the primary driving force behind the BART 1sdruleas seen in Figure 10,
particularly for mothers who did not have a spouse living in the household at the
time of birth. Mother’s education plays a more important role for the BART ratio
rules for the reading outcome. This variable also has some issues with incomplete
overlap and it is slightly more important in predicting reading outcomes than math
A look at the fitted propensity score model, on the other hand, reveals that
breastfeeding for nine or more months is predicted most strongly by the mother’s
AFQT scores, her educational attainment, and her age at the time of the birth of
her child. Thus, these variables drive the discard rule. In particular, the critical role
of mom’s AFQT is evidenced in the regression tree for the discard rule at the bot-
tom of Figure 10. Children whose mothers were not married at birth and whose
AFQT scores were less than 50 were most likely to be discarded from the group
of nonbreastfeeding mothers about whom we would like to make inferences.
What conclusions can we draw from this example? Substantively, if we feel
confident about the ignorability assumption, the BART results suggest a moderate
positive impact of breastfeeding 9 or more months on both reading and math out-
comes at age 5 or 6. The propensity score results for the sample that remain after
discarding for common support are more mixed, with only the matching estimates
on reading outcomes showing up as positive and statistically significant.
Methodologically, this is an example in which propensity score rules yield more
discards than BART rules. The most reliable rule based on our simulation results
(the BART 1sdrule) would not discard any units. A closer look at the overlap for
specific covariates and at regression trees for the discard statistics indicates that
1416 J. HILL AND Y.-S. SU
FIG. 10. Regression trees explore the characteristics of units at risk of failing to satisfy common
(causal)support.The top two trees use the two statistics from the BART discard rules for the reading
outcome variable as the response;the next two trees use the two statistics from the BART discard
rules for the math outcome variable.The bottom tree uses the estimated propensity score subtracted
from the cutoff (maximum estimated propensity score for the controls). The predictors of the trees are
all the potential confounding covariates.For all trees the larger the statistic the more likely the unit
will be discarded,so focus is on the rightmost part of each tree.
the BART discard rules may represent a better reflection of the actual relationships
between the variables. The lack of stability of the propensity score estimates is also
cause for concern. We emphasize, however, that we have used rather naive propen-
sity score approaches which are not intended to represent best practice. Given the
current lack of guidance with regard to optimal choices for propensity score mod-
els and specific matching and weighting methods, we chose instead to use imple-
mentations that were as straightforward as the BART approach.
7. Discussion. Evaluation of empirical evidence for the common support as-
sumption has been given short shrift in the causal inference literature although the
implications can be important. Failure to detect areas that lack common causal sup-
port can lead to biased inference due to imbalance or inappropriate model extrapo-
lation. On the other extreme, overly conservative assessment of neighborhoods or
units that seem to lack common support may be equally problematic.
This paper distinguishes between the concepts of common support and common
causal support. It introduces a new approach for identifying common causal sup-
port that relies on Bayesian Additive Regression Trees (BART). We believe that
this method’s flexible functional form and its ability to take advantage of infor-
mation in the response surface allows it to better target areas of common causal
support than traditional propensity-score-based methods. We also propose a sim-
ple approach to profiling discarded units based on regression trees. The potential
usefulness of these strategies has been demonstrated through examples and simu-
lation evidence and the approach has been illustrated in a real example.
While this paper provides some evidence that BART may outperform propensity
score methods in the situations tested, we do not claim that it is uniformly supe-
rior or that it is the only strategy for incorporating information about the outcome
variable. We acknowledge that there are many ways of using propensity scores
that we did not test, however, our focus was on examination of methods that were
straightforward to implement and do not require complicated interplay between
the researcher’s substantive knowledge and the choice of how to implement (what
propensity score model to fit, which matching or weighting method to use, which
variables to privilege in balancing, which balance statistics to use). We hope that
this paper is a starting point for further explorations into better approaches for iden-
tifying common support, investigating the role of the outcome variable in causal
inference methods, and development of more effective ways of profiling units that
we deem to lack common causal support.
There is a connection between this work and that of others [e.g., Brookhart et al.
(2006)] who have pointed out the danger of strategies that implicitly assign greater
importance to variables that most strongly influence the treatment variable but that
may have little or no direct association with the outcome variable. In response,
some authors such as Kelcey (2011) have outlined approaches to choosing con-
founders in ways that make use of the observed association between the possible
confounders and the potential outcomes. Another option that is close in spirit to
1418 J. HILL AND Y.-S. SU
the propensity score techniques but makes use of outcome data (at least in the con-
trol group) would be a prognostic score approach [Hansen (2008)]. To date, there
has been no formal discussion of use of prognostic cores for this purpose, but this
might be a useful avenue for further research.10
Acknowledgments. The authors would like to thank two anonymous referees
and our Associate Editor, Susan Paddock, for their helpful comments and sugges-
ANDERSON,J.W.,JOHNSTONE,B.M.andREMLEY, D. T. (1999). Breast-feeding and cognitive
development: A meta-analysis. Am.J.Clin.Nutr.70 525–535.
BREIMAN, L. (2001). Random forests. Machine Learning 45 5–32.
BREIMAN,L.,FREIDMAN,J.H.,OLSHEN,R.A.andSTONE, C. J. (1984). Classification and
Regression Trees. Wadsworth, Belmont, CA.
STÜRMER, T. (2006). Variable selection for propensity score models. Am.J.Epidemiol.163
of the National Longitudinal Survey of Youth: A unique research opportunity. Developmental
Psychology 27 918–931.
CHIPMAN,H.,GEORGE,E.andMCCULLOCH, R. (2007). Bayesian ensemble learning. In Ad-
vances in Neural Information Processing Systems 19 (B. Schölkopf, J. Platt and T. Hoffman,
eds.). MIT Press, Cambridge, MA.
CHIPMAN,H.A.,GEORGE,E.I.andMCCULLOCH, R. E. (2010). BART: Bayesian additive re-
gression trees. Ann.Appl.Stat.4266–298.
CHIPMAN,H.andMCCULLOCH, R. (2009). BayesTree: Bayesian methods for tree based models.
R package version 0.3-1.
CRUMP,R.K.,HOT Z,V.J.,IMBENS,G.W.andMITNIK, O. A. (2009). Dealing with limited
overlap in estimation of average treatment effects. Biometrika 96 187–199. MR2482144
DEHEJIA,R.H.andWAHBA, S. (1999). Causal effects in nonexperimental studies: Reevaluating
the evaluation of training programs. J.Amer.Statist.Assoc.94 1053–1062.
DER,G.,BATTY,G.D.andDEARY, I. J. (2006). Effect of breast feeding on intelligence in children:
Prospective study, sibling pairs analysis, and meta-analysis. British Medical Journal 333 945–
DRANE,D.L.andLOGEMANN, J. A. (2000). A critical evaluation of the evidence on the association
between type of infant feeding and cognitive development. Paediatr.Perinat.Epidemiol.14 349–
FROLICH, M. (2004). Finite-sample properties of propensity-score matching and weighting estima-
tors. The Review of Economics and Statistics 86 77–90.
GREEN,D.P.andKERN, H. L. (2012). Modeling heterogeneous treatment effects in survey exper-
iments with Bayesian additive regression trees. Public Opinion Quarterly 76 491–511.
HANSEN, B. B. (2008). The prognostic analogue of the propensity score. Biometrika 95 481–488.
HASTIE, T. (2009). gam: Generalized additive models. R package version 1.01.
10Thanks to an anonymous referee for pointing out this connection.
HECKMAN,J.J.,ICHIMURA,H.andTODD, P. (1997). Matching as an econometric evaluation
estimator: Evidence from a job training programme. Rev.Econom.Stud.64 605–654.
HILL, J. L. (2011). Bayesian nonparametric modeling for causal inference. J.Comput.Graph.
Statist.20 217–240. MR2816546
HILL,J.L.,WEISS,C.andZHAI, F. (2013). Challenges with propensity score strategies in a high-
dimensional setting and a potential alternative. Multivariate Behavioral Research 46 477–513.
HO,D.E.,IMAI,K.,KING,G.andSTUART, E. A . (2013). MatchIt: Nonparametric preprocessing
for parametric causal inference. Journal of Statistical Software 42 1–28.
IMBENS, G. (2004). Nonparametric estimation of average treatment effects under exogeneity: A re-
view. The Review of Economics and Statistics 86 4–29.
JAIN,A.,CONCATO,J.andLEVENTHAL, J. M. (2002). How good is the evidence linking breast-
feeding and intelligence? Pediatrics 109 1044–1053.
KELCEY, B. (2011). Covariate selection in propensity scores using outcome proxies. Multivariate
Behavioral Research 46 453–476.
KERN,H.L.,STUART,E.A.,HILL,J.L.andGREEN, D. P. (2013). Assessing methods for gener-
alizing experimental impact estimates to target samples. Technical report, Univ. South Carolina,
Columbia, SC.
KOZLOVA,L.andSHAPIRO, S. (2008). Breastfeeding and child cognitive development: New
evidence from a large randomized trial. Archives of General Psychiatry 65 578–584.
ROBINS, J. M. (2006). Results of multivariable logistic regression, propensity matching, propen-
sity adjustment, and propensity-based weighting under conditions of non-uniform effect. Ameri-
can Journal of Epidemiology 163 262–270.
BOR, W. (2006). Early life predictors of childhood intelligence: Findings from the Mater-
University study of pregnancy and its outcomes. Paediatric and Perinatal Epidemiology 20 148–
LEUVEN,E.andSIANESI, B. (2011). PSMATCH2: Stata module to perform full Mahalanobis and
propensity score matching, common support graphing, and covariate imbalance testing. Boston
College Dept. Economics, Boston, MA.
LUNDQVIST-PERSSON,C.,LAU,G.,NORDIN, P. et al. (2010). Early behaviour and development
in breastfed premature infants are influenced by omega-6 and omega 3-fatty acids. Early Human
Development 86 407–412.
MCCAFFREY,D.F.,RIDGEWAY,G.andMORRAL, A. R. (2004). Propensity score estimation with
boosted regression for evaluating causal effects in observational studies. Psychol.Methods 9403–
MORGAN,S.L.andHARDING, D. J. (2006). Matching estimators of causal effects: Prospects and
pitfalls in theory and practice. Sociol.Methods Res.35 3–60. MR2247150
association between duration of breastfeeding and adult intelligence. Journal of the American
Medical Association 287 2365–2371.
ORE TEAM (2012). R:A Language and Environment for Statistical Computing. Vienna, Austria.
ISBN 3-900051-07-0.
RIDGEWAY, G. (2007). gbm: Generalized boosted regression models. R package version 1.6-3.
twang: Toolkit for weighting and analysis of nonequivalent groups. R package version 1.2-5.
Available at
1420 J. HILL AND Y.-S. SU
ROSENBAUM, P. R. (1984). The consequences of adjustment for a concomitant variable that has
been affected by the treatment. J.Roy.Statist.Soc.Ser.A147 656–666.
ROSENBAUM,P.R.andRUBIN, D. B. (1983). The central role of the propensity score in observa-
tional studies for causal effects. Biometrika 70 41–55. MR0742974
RUBIN, D. B. (2002). Using propensity scores to help design observational studies: Application to
the tobacco litigation. Health Services & Outcomes Research Methodology 2169–188.
WOO,M.-J.,REITER,J.P.andKARR, A. F. (2008). Estimation of propensity scores using gener-
alized additive models. Stat.Med.27 3805–3816. MR2526610
... Furthermore, the methods applied here make explicit use of the potential outcomes framework (Rubin, 1987(Rubin, , 2005 and offer principled ways to assess whether there is enough empirical support to estimate counterfactual states (and to, thus, make causal claims) at different levels of attendance and for each participant. We draw from the common support approaches developed by Hill and Su (2013) to assess the availability and quality of empirical counterfactuals in our sample. In addition, we use the Average Dosage-Response Function framework for causal inference (Galagate, 2016) to account for differences in the causal relationship between attendance and the outcome across participants. ...
... values can be assumed to have comparable distributions in their potential outcomes and, thus, provide the necessary information to estimate reasonable counterfactuals for causal inference. A second assumption, also known as common support (Hill & Su, 2013), essentially says that we have enough empirical counterfactuals between exposure groups (that share similar combinations of pretreatment covariates ), to make reasonable predictions about the potential outcomes for all the students that are used to compute the average treatment effect. This assumption is expressed in terms of a non-zero probability of exposure for all subgroups of students that differ in their pretreatment characteristics: ...
... Additionally, it is equipped to estimate counterfactuals for each participant in the analysis sample, along with the uncertainty of each counterfactual prediction. This feature allows us to better understand the observations in the sample for which we may lack common causal support (common support with respect to the true confounders, see Hill & Su, 2013) and ...
Full-text available
This article estimates, for a sample of 1,777 Syrian refugee children, the impact on basic reading assessments of attending a remedial support program in Lebanon that was infused with social and emotional learning practices. We use flexible methods that capitalize on advantages of both machine learning and Bayesian inferential frameworks to leverage the information available in understudied contexts and help account for the problem of self-selection. Average treatment effects were estimated both using multiply imputed data and data from outcome-respondents only. We do not find conclusive evidence for an effect on one of the reading measures studied (ASER). However, we provide evidence for positive effects for three, more robust, measures of basic reading outcomes from the Arabic EGRA assessment. We discuss potential reasons for the differences in effects that are relevant for educational research and practice. We also consider the implications for future research of choices related to measurement, data collection and processing, and missing data.
... The cut-off rule we use for exclusion under the dichotomous treatment (high vs low attendance) is the one proposed in Hill and Su (2013), so that an observation would be dropped from the ATE estimation if: Common support assessment outside of the dichotomous treatment case was done by extending the idea in Hill and Su (2013) of comparing the uncertainty of the counterfactual outcome predictions against the uncertainty of outcome predictions consistent with the students' factual treatment (or in this case dosage) assignment. ...
... The cut-off rule we use for exclusion under the dichotomous treatment (high vs low attendance) is the one proposed in Hill and Su (2013), so that an observation would be dropped from the ATE estimation if: Common support assessment outside of the dichotomous treatment case was done by extending the idea in Hill and Su (2013) of comparing the uncertainty of the counterfactual outcome predictions against the uncertainty of outcome predictions consistent with the students' factual treatment (or in this case dosage) assignment. ...
... In the case of attendance, the factual dosage lives in a continuum, and it is not possible to consider only the students that have the exact factual attendance as the benchmark for the uncertainty of each estimated counterfactual value (which covers a grid of 9 attendance values) in adherence to what is proposed in Hill and Su (2013). Doing this would result in discarding most of the empirical information as most students' factual attendance lies between the grid values. ...
Full-text available
Online appendices for "The Impact of Attending a Remedial Support Program on Syrian Children’s Reading Skills: Using BART for Causal Inference" manuscript. Main manuscript text can be found in
... Notice that common support does not necessarily need to hold for all the available covariates in the covariate set X i , but just for the confounders, which might constitute a strict subset of X i . For this reason common support is sometimes referred to as common causal support (Hill & Su, 2013). Common support may hold only for a portion of the available sample as we discuss in more details in the last paragraph of this section, so that inference on treatment effects outside the guaranteed overlap region becomes unreliable. ...
... However, inspection of common support regions with more naive methods such as visual inspection might be challenging when X i is high dimensional. Some Bayesian non-parametric implementations of CATE estimation models offer a simple yet effective way of checking for common support regions, as described in Hill and Su (2013). In the simple one-covariate example of Figure 1, the propensity score takes values which are very close to either 0 or 1, but it is guaranteed to lie strictly between the two in the data generating process. ...
... As a further advantage, the direct Bayesian approach returns full predictive posterior distribution on CATE, which conveniently allows the computation of point estimates as well as credible intervals. This feature is shared also by Bayesian implementation of S-Learners (Hill, 2011) and can be usefully employed to check for causal common support, as showed by Hill and Su (2013). Meta-Learners that explicitly model CATE, such as S-and R-Learners, can naturally provide confidence intervals to accompany point estimates (Athey & Imbens, 2016;Athey & Wager, 2019), while T-Learners and their extensions (X-and Multitask-Learners), which indirectly model CATE as the difference between two separately fitted surfaces, must resort to re-sampling techniques such as jackknife or bootstrapping to produce confidence intervals (Künzel et al., 2019). ...
Full-text available
Large observational data are increasingly available in disciplines such as health, economic and social sciences, where researchers are interested in causal questions rather than prediction. In this paper, we examine the problem of estimating heterogeneous treatment effects using non‐parametric regression‐based methods, starting from an empirical study aimed at investigating the effect of participation in school meal programs on health indicators. First, we introduce the setup and the issues related to conducting causal inference with observational or non‐fully randomized data, and how these issues can be tackled with the help of statistical learning tools. Then, we review and develop a unifying taxonomy of the existing state‐of‐the‐art frameworks that allow for individual treatment effects estimation via non‐parametric regression models. After presenting a brief overview on the problem of model selection, we illustrate the performance of some of the methods on three different simulated studies. We conclude by demonstrating the use of some of the methods on an empirical analysis of the school meal program data.
... Causal inference has drawn a lot of attention across various research areas including statistics [25,2], economics and finance [7,3,15] commercial social network applications [10,5] and health care [8,12]. One of the main tasks of causal inference is to estimate the average treatment effect (ATE). ...
Estimating the average treatment effect (ATE) from observational data is challenging due to selection bias. Existing works mainly tackle this challenge in two ways. Some researchers propose constructing a score function that satisfies the orthogonal condition, which guarantees that the established ATE estimator is "orthogonal" to be more robust. The others explore representation learning models to achieve a balanced representation between the treated and the controlled groups. However, existing studies fail to 1) discriminate treated units from controlled ones in the representation space to avoid the over-balanced issue; 2) fully utilize the "orthogonality information". In this paper, we propose a moderately-balanced representation learning (MBRL) framework based on recent covariates balanced representation learning methods and orthogonal machine learning theory. This framework protects the representation from being over-balanced via multi-task learning. Simultaneously, MBRL incorporates the noise orthogonality information in the training and validation stages to achieve a better ATE estimation. The comprehensive experiments on benchmark and simulated datasets show the superiority and robustness of our method on treatment effect estimations compared with existing state-of-the-art methods.
... We were surprised that covariates balance could only be achieved for 13% of treated units. It would be an interesting question for future research to see if alternative methods such as cardinality matching or bayesian additive regression trees lead to similar results [45][46][47] . The relevant structure of the hypothetical experiment to target should also be of interest since our pair matching algorithm failed to increase the precision of estimates compared to a completely randomized assignment of the treatment. ...
Full-text available
A growing literature in economics and epidemiology has exploited changes in wind patterns as a source of exogenous variation to better measure the acute health effects of air pollution. Since the distribution of wind components is not randomly distributed over time and related to other weather parameters, multivariate regression models are used to adjust for these confounding factors. However, this type of analysis relies on its ability to correctly adjust for all confounding factors and extrapolate to units without empirical counterfactuals. As an alternative to current practices and to gauge the extent of these issues, we propose to implement a causal inference pipeline to embed this type of observational study within an hypothetical randomized experiment. We illustrate this approach using daily data from Paris, France, over the 2008–2018 period. Using the Neyman–Rubin potential outcomes framework, we first define the treatment of interest as the effect of North-East winds on particulate matter concentrations compared to the effects of other wind directions. We then implement a matching algorithm to approximate a pairwise randomized experiment. It adjusts nonparametrically for observed confounders while avoiding model extrapolation by discarding treated days without similar control days. We find that the effective sample size for which treated and control units are comparable is surprisingly small. It is however reassuring that results on the matched sample are consistent with a standard regression analysis of the initial data. We finally carry out a quantitative bias analysis to check whether our results could be altered by an unmeasured confounder: estimated effects seem robust to a relatively large hidden bias. Our causal inference pipeline is a principled approach to improve the design of air pollution studies based on wind patterns.
... See, e.g., Chen et al. (2017), Imai and Ratkovic (2013), Knaus, Lechner, and Strittmatter (2020), and Tian et al. (2014), for more detail on the LASSO and empirical applications in different substantive domains. 60 See, e.g., Wager and Athey (2018), Davis andHeller (2017), Foster, Taylor, andRuberg (2011), Green and Kern (2012), Hill (2011), andHill andSu (2013). splitting, and its associated variance component, into their variance estimator and inference procedures. ...
Full-text available
We document substantial variation in the effects of a highly-effective literacy program in northern Uganda. The program increases test scores by 1.4 SDs on average, but standard statistical bounds show that the impact standard deviation exceeds 1.0 SD. This implies that the variation in effects across our students is wider than the spread of mean effects across all randomized evaluations of developing country education interventions in the literature. This very effective program does indeed leave some students behind. At the same time, we do not learn much from our analyses that attempt to determine which students benefit more or less from the program. We reject rank preservation, and the weaker assumption of stochastic increasingness leaves wide bounds on quantile-specific average treatment effects. Neither conventional nor machine-learning approaches to estimating systematic heterogeneity capture more than a small fraction of the variation in impacts given our available candidate moderators.
Many practical decision-making problems in economics and healthcare seek to estimate the average treatment effect (ATE) from observational data. The Double/Debiased Machine Learning (DML) is one of the prevalent methods to estimate ATE in the observational study. However, the DML estimators can suffer an error-compounding issue and even give an extreme estimate when the propensity scores are misspecified or very close to 0 or 1. Previous studies have overcome this issue through some empirical tricks such as propensity score trimming, yet none of the existing literature solves this problem from a theoretical standpoint. In this paper, we propose a Robust Causal Learning (RCL) method to offset the deficiencies of the DML estimators. Theoretically, the RCL estimators i) are as consistent and doubly robust as the DML estimators, and ii) can get rid of the error-compounding issue. Empirically, the comprehensive experiments show that i) the RCL estimators give more stable estimations of the causal parameters than the DML estimators, and ii) the RCL estimators outperform the traditional estimators and their variants when applying different machine learning models on both simulation and benchmark datasets.
When drawing causal inferences about the effects of multiple treatments on clustered survival outcomes using observational data, we need to address implications of the multilevel data structure, multiple treatments, censoring, and unmeasured confounding for causal analyses. Few off‐the‐shelf causal inference tools are available to simultaneously tackle these issues. We develop a flexible random‐intercept accelerated failure time model, in which we use Bayesian additive regression trees to capture arbitrarily complex relationships between censored survival times and pre‐treatment covariates and use the random intercepts to capture cluster‐specific main effects. We develop an efficient Markov chain Monte Carlo algorithm to draw posterior inferences about the population survival effects of multiple treatments and examine the variability in cluster‐level effects. We further propose an interpretable sensitivity analysis approach to evaluate the sensitivity of drawn causal inferences about treatment effect to the potential magnitude of departure from the causal assumption of no unmeasured confounding. Expansive simulations empirically validate and demonstrate good practical operating characteristics of our proposed methods. Applying the proposed methods to a dataset on older high‐risk localized prostate cancer patients drawn from the National Cancer Database, we evaluate the comparative effects of three treatment approaches on patient survival, and assess the ramifications of potential unmeasured confounding. The methods developed in this work are readily available in the R$$ \mathsf{R}\kern.15em $$package riAFTBART$$ \mathsf{riAFTBART} $$.
It is crucial in clinical trials to investigate treatment effect consistency across subgroups defined by patient baseline characteristics. However, there may be treatment effect variability across subgroups due to small subgroup sample size. Various Bayesian models have been proposed to incorporate this variability when borrowing information across subgroups. These models rely on the underlying assumption that patients with similar characteristics will have similar outcomes to the same treatment. Patient populations within each subgroup must subjectively be deemed similar enough Pocock (1976) to borrow response information across subgroups. We propose utilizing the machine learning method of Bayesian Additive Regression Trees (BART) to provide a method for subgroup borrowing that does not rely on an underlying assumption of homogeneity between subgroups. BART is a data-driven approach that utilizes patient-level observations. The amount of borrowing between subgroups automatically adjusts as BART learns the covariate-response relationships. Modeling patient-level data rather than treating the subgroup as a single unit minimizes assumptions regarding homogeneity across subgroups. We illustrate the use of BART in this context by comparing performance from existing subgroup borrowing methods in a simulation study and a case study in non-small cell lung cancer. The application of BART in the context of subgroup analyses alleviates the need to subjectively choose how much information to borrow based on subgroup similarity. Having the amount of borrowing be analytically determined and controlled for based on the similarity of individual patient-level characteristics allows for more objective decision making in the drug development process with many other applications including basket trials.
Full-text available
Survey experimenters routinely test for systematically varying treatment effects by using interaction terms between the treatment indicator and covariates. Parametric models, such as linear or logistic regression, are currently used to search for systematic treatment effect heterogeneity but suffer from several shortcomings; in particular, the potential for bias due to model misspecification and the large amount of discretion they introduce into the analysis of experimental data. Here, we explicate what we believe to be a better approach. Drawing on the statistical learning literature, we discuss Bayesian Additive Regression Trees (BART), a method for analyzing treatment effect heterogeneity. BART automates the detection of nonlinear relationships and interactions, thereby reducing researchers' discretion when analyzing experimental data. These features make BART an appealing "off-the-shelf tool for survey experimenters who want to model systematic treatment effect heterogeneity in a flexible and robust manner. In order to illustrate how BART can be used to detect and model heterogeneous treatment effects, we reanalyze a well-known survey experiment on welfare attitudes from the General Social Survey.
The data set known as Children of the National Longitudinal Survey of Youth (SLSY) offers unusual opportunuities for research on questions not easily purpued by developmental psychologists. This article provides a history of Children of the NLSY, describes the data set with special focus on the child outcome measures and a subset of maternal life history measures, highlights several of the research and policy relevant issues that may be addressed, and shows how the intersection of children's and mothers' lives may be studied in less static, more life-course oriented ways. Examplars are given in the topics of maternal employment and child care, adolescent pregnancy and child rearing, divorce, povert, and multigenerational parenting
Randomized experiments are considered the gold standard for causal inference because they can provide unbiased estimates of treatment effects for the experimental participants. However, researchers and policymakers are often interested in using a specific experiment to inform decisions about other target populations. In education research, increasing attention is being paid to the potential lack of generalizability of randomized experiments because the experimental participants may be unrepresentative of the target population of interest. This article examines whether generalization may be assisted by statistical methods that adjust for observed differences between the experimental participants and members of a target population. The methods examined include approaches that reweight the experimental data so that participants more closely resemble the target population and methods that utilize models of the outcome. Two simulation studies and one empirical analysis investigate and compare the methods’ performance. One simulation uses purely simulated data while the other utilizes data from an evaluation of a school-based dropout prevention program. Our simulations suggest that machine learning methods outperform regression-based methods when the required structural (ignorability) assumptions are satisfied. When these assumptions are violated, all of the methods examined perform poorly. Our empirical analysis uses data from a multisite experiment to assess how well results from a given site predict impacts in other sites. Using a variety of extrapolation methods, predicted effects for each site are compared to actual benchmarks. Flexible modeling approaches perform best, although linear regression is not far behind. Taken together, these results suggest that flexible modeling techniques can aid generalization while underscoring the fact that even state-of-the-art statistical techniques still rely on strong assumptions.
Context A number of studies suggest a positive association between breastfeeding and cognitive development in early and middle childhood. However, the only previous study that investigated the relationship between breastfeeding and intelligence in adults had several methodological shortcomings.Objective To determine the association between duration of infant breastfeeding and intelligence in young adulthood.Design, Setting, and Participants Prospective longitudinal birth cohort study conducted in a sample of 973 men and women and a sample of 2280 men, all of whom were born in Copenhagen, Denmark, between October 1959 and December 1961. The samples were divided into 5 categories based on duration of breastfeeding, as assessed by physician interview with mothers at a 1-year examination.Main Outcome Measures Intelligence, assessed using the Wechsler Adult Intelligence Scale (WAIS) at a mean age of 27.2 years in the mixed-sex sample and the Børge Priens Prøve (BPP) test at a mean age of 18.7 years in the all-male sample. Thirteen potential confounders were included as covariates: parental social status and education; single mother status; mother's height, age, and weight gain during pregnancy and cigarette consumption during the third trimester; number of pregnancies; estimated gestational age; birth weight; birth length; and indexes of pregnancy and delivery complications.Results Duration of breastfeeding was associated with significantly higher scores on the Verbal, Performance, and Full Scale WAIS IQs. With regression adjustment for potential confounding factors, the mean Full Scale WAIS IQs were 99.4, 101.7, 102.3, 106.0, and 104.0 for breastfeeding durations of less than 1 month, 2 to 3 months, 4 to 6 months, 7 to 9 months, and more than 9 months, respectively (P = .003 for overall F test). The corresponding mean scores on the BPP were 38.0, 39.2, 39.9, 40.1, and 40.1 (P = .01 for overall F test).Conclusion Independent of a wide range of possible confounding factors, a significant positive association between duration of breastfeeding and intelligence was observed in 2 independent samples of young adults, assessed with 2 different intelligence tests.
Adjustments for bias in observational studies are not always confined to variables that were measured prior to treatment. Estimators that adjust for a concomitant variable that has been affected by the treatment are generally biased. The bias may be written as the sum of two easily interpreted components: one component is present only in observational studies; the other is common to both observational studies and randomized experiments. The first component of bias will be zero when the affected posttreatment concomitant variable is, in a certain sense, a surrogate for an unobserved pretreatment variable. The second component of bias can often be addressed by an appropriate sensitivity analysis.