Content uploaded by Sebastian Schneeweiss
Author content
All content in this area was uploaded by Sebastian Schneeweiss on Jun 01, 2022
Content may be subject to copyright.
Copyright © 2022 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
O A
Epidemiology • Volume 33, Number 4, July 2022 www.epidem.com | 541
ISSN: 1044-3983/22/334-541
DOI: 10.1097/EDE.0000000000001482
Abstract: The propensity score has become a standard tool to con-
trol for large numbers of variables in healthcare database studies.
However, little has been written on the challenge of comparing
large-scale propensity score analyses that use dierent methods for
confounder selection and adjustment. In these settings, balance diag-
nostics are useful but do not inform researchers on which variables
balance should be assessed or quantify the impact of residual covari-
ate imbalance on bias. Here, we propose a framework to supplement
balance diagnostics when comparing large-scale propensity score
analyses. Instead of focusing on results from any single analysis, we
suggest conducting and reporting results for many analytic choices
and using both balance diagnostics and synthetically generated con-
trol studies to screen analyses that show signals of bias caused by
measured confounding. To generate synthetic datasets, the frame-
work does not require simulating the outcome-generating process.
In healthcare database studies, outcome events are often rare, mak-
ing it dicult to identify and model all predictors of the outcome to
simulate a confounding structure closely resembling the given study.
Therefore, the framework uses a model for treatment assignment
to divide the comparator population into pseudo-treatment groups
where covariate dierences resemble those in the study cohort. The
partially simulated datasets have a confounding structure approxi-
mating the study population under the null (synthetic negative con-
trol studies). The framework is used to screen analyses that likely
violate partial exchangeability due to lack of control for measured
confounding. We illustrate the framework using simulations and an
empirical example.
Keywords: Causal inference; Confounding; Control studies;
Diagnostic; Propensity score
(Epidemiology 2022;33: 541–550)
In nonexperimental studies that utilize administrative health-
care databases, it is often necessary to control for large
numbers of confounding variables to estimate valid treatment
eects.1,2 In these settings, methods that collapse the informa-
tion of a large set of covariates into a single value or summary
score and then use this summary measure for confounding
control have become increasingly popular. The propensity
score, which summarizes covariate associations with treat-
ment assignment, has been the most widely used summary
measure and has become a standard tool for confounding con-
trol in healthcare database studies.3–5
In the context of fitting propensity score models, both
theory and simulations have shown that the optimal adjust-
ment set for reducing both bias and variability in eect esti-
mates includes variables aecting both treatment and outcome
(confounders) as well as variables that aect the outcome
but are unrelated to treatment (risk factors).6,7 A large liter-
ature has shown that adjusting for risk factors improves the
precision in eect estimates without aecting bias, whereas
adjusting for variables that are associated with treatment but
conditionally independent of the outcome except through
treatment (instrumental variables) harms precision and can
increase bias in the presence of unmeasured confounding.8–13
For a formal discussion on principles for confounder selec-
tion, see VanderWeele.7
To improve large-scale propensity score analyses in
healthcare databases, several papers have proposed using
data-adaptive algorithms to help identify confounding factors
Submitted November 4, 2020; accepted March 15, 2022
From the aDivision of Pharmacoepidemiology and Pharmacoeconomics,
Department of Medicine, Brigham and Women’s Hospital, Harvard
Medical School, Boston, MA; bDivision of General Internal Medicine,
Department of Medicine, Massachusetts General Hospital, Harvard
Medical School, Boston, MA; and cUCB Pharma, Raleigh, NC.
This project was funded by an unrestricted grant from UCB Pharma. R.W.
received additional funding from the FDA Sentinel Innovation Center
and NIH RO1LM013204. K.J.L. was supported by NIH RO1LM013204.
J.M.F. was supported by NIH RO1HL141505.
Disclosure: S.S. is participating in investigator-initiated grants to the Brigham
and Women’s Hospital from Bayer, Vertex, and Boehringer Ingelheim
unrelated to the topic of this study. He is a consultant to Aetion Inc., a soft-
ware manufacturer of which he owns equity. His interests were declared,
reviewed, and approved by the Brigham and Women’s Hospital and
Partners HealthCare System in accordance with their institutional compli-
ance policies. D.P.M. and L.K. were employees of UCB Pharma at the time
this analysis was conducted. The other authors have no conflicts to report.
Dataset for the empirical study is not available for public use due to data use
agreements. Code for simulation and empirical analyses is provided in the
eAppendix; http://links.lww.com/EDE/B911 of the article.
Supplemental digital content is available through direct URL citations
in the HTML and PDF versions of this article (www.epidem.com).
Correspondence: Richard Wyss, PhD, Division of Pharmacoepidemiology
and Pharmacoeconomics, Brigham and Women’s Hospital and Harvard
Medical School, 1620 Tremont St, Suite 3030, Boston, MA 02120.
E-mail: rwyss@bwh.harvard.edu.
Copyright © 2022 Wolters Kluwer Health, Inc. All rights reserved.
Synthetic Negative Controls
Using Simulation to Screen Large-scale Propensity Score Analyses
Richard Wyss,a Sebastian Schneeweiss,a Kueiyu Joshua Lin,a,b David P. Miller,c Linda Kalilani,c
and Jessica M. Franklina
Copyright © 2022 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
Epidemiology • Volume 33, Number 4, July 2022
Wyss et al.
542 | www.epidem.com © 2022 Wolters Kluwer Health, Inc. All rights reserved.
and exclude variables that harm the properties of eect esti-
mates (e.g., instrumental variables).14–25 In healthcare data-
base studies, however, outcome events are often rare, and it
can be dicult to accurately characterize the joint correlation
structure between a high-dimensional set of variables and the
outcome to distinguish between instruments, confounders,
and other variable types. Various approaches rely on tuning
parameters and assumptions that hold in dierent cases, and
it can be unclear if a given approach adequately adjusts for
all measured confounders. Therefore, to improve robustness
of large-scale propensity score analyses, a separate literature
has argued that emphasis should be placed on capturing the
maximum amount of measured confounder information in
the data by balancing all covariates across treatment groups at
the expense of adjusting for instrumental variables and other
nonconfounders.26–29
An alternative to these two viewpoints is to conduct and
report results for many analyses to determine whether results,
as a whole, are consistent with the underlying hypothesis.30–32
When the optimal analytic approach is unclear, studies have
argued that reporting results from all analyses that are consid-
ered reasonable alternatives can help to improve robustness
and address concerns of transparency and reproducibility in
nonexperimental studies.30–32 When alternative analyses do
not yield consistent results, however, researchers can dis-
agree on which analytic choices are reasonable alternatives.
Diagnostics can help investigators identify and focus on results
from more credible analyses, but it remains unclear how to
objectively evaluate and compare the performance of alter-
native large-scale propensity score analyses that use dierent
methods for confounder selection and adjustment. Although
balance diagnostics can be useful for evaluating propensity
score analyses, balance diagnostics do not inform researchers
on which variables balance should be assessed or quantify the
impact of residual covariate imbalance on bias.33,34
In this work, we propose a framework to supplement
balance diagnostics when comparing large-scale propensity
score analyses in healthcare database studies. Instead of focus-
ing on results from any single analysis, we suggest conducting
and reporting results for many analytic choices and screen-
ing analyses that show signals of bias caused by measured
confounding. We propose a screening method that builds on
previous work that has used synthetically generated control
studies as a benchmark to compare alternative causal infer-
ence approaches.35–39 Here, we extend a simulation framework
for generating synthetic negative controls to screen propensity
score analyses when the optimal set of variables for adjust-
ment is high-dimensional and not known a priori.
METHODS
Underlying Assumptions
Following Neyman et al40 and Rubin,41 we define
the eect of a time-fixed binary treatment, A, in terms of
potential outcomes YA=1 and YA=0. In practice, only one of
these potential outcomes is observed for each individual. Let
the random variable Y represent the observed outcome cor-
responding with either YA=1 or YA=0 depending on whether the
individual received (A = 1) or did not receive (A = 0) treat-
ment. Furthermore, let X represent the full set of pretreatment
covariates, P(Y, A, X) the data generating distribution for the
full study population, and P(YA=0, A, X) the data generating
distribution for the counterfactual population under the con-
trol, or comparator, treatment. Finally, let
ψ
XA
X
()
=
E
()
represent the conditional probability of treatment given X (i.e.,
the propensity score).
We assume that treatment assignment is conditionally
exchangeable given X, written as (,
)|
YY X
AA==
⊥
10
A, where
⊥
denotes independence of random variables. Conditional
exchangeability implies no unmeasured confounding or
selection bias. Conditional exchangeability given X, implies
conditional exchangeability given
ψ
X
()
.5 Conditional
exchangeability also implies partial exchangeability, written
as
YX
A=⊥
0A| .
Partial exchangeability is a weaker condition than full
conditional exchangeability.42,43 If an adjustment set fails to
satisfy partial exchangeability, then the same adjustment set
would also fail to satisfy full conditional exchangeability.42,43
We further assume consistency and positivity, which are
also necessary conditions for identification of average treat-
ment eects.44,45 A formal discussion on necessary assump-
tions for causal inference is provided elsewhere.46
A Framework for Screening Large-scale
Propensity Score Analyses
The use of synthetically generated datasets—where
treatment–outcome associations are known by design and sim-
ulated patterns of confounding approximate the observed data
structure—has become increasingly popular to help tailor ana-
lytic choices for causal inference.27,28,35,36,39,47–52 Frameworks
for generating synthetic datasets have largely been based on
approaches that combine real data from the given study with
simulated features. The basic concept of these approaches
is to take the observed data structure and use modeled rela-
tionships from the original data to simulate outcome status
while leaving both treatment assignment and baseline covari-
ates unchanged or to simulate both treatment and outcome
while leaving only baseline covariates unchanged.35,36,39,48,50
In healthcare database studies, however, outcome events are
often rare. This can make it dicult to identify and model all
predictors of the outcome to simulate synthetic datasets that
closely approximate patterns of confounding for large num-
bers of variables.
Here, we propose a framework that does not require use
of an outcome model to generate synthetic control datasets.
The framework builds on a technique that was originally pro-
posed by Hansen37,51 and Huber et al.38 The framework uses
a model for treatment assignment to divide the comparator
Copyright © 2022 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
Epidemiology • Volume 33, Number 4, July 2022 Screening Large-scale Propensity Score Analyses
© 2022 Wolters Kluwer Health, Inc. All rights reserved. www.epidem.com | 543
group into pseudo-treated and pseudo-comparator groups to
resemble dierences between the actual treated and compara-
tor populations. The synthetic datasets are used to evaluate
analyses in their ability to retrieve unbiased null eect esti-
mates. The purpose of the framework is to supplement balance
diagnostics to screen large-scale propensity score analyses that
likely violate partial exchangeability due to lack of control for
measured confounding (discussed further in Comments on the
Screening Framework). We outline the approach in Table1.
Details and example code are provided in the eAppendix;
http://links.lww.com/EDE/B911.
Comments on the Screening Framework
In healthcare database studies, it is often easier to iden-
tify and model all predictors of treatment than predictors of
the outcome. The framework outlined in Table1 is intended
for such settings and begins with a propensity score model
that includes all predictors of treatment. When building this
model, investigators are not faced with the dicult task of
identifying the optimal propensity score model for causal
inference. Instead, the objective is to capture the maximum
amount of measured confounder information in the data at
the risk of including instrumental variables that can harm
the properties of eect estimates. This is Step 3 outlined in
Table1 and the framework hinges on the assumption that this
model is well specified. This model is then used to separate the
control group on all predictors of treatment to form pseudo-
treated and pseudo-control groups to evaluate analyses in their
ability to retrieve unbiased pseudo-eect estimates. Analyses
whose confidence limits (Step 7) do not contain the true null
value are screened as unlikely to adequately control for mea-
sured confounder information that is captured in the propen-
sity score model developed in Step 3.
By only using the control group when generating syn-
thetic datasets, the framework does not attempt to simulate the
outcome to approximate the full data distribution. Instead, the
framework simulates a pseudo-treatment within the control
group to approximate the simpler distribution PY AX
A=
()
0,,
. Because the synthetic datasets approximate PY AX
A=
()
0,,
rather than the full data structure, the synthetic datasets are
only used to screen analyses that show signals of bias rather
than trying to select the best analytic approach for the study
at hand. If estimators show clear signs of bias when applied
within the synthetic datasets, then this would raise concerns
on the ability of those analyses to satisfy partial exchange-
ability—a necessary condition for identification of average
treatment eects. Analyses that are not screened are not neces-
sarily accepted as valid; we simply determine that there is not
enough evidence to reject them as being biased due to mea-
sured confounding (discussed further in the Discussion). The
final set of results can include estimates from propensity score
models that balance all variables but should also include esti-
mates from more parsimonious models where there is insuf-
ficient evidence for rejection. These estimates are considered
reasonable alternatives and are taken into consideration as a
whole when interpreting results.
The proposed framework allows investigators to objec-
tively evaluate and compare analyses in their ability to control
TABLE 1. A Framework for Screening Large-scale Propensity Score Analyses
Step 1: Apply alternative confounder selection approaches in the full study population and fit a PS model for each set of selected covariates. For each
model estimate the treatment eect after PS adjustment.
Step 2: Evaluate each analysis using traditional PS diagnostics (e.g., covariate balance). For analyses that successfully balance the covariates within
their respective model, we propose further evaluation by adjusting for the same set of features, or variables, within synthetic negative control
datasets (Steps 3–7).
Step 3: Fit a propensity score model in the full study population. The goal when fitting this model is to identify and model all predictors of treatment.
Step 4: Set the treated population aside and keep only the control group where the potential outcome under the control treatment, YA=0, is observed.
Step 5: Divide the control group into pseudo-treatment and pseudo-control groups (Steps 5a through 5d):
5a: Use the model in Step 3 to assign each individual a treatment probability given observed covariates.
5b: Shift the assigned probabilities in Step 5a by a constant value so that the expected value of the assigned probabilities is equal to the
proportion of treated individuals in the full cohort. Shifting is done in a way that maintains a proportionality relationship between the
odds of selection for pseudo-treatment and the odds of treatment selection in the full cohort. Details and example code are provided in
eAppendix 1; http://links.lww.com/EDE/B911.
5c: Using the assigned probabilities from Step 5b, perform an independent Bernoulli trial for each individual to determine their pseudo-
treatment status.
5d: Repeat Step 5c K times to generate K pseudo-populations
Step 6: Estimate the PS and pseudo-treatment eect within each of the K pseudo-populations using the same set of covariates and adjustment method
that was used for each analysis in the full population. For each analysis, take the mean of the calculated pseudo-eect estimates across the K
pseudo-populations as the estimate for the pseudo-bias for that analysis.
Step 7: For each analysis, estimate the standard error for the pseudo-bias calculated in Step 6. This is done by repeating Steps 5 and 6 within
bootstrapped samples of the control group. Use the bootstrapped standard error to calculate confidence limits. Analyses whose confidence
limits do not contain the true null value are screened.
PS indicates propensity score.
Copyright © 2022 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
Epidemiology • Volume 33, Number 4, July 2022
Wyss et al.
544 | www.epidem.com © 2022 Wolters Kluwer Health, Inc. All rights reserved.
for measured confounding without being influenced by esti-
mated treatment eects in the full study population. In other
words, the framework maintains objectivity in study design by
not allowing information on the treatment–outcome associa-
tion to contribute to decisions on model selection (discussed
further in the Discussion).53
Finally, it is important to note that the synthetic data-
sets have the property that the odds of selection for pseudo-
treatment are proportional to the estimated odds of treatment
in the full population. This proportionality relationship is not
designed to create a pseudo-population where covariate distri-
butions mirror those of the full cohort. Rather, it is designed to
create a pseudo-population where general patterns of covari-
ate imbalance are similar to those within the study population.
Therefore, analysts should check that dierences in baseline
characteristics across pseudo-treatment groups resemble dif-
ferences between treatment groups in the full study cohort. If
the propensity score model in Step 3 is well specified, there
should be such a similarity. It will of course be inexact; one is
looking for gross departures here.
Simulation Study
We simulated datasets with a sample size of 10,000, a
binary treatment, a binary outcome, and 100 binary baseline
variables that consisted of confounders, instrumental vari-
ables, risk factors, and spurious variables (not associated with
either treatment or outcome). Although healthcare database
studies will often involve much larger sample sizes with thou-
sands of baseline variables, the parameters considered here
were chosen to reduce computation time when running the
framework on thousands of simulated datasets. Details and R
code for the simulations are provided in the eAppendix; http://
links.lww.com/EDE/B911.
We varied selected parameters described in Table2 to
consider six scenarios. These parameters were specifically
chosen to illustrate strengths and limitations of the proposed
framework. For each scenario, we simulated 1000 datasets.
For each dataset, we applied 10 dierent variable selection
approaches (Table3). Selected variables were included in a
main eects logistic regression propensity score model. We
adjusted for propensity scores using the following weighting
methods54–56:
Stabilized inverse probability treatment weights (IPTW):
wPA
ps
PA
ps
==
+− =
−
AA
()
()
()
()
110
1
Standardized mortality ratio weights (SMRW):
wps
ps
=+−−
AA()
()
11
Overlap weights:
wA ps Aps=−+−()()11
Matching weights:
wA
minpsps
ps Aminp
sp
s
ps
=−
+− −
−
(, )()(, )
()
111
1
For each analysis, we calculated the mean bias in the
estimated treatment eect, the mean pseudo-bias, and the
pseudo-coverage defined as the percentage of the 1000 anal-
yses whose confidence limits for the pseudo-eect estimate
(Step 7 in Table1) contained the true null value. The pseudo-
coverage is equivalent to 1 minus the proportion of analyses
screened across all 1000 simulated datasets.
For the treatment prediction model used to generate the
synthetic datasets (Step 3 in Table1), we used a Lasso regres-
sion model to identify predictors of treatment and then included
these variables in a logistic regression model to predict treat-
ment assignment (propensity score model 9 in Table3). We
then estimated the pseudo-bias for each analysis using the
approach in Table1 and calculated the 95% confidence inter-
val using the bootstrapped standard error (Step 7 in Table1).
However, running 1000 bootstraps for each of the 1000 simu-
lated datasets was not computationally feasible. Therefore, for
each scenario, we only calculated the bootstrapped standard
error for 10 of the simulated datasets and used the mean of
these 10 bootstrapped standard errors as an approximation
of the bootstrapped standard error for all simulated datasets.
To ensure that this approximation did not produce incorrect
estimation of coverage for the pseudo-eect estimates, we
TABLE 2. Simulation Scenarios
Scenario Treatment EectaUnmeasured Confounding % Confoundersb% InstrumentsbValue of Kc
1. Null None 40% 10% 1
2. Null None 40% 10% 20
3. Heterogeneity 1 None 40% 10% 20
4. Heterogeneity 2 None 40% 10% 20
5. Null Ye s 40% 10% 20
6. Null Ye s 10% 40% 20
aScenarios 3-4: Treatment eect heterogeneity on the absolute (risk dierence) scale. For Scenario 4, treatment modified the eect of confounders on the outcome so that all
confounders were associated with the outcome only in the treatment group (i.e., partial exchangeability does not imply full exchangeability).
bAll scenarios consisted of 100 baseline covariates. Of these 100 covariates, 40% were spurious variables (not associated with either treatment or outcome) and 10% were risk
factors (associated with outcome but not treatment). We then varied the percentage of variables that were simulated as confounders and instruments.
cValue of K in Step 5d of the algorithm described in Table 1.
Copyright © 2022 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
Epidemiology • Volume 33, Number 4, July 2022 Screening Large-scale Propensity Score Analyses
© 2022 Wolters Kluwer Health, Inc. All rights reserved. www.epidem.com | 545
repeated scenario 1 after calculating the bootstrapped stan-
dard error for each simulated dataset (i.e., for each simulated
dataset, we used the standard error from 1000 bootstrapped
samples to calculate confidence limits). We found that this did
not meaningfully change the calculation of pseudo-coverage
(eFigure 1; http://links.lww.com/EDE/B911).
For scenarios involving treatment eect heterogene-
ity, the true treatment eect was calculated by simulating the
counterfactual population for each target population (e.g., full
study population for IPTW, treated population for SMRW,
overlap population for overlap weights, and matching popula-
tion for matching weights). We then calculated bias by taking
the dierence between the estimated and true eect.
Empirical Example
We compared the eect of nonselective nonsteroidal
anti-inflammatories (NSAIDs) versus Cox-2 inhibitors on
gastrointestinal (GI) bleed in a population of Medicare ben-
eficiaries. The study population included 49,653 individu-
als with 17,611 (35.5%) initiating nonselective NSAIDs and
32,042 (64.5%) initiating Cox-2 inhibitors. In this population,
outcome events were relatively uncommon with 552 total GI
events (1.1%). The empirical study has been previously pub-
lished and described in detail elsewhere.2,57,58 The study was
approved by the institutional review board of Brigham and
Women’s Hospital.
We adjusted for 24 investigator-specified variables
that were selected using background knowledge. We then
added various numbers of empirically selected covariates for
adjustment. We used the high-dimensional propensity score
algorithm to generate thousands of binary variables and
selected various sets of these variables using the approaches
described in Table3.2 Selected variables were included in a
main eects logistic regression model to estimate the propen-
sity score. We adjusted for propensity scores using the same
weighting approaches described previously. To reduce suscep-
tibility to unmeasured confounding, we implemented analyses
with 1% asymmetric trimming within the full study popula-
tion.59 Propensity score trimming has been recommended to
mitigate unmeasured confounding and misclassification where
individuals are treated contrary to prediction.59–61 Asymmetric
trimming was based on the distribution of predicted values
from the treatment prediction model used to generate the syn-
thetic datasets. For 1% asymmetric trimming, the synthetic
datasets were generated after trimming on the full study popu-
lation. There was no additional trimming within the synthetic
datasets.
To screen analyses, we first assessed covariate bal-
ance after propensity score adjustment in the full population.
We assessed balance for each analysis only on the variables
included in the given propensity score model. We screened
analyses with standardized dierences >0.1 for any covari-
ate in the given propensity score model.33,34 For analyses that
we did not screen, we used synthetic negative control studies
to consider additional screening. To ensure that the assign-
ment of pseudo-treatment closely approximated the treatment
assignment mechanism within the full population, we com-
pared standardized dierences in covariates across treatment
TABLE 3. Confounder Selection Methods
No. Name Description
Models for simulation study
1–8 Top N Bross bias
ranking
We used the Bross bias formula to rank baseline covariates by their potential confounding impact. PS models 1–8 included
the top 10, 20, 30, 40, 50, 60, 70, and 80 ranked variables, respectively.
9 Treatment lasso Lasso regression for the treatment using all 100 baseline variables. All variables whose coecient was not shrunk to 0 were
included for confounder adjustment.
10 Outcome lasso Lasso regression with outcome as the dependent variable. Predictors included treatment and all 100 baseline variables. All
variables whose coecient was shrunk to 0 were excluded (treatment coecient was constrained and was not shrunk). All
other variables (except for treatment) were included for confounder adjustment.
Models for empirical study
1 Investigator-specified
variables
A set of 24 investigator-specified variables were selected manually using background knowledge.
2–7 Top N Bross bias
ranking
The top N HDPS generated variables, as ranked by the Bross formula, were selected for confounder adjustment. Values of N
included: 50, 100, 200, 300, 400, and 500.
8 Treatment lasso Lasso regression for the treatment using the top 1000 HDPS generated variables and prespecified investigator-specified
variables as the predictors. Variables whose coecient was not shrunk to 0 were included for confounder adjustment. This
model selected 472 variables.
9 Outcome lasso Lasso regression with outcome as the dependent variable. Predictors included treatment, the top 1000 HDPS ranked
variables, and prespecified variables. Variables whose coecient was shrunk to 0 were excluded (treatment coecient
was constrained and was not shrunk). All other variables (except for treatment) were included for confounder adjustment.
This model selected 102 variables.
HDPS indicates high-dimensional propensity score; PS, propensity score.
Copyright © 2022 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
Epidemiology • Volume 33, Number 4, July 2022
Wyss et al.
546 | www.epidem.com © 2022 Wolters Kluwer Health, Inc. All rights reserved.
groups in the full population with standardized dierences in
covariates across pseudo-treatment groups.
RESULTS
Simulation Study Results
The Figure shows results for scenarios 1 and 2 described
in Table 2 and illustrates the impact of varying the tuning
parameter K in Step 5d of Table1. These scenarios consist of
settings involving no unmeasured confounding, correct pro-
pensity score model specification, and when adjustment sets
that satisfy partial exchangeability also satisfy full exchange-
ability. Under these conditions, Figure shows that the power to
reject biased analyses increased as the degree of bias increased
and as the selected value for K increased. For example, analy-
ses based on Model 2 were biased. The Figure shows that the
power to reject this analysis (1 minus the pseudo-coverage)
was ≈35% when K=1 (Figure D) and ≈87% when K=20
(Figure H). The Figure also shows that for analyses that were
approximately unbiased, the framework maintained proper
coverage for pseudo-eect estimates of ≈95%, implying that
the rejection rate for unbiased analyses was controlled at ≈5%.
Scenarios 3 through 6 illustrate limitations of the pro-
posed framework. Results for these scenarios are shown in
eFigures 2 and 3; http://links.lww.com/EDE/B911. eFigure 2;
http://links.lww.com/EDE/B911 shows results for scenarios
3 and 4 and illustrates an important limitation when adjust-
ment sets that satisfy partial exchangeability do not satisfy
full conditional exchangeability. Scenario 3 involves eect
heterogeneity on the risk dierence scale in a setting where
confounders are associated with the outcome in both the
treated and control group. In this setting, the framework per-
formed well with the power to reject biased analyses (1 minus
pseudo-coverage) increasing as bias increased while main-
taining proper pseudo-coverage of ≈95% for unbiased analy-
ses (eFigure 2D; http://links.lww.com/EDE/B911). Scenario
4 shown in eFigure 2; http://links.lww.com/EDE/B911, how-
ever, involves a setting where confounders are associated with
the outcome only in the treatment group (adjustment sets sat-
isfying partial exchangeability do not satisfy full conditional
exchangeability). In this setting, the framework still maintains
proper pseudo-coverage for unbiased analyses but can lose
power to reject biased analyses. However, performance is not
negatively impacted for analyses that target the average treat-
ment eect on the treated (e.g., SMRW) since this estimand
only requires partial exchangeability for identification (dis-
cussed further in the Discussion).43,62
Finally, eFigure 3; http://links.lww.com/EDE/B911
shows results for scenarios 5 and 6 and highlights additional
limitations when the assumption of no unmeasured confound-
ing is violated. All analyses in eFigure 3; http://links.lww.com/
EDE/B911 are biased illustrating that the framework is unable
to identify bias caused by unmeasured confounding. Results
in eFigure 3; http://links.lww.com/EDE/B911 further show
that in the presence of unmeasured confounding, it is possi-
ble for the framework to result in poorer pseudo-coverage for
analyses that exclude instrumental variables compared with
analyses that adjust for instruments (scenario 6, eFigure 3;
http://links.lww.com/EDE/B911). In other words, in the pres-
ence of unmeasured confounding, the framework can penalize
FIGURE. Simulation results for scenarios 1 (Plots A through D) and 2 (Plots E through H). It shows the mean risk difference, mean
bias in the estimated risk difference, mean pseudo-bias, and the pseudo-coverage across all 1000 simulated datasets. Scenarios
1 and 2 differed in the value of the tuning parameter, K, described in Step 5d of Table1. This value was set at 1 for scenario 1
and 20 for scenario 2. Propensity score (PS) models differed in which covariates were selected for adjustment. The Crude is the
unadjusted model. Models 1–10 are described in Table3.
Copyright © 2022 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
Epidemiology • Volume 33, Number 4, July 2022 Screening Large-scale Propensity Score Analyses
© 2022 Wolters Kluwer Health, Inc. All rights reserved. www.epidem.com | 547
analyses that exclude instrumental variables by being more
likely to screen these analyses. This limitation, however, is not
unique to the proposed framework and can impact any model
selection algorithm that conditions on treatment to evaluate
covariate associations with the outcome (discussed further in
the Discussion).
Empirical Example Results
All analyses resulted in well balanced cohorts for each
scenario with standardized covariate dierences well below
0.1 for variables included in the given propensity score model
(eFigures 4 through 7; http://links.lww.com/EDE/B911). After
generating synthetic control datasets, patterns of imbalance
in covariates across pseudo-treatment groups closely aligned
with patterns of imbalance in the study cohort (eFigure 8;
http://links.lww.com/EDE/B911).Table 4 shows that screen-
ing with synthetic control studies resulted in a narrower
range of estimates for investigators to consider when inter-
preting results. Table 4 also shows that for IPTW analyses,
patterns in the movement of pseudo-eect estimates did not
follow general patterns of movement in the estimated treat-
ment eects. As illustrated in simulations, this can occur when
there is strong eect heterogeneity or strong unmeasured con-
founding. In healthcare database studies, heterogeneity and
unmeasured confounding are often greater in the tails of the
propensity score where individuals are treated contrary to pre-
diction.59,63,64 Because IPTW analyses upweight individuals in
the tails of the propensity score distribution, it is possible that
unmeasured confounding and heterogeneity had a stronger
impact on these analyses. This could explain dissimilarity in
TABLE 4. Results for Empirical Example
Adjustment
Method Resultsa
PS Modelb
Unadjusted Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Model 7 Model 8 Model 9
OW
RD
(95% CI)
−0.05
(−0.25,
0.14)
0.07
(−0.12,
0.26)
0.11
(−0.08,
0.29)
0.14
(−0.05,
0.32)
0.14
(−0.04,
0.33)
0.16
(−0.02,
0.35)
0.16
(−0.02,
0.35)
0.15
(−0.03,
0.33)
0.13
(−0.05,
0.32)
0.11
(−0.07,
0.30)
Pseudo-bias
(95% CI)
−0.25
(−0.34,
−0.17)
−0.14
(−0.22,
−0.06)
−0.09
(−0.16,
−0.01)
−0.06
(−0.13,
0.01)
−0.05
(−0.12,
0.01)
−0.04
(−0.11,
0.02)
−0.03
(−0.10,
0.02)
−0.02
(−0.07,
0.03)
−0.01
(−0.04,
0.02)
−0.07
(0.14,
−0.01)
Screened Ye s Ye s Ye s No No No No No No Ye s
MW
RD
(95% CI)
−0.05
(−0.25,
0.14)
0.04
(−0.14,
0.23)
0.09
(−0.09,
0.27)
0.12
(−0.06,
0.30)
0.13
(−0.05,
0.31)
0.15
(−0.03,
0.33)
0.16
(−0.02,
0.34)
0.16
(−0.02,
0.34)
0.16
(−0.02,
0.34)
0.10
(−0.08,
0.29)
Pseudo-bias
(95% CI)
−0.25
(−0.34,
−0.17)
−0.15
(−0.23,
−0.07)
−0.09
(−0.17,
−0.02)
−0.07
(−0.14,
0.00)
−0.06
(−0.13,
0.01)
−0.05
(−0.11,
0.02)
−0.04
(−0.10,
0.02)
−0.02
(−0.08,
0.03)
−0.01
(−0.03,
0.02)
−0.08
(−0.14,
−0.01)
Screened Ye s Ye s Ye s No No No No No No Ye s
SMRW
RD
(95% CI)
−0.05
(−0.25,
0.14)
0.06
(−0.12,
0.24)
0.11
(−0.07,
0.29)
0.13
(−0.05,
0.31)
0.14
(−0.04,
0.32)
0.16
(−0.02,
0.34)
0.18
(0.00,
0.35)
0.18
(0.00, 0.36)
0.15
(−0.03,
0.33)
0.11
(−0.07,
0.29)
Pseudo-bias
(95% CI)
−0.25
(−0.34,
−0.17)
−0.15
(−0.23,
−0.07)
−0.10
(−0.17,
−0.02)
−0.07
(−0.14,
0.00)
−0.06
(−0.13,
0.01)
−0.05
(−0.11,
0.01)
−0.04
(−0.10,
0.02)
−0.02
(−0.08,
0.04)
0.01
(−0.02,
0.03)
−0.08
(−0.14,
−0.01)
Screened Ye s Ye s Ye s No No No No No No Ye s
IPTW
RD
(95% CI)
−0.05
(−0.25,
0.14)
0.15
(−0.05,
0.35)
0.20
(0.00,
0.40)
0.24
(0.04,
0.42)
0.22
(0.02,
0.42)
0.20
(0.00,
0.40)
0.17
(−0.03,
0.36)
0.14
(−0.06,
0.34)
0.08
(−0.11,
0.28)
0.15
(−0.05,
0.35)
Pseudo-bias
(95% CI)
−0.25
(−0.34,
-0.17)
−0.12
(−0.21,
−0.04)
−0.07
(−0.15,
0.01)
−0.04
(−0.12,
0.04)
−0.04
(−0.11,
0.03)
−0.04
(−0.11,
0.03)
−0.04
(−0.11,
0.02)
−0.03
(−0.09,
0.03)
−0.03
(−0.06,
0.01)
−0.06
(−0.14,
0.01)
Screened Ye s Ye s No No No No No No No No
aRisk dierence × 100 and using bootstrapped SE for confidence intervals.
bModel 1 adjusted for investigator-specified variables. Models 2–7 adjusted for investigator-specified variables along with the top 50 (Model 2), 100 (Model 3), 200 (Model 4),
300 (Model 5), 400 (Model 6), and 500 (Model 7) Bross-ranked HDPS generated variables. Model 8 included variables selected by a Lasso model for treatment, and Model 9 included
variables selected by a Lasso model for the outcome.
CI indicates confidence interval; HDPS, high-dimensional propensity score; PS, propensity score.
Copyright © 2022 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
Epidemiology • Volume 33, Number 4, July 2022
Wyss et al.
548 | www.epidem.com © 2022 Wolters Kluwer Health, Inc. All rights reserved.
the patterns of movement between pseudo-bias and estimated
treatment eects.
Finally, Table 4 illustrates that the width of the confi-
dence interval for pseudo-bias decreases as more predictors of
treatment are included in the propensity score model. This is
because of Steps 5d and 6 in Table1. As K in Step 5d increases,
Step 6 produces a distribution of pseudo-eect estimates and
takes the mean of this distribution as the estimate for pseudo-
bias to average over the variation in this distribution. However,
because Step 5d only simulates pseudo-treatment status and
not the outcome, Step 6 only averages over variation that is
caused by controlling for variability in the pseudo-treatment
assignment mechanism (i.e., averages over variation caused by
controlling for predictors of treatment). Consequently, analy-
ses that control for more variables in the model that is used to
assign pseudo-treatment will have less variation around the
estimated pseudo-bias. This averaging allows for a more pre-
cise assessment of how much of the dierence between the
estimated pseudo-bias and null eect is due to random varia-
tion and is not intended to produce estimates that mimic the
properties of actual treatment eect estimates.
DISCUSSION
We proposed a framework for generating synthetic neg-
ative controls to screen large-scale propensity score analyses
that are unlikely to adequately control for measured confound-
ing. We used simulations to discuss strengths and limitations
of the framework and we illustrated its application through
an empirical example. Under certain assumptions, we found
that synthetic negative control studies can supplement bal-
ance diagnostics to provide an objective framework to screen
analyses that show signs of measured confounding bias for
the given study implementation. The framework allows inves-
tigators in an algorithmically predefined way to exclude likely
biased analyses when interpreting results while maintaining
proper pseudo-coverage for unbiased analyses, thereby con-
trolling the rejection rate.
Because the framework relies on a well-specified pro-
pensity score model to generate the synthetic datasets, one
may wish to simply compare results directly in the full study
population with the estimated eect obtained from adjustment
based on this model. However, directly comparing results to
a gold standard eect estimate risks making model selec-
tion decisions that are influenced by factors other than the
model’s performance in controlling for measured confound-
ing. For example, if investigators are interested in examining
how the treatment eect changes across alternative propen-
sity score adjustment methods and propensity score models,
it is dicult to disentangle whether dierences in estimated
eects are due to models excluding confounding variables,
heterogeneity in the treatment eect, or residual imbalance
from the adjustment approach. Making comparisons after
weighting analyses to the same target population would not
allow for direct comparisons of adjustment approaches (e.g.,
comparing robustness of overlap weights versus inverse prob-
ability weighting). Balance checks can help identify problems
with propensity score analyses. However, balance checks do
not inform researchers on which variables balance should be
assessed. Further, it can be dicult to quantify the impact of
many small residual imbalances in high-dimensional settings.
The proposed framework can supplement balance diagnostics
to allow investigators to objectively evaluate and compare
analyses in their ability to control for measured confounder
information without relying on comparisons to estimated
eects in the full population.
A few limitations deserve attention. First, our findings
are supported by illustrative simulated data rather than com-
prehensive evidence or proof. Statistical theory supporting
the proposed framework is not provided in this work. Second,
the proposed method is only useful for screening analyses
that show signs of bias caused by measured confounding.
Analyses that are not screened are not necessarily accepted
as valid. In particular, the framework can only test for viola-
tions of partial exchangeability caused by lack of control for
measured confounders. Although partial exchangeability is
necessary for identification of average treatment eects and
sucient for some causal eects (e.g., the average treatment
eect in the treated or ATT), it is not sucient for identifica-
tion of all causal eects.43,62 For target parameters that require
full conditional exchangeability, the power to reject biased
analyses can deteriorate when covariate sets that satisfy par-
tial exchangeability do not satisfy full conditional exchange-
ability. Fourth, in this study, we only considered parametric
models for estimating propensity scores. It is possible that the
framework performs best when there is some restriction on
the complexity of the propensity score model. Future research
could explore settings that require more flexible machine
learning algorithms for propensity score modeling. In this
case, sample splitting may be necessary for proper cover-
age.65,66 Future research could also compare the described
approach with other proposed frameworks for model valida-
tion in causal inference.35,39,47,67,68 A thorough comparison of
these alternative frameworks is beyond our scope.
Finally, the proposed framework assumes all con-
founder information is identifiable in the data. In the presence
of strong unmeasured confounding, the proposed framework
can favor models that err on the inclusion rather than exclu-
sion of instrumental variables. This is because conditioning
on treatment when generating the synthetic datasets induces
correlations between instrumental variables and unmeasured
confounders which can make instruments act like confound-
ers in the synthetic datasets. However, this is a general limita-
tion of data-driven confounder selection and is not unique to
the proposed approach. When unmeasured confounding can-
not be eliminated or nearly eliminated, all analyses would be
flawed regardless the analytic approach.
In summary, we have described a framework to help
researchers screen large-scale propensity score analyses that
Copyright © 2022 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
Epidemiology • Volume 33, Number 4, July 2022 Screening Large-scale Propensity Score Analyses
© 2022 Wolters Kluwer Health, Inc. All rights reserved. www.epidem.com | 549
show signs of measured confounding bias for a given study
implementation. Under certain assumptions, the framework can
supplement balance diagnostics to improve robustness of large-
scale propensity score analyses in healthcare database studies.
REFERENCES
1. Schneeweiss S. Automated data-adaptive analytics for electronic
healthcare data to study causal treatment eects. Clin Epidemiol.
2018;10:771–788.
2. Schneeweiss S, Rassen JA, Glynn RJ, Avorn J, Mogun H, Brookhart MA.
High-dimensional propensity score adjustment in studies of treatment
eects using health care claims data. Epidemiology. 2009;20:512–522.
3. Brookhart MA, Wyss R, Layton JB, Stürmer T. Propensity score methods
for confounding control in nonexperimental research. Circ Cardiovasc
Qual Outcomes. 2013;6:604–611.
4. Glynn RJ, Schneeweiss S, Stürmer T. Indications for propensity scores
and review of their use in pharmacoepidemiology. Basic Clin Pharmacol
Toxicol. 2006;98:253–259.
5. Rosenbaum PR, Rubin DB. The central role of the propensity score in
observational studies for causal eects. Biometrika. 1983;70:41–55.
6. Brookhart MA, Schneeweiss S, Rothman KJ, Glynn RJ, Avorn J, Stürmer
T. Variable selection for propensity score models. Am J Epidemiol.
2006;163:1149–1156.
7. VanderWeele TJ. Principles of confounder selection. Eur J Epidemiol.
2019;34:211–219.
8. Bhattachar ya J, Vogt WB. Do instrumental variables belong in propensity
scores? National Bureau of Economic Research; 2007. (NBER Technical
Working Paper no. 343).
9. Myers JA, Rassen JA, Gagne JJ, et al. Eects of adjusting for instrumen-
tal variables on bias and precision of eect estimates. Am J Epidemiol.
2011;174:1213–1222.
10. Sauer BC, Brookhart MA, Roy J, VanderWeele T. A review of covari-
ate selection for non-experimental comparative eectiveness research.
Pharmacoepidemiol Drug Saf. 2013;22:1139–1145.
11. Schisterman EF, Cole SR, Platt RW. Overadjustment bias and unnecessary
adjustment in epidemiologic studies. Epidemiology. 2009;20:488–495.
12. Wooldridge J. Should instrumental variables be used as matching vari-
ables? Res Econ. 2016;70:232–237.
13. Wyss R, Girman CJ, LoCasale RJ, Brookhart AM, Stürmer T. Variable
selection for propensity score models when estimating treatment eects
on multiple outcomes: a simulation study. Pharmacoepidemiol Drug Saf.
2013;22:77–85.
14. Ju C, Benkeser D, van der Laan MJ. Robust inference on the average
treatment eect using the outcome highly adaptive lasso. Biometrics.
2020;76:109–118.
15. Ju C, Gruber S, Lendle SD, et al. Scalable collaborative targeted learning
for high-dimensional data. Stat Methods Med Res. 2019;28:532–554.
16. Ju C, Wyss R, Franklin JM, Schneeweiss S, Häggström J, van der Laan
MJ. Collaborative-controlled LASSO for constructing propensity score-
based estimators in high-dimensional data. Stat Methods Med Res.
2019;28:1044–1063.
17. Koch B, Vock DM, Wolfson J. Covariate selection with group lasso and
doubly robust estimation of causal eects. Biometrics. 2018;74:8–17.
18. Koch B, Vock DM, Wolfson J, Vock LB. Variable selection and estimation
in causal inference using Bayesian spike and slab priors. Stat Methods
Med Res. 2020;29:2445–2469.
19. Shortreed SM, Ertefaie A. Outcome-adaptive lasso: variable selection for
causal inference. Biometrics. 2017;73:1111–1122.
20. Ertefaie A, Asgharian M, Stephens DA. Variable selection in causal
inference using a simultaneous penalization method. J Causal Inference.
2018;6:20170010.
21. Franklin JM, Eddings W, Glynn RJ, Schneeweiss S. Regularized regression
versus the high-dimensional propensity score for confounding adjustment
in secondary database analyses. Am J Epidemiol. 2015;182:651–659.
22. Ju C, Combs M, Lendle SD, et al. Propensity score prediction for elec-
tronic healthcare databases using Super Learner and High-dimensional
Propensity Score Methods. J Appl Stat. 2019;46:2216–2236.
23. Karim ME, Pang M, Platt RW. Can we train machine learning meth-
ods to outperform the high-dimensional propensity score algorithm?
Epidemiology. 2018;29:191–198.
24. Schneeweiss S, Eddings W, Glynn RJ, Patorno E, Rassen J, Franklin
JM. Variable selection for confounding adjustment in high-dimensional
covariate spaces when analyzing healthcare databases. Epidemiology.
2017;28:237–248.
25. Wyss R, Schneeweiss S, van der Laan M, Lendle SD, Ju C, Franklin JM.
Using super learner prediction modeling to improve high-dimensional
propensity score estimation. Epidemiology. 2018;29:96–106.
26. Schuemie JM, Cepeda MS, Suchard MA, et al. How confident are we
about observational findings in health care: a benchmark study. Harv
Data Sci Rev. 2020;2. doi 10.1162/99608f92.147cc28e
27. Schuemie MJ, Hripcsak G, Ryan PB, Madigan D, Suchard MA. Empirical
confidence interval calibration for population-level eect estimation
studies in observational healthcare data. Proc Natl Acad Sci U S A.
2018;115:2571–2577.
28. Schuemie MJ, Ryan PB, Hripcsak G, Madigan D, Suchard MA.
Improving reproducibility by using high-throughput observational
studies with empirical calibration. Philos Trans A Math Phys Eng Sci.
2018;376:20170356.
29. Tian Y, Schuemie MJ, Suchard MA. Evaluating large-scale propensity
score performance through real-world and synthetic data experiments. Int
J Epidemiol. 2018;47:2005–2014.
30. Athey S, Imbens GW. A measure of robustness to misspecification. Am
Econ Rev. 2015;105:476–480.
31. Coker B, Rudin C, King G. A theory of statistical inference for ensuring
the robustness of scientific results. Manag Sci. 2021;67:6174–6197.
32. Simonsohn U, Simmons JP, Nelson LD. Specification curve analysis. Nat
Hum Behav. 2020;4:1208–1214.
33. Austin PC. Balance diagnostics for comparing the distribution of base-
line covariates between treatment groups in propensity-score matched
samples. Stat Med. 2009;28:3083–3107.
34. Franklin JM, Rassen JA, Ackermann D, Bartels DB, Schneeweiss S.
Metrics for covariate balance in cohort studies of causal eects. Stat Med.
2014;33:1685–1699.
35. Alaa AM, van der Schaar M. Validating causal inference models via
influence functions. In: Proceedings of the 36th International Conference
on Machine Learning, in Proceedings of Machine Learning Research.
2019:191–201.
36. Franklin JM, Schneeweiss S, Polinski JM, Rassen JA. Plasmode simula-
tion for the evaluation of pharmacoepidemiologic methods in complex
healthcare databases. Comput Stat Data Anal. 2014;72:219–226.
37. Hansen BB. Bias reduction in observational studies via prognosis scores.
Statistics Department, University of Michigan; 2006; Technical report
No. 441.
38. Huber M, Lechner M, Wunsch C. The performance of estimators based
on the propensity score. J Econom. 2013;175:1–21.
39. Schuler A, Jung K, Tibshirani R, Hastie T, Shah, N. Synth-validation:
selecting the best causal inference method for a given dataset. arXiv pre-
print arXiv:1711.00083. 2017.
40. Neyman J, Dabrowska DM, Speed TP. On the application of probability
theory to agricultural experiments. Essay on principles. Section 9. Statist
Sci. 1990;5:465–472.
41. Rubin DB. Estimating causal eects of treatments in randomized and
nonrandomized studies. J Educ Psychol. 1974;66:668–701.
42. Greenland S, Robins JM. Identifiability, exchangeability and confound-
ing revisited. Epidemiol Perspect Innov. 2009;6:4.
43. Sarvet AL, Wanis KN, Stensrud MJ, Hernán MA. A graphical description
of partial exchangeability. Epidemiology. 2020;31:365–368.
44. VanderWeele TJ. Concerning the consistency assumption in causal infer-
ence. Epidemiology. 2009;20:880–883.
45. Westreich D, Cole SR. Invited commentary: positivity in practice. Am J
Epidemiol. 2010;171:674–677; discussion 678.
46. Pearl J. Causal Inference. Proceedings of Workshop on Causality:
Objectives and Assessment at NIPS 2008, in Proceedings of Machine
Learning Research. 2010;6:29–58.
47. Athey S, Imbens GW, Metzger J, Munro E. Using Wasserstein gen-
erative adversarial networks for the design of Monte-Carlo simula-
tions [published online ahead of print March 20, 2021]. J Econom. doi:
10.1016/j.jeconom.2020.09.013
48. Bahamyirou A, Blais L, Forget A, Schnitzer ME. Understanding and
diagnosing the potential for bias when using machine learning meth-
ods with doubly robust causal estimators. Stat Methods Med Res.
2019;28:1637–1650.
Copyright © 2022 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.
Epidemiology • Volume 33, Number 4, July 2022
Wyss et al.
550 | www.epidem.com © 2022 Wolters Kluwer Health, Inc. All rights reserved.
49. Dorie V, Harada M, Carnegie NB, Hill J. A flexible, interpretable frame-
work for assessing sensitivity to unmeasured confounding. Stat Med.
2016;35:3453–3470.
50. Petersen ML, Porter KE, Gruber S, Wang Y, van der Laan MJ. Diagnosing
and responding to violations in the positivity assumption. Stat Methods
Med Res. 2012;21:31–54.
51. Wyss R, Hansen BB, Ellis AR, et al. The “Dry-Run” analysis: a method
for evaluating risk scores for confounding control. Am J Epidemiol.
2017;185:842–852.
52. Neal B, Huang CW, Raghupathi S. RealCause: realistic causal inference
benchmarking. arXiv preprint arXiv:2011.15007. 2020.
53. Rubin DB. The design versus the analysis of observational studies for
causal eects: parallels with the design of randomized trials. Stat Med.
2007;26:20–36.
54. Li F, Morgan KL, Zaslavsky AM. Balancing covariates via propensity
score weighting. J Am Stat Assoc. 2018;113:390–400.
55. Li L, Greene T. A weighting analogue to pair matching in propensity
score analysis. Int J Biostat. 2013;9:215–234.
56. Cole SR, Hernán MA. Constructing inverse probability weights for mar-
ginal structural models. Am J Epidemiol. 2008;168:656–664.
57. Patorno E, Bohn RL, Wahl PM, et al. Anticonvulsant medications
and the risk of suicide, attempted suicide, or violent death. JAMA.
2010;303:1401–1409.
58. Patorno E, Glynn RJ, Hernández-Díaz S, Liu J, Schneeweiss S. Studies
with many covariates and few outcomes: selecting covariates and imple-
menting propensity-score-based confounding adjustments. Epidemiology.
2014;25:268–278.
59. Stürmer T, Rothman KJ, Avorn J, Glynn RJ. Treatment eects in the pres-
ence of unmeasured confounding: dealing with observations in the tails
of the propensity score distribution–a simulation study. Am J Epidemiol.
2010;172:843–854.
60. Crump RK, Hotz VJ, Imbens GW, Mitnik OA. Dealing with limited overlap
in estimation of average treatment eects. Biometrika. 2009;96:187–199.
61. Walker AM, Patrick AR, Lauer MS, et al. A tool for assessing the feasibil-
ity of comparative eectiveness research. Comp E Res. 2013;3:11–20.
62. Hansen BB. The prognostic analogue of the propensity score. Biometrika.
2008;95:481–488.
63. Glynn RJ, Lunt M, Rothman KJ, Poole C, Schneeweiss S, Stürmer
T. Comparison of alternative approaches to trim subjects in the tails
of the propensity score distribution. Pharmacoepidemiol Drug Saf.
2019;28:1290–1298.
64. Stürmer T, Webster-Clark M, Lund JL, et al. Propensity score
weighting and trimming strategies for reducing variance and bias
of treatment eect estimates: a Simulation Study. Am J Epidemiol.
2021;190:1659–1670.
65. Naimi AI, Mishler AE, Kennedy EH. Challenges in obtaining valid causal
eect estimates with machine learning algorithms [published online
ahead of print July 15, 2021]. Am J Epidemiol. doi: 10.1093/aje/kwab201
66. Zivich PN, Breskin A. Machine Learning for Causal Inference: On the
Use of Cross-fit Estimators. Epidemiology. 2021;32:393–401.
67. Saito Y, Yasui S. Counterfactual cross-validation: eective causal model
selection from observational data. arXiv. preprint arXiv:1909.05299 2019.
68. Rolling CA, Yang Y. Model selection for estimating treatment eects. J R
Stat Soc Series B. 2014;76:749–769.