(www.interscience.wiley.com) DOI: 10.1002/sim.0000
Web-based supporting materials for “A ﬂexible,
interpretable framework for assessing sensitivity
to unmeasured confounding”
Vincent Dorie, Masataka Harada, Nicole Bohme Carnegie, and Jennifer Hill
1. BART for Binary Response Data
This section explores the sensitivity of Bayesian Additive Regression Trees (BART) for binary outcomes to the choice of
end-node prior, speciﬁcally via the hyperparameter ‘k’. For three illustrative datasets, misclassiﬁcation rates are computed
using cross-validation across a range of values of k. The optimal hyperparameter varies in a non-obvious fashion, such
that no single default is sufﬁcient for all cases. As such, further work on binary BART is required before it can be included
as an assignment mechanism model in our sensitivity analysis algorithm.
BART operates by using many “small learners”, each of which divides the covariate space in a regression tree and
the results of which are all summed together. To make this approach fully Bayesian, BART includes priors over the tree
structure and the parameters in the end-nodes of those trees. For full details of BART, see Chipman et al. . In its original
derivation, the end-node prior for binary data is a normal distribution with a standard deviation given by the equation:
sd = 3
k⇥pnum trees .
For any ﬁxed covariate, the prediction is the sum of a draw for each tree so that the marginal prior is normal with a standard
deviation that is equal to 3/k. The default value for kis 2. Smaller values of kindicate a more diffuse prior, while larger
values correspond to shrinking the end-node parameters to a speciﬁc value.
We evaluate the sensitivity of binary BART to kin the context of three classiﬁcation problems:
1. Determining Republican/Democrat presidential votes from income, race, and sex in the 2008 U.S. national election
. After dropping incomplete cases, there are 1426 observations, one continuous variable and two categorical,
yielding 7 predictors for BART to split.
2. Predicting the presence of cardiac arrhythmia using patient physical attributes and characteristics of
electrocardiograms . After dropping variables with missing or constant values, 191 continuous features were
combined with 66 binary ones to create 257 predictors. The dataset contains 452 observations.
3. One of MONK’s problem, a constructed example wherein positive cases derive from speciﬁc relations between
subsets of categorical features . There are 432 observations and 6 categorical variables, so that BART has 15
predictors to use. The speciﬁc problem ﬁt is the ﬁrst of three, in which positive cases arise when either the ﬁrst two
categorical variables are the same or the ﬁfth takes on a speciﬁc value.
The latter two datasets are part of the UCI Machine Learning Repository .
To test the sensitivity of binary BART to k, we use K-fold cross-validation to calculate the missclassiﬁcation rate on
held-out data across a range of values. For 200 replications, a random ﬁfth of the data are set aside while BART is ﬁt to
the remainder using 500 burn-in samples and 200 posterior draws. In addition to binary BART, we ﬁt logistic regressions
to each dataset. The examples used are notable for the difﬁculties they present standard techniques. The ﬁrst and second
exhibit quasi-complete separation , while the third is such that if interactions are included, the number of observations
exactly equals the number of predictors. Consequently, we use a Bayesian version of logistic regression which penalizes
the likelihood by imposing a Cauchy prior on the regression coefﬁcients . In addition to a main effects model obtained
by regressing against all variables as initially deﬁned, we also ﬁt a second order model that contains quadratic terms and
Statist. Med. 0000, 00 1–3 Copyright c
0000 John Wiley & Sons, Ltd.
Prepared using simauth.cls [Version: 2010/03/10 v3.00]
in Medicine V. DORIE ET AL.
0.1 0.2 0.3 0.4 0.5
Sensitivity of Binary BART to End−Node Parameter
Figure 1. Missclassiﬁcation rate for held-out data as a function of the binary BART parameter k. The thick black lines are BART results, gray lines are from a Bayesian logistic
regression using only main effects, and thin lines are from similar regressions with second order effects. For ANES, the thin and gray lines largely overlap; for MONK1, the thin
line is on the horizontal axis.
interactions. For the arrhythmia data, we use second order terms obtained from only the ﬁve most statistically signiﬁcant
predictors in the main-effects model.
The results of this comparison are shown in Figure 1. The vote prediction problem is such that most of the predictors
used have little relationship to the outcome: the second order logistic model performs little better than the main effects ﬁt
and binary BART excels on out of sample data only as its predictions are shrunk towards the baseline success probability.
The predictor-rich arrhythmia data exhibits widely varying performance, but is best with a relatively diffuse prior/strong
preference for the likelihood. Finally, MONK’s problem shows that too much prior shrinkage can prevent BART from
identifying deep interactions.
These three examples demonstrate how difﬁcult it is ﬁnd an optimal hyperparameter without directly exploring that
space. Attempting to estimate the optimal kby using a simpler model in the ANES data would show little relationship
between the predictors and outcome, while passing over the deeper interactions in a case like the MONK dataset.
Exhaustively exploring the space of all higher-order interactions to determine if this is the case is computationally
prohibitive. Finally, the arrhythmia example shows how sensitive performance can be in the context of a difﬁcult problem.
While it is beyond the scope of this work to directly investigate how to choose k, streamlining the cross-validation
procedure has shown some promise and, as a Bayesian algorithm, including uncertainty in kitself as a hyper-prior remains
 Chipman HA, George EI, McCulloch RE. BART: Bayesian additive regression trees. Annals of Applied Statistics
 The American National Election Studies. The ANES 2008 time series study 2008. URL http://www.
 Guvenir HA, Acar B, Demiroz G, Cekin A. A supervised machine learning algorithm for arrhythmia analysis.
Proceedings of the Computers in Cardiology Conference, Lund, Sweden, 1997.
 Thrun SB, Bala J, Bloedorn E, Bratko I, Cestnik B, Cheng J, Jong KD, Dzeroski S, Fahlman SE, Fisher D, et al.. The
MONK’s problems: A performance comparison of different learning algorithms. Technical Report 1991.
 Bache K, Lichman M. UCI machine learning repository 2013. URL http://archive.ics.uci.edu/ml.
 Albert A, Anderson JA. On the existence of maximum likelihood estimates in logistic regression models. Biometrika
2www.sim.org Copyright c
0000 John Wiley & Sons, Ltd. Statist. Med. 0000, 00 1–3
Prepared using simauth.cls
V. DORIE ET AL.
 Gelman A, Jakulin A, Pittau MG, Su YS. A weakly informative default prior distribution for logistic and other
regression models. The Annals of Applied Statistics 2008; :1360–1383.
Statist. Med. 0000, 00 1–3 Copyright c
0000 John Wiley & Sons, Ltd. www.sim.org 3
Prepared using simauth.cls