DataPDF Available


Supporting info item
Research Article
in Medicine
Received XXXX
( DOI: 10.1002/sim.0000
Web-based supporting materials for “A flexible,
interpretable framework for assessing sensitivity
to unmeasured confounding”
Vincent Dorie, Masataka Harada, Nicole Bohme Carnegie, and Jennifer Hill
1. BART for Binary Response Data
This section explores the sensitivity of Bayesian Additive Regression Trees (BART) for binary outcomes to the choice of
end-node prior, specifically via the hyperparameter ‘k’. For three illustrative datasets, misclassification rates are computed
using cross-validation across a range of values of k. The optimal hyperparameter varies in a non-obvious fashion, such
that no single default is sufficient for all cases. As such, further work on binary BART is required before it can be included
as an assignment mechanism model in our sensitivity analysis algorithm.
BART operates by using many “small learners”, each of which divides the covariate space in a regression tree and
the results of which are all summed together. To make this approach fully Bayesian, BART includes priors over the tree
structure and the parameters in the end-nodes of those trees. For full details of BART, see Chipman et al. [1]. In its original
derivation, the end-node prior for binary data is a normal distribution with a standard deviation given by the equation:
sd = 3
kpnum trees .
For any fixed covariate, the prediction is the sum of a draw for each tree so that the marginal prior is normal with a standard
deviation that is equal to 3/k. The default value for kis 2. Smaller values of kindicate a more diffuse prior, while larger
values correspond to shrinking the end-node parameters to a specific value.
We evaluate the sensitivity of binary BART to kin the context of three classification problems:
1. Determining Republican/Democrat presidential votes from income, race, and sex in the 2008 U.S. national election
[2]. After dropping incomplete cases, there are 1426 observations, one continuous variable and two categorical,
yielding 7 predictors for BART to split.
2. Predicting the presence of cardiac arrhythmia using patient physical attributes and characteristics of
electrocardiograms [3]. After dropping variables with missing or constant values, 191 continuous features were
combined with 66 binary ones to create 257 predictors. The dataset contains 452 observations.
3. One of MONK’s problem, a constructed example wherein positive cases derive from specific relations between
subsets of categorical features [4]. There are 432 observations and 6 categorical variables, so that BART has 15
predictors to use. The specific problem fit is the first of three, in which positive cases arise when either the first two
categorical variables are the same or the fifth takes on a specific value.
The latter two datasets are part of the UCI Machine Learning Repository [5].
To test the sensitivity of binary BART to k, we use K-fold cross-validation to calculate the missclassification rate on
held-out data across a range of values. For 200 replications, a random fifth of the data are set aside while BART is fit to
the remainder using 500 burn-in samples and 200 posterior draws. In addition to binary BART, we fit logistic regressions
to each dataset. The examples used are notable for the difficulties they present standard techniques. The first and second
exhibit quasi-complete separation [6], while the third is such that if interactions are included, the number of observations
exactly equals the number of predictors. Consequently, we use a Bayesian version of logistic regression which penalizes
the likelihood by imposing a Cauchy prior on the regression coefficients [7]. In addition to a main effects model obtained
by regressing against all variables as initially defined, we also fit a second order model that contains quadratic terms and
Statist. Med. 0000, 00 1–3 Copyright c
0000 John Wiley & Sons, Ltd.
Prepared using simauth.cls [Version: 2010/03/10 v3.00]
in Medicine V. DORIE ET AL.
0.1 0.2 0.3 0.4 0.5
Missclassification Rate
Sensitivity of Binary BART to EndNode Parameter
Figure 1. Missclassification rate for held-out data as a function of the binary BART parameter k. The thick black lines are BART results, gray lines are from a Bayesian logistic
regression using only main effects, and thin lines are from similar regressions with second order effects. For ANES, the thin and gray lines largely overlap; for MONK1, the thin
line is on the horizontal axis.
interactions. For the arrhythmia data, we use second order terms obtained from only the five most statistically significant
predictors in the main-effects model.
The results of this comparison are shown in Figure 1. The vote prediction problem is such that most of the predictors
used have little relationship to the outcome: the second order logistic model performs little better than the main effects fit
and binary BART excels on out of sample data only as its predictions are shrunk towards the baseline success probability.
The predictor-rich arrhythmia data exhibits widely varying performance, but is best with a relatively diffuse prior/strong
preference for the likelihood. Finally, MONK’s problem shows that too much prior shrinkage can prevent BART from
identifying deep interactions.
These three examples demonstrate how difficult it is find an optimal hyperparameter without directly exploring that
space. Attempting to estimate the optimal kby using a simpler model in the ANES data would show little relationship
between the predictors and outcome, while passing over the deeper interactions in a case like the MONK dataset.
Exhaustively exploring the space of all higher-order interactions to determine if this is the case is computationally
prohibitive. Finally, the arrhythmia example shows how sensitive performance can be in the context of a difficult problem.
While it is beyond the scope of this work to directly investigate how to choose k, streamlining the cross-validation
procedure has shown some promise and, as a Bayesian algorithm, including uncertainty in kitself as a hyper-prior remains
an option.
[1] Chipman HA, George EI, McCulloch RE. BART: Bayesian additive regression trees. Annals of Applied Statistics
2010; 4(1):266–298.
[2] The American National Election Studies. The ANES 2008 time series study 2008. URL http://www.
[3] Guvenir HA, Acar B, Demiroz G, Cekin A. A supervised machine learning algorithm for arrhythmia analysis.
Proceedings of the Computers in Cardiology Conference, Lund, Sweden, 1997.
[4] Thrun SB, Bala J, Bloedorn E, Bratko I, Cestnik B, Cheng J, Jong KD, Dzeroski S, Fahlman SE, Fisher D, et al.. The
MONK’s problems: A performance comparison of different learning algorithms. Technical Report 1991.
[5] Bache K, Lichman M. UCI machine learning repository 2013. URL
[6] Albert A, Anderson JA. On the existence of maximum likelihood estimates in logistic regression models. Biometrika
1984; 71(1):1–10. Copyright c
0000 John Wiley & Sons, Ltd. Statist. Med. 0000, 00 1–3
Prepared using simauth.cls
in Medicine
[7] Gelman A, Jakulin A, Pittau MG, Su YS. A weakly informative default prior distribution for logistic and other
regression models. The Annals of Applied Statistics 2008; :1360–1383.
Statist. Med. 0000, 00 1–3 Copyright c
0000 John Wiley & Sons, Ltd. 3
Prepared using simauth.cls
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
We develop a Bayesian “sum-of-trees” model where each tree is constrained by a regularization prior to be a weak learner, and fitting and inference are accomplished via an iterative Bayesian backfitting MCMC algorithm that generates samples from a posterior. Effectively, BART is a nonparametric Bayesian regression approach which uses dimensionally adaptive random basis elements. Motivated by ensemble methods in general, and boosting algorithms in particular, BART is defined by a statistical model: a prior and a likelihood. This approach enables full posterior inference including point and interval estimates of the unknown regression function as well as the marginal effects of potential predictors. By keeping track of predictor inclusion frequencies, BART can also be used for model-free variable selection. BART’s many features are illustrated with a bake-off against competing methods on 42 different data sets, with a simulation experiment and on a drug discovery classification problem.
Full-text available
The problems of existence, uniqueness and location of maximum likelihood estimates in log linear models have received special attention in the literature (Haberman, 1974, Chapter 2; Wedderburn, 1976; Silvapulle, 1981). For multinomial logistic regression models, we prove existence theorems by considering the possible patterns of data points, which fall into three mutually exclusive and exhaustive categories: complete separation, quasicomplete separation and overlap. Our results suggest general rules for identifying infinite parameter estimates in log linear models for frequency tables.
Conference Paper
Full-text available
A new machine learning algorithm for the diagnosis of cardiac arrhythmia from standard 12 lead ECG recordings is presented. The algorithm is called VF15 for Voting Feature Intervals. VF15 is a supervised and inductive learning algorithm for inducing classification knowledge from examples. The input to VF15 is a training set of records. Each record contains clinical measurements, from ECG signals and some other information such as sex, age, and weight, along with the decision of an expert cardiologist. The knowledge representation is based on a recent technique called Feature Intervals, where a concept is represented by the projections of the training cases on each feature separately. Classification in VF15 is based on a majority voting among the class predictions made by each feature separately. The comparison of the VF15 algorithm indicates that it outperforms other standard algorithms such as Naive Bayesian and Nearest Neighbor classifiers
Full-text available
This report summarizes a comparison of different learning techniques which was performed at the 2nd European Summer School on Machine Learning, held in Belgium during summer 1991. A variety of symbolic and non-symbolic learning techniques -- namely AQ17-DCI, AQ17-HCI, AQ17FCLS, AQ14-NT, AQ15-GA, Assistant Professional, mFOIL, ID5R, IDL, ID5R-hat, TDIDT, ID3, AQR, CN2, CLASS WEB, ECOBWEB, PRISM, Backpropagation, and Cascade Correlation -- are compared on three classification problems, the MONK's problems. The MONK's problems are derived from a domain in which each training example is represented by six discrete-valued attributes. Each problem involves learning a binary function defined over this domain, from a sample of training examples of this function. Experiments were performed with and without noise in the training examples. One significant characteristic of this comparison is that it was performed by a collection of researchers, each of whom was an advocate of the technique they t...