Access to this full-text is provided by Springer Nature.
Content available from BMC Medical Research Methodology
This content is subject to copyright. Terms and conditions apply.
Kawabataetal.
BMC Medical Research Methodology (2024) 24:278
https://doi.org/10.1186/s12874-024-02382-4
RESEARCH Open Access
© The Author(s) 2024. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this
licence, visit http://creativecommons.org/licenses/by/4.0/.
BMC Medical Research
Methodology
Accounting forbias due tooutcome
data missing notatrandom: comparison
andillustration oftwo approaches
toprobabilistic bias analysis: asimulation study
Emily Kawabata1,2†, Daniel Major‑Smith1,2†, Gemma L. Clayton1,2†, Chin Yang Shapland1,2, Tim P. Morris3,
Alice R. Carter1,2, Alba Fernández‑Sanlés4, Maria Carolina Borges1,2, Kate Tilling1,2, Gareth J. Griffith1,2,
Louise A. C. Millard1,2, George Davey Smith1,2, Deborah A. Lawlor1,2 and Rachael A. Hughes1,2*
Abstract
Background Bias from data missing not at random (MNAR) is a persistent concern in health‑related research. A bias
analysis quantitatively assesses how conclusions change under different assumptions about missingness using bias
parameters that govern the magnitude and direction of the bias. Probabilistic bias analysis specifies a prior distribu‑
tion for these parameters, explicitly incorporating available information and uncertainty about their true values.
A Bayesian bias analysis combines the prior distribution with the data’s likelihood function whilst a Monte Carlo bias
analysis samples the bias parameters directly from the prior distribution. No study has compared a Monte Carlo bias
analysis to a Bayesian bias analysis in the context of MNAR missingness.
Methods We illustrate an accessible probabilistic bias analysis using the Monte Carlo bias analysis approach
and a well‑known imputation method. We designed a simulation study based on a motivating example from the UK
Biobank study, where a large proportion of the outcome was missing and missingness was suspected to be MNAR.
We compared the performance of our Monte Carlo bias analysis to a principled Bayesian bias analysis, complete case
analysis (CCA) and multiple imputation (MI) assuming missing at random.
Results As expected, given the simulation study design, CCA and MI estimates were substantially biased, with 95%
confidence interval coverages of 7–48%. Including auxiliary variables (i.e., variables not included in the substan‑
tive analysis that are predictive of missingness and the missing data) in MI’s imputation model amplified the bias
due to assuming missing at random. With reasonably accurate and precise information about the bias parameter,
the Monte Carlo bias analysis performed as well as the Bayesian bias analysis. However, when very limited information
was provided about the bias parameter, only the Bayesian bias analysis was able to eliminate most of the bias due
to MNAR whilst the Monte Carlo bias analysis performed no better than the CCA and MI.
†Emily Kawabata, Daniel Major‑Smith and Gemma L. Clayton contributed
equally to this work.
*Correspondence:
Rachael A. Hughes
rachael.hughes@bristol.ac.uk
Full list of author information is available at the end of the article
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 2 of 14
Kawabataetal. BMC Medical Research Methodology (2024) 24:278
Conclusion The Monte Carlo bias analysis we describe is easy to implement in standard software and, in the setting
we explored, is a viable alternative to a Bayesian bias analysis. We caution careful consideration of choice of auxiliary
variables when applying imputation where data may be MNAR.
Keywords Bayesian bias analysis, Inverse probability weighting, Missing not at random, Monte Carlo bias analysis,
Multiple imputation, Probabilistic bias analysis, Sensitivity analysis, UK Biobank
Introduction
e main aim of many epidemiology studies is to esti-
mate the causal effect of an exposure on an outcome
(here onward, shortened to exposure effect). Inference
about the exposure effect may be invalid when the sam-
ple included in the analysis is not a representative (ran-
dom) sample of the target population. e choice of
method for dealing with missing data partly depends on
the mechanism causing the data to be missing (called
missingness mechanisms). ese mechanisms are com-
monly classified as missing completely at random (prob-
ability of missingness is independent of the observed and
missing data), missing at random (MAR; probability of
missingness is independent of the missing data given the
observed data) and missing not at random (MNAR; prob-
ability of missingness depends on the missing data even
after conditioning on the observed data) [1]. We focus on
a MNAR missingness mechanism where the value of a
variable directly affects its own probability of missingness
[2]. Implementations of multiple imputation (MI) and
inverse probability weighting (IPW) assume by default
that data are MAR and so may give biased results when
the missingness mechanism is MNAR. Note that imple-
mentations of MI and IPW incorporating MNAR mecha-
nisms also exist (e.g., [3–5]).
Information about the missingness mechanism may
be available from ancillary data such as instruments for
missingness [6], record-linkage data [7, 8], and respon-
siveness data [9]. In the absence of such information, the
analyst cannot distinguish between MAR and MNAR
missingness mechanisms based on the observed data
only [10]. Instead, the analyst must base their decision on
expert knowledge or available literature. When MNAR
missingness is suspected, a bias analysis (also known as
a sensitivity analysis) is recommended to quantify the
potential impact of MNAR missingness on their study
conclusions [11–13].
A bias analysis for MNAR missingness (here onward,
shortened to bias analysis) requires a model (known as
a bias model) for the data and missingness mechanism.
Two commonly used approaches are selection models
and pattern-mixture models [11] (chapter15, references
therein). In the context of an outcome MNAR, the selec-
tion model usually consists of a model for the substan-
tive analysis of interest and a model for the missingness
mechanism that characterizes how missingness depends
on the outcome. In contrast, the pattern-mixture model
describes how the distribution of the outcome depends
on missingness and may consist of a model for the sub-
stantive analysis that differs between participants with
observed and missing outcome. Both types of model can
be fitted using maximum likelihood, within a Bayesian
framework or using multiple imputation [11, 14, 15].
Under MNAR both the selection and pattern-mixture
models are unidentified models since the observed data
does not provide any information about the parameters
governing the dependency between the outcome and
missingness (known as bias or sensitivity parameters).
Setting the bias parameters to prespecified values enables
estimation of the remaining parameters of the model and
provides an estimate of the exposure effect adjusted for
bias due to MNAR (here onward, called the bias-adjusted
exposure effect estimate). By changing the values of
these bias parameters, a bias analysis estimates the bias-
adjusted exposure effect under different assumptions
about the missingness mechanism.
A bias analysis can be implemented as a determinis-
tic or probabilistic bias analysis [12, 15]. In a determin-
istic bias analysis, a range of values is specified for all
bias parameters and then for each plausible combination
of values, the bias model is estimated by fixing the bias
parameters to these values. is approach provides the
analyst with information about the range of possible esti-
mates for the exposure effect but does not indicate which
of these estimates are most likely to occur, making inter-
pretation of the results challenging [12]. Alternatively, a
probabilistic bias analysis specifies a prior probability dis-
tribution for the bias parameters which explicitly incor-
porates the analyst’s assumptions about plausible values
and the combinations of values most likely to occur. e
probabilistic bias analysis generates a distribution of bias-
adjusted exposure effect estimates which is then summa-
rised as a point estimate (e.g., the median as a measure
of central tendency) and a 95% interval estimate (e.g.,
2.5th and 97.5th percentiles as limits of the interval) that
accounts for the analyst’s uncertainty about the MNAR
missingness mechanism in addition to the usual random
sampling error.
A probabilistic bias analysis can be implemented as a
Bayesian bias analysis (where the prior distribution of the
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 3 of 14
Kawabataetal. BMC Medical Research Methodology (2024) 24:278
bias parameters is combined with the likelihood function
for the data) or as a Monte Carlo bias analysis (where val-
ues of the bias parameters are directly sampled from their
prior distribution and then used to fix the bias param-
eters to enable estimation of the bias-adjusted exposure
effect) [16]. Generally, a Monte Carlo bias analysis is sim-
pler to understand, quicker and easier to implement as it
requires no Bayesian computation [12, 17, 18]. We note
that the term “Monte Carlo” is also used to describe sim-
ulation-based techniques for Bayesian inference. To avoid
confusion, we shall use the term “Markov Chain Monte
Carlo (MCMC)” when referring to sampling from a pos-
terior distribution and “Monte Carlo bias analysis” when
referring to a type of probabilistic bias analysis.
In the context of bias analysis to unmeasured con-
founding or misclassification, a small number of studies
have compared a Monte Carlo bias analysis to a Bayesian
bias analysis [16–21]. Along with some theoretical argu-
ments, these studies indicate that the Monte Carlo bias
analysis is a good approximation of a Bayesian bias analy-
sis provided the prior distribution for the bias parameters
only specifies plausible values given the observed data
[17–20]. Otherwise, the Monte Carlo bias analysis can
give interval estimates that are either too wide or too nar-
row [16, 19]. No study has compared a Monte Carlo bias
analysis to a Bayesian bias analysis in the context of a bias
analysis to MNAR missingness.
Currently, there is limited guidance on implementing a
probabilistic bias analysis to data MNAR. Recent excep-
tions for cross-sectional analyses include: (1) a pattern-
mixture approach where draws from a prior distribution
(of the bias parameters) are used to impute a categorical
covariate under MNAR [22, 23] and (2) a Bayesian imple-
mentation of a selection model for a partially observed
continuous outcome [24]. Additionally, in the context of
selection bias due to non-random selection of partici-
pants into a study, Banack etal. review and compare a
Monte Carlo bias analysis to an alternative approach that
simulates the entire dataset under different assumptions
about the selection bias [25] and Jayaweera et al. con-
ducted a Monte Carlo bias analysis by inversely weighting
participants based on their probability of inclusion (i.e.,
participating and remaining in the study combined) [26].
In this paper, we illustrate a Monte Carlo bias analy-
sis [12, 17] using a pattern-mixture version of fully con-
ditional specification (FCS) imputation [5, 27, 28]. Via
a data example and simulations, we compare the per-
formance of our Monte Carlo bias analysis to a Bayes-
ian bias analysis in a setting where a large proportion of
the outcome is missing and missingness is suspected to
be MNAR. R and Stata software code implementing the
Monte Carlo and Bayesian bias analyses is available from
https:// github. com/ MRCIEU/ COVID ITY_ ProbQ BA.
Methods
Hypothetical example
We want to estimate the effect of an exposure (or treat-
ment)
X
on an outcome
Y
, denoted
βX
. To estimate
βX
,
our substantive analysis is a generalised linear regres-
sion of
Y
on
X
adjusted for measured confounders
Z
and
W
where g−
1
Y
(·
)
denotes the inverse link function. We
assume all confounders of the
Y
–
X
association are meas-
ured and without error, and in the absence of missing
data that the substantive analysis would give unbiased
results for
βX
. Outcome
Y
is observed in a small pro-
portion of study participants. e study recorded data
on auxiliary variables (i.e., variables not included in the
substantive analysis) that are predictive of the miss-
ing values of
Y
and whether
Y
was observed or miss-
ing. Also, a small proportion of participants are missing
data on exposure
X
and some of the confounders and
auxiliary variables. Let
Z
and
W
denote the fully and
partially observed confounders, respectively, and
A
and
D
denote the fully and partially observed auxiliary vari-
ables, respectively. To simplify the notation, and with-
out loss of generality, we assume that
A
denotes a single
variable, and similarly for
D,Z
and
W
. Binary variables
MY,MX,MW
and
MD
denote the missingness indicators
of
Y,X,W
and
D
, respectively (e.g.,
MY=1
when
Y
is
missing and
MY=0
otherwise).
Figure 1 depicts two missingness directed acy-
clic graphs (m-DAGs [29]) showing the relationships
among the variables of our substantive analysis of inter-
est (
Y,X,Z
and
W
), the auxiliary variables (
A
and
D
),
and the missingness mechanisms of
Y
, and of
X,W
and
D
. Note that m-DAGs do not specify the form of these
relationships (e.g., nonlinear relationships between var-
iables). Exposure effect,
βX
, represents the total effect
of
X
on
Y
(i.e., direct effect and indirect effect via aux-
iliaries A and D). We consider two scenarios, when
βX
is not-null (Fig.1a) and null (Fig.1b). Note that
UWZ
and
UDA
denote unmeasured shared ancestors of
W
and
Z
, and
D
and
A
, respectively. Outcome
Y
is MNAR
depending on fully observed auxiliary
A
, the missing
values of
Y
, and the observed and missing values of
exposure
X
and auxiliary
D
. Note that missingness of
Y
does not depend on
W
or
Z
, and we exclude the special
case where the MNAR mechanism depends on
X
and
Y
independently [30]. Variables
X,W
and
D
are MAR
depending on fully observed confounder
Z
and auxil-
iary
A
; hence this MAR mechanism applies across all
missing data patterns of
Y,X,W
, and
D
.
(1)
E
(Y|X,Z,W)=g
−1
Y
(β0+β
X
X+β
Z
Z+β
W
W
)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 4 of 14
Kawabataetal. BMC Medical Research Methodology (2024) 24:278
Complete case analysis andignorable missing data
methods
Popular missing data methods include complete case
analysis (CCA) and ignorable implementations of MI
and IPW that assume MAR (hereafter referred to as
MI and IPW, respectively). A detailed comparison of
these methods is provided elsewhere (e.g., [10, 31, 32]).
For our hypothetical example, CCA and MI are
expected to give a biased estimate for
βX
in both the null
and not-null scenarios. Since missingness of
Y
depends
jointly on
X
and
Y
, CCA is an invalid approach even when
the substantive analysis is a logistic regression [30]. For
MI, the MAR assumption is not valid, regardless of the
variables included in the imputation model, since miss-
ingness of
Y
depends directly on
Y
(i.e., path
Y→MY
).
See the Supplementary materials for information about
IPW in the context of our hypothetical example.
We next describe two non-ignorable missing data
methods, a Bayesian bias analysis using a selection
model (here onward, called Bayesian SM) and a Monte
Carlo bias analysis using a pattern-mixture model (here
onward, called Monte Carlo NARFCS). e bias mod-
els of Bayesian SM and Monte Carlo NARFCS consist
of a collection of generalised linear regressions. For
simplicity and without loss of generality, we describe
Bayesian SM and Monte Carlo NARFCS with respect to
continuous variable
X
and binary variables
Y,W,A
and
D
, whilst
Z
is left unspecified and, by definition, miss-
ingness indicators
MY,MX,MW
and
MD
are binary.
Bayesian SM
Bias model specied asaselection model
We use the sequential modelling approach [33–36] to
jointly model the substantive analysis, the MNAR miss-
ingness mechanism for
Y
, and the models to estimate the
missing values of
X,W
and
D
. e sequential modelling
approach factorises a joint distribution into a sequence of
simpler univariate distributions, where each univariate
distribution is modelled using an appropriate regression
model (e.g., linear regression for continuous variables
and logistic regression for binary variables). We specify
the following regression models for the joint distribution
of
W
,
X
,
Y
,
A
,
D
,
MY|Z
:
where
expit
k=exp k /1+exp k , and
δSM
is
the bias parameter representing the difference in the log-
odds of observing
Y
between those with
Y=1
and
Y=0
,
conditional on
D,A,Y
and
X
. Note that as missing-
ness of
Y
is conditionally independent of
W
and
Z
given
D, A, Y
and
X
then the bias model correctly assumes that
Pr
M
Y
=1|D,A,Y,X,W,Z
=Pr
M
Y
=1|D,A,Y,X
for all values of
W
and
Z
. Le t
SM
denote the set of all
(2)
Pr
(W=1|Z)∼expit{η0+ηZZ},X|W,Z∼N
ζ0+ζWW+ζZZ,ξ
2
,
Pr
(Y=1|X,W,Z)∼expit{β0+βXX+βWW+βZZ},
Pr
(A=1|Y,X,W,Z)∼expit{θ0+θYY+θXX+θWW+θZZ},
Pr
(D=1|A,Y,X,W,Z)∼expit{ω0+ωAA+ωYY+ωXX+ωWW+ωZZ}
,
Pr
MY=1|D,A,Y,X
∼expit
ψ0+ψ
D
D+ψ
A
A+δSM Y+ψ
X
X
,
Fig. 1 Missingness directed acyclic graphs (m‑DAGs) of the scenario investigated by the simulation study when the exposure effect,
βX
,
is (a) not‑null and (b) null. Black edges depict the relationships in the fully observed data, and the blue and red edges depict the missingness
mechanisms of the outcome and baseline variables (exposure, confounders, and auxiliary variables), respectively
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 5 of 14
Kawabataetal. BMC Medical Research Methodology (2024) 24:278
estimable parameters of model [2] (i.e., all except
δSM
),
noting
SM
includes exposure effect
βX
.
Different orderings of these regression models may
result in different joint distributions [37]. We specified
this ordering because it includes: the substantive analysis,
a model for the MNAR missingness mechanism of
Y
, and
incorporates auxiliary variables
A
and
D
without altering
the substantive analysis. is ordering is compatible with a
selection model framework. e ordering of the remaining
models can be with respect to the amount of missing data
(i.e., starting with the model for the variable with the least
amount of missing data). Note that previous studies have
reported that a Bayesian implementation of the sequential
modelling approach appears robust to the ordering of the
models [38, 39] but it may affect computational time [33].
Prior probability distributions
We assign independent prior distributions for all param-
eters. Following standard practice, we assign a normal
distribution for each coefficient of the regression mod-
els and an inverse gamma distribution for the variance
parameter of a linear regression [40]. For
δSM
we assign
Normal distribution
δSM
∼N
µ
SM
,σ
SM
where values
for mean
µSM
and variance
σSM
are chosen based on
external information such as published results, expert
opinion, or external data. In practice, external informa-
tion about
δSM
may be unattainable. Instead, it may be
easier to obtain external information about a related
parameter (such as the marginal difference in the log-
odds of observing
Y
between those with
Y=1
and
Y=0
) that can then be converted into information about
δSM
. We illustrate this concept when deriving values for
hyperparameters
µSM
and
σSM
in our motivating exam-
ple. For all remaining parameters,
SM
, we assign vague
priors; namely,
N(0,100)
for the coefficients and Inv-
Gamma(0.01,0.01) for the variances.
Bayesian implementation
In the Bayesian framework, Bayes’ theorem is applied
to combine the prior distributions for the bias model
parameters with the likelihood function for the data to
obtain the joint posterior distribution of (
δSM
,
�SM
).
erefore, application of Bayes’ theorem may rule out
certain values of
δSM
because they are incompatible with
the data [16]. From the joint posterior distribution of
(
δSM ,�SM
), we can derive the conditional posterior dis-
tribution of a single parameter, such as
βX
.
e Bayesian framework views the missing data
of
Y,X,W,
and
D
and parameters
δSM
and
SM
as
unknown quantities to be estimated. Since direct sam-
pling from the joint distribution of these unknown
quantities is difficult, we fit the selection model using
MCMC estimation, specifically Gibbs sampling imple-
mented by JAGS (version 4.3.0) [41–43] using R package
jagsUI (version 1.5.2) [44].
Monte Carlo NARFCS
Bias model specied asapattern‑mixture model
We use the Not-At-Random Fully Conditional Speci-
fication (NARFCS) approach [5] which is an MNAR
extension of the MAR imputation method FCS [28] (see
references therein for other variants of FCS). Like FCS,
NARFCS imputes each variable under a separate univari-
ate regression model (of type appropriate to the variable
being imputed) and updates the missing data for each
variable in turn using an iterative algorithm which we
shall call the FCS algorithm [45, 46]. Note that the uni-
variate distributions implied by these regression models
may not be consistent with the same joint distribution
and different orderings of these regression models within
the FCS algorithm could lead to sampling from dif-
ferent joint distributions [46, 47]. In practice, FCS has
been shown to be a robust approach even when the set
of regression models are not compatible with the same
joint distribution ( [46], references therein). e order in
which the partially observed variables are updated within
the FCS algorithm is typically determined by the amount
of missing data [27].
We specify the following regression models for our
NARFCS bias model:
where
δNARFCS
is the bias parameter, representing the
difference in the log-odds of
Y=1
between those with
observed and missing values of
Y
. Let
NARFCS
denote
the set of all estimable parameters of model [3] (i.e., all
except
δNARFCS
), which does not include
βX
.
NARFCS differs from FCS in two ways which we shall
illustrate using the regression model for
Y
in the bias
model, [3], above. First, NARFCS includes missingness
(3)
Pr
W=1|Y,X,D,A,Z,M
Y
,M
X
,M
D
∼expit
α0+αYY+αXX+αDD+αAA+αZZ+αMYM
Y
+αMXM
X
+αMDM
D
,
X
Y,W,D,A,Z,MY,MW,MD∼Nγ0+γYY+γWW+γDD+γAA+γZZ+γMYMY+γMWMW+γMDMD,ε2,
Pr
D=1|Y,X,W,A,Z,MY,MX,MW∼expitκ0+κYY+κXX+κWW+κAA+κZZ+κMYMY+κMXMX+κMWMW,
Pr
Y=1|X,W,D,A,Z,MY,MX,MW,MD
∼expit
0+
X
X+
W
W+
D
D+
A
A+
Z
Z+δNARFCSMY+
M
XMX+
M
WMW+
M
DMD
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 6 of 14
Kawabataetal. BMC Medical Research Methodology (2024) 24:278
indicator
MY
as an independent variable in the regression
model for
Y
in order to quantify how the distribution of
Y
differs between participants with observed and miss-
ing values of
Y
. Hence NARFCS belongs to the class of
pattern-mixture models. Second, NARFCS includes the
missingness indicators of the other partially observed
variables,
MX,MW,MD
, as independent variables in the
regression model for
Y
in order to maximise the amount
of correlation between the variables captured by the
model [5]. Note that the regression model for
X
omits
MX
as an independent variable because we assume
X
is
MAR given
A
and
Z
. Similarly, for the regression models
of
W
and
D
.
Prior probability distributions
NARFCS does not assign a prior distribution for
δNARFCS ,
instead it fixes
δNARFCS
to a prespecified value before
applying the FCS algorithm. For remaining param-
eters,
NARFCS
, NARFCS independently samples the
parameters of each regression model from a posterior
distribution (or an approximation) under a vague prior
distribution (to ensure uncertainty from estimating the
imputation model parameters is propagated through to
the resulting imputations [27]). For example, for a regres-
sion with coefficients
υ
(and if applicable variance param-
eter
ς
) Stata command mi impute chained (version 17
[48]) and R package mice (version 3.14.0) specify prior
p(ν,ς)∝
1
ς
for a linear regression and prior
p(ν)∝1
for a logistic regression [49, 50].
Note that Tompsett etal. [5] illustrate a determinis-
tic bias analysis using NARFCS where (in the context of
our hypothetical example) the user prespecifies multiple
values for
δNARFCS
and then repeatedly applies NAR-
FCS by fixing
δNARFCS
to each pre-specified value in
turn. As we are implementing a probabilistic bias analy-
sis using NARFCS, we must specify a prior distribution
for
δNARFCS
. In keeping with Bayesian SM, we use prior
p
δ
NARFCS
∼N
µ
NARFCS
,σ
NARFCS
with
µNARFCS
and
σNARFCS
set to values based on external information
about
δNARFCS
, or more practically on a related parameter
that is then converted into information about
δNARFCS
.
Monte Carlo bias analysis
e Monte Carlo bias analysis repeatedly samples
directly from the prior distribution for
δNARFCS
before
fitting the bias model. erefore, no sampled values of
δNARFCS
are rejected due to incompatibility with the
observed data. Using the NARFCS bias model, we gener-
ate a Monte Carlo frequency distribution of bias-adjusted
estimates of
βX
by repeatedly carrying out the following
steps
S
(S>1)
times: for
s=1, ··· ,S
i. Randomly draw a value for the bias parameter
directly from its prior distribution,
δNARFCS(s)
∼
N
µ
NARFCS
,σ
NARFCS
.
ii. Impute the observed data
K
(K≥1)
times using the
NARFCS bias model with the bias parameter fixed
at
δNARFCS(s)
. Fit the substantive analysis separately
to each imputed dataset using maximum likelihood
estimation and combine the multiple sets of results
for
βX
using Rubin’s rules [1]. Let
β
δ
NARFCS(s)
X
and
V
δ
NARFCS(s)
X
denote the combined estimate of
βX
and
accompanying variance, respectively.
iii. Incorporate random sampling error
β
δ
NARFCS(s)
X
∼
N
βδNARFCS(s)
X,
VδNARFCS(s)
X
.
After
S
steps, we compute the median, 2.5th and
97.5th percentiles of the frequency distribution of
β
δ
NARFCS(1)
X
,
βδ
NARFCS(2)
X
,··· ,
βδ
NARFCS(S)
X
to obtain our Monte
Carlo NARFCS point and interval estimates of
βX
. Monte
Carlo NARFCS was implemented in R using the NAR-
FCS extension to mice [51] and in Stata using mi impute
with option offset.
Simulation study design
We compared the performance of Monte Carlo NAR-
FCS with Bayesian SM when a large proportion of data
were missing under a very strong MNAR mechanism.
We evaluated these methods when the prior distribution
for the bias parameter was (i) inaccurate and imprecise,
(ii) accurate and reasonably precise, and (iii) accurate
and very precise. We repeated the simulation study for
βX=0
and
βX=ln(3)
and for two data generating mod-
els: based on the selection model framework (SM data
generating model) and the pattern-mixture model frame-
work (PMM data generating model). For all combinations
of the simulation settings, we generated 1000 simulated
data sets, each with 100,000 observations for the full
sample.
Generation ofthecomplete data
e simulation study was based on the hypotheti-
cal example described above with the exception that
Z=(Z1,Z2,Z3)
denotes three fully observed confound-
ers and
A=(A1,A2)
denotes two fully observed auxiliary
variables. Exposure
X
and
Z2
were continuous variables
with mean 0 and standard deviation of 1, and the remain-
ing variables were binary (
Z1,Z3,A,
outcome
Y
, partially
observed confounder
W
, partially observed auxiliary
D
and missingness indicators
MY,MX,MW
, and
MD
).
First, we simulated (complete) data on
X,Y,Z,W,A,D,
and
MY
from their joint distribution factorised into
a series univariate regressions: logistic regression for
Y,Z1,Z3,W,A,D,
and
MY
, and linear regression for
X
and
Z2
. We considered two factorisations of this
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 7 of 14
Kawabataetal. BMC Medical Research Methodology (2024) 24:278
joint distribution, with the factorisation for the SM and
PMM data generating models chosen to resemble the
bias model of Bayesian SM and Monte Carlo NARFCS,
respectively. See the Supplementary materials for further
details.
Most of the parameter values of the SM data gener-
ating model were set to estimates from an analysis of a
real dataset (our motivating example, described in the
next section). Note that the value of
δSM
was artificially
derived to simulate a very strong MNAR mechanism.
e marginal prevalence of
Y
was fixed at 5% for both the
βX=0
and
βX=ln(3)
scenarios. We were unable to ana-
lytically derive the corresponding parameter values of the
PMM data generating model. Instead, we fitted the PMM
data generating model to a dataset of 50,000,000 observa-
tions simulated under the SM data generating model and
then used the resulting estimates as the parameter values
of the PMM data generating model [4].
Generation ofthemissing data
Following generation of the complete data, which
included missingness indicator
MY
, values of
Y
were set
to missing when
MY=1
. Missing data for
X,W,
and
D
were subsequently generated independently of each other
and of
Y
using the following missingness mechanisms of
MAR given fully observed variables:
where all parameter values were derived from the
observed data of our motivating example. ese miss-
ingness mechanisms were the same for both the
βx=0
and
βx=ln(3)
scenarios and the SM and PMM data gen-
erating models, resulting in a non-monotone missingness
pattern. Close to 5% of the observations of
X,W,
and
D
were set as missing.
Missing data methods andevaluation
Probabilistic bias analyses, Bayesian SM and Monte Carlo
NARFCS, were implemented as described previously
Based on running standard convergence checks [40] on
one randomly selected dataset, Bayesian SM was applied
with 50,000 iterations, of which 5,000 were burn-in itera-
tions. Monte Carlo NARFCS was applied with 10,000
Monte Carlo steps and single imputation within each
step. To assess sensitivity to the number of Monte Carlo
steps and imputed datasets, we also conducted Monte
Carlo NARFCS using 10,000 Monte Carlo steps with five
imputations, and 5,000 Monte Carlo steps with single
(4)
Pr
M
X
=1|Z1,Z2,Z3,A1,A2
=expit{−3.20 +0.233 ×Z1−0.0570 ×Z2−0.133 ×Z3+0.363 ×A1+0.763 ×A2}
,
PrMW=1|Z1,Z2,Z3,A1,A2=expit {−2.90 +0.0720 ×Z1−0.232 ×Z2−0.774 ×Z3+0.169 ×A1+0.417 ×A2}
,
Pr
MD=1|Z1,Z2,Z3,A1,A2
=expit{−2.95 −0.0590 ×Z1−0.0290 ×Z2−0.190 ×Z3+0.130 ×A1+0.192 ×A2
}
imputation. e number of burn-in iterations of the FCS
algorithm was always set to 10. We applied Bayesian
SM and Monte Carlo NARFCS with three different pri-
ors for the bias parameter: (i) vague prior
N(0,100)
, (ii)
informative prior
N(truth,4
)
, and (iii) very informative
prior
N(truth,1
)
, where
truth
denotes the true value of
the bias parameter. Note that the true value of
δNARFCS
was unknown (since it was not a parameter of either data
generating model) and so we instead used an estimate
of
δNARFCS
based on a simulated dataset of 50,000,000
observations.
We compared Bayesian SM and Monte Carlo NARFCS
to a CCA and MI. We applied MI using FCS imputation
with 10 burn-in iterations and 50 imputations, imput-
ing the binary and continuous variables using logistic
and linear regressions, respectively. (See Supplementary
materials for further details on all missing data methods).
e estimand of interest was the exposure effect
βX
.
For the SM data generating model, the true value of
βX
was known as it was a parameter of this model, whilst
for the PMM data generating model, a value for
βX
was
computed by fitting the substantive analysis to a data-
set of size 50,000,000 before data deletion. Performance
measures of interest were bias, empirical, and model-
based standard errors, and 95% confidence interval (CI)
coverage of estimates of
βX
. We used Stata version 17.0
[48] to generate the data. e remaining methods were
conducted in R 4.1.0 [52]. Bayesian SM and Monte Carlo
NARFCS were applied using high performance comput-
ing for parallel processing [53] across the simulated data-
sets. R package rsimsum (version 0.11.3) [54] was used to
compute the simulation results.
Motivating example
e motivating example for our simulation study is a pre-
viously described study where the substantive analysis of
interest is a logistic regression of SARS-CoV-2 infection
(0 not infected, 1 infected) on body mass index (BMI)
adjusted for confounders age, sex (0 female, 1 male), uni-
versity degree (0 no, 1 yes), and current smoker (0 no, 1
yes) [55]. ere are three auxiliary variables: diagnosis of
asthma (0 no, 1 yes), diabetes (0 no, 1 yes), and hyper-
tension (0 no, 1 yes). is motivating example illustrates
derivation of an informative prior for
δSM
and
δNARFCS
.
As this is an illustrative example, we have ignored other
potential sources of bias (such as selection bias due to
non-random participation in UK Biobank [56]), and we
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 8 of 14
Kawabataetal. BMC Medical Research Methodology (2024) 24:278
have only considered a small number of confounders of
the outcome–exposure relationship.
Motivating case study
Using data from the UK Biobank study (UKB) [56], we
define our target population as middle aged and elderly
adults (aged 47 – 86, with close to 75% of participants
aged 61 or older) resident and alive in England on 1st
January 2020. Active SARS-CoV-2 infection was defined
as either a positive SARS-CoV-2 PCR test (from linked
Public Health England data) or COVID-19 recorded on a
death certificate between 1st January 2020 and 18th May
2020 (i.e., the date mass testing became available in the
UK; [57]). Testing for SARS-CoV-2 was highly restricted
during this period and so data on SARS-CoV-2 infec-
tion were missing for over 98% of participants. Data on
SARS-CoV-2 infection were suspected to be MNAR since
testing among the majority of the UK population (i.e.,
non-healthcare workers) was mainly restricted to those
who experienced symptoms of COVID-19 [58]. Observed
factors associated with the chance of being tested in UKB
included having higher BMI, being a current smoker,
having a pre-existing condition (such as asthma, diabetes,
or hypertension), being female, and having a university
degree or higher [55].
Among the 445,377 participants included in the UKB
study, we excluded 24,465 (5.49%) participants who died
before 2020 and 65 (0.0146%) who were not tested for
SARS-CoV-2 but were diagnosed with COVID-19 post-
mortem. Of the remaining 420,847 participants eligible
for analysis, 405,174 (96.3%) were missing the outcome
only, 10,870 (2.58%) were missing the outcome and
at least one covariate (BMI, smoker, or degree), 4,610
(1.10%) had complete data and 193 (0.0459%) had an
observed outcome but were missing at least one covariate
(Supplementary Table3). Confounders age and sex, and
auxiliary variables asthma, diabetes, and hypertension
were fully observed. Figure2 shows the m-DAG for this
motivating example based on subject-matter knowledge
and our investigations of observed predictors of miss-
ingness (Supplementary tables4 and 5). We assume the
covariate data were MAR and there were no unmeasured
common causes after accounting for age, sex, degree,
smoker, BMI, asthma, diabetes, and hypertension.
Statistical analyses
We analysed the data using CCA, MI, Bayesian SM,
Monte Carlo NARFCS, and a “population-based compar-
ison group approach” where untested participants were
assumed to be not infected with SARS-CoV-2 [59–61].
(See the Supplementary materials for further details).
Due to convergence problems encountered when apply-
ing Bayesian SM to the full data, we restricted all analy-
ses to the 409,784 participants with complete data on
the covariates. is simplified the imputation, weighting,
and bias models by reducing the number of parameters
to be estimated. Given the small percentage of dropped
participants (the majority of which had a missing out-
come), the characteristics of the full sample and the
restricted sample were virtually the same (Supplemen-
tary Table6). In keeping with the preceding paper [55],
and to improve the efficiency of MCMC sampling by
Fig. 2 Missingness directed acyclic graph for the UK Biobank example. Black edges depict the assumed relationships in the fully observed data
between the outcome (SARS‑CoV‑2 infection), exposure (body mass index (BMI)), confounders (age, sex, degree, and smoker), and auxiliary variables
(asthma, diabetes, and hypertension). Tested, MBMI, and Mdegree,smoker denote missingness indicators for the outcome, exposure, and confounders,
respectively. Blue and red edges depict the missingness mechanisms of the outcome and covariates (exposure and confounders), respectively.
Note, we have not included all edges between the variables
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 9 of 14
Kawabataetal. BMC Medical Research Methodology (2024) 24:278
reducing autocorrelation in the chains, each continuous
variable (age and BMI) was standardized by subtracting
its observed mean and dividing by its observed stand-
ard deviation. ese standardised variables were used
in all analyses. We applied MI with 50 imputed datasets,
Bayesian SM using 50,000 MCMC iterations (including
5,000 burn-in iterations), and Monte Carlo NARFCS with
10,000 Monte Carlo steps and single imputation. Bayes-
ian SM and Monte Carlo NARFCS were applied using an
informative prior for
δSM
and
δNARFCS
, respectively.
Derivation oftheinformative prior
δSM
and
δNARFCS
e hyperparameters of the informative priors p
δ
SM
∼
N
µ
SM
,σ
SM
and p
δ
NARFCS
∼N
µ
NARFCS
,σ
NARFCS
were
derived from published results of the REal-time Assess-
ment of Community Transmission-2 (REACT-2) national
study [62]. e REACT-2 study sent home-based SARS-
CoV-2 antibody test kits to over 100,000 randomly sam-
pled adults living in England between 20th June and 13th
July 2020. Among 65–74-year-olds (similar age range to
our study), SARS-CoV-2 antibody prevalence was esti-
mated to be 3.2% [95% CI 2.8–3.6%] [62] by mid-July
2020.
Bias parameters
δSM
and
δNARFCS
are conditional
parameters on the log-odds scale. So, we used an
algorithm from Tompsett etal. [5] to compute approxi-
mate values of
δSM
and
δNARFCS
calibrated to mar-
ginal prevalences of SARS-CoV-2 infection. For
prior p
δ
NARFCS
∼N
µ
NARFCS
,σ
NARFCS
, we set
µNARFCS =−2.6
(the value of
δNARFCS
calibrated to a
prevalence of 3.2%) and set
σNARFCS =0.222
such that
95% of the sampled values of
δNARFCS
were expected to be
between -3.0 and -2.2 (which were the values of
δNARFCS
calibrated to prevalences of approximately 2.2% and 4.2%,
respectively). Note that we allowed for additional uncer-
tainty because the prevalence of infection was unknown
in our UKB study. e comparable prior for Bayesian SM
was
p
δ
SM
∼N
2.6, 0.22
2
. See Supplementary materi-
als for further details.
Results
Simulation study results
When there were no missing data, the full data estimate of
βX
was unbiased and CI coverage was close to the nominal
level in all scenarios. Figure3 shows the bias and coverage
of estimating
βX
in the presence of missing data using dif-
ferent missing data methods when the true value of
βX
was
ln(3)
and 0 and the data were generated using the SM data
generating model (detailed results reported in Supplemen-
tary tables8 and 9). ere was substantial bias and severe
Fig. 3 Bias and 95% confidence interval coverage of exposure effect,
βX
, according to the not null (
β
X=
ln(3)
) and null (
βX=0
) scenarios for data
generated using SM data generating model. Error bars denote 95% Monte Carlo intervals, and the vertical dashed line denotes zero bias (top)
and nominal coverage (bottom). Results for Bayesian SM were based on 926–928 simulated datasets; the remaining methods were based on 1,000
simulated datasets
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 10 of 14
Kawabataetal. BMC Medical Research Methodology (2024) 24:278
CI under-coverage for the CCA estimates, with similar lev-
els of bias for
βX=ln(3)
and
βX=0
but slightly higher CI
coverage for
βX=ln(3)
due to wider CIs. When
βX=0
,
MI had broadly comparable levels of bias and CI cover-
age to CCA. However, when
βX=ln(3)
, the bias of the
MI estimates was noticeably larger than that of CCA. is
was likely due to amplification of the bias (resulting from
incorrect assumptions about the missingness mechanism)
caused by including variables in the imputation model
that were strongly predictive of
Y
[63] (see Supplementary
materials for further details).
For both the
βX=ln(3)
and
βX=0
scenarios, there was
negligible bias when applying Monte Carlo NARFCS with
an informative or very informative prior. Applying Monte
Carlo NARFCS with a vague prior resulted in biased esti-
mates where the level of bias was slightly lower than that of
CCA for the
βX=0
scenario but higher for the
βX=ln(3)
scenario (and comparable to that of MI). In accordance
with MI, the higher level of bias for the
βX=ln(3)
sce-
nario was likely due to the auxiliary variables amplifying
the bias from misspecification of the missingness mecha-
nism. Despite the (relatively) high level of bias, CI cover-
age was nominal due to the imprecision of the vague prior.
Very similar results were obtained when applying Monte
Carlo NARFCS with 10,000 Monte Carlo steps with 5
imputations and 5,000 Monte Carlo steps with single
imputation (Supplementary tables15 and 16).
Method Bayesian SM failed to produce results for 72
to 74 simulated datasets (further details in Supplemen-
tary Sect.3.5) whilst the other methods returned results
for all 1,000 simulated datasets. Similar to Monte Carlo
NARFCS, applying Bayesian SM with an informative or
very informative prior resulted in minimal bias. How-
ever, compared to Monte Carlo NARFCS, Bayesian SM
showed slightly higher levels of bias and inefficiency (i.e.,
larger empirical standard errors), leading to moderate
levels of CI under-coverage. is seeming under-perfor-
mance of Bayesian SM may have been due to the omitted
estimates caused by nonconvergence in a small number
of datasets. Unlike Monte Carlo NARFCS, Bayesian SM
with a vague prior eliminated some of the bias in both
the
βX=ln(3)
and
βX=0
scenarios, with bias levels
at least 50% lower than those of CCA. Also, the model-
based standard errors of Bayesian SM were considerably
smaller than those of Monte Carlo NARFCS. A likely
explanation is that some information was gained from
the application of Bayes’ theorem combining the prior
for
δSM
with the observed data. Supporting this claim,
we note that when applied with an a priori mean of 0 for
δSM
, across the simulations the mean of the posterior
estimates of
δSM
was 8.83 (95% Monte Carlo interval 8.31
to 8.96) and 6.36 (95% Monte Carlo interval 6.20 to 6.52)
for the
βX=ln(3)
and
βX=0
scenarios, respectively
(where the true value was 7.85).
For both Bayesian SM and Monte Carlo NARFCS with
(very) informative priors, there was CI overcoverage
when the estimates of
βX
were unbiased (or negligibly
biased). is overcoverage was likely due to generating
the data using a fixed value for the bias parameter which
is known to lead to CI overcoverage when applying an
analysis with an informative prior centred on the true
value of the parameter [64].
Similar patterns were noted on the relative perfor-
mances of the methods for data generated using the
PMM data generating model (Supplementary tables 18
and 19). For both data generating models and
βX=ln(3)
and
βX=0
scenarios, Bayesian SM took substantially
longer to run than Monte Carlo NARFCS with Monte
Carlo NARFCS taking approximately 2days per dataset
in R (approximately 1day per dataset in Stata) and Bayes-
ian SM taking approximately 6days per dataset.
Results ofthemotivating example
Of the 409,784 participants included in our analysis with
complete covariate data, 4,610 (1.12%) were tested for
SARS-CoV-2, leaving 405,174 (98.9%) with a missing
outcome. Out of the 4,610 participants tested for SARS-
CoV-2, 1,317 (28.6%) tested positive. Figure 4 shows the
results for the exposure odds ratio (i.e., odds ratio of SARS-
CoV-2 infection per standard deviation increase in BMI)
estimated using CCA, MI, Bayesian SM, Monte Carlo
NARFCS, and the population-based comparison group
approach. All analyses suggested that participants with a
higher BMI tended to be at a higher risk of SARS-CoV-2
infection. e two probabilistic bias analyses, Bayesian
SM and Monte Carlo NARFCS, gave similar results with
slightly higher point estimates than CCA and MI, although
there was substantial overlap between the CIs of these
methods. e results for the population-based comparison
group approach were markedly different from those of the
other methods.
e patterns in the results were consistent with our
prior knowledge that untested participants tended to
have a lower BMI and were less likely to have experienced
symptoms of SARS-CoV-2 infection than tested partici-
pants. For example, under this missingness mechanism
we expected that dropping untested participants would
lead to an underestimate of the exposure odds ratio (as
demonstrated by the simulation study) and setting all
untested participants as “not infected” would lead to an
overestimate. All analyses except CCA were based on the
untested and tested participants but had similar levels of
precision to that of CCA. is was unsurprising given
that (i) the precision of binary outcome estimators is pri-
marily determined by the number of cases (i.e., positive
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 11 of 14
Kawabataetal. BMC Medical Research Methodology (2024) 24:278
SARS-CoV-2 infections) and (ii) for our study population
and study period, the prevalence of SARS-CoV-2 infec-
tion was estimated to be relatively low (3.2% [95% CI
2.8–3.6%] [62]) and so a large proportion of the untested
participants were likely not infected with SARS-CoV-2.
e distinct results of the population-based comparison
group approach was due to the imposed extreme sce-
nario which implied that the prevalence of infection in
the study sample was only 0.32%.
Discussion
We have illustrated the feasibility and practicality of
conducting a probabilistic bias analysis to data MNAR
when a large proportion of an outcome is missing under
a strong MNAR mechanism. In the specific setting we
considered, our simulation study demonstrated that
given reasonably accurate and precise information about
the bias parameter, the simpler, Monte Carlo NARFCS
method performed as well as the more principled, Bayes-
ian SM method. When very limited information was pro-
vided about the bias parameter, the Bayesian bias analysis
was able to eliminate most of the bias due to data MNAR
while the Monte Carlo bias analysis performed no better
than the CCA and the MAR implementation of MI. We
have also shown how including auxiliary variables in an
imputation model can amplify bias due to data MNAR.
Monte Carlo NARFCS has three key advantages for
non-specialist analysts over the Bayesian SM approach:
(1) a Monte Carlo bias analysis is simpler and less daunt-
ing to implement because it does not require knowledge
about Bayesian inference and specialist statistical soft-
ware. (2) e bias model of Monte Carlo NARFCS uses
an MNAR-extension of the popular imputation approach,
FCS, which has been implemented in several software
environments. (3) For our study, Monte Carlo NARFCS
was less computationally demanding than Bayesian SM,
resulting in substantially faster run-times. erefore, it
is encouraging that Monte Carlo NARFCS can perform
as well as the more principled Bayesian SM. is is sup-
ported by previous research, which has established the
robustness of FCS imputation to its theoretical weak-
ness (that the joint distribution implied by the univariate
regression models may not exist [46, 47, 65]). During the
simulation study we experienced some minor technical
difficulties with Bayesian SM. However, these issues can
be easily resolved when applying the method in practice.
For example, nonconvergence would be identified using
standard Bayesian diagnostic tools and resolved by run-
ning a longer burn-in, and failure of the Bayesian sampler
could be rectified by using different starting values or
switching to a different Monte Carlo algorithm. In keep-
ing with McCandless and Gustafson [16], we found that
applying a Bayesian bias analysis using a vague prior for
the bias parameter gained some information about the
MNAR mechanism and consequently eliminated some of
the bias due to missing data. is was likely due to the
Bayesian process ruling out certain MNAR mechanisms
(i.e., values of the bias parameter) incompatible with the
observed data [16]. In contrast, since the Monte Carlo
bias analysis samples directly from the prior distribution
of the bias parameter, irrespective of the observed data,
then applying Monte Carlo NARFCS with a vague prior
performed as badly as the MAR methods. erefore, a
Bayesian bias analysis is recommended when there is
limited information available about the bias parameters.
Another difference between the two probabilistic
bias analyses is that the bias model of Bayesian SM is a
selection model while that of Monte Carlo NARFCS is
Fig. 4 Forest plot of the results for exposure odds ratio,
exp{βX}
, estimated by complete case analysis (CCA), multiple imputation assuming missing
at random (MI), population‑based comparison group approach (Missing not infected), and the probabilistic bias analyses, Monte Carlo NARFCS
and Bayesian SM. Dashed line denotes the null effect
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 12 of 14
Kawabataetal. BMC Medical Research Methodology (2024) 24:278
a pattern-mixture model. e advantage of the selection
model framework is that it is coherent with our under-
standing of how the observed data arises and there is a
logical separation of the parameters of interest from the
bias parameters [66]. However, others have argued that
the bias parameters of the pattern-mixture model are
usually easier to interpret and so this framework is more
convenient for conducting bias analyses [67–69]. In our
applied example, the available external information was
not ideally suited for the bias parameter of either the
selection or pattern-mixture model. Overall, the pattern-
mixture framework is credited as being more accessible
and widely available [70], although the selection model
framework may be preferable when the missingness
mechanism is of primary interest.
Our simulation study has several limitations. First,
our comparison of a Bayesian bias analysis to a Monte
Carlo bias analysis also differed with respect to the bias
model. However, the primary objective of our study was
to illustrate an easy to apply probabilistic bias analy-
sis (for non-specialist analysts) and to compare it to a
principled approach. Second, we simulated the data
using a fixed value for the bias parameter (as opposed
to sampling from an appropriate prior). However, we
consider the anticipated overcoverage of the probabilis-
tic bias analyses acceptable as we focus on what Rubin
termsconfidence validity(i.e., intervals that coverat least
nominally)rather than randomisation validity (i.e., inter-
vals that cover exactly nominally) [71]. ird, we only
explored a small number of scenarios because of the time
it took to run each probabilistic bias analysis in a large
data setting (typical of cohort studies). To achieve our
objective of evaluating the robustness of Monte Carlo
NARFCS using a small-scale simulation study, we consid-
ered an extreme setting of a large proportion of missing
data with a strong MNAR mechanism.
We note that the sequential modelling approach of
Bayesian SM and the FCS-type approach of Monte Carlo
NARFCS can both flexibly incorporate nonlinear terms
and interactions between the outcome and the predictors
(i.e., covariates and auxiliary variables), and between the
predictors (e.g., [34, 37, 72]). Future work should com-
pare Bayesian SM and Monte Carlo NARFCS when the
bias models include nonlinear or interaction terms.
Alternative approaches to a bias analysis are available
[11]. ese include (i) reference based methods used for
handling missing data in randomized clinical trials (e.g.,
[73]), (ii) placing restrictions on the model parameters
(e.g., [74]), (iii) instrumental variable(s) for missingness
[6], and (iv) use of additional data (e.g., information from
recontacting nonparticipants [75]).
In the extreme setting we explored, our simpler Monte
Carlo bias analysis is a viable alternative to a Bayesian bias
analysis provided information is available on plausible val-
ues of the bias parameter. However, when limited infor-
mation is available, a Bayesian bias analysis is preferred.
By illustrating two different types of probabilistic bias
analyses and providing code to replicate them, we hope to
encourage the increased adoption of such bias analyses in
epidemiological research. Finally, in keeping with [63, 76],
we caution careful consideration of the choice of auxiliary
variables when applying MI where data may be MNAR.
Abbreviations
BMI Body mass index
CCA Complete case analysis
FCS Fully conditional specification
IPW Inverse probability weighting
MAR Missing at random
MCMC Markov Chain Monte Carlo
MI Multiple imputation
MNAR Missing not at random
NARFCS Not‑at‑random fully conditional specification
PMM Pattern mixture model
REACT‑2 REal‑time Assessment of Community Transmission‑2
SM Selection model
UKB UK Biobank study
Supplementary Information
The online version contains supplementary material available at https:// doi.
org/ 10. 1186/ s12874‑ 024‑ 02382‑4.
Supplementary Material 1.
Acknowledgements
Not applicable.
Authors’ contributions
Authors EK, DM‑S, GLC, CYS, TPM, ARC, AF‑S, MCB, GJG, LACM and RAH
designed the study with critical review from KT, DL AND GDS. EK, DM‑S and
GLC performed the simulation study and statistical analyses under the supervi‑
sion of RAH, CYS and TPM. EK, DM‑S, GLC and RAH drafted the paper with input
from the remaining authors. All authors were responsible for critical revision of
the manuscript and have approved the final version to be published.
Funding
This work was supported by the Bristol British Heart Foundation (BHF)
Accelerator Award (AA/18/7/34219), the University of Bristol and Medical
Research Council (MRC) Integrative Epidemiology Unit (MC_UU_00032/01, 02
& 05), the BHF‑National Institute of Health Research (NIHR) COVIDITY flagship
project, the John Templeton Foundation (61917), and the Wellcome Trust
and Royal Society (215408/Z/19/Z). TPM was funded by the UKRI Medical
Research Council, grant number MC_UU_00004/09 and GJF was funded by an
MQ fellowship, grant number MQF22\22. The computation work was carried
out using the computational facilities of the Advanced Computing Research
Centre, University of Bristol—http://www.bristol.ac.uk/acrc/.
Data availability
The software code to generate the simulated datasets analysed during the simu‑
lated study are available in the COVIDITY_ProbQBA repository, https://github.
com/MRCIEU/COVIDITY_ProbQBA. The UK Biobank study dataset analysed
during the current study is available from the UK Biobank Access Management
Team (https://www.ukbiobank.ac.uk/learn‑more‑about‑uk‑biobank/contact‑us)
but restrictions may apply to the availability of these data, which were used
under license for the current study, and so are not publicly available. All methods
discussed in this paper can be implemented using the provided software code
available from the COVIDITY_ProbQBA repository, https://github.com/MRCIEU/
COVIDITY_ProbQBA.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 13 of 14
Kawabataetal. BMC Medical Research Methodology (2024) 24:278
Declarations
Ethics approval and consent to participate
For the simulation study, data were completely simulated, which did not
require approval from an ethics committee or consent from participants.
UKB received ethical approval from the UK National Health Service’s National
Research Ethics Service (ref. 11/NW/0382). All participants provided written
and informed consent for data collection, analysis, and record linkage. This
research was conducted under UKB application number 16729.
Consent for publication
Not applicable.
Competing interests
TPM has received consultancy fees from: Bayer Healthcare Pharmaceuticals,
Alliance Pharmaceuticals, Gilead Sciences, and Kite Pharmaceuticals. Since
January 2023, ARC has been an employee of Novo Nordisk Research Centre
Oxford, which is not related to the current work and had no involvement in
the decision to publish. The remaining authors declare that they have no
competing interests.
Author details
1 MRC Integrative Epidemiology Unit, University of Bristol, Bristol, UK. 2 Popula‑
tion Health Sciences, Bristol Medical School, University of Bristol, Bristol, UK.
3 MRC Clinical Trials Unit at UCL, London, UK. 4 MRC Unit for Lifelong Health
and Ageing at University College London, London, UK.
Received: 24 March 2024 Accepted: 21 October 2024
References
1. Rubin D. Inference and missing data. Biometrika. 1976;63:581–92.
2. Li Y, Miao W, Shpitser I, Tchetgen Tchetgen EJ. A self‑censoring model for
multivariate nonignorable nonmonotone missing data. Biometrics. 2023;:
1–12.
3. Giusti C, Little RJ. An analysis of nonignorable nonresponse to
income in a survey with a rotating panel design. J Official Statistics.
2011;27(2):211–29.
4. White IR, Carpenter J, Horton NJ. A mean score method for sensitivity
analysis to depatures from the missing at random assumption in ran‑
domised trials. Stat Sin. 2018;28(4):1985–2003.
5. Tompsett DM, Leacy F, Moreno‑Betancu M, Heron J, White IR. On the use
of the not‑at‑random fully conditional specification (NARFCS) procedure
in practice. Stat Med. 2018;37:2338–53.
6. Tchetgen Tchetgen EJ, Wirth KE. A general instrumental variable
framework for regression analysis with outcome missing not at random.
Biometrics. 2017;73:1123–31.
7. Cornish RP, Macleod J, Carpenter JR, Tilling K. Multiple imputation using
linked proxy outcome data results in important bias reduction and
efficiency gains: a simulation study. Emerg Themes Epidemiol. 2017;14:14.
8. Gray L, Gorman E, White IR, Vittal Katikireddi S, McCartney G, Rutherford
L, et al. Correcting for non‑participation bias in health surveys usuing
record‑linkage, synthetic observations and pattern mixture modelling.
Stat Methods Med Res. 2020;29(4):1212–26.
9. Doidge JC. Responsivenss‑informed multiple imputation and inverse
probability weighting in cohort studies with missing data that are
non‑monotone or not missing at random. Stat Methods Med Res.
2018;27(2):352–63.
10. Hughes RA, Heron J, Sterne JA, Tilling K. Accounting for missing data in
statistical analyses: multiple imputation is not always the answer. Int J
Epidemiol. 2019;48(4):1294–304.
11. Little RA, Rubin DB. Statistical Analysis with Missing Data. 3rd ed. New
York: John Wiley & Sons; 2019.
12. Fox MP, MacLehose RF, Lash TL. Applying quantitative bias analysis to
epidemiologic data. 2nd ed. New York: Springer; 2021.
13. Lee K, Tilling K, Cornish R, Little R, Bell M, Goetghebeur E, et al. Framework
for the treatment and reporting of missing data in observational studies:
The Treatment And Reporting of Missing data in Observational Studies
framework. J Clin Epidemiol. 2021;134:79–88.
14. Andridge RR, Little RJ. Proxy pattern‑mixture analysis for survey nonre‑
sponse. J Official Statistics. 2011;27(2):153–80.
15. Andridge R, Little RJ. Proxy pattern‑mixture analysis for a binary variable
subject to nonresponse. J Official Statistics. 2020;36(3):703–28.
16. McCandless LC, Gustafson P. A comparison of Bayesian and Monte
Carlo sensitivity analysis for unmeasured confounding. Stat Med.
2017;36(18):2887–901.
17. Greenland S. Multiple‑bias modelling for analysis of observational data. J
R Stat Soc Ser A. 2005;168:267–306.
18. MacLeohse RF, Gustafson P. Is probabilistic bias analysis approximately
Bayesian? Epidemiology. 2012;23(1):151–8.
19. Flanders WD, Waller LA, Zhang Q, Getahun D, Silverberg M, Goodman
M. Negative Control Exposures ‑ Causal effect identifiability and use in
probabilistic‑bias and Bayesian analyses with unmeasured confounders.
Epidemiology. 2022;33(6):832–9.
20. Gustafson P, McCandless L. Priors, parameters and probability ‑ A Bayesian
perspective on sensitivity analysis. Epidemiology. 2014;25(6):910–2.
21. Corbin M, Haslett S, Pearce N, Maule M, Greenland S. A comparison of
sensitivity‑specificity imputation, direct imputation and fully Bayesian
analysis to adjust for exposure misclassification when validation data are
unavailable. Int J Epidemiol. 2017;46(3):1063–72.
22. Gachau S, Quartagno M, Njeru Njagi E, Owuor N, English M, Ayieko
P. Handling missing data in modelling quality of clinician‑prescribed
routine care: Sensitivity analysis of departure from missing at random
assumption. Stat Methods Med Res. 2020;29(10):3076–92.
23. Smuk M, Carpenter JR, Morris TP. What impact do assumptions about
missing data have on conclusions? A practical sensitivity analysis for a
cancer survival registry. BMC Med Res Methodol. 2017;17:21.
24. Mason A, Richardson S, Plewis I, Best N. Strategy for modelling nonran‑
dom missing data mechanisms in observational studies using Bayesian
methods. J Official Stat. 2012;28(2):279–302.
25. Banack HR, Hayes‑Larsin E, Mayeda E. Monte Carlo Simulation
Approaches for Quantitative Bias Analysis: A Tutorial. Epidemiolog Rev.
2021;43:106–17.
26. Jayaweera RT, Bradshaw PT, Gerdts C, Egwuatu I, Grosso B, Kristianingrum
I, et al. Accounting for misclassification and selection bias in estimat‑
ing effectiveness of self‑managed medication abortion. Epidemiology.
2023;34:140–9.
27. White IR, Royston P, Wood AM. Multiple imputation using chained equa‑
tions: Issues and guidance for practice. Stat Med. 2011;30:377–99.
28. van Buuren S. Multiple imputation of discrete and continuous data by
fully conditional specification. Stat Methods Med Res. 2007;16:219–42.
29. Moreno‑Betancur M, Lee KJ, Leacy FP, White IR, Simpson JA, Carlin JB.
Canonical Causal Diagrams to Guide the Treatment of Missing Data in
Epidemiologic Studies. Am J Epidemiol. 2018;187(12):2705–15.
30. Bartlett JW, Harel O, Carpenter JR. Asymptotically unbiased estimation
of exposure odds ratios in complete records logistic regression. Am J
Epidemiol. 2015;182(8):730–6.
31. Carpenter JR, Smuk M. Missing data: A statistical framework for practice.
Biom J. 2021;63:915–47.
32. Little RJ, Carpenter JR, Lee KJ. A comparison of three popular methods for
handling missing data: complete‑analysis, inverse probability weighting,
and multiple imputation. Sociol Methods Res. 2024;53(3):1105–35.
33. Erler NS, Rizopoulos D, Rosmalen JV, Jaddoe VW, Franco OH, Lesaffre EM.
Dealing with missing covariates in epidemiologic studies: a comparison
between multiple imputation and a full Bayesian approach. Stat Med.
2016;35(17):2955–74.
34. Lüdtke O, Robitzsch A, West SG. Regression models involving nonlinear
effects with missing data: A sequential modeling approach using Bayes‑
ian estimation. Psychol Methods. 2020;25(2):157.
35. Ibrahim JG, Chen M, Lipsitz SR. Bayesian methods for generalized
linear models with covariates missing at random. Canadian J Statistics.
2002;30(1):55–78.
36. Du H, Enders C, Keller BT, Bradbury TN, Karney BR. A Bayesian latent vari‑
able selection model for nonignorable missingness. Multivar Behav Res.
2022;57(2–3):478–512.
37. Bartlett JW, Seaman SR, White IR, Carpenter JR. Alzheimer’s disease neuro‑
imaging initiative. Multiple imputation of covariates by fully conditional
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 14 of 14
Kawabataetal. BMC Medical Research Methodology (2024) 24:278
specification: accommodating the substantive model. Stat Methods Med
Res. 2015;24(4):462–87.
38. Chen M, Ibrahim JG. Maximum likelihood methods for cure rate models
with missing covariates. Biometrics. 2001;57(1):43–52.
39. Zhu J, Raghunathan TE. Convergence properties of a sequen‑
tial regression multiple imputation algorithm. J Am Stat Assoc.
2015;110(511):1112–24.
40. Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. 3rd ed.
New York: Chapman Hall/CRC; 2013.
41. Gelfand AE, Smith AF. Sampling‑Based Approaches to Calculating Mar‑
ginal Densities. J Am Stat Assoc. 1990;85(410):398–409.
42. Casella G, George EI. Explaining the Gibbs Sampler. Am Stat.
1992;46(3):167–74.
43. Plummer M. JAGS Version 4.3.0 user manual. https:// people. stat. sc. edu/
hanso nt/ stat7 40/ jags_ user_ manual. pdf; 2017. Available from: chrome‑
extension://efaidnbmnnnibpcajpcglclefindmkaj/. Cited 2024 March 19.
44. Kellner K, Meredith M. jagsUI: a wrapper around ‘rjags’ to streamline ‘JAGS’
analyses. 2024. Available from: https:// kenke llner. com/ jagsUI/. Cited 2024
March 19.
45. van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing
blood pressure covariates in survival analysis. Statistics in Medicine.
1999;18:681–94.
46. Hughes RA, White IR, Seaman RS, Carpenter JR, Tilling K, Sterne JA. Joint
modelling rationale for chained equations. BMC Med Res Methodol.
2014;14:28.
47. Liu J, Gelman A, Hill J, Su Y, Kropko J. On the stationary distribution of
iterative imputations. Biometrika. 2014;101(1):155–73.
48. StataCorp. Stata Statistical Software: Release 17. 2021.
49. StataCorp. Stata 17 Multiple‑Imputation Reference Manual College Sta‑
tion. TX: Stata Press; 2021.
50. van Buuren S, Groothuis‑Oudshoorn K. mice: Multivariate Imputation by
Chained Equations in R. J Stat Softw. 2011;45(3):1–67. https:// doi. org/ 10.
18637/ jss. v045. i03.
51. Moreno‑Betancur M, Leacy F, Tompsett D, White I. mice: The NARFCS
procedure for sensitivity analyses. 2019. Available from: https:// github.
com/ moreno‑ betan cur/ NARFCS/ blob/ master/ README. md. Cited 2023
September 4.
52. R Core Team. R: A Language and Environment for Statistical Computing. 2021.
53. University of Bristol. High Performance Computing. Available from:
https:// www. brist ol. ac. uk/ acrc/ high‑ perfo rmance‑ compu ting/. Cited
2023 12 30.
54. Gasparini A. rsimsum: Summarise results from Monte Carlo simulation
studies. J Open Source Software. 2018;3: 739.
55. Millard LC, Fernández‑Sanlés A, Carter AR, Hughes RA, Tilling K, Morris TP,
Major‑Smith D, Griffith GJ, Clayton GL, Kawabata E, Davey Smith G, Lawlor
DA, Borges MC. Exploring the impact of selection bias in observational
studies of COVID‑19: a simulation study. Int J Epidemiol. 2023;52(1):44–57.
https:// doi. org/ 10. 1093/ ije/ dyac2 21.
56. Fry A, Littlejohns TJ, Sudlow C, Doherty N, Adamska L, Sprosen T, et al.
Comparison of Sociodemographic and Health‑Related Characteristics
of UK Biobank Participants with Those of the General Population. Am J
Epidemiol. 2017;186(9):1026–34.
57. Armstrong J, Rudkin JK, Allen N, Crook DW, Wilson DJ, Wyllie DH, et al.
Dynamic linkage of COVID‑19 test results between public health Eng‑
land’s second generation surveillance system and UK Biobank. Microbial
Genomics. 2020;6(7):mgen000397.
58. Carter AR, Clayton GL, Borges MC, Howe LD, Hughes RA, Davey Smith
G, et al. Time‑sensitive testing pressures and COVID‑19 outcomes~: are
socioeconomic inequalities over the first year of the pandemic explained
by selection bias. BMC Public Health. 1863;2023(23):1–12.
59. COVID‑19 Host Genetics Initiative. Mapping the human genetic architec‑
ture of COVID‑19. Nature. 2021; 600: 472–477.
60. Lassale C, Gaye B, Hamer M, Gale CR, Batty GD. Ethnic disparities in
hospitalisation for COVID‑19 in England: the role of socioeconomic fac‑
tors, mental health, and inflammatory and proinflammatory factors in a
community‑based cohort study. Brain Behav Immun. 2020;88:44–9.
61. Western Cape Department of Health in collaboration with the National
Institute for Communicable Diseases SA. Risk factors for coronavirus
disease 2019 (COVID‑19) death in a population cohort study from the
Western Cape Province. South Africa Clin Infect Dis. 2021;73:e2005‑2015.
62. Ward H, Atchison C, Whitaker M, Ainslie KE, Elliott J, Okell L, et al. SARS‑
CoV‑2 antibody prevalence in England following the first peak of the
pandemic. Nat Commun. 2021;12(1):905.
63. Thoemmes F, Rose N. A cautious note on auxiliary variables that
can increase bias in missing data problems. Multivar Behav Res.
2014;49:443–59.
64. White IR, Pham TM, Quartagno M, Morris TP. How to check a simulation
study. Int J Epidemiol. 2024;53(1):1–7. https:// doi. org/ 10. 1093/ ije/ dyad1
34.
65. van Buuren S, Brand JL, Groothuis‑Oudshoorn CM, Rubin DB. Fully
conditional specification in multivariate imputation. J Stat Comput Simul.
2006;76(12):1049–64.
66. Scharfstein DO, Daniels MJ, Robins JM. Incorporating prior beliefs about
selection bias into the analysis of randomized trials with missing out‑
comes. Biostatistics. 2003;4:495–512.
67. White IR, Carpenter JE, Evans S, Schroter S. Eliciting and using expert
opinions about dropout bias in randomized controlled trials. Clin Trials.
2007;4(2):125–39.
68. Daniels MJ, Hogan JW. Reparameterizing the Pattern Mixture Model
for Sensitivity Analyses Under Informative Dropout. Biometrics.
2000;56(4):1241–8.
69. White IR, Higgins JP, Wood AM. Allowing for uncertainty due to
missing data in meta‑analysis ‑ Part 1: Two‑stage methods. Stat Med.
2008;27(5):711–27.
70. Carpenter JR, Kenward MG. Sensitivity analysis with multiple imputation.
In: Molenberghs G, Fitzmaurice G, Kenward MG, Tsiatis A, Verbeke G, edi‑
tors. Handbook of missing data methodology. New York: CRC Press; 2015.
p. 435–70.
71. Rubin DB. Multiple Imputation After 18+ Years. J Am Stat Assoc.
1996;91(434):473–89.
72. Tilling K, Williamson EJ, Spratt M, Sterne JA, Carpenter JR. Appropriate
inclusion of interactions was needed to avoid bias in multiple imputa‑
tion. J Clin Epidemiol. 2016;80:107–15.
73. Cro S, Morris TP, Kenward MG, Carpenter JR. Sensitivity analysis for clinical
trials with missing continuous outcome data using controlled multiple
imputation: A practical guide. Stat Med. 2020;39:2815–42.
74. Linero AR, Daniels MJ. Bayesian approaches for missing not at ran‑
dom outcome data: The role of identifying restrictions. Stat Sci.
2018;33(2):198–213.
75. Karvanen J, Tolonen H, Harkanen T, Jousliahti P, Kuulasamaa K. Selection
bias was reduced by recontacting nonparticipants. J Clin Epidemiol.
2016;76:209–17.
76. Curnow E, Cornish RP, Heron JE, Carpenter JR, Tilling K. Multiple imputa‑
tion assuming missing at random: auxiliary imputation variables that only
predict missingness can increase bias due to data missing not at random.
2023. Available from: https:// www. medrx iv. org/ conte nt/ 10. 1101/ 2023. 10.
17. 23297 137v1. Cited 2024 March 24.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in pub‑
lished maps and institutional affiliations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com