Missing Not at Random Models for Latent Growth Curve Analyses
Craig K. Enders
Arizona State University
The past decade has seen a noticeable shift in missing data handling techniques that assume a missing at
random (MAR) mechanism, where the propensity for missing data on an outcome is related to other analysis
variables. Although MAR is often reasonable, there are situations where this assumption is unlikely to hold,
leading to biased parameter estimates. One such example is a longitudinal study of substance use where
participants with the highest frequency of use also have the highest likelihood of attrition, even after
controlling for other correlates of missingness. There is a large body of literature on missing not at random
(MNAR) analysis models for longitudinal data, particularly in the field of biostatistics. Because these methods
allow for a relationship between the outcome variable and the propensity for missing data, they require a
weaker assumption about the missing data mechanism. This article describes 2 classic MNAR modeling
approaches for longitudinal data: the selection model and the pattern mixture model. To date, these models
have been slow to migrate to the social sciences, in part because they required complicated custom computer
programs. These models are now quite easy to estimate in popular structural equation modeling programs,
particularly Mplus. The purpose of this article is to describe these MNAR modeling frameworks and to
illustrate their application on a real data set. Despite their potential advantages, MNAR-based analyses are not
without problems and also rely on untestable assumptions. This article offers practical advice for implement-
ing and choosing among different longitudinal models.
Keywords: missing data, pattern mixture model, selection model, attrition, missing not at random
Supplemental materials: http://dx.doi.org/10.1037/a0022640.supp
Missing data handling techniques have received considerable
attention in the methodological literature during the past 40 years.
This literature has largely discredited most of the simple proce-
dures that have enjoyed widespread use for decades, including
methods that discard incomplete cases (e.g., listwise deletion,
pairwise deletion) and approaches that impute the data with a
single set of replacement values (e.g., mean imputation, regression
imputation, last observation carried forward). The past decade has
seen a noticeable shift to analytic techniques that assume a missing
at random (MAR) mechanism, whereby an individual’s propensity
for missing data on a variable Y is potentially related to other
variables in the analysis (or in the imputation model) but not to the
unobserved values of Y itself (Little & Rubin, 2002; Rubin, 1976).
Maximum likelihood estimation and multiple imputation are ar-
guably the predominant MAR-based approaches, although inverse
probability weighting methods have gained traction in the statistics
literature (e.g., Carpenter, Kenward, & Vansteelandt, 2006; Robins
& Rotnitzky, 1995; Scharfstein, Rotnitzky, & Robins, 1999). A
number of resources are available to readers who are interested in
additional details on these methods (e.g., Carpenter et al., 2006;
Enders, 2010; Little & Rubin, 2002; Rotnitzky, 2009; Schafer,
1997; Schafer & Graham, 2002).
Although the MAR mechanism is often reasonable, there are situ-
ations where this assumption is unlikely to hold. For example, in a
longitudinal study of substance use, it is reasonable to expect partic-
ipants with the highest frequency of use to have the highest likelihood
of attrition, even after controlling for other correlates of missingness.
Similarly, in a study that examines quality of life changes throughout
the course of a clinical trial for a new cancer medication, it is likely
that patients with rapidly decreasing quality of life scores are more
likely to leave the study because they die or become too ill to
participate. The previous scenarios are characterized by a relationship
between the outcome variable (i.e., substance use, quality of life) and
the propensity for missing data. This so-called missing not at random
(MNAR) mechanism is problematic because MAR-based analyses
are likely to produce biased parameter estimates. Unfortunately, there
is no empirical test of the MAR mechanism, so it is generally
impossible to fully rule out MNAR missingness. This underscores the
need for MNAR analysis methods.
for longitudinal data, particularly in the field of biostatistics (e.g.,
Albert & Follmann, 2000, 2009; Diggle & Kenward, 1994; Follmann
& Wu, 1995; Little, 1995, 2009; Molenberghs & Kenward, 2007;
Verbeke, Molenberghs, & Kenward, 2000; Wu & Bailey, 1989; Wu
& Carroll, 1988). This literature addresses a wide variety of substan-
tive applications and includes models for categorical outcomes, count
data, and continuous variables, to name a few. Although researchers
are sometimes quick to discount MAR-based analyses, MNAR mod-
els are not without their own problems. In particular, MNAR analyses
rely heavily on untestable assumptions (e.g., normally distributed
latent variables), and even relatively minor violations of these as-
sumptions can introduce substantial bias. This fact has led some
methodologists to caution against the routine use of these models
(Demirtas & Schafer, 2003; Schafer, 2003). A common viewpoint is
that MNAR models are most appropriate for exploring the sensitivity
Correspondence concerning this article should be addressed to Craig K.
Enders, Box 871104, Department of Psychology, Arizona State University,
Tempe, AZ 85287–1104. E-mail: firstname.lastname@example.org
2011, Vol. 16, No. 1, 1–16
© 2011 American Psychological Association
of one’s results to a variety of different assumptions and conditions.
Despite their potential problems, MNAR models are important op-
tions to consider, particularly when outcome-related attrition seems
plausible. At the very least, these procedures can augment the results
from an MAR-based analysis.
Although MNAR analysis models have been in the literature for
many years, they have been slow to migrate to the social and the
behavioral sciences. To date, most substantive applications have
appeared in the medical literature (e.g., Hogan, Roy, & Korkont-
zelou, 2004; Kenward, 1998; Michiels, Molenberghs, Bijnens,
Vangeneugden, & Thijs, 2002). The adoption of any novel statis-
tical procedure is partially a function of awareness but is also
driven by software availability. MNAR analyses were traditionally
difficult to implement because they required complicated custom
programming. These models are now quite easy to estimate in
popular structural equation modeling programs, particularly Mplus
(L. K. Muthe ´n & Muthe ´n, 1998–2010). Consequently, the purpose
of this article is to describe two classic MNAR modeling families
for longitudinal data—selection models and pattern mixture mod-
els—and illustrate their use on a real data set. Methodologists
continue to develop MNAR analysis methods, most of which
extend the models that I describe in this article (e.g., Beunckens,
Molenberghs, Verbeke, & Mallinckrodt, 2008; Dantan, Proust-
Lima, Letenneur, & Jacqmin-Gadda, 2008; Lin, McCulloch, &
Rosenheck, 2004; B. Muthe ´n, Asparouhov, Hunter, & Leuchter,
2011; Roy, 2003; Roy & Daniels, 2008; Yuan & Little, 2009). By
limiting the scope of this article to classic techniques, I hope to
provide readers with the necessary background information for
accessing these newer approaches. B. Muthe ´n et al. (2011) has
provided an excellent overview of these recent innovations.
The organization of this article is as follows. I begin with an
overview of Rubin’s (1976) missing data theory, including a
discussion of how selection models and pattern mixture models fit
into Rubin’s definition of an MNAR mechanism. After a brief
review of growth curve models, I then describe classic selection
models and pattern mixture models for longitudinal data. Next, I
use a series of data analysis examples to illustrate the estimation
and interpretation of the models. I then conclude with a discussion
of model selection and sensitivity analyses.
Some background information on Rubin’s (1976) missing data
theory is useful for understanding the rationale behind MNAR
analysis models. According to Rubin, the propensity for missing
data is a random variable that has a distribution. In practical terms,
this implies that each variable potentially yields a pair of scores: an
underlying Y value that may or may not be observed and a
corresponding R value that denotes whether Y is observed or is
missing (e.g., R ? 0 if Y is observed and R ? 1 if Y is missing).
Under an MNAR mechanism, the data and the probability of
missingness have a joint distribution:
p?Yi, Ri??, ??, (1)
where p denotes a probability distribution, Yiis the outcome
variable for case i, Riis the corresponding missing data indicator,
? is a set of parameters that describes the distribution of Y (e.g.,
growth model parameters), and ? contains parameters that de-
scribe the propensity for missing data on Y (e.g., a set of logistic
regression coefficients that predict R). Collectively, the parameters
of the joint distribution dictate the mutual occurrence of different
Y values and missing data.
Under an MAR mechanism, Equation 1 simplifies, and it is
unnecessary to estimate the parameters that dictate missingness
(i.e., ?). For this reason, an MAR mechanism is often referred to
as ignorable missingness. In contrast, an MNAR mechanism re-
quires an analysis model that includes all parameters of the joint
distribution, not just those that are of substantive interest. In
practical terms, this means that the statistical analysis must incor-
porate a submodel that describes the propensity for missing data
(e.g., a logistic regression that predicts R). Both the selection
model and the pattern mixture model incorporate a model for R
into the analysis, but they do so in different ways.
The selection model and the pattern mixture model factor the
joint distribution of Y and R into the product of two separate
distributions. In the selection modeling framework, the joint dis-
tribution is as follows:
p?Yi, Ri??, ?? ? p?Yi???p?Ri?Yi, ??,(2)
where p?Yi??? is the marginal distribution of Y, and p?Ri?Yi,?? is the
conditional distribution of missing data, given Y. The preceding
factorization implies a two-part model where the marginal distri-
bution corresponds to the substantive analysis (e.g., a growth
model) and where the conditional distribution corresponds to a
regression model that uses Y to predict the probability of missing
data. The regression of R on Y is inherently inestimable because Y
is always missing whenever R equals one. The selection model
achieves identification by imposing strict distributional assump-
tions, typically multivariate normality. The model tends to be
highly sensitive to this assumption, and even slight departures
from normality can produce substantial bias.
In the pattern mixture modeling framework, the factorization
reverses the role of Y and R as follows:
p?Yi, Ri??, ?? ? p?Yi?Ri, ??p?Ri???,(3)
where p?Yi?Ri, ?? is the conditional distribution of Y, given a
particular value of R, and p?Ri??? is the marginal distribution of
R. The preceding factorization implies a two-part model where
the conditional distribution of Y represents the substantive
model parameters for a group of cases that shares the same
missing data pattern and where the marginal distribution of R
describes the incidence of different missing data patterns. This
factorization implies the following strategy: Stratify the sample
into subgroups that share a common missing data pattern, and
estimate the substantive model separately within each pattern.
Although it is not immediately obvious, the pattern mixture
model is also inestimable without invoking additional assump-
tions. For example, a growth model is underidentified in a group
of cases with only two observed data points. Therefore, these
assumptions would take the form of assumed values for the
inestimable parameters. I discuss these assumptions in detail
later in the article, but suffice it to say that the model is prone
to bias when its assumptions are incorrect.
The selection model and pattern mixture model are equivalent
in the sense that they describe the same joint distribution. However,
because the two frameworks require different assumptions, they
can (and often do) produce very different estimates of the substan-
tive model parameters. There is usually no way to judge the
relative accuracy of the two models because both rely heavily on
untestable assumptions. For this reason, methodologists generally
recommend sensitivity analyses that apply different models (and
thus different assumptions) to the same data. I illustrate the appli-
cation of these models to longitudinal data later in the article.
Brief Overview of Growth Curve Models
Much of the methodological work on MNAR models has cen-
tered on longitudinal data analyses, particularly growth curve
models (also known as mixed effects models, random coefficient
models, and multilevel models). Because this article focuses solely
on longitudinal data analyses, a brief overview of the growth curve
model is warranted before proceeding. A growth model expresses
the outcome variable as a function of a temporal predictor variable
that captures the passage of time. For example, the unconditional
linear growth curve model is as follows:
Yti? ?0? ?1?TIMEti? ? b0i? b1i?TIMEti? ? εti, (4)
where Ytiis the outcome score for case i at time t, TIMEtiis the
value of the temporal predictor for case i at time t (e.g., the elapsed
time since the onset of the study), ?0is the mean intercept, ?1is
the mean growth rate, b0iand b1iare residuals (i.e., random effects)
that allow the intercepts and the change rates, respectively, to vary
across individuals, and εtiis a time-specific residual that captures
the difference between an individual’s fitted linear trajectory and
his or her observed data. The model can readily incorporate non-
linear change by means of polynomial terms. For example, the
unconditional quadratic growth model is as follows:
Yti? ?0? ?1?TIMEti? ? ?2?TIMEti
2? ? b0i? b1i?TIMEti?
2? ? εti,(5)
where ?0is the mean intercept, ?1is the average instantaneous
linear change when TIME equals zero, and ?2is the mean curva-
ture. As before, the model uses a set of random effects to incor-
porate individual heterogeneity into the developmental trajectories
(i.e., b0i, b1i,and b2i), and εtiis a time-specific residual.
The previous models are estimable from the multilevel, mixed
model or from the structural equation modeling frameworks.
Structural equation modeling—and the Mplus software package, in
particular—provides a convenient platform for estimating MNAR
models. Cast as a structural equation model, the individual growth
components (i.e., b0i, b1i, and b2i) are latent variables, the means of
which (i.e., ?0, ?1, and ?2) define the average growth trajectory.
To illustrate, Figure 1 shows a path diagram of a linear growth
model from a longitudinal study with four equally spaced assess-
ments. The unit factor loadings for the intercept latent variable
reflect the fact that the intercept is a constant component of each
individual’s idealized growth trajectory, and the loadings for the
linear latent variable capture the timing of the assessments (i.e., the
TIME scores in Equation 4). A quadratic growth model incorporates
an additional latent factor with loadings equal to the square of the
linear factor loadings. A number of resources are available to readers
who want additional details on growth curve models (Bollen &
Curran, 2006; Hancock & Lawrence, 2006; Hedeker & Gibbons,
2006; Singer & Willett, 2003). As an aside, mixed modeling software
programs (e.g., PROC MIXED in SAS) can also estimate some of the
MNAR models that I describe in this article (e.g., the selection
models). Although different modeling frameworks often yield iden-
tical parameter estimates, the latent growth curve approach is argu-
ably more convenient for implementing MNAR models.
Selection Models for Longitudinal Data
Heckman (1976, 1979) originally proposed the selection model
as a bias correction method for regression analyses with MNAR
data on the outcome variable. Like their classic predecessor, se-
lection models for longitudinal data combine a substantive model
(i.e., a growth curve model) with a set of regression equations that
predict missingness. The two parts of the model correspond to the
factorization on the right side of Equation 2. The literature de-
scribes two classes of longitudinal models that posit different
linkages between the repeated measures variables and the missing
data indicators. Wu and Carroll’s (1988) model indirectly links the
repeated measures variables to the response probabilities through
the individual intercepts and slopes (i.e., the b0i, and b1i, terms in
Equation 4). This approach is commonly referred to as the random
coefficient selection model or the shared parameter model.1In
contrast, Diggle and Kenward’s (1994) selection model directly
relates the probability of missing data at time t to the outcome
variable at time t. Although these models have commonalities,
1Authors often treat the shared parameter model as a distinct MNAR
approach. Because the structural features of Wu and Carroll’s (1988)
model are similar to those of Diggle and Kenward’s (1994) model (i.e., one
or more variables from the substantive model predict missingness), I treat
both as selection models.
?1? mean growth rate; b0iand b1i? residuals that allow the intercepts and
the change rates, respectively, to vary across individuals; Y1–Y4? outcome
variables; ε1–ε4? time-specific residuals.
Path diagram of a linear growth model. ?0? mean intercept;
MISSING NOT AT RANDOM MODELS
they require somewhat different assumptions and may produce
different estimates. This section provides a brief description of the
two models, and a number of resources are available to readers
who are interested in additional technical details (Albert & Foll-
mann, 2009; Diggle & Kenward, 1994; Little, 2009; Molenberghs
& Kenward, 2007; Verbeke, Molenberghs, & Kenward, 2000).
Wu and Carroll’s (1988) Model
Wu and Carroll’s (1988) model uses the individual growth
trajectories to predict the probability of missing data at time t. To
illustrate, Figure 2 shows a path diagram of a linear growth curve
model of the type developed by Wu and Carroll. The rectangles
labeled R2, R3, and R4are missing data indicators that denote
whether the outcome variable is observed at a particular assess-
ment (e.g., Rt? 0 if Ytis observed, and Rt? 1 if Ytis missing).
Note that the model does not require an R1indicator when the
baseline assessment is complete, as is the case in the figure. The
dashed arrows that link the latent variables (i.e., the individual
intercepts and slopes) to the missing data indicators represent
logistic regression equations.2Regressing the indicator variables
on the intercepts and slopes effectively allows the probability of
missing data to depend on the entire set of repeated measures
variables, including the unobserved scores from later assessments.
Although this proposition may seem awkward, linking the re-
sponse probabilities to the intercepts and slopes is useful when
missingness is potentially dependent on an individual’s overall
developmental trajectory rather than a single error-prone realiza-
tion of the outcome variable (Albert & Follmann, 2009; Little,
Diggle and Kenward’s (1994) Model
Diggle and Kenward’s (1994) model also combines a growth
curve model with a set of regression equations that predict miss-
ingness. However, unlike Wu and Carroll’s (1988) model, the
probability of missing data at wave t depends directly on the
repeated measures variables. To illustrate, Figure 3 shows a path
diagram of a linear Diggle and Kenward growth curve model. As
before, the rectangles labeled R2, R3, and R4are missing data
indicators that denote whether the outcome variable is observed or
missing, and the dashed arrows represent logistic regression equa-
tions. Notice that the probability of missing data at time t now
depends directly on the outcome variable at time t as well as on the
outcome variable from the preceding assessment (e.g., Y1and Y2
predict R2, Y2and Y3predict R3, and so on).
As an aside, the logistic regression equations in the previous
models potentially carry information about the missing data mech-
anism. For example, in Diggle and Kenward’s (1994) model, a
significant path between Rtand Ytimplies an MNAR mechanism
because dropout at wave t is concurrently related to the outcome.
Similarly, a significant association between Rtand Yt?1provides
evidence for an MAR mechanism because dropout at time t is
related to the outcome at the previous assessment. Finally, the
absence of any relationship between the outcomes and the missing
data indicators is consistent with an MCAR mechanism because
dropout is unrelated to the variables in the model. Although it is
tempting to use the logistic regressions to make inferences about
the missing data mechanism, it is important to reiterate that these
associations are estimable only because of strict distributional
assumptions. Consequently, using the logistic regressions to eval-
uate the missing data mechanism is tenuous, at best.
Selection Model Assumptions
Although it is not immediately obvious, longitudinal selection
models rely on distributional assumptions to achieve identification,
and these distributional assumptions dictate the accuracy of the
resulting parameter estimates. For Wu and Carroll’s (1988) model,
identification is driven by distributional assumptions for the ran-
dom effects (i.e., the individual intercepts and slopes), whereas
Diggle and Kenward’s (1994) model requires distributional as-
sumptions for the repeated measures variables. Without these
assumptions, the models are inestimable (e.g., in Diggle & Ken-
ward’s, 1994, model, the regression of Rton Ytis inestimable
because Y is always missing whenever R equals one). With con-
tinuous outcomes, the typical practice is to assume a multivariate
normal distribution for the individual intercepts and slopes or for
the repeated measures variables. Wu and Carroll’s model addition-
ally assumes that the repeated measures variables and the missing
data indicators are conditionally independent, given the random
effects (i.e., after controlling for the individual growth trajectories,
there is no residual correlation between Ytand Rt). Collectively,
these requirements are difficult to assess with missing data, so the
accuracy of the resulting parameter estimates ultimately relies on
one or more untestable assumptions.
2A logistic model is not the only possibility for the missing data
indicators. Probit models are also common.
R2–R4? missing data indicators; ?0? mean intercept; ?1? mean growth
rate; b0iand b1i? residuals that allow the intercepts and the change rates,
respectively, to vary across individuals; Y1–Y4? outcome variables; ε1–
ε4? time-specific residuals.
Path diagram of a linear Wu and Carroll (1988) growth model.
Coding the Missing Data Indicators
Thus far, I have been purposefully vague about the missing data
indicators because the appropriate coding scheme depends on the
exact configuration of missing values. The models of Wu and
Carroll (1988) and Diggle and Kenward (1994) were originally
developed for studies with permanent attrition (i.e., a monotone
missing data pattern). In this scenario, it makes sense to utilize
discrete-time survival indicators, such that Rttakes on a value of
zero prior to dropout, a value of one at the assessment where
dropout occurs, and a missing value code at all subsequent assess-
ments (e.g., B. Muthe ´n & Masyn, 2005; Singer & Willett, 2003).
In contrast, when a study has only intermittent missing values, it is
reasonable to represent the indicators as a series of independent
Bernoulli trials, such that Rttakes on a value of zero at any
assessment where Ytis observed and takes on a value of one at any
assessment where Ytis missing.
Most longitudinal studies have a mixture of sporadic missing-
ness and permanent attrition. One option for dealing with this
configuration of missingness is to use discrete-time survival indi-
cators to represent the dropout patterns and code intermittent
missing values as though they were observed (i.e., for intermit-
tently missing values, Rttakes on a value of zero). Because
intermittent missingness is not treated as a target event, this coding
strategy effectively assumes that these values are consistent with
an MAR mechanism. A second option for dealing with intermittent
missingness and permanent attrition is to create indicators that are
consistent with a multinomial logistic regression (Albert & Foll-
mann, 2009; Albert, Follmann, Wang, & Suh, 2002), such that the
two types of missingness have distinct numeric codes. I illustrate
these various coding strategies in the subsequent data analysis
Pattern Mixture Models for Longitudinal Data
Like the selection model, the pattern mixture approach inte-
grates a model for the missing data into the analysis, but it does
so in a very different way. Specifically, a pattern mixture
analysis stratifies the sample into subgroups that share the same
missing data pattern and estimates a growth model separately
within each pattern. For example, in a four-wave study with a
monotone missing data pattern, the complete cases would form
one pattern, the cases that drop out following the baseline
assessment would constitute a second pattern, the cases that
leave the study after the second wave would form a third
pattern, and the cases with missing data at the final assessment
only would form the fourth pattern. Assuming a sufficient
sample size within each pattern, the four missing data groups
would yield unique estimates of the growth model parameters.
Returning to Equation 3, these pattern-specific estimates corre-
spond to the conditional distribution p?Yi?Ri, ??, and the group
proportions correspond to p?Ri???.
Although the pattern-specific estimates are often informative,
the usual substantive goal is to estimate the population growth
trajectory. Computing the weighted average of the pattern-specific
estimates yields a marginal estimate that averages over the distri-
bution of missingness. For example, the average intercept from the
hypothetical four-wave study is as follows:
ˆ??0? ? ˆ?1??ˆ0
?1?? ? ˆ?2??ˆ0
?2?? ? ˆ?3??ˆ0
?3?? ? ˆ?4??ˆ0
where the numeric superscript denotes the missing data pattern,
? ˆ? p?is the proportion of cases in missing data pattern p, and ?ˆ0
is the pattern-specific intercept estimate. Of importance, a pattern
mixture analysis does not automatically produce standard errors
for the average estimates because these quantities are a function of
the model parameters. Consequently, it is necessary to use the
multivariate delta method to derive an approximate standard error
(Hedeker & Gibbons, 1997; Hogan & Laird, 1997). Fortunately,
performing these additional computations is unnecessary because
Mplus can readily compute the average estimates and their stan-
As an aside, stratifying cases by missing data pattern is also an
old MAR-based strategy that predates current maximum likelihood
missing data handling techniques (B. Muthe ´n, Kaplan, & Hollis,
1987). This so-called multiple group approach used between-
pattern equality constraints on the model parameters to trick ex-
isting structural equation modeling programs into producing a
single set of MAR-based estimates. Although this procedure
closely resembles a pattern mixture model, forcing the missing
data patterns to have the same parameter estimates effectively
ignores the pattern-specific conditioning that is central to the
MNAR factorization in Equation 3.
Although its resemblance to a multiple group analysis makes the
pattern mixture model conceptually straightforward, implementing
model. ?0? mean intercept; ?1? mean growth rate; b0iand b1i?
residuals that allow the intercepts and the change rates, respectively, to
vary across individuals; Y1–Y4? outcome variables; ε1–ε4? time-specific
residuals; R2–R4? missing data indicators.
Path diagram of Diggle and Kenward’s (1994) linear growth
MISSING NOT AT RANDOM MODELS
the procedure is made difficult by the fact that one or more of the
pattern-specific parameters are usually inestimable. To illustrate,
consider a four-wave study that uses a quadratic growth model.
The model is identified only for the subgroup of participants with
complete data. For cases with two complete observations, the
linear trend is estimable, but the quadratic coefficient and certain
variance components are not. The identification issue is most
evident in the subgroup that drops out following the baseline
assessment, where neither the linear nor the quadratic coefficients
Estimating a pattern mixture model requires the user to specify
values for the inestimable parameters, either explicitly or implic-
itly. Using code variables as predictors in a growth model is one
way to accomplish this (Hedeker & Gibbons, 1997, 2006). For
example, Hedeker and Gibbons (1997) classified participants from
a psychiatric drug trial as completers (cases with data at every
wave) or dropouts (cases that left the study at some point after the
baseline assessment), and they subsequently included the binary
missing data indicator as a predictor of the intercepts and slopes in
a linear growth model. A linear model with the missing data
indicator as the only predictor would be as follows:
Yti? ?0? ?1?TIMEti? ? ?2?DROPOUTi?
? ?3?DROPOUTi??TIMEti? ? b0i? b1i?TIMEti? ? εti, (7)
where DROPOUT denotes the missing data pattern (0 ? com-
pleters, 1 ? dropouts), ?0and ?1are the mean intercept and slope,
respectively, for the complete cases, ?2is the intercept difference
for the dropouts, and ?3is the slope difference for dropouts.
Hedeker and Gibbons’s (1997, 2006) approach achieves identifi-
cation by sharing information across patterns. For example, the
model in Equation 7 implicitly assumes that early dropouts have
the same developmental trajectory as the cases that drop out later
in the study. The model also assumes that all missing data patterns
share the same covariance structure.
A second estimation strategy is to implement so-called iden-
tifying restrictions that explicitly equate the inestimable param-
eters from one pattern to the estimable parameters from one or
more of the other patterns. Later in the article, I illustrate three
such restrictions: the complete case missing variable restriction,
the neighboring case missing variable restriction, and the avail-
able case missing variable restriction. As its name implies, the
complete case missing variable restriction equates the inesti-
mable parameters to the estimates from the complete cases. The
neighboring case missing variable restriction replaces inestima-
ble parameters with estimates from a group of cases that share
a comparable missing data pattern. For example, in a four-wave
study, the cases that drop out after the third wave can serve as
a donor pattern for the cases that drop out after the second
wave, such that the two patterns share the same quadratic
coefficient. Finally, the available case missing variable restric-
tion replaces inestimable growth parameters with the weighted
average of the estimates from other patterns. Still considering a
group of cases with two observations, this identifying restric-
tion would replace the inestimable quadratic term with the
average coefficient from the complete cases and the cases that
drop out following the third wave. Additional details and ex-
amples of various identification strategies are available else-
where in the literature (Demirtas & Schafer, 2003; Enders,
2010; Molenberghs, Michiels, Kenward, & Diggle, 1998; Thijs,
Molenberghs, Michiels, & Curran, 2002; Verbeke et al., 2000).
Pattern Mixture Model Assumptions
The assumed values for the inestimable parameters dictate
the accuracy of the pattern mixture model. To the extent that the
values are correct, the model can reduce or eliminate the bias
from an MNAR mechanism. However, like the selection model,
there is ultimately no way to gauge the accuracy of the resulting
estimates, and implementing different identification constraints
can (and often does) produce disparate sets of results. At first
glance, the need to specify values for inestimable parameters
may appear to be a serious weakness of the pattern mixture
model. However, some methodologists argue that this require-
ment is advantageous because it forces researchers to make
their assumptions explicit. This is in contrast to the selection
model, which relies on implicit distributional assumptions that
are not obvious. This aspect of the pattern mixture model also
provides flexibility because it allows researchers to explore the
sensitivity of the substantive model parameters to a number of
different identification constraints (i.e., assumed parameter val-
ues). In truth, the previous identifying restrictions are simply
arbitrary rules of thumb for generating parameter values. Any
number of other restrictions is possible (e.g., a restriction that
specifies a flat trajectory shape after the last observed data
point; Little, 2009), and performing a sensitivity analysis that
applies a variety of identification strategies to the same data is
usually a good idea.
Data Analysis Examples
To date, applications of longitudinal models for MNAR data are
relatively rare in the social and the behavioral sciences, perhaps
because these analyses have traditionally required complex custom
programming. Software availability is no longer a limiting factor
because the Mplus package provides a straightforward platform for
estimating a variety of selection models and pattern mixture mod-
els. This section describes a series of data analyses that apply the
MNAR models from earlier in the article. The Mplus 6 syntax files
for the analyses are available at www.appliedmissingdata.com/
The analysis examples use the psychiatric trial data from
Hedeker and Gibbons (1997, 2006).3Briefly, the data were
collected as part of the National Institute of Mental Health
Schizophrenia Collaborative Study and consist of repeated mea-
surements from 437 individuals. In the original study, partici-
pants were assigned to one of four experimental conditions (a
placebo condition and three drug regimens), but the subsequent
analyses collapsed these categories into a dichotomous treat-
ment indicator (0 ? placebo, 1 ? drug). The primary substan-
tive goal was to assess treatment-related changes in illness
severity over time. The outcome was measured on a 7-point
scale, such that higher scores reflect greater severity (e.g., 1 ?
3The data are used here with Hedeker’s permission and are available at
his website: http://tigger.uic.edu/?he
normal, not at all ill; 7 ? among the most extremely ill). Most
of the measurements were collected at baseline, Week 1, Week
3, and Week 6, but a small number of participants also had
measurements at Week 2, Week 4, or Week 5. To simplify the
presentation, I excluded these irregular observations from all
analyses. Finally, note that the discrete measurement scale
violates multivariate normality, by definition. Although these
data are still useful for illustration purposes, the normality
violation is likely problematic for the selection model analyses.
The data set contains nine distinct missing data patterns that
represent a mixture of permanent attrition and intermittent miss-
ingness. The left column of Table 1 summarizes these patterns. To
provide some sense about the developmental trends, Figure 4
shows the observed means for each pattern by treatment condition.
The fitted trajectories in the figure suggest nonlinear growth. In
their analyses of the same data, Hedeker and Gibbons (1997, 2006)
linearize the trajectories by modeling illness severity as a function
of the square root of weeks. Although this decision is very sensi-
ble, I used a quadratic growth model for the subsequent analyses
because it provides an opportunity to illustrate the complexities
that arise with MNAR models, particularly pattern mixture models
with identifying restrictions. The analysis model is as follows:
Yti? ?0? ?1?TIMEti? ? ?2?TIMEti
2? ? ?4?DRUGi?
? ?5?DRUGi??TIMEti? ? ?6?DRUGi??TIMEti
2? ? b0i
? b1i?TIMEti? ? b2i?TIMEti
2? ? εti,(8)
where ?0, ?1, and ?2define the average growth trajectory for the
placebo cases (i.e., DRUG ? 0), and ?4, ?5, and ?6capture the
mean differences between the treatment conditions.
In an intervention study, the usual goal is to assess treatment
group differences at the end of the study. Centering the temporal
predictor at the final assessment (e.g., by fixing the final slope
factor loading to a value of 0) addresses this question because the
regression of the intercept on treatment group membership quan-
tifies the mean difference. However, implementing identifying
restrictions in a pattern mixture model is made easier by centering
the intercept at the baseline assessment, particularly when perma-
nent attrition is the primary source of missingness. Consequently,
I fixed the linear slope factor loadings to values of 0, 1, 3, and 6
for all subsequent analyses (the quadratic factor loadings are the
squares of these values). Despite this parameterization, it is
straightforward to construct a test of the endpoint mean difference.
Algebraically manipulating the growth model parameters gives the
model-implied mean difference at the final assessment as follows:
? ˆDrug? ? ˆPlacebo? ?ˆ3? 6 ? ?ˆ4? 36 ? ?ˆ5, (9)
where the ?ˆ terms are the regression coefficients that link the
growth factors to the treatment indicator (i.e., the latent mean
differences), 6 is the value of the linear factor loading at the final
assessment (i.e., the time score, weeks since baseline), and 36 is
the corresponding quadratic factor loading. Among other things,
the MODEL CONSTRAINT command in Mplus allows users to
define new parameters that are functions of the estimated param-
eters. In the subsequent analyses, I used this command to estimate
the mean difference in Equation 9 and its standard error.
MAR-Based Growth Curve Model
As a starting point, I used MAR-based maximum likelihood
missing data handling to estimate the quadratic growth curve
model. Figure 5 shows the path diagram for the analysis. Table 2
lists the estimates and the standard errors for selected parameters,
and Figure 6 displays the corresponding model-implied trajecto-
ries. The figure clearly suggests that participants in the drug
condition experienced greater reductions in illness severity relative
to the placebo group. However, it is important to emphasize that
these estimates assume that an individual’s propensity for missing
data at week t is completely determined by treatment group mem-
bership or by his or her severity score at earlier assessments (i.e.,
the missing values conform to an MAR mechanism). Substituting
the appropriate quantities from the maximum likelihood analysis
into Equation 9 gives a mean difference of ?1.424 (SE ? 0.182,
p ? .001). Expressed relative to the model-implied estimate of the
baseline standard deviation (i.e., the square root of the sum of the
intercept variance and the residual variance), the standardized
mean difference is d ? 1.563.
Diggle and Kenward’s (1994) Selection Models
Mplus is ideally suited for estimating selection models because
it can accommodate normally distributed (e.g., the repeated mea-
sures) and categorical outcomes (e.g., the missing data indicators)
in the same model. To illustrate, I fit two selection models of the
type developed by Diggle and Kenward (1994) to the psychiatric
trial data. The first analysis treated permanent attrition (Patterns
2–4) as MNAR and treated intermittent missingness (Patterns 5–9)
as MAR. As noted previously, missing data indicators that are
consistent with a discrete-time survival model are appropriate
when modeling permanent dropout. Normally, a set of three miss-
ing data indicators could represent the dropout patterns in Table 1,
but the small number of cases in Pattern 4 made it impossible to
model attrition at the second assessment. Consequently, the model
incorporated indicator variables at the final two waves with the
following coding scheme:
Missing Data Patterns and Indicator Codes for Data
1 ? dropout, and 99 ? a missing value code. For multinomial coding, 0 ?
intermittent missingness, 1 ? dropout, and 2 ? observed.
O ? observed; M ? missing. For dropout codes, 0 ? observed,
MISSING NOT AT RANDOM MODELS
observed or intermittent missingness
dropout at time t
dropout at previous time,
where 99 represents a missing value code. Of importance, assign-
ing a code of 0 to the intermittently missing values effectively
defines sporadic missingness (Patterns 5–9) as MAR. Finally, note
that the Pattern 4 cases had indicator codes of R3? 1 and R4? 99.
This treats the missing Y2values as MAR and the missing Y3
values as MNAR dropout. The middle columns of Table 1 sum-
marize the indicator codes for each missing data pattern.
Figure 7 shows a path diagram of Diggle and Kenward’s (1994)
selection model. The different types of dashed arrows represent
equality constraints on the regression coefficients in the logistic
part of the model (e.g., the regression of R4on Y4is set as equal
to the regression of R3on Y3). Describing the specification of a
discrete-time survival model is beyond the scope of this article, but
readers who are interested in the rationale behind these constraints
can consult Singer and Willett (2003) and B. Muthe ´n and Masyn
(2005), among others.
Table 3 gives selected parameter estimates and standard errors
from the analysis. The model-implied growth trajectories were
quite similar to those in Figure 6, although the selection model
produced a larger mean difference between the treatment condi-
tions at Week 6. Specifically, substituting the appropriate estimates
into Equation 9 yields a model-implied mean difference of ?1.665
(SE ? 0.198, p ? .001) at the final assessment. Expressed relative
to the model-implied estimate of the baseline standard deviation,
this mean difference corresponds to a standardized effect size of
d ? 1.810. Notice that the selection model produced the same
substantive conclusion as the maximum likelihood analysis (i.e.,
the drug condition experienced greater reductions in illness sever-
ity), albeit with a larger effect size. Again, the normality violation
should cast doubt on the validity of the selection model estimates.
Turning to the logistic portion of the model, the regression
coefficients quantify the influence of treatment group membership
and the repeated measures variables on the hazard probabilities
(i.e., the conditional probability of dropout, given participation at
the previous assessment). For example, the significant positive
association between Rtand Ytsuggests that participants with
higher illness severity scores at wave t were more likely to drop
out, even after controlling for treatment group membership and
scores from the previous assessment. Although the accuracy of this
coefficient depends on untenable distributional assumptions, it
does provide some evidence for an MNAR mechanism. It is
important to note that estimating the model from 100 random
starting values produced two sets of solutions with different logis-
tic regression coefficients (the log likelihood values were
?2,565.814 and ?2,573.115). In the second solution, the associ-
ation between Rtand Ytswitched signs, such that cases with lower
illness severity scores were more likely to drop out. It is unclear
trial data. The shaded circles denote the drug condition means, and the clear circles represent the placebo group
Observed means and fitted trajectories for each of the nine missing data patterns in the psychiatric
whether this sensitivity to different starting values is a symptom of
model misspecification (e.g., the logistic portion of the model
omits an important predictor of missingness) or normality viola-
tion. Because these models are weakly identified to begin with, a
quadratic model may be too complex, although a linear model
showed similar instability. Regardless of the underlying cause, this
finding underscores the importance of using random starting val-
ues when estimating these models.
The previous analysis treated intermittent missing values as
MAR. As an alternative, creating indicators that are consistent
with a multinomial logistic regression can distinguish between
intermittent and permanent missing values (Albert & Follmann,
2009; Albert, Follmann, Wang, & Suh, 2002). The following
coding scheme is one such example:
intermittent missingness at time t
dropout at time t
observed at time t
dropout at an earlier time,
where 99 is a missing value code. By default, Mplus treats the
highest nonmissing category (e.g., 2) in a multinomial logistic
regression as the reference group. Therefore, assigning the highest
code to the observed values yields logistic regression coefficients
that quantify the probability of each type of missingness relative to
complete data. After minor alterations to accommodate sparse
missing data patterns, I estimated Diggle and Kenward’s (1994)
model under this alternate coding scheme. The right columns of
Table 1 summarize the indicator coding for the analysis. The
model with multinomial indicators produced mean difference and
figure omits the latent variable intercepts and the residual covariances
among the latent variables to reduce visual clutter. b0i, b1i, and b2i?
individual growth components; Y1–Y4? outcome variables; ε1–ε4?
Quadratic growth model for the psychiatric data. Note that the
imum likelihood analysis. MAR ? missing at random.
Model-implied growth trajectories from the MAR-based max-
psychiatric data. Note that the figure omits the latent variable intercepts and
the residual covariances among the latent variables to reduce visual clutter.
b0i, b1i, and b2i? individual growth components; Y1–Y4? outcome
variables; ε1–ε4? time-specific residuals; R3and R4? missing data
Diggle and Kenward’s (1994) quadratic growth model for the
MAR-Based Maximum Likelihood Estimates
Week 6 difference
MAR ? missing at random.
MISSING NOT AT RANDOM MODELS
effect size estimates that were quite similar to those of the previous
Diggle and Kenward model. The logistic portion of the model was
also comparable. The similarity of the two coding schemes sug-
gests that treating intermittent missing values as MAR had very
little impact on the final estimates, perhaps because permanent
attrition accounts for the vast majority of the missing data.
Wu and Carroll’s (1988) Selection Model
In Diggle and Kenward’s (1994) models, the probability of
missing data was directly related to the repeated measures vari-
ables. In contrast, Wu and Carroll’s (1988) selection model uses
individual intercepts and slopes as predictors of missingness. Al-
though it is possible to apply the previous missing data indicator
codes to Wu and Carroll’s models, only the model with discrete-
time survival indicators converged to a proper solution. Conse-
quently, I limit the subsequent discussion to an analysis that treated
permanent attrition (Patterns 2–4) as MNAR and treated intermit-
tent missingness (Patterns 5–9) as MAR. The missing data indi-
cators were identical to the discrete-time coding scheme in the
middle columns of Table 1 (i.e., 0 ? observed or intermittent
missingness, 1 ? dropout at time t, 99 ? dropout at a previous
time). An initial analysis failed to converge because the latent
variable covariance matrix was not positive definite. Constraining
the quadratic factor variance to zero eliminated this problem and
produced plausible parameter estimates. Because of this modifi-
cation, the final model used treatment group membership and the
individual intercepts and linear slopes to predict attrition. Figure 8
shows a path diagram of the final model. As before, different types
of dashed arrows represent equality constraints on the regression
coefficients in the logistic part of the model.
It is important to note that estimating the model from 100
random starting values produced 85 convergence failures, even
after eliminating the quadratic variance from the model. The 15
sets of starts that successfully converged produced comparable log
likelihood values but slightly different parameter estimates. Sim-
plifying the model by examining change as a linear function of the
square root of time reduced this problem and produced sets of
solutions with identical estimates and identical log likelihood
values. This finding suggests that a quadratic model is too complex
for these data, but it could also be the case that model misspeci-
fication or normality violations contributed to the convergence
failures. For illustration purposes, I report the quadratic model
estimates from the solution with the highest log likelihood, but
these results should be viewed with caution.
Table 4 gives selected parameter estimates and standard
errors from Wu and Carroll’s (1988) selection model. Wu and
Carroll’s model produced a smaller effect size than Diggle and
Kenward’s (1994) selection model. Specifically, substituting
the appropriate estimates into Equation 9 yields a mean differ-
ence of ?1.363 (SE ? 0.183, p ? .001) at the final assessment
and a standardized effect size of d ? 1.576. Turning to the
logistic portion of the model, the regression coefficients quan-
tify the influence of the individual intercepts and linear slopes
on the hazard probability. Because the time scores are centered
at the baseline assessment, the linear slope represents instanta-
Diggle and Kenward’s (1994) Selection Model Estimates With
Missing Not at Random Dropout
Week 6 difference
Treatment 3 Rt
psychiatric data. Note that the figure omits the latent variable intercepts and
the residual covariances among the latent variables to reduce visual clutter.
R3and R4? missing data indicators; b0i, b1i, and b2i? individual growth
components; Y1–Y4? outcome variables; ε1–ε4? time-specific residuals.
Wu and Carroll’s (1988) quadratic growth model for the
Wu and Carroll’s (1988) Selection Model Estimates With
Missing Not at Random Dropout
Week 6 difference
Intercepts 3 Rt
Linear slopes 3 Rt
Treatment 3 Rt
neous change at the beginning of the study. Consequently, the
negative coefficient for the regression of Rton the linear growth
factor suggests that participants who experienced immediate
reductions in illness severity were most likely to drop out, even
after controlling for initial severity level (i.e., the intercept) and
treatment group membership.
Overview of the Pattern Mixture Models
Hedeker and Gibbons (1997, 2006) illustrated a pattern mix-
ture modeling approach that uses the missing data pattern
(represented by one or more dummy variables) as a predictor in
the growth model. This method is advantageous because stan-
dard mixed modeling procedures (e.g., the MIXED procedures
in SPSS and SAS) can estimate the model. Mplus offers finite
mixture modeling options (B. Muthe ´n & Shedden, 1999) that
are ideally suited for implementing a variety of other pattern
mixture models that are difficult or impossible to estimate with
standard software (e.g., pattern mixture models with identifying
restrictions). Because Hedeker and Gibbons thoroughly de-
scribed the use of pattern indicators as predictors of growth, I
limit the subsequent examples to pattern mixture models with
identifying restrictions. Interested readers can consult B.
Muthe ´n et al. (2011) for other interesting variations of the
pattern mixture model.
Within the Mplus finite mixture modeling framework, each
missing data pattern functions as a pseudolatent class. In the
conventional pattern mixture model, these classes simply reflect
a manifest grouping variable that is derived from the observed
missing data patterns. For example, in a simple model, the
complete cases could form one class, and the cases with one or
more missing values could form a second class. The KNOWN-
CLASS subcommand in Mplus uses a grouping variable from
the input data set to assign cases to classes with a probability of
zero or one. Although the pattern mixture models in this section
are effectively multiple group growth models, the finite mixture
modeling framework provides a convenient mechanism for
implementing various identifying restrictions (a multiple group
model does not allow the user to specify equality constraints for
inestimable parameters). Roy (2003) and B. Muthe ´n et al.
(2011) described modeling variations that treat class member-
ship as a true latent variable.
Returning to the psychiatric trial data, Figure 4 shows the
observed means and the fitted trajectories for each of the nine
missing data patterns. With a small number of patterns and a
sufficiently large sample size, it would be possible to define
each pattern as a distinct class, but the number of cases in
Patterns 4 through 9 precludes this option. To simplify the
models, I reduced the number of classes by aggregating patterns
with comparable trajectory shapes. Considering the first three
patterns, there appears to be a relationship between dropout
time and the rate of initial decline, such that rapid improvement
is associated with earlier dropout, at least in the drug condition.
Consequently, it is reasonable to treat the first three patterns as
distinct classes. Although the decision was somewhat arbitrary,
I combined Patterns 3 and 4 because these groups were com-
parable with respect to the timing of dropout. Next, consider the
cases with intermittent missingness (Patterns 5–9). Although it
is reasonable to treat these patterns as a distinct group, the
trajectory shapes roughly resemble the growth curves for the
complete cases. Because the Bayesian information criterion
values from a series of preliminary analyses clearly favored a
model that combined Patterns 5 through 9 with Pattern 1, the
final set of pattern mixture models used three classes: (a) cases
with complete data and intermittent missing values, (b) cases
that dropped out after the third assessment, and (c) cases that
dropped out after the first or the second assessment.
Recall that pattern mixture models are inherently underiden-
tified because they typically involve one or more inestimable
parameters. With respect to the mean structure, Classes 1 and 2
have sufficient data to estimate a quadratic trend, but the
quadratic intercept and the regression of the quadratic growth
factor on the treatment group indicator are inestimable for Class
3. The subsequent models used one of three identifying restric-
tions to achieve identification. The complete case missing vari-
able restriction equated the inestimable quadratic parameters to
those of Class 1 (complete data and intermittent missingness).
The second model implemented the neighboring case missing
variable restriction by replacing the inestimable parameters
with those of Class 2 (dropout after the third assessment). The
final model used the available case missing variable restriction
and equated the quadratic parameters for Class 3 to the
weighted average of the estimates from Classes 1 and 2. In
Mplus, specifying between-class equality constraints (e.g., us-
ing the MODEL CONSTRAINT command) implements these
restrictions. Although the same identification strategies are ap-
plicable to the covariance structure, the subsequent models
assumed a common covariance matrix for the three classes.
Complete Case Missing Variable Restriction
Recall that the pattern mixture model produces unique param-
eter estimates for each class (i.e., estimates that are conditional on
the missing data pattern). Although the substantive goal is to
generate a single set of estimates that averages across the distri-
bution of missing data, it is important to inspect the class-specific
results. To better illustrate the estimates, Figure 9A shows the
model-implied growth curves for each class. Notice that the fitted
trajectories for the Class 3 drug condition and the Class 2 placebo
condition fall outside the plausible score range. For Class 3, the
identifying restriction clearly underestimated the degree of curva-
ture. For Class 2, the mean structure was identified, but attrition at
the final assessment produced an inaccurate extrapolation. After
some experimentation, changing the constrained value of the Class
3 regression coefficient from .021 to .080 produced a reasonable
trajectory that stayed within bounds. Similarly, constraining the
Class 2 intercept to a value of .070 or lower returned plausible
At first glance, it may seem unreasonable to arbitrarily change
parameter values. However, it is important to remember that the
identifying constraints essentially represent assumptions about tra-
jectory shapes that would have been observed if the data had been
complete. Because the growth curves in Figure 9A clearly repre-
sent incorrect predictions about the unobserved data points, it is
difficult to defend a set of marginal estimates that average across
the missing data patterns. Consequently, model modification
seems necessary in this case. In truth, the identifying restrictions
are nothing more than arbitrary rules of thumb for generating
MISSING NOT AT RANDOM MODELS
plausible parameter values, so viewing the restrictions as tentative
starting points for estimation and altering them as needed is a
After implementing new parameter constraints, the model pro-
duced plausible class-specific estimates. The top section of Table
5 gives the updated estimates, and Figure 9B displays the corre-
sponding model-implied trajectories. Computing the weighted
mean of the class-specific values yields an estimate of the popu-
lation growth trajectory that averages over the distribution of
missingness. For these analyses, the population estimate is as
??? ? ˆ?1??ˆ?1?? ? ˆ?2??ˆ?2?? ? ˆ?3??ˆ?3?, (10)
where the numeric superscript denotes the missing data pattern,
? ˆ? p?is the proportion of cases in missing data class p, and ?ˆ? p?is
the class-specific estimate. Because the averaging process is iden-
tical for all estimates, ?ˆgenerically denotes a model parameter.
Table 6 gives the average estimates and the standard errors
for selected parameters. The trajectory shapes from the pattern
mixture model resemble those from the previous analyses, but
the mean difference at the final assessment is somewhat larger.
Specifically, using the class-specific coefficients to construct a
mean difference for each missing data pattern and computing
the weighted average of these estimates gives a difference of
?1.827 (SE ? 0.374, p ? .001) at the final assessment. Ex-
pressed relative to the model-implied estimate of the baseline
standard deviation, this mean difference equates to a standard-
ized effect size of d ? 2.019. Because the marginal estimates
(i.e., the result of Equation 10) are a function of the model
parameters, a pattern mixture analysis does not automatically
produce standard errors. Consequently, it is necessary to use the
multivariate delta method to derive an approximate standard
error (Hedeker & Gibbons, 1997; Hogan & Laird, 1997). For-
tunately, the Mplus MODEL CONSTRAINT command can
generate the average estimates and their standard errors, so
further computations are unnecessary. Descriptions of the mul-
tivariate delta method are available elsewhere in the literature
(MacKinnon, 2008; Raykov & Marcoulides, 2004), and Enders
generated panels A and B, the neighboring case missing variable restriction produced panels C and D, and the
available case missing variable restriction produced panels E and F. Panels A, C, and E are implausible
trajectories from initial analyses.
Class-specific model-implied growth trajectories. The complete case missing variable restriction
(2010) sketches the computational details for various identify-
Neighboring Case Missing Variable Restriction
The second pattern mixture model analysis used the neighboring
case missing variable restriction to equate the inestimable qua-
dratic parameters for Class 3 (dropout after the second assessment)
to the estimates from Class 2 (dropout after the third assessment).
Consistent with the complete case restriction, the initial estimates
produced fitted trajectories that fell outside the plausible score
range. Figure 9C shows the model-implied growth trajectories
from the initial analysis. Some experimentation revealed that con-
straining the quadratic intercept for Pattern 2 (and, by extension,
the quadratic intercept for Pattern 3) to a value of .070 or lower
produced growth curves that stayed within bounds. The middle
portion of Table 5 lists the class-specific estimates from the revised
model, and Figure 9D displays the corresponding trajectories.
Table 6 gives the average estimates and the standard errors from
the neighboring case missing variable restriction. The model-
implied mean difference is ?1.957 (SE ? 0.522, p ? .001), and
the corresponding effect size is d ? 2.163. The effect size differ-
ence is largely due to the elevated growth trajectory for the placebo
condition in Class 3. Again, it is important to reiterate that the
differences between the two models result from applying different
sets of assumptions about the unobserved data. There is no way to
empirically assess the accuracy of competing estimates.
Available Case Missing Variable Restriction
The final analysis implemented the available case missing vari-
able restriction. Recall that this approach achieves identification by
equating an inestimable parameter to the weighted average of the
estimates from other patterns. Applied to the current example, the
available case restriction replaced the quadratic intercept for Class
3 (dropout after the second assessment) with the weighted average
of the intercept estimates from the first two classes. The weight for
Class 1 was 336/389 ? .864, and the weight for Class 2 was
53/389 ? .136. Consistent with the previous analyses, the initial
model produced trajectories that fell outside the plausible score
range (see Figure 9E). Because the available case restriction ap-
plied the largest weight to the estimates from the complete cases,
the growth curves in Figure 9E closely resemble those from the
complete case missing variable restriction in Figure 9A. Changing
Class 1’s contribution to the inestimable regression coefficient
from .021 to .080 and constraining the quadratic intercept for Class
2 to a value of .070 or lower produced plausible growth curves.
Notice that these modifications are the same as those from the
previous analysis. The bottom section of Table 5 gives the class-
specific estimates from the revised model, and Figure 9F displays
the corresponding trajectories.
Table 6 displays the population estimates and their standard
errors. It is, perhaps, not surprising that the available case restric-
tion produced estimates that were virtually identical to those of the
complete case missing variable restriction. The similarity is owed
to the fact that complete cases primarily determined the values of
the inestimable parameters (the weight for this group was .864,
compared with .136 for Class 2). The mean difference and stan-
dardized effect size values from the analysis (?1.845 and 2.038,
respectively) were also virtually identical to those of the complete
case restriction (?1.827 and 2.019, respectively).
The preceding analysis examples applied seven different mod-
els—and thus seven sets of assumptions—to the psychiatric trial
data. Although the analyses produced the same substantive con-
clusion (i.e., the drug group exhibited dramatic improvement rel-
ative to the placebo group), the standardized effect size estimates
Pattern Mixture Model Estimates Averaged Across
Missing Data Patterns
Week 6 difference
Note.CCMV ? complete case missing variable restriction; NCMV ?
neighboring case missing variable restriction; ACMV ? available case
missing variable restriction.
Class-Specific Estimates From Pattern Mixture Models
(n ? 336)
(n ? 53)
(n ? 48)
Complete case identifying restriction
Neighboring case identifying restriction
Available case identifying restriction
Note.Italic typeface denotes donor estimates for Class 3. Bold typeface
denotes constrained parameters.
MISSING NOT AT RANDOM MODELS
had a range of nearly seven tenths of a standard deviation unit.
Because the models applied different assumptions, this variation
might not come as a surprise. Nevertheless, the fluctuation in the
effect size estimates is disconcerting. Had the intervention effect
not been so dramatic, it could have easily been the case that the
models produced conflicting evidence about the efficacy of the
drug condition. Unfortunately, it is relatively common for sensi-
tivity analyses to produce discrepant estimates (Demirtas & Scha-
fer, 2003; Foster & Fang, 2004). The next section offers some
practical advice on model selection.
Choosing Among Competing Models
MNAR modeling is an active area of methodological research,
and the procedures in this article represent just a few possible
options. Given the wide array of analytic choices, model selection
becomes an important practical consideration; this is particularly
true when different models produce disparate estimates, as they do
in the preceding examples. Although somewhat disconcerting, it is
impossible to provide general recommendations about model se-
lection because every analytic option—MAR or MNAR—relies on
one or more untestable assumptions. Although an MAR and an
MNAR model may produce identical fit to the observed data, they
make fundamentally different predictions about the unobserved
score values (Molenberghs & Kenward, 2007). Because there is no
way to empirically assess the validity of these predictions, model
selection is not about choosing a single correct model. Rather,
researchers must choose the model with the most defensible set of
assumptions and construct a logical argument that defends that
choice. In some situations, it is possible to discount certain models
a priori (e.g., the preceding selection model analyses are suspect
because of the normality violations). In other situations, substan-
tive considerations may lead researchers to prefer one model over
the other. This section outlines a few such considerations.
To begin, consider the selection modeling framework. Although
Wu and Carroll’s (1988) and Diggle and Kenward’s (1994) models
have commonalities, study-specific features may influence model
selection. To illustrate, consider two hypothetical research scenar-
ios. First, suppose that a psychologist is studying quality of life in
a clinical trial for a new cancer medication and finds that a number
of patients become so ill (i.e., their quality of life becomes so poor)
that they can no longer participate in the study. In this situation, it
is reasonable to believe that attrition is related to one’s develop-
mental trajectory, such that patients with rapidly decreasing quality
of life scores are most likely to leave the study because they die or
become too ill to participate. To the extent that this assumption is
correct, Wu and Carroll’s model may be preferred because the
developmental trajectories—as opposed to single realizations of
the quality of life measure—are probable determinants of miss-
ingness. Methodologists have also suggested that the random co-
efficient model is well-suited for situations where the outcome
measure is highly variable over time (Albert & Follmann, 2009) or
is an unreliable indicator of an underlying latent construct (Little,
As a second example, consider a drug treatment study that tracks
substance use in the weeks following an intervention. In this
situation, it seems plausible that attrition is related to the actual
outcome at time t, such that participants who use drugs prior to an
assessment fail to show up because they will screen positive for
substance use. Diggle and Kenward’s (1994) model may be most
appropriate for this scenario because the outcome scores at a
particular time point—as opposed to the developmental trends—
are likely to determine missingness. Although the substantive
research problem may favor one selection model over the other, it
is important to reiterate that the data provide no basis for empir-
ically comparing the two models. Consequently, conducting a
sensitivity analysis that fits both models to the same data is usually
a good strategy.
Substantive and practical considerations also come into play
with pattern mixture models. The idea of estimating developmental
trajectories separately for each missing data pattern is intuitively
appealing, particularly for researchers who are familiar with mul-
tiple group structural equation models. In some situations, the
class-specific estimates can provide additional insight into one’s
substantive hypotheses. For example, in an intervention study, it
may be interesting to examine the response to treatment within
each dropout class in addition to estimating a marginal treatment
effect that averages over missing data patterns. Although the
previous analysis examples did not illustrate this possibility, the
pattern mixture model can incorporate predictors of dropout class
membership. This too can provide useful substantive information
(e.g., by identifying factors that are related to dropout or to a
particular developmental trajectory). One of the pattern mixture
model’s often-cited advantages is that it forces researchers to
explicitly state their assumptions in the form of values for the
inestimable parameters. The identifying restrictions that I imple-
mented in the earlier analysis examples are just a few possibilities,
and experimenting with different options is quite easy in Mplus.
The ability to identify the members of each missing data pattern is
potentially useful in this regard. For example, if the members of a
particular dropout group share a common set of characteristics
(e.g., in a school-based study, the early dropout class has a high
proportion of learning disabled children), it might be possible to
use previous research or substantive knowledge to formulate rea-
sonable predictions for the inestimable parameters. The flexibility
of the pattern mixture model makes it a highly useful tool for
conducting sensitivity analyses.
In the missing data literature, a common viewpoint is that
researchers should explore the stability of their substantive con-
clusions by fitting alternate models to the same data. I previously
illustrated this procedure by fitting seven different models to the
psychiatric trial data. Exploring alternate models is just one form
of sensitivity analysis, and methodologists have outlined many
other procedures. Although it is impossible to briefly summarize
the broad range of viewpoints and analytic approaches from the
sensitivity analysis literature, it is nevertheless important to raise
awareness of this topic. Molenberghs and colleagues (Molen-
berghs & Kenward, 2007; Molenberghs, Verbeke, & Kenward,
2009) provided a detailed discussion of these procedures, and this
section summarizes a few of their key points.
Within a given modeling framework, it is useful to explore the
sensitivity of key parameter estimates to various model modifica-
tions. As an example, consider the selection modeling framework.
Both Diggle and Kenward’s (1994) and Wu and Carroll’s (1988)
models are sensitive to minor violations of distributional assump-
tions; the former assumes that the repeated measures variables are
multivariate normal, and the latter assumes that the random effects
(i.e., the individual intercepts and slopes) are normal. Examining
the change in key parameter estimates after modifying distribu-
tional assumptions is an important type of sensitivity analysis.
Although it is not the only method for doing so, finite mixture
modeling (e.g., growth mixtures) is a useful tool for representing
nonnormal manifest variables as well as nonnormal random effects
(McLachlan & Peel, 2000; B. Muthe ´n & Asparouhov, 2009). In
the context of MNAR analyses, methodologists have outlined
latent class versions of selection models of the types developed by
Diggle and Kenward and Wu and Carroll (Beunckens et al., 2008;
B. Muthe ´n et al., 2011) that are readily estimable with Mplus.
Similar strategies are available for pattern mixture models (B.
Muthe ´n et al., 2011; Roy, 2003; Roy & Daniels, 2008).
Modifying the growth model’s covariance structure is a second
option for exploring sensitivity within a given modeling frame-
work. Conventional wisdom suggests that modifying the covari-
ance structure has little to no impact on average growth rate
estimates (Singer & Willett, 2003). In large part, this is because the
mean and the covariance structure are independent in a complete-
data maximum likelihood analysis (i.e., the off-diagonal elements
in the parameter covariance matrix equal zero). Because this
independence is lost with missing data, modifying the covariance
structure (e.g., estimating class-specific variance components; es-
timating residual covariances; introducing an alternate covariance
structure) can potentially alter the latent variable means; Molen-
berghs et al. (2009) gave an example that illustrates this point.
Although it is unclear whether these modifications materially
affect the performance of MNAR models, they are nevertheless
easy to implement.
Finally, methodologists have developed local influence statistics
that attempt to identify cases that unduly impact the parameters of
the missingness model (e.g., the logistic regressions from Diggle &
Kenward’s, 1994, model) or the substantive model. These statistics
are conceptually similar to familiar measures from the ordinary
least squares regression literature (e.g., Cook’s D). Although these
influence statistics do not necessarily identify respondents with an
MNAR missingness mechanism, they can provide important in-
sight into the behavior of a model. For example, there is evidence
to suggest that a complete case with an anomalous score profile
can influence estimates in a way that gives credence to an MNAR
mechanism (Jansen et al., 2006; Kenward, 1998). Interested read-
ers can consult various work by Molenberghs and colleagues for a
detailed overview of local influence measures for missing data
analyses (Jansen et al., 2006; Molenberghs & Kenward, 2007;
Molenberghs et al., 2009).
Methodologists have long advocated for the use of MAR-based
missing data handling procedures. The MAR assumption is often
very reasonable, but there are many situations where missingness
is related to the outcome variable itself. This so-called MNAR
mechanism is problematic because MAR-based procedures pro-
duce biased estimates. MNAR analysis models have received
considerable attention in the biostatistics literature, particularly in
the context of longitudinal data. Although some of these models
have been in the literature for many years, they have been slow to
migrate to the social and the behavioral sciences. The purpose of
this article is to describe two classic MNAR modeling frame-
works: the selection model and pattern mixture model. The com-
monality among MNAR models is that they integrate a submodel
that describes the propensity for missing data into the analysis. The
selection model augments the growth curve analysis with a set of
logistic regressions that describe the probability of missing data at
each occasion. The pattern mixture approach estimates the growth
model separately within each missing data pattern and subse-
quently averages over the missing data patterns.
The fundamental problem with missing data analyses is that it is
generally impossible to fully rule out MNAR missingness; by the
same token, it is impossible to disprove the MAR assumption.
Despite their intuitive appeal, MNAR analyses rely on untestable
assumptions (e.g., normally distributed latent variables, accurate
values for inestimable parameters), and relatively minor violations
of these assumptions can introduce substantial bias. The fact that
MNAR models produce accurate estimates under a relatively nar-
row range of conditions has led some methodologists to caution
against their routine use. A common opinion is that these models
are most appropriate for sensitivity analyses that apply different
models (and thus different assumptions) to the same data.
MNAR analysis techniques continue to receive a great deal of
attention in the methodological literature, and they are likely to
gain in popularity. Despite their limitations, these models are
important options to consider, particularly when outcome-related
attrition seems plausible. At the very least, MNAR models can
augment the results from an MAR-based analysis. Although sen-
sitivity analyses are useful for exploring the impact of modeling
choices on key parameter estimates, the observed data provide no
basis for model selection. Ultimately, choosing a missing data
handling technique—be it MAR or MNAR—is really a matter of
choosing among a set of competing assumptions. Consequently,
researchers should choose a model with the most defensible set of
assumptions, and they should provide a logical argument that
supports this choice.
Albert, P. S., & Follmann, D. A. (2000). Modeling repeated count data
subject to informative dropout. Biometrics, 56, 667–677.
Albert, P. S., & Follmann, D. A. (2009). Shared-parameter models. In G.
Fitzmaurice, M. Davidian, G. Verbeke, & G. Molenberghs (Eds.),
Longitudinal data analysis (pp. 433–452). Boca Raton, FL: Chapman &
Albert, P. S., Follmann, D. A., Wang, S. A., & Suh, E. B. (2002). A latent
autoregressive model for longitudinal binary data subject to informative
missingness. Biometrics, 58, 631–642.
Beunckens, C., Molenberghs, G., Verbeke, G., & Mallinckrodt, C. (2008).
A latent-class mixture model for incomplete longitudinal Gaussian data.
Biometrics, 64, 96–105.
Bollen, K. A., & Curran, P. J. (2006). Latent curve models: A structural
equation approach. Hoboken, NJ: Wiley.
Carpenter, J. R., Kenward, M. G., & Vansteelandt, S. (2006). A compar-
ison of multiple imputation and doubly robust estimation for analyses
with missing data. Journal of the Royal Statistical Society, Series A, 169,
Dantan, E., Proust-Lima, C., Letenneur, L., & Jacqmin-Gadda, H. (2008).
Pattern mixture models and latent class models for the analysis of
MISSING NOT AT RANDOM MODELS
multivariate longitudinal data with informative dropouts. International
Journal of Biostatistics, 4, 1–26.
Demirtas, H., & Schafer, J. L. (2003). On the performance of random-
coefficient pattern-mixture models for non-ignorable drop-out. Statistics
in Medicine, 22, 2553–2575.
Diggle, P., & Kenward, M. G. (1994). Informative dropout in longitudinal
data analysis. Applied Statistics, 43, 49–94.
Enders, C. K. (2010). Applied missing data analysis. New York, NY:
Follmann, D., & Wu, M. (1995). An approximate generalized model with
random effects for informative missing data. Biometrics, 51, 151–168.
Foster, E. M., & Fang, G. Y. (2004). Alternative methods for handling
attrition: An illustration using data from the Fast Track evaluation.
Evaluation Review, 28, 434–464.
Hancock, G. R., & Lawrence, F. R. (2006). Using latent growth models to
evaluate longitudinal change. In G. R. Hancock & R. O. Mueller (Eds.),
Structural equation modeling: A second course (pp. 171–196). Green-
wood, CT: Information Age.
Heckman, J. T. (1976). The common structure of statistical models of
truncation, sample selection and limited dependent variables and a
simple estimator for such models. Annals of Economic and Social
Measurement, 5, 475–492.
Heckman, J. T. (1979). Sample selection bias as a specification error.
Econometrica, 47, 153–161.
Hedeker, D., & Gibbons, R. D. (1997). Application of random-effects
pattern-mixture models for missing data in longitudinal studies. Psycho-
logical Methods, 2, 64–78.
Hedeker, D., & Gibbons, R. D. (2006). Longitudinal data analysis. Hobo-
ken, NJ: Wiley.
Hogan, J. W., & Laird, N. M. (1997). Mixture models for the joint
distribution of repeated measures and event times. Statistics in Medi-
cine, 16, 239–257.
Hogan, J. W., Roy, J., & Korkontzelou, C. (2004). Handling drop-out in
longitudinal studies. Statistics in Medicine, 23, 1455–1497.
Jansen, I., Hens, N., Molenberghs, G., Aerts, M., Verbeke, G., & Kenward,
M. G. (2006). The nature of sensitivity in missing not at random models.
Computational Statistics and Data Analysis, 50, 830–858.
Kenward, M. G. (1998). Selection models for repeated measurements with
non-random dropout: An illustration of sensitivity. Statistics in Medi-
cine, 17, 2723–2732.
Lin, H., McCulloch, C. E., & Rosenheck, R. A. (2004). Latent pattern
mixture models for informative intermittent missing data in longitudinal
studies. Biometrics, 60, 295–305.
Little, R. (1995). Modeling the drop-out mechanism in repeated-measures
studies. Journal of the American Statistical Association, 90, 1112–1121.
Little, R. (2009). Selection and pattern mixture models. In G. Fitzmaurice,
M. Davidian, G. Verbeke, & G. Molenberghs (Eds.), Longitudinal data
analysis (pp. 409–431). Boca Raton, FL: Chapman & Hall.
Little, R., & Rubin, D. B. (2002). Statistical analysis with missing data
(2nd ed.). Hoboken, NJ: Wiley.
MacKinnon, D. P. (2008). Introduction to statistical mediation analysis.
Mahwah, NJ: Erlbaum.
McLachlan, G., & Peel, D. (2000). Finite mixture models. New York, NY:
Michiels, B., Molenberghs, G., Bijnens, L., Vangeneugden, T., & Thijs, H.
(2002). Selection models and pattern-mixture models to analyse longi-
tudinal quality of life data subject to drop-out. Statistics in Medicine, 21,
Molenberghs, G., & Kenward, M. G. (2007). Missing data in clinical
studies. West Sussex, England: Wiley.
Molenberghs, G., Michiels, B., Kenward, M. G., & Diggle, P. J. (1998).
Monotone missing data and pattern-mixture models. Statistica Neer-
landica, 52, 153–161.
Molenberghs, G., Verbeke, G., & Kenward, M. G. (2009). Sensitivity
analysis for incomplete data. In G. Fitzmaurice, M. Davidian, G. Ver-
beke, & G. Molenberghs (Eds.), Longitudinal data analysis (pp. 501–
551). Boca Raton, FL: Chapman & Hall.
Muthe ´n, B., & Asparouhov, T. (2009). Growth mixture modeling: Analysis
with non-Gaussian random effects. In G. Fitzmaurice, M. Davidian, G.
Verbeke, & G. Molenberghs (Eds.), Longitudinal data analysis (pp.
144–165). Boca Raton, FL: Chapman & Hall.
Muthe ´n, B., Asparouhov, T., Hunter, A., & Leuchter, A. (2011). Growth
modeling with non-ignorable dropout: Alternative analyses of the
STAR*D antidepressant trial. Psychological Methods, 16.
Muthe ´n, B., Kaplan, D., & Hollis, M. (1987). On structural equation
modeling with data that are not missing completely at random. Psy-
chometrika, 52, 431–462.
Muthe ´n, B., & Masyn, K. (2005). Discrete-time survival mixture analysis.
Journal of Educational and Behavioral Statistics, 20, 27–58.
Muthe ´n, B., & Shedden, K. (1999). Finite mixture modeling with mixture
outcomes using the EM algorithm. Biometrics, 55, 463–469.
Muthe ´n, L. K., & Muthe ´n, B. O. (1998–2010). Mplus user’s guide (6th
ed.). Los Angeles, CA: Muthe ´n & Muthe ´n.
Raykov, T., & Marcoulides, G. A. (2004). Using the delta method for
approximate interval estimation of parameter functions in SEM. Struc-
tural Equation Modeling: A Multidisciplinary Journal, 11, 621–637.
Robins, J. M., & Rotnitzky, A. (1995). Semiparametric efficiency in
multivariate regression models with missing data. Journal of the Amer-
ican Statistical Association, 90, 122–129.
Rotnitzky, A. (2009). Inverse probability weighted methods. In G. Fitz-
maurice, M. Davidian, G. Verbeke, & G. Molenberghs (Eds.), Longi-
tudinal data analysis (pp. 453–476). Boca Raton, FL: Chapman & Hall.
Roy, J. (2003). Modeling longitudinal data with nonignorable dropout
using a latent dropout class model. Biometrics, 59, 829–836.
Roy, J., & Daniels, M. J. (2008). A general class of pattern mixture models
for nonignorable dropout with many possible dropout times. Biometrics,
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592.
Schafer, J. L. (1997). Analysis of incomplete multivariate data. London,
England: Chapman & Hall.
Schafer, J. L. (2003). Multiple imputation in multivariate problems when
the imputation and analysis models differ. Statistica Neerlandica, 57,
Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state
of the art. Psychological Methods, 7, 147–177.
Scharfstein, D. O., Rotnitzky, A., & Robins, J. M. (1999). Adjusting for
nonigorable drop-out using semi-parametric nonresponse models. Jour-
nal of the American Statistical Association, 94, 1096–1146.
Singer, J. D., & Willett, J. B. (2003). Applied longitudinal data analysis:
Modeling change and event occurrence. New York, NY: Oxford Uni-
Thijs, H., Molenberghs, G., Michiels, B., & Curran, D. (2002). Strategies
to fit pattern-mixture models. Biostatistics, 3, 245–265.
Verbeke, G., & Molenberghs, G., & Kenward, M. G. (2000). Linear mixed
models for longitudinal data. New York, NY: Springer-Verlag.
Wu, M. C., & Bailey, K. R. (1989). Estimation and comparison of changes
in the presence of informative right censoring: Conditional linear model.
Biometrics, 45, 939–955.
Wu, M. C., & Carroll, R. J. (1988). Estimation and comparison of changes
in the presence of informative right censoring by modeling the censoring
process. Biometrics, 44, 175–188.
Yuan, Y., & Little, R. J. A. (2009). Mixed-effect hybrid models for
longitudinal data with nonignorable dropout. Biometrics, 65, 478–486.
Received January 4, 2010
Revision received August 3, 2010
Accepted November 11, 2010 ?