A Bayesian Two-Part Latent Class Model for Longitudinal Medical
Expenditure Data: Assessing the Impact of Mental Health and Substance
Brian Neelon1,∗, A. James O’Malley2, and Sharon-Lise T. Normand2,3
1Nicholas School of the Environment, Duke University, Durham, North Carolina, U.S.A.
2Department of Health Care Policy, Harvard Medical School, Boston, Massachusetts, U.S.A.
3Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, U.S.A.
Summary:In 2001, the U.S. Office of Personnel Management required all health plans participating
in the Federal Employees Health Benefits Program to offer mental health and substance abuse
benefits on par with general medical benefits. The initial evaluation found that, on average, parity did
not result in either large spending increases or increased service use over the four-year observational
period. However, some groups of enrollees may have benefited from parity more than others. To
address this question, we propose a Bayesian two-part latent class model to characterize the effect
of parity on mental health use and expenditures. Within each class, we fit a two-part random
effects model to separately model the probability of mental health or substance abuse use and mean
spending trajectories among those having used services. The regression coefficients and random effect
covariances vary across classes, thus permitting class-varying correlation structures between the two
components of the model. Our analysis identified three classes of subjects: a group of low spenders
that tended to be male, had relatively rare use of services, and decreased their spending pattern
over time; a group of moderate spenders, primarily female, that had an increase in both use and
mean spending after the introduction of parity; and a group of high spenders that tended to have
chronic service use and constant spending patterns. By examining the joint 95% highest probability
density regions of expected changes in use and spending for each class, we confirmed that parity had
an impact only on the moderate spender class.
Biometrics xx, 0–21 DOI: xxx
Key words: Bayesian analysis; Growth mixture model; Latent class model; Mental health parity;
Semi-continuous data; Two-part model.
Bayesian Two-Part Latent Class Model1
The Federal Employees Health Benefits (FEHB) Program sponsors health insurance benefits
for more than 8.5 million federal employees and retirees, plus their spouses and dependents.
Over 250 health plans currently participate in the FEHB program. At the beginning of
2001, the U.S. Office of Personnel Management implemented a parity policy that required
all health plans participating in the FEHB Program to offer mental health and substance
abuse benefits on par with general medical benefits (U.S. OPM, 2000). An early evaluation
of the policy examined changes in total mental health expenditures, including out-of-pocket
and plan spending, from 1999 to 2002, and found that, on average, parity did not result
in either the large increases in spending predicted by opponents of parity or the increased
service use anticipated by mental health advocates (Goldman et al., 2006). Because most
of the literature on the impact of parity has focused on the average effect of the policy on
costs and access to mental health and substance abuse care, little is known about its impact
on specific enrollee subpopulations—for example, the sickest patients or those carrying the
greatest financial burden of illness.
To answer this question, there are three key features of longitudinal medical expendi-
ture data that must be addressed. The data are semi-continuous, assuming non-negative
values with a spike at zero for those who use no services, followed by a continuous, right-
skewed distribution for those who have used services. Table 1 provides a description of the
total spending data for a sample of 1581 FEHB enrollees from one state, each with four
years of data, yielding a total of 6324 observations. Over 80% of enrollees had no annual
mental health expenditures, while a small fraction had large expenditures. The percentage of
spenders increased steadily over time, while median spending increased immediately following
introduction of the parity directive and then returned to baseline levels by 2002.
[Table 1 about here.]
2 Biometrics, xx 20yy
Another important feature of the data concerns repeated measurements. In the FEHB
data, each enrollee contributes an observation for each of the four study years, introducing
within-subject correlation. Moreover, in each year, there are two outcomes per enrollee:
use of mental health/substance abuse services, and if use, the level of use as measured by
expenditures. Further, it may be reasonable to assume that the probability of some use
is correlated with the expected level of spending. An appropriate statistical model should
address these multiple sources of correlation.
One modeling strategy is to apply a longitudinal two-part model (Olsen and Schafer, 2001;
Tooze, Grunwald, and Jones, 2002; Ghosh and Albert, 2009). Two-part models are mixtures
of a point mass at zero followed by a right-skewed distribution (e.g., lognormal) for the
nonzero values. The two mixture components are modeled in stages. First, the probability
of service use is modeled via mixed effects probit or logistic regression. Next, conditional on
some usage, the expected spending level is modeled through (most commonly) a lognormal
mixed effects model. The random effects for the two components are typically assumed to
be correlated; ignoring this potential correlation can yield biased inferences (Su, Tom, and
Finally, because enrollees tend to share characteristics related to spending, it is reasonable
to assume that FEHB enrollees’ trajectories fall into a small number of classes. One natural
mechanism to handle this feature is to use latent class models, in particular latent class
“heterogeneity” or “growth mixture” models (Verbeke and Lesaffre, 1996; Muth´ en and
Shedden, 1999; Muth´ en et al., 2002). Growth mixture models (GMMs) assume that subjects
first fall into one of a finite number of latent classes characterized by a class-specific mean
trajectory; then, about these class means, subjects have their own unique longitudinal
trajectories defined by a set of random effects with class-specific variance parameters. As
such, GMMs can be viewed as finite mixtures of random effects models.
Bayesian Two-Part Latent Class Model3
Growth mixtures have become increasingly popular as a way of decomposing complex
heterogeneity in longitudinal models. Lin et al. (2000) developed a GMM to estimate class-
specific PSA trajectories among men at risk for prostate cancer. Proust-Lima, Letenneur,
and Jacqmin-Gadda (2007) proposed a GMM to jointly model a set of correlated longitudinal
biomarkers and a binary event. Lin et al. (2002) and Proust-Lima et al. (2009) developed
related models to analyze longitudinal biomarkers and a time to event. Beunckens et al.
(2008) proposed a GMM for incomplete longitudinal data. In the Bayesian setting, Lenk and
DeSarbo (2000) describe computational strategies for fitting GMMs; Elliott et al. (2005)
developed a Bayesian GMM to jointly analyze daily affect and negative event occurrences
during a 35-day study period; and recently, Leiby et al. (2009) fitted a Bayesian latent class
factor-analytic model to analyze multiple outcomes from a clinical trial evaluating a new
treatment for interstitial cytitis.
We build on this previous work to develop a Bayesian two-part growth mixture model for
characterizing the effect of parity on mental health use and expenditures. The advantages
of Bayesian inference are well-known and include elicitation of prior beliefs, avoidance of
asymptotic approximations, and, as we demonstrate below, practical estimation of parameter
contrasts and multidimensional credible regions. Within each class, we fit a probit-lognormal
model with class-specific regression coefficients and random effects. An attractive feature
of the model is that it permits the random effect covariance to vary across the classes.
For example, one class might comprise enrollees with frequent high expenditures (positive
correlation between the probability of spending and the actual amount spent), whereas
another class might comprise enrollees with frequent but modest expenditure (negative
correlation between probability of spending and amount spent).
The remainder of this paper is organized as follows: Section 2 outlines the proposed
model; Section 3 describes prior elicitation, posterior computation, model comparison, and
4 Biometrics, xx 20yy
evaluation of model fit; Section 4 describes a small simulation study; Section 5 applies the
method to the FEHB study; and the final section provides a discussion and directions for
2. The Two-Part Growth Mixture Model
The two-part model for semi-continuous data is a mixture of a degenerate distribution at
zero and a positive continuous distribution, such as a lognormal (LN), for the nonzero values.
The probability distribution is expressed as
f(yi) = (1 − φ)1−di?φ × LN(yi;µ,τ2)?di, i = 1,...,n; 0 ? p ? 1,
where yiis the observed response of the random variable Yi; diis an indicator that yi> 0;
φ = Pr(Yi > 0); and LN(yi;µ,τ2) denotes the lognormal density evaluated at yi, with µ
and τ2representing the mean and variance of log(Yi|Yi> 0). When φ = 0, the distribution
is degenerate at 0, and when φ = 1, there are no zeros and the distribution reduces to a
The model can be extended to allow for repeated measures and latent classes by introducing
a latent categorical variable Cithat takes the value k (k = 1,...,K) if subject i belongs to
class k. In its most general form, the model is given by
f(yij|Ci= k,bi) = (1 − φijk)1−dij?φijk× LN(yij;µijk,τ2
g(φijk) = x?
1ijb1i (binomial component)
log(µijk) = x?
2ijb2i (lognormal component), (1)
where yijis the j-th observed response for subject i (j = 1,...,ni); g denotes a link function,
such as the probit or logit link; xlijand zlijare pl×1 and ql×1 vectors of fixed and random
effect covariates for component l(l = 1,2), including appropriate time-related variables (e.g.,
polynomials of time or binary indicators representing measurement occasions); αkand βk
are fixed effect coefficients specific to class k; and bi|Ci= (b?
2i)?|Ci∼ Nq1+q2(0,Σk) is
Bayesian Two-Part Latent Class Model5
a stacked vector of random effects for subject i, with class-specific covariance Σk. When
q1= q2= 1, the model reduces to the widely used random-intercept two-part model. For
the standard random-slope model, q1 = q2 = 2 and Σk is a 4 × 4 matrix that includes
cross-covariances between the random intercepts and slopes of the two components. While
this model captures additional heterogeneity over time, restrictions on Σkmay be needed to
To complete the model, we assume that the class indicator Cihas a “categorical distribu-
tion” taking the value k with probability πik, where πikis linked to a r-dimensional covariate
vector, wi, via a generalized logit model; that is,
Ci ∼ Cat(πi1,...,πiK),
iγh, with γ1= 0 for identifiability. (2)
(We use the term “categorical distribution” in lieu of “multinomial distribution” since Ck
is an integer-valued variable ranging from 1 to K.) And finally, throughout the paper, we
assume that the number of classes, K, is known. In Section 3.3, we discuss Bayesian model-
selection strategies for determining the optimal value of K, and in the Discussion section,
we note alternatives to fixing K.
The latent class two-part model is quite general in that it allows the fixed effects and
random effect covariances to differ across classes. For the special case when K = 1, the model
reduces to a Bayesian version of the standard two-part model for semi-continuous data (c.f.,
Tooze et al., 2002). For K > 2 classes, the model introduces two levels of between-subject
heterogeneity: one induced by the latent classes, and a second represented by the within-class
covariances Σk. Our model can therefore be viewed as a two-part growth mixture model.
Note that Σkmay vary across classes. For example, for some classes, b1iand b2imay be
positively correlated, while for others they may be negatively correlated or even uncorrelated.
In fact, the structure of Σkcan itself vary across classes. For instance, in one class, there
6 Biometrics, xx 20yy
may be no particular structure (unstructured covariance), while in another, an exchangeable
or an AR1 structure may be more suitable.
3. Priors Specification, Posterior Computation, and Model Selection
3.1 Prior Specification
Under a fully Bayesian approach, prior distributions are assumed for all model parameters.
To ensure a well-identified model with proper posteriors determined almost entirely by
the data, we assign weakly informative proper distributions to all class-specific parameters
Np1(µα,Σα) and βk∼ Np2(µβ,Σβ). We assume that the prior hyperparameters are identical
across classes, but this is not necessary. Each Σkis assumed to have a conjugate inverse-
k,Σk,γk}. For the fixed effects, we assume exchangeable normal priors: αk ∼
Wishart IW(ν0,D0) (ν0? k) distribution. Our experience suggests that the conjugate IW
prior performs well in zero-inflated models with low-dimensional random effect covariance
matrices (c.f., Neelon, O’Malley, and Normand, 2010).
For, the lognormal precisions, τ−2
k, we assume conjugate Ga(λ,δ) priors. Following Garrett
and Zeger (2000) and Elliott et al. (2005), we recommend γk∼ Nr[0,(9/4)Ir], which
induces a prior for πik centered at 1/K and bounded away from 0 and 1. If there are no
class-membership covariates (i.e., r = 1), a conjugate Dirichlet(e1,...,eK) prior can be
placed directly on the class-membership probabilities π = (π1,...,πK)?, which can lead to
convenient closed-form full conditionals. Fr¨ uhwirth-Schnatter (2006) recommends choosing
ek> 1 to bound πkaway from zero.
Bayesian Two-Part Latent Class Model7
3.2 Posterior Computation
Let θk = (α?
k)?. Assuming prior independence, the corresponding joint posterior is
where f(yij|θk,bi) is given in equation (1) and b = (b?
For posterior computation, we propose an MCMC algorithm that combines draws from full
conditionals and Metropolis steps. After assigning initial values to the model parameters,
the algorithm iterates between the following steps:
(1) For k = 2,...,K, update the vector γkusing a random-walk Metropolis step;
(2) Sample the class indicators Ci(i = 1,...,n) from a categorical distribution with posterior
probability vector pi= (pi1,...,piK);
(3) For k = 1,...,K, sample αk, βk, τ−2
k, and Σkfrom their full conditionals;
(4) Update biusing a random-walk Metropolis step.
Details of the algorithm are provided in the Web Appendix. Convergence is monitored by
running multiple chains from dispersed initial values and then applying standard Bayesian
diagnostics, such as trace plots; autocorrelation statistics; Geweke’s (1992) Z-diagnostic,
which evaluates the mean and variance of parameters at various points in the chain; and the
Brooks-Gelman-Rubin scale-reduction statistic?R, which compares the within-chain variation
to the between-chain variation (Gelman et al., 2004). As a practical rule of thumb, a 0.975
quantile for?R ? 1.2 is indicative of convergence. In the application below, convergence
diagnostics were performed using the R package boa (Smith, 2007).
A well-known computational issue for Bayesian finite mixture models is “label switching”
in which draws of class-specific parameters may be associated with different class labels
8 Biometrics, xx 20yy
during the course of the MCMC run. Consequently, class-specific posterior summaries that
average across the draws will be invalid. In some cases, label switching can be avoided by
placing constraints on the class probabilities (Lenk and DeSarbo, 2000) or on the model
parameters themselves (Congdon, 2005). However, as Fr¨ uhwirth-Schnatter (2006) notes,
these constraints must be carefully chosen to ensure a unique labeling. She describes several
exploratory procedures useful for identifying appropriate constraints. As an alternative,
Stephens (2000) proposed a post-hoc relabeling algorithm that minimizes the Kullback-
Leibler distance between the posterior probability pijthat individual i is assigned to class j
under the current labeling, and the posterior probability under the “true” labeling, estimated
as the posterior mean of pij. We apply Stephens’ approach in the case study below.
3.3 Determining the Number of Classes
To determine the number of latent classes, we adopt a model selection approach and use
the deviance information criterion (DIC) proposed by Spiegelhalter et al. (2002) to compare
models under various fixed values of K (K = 1,...,Kmax). This approach has been applied
in several previous studies involving latent class models (Elliott et al., 2005; White et al.,
2008; Leiby et al., 2009).
Like Akaike’s information criterion (AIC), the DIC provides an assessment of model fit as
well as a penalty for model complexity. The DIC is defined as D(θ) + pD, where D(θ) =
E[D(θ)|y] is the posterior mean of the deviance, D(θ), and pD= D(θ)−?D(θ) = E[D(θ)|y]−
D(E[θ|y]) is the difference in the posterior mean of the deviance and the deviance evaluated
at the posterior mean of the parameters. The deviance, typically taken as negative twice
the log-likelihood, is a measure of the model’s relative fit, whereas pDis a penalty for the
model’s complexity. For fixed effect models, the complexity—as measured by the number
of model parameters—is easily determined. For random effect models, the dimension of the
parameter space is less clear and depends on the degree of heterogeneity between subjects
Bayesian Two-Part Latent Class Model9
(more heterogeneity implies more “effective” parameters). DIC was proposed to estimate the
number of effective parameters in a Bayesian hierarchical model.
As a rule of thumb, if two models differ in DIC by more than three, the one with the
smaller DIC is considered the best fitting (Spiegelhalter et al., 2002). For finite mixture
models, Celeux et al. (2006) recommend a modified DIC, termed DIC3, which estimates
?D(θ) using the posterior mean of the marginal likelihood averaged across the classes, a
measure invariant to label switching. Specifically,
DIC3= 2D(θ) + 2log
whereˆf(yi) is the posterior mean of the marginal likelihood contribution for subject i
averaged across the classes. As Celeux et al. (2006) point out, DIC3 is closely related to
the measure proposed by Richardson (2002) to avoid overfitting the number of components.
In the application below, we use a hybrid DIC that combines DIC3with the original DIC
measure: for the fixed effects, we average across classes, as in DIC3; for the within-class
random effects, we condition on the posterior draws, as in the original DIC. This approach
preserves the conditional nature of the model, and provides a penalty for the effective
number of random effect parameters. The approach can also be viewed as a natural extension
of the one-class DIC measure provided in standard Bayesian software, such as WinBUGS
(Spiegelhalter et al., 2003).
3.4 Assessment of the Final Model Fit
To assess the adequacy of the selected model, we use posterior predictive checking (Gelman,
Meng, and Stern, 1996), whereby the observed data are compared to data replicated from
the posterior predictive distribution. If the model fits well, the replicated data, yrep, should
resemble the observed data, y. To quantify the similarity, we can choose a discrepancy
measure, T = T(y,θ), that takes an extreme value if the model is in conflict with the
10Biometrics, xx 20yy
observed data. Popular choices for T include sample moments and quantiles, and residual-
The Bayesian predictive p-value (PB) denotes the probability that the discrepancy measure
based on the predictive sample, Trep= T(yrep,θ), is more extreme than the observed measure
T. A Monte Carlo estimate of PBcan be computed by evaluating the proportion of draws
in which T∗> T. A p-value close to 0.50 represents adequate model fit, while p-values near
0 or 1 indicate lack of fit. The cut-off for determining lack of fit is subjective, although by
analogy to the classical p-value, a Bayesian p-value between 0.05 and 0.95 suggests adequate
fit. In some cases, a stricter range, such as (0.20, 0.80), might be more appropriate.
For the latent class two-part model, we recommend two test statistics to assess the fit of
both the binomial and lognormal components. For the binomial component, we recommend
T1 = the proportion of observations greater than zero. For the nonzero observations, we
suggest a modification of the omnibus chi-square measure proposed by Gelman et al. (2004)
?[log(yij) − µijk]
where, for the random intercept model, µijk = x?
ij(βk) + b2i and M denotes the number
of nonzero observations. To generate replicate data, we first draw replicate class indicators
(i = 1,...,n) using expression (2); then, conditional on Crep
, we generate brep
N2(0,Σk); finally, we draw yrep
from (1). An alternative approach is to use the actual
posterior draws of Ciand bi; however, this approach does not mimic the data-generating
process as accurately as the former approach. That said, for the analysis presented in Section
5, the two approaches yield similar results.
4. Simulation Study
To examine the properties of the proposed model, we conducted a small simulation study.
First, we simulated 100 datasets from a three-class model according to equation (1). The
Bayesian Two-Part Latent Class Model 11
datasets contained 500 subjects, each with five observations, for a total of 2,500 observations
per dataset. The binomial and lognormal components contained class-specific fixed-effects
intercepts, fixed-effect linear time trend, and random intercepts. That is, α = (αk1,αk2)?,β =
(βk1,βk2)?, and, given Ci= k, bi= (b1i,b2i)?∼ N2(0,Σk), with
, k = 1,2,3.
We also allowed the class membership probabilities to include an intercept and a single
covariate wi; hence, γ2= (g21,g22)?, γ3= (g31,g32)?, and γ1= (0,0) for identifiability.
Next, we fitted one to four class models to each of the 100 datasets and compiled the results.
Web Table 1 presents the DIC statistics for each of the four fitted models. As expected, the
average DIC across the 100 simulations was lowest for the three-class fitted model (i.e., the
true model). Moreover, the three-class model had the lowest DIC values for each of the 100
datasets, followed in general by the two-class model, which had the second-lowest DIC in 97
of the 100 simulations. For the most part, the four-class model had the highest DIC score,
alleviating concerns that the hybrid DIC measure proposed in Section 3.3 overestimates the
number of classes.
Web Table 3 provides summary statistics for the three-class model parameters. Column
1 presents the estimated class precentages, averaged across the 100 simulations. These were
identical (up to two decimal places) to the true class percentages of 31%, 26%, and 43% for
classes 1, 2, and 3, respectively. Column 5 presents the average posterior estimates across
the 100 simulations. The bias was extremely low for all parameters, including the random
effect variance components. The coverage rates ranged from 0.91 to 0.99, but for the most
part, were close to the nominal value of 0.95. Variability in coverage rates was likely due to
the size of the simulation.
12Biometrics, xx 20yy
5. Assessing the Impact of Mental Health and Substance Abuse Parity
To analyze the FEHB data described in the introduction, we fitted a series of two-part
growth mixture models, allowing the number of classes, K, to range from one to four. For
each class, we fit a fixed effects model, a model with uncorrelated random intercepts, a
model with correlated random intercepts, and a model with random intercepts for each
component and a random slope for the lognormal component. We also fitted a model with
an additional random slope for the binomial component (i.e., four random effects), but the
model was poorly identified and failed to converge according to standard MCMC diagnostics.
We consider identifiability issues related to this model further in the Discussion section.
Within each class, we assumed a probit-lognormal two-part model as in equations (1) and
(2). For both components, the fixed-effect covariate vector xijcomprised an intercept term
and three dummy indicators representing years 2000–2002. Because our study included only
four measurement occasions, we chose to model time categorically to allow for maximum
flexibility in capturing the time trend. Alternative parameterizations of the time trend—
such as polynomials or splines—may be appealing in other settings, particularly if there are
a large number of time points. For K > 2 classes, we allowed gender and employee status
to serve as class-membership covariates; specifically, wiin equation (2) represented a 3 × 1
vector consisting of an intercept and indicator variables for female gender and employee vs.
dependant status. To investigate the impact of between-subject heterogeneity, we compared
fixed effects models to models with correlated and uncorrelated random intercepts.
The models were fitted in R version 2.8 (R Development Core Team, 2008) using a MCMC
code developed by the authors. For each model, we ran three, initially dispersed MCMC
chains for 200,000 iterations each, discarding the first 50,000 as a burn-in to ensure that a
steady-state distribution had been reached. We retained every 50th draw to reduce autocor-
relation. Run times ranged from six to 12 hours depending on the number of classes. MCMC
Bayesian Two-Part Latent Class Model 13 Download full-text
diagnostics, such as trace plots, Geweke Z-statistics (Geweke, 1992), and Brooks-Gelman-
Rubin scale reduction statistics (Gelman et al., 2004), were used to assess convergence
of the chains. There was little evidence of label switching within individual chains, and
Stephens’ (2000) relabeling algorithm tended to converge rapidly. In some cases, the class
labels required reordering across chains, but the proper order was easily identified in each
For model comparison, we used the hybrid DIC measure proposed in Section 3.3. The
results are presented in Table 2. For each class, the correlated random intercept model had
the lowest DIC value. Overall, the three-class model with correlated random intercepts was
preferred, followed by the two- and four-class correlated models.
[Table 2 about here.]
Web Figure 1 presents post-burn-in trace plots for four representative parameters from
the 3-class random intercepts model: α22 (change in log odds use at year 2 compared to
year 1, class 2); β22(increase in log-spending at year 2 for class 2); γ22(log odds of class-
two membership, female vs. male); and ρ2 (class-2 random effect correlation). For clarity
of presentation, we have graphed only two of the three MCMC chains. The overlapping
trajectory lines suggest convergence and efficient mixing of the chains. The Geweke Z-
diagostic p-values ranged from 0.35 (β22) to 0.64 (α22), indicating no significant difference in
posterior means across regions of the chains; the 0.975 quantiles of the Brooks-Gelman-Rubin
statistic were each less than 1.04, again indicating convergence of the chains. However, we
did observe modest autocorrelation in the chains: the lag-10 autocorrelations ranged from
0.01 for α22to 0.16 for ρ2.
Table 3 presents the posterior means and 95% posterior intervals for the three-class model.
[Table 3 about here.]
A few general trends are worth noting: