Page 1

9

MULTIPLE IMPUTATION OF

INCOMPLETE CATEGORICAL

DATA USING LATENT CLASS

ANALYSIS

Jeroen K. Vermunt*

Joost R. van Ginkel†

L. Andries van der Ark*

Klaas Sijtsma*

We propose using latent class analysis as an alternative to log-

linear analysis for the multiple imputation of incomplete cate-

gorical data. Similar to log-linear models, latent class models

can be used to describe complex association structures between

the variables used in the imputation model. However, unlike log-

linear models, latent class models can be used to build large im-

putation models containing more than a few categorical variables.

To obtain imputations reflecting uncertainty about the unknown

model parameters, we use a nonparametric bootstrap procedure as

an alternative to the more common full Bayesian approach. The

proposed multiple imputation method, which is implemented in

Latent GOLD software for latent class analysis, is illustrated with

two examples. In a simulated data example, we compare the new

method to well-established methods such as maximum likelihood

WewouldliketothankPaulAllisonandJayMagidson,aswellastheeditor

and the three anonymous reviewers, for their comments, which very much helped to

improve this article. We would also like to thank Greg Richards for providing the

data of the ATLAS Cultural Tourism Research Project, 2003.

*Tilburg University

†Leiden University

369

Page 2

370

VERMUNT, VAN GINKEL, VAN DER ARK, AND SIJTSMA

estimation with incomplete data and multiple imputation using a

saturated log-linear model. This example shows that the proposed

method yields unbiased parameter estimates and standard errors.

The second example concerns an application using a typical social

sciences data set. It contains 79 variables that are all included in

the imputation model. The proposed method is especially useful

for such large data sets because standard methods for dealing with

missing data in categorical variables break down when the number

of variables is so large.

1. INTRODUCTION

Multiple imputation (MI) has become a widely accepted method for

dealing with missing data problems (Rubin 1987:2–4). One of its attrac-

tive features is that it allows the handling of the missing data problem

prior to the actual data analysis; that is, once the missing values are

replaced by imputed values, the statistical analyses of interest can be

performed using standard techniques. Another attractive feature of MI

is that, contrary to single imputation, the multiply imputed versions of

a data set reflect uncertainty about the imputed values, which is a re-

quirementforobtainingunbiasedstandarderrorsinstatisticalanalyses.

MI requires the specification of an imputation model, the exact

choiceofwhichwilltypicallydependonthescaletypesofthevariablesin

the data set. For (approximately) continuous variables, the most widely

usedimputationmodelisthemultivariatenormalmodel(Schafer1997),

which is available in SAS PROC MI (Yuan 2000) and in the missing-

data library of S-plus (2006), as well as in stand-alone programs such as

NORM(Schafer1999)andAMELIA(Kingetal.2001;Honaker,King,

andBlackwell2007).GrahamandSchafer(1999)showedthatMIunder

the multivariate normal model is rather robust to violations of normal-

ity. The most appropriate imputation model for categorical variables is

the log-linear model (Schafer 1997), which is also implemented in the

missing-data library of S-plus (2006). For data sets containing both cat-

egorical and continuous variables, Schafer (1997) proposed imputation

using the general location model, a combination of a log-linear and a

multivariate normal model implemented in the missing-data library of

S-plus.

MI based on log-linear modeling provides an elegant and

sound solution for many missing-data problems concerning categorical

Page 3

IMPUTATION USING LATENT CLASS ANALYSIS

371

variables. This was confirmed in simulation studies by Ezzati-Rice et al.

(1995), Schafer et al. (1996), and Schafer (1997), who showed that log-

linear imputation yields unbiased statistical inference, and it is robust

against departures from the assumed imputation model. The main lim-

itation of MI under the log-linear model is, however, that it can be ap-

plied only when the number of variables used in the imputation model

is small—that is, only when we are able to set up and process the full

multi-waycross-tabulationrequiredforthelog-linearanalysis.Whereas

social science data sets with 100 variables or more are very common,

it is impossible to estimate a log-linear model for say a frequency table

cross-classifying 100 trichotomous variables: The resulting table with

3100(=5.15378e47) cell entries is much too large to be stored and pro-

cessed. Note that the necessity to process each cell in the maximum

likelihood estimation of log-linear models holds even if the specified

model is very restricted—for example, if the model contains only two-

and three-way association terms. An exception is the situation in which

the log-linear model is collapsible (Agresti 2002), but it is unlikely that

one will use such a model for imputation.

A first possible solution to the problem of a limited number of

variables associated with the log-linear approach is to ignore the cate-

gorical nature of the variables and use an imputation model for contin-

uous data instead, where discrete imputed values may be obtained by

rounding the non-integer imputed values to the nearest feasible integer.

Van Ginkel, Van der Ark, and Sijtsma (2007a; 2007b) found that MI

under the multivariate normal model with rounding produces reliable

results in discrete (ordinal) psychological test data—for example, in the

estimation of Cronbach’s alpha. Other authors, however, showed that

rounding continuous imputations to the nearest admissible integer val-

ues may lead to serious bias (Allison 2005; Horton, Lipsitz, and Parzen

2003),especiallyifthevariablesconcernedareusedasindependentvari-

ables in a regression analysis. This was confirmed by Bernaards, Belin,

and Schafer (2007), who showed that the bias may be reduced by using

a more sophisticated (adaptive) rounding procedure. Despite the fact

that this approach may sometimes work well with dichotomous and

ordinal categorical variables, it is clearly much more problematic when

used with nominal variables.

A second possible solution is to use hot-deck imputation (Rubin

1987:9) rather than a statistical imputation model. This nonparamet-

ric imputation method involves a search for complete cases that have

Page 4

372

VERMUNT, VAN GINKEL, VAN DER ARK, AND SIJTSMA

(almost) the same values on the observed variables as the case with

missing values, and then imputes the missing values of the latter by

drawing from the empirical distribution defined by the former. Hot-

deck imputation, which is available in the SOLAS program (2001), can

be used for data sets containing large numbers of categorical variables.

Whereasthestandardhot-deckdoesnotyieldproperimputation,avari-

ant called approximate Bayesian bootstrap (Rubin and Schenker 1986)

does.However,LittleandRubin(2002:69)indicatedthefollowingabout

thenearestneighborhot-deckprocedure:“Sinceimputedvaluesarerel-

ativelycomplexfunctionsoftherespondingitems,quasi-randomization

properties of estimates derived from such matching procedures remain

largely unexplored.” This means that it is difficult to demonstrate that

estimates will be unbiased under the missing at random (MAR) as-

sumption. A simulation study by Schafer and Graham (2002) showed

that hot-deck imputation may produce biased results irrespective of the

missing data mechanism.

A third possible solution is to use one of the recently proposed

sequentialregressionimputationmethods,whichincludetheMICEand

ICE methods (Van Buuren and Oudshoorn 2000; Raghunathan et al.

2001; Van Buuren et al. 2006). Rather than specifying a model for the

joint distribution of the variables involved in the imputation, the impu-

tationmodelconsistsofaseriesofmodelsfortheunivariateconditional

distributions of the variables with missing values. For categorical vari-

ables, this will typically be a series of logistic regression equations. We

found that for large numbers of variables the specification of the series

ofimputationmodelsisratherdifficultandtime-consuming.Especially

problematic is that, unlike log-linear imputation, sequential imputation

does not pick up higher-order interactions, unless these are included

explicitly in the imputation model. This means that it is likely that one

misses important interactions that may seriously bias subsequent anal-

yses. Also, unlike the log-linear approach, sequential regression impu-

tation methods lack a strong statistical underpinning; that is, there is

no guarantee that iterations will converge to the posterior distribution

of the missing values.

Alternatively, we propose using the latent class (LC) model as an

imputation model for categorical data. It is a statistically sound cate-

gorical data method that resolves the most important limitation of the

log-linearapproach;thatis,thatitcanbeappliedtodatasetscontaining

more than a few variables. An LC model can be viewed as a mixture

Page 5

IMPUTATION USING LATENT CLASS ANALYSIS

373

model of independent multinomial distributions. Mixture models have

been shown to be very flexible tools for density estimation because they

can be used to approximate any type of distribution by choosing the

number of mixture components (latent classes) sufficiently large (e.g.,

McLachlan and Peel 2000:11–14). Using the LC model as an imputa-

tion model resembles the use of LC models by Vermunt and Magidson

(2003), who proposed using the LC model as a prediction or classifica-

tion tool. As is explained in more detail below, the local independence

assumption (Lazarsfeld 1950a, 1950b; Goodman, 1974) makes it pos-

sible to obtain maximum likelihood estimates of the parameters of LC

models for large numbers of categorical variables with missing values.

Multiple imputations should reflect uncertainty about not only

the missing values but also the unknown parameters of the imputation

model. This parameter uncertainty is typically dealt with by using a full

Bayesianapproach:randomimputationsarebasedonrandomdrawsof

the parameters from their posterior distribution (Rubin 1987; Schafer

1997).Analternativethatallowsstayingwithinafrequentistframework

is to use the nonparametric bootstrap, as is done in the Amelia II soft-

ware (King et al. 2001; Honaker et al. 2007). This is also the approach

used in this article and implemented in the syntax version of Latent

GOLD program (Vermunt and Magidson 2008).

The remainder of this article will focus on the following: First,

we introduce the basic principles of MI. Second, we discuss MI using

LCanalysis(fromhereonabbreviatedasLCMI)andalsodiscussissues

suchasdealingwithparameteruncertainty,modelselection,andmodel

identifiability.Third,aconstructedexampleispresentedcomparingLC

MI with complete-case analysis, maximum likelihood estimation with

incompletedataandlog-linearMI,aswellasdemonstratingthevalidity

of LC MI. Fourth, a large real-data example is discussed that illustrates

LC MI for a situation in which log-linear MI is no longer feasible.

We offer conclusions about the proposed LC MI approach and discuss

possible extensions of this tool.

2. THE BASIC PRINCIPLE OF MULTIPLE IMPUTATION

This section describes the basic principles of MI. Let us first introduce

the relevant notation. Let Y denote the N × J data matrix of interest

with entries yij, where N is the number of cases and J the number of

Page 6

374

VERMUNT, VAN GINKEL, VAN DER ARK, AND SIJTSMA

variables, with indices i and j such that 1 ≤ i ≤ N and 1 ≤ j ≤ J. In the

presence of missing data, the data matrix has an observed part and a

missingpart.ThesetwopartsaredenotedasYobsandYmis,respectively,

where Y = (Yobs, Ymis). The unknown parameters that govern Y are

collected in vector θ. Let R be a response-indicator matrix with entries

rij, where rij = 0 if the value of yij is missing and 1 otherwise. The

unknown parameters that govern R are collected in vector ξ.

The basic idea of multiple imputation is to construct multiple,

say M, complete data sets by random imputation of the missing values

in Ymis. The researcher interested in a particular analysis model—say

a linear regression model—can estimate the analysis model with each

of these M complete data sets using standard complete data methods.

The M results can be combined into a single set of estimates and stan-

dard errors reflecting the uncertainty about the imputed missing values

(Rubin 1987:76–79).

What is needed to construct multiply imputed data sets is an

imputation model: a model for the joint distribution of the response

indicators and the survey variables in the data set, which is denoted as

P(R,Yobs,Ymis;θ,ξ). When defining a model for P(R,Yobs,Ymis;θ,ξ), one

typically separates the model for Y from the model for the missing data

mechanism. This is achieved by the following decomposition:

P(R,Yobs,Ymis;θ,ξ) = P(Yobs,Ymis;θ)P(R|Yobs,Ymis;ξ),

(1)

where P(Yobs,Ymis;θ) is the marginal distribution of the survey vari-

ables and P(R|Yobs,Ymis;ξ) is the conditional distribution of the re-

sponse indicators given the survey variables. It can easily be seen that

the decomposition in equation (1) transforms the definition of a model

for P(R,Yobs,Ymis;θ,ξ) into the definition of two submodels, one for

P(Yobs,Ymis;θ) and one for P(R| Yobs,Ymis;ξ). Specification and estima-

tion of the imputation model can be simplified further by making the

additional assumption that the probability of having a certain pattern

of missing values is independent of the variables with missing values

conditionally on the observed; that is,

P(R|Yobs,Ymis;θ) = P(R|Yobs;θ).

(2)

Ifequation(2)holds,thenthemissingdataaresaidtobemissingatran-

dom (MAR; Rubin 1976; Little and Rubin 2002:12). When in addition

Page 7

IMPUTATION USING LATENT CLASS ANALYSIS

375

thesubmodelsforP(Yobs,Ymis;θ)andP(R|Yobs,Ymis;ξ)donothavecom-

mon parameters, the submodel for the missing data mechanism can be

ignored when estimating the model for P(Yobs,Ymis;θ). The MAR as-

sumptionwillbeviolatedeitherwhentherearedirecteffectsofvariables

withmissingvaluesontheresponseindicatorsaftercontrollingforYobs,

or when there are variables that are not in the imputation model affect-

ing both Ymisand R. In these cases, the missingness mechanism is not

MAR (i.e., it is NMAR), and the validity of the results from likelihood-

based methods cannot be guaranteed, unless the correct NMAR model

for the missingness mechanism is specified.

It may be noted that the MAR assumption becomes more plau-

sible when a larger number of variables are included in the imputation

model (Schafer 1997:28). If the set Yobsbecomes larger, it becomes less

likely that dependencies remain between R and Ymisafter conditioning

on Yobs(Schafer 1997:28). The main advantage of imputation meth-

ods compared to parameter estimation with incomplete data is that we

can put more effort into building a model that is in agreement with the

MAR assumption. Incomplete-data likelihood methods usually make

use of a smaller set of variables, typically only the variables needed in

the analysis model of interest.

To minimize the risk of bias, Schafer (1997:143) advocated using

an imputation model that is as general as possible—for example, an

unrestricted multivariate normal model or a saturated model. At worst,

standarderrorsoftheparametersderivedfromMImayslightlyincrease

whentheimputationmodelcontainsassociationsthatcanbeattributed

to sampling fluctuations (Schafer 1997:140–44). On the other hand, if

the imputation model is too restrictive, results may be biased because

theMARassumptionisviolated(Schafer1997:142–43).Forthisreason,

Schafer recommended generating imputed values that are as much as

possible in accordance with the observed data, so that the imputed

values behave “neutral” in the subsequent statistical analyses.

The actual imputation of the missing values involves generating

random draws from the distribution P(Ymis|Yobs), which is defined as

follows (Rubin 1987; Schafer 1997):

P(Ymis|Yobs) =

?

?

P(Ymis|Yobs;θ)P(θ|Yobs)dθ

P(Yobs,Ymis;θ)

P(Yobs;θ)

=

P(θ|Yobs)dθ.

(3)

Page 8

376

VERMUNT, VAN GINKEL, VAN DER ARK, AND SIJTSMA

The most popular way to perform this sampling is by Bayesian Markov

chain Monte Carlo methods (Schafer 1997:105; Tanner and Wong

1987), which involves a two-step procedure. In the first step, values

of θ are drawn from the posterior distribution of the parameters P(θ|

Yobs). These values of θare used in the actual imputation step to ob-

tain draws from P(Ymis|Yobs;θ). Since each imputed data set is based

on new draws from P(θ|Yobs), the multiple imputations will reflect not

only uncertainty about Ymisbut also uncertainty about the parameters

of the imputation model, yielding what Rubin referred to as proper

imputations.

Schafer (1997:289–331) proposed using log-linear analysis as an

imputation tool for categorical data. The strong points of log-linear

modelsarethattheyyieldanaccuratedescriptionofP(Yobs,Ymis;θ)and

that they can easily be estimated with incomplete data. A serious limi-

tation of the log-linear modeling approach is, however, that it can only

be used with small numbers of variables, whereas imputation should

preferably be based on large sets of variables. To overcome this draw-

back, we propose using LC analysis as a tool for the imputation of

incomplete categorical data. The LC model is a categorical data model

that can be used to describe the relationships between the survey vari-

ables as accurately as needed. Parameter estimation of LC models does

not break down when the number of variables is large. Moreover, the

model parameters can be easily estimated in the presence of missing

data.

3. MULTIPLE IMPUTATION UNDER

A LATENT CLASS MODEL

3.1. Latent Class Analysis with Incomplete Data

This section deals with MI using latent class models. We first intro-

duce latent class models for incomplete categorical data, where special

attention is paid to issues that are relevant when using these models

in the context of MI. Then we discuss model selection and the exact

implementation of the LC-based imputation procedure.

Let yidenote a vector containing the responses of person i on

J categorical variables, xia discrete latent variable, K the number of

categories of xior, equivalently, the number of latent classes, and k a

Page 9

IMPUTATION USING LATENT CLASS ANALYSIS

377

particular latent class (k = 1,..., K). The model we propose for impu-

tationisanunrestrictedLCmodel,inwhichP(yi;θ),thejointprobabil-

ity density of yi, is assumed to have the following well-known form

(Lazarsfeld 1950a, 1950b; Goodman 1974; Vermunt and Magidson

2004):

P(yi;θ) =

K

?

k=1

K

?

k=1

P(xi= k;θx)P(yi|xi= k;θy)

=

P(xi= k;θx)

J ?

j=1

P(yij|xi= k;θyj),

(4)

where θ = (θx, θy) or θ = (θx, θy1,...,θyj,...,θyJ). The indices in

θx, θy, and θyjindicate to which set of multinomial probabilities the

unknown parameters concerned belong. Equation (4) shows the two

basic assumptions of the latent-class model:

1.The density P(yi; θ) is a mixture—or weighted average—of class-

specific densities P(yi|xi= k; θy), where the unconditional latent

class proportions P(xi= k; θx) serve as weights.

Responses are independent within latent classes, such that the joint

conditional density P(yi|xi= k; θy) equals the product of the J

univariate densities P(yij|xi= k;θyj). Note that P(yij|xi= k;θyj)

is the probability that person i provides response yijto variable j

conditional on membership of class k. This is generally referred to

as the local independence assumption.

2.

By choosing the number of latent classes sufficiently large, like any type

of mixture model, an LC model will accurately pick up the first, sec-

ond, and higher-order observed moments of the J response variables

(McLachlan and Peel 2000:11–14). In the context of categorical vari-

ables, these moments are the univariate distributions, bivariate associa-

tions, and the higher-order interactions.

It is important to emphasize that we are using the LC model not

as a clustering or scaling tool but as a tool for density estimation—that

is, as a practical tool to obtain a sufficient exact representation of the

true P(yi; θ), even for large J. This has the following implications:

Page 10

378

VERMUNT, VAN GINKEL, VAN DER ARK, AND SIJTSMA

1. Contrary to typical LC applications, there is no need to find inter-

pretablelatentclassesorclusters.Infact,thereisnoneedtointerpret

the parameters of the LC imputation model at all. This is not spe-

cific for LC analysis imputation. Also when using a multivariate

normal or a log-linear imputation model, the parameters will not

be interpreted.

Overfitting the data is less of a problem than underfitting; that is,

picking up certain random fluctuations that are sample specific is

lessproblematicthanignoringimportantassociationorinteractions

betweenthevariablesintheimputationmodel.Notethatoverfitting

an LC model is similar to using a log-linear imputation model that

includes nonsignificant parameters. This is likely to occur when us-

ing a saturated log-linear model. Underfitting an LC model is com-

parabletousinganonsaturatedlog-linearmodelinwhichimportant

higher-order interactions are omitted.

Itiswell-knownthatLCmodelsmaybeunidentifiedwhenthenum-

ber of classes is large compared with the number of observed vari-

ables (for example, see Goodman 1974). Unidentifiability means

that different values of θ yield the same P(yi; θ), which makes the

interpretationoftheθparametersproblematic.However,inthecon-

text of imputation, this is not a problem since we are interested only

inP(yi;θ),whichisuniquelydefinedeveniftheθparametersarenot.

For large K, a solution may be obtained that is a local instead of

a global maximum of the incomplete data log-likelihood function.

Even if we increase the number of start sets to say 100—as we did

in our analysis with the automated starting values procedure of

Latent GOLD—there is no guarantee that we will find the global

maximum likelihood solution. Whereas this is problematic if we

wish to interpret the model parameters, in the context of MI this

does not seem to be a problem, especially because a local maximum

will typically give a representation of P(yi; θ) that is nearly as good

as the global maximum.

2.

3.

4.

We will use a simulated data example to illustrate these issues below.

Equation(4)describestheLCmodel,neglectingthemissingdata.

However, for parameter estimation using maximum likelihood, as well

asforthederivationofP(Ymis|Yobs;θ)neededfortheactualimputation

(see equation 3), the LC model must be expressed as a model for the

observed data density P(yi,obs; θ); that is,

Page 11

IMPUTATION USING LATENT CLASS ANALYSIS

379

P(yi,obs;θ) =

K

?

k=1

K

?

k=1

P(xi= k;θx)P(yi,obs|xi= k;θy)

=

P(xi= k;θx)

J ?

j=1

?P(yij|xi= k;θyj)?rij,

(5)

where again rij= 0 if the value of yijis missing and 1 otherwise. Note

that for a respondent with a missing value on variable j (i.e., rij= 0),

[P(yij|xi= k;θyj)]rij= 1, and thus the corresponding term cancels from

equation (5). This illustrates how missing values are dealt with in the

estimation of an LC model: Case i contributes to the estimation of the

unknown model parameters only for the variables that are observed.

Maximum likelihood (ML) estimates of the parameters of an

LC model can be obtained by maximizing the sum of the log of equa-

tion (5) across all cases—for example, by means of the EM algorithm

(Goodman 1974; Dempster, Laird, and Rubin 1977). Because of the lo-

cal independence assumption, the problem is collapsible, which implies

thattheMstepoftheEMalgorithminvolvesprocessingJ two-wayxiby

yijcross-classifications. Thus, even for large J, ML estimation remains

feasible. Except for very specific situations, log-linear analysis, on the

contrary, requires processing the full J dimensional table.

3.2. Model Selection

As was already mentioned, mixture models can approximate a wide

variety of distributions (McLachlan and Peel 2000:11–14). For LC-

based MI, this means that the imputation model will accurately ap-

proximate the distribution of yiby choosing K sufficiently large. Fol-

lowing Schafer’s (1997:140–44) advise to use an imputation model that

describes the data as accurately as possible, we should thus use a large

number of latent classes.

Thereisoneparticularquestionofinterest:Whatisasufficiently

large value for K? We propose using the Bayesian information crite-

rion(BIC;Schwarz1978),Akaike’sinformationcriterion(AIC;Akaike

1974), and a variant of AIC called AIC3 (Bozdogan 1993; Andrews

and Currim 2003), which are typically used for model selection in LC

analysis with sparse tables. These measures have in common that they

combinemodelfit(log-likelihoodvalue,LL)andparsimony(numberof

Page 12

380

VERMUNT, VAN GINKEL, VAN DER ARK, AND SIJTSMA

parameters, Npar) into a single value. The model with the lowest BIC,

AIC, or AIC3 value is the preferred one. The three information criteria

are defined as follows:

BIC = −2LL + log(N) × Npar,

(6)

AIC = −2LL + 2 × Npar,

(7)

AIC3 = −2LL + 3 × Npar.

(8)

Equations (6), (7), and (8) show that the three criteria differ only

with respect to the weight attributed to parsimony. Because the log of

the sample size—log(N)—is usually larger than 3, BIC tends to select a

model with fewer latent classes than AIC and AIC3. Simulation studies

byLinandDayton(1997),AndrewsandCurrim(2003),andDias(2004)

showed that BIC tends to underestimate the number of classes, whereas

AICtendstoselectamodelwithtoomanyclasses.AndrewsandCurrim

(2003) and Dias (2004) also showed that for selecting K in LC models,

AIC3 provides a good compromise between BIC and AIC.

3.3. Imputation Procedure

Multiple imputation involves obtaining M draws from P(yi,mis|yi,obs)

(see equation 3). This requires obtaining draws from P(θ|yi,obs) and

subsequently from P(yi,mis|yi,obs; θ). As is also done in the AMELIA II

software(Kingetal.2001;Honakeretal.2007),weproposeobtainingM

sets of parameters θ using a nonparametric bootstrap procedure. (For

a general introduction in the bootstrap procedure and an application

in LC analysis see Efron and Tibshirani [1993] and Dias and Vermunt

[forthcoming],respectively.)First,M nonparametricbootstrapsamples

from Y are obtained and denoted as Y∗

each bootstrap sample an LC model is estimated resulting in M sets

of parameters θ∗

from P(yi,mis|yi,obs; θ = θ∗

uncertainty of the parameter estimates into account.

In an LC model, the conditional distribution P(yi,mis|yi,obs;θ)

can be written as

1, ...,Y∗

m, ..., Y∗

M. Second, for

1, ..., θ∗

m, ..., θ∗

m) for m = 1, ...M. In this way, we take the

M. For imputed data set m, we sample

Page 13

IMPUTATION USING LATENT CLASS ANALYSIS

381

P(yi,mis|yi,obs;θ) =

K

?

k=1

K

?

k=1

K

?

k=1

P(xi= k,yi,mis|yi,obs;θ)

=

P(xi= k|yi,obs;θ)P(yi,mis|xi= k;θy)

=

P(xi= k|yi,obs;θ)

J ?

j=1

?P(yij|xi= k;θyj

?1−rij.

(9)

In the first row of equation (9), we introduce the discrete latent variable

xi.Thesecondrowisobtainedusingthelocalindependenceassumption

betweenyi,misandyi,obsgivenxi.Thelastrowusesagainthelocaldepen-

dence assumption, but now among the variables with missing values. It

shouldbenotedthatP(xi=k|yi,obs;θ)isasubject’slatentclassification

probability for class k, which is part of the standard output of an LC

analysis.Theseprobabilities,whicharealsoreferredtoasposteriorclass

membership probabilities, can be obtained as follows:

P(xi= k|yi,obs;θ) =P(xi= k;θy)P(yi,obs|xi= k;θy)

P(yi,obs;θ)

,

(10)

The terms in equation (10) were defined in equation (5).

Equation (9) suggests how to sample from P(yi,mis|yi,obs; θ) in

LC MI; that is, in a first step, assigning a person randomly to one

of the K latent classes using P(xi= k|yi,obs; θ) (see Goodman 2007)

and subsequently, in a second step, sampling yi,mis conditionally on

the assigned class using P(yi,mis|xi= k; θ). This second step can be

performed separately for each variable with a missing value; that is, by

means of independent draws from univariate multinomial distributions

with probabilities P(yij|xi= k;θyj).

The possibility to perform the second step for each variable sep-

arately shows that the LC imputation method is applicable to large

data sets and has the additional advantage that it can be implemented

using standard software for LC analysis with missing data, such as

Latent GOLD 4.0 (Vermunt and Magidson 2005), LEM (Vermunt

1997), Mplus (Muth´ en and Muth´ en 2006), and the R library poLCA

(LinzerandLewis2007).Eachofthesepackagesprovidesboththelatent

classification probabilities P(xi= k|yi,obs; θ) and the response proba-

bilities P(yij|xi= k;θyj) as output. Using these probabilities one can

Page 14

382

VERMUNT, VAN GINKEL, VAN DER ARK, AND SIJTSMA

draw random values for the missing data with pseudo random number

generators that are readily available in general statistical packages. The

new syntax module of the Latent GOLD 4.5 implements the procedure

describedabove,includingthenonparametricbootstrap,fullyautomat-

ically. We provide more details about the implementation of LC MI in

Latent GOLD 4.5 in the appendix.

4. EXAMPLES

In this section, LC MI is illustrated and compared with other methods

using two examples. The first uses a simulated data set, which makes

it possible to compare the obtained results with the known truth. The

secondexampleusesalargerealdatasetwithmanycategoricalvariables

with missing values.

4.1. A Simulated Data Example

In the introduction we already described various simulation studies

showing that (proper) MI—say, under a correctly specified log-linear

model—is able to yield unbiased parameter estimates and unbiased

asymptotic standard errors for the analysis model of interest (e.g.,

Ezzati-Rice et al. 1995; Schafer et al. 1996; Schafer 1997). Also, the

effect of sample size and the effect of different types of violations of the

MARassumptionhavebeenstudied.Therefore,ratherthanperforming

an extended simulation study in which the same issues are investigated

for LC MI, we concentrate on those aspects that are specific for the LC

MI method proposed in this paper.

Themainquestionisthis:HowdoestheproposedLCMImethod

compare to ML estimation with incomplete data and MI using a (sat-

urated) log-linear model? There are also two more specific questions:

What happens if we select too few latent classes? What happens if we

select too many latent classes?

We simulated one large data set (N = 10,000) from a population

model with the following characteristics:

rSix dichotomous variables: y1to y6; y1to y5were the independent

variables and y6was the dependent variable.

Page 15

IMPUTATION USING LATENT CLASS ANALYSIS

383

rFor the relationship among the independent variables, we assumed

a log-linear model with, under dummy coding, one-variable terms

equal to −2.0 and two-variable associations equal to 1.0; that is,

logP(y1, y2, y3, y4, y5) ∝

5

?

s=1

−2.0ys+

5

?

s=1

5

?

t=s+1

1.0ysyt.

This is equivalent to a model defined under effect coding with all

one-variable terms equal to 0.0 and all two-variable terms equal to

.25.

rFor the dependent variable, we assumed a logit model containing

main effects of the independent variables as well as a two-way in-

teraction between y2and y3. Using dummy coding, the population

logit equation was defined as follows:

logitP(y6)=−3.0+1.0y1+2.0y2+2.0y3+1.0y4+1.0y5−2.0y2y3.

rBothy1andy2wereassumedtohaveMARmissingvalues,wherethe

missingnessony1dependedony3andy4—missingnessprobabilities

were .1, .4, .4, and .7, respectively, for the four possible combination

of y3and y4—and the missingness on y2depended on y5and y6—

with probabilities equal to .7, .4, .4, and .1, respectively. In total

almost 70 percent of 10,000 cases had at least one missing value.

Note that we used a large sample because we were not interested in

assessingtheeffectofsamplingfluctuations.WeassumedaMARmodel

with a large proportion of missing values in the predictor variables

because this is the kind of situation in which LC MI work should work.

The key element in the population model specification is the inclusion

of a large interaction term in the regression model for y6, which in fact

implies that there is a three-variable association between y2, y3, and

y6. While such an association is automatically picked up by a saturated

log-linear model, it should be investigated whether an LC model picks

it up as well.

Table 1 shows the log-likelihood, BIC, AIC, and AIC3 values for

1- to 10-class models and for the saturated log-linear model estimated

with the simulated data set. Based on BIC, we should select the 3-class

model and based on either the AIC or AIC3 criterion one should select

the6-classmodel.Thisdifferenceinsuggestednumberoflatentclassesis

largerthanusuallyencounteredinstandardapplicationsofLCanalysis,