Multiple imputation of incomplete categorical data using latent class analysis

Sociological Methodology (Impact Factor: 3). 04/2008; 38(1). DOI: 10.1111/j.1467-9531.2008.00202.x

ABSTRACT We propose using latent class analysis as an alternative to log-linear analysis for the multiple imputation of incomplete categorical data. Similar to log-linear models, latent class models can be used to describe complex association structures between the variables used in the imputation model. However, unlike log-linear models, latent class models can be used to build large imputation models containing more than a few categorical variables. To obtain imputations reflecting uncertainty about the unknown model parameters, we use a nonparametric bootstrap procedure as an alternative to the more common full Bayesian approach. The proposed multiple imputation method, which is implemented in Latent GOLD software for latent class analysis, is illustrated with two examples. In a simulated data example, we compare the new method to well-established methods such as maximum likelihood estimation with incomplete data and multiple imputation using a saturated log-linear model. This example shows that the proposed method yields unbiased parameter estimates and standard errors. The second example concerns an application using a typical social sciences data set. It contains 79 variables that are all included in the imputation model. The proposed method is especially useful for such large data sets because standard methods for dealing with missing data in categorical variables break down when the number of variables is so large.

  • [Show abstract] [Hide abstract]
    ABSTRACT: The performance of multiple imputation in questionnaire data has been studied in various simulation studies. However, in practice, questionnaire data are usually more complex than simulated data. For example, items may be counterindicative or may have unacceptably low factor loadings on every subscale, or completely missing subscales may complicate computations. In this article, it was studied how well multiple imputation recovered the results of several psychometrically important statistics in a data set with such properties. Analysis of this data set revealed that multiple imputation was able to recover the results of these analyses well. Also, a simulation study showed that multiple imputation produced small bias in these statistics for simulated data sets with the same properties.
    Multivariate Behavioral Research 05/2010; 45(3):574-598. · 1.66 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Latent class analysis (LCA) has been found to have important applications in social and behavioural sciences for modelling categorical response variables, and non-response is typical when collecting data. In this study, the non-response mainly included ‘contingency questions’ and real ‘missing data’. The primary objective of this study was to evaluate the effects of some potential factors on model selection indices in LCA with non-response data. We simulated missing data with contingency question and evaluated the accuracy rates of eight information criteria for selecting the correct models. The results showed that the main factors are latent class proportions, conditional probabilities, sample size, the number of items, the missing data rate and the contingency data rate. Interactions of the conditional probabilities with class proportions, sample size and the number of items are also significant. From our simulation results, the impact of missing data and contingency questions can be amended by increasing the sample size or the number of items.
    Journal of Statistical Computation and Simulation 06/2012; · 0.71 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we propose a latent class based multiple imputation approach for analyzing missing categorical covariate data in a highly stratified data model. In this approach, we impute the missing data assuming a latent class imputation model and we use likelihood methods to analyze the imputed data. Via extensive simulations, we study its statistical properties and make comparisons with complete case analysis, multiple imputation, saturated log-linear multiple imputation and the Expectation–Maximization approach under seven missing data mechanisms (including missing completely at random, missing at random and not missing at random). These methods are compared with respect to bias, asymptotic standard error, type I error, and 95% coverage probabilities of parameter estimates. Simulations show that, under many missingness scenarios, latent class multiple imputation performs favorably when jointly considering these criteria. A data example from a matched case–control study of the association between multiple myeloma and polymorphisms of the Inter-Leukin 6 genes is considered.
    Journal of Statistical Planning and Inference 11/2010; 140(11):3252–3262. · 0.60 Impact Factor

Full-text (3 Sources)

Available from
Dec 27, 2014