Article

Multiple imputation of incomplete categorical data using latent class analysis

Sociological Methodology (Impact Factor: 3). 04/2008; 38(1). DOI: 10.1111/j.1467-9531.2008.00202.x

ABSTRACT

We propose using latent class analysis as an alternative to log-linear analysis for the multiple imputation of incomplete categorical data. Similar to log-linear models, latent class models can be used to describe complex association structures between the variables used in the imputation model. However, unlike log-linear models, latent class models can be used to build large imputation models containing more than a few categorical variables. To obtain imputations reflecting uncertainty about the unknown model parameters, we use a nonparametric bootstrap procedure as an alternative to the more common full Bayesian approach. The proposed multiple imputation method, which is implemented in Latent GOLD software for latent class analysis, is illustrated with two examples. In a simulated data example, we compare the new method to well-established methods such as maximum likelihood estimation with incomplete data and multiple imputation using a saturated log-linear model. This example shows that the proposed method yields unbiased parameter estimates and standard errors. The second example concerns an application using a typical social sciences data set. It contains 79 variables that are all included in the imputation model. The proposed method is especially useful for such large data sets because standard methods for dealing with missing data in categorical variables break down when the number of variables is so large.

Download full-text

Full-text

Available from: Jeroen K. Vermunt, Dec 27, 2014
  • Source
    • "The convergence to the joint distribution is often obtained for a low number of iterations (5 can be sufficient), but [25, p.113] underlines that this number can be higher in some cases. In addition, FCS is more computationally intensive than JM [15] [25]. This is not a practical issue when the data set is small, but it becomes so on a data set of high dimensions. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose a multiple imputation method to deal with incomplete categorical data. This method imputes the missing entries using the principal components method dedicated to categorical data: multiple correspondence analysis (MCA). The uncertainty concerning the parameters of the imputation model is reflected using a non-parametric bootstrap. Multiple imputation using MCA (MIMCA) requires estimating a small number of parameters due to the dimensionality reduction property of MCA. It allows the user to impute a large range of data sets. In particular, a high number of categories per variable, a high number of variables or a small the number of individuals are not an issue for MIMCA. Through a simulation study based on real data sets, the method is assessed and compared to the reference methods (multiple imputation using the loglinear model, multiple imputation by logistic regressions) as well to the latest works on the topic (multiple imputation by random forests or by the Dirichlet process mixture of products of multinomial distributions model). The proposed method shows good performances in terms of bias and coverage for an analysis model such as a main effects logistic regression model. In addition, MIMCA has the great advantage that it is substantially less time consuming on data sets of high dimensions than the other multiple imputation methods.
    Full-text · Article · May 2015 · Statistics and Computing
  • Source
    • "For categorical data, logistic regression is a natural choice and has the advantage of accurately modelling the distribution of the missing data given the observed data. The parameters are easily estimated via the incomplete observed data [11] [12]. However, the model-based approaches in conjunction with logistic regression become problematic for data sets with a large number of variables. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The imputation of missing data is often a crucial step in the analysis of survey data. This study reviews typical problems with missing data and discusses a method for the imputation of missing survey data with a large number of categorical variables which do not have a monotone missing pattern. We develop a method for constructing a monotone missing pattern that allows for imputation of categorical data in data sets with a large number of variables using a model-based MCMC approach. We report the results of imputing the missing data from a case study, using educational, sociopsychological, and socioeconomic data from the National Latino and Asian American Study (NLAAS). We report the results of multiply imputed data on a substantive logistic regression analysis predicting socioeconomic success from several educational, sociopsychological, and familial variables. We compare the results of conducting inference using a single imputed data set to those using a combined test over several imputations. Findings indicate that, for all variables in the model, all of the single tests were consistent with the combined test.
    Full-text · Article · Jan 2015 · Journal of Applied Mathematics
  • Source
    • "For categorical data, logistic regression is a natural choice and has the advantage of accurately modelling the distribution of the missing data given the observed data. The parameters are easily estimated via the incomplete observed data [11] [12]. However, the model-based approaches in conjunction with logistic regression become problematic for data sets with a large number of variables. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The imputation of missing data is often a crucial step in the analysis of survey data. This study reviews typical problems with missing data and discusses a method for the imputation of missing survey data with a large number of categorical variables which do not have a monotone missing pattern. We develop a method for constructing a monotone missing pattern that allows for imputation of categorical data in data sets with a large number of variables using a model-based MCMC approach. We report the results of imputing the missing data from a case study, using educational, sociopsychological, and socioeconomic data from the National Latino and Asian American Study (NLAAS). We report the results of multiply imputed data on a substantive logistic regression analysis predicting socioeconomic success from several educational, sociopsychological, and familial variables. We compare the results of conducting inference using a single imputed data set to those using a combined test over several imputations. Findings indicate that, for all variables in the model, all of the single tests were consistent with the combined test.
    Full-text · Article · Jan 2014
Show more