Multiple imputation of incomplete categorical data using latent class analysis

Sociological Methodology (Impact Factor: 3). 04/2008; 38(1). DOI:10.1111/j.1467-9531.2008.00202.x

ABSTRACT We propose using latent class analysis as an alternative to log-linear analysis for the multiple imputation of incomplete categorical data. Similar to log-linear models, latent class models can be used to describe complex association structures between the variables used in the imputation model. However, unlike log-linear models, latent class models can be used to build large imputation models containing more than a few categorical variables. To obtain imputations reflecting uncertainty about the unknown model parameters, we use a nonparametric bootstrap procedure as an alternative to the more common full Bayesian approach. The proposed multiple imputation method, which is implemented in Latent GOLD software for latent class analysis, is illustrated with two examples. In a simulated data example, we compare the new method to well-established methods such as maximum likelihood estimation with incomplete data and multiple imputation using a saturated log-linear model. This example shows that the proposed method yields unbiased parameter estimates and standard errors. The second example concerns an application using a typical social sciences data set. It contains 79 variables that are all included in the imputation model. The proposed method is especially useful for such large data sets because standard methods for dealing with missing data in categorical variables break down when the number of variables is so large.

0 0
  • [show abstract] [hide abstract]
    ABSTRACT: Latent class analysis (LCA) has been found to have important applications in social and behavioural sciences for modelling categorical response variables, and non-response is typical when collecting data. In this study, the non-response mainly included ‘contingency questions’ and real ‘missing data’. The primary objective of this study was to evaluate the effects of some potential factors on model selection indices in LCA with non-response data. We simulated missing data with contingency question and evaluated the accuracy rates of eight information criteria for selecting the correct models. The results showed that the main factors are latent class proportions, conditional probabilities, sample size, the number of items, the missing data rate and the contingency data rate. Interactions of the conditional probabilities with class proportions, sample size and the number of items are also significant. From our simulation results, the impact of missing data and contingency questions can be amended by increasing the sample size or the number of items.
    Journal of Statistical Computation and Simulation - J STAT COMPUT SIM. 01/2012;
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Missing data, such as item responses in multilevel data, are ubiquitous in educational research settings. Researchers in the item response theory (IRT) context have shown that ignoring such missing data can create problems in the estimation of the IRT model parameters. Consequently, several imputation methods for dealing with missing item data have been proposed and shown to be effective when applied with traditional IRT models. Additionally, a nonimputation direct likelihood analysis has been shown to be an effective tool for handling missing observations in clustered data settings. This study investigates the performance of six simple imputation methods, which have been found to be useful in other IRT contexts, versus a direct likelihood analysis, in multilevel data from educational settings. Multilevel item response data were simulated on the basis of two empirical data sets, and some of the item scores were deleted, such that they were missing either completely at random or simply at random. An explanatory IRT model was used for modeling the complete, incomplete, and imputed data sets. We showed that direct likelihood analysis of the incomplete data sets produced unbiased parameter estimates that were comparable to those from a complete data analysis. Multiple-imputation approaches of the two-way mean and corrected item mean substitution methods displayed varying degrees of effectiveness in imputing data that in turn could produce unbiased parameter estimates. The simple random imputation, adjusted random imputation, item means substitution, and regression imputation methods seemed to be less effective in imputing missing item scores in multilevel data settings.
    Behavior Research Methods 10/2011; 44(2):516-31. · 2.12 Impact Factor
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Training in the five steps of evidence-based practice (EBP) has been recommended for inclusion in entry-level health professional training. The effectiveness of EBP education has been explored predominantly in the medical and nursing professions and more commonly in post-graduate than entry-level students. Few studies have investigated longitudinal changes in EBP attitudes and behaviours. This study aimed to assess the changes in EBP knowledge, attitudes and behaviours in entry-level physiotherapy students transitioning into the workforce. A prospective, observational, longitudinal design was used, with two cohorts. From 2008, 29 participants were tested in their final year in a physiotherapy program, and after the first and second workforce years. From 2009, 76 participants were tested in their final entry-level and first workforce years. Participants completed an Evidence-Based Practice Profile questionnaire (EBP2), which includes self-report EBP domains [Relevance, Terminology (knowledge of EBP concepts), Confidence, Practice (EBP implementation), Sympathy (disposition towards EBP)]. Mixed model analysis with sequential Bonferroni adjustment was used to analyse the matched data. Effect sizes (ES) (95% CI) were calculated for all changes. Effect sizes of the changes in EBP domains were small (ES range 0.02 to 0.42). While most changes were not significant there was a consistent pattern of decline in scores for Relevance in the first workforce year (ES -0.42 to -0.29) followed by an improvement in the second year (ES +0.27). Scores in Terminology improved (ES +0.19 to +0.26) in each of the first two workforce years, while Practice scores declined (ES -0.23 to -0.19) in the first year and improved minimally in the second year (ES +0.04). Confidence scores improved during the second workforce year (ES +0.27). Scores for Sympathy showed little change. During the first two years in the workforce, there was a transitory decline in the self-reported practice and sense of relevance of EBP, despite increases in confidence and knowledge. The pattern of progression of EBP skills beyond these early professional working years is unknown.
    BMC Medical Education 11/2011; 11:100. · 1.41 Impact Factor


Available from

Jeroen K. Vermunt