Multiple imputation of missing covariates with non-linear effects and interactions: evaluation of statistical methods

MRC Biostatistics Unit, Institute of Public Health, Cambridge CB2 0SR, UK.
BMC Medical Research Methodology (Impact Factor: 2.27). 04/2012; 12(1):46. DOI: 10.1186/1471-2288-12-46
Source: PubMed


Multiple imputation is often used for missing data. When a model contains as covariates more than one function of a variable, it is not obvious how best to impute missing values in these covariates. Consider a regression with outcome Y and covariates X and X2. In 'passive imputation' a value X* is imputed for X and then X2 is imputed as (X*)2. A recent proposal is to treat X2 as 'just another variable' (JAV) and impute X and X2 under multivariate normality.
We use simulation to investigate the performance of three methods that can easily be implemented in standard software: 1) linear regression of X on Y to impute X then passive imputation of X2; 2) the same regression but with predictive mean matching (PMM); and 3) JAV. We also investigate the performance of analogous methods when the analysis involves an interaction, and study the theoretical properties of JAV. The application of the methods when complete or incomplete confounders are also present is illustrated using data from the EPIC Study.
JAV gives consistent estimation when the analysis is linear regression with a quadratic or interaction term and X is missing completely at random. When X is missing at random, JAV may be biased, but this bias is generally less than for passive imputation and PMM. Coverage for JAV was usually good when bias was small. However, in some scenarios with a more pronounced quadratic effect, bias was large and coverage poor. When the analysis was logistic regression, JAV's performance was sometimes very poor. PMM generally improved on passive imputation, in terms of bias and coverage, but did not eliminate the bias.
Given the current state of available software, JAV is the best of a set of imperfect imputation methods for linear regression with a quadratic or interaction effect, but should not be used for logistic regression.

8 Reads
  • Source
    • "We adjusted for the same covariates as the main model for each exposure. Consistent with the recommendations of Seaman et al. (2012), these analyses were restricted to participants with known values of all covariates. The p-value for interaction was calculated by the likelihood ratio test comparing the log-likelihood for the model with the interaction terms to the loglikelihood for the model without the interaction term. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Long-term exposure to persistent pollutants with hormonal properties (endocrine-disrupting chemicals, EDCs) may contribute to the risk of prostate cancer (PCa). However, epidemiological evidence remains limited. We investigated the relationship between PCa and plasma concentrations of universally widespread pollutants, in particular p,p'-dichlorodiphenyl dichloroethene (DDE) and the non-dioxin like polychlorinated biphenyl congener 153 (PCB153). We evaluated before treatment 576 men with newly diagnosed PCa and 655 controls in Guadeloupe (French West Indies). Exposure was analyzed according to case-control status. Associations were assessed by unconditional logistic regression analysis, controlling for confounding factors. Missing data were handled by multiple imputation. We estimated a significant positive association between DDE and PCa (adjusted odds ratio [OR] 1.53; 95% CI 1.02, 2.30 for the highest versus lowest quintile of exposure; PTrend = 0.01). PCB153 was inversely associated with PCa (OR 0.30; 95% CI 0.19, 0.47 for the highest versus lowest quintile of exposure values; PTrend < 0.001). Also, PCB153 was more strongly associated to low-grade than high grade PCa. Associations of PCa with DDE and PCB153 were in opposite directions. This may reflect differences in the mechanisms of action of these EDCs, and although our findings need to be replicated in other populations, they are consistent with complex effects of EDCs on human health.
    Environmental Health Perspectives 11/2014; 123(4). DOI:10.1289/ehp.1408407 · 7.98 Impact Factor
  • Source
    • "One of the biggest challenges for users of MI is specifying the imputation model correctly. This is not always easy to do, even for seemingly simple analyses: for instance when the analysis model contains nonlinear functions of incomplete covariates [8]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Multiple imputation is a commonly used method for handling incomplete covariates as it can provide valid inference when data are missing at random. This depends on being able to correctly specify the parametric model used to impute missing values, which may be difficult in many realistic settings. Imputation by predictive mean matching (PMM) borrows an observed value from a donor with a similar predictive mean; imputation by local residual draws (LRD) instead borrows the donor’s residual. Both methods relax some assumptions of parametric imputation, promising greater robustness when the imputation model is misspecified. Methods We review development of PMM and LRD and outline the various forms available, and aim to clarify some choices about how and when they should be used. We compare performance to fully parametric imputation in simulation studies, first when the imputation model is correctly specified and then when it is misspecified. Results In using PMM or LRD we strongly caution against using a single donor, the default value in some implementations, and instead advocate sampling from a pool of around 10 donors. We also clarify which matching metric is best. Among the current MI software there are several poor implementations. Conclusions PMM and LRD may have a role for imputing covariates (i) which are not strongly associated with outcome, and (ii) when the imputation model is thought to be slightly but not grossly misspecified. Researchers should spend efforts on specifying the imputation model correctly, rather than expecting predictive mean matching or local residual draws to do the work.
    BMC Medical Research Methodology 06/2014; 14(1):75. DOI:10.1186/1471-2288-14-75 · 2.27 Impact Factor
  • Source
    • "Women admitted to hospital in latent phase (<3 cm cervical dilation) have been shown to be at higher risk of obstetrical interventions, including electronic fetal monitoring, epidural analgesia, oxytocin, and caesarean section, than those who are admitted in active labour [9-13]. Approaches to early labour care including home visits [14,15], standardized definitions of labour onset [16] or targeted interventions to manage discomfort [17] have been evaluated in randomized controlled trials and have not been shown to influence labour outcomes. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Progress during early labour may impact subsequent labour trajectories. Women admitted to hospital in latent phase (<3 cm cervical dilation) labour have been shown to be at higher risk of obstetrical interventions. Methods We conducted a secondary analysis of data from a randomized controlled trial of 1247 healthy nulliparous women in spontaneous labour at term with a singleton fetus in cephalic presentation at seven hospitals in Southwestern British Columbia. We computed relative risks and their 95% confidence intervals to examine our primary outcome of cesarean section and secondary outcomes including obstetrical interventions and maternal and newborn outcomes according to women’s perception of length of pre-hospital labour. Women were asked on admission to hospital how long they had been experiencing contractions prior to coming to hospital. Results Women indicating that they had been in labour for 24 hours or longer at the time of hospital admission were at elevated risk for cesarean birth, relative risk (RR) 1.40, (95% Confidence Intervals 1.15-1.72), admission with a cervical dilation of 3 cm or less, RR 1.21 (1.07-1.36), more obstetrical interventions including continuous electronic fetal monitoring RR 1.11 (1.03-1.20), augmentation of labour RR 1.33 (1.23-1.44), use of narcotic RR 1.21 (1.06-1.37) and epidural analgesia RR 1.18 (1.09-1.28). Adverse neonatal outcomes did not differ apart from a significant increase in meconium-stained amniotic fluid RR 1.60 (1.09-2.35). Conclusions A single question asked of women on presentation to hospital was an important predictor of cesarean birth and may have utility in identifying women who would benefit from close observation and more active management of labour.
    BMC Pregnancy and Childbirth 05/2014; 14(1):182. DOI:10.1186/1471-2393-14-182 · 2.19 Impact Factor
Show more


8 Reads
Available from