Project
Prediction calibration using multiple imputations to account for missing predictor values
Updates
0 new
0
Recommendations
0 new
0
Followers
0 new
3
Reads
0 new
51
Project log
In this paper, we expand the methodology presented in Mertens et. al (2020, Biometrical Journal) to the study of life-time (survival) outcome which is subject to censoring and when imputation is used to account for missing values. We consider the problem where missing values can occur in both the calibration data as well as newly - to-be-predicted - observations (validation). We focus on the Cox model. Methods are described to combine imputation with predictive calibration in survival modeling subject to censoring. Application to cross-validation is discussed. We demonstrate how conclusions broadly confirm the first paper which restricted to the study of binary outcomes only. Specifically prediction-averaging appears to have superior statistical properties, especially smaller predictive variation, as opposed to a direct application of Rubin's rules. Distinct methods for dealing with the baseline hazards are discussed when using Rubin's rules-based approaches.
The package `mipred` contains two basic functions. The first is `mipred.cv`, which estimates cross-validated predictions when predictors contain missing values and using multiple imputation. The second is `mipred`, which allows users to apply the same methodology to predict outcome for novel observations, based on past calibration data. Both the new observations, as well as the calibration data may contain mising observations in the predictor data. This document describes data analysis approaches using the `mipred` package functions for the above objectives. We first discuss cross-validation of prediction with `mipred.cv`, using both the `averaging` and `rubin` methods as described in the paper by Mertens et al (see research gate) to estimate the expected prediction performance on future data. We subsequently describe use of the `mipred` function to estimated predictions on new patient data, based on past data. Finally, `mipred` package functionality and options are discussed.
The latest version of the software can be downloaded from GitHub: https://github.com/BartJAMertens/mipred.
And we are happy to say that the package is now on CRAN, so you may as well install directly from there.
We investigate the problem of calibration and assessment of predictive rules in prognostic designs when missing values are present in the predictors. Our paper has two key objectives which are entwined. The first is to investigate how the calibration of the prediction rule can be combined with the use of multiple imputation to account for missing predictor observations. The second objective is to propose such methods that can be implemented with current multiple imputation software, while allowing for unbiased predictive assessment through validation on new observations for which outcome is not yet available. To inform the definition of methodology, we commence with a review of the theoretical background of multiple imputation as a model estimation approach as opposed to a purely algorithmic description. We specifically contrast application of multiple imputation for parameter (effect) estimation with predictive calibration. Based on this review, two approaches are formulated, of which the second utilizes application of the classical Rubin's rules for parameter estimation, while the first approach averages probabilities from models fitted on single imputations to directly approximate the predictive density for future observations. We present implementations using current software which allow for validatory or cross-validatory estimation of performance measures, as well as imputation of missing data in predictors on the future data where outcome is by definition as yet unobserved. We restrict discussion to binary outcome and logistic regression throughout, though the principles discussed are generally applicable. We present two data sets as examples from our regular consultative practice. Results show little difference between methods for accuracy but substantial reductions in variation of calibrated probabilities when using the first approach.