Article

The Multiple Adaptations of Multiple Imputation

Taylor & Francis
Journal of the American Statistical Association
Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Multiple imputation was rst conceived as a tool that statistical agencies could use to handle nonresponse in large sample, public use surveys. In the last two decades, the multiple imputation framework has been adapted for other statistical contexts. As examples, individual researchers use multiple imputa- tion to handle missing data in small samples; statistical agencies disseminate multiply-imputed datasets for purposes of protecting data conden tiality; and, survey methodologists and epidemiologists use multiple imputation to correct for measurement errors. In some of these settings, Rubin's original rules for combining the point and variance estimates from the multiply-imputed datasets are not appropriate, because what is known|and therefore in the conditional expectations and variances used to derive inferential methods|diers from the missing data context. These applications require new combining rules and methods of inference. In fact, more than ten combining rules exist in the

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... After this, each party locally combines the resulting K analysis models, by either distilling a single combined model out of them or setting up a suitable ensemble [12,31]. Following [12], we use Rubin's rules to combine the obtained parameter and standard error estimates into a single model, analogous to the concept of multiple imputation [32,33], where missing data is repeatedly replaced with resampled available data. For each of the ...
... Note that v (−) K can be negative. When this occurs, we use a more conservative alternative of Rubin's rules [33] and set variance to v (+) ...
... As detailed in Section Sharing multiple synthetic data sets we additionally rely on repeated sampling of synthetic data and application of Rubin's rules [32,33] to quantify the additional uncertainty introduced by sampling finite (small) data sets from the learned generative models at no additional privacy cost. ...
Article
Full-text available
Background Consider a setting where multiple parties holding sensitive data aim to collaboratively learn population level statistics, but pooling the sensitive data sets is not possible due to privacy concerns and parties are unable to engage in centrally coordinated joint computation. We study the feasibility of combining privacy preserving synthetic data sets in place of the original data for collaborative learning on real-world health data from the UK Biobank. Methods We perform an empirical evaluation based on an existing prospective cohort study from the literature. Multiple parties were simulated by splitting the UK Biobank cohort along assessment centers, for which we generate synthetic data using differentially private generative modelling techniques. We then apply the original study’s Poisson regression analysis on the combined synthetic data sets and evaluate the effects of 1) the size of local data set, 2) the number of participating parties, and 3) local shifts in distributions, on the obtained likelihood scores. Results We discover that parties engaging in the collaborative learning via shared synthetic data obtain more accurate estimates of the regression parameters compared to using only their local data. This finding extends to the difficult case of small heterogeneous data sets. Furthermore, the more parties participate, the larger and more consistent the improvements become up to a certain limit. Finally, we find that data sharing can especially help parties whose data contain underrepresented groups to perform better-adjusted analysis for said groups. Conclusions Based on our results we conclude that sharing of synthetic data is a viable method for enabling learning from sensitive data without violating privacy constraints even if individual data sets are small or do not represent the overall population well. Lack of access to distributed sensitive data is often a bottleneck in biomedical research, which our study shows can be alleviated with privacy-preserving collaborative learning methods.
... These pooling formulas are similar to those used in missing data analysis (Rubin, 1987) but differ in the expression for the pooled variance, because the variance between the synthetic data sets is a result of only the synthesis but not incomplete data. Other pooling methods exist for different types of analyses (e.g., model comparisons; see Reiter & Raghunathan, 2007) and other types of synthetic data (e.g., fully synthetic data; see Raghunathan et al., 2003;Reiter & Raghunathan, 2007). ...
... These pooling formulas are similar to those used in missing data analysis (Rubin, 1987) but differ in the expression for the pooled variance, because the variance between the synthetic data sets is a result of only the synthesis but not incomplete data. Other pooling methods exist for different types of analyses (e.g., model comparisons; see Reiter & Raghunathan, 2007) and other types of synthetic data (e.g., fully synthetic data; see Raghunathan et al., 2003;Reiter & Raghunathan, 2007). ...
... The synthetic data are then analyzed separately, and the results are pooled as in the conventional MI approach to synthetic data (see Reiter, 2003;Reiter & Raghunathan, 2007). Note that the masked Schematic Representation of the Data-Augmented Multiple Imputation (DA-MI) Approach to Synthetic Data ...
Article
Full-text available
In recent years, psychological research has faced a credibility crisis, and open data are often regarded as an important step toward a more reproducible psychological science. However, privacy concerns are among the main reasons that prevent data sharing. Synthetic data procedures, which are based on the multiple imputation (MI) approach to missing data, can be used to replace sensitive data with simulated values, which can be analyzed in place of the original data. One crucial requirement of this approach is that the synthesis model is correctly specified. In this article, we investigated the statistical properties of synthetic data with a particular emphasis on the reproducibility of statistical results. To this end, we compared conventional approaches to synthetic data based on MI with a data-augmented approach (DA-MI) that attempts to combine the advantages of masking methods and synthetic data, thus making the procedure more robust to misspecification. In multiple simulation studies, we found that the good properties of the MI approach strongly depend on the correct specification of the synthesis model, whereas the DA-MI approach can provide useful results even under various types of misspecification. This suggests that the DA-MI approach to synthetic data can provide an important tool that can be used to facilitate data sharing and improve reproducibility in psychological research. In a working example, we also demonstrate the implementation of these approaches in widely available software, and we provide recommendations for practice.
... We restrict attention to methods for imputing item missing data (imputing the subset of values that are missing for an incomplete observation) in settings with independent observations. Much of the discussion also applies to other data structures, and to problems other than item missing data where MI has proven useful (see Reiter and Raghunathan (2007) for some examples of other uses for multiple imputation). ...
... Barnard and Rubin (1999) proposed an alternative degrees of freedom estimate with better behavior in moderate samples, suggesting it for general use. See Reiter and Raghunathan (2007) for a review of combining rules for more general estimands. ...
Preprint
Multiple imputation is a straightforward method for handling missing data in a principled fashion. This paper presents an overview of multiple imputation, including important theoretical results and their practical implications for generating and using multiple imputations. A review of strategies for generating imputations follows, including recent developments in flexible joint modeling and sequential regression/chained equations/fully conditional specification approaches. Finally, we compare and contrast different methods for generating imputations on a range of criteria before identifying promising avenues for future research.
... We use a framework that integrates three key ideas from the literature on confidentiality protection and data access. The first idea is to provide synthetic public use files, as proposed by Rubin (1993) and others (e.g., Little, 1993;Fienberg, 1994;Raghunathan et al., 2003;Reiter and Raghunathan, 2007;Drechsler, 2011). Such files comprise individual records with every value replaced with simulated draws from an estimate of the multivariate distribution of the confidential data. ...
... If desired by the OPM, we could create and release multiple synthetic datasets by repeating the data generation process. An advantage of releasing multiple synthetic datasets is that users can propagate uncertainty from the synthesis process through their inferences using simple combining rules (Raghunathan et al., 2003;Reiter and Raghunathan, 2007;Drechsler, 2011). ...
Preprint
Data stewards seeking to provide access to large-scale social science data face a difficult challenge. They have to share data in ways that protect privacy and confidentiality, are informative for many analyses and purposes, and are relatively straightforward to use by data analysts. One approach suggested in the literature is that data stewards generate and release synthetic data, i.e., data simulated from statistical models, while also providing users access to a verification server that allows them to assess the quality of inferences from the synthetic data. We present an application of the synthetic data plus verification server approach to longitudinal data on employees of the U.S. federal government. As part of the application, we present a novel model for generating synthetic career trajectories, as well as strategies for generating high dimensional, longitudinal synthetic datasets. We also present novel verification algorithms for regression coefficients that satisfy differential privacy. We illustrate the integrated use of synthetic data plus verification via analysis of differentials in pay by race. The integrated system performs as intended, allowing users to explore the synthetic data for potential pay differentials and learn through verifications which findings in the synthetic data hold up and which do not. The analysis on the confidential data reveals pay differentials across races not documented in published studies.
... We must now combine the M point estimates of the marginal treatment effect and their variances to generate a posterior distribution. Pooling across multiple syntheses is a topic that has already been investigated within the domain of statistical disclosure limitation [50][51][52][53][54][55][56]. ...
... Raghunathan et al. [50] describe full synthesis as a two-step process: (1) construct multiple synthetic populations by repeatedly drawing from the posterior predictive distribution, conditional on a model fitted to the original data; and (2) draw random samples from each synthetic population, releasing these synthetic samples to the public. In practice, as indicated by Reiter and Raghunathan [55], it is not a requirement to generate the populations, but only to generate values for the synthetic samples. Once the samples are released, the analyst seeks inferences based on the synthetic data alone. ...
Article
Full-text available
Background When studying the association between treatment and a clinical outcome, a parametric multivariable model of the conditional outcome expectation is often used to adjust for covariates. The treatment coefficient of the outcome model targets a conditional treatment effect. Model-based standardization is typically applied to average the model predictions over the target covariate distribution, and generate a covariate-adjusted estimate of the marginal treatment effect. Methods The standard approach to model-based standardization involves maximum-likelihood estimation and use of the non-parametric bootstrap. We introduce a novel, general-purpose, model-based standardization method based on multiple imputation that is easily applicable when the outcome model is a generalized linear model. We term our proposed approach multiple imputation marginalization (MIM). MIM consists of two main stages: the generation of synthetic datasets and their analysis. MIM accommodates a Bayesian statistical framework, which naturally allows for the principled propagation of uncertainty, integrates the analysis into a probabilistic framework, and allows for the incorporation of prior evidence. Results We conduct a simulation study to benchmark the finite-sample performance of MIM in conjunction with a parametric outcome model. The simulations provide proof-of-principle in scenarios with binary outcomes, continuous-valued covariates, a logistic outcome model and the marginal log odds ratio as the target effect measure. When parametric modeling assumptions hold, MIM yields unbiased estimation in the target covariate distribution, valid coverage rates, and similar precision and efficiency than the standard approach to model-based standardization. Conclusion We demonstrate that multiple imputation can be used to marginalize over a target covariate distribution, providing appropriate inference with a correctly specified parametric outcome model and offering statistical performance comparable to that of the standard approach to model-based standardization.
... We additionally employ a simple technique, drawing inspiration from multiple imputation literature (Rubin, 1987;Reiter and Raghunathan, 2007), to quantify the additional uncertainty introduced by sampling finite (small) data sets from the learned generative models, which is detailed in Sec. 4.4. ...
... After this, each party locally combines the resulting K predictive models, by either distilling a single combined model out of them or setting up a suitable ensemble. Following (Räisä et al., 2023), we use Rubin's rules to combine the obtained parameter and standard error estimates into a single model, analogous to the concept of multiple imputation (Rubin, 1987;Reiter and Raghunathan, 2007), where missing data is repeatedly replaced with resampled available data. We set the number of synthetic data sets sampled by each party as K = 100. ...
Preprint
Full-text available
Consider a setting where multiple parties holding sensitive data aim to collaboratively learn population level statistics, but pooling the sensitive data sets is not possible. We propose a framework in which each party shares a differentially private synthetic twin of their data. We study the feasibility of combining such synthetic twin data sets for collaborative learning on real-world health data from the UK Biobank. We discover that parties engaging in the collaborative learning via shared synthetic data obtain more accurate estimates of target statistics compared to using only their local data. This finding extends to the difficult case of small heterogeneous data sets. Furthermore, the more parties participate, the larger and more consistent the improvements become. Finally, we find that data sharing can especially help parties whose data contain underrepresented groups to perform better-adjusted analysis for said groups. Based on our results we conclude that sharing of synthetic twins is a viable method for enabling learning from sensitive data without violating privacy constraints even if individual data sets are small or do not represent the overall population well. The setting of distributed sensitive data is often a bottleneck in biomedical research, which our study shows can be alleviated with privacy-preserving collaborative learning methods.
... In this case, the partial synthesis approach simply replaces all variables with synthetic values. More details about generating fully synthetic data according to the original proposal in Rubin (1993) can be found in Raghunathan et al. (2003); Reiter and Raghunathan (2007);Drechsler (2011). ...
... For fully synthetic data generated, Rubin (1993) originally proposes we use T f = (1 + 1/m)B − W for the variance estimate ofβ (see (Reiter, 2002) for an alternative non-negative variance estimator) and v f = (m − 1)(1 − W/((1 + 1/m)B)) 2 for the degrees of freedom of corresponding the t distribution. We refer interested readers to Raghunathan et al. (2003); Reiter and Raghunathan (2007); Drechsler (2011) for further details on these combining rules. Additionally, Raab et al. (2017b) provides an overview of other ways to estimate the variance and recommendations on which to use under different settings. ...
Preprint
Synthetic data generation is a powerful tool for privacy protection when considering public release of record-level data files. Initially proposed about three decades ago, it has generated significant research and application interest. To meet the pressing demand of data privacy protection in a variety of contexts, the field needs more researchers and practitioners. This review provides a comprehensive introduction to synthetic data, including technical details of their generation and evaluation. Our review also addresses the challenges and limitations of synthetic data, discusses practical applications, and provides thoughts for future work.
... We must now combine the M point estimates of the marginal treatment effect and their variances to generate a posterior distribution. Pooling across multiple syntheses is a topic that has already been investigated within the domain of statistical disclosure limitation [40,41,42,43,44,45,46]. ...
... Raghunathan et al. [40] describe full synthesis as a two-step process: (1) construct multiple synthetic populations by repeatedly drawing from the posterior predictive distribution, conditional on a model fitted to the original data; and (2) draw random samples from each synthetic population, releasing these synthetic samples to the public. In practice, as indicated by Reiter and Raghunathan [45], it is not a requirement to generate the populations, but only to generate values for the synthetic samples. Once the samples are released, the analyst seeks inferences based on the synthetic data alone. ...
Preprint
Full-text available
When studying the association between treatment and a clinical outcome, a parametric multivariable model of the conditional outcome expectation is often used to adjust for covariates. The treatment coefficient of the outcome model targets a conditional treatment effect. Model-based standardization is typically applied to average the model predictions over the target covariate distribution, and generate a covariate-adjusted estimate of the marginal treatment effect. The standard approach to model-based standardization involves maximum-likelihood estimation and use of the non-parametric bootstrap. We introduce a novel, general-purpose, model-based standardization method based on multiple imputation that is easily applicable when the outcome model is a generalized linear model. We term our proposed approach multiple imputation marginalization (MIM). MIM consists of two main stages: the generation of synthetic datasets and their analysis. MIM accommodates a Bayesian statistical framework, which naturally allows for the principled propagation of uncertainty, integrates the analysis into a probabilistic framework, and allows for the incorporation of prior evidence. We conduct a simulation study to benchmark the finite-sample performance of MIM in conjunction with a parametric outcome model. The simulations provide proof-of-principle in scenarios with binary outcomes, continuous-valued covariates, a logistic outcome model and the marginal log odds ratio as the target effect measure. When parametric modeling assumptions hold, MIM yields unbiased estimation in the target covariate distribution, valid coverage rates, and similar precision and efficiency than the standard approach to model-based standardization.
... The authors have discussed multiple mutations in a large database with missing data with measurement error [19]. Nested multiple imputations and standard multiple imputations are discussed in [19]. ...
... The authors have discussed multiple mutations in a large database with missing data with measurement error [19]. Nested multiple imputations and standard multiple imputations are discussed in [19]. Data collections, data analysis, data processing, data interpretations, recommendations, and reasons for missing data are discussed as a review in [20]. ...
Article
Nowadays, digital data processing plays an important role in various significant fields such as biomedical, marketing, data analytics, machine learning, etc. Bulk data collection and processing are complex tasks due to the complexity of collection and manipulation. Also, from bulk data, there is a lot of probability of losing the data due to improper collection. Missing data prediction can be performed manually and automatically. Manual data prediction is possible for databases with a small size but complex in cases of the larger database. This paper proposed a novel technique based on the Support Vector Machine (SVM) to predict the lost data from a bulk dataset automatically. This paper uses various algorithms to get the similarity between the users and the contents. Here it uses City Block distance (CBD) metrics, Root Mean Square Error (RMSE), Pearson Coefficient Measurement (PCM), and Cosine Similarity Measurement (CSM). This paper is validated using the Movie Lens dataset, and it produces promising performances. Compared to the system's performance with the other existing methods among all methods, it has better performance.
... The coarse data framework is very useful for characterising the possible ways in which observed data may differ from their true values, and while it incorporates missing data as a type of coarsening, its extension to other data problems such as measurement error is limited on theoretical grounds. Recent advances in multiple imputation theory do indeed pose solutions to data measured with error (see, particularly, Ghosh-Dastidar and Schafer, 2003),but associated with this is (1) a necessary change in the operation of imputation algorithms and, (2) a modification of the combination rules required for valid statistical inference from multiply imputed datasets (Reiter and Raghunathan, 2007). ...
... • Conducting complete-case analysis from multiply imputed data using the correct combination rules. Depending on the problem under investigation, these combination rules may differ to Rubin's Rules (Reiter and Raghunathan, 2007). • Testing the sensitivity of the results. ...
... The coarse data framework is very useful for characterising the possible ways in which observed data may differ from their true values, and while it incorporates missing data as a type of coarsening, its extension to other data problems such as measurement error is limited on theoretical grounds. Recent advances in multiple imputation theory do indeed pose solutions to data measured with error (see, particularly, Ghosh-Dastidar and Schafer, 2003),but associated with this is (1) a necessary change in the operation of imputation algorithms and, (2) a modification of the combination rules required for valid statistical inference from multiply imputed datasets (Reiter and Raghunathan, 2007). ...
... • Conducting complete-case analysis from multiply imputed data using the correct combination rules. Depending on the problem under investigation, these combination rules may differ to Rubin's Rules (Reiter and Raghunathan, 2007). • Testing the sensitivity of the results. ...
Chapter
Full-text available
Employment income data are coarsened as a result of questionnaire design. In the previous chapter we saw that Statistics South Africa (SSA) ask two employment income questions: an exact income question with a showcard follow-up. In public-use datasets, this results in two income variables: a continuously distributed variable for exact income responses and a categorical variable for bounded income responses with separate categories for nonresponse. It is the task of the researcher to then generate a single income variable that effectively deals with this mixture of data types. Following Heitjan and Rubin (1991), we call a variable with this mixture of data types “coarse data”.
... The coarse data framework is very useful for characterising the possible ways in which observed data may differ from their true values, and while it incorporates missing data as a type of coarsening, its extension to other data problems such as measurement error is limited on theoretical grounds. Recent advances in multiple imputation theory do indeed pose solutions to data measured with error (see, particularly, Ghosh-Dastidar and Schafer, 2003),but associated with this is (1) a necessary change in the operation of imputation algorithms and, (2) a modification of the combination rules required for valid statistical inference from multiply imputed datasets (Reiter and Raghunathan, 2007). ...
... • Conducting complete-case analysis from multiply imputed data using the correct combination rules. Depending on the problem under investigation, these combination rules may differ to Rubin's Rules (Reiter and Raghunathan, 2007). • Testing the sensitivity of the results. ...
Chapter
Full-text available
This chapter identifies a framework for investigating microdata quality that is particularly useful to researchers working with public-use micro datasets where limited information about the data quality protocols of the survey organisation are present. It then utilises this framework to investigate South African labour market household surveys from the mid 1990s to 2007. In order to develop the framework, we rely on the total survey error (TSE) framework to articulate the forms of statistical imprecision that exist in any public-use dataset. The magnitudes of statistical imprecision are largely dependent on the efficacy of the survey organisation’s data quality control protocols, which are, in turn, affected by human resource and budget constraints.
... The coarse data framework is very useful for characterising the possible ways in which observed data may differ from their true values, and while it incorporates missing data as a type of coarsening, its extension to other data problems such as measurement error is limited on theoretical grounds. Recent advances in multiple imputation theory do indeed pose solutions to data measured with error (see, particularly, Ghosh-Dastidar and Schafer, 2003),but associated with this is (1) a necessary change in the operation of imputation algorithms and, (2) a modification of the combination rules required for valid statistical inference from multiply imputed datasets (Reiter and Raghunathan, 2007). ...
... • Conducting complete-case analysis from multiply imputed data using the correct combination rules. Depending on the problem under investigation, these combination rules may differ to Rubin's Rules (Reiter and Raghunathan, 2007). • Testing the sensitivity of the results. ...
Chapter
Full-text available
The income question in household surveys is one of the most socially sensitive constructs. Two problems that arise with social sensitivity concern the probability of obtaining a response and the type of response provided. In survey error terms, this translates into an important relationship between questionnaire design (construct validity) and item non-response. In turn, these affect the statistical distribution of income that has both univariate and multivariate implications. Consequently, the interrelationship between questionnaire design and response type is crucial to understand when conducting analyses of the income variable.
... The coarse data framework is very useful for characterising the possible ways in which observed data may differ from their true values, and while it incorporates missing data as a type of coarsening, its extension to other data problems such as measurement error is limited on theoretical grounds. Recent advances in multiple imputation theory do indeed pose solutions to data measured with error (see, particularly, Ghosh-Dastidar and Schafer, 2003),but associated with this is (1) a necessary change in the operation of imputation algorithms and, (2) a modification of the combination rules required for valid statistical inference from multiply imputed datasets (Reiter and Raghunathan, 2007). ...
... • Conducting complete-case analysis from multiply imputed data using the correct combination rules. Depending on the problem under investigation, these combination rules may differ to Rubin's Rules (Reiter and Raghunathan, 2007). • Testing the sensitivity of the results. ...
Chapter
Full-text available
Household survey data are subject to multiple forms of survey error that can have a direct bearing on data quality, influencing end-user estimates of parameters of interest in unpredictable ways. This book has focussed specifically on employee income, but the insights are generalisable to any component of income.
... We must account for a third source of variation to produce valid statistical inference: the uncertainty due to the data being synthesized. This is incorporated by pooling across multiple syntheses, a question shared with the domain of statistical disclosure limitation [196][197][198][199][200][201][202]. ...
... Raghunathan et al. [196] describe full synthesis as a two-step process: (1) construct multiple synthetic populations by repeatedly drawing from the posterior predictive distribution, conditional on a model fitted to the original data; and (2) draw random samples from each synthetic population and release these synthetic samples to the public. In practice, as indicated by Reiter and Raghunathan [201], it is not a requirement to generate the populations, but only to generate values for the synthetic samples. Once the samples are released, the analyst seeks inferences based on the synthetic data alone. ...
Thesis
Health technology assessment systems base their decision-making on health-economic evaluations. These require accurate relative treatment effect estimates for specific patient populations. In an ideal scenario, a head-to-head randomized controlled trial, directly comparing the interventions of interest, would be available. Indirect treatment comparisons are necessary to contrast treatments which have not been analyzed in the same trial. Population-adjusted indirect comparisons estimate treatment effects where there are: no head-to-head trials between the interventions of interest, limited access to patient-level data, and cross-trial differences in effect measure modifiers. Health technology assessment agencies are increasingly accepting evaluations that use these methods across a diverse range of therapeutic areas. Popular approaches include matching-adjusted indirect comparison (MAIC), based on propensity score weighting, and simulated treatment comparison (STC), based on outcome regression. There is limited formal evaluation of these methods and whether they can be used to accurately compare treatments. Thus, I undertake a review and a simulation study that compares the standard unadjusted indirect comparisons, MAIC and STC across 162 scenarios. This simulation study assumes that the trials are investigating survival outcomes and measure continuous covariates, with the log hazard ratio as the measure of effect — one of the most widely used setups in health technology assessment applications. MAIC yields unbiased treatment effect estimates under no failures of assumptions. The typical usage of STC produces bias because it targets a conditional treatment effect where the target estimand should be a marginal treatment effect. The incompatibility of estimates in the indirect comparison leads to bias as the measure of effect is non-collapsible. When adjusting for covariates, one must integrate or average the conditional model over the population of interest to recover a compatible marginal treatment effect. I propose a marginalization method based on parametric G-computation that can be easily applied where the outcome regression is a generalized linear model or a Cox model. In addition, I introduce a novel general-purpose method based on the ideas underlying multiple imputation, which is termed multiple imputation marginalization (MIM) and is applicable to a wide range of models, including parametric survival models. The approaches view the covariate adjustment regression as a nuisance model and separate its estimation from the evaluation of the marginal treatment effect of interest. Both methods can accommodate a Bayesian statistical framework, which naturally integrates the analysis into a probabilistic framework, typically required for health technology assessment. Another simulation study provides proof-of-principle for the methods and benchmarks their performance against MAIC and the conventional STC. The simulations are based on scenarios with binary outcomes and continuous covariates, with the log-odds ratio as the measure of effect. The marginalized outcome regression approaches achieve more precise and more accurate estimates than MAIC, particularly when covariate overlap is poor, and yield unbiased marginal treatment effect estimates under no failures of assumptions. Furthermore, regressionadjusted estimates of the marginal effect provide greater precision and accuracy than the conditional estimates produced by the conventional STC, which are systematically biased because the log-odds ratio is a non-collapsible measure of effect. The marginalization methods outlined in this thesis are necessary and important for health technology assessment more generally, because marginal treatment effects should be the preferred inferential target for reimbursement decisions at the population level. Treatment effectiveness inputs in health economic models are often informed by the treatment coefficient of a multivariable regression. An often overlooked issue is that this has a conditional interpretation, and that the coefficients of the regression must be marginalized over the target population of interest to produce a relevant estimate for reimbursement decisions at the population level.
... This is incorporated by pooling across multiple syntheses, a question shared with the domain of statistical disclosure limitation. [121][122][123][124][125][126][127] In statistical disclosure limitation, data agencies mitigate the risk of identity disclosure by releasing multiple fully synthetic datasets, i.e., datasets that only contain simulated values, in lieu of the original confidential data of real survey respondents. Raghunathan et al. 121 describe full synthesis as a two-step process: (1) construct multiple synthetic populations by repeatedly drawing from the posterior predictive distribution, conditional on a model fitted to the original data; and (2) draw random samples from each synthetic population and release these synthetic samples to the public. ...
... Raghunathan et al. 121 describe full synthesis as a two-step process: (1) construct multiple synthetic populations by repeatedly drawing from the posterior predictive distribution, conditional on a model fitted to the original data; and (2) draw random samples from each synthetic population and release these synthetic samples to the public. In practice, as indicated by Reiter and Raghunathan, 126 it is not a requirement to generate the populations, but only to generate values for the synthetic samples. Once the samples are released, the analyst seeks inferences based on the synthetic data alone. ...
Article
Population adjustment methods such as matching-adjusted indirect comparison (MAIC) are increasingly used to compare marginal treatment effects when there are cross-trial differences in effect modifiers and limited patient-level data. MAIC is sensitive to poor covariate overlap and cannot extrapolate beyond the observed covariate space. Current outcome regression-based alternatives can extrapolate but target a conditional treatment effect that is incompatible in the indirect comparison. When adjusting for covariates, one must integrate or average the conditional estimate over the population of interest to recover a compatible marginal treatment effect. We propose a marginalization method based on parametric G-computation that can be easily applied where the outcome regression is a generalized linear model or a Cox model. In addition, we introduce a novel general-purpose method based on multiple imputation, which we term multiple imputation marginalization (MIM) and is applicable to a wide range of models. Both methods can accommodate a Bayesian statistical framework, which naturally integrates the analysis into a probabilistic framework. A simulation study provides proof-of-principle for the methods and benchmarks their performance against MAIC and the conventional outcome regression. The marginalized outcome regression approaches achieve more precise and more accurate estimates than MAIC, particularly when covariate overlap is poor, and yield unbiased marginal treatment effect estimates under no failures of assumptions. Furthermore, the marginalized regression-adjusted estimates provide greater precision and accuracy than the conditional estimates produced by the conventional outcome regression, which are systematically biased because the measure of effect is non-collapsible.
... Below we review the inferential procedures of Rubin (1987) for a scalar-valued estimand, and the procedures of Rubin (1987) and Li, Raghunathan, and Rubin (1991) for a vector-valued estimand. In addition to Rubin (1987) and Li, Raghunathan, and Rubin (1991), one may also refer to Rubin (1996), Schafer (1997), Little and Rubin (2002) and Reiter and Raghunathan (2007) for more information on these procedures, and for additional procedures. ...
Article
Full-text available
In this paper we consider the scenario where continuous microdata have been noise infused using a differentially private Laplace mechanism for the purpose of statistical disclosure control. We assume the original data are independent and identically distributed, having distribution within a parametric family of continuous distributions. We use a variant of the Laplace mechanism that allows the range of the original data to be unbounded by first truncating the original data and then adding appropriate Laplace random noise. We propose methodology to analyze the noise infused data using multiple imputation. This approach allows the data user to analyze the released data as if it were original, i.e., not noise infused, and then to obtain inference that accounts for the noise infusion mechanism using standard multiple imputation combining formulas. Methodology is presented for univariate data, and some simulation studies are presented to evaluate the performance of the proposed method. An extension of the proposed methodology to multivariate data is also presented. IJSS, Vol. 24(2) Special, December, 2024, pp 95-122
... Second, the imputation procedure must be carried out, resulting in a number of imputed data sets. Third, the data sets must be analyzed separately, and the resulting parameter estimates are combined according to the rules described in Rubin (1987; for alternatives, see Carpenter & Kenward, 2013;Reiter & Raghunathan, 2007). These steps can be carried out using the mitml package. ...
Preprint
The treatment of missing data can be difficult in multilevel research because state-of-the-art procedures such as multiple imputation (MI) may require advanced statistical knowledge or a high degree of familiarity with certain statistical software. In the missing data literature, pan has been recommended for MI of multilevel data. In this article, we provide an introduction to MI of multilevel missing data using the R package pan, and we discuss its possibilities and limitations in accommodating typical questions in multilevel research. In order to make pan more accessible to applied researchers, we make use of the mitml package, which provides a user-friendly interface to the pan package and several tools for managing and analyzing multiply imputed data sets. We illustrate the use of pan and mitml with two empirical examples that represent common applications of multilevel models, and we discuss how these procedures may be used in conjunction with other software.
... Consequently, in generating predictive values for individual subjects or estimating uncertainty in model parameters, we use an average over all M fair datasets (y, x). This approach of creating multiple datasets is also used in the privacy settings [Reiter, 2005] and multiple imputation [Rubin, 2004, Reiter andRaghunathan, 2007], where a common default value is M = 10 [Buuren and Groothuis-Oudshoorn, 2011]. In the fairness setting, we have the additional goal of limiting the effect of stochastic synthetic data x on individual predictions, so we use a larger default value of M = 50. ...
Preprint
Predictive modeling is increasingly being employed to assist human decision-makers. One purported advantage of replacing or augmenting human judgment with computer models in high stakes settings-- such as sentencing, hiring, policing, college admissions, and parole decisions-- is the perceived "neutrality" of computers. It is argued that because computer models do not hold personal prejudice, the predictions they produce will be equally free from prejudice. There is growing recognition that employing algorithms does not remove the potential for bias, and can even amplify it if the training data were generated by a process that is itself biased. In this paper, we provide a probabilistic notion of algorithmic bias. We propose a method to eliminate bias from predictive models by removing all information regarding protected variables from the data to which the models will ultimately be trained. Unlike previous work in this area, our framework is general enough to accommodate data on any measurement scale. Motivated by models currently in use in the criminal justice system that inform decisions on pre-trial release and parole, we apply our proposed method to a dataset on the criminal histories of individuals at the time of sentencing to produce "race-neutral" predictions of re-arrest. In the process, we demonstrate that a common approach to creating "race-neutral" models-- omitting race as a covariate-- still results in racially disparate predictions. We then demonstrate that the application of our proposed method to these data removes racial disparities from predictions with minimal impact on predictive accuracy.
... The most direct line of work compared to ours uses methods that generate synthetic survey responses and weights [HSW22]. These methods can produce synthetic data that are interoperable with existing analyses and admit combining-rules-based approaches to inferences with synthetic data [RR07]. Our approach differs in key ways. ...
Preprint
Full-text available
In general, it is challenging to release differentially private versions of survey-weighted statistics with low error for acceptable privacy loss. This is because weighted statistics from complex sample survey data can be more sensitive to individual survey response and weight values than unweighted statistics, resulting in differentially private mechanisms that can add substantial noise to the unbiased estimate of the finite population quantity. On the other hand, simply disregarding the survey weights adds noise to a biased estimator, which also can result in an inaccurate estimate. Thus, the problem of releasing an accurate survey-weighted estimate essentially involves a trade-off among bias, precision, and privacy. We leverage this trade-off to develop a differentially private method for estimating finite population quantities. The key step is to privately estimate a hyperparameter that determines how much to regularize or shrink survey weights as a function of privacy loss. We illustrate the differentially private finite population estimation using the Panel Study of Income Dynamics. We show that optimal strategies for releasing DP survey-weighted mean income estimates require orders-of-magnitude less noise than naively using the original survey weights without modification.
... Number of missing data was about 3.4%. The missing values were imputed by using multiple imputation methods with a minimum of 5 imputations (Rubin, 1987) and at least 10 iterations per imputation (Reiter & Raghunathan, 2007). The multiple imputation was carried out using the R package MICE (van Buren & Groothuis-Oudshoorn, 2011). ...
... In the synthetic data literature, the problem illustrated in Theorem 2 is often addressed by releasing multiple synthetic datasets and using combining rules to account for the increased variability due to the synthetic data generation procedure (Raghunathan et al., 2003;Reiter and Raghunathan, 2007;Reiter, 2002). However, it still remains that the synthetic data do not follow the same distribution as the original dataset, and the combining rules are often designed for only specific statistics. ...
... Higher scores reflect higher endorsement of depressive symptoms, with scores in the 10-11 range considered indicative of clinically significant levels of depression among Spanish-speaking populations (Garcia-Esteve et al., 2003). Due to miskeyed items in the Spanish language version of the survey at several time points and other random sources of missingness, missing data were imputed using Mplus 7 (Muthén & Muthén, 1998Reiter & Raghunathan, 2007) for all time points from the prenatal assessment through 24 weeks (additional information on imputation rationale and methods is provided in the online supplemental materials). ...
Article
Full-text available
The current study used novel methodology to characterize intraindividual variability in the experience of dynamic, within-person changes in postpartum depressive (PPD) symptoms across the first year postpartum and evaluated maternal and infant characteristics as predictors of between-person differences in intraindividual variability in PPD symptoms over time. With a sample of 322 low-income Mexican-origin mothers (Mage = 27.79; SD = 6.48), PPD symptoms were assessed at 11 time points from 3 weeks to 1 year postpartum (Edinburgh Perinatal Depression Scale; Cox & Holden, 2003). A prenatal cumulative risk index was calculated from individual psychosocial risk factors. Infant temperamental negativity was assessed via a maternal report at the infant age of 6 weeks (Infant Behavior Questionnaire; Putnam et al., 2014). Multilevel location scale analyses in a dynamic structural equation modeling (Asparouhov et al., 2018) framework were conducted. Covariates included prenatal depressive symptoms. On average, within-mother change in depressive symptoms at one time point was found to carry over to the next time point. Nonnull within-mother volatility in PPD symptoms reflected substantial ebbs and flows in PPD symptoms over the first year postpartum. Results of the between-level model demonstrated that mothers differed in their equilibriums, carryover, and volatility of their PPD symptoms. Mothers with more negative infants and those with higher prenatal cumulative risk exhibited higher equilibriums of PPD symptoms and more volatility in symptoms but did not differ in their carryover of PPD symptoms.
... Figure 7 represents the decision tree as an example. Decision trees have found many fields of implementation due to their simple analysis [32]. ...
Article
Full-text available
Text mining is an intriguing area of research, considering there is an abundance of text across the Internet and in social medias. Nevertheless outliers pose a challenge for textual data processing. The ability to identify this sort of irrelevant input is consequently crucial in developing high-performance models. In this paper, a novel unsupervised method for identifying outliers in text data is proposed. In order to spot outliers, we concentrate on the degree of similarity between any two documents and the density of related documents that might support integrated clustering throughout processing. To compare the e ectiveness of our proposed approach with alternative classification techniques, we performed a number of experiments on a real dataset. Experimental findings demonstrate that the suggested model can obtain accuracy greater than 98% and performs better than the other existing algorithms.
... On the other hand, the model comparison itself can become challenging when using MI because it requires the results to be pooled across the imputed data sets. Pooling methods for various types of analyses have been proposed in the missing data literature (for an overview, see Reiter & Raghunathan, 2007). The main focus of this article is on model comparisons with LRTs, but similar methods exist for Wald and score tests Mansolf et al., 2020). ...
Article
Full-text available
Different statistical models are often compared with likelihood ratio tests (LRTs) in psychology. However, participants sometimes do not fully complete their responses, and modern methods such as multiple imputation are often used to replace missing responses with plausible values. Conducting LRTs with imputed data is more difficult than with complete data, and several different methods have been developed for this task. In this article, we evaluate the performance of all available methods in applications with linear regression, logistic regression, and structural equation modeling. In addition, we developed an R package that can be used to apply these methods, and we provide an example with an annotated R script, in which we use these methods to investigate measurement invariance.
... We used MI to create and analyze five multiply imputed datasets. Incomplete variables were imputed under fully conditional specification (Markov chain Monte Carlo method), with ten iterations maximum for each imputation as recommended (Reiter & Raghunathan, 2007). Results from the five datasets were pooled. ...
Thesis
Full-text available
Student engagement (SE) is associated with higher academic performance and persistence, influencing academic completion. Social and emotional competencies (SECs) are fundamental protective factors for a healthy development. Despite being associated with SE, there are research gaps in this area. Therefore, our goal was to analyse the association between SECs and SE, accounting for individual and environmental factors. At first, a systematic review was performed to summarise previous research. Then, a questionnaire to assess emotion regulation strategies (ERS) in youth was validated. Subsequently, four quantitative studies were performed to examine the association between SECs and SE: the first, focused on ERS and included a representative sample of Portuguese youth (10-25 years old), the second one compared students who lived with their parents vs in residential care, and integrated school success perception and absenteeism; the third one had a sample of university students from nine countries and analysed the influence of the country's development index, and the fourth used a longitudinal methodology and also analysed the impact on mental health. Our findings suggest that SECs are positively associated with higher SE and lower absenteeism, regardless of family or sociocultural context; SE protects the maintenance of SECs in adverse environments; school success perception decreases absenteeism; and CSEs seem to be predictive of higher SE and better mental health. To enhance the effectiveness of health-promoting programmes, and based on evidence, we highlight that: social and emotional learning (SEL) programmes must be universal and integrated into the academic curriculum (including at university) but consider the developmental characteristics and needs of individuals; SE is fundamental to promote health, especially for the most vulnerable students and in adverse situations; SE support must include the implementation of youth-friendly policies.
... We will only review the combining rules for univariate estimates here borrowing heavily from Drechsler (2011c). The interested reader is referred to Reiter and Raghunathan (2007), which offers a full review of all combining rules for synthetic data and also for the nonresponse context. To understand the procedure of analyzing multiply imputed synthetic datasets, think of an analyst interested in an unknown scalar parameter Q, where Q could be, for example, the mean of a variable, the correlation coefficient between two variables, or a regression coefficient in a linear regression. ...
Preprint
Full-text available
The idea to generate synthetic data as a tool for broadening access to sensitive microdata has been proposed for the first time three decades ago. While first applications of the idea emerged around the turn of the century, the approach really gained momentum over the last ten years, stimulated at least in parts by some recent developments in computer science. We consider the upcoming 30th jubilee of Rubin's seminal paper on synthetic data (Rubin, 1993) as an opportunity to look back at the historical developments, but also to offer a review of the diverse approaches and methodological underpinnings proposed over the years. We will also discuss the various strategies that have been suggested to measure the utility and remaining risk of disclosure of the generated data.
... Thus, unlike in the missing data setting, the between-imputation variance captures variability both due to uncertainty about µ in the observed data estimate and the additional variability due to effectively taking new random samples of size nsyn from the population for each imputation (Reiter and Raghunathan, 2007). The expected value ofVsyn is then ...
Preprint
G-formula is a popular approach for estimating treatment or exposure effects from longitudinal data that are subject to time-varying confounding. G-formula estimation is typically performed by Monte-Carlo simulation, with non-parametric bootstrapping used for inference. We show that G-formula can be implemented by exploiting existing methods for multiple imputation (MI) for synthetic data. This involves using an existing modified version of Rubin's variance estimator. In practice missing data is ubiquitous in longitudinal datasets. We show that such missing data can be readily accommodated as part of the MI procedure, and describe how MI software can be used to implement the approach. We explore its performance using a simulation study.
... We define the estimated prevalence coefficients as µ (t) kj for the confidential data and µ λ,(t),m kj for the mth synthetic data with protection level λ. There are a total of M partially synthetic datasets and once M > 1 are generated, confidence intervals of coefficients can be calculated using widely-used combining rules (Reiter & Raghunathan, 2007;Drechsler, 2011). ...
Article
Full-text available
Privacy concerns emerge when online users of popular user-generated content (UGC) platforms are identified through a combination of their structured data (e.g., location and name) and textual content (e.g., word choices and writing style). To overcome this problem, we introduce a Bayesian sequential synthesis methodology for organizations to share structured data adjoined to textual content. Our proposed approach enables platforms to use a single shrinkage parameter to control the privacy level of their released UGC data. Our results show that our synthesis strategy decreases the probability of identification of a user to an acceptable threshold while maintaining much of the textual content present in the structured data. Additionally, we find that the value of sharing our protected data exceeds that of sharing the unprotected structured data and textual content separately. These findings encourage UGC platforms that wish to be known for consumer privacy to protect anonymity of their online users with synthetic data.
... Standard errors are adjusted for between-dataset variation within each block of 800 plausible completed datasets, using an adaptation of the method described in Reiter and Raghunathan (2007) for small-sample multiple imputation. We also apply Reiter and Raghunathan-style penalties to the allowed degrees of freedom in t-distributions for significance testing. ...
... Next, the results are aggregated according to a set of rules and we recover finally a more robust estimator for our downstream analysis. Further discussion about the Rubin's rules can be found for example in (Reiter and Raghunathan, 2007). ...
Preprint
Full-text available
Differentially private (DP) release of multidimensional statistics typically considers an aggregate sensitivity, e.g. the vector norm of a high-dimensional vector. However, different dimensions of that vector might have widely different magnitudes and therefore DP perturbation disproportionately affects the signal across dimensions. We observe this problem in the gradient release of the DP-SGD algorithm when using it for variational inference (VI), where it manifests in poor convergence as well as high variance in outputs for certain variational parameters, and make the following contributions: (i) We mathematically isolate the cause for the difference in magnitudes between gradient parts corresponding to different variational parameters. Using this as prior knowledge we establish a link between the gradients of the variational parameters, and propose an efficient while simple fix for the problem to obtain a less noisy gradient estimator, which we call aligned\textit{aligned} gradients. This approach allows us to obtain the updates for the covariance parameter of a Gaussian posterior approximation without a privacy cost. We compare this to alternative approaches for scaling the gradients using analytically derived preconditioning, e.g. natural gradients. (ii) We suggest using iterate averaging over the DP parameter traces recovered during the training, to reduce the DP-induced noise in parameter estimates at no additional cost in privacy. Finally, (iii) to accurately capture the additional uncertainty DP introduces to the model parameters, we infer the DP-induced noise from the parameter traces and include that in the learned posteriors to make them noise aware\textit{noise aware}. We demonstrate the efficacy of our proposed improvements through various experiments on real data.
... Rubin (1993) advocates for such synthetic data to 'release no actual microdata but only synthetic microdata constructed using multiple imputation so that they can be validly analyzed' (Rubin, 1993, p. 461). Several studies have developed techniques to make valid statistical inference on multiply imputed synthetic data (e.g., Reiter & Raghunathan, 2007). Generating synthetic data for replication purposes is less ambitious, as the utility of replication data requires that the results are qualitatively similar to those from the original data (e.g., Khan & Kabir, 2021), and only one synthetic dataset would be released with the article. ...
Article
Full-text available
Empirical studies in agricultural economics usually involve policy implications. In many cases, such studies rely on proprietary or confidential data that cannot be published along with the article, challenging the replicability and credibility of the results. To overcome this problem, the use of synthetic data—that is, data that do not contain a single unit of the original data—has been proposed. In this note, we illustrate the utility of synthetic data generation methods for replication purposes using a range of methods from agricultural production analysis. More specifically, we compare input elasticities and technical efficiency scores based on different farm‐level production data between original data and synthetic data. We generate synthetic data using a non‐parametric method of classification and regression trees (CART) and parametric linear regressions. We find synthetic data result in elasticities and technical efficiency distributions that are very similar to the original data, especially when generated with CART, and conclude with implications for the research community.
... We used MI to create and analyze five multiply imputed datasets. Incomplete variables were imputed under fully conditional specification (Markov chain Monte Carlo method), with ten iterations maximum for each imputation as recommended (Reiter & Raghunathan, 2007). Results from the five datasets were pooled. ...
Article
Full-text available
Background School absenteeism is associated with multiple negative short and long-term impacts, such as school grade retention and mental health difficulties. Objective The present study aimed to understand the role of resilience-related internal assets, student engagement, and perception of school success as protective factors for truancy. Additionally, we investigated whether there were differences in these variables between students living in residential care and students living with their parents. Methods This study included 118 participants aged 11 to 23 years old (M = 17.16, SE = 0.26). The majority were female (n = 61, 51.7 %) and Portuguese (n = 98, 83.1 %), with half living in residential care. In this cross-sectional study, participants responded to self-report questionnaires. Hierarchical regression analysis was used to understand the factors associated with truancy. Results There were no group differences in resilience-related internal assets and their perception of school success. On the contrary, participants in residential care reported more unexcused school absences, more grade retentions, higher levels of depression, and lower levels of student engagement. Moreover, hierarchical linear regression controlling for key variables (i.e., living in residential care or with parents, school grade retention, and depression) showed that perception of school success and resilience-related internal assets significantly contributed to truancy. Conclusions Results are discussed in the context of universal and selective interventions. These interventions can foster individual strengths and provide opportunities for every student to experience success. Consequently, they promote engagement and reduce the likelihood of school absences, especially for those in more vulnerable situations such as youth in residential care.
... We next imputed 25 complete data sets in Mplus where patterns of missing data can first be identified and then generated from Markov chain Monte Carlo (MCMC) simulation given there is 25% or more missing data (e.g., Larsen, 2011). The imputation model corresponds to the full mediation model with missing data and the resulting parameter estimates are combined according to the rules described in Rubin (1987; for alternatives, see Reiter & Raghunathan, 2007). We found that the results from the imputation model were largely consistent with the mediation model with missing data, thereby suggesting that the data were likely missing at random (MAR). ...
Article
Full-text available
Prominent disclosure models elucidate decisions to disclose health information, yet explanations for disclosure consequences remain underdeveloped. Drawing on Chaudoir and Fisher's disclosure process model, this study aims to advance understandings of how disclosure to a parent contributes to well-being for college students with mental illness. We tested a mediational model in which, at the within-person level, perceived support quality explains the association between on-going disclosure of mental illness-related experiences and well-being. Participants were 163 college students who self-identified as having mental illness and who completed six consecutive, weekly surveys. A multilevel analysis showed that increases in disclosures of mental illness-related experiences, relative to participants' mean level, were associated with enhanced well-being via perceptions of higher quality support, above and beyond between-person differences. This study contributes to the literature by offering an explanation for the effects of disclosure on well-being and underscores the importance of capturing disclosures over time.
... We average the within-impute design-based variance estimates of µ m (via Taylor linearization) to getŪ. We then estimate the between-impute variance of µ m to get B. The combined variance estimate is then T = (1 + 1/J)B + U. We then generate symmetric asymptotic intervals using the t− distribution (See section 2.1.1 of Reiter and Raghunathan, 2007). ...
Preprint
Full-text available
Nonprobability (convenience) samples are increasingly sought to stabilize estimations for one or more population variables of interest that are performed using a randomized survey (reference) sample by increasing the effective sample size. Estimation of a population quantity derived from a convenience sample will typically result in bias since the distribution of variables of interest in the convenience sample is different from the population. A recent set of approaches estimates conditional (on sampling design predictors) inclusion probabilities for convenience sample units by specifying reference sample-weighted pseudo likelihoods. This paper introduces a novel approach that derives the propensity score for the observed sample as a function of conditional inclusion probabilities for the reference and convenience samples as our main result. Our approach allows specification of an exact likelihood for the observed sample. We construct a Bayesian hierarchical formulation that simultaneously estimates sample propensity scores and both conditional and reference sample inclusion probabilities for the convenience sample units. We compare our exact likelihood with the pseudo likelihoods in a Monte Carlo simulation study.
... For simplicity, assume that the data are completely observed (i.e., Y = Y obs ). Following the notation of Reiter and Raghunathan [23], let, given n observations, Z i = 1 if any of the values of unit i = 1, 2, . . . , n, are to be replaced by imputations, and Z i = 0 otherwise, with Z = (Z 1 , Z 2 , . . . ...
Article
Full-text available
Synthetic datasets simultaneously allow for the dissemination of research data while protecting the privacy and confidentiality of respondents. Generating and analyzing synthetic datasets is straightforward, yet, a synthetic data analysis pipeline is seldom adopted by applied researchers. We outline a simple procedure for generating and analyzing synthetic datasets with the multiple imputation software mice (Version 3.13.15) in R. We demonstrate through simulations that the analysis results obtained on synthetic data yield unbiased and valid inferences and lead to synthetic records that cannot be distinguished from the true data records. The ease of use when synthesizing data with mice along with the validity of inferences obtained through this procedure opens up a wealth of possibilities for data dissemination and further research on initially private data.
... Multiple imputation (MI) is a useful tool for dealing with missing data, given its attractive theoretical properties, its ability to handle any pattern of missing data, and the numerous computation platforms that are available in practice. Since the initial development by Rubin (1987), MI has been successfully applied in a variety of fields for missing data and more broadly to handle related problems such as measurement error, confidentiality protection, and finite population inference (Reiter and Raghunathan 2007;Van Buuren 2012;Carpenter and Kenward 2013). ...
Article
Multiple imputation (MI) is a popular and well-established method for handling missing data in multivariate data sets, but its practicality for use in massive and complex data sets has been questioned. One such data set is the Panel Study of Income Dynamics (PSID), a longstanding and extensive survey of household income and wealth in the United States. Missing data for this survey are currently handled using traditional hot deck methods because of the simple implementation; however, the univariate hot deck results in large random wealth fluctuations. MI is effective but faced with operational challenges. We use a sequential regression/chained-equation approach, using the software IVEware, to multiply impute cross-sectional wealth data in the 2013 PSID, and compare analyses of the resulting imputed data with those from the current hot deck approach. Practical difficulties, such as non-normally distributed variables, skip patterns, categorical variables with many levels, and multicollinearity, are described together with our approaches to overcoming them. We evaluate the imputation quality and validity with internal diagnostics and external benchmarking data. MI produces improvements over the existing hot deck approach by helping preserve correlation structures, such as the associations between PSID wealth components and the relationships between the household net worth and sociodemographic factors, and facilitates completed data analyses with general purposes. MI incorporates highly predictive covariates into imputation models and increases efficiency. We recommend the practical implementation of MI and expect greater gains when the fraction of missing information is large.
Article
We present methodology for creating synthetic data and an application to create a publicly releasable synthetic version of the Longitudinal Aging Study in India (LASI). The LASI, a health and retirement survey, is used for research and educational purposes, but it can only be shared under restricted access due to privacy considerations. We present novel methods to synthesize the survey, maintaining three nested levels of observation—individuals, couples, and households—with both continuous and categorical variables and survey weights. We show that the synthetic data maintains the distributional patterns of the confidential data and largely mitigates identification and attribute disclosure risk. We also present a novel method for controlling the risk and utility tradeoff for the synthetic data that take into account the survey sampling rates. Specifically, we down-weight records that have a high likelihood of being uniquely identifiable in the population due to unique demographic information and oversampling. We show this approach reduces both identification and attribute risk for records while preserving better utility over another common approach of coarsening records. Our methods and evaluations provide a foundation for creating a synthetic version of surveys with multiple units of observation, such as the LASI.
Chapter
We present an overview of the evolution of single imputation synthetic datasets in Statistical Disclosure Control (SDC). Imputation is a widely used technique for generating privacy-preserving synthetic data that allows for the release of statistical information while protecting individual privacy, and we will focus on the evolution of the techniques on generating only one single dataset available to release and the evolution of the methods that allow its analysis. While the review does not delve into specific practical aspects or implementations, it provides a comprehensive understanding of the evolution in the generation and analysis of single imputation synthetic datasets. By synthesizing the existing literature, the paper aims to contribute to the knowledge base in SDC and assist researchers and practitioners in making informed decisions regarding the generation and analysis of synthetic datasets for statistical purposes.***
Article
We study statistical inference procedures in coarsened time series through the generalized method of moments. A new model for the coarsened time series via multiple potential outcomes is proposed. It can be naturally extended for inferring multi‐variate coarsened time series. We show that this framework generates a general class of estimators. It neatly generalizes the classical Horvitz–Thompson estimator for handling coarsened time series data. Asymptotic properties, including consistency and limiting distribution, of the proposed estimators are investigated. Estimators of the optimal weight matrix and the long‐run covariance matrix are also derived. In particular, confidence intervals of the mean function of the potential outcome as a function of coarsening index can be constructed. A real‐data application on air quality in the USA is investigated.
Article
Full-text available
Nonprobability (convenience) samples are increasingly sought to reduce the estimation variance for one or more population variables of interest that are estimated using a randomized survey (reference) sample by increasing the effective sample size. Estimation of a population quantity derived from a convenience sample will typically result in bias since the distribution of variables of interest in the convenience sample is different from the population distribution. A recent set of approaches estimates inclusion probabilities for convenience sample units by specifying reference sample-weighted pseudo likelihoods. This paper introduces a novel approach that derives the propensity score for the observed sample as a function of inclusion probabilities for the reference and convenience samples as our main result. Our approach allows specification of a likelihood directly for the observed sample as opposed to the approximate or pseudo likelihood. We construct a Bayesian hierarchical formulation that simultaneously estimates sample propensity scores and the convenience sample inclusion probabilities. We use a Monte Carlo simulation study to compare our likelihood based results with the pseudo likelihood based approaches considered in the literature.
Article
Synthetic data generation is a powerful tool for privacy protection when considering public release of record‐level data files. Initially proposed about three decades ago, it has generated significant research and application interest. To meet the pressing demand of data privacy protection in a variety of contexts, the field needs more researchers and practitioners. This review provides a comprehensive introduction to synthetic data, including technical details of their generation and evaluation. Our review also addresses the challenges and limitations of synthetic data, discusses practical applications, and provides thoughts for future work. This article is categorized under: Statistical and Graphical Methods of Data Analysis > Modeling Methods and Algorithms
Article
We propose two synthetic microdata approaches to generate private tabular survey data products for public release. We adapt a pseudo posterior mechanism that downweights by-record likelihood contributions with weights ∈[0,1] based on their identification disclosure risks to producing tabular products for survey data. Our method applied to an observed survey database achieves an asymptotic global probabilistic differential privacy guarantee. Our two approaches synthesize the observed sample distribution of the outcome and survey weights, jointly, such that both quantities together possess a privacy guarantee. The privacy-protected outcome and survey weights are used to construct tabular cell estimates (where the cell inclusion indicators are treated as known and public) and associated standard errors to correct for survey sampling bias. Through a real data application to the Survey of Doctorate Recipients public use file and simulation studies motivated by the application, we demonstrate that our two microdata synthesis approaches to construct tabular products provide superior utility preservation as compared to the additive noise approach of the Laplace Mechanism. Moreover, our approaches allow the release of microdata to the public, enabling additional analyses at no extra privacy cost.
Chapter
An essential precondition for Artificial Intelligence applications in health is the possibility of having recourse to Big Data. The inconvenient truth, however, is that ever-growing quantities of biomedical data sit idle in fragmented silos, and their use is hindered, among other things, by the need to act in compliance with the General Data Protection Regulation. A recent H2020 Research and Innovation project, MyHealthMyData, highlighted two privacy-enhancing technologies which can provide new solutions for data sharing in health. The first is based on bringing the algorithms to the data in what is called the “visiting mode”, based on secure computation mechanisms, by which only the outcomes are released, not the original data. The second approach leverages the capacity of producing synthetic data through generative adversarial neural networks, adding a further non-reidentification guarantee by making use also of differential privacy. Synthetic data generation and secure computation may thus complement the GDPR by developing an ecosystem in which health data can safely provide the Big Data needed to fully make use of AI in medicine.KeywordsArtificial IntelligenceBig DataAnonymisationPseudonymisationData SharingPersonalised MedicineSecure ComputationSynthetic DataDifferential Privacy
Article
Full-text available
There is a significant public demand for rapid data-driven scientific investigations using aggregated sensitive information. However, many technical challenges and regulatory policies hinder efficient data sharing. In this study, we describe a partially synthetic data generation technique for creating anonymized data archives whose joint distributions closely resemble those of the original (sensitive) data. Specifically, we introduce the DataSifter technique for time-varying correlated data (DataSifter II), which relies on an iterative model-based imputation using generalized linear mixed model and random effects-expectation maximization tree. DataSifter II can be used to generate synthetic repeated measures data for testing and validating new analytical techniques. Compared to the multiple imputation method, DataSifter II application on simulated and real clinical data demonstrates that the new method provides extensive reduction of re-identification risk (data privacy) while preserving the analytical value (data utility) in the obfuscated data. The performance of the DataSifter II on a simulation involving 20% artificially missingness in the data, shows at least 80% reduction of the disclosure risk, compared to the multiple imputation method, without a substantial impact on the data analytical value. In a separate clinical data (Medical Information Mart for Intensive Care III) validation, a model-based statistical inference drawn from the original data agrees with an analogous analytical inference obtained using the DataSifter II obfuscated ( sifted) data. For large time-varying datasets containing sensitive information, the proposed technique provides an automated tool for alleviating the barriers of data sharing and facilitating effective, advanced, and collaborative analytics.
Article
We consider settings where an analyst of multiply imputed data desires an integer-valued point estimate and an associated interval estimate, for example, a count of the number of individuals with certain characteristics in a population. Even when the point estimate in each completed dataset is an integer, the multiple imputation point estimator, i.e., the average of these completed-data estimators, is not guaranteed to be an integer. One natural approach is to round the standard multiple imputation point estimator to an integer. Another seemingly natural approach is to use the median of the completed-data point estimates (when they are integers). However, these two approaches have not been compared; indeed, methods for obtaining multiple imputation inferences associated with the median of the completed-data point estimates do not even exist. In this article, we evaluate and compare these two approaches. In doing so, we derive an estimator of the variance of the median-based multiple imputation point estimator, as well as a method for obtaining associated multiple imputation confidence intervals. Using simulation studies, we show that both methods can offer well-calibrated coverage rates and have similar repeated sampling properties, and hence are both useful for this analysis task.
Article
Full-text available
Conventional multiple imputation (MI) (Rubin, 1987) replaces the missing values in a dataset by m > 1 sets of simulated values. We describe a two-stage extension of MI in which the missing values are partitioned into two groups and imputed N = mn times in a nested fashion. Two-stage MI divides the missing information into two components of variability, lending insight when the missing values are of two qualitatively different types. It also opens new possibilities for making different assumptions about the mechanisms producing the two kinds of missing values. Point estimates and standard errors from the N complete-data analyses are consolidated by simple rules derived by analogy to nested analysis of variance. After reviewing the theory and practice of two-stage MI, we illustrate the method with a simple analysis of binary variables from a longitudinal survey.
Article
Full-text available
This article describes and evaluates a procedure for imputing missing values for a relatively complex data structure when the data are missing at random. The imputations are obtained by fitting a sequence of regression models and drawing values from the corresponding predictive distributions. The types of regression models used are linear, logistic, Poisson, generalized logit or a mixture of these depending on the type of variable being imputed. Two additional common features in the imputation process are incorporated: restriction to a relevant subpopulation for some variables and logical bounds or constraints for the imputed values. The restrictions involve subsetting the sample individuals that satisfy certain criteria while fitting the regression models. The bounds involve drawing values from a truncated predictive distribution. The development of this method was partly motivated by the analysis of two data sets which are used as illustrations. The sequential regression procedure is applied to perform multiple imputation analysis for the two applied problems. The sampling properties of inferences from multiply imputed data sets created using the sequential regression method are evaluated through simulated data sets.
Article
Full-text available
Recent developments in record linkage technology together with vast increases in the amount of personally identified information available in machine readable form raise serious concerns about the future of public use datasets. One possibility raised by Rubin [1993] is to release only simulated data created by multiple imputation techniques using the actual data. This paper uses the multiple imputation software developed for the Survey of Consumer Finances (Kennickell [1991]) to develop a series of experimental simulated versions of the 1995 survey data.
Article
Full-text available
We present a method of analyzing a series of independent cross-sectional surveys in which some questions are not answered in some surveys and some respondents do not answer some of the questions posed. The method is also applicable to a single survey in which different questions are asked or different sampling methods are used in different strata or clusters. Our method involves multiply imputing the missing items and questions by adding existing methods of imputation designed for single surveys a hierarchical regression that allows co variates at the individual and survey levels. Information from survey weights is exploited by including in the analysis the variables on which the weights are based, and then reweighting individual responses (observed and imputed) to estimate population quantities. We also develop diagnostics for checking the fit of the imputation model based on comparing imputed data to nonimputed data. We illustrate with the example that motivated this project: a study of pre-election public opinion polls in which not all the questions of interest are asked in all the surveys, so that it is infeasible to impute with in each survey separately. Version of Record
Article
The Survey of Consumer Finances (SCF) focuses intensely on the details of households' finances. Owing to the perceived sensitivity of this topic to some people and the difficulty of answering some questions, unit and item nonresponse rates in the SCF are substantial. The FRITZ Multiple imputation (MI) routine developed for the SCF has provided a means of providing a public data set that is more informative overall than anything that could be constructed with the data available to the public while also providing a more honest picture of the limits of our knowledge about the missing data. MI also plays a key role in the SCF in disclosure limitation as a tool for a limited form of data simulation in the public version of the data. This paper reviews the implementation of MI for the SCF and provides some empirical evidence on the performance of the FRITZ system.
Article
A multiple-imputation method is developed for analyzing data from an observational study where some covariate values are not observed. A hybrid approach is presented where the imputations are created under a Bayesian model involving an extended set of variables, althoug the ultimate analysis may be based on a regression model with a smaller set of variables. The imputations are the random draws from the posterior predictive distribution of the missing values, given the observed values. Gibbs sampling under an extension of the Olkin-Tate general location-scale model is used for the imputation. The method proposed is used to analyze data from a population-based case-control study investigating the association between drug therapy and primary cardiac arrest among pharmacologically treated hypertensives. The sensitivity of the inference to the assumptions about the mechanism for the missing data is explored by creating imputations under several non-ignorable mechanisms for missing data. The sampling properties of the estimates from the hybrid multiple-imputation approach are compared with those based on the complete data and maximum likelihood approaches through simulated data sets. This comparison suggests that much efficiency can be gained through the hybrid approach. Also, the multiple-imputation approach seems to be fairly robust to departures from the assumed normality unless the actual distribution of the continuous covariates is very skew.
Article
The National Health Interview Survey (NHIS) provides a rich source of data for studying relationships between income and health and for monitoring health and health care for persons at different income levels. However, the nonresponse rates are high for two key items, total family income in the previous calendar year and personal earnings from employment in the previous calendar year. To handle the missing data on family income and personal earnings in the NHIS, multiple imputation of these items, along with employment status and ratio of family income to the federal poverty threshold (derived from the imputed values of family income), has been performed for the survey years 1997-2004. (There are plans to continue this work for years beyond 2004 as well.) Files of the imputed values, as well as documentation, are available at the NHIS website (http://www.cdc. gov/nchs/nhis. htm). This article describes the approach used in the multiple-imputation project and evaluates the methods through analyses of the multiply imputed data. The analyses suggest that imputation corrects for biases that occur in estimates based on the data without imputation, and that multiple imputation results in gains in efficiency as well.
Article
Multiple imputation was designed to handle the problem of missing data in public-use data bases where the data-base constructor and the ultimate user are distinct entities. The objective is valid frequency inference for ultimate users who in general have access only to complete-data software and possess limited knowledge of specific reasons and models for nonresponse. For this situation and objective, I believe that multiple imputation by the data-base constructor is the method of choice. This article first provides a description of the assumed context and objectives, and second, reviews the multiple imputation framework and its standard results. These preliminary discussions are especially important because some recent commentaries on multiple imputation have reflected either misunderstandings of the practical objectives of multiple imputation or misunderstandings of fundamental theoretical results. Then, criticisms of multiple imputation are considered, and, finally, comparisons are made to alternative strategies.
Article
Multiple imputation is applied to a demographic data set with coarse age measurements for Tanzanian children. The heaped ages are multiply imputed with plausible true ages using (a) a simple naive model and (b) a new, relatively complex model that relates true age to the observed values of heaped age, sex, and anthropometric variables. The imputed true ages are used to create valid inferences under the models and compare inferences across models, thereby revealing sensitivity of inferences to prior specifications, from naive to complex. In addition, diagnostic analyses applied to the imputed data are used to suggest which models appear most appropriate. Because it is not clear just what set of heaping intervals should be used, the models are applied under various assumptions about the heaping: rounding (to the nearest year or half year) versus a combination of rounding and truncation as practiced in the United States, and medium versus wide heaping interval sizes. The most striking conclusions are the following: (a) inferences are very sensitive to the assumption of strict rounding versus rounding combined with truncation, yet judging from the diagnostics, the data cannot distinguish between such models; and (b) the diagnostics consistently favor the new, more complex model, which, although theoretically more satisfactory, can lead to inferences very similar to those obtained with the naive model. It is concluded that knowledge of the interval widths and heaping process sharpens valid inferences from data of this kind, and that given a specified process, simple and easily programmed multiple-imputation methods can lead to valid inferences.
Article
We describe methods used to create a new Census data base that can be used to study comparability of industry and occupation classification systems. This project represents the most extensive application of multiple imputation to date, and the modeling effort was considerable as well—hundreds of logistic regressions were estimated. One goal of this article is to summarize the strategies used in the project so that researchers can better understand how the new data bases were created. Another goal is to show how modifications of maximum likelihood methods were made for the modeling and imputation phases of the project. To multiply-impute 1980 census-comparable codes for industries and occupations in two 1970 census public-use samples, logistic regression models were estimated with flattening constants. For many of the regression models considered, the data were too sparse to support conventional maximum likelihood analysis, so some alternative had to be employed. These methods solve existence and related computational problems often encountered with maximum likelihood methods. Inferences pertaining to effects of predictor variables and inferences regarding predictions from logit models are also more satisfactory. The Bayesian strategy used in this project can be applied in other sparse-data settings where logistic regression is used because the approach can be implemented easily with any standard computer program for logit regression or log-linear analysis.
Article
We present a procedure for computing significance levels from data sets whose missing values have been multiply imputed data. This procedure uses moment-based statistics, m ≤ 3 repeated imputations, and an F reference distribution. When m = ∞, we show first that our procedure is essentially the same as the ideal procedure in cases of practical importance and, second, that its deviations from the ideal are basically a function of the coefficient of variation of the canonical ratios of complete to observed information. For small m our procedure's performance is largely governed by this coefficient of variation and the mean of these ratios. Using simulation techniques with small m, we compare our procedure's actual and nominal large-sample significance levels and conclude that it is essentially calibrated and thus represents a definite improvement over previously available procedures. Furthermore, we compare the large-sample power of the procedure as a function of m and other factors, such as the dimensionality of the estimand and fraction of missing information, to provide guidance on the choice of the number of imputations; generally, we find the loss of power due to small m to be quite modest in cases likely to occur in practice.
Article
Multiple imputation is becoming a standard tool for handling nonresponse in sample surveys. A difficult problem in the analysis of a multiply- imputed data set concerns how to combine repeated p-values efficiently to create a valid significance level. Here we propose, justify, and evaluate the validity of a new procedure, which is superior to the current standard. This problem is inherently difficult when the number of multiple imputations is small, as it must be in common practice, as made clear by its close relationship to a multivariate version of the classic Behrens-Fisher problem with small degrees of freedom.
Article
Existing procedures for obtaining significance levels from multiply- imputed data either (i) require access to the completed-data point estimates and variance-covariance matrices, which may not be available in practice when the dimensionality of the estimand is high, or (ii) directly combine p-values with less satisfactory results. Taking advantage of the well-known relationship between the Wald and log likelihood ratio test statistics, we propose a complete-data log likelihood ratio based procedure. It is shown that, for any number of multiple imputations, the proposed procedure is equivalent in large samples to the existing procedure based on the point estimates and the variance-covariance matrices, yet it only requires the point estimates and evaluations of the complete-data log likelihood ratio statistic as a function of these estimates and the completed data. The proposed procedure, therefore, is especially attractive with highly multiparameter incomplete-data problems since it does not involve the computation of any matrices.
Article
To avoid disclosures, Rubin proposed creating multiple, synthetic data sets for public release so that (i) no unit in the released data has sensitive data from an actual unit in the population, and (ii) statistical procedures that are valid for the original data are valid for the released data. In this article, I show through simulation studies that valid inferences can be obtained from synthetic data in a variety of settings, including simple random sampling, probability proportional to size sampling, two-stage cluster sampling, and stratied sampling. I also provide guidance on specifying the number and size of synthetic data sets and demonstrate the benet of including design variables in the released data sets.
Article
To limit disclosure risks, one approach is to release partially synthetic, public use microdata sets. These comprise the units originally surveyed, but some collected values, for example sensitive values at high risk of disclosure or values of key identiers, are replaced with multiple imputations. This article presents and evaluates the use of classication and regression trees to generate partially synthetic data. Two potential applications of CART are studied via simulation: (i) generate synthetic data for sensitive variables; and, (ii) generate synthetic data for variables that are key identiers.
Article
Missing data is a major issue in many applied problems, especially in the biomedical sciences. We review four common approaches for inference in generalized linear models (GLMs) with missing covariate data: maximum likelihood (ML), multiple imputation (MI), fully Bayesian (FB), and weighted estimating equations (WEEs). There is considerable interest in how these four methodologies are related, the properties of each approach, the advantages and disadvantages of each methodology, and computational implementation. We examine data that are missing at random and nonignorable missing. For ML we focus on techniques using the EM algorithm, and in particular, discuss the EM by the method of weights and related procedures as discussed by Ibrahim. For MI, we examine the techniques developed by Rubin. For FB, we review approaches considered by Ibrahim et al. For WEE, we focus on the techniques developed by Robins et al. We use a real dataset and a detailed simulation study to compare the four methods.
Article
The Fatal Accidient Reporting system (FARS) is a database collected for the US National Highway Traffic Safety Administration (NHTSA) at the site of all fatal traffic accidents. Variables include location and time of accident, number and position of vehicles, age, sex and driving record of the driver, seat-belt use and blood alcohol content of the driver. The last two variables are of great interest but have substantial proportions of missing data. The NHTSA is interested in a method of imputation that allows appropriate estimates and standard errors to be computed from the filled-in data. This paper explores the use of multiple imputation based on predictive mean matching as a means of achieving these goals. Two specific methods are decsribed and applied to a sample of the FARS data. A simulation study compares the frequency properties of the methods.
Book
Introduction Assumptions EM and Inference by Data Augmentation Methods for Normal Data More on the Normal Model Methods for Categorical Data Loglinear Models Methods for Mixed Data Further Topics Appendices References Index
Article
The multiple-matrix item sampling designs that provide information about population characteristics most efficiently administer too few responses to students to estimate their proficiencies individually. Marginal estimation procedures, which estimate population characteristics directly from item responses, must be employed to realize the benefits of such a sampling design. Numerical approximations of the appropriate marginal estimation procedures for a broad variety of analyses can be obtained by constructing, from the results of a comprehensive extensive marginal solution, files of plausible values of student proficiencies. This article develops the concepts behind plausible values in a simplified setting, sketches their use in the National Assessment of Educational Progress (NAEP), and illustrates the approach with data from the Scholastic Aptitude Test (SA T).
Article
This paper describes a method for estimating disease–exposure odds ratios in a case–control study where information on the exposure variable is available from several, possibly imper-fect, sources. A hybrid approach is developed where a Bayesian perspective is used in combining information from multiple sources, although the ultimate analysis of the disease–exposure association is likelihood based and incorporates the design considerations from a frequentist perspective, namely matching cases and controls on the basis of certain charac teristics. The basic analytical strategy involves using Gibbs sampling to draw several sets of actual exposure variables at random from their posterior distribution, conditional on the exposure ascertainment from several sources and other pertinent variables. Each set of drawn values of the actual exposure variable and the confounding variables are used as independent variables in a conditional logistic regression model with case–control status as the dependent variable. The resulting point estimates and their co-variance matrices are then combined. This method is applied to a population-based case–control study of the risk of primary cardiac arrest and the intake of n-3 polyunsatur ated fatty acids derived mainly from fish and seafood, which motivated this research. This hybrid strategy was developed for pragmatic reasons as these data will be used for several analyses from differing perspectives by different analysts. Hence, this paper also reports an evaluation from a frequentist perspective that investigates the sampling properties of estimates so derived through a simulation study that is similar in many respects to the actual data set analysed. These results show that the estimate of the log-odds ratio obtained by using the method described in this paper is better in terms of bias, the mean-square error and the confidence coverage when compare d with the estimate obtained by using only one of the several sources as the exposure variable
Article
Multiple imputation can handle missing data and disclosure limitation simultaneously. First, fill in the missing data to generate m completed datasets, then replace confidential values in each completed dataset with r imputations. I investigate how to select m and r.
Article
To limit the risks of disclosures when releasing data to the public, it has been suggested that statistical agencies release multiply imputed, synthetic microdata. For example, the released microdata can be fully synthetic, comprising random samples of units from the sampling frame with simulated values of variables. Or, the released microdata can be partially synthetic, comprising the units originally surveyed with some collected values, e.g. sensitive values at high risk of disclosure or values of key identifiers, replaced with multiple imputations. This article presents inferential methods for synthetic data for multi-component estimands, in particular procedures for Wald and likelihood ratio tests. The performance of the procedures is illustrated with simulation studies.
Article
An appealing feature of multiple imputation is the simplicity of the rules for combining the multiple complete-data inferences into a final inference, the repeated-imputation inference. This inference is based on a t distribution and is derived from a Bayesian paradigm under the assumption that the complete-data degrees of freedom, ν com , are infinite, but the number of imputations, m, is finite. When ν com is small and there is only a modest proportion of missing data, the calculated repeated-imputation degrees of freedom, ν m , for the t reference distribution can be much larger than ν com , which is clearly inappropriate. Following the Bayesian paradigm, we derive an adjusted degrees of freedom, ν ˜ m , with the following three properties: for fixed m and estimated fraction of missing information, ν ˜ m monotonically increases in ν com ; ν ˜ m is always less than or equal to ν com ; and ν ˜ m equals ν m when ν com is infinite. A small simulation study demonstrates the superior frequentist performance when using ν ˜ m rather than ν m .
Conference Paper
Statistical agencies alter values of identiers to protect re- spondents' conden tiality. When these identiers are survey design vari- ables, leaving the original survey weights on the le can be a disclosure risk. Additionally, the original weights may not correspond to the altered values, which impacts the quality of design-based (weighted) inferences. In this paper, we discuss some strategies for altering survey weights when altering design variables. We do so in the context of simulating identiers from probability distributions, i.e. partially synthetic data. Using simu- lation studies, we illustrate aspects of the quality of inferences based on the dieren t strategies.
Conference Paper
This paper describes ongoing research to protect confidentiality in longitudinal linked data through creation of multiply-imputed, partially synthetic data. We present two enhancements to the methods of [2]. The first is designed to preserve marginal distributions in the partially synthetic data. The second is designed to protect confidential links between sampling frames.
Article
Introduction General Conditions for the Randomization-Validity of Infinite-m Repeated-Imputation Inferences Examples of Proper and Improper Imputation Methods in a Simple Case with Ignorable Nonresponse Further Discussion of Proper Imputation Methods The Asymptotic Distribution of (Q̄m, Ūm, Bm) for Proper Imputation Methods Evaluations of Finite-m Inferences with Scalar Estimands Evaluation of Significance Levels from the Moment-Based Statistics Dm and Δm with Multicomponent Estimands Evaluation of Significance Levels Based on Repeated Significance Levels
Article
Conducting sample surveys, imputing incomplete observations, and analyzing the resulting data are three indispensable phases of modern practice with public-use data files and with many other statistical applications. Each phase inherits different input, including the information preceding it and the intellectual assessments available, and aims to provide output that is one step closer to arriving at statistical inferences with scientific relevance. However, the role of the imputation phase has often been viewed as merely providing computational convenience for users of data. Although facilitating computation is very important, such a viewpoint ignores the imputer's assessments and information inaccessible to the users. This view underlies the recent controversy over the validity of multiple-imputation inference when a procedure for analyzing multiply imputed data sets cannot be derived from (is "uncongenial" to) the model adopted for multiple imputation. Given sensible imputations and complete-data analysis procedures, inferences from standard multiple-imputation combining rules are typically superior to, and thus different from, users' incomplete-data analyses. The latter may suffer from serious nonresponse biases because such analyses often must rely on convenient but unrealistic assumptions about the nonresponse mechanism. When it is desirable to conduct inferences under models for nonresponse other than the original imputation model, a possible alternative to recreating imputations is to incorporate appropriate importance weights into the standard combining rules. These points are reviewed and explored by simple examples and general theory, from both Bayesian and frequentist perspectives, particularly from the randomization perspective. Some convenient terms are suggested for facilitating communication among researchers from different perspectives when evaluating multiple-imputation inferences with uncongenial sources of input.
Article
We consider the asymptotic behaviour of various parametric multiple imputation procedures which include but are not restricted to the ‘proper’ imputation procedures proposed by Rubin (1978). The asymptotic variance structure of the resulting estimators is provided. This result is used to compare the relative efficiencies of different imputation procedures. It also provides a basis to understand the behaviour of two Monte Carlo iterative estimators, stochastic EM (Celeux & Diebolt, 1985; Wei & Tanner, 1990) and simulated EM (Ruud, 1991). We further develop properties of these estimators when they stop at iteration K with imputation size m . An application to a measurement error problem is used to illustrate the results.
Article
We derive an estimator of the asymptotic variance of both single and multiple imputation estimators. We assume a parametric imputation model but allow for non- and semiparametric analysis models. Our variance estimator, in contrast to the estimator proposed by Rubin (1987), is consistent even when the imputation and analysis models are misspecified and incompatible with one another.
Article
In this review paper, we discuss the theoretical background of multiple imputation, describe how to build an imputation model and how to create proper imputations. We also present the rules for making repeated imputation inferences. Three widely used multiple imputation methods, the propensity score method, the predictive model method and the Markov chain Monte Carlo (MCMC) method, are presented and discussed. En cet exposé synoptique, nous discutons le fond théorique de l'imputation multiple, décrivons comment établir un modèle d'imputation et comment ciéer imputations appropriée. Nous présentons règles pour faire des inférences répétées d'imputation. Trois méthodes multiples largement répandues d'imputation, imputation multiple de points de propension, imputation multiple modèle prédictive et imputation multiple de Monte Carlo de châne de Markov (MCMC), sont présentées et discutées.
Article
Despite advances in public health practice and medical technology, the disparities in health among the various racial/ethnic and socioeconomic groups remain a concern which has prompted the Department of Human and Health Services to designate the elimination of disparities in health as an overarching goal of Healthy People 2010. To assess the progress towards this goal, suitable measures are needed at the population level that can be tracked over time; Statistical inferential procedures have to be developed for these population level measures; and the data sources have to be identified to allow for such inferences to be conducted. Popular data sources for health disparities research are large surveys such the National Health and Interview Survey (NHIS) or the Behavior Risk Factor Surveillance System (BRFSS). The self-report disease status collected in these surveys may be inaccurate and the errors may be correlated with variables used in defining the groups. This article uses the National Health and Nutritional Examination Survey (NHANES) 99-00 to assess the extent of error in the self-report disease status; uses a Bayesian framework develop corrections for the self-report disease status in the National Health Interview Survey (NHIS) 99-00; and compares inferences about various measures of health disparities, with and without correcting for measurement error. The methodology is illustrated using the disease outcome hypertension, a common risk factor for cardiovascular disease.
Article
Multiple imputation is a model based technique for handling missing data problems. In this application we use the technique to estimate the distribution of times from HIV seroconversion to AIDS diagnosis with data from a cohort study of 4954 homosexual men with 4 years of follow-up. In this example the missing data are the dates of diagnosis with AIDS. The imputation procedure is performed in two stages. In the first stage, we estimate the residual AIDS-free time distribution as a function of covariates measured on the study participants with data provided by the participants who were seropositive at study entry. Specifically, we assume the residual AIDS-free times follow a log-normal regression model that depends on the covariates measured at enrolment on the seropositive participants. In the second stage we impute the date of AIDS diagnosis for the participants who seroconverted during the course of the study and are AIDS-free with use of the log-normal distribution estimated in the first stage and the covariates from each seroconverter's latest visit. The estimated proportions developing AIDS within 4 and within 7 years of seroconversion are 15 and 36 per cent respectively, with associated 95 per cent confidence intervals of (10, 21) and (26, 47) per cent. We discuss the Bayesian foundations of the multiple imputation technique and the statistical and scientific assumptions.
Article
Rubin's multiple imputation is a three-step method for handling complex missing data, or more generally, incomplete-data problems, which arise frequently in medical studies. At the first step, m (> 1) completed-data sets are created by imputing the unobserved data m times using m independent draws from an imputation model, which is constructed to reasonably approximate the true distributional relationship between the unobserved data and the available information, and thus reduce potentially very serious nonresponse bias due to systematic difference between the observed data and the unobserved ones. At the second step, m complete-data analyses are performed by treating each completed-data set as a real complete-data set, and thus standard complete-data procedures and software can be utilized directly. At the third step, the results from the m complete-data analyses are combined in a simple, appropriate way to obtain the so-called repeated-imputation inference, which properly takes into account the uncertainty in the imputed values. This paper reviews three applications of Rubin's method that are directly relevant for medical studies. The first is about estimating the reporting delay in acquired immune deficiency syndrome (AIDS) surveillance systems for the purpose of estimating survival time after AIDS diagnosis. The second focuses on the issue of missing data and noncompliance in randomized experiments, where a school choice experiment is used as an illustration. The third looks at handling nonresponse in United States National Health and Nutrition Examination Surveys (NHANES). The emphasis of our review is on the building of imputation models (i.e. the first step), which is the most fundamental aspect of the method.
Article
We propose a general semiparametric method based on multiple imputation for Cox regression with interval-censored data. The method consists of iterating the following two steps. First, from finite-interval-censored (but not right-censored) data, exact failure times are imputed using Tanner and Wei's poor man's or asymptotic normal data augmentation scheme based on the current estimates of the regression coefficient and the baseline survival curve. Second, a standard statistical procedure for right-censored data, such as the Cox partial likelihood method, is applied to imputed data to update the estimates. Through simulation, we demonstrate that the resulting estimate of the regression coefficient and its associated standard error provide a promising alternative to the nonparametric maximum likelihood estimate. Our proposal is easily implemented by taking advantage of existing computer programs for right-censored data.
Article
In problems with missing or latent data, a standard approach is to first impute the unobserved data, then perform all statistical analyses on the completed dataset--corresponding to the observed data and imputed unobserved data--using standard procedures for complete-data inference. Here, we extend this approach to model checking by demonstrating the advantages of the use of completed-data model diagnostics on imputed completed datasets. The approach is set in the theoretical framework of Bayesian posterior predictive checks (but, as with missing-data imputation, our methods of missing-data model checking can also be interpreted as "predictive inference" in a non-Bayesian context). We consider the graphical diagnostics within this framework. Advantages of the completed-data approach include: (1) One can often check model fit in terms of quantities that are of key substantive interest in a natural way, which is not always possible using observed data alone. (2) In problems with missing data, checks may be devised that do not require to model the missingness or inclusion mechanism; the latter is useful for the analysis of ignorable but unknown data collection mechanisms, such as are often assumed in the analysis of sample surveys and observational studies. (3) In many problems with latent data, it is possible to check qualitative features of the model (for example, independence of two variables) that can be naturally formalized with the help of the latent data. We illustrate with several applied examples.
Article
Across multiply imputed data sets, variable selection methods such as stepwise regression and other criterion-based strategies that include or exclude particular variables typically result in models with different selected predictors, thus presenting a problem for combining the results from separate complete-data analyses. Here, drawing on a Bayesian framework, we propose two alternative strategies to address the problem of choosing among linear regression models when there are missing covariates. One approach, which we call "impute, then select" (ITS) involves initially performing multiple imputation and then applying Bayesian variable selection to the multiply imputed data sets. A second strategy is to conduct Bayesian variable selection and missing data imputation simultaneously within one Gibbs sampling process, which we call "simultaneously impute and select" (SIAS). The methods are implemented and evaluated using the Bayesian procedure known as stochastic search variable selection for multivariate normal data sets, but both strategies offer general frameworks within which different Bayesian variable selection algorithms could be used for other types of data sets. A study of mental health services utilization among children in foster care programs is used to illustrate the techniques. Simulation studies show that both ITS and SIAS outperform complete-case analysis with stepwise variable selection and that SIAS slightly outperforms ITS.
Article
There are many methods for measurement-error correction. These methods remain rarely used despite the ubiquity of measurement error. Treating measurement error as a missing-data problem, the authors show how multiple-imputation for measurement-error (MIME) correction can be done using SAS software and evaluate the approach with a simulation experiment. Based on hypothetical data from a planned cohort study of 600 children with chronic kidney disease, the estimated hazard ratio for end-stage renal disease from the complete data was 2.0 [95% confidence limits (95% CL) 1.4, 2.8] and was reduced to 1.5 (95% CL 1.1, 2.1) using a misclassified exposure of low glomerular filtration rate at study entry (sensitivity of 0.9 and specificity of 0.7). The MIME correction hazard ratio was 2.0 (95% CL 1.2, 3.3), the regression calibration (RC) hazard ratio was 2.0 (95% CL 1.1, 3.7), and restriction to a 25% validation substudy yielded a hazard ratio of 2.0 (95% CL 1.0, 3.7). Based on Monte Carlo simulations across eight scenarios, MIME was approximately unbiased, had approximately correct coverage, and was sometimes more powerful than misclassified or RC analyses. Using root mean squared error as a criterion, the MIME bias correction is sometimes outweighed by added imprecision. The choice between MIME and RC depends on performance, ease, and objectives. The usefulness of MIME correction in specific applications will depend upon the sample size or the proportion validated. MIME correction may be valuable in interpreting imperfectly measured epidemiological data.
Article
When performing multi-component significance tests with multiply-imputed datasets, analysts can use a Wald-like test statistic and a reference F-distribution. The currently employed degrees of freedom in the denominator of this F-distribution are derived assuming an infinite sample size. For modest complete-data sample sizes, this degrees of freedom can be unrealistic; for example, it may exceed the complete-data degrees of freedom. This paper presents an alternative denominator degrees of freedom that is always less than or equal to the complete-data denominator degrees of freedom, and equals the currently employed denominator degrees of freedom for infinite sample sizes. Its advantages over the currently employed degrees of freedom are illustrated with a simulation. Copyright 2007, Oxford University Press.
Article
The paper presents an illustration and empirical study of releasing multiply imputed, fully synthetic public use microdata. Simulations based on data from the US Current Population Survey are used to evaluate the potential validity of inferences based on fully synthetic data for a variety of descriptive and analytic estimands, to assess the degree of protection of confidentiality that is afforded by fully synthetic data and to illustrate the specification of synthetic data imputation models. Benefits and limitations of releasing fully synthetic data sets are discussed. Copyright 2005 Royal Statistical Society.