Conference Paper

Routine multiple imputation in statistical databases

Dept. of Stat., TNO-PG, Leiden
DOI: 10.1109/SSDM.1994.336960 Conference: Scientific and Statistical Database Management, 1994. Proceedings., Seventh International Working Conference on
Source: IEEE Xplore


This paper deals with problems concerning missing data in
statistical databases. Multiple imputation is a statistically sound
technique for handling incomplete data. Two problems should be addressed
before the routine application of the technique becomes feasible. First,
if imputations are to be appropriate for more than one statistical
analysis, they should be generated independently of any scientific
models that are to be applied to the data at a later stage. This is done
by finding imputations that will extrapolate the structure of the data,
as well as the uncertainty about this structure. A second problem is to
use complete-data methods in an efficient way. The HERMES workstation
encapsulates existing statistical packages in a client-server model. It
forms a natural and convenient environment for implementing multiple

3 Reads
  • Source
    • "On the other hand, Alzola and Harrell [36] introduced a function that imputes each incomplete attribute by cubic spline regression given all the other attributes, without assuming that the data must be modeled by the multivariate distribution. A multiple imputation environment called multivariate imputation by chained equations (MICE), which provides a full spectrum of conditional distributions and related regression based methods, was developed by Buuren and Oudshoorn [33] [37]. MICE incorporates logistic regression, polytomous regression, and linear regression, uses Gibbs sampler [38] to generate multiple imputation , and is furnished with a comprehensive, state-of-the-art missing data imputation software package. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Numerous industrial and research databases include missing values. It is not uncommon to encounter databases that have up to a half of the entries missing, making it very difficult to mine them using data analysis methods that can work only with complete data. A common way of dealing with this problem is to impute (fill-in) the missing values. This paper evaluates how the choice of different imputation methods affects the performance of classifiers that are subsequently used with the imputed data. The experiments here focus on discrete data. This paper studies the effect of missing data imputation using five single imputation methods (a mean method, a Hot deck method, a Naı¨ve-Bayes method, and the latter two methods with a recently proposed imputation framework) and one multiple imputation method (a polytomous regression based method) on classification accuracy for six popular classifiers (RIPPER, C4.5, K-nearest-neighbor, support vector machine with polynomial and RBF kernels, and Naı¨ve-Bayes) on 15 datasets. This experimental study shows that imputation with the tested methods on average improves classification accuracy when compared to classification without imputation. Although the results show that there is no universally best imputation method, Naı¨ve-Bayes imputation is shown to give the best results for the RIPPER classifier for datasets with high amount (i.e., 40% and 50%) of missing data, polytomous regression imputation is shown to be the best for support vector machine classifier with polynomial kernel, and the application of the imputation framework is shown to be superior for the support vector machine with RBF kernel and K-nearest-neighbor. The analysis of the quality of the imputation with respect to varying amounts of missing data (i.e., between 5% and 50%) shows that all imputation methods, except for the mean imputation, improve classification error for data with more than 10% of missing data. Finally, some classifiers such as C4.5 and Naı¨ve-Bayes were found to be missing data resistant, i.e., they can produce accurate classification in the presence of missing data, while other classifiers such as K-nearest-neighbor, SVMs and RIPPER benefit from the imputation.
    Pattern Recognition 12/2008; 41(12-41):3692-3705. DOI:10.1016/j.patcog.2008.05.019 · 3.10 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: On-Line Analytical Processing (OLAP) technologies are being used widely for business-data analysis, and these technologies are also being used increasingly in medical applications, e.g., for patient-data analysis. The lack of effective means of handling data imprecision, which occurs when exact values are not known precisely or are entirely missing, represents a major obstacle in applying OLAP technology to the medical domain, as well as many other domains. OLAP systems are mainly based on a multidimensional model of data and include constructs such as dimension hierarchies and granularities. This paper develops techniques for the handling of imprecision that aim to maximally reusing these already existing constructs. With imprecise data now available in the database, queries are tested to determine whether or not they may be answered precisely given the available data; if not, alternative queries that are unaffected by the imprecision are suggested. When a user elects to proceed with ...
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This thesis is about data modeling and query processing for complex multidimensional data. Multidimensional data has become the subject of much attention in both academia and industry in recent years, fueled by the popularity of data warehousing and On-Line Analytical Processing (OLAP) applications. One application area where complex multidimensional data is common is within medical informatics, an area that may benefit significantly from the functionality offered by data warehousing and OLAP. However, the special nature of clinical applications poses different and new requirements to data warehousing technologies, over those posed by conventional data warehouse applications. This thesis presents a number of exciting new research challenges posed by clinical applications, to be met by the database research community. These include the need for complex-data modeling features, advanced temporal support, advanced classification structures, continuously valued data, dimensionally reduced ...
Show more