Conference Paper

Routine multiple imputation in statistical databases

Dept. of Stat., TNO-PG, Leiden
DOI: 10.1109/SSDM.1994.336960 Conference: Scientific and Statistical Database Management, 1994. Proceedings., Seventh International Working Conference on
Source: IEEE Xplore

ABSTRACT This paper deals with problems concerning missing data in
statistical databases. Multiple imputation is a statistically sound
technique for handling incomplete data. Two problems should be addressed
before the routine application of the technique becomes feasible. First,
if imputations are to be appropriate for more than one statistical
analysis, they should be generated independently of any scientific
models that are to be applied to the data at a later stage. This is done
by finding imputations that will extrapolate the structure of the data,
as well as the uncertainty about this structure. A second problem is to
use complete-data methods in an efficient way. The HERMES workstation
encapsulates existing statistical packages in a client-server model. It
forms a natural and convenient environment for implementing multiple

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Numerous industrial and research databases include missing values. It is not uncommon to encounter databases that have up to a half of the entries missing, making it very difficult to mine them using data analysis methods that can work only with complete data. A common way of dealing with this problem is to impute (fill-in) the missing values. This paper evaluates how the choice of different imputation methods affects the performance of classifiers that are subsequently used with the imputed data. The experiments here focus on discrete data. This paper studies the effect of missing data imputation using five single imputation methods (a mean method, a Hot deck method, a Naı¨ve-Bayes method, and the latter two methods with a recently proposed imputation framework) and one multiple imputation method (a polytomous regression based method) on classification accuracy for six popular classifiers (RIPPER, C4.5, K-nearest-neighbor, support vector machine with polynomial and RBF kernels, and Naı¨ve-Bayes) on 15 datasets. This experimental study shows that imputation with the tested methods on average improves classification accuracy when compared to classification without imputation. Although the results show that there is no universally best imputation method, Naı¨ve-Bayes imputation is shown to give the best results for the RIPPER classifier for datasets with high amount (i.e., 40% and 50%) of missing data, polytomous regression imputation is shown to be the best for support vector machine classifier with polynomial kernel, and the application of the imputation framework is shown to be superior for the support vector machine with RBF kernel and K-nearest-neighbor. The analysis of the quality of the imputation with respect to varying amounts of missing data (i.e., between 5% and 50%) shows that all imputation methods, except for the mean imputation, improve classification error for data with more than 10% of missing data. Finally, some classifiers such as C4.5 and Naı¨ve-Bayes were found to be missing data resistant, i.e., they can produce accurate classification in the presence of missing data, while other classifiers such as K-nearest-neighbor, SVMs and RIPPER benefit from the imputation.
    Pattern Recognition 12/2008; · 2.58 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: "A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree of Master of Science, Department of Electrical and Computer Engineering". Thesis (M.Sc.)--University of Alberta, 2005. Includes bibliographical references.
    IEEE Transactions on Systems, Man, and Cybernetics, Part A. 01/2007; 37:692-709.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: On-line analytical processing (OLAP) systems considerably improve data analysis and are finding wide-spread use.OLAP systems typically employ multidimensional data models to structure their data. This paper identifies 11 modelingrequirements for multidimensional data models. These requirements are derived from an assessment of complexdatafound in real-world applications. A survey of 14 multidimensional data models reveals shortcomings in meeting some ofthe requirements. Existing models do...
    Inf. Syst. 01/2001; 26:383-423.