Conference Paper

Routine multiple imputation in statistical databases

Dept. of Stat., TNO-PG, Leiden
DOI: 10.1109/SSDM.1994.336960 Conference: Scientific and Statistical Database Management, 1994. Proceedings., Seventh International Working Conference on
Source: IEEE Xplore

ABSTRACT This paper deals with problems concerning missing data in
statistical databases. Multiple imputation is a statistically sound
technique for handling incomplete data. Two problems should be addressed
before the routine application of the technique becomes feasible. First,
if imputations are to be appropriate for more than one statistical
analysis, they should be generated independently of any scientific
models that are to be applied to the data at a later stage. This is done
by finding imputations that will extrapolate the structure of the data,
as well as the uncertainty about this structure. A second problem is to
use complete-data methods in an efficient way. The HERMES workstation
encapsulates existing statistical packages in a client-server model. It
forms a natural and convenient environment for implementing multiple

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Numerous industrial and research databases include missing values. It is not uncommon to encounter databases that have up to a half of the entries missing, making it very difficult to mine them using data analysis methods that can work only with complete data. A common way of dealing with this problem is to impute (fill-in) the missing values. This paper evaluates how the choice of different imputation methods affects the performance of classifiers that are subsequently used with the imputed data. The experiments here focus on discrete data. This paper studies the effect of missing data imputation using five single imputation methods (a mean method, a Hot deck method, a Naı¨ve-Bayes method, and the latter two methods with a recently proposed imputation framework) and one multiple imputation method (a polytomous regression based method) on classification accuracy for six popular classifiers (RIPPER, C4.5, K-nearest-neighbor, support vector machine with polynomial and RBF kernels, and Naı¨ve-Bayes) on 15 datasets. This experimental study shows that imputation with the tested methods on average improves classification accuracy when compared to classification without imputation. Although the results show that there is no universally best imputation method, Naı¨ve-Bayes imputation is shown to give the best results for the RIPPER classifier for datasets with high amount (i.e., 40% and 50%) of missing data, polytomous regression imputation is shown to be the best for support vector machine classifier with polynomial kernel, and the application of the imputation framework is shown to be superior for the support vector machine with RBF kernel and K-nearest-neighbor. The analysis of the quality of the imputation with respect to varying amounts of missing data (i.e., between 5% and 50%) shows that all imputation methods, except for the mean imputation, improve classification error for data with more than 10% of missing data. Finally, some classifiers such as C4.5 and Naı¨ve-Bayes were found to be missing data resistant, i.e., they can produce accurate classification in the presence of missing data, while other classifiers such as K-nearest-neighbor, SVMs and RIPPER benefit from the imputation.
    Pattern Recognition 12/2008; DOI:10.1016/j.patcog.2008.05.019 · 2.58 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: "A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree of Master of Science, Department of Electrical and Computer Engineering". Thesis (M.Sc.)--University of Alberta, 2005. Includes bibliographical references.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: On-line analytical processing (OLAP) systems considerably improve data analysis and are finding wide-spread use. OLAP systems typically employ multidimensional data models to structure their data. This paper identifies 11 modeling requirements for multidimensional data models. These requirements are derived from an assessment of complex data found in real-world applications. A survey of 14 multidimensional data models reveals shortcomings in meeting some of the requirements. Existing models do not support many-to-many relationships between facts and dimensions, lack built-in mechanisms for handling change and time, lack support for imprecision, and are generally unable to insert data with varying granularities. This paper defines an extended multidimensional data model and algebraic query language that address all I I requirements. The model reuses the common multidimensional concepts of dimension hierarchies and granularities to capture imprecise data. For queries that cannot be answered precisely due to the imprecise data, techniques are proposed that take into account the imprecision in the grouping of the data, in the subsequent aggregate computation, and in the presentation of the imprecise result to the user. In addition, alternative queries unaffected by imprecision are offered. The data model and query evaluation techniques discussed in this paper can be implemented using relational database technology. The approach is also capable of exploiting multidimensional query processing techniques like pre-aggregation. This yields a practical solution with low computational overhead.
    Information Systems 07/2001; 26(5):383-423. DOI:10.1016/S0306-4379(01)00023-0 · 1.24 Impact Factor