Missing value imputation improves clustering and interpretation of gene expression microarray data.

Department of Information Technology and TUCS, University of Turku, FI-20014 Turku, Finland.
BMC Bioinformatics (Impact Factor: 3.02). 02/2008; 9:202. DOI: 10.1186/1471-2105-9-202
Source: DBLP

ABSTRACT Missing values frequently pose problems in gene expression microarray experiments as they can hinder downstream analysis of the datasets. While several missing value imputation approaches are available to the microarray users and new ones are constantly being developed, there is no general consensus on how to choose between the different methods since their performance seems to vary drastically depending on the dataset being used.
We show that this discrepancy can mostly be attributed to the way in which imputation methods have traditionally been developed and evaluated. By comparing a number of advanced imputation methods on recent microarray datasets, we show that even when there are marked differences in the measurement-level imputation accuracies across the datasets, these differences become negligible when the methods are evaluated in terms of how well they can reproduce the original gene clusters or their biological interpretations. Regardless of the evaluation approach, however, imputation always gave better results than ignoring missing data points or replacing them with zeros or average values, emphasizing the continued importance of using more advanced imputation methods.
The results demonstrate that, while missing values are still severely complicating microarray data analysis, their impact on the discovery of biologically meaningful gene groups can - up to a certain degree - be reduced by using readily available and relatively fast imputation methods, such as the Bayesian Principal Components Algorithm (BPCA).

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Many attempts have been carried out to deal with missing values (MV) in microarrays data representing gene expressions. This is a problematic issue as many data analysis techniques are not robust to missing data. Most of the MV imputation methods currently being used have been evaluated only in terms of the similarity between the original and imputed data. While imputed expression values themselves are not interesting, rather whether or not the imputed expression values are reliable to use in subsequent analysis is the major concern. This paper focuses on studying the impact of different MV imputation methods on the classification accuracy. The experimental work was first subjected to implementing three popular imputation methods, namely Singular Value Decomposition (SVD), weighted K-nearest neighbors (KNNimpute), and Zero replacement. The robustness of the three methods to the amount of missing data was then studied. The experiments were repeated for datasets with different missing rates (MR) over the range of 0-20% MR. In applying supervised two class classification we adopted a twofold approach, introducing all genes expressions to the classifiers as well as a subset of selected genes. The feature selection method used for gene selection is Fisher Discriminate Analysis (FDA), which improved noticeably the performance of the classifiers. The retained classifiers accuracies using imputed data after applying the three proposed imputation methods show slight variations over the specified range of MR. Thus, assessing that the three imputation methods in concern are robust.
    Biomedical Engineering (MECBME), 2011 1st Middle East Conference on; 01/2011
  • Source
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Many bioinformatics analytical tools, especially for cancer classification and prediction, require complete sets of data matrix. Having missing values in gene expression studies significantly influences the interpretation of final data. However, to most analysts' dismay, this has become a common problem and thus, relevant missing value imputation algorithms have to be developed and/or refined to address this matter. This paper intends to present a review of preferred and available missing value imputation methods for the analysis and imputation of missing values in gene expression data. Focus is placed on the abilities of algorithms in performing local or global data correlation to estimate the missing values. Approaches of the algorithms mentioned have been categorized into global approach, local approach, hybrid approach, and knowledge assisted approach. The methods presented are accompanied with suitable performance evaluation. The aim of this review is to highlight possible improvements on existing research techniques, rather than recommending new algorithms with the same functional aim.
    Current Bioinformatics. 01/2014; 9(1):18-22.

Full-text (2 Sources)

Available from
May 30, 2014