Missing value imputation improves clustering and interpretation of gene expression microarray data

Department of Information Technology and TUCS, University of Turku, FI-20014 Turku, Finland.
BMC Bioinformatics (Impact Factor: 2.67). 02/2008; 9:202. DOI: 10.1186/1471-2105-9-202
Source: PubMed

ABSTRACT Missing values frequently pose problems in gene expression microarray experiments as they can hinder downstream analysis of the datasets. While several missing value imputation approaches are available to the microarray users and new ones are constantly being developed, there is no general consensus on how to choose between the different methods since their performance seems to vary drastically depending on the dataset being used.
We show that this discrepancy can mostly be attributed to the way in which imputation methods have traditionally been developed and evaluated. By comparing a number of advanced imputation methods on recent microarray datasets, we show that even when there are marked differences in the measurement-level imputation accuracies across the datasets, these differences become negligible when the methods are evaluated in terms of how well they can reproduce the original gene clusters or their biological interpretations. Regardless of the evaluation approach, however, imputation always gave better results than ignoring missing data points or replacing them with zeros or average values, emphasizing the continued importance of using more advanced imputation methods.
The results demonstrate that, while missing values are still severely complicating microarray data analysis, their impact on the discovery of biologically meaningful gene groups can - up to a certain degree - be reduced by using readily available and relatively fast imputation methods, such as the Bayesian Principal Components Algorithm (BPCA).

  • [Show abstract] [Hide abstract]
    ABSTRACT: Missing value imputation is crucial for the microarray data analysis since the missing values would degrade the performance of the downstream analysis, e.g. differentially expressed genes identification, gene clustering or classification. Although many missing value imputation algorithms has been proposed, convenient software tools are still lacking. The existing tools are not easy to use and cannot tell users how to choose the optimal imputation algorithm for their dataset. In this paper, we present an easy-to-use web server named IMDE (Impute Missing Data Easily). IMDE has two unique features. First, it provides much more missing value imputation algorithms than any existing tool. Second, it can suggest the optimal imputation algorithm for users' dataset after doing the performance evaluation.We used four different datasets to show that different optimal algorithms may be chosen for different datasets and for different selection schemes. We expect that IMDE will be a very useful server for solving the missing value problem in the microarray data.
    2014 11th IEEE International Conference on Control & Automation (ICCA); 06/2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. Results and conclusions We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at
    BMC Bioinformatics 02/2015; 16:64. DOI:10.1186/s12859-015-0494-3 · 2.67 Impact Factor
  • Source

Full-text (3 Sources)

Available from
May 30, 2014