Missing value imputation improves clustering and interpretation of gene expression microarray data

Department of Information Technology and TUCS, University of Turku, FI-20014 Turku, Finland.
BMC Bioinformatics (Impact Factor: 2.67). 02/2008; 9(1):202. DOI: 10.1186/1471-2105-9-202
Source: PubMed

ABSTRACT Missing values frequently pose problems in gene expression microarray experiments as they can hinder downstream analysis of the datasets. While several missing value imputation approaches are available to the microarray users and new ones are constantly being developed, there is no general consensus on how to choose between the different methods since their performance seems to vary drastically depending on the dataset being used.
We show that this discrepancy can mostly be attributed to the way in which imputation methods have traditionally been developed and evaluated. By comparing a number of advanced imputation methods on recent microarray datasets, we show that even when there are marked differences in the measurement-level imputation accuracies across the datasets, these differences become negligible when the methods are evaluated in terms of how well they can reproduce the original gene clusters or their biological interpretations. Regardless of the evaluation approach, however, imputation always gave better results than ignoring missing data points or replacing them with zeros or average values, emphasizing the continued importance of using more advanced imputation methods.
The results demonstrate that, while missing values are still severely complicating microarray data analysis, their impact on the discovery of biologically meaningful gene groups can - up to a certain degree - be reduced by using readily available and relatively fast imputation methods, such as the Bayesian Principal Components Algorithm (BPCA).

Download full-text


Available from: Johannes Tuikkala, Jul 06, 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The imputation of unknown or missing data is a crucial task on the analysis of biomedical datasets. There are several situations where it is necessary to classify or identify instances given incomplete vectors, and the existence of missing values can much degrade the performance of the algorithms used for the classification/recognition. The task of learning accurately from incomplete data raises a number of issues some of which have not been completely solved in machine learning applications. In this sense, effective missing value estimation methods are required. Different methods for missing data imputations exist but most of the times the selection of the appropriate technique involves testing several methods, comparing them and choosing the right one. Furthermore, applying these methods, in most cases, is not straightforward, as they involve several technical details, and in particular in cases such as when dealing with microarray datasets, the application of the methods requires huge computational resources. As far as we know, there is not a public software application that can provide the computing capabilities required for carrying the task of data imputation. This paper presents a new public tool for missing data imputation that is attached to a computer cluster in order to execute high computational tasks. The software WIMP (Web IMPutation) is a public available web site where registered users can create, execute, analyze and store their simulations related to missing data imputation.
    Computer methods and programs in biomedicine 09/2012; 108(3). DOI:10.1016/j.cmpb.2012.08.006 · 1.09 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Proteomic datasets are often incomplete due to identification range and sensitivity issues. It becomes important to develop methodologies to estimate missing proteomic data, allowing better interpretation of proteomic datasets and metabolic mechanisms underlying complex biological systems. In this study, we applied an artificial neural network to approximate the relationships between cognate transcriptomic and proteomic datasets of Desulfovibrio vulgaris, and to predict protein abundance for the proteins not experimentally detected, based on several relevant predictors, such as mRNA abundance, cellular role and triple codon counts. The results showed that the coefficients of determination for the trained neural network models ranged from 0.47 to 0.68, providing better modeling than several previous regression models. The validity of the trained neural network model was evaluated using biological information (i.e. operons). To seek understanding of mechanisms causing missing proteomic data, we used a multivariate logistic regression analysis and the result suggested that some key factors, such as protein instability index, aliphatic index, mRNA abundance, effective number of codons (N(c)) and codon adaptation index (CAI) values may be ascribed to whether a given expressed protein can be detected. In addition, we demonstrated that biological interpretation can be improved by use of imputed proteomic datasets.
    Comparative and Functional Genomics 05/2011; 2011:780973. DOI:10.1155/2011/780973 · 1.75 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Epistatic miniarray profiling (E-MAP) is a powerful tool for analyzing gene functions and their biological relevance. However, E-MAP data suffers from large proportion of missing values, which often results in misleading and biased analysis results. It is urgent to develop effective missing value estimation methods for E-MAP. Although several independent algorithms can be applied to achieve this goal, their performance varies significantly on different datasets, indicating different algorithms having their own advantages and disadvantages. In this paper, we propose a novel ensemble approach EMDI based on the high-level diversity to impute missing values that consists of two global and four local base estimators. Experimental results on five E-MAP datasets show that EMDI outperforms all single base algorithms, demonstrating an appropriate combination providing complementarity among different methods. Comparison results between several fusion strategies also demonstrate that the proposed high-level diversity scheme is superior to others. EMDI is freely available at
    Genomics 03/2011; 97(5):257-64. DOI:10.1016/j.ygeno.2011.03.001 · 2.79 Impact Factor