Missing value imputation improves clustering and interpretation of gene expression microarray data

Department of Information Technology and TUCS, University of Turku, FI-20014 Turku, Finland.
BMC Bioinformatics (Impact Factor: 2.58). 02/2008; 9(1):202. DOI: 10.1186/1471-2105-9-202
Source: PubMed


Missing values frequently pose problems in gene expression microarray experiments as they can hinder downstream analysis of the datasets. While several missing value imputation approaches are available to the microarray users and new ones are constantly being developed, there is no general consensus on how to choose between the different methods since their performance seems to vary drastically depending on the dataset being used.
We show that this discrepancy can mostly be attributed to the way in which imputation methods have traditionally been developed and evaluated. By comparing a number of advanced imputation methods on recent microarray datasets, we show that even when there are marked differences in the measurement-level imputation accuracies across the datasets, these differences become negligible when the methods are evaluated in terms of how well they can reproduce the original gene clusters or their biological interpretations. Regardless of the evaluation approach, however, imputation always gave better results than ignoring missing data points or replacing them with zeros or average values, emphasizing the continued importance of using more advanced imputation methods.
The results demonstrate that, while missing values are still severely complicating microarray data analysis, their impact on the discovery of biologically meaningful gene groups can - up to a certain degree - be reduced by using readily available and relatively fast imputation methods, such as the Bayesian Principal Components Algorithm (BPCA).

Download full-text


Available from: Johannes Tuikkala,
  • Source
    • "Second, few independent rounds of the imputed procedure were performed (usually 10 times). Third, single performance measure was used [33,34]. Here, we present a fair and comprehensive evaluation to assess the performances of different imputation algorithms on different datasets using different performance measures. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Microarray data are usually peppered with missing values due to various reasons. However, most of the downstream analyses for microarray data require complete datasets. Therefore, accurate algorithms for missing value estimation are needed for improving the performance of microarray data analyses. Although many algorithms have been developed, there are many debates on the selection of the optimal algorithm. The studies about the performance comparison of different algorithms are still incomprehensive, especially in the number of benchmark datasets used, the number of algorithms compared, the rounds of simulation conducted, and the performance measures used. In this paper, we performed a comprehensive comparison by using (I) thirteen datasets, (II) nine algorithms, (III) 110 independent runs of simulation, and (IV) three types of measures to evaluate the performance of each imputation algorithm fairly. First, the effects of different types of microarray datasets on the performance of each imputation algorithm were evaluated. Second, we discussed whether the datasets from different species have different impact on the performance of different algorithms. To assess the performance of each algorithm fairly, all evaluations were performed using three types of measures. Our results indicate that the performance of an imputation algorithm mainly depends on the type of a dataset but not on the species where the samples come from. In addition to the statistical measure, two other measures with biological meanings are useful to reflect the impact of missing value imputation on the downstream data analyses. Our study suggests that local-least-squares-based methods are good choices to handle missing values for most of the microarray datasets. In this work, we carried out a comprehensive comparison of the algorithms for microarray missing value imputation. Based on such a comprehensive comparison, researchers could choose the optimal algorithm for their datasets easily. Moreover, new imputation algorithms could be compared with the existing algorithms using this comparison strategy as a standard protocol. In addition, to assist researchers in dealing with missing values easily, we built a web-based and easy-to-use imputation tool, MissVIA (http://cosbi.ee.ncku.edu.tw/MissVIA), which supports many imputation algorithms. Once users upload a real microarray dataset and choose the imputation algorithms, MissVIA will determine the optimal algorithm for the users' data through a series of simulations, and then the imputed results can be downloaded for the downstream data analyses.
    BMC Systems Biology 12/2013; 7 Suppl 6(Suppl 6):S12. DOI:10.1186/1752-0509-7-S6-S12 · 2.44 Impact Factor
  • Source
    • "Other valid approaches to deal with incomplete datasets tend to use supervised learning or statistical analysis to impute the missing data so as to use the total number of samples available in the dataset [1] [9] [16] [18] [25] [27] [23] [22] [21]. In fact, most of the biomedical studies have focussed on developing missing value estimation methods for incomplete biomedical or microarray datasets [35] [38] [42] [20] [2] [12] [19] [11] [30] [5] [39] [3] [36] [41] [7] [10]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The imputation of unknown or missing data is a crucial task on the analysis of biomedical datasets. There are several situations where it is necessary to classify or identify instances given incomplete vectors, and the existence of missing values can much degrade the performance of the algorithms used for the classification/recognition. The task of learning accurately from incomplete data raises a number of issues some of which have not been completely solved in machine learning applications. In this sense, effective missing value estimation methods are required. Different methods for missing data imputations exist but most of the times the selection of the appropriate technique involves testing several methods, comparing them and choosing the right one. Furthermore, applying these methods, in most cases, is not straightforward, as they involve several technical details, and in particular in cases such as when dealing with microarray datasets, the application of the methods requires huge computational resources. As far as we know, there is not a public software application that can provide the computing capabilities required for carrying the task of data imputation. This paper presents a new public tool for missing data imputation that is attached to a computer cluster in order to execute high computational tasks. The software WIMP (Web IMPutation) is a public available web site where registered users can create, execute, analyze and store their simulations related to missing data imputation.
    Computer methods and programs in biomedicine 09/2012; 108(3). DOI:10.1016/j.cmpb.2012.08.006 · 1.90 Impact Factor
  • Source
    • "In particular, when partial proteomics dataset was used in various integrated " omics " studies, the undetected proteins were simply assigned a " zero " value and were excluded from relationship modeling, which could bias any conclusion resulted from the integrated studies [14]. To overcome this problem, several methods have been adapted from the estimation of missing values in transcriptomic data to estimate the missing proteomics values by using the available measurements from other proteins, such as the k nearest neighbor and Bayesian Principal Component Analysis (BPCA) methods for imputing missing proteomic values in gel-based proteomics dataset [15] [16], and by integrating the GO (Gene Ontology) information into the proteomic data imputation [17]. Based on the assumption that there exists meaningful correlation between two types of datasets [14], in recent years we have developed the Zero-inflated Poisson (ZIP) linear regression model [18] and a stochastic Gradient Boosted Trees (GBT) nonlinear model [19] "
    [Show abstract] [Hide abstract]
    ABSTRACT: Proteomic datasets are often incomplete due to identification range and sensitivity issues. It becomes important to develop methodologies to estimate missing proteomic data, allowing better interpretation of proteomic datasets and metabolic mechanisms underlying complex biological systems. In this study, we applied an artificial neural network to approximate the relationships between cognate transcriptomic and proteomic datasets of Desulfovibrio vulgaris, and to predict protein abundance for the proteins not experimentally detected, based on several relevant predictors, such as mRNA abundance, cellular role and triple codon counts. The results showed that the coefficients of determination for the trained neural network models ranged from 0.47 to 0.68, providing better modeling than several previous regression models. The validity of the trained neural network model was evaluated using biological information (i.e. operons). To seek understanding of mechanisms causing missing proteomic data, we used a multivariate logistic regression analysis and the result suggested that some key factors, such as protein instability index, aliphatic index, mRNA abundance, effective number of codons (N(c)) and codon adaptation index (CAI) values may be ascribed to whether a given expressed protein can be detected. In addition, we demonstrated that biological interpretation can be improved by use of imputed proteomic datasets.
    Comparative and Functional Genomics 05/2011; 2011:780973. DOI:10.1155/2011/780973 · 2.03 Impact Factor
Show more