Clustering methods for the analysis of DNA microarray data

Technical report. Stanford: Department of Statistics, Stanford University; 11/1999;
Source: CiteSeer


It is now possible to simultaneously measure the expression of thousands of genes during cellular differentiation and response, through the use of DNA microarrays. A major statistical task is to understand the structure in the data that arise from this technology. In this paper we review various methods of clustering, and illustrate how they can be used to arrange both the genes and cell lines from a set of DNA microarray experiments. The methods discussed are global clustering techniques including hierarchical, K-means, and block clustering, and tree-structured vector quantization. Finally, we propose a new method for identifying structure in subsets of both genes and cell lines that are potentially obscured by the global clustering approaches. 1 Introduction DNA microarrays and other high-throughput methods for analyzing complex nucleic acid samples make it now possible to measure rapidly, efficiently and accurately the levels of virtually all genes expressed in a biologi...

22 Reads
  • Source
    • "We present the TriGen (Triclustering-Genetic based) algorithm based on an evolutionary heuristic, genetic algorithms, which finds groups of pattern similarity for genes on a three dimensional space, thus taking into account the gene, conditions and time factor. Although many clustering and biclustering models define similarity based on distance functions [12] [37], these functions are not always adequate to capture similarities among genes, since correlations may still exist among a set of genes which are expressed at different levels of magnitude. Therefore we propose two different evaluation functions: the first one finds triclusters of coherent values and is based on a three dimensions adaptation of the Mean Square Residue measure (MSR) which is a classic biclustering distance measure for gene expression analysis [9], the second one is a correlation measure that identifies triclusters of coherent behavior based on the least square approximation (LSL) which calculates the distances among the slopes of the least square lines from a tricluster. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Analyzing microarray data represents a computational challenge due to the characteristics of these data. Clustering techniques are widely applied to create groups of genes that exhibit a similar behavior under the conditions tested. Biclustering emerges as an improvement of classical clustering since it relaxes the constraints for grouping genes to be evaluated only under a subset of the conditions and not under all of them. However, this technique is not appropriate for the analysis of longitudinal experiments in which the genes are evaluated under certain conditions at several time points. We present the TriGen algorithm, a genetic algorithm that finds triclusters of gene expression that take into account the experimental conditions and the time points simultaneously. We have used TriGen to mine datasets related to synthetic data, yeast (Saccharomyces cerevisiae) cell cycle and human inflammation and host response to injury experiments. TriGen has proved to be capable of extracting groups of genes with similar patterns in subsets of conditions and times, and these groups have shown to be related in terms of their functional annotations extracted from the Gene Ontology.
    Neurocomputing 05/2014; 132:In press. DOI:10.1016/j.neucom.2013.03.061 · 2.08 Impact Factor
  • Source
    • "In this paper a criteria for partitioning other than a constant value was also proposed, for example a two way analysis of variance model and a mean squared residue scoring approach. Later this method was improved by [35] that introduced a backward pruning method for generating an optimal number of two way clusters. [6] "
    [Show abstract] [Hide abstract]
    ABSTRACT: In the past few years co-clustering has emerged as an important data mining tool for two way data analysis. Co-clustering is more advantageous over traditional one dimensional clustering in many ways such as, ability to find highly correlated sub-groups of rows and columns. However, one of the overlooked benefits of co-clustering is that, it can be used to extract meaningful knowledge for various other knowledge extraction purposes. For example, building predictive models with high dimensional data and heterogeneous population is a non-trivial task. Co-clusters extracted from such data, which shows similar pattern in both the dimension, can be used for a more accurate predictive model building. Several applications such as finding patient-disease cohorts in health care analysis, finding user-genre groups in recommendation systems and community detection problems can benefit from co-clustering technique that utilizes the predictive power of the data to generate co-clusters for improved data analysis. In this paper, we present the novel idea of Predictive Overlapping Co-Clustering (POCC) as an optimization problem for a more effective and improved predictive analysis. Our algorithm generates optimal co-clusters by maximizing predictive power of the co-clusters subject to the constraints on the number of row and column clusters. In this paper precision, recall and f-measure have been used as evaluation measures of the resulting co-clusters. Results of our algorithm has been compared with two other well-known techniques - K-means and Spectral co-clustering, over four real data set namely, Leukemia, Internet-Ads, Ovarian cancer and MovieLens data set. The results demonstrate the effectiveness and utility of our algorithm POCC in practice.
  • Source
    • "Euclidean distance and Pearson’s correlation coefficient are two simple commonly used metrics [2]. Popular clustering algorithms are hierarchical clustering, k‐means clustering and Self Organizing Maps [3]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background While the genomes of hundreds of organisms have been sequenced and good approaches exist for finding protein encoding genes, an important remaining challenge is predicting the functions of the large fraction of genes for which there is no annotation. Large gene expression datasets from microarray experiments already exist and many of these can be used to help assign potential functions to these genes. We have applied Support Vector Machines (SVM), a sigmoid fitting function and a stratified cross‐validation approach to analyze a large microarray experiment dataset from Drosophila melanogaster in order to predict possible functions for previously un‐annotated genes. A total of approximately 5043 different genes, or about one‐third of the predicted genes in the D. melanogaster genome, are represented in the dataset and 1854 (or 37%) of these genes are un‐annotated. Results 39 Gene Ontology Biological Process (GO‐BP) categories were found with precision value equal or larger than 0.75, when recall was fixed at the 0.4 level. For two of those categories, we have provided additional support for assigning given genes to the category by showing that the majority of transcripts for the genes belonging in a given category have a similar localization pattern during embryogenesis. Additionally, by assessing the predictions using a confidence score, we have been able to provide a putative GO‐BP term for 1422 previously un‐annotated genes or about 77% of the un‐annotated genes represented on the microarray and about 19% of all of the un‐annotated genes in the D. melanogaster genome. Conclusions Our study successfully employs a number of SVM classifiers, accompanied by detailed calibration and validation techniques, to generate a number of predictions for new annotations for D. melanogaster genes. The applied probabilistic analysis to SVM output improves the interpretability of the prediction results and the objectivity of the validation procedure.
    BioData Mining 04/2013; 6(1):8. DOI:10.1186/1756-0381-6-8 · 2.02 Impact Factor
Show more