Clustering methods for the analysis of DNA microarray data

Technical report. Stanford: Department of Statistics, Stanford University; 11/1999;
Source: CiteSeer


It is now possible to simultaneously measure the expression of thousands of genes during cellular differentiation and response, through the use of DNA microarrays. A major statistical task is to understand the structure in the data that arise from this technology. In this paper we review various methods of clustering, and illustrate how they can be used to arrange both the genes and cell lines from a set of DNA microarray experiments. The methods discussed are global clustering techniques including hierarchical, K-means, and block clustering, and tree-structured vector quantization. Finally, we propose a new method for identifying structure in subsets of both genes and cell lines that are potentially obscured by the global clustering approaches. 1 Introduction DNA microarrays and other high-throughput methods for analyzing complex nucleic acid samples make it now possible to measure rapidly, efficiently and accurately the levels of virtually all genes expressed in a biologi...

24 Reads
  • Source
    • "5329Clustering is the far most used method in gene expression analysis. Tibshirani et al[15]and Aas[16]provide a classification of clustering methods in two categories: one-way clustering and two-way clustering. Methods of the first category are used to group either genes with similar behavior or samples with similar gene expressions. "

    Preview · Article · Jul 2015
  • Source
    • "We present the TriGen (Triclustering-Genetic based) algorithm based on an evolutionary heuristic, genetic algorithms, which finds groups of pattern similarity for genes on a three dimensional space, thus taking into account the gene, conditions and time factor. Although many clustering and biclustering models define similarity based on distance functions [12] [37], these functions are not always adequate to capture similarities among genes, since correlations may still exist among a set of genes which are expressed at different levels of magnitude. Therefore we propose two different evaluation functions: the first one finds triclusters of coherent values and is based on a three dimensions adaptation of the Mean Square Residue measure (MSR) which is a classic biclustering distance measure for gene expression analysis [9], the second one is a correlation measure that identifies triclusters of coherent behavior based on the least square approximation (LSL) which calculates the distances among the slopes of the least square lines from a tricluster. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Analyzing microarray data represents a computational challenge due to the characteristics of these data. Clustering techniques are widely applied to create groups of genes that exhibit a similar behavior under the conditions tested. Biclustering emerges as an improvement of classical clustering since it relaxes the constraints for grouping genes to be evaluated only under a subset of the conditions and not under all of them. However, this technique is not appropriate for the analysis of longitudinal experiments in which the genes are evaluated under certain conditions at several time points. We present the TriGen algorithm, a genetic algorithm that finds triclusters of gene expression that take into account the experimental conditions and the time points simultaneously. We have used TriGen to mine datasets related to synthetic data, yeast (Saccharomyces cerevisiae) cell cycle and human inflammation and host response to injury experiments. TriGen has proved to be capable of extracting groups of genes with similar patterns in subsets of conditions and times, and these groups have shown to be related in terms of their functional annotations extracted from the Gene Ontology.
    Full-text · Article · May 2014 · Neurocomputing
  • Source
    • "In this paper a criteria for partitioning other than a constant value was also proposed, for example a two way analysis of variance model and a mean squared residue scoring approach. Later this method was improved by [35] that introduced a backward pruning method for generating an optimal number of two way clusters. [6] "
    [Show abstract] [Hide abstract]
    ABSTRACT: In the past few years co-clustering has emerged as an important data mining tool for two way data analysis. Co-clustering is more advantageous over traditional one dimensional clustering in many ways such as, ability to find highly correlated sub-groups of rows and columns. However, one of the overlooked benefits of co-clustering is that, it can be used to extract meaningful knowledge for various other knowledge extraction purposes. For example, building predictive models with high dimensional data and heterogeneous population is a non-trivial task. Co-clusters extracted from such data, which shows similar pattern in both the dimension, can be used for a more accurate predictive model building. Several applications such as finding patient-disease cohorts in health care analysis, finding user-genre groups in recommendation systems and community detection problems can benefit from co-clustering technique that utilizes the predictive power of the data to generate co-clusters for improved data analysis. In this paper, we present the novel idea of Predictive Overlapping Co-Clustering (POCC) as an optimization problem for a more effective and improved predictive analysis. Our algorithm generates optimal co-clusters by maximizing predictive power of the co-clusters subject to the constraints on the number of row and column clusters. In this paper precision, recall and f-measure have been used as evaluation measures of the resulting co-clusters. Results of our algorithm has been compared with two other well-known techniques - K-means and Spectral co-clustering, over four real data set namely, Leukemia, Internet-Ads, Ovarian cancer and MovieLens data set. The results demonstrate the effectiveness and utility of our algorithm POCC in practice.
    Full-text · Article · Mar 2014
Show more