Clustering methods for the analysis of DNA microarray data
ABSTRACT It is now possible to simultaneously measure the expression of thousands of genes during cellular differentiation and response, through the use of DNA microarrays. A major statistical task is to understand the structure in the data that arise from this technology. In this paper we review various methods of clustering, and illustrate how they can be used to arrange both the genes and cell lines from a set of DNA microarray experiments. The methods discussed are global clustering techniques including hierarchical, K-means, and block clustering, and tree-structured vector quantization. Finally, we propose a new method for identifying structure in subsets of both genes and cell lines that are potentially obscured by the global clustering approaches. 1 Introduction DNA microarrays and other high-throughput methods for analyzing complex nucleic acid samples make it now possible to measure rapidly, efficiently and accurately the levels of virtually all genes expressed in a biologi...
- SourceAvailable from: Francisco Martínez-Álvarez
[Show abstract] [Hide abstract]
- "We present the TriGen (Triclustering-Genetic based) algorithm based on an evolutionary heuristic, genetic algorithms, which finds groups of pattern similarity for genes on a three dimensional space, thus taking into account the gene, conditions and time factor. Although many clustering and biclustering models define similarity based on distance functions  , these functions are not always adequate to capture similarities among genes, since correlations may still exist among a set of genes which are expressed at different levels of magnitude. Therefore we propose two different evaluation functions: the first one finds triclusters of coherent values and is based on a three dimensions adaptation of the Mean Square Residue measure (MSR) which is a classic biclustering distance measure for gene expression analysis , the second one is a correlation measure that identifies triclusters of coherent behavior based on the least square approximation (LSL) which calculates the distances among the slopes of the least square lines from a tricluster. "
ABSTRACT: Analyzing microarray data represents a computational challenge due to the characteristics of these data. Clustering techniques are widely applied to create groups of genes that exhibit a similar behavior under the conditions tested. Biclustering emerges as an improvement of classical clustering since it relaxes the constraints for grouping genes to be evaluated only under a subset of the conditions and not under all of them. However, this technique is not appropriate for the analysis of longitudinal experiments in which the genes are evaluated under certain conditions at several time points. We present the TriGen algorithm, a genetic algorithm that finds triclusters of gene expression that take into account the experimental conditions and the time points simultaneously. We have used TriGen to mine datasets related to synthetic data, yeast (Saccharomyces cerevisiae) cell cycle and human inflammation and host response to injury experiments. TriGen has proved to be capable of extracting groups of genes with similar patterns in subsets of conditions and times, and these groups have shown to be related in terms of their functional annotations extracted from the Gene Ontology.Neurocomputing 05/2014; 132:In press. DOI:10.1016/j.neucom.2013.03.061 · 2.01 Impact Factor
- "In this paper a criteria for partitioning other than a constant value was also proposed, for example a two way analysis of variance model and a mean squared residue scoring approach. Later this method was improved by  that introduced a backward pruning method for generating an optimal number of two way clusters.  "
Article: Predictive Overlapping Co-Clustering[Show abstract] [Hide abstract]
ABSTRACT: In the past few years co-clustering has emerged as an important data mining tool for two way data analysis. Co-clustering is more advantageous over traditional one dimensional clustering in many ways such as, ability to find highly correlated sub-groups of rows and columns. However, one of the overlooked benefits of co-clustering is that, it can be used to extract meaningful knowledge for various other knowledge extraction purposes. For example, building predictive models with high dimensional data and heterogeneous population is a non-trivial task. Co-clusters extracted from such data, which shows similar pattern in both the dimension, can be used for a more accurate predictive model building. Several applications such as finding patient-disease cohorts in health care analysis, finding user-genre groups in recommendation systems and community detection problems can benefit from co-clustering technique that utilizes the predictive power of the data to generate co-clusters for improved data analysis. In this paper, we present the novel idea of Predictive Overlapping Co-Clustering (POCC) as an optimization problem for a more effective and improved predictive analysis. Our algorithm generates optimal co-clusters by maximizing predictive power of the co-clusters subject to the constraints on the number of row and column clusters. In this paper precision, recall and f-measure have been used as evaluation measures of the resulting co-clusters. Results of our algorithm has been compared with two other well-known techniques - K-means and Spectral co-clustering, over four real data set namely, Leukemia, Internet-Ads, Ovarian cancer and MovieLens data set. The results demonstrate the effectiveness and utility of our algorithm POCC in practice.
[Show abstract] [Hide abstract]
- "Good general references of books on clustering are Everitt et al. (2003), Kaufman and Rousseeuw (1990) and Gordon (1999), etc. Review papers about clustering methods applied in microarray data include Brazma and Vilo (2000), Jiang and Zhang (2002), Sharan and Shamir (2002), Tibshirani et al. (1999) and Tseng (2004), etc. Notably, there is no universal and single best clustering algorithm for all types of data (Jain and Dubes, 1988; Patrik, 2005). Each algorithm imposes its own biases on the clusters it constructs, and therefore, it is often difficult, if not impossible, to determine the superiority of specific algorithms. "
ABSTRACT: A new clustering algorithm, Message Passing Clustering (MPC), is proposed. MPC employs the concept of message passing to describe parallel and spontaneous clustering process by allowing data objects to communicate with each other. MPC also provides an extensible framework to accommodate additional features into clustering, such as adaptive feature weights scaling, stochastic cluster merging, and semi-supervised constraints guiding. Extensive experiments were performed using both simulation and real microarray gene expression and phylogenetic data. The results showed that MPC performed favourably to other popular clustering algorithms and MPC with the integration of additional features gave even higher accuracy rate than MPC.International Journal of Data Mining and Bioinformatics 02/2008; 2(2):95-120. DOI:10.1504/IJDMB.2008.019092 · 0.66 Impact Factor