Clustering methods for the analysis of DNA microarray data
Technical report. Stanford: Department of Statistics, Stanford University;
It is now possible to simultaneously measure the expression of thousands of genes during cellular differentiation and response, through the use of DNA microarrays. A major statistical task is to understand the structure in the data that arise from this technology. In this paper we review various methods of clustering, and illustrate how they can be used to arrange both the genes and cell lines from a set of DNA microarray experiments. The methods discussed are global clustering techniques including hierarchical, K-means, and block clustering, and tree-structured vector quantization. Finally, we propose a new method for identifying structure in subsets of both genes and cell lines that are potentially obscured by the global clustering approaches. 1 Introduction DNA microarrays and other high-throughput methods for analyzing complex nucleic acid samples make it now possible to measure rapidly, efficiently and accurately the levels of virtually all genes expressed in a biologi...
Available from: ijirset.com
- "5329Clustering is the far most used method in gene expression analysis. Tibshirani et aland Aasprovide a classification of clustering methods in two categories: one-way clustering and two-way clustering. Methods of the first category are used to group either genes with similar behavior or samples with similar gene expressions. "
Available from: Francisco Martínez-Álvarez
- "We present the TriGen (Triclustering-Genetic based) algorithm based on an evolutionary heuristic, genetic algorithms, which finds groups of pattern similarity for genes on a three dimensional space, thus taking into account the gene, conditions and time factor. Although many clustering and biclustering models define similarity based on distance functions  , these functions are not always adequate to capture similarities among genes, since correlations may still exist among a set of genes which are expressed at different levels of magnitude. Therefore we propose two different evaluation functions: the first one finds triclusters of coherent values and is based on a three dimensions adaptation of the Mean Square Residue measure (MSR) which is a classic biclustering distance measure for gene expression analysis , the second one is a correlation measure that identifies triclusters of coherent behavior based on the least square approximation (LSL) which calculates the distances among the slopes of the least square lines from a tricluster. "
[Show abstract] [Hide abstract]
ABSTRACT: Analyzing microarray data represents a computational challenge due to the characteristics of these data. Clustering techniques are widely applied to create groups of genes that exhibit a similar behavior under the conditions tested. Biclustering emerges as an improvement of classical clustering since it relaxes the constraints for grouping genes to be evaluated only under a subset of the conditions and not under all of them. However, this technique is not appropriate for the analysis of longitudinal experiments in which the genes are evaluated under certain conditions at several time points. We present the TriGen algorithm, a genetic algorithm that finds triclusters of gene expression that take into account the experimental conditions and the time points simultaneously. We have used TriGen to mine datasets related to synthetic data, yeast (Saccharomyces cerevisiae) cell cycle and human inflammation and host response to injury experiments. TriGen has proved to be capable of extracting groups of genes with similar patterns in subsets of conditions and times, and these groups have shown to be related in terms of their functional annotations extracted from the Gene Ontology.
Available from: Jaideep Srivastava
- "In this paper a criteria for partitioning other than a constant value was also proposed, for example a two way analysis of variance model and a mean squared residue scoring approach. Later this method was improved by  that introduced a backward pruning method for generating an optimal number of two way clusters.  "
[Show abstract] [Hide abstract]
ABSTRACT: In the past few years co-clustering has emerged as an important data mining
tool for two way data analysis. Co-clustering is more advantageous over
traditional one dimensional clustering in many ways such as, ability to find
highly correlated sub-groups of rows and columns. However, one of the
overlooked benefits of co-clustering is that, it can be used to extract
meaningful knowledge for various other knowledge extraction purposes. For
example, building predictive models with high dimensional data and
heterogeneous population is a non-trivial task. Co-clusters extracted from such
data, which shows similar pattern in both the dimension, can be used for a more
accurate predictive model building. Several applications such as finding
patient-disease cohorts in health care analysis, finding user-genre groups in
recommendation systems and community detection problems can benefit from
co-clustering technique that utilizes the predictive power of the data to
generate co-clusters for improved data analysis.
In this paper, we present the novel idea of Predictive Overlapping
Co-Clustering (POCC) as an optimization problem for a more effective and
improved predictive analysis. Our algorithm generates optimal co-clusters by
maximizing predictive power of the co-clusters subject to the constraints on
the number of row and column clusters. In this paper precision, recall and
f-measure have been used as evaluation measures of the resulting co-clusters.
Results of our algorithm has been compared with two other well-known techniques
- K-means and Spectral co-clustering, over four real data set namely, Leukemia,
Internet-Ads, Ovarian cancer and MovieLens data set. The results demonstrate
the effectiveness and utility of our algorithm POCC in practice.
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.