Conference Paper

Clustering Scatter Plots Using Data Depth Measures

DOI: 10.4172/2155-6180.S5-001 Conference: Proceedings of The 2010 International Conference on Data Mining, DMIN 2010, July 12-15, 2010, Las Vegas, Nevada, USA
Source: DBLP


Clustering is rapidly becoming a powerful data mining technique, and has been broadly applied to many domains such as bioinformatics and text mining. However, the existing methods can only deal with a data matrix of scalars. In this paper, we introduce a hierarchical clustering procedure that can handle a data matrix of scatter plots. To more accurately reflect the nature of data, we introduce a dissimilarity statistic based on "data depth" to measure the discrepancy between two bivariate distributions without oversimplifying the nature of the underlying pattern. We then combine hypothesis testing with hierarchical clustering to simultaneously cluster the rows and columns of the data matrix of scatter plots. We also propose novel painting metrics and construct heat maps to allow visualization of the clusters. We demonstrate the utility and power of our new clustering method through simulation studies and application to a microbe-host-interaction study.

Download full-text


Available from: Daniel R. Jeske, Dec 27, 2013
23 Reads
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Organisms simplify the orchestration of gene expression by coregulating genes whose products function together in the cell. Many proteins serve different roles depending on the demands of the organism, and therefore the corresponding genes are often coexpressed with different groups of genes under different situations. This poses a challenge in analyzing whole-genome expression data, because many genes will be similarly expressed to multiple, distinct groups of genes. Because most commonly used analytical methods cannot appropriately represent these relationships, the connections between conditionally coregulated genes are often missed. We used a heuristically modified version of fuzzy k-means clustering to identify overlapping clusters of yeast genes based on published gene-expression data following the response of yeast cells to environmental changes. We have validated the method by identifying groups of functionally related and coregulated genes, and in the process we have uncovered new correlations between yeast genes and between the experimental conditions based on similarities in gene-expression patterns. To investigate the regulation of gene expression, we correlated the clusters with known transcription factor binding sites present in the genes' promoters. These results give insights into the mechanism of the regulation of gene expression in yeast cells responding to environmental changes. Fuzzy k-means clustering is a useful analytical tool for extracting biological insights from gene-expression data. Our analysis presented here suggests that a prevalent theme in the regulation of yeast gene expression is the condition-specific coregulation of overlapping sets of genes.
    Genome biology 11/2002; 3(11):RESEARCH0059. DOI:10.1186/gb-2002-3-11-research0059 · 10.81 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A system of cluster analysis for genome-wide expression data from DNA microarray hybridization is described that uses standard statistical algorithms to arrange genes according to similarity in pattern of gene expression. The output is displayed graphically, conveying the clustering and the underlying expression data simultaneously in a form intuitive for biologists. We have found in the budding yeast Saccharomyces cerevisiae that clustering gene expression data groups together efficiently genes of known similar function, and we find a similar tendency in human data. Thus patterns seen in genome-wide expression experiments can be interpreted as indications of the status of cellular processes. Also, coexpression of genes of known function with poorly characterized or novel genes may provide a simple means of gaining leads to the functions of many genes for which information is not available currently.
    Proceedings of the National Academy of Sciences 11/1999; 95(25):14863-8. DOI:10.1073/pnas.95.25.14863 · 9.67 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we present a structured overview of methods for two-mode clustering, that is, methods that provide a simultaneous clustering of the rows and columns of a rectangular data matrix. Key structuring principles include the nature of row, column and data clusters and the type of model structure or associated loss function. We illustrate with analyses of symptom data on archetypal psychiatric patients.
    Statistical Methods in Medical Research 11/2004; 13(5):363-94. DOI:10.1191/0962280204sm373ra · 4.47 Impact Factor
Show more