Conference Paper

A General Approach to Mining Quality Pattern-Based Clusters from Microarray Data

DOI: 10.1007/11408079_18 Conference: Database Systems for Advanced Applications, 10th International Conference, DASFAA 2005, Beijing, China, April 17-20, 2005, Proceedings
Source: DBLP


Pattern-based clustering has broad applications in microar- ray data analysis, customer segmentation, e-business data analysis, etc. However, pattern-based clustering often returns a large number of highly- overlapping clusters, which makes it hard for users to identify interest- ing patterns from the mining results. Moreover, there lacks of a general model for pattern-based clustering. Different kinds of patterns or differ- ent measures on the pattern coherence may require different algorithms. In this paper, we address the above two problems by proposing a general quality-driven approach to mining top-k quality pattern-based clusters. We examine our quality-driven approach using real world microarray data sets. The experimental results show that our method is general, effective and efficient.

Full-text preview

Available from:
  • Source
    • "Clustering is an important data mining problem (Aggarwal et al., 1999; Aggarwal and Yu, 2000; Cheng et al., 1999; Ester et al., 1996; Pei et al., 2003). For a set of objects, clustering is the process of grouping the objects into a set of disjoint classes, called clusters, such that objects within a cluster have high similarity to each other, while objects in different clusters are dissimilar (Jiang et al., 2005). Recent efforts in data mining have focused on methods for efficient and effective cluster analysis (Zhang et al., 1996) in large databases, e.g., microarray datasets. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Mining subspace clusters from the DNA microarrays could help researchers identify those genes which commonly contribute to a disease, where a subspace cluster indicates a subset of genes whose expression levels are similar under a subset of conditions. Since in a DNA microarray, the number of genes is far larger than the number of conditions, those previous proposed algorithms which compute the maximum dimension sets (MDSs) for any two genes will take a long time to mine subspace clusters. In this article, we propose the Large Itemset-Based Clustering (LISC) algorithm for mining subspace clusters. Instead of constructing MDSs for any two genes, we construct only MDSs for any two conditions. Then, we transform the task of finding the maximal possible gene sets into the problem of mining large itemsets from the condition-pair MDSs. Since we are only interested in those subspace clusters with gene sets as large as possible, it is desirable to pay attention to those gene sets which have reasonable large support values in the condition-pair MDSs. From our simulation results, we show that the proposed algorithm needs shorter processing time than those previous proposed algorithms which need to construct gene-pair MDSs.
    Journal of computational biology: a journal of computational molecular cell biology 06/2009; 16(5):745-68. DOI:10.1089/cmb.2008.0161 · 1.74 Impact Factor
  • Source

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose a novel algorithm to discover the top-k covering rule groups for each row of gene expression profiles. Several experiments on real bioinformatics datasets show that the new top-k covering rule mining algorithm is orders of magnitude faster than previous association rule mining algorithms.Furthermore, we propose a new classification method RCBT. RCBT classifier is constructed from the top-k covering rule groups. The rule groups generated for building RCBT are bounded in number. This is in contrast to existing rule-based classification methods like CBA [19] which despite generating excessive number of redundant rules, is still unable to cover some training data with the discovered rules. Experiments show that the RCBT classifier can match or outperform other state-of-the-art classifiers on several benchmark gene expression datasets. In addition, the top-k covering rule groups themselves provide insights into the mechanisms responsible for diseases directly.
    Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, June 14-16, 2005; 01/2005
Show more