Conference Paper

A General Approach to Mining Quality Pattern-Based Clusters from Microarray Data.

DOI: 10.1007/11408079_18 Conference: Database Systems for Advanced Applications, 10th International Conference, DASFAA 2005, Beijing, China, April 17-20, 2005, Proceedings
Source: DBLP

ABSTRACT Pattern-based clustering has broad applications in microar- ray data analysis, customer segmentation, e-business data analysis, etc. However, pattern-based clustering often returns a large number of highly- overlapping clusters, which makes it hard for users to identify interest- ing patterns from the mining results. Moreover, there lacks of a general model for pattern-based clustering. Different kinds of patterns or differ- ent measures on the pattern coherence may require different algorithms. In this paper, we address the above two problems by proposing a general quality-driven approach to mining top-k quality pattern-based clusters. We examine our quality-driven approach using real world microarray data sets. The experimental results show that our method is general, effective and efficient.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Pattern-based clustering, which capture the similarity of the patterns exhibited by objects in a subset of dimensions, has broad applications in DNA microarray data analysis, customer segmentation, e-business data analysis, etc. However, pattern- based clustering often returns a large number of highly- overlapping clusters, which makes it hard for users to identify interesting patterns from the huge mining results. Moreover, there lacks a general measurement to evaluate the quality of Clusters which pattern-based clustering obtained. In this paper, we discuss factors which cause highly-overlapping, make error analysis and pattern weighting, and propose qScore as a key evaluation parameters on quality of Clusters. A algorithm which based on qScore is presented to solve the problem of high- overlapping and get better quality clustering results.
    Eighth International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2011, 26-28 July 2011, Shanghai, China; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Mining subspace clusters from the DNA microarrays could help researchers identify those genes which commonly contribute to a disease, where a subspace cluster indicates a subset of genes whose expression levels are similar under a subset of conditions. Since in a DNA microarray, the number of genes is far larger than the number of conditions, those previous proposed algorithms which compute the maximum dimension sets (MDSs) for any two genes will take a long time to mine subspace clusters. In this article, we propose the Large Itemset-Based Clustering (LISC) algorithm for mining subspace clusters. Instead of constructing MDSs for any two genes, we construct only MDSs for any two conditions. Then, we transform the task of finding the maximal possible gene sets into the problem of mining large itemsets from the condition-pair MDSs. Since we are only interested in those subspace clusters with gene sets as large as possible, it is desirable to pay attention to those gene sets which have reasonable large support values in the condition-pair MDSs. From our simulation results, we show that the proposed algorithm needs shorter processing time than those previous proposed algorithms which need to construct gene-pair MDSs.
    Journal of computational biology: a journal of computational molecular cell biology 06/2009; 16(5):745-68. · 1.69 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose a novel algorithm to discover the top-k covering rule groups for each row of gene expression profiles. Several experiments on real bioinformatics datasets show that the new top-k covering rule mining algorithm is orders of magnitude faster than previous association rule mining algorithms.Furthermore, we propose a new classification method RCBT. RCBT classifier is constructed from the top-k covering rule groups. The rule groups generated for building RCBT are bounded in number. This is in contrast to existing rule-based classification methods like CBA [19] which despite generating excessive number of redundant rules, is still unable to cover some training data with the discovered rules. Experiments show that the RCBT classifier can match or outperform other state-of-the-art classifiers on several benchmark gene expression datasets. In addition, the top-k covering rule groups themselves provide insights into the mechanisms responsible for diseases directly.
    Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, June 14-16, 2005; 01/2005

Full-text (2 Sources)