Discriminative frequent subgraph mining with optimality guarantees

Statistical Analysis and Data Mining 09/2010; 3(5):302 - 318. DOI: 10.1002/sam.10084
Source: DBLP

ABSTRACT The goal of frequent subgraph mining is to detect subgraphs that frequently occur in a dataset of graphs. In classification settings, one is often interested in discovering discriminative frequent subgraphs, whose presence or absence is indicative of the class membership of a graph. In this article, we propose an approach to feature selection on frequent subgraphs, called CORK, that combines two central advantages. First, it optimizes a submodular quality criterion, which means that we can yield a near-optimal solution using greedy feature selection. Second, our submodular quality function criterion can be integrated into gSpan, the state-of-the-art tool for frequent subgraph mining, and help to prune the search space for discriminative frequent subgraphs even during frequent subgraph mining. Copyright © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 302-318, 2010

  • [Show abstract] [Hide abstract]
    ABSTRACT: One of the most powerful techniques to study protein structures is to look for recurrent fragments (also called substructures or spatial motifs), then use them as patterns to characterize the proteins under study. An emergent trend consists in parsing proteins three-dimensional (3D) structures into graphs of amino acids. Hence, the search of recurrent spatial motifs is formulated as a process of frequent subgraph discovery where each subgraph represents a spatial motif. In this scope, several efficient approaches for frequent subgraph discovery have been proposed in the literature. However, the set of discovered frequent subgraphs is too large to be efficiently analyzed and explored in any further process. In this paper, we propose a novel pattern selection approach that shrinks the large number of discovered frequent subgraphs by selecting the representative ones. Existing pattern selection approaches do not exploit the domain knowledge. Yet, in our approach we incorporate the evolutionary information of amino acids defined in the substitution matrices in order to select the representative subgraphs. We show the effectiveness of our approach on a number of real datasets. The results issued from our experiments show that our approach is able to considerably decrease the number of motifs while enhancing their interestingness.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Graph classification is an important data mining task, and various graph kernel methods have been proposed recently for this task. These methods have proven to be effective, but they tend to have high computational overhead. In this paper, we propose an alternative approach to graph classification that is based on feature vectors constructed from different global topological attributes, as well as global label features. The main idea is that the graphs from the same class should have similar topological and label attributes. Our method is simple and easy to implement, and via a detailed comparison on real benchmark datasets, we show that our topological and label feature-based approach delivers competitive classification accuracy, with significantly better results on those datasets that have large unlabeled graph instances. Our method is also substantially faster than most other graph kernels. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2012 © 2012 Wiley Periodicals, Inc.
    Statistical Analysis and Data Mining 08/2012; 5(4):265-283.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Classification of structured data is essential for a wide range of problems in bioinformatics and cheminformatics. One such problem is in silico prediction of small molecule properties such as toxicity, mutagenicity and activity. In this paper, we propose a new feature selection method for graph kernels that uses the subtrees of graphs as their feature sets. A masking procedure which boils down to feature selection is proposed for this purpose. Experiments conducted on several data sets as well as a comparison of our method with some frequent subgraph based approaches are presented.
    International Journal of Data Mining and Bioinformatics 01/2013; 8(3):294-310. · 0.39 Impact Factor


Available from