# Discriminative frequent subgraph mining with optimality guarantees

**ABSTRACT** The goal of frequent subgraph mining is to detect subgraphs that frequently occur in a dataset of graphs. In classification settings, one is often interested in discovering discriminative frequent subgraphs, whose presence or absence is indicative of the class membership of a graph. In this article, we propose an approach to feature selection on frequent subgraphs, called CORK, that combines two central advantages. First, it optimizes a submodular quality criterion, which means that we can yield a near-optimal solution using greedy feature selection. Second, our submodular quality function criterion can be integrated into gSpan, the state-of-the-art tool for frequent subgraph mining, and help to prune the search space for discriminative frequent subgraphs even during frequent subgraph mining. Copyright © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 302-318, 2010

**0**Bookmarks

**·**

**167**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**Graph classification is an important data mining task, and various graph kernel methods have been proposed recently for this task. These methods have proven to be effective, but they tend to have high computational overhead. In this paper, we propose an alternative approach to graph clas-sification that is based on feature-vectors constructed from different global topological attributes, as well as global la-bel features. The main idea here is that the graphs from the same class should have similar topological and label at-tributes. Our method is simple and easy to implement, and via a detailed comparison on real benchmark datasets, we show that our topological and label feature-based approach delivers better or competitive classification accuracy, and is also substantially faster than other graph kernels. It is the most effective method for large unlabeled graphs. - [Show abstract] [Hide abstract]

**ABSTRACT:**Classification of structured data is essential for a wide range of problems in bioinformatics and cheminformatics. One such problem is in silico prediction of small molecule properties such as toxicity, mutagenicity and activity. In this paper, we propose a new feature selection method for graph kernels that uses the subtrees of graphs as their feature sets. A masking procedure which boils down to feature selection is proposed for this purpose. Experiments conducted on several data sets as well as a comparison of our method with some frequent subgraph based approaches are presented.International Journal of Data Mining and Bioinformatics 01/2013; 8(3):294-310. · 0.39 Impact Factor - [Show abstract] [Hide abstract]

**ABSTRACT:**Graph classification is an important data mining task, and various graph kernel methods have been proposed recently for this task. These methods have proven to be effective, but they tend to have high computational overhead. In this paper, we propose an alternative approach to graph classification that is based on feature vectors constructed from different global topological attributes, as well as global label features. The main idea is that the graphs from the same class should have similar topological and label attributes. Our method is simple and easy to implement, and via a detailed comparison on real benchmark datasets, we show that our topological and label feature-based approach delivers competitive classification accuracy, with significantly better results on those datasets that have large unlabeled graph instances. Our method is also substantially faster than most other graph kernels. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2012 © 2012 Wiley Periodicals, Inc.Statistical Analysis and Data Mining 08/2012; 5(4):265-283.

Page 1

Discriminative frequent subgraph mining with optimality guarantees

Marisa Thoma∗

Hong Cheng†

Arthur Gretton‡

Jiawei Han§

Hans-Peter Kriegel∗

Alex Smola¶

Le Song‡

Philip S. Yu?

Xifeng Yan∗∗

Karsten M. Borgwardt††

July 7, 2010

Abstract

The goal of frequent subgraph mining is to detect subgraphs

that frequently occur in a dataset of graphs. In classification

settings, one is often interested in discovering discriminative

frequent subgraphs, whose presence or absence is indicative

of the class membership of a graph. In this article, we pro-

pose an approach to feature selection on frequent subgraphs,

called CORK, that combines two central advantages. First,

it optimizes a submodular quality criterion, which means

that we can yield a near-optimal solution using greedy fea-

ture selection. Second, our submodular quality function cri-

terion can be integrated into gSpan, the state-of-the-art tool

for frequent subgraph mining, and help to prune the search

space for discriminative frequent subgraphs even during fre-

quent subgraph mining.

1Introduction.

In a graph classification problem, we are given a

set of training graphs {G1,...,Gn} with class labels

{Gi,yi}n

amples, our task is to train a classifier for correctly pre-

dicting the labels of unclassified test graphs.

Such graph classification algorithms have a wide va-

riety of real world applications. In biology and chem-

istry, for example, graph classification quantitatively

correlates chemical structures with biological and chem-

ical processes, such as active or inactive in an anti-

cancer screen, toxic or non-toxic to human beings [21].

This makes graph classification scientifically and com-

mercially valuable (e.g. in drug discovery). In computer

i=1, yi∈ {1,...,K}. Given these training ex-

∗Institute for Informatics, Ludwig-Maximilians-Universit¨ at

M¨ unchen

†Department of Systems Engineering and Engineering Man-

agement, Chinese University of Hong Kong

‡School of Computer Science, Carnegie Mellon University

§University of Illinois at Urbana-Champaign

¶Yahoo! Research, Santa Clara, California

?University of Illinois at Chicago, Chicago, Illinois

∗∗Department of Computer Science, University of California at

Santa Barbara

††Max Planck Institute for Developmental Biology and Max

Planck Institute for Biological Cybernetics, T¨ ubingen

vision, images can be abstracted as graphs, where nodes

are spatial entities and edges are their mutual relation-

ships. Such models can be used to identify the type of

foreground objects in an image. In software engineer-

ing, a program can also be modeled as a graph, where

program blocks are nodes and flows of the program are

edges. Static and dynamic analysis of program behav-

iors can then be carried out in these graphs. For in-

stance, anomaly detection of control flows is essentially

a graph classification problem.

Recent research in graph classification comprises

three branches:

• first, the family of frequent pattern approaches [19,

10, 8]. Each graph is represented by its frequent

subgraphs, i.e., its set of subgraphs that occur in

at least σ% of all graphs in the database. This

frequent pattern approach is also referred to as the

(frequent) substructure or fragment approach, and

we will use these terms interchangeably.

• second, the family of approaches that consider all

subgraphs of a certain type in a graph [18, 36, 30].

For instance, the graph kernels by [18, 30] belong

to this class and they count common walks and

subtree patterns in two graphs, respectively.

• third, the family of wrapper approaches that select

informative subgraphs for classification during the

training phase.Typical instances of this family

are the boosting approach by [22] and the lasso-

approach by [33].

In this article, we are concerned with the first of

these three families, the family of frequent subgraph ap-

proaches. There are two reasons for adapting frequent

subgraphs in graph classification. First, it is computa-

tionally difficult to enumerate all of the substructures

existing in a large graph dataset, while it is possible to

mine frequent patterns due to the recent development

of efficient graph mining algorithms. Second, the dis-

criminative power of extremely infrequent substructures

is small due to their limited coverage in the dataset.

Page 2

Therefore, it is a promising approach to use frequent

substructures as features in classification models.

However, the vast number of substructures poses

three challenges.

1. Redundancy:

differ slightly in structure and co-occur in the same

graphs.

Most frequent substructures only

2. Statistical significance: Frequency alone is not a

good measure of the discriminative power of a sub-

graph, as both frequent and infrequent subgraphs

may be uniformly distributed over all classes. Only

frequent subgraphs whose presence is statistically

significantly correlated with class membership are

promising contributors for classification.

3. Efficiency: Very frequent subgraphs are not use-

ful for classification due to lack of discriminative

power. Therefore, frequent subgraph based clas-

sification usually sets an extremely low frequency

threshold, resulting in thousands or even millions

of features. Given such a tremendous number of

features, any runtime or memory-intensive feature

selection algorithm will fail.

Consequently, we need an efficient algorithm to se-

lect discriminative features among a large number of

frequent subgraphs.In [32], we introduced a near-

optimal approach to feature selection among frequent

subgraphs generated by gSpan [39] for two-class prob-

lems. Our method greedily chooses frequent subgraphs

according to the submodular quality criterion CORK

(Correspondence-based Quality Criterion). The use of

a submodular function in a greedy approach ensures a

solution close to the optimal solution [24]. We further-

more showed that CORK can be integrated into gSpan,

the state-of-the-art tool for frequent subgraph mining.

Other approaches use heuristic strategies for feature

selection (such as [8, 13]) or do not provide optimality

guarantees [22, 29, 28, 33, 38, 17]. We will present an

overview on related algorithms in Section 3.1.

Goal The goal of this paper is to refresh the idea of

near-optimal feature selection in subgraph patterns and

to introduce improvements for future use. As a review

of [32] we will first formalize the optimization problem

to be solved (Section 2.1) and then we will summarize

the essential ingredients of our graph feature selector:

first, submodularity and its use in feature selection (Sec-

tion 2.2); second, gSpan, the method to find frequent

subgraphs (Section 2.3). We will review our selection

criterion CORK for two-class problems in Section 2.4,

and explain its integration as additional pruning crite-

rion into pattern growth based graph miners like gSpan

in Section 2.6.

Many applications for graph learning actually de-

fine more than the commonly-used two classes: Biolog-

ical molecules can be categorized into a wide catalog of

functional or structural classes, social network commu-

nities are involved with various topics and process flows

can be analyzed with respect to multiple attributes. As

a new contribution, we will thus generalize CORK to

multi-class problems in Section 2.7.

Finally, for increasing the flexibility of our algo-

rithm, in Section 2.8, we will also provide an extension

for using the proposed pruning approach on pre-mined

graphs.After a review of related work in Section 3

we thoroughly evaluate the proposed algorithms in Sec-

tion 4 on 11 real-world datasets and conclude with a

discussion and outlook in Section 5.

2 Near-optimal

frequent subgraphs

feature selection among

We formalize the given dataset as a collection of graphs

G = ∪K

In this paper we exclude overlapping classes, however,

the proposed selection approach can be easily extended

to graphs with multiple labels.

As a notational convention, the vertex set of a graph

G ∈ G is denoted by V (G) and the edge set by E(G).

A label function, l, maps a vertex or an edge to a label.

A graph G is a subgraph of another graph G′if there

exists a subgraph isomorphism from G to G′, denoted

by G ⊑ G′. Accordingly, G′is called a super-graph of

G (G′⊒ G). Due to its importance for this article, we

here recite the definition of a subgraph isomorphism.

i=iKithat each belong to one of the K classes Ki.

Definition 2.1. (Subgraph Isomorphism) A

graph isomorphism is an injective function f : V (G) →

V (G′), such that

sub-

1. ∀u ∈ V (G), l(u) = l′(f(u)), and

2. ∀(u,v) ∈ E(G), (f(u),f(v)) ∈ E(G′) and l(u,v) =

l′(f(u),f(v)),

where l and l′are the label function of G and G′,

respectively. f is called an embedding of G in G′.

Given a graph database G, we denote by GG1the

number of graphs in G of which G is a subgraph and

by GG0the number of graphs in G of which G is not a

subgraph. GG1is called the (absolute) support. A graph

G is frequent if its support is no less than a minimum

support threshold, σ. Hence, the frequent graph is a

relative concept: whether or not a graph is frequent

depends on the value of σ and on the number of elements

|G| contained in G.

Page 3

2.1

ture selection among frequent subgraphs can be defined

as a combinatorial optimization problem. We denote by

D the full set of features, which in our case will corre-

spond to the frequent subgraphs generated by gSpan.

When using these features to predict the class member-

ship of individual graph instances, clearly, only a subset

E ⊆ D of features will be relevant. We denote the rel-

evance of a feature set for class membership by q(E),

where q is a quality criterion measuring the discrimi-

native power of E. It is computed by restricting the

dataset’s representation to the features in E. We then

formulate feature selection as:

Combinatorial optimization problem Fea-

D‡= argmax

E⊆Dq(E)s.t.

|E | ≤ s

(2.1)

where |·| computes the cardinality of a set and s is the

maximally allowed number of selected features.

The optimal solution of this problem would require

us to search all possible subsets of features exhaustively.

Due to the exponential number of all feature combina-

tions this approach is prohibitive for large feature sets

like frequent subgraphs. The common remedy is to re-

sort to heuristic alternatives, the solutions of which can-

not be guaranteed to be globally optimal or even close

to the global optimal solution. Hence, the key point

in this article is to employ a heuristic approach which

does allow for these quality guarantees, namely a greedy

strategy which achieves near-optimal results.

2.2

sume that we are measuring the discriminative power

q(E) of a feature set E in terms of a quality function

q. A near-optimality solution is reached for a submod-

ular quality function q when used in combination with

greedy feature selection. Greedy forward feature selec-

tion consists in iteratively picking the feature that – in

union with the features selected so far – maximises the

quality function q over the prospective feature set. In

general, this strategy will not yield an optimal solution,

but it can be shown to yield a near-optimal solution if

q is submodular:

Feature Selection and Submodularity As-

Definition 2.2. (Submodular set function)

A quality function q is said to be submodular on a set

D if for E′⊆ E ⊆ D and X ∈ E

q(E′∪ {X}) − q(E′) ≥ q(E ∪ {X}) − q(E)

If q is submodular and we employ greedy forward

feature selection, then we can exploit the following

theorem from [24]:

(2.2)

Theorem 2.1. If q is a submodular, non-decreasing set

function on a set D and q(∅) = 0, then greedy forward

feature selection is guaranteed to find a set of features

E†⊆ D such that

?

where s is the number of features to be selected.

q(E†) ≥

1 −1

e

?

max

E⊆D: |E|=sq(E) , (2.3)

As a direct consequence, the result from greedy

feature selection achieves at least?1 −1

problem. This property is referred to as being near-

optimal in the literature (e.g. [14]).

e

?≈ 63% of the

score of the optimal solution to the feature selection

v0

X

a

b

b

a

X

Z

Y

v1

v2

v3

v0

X

a

b

b

a

X

Z

Y

v1

v2

v3

backward extension

v0

X

a

b

b

a

X

Z

Y

v1

v2

v3

forward extension

Figure 1: gSpan: Rightmost Extension

2.3

for feature selection on frequent subgraphs, we could

yield a near-optimal solution to problem (2.1).

how do we determine the frequent subgraphs in the

first place?For this purpose, we use the frequent

subgraph algorithm gSpan [39], which we will outline

in the following.

The discovery of frequent graphs usually consists of

two steps. In the first step, we generate frequent sub-

graph candidates, while in the second step, we check the

frequency of each candidate. The second step involves a

subgraph isomorphism test, which is NP-complete. For-

tunately, efficient isomorphism testing algorithms have

been developed, making such testing affordable in prac-

tice. Most studies of frequent subgraph discovery pay

attention to the first step; that is, how to generate as

few frequent subgraph candidates as possible, and as

fast as possible.

The initial frequent graph mining algorithms, such

as AGM [16], FSG [23] and the path-join algorithm [35],

share similar characteristics with the Apriori-based

itemset mining [1]. All of them require a join opera-

tion to merge two (or more) frequent substructures into

one larger substructure candidate. To avoid this over-

head, non-Apriori-based algorithms such as gSpan [39],

MoFa [3], FFSM [15], and Gaston [25] adopt the

pattern-growth methodology, which attempts to gen-

erate candidate graphs from a single graph. For each

gSpan If we found a useful submodular criterion

But

Page 4

discovered graph G, these methods recursively add new

edges until all the frequent supergraphs of G have been

discovered. The recursion stops once no more frequent

graph can be generated.

gSpan introduced a sophisticated extension method,

which is built on a depth first search (DFS) tree. Given

a graph G we label the root, i.e. the starting vertex of

the DFS tree, as v0, and the last visited vertex as vn.

vn is also called the rightmost vertex. Consequently,

the straight path from v0 to vn is the rightmost path.

Figure 1 shows an example. The darkened edges form

a DFS tree. The vertices are discovered in the order

v0,v1,v2,v3, thus v3 is the rightmost vertex.

rightmost path is (v0,v1,v3).

This method, called rightmost extension, restricts

the extension of new edges in a graph as follows: For

a given graph and a DFS tree, a new edge e can be

added between the rightmost vertex and other vertices

on the rightmost path (backward extension), or it can

introduce a new vertex originating from a vertex on

the rightmost path (forward extension). As we do not

allow duplicate connections, the only legal backward

extension candidate of the graph in Figure 1 is (v3,v0).

The forward extension candidates can be edges from v3,

v1, or v0 introducing a new vertex. Since there may

be multiple DFS trees for one graph, gSpan establishes

a set of rules to select one of them as representative

so that the backward and forward extensions will only

take place in one DFS tree. One of those rules is the

restriction of newly generated edges to the vertices along

the rightmost path. Another rule, the minimality test,

checks whether the currently examined graph has not

been treated before. For a detailed description of gSpan,

see [39].

The

Algorithm 2.1. gSpan(G, G, σ, S)

Input:Graph G, graph dataset G,

threshold σ, set of subgraphs S

Output: The set of frequent subgraphs S.

1: if G ?= dfs(G), then

2:

return S

3: insert G into S

4: set C to ∅

5: scan G once: find all the edges e such that G can

be rightmost extended to G ⋄re

6: insert G ⋄re into C and count its frequency

7: for each frequent G ⋄re in C do

8:Call gSpan(G ⋄re, G, σ, S)

9: done

10: return S

// G is not minimal

Algorithm 2.1 outlines the pseudocode of gSpan.

G ⋄re denotes that an edge e extends graph G via

rightmost extension.

where dfs(G), the canonical form of graph G [39] is

compared to the edge order of G. Therefore, G is only

proceeded at the first encounter.

Once we have determined the frequent subgraphs

using gSpan, a natural way of representing each graph

G is in terms of a binary indicator vector of length |S|:

Step 1 is the minimality test,

Definition 2.3. (Indicator vector) Given a graph

Gi from a dataset G and a set of frequent subgraph

features S discovered by gSpan.

indicator vector v(i)for Gias

We then define an

v(i)

d

=

?

1

0

if Sd⊑ Gi

otherwise

(Sd is a subgraph of Gi)

,

(2.4)

where v(i)

d-th graph in S.

d

is the d-th component of v(i)and Sd is the

2.4

selection criterion q for two-class problems. It will be

generalized to multi-class problems in Section 2.7.

Definition of CORK We now define our feature

Definition 2.4. Let G be a dataset of binary vectors,

consisting of two disjunct classes G = A∪B.

D denote a set of features of the data objects in G,

represented by indicator vector v(i)for graphs Gi∈ G.

Let

As we aim to separate the two classes, we pay

specific attention to pairs of inter-class instances with

the same pattern in the given feature set.

instance pairs are correspondences:

These

Definition 2.5. (Correspondence) A pair of data

objects (v(i),v(j)) is called a correspondence in a set

of features indicated by indices U ⊆ {1,...,|D|} (or,

w.r.t. a set of features U) iff

(v(i)∈ A) ∧ (v(j)∈ B) ∧ ∀d ∈ U : (v(i)

d

= v(j)

d),(2.5)

where v(i)

d

is the value of feature d in vector v(i).

Our quality criterion consequently punishes the

number of correspondences remaining for feature set D.

Definition 2.6. (CORK) We define a quality crite-

rion q, called CORK (Correspondence-based Quality

Criterion), for a subset of features E as

q(E) = (−1) ∗ number of correspondences in E

(2.6)

Theorem 2.2. q is submodular.

Page 5

Proof. For q to be submodular, adding feature X ∈ D

to a feature set E′⊆ E ⊆ D has to increase q(E′) at

least as much as adding feature X to E increases q(E).

This law of diminishing returns is obviously fulfilled if

removing a correspondence from E by adding feature X

also results in a correspondence being eliminated in E′

by adding feature X.

Let us first state that an instance pair (v(i),v(j)),

that is a correspondence in E must also be a correspon-

dence in E′. Note that the opposite is not necessarily

true.

In the following, let x be the index of feature X in D.

Whenever adding a feature X to E removes the above

correspondence from E, this means that v(i)

since the other features in E must match. Therefore,

the two formerly corresponding feature patterns for

(v(i),v(j)) cannot match in E′∪ {X} either.

if a feature X eliminates a correspondence from E,

this very correspondence (possibly together with further

correspondences) is also removed from E′, and we satisfy

the submodularity condition of Equation 2.2.

x

?= v(j)

x ,

Thus,

?

This submodular criterion can be turned (by adding

the constant |A| · |B|) into a submodular set function

fulfilling the conditions of Theorem 2.1.

2.5

one feature X in a dataset of the classes A and B can be

computed as the number of inter-class pairs of objects

that both contain X (with AX1instances in A and BX1

instances in B) or that both do not contain X (AX0and

BX0objects).

q({X}) = −(AX0·BX0+AX1·BX1)

For feature sets CORK can be efficiently computed

by recursively dividing the dataset into equivalence

classes:

Computation of CORK The CORK value for

(2.7)

Definition 2.7. (Equivalence Classes) Given

two-class dataset G = A∪B represented as binary

indicator vectors over the feature set U. Let P ⊆ 2Ube

the set of all unique binary indicator vectors occurring

in G with |P| = l. Then the equivalence class of an

indicator vector v(i)∈ G is defined as the set

{v(j)|v(j)∈ G ∧ ∀ d ∈ U : v(i)

Each of these unique indicator vectors Pcforms an

equivalence class Ec(c ∈ {1,...,l}) containing all graphs

of with an indicator vector equal to Pc.

We denote by

a

d

= v(j)

d}

(2.8)

APc=

???{v(i)∈ A| ∀d ∈ U : v(i)

d

= Pc[d]}

???

(2.9)

the number of instances of equivalence class Ec in A

and by

BPc=

???{v(i)∈ B| ∀d ∈ U : v(i)

d

= Pc[d]}

???

(2.10)

the number of instances of equivalence class Ecin B.

In each greedy iteration step, those equivalence

classes can be efficiently split into hits and misses. The

CORK score for a feature set U ⊆ {1,...,|D|} can thus

be calculated by adding up the correspondences of all

occurring equivalence classes Ecin U:

q(U) = (−1) ·

??

Pc∈P

APc·BPc

?

(2.11)

We can now use q for greedy forward feature selec-

tion on a pre-mined set S of frequent subgraphs in G and

receive a result set S†⊆ S of discriminative subgraphs

with a guaranteed quality bound. However, the success

of S†strongly depends on the choice of the minimum

support σ. If σ is chosen too low, we can quickly gener-

ate too many features for the selection step to finish in

a reasonable runtime. Setting σ too high can cause the

loss of all informative features. In the following, we will

introduce a selection approach which directly mines only

discriminative subgraphs, which is nested in gSpan and

which can act independently from a frequency thresh-

old.

2.6

gSpan exploits the fact that the frequency of a subgraph

S ∈ S is an upper bound for the frequency of all of its

supergraphs T ⊒ S (all subgraphs containing S) when

pruning the search space for frequent subgraphs. We

will show how to derive an upper bound for the CORK-

values of all supergraphs of a subgraph S, which allows

us to further prune the search space.

Let us emphasize that this technique can also be

applied in other graph miners which employ a kind

of hierarchical subgraph pattern growth [3, 15, 25] or

Apriori-based join [16, 23, 15].

pre-condition for including CORK as pruning step is

a supergraph relation (T ⊒ S) for patterns mined at a

later stage.

Pruning gSpan’s search space via CORK

The only necessary

Theorem 2.3. Let S,T ∈ S be frequent subgraphs, and

T be a supergraph of S. Let AS1denote the number

of graphs in class A that contain S (‘hits’), AS0the

number of graphs in A that do not contain S (‘misses’)

and define BS0, BS1analogously. Then

(2.12)

q({T}) ≤ q({S}) + max

AS1· (BS1−BS0)

(AS1−AS0) · BS1

0