Page 1

Fundamenta Informaticae 66 (2005) 53–82

IOS Press

53

A General Framework for Mining Frequent Subgraphs from

Labeled Graphs

Akihiro Inokuchi

Tokyo Research Laboratory

?

IBM Japan

1623-14, Shimotsuruma, Yamato, Kanagawa, 242-8502, Japan

inokuchi@jp.ibm.com

Takashi Washio and Hiroshi Motoda

The Institute of Scientific and Industrial Research

Osaka University

8-1, Mihogaoka, Ibaraki, Osaka, 567-0047, Japan

washio@ar.sanken.osaka-u.ac.jp

motoda@ar.sanken.osaka-u.ac.jp

Abstract. The derivation of frequent subgraphs from a dataset of labeled graphs has high compu-

tational complexity because the hard problems of isomorphism and subgraph isomorphism have to

be solved as part of this derivation. To deal with this computational complexity, all previous ap-

proaches have focused on one particular kind of graph. In this paper, we propose an approach to

conduct a complete search for various classes of frequent subgraphs in a massive dataset of labeled

graphs within a practical time. The power of our approach comes from the algebraic representa-

tion of graphs, its associated operations and well-organizedbias constraints to limit the search space

efficiently. The performance has been evaluated using real world datasets, and the high scalabil-

ity and flexibility of our approach have been confirmed with respect to the amount of data and the

computation time.

Keywords: Data Mining, Graph Mining, Frequent Subgraph, Bias, Canonical Form, Subgraph

Isomorphism

?Address for correspondence: Tokyo Research Laboratory, IBM Japan, 1623-14, Shimotsuruma, Yamato, Kanagawa, 242-

8502, Japan

Page 2

54

A. Inokuchi et al./A General Framework for Mining Frequent Subgraphs from Labeled Graphs

1. Introduction

Graph mining algorithms that discover characteristic subgraph patterns embedded in a dataset of labeled

graphs have a broad range of applications. However it is hard to develop methods with practical run

times because the search for candidate frequent subgraphs has exponential complexity and includes the

subgraph isomorphism problem, which is known to be NP-complete.

To address these issues, various approaches to mine a complete set of frequent patterns from massive

datasets of labeled graphs or labeled trees have been proposed. Although each method can efficiently

discover the patterns, the subgraphs to be searched are limited within a specific class. For example,

MolFea efficiently mines frequent paths from labeled graphs [7]. TreeMiner [27] and FREQT [2] can

quickly discover all frequent patterns from ordered trees. However, they cannot mine more complex

substructures such as labeled subgraphs. On the other hand, the AGM algorithm [10, 12], FSG [17], and

gSpan [24] can mine frequent subgraphs from a set of labeled graphs. However, they cannot efficiently

discover frequent patterns of paths and trees, because their data structures and their search operations are

not dedicated to path and tree structure mining.

In this paper, we propose a generic and efficient framework to mine various classes of substructures.

By introducing a bias for each class of substructure, e.g., connected subgraphs, ordered subtrees, and

path structures, to the AGM algorithm, a complete search for the frequent substructures of each class is

achieved. The bias includes restrictions on the search space of the frequent patterns, on the ambiguity of

the structural representation, and on the criteria used for subgraph isomorphism checking. We call this

new framework Biased Apriori-based Graph Mining (B-AGM). We evaluate its performance in terms of

the required computation time for real world datasets of various sizes.

The rest of this paper is organized as follows. Section 2 defines the frequent subgraph pattern mining

problem, and describes the basic concepts of the Apriori-based Graph Mining algorithm used for mining

frequent patterns in a dataset consisting of labeled graphs. Section 3 defines some additional specific

biases to derive various types of patterns, e.g., general subgraphs, connected subgraphs, ordered subtrees

and path patterns. Section 4 provides an experimental evaluation of our algorithm on some real datasets

consisting of chemical compounds, Web access logs, and XML data. In Section 6, we discuss future

extension of our framework. We provide a discussion and some related work in Section 6, and finally

conclude in Section 7.

2.Problem Definitions and the AGM algorithm

Weuse the basic principles of the AGM algorithm in our extended framework. By applying some specific

biases to the algorithm, our B-AGM discovers frequent subgraphs of various classes. In this section, we

define the problem and explain the AGM algorithm. In the next section, we propose biases to enable

graph mining of various classes.

2.1.Problem Definition

The input for frequent subgraph mining is aset of labeled graphs in which each vertex and each edge have

a vertex label and an edge label respectively. The label of each vertex (edge) does not need to be unique,

and it is possible that the same label can be used for several vertices (edges). Each graph in the dataset

is represented as

???

?????????????, where

?

??

??

?

??

?????

???,

?

???

??

?

?

?

??

??

?

?

??

?

??

?

?,

Page 3

A. Inokuchi et al./A General Framework for Mining Frequent Subgraphs from Labeled Graphs

55

??

?

??

??

??

?

??

??

?

??

?

?

??

?

?,

??

?

??

??

??

??

??

?

?

?

???

?

?

??

?

?

?

?

?

?

??

?

?, and

??

?

?

?

?

???

?

?

?

?

??

The number of vertices,

Agraph canberepresented by using an“adjacency matrix”. Forcalculation efficiency, let

and

respectively. Given a labeled graph

whose size is

?are sets of vertices, edges, vertex labels, edge labels, and a function to assign a label to a vertex or to

an edge, respectively. To the convenience of the description, the sets of vertices, edges, vertex labels, and

edge labels of the labeled graph

?are represented as

?

??

?,

?

??

?,

??

??

?, and

??

??

?, respectively.

??

??

??, is called the “size” of the graph

?.

???

?

??

?

??

??

???

?

??

??

??

?

?

?

???be natural numbers assigned to a vertex label

??

?

??

?and an edge label

??

??

??

?

?

?

??,

?, the

?

?

?

?

?-element

?

if

if

?

?

?of an adjacency matrix

?

?of the graph

?

? is represented as follows.

?

?

?

?

?

????

?

??

??

??

?

?

?

????

?

??

?

?

?

?

?

?

??

?

?

?

?

??

?

?

?

?

?

?

?

??

?

?

where

matrix is called the

?

?

?

?

?

?

?

?

?

?

?

???

?

?

?. The vertex corresponding to the

?-th row (

?-th column) of an adjacency

?-th vertex, and the graph structure of an adjacency matrix

matrix representations for a single graph can be obtained. To remove this ambiguity, we use a “canonical

form” of the adjacency matrices to represent a graph. To mathematically define the canonical form of a

graph and to deal efficiently with matrices, the code of an adjacency matrix is defined as follows. For an

undirected graph, the function

?

?is represented as

??

?

?

?.

By choosing different assignments of rows and columns to vertices in a graph, multiple adjacency

???? of an adjacency matrix

?

?is defined as

????

?

?

?

?

?

?

??

?

?

??

?

?

?

?

?

?

??

?

???

?

?

?

?

?

?

?

?

?

??

?

?

which is a concatenation of the

defined as

?

?

?

?

?-element with

?

?

?

?as shown in Figure 1. For a directed graph, it is

????

?

?

?

?

?

?

?

?

?

?

?

?

?

???

?

?

?

?

?

?

?

?

??

?

?

?

??and

??

??

???

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

??

where

?

?. Furthermore, a function

!"#

?including the vertex labels is defined as

!"#

?

?

?

?

?

?

???

?

?

?

?

????

?

?

?

??

which is a concatenation of

???

?

?

?

?and

????

?

?

?

?, and

???

?

?

?

?

?

???

?

??

?

??

??

???

???

?

??

?

??

???

The canonical form of the adjacency matrices representing a graph is the unique matrix having the max-

imum (or minimum)

substructure patterns to be mined, which is included in the definition of each bias. For example, when

connected subgraphs are mined, the maximum

minimum

!"#

?. The choice of the maximum or minimum

!"#

?depends on a class of

!"#

?is used to define the canonical form, while the

!"#

?is used for subtree mining. Both are applicable to the conventional AGM algorithm.

Page 4

56

A. Inokuchi et al./A General Framework for Mining Frequent Subgraphs from Labeled Graphs

?

?

?

?

?

?

?

?

?

?

?

?

?

??

?

??

???

?

??

???

?

?

?

?

???

??

?

??

?

??

?

??

?

?

...

?

??

...

?

?

??

...

?

?

...

??

?

??

...

?

??

?

??

?

?

?

??

?

?

?

?

?

???

?

?

?

?

?

...

?

?

?

?

?

?

?

??

?

?

?

?

?

???

?

?

?

?

??

?

??

?

?

?

??

?

?

?

?

?

?

?

?

???

?

?

?

?

?

?

?

?

Figure 1.Order of Matrix Elements to Define a Function

???? for an Undirected Graph.

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

??

?

??

???

?

??

???

?

?

?

?

???

??

?

??

?

??

?

??

?

?

...

?

??

...

?

?

??

...

?

?

...

??

?

??

...

?

??

?

??

?

?

?

??

?

?

?

?

?

???

?

?

?

?

?

...

?

?

?

?

?

?

?

??

?

?

?

?

?

???

?

?

?

?

??

?

??

?

?

?

??

?

?

?

?

?

?

?

?

???

?

?

?

?

?

?

?

?

Figure 2.Order of Matrix Elements to Define a Function

???? for a Directed Graph.

Adjacency matrices corresponding to an identical graph are mutually convertible using the following

“transformation matrix” (permutation matrix). When adjacency matrices

identical graph of size

?

?and

?

?representing an

? are given, each element

otherwise

?

?

?

?of a transformation matrix

?

?is defined as follows.

?

?

?

?

?

?

?

?

the

?-th vertex of

??

?

?

?corresponds to the

?-th vertex of

??

?

?

?

?

?

?

?

?is expressed as

Given graphs

?

?

?

??

?

?

?

?

?.

?and

?

?, if there is a function

?

?

?

??

?

?

?

?

??

?that satisfies

1.

?

?

?

?

??

?

??

?

?

?

?

?

?

??

????

?

?

?

?

??

?

?

?

?

??, and

2.

?

?

??

?

?

?

?

?

?

??

?

??

?

?

?

??

??

?

?

?

?

??

?

?

??

????

??

??

?

?

?

??

?

??

??

?

?

??

??

?

?

?

?

???.

?

?is a “subgraph” of

3.

?, which is represented as

?

??

?. Additionally, if the function satisfies

?

??

?

?

?

?

?

?

??

?

?

?

?

?

?

??

??

?

?

?

?

??

?

?

??

?,

then

?

?is an “induced subgraph” of

A “path” is a sequence of consecutive vertices and edges in a graph. Given an undirected graph

if a path exists between any two vertices of the graph, then

of a directed graph

undirected graph obtained by ignoring the directions of the edges in

?, which is represented as

?

??

?

?.

?,

?is called a “connected graph”. In the case

?,

?is called a connected graph if a path exists between any two vertices in the

?. An “unordered tree” is a directed

Page 5

A. Inokuchi et al./A General Framework for Mining Frequent Subgraphs from Labeled Graphs

57

??

?

?

?

?

?

?

?

???

?

Figure 3.Examples of Labeled Graphs.

acyclic graph with a root vertex and where every other vertex has one entering edge. An “ordered tree”

is a tree with a left-to-right ordering among the children of each vertex.

Given a set of labeled graphs

defined as

?

#, the “support”

?

?

?

??

?

?of an induced subgraph pattern

?

?is

?

?

?

??

?

?

?

?

??

?

?

?

?

#

?

?

??

?

?

?

?

?

?

#

?

?

where

the frequent patterns that are contained as subgraphs, the support is defined as

?

?stands for inclusion of an induced subgraph in a graph. When a user would like to derive all of

?

?

?

??

?

?

?

?

??

?

?

?

?

#

?

?

??

?

?

?

?

?

#

?

?

There is an induced subgraph derivation and a general subgraph derivation for each class of structure

except for a subtree. These derivations are introduced independently of any bias for each class of struc-

ture which is defined in Section 3. By combining an induced or general subgraph derivation with a bias,

the B-AGM algorithm can mine the frequent induced subgraphs separately from the frequent general

subgraphs. Any derived subgraph having support greater than or equal to the “minimum support” spec-

ified by a user is called a “frequent subgraph”. A frequent subgraph with

? vertices is called a frequent

?-subgraph. When a dataset which consists of labeled graphs and the minimum support are given as

For example, two labeled graphs as shown in Figure 3 are given as an input dataset

numbers 1, 2, and 3 are assigned to

canonical form of the graph

input, the frequent subgraph mining problem is to derive all frequent subgraphs in the dataset that have

support greater than or equal to the minimum support value [11].

?

#, where the

?,

?, and

!, respectively, and 1 is assigned to an edge label. The

?

?in Figure 3 is expressed as

?

?

?

?

?

?

?

?

?

!!

???

!

??

??

?

!

??

?

??

?

??

?

??

?

?

?

?

??

?

??

?

??

?

?

?

?

?

?

?

The

!"#

?of

?

? is represented as

!"#

?

?

?

?

?

?

??

???

?

???

?

?

??

?

?

?

Page 6

58

A. Inokuchi et al./A General Framework for Mining Frequent Subgraphs from Labeled Graphs

?

?

?

?

?

?

?

?

?

?

?

?

?

??????

???

????????

??

?

?

?

?

?

?

?

?

?

?

?

?

??

?

?

?

?

?

?

?

??

?

??

?

?

?

?

??

?

??

?

?

??

??

Figure 4.Search Space to Mine Subgraphs in Data in Figure 3.

where the italic characters represent

support is set to 100%, the search space for mining frequent induced subgraphs is as represented in

Figure 4, where each graph in a rectangle corresponds to a subgraph pattern which has support greater

than 0%. (In Figure 4, any pattern whose support is 0% is omitted due to space limitations.) A rectangle

for a subgraph pattern is linked to the rectangles of its induced subgraphs. The subgraph patterns above

the dashed line in Figure 4 are the frequent subgraphs. The support value of the subgraph

is

not added to the set of frequent subgraphs.

???

?

?

?

?. When the dataset

?

#is given as input and the minimum

?

?in Figure 4

?

?

?

?

?

?

?

?

?

?

?

?

??

?. The support value of the subgraph

?

?is

?

?

?

?

?

?

?

?

?

?

?, and the subgraph is

2.2.Apriori-based Graph Mining Algorithm

In our previous work, we proposed an approach named AGM (Apriori-based Graph Mining) algorithm

in which the knowledge representation and the search operations are highly dedicated to the graph struc-

tured data mining [10, 12]. The AGM algorithm is so generic that it can discover not only connected

frequent subgraphs, but also disconnected frequent subgraphs. We use the basic concept of the AGM

algorithm as the framework for frequent subgraph mining. By adding some additional biases, the AGM

framework can discover various types of subgraphs, such as connected subgraphs, subtree structures, and

path structures.

The AGM algorithm derives all frequent subgraphs in ascending order of the size of the graph based

on the anti-monotonic property of the support measure. Frequent subgraphs are derived stepwise from

the top in the lattice search space as depicted in Figure 4. Figure 5 is the outline of our AGM algorithm.

First, a

for

?

?

?adjacency matrix representing a vertex is generated for every vertex, and they are substituted

!

?(Line 1). Next, the support for the each candidate frequent subgraph is calculated by scanning

Page 7

A. Inokuchi et al./A General Framework for Mining Frequent Subgraphs from Labeled Graphs

59

the database (Lines 4 and 5). Next, the Generate-Candidate function generates the candidate frequent

subgraphs of size

6). These steps are repeated until

(Line 9).

?

//

//

//

//

?

?from the frequent

?-subgraphs in

?

?, and they are substituted for

!

?

?

?(Line

!

?becomes empty. Finally, all of the frequent subgraphs are returned

?

# is a database consisting of labeled graphs.

?

?is a set of adjacency matrices of frequent

?-subgraphs

!

?is a set of adjacency matrices of candidate

?-subgraphs

?

2)

3)

4)

5)

6)

7)

8)

9)

10)

?

?

?

?

?is the minimum support.

Main(

?all adjacency matrices consisting of one element

0)

1)

?

#

?

?

?

?

?

?

?)

?

!

?

?

?;

?

?

?;

while(

Count(

!

?

?

?

?)

?

?

#

?

!

?);

?

?

?

?

?

?

?

!

?

?

?

?

?

???

?

?

??

?

?

?

?

?

?

?

?;

!

?

?

?

?Generate-Candidate

?

?

?

?;

?

?

?

?

?;

?

return

?

?

?

?

?

?

?

?

?

?

?is canonical

?

?

Figure 5.Outline of the Apriori-based Graph Mining Algorithm.

2.2.1.Join Operation

The Generate-Candidate function referenced in Figure 5 consists of three parts: the join operation, the

subgraph-check operation, and the canonicalize operation. In the join operation, the adjacency matrices

of the candidate frequent subgraphs of size

of the frequent

subgraphs, they are joinable if and only if all of the conditions to join are satisfied.

?

?

?are generated by joining the two adjacency matrices

?-subgraphs in

?

?. Given two adjacency matrices

?

?and

?

?representing the frequent

!???

?

?

?

??

?

?Let

?

???

?

?

??and

?

???

?

?

??be

?

?

?

?

?

?

?

???

?

?

??and

?

?

?

?

?

?

?

???

?

?

??respectively,

where

cept for the

?

?is the

?-th vertex of

??

?

?

?and

??is the

?-th vertex of

??

?

?

?.

?

?and

?

?are identical ex-

?-th row and the

?-th column, i.e.,

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?and

??

?

?

?

?

?

??

?

??

?for

?

?

?

?

???

?

??

?

?

Page 8

60

A. Inokuchi et al./A General Framework for Mining Frequent Subgraphs from Labeled Graphs

?

?

?

??

?

??

?

??

?

?

?

?

?

?

?

?

?

0 0 1

0 0 1

1 1 0

X

???

?

?

?

?

0 0 0

0 0 0

0 0 0

???

?

?

?

Y?

?

0 0 1 0

???

0 0 1 0

1 1 0 0

0 0 0 0

?

?

?

?

Z

?

?

?

?

?

?

?

?

?

?

0 0 1 0

???

0 0 1 0

1 1 0 1

0 0 1 0

?

?

?

?

Figure 6.Example of Join Operation.

!???

?

?

?

??

?

?

?

?is the canonical form of

??

?

?

?.

!???

?

?

?

??

?

?

!"#

?

?

?

?

?

?

!"#

?

?

?

?

?is fulfilled.1

If

?

?and

?

?are joinable, their join operation is defined as follows.

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

??

?

?

?

?

?

?

??

?

?

?

?

?

??

?

?

?

?for

?

?

?

?

???

?

??

?

???

?

?

?

?

?

??

?

?

?

??and

??

?

?

?

?

?

??

?

?

?

?

?

??

These matrices

?

?and

?

?are called the “first generator matrix” and the “second generator matrix” of

?

the undirected graph, the possible graph structures for

or where there is no edge between

?

?

?, respectively. The two elements

?

?

?

?

?

?and

?

?

?

??

?of

?

?

?

?are not determined by

?

?and

?

?. For

??

?

?

?

?

?are those where there is a labeled edge

?-th vertex and

?

?

?

?

?-th vertex. For these undirected graphs, the

?

edge labels, while the

matrix generated under the above conditions is called a “normal form”.

Figure 6 shows an example of the join operation when there is only one edge label in the undirected

graphs and

???

?

?

?

?adjacency matrices under

?

?

?

?

?

?

?

?

?

?

??

?are then generated, where

???

?is the number of

?

???

?

?

?

?

?

adjacency matrices are generated for directed graphs. The adjacency

???

?

?

?

???

?

?

?

???

?

!

?. Since

?

?and

?

?are joinable, the two adjacency matrices

?

element. In the two matrices, each pair consists of 0s or 1s.

?

?

?are generated, where the difference is the pair consisting of the

?

?

?

?

?-element and the

?

?

?

?

?-

2.2.2.Subgraph-Check Operation

For the necessary condition of

must be frequent subgraphs according to the anti-monotonic property of the support. This condition

reduces the candidates. When the subgraph-check operation for a graph of size

assumed that one of transformation matrices from every normal form matrix to its canonical form matrix

whose size is less than

??

?

?

?

?

?being a frequent subgraph, all induced subgraphs of

??

?

?

?

?

?

?

?

?is done, it can be

?

?

?is known, since the complete search was done in the previous

? steps.

1In the case that the canonical form is defined as the unique matrix having the minimum

?

????, the condition 3 is

?????

???

?????

???.

Page 9

A. Inokuchi et al./A General Framework for Mining Frequent Subgraphs from Labeled Graphs

61

0)

1)

2)

3)

4)

5)

6)

7)

8)

9)

Normalize(

?

?)

?

?

?

?;

while(

if(

?

?

?

?

?

?)

?

?

?is a normal form and can become the first generator matrix)

?

?

?

?

?

?

?

?

?

?

?

?

?;

?

?

?

?

?;

?else

?

?

?

?

?

?

?

?

?

?

?;

?

?

?

?

?

?

?

10)

11)

12)

?

return

?

?;

?

Figure 7.Normalization Algorithm.

An adjacency matrix

?

?of an induced subgraph of size

?is obtained by removing the elements in the

?-th row and the

in the earlier steps. Let the upper left

to transform

matrices

?-th column (

?

?

?

?

?

?

?) of

?

?

?

?. Then

?

?is transformed into the normal form by

applying the algorithm shown in Figure 7. This is necessary because the AGM algorithm generates only

normal form matrices, and support of

?

?is easily checked by using the normal form matrices obtained

?

?

?submatrix of the adjacency matrix

?

?be

?

?, the matrix

?

?into the canonical form be

?

?and the unit matrix of size

? be

0

?

?. The transformation

?

?

?and

?

?in Figure 7 are generated as follows.

?

?

?

?

?

?

0

?

0

?

?

?

?

?

?

and

?

?

?

?

?

?

?

?

?

?

0

00

0

?

?

?

?

?

?

?

0

?

?

?

?

Line 4 in Figure 7 is the operation to transform

?

?into its canonical form, and Line 7 changes the

?

?

?

?

?-th vertex into the

?-th vertex and the

?-th vertex into the

?

?

?

?

?-th vertex (

?

?

?

?

?

?

?

?

???

?

?).

2.2.3.Canonicalization Operation

After generating the matrices of candidate subgraphs, a database is accessed to calculate their supports.

However, since multiple normal form matrices can represent the same graph, the canonical form of each

of these matrices must be identified to collect all supports of the subgraph.

When the canonical form of

assumed that one of the transformation matrices from each normal form matrix into its canonical form

matrix of size

transformation matrix of

by removing the elements in the

?

?and its associated transformation matrix are searched for, it can be

??

?is known, because of the stepwise extension of the graph size in the search. Let the

?

?

?

?

?to its normal form be

?

?

?

?

?where

?

?

?

?

?is the adjacency matrix obtained

?-th row and

?-th column of

?

?. Also let the matrix to transform the

Page 10

62

A. Inokuchi et al./A General Framework for Mining Frequent Subgraphs from Labeled Graphs

normalized matrix

and

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

0

?into its canonical form be

?

?

?

?

0

?. The transformation matrices

?

?

?

?

?

?for

?

?are generated by using the following equations from

?

?

?

?

?and

?

?

?

?

0

?.

?

?

?

?

?

?

?

0

?

?

?

?

?

?and

?

?

?

?

?

?

?

?

?

?

?

0

00

0

?

?

?

?

?

0

?

?

?

?

?

?

0

?

?

?

?

?

?

A canonical form for

?

?is given by

?

?

?

?

?

????

?

?

?

?????

?

?

!"#

?

??

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

???

(1)

The matrix to transform

maximum. If the transformation matrix

canonical form hasalready been found, thecanonical form of

and thus the calculations for all of the

that the canonical form might not be found by the above method in some cases where the canonical form

and its transformation matrix must be searched for in accordance with the permutations. The principles

of this canonical-form finding method are described in detail in the literature [12].

?

?into its canonical form is represented as

?

?

?

?

?

?, which makes Equation (1) the

?

?

?which transforms

?

?through a

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?into the

?

?isprovided as

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?,

?s in Equation (1) are not required. It should be noted, however,

2.2.4.Counting the Frequency of Each Candidate

After all of the canonical forms of the candidate subgraphs are obtained, the database is accessed, and

the frequency of each candidate subgraph is calculated. It is known that subgraph isomorphism [8] is

NP-complete, and ordered subtree isomorphism matching and subtree isomorphism matching require

"

?

?

!

??

?

??time and space where

?

!

?and

?

?

?are the sizes of the two graphs for isomorphism [14].

?

?

?

?

?

?

??

?

?

?

?

?

?

?

??

?

?

?

??

?

???

??????????????????????????????????????????????????????????????????????

Figure 8.Graph Data and Candidate Subgraphs.

We now explain the counting in the case of frequent induced subgraph derivation. Let the canonical

form of the candidate

of size

respectively. The canonical forms of Figure 8 (b) and (c) are expressed as

?-subgraph be

?

?, its first generator matrix be

?

?

?

?, and the graph in the database

? be

?

?. For example, let

?

?,

??

?

?

?

?

?, and

??

?

?

?be the graphs in Figure 8 (a), (b), and (c)

?

?

?

?

?

?

??

?

?

?

?

?

?

?

?

and

?

?

?

?

?

???

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

Page 11

A. Inokuchi et al./A General Framework for Mining Frequent Subgraphs from Labeled Graphs

63

?

?

?

?

?

?

?

?

?

?

??

?

??

?

??

?

??

?

??

?

??? ??

?

??

?

??

?

??

?

??

?

??

?

??

?

??

?

??

?

??

?

?

?

??

?

??

?

??

?

Figure 9.Search Tree for

?? and

??

???

??.

?

?

?

?

?

?

?

?

?

?

??

?

??

?

??

?

??

?

??

?

???

?

???

?

???

?

???

?

???

?

??? ??

?

??

?

??

?

??

?

??

?

???

?

???

?

???

?

???

?

???

?

??

?

??

?

??

?

??

?

??

?

???

?

???

?

???

?

???

?

???

?

?

???????????

?

?

??

?

??

?

??

?

???

?

???

?

???

?

Figure 10.Search Tree for

?? and

??

??

?.

where

If the brute force method checks whether

ascending order of the vertex IDs when

?

?

?and

???

?

?

?

?

???

?

?

?. The numbers assigned to the vertices in Figure 8 are vertex IDs.

?

?includes the graph

??

?

?

?

?

?by a depth first search in the

??

?

?

?

?

?is the candidate subgraph, it turns out that the graph

?

are 2=I and 3=II, where 2=I shows that vertex whose ID is 2 is mapped to I. The search tree for this case

is shown in Figure 9. When

and the correspondences of the vertices are 2=I, 3=II, and 6=III, as shown in Figure 10. In this case,

the search in the part on the left side of the path root-I2-II3 in Figure 10 is not necessary since this part

has already been checked in Figure 9. Therefore, if the correspondence relation of the vertex of

?includes the graph

??

?

?

?

?

?, and the correspondences of the vertices between

?

?and

??

?

?

?

?

?

??

?

?

?is the candidate subgraph, it turns out that graph

??

?

?

?is included,

?

?and

?

can be efficiently checked.

We use this method as the default method to count the frequency. However, this method is imple-

mented so that it can be overwritten. As mentioned later, the method is modified to compare B-AGM

with other tree mining methods.

?

?

?

?

?

?is recorded,

?

?’s inclusion of a graph structure which has

?

?

?

?as the first generator matrix

Page 12

64

A. Inokuchi et al./A General Framework for Mining Frequent Subgraphs from Labeled Graphs

2.3.Completeness of Search of Frequent Subgraphs

The completeness of the search of frequent subgraphs in this join operation is proven as follows:

Theorem 2.1. Given a canonical form matrix

?

?

and

?

?

?of an undirected graph and its

!"#

?

?

?

?

?

?

?

?

???

?

?

?

?

?

???

?

??

?

?

?

??

???

?

??

?

?

?

?

?

??

????

?

?

?

?

?

?

?

??

?

???

?

?

?

??

?

?

??

?

?

?

???

?

?

?

?

?

?. Then

???

?

??

?

?

?

??

?

???

?

??

?

?

?

?

?

???

or

???

???

?

??

?

?

?

??

?

???

???

?

??

?

?

?

?

?

??

?

??

?

???

?

?

?

??

?

?

?

??

?

?

?

???

?

?

?

??

?

?

?

holds. Similarly,

?

??

?

?

?

??

?

?

??

?

?

?

?

?

???

or

and

???

?

??

?

?

?

??

?

???

?

??

?

?

?

?

?

??

?

?

?

?

?

?

???

?

?

?

?

?

?

?

???

?

?

?

?

?

?

holds for a directed graph, where

?

?

?

?

?

?

?

?

?

.

Proof:

Consider a matrix

matrix

?

?

?

?

?obtained by permutation of the

?-th and

?

?

?

?

?-th rows and columns of the

?

?

?

?.

!

Accordingly,

"#

?

?

?

?

?

?

?

?

?

???

?

?

?

?

?

?

???

?

??

?

?

?

?

?

??

???

?

??

?

?

?

??

???

or

and

?

?

?

?

?

?

?

?

??

?

?

?

???

?

?

?

??

?

?

?

?

??

?

???

?

?

?

??

?

?

?

?

??

?

?

!"#

?

?

?

?

?

?

?

?

?

!"#

?

?

?

?

?

?

?when

???

?

??

?

?

?

??

???

?

??

?

?

?

?

?

???

???

?

??

?

?

?

??

?

???

?

??

?

?

?

?

?

??

?

??

?

???

?

?

?

??

?

?

??

?

?

?

???

?

?

?

??

?

?

?

?

On the other hand,

invariant over the permutation of rows and columns. This contradicts the assumption that

canonical form matrix. The same argument applies to the directed graph.

??

?

?

?

?

?

?

?

??

?

?

?

?

?, because the graph represented by an adjacency matrix is

?

?

?

?is a

?

?

Theorem 2.2. The first generator matrix

matrix.

?

?

?

?of a canonical form matrix

?

?is also a canonical form

Proof:

If

?

?is a canonical form matrix, but

?

?

?

?is not, then the matrices

?

?

?and its first generator matrix

?

?

?

?

?meeting the following conditions must exist:

???

?

?

?

?

?

?

?

?

???

?

?

?

?

?

??

or

and

???

?

?

?

?

?

?

?

?

???

?

?

?

?

?

?

????

?

?

?

?

?

?

?

?

????

?

?

?

?

?

??

where

??

?

?

?

?

?

??

?

?

?and

??

?

?

?

?

?

?

?

??

?

?

?

?

??

Page 13

A. Inokuchi et al./A General Framework for Mining Frequent Subgraphs from Labeled Graphs

65

In the latter condition, the labels of the vertices corresponding to the last rows and columns of

are identical, i.e.,

relation satisfied:

?

?

?and

?

?

???

?

??

?

??

?

??

?

???

?

??

?

?

?

??, because

??

?

?

?

?

?

??

?

?

?. Accordingly, the following

!"#

?

?

?

?

?

?

?

???

?

?

?

?

?

?

?

???

?

??

?

?

?

?

??

????

?

?

?

?

?

?

?

?

?

??

?

???

?

?

?

?

??

?

?

!"#

?

?

?

?

?

?

???

?

?

?

?

?

?

???

?

??

?

?

?

??

????

?

?

?

?

?

?

?

??

?

???

?

?

?

??

?

?

This contradicts the assumption that

matrix.

?

?is a canonical form matrix. Thus,

?

?

?

?is a canonical form

?

?

Theorem 2.3. Given

the canonical form

first generator matrix

?

?

??all frequent subgraphs of size

?

?and

?

?

?

??

?

?

?

??

?

?

?

?

?

?, and

?

?is

?, for a given

?

?

?

?

?

?then let

?

?

?

?

?

?

?

??

?

?

?

??

?

?

?

?

?

?

?

??

?

?

?

?

?shares its

?

?

?

?with

?

?

?and

!"#

?

?

?

?

?

?

!"#

?

?

?

?

?

?,

?

?

?

?

?

?

?

?

??

?

?

?

?

?

?

?

?,

?

Proof:

Each

follows from Condition 1:

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?is derived by the join operation between

?

?

?and

?

?

?

?

?

?

?

?

?

?

?, and

?

?

?

?

?

?

?

?

?

??

?

?

?

?

?

?

?

?

?

?. Then

?

?

?

?includes all

?

?

?

?s in

?

?

?

?

?.

?

?

?

?

?

?meets Condition 2. The codes of

?

?and

?

?for undirected graphs are represented as

!"#

?

?

?

?

?

?

???

?

?

?

?

?

?

???

?

??

?

?

?

??

????

?

?

?

?

?

?

?

??

?

???

?

?

?

??

?

?

!"#

?

?

?

?

?

?

???

?

?

?

?

?

?

???

?

??

?

?

?

??

????

?

?

?

?

?

?

?

??

?

???

?

?

?

??

?

?

Hence, Condition 3 can be rewritten as follows (#1):

???

?

??

?

?

?

??

?

???

?

??

?

?

?

???

or

and

???

?

??

?

?

?

??

?

???

?

??

?

?

?

??

?

??

?

???

?

?

?

?

??

?

?

?

??

?

???

?

?

?

??

?

?

Also, the label

??

?

?

?

?

?

?and the element values

?

??

?

?

?

???

?

?

?

??

?

?

?of

?

?

?

?corresponds to

??

?

?

?

?and

?

having its first generator matrix

the case of directed graphs. From this observation and the fact that the join operations are applied to all

??

?

???

?

?

?

??

?, respectively. These constraints on

!"#

?s are identical with those of Theorem 2.1 when

?

?

?

?is considered as

?

?

?

?. The elements

?

?

?

?

?

?and

?

?

?

??

?in

?

?

?take any values in

???

?

??

?

?

?

???

?

????

?

??

?

??

?(#2).

??

?

?

?and

??

?

?

?are frequent. Through the join operations of any

?

?and

?

?s satisfying these constraints, all canonical form matrices

?

?

?

?s representing frequent subgraphs and

?

?are derived in

?

?

?

?

?

?

?

?

?. The corresponding discussion applies to

?

every

?s in

?

?

?, we conclude that every canonical form matrix

?

?

?

?where the first generator matrix is one

of

has the first generator matrix

?

?

?s in

?

?

?is completely derived in

?

?

?

?

?. On the other hand, every canonical form matrix

?

?

?

?

?

?which is a canonical form from Theorem 2.1. Since

?

?

?is complete,

?

?

?has the first generator matrix

?

?in

?

?

?. Therefore,

?

?

?

?

?includes the complete set of

?

sets of

completely enumerated at the initial search. Accordingly, the complete

step

?

?

?s in

After deriving

objective data are used to derive

?

?

?

?

?.

?

?

?

?

?

?

?, complete pruning of infrequent

?

?

?

?s and frequency counts of

?

?

?

?s in the

?

?

?

?and

?

?

?

?

?as described later. At the level

?

?

?, all complete

?

?,

?

?

?and

?

?

?are derived, since all frequent single vertices and their

?

?

?matrix notions are

?

?and

?

?

?are found in every

? from Theorem 2.3.

Page 14

66

A. Inokuchi et al./A General Framework for Mining Frequent Subgraphs from Labeled Graphs

3.Extension to Mine Various Classes of Structures

The original AGM performs the complete mining of the frequent subgraphs. However, the variation of

AGM that we introduce here contains a bias to derive only the frequent induced subgraphs [10, 12]. An

induced subgraph of a graph

vertices as in

been applied in the past work. When counting the frequency of each candidate frequent subgraph, the

AGM algorithm checks whether it is contained in each graph in a database as an induced subgraph.

In the following subsections, we propose further biases that allow for the graph mining of various

classes based on the AGM framework as depicted in Figure 11. We call this framework B-AGM (Biased-

Apriori based Graph Mining). A bias for a specific class of the graph structure consists of the dedicated

definitions of the canonical form and the join operation. By choosing an appropriate bias on the platform

of the AGM framework, the complete mining for the frequent subgraphs of the objective class we are

seeking for is defined.

?has a subset of the vertices of

?and the same edges between pairs of

?. To limit the search of the frequent subgraphs within this class, the following bias has

?????

?????????????????

???????????????

?????????

?????

??????

??????????

???????????????????????????

?????????

????????????????

????????????????

??????????????????

?????????????? ????????????????

?????????????

??????????

?????????????????????

???

???????

????

?????

??????????

?????????????

????

?????

????

????

?????????????

????

??????????

Figure 11.B-AGM Framework.

3.1.Bias for Connected Subgraph Derivation

For calculation efficiency, the B-AGM algorithm with this bias mines all of the frequent connected sub-

graphs and some semi-connected subgraphs which consist of a connected subgraph and an isolated ver-

tex. The semi-connected graphs are not added to the output of the frequent subgraphs2.

Canonical Form

The definition of canonical form is altered from the original. Given the upper left

adjacency matrix

identical graph

?

?

?submatrix of the

?

?as

?

?(

?

?

?

?

?), the following set

?

??

?of adjacency matrices representing an

?is defined.

?

??

?

??

?

?

?

??

?

?

?is connected for

?

?

?

?

???

?

??

?

?

?

?

??

?

?

?

?

?

2B-AGM with this bias is available from http://www.alphaworks.ibm.com/tech/fsm.

Page 15

A. Inokuchi et al./A General Framework for Mining Frequent Subgraphs from Labeled Graphs

67

The adjacency matrix

!

?with the largest

!"#

?in

?

??

?is called the canonical form.

!

?

?

?

?

?

!"#

?

?

!

?

?

?

??

?

?

?

?

?

?

?

?

!"#

?

?

?

?

??

Join Operation

The original Conditions 1 and 2 are retained, and the following Conditions 3 and 4 are introduced.

!

determined from the flags of its first and second generator matrices. If both flags are connected, then the

flag of a graph which is made from the first and second matrices becomes connected.

If the second generator matrix corresponds to an disconnected graph,

does not have to be satisfied to join the adjacency matrices. For example, let

and 2, respectively, in Figure 12. The canonical form of

and it is generated by joining two adjacency matrices, such as

???

?

?

?

??

?

?

??

?

?

?is a connected graph.

!???

?

?

?

??

?

?

??

?

?

?is not a connected graph, otherwise

!"#

?

?

?

?

?

?

!"#

?

?

?

?

?.

Each frequent subgraph of

??

?

?

?has a flag to represent whether or not it is connected. The flag is

!"#

?

?

?

?

?

?

!"#

?

?

?

?

?

???

?

?

?and

???

?

?

?be 1

??

?

?

?in Figure 12 is

!"#

?

?

?

?

?

?

?

?

?

?

?

?,

!"#

?

?

?

?

?

?

?

??and

!"#

?

?

?

?

?

?

??

?. Therefore, the condition that the second generator matrix corresponds to an disconnected graph is

needed3.

?

?

?

?

?

0 1

1 0

??

?

?

X

?

0 0

0 0

??

?

?

Y

?

?

0 1 0

??

1 0 1

0 1 0

?

?

?

Z?

?

?

Figure 12. Example of Join Operation Bias for Connected Subgraph Derivation.

The completeness of the search by this join operation is proven. However, due to the space limitation,

only the points to be altered in the aforementioned proof for the standard join operation are explained.

Theorem 2.1 is altered as follows.

Theorem 3.1. Given a canonical form matrix

?

?

?

?of an undirected graph and its

!"#

?

?

?

?

?

?

?

?

???

?

?

?

?

?

?

???

?

?

??

?

?

?

??

???

?

?

??

?

?

?

?

?

??

????

?

?

?

?

?

?

?

??

?

???

?

?

?

??

?

?

??

?

?

?

???

?

?

?

?

?

?. For

?

?

?

?repre-

senting a connected graph,

1.

???

?

??

?

?

?

??

?

???

?

??

?

?

?

?

?

??or

2.

???

?

??

?

?

??

?

???

??

?

?

?

?

?

??and

?

??

?

???

?

?

?

??

?

?

?

??

?

?

?

???

?

?

?

??

?

?

?, or

3.

???

?

??

?

?

?

??

???

?

??

?

?

?

?

?

??,

???

?

??

?

?

?

??

?

???

?

??

?

?

?

?

?

??,

?

??

?

?

?

?

???

?

?

?

?

??

?

?

?

?

?, and

?

?

?

?

?

?

?

?

?,

3In the case of the conventional AGM algorithm which finds not only connected subgraphs but also disconnected subgraph, the

canonical form of

?

?

?

?

?in Figure 12 becomes

?????

?

?

?

????

?

??.