Thesis

An Efficient Method for Mining Correlation in Graph Databases

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Correlation mining is recognized as one of the most important data mining task for its capability to identify underlying dependencies between objects. And day by day data mining techniques are being increasingly applied to non-traditional domains, existing approaches for finding frequent itemsets and association rules to obtain knowledge from large volume of data, cannot be used as they cannot model the requirement of these domains. The graph modeling based data mining techniques are advantageous due to its capability of modeling various real life complex scenarios. Therefore, we have focused on graph databases. It has a wide range of application domains but existing works either find structural similarity based correlation or correlation with a specific graph. In this thesis we have proposed a new method of finding the underlying correlations among graphs in a graph database with a new graph correlation measure gConfidence. At first some necessary terminologies and some related works have been discussed, then provided some scenarios which motivated us in doing such work, then introduced our new measure along with some of its properties and proved its downward closure pruning capability. We have also provided a lemma and proved it, then we have introduced our correlation mining algorithm and analyzed its performance and found it scalable in terms of running time and memory usage, the algorithm is faster enough in mining correlation with graph databases having various number of graphs, various graph density and various threshold values. We have compared its performance with other existing graph correlation mining algorithms and found more than two times improvement on speed in mining correlation. Finally, we have illustrated some application scenarios and some application domains where our algorithm can be applied.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

ResearchGate has not been able to resolve any citations for this publication.
Chapter
Full-text available
The multiprocessor scheduling problem is stated as finding a schedule for a task graph to be executed on a multiprocessor system such that the execution time of the graph can be minimized. This problem is known to be NP-hard, in all but a few restricted cases. To solve the problem, we apply the well-known state space reduction algorithm, A*. To alleviate the impediments of large space and time requirements, we employ three new techniques, processor isomorphism, task isomorphism, and node isomorphism. We demonstrate the effectiveness of our algorithm using several computer vision tasks as benchmarks. Finally, we also present an efficient Heuristic algoritlun for solving the problem in a reasonable amount of computation time.
Article
Full-text available
Cohesive subgroups have always represented an important construct for sociologists who study individuals and organizations. In this article, I apply recent advances in the statistical modelling of social network data to the task of identifying cohesive subgroups from social network data. Further, through simulated data, I describe a process for obtaining the probability that a given sample of data could have been obtained from a network in which actors were no more likely to engage in interaction with subgroup members than with members of other subgroups. I obtain the probability for a specific data set, and then, through further simulations, develop a model which can be applied to future data sets. Also through simulated data, I characterize the extent to which a simple hill-climbing algorithm recovers known subgroup memberships. I apply the algorithm to data indicating the extent of professional discussion among teachers in a high school, and I show the relationship between membership in cohesive subgroups and teachers' orientations towards teaching.
Conference Paper
Full-text available
Because many databases contain or can be embellished with structural information, a method for identifying interesting and repetitive substructures is an essential com- ponent to discovering knowledge in such databases. This paper describes the SUBDUE system, which uses the minimum description length (MDL) principle to discover sub- structures that compress the database and represent structural concepts in the data. By replacing previously-discovered substructures in the data, multiple passes of SUBDUE produce a hierarchical description of the structural regularities in the data. Inclusion of background knowledgeguides SUBDUE toward appropriate substructures for a particu- lar domain or discovery goal, and the use of an inexact graph match allows a controlled amount of deviations in the instance of a substructure concept. We describe the ap- plication of SUBDUE to a variety of domains. We also discuss approaches to combining SUBDUE with non-structural discovery systems.
Conference Paper
Full-text available
We consider the problem of finding highly correlated pairs in a large data set. That is, given a threshold not too small, we wish to report all the pairs of items (or binary attributes) whose (Pearson) correlation coefficients are greater than the threshold. Correlation analysis is an important step in many statistical and knowledge-discovery tasks. Normally, the number of highly correlated pairs is quite small compared to the total number of pairs. Identifying highly correlated pairs in a naive way by computing the correlation coefficients for all the pairs is wasteful. With massive data sets, where the total number of pairs may exceed the main-memory ca- pacity, the computational cost of the naive method is pro- hibitive. In their KDD'04 paper (15), Hui Xiong et al. ad- dress this problem by proposing the TAPER algorithm. The algorithm goes through the data set in two passes. It uses the first pass to generate a set of candidate pairs whose cor- relation coefficients are then computed directly in the sec- ond pass. The efficiency of the algorithm depends greatly on the selectivity (pruning power) of its candidate-generating stage. In this work, we adopt the general framework of the TA- PER algorithm but propose a different candidate-generation method. For a pair of items, TAPER's candidate-generation method considers only the frequencies (supports) of individ- ual items. Our method also considers the frequency (sup- port) of the pair but does not explicitly count this frequency (support). We give a simple randomized algorithm whose false-negative probability is negligible. The space and time complexities of generating the candidate set in our algorithm are asymptotically the same as TAPER's. We conduct ex- periments on synthesized and real data. The results show
Conference Paper
Full-text available
The ability to identify interesting and repetitive substructures is an essential component to discovering knowledge in structural data. We describe a new version of our SUBDUE substructure discovery system based on the minimum description length principle. The SUBDUE system discovers substructures that compress the original data and represent structural concepts in the data. By replacing previously-discovered substructures in the data, multiple passes of SUBDUE produce a hierarchical description of the structural regularities in the data. SUBDUE uses a computationally-bounded inexact graph match that identifies similar, but not identical, instances of a substructure and finds an approximate measure of closeness of two substructures when under computational constraints. In addition to the minimum description length principle, other background knowledge can be used by SUBDUE to guide the search towards more appropriate substructures. Experiments in a variety of domains demonstrate SUBDUE's ability to find substructures capable of compressing the original data and to discover structural concepts important to the domain. Description of Online Appendix: This is a compressed tar file containing the SUBDUE discovery system, written in C. The program accepts as input databases represented in graph form, and will output discovered substructures with their corresponding value.
Conference Paper
Full-text available
A machine learning technique called Graph-Based Induction (GBI) efficiently extracts typical patterns from directed graph data by stepwise pair expansion (pairwise chunking). In this paper, we expand the capability of the Graph-Based Induction to handle not only tree structured data but also multi-inputs/outputs nodes and loop structure (including a self-loop) which cannot be treated in the conventional way. The method is verified to work as expected using artificially generated data and we evaluated experimentally the computation time of the implemented program. We, further, show the effectiveness of our approach by applying it to two kinds of the real-world data: World Wide Web browsing data and DNA sequence data.
Article
Full-text available
A new graph similarity calculation procedure is introduced for comparing labeled graphs. Given a minimum similarity threshold, the procedure consists of an initial screening process to determine whether it is possible for the measure of similarity between the two graphs to exceed the minimum threshold, followed by a rigorous maximum common edge subgraph (MCES) detection algorithm to compute the exact degree and composition of similarity. The proposed MCES algorithm is based on a maximum clique formulation of the problem and is a significant improvement over other published algorithms. It presents new approaches to both lower and upper bounding as well as vertex selection.
Article
Full-text available
The derivation of frequent subgraphs from a dataset of labeled graphs has high compu- tational complexity because the hard problems of isomorphism and subgraph isomorphism have to be solved as part of this derivation. To deal with this computational complexity, all previous ap- proaches have focused on one particular kind of graph. In this paper, we propose an approach to conduct a complete search for various classes of frequent subgraphs in a massive dataset of labeled graphs within a practical time. The power of our approach comes from the algebraic representa- tion of graphs, its associated operations and well-organiz ed bias constraints to limit the search space efficiently. The performance has been evaluated using real w orld datasets, and the high scalabil- ity and flexibility of our approach have been confirmed with re spect to the amount of data and the computation time.
Article
Full-text available
Similarity search of complex structures is an important operation in graph-related applications since exact matching is often too restrictive. In this article, we investigate the issues of substructure similarity search using indexed features in graph databases. By transforming the edge relaxation ratio of a query graph into the maximum allowed feature misses, our structural filtering algorithm can filter graphs without performing pairwise similarity computation. It is further shown that using either too few or too many features can result in poor filtering performance. Thus the challenge is to design an effective feature set selection strategy that could maximize the filtering capability. We prove that the complexity of optimal feature set selection is Ω(2m) in the worst case, where m is the number of features for selection. In practice, we identify several criteria to build effective feature sets for filtering, and demonstrate that combining features with similar size and selectivity can improve the filtering and search performance significantly within a multifilter composition framework. The proposed feature-based filtering concept can be generalized and applied to searching approximate nonconsecutive sequences, trees, and other structured data as well.
Article
Full-text available
Graph database models can be defined as those in which data structures for the schema and instances are modeled as graphs or generalizations of them, and data manipulation is expressed by graph-oriented operations and type constructors. These models took off in the eighties and early nineties alongside object-oriented models. Their influence gradually died out with the emergence of other database models, in particular geographical, spatial, semistructured, and XML. Recently, the need to manage information with graph-like nature has reestablished the relevance of this area. The main objective of this survey is to present the work that has been conducted in the area of graph database modeling, concentrating on data structures, query languages, and integrity constraints.
Conference Paper
Full-text available
Existing algorithms that mine graph datasets to discover patterns corresponding to frequently occurring subgraphs can operate efficiently on graphs that are sparse, contain a large number of relatively small connected components, have vertices with low and bounded degrees, and contain well-labeled vertices and edges. However, for graphs that do not share these characteristics, these algorithms become highly unscalable. In this paper we present a heuristic algorithm called GREW to overcome the limitations of existing complete or heuristic frequent subgraph discovery algorithms. GREW is designed to operate on a large graph and to find patterns corresponding to connected subgraphs that have a large number of vertex-disjoint embeddings. Our experimental evaluation shows that GREW is efficient, can scale to very large graphs, and find non-trivial patterns.
Conference Paper
Full-text available
Whereas data mining in structured data focuses on frequent data values, in semistructured and graph data the emphasis is on frequent labels and common topologies. Here, the structure of the data is just as important as its content. We study the problem of discovering typical patterns of graph data. The discovered patterns can be useful for many applications, including: compact representation of source information and a road-map for browsing and querying information sources. Difficulties arise in the discovery task from the complexity of some of the required sub-tasks, such as sub-graph isomorphism. This paper proposes a new algorithm for mining graph data, based on a novel definition of support. Empirical evidence shows practical, as well as theoretical, advantages of our approach.
Conference Paper
Full-text available
As data mining techniques are being increasingly applied to non-traditional domains, existing approaches for finding frequent itemsets cannot be used as they cannot model the requirement of these domains. An alternate way of modeling the objects in these data sets is to use graphs. Within that model, the problem of finding frequent patterns becomes that of discovering subgraphs that occur frequently over the entire set of graphs.The authors present a computationally efficient algorithm for finding all frequent subgraphs in large graph databases. We evaluated the performance of the algorithm by experiments with synthetic datasets as well as a chemical compound dataset. The empirical results show that our algorithm scales linearly with the number of input transactions and it is able to discover frequent subgraphs from a set of graph transactions reasonably fast, even though we have to deal with computationally hard problems such as canonical labeling of graphs and subgraph isomorphism which are not necessary for traditional frequent itemset discovery
Article
Full-text available
Whereas data mining in structured data focuses on frequent data values, in semistructured and graph data mining, the issue is frequent labels and common specific topologies. The structure of the data is just as important as its content. We study the problem of discovering typical patterns of graph data, a task made difficult because of the complexity of required subtasks, especially subgraph isomorphism. In this paper, we propose a new apriori-based algorithm for mining graph data, where the basic building blocks are relatively large, disjoint paths. The algorithm is proven to be sound and complete. Empirical evidence shows practical advantages of our approach for certain categories of graphs
Article
Full-text available
Given a user-specified minimum correlation threshold θ and a market-basket database with N items and T transactions, an all-strong-pairs correlation query finds all item pairs with correlations above the threshold θ. However, when the number of items and transactions are large, the computation cost of this query can be very high. The goal of this paper is to provide computationally efficient algorithms to answer the all-strong-pairs correlation query. Indeed, we identify an upper bound of Pearson's correlation coefficient for binary variables. This upper bound is not only much cheaper to compute than Pearson's correlation coefficient, but also exhibits special monotone properties which allow pruning of many item pairs even without computing their upper bounds. A two-step all-strong-pairs correlation query (TAPER) algorithm is proposed to exploit these properties in a filter-and-refine manner. Furthermore, we provide an algebraic cost model which shows that the computation savings from pruning is independent of or improves when the number of items is increased in data sets with Zipf-like or linear rank-support distributions. Experimental results from synthetic and real-world data sets exhibit similar trends and show that the TAPER algorithm can be an order of magnitude faster than brute-force alternatives. Finally, we demonstrate that the algorithmic ideas developed in the TAPER algorithm can be extended to efficiently compute negative correlation and uncentered Pearson's correlation coefficient.
Article
Full-text available
Many techniques for association rule mining and feature selection require a suitable metric to capture the dependencies among variables in a data set. For example, metrics such as support, confidence, lift, correlation, and collective strength are often used to determine the interestingness of association patterns. However, many such measures provide conflicting information about the interestingness of a pattern, and the best metric to use for a given application domain is rarely known. In this paper, we present an overview of various measures proposed in the statistics, machine learning and data mining literature. We describe several key properties one should examine in order to select the right measure for a given application domain. A comparative study of these properties is made using twenty one of the existing measures. We show that each measure has different properties which make them useful for some application domains, but not for others. We also present two scenarios in which most of the existing measures agree with each other, namely, support-based pruning and table standardization. Finally, we present an algorithm to select a small set of tables such that an expert can select a desirable measure by looking at just this small set of tables.
Thesis
Full-text available
....................................... ix 1 INTRODUCTION . . .............................. 1 1.1 ProblemStatement ............................... 1 1.2 OverviewofApproach............................. 4 1.3 Outline of Thesis . . .............................. 5 2 BACKGROUND . . . .............................. 6 2.1 GraphMatching ................................ 6 2.1.1 Definitions and Background . . ....................... 8 2.1.2 CommonApproaches ............................ 9 2.1.3 Invariants................................... 11 2.1.4 ConventionalApproaches .......................... 13 2.1.5 OtherApproaches .............................. 17 2.2 SolidModelingandFeatureBasedDesign................... 22 2.2.1 Constructive Solid Geometry (CSG) . . . . . ................ 22 2.2.2 Boundary Representation (B-rep) . . . . . . ................ 22 2.2.3 Feature-based Modeling . . . . ....................... 23 2.2.4 Feature Recognition From Solid Models . . ...................
Article
Weighted frequent pattern (WFP) mining is more practical than frequent pattern mining because it can consider different semantic significance (weight) of the items. For this reason, WFP mining becomes an important research issue in data mining and knowledge discovery. However, existing algorithms cannot be applied for incremental and interactive WFP mining and also for stream data mining because they are based on a static database and require multiple database scans. In this paper, we present two novel tree structures IWFPTWA (Incremental WFP tree based on weight ascending order) and IWFPTFD (Incremental WFP tree based on frequency descending order), and two new algorithms IWFPWA and IWFPFD for incremental and interactive WFP mining using a single database scan. They are effective for incremental and interactive mining to utilize the current tree structure and to use the previous mining results when a database is updated or a minimum support threshold is changed. IWFPWA gets advantage in candidate pattern generation by obtaining the highest weighted item in the bottom of IWFPTWA. IWFPFD ensures that any non-candidate item cannot appear before candidate items in any branch of IWFPTFD and thus speeds up the prefix tree and conditional tree creation time during mining operation. IWFPTFD also achieves the highly compact incremental tree to save memory space. To our knowledge, this is the first research work to perform single-pass incremental and interactive mining for weighted frequent patterns. Extensive performance analyses show that our tree structures and algorithms are very efficient and scalable for single-pass incremental and interactive WFP mining.
Article
Subgraph isomorphism can be determined by means of a brute-force tree-search enumeration procedure. In this paper a new algorithm is introduced that attains efficiency by inferentially eliminating successor nodes in the tree search. To assess the time actually taken by the new algorithm, subgraph isomorphism, clique detection, graph isomorphism, and directed graph isomorphism experiments have been carried out with random and with various nonrandom graphs. A parallel asynchronous logic-in-memory implementation of a vital part of the algorithm is also described, although this hardware has not actually been built. The hardware implementation would allow very rapid determination of isomorphism.
Article
A new concept-learning method called CLIP (concept learning from inference patterns) is proposed that learns new concepts from inference patterns, not from positive/negative examples that most conventional concept learning methods use. The learned concepts enable an efficient inference on a more abstract level. We use a colored digraph to represent inference patterns. The graph representation is expressive enough and enables the quantitative analysis of the inference pattern frequency. The learning process consists of the following two steps: (1) Convert the original inference patterns to a colored digraph, and (2) Extract a set of typical patterns which appears frequently in the digraph. The basic idea is that the smaller the digraph becomes, the smaller the amount of data to be handled becomes and, accordingly, the more efficient the inference process that uses these data. Also, we can reduce the size of the graph by replacing each frequently appearing graph pattern with a single node, and each reduced node represents a new concept. Experimentally, CLIP automatically generates multilevel representations from a given physical/single-level representation of a carry-chain circuit. These representations involve abstract descriptions of the circuit, such as mathematical and logical descriptions.
Conference Paper
It is shown that any recognition problem solved by a polynomial timebounded nondeterministic Turing machine can be "reduced" to the problem of determining whether a given propositional formula is a tautology.Here "reduced" means, roughly speaking,that the first problem can be solved deterministically in polynomial time provided an oracle is available for solving the second.From this notion of reducible,polynomial degrees of difficulty are defined, and it is shown that the problem of determining tautologyhood has the same polynomial degree as the problem of determining whether the first of two given graphs is isomorphic to a subgraph of the second.Other examples are discussed. A method of measuring the complexity of proof procedures for the predicate calculus is introduced and discussed.
Conference Paper
Process engineering and workflow analysis both aim to enhance business operations, product manufacturing and software development by applying proven process models to solve individual problem cases. However, most applications assume that a process model already exists and is available. In many situations, though, the more important and interesting problem to solve is that of discovering or recovering the model by reverse engineering given an abundance of execution logs or history. In this paper, a new algorithmic solution is presented for process model discovery, which is treated as a special case of the Maximal Overlap Sets problem in graph matching. The paradigm of planning and scheduling by resource management is used to tackle the combinatorial complexity and to achieve efficiency and practicality in real world applications. The effectiveness of the algorithm, for this generally NP (nondeterministic polynomial) problem, is demonstrated with a broad set of experimental results.
Conference Paper
Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist prolific patterns and/or long patterns. In this study, we propose a novel frequent pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a highly condensed, much smaller data structure, which avoids costly, repeated database scans, (2) our FP-tree-based mining adopts a pattern fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent pattern mining methods.
Conference Paper
This paper proposes a novel approach named AGM to efficiently mine the association rules among the frequently appearing substructures in a given graph dataset. A graph transaction is represented by an adjacency matrix, and the frequent patterns appearing in the matrices are mined through the extended algorithm of the basket analysis. Its performance has been evaluated for the artificial simulation data and the carcinogenesis data of Oxford University and NTP. Its high efficiency has been confirmed for the size of a real-world problem. ...
Article
Objects in a database are interrelated. When an update operation is applied to an object, it may also impact on its related objects, depending on the semantics of their relationships. Current OODBMSs provide no support for update propagation but hard-coding. In this paper, we study update propagation support for generic update operations in object-oriented databases. We take a declarative approach, specifying propagation policies for each identified reference attribute in classes of an object-oriented database schema. Propagation policies for generic update propagation are well defined. However, we also discover that potential conflicts among propagation policies may occur if the policies can be arbitrarily specified by a designer. Therefore, we promote the update propagation problem to a higher level, investigating possible dependencies between objects. As such, the designer only needs to specify the dependency property for each reference attribute. Propagation policies are predefined for each type of dependency. By introducing some restrictions on an object-oriented database schema, conflict-free propagation policies can be achieved. Implementation issues for update propagation support in object-oriented database systems are also addressed.
Article
High utility pattern (HUP) mining is one of the most important research issues in data mining. Although HUP mining extracts important knowledge from databases, it requires long calculations and multiple database scans. Therefore, HUP mining is often unsuitable for real-time data processing schemes such as data streams. Furthermore, many HUPs may be unimportant due to the poor correlations among the items inside of them. Hence,the fast discovery of fewer but more important HUPs would be very useful in many practical domains. In this paper, we propose a novel framework to introduce a very useful measure, called frequency affinity, among the items in a HUP and the concept of interesting HUP with a strong frequency affinity for the fast discovery of more applicable knowledge. Moreover, we propose a new tree structure, utility tree based on frequency affinity (UTFA), and a novel algorithm, high utility interesting pattern mining (HUIPM), for single-pass mining of HUIPs from a database. Our approach mines fewer but more valuable HUPs, significantly reduces the overall runtime of existing HUP mining algorithms and is applicable to real-time data processing. Extensive performance analyses show that the proposed HUIPM algorithm is very efficient and scalable for interesting HUP mining with a strong frequency affinity.
Article
In this paper, we consider a data mining problem for semi-structured data. Modeling semi-structured data as labeled ordered trees, we present an efficient algorithm for discovering frequent substructures from a large collection of semi-structured data. By extending the enumeration technique developed by Bayardo (SIGMOD'98) for discovering long itemsets, our algorithm scales almost linearly in the total size of maximal tree patterns contained in an input collection depending mildly on the size of the longest pattern. We also developed several pruning techniques that significantly speed-up the search. Experiments on Web data show that the our algorithm runs efficiently on real-life datasets combined with proposed pruning techniques in the wide range of parameters.
Article
Correlation mining has gained great success in many application domains for its ability to capture underlying dependencies between objects. However, research on correlation mining from graph databases is still lacking despite that graph data, especially in scientific domains, proliferate in rece nt years. We propose a new problem of correlation mining from graph databases, called Correlated Graph Search (CGS). CGS adopts Pearson's correlation coefficient as the correlation measu re to take into account the occurrence distributions of graphs. How- ever, the CGS problem poses significant challenges, since ev ery subgraph of a graph in the database is a candidate but the number of subgraphs is exponential. We derive two necessary conditions that set bounds on the occurrence probability of a candidate in the database. With this result, we devise an efficient algorithm that mines the candidate set from a much smaller projected database and thus we are able to obtain a significantly smaller set of candidates. Three heuristic ru les are further developed to refine the candidate set. We also make us e of the bounds to directly answer high-support queries without mining the candidates. Our experimental results demonstrate the efficiency of our algorithm. Finally, we show that our algori thm provides a general solution when most of the commonly used correlation measures are used to generalize the CGS problem.
Article
Thesis (Ph. D.)--University of Massachusetts Boston, 2002.
Article
A new algorithm is presented that allows one to uncover underlying relations between the biological activity of a diverse set of molecules and a global parameter such as the partition coefficient, solubility, and/or the redox properties.
Conference Paper
We introduce a novel method of indexing graph databases in order to facilitate subgraph isomorphism and similarity queries. The index is comprised of two major data structures. The primary structure is a directed acyclic graph which contains a node for each of the unique, induced subgraphs of the database graphs. The secondary structure is a hash table which cross-indexes each subgraph for fast isomorphic lookup. In order to create a hash key independent of isomorphism, we utilize a code-based canonical representation of adjacency matrices, which we have further refined to improve computation speed. We validate the concept by demonstrating its effectiveness in answering queries for two practical datasets. Our experiments show that for subgraph isomorphism queries, our method outperforms existing methods by more than an order of magnitude.
Conference Paper
Graphs have become popular for modeling structured data. As a result, graph queries are becoming common and graph indexing has come to play an essential role in query processing. We introduce the concept of a graph closure, a generalized graph that represents a number of graphs. Our indexing technique, called Closure-tree, organizes graphs hierarchically where each node summarizes its descendants by a graph closure. Closure-tree can efficiently support both subgraph queries and similarity queries. Subgraph queries find graphs that contain a specific subgraph, whereas similarity queries find graphs that are similar to a query graph. For subgraph queries, we propose a technique called pseudo subgraph isomorphism which approximates subgraph isomorphism with high accuracy. For similarity queries, we measure graph similarity through edit distance using heuristic graph mapping methods. We implement two kinds of similarity queries: K-NN query and range query. Our experiments on chemical compounds and synthetic graphs show that for subgraph queries, Closuretree outperforms existing techniques by up to two orders of magnitude in terms of candidate answer set size and index size. For similarity queries, our experiments validate the quality and efficiency of the presented algorithms.
Conference Paper
We investigate new approaches for frequent graph-based pattern mining in graph datasets and propose a novel algorithm called gSpan (graph-based substructure pattern mining), which discovers frequent substructures without candidate generation. gSpan builds a new lexicographic order among graphs, and maps each graph to a unique minimum DFS code as its canonical label. Based on this lexicographic order gSpan adopts the depth-first search strategy to mine frequent connected subgraphs efficiently. Our performance study shows that gSpan substantially outperforms previous algorithms, sometimes by an order of magnitude.
Article
Data mining is defined as the process of discovering significant and potentially useful patterns in large volumes of data. Discovering associations between items in a large database is one such data mining activity. In finding associations, support is used as an indicator as to whether an association is interesting. In this paper, we discuss three alternative interest measures for associations: any-confidence, all-confidence, and bond. We prove that the important downward closure property applies to both all-confidence and bond. We show that downward closure does not hold for any-confidence. We also prove that, if associations have a minimum all-confidence or minimum bond, then those associations will have a given lower bound on their minimum support and the rules produced from those associations will have a given lower bound on their minimum confidence as well. However, associations that have that minimum support (and likewise their rules that have minimum confidence) may not satisfy the minimum all-confidence or minimum bond constraint. We describe the algorithms that efficiently find all associations with a minimum all-confidence or minimum bond and present some experimental results.
Article
A tight integration of Mitchell's version space algorithm with Agrawal et al.'s Apriori algorithm is presented. The algorithm can be used to generate patterns that satisfy a variety of constraints on data. Constraints that can be imposed on patterns include the generality relation among patterns and imposing a minimum or a maximum frequency on data sets of interest. The theoretical framework is applied to an important application in chemo-informatics, i.e. that of finding fragments of interest within a given set of compounds. Fragments are linearly connected substructures of compounds. An implementation as well as preliminary experiments within the application are presented.
Article
One of the basic problems in knowledge discovery in databases (KDD) is the following: given a data set r, a class L of sentences for deøning subgroups of r, and a selection predicate, ønd all sentences of L deemed interesting by the selection predicate. We analyze the simple levelwise algorithm for ønding all such descriptions. We give bounds for the number of database accesses that the algorithm makes. For this, we introduce the concept of the border of a theory, a notion that turns out to be surprisingly powerful in analyzing the algorithm. We also consider the veriøcation problem of a KDD process: given r and a set of sentences S ` L, determine whether S is exactly the set of interesting statements about r. We show strong connections between the veriøcation problem and the hypergraph transversal problem. The veriøcation problem arises in a natural way when using sampling to speed up the pattern discovery step in KDD. Computing Reviews Categories and Subject Descriptors: I.2.8 Pro...
Article
To formulate a meaningful query on semistructured data, such on the Web, that matches some of the source's structure, we need first to discover something about how the information is represented in the source. This is referred to as schema discovery and was considered for a single object recently. In the case of multiple objects, the task of schema discovery is to identify typical structuring information of those objects as a whole. We motivate the schema discovery in this general setting and propose a framework and algorithm for it. We apply the framework to a real Web database, the Internet Movies Database, to discover typical schema of most voted movies. Introduction As the amount of data available on-line grows rapidly, we find that more and more of the data is semistructured. In the semistructured world, data has no absolute schema fixed in advance, and the structure of data may be irregular or incomplete. Semistructured data arise when the source does not impose a ...
Article
The structure of a document refers to the role and hierarchy of subdocument references. Many online documents are similarly structured, though not identically structured. We study the problem of discovering "typical" structures of a collection of such documents, where the user specifies the minimum frequency of a typical structure. We will consider structural features of subdocument references such as labeling, nesting, ordering, cyclicity, and wild-card references, like those found on the Web and digital libraries. Typical structures can be used to serve the following purposes. (a) The "table-ofcontent " for gaining the general information of a source. (b) A road map for browsing and querying a source. (c) A basis for clustering documents. (d) Partial schemas for building structured layers to provide standard database access methods. (e) User/customer's interests and browsing patterns. We present a solution to the discovery problem. 1 Introduction 1.1 Motivation Many on-line documen...
Article
Mining frequent trees is very useful in domains like bioinformatics, web mining, mining semi-structured data, and so on. We formulate the problem of mining (embedded) subtrees in a forest of rooted, labeled, and ordered trees. We present TreeMiner, a novel algorithm to discover all frequent subtrees in a forest, using a new data structure called scope-list. We contrast TreeMiner with a pattern matching tree mining algorithm (PatternMatcher). We conduct detailed experiments to test the performance and scalability of these methods. We find that TreeMiner outperforms the pattern matching approach by a factor of 4 to 20, and has good scaleup properties. We also present an application of tree mining to analyze real web logs for usage patterns.
Article
Obtaining accurate structural alerts for the causes of chemical cancers is a problem of great scientific and humanitarian value. This paper follows up on earlier research that demonstrated the use of Inductive Logic Programming (ILP) for predictions for the related problem of mutagenic activity amongst nitroaromatic molecules. Here we are concerned with predicting carcinogenic activity in rodent bioassays using data from the U.S. National Toxicology Program conducted by the National Institute of Environmental Health Sciences. The 330 chemicals used here are significantly more diverse than the previous study, and form the basis for obtaining Structure-Activity Relationships (SARs) relating molecular structure to cancerous activity in rodents. We describe the use of the ILP system Progol to obtain SARs from this data. The rules obtained from Progol are comparable in accuracy to those from expert chemists, and more accurate than most state-of-the-art toxicity prediction methods. The rules can also be interpreted to give clues about the biological and chemical mechanisms of carcinogenesis, and make use of those learnt by Progol for mutagenesis. Finally, we present details of, and predictions for, an ongoing international blind trial aimed specifically at comparing prediction methods. This trial provides ILP algorithms an opportunity to participate at the leading-edge of scientific discovery. 1
Article
We present a new approach for personalized presentation
Article
Machine Learning algorithms are being increasingly used for knowledge discovery tasks. Approaches can be broadly divided by distinguishing discovery of procedural from that of declarative knowledge. Client requirements determine which of these is appropriate. This paper discusses an experimental application of machine learning in an area related to drug design. The bottleneck here is in finding appropriate constraints to reduce the large number of candidate molecules to be synthesisedand tested. Such constraints canbe viewed as declarative specifications of the structural elements necessary for high medicinal activity and low toxicity. The first-order representation used within Inductive Logic Programming (ILP) provides an appropriate description language for such constraints. Within this application area knowledge accreditation requires not only a demonstration of predictive accuracy but also, and crucially, a certification of novel insight into the structural chemistry. Thi...