Conference PaperPDF Available

A Multi-Relational Rule Discovery System

Authors:

Abstract and Figures

This paper describes a rule discovery system that has been developed as part of an ongoing research project. The system allows discovery of multi-relational rules using data from relational databases. The basic assumption of the system is that objects to be analyzed are stored in a set of tables. Multi-relational rules discovered would either be used in predicting an unknown object attribute value, or they can be used to see the hidden relationship between the objects’ attribute values. The rule discovery system, developed, was designed to use data available from any possible ‘connected’ schema where tables concerned are connected by foreign keys. In order to have a reasonable performance, the ‘hypotheses search’ algorithm was implemented to allow construction of new hypotheses by refining previously constructed hypotheses, thereby avoiding the work of re-computing.
Content may be subject to copyright.
A Multi-Relational Rule Discovery System
Mahmut Uludağ1, Mehmet R. Tolun2, Thure Etzold1
1 LION Bioscience Ltd., Compass House, 80-82, Newmarket Road, Cambridge, CB5 8DZ,
United Kingdom
{mahmut.uludag, thure.etzold}@uk.lionbioscience.com
2 Atilim University, Dept. of Computer Engineering, 06836 Incek, Ankara, Turkey
tolun@atilim.edu.tr
http://cmpe.emu.edu.tr/rila/
Abstract. This paper describes a rule discovery system that has been developed
as part of an ongoing research project. The system allows discovery of multi-
relational rules using data from relational databases. The basic assumption of
the system is that objects to be analyzed are stored in a set of tables. Multi-
relational rules discovered would either be used in predicting an unknown ob-
ject attribute value, or they can be used to see the hidden relationship between
the objects’ attribute values. The rule discovery system, developed, was de-
signed to use data available from any possible ‘connected’ schema where tables
concerned are connected by foreign keys. In order to have a reasonable per-
formance, the ‘hypotheses search’ algorithm was implemented to allow con-
struction of new hypotheses by refining previously constructed hypotheses,
thereby avoiding the work of re-computing.
1 Introduction
Most of the current data mining algorithms are designed to use data from a single
table. They require each object to be described by a fixed set of attributes. Compared
to a single table of data, a relational database containing multiple tables makes it pos-
sible to represent more complex and structured data. In addition, today, a significant
amount of scientific data is stored in relational databases. For these reasons, it is im-
portant to have discovery algorithms running for relational data in its natural form
without requiring the data to be viewed in a single table.
A relational data model consisting of multiple tables may represent several object
classes, i.e. within a schema while one set of tables represents a class of object, a
different set of tables may represent another class. Before starting discovery proc-
esses, users should analyze the schema and select the list of tables that represents the
kind of objects they are interested in. One of the selected tables will be central for the
objects and each row in the table should correspond to a single object in the database.
In the previous multi-relational data mining publications, this central table is named as
‘target table’ in [1] and [2], ‘primary table’ in [3], ‘master relation’ in [4], and ‘hub
table’ in [5].
Gene
LocalizationChromosomeEssentialGeneid LocalizationChromosomeEssentialGeneid
Composition
Motif Function ComplexClassPhenotypeGeneid Motif Function ComplexClassPhenotypeGeneid
IF Composition.Class = ‘ATPases’AND Composition.Complex = ‘ Intracellular transport
THEN Gene.Localization = extracellular..
Fig. 1. An example multi-relational rule that refers to the composition table in its conditions
and refers to the gene table in its right hand side
For a multi-relational rule, attribute names in its conditions are annotated using the
name of the relational table to which the attribute is related. Figure 1 shows an exam-
ple of such a multi-relational rule.
The concepts suggested in the multi-relational data-mining framework described in
[1]; selection graphs, target table and target attribute, all helped during the initial
stages of the process of building the present multi-relational rule discovery system.
2 Architecture
The architecture of the rule discovery system developed can be depicted as shown in
Figure 2. The discovery system uses the JDBC API to communicate with the database
management systems (DBMS). When a data mining session is started the system sends
meta-data queries to the DBMS connected. After the user selects a set of tables, the
target table and the target attribute, the data mining process starts, during which the
system sends a number of SQL queries to the DBMS. SQL queries sent to the data-
base management system are generally for building valid hypotheses about the data.
In order to reduce the complexity of communication between the rule discovery
system and the DBMS, the information about covered objects and the discretized
columns are both stored in temporary tables in the DBMS rather than in the internal
data structures in the rule discovery system side. It was also decided to use these tem-
porary tables for performance reasons.
The temporary table ‘covered’ has two columns named ‘id’ and ‘mark’. Each time
the rule discovery system starts processing a new class, inserting a new row for each
object belonging to the current class reinitializes this table. The ‘id’ field is given the
value of the primary key and the ‘mark’ field is set to zero. When a new rule is gener-
ated, the ‘mark’ fields of the rows that refer to the objects covered by the new rule are
changed to one.
DBMSDiscovery system
JDBC driver
Rules
Hypotheses SQL, meta data queries
Result sets
Fig. 2. The basic architecture of the system
The rule discovery system discretizes the numeric attribute values during the pre-
processing stage of a data mining session, and initializes the table ‘disc’ by inserting
one row per discretization interval. The following columns represent an interval.
• table_name: name of the table the numeric attribute is from
• column_name: name of the column the numeric attribute is associated with
• interval_name: name of the interval between two successive cut points
• min_val: minimum value of the interval
• max_val: maximum value of the interval
There is a concurrency problem with using the temporary tables if more than one
user wants to use the rule discovery system simultaneously on the same data. There-
fore we want to improve the solution taking into account the concurrency issues.
3 The Algorithm
The multi-relational rule discovery algorithm of the system that has been developed
was adapted from ILA (Inductive Learning Algorithm) [6]. ILA is a ‘covering’ type
learning algorithm that takes each class in turn and seeks a way of covering all in-
stances, at the same time excluding instances which are not in the class. There is also
an improved version of the ILA algorithm named ILA-2 that uses a penalty factor that
helps to produce better results for noisy data [7]. In this paper, the adapted version of
the ILA-2 algorithm is named Relational-ILA.
ILA requires a particular feature of the object under consideration to be used as a
dependent attribute for classification. In Relational-ILA, however, the dependent
attribute corresponds to the target attribute of the target table. It is assumed that the
target table is connected to other tables through foreign key relations. Relational-ILA
is composed of initial hypotheses generation, hypotheses evaluation, hypotheses re-
finement and rule selection steps. The relationship between these steps is summarized
in Figure 3 for processing examples of a single class.
The database schema is treated as a graph where nodes represent relations (tables)
and edges represent foreign keys. The schema graph is searched in a breadth-first
search manner starting from the target table. While searching the schema graph, the
algorithm keeps track of the path followed; no table is processed for the second time.
Fig. 3. The simplified Relational-ILA algorithm for processing examples of a single
class
Initial hypotheses are composed of only one condition. During the building of the
initial hypotheses set, the following template is used to generate SQL queries for find-
ing hypotheses together with their frequency values, each time a table in the schema
graph is visited.
Select attr, count(distinct targetTable.pk)
from table, covered, table_list
where join_list and
targetTable.targetAttr = currentClass and
covered.id = targetTable.pk and
covered.mark=0
group by attr
In the template,
attr is the column name for which hypotheses are being searched
targetTable is the table that has one row for each object being analyzed
pk is the name of the primary key column in the target table
table refers to the current table where the hypotheses are being searched
table_list is the list of the tables that are used to connect the current table to
the target table
join_list is the list of join conditions that are used to connect the current ta-
ble to the target table
targetAttr is the class column
currentClass is the current class for the hypotheses that are being searched
The template is applied for each column except the foreign and primary key col-
umns and the class column, i.e. the target attribute. If the current table is the target
table then the algorithm uses a simplified version of the template.
The algorithm also needs to know about the frequency of the hypotheses in classes
other than the current class. The following template is used to generate the necessary
SQL queries.
Select attr, count(distinct targetTable.pk)
from table_list, targetTable
where join_list and
targetTable.targetAttr <> currentClass
group by attr
Similarly, for the target table, the algorithm uses a simplified version of the tem-
plate. When the above templates are to be used for a numeric attribute, the ‘select’
clauses in the templates are changed such that the attribute column ‘attr’ is replaced by
the following three column names; interval_name, min_val and max_val. Accordingly,
the ‘group by’ clauses of the queries have a similar replacement. Also the ‘from’
clauses are extended by the table ‘disc’ and the join conditions are extended using the
following two conditions.
disc.attribute_name = ‘attr’ and
attr > disc.min_val and
attr < disc.max_val
After the initial hypotheses are generated they are sorted based on the output of the
ILA hypothesis evaluation function, which shows how a hypothesis satisfies the condi-
tions for being a valid rule. If any of the hypotheses can be used for generating a new
rule then the one with the maximum score is converted to a new rule and the objects
covered by the new rule are marked in the temporary table ‘covered’. After the rule
selection processes if some rules were selected but there are still objects not yet cov-
ered, then the initial hypotheses are rebuilt using only the objects that are not covered
by the rules already generated. If no new rule can be generated then the hypotheses
refinement step is started.
Refinement of a multi-relational hypothesis means extending the description of the
hypothesis. It results in a new selection of objects that is a subset of the selection as-
sociated with the original hypothesis.
Similar to the initial hypotheses build case, to extend a hypothesis, the schema
graph is searched, starting from the target table, by following the foreign key relations
between tables. When a table in the schema graph is reached the following template is
used to generate SQL queries for refining the hypothesis.
Select attr, count(distinct targetTable.pk)
from covered, table_list, hypothesis.table_list()
where targetAttr = currentClass and
join_list and
hypothesis.join_list()
covered.id = targetTable.pk and
covered.mark=0
group by attr;
Here the hypothesis is the hypothesis object being refined. The object has two
methods to help SQL construction processes. The table_list method returns the
list of the tables to which the features in the hypothesis refer, plus the tables that con-
nect each feature to the target table. The join_list method returns the list of join
conditions for the features in the hypothesis plus the list of join conditions to connect
each feature to the target attribute.
In order to know the frequency of the extended hypotheses in the classes other than
the current class the following SQL template is used.
Select attr, count(distinct targetTable.pk)
from targetTable, table_list, hypothesis.table_list()
where targetAttr <> currentClass and
join_list and
hypothesis.join_list()
group by attr;
4 Experiments
A set of experiments was conducted using the genes dataset of KDD Cup 2001 [8].
There are two tables in the original genes dataset. One table (interaction) specifies
which genes interact with which other genes. The other table (gene) specifies a variety
of properties of individual genes. The gene table has information about 862 different
genes. There could be more than one row for each gene. The attribute gene_id identi-
fies a gene uniquely. Tests have been conducted to generate rules for the localization
attribute. Because the discovery system requires the target table to have a primary key,
the schema has been normalized as shown in Figure 4.
Gene
Geneid
Essential
Chromosome
Localization
Composition
Geneid
Interaction
Geneid1
Geneid2
Type
Expression
Complex
Class
Phenotype
Motif
Function
862 rows 4346 rows
910 rows
Fig. 4. Schema of the KDD Cup 2001 genes data after normalization
The dataset has one numeric attribute and several attributes with missing values.
The numeric attribute, ‘expr’, in the interaction table was divided into 20 bins using
the class-blind binning method. Missing attribute values were ignored. In the experi-
ments, the ILA-2 penalty factor was selected as 1, and the maximum hypothesis size
limited to 4.
The results of the experiments are presented in Table 1 and Table 2. In both of the
two tables, the first column shows the selected minimum support pruning parameter.
In Table 1, the second column shows the amount of time the learning process required
on a 1.8 MHz Atlahon processor machine. The third column shows the percentage of
the training objects covered and the last column shows the number of rules generated.
Table 1. The results of the training process
Minimum support
pruning percentage time (sec.) % covered # of rules
0% 1536 74.71 218
2% 883 58.93 87
5% 354 44.78 37
In Table 2, the second column shows the percentage of the test objects covered by
the rules and the third column shows the accuracy of the rules on the objects covered.
The last column shows the accuracy of the rules if a default rule is used for objects not
covered by any rule. The default rule was selected as the majority class value.
Table 2. The results on the test data
Minimum support
pruning percentage % covered % accuracy %accuracy
using a default
rule
0% 64.57 79.67 62.20
2% 55.90 84.50 61.94
5% 46.50 87.00 60.60
The results in the last column of Table 2 indicate that the discovered rules have
about 10% less prediction accuracy then the cup the winner’s test set accuracy which
was 72% [8]. The poor performance is because of the present system’s inability to
read the relational information between genes defined by the interaction table.
5 Conclusions and Future Work
A multi relational rule discovery system, Relational-ILA, has been implemented
which extracts rules from relational database management systems where the tables
concerned are directly or indirectly connected to each other using foreign key rela-
tions. The system requires a primary key in the target table to identify individual ob-
jects.
One of the important issues we would like to address during the next stage of this
study is to make the hypotheses search algorithm more aware of the path followed to
reach a table and to allow tables to be visited for a second time if they previously were
reached via a different route. It will then be possible to mine schemas having related
objects that are recursively defined. For example, the present system discovers rules
that can help to understand properties of individual genes but the next version will
discover rules that can help to understand the relations between genes.
References
1. Knobbe, A.J., Blockeel, H., Siebes, A., Van der Wallen, D.M.G.: Multi-Relational Data
Mining, In Proceedings of Benelearn’99, (1999)
2. Leiva, H., and Honavar, V.: Experiments with MRDTL—A Multi-Relational Decision Tree
Learning Algorithm. In Dzeroski, S., Raedt, L.D., and Wrobel, S. (editors): Proceedings of
the Workshop on Multi-Relational Data Mining (MRDM-2002), University of Alberta, Ed-
monton, Canada, (2002) 97-112
3. Crestana-Jensen, V. and Soparkar, N.: Frequent Item-set Counting across Multiple Tables.
PAKDD 2000, (2000) 49-61
4. Wrobel, S.: An Algorithm for Multi-Relational Discovery of Subgroups, Proceedings of
PKDD’97, Springer-Verlag, Berlin, New York, (1997)
5. SRS-Relational White Paper, Working with relational databases using SRS, LION Biosci-
ence Ltd. http://www.lionbioscience.com/solutions/products/srs
6. Tolun, M.R. and Abu-Soud, S.M.: ILA: An Inductive Learning Algorithm for Rule Extrac-
tion, Expert Systems with Applications, 14(3), (1998) 361-370
7. Tolun, M.R., Sever, H., Uludağ, M. and Abu-Soud, S.M.: ILA-2: An Inductive Learning
Algorithm for Knowledge Discovery, Cybernetics and Systems: An International Journal,
Vol. 30, (1999) 609-628
8. Cheng, J., Krogel, M., Sese, J., Hatsiz, C., Morishita, S., Hayashi, H. and Page, D.: KDD
Cup 2001 Report, ACM Special Interest Group on Knowledge Discovery and Data Mining
(SIGKDD) Explorations, Vol. 3, issue 2, (2002)
... Many algorithms ID3 (Quinlan, 1986), ILA (Tolun & Abu-Soud, 1998), (Mohamed & Shukri-Bin-Jahabar, 2006), (Uludağ, et al., 2003) etc. belongs to this area. ...
... Sazo Dzeroski (Dzeroski, 2003) and M. Tolun (Uludağ, et al., 2003) presented their work about relational rule mining which involves checking conditions of a rule in different tables. For example: ...
Thesis
Full-text available
This research practically demonstrates how to use data mining technology to supply knowledge to the rule based system. It lays down a framework for utilization of data mining concepts to provide a sustained supply of knowledge to a rule based system at run-time. A novel ‗record couple‘ based production rule mining algorithm has been proposed to extract production rules from large datasets. Production rules thus extracted are added to the knowledge base of a rule based system. Conventional programming techniques are not efficient for performing knowledge oriented data consistency checks, especially in the domain involving large body of regularly updating knowledge. ‗Rule based systems‘, an established technique of Artificial Intelligence (AI), is well suited to perform data consistency checks on large datasets and perform corrective measures accordingly. A problem that limits the use of rule based systems is their implementation environment, which differs from data storage environment of the business world. Rule based systems and other AI related systems are normally developed in their specific environments such as Prolog, LISP, OPS-5 etc., while data of the business world is stored in relational environment. This research proves that rule based programming is more efficient than conventional programming techniques. Further, to increase the utility of rule based systems, relational database environment has been used for implementation. A Structured Query Language (SQL) based representation of production rules has been proposed. Rule based engine has been designed, developed, and evaluated in relational database environment. Moreover, the problem of knowledge acquisition has been persistent in the area of rule based systems. Manual knowledge editing techniques have been used so far to add knowledge to rule based systems. Although machine learning techniques have been used to overcome knowledge acquisition bottlenecks, yet limited success has been reported so far. In recent years, data mining has emerged as a useful technique to solve knowledge acquisition problems. Most of the data mining work produces very useful information for business executives and decision makers; however it leaves to the choice of decision makers either to use it or to disregard that information which limits the utilization of data mining technology considerably. Additionally, the area of ‗production rule mining‘ is not adequately explored by the data mining research community and often it is mixed up with ‗association rule mining‘. A system integrated with machine learning module is termed as ‗Learning Apprentice System‘. Configuration of the system emerging as result of this research is termed as the ‗architecture of a learning apprentice system‘ as it involves a ‗rule based system module‘ integrated with ‗data mining based learning module‘. ‗Production rule mining algorithm‘ and ‗feeding the extracted production rules to a rule based system designed in relational database environment‘ are novel contributions of this research. Medical billing compliance has been used as testing domain for concepts and ideas developed during this research. The rule based system is populated with the knowledge of medical claim processing rules using a knowledge editor. The proposed production rule mining algorithm has been utilized to mine medical claim processing rules which are then applied by the system to scrub medical claims i.e. detecting errors and performing minor, legitimate, and corrective actions. Performance of the proposed system has been proved to be efficient and has problem-solving ability as compared to the systems based on conventional programming. Conventional programming based systems are good enough to perform small and simple checks, however where complex and knowledge oriented checks are involved, the techniques proposed have proven to be much better. Evaluation of data mining driven rule based learning apprentice system proved that a lot of time were saved by prompt identification of knowledge oriented data consistency errors from the medical billing data. Although the current system has been tested in medical billing compliance domain, it can be applied to many real life problem domains which involve large numbers of knowledge oriented data consistency checks, such as credit card processing system, loan approval system, examination system etc.
... Acknowledging the need to benefit from relational database management systems (RDBMS) in learning algorithms, research was driven once a step forward by implementing a relational database inductive learning algorithm called RILA [13] which aims to develop data analysis solutions for relational data without requiring it to be transformed into a single table, but did not put much concentration on solving the learning rules from distributed databases. RILA was developed with two rule selection strategies: ...
... • ILA2 [12]: fast inductive learning algorithm for learning from single table with solution for overfitting problem (noise-tolerant version of the ILA rule induction algorithm). • RILA [13], [14]: relational learning algorithm from centralized database based on ILA2 algorithm. Our general strategy for designing an algorithm for learning from distributed data that is provably exact with respect to its centralized counterpart follows from the observation that most of the learning algorithms use only certain statistics computed from the data D in the process of generating the hypotheses that they output. ...
Article
Full-text available
This paper describes a new rule discovery algorithm called Distributed Relational Inductive Learning DRILA, which has been developed as part of ongoing research of the Inductive Learning Algorithm (ILA) [11], and its extension ILA2 [12] which were built to learn from a single table, and the Relational Inductive Learning Algorithm (RILA) [13], [14] which was developed to learn from a group of interrelated tables, i.e. a centralized database. DRILA allows discovery of distributed relational rules using data from distributed relational databases. It consists of a collection of sites, each of which maintains a local database system, or a collection of multiple, logically interrelated databases distributed over a computer network. The basic assumption of the algorithm is that objects to be analyzed are stored in a set of tables that are distributed over many locations. Distributed relational rules discovered would either be used in predicting an unknown object attribute value, or they can be used to extract the hidden relationship between the objects' attribute values. The rule discovery algorithm, developed, was designed to use data available from many locations (sites), any possible ‘connected’ schema at each location where tables concerned are connected by foreign keys. In order to have a reasonable performance, the ‘hypotheses search’ algorithm was implemented to allow construction of new hypotheses by refining previously constructed hypotheses, thereby avoiding the work of recomputing. Unlike many other relational learning algorithms, the DRILA algorithm does not need its own copy of distributed relational data to process it. This is important in terms of the scalability and usability of the distributed relational data mining solution that has been developed. The architecture proposed can be used as a framework to upgrade other propositional learning algorithms to relational learning.
... Acknowledging the need to benefit from relational database management systems (RDBMS) in learning algorithms, research was driven once a step forward by implementing a relational database inductive learning algorithm called RILA [13] which aims to develop data analysis solutions for relational data without requiring it to be transformed into a single table, but did not put much concentration on solving the learning rules from distributed databases. RILA was developed with two rule selection strategies: ...
... • ILA2 [12]: fast inductive learning algorithm for learning from single table with solution for overfitting problem (noise-tolerant version of the ILA rule induction algorithm). • RILA [13], [14]: relational learning algorithm from centralized database based on ILA2 algorithm. Our general strategy for designing an algorithm for learning from distributed data that is provably exact with respect to its centralized counterpart follows from the observation that most of the learning algorithms use only certain statistics computed from the data D in the process of generating the hypotheses that they output. ...
Article
Full-text available
This paper describes a new rule discovery algorithm called Distributed Relational Inductive Learning DRILA, which has been developed as part of ongoing research of the Inductive Learning Algorithm (ILA) [11], and its extension ILA2 [12] which were built to learn from a single table, and the Relational Inductive Learning Algorithm (RILA) [13], [14] which was developed to learn from a group of interrelated tables, i.e. a centralized database. DRILA allows discovery of distributed relational rules using data from distributed relational databases. It consists of a collection of sites, each of which maintains a local database system, or a collection of multiple, logically interrelated databases distributed over a computer network. The basic assumption of the algorithm is that objects to be analyzed are stored in a set of tables that are distributed over many locations. Distributed relational rules discovered would either be used in predicting an unknown object attribute value, or they can be used to extract the hidden relationship between the objects' attribute values. The rule discovery algorithm, developed, was designed to use data available from many locations (sites), any possible 'connected' schema at each location where tables concerned are connected by foreign keys. In order to have a reasonable performance, the 'hypotheses search' algorithm was implemented to allow construction of new hypotheses by refining previously constructed hypotheses, thereby avoiding the work of re-computing. Unlike many other relational learning algorithms, the DRILA algorithm does not need its own copy of distributed relational data to process it. This is important in terms of the scalability and usability of the distributed relational data mining solution that has been developed. The architecture proposed can be used as a framework to upgrade other propositional learning algorithms to relational learning.
... The relational rule discovery system Rila that we have improved for mining recursive many-to-many relations was previously described in [10]. The architecture of the rule discovery system developed is shown in Figure 2. The system was written in Java and uses Java JDBC API to communicate with the database management systems (DBMS). ...
Conference Paper
Full-text available
Many-to-many relations are often observed between real life objects. When many-to- many relations are between objects in the same class the data mining process becomes more complicated than mining objects when there are no such recursive relations. Mining objects re- lated to other objects in the same class requires construction and execution of recursive queries and hence interpretation of the results of recursive queries. Here, we describe Rila, a new rela- tional rule discovery system that was extended for mining objects related to other objects in the same class. Experimental results are provided on Mutagenesis and KDD Cup 2001 data sets.
Article
Full-text available
Data Mining is the process of extracting useful knowledge from large set of data. There are number of data mining techniques available to find hidden knowledge from huge set of Data. Among these techniques classification is one of the techniques to predict the class label for unknown data based on previously known class labeled dataset. Several classification techniques like decision tree induction, Naivy Bayes model, rough set approach, fuzzy set theory and neural network are used for pattern extraction. Now a day's most of the real world data stored in relational database but the decision tree induction method is used to find knowledge from flat data relations only, but can't discover pattern from relational database. So to extract multi-relational pattern from relational tables we use MRDTL approach. In real world Missing value problem are common in many data mining application. This paper provides survey of multi-relational decision tree learning algorithm to discover hidden multi-relational pattern from relational data sets and also includes some simple technique to deal with missing value.
Article
This paper describes a new rule induction system, rila, which can extract frequent patterns from multiple connected relations. The system supports two different rule selection strategies, namely the select early and select late strategies. Pruning heuristics are used to control the number of hypotheses generated during the learning process. Experimental results are provided on the mutagenesis and the segmentation data sets. The present rule induction algorithm is also compared to the similar relational learning algorithms. Results show that the algorithm is comparable to similar algorithms.
Article
Full-text available
We describe experiments with an implementation of Multi-relational decision tree learning (MRDTL) algorithm for induction of decision trees from relational databases using an approach proposed by Knobbe et al. (1999a). Our results show that the performance of MRDTL is competitive with that of other algorithms for learning classifiers from multiple relations including Progol (Muggleton, 1995), FOIL (Quinlan, 1993), Tilde (Blockeel, 1998). Preliminary results indicate that MRDTL, when augmented with principled methods for handling missing attribute values, could be competitive with the state-of-the-art algorithms for learning classifiers from multiple relations on real-world data sets such as those used in the KDD Cup 2001 data mining competition (Cheng et al., 2002).
Article
Full-text available
In this paper we present a novel inductive learning algorithm called the Inductive Learning Algorithm (ILA) for extracting production rules from a set of examples. We also describe application of the ILA to a range of data sets with different numbers of attributes and classes. The results obtained show that the ILA is more general and robust than most other algorithms for inductive learning. Most of the time, ILA appears to be comparable to other well-known algorithms, such as AQ and ID3, if not better.
Article
Full-text available
This paper presents results and lessons from KDD Cup 2001. KDD Cup 2001 focused on mining biological databases. It involved three cutting-edge tasks related to drug design and genomics.
Article
Full-text available
In this paper we describe the ILA-2 rule induction algorithm, which is the improved version of a novel inductive learning algorithm ILA . We first outline the basic algorithm ILA, and then present how the algorithm is improved using a new evaluation metric that handles uncertainty in the data. By using a new soft computing metric, users can reflect their preferences through a penalty factor to control the performance of the algorithm. Inductive learning algorithm has also a faster pass criteria feature which reduces the processing time without sacrificing much from the accuracy that is not available in basic ILA. We experimentally show that the performance of ILA-2 is comparable to that of well-known inductive learning algorithms, namely, CN2, OC1, ID3, and C4.5.
Article
Full-text available
An important aspect of data mining algorithms and systems is that they should scale well to large databases. A consequence of this is that most data mining tools are based on machine learning algorithms that work on data in attribute-value format. Experience has proven that such 'single-table' mining algorithms indeed scale well. The downside of this format is, however, that more complex patterns are simply not expressible in this format and, thus, cannot be discovered. One way to enlarge the expressiveness is to generalize, as in ILP, from onetable mining to multiple table mining, i.e., to support mining on full relational databases. The key step in such a generalization is to ensure that the search space does not explode and that efficiency and, thus, scalability are maintained. In this paper we present a framework and an architecture that provide such a generalization. In this framework the semantic information in the database schema, e.g., foreign keys, are exploited ...
Conference Paper
Available technology for mining data usually applies to centrally stored data (i.e., homogeneous, and in one single repository and schema). The few extensions to mining algorithms for decentralized data have largely been for load balancing. In this paper, we examine mining decentralized data for the task of finding frequent itemsets. In contrast to current techniques where data is first joined to form a single table, we exploit the inter-table foreign key relationships to obtain decentralized algorithms that execute concurrently on the separate tables, and thereafter, merge the results. In particular, for typical warehouse schema designs, our approach adapts standard algorithms, and works efficiently. We provide analyses and empirical validation for important cases to exhibit how our approach performs well. In doing so, we also compare two of our approaches in merging results from individual tables, and thereby, we exhibit certain memory vs I/O trade-offs that are inherent in merging of decentralized partial results.
Conference Paper
Although there is a growing n eed for multi-relational data mining solutions in KDD, the use of obvious candidates from the field of Inductive Logic Programming (ILP) has been limited. In our view this is mainly d ue to the variation in ILP engines, especially with respect to input specification, as well as the limited attention for relational database issues. In this paper we describe an approach which uses UML as the common specification language for a large range of ILP engines. Having such a common language will enable a wide range of users, including non-experts, to model problems and apply different engines without any extra effort. The process involves transformation of UML into a language ca lled CDBL, that i s then translated to a variety of input formats for different engines.
Article
. We consider the problem of finding statistically unusual subgroups in a multi-relation database, and extend previous work on singlerelation subgroup discovery. We give a precise definition of the multirelation subgroup discovery task, propose a specific form of declarative bias based on foreign links as a means of specifying the hypothesis space, and show how propositional evaluation functions can be adapted to the multi-relation setting. We then describe an algorithm for this problem setting that uses optimistic estimate and minimal support pruning, an optimal refinement operator and sampling to ensure efficiency and can easily be parallelized. 1 Introduction Data Mining or Knowledge discovery in databases (KDD) is concerned with the computer-aided extraction of novel, useful and interesting knowledge from (large) databases. A particularly important subclass of knowledge discovery tasks is the discovery of interesting subgroups in populations, where interestingness is defined as di...
Working with relational databases using SRS, LION Bioscience Ltd
  • Srs-Relational White Paper
SRS-Relational White Paper, Working with relational databases using SRS, LION Bioscience Ltd. http://www.lionbioscience.com/solutions/products/srs