Content uploaded by Mahmut Uludag
Author content
All content in this area was uploaded by Mahmut Uludag
Content may be subject to copyright.
A Multi-Relational Rule Discovery System
Mahmut Uludağ1, Mehmet R. Tolun2, Thure Etzold1
1 LION Bioscience Ltd., Compass House, 80-82, Newmarket Road, Cambridge, CB5 8DZ,
United Kingdom
{mahmut.uludag, thure.etzold}@uk.lionbioscience.com
2 Atilim University, Dept. of Computer Engineering, 06836 Incek, Ankara, Turkey
tolun@atilim.edu.tr
http://cmpe.emu.edu.tr/rila/
Abstract. This paper describes a rule discovery system that has been developed
as part of an ongoing research project. The system allows discovery of multi-
relational rules using data from relational databases. The basic assumption of
the system is that objects to be analyzed are stored in a set of tables. Multi-
relational rules discovered would either be used in predicting an unknown ob-
ject attribute value, or they can be used to see the hidden relationship between
the objects’ attribute values. The rule discovery system, developed, was de-
signed to use data available from any possible ‘connected’ schema where tables
concerned are connected by foreign keys. In order to have a reasonable per-
formance, the ‘hypotheses search’ algorithm was implemented to allow con-
struction of new hypotheses by refining previously constructed hypotheses,
thereby avoiding the work of re-computing.
1 Introduction
Most of the current data mining algorithms are designed to use data from a single
table. They require each object to be described by a fixed set of attributes. Compared
to a single table of data, a relational database containing multiple tables makes it pos-
sible to represent more complex and structured data. In addition, today, a significant
amount of scientific data is stored in relational databases. For these reasons, it is im-
portant to have discovery algorithms running for relational data in its natural form
without requiring the data to be viewed in a single table.
A relational data model consisting of multiple tables may represent several object
classes, i.e. within a schema while one set of tables represents a class of object, a
different set of tables may represent another class. Before starting discovery proc-
esses, users should analyze the schema and select the list of tables that represents the
kind of objects they are interested in. One of the selected tables will be central for the
objects and each row in the table should correspond to a single object in the database.
In the previous multi-relational data mining publications, this central table is named as
‘target table’ in [1] and [2], ‘primary table’ in [3], ‘master relation’ in [4], and ‘hub
table’ in [5].
Gene
LocalizationChromosomeEssentialGeneid LocalizationChromosomeEssentialGeneid
Composition
Motif Function ComplexClassPhenotypeGeneid Motif Function ComplexClassPhenotypeGeneid
IF Composition.Class = ‘ATPases’AND Composition.Complex = ‘ Intracellular transport’
THEN Gene.Localization = extracellular..
Fig. 1. An example multi-relational rule that refers to the composition table in its conditions
and refers to the gene table in its right hand side
For a multi-relational rule, attribute names in its conditions are annotated using the
name of the relational table to which the attribute is related. Figure 1 shows an exam-
ple of such a multi-relational rule.
The concepts suggested in the multi-relational data-mining framework described in
[1]; selection graphs, target table and target attribute, all helped during the initial
stages of the process of building the present multi-relational rule discovery system.
2 Architecture
The architecture of the rule discovery system developed can be depicted as shown in
Figure 2. The discovery system uses the JDBC API to communicate with the database
management systems (DBMS). When a data mining session is started the system sends
meta-data queries to the DBMS connected. After the user selects a set of tables, the
target table and the target attribute, the data mining process starts, during which the
system sends a number of SQL queries to the DBMS. SQL queries sent to the data-
base management system are generally for building valid hypotheses about the data.
In order to reduce the complexity of communication between the rule discovery
system and the DBMS, the information about covered objects and the discretized
columns are both stored in temporary tables in the DBMS rather than in the internal
data structures in the rule discovery system side. It was also decided to use these tem-
porary tables for performance reasons.
The temporary table ‘covered’ has two columns named ‘id’ and ‘mark’. Each time
the rule discovery system starts processing a new class, inserting a new row for each
object belonging to the current class reinitializes this table. The ‘id’ field is given the
value of the primary key and the ‘mark’ field is set to zero. When a new rule is gener-
ated, the ‘mark’ fields of the rows that refer to the objects covered by the new rule are
changed to one.
DBMSDiscovery system
JDBC driver
Rules
Hypotheses SQL, meta data queries
Result sets
Fig. 2. The basic architecture of the system
The rule discovery system discretizes the numeric attribute values during the pre-
processing stage of a data mining session, and initializes the table ‘disc’ by inserting
one row per discretization interval. The following columns represent an interval.
• table_name: name of the table the numeric attribute is from
• column_name: name of the column the numeric attribute is associated with
• interval_name: name of the interval between two successive cut points
• min_val: minimum value of the interval
• max_val: maximum value of the interval
There is a concurrency problem with using the temporary tables if more than one
user wants to use the rule discovery system simultaneously on the same data. There-
fore we want to improve the solution taking into account the concurrency issues.
3 The Algorithm
The multi-relational rule discovery algorithm of the system that has been developed
was adapted from ILA (Inductive Learning Algorithm) [6]. ILA is a ‘covering’ type
learning algorithm that takes each class in turn and seeks a way of covering all in-
stances, at the same time excluding instances which are not in the class. There is also
an improved version of the ILA algorithm named ILA-2 that uses a penalty factor that
helps to produce better results for noisy data [7]. In this paper, the adapted version of
the ILA-2 algorithm is named Relational-ILA.
ILA requires a particular feature of the object under consideration to be used as a
dependent attribute for classification. In Relational-ILA, however, the dependent
attribute corresponds to the target attribute of the target table. It is assumed that the
target table is connected to other tables through foreign key relations. Relational-ILA
is composed of initial hypotheses generation, hypotheses evaluation, hypotheses re-
finement and rule selection steps. The relationship between these steps is summarized
in Figure 3 for processing examples of a single class.
The database schema is treated as a graph where nodes represent relations (tables)
and edges represent foreign keys. The schema graph is searched in a breadth-first
search manner starting from the target table. While searching the schema graph, the
algorithm keeps track of the path followed; no table is processed for the second time.
Fig. 3. The simplified Relational-ILA algorithm for processing examples of a single
class
Initial hypotheses are composed of only one condition. During the building of the
initial hypotheses set, the following template is used to generate SQL queries for find-
ing hypotheses together with their frequency values, each time a table in the schema
graph is visited.
Select attr, count(distinct targetTable.pk)
from table, covered, table_list
where join_list and
targetTable.targetAttr = currentClass and
covered.id = targetTable.pk and
covered.mark=0
group by attr
In the template,
• attr is the column name for which hypotheses are being searched
• targetTable is the table that has one row for each object being analyzed
• pk is the name of the primary key column in the target table
• table refers to the current table where the hypotheses are being searched
• table_list is the list of the tables that are used to connect the current table to
the target table
• join_list is the list of join conditions that are used to connect the current ta-
ble to the target table
• targetAttr is the class column
• currentClass is the current class for the hypotheses that are being searched
The template is applied for each column except the foreign and primary key col-
umns and the class column, i.e. the target attribute. If the current table is the target
table then the algorithm uses a simplified version of the template.
The algorithm also needs to know about the frequency of the hypotheses in classes
other than the current class. The following template is used to generate the necessary
SQL queries.
Select attr, count(distinct targetTable.pk)
from table_list, targetTable
where join_list and
targetTable.targetAttr <> currentClass
group by attr
Similarly, for the target table, the algorithm uses a simplified version of the tem-
plate. When the above templates are to be used for a numeric attribute, the ‘select’
clauses in the templates are changed such that the attribute column ‘attr’ is replaced by
the following three column names; interval_name, min_val and max_val. Accordingly,
the ‘group by’ clauses of the queries have a similar replacement. Also the ‘from’
clauses are extended by the table ‘disc’ and the join conditions are extended using the
following two conditions.
disc.attribute_name = ‘attr’ and
attr > disc.min_val and
attr < disc.max_val
After the initial hypotheses are generated they are sorted based on the output of the
ILA hypothesis evaluation function, which shows how a hypothesis satisfies the condi-
tions for being a valid rule. If any of the hypotheses can be used for generating a new
rule then the one with the maximum score is converted to a new rule and the objects
covered by the new rule are marked in the temporary table ‘covered’. After the rule
selection processes if some rules were selected but there are still objects not yet cov-
ered, then the initial hypotheses are rebuilt using only the objects that are not covered
by the rules already generated. If no new rule can be generated then the hypotheses
refinement step is started.
Refinement of a multi-relational hypothesis means extending the description of the
hypothesis. It results in a new selection of objects that is a subset of the selection as-
sociated with the original hypothesis.
Similar to the initial hypotheses build case, to extend a hypothesis, the schema
graph is searched, starting from the target table, by following the foreign key relations
between tables. When a table in the schema graph is reached the following template is
used to generate SQL queries for refining the hypothesis.
Select attr, count(distinct targetTable.pk)
from covered, table_list, hypothesis.table_list()
where targetAttr = currentClass and
join_list and
hypothesis.join_list()
covered.id = targetTable.pk and
covered.mark=0
group by attr;
Here the hypothesis is the hypothesis object being refined. The object has two
methods to help SQL construction processes. The table_list method returns the
list of the tables to which the features in the hypothesis refer, plus the tables that con-
nect each feature to the target table. The join_list method returns the list of join
conditions for the features in the hypothesis plus the list of join conditions to connect
each feature to the target attribute.
In order to know the frequency of the extended hypotheses in the classes other than
the current class the following SQL template is used.
Select attr, count(distinct targetTable.pk)
from targetTable, table_list, hypothesis.table_list()
where targetAttr <> currentClass and
join_list and
hypothesis.join_list()
group by attr;
4 Experiments
A set of experiments was conducted using the genes dataset of KDD Cup 2001 [8].
There are two tables in the original genes dataset. One table (interaction) specifies
which genes interact with which other genes. The other table (gene) specifies a variety
of properties of individual genes. The gene table has information about 862 different
genes. There could be more than one row for each gene. The attribute gene_id identi-
fies a gene uniquely. Tests have been conducted to generate rules for the localization
attribute. Because the discovery system requires the target table to have a primary key,
the schema has been normalized as shown in Figure 4.
Gene
Geneid
Essential
Chromosome
Localization
Composition
Geneid
Interaction
Geneid1
Geneid2
Type
Expression
Complex
Class
Phenotype
Motif
Function
862 rows 4346 rows
910 rows
Fig. 4. Schema of the KDD Cup 2001 genes data after normalization
The dataset has one numeric attribute and several attributes with missing values.
The numeric attribute, ‘expr’, in the interaction table was divided into 20 bins using
the class-blind binning method. Missing attribute values were ignored. In the experi-
ments, the ILA-2 penalty factor was selected as 1, and the maximum hypothesis size
limited to 4.
The results of the experiments are presented in Table 1 and Table 2. In both of the
two tables, the first column shows the selected minimum support pruning parameter.
In Table 1, the second column shows the amount of time the learning process required
on a 1.8 MHz Atlahon processor machine. The third column shows the percentage of
the training objects covered and the last column shows the number of rules generated.
Table 1. The results of the training process
Minimum support
pruning percentage time (sec.) % covered # of rules
0% 1536 74.71 218
2% 883 58.93 87
5% 354 44.78 37
In Table 2, the second column shows the percentage of the test objects covered by
the rules and the third column shows the accuracy of the rules on the objects covered.
The last column shows the accuracy of the rules if a default rule is used for objects not
covered by any rule. The default rule was selected as the majority class value.
Table 2. The results on the test data
Minimum support
pruning percentage % covered % accuracy %accuracy
using a default
rule
0% 64.57 79.67 62.20
2% 55.90 84.50 61.94
5% 46.50 87.00 60.60
The results in the last column of Table 2 indicate that the discovered rules have
about 10% less prediction accuracy then the cup the winner’s test set accuracy which
was 72% [8]. The poor performance is because of the present system’s inability to
read the relational information between genes defined by the interaction table.
5 Conclusions and Future Work
A multi relational rule discovery system, Relational-ILA, has been implemented
which extracts rules from relational database management systems where the tables
concerned are directly or indirectly connected to each other using foreign key rela-
tions. The system requires a primary key in the target table to identify individual ob-
jects.
One of the important issues we would like to address during the next stage of this
study is to make the hypotheses search algorithm more aware of the path followed to
reach a table and to allow tables to be visited for a second time if they previously were
reached via a different route. It will then be possible to mine schemas having related
objects that are recursively defined. For example, the present system discovers rules
that can help to understand properties of individual genes but the next version will
discover rules that can help to understand the relations between genes.
References
1. Knobbe, A.J., Blockeel, H., Siebes, A., Van der Wallen, D.M.G.: Multi-Relational Data
Mining, In Proceedings of Benelearn’99, (1999)
2. Leiva, H., and Honavar, V.: Experiments with MRDTL—A Multi-Relational Decision Tree
Learning Algorithm. In Dzeroski, S., Raedt, L.D., and Wrobel, S. (editors): Proceedings of
the Workshop on Multi-Relational Data Mining (MRDM-2002), University of Alberta, Ed-
monton, Canada, (2002) 97-112
3. Crestana-Jensen, V. and Soparkar, N.: Frequent Item-set Counting across Multiple Tables.
PAKDD 2000, (2000) 49-61
4. Wrobel, S.: An Algorithm for Multi-Relational Discovery of Subgroups, Proceedings of
PKDD’97, Springer-Verlag, Berlin, New York, (1997)
5. SRS-Relational White Paper, Working with relational databases using SRS, LION Biosci-
ence Ltd. http://www.lionbioscience.com/solutions/products/srs
6. Tolun, M.R. and Abu-Soud, S.M.: ILA: An Inductive Learning Algorithm for Rule Extrac-
tion, Expert Systems with Applications, 14(3), (1998) 361-370
7. Tolun, M.R., Sever, H., Uludağ, M. and Abu-Soud, S.M.: ILA-2: An Inductive Learning
Algorithm for Knowledge Discovery, Cybernetics and Systems: An International Journal,
Vol. 30, (1999) 609-628
8. Cheng, J., Krogel, M., Sese, J., Hatsiz, C., Morishita, S., Hayashi, H. and Page, D.: KDD
Cup 2001 Report, ACM Special Interest Group on Knowledge Discovery and Data Mining
(SIGKDD) Explorations, Vol. 3, issue 2, (2002)