Conference PaperPDF Available

Scalable Multi-Relational Association Mining

Authors:

Abstract

We propose the RADAR technique for multirelational data mining. This permits the mining of very large collections and provides a technique for discovering multirelational associations. Results show that RADAR is reliable and scalable for mining a large yeast homology collection, and that it does not have the main-memory scalability constraints of the Farmer and Warmr tools.
Scalable Multi-Relational Association Mining
Amanda Clare
Department of Computer Science,
University of Wales Aberystwyth,
Aberystwyth, SY23 3DB, UK
afc@aber.ac.uk
Hugh E. Williams Nicholas Lester
School of Computer Science and Information Technology,
RMIT University, GPO Box 2476V,
Melbourne, Australia 3001.
hugh,nml @cs.rmit.edu.au
Abstract
We propose the new RADAR technique for multi-
relational data mining. This permits the mining of very
largecollectionsandprovides a newtechniquefordiscover-
ing multi-relational associations. Results show that RADAR
is reliable and scalable for mining a large yeast homology
collection, and that it does not have the main-memory scal-
ability constraints of the Farmer and Warmr tools.
1. Introduction
Large collections of multi-relational data present sig-
nificant new challenges to data mining. These challenges
are reflected in the annual KDD Cup competition, which
involved relational datasets in 2001 and 2002, and net-
work mining in 2003. The July 2003 edition of the ACM
SIGKDD Explorations is devoted to position papers out-
lining the current frontiers in multi-relational data min-
ing. Similar problems exist in bioinformatics databases
such as those at MIPS
1
that provide integrated data on
a genome-wide scale for whole organisms, with multiple
cross references to other databases.
The vast majority of association mining algorithms are
designed for single table, propositional datasets. We pro-
pose a novel technique for multi-relational association min-
ing that permits efficient and scalable discovery of relation-
ships. To our knowledge, the only existing multi-relational
association mining algorithms are upgrades of Apriori [1]
and, with the field in its infancy, there is much scope for
improving the scalability of these solutions. Our technique
uses an inverted index, a largely disk-based search structure
that is used to support querying in all practical Information
Retrieval systems and web search engines.
This work carried out at and supported by the School of Computer
Science and Information Technology at RMIT University.
1
See: http://mips.gsf.de/
2. Inverted Indexes
An inverted index is a well-known structure used in
all practical text retrieval systems [8]. It consists of an
in-memory (or partially in-memory) search structure that
stores the vocabulary of searchable terms, and on-disk post-
ings that store, for each term, the location of that term in the
collection. In practice, the vocabularyis typically the words
that occur in the collection [8].
Using the notation of Zobel and Moffat [10], each term
has postings , where is the frequency
of term in document . Consider an example for the term
“mining” that occurs in four documents:
This postings list shows that the word “mining” occurs
twice in document 11, once in document 19, once in docu-
ment 72, and twice in document 107. The documents them-
selves are ordinally numbered, and a mapping table asso-
ciates each document number to its location on disk. De-
spite its simplicity, this inverted index structure is sufficient
to support the popular ranked query mode that is used by
most search engine users.
The organisation, compression, and processing of post-
ings lists is crucial to retrieval system performance. Com-
pression is important for three reasons: first, a compressed
representation requires less storage space than an uncom-
pressed one; second, a retrieval system is faster when com-
pression is used, since the cost of transferring compressed
lists anddecompressingthemis typically much less than the
cost of transferring uncompressed data; and, last, caching
is improved because more lists fit into main-memory than
when uncompressed lists are used. Scholer et al. [7] re-
cently showed that compression of postings lists more than
halves query evaluation times than when no compression is
used.
3. Multi-Relational Association Mining
The first mining technique to find associations in multi-
table relational data was Warmr [4]. Warmr is a first-
order upgrade of Apriori, with the additional introduction
of a user-defined language bias to restrict the search space.
Blockeel et al. [2] have been investigating enhancements
such as query packs to the underlying Prolog compiler
to address efficiency issues. They have also implemented
techniques to allow the user to limit the amount of data re-
quired to be loaded into main-memory. With Warmr, the
user has the full powerof the Prolog programminglanguage
for specifying the data and background knowledge.
PolyFARM [3] was based on the ideas of Warmr and
written for distribution on a Beowulf cluster by partition-
ing the data to be counted. Unfortunately, although the size
of the database is reduced by partitioning, the size of the
candidate associations held in main-memory can grow im-
practically large.
Nijssen and Kok’s Farmer [6] is a new multi-relational
mining technique, with a running time that is an order
of magnitude improvement over Warmr; indeed, on small
data sets, Farmer can be astonishingly fast. However, they
still require that all data is available in main-memory — a
still significant problem for large datasets — and the main-
memory use increases steadily throughout each search.
4. RADAR
We propose RADAR, the Relational Association
Datamining AlgoRithm
2
. RADAR is the first multi-
relational association mining algorithm that uses com-
pressed inverted indexing techniques to provide a scalable
solution for mining large databases.
Our aim is to count all frequent associations in a
database. We use the language of first order logic to rep-
resent the associations. A frequent association is a con-
junction or set of atoms that occurs with at least the min-
imum support frequency in the database [4]. For example,
“a chardonnay wine that is made by an Australian grower”
is represented by the association:
Inspired by the Eclat algorithm [9], we propose to mine
these frequent associations by flatteningthe database, build-
ing an inverted index of the flattened database, and repeat-
edly joining postings lists.
In a multi-table relational database, we must decide
which field in which table is our main key or notion of
transaction, that is, what we are counting. For example,
in a database representing wines, retailers, and growers, we
2
The RADAR software and sample databases are available from
http://www.aber.ac.uk/compsci/Research/bio/dss/radar/
Figure 1 Five tables representing molecules by atoms and
bonds.
atom_id
mol_id
element
quanta
charge
atom1_id
atom2_id
bondtype
Bond Atom
ring_id
ringtype
Ring
mol_id
activity
Molecule
ring_id
atom_id
Figure 2 Example of the two-column flattened database
with keys. For example, line 2 describes a double bond in
mol 12 between atoms 10 and 11.
Keys (Arguments) Attributes (Predicate Symbols)
m12 inactive
m12, a10, a11 bond double
m12, a10 elem carbon, quanta 27, charge medium
m12, a11, a12 bond
double
m12, a11, a13 bond single
m12, a11 elem carbon, quanta 22, charge medium
m12, r47, a10 ring benzene
m12, r47, a11 ring benzene
m12, r47, a12 ring benzene
must decide if we are interested in counting the number
of Australian growers that make chardonnay wines, or the
number of chardonnay wines that are made by Australian
growers. We refer to this field as the COUNTKEY, so as to
distinguish it from the common notion of a database key
field.
To prepare for indexing, the database is flattened into a
single table with a two-column format. The first column
stores the database keys (which represent the arguments to
the predicates), and the second column stores the database
items, that is, descriptive attributes (which represent the
predicate names). We refer to these as keys and predicate
symbols respectively. Each row of the flattened database
can hold multiple keys and multiple predicate symbols.
The attributes in a simple multi-table relational database
describing molecules represented by bonds and atoms are
shown in Figure 1. Selected flattened rows from this
database are shown in Figure 2. Flattening can be made
more or less explicit depending on the application require-
ments.
Keys are used to form the arguments to the predicates.
For example, if is to be a possible
atom in associations, then any row in the flattened database
that contains an instance of the term in the second
columnmust alwayshave both Wineand Grower keyslisted
in the first column of that row.
To create an inverted index for the flattened database, we
number each row sequentially and use these numbers as the
documentnumbers. All termswithin a roware indexed, that
Figure 3 A section of the inverted index of the flattened
database from Figure 2. For compactness, the postings list
show only document numbers; we have omitted .
Term ( ) Postings list ( )
inactive 1
bond double 2,4
bond single 5
elem
carbon 3,6
ring benzene 7,8,9
quanta 22 6
m12 1,2,3,4,5,6,7,8,9
a10 2,3,7
a11 2,4,5,6,8
is, both keys and attributes. A section of the inverted index
for Figure 2 is shown in Figure 3.
To mine the data, the user providesthe flattened database
and a language bias (the set of factors that influence and
direct the search). In our case, this is a list of the COUN-
TKEYS, a list of all the predicates for use in associations, the
types and modes of their arguments, and other constraints.
Associations are then generated depth-first.
All argumentsto the predicates in anassociation are vari-
ables that can be satisfied by particular database keys. To
count how frequently an association appears in the database
with respect to the COUNTKEY we need to test
whether, for each possible COUNTKEY, there is a set of
keys that satisfy this relationship. This means that when we
have multi-relational data we cannot simply intersect post-
ings lists for predicates that appear within the same associa-
tion because we are seeking to identifypredicates that share
the correct set of keys that hold the relationships between
the predicates. The algorithm for counting associations us-
ing our compressed inverted index is shown in Figure 4.
5. Results
We present results of using RADAR, Warmr, and Farmer.
All measurements were carried on a 1.66 GHz AMD
Athlon-based workstation running Linux with 2 GB of
main-memory. We used two small collections — for which
RADAR is not optimised, but that are well-known and well-
suited to the other schemes and a large collection that
illustrates the scalability of RADAR. MUTA is a well-
known mutagenesis dataset [5], consisting of descriptions
of molecules, including their atoms, bonds, and ring struc-
tures. KDD2002 is the collection used in Task 2 of the KDD
2002 Cup competition
3
, that describes yeast proteins and
their interactions. YEASTHOM is a large collection
4
that de-
3
See: http://www.biostat.wisc.edu/ craven/kddcup/
4
http://www.aber.ac.uk/compsci/Research/bio/dss/yeastdata/
Figure 4: Algorithm for counting an association
function countassoc( )
fetch postings lists for each predicate in
foreach in do
fetch postings list for
join with each appropriate
if all are non-empty then
if other args exist then
if
then
++
else
++
return
function doargs( )
find shortest docs list amongst appropriate predicates
foreach
in do
key of appropriate type for argnum from
fetch postings list for
join with each appropriate
if all are non-empty then
if other args exist then
if
then
return true
else
return true
return false
scribes homologous relationships between yeast genes and
proteins in the SwissProt database.
We compared RADAR to Warmr (version ACE 1.2.6) and
Farmer (2003). A fair, direct comparison is not straightfor-
ward as each algorithm has its own distinct properties. In
particular, Farmer does not allow a limit on the length of
the association, but only on the maximum use of each in-
dividual predicate. This means that we cannot stop Farmer
from finding more, longer associations than the other algo-
rithms.
Table 1 shows the results of our experiments. The results
for MUTA and KDD2002 illustrate the general properties of
the schemes: RADAR uses 34 Mb of main-memory for both
collections, while the memory use of the other schemes
varies significantly with the number of discovered associ-
ations (from 25 to 119 Mb for Warmr, and from 387 to
11 Mb for Farmer). Constant memory use comes at a price
for small collections: RADAR is two to three times slower
than the other schemes on the MUTA task, and unacceptably
slow on the KDD2002 task compared to the fast Farmer.
The results for YEASTHOM illustrate the advantages of
RADAR, and the disadvantages of the other approaches.
RADAR is highly scalable: despite the almost thousand-fold
increase in data size from KDD2002 to YEASTHOM, main-
memory use only increases from 34 Mb to 56 Mb. Farmer
which is impressiveon small datasets is unsuitablefor
this task: main-memory use increases steadily throughout
the lifetime of the task, since it holds the database and as-
Data Algorithm Data size Maximum Memory Time Associations
Original Compiled Use (Mb) Found
Warmr 823 kb 1,292 kb 25 7.8 mins 2,756
MUTA Farmer 823 kb 387 10.9 mins 95,715
RADAR 596 kb 526 kb 34 25.0 mins 12,530
Warmr 1,407 kb 1,556 kb 119 31.1 mins 7,523
KDD2002 Farmer 1,407 kb 11 0.1 mins 20,359
RADAR 1,023 kb 418 kb 34 361.0 mins 9,130
Warmr 841 Mb 880 Mb 800 25 days 7,712*
YEASTHOM Farmer 1,465 Mb 1,254 18 days 698,974
RADAR 1,565 Mb 163 Mb 56 25 days 34,782*
Table 1. Experiments on the MUTA, KDD2002 and YEASTHOM collections. For MUTA, support = 20
molecules (10.6%), max. assoc. length = 3 predicates (excluding ). Farmer continued to find
associations to length 11. For KDD2002, support = 20 ORFs(0.84%), max. assoc. length = 3 predicates
(excluding ). Farmer continued to find associations to length 8. For YEASTHOM, support = 20
ORFs (0.31%), max. assoc. length = 3 predicates (excluding ). Italicised figures indicate that
the algorithm was still running. Farmer continued to find associations to length 7 but stopped before
completion due to main memory exhaustion. Warmr’s maximum memory use was set to 800 Mb.
sociations in memory. Indeed, after 18 days, main-memory
was exhausted. Warmr processes associations in packs that
group together common subparts for faster counting. This
means that no results are given until a whole level is com-
plete. For the YEASTHOM collection, associations of length
two were produced after about six hours, and then the sys-
tem gave no further output for several weeks.
RADAR isstructured similarly to Farmer as an any-
time algorithm that produces continuous output. Further,
RADAR can be seeded with an association, so that the appli-
cation can be restarted at any time. This aspect is useful for
large-scale mining problems that run for weeks.
6. Conclusion
Large multi-relational collections are the next frontier
for data mining. In this paper we have shown how com-
pressed inverted indexes used in text retrieval systems can
be adapted for multi-relational data mining. Our technique,
RADAR, is both scalable and reliable on large amounts of
data. It produces output continuously, with the option of
stopping and resuming the mining process later. For small
datasets — for which RADAR is not designed — the Warmr
and Farmer techniques should be used in preference.
Acknowledgements
This work was supported by the Australian Research
Council.
References
[1] R. Agrawal and R. Srikant. Fast algorithms for mining asso-
ciation rules in large databases. In 20th International Con-
ference on Very Large Databases (VLDB 94), 1994.
[2] H. Blockeel et al. Improving the efficiency of Inductive
Logic Programming through the use of query packs. Journal
of Artificial Intelligence Research, 16:135–166, 2002.
[3] A. Clare and R. D. King. Data mining the yeast genome in
a lazy functional language. In Practical Aspects of Declar-
ative Languages (PADL’03), 2003.
[4] L. Dehaspe. Frequent Pattern Discovery in First Order
Logic. PhD thesis, Department of Computer Science,
Katholieke Universiteit Leuven, 1998.
[5] R. King, S. Muggleton, A. Srinivasan, and M. Sternberg.
Structure-activityrelationships derived bymachine learning.
Proc. Nat. Acad. Sci. USA, 93:438–442, 1996.
[6] S. Nijssen and J. N. Kok. Efficient frequent query discovery
in FARMER. In 13th International Conference on Inductive
Logic Programming (ILP 2003), 2003.
[7] F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. Com-
pression of inverted indexes for fast query evaluation. In
K. J¨arvelin, M. Beaulieu, R.Baeza-Yates,and S. H.Myaeng,
editors, Proc. ACM-SIGIR International Conference on Re-
search and Development in Information Retrieval, pages
222–229, Tampere, Finland, 2002.
[8] I. Witten, A. Moffat, and T. Bell. Managing Gigabytes:
Compressing and Indexing Documents and Images. Morgan
Kaufmann Publishers, Los Altos, CA 94022, USA, second
edition, 1999.
[9] M. J. Zaki. Scalable algorithms for association mining.
IEEE Transactions on Knowledge and Data Engineering,
12(3):372–390, 2000.
[10] J. Zobel and A. Moffat. Exploring the similarity space. ACM
SIGIR Forum, 32(1):18–34, 1998.
... to denote the induced subgraph under disjunctive and conjunctive interpretation respectively. 2 The attribute naming and the names of the problems and the graph properties we introduce later are inspired by the bibliography dataset (AU-THORS -PAPERS -TOPICS). Now let Ψ denote a graph property. ...
... There has already been some effort on multi-relational mining [2,3,4]. The approach taken has been to generalize apriori-like data-mining algorithms to the multi-relational case using inductive logic programming concepts. ...
Conference Paper
Full-text available
Traditional data mining methods consider the problem of mining a single relation that relates two different attributes. For example, in a scientific bibliography database, authors are related to papers, and we may be interested in discovering association rules between authors based on the papers that they have co-authored. However, in real life it is often the case that we have multiple attributes related through chains of relations. For example, authors write papers, and papers belong to one or more topics, defining a three-level chain of relations. In this paper we consider the problem of mining such relational chains. We formulate a generic problem of finding selector sets (subsets of objects from one of the attributes) such that the projected dataset—the part of the dataset determined by the selector set—satisfies a specific property. The motivation for our approach is that a given property might not hold on the whole dataset, but holds when projecting the data on a subset of objects. We show that many existing and new data mining problems can be formulated in the framework. We discuss various algorithms and identify the conditions when apriori technique can be used. We experimentally demonstrate the effectiveness and efficiency of our methods.
... The first viable multi-relational mining proposals go back to Inductive Logic Programming (ILP), having techniques based on pattern representation from first order logic [3], with its main representative being the WARMR algorithm [18]. Other existing ILP algorithms are the FARMER [19] and the RADAR [20], both proposing performance improvements of the WARMR. ...
Conference Paper
Full-text available
The multi-relational Data Mining approach has emerged as alternative to the analysis of structured data, such as relational databases. Unlike traditional algorithms, the multi-relational proposals allow mining directly multiple tables, avoiding the costly join operations. In this paper, is presented a comparative study involving the traditional Patricia Mine algorithm and its corresponding multi-relational proposed, MR-Radix in order to evaluate the performance of two approaches for mining association rules are used for relational databases. This study presents two original contributions: the proposition of an algorithm multi-relational MR-Radix, which is efficient for use in relational databases, both in terms of execution time and in relation to memory usage and the presentation of the empirical approach multi-relational advantage in performance over several tables, which avoids the costly join operations from multiple tables.
... The first viable multi-relational mining proposals go back to Inductive Logic Programming (ILP), having techniques based on pattern representation from first order logic [3], with its main representative being the WARMR algorithm [20] . Other existing ILP algorithms are the FARMER [21] and the RADAR [22] , both proposing performance improvements of the WARMR. The extraction of multi-relational association rules can also be done with graphs. ...
Article
Full-text available
Background Once multi-relational approach has emerged as an alternative for analyzing structured data such as relational databases, since they allow applying data mining in multiple tables directly, thus avoiding expensive joining operations and semantic losses, this work proposes an algorithm with multi-relational approach. Methods Aiming to compare traditional approach performance and multi-relational for mining association rules, this paper discusses an empirical study between PatriciaMine - an traditional algorithm - and its corresponding multi-relational proposed, MR-Radix. Results This work showed advantages of the multi-relational approach in performance over several tables, which avoids the high cost for joining operations from multiple tables and semantic losses. The performance provided by the algorithm MR-Radix shows faster than PatriciaMine, despite handling complex multi-relational patterns. The utilized memory indicates a more conservative growth curve for MR-Radix than PatriciaMine, which shows the increase in demand of frequent items in MR-Radix does not result in a significant growth of utilized memory like in PatriciaMine. Conclusion The comparative study between PatriciaMine and MR-Radix confirmed efficacy of the multi-relational approach in data mining process both in terms of execution time and in relation to memory usage. Besides that, the multi-relational proposed algorithm, unlike other algorithms of this approach, is efficient for use in large relational databases.
... Second, the overhead paid at each iteration is considerably large, since Workers are necessarily stateless and they need to read a database partition at each iteration. Third, the number of candidate patterns held in main-memory can grow impractically large [9]. Our contribution aims to overcome these limits through a different approach based on independent multi-sample mining [33]. ...
Article
Full-text available
The amount of data produced by ubiquitous computing ap- plications is quickly growing, due to the pervasive presence of small de- vices endowed with sensing, computing and communication capabilities. Heterogeneity and strong interdependence, which characterize 'ubiqui- tous data', require a (multi-)relational approach to their analysis. How- ever, relational data mining algorithms do not scale well and very large data sets are hardly processable. In this paper we propose an exten- sion of a relational algorithm for multi-level frequent pattern discovery, which resorts to data sampling and distributed computation in Grid en- vironments, in order to overcome the computational limits of the original serial algorithm. The set of patterns discovered by the new algorithm ap- proximates the set of exact solutions found by the serial algorithm. The quality of approximation depends on three parameters: the proportion of data in each sample, the minimum support thresholds and the number of samples in which a pattern has to be frequent in order to be con- sidered globally frequent. Considering that the first two parameters are hardly controllable, we focus our investigation on the third one. Theoret- ically derived conclusions are also experimentally confirmed. Moreover, an additional application in the context of event log mining proves the viability of the proposed approach to relational frequent pattern mining from very large data sets.
Article
Full-text available
The multi relational data mining approach has developed as an alternative way for handling the structured data such that RDBMS. This will provides the mining in multiple tables directly. In MRDM the patterns are available in multiple tables (relations) from a relational database. As the data are available over the many tables which will affect the many problems in the practice of the data mining. To deal with this problem, one either constructs a single table by Propositionalisation, or uses a Multi-Relational Data Mining algorithm. MRDM approaches have been successfully applied in the area of bioinformatics. Three popular pattern finding techniques classification, clustering and association are frequently used in MRDM. Multi relational approach has developed as an alternative for analyzing the structured data such as relational database. MRDM allowing applying directly in the data mining in multiple tables. To avoid the expensive joining operations and semantic losses we used the MRDM technique. This paper focuses some of the application areas of MRDM and feature directions as well as the comparison of ILP, GM, SSDM and MRDM
Chapter
Most of the sectors transferred their information to the internet environment once the technology used became widespread and cheap. Print media organizations, which have a vital role in informing public opinion, and the data that are the services of these organizations are also shared over internet media. The fact that the continuously increasing amount of data includes the rich data it causes interesting and important data to be overlooked. Having large numbers of responses returned from queries about a specific event, person or place brings query owners face to face with unwanted or unrelated query results. Therefore, it is necessary to access textual data in press-publication sources that are reachable over internet in a fast and effective way and to introduce studies about producing meaningful and important information from these resources.
Article
Full-text available
In many application domains, the amount of available data increased so much that humans need help from automatic computerized methods for extracting relevant information. Moreover, it is becoming more and more common to store data that possess inherently structural or relational characteristics. These types of data are best represented by graphs, which can very naturally represent entities, their attributes, and their relationships to other entities. In this article, we review the state of the art in graph mining, and we present advances in processing trees and graphs by two computational intelligence classes of methods, namely neural networks and kernel methods.
Conference Paper
This paper presents a case study of multi-relational data mining using the ConnectionBlock algorithm, applied to the database of a sugar mill. The algorithm handles multiple tables not explicitly correlated but which influence one another according to the semantics of the data involved. The experiment revealed very interesting and useful patterns that are not found using traditional algorithms. The paper aims to present how the data were prepared to obtain better expressiveness of the rules generated, showing the potential of the algorithm to find patterns in semantically related data.
Article
Many databases do not consist of a single table of fixed dimensions, but of objects that are related to each other: the databases are relational, or structured. We study the discovery of patterns in such data. In our approach, a data analyst specifies constraints on patterns that she believes to be of interest, and the computer searches for patterns that satisfy these constraints. An important constraint on which we focus, is the constraint that a pattern should have a significant number of occurrences in the data. Constraints like this allow the search to be performed reasonably efficiently. We develop algorithms for searching ppatterns taht are represented in formal first order logic, tree data structures and graph data structures. We perform experiments in which these algorithms, and algorithms proposed by other researchers, are compared with each other, and study which properties determine the efficiency of the algorithms. As a result, we are able to develop more efficient algorithms. As application we study the discovery of fragments in molecular datasets. The aim is to discover fragments that relate the structure of molecules to their activity.
Conference Paper
Full-text available
Compression reduces both the size of indexes and the time needed to evaluate queries. In this paper, we revisit the compression of inverted lists of document postings that store the position and frequency of indexed terms, considering two approaches to improving retrieval efficiency: better implementation and better choice of integer compression schemes. First, we propose several simple optimisations to well-known integer compression schemes, and show experimentally that these lead to significant reductions in time. Second, we explore the impact of choice of compression scheme on retrieval efficiency.In experiments on large collections of data, we show two surprising results: use of simple byte-aligned codes halves the query evaluation time compared to the most compact Golomb-Rice bitwise compression schemes; and, even when an index fits entirely in memory, byte-aligned codes result in faster query evaluation than does an uncompressed index, emphasising that the cost of transferring data from memory to the CPU cache is less for an appropriately compressed index than for an uncompressed index. Moreover, byte-aligned schemes have only a modest space overhead: the most compact schemes result in indexes that are around 10% of the size of the collection, while a byte-aligned scheme is around 13%. We conclude that fast byte-aligned codes should be used to store integers in inverted lists.
Article
Full-text available
The information model chosen by the ISO for the management of open systems is object‐oriented. We provide an effective mapping from the structural and behavioural specification of the managed objects of open systems to a compact logical form suitable ...
Article
Full-text available
We present a general approach to forming structure-activity relationships (SARs). This approach is based on representing chemical structure by atoms and their bond connectivities in combination with the inductive logic programming (ILP) algorithm PROGOL. Existing SAR methods describe chemical structure by using attributes which are general properties of an object. It is not possible to map chemical structure directly to attribute-based descriptions, as such descriptions have no internal organization. A more natural and general way to describe chemical structure is to use a relational description, where the internal construction of the description maps that of the object described. Our atom and bond connectivities representation is a relational description. ILP algorithms can form SARs with relational descriptions. We have tested the relational approach by investigating the SARs of 230 aromatic and heteroaromatic nitro compounds. These compounds had been split previously into two subsets, 188 compounds that were amenable to regression and 42 that were not. For the 188 compounds, a SAR was found that was as accurate as the best statistical or neural network-generated SARs. The PROGOL SAR has the advantages that it did not need the use of any indicator variables handcrafted by an expert, and the generated rules were easily comprehensible. For the 42 compounds, PROGOL formed a SAR that was significantly (P < 0.025) more accurate than linear regression, quadratic regression, and back-propagation. This SAR is based on an automatically generated structural alert for mutagenicity.
Article
Full-text available
Association rule discovery has emerged as an important problem in knowledge discovery and data mining. The association mining task consists of identifying the frequent itemsets, and then forming conditional implication rules among them. We present efficient algorithms for the discovery of frequent itemsets which forms the compute intensive phase of the task. The algorithms utilize the structural properties of frequent itemsets to facilitate fast discovery. The items are organized into a subset lattice search space, which is decomposed into small independent chunks or sublattices, which can be solved in memory. Efficient lattice traversal techniques are presented which quickly identify all the long frequent itemsets and their subsets if required. We also present the effect of using different database layout schemes combined with the proposed decomposition and traversal techniques. We experimentally compare the new algorithms against the previous approaches, obtaining improvements of more than an order of magnitude for our test databases
Conference Paper
Critics of lazy functional languages contend that the languages are only suitable for toy problems and are not used for real systems. We present an application (PolyFARM) for distributed data mining in relational bioinformatics data, written in the lazy functional language Haskell. We describe the problem we wished to solve, the reasons we chose Haskell and relate our experiences. Laziness did cause many problems in controlling heap space usage, but these were solved by a variety of methods. The many advantages of writing software in Haskell outweighed these problems. These included clear expression of algorithms, good support for data structures, abstraction, modularity and generalisation leading to fast prototyping and code reuse, parsing tools, profiling tools, language features such as strong typing and referential transparency, and the support of an enthusiastic Haskell community. PolyFARM is currently in use mining data from the Saccharomyces cerevisiae genome and is freely available for non-commercial use at http://www.aber.ac.uk/compsci/Research/bio/dss/polyfarm/.
Conference Paper
The upgrade of frequent item set mining to a setup with multiple relations |frequent query mining| poses many eciency prob- lems. Taking Object Identity as starting point, we present several opti- mization techniques for frequent query mining algorithms. The resulting algorithm has a better performance than a previous ILP algorithm and competes with more specialized graph mining algorithms in performance.
Article
Ranked queries are used to locate relevant documents in text databases. In a ranked query a list of terms is specified, then the documents that most closely match the query are returned---in decreasing order of similarity---as answers. Crucial to the efficacy of ranked querying is the use of a similarity heuristic, a mechanism that assigns a numeric score indicating how closely a document and the query match. In this note we explore and categorise a range of similarity heuristics described in the literature. We have implemented all of these measures in a structured way, and have carried out retrieval experiments with a substantial subset of these measures.Our purpose with this work is threefold: first, in enumerating the various measures in an orthogonal framework we make it straightforward for other researchers to describe and discuss similarity measures; second, by experimenting with a wide range of the measures, we hope to observe which features yield good retrieval behaviour in a variety of retrieval environments; and third, by describing our results so far, to gather feedback on the issues we have uncovered. We demonstrate that it is surprisingly difficult to identify which techniques work best, and comment on the experimental methodology required to support any claims as to the superiority of one method over another.