Scalable Multi-Relational Association Mining
Department of Computer Science,
University of Wales Aberystwyth,
Aberystwyth, SY23 3DB, UK
Hugh E. Williams Nicholas Lester
School of Computer Science and Information Technology,
RMIT University, GPO Box 2476V,
Melbourne, Australia 3001.
We propose the new RADAR technique for multi-
relational data mining. This permits the mining of very
largecollectionsandprovides a newtechniquefordiscover-
ing multi-relational associations. Results show that RADAR
is reliable and scalable for mining a large yeast homology
collection, and that it does not have the main-memory scal-
ability constraints of the Farmer and Warmr tools.
Large collections of multi-relational data present sig-
niﬁcant new challenges to data mining. These challenges
are reﬂected in the annual KDD Cup competition, which
involved relational datasets in 2001 and 2002, and net-
work mining in 2003. The July 2003 edition of the ACM
SIGKDD Explorations is devoted to position papers out-
lining the current frontiers in multi-relational data min-
ing. Similar problems exist in bioinformatics databases —
such as those at MIPS
— that provide integrated data on
a genome-wide scale for whole organisms, with multiple
cross references to other databases.
The vast majority of association mining algorithms are
designed for single table, propositional datasets. We pro-
pose a novel technique for multi-relational association min-
ing that permits efﬁcient and scalable discovery of relation-
ships. To our knowledge, the only existing multi-relational
association mining algorithms are upgrades of Apriori 
and, with the ﬁeld in its infancy, there is much scope for
improving the scalability of these solutions. Our technique
uses an inverted index, a largely disk-based search structure
that is used to support querying in all practical Information
Retrieval systems and web search engines.
This work carried out at and supported by the School of Computer
Science and Information Technology at RMIT University.
2. Inverted Indexes
An inverted index is a well-known structure used in
all practical text retrieval systems . It consists of an
in-memory (or partially in-memory) search structure that
stores the vocabulary of searchable terms, and on-disk post-
ings that store, for each term, the location of that term in the
collection. In practice, the vocabularyis typically the words
that occur in the collection .
Using the notation of Zobel and Moffat , each term
has postings , where is the frequency
of term in document . Consider an example for the term
“mining” that occurs in four documents:
This postings list shows that the word “mining” occurs
twice in document 11, once in document 19, once in docu-
ment 72, and twice in document 107. The documents them-
selves are ordinally numbered, and a mapping table asso-
ciates each document number to its location on disk. De-
spite its simplicity, this inverted index structure is sufﬁcient
to support the popular ranked query mode that is used by
most search engine users.
The organisation, compression, and processing of post-
ings lists is crucial to retrieval system performance. Com-
pression is important for three reasons: ﬁrst, a compressed
representation requires less storage space than an uncom-
pressed one; second, a retrieval system is faster when com-
pression is used, since the cost of transferring compressed
lists anddecompressingthemis typically much less than the
cost of transferring uncompressed data; and, last, caching
is improved because more lists ﬁt into main-memory than
when uncompressed lists are used. Scholer et al.  re-
cently showed that compression of postings lists more than
halves query evaluation times than when no compression is
3. Multi-Relational Association Mining
The ﬁrst mining technique to ﬁnd associations in multi-
table relational data was Warmr . Warmr is a ﬁrst-
order upgrade of Apriori, with the additional introduction
of a user-deﬁned language bias to restrict the search space.
Blockeel et al.  have been investigating enhancements —
such as query packs — to the underlying Prolog compiler
to address efﬁciency issues. They have also implemented
techniques to allow the user to limit the amount of data re-
quired to be loaded into main-memory. With Warmr, the
user has the full powerof the Prolog programminglanguage
for specifying the data and background knowledge.
PolyFARM  was based on the ideas of Warmr and
written for distribution on a Beowulf cluster by partition-
ing the data to be counted. Unfortunately, although the size
of the database is reduced by partitioning, the size of the
candidate associations held in main-memory can grow im-
Nijssen and Kok’s Farmer  is a new multi-relational
mining technique, with a running time that is an order
of magnitude improvement over Warmr; indeed, on small
data sets, Farmer can be astonishingly fast. However, they
still require that all data is available in main-memory — a
still signiﬁcant problem for large datasets — and the main-
memory use increases steadily throughout each search.
We propose RADAR, the Relational Association
. RADAR is the ﬁrst multi-
relational association mining algorithm that uses com-
pressed inverted indexing techniques to provide a scalable
solution for mining large databases.
Our aim is to count all frequent associations in a
database. We use the language of ﬁrst order logic to rep-
resent the associations. A frequent association is a con-
junction or set of atoms that occurs with at least the min-
imum support frequency in the database . For example,
“a chardonnay wine that is made by an Australian grower”
is represented by the association:
Inspired by the Eclat algorithm , we propose to mine
these frequent associations by ﬂatteningthe database, build-
ing an inverted index of the ﬂattened database, and repeat-
edly joining postings lists.
In a multi-table relational database, we must decide
which ﬁeld in which table is our main key or notion of
transaction, that is, what we are counting. For example,
in a database representing wines, retailers, and growers, we
The RADAR software and sample databases are available from
Figure 1 Five tables representing molecules by atoms and
Figure 2 Example of the two-column ﬂattened database
with keys. For example, line 2 describes a double bond in
mol 12 between atoms 10 and 11.
Keys (Arguments) Attributes (Predicate Symbols)
m12, a10, a11 bond double
m12, a10 elem carbon, quanta 27, charge medium
m12, a11, a12 bond
m12, a11, a13 bond single
m12, a11 elem carbon, quanta 22, charge medium
m12, r47, a10 ring benzene
m12, r47, a11 ring benzene
m12, r47, a12 ring benzene
must decide if we are interested in counting the number
of Australian growers that make chardonnay wines, or the
number of chardonnay wines that are made by Australian
growers. We refer to this ﬁeld as the COUNTKEY, so as to
distinguish it from the common notion of a database key
To prepare for indexing, the database is ﬂattened into a
single table with a two-column format. The ﬁrst column
stores the database keys (which represent the arguments to
the predicates), and the second column stores the database
items, that is, descriptive attributes (which represent the
predicate names). We refer to these as keys and predicate
symbols respectively. Each row of the ﬂattened database
can hold multiple keys and multiple predicate symbols.
The attributes in a simple multi-table relational database
describing molecules represented by bonds and atoms are
shown in Figure 1. Selected ﬂattened rows from this
database are shown in Figure 2. Flattening can be made
more or less explicit depending on the application require-
Keys are used to form the arguments to the predicates.
For example, if is to be a possible
atom in associations, then any row in the ﬂattened database
that contains an instance of the term in the second
columnmust alwayshave both Wineand Grower keyslisted
in the ﬁrst column of that row.
To create an inverted index for the ﬂattened database, we
number each row sequentially and use these numbers as the
documentnumbers. All termswithin a roware indexed, that
Figure 3 A section of the inverted index of the ﬂattened
database from Figure 2. For compactness, the postings list
show only document numbers; we have omitted .
Term ( ) Postings list ( )
bond double 2,4
bond single 5
ring benzene 7,8,9
quanta 22 6
is, both keys and attributes. A section of the inverted index
for Figure 2 is shown in Figure 3.
To mine the data, the user providesthe ﬂattened database
and a language bias (the set of factors that inﬂuence and
direct the search). In our case, this is a list of the COUN-
TKEYS, a list of all the predicates for use in associations, the
types and modes of their arguments, and other constraints.
Associations are then generated depth-ﬁrst.
All argumentsto the predicates in anassociation are vari-
ables that can be satisﬁed by particular database keys. To
count how frequently an association appears in the database
— with respect to the COUNTKEY — we need to test
whether, for each possible COUNTKEY, there is a set of
keys that satisfy this relationship. This means that when we
have multi-relational data we cannot simply intersect post-
ings lists for predicates that appear within the same associa-
tion because we are seeking to identifypredicates that share
the correct set of keys that hold the relationships between
the predicates. The algorithm for counting associations us-
ing our compressed inverted index is shown in Figure 4.
We present results of using RADAR, Warmr, and Farmer.
All measurements were carried on a 1.66 GHz AMD
Athlon-based workstation running Linux with 2 GB of
main-memory. We used two small collections — for which
RADAR is not optimised, but that are well-known and well-
suited to the other schemes — and a large collection that
illustrates the scalability of RADAR. MUTA is a well-
known mutagenesis dataset , consisting of descriptions
of molecules, including their atoms, bonds, and ring struc-
tures. KDD2002 is the collection used in Task 2 of the KDD
2002 Cup competition
, that describes yeast proteins and
their interactions. YEASTHOM is a large collection
See: http://www.biostat.wisc.edu/ craven/kddcup/
Figure 4: Algorithm for counting an association
function countassoc( )
fetch postings lists for each predicate in
foreach in do
fetch postings list for
join with each appropriate
if all are non-empty then
if other args exist then
function doargs( )
ﬁnd shortest docs list amongst appropriate predicates
key of appropriate type for argnum from
fetch postings list for
join with each appropriate
if all are non-empty then
if other args exist then
scribes homologous relationships between yeast genes and
proteins in the SwissProt database.
We compared RADAR to Warmr (version ACE 1.2.6) and
Farmer (2003). A fair, direct comparison is not straightfor-
ward as each algorithm has its own distinct properties. In
particular, Farmer does not allow a limit on the length of
the association, but only on the maximum use of each in-
dividual predicate. This means that we cannot stop Farmer
from ﬁnding more, longer associations than the other algo-
Table 1 shows the results of our experiments. The results
for MUTA and KDD2002 illustrate the general properties of
the schemes: RADAR uses 34 Mb of main-memory for both
collections, while the memory use of the other schemes
varies signiﬁcantly with the number of discovered associ-
ations (from 25 to 119 Mb for Warmr, and from 387 to
11 Mb for Farmer). Constant memory use comes at a price
for small collections: RADAR is two to three times slower
than the other schemes on the MUTA task, and unacceptably
slow on the KDD2002 task compared to the fast Farmer.
The results for YEASTHOM illustrate the advantages of
RADAR, and the disadvantages of the other approaches.
RADAR is highly scalable: despite the almost thousand-fold
increase in data size from KDD2002 to YEASTHOM, main-
memory use only increases from 34 Mb to 56 Mb. Farmer
— which is impressiveon small datasets — is unsuitablefor
this task: main-memory use increases steadily throughout
the lifetime of the task, since it holds the database and as-
Data Algorithm Data size Maximum Memory Time Associations
Original Compiled Use (Mb) Found
Warmr 823 kb 1,292 kb 25 7.8 mins 2,756
MUTA Farmer 823 kb — 387 10.9 mins 95,715
RADAR 596 kb 526 kb 34 25.0 mins 12,530
Warmr 1,407 kb 1,556 kb 119 31.1 mins 7,523
KDD2002 Farmer 1,407 kb — 11 0.1 mins 20,359
RADAR 1,023 kb 418 kb 34 361.0 mins 9,130
Warmr 841 Mb 880 Mb 800 25 days 7,712*
YEASTHOM Farmer 1,465 Mb — 1,254 18 days 698,974
RADAR 1,565 Mb 163 Mb 56 25 days 34,782*
Table 1. Experiments on the MUTA, KDD2002 and YEASTHOM collections. For MUTA, support = 20
molecules (10.6%), max. assoc. length = 3 predicates (excluding ). Farmer continued to ﬁnd
associations to length 11. For KDD2002, support = 20 ORFs(0.84%), max. assoc. length = 3 predicates
(excluding ). Farmer continued to ﬁnd associations to length 8. For YEASTHOM, support = 20
ORFs (0.31%), max. assoc. length = 3 predicates (excluding ). Italicised ﬁgures indicate that
the algorithm was still running. Farmer continued to ﬁnd associations to length 7 but stopped before
completion due to main memory exhaustion. Warmr’s maximum memory use was set to 800 Mb.
sociations in memory. Indeed, after 18 days, main-memory
was exhausted. Warmr processes associations in packs that
group together common subparts for faster counting. This
means that no results are given until a whole level is com-
plete. For the YEASTHOM collection, associations of length
two were produced after about six hours, and then the sys-
tem gave no further output for several weeks.
RADAR isstructured— similarly to Farmer— as an any-
time algorithm that produces continuous output. Further,
RADAR can be seeded with an association, so that the appli-
cation can be restarted at any time. This aspect is useful for
large-scale mining problems that run for weeks.
Large multi-relational collections are the next frontier
for data mining. In this paper we have shown how com-
pressed inverted indexes used in text retrieval systems can
be adapted for multi-relational data mining. Our technique,
RADAR, is both scalable and reliable on large amounts of
data. It produces output continuously, with the option of
stopping and resuming the mining process later. For small
datasets — for which RADAR is not designed — the Warmr
and Farmer techniques should be used in preference.
This work was supported by the Australian Research
 R. Agrawal and R. Srikant. Fast algorithms for mining asso-
ciation rules in large databases. In 20th International Con-
ference on Very Large Databases (VLDB 94), 1994.
 H. Blockeel et al. Improving the efﬁciency of Inductive
Logic Programming through the use of query packs. Journal
of Artiﬁcial Intelligence Research, 16:135–166, 2002.
 A. Clare and R. D. King. Data mining the yeast genome in
a lazy functional language. In Practical Aspects of Declar-
ative Languages (PADL’03), 2003.
 L. Dehaspe. Frequent Pattern Discovery in First Order
Logic. PhD thesis, Department of Computer Science,
Katholieke Universiteit Leuven, 1998.
 R. King, S. Muggleton, A. Srinivasan, and M. Sternberg.
Structure-activityrelationships derived bymachine learning.
Proc. Nat. Acad. Sci. USA, 93:438–442, 1996.
 S. Nijssen and J. N. Kok. Efﬁcient frequent query discovery
in FARMER. In 13th International Conference on Inductive
Logic Programming (ILP 2003), 2003.
 F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. Com-
pression of inverted indexes for fast query evaluation. In
K. J¨arvelin, M. Beaulieu, R.Baeza-Yates,and S. H.Myaeng,
editors, Proc. ACM-SIGIR International Conference on Re-
search and Development in Information Retrieval, pages
222–229, Tampere, Finland, 2002.
 I. Witten, A. Moffat, and T. Bell. Managing Gigabytes:
Compressing and Indexing Documents and Images. Morgan
Kaufmann Publishers, Los Altos, CA 94022, USA, second
 M. J. Zaki. Scalable algorithms for association mining.
IEEE Transactions on Knowledge and Data Engineering,
 J. Zobel and A. Moffat. Exploring the similarity space. ACM
SIGIR Forum, 32(1):18–34, 1998.