Conference PaperPDF Available

Finding closed frequent item sets by intersecting transactions

Authors:
  • Paris Lodron University of Salzburg
  • PerkinElmer, Granada, Spain

Abstract and Figures

Most known frequent item set mining algorithms work by enumerating candidate item sets and pruning infrequent candidates. An alternative method, which works by intersecting transactions, is much less researched. To the best of our knowledge, there are only two basic algorithms: a cumulative scheme, which is based on a repository with which new transactions are intersected, and the Carpenter algorithm, which enumerates and intersects candidate transaction sets. These approaches yield the set of so-called closed frequent item sets, since any such item set can be represented as the intersection of some subset of the given transactions. In this paper we describe a considerably improved implementation scheme of the cumulative approach, which relies on a prefix tree representation of the already found intersections. In addition, we present an improved way of implementing the Carpenter algorithm. We demonstrate that on specific data sets, which occur particularly often in the area of gene expression analysis, our implementations significantly outperform enumeration approaches to frequent item set mining.
Content may be subject to copyright.
Finding Closed Frequent Item Sets
by Intersecting Transactions
Christian Borgelt
European Centre for Soft Computing
c/ Gonzalo Gutiérrez Quirós s/n
E-33600 Mieres (Asturias), Spain
christian.borgelt@softcomputing.es
Xiaoyuan Yang
Telefonica Research
Via Augusta, 177
E-08021 Barcelona
yxiao@tid.es
Ruben Nogales-Cadenas
Facultad de Ciencias Físicas
Universidad Complutense
E-28040 Madrid, Spain
ruben.nogales@fdi.ucm.es
Pedro Carmona-Saez
Facultad de Ciencias Físicas
Universidad Complutense
E-28040 Madrid, Spain
pcarmona@fis.ucm.es
Alberto Pascual-Montano
Functional Bioinformatics Group
National Center for Biotechnology-CSIC
E-28040 Madrid, Spain
pascual@cnb.csic.es
ABSTRACT
Most known frequent item set mining algorithms work by
enumerating candidate item sets and pruning infrequent can-
didates. An alternative method, which works by intersect-
ing transactions, is much less researched. To the best of our
knowledge, there are only two basic algorithms: a cumula-
tive scheme, which is based on a repository with which new
transactions are intersected, and the Carpenter algorithm,
which enumerates and intersects candidate transaction sets.
These approaches yield the set of so-called closed frequent
item sets, since any such item set can be represented as the
intersection of some subset of the given transactions. In this
paper we describe a considerably improved implementation
scheme of the cumulative approach, which relies on a pre-
fix tree representation of the already found intersections. In
addition, we present an improved way of implementing the
Carpenter algorithm. We demonstrate that on specific data
sets, which occur particularly often in the area of gene ex-
pression analysis, our implementations significantly outper-
form enumeration approaches to frequent item set mining.
Categories and Subject Descriptors
I.5.5 [Pattern Recognition]: Implementation
Keywords
frequent item set mining; closed item set; intersection
1. INTRODUCTION
It is hardly an exaggeration to say that the popular research
area of data mining was started by the tasks of frequent item
set mining and association rule induction (see Section 2.1).
At least, these tasks have a strong and long-standing tra-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
EDBT 2011, March 22–24, 2011, Uppsala, Sweden.
Copyright 2011 ACM 978-1-4503-0528-0/11/0003 ...$10.00
dition in data mining and knowledge discovery in databases
and triggered an abundance of publications in data mining
conferences and journals. Research efforts devoted to these
tasks have led to a variety of sophisticated and efficient al-
gorithms to find frequent item sets. Among the best-known
approaches are Apriori [2, 1], Eclat [22] and FP-growth [11].
Most of these approaches enumerate candidate item sets,
determine their support, and prune candidates that fail to
reach the user-specified minimum support. A brief abstract
description of a general search scheme, which can also be
interpreted as a depth-first search in the subset lattice of
the item sets, is provided in Section 2.2. In this view, enu-
meration approaches work “top-down”, since they start at
the one-element item sets and work their way downward by
extending found frequent item sets by new items.
An alternative algorithmic scheme, which has been much less
researched, intersects (subsets of) the given transactions. To
the best of our knowledge, only two main variants of this
approach have been studied up to now: [14] proposed a cu-
mulative scheme, in which new transactions are intersected
with a repository of already found closed item sets, and [15]
proposed the Carpenter algorithm (and [16] its closely re-
lated variant Cobbler), which works by enumerating and in-
tersecting candidate transaction sets. These approaches can
be seen as working “bottom-up”, because they start with
large item sets, namely the transactions, which are reduced
to smaller sets by intersecting them with other transactions.
The main reason why the intersection approach is less re-
searched is that it is often not competitive with the item set
enumeration approaches, at least on standard benchmark
data sets. However, standard benchmark data sets contain
comparatively few items (a few hundred), and very many
transactions (tens or even hundreds of thousands). Natu-
rally, if there are few items, there are (relatively) few candi-
date item sets to enumerate and thus the search space of the
enumeration approaches is of manageable size. In contrast
to this, the more transactions there are, the more work an
intersection approach has to do, especially, since it is not
linear in the number of transactions like the support com-
putation of the item set enumeration approaches.
However, not all data sets, to which frequent item set min-
ing and association rule induction can profitably be applied,
have this structure. As we discuss in Section 4, in gene ex-
pression analysis one often meets data sets with very many
items (several thousand to tens of thousands), but fairly few
transactions (several dozens to hundreds). On these data
sets item set enumeration approaches are likely to strug-
gle, because of the huge search space, while the intersection
methods can be fast, because there are only few transac-
tions to intersect. Indeed, as we demonstrate on such data,
our implementations of the intersection methods can signif-
icantly outperform the item set enumeration approaches.
The rest of this paper is structured as follows: in Section 2
we review the basics of frequent item set mining, provide a
brief, abstract description of the item set enumeration ap-
proach, and define the important concept of a closed (fre-
quent) item set. We exploit an alternative characterization
of closed item sets to establish the intersection approach,
and review the Galois connection by which it can be for-
mally justified. In Section 3 we discuss our improvements
of the Carpenter algorithm as well as the basic cumulative
/ repository-based algorithm and our prefix tree implemen-
tation of it. Section 4 describes the application area of gene
expression analysis, the data sets we used in our experi-
ments, and the results we obtained on them. Finally, in
Section 6, we draw conclusions from our discussion.
2. FREQUENT ITEM SET MINING
Frequent item set mining is a data analysis method that
was originally developed for market basket analysis and that
aims at finding regularities in the shopping behavior of the
customers of supermarkets, mail-order companies and online
shops. In particular, it tries to identify sets of products that
are frequently bought together. Once identified, such sets of
associated products may be used to optimize the organiza-
tion of the offered products on the shelves of a supermarket
or the pages of a mail-order catalog or web shop, or can give
hints which products may conveniently be bundled or may
be suggested to a new customer.
2.1 Basic Notions and Notation
Formally, the task of frequent item set mining can be de-
scribed as follows: we are given a set Bof items, called the
item base, and a database Tof transactions. Each item rep-
resents a product, and the item base represents the set of
all products offered by a store. The term item set refers to
any subset of the item base B. Each transaction is an item
set and represents a set of products that has been bought
by an actual customer. Since two or even more customers
may have bought the exact same set of products, the total
of all transactions must be represented as a vector, a bag
or a multiset, since in a simple set each transaction could
occur at most once. (Alternatively, each transaction may be
enhanced by a unique transaction identifier, and these en-
hanced transactions may then be combined in a simple set.)
Note that the item base Bis often not given explicitly, but
implicitly as the union of all transactions.
Let T= (t1,...,tn) be a transaction database over an item
base B. The cover KT(I) of an item set IBw.r.t. this
database is set of indices of transactions that contain it,
that is, KT(I) = {k∈ {1,...,n} | Itk}. The sup-
port sT(I) of an item set IBis the size of its cover:
sT(I) = |KT(I)|, that is, the number of transactions in the
database Tit is contained in. Given a user-specified mini-
mum support smin IN, an item set Iis called frequent in T
iff sT(I)smin. The goal of frequent item set mining is
to identify all item sets IBthat are frequent in a given
transaction database T. Note that the task of frequent item
set mining may also be defined with a relative minimum sup-
port (fraction or percentage of the transactions in T). This
alternative definition is obviously equivalent.
2.2 Item Set Enumeration Algorithms
A standard approach to find all frequent item sets w.r.t.
a given database Tand support threshold smin, which is
adopted by basically all frequent item set mining algorithms
(except those of the Apriori family), is a depth-first search
in the subset lattice of the item base B. This approach can
be interpreted as a simple divide-and-conquer scheme. For
some chosen item i, the problem to find all frequent item
sets is split into two subproblems: (1) find all frequent item
sets containing the item iand (2) find all frequent item sets
not containing the item i. Each subproblem is then further
divided based on another item j: find all frequent item sets
containing (1.1) both items iand j, (1.2) item i, but not j,
(2.1) item j, but not i, (2.2) neither item inor j, and so on.
All subproblems that occur in this divide-and-conquer re-
cursion can be defined by a conditional transaction database
and a prefix. The prefix is a set of items that has to be
added to all frequent item sets that are discovered in the
conditional database. Formally, all subproblems are tuples
S= (C, P ), where Cis a conditional transaction database
and PBis a prefix. The initial problem, with which the
recursion is started, is S= (T , ), where Tis the transaction
database to mine and the prefix is empty.
A subproblem S0= (C0, P0) is processed as follows: Choose
an item iB0, where B0is the set of items occurring
in C0. This choice is, in principle, arbitrary, but usually
follows some predefined order of the items. If sC0(i)smin ,
then report the item set P0∪ {i}as frequent with the sup-
port sC0(i), and form the subproblem S1= (C1, P1) with
P1=P0∪ {i}. The conditional transaction database C1
comprises all transactions in C0that contain the item i, but
with the item iremoved. This also implies that transactions
that contain no other item than iare entirely removed: no
empty transactions are ever kept. If C1contains at least
smin transactions, process S1recursively. In any case (that
is, regardless of whether sC0(i)smin or not), form the
subproblem S2= (C2, P2), where P2=P0and the condi-
tional transaction database C2comprises all transactions in
C0(regardless of whether they contain the item ior not),
but again with the item iremoved. If C2contains at least
smin transactions, process S2recursively.
Eclat, FP-growth, and several other frequent item set mining
algorithms (see [7, 8]) rely on this basic scheme, but differ
in how they represent the conditional databases. The main
approaches are horizontal and vertical representations. In a
horizontal representation, the database is stored as a list (or
array) of transactions, each of which is a list (or array) of the
items contained in it. In a vertical representation, a database
is represented by first referring with a list (or array) to the
different items. For each item a list (or array) of identifiers
is stored, which indicate the transactions that contain the
item. However, this distinction is not pure, since there are
many algorithms that use a combination of the two forms of
representing a database. For example, while Eclat [22] uses
a purely vertical representation and SaM (Split and Merge)
[3] uses a purely horizontal representation, FP-growth [11]
combines in its FP-tree structure a (compressed) horizontal
representation (prefix tree of transactions) and a vertical
representation (links between tree branches).
The basic recursive processing scheme described above can
easily be improved with so-called perfect extension pruning,
which relies on the following simple idea: given an item
set I, an item i /Iis called a perfect extension of I, iff I
and I∪ {i}have the same support, that is, if iis contained
in all transactions containing I. Perfect extensions have the
following properties: (1) if the item iis a perfect extension
of an item set I, then it is also a perfect extension of any
item set JIas long as i /Jand (2) if Iis a frequent
item set and Kis the set of all perfect extensions of I, then
all sets IJwith J2K(where 2Kdenotes the power set
of K) are also frequent and have the same support as I.
These properties can be exploited by collecting in the recur-
sion not only prefix items, but also, in a third element of a
subproblem description, perfect extension items. Once iden-
tified, perfect extension items are no longer processed in the
recursion, but are only used to generate all supersets of the
prefix that have the same support. Depending on the data
set, this can lead to a considerable speed-up. It should be
clear that this optimization can, in principle, be applied in
all frequent item set mining algorithms that work according
to the described divide-and-conquer scheme.
2.3 Types of Frequent Item Sets
One of the first observations one makes when mining fre-
quent item sets is that the output is often huge—it may
even exceed the size of the transaction database to mine.
As a consequence, there are several approaches that try to
reduce the output, if possible without any loss of informa-
tion. The most basic of these approaches is to restrict the
output to so-called closed or maximal frequent item sets.
A frequent item set is called closed if there does not exist a
superset that has the same support, or formally:
IBis closed sT(I)smin
iBI:sT(I∪ {i})< sT(I).
A frequent item set is called maximal if there does not exist
any superset that is frequent, or formally:
IBis maximal sT(I)smin
iBI:sT(I∪ {i})< smin.
Restricting the output of a frequent item set mining algo-
rithm to only the closed or even only the maximal frequent
item sets can sometimes reduce it by orders of magnitude.
However, little information is lost: From the set of all max-
imal frequent item sets the set of all frequent item sets can
be reconstructed, since any frequent item set has at least
one maximal superset. Therefore the union of all subsets of
maximal item sets is the set of all frequent item sets.
Closed frequent item sets even preserve knowledge of the
support values. The reason is that each frequent item set
has a uniquely determined closed superset with the same
support. Hence the support of a frequent item set that is
not closed can be computed as the maximum of the support
values of all closed frequent item sets that contain it (the
maximum has to be used, because no superset can have a
greater support—the so-called apriori property). As a con-
sequence, closed frequent item sets are the most popular
form of compressing the result of frequent item set mining.
Note that closed item sets are closely related to perfect ex-
tensions: an item set is closed if it does not have a perfect
extension. However, using perfect extension pruning does
not mean that the output is restricted to closed item sets,
because in the search not all possible extension items are
considered (conditional databases do not contain all items).
2.4 Characterizing Closed Item Sets
In the preceding section we characterized closed frequent
item sets based on their support and the support of their
supersets. With the help of the notion of the cover KT(I)
of an item set I(as defined above) we can define
IBis closed ⇔ |KT(I)| ≥ smin I=\
kKT(I)
tk.
That is, an item set is closed if it is equal to the intersection
of all transactions that contain it. This definition is obvi-
ously equivalent to the one given in Section 2.3: if an item
set is a proper subset of the intersection of the transactions
it is contained in, there exists a superset (especially the in-
tersection of the transactions itself) that has the same cover
and thus the same support. If, however, an item set is equal
to the intersection of the transactions containing it, adding
any item will remove at least one transaction from its cover
and will thus reduce the item set support.
This characterization allows us to find closed item sets by
forming, for a minimum support smin, all intersections of k
transactions, k∈ {smin,...,n}and removing duplicates. Al-
though implementing this procedure directly is prohibitively
costly, it provides the basis for intersection approaches.
2.5 Galois Connection
Intersection approaches to find closed frequent item sets can
nicely be justified in a formal way by analyzing the Galois
connection between the set of all possible item sets 2B(the
power set of the item base B) and the set of all possible
sets of transaction indices 2{1,...,n}(where nis the number
of transactions), as it was emphasized and explored in detail
in [17]. Consider the two functions
f: 2B2{1,...,n}, I 7→ KT(I) and
g: 2{1,...,n}2B, K 7→ TkKtk.
It is easy to show [17] that this function pair is a Galois
connection of 2Band 2{1,...,n}. As a consequence, the com-
pound function fg: 2B2B,I7→ TkKT(I)tkis a closure
operator. The closed frequent item sets are then simply the
frequent item sets that are closed w.r.t. this closure operator.
More interesting, however, is the fact that with fand g
being a Galois connection, the other compound function,
namely gf: 2{1,...,n}2{1,...,n},K7→ KT(TkKtk) =
{j|TkKtktj}is also a closure operator. Thus, if both
sets (that is, 2Band 2{1,...,n}) are restricted to the closed
elements w.r.t. the two closure operators, we arrive at a
bijective mapping. That is, if X={IB|(fg)(I) = I}
and Y={K⊆ {1,...,n} | (gf)(K) = K}, then f|Xis
a bijection from Xto Y. As a consequence, we can find
the closed frequent item sets by finding the closed sets of
transaction indices that have at least the size smin.
3. INTERSECTING TRANSACTIONS
We discuss two ways of implementing the intersection ap-
proach: enumerating transaction sets as it is done in the
Carpenter algorithm [15] and a cumulative scheme [14]. For
both schemes we present improved implementations.
3.1 Enumerating Transaction Sets
The Carpenter algorithm [15] implements the intersection
approach by enumerating sets of transactions (or, equiva-
lently, sets of transaction indices) and intersecting them.
This is done with basically the same divide-and-conquer
scheme as for the item set enumeration approaches, only
that it is applied to transactions (that is, items and trans-
actions exchange their meaning, cf. [17]). Technically, the
task to enumerate all transaction index sets is split into two
sub-tasks: (1) enumerate all transaction index sets that con-
tain the index 1 and (2) enumerate all transaction index sets
that do not contain the index 1. These sub-tasks are then
further divided w.r.t. the transaction index 2: enumerate all
transaction index sets containing (1.1) both indices 1 and 2,
(1.2) index 1, but not index 2, (2.1) index 2, but not index 1,
(2.2) neither index 1 nor index 2, and so on.
Formally, all subproblems occurring in the recursion can be
described by triples S= (I, K, `). K⊆ {1,...,n}is a set
of transaction indices, I=TkKtk, that is, Iis the item
set that results from intersecting the transactions referred
to by K, and `is a transaction index, namely the index of
the next transaction to consider. The initial problem, with
which the recursion is started, is S= (B, ,1), where Bis
the item base and no transactions have been intersected yet.
A subproblem S0= (I0, K0, `0) is processed as follows: form
the intersection I1=I0t`0. If I1=, do nothing (return
from recursion). If |K0|+ 1 smin, and there is no trans-
action tjwith j∈ {1,...,n} − K0such that I1tj, report
I1with support sT(I1) = |K0|+ 1. If `0< n, then form
the subproblems S1= (I1, K1, `1) with K1=K0∪ {`0}and
`1=`0+ 1 and S2= (I2, K2, `2) with I2=I0,K2=K0and
`2=`0+ 1 and process them recursively.
3.1.1 List-based Implementation
From an implementation point of view the core issues of
the Carpenter algorithm consist in quickly finding the in-
tersection of the current item set with the next transaction,
and to be able to test quickly whether there is a transaction
(other than those that have been intersected) that contains
the current item set. For the former, the implementation
described in [15] uses a vertical transaction database repre-
sentation. That is, for each item an array is created that
lists the indices of those transactions that contain the item.
A collection of these arrays is used to represent the cur-
rent item set I. The intersection is carried out by collecting
all arrays that contain the next transaction index `0. In
a programming language that supports pointer arithmetic
(like C) this can be done very efficiently by keeping track of
the next unprocessed transaction index for each item.
The check whether there exists a transaction that has not
been used to form the intersection and which contains the
current item set is more complicated. This check is split
into the checks (1) whether there exists a transaction tjwith
j > `0such that I1tjand (2) whether there exists a trans-
action tjwith j∈ {1,...,`01} − K0such that I1tj.
The first check is actually easy, since the transactions tjwith
j > `0are considered in the recursive processing, which can
return whether such a transaction exists. The problematic
part is (2), which is solved in the implementation described
in [15] by maintaining a repository of already found closed
frequent item sets and always solving subproblem S1before
S2(including a transaction before excluding it). In this way
one can simply check (2) by determining whether I1is al-
ready in the repository. If there is a transaction tjwith
j∈ {1,...,`01} − K0and I1tj, then I1is already in
the repository, since K0∪ {j}is considered before K0.
In order to make the lookup in the repository efficient, it is
laid out as a prefix tree, with its top level structured as a flat
array of all items and its deeper levels consisting of nodes
linking to their first child and their right sibling. Especially
the flat structure of the top level is important, because the
data sets Carpenter is designed for have many items. Since
the top level is likely to be almost fully populated, a flat
array avoids traversing a long sibling list. Deeper levels,
however, can be expected to be sparser, and consequently
we did not observe significant improvements from laying out
deeper levels as flat arrays or search trees for sibling nodes.
An important optimization of the above scheme is the ana-
log of perfect extension pruning in the item set enumeration
scheme [15]: if I1=I0, it is not necessary to solve the sec-
ond subproblem, because it cannot produce any output: any
item set considered in the solution of S2can be intersected
with t`0without changing the item set, so the test for a con-
taining transaction must fail. In our own implementation,
we also added the following optimization: since we know
from the arrays of transaction indices per item how many
of the transactions tjwith j`0contain a given item, we
can immediately exclude any item ifrom the intersection for
which |K0|+|{j|`0jnitj}| < smin. The reason
is that in this case no item set that will be constructed in
the recursion and which contains ican reach the minimum
support. This optimization leads to a considerable speed-up.
3.1.2 Table-based Implementation
An implementation based on lists of transactions indices per
item has the disadvantage that collecting the reduced lists
for an intersection and updating the number of remaining
transactions containing an item not only requires time, but
also memory. In order to reduce these costs, we designed
a table- or matrix-based implementation, which represents
the data set by a n× |B|matrix Mas follows:
mki =0,if item i /tk,
|{j|kjnitj}|,otherwise.
transaction
database
t1a b c
t2a d e
t3b c d
t4a b c d
t5b c
t6a b d
t7d e
t8c d e
matrix
representation
abcde
t14 5 5 0 0
t23 0 0 6 3
t30 4 4 5 0
t42 3 3 4 0
t50 2 2 0 0
t61 1 0 3 0
t70 0 0 2 2
t80 0 1 1 1
Table 1: Matrix representation of a transaction
database for the improved Carpenter variant.
As an example, Table 1 shows a transaction database with
5 items and 8 transactions and its matrix representation.
Although such a matrix representation needs more memory
than the lists of transaction indices (since the zero entries
in the matrix need not be represented in the list representa-
tion), it saves memory in the recursion, since only the items
in the current intersection need to be represented and no ref-
erences to the current positions in the transaction index lists
are needed. In addition, the intersection can be formed by
indexing the matrix with `0and item identifiers of the cur-
rent set, which is faster than traversing the lists and checking
whether they contain `0as the next transaction index.
To check for the existence of transactions that have not been
used for the intersections that resulted in the current item
set, but contain it, the same repository technique is used
that was outlined in the preceding section.
3.2 Cumulative Intersections
An alternative to the transaction set enumeration approach
is a scheme that maintains a repository of all closed item
sets, which is updated by intersecting it with the next trans-
action. To justify this approach formally, we consider the set
of all closed frequent item sets for smin = 1, that is, the set
C(T) = nIB
ST:S6=∅ ∧ I=\
tS
to.
As was already noted in [14] and exploited for the imple-
mentation accompanying that paper, the set C(T) satisfies
the following simple recursive relation:
C() = ,(1)
C(T∪ {t}) = C(T)∪ {t} ∪ {I| ∃s∈ C(T) : I=st}.
As a consequence, we can start the procedure with an empty
set of closed item sets and then process the transactions one
by one, each time updating the set of closed item sets by
adding the new transaction itself and the additional closed
item sets that result from intersecting it with the already
known closed item sets. In addition, the support of already
known closed item sets may have to be updated.
Furthermore, we have to consider that in practice we will not
work with a minimum support smin = 1 as it underlies C(T).
Unfortunately, removing intersections early, because they do
not reach the user-specified minimum support is difficult:
typedef struct node {// a prefix tree node
int step; // most recent update step
int item; // assoc. item (last in set)
int supp; // support of item set
struct node *sibling; // successor in sibling list
struct node *children; // list of child nodes
}NODE;
Figure 1: Structure of the prefix tree nodes.
in principle, enough of the transactions to be processed in
the future could contain the item set under consideration
in order to make it frequent. This is a fundamental prob-
lem of the intersection approach. We improve on it by an
improved version of the method that was already outlined
in [14], which is analogous to the item elimination scheme
described in Section 3.1.1: in an initial pass through the
transaction database we determine the frequency of the in-
dividual items. (This is done by virtually all frequent item
set mining algorithms anyway in order to remove infrequent
items and to determine a good order in which to process the
items.) The obtained counters are updated with each pro-
cessed transaction, so that they always represent the number
of occurrences of each item in the unprocessed transactions
(cf. also the matrix representation shown in Figure 1).
Based on these counters, we can apply the following prun-
ing scheme: suppose that after having processed kof a to-
tal of ntransactions the support of a closed item set Iis
sTk(I) = xand let ybe the minimum of the counter values
for the items contained in I. If x+y < smin , then Ican be
discarded, because it cannot reach the minimum support.
The reason is that it cannot occur more than ytimes in the
remaining transactions, because one of its items does not
occur more often.
We have to be a bit careful, though, because Imay be
needed in order to construct certain subsets of it, namely
those that result from intersections of it with new trans-
actions. These subsets may still be frequent, even though
Iis not. As a consequence, we do not simply remove the
item set, but selectively remove items from it, which do not
occur frequently enough in the remaining transactions (in
exactly the same way in which we eliminated items in our
improved implementation of the Carpenter algorithm). Al-
though in this way we may construct non-closed item sets,
we do not create problems for the final output: either the
reduced item set also occurs as the intersection of enough
transactions and thus is closed, or it will not reach the min-
imum support threshold.
3.3 Prefix Tree Implementation
The core problem of implementing the scheme outlined in
the preceding section is to find a data structure for storing
the closed item sets that allows us to quickly compute the
intersections of these sets with a new transaction and to
merge the result with them. To achieve this, we rely on a
prefix tree, each node of which represents an item set and
is structured as shown in Figure 1: the field children holds
the head of the list of child nodes, which are linked by the
field sibling, which thus implements a sibling list. The item
set that is represented by a node consists of the item stored
in it plus the items in the nodes on the path from the root.
void isect (NODEnode, NODE **ins)
{// intersect with transaction
int i; // buffer for current item
NODE d; // to allocate new nodes
while (node) {// traverse the sibling list
i = nodeitem; // get the current item
if (trans[i]) {// if item is in intersection
while ((d = *ins) && (ditem >i))
ins = &dsibling; // find the insertion position
if (d // if an intersection node with
&& (ditem == i)){// the item already exists
if (dstep step) dsupp−−;
if (dsupp <nodesupp)
dsupp = nodesupp;
dsupp++; // update intersection support
dstep = step; }// and set current update step
else {// if there is no corresp. node
d = malloc(sizeof(NODE));
dstep = step; // create a new node and
ditem = i; // set item and support
dsupp = nodesupp+1;
dsibling = ins; *ins = d;
dchildren = NULL;
}// insert node into the tree
if (i imin) return; // if beyond last item, abort
isect(nodechildren, &dchildren); }
else {// if item is not in intersection
if (i imin) return; // if beyond last item, abort
isect(nodechildren, ins);
}// intersect with subtree
node = nodesibling; // go to the next sibling
}// end of while (node)
}// isect()
Figure 2: Code of the intersection function.
The support of this item set is stored in the field supp. In
order to avoid duplicate representations of the same item set,
all children of a node must refer to items with lower codes
than the item referred to by the node itself. In addition, we
require that the items referred to by the nodes in a sibling
list are in descending order w.r.t. their codes.
Finally, the field step is used in the intersection process to
indicate whether the support of a node has already been
updated from an intersection or not. This is important,
because multiple different closed item sets can, when inter-
sected with a new transaction, give rise to the same item
set. The step counter (which can be seen as an incremen-
tal update flag, thus eliminating the need to clear the flag)
ensures that the correct support value is computed.
The algorithm works on the prefix tree as follows: at the
beginning an empty tree (consisting only of a root node,
which does not store any item and thus represents the empty
set) is created. Then the transactions are processed one by
one. Each new transaction is first simply added to the prefix
tree (cf. the recursive relation (1)). Any new nodes created
in this step are initialized with a support of zero.
In the next step we compute the intersections of the new
transactions with all sets represented by the current prefix
tree. This is achieved with a recursive procedure that is
basically a selective depth-first traversal of the prefix tree
and matches the items in the tree nodes with the items of the
transaction. For this purpose the transaction is represented
by a global flag array trans with one element per item,
which is set if the item is contained in the transaction and
cleared otherwise. In addition, the item with the lowest
code in the current transaction is stored in a global variable
imin, so that the recursion can avoid branches that cannot
produce any intersection results. Finally, there is a global
variable step indicating the current update step, which is
equal to the index of the current transaction.
The code for the intersection step is shown in Figure 2. (This
is a considerably simplified version that illustrates the core
steps, but does not contain all optimizations we added. For
a detailed version refer to the source code, see URL below.)
Each call of the function isect traverses a sibling list, the
start of which is passed as the parameter node.
In principle, the recursive procedure could generate all in-
tersections and store them in a separate prefix tree, which is
afterwards merged with the original tree. (Actually this was
our first implementation.) However, the two steps of form-
ing intersections and merging the result can be combined.
This is achieved with the parameter ins, which points to the
location in the prefix tree that represents the item set that
resulted from intersecting the already processed part of the
current transaction with the item set represented by node.
Hence it indicates the location where new nodes may have
to be inserted in order to represent new closed item sets that
result from intersections of the subtree rooted at node with
(the unprocessed part of) the current transaction.
In more detail, the procedure works as follows: whenever the
item in the current sibling equals an item in the transaction
(that is, whenever trans[i] is true) and thus the item is in
the intersection, it is checked whether the sibling list starting
at *ins contains a node with this item, because this node
has to represent the extended intersection. If such a node
exists, its support is updated. For this update the step field
and variable are vital, because they allow us to determine
whether the current transaction was already counted for the
support in the node or not. If the value of the step field in
the node equals the current step, the node has already been
updated and therefore the transaction must be discounted
again before taking the maximum. The maximum is taken,
because we have to determine the support from the largest
set of transactions containing the item set represented by the
node. If a node with the intersection item does not exist,
a new node is allocated and inserted into the tree at the
location indicated by ins.
In both cases (with the current item in the transaction or
not, that is, with trans[i] true or false) the subtree is pro-
cessed recursively, unless the item of the current node is not
larger than the minimum item in the current transaction.
In this case no later siblings can produce any intersection,
because they and their descendants refer to items with lower
codes. (Recall that items are in descending order in a sibling
list and from parent to child.) The only difference between
the recursive calls is that in the case where the current item
is in the intersection, the insertion position is advanced to
the children of the current node.
As an illustration of how the prefix tree is built, Figure 3
shows how three simple transactions are processed. In all
of these trees arrows pointing downwards are child pointers
0: 0
1: 1
e 1
c 1
a 1
2: 2
e 2
d 1
b 1
c 1
a 1
3.1: 2
e 2
d 1
b 1
c 1
a 1
d 0
c 0
b 0
a 0
3.2: 2
e 2
d 1
b 1
c 1
a 1
d 2
c 1
b 1
a 1
b 2
3.3: 2
e 2
d 1
b 1
c 1
a 1
d 2
c 1
b 1
a 1
b 2
c 2
a 2
Figure 3: An example of building the prefix tree.
void report (NODE node, int min)
{// recursively report item sets
NODE c; // to traverse the child nodes
int max = 1; // maximum support of a child
for (c = nodechildren; c; c = childsibling) {
if (csupp <min) // traverse the child nodes,
continue; // but skip infrequent item sets
if (csupp >max) // find maximum child support
max = csupp; // for the closedness check
report(c, min); // recursively report item sets
}
if (nodesupp >max) // if no child has same support
hreport the item set represented by the path to nodei
}// report()
Figure 4: Code of the recursive reporting.
(field children in the node structure) and arrows pointing
to the right are sibling pointers (field sibling in the node
structure). Initially, the prefix tree is empty (step 0, at the
top left). In step 1 the transaction {e, c, a}is added to the
tree. Since it was empty before, no new intersections have to
be taken care of, so no further processing is carried out. In
step 2 the transaction {e, d, b}is added. This is done in two
steps: first the two nodes on the bottom left are added to
represent the transaction. Then, in the recursion, it is dis-
covered that the new transaction overlaps with the one that
was already present on the item e. As a consequence, the
support value of the node that represents the item set {e}is
incremented to 2. In the third step the transaction {d, c, b, a}
is processed. Since things are more complicated now, we
split the processing into three steps. In step 3.1 the new
transaction is added to the tree, with new nodes initialized to
support 0. Steps 3.2 and 3.3 show how the prefix tree is ex-
tended by the two intersections {d, b}={e, d, b} {d, c, b, a}
and {c, a}={e, c, a}∩{d, c, b, a}. Note that the increment of
the counters in the nodes for the full transaction {d, c, b, a}
is not shown as a separate step.
Finally, after all transactions have been processed, the closed
frequent item sets have to be reported. This is done with
the function shown in Figure 4. Note that not every node
of the prefix tree generates output. In the first place, item
sets that do not reach the user-specified minimum support
(parameter min) must be discarded. In addition, however,
we must also check whether a child has the same support.
If this is the case, the item set represented by a node is not
closed and must not be reported. In order to take care of
this, the function first traverses the children and determines
their maximum support. Only if the nodes own support
exceeds this maximum (and thus we know that the item set
represented by it is closed), it is reported.
3.4 Item and Transaction Orders
It is well known from the enumeration approaches to fre-
quent item set mining that the order in which items are
processed can have a huge impact on the processing time.
Usually it is best to process them in the order of increasing
frequency in the transaction database. In the intersection
approach similar considerations are in place, because it can
have a huge impact on the processing time how the items are
coded and in which order the transactions are processed.1
By experimenting with our implementation, we found the
following: it is usually most efficient to assign the item codes
w.r.t. ascending frequency in the database (the rarest item
has code 0, the next code 1 etc.) and to process the trans-
actions in the order of increasing size (number of contained
items). The order of transactions of the same size seems to
have very little influence. We use a lexicographical order
of the transactions based on a descending order of items in
each transaction. An intuitive explanation why this scheme
is fastest is that it manages to have few and small closed
item sets and thus small prefix trees at the beginning, so
that many transactions can be processed fast. With the re-
verse processing order for the transactions the prefix tree
becomes fairly large already after few transactions, which
slows down the processing for all later transactions.
4. GENE EXPRESSION ANALYSIS
DNA microarray technology is a powerful method for mon-
itoring the expression level of complete genomes in a single
experiment. In the last few years, this technique has been
widely used in several contexts such as tumor profiling, drug
discovery or temporal analysis of cell behavior [19]. Due to
the widespread use of this high-throughput technique in the
study of several biological systems, a large collection of gene
expression data sets is available to the scientific community,
some of which contain tens or hundreds of different exper-
imental conditions and constitutes reference databases or
“compendiums” of gene expression profiles. A key task to
derive biological knowledge from gene expression data is to
detect the presence of sets of genes that share similar ex-
pression patterns and common biological properties, such as
function or regulatory mechanisms. Frequent item set min-
ing has proved to be very efficient for the integrative analysis
of such data [5, 4]. This methodology is able to integrate
different types of data in the same analytic framework to un-
cover significant associations among gene expression profiles
1Note that the order of the items is independent of the order
of the codes used to represent them, which we referred to in
the preceding section. We may assign the codes so that the
items are numbered ascendingly or descendingly w.r.t. their
frequency in the transaction database or in any other way.
and also take into account multiple biological properties of
the genes based on co-occurrence patterns.
In this study we have used, among others, the data provided
in [12] which contains expression profiles for 6316 transcripts
corresponding to 300 diverse mutations and chemical treat-
ments in Saccharomyces cerevisiae (baker’s yeast). This or-
ganism it is one of the most intensively studied eukaryotic
model organisms in molecular and cell biology for several
reasons, including the fact that it shares the complex in-
ternal cell structure of plants and animals. We believe this
is a good representative example of the types of data and
problems that are common in any molecular biology lab.
The original data is composed of an expression matrix where
genes correspond to rows and experimental conditions to
columns. By creating a transaction database from this for-
mat, genes will be considered as transactions while experi-
mental conditions will be considered as items. This allows
the extraction of relationships among experimental condi-
tions. In this scenario we have much more transactions than
items. However, the matrix may also be transposed to con-
sider genes as items to extract relationships between them.
Contrary to the previous case, the number of items in this
scenario is much larger than the number of transactions.
Due to the dual properties of the analysis, this use case is
an excellent test base for the methodology we propose.
To construct the transaction database from this data set
we converted the expression matrix into a Boolean matrix.
For this purpose, and following the criteria used in other
studies (see, for example, [5]), we have used two expression
thresholds: genes with log expression values greater than
0.2 were considered as over-expressed and genes with log
expression values lower than 0.2 were considered under-
expressed. Values between these two ranks were seen as
neither expressed nor inhibited.
As a second data set we used the publicly available NCBI60
cancer cell line microarray expression database as reported
in [18], which was pre-processed in a similar way. Further-
more, we used the test part of the Thrombin data set that
was made publicly available for the KDD Cup 2001. Even
though this is not a gene expression data set (each record
describes a molecule that binds or does not bind to thrombin
by 139,351 binary features), it exhibits similar characteris-
tics. In order to be able to run detailed experiments, we se-
lected the first 64 records as transactions. Finally, we used
the transposed version of the well-known BMS-Webview-1
data set, which is a common benchmark for frequent item
set mining. It was derived from click streams of the online
shop of a leg-care company that no longer exists and has
been used in the KDD cup 2000 [13]. We used the trans-
posed form to obtain a data set with many items and few
transactions, as this is the type of data set the algorithms
presented here are designed for.
5. EXPERIMENTAL RESULTS
We compared our implementation of the intersection ap-
proach to frequent item set mining, which we call IsTa (for
Intersecting Transactions), to our implementations of the
two variants of the Carpenter algorithm as well as to FP-
growth (or rather FP-close, since we used the version that
finds closed frequent item sets) in the implementation of
[9, 10] (version of 2004), which won the FIMI’03 best im-
plementation award, and to LCM3 [20, 21], which won (in
version 2) the FIMI’04 best implementation award.2
We also tried the implementation of the intersection ap-
proach described in [14], but do not report the results here,
because the execution times are vastly larger than those of
our implementation (often exceeding a factor of 100), which
is due to the fact that this implementation does not employ
a prefix tree, but a simple flat structure. Furthermore, we
tried to compare to the original implementation of the Car-
penter algorithm by its designers (see [15]), as it is made
available as part of the GEMini (Gene Expression Mining)
software package3for Weka. However, we met several tech-
nical problems in doing so. In the first place, GEMini is
available only as compiled code for Windows, as it ships
partially as dynamic link libraries (DLLs). In addition, it
works only with a very specific version of Weka (namely
3.5.6), while the Carpenter module did not show up in any
other Weka version we tried. Finally, the Carpenter program
(including Weka) would crash extremely often. For exam-
ple, it was impossible to mine the baker’s yeast data set for
any support value less than 23 (regardless of the memory as-
signed to the Java virtual machine or the Windows version
used). However, for those support values we could try with-
out crashing, the GEMini implementation already needed an
order of magnitude longer than our implementation. Due to
these results and the comparative results reported in [6],
we believe that we can also confidently claim that our im-
plementation of the Carpenter algorithm is faster than the
RERII and REPT algorithms developed in [6]. We could
not compare to these algorithms directly, because we could
not get hold of the implementations.
All of our experiments were carried out on a desktop com-
puter with an Intel Core 2 Quad Q9650 processor (3GHz)
and 8 GB memory running Ubuntu Linux 10.04, kernel ver-
sion 2.6.32-24-generic (64 bit). The programs are written
either in C or in C++ and were compiled with gcc or g++,
respectively, version 4.4.3.
The results are shown in Figures 5 to 8. On the yeast
data set (Figure 5) IsTa and Carpenter are slightly slower
than the enumeration approaches for minimum support val-
ues greater than about 20–24 (where, however, all execu-
tion times are far less than a second). For lower support
the algorithms heavily diverge in their time consumption:
whereas FP-growth and LCM3 exceed 1 minute for a mini-
mum support of 8, and grow even more heavily afterwards,
IsTa manages to keep the execution time around 5 seconds.
The reason for this is, of course, the fairly small number of
transactions (300), combined with a huge number of items
(close to 10000, at least at the lowest support values), which
clearly favors the intersection approach. It should be noted
that the enumeration approaches still outperform the im-
plementation of the intersection approach described in [14],
which did not finish in a reasonable time on this data set,
even for larger support (and thus we terminated the run).
2The source code of these implementations can be down-
loaded from http://fimi.cs.helsinki.fi/ and
http://research.nii.ac.jp/~uno/codes.htm
3http://nusdm.comp.nus.edu.sg/projects.htm
Figure 5: Results on the baker’s yeast data.
46 48 50 52 54
–1
0
1
2
3
minimum support
log(time/seconds)
IsTa
Carp. table
Carp. lists
ncbi60
Figure 6: Results on the ncbi60 data.
Of the Carpenter variants, the table-based is somewhat bet-
ter than the list-based, but neither can compete with IsTa.
On the NCBI60 data set (Figure 6) the table-based variant of
Carpenter and IsTa perform basically on par until IsTa can
gain a small advantage at the lowest support value. The list-
based Carpenter version is clearly slower by a constant factor
(note the logarithmic scale on the vertical axis). There are
no results for FP-growth or LCM3 on this data set, because
both programs either crashed with segmentation faults or
entered an infinite loop (LCM3 for higher support values).
The Thrombin subset behaves very similarly to the NCBI60
data set, with the table-based variant of Carpenter and IsTa
performing basically on par until IsTa achieves a slight ad-
vantage at the lowest support value. The list-based variant
is again slower by constant factor. LCM3 and FP-growth are
competitive only down to a minimum support of 32 to 34,
where, however, all execution times are less than a second.
Finally, on the transposed BMS-webview-1 data set the be-
havior is similar to the yeast data set. The table-based vari-
ant of Carpenter is slightly better than the list-based, both
of which are clearly outperformed by IsTa. FP-growth and
LCM3 are competitive only down to a minimum support of
about 11, where all execution times are less than a second.
From the reported experimental results we can confidently
infer that the intersection approach is the method of choice
for data sets with few transactions and (very) many items,
25 30 35 40
–1
0
1
2
3
minimum support
log(time/seconds)
FP-close
LCM3
IsTa
Carp. table
Carp. lists
thrombin
Figure 7: Results on a subset of the thrombin data.
0 5 10 15 20
–1
0
1
2
3
minimum support
log(time/seconds)
FP-close
LCM3
IsTa
Carp. table
Carp. lists
webview tpo.
Figure 8: Results on the transposed webview data.
as they commonly occur in gene expression analysis, where
a large to huge number of genes (items) is tested in a mod-
erate number of conditions (transactions): both IsTa and
Carpenter clearly outperform the item set enumeration ap-
proaches, which were represented by particularly fast imple-
mentations. While they are on par on two data sets, IsTa
clearly outperforms Carpenter on the other two.
6. CONCLUSIONS
In this paper we presented an improved implementation of
the cumulative intersection approach for mining closed fre-
quent item sets, which was originally introduced by [14]. By
using a prefix tree structure to represent the already found
closed item sets, and by devising a recursive procedure that
quickly finds all intersections with a new transaction and
adds them to the prefix tree, we obtained an implementation
that is orders of magnitudes faster than the implementation
of [14]. In addition, we presented an improved implementa-
tion of the standard list-based scheme of the Carpenter al-
gorithm, which significantly outperforms the version by the
original authors. Finally, we suggested a table-based vari-
ant of the Carpenter algorithm, which yields even shorter
execution times, even though it loses (on two data sets)
against IsTa, our implementation of the cumulative intersec-
tion approach. By comparing the algorithms to the fastest
representatives of the item set enumeration approaches, we
showed that on a particular type of data sets, which com-
monly occur in gene expression analysis, our implementa-
tions significantly outperform the fastest enumeration ap-
proaches, for lower support values even by huge factors.
Software
The source code of our implementations can be found at
http://www.borgelt.net/ista.html
http://www.borgelt.net/carpenter.html
Acknowledgments
The work presented here was partially supported by the Eu-
ropean Commission under the 7th Framework Program FP7-
ICT-2007-C FET-Open, contract no. BISON-211898.
7. REFERENCES
[1] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and
A. Verkamo. Fast discovery of association rules. In
U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
R. Uthurusamy, editors, Advances in Knowledge
Discovery and Data Mining, pages 307–328. AAAI
Press / MIT Press, Cambridge, CA, USA, 1996.
[2] R. Agrawal and R. Srikant. Fast algorithms for mining
association rules. In Proc. 20th Int. Conf. on Very
Large Databases (VLDB 1994, Santiago de Chile),
pages 487–499, San Mateo, CA, USA, 1994. Morgan
Kaufmann.
[3] C. Borgelt and X. Wang. SaM: A split and merge
algorithm for fuzzy frequent item set mining. In Proc.
13th Int. Fuzzy Systems Association World Congress
and 6th Conf. of the European Society for Fuzzy Logic
and Technology (IFSA/EUSFLAT’09, Lisbon,
Portugal), Lisbon, Portugal, 2009. IFSA/EUSFLAT
Organization Committee.
[4] P. Carmona-S´aez, M. Chagoyen, A. Rodr´ıguez,
O. Trelles, J. M. Carazo, and A. Pascual-Montano.
Integrated analysis of gene expression by association
rules discovery. BMC Bioinformatics, 7:54, 2006.
[5] C. Ceighton and S. Hanash. Mining gene expression
databases for association rules. Bioinformatics,
19:79–86, 2003.
[6] G. Cong, K.-I. Tan, A. Tung, and F. Pan. Mining
frequent closed patterns in microarray data. In Proc.
4th IEEE International Conference on Data Mining
(ICDM 2004, Brighton, UK), pages 363–366,
Piscataway, NJ, USA, 2004. IEEE Press.
[7] B. Goethals and M. Zaki. Advances in frequent
itemset mining implementations: Introduction to
FIMI’03. In Proc. Workshop Frequent Item Set Mining
Implementations (FIMI 2003, Melbourne, FL),
Aachen, Germany, 2003. CEUR Workshop
Proceedings 90.
[8] B. Goethals and M. Zaki, editors. Proc. Workshop
Frequent Item Set Mining Implementations (FIMI
2004, Brighton, UK), Aachen, Germany, 2004. CEUR
Workshop Proceedings 126.
[9] G. Grahne and J. Zhu. Efficiently using prefix-trees in
mining frequent itemsets. In Proc. Workshop Frequent
Item Set Mining Implementations (FIMI 2003,
Melbourne, FL), Aachen, Germany, 2003. CEUR
Workshop Proceedings 90.
[10] G. Grahne and J. Zhu. Reducing the main memory
consumptions of FPmax* and FPclose. In Proc.
Workshop Frequent Item Set Mining Implementations
(FIMI 2004, Brighton, UK), Aachen, Germany, 2004.
CEUR Workshop Proceedings 126.
[11] J. Han, H. Pei, and Y. Yin. Mining frequent patterns
without candidate generation. In Proc. Conf. on the
Management of Data (SIGMOD’00, Dallas, TX),
pages 1–12, New York, NY, USA, 2000. ACM Press.
[12] T. Hughes, M. Marton, A. Jones, C. Roberts,
R. Stoughton, C. Armour, H. Bennett, E. Coffey,
H. Dai, Y. He, M. Kidd, A. King, M. Meyer, D. Slade,
P. Lum, S. Stepaniants, D. Shoemaker, D. Gachotte,
K. Chakraburtty, J. Simon, M. Bard, and S. Friend.
Functional discovery via a compendium of expression
profiles. Cell, 102:109–126, 2000.
[13] R. Kohavi, C. Bradley, B. Frasca, L. Mason, and
Z. Zheng. Kdd-cup 2000 organizers’ report: Peeling
the onion. SIGKDD Exploration, 2:86–93, 2000.
[14] T. Mielik¨
ainen. Intersecting data to closed sets with
constraints. In Proc. Workshop Frequent Item Set
Mining Implementations (FIMI 2003, Melbourne, FL),
Aachen, Germany, 2003. CEUR Workshop
Proceedings 90.
[15] F. Pan, G. Cong, A. Tung, J. Yang, and M. Zaki.
Carpenter: Finding closed patterns in long biological
datasets. In Proc. 9th ACM SIGKDD Int. Conf. on
Knowledge Discovery and Data Mining (KDD 2003,
Washington, DC), pages 637–642, New York, NY,
USA, 2003. ACM Press.
[16] F. Pan, A. Tung, G. Cong, and X. Xu. Cobbler:
Combining column and row enumeration for closed
pattern discovery. In Proc. 16th Int. Conf. on
Scientific and Statistical Database Management
(SSDBM 2004, Santori Island, Greece), page 21ff,
Piscataway, NJ, USA, 2004. IEEE Press.
[17] F. Rioult, J.-F. Boulicaut, B. Cr´emilleux, and
J. Besson. Using transposition for pattern discovery
from microarray data. In Proc. 8th ACM SIGMOD
Workshop on Research Issues in Data Mining and
Knowledge Discovery (DMKD 2003, San Diego, CA),
pages 73–79, New York, NY, USA, 2003. ACM Press.
[18] U. Scherf, D. Ross, M. Waltham, L. Smith, J. Lee,
L. Tanabe, K. Kohn, W. Reinhold, T. Myers, ,
D. Andrews, D. Scudiero, M. Eisen, E. Sausville,
Y. Pommier, D. Botstein, P. Brown, and J. Weinstein.
A gene expression database for the molecular
pharmacology of cancer. Nature Genetics, pages
236–244, 2000.
[19] R. Stoughton. Applications of DNA microarrays in
biology. Annual Review of Biochemistry, 2004.
[20] T. Uno, T. Asai, Y. Uchida, and H. Arimura. Lcm:
An efficient algorithm for enumerating frequent closed
item sets. In Proc. Workshop Frequent Item Set
Mining Implementations (FIMI 2003, Melbourne, FL),
Aachen, Germany, 2003. CEUR Workshop
Proceedings 90.
[21] T. Uno, M. Kiyomi, and H. Arimura. Lcm ver. 2:
Efficient mining algorithms for
frequent/closed/maximal itemsets. In Proc. Workshop
Frequent Item Set Mining Implementations (FIMI
2004, Brighton, UK), Aachen, Germany, 2004. CEUR
Workshop Proceedings 126.
[22] M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li.
New algorithms for fast discovery of association rules.
In Proc. 3rd Int. Conf. on Knowledge Discovery and
Data Mining (KDD’97, Newport Beach, CA), pages
283–296, Menlo Park, CA, USA, 1997. AAAI Press.
... Due to the complexity of the game environments, we choose to implement an association rules analysis. In contrast to decision trees, association rule analysis works efficiently with a high number of features [3,18]. Learning a tree would have a large overhead during the feature selection phase and result in a tree with high width and depth [9]. ...
... However, the hypothesis (1) and (2) can be excluded, in case we combine the observations of level 1 and level 2. Both levels will result in play-traces, which consist of multiple time-step-transitions in which a step to the right or to the left does not end the game. Deciding which of the other three hypotheses (3)(4)(5) is the most suitable cannot be done by just observing playtraces of a single level. Judging by complexity hypothesis (3) is the simplest and is correct for both levels. ...
... None of the remaining generated hypotheses does fully describe the actual termination set. Nevertheless, we can combine hypotheses (3)(4)(5)(6) in a termination set to fully describe the game's original model. ...
Chapter
Computational intelligence agents can reach expert levels in many known games, such as Chess, Go, and Morris. Those systems incorporate powerful machine learning algorithms, which are able to learn from, e.g., observations, play-traces, or by reinforcement learning. While many black box systems, such as deep neural networks, are able to achieve high performance in a wide range of applications, they generally lack interpretability. Additionally, previous systems often focused on a single or a small set of games, which makes it a cumbersome task to rebuild and retrain the agent for each possible application. This paper proposes a method, which extracts an interpretable set of game rules for previously unknown games. Frequent pattern mining is used to find common observation patterns in the game environment. Finally, game rules as well as winning-/losing-conditions are extracted via association rule analysis. Our evaluation shows that a wide range of game rules can be successfully extracted from previously unknown games. We further highlight how the application of fuzzy methods can advance our efforts in generating explainable artificial intelligence (AI) agents.
... In this paper, we propose the first PARASOL method to find a -covered set of the FIs. The proposed method is based on two key techniques: incremental intersection, which is an incremental way to compute the closed itemsets (Borgelt et al. 2011;Yen et al. 2011), and minimum entry deletion, which is a space-saving technique for PC and RC approximation (Manku and Motwani 2002;Metwally et al. 2005;Yamamoto et al. 2014). We firstly show that the output obtained from these existing techniques composes a -covered set of the FIs. ...
... Especially, we focus on a binomial spanning tree, called weeping tree. Unlike the previously proposed data structures, such as prefix trees (Borgelt et al. 2011;Shin et al. 2014) and vertical format index (Yen et al. 2011), the weeping tree captures not only each itemset but its occurrence sequence in the transactions. As a result, the weeping tree exhibits several interesting features that are essential to prune redundant update process. ...
... Incremental intersection (Borgelt et al. 2011;Yen et al. 2011) is used for computing the closed itemsets. It is based on the following cumulative and incremental features of the closed itemsets. ...
Article
Full-text available
Here, we present a novel algorithm for frequent itemset mining in streaming data (FIM-SD). For the past decade, various FIM-SD methods in one-pass approximation settings that allow to approximate the support of each itemset have been proposed. They can be categorized into two approximation types: parameter-constrained (PC) mining and resource-constrained (RC) mining. PC methods control the maximum error that can be included in the approximate support based on a pre-defined parameter. In contrast, RC methods limit the maximum memory consumption based on resource constraints. However, the existing PC methods can exponentially increase the memory consumption, while the existing RC methods can rapidly increase the maximum error. In this study, we address this problem by introducing a hybrid approach of PC-RC approximations, called PARASOL. For any streaming data, PARASOL ensures to provide a condensed representation, called a Δ-covered set, which is regarded as an extension of the closedness compression; when Δ = 0, the solution corresponds to the ordinary closed itemsets. PARASOL searches for such approximate closed itemsets that can restore the frequent itemsets and their supports while the maximum error is bounded by an integer, Δ. Then, we empirically demonstrate that the proposed algorithm significantly outperforms the state-of-the-art PC and RC methods for FIM-SD.
... A landmark intersection-based approach is adapted to batch FCI mining in [1]. They use a two-pass scheme and store nearly all CIs (stripped of infrequent items) and no tidsets. ...
Conference Paper
Full-text available
Mining association rules from data streams is a challenging task due to the (typically) limited resources available vs. the large size of the result. Frequent closed itemsets (FCI) enable an efficient first step, yet current FCI stream miners are not optimal on resource consumption, e.g. they store a large number of extra itemsets at an additional cost. In a search for a better storage-efficiency tradeoff, we designed Ciclad, an intersection-based sliding window FCI miner. Leveraging in-depth insights into FCI evolution, it combines minimal storage with quick access. Experimental results indicate Ciclad’s memory imprint is much lower and its performances glob- ally better than competitor methods.
... A landmark intersection-based approach is adapted to batch FCI mining in [1]. They use a two-pass scheme and store nearly all CIs (stripped of infrequent items) and no tidsets. ...
Preprint
Full-text available
Mining association rules from data streams is a challenging task due to the (typically) limited resources available vs. the large size of the result. Frequent closed itemsets (FCI) enable an efficient first step, yet current FCI stream miners are not optimal on resource consumption, e.g. they store a large number of extra itemsets at an additional cost. In a search for a better storage-efficiency trade-off, we designed Ciclad,an intersection-based sliding-window FCI miner. Leveraging in-depth insights into FCI evolution, it combines minimal storage with quick access. Experimental results indicate Ciclad's memory imprint is much lower and its performances globally better than competitor methods.
Chapter
In this paper, the fuzzy association algorithm based on Load Classifier is proposed to study the fuzzy association rules of numerical data flow. A method of dynamic partitioning of data blocks by load classifier for data stream is proposed, and the membership function of design optimization is proposed. The FP-Growth algorithm is used to realize the parallelization processing of fuzzy association rules. First, based on the load balancing classifier, variable window is proposed to divide the original data stream. Second, the continuous data preprocessing is performed and is converted into fuzzy interval data by the improved membership function. Finally through simulation experiments of the Load Classifier, compared with the four algorithms, the data processing time is similar after convergence, and the data processing time of SDBA (Spark Dynamic Block Adjustment Spark) is lower than 25 ms.
Conference Paper
Full-text available
The growth of bioinformatics has resulted in datasets with new characteristics. These datasets typically contain a large number of columns and a small number of rows. For example, many gene expression datasets may contain 10,000-100,000 columns but only 100-1000 rows.Such datasets pose a great challenge for existing (closed) frequent pattern discovery algorithms, since they have an exponential dependence on the average row length. In this paper, we describe a new algorithm called CARPENTER that is specially designed to handle datasets having a large number of attributes and relatively small number of rows. Several experiments on real bioinformatics datasets show that CARPENTER is orders of magnitude better than previous closed pattern mining algorithms like CLOSET and CHARM.
Conference Paper
Full-text available
Abstract—This paper presents SaM, a split and merge algorithm for frequent item set mining. Its distinguishing qualities are an excep- tionally simple algorithm and data structure, which not only render it easy to implement, but also convenient to execute on external stor- age. Furthermore, it can easily be extended to allow for “fuzzy” fre- quent item set mining in the sense that missing items can be inserted into transactions with a user-specified penalty. In order to demon- strate its performance, we report experiments comparing it with the “fuzzy” frequent item set mining version of RElim (an algorithm we suggested in an earlier paper [15] and improved,in the meantime). Keywords— data mining, frequent item set mining, fuzzy fre- quent item set, fault tolerant data mining
Conference Paper
Full-text available
We describe a method for computing closed sets with data-dependent constraints. Especially, we show how the method can be adapted to find frequent closed sets in a given data set. The current preliminary implementation of the method is quite inefficient but more powerful pruning techniques could be used. Also, the method can be easily applied to wide variety of constraints. Regardless of the potential practical usefulness of the method, we hope that the sketched approach can shed some additional light to frequent closed set mining.
Conference Paper
We analyze expression matrices to identify a priori interesting sets of genes, e.g., genes that are frequently co-regulated. Such matrices provide expression values for given biological situations (the lines) and given genes (columns). The frequent itemset (sets of columns) extraction technique enables to process difficult cases (millions of lines, hundreds of columns) provided that data is not too dense. However, expression matrices can be dense and have generally only few lines w.r.t. the number of columns. Known algorithms, including the recent algorithms that compute the so-called condensed representations can fail. Thanks to the properties of Galois connections, we propose an original technique that processes the transposed matrices while computing the sets of genes. We validate the potential of this framework by looking for the closed sets in two microarray data sets.
Conference Paper
Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist prolific patterns and/or long patterns. In this study, we propose a novel frequent pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a highly condensed, much smaller data structure, which avoids costly, repeated database scans, (2) our FP-tree-based mining adopts a pattern fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent pattern mining methods.
Conference Paper
In (4), we gave FPgrowth*, FPmax* and FPclose for mining all, maximal and closed frequent itemsets, respec- tively. In this short paper, we describe two approaches for improving the main memory consumptions of FPmax* and FPclose. Experimental results show that the two ap- proaches successfully reduce the main memory require- ments of the two algorithms, and that in particular one of the approaches does not incur any practically significant extra running time.
Conference Paper
In this paper, we propose three algorithms LCM- freq, LCM, and LCMmax for mining all frequent sets, frequent closed item sets, and maximal frequent sets, respectively, from transaction databases. The main theoretical contribution is that we construct tree- shaped transversal routes composed of only frequent closed item sets, which is induced by a parent-child relationship defined on frequent closed item sets. By traversing the route in a depth-first manner, LCM finds all frequent closed item sets in polynomial time per item set, without storing previously obtained closed item sets in memory. Moreover, we introduce several algorithmic techniques using the sparse and dense structures of input data. Algorithms for enu- merating all frequent item sets and maximal frequent item sets are obtained from LCM as its variants. By computational experiments on real world and syn- thetic databases to compare their performance to the previous algorithms, we found that our algorithms are fast on large real world datasets with natural dis- tributions such as KDD-cup2000 datasets, and many other synthetic databases.