Content uploaded by Ben Kao
Author content
All content in this area was uploaded by Ben Kao on May 23, 2016
Content may be subject to copyright.
Data Mining and Knowledge Discovery, 10, 87–116, 2005
c
2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands.
Efficient Algorithms for Mining and Incremental
Update of Maximal Frequent Sequences∗
BEN KAO kao@cs.hku.hk
MINGHUA ZHANG mhzhang@cs.hku.hk
CHI-LAP YIP clyip@cs.hku.hk
DAVID W. CHEUNG dcheung@cs.hku.hk
Department of Computer Science, The University of Hong Kong, Hong Kong
Editor: Usama Fayyad
Received April 15, 2003; Revised April 14, 2004
Abstract. We study two problems: (1) mining frequent sequences from a transactional database, and (2) in-
cremental update of frequent sequences when the underlying database changes over time. We review existing
sequence mining algorithms including GSP,PrefixSpan,SPADE, and ISM.Wepoint out the large memory re-
quirement of Pref ixSpan, SPADE, and ISM, and evaluate the performance of GSP.Wediscuss the high I/O cost
of GSP, particularly when the database contains long frequent sequences. To reduce the I/O requirement, we pro-
pose an algorithm MFS, which could be considered as a generalization of GSP. The general strategy of MFS is to
first find an approximate solution to the set of frequent sequences and then perform successive refinement until
the exact set of frequent sequences is obtained. We show that this successive refinement approach results in a
significant improvement in I/O cost. We discuss how MFS can be applied to the incremental update problem. In
particular, the result of a previous mining exercise can be used (by MFS)asagood initial approximate solution for
the mining of an updated database. This results in an I/O efficient algorithm. To improve processing efficiency,
we devise pruning techniques that, when coupled with GSP or MFS, result in algorithms that are both CPU and I/O
efficient.
Keywords: data mining, sequence, incremental update
1. Introduction
Data mining has recently attracted considerable attention from database practitioners and
researchers because of its applicability in many areas such as decision support, market
strategy and financial forecasts. Combining techniques from the fields of machine learning,
statistics and databases, data mining enables us to find out useful and invaluable information
from huge databases.
One of the many data mining problems is the extraction of frequent sequences from
transactional databases. The goal is to discover frequent sequences of events. For example,
an on-line bookstore may find that most customers who have purchased the book “The
Gunslinger” are likely to come back again in the future to buy “The Gunslinger II” in another
∗This research is supported by Hong Kong Research Grants Council grant HKU 7040/02E.
88 KAO ET AL.
transaction. Knowledge of this sort enables the store manager to conduct promotional
activities and to come up with good marketing strategies. As another example, applying
sequence mining on a medical database may reveal that the occurrence of a sequence of
events would likely lead to the occurrence of certain illness or symptoms. As a third example,
a web site manager may mine the web-log sequences to find the visitors’ access patterns so
as to improve the site design.
The problem of mining frequent sequences was first introduced by Agrawal and Srikant
(1995). In their model, a database is a collection of transactions. Each transaction is a
set of items (or an itemset) and is associated with a customer ID and a time ID. If one
groups the transactions by their customer IDs, and then sorts the transactions of each
group by their time IDs in increasing value, the database is transformed into a number of
customer sequences. Each customer sequence shows the order of transactions a customer
has conducted. Roughly speaking, the problem of mining frequent sequences is to dis-
cover “subsequences” (of itemsets) that occur frequently enough among all the customer
sequences.
Many works have been published on the problem and its variations (Agrawal and Srikant,
1995; Srikant and Agrawal, 1996; Zaki, 2000; Parthasarathy et al., 1999; Garofalakis et al.,
1999; Pei et al., 2001). Among them, PrefixSpan (Pei et al., 2001) and SPADE (Zaki,
2000) are two very efficient algorithms. Unfortunately, both algorithms require storing a
(transformed) portion of the database in memory to achieve efficiency. If the database is
large compared with the amount of memory available, memory paging and/or disk I/O
would substantially slow them down. Thus, PrefixSpan and SPADE are good for small
databases. For large ones, a good choice is GSP (Srikant and Agrawal, 1996).
GSP is a multi-phase iterative algorithm. It scans the database a number of times. Very
similar to the structure of the Apriori algorithm (Agrawal and Swami, 1993) for mining
association rules, GSP starts by finding all frequent length-1 sequences1by scanning the
database. This set of length-1 frequent sequences is then used to generate a set of candidate
length-2 sequences. The supports, or occurrence frequencies, of the candidate sequences
are counted by scanning the database again. Those length-2 sequences that are frequent are
used to generate candidate sequences of length 3, and so on. This process is repeated until
no more frequent sequences are discovered during a database scan, or no candidates are
generated.
The key of the GSP algorithm is its candidate generation function GSP Gen.GSP Gen
wisely generates only those candidate sequences that have the potential of being frequent.
This makes GSP computationally efficient. However, we observe that each iteration of
GSP only discovers frequent sequences of the same length. In particular, a length-kfrequent
sequence is discovered only in the k-th iteration of the algorithm. Consequently, the number
of iterations (and hence database scans) is dependent on the length of the longest frequent
sequences in the database. Therefore, if the database is huge and if it contains very long
frequent sequences, GSP suffers from a high I/O cost.
In many applications, the content of a database changes over time. For example, new cus-
tomer sequences are added to a bookstore’s database as the store recruits new customers.
Similarly, every visit to a web site will add a new log to the site’s log database. There are
also cases in which we have to delete sequences from the database. As an example, when
MINING AND INCREMENTAL UPDATE OF MAXIMAL FREQUENT SEQUENCES 89
mining current access patterns of a web site, we may need to delete some out-of-date logs
(say those that are more than a year old). Since an operational database changes continu-
ously, the set of frequent sequences has to be updated incrementally. One simple strategy
is to apply GSP on the updated database. However, this strategy fails to take advantage
of the valuable information obtained from a previous mining exercise. We note that this
information is particularly useful if the updated database and the old one share a signifi-
cant portion of common sequences. An incremental update algorithm that makes use of a
previous mining result should therefore be much more efficient than a mining-from-scratch
approach.
The goals of this paper are twofold: (1) we analyze and improve the I/O performance
of GSP and (2) we study the problem of incremental maintenance of frequent sequences.
In the rest of this section, we give a preview of our approach to the two problems and the
techniques used.
As we have mentioned, an intrinsic problem of GSP is that frequent sequences are dis-
covered incrementally length-wise, with each iteration (database scan) discovering frequent
sequences that are one unit longer than those found in the previous iteration. This property
is a direct consequence of GSP Gen.Gen being only capable of generating candidates of
the same length. To improve I/O performance, we devise a new algorithm called MFS. Our
algorithm consists of the following two components:
1. We first find an approximate solution, Sest,to the set of frequent sequences. One way to
obtain Sest is to mine a sample of the database using, for example, GSP.Wethen scan
the database once to verify and to retain those sequences in Sest that are frequent w.r.t.
the whole database.
2. We generalize GSP Gen so that the generation function is capable of accepting a set
of frequent sequences of various lengths as input, and returning a set of candidate
sequences of various lengths as output. We call the new candidate generation function
MGen.Weapply MGen on Sest to generate candidate sequences. By scanning the database
once, we refine Sest by adding to it those candidate sequences that are frequent. The
candidate-generation-refinement process is repeated until no more candidates can be
generated.
As we will see in Section 6, coupling the above techniques makes MFS much more
I/O efficient than GSP. Intuitively, the initial approximate solution gives MFS a head-start
towards finding the exact set of frequent sequences. As an example, consider a database
that contains a length-10 frequent sequence s.With GSP,10iterations (database scans)
are needed to discover s.With MFS,however, if all the length-9 sub-sequences of sare
discovered in the approximate solution, then only 1 iteration is needed to obtain s.Wewill
discuss how sampling is done to obtain a good approximate solution that allows MFS to
achieve its I/O efficiency in this paper.
We note that MFS can be effectively applied to the incremental update problem. This is
because the set of frequent sequences found in the old database can well be used as Sest for
mining the new database. We can thus save the effort in mining the sample.
90 KAO ET AL.
To further improve the efficiency of the incremental update algorithm, we propose a
pruning technique that removes candidate sequences before their supports w.r.t. the database
are counted. The pruning technique is made based on the following observations:
•If one knows about the support of a frequent sequence in the old database, then the
sequence’s support w.r.t. the new database can be deduced by scanning the inserted cus-
tomer sequences and the deleted customer sequences. The portion (typically the majority)
of the database that has not been changed needs not be processed.
•Given that a sequence is not frequent w.r.t. the old database, the sequence cannot become
frequent unless its support in the inserted customer sequences is large enough and that
its support in the deleted customer sequences is small enough. This observation allows
us to determine whether a candidate sequence should be considered by “looking” at the
small portion of the database that has been changed.
By applying the pruning technique on GSP and MFS,weobtain their incremental versions,
namely, GSP+ and MFS+.
We conducted an extensive experiment comparing the performance of the various algo-
rithms for the sequence mining problem and also for the incremental update problem. For
the sequence mining problem, we will show that in many cases, the I/O cost of MFS is sig-
nificantly less than that of GSP.Wewill also show how the performance gain MFS achieves
depends on the accuracy of the initial estimate Sest.As an extreme case, MFS requires only
two database scans if Sest covers all frequent sequences of the database. This number is
independent of the length of the longest frequent sequences. For a database containing long
frequent sequences, the I/O saving is significant. For the incremental update problem, our
experiment results show that GSP+ and MFS+ save CPU cost significantly compared with
GSP and MFS.
The rest of this paper is organized as follows. In Section 2 we give a formal definition of the
problem of mining frequent sequences and that of incremental update. In Section 3 we review
some related algorithms including GSP,SPADE,ISM, and PrefixSpan. Section 4 presents
the MFS algorithm and its candidate generation function MGen. Section 5 introduces the
pruning technique and presents the two incremental algorithms GSP+ and MFS+. Experiment
results comparing the performance of the algorithms are shown in Section 6. Finally, we
conclude the paper in Section 7.
2. Problem definition
In this section we formally define the problem of mining frequent sequences and the incre-
mental update problem. We also define some notations to simplify our discussion.
Let I={i1,i2,...,im}be a set of literals called items. An itemset Xis a set of items
(hence, X⊆I). A sequence s=t1,t2,...,tnis an ordered set of itemsets. The length
of a sequence sis defined as the number of items contained in s(denoted by |s|). If an
item occurs several times in different itemsets of a sequence, the item is counted for each
occurrence. For example, if s={1},{2,3},{1,4}, then |s|=5.
MINING AND INCREMENTAL UPDATE OF MAXIMAL FREQUENT SEQUENCES 91
Figure 1. Definitions of D,D,
−,D−and +.
Consider two sequences s1=(a1,a2,...,am) and s2=(b1,b2,...,bn). We say that s1
contains s2,orequivalently, s2is a subsequence of s1if there exist integers j1,j2,..., jn,
such that 1 ≤j1<j2<··· <jn≤mand b1⊆aj1,b2⊆aj2,...,bn⊆ajn.We
represent this relationship by s2s1.Asanexample, the sequence s2={a},{b,c},{d} is
contained in s1={e},{a,d},{g},{b,c,f},{d} because {a}⊆{a,d},{b,c}⊆{b,c,f},
and {d}⊆{d}. Hence, s2s1.Onthe other hand, s2 s3={a,d},{d},{b,c,f}
because {b,c}occurs before {d}in s2, which is not the case in s3.
Given a sequence set Vand a sequence s,ifthere exists a sequence s∈Vsuch that
ss,wewrite sV.Inwords, sVif sis contained in some sequence of V.
Given a sequence set V,asequence s∈Vis maximal if sis not contained in any other
sequence in V. That is, sis maximal if ∃ss.t. (s∈V)∧(s= s)∧(ss). We use
Max(V)torepresent the set of all maximal sequences in V.
Given a database Dof sequences, the support count of a sequence s, denoted by δs
D,is
defined as the number of sequences in Dthat contain s. The fraction of sequences in D
that contain sis called the support of s(represented by sup(s)). If we use the symbol |D|
to denote the number of sequences in D(or the size of D), we have sup(s)=δs
D/|D|.
If sup(s)) is not less than a user specified support threshold ps,sis a frequent sequence.
We use the symbol Lito denote the set of all length-ifrequent sequences. Also, we use L
to denote the set of all frequent sequences. That is L=∪
∞
i=1Li. The problem of mining
frequent sequences is to find all maximal frequent sequences in a database D(i.e., Max(L)).
For the incremental update problem, we assume that a previous mining exercise has
been executed on a database Dto obtain the supports of the frequent sequences in D. The
database Dis then updated by deleting a set of sequences −followed by inserting a set of
sequences +. Let us denote the updated database D. Note that D=(D−−)∪+. Let
the set of sequences shared by Dand Dbe D−, hence, D−=D−−=D−+. Since
the relative order of the sequences within a database does not affect the mining results,
we may assume (without loss of generality) that all deleted sequences are located at the
beginning of the database and all new sequences are appended at the end, as illustrated in
figure 1.
The incremental update problem is to find all maximal frequent sequences in the database
Dgiven −,D−,
+, and the result of mining D.Table 1 summarizes the notations we
use in this paper.
3. Related works
In this section we review some related algorithms. These algorithms include GSP,Pref
ixSpan,SPADE, and ISM.Aswehave indicated in Section 1, SPADE and PrefixSpan are two
92 KAO ET AL.
Table 1. Notations.
Symbol Description
SA sequence
δs
XSupport count of sequence s in database X
ρsSupport threshold
|X|Number of sequences in database X
LiThe set of length-i frequent sequences
LThe set of all frequent sequences
DOld database
DUpdated database
−The set of deleted sequences
+The set of inserted sequences
D−The set of sequences shared by Dand D
efficient algorithms for mining frequent sequences. However, their memory requirements
are large and thus are only suitable for mining small databases. ISM is an incremental update
algorithm derived from SPADE. Readers who are familiar with these algorithms may skip
this section.
3.1. GSP
Agrawal and Srikant first studied the problem of mining frequent sequences (Agrawal and
Srikant, 1995) and they proposed an algorithm called AprioriAll. Later, they improved
AprioriAll and came up with a more efficient algorithm called GSP (Srikant and Agrawal,
1996). GSP can also be used to solve other generalized versions of the frequent-sequence
mining problem. For example, a user can specify a sliding time window. Items that occur
in itemsets that are within the sliding time window could be considered as to occur in the
same itemset.
Similar to the structure of the Apriori algorithm (Agrawal and Swami, 1993) for mining
association rules, GSP starts by finding all frequent length-1 sequences from the database.
A set of candidate length-2 sequences are then generated. The support counts of the can-
didate sequences are then counted by scanning the database once. Those frequent length-2
sequences are then used to generate candidate sequences of length 3, and so on. In general,
GSP uses a function GSP Gen to generate candidate sequences of length k+1given the
set of all frequent length-ksequences. The algorithm terminates when no more frequent
sequences are discovered during a database scan.
The candidate generation function GSP Gen works as follows. Given the set of all frequent
length-ksequences, Lk,asinput, GSP Gen considers every pair of sequences s1and s2in
Lk.Ifthe sequence obtained by deleting the first item of s1is equal to the sequence obtained
by deleting the last item of s2, then a candidate is generated by appending the last item
of s2to the end of s1.For example, consider s1={A,B},{C} and s2={B},{C,D}.
MINING AND INCREMENTAL UPDATE OF MAXIMAL FREQUENT SEQUENCES 93
Table 2. Database.
Customer ID Sequence
1{A,B,C},{E,G},{C,D}
2{A,B},{A,D,E}
3{A,F},{B,E}
4{A,B,F},{C,E}
Table 3. Mining process of GSP.
C1{{A},{B},{C},{D},{E},{F},{G}},
L1{{A},{B},{E}},
C2{{A,B},{A,E},{B,E},{A},{A},{A},{B},
{A},{E},{B},{A},{B},{B},{B},{E},
{E},{A},{E},{B},{E},{E}},
L2{{A,B},{A},{E},{B},{E}}
C3{{A,B},{E}}
L3{{A,B},{E}}
C4∅
Since removing the leading item Afrom s1gives the same sequence ({B},{C})asthat
obtained by removing the trailing item Dfrom s2,acandidate sequence {A,B},{C,D}is
generated by GSP Gen. After a candidate sequence sis generated, GSP Gen checks whether
all subsequences of sare frequent. If not, the candidate is thrown away.
As an example, consider the database shown in Table 2. If the support threshold is 75%,
then a sequence is frequent if it is contained in at least 3 customer sequences. In this case,
the candidates (Ci’s) generated by GSP Gen and the frequent sequences (Li’s) discovered
by GSP for each iteration (i) are listed in Table 3.
GSP is an efficient algorithm. However, the number of database scans it requires is deter-
mined by the length of the longest frequent sequences. Consequently, if there are very long
frequent sequences and if the database is huge, the I/O cost of GSP could be substantial.
3.2. PrefixSpan
PrefixSpan is a newly devised efficient algorithm for mining frequent sequences (Pei
et al., 2001). PrefixSpan mines frequent sequences by intermediate database generation
instead of the tradition approach of candidate sequence generation. PrefixSpan is shown
to be efficient if sufficient amount of memory is available.
In Pei et al. (2001), it is assumed without loss of generality that all items within an itemset
are listed in alphabetical order. Before we review the algorithm, let us first introduce several
terminologies defined in Pei et al. (2001).
94 KAO ET AL.
•prefix. Given two sequences s1=t1,t2,...,tn,s2=t
1,t
2,...,t
m(m≤n),s2is
called a prefix of s1if (1) ti=t
ifor i≤m−1; (2) t
m⊆tm; and (3) all items in (tm−t
m)
are alphabetically ordered after those in t
m.
Forexample, if s1={a},{b,c,d},{e},s2={a},{b},s3={a},{d}, then s2is a
prefix of s1,buts3is not.
•projection. Given a sequence s1and one of its subsequences s2(i.e., s2s1), a sequence
pis called the projection of s1w.r.t. prefix s2if (1) Ps1; (2) s2is a prefix of p;
(3) pis the “maximal” sequence that satisfies conditions (1) and (2), that is, ∃p, s.t.
(pps1)∧(p= p)∧(s2is a prefix of p).
Forexample, if s1={a},{b,c,d},{e},{f},s2={a},{c,d}, then p={a},{c,d},
{e},{f} is the projection of s1w.r.t. prefix s2.
•postfix.Ifpis the projection of s1w.r.t. prefix s2, then s3obtained by removing the prefix
s2from pis called the postfix of s1w.r.t. prefix s2.
Forexample, if s1={a},{b,c,d},{e},{f},s2={a},{c,d}, then p={a},{c,d},
{e},{f},isthe projection of s1w.r.t. prefix s2and the postfix of s1w.r.t. prefix s2is
{e},{f}.
If s2is not a subsequence of s1, then both the projection and the postfix of s1w.r.t. s2are
empty.
There are three major steps of PrefixSpan.
•Find frequent length-1 sequences
In this step, PrefixSpan scans the database Donce to find all frequent items. The set of
frequent length-1 sequences is L1= {{i}|iis a frequent item}.For example, given the
database shown in Table 2, and a support count threshold of 3, the set of frequent items
is {A,B,E}.
•Divide search space into smaller subspaces
The set of all frequent sequences can be divided into several groups, such that the se-
quences within a group share the same prefix item.
Forexample, if {A,B,E}is the set of frequent items discovered in the first step, then
all the frequent sequences can be divided into three groups, corresponding to the three
prefixes {A},{B}, and {E}.
•Discover frequent sequences in each subspace
In this step, PrefixSpan finds frequent sequences in each sub-space. We use an example
to illustrate the procedure.
Using the running example, to find the frequent sequences with prefix {A},PrefixSpan
first projects the database Dto get an intermediate database D{A}.For every sequence
sin D,D{A} contains the postfix of sw.r.t. {A}. The projected database D{A} of our
example is shown in Table 4. In the table, an underscore ‘ ’ preceding an item xindicates
that xis contained in the same itemset of the last item in the prefix. For example, w.r.t.
the prefix {A}, the postfix sequence { B,C},{E,G},{C,D} indicates that the items
Band Care contained in the same itemset of Ain an original database sequence.
After D{A} is obtained, PrefixSpan scans D{A} once to get all frequent items in
D{A}.Inour example, the frequent items are {B,E}.Sothere are in total two length-2
frequent sequences with prefix {A}, namely, {A,B} and {A},{E}. Then recursively,
the database D{A} is projected w.r.t. the prefixes { B} and {E} to obtain D{A,B}
MINING AND INCREMENTAL UPDATE OF MAXIMAL FREQUENT SEQUENCES 95
Table 4. Projected database.
Customer ID Postfix sequence
1{ B,C},{E,G},{C,D}
2{ B},{A,D,E}
3{ F},{B,E}
4{ B,F},{C,E}
and D{A},{E}. Each one is recursively mined to obtain frequent sequences with the
corresponding prefix.
Very different from GSP,PrefixSpan discovers frequent sequences by projecting
databases and counting items’ supports. This implies that only the supports of sequences
that actually occur in the database are counted. In contrast, a candidate sequence gener-
ated by GSP may not appear in the database at all. The time for generating such a can-
didate sequence and checking whether such a candidate is a sub-sequence of database
sequences is wasted. This factor contributes to the efficiency of PrefixSpan over
GSP.
The major cost of PrefixSpan is that of generating projected databases. It can be shown
that for every frequent sequence discovered, a projected database has to be computed for
it. Hence, the number of intermediate databases is very large if there are many frequent
sequences. If the database Dis large, then PrefixSpan requires substantial amount of
memory. Since we are interested in scenarios where large databases are involved (which
is exactly when an efficient algorithm is needed most), we do not further our study on
PrefixSpan because of its high memory requirement.
3.3. SPADE
The algorithms we have reviewed so far, namely, GSP and PrefixSpan, assume a horizontal
database representation.Inthis representation, each row in the database table represents
a transaction. Each transaction is associated with a customer ID, a transaction timestamp,
and an itemset. Table 5 shows an example database in the horizontal representation.
Table 5. Horizontal database.
Customer ID Transaction timestamp Itemset
1 110 A
1 120 B C
2 210 A
2 220 C D
96 KAO ET AL.
Table 6.Vertical database.
Item Customer ID Transaction timestamp
A1 110
2 210
B1 120
C1 120
2 220
D2 220
In Zaki (2000), it is observed that a vertical representation of the database may be better
suited for sequence mining. In the vertical representation, every item in the database is
associated with an id-list. For an item a, its id-list is a list of (customer ID, transaction
timestamp) pairs. Each such pair identifies a unique transaction that contains a.Avertical
database is composed of the id-lists of all items. Table 6 shows the vertical representation
of the database shown in Table 5.
In Zaki (2000), the algorithm SPADE is proposed that uses a vertical database to mine
frequent sequences. As has been shown in previous studies (Zaki, 2000), SPADE outperforms
GSP for small databases. To understand SPADE, let us first define two terms: generating
subsequences and sequence id-list.
•generating subsequences. Forasequence ssuch that |s|≥2, the two generating subse-
quences of sare obtained by removing the first or the second item of s.
•sequence id-list. Similar to the id-list of an item, we can also associate an id-list to a
sequence. The id-list of a sequence s is a list of (Customer ID, transaction timestamp)
pairs. If the pair (C,t)isinthe id-list of a sequence s, then sis contained in the sequence
of Customer C, and that the first item of soccurs in the transaction of Customer Cat
timestamp t.Table 7 shows the id-list of {A},{C}.
We note that if id-lists are available, counting the supports of sequences is trivial. In
particular, the support count of a length-1 sequence can be obtained by inspecting the
vertical database. In general, the support count of a sequence sis given by the number of
distinct customer id’s in sS id-list. The problem of support counting is thus reduced to the
problem of sequence id-list computation.
With the vertical database, only the id-lists of length-1 sequences can be readily obtained.
The id-lists of longer sequences have to be computed. In Zaki (2000), it is shown that the id-
list of a sequence scan be computed easily by intersecting the id-lists of the two generating
subsequences of s.
Table 7. ID-list of {A},{C}.
Customer ID Transaction timestamp
1 110
2 210
MINING AND INCREMENTAL UPDATE OF MAXIMAL FREQUENT SEQUENCES 97
Table 8. Horizontal database generated for comput-
ing L2.
Customer ID (item, transaction timestamp) pairs
1(A110) (C 120)
2(A210) (C 220)
Here, we summarize the key steps of SPADE.
1. Find frequent length-1 sequences. As we have explained, the support count of a length-1
sequence can be obtained by simply scanning the id-list of the lone item in the sequence.
The first step of SPADE is to discover all frequent length-1 sequences by scanning the
vertical database once.
2. Find frequent length-2 sequences. Suppose there are Mfrequent items, then the number
of candidate frequent length-2 sequences is O(M2). If the support counts of these length-
2 sequences are obtained by first computing their id-lists using the intersection procedure,
we have to access id-lists from the vertical database O(M2) times.2This could be very
expensive.
Instead, SPADE solves the problem by building a horizontal database on the fly that
involves only frequent items. In the horizontal database, every customer is associated
with a list of (item, transaction timestamp) pairs. For each frequent item found in Step
1, SPADE reads its id-list from disk and the horizontal database is updated accordingly.
Forexample, if the frequent items of our example database (Table 6) are A,C, then
the constructed horizontal database is shown in Table 8. After obtaining the horizontal
database, the supports of all candidate length-2 sequences are computed from it.
We remark that maintaining the horizontal database might require a lot of memory. This
is especially true if the number of frequent items and the vertical database are large.
3. Find long frequent sequences. In step 3, SPADE generates the id-lists of long candidate
sequences (those of length ≥3) by the intersection procedure. SPADE carefully controls
the order in which candidate sequences (and their id-lists) are generated to keep the
memory requirement at a minimum. For details, readers are again referred to Zaki
(2000).
Although SPADE is shown to be an efficient algorithm, we do not extensively evaluate
SPADE for the following reasons. First, SPADE requires a vertical representation of the
database. In many applications, a horizontal database is more natural. Hence, in order to
apply SPADE, the database has to be converted to the vertical representation first, which could
be computationally expensive. Second, SPADE generates id-lists and a horizontal database,
which require a large amount of memory. The memory requirement of SPADE grows with the
database. For example, we implemented GSP and SPADE, and executed them on a synthetic
database of 1 million sequences. GSP required 38 MB of memory, while SPADE needed
354 MB. With a larger database of 1.5 million sequences, GSP required about the same
98 KAO ET AL.
amount of memory, while SPADE took 575 MB. Hence, unless memory is abundant, SPADE
is efficient only when applied to small databases. In Section 6.5, we further discuss the
space requirement of SPADE.Inparticular, we briefly compare SPADE with our algorithm
overarange of database sizes.
3.4. ISM
ISM is an incremental update algorithm based on SPADE.Itwas designed to handle database
updates where new transactions are appended to existing sequences, or whole new sequences
are added to the database. Similar to SPADE,ISM requires the availability of a vertical
database. Besides that, it requires the following information w.r.t. the old database D:
•all frequent sequences (and their support counts);
•all sequences (and their support counts) in the negative border (NB).
Under ISM,asequence sis in the negative border (NB) if sis not frequent and either
|s|=1orboth of sS generating subsequences are frequent. Note that this definition of
negative border is different from that used in other papers. (The other definition requires
that all subsequences of sbe frequent.)
The above information is used to construct an increment sequence lattice,orISL.A
sequence that is frequent in Dor is in the NB of Dis represented by a node in the ISL. The
node also contains the support count of the sequence w.r.t. D. Edges in the ISL connect a
sequence with its generating subsequences.
ISM assumes that the ISL of the old database is available before the incremental update.
This ISL is obtained, for example, by a priori execution of SPADE or ISM. Here, we sum-
marize the three key steps of ISM.For further details, readers are referred to Parthasarathy
et al. (1999).
1. Update ISL if the support count requirement is changed. After obtaining ISL,ISM
checks whether there are new sequences added to the old database Din the update. If
there are, ISM computes the new support count threshold and adjusts ISL accordingly.
In the adjustment, frequent sequences may remain frequent, be moved to the negative
border, or be deleted from ISL.Also, sequences in the negative border may stay in the
negative border or be removed.
2. Update support counts. The next step of ISM is to update the support counts of the
sequences that are present in ISL. Since the support of sw.r.t. the old database is known,
the updated support count can be efficiently found. In this step, some sequences in the
negative border may be moved to the frequent sequence set.
3. Capture sequences that were not originally in ISL.The third step of ISM is to check
whether there are new sequences that should be added to ISL due to the update. Here,
ISM executes a SPADE-like procedure for generating candidate sequences and counting
their supports by id-list intersection.
MINING AND INCREMENTAL UPDATE OF MAXIMAL FREQUENT SEQUENCES 99
We have conducted experiments on ISM.Wefound that, in general, ISM is efficient under
two conditions: (1) the ISL structure derived from the database is small and (2) the amount
of update to the database (and hence the amount of adjustment made to ISL) is small. The
former happens, for example, when there are relatively few items in the database. Otherwise,
SPADE could be more efficient than ISM even for the incremental update problem. Being
derived from SPADE,ISM has a similar memory requirement as that of SPADE. Finally, ISM
does not handle database updates where transactions or sequences are deleted from the
database.
3.5. Other related works
In Wang (1997), Wang studied the problem of incremental update of frequent substrings. In
the paper, the author proposed a data structure called dynamic suffix tree that allows very
efficient updates of frequent substrings. Our sequence mining model differs from that used
in Wang (1997) in two aspects. First, we consider sequences of transactions (or item-sets)
instead of sequences of singular symbols. Although we could map our sequence model
to the sub-string model by using a unique symbol to represent each possible itemset, the
number of symbols would be huge. That would make the algorithm inefficient. Second, our
objective is to mine frequent sub-sequences instead of frequent sub-strings. In particular,
the sequences mined could represent events that are not necessarily consecutive. As an
example, we consider the {a},{g} a sub-sequence of {a},{c},{g}.InWang (1997),
however, the objective is to mine frequent sub-strings, i.e., sequence of symbols that occur
consecutively.
There are also some papers on applying sampling techniques for pattern mining (e.g.,
Lee et al., 1998; Provost et al., 1999). For example, in Provost et al. (1999), the authors
proposed an interesting method that mine a sample of the database with the sample size
increased progressively until no further improvement in the mining accuracy is achieved.
These sample-based mining algorithms are shown to be very efficient. However, the results
obtained by those mining algorithms are only approximate ones. In this paper, we study
how sampling can be used to assist us in obtaining the complete set of frequent sequences
efficiently.
Finally, a number of studies have been done on the problem of maintaining discovered
association rules. Some examples include (Ayan et al., 1999; Cheung et al., 1996a, 1997,
1996b; Lee and Cheung, 1997; Omiecinski and Savasere, 1998; Sarda and Srinivas, 1998;
Thomas et al., 1997).
4. MFS
In this section we describe our algorithm MFS.Asarecapitulation, the general strategy of
MFS is (1) efficiently finds an estimate of the set of frequent sequences and (2) refines the
estimate successively until no more refinement can be achieved. We note that if the initial
estimate is sufficiently close to the final result, then only a few iterations (database scans)
of the refinement step are needed, resulting in an I/O-efficient algorithm.
100 KAO ET AL.
Figure 2. Algorithm MFS and function MGen.
An initial estimate, Sest, can be obtained in a couple of ways. One possibility is to mine
a small sample (let’s say 10%) of the database using GSP.Inthis case, the sample could
be mined using the same support threshold as used to mine the whole dataset. As another
possibility, if the database is updated and mined regularly, the result obtained from a previous
mining exercise can be used as Sest.Inthe latter case, no effort is spent in obtaining the
estimated set. This makes MFS suitable for the incremental update problem.
Figure 2 shows the MFS algorithm. MFS takes four inputs: the database D, the support
threshold ρs, the set of all items I, and an estimated set of frequent sequences Sest.
Algorithm MFS maintains a set MFSS, which is composed of all and only those maximal
frequent sequences known so far. Because all subsequences of a frequent sequence are
frequent, MFSS is sufficient to represent the set of all frequent sequences known.
Since3no frequent sequences are known when the algorithm starts, MFSS is initialized
to the empty set. MFS then scans the database once to count the supports of all length-1
sequences and all the sequences contained in Sest. The maximal frequent sequences found
MINING AND INCREMENTAL UPDATE OF MAXIMAL FREQUENT SEQUENCES 101
in this step is put in the set MFSS. After that, MFS iterates the following two steps: (1)
apply a candidate generation function MGen on MFSS to obtain a candidate set; (2) scan
the database to find out which sequences in the candidate set are frequent. Those frequent
candidate sequences are then used to update MFSS. This successive refinement procedure
stops when MFS cannot find any new frequent sequences in an iteration.
The heart of MFS is the MGen function. MGen is a generalization of the GSP Gen function
of GSP.Itgenerates a set of candidate sequences of various lengths given a set of frequent
sequences (represented by MFSS)ofvarious lengths. MGen takes three parameters, namely,
MFSS—the set of all maximal frequent sequences known so far; Iteration—a loop counter
that MFS maintains; and Already Counted—a set of sequences whose supports have already
been counted.
MGen generates candidate sequences by “joining” MFSS with itself (lines 3–8). For every
pair of frequent sequences s1and s2in MFSS that have a common subsequence, x,MGen
generates candidates by prepending an item i1from s1to xand appending an item i2
from s2to x.For example, if s1={A,B},{C},{D,E,F} and s2={B},{C,G},{H},
then a common subsequence would be {B},{C}.Bythe above generation rule, MGen
would generate the candidates {A,B},{C,G} and {A,B},{C},{H}. Note that, unlike
GSP Gen, the generating sequences s1and s2have different lengths. After a candidate
sequence sis generated, it is removed (from the candidate set) if any one of the following
conditions is true (lines 9–13):
•sMFSS. This implies that sis already known to be frequent.
•s∈AlreadyCounted. This implies that the support of shas been counted already.
•Some subsequences of swith length |s|—1are not known to be frequent. This implies
that scannot be frequent.
4.1. Theorems
In this subsection we summarize a few properties of MFS, and show its correctness. Details
of the proofs are given in the appendix. For notational convenience, we use CGGen and CMGen
to represent the set of candidate sequences generated by GSP Gen and MGen, respectively.
Both CGGen and CMGen include all length-1 sequences.
Theorem 1. Given a suggested frequent sequence set Sest,CGGen ⊆CMGen ∪{s|sSest}.
Since CGGen and CMGen ∪{s|sSest}are the sets of candidate sequences whose supports
are counted by GSP and MFS, respectively, Theorem 1 says that all candidates whose supports
are counted by GSP are also counted by MFS. Consequently, if GSP finds out all frequent
sequences, so does MFS. Therefore, MFS is correct.
Theorem 2. CMGen ⊆CGGen.
Theorem 2 says that the set of candidates generated by MGen is a subset of that generated
by GSP Gen. Thus MGen does not generate any unnecessary candidates and does not waste
resources for counting their supports.
102 KAO ET AL.
Theorem 3. if Sest =∅, then CMGen =CGGen.
Theorem 3 says that without an initial estimate, MFS and GSP are equivalent in the sense
that they consider the same set of candidate sequences.
5. Incremental update algorithms
In this section we describe a pruning technique that, when applied to GSP and MFS, results in
two efficient incremental update algorithms GSP+ and MFS+.Inour model, the database is
updated continuously and incrementally. In the following discussion, Dis an old database
version, −and +are the sets of sequences removed from and added to D, respec-
tively. The update results in a new database version D. The algorithms can be similarly
applied when the database is subsequently changed again with Dtaking the role of D,
etc.
The general idea of the incremental update algorithms is that, given a sequence s,weuse
the support count of sin D(if available) to deduce whether scould have enough support in
the updated database D.Inthe deduction process, the portion of the database that has been
changed, namely, −and +, might have to be scanned. If we deduce that scannot be
frequent in D,s’s support in D−(the portion of the database that has not been changed) is
not counted. If D−is large compared with −and +, the pruning technique saves much
CPU cost.
Before we present the algorithms, let us consider a few mathematical equations that
allow us to perform the pruning deductions. (As a reminder, readers are referred to Table 1
(page 92) for the symbol definitions.)
First of all, since D=D−−∪+=D−∪+,wehave, for any sequence s,
δs
D=δs
D−+δs
−,(1)
δs
D=δs
D−+δs
+,(2)
δs
D=δs
D+δs
+−δs
−.(3)
Let us define bs
x=minsδs
Xfor any sequence sand database X, where (ss)∧(|s|=
|s|−1). That is to say, if sis a length-ksequence, bs
Xis the smallest support count of the
length-(k−1) subsequences of sin the database X. Since the support count of a sequence
smust not be larger than the support count of any subsequence of s,bs
Xis an upper bound
of δs
X.
The reason for considering bs
Xis to allow us to estimate δs
Xwithout counting it. As
we will see later, under both GSP+ and MFS+,acandidate sequence sis considered (and
may have its support counted) only if all of s’s subsequences are frequent. Since fre-
quent sequences would already have their supports counted (in order to conclude that they
are frequent), we would have the necessary information to deduce bs
Xwhen we consider
s.
To illustrate how the bound is used in the deduction, let us consider the following simple
Lemmas:
MINING AND INCREMENTAL UPDATE OF MAXIMAL FREQUENT SEQUENCES 103
Lemma 1. If a sequence s is frequent in D,then δs
D+bs
D+≥δs
D+bs
+−δs
−≥|D|×ρs.
Proof: If sis frequent in D,wehave,
|D|×ρs≤δs
D(by definition)
=δs
D+δs
+−δs
−(by Equation3)
≤δs
D+bs
+−δs
−
≤δs
D+bs
+.
Thus Lemma 1 follows.
Given a sequence s,ifsis frequent in D,weknow δs
D.Ifbs
+is available, we can
compute δs
D+bs
+and conclude that sis not frequent in Dif the quantity is less than
D×ρs. Otherwise, we scan −to find δs
−.Weconclude that sis not frequent in Dif
δs
D+bs
+−δs
−is less than the required support count (|D|×ρs). Note that in the above
cases, the deduction is made without processing D−or +.
If a sequence sis not frequent in D,δs
Dis unavailable. The pruning tricks derived from
Lemma 1 are thus not applicable. However, being not frequent in Dmeans that the support
of s(in D)issmall. The following Lemma allows us to prune those sequences.
Lemma 2. If a sequence s is frequent in Dbut not in D,then bs
+≥bs
+−δs
−≥
δs
+−δs
−>(|+|−|−|)×ρs.
Proof: If sis frequent in Dbut not in D,wehave, by definition:
δs
D≥|D|×ρs=(|D−|+|+|)×ρs,
δs
D<|D|×ρs=(|D−|+|−|)×ρs.
Hence,
δs
D−δs
D>(|D−|+|+|)×ρs−(|D−|+|−|)×ρs
δs
+−δs
−>(|+|−|−|)×ρs.(by Equation 3)
Also,
bs
+≥bs
+−δs
−≥δs
+−δs
−.
Lemma 2 thus follows.
Given a candidate sequence s that is not frequent in the old database D,wefirst compare
bs
+against (|+|−|−|)×ρs.Ifbs
+is not large enough, scannot be frequent in D,
and hence scan be pruned. Otherwise, we scan −to find bs
+−δs
−and see if scan be
pruned. If not, we scan +and consider δs
+−δs
−.
104 KAO ET AL.
Similar to Lemma 1, Lemma 2 allows us to prune some candidate sequences without
completely counting their supports in the updated database.
5.1. GSP+
Based on the Lemmas, we modify GSP to incorporate the pruning techniques mentioned to
obtain a new algorithm, GSP+.
GSP+ shares the same structure with GSP.GSP+ is an iterative algorithm. During each
iteration i,aset of candidate sequences Ciis generated based on Li−1(the set of frequent
sequences of length i−1). Before the database is scanned to count the supports of the
candidate sequences, the pruning tests derived from Lemmas 1 and 2 are applied. Depending
on the test results, the datasets −and/or +may have to be processed to count the support
of a candidate sequence. GSP+ carefully controls when such countings are necessary. If all
pruning tests fail on a candidate sequence s,GSP+ checks whether sis frequent in D.If
so, δs
Dis available. Hence, δs
D, can be computed by δs
D+δ+−δ−. Finally, if sis not
frequent in D, the unchanged part of the database, D−,isscanned to find out the actual
support of s. Since D−is typically much larger than D+and −,saving is achieved by
avoiding processing D−for certain candidate sequences. As we will see later in Section 6,
the pruning tests can prune up to 60% of the candidate sequences in our experiment setting.
The tests are thus quite effective.
In terms of I/O performance, GSP+ generally has a slightly higher I/O cost than that of
applying GSP directly on the updated database D. This is because GSP+ scans and processes
−, the deleted portion of the database, which is not needed by GSP.However, in some
cases, the pruning tests (in GSP+) remove all candidate sequences during an iteration of the
algorithm. Under such cases, GSP+ saves some database passes, and is slightly more I/O
efficient than GSP.
5.2. MFS+
The pruning tests can also be applied to MFS.Wecall the resulting algorithm MFS+. The
interesting thing about MFS+ is that it uses the set of frequent sequences (Lol d )ofthe old
database Das an initial estimate of the set of frequent sequences of the new database D.
These sequences together with all possible length-1 sequences are put into a candidate set
CandidateSet.Itthen scans −,D−, and +to obtain δs
−,δ
s
D−, and δs
+for each sequence
sin CandidateSet. From these counts, we can deduce which sequences in CandidateSet
are frequent in D. The maximals of such frequent sequences are put into the set MFSS.
MFS+ then executes a loop, trying to refine MFSS. During each iteration, MFS+ generates a
set of candidate sequences CandidateSet from MFSS. MFS+ then deduces which candidate
sequences must not be frequent by applying the pruning tests. In the process, the datasets
−and +may have to be scanned. For those sequences that are not pruned by the tests,
D−is scanned to obtain their exact support counts. Since sequences that are originally
frequent in the old database Dhave already had their supports (w.r.t. D) counted in the
initial part of MFS+, all candidate sequences considered by MFS+ in the loop section are
MINING AND INCREMENTAL UPDATE OF MAXIMAL FREQUENT SEQUENCES 105
not frequent in D. Hence, only those pruning tests resulting from Lemma 2 are used. MFS+
terminates when no refinement is made to MFSS during an iteration.
6. Performance
We performed a number of experiments comparing the various algorithms for the sequence
mining problem and the incremental update problem. We study the amount of I/O savings
MFS could achieve, and how effective sampling is in discovering an initial estimate of the set
of frequent sequences Sest.Wealso study the effectiveness of the pruning technique when
applied to GSP and MFS in solving the incremental update problem. The experiment was
done on a Sun Enterprise 4000 machine with 12 UltraSparc II 250 MHz CPUs and 3G main
memory running Solaris 7. In this section we present some representative results from our
experiments.
6.1. Synthetic dataset
We used synthetic data as the test databases. The data is generated using the generator of
the IBM Quest data mining project. Readers are referred to (ibm, ) for the details of the data
generator. The generator is also used in many other sequence mining studies (e.g., Agrawal
and Srikant, 1995; Srikant and Agrawal, 1996; Zaki, 2000; Pei et al., 2001). The values of
the parameters used in the data generation are listed in Table 9.
6.2. Coverages and I/O savings
Recall that MFS requires an estimated frequent sequence set, Sest.Obviously the number of
frequent sequences contained in Sest will affect how much I/O gain MFS can achieve. In one
extreme case, if Sest contains no frequent sequences, MFS reduces to GSP, and it needs the
same number of I/O passes as GSP.Inthe other extreme case, if Sest contains all frequent
sequences, MFS can finish the whole mining process in only two database scans. One pass to
get all frequent sequences; the other pass to verify no other frequent sequences can be found.
Table 9.Parameters and values for data generation.
Parameter Description Value
|C|Average number of transactions(itemsets) per sequence 10
|T|Average number of items per transaction(itemset) 2,5
|S|Average number of itemsets in maximal potentially frequent sequences 4
|I|Average size of itemsets in maximal potentially frequent sequences 1.25
NSNumber of maximal potentially frequent sequences 5,000
NlNumber of maximal potentially frequent itemsets 25,000
NNumber of items 10,000
106 KAO ET AL.
Figure 3. No. of I/O passes needed by MFS versus coverage under different ρs.
In this subsection we study how the “coverage” of Sest affects the performance of MFS.
By coverage, we roughly mean the fraction of all the frequent sequences that are contained
in Sest.Weformally define coverage by:
coverage =|{s|sSest}∩(∪∞
i=2Li)|
|∪
∞
i=2Li|
where Lirepresents the set of all frequent sequences of length i. Notice that we only consider
those frequent sequences of length-2 or longer. This is because all length-1 sequences will
be checked by MFS during its first scan of the database (see figure 2, line 20). Therefore,
whether Sest contains frequent length-1 sequences or not is immaterial to the number of I/O
passes required by MFS.
In our first experiment, we generated a database of about 131,000 sequences using the
parameter values listed in Table 9. We then applied GSP on the database to obtain all frequent
sequences. After that, we randomly selected some frequent sequences to form an estimated
set Sest.MFS was then applied on the database with the obtained Sest . The number of I/O
passes taken by MFS was noted. We did the experiment using different support thresholds
ρs. The results for ρs=0.3%, 0.4%, and 0.5% are shown in figure 3.
In figure 3, the xaxis is the coverage of Sest, and the yaxis is the number of I/O passes taken
by MFS.Weuse three kinds of points (‘’, ‘+’, ‘’) to represent three support thresholds
0.3%, 0.4%, 0.5%, respectively. For example, point Ais represented by and its coordinates
are (0.718, 4). This means that when the support threshold was 0.5% and the coverage of
Sest was 0.718, MFS took 4 I/O passes. Note that the lines connecting points of the same
kind are there for legibility reason only and should not be interpreted as interpolation.
We see that when coverage increases, the number of I/O passes required by MFS decreases.
In general, the curves show the following trend:
•When coverage =0 (e.g., when Sest is empty), MFS degenerates to GSP. Therefore, the
number of I/O passes taken by MFS is the same as that of GSP. Hence, the y-intersects of
the curves show the I/O costs of GSP under their respective support thresholds.
MINING AND INCREMENTAL UPDATE OF MAXIMAL FREQUENT SEQUENCES 107
•When coverage is small, Sest would contain very few long frequent sequences. This is
because if Sest covers a long frequent sequence s,italso covers every subsequence of
s. These subsequences are frequent and if sis long they are numerous. The coverage of
Sest would thus be high. Since few long frequent sequences are covered by Sest , quite a
number of I/O passes are required to discover them. Hence, with a small coverage, MFS
does not reduce the I/O cost at all.
•When coverage is moderate, MFS becomes more effective. The amount of I/O saving
increases with the coverage.
•When coverage is 100% (i.e., Sest covers all frequent sequences in the database), MFS
requires only two passes over the database: one pass to find all frequent sequences in Sest,
another pass to verify that no more frequent sequences can be found.
6.3. Sampling
As we have mentioned, one way to obtain Sest is to mine a sample of the whole database.
We performed a set of experiments to study the effectiveness of the sampling approach.
In this set of experiments, we used a database of about 131,000 sequences and we fixed
the support threshold at ρs=0.4%. The experiments were done as follows. We first applied
GSP on our synthetic dataset to obtain the number of I/O passes it required. Then a random
sample of the database was drawn. After that, we ran GSP on the sample to obtain Sest for
MFS. When mining the sample, GSP used a sample support threshold ρs-sample =ρs=0.4%.
We then executed MFS with the Sest found. This exercise was repeated 64 times, each with
a different random sample. Finally, the experiment was repeated using various sample
sizes.
We compared the performance of GSP and MFS in three aspects—I/O cost, the number
of candidates they counted, and CPU cost. The I/O cost of MFS includes the cost of mining
the sample. For example, if a 1/8 sample was used, if GSP scanned the sample database 8
times to obtain Sest, and if MFS took 4 database scans to finish, then the total I/O cost of MFS
is calculated as: 1/8×8+4=5 passes.
The number of candidates MFS counted are measured by the following formula: (# of
candidates counted to obtain Sest)×(sample size) +(# of candidates counted in MFS).
Similarly, the amount of CPU time MFS took included that of mining the sample.
The result of the experiments with different sample sizes is shown in Table 10. The values
of MFS are averages over 64 runs. Note that the average number of candidates counted and
the average CPU cost of MFS are shown in relative quantities with those of GSP set to 1.
From Table 10, we see that MFS required fewer I/O passes than GSP (8 passes). As the
sample size increases, the coverage of Sest becomes higher, and fewer I/O passes are required
for MFS to discover all the frequent sequences given Sest.This accounts for the drop of I/O
cost from 6.777 passes to 5.807 passes as the sample size increases from 1/128 to 1/8.
As the sample size increases further, however, the I/O cost of mining the sample becomes
substantial. The benefit obtained by having a better-coverage Sest is outweighted by the
penalty of mining the sample. Hence, the overall I/O cost increases as the sample size
increases from 1/8 to 1/2. Also, similar trends are observed for the number of candidates
being counted and the CPU cost.
108 KAO ET AL.
Table 10. Performance of MFS vs. sample size (ρs=0.4%, ρs-sample =0.4%).
sample size 1/128 1/64 1/32 1/16 1/8 1/4 1/2 0(GSP)
avg. coverage 0.718 0.758 0.788 0.793 0.861 0.900 0.924 n/a
avg. I/O cost 6.777 6.533 6.090 6.059 5.807 6.262 7.945 8
avg. # of cand. 1.685 1.279 1.149 1.106 1.146 1.265 1.503 1
avg. CPU cost 3.693 1.699 1.320 1.198 1.174 1.249 1.441 1
The experiment result shows that sampling is a good way to get Sest.For example, even if
the sample size is as small as 1/128, the coverage of Sest is over 70%. It also shows that MFS
can achieve good I/O efficiency by sampling. For a 1/8 sample, for example, MFS reduces
about 8−5.807
8=27% of the I/O cost at the expense of a 17% increment in CPU cost. In all
our experiments, we find that a sample size of 1/8 or 1/16 generally gives a large reduction
in I/O cost with a small expense in CPU cost.
In the above experiments, the support threshold used in mining the sample (ps-sampie )is
the same as the support threshold used for mining the whole database (ρs). What if we use
a different ρs-sampie compared with ρs? Intuitively, if ρs-sampie <ρ
s, more sequences in the
sample will satisfy the support requirement. Therefore, more sequences will be included in
Sest.This will potentially improve the coverage of Sest, and hence a larger reduction in I/O
cost is achieved. The disadvantage is a larger CPU cost, since more support counting would
have to be done in the sample-mining process as well as in verifying which sequences in
Sest are frequent.
To study the effect of using a smaller ρs-sample,weconducted an experiment varying the
ratio ρs-sample/ρsfrom 1.0 to 0.7. In the experiment, we fixed the sample size at 1/8 and ρs
at 0.4%. Figure 4 shows the performance of MFS.Inthe figure, the average I/O cost and the
average CPU cost of MFS use the left scale and the right scale, respectively. The number
associated with each point shows the average coverage of Sest given the corresponding
ρs-sample/ρsvalue.
Figure 4. Performance of MFS vs. ρs-sample/ρs.
MINING AND INCREMENTAL UPDATE OF MAXIMAL FREQUENT SEQUENCES 109
From the figure, we see that a smaller ρs-sampie gives a smaller I/O cost and a larger CPU
cost. For example, when ρs-sample equals 0.9 ×ρs,MFS took about 5 passes over the data
(GSP took 8) at a CPU penalty of 27%. Figure 4 shows that MFS is a flexible algorithm: one
can adjust ρs-sampie to achieve efficiency based on the system’s characteristics. For example,
if the machine on which mining is performed has a fast CPU but a slow storage device, a
small ρs-sampie should be used.
6.4. CPU cost
Our previous experiments show that in order to achieve I/O efficiency, MFS has to pay a
non-trivial CPU penalty. In this subsection we discuss the CPU requirement of MFS.Wewill
show that the relative CPU cost of MFS (compared with GSP) decreases when the database
size increases. For large databases, the CPU penalty is negligible.
Compared with GSP,there are two reasons why MFS might require a higher CPU cost.
First, there is the cost of mining a sample. Second, the candidate generation function
MGen is more complicated than GSP Gen. There are usually more subsequence testing in
the generation process, and more pruning tests to check. However, we remark that the
generation function is executed once per iteration of the algorithm, and it is independent
of the database size. On the other hand, there is a factor that causes MFS to be more CPU
efficient than GSP. Recall that under MFS or GSP, during each iteration, each sequence sin
the database is matched against a set of candidate sequences to see which candidates are
contained in s, and to increase their support counts. Since MFS performs fewer database
scans compared with GSP,the database sequences are matched against candidate sequences
fewer times. This makes MFS more CPU efficient. We note that the larger the database, the
more prominent is the saving achieved. Let us call this factor the support counting benefit.
To illustrate, we performed an experiment varying the database size from the original
setting of 131,000 sequences to 2.5 million sequences. Again, a 1/8 sample was used and
ρswas set to 0.4%. Figure 5 shows the relative CPU cost of MFS (over GSP)asthe database
size changes.
Figure 5. Relative CPU cost of MFS versus database size.
110 KAO ET AL.
From the figure, we see that as the database size increases, the relative CPU cost of MFS
decreases. This is because the penalty of a more expensive candidate generation process
suffered by MFS does not grow with the database size. Hence, when the database is very
large, the candidate generation penalty is negligible compared with the other two factors
that affect MFS’s CPU performance, namely, the sampling penalty and the support counting
benefit. Since these two factors are affected by the database size in a similar way, the relative
CPU cost of MFS approaches a constant factor as the database becomes very large. From
figure 5, we see that under our experiment setting, the CPU penalty of MFS is only about 4%
for large databases. We remark that the relative I/O saving achieved by MFS is unaffected
by the database size.
6.5. MFS vs. SPADE
In this subsection we briefly compare the performance of MFS and SPADE in a limited-
memory environment. To study the effect of memory availability on SPADE’s performance,
we compare the performance of SPADE and MFS running on a PC with 512M memory. As
we have discussed SPADE is a very efficient algorithm for small databases. Its performance,
however, is less impressive if the database size is relatively large compared with the amount
of memory available. Figure 6 shows that when the database size is large (e.g., when there
are more than 1.4 million sequences), the performance of SPADE degrades greatly due to
memory paging.
6.6. Incremental update
In this section we study the performance of the algorithms GSP,MFS,GSP+, and MFS+ when
they are applied to the incremental update problem. For GSP, the algorithm is simply applied
to the updated database D.ForMFS,itisapplied to Dusing the set of frequent sequences of
the old database Das Sest. Both GSP+ and MFS+ use the mining result of Dand the pruning
technique to discover the frequent sequences in D.
Figure 6.Execution time of MFS and SPADE under different database sizes.
MINING AND INCREMENTAL UPDATE OF MAXIMAL FREQUENT SEQUENCES 111
Figure 7. Comparison of four algorithms under different support thresholds.
We conducted a series of experiments using a database Dof 1.5 million sequences. We
first executed a sequence mining program on Dto obtain all frequent sequences and their
support counts. Then, 10% (150,000) of the sequences in Dwere deleted. These sequences
formed the dataset −. Another 10% (150,000) sequences, which formed the set +, were
added into Dto form the updated database D. After that, the four algorithms were executed
to mine D.Inthe experiment, we varied the support threshold ρsfrom 0.35% to 0.65%.
We compare the CPU costs and I/O costs of the four algorithms. I/O cost is measured in
terms of database scans normalized by the size of D.For example, if GSP scans D8 times,
then the I/O cost of GSP is 8. For an incremental algorithm, if it reads −n1times, D−n2
times, and +n3times, then its I/O cost is (n1|−|+n2|D−|+n3|+|)/|D.Wenote that
while the I/O costs of GSP and MFS are integral numbers (because Dis the only dataset they
read), those of GSP+ and MFS+ could be fractional. Figure 7 shows the experiment results.
Figure 7(a) shows that as ρsincreases, the CPU costs of all four algorithms decrease. This
is because a larger ρsmeans fewer frequent sequences, and thus there are fewer candidate
sequences whose supports need to be counted.
Among the four algorithms, GSP has the highest CPU cost. Since MFS uses the mining
result of Das the estimate, Sest,no time is spent on mining a sample. Because of the support
counting benefit, MFS is more CPU efficient than GSP for incremental update. We also see
that GSP+ and MFS+ are more CPU efficient than their respective counterparts. The saving
is obtained by the pruning effect of the incremental algorithms: some candidate sequences
are pruned by processing only −and/or +; the set D−is avoided. In our experiment
setting, the size of D−is 9 times that of +and −.Avoid counting the supports of the
pruned candidate sequences in D−results in a significant CPU cost reduction.
Table 11 shows the effectiveness of the pruning tests used in GSP+. The total number
of candidate sequences processed by the pruning tests is shown in the second row, and
the number of them that require the scanning of D−is shown in the third row. From the
table, we see that only about 40% of the candidate sequences require the processing of D−
to obtain their support counts. In other words, about 60% of the candidate sequences are
pruned. This accounts for the saving in CPU cost achieved by GSP+ over GSP. Similarly,
MFS+ outperforms MFS, mainly due to candidate pruning.
112 KAO ET AL.
Table 11.Effectiveness of prunning tests.
ρs0.35% 0.4% 0.45% 0.5% 0.55% 0.6% 0.65%
Total # of candidates 34,065 18,356 10,024 5,812 3,365 2,053 1,160
Those requiring δs
D−13,042 7,161 3,966 2,313 1,353 867 509
Percentage (row3/row2) 38% 39% 40% 40% 40% 42% 44%
Finally, when ρsbecomes large, the CPU times of the algorithms are roughly the same.
This is because a large support threshold implies short and few frequent sequences. In such
a case, GSP and MFS take similar number of iterations, and hence the CPU saving of MFS over
GSP is diminished. Moreover, there are much fewer candidate sequences when ρsis large,
and there are much fewer opportunities for the incremental algorithms to prune candidate
sequences.
Figure 7(b) shows the I/O costs of the four algorithms. We see that as ρsincreases, the
I/O costs of the algorithms decrease. This is due to the fact that a larger ρsleads to shorter
frequent sequences. Hence, the number of iterations (and database scans) the algorithms
take is small.
Comparing GSP and MFS,wesee that MFS is a very I/O-efficient algorithm. Since MFS
uses the frequent sequences in Das an estimate of those in D, this gives MFS a head start
and allows MFS to discover all maximal frequent sequences in much fewer database scans.
The incremental algorithms generally require a slightly higher I/O cost than their non-
incremental counterparts. The reason is that the incremental algorithms scan and process −
to make pruning deductions, which is not needed by GSP or MFS.However, in some cases, the
pruning tests remove all candidate sequences during an iteration of the algorithm. In such
cases, incremental algorithms save some database passes. For example, when ρs<0.45%,
GSP+ has a smaller I/O cost than GSP (see Figure 7(b)).
From Figure 7 we can conclude that GSP has a high CPU cost and a high I/O cost. GSP+
reduces the CPU requirement but does not help in terms of I/O. MFS is very I/O-efficient
and it also performs better than GSP in terms of CPU cost. MFS+ is the overall winner. It
requires the least amount of CPU time and its I/O cost is comparable to that of MFS.
6.6.1. Varying |∆+|and |∆−|.As we have discussed, GSP+ and MFS+ achieve efficiency
by processing +and −to prune candidate sequences. The performance of GSP+ and
MFS+ is thus dependent on how large +and −are. Intuitively, the larger the deltas, the
more effort needs to be paid. We ran an experiment to study how |+|and |−|affect
the performance of GSP+ and MFS+.Inthe experiment, we set |D|=|D|=1.5 million
sequences, ρs=0.5%, and we varied |+|and |−|from 15,000 to 600,000 (1%–40% of
|D|). Figure 8 shows the experiment results.
Figure 8(a) shows that the CPU costs of GSP and MFS stay relatively steady. This is
expected since |D|does not change. The CPU costs of GSP+ and MFS+,onthe other hand,
increase linearly with the size of +and −. This increase is due to a longer processing
time taken by GSP+ and MFS+ to deal with +and −. From the figure, we see that MFS+
outperforms others even when the database is changed substantially. In particular, if |+|
MINING AND INCREMENTAL UPDATE OF MAXIMAL FREQUENT SEQUENCES 113
Figure 8. Comparison of four algorithms under different insertion and deletion sizes.
and |−|are less than 375,000 sequences (or 25% of |D|), MFS+ is the most CPU-efficient
algorithm in our experiment. For cases in which only a small fraction of the database is
changed, the incremental algorithms can achieve significant performance gains.
Finally, as we have discussed previously, GSP+ and MFS+ usually have slightly higher I/O
costs than their non-incremental counterparts, since they have to read −. Therefore, the
I/O performance difference between GSP and GSP+ (and also that between MFS and MFS+)
widens as |−|increases. This is shown in figure 8(b).
We also do experiments with settings of varying |+|and fixed |−|,varying |−|and
fixed |+|,Varying |+|and |−|=0, and Varying |−|and |+|=0. The results of
these experiments are same as the ones we described here.
7. Conclusion
In this paper we proposed an I/O-efficient algorithm MFS for mining frequent sequences.
Anew candidate generation function MGen was proposed, which can generate candidate
sequences of various lengths given a set of frequent sequences of various lengths. Because
long sequences are generated and processed early, MFS can effectively reduce the I/O cost.
Experiment results show that MFS saves I/O passes significantly compared with GSP, es-
pecially when an estimate (Sest)ofthe set of frequent sequences with a good coverage is
available. We showed how mining a small sample of the database led to a good Sest.By
using a smaller support threshold (ρs-sample)inmining the sample, we showed that MFS
outperformed GSP in I/O cost by a wide margin. The I/O saving is obtained, however, at a
mild CPU cost. We showed that the CPU penalty was insignificant for large databases.
We also put forward a candidate pruning technique for incremental update of frequent
sequences. By applying the pruning technique on GSP and MFS,weobtained two efficient
incremental algorithms GSP+ and MFS+.Weperformed extensive experiments comparing
the performance of the incremental algorithms and their non-incremental counterparts. Our
results showed that the pruning technique was very effective. In particular, MFS+ is shown
to be both I/O and CPU efficient.
114 KAO ET AL.
Appendix
In this appendix we prove the theorems mention in Section 4.1. Remember that, CGGen and
CMGen represent the set of candidate sequences generated by GSP Gen and MGen, respec-
tively. Both CGGen and CMGen include all length-1 sequences.
Theorem 1. Given a suggested frequent sequence set Sest,CGGen ⊆CMGen ∪{s|sSest}.
Proof: Let X=CGGen −{CMGen ∪{s|sSest }} = {s|s∈CGGen ∧s∈ CMGen ∧s | Sest}.
Suppose X=∅.
We select a shortest sequence s∈X.smust be of length 2 or longer because all length-
1 sequences belong to CMGen ∪{s|sSest}.
s∈X
⇒s∈CCGen (by definition of X)
⇒all length-(|s|−1) subsequences of sare frequent (by generation rule of
GSP Gen)
⇒all length-(|s|−1) subsequences of s∈CGGen (by correctness of GSP)
Now,
all length-(|s|−1) subsequences of s∈ X(sis the shortest one in X)
⇒all length-(|s|−1) subsequences of s∈CMGen ∪{s|sSest}(by definition of X)
⇒all length-(|s|−1) subsequences of sare found to be frequent by MFS (by logic of
MFS)
⇒the support of sis counted by MFS (by logic of MFS)
⇒s∈CMGen ∪{s|sSest}(by logic of MFS)
⇒s∈ X(by definition of X)
A contradiction to the fact that s∈X. Therefore, X=∅,orCGGen ⊆CMGen ∪{s|s
Sest}.
Theorem 2. CMGen ⊆CGGen.
Proof:
∀s∈CMGen
⇒all length-(|s|−1) subsequences of sare frequent (by logic of MGen)
⇒s∈CGGen (by logic of GSP Gen)
Therefore, CMGen ⊆CGGen.
Theorem 3. if Sest =∅,then CMGen =CGGen.
MINING AND INCREMENTAL UPDATE OF MAXIMAL FREQUENT SEQUENCES 115
Proof:
If Sest =∅, Theorem 1 becomes CGGen ⊆CMGen.ByTheorem 2, we have CMGen ⊆CGGen.
Hence, CMGen =CGGen.
Notes
1. A length-isequence is one that contains iitems.
2. This is because computing the id-list of a length-2 sequence requires accessing the 2 id-lists of the 2 items
involved.
3. i1,s,i2is the sequence obtained by adding item i1to the beginning of sequence sand adding item i2to the
end of s. Whether i1is in a separate itemset in the result sequence is determined by whether i1is in a separate
itemset in s1; similarly whether i2is in a separate itemset in the result sequence is determined by whether i2is
in a separate itemset in s2.
References
Agrawal, R., Imielinski, T., and Swami, A.N. 1993. Mining association rules between sets of items in large
databases. In Proc. ACM SIGMOD International Conference on Management of Data, Washington, D.C.,
pp. 207–216.
Agrawal, R. and Srikant, R. 1995. Mining sequential patterns. In Proc. of the llth Int’l Conference on Data
Engineering. Taipei, Taiwan, pp. 3–14.
Ayan, N.F., Tansel, A.U., and Arkun, E. 1999. An efficient algorithm to update large itemsets with early pruning.
In Proc. 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego,
CA, USA, pp. 287–291.
Cheung, D.W., Han, J., Ng, V., and Wong, C.Y. 1996a. Maintenance of discovered association rules in large
databases: An incremental updating techniques. In Proc. 12th IEEE International Conference on Data Engi-
neering (ICDE). New Orleans, Louisiana, USA, pp. 106–114.
Cheung, D.W., Lee, S.D., and Kao, B. 1997. A general incremental technique for maintaining discovered asso-
ciation rules. In Proc. International Conference on Database Systems for Advanced Applications (DASFAA).
Melbourne, Australia, pp. 185–194.
Cheung, D.W., Ng, V., and Tarn, B.W. 1996b. Maintenance of discovered knowledge: A case in multi-level
association rules. In Proc. Second International Conference on Knowledge Discovery and Data Mining (KDD).
Portland, Oregon, pp. 307–310.
Garofalakis, M.N., Rastogi, R., and Shim, K. 1999. SPIRIT: Sequential pattern mining with regular expression
constraints. In Proceedings of the 25th International Conference on Very Large Data Bases. Edinburgh, Scotland,
UK, pp. 223–234.
ibm. http://www.almaden.ibm.com/cs/quest/.
Lee, S., Cheung, D.W., and Kao, B. 1998. Is sampling useful in data mining? a case in the maintenance of
discovered association rules. Data Mining and Knowledge Discovery, 2:233–262.
Lee, S.D. and Cheung, D.W. 1997. Maintenance of discovered association rules: When to update? In Proc. 1997
ACM-SIGMOD Workshop on Data Mining and Knowledge Discovery (DMKD). Tucson, Arizona.
Omiecinski, E. and Savasere, A. 1998. Efficient mining of association rules in large dynamic databases. In Proc.
BNCOD’98, pp. 49–63.
Parthasarathy, S., Zaki, M.J., Ogihara, M., and Dwarkadas, S. 1999. Incremental and interactive sequence mining.
In Proceedings of the 1999 ACM 8th International Conference on Information and Knowledge Management
(CIKM’99). Kansas City, MO USA, pp. 251–258.
Pel, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., and Hsu, M.-C. 2001. Prefixspan: Mining
sequential patterns by prefix-projected growth. In Proc. 17th IEEE International Conference on Data Engineering
(ICDE). Heidelberg, Germany, pp. 215–224.
Provost, F., Jensen, D., and Gates, T. 1999. Efficient progressive sampling. In Proceedings of the fifth ACM
SIGKDD international conference on Knowledge discovery and data mining. San Diego CA, USA, pp. 23–32.
116 KAO ET AL.
Sarda, N.L. and Srinivas, N.V. 1998. An adaptive algorithm for incremental mining of association rules. In Proc.
DEXA Workshop’98, pp. 240–245.
Srikant, R. and Agrawal, R. 1996. Mining sequential patterns: Generalizations and performance improvements.
In Proc. of the 5th Conference on Extending Database Technology (EDBT). Avignion, France, pp. 3–17.
Thomas, S., Bodagala, S., Alsabti, K., and Ranka, S. 1997. An efficient algorithm for the incremental updation of
association rules in large databases. In Proc. KDD’97, pp. 263–266.
Wang, K. 1997. Discovering patterns from large and dynamic sequential data. Journal of Intelligent Information
Systems, 9:33–56.
Zaki, M.J. 2000. SPADE:Anefficient algorithm for mining frequent sequences. Machine Learning, pp. 31–60.
Zhang, M., Kao, B., Cheung, D., and Yip, C.-L. 2002. Efficient algorithms for incremental update of frequent
sequences. In Proc. of the sixth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD).
Taiwan, pp. 186–197.
Zhang, M., Kao, B., Yip, C., and Cheung, D. 2001. A GSP-Based efficient algorithm for mining frequent sequences.
In Proc. of IC-AI’2001. Las Vegas, Nevada, USA, pp. 497–503.