Content uploaded by Peng Wang
Author content
All content in this area was uploaded by Peng Wang on Sep 17, 2019
Content may be subject to copyright.
On Reducing Classifier Granularity in Mining Concept-Drifting Data Streams∗
Peng Wang†Haixun Wang‡Xiaochen Wu†Wei Wang†Baile Shi†
†Fudan University, China {041021057,0124078,weiwang1,bshi}@fudan.edu.cn
‡IBM T. J. Watson Research Center, U.S.A. haixun@us.ibm.com
Abstract
Many applications use classification models on stream-
ing data to detect actionable alerts. Due to concept drifts
in the underlying data, how to maintain a model’s upto-
dateness has become one of the most challenging tasks in
mining data streams. State of the art approaches, including
both the incrementally updated classifiers and the ensemble
classifiers, have proved that model update is a very costly
process. In this paper, we introduce the concept of model
granularity. We show that reducing model granularity will
reduce model update cost. Indeed, models of fine granu-
larity enable us to efficiently pinpoint local components in
the model that are affected by the concept drift. It also en-
ables us to derive new components that can easily integrate
with the model to reflect the current data distribution, thus
avoiding expensive updates on a global scale. Experiments
on real and synthetic data show that our approach is able
to maintain good prediction accuracy at a fraction of model
updating cost of state of the art approaches.
1 Introduction
Traditional classification methods work on static data,
and they usually require multiple scans of the training data
in order to build a model [13, 7, 6]. The advent of new ap-
plication areas such as ubiquitous computing, e-commerce,
and sensor networks leads to intensive research on data
streams. In particular, mining data streams for actionable
insights has become an important and challenging task for
a wide range of applications [5, 9, 14, 3, 8].
State of the art For many applications, the major chal-
lenge in mining data streams lies not in the tremendous data
volume, but rather, in the time changing nature of the data,
a.k.a. the concept drifts [9, 15]. If the data distribution is
static, we can always use a subset of the data to learn a fixed
model and use it for all future data. Unfortunately, the data
distribution is constantly changing, which means the mod-
els have to be constantly revised to reflect the current data
feature.
∗This work is partially supported by the National Natural Science Foun-
dation of China (No. 69933010, 60303008).
It is not difficult to see why model update incurs a ma-
jor cost. Stream classifiers that handle concept drifts can
be roughly divided into two categories. The first category
is known as the incrementally updated classifiers. The
CVFDT approach [9], for example, uses a single decision
tree to model streams with concept drifts. However, even a
slight drift of the concept may trigger substantial changes in
the tree (e.g., replacing old branches with new branches, re-
growing or building alternative sub-trees), which severely
compromise learning efficiency. Aside from this undesir-
able aspect, incremental methods are also hindered by their
prediction accuracy. This is so because they discard old
examples at a fixed rate (no matter if they represent the
changed concept or not). Thus, the learned model is sup-
ported only by the data in the current window – a snapshot
that contains relatively small amount of data. This causes
large variances in prediction.
The second category is known as the ensemble classi-
fiers. Instead of maintaining a single model, the ensemble
approach divides the stream into data chunks of fixed size,
and learns a classifier from each of the chunk [15]. To make
a prediction, all valid classifiers have to be consulted, which
is an expensive process. Besides, the ensemble approach
has high model update cost: i) it keeps learning new mod-
els on new data, whether it contains concept drifts or not;
ii) it keeps checking the accuracy of old models by apply-
ing each of them on the new training data. This apparently
introduces considerable cost in modeling high speed data
streams.
If models are not updated timely because of high update
cost, their prediction accuracy will drop eventually. This
causes a severe problem, especially to applications that han-
dle large volume streaming data at very high speed.
Model Granularity In this paper, we introduce the con-
cept of model granularity. We argue that state of the art
stream classifiers incur a significant model update cost be-
cause they use monolithic models that do not allow a se-
mantic decomposition.
Current approaches for classifying stream data are
adapted from algorithms designed for static data, for which
monolithic models are not a problem. For incrementally
updated classifiers, the fact that even a small disturbance
from the data may bring a complete change to the model in-
dicates that monolithic models are not appropriate for data
streams. The ensemble approach lowered model granularity
1
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE
by dividing the model into components each making inde-
pendent predictions. However, this division is not semantic-
aware: in face of concept drifts, it is still very costly to tell
which components are affected and hence must be replaced,
and what new components must be brought in to represent
the new concept. In this sense, the ensemble model is still
monolithic.
In this paper, we are concerned with the problem of re-
ducing model update cost in classifying data streams. We
observe that in most real life applications, concepts evolve
slowly, which means we shall be able to avoid making
global changes to the model all the time. We achieve this
by building a model consisting of semantic-aware compo-
nents of fine granularity. In face of concept drifts, it enables
us to figure out, in an efficient way, i) which components
are affected by the current concept drift, and ii) what new
components shall be introduced to model the new concept
without affecting the rest of the model. We show that our
approach give accurate predictions at a fraction of cost re-
quired by state of the art stream classifiers.
Our Contribution In summary, our paper makes the fol-
lowing contributions:
•We propose the concept of model granularity, and
show that granularity plays an important role in deter-
mining model update cost.
•We introduce a model consisting of semantic-aware
components of fine granularity. It enables us to imme-
diately pinpoint components in the model that become
obsolete when concept drifts occur.
•We introduce a low cost method to revise the model.
Instead of learning new model components from raw
data, we propose a heuristic that derives new compo-
nents efficiently from a novel synopsis data structure.
•Experiments show that our model update cost is re-
duced without compromising classification accuracy.
2 Related Work
Traditional classification methods, including, for exam-
ple, C4.5 and the Bayesian network, are designed for static
data. To the same goal, rule-based approaches called as-
sociative classification have also been proposed [11, 10].
A rule-based classifier is composed of high quality asso-
ciation rules learned from training data sets using user-
specified support and confidence thresholds. Since asso-
ciation rules explore highly confident associations among
multiple variables, the rule-based approach overcomes the
constraint of the decision-tree induction method which ex-
amines one variable at a time. As a result, they usually have
higher accuracy than the traditional classification methods.
However, the techniques used in these methods focus on
mining rules in static data, and do not apply to infinite data
streams, nor concept drifts.
In wake of recent interest in data stream applications,
several classification algorithms have been introduced for
streaming data [9, 15]. The CVFDT [9] is an incremental
approach. It refines a decision tree by continuously incor-
porating new data from the data stream. In order to handle
concept drifts, it retires old examples at a predetermined
“fixed rate” and discards or re-grows sub-trees. However,
because decision trees are unstable structures, even slight
changes in the underlying data distribution may trigger sub-
stantial changes in the tree, which may severely compro-
mise learning efficiency. The ensemble approach [15], on
the other hand, constructs a weighted ensemble of clas-
sifiers. Classifiers in the ensemble are learned from data
chunks of fixed size. However, no matter whether concept
drifts occur or not, it keeps training new classifiers and re-
computing weights of existing classifiers in the ensemble,
which introduces considerable cost, and becomes vulnera-
ble in dealing high speed data streams.
Our work introduces a rule-based stream classifier. It
aims at maintaining a model made up of tiny components
that are individually revisable. To access these compo-
nents efficiently, we use tree based index structures. Us-
ing trees as summary structures of data streams has been
studied mostly in the field of finding frequent patterns in
data stream. Manku et al. [12] proposed an approximate
algorithm that mines frequent patterns over data in the en-
tire stream up to now. The estDec method [2] finds recent
frequent patterns, and it defines frequency using an aging
function. The Moment algorithm [4] uses index trees to
mine closed frequent patterns in a sliding window. How-
ever, although similar data structures are used, they are for
different purposes. Their goal is to find all patterns whose
occurrence is above a threshold, and the problem of predic-
tion is not considered.
3 Motivation
For streams with concept drifts, model updates are not
avoidable. Our task is to design a model that can be up-
dated easily and efficiently. To achieve this goal, we need
to ensure the following:
1. The model is decomposable into smaller components,
and each component can be revised independently of
other components;
2. The decomposition is semantic-aware in the sense that
when concept drift occurs, there is an efficient way to
pinpoint which component is obsolete, and what new
component shall be built.
Clearly, the incrementally updated classifiers meet nei-
ther of the two requirements, while the ensemble classifiers
satisfy only the first one. Our motivation is to reduce update
cost by reducing model granularity. We first use decision
tree classifiers to illustrate the impact of model granularity
on update cost. Then, we introduce a rule-based classifier,
and discuss the possibility it offers to reduce update cost.
2
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE
3.1 Monolithic Models
We consider a stream as a sequence of records
r1,··· ,r
k,···, where each record has dattributes
A1,··· ,A
d. In addition, each training record is associ-
ated with a class label Ci. We place a moving window
on the stream. Let Widenote the window over records
ri, ..., ri+w−1, where wis the window size. From window
Wi, we learn a model Ci.
We use decision tree to illustrate the cost of model up-
date. The reason of using decision tree is because decision
trees are considered “interpretable” models, that is, unlike a
black box, their semantics allow them to be updated incre-
mentally. Figure 1 shows a data stream with moving win-
dows of size 6. Each record has 3 attributes and a class
label. The decision tree for W1is shown in Figure 2(a). Af-
ter the arrival of record r7and r8, the window moves to W3.
The decision tree of W3is shown in Figure 2(b).
Although majority of the data and the concept they em-
body keep the same, we find that the two decision trees are
completely different. This illustrates that small disturbances
in the data stream may cause global changes in the model.
Thus, even for an interpretable model, in many cases, incre-
mentally maintaining the model is as costly as rebuilding
one from scratch.
The ensemble approach uses multiple models. However,
the models are usually homogeneous, each of which tries to
capture the global data feature as accurate as possible. In
this sense, each model is still monolithic, and it is replaced
as a whole when accuracy drops [15].
Figure 1. Moving windows on streaming data
(a) W1(b) W3
Figure 2. Decision tree models
3.2 Rule-based Models
A rule has the form of p1∧p2∧... ∧pk→Cj, where
Cjis a class label, and each piis a predicate in the form of
Ai=v. We also denote p1∧p2∧... ∧pkas a pattern.
We learn rules from records in each window Wi.If
a rule’s support and confidence are above the predefined
threshold minsup and minconf, we call it a valid rule. All
valid rules learned from window Wiform a classifier Ci.
We use rules to model the data shown in Figure 1. As-
sume the minsup and minconf are set at 0.3 and 0.8 respec-
tively. The valid rules of W1are1
a1,b
2→C1,b
1→C2,a
3,c
1→C3,a
3→C3(1)
After the window moves to W3, the valid rules become
c3,b
2→C4,b
1→C2,a
3,c
1→C3,a
3→C3(2)
From (1) and (2) above, we make the following obser-
vations: i) only the first rule has changed, which shows,
indeed, small disturbance in the data does not always intro-
duce overall model update; ii) rule-based models have very
low granularity, because each component, whether a rule, a
pattern, or a predicate, is interpretable and replaceable.
Thus, our goal is that, as long as the majority of the data
remains the same between two windows, we shall be able to
slightly change certain components of the model to main-
tain its uptodateness. In the rest of the paper, we develop
algorithms that allow us to i) efficiently pinpoint compo-
nents that become outdated, and ii) efficiently derive new
components to represent emerging concepts.
4 A Low Granularity Stream Classifier
4.1 Overview
We maintain a classifier that consists of a set of rules.
Let Wibe the most recent window, and let Cibe the classi-
fier for Wi. When window Wimoves to Wi+1, we update
the support and the confidence of the rules in Ci. The new
classifier Ci+1 contains the old rules of Cithat are still valid
(support and confidence above threshold) as well as new
rules we find in Wi+1. To classify an unlabelled record, we
use the rule that has the highest confidence among all the
rules that match the record. If the record does not match
any valid rule, we classify it to be the majority class of the
current window.
The main technique of our approach lies in its handling
of concept drifts.
•If the concept drifts are not too dramatic and the win-
dow of the stream has appropriate size2, most rules do
not change their status from valid to invalid or from
invalid to valid. In this case, our approach incurs min-
imal learning cost.
1When no confusion can arise, we use a1to denote predicate A=a1.
2Windows that are too big will build up conflicting concepts, and win-
dows that are too small will give rise to the overfitting problem. Both have
an adverse effect on prediction accuracy.
3
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE
•Our approach detects concept drifts by tracking mis-
classified records. The tracking is performed by des-
ignating a historical window for each class as a refer-
ence for accuracy comparison. This allows us to effi-
ciently pinpoint the components in the model that are
outdated.
•We introduce a heuristic to derive new model com-
ponents from the distribution of misclassified records,
thus avoiding learning new models from the raw data
again.
4.2 Dealing with concept drifts
We describe methods that detect concept drifts, pinpoint
obsolete model components, and derive new model compo-
nents efficiently.
4.2.1 Detecting Concept Drifts
We group rules by their class label. For each class, we des-
ignate a historical window as its reference window. To de-
tect concept drifts related to class Ci, we compare the pre-
dictive accuracy of rules corresponding to Ciin the refer-
ence window and in the current window. The rationale is
that when the data distribution is stable and the window size
is appropriate, the classifier has stable predictive accuracy.
In other words, if at certain point, the accuracy drops con-
siderably, it means some concept drifts have occurred and
some new rules that represent the new concept may have
emerged.
Predictive accuracy of a model is usually measured by
the percentage of records that the model misclassified. For
instance, in the ensemble approach [15], a previous classi-
fier is considered obsolete if it has a high percentage of mis-
classified records, in which case, the entire classifier will be
discarded. Clearly, this approach does not tell us which part
of the classifier gives rise to the inaccuracy so that only this
particular part needs to be updated.
In our approach, instead of using the percentage, we
study the distribution of the misclassified records. Once
a new record, say ri+w, arrives, and the window moves to
Wi+1, we derive classifier Cifrom classifier Ci−1by updat-
ing the confidence and the support of the rules in Ci−1, then
we use Cito classify ri+w.
Definition 1. Misclassified Records
Record ri+wis called a misclassified record if i) it is as-
signed a wrong label by classifier Ci, or ii) it has no match-
ing rule in Ci.
We then group misclassified records by their true class
labels, and we use the number of such misclassified records
as an indicator of the stability of the stream.
Definition 2. Number of misclassified records: Nij
Let Wibe a window. Nij is the number of records in Wi
whose true class is Cjbut are misclassified.
Nij indicates the number of misclassified records be-
longing to class Cjin window Wi. In [16], we prove that
when the stream is stable (no concept drifts), Nij also keeps
stable; when Nij increases dramatically, a concept drift re-
lated to Cjmay occur with a high probability. Moreover,
The misclassified records will enable us to pinpoint the ex-
act subset of rules that are conflicting with the new data
distribution, and further more, through a careful analysis of
them, we can derive the emerging new rules.
To measure whether an increase of Nij amounts to a con-
cept drift in window Wi, we choose a historical window Wk
as a reference. Note that the choice of a reference window
shall depend on the data distribution of the distribution. For
example, if we always use the window immediately before
Wi(that is, k=i−1) as the reference window, we may not
be able to detect any concept drift, because concept drifts
usually build up slowly, and only become apparent over a
certain period of time.
Definition 3. Reference Window
Let Wibe the current window. We say window Wkis the
reference window for class Cjif Nkj =min
l≤iNlj .
Clearly, for different classes, the reference windows may
be different. The reference window enables us to tell how
far the concepts (with regard to a particular class) have
drifted away from the state in which they are accurately
modelled. With this knowledge, we can decide whether we
need to mine new rules that model this concept. Formally, if
the difference between Nij and Nkj reaches a user-defined
threshold minWR, i.e., Nij −Nkj ≥minWR, it may indicate
that we need new rules for class Cjto model the change of
concepts [16].
Nij is computed by the following equation:
Nij =Ni−1,j +g(ri+w−1,i,j)−g(ri−1,i,j)(3)
where g(r, i, j)=1if r’s true label is Cjand is misclas-
sified by Cito some other class, and 0 otherwise. We refer
the readers to [16] for the correctness of Eq (3).
4.2.2 Finding new rules
Assume in window Wi, we find Nij −Nkj ≥minWR.We
need to find new rules to deal with the drop of the accuracy.
To avoid learning the new rules from scratch, we analyze
the misclassified records being tracked to find clues about
the patterns of the new rules.
Assume all misclassified records whose true class label
is Cjsatisfy two predicates A1=v1and A2=v2. Then,
it is very likely that a new rule in the form of P→Cjhas
emerged where Pcontains one or both of the two predi-
cates. On the other hand, if a predicate is satisfied by few
misclassified records, probably the new rules do not con-
tain the predicate. We use this heuristics to form new rules
based on the information in the misclassified records.
Formally, we use Lij to denote the set of predicates each
of which is satisfied by no less than cmisclassified records
that belong to class Cj. We represent Lij in the form of
{pi:ci}where piis a predicate, and ci≥cis the number
of misclassified records belonging to Cjthat satisfy pi.We
use Lij to generate candidate patterns of the new rules.
4
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE
Let us return to the data stream in Figure 1 as an ex-
ample. Assume minWR is 2. For window W1, classifier C1
(shown in Eq 1) classifies every record correctly, so we have
N1,i =0for 1≤i≤4. For window W3, both r7and r8
are misclassified, so N3,4becomes 2. Since the increase
of N3,4is ≥minWR, we decide to mine new rules in W3.
For class C4, the misclassified predicates and their misclas-
sified frequency are: {b2:2,c
3:2,a
1:1,a
2:1}.We
then decide the new rule is very likely to have a pattern that
includes predicate b2and/or predicate c3, and we use these
predicates to generate the pattern of the new rule, and ig-
nore other patterns. It turns out that c3,b
2→C4is exactly
the new rule we are looking for (Eq 2).
We describe our method of mining new rules more for-
mally. We use a table Tto store Lij for each class Cj.
Table Tis updated as the window moves so it always con-
tain the misclassified predicates and their frequencies in the
most recent window. We update Tas follows. When Wi
becomes the new window and record ri−1(the record that
moves out) is a previously misclassified record whose true
label is Cj, then for each attribute Ad, we decrease the count
of Ad=vdby 1, where vdis the value of ri−1for attribute
Ad. We do the same for ri+w−1(the record that moves in),
but increase the count instead of deceasing it.
Algorithm MINERULE
1. sort all candidate predicates by their frequency
2. choose the top-K predicates p1, ..., pKto construct a
Candidate Rule Set (CRS)
3. scan the window to compute the support and confi-
dence of rules in CRS
4. add valid rules to the current classifier
For every wrecords (wis the window size), we compare
Nij with Nkj for each j. We invoke procedure MINERULE
to mine new rules if the difference exceeds minWR. First, we
construct a set of candidate patterns. We sort all predicates
by their occurrence frequencies in descending order. Then,
we use the top-K predicates to construct the patterns. We
restrict the patterns to be within certain length N. The rea-
son is: i) a rule with many predicates has low support and
cannot form a valid rule, ii) complex rules tent to overfit the
data, and iii) evaluating rules with a long pattern is time con-
suming. Then, we construct a candidate rule set (CRS) for
class Cjusing patterns we have obtained. We compute the
support and the confidence of the rules in CRS. If a certain
candidate rule is valid, that is, its support and confidence
exceeds minsup and minconf, we add this new rule into
the current classifier.
One remaining problem is, are the new rules discovered
by procedure MINERULE enough to represent the latest data
distribution? After all, the constraints we use might have
prevented us from discovering some subtle concepts. In-
deed, if the constraints are relaxed, we may find more rules,
but some just reflect noises in the data. The new rules will
be evaluated as the window moves ahead. At some window
Widown the stream when we are required to compare Nij
and Nkj, we will have accumulated more statistics. In most
cases, the introduction of the new rules in window Wiwill
have reduced Nijso that its difference with Nkj is smaller
than minWR, which means the new rules are sufficient. In
case the difference between Nijand Nkj is still larger than
minWR, which means either we have missed some subtle
rules, or there is another concept drift, we invoke proce-
dure MINERULE again. The experiments show that in most
cases, we only need to apply procedure MINERULE once to
get enough new rules for one concept drift.
4.3 The algorithm
To build a stream classifier that is efficiently adaptable
to new concepts, one major issue is to how to access the
records and update the rules efficiently. In this section, we
describe data structures and algorithms for this purpose.
4.3.1 Data Structure
We use two tree structures to maintain the rules and the
records for the most recent window.
The RS-tree We assume there is a total order among at-
tributes A1≺···≺Ad. We can sort predicates and pat-
terns based on this order. We store current rules in a prefix
tree called the RS-tree. Each node Nrepresents a unique
rule R:P→Ci. A node Nthat represents rule P→Cj
is a child node of N,iff:
1. P⊂P
2. P≺P
3. no other rule P →Ckexists so that P⊂P ⊂P
and P≺P ≺P
A node stores sup(R)and conf(R)for the rule Rit rep-
resents. An example RS-tree is shown in Figure 3(b). Node
N1represents rule (a1,b
2)→C1whose support and confi-
dence are 0.33 and 1 respectively. Node N3is the child of
N2since {a3}⊂{a3,c
1}and {a3}≺{a3,c
1}.
The REC-tree We think of each record ras a sequence,
vd,··· ,v
1,C, where viis r’s value for attribute Ai, and
Cis r’s class label. We insert record rin its sequence rep-
resentation vd,··· ,v
1,C
iinto a tree structure, which we
call the REC-tree. A path from any internal node Nto the
root node represents a unique postfix {Ai=vi,A
i+1 =
vi+1,··· ,A
d=vd}.
Each internal node keeps a counter, which denotes how
many records of the current window contain the postfix rep-
resented by the node. A node in the REC-tree may point
to nodes in the RS-tree. Assume p1≺ ··· ≺ pk. Node N
points to rule p1∧p2∧···∧pk→Ciin the RS-tree if:
1. node Nsatisfies p1
2. the postfix that starts at Ncontains the pattern p1∧
p2∧···∧pk
Intuitively, node Nrepresents a projection of a record r,
and it points to all rules whose pattern rsatisfies. For each
record that moves into the window, we update the support
5
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE
and the confidence of the rules it matches. The rule pointers
speed up this process.
An example REC-tree is shown in Figure 3(a). Record
{a2,b
1,c
1:C1}is stored on the left-most path. Node (b1:
1) in the path points to rule b1→C2in the RS-tree.
The REC-tree is associated with an array of record ids
[i, ··· ,i +w−1]. Each record id points to a leaf node
that represents that record. When a new record arrives, we
insert it into REC-tree and also insert an entry in the rid
array. The record id array enables us to access any record in
the window efficiently.
(a) REC-tree (b) RS-tree
Figure 3. REC-tree and RS-tree
4.3.2 Update the classifier using REC-tree
Assume the current window is Wi. When a new record ri+w
arrives, the window becomes Wi+1 and we derive a new
classifier Ci+1. First we insert ri+wto the REC-tree. We
update the support and the confidence of the rules pointed
to by the nodes involved in the insertion as follows.
supi+1(R)= supi(R)∗w+1
w
confi+1(R)=confi(R)∗supi(R)+1
supi(R)+1 :Ci=Cj
confi(R)∗supi(R)
supi(R)+1 :Ci=Cj
where supi(R)and confi(R)are the old support and the
old confidence of R, and supi+1(R)and confi+1(R)are
the new ones; Ciis the class label of ri+wand Cjis the
class label of R.
The insertion of ri+wcan create new nodes, in which
case the counter is set to 1. Moreover, new rule pointers,
if necessary, are added to this node. To find which rules
are matched a postfix, we need to scan RS-tree. Assume a
new node represents Ai=v. Since the RS-tree is a prefix
tree, we only need to scan the subtrees whose root’s rule has
Ai=vas its pattern’s first predicate.
We delete rifrom REC-tree and update the rules
matched by it. We decrease the counters of the nodes in-
volved. When the counter of a node becomes 0, we do not
delete it from the REC-tree immediately. Since it contains
the information of the rules it points, the information can
be used later when a record with the same postfix arrives.
However, when the number of nodes in REC-tree exceeds a
threshold, we delete the nodes whose counters are 0.
4.3.3 The main algorithm
We now describe our algorithm as a whole. It contains two
phases. The first phase is the initial phase. We use the first
wrecords to train all valid rules for window W1. Based on
them, we construct the RS-tree and the REC-tree. The sec-
ond phase is the update phase. When record ri+warrives,
we insert it into the REC-tree and update the support and
the confidence of the rules matched by it. Then we delete
the oldest record and also update the rules matched accord-
ing to it. For every wrecords, we compare Ni+1,j and Nkj
for each class label. If for some j, their difference exceeds
minWR, we apply procedure MINERULE to find the new
rules. Algorithm UPDATE describes the update phase.
Algorithm UPDATE
Input: ri: record that moves out of the window;
Input: ri+w: record that moves into the window;
1. let Nbe the node that represents riin the REC-tree;
2. for each node nfrom Nto the root node
3. decrement n’s counter by 1;
4. update the rules pointed by n;
5. for m←dto 1
6. if Am=vmalready exists in REC-tree
7. then increment its counter by 1;
8. update the rules’ support and the confi-
dence;
9. else create a new node with counter =1;
10. scan RS-tree and add rule pointers if
necessary;
11. add a new entry in record id array;
12. update Ni+1,j and Li+1,j;
13. if (i+1)mod w=0and (Ni+1,j −Nkj)≥minWR
14. then apply MINERULE;
5 Experiments
We conduct experiments on both synthetic and real life
data streams. Tests are carried out on a PC with a 1.7GHz
CPU and 256 MB main memory.
Datasets We create synthetic data with drifting concepts
using a moving hyperplane. A hyperplane in d-dimensional
space is represented by: d
i=1 aixi=a0. Records satisfy-
ing d
i=1 aixi<a
0are labelled positive, otherwise nega-
tive. Hyperplane has been used to simulate time-changing
concepts because the orientation and the position of the hy-
perplane can be changed in a smooth manner by changing
the magnitude of the weights [9, 15].
We generate random records uniformly distributed in
[0,1]d. Weights ai(1 ≤i≤d)are initialized by random
values in [0, 1]. We set a0=1
2d
i=1 aiso that the hy-
perplane cuts the multi-dimensional space in two parts of
the same volume. Thus, roughly half of the examples are
positive, and the other half are negative.
We simulate concept drifts using several parameters.
Parameter kspecifies the number of dimensions whose
weights are changing. Parameter ti∈R,1≤i≤k
6
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE
specifies the magnitude of the change for weights ai, and
si∈{−1,1}specifies the direction of change for ai.
We also used real life dataset ‘nursery’ in the UCI ML
Repository [1]. The dataset has 10,344 records and 8 di-
mensions. we randomly sample records from the dataset
to generate a stream. To simulate concept drifts, for every
50,000 records sampled, we randomly select some attributes
of the data set and change their values in a consistent way.
One method we use is to shuffle the values, for instance, we
change values a1→a2→···→an→a1for all records
and keep the class label intact.
5.1 Effect of model updating
We show the effectiveness of updating a rule-based clas-
sifier model in handling concept drifts. We compare three
approaches: i) training an initial rule-based classifier and
using it without change to classifier streaming data, ii) con-
tinuously revising the initial classifier by updating the sup-
port and confidence of its rules, and iii) continuously revis-
ing the initial classifier by updating the confidence/support
of the existing rules and discovering new rules.
In the synthetic dataset we use, each record has 10 di-
mensions. The window size is 5,000. We introduce a con-
cept drift by randomly choosing 4 dimensions and changing
their weights for every 50,000 records. The results in Fig-
ure 4 are obtained for a run with parameters minsup =.03,
minconf =.7,minWR = 150,K=7(the top-K parame-
ter) and N=5(the maximum pattern length).
0
20
40
60
80
100
5
10
15
20
25
30
35
40
45
50
55
Datasize (*5,000)
Error Rate (%)
only update
mine new rules
static
Figure 4. Error Rate Comparison
Figure 4 indicates that, 1) after the first concept drift, the
accuracy of the initial classifier drops significantly, which
means concept drifts can be effectively detected by misclas-
sified records; 2) rule updating can adjust the classifier to
adapt to the new concept; 3) classifiers that mine new rules
can further improve accuracy.
5.2 The relation of concept drifts and N
ij
In this subsection, we verify the approach to detect con-
cept drifts using abnormally increase of N
ij
. We test it in
a hyperplane dataset. Every after 50,000 records, we adjust
the hyperplane to simulate a concept drift. These two curves
indicate N
i1
and N
i0
respectively. The result is shown in
figure 5. We can see that each time a concept drift happen,
N
i1
and N
i0
burst dramatically while they keep stable when
the streaming data is stable.
0
5
10
15
20
25
30
35
40
0
100
200
300
400
500
600
700
800
900
1000
Records (* 5,000)
# misclassified records
Figure 5. Relation of concept drifts and N
ij
5.3 Effect of rule composition
We verify the effectiveness of choosing the top-K can-
didate predicates in composing new rules. We compare its
accuracy against choosing predicates randomly. We used
the same parameters as in the previous experiment. The
result is shown in Figure 6. It shows that by using most
frequently occurring predicates in misclassified records to
construct CRS, we can obtain rules that effectively repre-
senting the new concept.
0
5
10
15
20
25
30
35
40
2
4
6
8
10
12
14
16
Datasize (*5,000)
Error Rate (%)
random
norandom
Figure 6. Choosing literals
5.4 Accuracy and time
While having improved or similar classifying accuracy,
our rule-based approach is much faster than state of the art
stream classifiers. We compare our method with the ensem-
ble classifier [15] and CVFDT [9] in terms of accuracy and
run time on both synthetic data and real life data.
We used different parameters (e.g., window size, number
of classifiers in the ensemble, etc.) to tune the classifiers,
and a typical result is shown in Figure 7, which is obtained
using windows of size 10,000, and the ensemble contains 10
classifiers, each trained on 1,000 records. We report error
rate for every 5,000 records.
Figure 7(a) shows that the accuracy of our rule-based
approach is higher than that of CVFDT, and is similar to
the ensemble classifier. Compared with CVFDT, our rule-
based approach can catch concept drifts and adjust the clas-
sifier more quickly. This is because the rule-based classifier
7
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE
0 20 40 60 80 100
5
10
15
20
25
30
35
40
45
Datasize (*5000)
Error Rate (%)
rule−based
CVFDT
EC
0 20 40 60 80 100
0
50
100
150
200
250
Datasize (*5,000)
Runtime (s)
rule−based
EC
CVFDT
(a) accuracy (b) time
Figure 7. Synthetic data
0 5 10 15 20 25 30 35 40
4
6
8
10
12
14
16
18
Datasize (*5,000)
Error Rate (%)
rule−based
EC
CVFDT
0 5 10 15 20 25 30 35 40
0
10
20
30
40
50
60
Datasize (*5,000)
Runtime (s)
rule−based
EC
CVFDT
(a) accuracy (b) time
Figure 8. Real life data
has low granularity, which means it can quickly determine
which component to revise and quickly figure out how to
revise it, while the CVFDT has to learn the new concepts
by re-growing a decision tree level by level.
Figure 7(b) shows that our rule-based approach is much
faster than the ensemble classifier and CVFDT. Unlike the
ensemble classifier and CVFDT, which keep learning new
models, most of the time our approach only adjusts the sup-
port and confidence of matched rules in the model. This
operation is made very efficient using the RS-tree and the
REC-tree structure. Even when the new rule detection pro-
cedure is triggered, the cost is still small, because the update
has already been limited to a very small space, which is em-
bodied by the top-K predicates. The second experiment is
run on the real life ‘nursery’ dataset, and the results shown
in Figure 8 are consistent with those on the synthetic data.
6 Conclusion
An important task in mining data stream is to overcome
the effects of concept drifts, which pose a major challenge
to stream classification algorithms because of the high cost
associated with maintaining the uptodateness of the mod-
els. Current stream classifiers are adapted from algorithms
designed for static data, and they are hardly incrementally
maintainable because it is not easy, if not impossible, to se-
mantically break down the model into smaller pieces. In
this paper, we addressed the issue of classifier granularity,
and we showed that by reducing this granularity, change de-
tection and model update can be made much more efficient
without compromising classification accuracy, as reported
by our extensive experiments.
References
[1] C. Blake and C. Merz. UCI repository of machine
learning databases. In Univ. of California, Dept. of
Information and Computer Science, 1998.
[2] J. H. Chang and W. S. Lee. Finding recent fre-
quent itemsets adaptively over online data streams. In
SIGKDD, 2003.
[3] Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang.
Multi-dimensional regression analysis of time-series
data streams. In VLDB, Hongkong, China, 2002.
[4] Yun Chi, Haixun Wang, Philip S. Yu, and Richard R.
Muntz. Moment: Maintaining closed frequent item-
sets over a stream sliding window data streams. In
ICDM, 2004.
[5] P. Domingos and G. Hulten. Mining high-speed data
streams. In SIGKDD, pages 71–80, Boston, MA,
2000. ACM Press.
[6] J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Loh.
BOAT– optimistic decision tree construction. In SIG-
MOD, 1999.
[7] J. Gehrke, R. Ramakrishnan, and V. Ganti. RainFor-
est: A framework for fast decision tree construction of
large datasets. In VLDB, 1998.
[8] S. Guha, N. Milshra, R. Motwani, and L. O’Callaghan.
Clustering data streams. In FOCS, pages 359–366,
2000.
[9] G. Hulten, L. Spencer, and P. Domingos. Mining time-
changing data streams. In SIGKDD, pages 97–106,
San Francisco, CA, 2001. ACM Press.
[10] Wenmin Li, Jiawei Han, and Jian Pei. CMAR: Ac-
curate and efficient classification based on multiple
class-association rules. In ICDM, 2001.
[11] Bing Liu, Wynne Hsu, and Yiming Ma. Integrat-
ing classification and association rule mining. In
SIGKDD, 1998.
[12] G. Manku and R. Motwani. Approximate frequency
counts over data streams. In VLDB, 2002.
[13] C. Shafer, R. Agrawal, and M. Mehta. Sprint: A scal-
able parallel classifier for data mining. In VLDB, 1996.
[14] W. Nick Street and YongSeog Kim. A streaming en-
semble algorithm (SEA) for large-scale classification.
In SIGKDD, 2001.
[15] Haixun Wang, Wei Fan, Philip S. Yu, and Jiawei Han.
Mining concept-drifting data streams using ensemble
classifiers. In SIGKDD, 2003.
[16] Peng Wang, Haixun Wang, Xiaochen Wu, Wei
Wang, and Baile Shi. On reducing classi-
fier granularity in mining concept-drifting data
streams. Technical report, http://wis.cs.ucla.edu/˜
hxwang/publications/wangtech05.pdf, IBM T. J. Wat-
son Research Center, 2005.
8
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE