Conference PaperPDF Available

On reducing classifier granularity in mining concept-drifting data streams

Authors:

Abstract and Figures

Many applications use classification models on streaming data to detect actionable alerts. Due to concept drifts in the underlying data, how to maintain a model's up-to-dateness has become one of the most challenging tasks in mining data streams. State of the art approaches, including both the incrementally updated classifiers and the ensemble classifiers, have proved that model update is a very costly process. In this paper, we introduce the concept of model granularity. We show that reducing model granularity will reduce model update cost. Indeed, models of fine granularity enable us to efficiently pinpoint local components in the model that are affected by the concept drift. It also enables us to derive new components that can easily integrate with the model to reflect the current data distribution, thus avoiding expensive updates on a global scale. Experiments on real and synthetic data show that our approach is able to maintain good prediction accuracy at a fraction of model updating cost of state of the art approaches.
Content may be subject to copyright.
On Reducing Classifier Granularity in Mining Concept-Drifting Data Streams
Peng WangHaixun WangXiaochen WuWei WangBaile Shi
Fudan University, China {041021057,0124078,weiwang1,bshi}@fudan.edu.cn
IBM T. J. Watson Research Center, U.S.A. haixun@us.ibm.com
Abstract
Many applications use classification models on stream-
ing data to detect actionable alerts. Due to concept drifts
in the underlying data, how to maintain a model’s upto-
dateness has become one of the most challenging tasks in
mining data streams. State of the art approaches, including
both the incrementally updated classifiers and the ensemble
classifiers, have proved that model update is a very costly
process. In this paper, we introduce the concept of model
granularity. We show that reducing model granularity will
reduce model update cost. Indeed, models of fine granu-
larity enable us to efficiently pinpoint local components in
the model that are affected by the concept drift. It also en-
ables us to derive new components that can easily integrate
with the model to reflect the current data distribution, thus
avoiding expensive updates on a global scale. Experiments
on real and synthetic data show that our approach is able
to maintain good prediction accuracy at a fraction of model
updating cost of state of the art approaches.
1 Introduction
Traditional classification methods work on static data,
and they usually require multiple scans of the training data
in order to build a model [13, 7, 6]. The advent of new ap-
plication areas such as ubiquitous computing, e-commerce,
and sensor networks leads to intensive research on data
streams. In particular, mining data streams for actionable
insights has become an important and challenging task for
a wide range of applications [5, 9, 14, 3, 8].
State of the art For many applications, the major chal-
lenge in mining data streams lies not in the tremendous data
volume, but rather, in the time changing nature of the data,
a.k.a. the concept drifts [9, 15]. If the data distribution is
static, we can always use a subset of the data to learn a fixed
model and use it for all future data. Unfortunately, the data
distribution is constantly changing, which means the mod-
els have to be constantly revised to reflect the current data
feature.
This work is partially supported by the National Natural Science Foun-
dation of China (No. 69933010, 60303008).
It is not difficult to see why model update incurs a ma-
jor cost. Stream classifiers that handle concept drifts can
be roughly divided into two categories. The first category
is known as the incrementally updated classifiers. The
CVFDT approach [9], for example, uses a single decision
tree to model streams with concept drifts. However, even a
slight drift of the concept may trigger substantial changes in
the tree (e.g., replacing old branches with new branches, re-
growing or building alternative sub-trees), which severely
compromise learning efficiency. Aside from this undesir-
able aspect, incremental methods are also hindered by their
prediction accuracy. This is so because they discard old
examples at a fixed rate (no matter if they represent the
changed concept or not). Thus, the learned model is sup-
ported only by the data in the current window – a snapshot
that contains relatively small amount of data. This causes
large variances in prediction.
The second category is known as the ensemble classi-
fiers. Instead of maintaining a single model, the ensemble
approach divides the stream into data chunks of fixed size,
and learns a classifier from each of the chunk [15]. To make
a prediction, all valid classifiers have to be consulted, which
is an expensive process. Besides, the ensemble approach
has high model update cost: i) it keeps learning new mod-
els on new data, whether it contains concept drifts or not;
ii) it keeps checking the accuracy of old models by apply-
ing each of them on the new training data. This apparently
introduces considerable cost in modeling high speed data
streams.
If models are not updated timely because of high update
cost, their prediction accuracy will drop eventually. This
causes a severe problem, especially to applications that han-
dle large volume streaming data at very high speed.
Model Granularity In this paper, we introduce the con-
cept of model granularity. We argue that state of the art
stream classifiers incur a significant model update cost be-
cause they use monolithic models that do not allow a se-
mantic decomposition.
Current approaches for classifying stream data are
adapted from algorithms designed for static data, for which
monolithic models are not a problem. For incrementally
updated classifiers, the fact that even a small disturbance
from the data may bring a complete change to the model in-
dicates that monolithic models are not appropriate for data
streams. The ensemble approach lowered model granularity
1
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE
by dividing the model into components each making inde-
pendent predictions. However, this division is not semantic-
aware: in face of concept drifts, it is still very costly to tell
which components are affected and hence must be replaced,
and what new components must be brought in to represent
the new concept. In this sense, the ensemble model is still
monolithic.
In this paper, we are concerned with the problem of re-
ducing model update cost in classifying data streams. We
observe that in most real life applications, concepts evolve
slowly, which means we shall be able to avoid making
global changes to the model all the time. We achieve this
by building a model consisting of semantic-aware compo-
nents of fine granularity. In face of concept drifts, it enables
us to figure out, in an efficient way, i) which components
are affected by the current concept drift, and ii) what new
components shall be introduced to model the new concept
without affecting the rest of the model. We show that our
approach give accurate predictions at a fraction of cost re-
quired by state of the art stream classifiers.
Our Contribution In summary, our paper makes the fol-
lowing contributions:
We propose the concept of model granularity, and
show that granularity plays an important role in deter-
mining model update cost.
We introduce a model consisting of semantic-aware
components of fine granularity. It enables us to imme-
diately pinpoint components in the model that become
obsolete when concept drifts occur.
We introduce a low cost method to revise the model.
Instead of learning new model components from raw
data, we propose a heuristic that derives new compo-
nents efficiently from a novel synopsis data structure.
Experiments show that our model update cost is re-
duced without compromising classification accuracy.
2 Related Work
Traditional classification methods, including, for exam-
ple, C4.5 and the Bayesian network, are designed for static
data. To the same goal, rule-based approaches called as-
sociative classification have also been proposed [11, 10].
A rule-based classifier is composed of high quality asso-
ciation rules learned from training data sets using user-
specified support and confidence thresholds. Since asso-
ciation rules explore highly confident associations among
multiple variables, the rule-based approach overcomes the
constraint of the decision-tree induction method which ex-
amines one variable at a time. As a result, they usually have
higher accuracy than the traditional classification methods.
However, the techniques used in these methods focus on
mining rules in static data, and do not apply to infinite data
streams, nor concept drifts.
In wake of recent interest in data stream applications,
several classification algorithms have been introduced for
streaming data [9, 15]. The CVFDT [9] is an incremental
approach. It refines a decision tree by continuously incor-
porating new data from the data stream. In order to handle
concept drifts, it retires old examples at a predetermined
“fixed rate” and discards or re-grows sub-trees. However,
because decision trees are unstable structures, even slight
changes in the underlying data distribution may trigger sub-
stantial changes in the tree, which may severely compro-
mise learning efficiency. The ensemble approach [15], on
the other hand, constructs a weighted ensemble of clas-
sifiers. Classifiers in the ensemble are learned from data
chunks of fixed size. However, no matter whether concept
drifts occur or not, it keeps training new classifiers and re-
computing weights of existing classifiers in the ensemble,
which introduces considerable cost, and becomes vulnera-
ble in dealing high speed data streams.
Our work introduces a rule-based stream classifier. It
aims at maintaining a model made up of tiny components
that are individually revisable. To access these compo-
nents efficiently, we use tree based index structures. Us-
ing trees as summary structures of data streams has been
studied mostly in the field of finding frequent patterns in
data stream. Manku et al. [12] proposed an approximate
algorithm that mines frequent patterns over data in the en-
tire stream up to now. The estDec method [2] finds recent
frequent patterns, and it defines frequency using an aging
function. The Moment algorithm [4] uses index trees to
mine closed frequent patterns in a sliding window. How-
ever, although similar data structures are used, they are for
different purposes. Their goal is to find all patterns whose
occurrence is above a threshold, and the problem of predic-
tion is not considered.
3 Motivation
For streams with concept drifts, model updates are not
avoidable. Our task is to design a model that can be up-
dated easily and efficiently. To achieve this goal, we need
to ensure the following:
1. The model is decomposable into smaller components,
and each component can be revised independently of
other components;
2. The decomposition is semantic-aware in the sense that
when concept drift occurs, there is an efficient way to
pinpoint which component is obsolete, and what new
component shall be built.
Clearly, the incrementally updated classifiers meet nei-
ther of the two requirements, while the ensemble classifiers
satisfy only the first one. Our motivation is to reduce update
cost by reducing model granularity. We first use decision
tree classifiers to illustrate the impact of model granularity
on update cost. Then, we introduce a rule-based classifier,
and discuss the possibility it offers to reduce update cost.
2
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE
3.1 Monolithic Models
We consider a stream as a sequence of records
r1,··· ,r
k,···, where each record has dattributes
A1,··· ,A
d. In addition, each training record is associ-
ated with a class label Ci. We place a moving window
on the stream. Let Widenote the window over records
ri, ..., ri+w1, where wis the window size. From window
Wi, we learn a model Ci.
We use decision tree to illustrate the cost of model up-
date. The reason of using decision tree is because decision
trees are considered “interpretable” models, that is, unlike a
black box, their semantics allow them to be updated incre-
mentally. Figure 1 shows a data stream with moving win-
dows of size 6. Each record has 3 attributes and a class
label. The decision tree for W1is shown in Figure 2(a). Af-
ter the arrival of record r7and r8, the window moves to W3.
The decision tree of W3is shown in Figure 2(b).
Although majority of the data and the concept they em-
body keep the same, we find that the two decision trees are
completely different. This illustrates that small disturbances
in the data stream may cause global changes in the model.
Thus, even for an interpretable model, in many cases, incre-
mentally maintaining the model is as costly as rebuilding
one from scratch.
The ensemble approach uses multiple models. However,
the models are usually homogeneous, each of which tries to
capture the global data feature as accurate as possible. In
this sense, each model is still monolithic, and it is replaced
as a whole when accuracy drops [15].
Figure 1. Moving windows on streaming data
(a) W1(b) W3
Figure 2. Decision tree models
3.2 Rule-based Models
A rule has the form of p1p2... pkCj, where
Cjis a class label, and each piis a predicate in the form of
Ai=v. We also denote p1p2... pkas a pattern.
We learn rules from records in each window Wi.If
a rule’s support and confidence are above the predefined
threshold minsup and minconf, we call it a valid rule. All
valid rules learned from window Wiform a classifier Ci.
We use rules to model the data shown in Figure 1. As-
sume the minsup and minconf are set at 0.3 and 0.8 respec-
tively. The valid rules of W1are1
a1,b
2C1,b
1C2,a
3,c
1C3,a
3C3(1)
After the window moves to W3, the valid rules become
c3,b
2C4,b
1C2,a
3,c
1C3,a
3C3(2)
From (1) and (2) above, we make the following obser-
vations: i) only the first rule has changed, which shows,
indeed, small disturbance in the data does not always intro-
duce overall model update; ii) rule-based models have very
low granularity, because each component, whether a rule, a
pattern, or a predicate, is interpretable and replaceable.
Thus, our goal is that, as long as the majority of the data
remains the same between two windows, we shall be able to
slightly change certain components of the model to main-
tain its uptodateness. In the rest of the paper, we develop
algorithms that allow us to i) efficiently pinpoint compo-
nents that become outdated, and ii) efficiently derive new
components to represent emerging concepts.
4 A Low Granularity Stream Classifier
4.1 Overview
We maintain a classifier that consists of a set of rules.
Let Wibe the most recent window, and let Cibe the classi-
fier for Wi. When window Wimoves to Wi+1, we update
the support and the confidence of the rules in Ci. The new
classifier Ci+1 contains the old rules of Cithat are still valid
(support and confidence above threshold) as well as new
rules we find in Wi+1. To classify an unlabelled record, we
use the rule that has the highest confidence among all the
rules that match the record. If the record does not match
any valid rule, we classify it to be the majority class of the
current window.
The main technique of our approach lies in its handling
of concept drifts.
If the concept drifts are not too dramatic and the win-
dow of the stream has appropriate size2, most rules do
not change their status from valid to invalid or from
invalid to valid. In this case, our approach incurs min-
imal learning cost.
1When no confusion can arise, we use a1to denote predicate A=a1.
2Windows that are too big will build up conflicting concepts, and win-
dows that are too small will give rise to the overfitting problem. Both have
an adverse effect on prediction accuracy.
3
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE
Our approach detects concept drifts by tracking mis-
classified records. The tracking is performed by des-
ignating a historical window for each class as a refer-
ence for accuracy comparison. This allows us to effi-
ciently pinpoint the components in the model that are
outdated.
We introduce a heuristic to derive new model com-
ponents from the distribution of misclassified records,
thus avoiding learning new models from the raw data
again.
4.2 Dealing with concept drifts
We describe methods that detect concept drifts, pinpoint
obsolete model components, and derive new model compo-
nents efficiently.
4.2.1 Detecting Concept Drifts
We group rules by their class label. For each class, we des-
ignate a historical window as its reference window. To de-
tect concept drifts related to class Ci, we compare the pre-
dictive accuracy of rules corresponding to Ciin the refer-
ence window and in the current window. The rationale is
that when the data distribution is stable and the window size
is appropriate, the classifier has stable predictive accuracy.
In other words, if at certain point, the accuracy drops con-
siderably, it means some concept drifts have occurred and
some new rules that represent the new concept may have
emerged.
Predictive accuracy of a model is usually measured by
the percentage of records that the model misclassified. For
instance, in the ensemble approach [15], a previous classi-
fier is considered obsolete if it has a high percentage of mis-
classified records, in which case, the entire classifier will be
discarded. Clearly, this approach does not tell us which part
of the classifier gives rise to the inaccuracy so that only this
particular part needs to be updated.
In our approach, instead of using the percentage, we
study the distribution of the misclassified records. Once
a new record, say ri+w, arrives, and the window moves to
Wi+1, we derive classifier Cifrom classifier Ci1by updat-
ing the confidence and the support of the rules in Ci1, then
we use Cito classify ri+w.
Definition 1. Misclassified Records
Record ri+wis called a misclassified record if i) it is as-
signed a wrong label by classifier Ci, or ii) it has no match-
ing rule in Ci.
We then group misclassified records by their true class
labels, and we use the number of such misclassified records
as an indicator of the stability of the stream.
Definition 2. Number of misclassified records: Nij
Let Wibe a window. Nij is the number of records in Wi
whose true class is Cjbut are misclassified.
Nij indicates the number of misclassified records be-
longing to class Cjin window Wi. In [16], we prove that
when the stream is stable (no concept drifts), Nij also keeps
stable; when Nij increases dramatically, a concept drift re-
lated to Cjmay occur with a high probability. Moreover,
The misclassified records will enable us to pinpoint the ex-
act subset of rules that are conflicting with the new data
distribution, and further more, through a careful analysis of
them, we can derive the emerging new rules.
To measure whether an increase of Nij amounts to a con-
cept drift in window Wi, we choose a historical window Wk
as a reference. Note that the choice of a reference window
shall depend on the data distribution of the distribution. For
example, if we always use the window immediately before
Wi(that is, k=i1) as the reference window, we may not
be able to detect any concept drift, because concept drifts
usually build up slowly, and only become apparent over a
certain period of time.
Definition 3. Reference Window
Let Wibe the current window. We say window Wkis the
reference window for class Cjif Nkj =min
liNlj .
Clearly, for different classes, the reference windows may
be different. The reference window enables us to tell how
far the concepts (with regard to a particular class) have
drifted away from the state in which they are accurately
modelled. With this knowledge, we can decide whether we
need to mine new rules that model this concept. Formally, if
the difference between Nij and Nkj reaches a user-defined
threshold minWR, i.e., Nij Nkj minWR, it may indicate
that we need new rules for class Cjto model the change of
concepts [16].
Nij is computed by the following equation:
Nij =Ni1,j +g(ri+w1,i,j)g(ri1,i,j)(3)
where g(r, i, j)=1if rs true label is Cjand is misclas-
sified by Cito some other class, and 0 otherwise. We refer
the readers to [16] for the correctness of Eq (3).
4.2.2 Finding new rules
Assume in window Wi, we find Nij Nkj minWR.We
need to find new rules to deal with the drop of the accuracy.
To avoid learning the new rules from scratch, we analyze
the misclassified records being tracked to find clues about
the patterns of the new rules.
Assume all misclassified records whose true class label
is Cjsatisfy two predicates A1=v1and A2=v2. Then,
it is very likely that a new rule in the form of PCjhas
emerged where Pcontains one or both of the two predi-
cates. On the other hand, if a predicate is satisfied by few
misclassified records, probably the new rules do not con-
tain the predicate. We use this heuristics to form new rules
based on the information in the misclassified records.
Formally, we use Lij to denote the set of predicates each
of which is satisfied by no less than cmisclassified records
that belong to class Cj. We represent Lij in the form of
{pi:ci}where piis a predicate, and cicis the number
of misclassified records belonging to Cjthat satisfy pi.We
use Lij to generate candidate patterns of the new rules.
4
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE
Let us return to the data stream in Figure 1 as an ex-
ample. Assume minWR is 2. For window W1, classifier C1
(shown in Eq 1) classifies every record correctly, so we have
N1,i =0for 1i4. For window W3, both r7and r8
are misclassified, so N3,4becomes 2. Since the increase
of N3,4is minWR, we decide to mine new rules in W3.
For class C4, the misclassified predicates and their misclas-
sified frequency are: {b2:2,c
3:2,a
1:1,a
2:1}.We
then decide the new rule is very likely to have a pattern that
includes predicate b2and/or predicate c3, and we use these
predicates to generate the pattern of the new rule, and ig-
nore other patterns. It turns out that c3,b
2C4is exactly
the new rule we are looking for (Eq 2).
We describe our method of mining new rules more for-
mally. We use a table Tto store Lij for each class Cj.
Table Tis updated as the window moves so it always con-
tain the misclassified predicates and their frequencies in the
most recent window. We update Tas follows. When Wi
becomes the new window and record ri1(the record that
moves out) is a previously misclassified record whose true
label is Cj, then for each attribute Ad, we decrease the count
of Ad=vdby 1, where vdis the value of ri1for attribute
Ad. We do the same for ri+w1(the record that moves in),
but increase the count instead of deceasing it.
Algorithm MINERULE
1. sort all candidate predicates by their frequency
2. choose the top-K predicates p1, ..., pKto construct a
Candidate Rule Set (CRS)
3. scan the window to compute the support and confi-
dence of rules in CRS
4. add valid rules to the current classifier
For every wrecords (wis the window size), we compare
Nij with Nkj for each j. We invoke procedure MINERULE
to mine new rules if the difference exceeds minWR. First, we
construct a set of candidate patterns. We sort all predicates
by their occurrence frequencies in descending order. Then,
we use the top-K predicates to construct the patterns. We
restrict the patterns to be within certain length N. The rea-
son is: i) a rule with many predicates has low support and
cannot form a valid rule, ii) complex rules tent to overfit the
data, and iii) evaluating rules with a long pattern is time con-
suming. Then, we construct a candidate rule set (CRS) for
class Cjusing patterns we have obtained. We compute the
support and the confidence of the rules in CRS. If a certain
candidate rule is valid, that is, its support and confidence
exceeds minsup and minconf, we add this new rule into
the current classifier.
One remaining problem is, are the new rules discovered
by procedure MINERULE enough to represent the latest data
distribution? After all, the constraints we use might have
prevented us from discovering some subtle concepts. In-
deed, if the constraints are relaxed, we may find more rules,
but some just reflect noises in the data. The new rules will
be evaluated as the window moves ahead. At some window
Widown the stream when we are required to compare Nij
and Nkj, we will have accumulated more statistics. In most
cases, the introduction of the new rules in window Wiwill
have reduced Nijso that its difference with Nkj is smaller
than minWR, which means the new rules are sufficient. In
case the difference between Nijand Nkj is still larger than
minWR, which means either we have missed some subtle
rules, or there is another concept drift, we invoke proce-
dure MINERULE again. The experiments show that in most
cases, we only need to apply procedure MINERULE once to
get enough new rules for one concept drift.
4.3 The algorithm
To build a stream classifier that is efficiently adaptable
to new concepts, one major issue is to how to access the
records and update the rules efficiently. In this section, we
describe data structures and algorithms for this purpose.
4.3.1 Data Structure
We use two tree structures to maintain the rules and the
records for the most recent window.
The RS-tree We assume there is a total order among at-
tributes A1≺···≺Ad. We can sort predicates and pat-
terns based on this order. We store current rules in a prefix
tree called the RS-tree. Each node Nrepresents a unique
rule R:PCi. A node Nthat represents rule PCj
is a child node of N,iff:
1. PP
2. PP
3. no other rule P Ckexists so that PP P
and PP P
A node stores sup(R)and conf(R)for the rule Rit rep-
resents. An example RS-tree is shown in Figure 3(b). Node
N1represents rule (a1,b
2)C1whose support and confi-
dence are 0.33 and 1 respectively. Node N3is the child of
N2since {a3}⊂{a3,c
1}and {a3}≺{a3,c
1}.
The REC-tree We think of each record ras a sequence,
vd,··· ,v
1,C, where viis r’s value for attribute Ai, and
Cis r’s class label. We insert record rin its sequence rep-
resentation vd,··· ,v
1,C
iinto a tree structure, which we
call the REC-tree. A path from any internal node Nto the
root node represents a unique postfix {Ai=vi,A
i+1 =
vi+1,··· ,A
d=vd}.
Each internal node keeps a counter, which denotes how
many records of the current window contain the postfix rep-
resented by the node. A node in the REC-tree may point
to nodes in the RS-tree. Assume p1≺ ··· ≺ pk. Node N
points to rule p1p2∧···∧pkCiin the RS-tree if:
1. node Nsatisfies p1
2. the postfix that starts at Ncontains the pattern p1
p2∧···∧pk
Intuitively, node Nrepresents a projection of a record r,
and it points to all rules whose pattern rsatisfies. For each
record that moves into the window, we update the support
5
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE
and the confidence of the rules it matches. The rule pointers
speed up this process.
An example REC-tree is shown in Figure 3(a). Record
{a2,b
1,c
1:C1}is stored on the left-most path. Node (b1:
1) in the path points to rule b1C2in the RS-tree.
The REC-tree is associated with an array of record ids
[i, ··· ,i +w1]. Each record id points to a leaf node
that represents that record. When a new record arrives, we
insert it into REC-tree and also insert an entry in the rid
array. The record id array enables us to access any record in
the window efficiently.
(a) REC-tree (b) RS-tree
Figure 3. REC-tree and RS-tree
4.3.2 Update the classifier using REC-tree
Assume the current window is Wi. When a new record ri+w
arrives, the window becomes Wi+1 and we derive a new
classifier Ci+1. First we insert ri+wto the REC-tree. We
update the support and the confidence of the rules pointed
to by the nodes involved in the insertion as follows.
supi+1(R)= supi(R)w+1
w
confi+1(R)=confi(R)supi(R)+1
supi(R)+1 :Ci=Cj
confi(R)supi(R)
supi(R)+1 :Ci=Cj
where supi(R)and confi(R)are the old support and the
old confidence of R, and supi+1(R)and confi+1(R)are
the new ones; Ciis the class label of ri+wand Cjis the
class label of R.
The insertion of ri+wcan create new nodes, in which
case the counter is set to 1. Moreover, new rule pointers,
if necessary, are added to this node. To find which rules
are matched a postfix, we need to scan RS-tree. Assume a
new node represents Ai=v. Since the RS-tree is a prefix
tree, we only need to scan the subtrees whose root’s rule has
Ai=vas its pattern’s first predicate.
We delete rifrom REC-tree and update the rules
matched by it. We decrease the counters of the nodes in-
volved. When the counter of a node becomes 0, we do not
delete it from the REC-tree immediately. Since it contains
the information of the rules it points, the information can
be used later when a record with the same postfix arrives.
However, when the number of nodes in REC-tree exceeds a
threshold, we delete the nodes whose counters are 0.
4.3.3 The main algorithm
We now describe our algorithm as a whole. It contains two
phases. The first phase is the initial phase. We use the first
wrecords to train all valid rules for window W1. Based on
them, we construct the RS-tree and the REC-tree. The sec-
ond phase is the update phase. When record ri+warrives,
we insert it into the REC-tree and update the support and
the confidence of the rules matched by it. Then we delete
the oldest record and also update the rules matched accord-
ing to it. For every wrecords, we compare Ni+1,j and Nkj
for each class label. If for some j, their difference exceeds
minWR, we apply procedure MINERULE to find the new
rules. Algorithm UPDATE describes the update phase.
Algorithm UPDATE
Input: ri: record that moves out of the window;
Input: ri+w: record that moves into the window;
1. let Nbe the node that represents riin the REC-tree;
2. for each node nfrom Nto the root node
3. decrement n’s counter by 1;
4. update the rules pointed by n;
5. for mdto 1
6. if Am=vmalready exists in REC-tree
7. then increment its counter by 1;
8. update the rules’ support and the confi-
dence;
9. else create a new node with counter =1;
10. scan RS-tree and add rule pointers if
necessary;
11. add a new entry in record id array;
12. update Ni+1,j and Li+1,j;
13. if (i+1)mod w=0and (Ni+1,j Nkj)minWR
14. then apply MINERULE;
5 Experiments
We conduct experiments on both synthetic and real life
data streams. Tests are carried out on a PC with a 1.7GHz
CPU and 256 MB main memory.
Datasets We create synthetic data with drifting concepts
using a moving hyperplane. A hyperplane in d-dimensional
space is represented by: d
i=1 aixi=a0. Records satisfy-
ing d
i=1 aixi<a
0are labelled positive, otherwise nega-
tive. Hyperplane has been used to simulate time-changing
concepts because the orientation and the position of the hy-
perplane can be changed in a smooth manner by changing
the magnitude of the weights [9, 15].
We generate random records uniformly distributed in
[0,1]d. Weights ai(1 id)are initialized by random
values in [0, 1]. We set a0=1
2d
i=1 aiso that the hy-
perplane cuts the multi-dimensional space in two parts of
the same volume. Thus, roughly half of the examples are
positive, and the other half are negative.
We simulate concept drifts using several parameters.
Parameter kspecifies the number of dimensions whose
weights are changing. Parameter ti∈R,1ik
6
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE
specifies the magnitude of the change for weights ai, and
si∈{1,1}specifies the direction of change for ai.
We also used real life dataset ‘nursery’ in the UCI ML
Repository [1]. The dataset has 10,344 records and 8 di-
mensions. we randomly sample records from the dataset
to generate a stream. To simulate concept drifts, for every
50,000 records sampled, we randomly select some attributes
of the data set and change their values in a consistent way.
One method we use is to shuffle the values, for instance, we
change values a1a2→···→ana1for all records
and keep the class label intact.
5.1 Effect of model updating
We show the effectiveness of updating a rule-based clas-
sifier model in handling concept drifts. We compare three
approaches: i) training an initial rule-based classifier and
using it without change to classifier streaming data, ii) con-
tinuously revising the initial classifier by updating the sup-
port and confidence of its rules, and iii) continuously revis-
ing the initial classifier by updating the confidence/support
of the existing rules and discovering new rules.
In the synthetic dataset we use, each record has 10 di-
mensions. The window size is 5,000. We introduce a con-
cept drift by randomly choosing 4 dimensions and changing
their weights for every 50,000 records. The results in Fig-
ure 4 are obtained for a run with parameters minsup =.03,
minconf =.7,minWR = 150,K=7(the top-K parame-
ter) and N=5(the maximum pattern length).
0
20
40
60
80
100
5
10
15
20
25
30
35
40
45
50
55
Datasize (*5,000)
Error Rate (%)
only update
mine new rules
static
Figure 4. Error Rate Comparison
Figure 4 indicates that, 1) after the first concept drift, the
accuracy of the initial classifier drops significantly, which
means concept drifts can be effectively detected by misclas-
sified records; 2) rule updating can adjust the classifier to
adapt to the new concept; 3) classifiers that mine new rules
can further improve accuracy.
5.2 The relation of concept drifts and N
ij
In this subsection, we verify the approach to detect con-
cept drifts using abnormally increase of N
ij
. We test it in
a hyperplane dataset. Every after 50,000 records, we adjust
the hyperplane to simulate a concept drift. These two curves
indicate N
i1
and N
i0
respectively. The result is shown in
figure 5. We can see that each time a concept drift happen,
N
i1
and N
i0
burst dramatically while they keep stable when
the streaming data is stable.
0
5
10
15
20
25
30
35
40
0
100
200
300
400
500
600
700
800
900
1000
Records (* 5,000)
# misclassified records
Figure 5. Relation of concept drifts and N
ij
5.3 Effect of rule composition
We verify the effectiveness of choosing the top-K can-
didate predicates in composing new rules. We compare its
accuracy against choosing predicates randomly. We used
the same parameters as in the previous experiment. The
result is shown in Figure 6. It shows that by using most
frequently occurring predicates in misclassified records to
construct CRS, we can obtain rules that effectively repre-
senting the new concept.
0
5
10
15
20
25
30
35
40
2
4
6
8
10
12
14
16
Figure 6. Choosing literals
5.4 Accuracy and time
While having improved or similar classifying accuracy,
our rule-based approach is much faster than state of the art
stream classifiers. We compare our method with the ensem-
ble classifier [15] and CVFDT [9] in terms of accuracy and
run time on both synthetic data and real life data.
We used different parameters (e.g., window size, number
of classifiers in the ensemble, etc.) to tune the classifiers,
and a typical result is shown in Figure 7, which is obtained
using windows of size 10,000, and the ensemble contains 10
classifiers, each trained on 1,000 records. We report error
rate for every 5,000 records.
Figure 7(a) shows that the accuracy of our rule-based
approach is higher than that of CVFDT, and is similar to
the ensemble classifier. Compared with CVFDT, our rule-
based approach can catch concept drifts and adjust the clas-
sifier more quickly. This is because the rule-based classifier
7
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE
0 20 40 60 80 100
5
10
15
20
25
30
35
40
45
Datasize (*5000)
Error Rate (%)
rule−based
CVFDT
EC
0 20 40 60 80 100
0
50
100
150
200
250
Datasize (*5,000)
Runtime (s)
rule−based
EC
CVFDT
(a) accuracy (b) time
Figure 7. Synthetic data
0 5 10 15 20 25 30 35 40
4
6
8
10
12
14
16
18
Datasize (*5,000)
Error Rate (%)
rule−based
EC
CVFDT
0 5 10 15 20 25 30 35 40
0
10
20
30
40
50
60
Datasize (*5,000)
Runtime (s)
rule−based
EC
CVFDT
(a) accuracy (b) time
Figure 8. Real life data
has low granularity, which means it can quickly determine
which component to revise and quickly figure out how to
revise it, while the CVFDT has to learn the new concepts
by re-growing a decision tree level by level.
Figure 7(b) shows that our rule-based approach is much
faster than the ensemble classifier and CVFDT. Unlike the
ensemble classifier and CVFDT, which keep learning new
models, most of the time our approach only adjusts the sup-
port and confidence of matched rules in the model. This
operation is made very efficient using the RS-tree and the
REC-tree structure. Even when the new rule detection pro-
cedure is triggered, the cost is still small, because the update
has already been limited to a very small space, which is em-
bodied by the top-K predicates. The second experiment is
run on the real life ‘nursery’ dataset, and the results shown
in Figure 8 are consistent with those on the synthetic data.
6 Conclusion
An important task in mining data stream is to overcome
the effects of concept drifts, which pose a major challenge
to stream classification algorithms because of the high cost
associated with maintaining the uptodateness of the mod-
els. Current stream classifiers are adapted from algorithms
designed for static data, and they are hardly incrementally
maintainable because it is not easy, if not impossible, to se-
mantically break down the model into smaller pieces. In
this paper, we addressed the issue of classifier granularity,
and we showed that by reducing this granularity, change de-
tection and model update can be made much more efficient
without compromising classification accuracy, as reported
by our extensive experiments.
References
[1] C. Blake and C. Merz. UCI repository of machine
learning databases. In Univ. of California, Dept. of
Information and Computer Science, 1998.
[2] J. H. Chang and W. S. Lee. Finding recent fre-
quent itemsets adaptively over online data streams. In
SIGKDD, 2003.
[3] Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang.
Multi-dimensional regression analysis of time-series
data streams. In VLDB, Hongkong, China, 2002.
[4] Yun Chi, Haixun Wang, Philip S. Yu, and Richard R.
Muntz. Moment: Maintaining closed frequent item-
sets over a stream sliding window data streams. In
ICDM, 2004.
[5] P. Domingos and G. Hulten. Mining high-speed data
streams. In SIGKDD, pages 71–80, Boston, MA,
2000. ACM Press.
[6] J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Loh.
BOAT– optimistic decision tree construction. In SIG-
MOD, 1999.
[7] J. Gehrke, R. Ramakrishnan, and V. Ganti. RainFor-
est: A framework for fast decision tree construction of
large datasets. In VLDB, 1998.
[8] S. Guha, N. Milshra, R. Motwani, and L. O’Callaghan.
Clustering data streams. In FOCS, pages 359–366,
2000.
[9] G. Hulten, L. Spencer, and P. Domingos. Mining time-
changing data streams. In SIGKDD, pages 97–106,
San Francisco, CA, 2001. ACM Press.
[10] Wenmin Li, Jiawei Han, and Jian Pei. CMAR: Ac-
curate and efficient classification based on multiple
class-association rules. In ICDM, 2001.
[11] Bing Liu, Wynne Hsu, and Yiming Ma. Integrat-
ing classification and association rule mining. In
SIGKDD, 1998.
[12] G. Manku and R. Motwani. Approximate frequency
counts over data streams. In VLDB, 2002.
[13] C. Shafer, R. Agrawal, and M. Mehta. Sprint: A scal-
able parallel classifier for data mining. In VLDB, 1996.
[14] W. Nick Street and YongSeog Kim. A streaming en-
semble algorithm (SEA) for large-scale classification.
In SIGKDD, 2001.
[15] Haixun Wang, Wei Fan, Philip S. Yu, and Jiawei Han.
Mining concept-drifting data streams using ensemble
classifiers. In SIGKDD, 2003.
[16] Peng Wang, Haixun Wang, Xiaochen Wu, Wei
Wang, and Baile Shi. On reducing classi-
fier granularity in mining concept-drifting data
streams. Technical report, http://wis.cs.ucla.edu/˜
hxwang/publications/wangtech05.pdf, IBM T. J. Wat-
son Research Center, 2005.
8
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE
... Many algorithms have been proposed in this field [32,33]. Most of them require very large amounts of samples for training a classifier and are based on a learning manner of batch, which makes them incapable of dealing with concept drift effectively and efficiently [34][35][36]. ...
Article
Full-text available
Due to concept drifts, maintaining an up-to-date model is a challenging task for most of the current classification approaches used in data stream mining. Both the incremental classifiers and the ensemble classifiers spend most of their time in updating their temporary models and at the same time, a big sample buffer for training a classifier is necessary for most of them. These two drawbacks constrain further application in classifying a data stream. In this paper, we present a hormone based nearest neighbor classification algorithm for data stream classification, in which the classifier is updated every time a new record arrives. The records could be seen as locations in the feature space, and each location can accommodate only one endocrine cell. The classifier consists of endocrine cells on the boundaries of different classes. Every time a new record arrives, the cell that resides in the most unfit location will move to the new arrived record. In this way, the changing boundaries between different classes are recorded by the locations where endocrine cells reside in. The main advantages of the proposed method are the saving of the sample buffer and the improving of the classification accuracy. It is very important for conditions where the hardware resources are very expensive or the main memory is limited. Experiments on synthetic and real life data sets show that the proposed algorithm is able to classify data streams with less memory space and classification error.
... Data stream mining is a field that draws attention from data mining and database community [1,7,3]. The main distinctiveness of the data is that we cannot store incoming records/transactions/documents and therefore we need algorithms that process the data only once. ...
Article
Full-text available
Real world text classiflcation applications are of special inter- est for the machine learning and data mining community, mainly because they introduce and combine a number of special di-culties. They deal with high dimensional, streaming, unstructured, and, in many occasions, concept drifting data. Another important peculiarity of streaming text, not adequately discussed in the relative literature, is the fact that the feature space is initially unavailable. In this paper, we discuss this aspect of textual data streams. We underline the necessity for a dynamic fea- ture space and the utility of incremental feature selection in streaming text classiflcation tasks. In addition, we describe a computationally un- demanding incremental learning framework that could serve as a baseline in the fleld. Finally, we introduce a new concept drifting dataset which could assist other researchers in the evaluation of new methodologies.
... Data stream processing has attracted much attention recently due to the explosive growth of many new classes of data stream applications. There is much work in this area including modeling, querying, and mining [1,9,11,7,14,3,12,6,13,19,18,5,4]. ...
Conference Paper
Full-text available
Mining data streams of changing class distributions is im- portant for real-time business decision support. The stream classifier must evolve to reflect the current class distribution. This poses a serious challenge. On the one hand, relying on historical data may increase the chances of learning obsolete models. On the other hand, learning only from the latest data may lead to biased classifiers, as the latest data is often an unrepresentative sample of the current class distribution. The problem is particularly acute in classifying rare events, when, for example, instances of the rare class do not even show up in the most recent training data. In this paper, we use a stochastic model to describe the concept shifting pat- terns and formulate this problem as an optimization one: from the historical and the current training data that we have observed, find the most-likely current distribution, and learn a classifier based on the most-likely distribution. We derive an analytic solution and approximate this solution with an efficient algorithm, which calibrates the influence of historical data carefully to create an accurate classifier. We evaluate our algorithm with both synthetic and real-world datasets. Our results show that our algorithm produces ac- curate and efficient classification.
Conference Paper
Increasingly, data streams are generated from a growing number of small, cheap sensors that monitor, e.g., personal activities, industrial facilities or the natural environment. In these settings, there are often rapid changes in input-to-target relations and we are concerned with tree-structured models that can rapidly adapt to these changes. Based on our new algorithms accuracy and tracking behavior is improved, which we demonstrate for a number of popular tree based classifiers with over state-of-the-art change detection using five data sets and two different settings. The key novel idea is the representation of record values as distributions rather than point-values in the stream setting, covering a larger part of the instance space early on, and resulting in an often smaller, more flexible classification model.
Conference Paper
Concept drifting poses a real challenge for network models which depends on statistical heuristics learned from the data stream, for example Anomaly Based Detection/Prevention Systems. These models tend to become inconsistent over a period of time as the underlying data stream like network traffic tends to change and get affected by evolution of concept drift. Change in network traffic pattern is inevitable, it impacts the enterprises which are dynamic in nature especially cloud-centric enterprises. These changes in the network pattern can be of short time period or they can be persistent for longer time duration. Change in network traffic pattern is not always because of malicious activity, changes can be benign and thus impacting the performance of the IDS/IPS model. There is a need to quantify concept drifts and incorporate them in the model. In this paper we have proposed a supervised learning model to quantify the concept drift in the network traffic. The proposed model uses adaptive learning strategies with fixed training window to constantly evolve the model. Classification of data is done by Naive Bayes Classifier. ROC curve generated from Naive Bayes classifiers has been used as a de facto method for identifying concept drift. Classifications have been carried out on entire dataset and also on specific flow attributes like source ip, destination ip, source port, destination port, flags and protocols. In this paper we demonstrate the capabilities of the proposed model to identify drift in the network pattern and also which flow attributes have contributed in concept drifting using ROC curve.
Article
In an online data stream, the composition and distribution of the data may change over time, which is a phenomenon known as concept drift. The occurrence of concept drift can affect considerably the performance of a data stream mining method, especially in relation to mining accuracy. In this paper, we study the problem of mining frequent patterns from transactional data streams in the presence of concept drift, considering the important issue of mining accuracy preservation. In terms of frequent-pattern mining, we give the definitions of concept and concept drift with respect to streaming data; moreover, we present a categorization for concept drift. The concept of streaming data is considered the relationships of frequency between different patterns. Accordingly, we devise approaches to describe the concept concretely and to learn the concept through frequency relationship modeling. Based on concept learning, we propose a method of support approximation for discovering data stream frequent patterns. Our analyses and experimental results have shown that in several studied cases of concept drift, the proposed method not only performs efficiently in terms of time and memory but also preserves mining accuracy well on concept-drifting data streams.
Article
We study the problem of mining frequent itemsets in dynamic data streams and consider the issue of concept drift. A count-prediction based algorithm is proposed, which estimates the counts of itemsets by predictive models to find frequent itemsets out. The predictive models are constructed based on the data in the data stream and serve as a description of the concept of the stream. If there is a concept drift in the stream, the description of the concept can be updated by reconstructing the predictive models. According to our experimental results, the proposed algorithm is efficient and has stable performance. Besides, using respective predictive models for count-predictive mining would preserve the quality of mining answers effectively (in terms of accuracy) against the change of the concept.
Conference Paper
In mining data streams, one of the most challenging tasks is adapting to concept change, that is change over time of the underlying concept in the data. In this paper, we propose a novel ensemble framework for mining concept-changing data streams. This algorithm, called QACC (Quick Adaptation to Changing Concepts), realizes quick adaptation to changing concepts using an ensemble of classifiers. For quick adaptation, QACC sensitively detects concept changes in noisy streaming data. Empirical studies show that the QACC algorithm is efficient for various concept changes.
Conference Paper
Much work has focused on mining evolving data, and most approaches learn the latest model from the latest data. The problem with these approaches is that the learned model is always of low quality. In this paper, we propose a clustering approach to find hidden concepts that control data generation. Unlike traditional clustering methods that are based on data similarity (measured by Euclidean distance, e.g.), we devise a new similarity metric for concept similarity. We propose a two step algorithm, which uses dynamic programming and hierarchical clustering to find concepts in the data.
Article
In recent years, classification learning for data streams has become an important and active research topic. A major challenge posed by data streams is that their underlying concepts can change over time, which requires current classifiers to be revised accordingly and timely. To detect concept change, a common methodology is to observe the online classification accuracy. If accuracy drops below some threshold value, a concept change is deemed to have taken place. An implicit assumption behind this methodology is that any drop in classification accuracy can be interpreted as a symptom of concept change. Unfortunately however, this assumption is often violated in the real world where data streams carry noise that can also introduce a significant reduction in classification accuracy. To compound this problem, traditional noise cleansing methods are incompetent for data streams. Those methods normally need to scan data multiple times whereas learning for data streams can only afford one-pass scan because of data’s high speed and huge volume. Another open problem in data stream classification is how to deal with missing values. When new instances containing missing values arrive, how a learning model classifies them and how the learning model updates itself according to them is an issue whose solution is far from being explored. To solve these problems, this paper proposes a novel classification algorithm, flexible decision tree (FlexDT), which extends fuzzy logic to data stream classification. The advantages are three-fold. First, FlexDT offers a flexible structure to effectively and efficiently handle concept change. Second, FlexDT is robust to noise. Hence it can prevent noise from interfering with classification accuracy, and accuracy drop can be safely attributed to concept change. Third, it deals with missing values in an elegant way. Extensive evaluations are conducted to compare FlexDT with representative existing data stream classification algorithms using a large suite of data streams and various statistical tests. Experimental results suggest that FlexDT offers a significant benefit to data stream classification in real-world scenarios where concept change, noise and missing values coexist.
Conference Paper
Full-text available
Ensemble methods have recently garnered a great deal of attention in the machine learning community. Techniques such as Boosting and Bagging have proven to be highly effective but require repeated resampling of the training data, making them inappropriate in a data mining context. The methods presented in this paper take advantage of plentiful data, building separate classifiers on sequential chunks of training points. These classifiers are combined into a fixed-size ensemble using a heuristic replacement strategy. The result is a fast algorithm for large-scale or streaming data that classifies as well as a single decision tree built on all the data, requires approximately constant memory, and adjusts quickly to concept drift.
Conference Paper
Full-text available
Real-time production systems and other dynamic environments often generate tremendous (potentially infinite) amount of stream data; the volume of data is too huge to be stored on disks or scanned multiple times. Can we perform on-line, multi-dimensional analysis and data mining of such data to alert people about dramatic changes of situations and to initiate timely, high-quality responses? This is a challenging task.
Article
Full-text available
This paper considers the problem of mining closed frequent itemsets over a data stream sliding window using limited memory space. We design a synopsis data structure to mon- itor transactions in the sliding window so that we can output the current closed frequent itemsets at any time. Due to time and memory constraints, the synopsis data structure can- not monitor all possible itemsets. However, monitoring only frequent itemsets will make it impossible to detect new itemsets when they become frequent. In this paper, we introduce a compact data structure, the closed enumeration tree (CET), to maintain a dynamically selected set of itemsets over a sliding window. The selected itemsets contain a boundary between closed frequent itemsets and the rest of the itemsets. Concept drifts in a data stream are re∞ected by boundary movements in the CET. In other words, a status change of any itemset (e.g., from non-frequent to frequent) must occur through the boundary. Because the boundary is relatively stable, the cost of mining closed frequent itemsets over a sliding window is dramatically reduced to that of mining transactions that can possibly cause boundary movements in the CET. Our experiments show that our algorithm per- forms much better than a representative algorithm for the sate-of-the-art approaches.
Article
Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A very popular class of classifiers are decision trees. All current algorithms to construct decision trees, including all main-memory algorithms, make one scan over the training database per level of the tree. We introduce a new algorithm (BOAT) for decision tree construction that improves upon earlier algorithms in both performance and functionality. BOAT constructs several levels of the tree in only two scans over the training database, resulting in an average performance gain of 300% over previous work. The key to this performance improvement is a novel optimistic approach to tree construction in which we construct an initial tree using a small subset of the data and refine it to arrive at the final tree. We guarantee that any difference with respect to the "real" tree (i.e., the tree that would be constructed by examining all the data in a traditional way) is detected and corrected. The correction step occasionally requires us to make additional scans over subsets of the data; typically, this situation rarely arises, and can be addressed with little added cost. Beyond offering faster tree construction, BOAT is the first scalable algorithm with the ability to incrementally update the tree with respect to both insertions and deletions over the dataset. This property is valuable in dynamic environments such as data warehouses, in which the training dataset changes over time. The BOAT update operation is much cheaper than completely re-building the tree, and the resulting tree is guaranteed to be identical to the tree that would be produced by a complete re-build.
Article
Research in data stream algorithms has blossomed since late 90s. The talk will trace the history of the Approximate Frequency Counts paper, how it was conceptualized and how it influenced data stream research. The talk will also touch upon a recent development: analysis of personal data streams for improving our quality of lives.
Conference Paper
A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. Consequently, the knowledge embedded in a data stream is more likely to be changed as time goes by. Identifying the recent change of a data stream, specially for an online data stream, can provide valuable information for the analysis of the data stream. In addition, monitoring the continuous variation of a data stream enables to find the gradual change of embedded knowledge. However, most of mining algorithms over a data stream do not differentiate the information of recently generated transactions from the obsolete information of old transactions which may be no longer useful or possibly invalid at present. This paper proposes a data mining method for finding recent frequent itemsets adaptively over an online data stream. The effect of old transactions on the mining result of the data steam is diminished by decaying the old occurrences of each itemset as time goes by. Furthermore, several optimization techniques are devised to minimize processing time as well as main memory usage. Finally, the proposed method is analyzed by a series of experiments.
Conference Paper
Previous studies propose that associative classification has high classification accuracy and strong flexibility at handling unstructured data. However, it still suffers from the huge set of mined rules and sometimes biased classification or overfitting since the classification is based on only a single high-confidence rule. The authors propose a new associative classification method, CMAR, i.e., Classification based on Multiple Association Rules. The method extends an efficient frequent pattern mining method, FP-growth, constructs a class distribution-associated FP-tree, and mines large databases efficiently. Moreover, it applies a CR-tree structure to store and retrieve mined association rules efficiently, and prunes rules effectively based on confidence, correlation and database coverage. The classification is performed based on a weighted χ2 analysis using multiple strong association rules. Our extensive experiments on 26 databases from the UCI machine learning database repository show that CMAR is consistent, highly effective at classification of various kinds of databases and has better average classification accuracy in comparison with CBA and C4.5. Moreover, our performance study shows that the method is highly efficient and scalable in comparison with other reported associative classification methods