Conference PaperPDF Available

Use of ensembles of Fourier spectra in capturing recurrent concepts in data streams

Authors:

Figures

Content may be subject to copyright.
Use of Ensembles of Fourier Spectra in Capturing
Recurrent Concepts in Data Streams
Sripirakas Sakthithasan and Russel Pears
School of Computer and Mathematical Sciences
Auckland University of Technology, New Zealand
Email: {ssakthit, rpears }@aut.ac.nz
Albert Bifet and Bernhard Pfahringer
Department of Computer Science
University of Waikato, New Zealand
Email: {albifet , bfahringer }@cs.waikato.ac.nz
Abstract—In this research, we apply ensembles of Fourier
encoded spectra to capture and mine recurring concepts in a data
stream environment. Previous research showed that compact ver-
sions of Decision Trees can be obtained by applying the Discrete
Fourier Transform to accurately capture recurrent concepts in a
data stream. However, in highly volatile environments where new
concepts emerge often, the approach of encoding each concept
in a separate spectrum is no longer viable due to memory
overload and thus in this research we present an ensemble
approach that addresses this problem. Our empirical results on
real world data and synthetic data exhibiting varying degrees
of recurrence reveal that the ensemble approach outperforms
the single spectrum approach in terms of classification accuracy,
memory and execution time.
I. INTRODUCTION
In many real world applications, patterns or concepts recur
over time. Machine learning applications that model, capture
and recognize concept re-occurrence gain significant efficiency
and accuracy advantages over systems that simply re-learn
concepts each time they re-occur. When such applications
include safety and time critical requirements, the need for
concept re-use to support decision making becomes even more
compelling.
Auto-pilot systems sense environmental changes and take
appropriate action (classifiers, in the supervised machine learn-
ing context) to avoid disasters and to fly smoothly. As en-
vironmental conditions change, appropriate actions must be
taken in the shortest possible time in the interest of safety.
Thus for example, a situation that involves the occurrence
of a sudden low pressure area coupled with high winds (a
concept that would be captured by a classifier) would require
appropriate action to keep the aircraft on a steady trajectory. A
machine learning system that is coupled to a flight simulator
can learn such concepts in the form of classifiers and store
them in a repository for timely re-use when the aircraft is on
live flying missions. In live flying mode the autopilot system
can quickly re-use the stored classifiers when such situations
re-occur. Additionally, in live flying mode, new potentially
hazardous situations not experienced in simulator mode can
also be learned and stored as classifiers in the repository for
future use.
In a real world setting, there is an abundance of applications
that exhibit such recurring behavior such as stock and sales
applications where timely decision making results in improved
productivity. Our research setting is a data stream environment
where we seek to capture concepts as they occur, store them
in highly compressed form in a repository and to re-use
such concepts for classification when the need arises in the
future. A number of challenges need to be overcome. Firstly,
a compression scheme that captures concepts using minimal
storage is required as in a high volatile high dimensional
environment. Memory overhead will be a prime concern as
the number of concepts will grow continuously in time given
the unbounded nature of data streams. Secondly, in real-world
environments, concepts rarely, if ever, occur in exactly their
original form and so a mechanism is needed to recognize par-
tial re-occurrence of concepts. Thirdly, the concept encoding
scheme needs to be efficient in order to support high speed
data stream environments.
In order to meet the above challenges, we extend the work
proposed in [16] in a number of ways. In [16] concepts were
initially captured using decision trees and the Discrete Fourier
Transform (DFT) was applied to encode them into spectra
yielding compressed versions of the original decision trees.
Firstly, instead of encoding each concept using its own
Fourier spectrum, we use an ensemble approach to aggregate
individual spectra into a single unified spectrum. This has two
advantages, the first of which is reduction of memory overhead.
Memory is further reduced as Fourier coefficients that are
common between different spectra can be combined into a
single coefficient, thus eliminating redundancy. The second
advantage arises from the use of an ensemble: new concepts
that manifest as a combination of previously occurring con-
cepts already present in the ensemble have a higher likelihood
of being recognized, resulting in better accuracy and stability
over large segments of the data stream.
Secondly, we devise an efficient scheme for spectral energy
thresholding that directly controls the degree of compression
that can be obtained in encoding concepts in the repository.
Thirdly, we optimize the DFT encoding process by re-
moving the need for computing a potentially expensive inner
product operation on vectors.
II. RE LATE D RESEARCH
While a vast literature on concept drift detection exists
[13], only a small body of work exists so far on exploitation
of recurrent concepts. The methods that exist fall into two
broad categories. Firstly, methods that store past concepts as
models and then use a meta-learning mechanism to find the
best match when a concept drift is triggered [5], [7]. Secondly,
methods that store past concepts as an ensemble of classifiers.
The method proposed in this research belongs to the second
category where ensembles remember past concepts.
An algorithm called REDDLA is presented in [14]. This
algorithm is designed to handle recurring concepts with un-
labeled data instances. One of the key issues is that explicit
domain is required on the concept recurrence interval. The
other issue is high memory overhead.
Lazarescu in [11] proposed an evidence forgetting mecha-
nism based on a multiple window approach and a prediction
module to adapt classifiers based on estimating future rate
of change. Whenever the difference between observed and
estimated rates of change are above a threshold, a classifier
that best represents the current concept is stored in a reposi-
tory. Experimentation on the STAGGER data set showed that
the proposed approach outperformed the FLORA method on
classification accuracy with re-emergence of previous concepts
in the stream.
Ramamurthy and Bhatnagar [15] use an ensemble approach
based on a set of classifiers in a global set G. An ensemble of
classifiers is built dynamically from a collection of classifiers
in G, if none of the existing individual classifiers are able to
meet a minimum accuracy threshold based on a user defined
acceptance factor. Whenever the ensemble accuracy falls below
the accuracy threshold, G is updated with a new classifier
trained on the current chunk of data.
Another ensemble based approach by Katakis et al. is
proposed in [9]. A mapping function is applied on data stream
instances to form conceptual vectors which are then grouped
together into a set of clusters. A classifier is incrementally built
on each cluster and an ensemble is formed based on the set
of classifiers. Experimentation on the Usenet data set showed
that the ensemble approach produced better accuracy than a
simple incremental version of the Naive Bayes classifier.
Gomes et al. [7] used a two layer approach with the first
layer consisting of a set of classifiers trained on the current
concept, while the second layer contains classifiers created
from past concepts. A concept drift detector flags when a
warning state is triggered and incoming data instances are
buffered to prepare a new classifier. If the number of instances
in the warning window is below a threshold, the classifier in
layer 1 is used instead of re-using classifiers in layer 2. One
major issue with this method is validity of the assumption that
explicit contextual information is available in the data stream.
Gama and Kosina also proposed a two layered system in
[5] which is designed for delayed labelling, similar in some
respects to the Gomes et al. [7] approach. In their approach,
Gama and Kosina pair a base classifier in the first layer
with a referee in the second layer. Referees learn regions of
feature space which its corresponding base classifier predicts
accurately and is thus able to express a level of confidence on
its base classifier with respect to a newly generated concept.
The base classifier which receives the highest confidence score
is selected, provided that it is above a user defined hit ratio
parameter; if not, a new classifier is learnt.
Just-in-Time classifiers is the solution proposed by Allipi
et al. [1] to deal with recurrent concepts. Concept change
detection is carried out on the classification accuracy as well as
by observing the distribution of input instances. The drawback
is that this model is designed for abrupt drifts and is weak at
handling gradual changes.
Recently, Sakthithasan and Pears in [16] used the Discrete
Fourier Transform (DFT) to encode decision trees into a
highly compressed form for future use. They showed that DFT
encoding is very effective in improving classification accuracy,
memory usage and processing time in general. It maintains a
pool of Fourier spectra and a decision tree forest in parallel.
The decision tree forest dominates the model, when none of the
existing Fourier spectra matches the current concept, otherwise
classification is done by the best performing Fourier spectrum.
III. APP LI CATI ON O F TH E DISCRETE FOURIER
TRANSFORM ON DECISION TRE ES
The Discrete Fourier Transform (DFT) has a vast area of
application in diverse domains such as time series analysis,
signal processing, image processing and so on. It turns out
as Park [12] and Kargupta [10] show, that the DFT is very
effective in terms of classification when applied on a decision
tree model.
Kargupta et al. [10], working in the domain of distributed
data mining, showed that the Fourier spectrum fully captures
a decision tree in algebraic form, meaning that the Fourier
representation preserves the same classification power as the
original decision tree.
A. Transforming Decision Tree into Fourier Spectrum
A decision tree can be represented in compact algebraic
form by applying the DFT to paths of the tree. Each Fourier
coefficient ωjis given by:
ωj=1
2dX
x
f(x)ψλ
j(x); (1)
ψλ
j(x) = Qmexp 2πi
λmxmjmwhere j and x are strings of length
d;xmand jmrepresent the mth attribute value in jand
xrespectively; f(x)is the classification outcome of path vector
x and ψλ
j(x)is the Fourier basis function.
Fig. 1. Decision Tree with 3 binary features
Figure 1 shows a simple example with 3 binary valued
features x1,x2and x3, out of which only x1and x3are
actually used in the classification.
With the wild card operator * in place we can use equations
(1) and (2) to calculate non zero coefficients. Thus for example
we can compute:
ω000 =4
8f(∗ ∗ 0)ψ000(∗ ∗ 0) + 2
8f(0 1)ψ000(0 1)
+2
8f(1 1)ψ000f(1 1) = 3
4
ω001 =4
8f(∗ ∗ 0)ψ001(∗ ∗ 0) + 2
8f(0 1)ψ001(0 1)
+2
8f(1 1)ψ001f(1 1) = 1
4
and so on.
Kargupta et al in [10] showed that the Fourier spectrum of
a given decision tree can be approximated by computing only a
small number of low order coefficients, thus reducing storage
overhead. With a suitable thresholding scheme in place, the
Fourier spectrum consisting of the set of low order coefficients
is thus an ideal mechanism for capturing past concepts.
Furthermore, classification of unlabeled data instances can
be done directly in the Fourier domain as it is well known that
the inverse of the DFT defined in expression 1 can be used
to recover the classification value, thus avoiding the need for
expensive reconstruction of a decision tree from its Fourier
spectrum. The inverse Fourier Transform is given by
f(x) = X
j
ωjψλ
j(x)(2)
where ψλ
j(x)is the complex conjugate of ψλ
j(x). ‘ An instance
can be transformed into binary vector through the symbolic
mapping between the actual attribute value and mapped value
( either 0 or 1 in binary case). It can then be classified using
the inverse function in equation 2. Suppose the instance is 010,
the classification value f(010) can be calculated as follows:
f(010) = 1
2d(1)000.010ω000 +1
2d(1)001.010ω001
+1
2d(1)010.010ω010 +1
2d(1)011.010ω011
+1
2d(1)100.010ω100 +1
2d(1)101.010ω101
+1
2d(1)110.010ω110 +1
2d(1)111.010ω111 = 1 (3)
IV. EXP LO ITATI ON O F TH E FOURIER TRANSFORM FOR
RECURRENT CON CE PT CA PT UR E
We first present the basic algorithm used in Section IV-A
and then go on to discuss an optimization that we used for
energy thresholding in Section IV-B.
We use CBDT [8] as the base classifier which maintains a
forest of Hoeffding Trees [4] CBDT is dynamic in the sense
that it can adapt to changing concepts at drift detection points.
As shown in Figure 2, the memory is divided into two
segments: the forest of Hoeffding trees; and a pool of Fourier
Spectra. The forest learns and undergoes structural modifica-
tion on a continuous basis. The pool maintains a collection
of Fourier Spectra encoded from Hoeffding Trees, each of
which had the best classification accuracy across the forest at a
particular concept drift point. Each Hoeffding Tree and Fourier
Spectrum is equipped with an instance of a drift detector. In
this research, we use the SeqDrift2 drift detector [13] as the
default option.
In [16], each Fourier spectrum is represented individually
as a Fourier Concept Tree (FCT). In this work, we aggregate
spectra and maintain a pool of ensemble spectra known as
Ensemble Pool (EP). The aggregation process is carried out in
two different ways. Algorithm EPaaggregates with reference
to similarity based on accuracy whereas EP aggregates based
Fig. 2. An Architecture for Recurrent Concept Capture
on structural similarity. We describe the EP process in Algo-
rithm EP and discuss how FCT can be generated from it as a
special case.
In practice, any incremental decision tree approach that
uses a forest of trees can be used in place of CBDT base
classifier.
A. EP Algorithm
Algorithm EP
Input: Energy Threshold ET, Accuracy Tie Threshold τ
Input: Structural Similarity Threshold α
Output: Best Performing Classifier Cthat suits current concept
1. Plant a Hoeffding tree rooted on each attribute found in the data
stream
2. Cis set to a randomly selected Hoeffding tree model from forest
3. Initialise an empty pool
4. Read an instance Ifrom the data stream
5. repeat
6. Apply all classifiers in forest and pool to classify I
7. Append 0to the embedded drift detector’s window for each
classifier if classification is correct, else 1
8. until Drift is detected by the current best classifier C
9. if C is from the forest
10. Identify best performing Fourier Spectrum Fin pool
11. if (accuracy(C)-accuracy(F))> τ
12. Apply DFT on model Cto produce Fourier
Spectrum F* using energy threshold ET
13. if F* is not already in pool
14. Call Aggregation
15. Identify best classifier Cacross forest and pool
16. GoTo 4
Algorithm Aggregation
Input: Fourier Spectrum F*,a set of existing ensembles Ein pool
Output: Updated Pool
1. repeat Over all data instances i
2. for Each ensemble Ein the pool
3. d(E) = d(E) + |c(F*, i)c(E, i)|
4. until Next concept drift point
5. E= arg min
E
(distance(E))
6. if (E* α) Merge F* with E
7. else insert E as a new spectrum
In step 1, a Hoeffding Tree rooted on each attribute is
created. In step 2, a tree is randomly chosen as the best
performing classifier C. Next, an empty pool is created in step
3. Each incoming instance is routed to all trees in the forest and
pool until a concept drift signal is triggered by the drift detector
instance attached to the best classifier C(steps 4 to 8). At the
first concept drift point, the best performing tree C(in terms
of drift detector’s estimate of accuracy) is transformed into
a Fourier Spectrum F* after energy thresholding [16]. In this
method, the assumption is that the best tree that has the highest
accuracy helps locate conceot changes precisely than other
trees because it is the tree that captures concepts at a greater
detail than others, thus the highest accuracy. Thereafter, F* is
stored in the repository for reuse whenever the concept recurs.
The spectra stored in the repository are fixed in nature as the
intention is to capture past concepts. A new best performing
classifier is then identified as shown in step 15.
At each subsequent drift point, if the best classifier is from
the pool then that classifier is applied to classify data instances
until a new best classifier emerges at a subsequent drift point.
Otherwise, if the best classifier is from forest, two tests are
made prior to applying the DFT to reduce redundancy in the
pool. Firstly, we check whether the difference in accuracy
between the best Hoeffding tree in forest (C) and the best
performing Fourier Spectrum (F) in the pool (from step 10) is
greater than a user defined tie threshold τ(step 11). If this test
succeeds, the DFT is applied to C to produce (F*) (step 12).
Furthermore, a second test is made to ensure that its Fourier
representation (F*) is not already in the pool (step 13). If this
test is also passed, algorithm Aggregation is called to integrate
F* into a selected existing Fourier Spectrum (E*) or plant (F*)
as a separate Fourier spectrum in the pool (step 14).
Algorithm Aggregation searches for the spectrum (E*)
that has the greatest structural similarity to the currently
generated spectrum (F*) (step 3). Step 3 evaluates the degree
of disagreement (d) between the classification decisions (c) for
F* and E on data instance i. Degree of disagreement between
(F*) and each of the existing ensemble (E) in pool can easily be
updated incrementally in Algorithm EP using a single counter
variable at each ensemble E. This removes the steps from 2 to
4 in Algorithm Aggregation. As an alternative to aggregating
structurally similar spectra, we used accuracy as the measure
that defines similarity. Similarity based on accuracy leads to
aggregating similar performing Fourier Spectra together. Thus,
we test the hypothesis, aggregation of two spectra based on
structural similarity produces better performing trees than the
one based on accuracy.
As stated earlier, FCT omits the call to Algorithm Aggre-
gation and inserts (F*) as it is, and is thus a special case of
EP.
B. Optimising the Energy Thresholding Process
Sakthihasan et al. in [16] showed that classification ac-
curacy is sensitive to spectral energy, which is given by the
total of the sum of squares of the coefficients[10]); the higher
the energy the greater is the classification accuracy in general.
Thresholding on spectral energy is thus an effective method of
obtaining a compact spectrum while retaining the classification
power inherent in the decision tree counterpart.
A solution described in [16] was to iterate through each
order of the spectrum and compute ratio of energy at orders
i1to that of irespectively. Thresholding can then be
implemented at order O when the ratio is less than some
small tolerance value, say 0.01. The drawback of this simple
solution is that it does not guarantee that the cumulative
energy up to order O contains a proportion () of the
total energy. Fortunately, a solution exists for this problem.
Theorem 1 proves that E(T) (total energy of Fourier Spectrum)
equals to ω0(The 0th coefficient). Thus, total energy can
be computed efficiently, without having to enumerate all the
single coefficients.
Theorem 1 The total spectral energy E=Pjω2
j=ω0,
where ω0denotes the coefficient with order 0, which is easily
computed as its Fourier basis function is unity.
Proof: Omitted due to lack of space and can be found
in http://cogprints.org/9879/
This optimization significantly increases processing speed,
especially in high dimensional data stream environments.
The next optimization is applied to optimize the Fourier
Basis function calculation in equation 1 especially when
wildcard characters (denoting absence of a feature) are
present in a path vector xof a Hoeffding Tree.
C. Optimizing the Computation of the Fourier Basis Function
The computation of a Fourier basis function for a given
partition jin a generic nary (n2) domain is given by:
X
xS
ψj(x) = X
xSY
m
exp2πijmxm
λm(4)
Thus we can see from (4.3) that the computation of
PxSψ(j)over a set of schema S requires the computation
of an expensive inner product operation between the xand
j. However, it is possible to optimize this inner product
computation as defined in Theorem 2.
Theorem 2 The computation of PxSψj(x)can be
optimized as follows:
Case 1: If there exists at least one (p, )combination
with pj,p6= 0 and a wild card character defining a set
of schema S, then PxSψj(x)=0.
Case 2: else if there exists ncombinations of (0,)
pairs in the jand xvectors respectively, then
X
xS
ψj(x) = λ
λk1
Y
k=n
exp2πijkxk
λkwhere λ=Qn1
l=0 λl
Proof: Omitted due to lack of space and can be accessed
from:http://cogprints.org/9879/
The value of Case 1 is that a simple scan of the jand
xvectors will save a total of dmultiplications and d1
additions.
We now turn our attention to Case 2. Since Qlλlis a
constant for all possible values of jand y, the value of Case
2 is that a scan of the two vectors will avoid the overhead of
nmultiplications and n1additions.
Even with these optimizations, coefficient calculation may
be expensive in a large dimensional data set. In the next section
we present a strategy to further optimize the derivation of the
spectrum.
D. Localized Approach to Ensemble Learning in the Fourier
Domain
In order to realize the full benefits of ensemble learning in
the Fourier domain, we aggregate individual spectra si(x)that
represent different concepts which manifest at different points
in the stream.
sc(x) = X
i
AiX
i
si(x)
=X
i
AiX
jPi
ωj(i)ψj(x)(5)
where sc(x)denotes the ensemble spectrum produced from the
individual spectra si(x)produced at different points iin the
stream; Aiis the classification accuracy of its corresponding
spectrum and Piis the set of partitions for non zero coefficients
in spectrum si.
Park in [12] used ensemble learning with Fourier spectra
in a setting different to ours. They considered a distributed
system with each node iproducing its own spectrum si(x)
and aggregation taking place at a central node. In our setting
of a data stream environment, we do not have all spectra in
advance but we can still use the same principle due to the
distributive nature of the linear weighted sum expressed by
(5). Hence, we use:
s(i+1)
c(x) = s(i)
c(x) + Ai+1si+1 (x)(6)
where s(i+1)
c(x),s(i)
crepresent the ensemble spectra at concept
drift points i+ 1 and irespectively in the stream and si+1(x)
is the spectrum produced at drift point i+ 1 with accuracy
Ai+1.
We use expression (6) for implementing ensemble learning
but with one essential difference. A direct application of (6)
using the entire (global) set of attributes Gcomprising the data
set would be inefficient. As there are an exponential number
of coefficients with respect to the number of attributes, this
could cause a bottleneck in high dimensional environments.
One practical solution is to populate the spectrum using only
attributes present in a given tree. The major advantage of this
approach is smaller computational overhead as the Fourier
transform effort is directly proportional to the size of the
attribute set used. Then this initial spectrum can be extended to
a full length spectrum containing the attributes that are absent
in the given tree, using a simple transformation scheme.
We define an attribute set of a Decision Tree as that subset
of attributes which define splits in the tree. Suppose that we
are integrating spectra from trees D1and D2, having attribute
sets Land Mrespectively. We apply the DFT on D1to obtain
S1using only the attributes in its attribute set Land not all
attributes in G. Similarly we generate S2from D2using only
the attributes defined in M.
Now, in order to integrate S1with S2, we need to account
for differences in the attribute sets Land M. To do this, we
take S1and expand the spectrum by incorporating attributes in
the set M\L. The expansion is defined by a single operation:
For each schema instance in the spectrum (say S1) expand
the spectrum by adding 0to all attribute index positions in
set M\L. The coefficient value after expansion will remain
it the same as the classification fvalue for all of these added
index positions remains unchanged. We are now in a position
to integrate two spectra produced from their own localized
set of attributes. Essentially, this means that we now have
a more efficient method of implementing ensemble learning
using expression (6).
The next section presents the empirical outcomes of the
proposed models with the above mentioned optimizations.
V. EXPERIMENTAL STU DY
The main focus of the study is to assess the effectiveness
of the ensemble EP approach vis-a-vis FCT in respect of
classification accuracy, memory consumption, processing
speed, tolerance to noise. We also assessed the sensitivity
of EP’s accuracy on two significant factors, pool size and
impact of drift detector. All experimentation was done with
the following parameter values:
Tree Forest: Max Node Count=5000, Max Number of
Fourier spectra=10, Tie Threshold τ=0.01
SeqDrift2/ADWIN [2]: drift significance value=0.01
A. Datasets Used for the Experimental Study
1) Synthetic Data: We experimented with the Rotating
Hyperplane data generator that is commonly used in drift
detection and recurrent concept mining. The dataset was
generated within the MOA data stream tool [3]. We injected
concept recurrence into the stream at known points so that we
could evaluate the capabilities of FCT and EP to recognize and
exploit such recurrences. For this dataset 10 different concepts
were generated, each of which spanned 5,000 instances and
each occurred a total of 3 times at different points in the
stream. In order to challenge the concept recognition process,
we added 10% noise by inverting the class labels of 10% of
randomly selected instances.
2) Real World Data: Spam Data Set: The Spam dataset
was used in its original form 1which encapsulates an evo-
lution of Spam messages. There are 9,324 instances and 499
informative attributes.
Electricity Data Set: NSW Electricity dataset is also used
in its original form 2. There are two classes Up and Down
that indicate the change of price with respect to the moving
average of the prices in last 24 hours.
Flight Data Set: This dataset is generated through the
use of NASAs FLTz flight simulator which was designed to
simulate flight conditions experienced with commercial flights.
1from http://www.liaad.up.pt/kdus/products/datasets-for-concept-drift
2from http://moa.cms.waikato.ac.nz/datasets/
It consists of a set of 20 separate files, each containing data
about a single flight with four scenarios: take off, climb,
cruise and landing. Data is recorded every second and a data
instance is produced. The ”Velocity” feature is chosen as the
class feature as it needs to be adjusted in order to maintain
aircraft stability during various maneuvers such as take off
and landing. Velocity was discretized into binary outcomes
”UP” or ”DOWN” depending on the directional change of the
moving average in a window of size 10 data instances.
B. Comparative Study: Ensemble versus Single Spectrum Ap-
proach
Previous research on the use of Fourier spectrum re-
vealed accuracy and memory advantages over meta learning
approaches such as the one employed by Gama and Kosina
in storing past concepts in a repository [16]. For details of
the advantages of the Fourier approach and experimentation
with it the reader is referred to [16]. Our focus here is a
comparative study of the Ensemble approach versus the single
spectrum approach. With this in mind we designed three types
of experiments.
1) Accuracy: Accuracy is a critical performance measure
in many practical applications. Due to the dynamic nature of
data streams classification accuracy on the current concept was
taken as the performance measure. Figure 3 presents Accuracy
Fig. 3. Accuracy Profiles
values of all algorithms at 10 equal-sized sub-divisions of the
stream. We also present overall mean and standard deviations
of accuracy taken across the entire stream for each dataset.
Fig 3 shows that the individual accuracies across segments
and overall accuracy across the entire stream are consistent
with each other. MetaCT, which uses a referee based strategy
was found to be the worst performing algorithm on all datasets.
In contrast EP outperforms the other algorithms in general,
followed by EPaand FCT. These results show clearly that
DFT based methods are superior in a dynamic data stream
environment.
FCT does not exploit aggregation of Fourier Spectra and
is hence challenged in a memory constrained environment
where the number of models stored for reuse is limited. Figure
3 depicts the performance in such an environment where
memory is severely limited. This introduces a large burden
on FCT to re-learn concepts after change. EP is more resilient
at small pool sizes as any given concept that recurs can be
approximated by a linear combination of spectra embedded
in the ensemble, just as a waveform of arbitrary shape can
be approximated by a large enough sum of sine functions in
signal processing.
Examining model usage statistics, EP was 3.7 times higher
in model re-use on the Flight dataset. The corresponding value
was 2.6 for EPaon the same dataset. This provides empirical
support for the claim that an aggregation-based model such
as EP has a significant advantage in reducing the degree of
relearning. For Rotating Hyperplane with known recurrence
points the advantage of EP over its counterparts is very explicit.
We display the stream segment for the third round of concept
occurrences, spanning the 10 concepts. Each of the 10 intervals
represent the second recurrence of a concept and the Figure
shows that EP outperforms FCT on 8/10 concepts; EPaand
MetaCT on 7/10 concepts; and CBDT on all 10 concepts.
The next key aspect in a memory constrained environment
is memory consumption which is assessed in the following
section.
2) Memory: Memory consumption is influenced by the
degree of generalizability of a given algorithm. A greater
degree of generalizability promotes higher re-use and reduces
the number of spectra that need to be stored in the repository
to achieve a given level of classification accuracy. In this
context it will be interesting to compare the consumption of
EP with that of FCT as they have contrasting model re-use
characteristics.
MetaCT(SeqDrift2) and CBDT were excluded from mem-
ory comparison due to their relatively poor performance in the
previous experiment.
TABLE I. MEMORY USAGE WITH POOL S IZE S ET T O 10
Dataset Average Pool Memory (in KBs)
FCT EPaEP
Flight 32.1 20.2 18.1
Electricity 31.6 16.1 14.1
Rot. Hyperplane 48.4 38.6 27.9
Spam 17.3 17.2 16.4
Table I presents the average memory consumption of the pool
over the entirety of each dataset. As mentioned in Section III,
each of the above algorithms in Table I has two components:
a forest and a repository pool. Memory consumed by forest
is not a distinguishing factor as there was a very marginal
difference between the algorithms and thus the focus was on
the repository pool.
Without exception, EP consumed the least memory com-
pared to the other algorithms. This was expected as EP
structurally examines instance vectors (i.e. corresponding to
classification paths in Hoeffding tree) and aggregates similar
vectors together. On the other hand in EPa, structural sim-
ilarity is not guaranteed and two structurally very different
Spectra producing similar accuracy could be chosen as the
candidates to be aggregated, thus resulting in larger spectra.
Table I provides evidence to support this premise as the
memory consumed by EPais higher than that of EP but
lower than FCT. On average over all datasets, EP achieved a
41% reduction in memory consumption in relation to FCT; the
corresponding figure for Electricity was 55%. This represents
a significant benefit of applying aggregation in Fourier space.
3) Processing Speed: DFT application is a potential perfor-
mance bottleneck when compared to classification, especially
in high dimensional data streams.
Processing speed is dependent on a variety of factors:
maintaining and classifying relatively larger number of Fourier
Spectra in FCT compared to EP and EPa, aggregation in EP
and EPathat generalize models thus reducing re-learning and
the need for DFT application, and finally the computational
overheads of aggregation. Therefore, this section assesses the
trade off between single and aggregated Fourier approaches in
terms of processing speed.
TABLE II. PRO CES SI NG SPE ED I N INS TANC ES P ER SE CO ND
Dataset FCT EPaEP
Flight 797.2 731.2 836.9
Electricity 11600.3 9002.5 11402.5
Rotating Hyperplane 5647.8 5413.8 5804.5
Spam 4.2 3.9 4.2
Table II shows that EP is the fastest most of the time. EPa,
even though it has the potential to be faster due to its simple
aggregation strategy, suffers from inappropriate aggregations
that introduce instability, thus triggering more drift points than
its EP counterpart. EP, on the other hand, efficiently does
structural similarity comparison by incrementally updating
simple counters that remembers the number of disagreements
in classification between the current winner tree and every
Fourier Spectra in pool. On the other hand, although EP
through its aggregation strategy requires more computational
effort than EPa, that effort is compensated for by its stability,
which triggers fewer false drift alarms than either EPaor FCT.
Therefore, this experiment demonstrates that an expensive
operation such as aggregation if applied appropriately will
yield a direct processing speed advantage over a period of
time.
4) Effects of Noise: Algorithms that work well in noise-
free environments will fail on noisy environments if they
lack the ability to generalize to new data by removing minor
variations which often correspond to noise. DFT application, as
mentioned earlier, extracts significant coefficients by ignoring
minor coefficients that may capture noise inherent in data. It
was shown in [16] that DFT application provides robustness in
a noisy environment as opposed to a non-DFT based approach
such as MetaCT. Therefore, this experiment is aimed at testing
whether aggregation has an added advantage over a non-
aggregation based method such as FCT.
Fig. 4. The impact of noise on accuracy
Figure 4 shows percentage accuracy decrease for noise
levels 20% and 30% on FCT and EP relative to accuracy on the
original Flight dataset. It is clear that the decrease in accuracy
is higher at the 30% noise level. What is interesting is the
higher tolerance of EP to noise compared to FCT. In 8/10
intervals, for 20% noise, EP is found to be having a lesser
decrease than its counterpart. Similarly at the 30% noise level,
the fraction is 4/10, with the two being tied in performance in
two other intervals. Again, as with the other metrics that we
tracked, the superior performance of EP can be explained in
terms of its power to generalize making it more robust to the
effects of noise 3.
Next we examine the sensitivity of EP on key parameters
that significantly affect performance. Due to the superiority of
EP over the other algorithms, the study was confined to this
algorithm. Please refer [16] for sensitivity analysis on FCT’s
parameters.
C. Sensitivity Analysis
EP(SeqDrift2) has two key parameters of its own: pool size
and choice of drift detector.
1) Pool Size: In this experiment we contrasted classifica-
tion accuracy at two different ends of the pool size scale,
namely 1 and 10. In the context of the Flight dataset which
has four concepts, a pool size of 1 represents an extremely
limiting memory environment and the size of 10 represents a
situation where memory is plentiful. Figure 5 shows accuracy
Fig. 5. The impact of pool size on flight dataset
values over 10 intervals. Interestingly, EP, with pool size1,
has the highest accuracy in 8/10 intervals. There is a 7.6%
and 7.2% gain in accuracy compared to FCT over pool sizes
1 and 10 respectively. This is a significant outcome of this
research. Even in an extreme memory challenged environment,
EP achieves its best accuracy over a setting with a much
3The other 3 datasets that we experimented with displayed similar trends
to that of the Flight dataset and were thus not included in interests of space
constraints
higher memory capacity. The implication is that ensemble
accuracy increases with greater diversity and resonates with
the research conducted by [6]. This illustrates the strength of
aggregation applied in the EP algorithm. As more memory
becomes available at pool size 10, FCT’s accuracy converges
to that of its counterpart, as expected. At the higher memory
setting FCT can accommodate more spectra in its pool that
are tailored to specific concepts.
2) Impact of Drift Detector: A drift detector that incor-
rectly triggers change points leads to partial learning of a con-
cept and under developed classifiers being stored in the pool.
This introduces fluctuations in accuracy, which in turn trigger
change detections, causing even more fluctuations and so on.
This is a cyclic problem. On the other hand, if a drift detector
fails to detect changes, classifiers are not updated in a timely
fashion, thus leading to poor performance. This situation may
arise if a drift detector has significantly high detection delay in
signaling changes. The ADWIN and SeqDrift2 drift detectors,
as shown in [13] have contrasting properties. SeqDrift2 has a
lower false positive rate than ADWIN while having similar
sensitivity to ADWIN. Therefore, the comparative study is
largely governed by false positive detections.
Fig. 6. The impact of drift detector on EP with pool size 10
Figure 6 reveals that SeqDrift2 helped EP to reduce the
frequency of sudden accuracy drops seen with ADWIN, due
to the latter signaling false changes in concepts. In the segment
shown in Figure 6, there is a 5% gain in accuracy by using
SeqDrift2 and it is 3.4% over the entire data set.
VI. CONCLUSIONS AND FUTURE WO RK
In this research we proposed a novel approach for cap-
turing and exploiting recurring concepts in data streams. We
optimized the derivation of the Fourier spectrum by employing
two mechanisms: one for energy thresholding and the other for
speeding up computation of the Fourier basis functions.
This research revealed that the ensemble approach outper-
formed the single spectrum approach and is thus the method
of choice in high speed dynamic environments that generate
large amounts of concepts over the progression of the stream.
In such environments FCT would be challenged in terms of
memory capacity and would be forced to flush portions of
its repository sooner that EP, thus losing its ability to exploit
concept recurrences and in turn leading to a loss of accuracy.
However, as shown in the experimentation care needs to be
taken on how spectra are combined: a naive approach of simply
combining similarly performing spectra in terms of accuracy
can be worse than maintaining single spectra. We showed
that the structural similarity scheme outperformed the other
two approaches on a broad set of criteria including accuracy,
robustness to noise and over-fitting, memory consumption and
processing speed.
In terms of future work there are two promising directions.
We believe that is possible to further reduce the computational
effort involved in deriving the spectrum by only keeping the
lowest order coefficient at each leaf node of the Decision
Tree together with a residual coefficient that captures the
contribution of other coefficients at that node. Secondly, at
each concept drift point we can parallelize computation of the
spectrum in one thread while processing incoming instances
in another thread in a parallel environment such as a Spark
framework. REFERENCES
[1] C. Alippi, G. Boracchi and M. Roveri. Just-In-Time Classifiers for Re-
current Concepts. IEEE Transactions on Neural Networks and Learning
Systems, vol. 24(4), pages 620–634, 2013.
[2] Bifet, A. & Gavald`
a, R. Learning from Time-Changing Data with
Adaptive Windowing. In Proceedings of the 7th SIAM ICDM, pages
443–448. SIAM, 2007.
[3] Bifet, A. Holmes, G. Kirkby, R. & Pfahringer, B. MOA: Massive Online
Analysis. The Journal of Machine Learning Research, vol(11), pages
1601–1604, 2010.
[4] Domingos, P. & Hulten, G. Mining High-speed Data Streams. In
Proceedings of the ACM SIGKDD’00, pages 71–80, New York, NY,
USA, 2000. ACM.
[5] Gama, J. & Kosina, P. Learning about the Learning Process. In Advances
in Intelligent Data Analysis X, vol(7014) of Lecture Notes in Computer
Science, pages 162–172. Springer Berlin Heidelberg, 2011.
[6] Gashler, M., Giraud-Carrier C., & Martinez, T. Decision Tree Ensemble:
Small Heterogeneous Is Better Than Large Homogeneous. 7th Interna-
tional Conference on Machine Learning and Applications, pages 900–
905,IEEE Computer Society, 2008.
[7] Gomes, J. Menasalvas, E. & Sousa, P. Tracking Recurrent Concepts
Using Context. In Rough Sets and Current Trends in Computing,
vol(6086), pages 168–177. Springer Berlin Heidelberg, 2010.
[8] Hoeglinger, S. Pears, R. & Koh, Y. CBDT: A Concept Based Approach
to Data Stream Mining. In Proceedings of the PAKDD ’09, pages 1006–
1012, Berlin, Heidelberg, 2009. Springer-Verlag.
[9] Katakis, I. Tsoumakas, G. & Vlahavas, I. An Ensemble of Classifiers for
Coping with Recurring Contexts in Data Streams. In Proceedings of the
ECAI’08 , pages 763–764, Amsterdam, Netherlands, The Netherlands,
2008. The IOS Press.
[10] Kargupta, H. Park, B. & Dutta, H. Orthogonal Decision Trees. IEEE
Transactions on Knowledge and Data Engineering, vol(18), no(8), pages
1028–1042, 2006.
[11] Lazarescu, M. A Multi-Resolution Learning Approach to Tracking
Concept Drift and Recurrent Concepts. In 5th international workshop
on Pattern Recognition in Information Systems, 2005.
[12] Park, B. Knowledge Discovery from Heterogeneous Data Streams Using
Fourier Spectrum of Decision Trees. PhD thesis, Pullman, WA, USA,
2001.
[13] Pears, R. Sripirakas, S. & Koh, Y. Detecting concept change in dynamic
data streams. Machine Learning, 97:3, pp 259–293, 2014.
[14] Peipei Li, Xindong Wu, and Xuegang Hu, ”Mining recurring concept
drifts with limited labeled streaming data,” ACM Trans. Intell. Syst.
Technol.,vol. 3, no. 2, pp. 29:1-29:32, Feb. 2012
[15] Ramamurthy, S. & Bhatnagar, R. Tracking recurrent concept drift
in streaming data using ensemble classifiers. In 6th International
Conference on Machine Learning Applications, pages 404–409, Dec
2007.
[16] Sripirakas, S. & Pears, R. Mining Recurrent Concepts in Data Streams
Using the Discrete Fourier Transform. In DaWaK’14, vol(8646) of
Lecture Notes in Computer Science, pp 439–451. Springer International
Publishing, 2014.
... The order of the both processes matters a lot in this context, i.e., if we pass on the balanced data to the drift detector then, since the original state of the data has been changed, it could affect the underlying concepts established phenomenon that exists in high-speed varying data streams, and numerous studies have shed light on its repercussions on the data mining results. Several approaches [3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18] have been proposed to cater this problem according to the situation that opts batch processing or sliding window, fluctuates stream speeds to observe the changes, uses varying window size, handles labeled and unlabeled data, adopts heuristic-based approach and so on. A computationally inexpensive algorithm in terms of memory consumption, OnePassSampler [3], adopted the approach of single pass through its memory buffer. ...
Article
Full-text available
A key toward intelligent decision-making in industries lies in the ability to process and analyze vast quantities of business data. Concept drift and class imbalance are co-existing problems in real-life data sets, and the telecommunication sector is no exception. It has been discovered only recently that the problems of concept drift and class imbalance, which were thought to be totally independent, are correlated in many ways and they have adverse effects on each other. Cumulative Sum Detection Method (CusumDM), a simple and efficient method based on Cumulative Sum Charts, handles the problem of concept drift efficiently. However, the presence of class imbalance causes a decline in its performance. To facilitate intelligent customer churn prediction, this article proposes optimized two-sided Cusum churn detector (OTCCD), which is a significant improvement of CusumDM that handles the problems of class imbalance and concept drift by determining error rate of sliding windows alongside. This approach has been applied to the Call Detail Record (CDR) of a South Asian Telecom Company for churn prediction. Handling telecom data requires high-performance computing power and resource-aware intelligent methods due to the velocity and speed of the data. Presence of an extremely low number of churners creates class imbalance problems in CDR and abruptly, or gradually changing data distribution makes the churn data a rich source to target these problems. Prediction accuracy and mean evaluation time were the basis for verification of classification results, showing that OTCCD outperformed its antecedents CusumDM, Ensemble Drift Detection Method (Ensemble) and Adaptive Windowing (ADWIN) Change Detector by producing better accuracy.
... Literature [14] USES the number of different values of each dimension features of positive samples and negative samples to represent the concept of sample ownership. Literature [19] the proposed learning algorithm (Metalearner, META) and recurring concept put forward by the literature [20] drift framework (Recurring concept drift framework, RCD) with the characteristics of the sample set (regardless of the category) characterization concept samples belong to; literature [10] put forward the concept of prediction algorithm (Conceptual clustering and prediction, CPP) label sample type and the probability of the relationship between each dimension feature representation sample belongs to concept; The algorithms in literature [15], [18] all use the classifier of sample training to represent the concept of sample ownership. ...
Conference Paper
Recurring concept drift is one of the sub-types of concept drift. In recurring concept drift detection, it is very important to represent concepts and select the most appropriate classifier to classify. We propose an algorithm, conceptual clustering and prediction through main feature extraction (MFCCP), for classifying data stream with recurring concept drifts. MFCCP can recognize recurring concepts by computing the differences of main features and impact factors of different batches of samples. It maintains a classifier for each concept and monitors the classification accuracy to select classifier according to hoeffding inequality in order to enhance the ability of adapting to concept drift. The experimental results over the three datasets illustrate that MFCCP achieves better classification accuracy, adapts faster to concept drift, and detects concept drift more accurately than the other four algorithms on the data streams with recurring concept drifts, and therefore, MFCCP is apt to classify data stream without recurring concept drift.
... The recurrent concepts challenge has been faced only quite recently [18,20,21,46,47]. The general idea of approaches covering recurring concepts is not to lose knowledge gathered over time. ...
Article
Full-text available
Classifying a stream of non-stationary data with recurrent drift is a challenging task and has been considered as an interesting problem in recent years. All of the existing approaches handling recurrent concepts maintain a pool of concepts/classifiers and use that pool for future classifications to reduce the error on classifying the instances from a recurring concept. However, the number of classifiers in the pool usually grows very fast as the accurate detection of an underlying concept is a challenging task in itself. Thus, there may be many concepts in the pool representing the same underlying concept. This paper proposes the GraphPool framework that refines the pool of concepts by applying a merging mechanism whenever necessary: after receiving a new batch of data, we extract a concept representation from the current batch considering the correlation among features. Then, we compare the current batch representation to the concept representations in the pool using a statistical multivariate likelihood test. If more than one concept is similar to the current batch, all the corresponding concepts will be merged. GraphPool not only keeps the concepts but also maintains the transition among concepts via a first-order Markov chain. The current state is maintained at all times and new instances are predicted based on that. Keeping these transitions helps to quickly recover from drifts in some real-world problems with periodic behavior. Comprehensive experimental results of the framework on synthetic and real-world data show the effectiveness of the framework in terms of performance and pool management.
Chapter
At present, most of the networks used to extract features in tracking algorithms are pre-trained on the ImageNet classification data set, such as VGG-16, AlexNet, etc. Most of the features used are feature maps output by the highest convolutional layer, so as to make full use of the powerful expressive power of convolutional features. But it will bring certain problems. The spatial resolution ability of convolutional neural network is not strong. If the target moves fast, exceeds the sampling area of the picture, or the target is occluded by background objects, the target tracking algorithm will fail. In order to further improve the spatial resolution ability of the convolutional neural network and solve the problem of tracking failure that may be caused by local cutting, a single target tracking algorithm based on feature fusion is proposed here. The feature map of the last layer is up-sampled and merged with the feature map of the previous layer to improve the spatial resolution ability. At the same time, the information of different feature models is used to improve the accuracy of the tracking algorithm. In addition to the local detection model, this article also proposes a classification detection model. When the local detection model cannot perform effective detection, multiple detection models are used instead to detect within the entire image to find the location where the target may appear. If the location where the target appears is found, the local detection model continues to be used at that location.
Article
Data stream classification is widely popular in the field of network monitoring, sensor network and electronic commerce, etc. However, in the real-world applications, recurring concept drifting and label missing in data streams seriously aggravate the difficulty on the classification solutions. And this challenge has received little attention from the research community. Motivated by this, we propose a new ensemble classification approach based on the recurring concept drifting detection and model selection for data streams with unlabeled data. First, we build an ensemble model based on the classifiers and clusters. To improve the classification accuracy, we use the ensemble model to predict each data chunk and partition clusters according to the distribution of predicted class labels. Second, we adopt a new concept drifting detection method based on the divergence of concept distributions between adjoining data chunks to distinguish recurring concept drifts. All historical new concepts will be maintained. Meanwhile, we introduce the time-stamp-based weights for base models in the ensemble model. In the selection of the base model, we consider the time-stamp-based weight and the divergence between concept distributions simultaneously. Finally, extensive experiments conducted on four benchmark data sets show that our approach can quickly adapt to data streams with recurring concept drifts, and improve the classification accuracy compared to several state-of-the-art classification algorithms for data streams with concept drifts and unlabeled data.
Article
Full-text available
With emergence of Internet of Things (IoT) and subsequent technologies, smart devices are being increasingly used in various domains such as smart homes, smart parking, intelligent transportation etc. Vast amount of image and video data has been produced by IoT based systems in the form of continuous and possibly infinite image and video streams. This demands the development of advanced predictive vision systems which exploits stream mining concepts for intelligent processing of visual data streams. Among other challenges faced by visual data streams, a major challenge is concept drift, which is caused by overtime change in data distribution. In the presence of skewed data, the detection of concept drift becomes more challenging. When analyzing the data generated from smart devices and other performance critical wireless sensors, concept drift affects data integrity and accuracy of prediction results. EWMA for Concept Drift Detection (ECDD) has been proposed in the literature for detecting data streams. However, ECDD has a high prediction error rate which makes it less useful for performance critical data streams generated by imaging and video data streams. In this paper, Vision based Drift Detection Method (VisDDM) is proposed, which systematically handles abrupt and gradual concept drift in data streams. Experiments have been performed using synthetic and real world datasets from different application domains. Our proposed VisDDM algorithm is able to handle abrupt and gradual drift types and outperformed the existing drift detection methods in terms of accuracy and mean evaluation time.
Chapter
In this chapter, we present the staged learning approach to classification in a non-stationary stream of data. Unlike the standard data stream mining paradigm that assumes change is always present, the staged approach senses the level of volatility in the stream and adjusts the mode of learning accordingly. We propose a scheme whereby volatility could be measured and construct a volatility detector that senses the stream. We model the data stream as consisting of two states: a high-volatility state and a low-volatility state, with transitions taking place to/from these states depending on the level of volatility in the stream. In segments of high volatility an ensemble of online classifiers is used for learning, whereas in low volatility maximum utilization is made of past concepts which are encoded by compact versions of Fourier spectra. The staged approach results in improvements in accuracy as well as throughput while reducing memory usage as demonstrated by our experimentation on a wide range of real-world and synthetic datasets.
Article
Data quality is deemed as determinant in the knowledge extraction process. Low-quality data normally imply low-quality models and decisions. Discretization, as part of data preprocessing, is considered one of the most relevant techniques for improving data quality. In static discretization, output intervals are generated at once, and maintained along the whole process. However, many contemporary problems demands rapid approaches capable of self-adapting their discretization schemes to an ever-changing nature. Other major issues for stream-based discretization such as interval definition, labeling or how is implemented the interaction between learning and discretization components are also discussed in this paper. In order to address all the aforementioned problems, we propose a novel, online and self-adaptive discretization solution for streaming classification which aims at reducing the negative impact of fluctuations in evolving intervals. Experiments with a long list of standard streaming datasets and discretizers have demonstrated that our proposal performs significantly more accurately than the other alternatives. In addition, our scheme is able to leverage from class information without incurring in an overweight cost, being ranked as one of the most rapid supervised options.
Article
Full-text available
In this research we present a novel approach to the concept change detection problem. Change detection is a fundamental issue with data stream mining as classification models generated need to be updated when significant changes in the underlying data distribution occur. A number of change detection approaches have been proposed but they all suffer from limitations with respect to one or more key performance factors such as high computational complexity, poor sensitivity to gradual change, or the opposite problem of high false positive rate. Our approach uses reservoir sampling to build a sequential change detection model that offers statistically sound guarantees on false positive and false negative rates but has much smaller computational complexity than the ADWIN concept drift detector. Extensive experimentation on a wide variety of datasets reveals that the scheme also has a smaller false detection rate while maintaining a competitive true detection rate to ADWIN.
Conference Paper
Full-text available
In this research we address the problem of capturing recurring concepts in a data stream environment. Recurrence capture enables the re-use of previously learned classifiers without the need for re-learning while providing for better accuracy during the concept recurrence interval. We capture concepts by applying the Discrete Fourier Transform (DFT) to Decision Tree classifiers to obtain highly compressed versions of the trees at concept drift points in the stream and store such trees in a repository for future use. Our empirical results on real world and synthetic data exhibiting varying degrees of recurrence show that the Fourier compressed trees are more robust to noise and are able to capture recurring concepts with higher precision than a meta learning approach that chooses to re-use classifiers in their originally occurring form.
Conference Paper
Full-text available
Using decision trees that split on randomly selected attributes is one way to increase the diversity within an ensemble of decision trees. Another approach increases diversity by combining multiple tree algorithms. The random forest approach has become popular because it is simple and yields good results with common datasets. We present a technique that combines heterogeneous tree algorithms and contrast it with homogeneous forest algorithms. Our results indicate that random forests do poorly when faced with irrelevant attributes, while our heterogeneous technique handles them robustly. Further, we show that large ensembles of random trees are more susceptible to diminishing returns than our technique. We are able to obtain better results across a large number of common datasets with a significantly smaller ensemble.
Conference Paper
Full-text available
We present a new approach for dealing with distribution change and concept drift when learning from data sequences that may vary with time. We use sliding windows whose size, instead of being fixed a priori, is recomputed online according to the rate of change observed from the data in the window itself. This delivers the user or programmer from having to guess a time-scale for change. Contrary to many related works, we provide rigorous guarantees of performance, as bounds on the rates of false positives and false negatives. Using ideas from data stream algorithmics, we develop a time-and memory-efficient version of this algorithm, called ADWIN2. We show how to combine ADWIN2 with the Naïve Bayes (NB) predictor, in two ways: one, using it to monitor the error rate of the current model and declare when revision is necessary and, two, putting it inside the NB predictor to maintain up-to-date estimations of conditional probabilities in the data. We test our approach using synthetic and real data streams and compare them to both fixed-size and variable-size window strategies with good results.
Conference Paper
Full-text available
This paper proposes a general framework for classify- ing data streams by exploiting incremental clustering in order to dynamically build and update an ensemble of incremental classi- fiers. To achieve this, a transformation function that maps batches of examples into a new conceptual feature space is pro- posed. The clustering algorithm is then applied in order to group different concepts and identify recurring contexts. The ensemble is produced by maintaining an classifier for every concept dis- covered in the stream2.
Article
In this research, we address the problem of capturing recurring concepts in a data stream environment. Recurrence capture enables the reuse of previously learned classifiers without the need for relearning while providing for better accuracy during the concept recurrence interval. We capture concepts by applying the discrete Fourier transform to decision tree classifiers to obtain highly compressed versions of the trees at concept drift points in the stream and store such trees in a repository for future use. In addition, the impact of drift detector in enabling stable performance is also studied with the two drift detectors: ADWIN and SeqDrift2 in recurring concept capturing context. Our empirical results on real world and synthetic data exhibiting varying degrees of recurrence show that the Fourier compressed trees are more robust to noise and are able to capture recurring concepts with higher precision than a meta-learning approach that chooses to reuse classifiers in their originally occurring form. A case study on a flight dataset that closely matches the target data stream environment where concepts recur in similar form in a time critical system is also conducted and the benefits of discrete Fourier transform application is evaluated. Copyright
Article
Just-in-time (JIT) classifiers operate in evolving environments by classifying instances and reacting to concept drift. In stationary conditions, a JIT classifier improves its accuracy over time by exploiting additional supervised information coming from the field. In nonstationary conditions, however, the classifier reacts as soon as concept drift is detected; the current classification setup is discarded and a suitable one activated to keep the accuracy high. We present a novel generation of JIT classifiers able to deal with recurrent concept drift by means of a practical formalization of the concept representation and the definition of a set of operators working on such representations. The concept-drift detection activity, which is crucial in promptly reacting to changes exactly when needed, is advanced by considering change-detection tests monitoring both inputs and classes distributions.
Conference Paper
This work addresses the problem of mining data stream generated in dynamic environments where the distribution underlying the observations may change over time. We present a system that monitors the evolution of the learning process. The system is able to self-diagnosis degradations of this process, using change detection mechanisms, and self-repairs the decision models. The system uses meta-learning techniques that characterize the domain of applicability of previously learned models. The meta-learns can detect re-occurrence of contexts, using unlabeled examples, and take pro-active actions by activating previously learned models.
Conference Paper
This paper presents a multiple-window algorithm that combines a novel evidence based forgetting method with data prediction to handle different types of concept drift and recurrent concepts. We describe the reasoning behind the algorithm and we compare the performance with the FLORA algorithm on three different problems: the STAGGER concepts problem, a recurrent concept problem and a video surveillance problem.
Conference Paper
Data Stream mining presents unique challenges compared to traditional mining on a random sample drawn from a stationary statistical distribution. Data from real-world data streams are subject to concept drift due to changes that take place continuously in the underlying data generation mechanism. Concept drift complicates the process of mining data as models that are learnt need to be updated continuously to reflect recent changes in the data while retaining relevant information that has been learnt from the past. In this paper, we describe a Concept Based Decision Tree (CBDT) learner and compare it with the CVDFT algorithm, which uses a sliding time window. Our experimental results show that CBDT outperforms CVFDT in terms of both classification accuracy and memory consumption.