Conference PaperPDF Available

Droplet Ensemble Learning on Drifting Data Streams

Authors:

Abstract and Figures

Ensemble learning methods for evolving data streams are extremely powerful learning methods since they combine the predictions of a set of classifiers, to improve the performance of the best single classifier inside the ensemble. In this paper we introduce the Droplet Ensemble Algorithm (DEA), a new method for learning on data streams subject to concept drifts which combines ensemble and instance based learning. Contrarily to state of the art ensemble methods which select the base learners according to their performances on recent observations, DEA dynamically selects the subset of base learners which is the best suited for the region of the feature space where the latest observation was received. Experiments on 25 datasets (most of which being commonly used as benchmark in the literature) reproducing different type of drifts show that this new method achieves excellent results on accuracy and ranking against SAM KNN [1], all of its base learners and a majority vote algorithm using the same base learners.
Content may be subject to copyright.
Droplet Ensemble Learning on Drifting Data
Streams
Pierre-Xavier Loeffel1,2, Albert Bifet4, Christophe Marsala1,2, and Marcin
Detyniecki1,2,3
1Sorbonne Universités, UPMC Univ Paris 06, UMR 7606, LIP6, F-75005, Paris
France,
2CNRS, UMR 7606, LIP6, F-75005, Paris, France,
3Polish Academy of Sciences, IBS PAN, Warsaw, Poland
4LTCI, Télécom ParisTech, Université Paris-Saclay, 75013, Paris, France
{pierre-xavier.loeffel, christophe.marsala, marcin.detyniecki}@lip6.fr
albert.bifet@telecom-paristech.fr
Abstract. Ensemble learning methods for evolving data streams are ex-
tremely powerful learning methods since they combine the predictions of
a set of classifiers, to improve the performance of the best single classifier
inside the ensemble. In this paper we introduce the Droplet Ensemble
Algorithm (DEA), a new method for learning on data streams subject
to concept drifts which combines ensemble and instance based learning.
Contrarily to state of the art ensemble methods which select the base
learners according to their performances on recent observations, DEA
dynamically selects the subset of base learners which is the best suited
for the region of the feature space where the latest observation was re-
ceived. Experiments on 25 datasets (most of which being commonly used
as benchmark in the literature) reproducing different type of drifts show
that this new method achieves excellent results on accuracy and rank-
ing against SAM KNN [1], all of its base learners and a majority vote
algorithm using the same base learners.
Keywords: Concept Drift, Ensemble Learning, Online-Learning, Su-
pervised Learning, Data Streams
1 Introduction
The explosion of data generated in real-time from streams has brought to the
limelight the learning algorithms able to handle them. Sensors, stock prices on
the financial markets or health monitoring are a few example of the numerous
cases in real life where data streams are generated. It is therefore important to
devise learning algorithms that can handle this type of data.
Unfortunately, these data streams are often non-stationary and their char-
acteristics can change over time. For instance, the trend and volatility of the
stock prices can suddenly change as a consequence of an unexpected economic
event. This phenomenon, referred as concept drift (when the underlying distribu-
tion which generates the observations on which the algorithm is trying to learn
changes over time), raises the need to use adaptive algorithms to handle data
streams.
In this paper we propose a novel ensemble method which aims at obtaining
good performances regardless of the dataset and type of drift encountered. One
of the main characteristic of this method is that, it determines the regions of
expertise of its base learners (BL) in the feature space and selects the subset
of BL which is the best suited to predict on the latest observation. This new
method outperforms SAM-KNN [1], a new classifier algorithm for data streams
that won the Best Paper award at ICDM 2016.
The main contributions of the paper are the following:
a new streaming classifier for evolving data streams, which weights its base
learners according to their local expertise in the feature space.
an extensive evaluation over a wide range of datasets and type of drifts.
a discussion on how the new method, DEA over-performs the best state of
the art algorithms.
The paper is organized as follows: Section 2 lays down the framework of our
problem and goes through the related works. Section 3 details the proposed
algorithm while Section 4 presents the datasets used as well as the experimental
protocol. Section 5 presents and discuss the results of the experiments and finally
Section 6 concludes.
2 Framework and related work
In this section we present the framework of our problem and we discuss related
works on learning algorithms handling concept drift.
2.1 Framework
The problem being addressed here is supervised classification on a stream of
data subject to concept drifts. Formally, a stream endlessly emits observations
{x1, x2, ...}(where xi=x1
i, ..., xk
iX=Rk,kdesignates the dimension of
the feature space and idesignates the time step at which the observation was
received) which are unlabeled at first but for which a label yiY={1, ..., c}is
being received a constant amount of time uR+after xi. We will work in the
framework where the label of xiis always received before reception of xi+1. The
goal is to create an on-line classifier f:XYwhich can predict, as accurately
as possible, the class yiassociated to xi.
An on-line classifier is a classifier which can operate when data are received
in sequence (as opposed to a batch classifier which needs a full dataset from
scratch to operate) and can evolve over time (i.e. its learned model is constantly
updated with the latest observations). Formally, the operating process of an
on-line classifier is described thereafter:
When an observation xtis received at time t, it outputs a prediction ˆyt. The
true class ytis then released and, after computation of the prediction error
according to the 0-1 loss function: L(y, ˆy) = I{y6y}(where Iis the indicator
function), the classifier is updated with the latest observation:
ft=U pdate (ft1,{xt, yy}). Our goal then, is to minimize the average error
1
n
n
P
i=1
L(yi,ˆyi)over the nobservations received so far.
In the considered framework, the hidden joint distribution P(X, Y )(called
Concept) which generates the couples (xi, yi)at each time step, is also allowed
to unexpectedly change over time: a phenomenon referred as Concept Drift.
Formally [1], concept drift occurs at time tif Pt1(X, Y )6=Pt(X, Y ). According
to Bayes rule: P(X, Y ) = P(Y /X )P(X). Thus, a drift of concept can result
either in a change of the posterior probability of the classes P(Y/X)(called real
drift) either in a change of the distribution of the features P(X)(called virtual
drift) either in both.
The types of drifts can be further categorized according to the speed at which
they occur. We say that a drift is abrupt when the drift last for one observation
(Pt1(X, Y )6=Pt(X, Y )and the concept is stable before t1and after t) or
conversely that it is incremental when the drift last more than one observation
(Ptk(X, Y )6=... 6=Pt1(X, Y )6=Pt(X, Y )and the concept is stable before
tkand after t). Reoccurring drifts, happen when a previously learned concept
reappears after some time (kN/Ptk(X, Y ) = Pt(X, Y )).
2.2 Related work
Several methods have been proposed in order to deal with the issue of drifting
concepts on data streams, the majority of which being ensemble methods.
ADACC was introduced in [11]. It maintains a set of BL which are weighted
every τtime steps according to their number of wrong predictions. It then ran-
domly selects one BL from the worst half of the ensemble and replaces it by a
new one which is protected from deletion for a few time steps. The final pre-
diction is given by the current best performer. The algorithm also includes a
mechanism to remember past concepts.
Dynamic Weighted Majority (DWM) is an ensemble method introduced
in [6]. Each of its BL has a weight which is reduced in case of a wrong prediction.
When a BL’s weight drops bellow a given threshold, it is deleted from the en-
semble. If all the BL output a wrong prediction on an instance, a new classifier
is added to the ensemble.
ADWIN Bagging (Bag Ad) was introduced in [9] and improves the On-
line Bagging algorithm proposed by Oza and Rusell [10] by adding the ADWIN
algorithm as a change detector. When a change is detected, the worst performing
BL is replaced by a new one.
Similarly, ADWIN Boosting (Boost Ad) improves the on-line Boosting
algorithm of Oza and Russell [10] by adding ADWIN to detect changes.
Leveraging Bagging (Lev Bag) was introduced in [7] and further improves
the ADWIN Bagging algorithm by increasing re-sampling (using a value λlarger
than 1 to compute the Poisson distribution) and by adding randomization at the
output of the ensemble by using output codes.
Hoeffding Adaptive Tree (Hoeff Tree) was introduced in [12] and uses
ADWIN to monitor the performance of the branches on the tree. When the
accuracy of a branch decreases, it is replaced with a more accurate one.
AccuracyUpdatedEnsemble (AUE) described in [4] maintains a weighted
ensemble of BL and uses a weighted voting rule for its final prediction. It creates
a new BL after each chunk of data which replaces the weakest performing one.
The weights of each BL are computed according to their individual performances
on the latest data chunk.
Finally, SAM KNN [1], best paper award at ICDM 2016, is a new im-
provement method of the KNN algorithm. It maintains the past observations
into 2 types of memories (short and long term memory). The task of the short
term memory is to remain up to date according to the current concept whereas
the long term memory is in charge of remembering the past concepts. When a
concept change is detected, the observations from the short term memory are
transfered to the long term memory.
3 The Droplets Ensemble Algorithm
Fig. 1. Example of map learned in 2 dimensions. Left: before update of the model with
the 6th observation (received at point A). Right: after update of the model with the
6th observation.
Our main goal in designing our new ensemble algorithm dealing with a data
stream subject to concepts drift, is to take into account the local expertise of
each of its BL on the region of the feature space where the latest observation
was received. This means that it gives more weight to the predictions of the BL
which demonstrated an ability to predict accurately in this region.
We propose DEA (Droplets Ensemble Algorithm), an ensemble learning al-
gorithm which dynamically maintains an ensemble of nBL F=f1, ..., f n
along with an ensemble of pDroplets Map =D1, ..., Dp up to date with
respect to the current concept.
The BL can be any learning algorithms, as long as they are able to classify
on a data stream subject to concept drifts.
A Droplet is an object which can be represented as a k-dimensional hyper-
sphere (with kthe dimension of the feature space). Each Droplet Dtis associated
with an observation xtand holds a pointer to a BL: fi(i∈ {1, ..., n}). The values
taken by xtcorrespond to the coordinates of the center of the Droplet in the
feature space whereas ficorresponds to the BL which managed to achieve the
lowest prediction error on a region of the feature space defined around xt.
Figure 1. shows an example of Map learned where the numbers represent the
time step at which each Droplet has been received.
We now go through the algorithm in details.
3.1 Model Prediction
Algorithm 1 Model Prediction
Inputs: F=f1, ..., f n: Ensemble of base learners,
Map =D1, ..., Dp: Ensemble of existing Droplets,
xt: Latest unlabeled observation,
xconst: Normalization constants
Output: ˆyt: Estimated class for xt
xnorm
tN ormalize (xt, xconst)
ODtGet overlapped Droplets (M ap, xnorm
t)
If (ODt6=)
Foreach DhODt(h∈ {a, ..., u})
ˆ
yh
tP redict Dh, xt
End Foreach
ˆytM ajority V ote ˆ
ya
t, ..., ˆ
yu
t
Else
Dnn Get Nearest Droplet (M ap, xnorm
t)
ˆytP redict (Dnn , xt)
End If
At time t, upon reception of a new unlabeled observation xtthe first step is to
normalize the values of xtaccording to a vector of normalization constants xconst
found on the initialization step5. Then ODt, the set of Droplets which contains
the normalized coordinates of the latest observation is computed. If ODt6=,
the predicted value for this observation is given by a simple majority vote of
the BL associated with the overlapped Droplets in ODt. On the other hand,
if ODt=, the learner associated with the nearest Droplet Dnn is used for
prediction. For instance, in the left plot of Fig. 1., if an observation is received
at the position of point A, the BL associated with D1and D2will be used for
prediction whereas if an observation is received at the position of point B, only
the BL associated with D3will be used for prediction.
The prediction process is summarized in Algorithm 1.
3.2 Model Update
Once the true label ytassociated with the latest observation xtis released, each
BL fi(with i∈ {1, ..., n}) predicts on the latest observation and the vector of
the prediction errors et+1 =e1
t+1, ..., en
t+1(with ei
t+1 ∈ {0,1}) is set aside. The
BL are then updated with {xt, yt}.
The next step is to search for the BL which will be associated to the new
Droplet Dt. This is done by summing the prediction errors achieved by each BL
on the Nnearest Droplets , where Nis a parameter defined by the user. If an
unique BL minimizes this sum, it is associated to the new Droplet, otherwise
(if at least 2 BL minimizes the sum of prediction error) the search space is
expanded in turns to the N+ 1, N + 2, N + 3, ... nearest Droplets until a single
best performer is found.
The new Droplet Dtis then added to the feature space at the coordinates
of xnorm
t. This Droplet is given a default radius Rdefault (where Rdef ault is a
parameter defined by the user), stores the vector of prediction errors et+1 and
creates a pointer to the best BL fkfound on the previous step.
The algorithm then goes through the set of overlapped Droplets ODtand
if it is not empty, it decreases the influence of the Droplets in ODtwhich have
outputted a wrong prediction on xt. This is done by shrinking their radius which
will make them less likely to predict on a future observation received in this
region of the feature space. Formally, for each Droplet uin ODt:
1. Compute the overlap between Duand the latest Droplet:
Overlappu=Rdef ault +Ru− kxnorm
uxnorm
tk(where k.kdenotes the Eu-
clidean distance)
2. Update the radius of Du:Ru,t+1 =Ru,t O verlappu
2.
3. Delete Duif Ru,t+1 0.
For instance the right plot of Fig. 1. shows the updated model after reception of
an observation at the position of point A and where the BL associated with D1
5This is simply done by computing the average µias well as the standard deviation
σiof each feature on the initialization set and by transforming the ith feature of xt
into xi
t
µi
σi
outputted a wrong prediction on the 6th observation whereas the BL associated
with D2predicted correctly.
Finally, a memory management module is ran at each time step to ensure
that p, the user defined parameter for the maximum number of Droplets allowed
in memory is not exceeded. If the memory is full, the algorithm uses 3 different
criteria to select the Droplet which will be removed:
1. Remove the Droplet with the smallest radius.
2. If all the Droplets have the same radius, remove the Droplet which has
outputted the highest number of wrong prediction.
3. If criteria 1. and 2. failed, remove the oldest Droplet.
Algorithm 2 summarizes the model update process.
Algorithm 2 Model Update
Inputs: Rdefault : Default radius of a Droplet,
F=f1, ..., f n: Ensemble of base learners,
Map =D1, ..., Dp: Ensemble of existing Droplets,
xt: Latest unlabeled observation,
yt: True label latest observation,
p: Maximum number of Droplets allowed in memory,
ODt: Set of overlapped Droplets at time t
Output: Updated DEA
Foreach fiin F
ei
t+1 Get P rediction Er ror fi
t,{xt, yt}
fi
t+1 Update Base Learner fi
t,{xt, yt}
End foreach
fkSearch best base learner (Map, {xt, yt}),k∈ {1, ..., n}
DtCr eate Dr oplet Rdefault, xnor m
t, f k, et+1, sum er rors = 0
Map Add Droplet M ap, Dt
Foreach DuODt
Ru,t+1 U pdate Radius (Ru,t )
If Ru,t+1 0
Map Remove Dr oplet (Map, D u)
End if
End foreach
If (Car d (Map)p)
Map M emory M anagement (Map)
End if
3.3 Running time and space requirements
Running time: Provided that each of the base learner runs in constant time at
each time step, the temporal complexity of both the prediction and update steps
of DEA is O(i.p)whith iis the number of observations generated by the stream
so far and pthe maximum number of Droplets allowed in memory.
Space requirements: As previously explained, the maximum number pof Droplets
saved into computer memory is constrained and so is the number nof base learn-
ers. This means that, as long as each of the nbase learner constrains its memory
consumption at each time step, the space complexity of DEA will be O(n+p)
which is independant of the number of observations generated by the stream so
far.
4 Experimental Framework
In this section, we describe the datasets on which the experiments have been
conducted, their characteristics as well as the experimental protocol used.
4.1 Datasets
A total of 25 artificial and real world datasets have been used. These datasets
have been chosen for the diversity of their characteristics, which are summarized
thereafter:
Most of these datasets have frequently been used in the literature dedicated
to streams subjects to concept drifts. Also, please note that in this table, an
“N/A” value doesn’t mean that there is no concept drift. It means that, because
the dataset comes from the real world, it is impossible to know for sure the
number of drifts it includes as well as their type.
The first 4 datasets from Agrawal to LED have been generated using the
built-in generators of MOA6. A precise description of theses datasets can be
found in the following papers [13,14,15]. The KDD Cup 10 percent dataset was
introduced and described in [2]. Rotating Check board was created in [5] and
the version CB (Constant) dataset was used (constant drift rate). SPAM was
introduced in [3] and Usenet was inspired by [8]. Airlines was introduced in the
2009 Data Expo competition. The dataset has been shrinked to the first
153 000 observations.
Multidataset is a new synthetic dataset created for this paper. Every
50 000 observations, the concept drifts to a completely new dataset, starting with
Rotating checkboard, then Random RBF, then Rotating Hyperplane and finally
SEA. In the basic version, the successive concepts overlap each other whereas in
the No Overlap (NO) version the datasets are shifted and the data are randomly
generated on each dataset.
Finally, all the datasets listed after Weather have been retrieved from the
repository7given in the paper of Losing et al. [1].
All the datasets used as well as the code of the DEA algorithm and the results
of the experiments are available at the following link8.
4.2 Experimental Setting
MOA have been used to conduct the experiments and provide the implementa-
tion of the classifiers. DEA was also implemented in MOA. The code for SAM
KNN was directly retrieved from the link provided in their paper[1]9.
All the parameters of all the classifiers were set to default values (except for
the training period which was set to 100 observations for all the learners and for
the number of observations allowed in memory which was set to 400 for DEA
and SAM KNN) and for all the datasets. In the case of the Droplets algorithm,
the default radius was set to 0.1 and the minimum number of neighbors consid-
ered was set to 20 for all the experiments. We used all the algorithms described
in this paper as BL for DEA (they were chosen because of their availability on
MOA) with the exception of SAM KNN and Majority Vote. The simple majority
6http://moa.cms.waikato.ac.nz/
7https://github.com/vlosing/driftDatasets
8https://mab.to/o5iNvZdhH
9https://github.com/vlosing/driftDatasets
vote algorithm (which uses the same BL as DEA) was used as a base-line for
performance comparison.
Leaving all the parameters to default values for all the datasets is required
because there is no assumptions regarding the structure of the data or the type
of drifts the classifiers will have to deal with. Therefore, it wouldn’t be relevant
to optimize parameters that would be suitable for a particular concept, at a
particular time and for a particular dataset.
The goals of the experiments were to compare the performance of DEA
against one of the currently best adaptive algorithm (SAM KNN), assess how
DEA was faring against another ensemble algorithm which is given the same BL
(Majority Vote) and assess whether DEA was able to outperform each of its BL.
For each dataset, the performance of the algorithms was computed using the
prequential method (interleaved test-then-train): when an unlabelled observation
is received, the algorithm is first asked to predict its label and the prediction
error is recorded (test). Once the true labelled is realeased, the classifier is trained
with this labelled observation (train). This method has the advantage of making
use of the whole dataset.
5 Results and discussion
The accuracy (percentage of correct classifications) obtained by each algorithm
on each dataset are reported in Table 2. Bold numbers indicate the best per-
forming algorithm. The bottom 2 lines show the average accuracy as well as the
average rank obtained by each algorithm on all the datasets.
The results indicate that DEA managed to obtain the best average accuracy
as well as the best average rank on the 25 datasets considered. In particular, the
average rank obtained demonstrates the ability of DEA to perform consistently
well regardless of the characteristics of the dataset and of the type of drifts
encountered. This is an interesting property because it is often impossible to
predict how the stream will evolve over time and thus, an algorithm which can
deal with a very diversified set of environments could be useful as it wouldn’t be
possible to pick right from the beginning the algorithm which is the best suited
for the whole dataset.
This good performance also confirms that using the local expertise of the BL
as a selection criteria to decide which subset will be used for prediction should
be considered as a way to improve the performances of an ensemble learning
algorithm. Indeed, DEA over-performed the ensemble learning algorithms which
rely on the latest performances to weight their BL (ADACC, DWM, AUE, ...)
as well as a Majority Vote algorithm which simply ask all the algorithms to
collaborate for prediction, independently of the observation received.
6 Conclusion
Learning on a data stream subject to concept drifts is a challenging task. The
hidden underlying distribution on which the algorithm is trying to learn can
change in many unexpected ways, requiring an algorithm which is capable of
good performances regardless of the environment encountered.
In order to tackle this issue, we have proposed the Droplets Ensemble Algo-
rithm (DEA), a novel algorithm which combines the properties of an instance
base learning algorithm with the ones of an ensemble learning algorithm. It
maintains into memory a set of hyper-spheres, each of which includes a pointer
to the BL which is the most likely to obtain the best performance in the region of
the feature space around that observation. When a new observation is received,
it selects the BL which are likely to obtain the best performance in this region
and use them for prediction.
The experiments carried on a set of 25 diversified datasets, reproducing a
wide variety of drifts show that our algorithm is able to over-perform each of
its base learners, a majority vote algorithm using the same base learners as well
as SAM KNN (one of the currently best adaptive algorithm) by obtaining the
best average accuracy and rank. These results indicate that our algorithm is
well suited to be used as a general purposed algorithm for predicting on data
streams with concept drifts and that taking into account the local expertise
of each BL should be considered in order to improve the performances of an
ensemble learning algorithm.
The algorithm can still be further improved and future work will focus on
improving the efficiency of the search algorithm.
References
[1] Losing, V., Hammer, B., & Wersing, H. (2016). KNN Classifier with Self Adjusting
Memory for Heterogeneous Concept Drift, 1. ICDM
[2] Tavallaee, M., Bagheri, E., Lu, W., & Ghorbani, A. A. (2009). A detailed analysis
of the KDD CUP 99 data set. In Proceedings of the second ieee international
conference on computational intelligence in security and defense applications (pp.
53–58).
[3] Katakis, I., Tsoumakas, G., Vlahavas, I.: Tracking recurring contexts using ensem-
ble classifiers: an application to email filtering. In: Knowledge and Information
Systems, 22(3), pp. 371–391 (2010)
[4] Brzezinski, D., & Stefanowski, J. (2014). Reacting to different types of concept
drift: the Accuracy Updated Ensemble algorithm. IEEE Transactions on Neural
Networks and Learning Systems, 25(1), 81–94.
[5] Elwell, R., & Polikar, R. (2011). Incremental Learning of Concept Drift in Nonsta-
tionary Environments. IEEE Transactions on Neural Networks, 22(10), 1517–1531.
[6] Kolter, J. Z., Maloof, M.A.: Dynamic weighted majority: A new ensemble method
for tracking concept drift. In: Data Mining, 2003. ICDM 2003. Third IEEE Inter-
national Conference, pp. 123-130 (2013)
[7] Bifet, A., Holmes, G., Pfahringer, B.: Leveraging bagging for evolving data streams.
In: Machine Learning and Knowledge Discovery in Databases, pp. 135–150.
Springer Berlin Heidelberg (2010)
[8] I. Katakis, G. Tsoumakas, I. Vlahavas, “An Ensemble of Classifiers for coping
with Recurring Contexts in Data Streams”, 18th European Conference on Artificial
Intelligence, IOS Press, Patras, Greece, 2008.
[9] Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., & Gavaldà, R. (2009). New en-
semble methods for evolving data streams. Proceedings of the 15th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining - KDD ’09.
[10] N. Oza and S. Russell. Online bagging and boosting. In Artificial Intelligence and
Statistics 2001, pages 105–112. Morgan Kaufmann, 2001.
[11] Jaber, G., Cornue´
jols, A., & Tarroux, P. (2013). A new on-line learning method
for coping with recurring concepts: The ADACC system. In Lecture Notes in Com-
puter Science (including subseries Lecture Notes in Artificial Intelligence and Lec-
ture Notes in Bioinformatics) (Vol. 8227 LNCS, pp. 595–604).
[12] Bifet, A., & Gavaldà, R. (2009). Adaptive Learning from Evolving Data Streams.
Proceedings of the 8th International Symposium on Intelligent Data Analysis,
249–260.
[13] Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone.
Classification and Regression Trees. Wadsworth, 1984.
[14] R. Agrawal, T. Imielinski, and A. Swami. Database mining: A perfor- mance
perspective. IEEE Trans. on Knowl. and Data Eng., 5(6):914–925, 1993.
[15] P. Domingos and G. Hulten. Mining high-speed data streams. In Knowledge Dis-
covery and Data Mining, pages 71–80, 2000.
... The most important and challenging characteristic of data streams is variability, which refers to streaming data which evolves over time. In other words, and according to [1], the relation between the arriving data stream examples and the classes changes over time and is referred to as the problem of concept drift [4][5][6]. ...
Article
Full-text available
Data stream mining is an important research topic that has received increasing attention due to its use in a wide range of applications, such as sensor networks, banking, and telecommunication. The phenomenon of data streams evolving over time is known as concept drift. In addition, the presence of multiple classes aggravates the problem of a loss in performance during the process of drift detection in data streams. Several drift detectors and ensemble approaches have been widely employed, however they either incur a high cost in terms of memory consumption and run time or ensemble approaches may respond slowly due to using outdated blocks to train classifiers. Motivated by this, we propose a hybrid block-based ensemble, which is a framework for multi-class classification in evolving data streams. The multi-class framework aims to integrate the main pros of an online drift detector for a k-class problem and the concept block-based weighting with a view to react to different types of drifts. The experimental evaluations on well-known synthetic and real-world datasets through a comprehensive comparison upon eleven drift detectors and five ensemble approaches, it shows that our proposed algorithms performs significantly better than other drift detectors and ensemble approaches.
... Whenever there is an increase in the actual classifier error, this means there will be a drift in the current concept of incoming data, whereas when the error rate keeps increasing and then reaches the level of error, this means that drift has been detected [13]. Conversely, the evolving/passive approach does not detect changes or drifts, rather it provides an updated model at all times [31][32][33]. ...
Article
Data stream mining is an important research topic that has received increasing attention due to its use in a wide range of applications, such as sensor networks, banking, and telecommunication. A serious and challenging problem affecting data stream mining is concept drift. This problem occurs when the relation between the input data and the target variable changes over time. Several concept drift detection methods have been proposed, however; they either suffer from a high cost in terms of memory or run time or they are not fast enough in terms of detection speed. In this work, we propose a method, called diversity measure as a new drift detection method (DMDDM), which reacts rapidly to concept drift in less time and with less memory consumption. The proposed method combines one of the diversity measures, disagreement measure, known from static learning in streaming scenarios with the Page-Hinkley test and uses these calculations to detect drifts. The proposed method has been experimentally compared with ten drift detection methods in different drift scenarios using several datasets. The experiment results show that the proposed method is capable of detecting concept drifts faster than most of the compared methods with minimal consumption in terms of memory and run time.
Chapter
Learning in non-stationary environments is challenging, because under such conditions the common assumption of independent and identically distributed data does not hold; when concept drift is present it necessitates continuous system updates. In recent years, several powerful approaches have been proposed. However, these models typically classify any input, regardless of their confidence in the classification – a strategy, which is not optimal, particularly in safety-critical environments where alternatives to a (possibly unclear) decision exist, such as additional tests or a short delay of the decision. Formally speaking, this alternative corresponds to classification with rejection, a strategy which seems particularly promising in the context of concept drift, i.e. the occurrence of situations where the current model is wrong due to a concept change. In this contribution, we propose to extend learning under concept drift with rejection. Specifically, we extend two recent learning architectures for drift, the self-adjusting memory architecture (SAM-kNN) and adaptive random forests (ARF), to incorporate a reject option, resulting in highly competitive state-of-the-art technologies. We evaluate their performance in learning scenarios with different types of drift.
Conference Paper
Full-text available
Data Mining in non-stationary data streams is gaining more attention recently, especially in the context of Internet of Things and Big Data. It is a highly challenging task, since the fundamentally different types of possibly occurring drift undermine classical assumptions such as data independence or stationary distributions. Available algorithms are either struggling with certain forms of drift or require a priori knowledge in terms of a task specific setting. We propose the Self Adjusting Memory (SAM) model for the k Nearest Neighbor (kNN) algorithm since kNN constitutes a proven classifier within the streaming setting. SAM-kNN can deal with heterogeneous concept drift, i.e different drift types and rates, using biologically inspired memory models and their coordination. It can be easily applied in practice since an optimization of the meta parameters is not necessary. The basic idea is to construct dedicated models for the current and former concepts and apply them according to the demands of the given situation. An extensive evaluation on various benchmarks, consisting of artificial streams with known drift characteristics as well as real world datasets is conducted. Thereby, we explicitly add new benchmarks enabling a precise performance evaluation on multiple types of drift. The highly competitive results throughout all experiments underline the robustness of SAM-kNN as well as its capability to handle heterogeneous concept drift.
Conference Paper
Full-text available
When the environment changes, as is increasingly the case when considering unending streams and long-life learning tasks, it is necessary to rely on on-line learning with the capability to adapt to changing conditions a.k.a. concept drifts. Previous works have focused on means to detect changes and to adapt to them. Ensemble methods relying on committees of base learners have been among the most successful approaches. In this paper, we introduce a new second-order learning mechanism that is able to detect relevant states of the environment in order to recognize recurring contexts and act pro-actively to concepts changes. Empirical comparisons with existing methods on well-known data sets show the advantage of the proposed algorithm.
Article
Full-text available
Data stream mining has been receiving increased attention due to its presence in a wide range of applications, such as sensor networks, banking, and telecommunication. One of the most important challenges in learning from data streams is reacting to concept drift, i.e., unforeseen changes of the stream's underlying data distribution. Several classification algorithms that cope with concept drift have been put forward, however, most of them specialize in one type of change. In this paper, we propose a new data stream classifier, called the Accuracy Updated Ensemble (AUE2), which aims at reacting equally well to different types of drift. AUE2 combines accuracy-based weighting mechanisms known from block-based ensembles with the incremental nature of Hoeffding Trees. The proposed algorithm is experimentally compared with 11 state-of-the-art stream methods, including single classifiers, block-based and online ensembles, and hybrid approaches in different drift scenarios. Out of all the compared algorithms, AUE2 provided best average classification accuracy while proving to be less memory consuming than other ensemble approaches. Experimental results show that AUE2 can be considered suitable for scenarios, involving many types of drift as well as static environments.
Conference Paper
Full-text available
Bagging, boosting and Random Forests are classical ensemble methods used to improve the performance of single classifiers. They obtain superior performance by increasing the accuracy and diversity of the single classifiers. Attempts have been made to reproduce these methods in the more challenging context of evolving data streams. In this paper, we propose a new variant of bagging, called leveraging bagging. This method combines the simplicity of bagging with adding more randomization to the input, and output of the classifiers. We test our method by performing an evaluation study on synthetic and real-world datasets comprising up to ten million examples.
Article
Full-text available
Concept drift constitutes a challenging problem for the machine learning and data mining community that frequently appears in real world stream classification problems. It is usually defined as the unforeseeable concept change of the target variable in a prediction task. In this paper, we focus on the problem of recurring contexts, a special sub-type of concept drift, that has not yet met the proper attention from the research community. In the case of recurring contexts, concepts may re-appear in future and thus older classification models might be beneficial for future classifications. We propose a general framework for classifying data streams by exploiting stream clustering in order to dynamically build and update an ensemble of incremental classifiers. To achieve this, a transformation function that maps batches of examples into a new conceptual representation model is proposed. The clustering algorithm is then applied in order to group batches of examples into concepts and identify recurring contexts. The ensemble is produced by creating and maintaining an incremental classifier for every concept discovered in the data stream. An experimental study is performed using (a) two new real-world concept drifting datasets from the email domain, (b) an instantiation of the proposed framework and (c) five methods for dealing with drifting concepts. Results indicate the effectiveness of the proposed representation and the suitability of the concept-specific classifiers for problems with recurring contexts.
Article
Full-text available
We introduce an ensemble of classifiers-based approach for incremental learning of concept drift, characterized by nonstationary environments (NSEs), where the underlying data distributions change over time. The proposed algorithm, named Learn++.NSE, learns from consecutive batches of data without making any assumptions on the nature or rate of drift; it can learn from such environments that experience constant or variable rate of drift, addition or deletion of concept classes, as well as cyclical drift. The algorithm learns incrementally, as other members of the Learn++ family of algorithms, that is, without requiring access to previously seen data. Learn++.NSE trains one new classifier for each batch of data it receives, and combines these classifiers using a dynamically weighted majority voting. The novelty of the approach is in determining the voting weights, based on each classifier's time-adjusted accuracy on current and past environments. This approach allows the algorithm to recognize, and act accordingly, to the changes in underlying data distributions, as well as to a possible reoccurrence of an earlier distribution. We evaluate the algorithm on several synthetic datasets designed to simulate a variety of nonstationary environments, as well as a real-world weather prediction dataset. Comparisons with several other approaches are also included. Results indicate that Learn++.NSE can track the changing environments very closely, regardless of the type of concept drift. To allow future use, comparison and benchmarking by interested researchers, we also release our data used in this paper.
Conference Paper
Full-text available
We propose and illustrate a method for developing algorithms that can adaptively learn from data streams that drift over time. As an example, we take Hoeffding Tree, an incremental decision tree inducer for data streams, and use as a basis it to build two new methods that can deal with distribution and concept drift: a sliding window-based algorithm, Hoeffding Window Tree, and an adaptive method, Hoeffding Adaptive Tree. Our methods are based on using change detectors and estimator modules at the right places; we choose implementations with theoretical guarantees in order to extend such guarantees to the resulting adaptive learning algorithm. A main advantage of our methods is that they require no guess about how fast or how often the stream will drift; other methods typically have several user-defined parameters to this effect. In our experiments, the new methods never do worse, and in some cases do much better, than CVFDT, a well-known method for tree induction on data streams with drift.
Conference Paper
Full-text available
This paper proposes a general framework for classify- ing data streams by exploiting incremental clustering in order to dynamically build and update an ensemble of incremental classi- fiers. To achieve this, a transformation function that maps batches of examples into a new conceptual feature space is pro- posed. The clustering algorithm is then applied in order to group different concepts and identify recurring contexts. The ensemble is produced by maintaining an classifier for every concept dis- covered in the stream2.
Book
The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.