Conference PaperPDF Available

Classifier Concept Drift Detection and the Illusion of Progress

Authors:

Abstract and Figures

When a new concept drift detection method is proposed, a common way to show the benefits of the new method, is to use a classifier to perform an evaluation where each time the new algorithm detects change, the current classifier is replaced by a new one. Accuracy in this setting is considered a good measure of the quality of the change detector. In this paper we claim that this is not a good evaluation methodology and we show how a non-change detector can improve the accuracy of the classifier in this setting. We claim that this is due to the existence of a temporal dependence on the data and we propose not to evaluate concept drift detectors using only classifiers.
Content may be subject to copyright.
Classifier Concept Drift Detection
and the Illusion of Progress
Albert Bifet
LTCI, T´el´ecom ParisTech
Universit´e Paris-Saclay
75013, Paris, France
albert.bifet@telecom-paristech.fr
Abstract. When a new concept drift detection method is proposed, a
common way to show the benefits of the new method, is to use a classifier
to perform an evaluation where each time the new algorithm detects
change, the current classifier is replaced by a new one. Accuracy in this
setting is considered a good measure of the quality of the change detector.
In this paper we claim that this is not a good evaluation methodology
and we show how a non-change detector can improve the accuracy of
the classifier in this setting. We claim that this is due to the existence
of a temporal dependence on the data and we propose not to evaluate
concept drift detectors using only classifiers.
Keywords: concept drift, data streams, incremental, classification, evolv-
ing, online
1 Introduction
IoT Analytics is a term used to identify machine learning done using data streams
from the Internet of Things (IoT). Dealing with IoT data streams, or in data
streams in general, drift detection is a very important component in adaptive
modeling, since detecting a change gives a signal about when to adapt models [21,
17, 22]. Typically, the streaming error of predictive models is monitored and when
the detector raises a change alarm, then the model is updated or replaced by a
new one.
We start by discussing an example of how researchers evaluate a concept drift
detector using two real datasets representing a data stream. The Electricity
dataset due to [12] is a popular benchmark for testing adaptive classifiers. It
has been used in over 50 concept drift experiments, for instance, [9, 15, 6, 18].
The Electricity Dataset was collected from the Australian New South Wales
Electricity Market. In this market, prices are not fixed and are affected by the
demand and supply of the market. Prices are set every five minutes. The dataset
contains 45,312 instances which record electricity prices at 30 minute intervals.
The class label identifies the change of the price (UP or DOWN) related to a
moving average of the last 24 hours. The data is subject to concept drift due to
changing consumption habits, unexpected events and seasonality.
Table 1: Evaluation results of an adaptive Naive Bayes classifier on Electricity
and CoverType datasets.
Forest Covertype Electricity
Change Detector Accuracy κAccuracy κ
ADWIN 83.24 73.25 81.03 60.79
CUSUM 81.55 70.66 79.21 56.83
DDM 88.03 80.78 81.18 61.14
Page-Hinckley 80.06 68.40 78.04 54.43
EDDM 86.08 77.67 84.83 68.96
No-Change 88.79 81.97 86.16 71.65
The second, Forest Covertype, contains the forest cover type for 30×30 meter
cells obtained from US Forest Service (USFS) Region 2 Resource Information
System (RIS) data. It contains 581,012 instances and 54 attributes, and has
been used in several papers on data stream classification
Let us test two state-of-the-art data stream classifiers on this dataset. We
test an incremental Naive Bayes classifier, and an incremental (streaming) de-
cision tree learner. As a streaming decision tree, we use the Hoeffding Tree [13]
with functional leaves, using Naive Bayes classifiers at the leaves. The Hoeffding
Tree employs a strategy based on the Hoeffding bound to incrementally grow
a decision tree. A node is expanded by splitting as soon as there is sufficient
statistical evidence, based on the data seen so far, to support the split and this
is decision is based on the distribution-independent Hoeffding bound.
Tables 1 and 2 show the performance of a Naive Bayes classifier and a Ho-
effding Tree Classifier that uses a change detector to start a new classifier when a
change is detected. As we can see in these tables, the best performance is due to
the No-Change Detector. This detector outputs change every 60 instances; it is a
no-change detector in the sense that it is not detecting change in the stream. Sur-
prisingly, the classifiers using this no-change detector are getting better results
than using the standard change detectors.
This experiment shows us that it is not enough to show the performance of a
change detector working with a classifier. There is need to use other evaluation
techniques.
In Section 2, we present the state of the art of change detector algorithms. In
Section 3, we perform an experimental evaluation of concept drift detectors not
using classifiers. In Section 4, we discuss temporal dependence in data streams,
and Section 5 concludes the paper.
2 Change detectors
Achange detector or drift detector is an algorithm that takes a stream of in-
stances as input and outputs an alarm if it detects a change in the distribution of
the data. A detector may often be combined with a predictive model to output a
Table 2: Evaluation results of an adaptive Hoeffding Tree classifier on Electricity
and CoverType datasets.
Forest Covertype Electricity
Change Detector Accuracy κAccuracy κ
ADWIN 83.36 73.37 83.23 65.41
CUSUM 83.01 72.91 81.71 62.05
DDM 87.35 79.71 85.41 70.05
Page-Hinckley 81.65 70.75 81.95 62.60
EDDM 86.00 77.48 84.91 69.08
No-Change 88.04 80.71 85.54 70.27
prediction of the next instance to come. In general, the input to a change detec-
tion algorithm is a sequence x1, x2, . . . , xt, . . . of data points whose distribution
varies over time in an unknown way. At each time step the algorithm outputs:
1. an estimate of the parameters of the input distribution, and
2. an alarm signal indicating whether a change in this distribution has recently
occurred, or not.
We consider a specific, but very frequent case, of this setting with all xt
being real values. The desired estimate is usually the current expected value
of xt, and sometimes other statistics of the distribution such as, for instance,
variance. The only assumption about the distribution of xis that each xtis
drawn independently from each other. This assumption may be not satisfied if
xtis an error produced by a classifier that updates itself incrementally, because
the update depends on the performance, and the next performance depends on
whether we updated it correctly. In practice, however, this effect is negligible, so
treating them independently is a reasonable approach.
The most general structure of a change detection algorithm contains three
components:
1. Memory is the component where the algorithm stores the sample data or
data summaries that are considered to be relevant at the current time, i.e.,
the ones that describe the current data distribution.
2. Estimator is an algorithm that estimates the desired statistics on the in-
put data, which may change over time. The algorithm may or may not use
the data contained in Memory. One of the simplest Estimator algorithms
is the linear estimator, which simply returns the average of the data items
contained in Memory. Other examples of run-time efficient estimators are
Auto-Regressive, Auto Regressive Moving Average, and Kalman filters [14].
3. Change detector (hypothesis testing) outputs an alarm signal when it detects
a change in the input data distribution. It uses the output of the Estimator,
and may or may not in addition use the contents of Memory.
-
xt
Estimator gt
- -
Alarm
Change Detector h
-
Estimation
Memory
-
6
6
?
Fig. 1: General Framework
There are many different algorithms to detect change in time series. We
discuss the classical ones used in statistical quality control [2], time series analysis
[19], statistical methods and more recent ones such as ADWIN[4].
2.1 Statistical Tests with Stopping Rules
These tests decide between the hypothesis that there is change and the hypoth-
esis that there is no change, using a stopping rule. When this stopping rule
is achieved, then the change detector method signals a change. The following
methods differ in their stopping rule.
The CUSUM Test The cumulative sum (CUSUM algorithm), which was first
proposed in [16], is a change detection algorithm that raises an alarm when the
mean of the input data is significantly different from zero. The CUSUM input t
can be any filter residual, for instance the prediction error from a Kalman filter.
The stopping rule of the CUSUM test is as follows:
g0= 0
gt= max (0, gt1+tυ)
if gt> h then alarm and gt= 0
The CUSUM test is memoryless, and its accuracy depends on the choice of
parameters υand h. Note that CUSUM is a one sided, or asymmetric test. It
assumes that changes can happen only in one direction of the statistics, detecting
only increases.
The Page Hinckley Test The Page Hinckley Test [16] stopping rule is as
follows, when the signal is increasing:
g0= 0, gt=gt1+ (tυ)
Gt= min(gt, Gt1)
if gtGt> h then alarm and gt= 0
When the signal is decreasing, instead of Gt= min(gt, Gt1), we should use
Gt= max(gt, Gt1) and Gtgt> h as the stopping rule. Like the CUSUM test,
the Page Hinckley test is memoryless, and its accuracy depends on the choice of
parameters υand h.
2.2 Drift Detection Method
The drift detection method (DDM) proposed by Gama et al. [10] controls the
number of errors produced by the learning model during prediction. It compares
the statistics of two windows: the first contains all the data, and the second
contains only the data from the beginning until the number of errors increases.
Their method doesn’t store these windows in memory. It keeps only statistics
and a window of recent errors data.
The number of errors in a sample of nexamples is modelled by a binomial
distribution. For each point tin the sequence that is being sampled, the error
rate is the probability of misclassifying (pt), with standard deviation given by
st=ppt(1 pt)/t. They assume that the error rate of the learning algorithm
(pt) will decrease while the number of examples increases if the distribution of
the examples is stationary. A significant increase in the error of the algorithm,
suggests that the class distribution is changing and, hence, the actual decision
model is supposed to be inappropriate. Thus, they store the values of ptand st
when pt+streaches its minimum value during the process (obtaining pmin and
smin). DDM then checks if the following conditions trigger:
pt+stpmin +2·smin for the warning level. Beyond this level, the examples
are stored in anticipation of a possible change of context.
pt+stpmin + 3 ·smin for the drift level. Beyond this level the concept
drift is supposed to be true, the model induced by the learning method is
reset and a new model is learnt using the examples stored since the warning
level triggered. The values for pmin and smin are reset.
In the standard notation, they have two hypothesis tests hwfor warning and hd
for detection.
gt=pt+st
if gt> hw,then alarm warning,
if gt> hd,then alarm detection,
where hw=pmin + 2smin and hd=pmin + 3smin.
The test is nearly memoryless, it only needs to store the statistics ptand st,
as well as switch on some memory to store an extra model of data from the time
of warning until the time of detection.
This approach works well for detecting abrupt changes and reasonably fast
changes, but it has difficulties detecting slow gradual changes. In the latter case,
examples will be stored for long periods of time, the drift level can take too much
time to trigger and the examples in memory may overflow.
Baena-Garc´ıa et al. proposed a new method EDDM [1] in order to improve
DDM. It is based on the estimated distribution of the distances between classi-
fication errors. The window resize procedure is governed by the same heuristics.
2.3 ADWIN:ADaptive sliding WINdow algorithm
ADWIN[3] is a change detector and estimator that solves in a well-specified way the
problem of tracking the average of a stream of bits or real-valued numbers. ADWIN
keeps a variable-length window of recently seen items, with the property that
the window has the maximal length statistically consistent with the hypothesis
“there has been no change in the average value inside the window”.
More precisely, an older fragment of the window is dropped if and only if
there is enough evidence that its average value differs from that of the rest of
the window. This has two consequences: one, that change can reliably be declared
whenever the window shrinks; and two, that at any time the average over the
existing window can be reliably taken as an estimate of the current average in the
stream (barring a very small or very recent change that is still not statistically
visible). These two points appears in [3] in a formal theorem.
ADWIN is data parameter- and assumption-free in the sense that it automat-
ically detects and adapts to the current rate of change. Its only parameter is a
confidence bound δ, indicating how confident we want to be in the algorithm’s
output, inherent to all algorithms dealing with random processes.
ADWIN does not maintain the window explicitly, but compresses it using a
variant of the exponential histogram technique. This means that it keeps a win-
dow of length Wusing only O(log W) memory and O(log W) processing time
per item.
3 Concept Drift Evaluation
Change detection is a challenging task due to a fundamental limitation [11]: the
design of a change detector is a compromise between detecting true changes and
avoiding false alarms.
When designing a change detection algorithm one needs to balance false
and true alarms and minimize the time from the change actually happening to
detection. The following existing criteria [11, 2] formally capture these properties
for evaluating change detection methods.
Mean Time between False Alarms (MTFA) characterizes how often we get
false alarms when there is no change. The false alarm rate FAR is defined
as 1/MTFA. A good change detector would have high MTFA.
Table 3: Evaluation results with change of α= 0.0001.
Method Measure No Change tc= 1,000 tc= 10,000 tc= 100,000 tc= 1,000,000
ADWIN 1-MDR 0.13 1.00 1.00 1.00
MTD 111.26 1,062.54 1,044.96 1,044.96
MTFA 5,315,789
MTR 6,150 5,003 5,087 5,087
CUSUM(h=50) 1-MDR 0.41 1.00 1.00 1.00
MTD 344.50 902.04 915.71 917.34
MTFA 59,133
MTR 70 66 65 64
DDM 1-MDR 0.44 1.00 1.00 1.00
MTD 297.60 2,557.43 7,124.65 42,150.39
MTFA 1,905,660
MTR 2,790 745 267 45
Page-Hinckley(h=50) 1-MDR 0.17 1.00 1.00 1.00
MTD 137.10 1,320.46 1,403.49 1,431.88
MTFA 3,884,615
MTR 4,769 2,942 2,768 2,713
EDDM 1-MDR 0.95 1.00 1.00 1.00
MTD 216.95 1,317.68 6,964.75 43,409.92
MTFA 37,146
MTR 163 28 5 1
Table 4: Evaluation results with change at tc= 10,000.
Method Measure No Change α= 0.00001 α= 0.0001 α= 0.001
ADWIN 1-MDR 1.00 1.00 1.00
MTD 4,919.34 1,062.54 261.59
MTFA 5,315,789.47
MTR 1,080.59 5,002.89 20,320.76
CUSUM 1-MDR 1.00 1.00 1.00
MTD 3,018.62 902.04 277.76
MTFA 59,133.49
MTR 19.59 65.56 212.89
DDM 1-MDR 0.55 1.00 1.00
MTD 3,055.48 2,557.43 779.20
MTFA 1,905,660.38
MTR 345.81 745.15 2,445.67
Page-Hinckley 1-MDR 1.00 1.00 1.00
MTD 4,659.20 1,320.46 405.50
MTFA 3,884,615.38
MTR 833.75 2,941.88 9,579.70
EDDM 1-MDR 0.99 1.00 1.00
MTD 4,608.01 1,317.68 472.47
MTFA 37,146.01
MTR 7.98 28.19 78.62
Mean Time to Detection (MTD) characterizes the reactivity of the system
to changes after they occur. A good change detector would have small MTD.
Missed Detection Rate (MDR) gives the probability of not receiving an
alarm when there has been drift. It is the fraction of non-detected changes
in all the changes that happened. A good detector would have small or zero
MDR.
Average Run Length (ARL(θ)) generalizes over MTFA and MTD. It quan-
tifies how long we have to wait before we detect a change of size θin the
variable that we are monitoring.
ARL(θ= 0) = M T F A, ARL(θ6= 0) = M T D
To do a fair comparison of change detectors, the evaluation framework needs
to know ground truth changes in the data for evaluation of change detection
algorithms. Thus, we need to use synthetic datasets with ground truth. Before a
true change happens, all the alarms are considered as false alarms. After a true
change occurs, the first detection that is flagged is considered as the true alarm.
After that and before a new true change occurs, the consequent detections are
considered as false alarms. If no detection is flagged between two true changes,
then it is considered a missed detection.
In [7] a new quality evaluation measure was proposed that monitors the
compromise between fast detection and false alarms:
M T R(θ) = M T F A
M T D ×(1 MDR) = ARL(0)
ARL(θ)×(1 MDR).(1)
This measure M T R (Mean Time Ratio) is the ratio between the mean time
between false alarms and the mean time to detection, multiplied by the proba-
bility of detecting an alarm. An ideal change detection algorithm would have a
low false positive rate (which means a high mean time between false alarms), a
low mean time to detection, and a low missed detection rate.
Comparing two change detectors for a specific change θis easy with this new
measure: the algorithm that has the highest M T R(θ) value is to be preferred.
3.1 Comparative Experimental Evaluation
We performed a comparison using MOA [5] with the following methods: DDM,
ADWIN, EDDM, Page-Hinckley Test, and CUSUM Test. The two last methods
were used with υ= 0.005 and h= 50 by default.
The experiments were performed simulating the error of a classifier system
with a binary output 0 or 1. The probability of having an error is maintained as
0.2 during the first tcinstances, and then it changes gradually, linearly increasing
by a value of αfor each instance. The results were averaged over 100 runs.
Tables 3 and 4 show the results. Every single row represents an experiment
where four different drifts occur at different times in Table 3, and four different
drifts with different incremental values in Table 4. Note that MTFA values come
from the no-change scenario. We observe the tradeoff between faster detection
Table 5: Evaluation results of different PageHinckley tests with change of α=
0.0001.
Method Measure No Change tc= 1,000 tc= 10,000 tc= 100,000 tc= 1,000,000
PageHinckley (h=25) 1-MDR 0.17 1.00 1.00 1.00
MTD 137.10 1,315.35 1,396.56 1,386.92
MTFA 1,346,666.67
MTR 1,653.31 1,023.81 964.27 970.98
PageHinckley (h=50) 1-MDR 0.17 1.00 1.00 1.00
MTD 137.10 1,320.46 1,403.49 1,431.88
MTFA 3,884,615.38
MTR 4,769.15 2,941.88 2,767.84 2,712.95
PageHinckley (h=75) 1-MDR 0.17 1.00 1.00 1.00
MTD 137.10 1,326.01 1,410.96 1,473.90
MTFA 4,208,333.33
MTR 5,166.58 3,173.68 2,982.60 2,855.23
and smaller number of false alarms. Page Hinckley with h= 50 and ADWIN are
the methods with fewer false positives, however CUSUM is faster at detecting
change for some change values. Using the new measure MTR, ADWIN seems to
be the algorithm with the best results.
In a second experiment, we apply three diferent Page-Hinckley tests with
three different values for h= 25,50,75. Table 5 contains the results. We observe
that for h= 25, the test is the fastest detecting change, but it has more false
positives. On the other hand, with h= 50, there are fewer false positives, but
the detection is slower. Looking at the new MTR measure, the test with h=75,
is performing better than using other values.
This type of test, has the property that by increasing hwe can reduce the
number of false positives, at the expense of increasing the detection delay.
4 Temporal Dependence in Data Streams
The excellent results of a No-Change Detector that outputs change every 60
instances, with the Electricity dataset is surprising. The reason of this good per-
formance, could be due to the temporal dependence in the Electricity dataset [20,
8, 23]. If the price goes UP now, it is more likely than by chance to go UP again,
and vice versa. Secondly, the prior distribution of classes in this data stream
is evolving. Figure 2 plots the class distribution of this dataset over a sliding
window of 1000 instances and the autocorrelation function of the target label.
We can see that data is heavily autocorrelated with very clear cyclical peaks at
every 48 instances (24 hours), due to electricity consumption habits.
Let us consider a No-Change classifier that uses temporal dependence infor-
mation by predicting that the next class label will be the same as the last seen
class label. It can be compared to a naive weather forecasting rule: the weather
1 2 3 4
·104
20
30
40
50
60
Time, instances
Class prior, %
50 100 150 200
0
0.5
1
Lag, instances
Autocorrelation
Fig. 2: Characteristics of the Electricity Dataset
Table 6: Evaluation results of a No-Change classifier on Electricity and Cover-
Type datasets.
Forest Covertype Electricity
Classifier Accuracy κAccuracy κ
No-Change Classifier 95.06 92.07 85.33 69.98
tomorrow will be the same as today. The performance of this classifier is shown
in Table 6. We see that this classifier is getting better results than most of the
classifiers using state-of-the-art concept drift techniques. We observe that in the
case of the Forest Covertype dataset the performance of this No-Change classifier
is much better than the methods using concept drift detection techniques.
5 Conclusions
Change detection is an important component of systems that need to adapt to
changes in their input data. We discussed the surprising result that non-change
detectors can outperform change-detectors when used in a classification stream-
ing evaluation. We argued that this may be due to the temporal dependence
on data, and we argued that evaluation of change detectors should not be done
using only classifiers. We wish that this paper will open several directions for
future research.
Acknowledgments.
This work was supported by the Polish National Science Center under Grant
No. 2014/15/B/ST7/05264.
References
1. M. Baena-Garc´ıa, J. del Campo- ´
Avila, R. Fidalgo, A. Bifet, R. Gavald´a, and
R. Morales-Bueno. Early drift detection method. In Fourth International Work-
shop on Knowledge Discovery from Data Streams, 2006.
2. M. Basseville and I. V. Nikiforov. Detection of abrupt changes: theory and appli-
cation. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1993.
3. A. Bifet and R. Gavald`a. Learning from time-changing data with adaptive win-
dowing. In SIAM International Conference on Data Mining, 2007.
4. A. Bifet and R. Gavald`a. Adaptive learning from evolving data streams. In 8th
International Symposium on Intelligent Data Analysis, pages 249–260, 2009.
5. A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. MOA: Massive online analysis.
J. of Mach. Learn. Res., 11:1601–1604, 2010.
6. A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavald`a. New ensemble
methods for evolving data streams. In Proc. of the 15th ACM SIGKDD int. conf.
on Knowledge discovery and data mining, KDD, pages 139–148, 2009.
7. A. Bifet, J. Read, B. Pfahringer, G. Holmes, and I. Zliobaite. CD-MOA: change
detection framework for massive online analysis. In Advances in Intelligent Data
Analysis XII - 12th International Symposium, IDA 2013, London, UK, October
17-19, 2013. Proceedings, pages 92–103, 2013.
8. A. Bifet, J. Read, I. Zliobaite, B. Pfahringer, and G. Holmes. Pitfalls in bench-
marking data stream classification and how to avoid them. In ECML PKDD 2013,
Proceedings, Part I, pages 465–479, 2013.
9. J. Gama, P. Medas, G. Castillo, and P. Rodrigues. Learning with drift detection.
In Brazilian Symp. on Artificial Intelligence, SBIA, pages 286–295, 2004.
10. J. Gama, P. Medas, G. Castillo, and P. Rodrigues. Learning with drift detection.
In SBIA Brazilian Symposium on Artificial Intelligence, pages 286–295, 2004.
11. F. Gustafsson. Adaptive Filtering and Change Detection. Wiley, 2000.
12. M. Harries. SPLICE-2 comparative evaluation: Electricity pricing. Tech. report,
University of New South Wales, 1999.
13. G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In
Proc. of the 7th ACM SIGKDD int. conf. on Knowl. disc. and data mining, KDD,
pages 97–106, 2001.
14. H. Kobayashi, B. L. Mark, and W. Turin. Probability, Random Processes, and
Statistical Analysis. Cambridge University Press, 2011.
15. J. Kolter and M. Maloof. Dynamic weighted majority: An ensemble method for
drifting concepts. J. of Mach. Learn. Res., 8:2755–2790, 2007.
16. E. S. Page. Continuous inspection schemes. Biometrika, 41(1/2):100–115, 1954.
17. J. Read, A. Bifet, B. Pfahringer, and G. Holmes. Batch-incremental versus
instance-incremental learning in dynamic and evolving data. In IDA 2012, Pro-
ceedings, pages 313–323, 2012.
18. G. Ross, N. Adams, D. Tasoulis, and D. Hand. Exponentially weighted moving
average charts for detecting concept drift. Pattern Recogn. Lett, 33:191–198, 2012.
19. J. Takeuchi and K. Yamanishi. A unifying framework for detecting outliers and
change points from time series. IEEE Tr. Knowl. Data Eng., 18(4):482–492, 2006.
20. I. Zliobaite. How good is the electricity benchmark for evaluating concept drift
adaptation. CoRR, abs/1301.3524, 2013.
21. I. Zliobaite, A. Bifet, M. M. Gaber, B. Gabrys, J. Gama, L. L. Minku, and
K. Musial. Next challenges for adaptive learning systems. SIGKDD Explorations,
14(1):48–55, 2012.
22. I. Zliobaite, A. Bifet, G. Holmes, and B. Pfahringer. MOA concept drift active
learning strategies for streaming data. In WAPA 2011, pages 48–55, 2011.
23. I. Zliobaite, A. Bifet, J. Read, B. Pfahringer, and G. Holmes. Evaluation methods
and decision theory for classification of streaming data with temporal dependence.
Machine Learning, 98(3):455–482, 2015.
... Concept drift detectors are either part of the actual classification algorithm itself or run parallel to the classification task and send an alarm to the actual classifier [1]. Summarizing works give an overview of algorithms for concept drift detection [4,36,85,112,118,[155][156][157]. Depending on the specific work, taxonomies and terms, as well as the focus of concept drift detection algorithms, differ. ...
... Within concept drift detection literature, authors often refer to statistical test-based approaches [36,85,155,157]. Please note that the cited summarizing papers of this section differ in their taxonomies. ...
... The idea of tests from that category is that if a distribution shifts, then the probability of observing elements of this new Distribution should be higher than the probability of observing elements of the "old" Distribution. An influential approach, following this idea, is the cumulative sum (CUSUM) test [36,85,155]. CUSUM, first proposed by [162], raises the alarm when the mean of the input data-e.g., the prediction error of the classification model (as input data of the CUSUM drift detection method researchers propose the residuals of the Kalman filter (see [36,85,155]))-is significantly different from a threshold value. ...
Article
Full-text available
Due to the rise of continuous data-generating applications, analyzing data streams has gained increasing attention over the past decades. A core research area in stream data is stream classification, which categorizes or detects data points within an evolving stream of observations. Areas of stream classification are diverse—ranging, e.g., from monitoring sensor data to analyzing a wide range of (social) media applications. Research in stream classification is related to developing methods that adapt to the changing and potentially volatile data stream. It focuses on individual aspects of the stream classification pipeline, e.g., designing suitable algorithm architectures, an efficient train and test procedure, or detecting so-called concept drifts. As a result of the many different research questions and strands, the field is challenging to grasp, especially for beginners. This survey explores, summarizes, and categorizes work within the domain of stream classification and identifies core research threads over the past few years. It is structured based on the stream classification process to facilitate coordination within this complex topic, including common application scenarios and benchmarking data sets. Thus, both newcomers to the field and experts who want to widen their scope can gain (additional) insight into this research area and find starting points and pointers to more in-depth literature on specific issues and research directions in the field
... This measure may not provide enough information about the quality of detection. It is worth comparing the moments of drift detection with the moments of actual drift [9]. By increasing the availability of streams containing drift ground-truth, we can contribute to the development of effective drift detection methods. ...
... This point is difficult to find in practice, and the typical approach is to resort to synthetic experimental designs, e.g. (Žliobaite, 2010;Bifet, 2017). In our work, we keep the same experimental design presented in the previous sections, and use the behaviour of SS as benchmark. ...
Article
Full-text available
Concept drift detection is a crucial task in data stream evolving environments. Most of state of the art approaches designed to tackle this problem monitor the loss of predictive models. However, this approach falls short in many real-world scenarios, where the true labels are not readily available to compute the loss. In this context, there is increasing attention to approaches that perform concept drift detection in an unsupervised manner, i.e., without access to the true labels after the model is deployed. We propose a novel approach to unsupervised concept drift detection based on a student-teacher learning paradigm. Essentially, we create an auxiliary model (student) to mimic the primary model’s behaviour (teacher). At run-time, our approach is to use the teacher for predicting new instances and monitoring the mimicking loss of the student for concept drift detection. In a set of experiments using 19 data streams, we show that the proposed approach can detect concept drift and present a competitive behaviour relative to the state of the art approaches.
... A good online deep learning needs to meet the following requirements [33,34,35,36]: In order to improve the accuracy of the algorithm, we need to design an appropriate total loss function, and constantly update the model parameters according to the changing data; The algorithm must have low false positive rate and negative rate and reduce the gap between the prediction label and the true label; Robustness to noise,i.e., the algorithm must be able to distinguish whether the probability distribution of data changes or appears noise in the data, and avoid confusion between them. So ...
Preprint
Full-text available
Online learning is an important technical means for sketching massive real-time and high-speed data. Although this direction has attracted intensive attention, most of the literature in this area ignore the following three issues: (1) they think little of the underlying abstract hierarchical latent information existing in examples, even if extracting these abstract hierarchical latent representations is useful to better predict the class labels of examples; (2) the idea of preassigned model on unseen datapoints is not suitable for modeling streaming data with evolving probability distribution. This challenge is referred as model flexibility. And so, with this in minds, the online deep learning model we need to design should have a variable underlying structure; (3) moreover, it is of utmost importance to fusion these abstract hierarchical latent representations to achieve better classification performance, and we should give different weights to different levels of implicit representation information when dealing with the data streaming where the data distribution changes. To address these issues, we propose a two-phase Online Deep Learning based on Auto-Encoder (ODLAE). Based on auto-encoder, considering reconstruction loss, we extract abstract hierarchical latent representations of instances; Based on predictive loss, we devise two fusion strategies: the output-level fusion strategy, which is obtained by fusing the classification results of encoder each hidden layer; and feature-level fusion strategy, which is leveraged self-attention mechanism to fusion every hidden layer output. Finally, in order to improve the robustness of the algorithm, we also try to utilize the denoising auto-encoder to yield hierarchical latent representations. Experimental results on different datasets are presented to verify the validity of our proposed algorithm (ODLAE) outperforms several baselines.
... 1: Input:streaming data: {x i , y i } N i=1 2: Repeat: obtain the outputs h l , l = 1, 1, ..., 5 of hidden layer according to the auto-Add the memory module M to the auto-encoder Eq.(3-6) 4: Calculate the reconstruction loss with the memory module L re Eq.(7) 5: Concatenate the output of various hidden layer to form a latent representation matrix H = (h 0 , h 1 , ..., h L ) 6: Obtain the alignment weight vector A = [a 0 , a 1 , ..., a L ] Eq.(8) 7: Obtain the feature fusion context vector C Eq.(9) 8: Get the results of prediction valueŷ t Eq.(10) 9: Calculate the total loss L total Eq.(11) 10: Dynamically update the network parameters via Algorithm I 11: End5. Experimental resultsA good adaptive online incremental learning algorithm on data streaming should meet the following requirements[42,43,44]: 1) Reduce misjudgment and false alarm in the case of the concept drift. ...
Preprint
Full-text available
Recent years have witnessed growing interests in online incremental learning. However, there are three major challenges in this area. The first major difficulty is concept drift, that is, the probability distribution in the streaming data would change as the data arrives. The second major difficulty is catastrophic forgetting, that is, forgetting what we have learned before when learning new knowledge. The last one we often ignore is the learning of the latent representation. Only good latent representation can improve the prediction accuracy of the model. Our research builds on this observation and attempts to overcome these difficulties. To this end, we propose an Adaptive Online Incremental Learning for evolving data streams (AOIL). We use auto-encoder with the memory module, on the one hand, we obtained the latent features of the input, on the other hand, according to the reconstruction loss of the auto-encoder with memory module, we could successfully detect the existence of concept drift and trigger the update mechanism, adjust the model parameters in time. In addition, we divide features, which are derived from the activation of the hidden layers, into two parts, which are used to extract the common and private features respectively. By means of this approach, the model could learn the private features of the new coming instances, but do not forget what we have learned in the past (shared features), which reduces the occurrence of catastrophic forgetting. At the same time, to get the fusion feature vector we use the self-attention mechanism to effectively fuse the extracted features, which further improved the latent representation learning.
... Due to the fact that PIACERE monitoring data have the form of time series, and thus the temporal dependence is present, we will consider a drift detection strategy based on [34,63]. As these papers show, the presence of temporal dependence leads us to consider a different detection approach and a different set of metrics to evaluate the performance of the drift detector mechanism. ...
Article
Full-text available
The current IT market is more and more dominated by the “cloud continuum”. In the “traditional” cloud, computing resources are typically homogeneous in order to facilitate economies of scale. In contrast, in edge computing, computational resources are widely diverse, commonly with scarce capacities and must be managed very efficiently due to battery constraints or other limitations. A combination of resources and services at the edge (edge computing), in the core (cloud computing), and along the data path (fog computing) is needed through a trusted cloud continuum. This requires novel solutions for the creation, optimization, management, and automatic operation of such infrastructure through new approaches such as infrastructure as code (IaC). In this paper, we analyze how artificial intelligence (AI)-based techniques and tools can enhance the operation of complex applications to support the broad and multi-stage heterogeneity of the infrastructural layer in the “computing continuum” through the enhancement of IaC optimization, IaC self-learning, and IaC self-healing. To this extent, the presented work proposes a set of tools, methods, and techniques for applications’ operators to seamlessly select, combine, configure, and adapt computation resources all along the data path and support the complete service lifecycle covering: (1) optimized distributed application deployment over heterogeneous computing resources; (2) monitoring of execution platforms in real time including continuous control and trust of the infrastructural services; (3) application deployment and adaptation while optimizing the execution; and (4) application self-recovery to avoid compromising situations that may lead to an unexpected failure.
Article
Full-text available
Among the difficulties being considered in data stream processing, a particularly interesting one is the phenomenon of concept drift. Methods of concept drift detection are frequently used to eliminate the negative impact on the quality of classification in the environment of evolving concepts. This article proposes Statistical Drift Detection Ensemble (sdde), a novel method of concept drift detection. The method uses drift magnitude and conditioned marginal covariate drift measures, analyzed by an ensemble of detectors, whose members focus on random subspaces of the stream’s features. The proposed detector was compared with state-of-the-art methods on both synthetic data streams and the semi-synthetic streams generated based on the real-world concepts. A series of computer experiments and a statistical analysis of the results, both for the classification accuracy and Drift Detection errors were carried out and confirmed the effectiveness of the proposed method.
Article
Concept drift detection algorithms have historically been faithful to the aged architecture of forcefully resetting the base classifiers for each detected drift. This approach prevents underlying classifiers becoming outdated as the distribution of a data stream shifts from one concept to another. In situations where both concept drift and temporal dependence are present within a data stream, forced resetting can cause complications in classifier evaluation. Resetting the base classifier too frequently when temporal dependence is present can cause classifier performance to appear successful, when in fact this is misleading. In this research, a novel architectural method for determining base classifier resets, Burst Detection-based Selective Classifier Resetting (BD-SCR), is presented. BD-SCR statistically monitors changes in the temporal dependence of a data stream to determine if a base classifier should be reset for detected drifts. The experimental process compares the predictive performance of state-of-the-art drift detectors in comparison to the “No-Change” detector using BD-SCR to inform and control the resetting decision. Results show that BD-SCR effectively reduces the negative impact of temporal dependence during concept drift detection through a clear negation in the performance of the “No-Change” detector, but is capable of maintaining the predictive performance of state-of-the-art drift detection methods.
Article
Data stream classification is widely popular in the field of network monitoring, sensor network and electronic commerce, etc. However, in the real-world applications, recurring concept drifting and label missing in data streams seriously aggravate the difficulty on the classification solutions. And this challenge has received little attention from the research community. Motivated by this, we propose a new ensemble classification approach based on the recurring concept drifting detection and model selection for data streams with unlabeled data. First, we build an ensemble model based on the classifiers and clusters. To improve the classification accuracy, we use the ensemble model to predict each data chunk and partition clusters according to the distribution of predicted class labels. Second, we adopt a new concept drifting detection method based on the divergence of concept distributions between adjoining data chunks to distinguish recurring concept drifts. All historical new concepts will be maintained. Meanwhile, we introduce the time-stamp-based weights for base models in the ensemble model. In the selection of the base model, we consider the time-stamp-based weight and the divergence between concept distributions simultaneously. Finally, extensive experiments conducted on four benchmark data sets show that our approach can quickly adapt to data streams with recurring concept drifts, and improve the classification accuracy compared to several state-of-the-art classification algorithms for data streams with concept drifts and unlabeled data.
Article
Recent years have witnessed growing interests in online incremental learning. However, there are three major challenges in this area. The first major difficulty is concept drift, that is, the probability distribution in the streaming data would change as the data arrives. The second major difficulty is catastrophic forgetting, that is, forgetting what we have learned before when learning new knowledge. The last one we often ignore is the learning of the latent representation. Only good latent representation can improve the prediction accuracy of the model. Our research builds on this observation and attempts to overcome these difficulties. To this end, we propose an Adaptive Online Incremental Learning for evolving data streams (AOIL). We use auto-encoder with the memory module, on the one hand, we obtained the latent features of the input, on the other hand, according to the reconstruction loss of the auto-encoder with memory module, we could successfully detect the existence of concept drift and trigger the update mechanism, adjust the model parameters in time. In addition, we divide features, which are derived from the activation of the hidden layers, into two parts, which are used to extract the common and private features respectively. By means of this approach, the model could learn the private features of the new coming instances, but do not forget what we have learned in the past (shared features), which reduces the occurrence of catastrophic forgetting. At the same time, to get the fusion feature vector we use the self-attention mechanism to effectively fuse the extracted features, which further improved the latent representation learning. Moreover, in order to further improve the robustness of the algorithm, we add the de-noising auto-encoder to original framework. Finally, we conduct extensive experiments on different datasets, and show that the proposed AOIL gets promising results and outperforms other state-of-the-art methods.
Conference Paper
Full-text available
Analysis of data from networked digital information systems such as mobile devices, remote sensors, and streaming applications, needs to deal with two challenges: the size of data and the capacity to be adaptive to changes in concept in real-time. Many approaches meet the challenge by using an explicit change detector alongside a classification algorithm and then evaluate performance using classification accuracy. However, there is an unexpected connection between change detectors and classification methods that needs to be acknowledged. The phenomenon has been observed previously, connecting high classification performance with high false positive rates. The implication is that we need to be careful to evaluate systems against intended outcomes–high classification rates, low false alarm rates, compromises between the two and so forth. This paper proposes a new experimental framework for evaluating change detection methods against intended outcomes. The framework is general in the sense that it can be used with other data mining tasks such as frequent item and pattern mining, clustering etc. Included in the framework is a new measure of performance of a change detector that monitors the compromise between fast detection and false alarms. Using this new experimental framework we conduct an evaluation study on synthetic and real-world datasets to show that classification performance is indeed a poor proxy for change detection performance and provide further evidence that classification performance is correlated strongly with the use of change detectors that produce high false positive rates.
Conference Paper
Full-text available
Data stream classification plays an important role in modern data analysis, where data arrives in a stream and needs to be mined in real time. In the data stream setting the underlying distribution from which this data comes may be changing and evolving, and so classifiers that can update themselves during operation are becoming the state-of-the-art. In this paper we show that data streams may have an important temporal component, which currently is not considered in the evaluation and benchmarking of data stream classifiers. We demonstrate how a naive classifier considering the temporal component only outperforms a lot of current state-of-the-art classifiers on real data streams that have temporal dependence, i.e. data is autocorrelated. We propose to evaluate data stream classifiers taking into account temporal dependence, and introduce a new evaluation measure, which provides a more accurate gauge of data stream classifier performance. In response to the temporal dependence issue we propose a generic wrapper for data stream classifiers, which incorporates the temporal component into the attribute space.
Article
Full-text available
Predictive modeling on data streams plays an important role in modern data analysis, where data arrives continuously and needs to be mined in real time. In the stream setting the data distribution is often evolving over time, and models that update themselves during operation are becoming the state-of-the-art. This paper formalizes a learning and evaluation scheme of such predictive models. We theoretically analyze evaluation of classifiers on streaming data with temporal dependence. Our findings suggest that the commonly accepted data stream classification measures, such as classification accuracy and Kappa statistic, fail to diagnose cases of poor performance when temporal dependence is present, therefore they should not be used as sole performance indicators. Moreover, classification accuracy can be misleading if used as a proxy for evaluating change detectors with datasets that have temporal dependence. We formulate the decision theory for streaming data classification with temporal dependence and develop a new evaluation methodology for data stream classification that takes temporal dependence into account. We propose a combined measure for classification performance, that takes into account temporal dependence, and we recommend using it as the main performance measure in classification of streaming data.
Conference Paper
Full-text available
Many real world problems involve the challenging context of data streams, where classifiers must be incremental: able to learn from a theoretically-infinite stream of examples using limited time and memory, while being able to predict at any point. Two approaches dominate the literature: batch-incremental methods that gather examples in batches to train models; and instance-incremental methods that learn from each example as it arrives. Typically, papers in the literature choose one of these approaches, but provide insufficient evidence or references to justify their choice. We provide a first in-depth analysis comparing both approaches, including how they adapt to concept drift, and an extensive empirical study to compare several different versions of each approach. Our results reveal the respective advantages and disadvantages of the methods, which we discuss in detail.
Article
Full-text available
An emerging problem in Data Streams is the detection of concept drift. This problem is aggravated when the drift is gradual over time. In this work we deflne a method for detecting concept drift, even in the case of slow gradual change. It is based on the estimated distribution of the distances between classiflcation errors. The proposed method can be used with any learning algorithm in two ways: using it as a wrapper of a batch learning algorithm or implementing it inside an incremental and online algorithm. The experimentation results compare our method (EDDM) with a similar one (DDM). Latter uses the error-rate instead of distance-error-rate.
Article
Full-text available
In this correspondence, we will point out a problem with testing adaptive classifiers on autocorrelated data. In such a case random change alarms may boost the accuracy figures. Hence, we cannot be sure if the adaptation is working well.
Article
Classifying streaming data requires the development of methods which are computationally efficient and able to cope with changes in the underlying distribution of the stream, a phenomenon known in the literature as concept drift. We propose a new method for detecting concept drift which uses an Exponentially Weighted Moving Average (EWMA) chart to monitor the misclassification rate of an streaming classifier. Our approach is modular and can hence be run in parallel with any underlying classifier to provide an additional layer of concept drift detection. Moreover our method is computationally efficient with overhead O(1) and works in a fully online manner with no need to store data points in memory. Unlike many existing approaches to concept drift detection, our method allows the rate of false positive detections to be controlled and kept constant over time.