Conference PaperPDF Available

Regression Trees from Data Streams with Drift Detection


Abstract and Figures

The problem of extracting meaningful patterns from time changing data streams is of increasing importance for the machine learning and data mining communities. We present an algorithm which is able to learn regression trees from fast and unbounded data streams in the presence of concept drifts. To our best knowledge there is no other algorithm for incremental learning regression trees equipped with change detection abilities. The FIRT-DD algorithm has mechanisms for drift detection and model adaptation, which enable to maintain accurate and updated regression models at any time. The drift detection mechanism is based on sequential statistical tests that track the evolution of the local error, at each node of the tree, and inform the learning process for the detected changes. As a response to a local drift, the algorithm is able to adapt the model only locally, avoiding the necessity of a global model adaptation. The adaptation strategy consists of building a new tree whenever a change is suspected in the region and replacing the old ones when the new trees become more accurate. This enables smooth and granular adaptation of the global model. The results from the empirical evaluation performed over several different types of drift show that the algorithm has good capability of consistent detection and proper adaptation to concept drifts.
Content may be subject to copyright.
Regression Trees from Data Streams with Drift Detection
Elena Ikonomovska1, João Gama2,3, Raquel Sebastião2,4, Dejan Gjorgjevik1
1 FEEIT – Ss. Cyril and Methodius University, Karpos II bb, 1000 Skopje, Macedonia
2 LIAAD/INESC – University of Porto, Rua de Ceuta, 118 – 6, 4050-190 Porto, Portugal
3 Faculty of Economics – University of Porto, Rua Roberto Frias, 4200 Porto, Portugal
4 Faculty of Science – University of Porto, R. Campo Alegre 823, 4100 Porto, Portugal
Abstract. The problem of extracting meaningful patterns from time changing
data streams is of increasing importance for the machine learning and data min-
ing communities. We present an algorithm which is able to learn regression
trees from fast and unbounded data streams in the presence of concept drifts. To
our best knowledge there is no other algorithm for incremental learning regres-
sion trees equipped with change detection abilities. The FIRT-DD algorithm
has mechanisms for drift detection and model adaptation, which enable to main-
tain accurate and updated regression models at any time. The drift detection
mechanism is based on sequential statistical tests that track the evolution of the
local error, at each node of the tree, and inform the learning process for the de-
tected changes. As a response to a local drift, the algorithm is able to adapt the
model only locally, avoiding the necessity of a global model adaptation. The
adaptation strategy consists of building a new tree whenever a change is sus-
pected in the region and replacing the old ones when the new trees become
more accurate. This enables smooth and granular adaptation of the global mod-
el. The results from the empirical evaluation performed over several different
types of drift show that the algorithm has good capability of consistent detec-
tion and proper adaptation to concept drifts.
Keywords: data stream, regression trees, concept drift, change detection,
stream data mining.
1 Introduction
Our environment is naturally dynamic, constantly changing in time. Huge amounts of
data are being constantly generated by various dynamic systems or applications. Real-
time surveillance systems, telecommunication systems, sensor networks and other
dynamic environments are such examples. Learning algorithms that model the under-
lying processes must be able to track this behavior and adapt the decision models
accordingly. The problem takes the form of changes in the target function or the data
distribution over time, and is known as concept drift. Examples of real world prob-
lems where drift detection is relevant include user modeling, real-time monitoring
industrial processes, fault detection, fraud detection, spam, safety of complex systems,
and many others [1]. In all these dynamic processes, the new concepts replace the old
concepts, while the interest of the final user is always to have available model that
2 Elena Ikonomovska1, João Gama2,3, Raquel Sebastião2,4, Dejan Gjorgjevik1
will describe or accurately predict the state of the underlying process at every time.
Therefore, the importance of drift detection when learning from data streams is evi-
dent and must be taken into consideration. Most machine-learning algorithms, includ-
ing the FIMT algorithm [2] make an assumption that the training data is generated by
a single concept from a stationary distribution, and are designed for static environ-
ments. However, when learning from data streams dynamic distributions are rule and
not an exception. To meet the challenges posed by the dynamic environment, they
must be able to detect changes and react properly on time. This is the challenge we
address in this work: how to embed change detection mechanisms inside a regression
tree learning algorithm and adapt the model properly.
Having in mind the importance of the concept drifting problem when learning
from data streams, we have studied the effect of local and global drift over the accu-
racy and the structure of the learned regression tree. We propose the FIRT-DD (Fast
and Incremental Regression Tree with Drift Detection) algorithm which is able to
learn regression trees from possibly unbounded, high-speed and time-changing data
streams. FIRT-DD algorithm has mechanisms for drift detection and model adapta-
tion, which enable to maintain an accurate and updated model at any time. The drift
detection mechanism is consisted of distributed statistical tests that track the evolution
of the error at every region of the instance space, and inform the algorithm about
significant changes that have affected the learned model locally or globally. If the
drift is local (affects only some regions of the instance space) the algorithm will be
able to localize the change and update only those parts of the model that cover the
influenced regions.
The paper is organized as follows. In the next section we present the related work
in the field of drift detection when learning in dynamic environments. Section 3 de-
scribes our new algorithm FIRT-DD. Section 4 describes the experimental evaluation
and presents a discussion of the obtained results. We conclude in section 5 and give
further directions.
2 Learning with Drift Detection
The nature of change is diverse. Changes may occur in the context of learning, due to
changes in hidden variables or in the intrinsic properties of the observed variables.
Often these changes make the model built on old data inconsistent with the new data,
and regular updating of the model is necessary. In this work we look for changes in
the joint probability P(X, Y), in particular for changes in the Y values given the
attribute values X, that is P(Y|X). This is usually called concept drift. There are two
types of drift that are commonly distinguished in the literature: abrupt (sudden, instan-
taneous) and gradual concept drift. We can also make a distinction between local and
global drift. The local type of drift affects only some parts of the instance space, while
global concept drift affects the whole instance space. Hidden changes in the joint
probability may also cause a change in the underlying data distribution, which is
usually referred to as virtual concept drift (sampling shift). A good review of the types
of concept drift and the existing approaches to the problem is given in [3, 4]. We
distinguish three main approaches for dealing with concept drift:
Regression Trees from Data Streams with Drift Detection 3
1. Methods based on data management. These include weighting of examples, or
example selection using time-windows with fixed or adaptive size. Relevant work
is [5].
2. Methods that explicitly detect a change point or a small window where change
occurs. They may follow two different approaches: (1) monitoring the evolution of
some drift indicators [4], or (2) monitoring the data distribution over two different
time-windows. Examples of the former are the FLORA family of algorithms [6],
and the works of Klinkenberg presented in [7, 8]. Examples of the latter are the al-
gorithms presented in [9, 10].
3. Methods based on managing ensembles of decision models. The key idea is to
maintain several different decision models that correspond to different data distri-
butions and manage an ensemble of decision models according to the changes in
the performance. All ensemble based methods use some criteria to dynamically de-
lete, reactivate or create new ensemble members, which are normally based on the
model’s consistency with the current data [3]. Such examples are [11, 12].
The adaptation strategies are usually divided on blind and informed methods. The
latter adapt the model without any explicit detection of changes. These are usually
used with the data management methods (time-windows). The former methods adapt
the model only after a change has been explicitly detected. These are usually used
with the drift detection methods and decision model management methods.
The motivation of our approach is behind the advantages of explicit detection and
informed adaptation methods, because they include information about process dynam-
ics: meaningful description of the change and quantification of the changes. Another
important aspect of drift management methods that we adopt and stress is the ability
to detect local drift influence and adapt only parts of the learned decision model. In
the case of local concept drift, many global models are discarded simply because their
accuracy on the current data chunks falls, although they could be good experts for the
stable parts of the instance space. Therefore, the ability to incrementally update local
parts of the model when required is very important. Example of a system that pos-
sesses this capability is the CVFDT system [13]. CVFDT algorithm performs regular
periodic validation of its splitting decisions by maintaining the necessary statistics at
each node over a window of examples. Every time a split is discovered as invalid it
starts growing new decision tree rooted at the corresponding node. The new sub-trees
grown in parallel are aimed to replace the old ones since they are generated using data
which corresponds to the now concepts. To smooth the process of adaptation, CVFDT
keeps the old tree rooted at the influenced node until one of the new ones becomes
more accurate. Maintaining the necessary counts for class distributions at each node
requires significant amount of additional memory and computations, especially when
the tree becomes large. We address this problem with the utilization of a light weight
detection units positioned in each node of the tree, which evaluate the goodness of the
split continuously for every region of the space using only few incrementally main-
tained variables. This approach does not require significant additional amount of
memory or time and is therefore suitable for the streaming scenario, while at the same
time enables drift localization.
4 Elena Ikonomovska1, João Gama2,3, Raquel Sebastião2,4, Dejan Gjorgjevik1
3 The FIRT-DD Algorithm
The FIRT-DD algorithm is an adaptation of the FIMT algorithm [2] to dynamic envi-
ronments and time-changing distributions. FIMT is an incremental any-time algo-
rithm for learning model trees from data streams. FIMT builds trees following the
top-down approach where each splitting decision is based on a set of examples cor-
responding to a certain time period or a sequence of the data stream. Decisions made
in upper nodes of the tree are therefore based on older examples, while the leaves
receive the most current set of examples. Each node has a prediction obtained during
its growing phase. The FIMT algorithm can guarantee high asymptotic similarity of
the incrementally learned tree to the one learned in a batch manner if provided with
enough data. This is done by determining a bound on the probability of selecting the
best splitting attribute. The probability bound provides statistical support and there-
fore stability to the splitting decision as long as the distribution of the data is statio-
nary. However, when the distribution is not stationary and the data contains concept
drifts, some of the splits become invalid. We approach this problem using statistical
process control methodologies for change detection, which are particularly suitable
for data streams. Statistical process control (SPC) methodologies [14] are capable to
handle large volume of data and have been widely used to monitor, control and im-
prove industrial processes quality. In recent years some SPC techniques were devel-
oped to accommodate auto-correlated data, such as process residual charts.
3.1 A Fully Distributed Change Detection Method based on Statistical Tests
Although the regression tree is a global model it can be decomposed according to the
partitions of the instance space obtained with the recursive splitting. Namely, each
node with the sub-tree below covers a region (a hyper-rectangle) of the instance
space. The root node covers the whole space, while the descendant nodes cover sub-
spaces of the space covered by their parent node. When a concept drift occurs locally
in some parts of the instance space, it is much less costly to make adaptations only to
the models that correspond to that region of the instance space. The perception of the
possible advantages of localization of drift has led us to the idea of a fully distributed
change detection system.
In order to detect where in the instance space drift has occurred, we bound each
node of the tree with a change detection unit. The change detection units bounded
with each node perform local, simultaneous and separate monitoring of every region
of the instance space. If a change has been detected, it suggests that the instance space
to which the node belongs has been influenced by a concept drift. The adaptation will
be made only at that sub-tree of the regression tree. This strategy has a major advan-
tage over global change detection methods, because the costs of updating the decision
model are significantly lower. Further, it can detect and adapt to changes in different
parts of the instance space at the same time, which speeds up the adaptation process.
The change detection units are designed following the approach of monitoring the
evolution of one or several concept drift indicators. The authors in [8] describe as
common indicators: performance measures, properties of the decision model, and
properties of the data. In the current implementation of our algorithm, we use the
Regression Trees from Data Streams with Drift Detection 5
absolute loss as performance measure1. When a change occurred in the target concept
the actual model does not correspond to the current status of nature and the absolute
loss will increase. This observation suggests a simple method to detect changes:
monitor the evolution of the loss. The absolute loss at each node is the absolute value
of the difference between the prediction (i.e. average of the target variable) from the
period the node was a leaf and the prediction now. If it starts increasing it may be a
sign of change in the examples’ distribution or in the target concept. To confidently
tell that there is a significant increase of the error which is due to a concept drift, we
propose continuous performing statistical tests at every internal node of the tree. The
test would monitor the absolute loss at the node, tracking major statistical increase
which will be a sure sign that a change occurred. The alarm of the statistical test will
trigger mechanisms for adaptation of the model.
In this work, we have considered two methods for monitoring the evolution of the
loss. We have studied the CUSUM algorithm [15] and Page-Hinkley method [16],
both from the same author. The CUSUM charts were proposed by Page [15] to detect
small to moderate shifts in the process mean. Since we are only interested in detecting
increases of the error the CUSUM algorithm is suitable to use. However, when com-
pared with the second option, the Page-Hinkley (PH) test [16], the results attested
stability and more consistent detection in favor of the PH test. Therefore in the fol-
lowing sections we have only analyzed and evaluated the second option. The PH test
is a sequential adaptation of the detection of an abrupt change in the average of a
Gaussian signal [1]. This test considers two variables: a cumulative variable mT, and
its minimum value MT. The first variable mT is defined as the cumulated difference
between the observed values and their mean till the current moment, where T is the
number of all examples seen so far and xt is the variable’s value at time t:
, (1)
, (2)
Its minimum value after seeing T examples is computed using the following formula:
min{ , 1... }
mt T
. (3)
At every moment the PH test monitors the difference between the minimum MT and
mT : TT T
PH m M. When this difference is greater than a given threshold (λ) it
alarms a change in the distribution (i.e. PHT > λ). The threshold parameter λ depends
on the admissible false alarm rate. Increasing λ will entail fewer false alarms, but
might miss some changes. The parameter α corresponds to the magnitude of changes
that are allowed.
1 Other measures like square loss could be used. For the purposes of change detec-
tion both metrics provide similar results.
6 Elena Ikonomovska1, João Gama2,3, Raquel Sebastião2,4, Dejan Gjorgjevik1
Each example from the data stream is traversing the tree until reaching one of the
leaves where the necessary statistic for building the tree is maintained. On its path
from the root till the leaf it will traverse several internal nodes, each of them bounded
with a drift detection unit which continuously performs the PH test, monitoring the
evolution of the absolute loss after every consecutive example passing through the
node. When the difference between the observed absolute loss and its average is in-
creasing fast and continuously, eventually exceeding a user predefined limit (λ), we
can confidently conclude that the predictions used in the computation of the absolute
loss are continuously and significantly wrong. This is an indication of a non-valid
split at the corresponding node.
The computation of the absolute loss can be done using the prediction from the
node where the PH test is performed or from the parent of the leaf node where the
example will fall (the leaf node is not used because it is still in its growing phase). As
a consequence of this we will consider two methods for change detection and locali-
zation. The prediction (whether it is from the node where the PH test is or from the
parent node of the leaf) can be: 1) the mean of y-values of the examples seen at that
node during its growing phase or; 2) the perceptron’s output for a given example
which is trained incrementally at the node. Preliminary experiments have shown that
the detection of the changes is more consistent and more reliable when using the first
option, that is the average of y values. If the loss is computed using the prediction
from the current node the computation can be performed while the example is passing
the node on its path to the leaf. Therefore, the loss will be monitored first at the root
node and after at its descendant nodes. Because the direction of monitoring the loss is
from the top towards the “bottom” of the tree, this method will be referred to as Top-
Down (TD) method. If the loss is computed using the prediction from the parent of
the leaf node, the example must first reach the leaf and then the computed difference
at the leaf will be back-propagated towards the root node. While back-propagating the
loss (using the same path the example reached the leaf) the PH tests located in the
internal nodes will monitor the evolution. This reversed monitoring gives the name of
the second method for change detection which will be referred to as Bottom-Up (BU)
method. The idea for the BU method came from the following observation: with the
TD method the loss is being monitored from the root towards the leaves, starting with
the whole instance space and moving towards smaller sub-regions. While moving
towards the leaves, predictions in the internal nodes become more accurate which
reduces the probability of false alarms. This was additionally confirmed with the
empirical evaluation: most of the false alarms were triggered at the root node and its
immediate descendants. Besides this, the predictions from the leaves which are more
precise than those at the internal nodes (as a consequence of splitting) emphasize the
loss in case of drift. Therefore, using them in the process of change detection would
shorten the delays and reduce the number of false alarms. However, there is one nega-
tive side of this approach, and it concerns the problem of gradual and slow concept
drift. Namely, the tree grows in parallel with the process of monitoring the loss. If
drift is present, consecutive splitting enables new and better predictions according to
the new concepts, which would disable proper change detections. The slower the drift
is the probability of non-detection gets higher.
Regression Trees from Data Streams with Drift Detection 7
3.2 Adapting to Concept Drift
The natural way of responding to concept drift for the incremental algorithm (if no
adaptation strategy is employed) would be to grow and append new sub-trees to the
old structure, which would eventually give correct predictions. Although the predic-
tions might be good, the structure of the whole tree would be completely wrong and
misleading. Therefore, proper adaptation of the structure is necessary. The FIRT-DD
without the change detection abilities is the incremental algorithm FIRT [2]. For
comparison FIRT is also used in the evaluation and is referred to as “No detection”.
Our adaptation mechanism falls in the group of informed adaptation methods: me-
thods that modify the decision model only after a change was detected. Common
adaptation to concept drift is forgetting the outdated and un-appropriate models, and
re-learning new ones that will reflect the current concept. The most straightforward
adaptation strategy consists of pruning the sub-models that correspond to the parts of
the instance space influenced by the drift. If the change is properly detected the cur-
rent model will be most consistent with the data and the concept which is being mod-
eled. Depending on whether the TD or the BU detection method is used, in the empir-
ical evaluation this strategy will be referred to as “TD Prune” and “BU Prune” corres-
pondingly. However when the change has been detected relatively high in the tree,
pruning the sub-tree will decrease significantly the number of predicting nodes –
leaves which will lead to unavoidable drastic short-time degradation in accuracy. In
these circumstances, an outdated sub-tree may still be better than a single leaf. Instead
of pruning when a change is detected we can start building an alternate tree (a new
tree) from the examples that will reach the node. A similar strategy is used in CVFDT
[13] where on one node can be grown several alternate trees at the same time.
This is the general idea of the adaptation method proposed in the FIRT-DD algo-
rithm. When a change is detected the change detection unit will trigger the adaptation
mechanism for the node where the drift was detected. The node will be marked for re-
growing and new memory will be allocated for maintaining the necessary statistic
used for growing a leaf. Examples which will traverse a marked node will be used for
updating its statistic, as well as the statistic at the leaf node where they will eventually
fall. The regrowing process will initiate a new alternate tree rooted at the node which
will grow in parallel with the old one. Every example that will reach a node with an
alternate tree will be used for growing both of the trees. The nodes in the alternate tree
won’t perform change detection till the moment when the new tree will replace the
old one. The old tree will be kept and grown in parallel until the new alternate tree
becomes more accurate.
However, if the detected change was a false alarm or the alternate tree cannot
achieve better accuracy, replacing the old tree might never happen. If the alternate
tree shows slow progress or starts to degrade in performance this should be consi-
dered as a sign that growing should be stopped and the alternate tree should be re-
moved. In order to prevent reactions to false alarms the node monitors the evolution
of the alternate tree and compares its accuracy with the accuracy of the original sub-
tree. This is performed by monitoring the average difference in accuracy with every
example reaching the node. The process of monitoring the average difference starts
after the node has received twice of the growing process chunk size (e.g. 400 exam-
ples) of examples, which should be enough to determine the first split and to grow an
8 Elena Ikonomovska1, João Gama2,3, Raquel Sebastião2,4, Dejan Gjorgjevik1
alternate tree with at least three nodes. When this number is reached the nodes starts
to maintain the mean squared error for the old and the alternate tree simultaneously.
On a user predetermined evaluation interval (e.g. 5000 examples) the difference of the
mean squared error between the old and the new tree is evaluated, and if it is positive
and at least 10% greater than the MSE of the old tree the new alternate tree will re-
place the old one. The old tree will be removed, or as an alternative it can be stored
for possible reuse in case of reoccurring concepts. If the MSE difference does not
fulfill the previous conditions its average is incrementally updated. Additionally in the
evaluation process a separate parameter can be used to specify a time period which
determines how much time a tree is allowed to grow in order to become more accu-
rate than the old one. When the time period for growing an alternate tree has passed or
the average of the difference started to increase instead to decrease, the alternate tree
will be removed from the node together with the maintained statistic and the memory
will be correspondingly released. In order to prevent from premature discarding of the
alternate tree the average of the MSE difference is being evaluated only after several
evaluation intervals have passed. The strategy of growing alternate trees will be re-
ferred to as “TD AltTree” and “BU AltTree” depending on which change detection
method is used (TD/BU).
4 Experimental Evaluation
To provide empirical support to FIRT-DD we have performed an evaluation over
several versions of an artificial “benchmark” dataset (simulating several different
types of drift) and over a real-world dataset from the DataExpo09 [18] competition.
Using artificial datasets allows us to control relevant parameters and to empirically
evaluate the drift detection and adaptation mechanisms. The real-problem dataset
enables us to evaluate the merit of the method for real-life problems. Empirical evalu-
ation showed that the FIRT-DD algorithm possesses satisfactory capability of detec-
tion and adaptation to changes and is able to maintain an accurate and well structured
decision model.
4.1 Experimental Setup and Metrics
In the typical streaming scenario data comes in sequential order, reflecting the current
state of the physical process that generates the examples. Traditional evaluation me-
thodologies common for static distributions are not adequate for the streaming setting
because they don’t take into consideration the importance of the order of examples
(i.e. their dependency on the time factor), as well as the evolution of the model in
time. One convenient methodology is to use the predictive sequential or prequential
error which can be computed as a cumulative sum of a loss function (error obtained
after every consecutive example) typical for the regression domain (mean squared or
root relative mean squared error). The prequential approach uses all the available data
for training and testing, and draws a pessimistic learning curve that traces the evolu-
tion of the error. Its main disadvantage is that it accumulates the errors from the first
examples of the data stream and therefore hinders precise on-line evaluation of real
Regression Trees from Data Streams with Drift Detection 9
performances. Current improvements cannot be easily seen due to past degradation in
accuracy accumulated in the prequential error.
More adequate methodology in evaluating the performance of an incremental algo-
rithm is to use an “exponential decay”/”fading factor” evaluation or a sliding window
evaluation. Using the “exponential decay”/”fading factor” method we can diminish
the influence of earlier errors by multiplying the cumulated loss with an e-δt function
of the time t or a fading factor (constant value less than one, e.g. 0.99) before sum-
ming the most current error. This method requires setting only the parameter δ or the
fading factor, but since it still includes all the previous information of the error it
gives slightly smoothed learning curve. With the sliding window method for evalua-
tion we can obtain detailed and precise evaluation over the whole period of train-
ing/learning without the influence of earlier errors. With this method we evaluate the
model over a test set determined by a window of examples which the algorithm has
not used for training. The window of examples manipulates the data like a FIFO
(first-in-first-out) queue. The examples which have been used for testing are given to
the algorithm one by one for the purpose of training. The size of the sliding window
determines the level of aggregation and it can be adjusted to the quantity a user is
willing to have. The sliding step determines the level of details or the smoothness of
the learning curve and can be also adjusted. Using the sliding window test set we
measure accuracy in terms of the mean squared error (MSE) or root relative squared
error (RRSE) and the current dimensions of the learned model.
We have performed a sensitivity analysis on the values of the parameters α and λ
which resulted in the pairs of values (0.1, 200), (0.05, 500) and (0.01, 1000), corres-
pondingly. Smaller values for α would increase the sensibility, while smaller values
for λ would shorten the delays in the change detection. However, we should have in
mind that smaller λ values increase the probability of detecting false alarms. In the
empirical evaluation we have used α = 0.05 and λ = 500 for all the simulations of drift
over the Fried artificial dataset.
4.2 The Datasets
For simulation of the different types of drift we have used the Fried dataset used by
Friedman in [18]. This is an artificial dataset containing 10 continuous predictor
attributes with independent values uniformly distributed in the interval [0, 1]. From
those 10 predictor attributes only five attributes are relevant, while the rest are redun-
dant. The original function for computing the predicted variable y is:
y = 10sin(πx1x2) + 20(x3 - 0.5)2 + 10x4 + 5x5 + σ(0, 1),
where σ(0,1) is a random number generated from a normal distribution with mean 0
and variance 1. The second dataset used is from the Data Expo competition [18] and
contains large amount of records containing flight arrival and departure details for all
commercial flights within the USA, from October 1987 to April 2008. This is a large
dataset since there are nearly 120 million records. The Depdelay dataset we have used
in the evaluation contains around 14 million records starting from January 2006 to
April 2008. The dataset is cleaned and records are sorted according to the departure
10 Elena Ikonomovska1, João Gama2,3, Raquel Sebastião2,4, Dejan Gjorgjevik1
date (year, month, day) and time (converted in seconds). The target variable is the
departure delay in seconds.
4.3 Results over artificial datasets
Using the artificial dataset we performed a set of controlled experiments. We have
studied several scenarios related with different types of change:
1. Local abrupt drift. The first type of simulated drift is local and abrupt. We have
introduced concept drift in two distinct regions of the instance space. The first re-
gion where the drift occurs is defined by the inequalities: x2 < 0.3 and x3 < 0.3 and
x4 > 0.7 and x5 < 0.3. The second region is defined by: x2 > 0.7 and x3 > 0.7 and x4
< 0.3 and x5 > 0.7. We have introduced three points of abrupt concept drift in the
dataset, the first one at one quarter of examples, the second one at one half of ex-
amples and the third at three quarters of examples. For all the examples falling in
the first region (x2 < 0.3 and x3 < 0.3 and x4 > 0.7 and x5 < 0.3) the new function for
computing the predicted variable y is: y = 10x1x2 + 20(x3 - 0.5) + 10x4 + 5x5 + σ(0,
1). For the second region (x2 > 0.7 and x3 > 0.7 and x4 < 0.3 and x5 > 0.7) the new
function for computing the predicted variable y is: y = 10cos(x1x2) + 20(x3 - 0.5) +
ex4 + 5x5
2 + σ(0, 1). At every consecutive change the region of drift is expanded.
This is done by reducing one of the inequalities at a time. More precisely, at the
second point of change the first inequality x2 < 0.3 (x2 > 0.7) is removed, while at
the third point of change two of the inequalities are removed: x2 < 0.3 and x3 < 0.3
(x2 > 0.7 and x3 > 0.7).
2. Global abrupt drift. The second type of simulated drift is global and abrupt. The
concept drift is performed with a change in the original function over the whole in-
stance space, which is consisted of misplacing the variables from their original po-
sition. We have introduced two points of concept drift, first at one half of examples
when the function for computing the predicted variable becomes: y = 10sin(πx4x5)
+ 20(x2 - 0.5)2 + 10x1 + 5x3 + σ(0,1), and the second point at three quarters of ex-
amples, when the old function is returned (reoccurrence).
3. Global gradual drift. The third type of simulated drift is global and gradual. The
gradual concept drift is initiated the first time at one half of examples. Starting
from this point examples from the new concept: y = 10sin(πx4x5) + 20(x2 - 0.5)2 +
10x1 + 5x3 + σ(0,1) are being gradually introduced among the examples from the
first concept. On every 1000 examples the probability of generating an example us-
ing the new function is incremented. This way after 100000 examples only the new
concept is present. At three quarters of examples a second gradual concept drift is
initiated on the same way. The new concept function is: y = 10sin(πx2x5) + 20(x4 -
0.5)2 + 10x3 + 5x1 + σ(0,1). Examples from the new concept will gradually replace
the ones from the last concept like before. The gradual drift ends again after
100000 examples. From this point only the last concept is present in the data.
The first part of the experimental evaluation was focused on analyzing and comparing
the effectiveness of the change detection methods proposed. The comparison was
performed only over the artificial datasets (each with size of 1 million examples)
because the drift is known and controllable and because they enable precise mea-
Regression Trees from Data Streams with Drift Detection 11
surement of delays, false alarms and miss detections. The Table 1 presents the aver-
aged results over 10 experiments for each of the artificial datasets. We have measured
the number of false alarms for each point of drift, the Page-Hinkley test delay (num-
ber of examples monitored by the PH test before the detection) and the global delay
(number of examples processed in the tree from the point of the concept drift till the
moment of the detection). The delay of the Page-Hinkley test measures how fast the
algorithm will be able to start the adaptation strategy at the local level, while the
global delay measures how fast the change was detected globally. The “Num. of
change” column specifies the number of the change point.
Table 1. Averaged results from the evaluation of change detection over 10 experiments
Data set Num. of
change FA's PH test
delay FA's PH test
Local abrupt
1 0.5 1896.8 5111.7 0 698.6 14799.8
2 1.7 1128.2 2551.1 0 494.1 3928.6
3 0.8 3107.1 5734.5 0 461.1 5502.4
Global ab-
rupt drift
1 1.5 284.5 1325.7 0 260.4 260.4
2 0 492.2 3586.3 0 319.9 319.9
Global gra-
dual drift
1 1 2619.7 16692.9 0 1094.2 14726.3
2 2.8 4377.5 10846.5 0 644.7 11838.2
No drift 0 0.9 - - 0 - -
Results in Table 1 show in general that the TD method for change detection triggers
significant number of false alarms as compared with the BU method (which never
detects false alarms). Both of the methods detect all the simulated changes for all the
types of drift. The detailed analysis for the Local abrupt drift dataset has shown that
most of the false alarms with TD were triggered at the root node. The true positives
with TD are also detected higher in the tree, while with the BU method changes are
detected typically in lower nodes, but precisely at the lowest node whose region is
closest to the region with local drift. On this way the BU method performs most pre-
cise localization of the drift. With the TD method the global delay for the first and the
second point of drift is smaller (because changes are detected higher), but with the BU
method the PH test delays are much smaller. This enables faster local adaptation. The
analysis from Table 1 over the Global abrupt drift dataset has shown clear advantage
of the BU method over the TD method. The BU method does not detect false alarms,
and both of the delays are smaller especially the global delay. With BU changes are
detected first at the root node (which covers the whole region where the global drift
occurs) and only after in the nodes below. This is not the case with the TD method
when changes are detected first in the lower nodes and after at their parents (moving
towards to root). False alarms are triggered usually for the root node. For the Global
gradual drift dataset the detailed analysis showed that with both of the methods
changes were detected similarly, starting at the lower parts of the tree and moving
towards the root. The main difference is that the BU method detects changes signifi-
cantly faster, having smaller delays. However, once a change is detected at the highest
point, the adaptation strategy (Prune or AltTree) prevents from detecting more
12 Elena Ikonomovska1, João Gama2,3, Raquel Sebastião2,4, Dejan Gjorgjevik1
changes, although the drift might still be present and increasing. The fast detection of
the drift in this case is a disadvantage for the BU method, rather than an advantage.
The TD method whose delays are bigger detects changes during the whole period of
gradual increase of the drift. This enables to perform the adaptation right on time
when all of the examples belong to the new concept. The last dataset is the original
Fried dataset without any drift. From the table it can be noted again that the BU me-
thod doesn’t trigger false alarms, while the TD method detected at least one in nine of
ten generations of the same dataset.
Table 2. Performance results over the last 10000 examples of the data stream averaged over 10
experiments for each type of simulated drift
Data set Measures No detection Prune AltTrees Prune AltTrees
abrupt drift
MSE/Std. dev. 4.947±0.14 6.768±0.13 3.82±0.08 4.969±0.1 3.696±0.08
RRSE 0.4114 0.4845 0.363 0.4149 0.3571
Growing nodes 1292.8 145.5 818.7 220.1 817.3
abrupt drift
MSE/Std. dev. 5.496±0.11 4.758±0.09 4.664±0.09 4.459±0.08 4.465±0.08
RRSE 0.46 0.4364 0.4316 0.4221 0.4223
Growing nodes 977.8 195.4 310.1 229.7 228.8
MSE/Std. dev. 16.629±0.32 4.682±0.08 3.85±0.07 5.149±0.11 7.461±0.16
RRSE 0.7284 0.3917 0.3541 0.4081 0.4892
Growing nodes 963.4 125.8 180.4 179.6 178.5
In Table 2 are given performance results over the last 10000 examples. Results in this
table enable to evaluate the adaptation of model for the different types of drift, when
learning has ended. For the Local abrupt drift dataset it is evident that the BU Prune
strategy gives better results than the TD Prune strategy. This is easy to explain having
in mind the comments from the last paragraph. Namely, in the case of TD detection
much bigger portions of the tree are pruned than necessary because the drift is de-
tected inadequately higher. The tree ends up smaller and even has lower accuracy as
compared to the tree grown without change detection. The BU method performs pre-
cise drift localization, which enables pruning just the right parts of the tree and there-
fore achieving the same performance results as the “No detection” strategy but with a
significantly lower number of rules. With the AltTree adaptation strategy reacting to
false alarms is avoided. According to that performance results for TD AltTree are
much better and even similar with the BU AltTree. For the Global abrupt drift dataset
in general the BU approach gives slightly better results. It is interesting to notice that
both of the adaptation strategies are equally good. Since drift is global they perform
the same thing, regrowing the whole tree using examples from the new concept.
However, the TD approach is not very adequate because the change is not detected
first at the root node but somewhere lower. Because of that, neither of the adaptation
strategies will enable proper correction of structure, although accuracy might still be
good. The “No detection” strategy gives the worst results. For the Global gradual
drift dataset the performance results are in favor of the TD method. Trees obtained are
smaller and with better accuracy because of the on-time adaptation.
Fig. 1. Local abrupt and global abrupt/gradual drift simulation over Fried dataset using sliding
window evaluation over a window of size 10000 and sliding step 1000 examples
On Fig. 2 are given the learning curves obtained with the sliding window evaluation
only for the Local abrupt drift and the Global gradual drift datasets due to lack of
space. On the top left figure are evident many peaks corresponding to drastic
degradation in accuracy when pruning huge parts of the tree or as a reaction false
alarms (before the first point of drift). On the top right figure are shown the effects of
smooth adaptation using the AltTree strategy. Obtained trees are smaller and
continuously more accurate. Similar conclusions can also be obtained from the lower
figures, but here more interesting is the advantage of the TD method, which is
especially evident for the second point of drift. Comments on this type of drift are
given below Table 1, but the general conclusion is that the tree obtained using the BU
method shows worst results mainly because it has been grown during the presence of
the two different concepts. Therefore, many of its splitting decisions are invalid.
4.4 Results over a real-world dataset
The Depdelay dataset represents a highly variable concept which depends on many
time-changing variables. Performance results in Table 3 were obtained using the slid-
ing window validation over the last 100000 examples. The results show significant
improvement of the accuracy when change detection and adaptation is performed.
The size of the model is also substantially smaller (in an order of magnitude). Stan-
dard deviation of the error for TD/BU methods is bigger compared to the “No detec-
14 Elena Ikonomovska1, João Gama2,3, Raquel Sebastião2,4, Dejan Gjorgjevik1
tion” situation, but detailed results show that this is due to a sudden increase over the
last 100000 examples (3 to 7 times). This can be seen on Fig.3. Both TD/BU AltTrees
methods perform continuously better compared the “No detection” situation. On Fig.
3 it can be also seen that when growing alternate trees the accuracy of the model is
stable, persistent and continuously better than the accuracy of the model when no drift
detection is performed. This is the evidence that data contains drifts and that the
FIRT-DD algorithm is able to detect and adapt the model properly.
Table 3. Performance results over the last 100000 examples of the Depdelay dataset
Measures No detection Prune AltTrees Prune AltTrees
MSE/Std. dev. 738.995±13.6 175.877±26.1 150.072±21.6 181.884±23.5 136.35±20.0
RRSE 0.396951 0.200305 0.185353 0.20379 0.181785
Growing nodes 4531 121 365 103 309
Fig. 2. Departure delays dataset
5 Conclusion
This paper presents a new algorithm for learning regression trees from time-changing
data streams. To our best knowledge, FIRT-DD is the first algorithm for inducing
regression trees for time-changing data streams. It is equipped with drift detection
mechanism that exploits the structure of the regression tree. It is based on change-
detection units installed in each internal node that monitor the growing process. The
tree structure is being monitored at every moment and every part of the instance
space. The change-detection units use only small constant amount of memory per
node and small, constant amount of time for each example. FIRT-DD algorithm is
able to cope with different types of drift including: abrupt or gradual, and local or
global concept drift. It effectively maintains its model up-to-date with the continuous
flow of data even when concept drifts occur. The algorithm enables local adaptation
when required, reducing the costs of updating the whole decision model and perform-
ing faster and better adaptation to the changes in data. Using an adaptation strategy
based on growing alternate trees FIRT-DD avoids short-term significant performance
degradation adapting the model smoothly. The model maintained with the FIRT-DD
algorithm continuously exhibits better accuracy than the model grown without any
Regression Trees from Data Streams with Drift Detection 15
change detection and proper adaptation. Preliminary application of FIRT-DD to a
real-world domain shows promising results. Our future work will be focused on im-
plementing these ideas in the domain of classification trees.
Acknowledgments. Thanks to the financial support by FCT under the PhD Grant
1. Basseville, M., Nikiforov, I.: Detection of Abrupt Changes: Theory and Applications. Pren-
tice-Hall Inc (1987)
2. Ikonomovska, E., Gama, J.: Learning Model Trees from Data Streams. In: 11th International
Conference on Discovery Science 2008.LNAI, vol. 5255, pp. 5--63. Springer, Haidelberg
3. Tsymbal, A.: The problem of concept drift: definitions and related work. Technical Report,
TCD-CS-2004-15, Department of Computer Science, Trinity College Dublin, Ireland (2004)
4. Gama, J., Castillo, G.: Learning with Local Drift Detection. In: Advances in Artificial Intel-
ligence - SBIA 2004. LNCS, vol. 3171, pp. 286--295. Springer, Haidelberg (2004)
5. Klinkenberg, R.: Learning drifting concepts: Example selection vs. example weighting. J.
Intelligent Data Analysis (IDA), Special Issue on Incremental Learning Systems Capable of
Dealing with Concept Drift vol. 8, 3, 281--300 (2004)
6. Widmer, G., Kubat M.: Learning in the presence of concept drifts and hidden contexts. J.
Machine Learning 23, 69--101 (1996)
7. Klinkenberg, R., Joachims, T.: Detecting concept drift with support vector machines. In:
Langley, P. (ed) 17th International Conference on Machine Learning, pp 487--494. Morgan
Kaufmann, San Francisco (2000)
8. Klinkenberg, R., Renz, I.: Adaptive information filtering: Learning in the presence of con-
cept drifts. In: Learning for Text Categorization, pp 33--40. AAAI Press, Menlo Park (1998)
9. Kifer, D., Ben-David, S., Gehrke, J.: Detecting change in data streams. In: 30th International
Conference on Very Large Data Bases, pp. 180--191. Morgan Kaufmann, San Francisco
10. Gama, J., Fernandes, R., Rocha, R.: Decision trees for mining data streams. J. Intelligent
Data Analysis vol. 10, (1), 23--46 (2006)
11. Kolter, J. Z., Maloof, M.: Using additive expert ensembles to cope with concept drift. In:
22th International Conference on Machine Learning, pp 449--456. ACM, New York (2005)
12. Kolter, J. Z., Maloof, M.: Dynamic weighted majority: A new ensemble method for track-
ing concept drift. In: 3rd International Conference on Data Mining, pp 123--130. IEEE
Computer Society (2003)
13. Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: 7th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 97--106.
ACM Press, Menlo Park (2001)
14. Grant, L., Leavenworth, S.: Statistical Quality Control. McGraw-Hill, United States (1996)
15. Page, E. S.: Continuous Inspection Schemes. J. Biometrika 41, 100--115 (1954)
16. Mouss, H., Mouss, D., Mouss, N., Sefouhi, L.: Test of Page-Hinkley, an Approach for Fault
Detection in an Agro-Alimentary Production System. In: 5th Asian Control Conference, vol.
2, pp. 815--818. IEEE Computer Society (2004)
17. Friedman, J. H.: Multivariate Adaptive Regression Splines. J. The Annals of Statistics 19,
pp. 1--141 (1991)
18.ASA Sections on Statistical Computing and Statistical Graphics, Data Expo 2009.
... Concept drifts can be categorized into three classes based on the changes in the joint probabilities (Ikonomovska et al., 2009;Diaz-Rozo et al., 2020) as follows: ...
... The active strategies depend on whether a concept drift has been flagged by the drift detector. When the drift detector identifies a true concept drift, the predictive model is adapted to the recent data (Ikonomovska et al., 2009). ...
In autonomous vehicle systems (AVSs), which are widely used to transfer wafers in semiconductor manufacturing, robust traffic control is a significant challenge because all the vehicles must be monitored and controlled in real time to cope with traffic congestion. Several predictive approaches have been proposed for preventing traffic congestion in stationary traffic environments. However, in real-life traffic situations, there exists concept drift characterized by time-varying traffic conditions which hinder the accurate prediction of congestion. In this study, we propose a concept drift modeling for a robust vehicle control system. The proposed method combines a drift-adaptation learning technique with a drift detector to achieve adaptive traffic prediction in time-varying AVSs. We compare both the effectiveness of the prediction and the efficiency of model updates with representative methods. High-fidelity simulations based on actual data confirm that the proposed method outperforms the alternatives by detecting change patterns and updating the prediction models whenever significant concept drift occurs in the traffic patterns.
... Artificial data set with (known) drifts we used the multivariate data set from the paper of Friedman (1991), which has been also applied before for drift analysis [see, e.g., Ikonomovska et al. (2009)] and which is based on the functional trend given by: ...
Full-text available
Evolving fuzzy systems (EFS) have enjoyed a wide attraction in the community to handle learning from data streams in an incremental, single-pass and transparent manner. The main concentration so far lied in the development of approaches for single EFS models, basically used for prediction purposes. Forgetting mechanisms have been used to increase their flexibility, especially for the purpose to adapt quickly to changing situations such as drifting data distributions. These require forgetting factors steering the degree of timely out-weighing older learned concepts, whose adequate setting in advance or in adaptive fashion is not an easy and not a fully resolved task. In this paper, we propose a new concept of learning fuzzy systems from data streams, which we call online sequential ensembling of fuzzy systems (OS-FS) . It is able to model the recent dependencies in streams on a chunk-wise basis: for each new incoming chunk, a new fuzzy model is trained from scratch and added to the ensemble (of fuzzy systems trained before). This induces (i) maximal flexibility in terms of being able to apply variable chunk sizes according to the actual system delay in receiving target values and (ii) fast reaction possibilities in the case of arising drifts. The latter are realized with specific prediction techniques on new data chunks based on the sequential ensemble members trained so far over time. We propose four different prediction variants including various weighting concepts in order to put higher weights on the members with higher inference certainty during the amalgamation of predictions of single members to a final prediction. In this sense, older members, which keep in mind knowledge about past states, may get dynamically reactivated in the case of cyclic drifts, which induce dynamic changes in the process behavior which are re-occurring from time to time later. Furthermore, we integrate a concept for properly resolving possible contradictions among members with similar inference certainties. The reaction onto drifts is thus autonomously handled on demand and on the fly during the prediction stage (and not during model adaptation/evolution stage as conventionally done in single EFS models), which yields enormous flexibility. Finally, in order to cope with large-scale and (theoretically) infinite data streams within a reasonable amount of prediction time, we demonstrate two concepts for pruning past ensemble members, one based on atypical high error trends of single members and one based on the non-diversity of ensemble members. The results based on two data streams showed significantly improved performance compared to single EFS models in terms of a better convergence of the accumulated chunk-wise ahead prediction error trends, especially in the case of regular and cyclic drifts. Moreover, the more advanced prediction schemes could significantly outperform standard averaging over all members’ outputs. Furthermore, resolving contradictory outputs among members helped to improve the performance of the sequential ensemble further. Results on a wider range of data streams from different application scenarios showed (i) improved error trend lines over single EFS models, as well as over related AI methods OS-ELM and MLPs neural networks retrained on data chunks, and (ii) slightly worse trend lines than on-line bagged EFS (as specific EFS ensembles), but with around 100 times faster processing times (achieving low processing times way below requiring milli-seconds for single samples updates).
Most of the current data sources generate large amounts of data over time. Renewable energy generation is one example of such data sources. Machine learning is often applied to forecast time series. Since data flows are usually large, trends in data may change and learned patterns might not be optimal in the most recent data. In this paper, we analyse wind energy generation data extracted from the Sistema de Información del Operador del Sistema (ESIOS) of the Spanish power grid. We perform a study to evaluate detecting concept drifts to retrain models and thus improve the quality of forecasting. To this end, we compare the performance of a linear regression model when it is retrained randomly and when a concept drift is detected, respectively. Our experiments show that a concept drift approach improves forecasting between a 7.88% and a 33.97% depending on the concept drift technique applied.KeywordsMachine learningConcept drift detectionData streamingTime seriesWind energy forecasting
Nowadays, swarm intelligence shows a high accuracy while solving difficult problems, including image processing problem. Image Edge detection is a complex optimization problem due to the high-resolution images involving large matrix of pixels. The current work describes several sensitive to the environment models involving swarm intelligence. The agents’ sensitivity is used in order to guide the swarm to obtain the best solution. Both theoretical general guidance and a practical example for a particular swarm are included. The quality of results is measured using several known measures.KeywordsSwarm intelligenceImage processingImage Edge Detection
The goal of this paper is to improve the predictive accuracy of data streaming algorithms without increasing the processing time of the incoming data. We propose the EnHAT (Ensemble Combined with Hoeffding Adaptive Tree) algorithm, which combines the state-of-the-art Hoeffding Adaptive Tree (HAT) algorithm with an ensemble of J48 decision trees induced from sequential chunks of the data stream. The slack time of HAT adaptation to a new window of incoming records is utilized in parallel for building a decision–tree ensemble. In our experiments on 4 benchmark streaming datasets and 4 synthetic datasets with different types of concept drift, EnHAT has reached the highest predictive accuracy in 23 out of 26 cases compared to an ensemble of J48 trees, HAT, and a single J48 model induced from the last sliding window. Thus, we can conclude that the ensemble/HAT synergy yields better prediction results than each one of the two approaches on its own. The higher accuracy does not come at the expense of any additional computational effort beyond the model induction times of the combined algorithms.
Context: The amount and diversity of data have increased drastically in recent years. However, in certain situations, the data to which a trained Machine Learning model is significantly different from testing data, a problem known as Concept Drift (CD). Because CD can be a serious issue, there has been a wealth of research on how to detect and work around it. However, most of the literature focuses on classification tasks. Objective: Making a Systematic Literature Review (SLR) for CD in the context of regression. Research questions: How to detect CD and how to build CD techniques for regression problems using machine learning? Method: We ran an automatic search process on reference databases, selecting papers from 2010 to August 2020, following the methodological process proposed by (Kitchenhame and Charters) (2007). Results:We selected 41 papers. Drift Detection Methods based on ensembles and neural networks with highlight OS-ELM were the most frequent in the selected papers with superior performance. However, only two papers confirm such superiority statistically. Furthermore, identify CD problems as the batch size, drift points, and where drift happens. Conclusions: SLR focuses on highlighting the existing literature on CD applied to regression.
At present, concept drift in the nonstationary data stream is showing trends with different speeds and different degrees of severity, which has brought great challenges to many fields like data mining and machine learning. In the past two decades, a lot of methods dedicated to handling concept drift in the nonstationary data stream have emerged. A novel perspective is proposed to classify these methods, and the current concept drift handling methods are comprehensively explained from the active handling methods and the passive handling methods. In particular, active handling methods are analyzed from the perspective of handling one specific type of concept drift and handling multiple types of concept drift, and passive handling methods are analyzed from the perspective of single learner and ensemble learning. Many concept drift handling methods in this survey are analyzed and summarized in terms of the comparing algorithms, learning model, applicable drift type, advantages, and disadvantages of the algorithms. Finally, further research directions are given, including the active and passive mixing methods, class imbalance, the existence of novel class in the data stream, and the noise in the data stream.
Online data stream mining is of great significance in practice because of its ubiquity in many real-world scenarios, especially in the big data era. Traditional data mining algorithms cannot be directly applied to data streams due to (1) the possible change of underlying data distribution over time (i.e., concept drift ) and (2) delayed, short, or even no labels for streaming data in practice. A new research area, named unsupervised concept drift detection , has emerged to tackle this difficulty mainly based on two-sample hypothesis tests, such as the Kolmogorov–Smirnov test. However, it is surprising that none of the existing methods in this area exploit the Bayesian nonparametric hypothesis test, which has clear interpretability and straightforward prior knowledge encoding ability and no strict or unrealistic requirement of prefixing the form for the underlying data distribution. In this article, we present a Bayesian nonparametric unsupervised concept drift detection method based on the Polya tree hypothesis test. The basic idea is to decompose the underlying data distribution into a multi-resolution representation that transforms the whole distribution hypothesis test into recursive and simple binomial tests. Also, an incremental mechanism is especially designed to improve its efficiency in the stream setting. The method effectively detect drifts, and it also locates where a drift happens and the posteriors of hypotheses. The experiments on synthetic data verify the desired properties of the proposed method, and the experiments on real-world data show the better performance of the method for data stream mining compared with its frequentist counterpart in the literature.
Evolving systems emerge from the synergy between systems with adaptive structures, and the recursive methods of machine learning. Evolving algorithms construct models and derive decision patterns from stream data produced by dynamically changing environments. Different components can be chosen to assemble the system structure, rules, trees, and neural networks being amongst the most prominent. Evolving systems concern mainly with time-varying environments, and processing of nonstationary stream data using computationally efficient recursive algorithms. They are particularly suitable for on-line, real-time applications, and dynamically changing situations, and operating conditions. This chapter gives an overview of evolving systems focusing on the model components, learning algorithms, and illustrative applications. The aim of to introduce the main ideas and a state of the art view of the area.
Full-text available
In the real world concepts are often not stable but change with time. Typical examples of this are weather prediction rules and customers' preferences. The underlying data distribution may change as well. Often these changes make the model built on old data inconsistent with the new data, and regular updating of the model is necessary. This problem, known as concept drift, complicates the task of learning a model from data and requires special approaches, different from commonly used techniques, which treat arriving instances as equally important contributors to the final concept. This paper considers different types of concept drift, peculiarities of the problem, and gives a critical review of existing approaches to the problem.
Conference Paper
Full-text available
Most of the work in Machine Learning assume that examples are generated at random according to some stationary probability distribution. In this work we study the problem of learning when the distribution that generates the examples changes over time. We present a method for detection of changes in the probability distribution of examples. The idea behind the drift detection method is to monitor the online error-rate of a learning algorithm looking for significant deviations. The method can be used as a wrapper over any learning algorithm. In most problems, a change affects only some regions of the instance space, not the instance space as a whole. In decision models that fit different functions to regions of the instance space, like Decision Trees and Rule Learners, the method can be used to monitor the error in regions of the instance space, with advantages of fast model adaptation. In this work we present experiments using the method as a wrapper over a decision tree and a linear model, and in each internal-node of a decision tree. The experimental results obtained in controlled experiments using artificial data and a real-world problem show a good performance detecting drift and in adapting the decision model to the new concept.
Conference Paper
Full-text available
Most of the work in machine learning assume that examples are generated at random according to some stationary probability distribution. In this work we study the problem of learning when the distribution that generate the examples changes over time. We present a method for detection of changes in the probability distribution of examples. The idea behind the drift detection method is to control the online error-rate of the algorithm. The training examples are presented in sequence. When a new training example is available, it is classified using the actual model. Statistical theory guarantees that while the distribution is stationary, the error will decrease. When the distribution changes, the error will increase. The method controls the trace of the online error of the algorithm. For the actual context we define a warning level, and a drift level. A new context is declared, if in a sequence of examples, the error increases reaching the warning level at example k w , and the drift level at example k d . This is an indication of a change in the distribution of the examples. The algorithm learns a new model using only the examples since k w . The method was tested with a set of eight artificial datasets and a real world dataset. We used three learning algorithms: a perceptron, a neural network and a decision tree. The experimental results show a good performance detecting drift and with learning the new concept. We also observe that the method is independent of the learning algorithm.
Conference Paper
In this paper we propose a fast and incremental algorithm for learning model trees from data streams (FIMT) for regression problems. The algorithm is incremental, works online, processes examples once at the speed they arrive, and maintains an any-time regression model. The leaves contain linear-models trained online from the examples that fall at that leaf, a process with low complexity. The use of linear models in the leaves increases its any-time global performance. FIMT is able to obtain competitive accuracy with batch learners even for medium size datasets, but with better training time in an order of magnitude. We study the properties of FIMT over several artificial and real datasets and evaluate its sensitivity on the order of examples and the noise level.
Conference Paper
We consider online learning where the tar- get concept can change over time. Previ- ous work on expert prediction algorithms has bounded the worst-case performance on any subsequence of the training data relative to the performance of the best expert. However, because these \experts" may be di-cult to implement, we take a more general approach and bound performance relative to the ac- tual performance of any online learner on this single subsequence. We present the additive expert ensemble algorithm AddExp, a new, general method for using any online learner for drifting concepts. We adapt techniques for analyzing expert prediction algorithms to prove mistake and loss bounds for a discrete and a continuous version of AddExp. Fi- nally, we present pruning methods and em- pirical results for data sets with concept drift.
In this paper we study the problem of constructing accurate decision tree models from data streams. Data streams are incremental tasks that require incremental, online, and any-time learning algorithms. One of the most successful algorithms for mining data streams is VFDT. We have extended VFDT in three directions: the ability to deal with continuous data; the use of more powerful classification techniques at tree leaves, and the ability to detect and react to concept drift. VFDTc system can incorporate and classify new information online, with a single scan of the data, in time constant per example. The most relevant property of our system is the ability to obtain a performance similar to a standard decision tree algorithm even for medium size datasets. This is relevant due to the any-time property. We also extend VFDTc with the ability to deal with concept drift, by continuously monitoring differences between two class-distribution of the examples: the distribution when a node was built and the distribution in a time window of the most recent examples. We study the sensitivity of VFDTc with respect to drift, noise, the order of examples, and the initial parameters in different problems and demonstrate its utility in large and medium data sets.