ArticlePDF Available

Abstract and Figures

Random forests is currently one of the most used machine learning algorithms in the non-streaming (batch) setting. This preference is attributable to its high learning performance and low demands with respect to input preparation and hyper-parameter tuning. However, in the challenging context of evolving data streams, there is no random forests algorithm that can be considered state-of-the-art in comparison to bagging and boosting based algorithms. In this work, we present the adaptive random forest (ARF) algorithm for classification of evolving data streams. In contrast to previous attempts of replicating random forests for data stream learning, ARF includes an effective resampling method and adaptive operators that can cope with different types of concept drifts without complex optimizations for different data sets. We present experiments with a parallel implementation of ARF which has no degradation in terms of classification performance in comparison to a serial implementation, since trees and adaptive operators are independent from one another. Finally, we compare ARF with state-of-the-art algorithms in a traditional test-then-train evaluation and a novel delayed labelling evaluation, and show that ARF is accurate and uses a feasible amount of resources.
Content may be subject to copyright.
Pre-print version
The final publication is available at Springer via
Adaptive Random Forests for Evolving Data Stream
Heitor M. Gomes1·Albert Bifet2·Jesse
Read2,3·Jean Paul Barddal1·Fabr´ıcio
Enembreck1·Bernhard Pfahringer5·
Geoff Holmes5·Talel Abdessalem2,4
Received: date / Accepted: date
Abstract Random Forests is currently one of the most used machine learning
algorithms in the non-streaming (batch) setting. This preference is attributable to
its high learning performance and low demands with respect to input preparation
and hyper-parameter tuning. However, in the challenging context of evolving data
streams, there is no Random Forests algorithm that can be considered state-of-
the-art in comparison to bagging and boosting based algorithms. In this work, we
present the Adaptive Random Forest (ARF) algorithm for classification of evolv-
ing data streams. In contrast to previous attempts of replicating Random Forests
for data stream learning, ARF includes an effective resampling method and adap-
tive operators that can cope with different types of concept drifts without complex
optimizations for different data sets. We present experiments with a parallel imple-
mentation of ARF which has no degradation in terms of classification performance
in comparison to a serial implementation, since trees and adaptive operators are
independent from one another. Finally, we compare ARF with state-of-the-art al-
gorithms in a traditional test-then-train evaluation and a novel delayed labelling
evaluation, and show that ARF is accurate and uses a feasible amount of resources.
Heitor M. Gomes Albert Bifet
Jesse Read Jean Paul Barddal
Fabr´ıcio Enembreck Bernhard Pfahringer
Geoff Holmes Talel Abdessalem
1PPGIa, Pont´ıcia Universidade Cat´olica do Paran´a, Brazil
2LTCI, T´el´ecom ParisTech, Universit´e Paris-Saclay, France
3LIX, ´
Ecole Polytechnique, France
4UMI CNRS IPAL & School of Computing, National University of Singapore, Singapore
5Department of Computer Science, University of Waikato, New Zealand
2 Heitor M. Gomes1et al.
Keywords Data Stream Mining, Random Forests, Ensemble Learning, Concept
1 Introduction
As technology advances, machine learning is becoming more pervasive in real world
applications. Nowadays many businesses are aided by learning algorithms for sev-
eral tasks such as: predicting users’ interests on advertisements, products or enter-
tainment media recommendations, spam filters, autonomous driving, stock market
predictions, face recognition, cancer detection, weather forecast, credit scoring, and
many others. Some of these applications tolerate offline processing of data, which
can take from a few minutes to weeks, while some of them demand real-time –
or near real-time – processing as their source of data is non-stationary, i.e. it
constitutes an evolving data stream.
While learning from evolving data streams one must be aware that it is in-
feasible to store data prior to learning as it is neither useful (old data may not
represent the current concept) nor practical (data may surpass available memory).
Also, it is expected that the learning algorithm is able to process instances at least
as fast as new ones are made available, otherwise the system will either collapse
due to lack of memory or start discarding upcoming data.
This evolving data stream learning setting has motivated the development of a
multitude of methods for supervised [35, 32, 13, 20, 27], unsupervised [28, 40, 7], and
more recently semi-supervised learning [39,41,37]. Ensemble learners are often
preferred when learning from evolving data streams, since they are able to achieve
high learning performance, without much optimization, and have the advantageous
characteristic of being flexible as new learners can be selectively added, updated,
reset or removed [32,14, 13, 20].
Bagging [16], Boosting [25] and Random Forests [17] are classic ensemble meth-
ods that achieve superior learning performance by aggregating multiple weak learn-
ers. Bagging uses sampling with reposition (i.e. resampling) to train classifiers
on different subsets of instances, which effectively increases the variance of each
classifier without increasing the overall bias. Boosting iteratively trains classifiers
by increasing the weight of instances that were previously misclassified. Random
Forests grow decision trees by training them on resampled versions of the original
data (similarly to Bagging) and by randomly selecting a small number of features
that can be inspected at each node for split. There are multiple versions of Bag-
ging and Boosting that are part of the current state-of-the-art for evolving data
stream learning, such as Leveraging Bagging [13] and Online Smooth-Boost [21].
Random Forests for evolving data stream learning is currently represented by the
Dynamic Streaming Random Forests [2], which lacks a resampling method, uses a
drift detection algorithm with no theoretical guarantees, and was evaluated only
on limited synthetic data (1 data set with 7 million instances, 5 attributes and 5
In this work we present the Adaptive Random Forests (ARF) algorithm, a
new streaming classifier for evolving data streams. ARF is an adaptation of the
classical Random Forest algorithm [17], and can also be viewed as an updated
version of previous attempts to perform this adaptation [1,2]. Therefore, the main
novelty of ARF is in how it combines the batch algorithm traits with dynamic
Adaptive Random Forests for Evolving Data Stream Classification 3
update methods to deal with evolving data streams. In comparison to previous
adaptations of Random Forest to the stream setting [1,2], ARF uses a theoretically
sound resampling method based on Online Bagging [35] and an updated adaptive
strategy to cope with evolving data streams. This adaptive strategy is based on
using a drift monitor per tree to track warnings and drifts, and to train new trees in
the background (when a warning is detected) before replacing them (when a drift
is detected). We avoid bounding ARF to a specific drift detection algorithm to
facilitate future adaptations, thus we present experiments using both ADWIN [10]
and Page Hinkley Test [36].
The main contributions of this paper are the following:
– Adaptive Random Forests (ARF): a new Random Forests algorithm for
evolving data stream classification. As shown in the empirical experiments
in Section 6, ARF is able to obtain high classification in data streams with
different characteristics without further hyper-parameter tuning. Since it is a
sustainable off-the-shelf learner for the challenging task of evolving data stream
classification, it is going to be useful for both practical applications and as a
benchmark for future algorithms proposals in the field.
Drift Adaptation: we propose a drift adaptation strategy that does not sim-
ply reset base models whenever a drift is detected. In fact, it start training a
background tree after a warning has been detected and only replace the pri-
mary model if the drift occurs. This strategy can be adapted to other ensembles
as it is not dependent on the base model.
Parallel implementation: we present experiments in terms of CPU time and
RAM-Hours of a parallel implementation of ARF.
Comprehensive experimental setting: very often experiments with novel
classifiers are focused on the well known test-then-train setting, where it is as-
sumed that labels for an instance are available before the next instance arrives.
We discuss the implications of a setting where labels are not readily available
(delayed setting) and report experiments based on it. Besides using accuracy
to measure classification performance, we also report Kappa M [9] and Kappa
Temporal [43], which allow better estimations for data sets with imbalanced
classes and temporal dependencies, respectively.
– Open source: All data sets and algorithms used in this paper are going to
be available as an extension to the MOA software [11], the most popular open
source software on data stream mining1, as a public available benchmark that
other researchers can use in their research when developing new algorithms.
The remainder of this work is organized as follows. In Section 2 we describe the
challenges, characteristics and different settings concerning evolving data streams
classification. In Section 3 we briefly discuss related works for data stream clas-
sification. Section 4 contains the description of our novel algorithm, i.e. Adaptive
Random Forests. In Section 5 the experimental setting and data sets used are de-
scribed. In Section 6 the results of the experiments are presented and thoroughly
discussed. Finally, Section 7 concludes this work and poses directions for future
4 Heitor M. Gomes1et al.
2 Problem Statement
Data stream classification, or online classification, is similar to batch classification
in the sense that both are concerned with predicting a nominal (class) value yof
an unlabeled instance represented by a vector of characteristics x. The difference
between online and batch resides in how learning, and predictions, take place. In
data stream classification, instances are not readily available for training as part
of a large static data set, instead, they are provided as a continuous stream of data
in a fast-paced way. Prediction requests are expected to arrive at any time and
the classifier must use its current model to make predictions. On top of that, it
is assumed that concept drifts may occur (evolving data streams), which damage
(or completely invalidate) the current learned model. Concept drifts might be
interleaved with stable periods that vary in length, and as a consequence, besides
learning new concepts it is also expected that the classifier retains previously
learned knowledge. The ability to learn new concepts (plasticity) while retaining
knowledge (stability) is known as the stability-plasticity dilemma [33,26]. In other
words, a data stream learner must be prepared to process a possibly infinite amount
of instances, such that storage for further processing is possible as long as the
algorithm can keep processing instances at least as fast as they arrive. Also, the
algorithm must incorporate mechanisms to adapt its model to concept drifts, while
selectively maintaining previously acquired knowledge.
Formally, a data stream Spresents, every utime units, new unlabeled instances
xtto the classifier for prediction, such that xtrepresents a vector of features
made available at time t. Most of the existing works on data stream classification
assumes that the true class label ytcorresponding to instance xtis available before
the next instance xt+1 is presented to the learning algorithm, thus the learner
can use it for training immediately after it has been used for prediction. This
setting may be realistic for problems like short-term stock marketing predictions,
although this is not the only meaningful setting for data stream learning. In some
real-world problems labels are not readily available, or some are never available,
after predictions. In Figure 1 we represent the characteristics of a stream learning
problem according to when labels are made available, and briefly discuss them
– Immediate: labels are presented to the learner before the next instance ar-
– Delayed: labels arrive with delay dwhich may be fixed or vary for different
– Never: labels are never available to the learner.
Situations where labels are never available (unsupervised learning) or where
some percentage pof labels will never arrive (semi-supervised learning) are outside
the scope of this work. Also, when labels are presented in a delayed fashion, it may
be the case that they arrive in batches of size greater than one, and the learner must
rapidly use these batches to update its model as new instances for prediction might
arrive concomitantly. In this paper we evaluate our Adaptive Random Forests
(ARF) algorithm in both immediate and delayed settings. As well as comparing
the results from both settings in terms of classification accuracy, we also report
CPU time and memory consumption (RAM-hours) as estimates of computational
resources usage.
Adaptive Random Forests for Evolving Data Stream Classification 5
3 Related Work
Ensemble classifiers are often chosen for dealing with evolving data stream classifi-
cation. Besides ensembles achieving (on average) higher classification performance
than single models, this decision is also based on the distinctive trait that ensem-
bles allow selective reset/remove/add/update of base models in response to drifts.
Many state-of-the-art ensemble methods for data stream learning [35,21, 38, 8] are
adapted versions of Bagging [16] and Boosting [24]. The standard Online Bagging
algorithm uses λ= 1, which means that around 37% of the values output by the
Poisson distribution are 0, another 37% are 1, and 26% are greater than 1. This
implies that by using Poison(1) 37% of the instances are not used for training
(value 0), 37% are used once (value 1), and 26% are trained with repetition (val-
ues greater than 1). Subsequent algorithms like Leveraging Bagging [13] and the
Diversity for Dealing with Drifts ensemble (DDD) [34] uses different values of λto
use more instances for training the base models (as in Leveraging Bagging) or to
induce more diversity to the ensemble by using varying values of λ(as in DDD).
One advantage of adapting existing batch ensembles is that they have already
been thoroughly studied, thus as long as the adaptation to online learning retains
the original method properties it can benefit from previous theoretical guarantees.
The first attempt to adapt Random Forests [17] to data stream learning is the
Streaming Random Forests [1]. Streaming Random Forests grow binary Hoeffding
trees while limiting the number of features considered for split at every node to a
random subset of features and by training each tree on random samples (without
replacement) of the training data. Effectively, trees are trained sequentially on
a fixed number of instances controlled by a hyper-parameter tree window, which
means that after a tree’s training is finished its model will never be updated. As
a result, this approach is only reasonable for stationary data streams.
To cope with evolving data streams, ensembles are often coupled with drift
detectors. For instance, Leveraging Bagging [13] and ADWIN Bagging [14] use the
ADaptive WINdowing (ADWIN) algorithm [10], while DDD [34] uses Early Drift
Detection Method (EDDM) [6] to detect concept drifts. Another approach to deal
with concept drifts while using an ensemble of classifiers is to constantly reset low
performance classifiers [32, 20, 27]. This reactive approach is useful to recover from
gradual drifts, while methods based on drift detectors are more appropriate for
rapidly recovering from abrupt drifts.
The same authors from Streaming Random Forests [1] presented the Dy-
namic Streaming Random Forests [2] to cope with evolving data streams. Dynamic
Stream learning
Immediate Delayed Never
Fixed Varying
All labeled
(Supervised) Some labeled
Fig. 1 Stream learning according to labels arrival time
6 Heitor M. Gomes1et al.
Streaming Random Forests replaces the hyper-parameter tree window by a dynam-
ically updated parameter tree min which is supposed to enforce trees that achieve
performance at least better than random guessing. Dynamic Streaming Random
Forests also includes an entropy-based drift detection technique that outputs an
estimate percentage of concept change. According to this estimated percentage of
concept change, more trees are reset. However, if it is 0, at least 25% of the trees
are reset whenever a new block of labelled instances is available.
Our Adaptive Random Forests (ARF) algorithm resembles Dynamic Stream-
ing Random Forests as both use Hoeffding trees as base learners and include a
drift detection operator. The first difference between both algorithms is that ARF
simulates sampling with reposition via online bagging [35] instead of growing each
tree sequentially on different subsets of data. This is not only a more theoretically
sustainable approach, but also has the practical effect of allowing training trees in
Another difference is that Dynamic Streaming Random Forests reset 25% of
its trees every new batch of labelled instances, while ARF is based on a warning
and drift detection scheme per tree, such that after a warning has been detected
for one tree, another one (background tree) starts growing in parallel and replaces
the tree only if the warning escalates to a drift.
Finally, ARF hyper-parameters are limited to the subset of features size m,
the number of trees nand the thresholds that control the drift detection method
sensitivity, thus it does not depend on difficult to set hyper-parameters such as
the number of instances a tree must be trained on, or the minimum accuracy that
a tree has to achieve before training stops.
4 Adaptive Random Forests
Random Forests [17] is a widely used learning algorithm in non-stream (batch)
classification and regression tasks. Random Forests can grow many trees while
preventing them from overfitting by decorrelating them via bootstrap aggregat-
ing (bagging [16]) and random selection of features during node split. The original
Random Forests algorithm requires multiple passes over input data to create boot-
straps for each tree, while for each internal node of every tree a pass over some
portion of the original features.
In data stream learning it is infeasible to perform multiple passes over input
data. Thus, an adaptation of Random Forests to streaming data depends on: (1)
an appropriate online bootstrap aggregating process; and (2) limiting each leaf
split decision to a subset of features. The second requirement is achieved by modi-
fying the base tree induction algorithm, effectively by restricting the set of features
considered for further splits to a random subset of size m, where m<M and M
corresponds to the total number of features. To explain our adaptations to address
the first requirement we need to discuss how bagging works in non-streaming,
and how it is simulated in a streaming setting. In non-streaming Bagging [16],
each of the nbase models is trained in a bootstrap sample of size Zcreated
by drawing random samples with replacement from the training set. Each boot-
strapped sample contains an original training instance Ktimes, where P(K=k)
follows a binomial distribution. For large values of Zthis binomial distribution
adheres to a Poisson(λ= 1) distribution. Based on that, authors in [35] proposed
Adaptive Random Forests for Evolving Data Stream Classification 7
Algorithm 1 RFTree Train. Symbols: λ: Fixed parameter to Poisson distribu-
tion; GP : Grace period before recalculating heuristics for split test.
1: function RFTreeTrain(m, t, x, y)
2: kP oisson(λ= 6)
3: if k > 0then
4: lF indLeaf (t, x)
5: UpdateLeafC ounts(l, x, k)
6: if InstancesS een(l)GP then
7: AttemptSplit(l)
8: if DidSplit(l)then
9: CreateChildren(l , m)
10: end if
11: end if
12: end if
13: end function
the Online Bagging algorithm, which approximates the original random sampling
with replacement by weighting instances2according to a Poisson(λ= 1) distri-
bution. In ARF, we use Poisson(λ= 6), as in Leveraging Bagging [13], instead
of Poisson(λ= 1). This “leverages” resampling, and has the practical effect of
increasing the probability of assigning higher weights to instances while training
the base models.
The function responsible for inducing each base tree is detailed in Algorithm
1. Random Forest Tree training (RFTreeTrain) is based on the Hoeffding Tree
algorithm (i.e. Very Fast Decision Tree) [23] with some important differences.
First, RFTreeTrain does not include any early tree pruning. Second, whenever a
new node is created (line 9, Algorithm 1) a random subset of features with size m
is selected and split attempts (line 7, Algorithm 1) are limited to these features for
the given node. Smaller values of GP 3(line 6, Algorithm 1) causes recalculations
of the split heuristic more frequently and tends to yield deeper trees. In general,
deeper trees are acceptable, even desired, in Random Forests. It is acceptable
because even if individual trees overfit, the variance reduction from averaging
multiple trees prevents the whole forest from overfitting. It is desired as trees with
very specialized models tend to differ more from one another.
The overall ARF pseudo-code is presented in Algorithm 2. To cope with sta-
tionary data streams a simple algorithm where each base tree is trained according
to RFTreeTraining function as new instances are available would be sufficient, i.e.
the lines 11 to 21 could be ommited from Algorithm 2. However, in ARF we aim
at dealing with evolving data streams, thus it is necessary to include other strate-
gies to cope with concept drifts. Concretely, these strategies include drift/warning
detection methods, weighted voting and training trees in the background before
replacing existing trees. The rest of this section is dedicated to explain and justify
these strategies.
To cope with evolving data streams a drift detection algorithm is usually cou-
pled with the ensemble algorithm [14,13]. The default approach is to reset learners
immediately after a drift is signaled. This may decrease the ensemble classification
2In this context, weighting an instance with a value wfor a given base model is analogous
to training the base model wtimes with that instance.
3GP was originally introduce in [23] as nmin, we use GP for consistency with the rest of
our nomenclature.
8 Heitor M. Gomes1et al.
Algorithm 2 Adaptive Random Forests. Symbols: m: maximum features evalu-
ated per split; n: total number of trees (n=|T|); δw: warning threshold; δd: drift
threshold; c(·): change detection method; S: Data stream; B: Set of background
trees; W(t): Tree tweight; P(·): Learning performance estimation function.
1: function AdaptiveRandomForests(m, n, δw, δd)
2: TCr eateT rees(n)
3: WInitW eig hts(n)
4: B← ∅
5: while HasN ext(S)do
6: (x, y)next(S)
7: for all tTdo
8: ˆypredict(t, x)
9: W(t)P(W(t),ˆy, y)
10: RF T reeT rain(m, t, x, y )Train ton the current instance (x, y)
11: if C(δw, t, x, y)then Warning detected?
12: bCr eateT ree() Init background tree
13: B(t)b
14: end if
15: if C(δd, t, x, y)then Drift detected?
16: tB(t)Replace tby its background tree
17: end if
18: end for
19: for all bBdo Train each background tree
20: RF T reeT rain(m, b, x, y )
21: end for
22: end while
23: end function
performance, since this learner has not been trained on any instance, thus making
it unable to positively impact the overall ensemble predictions. Instead of resetting
trees as soon as drifts are detected, in ARF we use a more permissive threshold
to detect warnings (line 11, Algorithm 2) and create “background” trees that are
trained (line 16, Algorithm 2) along the ensemble without influence the ensemble
predictions. If a drift is detected (line 15, Algorithm 2) for the tree that originated
the warning signal it is then replaced by its respective background tree.
ARF is not bounded to a specific detector. To show how different drift detec-
tion methods would perform in our implementation, we present experiments with
ADWIN and Page Hinkley test (PHT) [36]. Some drift detection algorithms might
depend on many parameters (this is the case for PHT), however to simplify our
pseudocode we assume only two different parameters one for warning detection δw
and another for drift detection δd. Effectively, for ADWIN δwand δdcorresponds
to the confidence level of the warning and drift detection, respectively, while in
PHT each would comprise a set of parameters.
In ARF votes are weighted based on the trees’ test-then-train accuracy (line
9, Algorithm 2), i.e., assuming the tree lhas seen nlinstances since its last reset
and correctly classified clinstances, such that clnl, then its weight will be
cl/nl. Assuming the drift and warning detection methods are precise, then this
weighting reflects the tree performance on the current concept. An advantage of
using this weighting mechanism is that it does not require a predefined window or
fading factor to estimate accuracy as in other data stream ensembles [19,20, 27].
Similarly to the drift/warning detection method, other voting schemes could be
used. To illustrate that we also present experiments using a simple majority vote.
Adaptive Random Forests for Evolving Data Stream Classification 9
4.1 Theoretical Insights
Given the maximum features per split m, the number of classes c, the number
of leaves l, and the maximum number of possible values per feature v, a single
Hoeffding Tree [23] demands O(lmcv) memory assuming memory depends only
on the true concept [23]. Given Tas the total number of trees and lmax as the
maximum number of leaves for all trees, the ARF algorithm, without warning/drift
detection, requires O(T lmaxmcv), while using drift detection requires the space
allocated for each data structure per tree to be allocated. For example, ADWIN [10]
requires O(M·log(W/M)), such that Mis the number of buckets, while Wis
maximum numbers per memory word [10], thus ARF using ADWIN for warning
and drift detection requires O(T((M·log(W/M) + lmaxmcv)) of space.
The number of background trees is never greater than the maximum number
of trees, i.e. |B| ≤ n, thus in the worst case it is necessary to allocate 2ntrees con-
currently. However the warning/drift detection data structures are not activated
in the background trees, thus they require less memory than an actual tree and
this also prevents background trees from triggering warnings which could lead to
multiple recursive creations of background trees.
Finally, in the Hoeffding Tree algorithm [23] authors present a strategy to
limit memory usage by introducing a threshold that represents the maximum
available memory, in case this threshold is reached then the least promising leaves
are deactivated. Assuming plis the probability that a leaf node is reach, and elis
the observed error rate at l, then pl·elis an upper bound on the error reduction
achievable by refining l, the least promising leaves are those that achieve the lowest
values of pl·el. Originally, Hoeffding Trees also include a pruning strategy that
removes poor attributes early on, yet we do not include this operation in ARF as
pruning in Random Forests reduces variability.
4.2 Parallelizing the Adaptive Random Forests algorithm
The most time consuming task in ensemble classifiers is often training its base
learners, exceptions being ensembles in which lazy learners are used. In a data
stream configuration, base learners are recurrently responsible for other tasks,
for example, keeping track of drift and updating individual data structures that
represent their weights. In ARF, training a tree with an instance includes updates
to the underlying drift detector, incrementing its estimate test-then-train accuracy,
and, if a warning is signalled, starting a new background tree. The aforementioned
operations can be executed independently for each tree, thus it is doable to execute
them in separate threads. To verify the advantages of training trees in parallel
we provide a parallel version ARF[M] and compare it against a standard serial
implementation ARF[S]. Anticipating the results presented in the experiments
section, the parallel version is around 3 times faster than the serial version and since
we are simply paralleling independent operations there is no loss in classification
performance, i.e., the results are exactly the same.
10 Heitor M. Gomes1et al.
5 Experimental Setting
In this section we present the experimental setting used. We evaluate the experi-
ments in terms of memory, time and classification performance. Memory is mea-
sured in GBs and based on RAM-hours [15], i.e. one GB of memory deployed for
one hour corresponds to one RAM-Hour. Processing time is measured in seconds
and is based on the CPU time used for training and testing. To assess classifica-
tion performance we perform 10-fold cross-validation prequential evaluation [9]4.
This evaluation ought not be confused with the standard cross-validation from
batch learning, which is not applicable to data stream classification mainly be-
cause instances can be strongly time-dependent, thus making it very difficult to
organize instances in folds that reflects the characteristics of the data. Three dif-
ferent strategies were proposed in [9] for cross-validation prequential evaluation,
namely: k-fold distributed cross-validation, k-fold distributed split-validation and
k-fold distributed bootstrap validation. These strategies share the characteristic
of training and testing kmodels in parallel, while they differ on how the folds are
built. In our evaluation framework we use the k-fold distributed cross-validation
as recommended in [9]. In this strategy, each instance is used for testing in one
randomly selected model and for training by all others.
Since accuracy can be misleading on data sets with class imbalance or temporal
dependencies, so we also report Kappa M and Kappa Temporal. In [9] authors show
that Kappa M measure has advantages over Kappa statistic as it has a zero value
for a majority class classifier. For data sets that exhibit temporal dependencies it
is advisable to evaluate Kappa Temporal since it replaces majority class classifier
with the NoChange classifier [43].
All the experiments were performed on machines with 40 cores5and 200 GB of
RAM. Experiments focusing resources usage were run individually and repeated
10 times to diminish perturbations on the results. We evaluate algorithms using
the immediate setting and delayed setting. In the delayed setting, the delay was set
to 1,000 instances and the classification performance estimates are calculated the
same way as they are in the immediate setting, i.e. a 10-fold cross-validation. The
only difference is ‘when’ labels become available to train the classifier, i.e., 1,000 in-
stances after the instance is used for prediction. To verify if there were statistically
significant differences between algorithms, we performed non-parametric tests us-
ing the methodology from [22]. For the statistical test we employ the Friedman
test with α= 0.05 and the null hypothesis “there were no statistical difference
between given algorithms”, if it is rejected, then we proceed with the Nemenyi
post-hoc test to identify these differences. All experiments were configured and
executed within the MOA (Massive Online Analysis) framework [11].
We use 10 synthetic and 6 real data sets on our experiments. The synthetic data
sets include abrupt, gradual, incremental drifts and one stationary data stream,
while the real data sets have been thoroughly used in the literature to assess
the classification performance of data stream classifiers and exhibit multiclass,
temporal dependences and imbalanced data sets. The 10-fold distributed cross-
validation for SPAM data set with 100 base models did not finish for LevBag,
4Prequential evaluation is similar to test-then-train, the only difference between them is
that prequential includes a fading factor to ‘forget’ old predictions performance.
5Intel(R) Xeon(R) CPU E5-2660 v3 2.60GHz
Adaptive Random Forests for Evolving Data Stream Classification 11
Table 1 data sets configurations (A: Abrupt Drift, G: Gradual Drift, Im: Incremental Drift
(moderate) and If: Incremental Drift (fast)). MF Label and LF Label stands for Most Frequent
and Less Frequent class label, respectively.
Data set # Instances # Features Type Drifts # Classes MF Label LF Label
LEDa1,000,000 24 Synthetic A 10 10.08% 9.94%
LEDg1,000,000 24 Synthetic G 10 10.08% 9.94%
SEAa1,000,000 3 Synthetic A 2 57.55% 42.45%
SEAg1,000,000 3 Synthetic G 2 57.55% 42.45%
AGRa1,000,000 9 Synthetic A 2 52.83% 47.17%
AGRg1,000,000 9 Synthetic G 2 52.83% 47.17%
RTG 1,000,000 10 Synthetic N 2 57.82% 42.18%
RBFm1,000,000 10 Synthetic Im5 30.01% 9.27%
RBFf1,000,000 10 Synthetic If5 30.01% 9.27%
HYPER 1,000,000 10 Synthetic If2 50.0% 50.0%
AIRL 539,383 7 Real - 2 55.46% 44.54%
ELEC 45,312 8 Real - 2 57.55% 42.45%
COVT 581,012 54 Real - 7 48.76% 0.47%
GMSC 150,000 11 Real - 2 93.32% 6.68%
KDD99 4,898,431 41 Real - 23 57.32% 0.00004%
SPAM 9,324 39,917 Real - 2 74.4% 25.6%
OzaBag and OzaBoost, as the machine run out of memory (we have tried using
up to 200GB of memory). Therefore we only report SPAM results in the end of
this report to show how ARF performs on a data set with a massive amount of
features (see Figure 6). Our goal with this multitude of data sets with different
characteristics is to show how ARF performs on each of these scenarios. Table 1
presents an overview of the data sets, while further details can be found in the
rest of this section.
LED. The LED data set simulates both abrupt and gradual drifts based on the
LED generator, early introduced in [18]. This generator yields instances with 24
boolean features, 17 of which are irrelevant. The remaining 7 features corresponds
to each segment of a seven-segment LED display. The goal is to predict the digit
displayed on the LED display, where each feature has a 10% chance of being
inverted. To simulate drifts in this data set the relevant features are swapped
with irrelevant features. Concretely, we parametrize 3 gradual drifts each with
an amplitude of 50k instances and centered at the 250k, 500k and 750k instance,
respectively. The first drift swaps 3 features, the second drift swaps 5 features, and
the last one 7 features. LEDgsimulates 3 gradual drifts, while LEDasimulates 3
abrupt drifts.
SEA. The SEA generator [42] produces data streams with three continuous
attributes (f1, f2, f3). The range of values that each attribute can assume is be-
tween 0 and 10. Only the first two attributes (f1, f2) are relevant, i.e., f3does
not influence the class value determination. New instances are obtained through
randomly setting a point in a two dimensional space, such that these dimensions
corresponds to f1and f2. This two dimensional space is split into four blocks,
each of which corresponds to one of four different functions. In each block a point
belongs to class 1 if f1+f2θand to class 0 otherwise. The threshold θused
to split instances between class 0 and 1 assumes values 8 (block 1), 9 (block 2), 7
(block 3) and 9.5 (block 4). It is possible to add noise to class values, being the
default value 10%, and to balance the number of instances of each class. SEAg
simulates 3 gradual drifts, while SEAasimulates 3 abrupt drifts.
AGRAWAL. AGRaand AGRgdata sets are based on the AGRAWAL gen-
erator [4], which produces data streams with six nominal and three continuous
12 Heitor M. Gomes1et al.
attributes. There are ten different functions that map instances into two different
classes. A perturbation factor is used to add noise to the data, both AGRgand
AGRaincludes 10% perturbation factor. This factor changes the original value of
an attribute by adding a deviation value to it, which is defined according to a uni-
form random distribution. AGRgsimulates 3 gradual drifts, while AGRasimulates
3 abrupt drifts.
RTG. The Random Tree Generator (RTG) [23] builds a decision tree by ran-
domly selecting attributes as split nodes and assigning random classes to each
leaf. After the tree is build, new instances are obtained through the assignment
of uniformly distributed random values to each attribute. The leaf reached after a
traverse of the tree, according to the attribute values of an instance, determines
its class value. RTG allows customizing the number of nominal and numeric at-
tributes, as well as the number of classes. In our experiments we did not simulate
drifts for the RTG data set.
RBF. RBFmand RBFfdata sets were generated using the Radial Basis Func-
tion (RBF) generator. This generator creates centroids at random positions and
associates them with a standard deviation value, a weight and a class label. To
create new instances one centroid is selected at random, where centroids with
higher weights have more chances to be selected. The new instance input values
are set according to a random direction chosen to offset the centroid. The extent
of the displacement is randomly drawn from a Gaussian distribution according to
the standard deviation associated with the given centroid. To simulate incremen-
tal drifts, centroids move at a continuous rate, effectively causing new instances
that ought to belong to one centroid to another with (maybe) a different class.
Both RBFmand RBFfwere parametrized with 50 centroids and all of them drift.
RBFmsimulates a “moderate” incremental drift (speed of change set to 0.0001)
while RBFfsimulates a more challenge “fast” incremental drift (speed of change
set to 0.001).
HYPER. The HYPER data set simulates an incremental drift and it was
generated based on the hyperplane generator [30]. A hyperplane is a flat, n1
dimensional subset of that space that divides it into two disconnected parts. It
is possible to change a hyperplane orientation and position by slightly changing
its relative size of the weights wi. This generator can be used to simulate time-
changing concepts, by varying the values of its weights as the stream progresses
[12]. HYPER was parametrized with 10 attributes and a magnitude of change of
Airlines. The Airlines data set was inspired by the regression data set from
Ikonomovska6. The task is to predict whether a given flight will be delayed given
information on the scheduled departure. Thus, it has 2 possible classes: delayed or
not delayed. This data set contains 539,383 records with 7 attributes (3 numeric
and 4 nominal).
Electricity. The Electricity data set was collected from the Australian New
South Wales Electricity Market, where prices are not fixed. These prices are af-
fected by demand and supply of the market itself and set every five minutes.
The Electricity data set contains 45,312 instances, where class labels identify the
changes of the price (2 possible classes: up or down) relative to a moving average of
Adaptive Random Forests for Evolving Data Stream Classification 13
the last 24 hours. An important aspect of this data set is that it exhibits temporal
Covertype. The forest covertype data set represents forest cover type for 30 x
30 meter cells obtained from the US Forest Service Region 2 Resource Information
System (RIS) data. Each class corresponds to a different cover type. This data
set contains 581,012 instances, 54 attributes (10 numeric and 44 binary) and 7
imbalanced class labels.
GMSC. The Give Me Some Credit (GMSC) data set7is a credit scoring
data set where the objective is to decide whether a loan should be allowed. This
decision is crucial for banks since erroneous loans lead to the risk of default and
unnecessary expenses on future lawsuits. The data set contains historical data
provided on 150,000 borrowers, each described by 10 attributes.
KDD99. KDD’99 data set8is often used for assessing data stream mining
algorithms’ accuracy due to its ephemeral characteristics [3,5]. It corresponds to a
cyber attack detection problem, i.e. attack or common access, an inherent stream-
ing scenario since instances are sequentially presented as a time series [3]. This
data set contains 4,898,431 instances and 41 attributes.
Spam. The Spam Corpus data set was developed in [31] as the result of a
text mining process on an online news dissemination system. The work presented
in [31] intended on creating an incremental filtering of emails classifying them as
spam or ham (not spam), and based on this classification, deciding whether an
email was relevant or not for dissemination among users. This data set has 9,324
instances and 39,917 boolean attributes, such that each attribute represents the
presence of a single word (the attribute label) in the instance (e-mail).
5.1 Ensembles and Parametrization
We compare ARF to state-of-the-art ensemble learners for data stream classifi-
cation, including bagging and boosting variants with and without explicit drift
detection and adaptation. Bagging variants includes Online Bagging (OzaBag)
[35] and Leveraging Bagging (LevBag) [13]. Boosting inspired algorithms are rep-
resented by Online Boosting (OzaBoost) [35] and Online Smooth-Boost (OSBoost)
[21]. The Online Accuracy Updated Ensemble (OAUE) [20] is a dynamic ensemble
designed specifically for data stream learning and it is neither based on bagging
nor boosting.
All experiments use the Hoeffding Tree [23] algorithm with Naive Bayes at the
leaves [29] as the base learner, which we refer to as Hoeffding Naive Bayes Tree
(HNBT). ARF uses a variation of HNBT that limits splits to mrandomly selected
features, where m=M+1 in all our experiments (see Section 6.1 for experiments
varying m). An important parameter of the trees is the grace period GP , which
is used to optimize training time [23] by delaying calculations of the heuristic
measure Gused to choose the test features (in this work we use Information
Gain). By using smaller values of GP run time (and memory usage) is increased,
and also causes trees to grow deeper, which enhances the overall variability of the
forest, and consequently ARF’s classification performance. For consistency, we use
14 Heitor M. Gomes1et al.
the same base learner configuration for all ensembles, i.e., HNBTs with GP = 50.
We report statistics for ensembles of 100 members, with the exception of adhoc
experiments that focus on CPU Time and RAM-Hours analysis. In the following
sections we organize experiments as follows: (1) Comparisons among ARF and
some of its variants; (2) Resource usage analysis; and (3) Comparisons of ARF
against other state-of-the-art ensembles.
6 Experiments
We start our experimentation by comparing variations of ARF to evaluate its sen-
sitivity to parameters (e.g. drift and warning threshold, ensemble size and subspace
size) and variations of the algorithm that deactivates some of its characteristics
(e.g. drift detection, warning detection, weighted vote). The second set of exper-
iments concerns the evaluation of computational resources usage (CPU time and
RAM-Hours). Finally, we present experiments comparing ARF and other state-
of-the-art ensemble classifiers in terms of accuracy, Kappa M and Kappa T, for
immediate and delayed settings.
6.1 ARF variations
Our first analysis is a comparison between 6 variations of the ARF algorithm,
each of which ‘removes’ some characteristics from ARF (e.g. drift detection) or
has a different parametrization (e.g. uses Page Hinkley drift detection). We did
this comparison to illustrate the benefits of using ARF as previously stated in
Section 4, and also to discuss how each strategy included in it contributes to the
overall classification performance. Table 2 presents the immediate setting 10-fold
cross-validation accuracy for these variations. Each variation configuration is as
ARFmoderate: Adopts a parametrization to ADWIN that results in less drifts/
warnings being flagged (δw= 0.0001 and δd= 0.00001).
ARFfast: Uses a parametrization of ADWIN that causes more drifts/warnings
to be detected (δw= 0.01 and δd= 0.001).
ARFPHT: Uses Page Hinkley test (PHT) to detected drifts/warnings (δw=
0.005, δd= 0.01, other parameters: λ= 50, α= 0.9999).
ARFnoBkg: Removes only the warning detection and background tree, therefore
whenever drifts are detected the associated trees are immediately reset.
ARFstdRF: This is a ‘pure’ online Random Forests version as it deactivates the
detection algorithm, does not reset trees and uses majority vote.
ARFmaj : Same configuration as ARFmoderate, but it uses majority vote instead
of weighted majority.
Without any drift detection (ARFstdRF) the results on data streams that con-
tains drifts are degraded severely. If trees are reset immediately whenever a drift
(ARFnoBkg) is detected the results improve in 2 real data sets (AIRL and COVT),
although we observe better, yet small improvements, when using background trees
and drift warnings (ARFmoderate and ARFfast ), especially on the synthetic data
sets. In general, the weighted majority vote is capable of improving performance
Adaptive Random Forests for Evolving Data Stream Classification 15
Table 2 Accuracy in the immediate setting for ARF variations (# learners = 100)
Data set ARFmoderate ARFfast ARFPHT ARFnoBkg ARFstdRF ARFmaj
LEDa73.72 73.74 73.57 73.73 66.5 73.71
LEDg72.87 72.89 72.83 72.84 66.36 72.86
SEAa89.66 89.66 89.58 89.66 87.27 89.66
SEAg89.24 89.23 89.25 89.24 87.2 89.24
AGRa89.75 89.98 89.3 89.75 79.88 89.6
AGRg84.54 84.6 84.45 84.73 76.96 84.39
RTG 93.91 93.91 93.91 93.89 93.89 93.89
RBFm86.02 86.19 85.18 86.05 74.96 86.01
RBFf72.36 72.46 70.73 72.45 47.02 72.21
HYPER 85.16 85.44 84.87 85.42 78.68 85.16
Synthetic Avg 83.72 83.81 83.37 83.78 75.87 83.67
Synthetic Avg Rank 2.7 1.8 4.1 2.6 5.9 3.9
AIRL 66.26 66.48 66.03 66.66 65.09 66.23
ELEC 88.54 89.44 87.04 88.6 85.81 88.5
COVT 92.32 91.85 91.81 92.35 88.18 92.31
GMSC 93.55 93.55 93.55 93.55 93.55 93.55
KDD99 99.97 99.97 99.98 99.97 99.97 99.97
Real Avg 88.13 88.26 87.68 88.23 86.52 88.11
Real Avg Rank 3.6 2.4 3.6 2.8 4.6 4
Overall Avg 85.19 85.29 84.81 85.26 79.42 85.15
Overall Avg Rank 323.93 2.67 5.47 3.93
on almost every data set when we compare ARFmoderate and ARFmaj, such that
both use the exact same configuration, but the latter uses majority vote instead
of weighted majority. This behavior can be attributed to the variance in weights
during periods of drift, such that trees adapted to the current concept shall re-
ceive higher weights and obfuscate outdated trees. However, if trees’ weights are
overestimated (or underestimated) this can lead to a combination that is inferior
to majority vote. Therefore, if it is infeasible to obtain accurate weights, e.g., ac-
curacy is not a reasonable metric for the data set, then it is safer to use majority
vote or change the weighting function. ARFmoderate and ARFfast differ the most
on the real data set ELEC (almost 1% accuracy), while the other results are quite
similar with a slight advantage for ARFfast . ARFf ast trains background trees
for less time than ARFmoderate as it detects drifts sooner, while ARFnoB kg is
an extreme case with no background training at all. In practice, it is necessary
to experiment with the warning and drift detector parameters to find the optimal
combination for the input data. However, it is less likely that not training the trees
before adding them to the forest, even for short periods, would benefit the overall
classification performance as the first decisions of a newly created tree are essen-
tially random. On the other hand, it is expected that the improvements obtained
by using background tree training might not differ a lot from the not using it, as
the most important thing remains resetting trees when drifts occurs as short peri-
ods of random decisions can be ‘corrected’ as long as not all trees are undergoing
this process at the same time. The best result for RTG is obtained by ARFPH T ,
however this data set does not contains any drift, thus it is not reasonable to at-
tribute its classification performance to the Page Hinkley Test detector. Also, the
difference between ARFmoderate and ARFP HT is after the second decimal place.
The Friedman test based on the overall rankings of Table 2 (both synthetic and
real data sets) indicated that there were differences among these ARF variations,
the follow-up posthoc nemenyi test, presented in Figure 2, indicates that there are
no significant differences between ARFfast ,ARFmoderate ,ARFP HT ,ARFnoBkg
and ARFmaj . Further experiments in this work are based on the ARFmoderate
configuration and referred to solely as ARF (or ARF[M] or ARF[S] when evaluating
resources usage).
16 Heitor M. Gomes1et al.
CD = 1.947
Fig. 2 ARF variations nemenyi test (95% confidence level) - Immediate setting with 100
To illustrate the impact of using different values for m(feature subset size) and
n(ensemble size) we present 3D plots of six data sets in Figure 3. In Figures 3a,
3b and 3e it was clearly a good choice to use small values of m, however it might
not always be the case as observed in Figures 3c, 3d and 3f. In the COVT, GMSC
and RTG plots (Figures 3c and 3d and 3f) we observe a trend where increasing the
number of features results in classification performance improvements. For RTG
we can affirm that this behavior is associated with the fact that the underlying
data generator is based on a random assignment of values to instances and a
decision tree traversal to determine the class label (see Section 6), which causes
that no unique feature, or subset of features (other than the full set), is strongly
correlated with the class label. Therefore when each tree is assigned the full set of
features, and use only sampling with reposition as the diversity induction, better
performance per base tree is achieved, thus the overall ensemble obtains better
performance as well. We cannot affirm this same behavior for the real data sets
that exhibit similar behavior as RTG (GMSC and COVT) as the underlying data
generator is unknown.
6.2 Resources comparison between ARF[S] and ARF[M]
To assess the benefits in terms of resources usage we compare ARF[M] and ARF[S]
implementations. We report average memory and processing time used to process
all data sets for 10, 20, 50 and 100 classifiers. Figures 4a and 4b present the results
of these experiments. It is expected that ARF[M] requires more memory than
ARF[S], yet since it executes faster its average RAM-Hours is lower in comparison
to ARF[S]. Ideally, a parallel implementation on elements that are independent,
in our case the trees’ training, must scale linearly in the number of elements
if enough threads are available. Although there are some factors that forestall
scalability in our implementation of ARF[M], such as the number of available
threads, overhead on job creation at every new training instance and operations
that are not parallelized (e.g. combining votes). Examining Figures 4a and 4b one
can see that when the number of trees is closer or less than 40 (number of available
processors) the gains are more prominently, this is expected as there is a limited
number of trees that can be trained at once.
6.3 ARF compared to other ensembles
This section comprises the comparison of ARF against state-of-the-art ensemble
classifiers. First, we report the CPU time and RAM-hours for ensembles with 100
Adaptive Random Forests for Evolving Data Stream Classification 17
(a) AGRg(b) AIRL (c) COVT
(d) GMSC (e) KDD99 (f) RTG
Fig. 3 ARF: accuracy (immediate) x ensemble size (n) x subspace size (m). Marked lines
highlights m=M+ 1
10 20 50 100
CPU Time (s)
Immediate - CPU Time
(a) CPU Time
10 20 50 100
RAM-Hours (GB-Hour)
Immediate - RAM-Hours
(b) RAM-Hours
Fig. 4 ARF[M] and ARF[S] comparison in terms of CPU Time and Memory, for 10, 20, 50
and 100 learners.
base models in Tables 3 and 4. Since ARF[M] distributes the training and drift
detection among several threads it is unsurprisingly the most efficient in terms
of CPU time and memory used. Besides that, we note that ARF[S] outperforms
Leveraging Bagging and is close to OAUE in terms of CPU time, while being very
similar to others in terms of RAM-Hours, yet worse than OAUE, OzaBag and
18 Heitor M. Gomes1et al.
Table 3 CPU Time - Immediate setting (# learners = 100)
Data set ARF[S] ARF[M] OzaBag OAUE OzaBoost OSBoost LevBag
LEDa1251.31 582.31 1388.63 1659.16 1305.46 1778.47 2698.73
LEDg1236.91 679.68 1244.97 1567.29 1154.88 1847.86 2332.57
SEAa1293.37 490.08 466.27 684.69 507.4 493.18 1431.29
SEAg1272.34 461.99 459.96 602.56 491.54 454.83 1379.41
AGRa1864.13 818.63 710.86 854 828.83 661.61 3981.14
AGRg2002.59 821.73 804.08 903.02 801.96 690.55 3225.2
RTG 5910.1 475.57 571.51 701.78 889.57 636.74 2865.09
RBFm1713.33 1133.28 1438.18 1876.86 1335.2 1822.05 3440.54
RBFf1711.51 908.26 1389.02 1815.99 1273.64 1998.32 3517.02
HYPER 1736.24 837.89 976.29 1050.43 922.73 927.8 3708.08
Synthetic Avg 1999.18 720.94 944.98 1171.58 951.12 1131.14 2857.91
Synthetic Avg Rank 51.8 2.8 5 3 3.5 6.9
AIRL 2745.49 361.31 544.75 896.69 626.04 448.45 4925.71
ELEC 73.37 31.28 30.01 24.69 34.69 28.01 104.97
COVT 1230.93 686.08 1160.96 1603.41 1148.46 1359.04 2906.84
GMSC 189.55 149.96 100.04 152.69 83.3 76.26 306.23
KDD99 4322.82 2109.04 2945.59 4910.54 3462.14 7553.36 4795.75
Real Avg 1712.43 667.53 956.27 1517.6 1070.93 1893.02 2607.9
Real Avg Rank 5.2 2.2 2.8 4.6 3.2 3.4 6.6
Overall Avg 1903.6 703.14 948.74 1286.92 991.06 1385.1 2774.57
Avg Rank 5.07 1.93 2.8 4.87 3.07 3.47 6.8
Table 4 RAM-Hours - Immediate setting (# learners = 100)
Data set ARF[S] ARF[M] OzaBag OAUE OzaBoost OSBoost LevBag
LEDa0.054 0.023 0.297 0.151 0.279 0.162 0.056
LEDg0.054 0.038 0.264 0.109 0.244 0.163 0.053
SEAa0.219 0.083 0.046 0.022 0.062 0.015 0.607
SEAg0.229 0.083 0.045 0.03 0.06 0.014 0.341
AGRa0.855 0.098 0.361 0.13 0.329 0.114 0.174
AGRg0.856 0.851 0.425 0.096 0.332 0.123 0.486
RTG 1.121 0.09 0.17 0.18 0.317 0.065 0.15
RBFm0.038 0.025 0.236 0.036 0.177 0.144 0.764
RBFf0.01 0.006 0.106 0.008 0.1 0.131 0.085
HYPER 0.173 0.084 0.413 0.035 0.361 0.116 1.075
Synthetic Avg 0.361 0.138 0.236 0.08 0.226 0.105 0.379
Synthetic Avg Rank 4.8 2.5 5.2 2.6 4.9 3.1 4.9
AIRL 0.422 0.056 0.023 0.337 0.196 0.216 1.425
ELEC 0.001 0.001 0.001 00.001 0 0.003
COVT 0.002 0.002 0.516 0.089 0.557 0.178 0.19
GMSC 0.02 0.016 0.004 0.005 0.004 0.001 0.067
KDD99 0.013 0.007 0.499 0.039 1.335 0.992 0.253
Real Avg 0.092 0.016 0.209 0.094 0.419 0.278 0.388
Real Avg Rank 4.4 2.6 3.4 3.2 5 3.4 6
Overall Avg 0.271 0.098 0.227 0.084 0.29 0.162 0.382
Avg Rank 4.86 2.64 4.43 2.71 4.86 3.07 5.43
The next step in our comparison of ARF to other ensembles is the evaluation of
its overall classification performance according to Accuracy, Kappa M and Kappa
Temporal. We group experiments per evaluation metric and setting used (delayed
or immediate) in Tables 5, 6, 7, 8, 9, and 10. The variations in the rankings
from delayed to immediate suggest that ARF is more suitable to the immediate
setting. In Table 5 we highlight ARF performance in RBFmand RBFfdata sets,
both containing incremental drifts. As previously mentioned in Section 6.1 ARF
cannot obtain good results in RTG while using only m=M+ 1 features, this
is emphasized when comparing ARF against other ensembles as ARF consistently
obtains the worst results in RTG. ARF performs well on SEAaand SEAg, however
Adaptive Random Forests for Evolving Data Stream Classification 19
these results are not related to the random selection of features as SEA generator
has only 3 features and each tree ends up using 3 features per split9.
Analysing the results from Kappa Temporal in Table 7 we observed that none
of the classifiers were able to surpass the baseline (NoChange classifier [43]) on the
COVT data set. This characteristic is accentuated on the experiments using the
delayed setting displayed in Table 10, where algorithms also failed to overcome
the baseline on the ELEC data set as well. Probably, using a temporally aug-
mented wrapper, as suggested in [43], would aid this problem for the immediate
setting, although it is unclear if it would solve the problem on the delayed setting.
Through analysis of Kappa M on Tables 6 and 9 we observed that differences in
accuracy that appeared to be not very large are actually highlighted in Kappa M,
for example, ARF and OzaBoost in data set AIRL achieved 66.26% and 60.83%
accuracy, respectively, in terms of Kappa M ARF achieves 24.24% while OzaBoost
only 12.05%.
We report only statistical tests based on the average rank of accuracy, since
ranks did not change among Accuracy, Kappa M and Kappa Temporal. Concretely,
we used the results from Tables 5 and 8. The Friedman test indicates that there
were significant differences in the immediate and delayed setting. We proceeded
with the Nemenyi post-hoc test to identify these differences, which results are
plotted in Figure 5.
The statistical tests for the immediate setting indicates that there are no sig-
nificant differences between ARF, LevBag, OSBoost and OUAE. While differences
in the delayed setting are less prominent, including OzaBag to the aforementioned
classifiers. This suggests that sometimes the active drift detection techniques are
less impactating in the delayed setting as ARF and LevBag have their overall
performance degraded when training is delayed. This is especially true for incre-
mental drifts, as drift signals (and warning signals in ARF) are delayed and action
is taken to accommodate a potentially already outdated concept. This is observ-
able by analysing the accuracy drop from the immediate to delayed setting for
ARF and LevBag in RBFf(Tables 5 and 8).
There is not much variation with respect to ranking changes while compar-
ing the synthetic data sets results between the immediate and delayed settings.
The only change is that OzaBag swaps rankings with LevBag in RBFf, which
effectively boosts the overall OzaBag ranking. In the real data sets the variations
are more prominent, such that ARF surpasses OzaBoost in the ELEC data set
for the delayed setting, however ARF loses 1 average rank from the immediate
to the delayed setting in the real data sets. Finally, OzaBag, OAUE and OS-
Boost improved their overall average rankings from the immediate results to the
delayed results, while ARF, OzaBoost and LevBag, decreased their classification
performances. Surprisingly, GMSC results improved in the delayed setting in com-
parison to those obtained in the immediate setting, this is better observable while
comparing the Kappa M results from Tables 6 and 9 for the GMSC data set.
9After rounding 3 + 1 to the closest integer we obtain 3, such that m=Mfor SEAaand
20 Heitor M. Gomes1et al.
Table 5 Accuracy - Immediate setting (# learners = 100).
Data set ARF OzaBag OAUE OzaBoost OSBoost LevBag
LEDa73.72 69.18 73.35 68.88 72.53 73.92
LEDg72.87 69.17 72.55 69.57 72.47 73.22
SEAa89.66 87.19 88.77 88.21 89.15 88.36
SEAg89.24 87.12 88.26 87.87 88.92 89.08
AGRa89.75 82.83 90.67 88.49 91.02 89.17
AGRg84.54 79.26 85.29 84.39 87.73 83.4
RTG 93.91 97.2 97 95.97 97.25 97.53
RBFm86.02 62.62 83.69 36.23 65.84 84.89
RBFf72.36 38.33 56.19 26.16 42.38 58.28
HYPER 85.16 80.2 87.67 85.93 87.88 87.45
Synthetic Avg 83.72 75.31 82.34 73.17 79.52 82.53
Synthetic Avg Rank 2.5 5.4 2.9 5.1 2.6 2.5
AIRL 66.26 64.96 65.35 60.83 65.62 63.38
ELEC 88.54 82.51 86.37 90.17 87.05 88.53
COVT 92.32 84.05 92.26 93.83 86.34 93.08
GMSC 93.55 93.52 93.55 92.64 92.95 93.54
KDD99 99.97 99.93 2.61 99.01 99.93 99.96
Real Avg 88.13 84.99 68.03 87.29 86.38 87.7
Real Avg Rank 1.6 4.8 4 3.8 3.8 3
Overall Avg 85.19 78.54 77.57 77.88 81.8 84.25
Overall Avg Rank 2.2 5.2 3.27 4.67 3 2.67
Table 6 Kappa M - Immediate setting (# learners = 100).
Data set ARF OzaBag OAUE OzaBoost OSBoost LevBag
LEDa70.75 65.7 70.35 65.36 69.43 70.98
LEDg69.8 65.68 69.45 66.13 69.35 70.2
SEAa74.21 68.05 71.98 70.59 72.93 70.96
SEAg73.16 67.86 70.71 69.75 72.37 72.77
AGRa78.26 63.61 80.22 75.6 80.96 77.04
AGRg67.22 56.04 68.82 66.9 73.98 64.8
RTG 85.56 93.36 92.89 90.45 93.48 94.14
RBFm80.03 46.59 76.69 8.89 51.19 78.42
RBFf60.51 11.88 37.41 -5.5 17.67 40.39
HYPER 70.27 60.33 75.29 71.81 75.72 74.85
Synthetic Avg 72.98 59.91 71.38 58 67.71 71.45
Synthetic Avg Rank 2.5 5.4 2.9 5.1 2.6 2.5
AIRL 24.24 21.34 22.21 12.05 22.82 17.8
ELEC 73 58.79 67.89 76.84 69.49 72.97
COVT 85 68.87 84.89 87.96 73.35 86.5
GMSC 3.51 3 3.46 -10.17 -5.54 3.4
KDD99 99.93 99.83 -128.21 97.68 99.85 99.92
Real Avg 57.14 50.37 10.05 52.87 51.99 56.12
Real Avg Rank 1.6 4.8 4 3.8 3.8 3
Overall Avg 67.7 56.73 50.94 56.29 62.47 66.34
Overall Avg Rank 2.2 5.2 3.27 4.67 3 2.67
Table 7 Kappa Temporal - Immediate setting (# learners = 100).
Data set ARF OzaBag OAUE OzaBoost OSBoost LevBag
LEDa70.79 65.74 70.38 65.41 69.46 71.01
LEDg69.83 65.72 69.48 66.17 69.39 70.23
SEAa78.31 73.13 76.43 75.27 77.23 75.58
SEAg77.44 72.98 75.37 74.57 76.77 77.11
AGRa77.57 62.45 79.59 74.82 80.35 76.31
AGRg66.61 55.22 68.24 66.29 73.5 64.15
RTG 87.51 94.26 93.86 91.74 94.37 94.93
RBFm81.92 51.63 78.89 17.49 55.8 80.45
RBFf64.24 20.2 43.32 4.46 25.44 46.02
HYPER 70.32 60.4 75.33 71.87 75.77 74.9
Synthetic Avg 74.45 62.17 73.09 60.81 69.81 73.07
Synthetic Avg Rank 2.5 5.4 2.9 5.1 2.6 2.5
AIRL 19.56 16.48 17.39 6.61 18.05 12.71
ELEC 21.86 -19.24 7.08 32.99 11.73 21.78
COVT -55.59 -222.99 -56.81 -24.91 -176.49 -40.07
GMSC 48.29 48.01 48.26 40.95 43.43 48.22
KDD99 -140.48 -471.44 -769385.24 -7717.89 -416.24 -177.84
Real Avg -21.27 -129.84 -153873.86 -1532.45 -103.9 -27.04
Real Avg Rank 1.6 4.8 4 3.8 3.8 3
Overall Avg 42.54 -1.83 -51242.56 -470.28 11.9 39.7
Overall Avg Rank 2.2 5.2 3.27 4.67 3 2.67
Adaptive Random Forests for Evolving Data Stream Classification 21
Table 8 Accuracy - Delayed setting (# learners = 100).
Data set ARF OzaBag OAUE OzaBoost OSBoost LevBag
LEDa73.57 69.01 73.19 68.72 72.37 73.76
LEDg72.76 69.03 72.44 69.44 72.36 73.16
SEAa89.57 87.11 88.68 88.14 89.06 88.27
SEAg89.17 87.03 88.17 87.81 88.84 89
AGRa89.58 82.67 90.48 88.31 90.84 88.98
AGRg84.48 79.11 85.14 84.28 87.59 83.3
RTG 93.85 97.13 96.94 95.91 97.19 97.46
RBFm83.45 56.69 80.42 34.7 59.72 81.81
RBFf29.12 28.76 28.69 26.12 28.43 27.82
HYPER 84.85 79.97 87.27 85.57 87.39 87.06
Synthetic Avg 79.04 73.65 79.14 72.9 77.38 79.06
Synthetic Avg Rank 2.5 5.1 2.9 5.1 2.6 2.8
AIRL 64.93 64.82 65.13 60.63 65.32 62.74
ELEC 75.36 74.27 74.63 71.07 72.91 74.61
COVT 83.79 78.34 84.81 84.48 80.11 85.09
GMSC 93.55 93.52 93.55 92.67 92.96 93.55
KDD99 98.72 99.53 2.4 98.62 99.59 99.38
Real Avg 83.27 82.1 64.1 81.49 82.18 83.07
Real Avg Rank 2.6 4 2.9 5.2 3.4 2.9
Overall Avg 80.45 76.47 74.13 75.76 78.98 80.4
Overall Avg Rank 2.53 4.73 2.9 5.13 2.87 2.83
Table 9 Kappa M - Delayed setting (# learners = 100).
Data set ARF OzaBag OAUE OzaBoost OSBoost LevBag
LEDa70.58 65.51 70.16 65.18 69.25 70.79
LEDg69.68 65.53 69.32 65.99 69.24 70.13
SEAa73.97 67.84 71.77 70.41 72.7 70.74
SEAg72.99 67.65 70.5 69.59 72.16 72.57
AGRa77.92 63.25 79.82 75.21 80.58 76.65
AGRg67.1 55.71 68.5 66.68 73.7 64.6
RTG 85.42 93.2 92.74 90.3 93.34 93.98
RBFm76.35 38.11 72.03 6.7 42.45 74.02
RBFf-1.27 -1.78 -1.88 -5.56 -2.26 -3.13
HYPER 69.64 59.88 74.5 71.09 74.74 74.07
Synthetic Avg 66.24 57.49 66.74 57.56 64.59 66.44
Synthetic Avg Rank 2.5 5.1 2.9 5.1 2.6 2.8
AIRL 21.26 21.03 21.72 11.61 22.14 16.35
ELEC 41.96 39.38 40.23 31.85 36.19 40.17
COVT 68.36 57.72 70.35 69.71 61.19 70.9
GMSC 3.53 3.08 3.51 -9.69 -5.41 3.51
KDD99 97 98.89 -128.69 96.76 99.05 98.55
Real Avg 46.42 44.02 1.42 40.05 42.63 45.9
Real Avg Rank 2.6 4 3 5.2 3.4 2.8
Overall Avg 59.63 53 44.97 51.72 57.27 59.59
Overall Avg Rank 2.53 4.73 2.93 5.13 2.87 2.8
Table 10 Kappa Temporal - Delayed setting (# learners = 100).
Data set ARF OzaBag OAUE OzaBoost OSBoost LevBag
LEDa70.62 65.55 70.2 65.23 69.28 70.83
LEDg69.72 65.57 69.35 66.02 69.27 70.16
SEAa78.11 72.95 76.25 75.11 77.04 75.39
SEAg77.29 72.8 75.19 74.43 76.59 76.94
AGRa77.21 62.08 79.17 74.42 79.96 75.9
AGRg66.49 54.89 67.91 66.06 73.21 63.94
RTG 87.39 94.12 93.72 91.61 94.24 94.79
RBFm78.59 43.96 74.67 15.52 47.89 76.47
RBFf8.3 7.83 7.74 4.41 7.4 6.61
HYPER 69.7 59.95 74.54 71.15 74.79 74.12
Synthetic Avg 68.34 59.97 68.88 60.39 66.97 68.52
Synthetic Avg Rank 2.5 5.1 2.9 5.1 2.6 2.8
AIRL 16.39 16.14 16.87 6.14 17.33 11.17
ELEC -67.95 -75.4 -72.95 -97.18 -84.65 -73.1
COVT -228.29 -338.67 -207.64 -214.26 -302.69 -201.9
GMSC 48.29 48.05 48.28 41.21 43.5 48.28
KDD99 -10010.79 -3644.79 -771004.82 -10839.44 -3103.47 -4775.42
Real Avg -2048.47 -798.93 -154244.05 -2220.7 -685.99 -998.19
Real Avg Rank 2.6 4 3 5.2 3.4 2.8
Overall Avg -637.26 -226.33 -51368.77 -699.97 -184.02 -287.05
Overall Avg Rank 2.53 4.73 2.93 5.13 2.87 2.8
22 Heitor M. Gomes1et al.
CD = 1.947
(a) Immediate setting with 100 learners.
CD = 1.947
(b) Delayed setting with 100 learners.
Fig. 5 Nemenyi test with 95% confidence level.
Focusing on the real world data sets it is clear that ARF consistently obtains
the best results or at least results that could be considered reasonable in contrast
with other algorithms that even though achieve very good results, sometimes fail
to obtain a reasonable model (e.g. OAUE and OzaBoost on KDD99).
In Figure 6 some of the experiments from the immediate setting (see Tables 5,
6 and 7). ARF is able to consistently achieve superior accuracy on RBFm(Figure
6c), which exhibits a moderate incremental drift. In LEDg(Figure 6a) and AGRa
(Figure 6b), ARF obtain a reasonable performance, even though it was not the
method with highest accuracy it was able to adapt to the abrupt and gradual
drifts. Figures 6d and 6e are interesting as the analysis solely focused on accuracy
would indicate that classifiers stabilize after 200 thousand instances, however by
observing the Kappa M plot it is visible that classifiers are actually improving
relatively to the majority class classifier. Similarly GMSC and KDD99 plots in
Figures 6f, 6g, 6h and 6i shows that by using Kappa M on an imbalanced data
set the differences between methods are intensified. Finally, on SPAM only ARF,
OAUE and OSBoost could finish executing, the results in Figures 6j, 6k and 6l
shows that Kappa M for OSBoost is below -100 (not showing in the plot) indicating
that it is not a reasonable choice for this data set. Also, in every plot from SPAM
it is observable that OAUE and OSBoost are degrading over time while ARF
maintains its performance stable.
7 Conclusion
In this work we have presented the Adaptive Random Forests (ARF) algorithm,
which enables the Random Forests algorithm for evolving data stream learning.
We provide a serial and a parallel implementation of our algorithm, ARF[S] and
ARF[M], respectively, and show that the parallel version can process the same
amount of instances in reasonable time without any decrease in the classification
performance. As a byproduct and additional contribution of this work we discuss
stream learning according to when labels are available (immediate and delayed
settings). We also remark that several of the techniques that were implemented on
ARF can be used in other ensembles, such as warning detection and background
We use a diverse set of data sets to show empirical evidence that ARF ob-
tains good results in terms of classification performance (Accuracy, Kappa M and
Kappa Temporal) and reasonable performance resources usage, even for the se-
quential version ARF[S], when compared to other state-of-the-art ensembles. The
classification performance experiments are further divided into the usual imme-
Adaptive Random Forests for Evolving Data Stream Classification 23
(a) Accuracy LEDg(b) Accuracy AGRa(c) Accuracy RBFm
(d) Accuracy AIRL (e) Kappa M AIRL (f) Accuracy GMSC
(g) Kappa M GMSC (h) Accuracy KDD99 (i) Kappa M KDD99
(j) Accuracy SPAM (k) Kappa M SPAM (l) Kappa T SPAM
Fig. 6 Sorted plots of Accuracy, Kappa M and Kappa T over time (100 classifiers per ensem-
ble). Solid and dashed vertical lines indicates drifts and drift window start/end, respectively.
diate setting and the delayed setting. From these experiments we highlight the
following characteristics of ARF:
ARF obtains good classification performance on both delayed and immediate
settings, especially on real world data sets;
ARF can be used to process data streams with a large number of features, such
as SPAM data set with almost fourty thousand features, using a relatively small
number of trees (in our experiments 100);
24 Heitor M. Gomes1et al.
ARF can train its base trees in parallel without affecting its classification per-
formance. This is an implementation concern, but it is useful to investigate and
make it available along with the algorithm as scalability is often a concern;
ARF might not be able to improve on data sets where all features are necessary
to build a reasonable model (such as RTG).
In future work we will investigate how to optimize the run-time performance
of ARF by limiting the number of detectors, as it is wasteful to maintain several
detectors that often trigger at the same time. Another possibility is to implement
a big data stream version of ARF, as we show in this work that each tree can
be trained independently (the most time consuming task) without affecting the
classification performance. Besides enhancing execution performance we are also
interested in investigating the development of a semi-supervised strategy to deal
with different real world scenarios, which might also lead to better performance
on the delayed setting.
This project was partially financially supported by the Coordena¸ao de Aper-
fei¸coamento de Pessoal de N´ıvel Superior (CAPES) through the Programa de Su-
porte `a P´os-Gradua¸ao de Institui¸oes de Ensino Particulares (PROSUP) program
for Doctorate students.
1. Hanady Abdulsalam, David B Skillicorn, and Patrick Martin. Streaming random forests.
In Database Engineering and Applications Symposium, 2007. IDEAS 2007. 11th Interna-
tional, pages 225–232. IEEE, 2007.
2. Hanady Abdulsalam, David B Skillicorn, and Patrick Martin. Classifying evolving data
streams using dynamic streaming random forests. In Database and Expert Systems Ap-
plications, pages 643–651. Springer, 2008.
3. Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. A framework for
clustering evolving data streams. In Proceedings of the 29th International Conference on
Very Large Data Bases - Volume 29, VLDB ’03, pages 81–92. VLDB Endowment, 2003.
4. Rakesh Agrawal, Tomasz Imilielinski, and Arun Swani. Database mining: A performance
perspective. IEEE Trans. on Knowledge and Data Engineering, 5(6):914–925, Dec. 1993.
5. Amineh Amini and Teh Ying Wah. On density-based data streams clustering algorithms:
A survey. Journal of Computer Science and Technology, 29(1):116–141, 2014.
6. Manuel Baena-Garc´ıa, Jos´e del Campo-´
Avila, Ra´ul Fidalgo, Albert Bifet, Ricard Gavald`a,
and Rafael Morales-Bueno. Early drift detection method. 2006.
7. Jean Paul Barddal, Heitor Murilo Gomes, and Fabr´ıcio Enembreck. Sncstream: A social
network-based data stream clustering algorithm. In Proceedings of the 30th Annual ACM
Symposium on Applied Computing, SAC ’15, pages 935–940, New York, NY, USA, 2015.
8. Alina Beygelzimer, Satyen Kale, and Haipeng Luo. Optimal and adaptive algorithms for
online boosting. pages 2323–2331, 2015.
9. Albert Bifet, Gianmarco de Francisci Morales, Jesse Read, Geoff Holmes, and Bernhard
Pfahringer. Efficient online evaluation of big data stream classifiers. In Proceedings of the
21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
pages 59–68. ACM, 2015.
10. Albert Bifet and Ricard Gavald`a. Learning from time-changing data with adaptive win-
dowing. In SIAM, 2007.
Adaptive Random Forests for Evolving Data Stream Classification 25
11. Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernhard Pfahringer. Moa: Massive
online analysis. The Journal of Machine Learning Research, 11:1601–1604, 2010.
12. Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernhard Pfahringer. MOA Data
Stream Mining - A Practical Approach. Centre for Open Software Innovation,
13. Albert Bifet, Geoff Holmes, and Bernhard Pfahringer. Leveraging bagging for evolving
data streams. In PKDD, pages 135–150, 2010.
14. Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Richard Kirkby, and Ricard Gavald`a.
New ensemble methods for evolving data streams. In Proceedings of the 15th ACM
SIGKDD international conference on Knowledge discovery and data mining, pages 139–
148. ACM SIGKDD, Jun. 2009.
15. Albert Bifet, Geoffrey Holmes, Bernhard Pfahringer, and Eibe Frank. Fast perceptron
decision tree learning from evolving data streams. In PAKDD, Lecture Notes in Computer
Science, pages 299–310. Springer, 2010.
16. Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.
17. Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
18. Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. Classification
and regression trees. CRC press, 1984.
19. Dariusz Brzezi´nski and Jerzy Stefanowski. Accuracy updated ensemble for data streams
with concept drift. In Hybrid Artificial Intelligent Systems, pages 155–163. Springer, 2011.
20. Dariusz Brzezinski and Jerzy Stefanowski. Combining block-based and online methods in
learning ensembles from concept drifting data streams. Information Sciences, 265:50–67,
21. Shang-Tse Chen, Hsuan-Tien Lin, and Chi-Jen Lu. An online boosting algorithm with
theoretical justifications. In Proceedings of the International Conference on Machine
Learning (ICML), June 2012.
22. Janez Demˇsar. Statistical comparisons of classifiers over multiple data sets. Journal of
Machine Learning Research, 7:1–30, December 2006.
23. Pedro Domingos and Geoff Hulten. Mining high-speed data streams. In Proceedings of the
sixth ACM SIGKDD international conference on Knowledge discovery and data mining,
pages 71–80. ACM SIGKDD, Sep. 2000.
24. Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning
and an application to boosting. Journal of computer and system sciences, 55(1):119–139,
25. Yoav Freund, Robert E Schapire, et al. Experiments with a new boosting algorithm. In
ICML, volume 96, pages 148–156, 1996.
26. Jo˜ao Gama, Indre Zliobaite, Albert Bifet, Mykole Pechenizkiy, and Abderlhamid
Bouchachia. A survey on concept drift adaptation. ACM Computing Surveys, 46(4):44:1–
44:37, March 2014.
27. Heitor Murilo Gomes and Fabr´ıcio Enembreck. Sae2: Advances on the social adaptive
ensemble classifier for data streams. In Proceedings of the 29th Annual ACM Symposium
on Applied Computing (SAC), SAC 2014, pages 199–206. ACM, March 2014.
28. Sudipto Guha, Nina Mishra, Rajeev Motwani, and Liadan O’Callaghan. Clustering data
streams. In Foundations of computer science, 2000. proceedings. 41st annual symposium
on, pages 359–366. IEEE, 2000.
29. Geoffrey Holmes, Richard Kirkby, and Bernhard Pfahringer. Stress-testing hoeffding trees.
In PKDD, pages 495–502, 2005.
30. Geoff Hulten, Laurie Spencer, and Pedro Domingos. Mining time-changing data streams.
In Proceedings of the seventh ACM SIGKDD international conference on Knowledge dis-
covery and data mining, pages 97–106. ACM, 2001.
31. Ioannis Katakis, Grigorios Tsoumakas, Evangelos Banos, Nick Bassiliades, and Ioannis
Vlahavas. An adaptive personalized news dissemination system. Journal of Intelligent
Information Systems, 32(2):191–212, 2009.
32. Jeremy Z Kolter, Marcus Maloof, et al. Dynamic weighted majority: A new ensemble
method for tracking concept drift. In Data Mining, 2003. ICDM 2003. Third IEEE
International Conference on, pages 123–130. IEEE, 2003.
33. Chee Peng Lim and Robert F Harrison. Online pattern classification with multiple neu-
ral network systems: an experimental study. IEEE Transactions on Systems, Man, and
Cybernetics, Part C: Applications and Reviews, 33(2):235–247, 2003.
26 Heitor M. Gomes1et al.
34. Leandro L. Minku and Xin Yao. Ddd: A new ensemble approach for dealing with concept
drift. IEEE Transactions on Knowledge and Data Engineering, 24(4):619–633, 2012.
35. N.C. Oza. Online bagging and boosting. In Systems, Man and Cybernetics, 2005 IEEE
International Conference on, volume 3, pages 2340–2345 Vol. 3, Oct 2005.
36. ES Page. Continuous inspection schemes. Biometrika, 41(1/2):100–115, 1954.
37. Brandon Shane Parker and Latifur Khan. Detecting and tracking concept class drift and
emergence in non-stationary fast data streams. In Twenty-Ninth AAAI Conference on
Artificial Intelligence, 2015.
38. Raphael Pelossof, Michael Jones, Ilia Vovsha, and Cynthia Rudin. Online coordinate boost-
ing. In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International
Conference on, pages 1354–1361. IEEE, 2009.
39. Xiangju Qin, Yang Zhang, Chen Li, and Xue Li. Learning from data streams with only
positive and unlabeled data. Journal of Intel ligent Information Systems, 40(3):405–430,
40. Carlos Ruiz, Ernestina Menasalvas, and Myra Spiliopoulou. Discovery Science: 12th Inter-
national Conference, DS 2009, Porto, Portugal, October 3-5, 2009, chapter C-DenStream:
Using Domain Knowledge on a Data Stream, pages 287–301. Springer Berlin Heidelberg,
Berlin, Heidelberg, 2009.
41. Tegjyot Singh Sethi, Mehmed Kantardzic, Elaheh Arabmakki, and Hanquing Hu. An
ensemble classification approach for handling spatio-temporal drifts in partially labeled
data streams. In Information Reuse and Integration (IRI), 2014 IEEE 15th International
Conference on, pages 725–732. IEEE, 2014.
42. W Nick Street and YongSeog Kim. A streaming ensemble algorithm (sea) for large-scale
classification. In Proceedings of the seventh ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 377–382. ACM, 2001.
43. Indr˙e ˇ
Zliobait˙e, Albert Bifet, Jesse Read, Bernhard Pfahringer, and Geoff Holmes. Eval-
uation methods and decision theory for classification of streaming data with temporal
dependence. Machine Learning, 98(3):455–482, 2015.
... Each decision tree is randomly constructed from a subsample of the training data, and final decisions are made by a majority vote or an average of the predictions from each tree. One of the interesting features of Random Forest is its ability to adapt to changes in the data, which is known as Adaptive Random Forest (ARF) introduced in [16]. In fact, a common approach to handle evolving data streams is to use a drift detection algorithm in conjunction with the algorithm. ...
... However, this can lead to a decrease in the model performance, as these learners have not been trained on new instances, which limits their impact on the overall predictions. Instead of resetting the trees once a drift is detected, the ARF uses a more permissive threshold to detect warnings and creates 'background' trees that are trained in parallel with the ARF without influencing its predictions [16]. If a drift is detected in the tree at the origin of the warning, its corresponding background tree then replaces this tree. ...
... In this study, the adaptive window (ADWIN) [17] algorithm was used as an external drift detection method for the ARF algorithm. This was based on the promising results obtained by the original ARF when this drift detection and warning method was used [10], [16]. ADWIN is an adaptive algorithm that dynamically adjusts the size of the observation window according to changes in the data [17]. ...
... It is not a requirement, but it is important that the classifier should be able to learn in a purely online environment in order to achieve the best possible performance, rather than learning in batches. Some popular classifiers that can be used in our framework are Naive Bayes [39], Hoeffing Tree (HT) [40], Adaptive Random Forest (ARF) [41], and Leveraging Bagging (LB) [42]. ...
Learning from non-stationary data streams is inherently challenging due to their evolving nature and concept drift. Furthermore, the assumption that all instances come labeled is often impractical in real-world applications. Many strategies have been proposed to tackle learning from sparsely labeled data streams. However, they typically rely on fixed labeling budgets, which can be a limitation in the context of drifting data streams. In this study, we introduce a novel active learning strategy that dynamically manages the labeling budget to optimize its utilization and adapt promptly to concept drift. Our approach continuously monitors the data stream for concept drift, and upon detecting such drift, it dynamically increases the maximum labeling budget for a predefined time window. This adjustment provides the classifier with more flexibility to adapt to the new concept. We conducted experiments using 7 synthetic data generators encompassing various drifting scenarios and 7 real-world data streams with different labeling budgets. Our results demonstrate that offering a flexible budget to the classifier can significantly enhance performance compared to merely increasing a fixed budget. Notably, our strategy outperformed state-of-the-art active learning strategies, all while maintaining a comparable or lower number of labeled instances. Experiments are available at
... Based on an improved weighted random forest algorithm [41][42][43][44][45], the prediction problem of emotional contagion in the SRtP process of cruise ships could be constructed as a classification problem. Figure 11 is the algorithm flowchart and the detailed process is as follows: ...
Full-text available
The Safe Return to Port issue regarding cruise ships has been extensively researched, covering aspects such as performance, operations, and electrical systems. However, an often overlooked aspect is the potential eruption of negative emotions among passengers during SRtP. This study aims to investigate the prediction of collective emotions to facilitate timely safety planning and enhance the safety of the Safe Return to Port process. To achieve this objective, an improved susceptible-infectious-recovered model with bidirectional infection is proposed to describe the emotional contagion process during the Safe Return to Port process. This model classifies the population into five emotional (extremely anxious–anxious–normal–calm–very calm) states and introduces two sources of infection. Moreover, it allows for emotions to transition both positively and negatively, making it a more realistic representation of scenarios resembling long-term refuge scenarios. In this study, questionnaire data, collected and statistically analyzed, serve as the primary dataset. A machine learning technique (the weighted random forest algorithm) is integrated with the model to make predictions. The accuracy, precision, recall, and the F-measure of prediction results demonstrate good performance. Additionally, through simulation, this study illustrates the fluctuating nature of emotional changes during the Safe Return to Port process of the cruise ship and analyzes the effects of varying parameters. The findings suggest that the improved susceptible-infectious-recovered model proposed in this paper can provide valuable insights for cruise ship emergency planning and positively contribute to maintaining passenger emotional stability during the Safe Return to Port process.
... We chose a EFDT with the following hyperparameters: grace period of 200 and split confidence of 1e −7 . 2) Adaptive Random Forest (ARF) [4]: ARF algorithm for the classification of evolving data streams enables the Random Forests algorithm for evolving data stream learning. There are effective resampling methods and flexible operators in ARF that can cope with different types of concept drifts without requiring complex optimizations for different data sets. ...
Conference Paper
Post-hoc explanation techniques such as the well-established partial dependence plot (PDP), which investigates feature dependencies, are used in explainable artificial intelligence (XAI) to understand black-box machine learning models. While many real-world applications require dynamic models that constantly adapt over time and react to changes in the underlying distribution, XAI, so far, has primarily considered static learning environments, where models are trained in a batch mode and remain unchanged. We thus propose a novel model-agnostic XAI framework called incremental PDP (iPDP) that extends on the PDP to extract time-dependent feature effects in non-stationary learning environments. We formally analyze iPDP and show that it approximates a time-dependent variant of the PDP that properly reacts to real and virtual concept drift. The time-sensitivity of iPDP is controlled by a single smoothing parameter, which directly corresponds to the variance and the approximation error of iPDP in a static learning environment. We illustrate the efficacy of iPDP by showcasing an example application for drift detection and conducting multiple experiments on real-world and synthetic data sets and streams.
Conference Paper
Full-text available
The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.
Conference Paper
Full-text available
The classification of streaming data requires learning in an environment where the distribution of the incoming data might change continuously. Stream classification methodologies need to adapt to these changes under limitations of time and memory resources. As such, it is not possible to expect all the samples in the stream to be labeled, as labeling is often time consuming and expensive. In this paper a new ensemble classification approach is proposed, which can handle Spatio-Temporal drifts in streams even when the labeling is limited. The proposed methodology uses a grid density clustering approach to track drifts in the spatial configuration of the data, and maintains a set of classifier models local to each cluster, to track its evolution over time. Structured weighted aggregation of the models across all clusters is performed to produce an overall effective prediction on a new sample. Additionally, a uniform sampling approach amenable to the grid representation of the clusters is proposed, which selects samples to be labeled while preserving the grid density information of the stream. This provides for better selection of representative samples to be labeled, for improved drift detection and handling, while maintaining the labeling budget. Experimental comparison with state of the art drift handling systems shows that the proposed methodology is able to give a high classification performance, with a manageable ensemble size and with only 10% of the samples labeled.
As the proliferation of constant data feeds increases from social media, embedded sensors, and other sources, the capability to provide predictive concept labels to these data streams will become ever more important and lucrative. However, the dynamic, non-stationary nature, and effectively infinite length of data streams pose additional challenges for stream data mining algorithms. The sparse quantity of training data also limits the use of algorithms that are heavily dependent on supervised training. To address all these issues, we propose an incremental semi-supervised method that provides general concept class label predictions, but it also tracks concept clusters within the feature space using an innovative new online clustering algorithm. Each concept cluster contains an embedded stream classifier, creating a diverse ensemble for data instance classification within the generative model used for detecting emerging concepts in the stream. Unlike other recent novel class detection methods, our method goes beyond detecting, and continues to differentiate and track the emerging concepts. We show the effectiveness of our method on several synthetic and real world data sets, and we compare the results against other leading baseline methods.
The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.