Conference PaperPDF Available

Streaming Random Patches for Evolving Data Stream Classification

Authors:

Abstract and Figures

Ensemble methods are a popular choice for learning from evolving data streams. This popularity is due to (i) the ability to simulate simple, yet, successful ensemble learning strategies, such as bagging and random forests; (ii) the possibility of incorporating drift detection and recovery in conjunction to the ensemble algorithm; (iii) the availability of efficient incremental base learners, such as Hoeffding Trees. In this work, we introduce the Streaming Random Patches (SRP) algorithm, an ensemble method specially adapted to stream classification which combines random subspaces and online bagging. We provide theoretical insights and empirical results illustrating different aspects of SRP. In particular, we explain how the widely adopted incremental Hoeffding trees are not, in fact, unstable learners, unlike their batch counterparts, and how this fact significantly influences ensemble methods design and performance. We compare SRP against state-of-the-art ensemble variants for streaming data in a multitude of datasets. The results show how SRP produce a high predictive performance for both real and synthetic datasets. Besides, we analyze the diversity over time and the average tree depth, which provides insights on the differences between local subspace randomization (as in random forest) and global subspace randomization (as in random subspaces).
Content may be subject to copyright.
Streaming Random Patches for Evolving Data
Stream Classification
Heitor Murilo Gomes∗†, Jesse Read, Albert Bifet∗†
University of Waikato, Hamilton, New Zealand
{heitor.gomes, albert.bifet}@waikato.ac.nz
LTCI, T´
el´
ecom Paris, IP-Paris, Paris, France
LIX, ´
Ecole Polytechnique, Palaiseau, France
jesse.read@polytechnique.edu
Abstract—Ensemble methods are a popular choice for learning
from evolving data streams. This popularity is due to (i) the
ability to simulate simple, yet, successful ensemble learning
strategies, such as bagging and random forests; (ii) the possibility
of incorporating drift detection and recovery in conjunction to the
ensemble algorithm; (iii) the availability of efficient incremental
base learners, such as Hoeffding Trees. In this work, we introduce
the Streaming Random Patches (SRP) algorithm, an ensemble
method specially adapted to stream classification which combines
random subspaces and online bagging. We provide theoretical
insights and empirical results illustrating different aspects of SRP.
In particular, we explain how the widely adopted incremental
Hoeffding trees are not, in fact, unstable learners, unlike their
batch counterparts, and how this fact significantly influences
ensemble methods design and performance. We compare SRP
against state-of-the-art ensemble variants for streaming data in
a multitude of datasets. The results show how SRP produce a
high predictive performance for both real and synthetic datasets.
Besides, we analyze the diversity over time and the average
tree depth, which provides insights on the differences between
local subspace randomization (as in random forest) and global
subspace randomization (as in random subspaces).
Index Terms—Stream Data Mining, Ensemble Learning, Ran-
dom Subspaces, Random Patches
I. INTRODUCTION
Machine learning applications of data streams have grown
in importance in recent years due to the tremendous amount of
real-time data generated by networks, mobile phones and the
wide variety of sensors currently available. Building predictive
models from data streams are central to many applications [1].
The underlying assumption of data stream learning is that the
algorithms must process large amounts of data in a fast-paced
way. In a supervised learning scenario, such characteristic
brings forward two crucial challenges:
Computational efficiency. The algorithm must use a
limited budget of computational resources to be able to
process examples at least as fast as new examples are
available;
Evolving data. The continuous flow of data might be sub-
ject to changes over time, where the canonical example
is concept drift [2]. Concept drifts can be characterized
as changes in the underlying data distribution that affect
the fitted model, such that to maintain its predictive
performance the model must be updated or even reset.
To tackle evolving data many strategies were proposed with
particular attention to ensemble-based methods. Ensembles are
often used to cope with concept drifts by selectively reset-
ting component learners [3]–[6]. Concerning computational
efficiency, ensembles of learners require more computational
resources than a single learner; however many are very easy
to parallelize [6].
In the traditional batch learning setting, several ensem-
ble methodologies are widely used, such as Random Sub-
spaces [7], Pasting [8], Bagging [9], Random Forest [10],
SubBag [11], and Random Patches [12]. The main differ-
ences among these algorithms remain on how they induce
diversity into the ensemble. Random subspace methods train
each base learner on a separate randomly selected subset of
features. Pasting and Bagging train base learners on samples
of instances draw with and without reposition, respectively,
from the original dataset. Random Forest extends Bagging
and randomly selects subsets of features to be considered
for splits in its base learners (decision trees). SubBag and
Random Patches combines Bagging and Random Subspaces
and Pasting and Random Subspaces, respectively, thus through
very similar means, they train base learners on random subsets
of features and samples. Other ensembles that are popular on
batch learning, such as AdaBoosting [13] are less attractive
for data streams, partially because the original batch learn-
ing implementations introduces dependencies among the base
learners, which are difficult to simulate appropriately in a
streaming setting [14].
In this work, we propose strategies to cope with classifi-
cation problems on evolving data streams using an ensemble
strategy that combines random subspaces and bagging. We
name this ensemble Streaming Random Patches (SRP) as it
is inspired by the Random Subspaces method [7] and Online
Bagging [15], and thus resembles the Random Patches [12]
algorithm. SRP incorporates an active drift detection strategy,
similarly to other ensembles methods, e.g., Leveraging Bag-
ging [5] and Adaptive Random Forest (ARF) [6]. The drift
detection and recovery strategy in SRP follow the approach
used in ARF. ARF consistently overcomes other state-of-the-
art ensembles for evolving data streams, partially due to this
strategy [6]; on top of that, by using the same procedure in
SRP, we can compare it to ARF in terms of the ensemble
strategy without the interference from the approach to cope
with evolving data.
Similar algorithms based on the Random Subspaces
method [7] or combinations of resampling and random sub-
spaces [11], [12] have been previously explored on batch learn-
ing for high-dimensional datasets [16] and also for evolving
data stream classification [5], [6], [17], [18]. Nevertheless,
to the best of our knowledge, none of these previous works
thoroughly investigated the impact of online bagging and
random subspaces, concomitantly, for evolving data streams.
Similarly, previous works have not outlined the similarities
and differences between a global and a local randomization
strategy for the subset of features for streaming data. We
use the same definition of global and local randomization as
in [12], i.e., in the random subspaces method, the subspace of
features is selected globally once for the whole base learner,
while in the random forest algorithm the subspaces are selected
locally for each leaf of the base tree [12]. We discuss the
impact of both strategies in our experiments (Section V) while
comparing ARF and SRP. Panov and Dˇ
zeroski, and Louppe
and Geurts conducted a similar investigation for the batch
setting in [11] and [12], respectively.
Paper contributions and roadmap. Our main contributions
can be summarized as follows:
1) Streaming Random Patches (SRP): We introduce an
ensemble-based method, namely SRP, that achieves high
accuracy by training base models on random subsets of
features and instances1;
2) Theoretical insights: We analyze the SRP algorithm with
particular attention to the questions of the stability and
diversity of Hoeffding trees, and the impact of global
subspace randomization in SRP in opposition to the local
randomization in ARF;
3) Empirical Analysis: We compare SRP against state-
of-the-art ensemble variants for streaming data in a
multitude of datasets. The results show a clear overview
of predictive performance and resources usage. Besides,
we analyze the diversity over time and the average tree
depth, which provides some insights on the differences
between local and global subspace randomization.
The rest of this paper is organized as follows. In Section
II we introduce the problem of learning classification models
from evolving data streams. In Section III, we present the SRP
algorithm and theoretical insights. In Section IV, related works
are discussed and compared to our approach. In Section V, we
present the experiments conducted to analyze SRP in terms of
accuracy, computational resources, diversity and decision trees
depth. Finally, Section VI concludes this work and presents
directions for future works.
II. PRO BL EM S ET TI NG
Let X={x−∞, . . . , x1, x0}be an open-ended sequence
of observations collected over time, containing input examples
1The implementation and instructions are available at: https://tinyurl.com/
yytbom4e
in which xkRnand n1. Similarly, let ybe an open-
ended sequence of corresponding class labels, such that every
example in Xhas a corresponding entry in y. Moreover, yk
has a finite set of possible values, i.e., yk∈ {l1, . . . , lL}for
L2, such that a classification task is defined. Furthermore,
we assume a problem setting where new input examples
xare presented every utime units to the learning model
for prediction, such that xt
krepresents a vector of features
available at time t. The true class label yt+1
k, corresponding
to instance xt
k, is available before the next instance xt+1
appears, and thus, it can be used for training immediately
after it has been used for prediction. We emphasise that this
experimental setting can be naturally extended to the delayed
and weakly-supervised settings considering a non-negligible
time delay between observing xand its class label y, including
an infinite delay (i.e., the label is never observed). However,
the conclusions drawn from experimenting in such settings are
similar to those in the “immediate” setting, as shown in [6].
Therefore, for simplicity, we omit such results in this paper.
An important characteristic of data stream classification is
whether it is a stationary or an evolving data distribution. In
this work, we assume evolving data distributions. Thus we
expect the occurrence of concept drifts2that might influence
decision boundaries. Note that if a concept drift is accurately
detected (without false negatives) and dealt with (by fully or
partially resetting models as appropriate) an iid assumption can
be made (on a per-concept basis), since each concept can be
treated as a separate iid stream, thus a series of iid streams to
be dealt with. Nevertheless, the typical nature of a data-stream
as being fast and dynamic encourages the in-depth study that
we present in this work.
III. STR EA MI NG RANDOM PATCH ES
Streaming Random Patches (SRP) can be viewed as an
adaptation of batch learning ensemble methods that com-
bined random samples of instances and random subspaces
of features [11], [12]. Following the terminology introduced
in [12], in the rest of this work, we refer to random
subsets of both features and instances as random patches.
Fig 1 presents an example of subsampling both instances and
features, simultaneously, from streaming data, where only the
shaded intersections of the matrix belong to the subsample,
i.e., {v1,1, v2,1, v6,1, v1,3, v2,3, v6,3}.
Our motivation for exploiting an ensemble of base models
trained on random patches is based mainly on the high
predictive performance of ensembles for data stream learning
that added randomization to the base models by either training
them on random samples of instances [5], random subsets of
features [17] or both [6]. We investigate whether selecting the
subset of features globally once and before constructing each
base model, overcomes locally selecting subsets of features at
each node while constructing base trees as in Random Forest.
In [12], authors show empirical evidence that Random Patches
combined with tree-based models achieved similar accuracy to
2A formal definition of concept drift can be found in [19]
x1x2x3x4x... xm
v1,1v1,2v1,3v1,4v1,5v1,6
v2,1v2,2v2,3v2,4v2,5v2,6
v3,1v3,2v3,3v3,4v3,5v3,6
v5,1v5,2v5,3v5,4v5,5v5,6
v6,1v6,2v6,3v6,4v6,5v6,6
v...,1v...,2v...,3v...,4v...,5v...,6
Fig. 1: Representation of a data stream as an unbounded table
where the rows are infinite, but the columns are constrained
by minput features.
other randomization strategies, including Random Forest [10],
while using less memory.
The original Random Patches algorithm [12] is defined
in terms of all possible subsets of features and instances,
such that R(ps, pf, D)denotes all random patches of size
psNs×pfNfthat can be drawn from the training set D,
where Nsand Nfrepresent the number of instances and
features, respectively, in D. The hyperparameters ps[0,1]
and pf[0,1] represent, respectively, the number of samples
and features in each patch rR(ps, pf, D). In SRP, the
set of all possible streaming random patches Rs(λ, pf, S)is
infinite in the sample dimension as the input training data is
represented by a data stream S. We control the number of
samples in the streaming patch using the Poisson parameter λ
(Section III-A).
A. Random Subsets of Instances
In the batch setting, Bagging builds Lbase models, training
each model with a bootstrap sample from the original training
dataset of size N. Each bootstrap contains each original train-
ing example Ktimes, where P r(K=k)follows a binomial
distribution which, for large N, tends to a Poisson(λ= 1)
distribution. Using this fact, Oza and Russell [15] proposed
Online Bagging, an online method that, instead of sampling
with replacement, gives each example a weight according to
Poisson(λ= 1).
Leveraging Bagging [5] and Adaptive Random Forest [6]
train their base models according to a Poisson(λ= 6)
distribution, which on average augment the weight of each
training instance and diminish the probability of not using an
instance for training, i.e., the probability of Pr[Poisson(λ=
6)=0]0.25%, while Pr[Poisson(λ= 6)=0]36.8%. Using
Poisson(λ= 6) tends to improve the predictive performance of
the ensemble as the base models are updated more often, but
this benefit comes at the expense of computational resources.
Minku et al. [20] used λas a proxy for diversity, i.e.,
the lower λ, the more diversity would be induced into the
ensemble. As pointed by Stapenhurst [21] for iid data the base
models will eventually converge, even faster if given larger
values of λ. One important question to be addressed then is:
why Poisson(6) works if only a small portion of data is not
presented to each learner? In the long run, the base models
start to converge. This can be visualized in Section V where
diversity is shown overtime for the AGRAWAL generator,
once a concept becomes stable the average Kappa Statistic
starts to increase (i.e., the outputs of the base models start to
converge) if the only means of decorrelating the base models
is resampling with reposition simulated with Poisson(6). This
motivates the addition of other techniques to induce diversity
(Section III-B).
B. Random Subsets of Features
Random Subspaces are susceptible to hyper-parameters m
(size of subspace) and n(number of learners). For a feature
space of Mfeatures, there are 2M1different non-empty
subsets of features. Thus, it is unfeasible to train one learner
for even moderate values of M, especially for streaming data
where processing time and memory are restricted [22]. Ho
noted in [7] that highly accurate ensembles could be obtained
far before all possible combinations of subspaces are explored.
Later, Kuncheva et al. [16] provided a thorough analysis
of the random subspace method for the functional magnetic
resonance imaging (fRMI) data problem, which resulted in
insights for selecting values of mand nthat generated usable
learners, i.e., contains at least one ‘relevant’ feature in its
subset of features.
In our problem setting, one reason to train base models
on random subspaces of features on top of training them on
different subsets of instances is to add even further diversity to
the models. Even if they converge because of iid data (Section
III-A) by training them on separate subspaces of features we
have higher chances of producing models that maintain some
level of diversity.
There is a risk of subspaces including only irrelevant
features. There are two mechanisms that help aid this situation:
(i) resetting subspaces once a model is reset in response to a
concept drift; (ii) assigning weights to the votes of base models
based on their predictive performance, then it is expected
that base models with only irrelevant features produce a poor
predictive performance and other base models dominate their
votes.
C. Drift Detection and Recovery
The ultimate goal of drift detection in our context is to allow
automatic recovery from a state where the model performance
is degrading. To achieve this goal we need an accurate drift
detector and a proper action that will be triggered as a response
to the drift signal. Currently, the most successful supervised
learning methods follow a simple, yet effective, approach:
when a concept drift is detected the underlying model is
reset [5], [6]. If the detection algorithm miss or take too long
to detect a change, then it will let the model degrade. On
the other hand, if it yields too many false positives, it will
continuously trigger model resets and consequently prevent
the algorithm from building an accurate model.
We use the same strategy to detect and recover from concept
drifts as introduced in the Adaptive Random Forest (ARF) [6]
algorithm. In this strategy, the correct/incorrect predictions
of each base model are monitored by a detection algorithm.
When the drift detection algorithm flags a warning a new base
model start training in the ‘background’, where ‘background’
means that it does not influence the ensemble decision with
its predictions. If the warning escalates to a concept drift, then
the background model replaces the associated base model.
The strategy accommodates for different drift detection
algorithms to be used, however, to facilitate discussion we
focus the experiments and analysis using SRP with the ADap-
tive WINdow (ADWIN) algorithm [23]. ADWIN is a change
detector and estimator that solves in a well-specified way the
problem of tracking the average of a stream of bits or real-
valued numbers. ADWIN keeps a variable-length window of
recently seen items, with the property that the window has
the maximal length statistically consistent with the hypothesis
“there has been no change in the average value inside the
window”. More precisely, an older fragment of the window
is dropped if and only if there is enough evidence that its
average value differs from that of the rest of the window.
This has two consequences: one, that change reliably declared
whenever the window shrinks; and two, that at any time the
average over the existing window can be reliably taken as
an estimation of the current average in the stream (barring a
very small or very recent change that is still not statistically
visible). A formal and quantitative statement of these two
points (a theorem) appears in [23]. ADWIN is a parameter-
and assumption-free in the sense that it automatically detects
and adapts to the current rate of change. Its only parameter is
the confidence bound δ, indicating how confident we want to
be in the algorithm’s output, inherent to all algorithms dealing
with random processes.
There are no guarantees that a detection algorithm based
on the correct/incorrect predictions will be accurate, but it
will at least be able to detect changes in the underlying data
that genuinely affected the decision boundary (real drifts),
while neglecting those that did not (virtual drifts) [24]. One
disadvantage of this strategy is that it requires access to
labelled data, which is not an issue given our problem setting
(Section II), but for problems that include verification latency
or weakly-labeled streams, then other drift detection strategies
must be explored [25].
The pseudocode for SRP is depicted in Alg. 1. The training
instances are used to evaluate the classification performance
of each base model, before being used for training, and this
estimation is used as the learner weight during voting (line 9,
Alg. 1). For non-stationary data streams, we should consider
that the relevant features, i.e., those that can effectively be used
to predict the class label, may change over time. Therefore,
when a background learner is created, a new random subspace
is generated for it (line 12, Alg. 1). Background models are
trained during the period between the warning that triggered
their creation and the concept drift signal that causes them
to replace the previous base model, and thus, models to be
added to the ensemble always start with a model that is not
an entirely new base model (line 15, Alg. 1).
Algorithm 1 Streaming Random Patches.
Symbols: m: maximum features per subset; λ: Poisson dis-
tribution parameter; n: total number of models (n=|L|);
δw: warning threshold; δd: drift threshold; S: Data stream;
B: Set of background models; W(l): model lweight; P(·):
Model predictive performance estimation function; d(·): drift
detection method.
1: function TRA IN SRP(m, n, δw, δd)
2: LCreateBaseM odels(n, m)
3: WI nitW eights(n)
4: B← ∅
5: while HasNext(S)do
6: (x, y)next(S)
7: for all lLdo
8: ˆypredict(l, x)
9: W(l)P(W(l),ˆy, y)
10: T rain(m, l, x, y)
11: if d(δw, l, x, y)then Warning detected?
12: B(l)CreateBkgM odel(m)
13: end if
14: if d(δd, l, x, y)then Drift detected?
15: lB(l)Replace lby bkg learner
16: end if
17: end for
18: for all bBdo
19: T rain(m, b, x, y)
20: end for
21: end while
22: end function
D. SRP Theoretical Insights
Bagging is well-known in the machine learning literature for
its effect on reducing variance, both in regression and classi-
fication [9], [26], which allows it to perform competitively in
a wide range of scenarios, including data streams [5], [15].
In theory, the reduction of the error is strictly related to how
uncorrelated prediction errors are [9]. Entirely uncorrelated
predictions are rarely achievable in practice, yet it is achieved
to some extent by encouraging diversity among the learning
models [27]. This itself implies a need to use unstable learners.
The standard (batch, unpruned) decision tree is a prime
example of an unstable learner: small changes to a training
sample can result in remarkably different models, and thus
diversity among predictions. Indeed, one readily observes that
decision trees are used throughout the literature.
In the context of data streams, Hoeffding trees [28] are the
popular choice of decision tree, since they are incremental.
However, crucially, Hoeffding trees – unlike their batch coun-
terparts – are in fact stable learners. As far as we are aware
we are among the first to focus on this fact in the context of
ensembles.
Splitting is supported statistically under the Hoeffding
bound. This guarantees to a certain (user-specified) confidence
level that under a sufficiently large number of examples a
Hoeffding tree built incrementally will be equivalent to a
batch-built tree. Until such a number of examples is seen,
however, Hoeffding trees will not grow and this implies
stability.
Formally, we may measure the stability of an algorithm as,
for example, hypothesis stability. In the following we adapt
the discussion of [29] to the streaming setting.
Suppose that ASdenotes that an algorithm A(e.g., C4.5,
or Hoeffding tree inducer) induces decision function f(e.g.,
a decision tree) over data stream segment Sof pairs (xk, yk)
(the segment is of length |S|=n). Let also S\irepresent
Swithout the i-th sample. Then hypothesis stability can be
expressed as
E(x,y)[|(AS,(x, y)) (AS\i,(x, y ))|]< β
under evaluation function/metric .
This captures the intuition that if we remove a sample from
the stream, the absolute difference in error of another model
trained on this new segment should be less than βwhen
compared to the error of the same model built on the original
(thus indicating its stability in terms of β).
We cannot compute this exactly unless we know the true
generating distribution (‘concept’ in stream terminology) from
which (xk, yk)pairs are drawn. However, by replacing the
expectation with a sum over leave-one-out samples from a
real stream we can empirically investigate and compare the
β-stability’ among learning algorithms with regard to such a
stream.
Repeatedly rebuilding models on relatively small samples
of instances is unavoidable in a stream which may experience
drift, implying that trees must be fully or partially regrown. By
small we mean “insufficiently large wrt the Hoeffding bound”.
These episodes add up over the life of a stream to a non-
negligible loss of accuracy.
Suppose that this number is n. As any well-regularized
algorithm, a Hoeffding tree does not adhere strongly to the
principal of empirical risk minimization, but rather it is forced
to accept many errors as a trade-off for long-term similarity
to a batch-built tree. This is a problem terms of Hoeffding
tree ensembles, since these errors are likely to be the same,
rendering the ensemble decisions are likely to be useless (no
advantage compared to a single model). In terms of bias-
variance trade-off, variance goes down at the cost of bias due
to Hoeffding stability [30]. However ensemble bagging-based
schemes are primarily for reducing variance and may even
increase bias, but since variance has already been reduced by
stability, it is not likely to have a positive effect.
This provides a suitable explanation as to why our proposed
SRP method performs well: by effectively reducing the feature
space of individual trees, Hoeffding trees are operating on a
‘sub-concept’, and are stable wrt that concept but unstable wrt
the complete concept, meaning that the variance reduction of
an ensemble still has a beneficial effect.
Furthermore, Random Subspaces are so beneficial in the
data stream setting is because we can look at decision trees
as adaptive nearest neighbours [31], and Random Subspaces
as transformations that preserve the Euclidean geometry [32].
Decision trees splits the overall space into several regions, one
for each one of their leaves. The prediction of the instances
in each one of the leaves is based on the majority vote of the
instances in that leaf. We can consider the instances in that
leaf as the neighbours of the instances to predict. Random
Subspaces are linear transformations that transforms instances
to another space, preserving their Euclidean geometry, a very
useful property when applied to nearest neighbours. This
is due to the fact that there exists Johnson-Lindenstrauss
guarantees that Random Subspaces approximately preserves
the Euclidean geometry of the data with high probability, as
shown in Lemma 1 [32].
Lemma 1. Let X={x(N1), . . . , x1, x0}be a sequence
of observations collected over time, containing input examples
in which xiRnfor every i∈ {−(N1),...,1,0},n1
and satisfying ||x2
i||c
n||xi||2
2where cR+is a constant
1cn. Let , δ (0,1], and let kc2
22ln N2
δbe an
integer. Let RS be a random subspace projection from Rn7→
Rk. Then with probability at least 1δover the random draws
of RS we have, for every i, j ∈ {−(N1),...,1,0}:
(1 )||xixj||2
2n
k||RS(xixj)||2
2(1 + )||xixj||2
2
The required number of spaces kis logarithmic in the
number of examples, but with a larger constant term.
Finally, another explanation of the success of Random
Patches is dropout [33]. Dropout is a technique used in
Deep Learning to improve the accuracy of Neural Networks,
randomly removing neurons. Random Patches uses a similar
technique to sample instances and attributes, removing many
of them, in an efficient random way.
Thus, overall, our proposal creates an artificially smaller
feature space, thus encouraging faster growth, and further-
more, even when tree growth is conservative, can encourage
disagreement (avoid correlation) among the leaf classifiers
even if they would be stable models if run outside the context
of such an ensemble. Empirical results are given in Section
V, which offer further support to these arguments.
IV. REL ATED WOR K
There is an extensive literature on ensemble methods for
data stream classification. This preference is counterintuitive
given the need for algorithms that use computational re-
sources judiciously. The justification for this preference is
attributable to the flexibility and high predictive performance
that ensemble models provide [14]. The seminal work of
Kolter and Maloof [3] introduced the Dynamic Weighted
Majority (DWM) ensemble method which featured heuristics
to cope with evolving data streams, such as removing base
models if their weight dropped below a given threshold, and
adding new ones according to the global performance of the
ensemble. DWM introduces a hyperparameter to control the
period (window) between base models addition, removal and
weight updates. Similarly to DWM, the Online Accuracy
Updated Ensemble (OAUE) [4] algorithm relies on a window
hyperparameter to determine which instances will be used to
train a new base model (candidate) and if it should replace the
base model that achieved the least classification performance
in the latest window of instances. OAUE does not use an active
drift detection approach; thus it relies on gradual resets of
the ensemble through candidates to adapt to concept drifts.
Also, it introduces a weighting mechanism that contributes to
the ensemble adaptation to concept drifts, since the weighting
function is designed to assign higher impact to predictions
on recently presented instances. Note that DWM and OAUE
use incremental base learners; however, they still require
the definition of a window to orchestrate their adaptation
techniques to evolving data.
Many ensemble methods for data stream learning exploit
strategies developed initially for batch learning. Online bag-
ging [15] trains base models on samples drawn from the
data stream simulating sampling with reposition as in the
classical Bagging algorithm [9]. Chen et al. introduce a
generalization of SmoothBoost [34], namely Online Smooth-
Boost (OB) [35], an algorithm that generates only smooth
distributions that, and do not assign too much weight to single
examples. OB is guaranteed to achieve an arbitrarily small
error rate given that the number of weak learners and examples
are sufficiently large.
Ensembles designed to cope with evolving data streams
combine decorrelating base models (e.g., bagging) and voting
(e.g., weighted majority vote [36]) with active drift recovery
strategies based on change detection algorithms. The Leverag-
ing Bagging (LB) [5] algorithm combines an adapted version
of Online Bagging [15] with the ADaptive WINdow (ADWIN)
drift detection algorithm, such that base models are selectively
reset whenever their corresponding ADWIN instance flags a
drift. Heuristic Updatable Weighted Random Subspaces
(HUWRS) [17] trains batch learners (C4.5 decision trees) on
random subspaces of features, following the Random Sub-
space Method (RSM) introduced by Ho [7]. HUWRS detects
virtual and real concept drift by computing the Hellinger
distance between the binned feature values of every base
model and the latest window of instances feature distribution
when labels are not available, and by computing Hellinger
distances between the feature distribution per class over the
latest window of instances, otherwise. The weighting of the
base models in HUWRS relies on the severity of the change
in the distribution of the features associated with its random
subspace. The Adaptive Random Forest (ARF) [6] and the
Dynamic Streaming Random Forest (DSRF) [37] both aim
to adapt the classic Random Forest [10] algorithm to streaming
data. Both ARF and DSRF uses the incremental decision tree
algorithm Hoeffding tree [28], however, they differ on how
the base trees are trained. ARF simulates resampling as in
Leveraging Bagging, while DSRF train trees sequentially on
different subsets of data. Moreover, ARF uses a drift detection
and recovery strategy based on detecting warnings and drifts
per base tree, such that after a warning is triggered another
tree is created and trained without affecting the ensemble
predictions (background tree). If the warning escalates to a
drift detection, then the base tree is replaced by the background
tree.
We briefly introduced the concepts of active and reactive
strategies for concept drift recovery and the vast literature in
ensemble learning for evolving data stream classification. We
refer the reader to [24] and [19] for further information on
concept drift, and to [14] for a detailed overview and taxonomy
of existing ensemble methods for data stream classification.
V. EX PE RI ME NT S
We evaluate the SRP implementation against state-of-the-
art classification algorithms, both concerning predictive per-
formance and computational resources usage. To analyze the
diversity among base models in our new proposed methods, we
present plots depicting the average pairwise kappa over time.
Also, to analyze how fast (and deep) the base trees are grown
by each ensemble strategy we include plots of the average tree
depth over time. We assess predictive performance through ac-
curacy results using a test-then-train evaluation strategy, where
every instance is used first for testing and then for training.
The algorithms used in the comparisons are Hoeffding Trees
(HT), Naive Bayes (NB), Leveraging Bagging (LB), Adaptive
Random Forest (ARF), Online Accuracy Updated Ensemble
(OAUE), Dynamic Weighted Majority (DWM), and Online
Smooth Boosting (OB). HT and NB serve the purpose of
baselines since they are single classifiers often used in data
stream classification. LB and ARF are ensemble methods that
consistently outperform other ensemble classifiers as shown
in [6] in a similar benchmark than the one used in this work.
OB represents a boosting adaptation to online learning, while
DWM and OAUE are ensemble methods explicitly developed
for data stream classification that rely on different heuristics
to address concept drift.
To analyze how SRP compares to “simple” variants of
itself we present two variations in the experiments, namely
the Streaming Random Subspaces (SRS) and a Bagging-like
strategy (BAG). SRS trains on random subspaces of features as
in SRP and all instances without simulating bootstraps, while
BAG only simmulates bagging using all features. In the online
resources 3we provide two tables analyzing the impact of m
in SRP (ranging from 10% up to 100% (same as the variant
BAG)) and λ= 1, which impacts the bagging simulation. The
experiments in the paper summarizes the results in the online
resources; still, they are available to the interested reader.
Regarding hyperparameters, we use HT as the base learner
for all the ensemble methods. The default subspace size is
m= 60% for SRS, SRP, and ARF, except for experiments
with the high dimensionality dataset SPAM and n= 100
where m= 10% (Table II). In the online resources (Section
A) we present complementary experiments varying mfrom
10% up to 100% (equivalent to BAG) in all datasets. The
HT grace period was set to GP = 50, the split confidence
c= 0.014, and the decision strategy used at leaves was Naive
3https://github.com/hmgomes/StreamingRandomPatches
4GP and cwere originally identified as nmin and δby Domingos and
Hulten [28], however we choose to keep their acronyms as in the Massive
Online Analysis (MOA) framework to facilitate reproducibility.
12
3
4
56
7
8
9
10
CD = 3.757
SRP
BAG
ARF
LB
SRS
OAUE
DWM
OB
HT
NB
12
3
4
56
7
8
CD = 3.031
SRP
SRS
BAG
ARF
LB
OB
OAUE
DWM
Fig. 2: Nemenyi test (95% confidence level) - n= 10 base models on the left; and Nemenyi test (95% confidence level) -
n= 100 base models on the right. The avg rank obtained in the SPAM dataset for n= 100 was not considered for any learner
since there are no results for LB and BAG.
Bayes Adaptive, i.e., either Naive Bayes or Majority vote are
used at a leaf depending on which one is more accurate [38].
This HT configuration tends to generate splits earlier at the
expense of processing time [6]. ADWIN is used as a drift
detector for all ensembles that rely on active drift detection
(i.e., ARF, LB, SRP, SRS, and BAG). The δparameter, which
controls the confidence in the change detected, was defined
as δ= 0.0001 for warning detection and δ= 0.00001 for
drift detection in ARF, SRP, SRS and BAG. In LB δwas set
according to its default value [5], i.e. δ= 0.002.
The datasets used in the experiments include 6 synthetic
data streams and 7 real datasets. The synthetic datasets sim-
ulate abrupt, gradual, and incremental drifts, while the real
datasets have been thoroughly used in the literature to assess
data stream classifiers. Further information concerning the
datasets, instructions on how to execute the experiments and
other details for reproducibility are available in Appendix A.
A. Streaming Random Patches vs. Others
The results presented in Table I show how SRP compares
against other algorithms. Similarly, II presents how SRP and
other ensembles perform when configured to use n= 100
learners. Besides presenting the average ranking (Avg Rank)
for each algorithm, we also highlight the average ranking
for the synthetic datasets (Avg Rank Synt.) and the average
ranking for real-world datasets (Avg Rank Real). The reason
to report these rankings separately is that some techniques
may perform better on synthetic data, while not so well
in overall and it is important to highlight and discuss that.
Good performance on the synthetic datasets may indicate
an effective drift recovery strategy, however synthetic data
stream concepts tend to be simple or biased towards a specific
learning algorithm, therefore an algorithm that produces good
results only on synthetic data may offer less credibility. We
apply the methodology presented in [39] to compare results
among several datasets and algorithms for the experiments
presented on Tables I and II. We first attempt to reject the
hypothesis that all learners produce equivalent results using a
Friedman test at a significance level α= 0.05. The Friedman
test indicated significant differences on both results and it
was followed by a post-hoc Nemenyi test. Figure 2 presents
the results for the post-hoc Nemenyi test. We note that no
significant difference has been found among SRP, BAG, ARF,
LB, SRS and OAUE, using n= 10, while using n= 100
there was no significant difference among SRP, SRS, BAG,
ARF and LB.
We can observe the influence of the mhyperparameter when
we compare SRP and BAG results, for example, in AIRLINES
even though the number of features is only 7, using m= 60%
produced better results than BAG as shown in Tables I and
II, while intuitively it seems that using all features for low
dimensionality datasets is better. For the SPAM dataset, SRP,
ARF and SRS were configured with m= 10% for the n= 100
experiments as m= 60% failed to finished. LB and BAG
could not finish, both failed after around 60% execution as
100GB of maximum memory allocation pool was insufficient.
SRP with n= 10 performs well in the real datasets, but
not as well in the synthetic datasets as BAG and LB, which
are very similar models (i.e., use all features and simulate
resampling). However, in the experiments using n= 100
the algorithms that exploit random subspaces (ARF, SRP,
SRS) benefited the most from the addition of more learners,
followed by BAG and LB. This characteristic of ARF, SRP,
and SRS, can be attributed to them being able to cover a more
significant number of subspaces of features. OB and DWM
improved in comparison to their results using n= 10, while
OAUE decreased its performance. OAUE obtain results far
below NB and HT for KDD99, ADS and NOMAO datasets
while performing well in the synthetic datasets with simulated
concept drifts.
B. Average Tree Depth and Diversity
To investigate how efficient SRP is in terms of inducing
diversity into the ensemble, we plot the average kappa over
time for AGRgand LEDain Figures 3 and 4. In Figure 6, we
can observe how average kappa for BAG and ARF converge
after the same concept has been in place. We notice that SRS
and SRP obtains low values of average kappa in comparison
to ARF and BAG. However, when we take into account the
accuracy results in Table I we can see that not necessarily SRP
or SRS outperform BAG in these datasets, i.e., even if the
difference is small, ARF and BAG outperform SRS and SRP
in LED(A). These results corroborate with the conclusions
by Stapenhurst [21] that the ensemble diversity influence in
the recovery from a concept drift, still, it is not as crucial as
the actual drift detection and recovery strategy. In AGR(G),
SRS and SRP outperform ARF and BAG, still, the average
kappa diversity in this experiment are quite similar, thus no
clear conclusions can be made about why SRS and SRP
perform better based solely on the average pairwise kappa. The
overall conclusion is that increasing diversity is not enough to
improve accuracy. Therefore, we complement our analysis of
TABLE I: Test-then-train accuracy (%) using n= 10 base models.
Data set NB HT LB OAUE DWM OB ARF SRP SRS BAG
LED(A) 53.964 69.032 73.918 74.007 73.742 69.898 73.945 73.588 73.533 73.944
LED(G) 54.02 68.649 73.076 73.167 72.723 69.562 73.01 72.416 72.296 73.151
AGR(A) 65.739 81.045 86.954 90.932 82.97 84.91 85.646 91.788 91.558 85.733
AGR(G) 65.759 77.374 80.709 86.339 79.418 79.73 79.885 87.762 88.538 81.347
RBF(M) 30.994 45.491 84.714 78.581 57.81 69.894 84.49 83.28 81.685 85.431
RBF(F) 29.136 32.292 74.102 50.021 54.861 42.915 70.715 70.825 59.061 74.891
AIRLINES 64.55 65.078 62.319 66.637 63.88 65.184 65.786 66.776 67.085 61.296
ELEC 73.362 79.195 90.157 88.275 87.756 85.253 88.718 88.82 89.4 89.502
COVTYPE 60.521 80.312 94.861 90.17 88.286 90.327 94.691 95.254 92.764 95.467
KDD99 95.603 99.903 99.974 2.473 99.951 99.944 99.975 99.984 99.979 99.972
ADS 68.161 85.91 99.665 15.401 97.499 86.917 99.726 99.756 98.445 99.665
NOMAO 86.865 92.128 97.035 58.407 95.462 94.252 97.055 97.232 96.451 97.064
SPAM 74.571 79.043 94.745 80.899 89.286 89.489 96.214 95.967 92.954 94.745
Avg Rank 9.54 8.29 3.5 7 7 6.86 3.57 2.43 3.71 3.36
Avg Rank Synt. 10 9 3.33 3.5 6.67 7.5 4.17 3.67 4.5 2.67
Avg Rank Real 9.14 8.14 4 7.71 6.86 6.57 3.14 1.86 3.71 3.86
TABLE II: Test-then-train accuracy (%) using n= 100 base models. Underlined results means the performance increased in
comparison to n= 10 version. BAG and LB did not finish execution for SPAM dataset.
Data set LB OAUE DWM OB ARF SRP SRS BAG
LED(A) 73.953 73.393 73.958 72.475 73.96 74.027 74.04 73.975
LED(G) 73.225 72.582 73.031 72.117 73.094 73.233 73.179 73.215
AGR(A) 88.717 90.164 88.299 90.374 87.929 92.869 92.807 86.663
AGR(G) 83.713 85.244 79.437 87.834 82.288 89.651 90.259 82.52
RBF(M) 84.338 84.262 60.977 74.514 86.958 86.039 84.821 86.671
RBF(F) 76.771 57.147 54.531 48.698 76.291 76.375 61.622 77.686
AIRLINES 62.82 65.229 64.025 64.556 66.417 68.564 68.303 62.093
ELEC 89.508 87.407 87.754 89.515 89.672 89.859 90.267 89.822
COVTYPE 95.104 92.857 88.519 92.695 94.967 95.348 93.461 95.288
KDD99 99.965 2.445 99.951 99.936 99.972 99.981 99.973 99.974
ADS 99.634 15.401 97.499 90.393 99.695 99.726 98.353 99.634
NOMAO 97.072 58.233 95.462 96.393 97.197 97.383 96.57 97.226
SPAM NA 80.781 89.361 86.519 97.319 97.437 95.924 NA
Avg Rank 4.46 6.33 6.67 6.17 4 1.58 3.17 3.63
Avg Rank Synt. 4.17 5.67 6.67 6.17 4.67 22.83 3.83
Avg Rank Real 4.75 7 6.67 6.17 3.33 1.17 3.5 3.42
the ensembles diversity by presenting the average tree depth
in Figures 5 and 6. SRP consistently grows trees faster and
deeper than ARF, SRS and BAG. Splitting sooner can lead to
overfitting the models or splits that could use a better feature
(or split point) if more instances were observed. However, as
shown by the SRP predictive performance in the empirical
experiments, it can be beneficial to an ensemble strategy.
# Instances
0
0.2
0.4
0.6
0.8
0 250000 500000 750000
SRP SRS ARF BAG
Fig. 3: AGR(G) - Avg kappa over time (n= 10).
C. Time and Memory Usage Analysis
The computational resources are estimated based on the
CPU time and RAM Hours (across all the experiments). The
# Instances
0
0.25
0.5
0.75
1
0 250000 500000 750000 1000000
SRP SRS ARF BAG
Fig. 4: LED(A) - Avg kappa over time (n= 10).
results for n= 100 are presented in Figures 7 and 85. We
note that SRP performs similar to LB, requires less resources
than BAG, but demands more resources than SRS and ARF.
The SRS efficiency is attributable to the fact that it does not
simulate resampling. In SRP, BAG, ARF and LB each learner
is trained on each instance, on average, lambda times, where
lambda = 6 in our experiments. If we use Poisson(λ= 1) we
5The results in Figures 7 and 8 excludes SPAM CPU Time and RAM hours
for all algorithms, since BAG and LB did not finish executing
# Instances
0
5
10
15
20
0 250000 500000 750000
SRP SRS ARF BAG
Fig. 5: AGR(G) - Avg tree depth over time (n= 10).
Fig. 6: LED(A) - Avg tree depth over time (n= 10).
also increase the chances of obtaining zeros (i.e., not using the
instance for training), which positively affects the memory and
processing time (train on less instances), but negatively impact
the classification performance as the base models are trained
on less instances.
Fig. 7: CPU time (n= 100).
VI. CONCLUSIONS
In this work, we have taken an in-depth look at the
performance of Random Subspaces and Bagging ensemble
methods and their application to streams. In particular, fol-
lowing theoretical considerations and empirical investigations,
we developed and presented the Streaming Random Patches
(SRP) method. SRP is a combination of Random Subspaces
and Online Bagging as each base model is trained on a
Fig. 8: RAM Hours usages (n= 100).
random patch of data (i.e., a random subset of features and
instances). We show how SRP can be highly accurate on
many benchmark streaming scenarios, and compare it against
several ensemble methods for data stream classification, in-
cluding bagging, boosting and random forest variations. We
discussed the differences, and similarities, between SRP and
the Adaptive Random Forest (ARF) algorithm. We showed
how SRP compared against a Streaming Random Subspaces
(SRS) method and a Bagging method using the same drift
detection and recovery strategies. We highlight that SRP has
the same amount of hyperparameters as ARF; still, it can also
be used to train base models that are not decision trees.
We discussed and demonstrated how methods using ran-
dom subspaces yield several significant advantages, such as
diversity enhancement (even for stable methods), which is
particularly suited to Hoeffding-tree based methods which can
be seen as stable methods. On top of that, these methods tend
to improve accuracy from the addition of more base models.
As only a subset of features is considered, training is effi-
cient as compared to popular existing methods for data stream
classification, such as Leveraging Bagging. Furthermore, even
though beyond the scope of this paper, a consideration of
distributed computation on our method is particularly favored,
as the base models are independent. These characteristics set
out an exciting path for future investigation.
REFERENCES
[1] L. Da Xu, W. He, and S. Li, “Internet of things in industries: A survey,
IEEE Transactions on industrial informatics, vol. 10, no. 4, pp. 2233–
2243, 2014.
[2] G. Widmer and M. Kubat, “Learning in the presence of concept drift
and hidden contexts,” Mach. Learn., vol. 23, no. 1, pp. 69–101, Apr.
1996.
[3] J. Z. Kolter, M. Maloof et al., “Dynamic weighted majority: A new
ensemble method for tracking concept drift,” in Data Mining, 2003.
ICDM 2003. Third IEEE International Conference on. IEEE, 2003,
pp. 123–130.
[4] D. Brzezinski and J. Stefanowski, “Combining block-based and online
methods in learning ensembles from concept drifting data streams,”
Information Sciences, vol. 265, pp. 50–67, 2014.
[5] A. Bifet, G. Holmes, and B. Pfahringer, “Leveraging bagging for
evolving data streams,” in PKDD, 2010, pp. 135–150.
[6] H. M. Gomes, A. Bifet, J. Read, J. P. Barddal, F. Enembreck,
B. Pfharinger, G. Holmes, and T. Abdessalem, “Adaptive random forests
for evolving data stream classification,Machine Learning, pp. 1–27,
2017. [Online]. Available: http://dx.doi.org/10.1007/s10994-017-5642-8
[7] T. K. Ho, “The random subspace method for constructing decision
forests,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. 20, no. 8, pp. 832–844, 1998.
[8] L. Breiman, “Pasting small votes for classification in large databases
and on-line,” Machine learning, vol. 36, no. 1-2, pp. 85–103, 1999.
[9] ——, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–
140, 1996.
[10] ——, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32,
2001.
[11] P. Panov and S. D ˇ
zeroski, “Combining bagging and random subspaces
to create better ensembles,” in International Symposium on Intelligent
Data Analysis. Springer, 2007, pp. 118–129.
[12] G. Louppe and P. Geurts, “Ensembles on random patches,” in Joint
European Conference on Machine Learning and Knowledge Discovery
in Databases. Springer, 2012, pp. 346–361.
[13] Y. Freund, R. E. Schapire et al., “Experiments with a new boosting
algorithm,” in ICML, vol. 96, 1996, pp. 148–156.
[14] H. M. Gomes, J. P. Barddal, F. Enembreck, and A. Bifet, “A survey
on ensemble learning for data stream classification,” ACM Comput.
Surv., vol. 50, no. 2, pp. 23:1–23:36, 2017. [Online]. Available:
http://doi.acm.org/10.1145/3054925
[15] N. Oza and S. Russell, “Online bagging and boosting,” in Artificial
Intelligence and Statistics 2001. Morgan Kaufmann, 2001, pp. 105–
112.
[16] L. I. Kuncheva, J. J. Rodr´
ıguez, C. O. Plumpton, D. E. Linden, and S. J.
Johnston, “Random subspace ensembles for fmri classification,” IEEE
transactions on medical imaging, vol. 29, no. 2, pp. 531–542, 2010.
[17] T. R. Hoens, N. V. Chawla, and R. Polikar, “Heuristic updatable
weighted random subspaces for non-stationary environments,” in Data
Mining (ICDM), 2011 IEEE 11th International Conference on. IEEE,
2011, pp. 241–250.
[18] C. O. Plumpton, L. I. Kuncheva, N. N. Oosterhof, and S. J. Johnston,
“Naive random subspace ensemble with linear classifiers for real-time
classification of fmri data,” Pattern Recognition, vol. 45, no. 6, pp. 2101–
2108, 2012.
[19] G. I. Webb, R. Hyde, H. Cao, H. L. Nguyen, and F. Petitjean, “Charac-
terizing concept drift,” Data Mining and Knowledge Discovery, vol. 30,
no. 4, pp. 964–994, 2016.
[20] L. L. Minku, A. P. White, and X. Yao, “The impact of diversity on online
ensemble learning in the presence of concept drift,” IEEE Transactions
on Knowledge and Data Engineering, vol. 22, no. 5, pp. 730–742, 2010.
[21] R. J. Stapenhurst, “Diversity, margins and non-stationary learning.
Ph.D. dissertation, University of Manchester, UK, 2012.
[22] A. Bifet, E. Frank, G. Holmes, and B. Pfahringer, “Ensembles of
restricted hoeffding trees,” ACM TIST, vol. 3, no. 2, pp. 30:1–30:20,
2012. [Online]. Available: http://doi.acm.org/10.1145/2089094.2089106
[23] A. Bifet and R. Gavald`
a, “Learning from time-changing data with
adaptive windowing,” in SIAM, 2007.
[24] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A
survey on concept drift adaptation,ACM Computing Surveys, vol. 46,
no. 4, pp. 44:1–44:37, Mar. 2014.
[25] I. ˇ
Zliobaite, “Change with delayed labeling: When is it detectable?” in
Data Mining Workshops (ICDMW), 2010 IEEE International Conference
on. IEEE, 2010, pp. 843–850.
[26] P. M. Domingos, “A unified bias-variance decomposition for zero-one
and squared loss,” in AAAI 2000, 2000, pp. 564–569.
[27] L. I. Kuncheva, “That elusive diversity in classifier ensembles,” in
Iberian conference on pattern recognition and image analysis. Springer,
2003, pp. 1126–1138.
[28] P. Domingos and G. Hulten, “Mining high-speed data streams,” in
Proceedings of the sixth ACM SIGKDD international conference on
Knowledge discovery and data mining. ACM SIGKDD, Sep. 2000,
pp. 71–80.
[29] O. Bousquet and A. Elisseeff, “Stability and generalization,Journal of
Machine Learning Research, vol. 2, pp. 499–526, 2002.
[30] E. Ikonomovska, J. Gama, and S. Dˇ
zeroski, “Learning model trees from
evolving data streams,Data mining and knowledge discovery, vol. 23,
no. 1, pp. 128–168, 2011.
[31] Y. Lin and Y. Jeon, “Random forests and adaptive nearest neighbors,”
Journal of the American Statistical Association, vol. 101, no. 474, pp.
578–590, 2006.
[32] N. Lim and R. J. Durrant, “Linear dimensionality reduction
in linear time: Johnson-lindenstrauss-type guarantees for random
subspace,” arXiv, vol. 1705.06408, 2017. [Online]. Available: https:
//arxiv.org/abs/1705.06408
[33] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov, “Dropout: a simple way to prevent neural networks
from overfitting,Journal of Machine Learning Research, vol. 15, no. 1,
pp. 1929–1958, 2014.
[34] R. A. Servedio, “Smooth boosting and learning with malicious noise,”
Journal of Machine Learning Research, vol. 4, no. Sep, pp. 633–648,
2003.
[35] S.-T. Chen, H.-T. Lin, and C.-J. Lu, “An online boosting algorithm with
theoretical justifications,” in Proceedings of the International Conference
on Machine Learning (ICML), June 2012.
[36] N. Littlestone and M. K. Warmuth, “The weighted majority algorithm,
Information and computation, vol. 108, no. 2, pp. 212–261, 1994.
[37] H. Abdulsalam, D. B. Skillicorn, and P. Martin, “Classifying evolving
data streams using dynamic streaming random forests,” in International
Conference on Database and Expert Systems Applications. Springer,
2008, pp. 643–651.
[38] G. Holmes, R. Kirkby, and B. Pfahringer, “Stress-testing hoeffding
trees,” in Knowledge Discovery in Databases: PKDD 2005, 2005, pp.
495–502. [Online]. Available: https://doi.org/10.1007/11564126\50
[39] J. Demˇ
sar, “Statistical comparisons of classifiers over multiple data sets,
Journal of Machine Learning Research, vol. 7, pp. 1–30, Dec. 2006.
[Online]. Available: http://dl.acm.org/citation.cfm?id=1248547.1248548
[40] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “Moa: Massive online
analysis,” The Journal of Machine Learning Research, vol. 11, pp. 1601–
1604, 2010.
APPENDIX
A. Software and hardware
All the experiments were executed in the Massive Online
Analysis (MOA) framework [40] version 2019.04 build. De-
tails about the hardware configuration are shown below:
CPU: 40 cores, Intel(R) Xeon(R) CPU E5-2660 v3 @
2.60GHz
Operational System: Ubuntu 16.04.5 LTS
Java Virtual Machine (JVM) version: JDK 1.8.0
JVM: Xmx = 100GB, Xms = 50MB
To reproduce the experiments the reader should access the
GitHub repository
https://github.com/hmgomes/StreamingRandomPatches
which contains the code, the datasets used, and instructions
on how to execute the algorithm.
B. Datasets
Table III presents an overview of the datasets.
TABLE III: Datasets (Drifts: (A) Abrupt, (G) Gradual, (M)
Incremental (moderate) and (F) Incremental (fast)). AGR and
LED concept drifts are introduced every 250k instances.
Dataset #Instances #Features Type Drifts #Classes
LED(A) 1,000,000 24 Synthetic A 10
LED(G) 1,000,000 24 Synthetic G 10
AGR(A) 1,000,000 9 Synthetic A 2
AGR(G) 1,000,000 9 Synthetic G 2
RBF(M) 1,000,000 10 Synthetic M 5
RBF(F) 1,000,000 10 Synthetic F 5
AIRL 539,383 7 Real - 2
ELEC 45,312 8 Real - 2
COVT 581,012 54 Real - 7
KDD99 4,898,431 41 Real - 23
ADS 3,279 1,559 Real - 2
NOMAO 34,465 119 Real - 2
SPAM 9,324 39,917 Real - 2
... This framework can be deployed on IoT cloud servers to process the big data streams transmitted from the IoT end devices through wireless communication strategies, as shown in Fig. 1. The proposed framework is an ensemble learning framework that uses the combinations of two popular drift detection methods, adaptive windowing (ADWIN) [12] and drift detection method (DDM) [13], and two state-of-theart drift adaptation methods, adaptive random forest (ARF) [14] and streaming random patches (SRP) [15], to construct base learners. The base learners are weighted according to their real-time performance and integrated to construct a robust anomaly detection ensemble model with improved drift adaptation performance. ...
... Additionally, ARF has an effective resampling technique and the adaptability to different types of drifts. Gomes et al. [15] also proposed a novel adaptive ensemble method named Streaming Random Patches (SRP) for streaming data analytics. SRP combines the random subspace and online bagging method to make predictions. ...
... Thus, each base learner already has a stronger data stream analysis capability than most existing drift adaptation methods. 2) ARF and SRP are both state-of-the-art drift adaptation methods whose performance has been proven to be better than other existing drift adaptation methods by experimental studies in [14] [15]. 3) Unlike block-based ensembles (e.g., SEA, AWE, and AUE), ARF and SRP are both online ensembles that do not require the tuning of data chunk sizes. ...
Conference Paper
As the number of Internet of Things (IoT) devices and systems have surged, IoT data analytics techniques have been developed to detect malicious cyber-attacks and secure IoT systems; however, concept drift issues often occur in IoT data analytics, as IoT data is often dynamic data streams that change over time, causing model degradation and attack detection failure. This is because traditional data analytics models are static models that cannot adapt to data distribution changes. In this paper, we propose a Performance Weighted Probability Averaging Ensemble (PWPAE) framework for drift adaptive IoT anomaly detection through IoT data stream analytics. Experiments on two public datasets show the effectiveness of our proposed PWPAE method compared against state-of-the-art methods.
... The Random Patches (RP) framework follows a very simple, yet effective, ensemble method that builds each individual model of the pool from a random patch of data obtained by drawing random subsets of both instances and features from the whole dataset [104]. Gomes, Read, and Bifet [67] presented the Streaming Random Patches (SRP) method that is an extension of Random Patches for classification tasks and meets the streaming setting requirements. ...
... The Streaming Random Patches [67] (SRP) trains each base learner on a different subset of the features and the instances from the original data, namely a random patch. This diversity enhancing strategy, is similar to the one used in random forests [69] by combing random subspaces and online bagging, yet it is not restricted to using decision trees as base learner. ...
... • SRP: Streaming Random Patches [67] is based on patches that are combination of random sub-spaces of both data instances and features. ...
Thesis
With the rapid growth of Internet-of-Things (IoT) devices and sensors, sources that are continuously releasing and curating vast amount of data at high pace in the form of stream. The ubiquitous data streams are essential for data driven decisionmaking in different business sectors using Artificial Intelligence (AI) and Machine Learning (ML) techniques in order to extract valuable knowledge and turn it to appropriate actions. Besides, the data being collected is often associated with a temporal indicator, referred to as temporal data stream that is a potentially infinite sequence of observations captured over time at regular intervals, but not necessarily. Forecasting is a challenging tasks in the field of AI and aims at understanding the process generating the observations over time based on past data in order to accurately predict future behavior. Stream Learning is the emerging research field which focuses on learning from infinite and evolving data streams. The thesis tackles dynamic model combination that achieves competitive results despite their high computational costs in terms of memory and time. We study several approaches to estimate the predictive performance of individual forecasting models according to the data and contribute by introducing novel windowing and meta-learning based methods to cope with evolving data streams. Subsequently, we propose different selection methods that aim at constituting a committee of accurate and diverse models. The predictions of these models are then weighted and aggregated. The second part addresses model compression that aims at building a single model to mimic the behavior of a highly performing and complex ensemble while reducing its complexity. Finally, we present the first streaming competition ”Real-time Machine Learning Competition on Data Streams”, at the IEEE Big Data 2019 conference, using the new SCALAR platform
... ARF [19], an adaption to the random forest algorithm [10], includes an effective resampling method that handles different types of concept drifts. Streaming Random Patches (SRP) [20] is also a ensemble method that combines random subspaces and bagging while using a strategy to detect drifts similar to the one introduced in ARF [19]. To adapt the configuration of heterogeneous algorithms to changing data streams, in [13], authors proposed an AutoML approach that uses different adaption strategies to retrain AutoML instances, such as H 2 O, Autosklearn and GAMA [18], but without taking into account costly retrainings of offline AutoML instances. ...
... In line 13, the data stream starts and a mutation is applied with a rate of f SS (lines [14][15][16][17][18][19][20]. Within the mutation steps the algorithm selects, in the first step, the best P best and weakest P weak pipeline configuration. ...
Book
Full-text available
Automated Machine Learning (AutoML) deals with finding well-performing machine learning models and their corresponding configurations without the need of machine learning experts. However, if one assumes an online learning scenario, where an AutoML instance executes on evolving data streams, the question for the best model and its configuration with respect to occurring changes in the data distribution remains open. Algorithms developed for online learning settings rely on few and homogeneous models and do not consider data mining pipelines or the adaption of their configuration. We, therefore, introduce EvoAutoML, an evolution-based online learning framework consisting of heterogeneous and connectable models that supports large and diverse configuration spaces and adapts to the online learning scenario. We present experiments with an implementation of EvoAutoML on a diverse set of synthetic and real datasets, and show that our proposed approach outperforms state-of-the-art online algorithms as well as strong ensemble baselines in a traditional test-then-train evaluation.
... Additionally, ensembles can naturally manage concept drift by incorporating new base learners trained on most recent data and discarding outdated ones (Cano and Krawczyk, 2020). New concepts offer a natural way of maintaining diversity among ensemble members, allowing them to continuously be mutually complementary (Gomes et al., 2019a). When looking at the possible approaches to ensemble learning for data streams, three main paths exist : (i) dynamic combiners; (ii) dynamic ensemble setup; and (iii) dynamic ensemble updating. ...
... (Oza, 2005) OBOA: Online Boosting with ADWIN. (Gomes et al., 2019a) SRP: Streaming Random Patches. ...
Article
Full-text available
Data streams are potentially unbounded sequences of instances arriving over time to a classifier. Designing algorithms that are capable of dealing with massive, rapidly arriving information is one of the most dynamically developing areas of machine learning. Such learners must be able to deal with a phenomenon known as concept drift, where the data stream may be subject to various changes in its characteristics over time. Furthermore, distributions of classes may evolve over time, leading to a highly difficult non-stationary class imbalance. In this work we introduce Robust Online Self-Adjusting Ensemble (ROSE), a novel online ensemble classifier capable of dealing with all of the mentioned challenges. The main features of ROSE are: (i) online training of base classifiers on variable size random subsets of features; (ii) online detection of concept drift and creation of a background ensemble for faster adaptation to changes; (iii) sliding window per class to create skew-insensitive classifiers regardless of the current imbalance ratio; and (iv) self-adjusting bagging to enhance the exposure of difficult instances from minority classes. The interplay among these features leads to an improved performance in various data stream mining benchmarks. An extensive experimental study comparing with 30 ensemble classifiers shows that ROSE is a robust and well-rounded classifier for drifting imbalanced data streams, especially under the presence of noise and class imbalance drift, while maintaining competitive time complexity and memory consumption. Results are supported by a thorough non-parametric statistical analysis.
... Streaming Random Patches (SRP) [32] is an ensemble method specially adapted to stream classification, which combines random subspaces and online bagging. SRP is not constrained to a specific base learner as ARF since its diversity inducing mechanisms are not built-in the base learner, i.e., SRP uses global randomization while ARF uses local randomization. ...
... SRP is not constrained to a specific base learner as ARF since its diversity inducing mechanisms are not built-in the base learner, i.e., SRP uses global randomization while ARF uses local randomization. Even though, in [32] all the experiments focused on Hoeffding trees and showed that SRP could produce deeper trees, which may lead to increased diversity in the ensemble. ...
Preprint
Full-text available
In recent years, the Edge Computing (EC) paradigm has emerged as an enabling factor for developing technologies like the Internet of Things (IoT) and 5G networks, bridging the gap between Cloud Computing services and end-users, supporting low latency, mobility, and location awareness to delay-sensitive applications. Most solutions in EC employ machine learning (ML) methods to perform data classification and other information processing tasks on continuous and evolving data streams. Usually, such solutions have to cope with vast amounts of data that come as data streams while balancing energy consumption, latency, and the predictive performance of the algorithms. Ensemble methods achieve remarkable predictive performance when applied to evolving data streams due to the combination of several models and the possibility of selective resets. This work investigates strategies for optimizing the performance (i.e., delay, throughput) and energy consumption of bagging ensembles to classify data streams. The experimental evaluation involved six state-of-art ensemble algorithms (OzaBag, OzaBag Adaptive Size Hoeffding Tree, Online Bagging ADWIN, Leveraging Bagging, Adaptive RandomForest, and Streaming Random Patches) applying five widely used machine learning benchmark datasets with varied characteristics on three computer platforms. Such strategies can significantly reduce energy consumption in 96% of the experimental scenarios evaluated. Despite the trade-offs, it is possible to balance them to avoid significant loss in predictive performance.
... Inspired by the subsampling approach of J+aB, our inferential approach is rooted in an ensemble built by taking tiny random subsamples of both observations and features in tabular data. This idea of double subsampling appears first in the context of random forests [41,27], linear regression [37], and more recently has been termed "minipatch ensembles" by [65,80]. We adopt this idea of minipatch ensembles and we are the first to develop inferential approaches using this approach. ...
Preprint
Full-text available
In order to trust machine learning for high-stakes problems, we need models to be both reliable and interpretable. Recently, there has been a growing body of work on interpretable machine learning which generates human understandable insights into data, models, or predictions. At the same time, there has been increased interest in quantifying the reliability and uncertainty of machine learning predictions, often in the form of confidence intervals for predictions using conformal inference. Yet, there has been relatively little attention given to the reliability and uncertainty of machine learning interpretations, which is the focus of this paper. Our goal is to develop confidence intervals for a widely-used form of machine learning interpretation: feature importance. We specifically seek to develop universal model-agnostic and assumption-light confidence intervals for feature importance that will be valid for any machine learning model and for any regression or classification task. We do so by leveraging a form of random observation and feature subsampling called minipatch ensembles and show that our approach provides assumption-light asymptotic coverage for the feature importance score of any model. Further, our approach is fast as computations needed for inference come nearly for free as part of the ensemble learning process. Finally, we also show that our same procedure can be leveraged to provide valid confidence intervals for predictions, hence providing fast, simultaneous quantification of the uncertainty of both model predictions and interpretations. We validate our intervals on a series of synthetic and real data examples, showing that our approach detects the correct important features and exhibits many computational and statistical advantages over existing methods.
... The most popular approach lies in combining resampling techniques with Online Bagging (Wang et al., 2015Wang and Pineau, 2016). Similar strategies can be applied to Adaptive Random Forest (Gomes et al., 2017), Online Boosting Gomes et al., 2019), Dynamic Weighted Majority (Lu et al., 2017), Dynamic Feature Selection (Wu et al., 2014), Adaptive Random Forest with resampling (Ferreira et al., 2019), Kappa Updated Ensemble (Cano and Krawczyk, 2020), Robust Online Self-Adjusting Ensemble (Cano and Krawczyk, 2022) or any ensemble that can incrementally update its base learners (Ancy and Paulraj, 2020;Li et al., 2020). It is interesting to note that preprocessing approaches enhance diversity among base classifiers (Zyblewski et al., 2019). ...
Preprint
Full-text available
Class imbalance poses new challenges when it comes to classifying data streams. Many algorithms recently proposed in the literature tackle this problem using a variety of data-level, algorithm-level, and ensemble approaches. However, there is a lack of standardized and agreed-upon procedures on how to evaluate these algorithms. This work presents a taxonomy of algorithms for imbalanced data streams and proposes a standardized, exhaustive, and informative experimental testbed to evaluate algorithms in a collection of diverse and challenging imbalanced data stream scenarios. The experimental study evaluates 24 state-of-the-art data streams algorithms on 515 imbalanced data streams that combine static and dynamic class imbalance ratios, instance-level difficulties, concept drift, real-world and semi-synthetic datasets in binary and multi-class scenarios. This leads to the largest experimental study conducted so far in the data stream mining domain. We discuss the advantages and disadvantages of state-of-the-art classifiers in each of these scenarios and we provide general recommendations to end-users for selecting the best algorithms for imbalanced data streams. Additionally, we formulate open challenges and future directions for this domain. Our experimental testbed is fully reproducible and easy to extend with new methods. This way we propose first standardized approach to conducting experiments in imbalanced data streams that can be used by other researchers to create trustworthy and fair evaluation of newly proposed methods. Our experimental framework can be downloaded from https://github.com/canoalberto/imbalanced-streams.
... It has been argued that uncorrelated predictions are important for error reduction effect (Breiman 1996); consequently that learner diversity is key to uncorrelated predictions (Kuncheva 2003), implying that it is desirable to have learners that are affected by small changes to the stream, creating diversity in the ensemble. However, Hoeffding Tree has a very conservative splitting mechanism (Gomes et al. 2019); that is, being provided slightly different versions of a stream does not greatly alter the decision tree model produced, on account of the requirement that the selected split should be better than any alternative. In contrast, Hoeffding AnyTime Tree requires only the the selected split be better than no split whatsoever, increasing the likelihood that different (but nonetheless useful) splits will be selected in different models. ...
Article
Full-text available
Decision tree ensembles are widely used in practice. In this work, we study in ensemble settings the effectiveness of replacing the split strategy for the state-of-the-art online tree learner, Hoeffding Tree, with a rigorous but more eager splitting strategy that we had previously published as Hoeffding AnyTime Tree. Hoeffding AnyTime Tree (HATT), uses the Hoeffding Test to determine whether the current best candidate split is superior to the current split, with the possibility of revision, while Hoeffding Tree aims to determine whether the top candidate is better than the second best and if a test is selected, fixes it for all posterity. HATT converges to the ideal batch tree while Hoeffding Tree does not. We find that HATT is an efficacious base learner for online bagging and online boosting ensembles. On UCI and synthetic streams, HATT as a base learner outperforms HT at a 0.05 significance level for the majority of tested ensembles on what we believe is the largest and most comprehensive set of testbenches in the online learning literature. Our results indicate that HATT is a superior alternative to Hoeffding Tree in a large number of ensemble settings.
Chapter
Automated Machine Learning (AutoML) deals with finding well-performing machine learning models and their corresponding configurations without the need of machine learning experts. However, if one assumes an online learning scenario, where an AutoML instance executes on evolving data streams, the question for the best model and its configuration with respect to occurring changes in the data distribution remains open. Algorithms developed for online learning settings rely on few and homogeneous models and do not consider data mining pipelines or the adaption of their configuration. We, therefore, introduce EvoAutoML, an evolution-based online learning framework consisting of heterogeneous and connectable models that supports large and diverse configuration spaces and adapts to the online learning scenario. We present experiments with an implementation of EvoAutoML on a diverse set of synthetic and real datasets, and show that our proposed approach outperforms state-of-the-art online algorithms as well as strong ensemble baselines in a traditional test-then-train evaluation.
Article
As an excellent ensemble algorithm, Gradient Boosting Decision Tree (GBDT) has been tested extensively with static data. However, real-world applications often involve dynamic data streams, which suffer from concept drift problems where the data distribution changes overtime. The performance of GBDT model is degraded when applied to predict data streams with concept drift. Although incremental learning can help to alleviate such degrading, finding a perfect learning rate (i.e., the iteration in GBDT) that suits all time periods with all their different drift severity levels can be difficult. In this paper, we convert the issue of determining an optimal learning rate into the issue of choosing the best adaptive iterations when tuning GBDT. We theoretically prove that drift severity is closely related to the convergence rate of model. Accordingly, we propose a novel drift adaptation method, called adaptive iterations (AdIter), that automatically chooses the number of iterations for different drift severities to improve the prediction accuracy for data streams under concept drift. In a series of comprehensive tests with seven state-of-the-art drift adaptation methods on both synthetic and real-world data, AdIter yielded superior accuracy levels.
Article
Full-text available
Random forests is currently one of the most used machine learning algorithms in the non-streaming (batch) setting. This preference is attributable to its high learning performance and low demands with respect to input preparation and hyper-parameter tuning. However, in the challenging context of evolving data streams, there is no random forests algorithm that can be considered state-of-the-art in comparison to bagging and boosting based algorithms. In this work, we present the adaptive random forest (ARF) algorithm for classification of evolving data streams. In contrast to previous attempts of replicating random forests for data stream learning, ARF includes an effective resampling method and adaptive operators that can cope with different types of concept drifts without complex optimizations for different data sets. We present experiments with a parallel implementation of ARF which has no degradation in terms of classification performance in comparison to a serial implementation, since trees and adaptive operators are independent from one another. Finally, we compare ARF with state-of-the-art algorithms in a traditional test-then-train evaluation and a novel delayed labelling evaluation, and show that ARF is accurate and uses a feasible amount of resources.
Article
Full-text available
Ensemble-based methods are among the most widely used techniques for data stream classification. Their popularity is attributable to their good performance in comparison to strong single learners while being relatively easy to deploy in real-world applications. Ensemble algorithms are especially useful for data stream learning as they can be integrated with drift detection algorithms and incorporate dynamic updates, such as selective removal or addition of classifiers. This work proposes a taxonomy for data stream ensemble learning as derived from reviewing over 60 algorithms. Important aspects such as combination, diversity, and dynamic updates, are thoroughly discussed. Additional contributions include a listing of popular open-source tools and a discussion about current data stream research challenges and how they relate to ensemble learning (big data streams, concept evolution, feature drifts, temporal dependencies, and others).
Article
Full-text available
Most machine learning models are static, but the world is dynamic, and increasing online deployment of learned models gives increasing urgency to the development of efficient and effective mechanisms to address learning in the context of non-stationary distributions, or as it is commonly called concept drift. However, the key issue of characterizing the different types of drift that can occur has not previously been subjected to rigorous definition and analysis. In particular, while some qualitative drift categorizations have been proposed, few have been formally defined, and the quantitative descriptions required for detailed understanding of learner performance have not existed. We present the first comprehensive framework for quantitative analysis of drift. This supports the development of the first comprehensive set of formal definitions of types of concept drift. The formal definitions clarify ambiguities and identify gaps in previous definitions, giving rise to a new comprehensive taxonomy of concept drift types and a solid foundation for research into mechanisms to detect and address concept drift.
Article
Full-text available
The success of simple methods for classification shows that is is often not necessary to model complex attribute interactions to obtain good classification accuracy on practical problems. In this article, we propose to exploit this phenomenon in the data stream context by building an ensemble of Hoeffding trees that are each limited to a small subset of attributes. In this way, each tree is restricted to model interactions between attributes in its corresponding subset. Because it is not known a priori which attribute subsets are relevant for prediction, we build exhaustive ensembles that consider all possible attribute subsets of a given size. As the resulting Hoeffding trees are not all equally important, we weigh them in a suitable manner to obtain accurate classifications. This is done by combining the log-odds of their probability estimates using sigmoid perceptrons, with one perceptron per class. We propose a mechanism for setting the perceptrons' learning rate using the ADWIN change detection method for data streams, and also use ADWIN to reset ensemble members (i.e., Hoeffding trees) when they no longer perform well. Our experiments show that the resulting ensemble classifier outperforms bagging for data streams in terms of accuracy when both are used in conjunction with adaptive naive Bayes Hoeffding trees, at the expense of runtime and memory consumption. We also show that our stacking method can improve the performance of a bagged ensemble.
Article
Full-text available
Internet of Things (IoT) has provided a promising opportunity to build powerful industrial systems and applications by leveraging the growing ubiquity of radio-frequency identification (RFID), and wireless, mobile, and sensor devices. A wide range of industrial IoT applications have been developed and deployed in recent years. In an effort to understand the development of IoT in industries, this paper reviews the current research of IoT, key enabling technologies, major IoT applications in industries, and identifies research trends and challenges. A main contribution of this review paper is that it summarizes the current state-of-the-art IoT in industries systematically.
Conference Paper
Full-text available
In this paper, we consider supervised learning under the assumption that the available memory is small compared to the dataset size. This general framework is relevant in the context of big data, distributed databases and embedded systems. We investigate a very simple, yet effective, ensemble framework that builds each individual model of the ensemble from a random patch of data obtained by drawing random subsets of both instances and features from the whole dataset. We carry out an extensive and systematic evaluation of this method on 29 datasets, using decision tree-based estimators. With respect to popular ensemble methods, these experiments show that the proposed method provides on par performance in terms of accuracy while simultaneously lowering the memory needs, and attains significantly better performance when memory is severely constrained.
Article
Full-text available
Concept drift primarily refers to an online supervised learning scenario when the relation between the input data and the target variable changes over time. Assuming a general knowledge of supervised learning in this article, we characterize adaptive learning processes; categorize existing strategies for handling concept drift; overview the most representative, distinct, and popular techniques and algorithms; discuss evaluation methodology of adaptive algorithms; and present a set of illustrative applications. The survey covers the different facets of concept drift in an integrated way to reflect on the existing scattered state of the art. Thus, it aims at providing a comprehensive introduction to the concept drift adaptation for researchers, industry analysts, and practitioners.
Article
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.