Conference PaperPDF Available

Streaming Random Patches for Evolving Data Stream Classification


Abstract and Figures

Ensemble methods are a popular choice for learning from evolving data streams. This popularity is due to (i) the ability to simulate simple, yet, successful ensemble learning strategies, such as bagging and random forests; (ii) the possibility of incorporating drift detection and recovery in conjunction to the ensemble algorithm; (iii) the availability of efficient incremental base learners, such as Hoeffding Trees. In this work, we introduce the Streaming Random Patches (SRP) algorithm, an ensemble method specially adapted to stream classification which combines random subspaces and online bagging. We provide theoretical insights and empirical results illustrating different aspects of SRP. In particular, we explain how the widely adopted incremental Hoeffding trees are not, in fact, unstable learners, unlike their batch counterparts, and how this fact significantly influences ensemble methods design and performance. We compare SRP against state-of-the-art ensemble variants for streaming data in a multitude of datasets. The results show how SRP produce a high predictive performance for both real and synthetic datasets. Besides, we analyze the diversity over time and the average tree depth, which provides insights on the differences between local subspace randomization (as in random forest) and global subspace randomization (as in random subspaces).
Content may be subject to copyright.
Streaming Random Patches for Evolving Data
Stream Classification
Heitor Murilo Gomes∗†, Jesse Read, Albert Bifet∗†
University of Waikato, Hamilton, New Zealand
{heitor.gomes, albert.bifet}
ecom Paris, IP-Paris, Paris, France
LIX, ´
Ecole Polytechnique, Palaiseau, France
Abstract—Ensemble methods are a popular choice for learning
from evolving data streams. This popularity is due to (i) the
ability to simulate simple, yet, successful ensemble learning
strategies, such as bagging and random forests; (ii) the possibility
of incorporating drift detection and recovery in conjunction to the
ensemble algorithm; (iii) the availability of efficient incremental
base learners, such as Hoeffding Trees. In this work, we introduce
the Streaming Random Patches (SRP) algorithm, an ensemble
method specially adapted to stream classification which combines
random subspaces and online bagging. We provide theoretical
insights and empirical results illustrating different aspects of SRP.
In particular, we explain how the widely adopted incremental
Hoeffding trees are not, in fact, unstable learners, unlike their
batch counterparts, and how this fact significantly influences
ensemble methods design and performance. We compare SRP
against state-of-the-art ensemble variants for streaming data in
a multitude of datasets. The results show how SRP produce a
high predictive performance for both real and synthetic datasets.
Besides, we analyze the diversity over time and the average
tree depth, which provides insights on the differences between
local subspace randomization (as in random forest) and global
subspace randomization (as in random subspaces).
Index Terms—Stream Data Mining, Ensemble Learning, Ran-
dom Subspaces, Random Patches
Machine learning applications of data streams have grown
in importance in recent years due to the tremendous amount of
real-time data generated by networks, mobile phones and the
wide variety of sensors currently available. Building predictive
models from data streams are central to many applications [1].
The underlying assumption of data stream learning is that the
algorithms must process large amounts of data in a fast-paced
way. In a supervised learning scenario, such characteristic
brings forward two crucial challenges:
Computational efficiency. The algorithm must use a
limited budget of computational resources to be able to
process examples at least as fast as new examples are
Evolving data. The continuous flow of data might be sub-
ject to changes over time, where the canonical example
is concept drift [2]. Concept drifts can be characterized
as changes in the underlying data distribution that affect
the fitted model, such that to maintain its predictive
performance the model must be updated or even reset.
To tackle evolving data many strategies were proposed with
particular attention to ensemble-based methods. Ensembles are
often used to cope with concept drifts by selectively reset-
ting component learners [3]–[6]. Concerning computational
efficiency, ensembles of learners require more computational
resources than a single learner; however many are very easy
to parallelize [6].
In the traditional batch learning setting, several ensem-
ble methodologies are widely used, such as Random Sub-
spaces [7], Pasting [8], Bagging [9], Random Forest [10],
SubBag [11], and Random Patches [12]. The main differ-
ences among these algorithms remain on how they induce
diversity into the ensemble. Random subspace methods train
each base learner on a separate randomly selected subset of
features. Pasting and Bagging train base learners on samples
of instances draw with and without reposition, respectively,
from the original dataset. Random Forest extends Bagging
and randomly selects subsets of features to be considered
for splits in its base learners (decision trees). SubBag and
Random Patches combines Bagging and Random Subspaces
and Pasting and Random Subspaces, respectively, thus through
very similar means, they train base learners on random subsets
of features and samples. Other ensembles that are popular on
batch learning, such as AdaBoosting [13] are less attractive
for data streams, partially because the original batch learn-
ing implementations introduces dependencies among the base
learners, which are difficult to simulate appropriately in a
streaming setting [14].
In this work, we propose strategies to cope with classifi-
cation problems on evolving data streams using an ensemble
strategy that combines random subspaces and bagging. We
name this ensemble Streaming Random Patches (SRP) as it
is inspired by the Random Subspaces method [7] and Online
Bagging [15], and thus resembles the Random Patches [12]
algorithm. SRP incorporates an active drift detection strategy,
similarly to other ensembles methods, e.g., Leveraging Bag-
ging [5] and Adaptive Random Forest (ARF) [6]. The drift
detection and recovery strategy in SRP follow the approach
used in ARF. ARF consistently overcomes other state-of-the-
art ensembles for evolving data streams, partially due to this
strategy [6]; on top of that, by using the same procedure in
SRP, we can compare it to ARF in terms of the ensemble
strategy without the interference from the approach to cope
with evolving data.
Similar algorithms based on the Random Subspaces
method [7] or combinations of resampling and random sub-
spaces [11], [12] have been previously explored on batch learn-
ing for high-dimensional datasets [16] and also for evolving
data stream classification [5], [6], [17], [18]. Nevertheless,
to the best of our knowledge, none of these previous works
thoroughly investigated the impact of online bagging and
random subspaces, concomitantly, for evolving data streams.
Similarly, previous works have not outlined the similarities
and differences between a global and a local randomization
strategy for the subset of features for streaming data. We
use the same definition of global and local randomization as
in [12], i.e., in the random subspaces method, the subspace of
features is selected globally once for the whole base learner,
while in the random forest algorithm the subspaces are selected
locally for each leaf of the base tree [12]. We discuss the
impact of both strategies in our experiments (Section V) while
comparing ARF and SRP. Panov and Dˇ
zeroski, and Louppe
and Geurts conducted a similar investigation for the batch
setting in [11] and [12], respectively.
Paper contributions and roadmap. Our main contributions
can be summarized as follows:
1) Streaming Random Patches (SRP): We introduce an
ensemble-based method, namely SRP, that achieves high
accuracy by training base models on random subsets of
features and instances1;
2) Theoretical insights: We analyze the SRP algorithm with
particular attention to the questions of the stability and
diversity of Hoeffding trees, and the impact of global
subspace randomization in SRP in opposition to the local
randomization in ARF;
3) Empirical Analysis: We compare SRP against state-
of-the-art ensemble variants for streaming data in a
multitude of datasets. The results show a clear overview
of predictive performance and resources usage. Besides,
we analyze the diversity over time and the average tree
depth, which provides some insights on the differences
between local and global subspace randomization.
The rest of this paper is organized as follows. In Section
II we introduce the problem of learning classification models
from evolving data streams. In Section III, we present the SRP
algorithm and theoretical insights. In Section IV, related works
are discussed and compared to our approach. In Section V, we
present the experiments conducted to analyze SRP in terms of
accuracy, computational resources, diversity and decision trees
depth. Finally, Section VI concludes this work and presents
directions for future works.
Let X={x−∞, . . . , x1, x0}be an open-ended sequence
of observations collected over time, containing input examples
1The implementation and instructions are available at:
in which xkRnand n1. Similarly, let ybe an open-
ended sequence of corresponding class labels, such that every
example in Xhas a corresponding entry in y. Moreover, yk
has a finite set of possible values, i.e., yk∈ {l1, . . . , lL}for
L2, such that a classification task is defined. Furthermore,
we assume a problem setting where new input examples
xare presented every utime units to the learning model
for prediction, such that xt
krepresents a vector of features
available at time t. The true class label yt+1
k, corresponding
to instance xt
k, is available before the next instance xt+1
appears, and thus, it can be used for training immediately
after it has been used for prediction. We emphasise that this
experimental setting can be naturally extended to the delayed
and weakly-supervised settings considering a non-negligible
time delay between observing xand its class label y, including
an infinite delay (i.e., the label is never observed). However,
the conclusions drawn from experimenting in such settings are
similar to those in the “immediate” setting, as shown in [6].
Therefore, for simplicity, we omit such results in this paper.
An important characteristic of data stream classification is
whether it is a stationary or an evolving data distribution. In
this work, we assume evolving data distributions. Thus we
expect the occurrence of concept drifts2that might influence
decision boundaries. Note that if a concept drift is accurately
detected (without false negatives) and dealt with (by fully or
partially resetting models as appropriate) an iid assumption can
be made (on a per-concept basis), since each concept can be
treated as a separate iid stream, thus a series of iid streams to
be dealt with. Nevertheless, the typical nature of a data-stream
as being fast and dynamic encourages the in-depth study that
we present in this work.
Streaming Random Patches (SRP) can be viewed as an
adaptation of batch learning ensemble methods that com-
bined random samples of instances and random subspaces
of features [11], [12]. Following the terminology introduced
in [12], in the rest of this work, we refer to random
subsets of both features and instances as random patches.
Fig 1 presents an example of subsampling both instances and
features, simultaneously, from streaming data, where only the
shaded intersections of the matrix belong to the subsample,
i.e., {v1,1, v2,1, v6,1, v1,3, v2,3, v6,3}.
Our motivation for exploiting an ensemble of base models
trained on random patches is based mainly on the high
predictive performance of ensembles for data stream learning
that added randomization to the base models by either training
them on random samples of instances [5], random subsets of
features [17] or both [6]. We investigate whether selecting the
subset of features globally once and before constructing each
base model, overcomes locally selecting subsets of features at
each node while constructing base trees as in Random Forest.
In [12], authors show empirical evidence that Random Patches
combined with tree-based models achieved similar accuracy to
2A formal definition of concept drift can be found in [19]
x1x2x3x4x... xm
Fig. 1: Representation of a data stream as an unbounded table
where the rows are infinite, but the columns are constrained
by minput features.
other randomization strategies, including Random Forest [10],
while using less memory.
The original Random Patches algorithm [12] is defined
in terms of all possible subsets of features and instances,
such that R(ps, pf, D)denotes all random patches of size
psNs×pfNfthat can be drawn from the training set D,
where Nsand Nfrepresent the number of instances and
features, respectively, in D. The hyperparameters ps[0,1]
and pf[0,1] represent, respectively, the number of samples
and features in each patch rR(ps, pf, D). In SRP, the
set of all possible streaming random patches Rs(λ, pf, S)is
infinite in the sample dimension as the input training data is
represented by a data stream S. We control the number of
samples in the streaming patch using the Poisson parameter λ
(Section III-A).
A. Random Subsets of Instances
In the batch setting, Bagging builds Lbase models, training
each model with a bootstrap sample from the original training
dataset of size N. Each bootstrap contains each original train-
ing example Ktimes, where P r(K=k)follows a binomial
distribution which, for large N, tends to a Poisson(λ= 1)
distribution. Using this fact, Oza and Russell [15] proposed
Online Bagging, an online method that, instead of sampling
with replacement, gives each example a weight according to
Poisson(λ= 1).
Leveraging Bagging [5] and Adaptive Random Forest [6]
train their base models according to a Poisson(λ= 6)
distribution, which on average augment the weight of each
training instance and diminish the probability of not using an
instance for training, i.e., the probability of Pr[Poisson(λ=
6)=0]0.25%, while Pr[Poisson(λ= 6)=0]36.8%. Using
Poisson(λ= 6) tends to improve the predictive performance of
the ensemble as the base models are updated more often, but
this benefit comes at the expense of computational resources.
Minku et al. [20] used λas a proxy for diversity, i.e.,
the lower λ, the more diversity would be induced into the
ensemble. As pointed by Stapenhurst [21] for iid data the base
models will eventually converge, even faster if given larger
values of λ. One important question to be addressed then is:
why Poisson(6) works if only a small portion of data is not
presented to each learner? In the long run, the base models
start to converge. This can be visualized in Section V where
diversity is shown overtime for the AGRAWAL generator,
once a concept becomes stable the average Kappa Statistic
starts to increase (i.e., the outputs of the base models start to
converge) if the only means of decorrelating the base models
is resampling with reposition simulated with Poisson(6). This
motivates the addition of other techniques to induce diversity
(Section III-B).
B. Random Subsets of Features
Random Subspaces are susceptible to hyper-parameters m
(size of subspace) and n(number of learners). For a feature
space of Mfeatures, there are 2M1different non-empty
subsets of features. Thus, it is unfeasible to train one learner
for even moderate values of M, especially for streaming data
where processing time and memory are restricted [22]. Ho
noted in [7] that highly accurate ensembles could be obtained
far before all possible combinations of subspaces are explored.
Later, Kuncheva et al. [16] provided a thorough analysis
of the random subspace method for the functional magnetic
resonance imaging (fRMI) data problem, which resulted in
insights for selecting values of mand nthat generated usable
learners, i.e., contains at least one ‘relevant’ feature in its
subset of features.
In our problem setting, one reason to train base models
on random subspaces of features on top of training them on
different subsets of instances is to add even further diversity to
the models. Even if they converge because of iid data (Section
III-A) by training them on separate subspaces of features we
have higher chances of producing models that maintain some
level of diversity.
There is a risk of subspaces including only irrelevant
features. There are two mechanisms that help aid this situation:
(i) resetting subspaces once a model is reset in response to a
concept drift; (ii) assigning weights to the votes of base models
based on their predictive performance, then it is expected
that base models with only irrelevant features produce a poor
predictive performance and other base models dominate their
C. Drift Detection and Recovery
The ultimate goal of drift detection in our context is to allow
automatic recovery from a state where the model performance
is degrading. To achieve this goal we need an accurate drift
detector and a proper action that will be triggered as a response
to the drift signal. Currently, the most successful supervised
learning methods follow a simple, yet effective, approach:
when a concept drift is detected the underlying model is
reset [5], [6]. If the detection algorithm miss or take too long
to detect a change, then it will let the model degrade. On
the other hand, if it yields too many false positives, it will
continuously trigger model resets and consequently prevent
the algorithm from building an accurate model.
We use the same strategy to detect and recover from concept
drifts as introduced in the Adaptive Random Forest (ARF) [6]
algorithm. In this strategy, the correct/incorrect predictions
of each base model are monitored by a detection algorithm.
When the drift detection algorithm flags a warning a new base
model start training in the ‘background’, where ‘background’
means that it does not influence the ensemble decision with
its predictions. If the warning escalates to a concept drift, then
the background model replaces the associated base model.
The strategy accommodates for different drift detection
algorithms to be used, however, to facilitate discussion we
focus the experiments and analysis using SRP with the ADap-
tive WINdow (ADWIN) algorithm [23]. ADWIN is a change
detector and estimator that solves in a well-specified way the
problem of tracking the average of a stream of bits or real-
valued numbers. ADWIN keeps a variable-length window of
recently seen items, with the property that the window has
the maximal length statistically consistent with the hypothesis
“there has been no change in the average value inside the
window”. More precisely, an older fragment of the window
is dropped if and only if there is enough evidence that its
average value differs from that of the rest of the window.
This has two consequences: one, that change reliably declared
whenever the window shrinks; and two, that at any time the
average over the existing window can be reliably taken as
an estimation of the current average in the stream (barring a
very small or very recent change that is still not statistically
visible). A formal and quantitative statement of these two
points (a theorem) appears in [23]. ADWIN is a parameter-
and assumption-free in the sense that it automatically detects
and adapts to the current rate of change. Its only parameter is
the confidence bound δ, indicating how confident we want to
be in the algorithm’s output, inherent to all algorithms dealing
with random processes.
There are no guarantees that a detection algorithm based
on the correct/incorrect predictions will be accurate, but it
will at least be able to detect changes in the underlying data
that genuinely affected the decision boundary (real drifts),
while neglecting those that did not (virtual drifts) [24]. One
disadvantage of this strategy is that it requires access to
labelled data, which is not an issue given our problem setting
(Section II), but for problems that include verification latency
or weakly-labeled streams, then other drift detection strategies
must be explored [25].
The pseudocode for SRP is depicted in Alg. 1. The training
instances are used to evaluate the classification performance
of each base model, before being used for training, and this
estimation is used as the learner weight during voting (line 9,
Alg. 1). For non-stationary data streams, we should consider
that the relevant features, i.e., those that can effectively be used
to predict the class label, may change over time. Therefore,
when a background learner is created, a new random subspace
is generated for it (line 12, Alg. 1). Background models are
trained during the period between the warning that triggered
their creation and the concept drift signal that causes them
to replace the previous base model, and thus, models to be
added to the ensemble always start with a model that is not
an entirely new base model (line 15, Alg. 1).
Algorithm 1 Streaming Random Patches.
Symbols: m: maximum features per subset; λ: Poisson dis-
tribution parameter; n: total number of models (n=|L|);
δw: warning threshold; δd: drift threshold; S: Data stream;
B: Set of background models; W(l): model lweight; P(·):
Model predictive performance estimation function; d(·): drift
detection method.
1: function TRA IN SRP(m, n, δw, δd)
2: LCreateBaseM odels(n, m)
3: WI nitW eights(n)
4: B← ∅
5: while HasNext(S)do
6: (x, y)next(S)
7: for all lLdo
8: ˆypredict(l, x)
9: W(l)P(W(l),ˆy, y)
10: T rain(m, l, x, y)
11: if d(δw, l, x, y)then Warning detected?
12: B(l)CreateBkgM odel(m)
13: end if
14: if d(δd, l, x, y)then Drift detected?
15: lB(l)Replace lby bkg learner
16: end if
17: end for
18: for all bBdo
19: T rain(m, b, x, y)
20: end for
21: end while
22: end function
D. SRP Theoretical Insights
Bagging is well-known in the machine learning literature for
its effect on reducing variance, both in regression and classi-
fication [9], [26], which allows it to perform competitively in
a wide range of scenarios, including data streams [5], [15].
In theory, the reduction of the error is strictly related to how
uncorrelated prediction errors are [9]. Entirely uncorrelated
predictions are rarely achievable in practice, yet it is achieved
to some extent by encouraging diversity among the learning
models [27]. This itself implies a need to use unstable learners.
The standard (batch, unpruned) decision tree is a prime
example of an unstable learner: small changes to a training
sample can result in remarkably different models, and thus
diversity among predictions. Indeed, one readily observes that
decision trees are used throughout the literature.
In the context of data streams, Hoeffding trees [28] are the
popular choice of decision tree, since they are incremental.
However, crucially, Hoeffding trees – unlike their batch coun-
terparts – are in fact stable learners. As far as we are aware
we are among the first to focus on this fact in the context of
Splitting is supported statistically under the Hoeffding
bound. This guarantees to a certain (user-specified) confidence
level that under a sufficiently large number of examples a
Hoeffding tree built incrementally will be equivalent to a
batch-built tree. Until such a number of examples is seen,
however, Hoeffding trees will not grow and this implies
Formally, we may measure the stability of an algorithm as,
for example, hypothesis stability. In the following we adapt
the discussion of [29] to the streaming setting.
Suppose that ASdenotes that an algorithm A(e.g., C4.5,
or Hoeffding tree inducer) induces decision function f(e.g.,
a decision tree) over data stream segment Sof pairs (xk, yk)
(the segment is of length |S|=n). Let also S\irepresent
Swithout the i-th sample. Then hypothesis stability can be
expressed as
E(x,y)[|(AS,(x, y)) (AS\i,(x, y ))|]< β
under evaluation function/metric .
This captures the intuition that if we remove a sample from
the stream, the absolute difference in error of another model
trained on this new segment should be less than βwhen
compared to the error of the same model built on the original
(thus indicating its stability in terms of β).
We cannot compute this exactly unless we know the true
generating distribution (‘concept’ in stream terminology) from
which (xk, yk)pairs are drawn. However, by replacing the
expectation with a sum over leave-one-out samples from a
real stream we can empirically investigate and compare the
β-stability’ among learning algorithms with regard to such a
Repeatedly rebuilding models on relatively small samples
of instances is unavoidable in a stream which may experience
drift, implying that trees must be fully or partially regrown. By
small we mean “insufficiently large wrt the Hoeffding bound”.
These episodes add up over the life of a stream to a non-
negligible loss of accuracy.
Suppose that this number is n. As any well-regularized
algorithm, a Hoeffding tree does not adhere strongly to the
principal of empirical risk minimization, but rather it is forced
to accept many errors as a trade-off for long-term similarity
to a batch-built tree. This is a problem terms of Hoeffding
tree ensembles, since these errors are likely to be the same,
rendering the ensemble decisions are likely to be useless (no
advantage compared to a single model). In terms of bias-
variance trade-off, variance goes down at the cost of bias due
to Hoeffding stability [30]. However ensemble bagging-based
schemes are primarily for reducing variance and may even
increase bias, but since variance has already been reduced by
stability, it is not likely to have a positive effect.
This provides a suitable explanation as to why our proposed
SRP method performs well: by effectively reducing the feature
space of individual trees, Hoeffding trees are operating on a
‘sub-concept’, and are stable wrt that concept but unstable wrt
the complete concept, meaning that the variance reduction of
an ensemble still has a beneficial effect.
Furthermore, Random Subspaces are so beneficial in the
data stream setting is because we can look at decision trees
as adaptive nearest neighbours [31], and Random Subspaces
as transformations that preserve the Euclidean geometry [32].
Decision trees splits the overall space into several regions, one
for each one of their leaves. The prediction of the instances
in each one of the leaves is based on the majority vote of the
instances in that leaf. We can consider the instances in that
leaf as the neighbours of the instances to predict. Random
Subspaces are linear transformations that transforms instances
to another space, preserving their Euclidean geometry, a very
useful property when applied to nearest neighbours. This
is due to the fact that there exists Johnson-Lindenstrauss
guarantees that Random Subspaces approximately preserves
the Euclidean geometry of the data with high probability, as
shown in Lemma 1 [32].
Lemma 1. Let X={x(N1), . . . , x1, x0}be a sequence
of observations collected over time, containing input examples
in which xiRnfor every i∈ {−(N1),...,1,0},n1
and satisfying ||x2
2where cR+is a constant
1cn. Let , δ (0,1], and let kc2
22ln N2
δbe an
integer. Let RS be a random subspace projection from Rn7→
Rk. Then with probability at least 1δover the random draws
of RS we have, for every i, j ∈ {−(N1),...,1,0}:
(1 )||xixj||2
2(1 + )||xixj||2
The required number of spaces kis logarithmic in the
number of examples, but with a larger constant term.
Finally, another explanation of the success of Random
Patches is dropout [33]. Dropout is a technique used in
Deep Learning to improve the accuracy of Neural Networks,
randomly removing neurons. Random Patches uses a similar
technique to sample instances and attributes, removing many
of them, in an efficient random way.
Thus, overall, our proposal creates an artificially smaller
feature space, thus encouraging faster growth, and further-
more, even when tree growth is conservative, can encourage
disagreement (avoid correlation) among the leaf classifiers
even if they would be stable models if run outside the context
of such an ensemble. Empirical results are given in Section
V, which offer further support to these arguments.
There is an extensive literature on ensemble methods for
data stream classification. This preference is counterintuitive
given the need for algorithms that use computational re-
sources judiciously. The justification for this preference is
attributable to the flexibility and high predictive performance
that ensemble models provide [14]. The seminal work of
Kolter and Maloof [3] introduced the Dynamic Weighted
Majority (DWM) ensemble method which featured heuristics
to cope with evolving data streams, such as removing base
models if their weight dropped below a given threshold, and
adding new ones according to the global performance of the
ensemble. DWM introduces a hyperparameter to control the
period (window) between base models addition, removal and
weight updates. Similarly to DWM, the Online Accuracy
Updated Ensemble (OAUE) [4] algorithm relies on a window
hyperparameter to determine which instances will be used to
train a new base model (candidate) and if it should replace the
base model that achieved the least classification performance
in the latest window of instances. OAUE does not use an active
drift detection approach; thus it relies on gradual resets of
the ensemble through candidates to adapt to concept drifts.
Also, it introduces a weighting mechanism that contributes to
the ensemble adaptation to concept drifts, since the weighting
function is designed to assign higher impact to predictions
on recently presented instances. Note that DWM and OAUE
use incremental base learners; however, they still require
the definition of a window to orchestrate their adaptation
techniques to evolving data.
Many ensemble methods for data stream learning exploit
strategies developed initially for batch learning. Online bag-
ging [15] trains base models on samples drawn from the
data stream simulating sampling with reposition as in the
classical Bagging algorithm [9]. Chen et al. introduce a
generalization of SmoothBoost [34], namely Online Smooth-
Boost (OB) [35], an algorithm that generates only smooth
distributions that, and do not assign too much weight to single
examples. OB is guaranteed to achieve an arbitrarily small
error rate given that the number of weak learners and examples
are sufficiently large.
Ensembles designed to cope with evolving data streams
combine decorrelating base models (e.g., bagging) and voting
(e.g., weighted majority vote [36]) with active drift recovery
strategies based on change detection algorithms. The Leverag-
ing Bagging (LB) [5] algorithm combines an adapted version
of Online Bagging [15] with the ADaptive WINdow (ADWIN)
drift detection algorithm, such that base models are selectively
reset whenever their corresponding ADWIN instance flags a
drift. Heuristic Updatable Weighted Random Subspaces
(HUWRS) [17] trains batch learners (C4.5 decision trees) on
random subspaces of features, following the Random Sub-
space Method (RSM) introduced by Ho [7]. HUWRS detects
virtual and real concept drift by computing the Hellinger
distance between the binned feature values of every base
model and the latest window of instances feature distribution
when labels are not available, and by computing Hellinger
distances between the feature distribution per class over the
latest window of instances, otherwise. The weighting of the
base models in HUWRS relies on the severity of the change
in the distribution of the features associated with its random
subspace. The Adaptive Random Forest (ARF) [6] and the
Dynamic Streaming Random Forest (DSRF) [37] both aim
to adapt the classic Random Forest [10] algorithm to streaming
data. Both ARF and DSRF uses the incremental decision tree
algorithm Hoeffding tree [28], however, they differ on how
the base trees are trained. ARF simulates resampling as in
Leveraging Bagging, while DSRF train trees sequentially on
different subsets of data. Moreover, ARF uses a drift detection
and recovery strategy based on detecting warnings and drifts
per base tree, such that after a warning is triggered another
tree is created and trained without affecting the ensemble
predictions (background tree). If the warning escalates to a
drift detection, then the base tree is replaced by the background
We briefly introduced the concepts of active and reactive
strategies for concept drift recovery and the vast literature in
ensemble learning for evolving data stream classification. We
refer the reader to [24] and [19] for further information on
concept drift, and to [14] for a detailed overview and taxonomy
of existing ensemble methods for data stream classification.
We evaluate the SRP implementation against state-of-the-
art classification algorithms, both concerning predictive per-
formance and computational resources usage. To analyze the
diversity among base models in our new proposed methods, we
present plots depicting the average pairwise kappa over time.
Also, to analyze how fast (and deep) the base trees are grown
by each ensemble strategy we include plots of the average tree
depth over time. We assess predictive performance through ac-
curacy results using a test-then-train evaluation strategy, where
every instance is used first for testing and then for training.
The algorithms used in the comparisons are Hoeffding Trees
(HT), Naive Bayes (NB), Leveraging Bagging (LB), Adaptive
Random Forest (ARF), Online Accuracy Updated Ensemble
(OAUE), Dynamic Weighted Majority (DWM), and Online
Smooth Boosting (OB). HT and NB serve the purpose of
baselines since they are single classifiers often used in data
stream classification. LB and ARF are ensemble methods that
consistently outperform other ensemble classifiers as shown
in [6] in a similar benchmark than the one used in this work.
OB represents a boosting adaptation to online learning, while
DWM and OAUE are ensemble methods explicitly developed
for data stream classification that rely on different heuristics
to address concept drift.
To analyze how SRP compares to “simple” variants of
itself we present two variations in the experiments, namely
the Streaming Random Subspaces (SRS) and a Bagging-like
strategy (BAG). SRS trains on random subspaces of features as
in SRP and all instances without simulating bootstraps, while
BAG only simmulates bagging using all features. In the online
resources 3we provide two tables analyzing the impact of m
in SRP (ranging from 10% up to 100% (same as the variant
BAG)) and λ= 1, which impacts the bagging simulation. The
experiments in the paper summarizes the results in the online
resources; still, they are available to the interested reader.
Regarding hyperparameters, we use HT as the base learner
for all the ensemble methods. The default subspace size is
m= 60% for SRS, SRP, and ARF, except for experiments
with the high dimensionality dataset SPAM and n= 100
where m= 10% (Table II). In the online resources (Section
A) we present complementary experiments varying mfrom
10% up to 100% (equivalent to BAG) in all datasets. The
HT grace period was set to GP = 50, the split confidence
c= 0.014, and the decision strategy used at leaves was Naive
4GP and cwere originally identified as nmin and δby Domingos and
Hulten [28], however we choose to keep their acronyms as in the Massive
Online Analysis (MOA) framework to facilitate reproducibility.
CD = 3.757
CD = 3.031
Fig. 2: Nemenyi test (95% confidence level) - n= 10 base models on the left; and Nemenyi test (95% confidence level) -
n= 100 base models on the right. The avg rank obtained in the SPAM dataset for n= 100 was not considered for any learner
since there are no results for LB and BAG.
Bayes Adaptive, i.e., either Naive Bayes or Majority vote are
used at a leaf depending on which one is more accurate [38].
This HT configuration tends to generate splits earlier at the
expense of processing time [6]. ADWIN is used as a drift
detector for all ensembles that rely on active drift detection
(i.e., ARF, LB, SRP, SRS, and BAG). The δparameter, which
controls the confidence in the change detected, was defined
as δ= 0.0001 for warning detection and δ= 0.00001 for
drift detection in ARF, SRP, SRS and BAG. In LB δwas set
according to its default value [5], i.e. δ= 0.002.
The datasets used in the experiments include 6 synthetic
data streams and 7 real datasets. The synthetic datasets sim-
ulate abrupt, gradual, and incremental drifts, while the real
datasets have been thoroughly used in the literature to assess
data stream classifiers. Further information concerning the
datasets, instructions on how to execute the experiments and
other details for reproducibility are available in Appendix A.
A. Streaming Random Patches vs. Others
The results presented in Table I show how SRP compares
against other algorithms. Similarly, II presents how SRP and
other ensembles perform when configured to use n= 100
learners. Besides presenting the average ranking (Avg Rank)
for each algorithm, we also highlight the average ranking
for the synthetic datasets (Avg Rank Synt.) and the average
ranking for real-world datasets (Avg Rank Real). The reason
to report these rankings separately is that some techniques
may perform better on synthetic data, while not so well
in overall and it is important to highlight and discuss that.
Good performance on the synthetic datasets may indicate
an effective drift recovery strategy, however synthetic data
stream concepts tend to be simple or biased towards a specific
learning algorithm, therefore an algorithm that produces good
results only on synthetic data may offer less credibility. We
apply the methodology presented in [39] to compare results
among several datasets and algorithms for the experiments
presented on Tables I and II. We first attempt to reject the
hypothesis that all learners produce equivalent results using a
Friedman test at a significance level α= 0.05. The Friedman
test indicated significant differences on both results and it
was followed by a post-hoc Nemenyi test. Figure 2 presents
the results for the post-hoc Nemenyi test. We note that no
significant difference has been found among SRP, BAG, ARF,
LB, SRS and OAUE, using n= 10, while using n= 100
there was no significant difference among SRP, SRS, BAG,
ARF and LB.
We can observe the influence of the mhyperparameter when
we compare SRP and BAG results, for example, in AIRLINES
even though the number of features is only 7, using m= 60%
produced better results than BAG as shown in Tables I and
II, while intuitively it seems that using all features for low
dimensionality datasets is better. For the SPAM dataset, SRP,
ARF and SRS were configured with m= 10% for the n= 100
experiments as m= 60% failed to finished. LB and BAG
could not finish, both failed after around 60% execution as
100GB of maximum memory allocation pool was insufficient.
SRP with n= 10 performs well in the real datasets, but
not as well in the synthetic datasets as BAG and LB, which
are very similar models (i.e., use all features and simulate
resampling). However, in the experiments using n= 100
the algorithms that exploit random subspaces (ARF, SRP,
SRS) benefited the most from the addition of more learners,
followed by BAG and LB. This characteristic of ARF, SRP,
and SRS, can be attributed to them being able to cover a more
significant number of subspaces of features. OB and DWM
improved in comparison to their results using n= 10, while
OAUE decreased its performance. OAUE obtain results far
below NB and HT for KDD99, ADS and NOMAO datasets
while performing well in the synthetic datasets with simulated
concept drifts.
B. Average Tree Depth and Diversity
To investigate how efficient SRP is in terms of inducing
diversity into the ensemble, we plot the average kappa over
time for AGRgand LEDain Figures 3 and 4. In Figure 6, we
can observe how average kappa for BAG and ARF converge
after the same concept has been in place. We notice that SRS
and SRP obtains low values of average kappa in comparison
to ARF and BAG. However, when we take into account the
accuracy results in Table I we can see that not necessarily SRP
or SRS outperform BAG in these datasets, i.e., even if the
difference is small, ARF and BAG outperform SRS and SRP
in LED(A). These results corroborate with the conclusions
by Stapenhurst [21] that the ensemble diversity influence in
the recovery from a concept drift, still, it is not as crucial as
the actual drift detection and recovery strategy. In AGR(G),
SRS and SRP outperform ARF and BAG, still, the average
kappa diversity in this experiment are quite similar, thus no
clear conclusions can be made about why SRS and SRP
perform better based solely on the average pairwise kappa. The
overall conclusion is that increasing diversity is not enough to
improve accuracy. Therefore, we complement our analysis of
TABLE I: Test-then-train accuracy (%) using n= 10 base models.
LED(A) 53.964 69.032 73.918 74.007 73.742 69.898 73.945 73.588 73.533 73.944
LED(G) 54.02 68.649 73.076 73.167 72.723 69.562 73.01 72.416 72.296 73.151
AGR(A) 65.739 81.045 86.954 90.932 82.97 84.91 85.646 91.788 91.558 85.733
AGR(G) 65.759 77.374 80.709 86.339 79.418 79.73 79.885 87.762 88.538 81.347
RBF(M) 30.994 45.491 84.714 78.581 57.81 69.894 84.49 83.28 81.685 85.431
RBF(F) 29.136 32.292 74.102 50.021 54.861 42.915 70.715 70.825 59.061 74.891
AIRLINES 64.55 65.078 62.319 66.637 63.88 65.184 65.786 66.776 67.085 61.296
ELEC 73.362 79.195 90.157 88.275 87.756 85.253 88.718 88.82 89.4 89.502
COVTYPE 60.521 80.312 94.861 90.17 88.286 90.327 94.691 95.254 92.764 95.467
KDD99 95.603 99.903 99.974 2.473 99.951 99.944 99.975 99.984 99.979 99.972
ADS 68.161 85.91 99.665 15.401 97.499 86.917 99.726 99.756 98.445 99.665
NOMAO 86.865 92.128 97.035 58.407 95.462 94.252 97.055 97.232 96.451 97.064
SPAM 74.571 79.043 94.745 80.899 89.286 89.489 96.214 95.967 92.954 94.745
Avg Rank 9.54 8.29 3.5 7 7 6.86 3.57 2.43 3.71 3.36
Avg Rank Synt. 10 9 3.33 3.5 6.67 7.5 4.17 3.67 4.5 2.67
Avg Rank Real 9.14 8.14 4 7.71 6.86 6.57 3.14 1.86 3.71 3.86
TABLE II: Test-then-train accuracy (%) using n= 100 base models. Underlined results means the performance increased in
comparison to n= 10 version. BAG and LB did not finish execution for SPAM dataset.
LED(A) 73.953 73.393 73.958 72.475 73.96 74.027 74.04 73.975
LED(G) 73.225 72.582 73.031 72.117 73.094 73.233 73.179 73.215
AGR(A) 88.717 90.164 88.299 90.374 87.929 92.869 92.807 86.663
AGR(G) 83.713 85.244 79.437 87.834 82.288 89.651 90.259 82.52
RBF(M) 84.338 84.262 60.977 74.514 86.958 86.039 84.821 86.671
RBF(F) 76.771 57.147 54.531 48.698 76.291 76.375 61.622 77.686
AIRLINES 62.82 65.229 64.025 64.556 66.417 68.564 68.303 62.093
ELEC 89.508 87.407 87.754 89.515 89.672 89.859 90.267 89.822
COVTYPE 95.104 92.857 88.519 92.695 94.967 95.348 93.461 95.288
KDD99 99.965 2.445 99.951 99.936 99.972 99.981 99.973 99.974
ADS 99.634 15.401 97.499 90.393 99.695 99.726 98.353 99.634
NOMAO 97.072 58.233 95.462 96.393 97.197 97.383 96.57 97.226
SPAM NA 80.781 89.361 86.519 97.319 97.437 95.924 NA
Avg Rank 4.46 6.33 6.67 6.17 4 1.58 3.17 3.63
Avg Rank Synt. 4.17 5.67 6.67 6.17 4.67 22.83 3.83
Avg Rank Real 4.75 7 6.67 6.17 3.33 1.17 3.5 3.42
the ensembles diversity by presenting the average tree depth
in Figures 5 and 6. SRP consistently grows trees faster and
deeper than ARF, SRS and BAG. Splitting sooner can lead to
overfitting the models or splits that could use a better feature
(or split point) if more instances were observed. However, as
shown by the SRP predictive performance in the empirical
experiments, it can be beneficial to an ensemble strategy.
# Instances
0 250000 500000 750000
Fig. 3: AGR(G) - Avg kappa over time (n= 10).
C. Time and Memory Usage Analysis
The computational resources are estimated based on the
CPU time and RAM Hours (across all the experiments). The
# Instances
0 250000 500000 750000 1000000
Fig. 4: LED(A) - Avg kappa over time (n= 10).
results for n= 100 are presented in Figures 7 and 85. We
note that SRP performs similar to LB, requires less resources
than BAG, but demands more resources than SRS and ARF.
The SRS efficiency is attributable to the fact that it does not
simulate resampling. In SRP, BAG, ARF and LB each learner
is trained on each instance, on average, lambda times, where
lambda = 6 in our experiments. If we use Poisson(λ= 1) we
5The results in Figures 7 and 8 excludes SPAM CPU Time and RAM hours
for all algorithms, since BAG and LB did not finish executing
# Instances
0 250000 500000 750000
Fig. 5: AGR(G) - Avg tree depth over time (n= 10).
Fig. 6: LED(A) - Avg tree depth over time (n= 10).
also increase the chances of obtaining zeros (i.e., not using the
instance for training), which positively affects the memory and
processing time (train on less instances), but negatively impact
the classification performance as the base models are trained
on less instances.
Fig. 7: CPU time (n= 100).
In this work, we have taken an in-depth look at the
performance of Random Subspaces and Bagging ensemble
methods and their application to streams. In particular, fol-
lowing theoretical considerations and empirical investigations,
we developed and presented the Streaming Random Patches
(SRP) method. SRP is a combination of Random Subspaces
and Online Bagging as each base model is trained on a
Fig. 8: RAM Hours usages (n= 100).
random patch of data (i.e., a random subset of features and
instances). We show how SRP can be highly accurate on
many benchmark streaming scenarios, and compare it against
several ensemble methods for data stream classification, in-
cluding bagging, boosting and random forest variations. We
discussed the differences, and similarities, between SRP and
the Adaptive Random Forest (ARF) algorithm. We showed
how SRP compared against a Streaming Random Subspaces
(SRS) method and a Bagging method using the same drift
detection and recovery strategies. We highlight that SRP has
the same amount of hyperparameters as ARF; still, it can also
be used to train base models that are not decision trees.
We discussed and demonstrated how methods using ran-
dom subspaces yield several significant advantages, such as
diversity enhancement (even for stable methods), which is
particularly suited to Hoeffding-tree based methods which can
be seen as stable methods. On top of that, these methods tend
to improve accuracy from the addition of more base models.
As only a subset of features is considered, training is effi-
cient as compared to popular existing methods for data stream
classification, such as Leveraging Bagging. Furthermore, even
though beyond the scope of this paper, a consideration of
distributed computation on our method is particularly favored,
as the base models are independent. These characteristics set
out an exciting path for future investigation.
[1] L. Da Xu, W. He, and S. Li, “Internet of things in industries: A survey,
IEEE Transactions on industrial informatics, vol. 10, no. 4, pp. 2233–
2243, 2014.
[2] G. Widmer and M. Kubat, “Learning in the presence of concept drift
and hidden contexts,” Mach. Learn., vol. 23, no. 1, pp. 69–101, Apr.
[3] J. Z. Kolter, M. Maloof et al., “Dynamic weighted majority: A new
ensemble method for tracking concept drift,” in Data Mining, 2003.
ICDM 2003. Third IEEE International Conference on. IEEE, 2003,
pp. 123–130.
[4] D. Brzezinski and J. Stefanowski, “Combining block-based and online
methods in learning ensembles from concept drifting data streams,”
Information Sciences, vol. 265, pp. 50–67, 2014.
[5] A. Bifet, G. Holmes, and B. Pfahringer, “Leveraging bagging for
evolving data streams,” in PKDD, 2010, pp. 135–150.
[6] H. M. Gomes, A. Bifet, J. Read, J. P. Barddal, F. Enembreck,
B. Pfharinger, G. Holmes, and T. Abdessalem, “Adaptive random forests
for evolving data stream classification,Machine Learning, pp. 1–27,
2017. [Online]. Available:
[7] T. K. Ho, “The random subspace method for constructing decision
forests,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. 20, no. 8, pp. 832–844, 1998.
[8] L. Breiman, “Pasting small votes for classification in large databases
and on-line,” Machine learning, vol. 36, no. 1-2, pp. 85–103, 1999.
[9] ——, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–
140, 1996.
[10] ——, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32,
[11] P. Panov and S. D ˇ
zeroski, “Combining bagging and random subspaces
to create better ensembles,” in International Symposium on Intelligent
Data Analysis. Springer, 2007, pp. 118–129.
[12] G. Louppe and P. Geurts, “Ensembles on random patches,” in Joint
European Conference on Machine Learning and Knowledge Discovery
in Databases. Springer, 2012, pp. 346–361.
[13] Y. Freund, R. E. Schapire et al., “Experiments with a new boosting
algorithm,” in ICML, vol. 96, 1996, pp. 148–156.
[14] H. M. Gomes, J. P. Barddal, F. Enembreck, and A. Bifet, “A survey
on ensemble learning for data stream classification,” ACM Comput.
Surv., vol. 50, no. 2, pp. 23:1–23:36, 2017. [Online]. Available:
[15] N. Oza and S. Russell, “Online bagging and boosting,” in Artificial
Intelligence and Statistics 2001. Morgan Kaufmann, 2001, pp. 105–
[16] L. I. Kuncheva, J. J. Rodr´
ıguez, C. O. Plumpton, D. E. Linden, and S. J.
Johnston, “Random subspace ensembles for fmri classification,” IEEE
transactions on medical imaging, vol. 29, no. 2, pp. 531–542, 2010.
[17] T. R. Hoens, N. V. Chawla, and R. Polikar, “Heuristic updatable
weighted random subspaces for non-stationary environments,” in Data
Mining (ICDM), 2011 IEEE 11th International Conference on. IEEE,
2011, pp. 241–250.
[18] C. O. Plumpton, L. I. Kuncheva, N. N. Oosterhof, and S. J. Johnston,
“Naive random subspace ensemble with linear classifiers for real-time
classification of fmri data,” Pattern Recognition, vol. 45, no. 6, pp. 2101–
2108, 2012.
[19] G. I. Webb, R. Hyde, H. Cao, H. L. Nguyen, and F. Petitjean, “Charac-
terizing concept drift,” Data Mining and Knowledge Discovery, vol. 30,
no. 4, pp. 964–994, 2016.
[20] L. L. Minku, A. P. White, and X. Yao, “The impact of diversity on online
ensemble learning in the presence of concept drift,” IEEE Transactions
on Knowledge and Data Engineering, vol. 22, no. 5, pp. 730–742, 2010.
[21] R. J. Stapenhurst, “Diversity, margins and non-stationary learning.
Ph.D. dissertation, University of Manchester, UK, 2012.
[22] A. Bifet, E. Frank, G. Holmes, and B. Pfahringer, “Ensembles of
restricted hoeffding trees,” ACM TIST, vol. 3, no. 2, pp. 30:1–30:20,
2012. [Online]. Available:
[23] A. Bifet and R. Gavald`
a, “Learning from time-changing data with
adaptive windowing,” in SIAM, 2007.
[24] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A
survey on concept drift adaptation,ACM Computing Surveys, vol. 46,
no. 4, pp. 44:1–44:37, Mar. 2014.
[25] I. ˇ
Zliobaite, “Change with delayed labeling: When is it detectable?” in
Data Mining Workshops (ICDMW), 2010 IEEE International Conference
on. IEEE, 2010, pp. 843–850.
[26] P. M. Domingos, “A unified bias-variance decomposition for zero-one
and squared loss,” in AAAI 2000, 2000, pp. 564–569.
[27] L. I. Kuncheva, “That elusive diversity in classifier ensembles,” in
Iberian conference on pattern recognition and image analysis. Springer,
2003, pp. 1126–1138.
[28] P. Domingos and G. Hulten, “Mining high-speed data streams,” in
Proceedings of the sixth ACM SIGKDD international conference on
Knowledge discovery and data mining. ACM SIGKDD, Sep. 2000,
pp. 71–80.
[29] O. Bousquet and A. Elisseeff, “Stability and generalization,Journal of
Machine Learning Research, vol. 2, pp. 499–526, 2002.
[30] E. Ikonomovska, J. Gama, and S. Dˇ
zeroski, “Learning model trees from
evolving data streams,Data mining and knowledge discovery, vol. 23,
no. 1, pp. 128–168, 2011.
[31] Y. Lin and Y. Jeon, “Random forests and adaptive nearest neighbors,”
Journal of the American Statistical Association, vol. 101, no. 474, pp.
578–590, 2006.
[32] N. Lim and R. J. Durrant, “Linear dimensionality reduction
in linear time: Johnson-lindenstrauss-type guarantees for random
subspace,” arXiv, vol. 1705.06408, 2017. [Online]. Available: https:
[33] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov, “Dropout: a simple way to prevent neural networks
from overfitting,Journal of Machine Learning Research, vol. 15, no. 1,
pp. 1929–1958, 2014.
[34] R. A. Servedio, “Smooth boosting and learning with malicious noise,”
Journal of Machine Learning Research, vol. 4, no. Sep, pp. 633–648,
[35] S.-T. Chen, H.-T. Lin, and C.-J. Lu, “An online boosting algorithm with
theoretical justifications,” in Proceedings of the International Conference
on Machine Learning (ICML), June 2012.
[36] N. Littlestone and M. K. Warmuth, “The weighted majority algorithm,
Information and computation, vol. 108, no. 2, pp. 212–261, 1994.
[37] H. Abdulsalam, D. B. Skillicorn, and P. Martin, “Classifying evolving
data streams using dynamic streaming random forests,” in International
Conference on Database and Expert Systems Applications. Springer,
2008, pp. 643–651.
[38] G. Holmes, R. Kirkby, and B. Pfahringer, “Stress-testing hoeffding
trees,” in Knowledge Discovery in Databases: PKDD 2005, 2005, pp.
495–502. [Online]. Available:\50
[39] J. Demˇ
sar, “Statistical comparisons of classifiers over multiple data sets,
Journal of Machine Learning Research, vol. 7, pp. 1–30, Dec. 2006.
[Online]. Available:
[40] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “Moa: Massive online
analysis,” The Journal of Machine Learning Research, vol. 11, pp. 1601–
1604, 2010.
A. Software and hardware
All the experiments were executed in the Massive Online
Analysis (MOA) framework [40] version 2019.04 build. De-
tails about the hardware configuration are shown below:
CPU: 40 cores, Intel(R) Xeon(R) CPU E5-2660 v3 @
Operational System: Ubuntu 16.04.5 LTS
Java Virtual Machine (JVM) version: JDK 1.8.0
JVM: Xmx = 100GB, Xms = 50MB
To reproduce the experiments the reader should access the
GitHub repository
which contains the code, the datasets used, and instructions
on how to execute the algorithm.
B. Datasets
Table III presents an overview of the datasets.
TABLE III: Datasets (Drifts: (A) Abrupt, (G) Gradual, (M)
Incremental (moderate) and (F) Incremental (fast)). AGR and
LED concept drifts are introduced every 250k instances.
Dataset #Instances #Features Type Drifts #Classes
LED(A) 1,000,000 24 Synthetic A 10
LED(G) 1,000,000 24 Synthetic G 10
AGR(A) 1,000,000 9 Synthetic A 2
AGR(G) 1,000,000 9 Synthetic G 2
RBF(M) 1,000,000 10 Synthetic M 5
RBF(F) 1,000,000 10 Synthetic F 5
AIRL 539,383 7 Real - 2
ELEC 45,312 8 Real - 2
COVT 581,012 54 Real - 7
KDD99 4,898,431 41 Real - 23
ADS 3,279 1,559 Real - 2
NOMAO 34,465 119 Real - 2
SPAM 9,324 39,917 Real - 2
... The Streaming Random Patches (SRP) algorithm [91] is another HT-based ensemble method for evolving data streams. It extends the idea of random patches and random subspace methods to the streaming setting. ...
... On the other hand, the 'drift detector' hyperparameter is selected from two methods, namely ADWIN and EDDM, to effectively handle both gradual and sudden drifts in the data. Additionally, the framework's performance in handling PLA problems is evaluated by comparing it with state-of-the-art online ML models, including HT [89], Leveraging Bagging (LB) [96], Aggregated Mondrian Forest (AMF) [97], Extremely Fast Decision Tree (EFDT) [98], Hoeffding Adaptive Tree (HAT) [99], ARF [90], and SRP [91]. ...
... The same five metrics -accuracy, precision, recall, F1-score, and average learning time per sample -are used to assess the effectiveness and efficiency of the proposed framework in intrusion detection tasks. Furthermore, the performance of the proposed framework is compared with that of the same state-of-the-art online learning methods used in the PLA experiments, including HT [89], LB [96], AMF [97], EFDT [98], HAT [99], ARF [90], and SRP [91]. ...
Full-text available
The transition from 5G to 6G networks necessitates network automation to meet the escalating demands for high data rates, ultra-low latency, and integrated technology. Recently, Zero-Touch Networks (ZTNs), leveraging AI and ML, have emerged as a promising solution for enhancing automation in 5G/6G networks but face significant challenges. Specifically, they are vulnerable to cyber-attacks, and the development of AI/ML-based cybersecurity mechanisms requires substantial specialized expertise and encounters model drift issues. Therefore, this paper proposes an automated security framework targeting Physical Layer Authentication (PLA) and Cross-Layer Intrusion Detection Systems (CLIDS) to address security concerns at multiple Internet protocol layers. The proposed framework employs drift-adaptive online learning techniques and a novel enhanced Successive Halving (SH)-based Automated ML (AutoML) method to automatically generate optimized ML models for dynamic networking environments. Experimental results illustrate that the proposed framework achieves high performance on the public ORACLE RF fingerprinting and CICIDS2017 datasets, showcasing its effectiveness in addressing PLA and CLIDS tasks within dynamic and complex networking environments. Furthermore, the paper explores open challenges and research directions in the 5G/6G cybersecurity domain. This framework represents a significant advancement towards fully autonomous and secure 6G networks, paving the way for future innovations in network automation and cybersecurity.
... Then, the ensemble model reacts to this drift by adding new classifiers, updating them, or restarting their learning process. Adaptive Random Forest (ARF) [30], Leverage bagging [9], Adaptive Classifiers Ensemble (ACE) [54], Heterogeneous Dynamic Weighted Majority (HDWM) [39], comprehensive active learning method for multi-class imbalanced streaming data with concept drift (CALMID) [48], Streaming Random Patches (SRP) [31], and Robust online self-adjusting ensemble for continual learning on imbalanced drifting data streams (ROSE) [17] are among the well-known algorithms in this category. ...
... SRP [31] Streaming Random Patches is an ensemble method that utilizes both bagging and random subspaces. * Descriptions of each method are based on the information on scikit-mutliflow [53] web page or their original paper. ...
Full-text available
In a data stream environment, classification models must effectively and efficiently handle concept drift. Ensemble methods are widely used for this purpose; however, the ones available in the literature either use a large data chunk to update the model or learn the data one by one. In the former, the model may miss the changes in the data distribution, while in the latter, the model may suffer from inefficiency and instability. To address these issues, we introduce a novel ensemble approach based on the Broad Learning System (BLS), where mini chunks are used at each update. BLS is an effective lightweight neural architecture recently developed for incremental learning. Although it is fast, it requires huge data chunks for effective updates and is unable to handle dynamic changes observed in data streams. Our proposed approach, named Broad Ensemble Learning System (BELS), uses a novel updating method that significantly improves best-in-class model accuracy. It employs an ensemble of output layers to address the limitations of BLS and handle drifts. Our model tracks the changes in the accuracy of the ensemble components and reacts to these changes. We present the mathematical derivation of BELS, perform comprehensive experiments with 35 datasets that demonstrate the adaptability of our model to various drift types, and provide its hyperparameter, ablation, and imbalanced dataset performance analysis. The experimental results show that the proposed approach outperforms 10 state-of-the-art baselines, and supplies an overall improvement of 18.59% in terms of average prequential accuracy.
... Streaming Random Patches (SRP) (Gomes et al, 2019) was proposed in 2019, which is a combination of the Online Bagging (Oza and Russell, 2001) and Random Subspaces approach (Ho, 1998). The strategy used in SRP to enhance the ensemble performance is to construct random subspaces of instances and their attributes to feed the base classifiers with these random subspaces. ...
... PWPAE (Yang et al, 2021), as an ensemble-based classification, is designed for anomaly detection in IoT data streams . This Performance Weighted Probability Averaging Ensemble is designed based on the ARF (Gomes et al, 2017b) and SRP (Gomes et al, 2019) approaches. On the other hand, ADWIN and DDM drift detectors are used for tracking any concept changes in the stream. ...
Full-text available
Data streams are sequences of fast-growing and high-speed data points that typically suffer from the infinite length, large volume, and specifically unstable data distribution. These potential issues of data streams bold the necessity of data stream mining tasks. Ensemble learning as a prevalent classification approach is widely used in data stream mining studies. Besides the impressive performance of ensemble learning algorithms in providing a collection of diverse and accurate classifiers, they are specifically efficient in handling non-stationary data streams. Due to the component-based nature and the chance of dynamic updates for the components of the ensemble, this category is appropriate for dynamically learning the changing concepts of the data. This paper aims to provide a thorough review of the most significant ensemble-based data stream classification approaches, along with a discussion about the potential issues of the non-stationary data streams. Furthermore, comprehensive experimental analysis is performed to compare the classification performance of the well-known state-of-the-art ensemble-based data stream classification approaches on 24 synthetic non-stationary data streams. The superiority of the approaches is proved by conducting various statistical tests.
... Once drift is detected, adapting to the drift becomes crucial to update the ML models and maintain their performance in the context of data streams. There are three primary types of drift adaptation methods [ [73], and Performance Weighted Probability Averaging Ensemble (PWPAE) [65], integrate multiple incremental learning models to enhance continual learning performance. Ensemble learning models are generally efficient for handling gradual and recurring drifts but may struggle with abrupt drifts and increased computational complexity. ...
Full-text available
The sixth generation of wireless networks (6G) will require network automation to meet the rapidly increasing demands for high data rate services, ultra-low latency, massive connectivity, and seamless integration with emerging technologies, while effectively reducing operating costs. To address these demands, the concept of Zero-Touch Networks (ZTNs) has been proposed, where Artificial Intelligence (AI) and Machine Learning (ML) play crucial roles in optimizing network performance, enabling intelligent decision-making, and ensuring efficient resource allocation. However, the implementation of ZTNs is subject to security challenges that may hinder their development and deployment. In particular, two critical challenges arise: the need for human expertise in developing AI/ML-based security mechanisms, and the threat of specific attacks targeting AI/ML models. In this survey paper, a comprehensive review of the security vulnerabilities and issues with ZTNs is conducted. Additionally, potential automated solutions to ZTN security concerns, with a specific focus on leveraging Automated ML (AutoML) technologies, are investigated. Two case studies are conducted to address security issues in ZTNs and further corroborate our findings: the development of autonomous intrusion detection systems and the creation of defense mechanisms against Adversarial ML (AML) attacks. Finally, some of the challenges and future research directions for the development of ZTN security approaches are discussed.
... RP combines Bagging (Breiman, 1996) and Random Subspace (Ho, 1998), the two best known generation methods. Based on this combination, RP can create more diverse samples and does not require prior assumptions about the data, making it a generation strategy adaptable to all time series and less sensitive to outliers (Louppe and Geurts, 2012;Panov and Džeroski, 2007;Gomes et al., 2019). ...
Because the size of the data stream, the arrival time and the appearance order of the data in the data stream cannot be determined, the data stream has the phenomenon of concept drift and class imbalance, which makes the classification task difficult to carry out smoothly. Aiming at the phenomenon of concept drift and class imbalance, the Matthews Adaptive XGBoost (MAXGB) algorithm is proposed, which uses the Matthews correlation coefficient as the evaluation index of the base classifier, and can select the base classifier that adapts to the current data stream. MAXGB uses random feature subspace, resampling and sliding window methods to train base classifiers to ensure the diversity among base classifiers. The experimental analysis of MAXGB is carried out on both synthetic and real dataset. The results show that the average performance of MAXGB in the two classification performance index of Kappa and G-mean is increased by 4.5% and 9.3%, respectively. At the same time, the average training speed of MAXGB is also increased by 19%.KeywordsData Stream MiningImbalanced DataConcept DriftXGBoost
Full-text available
Random forests is currently one of the most used machine learning algorithms in the non-streaming (batch) setting. This preference is attributable to its high learning performance and low demands with respect to input preparation and hyper-parameter tuning. However, in the challenging context of evolving data streams, there is no random forests algorithm that can be considered state-of-the-art in comparison to bagging and boosting based algorithms. In this work, we present the adaptive random forest (ARF) algorithm for classification of evolving data streams. In contrast to previous attempts of replicating random forests for data stream learning, ARF includes an effective resampling method and adaptive operators that can cope with different types of concept drifts without complex optimizations for different data sets. We present experiments with a parallel implementation of ARF which has no degradation in terms of classification performance in comparison to a serial implementation, since trees and adaptive operators are independent from one another. Finally, we compare ARF with state-of-the-art algorithms in a traditional test-then-train evaluation and a novel delayed labelling evaluation, and show that ARF is accurate and uses a feasible amount of resources.
Full-text available
Ensemble-based methods are among the most widely used techniques for data stream classification. Their popularity is attributable to their good performance in comparison to strong single learners while being relatively easy to deploy in real-world applications. Ensemble algorithms are especially useful for data stream learning as they can be integrated with drift detection algorithms and incorporate dynamic updates, such as selective removal or addition of classifiers. This work proposes a taxonomy for data stream ensemble learning as derived from reviewing over 60 algorithms. Important aspects such as combination, diversity, and dynamic updates, are thoroughly discussed. Additional contributions include a listing of popular open-source tools and a discussion about current data stream research challenges and how they relate to ensemble learning (big data streams, concept evolution, feature drifts, temporal dependencies, and others).
Full-text available
Most machine learning models are static, but the world is dynamic, and increasing online deployment of learned models gives increasing urgency to the development of efficient and effective mechanisms to address learning in the context of non-stationary distributions, or as it is commonly called concept drift. However, the key issue of characterizing the different types of drift that can occur has not previously been subjected to rigorous definition and analysis. In particular, while some qualitative drift categorizations have been proposed, few have been formally defined, and the quantitative descriptions required for detailed understanding of learner performance have not existed. We present the first comprehensive framework for quantitative analysis of drift. This supports the development of the first comprehensive set of formal definitions of types of concept drift. The formal definitions clarify ambiguities and identify gaps in previous definitions, giving rise to a new comprehensive taxonomy of concept drift types and a solid foundation for research into mechanisms to detect and address concept drift.
Full-text available
The success of simple methods for classification shows that is is often not necessary to model complex attribute interactions to obtain good classification accuracy on practical problems. In this article, we propose to exploit this phenomenon in the data stream context by building an ensemble of Hoeffding trees that are each limited to a small subset of attributes. In this way, each tree is restricted to model interactions between attributes in its corresponding subset. Because it is not known a priori which attribute subsets are relevant for prediction, we build exhaustive ensembles that consider all possible attribute subsets of a given size. As the resulting Hoeffding trees are not all equally important, we weigh them in a suitable manner to obtain accurate classifications. This is done by combining the log-odds of their probability estimates using sigmoid perceptrons, with one perceptron per class. We propose a mechanism for setting the perceptrons' learning rate using the ADWIN change detection method for data streams, and also use ADWIN to reset ensemble members (i.e., Hoeffding trees) when they no longer perform well. Our experiments show that the resulting ensemble classifier outperforms bagging for data streams in terms of accuracy when both are used in conjunction with adaptive naive Bayes Hoeffding trees, at the expense of runtime and memory consumption. We also show that our stacking method can improve the performance of a bagged ensemble.
Full-text available
Internet of Things (IoT) has provided a promising opportunity to build powerful industrial systems and applications by leveraging the growing ubiquity of radio-frequency identification (RFID), and wireless, mobile, and sensor devices. A wide range of industrial IoT applications have been developed and deployed in recent years. In an effort to understand the development of IoT in industries, this paper reviews the current research of IoT, key enabling technologies, major IoT applications in industries, and identifies research trends and challenges. A main contribution of this review paper is that it summarizes the current state-of-the-art IoT in industries systematically.
Conference Paper
Full-text available
In this paper, we consider supervised learning under the assumption that the available memory is small compared to the dataset size. This general framework is relevant in the context of big data, distributed databases and embedded systems. We investigate a very simple, yet effective, ensemble framework that builds each individual model of the ensemble from a random patch of data obtained by drawing random subsets of both instances and features from the whole dataset. We carry out an extensive and systematic evaluation of this method on 29 datasets, using decision tree-based estimators. With respect to popular ensemble methods, these experiments show that the proposed method provides on par performance in terms of accuracy while simultaneously lowering the memory needs, and attains significantly better performance when memory is severely constrained.
Full-text available
Concept drift primarily refers to an online supervised learning scenario when the relation between the input data and the target variable changes over time. Assuming a general knowledge of supervised learning in this article, we characterize adaptive learning processes; categorize existing strategies for handling concept drift; overview the most representative, distinct, and popular techniques and algorithms; discuss evaluation methodology of adaptive algorithms; and present a set of illustrative applications. The survey covers the different facets of concept drift in an integrated way to reflect on the existing scattered state of the art. Thus, it aims at providing a comprehensive introduction to the concept drift adaptation for researchers, industry analysts, and practitioners.
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.