Conference PaperPDF Available

Streaming Random Patches for Evolving Data Stream Classification

Authors:

Abstract and Figures

Ensemble methods are a popular choice for learning from evolving data streams. This popularity is due to (i) the ability to simulate simple, yet, successful ensemble learning strategies, such as bagging and random forests; (ii) the possibility of incorporating drift detection and recovery in conjunction to the ensemble algorithm; (iii) the availability of efficient incremental base learners, such as Hoeffding Trees. In this work, we introduce the Streaming Random Patches (SRP) algorithm, an ensemble method specially adapted to stream classification which combines random subspaces and online bagging. We provide theoretical insights and empirical results illustrating different aspects of SRP. In particular, we explain how the widely adopted incremental Hoeffding trees are not, in fact, unstable learners, unlike their batch counterparts, and how this fact significantly influences ensemble methods design and performance. We compare SRP against state-of-the-art ensemble variants for streaming data in a multitude of datasets. The results show how SRP produce a high predictive performance for both real and synthetic datasets. Besides, we analyze the diversity over time and the average tree depth, which provides insights on the differences between local subspace randomization (as in random forest) and global subspace randomization (as in random subspaces).
Content may be subject to copyright.
Streaming Random Patches for Evolving Data
Stream Classification
Heitor Murilo Gomes∗†, Jesse Read, Albert Bifet∗†
University of Waikato, Hamilton, New Zealand
{heitor.gomes, albert.bifet}@waikato.ac.nz
LTCI, T´
el´
ecom Paris, IP-Paris, Paris, France
LIX, ´
Ecole Polytechnique, Palaiseau, France
jesse.read@polytechnique.edu
Abstract—Ensemble methods are a popular choice for learning
from evolving data streams. This popularity is due to (i) the
ability to simulate simple, yet, successful ensemble learning
strategies, such as bagging and random forests; (ii) the possibility
of incorporating drift detection and recovery in conjunction to the
ensemble algorithm; (iii) the availability of efficient incremental
base learners, such as Hoeffding Trees. In this work, we introduce
the Streaming Random Patches (SRP) algorithm, an ensemble
method specially adapted to stream classification which combines
random subspaces and online bagging. We provide theoretical
insights and empirical results illustrating different aspects of SRP.
In particular, we explain how the widely adopted incremental
Hoeffding trees are not, in fact, unstable learners, unlike their
batch counterparts, and how this fact significantly influences
ensemble methods design and performance. We compare SRP
against state-of-the-art ensemble variants for streaming data in
a multitude of datasets. The results show how SRP produce a
high predictive performance for both real and synthetic datasets.
Besides, we analyze the diversity over time and the average
tree depth, which provides insights on the differences between
local subspace randomization (as in random forest) and global
subspace randomization (as in random subspaces).
Index Terms—Stream Data Mining, Ensemble Learning, Ran-
dom Subspaces, Random Patches
I. INTRODUCTION
Machine learning applications of data streams have grown
in importance in recent years due to the tremendous amount of
real-time data generated by networks, mobile phones and the
wide variety of sensors currently available. Building predictive
models from data streams are central to many applications [1].
The underlying assumption of data stream learning is that the
algorithms must process large amounts of data in a fast-paced
way. In a supervised learning scenario, such characteristic
brings forward two crucial challenges:
Computational efficiency. The algorithm must use a
limited budget of computational resources to be able to
process examples at least as fast as new examples are
available;
Evolving data. The continuous flow of data might be sub-
ject to changes over time, where the canonical example
is concept drift [2]. Concept drifts can be characterized
as changes in the underlying data distribution that affect
the fitted model, such that to maintain its predictive
performance the model must be updated or even reset.
To tackle evolving data many strategies were proposed with
particular attention to ensemble-based methods. Ensembles are
often used to cope with concept drifts by selectively reset-
ting component learners [3]–[6]. Concerning computational
efficiency, ensembles of learners require more computational
resources than a single learner; however many are very easy
to parallelize [6].
In the traditional batch learning setting, several ensem-
ble methodologies are widely used, such as Random Sub-
spaces [7], Pasting [8], Bagging [9], Random Forest [10],
SubBag [11], and Random Patches [12]. The main differ-
ences among these algorithms remain on how they induce
diversity into the ensemble. Random subspace methods train
each base learner on a separate randomly selected subset of
features. Pasting and Bagging train base learners on samples
of instances draw with and without reposition, respectively,
from the original dataset. Random Forest extends Bagging
and randomly selects subsets of features to be considered
for splits in its base learners (decision trees). SubBag and
Random Patches combines Bagging and Random Subspaces
and Pasting and Random Subspaces, respectively, thus through
very similar means, they train base learners on random subsets
of features and samples. Other ensembles that are popular on
batch learning, such as AdaBoosting [13] are less attractive
for data streams, partially because the original batch learn-
ing implementations introduces dependencies among the base
learners, which are difficult to simulate appropriately in a
streaming setting [14].
In this work, we propose strategies to cope with classifi-
cation problems on evolving data streams using an ensemble
strategy that combines random subspaces and bagging. We
name this ensemble Streaming Random Patches (SRP) as it
is inspired by the Random Subspaces method [7] and Online
Bagging [15], and thus resembles the Random Patches [12]
algorithm. SRP incorporates an active drift detection strategy,
similarly to other ensembles methods, e.g., Leveraging Bag-
ging [5] and Adaptive Random Forest (ARF) [6]. The drift
detection and recovery strategy in SRP follow the approach
used in ARF. ARF consistently overcomes other state-of-the-
art ensembles for evolving data streams, partially due to this
strategy [6]; on top of that, by using the same procedure in
SRP, we can compare it to ARF in terms of the ensemble
strategy without the interference from the approach to cope
with evolving data.
Similar algorithms based on the Random Subspaces
method [7] or combinations of resampling and random sub-
spaces [11], [12] have been previously explored on batch learn-
ing for high-dimensional datasets [16] and also for evolving
data stream classification [5], [6], [17], [18]. Nevertheless,
to the best of our knowledge, none of these previous works
thoroughly investigated the impact of online bagging and
random subspaces, concomitantly, for evolving data streams.
Similarly, previous works have not outlined the similarities
and differences between a global and a local randomization
strategy for the subset of features for streaming data. We
use the same definition of global and local randomization as
in [12], i.e., in the random subspaces method, the subspace of
features is selected globally once for the whole base learner,
while in the random forest algorithm the subspaces are selected
locally for each leaf of the base tree [12]. We discuss the
impact of both strategies in our experiments (Section V) while
comparing ARF and SRP. Panov and Dˇ
zeroski, and Louppe
and Geurts conducted a similar investigation for the batch
setting in [11] and [12], respectively.
Paper contributions and roadmap. Our main contributions
can be summarized as follows:
1) Streaming Random Patches (SRP): We introduce an
ensemble-based method, namely SRP, that achieves high
accuracy by training base models on random subsets of
features and instances1;
2) Theoretical insights: We analyze the SRP algorithm with
particular attention to the questions of the stability and
diversity of Hoeffding trees, and the impact of global
subspace randomization in SRP in opposition to the local
randomization in ARF;
3) Empirical Analysis: We compare SRP against state-
of-the-art ensemble variants for streaming data in a
multitude of datasets. The results show a clear overview
of predictive performance and resources usage. Besides,
we analyze the diversity over time and the average tree
depth, which provides some insights on the differences
between local and global subspace randomization.
The rest of this paper is organized as follows. In Section
II we introduce the problem of learning classification models
from evolving data streams. In Section III, we present the SRP
algorithm and theoretical insights. In Section IV, related works
are discussed and compared to our approach. In Section V, we
present the experiments conducted to analyze SRP in terms of
accuracy, computational resources, diversity and decision trees
depth. Finally, Section VI concludes this work and presents
directions for future works.
II. PRO BL EM S ET TI NG
Let X={x−∞, . . . , x1, x0}be an open-ended sequence
of observations collected over time, containing input examples
1The implementation and instructions are available at: https://tinyurl.com/
yytbom4e
in which xkRnand n1. Similarly, let ybe an open-
ended sequence of corresponding class labels, such that every
example in Xhas a corresponding entry in y. Moreover, yk
has a finite set of possible values, i.e., yk∈ {l1, . . . , lL}for
L2, such that a classification task is defined. Furthermore,
we assume a problem setting where new input examples
xare presented every utime units to the learning model
for prediction, such that xt
krepresents a vector of features
available at time t. The true class label yt+1
k, corresponding
to instance xt
k, is available before the next instance xt+1
appears, and thus, it can be used for training immediately
after it has been used for prediction. We emphasise that this
experimental setting can be naturally extended to the delayed
and weakly-supervised settings considering a non-negligible
time delay between observing xand its class label y, including
an infinite delay (i.e., the label is never observed). However,
the conclusions drawn from experimenting in such settings are
similar to those in the “immediate” setting, as shown in [6].
Therefore, for simplicity, we omit such results in this paper.
An important characteristic of data stream classification is
whether it is a stationary or an evolving data distribution. In
this work, we assume evolving data distributions. Thus we
expect the occurrence of concept drifts2that might influence
decision boundaries. Note that if a concept drift is accurately
detected (without false negatives) and dealt with (by fully or
partially resetting models as appropriate) an iid assumption can
be made (on a per-concept basis), since each concept can be
treated as a separate iid stream, thus a series of iid streams to
be dealt with. Nevertheless, the typical nature of a data-stream
as being fast and dynamic encourages the in-depth study that
we present in this work.
III. STR EA MI NG RANDOM PATCH ES
Streaming Random Patches (SRP) can be viewed as an
adaptation of batch learning ensemble methods that com-
bined random samples of instances and random subspaces
of features [11], [12]. Following the terminology introduced
in [12], in the rest of this work, we refer to random
subsets of both features and instances as random patches.
Fig 1 presents an example of subsampling both instances and
features, simultaneously, from streaming data, where only the
shaded intersections of the matrix belong to the subsample,
i.e., {v1,1, v2,1, v6,1, v1,3, v2,3, v6,3}.
Our motivation for exploiting an ensemble of base models
trained on random patches is based mainly on the high
predictive performance of ensembles for data stream learning
that added randomization to the base models by either training
them on random samples of instances [5], random subsets of
features [17] or both [6]. We investigate whether selecting the
subset of features globally once and before constructing each
base model, overcomes locally selecting subsets of features at
each node while constructing base trees as in Random Forest.
In [12], authors show empirical evidence that Random Patches
combined with tree-based models achieved similar accuracy to
2A formal definition of concept drift can be found in [19]
x1x2x3x4x... xm
v1,1v1,2v1,3v1,4v1,5v1,6
v2,1v2,2v2,3v2,4v2,5v2,6
v3,1v3,2v3,3v3,4v3,5v3,6
v5,1v5,2v5,3v5,4v5,5v5,6
v6,1v6,2v6,3v6,4v6,5v6,6
v...,1v...,2v...,3v...,4v...,5v...,6
Fig. 1: Representation of a data stream as an unbounded table
where the rows are infinite, but the columns are constrained
by minput features.
other randomization strategies, including Random Forest [10],
while using less memory.
The original Random Patches algorithm [12] is defined
in terms of all possible subsets of features and instances,
such that R(ps, pf, D)denotes all random patches of size
psNs×pfNfthat can be drawn from the training set D,
where Nsand Nfrepresent the number of instances and
features, respectively, in D. The hyperparameters ps[0,1]
and pf[0,1] represent, respectively, the number of samples
and features in each patch rR(ps, pf, D). In SRP, the
set of all possible streaming random patches Rs(λ, pf, S)is
infinite in the sample dimension as the input training data is
represented by a data stream S. We control the number of
samples in the streaming patch using the Poisson parameter λ
(Section III-A).
A. Random Subsets of Instances
In the batch setting, Bagging builds Lbase models, training
each model with a bootstrap sample from the original training
dataset of size N. Each bootstrap contains each original train-
ing example Ktimes, where P r(K=k)follows a binomial
distribution which, for large N, tends to a Poisson(λ= 1)
distribution. Using this fact, Oza and Russell [15] proposed
Online Bagging, an online method that, instead of sampling
with replacement, gives each example a weight according to
Poisson(λ= 1).
Leveraging Bagging [5] and Adaptive Random Forest [6]
train their base models according to a Poisson(λ= 6)
distribution, which on average augment the weight of each
training instance and diminish the probability of not using an
instance for training, i.e., the probability of Pr[Poisson(λ=
6)=0]0.25%, while Pr[Poisson(λ= 6)=0]36.8%. Using
Poisson(λ= 6) tends to improve the predictive performance of
the ensemble as the base models are updated more often, but
this benefit comes at the expense of computational resources.
Minku et al. [20] used λas a proxy for diversity, i.e.,
the lower λ, the more diversity would be induced into the
ensemble. As pointed by Stapenhurst [21] for iid data the base
models will eventually converge, even faster if given larger
values of λ. One important question to be addressed then is:
why Poisson(6) works if only a small portion of data is not
presented to each learner? In the long run, the base models
start to converge. This can be visualized in Section V where
diversity is shown overtime for the AGRAWAL generator,
once a concept becomes stable the average Kappa Statistic
starts to increase (i.e., the outputs of the base models start to
converge) if the only means of decorrelating the base models
is resampling with reposition simulated with Poisson(6). This
motivates the addition of other techniques to induce diversity
(Section III-B).
B. Random Subsets of Features
Random Subspaces are susceptible to hyper-parameters m
(size of subspace) and n(number of learners). For a feature
space of Mfeatures, there are 2M1different non-empty
subsets of features. Thus, it is unfeasible to train one learner
for even moderate values of M, especially for streaming data
where processing time and memory are restricted [22]. Ho
noted in [7] that highly accurate ensembles could be obtained
far before all possible combinations of subspaces are explored.
Later, Kuncheva et al. [16] provided a thorough analysis
of the random subspace method for the functional magnetic
resonance imaging (fRMI) data problem, which resulted in
insights for selecting values of mand nthat generated usable
learners, i.e., contains at least one ‘relevant’ feature in its
subset of features.
In our problem setting, one reason to train base models
on random subspaces of features on top of training them on
different subsets of instances is to add even further diversity to
the models. Even if they converge because of iid data (Section
III-A) by training them on separate subspaces of features we
have higher chances of producing models that maintain some
level of diversity.
There is a risk of subspaces including only irrelevant
features. There are two mechanisms that help aid this situation:
(i) resetting subspaces once a model is reset in response to a
concept drift; (ii) assigning weights to the votes of base models
based on their predictive performance, then it is expected
that base models with only irrelevant features produce a poor
predictive performance and other base models dominate their
votes.
C. Drift Detection and Recovery
The ultimate goal of drift detection in our context is to allow
automatic recovery from a state where the model performance
is degrading. To achieve this goal we need an accurate drift
detector and a proper action that will be triggered as a response
to the drift signal. Currently, the most successful supervised
learning methods follow a simple, yet effective, approach:
when a concept drift is detected the underlying model is
reset [5], [6]. If the detection algorithm miss or take too long
to detect a change, then it will let the model degrade. On
the other hand, if it yields too many false positives, it will
continuously trigger model resets and consequently prevent
the algorithm from building an accurate model.
We use the same strategy to detect and recover from concept
drifts as introduced in the Adaptive Random Forest (ARF) [6]
algorithm. In this strategy, the correct/incorrect predictions
of each base model are monitored by a detection algorithm.
When the drift detection algorithm flags a warning a new base
model start training in the ‘background’, where ‘background’
means that it does not influence the ensemble decision with
its predictions. If the warning escalates to a concept drift, then
the background model replaces the associated base model.
The strategy accommodates for different drift detection
algorithms to be used, however, to facilitate discussion we
focus the experiments and analysis using SRP with the ADap-
tive WINdow (ADWIN) algorithm [23]. ADWIN is a change
detector and estimator that solves in a well-specified way the
problem of tracking the average of a stream of bits or real-
valued numbers. ADWIN keeps a variable-length window of
recently seen items, with the property that the window has
the maximal length statistically consistent with the hypothesis
“there has been no change in the average value inside the
window”. More precisely, an older fragment of the window
is dropped if and only if there is enough evidence that its
average value differs from that of the rest of the window.
This has two consequences: one, that change reliably declared
whenever the window shrinks; and two, that at any time the
average over the existing window can be reliably taken as
an estimation of the current average in the stream (barring a
very small or very recent change that is still not statistically
visible). A formal and quantitative statement of these two
points (a theorem) appears in [23]. ADWIN is a parameter-
and assumption-free in the sense that it automatically detects
and adapts to the current rate of change. Its only parameter is
the confidence bound δ, indicating how confident we want to
be in the algorithm’s output, inherent to all algorithms dealing
with random processes.
There are no guarantees that a detection algorithm based
on the correct/incorrect predictions will be accurate, but it
will at least be able to detect changes in the underlying data
that genuinely affected the decision boundary (real drifts),
while neglecting those that did not (virtual drifts) [24]. One
disadvantage of this strategy is that it requires access to
labelled data, which is not an issue given our problem setting
(Section II), but for problems that include verification latency
or weakly-labeled streams, then other drift detection strategies
must be explored [25].
The pseudocode for SRP is depicted in Alg. 1. The training
instances are used to evaluate the classification performance
of each base model, before being used for training, and this
estimation is used as the learner weight during voting (line 9,
Alg. 1). For non-stationary data streams, we should consider
that the relevant features, i.e., those that can effectively be used
to predict the class label, may change over time. Therefore,
when a background learner is created, a new random subspace
is generated for it (line 12, Alg. 1). Background models are
trained during the period between the warning that triggered
their creation and the concept drift signal that causes them
to replace the previous base model, and thus, models to be
added to the ensemble always start with a model that is not
an entirely new base model (line 15, Alg. 1).
Algorithm 1 Streaming Random Patches.
Symbols: m: maximum features per subset; λ: Poisson dis-
tribution parameter; n: total number of models (n=|L|);
δw: warning threshold; δd: drift threshold; S: Data stream;
B: Set of background models; W(l): model lweight; P(·):
Model predictive performance estimation function; d(·): drift
detection method.
1: function TRA IN SRP(m, n, δw, δd)
2: LCreateBaseM odels(n, m)
3: WI nitW eights(n)
4: B← ∅
5: while HasNext(S)do
6: (x, y)next(S)
7: for all lLdo
8: ˆypredict(l, x)
9: W(l)P(W(l),ˆy, y)
10: T rain(m, l, x, y)
11: if d(δw, l, x, y)then Warning detected?
12: B(l)CreateBkgM odel(m)
13: end if
14: if d(δd, l, x, y)then Drift detected?
15: lB(l)Replace lby bkg learner
16: end if
17: end for
18: for all bBdo
19: T rain(m, b, x, y)
20: end for
21: end while
22: end function
D. SRP Theoretical Insights
Bagging is well-known in the machine learning literature for
its effect on reducing variance, both in regression and classi-
fication [9], [26], which allows it to perform competitively in
a wide range of scenarios, including data streams [5], [15].
In theory, the reduction of the error is strictly related to how
uncorrelated prediction errors are [9]. Entirely uncorrelated
predictions are rarely achievable in practice, yet it is achieved
to some extent by encouraging diversity among the learning
models [27]. This itself implies a need to use unstable learners.
The standard (batch, unpruned) decision tree is a prime
example of an unstable learner: small changes to a training
sample can result in remarkably different models, and thus
diversity among predictions. Indeed, one readily observes that
decision trees are used throughout the literature.
In the context of data streams, Hoeffding trees [28] are the
popular choice of decision tree, since they are incremental.
However, crucially, Hoeffding trees – unlike their batch coun-
terparts – are in fact stable learners. As far as we are aware
we are among the first to focus on this fact in the context of
ensembles.
Splitting is supported statistically under the Hoeffding
bound. This guarantees to a certain (user-specified) confidence
level that under a sufficiently large number of examples a
Hoeffding tree built incrementally will be equivalent to a
batch-built tree. Until such a number of examples is seen,
however, Hoeffding trees will not grow and this implies
stability.
Formally, we may measure the stability of an algorithm as,
for example, hypothesis stability. In the following we adapt
the discussion of [29] to the streaming setting.
Suppose that ASdenotes that an algorithm A(e.g., C4.5,
or Hoeffding tree inducer) induces decision function f(e.g.,
a decision tree) over data stream segment Sof pairs (xk, yk)
(the segment is of length |S|=n). Let also S\irepresent
Swithout the i-th sample. Then hypothesis stability can be
expressed as
E(x,y)[|(AS,(x, y)) (AS\i,(x, y ))|]< β
under evaluation function/metric .
This captures the intuition that if we remove a sample from
the stream, the absolute difference in error of another model
trained on this new segment should be less than βwhen
compared to the error of the same model built on the original
(thus indicating its stability in terms of β).
We cannot compute this exactly unless we know the true
generating distribution (‘concept’ in stream terminology) from
which (xk, yk)pairs are drawn. However, by replacing the
expectation with a sum over leave-one-out samples from a
real stream we can empirically investigate and compare the
β-stability’ among learning algorithms with regard to such a
stream.
Repeatedly rebuilding models on relatively small samples
of instances is unavoidable in a stream which may experience
drift, implying that trees must be fully or partially regrown. By
small we mean “insufficiently large wrt the Hoeffding bound”.
These episodes add up over the life of a stream to a non-
negligible loss of accuracy.
Suppose that this number is n. As any well-regularized
algorithm, a Hoeffding tree does not adhere strongly to the
principal of empirical risk minimization, but rather it is forced
to accept many errors as a trade-off for long-term similarity
to a batch-built tree. This is a problem terms of Hoeffding
tree ensembles, since these errors are likely to be the same,
rendering the ensemble decisions are likely to be useless (no
advantage compared to a single model). In terms of bias-
variance trade-off, variance goes down at the cost of bias due
to Hoeffding stability [30]. However ensemble bagging-based
schemes are primarily for reducing variance and may even
increase bias, but since variance has already been reduced by
stability, it is not likely to have a positive effect.
This provides a suitable explanation as to why our proposed
SRP method performs well: by effectively reducing the feature
space of individual trees, Hoeffding trees are operating on a
‘sub-concept’, and are stable wrt that concept but unstable wrt
the complete concept, meaning that the variance reduction of
an ensemble still has a beneficial effect.
Furthermore, Random Subspaces are so beneficial in the
data stream setting is because we can look at decision trees
as adaptive nearest neighbours [31], and Random Subspaces
as transformations that preserve the Euclidean geometry [32].
Decision trees splits the overall space into several regions, one
for each one of their leaves. The prediction of the instances
in each one of the leaves is based on the majority vote of the
instances in that leaf. We can consider the instances in that
leaf as the neighbours of the instances to predict. Random
Subspaces are linear transformations that transforms instances
to another space, preserving their Euclidean geometry, a very
useful property when applied to nearest neighbours. This
is due to the fact that there exists Johnson-Lindenstrauss
guarantees that Random Subspaces approximately preserves
the Euclidean geometry of the data with high probability, as
shown in Lemma 1 [32].
Lemma 1. Let X={x(N1), . . . , x1, x0}be a sequence
of observations collected over time, containing input examples
in which xiRnfor every i∈ {−(N1),...,1,0},n1
and satisfying ||x2
i||c
n||xi||2
2where cR+is a constant
1cn. Let , δ (0,1], and let kc2
22ln N2
δbe an
integer. Let RS be a random subspace projection from Rn7→
Rk. Then with probability at least 1δover the random draws
of RS we have, for every i, j ∈ {−(N1),...,1,0}:
(1 )||xixj||2
2n
k||RS(xixj)||2
2(1 + )||xixj||2
2
The required number of spaces kis logarithmic in the
number of examples, but with a larger constant term.
Finally, another explanation of the success of Random
Patches is dropout [33]. Dropout is a technique used in
Deep Learning to improve the accuracy of Neural Networks,
randomly removing neurons. Random Patches uses a similar
technique to sample instances and attributes, removing many
of them, in an efficient random way.
Thus, overall, our proposal creates an artificially smaller
feature space, thus encouraging faster growth, and further-
more, even when tree growth is conservative, can encourage
disagreement (avoid correlation) among the leaf classifiers
even if they would be stable models if run outside the context
of such an ensemble. Empirical results are given in Section
V, which offer further support to these arguments.
IV. REL ATED WOR K
There is an extensive literature on ensemble methods for
data stream classification. This preference is counterintuitive
given the need for algorithms that use computational re-
sources judiciously. The justification for this preference is
attributable to the flexibility and high predictive performance
that ensemble models provide [14]. The seminal work of
Kolter and Maloof [3] introduced the Dynamic Weighted
Majority (DWM) ensemble method which featured heuristics
to cope with evolving data streams, such as removing base
models if their weight dropped below a given threshold, and
adding new ones according to the global performance of the
ensemble. DWM introduces a hyperparameter to control the
period (window) between base models addition, removal and
weight updates. Similarly to DWM, the Online Accuracy
Updated Ensemble (OAUE) [4] algorithm relies on a window
hyperparameter to determine which instances will be used to
train a new base model (candidate) and if it should replace the
base model that achieved the least classification performance
in the latest window of instances. OAUE does not use an active
drift detection approach; thus it relies on gradual resets of
the ensemble through candidates to adapt to concept drifts.
Also, it introduces a weighting mechanism that contributes to
the ensemble adaptation to concept drifts, since the weighting
function is designed to assign higher impact to predictions
on recently presented instances. Note that DWM and OAUE
use incremental base learners; however, they still require
the definition of a window to orchestrate their adaptation
techniques to evolving data.
Many ensemble methods for data stream learning exploit
strategies developed initially for batch learning. Online bag-
ging [15] trains base models on samples drawn from the
data stream simulating sampling with reposition as in the
classical Bagging algorithm [9]. Chen et al. introduce a
generalization of SmoothBoost [34], namely Online Smooth-
Boost (OB) [35], an algorithm that generates only smooth
distributions that, and do not assign too much weight to single
examples. OB is guaranteed to achieve an arbitrarily small
error rate given that the number of weak learners and examples
are sufficiently large.
Ensembles designed to cope with evolving data streams
combine decorrelating base models (e.g., bagging) and voting
(e.g., weighted majority vote [36]) with active drift recovery
strategies based on change detection algorithms. The Leverag-
ing Bagging (LB) [5] algorithm combines an adapted version
of Online Bagging [15] with the ADaptive WINdow (ADWIN)
drift detection algorithm, such that base models are selectively
reset whenever their corresponding ADWIN instance flags a
drift. Heuristic Updatable Weighted Random Subspaces
(HUWRS) [17] trains batch learners (C4.5 decision trees) on
random subspaces of features, following the Random Sub-
space Method (RSM) introduced by Ho [7]. HUWRS detects
virtual and real concept drift by computing the Hellinger
distance between the binned feature values of every base
model and the latest window of instances feature distribution
when labels are not available, and by computing Hellinger
distances between the feature distribution per class over the
latest window of instances, otherwise. The weighting of the
base models in HUWRS relies on the severity of the change
in the distribution of the features associated with its random
subspace. The Adaptive Random Forest (ARF) [6] and the
Dynamic Streaming Random Forest (DSRF) [37] both aim
to adapt the classic Random Forest [10] algorithm to streaming
data. Both ARF and DSRF uses the incremental decision tree
algorithm Hoeffding tree [28], however, they differ on how
the base trees are trained. ARF simulates resampling as in
Leveraging Bagging, while DSRF train trees sequentially on
different subsets of data. Moreover, ARF uses a drift detection
and recovery strategy based on detecting warnings and drifts
per base tree, such that after a warning is triggered another
tree is created and trained without affecting the ensemble
predictions (background tree). If the warning escalates to a
drift detection, then the base tree is replaced by the background
tree.
We briefly introduced the concepts of active and reactive
strategies for concept drift recovery and the vast literature in
ensemble learning for evolving data stream classification. We
refer the reader to [24] and [19] for further information on
concept drift, and to [14] for a detailed overview and taxonomy
of existing ensemble methods for data stream classification.
V. EX PE RI ME NT S
We evaluate the SRP implementation against state-of-the-
art classification algorithms, both concerning predictive per-
formance and computational resources usage. To analyze the
diversity among base models in our new proposed methods, we
present plots depicting the average pairwise kappa over time.
Also, to analyze how fast (and deep) the base trees are grown
by each ensemble strategy we include plots of the average tree
depth over time. We assess predictive performance through ac-
curacy results using a test-then-train evaluation strategy, where
every instance is used first for testing and then for training.
The algorithms used in the comparisons are Hoeffding Trees
(HT), Naive Bayes (NB), Leveraging Bagging (LB), Adaptive
Random Forest (ARF), Online Accuracy Updated Ensemble
(OAUE), Dynamic Weighted Majority (DWM), and Online
Smooth Boosting (OB). HT and NB serve the purpose of
baselines since they are single classifiers often used in data
stream classification. LB and ARF are ensemble methods that
consistently outperform other ensemble classifiers as shown
in [6] in a similar benchmark than the one used in this work.
OB represents a boosting adaptation to online learning, while
DWM and OAUE are ensemble methods explicitly developed
for data stream classification that rely on different heuristics
to address concept drift.
To analyze how SRP compares to “simple” variants of
itself we present two variations in the experiments, namely
the Streaming Random Subspaces (SRS) and a Bagging-like
strategy (BAG). SRS trains on random subspaces of features as
in SRP and all instances without simulating bootstraps, while
BAG only simmulates bagging using all features. In the online
resources 3we provide two tables analyzing the impact of m
in SRP (ranging from 10% up to 100% (same as the variant
BAG)) and λ= 1, which impacts the bagging simulation. The
experiments in the paper summarizes the results in the online
resources; still, they are available to the interested reader.
Regarding hyperparameters, we use HT as the base learner
for all the ensemble methods. The default subspace size is
m= 60% for SRS, SRP, and ARF, except for experiments
with the high dimensionality dataset SPAM and n= 100
where m= 10% (Table II). In the online resources (Section
A) we present complementary experiments varying mfrom
10% up to 100% (equivalent to BAG) in all datasets. The
HT grace period was set to GP = 50, the split confidence
c= 0.014, and the decision strategy used at leaves was Naive
3https://github.com/hmgomes/StreamingRandomPatches
4GP and cwere originally identified as nmin and δby Domingos and
Hulten [28], however we choose to keep their acronyms as in the Massive
Online Analysis (MOA) framework to facilitate reproducibility.
12
3
4
56
7
8
9
10
CD = 3.757
SRP
BAG
ARF
LB
SRS
OAUE
DWM
OB
HT
NB
12
3
4
56
7
8
CD = 3.031
SRP
SRS
BAG
ARF
LB
OB
OAUE
DWM
Fig. 2: Nemenyi test (95% confidence level) - n= 10 base models on the left; and Nemenyi test (95% confidence level) -
n= 100 base models on the right. The avg rank obtained in the SPAM dataset for n= 100 was not considered for any learner
since there are no results for LB and BAG.
Bayes Adaptive, i.e., either Naive Bayes or Majority vote are
used at a leaf depending on which one is more accurate [38].
This HT configuration tends to generate splits earlier at the
expense of processing time [6]. ADWIN is used as a drift
detector for all ensembles that rely on active drift detection
(i.e., ARF, LB, SRP, SRS, and BAG). The δparameter, which
controls the confidence in the change detected, was defined
as δ= 0.0001 for warning detection and δ= 0.00001 for
drift detection in ARF, SRP, SRS and BAG. In LB δwas set
according to its default value [5], i.e. δ= 0.002.
The datasets used in the experiments include 6 synthetic
data streams and 7 real datasets. The synthetic datasets sim-
ulate abrupt, gradual, and incremental drifts, while the real
datasets have been thoroughly used in the literature to assess
data stream classifiers. Further information concerning the
datasets, instructions on how to execute the experiments and
other details for reproducibility are available in Appendix A.
A. Streaming Random Patches vs. Others
The results presented in Table I show how SRP compares
against other algorithms. Similarly, II presents how SRP and
other ensembles perform when configured to use n= 100
learners. Besides presenting the average ranking (Avg Rank)
for each algorithm, we also highlight the average ranking
for the synthetic datasets (Avg Rank Synt.) and the average
ranking for real-world datasets (Avg Rank Real). The reason
to report these rankings separately is that some techniques
may perform better on synthetic data, while not so well
in overall and it is important to highlight and discuss that.
Good performance on the synthetic datasets may indicate
an effective drift recovery strategy, however synthetic data
stream concepts tend to be simple or biased towards a specific
learning algorithm, therefore an algorithm that produces good
results only on synthetic data may offer less credibility. We
apply the methodology presented in [39] to compare results
among several datasets and algorithms for the experiments
presented on Tables I and II. We first attempt to reject the
hypothesis that all learners produce equivalent results using a
Friedman test at a significance level α= 0.05. The Friedman
test indicated significant differences on both results and it
was followed by a post-hoc Nemenyi test. Figure 2 presents
the results for the post-hoc Nemenyi test. We note that no
significant difference has been found among SRP, BAG, ARF,
LB, SRS and OAUE, using n= 10, while using n= 100
there was no significant difference among SRP, SRS, BAG,
ARF and LB.
We can observe the influence of the mhyperparameter when
we compare SRP and BAG results, for example, in AIRLINES
even though the number of features is only 7, using m= 60%
produced better results than BAG as shown in Tables I and
II, while intuitively it seems that using all features for low
dimensionality datasets is better. For the SPAM dataset, SRP,
ARF and SRS were configured with m= 10% for the n= 100
experiments as m= 60% failed to finished. LB and BAG
could not finish, both failed after around 60% execution as
100GB of maximum memory allocation pool was insufficient.
SRP with n= 10 performs well in the real datasets, but
not as well in the synthetic datasets as BAG and LB, which
are very similar models (i.e., use all features and simulate
resampling). However, in the experiments using n= 100
the algorithms that exploit random subspaces (ARF, SRP,
SRS) benefited the most from the addition of more learners,
followed by BAG and LB. This characteristic of ARF, SRP,
and SRS, can be attributed to them being able to cover a more
significant number of subspaces of features. OB and DWM
improved in comparison to their results using n= 10, while
OAUE decreased its performance. OAUE obtain results far
below NB and HT for KDD99, ADS and NOMAO datasets
while performing well in the synthetic datasets with simulated
concept drifts.
B. Average Tree Depth and Diversity
To investigate how efficient SRP is in terms of inducing
diversity into the ensemble, we plot the average kappa over
time for AGRgand LEDain Figures 3 and 4. In Figure 6, we
can observe how average kappa for BAG and ARF converge
after the same concept has been in place. We notice that SRS
and SRP obtains low values of average kappa in comparison
to ARF and BAG. However, when we take into account the
accuracy results in Table I we can see that not necessarily SRP
or SRS outperform BAG in these datasets, i.e., even if the
difference is small, ARF and BAG outperform SRS and SRP
in LED(A). These results corroborate with the conclusions
by Stapenhurst [21] that the ensemble diversity influence in
the recovery from a concept drift, still, it is not as crucial as
the actual drift detection and recovery strategy. In AGR(G),
SRS and SRP outperform ARF and BAG, still, the average
kappa diversity in this experiment are quite similar, thus no
clear conclusions can be made about why SRS and SRP
perform better based solely on the average pairwise kappa. The
overall conclusion is that increasing diversity is not enough to
improve accuracy. Therefore, we complement our analysis of
TABLE I: Test-then-train accuracy (%) using n= 10 base models.
Data set NB HT LB OAUE DWM OB ARF SRP SRS BAG
LED(A) 53.964 69.032 73.918 74.007 73.742 69.898 73.945 73.588 73.533 73.944
LED(G) 54.02 68.649 73.076 73.167 72.723 69.562 73.01 72.416 72.296 73.151
AGR(A) 65.739 81.045 86.954 90.932 82.97 84.91 85.646 91.788 91.558 85.733
AGR(G) 65.759 77.374 80.709 86.339 79.418 79.73 79.885 87.762 88.538 81.347
RBF(M) 30.994 45.491 84.714 78.581 57.81 69.894 84.49 83.28 81.685 85.431
RBF(F) 29.136 32.292 74.102 50.021 54.861 42.915 70.715 70.825 59.061 74.891
AIRLINES 64.55 65.078 62.319 66.637 63.88 65.184 65.786 66.776 67.085 61.296
ELEC 73.362 79.195 90.157 88.275 87.756 85.253 88.718 88.82 89.4 89.502
COVTYPE 60.521 80.312 94.861 90.17 88.286 90.327 94.691 95.254 92.764 95.467
KDD99 95.603 99.903 99.974 2.473 99.951 99.944 99.975 99.984 99.979 99.972
ADS 68.161 85.91 99.665 15.401 97.499 86.917 99.726 99.756 98.445 99.665
NOMAO 86.865 92.128 97.035 58.407 95.462 94.252 97.055 97.232 96.451 97.064
SPAM 74.571 79.043 94.745 80.899 89.286 89.489 96.214 95.967 92.954 94.745
Avg Rank 9.54 8.29 3.5 7 7 6.86 3.57 2.43 3.71 3.36
Avg Rank Synt. 10 9 3.33 3.5 6.67 7.5 4.17 3.67 4.5 2.67
Avg Rank Real 9.14 8.14 4 7.71 6.86 6.57 3.14 1.86 3.71 3.86
TABLE II: Test-then-train accuracy (%) using n= 100 base models. Underlined results means the performance increased in
comparison to n= 10 version. BAG and LB did not finish execution for SPAM dataset.
Data set LB OAUE DWM OB ARF SRP SRS BAG
LED(A) 73.953 73.393 73.958 72.475 73.96 74.027 74.04 73.975
LED(G) 73.225 72.582 73.031 72.117 73.094 73.233 73.179 73.215
AGR(A) 88.717 90.164 88.299 90.374 87.929 92.869 92.807 86.663
AGR(G) 83.713 85.244 79.437 87.834 82.288 89.651 90.259 82.52
RBF(M) 84.338 84.262 60.977 74.514 86.958 86.039 84.821 86.671
RBF(F) 76.771 57.147 54.531 48.698 76.291 76.375 61.622 77.686
AIRLINES 62.82 65.229 64.025 64.556 66.417 68.564 68.303 62.093
ELEC 89.508 87.407 87.754 89.515 89.672 89.859 90.267 89.822
COVTYPE 95.104 92.857 88.519 92.695 94.967 95.348 93.461 95.288
KDD99 99.965 2.445 99.951 99.936 99.972 99.981 99.973 99.974
ADS 99.634 15.401 97.499 90.393 99.695 99.726 98.353 99.634
NOMAO 97.072 58.233 95.462 96.393 97.197 97.383 96.57 97.226
SPAM NA 80.781 89.361 86.519 97.319 97.437 95.924 NA
Avg Rank 4.46 6.33 6.67 6.17 4 1.58 3.17 3.63
Avg Rank Synt. 4.17 5.67 6.67 6.17 4.67 22.83 3.83
Avg Rank Real 4.75 7 6.67 6.17 3.33 1.17 3.5 3.42
the ensembles diversity by presenting the average tree depth
in Figures 5 and 6. SRP consistently grows trees faster and
deeper than ARF, SRS and BAG. Splitting sooner can lead to
overfitting the models or splits that could use a better feature
(or split point) if more instances were observed. However, as
shown by the SRP predictive performance in the empirical
experiments, it can be beneficial to an ensemble strategy.
# Instances
0
0.2
0.4
0.6
0.8
0 250000 500000 750000
SRP SRS ARF BAG
Fig. 3: AGR(G) - Avg kappa over time (n= 10).
C. Time and Memory Usage Analysis
The computational resources are estimated based on the
CPU time and RAM Hours (across all the experiments). The
# Instances
0
0.25
0.5
0.75
1
0 250000 500000 750000 1000000
SRP SRS ARF BAG
Fig. 4: LED(A) - Avg kappa over time (n= 10).
results for n= 100 are presented in Figures 7 and 85. We
note that SRP performs similar to LB, requires less resources
than BAG, but demands more resources than SRS and ARF.
The SRS efficiency is attributable to the fact that it does not
simulate resampling. In SRP, BAG, ARF and LB each learner
is trained on each instance, on average, lambda times, where
lambda = 6 in our experiments. If we use Poisson(λ= 1) we
5The results in Figures 7 and 8 excludes SPAM CPU Time and RAM hours
for all algorithms, since BAG and LB did not finish executing
# Instances
0
5
10
15
20
0 250000 500000 750000
SRP SRS ARF BAG
Fig. 5: AGR(G) - Avg tree depth over time (n= 10).
Fig. 6: LED(A) - Avg tree depth over time (n= 10).
also increase the chances of obtaining zeros (i.e., not using the
instance for training), which positively affects the memory and
processing time (train on less instances), but negatively impact
the classification performance as the base models are trained
on less instances.
Fig. 7: CPU time (n= 100).
VI. CONCLUSIONS
In this work, we have taken an in-depth look at the
performance of Random Subspaces and Bagging ensemble
methods and their application to streams. In particular, fol-
lowing theoretical considerations and empirical investigations,
we developed and presented the Streaming Random Patches
(SRP) method. SRP is a combination of Random Subspaces
and Online Bagging as each base model is trained on a
Fig. 8: RAM Hours usages (n= 100).
random patch of data (i.e., a random subset of features and
instances). We show how SRP can be highly accurate on
many benchmark streaming scenarios, and compare it against
several ensemble methods for data stream classification, in-
cluding bagging, boosting and random forest variations. We
discussed the differences, and similarities, between SRP and
the Adaptive Random Forest (ARF) algorithm. We showed
how SRP compared against a Streaming Random Subspaces
(SRS) method and a Bagging method using the same drift
detection and recovery strategies. We highlight that SRP has
the same amount of hyperparameters as ARF; still, it can also
be used to train base models that are not decision trees.
We discussed and demonstrated how methods using ran-
dom subspaces yield several significant advantages, such as
diversity enhancement (even for stable methods), which is
particularly suited to Hoeffding-tree based methods which can
be seen as stable methods. On top of that, these methods tend
to improve accuracy from the addition of more base models.
As only a subset of features is considered, training is effi-
cient as compared to popular existing methods for data stream
classification, such as Leveraging Bagging. Furthermore, even
though beyond the scope of this paper, a consideration of
distributed computation on our method is particularly favored,
as the base models are independent. These characteristics set
out an exciting path for future investigation.
REFERENCES
[1] L. Da Xu, W. He, and S. Li, “Internet of things in industries: A survey,
IEEE Transactions on industrial informatics, vol. 10, no. 4, pp. 2233–
2243, 2014.
[2] G. Widmer and M. Kubat, “Learning in the presence of concept drift
and hidden contexts,” Mach. Learn., vol. 23, no. 1, pp. 69–101, Apr.
1996.
[3] J. Z. Kolter, M. Maloof et al., “Dynamic weighted majority: A new
ensemble method for tracking concept drift,” in Data Mining, 2003.
ICDM 2003. Third IEEE International Conference on. IEEE, 2003,
pp. 123–130.
[4] D. Brzezinski and J. Stefanowski, “Combining block-based and online
methods in learning ensembles from concept drifting data streams,”
Information Sciences, vol. 265, pp. 50–67, 2014.
[5] A. Bifet, G. Holmes, and B. Pfahringer, “Leveraging bagging for
evolving data streams,” in PKDD, 2010, pp. 135–150.
[6] H. M. Gomes, A. Bifet, J. Read, J. P. Barddal, F. Enembreck,
B. Pfharinger, G. Holmes, and T. Abdessalem, “Adaptive random forests
for evolving data stream classification,Machine Learning, pp. 1–27,
2017. [Online]. Available: http://dx.doi.org/10.1007/s10994-017-5642-8
[7] T. K. Ho, “The random subspace method for constructing decision
forests,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. 20, no. 8, pp. 832–844, 1998.
[8] L. Breiman, “Pasting small votes for classification in large databases
and on-line,” Machine learning, vol. 36, no. 1-2, pp. 85–103, 1999.
[9] ——, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–
140, 1996.
[10] ——, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32,
2001.
[11] P. Panov and S. D ˇ
zeroski, “Combining bagging and random subspaces
to create better ensembles,” in International Symposium on Intelligent
Data Analysis. Springer, 2007, pp. 118–129.
[12] G. Louppe and P. Geurts, “Ensembles on random patches,” in Joint
European Conference on Machine Learning and Knowledge Discovery
in Databases. Springer, 2012, pp. 346–361.
[13] Y. Freund, R. E. Schapire et al., “Experiments with a new boosting
algorithm,” in ICML, vol. 96, 1996, pp. 148–156.
[14] H. M. Gomes, J. P. Barddal, F. Enembreck, and A. Bifet, “A survey
on ensemble learning for data stream classification,” ACM Comput.
Surv., vol. 50, no. 2, pp. 23:1–23:36, 2017. [Online]. Available:
http://doi.acm.org/10.1145/3054925
[15] N. Oza and S. Russell, “Online bagging and boosting,” in Artificial
Intelligence and Statistics 2001. Morgan Kaufmann, 2001, pp. 105–
112.
[16] L. I. Kuncheva, J. J. Rodr´
ıguez, C. O. Plumpton, D. E. Linden, and S. J.
Johnston, “Random subspace ensembles for fmri classification,” IEEE
transactions on medical imaging, vol. 29, no. 2, pp. 531–542, 2010.
[17] T. R. Hoens, N. V. Chawla, and R. Polikar, “Heuristic updatable
weighted random subspaces for non-stationary environments,” in Data
Mining (ICDM), 2011 IEEE 11th International Conference on. IEEE,
2011, pp. 241–250.
[18] C. O. Plumpton, L. I. Kuncheva, N. N. Oosterhof, and S. J. Johnston,
“Naive random subspace ensemble with linear classifiers for real-time
classification of fmri data,” Pattern Recognition, vol. 45, no. 6, pp. 2101–
2108, 2012.
[19] G. I. Webb, R. Hyde, H. Cao, H. L. Nguyen, and F. Petitjean, “Charac-
terizing concept drift,” Data Mining and Knowledge Discovery, vol. 30,
no. 4, pp. 964–994, 2016.
[20] L. L. Minku, A. P. White, and X. Yao, “The impact of diversity on online
ensemble learning in the presence of concept drift,” IEEE Transactions
on Knowledge and Data Engineering, vol. 22, no. 5, pp. 730–742, 2010.
[21] R. J. Stapenhurst, “Diversity, margins and non-stationary learning.
Ph.D. dissertation, University of Manchester, UK, 2012.
[22] A. Bifet, E. Frank, G. Holmes, and B. Pfahringer, “Ensembles of
restricted hoeffding trees,” ACM TIST, vol. 3, no. 2, pp. 30:1–30:20,
2012. [Online]. Available: http://doi.acm.org/10.1145/2089094.2089106
[23] A. Bifet and R. Gavald`
a, “Learning from time-changing data with
adaptive windowing,” in SIAM, 2007.
[24] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A
survey on concept drift adaptation,ACM Computing Surveys, vol. 46,
no. 4, pp. 44:1–44:37, Mar. 2014.
[25] I. ˇ
Zliobaite, “Change with delayed labeling: When is it detectable?” in
Data Mining Workshops (ICDMW), 2010 IEEE International Conference
on. IEEE, 2010, pp. 843–850.
[26] P. M. Domingos, “A unified bias-variance decomposition for zero-one
and squared loss,” in AAAI 2000, 2000, pp. 564–569.
[27] L. I. Kuncheva, “That elusive diversity in classifier ensembles,” in
Iberian conference on pattern recognition and image analysis. Springer,
2003, pp. 1126–1138.
[28] P. Domingos and G. Hulten, “Mining high-speed data streams,” in
Proceedings of the sixth ACM SIGKDD international conference on
Knowledge discovery and data mining. ACM SIGKDD, Sep. 2000,
pp. 71–80.
[29] O. Bousquet and A. Elisseeff, “Stability and generalization,Journal of
Machine Learning Research, vol. 2, pp. 499–526, 2002.
[30] E. Ikonomovska, J. Gama, and S. Dˇ
zeroski, “Learning model trees from
evolving data streams,Data mining and knowledge discovery, vol. 23,
no. 1, pp. 128–168, 2011.
[31] Y. Lin and Y. Jeon, “Random forests and adaptive nearest neighbors,”
Journal of the American Statistical Association, vol. 101, no. 474, pp.
578–590, 2006.
[32] N. Lim and R. J. Durrant, “Linear dimensionality reduction
in linear time: Johnson-lindenstrauss-type guarantees for random
subspace,” arXiv, vol. 1705.06408, 2017. [Online]. Available: https:
//arxiv.org/abs/1705.06408
[33] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov, “Dropout: a simple way to prevent neural networks
from overfitting,Journal of Machine Learning Research, vol. 15, no. 1,
pp. 1929–1958, 2014.
[34] R. A. Servedio, “Smooth boosting and learning with malicious noise,”
Journal of Machine Learning Research, vol. 4, no. Sep, pp. 633–648,
2003.
[35] S.-T. Chen, H.-T. Lin, and C.-J. Lu, “An online boosting algorithm with
theoretical justifications,” in Proceedings of the International Conference
on Machine Learning (ICML), June 2012.
[36] N. Littlestone and M. K. Warmuth, “The weighted majority algorithm,
Information and computation, vol. 108, no. 2, pp. 212–261, 1994.
[37] H. Abdulsalam, D. B. Skillicorn, and P. Martin, “Classifying evolving
data streams using dynamic streaming random forests,” in International
Conference on Database and Expert Systems Applications. Springer,
2008, pp. 643–651.
[38] G. Holmes, R. Kirkby, and B. Pfahringer, “Stress-testing hoeffding
trees,” in Knowledge Discovery in Databases: PKDD 2005, 2005, pp.
495–502. [Online]. Available: https://doi.org/10.1007/11564126\50
[39] J. Demˇ
sar, “Statistical comparisons of classifiers over multiple data sets,
Journal of Machine Learning Research, vol. 7, pp. 1–30, Dec. 2006.
[Online]. Available: http://dl.acm.org/citation.cfm?id=1248547.1248548
[40] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “Moa: Massive online
analysis,” The Journal of Machine Learning Research, vol. 11, pp. 1601–
1604, 2010.
APPENDIX
A. Software and hardware
All the experiments were executed in the Massive Online
Analysis (MOA) framework [40] version 2019.04 build. De-
tails about the hardware configuration are shown below:
CPU: 40 cores, Intel(R) Xeon(R) CPU E5-2660 v3 @
2.60GHz
Operational System: Ubuntu 16.04.5 LTS
Java Virtual Machine (JVM) version: JDK 1.8.0
JVM: Xmx = 100GB, Xms = 50MB
To reproduce the experiments the reader should access the
GitHub repository
https://github.com/hmgomes/StreamingRandomPatches
which contains the code, the datasets used, and instructions
on how to execute the algorithm.
B. Datasets
Table III presents an overview of the datasets.
TABLE III: Datasets (Drifts: (A) Abrupt, (G) Gradual, (M)
Incremental (moderate) and (F) Incremental (fast)). AGR and
LED concept drifts are introduced every 250k instances.
Dataset #Instances #Features Type Drifts #Classes
LED(A) 1,000,000 24 Synthetic A 10
LED(G) 1,000,000 24 Synthetic G 10
AGR(A) 1,000,000 9 Synthetic A 2
AGR(G) 1,000,000 9 Synthetic G 2
RBF(M) 1,000,000 10 Synthetic M 5
RBF(F) 1,000,000 10 Synthetic F 5
AIRL 539,383 7 Real - 2
ELEC 45,312 8 Real - 2
COVT 581,012 54 Real - 7
KDD99 4,898,431 41 Real - 23
ADS 3,279 1,559 Real - 2
NOMAO 34,465 119 Real - 2
SPAM 9,324 39,917 Real - 2
... The accuracy and F-measure of PWPAE are impressive at 99.20% and 97.67%, respectively, with an execution time of 6.4 ms. Furthermore, a study [56] introduced Deep Neural Networks (DNN) coupled with the k-nearest neighbor (kNN) algorithm to counter application attacks. This hybrid approach achieves remarkable accuracy, reaching 99.85%, with an F-measure of 97.67%. ...
Article
Full-text available
The surge in Internet of Things usage has raised security breaches within the IoT ecosystem. Consequently, there is a pressing need to deploy robust Intrusion Detection Systems (IDSs) to safeguard IoT environments. This paper proposes a framework designed to establish stringent decision boundaries for effective attack detection, leveraging two prevalent datasets: CICIDS2017 and EDGE-IIOT. These datasets exhibit imbalanced class distributions and encompass numerous features with distinct characteristics. To address the class imbalance, the framework employs sampling techniques such as the synthetic minority oversampling technique with a genetic algorithm (GA-SMOTE) and with particle swarm optimization (SMOTE-PSO) along with random undersampling (RUS). The proposed framework utilizes tree-based learning algorithms, Decision Tree, Random Forest, and XGBoost, to identify cyberattacks and associated anomalies within the constrained IoT landscape. Feature selection is performed using the Boruta and WOA algorithms, and pruning algorithms are used to optimize the complexity of the model. The efficacy of the framework is evaluated using standard metrics on both workstations and Raspberry Pi boards to demonstrate its effectiveness on constrained IoT devices. The evaluation results demonstrate that the proposed model achieves a remarkable accuracy of 99.99% in identifying cyberattacks and related anomalies, exceeding the performance of existing baseline models in the CICIDS2017 dataset. It also obtains a high accuracy of 99.5% on EDGE-IIOT dataset. Furthermore, the framework shows promising results in terms of memory usage and execution time, achieving the best performance of 3.07 MB of memory usage and 4.26 s of execution time for the CICIDS2017 dataset and 1.93 MB of memory usage and 4.09 s of execution time for the EDGE-IIOT dataset when implemented on Raspberry Pi boards.
... The Hoeffding Adaptive Tree (HAT) enhances adaptiveness by replacing old branches dynamically using metrics such as Adaptive Windowing (ADWIN) algorithm [34] and also proposes a bootstrapping sampling on top of Hoeffding Trees. Bagging [35] and boosting-based [36] techniques have recently proven their success as part of ensembles in data stream learning like Adaptive Random Forests (ARF) [3] and Streaming Random Patches (SRP) [37]. ARF [3] is an enhanced adaptive ensemble with diversity through resampling and random node splitting, equipped with drift detection per node for adaptive training. ...
Conference Paper
The Internet of Things is an example domain where data is perpetually generated in ever-increasing quantities, reflecting the pro- liferation of connected devices and the formation of continuous data streams over time. Consequently, the demand for ad-hoc, cost-effective machine learning solutions must adapt to this evolving data influx. This study tackles the task of offloading in small gateways, exacerbated by their dynamic availability over time. An approach leveraging CPU uti- lization metrics using online and continual machine learning techniques is proposed to predict gateway availability. These methods are compared to popular machine learning algorithms and a recent time-series founda- tion model, Lag-Llama, for fine-tuned and zero-shot setups. Their per- formance is benchmarked on a dataset of CPU utilization measurements over time from an IoT gateway and focuses on model metrics such as pre- diction errors, training and inference times, and memory consumption. Our primary objective is to study new efficient ways to predict CPU performance in IoT environments. Across various scenarios, our findings highlight that ensemble and online methods offer promising results for this task in terms of accuracy while maintaining a low resource footprint. Code is available at https://github.com/sebasmos/AML4CPU
Article
Data stream learning is a very relevant paradigm because of the increasing real-world scenarios generating data at high velocities and in unbounded sequences. Stream learning aims at developing models that can process instances as they arrive, so models constantly adapt to new concepts and the temporal evolution in the stream. In multi-label data stream environments where instances have the peculiarity of belonging simultaneously to more than one class, the problem becomes even more complex and poses unique challenges such as different concept drifts impacting different labels at simultaneous or distinct times, higher class imbalance, or new labels emerging in the stream. This paper proposes a novel approach to multi-label data stream classification called Multi-Label Hoeffding Adaptive Tree (MLHAT). MLHAT leverages the Hoeffding adaptive tree to address these challenges by considering possible relations and label co-occurrences in the partitioning process of the decision tree, dynamically adapting the learner in each leaf node of the tree, and implementing a concept drift detector that can quickly detect and replace tree branches that are no longer performing well. The proposed approach is compared with other 18 online multi-label classifiers on 41 datasets. The results, validated with statistical analysis, show that MLHAT outperforms other state-of-the-art approaches in 12 well-known multi-label metrics.
Article
Concept drift is an important characteristic and inevitable difficult problem in streaming data mining. Ensemble learning is commonly used to deal with concept drift. However, most ensemble methods cannot balance the accuracy and diversity of base learners after drift occurs, and cannot adjust adaptively according to the drift type. To solve these problems, this paper proposes a targeted ensemble learning (Targeted EL) method to improve the accuracy and diversity of ensemble learning for streaming data with abrupt and gradual concept drift. Firstly, to improve the accuracy of the base learners, the method adopts different sample weighting strategies for different types of drift to realize bidirectional transfer of new and old distributed samples. Secondly, the difference matrix is constructed by the prediction results of the base learners on the current samples. According to the drift type, the submatrix with appropriate size and maximum difference sum is extracted adaptively to select appropriate, accuracy and diverse base learners for ensemble. The experimental results show that the proposed method can achieve good generalization performance when dealing with the streaming data with abrupt and gradual concept drift.
Article
Full-text available
Random forests is currently one of the most used machine learning algorithms in the non-streaming (batch) setting. This preference is attributable to its high learning performance and low demands with respect to input preparation and hyper-parameter tuning. However, in the challenging context of evolving data streams, there is no random forests algorithm that can be considered state-of-the-art in comparison to bagging and boosting based algorithms. In this work, we present the adaptive random forest (ARF) algorithm for classification of evolving data streams. In contrast to previous attempts of replicating random forests for data stream learning, ARF includes an effective resampling method and adaptive operators that can cope with different types of concept drifts without complex optimizations for different data sets. We present experiments with a parallel implementation of ARF which has no degradation in terms of classification performance in comparison to a serial implementation, since trees and adaptive operators are independent from one another. Finally, we compare ARF with state-of-the-art algorithms in a traditional test-then-train evaluation and a novel delayed labelling evaluation, and show that ARF is accurate and uses a feasible amount of resources.
Article
Full-text available
Ensemble-based methods are among the most widely used techniques for data stream classification. Their popularity is attributable to their good performance in comparison to strong single learners while being relatively easy to deploy in real-world applications. Ensemble algorithms are especially useful for data stream learning as they can be integrated with drift detection algorithms and incorporate dynamic updates, such as selective removal or addition of classifiers. This work proposes a taxonomy for data stream ensemble learning as derived from reviewing over 60 algorithms. Important aspects such as combination, diversity, and dynamic updates, are thoroughly discussed. Additional contributions include a listing of popular open-source tools and a discussion about current data stream research challenges and how they relate to ensemble learning (big data streams, concept evolution, feature drifts, temporal dependencies, and others).
Article
Full-text available
Most machine learning models are static, but the world is dynamic, and increasing online deployment of learned models gives increasing urgency to the development of efficient and effective mechanisms to address learning in the context of non-stationary distributions, or as it is commonly called concept drift. However, the key issue of characterizing the different types of drift that can occur has not previously been subjected to rigorous definition and analysis. In particular, while some qualitative drift categorizations have been proposed, few have been formally defined, and the quantitative descriptions required for detailed understanding of learner performance have not existed. We present the first comprehensive framework for quantitative analysis of drift. This supports the development of the first comprehensive set of formal definitions of types of concept drift. The formal definitions clarify ambiguities and identify gaps in previous definitions, giving rise to a new comprehensive taxonomy of concept drift types and a solid foundation for research into mechanisms to detect and address concept drift.
Article
Full-text available
The success of simple methods for classification shows that is is often not necessary to model complex attribute interactions to obtain good classification accuracy on practical problems. In this article, we propose to exploit this phenomenon in the data stream context by building an ensemble of Hoeffding trees that are each limited to a small subset of attributes. In this way, each tree is restricted to model interactions between attributes in its corresponding subset. Because it is not known a priori which attribute subsets are relevant for prediction, we build exhaustive ensembles that consider all possible attribute subsets of a given size. As the resulting Hoeffding trees are not all equally important, we weigh them in a suitable manner to obtain accurate classifications. This is done by combining the log-odds of their probability estimates using sigmoid perceptrons, with one perceptron per class. We propose a mechanism for setting the perceptrons' learning rate using the ADWIN change detection method for data streams, and also use ADWIN to reset ensemble members (i.e., Hoeffding trees) when they no longer perform well. Our experiments show that the resulting ensemble classifier outperforms bagging for data streams in terms of accuracy when both are used in conjunction with adaptive naive Bayes Hoeffding trees, at the expense of runtime and memory consumption. We also show that our stacking method can improve the performance of a bagged ensemble.
Article
Full-text available
Internet of Things (IoT) has provided a promising opportunity to build powerful industrial systems and applications by leveraging the growing ubiquity of radio-frequency identification (RFID), and wireless, mobile, and sensor devices. A wide range of industrial IoT applications have been developed and deployed in recent years. In an effort to understand the development of IoT in industries, this paper reviews the current research of IoT, key enabling technologies, major IoT applications in industries, and identifies research trends and challenges. A main contribution of this review paper is that it summarizes the current state-of-the-art IoT in industries systematically.
Conference Paper
Full-text available
In this paper, we consider supervised learning under the assumption that the available memory is small compared to the dataset size. This general framework is relevant in the context of big data, distributed databases and embedded systems. We investigate a very simple, yet effective, ensemble framework that builds each individual model of the ensemble from a random patch of data obtained by drawing random subsets of both instances and features from the whole dataset. We carry out an extensive and systematic evaluation of this method on 29 datasets, using decision tree-based estimators. With respect to popular ensemble methods, these experiments show that the proposed method provides on par performance in terms of accuracy while simultaneously lowering the memory needs, and attains significantly better performance when memory is severely constrained.
Article
Full-text available
Concept drift primarily refers to an online supervised learning scenario when the relation between the input data and the target variable changes over time. Assuming a general knowledge of supervised learning in this article, we characterize adaptive learning processes; categorize existing strategies for handling concept drift; overview the most representative, distinct, and popular techniques and algorithms; discuss evaluation methodology of adaptive algorithms; and present a set of illustrative applications. The survey covers the different facets of concept drift in an integrated way to reflect on the existing scattered state of the art. Thus, it aims at providing a comprehensive introduction to the concept drift adaptation for researchers, industry analysts, and practitioners.
Article
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.