Content uploaded by Heitor Murilo Gomes

Author content

All content in this area was uploaded by Heitor Murilo Gomes on Feb 03, 2020

Content may be subject to copyright.

Streaming Random Patches for Evolving Data

Stream Classiﬁcation

Heitor Murilo Gomes∗†, Jesse Read‡, Albert Bifet∗†

∗University of Waikato, Hamilton, New Zealand

{heitor.gomes, albert.bifet}@waikato.ac.nz

†LTCI, T´

el´

ecom Paris, IP-Paris, Paris, France

‡LIX, ´

Ecole Polytechnique, Palaiseau, France

jesse.read@polytechnique.edu

Abstract—Ensemble methods are a popular choice for learning

from evolving data streams. This popularity is due to (i) the

ability to simulate simple, yet, successful ensemble learning

strategies, such as bagging and random forests; (ii) the possibility

of incorporating drift detection and recovery in conjunction to the

ensemble algorithm; (iii) the availability of efﬁcient incremental

base learners, such as Hoeffding Trees. In this work, we introduce

the Streaming Random Patches (SRP) algorithm, an ensemble

method specially adapted to stream classiﬁcation which combines

random subspaces and online bagging. We provide theoretical

insights and empirical results illustrating different aspects of SRP.

In particular, we explain how the widely adopted incremental

Hoeffding trees are not, in fact, unstable learners, unlike their

batch counterparts, and how this fact signiﬁcantly inﬂuences

ensemble methods design and performance. We compare SRP

against state-of-the-art ensemble variants for streaming data in

a multitude of datasets. The results show how SRP produce a

high predictive performance for both real and synthetic datasets.

Besides, we analyze the diversity over time and the average

tree depth, which provides insights on the differences between

local subspace randomization (as in random forest) and global

subspace randomization (as in random subspaces).

Index Terms—Stream Data Mining, Ensemble Learning, Ran-

dom Subspaces, Random Patches

I. INTRODUCTION

Machine learning applications of data streams have grown

in importance in recent years due to the tremendous amount of

real-time data generated by networks, mobile phones and the

wide variety of sensors currently available. Building predictive

models from data streams are central to many applications [1].

The underlying assumption of data stream learning is that the

algorithms must process large amounts of data in a fast-paced

way. In a supervised learning scenario, such characteristic

brings forward two crucial challenges:

•Computational efﬁciency. The algorithm must use a

limited budget of computational resources to be able to

process examples at least as fast as new examples are

available;

•Evolving data. The continuous ﬂow of data might be sub-

ject to changes over time, where the canonical example

is concept drift [2]. Concept drifts can be characterized

as changes in the underlying data distribution that affect

the ﬁtted model, such that to maintain its predictive

performance the model must be updated or even reset.

To tackle evolving data many strategies were proposed with

particular attention to ensemble-based methods. Ensembles are

often used to cope with concept drifts by selectively reset-

ting component learners [3]–[6]. Concerning computational

efﬁciency, ensembles of learners require more computational

resources than a single learner; however many are very easy

to parallelize [6].

In the traditional batch learning setting, several ensem-

ble methodologies are widely used, such as Random Sub-

spaces [7], Pasting [8], Bagging [9], Random Forest [10],

SubBag [11], and Random Patches [12]. The main differ-

ences among these algorithms remain on how they induce

diversity into the ensemble. Random subspace methods train

each base learner on a separate randomly selected subset of

features. Pasting and Bagging train base learners on samples

of instances draw with and without reposition, respectively,

from the original dataset. Random Forest extends Bagging

and randomly selects subsets of features to be considered

for splits in its base learners (decision trees). SubBag and

Random Patches combines Bagging and Random Subspaces

and Pasting and Random Subspaces, respectively, thus through

very similar means, they train base learners on random subsets

of features and samples. Other ensembles that are popular on

batch learning, such as AdaBoosting [13] are less attractive

for data streams, partially because the original batch learn-

ing implementations introduces dependencies among the base

learners, which are difﬁcult to simulate appropriately in a

streaming setting [14].

In this work, we propose strategies to cope with classiﬁ-

cation problems on evolving data streams using an ensemble

strategy that combines random subspaces and bagging. We

name this ensemble Streaming Random Patches (SRP) as it

is inspired by the Random Subspaces method [7] and Online

Bagging [15], and thus resembles the Random Patches [12]

algorithm. SRP incorporates an active drift detection strategy,

similarly to other ensembles methods, e.g., Leveraging Bag-

ging [5] and Adaptive Random Forest (ARF) [6]. The drift

detection and recovery strategy in SRP follow the approach

used in ARF. ARF consistently overcomes other state-of-the-

art ensembles for evolving data streams, partially due to this

strategy [6]; on top of that, by using the same procedure in

SRP, we can compare it to ARF in terms of the ensemble

strategy without the interference from the approach to cope

with evolving data.

Similar algorithms based on the Random Subspaces

method [7] or combinations of resampling and random sub-

spaces [11], [12] have been previously explored on batch learn-

ing for high-dimensional datasets [16] and also for evolving

data stream classiﬁcation [5], [6], [17], [18]. Nevertheless,

to the best of our knowledge, none of these previous works

thoroughly investigated the impact of online bagging and

random subspaces, concomitantly, for evolving data streams.

Similarly, previous works have not outlined the similarities

and differences between a global and a local randomization

strategy for the subset of features for streaming data. We

use the same deﬁnition of global and local randomization as

in [12], i.e., in the random subspaces method, the subspace of

features is selected globally once for the whole base learner,

while in the random forest algorithm the subspaces are selected

locally for each leaf of the base tree [12]. We discuss the

impact of both strategies in our experiments (Section V) while

comparing ARF and SRP. Panov and Dˇ

zeroski, and Louppe

and Geurts conducted a similar investigation for the batch

setting in [11] and [12], respectively.

Paper contributions and roadmap. Our main contributions

can be summarized as follows:

1) Streaming Random Patches (SRP): We introduce an

ensemble-based method, namely SRP, that achieves high

accuracy by training base models on random subsets of

features and instances1;

2) Theoretical insights: We analyze the SRP algorithm with

particular attention to the questions of the stability and

diversity of Hoeffding trees, and the impact of global

subspace randomization in SRP in opposition to the local

randomization in ARF;

3) Empirical Analysis: We compare SRP against state-

of-the-art ensemble variants for streaming data in a

multitude of datasets. The results show a clear overview

of predictive performance and resources usage. Besides,

we analyze the diversity over time and the average tree

depth, which provides some insights on the differences

between local and global subspace randomization.

The rest of this paper is organized as follows. In Section

II we introduce the problem of learning classiﬁcation models

from evolving data streams. In Section III, we present the SRP

algorithm and theoretical insights. In Section IV, related works

are discussed and compared to our approach. In Section V, we

present the experiments conducted to analyze SRP in terms of

accuracy, computational resources, diversity and decision trees

depth. Finally, Section VI concludes this work and presents

directions for future works.

II. PRO BL EM S ET TI NG

Let X={x−∞, . . . , x−1, x0}be an open-ended sequence

of observations collected over time, containing input examples

1The implementation and instructions are available at: https://tinyurl.com/

yytbom4e

in which xk∈Rnand n≥1. Similarly, let ybe an open-

ended sequence of corresponding class labels, such that every

example in Xhas a corresponding entry in y. Moreover, yk

has a ﬁnite set of possible values, i.e., yk∈ {l1, . . . , lL}for

L≥2, such that a classiﬁcation task is deﬁned. Furthermore,

we assume a problem setting where new input examples

xare presented every utime units to the learning model

for prediction, such that xt

krepresents a vector of features

available at time t. The true class label yt+1

k, corresponding

to instance xt

k, is available before the next instance xt+1

appears, and thus, it can be used for training immediately

after it has been used for prediction. We emphasise that this

experimental setting can be naturally extended to the delayed

and weakly-supervised settings considering a non-negligible

time delay between observing xand its class label y, including

an inﬁnite delay (i.e., the label is never observed). However,

the conclusions drawn from experimenting in such settings are

similar to those in the “immediate” setting, as shown in [6].

Therefore, for simplicity, we omit such results in this paper.

An important characteristic of data stream classiﬁcation is

whether it is a stationary or an evolving data distribution. In

this work, we assume evolving data distributions. Thus we

expect the occurrence of concept drifts2that might inﬂuence

decision boundaries. Note that if a concept drift is accurately

detected (without false negatives) and dealt with (by fully or

partially resetting models as appropriate) an iid assumption can

be made (on a per-concept basis), since each concept can be

treated as a separate iid stream, thus a series of iid streams to

be dealt with. Nevertheless, the typical nature of a data-stream

as being fast and dynamic encourages the in-depth study that

we present in this work.

III. STR EA MI NG RANDOM PATCH ES

Streaming Random Patches (SRP) can be viewed as an

adaptation of batch learning ensemble methods that com-

bined random samples of instances and random subspaces

of features [11], [12]. Following the terminology introduced

in [12], in the rest of this work, we refer to random

subsets of both features and instances as random patches.

Fig 1 presents an example of subsampling both instances and

features, simultaneously, from streaming data, where only the

shaded intersections of the matrix belong to the subsample,

i.e., {v1,1, v2,1, v6,1, v1,3, v2,3, v6,3}.

Our motivation for exploiting an ensemble of base models

trained on random patches is based mainly on the high

predictive performance of ensembles for data stream learning

that added randomization to the base models by either training

them on random samples of instances [5], random subsets of

features [17] or both [6]. We investigate whether selecting the

subset of features globally once and before constructing each

base model, overcomes locally selecting subsets of features at

each node while constructing base trees as in Random Forest.

In [12], authors show empirical evidence that Random Patches

combined with tree-based models achieved similar accuracy to

2A formal deﬁnition of concept drift can be found in [19]

x1x2x3x4x... xm

v1,1v1,2v1,3v1,4v1,5v1,6

v2,1v2,2v2,3v2,4v2,5v2,6

v3,1v3,2v3,3v3,4v3,5v3,6

v5,1v5,2v5,3v5,4v5,5v5,6

v6,1v6,2v6,3v6,4v6,5v6,6

v...,1v...,2v...,3v...,4v...,5v...,6

Fig. 1: Representation of a data stream as an unbounded table

where the rows are inﬁnite, but the columns are constrained

by minput features.

other randomization strategies, including Random Forest [10],

while using less memory.

The original Random Patches algorithm [12] is deﬁned

in terms of all possible subsets of features and instances,

such that R(ps, pf, D)denotes all random patches of size

psNs×pfNfthat can be drawn from the training set D,

where Nsand Nfrepresent the number of instances and

features, respectively, in D. The hyperparameters ps∈[0,1]

and pf∈[0,1] represent, respectively, the number of samples

and features in each patch r∈R(ps, pf, D). In SRP, the

set of all possible streaming random patches Rs(λ, pf, S)is

inﬁnite in the sample dimension as the input training data is

represented by a data stream S. We control the number of

samples in the streaming patch using the Poisson parameter λ

(Section III-A).

A. Random Subsets of Instances

In the batch setting, Bagging builds Lbase models, training

each model with a bootstrap sample from the original training

dataset of size N. Each bootstrap contains each original train-

ing example Ktimes, where P r(K=k)follows a binomial

distribution which, for large N, tends to a Poisson(λ= 1)

distribution. Using this fact, Oza and Russell [15] proposed

Online Bagging, an online method that, instead of sampling

with replacement, gives each example a weight according to

Poisson(λ= 1).

Leveraging Bagging [5] and Adaptive Random Forest [6]

train their base models according to a Poisson(λ= 6)

distribution, which on average augment the weight of each

training instance and diminish the probability of not using an

instance for training, i.e., the probability of Pr[Poisson(λ=

6)=0]≈0.25%, while Pr[Poisson(λ= 6)=0]≈36.8%. Using

Poisson(λ= 6) tends to improve the predictive performance of

the ensemble as the base models are updated more often, but

this beneﬁt comes at the expense of computational resources.

Minku et al. [20] used λas a proxy for diversity, i.e.,

the lower λ, the more diversity would be induced into the

ensemble. As pointed by Stapenhurst [21] for iid data the base

models will eventually converge, even faster if given larger

values of λ. One important question to be addressed then is:

why Poisson(6) works if only a small portion of data is not

presented to each learner? In the long run, the base models

start to converge. This can be visualized in Section V where

diversity is shown overtime for the AGRAWAL generator,

once a concept becomes stable the average Kappa Statistic

starts to increase (i.e., the outputs of the base models start to

converge) if the only means of decorrelating the base models

is resampling with reposition simulated with Poisson(6). This

motivates the addition of other techniques to induce diversity

(Section III-B).

B. Random Subsets of Features

Random Subspaces are susceptible to hyper-parameters m

(size of subspace) and n(number of learners). For a feature

space of Mfeatures, there are 2M−1different non-empty

subsets of features. Thus, it is unfeasible to train one learner

for even moderate values of M, especially for streaming data

where processing time and memory are restricted [22]. Ho

noted in [7] that highly accurate ensembles could be obtained

far before all possible combinations of subspaces are explored.

Later, Kuncheva et al. [16] provided a thorough analysis

of the random subspace method for the functional magnetic

resonance imaging (fRMI) data problem, which resulted in

insights for selecting values of mand nthat generated usable

learners, i.e., contains at least one ‘relevant’ feature in its

subset of features.

In our problem setting, one reason to train base models

on random subspaces of features on top of training them on

different subsets of instances is to add even further diversity to

the models. Even if they converge because of iid data (Section

III-A) by training them on separate subspaces of features we

have higher chances of producing models that maintain some

level of diversity.

There is a risk of subspaces including only irrelevant

features. There are two mechanisms that help aid this situation:

(i) resetting subspaces once a model is reset in response to a

concept drift; (ii) assigning weights to the votes of base models

based on their predictive performance, then it is expected

that base models with only irrelevant features produce a poor

predictive performance and other base models dominate their

votes.

C. Drift Detection and Recovery

The ultimate goal of drift detection in our context is to allow

automatic recovery from a state where the model performance

is degrading. To achieve this goal we need an accurate drift

detector and a proper action that will be triggered as a response

to the drift signal. Currently, the most successful supervised

learning methods follow a simple, yet effective, approach:

when a concept drift is detected the underlying model is

reset [5], [6]. If the detection algorithm miss or take too long

to detect a change, then it will let the model degrade. On

the other hand, if it yields too many false positives, it will

continuously trigger model resets and consequently prevent

the algorithm from building an accurate model.

We use the same strategy to detect and recover from concept

drifts as introduced in the Adaptive Random Forest (ARF) [6]

algorithm. In this strategy, the correct/incorrect predictions

of each base model are monitored by a detection algorithm.

When the drift detection algorithm ﬂags a warning a new base

model start training in the ‘background’, where ‘background’

means that it does not inﬂuence the ensemble decision with

its predictions. If the warning escalates to a concept drift, then

the background model replaces the associated base model.

The strategy accommodates for different drift detection

algorithms to be used, however, to facilitate discussion we

focus the experiments and analysis using SRP with the ADap-

tive WINdow (ADWIN) algorithm [23]. ADWIN is a change

detector and estimator that solves in a well-speciﬁed way the

problem of tracking the average of a stream of bits or real-

valued numbers. ADWIN keeps a variable-length window of

recently seen items, with the property that the window has

the maximal length statistically consistent with the hypothesis

“there has been no change in the average value inside the

window”. More precisely, an older fragment of the window

is dropped if and only if there is enough evidence that its

average value differs from that of the rest of the window.

This has two consequences: one, that change reliably declared

whenever the window shrinks; and two, that at any time the

average over the existing window can be reliably taken as

an estimation of the current average in the stream (barring a

very small or very recent change that is still not statistically

visible). A formal and quantitative statement of these two

points (a theorem) appears in [23]. ADWIN is a parameter-

and assumption-free in the sense that it automatically detects

and adapts to the current rate of change. Its only parameter is

the conﬁdence bound δ, indicating how conﬁdent we want to

be in the algorithm’s output, inherent to all algorithms dealing

with random processes.

There are no guarantees that a detection algorithm based

on the correct/incorrect predictions will be accurate, but it

will at least be able to detect changes in the underlying data

that genuinely affected the decision boundary (real drifts),

while neglecting those that did not (virtual drifts) [24]. One

disadvantage of this strategy is that it requires access to

labelled data, which is not an issue given our problem setting

(Section II), but for problems that include veriﬁcation latency

or weakly-labeled streams, then other drift detection strategies

must be explored [25].

The pseudocode for SRP is depicted in Alg. 1. The training

instances are used to evaluate the classiﬁcation performance

of each base model, before being used for training, and this

estimation is used as the learner weight during voting (line 9,

Alg. 1). For non-stationary data streams, we should consider

that the relevant features, i.e., those that can effectively be used

to predict the class label, may change over time. Therefore,

when a background learner is created, a new random subspace

is generated for it (line 12, Alg. 1). Background models are

trained during the period between the warning that triggered

their creation and the concept drift signal that causes them

to replace the previous base model, and thus, models to be

added to the ensemble always start with a model that is not

an entirely new base model (line 15, Alg. 1).

Algorithm 1 Streaming Random Patches.

Symbols: m: maximum features per subset; λ: Poisson dis-

tribution parameter; n: total number of models (n=|L|);

δw: warning threshold; δd: drift threshold; S: Data stream;

B: Set of background models; W(l): model lweight; P(·):

Model predictive performance estimation function; d(·): drift

detection method.

1: function TRA IN SRP(m, n, δw, δd)

2: L←CreateBaseM odels(n, m)

3: W←I nitW eights(n)

4: B← ∅

5: while HasNext(S)do

6: (x, y)←next(S)

7: for all l∈Ldo

8: ˆy←predict(l, x)

9: W(l)←P(W(l),ˆy, y)

10: T rain(m, l, x, y)

11: if d(δw, l, x, y)then Warning detected?

12: B(l)←CreateBkgM odel(m)

13: end if

14: if d(δd, l, x, y)then Drift detected?

15: l←B(l)Replace lby bkg learner

16: end if

17: end for

18: for all b∈Bdo

19: T rain(m, b, x, y)

20: end for

21: end while

22: end function

D. SRP Theoretical Insights

Bagging is well-known in the machine learning literature for

its effect on reducing variance, both in regression and classi-

ﬁcation [9], [26], which allows it to perform competitively in

a wide range of scenarios, including data streams [5], [15].

In theory, the reduction of the error is strictly related to how

uncorrelated prediction errors are [9]. Entirely uncorrelated

predictions are rarely achievable in practice, yet it is achieved

to some extent by encouraging diversity among the learning

models [27]. This itself implies a need to use unstable learners.

The standard (batch, unpruned) decision tree is a prime

example of an unstable learner: small changes to a training

sample can result in remarkably different models, and thus

diversity among predictions. Indeed, one readily observes that

decision trees are used throughout the literature.

In the context of data streams, Hoeffding trees [28] are the

popular choice of decision tree, since they are incremental.

However, crucially, Hoeffding trees – unlike their batch coun-

terparts – are in fact stable learners. As far as we are aware

we are among the ﬁrst to focus on this fact in the context of

ensembles.

Splitting is supported statistically under the Hoeffding

bound. This guarantees to a certain (user-speciﬁed) conﬁdence

level that under a sufﬁciently large number of examples a

Hoeffding tree built incrementally will be equivalent to a

batch-built tree. Until such a number of examples is seen,

however, Hoeffding trees will not grow and this implies

stability.

Formally, we may measure the stability of an algorithm as,

for example, hypothesis stability. In the following we adapt

the discussion of [29] to the streaming setting.

Suppose that ASdenotes that an algorithm A(e.g., C4.5,

or Hoeffding tree inducer) induces decision function f(e.g.,

a decision tree) over data stream segment Sof pairs (xk, yk)

(the segment is of length |S|=n). Let also S\irepresent

Swithout the i-th sample. Then hypothesis stability can be

expressed as

E(x,y)[|(AS,(x, y)) −(AS\i,(x, y ))|]< β

under evaluation function/metric .

This captures the intuition that if we remove a sample from

the stream, the absolute difference in error of another model

trained on this new segment should be less than βwhen

compared to the error of the same model built on the original

(thus indicating its stability in terms of β).

We cannot compute this exactly unless we know the true

generating distribution (‘concept’ in stream terminology) from

which (xk, yk)pairs are drawn. However, by replacing the

expectation with a sum over leave-one-out samples from a

real stream we can empirically investigate and compare the

‘β-stability’ among learning algorithms with regard to such a

stream.

Repeatedly rebuilding models on relatively small samples

of instances is unavoidable in a stream which may experience

drift, implying that trees must be fully or partially regrown. By

small we mean “insufﬁciently large wrt the Hoeffding bound”.

These episodes add up over the life of a stream to a non-

negligible loss of accuracy.

Suppose that this number is n. As any well-regularized

algorithm, a Hoeffding tree does not adhere strongly to the

principal of empirical risk minimization, but rather it is forced

to accept many errors as a trade-off for long-term similarity

to a batch-built tree. This is a problem terms of Hoeffding

tree ensembles, since these errors are likely to be the same,

rendering the ensemble decisions are likely to be useless (no

advantage compared to a single model). In terms of bias-

variance trade-off, variance goes down at the cost of bias due

to Hoeffding stability [30]. However ensemble bagging-based

schemes are primarily for reducing variance and may even

increase bias, but since variance has already been reduced by

stability, it is not likely to have a positive effect.

This provides a suitable explanation as to why our proposed

SRP method performs well: by effectively reducing the feature

space of individual trees, Hoeffding trees are operating on a

‘sub-concept’, and are stable wrt that concept but unstable wrt

the complete concept, meaning that the variance reduction of

an ensemble still has a beneﬁcial effect.

Furthermore, Random Subspaces are so beneﬁcial in the

data stream setting is because we can look at decision trees

as adaptive nearest neighbours [31], and Random Subspaces

as transformations that preserve the Euclidean geometry [32].

Decision trees splits the overall space into several regions, one

for each one of their leaves. The prediction of the instances

in each one of the leaves is based on the majority vote of the

instances in that leaf. We can consider the instances in that

leaf as the neighbours of the instances to predict. Random

Subspaces are linear transformations that transforms instances

to another space, preserving their Euclidean geometry, a very

useful property when applied to nearest neighbours. This

is due to the fact that there exists Johnson-Lindenstrauss

guarantees that Random Subspaces approximately preserves

the Euclidean geometry of the data with high probability, as

shown in Lemma 1 [32].

Lemma 1. Let X={x−(N−1), . . . , x−1, x0}be a sequence

of observations collected over time, containing input examples

in which xi∈Rnfor every i∈ {−(N−1),...,1,0},n≥1

and satisfying ||x2

i||∞≤c

n||xi||2

2where c∈R+is a constant

1≤c≤n. Let , δ ∈(0,1], and let k≥c2

22ln N2

δbe an

integer. Let RS be a random subspace projection from Rn7→

Rk. Then with probability at least 1−δover the random draws

of RS we have, for every i, j ∈ {−(N−1),...,1,0}:

(1 −)||xi−xj||2

2≤n

k||RS(xi−xj)||2

2≤(1 + )||xi−xj||2

2

The required number of spaces kis logarithmic in the

number of examples, but with a larger constant term.

Finally, another explanation of the success of Random

Patches is dropout [33]. Dropout is a technique used in

Deep Learning to improve the accuracy of Neural Networks,

randomly removing neurons. Random Patches uses a similar

technique to sample instances and attributes, removing many

of them, in an efﬁcient random way.

Thus, overall, our proposal creates an artiﬁcially smaller

feature space, thus encouraging faster growth, and further-

more, even when tree growth is conservative, can encourage

disagreement (avoid correlation) among the leaf classiﬁers

even if they would be stable models if run outside the context

of such an ensemble. Empirical results are given in Section

V, which offer further support to these arguments.

IV. REL ATED WOR K

There is an extensive literature on ensemble methods for

data stream classiﬁcation. This preference is counterintuitive

given the need for algorithms that use computational re-

sources judiciously. The justiﬁcation for this preference is

attributable to the ﬂexibility and high predictive performance

that ensemble models provide [14]. The seminal work of

Kolter and Maloof [3] introduced the Dynamic Weighted

Majority (DWM) ensemble method which featured heuristics

to cope with evolving data streams, such as removing base

models if their weight dropped below a given threshold, and

adding new ones according to the global performance of the

ensemble. DWM introduces a hyperparameter to control the

period (window) between base models addition, removal and

weight updates. Similarly to DWM, the Online Accuracy

Updated Ensemble (OAUE) [4] algorithm relies on a window

hyperparameter to determine which instances will be used to

train a new base model (candidate) and if it should replace the

base model that achieved the least classiﬁcation performance

in the latest window of instances. OAUE does not use an active

drift detection approach; thus it relies on gradual resets of

the ensemble through candidates to adapt to concept drifts.

Also, it introduces a weighting mechanism that contributes to

the ensemble adaptation to concept drifts, since the weighting

function is designed to assign higher impact to predictions

on recently presented instances. Note that DWM and OAUE

use incremental base learners; however, they still require

the deﬁnition of a window to orchestrate their adaptation

techniques to evolving data.

Many ensemble methods for data stream learning exploit

strategies developed initially for batch learning. Online bag-

ging [15] trains base models on samples drawn from the

data stream simulating sampling with reposition as in the

classical Bagging algorithm [9]. Chen et al. introduce a

generalization of SmoothBoost [34], namely Online Smooth-

Boost (OB) [35], an algorithm that generates only smooth

distributions that, and do not assign too much weight to single

examples. OB is guaranteed to achieve an arbitrarily small

error rate given that the number of weak learners and examples

are sufﬁciently large.

Ensembles designed to cope with evolving data streams

combine decorrelating base models (e.g., bagging) and voting

(e.g., weighted majority vote [36]) with active drift recovery

strategies based on change detection algorithms. The Leverag-

ing Bagging (LB) [5] algorithm combines an adapted version

of Online Bagging [15] with the ADaptive WINdow (ADWIN)

drift detection algorithm, such that base models are selectively

reset whenever their corresponding ADWIN instance ﬂags a

drift. Heuristic Updatable Weighted Random Subspaces

(HUWRS) [17] trains batch learners (C4.5 decision trees) on

random subspaces of features, following the Random Sub-

space Method (RSM) introduced by Ho [7]. HUWRS detects

virtual and real concept drift by computing the Hellinger

distance between the binned feature values of every base

model and the latest window of instances feature distribution

when labels are not available, and by computing Hellinger

distances between the feature distribution per class over the

latest window of instances, otherwise. The weighting of the

base models in HUWRS relies on the severity of the change

in the distribution of the features associated with its random

subspace. The Adaptive Random Forest (ARF) [6] and the

Dynamic Streaming Random Forest (DSRF) [37] both aim

to adapt the classic Random Forest [10] algorithm to streaming

data. Both ARF and DSRF uses the incremental decision tree

algorithm Hoeffding tree [28], however, they differ on how

the base trees are trained. ARF simulates resampling as in

Leveraging Bagging, while DSRF train trees sequentially on

different subsets of data. Moreover, ARF uses a drift detection

and recovery strategy based on detecting warnings and drifts

per base tree, such that after a warning is triggered another

tree is created and trained without affecting the ensemble

predictions (background tree). If the warning escalates to a

drift detection, then the base tree is replaced by the background

tree.

We brieﬂy introduced the concepts of active and reactive

strategies for concept drift recovery and the vast literature in

ensemble learning for evolving data stream classiﬁcation. We

refer the reader to [24] and [19] for further information on

concept drift, and to [14] for a detailed overview and taxonomy

of existing ensemble methods for data stream classiﬁcation.

V. EX PE RI ME NT S

We evaluate the SRP implementation against state-of-the-

art classiﬁcation algorithms, both concerning predictive per-

formance and computational resources usage. To analyze the

diversity among base models in our new proposed methods, we

present plots depicting the average pairwise kappa over time.

Also, to analyze how fast (and deep) the base trees are grown

by each ensemble strategy we include plots of the average tree

depth over time. We assess predictive performance through ac-

curacy results using a test-then-train evaluation strategy, where

every instance is used ﬁrst for testing and then for training.

The algorithms used in the comparisons are Hoeffding Trees

(HT), Naive Bayes (NB), Leveraging Bagging (LB), Adaptive

Random Forest (ARF), Online Accuracy Updated Ensemble

(OAUE), Dynamic Weighted Majority (DWM), and Online

Smooth Boosting (OB). HT and NB serve the purpose of

baselines since they are single classiﬁers often used in data

stream classiﬁcation. LB and ARF are ensemble methods that

consistently outperform other ensemble classiﬁers as shown

in [6] in a similar benchmark than the one used in this work.

OB represents a boosting adaptation to online learning, while

DWM and OAUE are ensemble methods explicitly developed

for data stream classiﬁcation that rely on different heuristics

to address concept drift.

To analyze how SRP compares to “simple” variants of

itself we present two variations in the experiments, namely

the Streaming Random Subspaces (SRS) and a Bagging-like

strategy (BAG). SRS trains on random subspaces of features as

in SRP and all instances without simulating bootstraps, while

BAG only simmulates bagging using all features. In the online

resources 3we provide two tables analyzing the impact of m

in SRP (ranging from 10% up to 100% (same as the variant

BAG)) and λ= 1, which impacts the bagging simulation. The

experiments in the paper summarizes the results in the online

resources; still, they are available to the interested reader.

Regarding hyperparameters, we use HT as the base learner

for all the ensemble methods. The default subspace size is

m= 60% for SRS, SRP, and ARF, except for experiments

with the high dimensionality dataset SPAM and n= 100

where m= 10% (Table II). In the online resources (Section

A) we present complementary experiments varying mfrom

10% up to 100% (equivalent to BAG) in all datasets. The

HT grace period was set to GP = 50, the split conﬁdence

c= 0.014, and the decision strategy used at leaves was Naive

3https://github.com/hmgomes/StreamingRandomPatches

4GP and cwere originally identiﬁed as nmin and δby Domingos and

Hulten [28], however we choose to keep their acronyms as in the Massive

Online Analysis (MOA) framework to facilitate reproducibility.

12

3

4

56

7

8

9

10

CD = 3.757

SRP

BAG

ARF

LB

SRS

OAUE

DWM

OB

HT

NB

12

3

4

56

7

8

CD = 3.031

SRP

SRS

BAG

ARF

LB

OB

OAUE

DWM

Fig. 2: Nemenyi test (95% conﬁdence level) - n= 10 base models on the left; and Nemenyi test (95% conﬁdence level) -

n= 100 base models on the right. The avg rank obtained in the SPAM dataset for n= 100 was not considered for any learner

since there are no results for LB and BAG.

Bayes Adaptive, i.e., either Naive Bayes or Majority vote are

used at a leaf depending on which one is more accurate [38].

This HT conﬁguration tends to generate splits earlier at the

expense of processing time [6]. ADWIN is used as a drift

detector for all ensembles that rely on active drift detection

(i.e., ARF, LB, SRP, SRS, and BAG). The δparameter, which

controls the conﬁdence in the change detected, was deﬁned

as δ= 0.0001 for warning detection and δ= 0.00001 for

drift detection in ARF, SRP, SRS and BAG. In LB δwas set

according to its default value [5], i.e. δ= 0.002.

The datasets used in the experiments include 6 synthetic

data streams and 7 real datasets. The synthetic datasets sim-

ulate abrupt, gradual, and incremental drifts, while the real

datasets have been thoroughly used in the literature to assess

data stream classiﬁers. Further information concerning the

datasets, instructions on how to execute the experiments and

other details for reproducibility are available in Appendix A.

A. Streaming Random Patches vs. Others

The results presented in Table I show how SRP compares

against other algorithms. Similarly, II presents how SRP and

other ensembles perform when conﬁgured to use n= 100

learners. Besides presenting the average ranking (Avg Rank)

for each algorithm, we also highlight the average ranking

for the synthetic datasets (Avg Rank Synt.) and the average

ranking for real-world datasets (Avg Rank Real). The reason

to report these rankings separately is that some techniques

may perform better on synthetic data, while not so well

in overall and it is important to highlight and discuss that.

Good performance on the synthetic datasets may indicate

an effective drift recovery strategy, however synthetic data

stream concepts tend to be simple or biased towards a speciﬁc

learning algorithm, therefore an algorithm that produces good

results only on synthetic data may offer less credibility. We

apply the methodology presented in [39] to compare results

among several datasets and algorithms for the experiments

presented on Tables I and II. We ﬁrst attempt to reject the

hypothesis that all learners produce equivalent results using a

Friedman test at a signiﬁcance level α= 0.05. The Friedman

test indicated signiﬁcant differences on both results and it

was followed by a post-hoc Nemenyi test. Figure 2 presents

the results for the post-hoc Nemenyi test. We note that no

signiﬁcant difference has been found among SRP, BAG, ARF,

LB, SRS and OAUE, using n= 10, while using n= 100

there was no signiﬁcant difference among SRP, SRS, BAG,

ARF and LB.

We can observe the inﬂuence of the mhyperparameter when

we compare SRP and BAG results, for example, in AIRLINES

even though the number of features is only 7, using m= 60%

produced better results than BAG as shown in Tables I and

II, while intuitively it seems that using all features for low

dimensionality datasets is better. For the SPAM dataset, SRP,

ARF and SRS were conﬁgured with m= 10% for the n= 100

experiments as m= 60% failed to ﬁnished. LB and BAG

could not ﬁnish, both failed after around 60% execution as

100GB of maximum memory allocation pool was insufﬁcient.

SRP with n= 10 performs well in the real datasets, but

not as well in the synthetic datasets as BAG and LB, which

are very similar models (i.e., use all features and simulate

resampling). However, in the experiments using n= 100

the algorithms that exploit random subspaces (ARF, SRP,

SRS) beneﬁted the most from the addition of more learners,

followed by BAG and LB. This characteristic of ARF, SRP,

and SRS, can be attributed to them being able to cover a more

signiﬁcant number of subspaces of features. OB and DWM

improved in comparison to their results using n= 10, while

OAUE decreased its performance. OAUE obtain results far

below NB and HT for KDD99, ADS and NOMAO datasets

while performing well in the synthetic datasets with simulated

concept drifts.

B. Average Tree Depth and Diversity

To investigate how efﬁcient SRP is in terms of inducing

diversity into the ensemble, we plot the average kappa over

time for AGRgand LEDain Figures 3 and 4. In Figure 6, we

can observe how average kappa for BAG and ARF converge

after the same concept has been in place. We notice that SRS

and SRP obtains low values of average kappa in comparison

to ARF and BAG. However, when we take into account the

accuracy results in Table I we can see that not necessarily SRP

or SRS outperform BAG in these datasets, i.e., even if the

difference is small, ARF and BAG outperform SRS and SRP

in LED(A). These results corroborate with the conclusions

by Stapenhurst [21] that the ensemble diversity inﬂuence in

the recovery from a concept drift, still, it is not as crucial as

the actual drift detection and recovery strategy. In AGR(G),

SRS and SRP outperform ARF and BAG, still, the average

kappa diversity in this experiment are quite similar, thus no

clear conclusions can be made about why SRS and SRP

perform better based solely on the average pairwise kappa. The

overall conclusion is that increasing diversity is not enough to

improve accuracy. Therefore, we complement our analysis of

TABLE I: Test-then-train accuracy (%) using n= 10 base models.

Data set NB HT LB OAUE DWM OB ARF SRP SRS BAG

LED(A) 53.964 69.032 73.918 74.007 73.742 69.898 73.945 73.588 73.533 73.944

LED(G) 54.02 68.649 73.076 73.167 72.723 69.562 73.01 72.416 72.296 73.151

AGR(A) 65.739 81.045 86.954 90.932 82.97 84.91 85.646 91.788 91.558 85.733

AGR(G) 65.759 77.374 80.709 86.339 79.418 79.73 79.885 87.762 88.538 81.347

RBF(M) 30.994 45.491 84.714 78.581 57.81 69.894 84.49 83.28 81.685 85.431

RBF(F) 29.136 32.292 74.102 50.021 54.861 42.915 70.715 70.825 59.061 74.891

AIRLINES 64.55 65.078 62.319 66.637 63.88 65.184 65.786 66.776 67.085 61.296

ELEC 73.362 79.195 90.157 88.275 87.756 85.253 88.718 88.82 89.4 89.502

COVTYPE 60.521 80.312 94.861 90.17 88.286 90.327 94.691 95.254 92.764 95.467

KDD99 95.603 99.903 99.974 2.473 99.951 99.944 99.975 99.984 99.979 99.972

ADS 68.161 85.91 99.665 15.401 97.499 86.917 99.726 99.756 98.445 99.665

NOMAO 86.865 92.128 97.035 58.407 95.462 94.252 97.055 97.232 96.451 97.064

SPAM 74.571 79.043 94.745 80.899 89.286 89.489 96.214 95.967 92.954 94.745

Avg Rank 9.54 8.29 3.5 7 7 6.86 3.57 2.43 3.71 3.36

Avg Rank Synt. 10 9 3.33 3.5 6.67 7.5 4.17 3.67 4.5 2.67

Avg Rank Real 9.14 8.14 4 7.71 6.86 6.57 3.14 1.86 3.71 3.86

TABLE II: Test-then-train accuracy (%) using n= 100 base models. Underlined results means the performance increased in

comparison to n= 10 version. BAG and LB did not ﬁnish execution for SPAM dataset.

Data set LB OAUE DWM OB ARF SRP SRS BAG

LED(A) 73.953 73.393 73.958 72.475 73.96 74.027 74.04 73.975

LED(G) 73.225 72.582 73.031 72.117 73.094 73.233 73.179 73.215

AGR(A) 88.717 90.164 88.299 90.374 87.929 92.869 92.807 86.663

AGR(G) 83.713 85.244 79.437 87.834 82.288 89.651 90.259 82.52

RBF(M) 84.338 84.262 60.977 74.514 86.958 86.039 84.821 86.671

RBF(F) 76.771 57.147 54.531 48.698 76.291 76.375 61.622 77.686

AIRLINES 62.82 65.229 64.025 64.556 66.417 68.564 68.303 62.093

ELEC 89.508 87.407 87.754 89.515 89.672 89.859 90.267 89.822

COVTYPE 95.104 92.857 88.519 92.695 94.967 95.348 93.461 95.288

KDD99 99.965 2.445 99.951 99.936 99.972 99.981 99.973 99.974

ADS 99.634 15.401 97.499 90.393 99.695 99.726 98.353 99.634

NOMAO 97.072 58.233 95.462 96.393 97.197 97.383 96.57 97.226

SPAM NA 80.781 89.361 86.519 97.319 97.437 95.924 NA

Avg Rank 4.46 6.33 6.67 6.17 4 1.58 3.17 3.63

Avg Rank Synt. 4.17 5.67 6.67 6.17 4.67 22.83 3.83

Avg Rank Real 4.75 7 6.67 6.17 3.33 1.17 3.5 3.42

the ensembles diversity by presenting the average tree depth

in Figures 5 and 6. SRP consistently grows trees faster and

deeper than ARF, SRS and BAG. Splitting sooner can lead to

overﬁtting the models or splits that could use a better feature

(or split point) if more instances were observed. However, as

shown by the SRP predictive performance in the empirical

experiments, it can be beneﬁcial to an ensemble strategy.

# Instances

0

0.2

0.4

0.6

0.8

0 250000 500000 750000

SRP SRS ARF BAG

Fig. 3: AGR(G) - Avg kappa over time (n= 10).

C. Time and Memory Usage Analysis

The computational resources are estimated based on the

CPU time and RAM Hours (across all the experiments). The

# Instances

0

0.25

0.5

0.75

1

0 250000 500000 750000 1000000

SRP SRS ARF BAG

Fig. 4: LED(A) - Avg kappa over time (n= 10).

results for n= 100 are presented in Figures 7 and 85. We

note that SRP performs similar to LB, requires less resources

than BAG, but demands more resources than SRS and ARF.

The SRS efﬁciency is attributable to the fact that it does not

simulate resampling. In SRP, BAG, ARF and LB each learner

is trained on each instance, on average, lambda times, where

lambda = 6 in our experiments. If we use Poisson(λ= 1) we

5The results in Figures 7 and 8 excludes SPAM CPU Time and RAM hours

for all algorithms, since BAG and LB did not ﬁnish executing

# Instances

0

5

10

15

20

0 250000 500000 750000

SRP SRS ARF BAG

Fig. 5: AGR(G) - Avg tree depth over time (n= 10).

# Instances

0

5

10

15

0 250000 500000 750000 1000000

SRP SRS ARF BAG

Fig. 6: LED(A) - Avg tree depth over time (n= 10).

also increase the chances of obtaining zeros (i.e., not using the

instance for training), which positively affects the memory and

processing time (train on less instances), but negatively impact

the classiﬁcation performance as the base models are trained

on less instances.

Fig. 7: CPU time (n= 100).

VI. CONCLUSIONS

In this work, we have taken an in-depth look at the

performance of Random Subspaces and Bagging ensemble

methods and their application to streams. In particular, fol-

lowing theoretical considerations and empirical investigations,

we developed and presented the Streaming Random Patches

(SRP) method. SRP is a combination of Random Subspaces

and Online Bagging as each base model is trained on a

Fig. 8: RAM Hours usages (n= 100).

random patch of data (i.e., a random subset of features and

instances). We show how SRP can be highly accurate on

many benchmark streaming scenarios, and compare it against

several ensemble methods for data stream classiﬁcation, in-

cluding bagging, boosting and random forest variations. We

discussed the differences, and similarities, between SRP and

the Adaptive Random Forest (ARF) algorithm. We showed

how SRP compared against a Streaming Random Subspaces

(SRS) method and a Bagging method using the same drift

detection and recovery strategies. We highlight that SRP has

the same amount of hyperparameters as ARF; still, it can also

be used to train base models that are not decision trees.

We discussed and demonstrated how methods using ran-

dom subspaces yield several signiﬁcant advantages, such as

diversity enhancement (even for stable methods), which is

particularly suited to Hoeffding-tree based methods which can

be seen as stable methods. On top of that, these methods tend

to improve accuracy from the addition of more base models.

As only a subset of features is considered, training is efﬁ-

cient as compared to popular existing methods for data stream

classiﬁcation, such as Leveraging Bagging. Furthermore, even

though beyond the scope of this paper, a consideration of

distributed computation on our method is particularly favored,

as the base models are independent. These characteristics set

out an exciting path for future investigation.

REFERENCES

[1] L. Da Xu, W. He, and S. Li, “Internet of things in industries: A survey,”

IEEE Transactions on industrial informatics, vol. 10, no. 4, pp. 2233–

2243, 2014.

[2] G. Widmer and M. Kubat, “Learning in the presence of concept drift

and hidden contexts,” Mach. Learn., vol. 23, no. 1, pp. 69–101, Apr.

1996.

[3] J. Z. Kolter, M. Maloof et al., “Dynamic weighted majority: A new

ensemble method for tracking concept drift,” in Data Mining, 2003.

ICDM 2003. Third IEEE International Conference on. IEEE, 2003,

pp. 123–130.

[4] D. Brzezinski and J. Stefanowski, “Combining block-based and online

methods in learning ensembles from concept drifting data streams,”

Information Sciences, vol. 265, pp. 50–67, 2014.

[5] A. Bifet, G. Holmes, and B. Pfahringer, “Leveraging bagging for

evolving data streams,” in PKDD, 2010, pp. 135–150.

[6] H. M. Gomes, A. Bifet, J. Read, J. P. Barddal, F. Enembreck,

B. Pfharinger, G. Holmes, and T. Abdessalem, “Adaptive random forests

for evolving data stream classiﬁcation,” Machine Learning, pp. 1–27,

2017. [Online]. Available: http://dx.doi.org/10.1007/s10994-017-5642-8

[7] T. K. Ho, “The random subspace method for constructing decision

forests,” IEEE Transactions on Pattern Analysis and Machine Intelli-

gence, vol. 20, no. 8, pp. 832–844, 1998.

[8] L. Breiman, “Pasting small votes for classiﬁcation in large databases

and on-line,” Machine learning, vol. 36, no. 1-2, pp. 85–103, 1999.

[9] ——, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–

140, 1996.

[10] ——, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32,

2001.

[11] P. Panov and S. D ˇ

zeroski, “Combining bagging and random subspaces

to create better ensembles,” in International Symposium on Intelligent

Data Analysis. Springer, 2007, pp. 118–129.

[12] G. Louppe and P. Geurts, “Ensembles on random patches,” in Joint

European Conference on Machine Learning and Knowledge Discovery

in Databases. Springer, 2012, pp. 346–361.

[13] Y. Freund, R. E. Schapire et al., “Experiments with a new boosting

algorithm,” in ICML, vol. 96, 1996, pp. 148–156.

[14] H. M. Gomes, J. P. Barddal, F. Enembreck, and A. Bifet, “A survey

on ensemble learning for data stream classiﬁcation,” ACM Comput.

Surv., vol. 50, no. 2, pp. 23:1–23:36, 2017. [Online]. Available:

http://doi.acm.org/10.1145/3054925

[15] N. Oza and S. Russell, “Online bagging and boosting,” in Artiﬁcial

Intelligence and Statistics 2001. Morgan Kaufmann, 2001, pp. 105–

112.

[16] L. I. Kuncheva, J. J. Rodr´

ıguez, C. O. Plumpton, D. E. Linden, and S. J.

Johnston, “Random subspace ensembles for fmri classiﬁcation,” IEEE

transactions on medical imaging, vol. 29, no. 2, pp. 531–542, 2010.

[17] T. R. Hoens, N. V. Chawla, and R. Polikar, “Heuristic updatable

weighted random subspaces for non-stationary environments,” in Data

Mining (ICDM), 2011 IEEE 11th International Conference on. IEEE,

2011, pp. 241–250.

[18] C. O. Plumpton, L. I. Kuncheva, N. N. Oosterhof, and S. J. Johnston,

“Naive random subspace ensemble with linear classiﬁers for real-time

classiﬁcation of fmri data,” Pattern Recognition, vol. 45, no. 6, pp. 2101–

2108, 2012.

[19] G. I. Webb, R. Hyde, H. Cao, H. L. Nguyen, and F. Petitjean, “Charac-

terizing concept drift,” Data Mining and Knowledge Discovery, vol. 30,

no. 4, pp. 964–994, 2016.

[20] L. L. Minku, A. P. White, and X. Yao, “The impact of diversity on online

ensemble learning in the presence of concept drift,” IEEE Transactions

on Knowledge and Data Engineering, vol. 22, no. 5, pp. 730–742, 2010.

[21] R. J. Stapenhurst, “Diversity, margins and non-stationary learning.”

Ph.D. dissertation, University of Manchester, UK, 2012.

[22] A. Bifet, E. Frank, G. Holmes, and B. Pfahringer, “Ensembles of

restricted hoeffding trees,” ACM TIST, vol. 3, no. 2, pp. 30:1–30:20,

2012. [Online]. Available: http://doi.acm.org/10.1145/2089094.2089106

[23] A. Bifet and R. Gavald`

a, “Learning from time-changing data with

adaptive windowing,” in SIAM, 2007.

[24] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A

survey on concept drift adaptation,” ACM Computing Surveys, vol. 46,

no. 4, pp. 44:1–44:37, Mar. 2014.

[25] I. ˇ

Zliobaite, “Change with delayed labeling: When is it detectable?” in

Data Mining Workshops (ICDMW), 2010 IEEE International Conference

on. IEEE, 2010, pp. 843–850.

[26] P. M. Domingos, “A uniﬁed bias-variance decomposition for zero-one

and squared loss,” in AAAI 2000, 2000, pp. 564–569.

[27] L. I. Kuncheva, “That elusive diversity in classiﬁer ensembles,” in

Iberian conference on pattern recognition and image analysis. Springer,

2003, pp. 1126–1138.

[28] P. Domingos and G. Hulten, “Mining high-speed data streams,” in

Proceedings of the sixth ACM SIGKDD international conference on

Knowledge discovery and data mining. ACM SIGKDD, Sep. 2000,

pp. 71–80.

[29] O. Bousquet and A. Elisseeff, “Stability and generalization,” Journal of

Machine Learning Research, vol. 2, pp. 499–526, 2002.

[30] E. Ikonomovska, J. Gama, and S. Dˇ

zeroski, “Learning model trees from

evolving data streams,” Data mining and knowledge discovery, vol. 23,

no. 1, pp. 128–168, 2011.

[31] Y. Lin and Y. Jeon, “Random forests and adaptive nearest neighbors,”

Journal of the American Statistical Association, vol. 101, no. 474, pp.

578–590, 2006.

[32] N. Lim and R. J. Durrant, “Linear dimensionality reduction

in linear time: Johnson-lindenstrauss-type guarantees for random

subspace,” arXiv, vol. 1705.06408, 2017. [Online]. Available: https:

//arxiv.org/abs/1705.06408

[33] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and

R. Salakhutdinov, “Dropout: a simple way to prevent neural networks

from overﬁtting,” Journal of Machine Learning Research, vol. 15, no. 1,

pp. 1929–1958, 2014.

[34] R. A. Servedio, “Smooth boosting and learning with malicious noise,”

Journal of Machine Learning Research, vol. 4, no. Sep, pp. 633–648,

2003.

[35] S.-T. Chen, H.-T. Lin, and C.-J. Lu, “An online boosting algorithm with

theoretical justiﬁcations,” in Proceedings of the International Conference

on Machine Learning (ICML), June 2012.

[36] N. Littlestone and M. K. Warmuth, “The weighted majority algorithm,”

Information and computation, vol. 108, no. 2, pp. 212–261, 1994.

[37] H. Abdulsalam, D. B. Skillicorn, and P. Martin, “Classifying evolving

data streams using dynamic streaming random forests,” in International

Conference on Database and Expert Systems Applications. Springer,

2008, pp. 643–651.

[38] G. Holmes, R. Kirkby, and B. Pfahringer, “Stress-testing hoeffding

trees,” in Knowledge Discovery in Databases: PKDD 2005, 2005, pp.

495–502. [Online]. Available: https://doi.org/10.1007/11564126\50

[39] J. Demˇ

sar, “Statistical comparisons of classiﬁers over multiple data sets,”

Journal of Machine Learning Research, vol. 7, pp. 1–30, Dec. 2006.

[Online]. Available: http://dl.acm.org/citation.cfm?id=1248547.1248548

[40] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “Moa: Massive online

analysis,” The Journal of Machine Learning Research, vol. 11, pp. 1601–

1604, 2010.

APPENDIX

A. Software and hardware

All the experiments were executed in the Massive Online

Analysis (MOA) framework [40] version 2019.04 build. De-

tails about the hardware conﬁguration are shown below:

•CPU: 40 cores, Intel(R) Xeon(R) CPU E5-2660 v3 @

2.60GHz

•Operational System: Ubuntu 16.04.5 LTS

•Java Virtual Machine (JVM) version: JDK 1.8.0

•JVM: Xmx = 100GB, Xms = 50MB

To reproduce the experiments the reader should access the

GitHub repository

•https://github.com/hmgomes/StreamingRandomPatches

which contains the code, the datasets used, and instructions

on how to execute the algorithm.

B. Datasets

Table III presents an overview of the datasets.

TABLE III: Datasets (Drifts: (A) Abrupt, (G) Gradual, (M)

Incremental (moderate) and (F) Incremental (fast)). AGR and

LED concept drifts are introduced every 250k instances.

Dataset #Instances #Features Type Drifts #Classes

LED(A) 1,000,000 24 Synthetic A 10

LED(G) 1,000,000 24 Synthetic G 10

AGR(A) 1,000,000 9 Synthetic A 2

AGR(G) 1,000,000 9 Synthetic G 2

RBF(M) 1,000,000 10 Synthetic M 5

RBF(F) 1,000,000 10 Synthetic F 5

AIRL 539,383 7 Real - 2

ELEC 45,312 8 Real - 2

COVT 581,012 54 Real - 7

KDD99 4,898,431 41 Real - 23

ADS 3,279 1,559 Real - 2

NOMAO 34,465 119 Real - 2

SPAM 9,324 39,917 Real - 2