ArticlePDF Available

# Online Ensemble Learning with Abstaining Classifiers for Drifting and Noisy Data Streams

Authors:

## Abstract and Figures

Mining data streams is among most vital contemporary topics in machine learning. Such scenario requires adaptive algorithms that are able to process constantly arriving instances, adapt to potential changes in data, use limited computational resources, as well as be robust to any atypical events that may appear. Ensemble learning has proven itself to be an effective solution, as combining learners leads to an improved predictive power, more flexible drift handling, as well as ease of being implemented in high-performance computing environments. In this paper, we propose an enhancement of popular online ensembles by augmenting them with abstaining option. Instead of relying on a traditional voting, classifiers are allowed to abstain from contributing to the final decision. Their confidence level is being monitored for each incoming instance and only learners that exceed certain threshold are selected. We introduce a dynamic and self-adapting threshold that is able to adapt to changes in the data stream, by monitoring outputs of the ensemble and allowing to exploit underlying diversity in order to efficiently anticipate drifts. Additionally, we show that forcing uncertain classifiers to abstain from making a prediction is especially useful for noisy data streams. Our proposal is a lightweight enhancement that can be applied to any online ensemble method, improving its robustness to drifts and noise. Thorough experimental analysis validated through statistical tests proves the usefulness of the proposed approach.
Content may be subject to copyright.
Online Ensemble Learning with Abstaining Classiﬁers for Drifting
and Noisy Data Streams
Bartosz Krawczyka,, Alberto Canoa
aDepartment of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
Abstract
Mining data streams is among most vital contemporary topics in machine learning. Such
scenario requires adaptive algorithms that are able to process constantly arriving instances,
adapt to potential changes in data, use limited computational resources, as well as be robust
to any atypical events that may appear. Ensemble learning has proven itself to be an eﬀective
solution, as combining learners leads to an improved predictive power, more ﬂexible drift
handling, as well as ease of being implemented in high-performance computing environments.
In this paper, we propose an enhancement of popular online ensembles by augmenting them
with abstaining option. Instead of relying on a traditional voting, classiﬁers are allowed to
abstain from contributing to the ﬁnal decision. Their conﬁdence level is being monitored
for each incoming instance and only learners that exceed certain threshold are selected. We
introduce a dynamic and self-adapting threshold that is able to adapt to changes in the data
stream, by monitoring outputs of the ensemble and allowing to exploit underlying diversity in
order to eﬃciently anticipate drifts. Additionally, we show that forcing uncertain classiﬁers to
abstain from making a prediction is especially useful for noisy data streams. Our proposal is
a lightweight enhancement that can be applied to any online ensemble method, improving its
robustness to drifts and noise. Thorough experimental analysis validated through statistical
tests proves the usefulness of the proposed approach.
Keywords: Machine learning, Data stream mining, Concept drift, Ensemble learning,
Abstaining classiﬁer, Diversity
1. Introduction
In the context of the big data era, information systems produce a continuous ﬂow of
massive collections of data surpassing storage and computation capabilities of traditional
knowledge extraction methods. Big data is characterized by its properties which include
volume, velocity, variety, veracity, variability, visualization, and value. In recent years,
researchers have mainly focused on the scalability of data mining algorithms to address the
ever increasing data volume [1]. However, one cannot ignore the importance of the remaining
ones, especially velocity and variability. Velocity is critical in real-time decision systems,
where new instances are continuously evaluated and fast decision must be outputted under
Corresponding author
Email addresses: bkrawczyk@vcu.edu (Bartosz Krawczyk), acano@vcu.edu (Alberto Cano)
Preprint submitted to Applied Soft Computing September 7, 2018
time constraints. Variability refers to the non-stationary properties of data, which may shift
over time, leading to a phenomenon known as concept drift. The velocity, variability, and
complexity of data through time calls for development of eﬀective and eﬃcient algorithms
that can dynamically adapt to changes [2], and provide fast decision and model update [3, 4].
Ensemble learning is a popular approach for adapting to the dynamic nature of data
streams [5]. An ensemble is a set of individual classiﬁers whose predictions are combined,
leading to better accuracy than those from the individual classiﬁers. Ensembles optimize
the coverage by generating complementary and diverse classiﬁers. Classiﬁer ensembles are
attractive for data streams because they facilitate adaptation to changes in data character-
istics, e.g. by adding new components trained on recent data and removing components
representing outdated concepts. However, many times concept drifts are recurrent and by
simply removing outdated classiﬁers we would be forgetting useful knowledge for future re-
curring drifts. On the contrary, an approach based on abstaining classiﬁers [6, 7, 8] would
allow both to refrain predictions when conﬁdence regarding new instances is being lost dur-
ing drift, and explore the diversity of the ensemble by allowing only a selective subset of
learners to partake in the decision making. Moreover, noise in data is a recurrent diﬃculty
in data streams. Noise can be seen as temporal or spatial ﬂuctuations that may mislead drift
detectors. Noise may fool the drift detector and make it believe a concept drift occurred,
leading to an update of the model based on noisy data input, thus deteriorating accuracy.
In this paper, we introduce a lightweight and ﬂexible abstaining extension for online
ensembles that allows to exclude some of the classiﬁers from the voting process. Our method
utilizes a constant monitoring of the certainty of base classiﬁers for each incoming instance.
If a classiﬁer displays a maximum certainty below a speciﬁed threshold, then it abstains
from making a prediction (i.e, it is excluded from the voting process). We propose an
adaptive scheme for threshold calculation that follows changes in the stream, thus allowing
for enhancing or diminishing the role of abstaining according to the current situation. By
allowing classiﬁers to refrain from making a decision, we allow to dynamically change the
ensemble set-up (as diﬀerent classiﬁers may partake in the classiﬁcation of each instance),
allowing to exploit their local competencies. Additionally, inﬂuence of classiﬁers that are
poorly recovering from the change that took place is being diminished, thus reducing the
ensemble error. It also allows to exploit the diversity in ensembles, which is a key factor
in eﬃcient learning from streams [9]. Here, we are able cover various subsets of decision
space (due to diversity), but at the same time use only most accurate ensemble members (as
diversity does not necessarily lead to each base learner being competent [10])1. Furthermore,
the abstaining modiﬁcation should improve the robustness of online ensembles to noise,
without a need for complex and costly ﬁlters or data preprocessing solutions. Random
noise is most likely to inﬂuence the certainty of the classiﬁer regarding given instance, as it
may shift the given object with the respect to learned decision boundary. Our approach is
designed to select those classiﬁers for the voting step that are least likely to be inﬂuenced
by the noise distribution.
1As competent we mean a classiﬁer that is updated with the current state of the stream, being able to
accurately recognize new instances. Competence deteriorates when stream evolves and classiﬁer is not able
to adapt properly, losing generalization capabilities.
2
The main contributions of this paper are as follow:
A novel dynamic lightweight methodology for online ensembles by extending them with
abstaining classiﬁers, that can be applied to any online ensemble model with minimal
restrictions on the type of base classiﬁer used.
Eﬃcient selection of most competent classiﬁers for each instance that allows to exploit
the underlying diversity of base learners and reduce the error during drift recovery.
Abstaining leading to increased robustness to noisy data streams without a requirement
for costly preprocessing.
An extensive experimental analysis on a large number of data stream benchmarks
comparing the performance of popular online ensembles with Adaptive Hoeﬀding Tree,
Na¨ıve Bayes, and Multi-layer Perceptron as base classiﬁers, considering the canonical,
static and dynamic abstaining versions.
A study on the relationships between ensemble size vs accuracy, and noise vs accuracy.
The manuscript is organized as follows. Section 2 presents the background in data stream
mining. Section 3 describes the proposed online abstaining classiﬁers methodology. Section 4
presents the experimental study. Finally, Section 5 summarizes the concluding remarks.
2. Data stream mining
A popular view on data stream is to consider it as a ordered sequence of instances that
arrive over time and may be potentially of unbounded size [11]. Such settings diﬀer from
a canonical static scenario and thus is connected with speciﬁc constraints to be put onto
learning algorithm designed for such environments. We assume that instances arrive one by
one (online processing) or in form of data blocks (chunk processing). They arrive rapidly
within given time intervals - most works assume they are identical, yet in many real-life
scenarios these intervals may vary. Due to the potentially inﬁnite size of the stream and
characteristics of contemporary computing systems one cannot ﬁt the entire stream in the
memory and each instance should be processed once and then discarded. Additionally, as
instances arrive continuously, their processing time must be as small as possible, in order to
provide real-time responsiveness and avoid data queues. Finally, characteristic of streams
may change over time due to various conditions, which is known as concept drift [12].
Data stream is the sequence of states S={S1, S2,· · · , Sn}. As a state we deﬁne the
subsequence generated according to a given distribution, where state Siis generated by
a distribution Di. By a stationary data stream we will consider a sequence of instances
characterized by a transition SjSj+1, where Dj= Dj+1. However, concept drift presence
will lead to changes in distributions and deﬁnitions of learned concepts over time. Presence
of drift can aﬀect the underlying properties of classes that were used to train the current
classiﬁer, thus reducing its relevance with the progress of changes. At some point the drop
in accuracy may be so signiﬁcant that one cannot anymore consider classiﬁer as a competent
one. Thus, developing methods to tackle concept drift presence is of vital importance for
data stream mining [2, 13].
3
Concept drift can be considered with regard to its inﬂuence on learned classiﬁcation
boundaries. We distinguish here two types of drift - virtual and real. Former type of concept
drift does not impact the decision boundaries (posterior probabilities), but aﬀects only the
conditional probability density functions. Therefore, it should not directly inﬂuence the clas-
siﬁer being used, still should be detected. Latter type of concept drift has eﬀect on decision
boundaries (or posterior probabilities) and potentially may have impact on unconditional
probability density functions. This type of changes may signiﬁcantly inﬂuence performance
of a classiﬁer. Figure 1 depicts both of those types of drift.
(a) Initial concept. (b) Virtual concept drift. (c) Real concept drift.
Figure 1: Two types of concept drift according to their inﬂuence on learned classiﬁcation boundaries.
Another view on types of concept drifts is based on the severity and speed of changes.
Sudden concept drift is characterized by Sjbeing rapidly replaced by Sj+1, where Dj6= Dj+1.
Gradual concept drift can be considered as a transition phase where examples in Sj+1 are
generated by a mixture of Djand Dj+1 with their varying proportions. Incremental concept
drift has a much slower ratio of changes, where the diﬀerence between Djand Dj+1 is not
so signiﬁcant, usually not statistically signiﬁcant. In recurring concept drift a state from
k-th previous iteration may reappear Dj+1 = Djk. It may happen once or periodically.
Blips, or outliers, should be ignored due to their random nature and lack of any meaningful
information being carried [14]. Figure 2 depicts those ﬁve types of drift. Please note that
in most real-life scenarios we do not have a clearly deﬁned type of change, thus leading to
so-called mixed concept drift, which may exhibit hybrid characteristics of previous types.
Noise is another diﬃculty embedded in the nature of data streams. It can be seen as
temporal ﬂuctuations in the incoming concept that do not translate into actual changes, are
harmful to the classiﬁcation system and thus must be ﬁltered out. There exist a number of
solutions for handling noisy data streams proposed in the literature. However, their appli-
cation is limited, as one must expect for the noise to appear. However, in real-life scenarios
noise may appear unexpectedly or periodically. Therefore, improving general robustness to
noise is important for stream classiﬁers.
In order to tackle the presence of concept drift one may use three general solutions:
Rebuild the classiﬁer from scratch every time new instances arrive with the stream;
Monitor the progress of changes in stream characteristics and update the model only
when the severity of drift reaches a certain level;
Use adaptive learning algorithm that can adapt to new instances and forget old ones
in order to naturally follow drifts in the stream.
4
Figure 2: Four types of concept drift according to severity and speed of changes, and noisy blips.
Obviously, the ﬁrst solution is connected to a prohibitive computational cost and thus
only two remaining ones are used in data stream mining ﬁeld. Most important approaches
for adapting learning systems to concept drift include:
Concept drift detectors: they can be seen as external algorithms that are combined
with given classiﬁer. Their purpose is to monitor speciﬁc properties of data stream,
such as standard deviation [15], predictive error [16], or instance distribution [17]. Any
change to these characteristics are assumed to be caused by the drift presence. Thus,
by measuring the level of changes, detectors are able to report the incoming shift.
Sliding windows: these techniques keep a buﬀer with most recent instances that are
being considered as representative for the current state of the stream [18]. They are
used for the training / updating purposes and are discarded once newer instances arrive
with the stream. It allows to track data stream by storing its most recent state [19].
Online learners: update the model instance by instance, thus accommodating changes
in stream as soon as they occur. Such learners must fulﬁll a set of requirements [20]:
each object must be processed only once in the course of training, computational com-
plexity of handling each instance must be as small as possible, and its accuracy should
not be lower than that of a classiﬁer trained on batch data.
Ensemble learners: using a combination of several classiﬁers is a very popular ap-
proach for data stream mining (see a thorough survey by Krawczyk et al. [5]). Due
to their compound structure [21] they can easily accommodate changes in the stream,
oﬀering gains in both ﬂexibility and predictive power [22]. Two main approaches here
assume a changing line-up of the ensemble [23] or updating base classiﬁers [24]. In
the former solution a new classiﬁer is being trained on recently arrived data (usually
collected in a form of chunk) and added to the ensemble. Pruning is used to control
the number of base classiﬁers and remove irrelevant or oldest models. A weighting
scheme allows to assign highest importance to newest ensemble components, although
5
more sophisticated solutions allow to increase weights of classiﬁers that are recently
best-performing. Here one can use static classiﬁer, as the dynamic line-up keeps a
track of stream progress. Latter solutions assume that a ﬁxed-size ensemble is kept,
but update each component when new data become available.
Proper evaluation of classiﬁers for data stream mining is much more complex than in
static scenarios, as one must take into account various characteristics. It is worth to no-
tice that a good algorithm cannot excel at one of them, while under-performing on others.
Instead, it should aim to strike a balance among all of these criteria:
Predictive power: a popular criterion measured in all learning systems. However,
due to the dynamic nature of streams the relevance of error fades with time, making
the usage of prequential metrics necessary [25].
Memory consumption: necessary to evaluate due to imposed hardware limitations
for processing potentially inﬁnite streams.
Update time: reports how much time is required by a classiﬁer to accommodate for
new instances and model or decision boundaries.
Decision time: another time-related measure used. It informs us how much time
classiﬁer requires to make a prediction for each new instance (or their batch).
3. Online ensembles of abstaining classiﬁers
In this section we will describe in detail the proposed dynamic abstaining modiﬁcation
for online ensemble learning from drifting data streams.
3.1. Dynamic abstaining mechanism for exploring diversity
Diversity is a key factor inﬂuencing the performance of ensemble learning methods for
mining non-stationary data streams [5, 9]. In stationary scenarios diversity allows ensembles
to combine mutually complementary classiﬁers to cover the decision space more extensively.
In non-stationary cases diversity may be translated to capability of handling concept drifts.
By having a pool of varying base learners, one may be able to anticipate the incoming drift,
as at least one of the classiﬁers will have a decision boundary that can be quickly adapted to
the new concept. However, diversity does not translate directly into accuracy. While having
a pool of diverse learners, they do not necessarily have to make an accurate combined model.
Furthermore, even if one of the classiﬁers is better suited for managing the new state of the
stream, the aggregated decision making may diminish its inﬂuence on the ﬁnal class being
predicted. Therefore, while having diversity is beneﬁcial to the ensemble, one needs a tool
to smartly manage it in order to eﬃciently exploit the capabilities it oﬀers.
We propose to use abstaining for excluding uncertain classiﬁers from the class label
prediction. It means that each base classiﬁer in the ensemble may choose either to output
a label prediction or abstain from making a decision. In a discussed online setting this
would translate to selecting an ever-changing subset of classiﬁers for each incoming instance,
based on their predispositions for a certain object. We may relate such an approach to
6
works on dynamic ensemble selection in static environments [26]. Such methods require a
separate validation set to measure competencies of base learners and are computationally
costly [27], which is prohibitive for their usage in mining high-speed data streams in online
mode. Therefore, we need to investigate diﬀerent directions.
Abstaining classiﬁers have been mainly investigated in the context of rule-based classi-
ﬁers [8]. Here, a rule that did not cover the classiﬁed instance was abstaining instead of
using any generalization approach. However, we want our approach to be more ﬂexible and
work with a wider range of base classiﬁers. We propose to base abstaining on the certainty
displayed by each base classiﬁer regarding the new instance to be classiﬁed.
Let the data stream DS ={(x1, j1),(x2, j2), ..., (xk, jk), ...}be a potentially inﬁnite set
of instances, where xkstands for feature vector (xk X ) describing the k-th object and
jkits label jk∈ M. We assume that we have at our disposal a pool of Lindividual
online classiﬁers that form an ensemble
b
Ψ = {Ψ1,Ψ2,· · · ,ΨL}, where each base model is
able to give continuous output in a form of support functions Fl(x, j)[0; 1] for object x
belonging to j-th class. Such support function may be used to measure the certainty of
each base learner regarding its label prediction. While most of online ensembles use voting
with discrete decisions, many online classiﬁers are able to additionally return their support
functions. We propose to utilize a hybrid architecture. The ﬁnal decision regarding predicted
label is made using majority voting, but abstaining is based on comparing the certainty of
each classiﬁer from the pool with a given threshold. If classiﬁer fails to satisfy the threshold,
then it abstains from making a decision. Thus for each instance, we will select only those
classiﬁers for voting that satisfy the current threshold restrictions.
Using static threshold value will lead to poor performance on drifting data streams.
Therefore, we propose to use a dynamic abstaining threshold. We modify its value based
on the correctness of ensemble decision. If the ensemble of selected classiﬁers was able to
predict correctly the label, then it means that we have selected competent classiﬁers and
we may lower the threshold in order to probe for additional similarly competent learners.
On the other hand, an incorrect decision may indicate a drift occurrence, as majority of
classiﬁers were not able to properly classify the instance. In such a case the threshold needs
to be increased in hope that less competent classiﬁers will be excluded and we will use the
ones most suitable for the current state of the stream.
Such an approach is able to exploit the diversity in the ensemble. When drift occurs, one
or few classiﬁers may adapt much faster to the change than remaining ones. Majority voting
used in online ensembles will not allow them to be properly utilized and the overall recovery
rate of the entire ensemble will be much slower than its most competent components. By
employing adaptive abstaining, we will be able to consider those few classiﬁers by using only
them for predicting new labels until remaining classiﬁers will update themselves suﬃciently
to properly classify new concepts. The proposed abstaining will lead to selective control over
diversity, allowing to either use larger number of base models when they all can contribute
useful information, or reverting to a smaller subset of classiﬁers that can better anticipate
the direction of the drift.
A possible drawback of the proposed method lies in the fact that high support may not
directly translate to a competent classiﬁer - in fact it may also point to classiﬁers that are
conﬁdent of a wrong decision. However, by using an ensemble that maintains its diversity
during the stream processing we aim at reducing cases where majority of highly conﬁdent
7
classiﬁers in the pool would be actually incompetent. Additionally, our algorithm in its
current state assumes the continuous access to true class labels. It can be easily extended
with active learning paradigm to work with partially labeled data streams [28].
A illustrative toy example is depicted in Figure 3. It presents two ensembles - canonical
one and abstaining. Both use majority voting to make a decision regarding new instance.
Numbers inside classiﬁers show their support functions for selected class, while voting uses
only discrete outputted information. In the ﬁrst case, wrong class is selected due to classiﬁers
with low certainty oﬀering a majority of votes. In the second case, an abstaining threshold
equal to 0.65 is introduced, excluding four most uncertain classiﬁers. One may note that
even if instead of voting one would use classiﬁer combination operators on support functions
(maximum or average), the scenario would be exactly the same and without abstaining a
wrong class would be predicted. Classiﬁer combiners based only on support functions may be
too restrictive (such as max operator) or too sensitive to small changes in support functions
(such as avg operator). One may also use support functions to weight votes, but many works
point to the fact that such a weighted voting does not improve the results in a statistically
signiﬁcant manner. The proposed hybrid approach allows to select classiﬁers for voting based
on their support functions, combining advantages of both solutions.
The discussed situation may easily be translated into a drifting scenario, where classi-
ﬁers are bound to lose certainty and competence when drifts occur. By using an adaptive
abstaining threshold, we will allow to modify the outcomes of voting to promote diverse and
accurate base learners. Such a dynamic abstaining will select classiﬁers that display lowest
loss of certainty during the drift occurrence, promoting learners that can quickly recover
after a drift or that were anticipating change presence most accurately. A general framework
for the proposed dynamic abstaining approach is presented in Algorithm 1.
Algorithm 1: Proposed general framework for dynamic abstaining online ensembles.
input: ensemble
b
Ψ, abstaining threshold θ[0,1], adjustment factor s[0,1]
θinitialize threshold
Lsize of the ensemble
while end of stream = FALSE do
obtain new instance xfrom the stream
for l1;lL;l+ + do
obtain classiﬁer support FΨl(x) for each class
if maxj∈M FΨl(x, j)< θ then
Ψlabstains from the decision
else
Ψlparticipates in voting
zresult of non-abstaining classiﬁers voting
obtain label yof object x
if z== ythen
θθs(if θ > 0)
else
θθ+s(if θ < 1)
8
(a) Online ensemble with majority voting.
(b) Online ensemble with majority voting and abstaining threshold currently
set to 0.65.
Figure 3: Toy example of the diﬀerences between standard online ensemble and proposed abstaining online
ensemble. The true class label of the incoming instance is blue one.
9
The proposed modiﬁcation is lightweight, as it only requires to keep and update a single
additional variable and conduct Lsimple comparisons during testing phase for each instance.
Training phase with new instance is not aﬀected. The abstaining modiﬁcation should not
impose any signiﬁcant computational costs on the learning procedure.
3.2. Flexible areas of applicability
The proposed abstaining approach can be applied to almost any existing online ensemble
learning algorithm. Therefore, it should be seen as a lightweight enhancement that will allow
to improve the performance of underlying ensemble classiﬁer, rather than classiﬁer itself. We
give the following suggestions regarding the base method to which our augmentation can be
applied:
It should be an online ensemble that ensures diversity among their members and allows
to monitor the performance of its base classiﬁers. Most streaming algorithms based on
Bagging, Boosting or Random Subspaces are suitable.
Current version of the abstaining assumes equal importance of all base classiﬁers and
usage of majority voting. However, it can be relatively simply applied to methods using
weighted classiﬁer combination, with a requirement for a proper weight normalization
once some of base classiﬁers have been excluded from the label prediction for a given
instance.
Ensemble must use base classiﬁers that are able to work in an online mode and can
return their conﬁdence levels for each new instance (e.g., as support functions).
We strongly suggest to use online ensembles with drift detector to update the pool
of base models with new, competent ones after the change has been identiﬁed and
remove the outdated ones. Pruning oﬀers two advantages. Firstly, it will allow to
discard incompetent classiﬁers that would require too much time to adapt to new
state of the stream. They could display high conﬁdence, yet low competence and thus
negatively impact the abstaining module. Secondly, adding classiﬁers trained only on
recent instances will positively impact the diversity of the ensemble that our abstaining
module aims to exploit.
3.3. Tackling noisy data streams
In previous subsections we have discussed drifting data streams and how abstaining may
improve ensemble adaptation to non-stationary conditions. However, it is interesting to
analyze the potential of abstaining for alleviating the inﬂuence of noise on the accuracy of
online learning methods.
In many real-life problems noise is bound to appear. It may happen due to corrupted
data source, transmission errors, or malicious activities inﬂuencing received data. It has been
widely discussed in stationary data mining, as noisy training instances have huge impact on
the generalization capabilities of the learned model [29]. This problem becomes even more
challenging in learning from non-stationary data, as not only incoming instances may be
corrupted with varying and evolving level of noise, but additionally one must be able to
distinguish between noise and actual concept drift. It is very likely that isolated noisy samples
10
may stimulate drift detector, thus leading to premature rebuilding of the classiﬁcation model.
There is a number of solutions proposed for mining noisy data streams, usually relying on
ﬁltering of the incoming data [30], training dedicated classiﬁers [31], or ensembles [32, 33].
These methods usually impose additional computational cost that may be often prohibitive
when mining high-speed data streams. Additionally, in real-life scenarios we often cannot
predict the appearance of noise. Additionally, noise is not always prevalent in the stream, but
it is more likely to appear periodically. Paying the cost of using a dedicated method when it
is not always necessary may be not well motivated. On the other hand, it is diﬃcult to decide
when to switch from a standard classiﬁer to one devoted speciﬁcally to noisy streams. It
seems worthwhile to investigate solutions that do not inﬂuence the standard stream mining
process, while oﬀering increased robustness to noise with limited or no additional cost.
The proposed dynamic abstaining ensemble modiﬁcation will enhance the robustness of
underlying learning method to noise in the stream. If an instance is inﬂuenced by noise, it
will shift its position with the respect to a decision boundary. Learning and classiﬁcation
diﬃculties can originate from such information corruption. The closer the noisy instance
gets to the decision boundary, the lower certainty of given classiﬁer. An ensemble solution
may beneﬁt from its diversity, as due to using varying decision boundaries, decision of base
classiﬁers may be diﬀerently inﬂuenced by the noisy sample. Some of them will lose conﬁ-
dence (especially if noisy instance will shift to the opposite side of the decision boundary),
whereas others may still properly recognize it due having diﬀerent, yet complementary class
separation boundaries. Therefore, by abstaining most uncertain classiﬁers, we will enhance
the role of those few least likely to be inﬂuenced by noise. Nevertheless, we acknowledge
the possibility of noise so strongly shifting the instance that a classiﬁer would display high
certainty, despite classifying it to a wrong class. In such a case learner will still be allowed
to participate in voting, while not being competent. Yet, it is not likely that all of the
conﬁdent classiﬁers will be at the same time similarly incompetent (once again due to the
underlying diversity). Abstaining shows some similarities to a work by Zhu et al. [34], where
they proposed a dynamic classiﬁer selection for noisy data streams. Their approach required
access in dynamic data streams. Our method is lightweight and does not require access to
We acknowledge that while our proposal may improve robustness of the classiﬁcation
phase to noisy instances, it does not inﬂuence the robustness of the training phase. Noisy
instances may still impact the classiﬁer update step. However, our aim was to not explicitly
address the noise, but to show that the abstaining solution may oﬀer improved noise ro-
bustness at no cost. We plan to investigate excluding noisy samples from the training phase
using active learning in our future works.
4. Experimental study
In this section we present the experimental study in order to evaluate the eﬀectiveness of
online ensemble learning with abstaining classiﬁers. Experiments were designed to answer
the following research questions:
Does the abstaining modiﬁcation of online ensemble learning leads to improvements in
accuracy during mining drifting data streams?
11
Table 1: Data stream benchmarks characteristics.
Dataset Instances Features Classes Drift
RBF blip 1,000,000 10 6 blips
Hyp slow 1,000,000 10 4 incremental
Hyp fast 1,000,000 10 4 incremental
SEA sudden 1,000,000 3 2 sudden
LED fast 1,000,000 7 10 mixed
LED nodr 1,000,000 7 10 no drift
RTree 1,000,000 10 6 sudden recurring
Waveform 1,000,000 40 3 mixed
CovType 581,012 54 7 virtual
Electricity 45,312 8 2 unknown
Poker 1,000,000 10 10 unknown
Does the dynamic abstaining threshold that adapts to the changes in data oﬀers sig-
niﬁcant improvement over a static one?
Is the eﬃciency of abstaining related to the choice of a base classiﬁer?
How does the eﬃciency of abstaining relate to the ensemble size?
Is the abstaining modiﬁcation truly lightweight, not imposing a serious increase in
computational complexity of the ensemble methods?
Can abstaining improve the robustness of online ensembles in noisy data streams with-
4.1. Data stream benchmarks
We used 12 data streams to evaluate the performance of the abstaining modiﬁcation of
online ensembles. We have selected a diverse set of benchmarks with varying characteristics,
including real datasets and stream generators with diﬀerent properties concerning nature,
speed and number of concept drifts, which are shown in Table 1.
For experiments with noisy data streams, we took all of 12 datasets and injected a random
feature noise into them. The noise level ranged between 5% and 50%, allowing us to evaluate
the robustness of examined methods varying degree of feature corruption, creating 120 new
data stream benchmarks for the second experiment.
4.2. Set-up
During the experiments, we have used two online ensemble learning methods for evalu-
ating the eﬃciency of the proposed abstaining extension:
12
Table 2: Used online classiﬁers and their parameters.
Acronym Name Parameters
paths: 10
splitConﬁdence: 0.01
leaves: Na¨ıve Bayes
NB Na¨ıve Bayes
MLP Multi-layer Perceptron
hidden nodes :10
learning: online backpropagation
iterations: 300
learning rate: 0.01
momentum: 0.01
LB Leveraging Bagging λ= 2
Online Bagging (OB) [35] is a modiﬁcation of popular ensemble learning approach
suitable to streaming scenarios. Here we assume that each incoming instance from the
stream may be replicated zero, one, or many times in order to update each ensemble
member. Therefore, each base classiﬁer is given kcopies of the new instance, where k
varies for each of them. The value of kis selected on the basis of Poisson distribution,
where kP oisson(1). We apply an extended version of Online Bagging that uses
ADWIN drift detector [16] for replacing weakest classiﬁer with a new one after drift
happens.
Leveraging Bagging (LB) [36] is a modiﬁcation of Online Bagging that aims at in-
creasing the role of randomization in input for base classiﬁers. Leveraging Bagging
increases resampling from P oisson(1) to P oisson(λ) (where λis a user-deﬁned param-
eter). There is a possibility of using error-correcting output codes for classiﬁcation,
but for the abstaining extension we use canonical majority voting option. Leveraging
Bagging also relies on ADWIN for updating ensemble set-up in case of drift.
As base learners we utilize three popular online classiﬁers: Adaptive Hoeﬀding Tree [37],
Na¨ıve Bayes and Multi-layer Perceptron. Their parameters are given in Table 2. For exam-
ined ensemble methods the main parameter is the number of base learners, which we will
examine further in details.
We use the following experimental framework:
Classiﬁcation methods were evaluated with the use of four diﬀerent metrics: prequential
accuracy, memory consumption, update time and classiﬁcation time.
We have used an online learning scenario with test-then-train solution. It means that
each incoming instance is ﬁrst used to evaluate the performance of tested ensembles
and then to update them. Each experiment was repeated 10 times and we report
averaged results over these runs.
The proposed dynamic abstaining modiﬁcation used an initial threshold θ= 0.65 and
adjustment factor s= 0.01. These parameters may be easily adjusted to the speciﬁc
13
user’s needs and the nature of analyzed stream. If rapid changes are to be expected,
then adjustment factor should be increased to allow for faster adaptation. If changes
are expected to be of slower nature, then lower values of adjustment factor will lead
to a more stable performance. We propose a single value for all types of examined
streams in our experimental study. Our initial experiments on θvalue initialization
show that as long as the selected value is not an extreme one (close to 0 or 1) the
ensemble stabilizes very quickly regardless of the chosen parameter.
To assess the signiﬁcance of the results, we conducted a rigorous statistical analysis [38].
We used Shaﬀer post-hoc analysis for multiple comparison over multiple datasets. For
all of statistical tests the signiﬁcance level was set to α= 0.05.
To gain a better understanding about what is the actual factor behind robustness
to noise, we propose to use the Equalized Loss of Accuracy (ELA) [39]. ELA helps
us to check if the performance on noisy data is related to actual robustness or just
to diﬀerences in initial predictive accuracies. It is computed as ELAx%= (100
Accx%)/Acc0%, where Accx%is the test accuracy with an overlapping level x% and
Acc0% is the test accuracy in the original dataset D. Therefore, the lower the value
of ELA, the more robust is given classiﬁer to noise. At the same time, ELA takes
into an account the fact that a classiﬁer with a low base accuracy Acc0% that is not
deteriorated at higher noise levels is still not a better choice than a better classiﬁer
suﬀering from a low loss of accuracy when the noise level is increased.
4.3. Experiment 1: Learning from drifting data streams
This experiment aims at analyzing the inﬂuence of the proposed dynamic abstaining
modiﬁcation on the performance of ensemble learning methods with the respect to used
committee approach and base learner. Apart from the canonical and dynamic abstaining
versions, we present the results for a static abstaining, where the threshold is ﬁxed for the
entire data stream. Additionally, we examine the correlation between the ensemble size and
the usage of abstaining mode.
Averaged prequential accuracies, memory consumption, as well as update and test times
of examined methods are given in Table 3 for ensembles of AHTs, in Table 4 for ensembles
of NBs, and in Table 5 for ensembles of MLPs. Memory usage, as well as update and
testing times calculated over 1000 instances processed in an online mode. We presented
the best results coming from ensembles of size evaluated in separate experiments, for each
data stream ranging from 5 to 50 base classiﬁers. The relationships between the size and
accuracy are depicted in Figure 4 for ensembles using AHTs. Dependencies for remaining
base classiﬁers were identical. For the sake of clarity, Figure 4 depicts only canonical and
dynamically abstaining ensembles, as static ensembles are bound to underperform on drifting
data streams and thus we do not need to focus on them. Shaﬀer post-hoc test results are
given in Table 6.
Firstly, we will discuss the comparison among canonical, static abstaining and dynamic
abstaining methods. One can easily see that static ensembles return inferior accuracy on
all of drifting data streams. It can be explained by a lack of adaptiveness to changes in the
data. A preset threshold cannot capture the dynamic nature of incoming instances and thus
14
Table 3: Average prequential accuracies and computational complexities for canonical and abstaining online
ensembles with Adaptive Hoeﬀding Tree as a base classiﬁer. Best obtained results are in bold.
Dataset Online Bagging Leveraging Bagging
NoAbst StAbst DyAbst NoAbst StAbst DyAbst
RBF grad 91.47 87.43 93.18 93.68 89.85 95.09
RBF blip 92.38 88.16 93.70 94.16 90.12 95.03
Hyp slow 89.93 85.12 89.12 85.48 79.49 84.98
Hyp fast 88.96 86.84 91.38 87.52 85.38 90.07
SEA sudden 88.07 82.13 91.66 87.24 80.98 90.32
LED fast 67.62 63.05 71.16 66.74 59.18 70.30
LED nodr 51.23 51.47 51.47 50.64 50.93 50.93
RTree 43.30 40.35 42.00 39.79 36.81 38.14
Waveform 81.84 80.38 83.97 82.32 81.06 84.04
CovType 80.40 79.11 81.39 81.04 80.03 81.82
Electricity 77.31 75.18 80.48 77.06 74.92 78.11
Poker 61.18 62.03 62.03 82.62 82.81 82.81
Avg. RAM-Hours 0.003 0.003 0.004 0.02 0.02 0.03
Avg. Train time(s.) 4.76 4.76 4.93 12.03 12.03 12.76
Avg. Test time(s.) 0.38 0.38 0.40 0.44 0.44 0.46
is not suﬃcient for non-stationary scenarios. In cases of severe concept drift it may be too
high and force to abstain most of classiﬁers, despite their adaptation to the new state of
the stream. This takes place frequently right after the drift, when ensemble members try to
recover and some base classiﬁers suﬀer lower loss of competence than others. Despite it, a
ﬁxed threshold will not diﬀerentiate between stable and drifting stages of the stream, thus
excluding classiﬁers in moment when loss of certainty does not necessarily mean complete
loss of competence. Here the static abstaining can be seen as being too strict. On the other
hand, one may imagine a situation in which we deal with moment when stream stabilizes.
Only most competent classiﬁers should be promoted and allowed to participate in voting. On
the other hand, static threshold may be too small and permit too many classiﬁers to partake
in the ﬁnal decision making. Here static abstaining can be seen as being too liberal and
not ﬂexible enough to capture dynamics of drifting streams. Nevertheless, it is interesting
to notice the performance of static abstaining on datasets with no drift (LED nodr and
Poker). Here, it is able to oﬀer a small improvement over canonical online ensemble learning
and return identical performance as its dynamic counterpart. We can explain it by the fact
that there is no need to adapt the threshold, as there are no drastic changes in the stream.
Therefore, ensembles are able to beneﬁt from excluding less competent classiﬁers, while at
the same time not being aﬀected by a drastic change of concepts. However, as existence
of static data streams is very unlikely in real-world scenarios, we may conclude that, to no
surprise, static abstaining approach should be avoided in online ensemble learning.
Secondly, we will compare canonical online ensembles with the dynamic abstaining mod-
iﬁcation. One can see that the proposed dynamic abstaining leads to signiﬁcant improve-
15
Table 4: Average prequential accuracies and computational complexities for canonical and abstaining online
ensembles with Na¨ıve Bayes as a base classiﬁer. Best obtained results are in bold.
Dataset Online Bagging Leveraging Bagging
NoAbst StAbst DyAbst NoAbst StAbst DyAbst
RBF grad 88.58 84.54 90.62 90.46 85.15 92.20
RBF blip 88.74 85.19 90.92 90.85 84.97 92.36
Hyp slow 89.17 84.92 88.82 85.02 80.00 84.03
Hyp fast 88.39 86.22 91.02 86.71 84.61 89.74
SEA sudden 85.82 79.12 89.09 84.66 78.76 87.75
LED fast 67.14 62.82 70.88 67.48 62.97 71.02
LED nodr 54.66 55.01 55.01 53.72 53.96 53.96
RTree 39.75 36.69 38.52 37.22 33.91 36.47
Waveform 80.67 79.25 83.01 81.17 79.89 83.27
CovType 78.62 77.41 78.25 79.91 77.85 79.52
Electricity 73.82 70.38 77.00 74.81 71.42 77.96
Poker 60.97 61.13 61.13 81.98 82.06 82.06
Avg. RAM-Hours 0.001 0.001 0.001 0.01 0.01 0.01
Avg. Train time(s.) 2.13 2.13 2.39 7.02 7.02 7.48
Avg. Test time(s.) 0.22 0.22 0.23 0.27 0.27 0.28
ments in obtained accuracies on most of used benchmarks, especially those aﬀected by drift.
For ensembles using AHT and MLP classiﬁers, the canonical approach was better only on
two datasets (Hyp slow and RTree), while for ones using NB classiﬁer on three datasets
(Hyp slow, RTree and CovType). The proposed modiﬁcation, despite its simplicity, is able
to improve the performance of online ensemble learning algorithms in non-stationary scenar-
ios, regardless of ensemble model or base classiﬁer used. Continuous monitoring of ensemble
performance and adapting the abstaining threshold as the stream progresses plays a major
role here, by allowing to select only most competent classiﬁers for the voting process, thus
reducing the chance of a number of weak classiﬁers outvoting the few better adapted ones.
Such behavior is especially useful during the presence of drift, when we are able to take ad-
vantage of a diverse set of base learners that can anticipate potential directions in which data
may evolve. Selecting for voting only the ones that show faster recovery is a crucial factor
in reducing the error when facing instances from emerging concept. Additionally, one must
underline the interplay between the abstaining module and concept drift detector utilized by
examined ensembles. When the weakest base classiﬁer is replaced by the new one, the new
model is expected to be the most competent one. It has been trained on only the most recent
instances, while remaining models combine information extracted from recent and previous
instances, thus mixing concepts. Here, such a single classiﬁer is likely to be outvoted by
other models, as they may adapt to novel concept much slower. By using abstaining, we
ensure that uncertain classiﬁers are excluded, thus increasing the chances of the new learner
to guide the decision making process.
One must also analyze the situations in which abstaining returned unsatisfactory perfor-
16
Table 5: Average prequential accuracies and computational complexities for canonical and abstaining online
ensembles with Multi-layer Perceptron as a base classiﬁer. Best obtained results are in bold.
Dataset Online Bagging Leveraging Bagging
NoAbst StAbst DyAbst NoAbst StAbst DyAbst
RBF grad 90.28 86.10 91.97 92.44 88.74 94.21
RBF blip 90.37 85.89 92.01 92.63 88.82 94.25
Hyp slow 88.97 83.06 88.26 86.40 81.34 85.92
Hyp fast 88.62 86.13 91.31 86.99 84.50 91.40
SEA sudden 87.72 81.16 90.98 87.31 80.48 90.49
LED fast 67.93 63.11 71.52 68.11 63.47 71.96
LED nodr 50.95 60.57 60.57 49.89 50.16 50.16
RTree 40.03 36.88 39.01 38.31 35.12 37.28
Waveform 82.03 81.36 84.43 82.14 81.37 84.58
CovType 80.54 79.42 80.93 81.39 80.58 81.91
Electricity 76.15 74.11 79.36 75.93 73.79 79.17
Poker 63.17 63.49 63.49 84.68 85.00 85.00
Avg. RAM-Hours 0.009 0.009 0.0010 0.10 0.10 0.11
Avg. Train time(s.) 6.99 6.99 7.86 20.04 20.04 21.01
Avg. Test time(s.) 0.43 0.43 0.062 0.59 0.59 0.62
Table 6: Shaﬀer’s test for comparison between diﬀerent ensemble approaches. Symbol ’>’ stands for situation
in which dynamic abstaining approach is superior.
AHT NB MLP
Hypothesis p-value Hypothesis p-value Hypothesis p-value
OB-DyAbst vs. OB-NoAbst >(0.027) OB-DyAbst vs. OB-NoAbst >(0.031) OB-DyAbst vs. OB-NoAbst >(0.025)
OB-DyAbst vs. OB-StAbst >(0.000) OB-DyAbst vs. OB-StAbst >(0.000) OB-StAbst vs. OB-StAbst >(0.000)
LB-DyAbst vs. LB-NoAbst >(0.022) LB-DyAbst vs. LB-NoAbst >(0.028) LB-DyAbst vs. LB-NoAbst >(0.018)
LB-DyAbst vs. LB-StAbst >(0.000) LB-DyAbst vs. LB-StAbst >(0.000) LB-StAbst vs. LB-StAbst >(0.000)
mance. For Hyp slow data stream we may assume that the drop in accuracy is related to
the slow nature of changes. The concept drift introduced here is incremental in nature, but
the speed of change is low. A situation may happen in which local mistakes of ensemble lead
to uneven ratio of threshold changes in comparison to actual drift. Classiﬁers that could
still contribute to the decision making process were abstaining, weakening the collective pre-
dictive power. In case of RTree stream we deal with a highly randomized dataset. Due to
the sudden and random nature of changes a situation is bound to happen, when a classiﬁer
trained on previous concept display a high certainty on the new instance arriving after a
rapid drift (e.g., two classes suddenly exchanged their positions). It will lead to selecting
classiﬁers that seem like competent ones, but in fact did not yet managed to accommodate
new distribution of instances. Additionally, according to the way most of online ensembles
are updated, concept drift detector can replace only one classiﬁer at a time. Even with ab-
staining modiﬁcation they are vulnerable to situations in which all of base classiﬁers suddenly
suﬀer a signiﬁcant drop in actual competence. Finally, we may assume that a combination
17
OB−NoAbst
OB−DyAbst
LB−NoAbst
LB−DyAbst
Figure 4: Relationship between ensemble size and accuracy (Adaptive Hoeﬀding Tree).
18
of abstaining with NB classiﬁer cannot properly capture properties of CovType stream.
Having discussed the general properties of dynamic abstaining ensembles, we will analyze
the correlation between the ensemble size and achieved performance. Figure 4 depicts the
accuracies for ensembles having from 5 to 50 base learners. One can see two tendencies. For
static streams or ones with small drifts both online Bagging versions display behavior similar
to the static Bagging. After growing to a certain size, adding new ensemble members do
not contribute to the accuracy. On the other hand, larger pool of classiﬁers does not impact
negatively the performance. This follows observations in many studies on the behavior
of static Bagging. Yet, for drifting data streams we observe a diﬀerent behavior. Here,
ensembles improve their performance up to a certain ensemble size and then adding more
base learners lead to a signiﬁcant drop of accuracy. It can be explained by the negative
impact of wrongly managed diversity. While more diverse pool of learners should allow for
better anticipation of potential drifts, having many weak models may actually destabilize
voting procedure during drifts. What is interesting, ensembles augmented with dynamic
abstaining achieve better performance when using larger number of base learners and display
less signiﬁcant drops in accuracy when ensemble size grows beyond the optimal point for given
dataset. It further proves that abstaining allows to better manage diversity within ensemble,
allowing it to be used for tackling concept drifts.
Figure 5 shows the progress of prequential accuracy on RBF grad data stream for en-
sembles using AHT as base classiﬁer. The results were averaged over 1000 instances. By
visual inspection one may easily see moments where concept drift occurred. Both canonical
implementations of OB and LB display a signiﬁcant error for a number of instances after
the drift, showing that despite using drift detector they require some time to recover their
performance. On the other hand, their abstaining versions are able to adapt much faster
after the drift occurrence, which can be easily observed on the plot. They display much
smaller sudden drops in accuracy and are able to recover faster. Abstaining increase the
role of newly added (after drift) base classiﬁer that is able to more eﬃciently contribute to
the ﬁnal decision. In many cases abstaining ensembles are also able to display improved
performance on relatively static parts of the stream. We must notice few cases when they
display higher error than canonical methods, most likely due to incorrectly estimated ab-
staining threshold that excludes too many classiﬁers. Nevertheless, it is important to note
that abstaining versions always behave better than their canonical counterparts during drift
periods, and is a highly desirable property for non-stationary environments.
By analyzing memory consumption and update times (see Table 3 and Figures 6 and 7),
one can see that the proposed abstaining modiﬁcation imposes negligible additional compu-
tational costs upon the ensemble learning procedure. In most cases, we observe only small
increase in training times (on average 2%-4%) and used memory (3%-6%). The classiﬁcation
time is always exactly the same, as we do not modify the voting procedure itself, only select
classiﬁers that partake in it. However, one may also assume that the increase is partially due
abstaining ensembles utilizing slightly larger pools of base classiﬁers, which will obviously af-
fect the computational complexity. Still, conducted experiments prove that our modiﬁcation
is truly lightweight and does not have a negative impact on the computational requirements.
19
4.4. Experiment 2: Learning from drifting data streams under the presence of noise
The aim of the second experiment was to evaluate if the proposed dynamic abstaining
modiﬁcation can increase the robustness of online ensemble learning methods to noise present
in data streams. The 12 datasets described in Table 3 were injected with randomized feature
noise [40] ranging between 5% - 50%. It means that x% of attributes in the dataset are
corrupted. To corrupt an attribute, approximately x% of the examples in the data set are
chosen and values of their selected features are replaced with random values from the original
distribution. A uniform distribution is used either for numerical or nominal attributes.
We repeat the random noise injection every 1000 instances in order to dynamically change
features that are corrupted. If we would modify the same features for the entire data stream,
then classiﬁers may actually learn an underlying noise distribution. We created 120 noisy
data stream benchmarks that were used in the following experiments, leading to a thorough
study of the inﬂuence of noise on online ensemble learning.
In order to provide an insight into the inﬂuence of the noise on classiﬁer performance,
we report not only the accuracy, but also ELA metric. It indicates, if the diﬀerences in
accuracies under a certain level of noise are only due to the original diﬀerences between
methods, or are actually caused by one of the method being more robust.
Prequential accuracies and ELA averaged over all of datasets with respect to varying level
of noise and used base classiﬁers are given in Figures 8 - 13. Full results for these datasets are
to be found in supplementary materials to this paper. Results of Shaﬀer’s post-hoc test over
all of noise levels are given in Table 7. Additionally, a visualization of the correlation between
the noise level and accuracy / ELA for LED fast data stream is depicted in Figure 14. We
present results only for canonical and dynamic abstaining ensembles, as static abstaining
ensembles completely fail in noisy scenarios.
Table 7: Shaﬀer’s test for comparison between diﬀerent ensemble approaches, averaged among all noise levels
with respect to prequential accuracy and ELA metrics. Symbol ’>’ stands for situation in which dynamic
abstaining approach is superior.
AHT NB MLP
Hypothesis p-value Hypothesis p-value Hypothesis p-value
prequential accuracy
OB-DyAbst vs. OB-NoAbst >(0.102) OB-DyAbst vs. OB-NoAbst >(0.099) OB-DyAbst vs. OB-NoAbst >(0.107)
LB-DyAbst vs. LB-NoAbst >(0.0108) LB-DyAbst vs. LB-NoAbst >(0.103) LB-DyAbst vs. LB-NoAbst >(0.111)
ELA
OB-DyAbst vs. OB-NoAbst >(0.032) OB-DyAbst vs. OB-NoAbst >(0.029) OB-DyAbst vs. OB-NoAbst >(0.033)
LB-DyAbst vs. LB-NoAbst >(0.041) LB-DyAbst vs. LB-NoAbst >(0.40) LB-DyAbst vs. LB-NoAbst >(0.042)
Obtained results clearly show that noisy data streams pose a signiﬁcant challenge for
standard online ensembles. While most of methods behave reasonably well with small levels
of noise (5% - 10%), one may see that higher levels lead to signiﬁcant drops in accuracy.
Here, dynamic abstaining allows to alleviate accuracy loss, oﬀering improved performance
regardless of the level of noise. It is interesting to notice that even for datasets in which
abstaining versions did not perform well (Hyp slow and RTree), they outperform canonical
approaches when noise is being introduced. One may argue that this happens due to their
improved predictive power and those diﬀerences are preserved for all noise levels. One may
analyze ELA results as companion to accuracy. Here, we can see that dynamic abstaining
ensembles oﬀer very signiﬁcant improvement when compared to their canonical versions and
20
such an advantage is preserved regardless of the noise level. It can be explained by the fact
that when noise inﬂuences attributes, the corresponding instance is dislocated in the decision
space. It holds an inﬂuence over the certainty of base classiﬁers, as the closer instance gets
towards the decision boundary, the lower the certainty. Therefore, such classiﬁers will be
abstaining and become excluded from the voting. As we use diversiﬁed ensembles, there is a
high chance that some of base classiﬁers will use such a decision boundary that will display
good performance despite the level of noise. By reducing the number of voting members,
we reduce the probability of noisy instance aﬀecting majority of them and leading to an
incorrect decision.
Of course such a strategy is not guaranteed to work in all of cases. It is easy to come up
with a counterexample, where an instance will be aﬀected so strongly by the noise that it
will shift to the opposite side of the decision boundary and will lead to selection of a classiﬁer
with a high certainty, but incorrect prediction. Such cases are bound to happen when the
noise level is high. However, due to using an ensemble approach it becomes less likely that
all of base classiﬁers will be aﬀected in such a way.
It should be stressed that proposed method is a simple and ﬂexible modiﬁcation that can
be applied to most online ensemble classiﬁers. Existing approaches for noisy data streams
are based either on costly ﬁlters or using speciﬁc learning algorithms. Our aim was not to
prove that we are better. Our aim was to show that with no direct cost and no inﬂuence on
accuracy we are able to improve robustness to noise of popular online ensemble learners. In
most real-life scenarios noise is unexpected and does not appear thorough the entire stream
processing time. Therefore, one cannot predict when to switch to a certain noise-handling
approach. Our abstaining modiﬁcation can be used all the time during mining drifting data
streams, and when noise will appear it will lead to an improved robustness of the underlying
ensemble.
21
75 80 85 90 95
processed instances
accuracy [%]
1000 88000 189000 297000 4e+05 5e+05 6e+05 7e+05 8e+05 9e+05 1e+06
OB−NoAbst
OB−DyAbst
LB−NoAbst
LB−DyAbst
80 85 90 95
processed instances
accuracy [%]
211000 219000 227000 235000 243000 251000
75 80 85 90 95
processed instances
accuracy [%]
643000 651000 659000 667000 675000 683000
Figure 5: (Top) Prequential accuracies of evaluated ensembles with Adaptive Hoeﬀding Tree as a base
classiﬁer averaged over windows of 1000 instances for RBF grad data stream. (Bottom) Selected subsets of
instances showcasing the drift recovery capabilities of examined methods.
22
1e−04 1e−03 1e−02 1e−01 1e+00
processed instances
memory consumption [RAM−Hours]
1000 88000 189000 297000 4e+05 5e+05 6e+05 7e+05 8e+05 9e+05 1e+06
OB−NoAbst
OB−DyAbst
LB−NoAbst
LB−DyAbst
Figure 6: Memory usage (log scale) of evaluated ensembles with Adaptive Hoeﬀding Tree as a base classiﬁer
averaged over windows of 1000 instances for RBF grad data stream. Abstaining imposes almost no additional
time complexity (between 3-6% more compared to the original version).
0 5 10 15 20
processed instances
batch processing time [s.]
1000 88000 189000 297000 4e+05 5e+05 6e+05 7e+05 8e+05 9e+05 1e+06
OB−NoAbst
LB−NoAbst
0 5 10 15 20
processed instances
batch processing time [s.]
1000 88000 189000 297000 4e+05 5e+05 6e+05 7e+05 8e+05 9e+05 1e+06
OB−DyAbst
LB−DyAbst
Figure 7: Ensemble update time (test + train time) with Adaptive Hoeﬀding Tree as a base classiﬁer
averaged over windows of 1000 instances for RBF grad data stream: (left) ensembles without abstaining,
(right) ensembles with abstaining. Abstaining imposes almost no additional time complexity (between 0-3%
more compared to the original version).
23
noise level = 30%
noise level = 35%
noise level = 40%
noise level = 45%
noise level = 50%
noise level = 5%
noise level = 10%
noise level = 15%
noise level = 20%
noise level = 25%
OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA
25
50
75
25
50
75
Avg. preq. accuracy [%]
Figure 8: Averaged prequential accuracy over all data streams with respect to varying level of noise for
canonical and abstaining online ensembles with Adaptive Hoeﬀding Tree as a base classiﬁer.
noise level = 30%
noise level = 35%
noise level = 40%
noise level = 45%
noise level = 50%
noise level = 5%
noise level = 10%
noise level = 15%
noise level = 20%
noise level = 25%
OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA
0.0
0.5
1.0
1.5
2.0
0.0
0.5
1.0
1.5
2.0
Avg. ELA [%]
Figure 9: Averaged ELA over all data streams with respect to varying level of noise for canonical and
abstaining online ensembles with Adaptive Hoeﬀding Tree as a base classiﬁer.
24
noise level = 30%
noise level = 35%
noise level = 40%
noise level = 45%
noise level = 50%
noise level = 5%
noise level = 10%
noise level = 15%
noise level = 20%
noise level = 25%
OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA
25
50
75
25
50
75
Avg. preq. accuracy [%]
Figure 10: Averaged prequential accuracy over all data streams with respect to varying level of noise for
canonical and abstaining online ensembles with Na¨ıve Bayes as a base classiﬁer.
noise level = 30%
noise level = 35%
noise level = 40%
noise level = 45%
noise level = 50%
noise level = 5%
noise level = 10%
noise level = 15%
noise level = 20%
noise level = 25%
OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA
0
1
2
0
1
2
Avg. ELA [%]
Figure 11: Averaged ELA over all data streams with respect to varying level of noise for canonical and
abstaining online ensembles with Na¨ıve Bayes as a base classiﬁer.
25
noise level = 30%
noise level = 35%
noise level = 40%
noise level = 45%
noise level = 50%
noise level = 5%
noise level = 10%
noise level = 15%
noise level = 20%
noise level = 25%
OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA
25
50
75
25
50
75
Avg. preq. accuracy [%]
Figure 12: Averaged prequential accuracy over all data streams with respect to varying level of noise for
canonical and abstaining online ensembles with Multi-layer Perceptron as a base classiﬁer.
noise level = 30%
noise level = 35%
noise level = 40%
noise level = 45%
noise level = 50%
noise level = 5%
noise level = 10%
noise level = 15%
noise level = 20%
noise level = 25%
OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA OB_NA OB_DA LB_NA LB_DA
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
Avg. ELA [%]
Figure 13: Averaged ELA over all data streams with respect to varying level of noise for canonical and
abstaining online ensembles with Multi-layer Perceptron as a base classiﬁer.
26
30 40 50 60 70
noise level
accuracy [%]
0% 5% 15% 25% 35% 45%
OB−NoAbst
OB−DyAbst
LB−NoAbst
LB−DyAbst
0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1
noise level
ELA [%]
0% 5% 15% 25% 35% 45%
(a) Accuracy and ELA for ensembles with Adaptive Hoeﬀding Tree.
30 40 50 60 70
noise level
accuracy [%]
0% 5% 15% 25% 35% 45%
OB−NoAbst
OB−DyAbst
LB−NoAbst
LB−DyAbst
0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1
noise level
ELA [%]
0% 5% 15% 25% 35% 45%
(b) Accuracy and ELA for ensembles with Na¨ıve Bayes.
30 40 50 60 70
noise level
accuracy [%]
0% 5% 15% 25% 35% 45%
OB−NoAbst
OB−DyAbst
LB−NoAbst
LB−DyAbst
0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1
noise level
ELA [%]
0% 5% 15% 25% 35% 45%
(c) Accuracy and ELA for ensembles with Multi-layer Perceptron.
Figure 14: Visualization of performance of examined ensemble methods for varying level of noise added to
LED fast data stream.
27
5. Conclusions and future works
In this paper we have introduced a lightweight dynamic abstaining approach that can be
used to augment any online ensemble learning scheme. We utilized a certainty threshold that
is used to determine which ensemble members are allowed to participate in voting. If the
certainty of given base classiﬁer is below the threshold, we allow it to abstain from making
a decision and exclude it from the current class label prediction. Such a procedure is being
repeated for each instance from the data stream, thus leading to an indirect dynamic en-
semble selection, as for each new instance diﬀerent classiﬁers may form a sub-ensemble. We
proposed an adaptive strategy to calculate the threshold during the data stream progress. It
was based on monitoring the correctness of ensemble decision and adapting the abstaining
value accordingly. It allowed to track changes in the stream and promote most competent
classiﬁers. We showed that our proposal allowed to exploit the underlying diversity in online
ensembles, taking advantage of larger pool of base learners. The proposed approach imposes
almost no additional cost on the original ensemble learning scheme, thus making it an attrac-
tive proposition for general-purpose mining of non-stationary data streams. Additionally, we
have showed that the proposed dynamic abstaining is able to improve robustness of online
ensembles to noise present in data streams.
Experimental study utilizing 12 data stream benchmarks and 120 noisy data streams
proved the eﬃciency of dynamic abstaining modiﬁcation. We examined the performance of
our approach on two popular online ensemble models and three online base learners, backing-
up our observations with statistical testing. Robustness to noise was evaluated using both
accuracy and dedicated ELA measure to show that the good performance of our method can
truly be contributed to better robustness.
Obtained results encourage us to continue our works. We plan to further investigate
the methods for determining which classiﬁers should abstain and to further improve their
ability to handle various concept drifts. At the same time, we plan to maintain the low
computational complexity of our methods, thus we do not plan to follow the directions
of dynamic ensemble selection methods used for data streams, as they require too much
processing time and additional instances. Finally, we envision adapting our method to
mining recurrent concept drifts, by making members making abstain on new concepts, and
retrieving them when one previously seen reemerges.
References
[1] X. Wu, X. Zhu, G. Wu, W. Ding, Data Mining with Big Data, IEEE Trans. Knowl.
Data Eng. 26 (2014) 97–107.
[2] E. Lughofer, P. P. Angelov, Handling drifts and shifts in on-line data streams with
evolving fuzzy systems, Appl. Soft Comput. 11 (2011) 2057–2068.
[3] M. Chen, B. Chen, A hybrid fuzzy time series model based on granular computing for
stock price forecasting, Inf. Sci. 294 (2015) 227–241.
[4] E. Lughofer, E. Weigl, W. Heidl, C. Eitzinger, T. Radauer, Integrating new classes
on the ﬂy in evolving fuzzy classiﬁer designs and their application in visual inspection,
Appl. Soft Comput. 35 (2015) 558–582.
28
[5] B. Krawczyk, L. L. Minku, J. Gama, J. Stefanowski, M. Wo´zniak, Ensemble learning
for data stream analysis: a survey, Information Fusion 37 (2017) 132 – 156.
[6] T. Pietraszek, On the use of ROC analysis for the optimization of abstaining classiﬁers,
Machine Learning 68 (2007) 137–169.
[7] T. Pietraszek, Classiﬁcation of intrusion detection alerts using abstaining classiﬁers,
Intelligent Data Analysis 11 (2007) 293–316.
[8] J. B laszczy´nski, J. Stefanowski, M. Zajac, Ensembles of abstaining classiﬁers based on
rule sets, in: International Symposium on Methodologies for Intelligent Systems, pp.
382–391.
[9] L. L. Minku, A. P. White, X. Yao, The impact of diversity on online ensemble learning
in the presence of concept drift, IEEE Trans. Knowl. Data Eng. 22 (2010) 730–742.
[10] T. Windeatt, Accuracy/diversity and ensemble MLP classiﬁer design, IEEE Trans.
Neural Networks 17 (2006) 1194–1211.
[11] M. M. Gaber, Advances in data stream mining, Wiley Interdisciplinary Reviews: Data
Mining and Knowledge Discovery 2 (2012) 79–85.
[12] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, A. Bouchachia, A survey on concept
drift adaptation, ACM Computing Surveys 46 (2014) 44:1–44:37.
[13] G. Ditzler, M. Roveri, C. Alippi, R. Polikar, Learning in nonstationary environments:
A survey, IEEE Comp. Int. Mag. 10 (2015) 12–25.
[14] L. I. Kuncheva, Classiﬁer ensembles for detecting concept change in streaming data:
Overview and perspectives, in: 2nd Workshop SUEMA (ECAI 2008), pp. 5–10.
[15] J. Gama, P. Medas, G. Castillo, P. P. Rodrigues, Learning with drift detection, in:
Advances in Artiﬁcial Intelligence - SBIA 2004, 17th Brazilian Symposium on Artiﬁcial
Intelligence, S˜ao Luis, Maranh˜ao, Brazil, September 29 - October 1, 2004, Proceedings,
pp. 286–295.
[16] A. Bifet, R. Gavalda, Learning from time-changing data with adaptive windowing,
in: Proceedings of the Seventh SIAM International Conference on Data Mining, April
26-28, 2007, Minneapolis, Minnesota, USA, pp. 443–448.
[17] P. Sobolewski, M. Wo´zniak, Concept drift detection and model selection with simulated
recurrence and ensembles of statistical detectors, Journal of Universal Computer Science
19 (2013) 462–483.
[18] M. Wzniak, A hybrid decision tree training method using data streams, Knowl. Inf.
Syst. 29 (2011) 335–347.
[19] J. Shan, J. Luo, G. Ni, Z. Wu, W. Duan, CVS: fast cardinality estimation for large-scale
data streams over sliding windows, Neurocomputing 194 (2016) 107–116.
29
[20] P. Domingos, G. Hulten, Mining High-Speed Data Streams, in: I. Parsa, R. Ramakr-
ishnan, S. Stolfo (Eds.), Proceedings of the ACM Sixth International Conference on
Knowledge Discovery and Data Mining, ACM Press, Boston, USA, 2000, pp. 71–80.
[21] M. Wzniak, M. Gra˜na, E. Corchado, A survey of multiple classiﬁer systems as hybrid
systems, Information Fusion 16 (2014) 3–17.
[22] M. Wzniak, Application of combined classiﬁers to data stream classiﬁcation, in: Com-
puter Information Systems and Industrial Management - 12th IFIP TC8 International
Conference, CISIM 2013, Krakow, Poland, September 25-27, 2013. Proceedings, pp.
13–23.
[23] Y. Sun, K. Tang, L. L. Minku, S. Wang, X. Yao, Online ensemble learning of data
streams with gradually evolved classes, IEEE Trans. Knowl. Data Eng. 28 (2016) 1532–
1545.
[24] J. Stefanowski, Adaptive ensembles for evolving data streams - combining block-based
and online solutions, in: New Frontiers in Mining Complex Patterns - 4th Interna-
tional Workshop, NFMCP 2015, Held in Conjunction with ECML-PKDD 2015, Porto,
Portugal, September 7, 2015, Revised Selected Papers, pp. 3–16.
[25] J. Gama, R. Sebasti˜ao, P. P. Rodrigues, On evaluating stream learning algorithms,
Machine Learning 90 (2013) 317–346.
[26] R. M. O. Cruz, R. Sabourin, G. D. C. Cavalcanti, T. I. Ren, META-DES: A dynamic
ensemble selection framework using meta-learning, Pattern Recognition 48 (2015) 1925–
1935.
[27] P. Trajdos, M. Kurzynski, A dynamic model of classiﬁer competence based on the local
fuzzy confusion matrix and the random reference classiﬁer, Applied Mathematics and
Computer Science 26 (2016) 175.
[28] M. Wzniak, P. Ksieniewicz, B. Cyganek, A. Kasprzak, K. Walkowiak, Active learning
classiﬁcation of drifted streaming data, in: International Conference on Computational
Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA, pp. 1724–1733.
[29] J. A. S´aez, M. Galar, J. Luengo, F. Herrera, INFFC: an iterative class noise ﬁlter based
on the fusion of classiﬁers with noise sensitivity control, Information Fusion 27 (2016)
19–32.
[30] X. Zhu, P. Zhang, X. Wu, D. He, C. Zhang, Y. Shi, Cleansing noisy data streams, in:
Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008),
December 15-19, 2008, Pisa, Italy, pp. 1139–1144.
[31] S. Hashemi, Y. Yang, Flexible decision tree for data stream classiﬁcation in the presence
of concept change, noise and missing values, Data Min. Knowl. Discov. 19 (2009) 95–131.
[32] P. Li, X. Wu, X. Hu, Q. Liang, Y. Gao, A random decision tree ensemble for mining
concept drifts from noisy data streams, Applied Artiﬁcial Intelligence 24 (2010) 680–710.
30
[33] P. Zhang, X. Zhu, Y. Shi, L. Guo, X. Wu, Robust ensemble learning for mining noisy
data streams, Decision Support Systems 50 (2011) 469–479.
[34] X. Zhu, X. Wu, Y. Yang, Dynamic classiﬁer selection for eﬀective mining from noisy
data streams, in: Proceedings of the 4th IEEE International Conference on Data Mining
(ICDM 2004), 1-4 November 2004, Brighton, UK, pp. 305–312.
[35] N. C. Oza, S. J. Russell, Online bagging and boosting, in: Proceedings of the Eighth
International Workshop on Artiﬁcial Intelligence and Statistics, AISTATS 2001, Key
West, Florida, US, January 4-7, 2001.
[36] A. Bifet, G. Holmes, B. Pfahringer, Leveraging bagging for evolving data streams,
in: Machine Learning and Knowledge Discovery in Databases, European Conference,
ECML PKDD 2010, Barcelona, Spain, September 20-24, 2010, Proceedings, Part I, pp.
135–150.
[37] A. Bifet, R. Gavalda, Adaptive learning from evolving data streams, in: Advances
in Intelligent Data Analysis VIII, 8th International Symposium on Intelligent Data
Analysis, IDA 2009, Lyon, France, August 31 - September 2, 2009. Proceedings, pp.
249–260.
[38] S. Garc´ıa, A. Fern´andez, J. Luengo, F. Herrera, Advanced nonparametric tests for
multiple comparisons in the design of experiments in computational intelligence and
data mining: Experimental analysis of power, Information Sciences 180 (2010) 2044–
2064.
[39] J. A. S´aez, J. Luengo, F. Herrera, Evaluating the classiﬁer behavior with noisy data
considering performance and robustness: The equalized loss of accuracy measure, Neu-
rocomputing 176 (2016) 26–35.
[40] X. Zhu, X. Wu, Class noise vs. attribute noise: A quantitative study, Artif. Intell. Rev.
22 (2004) 177–210.
31
... Online bagging and boosting [35] is an improvement in the popular ensemble learning method. Bartosz [36] proposed a method to enhance popular online sets by adding waiver options, and to improve the robustness of drift and noise recognition by introducing dynamic and adaptive thresholds to adapt to changes in data streams. Svetlana [37] proposed a TDD-Awareness anomaly detection algorithm that considered the correlation between the sensor data stream and the attributes of each sensor and divided the anomaly detection process into point anomaly and context anomaly, but it had the limitation of time-series prediction. ...
Article
Full-text available
In the manufacturing process, digital twin technology can provide real-time mapping, prediction, and optimization of the physical manufacturing process in the information world. In order to realize the complete expression and accurate identification of and changes in the real-time state of the manufacturing process, a digital twin framework of incremental learning driven by stream data is proposed. Additionally, a novel method of stream data-driven equipment operation state modeling and incremental anomaly detection is proposed based on the digital twin. Firstly, a hierarchical finite state machine (HFSM) for the manufacturing process was proposed to completely express the manufacturing process state. Secondly, the incremental learning detection method driven by stream data was used to detect the anomaly of the job process data, so as to change the job status in real time. Furthermore, the F1 value and time consumption of the proposed algorithm were compared and analyzed using a general dataset. Finally, the method was applied to the practical case development of a welding manufacturer’s digital twin system. The flexibility of the proposed model is calculated by the quantitative method. The results show that the proposed state modeling and anomaly detection method can help the system realize job state mapping and state change quickly, effectively, and flexibly.
... Then, a classification event is triggered if newly selected features include genes that are not included in known gene signatures (Pam50) or that were not included in feature set computed in previous data chunks. In order to prevent catastrophic forgetting (Kirkpatrick et al., 2017), newly discovered feature sets (or biomarker models) are used to train new classifiers that will be then added to an ensemble learning system (Krawczyk and Cano, 2018). In our case study, we model a classifier, namely C KB , which is implemented by combining an online learning algorithm with the gene signature Pam50. ...
Article
Full-text available
Motivation Gene expression-based classifiers are often developed using historical data by training a model on a small set of patients and a large set of features. Models trained in such a way can be afterwards applied for predicting the output for new unseen patient data. However, very often the accuracy of these models starts to decrease as soon as new data is fed into the trained model. This problem, known as concept drift, complicates the task of learning efficient biomarkers from data and requires special approaches, different from commonly used data mining techniques. Results Here we propose an online ensemble learning method to continually validate and adjust gene expression-based biomarker panels over increasing volume of data. We also propose a computational solution to the problem of feature drift where gene expression signatures used to train the classifier become less relevant over time. A benchmark study was conducted to classify the breast tumors into known subtypes by using a large-scale transcriptomic dataset (∼3500 patients), which was obtained by combining two datasets: SCAN-B and TCGA-BRCA. Remarkably, the proposed strategy improves the classification performances of gold-standard biomarker panels (e.g., PAM50, OncotypeDX, and Endopredict) by adding features that are clinically relevant. Moreover, test results show that newly discovered biomarker models can retain a high classification accuracy rate when changing the source generating the gene expression profiles. Availability github.com/UEFBiomedicalInformaticsLab/OnlineLearningBD. Supplementary information Supplementary data are available at Bioinformatics Advances online.
... It can also be considered as improved incremental learning algorithms that are able to integrate fresh data during their operation to react to concept drifts [7]. Mentioned by [24], concept drift detector, sliding windows, online learner and ensemble learners are the most common adaptive learning approaches. One challenge is, the estimation of performance feedback is difficult for any adaptive learning system due to the absence of ground truth in stream data. ...
Article
With the popularity of Internet of Things (IoT), edge computing and cloud computing, more and more stream analytics applications are being developed including real-time trend prediction and object detection on top of IoT sensing data. One popular type of stream analytics is the recurrent neural network (RNN) deep learning model based time series or sequence data prediction and forecasting. Different from traditional analytics that assumes data are available ahead of time and will not change, stream analytics deals with data that are being generated continuously and data trend/distribution could change (a.k.a. concept drift), which will cause prediction/forecasting accuracy to drop over time. One other challenge is to find the best resource provisioning for stream analytics to achieve good overall latency. In this paper, we study how to best leverage edge and cloud resources to achieve better accuracy and latency for stream analytics using a type of RNN model called long short-term memory (LSTM). We propose a novel edge-cloud integrated framework for hybrid stream analytics that supports low latency inference on the edge and high capacity training on the cloud. To achieve flexible deployment, we study different approaches of deploying our hybrid learning framework including edge-centric, cloud-centric and edge-cloud integrated. Further, our hybrid learning framework can dynamically combine inference results from an LSTM model pre-trained based on historical data and another LSTM model re-trained periodically based on the most recent data. Using real-world and simulated stream datasets, our experiments show the proposed edge-cloud deployment is the best among all three deployment types in terms of latency. For accuracy, the experiments show our dynamic learning approach performs the best among all learning approaches for all three concept drift scenarios.
... (d) Feature Decay and Noise -As new behaviors appear in data with time, new features may emerge expanding the feature space and, old features may become irrelevant to the learning task at hand [24]. Similarly, corruption in the feature values or class labels in the form of noise may give a false sense of drifts in the data distribution [25]. ...
Preprint
Deploying robust machine learning models has to account for concept drifts arising due to the dynamically changing and non-stationary nature of data. Addressing drifts is particularly imperative in the security domain due to the ever-evolving threat landscape and lack of sufficiently labeled training data at the deployment time leading to performance degradation. Recently proposed concept drift detection methods in literature tackle this problem by identifying the changes in feature/data distributions and periodically retraining the models to learn new concepts. While these types of strategies should absolutely be conducted when possible, they are not robust towards attacker-induced drifts and suffer from a delay in detecting new attacks. We aim to address these shortcomings in this work. we propose a robust drift detector that not only identifies drifted samples but also discovers new classes as they arrive in an on-line fashion. We evaluate the proposed method with two security-relevant data sets -- network intrusion data set released in 2018 and APT Command and Control dataset combined with web categorization data. Our evaluation shows that our drifting detection method is not only highly accurate but also robust towards adversarial drifts and discovers new classes from drifted samples.
... The former stands for singular random changes in a stream that should be ignored and not mistaken for a concept drift. The latter stands for significant corruption in the feature values or class labels and must be filtered out in order to avoid feeding false (Krawczyk and Cano, 2018) or even adversarial information to the classifier (Sethi and Kantardzic, 2018). ...
Article
Full-text available
Continuous learning from streaming data is among the most challenging topics in the contemporary machine learning. In this domain, learning algorithms must not only be able to handle massive volume of rapidly arriving data, but also adapt themselves to potential emerging changes. The phenomenon of evolving nature of data streams is known as concept drift. While there is a plethora of methods designed for detecting its occurrence, all of them assume that the drift is connected with underlying changes in the source of data. However, one must consider the possibility of a malicious injection of false data that simulates a concept drift. This adversarial setting assumes a poisoning attack that may be conducted in order to damage the underlying classification system by forcing an adaptation to false data. Existing drift detectors are not capable of differentiating between real and adversarial concept drift. In this paper, we propose a framework for robust concept drift detection in the presence of adversarial and poisoning attacks. We introduce the taxonomy for two types of adversarial concept drifts, as well as a robust trainable drift detector. It is based on the augmented restricted Boltzmann machine with improved gradient computation and energy function. We also introduce Relative Loss of Robustness—a novel measure for evaluating the performance of concept drift detectors under poisoning attacks. Extensive computational experiments, conducted on both fully and sparsely labeled data streams, prove the high robustness and efficacy of the proposed drift detection framework in adversarial scenarios.
... It can also be considered as improved incremental learning algorithms that are able to integrate fresh data during their operation to react to concept drifts [7]. Mentioned by [20], concept drift detector, sliding windows, online learner and ensemble learners are the most common adaptive learning approaches. One challenge is, the estimation of performance feedback is difficult for any adaptive learning system due to the absence of ground truth in stream data. ...
Preprint
Full-text available
With the popularity of Internet of Things (IoT), edge computing and cloud computing, more and more stream analytics applications are being developed including real-time trend prediction and object detection on top of IoT sensing data. One popular type of stream analytics is the recurrent neural network (RNN) deep learning model based time series or sequence data prediction and forecasting. Different from traditional analytics that assumes data to be processed are available ahead of time and will not change, stream analytics deals with data that are being generated continuously and data trend/distribution could change (aka concept drift), which will cause prediction/forecasting accuracy to drop over time. One other challenge is to find the best resource provisioning for stream analytics to achieve good overall latency. In this paper, we study how to best leverage edge and cloud resources to achieve better accuracy and latency for RNN-based stream analytics. We propose a novel edge-cloud integrated framework for hybrid stream analytics that support low latency inference on the edge and high capacity training on the cloud. We study the flexible deployment of our hybrid learning framework, namely edge-centric, cloud-centric and edge-cloud integrated. Further, our hybrid learning framework can dynamically combine inference results from an RNN model pre-trained based on historical data and another RNN model re-trained periodically based on the most recent data. Using real-world and simulated stream datasets, our experiments show the proposed edge-cloud deployment is the best among all three deployment types in terms of latency. For accuracy, the experiments show our dynamic learning approach performs the best among all learning approaches for all three concept drift scenarios.
Article
Full-text available
The problem of concept drift has gained a lot of attention in recent years. This aspect is key in many domains exhibiting non-stationary as well as cyclic patterns and structural breaks affecting their generative processes. In this survey, we review the relevant literature to deal with regime changes in the behaviour of continuous data streams. The study starts with a general introduction to the field of data stream learning, describing recent works on passive or active mechanisms to adapt or detect concept drifts, frequent challenges in this area, and related performance metrics. Then, different supervised and non-supervised approaches such as online ensembles, meta-learning and model-based clustering that can be used to deal with seasonalities in a data stream are covered. The aim is to point out new research trends and give future research directions on the usage of machine learning techniques for data streams which can help in the event of shifts and recurrences in continuous learning scenarios in near real-time.
Article
Full-text available
Most research in machine learning for data streams has focused on classification algorithms, whereas regression methods have received a lot less attention. This paper proposes Self-Optimising K-Nearest Leaves (SOKNL), a novel forest-based algorithm for streaming regression problems. Specifically, the Adaptive Random Forest Regression, a state-of-the-art online regression algorithm is extended like this: in each leaf, a representative data point – also called centroid – is generated by compressing the information from all instances in that leaf. During the prediction step, instead of letting all trees in the forest participate, the distances between the input instance and all centroids from relevant leaves are calculated, only k trees that possess the smallest distances are utilised for the prediction. Furthermore, we simplify the algorithm by introducing a mechanism for tuning the k values, which is dynamically and automatically optimised based on historical information. This new algorithm produces promising predictive results and achieves a superior ranking according to statistical testing when compared with several standard stream regression methods over typical benchmark datasets. This improvement incurs only a small increase in runtime and memory consumption over the basic Adaptive Random Forest Regressor.
Article
Full-text available
Contemporary classification systems have to make a decision not only on the basis of the static data, but on the data in motion as well. Objects being recognized may arrive continuously to a classifier in the form of data stream. Usually, we would like to start exploitation of the classifier as soon as possible, the models which can improve their models during exportation are very desirable. Basically, we produce the model on the basis a few object learning objects and then we use and improve the classifier when new data comes. This concept is still vibrant and may be used in the plethora of practical cases. Constructing such a system we have to realize that we have the limited resources (as memory and computational power) at our disposal. Nevertheless, during the exploitation of a classifier system the chosen characteristic of the classifier model may change within a time. This phenomena is called \textit{concept drift} and may lead the deep deterioration of the classification performance. This work deals with the data stream classification with the presence of \textit{concept drift}. We propose a novel classifier training algorithm based on the sliding windows approach which allows us to implement forgetting mechanism, i.e., that old objects came from outdated model will not be taken into consideration during the classifier updating and on the other hand we assume that all arriving examples can not be labeled, because we assume that we have a limited budget for labeling. We will employ active learning paradigm to choose an “interesting” object to be be labeled. The proposed approach has been evaluated on the basis of the computer experiments carried out on the data streams. Obtained results confirmed the usability of proposed method to the smoothly drifted data stream classification.
Article
Full-text available
Nowadays, multiclassifier systems (MCSs) are being widely applied in various machine learning problems and in many different domains. Over the last two decades, a variety of ensemble systems have been developed, but there is still room for improvement. This paper focuses on developing competence and interclass cross-competence measures which can be applied as a method for classifiers combination. The cross-competence measure allows an ensemble to harness pieces of information obtained from incompetent classifiers instead of removing them from the ensemble. The cross-competence measure originally determined on the basis of a validation set (static mode) can be further easily updated using additional feedback information on correct/incorrect classification during the recognition process (dynamic mode). The analysis of computational and storage complexity of the proposed method is presented. The performance of the MCS with the proposed cross-competence function was experimentally compared against five reference MCSs and one reference MCS for static and dynamic modes, respectively. Results for the static mode show that the proposed technique is comparable with the reference methods in terms of classification accuracy. For the dynamic mode, the system developed achieves the highest classification accuracy, demonstrating the potential of the MCS for practical applications when feedback information is available.
Article
Full-text available
The paper presents a concept drift detection method for unsupervised learning which takes into consideration the prior knowledge to select the most appropriate classification model. The prior knowledge carries information about the data distribution patterns that reflect different concepts, which may occur in the data stream. The presented method serves as a temporary solution for a classification system after a virtual concept drift and also provides additional information about the concept data distribution for adapting the classification model. Presented detector uses a developed method called simulated recurrence and detector ensembles based on statistical tests. Evaluation is performed on benchmark datasets.
Article
In many applications of information systems learning algorithms have to act in dynamic environments where data are collected in the form of transient data streams. Compared to static data mining, processing streams imposes new computational requirements for algorithms to incrementally process incoming examples while using limited memory and time. Furthermore, due to the non-stationary characteristics of streaming data, prediction models are often also required to adapt to concept drifts. Out of several new proposed stream algorithms, ensembles play an important role, in particular for non-stationary environments. This paper surveys research on ensembles for data stream classification as well as regression tasks. Besides presenting a comprehensive spectrum of ensemble approaches for data streams, we also discuss advanced learning concepts such as imbalanced data streams, novelty detection, active and semi-supervised learning, complex data representations and structured outputs. The paper concludes with a discussion of open research problems and lines of future research.
Conference Paper
Learning ensemble classifiers from concept drifting data streams is discussed. The paper starts with a general overview of these ensembles. Then, differences between block-based and on-line ensembles are examined in detail. We hypothesize that it is still possible to develop new ensembles that combine the most beneficial properties of both types of these classifiers. Two such ensembles are described: Accuracy Updated Ensemble designed to process data blocks and its incremental version, Online Accuracy Updated Ensemble, for learning from single examples.
Conference Paper
This work reports the research on active learning approach applied to the data stream classification. The chosen characteristics of the proposed frameworks were evaluated on the basis of the wide range of computer experiments carried out on the three benchmark data streams. Obtained results confirmed the usability of proposed method to the data stream classification with the presence of incremental concept drift.
Article
Estimating the cardinality of data streams over a sliding window is an important problem in many applications, such as network traffic monitoring, web access log analysis and database. The problem becomes more difficult in large-scale data streams when time and space complexity is taken into account. In this paper, we present a novel randomized data structure to address the problem. The significant contributions are as follows. (1) A space-efficient counter vector sketch (CVS) are proposed, which extends the well-known bitmap sketch to sliding window settings. (2) Based on the CVS, a random update mechanism is introduced, whereby a small fixed number of entries are randomly chosen from CVS in a step and then updated. This means that the update procedure just costs constant time. (3) Furthermore, estimating cardinality by CVS just needs one-pass scan of the data. (4) Finally, a theoretical analysis is given to show the accuracy of CVS-based estimators. Our comprehensive experiments confirm that the CVS-based schema attains high accuracy, and that its time efficiency in comparison with the timestemp vector (TSV) and the auxiliary indexing method.
Article
Class evolution, the phenomenon of class emergence and disappearance, is an important research topic for data stream mining. All previous studies implicitly regard class evolution as a transient change, which is not true for many real-world problems. This paper concerns the scenario where classes emerge or disappear gradually. A class-based ensemble approach, namely Class-Based ensemble for Class Evolution (CBCE), is proposed. By maintaining a base learner for each class and dynamically updating the base learners with new data, CBCE can rapidly adjust to class evolution. A novel under-sampling method for the base learners is also proposed to handle the dynamic class-imbalance problem caused by the gradual evolution of classes. Empirical studies demonstrate the effectiveness of CBCE in various class evolution scenarios in comparison to existing class evolution adaptation methods.