Content uploaded by Łukasz Korycki
Author content
All content in this area was uploaded by Łukasz Korycki on Apr 21, 2021
Content may be subject to copyright.
Content uploaded by Bartosz Krawczyk
Author content
All content in this area was uploaded by Bartosz Krawczyk on Feb 12, 2021
Content may be subject to copyright.
Concept Drift Detection from Multi-Class
Imbalanced Data Streams
1st Łukasz Korycki
Department of Computer Science
Virginia Commonwealth University
Richmond VA, USA
koryckil@vcu.edu
2nd Bartosz Krawczyk
Department of Computer Science
Virginia Commonwealth University
Richmond VA, USA
bkrawczyk@vcu.edu
Abstract—Continual learning from data streams is among the
most important topics in contemporary machine learning. One of
the biggest challenges in this domain lies in creating algorithms
that can continuously adapt to arriving data. However, previously
learned knowledge may become outdated, as streams evolve over
time. This phenomenon is known as concept drift and must be
detected to facilitate efficient adaptation of the learning model.
While there exists a plethora of drift detectors, all of them
assume that we are dealing with roughly balanced classes. In
the case of imbalanced data streams, those detectors will be
biased towards the majority classes, ignoring changes happening
in the minority ones. Furthermore, class imbalance may evolve
over time and classes may change their roles (majority becoming
minority and vice versa). This is especially challenging in the
multi-class setting, where relationships among classes become
complex. In this paper, we propose a detailed taxonomy of
challenges posed by concept drift in multi-class imbalanced data
streams, as well as a novel trainable concept drift detector based
on Restricted Boltzmann Machine. It is capable of monitoring
multiple classes at once and using reconstruction error to detect
changes in each of them independently. Our detector utilizes
a skew-insensitive loss function that allows it to handle multiple
imbalanced distributions. Due to its trainable nature, it is capable
of following changes in a stream and evolving class roles, as
well as it can deal with local concept drift occurring in minority
classes. Extensive experimental study on multi-class drifting data
streams, enriched with a detailed analysis of the impact of local
drifts and changing imbalance ratios, confirms the high efficacy
of our approach.
Index Terms—machine learning, data stream mining, continual
learning, concept drift, class imbalance
I. INTRODUCTION
Continual learning from streaming data is a well-
established, yet still rapidly developing field of modern ma-
chine learning [1]. Contemporary data sources generate in-
formation characterized by both volume and velocity, thus
continuously flooding learning systems. This makes traditional
classification methods too slow and unable to handle the ever-
changing properties of arriving instances [2]. Therefore, new
algorithms are being developed with their efficacy and adaptive
capabilities in mind [3]. Learning methods for streaming data
must be capable of working under strictly limited memory
and time consumption while offering the ability to continually
incorporate new instances, and swiftly adapting to evolving
data stream characteristics [4]. This phenomenon is known as
concept drift and may influence the properties of a stream
in a multitude of ways, from changing class distributions [5]
to new features or classes emerging [6]. Timely detection of
concept drift and using this information to adapt the classifier
is of crucial importance to any continual learning system.
One of the biggest challenges in learning from data streams
is the non-stationary class imbalance [7]. Skewed data distribu-
tions are a very challenging topic in standard machine learning,
being present there for over 25 years. In the continual learning
domain, class imbalance becomes even more difficult, as we
not only need to deal with disproportion among classes, but
also with their evolving nature [8]. Class roles and imbalance
ratios are subject to change over time and we need to monitor
them closely to understand which class and why poses the
biggest problem for the classifier in a given moment [9].
When we extend this to a multi-class imbalance scenario, we
get a complex and perplexing scenario that actually occurs
in many real-life applications. Concept drift detection in such
problems becomes extremely demanding, as we need to factor
in both evolving nature of multiple classes, as well as their
non-stationary skewed distributions.
Research goal. To propose a fully trainable drift detector that
is capable of handling multi-class imbalanced data streams
with special focus on evolving minority classes, and with
changes appearing at both global (all classes affected) and
local (some classes affected) levels.
Motivation. While there exist a plethora of drift detectors
proposed in the literature, most of them share two limitations:
(i) they assume roughly balanced data distributions and thus
are likely to omit concept drift happening in minority classes;
and (ii) they monitor global data stream characteristics, thus
detecting concept drifts that affect the entire stream, not partic-
ular classes or decision regions. This makes the state-of-the-art
drift detectors unsuitable for mining imbalanced data streams,
especially when multiple classes are involved. There is a need
to develop a new drift detector that is skew-insensitive, can
monitor multiple classes at once, and can rapidly adapt to
changing imbalance ratios and classes switching roles, as none
of the existing methods is capable of this.
Summary. In this paper we propose RBM-IM, a trainable
concept drift detector for continual learning from multi-
class imbalanced data streams. It is realized as a Restricted
1068
2021 IEEE 37th International Conference on Data Engineering (ICDE)
2375-026X/21/$31.00 ©2021 IEEE
DOI 10.1109/ICDE51399.2021.00097
Boltzmann Machine neural network with skew-insensitive
modifications of the training procedure. We use it to track
reconstruction error for each class independently and signal
if any of them has been subject to a significant change over
the most recent mini-batch of data. Our drift detector re-trains
itself in an online fashion, allowing it to handle dynamically
changing imbalance ratio, as well as evolving class roles
(minority classes becoming majority and vice versa). RBM-
IM is capable of detecting drifts occurring at both global and
local levels, allowing for complex monitoring of multi-class
imbalanced data streams and understanding the nature of each
change that takes place.
Main contributions. We offer the following novel contribu-
tions to the field of continual learning from data streams.
•RBM-IM: a novel and trainable concept drift detector re-
alized as a recurrent neural network with skew-insensitive
loss function that is capable of monitoring multi-class
imbalanced data streams with dynamic imbalance ratio.
•Robustness to class imbalance: RBM-IM provides
robustness to multi-class skewed distributions, offering
excellent detection rates of drifts appearing in minority
classes without being biased towards majority concepts.
•Detecting local and global changes: RBM-IM is capable
of detecting concept drifts affecting only a subset of mi-
nority classes (even when only a single class is affected),
offering a better understanding of the nature of changes
than any state-of-the-art drift detector.
•Taxonomy of multi-class imbalanced data streams: we
propose a systematic view on possible challenges that
can be encountered in continual learning from multi-class
imbalanced data streams and formulate three scenarios
that allow us to model such changes.
•Extensive experimental study: we evaluate the efficacy
of RBM-IM on a thoroughly designed experimental test
bed using both real-world and artificial benchmarks. We
introduce a novel approach towards evaluating concept
drift detectors on imbalanced data streams, by measuring
their reactivity to drifts occurring only in a subset of
minority classes, as well as by checking their robustness
to increasing imbalance ratio among multiple classes.
II. DATA STR EA M MI NI NG
Data stream is defined as a sequence < S1, S2, ..., Sn, ... >,
where each element Sjis a new instance. In this paper,
we assume the (partially) supervised learning scenario with
classification task and thus we define each instance as Sj∼
pj(x1,· · · , xd, y ) = pj(x, y), where pj(x, y)is a joint distri-
bution of the j-th instance, defined by a d-dimensional feature
space and assigned to class y. Each instance is independent
and drawn randomly from a probability distribution Ψj(x, y).
Concept drift. When all instances come from the same
distribution, we deal with a stationary data stream. In real-
world applications, data very rarely falls under stationary
assumptions [10]. It is more likely to evolve over time and
form temporary concepts, being subject to concept drift [11].
This phenomenon affects various aspects of a data stream and
thus can be analyzed from multiple perspectives. One cannot
simply claim that a stream is subject to the drift. It needs to
be analyzed and understood in order to be handled adequately
to specific changes that occur [12]. Let us now discuss the
major aspects of concept drift and its characteristics.
Influence on decision boundaries. Firstly, we need to take
into account how concept drift impacts the learned decision
boundaries, distinguishing between real and virtual concept
drifts [13]. The former influences previously learned decision
rules or classification boundaries, decreasing their relevance
for newly incoming instances. Real drift affects posterior prob-
abilities pj(y|x)and additionally may impact unconditional
probability density functions. It must be tackled as soon as it
appears, since it negatively impacts the underlying classifier.
Virtual concept drift affects only the distribution of features x
over time:
bpj(x) = X
y∈Y
pj(x, y),(1)
where Yis a set of possible values taken by Sj. While it seems
less dangerous than real concept drift, it cannot be ignored.
Despite the fact that only the values of features change, it
may trigger false alarms and thus force unnecessary and costly
adaptations.
Locality of changes. It is important to distinguish between
global and local concept drifts [14]. The former affects the
entire stream, while the latter affects only certain parts of it
(e.g., individual clusters of instances, or subsets of classes).
Speed of changes. Here we distinguish between sudden,
gradual, and incremental concept drifts [11].
•Sudden concept drift is a case when instance distribution
abruptly changes with t-th example arriving from the
stream:
pj(x, y) = (D0(x, y),if j < t
D1(x, y),if j≥t. (2)
•Incremental concept drift is a case when we have a
continuous progression from one concept to another (thus
consisting of multiple intermediate concepts in between),
such that the distance from the old concept is increasing,
while the distance to the new concept is decreasing:
pj(x, y) =
D0(x, y),if j < t1
(1 −αj)D0(x, y) + αjD1(x, y),if t1≤j < t2
D1(x, y),if t2≤j
(3)
where
αj=j−t1
t2−t1
.(4)
•Gradual concept drift is a case where instances arriving
from the stream oscillate between two distributions during
1069
the duration of the drift, with the old concept appearing
with decreasing frequency:
pj(x, y) =
D0(x, y),if j < t1
D0(x, y),if t1≤j < t2∧δ > αj
D1(x, y),if t1≤j < t2∧δ≤αj
D1(x, y),if t2≤j,
(5)
where δ∈[0,1] is a random variable.
III. REL ATED W OR KS
Drift detectors. In order to be able to adapt to evolving data
streams, classifiers must either have explicit information on
when to update their model or use continuous learning to
follow the progression of a stream. Concept drift detectors
are external tools that can be paired with any classifier and
used to monitor a state of the stream [15]. Usually, this is
based on tracking the error of the classifier or measuring
the statistical properties of data. One of the first and most
popular drift detectors is Drift Detection Method (DDM) [16]
which analyzes the standard deviation of errors coming from
the underlying classifier. DDM assumes that the increase in
error rates directly corresponds to changes in the incoming
data stream and thus can be used to signal the presence of
drift. This concept was extended by Early Drift Detection
Method (EDDM) [17] by replacing the standard error deviation
with a distance between two consecutive errors. This makes
EDDM more reactive to slower, gradual changes in the stream,
at the cost of losing sensitivity to sudden drifts. Reactive
Drift Detection Method (RDDM) [18] is an improvement upon
DDM that allows detecting sudden and local changes under
access to a reduced number of instances. RDDM offers better
sensitivity than DDM by implementing a pruning mechanism
for discarding outdated instances. Adaptive Windowing (AD-
WIN) [19] is based on a dynamic sliding window that adjusts
its size according to the size of the stable concepts in the
stream. ADWIN stores two sub-windows for old and new
concepts, detecting a drift when mean values in these sub-
windows differ more than a given threshold. Drift Detection
Methods based on the Hoeffding’s bounds (HDDM) [20] uses
the identical bound as SEED, but drops the idea of sub-
windows and focuses on measuring both false positive and
false negative rates. Fast Hoeffding Drift Detection Method
(FHDDM) [21] is yet another drift detector utilizing the pop-
ular Hoeffding’s inequality, but its novelty lies in measuring
the probability of correct decisions returned by the underlying
classifier. Wilcoxon Rank Sum Test Drift Detector (WSTD)
[22] uses the Wilcoxon rank-sum statistical test for comparing
distributions in sub-windows.
Drift detectors for imbalanced data streams. There are
almost no concept drift detectors dedicated to imbalanced data
streams, especially to multi-class ones. Most of the works in
this domain focus mainly on making the underlying classifier
skew-insensitive, while assuming it is going to adapt on its
own [23], or erroneously using standard drift detection meth-
ods. Two main dedicated drift detectors for skewed streams
are PerfSim [24] that monitors the changes in the entire
confusion matrix; and Drift Detection Method for Online Class
Imbalance (DDM-OCI) [25] that monitors recall for every
class.
Limitations of existing methods. Existing drift detectors
suffer from two major limitations: (i) lack of self-adaptation
mechanisms; and (ii) lack of robustness mechanisms. The
first problem is rooted in state-of-the-art drift detectors being
based on monitoring the selected properties of a stream while
neglecting the fact that used monitoring criteria should also
be adapted over time. The second problem is rooted in a
lack of research on effective drift detectors when a stream
is suffering from various data-level difficulties, such as class
imbalance or noise presence. In this work, we address those
two limitations with our RBM-IM, a fully trainable drift
detector that autonomously adapts its detection mechanisms,
while offering robustness to imbalanced class distributions.
IV. CHALLENGES IN LEARNING FROM MULTI-CL AS S
IM BAL AN CE D DATA STR EA MS
In static scenarios there is a plethora of works devoted
to two-class imbalanced problems, but much less attention
is paid to a much more challenging multi-class imbalanced
setup [7]. The same carries over to the continual learning
from data streams, where most of the works focused on binary
streams [4]. This is highly limiting for many modern real-
world applications and thus there is a need to develop skew-
insensitive techniques that can handle multiple classes [26].
There is no single universal approach to how to view and
analyze multi-class imbalanced data streams. Therefore, we
propose a taxonomy of most crucial problems that can be
encountered in this setting, creating three distinctive scenarios.
They cover various learning difficulties that affect one or more
classes and thus pose significant challenges for both drift
detectors and classifiers.
(a) Before drift (b) I drift (c) II drift
Fig. 1: Scenario 1 – global concept drift and dynamic imbal-
ance ratio.
Scenario 1: Global concept drift and dynamic imbalance
ratio. Here we assume that all classes are subject to a real
concept drift that will influence the decision boundaries. Ad-
ditionally, imbalance ratio among the classes changes together
with the drift occurrences. However, class roles remain static
and classes denoted as minority stay minority during the entire
stream processing. This scenario poses challenges to drift
detectors by varying the degree of changes in each class and
1070
how they actually impact the decision boundaries. Changes in
minority classes may get overlooked by detector bias towards
the majority ones, as usually they gather statistics over the
entire data stream. This is depicted in Fig. 1.
(a) Before drift (b) I drift (c) II drift
Fig. 2: Scenario 2 – global concept drift, dynamic imbalance
ratio, and changing class roles.
Scenario 2: Global concept drift, dynamic imbalance ratio,
and changing class roles. Here we extend Scenario 1 by
adding the third learning difficulty – changing class roles. Now
the imbalance ratio is subject to more significant changes and
as a result classes may switch roles – minority may become
majority and vice versa. This is especially challenging to track
in a multi-class case, where relationships among classes are
more complex. Drift detectors have difficulties with keeping
any reliable statistics coming from classes that rapidly change
their roles. This may lead to frequently switching bias towards
whichever class is currently the most frequent one. This is
depicted in Fig. 2.
(a) Before drift (b) I drift (c) II drift
Fig. 3: Scenario 3 – local concept drift, dynamic imbalance
ratio, and changing class roles.
Scenario 3: Local concept drift, dynamic imbalance ratio,
and changing class roles. This is the most challenging
scenario that retains dynamic imbalance ratio and changing
class roles from Scenario 2, but moves from global concept
drift to local one. That means in a given moment only a
subset of classes (or even a single one) may be affected by
a real concept drift, while the remaining ones are subject to
no changes or a virtual concept drift that does not impact
decision boundaries (see Sec. 2). In such a setting we should
not only be able to tell if drift takes place but also which
classes are affected. It is a big step towards understanding the
dynamics of concept drift and offering classifier adaptation to
specific regions of decision space (leading to savings in time
and computational resources). This is the most challenging
scenario for concept drift detectors, as changes happening in
minority classes will remain unnoticed when a detector is
biased towards the majority class. This is depicted in Fig. 3.
Real-world problems affected by multi-class imbalance
and concept drift. The three defined scenarios are not only
interesting from the theoretical point of view but also directly
transfer to a plethora of real-world applications. In cyber-
security, we deal with multiple types of attacks that appear
with varying frequencies (multi-class extremely imbalanced
problems). Some of those attacks will dynamically change
over time to bypass new security settings, while legal transac-
tions will not be affected by such concept drift. In computer
vision, target detection focuses on finding few specific targets,
differentiating them from the information coming from a
much bigger background. Targets may change their nature
over time, being subject to variations, or even camouflage.
In natural language processing, we must deal with constantly
evolving wording/slang utilized by various minority groups,
where changes in those groups will happen independently.
V. RESTRICTED BOLTZMANN MACH IN E FO R IM BALANCED
DRIFT DETECTION
Overview of the proposed method. We introduce a novel
concept drift detector for multi-class imbalanced data streams,
implemented as a Restricted Boltzmann Machine (RBM-IM)
with leveraged robustness to skewed distributions via dedi-
cated loss function. It is a fully trainable drift detector, capable
of autonomous adaptation to the current state of a stream,
imbalance ratios, and class roles, without relying on user-
defined thresholds.
A. Skew-insensitive Restricted Boltzmann Machine
RBM-IM neural network architecture. Restricted Boltz-
mann Machines (RBMs) are generative two-layered neural
networks [27] constructed using the vlayer of Vvisible
neurons and the hlayer of Hhidden neurons:
v= [v1,· · · , vV]∈ {0,1}V,
h= [h1,· · · , hH]∈ {0,1}H(6)
We deal with supervised continual learning from data
streams (as defined in Sec. 2), thus we need to extend this
two-layer RBM architecture with the third zlayer for class
representation. It is implemented as a continuous encoding,
meaning that each neuron in zwill return its real-valued
support for each analyzed class (thus being responsible for
the classification process). By mzwe denote the vector of
RBM outputs with support returned by the z-th neuron for the
m-th class. This allows to define z, known also as the class
layer or the softmax layer:
z= [z1,· · · , zZ]∈m1,· · · ,mZ.(7)
This class layer uses the softmax function to estimate the
probabilities of activation of each neuron in z.
RBMs do not have connections between units in the same
layer, which holds for v,h, and z. Neurons in the visible
layer vare connected with neurons in the hidden layer h, and
neurons in hare connected with those in the class layer z.
1071
The weight assigned to a connection between the i-th visible
neuron viand the j-th hidden neuron hjis denoted as wij ,
while the weight assigned to a connection between the j-th
hidden neuron hjand the k-th class neuron zkis denoted as
ujk . This is used to define the RBM energy function:
E(v,h,z) = −
V
X
i=1
viai−
H
X
j=1
hjbj−
Z
X
k=1
zkck
−
V
X
i=1
H
X
j=1
vihjwij −
H
X
j=1
Z
X
k=1
hjzkujk ,
(8)
where ai, bj,and ckare biases introduced to v,h, and z
respectively. Energy formula E(·)for state [v,h,z]is used
to calculate the probability of RBM of being in a given state
(i.e., assuming certain weight values), using the Boltzmann
distribution:
P(v,h,z) = exp (−E(v,h,z))
F,(9)
where Fis a partition function allowing to normalize the
probability P(v,h,z)to 1.
Hidden neurons in hare independent and use features given
by the visible layer v. The activation probability of the j-th
given neuron hjcan be calculated as follows:
P(hj|v,z) = 1
1 + exp −bj−PV
i=1 viwij −PZ
k=1 zkujk
=σ bj+
V
X
i=1
viwij +
Z
X
k=1
zkujk !,
(10)
where σ(·)=1/(1 + exp(−·)) stands for a sigmoid function.
The same assumption may be made for neurons in the
visible layer v, when values of neurons in the hidden layer
hare known. This allows us to calculate the activation
probability of the i-th visible neuron as:
P(vi|h) = 1
1 + exp −ai−PH
j=1 hjwij
=σ
ai+
H
X
j=1
hjwij
,
(11)
where one must note that given h, the activation probability of
neurons in vdoes not depend on z. The activation probability
of class layer (i.e., decision which class the object should be
assigned to) is calculated using the softmax function:
P(z=1k|h) =
exp −ck−PH
j=1 hjujk
PZ
l=1 exp −cl−PH
j=1 hjujl ,(12)
where k∈[1,· · · , Z ]and k6=l.
RBM training procedure. As RBM is a neural network
model, we may train it using a loss function L(·)minimiza-
tion with any gradient descent method. Standard RBM most
commonly uses the negative log-likelihood of both external
layers vand z. However, our RBM-IM architecture must be
designed to handle multiple imbalanced classes. Therefore, we
need to modify this loss function to make RBM-IM skew-
insensitive. We will achieve this by using the effective number
of samples approach [28] that measures the contributions of
instances in each class. This allows us to formulate a class-
balanced negative log-likelihood loss for RBM-IM:
L(v,z) = −1−β
1−βx
m
log (P(v,z)) ,(13)
where βx
mstands for the contribution of x-th instance to the
m-th class. By taking each independent weight wij , we may
now calculate the gradient of the loss function:
∇L(wij ) = δL(v,z)
δwij
=X
v,h,z
P(v,h,z)vihj
−X
h
P(h|v,z)vihj.
(14)
This equation allows us to calculate the loss function gradient
for a single instance. However, as we use RBM as a drift
detector, we must be able to capture the evolving properties of
a data stream. If we based our change detection on variations
induced by a single new instance, we would be highly sensitive
to even the smallest noise ratio. Therefore, our RBM-based
drift detector must be able to work with a batch of the most
recent instances in order to capture the current stream char-
acteristics. We propose to define RBM-IM model for learning
on mini-batches of instances. This will offer significant speed-
up when compared to traditional batch learning used in data
streams. For a mini-batch of ninstances arriving in ttime
Mt=xt
1,· · · , xt
n, we can rewrite the gradient from Eq. 14
using expected values with loss function:
δL(Mt)
δwij
=Emodel[vihj]−Edata [vihj],(15)
where Edata is the expected value over the current mini-batch
of instances and Emodel is the expected value from the current
state of RBM-IM. Of course, we cannot trace directly the
value of Emodel (as this would require an immediate oracle
access to ground truth), therefore we must approximate it
using Contrastive Divergence with kGibbs sampling steps to
reconstruct the input data (CD-k):
δL(Mt)
δwij
≈Erecon[vihj]−Edata [vihj].(16)
After processing the t-th mini-batch Mt, we can update
the wights in RBM-IM using any gradient descent method as
follows:
wt+1
ij =wt
ij −η(Erecon[vihj]−Edata [vihj]) ,(17)
1072
where ηstands for the learning rate of the RBM-IM neural
network (responsible for the speed of model update and
forgetting of old information). The way to update the ai,bj,
and ckbiases, as well as weights ujk is analogous to Eq. 17
and can be expressed as:
at+1
i=at
i−η(Erecon[vi]−Edata [vi]) ,(18)
bt+1
j=bt
j−η(Erecon[hj]−Edata [hj]) ,(19)
ct+1
k=ct
k−η(Erecon[zk]−Edata [zk]) ,(20)
ut+1
jk =ut
jk −η(Erecon [hjzk]−Edata[hjzk]) .(21)
B. Drift detection with RBM-IM
While RBM-IM is a skew-insensitive generative neural
network model, we can use it as an explicit drift detector.
The RBM-IM model stores compressed characteristics of the
distribution of data it was trained on. By using any similarity
measure between the data prototypes and properties of newly
arrived instances, one may evaluate if there are any changes
in the distribution. This allows us to use RBM-IM as a drift
detector. Our model uses an embedded similarity measure for
monitoring the state of a stream and the level to which the
newly arrived instances differ from the previously observed
concepts. RBM-IM tracks the similarity measure for every
single class independently, using the class layer continuous
outputs. RBM-IM is a fully trainable and self-adaptive drift
detector, capable not only of capturing the trends of changes in
each class independently (versus state-of-the-art drift detectors
that monitor changes in all classes with an aggregated mea-
sure), but also of learning and adapting to the current state of a
stream, class imbalance ratios, and class roles. This makes it a
highly attractive approach for handling multi-class imbalanced
streams with various learning difficulties discussed in Sec. 4.
Measuring data similarity. In order to evaluate the similarity
of newly arrived instances to old concepts stored in RBM-IM,
we will use the reconstruction error metric. We can calculate it
online for each new instance, by inputting a newly arrived d-
dimensional instance Sn= [xn
1,· · · , xn
d, yn]to the vlayer
of RBM. Then values of neurons in vare calculated to
reconstruct the feature values. Finally, class layer zis activated
and used to reconstruct the class label. This allows us to keep
track of the reconstruction error for each class independently,
offering per-class drift detection capabilities. We can denote
the reconstructed vector for m-th class as:
˜
Sm
n= [˜xn
1,· · · ,˜xn
d,˜yn
1,· · · ,˜yn
Z],(22)
where the reconstructed vector features and labels are taken
from probabilities calculated using the hidden layer:
˜xn
i=P(vi|h),(23)
˜yn
k=P(zk|h).(24)
The hlayer is taken from the conditional probability, in which
the vlayer is identical to the input instance:
h∼P(h|v=xn,z=1yn).(25)
This allows us to write the reconstruction error in a form of
the mean squared error between the true and reconstructed
instance for the m-th class:
R(Sm
n) = v
u
u
t
d
X
i=1
(xn
i−˜xn
i)2+
Z
X
k=1
(1yn
k−˜yn
k)2.(26)
For the purpose of obtaining a stable concept drift detector,
we do not look for a change in distribution over a single
instance, but for the change over the newly arriving mini-
batch of instances. Therefore, we need to calculate the average
reconstruction error over the recent mini-batch of data for the
m-th class:
R(Mm
t) = 1
n
n
X
m=1
R(xt
m).(27)
Adapting reconstruction error to drift detection. In order to
make the reconstruction error a practical measure for detecting
the presence of concept drift, we propose to measure the
evolution of this measure (i.e., its trends) over arriving mini-
batches of instances. The analysis of the trends is done for
each class independently, allowing us to effectively detect
local concept drifts. We achieve this by using the well-known
sliding window technique that will move over the arriving
mini-batches. Let us denote the trend of reconstruction error
for the m-th class over time as Qr(t)mand calculate it using
the following equation:
Qr(t)m=¯
nmt¯
T Rt−¯
Tt¯
Rt
¯nt¯
T2t−(¯
Tt)2.(28)
The trend over time can be computed using a simple linear
regression, with the terms in Eq. 28 being simply sums over
time as follows:
¯
T Rt=¯
T Rt−1+tR(Mm
t),(29)
¯
Tt=¯
Tt−1+t, (30)
¯
Rt=¯
Rt−1+R(Mm
t),(31)
¯
T2t=¯
T2t−1+t2,(32)
where ¯
T R0= 0,¯
T0= 0,¯
R0= 0, and ¯
T20= 0. We capture
those statistics for each class using a sliding window of size
W. Instead of using a manually set size, which is inefficient
for drifting data streams, we propose to use a self-adaptive
1073
window size [19]. This eliminates the need for manual tuning
of the window size that is used for drift detection. To allow
flexible learning from various sizes of mini-batches, we must
consider a case where t>W. Here, we must compute the
terms for the trend regression using the following equations:
¯
T Rt=¯
T Rt−1+tR(Mt)−(t−w)R(Mm
t−W),(33)
¯
Tt=¯
Tt−1+t−(t−W),(34)
¯
Rt=¯
Rt−1+R(Mt)−R(Mm
t−W),(35)
¯
T2t=¯
T2t−1+t2−(t−W)2.(36)
The required number of instances ¯nm
tto compute the trend of
Qr(t)mfor m-th class as time tis given as follows:
¯nt=(tif t≤W
Wif t > W. (37)
Drift detection. The above Eq. 28 allows us to compute the
trends for every analyzed mini-batch of data. In order to detect
the presence of drift we need to have capability of checking if
the new mini-batch differs significantly from the previous one
for each analyzed class. Our RBM-IM uses Granger causality
test [29] on trends from subsequent mini-batches of data for
each class Qr(Mm
t)and Qr(Mm
t+1). This is a statistical test
that determines whether one trend is useful in forecasting
another. As we deal with non-stationary processes we perform
the variation of Granger causality test based on first differences
[30]. Accepted hypothesis means that it is assumed that there
exist Granger causality relationship between Qr(Mm
t)and
Qr(Mm
t+1), which means there is no concept drift on the m-
th class. If the hypothesis is rejected, RBM-IM signals the
presence of concept drift on the m-th class.
VI. EX PE RI ME NTAL S TU DY
In this section we present the experimental study used to
evaluate the quality of RBM-IM. It was carefully designed to
offer an in-depth analysis of the proposed method and gain
insights into its behavior in various multi-class imbalanced
data stream scenarios. We tailored this study to answer the
following research questions.
•RQ1: Does RBM-IM offer better concept drift detection
than state-of-the-art drift detectors designed for standard
data streams?
•RQ2: Does RBM-IM offer better concept drift detection
than state-of-the-art skew-insensitive drift detectors de-
signed for imbalanced data streams?
•RQ3: What is the capability of RBM–IBM to detect local
drifts that affect a subset of minority classes?
•RQ4: What robustness to increasing imbalance ratio is
offered by RBM-IM?
All methods and experiments were implemented in MOA
environment [31] and run on Intel Core i7-8365u with 64GB
DDR4 RAM.
A. Data stream benchmarks
For the purpose of this experimental study, we selected 24
benchmark data streams: 12 come from real-world domains
and 12 were generated artificially using the MOA environ-
ment [31]. Such a diverse mix allowed us to evaluate the
effectiveness of RBM-IM over a plethora of scenarios. Using
artificial data streams allows us to control the specific nature
of drift and class imbalance, as well as to inject local concept
drift into selected minority classes. Artificial data streams use
a dynamic imbalance ratio that both increases and decreases
over time. Real-world streams offer challenging problems that
are characterized by a mix of different learning difficulties.
Properties of the data stream benchmarks are given in Tab. I.
We report the highest imbalance ratio among all the classes,
i.e., the ratio between the biggest and the smallest class.
TABLE I: Properties of real-world (top) and artificial (bottom)
imbalanced data stream benchmarks.
Dataset Instances Features Classes IR Drift
Activity-Raw 1 048 570 3 6 128.93 yes
Connect4 67 557 42 3 45.81 unknown
Covertype 581 012 54 7 96.14 unknown
Crimes 878 049 3 39 106.72 unknown
DJ30 138 166 8 30 204.66 yes
EEG 14 980 14 2 29.88 yes
Electricity 45 312 8 2 17.54 yes
Gas 13 910 128 6 138.03 yes
Olympic 271 116 7 4 66.82 unknown
Poker 829 201 10 10 144.00 yes
IntelSensors 2 219 804 5 57 348.26 yes
Tags 164 860 4 11 194.28 unknown
Aggrawal5 1 000 000 20 5 50.00 incremental
Aggrawal10 1 000 000 40 10 80.00 incremental
Aggrawal20 2 000 000 80 20 100.00 incremental
Hyperplane5 1 000 000 20 5 100.00 gradual
Hyperplane10 1 000 000 40 10 200.00 gradual
Hyperplane20 2 000 000 80 20 300.00 gradual
RBF5 1 000 000 20 5 100.00 sudden
RBF10 1 000 000 40 10 200.00 sudden
RBF20 2 000 000 80 20 300.00 sudden
RandomTree5 1 000 000 20 5 100.00 sudden
RandomTree10 1 000 000 40 10 200.00 sudden
RandomTree20 2 000 000 80 20 300.00 sudden
B. Setup
Reference concept drift detectors. As reference methods to
the proposed RBM-IM, we have selected three state-of-the-art
concept drift detectors for standard data: WSTD [22], RDDM
[18], and FHDDM [21]; as well as two state-of-the-art drift
detectors for imbalanced data streams: PerfSim and DDM-
OCI. Parameters of all the six drift detectors are given in
Tab. II.
Parameter tuning. In order to offer a fair and thorough
comparison, we performed parameter tuning for every drift
detector and for every data stream benchmark. As we deal with
a streaming scenario, we used self hyper-parameter tuning [32]
that is based on the online Nelder-Mead optimization.
Base classifier. In order to ensure fairness when comparing the
examined drift detectors they all use Adaptive Cost-Sensitive
Perceptron Trees [33] as a base classifier. This is a skew-
insensitive and efficient classifier capable of handling both
1074
TABLE II: Examined drift detectors and their parameters.
Abbr. Name Parameters
WSTD [22] Wilcoxon Rank Sum Test sliding window size ω∈ {25,50,75,100}
Drift Detection warning significance αw∈ {0.01,0.03,0.05,0.07}
drift significance αd∈ {0.001,0.003,0.005,0.007}
max. no of old instances min ∈ {1000,2000,3000,4000}
RDDM [18] Reactive Drift Detection warning threshold αw∈ {0.90,0.92,0.95,0.98}
drift threshold αd∈ {0.80,0.85,0.90.0.95}
min. no. of errors e∈ {10,30,50,70}
min. no. of instances min ∈ {3000,5000,7000,9000}
max. no. of instances max ∈ {10000,20000,30000,40000}
warning limit wL ∈ {800,1000,1200,1400}
FHDDM [21] Fast Hoeffding Drift Detection sliding window size ω∈ {25,50,75,100}
allowed error δ∈ {0.000001,0.00001,0.0001,0.001}
PerfSim [24] Performance Similarity differentiation weights λ∈ {0.1,0.2,0.3,0.4}
min. no. of errors n={10,30,50,70}
DDM–OCI [25] Drift Detection Method warning threshold αw∈ {0.90,0.92,0.95,0.98}
for online class imbalance drift threshold αd∈ {0.80,0.85,0.90.0.95}
min. no. of errors e∈ {10,30,50,70}
RBM-IM RBM Drift Detection mini–batch size M∈ {25,50,75,100}
for imbalanced data streams visible neurons V=no. of features
hidden neurons H∈ {0.25V,0.5V,0.75V,V}
class neurons Z=no. of classes
learning rate η∈ {0.01,0.03,0.05,0.07}
Gibbs sampling steps k∈ {1,2,3,4}
binary and multi-class imbalanced data streams, but is strongly
dependent on an attached concept drift detection component.
Therefore, it offers an excellent backbone for our experiments,
allowing us to directly measure how a given drift detector
impacts the classification quality.
RBM-IM training. Our drift detector uses the first instance
batch to train itself at the beginning of the stream processing. It
continuously updates itself in an online fashion together with
the base classifier.
Evaluation metrics. As we deal with multi-class imbalanced
and drifting data streams, we evaluated the examined algo-
rithms using prequential multi-class AUC [25] and prequential
multi-class G-mean [34].
Windows. We used a window size W= 1000 for calculating
the prequential metrics. ADWIN self-adapting window was
used for both RBM-IM and reference drift detectors to allevi-
ate the need for manual window size tuning [34].
Statistical analysis. We used the Friedman ranking test with
Bonferroni-Dunn post-hoc and Bayesian signed test [35] for
statistical significance over multiple comparison with signifi-
cance level α= 0.05.
Drift injection. For experiment 2, we inject local concept drift
starting with the smallest minority class and then add classes
according to their increasing size. This allows us to consider
most difficult scenarios, where smallest classes are affected by
the local concept drift and thus most likely to be neglected.
C. Experiment 1: Drift detectors comparison
The first experiment was designed to analyze the behavior
of the six examined drift detectors under two different metrics
measured on all 24 benchmark data streams. This will allow
us to evaluate how competitive is RBM-IM as compared
with the state-of-the-art reference methods. Results according
to pmAUC and pmGM are given in Tab. III, while Fig. 4
and 5 depict the outcomes of the post-hoc statistical tests
of significance. Fig. 6 and 7 present visualizations of the
Bayesian signed test for pairwise comparisons with two best
performing reference detectors.
Comparison with standard drift detectors. The standard
drift detectors return unsatisfactory performance for all of the
1 2 3 4 5 6
RBM-IM
DDM-OCI
PerfSim FHDDM
RDDM
WSTD
Fig. 4: The Bonferroni-Dunn test (pmAUC).
1 2 3 4 5 6
RBM-IM
DDM-OCI
PerfSim FHDDM
RDDM
WSTD
Fig. 5: The Bonferroni-Dunn test (pmGM).
examined multi-class imbalanced data streams. This shows
that the metrics collected by them are unsuitable to monitor
skewed data streams. This also indicates that drift detectors,
despite not being actually trainable models, are still prone to
class imbalance. Despite the fact that the underlying classifier
used was designed for imbalanced data streams, it could not
offer accurate predictions when being feed incorrect informa-
tion from the drift detectors. Especially in the case of data
sets with a high number of classes (such as Crimes, DJ20,
IntelSensor, or the artificial ones) standard drift detectors re-
turned performance only slightly above a random guess. Those
detectors were not capable of capturing changes affecting
at the same time multiple class distributions and imbalance
ratios. RBM-IM alleviated those limitations, while displaying
comparable computational complexity.
Answer to RQ1: Yes, RBM-IM offers significant improve-
ments over standard drift detectors when applied to monitoring
multi-class imbalanced data streams. Standard detectors can-
not handle both a high number of classes and simultaneous
changes in distributions and imbalance ratios. This shows that
we need to have dedicated drift detectors for such difficult
scenarios.
Comparison with skew-insensitive drift detectors. Skew-
insensitive detectors performed significantly better when com-
pared with their standard counterparts. However, for most of
the real-world benchmarks and for all the artificial ones they
still could not compete with RBM-IM. The only four data sets
on which they returned a slightly better performance were
EEG, Electricity, Gas, and Tags. All of them are relatively
small and have a low number of classes. Especially the former
factor might have had a strong impact on RBM-IM. As this
is a trainable drift detector, it probably suffered from the
problem of underfitting when learning from small data streams.
This could be potentially alleviated by combining RBM-
IM with transfer learning or instance exploitation techniques,
which we will investigate in our future works. For all the
remaining 20 data stream benchmarks RBM-IM outperformed
in a statistically significant manner both PerfSim and DDM-
OCI. This can be contributed to the compressed information
about the current concept for each class stored within the
RBM-IM structure, which allowed for a significantly more
informative analysis of the changing properties of incoming
instances.
1075
TABLE III: Results according to pmAUC, pmGM and update times per batch for the examined concept drift detectors.
Dataset pmAUC pmGM
WSTD RDDM FHDDM PerfSim DDM–OCI RBM-IM WSTD RDDM FHDDM PerfSim DDM–OCI RBM-IM
Activity-Raw 45.43 46.23 48.45 72.81 74.29 79.92 51.06 54.10 55.82 76.11 78.59 82.04
Connect4 54.19 53.48 55.27 64.19 69.10 75.04 55.03 55.39 56.29 66.08 70.21 77.92
Covertype 33.19 34.12 35.72 41.24 40.58 53.98 32.45 33.10 35.98 40.19 41.02 54.02
Crimes 19.93 20.04 22.11 28.56 30.02 64.59 21.88 23.92 26.01 30.99 32.07 69.58
DJ30 26.94 25.98 26.02 34.11 33.98 59.04 27.45 27.11 28.73 36.71 35.48 61.29
EEG 58.14 59.98 62.29 70.08 74.22 72.03 59.85 60.98 64.67 72.93 77.29 74.13
Electricity 68.94 72.10 73.45 80.04 83.20 79.39 70.45 75.90 77.28 83.92 85.44 81.99
Gas 48.83 47.23 46.92 63.59 67.54 64.20 50.05 49.54 49.17 65.98 70.02 66.13
Olympic 72.98 70.34 74.53 80.08 83.19 87.01 73.95 71.91 76.02 83.19 86.88 89.24
Poker 72.11 69.65 72.98 84.65 87.91 91.03 74.46 70.97 74.52 87.11 89.34 93.06
IntelSensors 9.45 11.45 13.99 36.23 37.08 58.10 10.02 13.01 14.38 37.82 38.03 60.39
Tags 30.45 28.67 29.45 42.68 40.18 39.04 33.10 30.08 31.14 45.28 43.21 41.02
Aggrawal5 78.34 77.45 80.41 84.92 88.34 90.38 77.19 79.02 80.93 85.99 90.02 93.01
Aggrawal10 70.12 68.34 70.23 74.99 78.32 88.02 71.04 70.16 71.88 75.38 79.14 90.49
Aggrawal20 55.62 56.23 58.93 65.76 66.98 83.87 56.45 57.22 59.39 66.28 67.57 85.09
Hyperplane5 62.05 63.66 62.07 70.45 73.98 75.06 65.39 67.20 66.14 74.82 78.05 81.80
Hyperplane10 53.56 54.37 54.02 63.74 66.59 72.30 56.93 59.14 57.92 66.72 70.56 78.03
Hyperplane20 40.04 38.45 42.19 50.10 57.67 66.48 42.06 41.99 40.86 52.19 59.37 68.27
RBF5 80.18 78.56 82.40 90.48 92.36 92.78 83.47 81.59 84.99 92.12 94.82 94.97
RBF10 69.45 67.84 73.29 82.19 84.48 88.82 72.19 70.48 76.44 85.11 87.81 90.26
RBF20 53.18 52.88 54.01 70.24 71.93 83.08 55.98 54.90 57.73 73.89 74.84 85.30
RandomTree5 45.29 47.21 47.93 58.90 64.32 67.98 46.12 48.52 49.11 60.05 66.30 69.93
RandomTree10 31.63 33.19 35.02 50.02 53.87 63.01 32.79 33.90 36.14 51.58 55.20 64.97
RandomTree20 19.83 20.04 21.38 36.29 43.22 59.42 20.02 20.88 22.94 38.01 44.87 60.33
ranks 5.46 4.78 3.84 2.97 2.56 1.39 5.80 5.05 4.15 2.45 2.29 1.26
Avg. test time [s] 17.26±3.11 18.11±4.72 16.54±2.98 8.92±3.07 9.78±4.14 6.28±1.08
Avg. update time [s] 0.02±0.01 0.08±0.02 0.11±0.05 19.83±6.98 18.54±7.82 12.22±0.92
20
40
60
80
100
20
40
60
80
100
20
40
60
80
100
rope
L R
20
40
60
80
100
20
40
60
80
100
20
40
60
80
100
rope
L R
Fig. 6: Visualizations of the Bayesian signed test for compar-
ison between PerfSim and RBM-IM for pmAUC (left) and
pmGM (right).
20
40
60
80
100
20
40
60
80
100
20
40
60
80
100
rope
L R
20
40
60
80
100
20
40
60
80
100
20
40
60
80
100
rope
L R
Fig. 7: Visualizations of the Bayesian signed test for compar-
ison between DDM-OCI and RBM-IM for pmAUC (left) and
pmGM (right).
Answer to RQ2: Yes, RBM-IM is capable of outperforming
state-of-the-art skew-insensitive drift detectors, while addi-
tionally offering faster detection and update times. This is
especially visible on data sets with a high number of classes,
where monitoring simple performance measures is not enough
to accurately and timely detect occurrences of drifts. Addition-
ally, by being a trainable detector RBM-IM can better adapt
to changes in data streams, allowing fine-tuned encapsulation
of the definition of what currently is considered as a temporal
concept.
D. Experiment 2: Detection of local concept drifts
This experiment was designed to understand if and how the
examined drift detectors can handle the appearance of local
concept drifts on top of changing imbalance ratios and class
roles (see Sec. 3 – Scenario 3 for more details). We carried this
experiment only on artificial benchmarks, as they allowed us to
directly inject concept drift into a selected number of classes.
We evaluated how the performance of drift detectors change
with the decrease of the number of classes being affected by
the concept drift. For each of the 12 benchmark data streams,
we created scenarios where from 1 to Mclasses are being
affected by the drift, the Mcase standing for every single class
in the stream being subject to the concept drift. Fig. 8 depicts
the behavior of all the six drift detectors under various levels
of the local concept drift for the pmAUC metric. We do not
show plots for pmGM as they have very similar characteristics
and would not provide any additional insights. Please note that
the smaller number of classes that are subject to concept drift,
the more difficult its detection becomes.
Comparison with standard drift detectors. Unsurprisingly,
standard detectors completely failed when facing the task of
local drift detection. When the number of classes subject to
concept drift dropped below 80%, we could see significant
drops in their pmAUC. When the number of affected classes
dropped below 50%, all three detectors started to completely
ignore the presence of any drift. This crucially impacted
1076
50 60 70 80 90
Aggrawal5
no. of classes with drift
pmAUC[%]
1 2 3 4 5
WSTD
RDDM
FHDDM
PerSim
DDM−OCI
RBM−IM
20 40 60 80
Aggrawal10
no. of classes with drift
pmAUC[%]
1 2 3 4 5 6 7 8 9 10
20 40 60 80
Aggrawal20
no. of classes with drift
pmAUC[%]
1 3 5 7 9 11 13 15 17 19
30 40 50 60 70 80
Hyperplane5
no. of classes with drift
pmAUC[%]
1 2 3 4 5
20 30 40 50 60 70
Hyperplane10
no. of classes with drift
pmAUC[%]
1 2 3 4 5 6 7 8 9 10
10 20 30 40 50 60 70
Hyperplane20
no. of classes with drift
pmAUC[%]
1 3 5 7 9 11 13 15 17 19
60 70 80 90
RBF5
no. of classes with drift
pmAUC[%]
1 2 3 4 5
40 50 60 70 80 90
RBF10
no. of classes with drift
pmAUC[%]
1 2 3 4 5 6 7 8 9 10
20 40 60 80
RBF20
no. of classes with drift
pmAUC[%]
1 3 5 7 9 11 13 15 17 19
20 30 40 50 60 70
RandomTree5
no. of classes with drift
pmAUC[%]
1 2 3 4 5
10 20 30 40 50 60
RandomTree10
no. of classes with drift
pmAUC[%]
1 2 3 4 5 6 7 8 9 10
0 10 20 30 40 50 60
RandomTree20
no. of classes with drift
pmAUC[%]
1 3 5 7 9 11 13 15 17 19
Fig. 8: Relationship between pmAUC and the number of
classes affected by the local drift for the artificial benchmarks.
The lower the number of classes subject to concept drift, the
more difficult its detection.
the underlying classifier that lost any adaptation capabilities,
as drift detectors were never signaling any change being
present. Such results clearly support our earlier statement
that standard drift detectors cannot handle local changes, as
statistics they monitor relate to the entire stream, not specific
classes. Furthermore, in the case of imbalanced multi-class
drifting streams, the underlying bias toward the majority class
had a strong impact on those statistics. This damaged the
reactivity of those detectors to an even greater degree, as
changes happening in minority classes were obscured by static
properties of the majority class.
Comparison with skew-insensitive drift detectors. This
experiment showed the weak side of the skew-insensitive
drift detectors published so far. While they can display some
robustness to changing class ratios and global concept drift,
they did not perform significantly better than standard detec-
tors when facing local drifts. For more than 90% of classes
being affected by drift, both PerfSim and DDM-OCI returned
satisfactory performance. Their quality started degrading when
less than 70% of classes were being affected, reaching the
lowest plateau for less than 30% of classes being affected. This
shows that despite the fact of monitoring some performance
metrics for each class (like DDM-OCI monitors recall) they
do not extract strong enough properties of those classes to
properly detect local drifts. Only when the majority of classes
become subject to concept drift those detectors can pick up
local changes.
RBM-IM sensitivity to local drifts. RBM-IM displayed an
excellent sensitivity to local drifts, even when they affected
only a single class. This observation holds for any data set,
any imbalance ratio, and any total number of classes. This
can be contributed to the effectiveness of the reconstruction
error, used as a change detection metric, combined with storing
compressed information about each class independently, and
being able to compare reconstruction error for each class
individually. This allows RBM-IM to detect local drifts that
at a given moment affect any number of classes.
Answer to RQ3: RBM-IM is the only drift detector among the
examined ones that can correctly detect local concept drifts,
even when they affect only a single minority class. This allows
to gain a better understanding of what is the exact nature of
changes affecting the data stream and which classes should be
more carefully analyzed to discover useful knowledge. This
RBM-IM’s capability of offering at the same time global
and local concept drift detection is a crucial step towards
explainable drift detection and gaining deeper insights into
dynamics behind data streams, especially those imbalanced.
E. Experiment 3: Robustness to changing imbalance ratio
The third experiment was designed for evaluating the ro-
bustness of the examined drift detectors to changing imbalance
ratio, especially for extremely imbalanced cases (IR >400).
This will allow us to test the flexibility and trustworthiness
of skew-insensitive mechanisms used in the detectors and to
see how reliable they are. For each of 12 benchmark data
streams, we created scenarios in which we generate varying
imbalance ratios from 50 to 500. Fig. 9 depicts the behavior of
the six drift detectors under various levels of class imbalance
for the pmAUC metric. We do not show plots for pmGM as,
analogously to the previous experiment, they have very similar
characteristics.
Analyzing robustness to changing imbalance ratios. As
expected the standard drift detectors cannot handle any class
imbalance ratios and do not return any acceptable results,
omitting drift detection. This can be seen in the extremely poor
performance of the underlying classifier that stopped being
updated and could not handle new incoming concepts. Two
reference skew-insensitive detectors maintain an acceptable
robustness to small and medium imbalance ratios (IR <200),
but start to critically fail with further increasing IR. At extreme
levels of IR they performance becomes similar to standard
detectors. This shows that none of the existing detectors
can handle high imbalance ratios in multi-class data streams.
RBM-IM offers excellent and stable robustness, filling the gap
and providing a sought-after robust drift detection approach.
We can contribute this to a combination of the used loss
function and the ability of RBM-IM to continually learn from
the stream. This is a massive advantage, as all other drift
1077
20 40 60 80
Aggrawal5
multi−class imbalance ratio
pmAUC[%]
50 100 200 300 400 500
WSTD
RDDM
FHDDM
PerSim
DDM−OCI
RBM−IM
20 40 60 80
Aggrawal10
multi−class imbalance ratio
pmAUC[%]
50 100 200 300 400 500
20 40 60 80
Aggrawal20
multi−class imbalance ratio
pmAUC[%]
50 100 200 300 400 500
10 20 30 40 50 60 70 80
Hyperplane5
multi−class imbalance ratio
pmAUC[%]
50 100 200 300 400 500
10 20 30 40 50 60 70
Hyperplane10
multi−class imbalance ratio
pmAUC[%]
50 100 200 300 400 500
0 20 40 60
Hyperplane20
multi−class imbalance ratio
pmAUC[%]
50 100 200 300 400 500
20 40 60 80
RBF5
multi−class imbalance ratio
pmAUC[%]
50 100 200 300 400 500
20 40 60 80
RBF10
multi−class imbalance ratio
pmAUC[%]
50 100 200 300 400 500
20 40 60 80
RBF20
multi−class imbalance ratio
pmAUC[%]
50 100 200 300 400 500
10 20 30 40 50 60 70
RandomTree5
multi−class imbalance ratio
pmAUC[%]
50 100 200 300 400 500
20 40 60 80
RandomTree10
multi−class imbalance ratio
pmAUC[%]
50 100 200 300 400 500
0 10 20 30 40 50 60 70
RandomTree20
multi−class imbalance ratio
pmAUC[%]
50 100 200 300 400 500
Fig. 9: Relationship between pmAUC and changing imbalance
ratio for the artificial benchmarks. The higher the imbalance
ratio, the higher the disproportions among multiple classes.
detectors are using some preset rules for deciding if the drift
is present or not. RBM-IM can learn the current distribution
in a skew-insensitive manner, making its drift detection much
more accurate and not affected by the imbalance ratio.
Answer to RQ4: RBM-IM offers excellent robustness to vari-
ous levels of dynamic imbalance ratio in multi-class scenarios.
Due to its trainable nature, RBM-IM is capable of quickly
adapting to the current state of any stream and re-aligning its
own structure regarding class ratios and class roles. This is
the only drift detector displaying robustness to extremely high
levels of class imbalance (IR >400).
VII. LESSONS LEARNED
Let us now present a short summary of insights and
conclusions that were drawn from both the theoretical and
experimental parts of this paper.
Unified view on challenges in imbalanced multi-class data
streams. Continual learning from non-stationary and skewed
multiple distributions is a challenging topic that requires more
attention from the research community. It offers an excellent
field for developing and evaluating novel learning algorithms,
while calling for enhancing our models with various valuable
robust characteristics. Three mutually complementary scenar-
ios were identified by us, each dealing with different learning
difficulties embedded in the nature of data.
Advantages of trainable drift detector. To the best of
our knowledge, the existing state-of-the-art drift detectors are
realized as external modules that track some properties of the
stream and use them to decide if a drift should be detected or
not. However, those models use static rules for determining
the degree of change that constitutes drift presence. This
significantly limits them in capturing unique properties of each
concept and thus may negatively impact their reactivity to
changes. We propose to use a trainable drift detector that
can extract and store the most important characteristics of
the current state of the stream and use them to make an
informative and guided decision on deciding whether the
underlying classifier should be retrained or not.
Handling global and local drifts. Most of the works in
drift detection focus on detecting global drifts that affect the
entire stream. Detectors gather information from every single
instance and use those statistics to make a decision. However,
this makes them less sensitive to local drifts that affect only
certain classes. The situation becomes even more challenging
when combined with multi-class imbalanced distributions.
Here, local drifts affecting the minority classes would go
unnoticed, as gathered statistics will be biased towards the
majority classes. This shows the importance of monitoring
each individual class for local changes.
Impact of class imbalance on drift detection. Not enough
attention has been given to the interplay between the con-
cept drift and class imbalance. We observed that imbalanced
distributions will directly affect each drift detector in two
possible ways: (i) enhancing the presence of small changes
in the majority classes; and (ii) diminishing the importance of
changes in the minority classes. The former problem is caused
by statistics gathered from more abundant classes that will
dominate the detector and thus may cause false alarms, as even
small changes will be magnified by the sheer disproportion
among classes. The latter problem is caused by the minority
classes not contributing enough to the drift detector statistics
and thus not being able to trigger it to cause an alarm.
VIII. CONCLUSIONS AND FUTURE WORKS
In this work, we have discussed an important area of
learning from multi-class imbalanced data streams under con-
cept drift. We proposed a unifying taxonomy of challenges
that may be encountered when learning from such data, and
identified three realistic scenarios representing various types
of learning difficulties. This was the first complete attempt to
understand and organize challenges arising in this area of ma-
chine learning. We introduced RBM-IM, a novel and trainable
drift detector for monitoring changes for continual learning
from multi-class imbalanced data streams. Our research was
motivated by an apparent lack of drift detection methods
designed for skewed multi-class and evolving streams. We
developed our drift detector on the basis of the Restricted
1078
Boltzmann Machine neural network with a skew-sensitivities
loss function.
In our future works, we plan to combine RBM-IM with
techniques for handling underfitting (to make it applicable to
small data streams), as well as make it robust to adversarial
concept drifts that may be injected by a malicious party as a
poisoning attack.
REFERENCES
[1] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter, “Continual
lifelong learning with neural networks: A review,” Neural Networks, vol.
113, pp. 54–71, 2019.
[2] D. Sahoo, Q. Pham, J. Lu, and S. C. H. Hoi, “Online deep learning:
Learning deep neural networks on the fly,” in Proceedings of the Twenty-
Seventh International Joint Conference on Artificial Intelligence, IJCAI
2018, July 13-19, 2018, Stockholm, Sweden, J. Lang, Ed. ijcai.org,
2018, pp. 2660–2666.
[3] G. Ditzler, M. Roveri, C. Alippi, and R. Polikar, “Learning in nonsta-
tionary environments: A survey,” IEEE Comput. Intell. Mag., vol. 10,
no. 4, pp. 12–25, 2015.
[4] B. Krawczyk, L. L. Minku, J. Gama, J. Stefanowski, and M. Wozniak,
“Ensemble learning for data stream analysis: A survey,” Inf. Fusion,
vol. 37, pp. 132–156, 2017.
[5] S. Chandra, A. Haque, H. Tao, J. Liu, L. Khan, and C. C. Aggarwal,
“Ensemble direct density ratio estimation for multistream classification,”
in 34th IEEE International Conference on Data Engineering, ICDE
2018, Paris, France, April 16-19, 2018. IEEE Computer Society, 2018,
pp. 1364–1367.
[6] Z. Wang, Z. Kong, S. Chandra, H. Tao, and L. Khan, “Robust high
dimensional stream classification with novel class detection,” in 35th
IEEE International Conference on Data Engineering, ICDE 2019,
Macao, China, April 8-11, 2019. IEEE, 2019, pp. 1418–1429.
[7] B. Krawczyk, “Learning from imbalanced data: open challenges and
future directions,” Prog. Artif. Intell., vol. 5, no. 4, pp. 221–232, 2016.
[8] S. Wang, L. L. Minku, and X. Yao, “A systematic study of online class
imbalance learning with concept drift,” IEEE Trans. Neural Networks
Learn. Syst., vol. 29, no. 10, pp. 4802–4821, 2018.
[9] A. Fern ´
andez, S. Garc´
ıa, M. Galar, R. C. Prati, B. Krawczyk, and
F. Herrera, Learning from Imbalanced Data Sets. Springer, 2018.
[Online]. Available: https://doi.org/10.1007/978-3-319-98074-4
[10] A. R. Masegosa, A. M. Mart´
ınez, D. Ramos-L´
opez, H. Langseth, T. D.
Nielsen, and A. Salmer´
on, “Analyzing concept drift: A case study in the
financial sector,” Intell. Data Anal., vol. 24, no. 3, pp. 665–688, 2020.
[11] J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang, “Learning under
Concept Drift: A Review,” IEEE Transactions on Knowledge and Data
Engineering, vol. 31, no. 12, pp. 2346–2363, 2019.
[12] I. Goldenberg and G. I. Webb, “PCA-based drift and shift quantification
framework for multidimensional data,” Knowledge and Information
Systems, vol. 62, no. 7, pp. 2835–2854, 2020.
[13] G. H. F. M. Oliveira, L. L. Minku, and A. L. I. Oliveira, “GMM-VRD:
A Gaussian Mixture Model for Dealing With Virtual and Real Concept
Drifts,” in International Joint Conference on Neural Networks, IJCNN
2019 Budapest, Hungary, July 14-19, 2019. IEEE, 2019, pp. 1–8.
[14] J. Gama and G. Castillo, “Learning with Local Drift Detection,” in
Advanced Data Mining and Applications, Second International Con-
ference, ADMA 2006, Xi’an, China, August 14-16, 2006, Proceedings,
ser. Lecture Notes in Computer Science, vol. 4093. Springer, 2006, pp.
42–55.
[15] R. S. M. de Barros and S. G. T. de Carvalho Santos, “A large-scale
comparison of concept drift detectors,” Information Sciences, vol. 451-
452, pp. 348–370, 2018.
[16] J. Gama, P. Medas, G. Castillo, and P. P. Rodrigues, “Learning with
drift detection,” in Advances in Artificial Intelligence - SBIA 2004, 17th
Brazilian Symposium on Artificial Intelligence, S˜
ao Luis, Maranh˜
ao,
Brazil, September 29 - October 1, 2004, Proceedings, ser. Lecture Notes
in Computer Science, A. L. C. Bazzan and S. Labidi, Eds., vol. 3171.
Springer, 2004, pp. 286–295.
[17] M. Baena-Garca, J. Campo-Avila, R. Fidalgo-Merino, A. Bifet,
R. Gavald, and R. Morales-Bueno, “Early drift detection method,” in
4th ECML PKDD International Workshop on Knowledge Discovery from
Data Streams, 2006, p. 7786.
[18] R. S. M. de Barros, D. R. de Lima Cabral, P. M. G. Jr., and S. G. T.
de Carvalho Santos, “RDDM: reactive drift detection method,” Expert
Syst. Appl., vol. 90, pp. 344–355, 2017.
[19] A. Bifet and R. Gavald`
a, “Learning from Time-Changing Data with
Adaptive Windowing,” in Proceedings of the Seventh SIAM International
Conference on Data Mining, April 26-28, 2007, Minneapolis, Minnesota,
USA. SIAM, 2007, pp. 443–448.
[20] I. I. F. Blanco, J. del Campo- ´
Avila, G. Ramos-Jim´
enez, R. M. Bueno,
A. A. O. D´
ıaz, and Y. C. Mota, “Online and non-parametric drift
detection methods based on hoeffding’s bounds,” IEEE Trans. Knowl.
Data Eng., vol. 27, no. 3, pp. 810–823, 2015.
[21] A. Pesaranghader and H. L. Viktor, “Fast hoeffding drift detection
method for evolving data streams,” in Machine Learning and Knowledge
Discovery in Databases - European Conference, ECML PKDD 2016,
Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part II, ser.
Lecture Notes in Computer Science, vol. 9852. Springer, 2016, pp.
96–111.
[22] R. S. M. de Barros, J. I. G. Hidalgo, and D. R. de Lima Cabral,
“Wilcoxon rank sum test drift detector,” Neurocomputing, vol. 275, pp.
1954–1963, 2018.
[23] A. Cano and B. Krawczyk, “Kappa Updated Ensemble for drifting data
stream mining,” Machine Learning, vol. 109, no. 1, pp. 175–218, 2020.
[24] D. K. Antwi, H. L. Viktor, and N. Japkowicz, “The perfsim algorithm for
concept drift detection in imbalanced data,” in 12th IEEE International
Conference on Data Mining Workshops, ICDM Workshops, Brussels,
Belgium, December 10, 2012. IEEE Computer Society, 2012, pp. 619–
628.
[25] S. Wang and L. L. Minku, “AUC estimation and concept drift detection
for imbalanced data streams with multiple classes,” in 2020 International
Joint Conference on Neural Networks, IJCNN 2020, Glasgow, United
Kingdom, July 19-24, 2020. IEEE, 2020, pp. 1–8.
[26] A. Saadallah, L. Moreira-Matias, R. Sousa, J. Khiari, E. Jenelius, and
J. Gama, “BRIGHT - drift-aware demand predictions for taxi networks
(extended abstract),” in 35th IEEE International Conference on Data
Engineering, ICDE 2019, Macao, China, April 8-11, 2019. IEEE,
2019, pp. 2145–2146.
[27] S. Ramasamy, A. Ambikapathi, and K. Rajaraman, “Online RBM:
growing restricted boltzmann machine on the fly for unsupervised
representation,” Appl. Soft Comput., vol. 92, p. 106278, 2020.
[28] Y. Cui, M. Jia, T. Lin, Y. Song, and S. J. Belongie, “Class-balanced
loss based on effective number of samples,” in IEEE Conference on
Computer Vision and Pattern Recognition, CVPR 2019, Long Beach,
CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE,
2019, pp. 9268–9277.
[29] X. Sun, “Assessing nonlinear granger causality from multivariate time
series,” in Machine Learning and Knowledge Discovery in Databases,
European Conference, ECML/PKDD 2008, Antwerp, Belgium, Septem-
ber 15-19, 2008, Proceedings, Part II, ser. Lecture Notes in Computer
Science, vol. 5212. Springer, 2008, pp. 440–455.
[30] C. Mahjoub, J. Bellanger, A. Kachouri, and R. L. Bouquin-Jeann`
es,
“On the performance of temporal granger causality measurements on
time series: a comparative study,” Signal Image Video Process., vol. 14,
no. 5, pp. 955–963, 2020.
[31] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “MOA: Massive
Online Analysis,” Journal of Machine Learning Research, vol. 11, pp.
1601–1604, 2010.
[32] B. Veloso, J. Gama, and B. Malheiro, “Self hyper-parameter tuning for
data streams,” in Discovery Science - 21st International Conference, DS
2018, Limassol, Cyprus, October 29-31, 2018, Proceedings, ser. Lecture
Notes in Computer Science, vol. 11198. Springer, 2018, pp. 241–255.
[33] B. Krawczyk and P. Skryjomski, “Cost-sensitive perceptron decision
trees for imbalanced drifting data streams,” in Machine Learning and
Knowledge Discovery in Databases - European Conference, ECML
PKDD 2017, Skopje, Macedonia, September 18-22, 2017, Proceedings,
Part II, ser. Lecture Notes in Computer Science, vol. 10535. Springer,
2017, pp. 512–527.
[34] L. Korycki and B. Krawczyk, “Online oversampling for sparsely labeled
imbalanced and non-stationary data streams,” in 2020 International
Joint Conference on Neural Networks, IJCNN 2020, Glasgow, United
Kingdom, July 19-24, 2020. IEEE, 2020, pp. 1–8.
[35] A. Benavoli, G. Corani, J. Demsar, and M. Zaffalon, “Time for a change:
a tutorial for comparing multiple classifiers through bayesian analysis,”
J. Mach. Learn. Res., vol. 18, pp. 77:1–77:36, 2017.
1079