Content uploaded by Łukasz Korycki
Author content
All content in this area was uploaded by Łukasz Korycki on Mar 03, 2021
Content may be subject to copyright.
Content uploaded by Bartosz Krawczyk
Author content
All content in this area was uploaded by Bartosz Krawczyk on Feb 12, 2021
Content may be subject to copyright.
Low-Dimensional Representation Learning from
Imbalanced Data Streams
Lukasz Korycki and Bartosz Krawczyk
Department of Computer Science
Virginia Commonwealth University
Richmond, VA, USA {koryckil,bkrawczyk}@vcu.edu
Abstract. Learning from data streams is among the contemporary chal-
lenges in the machine learning domain, which is frequently plagued by the
class imbalance problem. In non-stationary environments, ratios among
classes, as well as their roles (majority and minority) may change over
time. The class imbalance is usually alleviated by balancing classes with
resampling. However, this suffers from limitations, such as a lack of adap-
tation to concept drift and the possibility of shifting the true class dis-
tributions. In this paper, we propose a novel ensemble approach, where
each new base classifier is built using a low-dimensional embedding. We
use class-dependent entropy linear manifold to find the most discrimi-
native low-dimensional representation that is, at the same time, skew-
insensitive. This allows us to address two challenging issues: (i) learning
efficient classifiers from imbalanced and drifting streams without data
resampling; and (ii) tackling simultaneously high-dimensional and imbal-
anced streams that pose extreme challenges to existing classifiers. Our
proposed low-dimensional representation algorithm is a flexible plug-in
that can work with any ensemble learning algorithm, making it a highly
useful tool for difficult scenarios of learning from high-dimensional im-
balanced and drifting data streams.
Keywords: Machine learning ·Data stream mining ·Class imbalance ·
Concept drift ·Low-dimensional representation
1 Introduction
We define a data stream as a sequence < S1, S2, ..., Sn, ... >, in which each ele-
ment Sjis a collection of instances (batch scenario) or a single instance (online
scenario). Each instance is independent and randomly generated using a station-
ary probability distribution Dj. In this paper, we consider the supervised learn-
ing scenario that allows us to define each element as Sj∼pj(x1,· · · , xd, y) =
pj(x, y), where pj(x, y) is a joint distribution of j-th instance, defined by d-
dimensional feature space and belonging to class y. Each instance in the stream
is independent and randomly drawn from a stationary probability distribution
Ψj(x, y). Data streams are also subject to a phenomenon known as concept
drift [13], where the properties of a stream evolve over time. Furthermore, two
challenging problems connected with data streams are class imbalance [7] and
2 L. Korycki and B. Krawczyk
high-dimensionality of the feature space [15]. These two problems have been an-
alyzed disjointly, but in realistic scenarios they may appear together [7]. There-
fore, there is a need for developing novel methods that can simultaneously deal
with both of these issues, while being robust to concept drift.
In this paper, we propose a novel approach for learning low-dimensional rep-
resentations of data streams. It is based on finding an optimal projection of
data into a subspace that maximizes the Renyi’s quadratic entropy per class. By
weighting class representations when searching for an optimal low-dimensional
manifold, we ensure that the new projection is not only highly discriminative,
but also skew-insensitive. We show that the proposed technique is a universal
and flexible plug-in that can be used with any ensemble algorithm created for
data streams. Additionally, we highlight and address two important shortcom-
ings of existing methods: (i) reference low-dimensional projections cannot handle
increasing imbalance ratio; and (ii) existing resampling and cost-sensitive algo-
rithms for imbalanced data streams fail with increasing dimensionality of the
feature space. We show that finding a discriminative and skew-insensitive pro-
jection may actually lead to better separation between classes and outperforming
resampling algorithms. The main contributions of this paper are given as follows.
– First low-dimensional embedding for imbalanced data streams.
Novel approach to tackling extremely challenging scenario of mining im-
balanced, high-dimensional and drifting data streams.
– Theoretically and practically sound streaming low-dimensional em-
bedding. Efficient and theoretically grounded entropy-based low dimen-
sional embedding that can be easily used with most streaming ensembles.
– Extensive empirical evaluation of the proposed method. We carry
a thorough experimental study on 576 diverse generators and three difficult
real-world data streams.
2 Low-Dimensional Representation
General idea. High-dimensional data may be challenging for many machine
learning algorithms, due to a phenomenon known as the curse of dimensionality.
Therefore, a projection from Rdto Rk, where d >> k, is highly attractive for
both classification and visualization purposes. There are two main approaches
for achieving this – linear projections or more complex embeddings.
Linear projections aim to find a matrix V ∈Rd×kwhich for a given data set
X∈Rd×Nprovides a projection VTX preserving as much of the original data
characteristics as possible. This is measured by a selected information measure
(which we will generally denote as IM ), as well as a set of constraints ϕi. We
can define a projection for obtaining a low-dimensional representation as:
max
V∈Rd×kIM(VTX; X,Y) (1)
subject to ϕi(V), i = 1,· · · , m, where Y is a set of additional information used
during the projection, such as class labels.
Low-Dimensional Representation Learning from Imbalanced Data Streams 3
Benefits of low-dimensional representation for imbalanced data streams.
The disproportion between classes is not the sole source of learning difficulties.
As long as classes are well-separated skewed distributions are not going to af-
fect the classifier. However, imbalanced data is often accompanied by a number
of other learning problems [7]. We argue that by using a discriminative and
low-dimensional embedding one may achieve similar or better results than when
using resampling, without risking its drawbacks. In this paper, we propose a
skew-insensitive embedding that is able to learn optimal subspaces that im-
prove separation between classes and alleviate the need for any resampling or
cost-sensitive learning. Additionally, as we generate new embedding for each
incoming chunk of data, we are able to adapt to concept drift.
3 Low-Dimensional Projection for Imbalanced Data
Initial assumptions. In this work, we aim at having a projection that is able
to find a low-dimensional representation offering the best discrimination be-
tween minority (Xmin) and majority classes (Xmaj ). This can be done using
Cauchy-Schwarz Divergence (DCS (·,·)) for measuring the discriminative power
and allows us to rewrite Eq. 1 as:
max
V∈Rd×kDCS ([[VTXmin ]].[[VTXmaj ]]) (2)
subject to VTV = I, where [[·]] stands for a density estimator.
In order to apply a low-dimensional projection to data streams, we need to
choose a proper information measure. In this work, we decided to use the Renyi’s
quadratic entropy [5], as it has two main advantages for streaming data: (i) low
computational complexity; and (ii) confirmed high usefulness for classification
tasks. We can express its density fon Rkas:
H2(f) = −log ZRk
f2(x)dx. (3)
For computing the Renyi’s quadratic entropy on actual data, one needs to
select a density estimator and solve an optimization problem of finding an or-
thonormal base Vof k-dimensional subspace that maximizes H2[[VTX]]. It has
been shown that the maximization of the Renyi’s quadratic entropy leads to
selecting a representation with a high spread, while its minimization offers con-
densed representation.
For the problem of imbalanced data, we can express the class-depended
Cauchy-Schwarz Divergence for minority and majority classes using the men-
tioned Renyi’s quadratic entropy (H2) and Renyi’s quadratic cross-entropy (Hx
2):
DCS (V) = log Z[[VTXmin]]2+log Z[[VTXmaj ]]2−2log Z[[VTXmin]][[VTXmaj ]]
=−H2([[VTXmin]]) −H2([[VTXmaj ]]) + 2Hx
2([[VTXmin]],[[VTXmaj ]]).(4)
4 L. Korycki and B. Krawczyk
Objective function for the projection. Let us now formulate the objective
function and its gradient for the Renyi’s quadratic entropy-based projection, so
it can be solved using any first-order optimization method. We want to project
the entire data set X on V and compute DCS (V) with kernel density estimators
of the obtained projection:
G−1(V)[[VTXmin]] and G−1(V)[[VTXmaj ]],(5)
where G(V) = VTV stands for grassmannian. We look for such a projection
space V that maximizes DCS . Such projections do not depend on the affine
transformations of data, thus allowing us to restrict the analogous formulas for
sets VTXmin and VTXmaj to such V that consist of linearly independent vectors.
Additionally, we want our projection to be skew-insensitive and improve the
separation between minority and majority classes. We obtain this by weighting
the classes with Wmin and Wmaj that stand for weights assigned to the minority
and majority objects, respectively. By assigning higher weights to the minority
class, we obtain a better representation after low-dimensional projection that
alleviates the class imbalance problem. Therefore, in order to maximize DCS with
respect to being skew-insensitive, we need to compute the gradient of following
function:
DCS (Vim )=DC S ([[VTXminWmin ]],[[VTXmaj Wmaj ]])
= log Z[[VTXminWmin]]2+ log Z[[VTXmaj Wmaj ]]2
−2log Z[[VTXminWmin]][[VTXmaj Wmaj ]],(6)
where Vim stands for a skew-insensitive projection and we consider only linearly
independent vectors. In order to avoid numerical instabilities, a penalty factor
can be added [6]:
DCS (Vim )− kVTV−Ik2,(7)
that is used to penalize non-orthonormal V’s.
This allows us to formulate the proposed Low-Dimensional Projection for
Imbalanced Data (LDPIM ) objective function used to select the best possible
low-dimensional projection that alleviates the effect of class imbalance:
LDPIM = DC S (Vim)− kVTV−Ik2.(8)
Additionally, we need to be able to compute the gradient of the objective
function ∇LDPIM . For the second term, we can compute this as:
∇kVTV−Ik2= 4VVTV−4V.(9)
Let us now present how can we compute the gradient of the first term. For
this, we will need to compute the product of kernel densities of two sets [6]. For
Low-Dimensional Representation Learning from Imbalanced Data Streams 5
the notation simplicity let us assume that set Astands for the minority class
Xmin and set Bfor the majority class Xmaj . We can estimate the kernel density
of a given set with the Gaussian kernel:
[[A]] = 1
|A|X
a∈A
N(a, ΣA),(10)
where ΣA= (hγ
A)2covA,hγ
A=γ(4
k+2 )1/(k+4)|A|−1/(k+4) , and γis a scaling
hyperparameter.
We need to obtain R[[A]][[B]] which can be calculated from the following:
ZN(a, ΣA)N(b, ΣB) = N(a−b, ΣA+ΣB)(0),(11)
leading to:
R[[A]][[B]] = 1
|A||B|Pw∈A−BN(w, ΣA+ΣB)(0) = 1
(2π)k/2det1/2(ΣAB|A||B|)Pw∈A−Bexp(−1
2kwk2
ΣAB ),
(12)
where A−B={a−b:a∈A, b ∈B}and ΣAB = (hγ
A)2covA+ (hγ
B)2covB.
As we deal with a sequence of linearly independent vectors V = [V1,· · · ,Vk]∈
Rd×k, we may use the following:
ΣAB(V) = VTΣAB V and SAB (V) = ΣAB (V)−1,(13)
where ΣAB(V) and SAB(V) are square symmetric matrices storing the properties
of a given projection onto space V.
Let us now calculate:
φAB(V) = 1
(2π)k/2det1/2(ΣAB |A||B|),(14)
and compute gradient of this function as:
∇φAB(V) = −φAB(V) ·ΣAB ·V·SAB(V).(15)
To compute the final formula for the first term of ∇LDPIM , we need to cal-
culate the gradient of function V →det(ΣAB(V)) that is given by the following
formula:
∇det(ΣAB(V)) = 2det(VTΣAB V) ·ΣABV(VTΣAB V)−1.(16)
We define the function for information potential:
ψw
AB(V) = exp(−1
2kVTwk2
ΣAB(V) ),(17)
where for an arbitrarily set value of wparameter we are able to compute its
gradient as following:
6 L. Korycki and B. Krawczyk
∇ψw
AB(V) = −ψw
AB(V) ·(wwTVSAB (V) −ΣAB (V)SAB (V)VTwwTVSAB (V)).
(18)
Finally, we need to define cross information potential function (between A
and B) and its gradient:
ipx
AB =φAB(V) X
w∈A−B
ψw
AB(V),(19)
∇ipx
AB =φAB(V) X
w∈A−B
ψw
AB(V) + X
w∈A−B
ψw
AB(V)!· ∇φAB(V).(20)
This allows us to get back to DCS (Vim) and rewrite it as:
DCS (Vim ) = log(ipx
XminXmin (V)) + log(ipx
Xmaj Xmaj (V)) −2 log(ipx
XminXmaj (V)),
(21)
and calculate its gradient as:
∇DCS (Vim) = 1
ipx
XminXmin (V) ∇ipx
XminXmin (V) + 1
ipx
XmajXmaj (V) ∇ipx
XmajXmaj (V) −21
ipx
XminXmaj (V) ∇ipx
XminXmaj (V).
(22)
After these steps, we may properly formulate the LDPIM objective function
and its gradient as:
LDPIM (V)=DCS (Vim )− kVTV−Ik2,(23)
∇LDPIM (V) = ∇DC S (Vim)−(4VVTV−4V).(24)
These equations can be used as an input to any first-order optimization
method to find a new k-dimensional projection that can be used as a discrim-
inative, skew-insensitive and low-dimensional projection of the original feature
space.
Embedding LDPIM into ensembles for data streams. The proposed LDPIM
can be seamlessly embedded into any chunk-based ensemble learning algorithm
dedicated to data streams [10]. Their general idea is based on training a new
base classifier on the most recent chunk of data and using it to replace the most
incompetent classifier in the pool. We propose to use LDPIM as a flexible plug-in
to any such ensemble, performing the low-dimensional embedding when a new
chunk of data arrives and then training a new classifier in the reduced feature
space. This will not only make any ensemble robust to class imbalance, but also
improve the predictive power and speed of training of base classifiers.
Adaptation to concept drift. As LDPIM will be run independently on each
data chunk, this will have two interesting effects on the underlying ensemble.
Firstly, it will allow for LDPIM to adapt to concept drift, as each embedding
will be done on the most recent data. Secondly, each base classifier will be trained
on different embedding, thus positively impacting the diversity of the ensemble.
Low-Dimensional Representation Learning from Imbalanced Data Streams 7
4 Experimental study
This experimental study was designed to answer the following research questions.
RQ1: Does the proposed low-dimensional embedding offer robustness to class
imbalance by providing a better representation of the minority class, and can it
outperform the state-of-the-art low-dimensional projection algorithms?
RQ2: Is LDPIM flexible enough to work with a variety of ensemble learning
algorithms designed for drifting data streams?
RQ3: Does using the skew-insensitive low-dimensional representation for high-
dimensional data streams offer better discriminative power than resampling and
cost-sensitive solutions?
RQ4: Is the proposed LDPIM capable of outperforming reference methods for
real-world data streams with various combinations of feature space dimension-
ality and imbalance ratio?
4.1 Data stream benchmarks
For the purpose of evaluating our proposed algorithm, we generated 576 di-
verse and large-scale data stream benchmarks using MOA. We used four gener-
ators to generate binary imbalanced streams, each with a number of features in
[50,100,250,500,1000,5000], an imbalance ratio in [10,30,50,80,100,150], and
with four types of drifts [no drift, sudden, gradual, incremental]. Exhaustive com-
binations of these factors lead to the creation of 576 data streams with 1M-5M
instances each. Additionally, we use three real-world data streams: CIFAR-100,
ImageNet and SUN-397 transformed to multi-class imbalanced problems [14].
Their properties are given in Tab. 1.
Table 1: Properties of data generators that were used to create 576 imbalanced data
stream benchmarks and three real-world data sets.
Dataset Instances Features Classes IR Drift
Aggrawal 1 000 000 50 – 5000 2 10 – 150 n/s/g/i
Hyperplane 1 000 000 50 – 5000 2 10 – 150 n/s/g/i
RBF 5 000 000 50 – 5000 2 10 – 150 n/s/g/i
RandomTree 2 000 000 50 – 5000 2 10 – 150 n/s/g/i
CIFAR-100 60 000 1024 100 50 unknown
ImageNet 1 200 000 4096 200 50 unknown
SUN-397 108 753 1024 397 137 unknown
4.2 Set-up
Reference algorithms for low-dimensional representation. We have se-
lected three state-of-the-art reference methods for obtaining low-dimensional
projections: (i) Per-Class Principal Component Analysis (pPCA); (ii) Structure-
Preserving Non-Negative Matrix Factorization (NMF) [11]; and (iii) Discrimi-
native Learning using Generalized Eigenvectors (GEM) [9].
Reference algorithms for handling class imbalance. We have selected
three state-of-the art reference methods for handling imbalanced data streams:
(i) Incremental Oversampling for Data Streams (IOSDS) [1]; (ii) undersam-
pling via Selection-Based Resampling (SRE) [12]; and (iii) Online Multiple Cost-
Sensitive Learning (OMCSL) [16].
8 L. Korycki and B. Krawczyk
GEM
NMF
pPCA
[KUE] LDP−IM vs.
no. of datasets
0
100
200
300
400
500
576
GEM
NMF
pPCA
[ARF] LDP−IM vs.
no. of datasets
0
100
200
300
400
500
576
GEM
NMF
pPCA
[GOOWE] LDP−IM vs.
no. of datasets
0
100
200
300
400
500
576
GEM
NMF
pPCA
[AUE] LDP−IM vs.
no. of datasets
0
100
200
300
400
500
576
Fig. 1: Comparison of LDPI M and reference algorithms for low-dimensional represen-
tation over four different ensemble architectures. Results presented with respect to the
number of wins (green), ties (yellow), and losses (red) over 576 data streams. A tie was
considered when the McNemar’s test rejected the significance of difference between
tested algorithms.
Ensemble learning algorithms. The proposed LDPI M is a flexible plug-in
that can be used with any ensemble learning algorithm. Therefore, to eval-
uate its interplay with various ensembles, we have selected recent and pop-
ular architectures designed for streaming data: (i) Kappa Updated Ensemble
(KUE) [4]; (ii) Adaptive Random Forest (ARF) [8]; (iii) Geometrically Optimum
and Online-Weighted Ensemble (GOOWE) [2]; and (iv) Accuracy Updated En-
semble (AUE) [3]. Each of them worked in a block-based mode, with 10 base
classifiers maintained. We used Hoeffding Trees as base learners.
Parameters. LDPIM uses the inverse imbalance ratio for setting weights in
Eq. 6. All algorithms for low-dimensional representation use k= 0.05d, so they
reduce the input feature space by 95%. All of them were tuned with parameters
and procedures suggested by their authors.
Evaluation metrics. As we deal with imbalanced and drifting data streams,
we evaluated the examined algorithms using prequential AUC [7].
Windows. We used a window size ω= 1000 for calculating the prequential
metrics and training new classifiers.
Statistical analysis. We used the McNemar’s test for the pairwise comparison
and the Bonferroni-Dunn test for the multiple comparison.
4.3 Experiment 1: Low-dimensional representations
General comparison. Fig. 5 offers a detailed comparison of the proposed
method with three reference low-dimensional embedding methods using four
underlying ensemble architectures over 576 diverse data stream benchmarks.
We can see that LDPIM is capable of outperforming reference embeddings re-
gardless of the underlying ensemble architecture. This shows that not only the
proposed method is able to create more discriminative subspaces (RQ1), but
it is also highly flexible with regard to the utilized classification scheme and
can be used as a general-purpose plug-in (RQ2). This is further confirmed by
the Bonferonii-Dunn test (see Fig. 9), proving that the differences between the
proposed LDPIM and reference algorithms are statistically significant.
Analysis of robustness to high dimensionality and class imbalance.
Fig. 6 shows a detailed analysis of the robustness of each of the examined pro-
jection methods to increasing dimensionality and imbalance ratio. What was
Low-Dimensional Representation Learning from Imbalanced Data Streams 9
Fig. 2: Analysis of the relationship between prequential AUC and increasing imbal-
ance ratio / dimensionality for LDPIM and reference algorithms for low-dimensional
representation. KUE used as the ensemble method.
OMCSL
SER
IOSDS
[KUE] LDP−IM vs.
no. of datasets
0
100
200
300
400
500
576
OMCSL
SER
IOSDS
[ARF] LDP−IM vs.
no. of datasets
0
100
200
300
400
500
576
OMCSL
SER
IOSDS
[GOOWE] LDP−IM vs.
no. of datasets
0
100
200
300
400
500
576
OMCSL
SER
IOSDS
[AUE] LDP−IM vs.
no. of datasets
0
100
200
300
400
500
576
Fig. 3: Comparison of LDPI M and reference algorithms for imbalanced data streams
over four different ensemble architectures. Results presented with respect to the num-
ber of wins (green), ties (yellow), and losses (red) over 576 data streams. A tie was
considered when the McNemar’s test rejected the significance of difference between
tested algorithms.
expected, all four algorithms can handle data streams with even up to thou-
sands of features. LDPIM shows some improvements over reference methods
when facing thousands of features, which can be contributed to using weighted
entropies from both classes. Things are different when we analyze an increasing
imbalance ratio. Here LDPIM clearly outperforms all reference methods, show-
ing that they are not capable of creating discriminative projections when dealing
with highly skewed distributions.
4.4 Experiment 2: Skew-insensitive algorithms
General comparison. Fig. 7 offers a detailed comparison of the proposed
method with three reference resampling and cost-sensitive methods utilizing four
underlying ensemble architectures over 576 diverse data stream benchmarks.
The obtained results show that by learning a skew-insensitive representation via
LDPIM we may be highly competitive to traditional solutions created for han-
dling class imbalance. Furthermore, we achieve this without artificially inflating
10 L. Korycki and B. Krawczyk
Fig. 4: Analysis of the relationship between prequential AUC and increasing imbalance
ratio / dimensionality for LDPIM and reference algorithms for handling imbalanced
data streams. KUE used as the ensemble method.
Fig. 5: The Bonferroni-Dunn tests for the comparison among methods for low-
dimensional representations and handling class imbalance.
the size of the data set or the need for meta-tuning of the cost parameter. This
is further confirmed by the Bonferonii-Dunn test (see Fig. 9).
Analysis of robustness to high dimensionality and class imbalance.
Fig. 8 shows a detailed analysis of the robustness of each of the examined im-
balance alleviation methods to increasing dimensionality and imbalance ratio.
These results show a weakness of the existing solutions, as neither resampling nor
cost-sensitive learning can work under high dimensionality of the feature space.
LDPIM offers comparable performance when the dimensionality is low, but offers
excellent robustness to even thousands of features. As for the imbalance ratio, all
examined methods can adapt to increasing disproportion between classes, but
only LDPIM can handle imbalanced and high-dimensional streams (RQ3). It is
important to notice that LDPIM also offers excellent performance when dealing
with lower imbalance ratios or smaller feature dimensionality, making it a very
flexible solution to learning from difficult data streams.
4.5 Experiment 3: Evaluation on real-world data streams
Finally, we want to evaluate the performance of the proposed LDPIM on real-
world data. Tab. 2 presents the obtained results according to the prequential
AUC metric and for four examined ensemble classifiers used as base learners. We
Low-Dimensional Representation Learning from Imbalanced Data Streams 11
Table 2: Results according to prequential multi-class AUC on real-world imbalanced
data sets for all reference methods and four ensemble classifiers used as base learners.
LDPIM pPCA NMF GEM IOSDS SER OMCSL
KUE
CIFAR-100 82.72±2.38 56.26±9.73 60.23±9.18 64.34±8.56 65.83±7.02 66.11±6.45 65.23±7.98
ImageNet 41.07±5.09 16.54±8.83 17.99±9.02 20.04±7.81 28.94±6.17 31.43±6.93 32.80±5.98
SUN-397 38.28±6.79 9.28±5.99 11.26±7.03 12.36±8.04 17.29±8.92 18.92±7.48 20.01±9.01
ARF
CIFAR-100 80.19±2.72 51.54±9.29 59.57±9.33 62.41±8.07 62.95±7.38 63.83±6.18 64.01±7.29
ImageNet 40.03±6.11 14.99±9.02 17.58±9.31 21.93±7.48 27.91±7.03 30.52±7.28 31.06±6.77
SUN-397 36.28±7.28 8.62±6.58 10.24±7.33 11.64±7.58 16.03±9.04 17.88±8.04 18.84±9.62
GOOWE
CIFAR-100 78.03±2.99 50.21±9.80 53.07±9.94 56.61±9.17 58.29±7.78 60.06±6.92 60.92±8.64
ImageNet 36.86±5.89 11.72±8.14 13.58±8.22 16.22±8.01 24.17±8.14 26.77±7.94 27.94±7.80
SUN-397 32.09±4.99 5.89±3.44 7.09±2.99 7.82±3.88 11.57±5.01 13.39±4.78 15.28±5.02
AUE
CIFAR-100 79.44±2.76 51.95±9.41 55.11±9.38 58.92±8.15 60.02±7.76 62.17±6.35 63.88±7.48
ImageNet 37.93±4.81 12.77±6.62 15.02±7.13 18.54±6.96 25.82±7.11 28.44±8.09 28.98±5.49
SUN-397 34.18±2.79 7.28±1.99 8.99±2.52 8.90±3.08 13.44±3.72 16.48±4.08 17.24±4.27
can see that LDPIM offers superior performance and exceptional stability for all
data sets and regardless of the used ensemble classifier. This verifies the high
attractiveness of the proposed LDPIM for learning from difficult data streams
in various real-world scenarios (RQ4).
5 Conclusions and future works
In this paper, we have tackled a highly challenging and, to the best of our knowl-
edge, previously unaddressed problem of learning from data streams under class
imbalance and high feature space dimensionality. We have presented LDPIM –
a novel low-dimensional representation learning algorithm designed for imbal-
anced data streams. It optimizes the search for a new subspace that can effec-
tively represent data, while maximizing its discriminative power. We achieve this
by using the Renyi’s quadratic entropy to evaluate the usefulness of potential
subspaces. We additionally compute the individual impact of each class, taking
into account the disproportions in their distributions and using this information
to weight entropy computations. This allowed us to find theoretically-justified
skew-insensitive projections that alleviate the need for any resampling or cost-
sensitive learning. By looking for discriminative low-dimensional representations,
we were able to enhance the predictive power of examined classifiers without the
need of using additional pre-processing methods.
Our claims were confirmed by an extensive empirical evaluation on 576 gener-
ated streaming benchmarks and three challenging real-world data sets. We have
showed that LDPIM can be seamlessly integrated into any streaming ensemble.
12 L. Korycki and B. Krawczyk
References
1. Anupama, N., Jena, S.: A novel approach using incremental oversampling for data
stream mining. Evolving Systems 10(3), 351–362 (2019)
2. Bonab, H.R., Can, F.: GOOWE: geometrically optimum and online-weighted en-
semble classifier for evolving data streams. ACM TKDD 12(2), 25:1–25:33 (2018)
3. Brzezinski, D., Stefanowski, J.: Reacting to different types of concept drift: The
accuracy updated ensemble algorithm. IEEE Trans. Neural Netw. Learning Syst.
25(1), 81–94 (2014)
4. Cano, A., Krawczyk, B.: Kappa updated ensemble for drifting data stream mining.
Machine Learning 109(1), 175–218 (2020)
5. Czarnecki, W.M., J´ozefowicz, R., Tabor, J.: Maximum entropy linear manifold for
learning discriminative low-dimensional representation. In: Machine Learning and
Knowledge Discovery in Databases - European Conference, ECML PKDD 2015,
Porto, Portugal, September 7-11, 2015, Proceedings, Part I. pp. 52–67 (2015)
6. Czarnecki, W.M., Tabor, J.: Multithreshold entropy linear classifier: Theory and
applications. Expert Syst. Appl. 42(13), 5591–5606 (2015)
7. Fern´andez, A., Garc´ıa, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F.: Learn-
ing from Imbalanced Data Sets. Springer (2018). https://doi.org/10.1007/978-3-
319-98074-4, https://doi.org/10.1007/978-3-319-98074-4
8. Gomes, H.M., Bifet, A., Read, J., Barddal, J.P., Enembreck, F., Pfharinger, B.,
Holmes, G., Abdessalem, T.: Adaptive random forests for evolving data stream
classification. Machine Learning 106(9-10), 1469–1495 (2017)
9. Karampatziakis, N., Mineiro, P.: Discriminative features via generalized eigenvec-
tors. In: Proceedings of the 31th International Conference on Machine Learning,
ICML 2014, Beijing, China, 21-26 June 2014. pp. 494–502 (2014)
10. Krawczyk, B., Minku, L.L., Gama, J., Stefanowski, J., Wo´zniak, M.: Ensemble
learning for data stream analysis: A survey. Information Fusion 37, 132–156 (2017)
11. Li, Z., Liu, J., Lu, H.: Structure preserving non-negative matrix factorization for di-
mensionality reduction. Computer Vision and Image Understanding 117(9), 1175–
1189 (2013)
12. Ren, S., Zhu, W., Liao, B., Li, Z., Wang, P., Li, K., Chen, M., Li, Z.: Selection-
based resampling ensemble algorithm for nonstationary imbalanced stream data
learning. Knowl.-Based Syst. 163, 705–722 (2019)
13. Wang, S., Minku, L.L., Yao, X.: A systematic study of online class imbalance
learning with concept drift. IEEE Trans. Neural Netw. Learning Syst. 29(10),
4802–4821 (2018)
14. Wang, Y., Ramanan, D., Hebert, M.: Learning to model the tail. In: Guyon, I.,
von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N.,
Garnett, R. (eds.) Advances in Neural Information Processing Systems 30: Annual
Conference on Neural Information Processing Systems 2017, December 4-9, 2017,
Long Beach, CA, USA. pp. 7029–7039 (2017)
15. Wang, Z., Kong, Z., Chandra, S., Tao, H., Khan, L.: Robust high dimensional
stream classification with novel class detection. In: 35th IEEE International Con-
ference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019. pp.
1418–1429 (2019)
16. Yan, Y., Yang, T., Yang, Y., Chen, J.: A framework of online learning with im-
balanced streaming data. In: Proceedings of the Thirty-First AAAI Conference
on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA. pp.
2817–2823 (2017)