Content uploaded by Łukasz Korycki

Author content

All content in this area was uploaded by Łukasz Korycki on Mar 03, 2021

Content may be subject to copyright.

Content uploaded by Bartosz Krawczyk

Author content

All content in this area was uploaded by Bartosz Krawczyk on Feb 12, 2021

Content may be subject to copyright.

Low-Dimensional Representation Learning from

Imbalanced Data Streams

Lukasz Korycki and Bartosz Krawczyk

Department of Computer Science

Virginia Commonwealth University

Richmond, VA, USA {koryckil,bkrawczyk}@vcu.edu

Abstract. Learning from data streams is among the contemporary chal-

lenges in the machine learning domain, which is frequently plagued by the

class imbalance problem. In non-stationary environments, ratios among

classes, as well as their roles (majority and minority) may change over

time. The class imbalance is usually alleviated by balancing classes with

resampling. However, this suﬀers from limitations, such as a lack of adap-

tation to concept drift and the possibility of shifting the true class dis-

tributions. In this paper, we propose a novel ensemble approach, where

each new base classiﬁer is built using a low-dimensional embedding. We

use class-dependent entropy linear manifold to ﬁnd the most discrimi-

native low-dimensional representation that is, at the same time, skew-

insensitive. This allows us to address two challenging issues: (i) learning

eﬃcient classiﬁers from imbalanced and drifting streams without data

resampling; and (ii) tackling simultaneously high-dimensional and imbal-

anced streams that pose extreme challenges to existing classiﬁers. Our

proposed low-dimensional representation algorithm is a ﬂexible plug-in

that can work with any ensemble learning algorithm, making it a highly

useful tool for diﬃcult scenarios of learning from high-dimensional im-

balanced and drifting data streams.

Keywords: Machine learning ·Data stream mining ·Class imbalance ·

Concept drift ·Low-dimensional representation

1 Introduction

We deﬁne a data stream as a sequence < S1, S2, ..., Sn, ... >, in which each ele-

ment Sjis a collection of instances (batch scenario) or a single instance (online

scenario). Each instance is independent and randomly generated using a station-

ary probability distribution Dj. In this paper, we consider the supervised learn-

ing scenario that allows us to deﬁne each element as Sj∼pj(x1,· · · , xd, y) =

pj(x, y), where pj(x, y) is a joint distribution of j-th instance, deﬁned by d-

dimensional feature space and belonging to class y. Each instance in the stream

is independent and randomly drawn from a stationary probability distribution

Ψj(x, y). Data streams are also subject to a phenomenon known as concept

drift [13], where the properties of a stream evolve over time. Furthermore, two

challenging problems connected with data streams are class imbalance [7] and

2 L. Korycki and B. Krawczyk

high-dimensionality of the feature space [15]. These two problems have been an-

alyzed disjointly, but in realistic scenarios they may appear together [7]. There-

fore, there is a need for developing novel methods that can simultaneously deal

with both of these issues, while being robust to concept drift.

In this paper, we propose a novel approach for learning low-dimensional rep-

resentations of data streams. It is based on ﬁnding an optimal projection of

data into a subspace that maximizes the Renyi’s quadratic entropy per class. By

weighting class representations when searching for an optimal low-dimensional

manifold, we ensure that the new projection is not only highly discriminative,

but also skew-insensitive. We show that the proposed technique is a universal

and ﬂexible plug-in that can be used with any ensemble algorithm created for

data streams. Additionally, we highlight and address two important shortcom-

ings of existing methods: (i) reference low-dimensional projections cannot handle

increasing imbalance ratio; and (ii) existing resampling and cost-sensitive algo-

rithms for imbalanced data streams fail with increasing dimensionality of the

feature space. We show that ﬁnding a discriminative and skew-insensitive pro-

jection may actually lead to better separation between classes and outperforming

resampling algorithms. The main contributions of this paper are given as follows.

– First low-dimensional embedding for imbalanced data streams.

Novel approach to tackling extremely challenging scenario of mining im-

balanced, high-dimensional and drifting data streams.

– Theoretically and practically sound streaming low-dimensional em-

bedding. Eﬃcient and theoretically grounded entropy-based low dimen-

sional embedding that can be easily used with most streaming ensembles.

– Extensive empirical evaluation of the proposed method. We carry

a thorough experimental study on 576 diverse generators and three diﬃcult

real-world data streams.

2 Low-Dimensional Representation

General idea. High-dimensional data may be challenging for many machine

learning algorithms, due to a phenomenon known as the curse of dimensionality.

Therefore, a projection from Rdto Rk, where d >> k, is highly attractive for

both classiﬁcation and visualization purposes. There are two main approaches

for achieving this – linear projections or more complex embeddings.

Linear projections aim to ﬁnd a matrix V ∈Rd×kwhich for a given data set

X∈Rd×Nprovides a projection VTX preserving as much of the original data

characteristics as possible. This is measured by a selected information measure

(which we will generally denote as IM ), as well as a set of constraints ϕi. We

can deﬁne a projection for obtaining a low-dimensional representation as:

max

V∈Rd×kIM(VTX; X,Y) (1)

subject to ϕi(V), i = 1,· · · , m, where Y is a set of additional information used

during the projection, such as class labels.

Low-Dimensional Representation Learning from Imbalanced Data Streams 3

Beneﬁts of low-dimensional representation for imbalanced data streams.

The disproportion between classes is not the sole source of learning diﬃculties.

As long as classes are well-separated skewed distributions are not going to af-

fect the classiﬁer. However, imbalanced data is often accompanied by a number

of other learning problems [7]. We argue that by using a discriminative and

low-dimensional embedding one may achieve similar or better results than when

using resampling, without risking its drawbacks. In this paper, we propose a

skew-insensitive embedding that is able to learn optimal subspaces that im-

prove separation between classes and alleviate the need for any resampling or

cost-sensitive learning. Additionally, as we generate new embedding for each

incoming chunk of data, we are able to adapt to concept drift.

3 Low-Dimensional Projection for Imbalanced Data

Initial assumptions. In this work, we aim at having a projection that is able

to ﬁnd a low-dimensional representation oﬀering the best discrimination be-

tween minority (Xmin) and majority classes (Xmaj ). This can be done using

Cauchy-Schwarz Divergence (DCS (·,·)) for measuring the discriminative power

and allows us to rewrite Eq. 1 as:

max

V∈Rd×kDCS ([[VTXmin ]].[[VTXmaj ]]) (2)

subject to VTV = I, where [[·]] stands for a density estimator.

In order to apply a low-dimensional projection to data streams, we need to

choose a proper information measure. In this work, we decided to use the Renyi’s

quadratic entropy [5], as it has two main advantages for streaming data: (i) low

computational complexity; and (ii) conﬁrmed high usefulness for classiﬁcation

tasks. We can express its density fon Rkas:

H2(f) = −log ZRk

f2(x)dx. (3)

For computing the Renyi’s quadratic entropy on actual data, one needs to

select a density estimator and solve an optimization problem of ﬁnding an or-

thonormal base Vof k-dimensional subspace that maximizes H2[[VTX]]. It has

been shown that the maximization of the Renyi’s quadratic entropy leads to

selecting a representation with a high spread, while its minimization oﬀers con-

densed representation.

For the problem of imbalanced data, we can express the class-depended

Cauchy-Schwarz Divergence for minority and majority classes using the men-

tioned Renyi’s quadratic entropy (H2) and Renyi’s quadratic cross-entropy (Hx

2):

DCS (V) = log Z[[VTXmin]]2+log Z[[VTXmaj ]]2−2log Z[[VTXmin]][[VTXmaj ]]

=−H2([[VTXmin]]) −H2([[VTXmaj ]]) + 2Hx

2([[VTXmin]],[[VTXmaj ]]).(4)

4 L. Korycki and B. Krawczyk

Objective function for the projection. Let us now formulate the objective

function and its gradient for the Renyi’s quadratic entropy-based projection, so

it can be solved using any ﬁrst-order optimization method. We want to project

the entire data set X on V and compute DCS (V) with kernel density estimators

of the obtained projection:

G−1(V)[[VTXmin]] and G−1(V)[[VTXmaj ]],(5)

where G(V) = VTV stands for grassmannian. We look for such a projection

space V that maximizes DCS . Such projections do not depend on the aﬃne

transformations of data, thus allowing us to restrict the analogous formulas for

sets VTXmin and VTXmaj to such V that consist of linearly independent vectors.

Additionally, we want our projection to be skew-insensitive and improve the

separation between minority and majority classes. We obtain this by weighting

the classes with Wmin and Wmaj that stand for weights assigned to the minority

and majority objects, respectively. By assigning higher weights to the minority

class, we obtain a better representation after low-dimensional projection that

alleviates the class imbalance problem. Therefore, in order to maximize DCS with

respect to being skew-insensitive, we need to compute the gradient of following

function:

DCS (Vim )=DC S ([[VTXminWmin ]],[[VTXmaj Wmaj ]])

= log Z[[VTXminWmin]]2+ log Z[[VTXmaj Wmaj ]]2

−2log Z[[VTXminWmin]][[VTXmaj Wmaj ]],(6)

where Vim stands for a skew-insensitive projection and we consider only linearly

independent vectors. In order to avoid numerical instabilities, a penalty factor

can be added [6]:

DCS (Vim )− kVTV−Ik2,(7)

that is used to penalize non-orthonormal V’s.

This allows us to formulate the proposed Low-Dimensional Projection for

Imbalanced Data (LDPIM ) objective function used to select the best possible

low-dimensional projection that alleviates the eﬀect of class imbalance:

LDPIM = DC S (Vim)− kVTV−Ik2.(8)

Additionally, we need to be able to compute the gradient of the objective

function ∇LDPIM . For the second term, we can compute this as:

∇kVTV−Ik2= 4VVTV−4V.(9)

Let us now present how can we compute the gradient of the ﬁrst term. For

this, we will need to compute the product of kernel densities of two sets [6]. For

Low-Dimensional Representation Learning from Imbalanced Data Streams 5

the notation simplicity let us assume that set Astands for the minority class

Xmin and set Bfor the majority class Xmaj . We can estimate the kernel density

of a given set with the Gaussian kernel:

[[A]] = 1

|A|X

a∈A

N(a, ΣA),(10)

where ΣA= (hγ

A)2covA,hγ

A=γ(4

k+2 )1/(k+4)|A|−1/(k+4) , and γis a scaling

hyperparameter.

We need to obtain R[[A]][[B]] which can be calculated from the following:

ZN(a, ΣA)N(b, ΣB) = N(a−b, ΣA+ΣB)(0),(11)

leading to:

R[[A]][[B]] = 1

|A||B|Pw∈A−BN(w, ΣA+ΣB)(0) = 1

(2π)k/2det1/2(ΣAB|A||B|)Pw∈A−Bexp(−1

2kwk2

ΣAB ),

(12)

where A−B={a−b:a∈A, b ∈B}and ΣAB = (hγ

A)2covA+ (hγ

B)2covB.

As we deal with a sequence of linearly independent vectors V = [V1,· · · ,Vk]∈

Rd×k, we may use the following:

ΣAB(V) = VTΣAB V and SAB (V) = ΣAB (V)−1,(13)

where ΣAB(V) and SAB(V) are square symmetric matrices storing the properties

of a given projection onto space V.

Let us now calculate:

φAB(V) = 1

(2π)k/2det1/2(ΣAB |A||B|),(14)

and compute gradient of this function as:

∇φAB(V) = −φAB(V) ·ΣAB ·V·SAB(V).(15)

To compute the ﬁnal formula for the ﬁrst term of ∇LDPIM , we need to cal-

culate the gradient of function V →det(ΣAB(V)) that is given by the following

formula:

∇det(ΣAB(V)) = 2det(VTΣAB V) ·ΣABV(VTΣAB V)−1.(16)

We deﬁne the function for information potential:

ψw

AB(V) = exp(−1

2kVTwk2

ΣAB(V) ),(17)

where for an arbitrarily set value of wparameter we are able to compute its

gradient as following:

6 L. Korycki and B. Krawczyk

∇ψw

AB(V) = −ψw

AB(V) ·(wwTVSAB (V) −ΣAB (V)SAB (V)VTwwTVSAB (V)).

(18)

Finally, we need to deﬁne cross information potential function (between A

and B) and its gradient:

ipx

AB =φAB(V) X

w∈A−B

ψw

AB(V),(19)

∇ipx

AB =φAB(V) X

w∈A−B

ψw

AB(V) + X

w∈A−B

ψw

AB(V)!· ∇φAB(V).(20)

This allows us to get back to DCS (Vim) and rewrite it as:

DCS (Vim ) = log(ipx

XminXmin (V)) + log(ipx

Xmaj Xmaj (V)) −2 log(ipx

XminXmaj (V)),

(21)

and calculate its gradient as:

∇DCS (Vim) = 1

ipx

XminXmin (V) ∇ipx

XminXmin (V) + 1

ipx

XmajXmaj (V) ∇ipx

XmajXmaj (V) −21

ipx

XminXmaj (V) ∇ipx

XminXmaj (V).

(22)

After these steps, we may properly formulate the LDPIM objective function

and its gradient as:

LDPIM (V)=DCS (Vim )− kVTV−Ik2,(23)

∇LDPIM (V) = ∇DC S (Vim)−(4VVTV−4V).(24)

These equations can be used as an input to any ﬁrst-order optimization

method to ﬁnd a new k-dimensional projection that can be used as a discrim-

inative, skew-insensitive and low-dimensional projection of the original feature

space.

Embedding LDPIM into ensembles for data streams. The proposed LDPIM

can be seamlessly embedded into any chunk-based ensemble learning algorithm

dedicated to data streams [10]. Their general idea is based on training a new

base classiﬁer on the most recent chunk of data and using it to replace the most

incompetent classiﬁer in the pool. We propose to use LDPIM as a ﬂexible plug-in

to any such ensemble, performing the low-dimensional embedding when a new

chunk of data arrives and then training a new classiﬁer in the reduced feature

space. This will not only make any ensemble robust to class imbalance, but also

improve the predictive power and speed of training of base classiﬁers.

Adaptation to concept drift. As LDPIM will be run independently on each

data chunk, this will have two interesting eﬀects on the underlying ensemble.

Firstly, it will allow for LDPIM to adapt to concept drift, as each embedding

will be done on the most recent data. Secondly, each base classiﬁer will be trained

on diﬀerent embedding, thus positively impacting the diversity of the ensemble.

Low-Dimensional Representation Learning from Imbalanced Data Streams 7

4 Experimental study

This experimental study was designed to answer the following research questions.

RQ1: Does the proposed low-dimensional embedding oﬀer robustness to class

imbalance by providing a better representation of the minority class, and can it

outperform the state-of-the-art low-dimensional projection algorithms?

RQ2: Is LDPIM ﬂexible enough to work with a variety of ensemble learning

algorithms designed for drifting data streams?

RQ3: Does using the skew-insensitive low-dimensional representation for high-

dimensional data streams oﬀer better discriminative power than resampling and

cost-sensitive solutions?

RQ4: Is the proposed LDPIM capable of outperforming reference methods for

real-world data streams with various combinations of feature space dimension-

ality and imbalance ratio?

4.1 Data stream benchmarks

For the purpose of evaluating our proposed algorithm, we generated 576 di-

verse and large-scale data stream benchmarks using MOA. We used four gener-

ators to generate binary imbalanced streams, each with a number of features in

[50,100,250,500,1000,5000], an imbalance ratio in [10,30,50,80,100,150], and

with four types of drifts [no drift, sudden, gradual, incremental]. Exhaustive com-

binations of these factors lead to the creation of 576 data streams with 1M-5M

instances each. Additionally, we use three real-world data streams: CIFAR-100,

ImageNet and SUN-397 transformed to multi-class imbalanced problems [14].

Their properties are given in Tab. 1.

Table 1: Properties of data generators that were used to create 576 imbalanced data

stream benchmarks and three real-world data sets.

Dataset Instances Features Classes IR Drift

Aggrawal 1 000 000 50 – 5000 2 10 – 150 n/s/g/i

Hyperplane 1 000 000 50 – 5000 2 10 – 150 n/s/g/i

RBF 5 000 000 50 – 5000 2 10 – 150 n/s/g/i

RandomTree 2 000 000 50 – 5000 2 10 – 150 n/s/g/i

CIFAR-100 60 000 1024 100 50 unknown

ImageNet 1 200 000 4096 200 50 unknown

SUN-397 108 753 1024 397 137 unknown

4.2 Set-up

Reference algorithms for low-dimensional representation. We have se-

lected three state-of-the-art reference methods for obtaining low-dimensional

projections: (i) Per-Class Principal Component Analysis (pPCA); (ii) Structure-

Preserving Non-Negative Matrix Factorization (NMF) [11]; and (iii) Discrimi-

native Learning using Generalized Eigenvectors (GEM) [9].

Reference algorithms for handling class imbalance. We have selected

three state-of-the art reference methods for handling imbalanced data streams:

(i) Incremental Oversampling for Data Streams (IOSDS) [1]; (ii) undersam-

pling via Selection-Based Resampling (SRE) [12]; and (iii) Online Multiple Cost-

Sensitive Learning (OMCSL) [16].

8 L. Korycki and B. Krawczyk

GEM

NMF

pPCA

[KUE] LDP−IM vs.

no. of datasets

0

100

200

300

400

500

576

GEM

NMF

pPCA

[ARF] LDP−IM vs.

no. of datasets

0

100

200

300

400

500

576

GEM

NMF

pPCA

[GOOWE] LDP−IM vs.

no. of datasets

0

100

200

300

400

500

576

GEM

NMF

pPCA

[AUE] LDP−IM vs.

no. of datasets

0

100

200

300

400

500

576

Fig. 1: Comparison of LDPI M and reference algorithms for low-dimensional represen-

tation over four diﬀerent ensemble architectures. Results presented with respect to the

number of wins (green), ties (yellow), and losses (red) over 576 data streams. A tie was

considered when the McNemar’s test rejected the signiﬁcance of diﬀerence between

tested algorithms.

Ensemble learning algorithms. The proposed LDPI M is a ﬂexible plug-in

that can be used with any ensemble learning algorithm. Therefore, to eval-

uate its interplay with various ensembles, we have selected recent and pop-

ular architectures designed for streaming data: (i) Kappa Updated Ensemble

(KUE) [4]; (ii) Adaptive Random Forest (ARF) [8]; (iii) Geometrically Optimum

and Online-Weighted Ensemble (GOOWE) [2]; and (iv) Accuracy Updated En-

semble (AUE) [3]. Each of them worked in a block-based mode, with 10 base

classiﬁers maintained. We used Hoeﬀding Trees as base learners.

Parameters. LDPIM uses the inverse imbalance ratio for setting weights in

Eq. 6. All algorithms for low-dimensional representation use k= 0.05d, so they

reduce the input feature space by 95%. All of them were tuned with parameters

and procedures suggested by their authors.

Evaluation metrics. As we deal with imbalanced and drifting data streams,

we evaluated the examined algorithms using prequential AUC [7].

Windows. We used a window size ω= 1000 for calculating the prequential

metrics and training new classiﬁers.

Statistical analysis. We used the McNemar’s test for the pairwise comparison

and the Bonferroni-Dunn test for the multiple comparison.

4.3 Experiment 1: Low-dimensional representations

General comparison. Fig. 5 oﬀers a detailed comparison of the proposed

method with three reference low-dimensional embedding methods using four

underlying ensemble architectures over 576 diverse data stream benchmarks.

We can see that LDPIM is capable of outperforming reference embeddings re-

gardless of the underlying ensemble architecture. This shows that not only the

proposed method is able to create more discriminative subspaces (RQ1), but

it is also highly ﬂexible with regard to the utilized classiﬁcation scheme and

can be used as a general-purpose plug-in (RQ2). This is further conﬁrmed by

the Bonferonii-Dunn test (see Fig. 9), proving that the diﬀerences between the

proposed LDPIM and reference algorithms are statistically signiﬁcant.

Analysis of robustness to high dimensionality and class imbalance.

Fig. 6 shows a detailed analysis of the robustness of each of the examined pro-

jection methods to increasing dimensionality and imbalance ratio. What was

Low-Dimensional Representation Learning from Imbalanced Data Streams 9

Fig. 2: Analysis of the relationship between prequential AUC and increasing imbal-

ance ratio / dimensionality for LDPIM and reference algorithms for low-dimensional

representation. KUE used as the ensemble method.

OMCSL

SER

IOSDS

[KUE] LDP−IM vs.

no. of datasets

0

100

200

300

400

500

576

OMCSL

SER

IOSDS

[ARF] LDP−IM vs.

no. of datasets

0

100

200

300

400

500

576

OMCSL

SER

IOSDS

[GOOWE] LDP−IM vs.

no. of datasets

0

100

200

300

400

500

576

OMCSL

SER

IOSDS

[AUE] LDP−IM vs.

no. of datasets

0

100

200

300

400

500

576

Fig. 3: Comparison of LDPI M and reference algorithms for imbalanced data streams

over four diﬀerent ensemble architectures. Results presented with respect to the num-

ber of wins (green), ties (yellow), and losses (red) over 576 data streams. A tie was

considered when the McNemar’s test rejected the signiﬁcance of diﬀerence between

tested algorithms.

expected, all four algorithms can handle data streams with even up to thou-

sands of features. LDPIM shows some improvements over reference methods

when facing thousands of features, which can be contributed to using weighted

entropies from both classes. Things are diﬀerent when we analyze an increasing

imbalance ratio. Here LDPIM clearly outperforms all reference methods, show-

ing that they are not capable of creating discriminative projections when dealing

with highly skewed distributions.

4.4 Experiment 2: Skew-insensitive algorithms

General comparison. Fig. 7 oﬀers a detailed comparison of the proposed

method with three reference resampling and cost-sensitive methods utilizing four

underlying ensemble architectures over 576 diverse data stream benchmarks.

The obtained results show that by learning a skew-insensitive representation via

LDPIM we may be highly competitive to traditional solutions created for han-

dling class imbalance. Furthermore, we achieve this without artiﬁcially inﬂating

10 L. Korycki and B. Krawczyk

Fig. 4: Analysis of the relationship between prequential AUC and increasing imbalance

ratio / dimensionality for LDPIM and reference algorithms for handling imbalanced

data streams. KUE used as the ensemble method.

Fig. 5: The Bonferroni-Dunn tests for the comparison among methods for low-

dimensional representations and handling class imbalance.

the size of the data set or the need for meta-tuning of the cost parameter. This

is further conﬁrmed by the Bonferonii-Dunn test (see Fig. 9).

Analysis of robustness to high dimensionality and class imbalance.

Fig. 8 shows a detailed analysis of the robustness of each of the examined im-

balance alleviation methods to increasing dimensionality and imbalance ratio.

These results show a weakness of the existing solutions, as neither resampling nor

cost-sensitive learning can work under high dimensionality of the feature space.

LDPIM oﬀers comparable performance when the dimensionality is low, but oﬀers

excellent robustness to even thousands of features. As for the imbalance ratio, all

examined methods can adapt to increasing disproportion between classes, but

only LDPIM can handle imbalanced and high-dimensional streams (RQ3). It is

important to notice that LDPIM also oﬀers excellent performance when dealing

with lower imbalance ratios or smaller feature dimensionality, making it a very

ﬂexible solution to learning from diﬃcult data streams.

4.5 Experiment 3: Evaluation on real-world data streams

Finally, we want to evaluate the performance of the proposed LDPIM on real-

world data. Tab. 2 presents the obtained results according to the prequential

AUC metric and for four examined ensemble classiﬁers used as base learners. We

Low-Dimensional Representation Learning from Imbalanced Data Streams 11

Table 2: Results according to prequential multi-class AUC on real-world imbalanced

data sets for all reference methods and four ensemble classiﬁers used as base learners.

LDPIM pPCA NMF GEM IOSDS SER OMCSL

KUE

CIFAR-100 82.72±2.38 56.26±9.73 60.23±9.18 64.34±8.56 65.83±7.02 66.11±6.45 65.23±7.98

ImageNet 41.07±5.09 16.54±8.83 17.99±9.02 20.04±7.81 28.94±6.17 31.43±6.93 32.80±5.98

SUN-397 38.28±6.79 9.28±5.99 11.26±7.03 12.36±8.04 17.29±8.92 18.92±7.48 20.01±9.01

ARF

CIFAR-100 80.19±2.72 51.54±9.29 59.57±9.33 62.41±8.07 62.95±7.38 63.83±6.18 64.01±7.29

ImageNet 40.03±6.11 14.99±9.02 17.58±9.31 21.93±7.48 27.91±7.03 30.52±7.28 31.06±6.77

SUN-397 36.28±7.28 8.62±6.58 10.24±7.33 11.64±7.58 16.03±9.04 17.88±8.04 18.84±9.62

GOOWE

CIFAR-100 78.03±2.99 50.21±9.80 53.07±9.94 56.61±9.17 58.29±7.78 60.06±6.92 60.92±8.64

ImageNet 36.86±5.89 11.72±8.14 13.58±8.22 16.22±8.01 24.17±8.14 26.77±7.94 27.94±7.80

SUN-397 32.09±4.99 5.89±3.44 7.09±2.99 7.82±3.88 11.57±5.01 13.39±4.78 15.28±5.02

AUE

CIFAR-100 79.44±2.76 51.95±9.41 55.11±9.38 58.92±8.15 60.02±7.76 62.17±6.35 63.88±7.48

ImageNet 37.93±4.81 12.77±6.62 15.02±7.13 18.54±6.96 25.82±7.11 28.44±8.09 28.98±5.49

SUN-397 34.18±2.79 7.28±1.99 8.99±2.52 8.90±3.08 13.44±3.72 16.48±4.08 17.24±4.27

can see that LDPIM oﬀers superior performance and exceptional stability for all

data sets and regardless of the used ensemble classiﬁer. This veriﬁes the high

attractiveness of the proposed LDPIM for learning from diﬃcult data streams

in various real-world scenarios (RQ4).

5 Conclusions and future works

In this paper, we have tackled a highly challenging and, to the best of our knowl-

edge, previously unaddressed problem of learning from data streams under class

imbalance and high feature space dimensionality. We have presented LDPIM –

a novel low-dimensional representation learning algorithm designed for imbal-

anced data streams. It optimizes the search for a new subspace that can eﬀec-

tively represent data, while maximizing its discriminative power. We achieve this

by using the Renyi’s quadratic entropy to evaluate the usefulness of potential

subspaces. We additionally compute the individual impact of each class, taking

into account the disproportions in their distributions and using this information

to weight entropy computations. This allowed us to ﬁnd theoretically-justiﬁed

skew-insensitive projections that alleviate the need for any resampling or cost-

sensitive learning. By looking for discriminative low-dimensional representations,

we were able to enhance the predictive power of examined classiﬁers without the

need of using additional pre-processing methods.

Our claims were conﬁrmed by an extensive empirical evaluation on 576 gener-

ated streaming benchmarks and three challenging real-world data sets. We have

showed that LDPIM can be seamlessly integrated into any streaming ensemble.

12 L. Korycki and B. Krawczyk

References

1. Anupama, N., Jena, S.: A novel approach using incremental oversampling for data

stream mining. Evolving Systems 10(3), 351–362 (2019)

2. Bonab, H.R., Can, F.: GOOWE: geometrically optimum and online-weighted en-

semble classiﬁer for evolving data streams. ACM TKDD 12(2), 25:1–25:33 (2018)

3. Brzezinski, D., Stefanowski, J.: Reacting to diﬀerent types of concept drift: The

accuracy updated ensemble algorithm. IEEE Trans. Neural Netw. Learning Syst.

25(1), 81–94 (2014)

4. Cano, A., Krawczyk, B.: Kappa updated ensemble for drifting data stream mining.

Machine Learning 109(1), 175–218 (2020)

5. Czarnecki, W.M., J´ozefowicz, R., Tabor, J.: Maximum entropy linear manifold for

learning discriminative low-dimensional representation. In: Machine Learning and

Knowledge Discovery in Databases - European Conference, ECML PKDD 2015,

Porto, Portugal, September 7-11, 2015, Proceedings, Part I. pp. 52–67 (2015)

6. Czarnecki, W.M., Tabor, J.: Multithreshold entropy linear classiﬁer: Theory and

applications. Expert Syst. Appl. 42(13), 5591–5606 (2015)

7. Fern´andez, A., Garc´ıa, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F.: Learn-

ing from Imbalanced Data Sets. Springer (2018). https://doi.org/10.1007/978-3-

319-98074-4, https://doi.org/10.1007/978-3-319-98074-4

8. Gomes, H.M., Bifet, A., Read, J., Barddal, J.P., Enembreck, F., Pfharinger, B.,

Holmes, G., Abdessalem, T.: Adaptive random forests for evolving data stream

classiﬁcation. Machine Learning 106(9-10), 1469–1495 (2017)

9. Karampatziakis, N., Mineiro, P.: Discriminative features via generalized eigenvec-

tors. In: Proceedings of the 31th International Conference on Machine Learning,

ICML 2014, Beijing, China, 21-26 June 2014. pp. 494–502 (2014)

10. Krawczyk, B., Minku, L.L., Gama, J., Stefanowski, J., Wo´zniak, M.: Ensemble

learning for data stream analysis: A survey. Information Fusion 37, 132–156 (2017)

11. Li, Z., Liu, J., Lu, H.: Structure preserving non-negative matrix factorization for di-

mensionality reduction. Computer Vision and Image Understanding 117(9), 1175–

1189 (2013)

12. Ren, S., Zhu, W., Liao, B., Li, Z., Wang, P., Li, K., Chen, M., Li, Z.: Selection-

based resampling ensemble algorithm for nonstationary imbalanced stream data

learning. Knowl.-Based Syst. 163, 705–722 (2019)

13. Wang, S., Minku, L.L., Yao, X.: A systematic study of online class imbalance

learning with concept drift. IEEE Trans. Neural Netw. Learning Syst. 29(10),

4802–4821 (2018)

14. Wang, Y., Ramanan, D., Hebert, M.: Learning to model the tail. In: Guyon, I.,

von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N.,

Garnett, R. (eds.) Advances in Neural Information Processing Systems 30: Annual

Conference on Neural Information Processing Systems 2017, December 4-9, 2017,

Long Beach, CA, USA. pp. 7029–7039 (2017)

15. Wang, Z., Kong, Z., Chandra, S., Tao, H., Khan, L.: Robust high dimensional

stream classiﬁcation with novel class detection. In: 35th IEEE International Con-

ference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019. pp.

1418–1429 (2019)

16. Yan, Y., Yang, T., Yang, Y., Chen, J.: A framework of online learning with im-

balanced streaming data. In: Proceedings of the Thirty-First AAAI Conference

on Artiﬁcial Intelligence, February 4-9, 2017, San Francisco, California, USA. pp.

2817–2823 (2017)