PreprintPDF Available

Low-Dimensional Representation Learning from Imbalanced Data Streams


Abstract and Figures

Learning from data streams is among the contemporary challenges in machine learning domain, which is frequently plagued by the class imbalance problem. In non-stationary environments, ratios among classes, as well as their roles (majority and minority) may change over time. The class imbalance is usually alleviated by balancing classes with resampling. However, this suffers from limitations, such as lack of adaptation to concept drift and possibility of shifting the true class distributions. In this paper, we propose a novel ensemble approach, where each new base classifier is built using a low-dimensional embedding. We use class-dependent entropy linear manifold to find the most discriminative low-dimensional representation that is at the same time skew-insensitive. This allows us to address two challenging issues: (i) learning efficient classifiers from imbalanced and drifting streams without data resam-pling; and (ii) tackling simultaneously high-dimensional and imbalanced streams that pose extreme challenge to existing classifiers. Our proposed low-dimensional representation algorithm is a flexible plug-in that can work with any ensemble learning algorithm, making it a highly useful tool for difficult scenarios of learning from high-dimensional imbalanced and drifting data streams.
Content may be subject to copyright.
Low-Dimensional Representation Learning from
Imbalanced Data Streams
Lukasz Korycki and Bartosz Krawczyk
Department of Computer Science
Virginia Commonwealth University
Richmond, VA, USA {koryckil,bkrawczyk}
Abstract. Learning from data streams is among the contemporary chal-
lenges in the machine learning domain, which is frequently plagued by the
class imbalance problem. In non-stationary environments, ratios among
classes, as well as their roles (majority and minority) may change over
time. The class imbalance is usually alleviated by balancing classes with
resampling. However, this suffers from limitations, such as a lack of adap-
tation to concept drift and the possibility of shifting the true class dis-
tributions. In this paper, we propose a novel ensemble approach, where
each new base classifier is built using a low-dimensional embedding. We
use class-dependent entropy linear manifold to find the most discrimi-
native low-dimensional representation that is, at the same time, skew-
insensitive. This allows us to address two challenging issues: (i) learning
efficient classifiers from imbalanced and drifting streams without data
resampling; and (ii) tackling simultaneously high-dimensional and imbal-
anced streams that pose extreme challenges to existing classifiers. Our
proposed low-dimensional representation algorithm is a flexible plug-in
that can work with any ensemble learning algorithm, making it a highly
useful tool for difficult scenarios of learning from high-dimensional im-
balanced and drifting data streams.
Keywords: Machine learning ·Data stream mining ·Class imbalance ·
Concept drift ·Low-dimensional representation
1 Introduction
We define a data stream as a sequence < S1, S2, ..., Sn, ... >, in which each ele-
ment Sjis a collection of instances (batch scenario) or a single instance (online
scenario). Each instance is independent and randomly generated using a station-
ary probability distribution Dj. In this paper, we consider the supervised learn-
ing scenario that allows us to define each element as Sjpj(x1,· · · , xd, y) =
pj(x, y), where pj(x, y) is a joint distribution of j-th instance, defined by d-
dimensional feature space and belonging to class y. Each instance in the stream
is independent and randomly drawn from a stationary probability distribution
Ψj(x, y). Data streams are also subject to a phenomenon known as concept
drift [13], where the properties of a stream evolve over time. Furthermore, two
challenging problems connected with data streams are class imbalance [7] and
2 L. Korycki and B. Krawczyk
high-dimensionality of the feature space [15]. These two problems have been an-
alyzed disjointly, but in realistic scenarios they may appear together [7]. There-
fore, there is a need for developing novel methods that can simultaneously deal
with both of these issues, while being robust to concept drift.
In this paper, we propose a novel approach for learning low-dimensional rep-
resentations of data streams. It is based on finding an optimal projection of
data into a subspace that maximizes the Renyi’s quadratic entropy per class. By
weighting class representations when searching for an optimal low-dimensional
manifold, we ensure that the new projection is not only highly discriminative,
but also skew-insensitive. We show that the proposed technique is a universal
and flexible plug-in that can be used with any ensemble algorithm created for
data streams. Additionally, we highlight and address two important shortcom-
ings of existing methods: (i) reference low-dimensional projections cannot handle
increasing imbalance ratio; and (ii) existing resampling and cost-sensitive algo-
rithms for imbalanced data streams fail with increasing dimensionality of the
feature space. We show that finding a discriminative and skew-insensitive pro-
jection may actually lead to better separation between classes and outperforming
resampling algorithms. The main contributions of this paper are given as follows.
First low-dimensional embedding for imbalanced data streams.
Novel approach to tackling extremely challenging scenario of mining im-
balanced, high-dimensional and drifting data streams.
Theoretically and practically sound streaming low-dimensional em-
bedding. Efficient and theoretically grounded entropy-based low dimen-
sional embedding that can be easily used with most streaming ensembles.
– Extensive empirical evaluation of the proposed method. We carry
a thorough experimental study on 576 diverse generators and three difficult
real-world data streams.
2 Low-Dimensional Representation
General idea. High-dimensional data may be challenging for many machine
learning algorithms, due to a phenomenon known as the curse of dimensionality.
Therefore, a projection from Rdto Rk, where d >> k, is highly attractive for
both classification and visualization purposes. There are two main approaches
for achieving this – linear projections or more complex embeddings.
Linear projections aim to find a matrix V Rd×kwhich for a given data set
XRd×Nprovides a projection VTX preserving as much of the original data
characteristics as possible. This is measured by a selected information measure
(which we will generally denote as IM ), as well as a set of constraints ϕi. We
can define a projection for obtaining a low-dimensional representation as:
VRd×kIM(VTX; X,Y) (1)
subject to ϕi(V), i = 1,· · · , m, where Y is a set of additional information used
during the projection, such as class labels.
Low-Dimensional Representation Learning from Imbalanced Data Streams 3
Benefits of low-dimensional representation for imbalanced data streams.
The disproportion between classes is not the sole source of learning difficulties.
As long as classes are well-separated skewed distributions are not going to af-
fect the classifier. However, imbalanced data is often accompanied by a number
of other learning problems [7]. We argue that by using a discriminative and
low-dimensional embedding one may achieve similar or better results than when
using resampling, without risking its drawbacks. In this paper, we propose a
skew-insensitive embedding that is able to learn optimal subspaces that im-
prove separation between classes and alleviate the need for any resampling or
cost-sensitive learning. Additionally, as we generate new embedding for each
incoming chunk of data, we are able to adapt to concept drift.
3 Low-Dimensional Projection for Imbalanced Data
Initial assumptions. In this work, we aim at having a projection that is able
to find a low-dimensional representation offering the best discrimination be-
tween minority (Xmin) and majority classes (Xmaj ). This can be done using
Cauchy-Schwarz Divergence (DCS (·,·)) for measuring the discriminative power
and allows us to rewrite Eq. 1 as:
VRd×kDCS ([[VTXmin ]].[[VTXmaj ]]) (2)
subject to VTV = I, where [[·]] stands for a density estimator.
In order to apply a low-dimensional projection to data streams, we need to
choose a proper information measure. In this work, we decided to use the Renyi’s
quadratic entropy [5], as it has two main advantages for streaming data: (i) low
computational complexity; and (ii) confirmed high usefulness for classification
tasks. We can express its density fon Rkas:
H2(f) = log ZRk
f2(x)dx. (3)
For computing the Renyi’s quadratic entropy on actual data, one needs to
select a density estimator and solve an optimization problem of finding an or-
thonormal base Vof k-dimensional subspace that maximizes H2[[VTX]]. It has
been shown that the maximization of the Renyi’s quadratic entropy leads to
selecting a representation with a high spread, while its minimization offers con-
densed representation.
For the problem of imbalanced data, we can express the class-depended
Cauchy-Schwarz Divergence for minority and majority classes using the men-
tioned Renyi’s quadratic entropy (H2) and Renyi’s quadratic cross-entropy (Hx
DCS (V) = log Z[[VTXmin]]2+log Z[[VTXmaj ]]22log Z[[VTXmin]][[VTXmaj ]]
=H2([[VTXmin]]) H2([[VTXmaj ]]) + 2Hx
2([[VTXmin]],[[VTXmaj ]]).(4)
4 L. Korycki and B. Krawczyk
Objective function for the projection. Let us now formulate the objective
function and its gradient for the Renyi’s quadratic entropy-based projection, so
it can be solved using any first-order optimization method. We want to project
the entire data set X on V and compute DCS (V) with kernel density estimators
of the obtained projection:
G1(V)[[VTXmin]] and G1(V)[[VTXmaj ]],(5)
where G(V) = VTV stands for grassmannian. We look for such a projection
space V that maximizes DCS . Such projections do not depend on the affine
transformations of data, thus allowing us to restrict the analogous formulas for
sets VTXmin and VTXmaj to such V that consist of linearly independent vectors.
Additionally, we want our projection to be skew-insensitive and improve the
separation between minority and majority classes. We obtain this by weighting
the classes with Wmin and Wmaj that stand for weights assigned to the minority
and majority objects, respectively. By assigning higher weights to the minority
class, we obtain a better representation after low-dimensional projection that
alleviates the class imbalance problem. Therefore, in order to maximize DCS with
respect to being skew-insensitive, we need to compute the gradient of following
DCS (Vim )=DC S ([[VTXminWmin ]],[[VTXmaj Wmaj ]])
= log Z[[VTXminWmin]]2+ log Z[[VTXmaj Wmaj ]]2
2log Z[[VTXminWmin]][[VTXmaj Wmaj ]],(6)
where Vim stands for a skew-insensitive projection and we consider only linearly
independent vectors. In order to avoid numerical instabilities, a penalty factor
can be added [6]:
DCS (Vim )− kVTVIk2,(7)
that is used to penalize non-orthonormal V’s.
This allows us to formulate the proposed Low-Dimensional Projection for
Imbalanced Data (LDPIM ) objective function used to select the best possible
low-dimensional projection that alleviates the effect of class imbalance:
LDPIM = DC S (Vim)− kVTVIk2.(8)
Additionally, we need to be able to compute the gradient of the objective
function LDPIM . For the second term, we can compute this as:
∇kVTVIk2= 4VVTV4V.(9)
Let us now present how can we compute the gradient of the first term. For
this, we will need to compute the product of kernel densities of two sets [6]. For
Low-Dimensional Representation Learning from Imbalanced Data Streams 5
the notation simplicity let us assume that set Astands for the minority class
Xmin and set Bfor the majority class Xmaj . We can estimate the kernel density
of a given set with the Gaussian kernel:
[[A]] = 1
N(a, ΣA),(10)
where ΣA= (hγ
k+2 )1/(k+4)|A|1/(k+4) , and γis a scaling
We need to obtain R[[A]][[B]] which can be calculated from the following:
ZN(a, ΣA)N(b, ΣB) = N(ab, ΣA+ΣB)(0),(11)
leading to:
R[[A]][[B]] = 1
|A||B|PwABN(w, ΣA+ΣB)(0) = 1
ΣAB ),
where AB={ab:aA, b B}and ΣAB = (hγ
A)2covA+ (hγ
As we deal with a sequence of linearly independent vectors V = [V1,· · · ,Vk]
Rd×k, we may use the following:
ΣAB(V) = VTΣAB V and SAB (V) = ΣAB (V)1,(13)
where ΣAB(V) and SAB(V) are square symmetric matrices storing the properties
of a given projection onto space V.
Let us now calculate:
φAB(V) = 1
(2π)k/2det1/2(ΣAB |A||B|),(14)
and compute gradient of this function as:
φAB(V) = φAB(V) ·ΣAB ·V·SAB(V).(15)
To compute the final formula for the first term of LDPIM , we need to cal-
culate the gradient of function V det(ΣAB(V)) that is given by the following
det(ΣAB(V)) = 2det(VTΣAB V) ·ΣABV(VTΣAB V)1.(16)
We define the function for information potential:
AB(V) = exp(1
ΣAB(V) ),(17)
where for an arbitrarily set value of wparameter we are able to compute its
gradient as following:
6 L. Korycki and B. Krawczyk
AB(V) = ψw
Finally, we need to define cross information potential function (between A
and B) and its gradient:
AB =φAB(V) X
AB =φAB(V) X
AB(V) + X
AB(V)!· ∇φAB(V).(20)
This allows us to get back to DCS (Vim) and rewrite it as:
DCS (Vim ) = log(ipx
XminXmin (V)) + log(ipx
Xmaj Xmaj (V)) 2 log(ipx
XminXmaj (V)),
and calculate its gradient as:
DCS (Vim) = 1
XminXmin (V) ipx
XminXmin (V) + 1
XmajXmaj (V) ipx
XmajXmaj (V) 21
XminXmaj (V) ipx
XminXmaj (V).
After these steps, we may properly formulate the LDPIM objective function
and its gradient as:
LDPIM (V)=DCS (Vim )− kVTVIk2,(23)
LDPIM (V) = DC S (Vim)(4VVTV4V).(24)
These equations can be used as an input to any first-order optimization
method to find a new k-dimensional projection that can be used as a discrim-
inative, skew-insensitive and low-dimensional projection of the original feature
Embedding LDPIM into ensembles for data streams. The proposed LDPIM
can be seamlessly embedded into any chunk-based ensemble learning algorithm
dedicated to data streams [10]. Their general idea is based on training a new
base classifier on the most recent chunk of data and using it to replace the most
incompetent classifier in the pool. We propose to use LDPIM as a flexible plug-in
to any such ensemble, performing the low-dimensional embedding when a new
chunk of data arrives and then training a new classifier in the reduced feature
space. This will not only make any ensemble robust to class imbalance, but also
improve the predictive power and speed of training of base classifiers.
Adaptation to concept drift. As LDPIM will be run independently on each
data chunk, this will have two interesting effects on the underlying ensemble.
Firstly, it will allow for LDPIM to adapt to concept drift, as each embedding
will be done on the most recent data. Secondly, each base classifier will be trained
on different embedding, thus positively impacting the diversity of the ensemble.
Low-Dimensional Representation Learning from Imbalanced Data Streams 7
4 Experimental study
This experimental study was designed to answer the following research questions.
RQ1: Does the proposed low-dimensional embedding offer robustness to class
imbalance by providing a better representation of the minority class, and can it
outperform the state-of-the-art low-dimensional projection algorithms?
RQ2: Is LDPIM flexible enough to work with a variety of ensemble learning
algorithms designed for drifting data streams?
RQ3: Does using the skew-insensitive low-dimensional representation for high-
dimensional data streams offer better discriminative power than resampling and
cost-sensitive solutions?
RQ4: Is the proposed LDPIM capable of outperforming reference methods for
real-world data streams with various combinations of feature space dimension-
ality and imbalance ratio?
4.1 Data stream benchmarks
For the purpose of evaluating our proposed algorithm, we generated 576 di-
verse and large-scale data stream benchmarks using MOA. We used four gener-
ators to generate binary imbalanced streams, each with a number of features in
[50,100,250,500,1000,5000], an imbalance ratio in [10,30,50,80,100,150], and
with four types of drifts [no drift, sudden, gradual, incremental]. Exhaustive com-
binations of these factors lead to the creation of 576 data streams with 1M-5M
instances each. Additionally, we use three real-world data streams: CIFAR-100,
ImageNet and SUN-397 transformed to multi-class imbalanced problems [14].
Their properties are given in Tab. 1.
Table 1: Properties of data generators that were used to create 576 imbalanced data
stream benchmarks and three real-world data sets.
Dataset Instances Features Classes IR Drift
Aggrawal 1 000 000 50 – 5000 2 10 – 150 n/s/g/i
Hyperplane 1 000 000 50 – 5000 2 10 – 150 n/s/g/i
RBF 5 000 000 50 – 5000 2 10 – 150 n/s/g/i
RandomTree 2 000 000 50 – 5000 2 10 – 150 n/s/g/i
CIFAR-100 60 000 1024 100 50 unknown
ImageNet 1 200 000 4096 200 50 unknown
SUN-397 108 753 1024 397 137 unknown
4.2 Set-up
Reference algorithms for low-dimensional representation. We have se-
lected three state-of-the-art reference methods for obtaining low-dimensional
projections: (i) Per-Class Principal Component Analysis (pPCA); (ii) Structure-
Preserving Non-Negative Matrix Factorization (NMF) [11]; and (iii) Discrimi-
native Learning using Generalized Eigenvectors (GEM) [9].
Reference algorithms for handling class imbalance. We have selected
three state-of-the art reference methods for handling imbalanced data streams:
(i) Incremental Oversampling for Data Streams (IOSDS) [1]; (ii) undersam-
pling via Selection-Based Resampling (SRE) [12]; and (iii) Online Multiple Cost-
Sensitive Learning (OMCSL) [16].
8 L. Korycki and B. Krawczyk
[KUE] LDP−IM vs.
no. of datasets
[ARF] LDP−IM vs.
no. of datasets
no. of datasets
[AUE] LDP−IM vs.
no. of datasets
Fig. 1: Comparison of LDPI M and reference algorithms for low-dimensional represen-
tation over four different ensemble architectures. Results presented with respect to the
number of wins (green), ties (yellow), and losses (red) over 576 data streams. A tie was
considered when the McNemar’s test rejected the significance of difference between
tested algorithms.
Ensemble learning algorithms. The proposed LDPI M is a flexible plug-in
that can be used with any ensemble learning algorithm. Therefore, to eval-
uate its interplay with various ensembles, we have selected recent and pop-
ular architectures designed for streaming data: (i) Kappa Updated Ensemble
(KUE) [4]; (ii) Adaptive Random Forest (ARF) [8]; (iii) Geometrically Optimum
and Online-Weighted Ensemble (GOOWE) [2]; and (iv) Accuracy Updated En-
semble (AUE) [3]. Each of them worked in a block-based mode, with 10 base
classifiers maintained. We used Hoeffding Trees as base learners.
Parameters. LDPIM uses the inverse imbalance ratio for setting weights in
Eq. 6. All algorithms for low-dimensional representation use k= 0.05d, so they
reduce the input feature space by 95%. All of them were tuned with parameters
and procedures suggested by their authors.
Evaluation metrics. As we deal with imbalanced and drifting data streams,
we evaluated the examined algorithms using prequential AUC [7].
Windows. We used a window size ω= 1000 for calculating the prequential
metrics and training new classifiers.
Statistical analysis. We used the McNemar’s test for the pairwise comparison
and the Bonferroni-Dunn test for the multiple comparison.
4.3 Experiment 1: Low-dimensional representations
General comparison. Fig. 5 offers a detailed comparison of the proposed
method with three reference low-dimensional embedding methods using four
underlying ensemble architectures over 576 diverse data stream benchmarks.
We can see that LDPIM is capable of outperforming reference embeddings re-
gardless of the underlying ensemble architecture. This shows that not only the
proposed method is able to create more discriminative subspaces (RQ1), but
it is also highly flexible with regard to the utilized classification scheme and
can be used as a general-purpose plug-in (RQ2). This is further confirmed by
the Bonferonii-Dunn test (see Fig. 9), proving that the differences between the
proposed LDPIM and reference algorithms are statistically significant.
Analysis of robustness to high dimensionality and class imbalance.
Fig. 6 shows a detailed analysis of the robustness of each of the examined pro-
jection methods to increasing dimensionality and imbalance ratio. What was
Low-Dimensional Representation Learning from Imbalanced Data Streams 9
Fig. 2: Analysis of the relationship between prequential AUC and increasing imbal-
ance ratio / dimensionality for LDPIM and reference algorithms for low-dimensional
representation. KUE used as the ensemble method.
[KUE] LDP−IM vs.
no. of datasets
[ARF] LDP−IM vs.
no. of datasets
no. of datasets
[AUE] LDP−IM vs.
no. of datasets
Fig. 3: Comparison of LDPI M and reference algorithms for imbalanced data streams
over four different ensemble architectures. Results presented with respect to the num-
ber of wins (green), ties (yellow), and losses (red) over 576 data streams. A tie was
considered when the McNemar’s test rejected the significance of difference between
tested algorithms.
expected, all four algorithms can handle data streams with even up to thou-
sands of features. LDPIM shows some improvements over reference methods
when facing thousands of features, which can be contributed to using weighted
entropies from both classes. Things are different when we analyze an increasing
imbalance ratio. Here LDPIM clearly outperforms all reference methods, show-
ing that they are not capable of creating discriminative projections when dealing
with highly skewed distributions.
4.4 Experiment 2: Skew-insensitive algorithms
General comparison. Fig. 7 offers a detailed comparison of the proposed
method with three reference resampling and cost-sensitive methods utilizing four
underlying ensemble architectures over 576 diverse data stream benchmarks.
The obtained results show that by learning a skew-insensitive representation via
LDPIM we may be highly competitive to traditional solutions created for han-
dling class imbalance. Furthermore, we achieve this without artificially inflating
10 L. Korycki and B. Krawczyk
Fig. 4: Analysis of the relationship between prequential AUC and increasing imbalance
ratio / dimensionality for LDPIM and reference algorithms for handling imbalanced
data streams. KUE used as the ensemble method.
Fig. 5: The Bonferroni-Dunn tests for the comparison among methods for low-
dimensional representations and handling class imbalance.
the size of the data set or the need for meta-tuning of the cost parameter. This
is further confirmed by the Bonferonii-Dunn test (see Fig. 9).
Analysis of robustness to high dimensionality and class imbalance.
Fig. 8 shows a detailed analysis of the robustness of each of the examined im-
balance alleviation methods to increasing dimensionality and imbalance ratio.
These results show a weakness of the existing solutions, as neither resampling nor
cost-sensitive learning can work under high dimensionality of the feature space.
LDPIM offers comparable performance when the dimensionality is low, but offers
excellent robustness to even thousands of features. As for the imbalance ratio, all
examined methods can adapt to increasing disproportion between classes, but
only LDPIM can handle imbalanced and high-dimensional streams (RQ3). It is
important to notice that LDPIM also offers excellent performance when dealing
with lower imbalance ratios or smaller feature dimensionality, making it a very
flexible solution to learning from difficult data streams.
4.5 Experiment 3: Evaluation on real-world data streams
Finally, we want to evaluate the performance of the proposed LDPIM on real-
world data. Tab. 2 presents the obtained results according to the prequential
AUC metric and for four examined ensemble classifiers used as base learners. We
Low-Dimensional Representation Learning from Imbalanced Data Streams 11
Table 2: Results according to prequential multi-class AUC on real-world imbalanced
data sets for all reference methods and four ensemble classifiers used as base learners.
CIFAR-100 82.72±2.38 56.26±9.73 60.23±9.18 64.34±8.56 65.83±7.02 66.11±6.45 65.23±7.98
ImageNet 41.07±5.09 16.54±8.83 17.99±9.02 20.04±7.81 28.94±6.17 31.43±6.93 32.80±5.98
SUN-397 38.28±6.79 9.28±5.99 11.26±7.03 12.36±8.04 17.29±8.92 18.92±7.48 20.01±9.01
CIFAR-100 80.19±2.72 51.54±9.29 59.57±9.33 62.41±8.07 62.95±7.38 63.83±6.18 64.01±7.29
ImageNet 40.03±6.11 14.99±9.02 17.58±9.31 21.93±7.48 27.91±7.03 30.52±7.28 31.06±6.77
SUN-397 36.28±7.28 8.62±6.58 10.24±7.33 11.64±7.58 16.03±9.04 17.88±8.04 18.84±9.62
CIFAR-100 78.03±2.99 50.21±9.80 53.07±9.94 56.61±9.17 58.29±7.78 60.06±6.92 60.92±8.64
ImageNet 36.86±5.89 11.72±8.14 13.58±8.22 16.22±8.01 24.17±8.14 26.77±7.94 27.94±7.80
SUN-397 32.09±4.99 5.89±3.44 7.09±2.99 7.82±3.88 11.57±5.01 13.39±4.78 15.28±5.02
CIFAR-100 79.44±2.76 51.95±9.41 55.11±9.38 58.92±8.15 60.02±7.76 62.17±6.35 63.88±7.48
ImageNet 37.93±4.81 12.77±6.62 15.02±7.13 18.54±6.96 25.82±7.11 28.44±8.09 28.98±5.49
SUN-397 34.18±2.79 7.28±1.99 8.99±2.52 8.90±3.08 13.44±3.72 16.48±4.08 17.24±4.27
can see that LDPIM offers superior performance and exceptional stability for all
data sets and regardless of the used ensemble classifier. This verifies the high
attractiveness of the proposed LDPIM for learning from difficult data streams
in various real-world scenarios (RQ4).
5 Conclusions and future works
In this paper, we have tackled a highly challenging and, to the best of our knowl-
edge, previously unaddressed problem of learning from data streams under class
imbalance and high feature space dimensionality. We have presented LDPIM
a novel low-dimensional representation learning algorithm designed for imbal-
anced data streams. It optimizes the search for a new subspace that can effec-
tively represent data, while maximizing its discriminative power. We achieve this
by using the Renyi’s quadratic entropy to evaluate the usefulness of potential
subspaces. We additionally compute the individual impact of each class, taking
into account the disproportions in their distributions and using this information
to weight entropy computations. This allowed us to find theoretically-justified
skew-insensitive projections that alleviate the need for any resampling or cost-
sensitive learning. By looking for discriminative low-dimensional representations,
we were able to enhance the predictive power of examined classifiers without the
need of using additional pre-processing methods.
Our claims were confirmed by an extensive empirical evaluation on 576 gener-
ated streaming benchmarks and three challenging real-world data sets. We have
showed that LDPIM can be seamlessly integrated into any streaming ensemble.
12 L. Korycki and B. Krawczyk
1. Anupama, N., Jena, S.: A novel approach using incremental oversampling for data
stream mining. Evolving Systems 10(3), 351–362 (2019)
2. Bonab, H.R., Can, F.: GOOWE: geometrically optimum and online-weighted en-
semble classifier for evolving data streams. ACM TKDD 12(2), 25:1–25:33 (2018)
3. Brzezinski, D., Stefanowski, J.: Reacting to different types of concept drift: The
accuracy updated ensemble algorithm. IEEE Trans. Neural Netw. Learning Syst.
25(1), 81–94 (2014)
4. Cano, A., Krawczyk, B.: Kappa updated ensemble for drifting data stream mining.
Machine Learning 109(1), 175–218 (2020)
5. Czarnecki, W.M., J´ozefowicz, R., Tabor, J.: Maximum entropy linear manifold for
learning discriminative low-dimensional representation. In: Machine Learning and
Knowledge Discovery in Databases - European Conference, ECML PKDD 2015,
Porto, Portugal, September 7-11, 2015, Proceedings, Part I. pp. 52–67 (2015)
6. Czarnecki, W.M., Tabor, J.: Multithreshold entropy linear classifier: Theory and
applications. Expert Syst. Appl. 42(13), 5591–5606 (2015)
7. Fern´andez, A., Garc´ıa, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F.: Learn-
ing from Imbalanced Data Sets. Springer (2018).
8. Gomes, H.M., Bifet, A., Read, J., Barddal, J.P., Enembreck, F., Pfharinger, B.,
Holmes, G., Abdessalem, T.: Adaptive random forests for evolving data stream
classification. Machine Learning 106(9-10), 1469–1495 (2017)
9. Karampatziakis, N., Mineiro, P.: Discriminative features via generalized eigenvec-
tors. In: Proceedings of the 31th International Conference on Machine Learning,
ICML 2014, Beijing, China, 21-26 June 2014. pp. 494–502 (2014)
10. Krawczyk, B., Minku, L.L., Gama, J., Stefanowski, J., Wzniak, M.: Ensemble
learning for data stream analysis: A survey. Information Fusion 37, 132–156 (2017)
11. Li, Z., Liu, J., Lu, H.: Structure preserving non-negative matrix factorization for di-
mensionality reduction. Computer Vision and Image Understanding 117(9), 1175–
1189 (2013)
12. Ren, S., Zhu, W., Liao, B., Li, Z., Wang, P., Li, K., Chen, M., Li, Z.: Selection-
based resampling ensemble algorithm for nonstationary imbalanced stream data
learning. Knowl.-Based Syst. 163, 705–722 (2019)
13. Wang, S., Minku, L.L., Yao, X.: A systematic study of online class imbalance
learning with concept drift. IEEE Trans. Neural Netw. Learning Syst. 29(10),
4802–4821 (2018)
14. Wang, Y., Ramanan, D., Hebert, M.: Learning to model the tail. In: Guyon, I.,
von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N.,
Garnett, R. (eds.) Advances in Neural Information Processing Systems 30: Annual
Conference on Neural Information Processing Systems 2017, December 4-9, 2017,
Long Beach, CA, USA. pp. 7029–7039 (2017)
15. Wang, Z., Kong, Z., Chandra, S., Tao, H., Khan, L.: Robust high dimensional
stream classification with novel class detection. In: 35th IEEE International Con-
ference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019. pp.
1418–1429 (2019)
16. Yan, Y., Yang, T., Yang, Y., Chen, J.: A framework of online learning with im-
balanced streaming data. In: Proceedings of the Thirty-First AAAI Conference
on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA. pp.
2817–2823 (2017)
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Learning from data streams in the presence of concept drift is among the biggest challenges of contemporary machine learning. Algorithms designed for such scenarios must take into an account the potentially unbounded size of data, its constantly changing nature, and the requirement for real-time processing. Ensemble approaches for data stream mining have gained significant popularity, due to their high predictive capabilities and effective mechanisms for alleviating concept drift. In this paper, we propose a new ensemble method named Kappa Updated Ensemble (KUE). It is a combination of online and block-based ensemble approaches that uses Kappa statistic for dynamic weighting and selection of base classifiers. In order to achieve a higher diversity among base learners, each of them is trained using a different subset of features and updated with new instances with given probability following a Poisson distribution. Furthermore, we update the ensemble with new classifiers only when they contribute positively to the improvement of the quality of the ensemble. Finally, each base classifier in KUE is capable of abstaining itself for taking a part in voting, thus increasing the overall robustness of KUE. An extensive experimental study shows that KUE is capable of outperforming state-of-the-art ensembles on standard and imbalanced drifting data streams while having a low computational complexity. Moreover, we analyze the use of Kappa vs accuracy to drive the criterion to select and update the classifiers, the contribution of the abstaining mechanism, the contribution of the diversification of classifiers, and the contribution of the hybrid architecture to update the classifiers in an online manner.
Conference Paper
Full-text available
A primary challenge in label prediction over a data stream is the emergence of instances belonging to unknown or novel class over time. Traditionally, studies addressing this problem aim to detect such instances using cluster-based mechanisms. They typically assume that instances from the same class are closer to each other than those belonging to different classes in observed feature space. Unfortunately, this may not hold true in higher-dimensional feature space such as images. In recent years, Convolutional neural network (CNN) have emerged as a leading system to be employed in many real-world application. Yet, based on the assumption of closed world dataset with a fixed number of categories, CNN lacks robustness for novel class detection, so it is unclear on how such models can be used to deal with novel class instances along a high-dimensional image stream. In this paper, we focus on addressing this challenge by proposing an effective learning framework called CNN-based Prototype Ensemble (CPE) for novel class detection and correction. Our framework includes a prototype ensemble loss (PE) to improve the intra-class compactness and expand inter-class separateness in the output feature representation, thereby enabling the robustness of novel class detection. Moreover, we provide an incremental learning strategy which maintains a constant amount of exemplars to update the network, making it more practical for real-world application. We empirically demonstrate the effectiveness of our framework by comparing its performance over multiple realworld image benchmark data streams with existing state-of-theart data stream detection techniques. The implementation of CPE is on:
Full-text available
Data stream mining is very popular in recent years with advanced electronic devices generating continuous data streams. The performance of standard learning algorithms is been compromised with imbalance nature present in real world data streams. In this paper we propose a novel algorithm dubbed as increment over sampling for data streams (IOSDS) which uses an unique over sampling technique to almost balance the data sets to minimize the effect of imbalance in stream mining process. The experimental analysis is conducted on 15 data chunks of data streams with varied sizes and different imbalance ratios. The results suggests that the proposed IOSDS algorithm improves the knowledge discovery over benchmark algorithms like C4.5 and Hoeffding tree in terms of standard performance measures namely accuracy, AUC, precision, recall and F-measure.
Full-text available
Designing adaptive classifiers for an evolving data stream is a challenging task due to the data size and its dynamically changing nature. Combining individual classifiers in an online setting, the ensemble approach, is a well-known solution. It is possible that a subset of classifiers in the ensemble outperforms others in a time-varying fashion. However, optimum weight assignment for component classifiers is a problem which is not yet fully addressed in online evolving environments. We propose a novel data stream ensemble classifier, called Geometrically Optimum and Online-Weighted Ensemble (GOOWE), which assigns optimum weights to the component classifiers using a sliding window containing the most recent data instances. We map vote scores of individual classifiers and true class labels into a spatial environment. Based on the Euclidean distance between vote scores and ideal-points, and using the linear least squares (LSQ) solution, we present a novel, dynamic, and online weighting approach. While LSQ is used for batch mode ensemble classifiers, it is the first time that we adapt and use it for online environments by providing a spatial modeling of online ensembles. In order to show the robustness of the proposed algorithm, we use real-world datasets and synthetic data generators using the MOA libraries. First, we analyze the impact of our weighting system on prediction accuracy through two scenarios. Second, we compare GOOWE with 8 state-of-the-art ensemble classifiers in a comprehensive experimental environment. Our experiments show that GOOWE provides improved reactions to different types of concept drift compared to our baselines. The statistical tests indicate a significant improvement in accuracy, with conservative time and memory requirements.
Full-text available
Random forests is currently one of the most used machine learning algorithms in the non-streaming (batch) setting. This preference is attributable to its high learning performance and low demands with respect to input preparation and hyper-parameter tuning. However, in the challenging context of evolving data streams, there is no random forests algorithm that can be considered state-of-the-art in comparison to bagging and boosting based algorithms. In this work, we present the adaptive random forest (ARF) algorithm for classification of evolving data streams. In contrast to previous attempts of replicating random forests for data stream learning, ARF includes an effective resampling method and adaptive operators that can cope with different types of concept drifts without complex optimizations for different data sets. We present experiments with a parallel implementation of ARF which has no degradation in terms of classification performance in comparison to a serial implementation, since trees and adaptive operators are independent from one another. Finally, we compare ARF with state-of-the-art algorithms in a traditional test-then-train evaluation and a novel delayed labelling evaluation, and show that ARF is accurate and uses a feasible amount of resources.
Full-text available
As an emerging research topic, online class imbalance learning often combines the challenges of both class imbalance and concept drift. It deals with data streams having very skewed class distributions, where concept drift may occur. It has recently received increased research attention; however, very little work addresses the combined problem where both class imbalance and concept drift coexist. As the first systematic study of handling concept drift in class-imbalanced data streams, this paper first provides a comprehensive review of current research progress in this field, including current research focuses and open challenges. Then, an in-depth experimental study is performed, with the goal of understanding how to best overcome concept drift in online learning with class imbalance. Based on the analysis, a general guideline is proposed for the development of an effective algorithm.
This book provides a general and comprehensible overview of imbalanced learning. It contains a formal description of a problem, and focuses on its main features, and the most relevant proposed solutions. Additionally, it considers the different scenarios in Data Science for which the imbalanced classification can create a real challenge. This book stresses the gap with standard classification tasks by reviewing the case studies and ad-hoc performance metrics that are applied in this area. It also covers the different approaches that have been traditionally applied to address the binary skewed class distribution. Specifically, it reviews cost-sensitive learning, data-level preprocessing methods and algorithm-level solutions, taking also into account those ensemble-learning solutions that embed any of the former alternatives. Furthermore, it focuses on the extension of the problem for multi-class problems, where the former classical methods are no longer to be applied in a straightforward way. This book also focuses on the data intrinsic characteristics that are the main causes which, added to the uneven class distribution, truly hinders the performance of classification algorithms in this scenario. Then, some notes on data reduction are provided in order to understand the advantages related to the use of this type of approaches. Finally this book introduces some novel areas of study that are gathering a deeper attention on the imbalanced data issue. Specifically, it considers the classification of data streams, non-classical classification problems, and the scalability related to Big Data. Examples of software libraries and modules to address imbalanced classification are provided. This book is highly suitable for technical professionals, senior undergraduate and graduate students in the areas of data science, computer science and engineering. It will also be useful for scientists and researchers to gain insight on the current developments in this area of study, as well as future research directions.
Conference Paper
Representation learning is currently a very hot topic in modern machine learning, mostly due to the great success of the deep learning methods. In particular low-dimensional representation which discriminates classes can not only enhance the classification procedure, but also make it faster, while contrary to the high-dimensional embeddings can be efficiently used for visual based exploratory data analysis. In this paper we propose Maximum Entropy Linear Manifold (MELM), a multidimensional generalization of Multithreshold Entropy Linear Classifier model which is able to find a low-dimensional linear data projection maximizing discriminativeness of projected classes. As a result we obtain a linear embedding which can be used for classification, class aware dimensionality reduction and data visualization. MELM provides highly discriminative 2D projections of the data which can be used as a method for constructing robust classifiers. We provide both empirical evaluation as well as some interesting theoretical properties of our objective function such us scale and affine transformation invariance, connections with PCA and bounding of the expected balanced accuracy error.
In many applications of information systems learning algorithms have to act in dynamic environments where data are collected in the form of transient data streams. Compared to static data mining, processing streams imposes new computational requirements for algorithms to incrementally process incoming examples while using limited memory and time. Furthermore, due to the non-stationary characteristics of streaming data, prediction models are often also required to adapt to concept drifts. Out of several new proposed stream algorithms, ensembles play an important role, in particular for non-stationary environments. This paper surveys research on ensembles for data stream classification as well as regression tasks. Besides presenting a comprehensive spectrum of ensemble approaches for data streams, we also discuss advanced learning concepts such as imbalanced data streams, novelty detection, active and semi-supervised learning, complex data representations and structured outputs. The paper concludes with a discussion of open research problems and lines of future research.
Linear classifiers separate the data with a hyperplane. In this paper we focus on the novel method of construction of multithreshold linear classifier, which separates the data with multiple parallel hyperplanes. Proposed model is based on the information theory concepts -- namely Renyi's quadratic entropy and Cauchy-Schwarz divergence. We begin with some general properties, including data scale invariance. Then we prove that our method is a multithreshold large margin classifier, which shows the analogy to the SVM, while in the same time works with much broader class of hypotheses. What is also interesting, proposed method is aimed at the maximization of the balanced quality measure (such as Matthew's Correlation Coefficient) as opposed to very common maximization of the accuracy. This feature comes directly from the optimization problem statement and is further confirmed by the experiments on the UCI datasets. It appears, that our Multithreshold Entropy Linear Classifier (MELC) obtaines similar or higher scores than the ones given by SVM on both synthetic and real data. We show how proposed approach can be benefitial for the cheminformatics in the task of ligands activity prediction, where despite better classification results, MELC gives some additional insight into the data structure (classes of underrepresented chemical compunds).