ChapterPDF Available

Adapting K-Means Algorithm for Pair-Wise Constrained Clustering of Imbalanced Data Streams


Abstract and Figures

Contemporary man is addicted to digital media and tools supporting his daily activities, which causes the massive increase of incoming data, both in volume and frequency. Due to the observed trend, unsupervised machine learning methods for data stream clustering have become a popular research topic over the last years. At the same time, semi-supervised constrained clustering is rarely considered in data stream clustering. To address this gap in the field, the authors propose adaptations of k-means constrained clustering algorithms for employing them in imbalanced data stream clustering. In this work, proposed algorithms were evaluated in a series of experiments concerning synthetic and real data clustering and verified their ability to adapt to occurring concept drifts.KeywordsData streamsPair-wise constrainedClusteringImbalanced data
Content may be subject to copyright.
Adapting k-means algorithm for pair-wise
constrained clustering of imbalanced data
Szymon Wojciechowski1( ), Germ´an Gonz´alez-Almagro2, Salvador Garc´ıa2,
and Micha l Wo´zniak1
1Department of Systems and Computer Networks
Wroc law University of Science and Technology
Wroc law, Poland
2Department of Computer Science and Artificial Intelligence (DECSAI) & DaSCI
Andalusian Institute of Data Science and Computational Intelligence
University of Granada
Granada, Spain
Abstract. Contemporary man is addicted to digital media and tools
supporting his daily activities, which causes the massive increase of in-
coming data, both in volume and frequency. Due to the observed trend,
unsupervised machine learning methods for data stream clustering have
become a popular research topic over the last years. At the same time,
semi-supervised constrained clustering is rarely considered in data stream
clustering. To address this gap in the field, the authors propose adapta-
tions of k-means constrained clustering algorithms for employing them
in imbalanced data stream clustering. In this work, proposed algorithms
were evaluated in a series of experiments concerning synthetic and real
data clustering and verified their ability to adapt to occurring concept
data streams, pair-wise constrained, clustering, imbalanced data
1 Introduction
The development of technology and the increasing number of devices connected
to one general-purpose network have made it possible to download and store mas-
sive amounts of constantly streaming data. Regardless of the information type,
which might be readings from sensor mesh or news outlets [1], it will always be
required to process it into a resource that the users can further utilize knowl-
edge. Therefore, it has become inevitable for the topic to be widely researched,
leading to the development of data mining techniques for data streams.
Undoubtedly, one of the discussed challenges is to design algorithms capable
of continuous data stream processing [2]. Therefore, the posed models must
adjust to the constantly changing data characteristics, which phenomenon is
named concept drift [3]. Moreover, models have to provide a prediction before
the next batch of data is available, which constrains the algorithm in terms of
computational complexity.
The vast majority of the recent works focus on data stream classification,
introducing various approaches for dealing with concept drift. Many of the pro-
posed methods employ classifier ensembles [4] and dynamic selection of classi-
fiers [5]. Moreover, various methods are based on concept drift detectors [6], and
a prior probability estimation [7] which indicates when the data characteristics
are changing.
Similarly, data streams were also researched in the context of clustering
task [8]. Proposed algorithms are often adapting the existing methods to data
stream processing [9]. At the same time, research on semi-supervised clustering
with pair-wise constraints [10] has been mainly focused on the integration of
constraints to classical algorithms [11] and methods based on optimization algo-
rithms [12]. The adaptation of those algorithms to data streams was proposed
only in a few recent works.
One of the mainly used techniques for this task is C-DenStream [13], which
is adapting DBSCAN algorithm employing micro-clusters constructed on con-
secutive chunks. Another example is the SemiStream [14] algorithm, based on
initial cluster initialization by MPCk-means, then detection of o-clusters, as-
signing them to s-clusters and m-clusters. Finally, the CE-Stream algorithm [15]
extends the E-Stream algorithm, but it is limited to must-link constraints set.
Marking that there still is a lack of pair-wise constrained clustering algo-
rithms for data streams, the main contribution of this work is a proposal for
employing COPk-means and PCk-means algorithms for data streams cluster-
2 Algorithm
The data stream DS is a sequence of data chunks DS ={DS1,DS
Each data chunk contains a set of samples described by a feature vector X
for which the clustering algorithm (X) assigns a label describing a cluster C.
Additionally each chunk is also provided with two lists of pair-wise constraints
ml and cl denoting must-list pairs and canno t-l ink pairs, respectively. Those
extend a context of clustering with expert knowledge, which does not provide
cluster labels directly, but only declares a relation between two samples.
The k-means clustering algorithm is an inductive learning method. The model
is trained iteratively to minimize the intra-cluster and maximize the infra-cluster
variance. In the first phase of the algorithm, kcluster centroids are initialized as
points in feature space. Then, each observation is assigned to a cluster by finding
the nearest centroid. After this procedure, the new centroids are calculated by
averaging all assigned observations. The procedure is repeated until the model
reaches convergence, which means that new centroids can not be further shifted.
An additional stop condition, which guarantees algorithm execution in a feasible
time, is the maximum number of maximum iterations. The algorithm pseudocode
is presented in Algorithm 1.
Algorithm 1: Generic k-means clustering
Initialize centroids ()
while iteration <max iter do
assign labels (X)
recalculate centroids ()
if convergence then
Constraints can be integrated into the k-means algorithm using one of two
rules, defined as follows:
hard rule: cluster labels assigned by the algorithm cannot be inconsistent
with constraints. This rule is used by COPk-means (copk) algorithm. Each
label assignment is verified for feasibility against constraints. If one of the
rules is violated, the algorithm terminates, leaving the rest of the samples
soft rule: cluster labels can be inconsistent with constraints, but constrain
violation is penalized as an additional factor in minimized criteria. This
rule is used by PCk-means (pck) algorithm. The criteria minimized in the
algorithm is defined as follows:
Jpckm =1
wij [li6=lj]+ X
¯wij [li=lj]
This approach assures that all samples are assigned to estimated clusters.
Another crucial part of the algorithm is selecting a proper cluster initializa-
tion method. The common methods are:
rando m - which is very fast, but a poor initial match may lead to very slow
kmeans++ - allowing a better estimation of the initial centroids, but it in-
troduces additional computation overhead.
For constrained clustering problems, centroids can also be initialized based on
transitive closure of the must-link constraints, averaging the broadest spanning
trees created during this procedure. The complexity of such a method depends
on a backbone spanning-tree search algorithm.
Since the goal is to adapt the algorithm to process data streams, it is nec-
essary to make changes to the initialization and modification of the centroids
in the copk and pck algorithms. Authors propose two new algorithms namely
copk-s and pck-s introducing modifications to described procedures.
Initialization In sequential data chunks processing, the model created on the
previous chunk has already determined the centroids that can be used in the
next one. It can be assumed that those centroids are close to optimal, and, at
the same time, reusing them reduces additional computational eort.
Assignment Base algorithms assume that kclusters will be formed. However,
sudden concept drift may lead to the complete disappearance of one of the
clusters. It may turn out that no pattern will be assigned to a given centroid,
which at the same time will prevent further shifting. Therefore, an additional
modification is introduced to the proposed algorithms. Centroids to which no
samples were assigned are omitted, but the same centroid will be reused for
clustering in the next chunk.
3 Experiments
The proposed methods of adapting the k-means algorithms were tested in a series
of experiments in order to find answers to the following research questions:
RQ1 : What impact do the proposed modifications have on the ability to adapt the
algorithm to the occurring concept drifts?
RQ2 : What impact do the proposed modifications have on the clustering quality?
3.1 Research Protocol
The following experimental protocol was formulated to answer the research ques-
tions. Both synthetic streams for which it was possible to observe the algo-
rithm’s operation with the defined behavior of the concept drift and the real-life
data were used, allowing for actual quality assessment of the proposed methods.
Synthetic data streams were generated with both the concept drift and the
dynamic imbalance to the stream based on the artificial classification problem.
Additionally, to create a set dedicated to clustering algorithms, two additional
sets were created, in which the defined clusters were generated from the normal
distribution at the given distribution means. This modification was aimed at
creating a set dedicated to the clustering algorithm, bearing in mind that, at the
same time, the objective of the algorithm is to determine the local concentration
of observations correctly and not as opposed to classification to determine
the correct labels.
There are no real data sets for the constrained clustering task with constraints
provided by an actual expert. Therefore, it was necessary to generate both the
synthetic and real streams based on a given set of labels for both the synthetic
and real streams. Each set can introduce natural disproportion between the ml
and cl, which is related to the number of classes and their volume. However,
with a complete set of labels, it is possible to determine n
2pairs of nsamples
and constraint type based on the similarity among their label. Therefore, to
propose the method of generating the constraints, the percentage of possible
sample connections is used, and it defines a number of pairs selected randomly
and transformed into constraints.
The quality of clustering on selected data streams was assessed in a sequen-
tial protocol typical for this type of research. The incoming data are divided
into chunks of equal size. The model is trained using the entire chunk, while
the evaluation process follows the protocol for evaluating clustering algorithms.
Additionally, for all tested methods, constraints lists are provided for each data
chunk, and the set of labels itself is only used to compute the clustering metric.
3.2 Experimental Setup
Eight data streams were used for the experiments, four of which were syntheti-
cally generated, and four were prepared based on real data sets, obtained from
MOA3and CSE4. A detailed description of the data sets is presented in the
Table 1.
Selected data sets were divided into chunks of 200 samples. The real-life
data streams were limited to the first 200 chunks, which preserved the original
problem characteristics and provided a convenient length for reliable analysis.
The constraints were generated for each chunk using original problem labels.
The ratio of generated constraints to all possible constraints was constant along
the stream, but each data set was evaluated in three configurations of 1%, 2%,
and 5% of possible constraints. The metric selected to measure and compare the
clustering quality of evaluated algorithms was Adjusted Rand Index (ARI).
The tested algorithms and the experimental environment were implemented
in Python programming language, using the scikit-learn [16] and stream-learn
libraries [17]. The repository allowing for the replication of presented results is
available at Github5.
3.3 Results
The graphs show ARI scores for both proposed methods and the reference ap-
proaches. In addition, for each method, the metric values are shown (a) for the
entire data stream length and (b) averaged over all data chunks. This approach
allows for a detailed analysis of the algorithm performance while enabling com-
parison in the context of the entire problem. Finally, all presented runs were
smoothed using the cumulative sum.
4 xqzhu/stream.html
Table 1: Summary of benchmark datasets.
Dataset Source Description
Blobs Gradual Synthetic One normal distribution per class with a drifting
mean value. Two separated clusters are intersect-
ing, then returning to original concept.
Blobs Imbalance Synthetic One normal distribution per class with a drifting
imbalance ratio. At the beginning static clusters
have equal imbalance ratio, which is drifting to-
wards 0.1 IR.
Classification Gradual Synthetic One distribution per class with two informative fea-
tures. Mean of the cluster distributions is shifting
over time.
Classification Imbalance Synthetic One distribution per class with two informative fea-
tures. Recurring imbalance drift is occurring twice,
varying between 0.05 0.95 IR.
Forest C o v ertype MOA Data contains the forest cover type obtained from
US Forest Service (USFS) Region 2 Resource In-
formation System (RIS).
Electricity MOA Data collected from Australian New South Wales
Electricity Market.
Airlines MOA Data contains flight arrival and departure details
for all the commercial flights within the USA, from
October 1987 to April 2008.
KddCup99 CSE Data created for network intrusion detection in
simulated network. This data set was used for
The Third International Knowledge Discovery and
Data Mining Tools Competition.
Synthetic Datasets. Research on synthetic sets was carried out to study
the algorithm behavior concerning the occurring drifts. The results obtained for
synthetic data streams with concept drift are presented in Figure 1.
One may observe that for the analyzed problem, the pck-s achieved the best
results. It should be underlined that in the case of Blobs Gradual, the clustering
quality of the base pck and for the proposition was equal. Moreover, all the
algorithms performed poorly around the cluster intersection which was ex-
pected behavior but ARI significantly improved when 5% of constraints were
provided. At the same time, attention should be paid to the massive variance of
the obtained metric, indicating how high is its potential impact on model diver-
sification. At the same time, despite the inferior quality of clustering employing
the copk, its significant improvement is noticeable for the copk-s. However
addressing RQ1 it cannot be unequivocally stated that the proposed methods
allow for faster adaptation to the concept drift.
For the Classification Gradual problem, it can be seen that both pck-s and
copk-s are better than the base algorithms and more stable especially for the
case where 2% of constraints were used. In the flows with the 1% of constraints
used, an interesting behavior at the beginning of the stream is visible - improving
overall model quality in the early run. It may suggest that the proposed initializa-
tion method causes some delay in achieving the algorithm’s convergence, which
stabilizes only after some time.
It is also essential to pay attention to the number of constraints on the tested
algorithms. In all cases, increasing this parameter did not aect the quality of
clustering by copk and had a slight eect on copk-s while significantly improv-
ing the pck and pck-s algorithms, leading to a stable result in Classification
Gradual with 5% limitations.
For the Blobs Gradual problem, the pck and pck-s algorithms achieve the
promising result using 5% constraints and minimize their variance while be-
coming insensitive to the emerging imbalance. It is also worth noting that, in
contrast to the previously analyzed problem, it is possible to point to a much
slighter decrease in the quality of pck-s clustering at the time of imbalance. A
similar relationship can be observed in Classification Gradual. In the scenario
where the percentage of constraints is 0.5% and 1%, there is a sudden increase
in the metric as the classes approach the prior equilibrium, followed by a rapid
decrease. In the last case, pck-s adjusts to imbalance the fastest, giving the
metric the most stable value.
Real Datasets. The results of experiments carried out on real data streams
are presented in Figure 2. Results analysis is carried out to address RQ2.Itis
impossible to indicate the best clustering algorithm for the forest covertype and
electricity sets where 1% and 2% restrictions were selected. Noticeable dierences
between the algorithms will appear only after considering the 5% constraints,
which significantly improves the pck and pck-s algorithms. An interesting drop
in the quality of the pck clustering on the convtypeNorm set starts in chunk 75,
which cannot be observed for pck-s.
Significant dierences can be observed in the case of the other two sets. It
can be noted that for the results of airlines, all methods performed poorly, but
there is an increase in the quality of clustering for copk and copk-s with an
increasing number of constraints. Thus, this observation seems to be an exception
to the rule observed in the previous sets. However, it should be noted that the
metric only increased by 0.1 between the lowest and the highest percentage of
The most interesting observations can be made on the KddCup99 set, for
which copk-s also turned out to be the best clustering algorithm. An interesting
relationship can also be seen for this set, according to which adding constraints
degenerate the quality of clustering. Most likely, it is related to the specific
behavior of the algorithm, related to constraint feasibility [18].
(a) Blobs Imbalance (b) Classification Imbalance
(c) Blobs Gradual (d) Classification Gradual
Fig. 1: Results for synthetic datasets.
(a) forest covertype (b) electricity
(c) airlines (d) KddCup99
Fig. 2: Results for real datasets.
4 Conclusions
Two clustering algorithms for data streams: pck-s and copk-s, were presented
in this work. Both algorithms are adopted to the task rarely discussed in the
literature. The series of studies showed the ability to adapt the proposed methods
to the existing concept drifts and obtain better results than in the case of the
base algorithms.
The presented results provide the basis for further research. One of the as-
pects discussed in this work is the proposal to extend the proposed methods to
be better adapted to the changing nature of the constraints. In some cases, the
clustering quality showed significant variance throughout the run. An appropri-
ate selection of constraints may be crucial for stabilizing the results.
In addition, the practical aspect of using the proposed methods should be
emphasized in the context of other machine learning tasks, especially for active
learning. Evaluating only the relation between two samples is easier for an expert
than assigning them to imposed classes, which will reduce the cost of obtaining
such knowledge.
This work was supported by the Polish National Science Centre under the grant
No. 2017/27/B/ST6/01325 as well as by the statutory funds of the Department
of Systems and Computer Networks, Wroclaw University of Science and Tech-
1. P. Ksieniewicz, P. Zyblewski, M. Choras, R. Kozik, A. Gielczyk, and M. Wozniak,
“Fake News Detection from Data Streams,” in 2020 International Joint Conference
on Neural Networks (IJCNN). Glasgow, United Kingdom: IEEE, Jul. 2020, pp.
2. B. Krawczyk, L. L. Minku, J. Gama, J. Stefanowski, and M. Wzniak, “Ensemble
learning for data stream analysis: A survey,” Information Fusion, vol. 37, pp. 132–
156, Sep. 2017.
3. S. Ram´ırez-Gallego, B. Krawczyk, S. Garc´ıa, M. Wzniak, and F. Herrera, “A
survey on data preprocessing for data stream mining: Current status and future
directions,” Neurocomputing, vol. 239, pp. 39–57, May 2017.
4. A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavald`a, “New ensemble
methods for evolving data streams,” in Proceedings of the 15th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, ser. KDD ’09.
New York, NY, USA: Association for Computing Machinery, 2009, p. 139–148.
5. P. Zyblewski, R. Sabourin, and M. Wzniak, “Preprocessed dynamic classifier en-
semble selection for highly imbalanced drifted data streams,” Information Fusion,
vol. 66, pp. 138–154, Feb. 2021.
6. F. Guzy, M. Wzniak, and B. Krawczyk, “Evaluating and explaining generative
adversarial networks for continual learning under concept drift,” in 2021 Interna-
tional Conference on Data Mining Workshops (ICDMW), 2021, pp. 295–303.
7. J. Komorniczak, P. Zyblewski, and P. Ksieniewicz, “Prior Probability Estimation
in Dynamically Imbalanced Data Streams,” in 2021 International Joint Conference
on Neural Networks (IJCNN). Shenzhen, China: IEEE, Jul. 2021, pp. 1–7.
8. J. A. Silva, E. R. Faria, R. C. Barros, E. R. Hruschka, A. C. P. L. F. d. Car-
valho, and J. Gama, “Data stream clustering: A survey,” ACM Computing Surveys,
vol. 46, no. 1, pp. 1–31, Oct. 2013.
9. F. Cao, M. Estert, W. Qian, and A. Zhou, Density-Based Clustering over an Evolv-
ing Data Stream with Noise, pp. 328–339.
10. I. Davidson, “A Survey of Clustering with Instance Level Constraints,” ACM
Transactions on Knowledge Discovery from Data, p. 41, 2007.
11. S. Gonz´alez, S. Garc´ıa, S.-T. Li, R. John, and F. Herrera, “Fuzzy k-nearest neigh-
bors with monotonicity constraints: Moving towards the robustness of monotonic
noise,” Neurocomputing, vol. 439, pp. 106–121, Jun. 2021.
12. G. Gonz´alez-Almagro, J. Luengo, J.-R. Cano, and S. Garc´ıa, “Enhancing instance-
level constrained clustering through dierential evolution,” Applied Soft Comput-
ing, vol. 108, p. 107435, Sep. 2021.
13. C. Ruiz, E. Menasalvas, and M. Spiliopoulou, “C-denstream: Using domain knowl-
edge on a data stream,” in Discovery Science. Berlin, Heidelberg: Springer Berlin
Heidelberg, 2009, pp. 287–301.
14. M. Halkidi, M. Spiliopoulou, and A. Pavlou, “A Semi-supervised Incremental Clus-
tering Algorithm for Streaming Data,” in Advances in Knowledge Discovery and
Data Mining. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, vol. 7301, pp.
578–590, series Title: Lecture Notes in Computer Science.
15. T. Sirampuj, T. Kangkachit, and K. Waiyamai, “CE-Stream : Evaluation-based
technique for stream clustering with constraints,” in The 2013 10th International
Joint Conference on Computer Science and Software Engineering (JCSSE). Khon
Kaen, Thailand: IEEE, May 2013, pp. 217–222.
16. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine
learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830,
17. P. Ksieniewicz and P. Zyblewski, “stream-learn open-source Python library for
dicult data stream batch analysis,” arXiv:2001.11077 [cs, stat], Jan. 2020, arXiv:
18. I. Davidson, K. L. Wagsta, and S. Basu, “Measuring Constraint-Set Utility for
Partitional Clustering Algorithms,” in Knowledge Discovery in Databases: PKDD
2006. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, vol. 4213, pp. 115–126,
series Title: Lecture Notes in Computer Science.
ResearchGate has not been able to resolve any citations for this publication.
Generative Adversarial Networks (GANs) are among the most popular contemporary machine learning algorithms. Despite remarkable successes in their developments, existing GANs cannot offer the appropriate tools to monitor their performance in a continual learning scenario when data distribution changes. We propose a complete framework for monitoring and evaluating GANs during the continual learning, explaining their reaction to the data distribution shifts. The proposed approach is the first complete solution for evaluating GANs in drifting environments, additionally adding explainability to the adaptation process. We introduce a novel prequential metric for continual evaluation of GANs. We show how to use various information extracted from streaming GANs to understand the model’s behavior under data changes and gain insight into the nature of concept drift. Our explainable components focus on learning curves under non-stationary data that highlight retaining relevant and forgetting outdated information, as well as of dynamic visualization of changes in relevant regions for drifting images. The proposed tool allows for detecting changes in data, as well as evaluating and explaining the reaction of GANs to the concept shift. Our framework can be downloaded from a public repository
stream-learn is a Python package compatible with scikit-learn and developed for the drifting and imbalanced data stream analysis. Its main component is a stream generator, which allows producing a synthetic data stream that may incorporate each of the three main concept drift types (i.e., sudden, gradual and incremental drift) in their recurring or non-recurring version, as well as static and dynamic class imbalance. The package allows conducting experiments following established evaluation methodologies (i.e., Test-Then-Train and Prequential). Besides, estimators adapted for data stream classification have been implemented, including both simple classifiers and state-of-the-art chunk-based and online classifier ensembles. The package utilises its own implementations of prediction metrics for imbalanced binary classification tasks to improve computational efficiency.
Conference Paper
Despite the fact that real-life data streams may often be characterized by the dynamic changes in the prior class probabilities, there is a scarcity of articles trying to clearly describe and classify this problem as well as suggest new methods dedicated to resolving this issue. The following paper aims to fill this gap by proposing a novel data stream taxonomy defined in the context of prior class probability and by introducing the Dynamic Statistical Concept Analysis (DSCA) – prior probability estimation algorithm. The proposed method was evaluated using computer experiments carried out on 100 synthetically generated data streams with various class imbalance characteristics. The obtained results, supported by statistical analysis, confirmed the usefulness of the proposed solution, especially in the case of discrete dynamically imbalanced data streams (DDIS).
Clustering has always been a powerful tool in knowledge discovery. Traditionally unsupervised, it received renewed attention when it was shown to produce better results when provided with new types of information, thus leading to a new kind of semi-supervised learning: constrained clustering. This technique is a generalization of traditional clustering that considers additional information encoded by constraints. Constraints can be given in the form of instance-level must-link and cannot-link constraints, which this paper focuses on. We propose the first application of Differential Evolution to the constrained clustering problem, which has proven to produce a better exploration-exploitation trade-off when comparing with previous approaches. We will compare the results obtained by this proposal to those obtained by previous nature-inspired techniques and by some of the state-of-the-art algorithms on 25 datasets with incremental levels of constraint-based information, supporting our conclusions with the aid of Bayesian statistical tests.
This paper proposes a new model based on Fuzzy k-Nearest Neighbors for classification with monotonic constraints, Monotonic Fuzzy k-NN (MonFkNN). Real-life data-sets often do not comply with monotonic constraints due to class noise. MonFkNN incorporates a new calculation of fuzzy memberships, which increases robustness against monotonic noise without the need for relabeling. Our proposal has been designed to be adaptable to the different needs of the problem being tackled. In several experimental studies, we show significant improvements in accuracy while matching the best degree of monotonicity obtained by comparable methods. We also show that MonFkNN empirically achieves improved performance compared with Monotonic k-NN in the presence of large amounts of class noise.
Using fake news as a political or economic tool is not new, but the scale of their use is currently alarming, especially on social media. The authors of misinformation try to influence the users' decisions, both in the economic and political sphere. The facts of using disinformation during elections are well known. Currently, two fake news detection approaches dominate. The first approach, so-called fact or news checker, is based on the knowledge and work of volunteers, the second approach employs artificial intelligence algorithms for news analysis and manipulation detection. In this work, we will focus on using machine learning methods to detect fake news. However, unlike most approaches, we will treat incoming messages as stream data, taking into account the possibility of concept drift occurring, i.e., appearing changes in the probabilistic characteristics of the classification model during the exploitation of the classifier. The developed methods have been evaluated based on computer experiments on benchmark data, and the obtained results prove their usefulness for the problem under consideration. The proposed solutions are part of the distributed platform developed by the H2020 SocialTruth project consortium.
free access till end of October 2020 -> use this link This work aims to connect two rarely combined research directions, i.e., non-stationary data stream classification and data analysis with skewed class distributions. We propose a novel framework employing stratified bagging for training base classifiers to integrate data preprocessing and dynamic ensemble selection methods for imbalanced data stream classification. The proposed approach has been evaluated based on computer experiments carried out on 135 artificially generated data streams with various imbalance ratios, label noise levels, and types of concept drift as well as on two selected real streams. Four preprocessing techniques and two dynamic selection methods, used on both bagging classifiers and base estimators levels, were considered. Experimentation results showed that, for highly imbalanced data streams, dynamic ensemble selection coupled with data preprocessing could outperform online and chunk-based state-of-art methods.
In many applications of information systems learning algorithms have to act in dynamic environments where data are collected in the form of transient data streams. Compared to static data mining, processing streams imposes new computational requirements for algorithms to incrementally process incoming examples while using limited memory and time. Furthermore, due to the non-stationary characteristics of streaming data, prediction models are often also required to adapt to concept drifts. Out of several new proposed stream algorithms, ensembles play an important role, in particular for non-stationary environments. This paper surveys research on ensembles for data stream classification as well as regression tasks. Besides presenting a comprehensive spectrum of ensemble approaches for data streams, we also discuss advanced learning concepts such as imbalanced data streams, novelty detection, active and semi-supervised learning, complex data representations and structured outputs. The paper concludes with a discussion of open research problems and lines of future research.