Content uploaded by Szymon Wojciechowski
Author content
All content in this area was uploaded by Szymon Wojciechowski on Sep 23, 2022
Content may be subject to copyright.
Adapting k-means algorithm for pair-wise
constrained clustering of imbalanced data
streams
Szymon Wojciechowski1( ), Germ´an Gonz´alez-Almagro2, Salvador Garc´ıa2,
and Micha l Wo´zniak1
1Department of Systems and Computer Networks
Wroc law University of Science and Technology
Wroc law, Poland
szymon.wojciechowski@pwr.edu.pl
2Department of Computer Science and Artificial Intelligence (DECSAI) & DaSCI
Andalusian Institute of Data Science and Computational Intelligence
University of Granada
Granada, Spain
Abstract. Contemporary man is addicted to digital media and tools
supporting his daily activities, which causes the massive increase of in-
coming data, both in volume and frequency. Due to the observed trend,
unsupervised machine learning methods for data stream clustering have
become a popular research topic over the last years. At the same time,
semi-supervised constrained clustering is rarely considered in data stream
clustering. To address this gap in the field, the authors propose adapta-
tions of k-means constrained clustering algorithms for employing them
in imbalanced data stream clustering. In this work, proposed algorithms
were evaluated in a series of experiments concerning synthetic and real
data clustering and verified their ability to adapt to occurring concept
drifts.
data streams, pair-wise constrained, clustering, imbalanced data
1 Introduction
The development of technology and the increasing number of devices connected
to one general-purpose network have made it possible to download and store mas-
sive amounts of constantly streaming data. Regardless of the information type,
which might be readings from sensor mesh or news outlets [1], it will always be
required to process it into a resource that the users can further utilize – knowl-
edge. Therefore, it has become inevitable for the topic to be widely researched,
leading to the development of data mining techniques for data streams.
Undoubtedly, one of the discussed challenges is to design algorithms capable
of continuous data stream processing [2]. Therefore, the posed models must
adjust to the constantly changing data characteristics, which phenomenon is
named concept drift [3]. Moreover, models have to provide a prediction before
the next batch of data is available, which constrains the algorithm in terms of
computational complexity.
The vast majority of the recent works focus on data stream classification,
introducing various approaches for dealing with concept drift. Many of the pro-
posed methods employ classifier ensembles [4] and dynamic selection of classi-
fiers [5]. Moreover, various methods are based on concept drift detectors [6], and
a prior probability estimation [7] which indicates when the data characteristics
are changing.
Similarly, data streams were also researched in the context of clustering
task [8]. Proposed algorithms are often adapting the existing methods to data
stream processing [9]. At the same time, research on semi-supervised clustering
with pair-wise constraints [10] has been mainly focused on the integration of
constraints to classical algorithms [11] and methods based on optimization algo-
rithms [12]. The adaptation of those algorithms to data streams was proposed
only in a few recent works.
One of the mainly used techniques for this task is C-DenStream [13], which
is adapting DBSCAN algorithm employing micro-clusters constructed on con-
secutive chunks. Another example is the SemiStream [14] algorithm, based on
initial cluster initialization by MPCk-means, then detection of o-clusters, as-
signing them to s-clusters and m-clusters. Finally, the CE-Stream algorithm [15]
extends the E-Stream algorithm, but it is limited to must-link constraints set.
Marking that there still is a lack of pair-wise constrained clustering algo-
rithms for data streams, the main contribution of this work is a proposal for
employing COPk-means and PCk-means algorithms for data streams cluster-
ing.
2 Algorithm
The data stream DS is a sequence of data chunks DS ={DS1,DS
2,...,DS
k}.
Each data chunk contains a set of samples described by a feature vector X
for which the clustering algorithm (X) assigns a label describing a cluster C.
Additionally each chunk is also provided with two lists of pair-wise constraints
ml and cl denoting must-list pairs and canno t-l ink pairs, respectively. Those
extend a context of clustering with expert knowledge, which does not provide
cluster labels directly, but only declares a relation between two samples.
The k-means clustering algorithm is an inductive learning method. The model
is trained iteratively to minimize the intra-cluster and maximize the infra-cluster
variance. In the first phase of the algorithm, kcluster centroids are initialized as
points in feature space. Then, each observation is assigned to a cluster by finding
the nearest centroid. After this procedure, the new centroids are calculated by
averaging all assigned observations. The procedure is repeated until the model
reaches convergence, which means that new centroids can not be further shifted.
An additional stop condition, which guarantees algorithm execution in a feasible
time, is the maximum number of maximum iterations. The algorithm pseudocode
is presented in Algorithm 1.
Algorithm 1: Generic k-means clustering
Initialize centroids ()
while iteration <max iter do
assign labels (X)
recalculate centroids ()
if convergence then
break
end
end
Constraints can be integrated into the k-means algorithm using one of two
rules, defined as follows:
–hard rule: cluster labels assigned by the algorithm cannot be inconsistent
with constraints. This rule is used by COPk-means (copk) algorithm. Each
label assignment is verified for feasibility against constraints. If one of the
rules is violated, the algorithm terminates, leaving the rest of the samples
unassigned.
–soft rule: cluster labels can be inconsistent with constraints, but constrain
violation is penalized as an additional factor in minimized criteria. This
rule is used by PCk-means (pck) algorithm. The criteria minimized in the
algorithm is defined as follows:
Jpckm =1
2X
xi2X
||xiµli||2+
X
(xi,xj)2ml
wij [li6=lj]+ X
(xi,xj)2cl
¯wij [li=lj]
(1)
This approach assures that all samples are assigned to estimated clusters.
Another crucial part of the algorithm is selecting a proper cluster initializa-
tion method. The common methods are:
–rando m - which is very fast, but a poor initial match may lead to very slow
convergence.
–kmeans++ - allowing a better estimation of the initial centroids, but it in-
troduces additional computation overhead.
For constrained clustering problems, centroids can also be initialized based on
transitive closure of the must-link constraints, averaging the broadest spanning
trees created during this procedure. The complexity of such a method depends
on a backbone spanning-tree search algorithm.
Since the goal is to adapt the algorithm to process data streams, it is nec-
essary to make changes to the initialization and modification of the centroids
in the copk and pck algorithms. Authors propose two new algorithms namely
copk-s and pck-s introducing modifications to described procedures.
Initialization In sequential data chunks processing, the model created on the
previous chunk has already determined the centroids that can be used in the
next one. It can be assumed that those centroids are close to optimal, and, at
the same time, reusing them reduces additional computational e↵ort.
Assignment Base algorithms assume that kclusters will be formed. However,
sudden concept drift may lead to the complete disappearance of one of the
clusters. It may turn out that no pattern will be assigned to a given centroid,
which at the same time will prevent further shifting. Therefore, an additional
modification is introduced to the proposed algorithms. Centroids to which no
samples were assigned are omitted, but the same centroid will be reused for
clustering in the next chunk.
3 Experiments
The proposed methods of adapting the k-means algorithms were tested in a series
of experiments in order to find answers to the following research questions:
RQ1 : What impact do the proposed modifications have on the ability to adapt the
algorithm to the occurring concept drifts?
RQ2 : What impact do the proposed modifications have on the clustering quality?
3.1 Research Protocol
The following experimental protocol was formulated to answer the research ques-
tions. Both synthetic streams – for which it was possible to observe the algo-
rithm’s operation with the defined behavior of the concept drift – and the real-life
data were used, allowing for actual quality assessment of the proposed methods.
Synthetic data streams were generated with both the concept drift and the
dynamic imbalance to the stream based on the artificial classification problem.
Additionally, to create a set dedicated to clustering algorithms, two additional
sets were created, in which the defined clusters were generated from the normal
distribution at the given distribution means. This modification was aimed at
creating a set dedicated to the clustering algorithm, bearing in mind that, at the
same time, the objective of the algorithm is to determine the local concentration
of observations correctly and not – as opposed to classification – to determine
the correct labels.
There are no real data sets for the constrained clustering task with constraints
provided by an actual expert. Therefore, it was necessary to generate both the
synthetic and real streams based on a given set of labels for both the synthetic
and real streams. Each set can introduce natural disproportion between the ml
and cl, which is related to the number of classes and their volume. However,
with a complete set of labels, it is possible to determine n
2pairs of nsamples
and constraint type based on the similarity among their label. Therefore, to
propose the method of generating the constraints, the percentage of possible
sample connections is used, and it defines a number of pairs selected randomly
and transformed into constraints.
The quality of clustering on selected data streams was assessed in a sequen-
tial protocol typical for this type of research. The incoming data are divided
into chunks of equal size. The model is trained using the entire chunk, while
the evaluation process follows the protocol for evaluating clustering algorithms.
Additionally, for all tested methods, constraints lists are provided for each data
chunk, and the set of labels itself is only used to compute the clustering metric.
3.2 Experimental Setup
Eight data streams were used for the experiments, four of which were syntheti-
cally generated, and four were prepared based on real data sets, obtained from
MOA3and CSE4. A detailed description of the data sets is presented in the
Table 1.
Selected data sets were divided into chunks of 200 samples. The real-life
data streams were limited to the first 200 chunks, which preserved the original
problem characteristics and provided a convenient length for reliable analysis.
The constraints were generated for each chunk using original problem labels.
The ratio of generated constraints to all possible constraints was constant along
the stream, but each data set was evaluated in three configurations of 1%, 2%,
and 5% of possible constraints. The metric selected to measure and compare the
clustering quality of evaluated algorithms was Adjusted Rand Index (ARI).
The tested algorithms and the experimental environment were implemented
in Python programming language, using the scikit-learn [16] and stream-learn
libraries [17]. The repository allowing for the replication of presented results is
available at Github5.
3.3 Results
The graphs show ARI scores for both proposed methods and the reference ap-
proaches. In addition, for each method, the metric values are shown (a) for the
entire data stream length and (b) averaged over all data chunks. This approach
allows for a detailed analysis of the algorithm performance while enabling com-
parison in the context of the entire problem. Finally, all presented runs were
smoothed using the cumulative sum.
3https://moa.cms.waikato.ac.nz/datasets/
4https://www.cse.fau.edu/ xqzhu/stream.html
5https://github.com/w4k2/pcsp
Table 1: Summary of benchmark datasets.
Dataset Source Description
Blobs Gradual Synthetic One normal distribution per class with a drifting
mean value. Two separated clusters are intersect-
ing, then returning to original concept.
Blobs Imbalance Synthetic One normal distribution per class with a drifting
imbalance ratio. At the beginning static clusters
have equal imbalance ratio, which is drifting to-
wards 0.1 IR.
Classification Gradual Synthetic One distribution per class with two informative fea-
tures. Mean of the cluster distributions is shifting
over time.
Classification Imbalance Synthetic One distribution per class with two informative fea-
tures. Recurring imbalance drift is occurring twice,
varying between 0.05 – 0.95 IR.
Forest C o v ertype MOA Data contains the forest cover type obtained from
US Forest Service (USFS) Region 2 Resource In-
formation System (RIS).
Electricity MOA Data collected from Australian New South Wales
Electricity Market.
Airlines MOA Data contains flight arrival and departure details
for all the commercial flights within the USA, from
October 1987 to April 2008.
KddCup99 CSE Data created for network intrusion detection in
simulated network. This data set was used for
The Third International Knowledge Discovery and
Data Mining Tools Competition.
Synthetic Datasets. Research on synthetic sets was carried out to study
the algorithm behavior concerning the occurring drifts. The results obtained for
synthetic data streams with concept drift are presented in Figure 1.
One may observe that for the analyzed problem, the pck-s achieved the best
results. It should be underlined that in the case of Blobs Gradual, the clustering
quality of the base pck and for the proposition was equal. Moreover, all the
algorithms performed poorly around the cluster intersection – which was ex-
pected behavior – but ARI significantly improved when 5% of constraints were
provided. At the same time, attention should be paid to the massive variance of
the obtained metric, indicating how high is its potential impact on model diver-
sification. At the same time, despite the inferior quality of clustering employing
the copk, its significant improvement is noticeable for the copk-s. However –
addressing RQ1 – it cannot be unequivocally stated that the proposed methods
allow for faster adaptation to the concept drift.
For the Classification Gradual problem, it can be seen that both pck-s and
copk-s are better than the base algorithms and more stable – especially for the
case where 2% of constraints were used. In the flows with the 1% of constraints
used, an interesting behavior at the beginning of the stream is visible - improving
overall model quality in the early run. It may suggest that the proposed initializa-
tion method causes some delay in achieving the algorithm’s convergence, which
stabilizes only after some time.
It is also essential to pay attention to the number of constraints on the tested
algorithms. In all cases, increasing this parameter did not a↵ect the quality of
clustering by copk and had a slight e↵ect on copk-s while significantly improv-
ing the pck and pck-s algorithms, leading to a stable result in Classification
Gradual with 5% limitations.
For the Blobs Gradual problem, the pck and pck-s algorithms achieve the
promising result using 5% constraints and minimize their variance while be-
coming insensitive to the emerging imbalance. It is also worth noting that, in
contrast to the previously analyzed problem, it is possible to point to a much
slighter decrease in the quality of pck-s clustering at the time of imbalance. A
similar relationship can be observed in Classification Gradual. In the scenario
where the percentage of constraints is 0.5% and 1%, there is a sudden increase
in the metric as the classes approach the prior equilibrium, followed by a rapid
decrease. In the last case, pck-s adjusts to imbalance the fastest, giving the
metric the most stable value.
Real Datasets. The results of experiments carried out on real data streams
are presented in Figure 2. Results analysis is carried out to address RQ2.Itis
impossible to indicate the best clustering algorithm for the forest covertype and
electricity sets where 1% and 2% restrictions were selected. Noticeable di↵erences
between the algorithms will appear only after considering the 5% constraints,
which significantly improves the pck and pck-s algorithms. An interesting drop
in the quality of the pck clustering on the convtypeNorm set starts in chunk 75,
which cannot be observed for pck-s.
Significant di↵erences can be observed in the case of the other two sets. It
can be noted that for the results of airlines, all methods performed poorly, but
there is an increase in the quality of clustering for copk and copk-s with an
increasing number of constraints. Thus, this observation seems to be an exception
to the rule observed in the previous sets. However, it should be noted that the
metric only increased by 0.1 between the lowest and the highest percentage of
constraints.
The most interesting observations can be made on the KddCup99 set, for
which copk-s also turned out to be the best clustering algorithm. An interesting
relationship can also be seen for this set, according to which adding constraints
degenerate the quality of clustering. Most likely, it is related to the specific
behavior of the algorithm, related to constraint feasibility [18].
(a) Blobs Imbalance (b) Classification Imbalance
(c) Blobs Gradual (d) Classification Gradual
Fig. 1: Results for synthetic datasets.
(a) forest covertype (b) electricity
(c) airlines (d) KddCup99
Fig. 2: Results for real datasets.
4 Conclusions
Two clustering algorithms for data streams: pck-s and copk-s, were presented
in this work. Both algorithms are adopted to the task rarely discussed in the
literature. The series of studies showed the ability to adapt the proposed methods
to the existing concept drifts and obtain better results than in the case of the
base algorithms.
The presented results provide the basis for further research. One of the as-
pects discussed in this work is the proposal to extend the proposed methods to
be better adapted to the changing nature of the constraints. In some cases, the
clustering quality showed significant variance throughout the run. An appropri-
ate selection of constraints may be crucial for stabilizing the results.
In addition, the practical aspect of using the proposed methods should be
emphasized in the context of other machine learning tasks, especially for active
learning. Evaluating only the relation between two samples is easier for an expert
than assigning them to imposed classes, which will reduce the cost of obtaining
such knowledge.
Acknowledgement
This work was supported by the Polish National Science Centre under the grant
No. 2017/27/B/ST6/01325 as well as by the statutory funds of the Department
of Systems and Computer Networks, Wroclaw University of Science and Tech-
nology.
References
1. P. Ksieniewicz, P. Zyblewski, M. Choras, R. Kozik, A. Gielczyk, and M. Wozniak,
“Fake News Detection from Data Streams,” in 2020 International Joint Conference
on Neural Networks (IJCNN). Glasgow, United Kingdom: IEEE, Jul. 2020, pp.
1–8.
2. B. Krawczyk, L. L. Minku, J. Gama, J. Stefanowski, and M. Wo´zniak, “Ensemble
learning for data stream analysis: A survey,” Information Fusion, vol. 37, pp. 132–
156, Sep. 2017.
3. S. Ram´ırez-Gallego, B. Krawczyk, S. Garc´ıa, M. Wo´zniak, and F. Herrera, “A
survey on data preprocessing for data stream mining: Current status and future
directions,” Neurocomputing, vol. 239, pp. 39–57, May 2017.
4. A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavald`a, “New ensemble
methods for evolving data streams,” in Proceedings of the 15th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, ser. KDD ’09.
New York, NY, USA: Association for Computing Machinery, 2009, p. 139–148.
5. P. Zyblewski, R. Sabourin, and M. Wo´zniak, “Preprocessed dynamic classifier en-
semble selection for highly imbalanced drifted data streams,” Information Fusion,
vol. 66, pp. 138–154, Feb. 2021.
6. F. Guzy, M. Wo´zniak, and B. Krawczyk, “Evaluating and explaining generative
adversarial networks for continual learning under concept drift,” in 2021 Interna-
tional Conference on Data Mining Workshops (ICDMW), 2021, pp. 295–303.
7. J. Komorniczak, P. Zyblewski, and P. Ksieniewicz, “Prior Probability Estimation
in Dynamically Imbalanced Data Streams,” in 2021 International Joint Conference
on Neural Networks (IJCNN). Shenzhen, China: IEEE, Jul. 2021, pp. 1–7.
8. J. A. Silva, E. R. Faria, R. C. Barros, E. R. Hruschka, A. C. P. L. F. d. Car-
valho, and J. Gama, “Data stream clustering: A survey,” ACM Computing Surveys,
vol. 46, no. 1, pp. 1–31, Oct. 2013.
9. F. Cao, M. Estert, W. Qian, and A. Zhou, Density-Based Clustering over an Evolv-
ing Data Stream with Noise, pp. 328–339.
10. I. Davidson, “A Survey of Clustering with Instance Level Constraints,” ACM
Transactions on Knowledge Discovery from Data, p. 41, 2007.
11. S. Gonz´alez, S. Garc´ıa, S.-T. Li, R. John, and F. Herrera, “Fuzzy k-nearest neigh-
bors with monotonicity constraints: Moving towards the robustness of monotonic
noise,” Neurocomputing, vol. 439, pp. 106–121, Jun. 2021.
12. G. Gonz´alez-Almagro, J. Luengo, J.-R. Cano, and S. Garc´ıa, “Enhancing instance-
level constrained clustering through di↵erential evolution,” Applied Soft Comput-
ing, vol. 108, p. 107435, Sep. 2021.
13. C. Ruiz, E. Menasalvas, and M. Spiliopoulou, “C-denstream: Using domain knowl-
edge on a data stream,” in Discovery Science. Berlin, Heidelberg: Springer Berlin
Heidelberg, 2009, pp. 287–301.
14. M. Halkidi, M. Spiliopoulou, and A. Pavlou, “A Semi-supervised Incremental Clus-
tering Algorithm for Streaming Data,” in Advances in Knowledge Discovery and
Data Mining. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, vol. 7301, pp.
578–590, series Title: Lecture Notes in Computer Science.
15. T. Sirampuj, T. Kangkachit, and K. Waiyamai, “CE-Stream : Evaluation-based
technique for stream clustering with constraints,” in The 2013 10th International
Joint Conference on Computer Science and Software Engineering (JCSSE). Khon
Kaen, Thailand: IEEE, May 2013, pp. 217–222.
16. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine
learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830,
2011.
17. P. Ksieniewicz and P. Zyblewski, “stream-learn – open-source Python library for
difficult data stream batch analysis,” arXiv:2001.11077 [cs, stat], Jan. 2020, arXiv:
2001.11077.
18. I. Davidson, K. L. Wagsta↵, and S. Basu, “Measuring Constraint-Set Utility for
Partitional Clustering Algorithms,” in Knowledge Discovery in Databases: PKDD
2006. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, vol. 4213, pp. 115–126,
series Title: Lecture Notes in Computer Science.