Content uploaded by Szymon Wojciechowski

Author content

All content in this area was uploaded by Szymon Wojciechowski on Sep 23, 2022

Content may be subject to copyright.

Adapting k-means algorithm for pair-wise

constrained clustering of imbalanced data

streams

Szymon Wojciechowski1( ), Germ´an Gonz´alez-Almagro2, Salvador Garc´ıa2,

and Micha l Wo´zniak1

1Department of Systems and Computer Networks

Wroc law University of Science and Technology

Wroc law, Poland

szymon.wojciechowski@pwr.edu.pl

2Department of Computer Science and Artiﬁcial Intelligence (DECSAI) & DaSCI

Andalusian Institute of Data Science and Computational Intelligence

University of Granada

Granada, Spain

Abstract. Contemporary man is addicted to digital media and tools

supporting his daily activities, which causes the massive increase of in-

coming data, both in volume and frequency. Due to the observed trend,

unsupervised machine learning methods for data stream clustering have

become a popular research topic over the last years. At the same time,

semi-supervised constrained clustering is rarely considered in data stream

clustering. To address this gap in the ﬁeld, the authors propose adapta-

tions of k-means constrained clustering algorithms for employing them

in imbalanced data stream clustering. In this work, proposed algorithms

were evaluated in a series of experiments concerning synthetic and real

data clustering and veriﬁed their ability to adapt to occurring concept

drifts.

data streams, pair-wise constrained, clustering, imbalanced data

1 Introduction

The development of technology and the increasing number of devices connected

to one general-purpose network have made it possible to download and store mas-

sive amounts of constantly streaming data. Regardless of the information type,

which might be readings from sensor mesh or news outlets [1], it will always be

required to process it into a resource that the users can further utilize – knowl-

edge. Therefore, it has become inevitable for the topic to be widely researched,

leading to the development of data mining techniques for data streams.

Undoubtedly, one of the discussed challenges is to design algorithms capable

of continuous data stream processing [2]. Therefore, the posed models must

adjust to the constantly changing data characteristics, which phenomenon is

named concept drift [3]. Moreover, models have to provide a prediction before

the next batch of data is available, which constrains the algorithm in terms of

computational complexity.

The vast majority of the recent works focus on data stream classiﬁcation,

introducing various approaches for dealing with concept drift. Many of the pro-

posed methods employ classiﬁer ensembles [4] and dynamic selection of classi-

ﬁers [5]. Moreover, various methods are based on concept drift detectors [6], and

a prior probability estimation [7] which indicates when the data characteristics

are changing.

Similarly, data streams were also researched in the context of clustering

task [8]. Proposed algorithms are often adapting the existing methods to data

stream processing [9]. At the same time, research on semi-supervised clustering

with pair-wise constraints [10] has been mainly focused on the integration of

constraints to classical algorithms [11] and methods based on optimization algo-

rithms [12]. The adaptation of those algorithms to data streams was proposed

only in a few recent works.

One of the mainly used techniques for this task is C-DenStream [13], which

is adapting DBSCAN algorithm employing micro-clusters constructed on con-

secutive chunks. Another example is the SemiStream [14] algorithm, based on

initial cluster initialization by MPCk-means, then detection of o-clusters, as-

signing them to s-clusters and m-clusters. Finally, the CE-Stream algorithm [15]

extends the E-Stream algorithm, but it is limited to must-link constraints set.

Marking that there still is a lack of pair-wise constrained clustering algo-

rithms for data streams, the main contribution of this work is a proposal for

employing COPk-means and PCk-means algorithms for data streams cluster-

ing.

2 Algorithm

The data stream DS is a sequence of data chunks DS ={DS1,DS

2,...,DS

k}.

Each data chunk contains a set of samples described by a feature vector X

for which the clustering algorithm (X) assigns a label describing a cluster C.

Additionally each chunk is also provided with two lists of pair-wise constraints

ml and cl denoting must-list pairs and canno t-l ink pairs, respectively. Those

extend a context of clustering with expert knowledge, which does not provide

cluster labels directly, but only declares a relation between two samples.

The k-means clustering algorithm is an inductive learning method. The model

is trained iteratively to minimize the intra-cluster and maximize the infra-cluster

variance. In the ﬁrst phase of the algorithm, kcluster centroids are initialized as

points in feature space. Then, each observation is assigned to a cluster by ﬁnding

the nearest centroid. After this procedure, the new centroids are calculated by

averaging all assigned observations. The procedure is repeated until the model

reaches convergence, which means that new centroids can not be further shifted.

An additional stop condition, which guarantees algorithm execution in a feasible

time, is the maximum number of maximum iterations. The algorithm pseudocode

is presented in Algorithm 1.

Algorithm 1: Generic k-means clustering

Initialize centroids ()

while iteration <max iter do

assign labels (X)

recalculate centroids ()

if convergence then

break

end

end

Constraints can be integrated into the k-means algorithm using one of two

rules, deﬁned as follows:

–hard rule: cluster labels assigned by the algorithm cannot be inconsistent

with constraints. This rule is used by COPk-means (copk) algorithm. Each

label assignment is veriﬁed for feasibility against constraints. If one of the

rules is violated, the algorithm terminates, leaving the rest of the samples

unassigned.

–soft rule: cluster labels can be inconsistent with constraints, but constrain

violation is penalized as an additional factor in minimized criteria. This

rule is used by PCk-means (pck) algorithm. The criteria minimized in the

algorithm is deﬁned as follows:

Jpckm =1

2X

xi2X

||xiµli||2+

X

(xi,xj)2ml

wij [li6=lj]+ X

(xi,xj)2cl

¯wij [li=lj]

(1)

This approach assures that all samples are assigned to estimated clusters.

Another crucial part of the algorithm is selecting a proper cluster initializa-

tion method. The common methods are:

–rando m - which is very fast, but a poor initial match may lead to very slow

convergence.

–kmeans++ - allowing a better estimation of the initial centroids, but it in-

troduces additional computation overhead.

For constrained clustering problems, centroids can also be initialized based on

transitive closure of the must-link constraints, averaging the broadest spanning

trees created during this procedure. The complexity of such a method depends

on a backbone spanning-tree search algorithm.

Since the goal is to adapt the algorithm to process data streams, it is nec-

essary to make changes to the initialization and modiﬁcation of the centroids

in the copk and pck algorithms. Authors propose two new algorithms namely

copk-s and pck-s introducing modiﬁcations to described procedures.

Initialization In sequential data chunks processing, the model created on the

previous chunk has already determined the centroids that can be used in the

next one. It can be assumed that those centroids are close to optimal, and, at

the same time, reusing them reduces additional computational e↵ort.

Assignment Base algorithms assume that kclusters will be formed. However,

sudden concept drift may lead to the complete disappearance of one of the

clusters. It may turn out that no pattern will be assigned to a given centroid,

which at the same time will prevent further shifting. Therefore, an additional

modiﬁcation is introduced to the proposed algorithms. Centroids to which no

samples were assigned are omitted, but the same centroid will be reused for

clustering in the next chunk.

3 Experiments

The proposed methods of adapting the k-means algorithms were tested in a series

of experiments in order to ﬁnd answers to the following research questions:

RQ1 : What impact do the proposed modiﬁcations have on the ability to adapt the

algorithm to the occurring concept drifts?

RQ2 : What impact do the proposed modiﬁcations have on the clustering quality?

3.1 Research Protocol

The following experimental protocol was formulated to answer the research ques-

tions. Both synthetic streams – for which it was possible to observe the algo-

rithm’s operation with the deﬁned behavior of the concept drift – and the real-life

data were used, allowing for actual quality assessment of the proposed methods.

Synthetic data streams were generated with both the concept drift and the

dynamic imbalance to the stream based on the artiﬁcial classiﬁcation problem.

Additionally, to create a set dedicated to clustering algorithms, two additional

sets were created, in which the deﬁned clusters were generated from the normal

distribution at the given distribution means. This modiﬁcation was aimed at

creating a set dedicated to the clustering algorithm, bearing in mind that, at the

same time, the objective of the algorithm is to determine the local concentration

of observations correctly and not – as opposed to classiﬁcation – to determine

the correct labels.

There are no real data sets for the constrained clustering task with constraints

provided by an actual expert. Therefore, it was necessary to generate both the

synthetic and real streams based on a given set of labels for both the synthetic

and real streams. Each set can introduce natural disproportion between the ml

and cl, which is related to the number of classes and their volume. However,

with a complete set of labels, it is possible to determine n

2pairs of nsamples

and constraint type based on the similarity among their label. Therefore, to

propose the method of generating the constraints, the percentage of possible

sample connections is used, and it deﬁnes a number of pairs selected randomly

and transformed into constraints.

The quality of clustering on selected data streams was assessed in a sequen-

tial protocol typical for this type of research. The incoming data are divided

into chunks of equal size. The model is trained using the entire chunk, while

the evaluation process follows the protocol for evaluating clustering algorithms.

Additionally, for all tested methods, constraints lists are provided for each data

chunk, and the set of labels itself is only used to compute the clustering metric.

3.2 Experimental Setup

Eight data streams were used for the experiments, four of which were syntheti-

cally generated, and four were prepared based on real data sets, obtained from

MOA3and CSE4. A detailed description of the data sets is presented in the

Table 1.

Selected data sets were divided into chunks of 200 samples. The real-life

data streams were limited to the ﬁrst 200 chunks, which preserved the original

problem characteristics and provided a convenient length for reliable analysis.

The constraints were generated for each chunk using original problem labels.

The ratio of generated constraints to all possible constraints was constant along

the stream, but each data set was evaluated in three conﬁgurations of 1%, 2%,

and 5% of possible constraints. The metric selected to measure and compare the

clustering quality of evaluated algorithms was Adjusted Rand Index (ARI).

The tested algorithms and the experimental environment were implemented

in Python programming language, using the scikit-learn [16] and stream-learn

libraries [17]. The repository allowing for the replication of presented results is

available at Github5.

3.3 Results

The graphs show ARI scores for both proposed methods and the reference ap-

proaches. In addition, for each method, the metric values are shown (a) for the

entire data stream length and (b) averaged over all data chunks. This approach

allows for a detailed analysis of the algorithm performance while enabling com-

parison in the context of the entire problem. Finally, all presented runs were

smoothed using the cumulative sum.

3https://moa.cms.waikato.ac.nz/datasets/

4https://www.cse.fau.edu/ xqzhu/stream.html

5https://github.com/w4k2/pcsp

Table 1: Summary of benchmark datasets.

Dataset Source Description

Blobs Gradual Synthetic One normal distribution per class with a drifting

mean value. Two separated clusters are intersect-

ing, then returning to original concept.

Blobs Imbalance Synthetic One normal distribution per class with a drifting

imbalance ratio. At the beginning static clusters

have equal imbalance ratio, which is drifting to-

wards 0.1 IR.

Classiﬁcation Gradual Synthetic One distribution per class with two informative fea-

tures. Mean of the cluster distributions is shifting

over time.

Classiﬁcation Imbalance Synthetic One distribution per class with two informative fea-

tures. Recurring imbalance drift is occurring twice,

varying between 0.05 – 0.95 IR.

Forest C o v ertype MOA Data contains the forest cover type obtained from

US Forest Service (USFS) Region 2 Resource In-

formation System (RIS).

Electricity MOA Data collected from Australian New South Wales

Electricity Market.

Airlines MOA Data contains ﬂight arrival and departure details

for all the commercial ﬂights within the USA, from

October 1987 to April 2008.

KddCup99 CSE Data created for network intrusion detection in

simulated network. This data set was used for

The Third International Knowledge Discovery and

Data Mining Tools Competition.

Synthetic Datasets. Research on synthetic sets was carried out to study

the algorithm behavior concerning the occurring drifts. The results obtained for

synthetic data streams with concept drift are presented in Figure 1.

One may observe that for the analyzed problem, the pck-s achieved the best

results. It should be underlined that in the case of Blobs Gradual, the clustering

quality of the base pck and for the proposition was equal. Moreover, all the

algorithms performed poorly around the cluster intersection – which was ex-

pected behavior – but ARI signiﬁcantly improved when 5% of constraints were

provided. At the same time, attention should be paid to the massive variance of

the obtained metric, indicating how high is its potential impact on model diver-

siﬁcation. At the same time, despite the inferior quality of clustering employing

the copk, its signiﬁcant improvement is noticeable for the copk-s. However –

addressing RQ1 – it cannot be unequivocally stated that the proposed methods

allow for faster adaptation to the concept drift.

For the Classiﬁcation Gradual problem, it can be seen that both pck-s and

copk-s are better than the base algorithms and more stable – especially for the

case where 2% of constraints were used. In the ﬂows with the 1% of constraints

used, an interesting behavior at the beginning of the stream is visible - improving

overall model quality in the early run. It may suggest that the proposed initializa-

tion method causes some delay in achieving the algorithm’s convergence, which

stabilizes only after some time.

It is also essential to pay attention to the number of constraints on the tested

algorithms. In all cases, increasing this parameter did not a↵ect the quality of

clustering by copk and had a slight e↵ect on copk-s while signiﬁcantly improv-

ing the pck and pck-s algorithms, leading to a stable result in Classiﬁcation

Gradual with 5% limitations.

For the Blobs Gradual problem, the pck and pck-s algorithms achieve the

promising result using 5% constraints and minimize their variance while be-

coming insensitive to the emerging imbalance. It is also worth noting that, in

contrast to the previously analyzed problem, it is possible to point to a much

slighter decrease in the quality of pck-s clustering at the time of imbalance. A

similar relationship can be observed in Classiﬁcation Gradual. In the scenario

where the percentage of constraints is 0.5% and 1%, there is a sudden increase

in the metric as the classes approach the prior equilibrium, followed by a rapid

decrease. In the last case, pck-s adjusts to imbalance the fastest, giving the

metric the most stable value.

Real Datasets. The results of experiments carried out on real data streams

are presented in Figure 2. Results analysis is carried out to address RQ2.Itis

impossible to indicate the best clustering algorithm for the forest covertype and

electricity sets where 1% and 2% restrictions were selected. Noticeable di↵erences

between the algorithms will appear only after considering the 5% constraints,

which signiﬁcantly improves the pck and pck-s algorithms. An interesting drop

in the quality of the pck clustering on the convtypeNorm set starts in chunk 75,

which cannot be observed for pck-s.

Signiﬁcant di↵erences can be observed in the case of the other two sets. It

can be noted that for the results of airlines, all methods performed poorly, but

there is an increase in the quality of clustering for copk and copk-s with an

increasing number of constraints. Thus, this observation seems to be an exception

to the rule observed in the previous sets. However, it should be noted that the

metric only increased by 0.1 between the lowest and the highest percentage of

constraints.

The most interesting observations can be made on the KddCup99 set, for

which copk-s also turned out to be the best clustering algorithm. An interesting

relationship can also be seen for this set, according to which adding constraints

degenerate the quality of clustering. Most likely, it is related to the speciﬁc

behavior of the algorithm, related to constraint feasibility [18].

(a) Blobs Imbalance (b) Classiﬁcation Imbalance

(c) Blobs Gradual (d) Classiﬁcation Gradual

Fig. 1: Results for synthetic datasets.

(a) forest covertype (b) electricity

(c) airlines (d) KddCup99

Fig. 2: Results for real datasets.

4 Conclusions

Two clustering algorithms for data streams: pck-s and copk-s, were presented

in this work. Both algorithms are adopted to the task rarely discussed in the

literature. The series of studies showed the ability to adapt the proposed methods

to the existing concept drifts and obtain better results than in the case of the

base algorithms.

The presented results provide the basis for further research. One of the as-

pects discussed in this work is the proposal to extend the proposed methods to

be better adapted to the changing nature of the constraints. In some cases, the

clustering quality showed signiﬁcant variance throughout the run. An appropri-

ate selection of constraints may be crucial for stabilizing the results.

In addition, the practical aspect of using the proposed methods should be

emphasized in the context of other machine learning tasks, especially for active

learning. Evaluating only the relation between two samples is easier for an expert

than assigning them to imposed classes, which will reduce the cost of obtaining

such knowledge.

Acknowledgement

This work was supported by the Polish National Science Centre under the grant

No. 2017/27/B/ST6/01325 as well as by the statutory funds of the Department

of Systems and Computer Networks, Wroclaw University of Science and Tech-

nology.

References

1. P. Ksieniewicz, P. Zyblewski, M. Choras, R. Kozik, A. Gielczyk, and M. Wozniak,

“Fake News Detection from Data Streams,” in 2020 International Joint Conference

on Neural Networks (IJCNN). Glasgow, United Kingdom: IEEE, Jul. 2020, pp.

1–8.

2. B. Krawczyk, L. L. Minku, J. Gama, J. Stefanowski, and M. Wo´zniak, “Ensemble

learning for data stream analysis: A survey,” Information Fusion, vol. 37, pp. 132–

156, Sep. 2017.

3. S. Ram´ırez-Gallego, B. Krawczyk, S. Garc´ıa, M. Wo´zniak, and F. Herrera, “A

survey on data preprocessing for data stream mining: Current status and future

directions,” Neurocomputing, vol. 239, pp. 39–57, May 2017.

4. A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavald`a, “New ensemble

methods for evolving data streams,” in Proceedings of the 15th ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining, ser. KDD ’09.

New York, NY, USA: Association for Computing Machinery, 2009, p. 139–148.

5. P. Zyblewski, R. Sabourin, and M. Wo´zniak, “Preprocessed dynamic classiﬁer en-

semble selection for highly imbalanced drifted data streams,” Information Fusion,

vol. 66, pp. 138–154, Feb. 2021.

6. F. Guzy, M. Wo´zniak, and B. Krawczyk, “Evaluating and explaining generative

adversarial networks for continual learning under concept drift,” in 2021 Interna-

tional Conference on Data Mining Workshops (ICDMW), 2021, pp. 295–303.

7. J. Komorniczak, P. Zyblewski, and P. Ksieniewicz, “Prior Probability Estimation

in Dynamically Imbalanced Data Streams,” in 2021 International Joint Conference

on Neural Networks (IJCNN). Shenzhen, China: IEEE, Jul. 2021, pp. 1–7.

8. J. A. Silva, E. R. Faria, R. C. Barros, E. R. Hruschka, A. C. P. L. F. d. Car-

valho, and J. Gama, “Data stream clustering: A survey,” ACM Computing Surveys,

vol. 46, no. 1, pp. 1–31, Oct. 2013.

9. F. Cao, M. Estert, W. Qian, and A. Zhou, Density-Based Clustering over an Evolv-

ing Data Stream with Noise, pp. 328–339.

10. I. Davidson, “A Survey of Clustering with Instance Level Constraints,” ACM

Transactions on Knowledge Discovery from Data, p. 41, 2007.

11. S. Gonz´alez, S. Garc´ıa, S.-T. Li, R. John, and F. Herrera, “Fuzzy k-nearest neigh-

bors with monotonicity constraints: Moving towards the robustness of monotonic

noise,” Neurocomputing, vol. 439, pp. 106–121, Jun. 2021.

12. G. Gonz´alez-Almagro, J. Luengo, J.-R. Cano, and S. Garc´ıa, “Enhancing instance-

level constrained clustering through di↵erential evolution,” Applied Soft Comput-

ing, vol. 108, p. 107435, Sep. 2021.

13. C. Ruiz, E. Menasalvas, and M. Spiliopoulou, “C-denstream: Using domain knowl-

edge on a data stream,” in Discovery Science. Berlin, Heidelberg: Springer Berlin

Heidelberg, 2009, pp. 287–301.

14. M. Halkidi, M. Spiliopoulou, and A. Pavlou, “A Semi-supervised Incremental Clus-

tering Algorithm for Streaming Data,” in Advances in Knowledge Discovery and

Data Mining. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, vol. 7301, pp.

578–590, series Title: Lecture Notes in Computer Science.

15. T. Sirampuj, T. Kangkachit, and K. Waiyamai, “CE-Stream : Evaluation-based

technique for stream clustering with constraints,” in The 2013 10th International

Joint Conference on Computer Science and Software Engineering (JCSSE). Khon

Kaen, Thailand: IEEE, May 2013, pp. 217–222.

16. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,

M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,

D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine

learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830,

2011.

17. P. Ksieniewicz and P. Zyblewski, “stream-learn – open-source Python library for

diﬃcult data stream batch analysis,” arXiv:2001.11077 [cs, stat], Jan. 2020, arXiv:

2001.11077.

18. I. Davidson, K. L. Wagsta↵, and S. Basu, “Measuring Constraint-Set Utility for

Partitional Clustering Algorithms,” in Knowledge Discovery in Databases: PKDD

2006. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, vol. 4213, pp. 115–126,

series Title: Lecture Notes in Computer Science.