- Access to this full-text is provided by Springer Nature.
- Learn more

Download available

Content available from Applied Intelligence

This content is subject to copyright. Terms and conditions apply.

https://doi.org/10.1007/s10489-022-04231-7

Cluster-based stability evaluation in time series data sets

Gerhard Klassen1

·Martha Tatusch1

·Stefan Conrad1

Accepted: 25 September 2022

©The Author(s) 2022

Abstract

In modern data analysis, time is often considered just another feature. Yet time has a special role that is regularly overlooked.

Procedures are usually only designed for time-independent data and are therefore often unsuitable for the temporal aspect

of the data. This is especially the case for clustering algorithms. Although there are a few evolutionary approaches for time-

dependent data, the evaluation of these and therefore the selection is difficult for the user. In this paper, we present a general

evaluation measure that examines clusterings with respect to their temporal stability and thus provides information about the

achieved quality. For this purpose, we examine the temporal stability of time series with respect to their cluster neighbors,

the temporal stability of clusters with respect to their composition, and finally conclude on the temporal stability of the entire

clustering. We summarise these components in a parameter-free toolkit that we call Cluster Over-Time Stability Evaluation

(CLOSE). In addition to that we present a fuzzy variant which we call FCSETS (Fuzzy Clustering Stability Evaluation

of Time Series). These toolkits enable a number of advanced applications. One of these is parameter selection for any

type of clustering algorithm. We demonstrate parameter selection as an example and evaluate results of classical clustering

algorithms against a well-known evolutionary clustering algorithm. We then introduce a method for outlier detection in

time series data based on CLOSE. We demonstrate the practicality of our approaches on three real world data sets and one

generated data set.

Keywords Time series clustering ·Over-time stability evaluation ·Evolutionary clustering ·Anomalous subsequences

1 Introduction

With the increase of time series (TS) data, their analysis

is becoming more and more important. There are many

different approaches which are all suitable for different

setups. However, most of the methods target the analysis

of individual time series, while only a few aim to analyse

whole TS databases. Without any doubt, the information

gained from a time series database can have a significant

Gerhard Klassen and Martha Tatusch contributed equally to this

research.

Gerhard Klassen

gerhard.klassen@hhu.de

Martha Tatusch

martha.tatusch@hhu.de

Stefan Conrad

stefan.conrad@hhu.de

1Heinrich-Heine-University D¨usseldorf, D¨usseldorf, Germany

influence on the results, especially compared to an analysis

applied to only one time series of the database.

A setting which illustrates this circumstance is the stock

market: During an economic crisis most of the shares lose

value. Regarding only one share at a time could lead to a

false interpretation (e.g. an outlier sequence within the time

series), while regarding all time series simultaneously the

assessment would result differently.

Although the mentioned setting describes extreme

circumstances, it is obvious that similar problems in analysis

and interpretation also occur under normal conditions.

The examination of this kind of setups can prove to be

very difficult, since it can be useful not to look at the

whole database at once, but to look at specific groups

instead. This requires the identification of groups which

is often accomplished by applying suitable clustering

algorithms. Although this is a well researched topic

for time independent data, approaches for time series

are often insufficient, sometimes to an extent that the

produced results are meaningless [25]. As this has been

identified as a major problem in time series clustering, the

research field evolutionary clustering developed. According

/ Published online: 13 December 2022

Applied Intelligence (2023) 53:16606–16629

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Cluster-based stability evaluation in time series data sets

to [8] evolutionary clustering is producing a clustering per

timestamp, hence a series of clusterings. Each clustering

should be similar to the clustering of its predecessor, while

accurately reflecting the properties of its own data. As this

definition is not regarding a certain clustering algorithm,

this leads to a variety of approaches adapted to different

clustering algorithms (read more about this in Section 2).

There are also approaches which try to define the necessary

adjustments to a standard clustering algorithm to receive

an evolutionary clustering algorithm [8,10]. However, the

amount of different approaches and different clustering

algorithms makes it difficult to select a suitable method for

a certain task.

The detection of groups in time series can provide

important insights into the data at hand. The application

areas of our toolkits and the methods based on them, such as

outlier detection, are diverse. One conceivable application

of our methods is on medical data, where patients could be

identified who were initially grouped with healthy patients

and whose medical values then slowly move away from this

group. Another area of application is the financial market,

where, for example, companies can be grouped that behave

similarly over time, so that classic company classifications

such as the Standard Industrial Classification (SIC) or the

North American Industry Classification System (NAICS)

can be usefully supplemented. Companies that change

their group more frequently in relation to other companies

may have anomalies that our outlier detection method

would identify. Analyses of the current Corona epidemic

are also conceivable. Using data from the Coronavirus

Government Response Tracker at the University of Oxford1,

the effectiveness of government measures to contain the

epidemic could be analysed. It would also be possible to

identify how good the respective chosen timing of a measure

was. There are countless applications where our methods

can be used. In addition, all our methods are transparent and

provide explainable results.

In this paper we describe two fundamental methods

to evaluate time series clustering according to its over-

time stability. The first algorithm CLOSE (Cluster Over-

Time Stability Evaluation) [51] is designed for multivariate

time series in crisp cluster environments. Hereby we

use an extended definition of evolutionary clustering:

Instead of targeting the similarity of two successive

clusterings we demand the similarity of a clustering to all

previous clusterings. We call this the over-time stability

and introduced it, because small changes between two

timestamps could develop to huge changes over several time

steps. Those changes would be overseen by considering

only two consecutive timestamps. A simple example for this

1https://www.bsg.ox.ac.uk/research/research-projects/

coronavirus-government-response- tracker

problem is the Covid-19 infection rate in different countries:

If one country changes its cluster peers from one point in

time to the other, this may be reasonable. However, if the

country is changing its cluster peers in every time point,

regarding solely the previous timestamp is not sufficient,

since it does not hold the historical changes. Therefore this

country could not be directly compared to other countries,

since among those there might be countries which have

changed its peers as well. Hence, the changes before the

previous timestamp contribute to the overall stability.

In the course of this paper we will show that this

adaptation of the definition is especially handy for certain

applications like outlier detection or parameter selection

for time series clustering. Our methodology is also very

different from other approaches in this field of research. In

contrary to a framework or an adapted clustering algorithm,

CLOSE is a ready-to-use toolkit. It does not require

any customization of the user-chosen clustering algorithm,

instead it analyses the produced clusterings per timestamp

and returns a stability score. This can be used to find the best

parameter setting for the underlying clustering algorithm.

The second algorithm FCSETS (Fuzzy Clustering

Stability Evaluation of Time Series) [30] is a toolkit

developed for fuzzy clustering environments. It makes

use of the relative assignment agreement similar to the

equivalence relation in the H¨ullermeier-Rifqi Index [20]and

achieves a stability score by regarding the average weighted

difference between the relative assignment agreements of

one time series to the others. The methodology of FCSETS

is very similar to the one of CLOSE and therefore further

adjustments of the chosen underlying fuzzy clustering

algorithm are not required. Further, we are presenting an

outlier detection algorithm [50] which is an application

of CLOSE. We give two variants [52] of the procedure

which focus on cluster transitions and therefore are capable

to detect a new sort of outliers, which are based on the

behavior of time series in relation to its cluster peers. The

implementation of the approaches as well as the generated

data sets are available on Github2.

In order to present the results of the introduced

algorithms, we use three real world data sets and one

generated data set. We apply CLOSE in combination

with DBSCAN [16] and K-Means [38] and FCSETS in

combination with Fuzzy C-Means [6] to the selected data

sets to get the best parameter settings. We qualitatively

analyse the resulting clusterings and in the case of K-

Means we subsequently use the possibility to compare the

CLOSE score with that of the evolutionary K-Means from

[8]. Further, we apply the outlier detection algorithms to the

data sets and explain the results in detail.

2https://github.com/tatusch/ots-eval

16607

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

G. Klassen et al.

2 Related work

Since this work addresses many different problems and

approaches, such as the (over-time) stability evaluation

of (fuzzy) clusters and the detection of anomalous

subsequences, this chapter deals with related works from

various domains, as well.

2.1 Time series clustering

There are various techniques for clustering TS data in

the field of time series analysis. In [56] the approaches

are divided into three categories: raw-data-based,feature-

based and model-based clustering algorithms. The first type

describes approaches, which consider the TS data without

any preprocessing. The second one works with feature

vectors extracted from the time series. In the third case,

models are approximated for the representation of the TS

data.

When considering approaches that work with the

unprocessed TS data that is given, a common approach

is clustering subsequences of a time series [3,22]. As

this is usually done in order to find motifs in time series,

only a single TS is considered at once. This approach

is controversial, since Keogh et al. state in [25]that

the clustering of subsequences of a single time series is

meaningless. Chen, however, argues that it is possible to

obtain meaningful results if the correct distance measure is

used [9]. In our context, the clustering has to be applied to

multiple time series, though. Clustering subsequences has

some disadvantages. First, outlier data points may have a

negative impact on the results. Second, the determination

of a meaningful length for the considered subsequences is

difficult but needed, since the examination of subsequences

of all lengths is usually very time-consuming. In our

approaches, subsequences of any length can selectively

be investigated and therefore provide more insights.

Nevertheless, it has to be noted, that only subsequences

starting at the first existing timestamp are considered. This

is reasoned by the assumption, that the entire time course

from the beginning is relevant for the analysis.

Another raw-data-based approach is the clustering of

entire sequences [15,17,41]. Since potential correlations

between subsequences of different TS are not recognized,

this procedure is not suitable for our applications.

In our context, the exact course of time series is not

relevant, but rather the trend they follow. This can e.g.

be achieved by algorithms using Dynamic Time Warping

(DTW) as distance measure [2,11,21,33] or methods of

the second type, where the sequences are transformed to

feature vectors first [19]. By extracting relevant features,

the exact course gets blurred. However, the problem of not

recognizing correlating subsequences still persists.

When considering the third type of TS clustering, a

major approach is the usage of auto-regressive moving-

average models (ARIMA) [42,57]. Therefore, an ARIMA

model/mixture for every time series is fitted. Those

sequences, whose models are similar to each other, are

grouped to the same cluster. Also, the sequences can be

modeled by the Haar Wavelet decomposition [54], their

approximated seasonality [31] or with the help of Markov

Chains [44]. However, all approaches share the idea of

clustering whole time series. In our application, correlating

subsequences and the movement of sequences with regard

to their neighbors are of interest. Therefore, those methods

are not applicable.

Approaches, which deal with the clustering of streaming

data [18,40] are also not comparable to our method, as they

deal with other problems such as high memory requirements

and time complexity, and in addition to that usually consider

only one sequence at once.

2.2 Evolutionary clustering

Evolutionary clustering describes the task of clustering

temporal data per timestamp under the consideration of two

criteria: on the one hand, the clustering should be reasonable

for the current data, and on the other the clustering should

not deviate significantly from one timestamp to another

[8]. Different frameworks have been developed, which

meet both criteria regarding streaming data [10], TS data

[58] and dynamic networks [27]. The framework, which is

presented in [8], for instance, is developed for streaming

data and therefore an incremental approach, which for each

timestamp ttries to find a clustering Ctthat optimizes the

following formula:

sq(Ct,M

t)−cp ·hc(Ct−1,C

t), (1)

where sq(Ct,M

t)is the snapshot quality regarding an

object relationship matrix Mt,cp is a change parameter

and hc(Ct−1,C

t)is the history cost.Thesnapshot quality

measures the quality of a clustering at a certain time point

with respect to the calculated n×nmatrix Mtwhich

represents the relationship of all nobjects to each other.

The history cost is calculated by the comparison of the

clusterings of two consecutive time points, whereby the

comparison may be applied on different data levels. For

example, simply the partitions of both clusterings may

be compared, or the best matching between two sets of

centroids regarding KMeans [38]. The change parameter

cp > 0 is a hyperparameter which trades off between

sq and hc. With this flexible framework a stable over-

time clustering may be achieved, which can be used as the

underlying clustering for our outlier detection algorithm.

Yet, due to the comparison of only consecutive time points,

short-term changes may have a strongly negative impact on

16608

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Cluster-based stability evaluation in time series data sets

the result and large long-term changes may occur, which is

not desirable.

The problem of identifying so called Moving Clusters

[23] seems to be a closely related topic, but addresses

a slightly different task. In contrast to clustering time

series, this field of research deals with the detection of

already given clusters that remain mostly the same with

regard to their members. In addition, it is assumed that a

cluster remains approximately the same size over time. This

may apply to some tasks, such as herd tracking, which is

examined in [23], but in most cases this requirement can not

be met.

2.3 Internal cluster evaluation measures

For the evaluation of clusters and clusterings, various

evaluation measures have been developed over the years.

There are two types of cluster evaluation measures: external

and internal measures. The difference between the two is,

that while the expected result – also known as ground truth

– is known for the external measures, it is missing for

the internal ones. Therefore, external evaluation measures

make a qualitative comparison between the expected and

the real result. Internal measures, however, focus on other

describing characteristics, such as the compactness or

separation of clusters in order to evaluate the quality of the

result.

OnecommonmetricistheSum of Squared Errors (SSE)

that evaluates the compactness of clusters. In case of fuzzy

clusterings this measure can be used by weighting the

membership degrees. The SSE is based on the calculation of

the overall distance between the members and the centroid

of a cluster. The centroid usually describes the mean of

all cluster members. Since this measure only considers

the compactness of clusters, further validity measures have

been developed, which evaluate the compactness as well

as the separation. Common examples are the Silhouette

Coefficient [47], Davies-Bouldin Index [12]orDunn Index

[14]. When considering fuzzy clusterings, there are for

example validity measures which use only membership

degrees [28,35] or include the distances between data points

and cluster prototypes [5,7].

However, all these metrics cannot directly be compared

to our method since they lack a temporal aspect, but they

can be applied in our stability evaluation methods.

2.4 Stability evaluation of clusterings

There are also several approaches addressing the stability

measurement of a clustering algorithm. One example is the

Rand Index [45], which is usually intended for the external

evaluation of a clustering. Given the clustering ζpand the

expected result ζt, it examines on the one hand all object

pairs that are located in the same cluster in ζpas well as

ζt, and on the other hand all pairs that belong to different

clusters in both clusterings. The measure is defined by the

number of corresponding object pairs in relation to the

number of all possible object pairs.

The measurement of the stability of a clustering

algorithm is for instance executed when searching for

the optimal parameter setting. In 2002, Roth et. al [46]

introduced the resampling approach for cluster validation.

Roth et. al put forward the hypothesis, that if multiple

partitionings of a clustering algorithm for the same

parameter setting are similar to each other, the parameter

setting is good. The higher the similarity, the better is the

parameter choice.

The unsupervised cluster stability value s(c) that is used

in Roth et. al’s approach [46] is calculated as the average

pairwise distance between mpartitionings:

s(c) =

m−1

i=1

m

j=i+1

d(Uci,U

cj )

m·(m −1)/2,(2)

where Uci and Ucj ,1 ≤i<j≤m, are two

partitionings produced for cclusters and d(Uci ,U

cj )is

an arbitrary similarity index of partitionings. The Rand

Index can be used for stability evaluation by including it

in this formula. Such stability measures pursue a different

objective and obviously do not take a temporal linkage

into consideration [55]. Our stability measure is similar

to the unsupervised cluster stability value but it includes

the temporal dependencies of clusterings. An intuitive idea

for achieving a temporal linkage would be to simply

compare clustering pairs of successive points in time. This

approach would strongly weight variation between two

points in time and neglect long-term changes. An ongoing

change would for instance be punished only slightly, since

consecutive clusterings would be very similar, while short-

term deviations would stand out, although the overall

behavior might be stable. Also, the index would be strongly

negatively affected by separations or merges of clusters of

successive time points. Even when comparing clustering

pairs of all different time points these problems would

persist.

In addition, the referred methods exclusively evaluate

the (over-time) stability of clusterings. As stated in [4,32],

however, stability alone is not sufficient for a proper

evaluation of a clustering. CLOSE takes both into account,

16609

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

G. Klassen et al.

the over-time stability as well as the quality of a clustering,

to give an overall rating for an over-time clustering.

2.5 Anomaly detection in time series

When regarding works dealing with outlier detection in

time series, various definitions of the term outlier can be

found. Many approaches consider only single conspicuous

data points such as additive outliers or change points

[24,37,43] and focus on a single time series [1,39].

However, in our context the detection of anomalous

subsequences is considered, so that only algorithms, which

either handle outlier subsequences or analyse the group

behavior of multiple time series over time, are relevant.

For the latter, approaches such as Probabilistic Suffix

Trees (PST) [49], Random Block Coordinate Descents

(RBCD) [59] and various neural networks [26] have been

developed and been shown to achieve convincing results.

However, while these methods examine the deviation of

one time series to all others in the data set, we focus

on the behavior of a time series compared to its steady

neighbors, since the consideration of the whole data set is

only meaningful, if all TS have a similar course. This is

for example the case in sensor data. In order to analyse

the group behavior over time, we first have to identify

continuous peers by clustering the TS data per time point.

Then, the transitions of sequences between different clusters

over time can be analysed. This type of transitions is also

evaluated in cluster evolution methods. Landauer et al. [34]

make use of such a method in order to calculate a prediction-

based anomaly score for a single data point. Similar to

our approach, the TS data is clustered per timestamp.

The cluster transitions of a considered time series are

then analysed by cluster evolution methods in order to

approximate a model which predicts the next data point.

Although groups of time series are identified, the detection

of outliers is therefore based on the prediction of a single

sequence. In contrast to Landauer et al. we refer to several

time series.

Our approach is very different from clustering whole

time series or their subsequences, since in that case the

outlier detection relies on the single fact whether a sequence

is assigned to a cluster or not. Such an approach does

not take the cluster transitions of a sequence into account,

which may be an expressive feature on its own. Hence, our

approach might recognize anomalous subsequences which

in a subsequence clustering would have been assigned to a

cluster and therefore not been marked as outlier.

Apart from clustering subsequences, there are also other

approaches for the detection of conspicuous subsequences

or so called discords [25,36]. Those often consider only

a single time series at once. Therefore, only anomalous

behavior with regard to the course of one sequence is

recognized. Though, in the context of the whole data set,

this behavior might for example be normal. Such methods

are thus not applicable in our context.

3 Methodology

The agglomeration of similar time series is a problem which

arises in many applications. There are various approaches

and a lot of research happened in this field. Since the

definitions differ in related works, we first present our

notations of relevant concepts for our work. Subsequently,

we will describe the principles of our approaches CLOSE

[51] and FCSETS [30].

3.1 Notations

The following definitions are based on our previous works

[30,50,51].

Definition 1 (Data Set) A data set D={T1,...,T

m}is a set

of mtime series of same length nand equivalent points in

time. Equivalent means, that they are either identical or they

can be mapped to a reference timestamp.

Definition 2 (Time Series) A time series T=ot1,...,o

tn

is an ordered set of nreal valued data points of arbitrary

dimension. The data points are chronologically ordered by

their time of recording, with t1and tnindicating the first and

last timestamp, respectively.

The vectors of all time series are denoted as the set

O={ot1,1,...,o

tn,m}, with the second index indicating the

time series where this data point originates from. For the

ease of reference, we write Otifor all data points at a certain

point in time.

Definition 3 (Subsequence) A subsequence Tti,tj,l =

oti,l ,...,o

tj,l with j>iis an ordered set of successive real

valued data points beginning at time tiand ending at tjfrom

time series Tl.

Definition 4 (Cluster) A cluster Cti,j ⊆Otiat time ti, with

j∈{1,...,N

C}being a unique identifier (e.g. counter), is a

set of similar data points, identified by a cluster algorithm,

where NCis the number of clusters. This means that all

clusters have distinct labels regardless of time.

Definition 5 (Cluster Member) A data point oti,l at time

ti, that is assigned to a cluster Cti,j is called a member of

cluster Cti,j .

16610

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Cluster-based stability evaluation in time series data sets

Definition 6 (Noise) A data point oti,l at time tiis

considered as noise, if it is not assigned to any cluster.

A data point that belongs to noise is also called an

outlier.Noise describes the set of noise data points of all

timestamps, i.e. Noise =kNoisetk.

Definition 7 (Clustering) A clustering is the overall result

of a clustering algorithm for all timestamps. It is defined by

the set ζ={Ct1,1,...,C

tn,NC}∪Noise.

Definition 8 (Time Clustering) A time clustering is the

result of a clustering algorithm at one timestamp. It is

defined by the set ζtk={Ctk,a,...,C

tk,b}∪Noisetkof all

clusters at time tk.

Definition 9 (Fuzzy Cluster Membership) The membership

degree uCti,j (oti,l)∈[0,1]expresses the relative degree of

belonging of the data object oti,l of time series Tlto cluster

Cti,j at time ti.

Definition 10 (Fuzzy Time Clustering) A fuzzy time

clustering is the result of a fuzzy clustering algorithm at one

timestamp. It is defined by the membership matrix Uti=

[uCti,j (oti,l )].

Definition 11 (Fuzzy Clustering) A fuzzy clustering of

time series is the overall result of a fuzzy clustering

algorithm for all timestamps. It is defined by the ordered set

U=Ut1,...,U

tnof all membership matrices.

An example for the above definitions can also be seen in

Figs. 1and 4.InFig.4, five time series of a data set D=

Ta,Tb,Tc,Td,Teare clustered per timestamp for the time

points ti,t

jand tk. The data points of a time series Tlare

denoted by the identifier lfor simplicity reasons. The shown

clustering consists of six clusters. It can be described by

the set ζ={Cti,l ,C

ti,u,C

tj,v,C

tj,f ,C

tk,g,C

tk,h}∪{oti,e}.

As oti,e is not assigned to any cluster in ti,itismarkedas

noise for this timestamp. The data points oti,a,o

ti,b of time

series Taand Tbin tiare cluster members of the yellow

cluster Cti,l . The subsequences Tti,tj,a and Tti,tj,b from time

series Taand Tbmove both from the yellow (Cti,l) to the red

(Ctj,v) cluster. The green (Ctk,h) and pink (Ctk,g ) cluster can

be summarized by the time clustering ζtkat time tk.

3.2 Over-time stability evaluation

Since we want to measure the stability of an over-time

clustering, whereby the partitioning may be produced

by an arbitrary (evolutionary) clustering algorithm, we

assume that different clusterings constitute different cluster

connectedness based on the underlying TS members. Time

series, which separate from their clusters’ members often,

indicate a low over-time stability. For this reason, we first

analyse the behaviour of every subsequence of a time series

T=ot1,...otk, with tk≤tn, starting at the first time-

stamp. In case of a hard clustering, subsequently, every

cluster is rated by a stability function, based on the previous

subsequence analysis of its members and the number of

clusters that merged into the considered cluster. The final

over-time stability score for the whole clustering can then be

calculated with the rating of each cluster. When regarding

fuzzy clusterings, the over-time clustering is directly rated

based on the subsequence scores.

Fig. 1 Illustration of the most important definitions. Lines between objects of a time series represent the development of the sequence [29]

16611

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

G. Klassen et al.

3.2.1 CLOSE

Given a TS data set D={Tl|1≤l≤m}with ntimestamps

and an over-time clustering ζ,letCti,a and Ctj,b be two

clusters of ζ, with ti,t

j∈{t1,...tn}.Thetemporal cluster

intersection, which is used for the stability evaluation of a

subsequence, is defined as follows

∩t{Cti,a,C

tj,b}={Tl|oti,l ∈Cti,a ∧otj,l ∈Ctj,b},(3)

with l∈{1,...,m}. The resulting set consists of time series,

which contain data points that are grouped to the same

cluster in tiand tj. The transition of a subsequence from

one cluster Cti,a in tito another Ctj,b in tjalong with its

group behaviour, which may be interpreted as team spirit,

can now be expressed by the proportion of members of Cti,a

remaining together in tj

p(Cti,a,C

tj,b)=⎧

⎨

⎩

0ifCti,a =∅

|Cti,a ∩tCtj,b|

|Cti,a |else (4)

with ti<t

j. Regarding the example in Fig. 4the proportion

for Cti,l and Ctj,v is defined by

p(Cti,l,C

tj,v)=|{a, b}|

|{a, b}| =2

2=1.0 .

This proportion can be used to evaluate the over-time

stability of a subsequence by rating its history with a

subsequence score. In order to address the clusters a data

point is assigned to, we first need to introduce an auxiliary

function, which we call cluster-identity function:

cid(oti,j)=∅if the data point is not assigned to a cluster

Cti,l else (5)

For a data point oti,j at time tithe function returns the

cluster it is assigned to. The subsequence score is then

defined by

subseq score(otk,l)=1

k·

k−1

i=1

p(cid(oti,l), cid (otk,l)) , (6)

with l∈{1,...,m}and kbeing the number of timestamps

where the data point exists. That means, that all time points

in which an object is an outlier, get the worst possible score

of 0. The subsequence score takes into account how many

cluster members of the object from the previous timestamps

have migrated together over time.

In the example of Fig. 4, the score of time series Taat

time point tkwould be:

subseq score(otk,a)=1

2·(2

2+2

3)=0.83 .

This value reflects a quite high stability, which can be

explained by the fact that Tamoves with most of its cluster

members over the time period. The time series d,getsa

significantly lower value of subseq score(otk,d )=0.5, as

it never moves with any of its cluster members. Note, that

the impact of transitions of single TS becomes significantly

lower when considering larger data sets.

The stability of a cluster can now be evaluated, focussing

on two factors. The first one is the number of different

clusters of previous timestamps, that merged into the

regarded cluster. This can be expressed by

m(Ctk,i )=|{Ctl,j |tl<t

k∧∃a:otl,a ∈Ctl,j ∧otk,a ∈Ctk,i }| ,(7)

Furthermore, a cluster’s stability score depends on the

subsequence rating of all its cluster members. The second

factor is therefore the sum of all subsequence scores of

the data points within the considered cluster. Hence, the

over-time stability of a cluster is defined as

ot stability(Ctk,i)=

1

|Ctk,i |·otk,l∈Ctk,i subseq score(otk,l)

1

k−1·m(Ctk,i )(8)

for k>1. For a cluster at time point tk, the entire preceding

time frame [t1,t

k−1]is considered. We define clusters at the

first timestamp to be stable and set ot stability(Ct1,i)=

1.0. In order to make clusters comparable, the sum of

subseq score is averaged by the number of data points in

the viewed cluster, while the number of merged clusters is

averaged by the number of timestamps before the regarded

cluster. There are clustering algorithms which do not assign

a cluster to every data point. Those data points are usually

denoted as outliers. It is important to mention, that the

number of merged clusters does not take these outliers into

account.

Regarding the example of Fig. 4, the stability of the

cluster Ctk,g is given by:

ot stability(Ctk,g)=

1

3·(0.83 +0.58 +0.25)

1

2·4=0.28 .

This low score can be explained by the fact that the cluster

under consideration contains only three data points. One

of those (Te) has a completely independent course of its

clusters’ members, and the remaining two are not perfectly

stable either.

Finally, the over-time stability of a clustering ζcan be

calculated by

CLOSE(ζ) =1

NC

·1−n

NC2·

C∈ζ

ot stability(C)

·(1−quality(C)),(9)

with NCbeing the number of clusters of the over-time

clustering ζ,nbeing the number of timestamps and

16612

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Cluster-based stability evaluation in time series data sets

quality being an arbitrary cluster evaluation measure. When

working with normalised data ∈[0,1]d, we suggest the

mean squared error (MSE), but any other rating function can

also be used. Please make sure of using a function, whose

results lie in the interval of [0,1]in order to get appropriate

results. When using a function for evaluating the quality

instead of the deficiency of a clustering – that means, higher

values indicate a higher quality – the term (1−quality(C))

may e.g. be replaced by (1−quality(C)−1)or quality(C)

depending on the quality measure.

As long as the output of the quality function is between

0 and 1 and there exists at least one cluster per timestamp,

CLOSE as well returns a score between 0 and 1, with 1

indicating a good over-time clustering.

The first pre-factor results from averaging by the number

of clusters. The second factor 1 −(n

NC)2is intended to

counteract one large cluster to get a high score. Since such

a clustering automatically exhibits a very high over-time

stability, the CLOSE score rises. Note, that the clusters of

the first point in time are also included in the evaluation

measure. Since they are assumed to have a stability of 1.0,

the score is in general slightly increased and for the first

timestamp only influenced by the quality of the clusters.

Remark 1 (Time Point Comparison) In contrast to the

evaluation function integrated in evolutionary clustering

[8,27,58], where only consecutive points in time are com-

pared, CLOSE compares clusterings of all preceding time

points with the last timestamp of the considered subse-

quence. This has multiple effects. First, the stability score

is robust against outliers. Second, short-term transitions

between clusters are weighted more lightly. Simultaneously,

long-term changes that develop slowly over time are pun-

ished more severely, which forms the third effect. Note: The

formula cannot be transformed to simply iterate over all

cluster pairs. Since the over-time stability is weighted with

the quality of the cluster, the results would differ.

Remark 2 (Handling Outliers) Our calculations are suitable

for both cleaned data and data with noise. Currently, outliers

have only a minor impact on the score. That is, because

they are solely considered in the subsequence score and

not in the cluster stability. However, apart from decreasing

the subsequence score, they have an additional indirect

influence on the clustering score. Since the pre-factor in

Formula (9) favours a large number of clusters, it may be

more advantageous for the clustering algorithm to assign

data points to smaller clusters than to interpret them as noise

and recognize only a few large clusters.

This weak treatment of outliers is reasoned considering

the idea, that the over-time clustering might be used for

outlier detection. In this case, the algorithm should not

be pushed into assigning every data object to a cluster.

Nevertheless, different strategies for treating outliers might

be investigated in future work.

One way to penalize noise more strongly would be, to

insert an exploitation term which represents the number of

data points that are assigned to a cluster Nco in relation to

the number of all existing data points No. In order to achieve

high CLOSE scores, this term should be maximized then.

The formula including the exploitation term is given by

CLOSE(ζ) =1

NC

·1−n

NC2·

C∈ζ

ot stability(C)

·(1−quality(C))·Nco

No

,(10)

Remark 3 (Merge & Split of Clusters) Considering the

subsequence score (Formula (6)), a merge of clusters do not

have a negative impact on the score. On the contrary: if two

clusters fuse entirely, the score is actually increased, as all

objects move together with all their cluster members and

therefore show a good team spirit. This is intended, since the

focus lies primarily on the cohesion of time series. A good

team spirit is rewarded in every case.

When considering cluster splits, though, the subsequence

score is lowered. Since a split indicates that time series

which have been members of the same cluster at some point

in time separate from each other, this behaviour is also

wanted. Note, that in the case, where smaller clusters have

previously merged together and then separated again in the

same way as before, the influence on the score is not high

and vanishes over time.

However, in some applications the punishment of cluster

merges might be desired. As we will show in Section 4

regarding our proposed outlier detection algorithm, the

Jaccard Index can be used in the proportion calculation, in

order to penalize merges and splits in the same way.

Remark 4 (Additional Remarks) As Ben et al. stated, the

sample size has a high impact on the stability evaluation

of a clustering [4]. This is not only the case, when

considering constant data points. When examining the over-

time stability of a clustering, a small sample size also leads

to a high sensitivity to transitions between clusters. The

greater the considered data set, the easier a statement about

the (over-time) stability can be made. In order to extend the

method for a broader field of quality measures, the formula

of CLOSE can be modified, so that quality measures for

clusterings instead of clusters can be used. Therefore, the

average cluster stability avg stab per time clustering ζti

16613

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

G. Klassen et al.

must be considered. The score is then normalised using the

number of timestamps n:

CLOSE(ζ) =1

n·1−n

NC2·

ζti⊂ζ

avg stab(ζti)·(1−quality(ζti)).

(11)

Figure 2summarises the calculation process and explains

the most important formulas.

3.2.2 FCSETS

Given a TS data set D={Ti|1≤i≤m}with ntimestamps

and a fuzzy over-time clustering U.LetUti⊂Ube a fuzzy

partitioning of the data objects Otiof all times series at time

tiin kticlusters. The relative assignment agreement of two

data objects oti,l and oti,s from time series Tland Tsto all

clusters in the partitioning Utiat time tican be calculated

using the equivalence relation from H¨ullermeier-Rifqi Index

(HRI) [20]:

EUti(oti,l ,o

ti,s )=1−1

2

kti

j=1

|uCti,j (oti,l )−uCti,j (oti,s )|,

(12)

with uCti,j (oti,l )being the membership degree of the

data point oti,l regarding the cluster Cti,j (see Definition

9). In order to measure the relation of two time series

Tland Ts, we calculate the difference between their

relative assignment agreements by subtracting the relative

assignment agreement values:

Dti,tr(Tl,T

s)=|EUti(oti,l ,o

ti,s )−EUtr(otr,l ,o

tr,s )|. (13)

LeaningontheH¨ullermeier-Rifqi Index [20]–which

deals with a slightly different task by calculating the

normalised degree of concordance between two partitions

– we define the over-time stability of a time series Tl

as the average weighted difference between the relative

assignment agreements to all other time series:

stability(Tl)=1−2

n(n −1)

n−1

i=1

n

r=i+1

m

s=1

EUti(oti,l ,o

ti,s )mDti,tr(Tl,T

s)2

m

s=1

EUti(oti,l ,o

ti,s )m

.

(14)

The difference between the assignment agreements

Dti,tr(Tl,T

s)is weighted by the assignment agreement

between pairs of TS at a previous time point in order to

damp large differences for stable time series caused by

supervention of new peers. On the other hand, time series

that leave their cluster peers when changing their cluster

membership are penalized.

The over-time stability of a fuzzy clustering Ucan now

be expressed by the average over-time stability of all time

series in the data set:

FCSET S(U) =1

m

m

l=1

stability(Tl). (15)

A more efficient approach as a substitute for the HRI

proposed by Runkler [48]istheSubset Similarity Index

(SSI). The efficiency gain is reasoned by the similarity

calculation, which in SSI considers cluster pairs while HRI

concentrates on the assignment agreement of data point

pairs. In our context, where the clustering should be used

for further analysis such as outlier detection, we aim to

describe the over-time stability of clustering by the team

spirit of the considered time series. Therefore, we believe,

that the degree of the assignment agreement between TS

pairs to clusters at different timestamps provide a greater

information gain than the similarity between cluster pairs.

For this reason, the SSI is not suitable for our over-stability

evaluation. Figure 3summarises the calculation process and

explains the most important formulas.

4 Applications

Our evaluation measures can not only be used for the over-

time stability evaluation of clusterings, but also for further

analyses such as parameter selection or outlier detection

[50,52,53]. Therefore, for example the part of CLOSE,

where subsequences are evaluated, can be used.

In [50], we present an approach called DOOTS

(Detecting Outliers regarding their Over-Time Stability)

for finding conspicuous subsequences of all lengths with

an underlying over-time clustering regarding the following

definition:

Definition 12 (Anomalous Subsequence) A subsequence

Tti,tj,l is called anomalous, if it is significantly more

unstable than its cluster members at time tj.

For this, the subsequence score from Formula (6)has

to be reformulated in order to handle subsequences with

arbitrary starting points. The subsequence score of a

subsequence Tti,tj,l of time series Tlstarting at tiand ending

at tjis defined as

subsequence score(Tti,tj,l)=1

k·

j−1

v=i

p(cid(otv,l), ci d(otj,l )) (16)

with l∈{1,...,m},k∈[1,j −i]being the number

of timestamps between tiand tjwhere the time series

exists [50].

16614

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Cluster-based stability evaluation in time series data sets

Fig. 2 Detailed step-by-step explanation of CLOSE

One noteworthy aspect is that the score is always 0, if

the last data point of the considered subsequence is marked

as noise. In most cases, this does not lead to any handicaps

regarding the analysis, since all partial sequences of these

16615

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

G. Klassen et al.

Fig. 3 Detailed step-by-step explanation of FCSETS

subsequences are treated normally, though. Nevertheless, a

more detailed discussion of such situations will be provided

in the further course of this work.

As already mentioned, the used proportion from Formula

(4) is asymmetric and punishes splits while ignoring merges.

In order to counteract this circumstance, the jaccard index

16616

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Cluster-based stability evaluation in time series data sets

Fig. 4 Example for cluster transitions of time series Ta,..,T

eover time

[52]

can be used, as proposed in [52]. Therefore, the temporal

cluster union of two clusters Cti,a,C

tj,b has to be introduced

first:

∪t{Cti,a,C

tj,b}={Tl|oti,l ∈Cti,a ∨otj,l ∈Ctj,b}(17)

with l∈{1,...,m}. The proportion ˆpcan then be expressed

by the jaccard index of two clusters:

ˆp(Cti,a ,C

tj,b)=⎧

⎨

⎩

0ifCti,a =∅∧Ctj,b =∅

|Cti,a ∩tCtj,b|

|Cti,a ∪tCtj,b|else

(18)

with ti<t

j. In contrast to the proportion from Formula (4)

regarding the example in Fig. 4the jaccard proportion is

ˆp(Cti,l ,C

tj,v)=|{a, b}|

|{a, b, c}| =2

3=0.67

since the merge of (parts of) the yellow (Cti,l) and turquoise

(Cti,u) cluster gets punished.

Another characteristic of the subsequence score from

CLOSE (Formula (6)) is the equal impact of all considered

timestamps regarding the over-time stability of a subse-

quence. When considering longer sequences, however, this

may lead to a tendency towards a worse rating, since slow

changes in cluster memberships might influence the score

considerably. Assuming that the nearer past is more signifi-

cant than the more distant past, a weighting function can be

integrated in the subsequence score.

Using the Gauss’ Formula, the weighting of the

proportion at time tiregarding the time interval [t1,t

k]can

be calculated by

i

k

a=1a

=i

k(k+1)

2

=2·i

k(k +1). (19)

Adjusting this weighting function to a time interval with

arbitrary starting point ts≥t1, the subsequence score is then

defined by

weighted subseq score(Tti,tj,l)=

j−1

v=i

2·(v−i+1)

k(k+1)p(cid(otv,l), c id(otj,l )) .

(20)

with k∈[1,j −i]again being the number of timestamps

between tiand tjwhere the considered time series exists

[52]. There is no need to normalize the score to an interval

of [0,1]by averaging it, as the sum of all weightings of a

subsequence’s timestamps is always 1 due the division by

the Gauss’s Formula.

In contrast to the subsequence score, regarding the

example in Fig. 4the weighted subsequence score is given

by

subseq score(otk,a)=1

3·1

2+2

3·2

3=0.61

which is a bit higher, since the immediately preceding

(higher) score gets a greater weighting than the more distant

one.

In summary, four options can be used: (i) the ordinary

subsequence score (DOOTS), (ii) the weighted subsequence

score (wDOOTS), (iii) the ordinary subsequence score

using the jaccard proportion (jDOOTS) and (iv) the

weighted subsequence score using the jaccard proportion

(jwDOOTS).

With this score, a subsequence can now be compared

with its cluster members, in order to determine, if its over-

time stability stands out. In this respect we consider the

following assumptions:

Assumption 1 If the score of a subsequence is significantly

lower than those of its cluster members, its over-time

behavior is conspicuous.

Assumption 2 If the score of a subsequence is low, but so

are those of its cluster members, its over-time behavior is

not conspicuous, since this low over-time stability shows a

pattern of regularity.

In order to find outlier sequences of all lengths, every

possible subsequence receives an outlier score indicating the

probability of being anomalous. The outlier score describes

the deviation of a subsequence’s stability from the best

subsequence score of its cluster. Figuratively, one can

16617

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

G. Klassen et al.

imagine that the time series with the highest subsequence

score represents a kind of leader and that a large deviation

from this leader is to be considered conspicuous. The best

subsequence score of a cluster Ctj,a regarding subsequences

starting at time tiis expressed by the following formula:

best score(ti,C

tj,a )=max({subsequence score(Tti,tj,l)|cid (otj,l)=Ctj,a })

(21)

The outlier score can then be calculated by

outlier score(Tti,tj,l)=best score(ti,cid(o

tj,l ))

−subsequence score(Tti,tj,l). (22)

With respect to Assumptions 1 and 2, the outlier score

depends on the best score of a cluster’s members. Therefore,

an outlier score of 100% can only be achieved in clusters

consisting exclusively of completely stable subsequences.

On the other hand, a cluster with small stabilities only, can

lead to a situation where no subsequence score is considered

conspicuous, no matter how low it is. As mentioned in

Assumption 2, this behavior is desired.

Using the outlier score and a threshold parameter τ,a

more precise definition of an outlier can now be given.

Definition 13 (Outlier) Given a threshold τ∈[0,1],a

subsequence Tti,tj,l is called an outlier, if its probability of

being an outlier is greater than or equal τ. That means, if

outlier score(Tti,tj,l)≥τ.

Even though the parameter τis constant, it can be

considered as a dynamic threshold, since the greatest

possible deviation from the best subsequence score – and

simultaneously the greatest outlier score – is dependent

on the best score of the considered cluster. Leaning on

Assumption 2, clusters which show a low stability have a

lower probability of containing an outlier than stable ones,

because all their cluster members exhibit irregularities,

which represents a pattern of instability. Thus, in this case,

a small subsequence score is not conspicuous.

Subsequences that consist entirely of noise data points

are automatically identified as outliers and are called

intuitive outliers. This special treatment is needed, since

subsequences whose last data point is labeled as noise

do not have any cluster members which the best score

can be calculated from. Therefore, no outlier score can be

determined for them. Hence, in our outlier detection we

consider three types of outliers: anomalous subsequences

regarding Definition 13, intuitive outliers and data points

marked as noise by a clustering algorithm.

Imagine examining a subsequence Tti,tj,l whose last data

point at time tjis marked as noise. In addition suppose

its subsequence Tti,tj−1,l getting a high outlier score and

therefore being detected as an outlier. Intuitively, one would

expect the subsequence under consideration Tti,tj,l being

identified as an outlier as well. In our approach, this would

only be the case, if the sequence was recognized as an

intuitive outlier i.e. the previous data point was categorized

as noise, too. Anyway, the subsequence Tti,tk,l with k>j,

which for the first time is assigned to a cluster again at its

last time point tk, would be detected as an outlier. Thus, in

the end Tti,tj,l would be covered.

Still, in the marginal case where a data point is labeled as

noise at the last time of the entire time series, a subsequence

with end time tmwould never be detected as an outlier, if

it is not marked as noise in tm−1. This drawback should be

investigated in future works.

Remark 5 (Modifications) As DOOTS is leaned on the

presented evaluation measure, the modification of the

proportion calculation using the Jaccard index as well as the

weighting function for the subsequence score may naturally

also be applied to CLOSE, if desired.

5 Evaluation

In this section, we present several experiments. First, we

describe the different data sets, which we use in order to

illustrate our results. Then we present clusterings calculated

with K-Means [38] and DBSCAN [16]. In order to create

those clusterings, we use common methods to identify good

parameters per timestamp. Afterwards, we compare the

results with clusterings whose parameters were identified

with the help of CLOSE. These results are then compared

to those of the evolutionary clustering presented in [8].

We also evaluate clusterings retrieved by Fuzzy C-Means

[6] and focus on the achieved FCSETS scores. Finally, the

comparison of clusterings is followed by applications to

the outlier detection algorithm. We finish the section with

qualitative analyses of the results.

5.1 Data sets

In the following, we present the three data sets our analyses

are based on.

5.1.1 COVID-19 data set

The COVID-19 pandemic is currently affecting the whole

world. In this context, the hashtag #FlattenTheCurve is

intended to encourage people all over the world to behave

in a way that prevents the distribution of infections over

16618

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Cluster-based stability evaluation in time series data sets

time and thus counteracts overloading of the health care

systems. Although the hashtag is used in an inflationary

way, few people realise that the curve is actually a time

series. Because of the current relevance of the data set,

it is an excellent candidate for applying our methods.

We obtained the data from the official GitHub repository

of Johns Hopkins University3. Specifically, we used the

daily reports on worldwide COVID-19 infections for our

analyses. Depending on the country, the data set contains

data on the individual regions (such as federal states) of

the country concerned. We have aggregated these data

so that for each available country, only one entry per

point in time has been created. Over time, other features

such as incidence were added. In order to provide the

incidence for all points in time, we calculated the incidence

using population data for the countries. For this purpose,

we have obtained the population data for the respective

countries from theglobaleconomy.com4. We then calculated

the seven-day incidence for the countries. The incidence

reflects the number of infections in the last seven days

per 100,000 inhabitants. Due to the low infection figures

at the beginning of the pandemic, the incidence value is

particularly low at some times. For this reason, we give

the number of infections per 10,000,000 inhabitants. In

addition, we do not consider directly consecutive days,

because the fluctuation in these is relatively small. Instead,

we look at every seventh day, reflecting the development

within a week.

5.1.2 ElectricityLoadDiagrams20112014 data set

The ElectricityLoadDiagrams20112014 dataset is from

the UCI Machine Learning Repository [13]. It contains

data on the electricity consumption of 370 customers at

quarter-hourly intervals. We have summarised the electricity

consumption into monthly intervals and selected the first 30

customers for a better overview. Summarising on a monthly

basis also has the advantage that the resulting dataset has

no missing values. For better comparability, we have also

applied a min-max scaling to the data.

5.1.3 TheGlobalEconomy.com data set

We extracted this data set from theglobaleconomy.com2.

The website offers over 400 indicators on 200 countries

for over 80 years. The indicators include data such as

GDP, inflation, population data, employment rates and

many more. All available data have been obtained from

reliable official sources. From the large number of available

indicators, we selected two for illustration purposes, namely

3https://github.com/CSSEGISandData/COVID-19

4https://www.theglobaleconomy.com

the unemployment rate and the education spending. The two

features are on the one hand the educational expenditure and

on the other hand the unemployment rate. In addition, we

have only considered twenty countries for the purpose of the

overview.

5.1.4 Generated data set

In order to show specific characteristics of CLOSE and our

outlier detection algorithm, we generated two artificial data

sets. The first contains 40 time series with 6 time points

and two-dimensional feature vectors in [0,1]2.Forevery

timestamp, four cluster centroids have been set, which 10

time series were assigned to with a maximal distance of 0.1

each. The cluster members remain the same for the whole

time period, but the clusters merge and split over time. More

precisely, at any time point only three clusters are visible,

since at the moment where one cluster splits (t4), two others

merge into one.

For the evaluation of our outlier detection algorithm,

three transition-based outliers have been inserted in the data

set. For each timestamp, the outlier sequences have been

randomly assigned to a cluster centroid with a maximal

distance of 0.1.

5.2 Density-based clustering

Since to the best of our knowledge there are no other evalu-

ation measures for the over-time stability of clusterings-per-

timestamp, a quantitative evaluation against other measures

is not possible. The comparison to other common stability

measures is not meaningful either, as the targeted stability

definition differs. Nevertheless, the evaluation of clusterings

retrieved with parameter settings determined by CLOSE

against those of evolutionary clustering algorithms, may

surrogate such an analysis as the objective function which

is optimized in evolutionary clustering includes a similar

definition of over-time stability. Apart from the compar-

ison with evolutionary clusterings, our evaluation section

deals with different experiments on real world and arti-

ficially generated data sets in order to discuss different

characteristics of CLOSE and its applications.

In the first experiment we investigate the behavior of

the CLOSE score depending on the parameter setting of

DBSCAN regarding the GlobalEconomy data set. In Fig. 6

this behavior is illustrated. For each minP t s a colored line

is drawn, which shows the CLOSE score depending on .

We tested all mi nP ts ∈[2,6]and ∈[0.1,0.4]with a step

size of 0.01. The best result was achieved with minP t s =2

and =0.2 and is shown in Fig. 5.

As can be seen, the resulting clustering is quiet stable

although the data set is rather dispersed and some of its data

objects have irregular movements. For example, Jamaica

16619

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

G. Klassen et al.

Fig. 5 Best resulting clustering with DBSCAN (=0.2, minP t s =2) achieving a CLOSE score of 0.514 on the GlobalEconomy data set

(JAM) and Ireland (IRL) are completely stable over time as

they are always together in one cluster. Such a stability can

only be achieved with minP t s =2 since bigger clusters

would lead to more cluster transitions. This characteristic

can also be read off the diagram in Fig. 6, where the curve

of minP t s =2 reaches higher CLOSE scores than the

others in most cases. Obviously, regarding this data set, it

is difficult to determine one optimal since the groups of

objects move towards each other. The choice of one fix

parameter setting leads for example to the creation of a

single cluster in the last considered timestamp. Although

it is not desired to have only one cluster, since it does not

lead to a high information gain, it is an intuitive result in

this case, though. When choosing a smaller in order to

counteract this circumstance, the over-time stability would

be significantly decreased.

Fig. 6 Resulting CLOSE score for different minP t s depending on

When considering the line of minP t s =6inFig.6,the

results might seem unintuitive since the CLOSE score is 0

for most of the time and it gets higher with >0.3 although

it already reached a score of 0 before. The first characteristic

can be explained by the high minP t s value since has to be

chosen relatively high in order to reach enough data points

to put together in one cluster. The second characteristic is

caused by the pre-factor of CLOSE which sets the score

to 0, if there are not at least kclusters, where kis the

number of timestamps. For =0.3 only one cluster per

timestamp is found which causes a high amount of outliers.

By increasing new clusters are created, whose members

have been marked as noise for lower . This applies in

particular to the years 2012 and 2013.

Fig. 7 Resulting CLOSE score, stability and quality for minP t s =2

depending on

16620

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Cluster-based stability evaluation in time series data sets

In Fig. 7the behavior of the ot stability,quality and the

CLOSE score (see Formula (9)) depending on can be

compared. minP t s was set to 2, as it proved to be the best

choice on the GlobalEconomy data set. The quality was

measured by the amount of objects that are assigned to a

cluster in relation to all objects at the considered time point.

The usage of such a simple measure can be justified by

the fact that the density of the resulting clusters is already

indirectly evaluated by the clustering algorithm DBSCAN.

Also, evaluation measures addressing the separation and

compactness of clusters are not suitable for density-based

clustering algorithms. Therefore, the aim is to minimize the

amount of outliers as they are not caught in the formula of

CLOSE. The diagram shows that, as long as the quality is

lower than the stability (≤0.13), it has a high impact on

the CLOSE score. Afterwards, the curve of CLOSE is very

similar to the stability. For >0.26 the CLOSE score gets

worse, although the quality as well as the stability increases.

The CLOSE score decreases rapidly to 0, which is caused by

the fact, that the number of clusters falls below the number

of timestamps. In other cases the score would highly depend

on the number of clusters as long as they exceed the number

of timestamps, if the quality and stability remain almost the

same.

5.3 K-Means

In this paragraph we compare the achievable CLOSE score

of K-Means with those of the evolutionary K-Means of

[8]. For this, we first used the one-dimensional COVID-

19 data set. The evolutionary clustering approach from

[8] softens the definition of partitioning clustering: At a

point in time, the space is classically partitioned into k

regions, but the assignment of individual elements to a

cluster is also based on the partitionings of the previous

points in time. The assignment function is therefore based

on two components, the so-called history costs and the

distance to the cluster centre. The user must specify a

weighting for these two components in advance. In addition,

this approach requires an unknown function fthat maps

clusters from two points in time to each other. Although

this function seems intuitive at first glance, it constitutes a

separate field of research. Despite the problems mentioned

above, evolutionary clustering has a decisive advantage

that becomes more relevant when calculating stability. The

assignment function of evolutionary clustering from [8] can

assign objects to a cluster even if they lie in a different

cluster from the point of view of a classical partitioning

method. This can positively influence the stability of time

series, clusters and thus also clusterings.

An adaptation of the classical K-Means to previous

points in time can be realised with the help of varying ks.

A search for the most stable clustering with varying ksis

also possible with CLOSE, but we consider this scenario

impractical because the number of configurations to be

tested would increase considerably: For 10 time points and

ak∈[2,5], this would already be 410 =1048576

combinations. A corresponding evaluation of the stability

for time-dependent kwould therefore be difficult to realise.

For this reason, we search for one kthat fits best for all

time points. The clustering that achieves the highest CLOSE

score is then compared with evolutionary clustering. In the

following evaluations, the asymmetric proportion and the

mean squared error as quality measure were used.

5.3.1 K-means and evolutionary K-means applied

to the COVID-19 data set

The results of the two clustering algorithms applied to the

COVID-19 data set are very different. First, the best kwas

identified for both approaches using CLOSE. Here, all ksin

the interval of [2,10]were examined. For both algorithms,

k=4 was identified as the kthat leads to the most stable

clustering.

For the evolutionary approach, the change parameter

was set to 0.5. The results can be viewed in Fig. 8.

The differences are particularly striking at times five to

seven. These can be explained by the previously extended

assignment function of the evolutionary approach. In this

specific case, however, the evolutionary approach does not

lead to a higher CLOSE score than the classical approach.

Specifically, the standard approach produces a clustering

that is 0.04 more stable than the evolutionary approach.

This may not be a big difference, but it shows that the

adjustments from [8] made for the evolutionary approach do

not necessarily lead to better CLOSE score.

5.3.2 K-means and evolutionary K-means applied to the

generated data set

In contrast to the results with the COVID-19 data set, the

clusterings of the classical K-Means and the evolutionary

K-Means [8] are identical. The result can be seen in Fig. 9.

This is mainly due to the nature of the generated data set. As

mentioned earlier, the generated data set actually contains

four clusters at each time point, two of which split off from

each other and merge in t4respectively. Although intuitively

one would identify three clusters at each time point, both

algorithms identified only two clusters each. This result

shows that both methods recognise that categorisation

into three clusters would lead to more changes within

the clusters and thus to less cluster stability. The only

clustering that could compete with this clustering in terms

of stability would be one in which all four original clusters

were identified. However, this result is not achievable due

16621

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

G. Klassen et al.

Fig. 8 K-Means and evolutionary K-Means applied to the COVID-19 Data Set

to the partitioning property of K-Means. The relatively

large distance between the clusters does not prevent the

evolutionary algorithm from recognising only two clusters.

This can be explained by the high influence of history

costs. In this case, we have set the weighting of the change

parameter to 0.5 again; it can be assumed that the result

will be different with a lower weight. In fact, a much lower

weight leads to the detection of three clusters. We identified

the kthat leads to a clustering with the highest stability for

both approaches with CLOSE (k=2). Here we examined

all kin the interval [2,6].InFig.10 one can see the

development of the CLOSE score as a function of the chosen

kfor the classical K-Means. In this data set, the highest

CLOSE score is reached at k=2. Higher ks lead to lower

CLOSE scores. Figure 9gives the impression that three

clusters would be more intuitive at any point in time, but the

problem is that such a setup would lead to more data points

changing their cluster peers over time. This circumstance

then leads to less stability of the individual time series,

clusters and thus the entire clustering. More clusters lead

to distributions in which objects have even more changing

cluster peers. It should be noted that in a scenario with more

clusters, quality increases but stability decreases. Together

with the stability, the pre-factor then has a higher influence

than the quality.

5.4 Fuzzy C-means

In this section we discuss the results of FCSETS on the

COVID-19 data set. The clusterings evaluated here were

created using fuzzy C-Means, a fuzzy variant of K-Means.

Figure 11b) shows the development of the FCSETS score as

a function of the number of clusters.

It is noticeable that the FCSETS scores achieved are

significantly higher than the CLOSE scores. This is mainly

due to the fact that there is no function for evaluating

the cluster quality. While the highest CLOSE score was

achieved with four clusters, the highest FCSETS score

Fig. 9 K-Means and evolutionary K-Means from [8] applied to the Generated Data Set

16622

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Cluster-based stability evaluation in time series data sets

Fig. 10 Resulting CLOSE score for standard K-Means with different

ks

was reached with two clusters (0.941). The fact that both

methods evaluate different numbers of clusters with the best

score is expected due to the different approaches of the

underlying clustering algorithms. This also means that other

clustering algorithms could achieve better or worse results

in the crisp but also fuzzy case. The decisive factor for the

evaluation of an over-time clustering in the fuzzy case is

the change in the degrees of membership over time. Fuzzy

C-Means achieves the smallest change in these with two

clusters per timestamp, which also reflects the most stable

result in this case. The main reason for this is the rate

of change of membership degrees from one time point to

another. In the case of the COVID-19 data set, a higher

number of clusters provides a higher rate of change, so

that the cluster membership is less stable over time. This

is especially the case when the movement of objects within

clusters is high. However, usually the movement has only

little influence on the highest degree of membership of an

object to a cluster, but the other degrees of membership

change strongly. In the case of the COVID-19 data set,

this change is strongest with five clusters per timestamp. In

Fig. 11a) we have visualised the clustering with the highest

FCSETS score. We have assigned the objects to the cluster

to which they have the highest membership degree.

5.5 Outlier detection

In this part of the paper, we present a qualitative analysis of

the presented outlier detection and its variants. In particular,

we address the effects of the different proportions and

weightings chosen and illustrate this using the COVID-

19 data set and the generated data set. In all the analyses

presented, to identify the most stable clustering, we applied

CLOSE to determine the parameters.

5.5.1 COVID-19 data set

In this section we compare the effect of asymmetric

proportion and symmetric (jaccard) proportion on outlier

detection. For this purpose we use the one-dimensional

COVID-19 data set because it is particularly suitable

for illustration. We clustered the data with K-Means,

identifying the most stable clustering (k=4) with

CLOSE. In Fig. 12 we can see the results obtained.

The black graphs correspond to the outliers found. At

first glance, it is immediately apparent that the outlier

detection method with the symmetric jaccard proportion

detects significantly fewer outliers than its asymmetric

counterpart. This is due to the different evaluation of

merged clusters. While merges of clusters have no influence

with the asymmetrical proportion, the symmetrical jaccard

proportion evaluates them negatively. This has a direct

impact on the subsequence scores, in the sense that

they all become smaller in our example. This is reflected

accordingly in the best score, which corresponds to the

maximum subsequence score of a cluster. Overall smaller

subsequence scores also lead to smaller outlier scores,

because the difference between the best score and the

individual subsequence scores also becomes smaller. With

Fig. 11 Fuzzy C-Means applied to the COVID-19 data set

16623

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

G. Klassen et al.

Fig. 12 Detected outliers on the COVID-19 data set with τ=0.6. black lines represent outliers. Clustering identified with CLOSE (K-Means,

k=4)

constant τ, as in this example, this leads to a smaller outlier

detection rate. So in the case of the COVID-19 data set, we

would prefer the outlie detection method with asymmetric

proportion. The one-dimensional example also illustrates

the type of outliers detected. In particular, we notice a time

series that was detected as a whole by the system and has

the highest incidence rate at the end. This time series is the

incidence value of Luxembourg. There, on 31 May 2020,

the highest incidence value of the European countries we

looked at was reported. The high number of changes in

the cluster environment is particularly striking. The first

change occurs from week four to week five, followed by the

change in week seven to week eight and finally the change

from week nine to week ten. The constant change of cluster

members leads to a relatively small subsequence score,

which then shows a high difference to the best scoresof

the individual clusters.

Another rather inconspicuous time series detected by

outlier detection has an incidence rate just above 0.2 at the

last time point. This time series reflects the development of

the pandemic in Romania. It is detected mainly because it

completely changes its cluster members twice. Firstly, the

incidence rate in Romania at time one does not develop

like that of its cluster members at time zero: In contrast to

Romania’s cluster members at time zero, the incidence rate

in Romania does not continue to rise but remains at about

the same level. The other change occurs from time ten to

time eleven: Here, Romania’s incidence rate jumps within

one week, so that it is now in a cluster with countries of a

higher infection level.

In this example, the difference between the two applied

proportions is not only that the asymmetric proportion

detects more outliers. The jaccard proportion also detects

other outliers. Exemplary for this is the sequence of the top

16624

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Cluster-based stability evaluation in time series data sets

orange-colored time series. This is detected by the outlier

detection with jaccard proportion, since a merge of clusters

takes place in the last time point and this is penalised

by the symmetrical proportion. This is not the case with

the asymmetric proportion, the merge has no effect on the

subsequence score of the time series.

Overall, relatively many outliers are found in this

example. This is mainly due to the choice of the parameter

τand the relatively over-time stable composition of the time

series. The clustering has many time series that remain in

a cluster over time with comparatively many time series.

This leads to high subsequence scores and thus to high

best scores. Time series that change their cluster members

only once have a comparatively low subsequence score,

which also leads directly to classification as outliers due to

the selected τ. This example also shows how to deal with

missing data: A time series only begins in the sixth week, its

sequence from week six to week eight is recognised as an

outlier. On the one hand, this can be explained by the change

in the cluster composition from week seven to week eight

and, on the other hand, by the shortness of the time series.

5.6 ElectricityLoadDiagrams20112014 data set

We clustered the one-dimensional data set with DBSCAN

and determined the best parameters with CLOSE. The

highest CLOSE score of 0.323 was achieved with =

0.11 and minpts =2. The aim of this experiment is

to show the influence of the outlier parameter τof the

DOOTS algorithm, therefore we applied four different τ

to the clustered data set. The low CLOSE score is due to

the large differences in the consumption data, which make

clustering fundamentally difficult. As shown in Fig. 13,

one or two clusters were identified at each time point. As

expected, electricity consumption is highest in the winter

months for most time series. It is also interesting to note

that most time series show a local maximum in July (time

7). The expected outliers are the time series that either move

completely independently of other time series or those that

move with other time series at some points but then diverge

from them. Figure 13a shows the result with τ=0.3 and

has the highest number of outliers in the four figures in

Fig. 13. For example, an interesting outlier in this figure

is the top time series at time 3, which was detected as

an outlier from time 2 to time 6. This has to do with the

comparatively low subsequence score of the time series at

these times. The subsequence score of the time series is

obviously even significantly worse than the subsequence

score of the time series in the same cluster at time point 4.

This other time series was classified as an intuitive outlier

in the first two time points, so that these time points fall

out of the calculation for the subsequence score. The top

time series at time 3 is no longer recognised as an outlier

from tau =0.4. The difference in the subsequence score

to the time series with which it was clustered is therefore

obviously greater than or equal to 0.3 but less than 0.4. The

differently detected outliers with the different tau differ not

only in number but also in length. For example, the top

sequence at time 7 with τ=0.4 is detected as an outlier

from time 6 to time 10, while the same time series with

τ=0.6 is only detected as an outlier from time 6 to time 9.

This indicates that the subsequence score diverges from

the other time series even before the actual detection. The τ

thus determines, among other things, how early an outlier is

already classified as such.

5.6.1 Generated data set

For the evaluation of DOOTS on the generated data set,

the clustering setting achieving the best CLOSE score was

chosen as underlying clustering. Therefore, K-Means with

k=4 was used. Figure 14 shows the detected outlier

sequences on the bivariate data set. All four proposed

derivatives of our algorithm have been tested: the original

method (DOOTS), the one using the jaccard index in

the proportion calculation (jDOOTS), the one using a

weighting in the subsequence score (wDOOTS) and the

method combining the weighting and the jaccard index

(jwDOOTS).

As can be seen, both approaches using the weighting

function got the same results (Fig. 14b). The same applies to

the remaining two (Fig. 14a). Both results are very similar

to each other, as they differ only at one timestamp and that is

the last one. Each method detects all three outlier sequences

(42, 43, 44) in the first four timestamps. At time 5, all

approaches are in agreement that there are only two outliers:

42 and 43. But at the last timestamp the weighted methods

mark only one sequence (42) as an outlier