Content uploaded by Fabrizio Durante
Author content
All content in this area was uploaded by Fabrizio Durante on Nov 14, 2015
Content may be subject to copyright.
Cluster Analysis of Time Series
via Kendall Distribution
Fabrizio Duranteand Roberta Pappad`a
School of Economics and Management,
Free University of Bozen–Bolzano, Bolzano, Italy
{fabrizio.durante,roberta.pappada}@unibz.it
Abstract. We present a method to cluster time series according to
the calculation of the pairwise Kendall distribution function between
them. A case study with environmental data illustrates the introduced
methodology.
Keywords: Cluster analysis, Copula, Kendall distribution, Tail depen-
dence.
1 Introduction
Cluster analysis plays an important role in extracting information from a group
of different time series. It can be used, for instance, to find some dependence
information, which is a key tool in geosciences and hydrology in order to un-
derstand the relationships between different variables. In general, a time series
clustering procedure involves the choice of an adequate metric between the uni-
variate time series, which allows to group together series exhibiting common
trends occurring at different times or similar sub-patterns in the data, according
to the idea of similarity one has adopted (see [1]).
A widely used approach to measure similarity is to consider a Pearson-
correlation based distance metric. However, recent studies have underlined that
classical correlation measures are often inadequate to capture the real depen-
dence structure between individual risk factors, especially in a financial and
environmental context (see, for instance, [2], [3]). As such, several investigations
have been carried out during the last years from different perspectives, exploit-
ing tools from extreme-value analysis ([4], [5]) to the concept of tail copulas
(see, for instance, [6] and the references therein). In particular, many research
efforts have remarked on the usefulness of extreme value theory in assessing cli-
mate changes and detecting spatial clusters (see, for instance [7], [8]). Moreover,
recent developments in statistical hydrology have shown the great potential of
copulas for the construction of multivariate cumulative distribution functions
and for carrying out a multivariate frequency analysis ([9], [10]). Extreme value
copulas have been largely used to investigate the spatial dependencies between
This work was supported by Free University of Bozen-Bolzano via the project
MODEX.
c
Springer International Publishing Switzerland 2015 209
P. Grzegorzewski et al. (eds.), Strengthening Links between Data Analysis & Soft Com puting,
Advances in Intelligent Systems and Computing 315, DOI: 10.1007/978-3-319-10765-3_25
210 F. Durante and R. Pappad`a
the involved variables, introducing a novel contribution to the interpretation of
meteorological and hydrological phenomena ([11], [12], [13]). From another per-
spective, methods have been recently proposed in order to cluster time series
observations according to a suitable copula-based dissimilarity measure, with
applications in the financial setting. Such an approach has been adopted, for
instance, in [14] focusing on the use of conditional Spearman’s correlation, and
in [15], [16] where the clustering procedure is based on the estimation of pairwise
tail dependence coefficients.
Management of environmental resources often requires the analysis of spatial
rainfall extremes which typically exhibit some form of dependence as a result
of the regional nature of hydrological phenomena. Reliable estimates of extreme
rainfall events are required for several hydrological purposes and their spatial
distribution is of both physical and practical interest, particularly in the case of
regional studies. Several approaches are available in the literature for the char-
acterization of spatial extremes, relying on a likelihood-based approach ([17],
[18]), a Bayesian approach ([19]) and cluster analysis for assessing the spatial
distribution of extremes ([20], [21]). In particular, the detection of spatial clus-
ters can help in summarizing available data, extracting useful information and
formulating hypothesis for further research. Clustering could be used in order to
identify homogeneous regions to be considered for regionalization procedures.
In the present contribution, we would like to use the Kendall distribution
function associated with a random vector in order to develop a novel clustering
procedure for grouping random vectors. We outline here briefly the possible
application of the proposed methodology to hydrological data by analysing time
series of maximum annual rainfall data collected at rain gauges of different sites
in the province of Bolzano-Bozen (Italy). Notice that according to the approach
in [22], homogeneity in the sense of Kendall’s distance implies homogeneity in
the sense of return period, a notion frequently used in environmental sciences
for the identification of dangerous events and risk assessment (see also [23],[24]).
2 Clustering via Kendall Distribution
We recall that a (bivariate) copula is a joint cumulative probability distribu-
tion function with uniform univariate margins on I=[0,1]. If we consider a
random pair (X, Y ) with cumulative continuous distribution function H,then
the bivariate probability integral transform is the random variable defined by
W=H(X, Y ). It is known that Wjust depends on the copula Cof (X, Y )and
it is equal in distribution to C(U, V ), where U=FX(X)andV=FY(Y), being
FX,F
Ythe univariate marginals of Xand Y, respectively. First introduced in
[25] for inference on Archimedean copulas, the Kendall distribution function (see
also [26]) is simply the distribution function of Wand is given by
K(q)=P(W≤q),
where q∈[0,1] is a probability level.
Cluster Analysis of Time Series via Kendall Distribution 211
There are two important particular cases for the Kendall distribution. When
Xand Yare comonotonic, one finds K(q)=KM(q)=qfor all 0 ≤q≤1, which
corresponds to C(u, v)=M(u, v)=min(u, v), where Mis the the Fr´echet-
Hoeffding upper bound copula. Under the hypothesis of independence between
Xand Y, which is equivalent to consider C(u, v)=Π(u, v )=uv,Khas the
form K(q)=KΠ(q)=q−qlog(q),0≤q≤1. Thus, on the graph of Kbased
on pseudo-random samples from a positively dependent bivariate vector (X, Y ),
perfect positive dependence would translate into data points aligned on the line
y=x, while the plot will be seen to match nearly the curve KΠ(q) as the data
become less and less dependent. Notice that, for each Kendall distribution K,
one has the lower bound K≥KMon I. Starting with [27] (see also [28]), order-
ing properties of Kendall distributions have been used to detect dependence in
copula models. Here we show how to use them to provide a clustering procedure
for time series.
Suppose that we have at disposal a set of time series Xt
1,...,Xt
n, correspond-
ing to ndifferent measurements collected at time t∈{1,...,T}. Such time series
are assumed to be a random sample from an unknown vector X=(X1,...,X
n).
In order to interpret properly the following results it is also convenient to sup-
pose that the all the pairs in Xare positively quadrant dependent, i.e. their
copula is grater than or equal to Π. We would like to group the components of
Xaccording to the strength of their inter–dependence. To do this, following the
general principle applied in [14], we may proceed as follows:
1. Calculate the Kendall distribution function K(·) for each pair (Xi,X
j), and
denote it by Kij .
2. Define a kind of distance between Xiand Xjin terms of the related Kendall
distribution Kij =Kand the Kendall distribution KMof comonotone ran-
dom variables by one of the following definitions:
d2(K, KM)=1
0
(q−K(q))2dq
d∞(K, KM)= sup
q∈[0,1] |q−K(q)|dq
Intuitively, two random variables have small distance if their Kendall distri-
bution is close to KMor, in other words, if they tend to be comonotone.
3. From these metrics, create a suitable dissimilarity matrix D:= (δij ),i,j=
1,...,n, for instance by using δij =d2(Kij ,K
M). In fact, if the random
variables are comonotone, their dissimilarity is 0, while this number increases
when they are becoming less and less dependent. Hence, in this construction,
the larger the distance, the weaker the dependence.
4. Apply classical cluster techniques to the obtained dissimilarity matrix. In
particular, agglomerative hierarchical methods with nearest distance (single
linkage), furthest distance (complete linkage) and average distance (average
linkage) can be used as grouping criteria.
For what concerns the estimation procedure of the Kendall distribution func-
tion we rely on non-parametric estimation by using the empirical distribution
212 F. Durante and R. Pappad`a
function computed as in [29]. Suppose that (X11,X
12),...,(XT1,X
T2)isaran-
dom sample from a distribution Hwith copula C. The empirical Kendall distri-
bution function KTis given, for all q∈[0,1], by
KT(q)= 1
T
T
j=1
1(Wj≤q),
where, for each j∈{1,...,T},
Wj=1
T+1
T
t=1
1(Xt1<X
j1,X
t2<X
j2).
The limiting behaviour of the empirical process √T(KT−K) has been discussed
in [30], where the convergence in law to a centered Gaussian limit under mild
regularity conditions is proved.
3 An Empirical Case Study
In order to briefly illustrate a possible application of the proposed methodology
we present here a case study from environmental data. The data were collected by
“Ufficio Idrografico” of the province of Bolzano-Bozen and are available online.
They are related to daily rainfall measurements recorded at 18 gauge stations
spread across the province of Bolzano-Bozen in the North-Eastern Italy. This
results in a set of d= 18 time series originally formed by T= 18262 observations.
Tab. 1 reports the available information on the analysed rainfall records. From
these time series, we extracted annual maxima at each spatial location resulting
in a 50 ×18 matrix of time series observations ˜
Xm
1,..., ˜
Xm
d,m∈{1,...,50},
summarized by Fig. 1. The selection of annual maxima has two main goals: it
transforms data with strong seasonality into data that can be assumed to be
independent and identically distributed; it transforms data that may have a
general dependence structure into data that are positively dependent (actually,
they are coupled by an extreme-value copula). For more details, see [5]. The
latter property is quite relevant since it allows to apply the method described in
Section 2 in order to detect the presence ofclustersoftheanalysedsitesonthe
basis of the componentwise maxima.
Specifically, we compute the dissimilarity matrix D:= (δij),i,j=1,...,d,
such that the dissimilarity between two time series is defined as the distance
δij =d2(ˆ
Kij ,K
M)=1
0
(q−ˆ
Kij (q))2dq,
where ˆ
Kij is the empirical Kendall distribution function based on the maxima
observations ( ˜
Xm
i,˜
Xm
j), m∈{1,...,50}.
The choice of this metric reflects the final goal of the clustering procedure
in the sense that two strongly dependent time series will give an extremely low
Cluster Analysis of Time Series via Kendall Distribution 213
Table 1. Summary of the rainfall measurement stations
Code Station Longitude Latitude Height (m)
0220 S.VALENTINO ALLA MUTA 10.5277 46.7745 1520
0310 TUBRE 10.4775 46.6503 1119
2090 PLATA 11.1783 46.8225 1147
3140 FLERES 11.3477 46.9639 1246
3260 VIPITENO-CONVENTO 11.4295 46.8978 948
8320 BOLZANO 11.3127 46.4976 254
9150 SESTO 12.3477 46.7035 1310
0250 MONTE MARIA 10.5213 46.7057 1310
0480 MAZIA 10.6175 46.6943 1570
1580 VERNAGO 10.8493 46.7357 1700
2170 S.LEONARDO PASSIARIA 11.2471 46.8091 644
2670 PAVICOLO 11.1093 46.6278 1400
3450 RIDANNA 11.3068 46.9091 1350
4450 S.MADDALENA IN CASIES 12.2427 46.8353 1398
6650 FUNDRES 11.7029 46.8872 1159
8570 BRONZOLO 11.3111 46.4065 226
8730 REDAGNO 11.3968 46.3465 1562
9100 ANTERIVO 11.3678 46.2773 1209
0220
0250
0310
0480
1580
2090
2170
2670
3140
3260
3450
4100
4450
6650
8320
8570
8730
9150
50
100
150
Fig. 1. Boxplot of annual maxima at each station from 1961 to 2010. The station codes
are as Tab. 1. On the y-axis the amount of rainfall is measured in millimeters.
value of their dissimilarity. The results of the clustering procedure are illustrated
by a tree diagram usually referred to as dendrogram, which represents the ar-
rangement of the clusters produced by hierarchical agglomerative clustering. In
Fig. 2, the dendrogram based on complete linkage is displayed. The vertical axis
represents the distance at which two clusters are joined. From the dendrogram it
is possible to identify, e.g., four different groups, by cutting at about height 0.06.
214 F. Durante and R. Pappad`a
0.00 0.02 0.04 0.06 0.08
Height
0220
0250
0310
0480
1580
2090
2170
2670
3140
3260
3450
4100
4450
6650
8320
8570
8730
9150
Fig. 2. Dendrogram for the 18 rainfall measurement stations listed in Tab. 1 based on
the complete linkage method
10.5°E11°E11.5°E12°E12.5°E
46°N 46.2°N 46.4°N 46.6°N 46.8°N47°N 47.2°N 47.4°N
0220
0250
0310
0480
1580
2090
2170
2670
3140
3260
3450
4100
4450
6650
8320
8570
8730
9150
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Fig. 3. Map of the rainfall measurement stations marked according the the 4-clusters
solution in the province of Bolzano–Bozen (North-Eastern, Italy)
The 4-clusters solution is visualized on the map in Fig. 3, where the stations are
marked according to their cluster.
For the hydrological interpretation of the results it seems that several factors
should be taken into account in order to determine correlated rainfall extremes.
Cluster Analysis of Time Series via Kendall Distribution 215
In fact, not only the geographical proximity plays a role, but also the strong
heterogeneity in morphological and climatic features.
4 Conclusions
We have presented a procedure for grouping time series according to a copula-
based dependence function among them. In particular, we considered a dis-
similarity measure that is based on the Kendall distribution associated to two
continuous random variables, since such a function provides useful information
in terms of environmental risk, as shown in [22]. The proposed approach comple-
ments similar methods provided by the authors about copula-based clustering
of time series (see, e.g., [14], [16]).
References
1. Liao, T.W.: Clustering of time series data - a survey. Pattern Recogn. 38(11),
1857–1874 (2005)
2. Embrechts, P., McNeil, A.J., Straumann, D.: Correlation and Dependence in Risk
Management: Properties and Pitfalls. Cambridge Univ. Press, New York (2001)
3. Poulin, A., Huard, D., Favre, A.-C., Pugin, S.: Importance of tail dependence in
bivariate frequency analysis. J. Hydrol. Eng. 12, 394–403 (2007)
4. Gudendorf, G., Segers, J.: Extreme-value copulas. In: Jaworski, P., Durante, F.,
H¨ardle, W., Rychlik, T. (eds.) Copula Theory and its Applications. Lecture Notes
in Statistics - Proceedings, vol. 198, pp. 127–145. Springer, Heidelberg (2010)
5. Salvadori, G., De Michele, C., Kottegoda, N.T., Rosso, R.: Extremes in Nature. An
Approach Using Copulas. Water Sci. and Technology Library 56. Springer (2007)
6. Jaworski, P.: Tail behaviour of copulas. In: Jaworski, P., Durante, F., H¨ardle, W.,
Rychlik, T. (eds.) Copula Theory and its Applications. Lecture Notes in Statistics
- Proceedings, vol. 198, pp. 161–186. Springer, Heidelberg (2010)
7. Gaetan, C., Grigoletto, M.: A hierarchical model for the analysis of spatial rainfall
extremes. J. ABES 12(4), 434–449 (2007)
8. Scotto, M.G., Barbosa, S.M., Alonso, A.M.: Extreme value and cluster analysis of
European daily temperature series. Journal of Applied Statistics 38(12), 2793–2804
(2011)
9. Favre, A.-C., Adlouni, S.E., Perreault, L., Thiemonge, N., Bobee, B.: Multivariate
hydrological frequency analysis using copulas. Water Resour. Res. 40 (2004)
10. Salvadori, G., De Michele, C.: Frequency analysis via copulas: theoretical aspects
and applications to hydrological events. Water Resour. Res. 40 (2004)
11. B´ardossy, A.: Copula-based geostatistical models for groundwater quality param-
eters. Water Resour. Res. 42(11) (2006)
12. Bonazzi, A., Cusack, S., Mitas, C., Jewson, S.: The spatial structure of European
wind storms as characterized by bivariate extreme-value Copulas. Nat. Hazards
Earth Syst. Sci. 12, 1769–1782 (2012)
13. Genest, C., Favre, A.-C.: Everything you always wanted to know about copula
modeling but were afraid to ask. J. Hydrologic Eng. 12(4), 347–368 (2007)
14. Durante, F., Pappad`a, R., Torelli, N.: Clustering of financial time series in risky sce-
narios. Adv. Data Anal. Classif. (2013) (in press), doi: 10.1007/s11634-013-0160-4
216 F. Durante and R. Pappad`a
15. De Luca, G., Zuccolotto, P.: A tail dependence-based dissimilarity measure for
financial time series clustering. Adv. Data Anal. Classif. 5(4), 323–340 (2011)
16. Durante, F., Pappad`a, R., Torelli, N.: Clustering of extreme observations via tail
dependence estimation. Statist. Papers (in press, 2014)
17. Buishand, T., de Haan, L., Zhou, C.: On spatial extremes: With application to a
rainfall problem. Ann. Appl. Statist. 2, 624–642 (2008)
18. Cooley, D., Naveau, P., Poncet, P.: Variograms for spatial max-stable random fields.
In: Dependence in Probability and Statistics. Lectures Notes in Statistics, pp. 373–
390. Springer, Heidelberg (2006)
19. Cooley, D., Nychka, D., Naveau, P.: Bayesian spatial modeling of extreme precipi-
tation return levels. J. Amer. Statist. Assoc. 102, 824–840 (2007)
20. Robeson, S.M., Doty, J.A.: Identifying rogue air temperature stations using cluster
analysis of percentile trends. J. Climate 18, 1275–1287 (2005)
21. Scotto, M.G., Alonso, A.M., Barbosa, S.M.: Clustering time series of sea levels:
Extreme value approach. J. Waterway, Port, Coastal, and Ocean Engrg. 136, 215–
225 (2010)
22. Salvadori, G., De Michele, C., Durante, F.: On the return period and design in a
multivariate framework. Hydrol. Earth Syst. Sci. 15, 3293–3305 (2011)
23. Salvadori, G., Durante, F., De Michele, C.: Multivariate return period calculation
via survival functions. Water Resour. Res. 49(4), 2308–2311 (2013)
24. Salvadori, G., Durante, F., Perrone, E.: Semi–parametric approximation of the
Kendall’s distribution and multivariate return periods. J. SFdS 154(1), 151–173
(2013)
25. Genest, C., Rivest, L.-P.: Statistical inference procedures for bivariate Archimedean
copulas. J. Amer. Statist. Assoc. 88(423), 1034–1043 (1993)
26. Genest, C., Rivest, L.-P.: On the multivariate probability integral transformation.
Statist. Probab. Lett. 53(4), 391–399 (2001)
27. Cap´era`a, P., Foug`eres, A.-L., Genest, C.: A stochastic ordering based on a decom-
position of Kendall’s tau. In: Beneˇs, V., ˇ
Stˇep´an, J (Eds.) Distributions with Given
Marginals and Moment Problems. Kluwer Academic Publishers, Dordrecht, pp.
81–86
28. Nelsen, R.B., Quesada–Molina, J.J., Rodr´ıguez–Lallena, J.A., ´
Ubeda–Flores, M.:
Kendall distribution functions. Statist. Probab. Lett. 65, 263–268 (2003)
29. Genest, C., Neˇslehov´a, G., Ziegel, J., Inference, J.: in multivariate Archimedean
copula models. TEST 20, 223–256 (2011)
30. Barbe, P., Genest, C., Ghoudi, K., R´emillard, B.: On Kendall’s process. J. Multivar.
Anal. 58(1996), 197–229 (1996)