# Correlation Clustering for Learning Mixtures of Canonical Correlation Models.

**ABSTRACT** This paper addresses the task of analyzing the correlation between two related domains X and Y. Our research is motivated by an Earth Science task that studies the rela- tionship between vegetation and precipitation. A standard statistical technique for such problems is Canonical Correla- tion Analysis (CCA). A critical limitation of CCA is that it can only detect linear correlation between the two domains that is globally valid throughout both data sets. Our ap- proach addresses this limitation by constructing a mixture of local linear CCA models through a process we name cor- relation clustering. In correlation clustering, both data sets are clustered simultaneously according to the data's corre- lation structure such that, within a cluster, domain X and domain Y are linearly correlated in the same way. Each clus- ter is then analyzed using the traditional CCA to construct local linear correlation models. We present results on both artificial data sets and Earth Science data sets to demon- strate that the proposed approach can detect useful correla- tion patterns, which traditional CCA fails to discover.

**0**Bookmarks

**·**

**57**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**Canonical correlation analysis (CCA) is a well-known technique for extracting linearly correlated features from multiple views (i.e., sets of features) of data. Recently, a locality-preserving CCA, named LPCCA, has been developed to incorporate the neighborhood information into CCA. Although LPCCA is proved to be better in revealing the intrinsic data structure than CCA, its discriminative power for subsequent classification is low on high-dimensional data sets such as face databases. In this paper, we propose an alternative formulation for integrating the neighborhood information into CCA and derive a new locality-preserving CCA algorithm called ALPCCA, which can better discover the local manifold structure of data and further enhance the discriminative power for high-dimensional classification. The experimental results on both synthetic and real-world data sets including multiple feature data set and face databases validate the effectiveness of the proposed method.Neural Processing Letters 04/2012; 37(2). · 1.24 Impact Factor - SourceAvailable from: web.engr.orst.edu
- SourceAvailable from: gisclimat.fr
##### Article: Weather regimes designed for local precipitation modeling: Application to the Mediterranean basin

[Show abstract] [Hide abstract]

**ABSTRACT:**Although weather regimes are often used as a primary step in many statistical downscaling processes, they are usually defined solely in terms of atmospheric variables and seldom to maximize their correlation to observed local meteorological phenomena. This paper compares different clustering methods to perform such a task. The correlation clustering model is introduced to define regimes that are well correlated to local-scale precipitation observed on seven French Mediterranean rain gauges. This clustering method is compared to other approaches such as the k-means and ``expectation-maximization'' (EM) algorithms. The two latter are applied either to the main principal components of large-scale reanalysis data (geopotential height at 500 mbar and sea level pressure) covering the Mediterranean basin or to the canonical variates associated with large scale and resulting from a canonical correlation analysis performed on reanalyses and local precipitation. The weather regimes obtained by the different approaches are compared, with a focus on the ``extreme content'' captured within the regimes. Then, cost functions are developed to quantify the errors due to misclassification, in terms of local precipitation. The different clustering approaches show different misclassification and costs. EM applied to canonical variates appears as a good compromise between the other approaches, with high discrimination, overall for extreme precipitation, while the precipitation costs due to bad classification are acceptable. This paper provides tools to help the users choose the clustering method to be used according to the expected goal and the use of the weather regimes.Journal of Geophysical Research Atmospheres 01/2010; 115. · 3.44 Impact Factor

Page 1

Correlation Clustering for Learning Mixtures of Canonical Correlation Models

X. Z. Fern∗

C. E. Brodley†

M. A. Friedl‡

Abstract

This paper addresses the task of analyzing the correlation

between two related domains X and Y .

motivated by an Earth Science task that studies the rela-

tionship between vegetation and precipitation. A standard

statistical technique for such problems is Canonical Correla-

tion Analysis (CCA). A critical limitation of CCA is that it

can only detect linear correlation between the two domains

that is globally valid throughout both data sets. Our ap-

proach addresses this limitation by constructing a mixture

of local linear CCA models through a process we name cor-

relation clustering. In correlation clustering, both data sets

are clustered simultaneously according to the data’s corre-

lation structure such that, within a cluster, domain X and

domain Y are linearly correlated in the same way. Each clus-

ter is then analyzed using the traditional CCA to construct

local linear correlation models. We present results on both

artificial data sets and Earth Science data sets to demon-

strate that the proposed approach can detect useful correla-

tion patterns, which traditional CCA fails to discover.

Our research is

1Introduction

In Earth science applications, researchers are often

interested in studying the correlation structure between

two domains in order to understand the nature of

the relationship between them.

correlation analysis task can be considered as two data

sets X and Y whose instances are described by feature

vectors ? x and ? y respectively. The dimension of ? x and

that of ? y do not need to be the same, although there

must be a one-to-one mapping between instances of X

and instances of Y . Thus, it is often more convenient to

consider these two data sets as one compound data set

whose instances are described by two feature vectors ? x

and ? y. Indeed, throughout the remainder of this paper,

we will refer to the input of our task as one data set,

and the goal is to study how the two sets of features are

correlated to each other.

Canonical Correlation Analysis (CCA) [4, 6] is a

The inputs to our

∗School of Elec. and Comp. Eng., Purdue University, West

Lafayette, IN 47907, USA

†Dept. of Comp. Sci., Tufts University, Medford, MA 02155,

USA

‡Dept. of Geography, Boston University, Boston, MA, USA

multivariate statistical technique commonly used to

identify and quantify the correlation between two sets

of random variables.Given a compound data set

described by feature vectors ? x and ? y, CCA seeks to find

a linear transformation of ? x and a linear transformation

of ? y such that the resulting two new variables are

maximumly correlated.

In Earth science research, CCA has been often

applied to examine whether there is a cause-and-effect

relationship between two domains or to predict the

behavior of one domain based on another. For example,

in [13] CCA was used to analyze the relationship

between the monthly mean sea-level pressure (SLP) and

sea-surface temperature (SST) over the North Atlantic

in the months of December, January and February. This

analysis confirmed the hypothesis that atmospheric SLP

anomalies cause SST anomalies.

Because CCA is based on linear transformations,

the scope of its applications is necessarily limited. One

way to tackle this limitation is to use nonlinear canoni-

cal correlationanalysis (NLCCA) [5, 8]. NLCCA applies

nonlinear functions to the original variables in order to

extract correlated components from the two sets of vari-

ables. Although promising results have been achieved

by NLCCA in some Earth science applications, it tends

to be difficult to apply such techniques because of the

complexity of the model and the lack of robustness due

to overfitting [5].

In this paper we propose to use a mixture of lo-

cal linear correlation models to capture the correlation

structure between two sets of random variables (fea-

tures). Mixtures of local linear models not only provide

an alternative solution to capturing nonlinear correla-

tions, but also have the potential to detect correlation

patterns that are significant only in a part (a local re-

gion) of the data. The philosophy of using multiple lo-

cal linear models to model global nonlinearity has been

successfully applied to other statistical approaches with

similar linearity limitations such as principal component

analysis [12] and linear regression [7]. Our approach

uses a two-step procedure. Given a compound data set,

we propose to first solve a clustering problem that par-

titions the data set into clusters such that each cluster

contains instances whose ? x features and ? y features are

linearly correlated. We then independently apply CCA

Page 2

to each cluster to form a mixture of correlation models

that are locally linear.

In designing this two-step process, we need address

the following two critical questions.

1. Assume we are informed a priori that we can model

the correlation structure using k local linear CCA

models.How should we cluster the data in the

context of correlation analysis?

2. In real-world applications, we are rarely equipped

with knowledge of k. How can we decide how many

clusters there are in the data or whether a global

linear structure will suffice?

Note that the goal of clustering in the context of cor-

relation analysis is different from traditional clustering.

In traditional clustering, the goal is to group instances

that are similar (as measured by certain distance or sim-

ilarity metric) together. In contrast, here we need to

group instances based on how their ? x features and ? y fea-

tures correlate to each other, i.e., instances that share

similar correlation structure between the two sets of fea-

tures should be clustered together. To differentiate this

clustering task from traditional clustering, we name it

correlation clustering1and, in Section 3 we propose an

iterative greedy k-means style algorithm for this task.

To address the second question, we apply the tech-

nique of cluster ensembles [2] to our correlation cluster-

ing algorithm, which provides a user with a visualization

of the results that can be used to determine the proper

number of clusters in the data. Note that our correla-

tion clustering algorithm is a k-means style algorithm

and as such may have many locally optimal solutions—

different initializations may lead to significantly differ-

ent clustering results. By using cluster ensembles, we

can also address the local optima problem of our clus-

tering algorithm and find a stable clustering solution.

To demonstrate the efficacy of our approach, we

apply it to both artificial data sets and real world Earth

science data sets.Our results on the artificial data

sets show that (1) the proposed correlation clustering

algorithm is capable of finding a good partition of

the data when the correct k is used and (2) cluster

ensembles provide an effective tool for finding k. When

applied to the Earth science data sets, our technique

detected significantly different correlation patterns in

comparison to what was found via traditional CCA.

These results led our domain expert to highly interesting

hypotheses that merit further investigation.

1Note that the term correlation clustering has also been used

by [1] as the name of a technique for traditional clustering.

The remainder of the paper is arranged as follows.

In Section 2, we review the basics of CCA. Section 3 in-

troduces the intuitions behind our correlation clustering

algorithm and formally describes the algorithm, which

is then applied to artificially constructed data sets to

demonstrate its efficacy in finding correlation clusters

from the data. Section 4 demonstrates how cluster en-

semble techniques can be used to determine the num-

ber of clusters in the data and address the local optima

problem of the k-means style correlation clustering al-

gorithm. Section 5 explains our motivating application,

presents results, and describes how our domain expert

interprets the results. Finally, in Section 6 we conclude

the paper and discuss future directions.

2 Basics of CCA

Given a data set whose instances are described by two

feature vectors ? x and ? y, the goal of CCA is to find linear

transformations of ? x and linear transformations of ? y

such that the resulting new variables are maximumly

correlated.

In particular, CCA constructs a sequence of pairs

of strongly correlated variables (u1,v1), (u2,v2),···,

(ud,vd) through linear transformations, where d is the

minimum dimension of ? x and ? y. These new variables

ui’s and vi’s, named canonical variates (sometimes

referred to as canonical factors).

to principal components in the sense that principal

components are linear combinations of the original

variables that capture the most variance in the data and

in contrast canonical variates are linear combinations of

the original variables that capture the most correlation

between two sets of variables.

To construct these canonical covariates, CCA first

seeks to transform ? x and ? y into a pair of new variables

u1and v1by the linear transformations:

They are similar

u1= (? a1)T? x, and v1= (?b1)T? y

where the transformation vectors ? a1and?b1are defined

such that corr(u1,v1) is maximized subject to the

constraint that both u1 and v1 have unit variance.2

Once ? a1,?b1;···; ? ai,?bi are determined, we then find the

next pair of transformations ?

correlationbetween ( ?ai+1)T? x and (?

with the constraint that the resulting ui+1and vi+1are

uncorrelated with all previous canonical variates.3Note

that the correlation between uiand vibecomes weaker

as i increases. Let rirepresent the correlation between

the ith pair of canonical variates, we have ri≥ ri+1.

ai+1and ?

bi+1such that the

bi+1)T? y is maximized

2This constraint ensures unique solutions.

3This constraint ensures that the extracted canonical variates

contain no redundant information.

Page 3

It can be shown that to find the projection vectors

for canonical variates, we only need to find the eigen-

vectors of the following matrices:

Mx= (Σxx)−1Σxy(Σyy)−1Σyx

and

My= (Σyy)−1Σyx(Σxx)−1Σxy

The eigenvectors of Mx, ordered according to de-

creasing eigenvalues, are the transformation vectors ? a1,

? a2, ···, ? ad and the eigenvectors of My are?b1,?b2, ···,

?bd. In addition, the eigenvalues of these two matrices

are identical and the square-root of the i-th eigenvalue

√λi= ri, i.e., the correlation between the i-th pair of

canonical variates ui and vi.

plications, only the first few most significant pairs of

canonical variates are of real interest. Assume that we

are interested in the first d pairs of variates, we can rep-

resent all the useful information of the linear correlation

structure as a model M, defined as

Note that in most ap-

M = {(uj,vj),rj,(? aj,?bj) : j = 1···d}

where (uj,vj) represent the jth pair of canonical vari-

ates, rjis the correlation between them and (? aj,?bj) rep-

resent the projection vectors for generating them. We

refer to M as a CCA model.

Once a CCA model is constructed, the next step is

for the domain experts to examine the variates as well

as the transformation vectors in order to understand

the relationship between the two domains. This can be

done in different ways depending on the application. In

our motivating Earth science task, the results of CCA

can be visualized as colored maps and interpreted by

Earth scientist. We explain this process in Section 5.

3Correlation Clustering

In this section, we first explain the basic intuitions that

led to our algorithm and formally present our k-means

style correlation clustering algorithm. We then apply

the proposed algorithm to artificially constructed data

sets and analyze the results.

3.1

scribed by two sets of features ? x and ? y, and the prior

knowledge that the correlation structure of the data can

be modeled by k local linear models, the goal of corre-

lation clustering is to partition the data into k clusters

such that for instances in the same cluster the features

of ? x and ? y are linearly correlated in the same way. The

critical question is how should we cluster the data to

reach this goal. Our answer is based on the following

important intuitions.

Algorithm Description Given a data set de-

Table 1: A correlation clustering algorithm

Input:a data set of n instances, each described

by two random vectors ? x and ? y

k, the desired number of clusters

Output:

k clusters and k linear CCA models, one

for each cluster

Algorithm:

1. Randomly assign instances to the k clusters.

2. For i = 1···k, apply CCA to cluster i to build

Mi= {(uj,vj),rj,(aj,bj) : j = 1···d}, i.e., the

top d pairs of canonical variates, the correlation

r between each pair, and the corresponding

d pairs of projection vectors.

3. Reassign each instance to a cluster based on

its ? x and ? y features and the k CCA models.

4. If no assignment has changed from previous

iteration, return the current clusters and CCA

models. Otherwise, go to step 2.

Intuition 1: If a given set of instances contains multiple

correlation structures, applying CCA to this instance set

will not detect a strong linear correlation.

This is because when we put instances that have

different correlation structure together, the original

correlation patterns will be weakened because they are

now only valid in part of the data. Conversely, if CCA

detects strong correlation in a cluster, it is likely that

the instances in the cluster share the same correlation

structure. This suggests that we can use the strength of

the correlation between the canonical variates extracted

by CCA to measure the quality of a cluster. Note that

it is computationally intractable to evaluate all possible

clustering solutions in order to select the optimal one.

This motivates us to examine a k-means style algorithm.

Starting from a random clustering solution, in each

iteration, we build a CCA model for each cluster and

then reassign each instance to its most appropriate

cluster according to its ? x and ? y features and the CCA

models. In Table 1, we describe the basic steps of such

a generic correlation clustering procedure.

The remaining question is how to assign instances

to their clusters.Note that in traditional k-means

clustering, each iteration reassigns instances to clusters

according to the distance between instances and cluster

centers. For correlation clustering, minimizing the

Page 4

Table 2: Procedure of assigning instances to clusters

1. For each cluster i and its CCA model Mi, described

as {(ui

regression modelsˆvi

for each pair of canonical variates.

j,vi

j),ri

j,(?ai

j,?bi

j) : j = 1···d}, construct d linear

j= βi

j∗ ui

j+ αi

j,j = 1···d, one

2. Given an instance (? x,? y), for each cluster i, compute

the instance’s canonical variates under Mias

uj = (?ai

and calculate ˆ vj as

ˆ vj = βi

and the weighted erri

erri=?d

where

ri

j)T? x and vj = (?bi

j)T? y, j = 1···d,

j∗ uj+ αi

j, j = 1···d,

j=1

rj

r1∗ (vj− ˆ vj)2,

ri

j

1is the weight for the jth prediction error.

3. Assign instance (? x,? y) to the cluster minimizing erri.

distance between instances and their cluster centers is

no longer our goal. Instead, our instance reassignment

is performed based on the intuition described below.

Intuition 2: If CCA detects strong a correlation pattern

in a cluster, i.e., the canonical variates u and v are

highly correlated, we expect to be able to predict the value

of v from u (or vice versa) using a linear regression

model.

This is demonstrated in Figure 1, where we plot a

pair of canonical variates with correlation 0.9. Shown

as a solid line is the linear regression model constructed

to predict one variate from the other.

suggests that, for each cluster, we can compute its

most significant pair of canonical variates (u1, v1) and

construct a linear regression model to predict v1 from

u1. To assign an instance to its proper cluster, we can

simply select the cluster whose regression model best

predicts the instance’s variate v1 from it’s variate u1.

In some cases, we are interested in the first few pairs of

canonical variates rather than only the first pair. It is

thus intuitive to construct one linear regression model

for each pair, and assign instances to clusters based

on the combined prediction error. Note that because

the correlation ri between variate vi,ui decreases as i

increase, we set the weight for the itherror to beri

this manner, the weight for the prediction error between

u1 and v1 is always one, whereas the weights for the

ensuing ones will be smaller depending on the strength

Intuition 2

r1. In

−4 −3 −2−101234

−4

−3

−2

−1

0

1

2

3

4

Figure 1: Scatter plot of a pair of canonical variates

(r = 0.9) and the linear regression model constructed

to predict one variate from another.

of the correlations.

put on the canonical variates that are more strongly

correlated. In Table 2, we describe the exact procedure

for reassigning instances to clusters.

Tables 1 and 2 complete the description of our cor-

relation clustering algorithm. To apply this algorithm,

the user needs to specify d, the number of pairs of canon-

ical variates that are used in computing the prediction

errors and reassigning the instances. Based on our em-

pirical observations with both artificial and real-world

datasets, we recommend that d be set to be the same as

or slightly larger than the total number of variates that

bear interest in the application. In our application, our

domain expert is interested in only the top two or three

pairs of canonical variates, consequently we used d = 4

as the default choice for our experiments.

The proposed correlation clustering algorithm is a

greedy iterative algorithm. We want to point out that

it is not guaranteed to converge. Specifically, after re-

assigning instances to clusters at each iteration, there is

no guarantee that the resulting new clusters will have

more strongly correlated variates. In our experiments,

we did observe fluctuations in the objective function,

i.e., the weighted prediction error.

typically occur only after an initial period in which the

error computed by the objective function quickly de-

creases. Moreover, after this rapid initial convergence,

the ensuing fluctuations are relatively small. Thus we

recommend that one specify a maximum number of it-

erations, and in our experiments we set this to be 200

iterations.

This ensures that more focus is

But, fluctuations

Page 5

Table 3: An artificial data set and results

Data Sets

D1

0.85

0.6

0.3

Global

CCA

0.521

0.462

0.302

Mixture of CCA

clust. 1

0.856(.001)

0.619(.001)

0.346(.003)

D2

0.9

0.7

0.4

clust. 2

0.904(.001)

0.685(.004)

0.436(.003)

r1

r2

r3

3.2

amine the efficacy of the proposed correlation clustering

algorithm, we apply it to artificially generated data sets

that have pre-specified nonlinear correlation structures.

We generate such data by first separately generating

multiple component data sets, each with a different lin-

ear correlation structure, and then mixing these com-

ponent data sets together to form a composite data set.

Obviously the resulting data set’s correlation structure

is no longer globally linear. However, a properly con-

structed mixture of local linear models should be able

to separate the data set into the original component

data sets and recover the correlation patterns in each

part. Therefore, we are interested in (1) testing whether

our correlation clustering algorithm can find the correct

partition of the data, and (2) testing whether it can

recover the original correlation patterns represented as

the canonical variates, and (3) comparing its results to

the results of global CCA on the composite data set.

In Table 3, we present the results of our correlation

clustering algorithm and traditional CCA on a compos-

ite data set formed by two component data sets, each

of which contains 1000 instances.

component data set as follows.4Given the desired cor-

relation values r1, r2, and r3, we first create a multi-

variate Gaussian distribution with six random variables

u1,u2,u3,v1,v2,v3, where uiand viare intended to be

the ith pair of canonical variates. We set the covariance

matrix to be:

00

r3

Experiments on Artificial Data Sets To ex-

We generate each

1

0

0

r1

0

0

1

0

0

r2

0

0

1

0

0

r1

0

0

1

0

0

0

r2

0

0

1

0

0

0

r3

0

0

1

This ensures that corr(uj,vj) = rj, for j = 1,2,3

and corr(ui,uj) = corr(vi,vj) = corr(ui,vj) = 0 for

i ?= j. We then randomly sample 1000 points from this

joint Gaussian distribution and form the final vector of

? x using linear combinations of uj’s and the vector of ? y

using linear combinations of vj’s.

4The matlab code for generating a component data set is

available at http://www.ecn.purdue.edu/∼xz

Columns 2 and 3 of Table 3 specify the correlation

between the first three pairs of canonical variates of

each of the constructed datasets, D1 and D2. These

are the values that were used to generate the data. We

applied the traditional CCA to the composite data set

(D1and D2combined together) and we report the top

three detected canonical correlations in Column 4. We

see from the results that, as expected, global CCA is

unable to extract the true correlation structure from

the data.

The last two columns of Table 3 show the results of

applying the proposed correlation clustering algorithm

to the composite data set with k = 2 and d = 4.

The results, shown in Columns 5 and 6 are the average

over ten runs with different random initializations (the

standard deviations are shown in parentheses).

observe that the detected canonical correlations are

similar to the true values. In Figure 2, We plot the

canonical variates extracted by our algorithm (y axis)

versus the true canonical variates (x axis) and the plots

of the first two pairs of variates are shown. We observe

that the first pair of variates extracted by our algorithm

are very similar to the original variates. This can be seen

by noticing that for both u1and v1most points lie on or

are close to the line of unit slope (shown as a red line).

For the second pair, we see more deviation from the red

line. This is possibly because our algorithm put less

focus on the second pair of variates during clustering.

Finally, we observe that the clusters formed by our

algorithm correspond nicely to the original component

data sets. On average, only 2.5% of the 2000 instances

were assigned to the wrong cluster.

These results show that our correlation clustering

algorithm can discover local linear correlation patterns

given prior knowledge of k, the true number of clusters

in the data.Our algorithm performs consistently

well on artificially constructed data sets.

part due to the fact that these data sets are highly

simplified examples of nonlinearly correlated data. In

real applications, the nonlinear correlation structure

is often more complex. Indeed, when applied to our

Earth science data sets, we observe greater instability of

our algorithm—different initializations lead to different

clustering solutions. We conjecture that this is because

our clustering algorithm is a k-means style greedy

algorithm and has large number of locally optimal

solutions.

We

This is in

4Cluster Ensembles for Correlation Clustering

In this section we address a problem in the practi-

cal application of the proposed correlation clustering

algorithm—identification of the number of clusters in

the data. A complicating factor is that because we are

Page 6

−4−3 −2−101234

−4

−3

−2

−1

0

1

2

3

4

Original u1

Extracted u1

−4−3−2−101234

−4

−3

−2

−1

0

1

2

3

4

Original v1

Extracted v1

(a). The first pair of canonical variates

−4−3 −2−101234

−4

−3

−2

−1

0

1

2

3

4

Extracted u2

Original u2

−4−3−2−101234

−4

−3

−2

−1

0

1

2

3

4

Original v2

Extracted v2

(b). The second pair of canonical variates

Figure 2: Comparing the first two pairs of canonical variates extracted by our mixture of CCA algorithm and the

original canonical variates.

dealing with a k-means style greedy algorithm there may

be many locally optimal solutions. In particular, differ-

ent initializations may lead to different clusters. In this

section we show how to apply cluster ensemble tech-

niques to address these issues.

The concept of cluster ensembles has recently seen

increasing popularity in the clustering community [11,

2, 10, 3], in part because it can be applied to any type

of clustering as a generic tool for boosting clustering

performance. The basic idea is to generate an ensemble

of different clustering solutions, each capturing some

structure of the data. The anticipated result is that by

combining the ensemble of clustering solutions, a better

final clustering solutions can be obtained.

ensembles have been successfully applied to determine

the number of clusters [10] and to improve clustering

performance for traditional clustering tasks [11, 2, 3].

Cluster

Although our clustering tasks are significantly different

from traditional clustering in terms of the goal, we

believe similar benefits can be achieved by using cluster

ensembles.

To generate a cluster ensemble, we run our correla-

tion clustering algorithm on a given data set with k=2

for r times, each run starting from a different initial as-

signment, where r is the size of the ensemble. We then

combine these different clustering solutions into a n×n

matrix S, which describes for each pair of instances the

frequency with which they are clustered together (n is

the total number of instances in the data set.) As de-

fined, each element of S is a number between 0 and 1.

We refer to it as a similarity matrix because S(i,j) can

be considered as the similarity (correlation similarity in-

stead of the conventional similarity) between instances

i and j.

Page 7

After the similarity matrix is constructed, we can

then visualize the matrix using a technique introduced

by [10] to help determine how many clusters there are

in the data. This visualization technique has two steps.

First, it orders the instances such that instances that

are similar to each other are arranged to be next to each

other. It then maps the 0-1 range of the similarity values

to a gray-scale such that 0 corresponds to white and

1 corresponds to black. The similarity matrix is then

displayed as an image, in which darker areas indicate

strong similarity and lighter areas indicate little to no

similarity. For example, if all clustering solutions in the

ensemble agree with one another perfectly, the similarity

matrix S will have similarity value 1 for these pairs of

instances that are from the same cluster and similarity

value 0 for those from different clusters.

the instances are ordered such that similar instances

are arranged next to each other, the visualization will

produce black squares along the diagonal of the image.

For a detailed description of the visualization technique,

please refer to [10].

To demonstrate the effect of cluster ensembles on

our correlation clustering, we generate three artificial

data sets using the same procedure as described in

Section 3.2. These three data sets contain one, two, and

three correlation clusters respectively.

correlation clustering algorithm 20 times with different

initializations and construct a similarity matrix for each

data set. In Figure 3 we show the images of the resulting

similarity matrices for these three data sets and make

following observations.

Because

We apply our

• For the one-cluster data set, shown in Figure 3 (a),

the produced similarity matrix does not show any

clear clustering pattern. This is because our cor-

relation clustering algorithm splits the data ran-

domly in each run—by combining the random runs

through the similarity matrix, we can easily reach

the conclusion that the given data set contains only

one correlation cluster.

• For the two-cluster data set, shown in Figure 3 (b),

First, we see two dark squares along the diagonal,

indicating there are two correlation clusters in the

data. This shows that, as we expect, the similarity

matrix constructed via cluster ensembles reveal

information about the true number of clusters in

the data.

In addition to the two dark diagonal squares, we

also see small gray areas in the image, indicating

that some of the clustering solutions in the ensem-

ble disagree with each other on some instances.

This is because different initializations sometimes

lead to different local optimal solutions. Further,

we argue that these different solutions sometimes

make different mistakes—combining them can po-

tentially correct some of the mistakes and produce

a better solution.5Indeed, our experiments show

that, for this particular two-cluster data set, ap-

plying the average-link agglomerative clustering to

the resulting similarity matrix reduces the cluster-

ing error rate from 2.0% (the average error rate of

the 20 clustering runs) to 1.1%. In this case, clus-

ter ensembles corrected for the local optima prob-

lem of our correlation clustering algorithm. Cluster

ensembles have been shown to boost the clustering

performance for traditional clustering tasks, here

we confirm that correlation clustering can also ben-

efit from cluster ensembles.

• For the last data set, shown in Figure 3 (c), we see

three dark squares along the diagonal, indicating

that there are three correlation clusters in the data.

Comparing to the two-cluster case, we see signifi-

cantly larger areas of gray. In this case, our corre-

lation clustering algorithm was asked to partition

the data into two parts although the data actually

contains three clusters. Therefore, it is not sur-

prising that many of the clustering solutions don’t

agree with each other because they may split or

merge clusters in many different ways when differ-

ent initializations are used, resulting in much larger

chance for disagreement. However, this does not

stop us from finding the correct number of clusters

from the similarity matrix. Indeed, by combining

multiple solutions, these random splits and merges

tend to cancel out each other and the true structure

of the data emerges.

With the help of the similarity matrix, now we

know there are three clusters in the last data set. We

then constructed another cluster ensemble for this data

set, but this time we set k=3 for each clustering run.

The resulting similarity matrix S?is shown in Figure 3

(d). In this case, the average error rate achieved by the

individual clustering solutions in the ensemble is 7.5%

and the average-link agglomerative clustering algorithm

applied on S?reduces the error rate to 6.8%.

To conclude, cluster ensembles help to achieve two

goals. First, they provide information about the true

structure of the data. Second, they help improve

clustering performance of our correlation clustering

algorithm.

5It should be noted that if the different solutions make the

same mistakes, these mistakes will not be corrected by using

cluster ensembles.

Page 8

(a)(b)

(c)(d)

Figure 3: Visualization of similarity matrices: (a). S for the one-cluster data set; (b). S for the two-cluster data

set ; (c). S for the three-cluster data set, and (d). S?for the three-cluster data set

5 Experiments on Earth Science Data Sets

We have demonstrated on artificial data sets that our

correlation algorithm is capable of finding locally linear

correlation patterns in the data. In this section, we ap-

ply our techniques to Earth science data sets. The task

is to investigate the relationship between the variability

in precipitation and the dynamics of vegetation. Below,

we briefly introduce the data sets and then compare our

technique to traditional CCA.

In this study, the standardized precipitation index

(SPI) is used to describe the precipitation domain and

the normalized difference vegetation index (NDVI) is

used to describe the vegetation domain [9]. The data for

both domains are collected and aligned at monthly time

intervals from July 1981 to October 2000 (232 months).

Our analysis is performed at continental level for the

continents of North America, South America, Australia

and Africa. For each of these continents, we form a

data set whose instances correspond to time points. For

a particular continent, the feature vector ? x records the

SPI value at each grid location of that continent, thus

the dimension of ? x equals the number of grid locations

of that continent. Similarly, ? y records the NDVI values.

Note that the dimensions of ? x and ? y are not equal

because different grid resolutions are used to collect the

data. The effect of applying our technique to the data

is to cluster the data points in time. This is motivated

by the hypothesis that during different time periods the

relationship between vegetation and precipitation may

vary.

For our application, a standard way to visualize

CCA results is to use colored map. In particular, to

Page 9

(a)Conventional CCA(b) Cluster 1 of MCCA(c) Cluster 2 of MCCA

Figure 4: The results of conventional CCA and Mixture of CCA (MCCA) for Africa. Top panel shows the NDVI

and SPI canonical variates (time series). Middle and bottom panel show the NDVI and SPI maps.

analyze a pair of canonical variates, which are in this

case a pair of correlated time series, one for SPI and

one for NDVI. We produce one map for SPI and one

map for NDVI. For example, to produce a map for

SPI, we take the correlation between the time series

of the SPI canonical variate and the SPI time series

of each grid point, generating a value between −1

(negative correlation) and 1 (positive correlation) for

each grid point. We then display these values on the

map via color coding.Areas of red (blue) color are

positively (negatively) correlated with the SPI canonical

variate. Considered together, the NDVI map and SPI

map identify regions where SPI correlates with NDVI.

Since our technique produces local CCA models, we

can visualize each cluster using the same technique.

Note that an exact geophysical interpretation of the

produced maps is beyond the scope of this paper. To

do so, familiarity with the geoscience terminologies and

concepts is required from our audience. Instead, we will

present the maps produced by traditional CCA and the

maps produced by our technique, as well as plots of

the time series of the SPI and NDVI canonical variates.

Finally, a high level interpretation of the results is

provided by our domain expert. For brevity, the rest

of our discussion will focus on the continent of Africa,

which is a representative example where our method

finds patterns of interest that were not discovered by

traditional CCA.

We apply our technique to the data set of Africa

by setting k=2 and constructing a cluster ensemble of

Page 10

size 200.6The final two clusters were obtained using

the average-link agglomerative algorithm applied to the

similarity matrix.

Figure 4 (a) shows the maps and the NDVI and SPI

time series generated by traditional CCA. Figures 4 (b)

and (c) show the maps and the time series for each of

the two clusters. Note that each of the maps is asso-

ciated with the first pair of canonical variates for that

dataset/cluster. Inspection of the time series and the

spatial patterns that are associated with the canonical

variates for each cluster demonstrates that the mixture

of CCA approach provides information that is clearly

different from results produced by conventional CCA.

For Africa, the interannual dynamics in precipitation

are strongly influenced by a complex set of dynamics

that depend on El-Nino and La Nina, and on the re-

sulting sea surface temperature regimes in Indian Ocean

and southern Atlantic ocean off the coast of west Africa.

Although exact interpretation of these results requires

more study, the maps of Figures 4 (b) and (c) show that

the proposed approach was able to isolate important

quasi-independent modes of precipitation-vegetation co-

variability that linear methods are unable to identify.

As shown in [9], conventional CCA is effective in iso-

lating precipitation and vegetation anomalies in eastern

Africa associated with El-Nino, but less successful in

isolating similar patterns in the Sahelian region of west-

ern Africa. In contrast, Figures 4 (b) and (c) show that

the mixture of CCA technique isolates the pattern in

eastern Africa, and additionally identifies a mode of co-

variability in the Sahel that is probably related to ocean-

atmosphere dynamics in the southern Atlantic ocean.

6 Conclusions and Future Work

This paper presented a method for constructing mix-

tures of local CCA models in attempt to address the

limitations of the conventional CCA approach. We de-

veloped a correlation clustering algorithm, which parti-

tions a given data set according to the correlation be-

tween two sets of features. We further demonstrated

that cluster ensembles can be used to identify the num-

ber of clusters in the data and ameliorate the local op-

tima problem of the proposed clustering algorithm. We

applied our technique to Earth science data sets. In

comparison to traditional CCA, our technique led to in-

teresting and encouraging new discoveries in the data.

As an ongoing effort, we will closely work with our

domain expert to verify our findings in the data from

6We use large ensemble sizes for the Earth science data sets

because they contain a small number of instances, making it

computationally feasible and also larger ensemble sizes ensure that

the clusters we found in the data are not obtained by chance.

a geoscience viewpoint. For future work, we would also

like to apply our technique to more artificial and real-

world data sets that have complex nonlinear correlation

structure.Finally, we are developing a probabilistic

approach to learning mixture of CCA models.

References

[1] N. Bansal, A. Blum, and S. Chawla.

clustering. Machine Learning, 56:89–113, 2004.

[2] X. Z. Fern and C. E. Brodley. Random projection for

high dimensional data clustering: A cluster ensemble

approach.In Proceedings of the Twentieth Interna-

tional Conference on Machine Learning, 2003.

[3] X. Z. Fern and C. E. Brodley. Solving cluster ensemble

problems by bipartite graph partitioning. In Proceed-

ings of the Twenty First International Conference on

Machine Learning, pages 281–288, 2004.

[4] H. Hotelling. Relations between two sets of variants.

Biometrika, 28:321–377, 1936.

[5] W. Hsieh.Nonlinear canonical correlation analysis

by neural networks. Neural Networks, 13:1095–1105,

2000.

[6] R.A. Johnson and D. W. Wichern. Applied multivari-

ate statistical analysis. Prentice Hall, 1992.

[7] M. Jordan and R. Jacobs.

experts and the EM algorithm. Neural Computation,

6:181–214, 1994.

[8] P. L. Lai and C. Fyfe. Kernel and nonlinear canonical

correlation analysis. International Journal of Neural

Systems, 10(5):365–377, 2000.

[9] A. Lotsch and M. Friedl.

precipitation variability observed from satellite and cli-

mate record. Geophysical research letter, In submis-

sion.

[10] S. Monti, P. Tamayo, J. Mesirov, and T. Golub.

Consensus clustering: A resampling-based method for

class discovery and visualization of gene expression

microarray data. Machine Learning, 52:91–118, 2003.

[11] A. Strehl and J. Ghosh.

knowledge reuse framework for combining multiple

partitions.

Machine Learning Research, 3:583–417,

2002.

[12] M. Tipping and C. Bishop. Mixtures of probabilistic

principal component analysers. Neural Computation,

11, 1999.

[13] E. Zorita, V. Kharin, and H. von Storch.

mospheric circulation and sea surface temperature in

the north atlantic area in winter: Their interaction and

relevance for iberian precipitation. Journal of Climate,

5:1097–1108, 1992.

Correlation

Hierarchical mixtures of

Coupled vegetation-

Cluster ensembles - A

The at-

#### View other sources

#### Hide other sources

- Available from Carla E Brodley · Jun 3, 2014
- Available from psu.edu
- Available from psu.edu