# Correlation Clustering for Learning Mixtures of Canonical Correlation Models

Abstract

This paper addresses the task of analyzing the correlation between two related domains X and Y. Our research is motivated by an Earth Science task that studies the rela- tionship between vegetation and precipitation. A standard statistical technique for such problems is Canonical Correla- tion Analysis (CCA). A critical limitation of CCA is that it can only detect linear correlation between the two domains that is globally valid throughout both data sets. Our ap- proach addresses this limitation by constructing a mixture of local linear CCA models through a process we name cor- relation clustering. In correlation clustering, both data sets are clustered simultaneously according to the data's corre- lation structure such that, within a cluster, domain X and domain Y are linearly correlated in the same way. Each clus- ter is then analyzed using the traditional CCA to construct local linear correlation models. We present results on both artificial data sets and Earth Science data sets to demon- strate that the proposed approach can detect useful correla- tion patterns, which traditional CCA fails to discover.

### Full-text (PDF)

Available from: Carla E BrodleyCorrelation Clustering for Learning Mixtures of Canonical Correlation Models

X. Z. Fern∗C. E. Brodley†M. A. Friedl‡

Abstract

This paper addresses the task of analyzing the correlation

between two related domains Xand Y. Our research is

motivated by an Earth Science task that studies the rela-

tionship between vegetation and precipitation. A standard

statistical technique for such problems is Canonical Correla-

tion Analysis (CCA). A critical limitation of CCA is that it

can only detect linear correlation between the two domains

that is globally valid throughout both data sets. Our ap-

proach addresses this limitation by constructing a mixture

of local linear CCA models through a process we name cor-

relation clustering. In correlation clustering, both data sets

are clustered simultaneously according to the data’s corre-

lation structure such that, within a cluster, domain Xand

domain Yare linearly correlated in the same way. Each clus-

ter is then analyzed using the traditional CCA to construct

local linear correlation models. We present results on both

artiﬁcial data sets and Earth Science data sets to demon-

strate that the proposed approach can detect useful correla-

tion patterns, which traditional CCA fails to discover.

1 Introduction

In Earth science applications, researchers are often

interested in studying the correlation structure between

two domains in order to understand the nature of

the relationship between them. The inputs to our

correlation analysis task can be considered as two data

sets Xand Ywhose instances are described by feature

vectors ~x and ~y respectively. The dimension of ~x and

that of ~y do not need to be the same, although there

must be a one-to-one mapping between instances of X

and instances of Y. Thus, it is often more convenient to

consider these two data sets as one compound data set

whose instances are described by two feature vectors ~x

and ~y. Indeed, throughout the remainder of this paper,

we will refer to the input of our task as one data set,

and the goal is to study how the two sets of features are

correlated to each other.

Canonical Correlation Analysis (CCA) [4, 6] is a

∗School of Elec. and Comp. Eng., Purdue University, West

Lafayette, IN 47907, USA

†Dept. of Comp. Sci., Tufts University, Medford, MA 02155,

USA

‡Dept. of Geography, Boston University, Boston, MA, USA

multivariate statistical technique commonly used to

identify and quantify the correlation between two sets

of random variables. Given a compound data set

described by feature vectors ~x and ~y, CCA seeks to ﬁnd

a linear transformation of ~x and a linear transformation

of ~y such that the resulting two new variables are

maximumly correlated.

In Earth science research, CCA has been often

applied to examine whether there is a cause-and-eﬀect

relationship between two domains or to predict the

behavior of one domain based on another. For example,

in [13] CCA was used to analyze the relationship

between the monthly mean sea-level pressure (SLP) and

sea-surface temperature (SST) over the North Atlantic

in the months of December, January and February. This

analysis conﬁrmed the hypothesis that atmospheric SLP

anomalies cause SST anomalies.

Because CCA is based on linear transformations,

the scope of its applications is necessarily limited. One

way to tackle this limitation is to use nonlinear canoni-

cal correlation analysis (NLCCA) [5, 8]. NLCCA applies

nonlinear functions to the original variables in order to

extract correlated components from the two sets of vari-

ables. Although promising results have been achieved

by NLCCA in some Earth science applications, it tends

to be diﬃcult to apply such techniques because of the

complexity of the model and the lack of robustness due

to overﬁtting [5].

In this paper we propose to use a mixture of lo-

cal linear correlation models to capture the correlation

structure between two sets of random variables (fea-

tures). Mixtures of local linear models not only provide

an alternative solution to capturing nonlinear correla-

tions, but also have the potential to detect correlation

patterns that are signiﬁcant only in a part (a local re-

gion) of the data. The philosophy of using multiple lo-

cal linear models to model global nonlinearity has been

successfully applied to other statistical approaches with

similar linearity limitations such as principal component

analysis [12] and linear regression [7]. Our approach

uses a two-step procedure. Given a compound data set,

we propose to ﬁrst solve a clustering problem that par-

titions the data set into clusters such that each cluster

contains instances whose ~x features and ~y features are

linearly correlated. We then independently apply CCA

to each cluster to form a mixture of correlation models

that are locally linear.

In designing this two-step process, we need address

the following two critical questions.

1. Assume we are informed a priori that we can model

the correlation structure using klocal linear CCA

models. How should we cluster the data in the

context of correlation analysis?

2. In real-world applications, we are rarely equipped

with knowledge of k.How can we decide how many

clusters there are in the data or whether a global

linear structure will suﬃce?

Note that the goal of clustering in the context of cor-

relation analysis is diﬀerent from traditional clustering.

In traditional clustering, the goal is to group instances

that are similar (as measured by certain distance or sim-

ilarity metric) together. In contrast, here we need to

group instances based on how their ~x features and ~y fea-

tures correlate to each other, i.e., instances that share

similar correlation structure between the two sets of fea-

tures should be clustered together. To diﬀerentiate this

clustering task from traditional clustering, we name it

correlation clustering1and, in Section 3 we propose an

iterative greedy k-means style algorithm for this task.

To address the second question, we apply the tech-

nique of cluster ensembles [2] to our correlation cluster-

ing algorithm, which provides a user with a visualization

of the results that can be used to determine the proper

number of clusters in the data. Note that our correla-

tion clustering algorithm is a k-means style algorithm

and as such may have many locally optimal solutions—

diﬀerent initializations may lead to signiﬁcantly diﬀer-

ent clustering results. By using cluster ensembles, we

can also address the local optima problem of our clus-

tering algorithm and ﬁnd a stable clustering solution.

To demonstrate the eﬃcacy of our approach, we

apply it to both artiﬁcial data sets and real world Earth

science data sets. Our results on the artiﬁcial data

sets show that (1) the proposed correlation clustering

algorithm is capable of ﬁnding a good partition of

the data when the correct kis used and (2) cluster

ensembles provide an eﬀective tool for ﬁnding k. When

applied to the Earth science data sets, our technique

detected signiﬁcantly diﬀerent correlation patterns in

comparison to what was found via traditional CCA.

These results led our domain expert to highly interesting

hypotheses that merit further investigation.

1Note that the term correlation clustering has also been used

by [1] as the name of a technique for traditional clustering.

The remainder of the paper is arranged as follows.

In Section 2, we review the basics of CCA. Section 3 in-

troduces the intuitions behind our correlation clustering

algorithm and formally describes the algorithm, which

is then applied to artiﬁcially constructed data sets to

demonstrate its eﬃcacy in ﬁnding correlation clusters

from the data. Section 4 demonstrates how cluster en-

semble techniques can be used to determine the num-

ber of clusters in the data and address the local optima

problem of the k-means style correlation clustering al-

gorithm. Section 5 explains our motivating application,

presents results, and describes how our domain expert

interprets the results. Finally, in Section 6 we conclude

the paper and discuss future directions.

2 Basics of CCA

Given a data set whose instances are described by two

feature vectors ~x and ~y, the goal of CCA is to ﬁnd linear

transformations of ~x and linear transformations of ~y

such that the resulting new variables are maximumly

correlated.

In particular, CCA constructs a sequence of pairs

of strongly correlated variables (u1, v1), (u2, v2),···,

(ud, vd) through linear transformations, where dis the

minimum dimension of ~x and ~y. These new variables

ui’s and vi’s, named canonical variates (sometimes

referred to as canonical factors). They are similar

to principal components in the sense that principal

components are linear combinations of the original

variables that capture the most variance in the data and

in contrast canonical variates are linear combinations of

the original variables that capture the most correlation

between two sets of variables.

To construct these canonical covariates, CCA ﬁrst

seeks to transform ~x and ~y into a pair of new variables

u1and v1by the linear transformations:

u1= ( ~a1)T~x, and v1= (~

b1)T~y

where the transformation vectors ~a1and ~

b1are deﬁned

such that corr(u1, v1) is maximized subject to the

constraint that both u1and v1have unit variance.2

Once ~a1,~

b1;···;~ai,~

biare determined, we then ﬁnd the

next pair of transformations ~ai+1 and ~

bi+1 such that the

correlation between ( ~ai+1 )T~x and ( ~

bi+1)T~y is maximized

with the constraint that the resulting ui+1 and vi+1 are

uncorrelated with all previous canonical variates.3Note

that the correlation between uiand vibecomes weaker

as iincreases. Let rirepresent the correlation between

the ith pair of canonical variates, we have ri≥ri+1.

2This constraint ensures unique solutions.

3This constraint ensures that the extracted canonical variates

contain no redundant information.

It can be shown that to ﬁnd the pro jection vectors

for canonical variates, we only need to ﬁnd the eigen-

vectors of the following matrices:

Mx= (Σxx)−1Σxy (Σy y)−1Σyx

and

My= (Σyy )−1Σyx (Σxx)−1Σxy

The eigenvectors of Mx, ordered according to de-

creasing eigenvalues, are the transformation vectors ~a1,

~a2,···,~adand the eigenvectors of Myare ~

b1,~

b2,···,

~

bd. In addition, the eigenvalues of these two matrices

are identical and the square-root of the i-th eigenvalue

√λi=ri, i.e., the correlation between the i-th pair of

canonical variates uiand vi. Note that in most ap-

plications, only the ﬁrst few most signiﬁcant pairs of

canonical variates are of real interest. Assume that we

are interested in the ﬁrst dpairs of variates, we can rep-

resent all the useful information of the linear correlation

structure as a model M, deﬁned as

M={(uj, vj), rj,(~aj,~

bj) : j= 1 ···d}

where (uj, vj) represent the jth pair of canonical vari-

ates, rjis the correlation between them and (~aj,~

bj) rep-

resent the projection vectors for generating them. We

refer to M as a CCA model.

Once a CCA model is constructed, the next step is

for the domain experts to examine the variates as well

as the transformation vectors in order to understand

the relationship between the two domains. This can be

done in diﬀerent ways depending on the application. In

our motivating Earth science task, the results of CCA

can be visualized as colored maps and interpreted by

Earth scientist. We explain this process in Section 5.

3 Correlation Clustering

In this section, we ﬁrst explain the basic intuitions that

led to our algorithm and formally present our k-means

style correlation clustering algorithm. We then apply

the proposed algorithm to artiﬁcially constructed data

sets and analyze the results.

3.1 Algorithm Description Given a data set de-

scribed by two sets of features ~x and ~y, and the prior

knowledge that the correlation structure of the data can

be modeled by klocal linear models, the goal of corre-

lation clustering is to partition the data into kclusters

such that for instances in the same cluster the features

of ~x and ~y are linearly correlated in the same way. The

critical question is how should we cluster the data to

reach this goal. Our answer is based on the following

important intuitions.

Table 1: A correlation clustering algorithm

Input: a data set of ninstances, each described

by two random vectors ~x and ~y

k, the desired number of clusters

Output:kclusters and klinear CCA models, one

for each cluster

Algorithm:

1. Randomly assign instances to the kclusters.

2. For i= 1 · · · k, apply CCA to cluster ito build

Mi={(uj, vj), rj,(aj, bj) : j= 1 ···d}, i.e., the

top dpairs of canonical variates, the correlation

rbetween each pair, and the corresponding

dpairs of projection vectors.

3. Reassign each instance to a cluster based on

its ~x and ~y features and the kCCA models.

4. If no assignment has changed from previous

iteration, return the current clusters and CCA

models. Otherwise, go to step 2.

Intuition 1: If a given set of instances contains multiple

correlation structures, applying CCA to this instance set

will not detect a strong linear correlation.

This is because when we put instances that have

diﬀerent correlation structure together, the original

correlation patterns will be weakened because they are

now only valid in part of the data. Conversely, if CCA

detects strong correlation in a cluster, it is likely that

the instances in the cluster share the same correlation

structure. This suggests that we can use the strength of

the correlation between the canonical variates extracted

by CCA to measure the quality of a cluster. Note that

it is computationally intractable to evaluate all possible

clustering solutions in order to select the optimal one.

This motivates us to examine a k-means style algorithm.

Starting from a random clustering solution, in each

iteration, we build a CCA model for each cluster and

then reassign each instance to its most appropriate

cluster according to its ~x and ~y features and the CCA

models. In Table 1, we describe the basic steps of such

a generic correlation clustering procedure.

The remaining question is how to assign instances

to their clusters. Note that in traditional k-means

clustering, each iteration reassigns instances to clusters

according to the distance between instances and cluster

centers. For correlation clustering, minimizing the

Table 2: Procedure of assigning instances to clusters

1. For each cluster iand its CCA model Mi, described

as {(ui

j, vi

j), ri

j,(~

ai

j,~

bi

j) : j= 1 · · · d}, construct dlinear

regression models ˆ

vi

j=βi

j∗ui

j+αi

j, j = 1 ···d, one

for each pair of canonical variates.

2. Given an instance (~x, ~y), for each cluster i, compute

the instance’s canonical variates under Mias

uj= ( ~

ai

j)T~x and vj= ( ~

bi

j)T~y,j= 1 ···d,

and calculate ˆvjas

ˆvj=βi

j∗uj+αi

j,j= 1 ···d,

and the weighted erri

erri=Pd

j=1

rj

r1∗(vj−ˆvj)2,

where ri

j

ri

1

is the weight for the jth prediction error.

3. Assign instance (~x, ~y) to the cluster minimizing erri.

distance between instances and their cluster centers is

no longer our goal. Instead, our instance reassignment

is performed based on the intuition described below.

Intuition 2: If CCA detects strong a correlation pattern

in a cluster, i.e., the canonical variates uand vare

highly correlated, we expect to be able to predict the value

of vfrom u(or vice versa) using a linear regression

model.

This is demonstrated in Figure 1, where we plot a

pair of canonical variates with correlation 0.9. Shown

as a solid line is the linear regression model constructed

to predict one variate from the other. Intuition 2

suggests that, for each cluster, we can compute its

most signiﬁcant pair of canonical variates (u1,v1) and

construct a linear regression model to predict v1from

u1. To assign an instance to its proper cluster, we can

simply select the cluster whose regression model best

predicts the instance’s variate v1from it’s variate u1.

In some cases, we are interested in the ﬁrst few pairs of

canonical variates rather than only the ﬁrst pair. It is

thus intuitive to construct one linear regression model

for each pair, and assign instances to clusters based

on the combined prediction error. Note that because

the correlation ribetween variate vi, uidecreases as i

increase, we set the weight for the ith error to be ri

r1. In

this manner, the weight for the prediction error between

u1and v1is always one, whereas the weights for the

ensuing ones will be smaller depending on the strength

−4 −3 −2 −1 0 1 2 3 4

−4

−3

−2

−1

0

1

2

3

4

Figure 1: Scatter plot of a pair of canonical variates

(r= 0.9) and the linear regression model constructed

to predict one variate from another.

of the correlations. This ensures that more focus is

put on the canonical variates that are more strongly

correlated. In Table 2, we describe the exact procedure

for reassigning instances to clusters.

Tables 1 and 2 complete the description of our cor-

relation clustering algorithm. To apply this algorithm,

the user needs to specify d, the number of pairs of canon-

ical variates that are used in computing the prediction

errors and reassigning the instances. Based on our em-

pirical observations with both artiﬁcial and real-world

datasets, we recommend that dbe set to be the same as

or slightly larger than the total number of variates that

bear interest in the application. In our application, our

domain expert is interested in only the top two or three

pairs of canonical variates, consequently we used d= 4

as the default choice for our experiments.

The proposed correlation clustering algorithm is a

greedy iterative algorithm. We want to point out that

it is not guaranteed to converge. Speciﬁcally, after re-

assigning instances to clusters at each iteration, there is

no guarantee that the resulting new clusters will have

more strongly correlated variates. In our experiments,

we did observe ﬂuctuations in the objective function,

i.e., the weighted prediction error. But, ﬂuctuations

typically occur only after an initial period in which the

error computed by the objective function quickly de-

creases. Moreover, after this rapid initial convergence,

the ensuing ﬂuctuations are relatively small. Thus we

recommend that one specify a maximum number of it-

erations, and in our experiments we set this to be 200

iterations.

Table 3: An artiﬁcial data set and results

Data Sets Global Mixture of CCA

D1D2CCA clust. 1 clust. 2

r10.85 0.9 0.521 0.856(.001) 0.904(.001)

r20.6 0.7 0.462 0.619(.001) 0.685(.004)

r30.3 0.4 0.302 0.346(.003) 0.436(.003)

3.2 Experiments on Artiﬁcial Data Sets To ex-

amine the eﬃcacy of the proposed correlation clustering

algorithm, we apply it to artiﬁcially generated data sets

that have pre-speciﬁed nonlinear correlation structures.

We generate such data by ﬁrst separately generating

multiple component data sets, each with a diﬀerent lin-

ear correlation structure, and then mixing these com-

ponent data sets together to form a composite data set.

Obviously the resulting data set’s correlation structure

is no longer globally linear. However, a properly con-

structed mixture of local linear models should be able

to separate the data set into the original component

data sets and recover the correlation patterns in each

part. Therefore, we are interested in (1) testing whether

our correlation clustering algorithm can ﬁnd the correct

partition of the data, and (2) testing whether it can

recover the original correlation patterns represented as

the canonical variates, and (3) comparing its results to

the results of global CCA on the composite data set.

In Table 3, we present the results of our correlation

clustering algorithm and traditional CCA on a compos-

ite data set formed by two component data sets, each

of which contains 1000 instances. We generate each

component data set as follows.4Given the desired cor-

relation values r1,r2, and r3, we ﬁrst create a multi-

variate Gaussian distribution with six random variables

u1, u2, u3, v1, v2, v3, where uiand viare intended to be

the ith pair of canonical variates. We set the covariance

matrix to be:

100r10 0

0 1 0 0 r20

00100r3

r100100

0r20 0 1 0

0 0 r3001

This ensures that corr(uj, vj) = rj, for j= 1,2,3

and corr(ui, uj) = corr(vi, vj) = corr(ui, vj) = 0 for

i6=j. We then randomly sample 1000 points from this

joint Gaussian distribution and form the ﬁnal vector of

~x using linear combinations of uj’s and the vector of ~y

using linear combinations of vj’s.

4The matlab code for generating a component data set is

available at http://www.ecn.purdue.edu/∼xz

Columns 2 and 3 of Table 3 specify the correlation

between the ﬁrst three pairs of canonical variates of

each of the constructed datasets, D1and D2. These

are the values that were used to generate the data. We

applied the traditional CCA to the composite data set

(D1and D2combined together) and we report the top

three detected canonical correlations in Column 4. We

see from the results that, as expected, global CCA is

unable to extract the true correlation structure from

the data.

The last two columns of Table 3 show the results of

applying the proposed correlation clustering algorithm

to the composite data set with k= 2 and d= 4.

The results, shown in Columns 5 and 6 are the average

over ten runs with diﬀerent random initializations (the

standard deviations are shown in parentheses). We

observe that the detected canonical correlations are

similar to the true values. In Figure 2, We plot the

canonical variates extracted by our algorithm (yaxis)

versus the true canonical variates (xaxis) and the plots

of the ﬁrst two pairs of variates are shown. We observe

that the ﬁrst pair of variates extracted by our algorithm

are very similar to the original variates. This can be seen

by noticing that for both u1and v1most points lie on or

are close to the line of unit slope (shown as a red line).

For the second pair, we see more deviation from the red

line. This is possibly because our algorithm put less

focus on the second pair of variates during clustering.

Finally, we observe that the clusters formed by our

algorithm correspond nicely to the original component

data sets. On average, only 2.5% of the 2000 instances

were assigned to the wrong cluster.

These results show that our correlation clustering

algorithm can discover local linear correlation patterns

given prior knowledge of k, the true number of clusters

in the data. Our algorithm performs consistently

well on artiﬁcially constructed data sets. This is in

part due to the fact that these data sets are highly

simpliﬁed examples of nonlinearly correlated data. In

real applications, the nonlinear correlation structure

is often more complex. Indeed, when applied to our

Earth science data sets, we observe greater instability of

our algorithm—diﬀerent initializations lead to diﬀerent

clustering solutions. We conjecture that this is because

our clustering algorithm is a k-means style greedy

algorithm and has large number of locally optimal

solutions.

4 Cluster Ensembles for Correlation Clustering

In this section we address a problem in the practi-

cal application of the proposed correlation clustering

algorithm—identiﬁcation of the number of clusters in

the data. A complicating factor is that because we are

−4 −3 −2 −1 0 1 2 3 4

−4

−3

−2

−1

0

1

2

3

4

Original u1

Extracted u1

−4 −3 −2 −1 0 1 2 3 4

−4

−3

−2

−1

0

1

2

3

4

Original v1

Extracted v1

(a). The ﬁrst pair of canonical variates

−4 −3 −2 −1 0 1 2 3 4

−4

−3

−2

−1

0

1

2

3

4

Extracted u2

Original u2

−4 −3 −2 −1 0 1 2 3 4

−4

−3

−2

−1

0

1

2

3

4

Original v2

Extracted v2

(b). The second pair of canonical variates

Figure 2: Comparing the ﬁrst two pairs of canonical variates extracted by our mixture of CCA algorithm and the

original canonical variates.

dealing with a k-means style greedy algorithm there may

be many locally optimal solutions. In particular, diﬀer-

ent initializations may lead to diﬀerent clusters. In this

section we show how to apply cluster ensemble tech-

niques to address these issues.

The concept of cluster ensembles has recently seen

increasing popularity in the clustering community [11,

2, 10, 3], in part because it can be applied to any type

of clustering as a generic tool for boosting clustering

performance. The basic idea is to generate an ensemble

of diﬀerent clustering solutions, each capturing some

structure of the data. The anticipated result is that by

combining the ensemble of clustering solutions, a better

ﬁnal clustering solutions can be obtained. Cluster

ensembles have been successfully applied to determine

the number of clusters [10] and to improve clustering

performance for traditional clustering tasks [11, 2, 3].

Although our clustering tasks are signiﬁcantly diﬀerent

from traditional clustering in terms of the goal, we

believe similar beneﬁts can be achieved by using cluster

ensembles.

To generate a cluster ensemble, we run our correla-

tion clustering algorithm on a given data set with k=2

for rtimes, each run starting from a diﬀerent initial as-

signment, where ris the size of the ensemble. We then

combine these diﬀerent clustering solutions into a n×n

matrix S, which describes for each pair of instances the

frequency with which they are clustered together (nis

the total number of instances in the data set.) As de-

ﬁned, each element of Sis a number between 0 and 1.

We refer to it as a similarity matrix because S(i, j) can

be considered as the similarity (correlation similarity in-

stead of the conventional similarity) between instances

iand j.

After the similarity matrix is constructed, we can

then visualize the matrix using a technique introduced

by [10] to help determine how many clusters there are

in the data. This visualization technique has two steps.

First, it orders the instances such that instances that

are similar to each other are arranged to be next to each

other. It then maps the 0-1 range of the similarity values

to a gray-scale such that 0 corresponds to white and

1 corresponds to black. The similarity matrix is then

displayed as an image, in which darker areas indicate

strong similarity and lighter areas indicate little to no

similarity. For example, if all clustering solutions in the

ensemble agree with one another perfectly, the similarity

matrix Swill have similarity value 1 for these pairs of

instances that are from the same cluster and similarity

value 0 for those from diﬀerent clusters. Because

the instances are ordered such that similar instances

are arranged next to each other, the visualization will

produce black squares along the diagonal of the image.

For a detailed description of the visualization technique,

please refer to [10].

To demonstrate the eﬀect of cluster ensembles on

our correlation clustering, we generate three artiﬁcial

data sets using the same procedure as described in

Section 3.2. These three data sets contain one, two, and

three correlation clusters respectively. We apply our

correlation clustering algorithm 20 times with diﬀerent

initializations and construct a similarity matrix for each

data set. In Figure 3 we show the images of the resulting

similarity matrices for these three data sets and make

following observations.

•For the one-cluster data set, shown in Figure 3 (a),

the produced similarity matrix does not show any

clear clustering pattern. This is because our cor-

relation clustering algorithm splits the data ran-

domly in each run—by combining the random runs

through the similarity matrix, we can easily reach

the conclusion that the given data set contains only

one correlation cluster.

•For the two-cluster data set, shown in Figure 3 (b),

First, we see two dark squares along the diagonal,

indicating there are two correlation clusters in the

data. This shows that, as we expect, the similarity

matrix constructed via cluster ensembles reveal

information about the true number of clusters in

the data.

In addition to the two dark diagonal squares, we

also see small gray areas in the image, indicating

that some of the clustering solutions in the ensem-

ble disagree with each other on some instances.

This is because diﬀerent initializations sometimes

lead to diﬀerent local optimal solutions. Further,

we argue that these diﬀerent solutions sometimes

make diﬀerent mistakes—combining them can po-

tentially correct some of the mistakes and produce

a better solution.5Indeed, our experiments show

that, for this particular two-cluster data set, ap-

plying the average-link agglomerative clustering to

the resulting similarity matrix reduces the cluster-

ing error rate from 2.0% (the average error rate of

the 20 clustering runs) to 1.1%. In this case, clus-

ter ensembles corrected for the local optima prob-

lem of our correlation clustering algorithm. Cluster

ensembles have been shown to boost the clustering

performance for traditional clustering tasks, here

we conﬁrm that correlation clustering can also ben-

eﬁt from cluster ensembles.

•For the last data set, shown in Figure 3 (c), we see

three dark squares along the diagonal, indicating

that there are three correlation clusters in the data.

Comparing to the two-cluster case, we see signiﬁ-

cantly larger areas of gray. In this case, our corre-

lation clustering algorithm was asked to partition

the data into two parts although the data actually

contains three clusters. Therefore, it is not sur-

prising that many of the clustering solutions don’t

agree with each other because they may split or

merge clusters in many diﬀerent ways when diﬀer-

ent initializations are used, resulting in much larger

chance for disagreement. However, this does not

stop us from ﬁnding the correct number of clusters

from the similarity matrix. Indeed, by combining

multiple solutions, these random splits and merges

tend to cancel out each other and the true structure

of the data emerges.

With the help of the similarity matrix, now we

know there are three clusters in the last data set. We

then constructed another cluster ensemble for this data

set, but this time we set k=3 for each clustering run.

The resulting similarity matrix S0is shown in Figure 3

(d). In this case, the average error rate achieved by the

individual clustering solutions in the ensemble is 7.5%

and the average-link agglomerative clustering algorithm

applied on S0reduces the error rate to 6.8%.

To conclude, cluster ensembles help to achieve two

goals. First, they provide information about the true

structure of the data. Second, they help improve

clustering performance of our correlation clustering

algorithm.

5It should be noted that if the diﬀerent solutions make the

same mistakes, these mistakes will not be corrected by using

cluster ensembles.

(a) (b)

(c) (d)

Figure 3: Visualization of similarity matrices: (a). Sfor the one-cluster data set; (b). Sfor the two-cluster data

set ; (c). Sfor the three-cluster data set, and (d). S0for the three-cluster data set

5 Experiments on Earth Science Data Sets

We have demonstrated on artiﬁcial data sets that our

correlation algorithm is capable of ﬁnding locally linear

correlation patterns in the data. In this section, we ap-

ply our techniques to Earth science data sets. The task

is to investigate the relationship between the variability

in precipitation and the dynamics of vegetation. Below,

we brieﬂy introduce the data sets and then compare our

technique to traditional CCA.

In this study, the standardized precipitation index

(SPI) is used to describe the precipitation domain and

the normalized diﬀerence vegetation index (NDVI) is

used to describe the vegetation domain [9]. The data for

both domains are collected and aligned at monthly time

intervals from July 1981 to October 2000 (232 months).

Our analysis is performed at continental level for the

continents of North America, South America, Australia

and Africa. For each of these continents, we form a

data set whose instances correspond to time points. For

a particular continent, the feature vector ~x records the

SPI value at each grid location of that continent, thus

the dimension of ~x equals the number of grid locations

of that continent. Similarly, ~y records the NDVI values.

Note that the dimensions of ~x and ~y are not equal

because diﬀerent grid resolutions are used to collect the

data. The eﬀect of applying our technique to the data

is to cluster the data points in time. This is motivated

by the hypothesis that during diﬀerent time periods the

relationship between vegetation and precipitation may

vary.

For our application, a standard way to visualize

CCA results is to use colored map. In particular, to

(a)Conventional CCA (b) Cluster 1 of MCCA (c) Cluster 2 of MCCA

Figure 4: The results of conventional CCA and Mixture of CCA (MCCA) for Africa. Top panel shows the NDVI

and SPI canonical variates (time series). Middle and bottom panel show the NDVI and SPI maps.

analyze a pair of canonical variates, which are in this

case a pair of correlated time series, one for SPI and

one for NDVI. We produce one map for SPI and one

map for NDVI. For example, to produce a map for

SPI, we take the correlation between the time series

of the SPI canonical variate and the SPI time series

of each grid point, generating a value between −1

(negative correlation) and 1 (positive correlation) for

each grid point. We then display these values on the

map via color coding. Areas of red (blue) color are

positively (negatively) correlated with the SPI canonical

variate. Considered together, the NDVI map and SPI

map identify regions where SPI correlates with NDVI.

Since our technique produces local CCA models, we

can visualize each cluster using the same technique.

Note that an exact geophysical interpretation of the

produced maps is beyond the scope of this paper. To

do so, familiarity with the geoscience terminologies and

concepts is required from our audience. Instead, we will

present the maps produced by traditional CCA and the

maps produced by our technique, as well as plots of

the time series of the SPI and NDVI canonical variates.

Finally, a high level interpretation of the results is

provided by our domain expert. For brevity, the rest

of our discussion will focus on the continent of Africa,

which is a representative example where our method

ﬁnds patterns of interest that were not discovered by

traditional CCA.

We apply our technique to the data set of Africa

by setting k=2 and constructing a cluster ensemble of

size 200.6The ﬁnal two clusters were obtained using

the average-link agglomerative algorithm applied to the

similarity matrix.

Figure 4 (a) shows the maps and the NDVI and SPI

time series generated by traditional CCA. Figures 4 (b)

and (c) show the maps and the time series for each of

the two clusters. Note that each of the maps is asso-

ciated with the ﬁrst pair of canonical variates for that

dataset/cluster. Inspection of the time series and the

spatial patterns that are associated with the canonical

variates for each cluster demonstrates that the mixture

of CCA approach provides information that is clearly

diﬀerent from results produced by conventional CCA.

For Africa, the interannual dynamics in precipitation

are strongly inﬂuenced by a complex set of dynamics

that depend on El-Nino and La Nina, and on the re-

sulting sea surface temperature regimes in Indian Ocean

and southern Atlantic ocean oﬀ the coast of west Africa.

Although exact interpretation of these results requires

more study, the maps of Figures 4 (b) and (c) show that

the proposed approach was able to isolate important

quasi-independent modes of precipitation-vegetation co-

variability that linear methods are unable to identify.

As shown in [9], conventional CCA is eﬀective in iso-

lating precipitation and vegetation anomalies in eastern

Africa associated with El-Nino, but less successful in

isolating similar patterns in the Sahelian region of west-

ern Africa. In contrast, Figures 4 (b) and (c) show that

the mixture of CCA technique isolates the pattern in

eastern Africa, and additionally identiﬁes a mode of co-

variability in the Sahel that is probably related to ocean-

atmosphere dynamics in the southern Atlantic ocean.

6 Conclusions and Future Work

This paper presented a method for constructing mix-

tures of local CCA models in attempt to address the

limitations of the conventional CCA approach. We de-

veloped a correlation clustering algorithm, which parti-

tions a given data set according to the correlation be-

tween two sets of features. We further demonstrated

that cluster ensembles can be used to identify the num-

ber of clusters in the data and ameliorate the local op-

tima problem of the proposed clustering algorithm. We

applied our technique to Earth science data sets. In

comparison to traditional CCA, our technique led to in-

teresting and encouraging new discoveries in the data.

As an ongoing eﬀort, we will closely work with our

domain expert to verify our ﬁndings in the data from

6We use large ensemble sizes for the Earth science data sets

because they contain a small number of instances, making it

computationally feasible and also larger ensemble sizes ensure that

the clusters we found in the data are not obtained by chance.

a geoscience viewpoint. For future work, we would also

like to apply our technique to more artiﬁcial and real-

world data sets that have complex nonlinear correlation

structure. Finally, we are developing a probabilistic

approach to learning mixture of CCA models.

References

[1] N. Bansal, A. Blum, and S. Chawla. Correlation

clustering. Machine Learning, 56:89–113, 2004.

[2] X. Z. Fern and C. E. Brodley. Random projection for

high dimensional data clustering: A cluster ensemble

approach. In Proceedings of the Twentieth Interna-

tional Conference on Machine Learning, 2003.

[3] X. Z. Fern and C. E. Brodley. Solving cluster ensemble

problems by bipartite graph partitioning. In Proceed-

ings of the Twenty First International Conference on

Machine Learning, pages 281–288, 2004.

[4] H. Hotelling. Relations between two sets of variants.

Biometrika, 28:321–377, 1936.

[5] W. Hsieh. Nonlinear canonical correlation analysis

by neural networks. Neural Networks, 13:1095–1105,

2000.

[6] R.A. Johnson and D. W. Wichern. Applied multivari-

ate statistical analysis. Prentice Hall, 1992.

[7] M. Jordan and R. Jacobs. Hierarchical mixtures of

experts and the EM algorithm. Neural Computation,

6:181–214, 1994.

[8] P. L. Lai and C. Fyfe. Kernel and nonlinear canonical

correlation analysis. International Journal of Neural

Systems, 10(5):365–377, 2000.

[9] A. Lotsch and M. Friedl. Coupled vegetation-

precipitation variability observed from satellite and cli-

mate record. Geophysical research letter, In submis-

sion.

[10] S. Monti, P. Tamayo, J. Mesirov, and T. Golub.

Consensus clustering: A resampling-based method for

class discovery and visualization of gene expression

microarray data. Machine Learning, 52:91–118, 2003.

[11] A. Strehl and J. Ghosh. Cluster ensembles - A

knowledge reuse framework for combining multiple

partitions. Machine Learning Research, 3:583–417,

2002.

[12] M. Tipping and C. Bishop. Mixtures of probabilistic

principal component analysers. Neural Computation,

11, 1999.

[13] E. Zorita, V. Kharin, and H. von Storch. The at-

mospheric circulation and sea surface temperature in

the north atlantic area in winter: Their interaction and

relevance for iberian precipitation. Journal of Climate,

5:1097–1108, 1992.

- CitationsCitations9
- ReferencesReferences18

- "Recent work [49, 50] attempts to transform the original problem of CCA into sequences of least squares problems and solve these problems with accelerated gradient descent (AGD). Since local correlations usually exist for specific applications besides the global correlation, e.g. in earth science applications, Fern et al. [31] propose to use a mixture of local linear canonical correlation models to find correlation clusters from the data. The proposed method uses a two-step procedure: the compound data set is first partitioned into clusters such that each cluster contains instances whose x features and y features are linearly correlated; then CCA is independently applied to each cluster to form a mixture of correlations that are locally linear. "

[Show abstract] [Hide abstract]**ABSTRACT:**Recently, multi-view representation learning has become a rapidly growing direction in machine learning and data mining areas. This paper first reviews the root methods and theories on multi-view representation learning, especially on canonical correlation analysis (CCA) and its several extensions. And then we investigate the advancement of multi-view representation learning that ranges from shallow methods including multi-modal topic learning, multi-view sparse coding, and multi-view latent space Markov networks, to deep methods including multi-modal restricted Boltzmann machines, multi-modal autoencoders, and multi-modal recurrent neural networks. Further, we also provide an important perspective from manifold alignment for multi-view representation learning. Overall, this survey aims to provide an insightful overview of theoretical basis and current developments in the field of multi-view representation learning and to help researchers find the most appropriate tools for particular applications.- "Apply the same procedure to obtain Σ y . We compared the copula mixture (CM) with three other methods: a Dirichlet prior Gaussian mixture for dependency-seeking clustering (GM) as derived in Klami & Kaski (2007), a non-Bayesian mixture of canonical correlation models (CCM) (Vrac, 2010) (Fern et al., 2005) and a variational Bayesian mixture of robust CCA models (RCCA) (Viinikanoja et al., 2010). CCM and RCCA both assume that the number of clusters is known or can be determined as explained in (Viinikanoja et al., 2010 ). "

[Show abstract] [Hide abstract]**ABSTRACT:**We introduce a copula mixture model to perform dependency-seeking clustering when co-occurring samples from different data sources are available. The model takes advantage of the great flexibility offered by the copulas framework to extend mixtures of Canonical Correlation Analysis to multivariate data with arbitrary continuous marginal densities. We formulate our model as a non-parametric Bayesian mixture, while providing efficient MCMC inference. Experiments on synthetic and real data demonstrate that the increased flexibility of the copula mixture significantly improves the clustering and the interpretability of the results.- "In this paper, we propose an alternative formulation for integrating the neighborhood information into CCA and derive a new locality-preserving CCA algorithm called ALPCCA, which is used to better discover the local manifold structure of data and further enhance the discriminative power for high-dimensional classification. It's worth noting that both LPCCA and ALPCCA are different from previous local CCA models in [22,23]. The former use the neighborhood information between samples (i.e., sample-level), while the latter are based on local information within clusters (i.e., cluster-level). "

[Show abstract] [Hide abstract]**ABSTRACT:**Canonical correlation analysis (CCA) is a well-known technique for extracting linearly correlated features from multiple views (i.e., sets of features) of data. Recently, a locality-preserving CCA, named LPCCA, has been developed to incorporate the neighborhood information into CCA. Although LPCCA is proved to be better in revealing the intrinsic data structure than CCA, its discriminative power for subsequent classification is low on high-dimensional data sets such as face databases. In this paper, we propose an alternative formulation for integrating the neighborhood information into CCA and derive a new locality-preserving CCA algorithm called ALPCCA, which can better discover the local manifold structure of data and further enhance the discriminative power for high-dimensional classification. The experimental results on both synthetic and real-world data sets including multiple feature data set and face databases validate the effectiveness of the proposed method.

Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.

This publication is from a journal that may support self archiving.

Learn more