Correlation Clustering for Learning Mixtures of Canonical Correlation Models.
ABSTRACT This paper addresses the task of analyzing the correlation between two related domains X and Y. Our research is motivated by an Earth Science task that studies the rela tionship between vegetation and precipitation. A standard statistical technique for such problems is Canonical Correla tion Analysis (CCA). A critical limitation of CCA is that it can only detect linear correlation between the two domains that is globally valid throughout both data sets. Our ap proach addresses this limitation by constructing a mixture of local linear CCA models through a process we name cor relation clustering. In correlation clustering, both data sets are clustered simultaneously according to the data's corre lation structure such that, within a cluster, domain X and domain Y are linearly correlated in the same way. Each clus ter is then analyzed using the traditional CCA to construct local linear correlation models. We present results on both artificial data sets and Earth Science data sets to demon strate that the proposed approach can detect useful correla tion patterns, which traditional CCA fails to discover.

Conference Paper: Variational Bayesian Mixture of Robust CCA Models.
[Show abstract] [Hide abstract]
ABSTRACT: We study the problem of extracting statistical dependencies between multivariate signals, to be used for exploratory analysis of complicated natural phenomena. In particular, we develop generative models for extracting the dependencies, made possible by the probabilistic interpretation of canonical correlation analysis (CCA). We introduce a mixture of robust canonical correlation analyzers, using tdistribution to make the model robust to outliers and variational Bayesian inference for learning from noisy data. We demonstrate the improvements of the new model on artificial data, and further apply it for analyzing dependencies between MEG and measurements of autonomic nervous system to illustrate potential use scenarios.Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2010, Barcelona, Spain, September 2024, 2010, Proceedings, Part III; 01/2010  SourceAvailable from: web.engr.orst.edu
 SourceAvailable from: Vipin Kumar
Conference Paper: Land cover change detection: a case study.
[Show abstract] [Hide abstract]
ABSTRACT: The study of land cover change is an important problem in the Earth Science domain because of its impacts on lo cal climate, radiation balance, biogeochemistry, hydrology, and the diversity and abundance of terrestrial species. Most wellknown change detection techniques from statistics, sig nal processing and control theory are not wellsuited for the massive highdimensional spatiotemporal data sets from Earth Science due to limitations such as high computational complexity and the inability to take advantage of seasonality and spatiotemporal autocorrelation inherent in Earth Sci ence data. In our work, we seek to address these challenges with new change detection techniques that are based on data mining approaches. Specically, in this paper we have per formed a case study for a new change detection technique for the land cover change detection problem. We study land cover change in the state of California, focusing on the San Francisco Bay Area and perform an extended study on the entire state. We also perform a comparative evaluation on forests in the entire state. These results demonstrate the utility of data mining techniques for the land cover change detection problem.Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 2427, 2008; 01/2008
Page 1
Correlation Clustering for Learning Mixtures of Canonical Correlation Models
X. Z. Fern∗
C. E. Brodley†
M. A. Friedl‡
Abstract
This paper addresses the task of analyzing the correlation
between two related domains X and Y .
motivated by an Earth Science task that studies the rela
tionship between vegetation and precipitation. A standard
statistical technique for such problems is Canonical Correla
tion Analysis (CCA). A critical limitation of CCA is that it
can only detect linear correlation between the two domains
that is globally valid throughout both data sets. Our ap
proach addresses this limitation by constructing a mixture
of local linear CCA models through a process we name cor
relation clustering. In correlation clustering, both data sets
are clustered simultaneously according to the data’s corre
lation structure such that, within a cluster, domain X and
domain Y are linearly correlated in the same way. Each clus
ter is then analyzed using the traditional CCA to construct
local linear correlation models. We present results on both
artificial data sets and Earth Science data sets to demon
strate that the proposed approach can detect useful correla
tion patterns, which traditional CCA fails to discover.
Our research is
1Introduction
In Earth science applications, researchers are often
interested in studying the correlation structure between
two domains in order to understand the nature of
the relationship between them.
correlation analysis task can be considered as two data
sets X and Y whose instances are described by feature
vectors ? x and ? y respectively. The dimension of ? x and
that of ? y do not need to be the same, although there
must be a onetoone mapping between instances of X
and instances of Y . Thus, it is often more convenient to
consider these two data sets as one compound data set
whose instances are described by two feature vectors ? x
and ? y. Indeed, throughout the remainder of this paper,
we will refer to the input of our task as one data set,
and the goal is to study how the two sets of features are
correlated to each other.
Canonical Correlation Analysis (CCA) [4, 6] is a
The inputs to our
∗School of Elec. and Comp. Eng., Purdue University, West
Lafayette, IN 47907, USA
†Dept. of Comp. Sci., Tufts University, Medford, MA 02155,
USA
‡Dept. of Geography, Boston University, Boston, MA, USA
multivariate statistical technique commonly used to
identify and quantify the correlation between two sets
of random variables.Given a compound data set
described by feature vectors ? x and ? y, CCA seeks to find
a linear transformation of ? x and a linear transformation
of ? y such that the resulting two new variables are
maximumly correlated.
In Earth science research, CCA has been often
applied to examine whether there is a causeandeffect
relationship between two domains or to predict the
behavior of one domain based on another. For example,
in [13] CCA was used to analyze the relationship
between the monthly mean sealevel pressure (SLP) and
seasurface temperature (SST) over the North Atlantic
in the months of December, January and February. This
analysis confirmed the hypothesis that atmospheric SLP
anomalies cause SST anomalies.
Because CCA is based on linear transformations,
the scope of its applications is necessarily limited. One
way to tackle this limitation is to use nonlinear canoni
cal correlationanalysis (NLCCA) [5, 8]. NLCCA applies
nonlinear functions to the original variables in order to
extract correlated components from the two sets of vari
ables. Although promising results have been achieved
by NLCCA in some Earth science applications, it tends
to be difficult to apply such techniques because of the
complexity of the model and the lack of robustness due
to overfitting [5].
In this paper we propose to use a mixture of lo
cal linear correlation models to capture the correlation
structure between two sets of random variables (fea
tures). Mixtures of local linear models not only provide
an alternative solution to capturing nonlinear correla
tions, but also have the potential to detect correlation
patterns that are significant only in a part (a local re
gion) of the data. The philosophy of using multiple lo
cal linear models to model global nonlinearity has been
successfully applied to other statistical approaches with
similar linearity limitations such as principal component
analysis [12] and linear regression [7]. Our approach
uses a twostep procedure. Given a compound data set,
we propose to first solve a clustering problem that par
titions the data set into clusters such that each cluster
contains instances whose ? x features and ? y features are
linearly correlated. We then independently apply CCA
Page 2
to each cluster to form a mixture of correlation models
that are locally linear.
In designing this twostep process, we need address
the following two critical questions.
1. Assume we are informed a priori that we can model
the correlation structure using k local linear CCA
models.How should we cluster the data in the
context of correlation analysis?
2. In realworld applications, we are rarely equipped
with knowledge of k. How can we decide how many
clusters there are in the data or whether a global
linear structure will suffice?
Note that the goal of clustering in the context of cor
relation analysis is different from traditional clustering.
In traditional clustering, the goal is to group instances
that are similar (as measured by certain distance or sim
ilarity metric) together. In contrast, here we need to
group instances based on how their ? x features and ? y fea
tures correlate to each other, i.e., instances that share
similar correlation structure between the two sets of fea
tures should be clustered together. To differentiate this
clustering task from traditional clustering, we name it
correlation clustering1and, in Section 3 we propose an
iterative greedy kmeans style algorithm for this task.
To address the second question, we apply the tech
nique of cluster ensembles [2] to our correlation cluster
ing algorithm, which provides a user with a visualization
of the results that can be used to determine the proper
number of clusters in the data. Note that our correla
tion clustering algorithm is a kmeans style algorithm
and as such may have many locally optimal solutions—
different initializations may lead to significantly differ
ent clustering results. By using cluster ensembles, we
can also address the local optima problem of our clus
tering algorithm and find a stable clustering solution.
To demonstrate the efficacy of our approach, we
apply it to both artificial data sets and real world Earth
science data sets.Our results on the artificial data
sets show that (1) the proposed correlation clustering
algorithm is capable of finding a good partition of
the data when the correct k is used and (2) cluster
ensembles provide an effective tool for finding k. When
applied to the Earth science data sets, our technique
detected significantly different correlation patterns in
comparison to what was found via traditional CCA.
These results led our domain expert to highly interesting
hypotheses that merit further investigation.
1Note that the term correlation clustering has also been used
by [1] as the name of a technique for traditional clustering.
The remainder of the paper is arranged as follows.
In Section 2, we review the basics of CCA. Section 3 in
troduces the intuitions behind our correlation clustering
algorithm and formally describes the algorithm, which
is then applied to artificially constructed data sets to
demonstrate its efficacy in finding correlation clusters
from the data. Section 4 demonstrates how cluster en
semble techniques can be used to determine the num
ber of clusters in the data and address the local optima
problem of the kmeans style correlation clustering al
gorithm. Section 5 explains our motivating application,
presents results, and describes how our domain expert
interprets the results. Finally, in Section 6 we conclude
the paper and discuss future directions.
2 Basics of CCA
Given a data set whose instances are described by two
feature vectors ? x and ? y, the goal of CCA is to find linear
transformations of ? x and linear transformations of ? y
such that the resulting new variables are maximumly
correlated.
In particular, CCA constructs a sequence of pairs
of strongly correlated variables (u1,v1), (u2,v2),···,
(ud,vd) through linear transformations, where d is the
minimum dimension of ? x and ? y. These new variables
ui’s and vi’s, named canonical variates (sometimes
referred to as canonical factors).
to principal components in the sense that principal
components are linear combinations of the original
variables that capture the most variance in the data and
in contrast canonical variates are linear combinations of
the original variables that capture the most correlation
between two sets of variables.
To construct these canonical covariates, CCA first
seeks to transform ? x and ? y into a pair of new variables
u1and v1by the linear transformations:
They are similar
u1= (? a1)T? x, and v1= (?b1)T? y
where the transformation vectors ? a1and?b1are defined
such that corr(u1,v1) is maximized subject to the
constraint that both u1 and v1 have unit variance.2
Once ? a1,?b1;···; ? ai,?bi are determined, we then find the
next pair of transformations ?
correlationbetween ( ?ai+1)T? x and (?
with the constraint that the resulting ui+1and vi+1are
uncorrelated with all previous canonical variates.3Note
that the correlation between uiand vibecomes weaker
as i increases. Let rirepresent the correlation between
the ith pair of canonical variates, we have ri≥ ri+1.
ai+1and ?
bi+1such that the
bi+1)T? y is maximized
2This constraint ensures unique solutions.
3This constraint ensures that the extracted canonical variates
contain no redundant information.
Page 3
It can be shown that to find the projection vectors
for canonical variates, we only need to find the eigen
vectors of the following matrices:
Mx= (Σxx)−1Σxy(Σyy)−1Σyx
and
My= (Σyy)−1Σyx(Σxx)−1Σxy
The eigenvectors of Mx, ordered according to de
creasing eigenvalues, are the transformation vectors ? a1,
? a2, ···, ? ad and the eigenvectors of My are?b1,?b2, ···,
?bd. In addition, the eigenvalues of these two matrices
are identical and the squareroot of the ith eigenvalue
√λi= ri, i.e., the correlation between the ith pair of
canonical variates ui and vi.
plications, only the first few most significant pairs of
canonical variates are of real interest. Assume that we
are interested in the first d pairs of variates, we can rep
resent all the useful information of the linear correlation
structure as a model M, defined as
Note that in most ap
M = {(uj,vj),rj,(? aj,?bj) : j = 1···d}
where (uj,vj) represent the jth pair of canonical vari
ates, rjis the correlation between them and (? aj,?bj) rep
resent the projection vectors for generating them. We
refer to M as a CCA model.
Once a CCA model is constructed, the next step is
for the domain experts to examine the variates as well
as the transformation vectors in order to understand
the relationship between the two domains. This can be
done in different ways depending on the application. In
our motivating Earth science task, the results of CCA
can be visualized as colored maps and interpreted by
Earth scientist. We explain this process in Section 5.
3Correlation Clustering
In this section, we first explain the basic intuitions that
led to our algorithm and formally present our kmeans
style correlation clustering algorithm. We then apply
the proposed algorithm to artificially constructed data
sets and analyze the results.
3.1
scribed by two sets of features ? x and ? y, and the prior
knowledge that the correlation structure of the data can
be modeled by k local linear models, the goal of corre
lation clustering is to partition the data into k clusters
such that for instances in the same cluster the features
of ? x and ? y are linearly correlated in the same way. The
critical question is how should we cluster the data to
reach this goal. Our answer is based on the following
important intuitions.
Algorithm Description Given a data set de
Table 1: A correlation clustering algorithm
Input:a data set of n instances, each described
by two random vectors ? x and ? y
k, the desired number of clusters
Output:
k clusters and k linear CCA models, one
for each cluster
Algorithm:
1. Randomly assign instances to the k clusters.
2. For i = 1···k, apply CCA to cluster i to build
Mi= {(uj,vj),rj,(aj,bj) : j = 1···d}, i.e., the
top d pairs of canonical variates, the correlation
r between each pair, and the corresponding
d pairs of projection vectors.
3. Reassign each instance to a cluster based on
its ? x and ? y features and the k CCA models.
4. If no assignment has changed from previous
iteration, return the current clusters and CCA
models. Otherwise, go to step 2.
Intuition 1: If a given set of instances contains multiple
correlation structures, applying CCA to this instance set
will not detect a strong linear correlation.
This is because when we put instances that have
different correlation structure together, the original
correlation patterns will be weakened because they are
now only valid in part of the data. Conversely, if CCA
detects strong correlation in a cluster, it is likely that
the instances in the cluster share the same correlation
structure. This suggests that we can use the strength of
the correlation between the canonical variates extracted
by CCA to measure the quality of a cluster. Note that
it is computationally intractable to evaluate all possible
clustering solutions in order to select the optimal one.
This motivates us to examine a kmeans style algorithm.
Starting from a random clustering solution, in each
iteration, we build a CCA model for each cluster and
then reassign each instance to its most appropriate
cluster according to its ? x and ? y features and the CCA
models. In Table 1, we describe the basic steps of such
a generic correlation clustering procedure.
The remaining question is how to assign instances
to their clusters.Note that in traditional kmeans
clustering, each iteration reassigns instances to clusters
according to the distance between instances and cluster
centers. For correlation clustering, minimizing the
Page 4
Table 2: Procedure of assigning instances to clusters
1. For each cluster i and its CCA model Mi, described
as {(ui
regression modelsˆvi
for each pair of canonical variates.
j,vi
j),ri
j,(?ai
j,?bi
j) : j = 1···d}, construct d linear
j= βi
j∗ ui
j+ αi
j,j = 1···d, one
2. Given an instance (? x,? y), for each cluster i, compute
the instance’s canonical variates under Mias
uj = (?ai
and calculate ˆ vj as
ˆ vj = βi
and the weighted erri
erri=?d
where
ri
j)T? x and vj = (?bi
j)T? y, j = 1···d,
j∗ uj+ αi
j, j = 1···d,
j=1
rj
r1∗ (vj− ˆ vj)2,
ri
j
1is the weight for the jth prediction error.
3. Assign instance (? x,? y) to the cluster minimizing erri.
distance between instances and their cluster centers is
no longer our goal. Instead, our instance reassignment
is performed based on the intuition described below.
Intuition 2: If CCA detects strong a correlation pattern
in a cluster, i.e., the canonical variates u and v are
highly correlated, we expect to be able to predict the value
of v from u (or vice versa) using a linear regression
model.
This is demonstrated in Figure 1, where we plot a
pair of canonical variates with correlation 0.9. Shown
as a solid line is the linear regression model constructed
to predict one variate from the other.
suggests that, for each cluster, we can compute its
most significant pair of canonical variates (u1, v1) and
construct a linear regression model to predict v1 from
u1. To assign an instance to its proper cluster, we can
simply select the cluster whose regression model best
predicts the instance’s variate v1 from it’s variate u1.
In some cases, we are interested in the first few pairs of
canonical variates rather than only the first pair. It is
thus intuitive to construct one linear regression model
for each pair, and assign instances to clusters based
on the combined prediction error. Note that because
the correlation ri between variate vi,ui decreases as i
increase, we set the weight for the itherror to beri
this manner, the weight for the prediction error between
u1 and v1 is always one, whereas the weights for the
ensuing ones will be smaller depending on the strength
Intuition 2
r1. In
−4 −3 −2−101234
−4
−3
−2
−1
0
1
2
3
4
Figure 1: Scatter plot of a pair of canonical variates
(r = 0.9) and the linear regression model constructed
to predict one variate from another.
of the correlations.
put on the canonical variates that are more strongly
correlated. In Table 2, we describe the exact procedure
for reassigning instances to clusters.
Tables 1 and 2 complete the description of our cor
relation clustering algorithm. To apply this algorithm,
the user needs to specify d, the number of pairs of canon
ical variates that are used in computing the prediction
errors and reassigning the instances. Based on our em
pirical observations with both artificial and realworld
datasets, we recommend that d be set to be the same as
or slightly larger than the total number of variates that
bear interest in the application. In our application, our
domain expert is interested in only the top two or three
pairs of canonical variates, consequently we used d = 4
as the default choice for our experiments.
The proposed correlation clustering algorithm is a
greedy iterative algorithm. We want to point out that
it is not guaranteed to converge. Specifically, after re
assigning instances to clusters at each iteration, there is
no guarantee that the resulting new clusters will have
more strongly correlated variates. In our experiments,
we did observe fluctuations in the objective function,
i.e., the weighted prediction error.
typically occur only after an initial period in which the
error computed by the objective function quickly de
creases. Moreover, after this rapid initial convergence,
the ensuing fluctuations are relatively small. Thus we
recommend that one specify a maximum number of it
erations, and in our experiments we set this to be 200
iterations.
This ensures that more focus is
But, fluctuations
Page 5
Table 3: An artificial data set and results
Data Sets
D1
0.85
0.6
0.3
Global
CCA
0.521
0.462
0.302
Mixture of CCA
clust. 1
0.856(.001)
0.619(.001)
0.346(.003)
D2
0.9
0.7
0.4
clust. 2
0.904(.001)
0.685(.004)
0.436(.003)
r1
r2
r3
3.2
amine the efficacy of the proposed correlation clustering
algorithm, we apply it to artificially generated data sets
that have prespecified nonlinear correlation structures.
We generate such data by first separately generating
multiple component data sets, each with a different lin
ear correlation structure, and then mixing these com
ponent data sets together to form a composite data set.
Obviously the resulting data set’s correlation structure
is no longer globally linear. However, a properly con
structed mixture of local linear models should be able
to separate the data set into the original component
data sets and recover the correlation patterns in each
part. Therefore, we are interested in (1) testing whether
our correlation clustering algorithm can find the correct
partition of the data, and (2) testing whether it can
recover the original correlation patterns represented as
the canonical variates, and (3) comparing its results to
the results of global CCA on the composite data set.
In Table 3, we present the results of our correlation
clustering algorithm and traditional CCA on a compos
ite data set formed by two component data sets, each
of which contains 1000 instances.
component data set as follows.4Given the desired cor
relation values r1, r2, and r3, we first create a multi
variate Gaussian distribution with six random variables
u1,u2,u3,v1,v2,v3, where uiand viare intended to be
the ith pair of canonical variates. We set the covariance
matrix to be:
00
r3
Experiments on Artificial Data Sets To ex
We generate each
1
0
0
r1
0
0
1
0
0
r2
0
0
1
0
0
r1
0
0
1
0
0
0
r2
0
0
1
0
0
0
r3
0
0
1
This ensures that corr(uj,vj) = rj, for j = 1,2,3
and corr(ui,uj) = corr(vi,vj) = corr(ui,vj) = 0 for
i ?= j. We then randomly sample 1000 points from this
joint Gaussian distribution and form the final vector of
? x using linear combinations of uj’s and the vector of ? y
using linear combinations of vj’s.
4The matlab code for generating a component data set is
available at http://www.ecn.purdue.edu/∼xz
Columns 2 and 3 of Table 3 specify the correlation
between the first three pairs of canonical variates of
each of the constructed datasets, D1 and D2. These
are the values that were used to generate the data. We
applied the traditional CCA to the composite data set
(D1and D2combined together) and we report the top
three detected canonical correlations in Column 4. We
see from the results that, as expected, global CCA is
unable to extract the true correlation structure from
the data.
The last two columns of Table 3 show the results of
applying the proposed correlation clustering algorithm
to the composite data set with k = 2 and d = 4.
The results, shown in Columns 5 and 6 are the average
over ten runs with different random initializations (the
standard deviations are shown in parentheses).
observe that the detected canonical correlations are
similar to the true values. In Figure 2, We plot the
canonical variates extracted by our algorithm (y axis)
versus the true canonical variates (x axis) and the plots
of the first two pairs of variates are shown. We observe
that the first pair of variates extracted by our algorithm
are very similar to the original variates. This can be seen
by noticing that for both u1and v1most points lie on or
are close to the line of unit slope (shown as a red line).
For the second pair, we see more deviation from the red
line. This is possibly because our algorithm put less
focus on the second pair of variates during clustering.
Finally, we observe that the clusters formed by our
algorithm correspond nicely to the original component
data sets. On average, only 2.5% of the 2000 instances
were assigned to the wrong cluster.
These results show that our correlation clustering
algorithm can discover local linear correlation patterns
given prior knowledge of k, the true number of clusters
in the data.Our algorithm performs consistently
well on artificially constructed data sets.
part due to the fact that these data sets are highly
simplified examples of nonlinearly correlated data. In
real applications, the nonlinear correlation structure
is often more complex. Indeed, when applied to our
Earth science data sets, we observe greater instability of
our algorithm—different initializations lead to different
clustering solutions. We conjecture that this is because
our clustering algorithm is a kmeans style greedy
algorithm and has large number of locally optimal
solutions.
We
This is in
4Cluster Ensembles for Correlation Clustering
In this section we address a problem in the practi
cal application of the proposed correlation clustering
algorithm—identification of the number of clusters in
the data. A complicating factor is that because we are
Page 6
−4−3 −2−101234
−4
−3
−2
−1
0
1
2
3
4
Original u1
Extracted u1
−4−3−2−101234
−4
−3
−2
−1
0
1
2
3
4
Original v1
Extracted v1
(a). The first pair of canonical variates
−4−3 −2−101234
−4
−3
−2
−1
0
1
2
3
4
Extracted u2
Original u2
−4−3−2−101234
−4
−3
−2
−1
0
1
2
3
4
Original v2
Extracted v2
(b). The second pair of canonical variates
Figure 2: Comparing the first two pairs of canonical variates extracted by our mixture of CCA algorithm and the
original canonical variates.
dealing with a kmeans style greedy algorithm there may
be many locally optimal solutions. In particular, differ
ent initializations may lead to different clusters. In this
section we show how to apply cluster ensemble tech
niques to address these issues.
The concept of cluster ensembles has recently seen
increasing popularity in the clustering community [11,
2, 10, 3], in part because it can be applied to any type
of clustering as a generic tool for boosting clustering
performance. The basic idea is to generate an ensemble
of different clustering solutions, each capturing some
structure of the data. The anticipated result is that by
combining the ensemble of clustering solutions, a better
final clustering solutions can be obtained.
ensembles have been successfully applied to determine
the number of clusters [10] and to improve clustering
performance for traditional clustering tasks [11, 2, 3].
Cluster
Although our clustering tasks are significantly different
from traditional clustering in terms of the goal, we
believe similar benefits can be achieved by using cluster
ensembles.
To generate a cluster ensemble, we run our correla
tion clustering algorithm on a given data set with k=2
for r times, each run starting from a different initial as
signment, where r is the size of the ensemble. We then
combine these different clustering solutions into a n×n
matrix S, which describes for each pair of instances the
frequency with which they are clustered together (n is
the total number of instances in the data set.) As de
fined, each element of S is a number between 0 and 1.
We refer to it as a similarity matrix because S(i,j) can
be considered as the similarity (correlation similarity in
stead of the conventional similarity) between instances
i and j.
Page 7
After the similarity matrix is constructed, we can
then visualize the matrix using a technique introduced
by [10] to help determine how many clusters there are
in the data. This visualization technique has two steps.
First, it orders the instances such that instances that
are similar to each other are arranged to be next to each
other. It then maps the 01 range of the similarity values
to a grayscale such that 0 corresponds to white and
1 corresponds to black. The similarity matrix is then
displayed as an image, in which darker areas indicate
strong similarity and lighter areas indicate little to no
similarity. For example, if all clustering solutions in the
ensemble agree with one another perfectly, the similarity
matrix S will have similarity value 1 for these pairs of
instances that are from the same cluster and similarity
value 0 for those from different clusters.
the instances are ordered such that similar instances
are arranged next to each other, the visualization will
produce black squares along the diagonal of the image.
For a detailed description of the visualization technique,
please refer to [10].
To demonstrate the effect of cluster ensembles on
our correlation clustering, we generate three artificial
data sets using the same procedure as described in
Section 3.2. These three data sets contain one, two, and
three correlation clusters respectively.
correlation clustering algorithm 20 times with different
initializations and construct a similarity matrix for each
data set. In Figure 3 we show the images of the resulting
similarity matrices for these three data sets and make
following observations.
Because
We apply our
• For the onecluster data set, shown in Figure 3 (a),
the produced similarity matrix does not show any
clear clustering pattern. This is because our cor
relation clustering algorithm splits the data ran
domly in each run—by combining the random runs
through the similarity matrix, we can easily reach
the conclusion that the given data set contains only
one correlation cluster.
• For the twocluster data set, shown in Figure 3 (b),
First, we see two dark squares along the diagonal,
indicating there are two correlation clusters in the
data. This shows that, as we expect, the similarity
matrix constructed via cluster ensembles reveal
information about the true number of clusters in
the data.
In addition to the two dark diagonal squares, we
also see small gray areas in the image, indicating
that some of the clustering solutions in the ensem
ble disagree with each other on some instances.
This is because different initializations sometimes
lead to different local optimal solutions. Further,
we argue that these different solutions sometimes
make different mistakes—combining them can po
tentially correct some of the mistakes and produce
a better solution.5Indeed, our experiments show
that, for this particular twocluster data set, ap
plying the averagelink agglomerative clustering to
the resulting similarity matrix reduces the cluster
ing error rate from 2.0% (the average error rate of
the 20 clustering runs) to 1.1%. In this case, clus
ter ensembles corrected for the local optima prob
lem of our correlation clustering algorithm. Cluster
ensembles have been shown to boost the clustering
performance for traditional clustering tasks, here
we confirm that correlation clustering can also ben
efit from cluster ensembles.
• For the last data set, shown in Figure 3 (c), we see
three dark squares along the diagonal, indicating
that there are three correlation clusters in the data.
Comparing to the twocluster case, we see signifi
cantly larger areas of gray. In this case, our corre
lation clustering algorithm was asked to partition
the data into two parts although the data actually
contains three clusters. Therefore, it is not sur
prising that many of the clustering solutions don’t
agree with each other because they may split or
merge clusters in many different ways when differ
ent initializations are used, resulting in much larger
chance for disagreement. However, this does not
stop us from finding the correct number of clusters
from the similarity matrix. Indeed, by combining
multiple solutions, these random splits and merges
tend to cancel out each other and the true structure
of the data emerges.
With the help of the similarity matrix, now we
know there are three clusters in the last data set. We
then constructed another cluster ensemble for this data
set, but this time we set k=3 for each clustering run.
The resulting similarity matrix S?is shown in Figure 3
(d). In this case, the average error rate achieved by the
individual clustering solutions in the ensemble is 7.5%
and the averagelink agglomerative clustering algorithm
applied on S?reduces the error rate to 6.8%.
To conclude, cluster ensembles help to achieve two
goals. First, they provide information about the true
structure of the data. Second, they help improve
clustering performance of our correlation clustering
algorithm.
5It should be noted that if the different solutions make the
same mistakes, these mistakes will not be corrected by using
cluster ensembles.
Page 8
(a)(b)
(c)(d)
Figure 3: Visualization of similarity matrices: (a). S for the onecluster data set; (b). S for the twocluster data
set ; (c). S for the threecluster data set, and (d). S?for the threecluster data set
5 Experiments on Earth Science Data Sets
We have demonstrated on artificial data sets that our
correlation algorithm is capable of finding locally linear
correlation patterns in the data. In this section, we ap
ply our techniques to Earth science data sets. The task
is to investigate the relationship between the variability
in precipitation and the dynamics of vegetation. Below,
we briefly introduce the data sets and then compare our
technique to traditional CCA.
In this study, the standardized precipitation index
(SPI) is used to describe the precipitation domain and
the normalized difference vegetation index (NDVI) is
used to describe the vegetation domain [9]. The data for
both domains are collected and aligned at monthly time
intervals from July 1981 to October 2000 (232 months).
Our analysis is performed at continental level for the
continents of North America, South America, Australia
and Africa. For each of these continents, we form a
data set whose instances correspond to time points. For
a particular continent, the feature vector ? x records the
SPI value at each grid location of that continent, thus
the dimension of ? x equals the number of grid locations
of that continent. Similarly, ? y records the NDVI values.
Note that the dimensions of ? x and ? y are not equal
because different grid resolutions are used to collect the
data. The effect of applying our technique to the data
is to cluster the data points in time. This is motivated
by the hypothesis that during different time periods the
relationship between vegetation and precipitation may
vary.
For our application, a standard way to visualize
CCA results is to use colored map. In particular, to
Page 9
(a)Conventional CCA(b) Cluster 1 of MCCA(c) Cluster 2 of MCCA
Figure 4: The results of conventional CCA and Mixture of CCA (MCCA) for Africa. Top panel shows the NDVI
and SPI canonical variates (time series). Middle and bottom panel show the NDVI and SPI maps.
analyze a pair of canonical variates, which are in this
case a pair of correlated time series, one for SPI and
one for NDVI. We produce one map for SPI and one
map for NDVI. For example, to produce a map for
SPI, we take the correlation between the time series
of the SPI canonical variate and the SPI time series
of each grid point, generating a value between −1
(negative correlation) and 1 (positive correlation) for
each grid point. We then display these values on the
map via color coding.Areas of red (blue) color are
positively (negatively) correlated with the SPI canonical
variate. Considered together, the NDVI map and SPI
map identify regions where SPI correlates with NDVI.
Since our technique produces local CCA models, we
can visualize each cluster using the same technique.
Note that an exact geophysical interpretation of the
produced maps is beyond the scope of this paper. To
do so, familiarity with the geoscience terminologies and
concepts is required from our audience. Instead, we will
present the maps produced by traditional CCA and the
maps produced by our technique, as well as plots of
the time series of the SPI and NDVI canonical variates.
Finally, a high level interpretation of the results is
provided by our domain expert. For brevity, the rest
of our discussion will focus on the continent of Africa,
which is a representative example where our method
finds patterns of interest that were not discovered by
traditional CCA.
We apply our technique to the data set of Africa
by setting k=2 and constructing a cluster ensemble of
Page 10
size 200.6The final two clusters were obtained using
the averagelink agglomerative algorithm applied to the
similarity matrix.
Figure 4 (a) shows the maps and the NDVI and SPI
time series generated by traditional CCA. Figures 4 (b)
and (c) show the maps and the time series for each of
the two clusters. Note that each of the maps is asso
ciated with the first pair of canonical variates for that
dataset/cluster. Inspection of the time series and the
spatial patterns that are associated with the canonical
variates for each cluster demonstrates that the mixture
of CCA approach provides information that is clearly
different from results produced by conventional CCA.
For Africa, the interannual dynamics in precipitation
are strongly influenced by a complex set of dynamics
that depend on ElNino and La Nina, and on the re
sulting sea surface temperature regimes in Indian Ocean
and southern Atlantic ocean off the coast of west Africa.
Although exact interpretation of these results requires
more study, the maps of Figures 4 (b) and (c) show that
the proposed approach was able to isolate important
quasiindependent modes of precipitationvegetation co
variability that linear methods are unable to identify.
As shown in [9], conventional CCA is effective in iso
lating precipitation and vegetation anomalies in eastern
Africa associated with ElNino, but less successful in
isolating similar patterns in the Sahelian region of west
ern Africa. In contrast, Figures 4 (b) and (c) show that
the mixture of CCA technique isolates the pattern in
eastern Africa, and additionally identifies a mode of co
variability in the Sahel that is probably related to ocean
atmosphere dynamics in the southern Atlantic ocean.
6 Conclusions and Future Work
This paper presented a method for constructing mix
tures of local CCA models in attempt to address the
limitations of the conventional CCA approach. We de
veloped a correlation clustering algorithm, which parti
tions a given data set according to the correlation be
tween two sets of features. We further demonstrated
that cluster ensembles can be used to identify the num
ber of clusters in the data and ameliorate the local op
tima problem of the proposed clustering algorithm. We
applied our technique to Earth science data sets. In
comparison to traditional CCA, our technique led to in
teresting and encouraging new discoveries in the data.
As an ongoing effort, we will closely work with our
domain expert to verify our findings in the data from
6We use large ensemble sizes for the Earth science data sets
because they contain a small number of instances, making it
computationally feasible and also larger ensemble sizes ensure that
the clusters we found in the data are not obtained by chance.
a geoscience viewpoint. For future work, we would also
like to apply our technique to more artificial and real
world data sets that have complex nonlinear correlation
structure.Finally, we are developing a probabilistic
approach to learning mixture of CCA models.
References
[1] N. Bansal, A. Blum, and S. Chawla.
clustering. Machine Learning, 56:89–113, 2004.
[2] X. Z. Fern and C. E. Brodley. Random projection for
high dimensional data clustering: A cluster ensemble
approach.In Proceedings of the Twentieth Interna
tional Conference on Machine Learning, 2003.
[3] X. Z. Fern and C. E. Brodley. Solving cluster ensemble
problems by bipartite graph partitioning. In Proceed
ings of the Twenty First International Conference on
Machine Learning, pages 281–288, 2004.
[4] H. Hotelling. Relations between two sets of variants.
Biometrika, 28:321–377, 1936.
[5] W. Hsieh.Nonlinear canonical correlation analysis
by neural networks. Neural Networks, 13:1095–1105,
2000.
[6] R.A. Johnson and D. W. Wichern. Applied multivari
ate statistical analysis. Prentice Hall, 1992.
[7] M. Jordan and R. Jacobs.
experts and the EM algorithm. Neural Computation,
6:181–214, 1994.
[8] P. L. Lai and C. Fyfe. Kernel and nonlinear canonical
correlation analysis. International Journal of Neural
Systems, 10(5):365–377, 2000.
[9] A. Lotsch and M. Friedl.
precipitation variability observed from satellite and cli
mate record. Geophysical research letter, In submis
sion.
[10] S. Monti, P. Tamayo, J. Mesirov, and T. Golub.
Consensus clustering: A resamplingbased method for
class discovery and visualization of gene expression
microarray data. Machine Learning, 52:91–118, 2003.
[11] A. Strehl and J. Ghosh.
knowledge reuse framework for combining multiple
partitions.
Machine Learning Research, 3:583–417,
2002.
[12] M. Tipping and C. Bishop. Mixtures of probabilistic
principal component analysers. Neural Computation,
11, 1999.
[13] E. Zorita, V. Kharin, and H. von Storch.
mospheric circulation and sea surface temperature in
the north atlantic area in winter: Their interaction and
relevance for iberian precipitation. Journal of Climate,
5:1097–1108, 1992.
Correlation
Hierarchical mixtures of
Coupled vegetation
Cluster ensembles  A
The at
View other sources
Hide other sources
 Available from Carla E Brodley · Jun 3, 2014
 Available from psu.edu
 Available from psu.edu