Page 1

Cluster Analysis for Gene

Expression Data: A Survey

Daxin Jiang, Chun Tang, and Aidong Zhang

Abstract—DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of

genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene

expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of

genes and the complexity of biological networks greatly increases the challenges of comprehending and interpreting the resulting mass

of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering

techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying

data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group

are more similar to each other than the points in different groups. A very rich literature on cluster analysis has developed over the past

three decades. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new

algorithms have recently been proposed specifically aiming at gene expression data. These clustering algorithms have been proven

useful for identifying biologically relevant groups of genes and samples. In this paper, we first briefly introduce the concepts of

microarray technology and discuss the basic elements of clustering on gene expression data. In particular, we divide cluster analysis

for gene expression data into three categories. Then, we present specific challenges pertinent to each clustering category and

introduce several representative approaches. We also discuss the problem of cluster validation in three aspects and review various

methods to assess the quality and reliability of clustering results. Finally, we conclude this paper and suggest the promising trends in

this field.

Index Terms—Microarray technology, gene expression data, clustering.

?

1

1.1

INTRODUCTION

Introduction to Microarray Technology

1.1.1 Measuring mRNA Levels

C

and collection of data on single genes, microarray technol-

ogies have now made it possible to monitor the expression

levels for tens of thousands of genes in parallel. The two

major types of microarray experiments are the cDNA

microarray [54] and oligonucleotide arrays (abbreviated oligo

chip) [44]. Despite differences in the details of their

experiment protocols, both types of experiments involve

three common basic procedures [67]:

OMPARED with the traditional approach to genomic

research, which has focused on the local examination

.

Chip manufacture: A microarray is a small chip (made

of chemically coated glass, nylon membrane, or

silicon), onto which tens of thousands of DNA

molecules (probes) are attached in fixed grids. Each

grid cell relates to a DNA sequence.

Target preparation, labeling, and hybridization: Typi-

cally, two mRNA samples (a test sample and a

control sample) are reverse transcribed into cDNA

(targets), labeled using either fluorescent dyes or

radioactive isotopics, and then hybridized with the

probes on the surface of the chip.

.

.

The scanning process: Chips are scanned to read the

signal intensity that is emitted from the labeled and

hybridized targets.

Generally, both cDNA microarray and oligo chip

experiments measure the expression level for each DNA

sequence by the ratio of signal intensity between the test

sample and the control sample, therefore, data sets resulting

from both methods share the same biological semantics. In

this paper, unless explicitly stated, we will refer to both the

cDNA microarray and the oligo chip as microarray technol-

ogy and term the measurements collected via both methods

as gene expression data.

1.1.2 Preprocessing of Gene Expression Data

A microarray experiment typically assesses a large number

of DNA sequences (genes, cDNA clones, or expressed

sequence tags [ESTs]) under multiple conditions. These

conditions may be a time series during a biological process

(e.g., the yeast cell cycle) or a collection of different tissue

samples (e.g., normal versus cancerous tissues). In this

paper, we will focus on the cluster analysis of gene

expression data without making a distinction among

DNA sequences, which will uniformly be called “genes.”

Similarly, we will uniformly refer to all kinds of experi-

mental conditions as “samples” if no confusion will be

caused. A gene expression data set from a microarray

experiment can be represented by a real-valued expression

matrix M ¼ fwijj 1 ? i ? n;1 ? j ? mg (Fig. 1a), where the

rows (G ¼ f~ g1 g1;...; ~ gn

genes, the columns (S ¼ f~ s1 s1;...; ~

sion profiles of samples, and each cell wijis the measured

gng) form the expression patterns of

sm

smg) represent the expres-

1370IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL. 16,NO. 11,NOVEMBER 2004

. The authors are with the Department of Computer Science and

Engineering, State University of New York at Buffalo, 201 Bell Hall,

Buffalo, NY 14260. E-mail: {djiang3, chuntang, azhang}@cse.buffalo.edu.

Manuscript received 23 Apr. 2002; revised 20 Mar. 2003; accepted 14 Aug.

2003.

For information on obtaining reprints of this article, please send e-mail to:

tkde@computer.org, and reference IEEECS Log Number 116396.

1041-4347/04/$20.00 ? 2004 IEEEPublished by the IEEE Computer Society

Page 2

expression level of gene i in sample j. Fig. 1b includes some

notation that will be used in the following sections.

The original gene expression matrix obtained from a

scanning process contains noise, missing values, and

systematic variations arising from the experimental proce-

dure. Data preprocessing is indispensable before any cluster

analysis can be performed. Some problems of data pre-

processing have themselves become interesting research

topics. Those questions are beyond the scope of this survey;

an examination of the problem of missing value estimation

appears in [69], and the problem of data normalization is

addressed in [32], [55]. Furthermore, many clustering

approaches apply one or more of the following preproces-

sing procedures: filtering out genes with expression levels

which do not change significantly across samples, perform-

ing a logarithmic transformation of each expression level, or

standardizing each row of the gene expression matrix with

a mean of zero and a variance of one. In the following

discussion of clustering algorithms, we will set aside the

details of preprocessing procedures and assume that the

input data set has already been properly preprocessed.

1.1.3 Applications of Clustering Gene Expression Data

Clustering techniques have proven to be helpful to under-

stand gene function, gene regulation, cellular processes, and

subtypes of cells. Genes with similar expression patterns

(coexpressed genes) can be clustered together with similar

cellular functions. This approach may further understand-

ing of the functions of many genes for which information

has not been previously available [66], [20]. Furthermore,

coexpressed genes in the same cluster are likely to be

involved in the same cellular processes, and a strong

correlation of expression patterns between those genes

indicates coregulation. Searching for common DNA se-

quences at the promoter regions of genes within the same

cluster allows regulatory motifs specific to each gene cluster

to be identified and cis-regulatory elements to be proposed

[9], [66]. The inference of regulation through the clustering

of gene expression data also gives rise to hypotheses

regarding the mechanism of the transcriptional regulatory

network [16]. Finally, clustering different samples on the

basis of corresponding expression profiles may reveal

subcell types which are hard to identify by traditional

morphology-based approaches [2], [24].

1.2

In this section, we will first introduce the concepts of clusters

and clustering. We will then divide the clustering tasks for

gene expression data into three categories according to

different clustering purposes. Finally, we will discuss the

issue of proximity measure in detail.

Introduction to Clustering Techniques

1.2.1 Clusters and Clustering

Clustering is the process of grouping data objects into a set

of disjoint classes, called clusters, so that objects within a

class have high similarity to each other, while objects in

separate classes are more dissimilar. Clustering is an

example of unsupervised classification. “Classification” refers

to a procedure that assigns data objects to a set of classes.

“Unsupervised” means that clustering does not rely on

predefined classes and training examples while classifying

the data objects. Thus, clustering is distinguished from

pattern recognition or the areas of statistics known as

discriminant analysis and decision analysis, which seek to

find rules for classifying objects from a given set of

preclassified objects.

1.2.2 Categories of Gene Expression Data Clustering

Currently, a typical microarray experiment contains 103to

104genes, and this number is expected to reach to the order

of 106. However, the number of samples involved in a

microarray experiment is generally less than 100. One of

the characteristics of gene expression data is that it is

meaningful to cluster both genes and samples. On one

hand, coexpressed genes can be grouped in clusters based

on their expression patterns [7], [20]. In such gene-based

clustering, the genes are treated as the objects, while the

samples are the features. On the other hand, the samples

can be partitioned into homogeneous groups. Each group

may correspond to some particular macroscopic phenotype,

such as clinical syndromes or cancer types [24]. Such

sample-based clustering regards the samples as the objects

and the genes as the features. The distinction of gene-based

clustering and sample-based clustering is based on different

characteristics of clustering tasks for gene expression data.

Some clustering algorithms, such as K-means and hierarch-

ical approaches, can be used both to group genes and to

partition samples. We will introduce those algorithms as

gene-based clustering approaches and will discuss how to

apply them as sample-based clustering in Section 2.2.1.

JIANG ET AL.: CLUSTER ANALYSIS FOR GENE EXPRESSION DATA: A SURVEY 1371

Fig. 1. (a) A gene expression matrix. (b) Notation in this paper.

Page 3

Both the gene-based and sample-based clustering ap-

proaches search exclusive and exhaustive partitions of

objects that share the same feature space. However, current

thinking in molecular biology holds that only a small subset

of genes participate in any cellular process of interest and

that a cellular process takes place only in a subset of the

samples. This belief calls for the subspace clustering to

capture clusters formed by a subset of genes across a subset

of samples. For subspace clustering algorithms, genes and

samples are treated symmetrically, so that either genes or

samples can be regarded as objects or features. Further-

more, clusters generated through such algorithms may have

different feature spaces.

While a gene expression matrix can be analyzed from

different angles, the gene-based, sample-based clustering,

and subspace clustering analysis face very different

challenges. Thus, we may have to adopt very different

computational strategies in the three situations. The details

of the challenges and the representative clustering techni-

ques pertinent to each clustering category will be discussed

in Section 2.

1.2.3 Proximity Measurement for Gene Expression Data

Proximity measurement measures the similarity (or distance)

between two data objects. Gene expression data objects, no

matter genes or samples, can be formalized as numerical

vectors~ O Oi¼ foijj1 ? j ? pg, where oijis the value of the jth

feature for the ith data object and p is the number of

features. The proximity between two objects Oi and Oj is

measured by a proximity function of corresponding vectors

~ O Oiand~ O Oj.

Euclidean distance is one of the most commonly used

methods to measure the distance between two data objects.

The distance between objects Oi and Oj in p-dimensional

space is defined as:

EuclideanðOi;OjÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

d¼1

X

p

ðoid? ojdÞ2

s

:

However, for gene expression data, the overall shapes of

gene expression patterns (or profiles) are of greater interest

than the individual magnitudes of each feature. Euclidean

distance does not score well for shifting or scaled patterns

(or profiles) [71]. To address this problem, each object

vector is standardized with zero mean and variance one

before calculating the distance [66], [59], [56].

An alternate measure is Pearson’s correlation coefficient,

which measures the similarity between the shapes of two

expression patterns (profiles). Given two data objects Oi

and Oj, Pearson’s correlation coefficient is defined as

Pp

PearsonðOi;OjÞ ¼

d¼1ðoid? ?oiÞðojd? ?ojÞ

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Pp

d¼1ðoid? ?oiÞ2

q

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Pp

d¼1ðojd? ?ojÞ2

q

;

where ?oiand ?ojare the means for~ O Oiand~ O Oj, respectively.

Pearson’s correlation coefficient views each object as a

random variable with p observations and measures the

similarity between two objects by calculating the linear

relationship between the distributions of the two corre-

sponding random variables.

Pearson’s correlation coefficient is widely used and has

proven effective as a similarity measure for gene

expression data [36], [64], [65], [74]. However, empirical

study has shown that it is not robust with respect to

outliers [30], thus potentially yielding false positives which

assign a high similarity score to a pair of dissimilar

patterns. If two patterns have a common peak or valley at

a single feature, the correlation will be dominated by this

feature, although the patterns at the remaining features

may be completely dissimilar. This observation evoked an

improved measure called Jackknife correlation [19], [30],

defined as JackknifeðOi;OjÞ ¼ minf?ð1Þ

where ?ðlÞ

ijis the Pearson’s correlation coefficient of data

objects Oi and Oj with the lth feature deleted. Use of the

Jackknife correlation avoids the “dominance effect” of

single outliers. More general versions of Jackknife

correlation that are robust to more than one outlier can

similarly be derived. However, the generalized Jackknife

correlation, which would involve the enumeration of

different combinations of features to be deleted, would be

computationally costly and is rarely used.

Another drawback of Pearson’s correlation coefficient is

that it assumes an approximate Gaussian distribution of the

points and may not be robust for non-Gaussian distribu-

tions [14], [16]. To address this, the Spearman’s rank-order

correlation coefficient has been suggested as the similarity

measure. The ranking correlation is derived by replacing

the numerical expression level oidwith its rank ridamong all

conditions. For example, rid¼ 3 if oid is the third highest

value among oik, where 1 ? k ? p. Spearman’s correlation

coefficient does not require the assumption of Gaussian

distribution and is more robust against outliers than

Pearson’s correlation coefficient. However, as a conse-

quence of ranking, a significant amount of information

present in the data is lost. Our experimental results indicate

that, on average, Spearman’s rank-order correlation coeffi-

cient does not perform as well as Pearson’s correlation

coefficient.

Almost all of the clustering algorithms mentioned in this

survey use either Euclidean distance or Pearson’s correla-

tion coefficient as the proximity measure. When Euclidean

distance is selected as proximity measure, the standardiza-

tion process~ O O0

?Oi

the dth feature of object Oi, while ?Oiand ?Oiare the mean

and standard deviation of~ O Oi, respectively. Suppose O0

O0

can prove PearsonðOi;OjÞ ¼ PearsonðO0

ffiffiffiffiffi

These two equations disclose the consistency between

Pearson’s correlation coefficient and Euclidean distance

after data standardization, i.e., if a pair of data objects

Oi1;Oj1has a higher correlation than pair

ij;...;?ðlÞ

ij;...;?ðpÞ

ijg,

id¼

~ O Oid??Oi

is usually applied, where~ O Oid is

iand

jare the standardized “objects” of Oi and Oj. Then, we

i;O0

jÞ and

EuclideanðO0

i;O0

jÞ ¼

2p

p

ð

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1 ? PearsonðO0

i;O0

jÞ

q

Þ:

Oi2;Oj2ðPearsonðO0

then pair Oi1;Oj1has a smaller distance than pair

i1;O0

j1Þ > PearsonðO0

i2;O0

j2ÞÞ;

Oi2;Oj2ðEuclideanðO0

Thus, we can expect the effectiveness of a clustering

algorithm to be equivalent whether Euclidean distance or

i1;O0

j1Þ < EuclideanðO0

i2;O0

j2ÞÞ:

1372IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL. 16, NO. 11,NOVEMBER 2004

Page 4

Pearson’s correlation coefficient is chosen as the proximity

measure.

2CLUSTERING ALGORITHMS

AswementionedinSection1.2.2,geneexpressionmatrixcan

be analyzed in two ways. For gene-based clustering, genes are

treated as data objects, while samples are considered as

features. Conversely, for sample-based clustering, samples

serveasdataobjectstobeclustered,whilegenesplaytherole

of features. The third category of cluster analysis applied to

gene expression data, which is subspace clustering, treats

genes and samples symmetrically such that either genes or

samples can be regarded as objects or features. Gene-based,

sample-based, and subspace clustering face very different

challenges, and different computational strategies are

adopted for each situation. In this section, we will introduce

the gene-based clustering, sample-based clustering, and

subspace clustering techniques, respectively.

2.1

In this section, we will discuss the problem of clustering

genes based on their expression patterns. The purpose of

gene-based clustering is to group together coexpressed

genes which indicate cofunction and coregulation. We will

first present the challenges of gene-based clustering and

then review a series of clustering algorithms which have

been applied to group genes. For each clustering algorithm,

we will first introduce the basic idea of the clustering

process, and then highlight some features of the algorithm.

Gene-Based Clustering

2.1.1 Challenges of Gene Clustering

Due to the special characteristics of gene expression data,

and the particular requirements from the biological domain,

gene-based clustering presents several new challenges and

is still an open problem.

.

First, cluster analysis is typically the first step in data

mining and knowledge discovery. The purpose of

clustering gene expression data is to reveal the

natural data structures and gain some initial insights

regarding data distribution. Therefore, a good

clustering algorithm should depend as little as

possible on prior knowledge, which is usually not

available before cluster analysis. For example, a

clustering algorithm which can accurately estimate

the “true” number of clusters in the data set would

be more favored than one requiring the predeter-

mined number of clusters.

Second, due to the complex procedures of micro-

array experiments, gene expression data often

contains a huge amount of noise. Therefore, cluster-

ing algorithms for gene expression data should be

capable of extracting useful information from a high

level of background noise.

Third, our empirical study has demonstrated that

gene expression data are often “highly connected”

[37], and clusters may be highly intersected with

each other or even embedded one in another [36].

Therefore, algorithms for gene-based clustering

should be able to effectively handle this situation.

.

.

.

Finally, users of microarray data may not only be

interested in the clusters of genes, but also be

interested in the relationship between the clusters

(e.g., which clusters are more close to each other and

which clusters are remote from each other), and the

relationship between the genes within the same

cluster (e.g., which gene can be considered as the

representative of the cluster and which genes are at

the boundary area of the cluster). A clustering

algorithm, which cannot only partition the data set

but also provides some graphical representation of

the cluster structure would be more favored by the

biologists.

2.1.2 K-Means

The K-means algorithm [46] is a typical partition-based

clustering method. Given a prespecified number K, the

algorithm partitions the data set into K disjoint subsets

which optimize the following objective function:

E ¼

X

K

i¼1

X

O2Ci

jO ? ?ij2:

Here, O is a data object in cluster Ciand ?iis the centroid

(mean of objects) of Ci. Thus, the objective function E tries

to minimize the sum of the squared distances of objects

from their cluster centers.

The K-means algorithm is simple and fast. The time

complexity of K-means is Oðl ? k ? nÞ, where l is the number

of iterations and k is the number of clusters. Our empirical

study has shown that the K-means algorithm typically

converges in a small number of iterations. However, it also

has several drawbacks as a gene-based clustering algorithm.

First, the number of gene clusters in a gene expression data

set is usually unknown in advance. To detect the optimal

number of clusters, users usually run the algorithms

repeatedly with different values of k and compare the

clustering results. For a large gene expression data set

which contains thousands of genes, this extensive para-

meter fine-tuning process may not be practical. Second,

gene expression data typically contain a huge amount of

noise; however, the K-means algorithm forces each gene

into a cluster, which may cause the algorithm to be sensitive

to noise [59], [57].

Recently, several new clustering algorithms [51], [31],

[59] have been proposed to overcome the drawbacks of the

K-means algorithm. These algorithms typically use some

global parameters to control the quality of resulting clusters

(e.g., the maximal radius of a cluster and/or the minimal

distance between clusters). Clustering is the process of

extracting all of the qualified clusters from the data set. In

this way, the number of clusters can be automatically

determined and those data objects which do not belong to

any qualified clusters are regarded as outliers. However,

the qualities of clusters in gene expression data sets may

vary widely. Thus, it is often a difficult problem to choose

the appropriate globally constraining parameters.

2.1.3 Self-Organizing Map

The Self-Organizing Map (SOM) was developed by Koho-

nen [39], on the basis of a single layered neural network.

JIANG ET AL.: CLUSTER ANALYSIS FOR GENE EXPRESSION DATA: A SURVEY 1373

Page 5

The data objects are presented at the input and the output

neurons are organized with a simple neighborhood

structure such as a two-dimensional p ? q grid. Each neuron

of the neural network is associated with a reference vector,

and each data point is “mapped” to the neuron with the

“closest” reference vector. In the process of running the

algorithm, each data object acts as a training sample which

directs the movement of the reference vectors towards the

denser areas of the input vector space, so that those

reference vectors are trained to fit the distributions of the

input data set. When the training is complete, clusters are

identified by mapping all data points to the output neurons.

One of the remarkable features of SOM is that it

generates an intuitively appealing map of a high-dimen-

sional data set in 2D or 3D space and places similar clusters

near each other. The neuron training process of SOM

provides a relatively more robust approach than K-means

to the clustering of highly noisy data [62], [29]. However,

SOM requires users to input the number of clusters and the

grid structure of the neuron map. These two parameters are

preserved through the training process; hence, improperly-

specified parameters will prevent the recovering of the

natural cluster structure. Furthermore, if the data set is

abundant with irrelevant data points, such as genes with

invariant patterns, SOM will produce an output in which

this type of data will populate the vast majority of clusters

[29]. In this case, SOM is not effective because most of the

interesting patterns may be merged into only one or two

clusters and cannot be identified.

2.1.4 Hierarchical Clustering

In contrast to partition-based clustering, which attempts to

directly decompose the data set into a set of disjoint

clusters, hierarchical clustering generates a hierarchical

series of nested clusters which can be graphically repre-

sented by a tree, called dendrogram. The branches of a

dendrogram not only record the formation of the clusters

but also indicate the similarity between the clusters. By

cutting the dendrogram at some level, we can obtain a

specified number of clusters. By reordering the objects such

that the branches of the corresponding dendrogram do not

cross, the data set can be arranged with similar objects

placed together.

Hierarchical clustering algorithms can be further divided

into agglomerative approaches and divisive approaches based

on how the hierarchical dendrogram is formed. Agglom-

erative algorithms (bottom-up approach) initially regard

each data object as an individual cluster, and at each step,

merge the closest pair of clusters until all the groups are

merged into one cluster. Divisive algorithms (top-down

approach) starts with one cluster containing all the data

objects and, at each step split, only singleton clusters of

individual objects remain. For agglomerative approaches,

different measures of cluster proximity, such as single link,

complete link, and minimum-variance [18], [38], derive

various merge strategies. For divisive approaches, the

essential problem is to decide how to split clusters at each

step. Some are based on heuristic methods such as the

deterministic annealing algorithm [3], while many others

are based on the graph theoretical methods which we will

discuss later.

Eisen et al. [20] applied an agglomerative algorithm

called UPGMA (Unweighted Pair Group Method with

Arithmetic Mean) and adopted a method to graphically

represent the clustered data set. In this method, each cell of

the gene expression matrix is colored on the basis of the

measured fluorescence ratio and the rows of the matrix are

reordered based on the hierarchical dendrogram structure

and a consistent node-ordering rule. After clustering, the

original gene expression matrix is represented by a colored

table (a cluster image) where large contiguous patches of

color represent groups of genes that share similar expres-

sion patterns over multiple conditions.

Alon et al. [3] split the genes through a divisive

approach, called the deterministic-annealing algorithm

(DAA) [53], [52]. First, two initial cluster centroids Cj,

j ¼ 1;2, were randomly defined. The expression pattern of

gene k was represented by a vector~ g gk, and the probability of

gene k belonging to cluster j was assigned according to a

two-component Gaussian model:

Pjð~ g gkÞ ¼ expð??j~ g gk? Cjj2Þ=

X

j

expð??j~ g gk? Cjj2Þ:

The cluster centroids were recalculated by

X

An iterative process (the EM algorithm) was then applied to

solve Pj and Cj (the details of the EM algorithm will be

discussed later). For ? ¼ 0, there was only one cluster,

C1¼ C2. When ? was increased in small steps until a

threshold was reached, two distinct, converged centroids

emerged. The whole data set was recursively split until each

cluster contained only one gene.

Hierarchical clustering not only groups together genes

with similar expression pattern but also provides a natural

way to graphically represent the data set. The graphic

representation allows users a thorough inspection of the

whole data set and obtain an initial impression of the

distribution of data. Eisen’s method is much favored by

many biologists and has become the most widely used tool

in gene expression data analysis [20], [3], [2], [33], [50].

However, the conventional agglomerative approach suffers

from a lack of robustness [62], i.e., a small perturbation of

the data set may greatly change the structure of the

hierarchical dendrogram. Another drawback of the hier-

archical approach is its high-computational complexity. To

construct a “complete” dendrogam (where each leaf node

corresponds to one data object, and the root node

corresponds to the whole data set), the clustering process

should take

2

merging (or splitting) steps. The time

complexity for a typical agglomerative hierarchical algo-

rithm is Oðn2lognÞ [34]. Furthermore, for both agglomera-

tive and divisive approaches, the “greedy” nature of

hierarchical clustering prevents the refinement of the

previous clustering. If a “bad” decision is made in the

initial steps, it can never be corrected in the following steps.

Cj¼

k

~ g gkPjð~ g gkÞ=

X

k

Pjð~ g gkÞ:

n2?n

2.1.5 Graph-Theoretical Approaches

Given a data set X, we can construct a proximity matrix P,

where P½i;j? ¼ proximityðOi;OjÞ, and a weighted graph

1374IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL. 16, NO. 11,NOVEMBER 2004