Page 1

Cluster Analysis for Gene

Expression Data: A Survey

Daxin Jiang, Chun Tang, and Aidong Zhang

Abstract—DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of

genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene

expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of

genes and the complexity of biological networks greatly increases the challenges of comprehending and interpreting the resulting mass

of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering

techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying

data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group

are more similar to each other than the points in different groups. A very rich literature on cluster analysis has developed over the past

three decades. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new

algorithms have recently been proposed specifically aiming at gene expression data. These clustering algorithms have been proven

useful for identifying biologically relevant groups of genes and samples. In this paper, we first briefly introduce the concepts of

microarray technology and discuss the basic elements of clustering on gene expression data. In particular, we divide cluster analysis

for gene expression data into three categories. Then, we present specific challenges pertinent to each clustering category and

introduce several representative approaches. We also discuss the problem of cluster validation in three aspects and review various

methods to assess the quality and reliability of clustering results. Finally, we conclude this paper and suggest the promising trends in

this field.

Index Terms—Microarray technology, gene expression data, clustering.

?

1

1.1

INTRODUCTION

Introduction to Microarray Technology

1.1.1 Measuring mRNA Levels

C

and collection of data on single genes, microarray technol-

ogies have now made it possible to monitor the expression

levels for tens of thousands of genes in parallel. The two

major types of microarray experiments are the cDNA

microarray [54] and oligonucleotide arrays (abbreviated oligo

chip) [44]. Despite differences in the details of their

experiment protocols, both types of experiments involve

three common basic procedures [67]:

OMPARED with the traditional approach to genomic

research, which has focused on the local examination

.

Chip manufacture: A microarray is a small chip (made

of chemically coated glass, nylon membrane, or

silicon), onto which tens of thousands of DNA

molecules (probes) are attached in fixed grids. Each

grid cell relates to a DNA sequence.

Target preparation, labeling, and hybridization: Typi-

cally, two mRNA samples (a test sample and a

control sample) are reverse transcribed into cDNA

(targets), labeled using either fluorescent dyes or

radioactive isotopics, and then hybridized with the

probes on the surface of the chip.

.

.

The scanning process: Chips are scanned to read the

signal intensity that is emitted from the labeled and

hybridized targets.

Generally, both cDNA microarray and oligo chip

experiments measure the expression level for each DNA

sequence by the ratio of signal intensity between the test

sample and the control sample, therefore, data sets resulting

from both methods share the same biological semantics. In

this paper, unless explicitly stated, we will refer to both the

cDNA microarray and the oligo chip as microarray technol-

ogy and term the measurements collected via both methods

as gene expression data.

1.1.2 Preprocessing of Gene Expression Data

A microarray experiment typically assesses a large number

of DNA sequences (genes, cDNA clones, or expressed

sequence tags [ESTs]) under multiple conditions. These

conditions may be a time series during a biological process

(e.g., the yeast cell cycle) or a collection of different tissue

samples (e.g., normal versus cancerous tissues). In this

paper, we will focus on the cluster analysis of gene

expression data without making a distinction among

DNA sequences, which will uniformly be called “genes.”

Similarly, we will uniformly refer to all kinds of experi-

mental conditions as “samples” if no confusion will be

caused. A gene expression data set from a microarray

experiment can be represented by a real-valued expression

matrix M ¼ fwijj 1 ? i ? n;1 ? j ? mg (Fig. 1a), where the

rows (G ¼ f~ g1 g1;...; ~ gn

genes, the columns (S ¼ f~ s1 s1;...; ~

sion profiles of samples, and each cell wijis the measured

gng) form the expression patterns of

sm

smg) represent the expres-

1370 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL. 16,NO. 11,NOVEMBER 2004

. The authors are with the Department of Computer Science and

Engineering, State University of New York at Buffalo, 201 Bell Hall,

Buffalo, NY 14260. E-mail: {djiang3, chuntang, azhang}@cse.buffalo.edu.

Manuscript received 23 Apr. 2002; revised 20 Mar. 2003; accepted 14 Aug.

2003.

For information on obtaining reprints of this article, please send e-mail to:

tkde@computer.org, and reference IEEECS Log Number 116396.

1041-4347/04/$20.00 ? 2004 IEEE Published by the IEEE Computer Society

Page 2

expression level of gene i in sample j. Fig. 1b includes some

notation that will be used in the following sections.

The original gene expression matrix obtained from a

scanning process contains noise, missing values, and

systematic variations arising from the experimental proce-

dure. Data preprocessing is indispensable before any cluster

analysis can be performed. Some problems of data pre-

processing have themselves become interesting research

topics. Those questions are beyond the scope of this survey;

an examination of the problem of missing value estimation

appears in [69], and the problem of data normalization is

addressed in [32], [55]. Furthermore, many clustering

approaches apply one or more of the following preproces-

sing procedures: filtering out genes with expression levels

which do not change significantly across samples, perform-

ing a logarithmic transformation of each expression level, or

standardizing each row of the gene expression matrix with

a mean of zero and a variance of one. In the following

discussion of clustering algorithms, we will set aside the

details of preprocessing procedures and assume that the

input data set has already been properly preprocessed.

1.1.3 Applications of Clustering Gene Expression Data

Clustering techniques have proven to be helpful to under-

stand gene function, gene regulation, cellular processes, and

subtypes of cells. Genes with similar expression patterns

(coexpressed genes) can be clustered together with similar

cellular functions. This approach may further understand-

ing of the functions of many genes for which information

has not been previously available [66], [20]. Furthermore,

coexpressed genes in the same cluster are likely to be

involved in the same cellular processes, and a strong

correlation of expression patterns between those genes

indicates coregulation. Searching for common DNA se-

quences at the promoter regions of genes within the same

cluster allows regulatory motifs specific to each gene cluster

to be identified and cis-regulatory elements to be proposed

[9], [66]. The inference of regulation through the clustering

of gene expression data also gives rise to hypotheses

regarding the mechanism of the transcriptional regulatory

network [16]. Finally, clustering different samples on the

basis of corresponding expression profiles may reveal

subcell types which are hard to identify by traditional

morphology-based approaches [2], [24].

1.2

In this section, we will first introduce the concepts of clusters

and clustering. We will then divide the clustering tasks for

gene expression data into three categories according to

different clustering purposes. Finally, we will discuss the

issue of proximity measure in detail.

Introduction to Clustering Techniques

1.2.1 Clusters and Clustering

Clustering is the process of grouping data objects into a set

of disjoint classes, called clusters, so that objects within a

class have high similarity to each other, while objects in

separate classes are more dissimilar. Clustering is an

example of unsupervised classification. “Classification” refers

to a procedure that assigns data objects to a set of classes.

“Unsupervised” means that clustering does not rely on

predefined classes and training examples while classifying

the data objects. Thus, clustering is distinguished from

pattern recognition or the areas of statistics known as

discriminant analysis and decision analysis, which seek to

find rules for classifying objects from a given set of

preclassified objects.

1.2.2 Categories of Gene Expression Data Clustering

Currently, a typical microarray experiment contains 103to

104genes, and this number is expected to reach to the order

of 106. However, the number of samples involved in a

microarray experiment is generally less than 100. One of

the characteristics of gene expression data is that it is

meaningful to cluster both genes and samples. On one

hand, coexpressed genes can be grouped in clusters based

on their expression patterns [7], [20]. In such gene-based

clustering, the genes are treated as the objects, while the

samples are the features. On the other hand, the samples

can be partitioned into homogeneous groups. Each group

may correspond to some particular macroscopic phenotype,

such as clinical syndromes or cancer types [24]. Such

sample-based clustering regards the samples as the objects

and the genes as the features. The distinction of gene-based

clustering and sample-based clustering is based on different

characteristics of clustering tasks for gene expression data.

Some clustering algorithms, such as K-means and hierarch-

ical approaches, can be used both to group genes and to

partition samples. We will introduce those algorithms as

gene-based clustering approaches and will discuss how to

apply them as sample-based clustering in Section 2.2.1.

JIANG ET AL.: CLUSTER ANALYSIS FOR GENE EXPRESSION DATA: A SURVEY1371

Fig. 1. (a) A gene expression matrix. (b) Notation in this paper.

Page 3

Both the gene-based and sample-based clustering ap-

proaches search exclusive and exhaustive partitions of

objects that share the same feature space. However, current

thinking in molecular biology holds that only a small subset

of genes participate in any cellular process of interest and

that a cellular process takes place only in a subset of the

samples. This belief calls for the subspace clustering to

capture clusters formed by a subset of genes across a subset

of samples. For subspace clustering algorithms, genes and

samples are treated symmetrically, so that either genes or

samples can be regarded as objects or features. Further-

more, clusters generated through such algorithms may have

different feature spaces.

While a gene expression matrix can be analyzed from

different angles, the gene-based, sample-based clustering,

and subspace clustering analysis face very different

challenges. Thus, we may have to adopt very different

computational strategies in the three situations. The details

of the challenges and the representative clustering techni-

ques pertinent to each clustering category will be discussed

in Section 2.

1.2.3 Proximity Measurement for Gene Expression Data

Proximity measurement measures the similarity (or distance)

between two data objects. Gene expression data objects, no

matter genes or samples, can be formalized as numerical

vectors~ O Oi¼ foijj1 ? j ? pg, where oijis the value of the jth

feature for the ith data object and p is the number of

features. The proximity between two objects Oi and Oj is

measured by a proximity function of corresponding vectors

~ O Oiand~ O Oj.

Euclidean distance is one of the most commonly used

methods to measure the distance between two data objects.

The distance between objects Oi and Oj in p-dimensional

space is defined as:

EuclideanðOi;OjÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

d¼1

X

p

ðoid? ojdÞ2

s

:

However, for gene expression data, the overall shapes of

gene expression patterns (or profiles) are of greater interest

than the individual magnitudes of each feature. Euclidean

distance does not score well for shifting or scaled patterns

(or profiles) [71]. To address this problem, each object

vector is standardized with zero mean and variance one

before calculating the distance [66], [59], [56].

An alternate measure is Pearson’s correlation coefficient,

which measures the similarity between the shapes of two

expression patterns (profiles). Given two data objects Oi

and Oj, Pearson’s correlation coefficient is defined as

Pp

PearsonðOi;OjÞ ¼

d¼1ðoid? ?oiÞðojd? ?ojÞ

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Pp

d¼1ðoid? ?oiÞ2

q

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Pp

d¼1ðojd? ?ojÞ2

q

;

where ?oiand ?ojare the means for~ O Oiand~ O Oj, respectively.

Pearson’s correlation coefficient views each object as a

random variable with p observations and measures the

similarity between two objects by calculating the linear

relationship between the distributions of the two corre-

sponding random variables.

Pearson’s correlation coefficient is widely used and has

proven effective as a similarity measure for gene

expression data [36], [64], [65], [74]. However, empirical

study has shown that it is not robust with respect to

outliers [30], thus potentially yielding false positives which

assign a high similarity score to a pair of dissimilar

patterns. If two patterns have a common peak or valley at

a single feature, the correlation will be dominated by this

feature, although the patterns at the remaining features

may be completely dissimilar. This observation evoked an

improved measure called Jackknife correlation [19], [30],

defined as JackknifeðOi;OjÞ ¼ minf?ð1Þ

where ?ðlÞ

ijis the Pearson’s correlation coefficient of data

objects Oi and Oj with the lth feature deleted. Use of the

Jackknife correlation avoids the “dominance effect” of

single outliers. More general versions of Jackknife

correlation that are robust to more than one outlier can

similarly be derived. However, the generalized Jackknife

correlation, which would involve the enumeration of

different combinations of features to be deleted, would be

computationally costly and is rarely used.

Another drawback of Pearson’s correlation coefficient is

that it assumes an approximate Gaussian distribution of the

points and may not be robust for non-Gaussian distribu-

tions [14], [16]. To address this, the Spearman’s rank-order

correlation coefficient has been suggested as the similarity

measure. The ranking correlation is derived by replacing

the numerical expression level oidwith its rank ridamong all

conditions. For example, rid¼ 3 if oid is the third highest

value among oik, where 1 ? k ? p. Spearman’s correlation

coefficient does not require the assumption of Gaussian

distribution and is more robust against outliers than

Pearson’s correlation coefficient. However, as a conse-

quence of ranking, a significant amount of information

present in the data is lost. Our experimental results indicate

that, on average, Spearman’s rank-order correlation coeffi-

cient does not perform as well as Pearson’s correlation

coefficient.

Almost all of the clustering algorithms mentioned in this

survey use either Euclidean distance or Pearson’s correla-

tion coefficient as the proximity measure. When Euclidean

distance is selected as proximity measure, the standardiza-

tion process~ O O0

?Oi

the dth feature of object Oi, while ?Oiand ?Oiare the mean

and standard deviation of~ O Oi, respectively. Suppose O0

O0

can prove PearsonðOi;OjÞ ¼ PearsonðO0

ffiffiffiffiffi

These two equations disclose the consistency between

Pearson’s correlation coefficient and Euclidean distance

after data standardization, i.e., if a pair of data objects

Oi1;Oj1has a higher correlation than pair

ij;...;?ðlÞ

ij;...;?ðpÞ

ijg,

id¼

~ O Oid??Oi

is usually applied, where~ O Oid is

iand

jare the standardized “objects” of Oi and Oj. Then, we

i;O0

jÞ and

EuclideanðO0

i;O0

jÞ ¼

2p

p

ð

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1 ? PearsonðO0

i;O0

jÞ

q

Þ:

Oi2;Oj2ðPearsonðO0

then pair Oi1;Oj1has a smaller distance than pair

i1;O0

j1Þ > PearsonðO0

i2;O0

j2ÞÞ;

Oi2;Oj2ðEuclideanðO0

Thus, we can expect the effectiveness of a clustering

algorithm to be equivalent whether Euclidean distance or

i1;O0

j1Þ < EuclideanðO0

i2;O0

j2ÞÞ:

1372IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL. 16, NO. 11,NOVEMBER 2004

Page 4

Pearson’s correlation coefficient is chosen as the proximity

measure.

2CLUSTERING ALGORITHMS

AswementionedinSection1.2.2,geneexpressionmatrixcan

be analyzed in two ways. For gene-based clustering, genes are

treated as data objects, while samples are considered as

features. Conversely, for sample-based clustering, samples

serveasdataobjectstobeclustered,whilegenesplaytherole

of features. The third category of cluster analysis applied to

gene expression data, which is subspace clustering, treats

genes and samples symmetrically such that either genes or

samples can be regarded as objects or features. Gene-based,

sample-based, and subspace clustering face very different

challenges, and different computational strategies are

adopted for each situation. In this section, we will introduce

the gene-based clustering, sample-based clustering, and

subspace clustering techniques, respectively.

2.1

In this section, we will discuss the problem of clustering

genes based on their expression patterns. The purpose of

gene-based clustering is to group together coexpressed

genes which indicate cofunction and coregulation. We will

first present the challenges of gene-based clustering and

then review a series of clustering algorithms which have

been applied to group genes. For each clustering algorithm,

we will first introduce the basic idea of the clustering

process, and then highlight some features of the algorithm.

Gene-Based Clustering

2.1.1 Challenges of Gene Clustering

Due to the special characteristics of gene expression data,

and the particular requirements from the biological domain,

gene-based clustering presents several new challenges and

is still an open problem.

.

First, cluster analysis is typically the first step in data

mining and knowledge discovery. The purpose of

clustering gene expression data is to reveal the

natural data structures and gain some initial insights

regarding data distribution. Therefore, a good

clustering algorithm should depend as little as

possible on prior knowledge, which is usually not

available before cluster analysis. For example, a

clustering algorithm which can accurately estimate

the “true” number of clusters in the data set would

be more favored than one requiring the predeter-

mined number of clusters.

Second, due to the complex procedures of micro-

array experiments, gene expression data often

contains a huge amount of noise. Therefore, cluster-

ing algorithms for gene expression data should be

capable of extracting useful information from a high

level of background noise.

Third, our empirical study has demonstrated that

gene expression data are often “highly connected”

[37], and clusters may be highly intersected with

each other or even embedded one in another [36].

Therefore, algorithms for gene-based clustering

should be able to effectively handle this situation.

.

.

.

Finally, users of microarray data may not only be

interested in the clusters of genes, but also be

interested in the relationship between the clusters

(e.g., which clusters are more close to each other and

which clusters are remote from each other), and the

relationship between the genes within the same

cluster (e.g., which gene can be considered as the

representative of the cluster and which genes are at

the boundary area of the cluster). A clustering

algorithm, which cannot only partition the data set

but also provides some graphical representation of

the cluster structure would be more favored by the

biologists.

2.1.2 K-Means

The K-means algorithm [46] is a typical partition-based

clustering method. Given a prespecified number K, the

algorithm partitions the data set into K disjoint subsets

which optimize the following objective function:

E ¼

X

K

i¼1

X

O2Ci

jO ? ?ij2:

Here, O is a data object in cluster Ciand ?iis the centroid

(mean of objects) of Ci. Thus, the objective function E tries

to minimize the sum of the squared distances of objects

from their cluster centers.

The K-means algorithm is simple and fast. The time

complexity of K-means is Oðl ? k ? nÞ, where l is the number

of iterations and k is the number of clusters. Our empirical

study has shown that the K-means algorithm typically

converges in a small number of iterations. However, it also

has several drawbacks as a gene-based clustering algorithm.

First, the number of gene clusters in a gene expression data

set is usually unknown in advance. To detect the optimal

number of clusters, users usually run the algorithms

repeatedly with different values of k and compare the

clustering results. For a large gene expression data set

which contains thousands of genes, this extensive para-

meter fine-tuning process may not be practical. Second,

gene expression data typically contain a huge amount of

noise; however, the K-means algorithm forces each gene

into a cluster, which may cause the algorithm to be sensitive

to noise [59], [57].

Recently, several new clustering algorithms [51], [31],

[59] have been proposed to overcome the drawbacks of the

K-means algorithm. These algorithms typically use some

global parameters to control the quality of resulting clusters

(e.g., the maximal radius of a cluster and/or the minimal

distance between clusters). Clustering is the process of

extracting all of the qualified clusters from the data set. In

this way, the number of clusters can be automatically

determined and those data objects which do not belong to

any qualified clusters are regarded as outliers. However,

the qualities of clusters in gene expression data sets may

vary widely. Thus, it is often a difficult problem to choose

the appropriate globally constraining parameters.

2.1.3 Self-Organizing Map

The Self-Organizing Map (SOM) was developed by Koho-

nen [39], on the basis of a single layered neural network.

JIANG ET AL.: CLUSTER ANALYSIS FOR GENE EXPRESSION DATA: A SURVEY1373

Page 5

The data objects are presented at the input and the output

neurons are organized with a simple neighborhood

structure such as a two-dimensional p ? q grid. Each neuron

of the neural network is associated with a reference vector,

and each data point is “mapped” to the neuron with the

“closest” reference vector. In the process of running the

algorithm, each data object acts as a training sample which

directs the movement of the reference vectors towards the

denser areas of the input vector space, so that those

reference vectors are trained to fit the distributions of the

input data set. When the training is complete, clusters are

identified by mapping all data points to the output neurons.

One of the remarkable features of SOM is that it

generates an intuitively appealing map of a high-dimen-

sional data set in 2D or 3D space and places similar clusters

near each other. The neuron training process of SOM

provides a relatively more robust approach than K-means

to the clustering of highly noisy data [62], [29]. However,

SOM requires users to input the number of clusters and the

grid structure of the neuron map. These two parameters are

preserved through the training process; hence, improperly-

specified parameters will prevent the recovering of the

natural cluster structure. Furthermore, if the data set is

abundant with irrelevant data points, such as genes with

invariant patterns, SOM will produce an output in which

this type of data will populate the vast majority of clusters

[29]. In this case, SOM is not effective because most of the

interesting patterns may be merged into only one or two

clusters and cannot be identified.

2.1.4 Hierarchical Clustering

In contrast to partition-based clustering, which attempts to

directly decompose the data set into a set of disjoint

clusters, hierarchical clustering generates a hierarchical

series of nested clusters which can be graphically repre-

sented by a tree, called dendrogram. The branches of a

dendrogram not only record the formation of the clusters

but also indicate the similarity between the clusters. By

cutting the dendrogram at some level, we can obtain a

specified number of clusters. By reordering the objects such

that the branches of the corresponding dendrogram do not

cross, the data set can be arranged with similar objects

placed together.

Hierarchical clustering algorithms can be further divided

into agglomerative approaches and divisive approaches based

on how the hierarchical dendrogram is formed. Agglom-

erative algorithms (bottom-up approach) initially regard

each data object as an individual cluster, and at each step,

merge the closest pair of clusters until all the groups are

merged into one cluster. Divisive algorithms (top-down

approach) starts with one cluster containing all the data

objects and, at each step split, only singleton clusters of

individual objects remain. For agglomerative approaches,

different measures of cluster proximity, such as single link,

complete link, and minimum-variance [18], [38], derive

various merge strategies. For divisive approaches, the

essential problem is to decide how to split clusters at each

step. Some are based on heuristic methods such as the

deterministic annealing algorithm [3], while many others

are based on the graph theoretical methods which we will

discuss later.

Eisen et al. [20] applied an agglomerative algorithm

called UPGMA (Unweighted Pair Group Method with

Arithmetic Mean) and adopted a method to graphically

represent the clustered data set. In this method, each cell of

the gene expression matrix is colored on the basis of the

measured fluorescence ratio and the rows of the matrix are

reordered based on the hierarchical dendrogram structure

and a consistent node-ordering rule. After clustering, the

original gene expression matrix is represented by a colored

table (a cluster image) where large contiguous patches of

color represent groups of genes that share similar expres-

sion patterns over multiple conditions.

Alon et al. [3] split the genes through a divisive

approach, called the deterministic-annealing algorithm

(DAA) [53], [52]. First, two initial cluster centroids Cj,

j ¼ 1;2, were randomly defined. The expression pattern of

gene k was represented by a vector~ g gk, and the probability of

gene k belonging to cluster j was assigned according to a

two-component Gaussian model:

Pjð~ g gkÞ ¼ expð??j~ g gk? Cjj2Þ=

X

j

expð??j~ g gk? Cjj2Þ:

The cluster centroids were recalculated by

X

An iterative process (the EM algorithm) was then applied to

solve Pj and Cj (the details of the EM algorithm will be

discussed later). For ? ¼ 0, there was only one cluster,

C1¼ C2. When ? was increased in small steps until a

threshold was reached, two distinct, converged centroids

emerged. The whole data set was recursively split until each

cluster contained only one gene.

Hierarchical clustering not only groups together genes

with similar expression pattern but also provides a natural

way to graphically represent the data set. The graphic

representation allows users a thorough inspection of the

whole data set and obtain an initial impression of the

distribution of data. Eisen’s method is much favored by

many biologists and has become the most widely used tool

in gene expression data analysis [20], [3], [2], [33], [50].

However, the conventional agglomerative approach suffers

from a lack of robustness [62], i.e., a small perturbation of

the data set may greatly change the structure of the

hierarchical dendrogram. Another drawback of the hier-

archical approach is its high-computational complexity. To

construct a “complete” dendrogam (where each leaf node

corresponds to one data object, and the root node

corresponds to the whole data set), the clustering process

should take

2

merging (or splitting) steps. The time

complexity for a typical agglomerative hierarchical algo-

rithm is Oðn2lognÞ [34]. Furthermore, for both agglomera-

tive and divisive approaches, the “greedy” nature of

hierarchical clustering prevents the refinement of the

previous clustering. If a “bad” decision is made in the

initial steps, it can never be corrected in the following steps.

Cj¼

k

~ g gkPjð~ g gkÞ=

X

k

Pjð~ g gkÞ:

n2?n

2.1.5 Graph-Theoretical Approaches

Given a data set X, we can construct a proximity matrix P,

where P½i;j? ¼ proximityðOi;OjÞ, and a weighted graph

1374IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL. 16, NO. 11,NOVEMBER 2004

Page 6

GðV;EÞ, called a proximity graph, where each data point

corresponds to a vertex. For some clustering methods, each

pair of objects is connected by an edge with weight assigned

according to the proximity value between the objects [56],

[73]. For other methods, proximity is mapped only to either

0 or 1 on the basis of some threshold, and edges only exist

between objects i and j, where P½i;j? equals 1 [7], [26].

Graph-theoretical clustering techniques are explicitly pre-

sented in terms of a graph, thus converting the problem of

clustering a data set into such graph theoretical problems as

finding minimum cut or maximal cliques in the proximity

graph G.

CLICK. CLICK (CLuster Identification via Connectivity

Kernels) [56] seeks to identify highly connected components

in the proximity graph as clusters. CLICK makes the

probabilistic assumption that after standardization, pair-

wise similarity values between elements (no matter if they

are in the same cluster or not) are normally distributed.

Under this assumption, the weight !ij of an edge ði;jÞ is

defined asthe probability that vertices i and j are in thesame

cluster. The clustering process of CLICK iteratively finds the

minimum cut in the proximity graph and recursively splits

the data set into a set of connected components from the

minimum cut. CLICK also takes two postpruning steps to

refine the cluster results. The adoption step handles the

remaining singletons and updates the current clusters, while

the merging step iteratively merges two clusters with

similarity exceeding a predefined threshold.

In [56], the authors compared the clustering results of

CLICK on two public gene expression data sets with those

of GENECLUSTER [62] (a SOM approach) and Eisen’s

hierarchical approach [20], respectively. In both cases,

clusters obtained by CLICK demonstrated better quality in

terms of homogeneity and separation (these two concepts

will be discussed in Section 3). However, CLICK has little

guarantee of not going astray and generating highly

unbalanced partitions, e.g., a partition that only separates

a few outliers from the remaining data objects. Further-

more, in gene expression data, two clusters of coexpressed

genes, C1 and C2, may be highly intersected with each

other. In such situations, C1and C2are not likely to be split

by CLICK, but would be reported as one highly connected

component.

CAST. Ben-Dor et al. [7] introduced the idea of a

corrupted clique graph data model. The input data set is

assumed to come from the underlying cluster structure by

“contamination” with random errors caused by the com-

plex process of gene expression measurement. Specifically,

it is assumed that the true clusters of the data points can be

represented by a clique graph H, which is a disjoint union of

complete subgraphs with each clique corresponding to a

cluster. The similarity graph G is derived from H by flipping

each edge/nonedge with probability ?. Therefore, cluster-

ing a data set is equivalent to identifying the original clique

graph H from the corrupted version G with as few flips

(errors) as possible.

In [7], Ben-Dor et al. presented both a theoretical

algorithm and a practical heuristic called CAST (Cluster

Affinity Search Technique). CAST takes as input a real,

symmetric, n-by-n similarity matrix SðSði;jÞ 2 ½0;1?Þ and an

affinity threshold t. The algorithm searches the clusters one at

a time. The currently searched cluster is denoted by Copen.

Each element x has an affinity value aðxÞ with respect to

Copen as aðxÞ ¼P

a low affinity value. CAST alternates between adding high-

affinity elements to the current cluster and removing low-

affinity elements from it. When the process stabilizes, Copen

is considered a complete cluster, and this process continues

with each new cluster until all elements have been assigned

to a cluster.

The affinity threshold t of the CAST algorithm is actually

the average of pairwise similarities within a cluster. CAST

specifies the desired cluster quality through t and applies a

heuristic searching process to identify qualified clusters one

at a time. Therefore, CAST does not depend on a user-

defined number of clusters and deals with outliers effec-

tively. Nevertheless, CAST has the usual difficulty of

determining a “good” value for the global parameter t.

y2CopenSðx;yÞ. An element x has a high

affinity value if it satisfies aðxÞ ? tjCopenj; otherwise, x has

2.1.6 Model-Based Clustering

Model-based clustering approaches [21], [76], [23], [45]

provide a statistical framework to model the cluster

structure of gene expression data. The data set is assumed

to come from a finite mixture of underlying probability

distributions, with each component corresponding to a

different cluster. The goal is to estimate the parameters ? ¼

f?ij 1 ? i ? kg and ? ¼ f?i

imize the likelihood Lmixð?;?Þ ¼Pk

components, xr is a data object (i.e., a gene expression

pattern), fiðxrj?iÞ is the density function of xrof component

Ci with some unknown set of parameters ?i (model

parameters), and ?i

r(hidden parameters) represents the prob-

ability that xrbelongs to Ci. Usually, the parameters ? and

? are estimated by the EM algorithm. The EM algorithm

iterates between Expectation (E) steps and Maximization

(M) steps. In the E step, hidden parameters ? are

conditionally estimated from the data with the current

estimated ?. In the M step, model parameters ? are

estimated so as to maximize the likelihood of complete

data given the estimated hidden parameters. When the

EM algorithm converges, each data object is assigned to

the component (cluster) with the maximum conditional

probability.

An important advantage of model-based approaches is

that they provide an estimated probability ?i

object i will belong to cluster k. As we will discuss in

Section 2.1.1, gene expression data are typically “highly-

connected”; there may be instances in which a single gene

has a high correlation with two different clusters. Thus, the

probabilistic feature of model-based clustering is particu-

larly suitable for gene expression data. However, model-

based clustering relies on the assumption that the data set

fits a specific distribution. This may not be true in many

cases. The modeling of gene expression data sets, in

particular, is an ongoing effort by many researchers, and,

to the best of our knowledge, there is currently no well-

established model to represent gene expression data.

rj 1 ? i ? k;1 ? r ? ng that max-

i¼1?i

rfiðxrj?iÞ, where n

is the number of data objects, k is the number of

kthat data

JIANG ET AL.: CLUSTER ANALYSIS FOR GENE EXPRESSION DATA: A SURVEY1375

Page 7

Yeung et al. [76] studied several kinds of commonly used

data transformations and assessed the degree to which

three gene expression data sets fit the multivariant

Gaussian model assumption. The raw values from all three

data sets fit the Gaussian model poorly and there is no

uniform rule to indicate which transformation would best

improve this fit.

2.1.7 A Density-Based Hierarchical Approach: DHC

In [36], the authors proposed a new clustering algorithm,

DHC (a density-based, hierarchical clustering method), to

identify the coexpressed gene groups from gene expression

data. DHC is developed based on the notions of “density”

and “attraction” of data objects. The basic idea is to consider

a cluster as a high-dimensional dense area, where data

objects are “attracted” with each other. At the “core” part of

the dense area, objects are crowded closely with each other

and, thus, have high density. Objects at the peripheral area

of the cluster are relatively sparsely distributed and are

“attracted” to the “core” part of the dense area.

Once the “density” and “attraction” of data objects are

defined, DHC organizes the cluster structure of the data set

in two-level hierarchical structures. At the first level, an

attraction tree is constructed to represent the relationship

between the data objects in the dense area. Each node on the

attraction tree corresponds to a data object, and the parent

of each node is its attractor. The only exception is the data

object which has the highest density in the data set. This

data object becomes the root of the attraction tree. However,

the structure of the attraction tree would be hard to

interpret when the data set becomes large and the data

structure becomes complicated. To address this problem, at

the second structure level, DHC summarizes the cluster

structure of the attraction tree into a density tree. Each node

of the density tree represents a dense area. Initially, the

whole data set is considered as a single dense area and is

represented by the root node of the density tree. This dense

area is then split into several subdense areas based on some

criteria, where each subdense area is represented by a child

node of the root node. These subdense areas are further

split, until each subdense area contains a single cluster.

As a density-based approach, DHC effectively detects

the coexpressed genes (which have relatively higher

density) from noise (which have relatively lower density),

and thus is robust in the noisy environment. Furthermore,

DHC is particularly suitable for the “high-connectivity”

characteristic of gene expression data because it first

captures the “core” part of the cluster and then divides

the borders of clusters on the basis of the “attraction”

between the data objects. The two-level hierarchical

representation of the data set not only discloses the

relationship between the clusters (via density tree), but

also organizes the relationship between data objects within

the same cluster (via attraction tree). However, to compute

the density of data objects, DHC calculates the distance

between each pair of data objects in the data set. The

computational complexity of this step is Oðn2Þ, which

makes DHC not efficient. Furthermore, two global para-

meters are used in DHC to control the splitting process of

dense areas. Therefore, DHC does not escape from the

typical difficulty to determine the appropriate value of

parameters.

2.1.8 Summary

In this section, we have reviewed a series of approaches to

gene clustering. The purpose of clustering genes is to

identify groups of highly coexpressed genes from noisy

gene expression data. Clusters of coexpressed genes

provide a useful basis for further investigation of gene

function and gene regulation. Some conventional clustering

algorithm, such as K-means, SOM, and hierarchical

approaches (UPGMA), were applied in the early stage and

have proven to be useful. However, those algorithms were

designed for the general purpose of clustering, and may not

be effective to address the particular challenges for gene-

based clustering. Recently, several new clustering algo-

rithms, such as CLICK, CAST, and DHC have been

proposed specifically to aim at gene expression data. The

experimental study [56], [36] has shown that these new

clustering algorithms may provide better performance than

the conventional ones on some gene expression data.

However, different clustering algorithms are based on

different clustering criteria and/or different assumptions

regarding data distribution. The performance of each

clustering algorithm may vary greatly with different data

sets and there is no absolute “winner” among the clustering

algorithms reviewed in this section. For example, K-means

or SOM may outperform other approaches if the target data

set contains few outliers and the number of clusters in the

data set is known. While for a very noisy gene expression

data set in which the number of clusters is unknown, CAST

or CLICK may be a better choice. Table 1 lists some gene

expression data sets to which gene-based clustering

approaches have commonly been applied.

2.2

Within a gene expression matrix, there are usually several

particular macroscopic phenotypes of samples related to

some diseases or drug effects, such as diseased samples,

normal samples, or drug treated samples. The goal of

sample-based clustering is to find the phenotype structures

or substructures of the samples. Previous studies [24] have

demonstrated that phenotypes of samples can be discrimi-

nated through only a small subset of genes whose

expression levels strongly correlate with the class distinc-

tion. These genes are called informative genes. The remaining

genes in the gene expression matrix are irrelevant to the

division of samples of interest and thus are regarded as

noise in the data set.

Although the conventional clustering methods, such as

K-means, self-organizing maps (SOM), hierarchical cluster-

ing (HC), can be directly applied to cluster samples using

all the genes as features, the signal-to-noise ratio (i.e., the

number of informative genes versus that of irrelevant

genes) is usually smaller than 1 : 10, which may seriously

degrade the quality and reliability of clustering results [73],

[63]. Thus, particular methods should be applied to identify

informative genes and reduce gene dimensionality for

clustering samples to detect their phenotypes.

The existing methods of selecting informative genes to

cluster samples fall into two major categories: supervised

Sample-Based Clustering

1376IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16,NO. 11,NOVEMBER 2004

Page 8

analysis (clustering based on supervised informative gene

selection) and unsupervised analysis(unsupervised clustering

and informative gene selection).

2.2.1 Clustering Based on Supervised Informative Gene

Selection

The supervised approach assumes that phenotype informa-

tion is attached to the samples, for example, the samples are

labeled as diseased versus normal. Using this information, a

“classifier” which only contains the informative genes can

be constructed. Based on this “classifier,” samples can be

clustered to match their phenotypes and labels can be

predicted for the future coming samples from the expres-

sion profiles. Supervised methods are widely used by

biologists to pick up informative genes. The major steps to

build the classifier include:

.

Training sample selection. In this step, a subset of

samples is selected to form the training set. Since the

number of samples is limited (less than 100), the size

of the training set is usually at the same order of

magnitude with the original size of samples.

Informative gene selection. The goal of informative

gene selection step is to pick out those genes whose

expression patterns can distinguish different pheno-

types of samples. For example, a gene is uniformly

high in one sample class and uniformly low in the

other [24]. A series of approaches to select informa-

tive genes include: the neighborhood analysis

approach [24], the supervised learning methods

such as the support vector machine (SVM) [10],

and a variety of ranking-based methods [6], [43],

[47], [49], [68], [70].

Sample clustering and classification. After about 50 ?

200 [24], [42] informative genes which manifest the

phenotype partition within the training samples are

.

.

selected, the whole set of samples are clustered

using only the informative genes as features. Since

the feature volume is relatively small, conventional

clustering algorithms, such as K-means or SOM, are

usually applied to cluster samples. The future

coming samples can also be classified based on the

informative genes, thus the supervised methods can

be used to solve sample classification problem.

2.2.2 Unsupervised Clustering and Informative Gene

Selection

Unsupervised sample-based clustering assumes no pheno-

type information being assigned to any sample. Since the

initial biological identification of sample classes has been

slow, typically evolving through years of hypothesis-driven

research, automatically discovering samples’ phenotypes

presents a significant contribution in gene expression data

analysis [24]. As an unsupervised learning method, clustering

also serves as an exploratory task intended to discover

unknown substructures in the sample space.

Unsupervised sample-based clustering is much more

complex than a supervised manner since no training set of

samples can be utilized as a reference to guide informative

gene selection. Many mature statistic methods and other

supervised methods cannot be applied without the pheno-

types of samples known in advance. The following two new

challenges of unsupervised sample-based clustering make it

very hard to detect phenotypes of samples and select

informative genes.

.

Since the number of samples is very limited while

the volume of genes is very large, such data sets are

very sparse in high-dimensional genes space. No

distinct class structures of samples can be properly

detected by the conventional techniques (for exam-

ple, density-based approaches).

JIANG ET AL.: CLUSTER ANALYSIS FOR GENE EXPRESSION DATA: A SURVEY1377

TABLE 1

Some Data Sets for Gene-Based Analysis

Page 9

.

Most of the genes collected may not necessarily be of

interest. A small percentage (less than 10 percent

[24]) of genes which manifest meaningful sample

phenotype structure are buried in large amount of

noise. Uncertainty about which genes are relevant

makes it difficult to select informative genes.

Two general strategies have been employed to address

the problem of unsupervised clustering and information

gene selection: unsupervised gene selection and interrelated

clustering.

Unsupervised gene selection. The first strategy differ-

entiates gene selection and sample clustering as indepen-

dent processes. First, the gene (feature) dimension is

reduced, then the conventional clustering algorithms are

applied. Since no training samples are available, gene

selection only relies on statistical models to analyze the

variance in the gene expression data.

Alter et al. [4] applied the principal component analysis

(PCA) to capture the majority of the variations within the

genes by a small set of principal components (PCs), called

“eigengenes.” The samples are then projected on the new

lower-dimensional PC space. However, eigengenes do not

necessarily have a strong correlation with informative

genes. Due to the large number of irrelevant genes,

discriminatory information of gene expression data is not

guaranteed to be the type of user-interested variations. The

effectiveness of applying PCA before clustering is discussed

in [75].

Ding [17] used a F-statistic method to select the genes

which show large variance in the expression matrix. Then, a

min-max cut hierarchical divisive clustering approach is

applied to cluster samples. Finally, the samples are ordered

such that adjacent samples are similar and samples far

away are different. However, this approach relies on the

assumption that informative genes exhibit larger variance

than irrelevant genes which is not necessarily true for the

gene expression data sets [75]. Therefore, the effectiveness

of this approach also depends on the data distribution.

Interrelated clustering. When we have a closer look at

the problems of informative gene selection and sample

clustering, we will find they are closely interrelated. Once

informative genes have been identified, then it is relatively

easy to use conventional clustering algorithms to cluster

samples. On the other hand, once samples have been

correctly partitioned, some supervised methods such as t-

test scores and separation scores [68] can be used to rank the

genes according to their relevance to the partition. Genes

with high relevance to the partition are considered as

informative genes. Based on this observation, the second

strategy has been suggested to dynamically use the

relationship between the genes and samples and iteratively

combine a clustering process and a gene selection process.

Intuitively, although we do not know the exact sample

partition in advance, for each iteration, we can expect to

obtain an approximate partition that is close to the target

sample partition. The approximate partition allows the

selection of a moderately good gene subset, which will,

hopefully, draw the approximate partition even closer to

the target partition in the next iteration. After several

iterations, the sample partition will converge to the true

sample structure and the selected genes will be feasible

candidates for the set of informative genes.

Xing and Karp [73] presented a sample-based clustering

algorithm named CLIFF (CLustering via Iterative Feature

Filtering) which iteratively use sample partitions as a

reference to filter genes. In [73], noninformative genes were

divided into the following three categories:

1.

2.

nondiscriminative genes (genes in the “off” state),

irrelevant genes (genes do not respond to the

physiological event), and

redundant genes (genes that are redundant or

secondary responses to the biological or experimen-

tal conditions that distinguish different samples).

CLIFF first uses a two-component Gaussian model to rank all

genes in terms of their discriminability and then select a set

of most discriminant genes. It then applies a graph-

theoretical clustering algorithm, NCut(Approximate Nor-

malized Cut), to generate an initial partition for the samples

and enters an iteration process. For each iteration, the input

is a reference partition C of the samples and the selected

genes. First a scoring method, named information gain

ranking, is applied to select a set of most “relevant” genes

based on the sample partition C. The Markov blanket filter is

then used to filter “redundant” genes. The remaining genes

are used as the features to generate a new partition C0of the

samples by NCut clustering algorithm. The new partition C0

and the remaining genes will be the input of the next

iteration. The iteration ends if this new partition C0is

identical to the input reference partition C. However, this

approach is sensitive to the outliers and noise of the

samples since the gene filtering highly depends on the

result of the NCut algorithm which is not robust to the noise

and outliers.

Tang et al. [64], [65] proposed iterative strategies for

interrelated sample clustering and informative gene selec-

tion. The problem of sample-based clustering is formulated

via an interplay between sample partition detection and

irrelevant gene pruning. The interrelated clustering ap-

proaches contained three phases: an initialization partition

phase, an interrelated iteration phase, and a class validation

phase.In the first phase,samples and genes are grouped into

several exclusive smaller groups by conventional clustering

methods K-means or SOM. In the iteration phase, the

relationship between the groups of the samples and the

groups of the genes are measured and analyzed. A

representation degree measurement is defined to detect the

sample groups with high internal coherence as well as a

large difference between each other. Sample groups with a

high representation degree are posted to form a partial or

approximate sample partition called representative pattern.

The representative pattern is then used to direct the elimina-

tion of irrelevant genes. In turn, the remaining meaningful

genes were used to guide further representative pattern

detection. The termination of the series of iterations is

determined by evaluating the quality of the sample parti-

tion. This is achieved in the class validation phase by

assigning coefficient of variation (CV) to measure the

“internally similar and well-separated” degree of the

selected genes and the related sample partition. The formula

for the coefficient of variation is: CV ¼1

3.

K

PK

t¼1

?t

jj~ ? ?tjj, where K

1378IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL. 16, NO. 11, NOVEMBER 2004

Page 10

represents the number of sample groups, ~ ? ?t indicates the

center sample vector of group k, and ?t represents the

standard deviation of group t. When a stable and significant

sample partition emerges, the iteration stops, and the finial

sample partition become the result of the process. This

approach delineates the relationships between sample

groups and gene groups while conducting an iterative

searchforsamples’phenotypesandinformative genes.Since

the representative pattern identified in each step is only

formed by “internally similar and well-separated” sample

groups, this approach is robust to the noise and outliers of

the samples.

2.2.3 Summary

In this section, we reviewed a series of approaches to

sample-based clustering. The goal of sample-based cluster-

ing is to find the phenotype structures of the samples. The

clustering techniques can be divided into the following

categories and subcategories:

1.

clustering based on supervised informative gene

selection and

unsupervised clustering and informative gene

selection

2.

.

.

unsupervised gene selection and

interrelated clustering.

Since the percentage of the informative genes is rather

low, the major challenge of sample-based clustering is

informative gene selection. Supervised informative gene

selection techniques which use samples’ phenotype infor-

mation to select informative genes is widely applied and

relatively easy to use to get a high clustering accuracy rate

since the majority of the samples are used as the training set

to select informative genes.

Unsupervised sample-based clustering as well as in-

formative gene selection is complex since no prior knowl-

edge is supposed to be known in advance. Basically, two

strategies have been adopted to address this problem. The

first strategy reduces the number of genes before clustering

samples. However, since these approaches rely on some

statistical models [4], [17], the effectiveness of them heavily

depends on the data distribution [75]. Another strategy

utilizes the relationship between the genes and samples to

perform gene selection and sample clustering simulta-

neously in an iterative paradigm. Two novel approaches

based on the this idea were proposed by Xing and Karp [73]

and Tang and Zhang [65]. For both approaches, the iteration

converges into an accurate partition of the samples and a set

of informative genes as well. One drawback of these

approaches is that the gene filtering process is noninver-

tible. The deterministic filtering will cause data to be

grouped based on local decisions.

There are two more issues regarding the quality of

sample-based clustering techniques that need to be further

discussed:

.

The number of clusters K. Usually, the number of

phenotypes within a gene expression matrix is very

small and known in advance. For example, the

number of phenotypes is 2 for the well-known

leukemia microarray set [24], [58] which often serves

as the benchmark for microarray analysis methods.

Thus, for sample-based analysis, the number of

clusters K is always predefined, namely, as an input

parameter of the clustering method.

Time complexity of the sample-based clustering techni-

ques. Since the number of samples m is far less than

the volume of genes n in a typical gene expression

data set, we only investigate the time complexity of

the algorithms with respect to the volume of genes n.

The time complexity of supervised informative gene

selection and unsupervised gene selection ap-

proaches is usually OðnÞ because each gene only

needs to be checked once. The time complexity of

interrelated clustering methods is Oðn ? lÞ, while l is

the number of iterations which usually is hard to

estimate.

Table 2 lists some widely used gene expression data sets

for sample-based clustering methods. Since there are many

supervised clustering methods, their applications are not

exhaustively listed.

.

2.3

The clustering algorithms discussed in the previous sections

are examples of “global clustering”; for a given data set to be

clustered, the feature space is globally determined and is

shared by all resulting clusters, and the resulting clusters

Subspace Clustering

JIANG ET AL.: CLUSTER ANALYSIS FOR GENE EXPRESSION DATA: A SURVEY 1379

TABLE 2

Some Data Sets for Sample-Based Analysis

Page 11

are exclusive and exhaustive. However, it is well known in

molecular biology that only a small subset of the genes

participates in any cellular process of interest and that any

cellular process takes place only in a subset of the samples.

Furthermore, a single gene may participate in multiple

pathways that may or may not be coactive under all

conditions, so that a gene can participate in multiple

clusters or in none at all. Recently, a series of subspace

clustering methods have been proposed [22], [11], [40] to

capture coherence exhibited by the “blocks” within gene

expression matrices. In this context, a “block” is a submatrix

defined by a subset of genes on a subset of samples.

Subspace clustering was first proposed by Agrawal et al. in

general data mining domain [1] to find subsets of objects

such that the objects appear as a cluster in a subspace

formed by a subset of the features. Fig. 2 shows an example

of the subspace clusters (A and B) embedded in a gene

expression matrix. In subspace clustering, the subsets of

features for various subspace clusters can be different. Two

subspace clusters can share some common objects and

features, and some objects may not belong to any subspace

cluster.

For a gene expression matrix containing n genes and

m samples, the computational complexity of a complete

combination of genes and samples is 2nþmso that the

problem of globally optimal block selection is NP-hard. The

subspace clustering methods usually define models to

describe the target block and then adopt some heuristics

to search in the gene-sample space. In the following section,

we will discuss some representative subspace clustering

algorithms proposed for gene expression matrices. In these

representative subspace clustering algorithms, genes and

samples are treated symmetrically such that either genes or

samples can be regarded as objects or features.

2.3.1 Coupled Two-Way Clustering (CTWC)

Getz et al. [22] model the block as a stable cluster with

features (Fi) and objects (Oj), where both Fiand Ojcan be

either genes or samples. The cluster is “stable” in the sense

that, when only the features in Fi are used to cluster the

corresponding Oj, Ojdoes not split below some threshold.

CTWC provides a heuristic to avoid brute-force enumera-

tion of all possible combinations. Only subsets of genes or

samples that are identified as stable clusters in previous

iterations are candidates for the next iteration.

CTWC begins with only one pair of gene set and sample

set (G0,S0), where G0is the set containing all genes and S0is

the set that contains all samples. A hierarchical clustering

method, called the superparamagnetic clustering algorithm

(SPC) [8], is applied to each set and the stable clusters of

genes and samples yielded by this first iteration are Gi

Sj

1. CTWC dynamically maintains two lists of stable clusters

(gene list GL and sample list SL) and a pair list of pairs of

gene and sample subsets (Gi

gene subset from GL and one sample subset from SL that

have not been previously combined are coupled and

clustered mutually as objects and features. Newly gener-

ated stable clusters are added to GL and SL, and a pointer

that identifies the parent pair is recorded in the pair list to

indicate the origin of the clusters. The iteration continues

until no new clusters are found which satisfy some

criterion, such as stability or critical size.

CTWC was applied to a leukemia data set [24] and a

colon cancer data set [3]. For the leukemia data set, CTWC

converges to 49 stable gene clusters and 35 stable sample

clusters in two iterations. For the colon cancer data set,

76 stable sample clusters and 97 stable gene clusters were

reported by CTWC in two iterations. The experiments

demonstrated the capability of CTWC to identify substruc-

tures of gene expression data which cannot be clearly

identified when all genes or samples are used as objects or

features.

However, CTWC searches for blocks in a deterministic

manner and the clustering results are therefore sensitive to

initial clustering settings. For example, suppose (G, S) is a

pair of stable clusters. If, during the previous iterations, G

was separately assigned to several clusters according to

features S0, or S was separated in several clusters according

to features G0, then (G, S) can never be found by CTWC in

the following iterations. Another drawback of CTWC is that

clustering results are sometimes redundant and hard to

interpret. For example, for the colon cancer data, a total of

76 sample clusters and 97 gene clusters were identified.

Among these, four different gene clusters partitioned the

samples in a normal/cancer classification and were there-

fore redundant, while many of the clusters were not of

interest, i.e., hard to interpret. More satisfactory results

would be produced if the framework can provide a

systematic mechanism to minimize redundancy and rank

the resulting clusters according to significance.

1and

n, Sj

m). For each iteration, one

2.3.2 Plaid Model

The plaid model [40] regards gene expression data as a sum of

multiple “layers,” where each layer may represent the

presence of a particular biological process with only a subset

of genes and a subset of samples involved. The generalized

plaid model is formalized as Yij¼PK

coming from multiple sources. To be specific, ?ij0 is the

background expression level for the whole data set, and ?ijk

describesthecontribution fromlayerk.Theparameter?ik(or

?jk) equals 1 when gene i (or sample j) belongs to layer k,

and equals 0 otherwise.

The clustering process searches the layers in the data set

one after another, using the EM algorithm to estimate the

model parameters. Suppose the first K ? 1 layers have been

extracted, the Kth layer is identified by minimizing the sum

of squared errors Q ¼1

k¼0?ijk?ik?jk, where the

expression level Yijof gene i under sample j is considered

2

Pn

i¼1

Pm

j¼1ðZij? ?ijK?iK?jkÞ2, where

1380IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL. 16, NO. 11,NOVEMBER 2004

Fig. 2. Illustration of subspace clusters.

Page 12

Zij¼ Yij? ?ij0?PK?1

variance of expression levels within the current layer is

smaller than a threshold.

The plaid model was applied to a yeast gene expression

data set combined from several time series under different

cellular processes [40]. Totally, 34 layers were extracted

from the data set, among which interesting clusters were

found. For example, the second layer was recognized as

dominated by genes that produce ribosomal proteins

involved in protein synthesis in which mRNA is translated.

However, the plaid model is based on the questionable

assumption that, if a gene participates in several cellular

processes, then its expression level is the sum of the terms

involved in the individual processes. Thus, the effectiveness

and interpret ability of the discovered layers need further

investigation.

k¼1?ijk?ij?jkis the residual from the first

K ? 1 layers. The clustering process stops when the

2.3.3 Biclustering and ?-Clusters

Cheng and Church [11] introduced the bicluster concept to

model a block, along with a score called the mean-squared

residue to measure the coherence of genes and conditions in

the block. Let G0and S0be subsets of genes and samples.

The pair ðG0;S0Þ specifies a submatrix with the mean-

squared residue score

HðG0;S0Þ ¼

1

jG0jjS0j

X

i2G0;j2S0

ðwij? ?iS0 ? ?G0jþ ?G0S0Þ2;

where

?iS0 ¼

1

jS0j

X

j2J

wij;?G0j¼

1

jG0j

X

i2I

wij;?S0J¼

1

jG0jjS0j

X

i2G0;j2S0

wij

are the row and column means and the means in the

submatrix. A submatrix is called a ?-bicluster if HðG0;S0Þ ?

? for some ? > 0. A low mean-squared residue score

together with a large variation from the constant suggest

a good criterion for identifying a block.

However, the problem of finding a minimum set of

biclusters to cover all the elements in a data matrix has been

shown to be NP-hard. A greedy method which provides an

approximation of the optimal solution and reduces the

complexity to polynomial-time has been introduced in [11].

To find a bicluster, the score H is computed for each

possible row/column addition/deletion, and the action that

decreases H the most is applied. If no action will decrease H

or if H ? ?, a bicluster is returned. However, this algorithm

(the brute-force deletion and addition of rows/columns)

requires computational time Oððn þ mÞ ? mnÞ, where n and

m are the number of genes and samples, respectively, and it

is time-consuming when dealing with a large gene expres-

sion data sets. A more efficient algorithm based on multiple

row/column addition/deletion (the biclustering algorithm)

with time-complexity OðmnÞ was also proposed in [11].

After one bicluster is identified, the elements in the

corresponding submatrix are replaced (masked) by random

numbers. The biclusters are successively extracted from the

raw data matrix until a prespecified number of clusters

have been identified. However, the biclustering algorithm

also has several drawbacks. First, the algorithm stops when

a prespecified number of clusters have been identified. To

cover the majority of elements in the data matrix, the

specified number is usually large. However, the bicluster-

ing algorithm does not guarantee that the biclusters

identified earlier will be of superior quality to those

identified later, which adds to the difficulty of the

interpretation of the resulting clusters. Second, biclustering

“mask” the identified biclusters with random numbers,

preventing the identification of overlapping biclusters.

Yang et al. [74] present a subspace clustering method

named “?-clusters” to capture K embedded subspace

clusters simultaneously. They use average residue across

every entry in the submatrix to measure the coherence

within a submatrix. A heuristic move-based method

called FLOC (FLexible Overlapped Clustering) is applied

to search K embedded subspace clusters. FLOC starts

with K randomly selected submatrices as the subspace

clusters then iteratively tries to add/remove each row/

column into/out of the subspace clusters to lower the

residue value, until a local minimum residue value is

reached. The time complexity of the ?-clusters algorithm is

Oððn þ mÞ ? n ? m ? k ? lÞ, where k is the number of

clusters and l is the number of iterations. The ?-clusters

algorithm also requires that the number of clusters be

prespecified. The advantage of the “?-clusters” approach

is that it is robust to missing values since the residue of a

submatrix only computed by existing values. “?-clusters”

can also detect overlapping embedded subspace clusters.

2.3.4 Summary

For the subspace clustering techniques applied to gene

expression data, a cluster is a “block” formed by a subset of

genes and a subset of experimental conditions, where the

genes in the “block” illustrate coherent expression patterns

under the conditions within the same “block.” Different

approaches adopt different greedy heuristics to approx-

imate the optimal solution and make the problem tractable.

The commonly used data sets are listed in Table 3.

JIANG ET AL.: CLUSTER ANALYSIS FOR GENE EXPRESSION DATA: A SURVEY 1381

TABLE 3

Some Data Sets for Subspace Clustering

Page 13

3CLASS VALIDATION

The previous sections have reviewed a number of clustering

algorithms which partition the data set based on different

clustering criteria. For gene expression data, clustering

results in groups of coexpressed genes, groups of samples

with a common phenotype, or “blocks” of genes and

samples involved in specific biological processes. However,

different clustering algorithms, or even a single clustering

algorithm using different parameters, generally result in

different sets of clusters. Therefore, it is important to

compare various clustering results and select the one that

best fits the “true” data distribution. Cluster validation is

the process of assessing the quality and reliability of the

cluster sets derived from various clustering processes.

Generally, cluster validity has three aspects. First, the

quality of clusters can be measured in terms of homogeneity

and separation on the basis of the definition of a cluster:

Objects within one cluster are similar to each other, while

objects in different clusters are dissimilar with each other.

The second aspect relies on a given “ground truth” of the

clusters. The “ground truth” could come from domain

knowledge, such as known function families of genes or

from other sources such as the clinical diagnosis of normal

or cancerous tissues. Cluster validation is based on the

agreement between clustering results and the “ground

truth.” The third aspect of cluster validity focuses on the

reliability of the clusters or the likelihood that the cluster

structure is not formed by chance. In this section, we will

discuss these three aspects of cluster validation.

3.1

There are various definitions for the homogeneity of

clusters which measures the similarity of data objects in

cluster C. For example,

P

jjCjj ? ðjjCjj ? 1Þ

This definition represents the homogeneity of cluster C by

the average pairwise object similarity within C. An alternate

definition evaluates the homogeneity with respect to the

“centroid” of the cluster C, i.e.,

Homogeneity and Separation

H1ðCÞ ¼

Oi;Oj2C;Oi6¼OjSimilarityðOi;OjÞ

:

H2ðCÞ ¼

1

jjCjj

X

Oi2C

SimilarityðOi;?O OÞ;

where?O O is the “centroid” of C. Other definitions, such as

the representation of cluster homogeneity via maximum or

minimum pairwise or centroid-based similarity within C

can also be useful and perform well under certain

conditions. Cluster separation is analogously defined from

various perspectives to measure the dissimilarity between

two clusters C1, C2. For example,

P

and S2ðC1;C2Þ ¼ Similarityð?O O1;?O O2Þ. Since these definitions

of homogeneity and separation are based on the similarity

between objects, the quality of C increases with higher

homogeneity values within C and lower separation values

between C and other clusters. Once we have defined the

S1ðC1;C2Þ ¼

Oi2C1;Oj2C2SimilarityðOi;OjÞ

jjC1jj ? jjC2jj

homogeneity of a cluster and the separation between a pair

of clusters, for a given clustering result C¼ fC1;C2;...;CKg,

we can define the homogeneity and the separation of C. For

example, Shamir and Sharan [56] used definitions of Have¼

1

N

X

to measure the average homogeniety and separation for the

set of clustering results C.

3.2Agreement with Reference Partition

If the “ground truth” of the cluster structure of the data

set is available, we can test the performance of a

clustering process by comparing the clustering results

with the “ground truth.” Given the clustering results

C ¼ fC1;...;Cpg, we can construct a n ? n binary matrix

C, where n is the number of data objects, Cij¼ 1 if Oi

and Ojbelong to the same cluster, and Cij¼ 0 otherwise.

Similarly, we can build the binary matrix P for the

“ground truth” P ¼ fP1;...;Psg. The agreement between

C and P can be disclosed via the following values:

.

n11 is the number of object pairs ðOi;OjÞ, where

Cij¼ 1 and Pij¼ 1.

.

n10 is the number of object pairs ðOi;OjÞ, where

Cij¼ 1 and Pij¼ 0.

.

n01 is the number of object pairs ðOi;OjÞ, where

Cij¼ 0 and Pij¼ 1.

.

n00 is the number of object pairs ðOi;OjÞ, where

Cij¼ 0 and Pij¼ 0.

Some commonly used indices [25], [60] have been defined

to measure the degree of similarity between C and P:

P

Ci2CjjCijj ? H2ðCiÞ and

Save¼

1

P

Ci6¼CjjjCijj ? jjCjjj

Ci6¼Cj

ðjjCijj ? jjCjjjÞS2ðCi;CjÞ

Rand index : Rand ¼

n11þ n00

n11þ n10þ n01þ n00;

n11

n11þ n10þ n01;

Jaccard coefficient : JC ¼

Minkowski measure : Minkowski ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

n10þ n01

n11þ n01

r

:

The Rand index and the Jaccard coefficient measure the

extent of agreement between C and P, while the Minkowski

measure illustrates the proportion of disagreements to the

total number of object pairs ðOi;OjÞ, where Oi, Ojbelong to

the same set in P. It should be noted that the Jaccard

coefficient and the Minkowski measure do not (directly)

involve the term n00. These two indices may be more

effective in gene-based clustering because a majority of

pairs of objects tend to be in separate clusters and the term

n00would dominate the other three terms in both good and

bad solutions. Other methods are also available to measure

the correlation between the clustering results and the

“ground truth” [25]. Again, the optimal index selection is

application dependent.

3.3

While a validation index can be used to compare different

clustering results, this comparison will not reveal the

reliability of the resulting clusters; that is, the probability

that the clusters are not formed by chance. In the following

section, we will review two approaches to measuring the

significance of the derived clusters.

Reliability of Clusters

1382IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL. 16, NO. 11,NOVEMBER 2004

Page 14

P-value of a cluster. In [66], Tavazoie et al. mapped the

genes in each resulting cluster to the 199 functional

categories in the Martinsried Institute of Protein Sciences

function classification scheme (MIPS) database. For each

cluster, P-values were calculated to measure the statistical

significance for functional category enrichment. To be

specific, the authors used the hypergeometric distribution

to calculate the probability of observing at least k genes

from a functional category within a cluster of size n:

? ?

P ¼ 1 ?

X

k?1

i¼0

f

i

g ? f

n ? i

g

n

?

?

?

?

;

where f is the total number of genes within a functional

category and g is the total number of genes within the

genome. Since the expectation of P within the cluster would

be higher than 0:05%, the authors regarded clusters with

P-values smaller than 3 ? 10?4as significant. Jakt et al. [35]

integrated the assessment of the potential functional

significance of both gene clusters and the corresponding

postulated regulatory motifs (common DNA sequence

patterns), and developed a method to estimate the prob-

ability (P-value) of finding a certain number of matches to a

motif in all of the gene clusters. A smaller probability

indicates a higher significance of the clustering results.

Prediction strength. A novel approach to evaluating the

reliability of sample clusters is based on the concept that if a

clustering result reflects true cluster structure, then a

predictor based on the resulting clusters should accurately

estimate the cluster labels for new test samples. For gene

expression data, extra data objects are rarely used as test

samples since the number of available samples is limited.

Rather, a cross-validation method is applied. The generated

clusters are assessed by repeatedly measuring the prediction

strength with one or a few of the data objects left out, in turn,

as “test samples” while the remaining data objects are used

for clustering.

Golub et al. introduced a method based on this idea.

Suppose a clustering algorithm partitions the samples into

two groups C1and C2, with samples ~ s s1;...;~ s skbelonging to

C1and samples ~ s skþ1;...;~ s sp?1belonging to C2, ~ s spbeing the

left out test sample, and G being a set of genes most

correlated with the current partition. Given the test sample

~ s sp, each gene ~ g gi2 G votes for either C1or C2, depending on

whether the gene expression level wipof ~ g giin ~ s spis closer to

?1or ?2(which denote, respectively, the mean expression

levels of~ s s1;...;~ s sk(C1) and~ s skþ1;...;~ s sp?1(C2)). The votes for

C1and C2are summed as V1and V2, respectively, and the

prediction strength for~ s spis defined as jV1?V2

of the genes in G uniformly vote ~ s spfor C1(or for C2), the

value of the predication strength will be high. High values

of prediction strength with respect to sufficient test samples

indicate a biologically significant clustering.

Golub’s method constructs a predictor based on derived

clusters and converts the reliability assessment of sample

clusters to a “supervised” classification problem. In [77],

Yeung et al. extended the idea of “prediction strength” and

proposed an approach to cluster validation for gene

V1þV2j. Clearly, if most

clusters. Intuitively, if a cluster of genes has possible

biological significance, then the expression levels of the

genes within that cluster should also be similar to each

other in “test” samples that were not used to form the

cluster. Yeung et al. proposed a specific figure of merit

(FOM), to estimate the predictive power of a clustering

algorithm. Suppose C1;...;Ck are the resulting clusters

based on samples 1;...;ðe ? 1Þ;ðe þ 1Þ;...;m, and sample e

is left out to test the prediction strength. Let Rðg;eÞ be the

expression level of gene g under sample e in the raw data

matrix. Let ?CiðeÞ be the average expression level in sample

e of the genes in cluster Ci. The figure of merit with respect to

e and the number of clusters k is defined as

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

i¼1

FOMðe;kÞ ¼

1

n?

X

k

X

x2Ci

ðRðx;eÞ ? ?CiðeÞÞ2

v

u

u

t

:

Each of the m samples can be left out, in turn, and the

aggregate figure of merit is defined as

FOMðkÞ ¼

X

m

e¼1

FOMðe;kÞ:

The FOM measures the mean deviation of the expression

levels of genes in e relative to their corresponding cluster

means. Thus, a small value of FOM indicates a strong

prediction strength and, therefore, a high-level reliability of

the resulting clusters. Levine and Domany [41] proposed

another figure of merit M based on a resampling scheme.

The basic idea is that the cluster structure derived from the

whole data set should be able to “predict” the cluster

structure of subsets of the full data. M measures the extent

to which the clustering assignments obtained from the

resamples (subsets of the full data) agree with those from

the full data. A high value of M against a wide range of

resampling indicates a reliable clustering result.

4CURRENT AND FUTURE RESEARCH DIRECTIONS

Recent DNA microarray technologies have made it possible

to monitor transcription levels of tens of thousands of genes

in parallel. Gene expression data generated by microarray

experiments offer tremendous potential for advances in

molecular biology and functional genomics. In this paper,

we reviewed both classical and recently developed cluster-

ing algorithms, which have been applied to gene expression

data, with promising results.

Gene expression data can be clustered on both genes and

samples. As a result, the clustering algorithms can be

divided into three categories: gene-based clustering, sam-

ple-based clustering, and subspace clustering. Each cate-

gory has specific applications and present specific

challenges for the clustering task. For each category, we

have analyzed its particular problems and reviewed several

representative algorithms.

Given the variety of available clustering algorithms, one

of the problems faced by biologists is the selection of the

algorithm most appropriate to a given gene expression data

set.However,thereisnosingle“best”algorithmwhichisthe

“winner” in every aspect. Researchers typically select a few

candidate algorithms and compare the clustering results.

Nevertheless, we have shown that there are three aspects of

cluster validation and, for each aspect, various approaches

JIANG ET AL.: CLUSTER ANALYSIS FOR GENE EXPRESSION DATA: A SURVEY1383

Page 15

can be used to assess the quality or reliability of the

clustering results. In this instance as well, there are no

existing standard validity metrics. In fact, the performance

of different clustering algorithms and different validation

approaches is strongly dependent on both data distribution

and application requirements. The choice of the clustering

algorithm and validity metric is often guided by a combina-

tion of evaluation criteria and the user’s experience.

A gene expression data set typically contains thousands

of genes. However, biologists often have different require-

ments on cluster granularity for different subsets of genes.

For some purpose, biologists may be particularly interested

in some specific subsets of genes and prefer small and tight

clusters while for other genes, people may only need a

coarse overview of the data structure. However, most of the

existing clustering algorithms only provide a crisp set of

clusters and may not be flexible to different requirements

for cluster granularity on a single data set. For gene

expression data, it would be more appropriate to avoid

the direct partition of the data set and instead provide a

scalable graphical representation of the data structure,

leaving the partition problem to the users. Several existing

approaches, such as hierarchical clustering, SOM, and

Optics [5], can graphically represent the cluster structure.

However, these algorithms may not be able to adapt to

different user requirements on cluster granularity for

different subsets of the data.

Clustering is generally recognized as an “unsupervised”

learning problem. Prior to undertaking a clustering task,

“global” information regarding the data set, such as the total

number of clusters and the complete data distribution in the

object space, is usually unknown. However, some “partial”

knowledge is often available regarding a gene expression

data set. For example, the functions of some genes have

been studied in the literature, which can provide guidance

to the clustering. Furthermore, some groups of the experi-

mental conditions are known to be strongly correlated and

the differences among the cluster structures under these

different groups may be of particular interest. If a clustering

algorithm could integrate such partial knowledge as some

clustering constraints when carrying out the clustering task,

we can expect that the clustering results would be more

biologically meaningful. In this way, clustering could cease

to be a “pure” unsupervised process and become an

interactive exploration of the data set.

REFERENCES

[1] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan,

“Automatic Subspace Clustering of High Dimensional Data for

Data Mining Applications,” SIGMOD 1998, Proc. ACM SIGMOD

Int’l Conf. Management of Data, pp. 94-105, 1998.

A.A. Alizadeh et al., “Distinct Types of Diffuse Large B-Cell

Lymphoma Identified by Gene Expression Profiling,” Nature,

vol. 403, pp. 503-511, Feb. 2000.

U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack,

and A.J. Levine, “Broad Patterns of Gene Expression Revealed by

Clustering Analysis of Tumor and Normal Colon Tissues Probed

by Oligonucleotide Array,” Proc. Nat’l Academy of Science, vol. 96,

no. 12, pp. 6745-6750, June 1999.

O. Alter, P.O. Brown, and D. Bostein, “Singular Value Decom-

position for Genome-Wide Expression Data Processing and

Modeling,” Proc. Nat’l Academy of Science, vol. 97, no. 18,

pp. 10101-10106, Aug. 2000.

M. Ankerst, M.M. Breunig, H.-P. Kriegel, and J. Sander, “OPTICS:

Ordering Points to Identify the Clustering Structure,” Sigmod,

pp. 49-60, 1999.

[2]

[3]

[4]

[5]

[6]

A. Ben-Dor, N. Friedman, and Z. Yakhini, “Class Discovery in

Gene Expression Data,” Proc. Fifth Ann. Int’l Conf. Computational

Molecular Biology (RECOMB 2001), pp. 31-38, 2001.

A. Ben-Dor, R. Shamir, and Z. Yakhini, “Clustering Gene

Expression Patterns,” J. Computational Biology, vol. 6, nos. 3/4,

pp. 281-297, 1999.

M. Blat, S. Wiseman, and E. Domany, “Super-Paramagnetic

Clustering of Data,” Physical Review Letters, vol. 76, pp. 3251-

3255, 1996.

A. Brazma and J. Vilo, “Minireview: Gene Expression Data

Analysis,” Federation of European Biochemical Soc., vol. 480, pp. 17-

24, June 2000.

[10] M.P.S. Brown, W.N. Grundy, D. Lin, N. Cristianini, C.W. Sugnet,

T.S. Furey, M. Ares Jr., and D. Haussler, “Knowledge-Based

Analysis of Microarray Gene Expression Data Using Support

Vector Machines,” Proc. Nat’l Academy of Science, vol. 97, no. 1,

pp. 262-267, Jan. 2000.

[11] Y. Cheng and G.M. Church, “Biclustering of Expression Data,”

Proc. Eighth Int’l Conf. Intelligent Systems for Molecular Biology

(ISMB), vol. 8, pp. 93-103, 2000.

[12] R.J. Cho, M.J. Campbell, E.A. Winzeler, L. Steinmetz, A. Conway,

L. Wodicka, T.G. Wolfsberg, A.E. Gabrielian, D. Landsman, D.J.

Lockhart, and R.W. Davis, “A Genome-Wide Transcriptional

Analysis of the Mitotic Cell Cycle,” Molecular Cell, vol. 2, no. 1,

pp. 65-73, July 1998.

[13] S. Chu et al., “The Transcriptional Program of Sporulation in

Budding Yeast,” Science, vol. 282, no. 5389, pp. 699-705, 1998.

[14] D.R. Bickel, “Robust Cluster Analysis of DNA Microarray Data:

An Application of Nonparametric Correlation Dissimilarity,” Proc.

Joint Statistical Meetings of the Am. Statistical Assoc., (Biometrics

Section), 2001.

[15] J.L. DeRisi, V.R. Iyer, and P.O. Brown, “Exploring the Metabolic

and Genetic Control of Gene Expression on a Genomic Scale,”

Science, pp. 680-686, 1997.

[16] P. D’haeseleer, X. Wen, S. Fuhrman, and R. Somogyi, “Mining the

Gene Expression Matrix: Inferring Gene Relationships From Large

Scale Gene Expression Data,” Information Processing in Cells and

Tissues, pp. 203-212, 1998.

[17] C. Ding, “Analysis of Gene Expression Profiles: Class Discovery

and Leaf Ordering,” Proc. Int’l Conf. Computational Molecular

Biology (RECOMB), pp. 27-136, Apr. 2002.

[18] R. Dubes and A. Jain, Algorithms for Clustering Data. Prentice Hall,

1988.

[19] B. Efron, “The Jackknife, the Bootstrap, and Other Resampling

Plans,” Proc. CBMS-NSF Regional Conf. Series in Applied Math.,

vol. 38, 1982.

[20] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, “Cluster

Analysis and Display of Genome-Wide Expression Patterns,” Proc.

Nat’l Academy of Science, vol. 95, no. 25, pp. 14863-14868, Dec. 1998.

[21] C. Fraley and A.E. Raftery, “How Many Clusters? Which

Clustering Method? Answers Via Model-Based Cluster Analysis,”

The Computer J., vol. 41, no. 8, pp. 578-588, 1998.

[22] G. Getz, E. Levine, and E. Domany, “Coupled Two-Way

Clustering Analysis of Gene Microarray Data,” Proc. Nat’l

Academy of Science, vol. 97, no. 22, pp. 12079-12084, Oct. 2000.

[23] D. Ghosh and A.M. Chinnaiyan, “Mixture Modelling of Gene

Expression Data from Microarray Experiments,” Bioinformatics,

vol. 18, pp. 275-286, 2002.

[24] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gassenbeek,

J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri,

D.D. Bloomfield, and E.S. Lander, “Molecular Classification of

Cancer: Class Discovery and Class Prediction by Gene Expression

Monitoring,” Science, vol. 286, no. 15, pp. 531-537, Oct. 1999.

[25] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “On Clustering

Validation Techniques,” Intelligent Information Systems J., 2001.

[26] E. Hartuv and R. Shamir, “A Clustering Algorithm Based on

Graph Connectivity,” Information Processing Letters, vol. 76, nos. 4-

6, pp. 175-181, 2000.

[27] T. Hastie, R. Tibshirani, D. Boststein, and P. Brown, “Supervised

Harvesting of Expression Trees,” Genome Biology, vol. 2, no. 1,

pp. 0003.1-0003.12, Jan. 2001.

[28] I. Hedenfalk, D. Duggan, Y.D. Chen, M. Radmacher, M. Bittner,

R. Simon, P. Meltzer, B. Gusterson, M. Esteller, O.P. Kallioniemi,

B. Wilfond, A. Borg, and J. Trent, “Gene-Expression Profiles in

Hereditary Breast Cancer,” The New England J. Medicine, vol. 344,

no. 8, pp. 539-548, Feb. 2001.

[7]

[8]

[9]

1384 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL. 16, NO. 11, NOVEMBER 2004

Page 16

[29] J. Herrero, A. Valencia, and J. Dopazo, “A Hierarchical Unsu-

pervised Growing Neural Network for Clustering Gene Expres-

sion Patterns,” Bioinformatics, vol. 17, pp. 126-136, 2001.

[30] L.J. Heyer, S. Kruglyak, and S. Yooseph, “Exploring Expression

Data: Identification and Analysis of Coexpressed Genes,” Genome

Research, 1999.

[31] L.J. Heyer, S. Kruglyak, and S. Yooseph, “Exploring Expression

Data: Identification and Analysis of Coexpressed Genes,” Genome

Research, vol. 9, no. 11, pp. 1106-1115, 1999.

[32] A. Hill, E. Brown, M. Whitley, G. Tucker-Kellogg, C. Hunter, and

D. Slonim, “Evaluation of Normalization Procedures for Oligo-

nucleotide Array Data Based on Spiked cRNA Contros,” Genome

Biology, vol. 2, no. 12, pp. research0055.-1-0055.13, 2001.

[33] V.R. Iyer, M.B. Eisen, D.T. Ross, G. Schuler, T. Moore, J.C.F. Lee,

J.M. Trent, L.M. Staudt, J. Hudson Jr., M.S. Boguski, D. Lashkari,

D. Shalon, D. Botstein, and P.O. Brown, “The Transcriptional

Program in the Response of Human Fibroblasts to Serum,” Science,

vol. 283, pp. 83-87, 1999.

[34] A.K. Jain, M.N. Murty, and P.J. Flynn, “Data Clustering: A

Review,” ACM Computing Surveys, vol. 31, no. 3, pp. 254-323, Sept.

1999.

[35] L.M. Jakt, L. Cao, K.S.E. Cheah, and D.K. Smith, “Assessing

Clusters and Motifs from Gene Expression Data,” Genome

Research, vol. 11, pp. 1112-123, 2001.

[36] D. Jiang, J. Pei, and A. Zhang, “DHC: A Density-Based

Hierarchical Clustering Method for Time-Series Gene Expression

Data,” Proc. BIBE2003: Third IEEE Int’l Symp. Bioinformatics and

Bioeng., 2003.

[37] D. Jiang, J. Pei, and A. Zhang, “Interactive Exploration of

Coherent Patterns in Time-Series Gene Expression Data,” Proc.

Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data

Mining (SIGKDD ’03), 2003.

[38] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An

Introduction to Cluster Analysis. John Wiley and Sons, 1990.

[39] T. Kohonen, Self-Organization and Associative Memory. Berlin:

Spring-Verlag, 1984.

[40] L. Lazzeroni and A. Owen, “Plaid Models for Gene Expression

Data,” Statistica Sinica, vol. 12, no. 1, pp. 61-86, 2002.

[41] E. Levine and E. Domany, “Resampling Methods for Unsuper-

vised Estimation of Cluster Validity,” Neural Computation, vol. 13,

pp. 2573-2593, 2001.

[42] L. Li, W. Leping, C.R. Weinberg, T.A. Darden, and L.G. Pedersen,

“Gene Selection for Sample Classification Based on Gene Expres-

sion Data: Study of Sensitivity to Choice of Parameters of the ga/

knn Method,” Bioinformatics, vol. 17, pp. 1131-1142, 2001.

[43] W. Li, “Zipf’s Law in Importance of Genes for Cancer

ClassificationUsing Microarray Data,”

Genetics, Rockefeller Univ., Apr. 2001.

[44] D. Lockhart et al., “Expression Monitoring by Hybridization to

High-Density Oligonucleotide Arrays,” Nature Biotechnology,

vol. 14, pp. 1675-1680, 1996.

[45] G.J. McLachlan, R.W. Bean, and D. Peel, “A Mixture Model-Based

Approach to the Clustering of Microarray Expression Data,”

Bioinformatics, vol. 18, 413-422, 2002.

[46] J.B. McQueen, “Some Methods for Classification and Analysis of

Multivariate Observations,” Proc. Fifth Berkeley Symp. Math.

Statistics and Probability, vol. 1, pp. 281-297, 1967.

[47] E.J. Moler, M.L. Chow, and I.S. Mian, “Analysis of Molecular

Profile Data Using Generative and Discriminative Methods.”

Physiological Genomics, vol. 4, no. 2, pp. 109-126, 2000.

[48] L.T. Nguyen et al., “Flow Cytometric Analysis of in Vitro

Proinflammatory Cytokine Secretion in Peripheral Blood from

Multiple Sclerosis Patients,” J. Clinical Immunology, vol. 19, no. 3,

pp. 179-185, 1999.

[49] P.J. Park, M. Pagano, and M. Bonetti, “A Nonparametric Scoring

Algorithm for Identifying Informative Genes from Microarray

Data,” Proc. Pacific Symp. Biocomputing, pp. 52-63, 2001.

[50] C.M. Perou, S.S. Jeffrey, M.V.D. Rijn, C.A. Rees, M.B. Eisen, D.T.

Ross, A. Pergamenschikov, C.F. Williams, S.X. Zhu, J.C.F. Lee, D.

Lashkari, D. Shalon, P.O. Brown, and D. Bostein, “Distinctive

Gene Expression Patterns in Human Mammary Epithelial Cells

and Breast Cancers,” Proc. Nat’l Academy of Science, vol. 96, no. 16,

pp. 9212-9217, Aug. 1999.

[51] P.A. Ralf-Herwig, C. Muller, C. Bull, H. Lehrach, and J. O’Brien,

“Large-Scale Clustering of cDNA-Fingerprinting Data,” Genome

Research, vol. 9, pp. 1093-1105, 1999.

Lab of Statistical

[52] K. Rose, “Deterministic Annealing for Clustering, Compression,

Classification, Regression, and Related Optimization Problems,”

Proc. IEEE, vol. 96, pp. 2210-2239, 1998.

[53] K. Rose, E. Gurewitz, and G. Fox, Physical Rev. Letters, vol. 65,

pp. 945-948, 1990.

[54] M.D. Schena, R. Shalon, R. Davis, and P. Brown, “Quantitative

Monitoring of Gene Expression Patterns with a Compolementatry

DNA Microarray,” Science, vol. 270, pp. 467-470, 1995.

[55] J. Schuchhardt, D. Beule, A. Malik, E. Wolski, H. Eickhoff, H.

Lehrach, and H. Herzel, “Normalization Strategies for cDNA

Microarrays,” Nucleic Acids Research, vol. 28, no. 10, 2000.

[56] R. Shamir and R. Sharan, “Click: A Clustering Algorithm for Gene

Expression Analysis,” Proc. Eighth Int’l Conf. Intelligent Systems for

Molecular Biology (ISMB ’00), 2000.

[57] G. Sherlock, “Analysis of Large-Scale Gene Expression Data,”

Current Opinion in Immunology, vol. 12, no. 2, pp. 201-205, 2000.

[58] J.N. Siedow, “Meeting Report: Making Sense of Microarrays,”

Genome Biology, vol. 2, no. 2, pp. reports 4003.1-4003.2, 2001.

[59] F.D. Smet, J. Mathys, K. Marchal, G. Thijs, M. Moor, D. Bart, and

Y. Moreau, “Adaptive Quality-Based Clustering of Gene Expres-

sion Profiles,” Bioinformatics, vol. 18, pp. 735-746, 2002.

[60] R.R. Sokal, “Clustering and Classification: Background and

Current Directions,” Classifincation and Clustering, J. Van Ryzin,

ed., Academic Press, 1977.

[61] P.T. Spellman et al., “Comprehensive Identification of Cell Cycle-

Regulated Genes of the Yeast Saccharomyces Cerevisiae by

Microarray Hybridization,” Molecular Biology of the Cell, vol. 9,

no. 12, pp. 3273-3297, 1998.

[62] P. Tamayo, D. Solni, J. Mesirov, Q. Zhu, S. Kitareewan, E.

Dmitrovsky, E.S. Lander, and T.R. Golub, “Interpreting Patterns

of Gene Expression with Self-Organizing Maps:

Application to Hematopoietic Differentiation,” Proc. Nat’l Acad-

emy of Science, vol. 96, no. 6, pp. 2907-2912, Mar. 1999.

[63] C. Tang, A. Zhang, and J. Pei, “Mining Phenotypes and

Informative Genes from Gene Expression Data,” Proc. Ninth

ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining

(SIGKDD ’03), 2003.

[64] C. Tang, L. Zhang, A. Zhang, and M. Ramanathan, “Interrelated

Two-Way Clustering: An Unsupervised Approach for Gene

Expression Data Analysis,” Proc. BIBE2001: Second IEEE Int’l

Symp. Bioinformatics and Bioeng., pp. 41-48, 2001.

[65] C. Tang and A. Zhang, “An Iterative Strategy for Pattern

Discovery in High-Dimensional Data Sets,” Proc. 11th Int’l Conf.

Information and Knowledge Management (CIKM ’02), 2002.

[66] S. Tavazoie, D. Hughes, M.J. Campbell, R.J. Cho, and G.M.

Church, “Systematic Determination of Genetic Network Archi-

tecture,” Nature Genetics, pp. 281-285, 1999.

[67] A. Tefferi, E. Bolander, M. Ansell, D. Wieben, and C. Spelsberg,

“Primer on Medical Genomics Part III: Microarray Experiments

and Data Analysis,” Mayo Clinic Proc., vol. 77, pp. 927-940, 2002.

[68] J.G. Thomas, J.M. Olson, S.J. Tapscott, and L.P. Zhao, “An Efficient

and Robust Statistical Modeling Approach to Discover Differen-

tially Expressed Genes Using Genomic Expression Profiles,”

Genome Research, vol. 11, no. 7, pp. 1227-1236, 2001.

[69] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R.

Tibshirani, D. Botstein, and R. Altman, “Missing Value Estimation

Methods for Dna Microarrays,” Bioinformatics, in press.

[70] V.G. Tusher, R. Tibshirani, and G. Chu, “Significance Analysis of

Microarrays Applied to the Ionizing Radiation Response,” Proc.

Nat’l Academy of Science, vol. 98, no. 9, pp. 5116-5121, Apr. 2001.

[71] H. Wang, W. Wang, Y. Wei, J. Yang, and P.S. Yu, “Clustering by

Pattern Similarity in Large Data Sets,” SIGMOD 2002, Proc. ACM

SIGMOD Int’l Conf. Management of Data, pp. 394-405, 2002.

[72] X. Wen, S. Fuhrman, G.S. Michaels, D.B. Carr, S. Smith, J.L. Barker,

and R. Smomgyi, “Large-Scale Temporal Gene Expression

Mapping of Central Nervous System Development,” Proc. Nat’l

Academy of Science, vol. 95, pp. 334-339, Jan. 1998.

[73] E.P. Xing and R.M. Karp, “Cliff: Clustering of High-Dimensional

Microarray Data via Iterative Feature Filtering Using Normalized

Cuts,” Bioinformatics, vol. 17, no. 1, pp. 306-315, 2001.

[74] J. Yang, W. Wang, H. Wang, and P.S. Yu, “?-Cluster: Capturing

Subspace Correlation in a Large Data Set,” Proc. 18th Int’l Conf.

Data Eng. (ICDE 2002), pp. 517-528, 2002.

[75] K.Y. Yeung and W.L. Ruzzo, “An Empirical Study on Principal

Component Analysis for Clustering Gene Expression Data,”

Technical Report UW-CSE-2000-11-03, Dept. of Computer Science

& Eng., Univ. of Washington, 2000.

Methods and

JIANG ET AL.: CLUSTER ANALYSIS FOR GENE EXPRESSION DATA: A SURVEY1385

Page 17

[76] K.Y. Yeung, C. Fraley, A. Murua, A.E. Raftery, and W.L. Ruzz,

“Model-Based Clustering and Data Transformations for Gene

Expression Data,” Bioinformatics, vol. 17, pp. 977-987, 2001.

[77] K.Y. Yeung, D.R. Haynor, and W.L. Ruzzo, “Validating Clustering

for Gene Expression Data,” Bioinformatics, vol. 17, no. 4, pp. 309-

318, 2001.

Daxin Jiang received the BS degree in compu-

ter science from the University of Science and

Technology of China in 1998. He is currently a

PhD student in the Department of Computer

Science and Engineering, State University of

New York at Buffalo. His research interests

include bioinformatics, data mining, machine

learning, and infomation retrieval.

Chun Tang received the BS and MS degrees,

both in computer science, from Peking Univer-

sity, China, in 1996 and 1999, respectively. She

is currently a PhD candidate in the Department

of Computer Science and Engineering, State

University of New York at Buffalo. Her research

interests include bioinformatics, data mining,

machine learning, information retrieval, and

geographical information systems.

Aidong Zhang received the PhD degree in

computer science from Purdue University, West

Lafayette, Indiana, in 1994. She was an assis-

tant professor from 1994 to 1999, an associate

professor from 1999 to 2002, and has been a

professor since 2002 in the Department of

Computer Science and Engineering at State

University of New York at Buffalo. Her research

interests include multimedia systems, content-

based image retrieval, geographical information

systems, distributed database systems, bioinformatics, and data mining.

She is an author of more than 130 research publications in these areas.

Dr. Zhang’s research has been funded by the US National Science

Foundation, the US National Imagery and Mapping Agency, and the US

National Institutes of Health. She is on the ACM SIGMOD executive

committee and serves on the editorial boards of ACM Multimedia

Systems, the International Journal of Multimedia Tools and Applications,

International Journal of Distributed and Parallel Databases, and ACM

SIGMOD DiSC (Digital Symposium Collection). She was cochair of the

technical program committee for ACM Multimedia 2001. She has also

served on various conference program committees. Dr. Zhang is a

recipient of the National Science Foundation CAREER award and SUNY

Chancellor’s Research Recognition award.

. For more information on this or any other computing topic,

please visit our Digital Library at www.computer.org/publications/dlib.

1386 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL. 16, NO. 11, NOVEMBER 2004