Cluster analysis for gene expression data: a survey
ABSTRACT DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of genes and the complexity of biological networks greatly increases the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups. A very rich literature on cluster analysis has developed over the past three decades. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new algorithms have recently been proposed specifically aiming at gene expression data. These clustering algorithms have been proven useful for identifying biologically relevant groups of genes and samples. In this paper, we first briefly introduce the concepts of microarray technology and discuss the basic elements of clustering on gene expression data. In particular, we divide cluster analysis for gene expression data into three categories. Then, we present specific challenges pertinent to each clustering category and introduce several representative approaches. We also discuss the problem of cluster validation in three aspects and review various methods to assess the quality and reliability of clustering results. Finally, we conclude this paper and suggest the promising trends in this field.

Dataset: motta etal SAC2013
Robson Motta, Alneu de Andrade Lopes, Bruno M. Nogueira, Solange O. Rezende, Alípio M. Jorge, Maria Cristina Ferreira de Oliveira  SourceAvailable from: Lopamudra Dey[Show abstract] [Hide abstract]
ABSTRACT: “Clustering” the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based Kmeans clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on realtime air pollution database, wholesale customer, wine, and vehicle datasets using typical Kmeans, Canonical PSO based Kmeans, simple PSO based Kmeans, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms.International Scholarly Research Notices. 10/2014; Volume 2014 (2014).  [Show abstract] [Hide abstract]
ABSTRACT: This survey first introduces how to produce and represent the gene expression data, and then discusses the stateoftheart cluster algorithms applied to gene expression data. According to the goals of clustering, clustering algorithms are divided into three categories: genebased clustering, samplebased clustering, and biclustering. Basic biological principles and challenges for each category are presented. For each category, the basic principle is discussed in detail as well as its advantages and drawbacks. This paper concludes with a summarization in this field and a discussion of future trends.ACTA AUTOMATICA SINICA 01/2008; 34(2).
Page 1
Cluster Analysis for Gene
Expression Data: A Survey
Daxin Jiang, Chun Tang, and Aidong Zhang
Abstract—DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of
genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene
expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of
genes and the complexity of biological networks greatly increases the challenges of comprehending and interpreting the resulting mass
of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering
techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying
data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group
are more similar to each other than the points in different groups. A very rich literature on cluster analysis has developed over the past
three decades. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new
algorithms have recently been proposed specifically aiming at gene expression data. These clustering algorithms have been proven
useful for identifying biologically relevant groups of genes and samples. In this paper, we first briefly introduce the concepts of
microarray technology and discuss the basic elements of clustering on gene expression data. In particular, we divide cluster analysis
for gene expression data into three categories. Then, we present specific challenges pertinent to each clustering category and
introduce several representative approaches. We also discuss the problem of cluster validation in three aspects and review various
methods to assess the quality and reliability of clustering results. Finally, we conclude this paper and suggest the promising trends in
this field.
Index Terms—Microarray technology, gene expression data, clustering.
?
1
1.1
INTRODUCTION
Introduction to Microarray Technology
1.1.1 Measuring mRNA Levels
C
and collection of data on single genes, microarray technol
ogies have now made it possible to monitor the expression
levels for tens of thousands of genes in parallel. The two
major types of microarray experiments are the cDNA
microarray [54] and oligonucleotide arrays (abbreviated oligo
chip) [44]. Despite differences in the details of their
experiment protocols, both types of experiments involve
three common basic procedures [67]:
OMPARED with the traditional approach to genomic
research, which has focused on the local examination
.
Chip manufacture: A microarray is a small chip (made
of chemically coated glass, nylon membrane, or
silicon), onto which tens of thousands of DNA
molecules (probes) are attached in fixed grids. Each
grid cell relates to a DNA sequence.
Target preparation, labeling, and hybridization: Typi
cally, two mRNA samples (a test sample and a
control sample) are reverse transcribed into cDNA
(targets), labeled using either fluorescent dyes or
radioactive isotopics, and then hybridized with the
probes on the surface of the chip.
.
.
The scanning process: Chips are scanned to read the
signal intensity that is emitted from the labeled and
hybridized targets.
Generally, both cDNA microarray and oligo chip
experiments measure the expression level for each DNA
sequence by the ratio of signal intensity between the test
sample and the control sample, therefore, data sets resulting
from both methods share the same biological semantics. In
this paper, unless explicitly stated, we will refer to both the
cDNA microarray and the oligo chip as microarray technol
ogy and term the measurements collected via both methods
as gene expression data.
1.1.2 Preprocessing of Gene Expression Data
A microarray experiment typically assesses a large number
of DNA sequences (genes, cDNA clones, or expressed
sequence tags [ESTs]) under multiple conditions. These
conditions may be a time series during a biological process
(e.g., the yeast cell cycle) or a collection of different tissue
samples (e.g., normal versus cancerous tissues). In this
paper, we will focus on the cluster analysis of gene
expression data without making a distinction among
DNA sequences, which will uniformly be called “genes.”
Similarly, we will uniformly refer to all kinds of experi
mental conditions as “samples” if no confusion will be
caused. A gene expression data set from a microarray
experiment can be represented by a realvalued expression
matrix M ¼ fwijj 1 ? i ? n;1 ? j ? mg (Fig. 1a), where the
rows (G ¼ f~ g1 g1;...; ~ gn
genes, the columns (S ¼ f~ s1 s1;...; ~
sion profiles of samples, and each cell wijis the measured
gng) form the expression patterns of
sm
smg) represent the expres
1370IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL. 16,NO. 11,NOVEMBER 2004
. The authors are with the Department of Computer Science and
Engineering, State University of New York at Buffalo, 201 Bell Hall,
Buffalo, NY 14260. Email: {djiang3, chuntang, azhang}@cse.buffalo.edu.
Manuscript received 23 Apr. 2002; revised 20 Mar. 2003; accepted 14 Aug.
2003.
For information on obtaining reprints of this article, please send email to:
tkde@computer.org, and reference IEEECS Log Number 116396.
10414347/04/$20.00 ? 2004 IEEEPublished by the IEEE Computer Society
Page 2
expression level of gene i in sample j. Fig. 1b includes some
notation that will be used in the following sections.
The original gene expression matrix obtained from a
scanning process contains noise, missing values, and
systematic variations arising from the experimental proce
dure. Data preprocessing is indispensable before any cluster
analysis can be performed. Some problems of data pre
processing have themselves become interesting research
topics. Those questions are beyond the scope of this survey;
an examination of the problem of missing value estimation
appears in [69], and the problem of data normalization is
addressed in [32], [55]. Furthermore, many clustering
approaches apply one or more of the following preproces
sing procedures: filtering out genes with expression levels
which do not change significantly across samples, perform
ing a logarithmic transformation of each expression level, or
standardizing each row of the gene expression matrix with
a mean of zero and a variance of one. In the following
discussion of clustering algorithms, we will set aside the
details of preprocessing procedures and assume that the
input data set has already been properly preprocessed.
1.1.3 Applications of Clustering Gene Expression Data
Clustering techniques have proven to be helpful to under
stand gene function, gene regulation, cellular processes, and
subtypes of cells. Genes with similar expression patterns
(coexpressed genes) can be clustered together with similar
cellular functions. This approach may further understand
ing of the functions of many genes for which information
has not been previously available [66], [20]. Furthermore,
coexpressed genes in the same cluster are likely to be
involved in the same cellular processes, and a strong
correlation of expression patterns between those genes
indicates coregulation. Searching for common DNA se
quences at the promoter regions of genes within the same
cluster allows regulatory motifs specific to each gene cluster
to be identified and cisregulatory elements to be proposed
[9], [66]. The inference of regulation through the clustering
of gene expression data also gives rise to hypotheses
regarding the mechanism of the transcriptional regulatory
network [16]. Finally, clustering different samples on the
basis of corresponding expression profiles may reveal
subcell types which are hard to identify by traditional
morphologybased approaches [2], [24].
1.2
In this section, we will first introduce the concepts of clusters
and clustering. We will then divide the clustering tasks for
gene expression data into three categories according to
different clustering purposes. Finally, we will discuss the
issue of proximity measure in detail.
Introduction to Clustering Techniques
1.2.1 Clusters and Clustering
Clustering is the process of grouping data objects into a set
of disjoint classes, called clusters, so that objects within a
class have high similarity to each other, while objects in
separate classes are more dissimilar. Clustering is an
example of unsupervised classification. “Classification” refers
to a procedure that assigns data objects to a set of classes.
“Unsupervised” means that clustering does not rely on
predefined classes and training examples while classifying
the data objects. Thus, clustering is distinguished from
pattern recognition or the areas of statistics known as
discriminant analysis and decision analysis, which seek to
find rules for classifying objects from a given set of
preclassified objects.
1.2.2 Categories of Gene Expression Data Clustering
Currently, a typical microarray experiment contains 103to
104genes, and this number is expected to reach to the order
of 106. However, the number of samples involved in a
microarray experiment is generally less than 100. One of
the characteristics of gene expression data is that it is
meaningful to cluster both genes and samples. On one
hand, coexpressed genes can be grouped in clusters based
on their expression patterns [7], [20]. In such genebased
clustering, the genes are treated as the objects, while the
samples are the features. On the other hand, the samples
can be partitioned into homogeneous groups. Each group
may correspond to some particular macroscopic phenotype,
such as clinical syndromes or cancer types [24]. Such
samplebased clustering regards the samples as the objects
and the genes as the features. The distinction of genebased
clustering and samplebased clustering is based on different
characteristics of clustering tasks for gene expression data.
Some clustering algorithms, such as Kmeans and hierarch
ical approaches, can be used both to group genes and to
partition samples. We will introduce those algorithms as
genebased clustering approaches and will discuss how to
apply them as samplebased clustering in Section 2.2.1.
JIANG ET AL.: CLUSTER ANALYSIS FOR GENE EXPRESSION DATA: A SURVEY 1371
Fig. 1. (a) A gene expression matrix. (b) Notation in this paper.
Page 3
Both the genebased and samplebased clustering ap
proaches search exclusive and exhaustive partitions of
objects that share the same feature space. However, current
thinking in molecular biology holds that only a small subset
of genes participate in any cellular process of interest and
that a cellular process takes place only in a subset of the
samples. This belief calls for the subspace clustering to
capture clusters formed by a subset of genes across a subset
of samples. For subspace clustering algorithms, genes and
samples are treated symmetrically, so that either genes or
samples can be regarded as objects or features. Further
more, clusters generated through such algorithms may have
different feature spaces.
While a gene expression matrix can be analyzed from
different angles, the genebased, samplebased clustering,
and subspace clustering analysis face very different
challenges. Thus, we may have to adopt very different
computational strategies in the three situations. The details
of the challenges and the representative clustering techni
ques pertinent to each clustering category will be discussed
in Section 2.
1.2.3 Proximity Measurement for Gene Expression Data
Proximity measurement measures the similarity (or distance)
between two data objects. Gene expression data objects, no
matter genes or samples, can be formalized as numerical
vectors~ O Oi¼ foijj1 ? j ? pg, where oijis the value of the jth
feature for the ith data object and p is the number of
features. The proximity between two objects Oi and Oj is
measured by a proximity function of corresponding vectors
~ O Oiand~ O Oj.
Euclidean distance is one of the most commonly used
methods to measure the distance between two data objects.
The distance between objects Oi and Oj in pdimensional
space is defined as:
EuclideanðOi;OjÞ ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
d¼1
X
p
ðoid? ojdÞ2
s
:
However, for gene expression data, the overall shapes of
gene expression patterns (or profiles) are of greater interest
than the individual magnitudes of each feature. Euclidean
distance does not score well for shifting or scaled patterns
(or profiles) [71]. To address this problem, each object
vector is standardized with zero mean and variance one
before calculating the distance [66], [59], [56].
An alternate measure is Pearson’s correlation coefficient,
which measures the similarity between the shapes of two
expression patterns (profiles). Given two data objects Oi
and Oj, Pearson’s correlation coefficient is defined as
Pp
PearsonðOi;OjÞ ¼
d¼1ðoid? ?oiÞðojd? ?ojÞ
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Pp
d¼1ðoid? ?oiÞ2
q
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Pp
d¼1ðojd? ?ojÞ2
q
;
where ?oiand ?ojare the means for~ O Oiand~ O Oj, respectively.
Pearson’s correlation coefficient views each object as a
random variable with p observations and measures the
similarity between two objects by calculating the linear
relationship between the distributions of the two corre
sponding random variables.
Pearson’s correlation coefficient is widely used and has
proven effective as a similarity measure for gene
expression data [36], [64], [65], [74]. However, empirical
study has shown that it is not robust with respect to
outliers [30], thus potentially yielding false positives which
assign a high similarity score to a pair of dissimilar
patterns. If two patterns have a common peak or valley at
a single feature, the correlation will be dominated by this
feature, although the patterns at the remaining features
may be completely dissimilar. This observation evoked an
improved measure called Jackknife correlation [19], [30],
defined as JackknifeðOi;OjÞ ¼ minf?ð1Þ
where ?ðlÞ
ijis the Pearson’s correlation coefficient of data
objects Oi and Oj with the lth feature deleted. Use of the
Jackknife correlation avoids the “dominance effect” of
single outliers. More general versions of Jackknife
correlation that are robust to more than one outlier can
similarly be derived. However, the generalized Jackknife
correlation, which would involve the enumeration of
different combinations of features to be deleted, would be
computationally costly and is rarely used.
Another drawback of Pearson’s correlation coefficient is
that it assumes an approximate Gaussian distribution of the
points and may not be robust for nonGaussian distribu
tions [14], [16]. To address this, the Spearman’s rankorder
correlation coefficient has been suggested as the similarity
measure. The ranking correlation is derived by replacing
the numerical expression level oidwith its rank ridamong all
conditions. For example, rid¼ 3 if oid is the third highest
value among oik, where 1 ? k ? p. Spearman’s correlation
coefficient does not require the assumption of Gaussian
distribution and is more robust against outliers than
Pearson’s correlation coefficient. However, as a conse
quence of ranking, a significant amount of information
present in the data is lost. Our experimental results indicate
that, on average, Spearman’s rankorder correlation coeffi
cient does not perform as well as Pearson’s correlation
coefficient.
Almost all of the clustering algorithms mentioned in this
survey use either Euclidean distance or Pearson’s correla
tion coefficient as the proximity measure. When Euclidean
distance is selected as proximity measure, the standardiza
tion process~ O O0
?Oi
the dth feature of object Oi, while ?Oiand ?Oiare the mean
and standard deviation of~ O Oi, respectively. Suppose O0
O0
can prove PearsonðOi;OjÞ ¼ PearsonðO0
ffiffiffiffiffi
These two equations disclose the consistency between
Pearson’s correlation coefficient and Euclidean distance
after data standardization, i.e., if a pair of data objects
Oi1;Oj1has a higher correlation than pair
ij;...;?ðlÞ
ij;...;?ðpÞ
ijg,
id¼
~ O Oid??Oi
is usually applied, where~ O Oid is
iand
jare the standardized “objects” of Oi and Oj. Then, we
i;O0
jÞ and
EuclideanðO0
i;O0
jÞ ¼
2p
p
ð
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 ? PearsonðO0
i;O0
jÞ
q
Þ:
Oi2;Oj2ðPearsonðO0
then pair Oi1;Oj1has a smaller distance than pair
i1;O0
j1Þ > PearsonðO0
i2;O0
j2ÞÞ;
Oi2;Oj2ðEuclideanðO0
Thus, we can expect the effectiveness of a clustering
algorithm to be equivalent whether Euclidean distance or
i1;O0
j1Þ < EuclideanðO0
i2;O0
j2ÞÞ:
1372IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL. 16, NO. 11,NOVEMBER 2004
Page 4
Pearson’s correlation coefficient is chosen as the proximity
measure.
2CLUSTERING ALGORITHMS
AswementionedinSection1.2.2,geneexpressionmatrixcan
be analyzed in two ways. For genebased clustering, genes are
treated as data objects, while samples are considered as
features. Conversely, for samplebased clustering, samples
serveasdataobjectstobeclustered,whilegenesplaytherole
of features. The third category of cluster analysis applied to
gene expression data, which is subspace clustering, treats
genes and samples symmetrically such that either genes or
samples can be regarded as objects or features. Genebased,
samplebased, and subspace clustering face very different
challenges, and different computational strategies are
adopted for each situation. In this section, we will introduce
the genebased clustering, samplebased clustering, and
subspace clustering techniques, respectively.
2.1
In this section, we will discuss the problem of clustering
genes based on their expression patterns. The purpose of
genebased clustering is to group together coexpressed
genes which indicate cofunction and coregulation. We will
first present the challenges of genebased clustering and
then review a series of clustering algorithms which have
been applied to group genes. For each clustering algorithm,
we will first introduce the basic idea of the clustering
process, and then highlight some features of the algorithm.
GeneBased Clustering
2.1.1 Challenges of Gene Clustering
Due to the special characteristics of gene expression data,
and the particular requirements from the biological domain,
genebased clustering presents several new challenges and
is still an open problem.
.
First, cluster analysis is typically the first step in data
mining and knowledge discovery. The purpose of
clustering gene expression data is to reveal the
natural data structures and gain some initial insights
regarding data distribution. Therefore, a good
clustering algorithm should depend as little as
possible on prior knowledge, which is usually not
available before cluster analysis. For example, a
clustering algorithm which can accurately estimate
the “true” number of clusters in the data set would
be more favored than one requiring the predeter
mined number of clusters.
Second, due to the complex procedures of micro
array experiments, gene expression data often
contains a huge amount of noise. Therefore, cluster
ing algorithms for gene expression data should be
capable of extracting useful information from a high
level of background noise.
Third, our empirical study has demonstrated that
gene expression data are often “highly connected”
[37], and clusters may be highly intersected with
each other or even embedded one in another [36].
Therefore, algorithms for genebased clustering
should be able to effectively handle this situation.
.
.
.
Finally, users of microarray data may not only be
interested in the clusters of genes, but also be
interested in the relationship between the clusters
(e.g., which clusters are more close to each other and
which clusters are remote from each other), and the
relationship between the genes within the same
cluster (e.g., which gene can be considered as the
representative of the cluster and which genes are at
the boundary area of the cluster). A clustering
algorithm, which cannot only partition the data set
but also provides some graphical representation of
the cluster structure would be more favored by the
biologists.
2.1.2 KMeans
The Kmeans algorithm [46] is a typical partitionbased
clustering method. Given a prespecified number K, the
algorithm partitions the data set into K disjoint subsets
which optimize the following objective function:
E ¼
X
K
i¼1
X
O2Ci
jO ? ?ij2:
Here, O is a data object in cluster Ciand ?iis the centroid
(mean of objects) of Ci. Thus, the objective function E tries
to minimize the sum of the squared distances of objects
from their cluster centers.
The Kmeans algorithm is simple and fast. The time
complexity of Kmeans is Oðl ? k ? nÞ, where l is the number
of iterations and k is the number of clusters. Our empirical
study has shown that the Kmeans algorithm typically
converges in a small number of iterations. However, it also
has several drawbacks as a genebased clustering algorithm.
First, the number of gene clusters in a gene expression data
set is usually unknown in advance. To detect the optimal
number of clusters, users usually run the algorithms
repeatedly with different values of k and compare the
clustering results. For a large gene expression data set
which contains thousands of genes, this extensive para
meter finetuning process may not be practical. Second,
gene expression data typically contain a huge amount of
noise; however, the Kmeans algorithm forces each gene
into a cluster, which may cause the algorithm to be sensitive
to noise [59], [57].
Recently, several new clustering algorithms [51], [31],
[59] have been proposed to overcome the drawbacks of the
Kmeans algorithm. These algorithms typically use some
global parameters to control the quality of resulting clusters
(e.g., the maximal radius of a cluster and/or the minimal
distance between clusters). Clustering is the process of
extracting all of the qualified clusters from the data set. In
this way, the number of clusters can be automatically
determined and those data objects which do not belong to
any qualified clusters are regarded as outliers. However,
the qualities of clusters in gene expression data sets may
vary widely. Thus, it is often a difficult problem to choose
the appropriate globally constraining parameters.
2.1.3 SelfOrganizing Map
The SelfOrganizing Map (SOM) was developed by Koho
nen [39], on the basis of a single layered neural network.
JIANG ET AL.: CLUSTER ANALYSIS FOR GENE EXPRESSION DATA: A SURVEY 1373
Page 5
The data objects are presented at the input and the output
neurons are organized with a simple neighborhood
structure such as a twodimensional p ? q grid. Each neuron
of the neural network is associated with a reference vector,
and each data point is “mapped” to the neuron with the
“closest” reference vector. In the process of running the
algorithm, each data object acts as a training sample which
directs the movement of the reference vectors towards the
denser areas of the input vector space, so that those
reference vectors are trained to fit the distributions of the
input data set. When the training is complete, clusters are
identified by mapping all data points to the output neurons.
One of the remarkable features of SOM is that it
generates an intuitively appealing map of a highdimen
sional data set in 2D or 3D space and places similar clusters
near each other. The neuron training process of SOM
provides a relatively more robust approach than Kmeans
to the clustering of highly noisy data [62], [29]. However,
SOM requires users to input the number of clusters and the
grid structure of the neuron map. These two parameters are
preserved through the training process; hence, improperly
specified parameters will prevent the recovering of the
natural cluster structure. Furthermore, if the data set is
abundant with irrelevant data points, such as genes with
invariant patterns, SOM will produce an output in which
this type of data will populate the vast majority of clusters
[29]. In this case, SOM is not effective because most of the
interesting patterns may be merged into only one or two
clusters and cannot be identified.
2.1.4 Hierarchical Clustering
In contrast to partitionbased clustering, which attempts to
directly decompose the data set into a set of disjoint
clusters, hierarchical clustering generates a hierarchical
series of nested clusters which can be graphically repre
sented by a tree, called dendrogram. The branches of a
dendrogram not only record the formation of the clusters
but also indicate the similarity between the clusters. By
cutting the dendrogram at some level, we can obtain a
specified number of clusters. By reordering the objects such
that the branches of the corresponding dendrogram do not
cross, the data set can be arranged with similar objects
placed together.
Hierarchical clustering algorithms can be further divided
into agglomerative approaches and divisive approaches based
on how the hierarchical dendrogram is formed. Agglom
erative algorithms (bottomup approach) initially regard
each data object as an individual cluster, and at each step,
merge the closest pair of clusters until all the groups are
merged into one cluster. Divisive algorithms (topdown
approach) starts with one cluster containing all the data
objects and, at each step split, only singleton clusters of
individual objects remain. For agglomerative approaches,
different measures of cluster proximity, such as single link,
complete link, and minimumvariance [18], [38], derive
various merge strategies. For divisive approaches, the
essential problem is to decide how to split clusters at each
step. Some are based on heuristic methods such as the
deterministic annealing algorithm [3], while many others
are based on the graph theoretical methods which we will
discuss later.
Eisen et al. [20] applied an agglomerative algorithm
called UPGMA (Unweighted Pair Group Method with
Arithmetic Mean) and adopted a method to graphically
represent the clustered data set. In this method, each cell of
the gene expression matrix is colored on the basis of the
measured fluorescence ratio and the rows of the matrix are
reordered based on the hierarchical dendrogram structure
and a consistent nodeordering rule. After clustering, the
original gene expression matrix is represented by a colored
table (a cluster image) where large contiguous patches of
color represent groups of genes that share similar expres
sion patterns over multiple conditions.
Alon et al. [3] split the genes through a divisive
approach, called the deterministicannealing algorithm
(DAA) [53], [52]. First, two initial cluster centroids Cj,
j ¼ 1;2, were randomly defined. The expression pattern of
gene k was represented by a vector~ g gk, and the probability of
gene k belonging to cluster j was assigned according to a
twocomponent Gaussian model:
Pjð~ g gkÞ ¼ expð??j~ g gk? Cjj2Þ=
X
j
expð??j~ g gk? Cjj2Þ:
The cluster centroids were recalculated by
X
An iterative process (the EM algorithm) was then applied to
solve Pj and Cj (the details of the EM algorithm will be
discussed later). For ? ¼ 0, there was only one cluster,
C1¼ C2. When ? was increased in small steps until a
threshold was reached, two distinct, converged centroids
emerged. The whole data set was recursively split until each
cluster contained only one gene.
Hierarchical clustering not only groups together genes
with similar expression pattern but also provides a natural
way to graphically represent the data set. The graphic
representation allows users a thorough inspection of the
whole data set and obtain an initial impression of the
distribution of data. Eisen’s method is much favored by
many biologists and has become the most widely used tool
in gene expression data analysis [20], [3], [2], [33], [50].
However, the conventional agglomerative approach suffers
from a lack of robustness [62], i.e., a small perturbation of
the data set may greatly change the structure of the
hierarchical dendrogram. Another drawback of the hier
archical approach is its highcomputational complexity. To
construct a “complete” dendrogam (where each leaf node
corresponds to one data object, and the root node
corresponds to the whole data set), the clustering process
should take
2
merging (or splitting) steps. The time
complexity for a typical agglomerative hierarchical algo
rithm is Oðn2lognÞ [34]. Furthermore, for both agglomera
tive and divisive approaches, the “greedy” nature of
hierarchical clustering prevents the refinement of the
previous clustering. If a “bad” decision is made in the
initial steps, it can never be corrected in the following steps.
Cj¼
k
~ g gkPjð~ g gkÞ=
X
k
Pjð~ g gkÞ:
n2?n
2.1.5 GraphTheoretical Approaches
Given a data set X, we can construct a proximity matrix P,
where P½i;j? ¼ proximityðOi;OjÞ, and a weighted graph
1374IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL. 16, NO. 11,NOVEMBER 2004