Page 1

DHC: A Density-based Hierarchical Clustering Method for Time

Series Gene Expression Data

Daxin JiangJian PeiAidong Zhang

Department of Computer Science and Engineering, State University of New York at Buffalo

Email:

?djiang3, jianpei, azhang

? @cse.buffalo.edu

Abstract

Clustering the time series gene expression data is an im-

portant task in bioinformatics research and biomedical ap-

plications. Recently, some clustering methods have been

adaptedorproposed. However, someconcernsstillremain,

suchastherobustnessoftheminingmethods, aswellasthe

quality and the interpretability of the mining results.

In this paper, we tackle the problem of effectively clus-

tering time series gene expression data by proposing al-

gorithm DHC, a density-based, hierarchical clustering

method. We use a density-based approach to identify the

clusters such that the clustering results are of high quality

and robustness. Moreover, The mining result is in the form

of a density tree, which uncovers the embedded clusters in

a data set. The inner-structures, the borders and the out-

liers of the clusters can be further investigated using the

attraction tree, which is an intermediate result of the min-

ing. By these two trees, the internal structure of the data

set can be visualized effectively. Our empirical evaluation

using some real-world data sets show that the method is

effective, robust and scalable. It matches the ground truth

provided by bioinformatics experts very well in the sample

data sets.

1 Introduction

DNA microarray technology [11, 12] has made it now pos-

sible to monitor simultaneously the expression levels for

thousands of genes during important biological process

[15] and across collections of related samples [1]. It is of-

ten an important task to find genes with similar expression

patterns (co-expressed genes) from DNA microarray data.

First, co-expressed genes may demonstrate a significant

enrichment for function analysis of the genes [2, 17, 6, 13].

We may understand the functions of some poorly charac-

terized or novel genes better by testing them together with

the genes with known functions. Second, co-expressed

genes with strong expression pattern correlations may in-

dicate co-regulation and help uncover the regulatory ele-

ments in transcriptional regulatory networks [17]. Cluster

techniques, which are essential in data mining process for

exploring natural structure and identifying interesting pat-

terns inunderlying data, haveprovedto beusefulinfinding

co-expressed genes.

In cluster analysis, one wishes to partition the given data

set into groups based on the given features such that the

data objects in the same group are more similar to each

other than the data objects in other groups. Various cluster-

ing algorithms have been applied on gene expression data

with promising results [6, 17, 16, 4, 13]. However, as indi-

catedinsomepreviousstudies(e.g.,[8,16]), manyconven-

tional clustering algorithms originated from non-biological

fields may suffer from some problems when mining gene

expression data, such as having to specify the number of

clusters, lacking of robustness to noises, and being weak to

handle embedded clusters and highly intersected clusters.

Recently, some specifically designed algorithms for clus-

tering gene expression data have been proposed aiming at

those problems [8, 4].

Distinguishingfromother kindsofdata, geneexpression

data usually have several characteristics. First, gene ex-

pression data sets are often of small size (in the scale of

thousands) comparing to some other large databases (e.g.

multimedia databases and transaction databases). A gene

expression data set often can be held into main memory.

Second, for many dedicated microarray experiments, peo-

ple are usually interested in the expression patterns of only

a subset of all the genes. Other gene patterns are roughly

considered insignificant, and thus become noise. Hence, in

the gene expression data analysis, people are much more

concerned with the effectiveness and interpretability of the

clustering results than the efficiency of the clustering algo-

rithm. How to group co-expressed genes together mean-

ingfully and extract the useful patterns intelligently from

noisydatasetsaretwomajorchallengesforclusteringgene

expression data.

In this paper, we investigate the problems of effectively

clustering gene expression data and make the following

contributions. First, we analyze and examine a good num-

ber of existing clustering algorithms in the context of

clustering gene expression data, and clearly identify the

challenges. Second, we develop DHC, a density-based,

hierarchical clustering method aiming at gene expression

data. DHCisadensity-basedapproachsothatit effectively

solvessomeproblems thatmostdistance-basedapproaches

???

1

Page 2

cannot handle. Moreover, DHC is a hierarchical method.

The mining result is in the form of a tree of clusters. The

internal structure of the data set can be visualized effec-

tively. At last, we conduct an extensive performance study

on DHC and some related methods. Our experimental re-

sults show that DHC is effective. The mining results match

the ground truth given by the bioinformatics experts nicely

on real data sets. Moreover, DHC is robust with respect to

noise and scalable with respect to database size.

The remainder of the paper is organized as follows. In

Section 2, we analyze some important existing clustering

methods in the context of clustering gene expression data,

and identify the challenges. In Section 3, we discuss the

density measurement for density-based clustering of gene

expression, and develop algorithm DHC. The extensiveex-

perimental results are reported in Section 4. The paper is

concluded in Section 5.

2 Related Work

Various clustering algorithms have been applied to gene

expression data. It has been proved that the clustering is

helpful to identify groups of co-expressed genes and corre-

sponding expression patterns. Nevertheless, several chal-

lenges arise. In this section, we identify the challenges by

a brief survey of some typical clustering methods.

2.1 Partition-based Algorithms

K-means [17] and SOM (Self Organizing Map) [16] are

two typical partition-based clustering algorithms.

though useful, these algorithms suffer the following draw-

backs as pointed out by [8, 14]. First, both K-means and

SOM require the users to provide the number of clusters as

a parameter. Since clustering is usually an explorative task

in the initial analysis of gene expression data sets, such in-

formation is often unavailable. Another disadvantage of

the partition-based approaches is that they force each data

object into a cluster, which makes the partition-based ap-

proaches sensitive to outliers.

Recently, some new algorithms have been developed for

clustering time-series gene expression data. They partic-

ularly addressed the problems of outliers and the number

of clusters discussed above. For example, Ben-Dor et al.

[4] introduced the idea of a “corrupted clique graph” data

modelandpresentedaheuristicalgorithm CAST(forClus-

ter Affinity Search Technique) based on the data model. In

[8], the authors described a two-step procedure (Adapt) to

identify one cluster while the first step is to estimate the

cluster center and the second step is to estimate the ra-

diusof the cluster. Once

a cluster is defined as

Al-

???

??? ???

and

???

are determined,

??????????????????????????????? ?"!#?$?%?'&

???)( . Both algorithms extract clusters from the data set

Therefore, the algorithms can automatically determine the

number of clusters, and genes not belonging to any cluster

one after another until no more clusters can be found.

are regarded as outliers. However, the criteron for clus-

ters defined by those algorithms are either based on some

glabal parameters or dependent on some assumptions of

the cluster structure of the data set. For example, CAST

uses an affinity threshold parameter

age pairwise similarity between objects within a cluster.

Adapt assumes that the cluster has the same radius in each

direction in the high-dimensional object space. However,

the clustering results may be quite sensitive to differenct

parameter settings and the assumptions for cluster struc-

ture may not always hold. In particular, CAST and Adapt

may not be effective with the embedded clusters and the

highly intersected clusters, respectively.

? to control the avaer-

2.2 Hierarchical Clustering

A hierarchical clustering algorithm does not generate a set

of disjoint clusters. Instead, it generates a hierarchy of

nested clusters that can be represented by a tree, called a

dendrogram. Basedonhowthehierarchicaldecomposition

is formed, hierarchical clustering algorithms can be further

divided into agglomerative algorithms (i.e., bottom-up ap-

proaches, e.g., [6]) and divisive algorithms (top-down ap-

proaches, e.g., [2, 13]). In fact, the hierarchical methods

are particularly favored by the biologists becasue they may

give more insights to the structure of the clusters than the

other methods.

However, the hierarchical methods also have some

drawbacks. First, itissometimessubtletodeterminewhere

to cut the dendrogram and derive clusters. Usually, this

step is done by domain experts’ visual inspection. Second,

it is hard to tell theinner structure ofa clusterfrom the den-

drogram, e.g. which object is the medoid of the cluster and

which objects are the borders of the cluster. Last, many

hierarchical methods are considered lacking of robustness

and uniqueness [16]. They may be sensitive to the order of

input and small perturbations in the data.

2.3 Density-based clustering

Density-based clustering algorithms [7, 9, 3]characterize

the data distribution by the density of each data object.

Clustering is the process of identifying dense areas in the

object space. Coventional density-based approaches, such

as DBSCAN [7], classify a data object

cores of a cluster if has more than

within neighborhood

neighboring ’core’ objects and those ’non-core’ objects ei-

ther serve as the boundaies of clusters or become outliers.

Since the noises of the data set are typically randomly dis-

tributed, the density within a cluster should be significantly

higher than that of the noises. Therefore, density-based ap-

proaches have the advantage of extracting clusters from a

highly noisy environment, which is the case of time-series

gene expression data.

However, the performance of DBSCAN is quite sensi-

tive to the parameters of object density, namely,

*

as one of the

* +-,/.10??2? neighbors

3 . Clusters are formed by connecting

+4,/.10??2?

2

Page 3

and

are hard to specify. Our experimental study has demon-

strated that DBSCAN tends to reuslt in either a large num-

ber of trivial clusters or a few huge clusters merged by

several smaller ones for time-series gene expression data.

Other density-based approaches (e.g. Optics [3] and Den-

clue [9]) are more robust to their algorithm parameters.

However, none of the exsited density-based algorithms

provide a hierarchial cluster strucuture which gives people

a thorough picture of the data distribution and help people

understand the relationship between the clusters and the

data objects well.

3 . For a complex data set, the appropriate parameters

2.4 What are the challenges?

Basedontheaboveanalysis,toconducteffectiveclustering

analysis over gene expression data, we need to develop an

algorithm meeting the following requirements.

First, the clustering result should be highly interpretable

and easy to visualize. As gene expression data is com-

plicated, the interpretability and visualization become very

important. Second, the clustering method should be able to

determine the number of clusters automatically. Gene ex-

pression data sets are usually noisy. It is hard to guess the

number of clusters. A method adaptive to the natural num-

ber of clusters is highly preferable. Third, the clustering

method should be robust to noise, outliers, and the param-

eters. Itiswellrecognizedthatthegeneexpressiondataare

usually noisy and the rules behind the data are unknown.

Thus, the method should be robust so that it can be used as

the first step to explorethe valuable patterns. Lastly but not

at least, the clustering method should be able to handle em-

bedded clusters and highly intersected clusters effectively.

The structure of gene expression data is often complicated.

It is unlikely that the data space can be clearly divided into

several independent clusters. Instead, the user may be in-

terested in both the clusters and the connections among the

clusters. The clustering method should be able to uncover

the whole picture.

3 The Algorithms

In this section, we propose DHC, an algorithm mining the

cluster structure of a data set as a density tree. First, we

dicuss how to define the density of objects properly and

then we develop the algorithm. The algorithm works in

two steps. First, all objects in a data set are organized into

an attraction tree. Then, the attraction tree is summarized

as clusters and dense areas, and a density tree is derived as

the summary structure.

3.1 Definition of density

When clustering gene expression data, we want to group

genes with similar expression patterns. Thus, we choose

the correlation coefficient, which is capable of catching the

similarity between two vectors based on their expression

patterns but not on the absolute magnitudes. The corre-

lation coefficient for two data objects

dimension space is defined as

*? and

*

?in a

? -

??,??

,???

? ,/?????/*????

*

???

?????/*???

*

???

?

???

?????????!

??"$#

?

&%

????'

??"(#

?

'?%

)

?

?

???????*?!

?&"+#

?!

%&,

)

?

?

???????*??'

?&"-#

??'

%&,/.

where

mean of the scalars of data object

lation coefficient

value, the more similar they are with each other.

We define the distance between objects

*??*0 is the

?&1?2 scalar of data object

*? and

3

*?? is the

*? . Note that the corre-

? ranges between -1 and 1. The larger the

*? and

*

?as

???/*????

*

???

?546

7

???

?8

?

'?%

if

otherwise

???/*???

*

???:9

. The set of

?

;=<

.

Given a radius

? , the neighborhood of

*? w.r.t.

? in a

? -dimension space forms a hyper-sphere

objects in the hyper-sphere is

>

?

?

?*

?

?

???/*????

*

???A@

?( . The

volume of the hyper-sphere is

BA?&>

?

?

?

?

C

?

,

D/EGF

,IH

6?JLK

?NM . We

ignore the global constant coefficient

C

?

,

DOEGF

,IH

6?J

and define

BQP

that each hyper-shell occupies exactly a unit volume. The

idea is demonstrated in Figure 1.

???&>

?!

?

?

??NM .

To preciselydescribethedistribution of neighbors of ob-

ject

into a series of hyper-shells, such

*?? , we divide

>

?

?

O

r1

r2

r3

..

.

r2

r3

...

r1

WeightTableRadiusTable

...

w3

...

w1

w2

...

Figure 1: Hypersphere and hyper-shell w.r.t object

*??

The radius of the

R?1?2 hyper-shell is

??S)??R

?

?

?TBQP

???U?

?I?&>

?

?WV

?

??RX?YR

??N?WZ??

K[K\K

?

.

Then, we have

]_^

and forms a histogram

objects falling into hyper-shell

define a weight of the contribution from

of

???

?

S

?`>

?

S

H

6

!a>

?

S

?4?*

?

???S

&b?U?/*???

*

???c@

??S

H

6

(

.

By hyper-shells, the neighborhood of

*? is discretized

d , where each bin

efS contains the

]_^

????S . For each

e!S , we

efS to the density

*?? .

e!S)??

]c^

???

?

S

?[?g

??,?

^

?h?

]_^

???

?

S

?

?

6

i j?k

S

.

Now, we are ready to define the density of an object

* .

?

??.??,/?????/*

?

?ml

n

S\o

6

g

??,?

^

?h?

]_^

???

?

S

?

K

e?S

.

3

Page 4

In our density definition, we do not simply set up a

threshold

neighborhood

cretize the neighborhood of object

hyper-shells and calculate the density of an object as the

sum of contributions from individual hyper-shells. Thus,

our definition avoids the difficulty of choosing a good

global threshold

tribution around the specific object.

3.2 Building an attraction tree

3 and count the number of data objects within

3 as the density. On the contrary, we dis-

*? into a series of

3 and accurately reflect the neighbor dis-

As the first step of the mining, we organize all objects in

the data set into a hierarchy based on their density. The

resulting structure is called an attraction tree, since the tree

is built by considering the attraction among objects in a

data set. Intuitively, an object with high density “attracts”

some other objects with lower density. We formalize the

ideas as follows.

The attraction between two data objects

(

*

6

and

*??

*

6

?

?*?? ) in a

R -d space is defined as follows, where

R is the dimensionality of the objects.

?

???N???

?,?P

.

?/*

6

?

*??

?

?

?

?. ??,?????/*

6

?

K

?

??.??,/?????/*??

?

?U?/*

6

?

*??

?

S

"

6

.

The attraction is said from

*? to

*

?, if

?

??.??,/?????/*?

?

&

?

??. ??,/?????/*

???, denoted as

*???

*

?. In the case that two

objects are tie, we can artificially assign

*???

*

?for

?

,&??

?. Thus, an object

*

is attracted by a set of ob-

jects

?

?/*

?whose densities are larger than that of

* , i.e.,

?

?/*

?

??*

?

?

?

??.??,/?????/*

???(9

?

?. ??,?????/*

?

( . We define

the attractor of

attraction to

*

as the object

*

?

?

?

?/*

?with the largest

* , i.e.,

?

???N???

??P

?O?/*

?

?????????????

?

'????

???

%

?

???N???

?,?P

.

?/*

?

?

*

?

.

The process of determining the attractor of each data

object is as follows. For each data object

ize

compare the attraction for each

the winner becomes the attractor of

case is

be empty. In this case, we set

The attraction relation from an object to another (i.e.,

*? , we initial-

*? ’s attractor as itself. Then we search for

?

?/*?

?and

*

?

?

?

?/*??

?to

*?? . Finally

*? . The only special

*

2?Mthat has the largest density, where

derive an attraction tree

such that

?

?/*

2?M

?will

*

2?M’s attractor as itself.

*????

*

?) is a partial order. Based on this order, we can

? . Each node has an object

*

0=?

data object

scan the data set once. For each data object

attractor

nal singleton cluster trees merge with the others during the

? ??.??h?/*

?

?

4

.?,?

if

otherwise.

?

???N???

??P

?O?/*

?

?*

?

?

???N???

??P

?O?/*

?

The tree construction process is as follows. First, each

*? is a singleton attraction tree

?

? . Then, we

*? , we find its

*

?. We insert

?

? as a child of

*

?. Thus the origi-

scanning process. When the scanning process is over, all

data objects form an attraction tree reflecting the attraction

hierarchy. One special case here is

is itself. The corresponding attraction tree

child of any others. Instead,

attraction tree. All other data objects are its descendants.

*

2?M

whose attractor

?

2?Mcannot be a

*

2?Mis the root of the resulting

3.3Deriving a density tree

The attraction tree constructed in Section 3.2 includes ev-

ery object in the data set. Thus, the tree can be bushy. To

identify the really meaningful clusters and their hierarchi-

cal structures, we need to identify the clusters and prune

the noise and outliers. This is done by a summarization of

clusters and dense areas in the form of a density tree.

There are two kinds of nodes in a density tree, namely

the collection nodes and the cluster nodes. A collection

node is an internal node in the tree and represents a dense

area of the data set. A cluster node represents a cluster that

will not be decomposed further. Each node has the medoid

of the dense area or the cluster as its representative.

Figure 2 is an example of a density tree. At the root

of the tree, the whole data set is regarded as a dense area,

and denotedas a root (collection) node

consists of two dense sub-areas, i.e.,

sub-areas can be further decomposed to finer sub-dense ar-

eas, i.e.,

areas until the sub-areas meet some termination criteron.

The sub-areas at the leaf level are represented by cluster

nodes in the density tree. Each cluster node corresponds to

one cluster in the data set.

???. This densearea

and . The dense

?

6

?

6

??? ,

?

? . DHC recursively splits the dense sub-

A2

C3C4

A0

A1

C1

C2

Figure 2: A density tree.

How to derive a density tree from an attraction tree?

The basic idea is that, first, we have the whole data set as

one dense area to split, then, we recursively split the dense

areas until each dense sub-area contains only one cluster.

To determine the clusters, the dense areas and the

bridges between them, we introduce two parameters: simi-

larity threshold

3 and minimum number of object threshold

+-,/.10??2? . For an edge

dense sub-area if and only if (1)

*?*

?in an attraction tree, where

*?? is the parent of

*

?, the attraction sub-tree

?

?

' is a

???

??,?

^!

P

???I?/*

?

?23

?

?#"

+-,/.10??2? and(2)

??,T?

,???

?,/?????/*

?

?

*??

?:@

3 . In otherwords,

a dense area is identified if and only if there are at least

+-,/.10??2? objects in the area and the similarity between the

4

Page 5

center of the area and the center of the higher level area is

no more than

Once the dense areas are identified, we can de-

rive the density tree using the dense areas and their

centers attraction relation stored in the attraction tree.

The algorithm is presented in Figure 3.

derives the density tree from the

attraction tree

to recursively split the dense areas.

3 .

Function

?

???,???

?

??. ??,/???

??? ??

?

???N???

?,?P

.

??? ?? .We maintain a queue

??? ??,/????? ????

The

???

from the

split. Then we call the function

areas in

function

we will serialize the

record it in the cluster list

return a square type node with split sub-areas as its chil-

dren. In this case, we put all of the children as candidate

split areas and put them into

stops when the

eas can be split.

3.4 Why is DHC effective and efficient?

??,/????? ???? is initialized with only one element, i.e., the

?

???N???

?,?P

.

??? ?? . Foreachiteration, weextractanelement

??? ??,/??????? which represents the dense area to

?????,/? to identify sub-dense

?????,/?

??? ?? . If

??? ??,/?

??? ?? cannot be further divided,

??? ??,/? returns

????,?

??? ?? unchanged. In this case,

??? ??,/?

??? ?? area list of data objects and

?

???????????? . Otherwise,

?????,/? will

??? ??,/????? ???? . The iteration

????,???? ???? is empty, i.e., no more sub ar-

By measuring the density for each data object, DHC cap-

tures the natural distribution of the data. Intuitively, a

group of highly co-expressed genes will form a dense area

(cluster), and the gene with the highest density within the

group becomes the medoid of the cluster. There may be

many noise objects. However, they distribute sparsely in

the object space and cannot show a high degree of co-

expression, and thus have low densities.

In summary, DHC has some distinct advantages over

some previously proposed methods.

means and SOM, DHC does not need a parameter about

the number of clusters, and the resulting clusters are not

affected by outliers. On the one hand, by locating the dense

areas in the object space, DHC automatically detects the

number of clusters. On the other hand, since DHC uses

the expression pattern of the medoid to represent the aver-

age expression patterns of co-expressed genes, the result-

ing clusters will not be corrupted by outliers.

Comparing to CAST and Adapt, DHC can handle the

embedded clusters and highly intersected clusters uni-

formly. Figure 4 shows an example. Figure 4(a) illus-

trates two embedded clusters, and Figure 4(b) shows two

highly intersected clusters. In both figures, let

whole data set, and

and and

Comparing to k-

???

be the

?

6

??? be the two clusters in

???,

*

6

*?? be the medoids of

?

6

and

?

? . Suppose

?

??. ??,/?????/*

6

?9

?

??.??,/???!?/*??

?. After the

?

???N???

? process

and

ing three facts hold: (1)

tree of the data set, since it has a higher density than any

other data objects; (2)

contains the data objects belonging to

medoid of

?\P

. ?????

?

?

??? ?? process, in both situations, the follow-

will be the root of the attraction

*

6

*?? will be the root of a subtree that

?

? , since

*?? is the

??? and (3)

*?? will be attracted by some data

Proc deriveDensityTree(

splitQueue.add(

while (!splitQueue.isEmpty())

spTree = splitQueue.extract()

parentTree = spTree.parent

node = splitTree(spTree)

if (parentTree == NULL) then root = node

else parentTree.addChild(node)

end if

if (node.type == CLUSTERTYPE) then

c = node.serialize()

clusters.add(c)

else // node.type==COLLECTIONTYPE

for each child

chTree.parent = node

node.remove(chTree)

splitQueue.add(chTree)

end for

end if

end while

End Proc

??????????????????????????? )

??????????????????????????? )

?? ???????? of node do

struct MaxCut(t,p,dist)

tree = t; parent = p; distance = dist

end Struct MaxCut

Func splitTree(tree)

fi nished = false;

currentDistance = MAXDISTANCE

While (! fi nished)

cut = fi ndMaxCut(tree,currentDistance)

if (cut == null) then fi nished = true

else

currentDistance= cut.distance

if (splitable(

cut.parent.remove(cut.tree)

fi nished = true

end if

end if

end while

if (cut == null) then return tree

else

collection = new DensityTree(collection)

collection.addChild(cut.tree)

collection.addChild(tree)

return

end if

End Func

!"? .tree,tree)) then

Figure 3: Algorithm DHC.

C1

A0

C2

O2

O1

A0

C1

O1

O2

C2

(a) Embedded cluster(b)Highly intersected cluster

Figure 4: An example of embedded cluster and highly in-

tersected cluster.

object

5(a) demonstrates the generated attraction tree. After the

*"# and become one of the children of

*$# . Figure

5