DHC: A DensityBased Hierarchical Clustering Method for Time Series Gene Expression Dat.
ABSTRACT terns in underlying data, have proved to be useful in finding coexpressed genes. Clustering the time series gene expression data is an important task in bioinformatics research and biomedical applications. Recently, some clustering methods have been adapted or proposed. However, some concerns still remain, such as the robustness of the mining methods, as well as the quality and the interpretability of the mining results. In this paper, we tackle the problem of effectively clustering time series gene expression data by proposing algorithm DHC, a densitybased, hierarchical clustering method. We use a densitybased approach to identify the clusters such that the clustering results are of high quality and robustness. Moreover, The mining result is in the form of a density tree, which uncovers the embedded clusters in a data set. The innerstructures, the borders and the outliers of the clusters can be further investigated using the attraction tree, which is an intermediate result of the mining. By these two trees, the internal structure of the data set can be visualized effectively. Our empirical evaluation using some realworld data sets show that the method is effective, robust and scalable. It matches the ground truth provided by bioinformatics experts very well in the sample data sets.

Conference Paper: Clustering biological data using voronoi diagram
[Show abstract] [Hide abstract]
ABSTRACT: Clustering is an essential tool in data mining that has drawn enormous attention. In this paper, we present a new clustering algorithm with the help of Voronoi diagram. Here the clusters are formed by considering the neighboring Voronoi cells. The points belong to the closer Voronoi cells are merged to form the clusters. The similarity of the points is measured based on Euclidean distance of the neighboring points and hence it is not necessary to compare the distances from one point to all other points of the given set. We perform various experiments using many synthetic and biological data sets. The experimental results demonstrate the significance of the proposed method.Proceedings of the 2011 international conference on Advanced Computing, Networking and Security; 12/2011  SourceAvailable from: iosrjournals.org[Show abstract] [Hide abstract]
ABSTRACT: Recent advances in DNA microarray technology, also known as gene chips, allow measuring the expression of thousands of genes in parallel under multiple experimental conditions [1]. This technology is having a significant impact on genomic studies. Disease diagnosis, drug discovery and toxicological research benefit from the microarray technology. Arrays are now widely used in basic biomedical research for mRNA expression profiling and are increasing being used to explore patterns of gene expression in clinical research.  [Show abstract] [Hide abstract]
ABSTRACT: With the advancement of highthroughput biotechnologies, biological data describing DNA, RNA, protein, and metabolite biomolecules are generated faster than ever. Huge amount of information is being produced and collected. Bioinformatics uses information technology to facilitate the discovery of new knowledge from large sets of various biological data at the molecular level. Within various applications of information technology, clustering has long played an important role. Clustering is an exploratory tool for analyzing large datasets and ...Statistical Bioinformatics: A Guide for Life and Biomedical Science Researchers, Edited by Jae K. Lee, 06/2010: chapter Clustering: Unsupervised Learning in Large Biological Data; WileyBlackwell., ISBN: 9780470567647
Page 1
DHC: A Densitybased Hierarchical Clustering Method for Time
Series Gene Expression Data
Daxin JiangJian PeiAidong Zhang
Department of Computer Science and Engineering, State University of New York at Buffalo
Email:
?djiang3, jianpei, azhang
? @cse.buffalo.edu
Abstract
Clustering the time series gene expression data is an im
portant task in bioinformatics research and biomedical ap
plications. Recently, some clustering methods have been
adaptedorproposed. However, someconcernsstillremain,
suchastherobustnessoftheminingmethods, aswellasthe
quality and the interpretability of the mining results.
In this paper, we tackle the problem of effectively clus
tering time series gene expression data by proposing al
gorithm DHC, a densitybased, hierarchical clustering
method. We use a densitybased approach to identify the
clusters such that the clustering results are of high quality
and robustness. Moreover, The mining result is in the form
of a density tree, which uncovers the embedded clusters in
a data set. The innerstructures, the borders and the out
liers of the clusters can be further investigated using the
attraction tree, which is an intermediate result of the min
ing. By these two trees, the internal structure of the data
set can be visualized effectively. Our empirical evaluation
using some realworld data sets show that the method is
effective, robust and scalable. It matches the ground truth
provided by bioinformatics experts very well in the sample
data sets.
1Introduction
DNA microarray technology [11, 12] has made it now pos
sible to monitor simultaneously the expression levels for
thousands of genes during important biological process
[15] and across collections of related samples [1]. It is of
ten an important task to find genes with similar expression
patterns (coexpressed genes) from DNA microarray data.
First, coexpressed genes may demonstrate a significant
enrichment for function analysis of the genes [2, 17, 6, 13].
We may understand the functions of some poorly charac
terized or novel genes better by testing them together with
the genes with known functions. Second, coexpressed
genes with strong expression pattern correlations may in
dicate coregulation and help uncover the regulatory ele
ments in transcriptional regulatory networks [17]. Cluster
techniques, which are essential in data mining process for
exploring natural structure and identifying interesting pat
terns inunderlying data, haveprovedto beusefulinfinding
coexpressed genes.
In cluster analysis, one wishes to partition the given data
set into groups based on the given features such that the
data objects in the same group are more similar to each
other than the data objects in other groups. Various cluster
ing algorithms have been applied on gene expression data
with promising results [6, 17, 16, 4, 13]. However, as indi
catedinsomepreviousstudies(e.g.,[8,16]), manyconven
tional clustering algorithms originated from nonbiological
fields may suffer from some problems when mining gene
expression data, such as having to specify the number of
clusters, lacking of robustness to noises, and being weak to
handle embedded clusters and highly intersected clusters.
Recently, some specifically designed algorithms for clus
tering gene expression data have been proposed aiming at
those problems [8, 4].
Distinguishingfromother kindsofdata, geneexpression
data usually have several characteristics. First, gene ex
pression data sets are often of small size (in the scale of
thousands) comparing to some other large databases (e.g.
multimedia databases and transaction databases). A gene
expression data set often can be held into main memory.
Second, for many dedicated microarray experiments, peo
ple are usually interested in the expression patterns of only
a subset of all the genes. Other gene patterns are roughly
considered insignificant, and thus become noise. Hence, in
the gene expression data analysis, people are much more
concerned with the effectiveness and interpretability of the
clustering results than the efficiency of the clustering algo
rithm. How to group coexpressed genes together mean
ingfully and extract the useful patterns intelligently from
noisydatasetsaretwomajorchallengesforclusteringgene
expression data.
In this paper, we investigate the problems of effectively
clustering gene expression data and make the following
contributions. First, we analyze and examine a good num
ber of existing clustering algorithms in the context of
clustering gene expression data, and clearly identify the
challenges. Second, we develop DHC, a densitybased,
hierarchical clustering method aiming at gene expression
data. DHCisadensitybasedapproachsothatit effectively
solvessomeproblems thatmostdistancebasedapproaches
???
1
Page 2
cannot handle. Moreover, DHC is a hierarchical method.
The mining result is in the form of a tree of clusters. The
internal structure of the data set can be visualized effec
tively. At last, we conduct an extensive performance study
on DHC and some related methods. Our experimental re
sults show that DHC is effective. The mining results match
the ground truth given by the bioinformatics experts nicely
on real data sets. Moreover, DHC is robust with respect to
noise and scalable with respect to database size.
The remainder of the paper is organized as follows. In
Section 2, we analyze some important existing clustering
methods in the context of clustering gene expression data,
and identify the challenges. In Section 3, we discuss the
density measurement for densitybased clustering of gene
expression, and develop algorithm DHC. The extensiveex
perimental results are reported in Section 4. The paper is
concluded in Section 5.
2Related Work
Various clustering algorithms have been applied to gene
expression data. It has been proved that the clustering is
helpful to identify groups of coexpressed genes and corre
sponding expression patterns. Nevertheless, several chal
lenges arise. In this section, we identify the challenges by
a brief survey of some typical clustering methods.
2.1Partitionbased Algorithms
Kmeans [17] and SOM (Self Organizing Map) [16] are
two typical partitionbased clustering algorithms.
though useful, these algorithms suffer the following draw
backs as pointed out by [8, 14]. First, both Kmeans and
SOM require the users to provide the number of clusters as
a parameter. Since clustering is usually an explorative task
in the initial analysis of gene expression data sets, such in
formation is often unavailable. Another disadvantage of
the partitionbased approaches is that they force each data
object into a cluster, which makes the partitionbased ap
proaches sensitive to outliers.
Recently, some new algorithms have been developed for
clustering timeseries gene expression data. They partic
ularly addressed the problems of outliers and the number
of clusters discussed above. For example, BenDor et al.
[4] introduced the idea of a “corrupted clique graph” data
modelandpresentedaheuristicalgorithm CAST(forClus
ter Affinity Search Technique) based on the data model. In
[8], the authors described a twostep procedure (Adapt) to
identify one cluster while the first step is to estimate the
cluster centerand the second step is to estimate the ra
diusof the cluster. Once
a cluster is defined as
Al
???
??????
and
???
are determined,
??????????????????????????????? ?"!#?$?%?'&
???)( . Both algorithms extract clusters from the data set
Therefore, the algorithms can automatically determine the
number of clusters, and genes not belonging to any cluster
one after another until no more clusters can be found.
are regarded as outliers. However, the criteron for clus
ters defined by those algorithms are either based on some
glabal parameters or dependent on some assumptions of
the cluster structure of the data set. For example, CAST
uses an affinity threshold parameter
age pairwise similarity between objects within a cluster.
Adapt assumes that the cluster has the same radius in each
direction in the highdimensional object space. However,
the clustering results may be quite sensitive to differenct
parameter settings and the assumptions for cluster struc
ture may not always hold. In particular, CAST and Adapt
may not be effective with the embedded clusters and the
highly intersected clusters, respectively.
? to control the avaer
2.2Hierarchical Clustering
A hierarchical clustering algorithm does not generate a set
of disjoint clusters. Instead, it generates a hierarchy of
nested clusters that can be represented by a tree, called a
dendrogram. Basedonhowthehierarchicaldecomposition
is formed, hierarchical clustering algorithms can be further
divided into agglomerative algorithms (i.e., bottomup ap
proaches, e.g., [6]) and divisive algorithms (topdown ap
proaches, e.g., [2, 13]). In fact, the hierarchical methods
are particularly favored by the biologists becasue they may
give more insights to the structure of the clusters than the
other methods.
However, the hierarchical methods also have some
drawbacks. First, itissometimessubtletodeterminewhere
to cut the dendrogram and derive clusters. Usually, this
step is done by domain experts’ visual inspection. Second,
it is hard to tell theinner structure ofa clusterfrom the den
drogram, e.g. which object is the medoid of the cluster and
which objects are the borders of the cluster. Last, many
hierarchical methods are considered lacking of robustness
and uniqueness [16]. They may be sensitive to the order of
input and small perturbations in the data.
2.3Densitybased clustering
Densitybased clustering algorithms [7, 9, 3]characterize
the data distribution by the density of each data object.
Clustering is the process of identifying dense areas in the
object space. Coventional densitybased approaches, such
as DBSCAN [7], classify a data object
cores of a cluster ifhas more than
within neighborhood
neighboring ’core’ objects and those ’noncore’ objects ei
ther serve as the boundaies of clusters or become outliers.
Since the noises of the data set are typically randomly dis
tributed, the density within a cluster should be significantly
higher than that of the noises. Therefore, densitybased ap
proaches have the advantage of extracting clusters from a
highly noisy environment, which is the case of timeseries
gene expression data.
However, the performance of DBSCAN is quite sensi
tive to the parameters of object density, namely,
*
as one of the
* +,/.10??2? neighbors
3 . Clusters are formed by connecting
+4,/.10??2?
2
Page 3
and
are hard to specify. Our experimental study has demon
strated that DBSCAN tends to reuslt in either a large num
ber of trivial clusters or a few huge clusters merged by
several smaller ones for timeseries gene expression data.
Other densitybased approaches (e.g. Optics [3] and Den
clue [9]) are more robust to their algorithm parameters.
However, none of the exsited densitybased algorithms
provide a hierarchial cluster strucuture which gives people
a thorough picture of the data distribution and help people
understand the relationship between the clusters and the
data objects well.
3 . For a complex data set, the appropriate parameters
2.4What are the challenges?
Basedontheaboveanalysis,toconducteffectiveclustering
analysis over gene expression data, we need to develop an
algorithm meeting the following requirements.
First, the clustering result should be highly interpretable
and easy to visualize. As gene expression data is com
plicated, the interpretability and visualization become very
important. Second, the clustering method should be able to
determine the number of clusters automatically. Gene ex
pression data sets are usually noisy. It is hard to guess the
number of clusters. A method adaptive to the natural num
ber of clusters is highly preferable. Third, the clustering
method should be robust to noise, outliers, and the param
eters. Itiswellrecognizedthatthegeneexpressiondataare
usually noisy and the rules behind the data are unknown.
Thus, the method should be robust so that it can be used as
the first step to explorethe valuable patterns. Lastly but not
at least, the clustering method should be able to handle em
bedded clusters and highly intersected clusters effectively.
The structure of gene expression data is often complicated.
It is unlikely that the data space can be clearly divided into
several independent clusters. Instead, the user may be in
terested in both the clusters and the connections among the
clusters. The clustering method should be able to uncover
the whole picture.
3The Algorithms
In this section, we propose DHC, an algorithm mining the
cluster structure of a data set as a density tree. First, we
dicuss how to define the density of objects properly and
then we develop the algorithm. The algorithm works in
two steps. First, all objects in a data set are organized into
an attraction tree. Then, the attraction tree is summarized
as clusters and dense areas, and a density tree is derived as
the summary structure.
3.1Definition of density
When clustering gene expression data, we want to group
genes with similar expression patterns. Thus, we choose
the correlation coefficient, which is capable of catching the
similarity between two vectors based on their expression
patterns but not on the absolute magnitudes. The corre
lation coefficient for two data objects
dimension space is defined as
*? and
*
?in a
? 
??,??
,???
?,/?????/*????
*
???
?????/*???
*
???
?
???
?????????!
??"$#
?
&%
????'
??"(#
?
'?%
)
?
?
???????*?!
?&"+#
?!
%&,
)
?
?
???????*??'
?&"#
??'
%&,/.
where
mean of the scalars of data object
lation coefficient
value, the more similar they are with each other.
We define the distance between objects
*??*0 is the
?&1?2 scalar of data object
*? and
3
*?? is the
*? . Note that the corre
? ranges between 1 and 1. The larger the
*? and
*
?as
???/*????
*
???
?546
7
???
?8
?
'?%
if
otherwise
???/*???
*
???:9
. The set of
?
;=<
.
Given a radius
? , the neighborhood of
*? w.r.t.
? in a
? dimension space forms a hypersphere
objects in the hypersphere is
>
?
?
?*
?
?
???/*????
*
???A@
?( . The
volume of the hypersphere is
BA?&>
?
?
?
?
C
?
,
D/EGF
,IH
6?JLK
?NM . We
ignore the global constant coefficient
C
?
,
DOEGF
,IH
6?J
and define
BQP
that each hypershell occupies exactly a unit volume. The
idea is demonstrated in Figure 1.
???&>
?!
?
?
??NM .
To preciselydescribethedistribution of neighbors of ob
ject
into a series of hypershells, such
*?? , we divide
>
?
?
O
r1
r2
r3
..
.
r2
r3
...
r1
WeightTableRadiusTable
...
w3
...
w1
w2
...
Figure 1: Hypersphere and hypershell w.r.t object
*??
The radius of the
R?1?2 hypershell is
??S)??R
?
?
?TBQP
???U?
?I?&>
?
?WV
?
??RX?YR
??N?WZ??
K[K\K
?
.
Then, we have
]_^
and forms a histogram
objects falling into hypershell
define a weight of the contribution from
of
???
?
S
?`>
?
S
H
6
!a>
?
S
?4?*
?
???S
&b?U?/*???
*
???c@
??S
H
6
(
.
By hypershells, the neighborhood of
*? is discretized
d , where each bin
efS contains the
]_^
????S . For each
e!S , we
efS to the density
*?? .
e!S)??
]c^
???
?
S
?[?g
??,?
^
?h?
]_^
???
?
S
?
?
6
ij?k
S
.
Now, we are ready to define the density of an object
* .
?
??.??,/?????/*
?
?ml
n
S\o
6
g
??,?
^
?h?
]_^
???
?
S
?
K
e?S
.
3
Page 4
In our density definition, we do not simply set up a
threshold
neighborhood
cretize the neighborhood of object
hypershells and calculate the density of an object as the
sum of contributions from individual hypershells. Thus,
our definition avoids the difficulty of choosing a good
global threshold
tribution around the specific object.
3.2Building an attraction tree
3 and count the number of data objects within
3 as the density. On the contrary, we dis
*? into a series of
3 and accurately reflect the neighbor dis
As the first step of the mining, we organize all objects in
the data set into a hierarchy based on their density. The
resulting structure is called an attraction tree, since the tree
is built by considering the attraction among objects in a
data set. Intuitively, an object with high density “attracts”
some other objects with lower density. We formalize the
ideas as follows.
The attraction between two data objects
(
*
6
and
*??
*
6
?
?*?? ) in a
R d space is defined as follows, where
R is the dimensionality of the objects.
?
???N???
?,?P
.
?/*
6
?
*??
?
?
?
?. ??,?????/*
6
?
K
?
??.??,/?????/*??
?
?U?/*
6
?
*??
?
S
"
6
.
The attraction is said from
*? to
*
?, if
?
??.??,/?????/*?
?
&
?
??.??,/?????/*
???, denoted as
*???
*
?. In the case that two
objects are tie, we can artificially assign
*???
*
?for
?
,&??
?. Thus, an object
*
is attracted by a set of ob
jects
?
?/*
?whose densities are larger than that of
* , i.e.,
?
?/*
?
??*
?
?
?
??.??,/?????/*
???(9
?
?.??,?????/*
?
( . We define
the attractor of
attraction to
*
as the object
*
?
?
?
?/*
?with the largest
* , i.e.,
?
???N???
??P
?O?/*
?
?????????????
?
'????
???
%
?
???N???
?,?P
.
?/*
?
?
*
?
.
The process of determining the attractor of each data
object is as follows. For each data object
ize
compare the attraction for each
the winner becomes the attractor of
case is
be empty. In this case, we set
The attraction relation from an object to another (i.e.,
*? , we initial
*? ’s attractor as itself. Then we search for
?
?/*?
?and
*
?
?
?
?/*??
?to
*?? . Finally
*? . The only special
*
2?Mthat has the largest density, where
derive an attraction tree
such that
?
?/*
2?M
?will
*
2?M’s attractor as itself.
*????
*
?) is a partial order. Based on this order, we can
? . Each node has an object
*
0=?
data object
scan the data set once. For each data object
attractor
nal singleton cluster trees merge with the others during the
? ??.??h?/*
?
?
4
.?,?
if
otherwise.
?
???N???
??P
?O?/*
?
?*
?
?
???N???
??P
?O?/*
?
The tree construction process is as follows. First, each
*? is a singleton attraction tree
?
? . Then, we
*? , we find its
*
?. We insert
?
? as a child of
*
?. Thus the origi
scanning process. When the scanning process is over, all
data objects form an attraction tree reflecting the attraction
hierarchy. One special case here is
is itself. The corresponding attraction tree
child of any others. Instead,
attraction tree. All other data objects are its descendants.
*
2?M
whose attractor
?
2?Mcannot be a
*
2?Mis the root of the resulting
3.3 Deriving a density tree
The attraction tree constructed in Section 3.2 includes ev
ery object in the data set. Thus, the tree can be bushy. To
identify the really meaningful clusters and their hierarchi
cal structures, we need to identify the clusters and prune
the noise and outliers. This is done by a summarization of
clusters and dense areas in the form of a density tree.
There are two kinds of nodes in a density tree, namely
the collection nodes and the cluster nodes. A collection
node is an internal node in the tree and represents a dense
area of the data set. A cluster node represents a cluster that
will not be decomposed further. Each node has the medoid
of the dense area or the cluster as its representative.
Figure 2 is an example of a density tree. At the root
of the tree, the whole data set is regarded as a dense area,
and denotedas a root (collection) node
consists of two dense subareas, i.e.,
subareas can be further decomposed to finer subdense ar
eas, i.e.,
areas until the subareas meet some termination criteron.
The subareas at the leaf level are represented by cluster
nodes in the density tree. Each cluster node corresponds to
one cluster in the data set.
???. This densearea
and. The dense
?
6
?
6
??? ,
?
? . DHC recursively splits the dense sub
A2
C3C4
A0
A1
C1
C2
Figure 2: A density tree.
How to derive a density tree from an attraction tree?
The basic idea is that, first, we have the whole data set as
one dense area to split, then, we recursively split the dense
areas until each dense subarea contains only one cluster.
To determine the clusters, the dense areas and the
bridges between them, we introduce two parameters: simi
larity threshold
3 and minimum number of object threshold
+,/.10??2? . For an edge
dense subarea if and only if (1)
*?*
?in an attraction tree, where
*?? is the parent of
*
?, the attraction subtree
?
?
' is a
???
??,?
^!
P
???I?/*
?
?23
?
?#"
+,/.10??2? and(2)
??,T?
,???
?,/?????/*
?
?
*??
?:@
3 . In otherwords,
a dense area is identified if and only if there are at least
+,/.10??2? objects in the area and the similarity between the
4
Page 5
center of the area and the center of the higher level area is
no more than
Once the dense areas are identified, we can de
rive the density tree using the dense areas and their
centers attraction relation stored in the attraction tree.
The algorithm is presented in Figure 3.
derives the density tree from the
attraction tree
to recursively split the dense areas.
3 .
Function
?
???,???
?
??.??,/???
??? ??
?
???N???
?,?P
.
??? ?? . We maintain a queue
?????,/?????????
The
???
from the
split. Then we call the function
areas in
function
we will serialize the
record it in the cluster list
return a square type node with split subareas as its chil
dren. In this case, we put all of the children as candidate
split areas and put them into
stops when the
eas can be split.
3.4Why is DHC effective and efficient?
??,/????????? is initialized with only one element, i.e., the
?
???N???
?,?P
.
??? ?? . Foreachiteration, weextractanelement
?????,/??????? which represents the dense area to
?????,/? to identify subdense
?????,/?
??? ?? . If
?????,/?
??? ?? cannot be further divided,
?????,/? returns
????,?
??? ?? unchanged. In this case,
?????,/?
??? ?? area list of data objects and
?
???????????? . Otherwise,
?????,/? will
?????,/????????? . The iteration
????,???????? is empty, i.e., no more sub ar
By measuring the density for each data object, DHC cap
tures the natural distribution of the data. Intuitively, a
group of highly coexpressed genes will form a dense area
(cluster), and the gene with the highest density within the
group becomes the medoid of the cluster. There may be
many noise objects. However, they distribute sparsely in
the object space and cannot show a high degree of co
expression, and thus have low densities.
In summary, DHC has some distinct advantages over
some previously proposed methods.
means and SOM, DHC does not need a parameter about
the number of clusters, and the resulting clusters are not
affected by outliers. On the one hand, by locating the dense
areas in the object space, DHC automatically detects the
number of clusters. On the other hand, since DHC uses
the expression pattern of the medoid to represent the aver
age expression patterns of coexpressed genes, the result
ing clusters will not be corrupted by outliers.
Comparing to CAST and Adapt, DHC can handle the
embedded clusters and highly intersected clusters uni
formly. Figure 4 shows an example. Figure 4(a) illus
trates two embedded clusters, and Figure 4(b) shows two
highly intersected clusters. In both figures, let
whole data set,and
and and
Comparing to k
???
be the
?
6
??? be the two clusters in
???,
*
6
*?? be the medoids of
?
6
and
?
? . Suppose
?
??.??,/?????/*
6
?9
?
??. ??,/???!?/*??
?. After the
?
???N???
? process
and
ing three facts hold: (1)
tree of the data set, since it has a higher density than any
other data objects; (2)
contains the data objects belonging to
medoid of
?\P
.?????
?
?
??? ?? process, in both situations, the follow
will be the root of the attraction
*
6
*?? will be the root of a subtree that
?
? , since
*?? is the
??? and (3)
*?? will be attracted by some data
Proc deriveDensityTree(
splitQueue.add(
while (!splitQueue.isEmpty())
spTree = splitQueue.extract()
parentTree = spTree.parent
node = splitTree(spTree)
if (parentTree == NULL) then root = node
else parentTree.addChild(node)
end if
if (node.type == CLUSTERTYPE) then
c = node.serialize()
clusters.add(c)
else // node.type==COLLECTIONTYPE
for each child
chTree.parent = node
node.remove(chTree)
splitQueue.add(chTree)
end for
end if
end while
End Proc
??????????????????????????? )
??????????????????????????? )
?? ???????? of node do
struct MaxCut(t,p,dist)
tree = t; parent = p; distance = dist
end Struct MaxCut
Func splitTree(tree)
fi nished = false;
currentDistance = MAXDISTANCE
While (! fi nished)
cut = fi ndMaxCut(tree,currentDistance)
if (cut == null) then fi nished = true
else
currentDistance= cut.distance
if (splitable(
cut.parent.remove(cut.tree)
fi nished = true
end if
end if
end while
if (cut == null) then return tree
else
collection = new DensityTree(collection)
collection.addChild(cut.tree)
collection.addChild(tree)
return
end if
End Func
!"? .tree,tree)) then
Figure 3: Algorithm DHC.
C1
A0
C2
O2
O1
A0
C1
O1
O2
C2
(a) Embedded cluster(b)Highly intersected cluster
Figure 4: An example of embedded cluster and highly in
tersected cluster.
object
5(a) demonstrates the generated attraction tree. After the
*"# and become one of the children of
*$# . Figure
5