Genome Informatics 16(1): 3–12 (2005)
Evaluating Distance Functions for Clustering
Department of Electrical and Computer Engineering, Boston University, Boston, MA,
Laboratory for Biocomputing and Informatics, Boston University, Boston, MA, USA
Departments of Computer Science and Biology, Graduate Program in Bioinformatics,
Boston University, Boston, MA, USA
Tandem repeats are an important class of DNA repeats and much research has focused on
their efficient identification [2, 4, 5, 11, 12], their use in DNA typing and fingerprinting [6, 16, 18],
and their causative role in trinucleotide repeat diseases such as Huntington Disease, myotonic
dystrophy, and Fragile-X mental retardation. We are interested in clustering tandem repeats into
groups or families based on sequence similarity so that their biological importance may be further
explored. To cluster tandem repeats we need a notion of pairwise distance which we obtain by
alignment. In this paper we evaluate five distance functions used to produce those alignments –
Consensus, Euclidean, Jensen-Shannon Divergence, Entropy-Surface, and Entropy-weighted. It is
important to analyze and compare these functions because the choice of distance metric forms the
core of any clustering algorithm. We employ a novel method to compare alignments and thereby
compare the distance functions themselves. We rank the distance functions based on the cluster
validation techniques – Average Cluster Density and Average Silhouette Width. Finally, we propose
a multi-phase clustering method which produces good-quality clusters. In this study, we analyze
clusters of tandem repeats from five sequences: Human Chromosomes 3, 5, 10 and X and C. elegans
Keywords: tandem repeats, distance functions, cluster analysis, cluster validation
DNA molecules are subjected to a variety of mutational events, one of which is tandem duplication
which produces tandem repeats. A tandem repeat is an occurrence of two or more adjacent, often
approximate copies of a sequence of nucleotides. For example,
ACTTAGT ACTTAGT ACTAAGT ACTTAGT
We are interested in clustering repeats into families based on their sequence similarities. Members of
a family have similar sequence but occur at different locations in a genome or in different genomes.
Families have been detected in both prokaryotic and eukaryotic genomes, including the E. coli, P.
aeruginosa, S. cerevisiae, C. elegans, and human genomes. To accurately and effectively compare
repeats, we cannot use standard measures like BLAST  or straightforward sequence alignment ,
because variant copy number and copy ordering are problematic for these methods. Benson  used
a profile representation of the repeats to overcome these difficulties. A profile  is a sequence whose
length equals the number of columns in a multiple alignment and whose individual elements are the
character compositions of the columns.
Rao et al.
Figure 1: Multiple alignment view of a tandem repeat. Individual copies are aligned to the repeat
consensus to obtain the profile representation of the repeat. Common mutations among the copies, as
are evident in this view, are reflected in the profile compositions.
In this representation, the n individual copies of a single tandem repeat are aligned to form a
multiple alignment M of length k (see Figure 1). Let Mi,jrepresent the element in the ithrow and jth
column of M. A profile for M is a sequence S = C1,C2,...,Ckof compositions, where each Cj is a
vector of frequencies of characters in M∗,j: Cj= (fA,fC,fG,fT,f−), with fσindicating the frequency
of letter σ and f−indicating the frequency of gaps in the column.
Alignment of profiles requires a pairwise distance function for compositions. In , Benson explored
a distance function based on minimal path lengths along an entropy surface. In this paper we explore
five distance functions, two related to the entropy surface Entropy-weighted and Entropy-Surface, a
third, the Jensen-Shannon Divergence which is also based on entropy, and two others, Euclidean and
Consensus. Each function produces a different score and the alignments may differ also. Hence trying
to gauge the effect of distance functions strictly from scores is not a very convincing or effective process
and might lead us to wrong conclusions. To overcome these difficulties we propose a new approach
for gauging the closeness or similarity of these distance functions to each other by comparing the
alignments which they produce. Thus by the end of our experiment we obtain a metric between these
distance functions with respect to the alignments. From this analysis and further examination of
clusters produced with these functions, we choose a single function for use in clustering.
Finally, we present a multi-phase clustering scheme, which initially uses the Hierarchical clustering
Method, and as a secondary step uses the Partition Around Medoids (PAM)  algorithm to obtain
good quality clusters. Multiphase clustering methods like CURE  have been used previously to
refine cluster quality. Clustering methods available in the R statistical programming language 
were used in our analysis.
The paper is organized as follows. Section 2 describes the repeats data we used in our analysis,
Section 3 defines the distance functions and describes our method for comparing the alignments pro-
duced by the different functions, Section 4 describes our analysis of cluster quality and the multiphase
clustering approach. Finally, Section 5 summarizes our conclusions.
2.1 Data Collection and Cleaning
It is important to start with a good set of data so our conclusions will be robust. The data used for
this analysis were obtained from the Tandem Repeats Database (TRDB)  using default parameters
and consist of 1000 pairs of related tandem repeats from Human Chromosomes 3, 5, 10 and X (NCBI
Rao et al.
 Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman, D., Basic local alignment search tool,
J. Mol. Biol., 215:403–410, 1990.
 Benson, G., Tandem repeats finder: A program to analyze DNA sequences, Nucleic Acids Res.,
 Benson, G., A new distance measure for comparing sequence profiles based on paths along an
entropy surface, Bioinformatics, 18:S44–S53, 2002.
 Castelo, A., Martins, W., and Gao, G., TROLL-tandem repeat occurrence locator, Bioinformat-
ics, 18:634–636, 2002.
 Delgrange, O. and Rivals, E., STAR: An algorithm to search for tandem approximate repeats,
Bioinformatics, 20:2812–2820, 2004.
 Edwards, A., Hammond, H., Jin, L., Caskey, C., and Chakraborty, R., Genetic variation at
five trimeric and tetrameric tandem repeat loci in four human population groups, Genomics,
 Everitt, B.S., Cluster Analysis, Edward Arnold, 1992.
 Gribskov, M., L¨ uthy, R., and Eisenberg, D., Profile analysis, Methods in Enzymol., 183:146–159,
 Guha, S., Rastogi, R., and Shim, K., CURE: An efficient clustering algorithm for large databases,
Proc. ACM SIGMOD International Conference on Management of Data 1998, 73–84, 1998.
 Kaufman, L. and Rousseeuw, P.J., Finding groups in data: An introduction to cluster analysis,
John Wiley and Sons, 1990.
 Kolpakov, R., Bana, G., and Kucherov, G., Mreps: Efficient and flexible detection of tandem
repeats in DNA, Nucleic Acids Res., 31:3672–3678, 2003.
 Landau, G., Schmidt, J., and Sokol, D., An algorithm for approximate tandem repeats, J. Comp.
Biol., 8:1–18, 2001.
 Maes, M., On a cyclic string-to-string correction problem, Information Processing Letters, 35:73–
 R Development Core Team, R: A language and environment for statistical computing, R Foun-
dation for Statistical Computing, 2005.
 Smith, T. and Waterman, M., Comparison of biosequences, Advances in Applied Mathematics,
 Weber, J. and May, P., Abundant class of human DNA polymorphisms which can be typed using
the polymerase chain reaction, Am. J. Hum. Genet., 44:388–396, 1989.
 Benson, G., TRDB, http://tandem.bu.edu/cgi-bin/trdb/trdb.exe
 Ruitberg, C., Reeder, D., and Butler, J., STRBase: A short tandem repeat DNA database for
the human identity testing community, http://www.cstl.nist.gov/biotech/strbase, 2001.