Page 1

Genome Informatics 16(1): 3–12 (2005)

3

Evaluating Distance Functions for Clustering

Tandem Repeats

Suyog Rao1,2

suyog@bu.edu

Department of Electrical and Computer Engineering, Boston University, Boston, MA,

USA

Laboratory for Biocomputing and Informatics, Boston University, Boston, MA, USA

Departments of Computer Science and Biology, Graduate Program in Bioinformatics,

Boston University, Boston, MA, USA

Alfredo Rodriguez2

alfredo@bu.edu

Gary Benson2,3

gbenson@bu.edu

1

2

3

Abstract

Tandem repeats are an important class of DNA repeats and much research has focused on

their efficient identification [2, 4, 5, 11, 12], their use in DNA typing and fingerprinting [6, 16, 18],

and their causative role in trinucleotide repeat diseases such as Huntington Disease, myotonic

dystrophy, and Fragile-X mental retardation. We are interested in clustering tandem repeats into

groups or families based on sequence similarity so that their biological importance may be further

explored. To cluster tandem repeats we need a notion of pairwise distance which we obtain by

alignment. In this paper we evaluate five distance functions used to produce those alignments –

Consensus, Euclidean, Jensen-Shannon Divergence, Entropy-Surface, and Entropy-weighted. It is

important to analyze and compare these functions because the choice of distance metric forms the

core of any clustering algorithm. We employ a novel method to compare alignments and thereby

compare the distance functions themselves. We rank the distance functions based on the cluster

validation techniques – Average Cluster Density and Average Silhouette Width. Finally, we propose

a multi-phase clustering method which produces good-quality clusters. In this study, we analyze

clusters of tandem repeats from five sequences: Human Chromosomes 3, 5, 10 and X and C. elegans

Chromosome III.

Keywords: tandem repeats, distance functions, cluster analysis, cluster validation

1 Introduction

DNA molecules are subjected to a variety of mutational events, one of which is tandem duplication

which produces tandem repeats. A tandem repeat is an occurrence of two or more adjacent, often

approximate copies of a sequence of nucleotides. For example,

ACTTAGT ACTTAGT ACTAAGT ACTTAGT

We are interested in clustering repeats into families based on their sequence similarities. Members of

a family have similar sequence but occur at different locations in a genome or in different genomes.

Families have been detected in both prokaryotic and eukaryotic genomes, including the E. coli, P.

aeruginosa, S. cerevisiae, C. elegans, and human genomes. To accurately and effectively compare

repeats, we cannot use standard measures like BLAST [1] or straightforward sequence alignment [15],

because variant copy number and copy ordering are problematic for these methods. Benson [3] used

a profile representation of the repeats to overcome these difficulties. A profile [8] is a sequence whose

length equals the number of columns in a multiple alignment and whose individual elements are the

character compositions of the columns.

Page 2

4

Rao et al.

Figure 1: Multiple alignment view of a tandem repeat. Individual copies are aligned to the repeat

consensus to obtain the profile representation of the repeat. Common mutations among the copies, as

are evident in this view, are reflected in the profile compositions.

In this representation, the n individual copies of a single tandem repeat are aligned to form a

multiple alignment M of length k (see Figure 1). Let Mi,jrepresent the element in the ithrow and jth

column of M. A profile for M is a sequence S = C1,C2,...,Ckof compositions, where each Cj is a

vector of frequencies of characters in M∗,j: Cj= (fA,fC,fG,fT,f−), with fσindicating the frequency

of letter σ and f−indicating the frequency of gaps in the column.

Alignment of profiles requires a pairwise distance function for compositions. In [3], Benson explored

a distance function based on minimal path lengths along an entropy surface. In this paper we explore

five distance functions, two related to the entropy surface Entropy-weighted and Entropy-Surface, a

third, the Jensen-Shannon Divergence which is also based on entropy, and two others, Euclidean and

Consensus. Each function produces a different score and the alignments may differ also. Hence trying

to gauge the effect of distance functions strictly from scores is not a very convincing or effective process

and might lead us to wrong conclusions. To overcome these difficulties we propose a new approach

for gauging the closeness or similarity of these distance functions to each other by comparing the

alignments which they produce. Thus by the end of our experiment we obtain a metric between these

distance functions with respect to the alignments. From this analysis and further examination of

clusters produced with these functions, we choose a single function for use in clustering.

Finally, we present a multi-phase clustering scheme, which initially uses the Hierarchical clustering

Method, and as a secondary step uses the Partition Around Medoids (PAM) [10] algorithm to obtain

good quality clusters. Multiphase clustering methods like CURE [9] have been used previously to

refine cluster quality. Clustering methods available in the R statistical programming language [14]

were used in our analysis.

The paper is organized as follows. Section 2 describes the repeats data we used in our analysis,

Section 3 defines the distance functions and describes our method for comparing the alignments pro-

duced by the different functions, Section 4 describes our analysis of cluster quality and the multiphase

clustering approach. Finally, Section 5 summarizes our conclusions.

2 Repeats

2.1 Data Collection and Cleaning

It is important to start with a good set of data so our conclusions will be robust. The data used for

this analysis were obtained from the Tandem Repeats Database (TRDB) [17] using default parameters

and consist of 1000 pairs of related tandem repeats from Human Chromosomes 3, 5, 10 and X (NCBI

Page 3

Evaluating Distance Functions in TR Clustering

5

Build 34, July 2003 Assembly) and C. elegans Chromosome III (Sanger Institute, Aug 2002). These

repeats were obtained using the Tandem Repeats Finder (TRF) [2] program. From the original set of

repeats in each chromosome, we used the TRDB filtering capability to select only repeats which have

a copy number greater than 5 and whose pattern size is greater than 35. The results of this selection

are shown in Table 1.

Table 1: Results of TRF analysis and TRDB filtering on the chromosomes.

ChromosomeSize in bp

(incl. gaps)

199344050

181034922

135037215

153692391

13002367

Number of

repeats

37643

35922

29510

31779

Repeats after

filter

660

971

795

794

354

Human Chr. 3

Human Chr. 5

Human Chr. 10

Human Chr. X

C. elegans Chr. III

5011

These repeats were subjected to the pre-existing clustering algorithm in TRDB. Repeats from all

chromosomes were clustered separately with a connected components algorithm using Entropy-Surface

as the distance function. There were totally 223 clusters over all five chromosomes that resulted from

the clustering. Our data set was sampled from this pool of repeats in an automated and randomized

manner. From the pool, we chose 400 pairs of repeats, each pair from a single cluster, from each of

Human Chromosomes 3, 5, 10 and X and 150 pairs from C. elegans Chromosome III. This candidate

set of 1750 pairs was next subjected to a data cleansing process described in the next section.

2.2 Data Cleansing - Identifying and Removing Subpatterns in Tandem Repeats

A tandem repeat in the data set may contain within itself one or more copies of tandem repeats

(repeats within a repeat), which we define to be subpatterns of the original repeat. A tandem repeat

is considered to have a perfect subpattern if its pattern length is a perfect multiple of the subpattern

length. Although there are many tandem repeats which have perfect subpatterns, it is important that

we also consider tandem repeats whose pattern size is a close multiple of some subpattern length.

Thus, we are interested in subpatterns whose copies span exactly or almost the entire length of the

original pattern. We call these strong subpatterns. Note also that the subpatterns in a tandem repeat

may be approximate copies of each other.

Why care about strong subpatterns?

Because we are comparing alignments produced by the different scoring functions, we do not want to

include situations where the alignments differ because a repeat has cyclically shifted in the alignment

simply because it contains a strong subpattern. Hence we eliminate these repeats in our data set. As

an illustration, consider the tandem repeat X, consisting of sub patterns X1,X2and X3as shown in

Figure 2. If X1? X2? X3, and we try to align this tandem repeat with another tandem repeat, the

pattern might rotate cyclically as shown in Figure 2, depending on the distance function used, which

is undesired. Formalizing, given a tandem repeat sequence X, the problem is to find the existence

of a strong subpattern within it. In order to achieve this we first identify the subpatterns in a given

set of repeats and then associate with each, a notion of a subpattern score or strength. This allows

us to identify a threshold that separates the strong subpatterns from the weak subpatterns. We omit

further discussion of this task.

Page 4

6

Rao et al.

Figure 2: Subpatterns can cause cyclic shifting of a repeat within an alignment when using different

distance functions. (a) is a tandem repeat pattern X with subpatterns. (b) is one such cyclic rotation

of subpatterns in X.

3 Comparing Alignments

3.1 Distance Functions

We evaluated five distance functions. In what follows, Ciis a composition vector of k = 5 character

frequencies fσ1,...,fσk, one for each DNA base and the gap character:

1. Consensus:

Cons(C1,C2) =

?0if majority character in C1and C2match, and

otherwise1

2. Euclidean:

Euc(C1,C2) =

?

?

?

?

k

?

i=1

∆2

σi

where ∆2

σiis the square of the difference between the frequencies for character σiin C1and C2.

3. Jensen-Shannon Divergence:

JS(C1,C2) = H(π1C1+ π2C2) − π1H(C1) − π2H(C2)

where H(C) is the entropy of vector C, that is, H(C) =?k

4. Entropy-Surface: This function and the next are related to one defined by Benson in [3]. They

are based on the entropy function

i=1fσilog2(fσi), and πiis a weighting

factor for vector Ci. We used πi= 0.5.

H(C) = −

?

σ

fσlog(fσ)

defined over all possible compositions C. The entropy function describes a six dimensional curved

surface (five for the character frequencies fσand one for the entropy value). Any composition,

C, defines a point H(C) which is the projection of C in 5-space onto the entropy surface.

For the distance measure, we project the straight line segment connecting C1and C2onto the

entropy surface. The distance between C1and C2is the length of the resulting curve (which we

numerically approximate with chords).

5. Entropy-weighted: Similar to the preceding, except the length of the curve is weighted by the

entropy value itself (in our numerical approximation, the length of each approximating chord is

multiplied by its midpoint entropy).

Page 5

Evaluating Distance Functions in TR Clustering

7

Each distance function was scaled to values between 0 and 255 inclusive. To correct for differences

between the number of copies in each repeat, all composition vectors were normalized to standard vec-

tors for a 10 copy repeat. The frequencies in a standard vector are drawn from {0,0.1,0.2,...,0.9,1.0}.

The standard vector closest by Euclidean distance to an original vector becomes that vector’s normal-

ized representative.

3.2Calculating the Distance between Repeats

The distance between repeats is calculated using alignment scores. We use a cyclic alignment algo-

rithm [13] in conjunction with our distance functions because the relative starting position of one

profile to another may be incorrect in the original data. (That is why we omit repeats with strong

subpatterns). We create the inter-repeat distance matrix for all repeats in our dataset, which becomes

the input to our clustering method. The distance between two repeats R1and R2is calculated as

follows:

Dist(R1,R2) =

255 ∗ Alignment length (R1,R2)

where 255 is the worst score possible (scaled), using any distance function.

Alignment Score (R1,R2)

(1)

3.3 Effect of Distance Functions on the Alignments

After collecting a good set of data and cleaning it to remove repeats with strong subpatterns, we

perform our experiment of comparing the different distance functions. We cannot simply use the

different alignment scores produced by the five functions on a particular repeat pair, as each function

has its own properties and the distribution of alignment scores is very much dependent on the distance

function used.

Table 2: Alignments produced by distance functions. The starting position of the upper pattern is

cyclically permuted in these alignments. Note that columns aligning dash to dash are an artifact of

the consensus pattern representation. Dash columns are not present in consensus patterns, but are

present in the profile when a repeat contains characters in a column in less than half its copies. These

columns often occur when a repeat has many copies as is the case here.

Distance Aligned

Function Repeats

Consensus

- C - - T - - - C C C A G - C

- A - - T - - A C A C A - - C

Alignment

Score

1020

Worst

Score

3825

Distance

0.27

Entropy-weighted

- C - - T - - - C C C A G - C

- A - - T - - A C A C A - - C

7273825 0.19

Euclidean

A - - G - - C - C T - C C C

A - - T - - A - C A - C A C

1185 35700.33

Jensen-Shannon

A G - C - - - C T - C C C

A - - T - - A C A - C A C

1105 33150.33

Entropy-Surface

A G - C - - - C T - C C C

A - - T - - A C A - C A C

126533150.38

For example consider two tandem repeats from Human Chr. 10 with consensus patterns ATACA-

CAC and CTCCCAGC, and copy numbers 24.6 and 30.0 respectively. Table 2 shows the alignments

and the alignment scores with different distance functions, and also the worst possible scores for the

Page 6

8

Rao et al.

same pair. The worst possible score depends on the alignment length and the alignment length could

vary with distance function used. The last column is the distance calculated by Equation (1).

3.4 Computing Relative Distances of the Distance Functions with Respect to

Alignments

To compare two distance functions we use the following procedure:

• Let A and B be the two repeats to be aligned, and let D1and D2be the two distance functions

to be compared.

• Align A and B with D1to get the alignment AP1.

• Align A and B with D2to get the alignment AP2.

• Calculate the number of identical pairs in AP1and AP2. By identical pairs we mean the same

pair of nucleotides aligned in both alignments.

Iterating this procedure for N pairs of repeats, we calculate the distance between two function as:

Distnb(D1,D2) = 1 −

?

NNo. of identical pairs * 2

len(AP1)+len(AP2)

N

.

(2)

Figure 3:

comparison of distance functions.

function is farthest away from all the other dis-

tance functions with respect to alignments pro-

duced. The height represents the inter-function

distances as calculated by Equation (2).

Cluster tree depicting the relative

Consensus

Figure 4: Hierarchical cluster tree of repeats in

Human Chromosome 10. The horizontal line in-

dicates the clustering cut-off value which is at a

height or distance of 25.5, calculated from Equa-

tion (1). In terms of similarity(inverse of dis-

tance) the cut-off is 90%.

Page 7

Evaluating Distance Functions in TR Clustering

9

Thus we take into account here the number of positions the alignments were identical, and use

this to form the output of our alignment experiment. Figure 3 shows a tree comparing the 5 distance

functions. The tree was obtained by hierarchical single linkage clustering of the distance functions.

Based on this tree, Euclidean and Entropy-Surface are the closest in terms of the alignments produced.

Height in the figure represents distance between the functions.

4 Clustering

Using the repeats from Human Chromosome 10 we produced clusters using the Hierarchical Agglomer-

ative algorithm method using the single linkage algorithm [7]. Hierarchical clustering is a widely used

algorithm despite its time complexity. This clustering algorithm is a bottom-up strategy and initially

places all data points as singleton clusters. It then merges these clusters into larger and larger clusters

based on the cluster linkage criteria. The single linkage method works by merging two clusters or

points which are closest to each other. The Hierarchical clustering algorithm takes as input an N ×N

distance matrix and a cut-off value which specifies at which height the clustering is terminated. We

performed this clustering procedure with the different distance functions and cut-off height (distance)

as 25.5. See Figure 4 for visualization of the clusters produced from the repeats in Human Chr. 10

using the entropy-weighted distance function.

4.1 Cluster Validation

The importance and effect of cluster structure with respect to tandem repeat families is still unclear.

However, we analyze the shape and density of the clusters and would like to produce good clusters

using these metrics. We assess the quality of clusters produced by the Hierarchical clustering method

using the cluster validation techniques Average Cluster Density and Average Silhouette Width defined

by [10]. Consequently we can rank the individual distance functions based on the quality of clusters

they produce.

Average Cluster Density: This measures the compactness and density of the clusters. The cluster

density is calculated by using the cluster diameter, which is the largest distance between any pair of

points in the cluster.

ClusterDensity =

AverageLength

ClusterDiameter

(3)

where Average Length is the average distance between any two points in the cluster. The Average

Cluster Density is the average over all clusters. If the Average Cluster Density is close to 1, we have

highly compact clusters.

Table 3: Human Chr. 10 clustering results using the Hierarchical clustering algorithm, with different

distance functions and distance 25.5 as the cut-off.

Distance

function

Consensus

Entropy-weighted

Euclidean

Entropy-Surface

Jensen-Shannon

No. of

clusters

38

40

38

40

36

Sil

Width

0.8

0.73

0.64

0.76

0.62

Avg

Diameter

0.85

0.90

0.93

0.89

0.93

Page 8

10

Rao et al.

Silhouette Width: This is a measure of the membership of an object i to a cluster C. The silhouette

width shows which objects lie well within the cluster and which ones are between clusters. Consider

an object i of the data set, and let Cidenote the cluster to which it is assigned. We calculate: 1) a(i)

= average distance of i to all other objects of Ci, 2) For each cluster C such that C ?= Ci, d(i,C) =

the average distance of i to all other objects of C, and 3) Over all clusters C ?= Ci, b(i) = min(d(i,C)),

the average distance of i to its nearest neighbor cluster. Silhouette Width s(i) is given by

s(i) =

b(i) − a(i)

max{b(i),a(i)}.

From this, s(i) lies in the range [−1,1]. The Average Silhouette Width savg(i) is the average over all

the objects in the dataset. If savg(i) is close to 1, the objects are well clustered or structured.

Figure 5: These graphs show the relationship between the number of clusters and percentage of repeats

clustered at different distance cut-offs (75% – 99%) using the Hierarchical clustering method and the

Entropy-weighted distance function. The “mountain” line is the number of multi-repeat clusters

(unary clusters are not counted), the descending line is the number of repeats in multi-repeat clusters.

Comparison of these graphs with those produced by the other functions (not shown) indicated that

Entropy-weighted was able to cluster a higher percentage of repeats than the other distance functions.

This was one criterion for picking Entropy-weighted as the preferred distance function.

We calculate the cluster statistics using different distance functions. Table 3 shows the cluster

qualities of each of the five distance functions on Human Chr. 10. We chose the Entropy-weighted

distance function for the remainder of our analysis because it scores well on both measures and was

best in terms of number of clusters produced and the percentage of repeats clustered (see Figure 5).

4.2Multiple-Phase Clustering

A defect of Hierarchical clustering is that clusters can be low-quality in the sense that they are

elongated and less dense. To split these chained clusters formed by single-linkage, we can subject

them to clustering again, using other partition based clustering methods. Using the cluster validation

techniques, we can identify these low-quality clusters. We use the Partition around Medoids (PAM) [10]

to re-cluster the chained clusters, splitting them into smaller clusters.

Page 9

Evaluating Distance Functions in TR Clustering

11

PAM is one of the variants of the popular K-means approach but is more robust than K-means

because medoids are less influenced by outliers. PAM consists of two steps – BUILD step, where k

representative objects, called medoids are initially selected; SWAP step, where one medoid is swapped

with another data point, iteratively until the clustering function is minimized. PAM requires as input

k, which is the number of clusters to be formed from the data set and an N ×N distance matrix . To

determine k, we run PAM on the data set several times, each time with a different k and select the k∗

which yields the highest Average Silhouette Width: k∗= argmaxk savg(k). Human Chromosome 10

when subjected to this multi-phase clustering yielded 44 clusters with an Average Silhouette Width of

0.76. The Hierarchical algorithm had produced 40 clusters with an Average Silhouette Width of 0.73.

Figure 6: Result of PAM on a cluster produced by Hierarchical clustering method. Multiple alignments

of 2 pairs of repeats (different) show inter-cluster and intra-cluster similarity.

Figure 6 shows an example were a poor quality cluster produced by the Hierarchical method on

Human Chromosome 10 was re-clustered using the PAM method. Initially the Hierarchical clustering

produced a cluster containing 138 repeats and an Average Silhouette Width of 0.45. Running PAM

on this cluster produced two smaller clusters while increasing the Average Silhouette Width to 0.6. As

the alignments illustrate, repeats within a cluster are much more closely related than those between

clusters.

5 Conclusion

We have described a new quantitative approach to evaluate distance functions with respect to align-

ments, and we study their effects on discovering families of tandem repeats. We describe a relative

comparison between the distance functions and also an individual evaluation using cluster validation

techniques. Tandem repeats from Human Chromosome 10 were clustered using a multi-phase ap-

proach by using the Hierarchical clustering method and Partition around Medoids in combination.

Our results show that for clustering repeats a multi-phase clustering approach produces better quality

clusters. The two entropy based functions – Entropy Weighted and Entropy Surface – outscore the

other distance functions in our alignment experiment, quality and number of clusters produced, and

also the number of repeats clustered. Future clustering tools in the Tandem Repeats Database will

employ entropy based distance functions and multi-phase clustering as demonstrated in this work.

Page 10

12

Rao et al.

References

[1] Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman, D., Basic local alignment search tool,

J. Mol. Biol., 215:403–410, 1990.

[2] Benson, G., Tandem repeats finder: A program to analyze DNA sequences, Nucleic Acids Res.,

27:573–580, 1999.

[3] Benson, G., A new distance measure for comparing sequence profiles based on paths along an

entropy surface, Bioinformatics, 18:S44–S53, 2002.

[4] Castelo, A., Martins, W., and Gao, G., TROLL-tandem repeat occurrence locator, Bioinformat-

ics, 18:634–636, 2002.

[5] Delgrange, O. and Rivals, E., STAR: An algorithm to search for tandem approximate repeats,

Bioinformatics, 20:2812–2820, 2004.

[6] Edwards, A., Hammond, H., Jin, L., Caskey, C., and Chakraborty, R., Genetic variation at

five trimeric and tetrameric tandem repeat loci in four human population groups, Genomics,

12:241–253, 1992.

[7] Everitt, B.S., Cluster Analysis, Edward Arnold, 1992.

[8] Gribskov, M., L¨ uthy, R., and Eisenberg, D., Profile analysis, Methods in Enzymol., 183:146–159,

1990.

[9] Guha, S., Rastogi, R., and Shim, K., CURE: An efficient clustering algorithm for large databases,

Proc. ACM SIGMOD International Conference on Management of Data 1998, 73–84, 1998.

[10] Kaufman, L. and Rousseeuw, P.J., Finding groups in data: An introduction to cluster analysis,

John Wiley and Sons, 1990.

[11] Kolpakov, R., Bana, G., and Kucherov, G., Mreps: Efficient and flexible detection of tandem

repeats in DNA, Nucleic Acids Res., 31:3672–3678, 2003.

[12] Landau, G., Schmidt, J., and Sokol, D., An algorithm for approximate tandem repeats, J. Comp.

Biol., 8:1–18, 2001.

[13] Maes, M., On a cyclic string-to-string correction problem, Information Processing Letters, 35:73–

78, 1990.

[14] R Development Core Team, R: A language and environment for statistical computing, R Foun-

dation for Statistical Computing, 2005.

[15] Smith, T. and Waterman, M., Comparison of biosequences, Advances in Applied Mathematics,

2:482–489, 1981.

[16] Weber, J. and May, P., Abundant class of human DNA polymorphisms which can be typed using

the polymerase chain reaction, Am. J. Hum. Genet., 44:388–396, 1989.

[17] Benson, G., TRDB, http://tandem.bu.edu/cgi-bin/trdb/trdb.exe

[18] Ruitberg, C., Reeder, D., and Butler, J., STRBase: A short tandem repeat DNA database for

the human identity testing community, http://www.cstl.nist.gov/biotech/strbase, 2001.