Page 1

Genome Informatics 16(1): 3–12 (2005)

3

Evaluating Distance Functions for Clustering

Tandem Repeats

Suyog Rao1,2

suyog@bu.edu

Department of Electrical and Computer Engineering, Boston University, Boston, MA,

USA

Laboratory for Biocomputing and Informatics, Boston University, Boston, MA, USA

Departments of Computer Science and Biology, Graduate Program in Bioinformatics,

Boston University, Boston, MA, USA

Alfredo Rodriguez2

alfredo@bu.edu

Gary Benson2,3

gbenson@bu.edu

1

2

3

Abstract

Tandem repeats are an important class of DNA repeats and much research has focused on

their efficient identification [2, 4, 5, 11, 12], their use in DNA typing and fingerprinting [6, 16, 18],

and their causative role in trinucleotide repeat diseases such as Huntington Disease, myotonic

dystrophy, and Fragile-X mental retardation. We are interested in clustering tandem repeats into

groups or families based on sequence similarity so that their biological importance may be further

explored. To cluster tandem repeats we need a notion of pairwise distance which we obtain by

alignment. In this paper we evaluate five distance functions used to produce those alignments –

Consensus, Euclidean, Jensen-Shannon Divergence, Entropy-Surface, and Entropy-weighted. It is

important to analyze and compare these functions because the choice of distance metric forms the

core of any clustering algorithm. We employ a novel method to compare alignments and thereby

compare the distance functions themselves. We rank the distance functions based on the cluster

validation techniques – Average Cluster Density and Average Silhouette Width. Finally, we propose

a multi-phase clustering method which produces good-quality clusters. In this study, we analyze

clusters of tandem repeats from five sequences: Human Chromosomes 3, 5, 10 and X and C. elegans

Chromosome III.

Keywords: tandem repeats, distance functions, cluster analysis, cluster validation

1 Introduction

DNA molecules are subjected to a variety of mutational events, one of which is tandem duplication

which produces tandem repeats. A tandem repeat is an occurrence of two or more adjacent, often

approximate copies of a sequence of nucleotides. For example,

ACTTAGT ACTTAGT ACTAAGT ACTTAGT

We are interested in clustering repeats into families based on their sequence similarities. Members of

a family have similar sequence but occur at different locations in a genome or in different genomes.

Families have been detected in both prokaryotic and eukaryotic genomes, including the E. coli, P.

aeruginosa, S. cerevisiae, C. elegans, and human genomes. To accurately and effectively compare

repeats, we cannot use standard measures like BLAST [1] or straightforward sequence alignment [15],

because variant copy number and copy ordering are problematic for these methods. Benson [3] used

a profile representation of the repeats to overcome these difficulties. A profile [8] is a sequence whose

length equals the number of columns in a multiple alignment and whose individual elements are the

character compositions of the columns.

Page 2

4

Rao et al.

Figure 1: Multiple alignment view of a tandem repeat. Individual copies are aligned to the repeat

consensus to obtain the profile representation of the repeat. Common mutations among the copies, as

are evident in this view, are reflected in the profile compositions.

In this representation, the n individual copies of a single tandem repeat are aligned to form a

multiple alignment M of length k (see Figure 1). Let Mi,jrepresent the element in the ithrow and jth

column of M. A profile for M is a sequence S = C1,C2,...,Ckof compositions, where each Cj is a

vector of frequencies of characters in M∗,j: Cj= (fA,fC,fG,fT,f−), with fσindicating the frequency

of letter σ and f−indicating the frequency of gaps in the column.

Alignment of profiles requires a pairwise distance function for compositions. In [3], Benson explored

a distance function based on minimal path lengths along an entropy surface. In this paper we explore

five distance functions, two related to the entropy surface Entropy-weighted and Entropy-Surface, a

third, the Jensen-Shannon Divergence which is also based on entropy, and two others, Euclidean and

Consensus. Each function produces a different score and the alignments may differ also. Hence trying

to gauge the effect of distance functions strictly from scores is not a very convincing or effective process

and might lead us to wrong conclusions. To overcome these difficulties we propose a new approach

for gauging the closeness or similarity of these distance functions to each other by comparing the

alignments which they produce. Thus by the end of our experiment we obtain a metric between these

distance functions with respect to the alignments. From this analysis and further examination of

clusters produced with these functions, we choose a single function for use in clustering.

Finally, we present a multi-phase clustering scheme, which initially uses the Hierarchical clustering

Method, and as a secondary step uses the Partition Around Medoids (PAM) [10] algorithm to obtain

good quality clusters. Multiphase clustering methods like CURE [9] have been used previously to

refine cluster quality. Clustering methods available in the R statistical programming language [14]

were used in our analysis.

The paper is organized as follows. Section 2 describes the repeats data we used in our analysis,

Section 3 defines the distance functions and describes our method for comparing the alignments pro-

duced by the different functions, Section 4 describes our analysis of cluster quality and the multiphase

clustering approach. Finally, Section 5 summarizes our conclusions.

2 Repeats

2.1 Data Collection and Cleaning

It is important to start with a good set of data so our conclusions will be robust. The data used for

this analysis were obtained from the Tandem Repeats Database (TRDB) [17] using default parameters

and consist of 1000 pairs of related tandem repeats from Human Chromosomes 3, 5, 10 and X (NCBI

Page 3

Evaluating Distance Functions in TR Clustering

5

Build 34, July 2003 Assembly) and C. elegans Chromosome III (Sanger Institute, Aug 2002). These

repeats were obtained using the Tandem Repeats Finder (TRF) [2] program. From the original set of

repeats in each chromosome, we used the TRDB filtering capability to select only repeats which have

a copy number greater than 5 and whose pattern size is greater than 35. The results of this selection

are shown in Table 1.

Table 1: Results of TRF analysis and TRDB filtering on the chromosomes.

ChromosomeSize in bp

(incl. gaps)

199344050

181034922

135037215

153692391

13002367

Number of

repeats

37643

35922

29510

31779

Repeats after

filter

660

971

795

794

354

Human Chr. 3

Human Chr. 5

Human Chr. 10

Human Chr. X

C. elegans Chr. III

5011

These repeats were subjected to the pre-existing clustering algorithm in TRDB. Repeats from all

chromosomes were clustered separately with a connected components algorithm using Entropy-Surface

as the distance function. There were totally 223 clusters over all five chromosomes that resulted from

the clustering. Our data set was sampled from this pool of repeats in an automated and randomized

manner. From the pool, we chose 400 pairs of repeats, each pair from a single cluster, from each of

Human Chromosomes 3, 5, 10 and X and 150 pairs from C. elegans Chromosome III. This candidate

set of 1750 pairs was next subjected to a data cleansing process described in the next section.

2.2 Data Cleansing - Identifying and Removing Subpatterns in Tandem Repeats

A tandem repeat in the data set may contain within itself one or more copies of tandem repeats

(repeats within a repeat), which we define to be subpatterns of the original repeat. A tandem repeat

is considered to have a perfect subpattern if its pattern length is a perfect multiple of the subpattern

length. Although there are many tandem repeats which have perfect subpatterns, it is important that

we also consider tandem repeats whose pattern size is a close multiple of some subpattern length.

Thus, we are interested in subpatterns whose copies span exactly or almost the entire length of the

original pattern. We call these strong subpatterns. Note also that the subpatterns in a tandem repeat

may be approximate copies of each other.

Why care about strong subpatterns?

Because we are comparing alignments produced by the different scoring functions, we do not want to

include situations where the alignments differ because a repeat has cyclically shifted in the alignment

simply because it contains a strong subpattern. Hence we eliminate these repeats in our data set. As

an illustration, consider the tandem repeat X, consisting of sub patterns X1,X2and X3as shown in

Figure 2. If X1? X2? X3, and we try to align this tandem repeat with another tandem repeat, the

pattern might rotate cyclically as shown in Figure 2, depending on the distance function used, which

is undesired. Formalizing, given a tandem repeat sequence X, the problem is to find the existence

of a strong subpattern within it. In order to achieve this we first identify the subpatterns in a given

set of repeats and then associate with each, a notion of a subpattern score or strength. This allows

us to identify a threshold that separates the strong subpatterns from the weak subpatterns. We omit

further discussion of this task.

Page 4

6

Rao et al.

Figure 2: Subpatterns can cause cyclic shifting of a repeat within an alignment when using different

distance functions. (a) is a tandem repeat pattern X with subpatterns. (b) is one such cyclic rotation

of subpatterns in X.

3 Comparing Alignments

3.1 Distance Functions

We evaluated five distance functions. In what follows, Ciis a composition vector of k = 5 character

frequencies fσ1,...,fσk, one for each DNA base and the gap character:

1. Consensus:

Cons(C1,C2) =

?0if majority character in C1and C2match, and

otherwise1

2. Euclidean:

Euc(C1,C2) =

?

?

?

?

k

?

i=1

∆2

σi

where ∆2

σiis the square of the difference between the frequencies for character σiin C1and C2.

3. Jensen-Shannon Divergence:

JS(C1,C2) = H(π1C1+ π2C2) − π1H(C1) − π2H(C2)

where H(C) is the entropy of vector C, that is, H(C) =?k

4. Entropy-Surface: This function and the next are related to one defined by Benson in [3]. They

are based on the entropy function

i=1fσilog2(fσi), and πiis a weighting

factor for vector Ci. We used πi= 0.5.

H(C) = −

?

σ

fσlog(fσ)

defined over all possible compositions C. The entropy function describes a six dimensional curved

surface (five for the character frequencies fσand one for the entropy value). Any composition,

C, defines a point H(C) which is the projection of C in 5-space onto the entropy surface.

For the distance measure, we project the straight line segment connecting C1and C2onto the

entropy surface. The distance between C1and C2is the length of the resulting curve (which we

numerically approximate with chords).

5. Entropy-weighted: Similar to the preceding, except the length of the curve is weighted by the

entropy value itself (in our numerical approximation, the length of each approximating chord is

multiplied by its midpoint entropy).

Page 5

Evaluating Distance Functions in TR Clustering

7

Each distance function was scaled to values between 0 and 255 inclusive. To correct for differences

between the number of copies in each repeat, all composition vectors were normalized to standard vec-

tors for a 10 copy repeat. The frequencies in a standard vector are drawn from {0,0.1,0.2,...,0.9,1.0}.

The standard vector closest by Euclidean distance to an original vector becomes that vector’s normal-

ized representative.

3.2Calculating the Distance between Repeats

The distance between repeats is calculated using alignment scores. We use a cyclic alignment algo-

rithm [13] in conjunction with our distance functions because the relative starting position of one

profile to another may be incorrect in the original data. (That is why we omit repeats with strong

subpatterns). We create the inter-repeat distance matrix for all repeats in our dataset, which becomes

the input to our clustering method. The distance between two repeats R1and R2is calculated as

follows:

Dist(R1,R2) =

255 ∗ Alignment length (R1,R2)

where 255 is the worst score possible (scaled), using any distance function.

Alignment Score (R1,R2)

(1)

3.3 Effect of Distance Functions on the Alignments

After collecting a good set of data and cleaning it to remove repeats with strong subpatterns, we

perform our experiment of comparing the different distance functions. We cannot simply use the

different alignment scores produced by the five functions on a particular repeat pair, as each function

has its own properties and the distribution of alignment scores is very much dependent on the distance

function used.

Table 2: Alignments produced by distance functions. The starting position of the upper pattern is

cyclically permuted in these alignments. Note that columns aligning dash to dash are an artifact of

the consensus pattern representation. Dash columns are not present in consensus patterns, but are

present in the profile when a repeat contains characters in a column in less than half its copies. These

columns often occur when a repeat has many copies as is the case here.

Distance Aligned

Function Repeats

Consensus

- C - - T - - - C C C A G - C

- A - - T - - A C A C A - - C

Alignment

Score

1020

Worst

Score

3825

Distance

0.27

Entropy-weighted

- C - - T - - - C C C A G - C

- A - - T - - A C A C A - - C

7273825 0.19

Euclidean

A - - G - - C - C T - C C C

A - - T - - A - C A - C A C

1185 35700.33

Jensen-Shannon

A G - C - - - C T - C C C

A - - T - - A C A - C A C

1105 33150.33

Entropy-Surface

A G - C - - - C T - C C C

A - - T - - A C A - C A C

126533150.38

For example consider two tandem repeats from Human Chr. 10 with consensus patterns ATACA-

CAC and CTCCCAGC, and copy numbers 24.6 and 30.0 respectively. Table 2 shows the alignments

and the alignment scores with different distance functions, and also the worst possible scores for the