Page 1

JOURNAL OF COMPUTATIONAL BIOLOGY

Volume 15, Number 2, 2008

© Mary Ann Liebert, Inc.

Pp. 129–138

DOI: 10.1089/cmb.2007.0155

Sequence Alignment with an Appropriate

Substitution Matrix

XIAOQIU HUANG

ABSTRACT

A widely used algorithm for computing an optimal local alignment between two sequences

requires a parameter set with a substitution matrix and gap penalties. It is recognized that a

proper parameter set should be selected to suit the level of conservation between sequences.

We describe an algorithm for selecting an appropriate substitution matrix at given gap

penalties for computing an optimal local alignment between two sequences. In the algorithm,

a substitution matrix that leads to the maximum alignment similarity score is selected among

substitution matrices at various evolutionary distances. The evolutionary distance of the

selected substitution matrix is defined as the distance of the computed alignment. To show

the effects of gap penalties on alignments and their distances and help select appropriate

gap penalties, alignments and their distances are computed at various gap penalties. The

algorithm has been implemented as a computer program named SimDist. The SimDist

program was compared with an existing local alignment program named SIM for finding

reciprocally best-matching pairs (RBPs) of sequences in each of 100 protein families, where

RBPs are commonly used as an operational definition of orthologous sequences. SimDist

produced more accurate results than SIM on 50 of the 100 families, whereas both programs

produced the same results on the other 50 families. SimDist was also used to compare three

types of substitution matrices in scoring 444,461 pairs of homologous sequences from the

100 families.

Key words: evolutionary distance, sequence alignment, substitution matrix.

1. INTRODUCTION

A

penalties. It is recognized that a proper parameter set should be selected to suit the level of conservation

between sequences (Altschul, 1993; Brutlag et al., 1990; Vingron and Argos, 1991). Diverse approaches

are taken to study the effects of parameters on alignments and select proper parameter values (Durbin

et al., 1998; Fernández-Baca and Srinivasan, 1991; Fitch and Smith, 1983; Gonnet et al., 1992; Gotoh,

1990; Gusfield et al., 1992; Huang and Brutlag, 2007; Pearson, 1995; Reese and Pearson, 2002; Vingron

WIDELY USED ALGORITHM (Gotoh, 1982; Smith and Waterman, 1981) for computing an optimal

local alignment between two sequences requires a parameter set with a substitution matrix and gap

Department of Computer Science, Iowa State University, Ames, Iowa.

129

Page 2

130 HUANG

and Waterman, 1994; Vogt et al., 1995; Waterman et al., 1992; Webb et al., 2002). Computing the accurate

similarity score and evolutionary distance between sequences with a parameter set that is appropriate for

the sequences is crucial for analysis of a large family of homologous sequences.

We describe an algorithm for selecting an appropriate substitution matrix at given gap penalties for

computing an optimal local alignment between two sequences. For a real number t > 0, let S.t/ be a

substitution matrix at evolutionary distance t in PAM (Point Accepted Mutations) units (Dayhoff et al.,

1978). The algorithm computes an estimate O t at given gap penalties such that using the substitution matrix

S.O t/ leads to the maximum alignment score. The estimate O t is defined as the evolutionary distance of

the computed alignment. Note that the evolutionary distance, which is commonly used in construction of

phylogenetic trees (Felsenstein, 2004), is different from the edit distance of an alignment. At given gap

penalties, the new algorithm is about 10 times slower than the Smith-Waterman algorithm.

To show the effects of gap penalties on alignments and their distances and help select appropriate

gap penalties, alignments and their distances are computed at various gap penalties. An increase in gap

penalties has three types of possible effects on alignments: alignment breakage, no change in alignment

configuration, and a decrease in the number of gaps and an increase in the number of mismatches. The

third type of effects results in an alignment of a larger distance. Changes in alignment configuration and

distance, induced by using various gap penalties, are helpful in selecting appropriate gap penalties.

The algorithm has been implemented as a computer program named SimDist (Similarity and Distance).

The SimDist program was used to compute alignments and their distances at various gap penalties on

2282 randomly selected pairs of homologous protein sequences from 100 families. Alignments and their

distances from the program at various gap penalties were examined to select appropriate gap penalties.

The SimDist program was compared with an existing local alignment program named SIM (Huang and

Miller, 1991) for finding reciprocally best-matching pairs (RBPs) of sequences in each of the 100 protein

families, where RBPs are commonly used as an operational definition of orthologous sequences (Tatusov

et al., 1997). SimDist produced more accurate results than SIM on 50 of the 100 families, whereas both

programs produced the same results on the other 50 families. SimDist was also used to compare three

types of substitutionmatrices (Henikoff and Henikoff, 1992; Jones et al., 1992; Müller and Vingron, 2000)

in scoring 444,461 pairs of homologous sequences from the 100 protein families.

2. METHODS

We describe the new algorithm in the context of protein sequences, which can be modified to work

for DNA sequences. We start with the Markov model of Dayhoff et al. (1978) for protein evolution. Let

P.t/ denote the transition probability matrix of 20 by 20 over t time period for the Markov model, where

for residues a and b, Pab.t/ is the transition probability from residue a to residue b over t time period.

We denote by ? the equilibrium probability distribution of the residues, where ?.a/ is the equilibrium

probability of residue a. Then ?.a/Pab.t/ is the joint probability of seeing a aligned with b. We denote

by S.t/ the substitution matrix at time t that is obtained from P.t/ by the formula (Dayhoff et al., 1978):

for residues a and b, and a scaling factor c > 0,

Sab.t/ D c ? log2

?.a/Pab.t/

?.a/?.b/

D c ? log2

Pab.t/

?.b/:

(1)

The substitution matrix S.t/ is said to be in 1=c bit units. Like the widely used BLOSUM matrix of

Henikoff and Henikoff (1992), a 23 by 23 substitution matrix is constructed from the 20 by 20 substitution

matrix to accommodate three additional symbols B, Z, and X. For efficiency, the scaling factor c is set to

30 and all values in the substitution matrix are rounded to integers.

The transition probability matrix P.t/ is computed for any evolutionary time (distance) t ? 0 through

an instantaneous rate matrix Q, which is independent of t (Müller and Vingron, 2000). Assume that Q can

be diagonalized by D D U?1QU, where D is a diagonal matrix with diagonal elements d1;d2;:::;d20.

Then the matrix P.t/ is computed with the formula (Müller and Vingron, 2000):

P.t/ D exp.t ? Q/ D

1

X

kD0

Qktk

kŠ

D

1

X

kD0

U.tD/kU?1

kŠ

D U exp.t ? D/U?1;

(2)

Page 3

SEQUENCE ALIGNMENT WITH APPROPRIATE PARAMETERS 131

where exp.x/ D ex, and exp.t ? D/ is a diagonal matrix with diagonal elements exp.t ? d1/, exp.t ?

d2/;:::;exp.t ? d20/.

Here we use the equilibrium distribution ? and rate matrix Q of Müller and Vingron (2000), which

are derived from alignments of varying degree of divergence. The rate matrix Q is normalized so that the

rate of substitution at equilibriumP

a

P

acids are changed in one unit of time (Müller and Vingron, 2000). The rate matrix Q is transformed into

a diagonal form with numerical algorithms (Press et al., 1992). To obtain the substitution matrix S.t/ at

any value t D O t, the transition probability matrix P.O t/ is first computed by formula (2), and then the

substitution matrix S.O t/ is computed by formula (1).

Let nonnegative variables q and r be gap open and extension penalties in the same 1=c bit units as the

substitution matrix S.t/, where a gap penalty w given in 1=g bit units is converted to the gap penalty

w.c=g/ in 1=c bit units. For instance, a gap open penalty of 12 in 1=3 bit units corresponds to that of 120

in 1=30 bit units. Integer substitution scores and gap penalties in 1=30 bit units have a higher resolution

than those in 1=3 bit units, whereas scores and penalties in 1=3 or 1=2 bit units are commonly used. Let O q

and O r be a pair of q and r values. We introduce a function of t, denoted by sAB.t; O q; O r/, where the value

of the function at a value t D O t is the similarity score of an optimal local alignment between sequences A

and B computed with the substitution matrix S.O t/ and gap penalties O q and O r. In other words, the similarity

score of an optimal local alignment between sequences A and B at gap penalties O q and O r is a function

of evolutionary distance t. Naturally, selecting an appropriate substitution matrix for aligning A and B at

gap penalties O q and O r involves finding the value of t, from an interval, maximizing sAB.t; O q; O r/.

Define dAB.O q; O r/ to be the value of t, from an interval, maximizing sAB.t; O q; O r/. For protein sequences,

a commonly used interval for t is from 10 to 200 PAMs (Reese and Pearson, 2002), which is used in this

paper. After dAB.O q; O r/ is computed, the substitution matrix S.dAB.O q; O r// and gap penalties O q and O r are

used to compute an optimal local alignment between A and B, where sAB.dAB.O q; O r/; O q; O r/ is the similarity

score of the alignment, and dAB.O q; O r/ is the evolutionary distance of the alignment.

The distance dAB.O q; O r/ is computed by the numerical procedure of Brent (1973), which is for max-

imization of functions without derivatives. The Brent procedure, given a function and an initial value

in an abscissa interval, iteratively finds a value in the abscissa interval to maximize the function. The

procedure combines a parabolic interpolation with the golden section algorithm to bracket the maximum

of the function. The procedure has a linear rate of convergence, where the function is evaluated at the

current abscissa value once in each iteration.

To evaluate the function sAB.t; O q; O r/ at a value t D tk, the transition probability matrix P.tk/ is first

computed by formula (2), which takes 20 scalar exponentiations and 2 matrix multiplications of dimension

20 by 20. Then the substitution matrix S.tk/ is computed by formula (1), which takes an order of 202

scalar operations. Next the score of an optimal local alignment between A and B is computed by the Smith

and Waterman algorithm with the substitution matrix S.tk/ and gap penalty values O q and O r. The alignment

score is the function value sAB.tk; O q; O r/.

To compute the distance dAB.O q; O r/, the Brent procedure is called with an initial value t0in an interval, a

fractional precision cutoff, and the definition of the function sAB.t; O q; O r/. In the first iteration of the Brent

procedure, the function sAB.t; O q; O r/ is evaluated once at the initial value t D t0, and a first t value t1is

found. In the second iteration, the function sAB.t; O q; O r/ is evaluated once at t D t1, and a second t value t2

is found. This is repeated until the t values meet a termination condition formed with the given fractional

precision cutoff. A t value with the maximum function score is selected among t0;t1;t2;:::. This t value

is the evolutionary distance dAB.O q; O r/, which maximizes the function sAB.t; O q; O r/.

We illustrate the algorithm by applying it to a pair of protein sequences A and B (SwissProt accession

nos Q29545 and O77811). Set O q D 150 in 1/30 bit units (15 in 1/3) and O r D 30 in 1/30 (3 in 1/3).

The Brent procedure is called with an initial value t0 D 123:00 PAMs in an interval from 10 to 200

PAMs, a fractional precision cutoff of 0.01, and the function sAB.t;150;30/. In the first iteration, the

function is evaluated at t0 D 123:00 PAMs, which involves computing the transition probability matrix

P.123:00/ and the substitution matrix S.123:00/, and computing the score of an optimal local alignment

between A and B with the substitution matrix S.123:00/ and gap penalties 150 and 30. The function

value at t0D 123:00 is 3384 in 1=30 bit units. A first value t1D 79:84 PAMs is generated. In the second

iteration, the function is evaluated at t1 D 79:84 PAMs, yielding the function value 3809 in 1=30 bit

units. A second value t2 D 53:16 PAMs is generated. This is repeated until the termination condition

b¤a?.a/Qab is 0.01, meaning that on average 1% of the amino

Page 4

132 HUANG

is met. The t values (distances) found by the Brent procedure along with the function values (alignment

scores) at those t values are shown in order of generation and in the form of (distance, alignment score):

.123:00;3384/, .79:84;3809/, .53:16;3965/, .36:68;3948/, .48:20;3975/, .46:90;3980/, .42:99;3970/,

.46:05;3974/, and .47:37;3971/. A distance of 46.90 PAMs with the maximum score 3980 is selected

by the Brent procedure. The procedure takes 9 iterations on this example. Finally, an optimal alignment

between A and B is produced by the Smith-Waterman algorithm with the substitution matrix S.46:90/

and gap penalties 150 and 30.

The time requirement of the new algorithm for computing a local alignment with an appropriate

substitution matrix is analyzed as follows. Each iteration of the Brent procedure takes one function

evaluation in addition to a constant number of basic operations (Brent, 1973). The function evaluation takes

time proportional to that of the alignment computation because the computation of a probability transition

matrix and a substitution matrix takes an order of 203scalar operations. Thus, the time requirement of the

new algorithm is a multiple of that of the standard alignment algorithm. The multiple can be controlled

by choosing a fractional precision cutoff. Another possibility is to use an upper bound on the number of

iterations. A third possibility is to select a distance based on the percent identify of an initial alignment

and use the substitutionmatrix at this distance in the final alignment computation. The correlation between

the alignment distance and alignment percent identify can be obtained by running the new algorithm on

pairs of sequences with various levels of conservation.

3. RESULTS

The new algorithm has been implemented as a computer program named SimDist. SimDist takes as input

a file of multiple sequences in FASTA format along with given gap penalties. For each possible pair of

sequences or each of a specified number of random pairs of sequences, SimDist determines an appropriate

evolutionary distance O t at the gap penalties, and computes an optimal local alignment between the two

sequences with the substitution matrix S.O t/ and the gap penalties. SimDist has an option to help select

an appropriate gap open penalty O q. Under this option, for each of the equally spaced gap open penalty

values in a given interval and at a given spacing, SimDist computes an optimal local alignment between

the sequences with an appropriate substitution matrix at the gap penalty value, and reports the alignment

along with its similarity score and distance. To show the effects of different penalty values on alignments,

differences in configuration between alignments computed at adjacent gap open penalty values are marked

with special symbols on the alignment obtained at the larger penalty value.

In addition to the rate matrix and equilibrium distribution from Müller and Vingron (2000), SimDist

has options to use, in construction of substitution matrices, any of the rate matrices and equilibrium

distributions included in version 5.2 of the TREE-PUZZLE package (Schmidt et al., 2002). To make

substitution matrices at specified distances available to the user, SimDist reports a directory of substitution

matrices computed from a specified rate matrix and equilibrium distribution at each of the equally-spaced

distances in a specified interval and at a specified spacing. The substitution matrices are in a specified

fraction of bit units.

3.1. Selection of appropriate gap open penalties

The SimDist program was used to study the effects of the gap open penalty q on alignments and

their distances. Because the gap extension penalty has much smaller effects on alignments (Reese and

Pearson, 2002), we fixed it at a constant value O r in this study. To study the statistical behavior of alignment

distances, the function dAB.q; O r/ was sampled uniformly over the entire range of q values for many pairs

of sequences. An appropriate value for q was selected by considering the statistical behavior of alignment

distances, following the framework of Vingron and Waterman (1994) and Pearson (1995) in selection of

gap penalties based on the statistical behavior of alignment scores. Our results below show that there is

a phase transition in growth of alignment distance with gap open penalty. The distance dAB.q; O r/ grows

rapidly over small q values but shows less variation over large q values.

The SimDist program was used on 2282 randomly selected pairs of homologous sequences from 100

families of protein sequences, which were first collected for evaluation of an alignment program (Huang

Page 5

SEQUENCE ALIGNMENT WITH APPROPRIATE PARAMETERS 133

and Brutlag, 2007). For each pair of sequences, alignments and their distances at 31 equally spaced q

values over an interval from 0 to 45 were computed, with all distances in PAM units and all gap penalty

values in 1=3 bit units. The gap extension penalty r was fixed at 3. A fractional accuracy cutoff of 0.01

was used to terminate the search for an appropriate evolutionary distance at each q value. This means that

the evolutionary distance is accurate within 0:01? 200 D 2 PAMs. For each of the 31 q values, the Brent

procedure iterated 10 times on average to find an appropriate evolutionary distance at the given value.

Because it is slow to compute alignments and their distances at various q values, the user may take the

following approach to deal with a large number of sequences. Perform this slow computation on a few

representative pairs of sequences. Select an appropriate q value based on the effects of the q values on

alignments and their distances. Then use SimDist only with the selected q value on a large number of

sequences.

We present several observations about the statistical behavior of alignment evolutionary distances

sampled at the variousq values on the 2282pairs ofhomologoussequences. The observationsare illustrated,

in Figure 1, with four penalty-distance plots from SimDist on four representative pairs of sequences. For

each pair of sequences, the evolutionary distance increases rapidly with q for q < 10. If the two sequences

contain a strong similarity, then the evolutionary distance is small with little variation for q > 10. As an

example, consider a pair of protein sequences (SwissProt accession nos Q29545 and O77811) with a local

alignment of 62% identity over a length of 650 residues. The pairs of penalty and distance values from

SimDist on this pair of sequences are shown in Plot 1 of Figure 1. For q > 10, the distance is around 47

and shows little variation. The alignment computed at q D 15:0 contains a short block of only a match

FIG. 1.

Gap open penalties are in 1=3 bit units, and evolutionary distances are in PAMs. A dot at coordinate .g;d/ in a plot

indicates that SimDist with a gap open penalty of g produced an alignment with an evolutionary distance of d on the

two sequences. A gap extension penalty of 3 in 1=3 bit units was used in all alignment computations.

Plots of alignment evolutionary distances against gap open penalties for four pairs of protein sequences.