Page 1
JOURNAL OF COMPUTATIONAL BIOLOGY
Volume 15, Number 2, 2008
© Mary Ann Liebert, Inc.
Pp. 129–138
DOI: 10.1089/cmb.2007.0155
Sequence Alignment with an Appropriate
Substitution Matrix
XIAOQIU HUANG
ABSTRACT
A widely used algorithm for computing an optimal local alignment between two sequences
requires a parameter set with a substitution matrix and gap penalties. It is recognized that a
proper parameter set should be selected to suit the level of conservation between sequences.
We describe an algorithm for selecting an appropriate substitution matrix at given gap
penalties for computing an optimal local alignment between two sequences. In the algorithm,
a substitution matrix that leads to the maximum alignment similarity score is selected among
substitution matrices at various evolutionary distances. The evolutionary distance of the
selected substitution matrix is defined as the distance of the computed alignment. To show
the effects of gap penalties on alignments and their distances and help select appropriate
gap penalties, alignments and their distances are computed at various gap penalties. The
algorithm has been implemented as a computer program named SimDist. The SimDist
program was compared with an existing local alignment program named SIM for finding
reciprocally best-matching pairs (RBPs) of sequences in each of 100 protein families, where
RBPs are commonly used as an operational definition of orthologous sequences. SimDist
produced more accurate results than SIM on 50 of the 100 families, whereas both programs
produced the same results on the other 50 families. SimDist was also used to compare three
types of substitution matrices in scoring 444,461 pairs of homologous sequences from the
100 families.
Key words: evolutionary distance, sequence alignment, substitution matrix.
1. INTRODUCTION
A
penalties. It is recognized that a proper parameter set should be selected to suit the level of conservation
between sequences (Altschul, 1993; Brutlag et al., 1990; Vingron and Argos, 1991). Diverse approaches
are taken to study the effects of parameters on alignments and select proper parameter values (Durbin
et al., 1998; Fernández-Baca and Srinivasan, 1991; Fitch and Smith, 1983; Gonnet et al., 1992; Gotoh,
1990; Gusfield et al., 1992; Huang and Brutlag, 2007; Pearson, 1995; Reese and Pearson, 2002; Vingron
WIDELY USED ALGORITHM (Gotoh, 1982; Smith and Waterman, 1981) for computing an optimal
local alignment between two sequences requires a parameter set with a substitution matrix and gap
Department of Computer Science, Iowa State University, Ames, Iowa.
129
Page 2
130 HUANG
and Waterman, 1994; Vogt et al., 1995; Waterman et al., 1992; Webb et al., 2002). Computing the accurate
similarity score and evolutionary distance between sequences with a parameter set that is appropriate for
the sequences is crucial for analysis of a large family of homologous sequences.
We describe an algorithm for selecting an appropriate substitution matrix at given gap penalties for
computing an optimal local alignment between two sequences. For a real number t > 0, let S.t/ be a
substitution matrix at evolutionary distance t in PAM (Point Accepted Mutations) units (Dayhoff et al.,
1978). The algorithm computes an estimate O t at given gap penalties such that using the substitution matrix
S.O t/ leads to the maximum alignment score. The estimate O t is defined as the evolutionary distance of
the computed alignment. Note that the evolutionary distance, which is commonly used in construction of
phylogenetic trees (Felsenstein, 2004), is different from the edit distance of an alignment. At given gap
penalties, the new algorithm is about 10 times slower than the Smith-Waterman algorithm.
To show the effects of gap penalties on alignments and their distances and help select appropriate
gap penalties, alignments and their distances are computed at various gap penalties. An increase in gap
penalties has three types of possible effects on alignments: alignment breakage, no change in alignment
configuration, and a decrease in the number of gaps and an increase in the number of mismatches. The
third type of effects results in an alignment of a larger distance. Changes in alignment configuration and
distance, induced by using various gap penalties, are helpful in selecting appropriate gap penalties.
The algorithm has been implemented as a computer program named SimDist (Similarity and Distance).
The SimDist program was used to compute alignments and their distances at various gap penalties on
2282 randomly selected pairs of homologous protein sequences from 100 families. Alignments and their
distances from the program at various gap penalties were examined to select appropriate gap penalties.
The SimDist program was compared with an existing local alignment program named SIM (Huang and
Miller, 1991) for finding reciprocally best-matching pairs (RBPs) of sequences in each of the 100 protein
families, where RBPs are commonly used as an operational definition of orthologous sequences (Tatusov
et al., 1997). SimDist produced more accurate results than SIM on 50 of the 100 families, whereas both
programs produced the same results on the other 50 families. SimDist was also used to compare three
types of substitutionmatrices (Henikoff and Henikoff, 1992; Jones et al., 1992; Müller and Vingron, 2000)
in scoring 444,461 pairs of homologous sequences from the 100 protein families.
2. METHODS
We describe the new algorithm in the context of protein sequences, which can be modified to work
for DNA sequences. We start with the Markov model of Dayhoff et al. (1978) for protein evolution. Let
P.t/ denote the transition probability matrix of 20 by 20 over t time period for the Markov model, where
for residues a and b, Pab.t/ is the transition probability from residue a to residue b over t time period.
We denote by ? the equilibrium probability distribution of the residues, where ?.a/ is the equilibrium
probability of residue a. Then ?.a/Pab.t/ is the joint probability of seeing a aligned with b. We denote
by S.t/ the substitution matrix at time t that is obtained from P.t/ by the formula (Dayhoff et al., 1978):
for residues a and b, and a scaling factor c > 0,
Sab.t/ D c ? log2
?.a/Pab.t/
?.a/?.b/
D c ? log2
Pab.t/
?.b/:
(1)
The substitution matrix S.t/ is said to be in 1=c bit units. Like the widely used BLOSUM matrix of
Henikoff and Henikoff (1992), a 23 by 23 substitution matrix is constructed from the 20 by 20 substitution
matrix to accommodate three additional symbols B, Z, and X. For efficiency, the scaling factor c is set to
30 and all values in the substitution matrix are rounded to integers.
The transition probability matrix P.t/ is computed for any evolutionary time (distance) t ? 0 through
an instantaneous rate matrix Q, which is independent of t (Müller and Vingron, 2000). Assume that Q can
be diagonalized by D D U?1QU, where D is a diagonal matrix with diagonal elements d1;d2;:::;d20.
Then the matrix P.t/ is computed with the formula (Müller and Vingron, 2000):
P.t/ D exp.t ? Q/ D
1
X
kD0
Qktk
kŠ
D
1
X
kD0
U.tD/kU?1
kŠ
D U exp.t ? D/U?1;
(2)
Page 3
SEQUENCE ALIGNMENT WITH APPROPRIATE PARAMETERS 131
where exp.x/ D ex, and exp.t ? D/ is a diagonal matrix with diagonal elements exp.t ? d1/, exp.t ?
d2/;:::;exp.t ? d20/.
Here we use the equilibrium distribution ? and rate matrix Q of Müller and Vingron (2000), which
are derived from alignments of varying degree of divergence. The rate matrix Q is normalized so that the
rate of substitution at equilibriumP
a
P
acids are changed in one unit of time (Müller and Vingron, 2000). The rate matrix Q is transformed into
a diagonal form with numerical algorithms (Press et al., 1992). To obtain the substitution matrix S.t/ at
any value t D O t, the transition probability matrix P.O t/ is first computed by formula (2), and then the
substitution matrix S.O t/ is computed by formula (1).
Let nonnegative variables q and r be gap open and extension penalties in the same 1=c bit units as the
substitution matrix S.t/, where a gap penalty w given in 1=g bit units is converted to the gap penalty
w.c=g/ in 1=c bit units. For instance, a gap open penalty of 12 in 1=3 bit units corresponds to that of 120
in 1=30 bit units. Integer substitution scores and gap penalties in 1=30 bit units have a higher resolution
than those in 1=3 bit units, whereas scores and penalties in 1=3 or 1=2 bit units are commonly used. Let O q
and O r be a pair of q and r values. We introduce a function of t, denoted by sAB.t; O q; O r/, where the value
of the function at a value t D O t is the similarity score of an optimal local alignment between sequences A
and B computed with the substitution matrix S.O t/ and gap penalties O q and O r. In other words, the similarity
score of an optimal local alignment between sequences A and B at gap penalties O q and O r is a function
of evolutionary distance t. Naturally, selecting an appropriate substitution matrix for aligning A and B at
gap penalties O q and O r involves finding the value of t, from an interval, maximizing sAB.t; O q; O r/.
Define dAB.O q; O r/ to be the value of t, from an interval, maximizing sAB.t; O q; O r/. For protein sequences,
a commonly used interval for t is from 10 to 200 PAMs (Reese and Pearson, 2002), which is used in this
paper. After dAB.O q; O r/ is computed, the substitution matrix S.dAB.O q; O r// and gap penalties O q and O r are
used to compute an optimal local alignment between A and B, where sAB.dAB.O q; O r/; O q; O r/ is the similarity
score of the alignment, and dAB.O q; O r/ is the evolutionary distance of the alignment.
The distance dAB.O q; O r/ is computed by the numerical procedure of Brent (1973), which is for max-
imization of functions without derivatives. The Brent procedure, given a function and an initial value
in an abscissa interval, iteratively finds a value in the abscissa interval to maximize the function. The
procedure combines a parabolic interpolation with the golden section algorithm to bracket the maximum
of the function. The procedure has a linear rate of convergence, where the function is evaluated at the
current abscissa value once in each iteration.
To evaluate the function sAB.t; O q; O r/ at a value t D tk, the transition probability matrix P.tk/ is first
computed by formula (2), which takes 20 scalar exponentiations and 2 matrix multiplications of dimension
20 by 20. Then the substitution matrix S.tk/ is computed by formula (1), which takes an order of 202
scalar operations. Next the score of an optimal local alignment between A and B is computed by the Smith
and Waterman algorithm with the substitution matrix S.tk/ and gap penalty values O q and O r. The alignment
score is the function value sAB.tk; O q; O r/.
To compute the distance dAB.O q; O r/, the Brent procedure is called with an initial value t0in an interval, a
fractional precision cutoff, and the definition of the function sAB.t; O q; O r/. In the first iteration of the Brent
procedure, the function sAB.t; O q; O r/ is evaluated once at the initial value t D t0, and a first t value t1is
found. In the second iteration, the function sAB.t; O q; O r/ is evaluated once at t D t1, and a second t value t2
is found. This is repeated until the t values meet a termination condition formed with the given fractional
precision cutoff. A t value with the maximum function score is selected among t0;t1;t2;:::. This t value
is the evolutionary distance dAB.O q; O r/, which maximizes the function sAB.t; O q; O r/.
We illustrate the algorithm by applying it to a pair of protein sequences A and B (SwissProt accession
nos Q29545 and O77811). Set O q D 150 in 1/30 bit units (15 in 1/3) and O r D 30 in 1/30 (3 in 1/3).
The Brent procedure is called with an initial value t0 D 123:00 PAMs in an interval from 10 to 200
PAMs, a fractional precision cutoff of 0.01, and the function sAB.t;150;30/. In the first iteration, the
function is evaluated at t0 D 123:00 PAMs, which involves computing the transition probability matrix
P.123:00/ and the substitution matrix S.123:00/, and computing the score of an optimal local alignment
between A and B with the substitution matrix S.123:00/ and gap penalties 150 and 30. The function
value at t0D 123:00 is 3384 in 1=30 bit units. A first value t1D 79:84 PAMs is generated. In the second
iteration, the function is evaluated at t1 D 79:84 PAMs, yielding the function value 3809 in 1=30 bit
units. A second value t2 D 53:16 PAMs is generated. This is repeated until the termination condition
b¤a?.a/Qab is 0.01, meaning that on average 1% of the amino
Page 4
132 HUANG
is met. The t values (distances) found by the Brent procedure along with the function values (alignment
scores) at those t values are shown in order of generation and in the form of (distance, alignment score):
.123:00;3384/, .79:84;3809/, .53:16;3965/, .36:68;3948/, .48:20;3975/, .46:90;3980/, .42:99;3970/,
.46:05;3974/, and .47:37;3971/. A distance of 46.90 PAMs with the maximum score 3980 is selected
by the Brent procedure. The procedure takes 9 iterations on this example. Finally, an optimal alignment
between A and B is produced by the Smith-Waterman algorithm with the substitution matrix S.46:90/
and gap penalties 150 and 30.
The time requirement of the new algorithm for computing a local alignment with an appropriate
substitution matrix is analyzed as follows. Each iteration of the Brent procedure takes one function
evaluation in addition to a constant number of basic operations (Brent, 1973). The function evaluation takes
time proportional to that of the alignment computation because the computation of a probability transition
matrix and a substitution matrix takes an order of 203scalar operations. Thus, the time requirement of the
new algorithm is a multiple of that of the standard alignment algorithm. The multiple can be controlled
by choosing a fractional precision cutoff. Another possibility is to use an upper bound on the number of
iterations. A third possibility is to select a distance based on the percent identify of an initial alignment
and use the substitutionmatrix at this distance in the final alignment computation. The correlation between
the alignment distance and alignment percent identify can be obtained by running the new algorithm on
pairs of sequences with various levels of conservation.
3. RESULTS
The new algorithm has been implemented as a computer program named SimDist. SimDist takes as input
a file of multiple sequences in FASTA format along with given gap penalties. For each possible pair of
sequences or each of a specified number of random pairs of sequences, SimDist determines an appropriate
evolutionary distance O t at the gap penalties, and computes an optimal local alignment between the two
sequences with the substitution matrix S.O t/ and the gap penalties. SimDist has an option to help select
an appropriate gap open penalty O q. Under this option, for each of the equally spaced gap open penalty
values in a given interval and at a given spacing, SimDist computes an optimal local alignment between
the sequences with an appropriate substitution matrix at the gap penalty value, and reports the alignment
along with its similarity score and distance. To show the effects of different penalty values on alignments,
differences in configuration between alignments computed at adjacent gap open penalty values are marked
with special symbols on the alignment obtained at the larger penalty value.
In addition to the rate matrix and equilibrium distribution from Müller and Vingron (2000), SimDist
has options to use, in construction of substitution matrices, any of the rate matrices and equilibrium
distributions included in version 5.2 of the TREE-PUZZLE package (Schmidt et al., 2002). To make
substitution matrices at specified distances available to the user, SimDist reports a directory of substitution
matrices computed from a specified rate matrix and equilibrium distribution at each of the equally-spaced
distances in a specified interval and at a specified spacing. The substitution matrices are in a specified
fraction of bit units.
3.1. Selection of appropriate gap open penalties
The SimDist program was used to study the effects of the gap open penalty q on alignments and
their distances. Because the gap extension penalty has much smaller effects on alignments (Reese and
Pearson, 2002), we fixed it at a constant value O r in this study. To study the statistical behavior of alignment
distances, the function dAB.q; O r/ was sampled uniformly over the entire range of q values for many pairs
of sequences. An appropriate value for q was selected by considering the statistical behavior of alignment
distances, following the framework of Vingron and Waterman (1994) and Pearson (1995) in selection of
gap penalties based on the statistical behavior of alignment scores. Our results below show that there is
a phase transition in growth of alignment distance with gap open penalty. The distance dAB.q; O r/ grows
rapidly over small q values but shows less variation over large q values.
The SimDist program was used on 2282 randomly selected pairs of homologous sequences from 100
families of protein sequences, which were first collected for evaluation of an alignment program (Huang
Page 5
SEQUENCE ALIGNMENT WITH APPROPRIATE PARAMETERS 133
and Brutlag, 2007). For each pair of sequences, alignments and their distances at 31 equally spaced q
values over an interval from 0 to 45 were computed, with all distances in PAM units and all gap penalty
values in 1=3 bit units. The gap extension penalty r was fixed at 3. A fractional accuracy cutoff of 0.01
was used to terminate the search for an appropriate evolutionary distance at each q value. This means that
the evolutionary distance is accurate within 0:01? 200 D 2 PAMs. For each of the 31 q values, the Brent
procedure iterated 10 times on average to find an appropriate evolutionary distance at the given value.
Because it is slow to compute alignments and their distances at various q values, the user may take the
following approach to deal with a large number of sequences. Perform this slow computation on a few
representative pairs of sequences. Select an appropriate q value based on the effects of the q values on
alignments and their distances. Then use SimDist only with the selected q value on a large number of
sequences.
We present several observations about the statistical behavior of alignment evolutionary distances
sampled at the variousq values on the 2282pairs ofhomologoussequences. The observationsare illustrated,
in Figure 1, with four penalty-distance plots from SimDist on four representative pairs of sequences. For
each pair of sequences, the evolutionary distance increases rapidly with q for q < 10. If the two sequences
contain a strong similarity, then the evolutionary distance is small with little variation for q > 10. As an
example, consider a pair of protein sequences (SwissProt accession nos Q29545 and O77811) with a local
alignment of 62% identity over a length of 650 residues. The pairs of penalty and distance values from
SimDist on this pair of sequences are shown in Plot 1 of Figure 1. For q > 10, the distance is around 47
and shows little variation. The alignment computed at q D 15:0 contains a short block of only a match
FIG. 1.
Gap open penalties are in 1=3 bit units, and evolutionary distances are in PAMs. A dot at coordinate .g;d/ in a plot
indicates that SimDist with a gap open penalty of g produced an alignment with an evolutionary distance of d on the
two sequences. A gap extension penalty of 3 in 1=3 bit units was used in all alignment computations.
Plots of alignment evolutionary distances against gap open penalties for four pairs of protein sequences.
Page 6
134HUANG
(L, L) surrounded by a gap on each side. The block disappears from the alignment at q D 16:5. There is no
change in alignment configuration for q ? 16:5. Thus, a gap open penalty of 16.5 or higher is appropriate
for this example.
If the two sequences contain a medium similarity, then the evolutionary distance may show significant
variation for q > 10. Consider a pair of protein sequences (Q29545 and P08582) on which SimDist
produced a local alignment of 40% identity and with 38 gaps over a length of about 650 residues at
q D 10:5 and another alignment of 36% identity and with 16 gaps at q D 45. The pairs of penalty and
distance values from SimDist on this pair of sequences are shown in Plot 2 of Figure 1. As the gap open
penalty increases, the alignment often has fewer gaps and more mismatches, resulting in an increase in
evolutionary distance. There are two significant increases in alignment distance as q increases between
10:5 and 24, with many short substitution blocks with one or two consecutive matches disappearing. For
q ? 25:5, there is only one significant increase in alignment distance. As q changes from 36 to 37:5, two
blocks each with five consecutive matches disappear from the alignment, resulting in an increase of 11 in
alignment distance. Thus, q D 25:5 is appropriate for this example.
As another example of medium similarity, the pairs of penalty and distance values from SimDist on a
pair of sequences (P31226 and Q921I1) are shown in Plot 3 of Figure 1. In this case, as q changes from
19:5 to 21, a block of only two matches disappears from the alignment. For q ? 21, there is little change
in alignment distance with q. Thus, q D 21 is appropriate for this example.
If the two sequences contain a weak similarity, then the evolutionary distance reaches a much higher
value before q D 10. For example, consider a pair of protein sequences (O34580 and O84645) with a
local alignment of 32% identity over a length of 113 residues. The pairs of penalty and distance values
from SimDist on this pair of sequences are shown in Plot 4 of Figure 1. The evolutionary distance reaches
a peak of 171 at q D 7:5, falls back to 127, and stays there for q ? 9. There is no change in alignment
configuration for q ? 9. Thus, a q value of 9 or higher is appropriate for this example.
3.2. Evaluation of SimDist and SIM
The SimDist program and an existing local alignment program named SIM (Huang and Miller, 1991)
were evaluated for finding RBPs of sequences in each of the 100 protein families. RBPs are commonly
used as an operational definition of orthologous sequences (Tatusov et al., 1997). Let score.A;B/ denote
the similarity score of sequences A and B. Two sequences A and B in a family form an RBP in the family
if score.A;B/ ? score.A;C/ and score.A;B/ ? score.B;C/ for every other member C in the family. In
the SimDist program, the similarity score score.A;B/ is computed as maxfsAB.t; O q; O r/ W 10 ? t ? 200g,
the maximum alignment score over substitution matrices at distances between 10 and 200. On the other
hand, in the SIM program, the similarity score score.A;B/ is computed as sAB.O t; O q; O r/, an alignment score
obtained by using a substitution matrix at a constant distance O t.
Obviously, RBPs from SimDist are based on a more accurate similarity measure than RBPs from SIM.
However, it is unclear how often the use of the more accurate similarity measure results in a difference
in practice. This question is addressed by evaluating SimDist and SIM on the 100 protein families with a
total of 7092 sequences.
The rate matrix and equilibriumdistributionfrom Müllerand Vingron(2000)were used inthis evaluation.
In addition, the following parameter values were used: O q D 18 and O r D 3 (both in 1=3 bit units), and
O t D 160 PAMs. The substitution matrix at the distance of 160 was selected for use with SIM since this
matrix is close to the widely used BLOSUM62 matrix (Müller and Vingron, 2000).
For each of the 100 protein families, SimDist and SIM were each run on the family, and the number of
unique RBPs from each program and the number of common RBPs were calculated. A common RBP was
reported by both programs, whereas a unique RBP was reported by only one of the programs. On 50 of
the 100 protein families, at least one of the two programs produced a unique RBP, indicating that the two
programs produced different results on each of the 50 protein families. Specifically, on those 50 protein
families, SimDist produced a total of 79 unique RBPs, SIM produced a total of 57 unique RBPs, and the
two programs produced a total of 1254 common RBPs. On the 100 families, the two programs produced
a total of 1945 common RBPs, with the score of an RBP from SimDist, on average, 70% larger than
the score of the same RBP from SIM. SimDist was about 10 times slower than SIM on the 100 protein
families.
Page 7
SEQUENCE ALIGNMENT WITH APPROPRIATE PARAMETERS 135
We present results on one of the 50 familiesin detail. On a proteinfamilynamed SAPS (Secreted Aspartyl
Proteinases), SimDist reported 1 unique RBP and SIM reported 1 unique RBP, with 6 common RBPs
from both programs. The unique RBP from SimDist consists of human SAPS domain family member 3
(Q5H9R7) and mouse SAPS domain family member 3 (Q922D4). The unique RBP from SIM consists of
the same human sequence and chicken SAPS domain family member 3 (Q5F471). The similarity scores
from SimDist are 9428 on the human-mouse pair, 9048 on the human-chicken pair, and 8469 on the
mouse-chicken pair, whereas the scores from SIM are 5121 on the human-mouse pair, 5121 on the human-
chicken pair, and 4789 on the mouse-chicken pair. All scores are in 1/3 bit units. The similarity scores
from SimDist on the three pairs of sequences indicate that the human and mouse sequences are more
similar than the human and chicken sequences, and the mouse and chicken sequences. On the other hand,
the similarity scores from SIM indicate that the human and mouse sequences are at the same level of
similarity as the human and chicken sequences. Thus, the unique RBP from SimDist is correct, whereas
that from SIM is wrong.
3.3. Comparison of substitution matrices
The SimDist program allows us to perform a “sequence-oriented” study on comparison of substitution
matrices, where for each pair of sequences, the sequences are scored by using each type of substitution
matrices at a proper distance. Previous studies on comparison of substitution matrices, on the other hand,
are “matrix-oriented,” where substitution matrices of different types at constant distances are used to score
various pairs of sequences (Henikoff and Henikoff, 1992). As shown in the last experiment on finding
RBPs, there is a large difference between an alignment score from a substitution matrix at a constant
distance and an alignment score from a substitution matrix at a proper distance. Below we present results
from a sequence-oriented study on comparison of substitution matrices.
The SimDist program was used to compare three types of substitution matrices, BLOSUM62 (Henikoff
and Henikoff, 1992), JTT (Jones et al., 1992), and VT (Müller and Vingron, 2000), for aligning 444,461
pairs of homologous sequences from the 100 protein families. The JTT and VT types of matrices are
based on evolutionary models and thus useful for sequence alignment and tree construction, whereas
BLOSUM62 is designed for sequence alignment only. To construct the three types of substitution matrices
at any distance, we used the rate matrices and equilibriumdistributionsin version 5.2 of the TREE-PUZZLE
package (Schmidt et al., 2002). By the BLOSUM62 type of substitution matrices, we mean the group of
substitution matrices at various distances that can be constructed from the BLOSUM62 rate matrix and
equilibrium distribution in the TREE-PUZZLE package.
For every pair of homologous sequences from each of the 100 protein families, three similarity scores
on the sequence pair, one score from each type of substitution matrices, were computed by SimDist.
Recall that the similarity score from SimDist is the maximum alignment score over substitution matrices
at distances between 10 and 200, where a substitution matrix at a distance is computed from a specified
pair of rate matrix and equilibrium distribution. All substitution matrices generated during the computation
were in the same scale of 1=30 bit units. The gap open and extension penalties were fixed at 180 and 30
in 1=30 bit units.
The behaviors of the three types of matrices on pairs of unrelated sequences were taken into account as
follows. For each pair of homologous sequences, a pair of unrelated sequences, each with the same length
and composition, was produced by shuffling each of the homologous sequences many times. Then three
similarity scores on the pair of unrelated sequences, one score from each type of substitution matrices,
were computed by SimDist. Next the score on the pair of unrelated sequences was subtracted from the
score on the pair of homologous sequences for each type of substitution matrices. Thus, three adjusted
scores, one from each type of substitutionmatrices, were obtained for each pair of homologous sequences.
The effectiveness of each type of substitution matrices in scoring alignments of homologous protein
sequences is assessed by counting the number of homologous sequences on which the adjusted score from
the matrix type is sufficiently larger than the adjusted scores from the other two matrix types. However,
it is unclear what amounts to being sufficiently larger. A simple way of dealing with this issue is to use
several percentage levels. Let SB, SJ, and SV be three adjusted scores on a pair of sequences from the
three types of substitution matrices, respectively. For a percentage level L, we say SB is L percent larger
than SJ and SV if SB > SJ ? .1CL=100/ and SB > SV ? .1 CL=100/. Six percentage levels, 0:1%,
Page 8
136HUANG
TABLE 1.
MATRICES AND EACH (L) OF THE SIX PERCENTAGE LEVELS, THE NUMBER
OF PAIRS OF HOMOLOGOUS SEQUENCES ON WHICH THE ADJUSTED SCORE
FROM MATRIX TYPE T IS L PERCENT LARGER THAN THE ADJUSTED
SCORES FROM THE OTHER TWO TYPES OF MATRICES
FOR EACH (T ) OF THE THREE TYPES OF SUBSTITUTION
Level (%)BLOSUM62JTT VT Sum
0.1
0.5
1.0
3.0
5.0
10.0
245,898
217,031
183,591
87,288
50,415
22,704
50,092
43,381
37,632
24,705
16,722
9340
131,059
102,833
75,063
25,538
13,880
6739
427,049
363,245
296,286
137,531
81,017
38,783
0:5%, 1:0%, 3:0%, 5:0%, and 10:0%, were used to assess the effectiveness of the three matrix types in
scoring the 444,461 pairs of homologous sequences from the 100 protein families.
The results for the three types of substitution matrices at the six percentage levels are shown in Table 1.
Two observations are obvious from those results. First, at each of the six percentage levels, the number of
sequence pairs on which BLOSUM62 produced sufficiently larger scores than JTT and VT is more than
the numbers of sequence pairs from JTT and VT combined. Second, the number of sequence pairs from
VT is larger than that from JTT at each of the first four percentage levels, whereas the opposite is true at
each of the other two percentage levels.
The entire computation took 6 days on a processor. It took 1 day for SimDist to align the 444,461 pairs of
homologous sequences for one type of substitution matrices. With the three types of substitution matrices
combined with the 444,461 pairs of homologous sequences and 444,461 pairs of unrelated sequences, the
computation times on the six combinations added up to 6 days. The three adjusted scores on each of the
444,461 pairs of homologous sequences were saved in three arrays. Then the comparison of the three
arrays of adjusted scores at each of the six percentage levels was performed quickly.
4. DISCUSSION
We have developed an algorithm for selecting an appropriate substitutionmatrix at given gap penalties for
computing an optimal local alignment between two sequences. Alignments and their evolutionary distances
produced by the algorithm at various gap penalties are helpful for selecting gap penalties that are specific
to the given sequences. Based on the level of variation in alignment distance and the level of tolerance on
short substitutionblocks, appropriate gap penalties are selected. The similarity and distance measures from
the algorithm are useful for analysis of a large family of homologous sequences. For example, SimDist
produced more accurate reciprocally best-matching pairs of sequences than SIM on 50 of 100 protein
families. In addition, SimDist is useful for comparison of different types of substitution matrices in scoring
sequence alignments. Our work is a step in developing a theoretical and empirical basis for finding the
most sensitive parameter values in sequence alignment.
For a given pair of gap open and extension penalties, the Smith-Waterman algorithm is iterated 10 times
on average in the new algorithm. A higher alignment score is often obtainable by iterating the Smith-
Waterman algorithm a few times. Thus, the new algorithm can be used in applications where the efficiency
of the Smith-Waterman algorithm is acceptable.
Existing algorithmsfor computingthe maximum likelihoodevolutionarydistance between two sequences
take as input a gap-free alignment between the sequences (Felsenstein, 2004; Schmidt et al., 2002). Our
results shows that gap penalties are important parameters affecting the evolutionary distance between two
sequences. Our algorithm combines alignment and distance computations into one framework and explores
the effects of substitution matrix and gap penalties on the similarity and distance between the sequences.
The techniques proposed in this paper can be extended to select appropriate values for additional
parameters in variations of standard alignment algorithms. For example, a constant penalty is given to
Page 9
SEQUENCE ALIGNMENT WITH APPROPRIATE PARAMETERS 137
a block of different regions between blocks of similar regions (Huang and Chao, 2003). An appropriate
difference block penalty can be selected by studying the effects of various difference block penalties on
alignments and their distances. As another example, two parameter sets with different levels of stringency
are used to deal with sequences with variation in the level of conservation among regions (Huang and
Brutlag, 2007). Given two sets of gap penalties, two substitution matrices are selected among substitution
matrices at different distances to form, with the given sets of gap penalties, two parameter sets that lead
to the maximum alignment score.
An alternative way of selecting a substitution matrix at a proper distance is by maximizing the statistical
significance of alignments over substitution matrices at distances between 10 and 200. The statistical
significance of an alignment can be computed by using the model of Karlin and Altschul (1990), where
the model parameters depend on the composition of the two sequences and the substitution matrix used in
the alignment computation. The model parameters can be estimated by using the method of Mott (2000).
It is unclear whether the alternative approach compares fairly in efficiency and accuracy with the approach
proposed in this paper.
AVAILABILITY
The SimDist program is freely available for academic use at: http://deepc2.psi.iastate.edu/aat/align/
align.html.
ACKNOWLEDGMENTS
I would like to acknowledge the inclusion, in SimDist, of a file of rate matrices and equilibrium
distributions from version 5.2 of the TREE-PUZZLE package.
REFERENCES
Altschul, S.F. 1993. A protein alignment scoring system sensitive at all evolutionary distances. J. Mol. Evol. 36,
290–300.
Brent, R.P. 1973. Algorithms for Minimization without Derivatives, Prentice-Hall, Englewood Cliffs, NJ.
Brutlag, D.L., Dautricout, J.P., Maulik, S., et al. 1990. Improved sensitivity of biological sequence database searches.
Comp. Appl. Biol. 6, 237–245.
Dayhoff, M.O., Schwartz, R.M., and Orcutt, B.C. 1978. A model of evolutionary change in proteins, 345–358. In:
Dayhoff, M.O., ed., Atlas of Protein Sequence and Structure, Vol. 5, Suppl. 3. National Biomedical Research
Foundation, Washington, DC.
Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. 1998. Biological Sequence Analysis: Probabilistic Models of
Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK.
Felsenstein, J. 2004. Inferring Phylogenies. Sinauer Associates, Sunderland, MA.
Fernández-Baca, D., and Srinivasan, S. 1991. Constructing the minimization diagram of a two-parameter problem.
Operat. Res. Lett. 10, 87–93.
Fitch, W., and Smith, T. 1983. Optimal sequence alignments. Proc. Natl. Acad. Sci. USA 80, 1382–1386.
Gonnet, G.H., Cohen, M.A., and Benner, S.A. 1992. Exhaustive matching of the entire protein sequence database.
Science 256, 1443–1445.
Gotoh, O. 1982. An improved algorithm for matching biological sequences. J. Mol. Biol. 162, 705–708.
Gotoh, O. 1990. Optimal sequence alignment allowing for long gaps. Bull. Math. Biol. 52, 359–373.
Gusfield, D., Balasubramanian, K., and Naor, D. 1992. Parametric optimization of sequence alignment. Proc. Third
Annu. ACM-SIAM Symp. Discrete Algorithms, 432–439. SIAM, Philadelphia, PA.
Henikoff, S., and Henikoff, J.G. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci
USA 89, 10915–10919.
Huang, X., and Brutlag, D.L. 2007. Dynamic use of multiple parameter sets in sequence alignment. Nucleic Acids
Res. 35, 678–686.
Huang, X., and Chao, K.-M. 2003. A generalized global alignment algorithm. Bioinformatics 19, 228–233.
Huang, X., and Miller, W. 1991. A time-efficient, linear-space local similarity algorithm. Adv. Appl. Math. 12, 337–357.
Page 10
138HUANG
Jones, D.T., Taylor, W.R., and Thornton, J.M. 1992. The rapid generation of mutation data matrices from protein
sequences. Comput. Appl. Biosci. 8, 275–282.
Karlin, S., and Altschul, S.F. 1990. Methods for assessing the statistical significance of molecular sequence features
by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264–2268.
Mott, R. 2000. Accurate formula for P-values of gapped local sequence and profile alignments. J. Mol. Biol. 300,
649–659.
Müller, T., and Vingron, M. 2000. Modeling amino acid replacement. J. Comput. Biol. 7, 761–776.
Pearson, W.R. 1995. Comparison of methods for searching protein sequence databases. Protein Sci. 4, 1145–1160.
Press, W.H., Teukolsky, S.A., Vetterling, W.T., et al. 1992. Numerical Recipes in C. Cambridge University Press,
Cambridge, UK.
Reese, J.T., and Pearson, W.R. 2002. Empirical determination of effective gap penalties for sequence comparison.
Bioinformatics 18, 1500–1507.
Schmidt, H.A., Strimmer, K., Vingron, M., et al. 2002. TREE-PUZZLE: maximum likelihood phylogenetic analysis
using quartets and parallel computing. Bioinformatics 18, 502–504.
Smith, T.F., and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197.
Tatusov, R.L., Koonin, E.V., and Lipman, D.J. 1997. A genomic perspective on protein families. Science 278, 631–637.
Vingron, M., and Argos, P. 1991. Motif recognition and alignment for many sequencesby comparison of dot-matrices.
J. Mol. Biol. 218, 33–43.
Vingron, M., and Waterman, M.S. 1994. Sequence alignment and penalty choice: review of concepts, case studies and
implications. J. Mol. Biol. 235, 1–12.
Vogt, G., Etzold, T., and Argos, P. 1995. An assessmentof amino acid exchange matrices in aligning protein sequences:
the twilight zone revisited. J. Mol. Biol. 249, 816–831.
Waterman, M.S., Eggert, M., and Lander, E. 1992. Parametric sequence comparisons. Proc. Natl. Acad. Sci. USA 89,
6090–6093.
Webb, B.-J.M., Liu, J.S., and Lawrence, C.E. 2002. BALSA: Bayesian algorithm for local sequencealignment. Nucleic
Acids Res. 30, 1268–1277.
Address reprint requests to:
Dr. Xiaoqiu Huang
Department of Computer Science
Iowa State University
226 Atanasoff Hall
Ames, IA 50011-1040
E-mail: xqhuang@cs.iastate.edu