JOURNAL OF COMPUTATIONAL BIOLOGY
Volume 15, Number 2, 2008
© Mary Ann Liebert, Inc.
Sequence Alignment with an Appropriate
A widely used algorithm for computing an optimal local alignment between two sequences
requires a parameter set with a substitution matrix and gap penalties. It is recognized that a
proper parameter set should be selected to suit the level of conservation between sequences.
We describe an algorithm for selecting an appropriate substitution matrix at given gap
penalties for computing an optimal local alignment between two sequences. In the algorithm,
a substitution matrix that leads to the maximum alignment similarity score is selected among
substitution matrices at various evolutionary distances. The evolutionary distance of the
selected substitution matrix is defined as the distance of the computed alignment. To show
the effects of gap penalties on alignments and their distances and help select appropriate
gap penalties, alignments and their distances are computed at various gap penalties. The
algorithm has been implemented as a computer program named SimDist. The SimDist
program was compared with an existing local alignment program named SIM for finding
reciprocally best-matching pairs (RBPs) of sequences in each of 100 protein families, where
RBPs are commonly used as an operational definition of orthologous sequences. SimDist
produced more accurate results than SIM on 50 of the 100 families, whereas both programs
produced the same results on the other 50 families. SimDist was also used to compare three
types of substitution matrices in scoring 444,461 pairs of homologous sequences from the
Key words: evolutionary distance, sequence alignment, substitution matrix.
penalties. It is recognized that a proper parameter set should be selected to suit the level of conservation
between sequences (Altschul, 1993; Brutlag et al., 1990; Vingron and Argos, 1991). Diverse approaches
are taken to study the effects of parameters on alignments and select proper parameter values (Durbin
et al., 1998; Fernández-Baca and Srinivasan, 1991; Fitch and Smith, 1983; Gonnet et al., 1992; Gotoh,
1990; Gusfield et al., 1992; Huang and Brutlag, 2007; Pearson, 1995; Reese and Pearson, 2002; Vingron
WIDELY USED ALGORITHM (Gotoh, 1982; Smith and Waterman, 1981) for computing an optimal
local alignment between two sequences requires a parameter set with a substitution matrix and gap
Department of Computer Science, Iowa State University, Ames, Iowa.
and Waterman, 1994; Vogt et al., 1995; Waterman et al., 1992; Webb et al., 2002). Computing the accurate
similarity score and evolutionary distance between sequences with a parameter set that is appropriate for
the sequences is crucial for analysis of a large family of homologous sequences.
We describe an algorithm for selecting an appropriate substitution matrix at given gap penalties for
computing an optimal local alignment between two sequences. For a real number t > 0, let S.t/ be a
substitution matrix at evolutionary distance t in PAM (Point Accepted Mutations) units (Dayhoff et al.,
1978). The algorithm computes an estimate O t at given gap penalties such that using the substitution matrix
S.O t/ leads to the maximum alignment score. The estimate O t is defined as the evolutionary distance of
the computed alignment. Note that the evolutionary distance, which is commonly used in construction of
phylogenetic trees (Felsenstein, 2004), is different from the edit distance of an alignment. At given gap
penalties, the new algorithm is about 10 times slower than the Smith-Waterman algorithm.
To show the effects of gap penalties on alignments and their distances and help select appropriate
gap penalties, alignments and their distances are computed at various gap penalties. An increase in gap
penalties has three types of possible effects on alignments: alignment breakage, no change in alignment
configuration, and a decrease in the number of gaps and an increase in the number of mismatches. The
third type of effects results in an alignment of a larger distance. Changes in alignment configuration and
distance, induced by using various gap penalties, are helpful in selecting appropriate gap penalties.
The algorithm has been implemented as a computer program named SimDist (Similarity and Distance).
The SimDist program was used to compute alignments and their distances at various gap penalties on
2282 randomly selected pairs of homologous protein sequences from 100 families. Alignments and their
distances from the program at various gap penalties were examined to select appropriate gap penalties.
The SimDist program was compared with an existing local alignment program named SIM (Huang and
Miller, 1991) for finding reciprocally best-matching pairs (RBPs) of sequences in each of the 100 protein
families, where RBPs are commonly used as an operational definition of orthologous sequences (Tatusov
et al., 1997). SimDist produced more accurate results than SIM on 50 of the 100 families, whereas both
programs produced the same results on the other 50 families. SimDist was also used to compare three
types of substitutionmatrices (Henikoff and Henikoff, 1992; Jones et al., 1992; Müller and Vingron, 2000)
in scoring 444,461 pairs of homologous sequences from the 100 protein families.
We describe the new algorithm in the context of protein sequences, which can be modified to work
for DNA sequences. We start with the Markov model of Dayhoff et al. (1978) for protein evolution. Let
P.t/ denote the transition probability matrix of 20 by 20 over t time period for the Markov model, where
for residues a and b, Pab.t/ is the transition probability from residue a to residue b over t time period.
We denote by ? the equilibrium probability distribution of the residues, where ?.a/ is the equilibrium
probability of residue a. Then ?.a/Pab.t/ is the joint probability of seeing a aligned with b. We denote
by S.t/ the substitution matrix at time t that is obtained from P.t/ by the formula (Dayhoff et al., 1978):
for residues a and b, and a scaling factor c > 0,
Sab.t/ D c ? log2
D c ? log2
The substitution matrix S.t/ is said to be in 1=c bit units. Like the widely used BLOSUM matrix of
Henikoff and Henikoff (1992), a 23 by 23 substitution matrix is constructed from the 20 by 20 substitution
matrix to accommodate three additional symbols B, Z, and X. For efficiency, the scaling factor c is set to
30 and all values in the substitution matrix are rounded to integers.
The transition probability matrix P.t/ is computed for any evolutionary time (distance) t ? 0 through
an instantaneous rate matrix Q, which is independent of t (Müller and Vingron, 2000). Assume that Q can
be diagonalized by D D U?1QU, where D is a diagonal matrix with diagonal elements d1;d2;:::;d20.
Then the matrix P.t/ is computed with the formula (Müller and Vingron, 2000):
P.t/ D exp.t ? Q/ D
D U exp.t ? D/U?1;
Jones, D.T., Taylor, W.R., and Thornton, J.M. 1992. The rapid generation of mutation data matrices from protein
sequences. Comput. Appl. Biosci. 8, 275–282.
Karlin, S., and Altschul, S.F. 1990. Methods for assessing the statistical significance of molecular sequence features
by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264–2268.
Mott, R. 2000. Accurate formula for P-values of gapped local sequence and profile alignments. J. Mol. Biol. 300,
Müller, T., and Vingron, M. 2000. Modeling amino acid replacement. J. Comput. Biol. 7, 761–776.
Pearson, W.R. 1995. Comparison of methods for searching protein sequence databases. Protein Sci. 4, 1145–1160.
Press, W.H., Teukolsky, S.A., Vetterling, W.T., et al. 1992. Numerical Recipes in C. Cambridge University Press,
Reese, J.T., and Pearson, W.R. 2002. Empirical determination of effective gap penalties for sequence comparison.
Bioinformatics 18, 1500–1507.
Schmidt, H.A., Strimmer, K., Vingron, M., et al. 2002. TREE-PUZZLE: maximum likelihood phylogenetic analysis
using quartets and parallel computing. Bioinformatics 18, 502–504.
Smith, T.F., and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197.
Tatusov, R.L., Koonin, E.V., and Lipman, D.J. 1997. A genomic perspective on protein families. Science 278, 631–637.
Vingron, M., and Argos, P. 1991. Motif recognition and alignment for many sequencesby comparison of dot-matrices.
J. Mol. Biol. 218, 33–43.
Vingron, M., and Waterman, M.S. 1994. Sequence alignment and penalty choice: review of concepts, case studies and
implications. J. Mol. Biol. 235, 1–12.
Vogt, G., Etzold, T., and Argos, P. 1995. An assessmentof amino acid exchange matrices in aligning protein sequences:
the twilight zone revisited. J. Mol. Biol. 249, 816–831.
Waterman, M.S., Eggert, M., and Lander, E. 1992. Parametric sequence comparisons. Proc. Natl. Acad. Sci. USA 89,
Webb, B.-J.M., Liu, J.S., and Lawrence, C.E. 2002. BALSA: Bayesian algorithm for local sequencealignment. Nucleic
Acids Res. 30, 1268–1277.
Address reprint requests to:
Dr. Xiaoqiu Huang
Department of Computer Science
Iowa State University
226 Atanasoff Hall
Ames, IA 50011-1040