Sequence alignment with an appropriate substitution matrix.

Department of Computer Science, Iowa State University, Ames, Iowa 50011-1040, USA.
Journal of Computational Biology (Impact Factor: 1.67). 04/2008; 15(2):129-38. DOI: 10.1089/cmb.2007.0155
Source: PubMed

ABSTRACT A widely used algorithm for computing an optimal local alignment between two sequences requires a parameter set with a substitution matrix and gap penalties. It is recognized that a proper parameter set should be selected to suit the level of conservation between sequences. We describe an algorithm for selecting an appropriate substitution matrix at given gap penalties for computing an optimal local alignment between two sequences. In the algorithm, a substitution matrix that leads to the maximum alignment similarity score is selected among substitution matrices at various evolutionary distances. The evolutionary distance of the selected substitution matrix is defined as the distance of the computed alignment. To show the effects of gap penalties on alignments and their distances and help select appropriate gap penalties, alignments and their distances are computed at various gap penalties. The algorithm has been implemented as a computer program named SimDist. The SimDist program was compared with an existing local alignment program named SIM for finding reciprocally best-matching pairs (RBPs) of sequences in each of 100 protein families, where RBPs are commonly used as an operational definition of orthologous sequences. SimDist produced more accurate results than SIM on 50 of the 100 families, whereas both programs produced the same results on the other 50 families. SimDist was also used to compare three types of substitution matrices in scoring 444,461 pairs of homologous sequences from the 100 families.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Pairwise sequence alignment forms the basis of numerous other applications in bioinformatics. The quality of an alignment is gauged by statistical significance rather than by alignment score alone. Therefore, accurate estimation of statistical significance of a pairwise alignment is an important problem in sequence comparison. Recently, it was shown that pairwise statistical significance does better in practice than database statistical significance, and also provides quicker individual pairwise estimates of statistical significance without having to perform time-consuming database search. Under an evolutionary model, a substitution matrix can be derived using a rate matrix and a fixed distance. Although the commonly used substitution matrices like BLOSUM62, etc. were not originally derived from a rate matrix under an evolutionary model, the corresponding rate matrices can be back calculated. Many researchers have derived different rate matrices using different methods and data. In this paper, we show that pairwise statistical significance using rate matrices with sequence-pair-specific distance performs significantly better compared to using a fixed distance. Pairwise statistical significance using sequence-pair-specific distanced substitution matrices also outperforms database statistical significance reported by BLAST.
  • [Show abstract] [Hide abstract]
    ABSTRACT: CodeML (part of the PAML package) implements a maximum likelihood-based approach to detect positive selection on a specific branch of a given phylogenetic tree. While CodeML is widely used, it is very compute-intensive. We present SlimCodeML, an optimized version of CodeML for the branch-site model. Our performance analysis shows that SlimCodeML substantially outperforms CodeML (up to 9.38 times faster), especially for large-scale genomic analyses.
    IEEE International Workshop on High Performance Computational Biology (HiCOMB'12); 05/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We outline a procedure for jointly sampling substitution matrices and multiple sequence alignments, according to an approximate posterior distribution, using an MCMC-based algorithm. This procedure provides an efficient and simple method by which to generate alternative alignments according to their expected accuracy, and allows appropriate parameters for substitution matrices to be selected in an automated fashion. In the cases considered here, the sampled alignments with the highest likelihood have an accuracy consistently higher than alignments generated using the standard BLOSUM62 matrix.


Available from