Alignment of protein sequences by their profiles

Mission Bay Genentech Hall, University of California, San Francisco, San Francisco, CA 94143, USA.
Protein Science (Impact Factor: 2.85). 05/2004; 13(4):1071-87. DOI: 10.1110/ps.03379804
Source: PubMed

ABSTRACT The accuracy of an alignment between two protein sequences can be improved by including other detectably related sequences in the comparison. We optimize and benchmark such an approach that relies on aligning two multiple sequence alignments, each one including one of the two protein sequences. Thirteen different protocols for creating and comparing profiles corresponding to the multiple sequence alignments are implemented in the SALIGN command of MODELLER. A test set of 200 pairwise, structure-based alignments with sequence identities below 40% is used to benchmark the 13 protocols as well as a number of previously described sequence alignment methods, including heuristic pairwise sequence alignment by BLAST, pairwise sequence alignment by global dynamic programming with an affine gap penalty function by the ALIGN command of MODELLER, sequence-profile alignment by PSI-BLAST, Hidden Markov Model methods implemented in SAM and LOBSTER, pairwise sequence alignment relying on predicted local structure by SEA, and multiple sequence alignment by CLUSTALW and COMPASS. The alignment accuracies of the best new protocols were significantly better than those of the other tested methods. For example, the fraction of the correctly aligned residues relative to the structure-based alignment by the best protocol is 56%, which can be compared with the accuracies of 26%, 42%, 43%, 48%, 50%, 49%, 43%, and 43% for the other methods, respectively. The new method is currently applied to large-scale comparative protein structure modeling of all known sequences.

Download full-text


Available from: Marc Marti-Renom, Aug 02, 2015
  • Source
    • "We compare our CNF threading method, CNFpred, with the topnotch profile-based and threading methods such as HHpred (Söding et al., 2005), MUSTER (Wu and Zhang, 2008), SPARKS/SP3/SP5 (Zhou and Zhou, 2005), SALIGN (Marti Renom et al., 2004), RAPTOR (Xu et al., 2003) and BThreader (Peng and Xu, 2009). We use the published results for SPARKS/SP3/SP5 since they have their own template file formats and we cannot correctly run them locally. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: Alignment errors are still the main bottleneck for current template-based protein modeling (TM) methods, including protein threading and homology modeling, especially when the sequence identity between two proteins under consideration is low (<30%). Results: We present a novel protein threading method, CNFpred, which achieves much more accurate sequence–template alignment by employing a probabilistic graphical model called a Conditional Neural Field (CNF), which aligns one protein sequence to its remote template using a non-linear scoring function. This scoring function accounts for correlation among a variety of protein sequence and structure features, makes use of information in the neighborhood of two residues to be aligned, and is thus much more sensitive than the widely used linear or profile-based scoring function. To train this CNF threading model, we employ a novel quality-sensitive method, instead of the standard maximum-likelihood method, to maximize directly the expected quality of the training set. Experimental results show that CNFpred generates significantly better alignments than the best profile-based and threading methods on several public (but small) benchmarks as well as our own large dataset. CNFpred outperforms others regardless of the lengths or classes of proteins, and works particularly well for proteins with sparse sequence profiles due to the effective utilization of structure information. Our methodology can also be adapted to protein sequence alignment. Contact: Supplementary information: Supplementary data are available at Bioinformatics online.
    Bioinformatics 06/2012; 28(12):i59-66. DOI:10.1093/bioinformatics/bts213 · 4.62 Impact Factor
  • Source
    • "SALIGN's default settings suffice for many applications. It has been fine tuned and extensively tested for alignment accuracy (Davis et al., 2006; Madhusudhan et al., 2006; 2009; Marti-Renom et al., 2004; 2007; Pieper et al., 2011). Nevertheless, the interface allows the user to manipulate many options if so desired. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Accurate alignment of protein sequences and/or structures is crucial for many biological analyses, including functional annotation of proteins, classifying protein sequences into families, and comparative protein structure modeling. Described here is a web interface to SALIGN, the versatile protein multiple sequence/structure alignment module of MODELLER. The web server automatically determines the best alignment procedure based on the inputs, while allowing the user to override default parameter values. Multiple alignments are guided by a dendrogram computed from a matrix of all pairwise alignment scores. When aligning sequences to structures, SALIGN uses structural environment information to place gaps optimally. If two multiple sequence alignments of related proteins are input to the server, a profile-profile alignment is performed. All features of the server have been previously optimized for accuracy, especially in the contexts of comparative modeling and identification of interacting protein partners. The SALIGN web server is freely accessible to the academic community at SALIGN is a module of the MODELLER software, also freely available to academic users (;
    Bioinformatics 05/2012; 28(15):2072-3. DOI:10.1093/bioinformatics/bts302 · 4.62 Impact Factor
  • Source
    • "PDB files were manually modified to include only amino acids of the defined 100- residue region of the DBD. Then, a multiple-structure alignment of the DBD was constructed with the SALIGN module from the MODELLER version 9.9 software package [16] [25]. The SALIGN module reports a table with the number of equivalent C α positions (the alignment length; 3.5 ˚ A cut-off), the root mean squared (RMS) distance of equivalent positions, and the sequence identity of equivalent residues for all pairs of proteins, as well as the multiplesequence alignment (MSA) derived from the multiple optimal superposition of protein structures. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Currently, about 20 crystal structures per day are released and deposited in the Protein Data Bank. A significant fraction of these structures is produced by research groups associated with the structural genomics consortium. The biological function of many of these proteins is generally unknown or not validated by experiment. Therefore, a growing need for functional prediction of protein structures has emerged. Here we present an integrated bioinformatics method that combines sequence-based relationships and three-dimensional (3D) structural similarity of transcriptional regulators with computer prediction of their cognate DNA binding sequences. We applied this method to the AraC/XylS family of transcription factors, which is a large family of transcriptional regulators found in many bacteria controlling the expression of genes involved in diverse biological functions. Three putative new members of this family with known 3D structure but unknown function were identified for which a probable functional classification is provided. Our bioinformatics analyses suggest that they could be involved in plant cell wall degradation (Lin2118 protein from Listeria innocua, PDB code 3oou), symbiotic nitrogen fixation (protein from Chromobacterium violaceum, PDB code 3oio), and either metabolism of plant-derived biomass or nitrogen fixation (protein from Rhodopseudomonas palustris, PDB code 3mn2).
    BioMed Research International 03/2012; 2012:103132. DOI:10.1155/2012/103132 · 2.71 Impact Factor
Show more