CORAL: Aligning conserved core regions across domain families

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Bioinformatics (Impact Factor: 4.98). 06/2009; 25(15):1862-8. DOI: 10.1093/bioinformatics/btp334
Source: PubMed


Homologous protein families share highly conserved sequence and structure regions that are frequent targets for comparative analysis of related proteins and families. Many protein families, such as the curated domain families in the Conserved Domain Database (CDD), exhibit similar structural cores. To improve accuracy in aligning such protein families, we propose a profile-profile method CORAL that aligns individual core regions as gap-free units.
CORAL computes optimal local alignment of two profiles with heuristics to preserve continuity within core regions. We benchmarked its performance on curated domains in CDD, which have pre-defined core regions, against COMPASS, HHalign and PSI-BLAST, using structure superpositions and comprehensive curator-optimized alignments as standards of truth. CORAL improves alignment accuracy on core regions over general profile methods, returning a balanced score of 0.57 for over 80% of all domain families in CDD, compared with the highest balanced score of 0.45 from other methods. Further, CORAL provides E-values to aid in detecting homologous protein families and, by respecting block boundaries, produces alignments with improved 'readability' that facilitate manual refinement.
CORAL will be included in future versions of the NCBI Cn3D/CDTree software, which can be downloaded at
Supplementary data are available at Bioinformatics online.

Download full-text


Available from: Aron Marchler-Bauer,
41 Reads
  • Source
    • "Extensions of these SCRs with other geometric features such as backbone conformations have been shown to improve the performance of comparative modeling (Deane, et al., 2001; Montalvao, et al., 2005). More "elastic" alignment methods, such as those based on comparison of intramolecular contacts, emphasize similarities in the local structural environment and allow deducing correspondences even for structural elements with larger deviations (Fong and Marchler-Bauer, 2009; Hasegawa and Holm, 2009; Holm and Sander, 1996). While the number of solved structures is growing rapidly, it still pales in comparison to the amount of available sequence data (Levitt, 2007). "
    [Show abstract] [Hide abstract]
    ABSTRACT: MOTIVATION: The structures of homologous proteins are generally better conserved than their sequences. This phenomenon is demonstrated by the prevalence of structurally conserved regions (SCRs) even in highly divergent protein families. Defining SCRs requires the comparison of two or more homologous structures and is affected by their availability and divergence, and our ability to deduce structurally equivalent positions among them. In the absence of multiple homologous structures, it is necessary to predict SCRs of a protein using information from only a set of homologous sequences and (if available) a single structure. Accurate SCR predictions can benefit homology modeling and sequence alignment. RESULTS: Using pairwise DaliLite alignments among a set of homologous structures, we devised a simple measure of structural conservation, termed structural conservation index (SCI). SCI was used to distinguish SCRs from non-SCRs. A database of SCRs was compiled from 386 SCOP superfamilies containing 6,489 protein domains. Artificial neural networks were then trained to predict SCRs with various features deduced from a single structure and homologous sequences. Assessment of the predictions via a 5-fold cross-validation method revealed that predictions based on features derived from a single structure perform similarly to ones based on homologous sequences, while combining sequence and structural features was optimal in terms of accuracy (0.755) and Matthews correlation coefficient (0.476). These results suggest that even without information from multiple structures, it is still possible to effectively predict SCRs for a protein. Finally, inspection of the structures with the worst predictions pinpoints difficulties in SCR definitions. AVAILABILITY: The SCR database and the prediction server can be found at CONTACT:, SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Online.
    Bioinformatics 11/2012; 29(2). DOI:10.1093/bioinformatics/bts682 · 4.98 Impact Factor
  • Source
    • "Conventional homology detection methods, which operate on the level of whole proteins or domains, consider connections between SCOP superfamilies as false positives (Gough et al., 2001). It becomes obvious, that analysis of evolutionary relationships on the level of functional closed loops requires a special approach (Andreeva et al., 2007; Fong and Marchler-Bauer, 2009; Xie and Bourne, 2008). Although most of the derived prototypes (>70%, data not shown) have matches in Pfam, Prosite and CDD, functional annotation can not be directly transferred from the databases defining the function on whole-protein or domain level (Bateman et al., 2004; Lo Conte et al., 2000; Marchler-Bauer, et al., 2009; Sigrist et al., 2010), resulting in ambiguous annotations, and requiring additional manual curation. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: Earlier studies of protein structure revealed closed loops with a characteristic size 25–30 residues and ring-like shape as a basic universal structural element of globular proteins. Elementary functional loops (EFLs) have specific signatures and provide functional residues important for binding/activation and principal chemical transformation steps of the enzymatic reaction. The goal of this work is to show how these functional loops evolved from pre-domain peptides and to find a set of prototypes from which the EFLs of contemporary proteins originated. Results: This article describes a computational method for deriving prototypes of EFLs based on the sequences of complete genomes. The procedure comprises the iterative derivation of sequence profiles followed by their hierarchical clustering. The scoring function takes into account information content on profile positions, thus preserving the signature. The statistical significance of scores is evaluated from the empirical distribution of scores of the background model. A set of prototypes of EFLs from archaeal proteomes is derived. This set delineates evolutionary connections between major functions and illuminates how folds and functions emerged in pre-domain evolution as a combination of prototypes. Contact:
    Bioinformatics 09/2010; 26(18):i497-503. DOI:10.1093/bioinformatics/btq374 · 4.98 Impact Factor