Refining multiple sequence alignments with conserved core regions. Nucleic Acids Res

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Nucleic Acids Research (Impact Factor: 9.11). 02/2006; 34(9):2598-606. DOI: 10.1093/nar/gkl274
Source: PubMed


Accurate multiple sequence alignments of proteins are very important to several areas of computational biology and provide an understanding of phylogenetic history of domain families, their identification and classification. This article presents a new algorithm, REFINER, that refines a multiple sequence alignment by iterative realignment of its individual sequences with the predetermined conserved core (block) model of a protein family. Realignment of each sequence can correct misalignments between a given sequence and the rest of the profile and at the same time preserves the family's overall block model. Large-scale benchmarking studies showed a noticeable improvement of alignment after refinement. This can be inferred from the increased alignment score and enhanced sensitivity for database searching using the sequence profiles derived from refined alignments compared with the original alignments. A standalone version of the program is available by ftp distribution ( and will be incorporated into the next release of the Cn3D structure/alignment viewer.

Download full-text


Available from: Anna R Panchenko,
  • Source
    • "Secondly we applied the “phylogenetically aware” method PRANK with the '+F’ option to account for insertion deletion events [38]. The REFINER method [81] was used to assess the quality of all of the resulting alignments from the different algorithms using the estimated norMD values [41]. For each gene family, the alignment with the highest norMD score was used and where more than one alignment had an equal top score the alignment method was chosen at random. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Placental mammals display a huge range of life history traits, including size, longevity, metabolic rate and germ line generation time. Although a number of general trends have been proposed between these traits, there are exceptions that warrant further investigation. Species such as naked mole rat, human and certain bat species all exhibit extreme longevity with respect to body size. It has long been established that telomeres and telomere maintenance have a clear role in ageing but it has not yet been established whether there is evidence for adaptation in telomere maintenance proteins that could account for increased longevity in these species. Here we carry out a molecular investigation of selective pressure variation, specifically focusing on telomere associated genes across placental mammals. In general we observe a large number of instances of positive selection acting on telomere genes. Although these signatures of selection overall are not significantly correlated with either longevity or body size we do identify positive selection in the microbat species Myotis lucifugus in functionally important regions of the telomere maintenance genes DKC1 and TERT, and in naked mole rat in the DNA repair gene BRCA1. These results demonstrate the multifarious selective pressures acting across the mammal phylogeny driving lineage-specific adaptations of telomere associated genes. Our results show that regardless of the longevity of a species, these proteins have evolved under positive selection thereby removing increased longevity as the single selective force driving this rapid rate of evolution. However, evidence of molecular adaptations specific to naked mole rat and Myotis lucifugus highlight functionally significant regions in genes that may alter the way in which telomeres are regulated and maintained in these longer-lived species.
    BMC Evolutionary Biology 11/2013; 13(1):251. DOI:10.1186/1471-2148-13-251 · 3.37 Impact Factor
  • Source
    • "In the post-genomic era,the growing complexity of the multiple alignment problem has lead to the development of novel methods that use a combination of different alignment algorithms [22], [23], [24], [25] or that incorporate biological information other than the sequence itself [26], [27]. A number of specific MSA problems have also been addressed by programs such as POA [28] for the alignment of non-linear sequences or PRANK [10] for the detailed evolutionary analysis of more closely related sequences. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Multiple comparison or alignmentof protein sequences has become a fundamental tool in many different domains in modern molecular biology, from evolutionary studies to prediction of 2D/3D structure, molecular function and inter-molecular interactions etc. By placing the sequence in the framework of the overall family, multiple alignments can be used to identify conserved features and to highlight differences or specificities. In this paper, we describe a comprehensive evaluation of many of the most popular methods for multiple sequence alignment (MSA), based on a new benchmark test set. The benchmark is designed to represent typical problems encountered when aligning the large protein sequence sets that result from today's high throughput biotechnologies. We show that alignmentmethods have significantly progressed and can now identify most of the shared sequence features that determine the broad molecular function(s) of a protein family, even for divergent sequences. However,we have identified a number of important challenges. First, the locally conserved regions, that reflect functional specificities or that modulate a protein's function in a given cellular context,are less well aligned. Second, motifs in natively disordered regions are often misaligned. Third, the badly predicted or fragmentary protein sequences, which make up a large proportion of today's databases, lead to a significant number of alignment errors. Based on this study, we demonstrate that the existing MSA methods can be exploited in combination to improve alignment accuracy, although novel approaches will still be needed to fully explore the most difficult regions. We then propose knowledge-enabled, dynamic solutions that will hopefully pave the way to enhanced alignment construction and exploitation in future evolutionary systems biology studies.
    PLoS ONE 03/2011; 6(3):e18093. DOI:10.1371/journal.pone.0018093 · 3.23 Impact Factor
  • Source
    • "As a consequence, the first methods were introduced that combined both global and local information in a single alignment program, such as DbClustal (55), T-Coffee (56), MAFFT (57), Muscle (33), Probcons (58) or PROMALS (59). Other authors introduced different kinds of information in the sequence alignment, such as 3D structure in 3DCoffee (60) and MUMMALS (61) or domain organization in REFINER (62). A number of methods were also developed to address specific problems, such as the accurate alignment of closely related sequences in PRANK (63) or the alignment of sequences with different domain organizations in POA (64). "
    [Show abstract] [Hide abstract]
    ABSTRACT: The post-genomic era presents many new challenges for the field of bioinformatics. Novel computational approaches are now being developed to handle the large, complex and noisy datasets produced by high throughput technologies. Objective evaluation of these methods is essential (i) to assure high quality, (ii) to identify strong and weak points of the algorithms, (iii) to measure the improvements introduced by new methods and (iv) to enable non-specialists to choose an appropriate tool. Here, we discuss the development of formal benchmarks, designed to represent the current problems encountered in the bioinformatics field. We consider several criteria for building good benchmarks and the advantages to be gained when they are used intelligently. To illustrate these principles, we present a more detailed discussion of benchmarks for multiple alignments of protein sequences. As in many other domains, significant progress has been achieved in the multiple alignment field and the datasets have become progressively more challenging as the existing algorithms have evolved. Finally, we propose directions for future developments that will ensure that the bioinformatics benchmarks correspond to the challenges posed by the high throughput data.
    Nucleic Acids Research 11/2010; 38(21):7353-63. DOI:10.1093/nar/gkq625 · 9.11 Impact Factor
Show more