[Show abstract][Hide abstract] ABSTRACT: Birds are the most species-rich class of tetrapod vertebrates and have wide relevance across many research fields. We explored bird macroevolution using full genomes from 48 avian species representing all major extant clades. The avian genome is principally characterized by its constrained size, which predominantly arose because of lineage-specific erosion of repetitive elements, large segmental deletions, and gene loss. Avian genomes furthermore show a remarkably high degree of evolutionary stasis at the levels of nucleotide sequence, gene synteny, and chromosomal structure. Despite this pattern of conservation, we detected many non-neutral evolutionary changes in protein-coding genes and noncoding regions. These analyses reveal that pan-avian genomic diversity covaries with adaptations to different lifestyles and convergent evolution of traits. W ith ~10,500 living species (1), birds are the most species-rich class of tetrapod vertebrates. Birds originated from a the-ropod lineage more than 150 million years ago during the Jurassic and are the only extant descendants of dinosaurs (2, 3). The earliest diversification of extant birds (Neornithes) oc-curred during the Cretaceous period. However, the Neoaves, the most diverse avian clade, later underwent a rapid global expansion and radiation after a mass extinction event ~66 million years ago near the Cretaceous-Paleogene (K-Pg) bound-ary (4, 5). As a result, the extant avian lineages exhibit extremely diverse morphologies and rates of diversification. Given the nearly complete global inventory of avian species, and the immense col-lected amount of distributional and biological data, birds are widely used as models for investigating evolutionary and ecological ques-tions (6, 7). The chicken (Gallus gallus), zebra finch (Taeniopygia guttata), and pigeon (rock dove) (Columba livia) are also important model organisms in disciplines such as neuroscience and developmental biology (8). In addition, birds are widely used for global conservation priorities (9) and are culturally important to human so-cieties. A number of avian species have been do-mesticated and are economically important. Farmed and wild water birds are key players in the global spread of pathogens, such as avian influenza virus (10). Despite the need to better understand avian genomics, annotated avian genomic data was previously available for only a few species: the domestic chicken, domestic turkey (Meleagris gallopavo) and zebra finch (11–13), together with a few others only published recently (14–16). To build an understanding of the genetic
[Show abstract][Hide abstract] ABSTRACT: The field of non-coding RNA biology has been hampered by the lack of availability of a comprehensive, up-to-date collection of accessioned RNA sequences. Here we present the first release of RNAcentral, a database that collates and integrates information from an international consortium of established RNA sequence databases. The initial release contains over 8.1 million sequences, including representatives of all major functional classes. A web portal (http://rnacentral.org) provides free access to data, search functionality, cross-references, source code and an integrated genome browser for selected species.
Nucleic Acids Research 10/2014; · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Here we present the results of a large-scale bioinformatic annotation of
non-coding RNA loci in 48 avian genomes. Our approach uses probabilistic models
of hand-curated families from the Rfam database to infer conserved RNA families
within each avian genome. We supplement these annotations with predictions from
the tRNA annotation tool, tRNAscan-SE and microRNAs from miRBase. We show that
a number of lncRNA-associated loci are conserved between birds and mammals,
including several intriguing cases where the reported mammalian lncRNA function
is not conserved in birds. We also demonstrate extensive conservation of
classical ncRNAs (e.g., tRNAs) and more recently discovered ncRNAs (e.g.,
snoRNAs and miRNAs) in birds. Furthermore, we describe numerous "losses" of
several RNA families, and attribute these to genuine loss, divergence or
missing data. In particular, we show that many of these losses are due to the
challenges associated with assembling Avian microchromosomes. These combined
results illustrate the utility of applying homology-based methods for
annotating novel vertebrate genomes.
[Show abstract][Hide abstract] ABSTRACT: The development of RNA bioinformatic tools began more than 30 y ago with the description of the Nussinov and Zuker dynamic programming algorithms for single sequence RNA secondary structure prediction. Since then, many tools have been developed for various RNA sequence analysis problems such as homology search, multiple sequence alignment, de novo RNA discovery, read-mapping, and many more. In this issue, we have collected a sampling of reviews and original research that demonstrate some of the many ways bioinformatics is integrated with current RNA biology research.
[Show abstract][Hide abstract] ABSTRACT: The Rfam database (available via the website at http://rfam.sanger.ac.uk and through our mirror at http://rfam.janelia.org) is a collection of non-coding RNA families, primarily RNAs with a conserved RNA secondary structure, including both RNA genes and mRNA cis-regulatory elements. Each family is represented by a multiple sequence alignment, predicted secondary structure and covariance model. Here we discuss updates to the database in the latest release, Rfam 11.0, including the introduction of genome-based alignments for large families, the introduction of the Rfam Biomart as well as other user interface improvements. Rfam is available under the Creative Commons Zero license.
Nucleic Acids Research 11/2012; · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Homology detection is critical to genomics. Identifying homologous sequence
allows us to transfer information gathered in one organism to another quickly
and with a high degree of confidence. Non-coding RNA (ncRNA) presents a
challenge for homology detection, as the primary sequence is often poorly
conserved and de novo structure prediction remains difficult. This chapter
introduces methods developed by the Rfam database for identifying "families" of
homologous ncRNAs from single "seed" sequences using manually curated
alignments to build powerful statistical models known as covariance models
(CMs). We provide a brief overview of the state of alignment and secondary
structure prediction algorithms. This is followed by a step-by-step iterative
protocol for identifying homologs, then constructing an alignment and
corresponding CM. We also work through an example, building an alignment and CM
for the bacterial small RNA MicA, discovering a previously unreported family of
divergent MicA homologs in Xenorhabdus in the process. This chapter will
provide readers with the background necessary to begin defining their own ncRNA
families suitable for use in comparative, functional, and evolutionary studies
of structured RNA elements.
[Show abstract][Hide abstract] ABSTRACT: InterPro amalgamates predictive protein signatures from a number of well-known partner databases into a single resource. To aid with interpretation of results, InterPro entries are manually annotated with terms from the Gene Ontology (GO). The InterPro2GO mappings are comprised of the cross-references between these two resources and are the largest source of GO annotation predictions for proteins. Here, we describe the protocol by which InterPro curators integrate GO terms into the InterPro database. We discuss the unique challenges involved in integrating specific GO terms with entries that may describe a diverse set of proteins, and we illustrate, with examples, how InterPro hierarchies reflect GO terms of increasing specificity. We describe a revised protocol for GO mapping that enables us to assign GO terms to domains based on the function of the individual domain, rather than the function of the families in which the domain is found. We also discuss how taxonomic constraints are dealt with and those cases where we are unable to add any appropriate GO terms. Expert manual annotation of InterPro entries with GO terms enables users to infer function, process or subcellular information for uncharacterized sequences based on sequence matches to predictive models. Database URL: http://www.ebi.ac.uk/interpro. The complete InterPro2GO mappings are available at: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/external2go/interpro2go.
Database The Journal of Biological Databases and Curation 01/2012; 2012:bar068. · 4.46 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Curated databases are an integral part of the tool set that researchers use on a daily basis for their work. For most users, however, how databases are maintained, and by whom, is rather obscure. The International Society for Biocuration (ISB) represents biocurators, software engineers, developers and researchers with an interest in biocuration. Its goals include fostering communication between biocurators, promoting and describing their work, and highlighting the added value of biocuration to the world. The ISB recently conducted a survey of biocurators to better understand their educational and scientific backgrounds, their motivations for choosing a curatorial job and their career goals. The results are reported here. From the responses received, it is evident that biocuration is performed by highly trained scientists and perceived to be a stimulating career, offering both intellectual challenges and the satisfaction of performing work essential to the modern scientific community. It is also apparent that the ISB has at least a dual role to play to facilitate biocurators' work: (i) to promote biocuration as a career within the greater scientific community; (ii) to aid the development of resources for biomedical research through promotion of nomenclature and data-sharing standards that will allow interconnection of biological databases and better exploit the pivotal contributions that biocurators are making. DATABASE URL: http://biocurator.org.
Database The Journal of Biological Databases and Curation 01/2012; 2012:bar059. · 4.46 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: InterPro (http://www.ebi.ac.uk/interpro/) is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures, against which protein sequences can be searched to determine their potential function. InterPro has utility in the large-scale analysis of whole genomes and meta-genomes, as well as in characterizing individual protein sequences. Herein we give an overview of new developments in the database and its associated software since 2009, including updates to database content, curation processes and Web and programmatic interfaces.
Nucleic Acids Research 11/2011; 40(Database issue):D306-12. · 8.81 Impact Factor