[Show abstract][Hide abstract] ABSTRACT: Here we present the results of a large-scale bioinformatics annotation of non-coding RNA loci in 48 avian genomes. Our approach uses probabilistic models of hand-curated families from the Rfam database to infer conserved RNA families within each avian genome. We supplement these annotations with predictions from the tRNA annotation tool, tRNAscan-SE and microRNAs from miRBase. We identify 34 lncRNA-associated loci that are conserved between birds and mammals and validate 12 of these in chicken. We report several intriguing cases where a reported mammalian lncRNA, but not its function, is conserved. We also demonstrate extensive conservation of classical ncRNAs (e.g., tRNAs) and more recently discovered ncRNAs (e.g., snoRNAs and miRNAs) in birds. Furthermore, we describe numerous "losses" of several RNA families, and attribute these to either genuine loss, divergence or missing data. In particular, we show that many of these losses are due to the challenges associated with assembling avian microchromosomes. These combined results illustrate the utility of applying homology-based methods for annotating novel vertebrate genomes.
PLoS ONE 03/2015; 10(3):e0121797. DOI:10.1371/journal.pone.0121797 · 3.23 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The Rfam database (available at http://rfam.xfam.org) is a collection of non-coding RNA families represented by manually curated sequence alignments, consensus secondary structures and annotation gathered from corresponding Wikipedia, taxonomy and ontology resources. In this article, we detail updates and improvements to the Rfam data and website for the Rfam 12.0 release. We describe the upgrade of our search pipeline to use Infernal 1.1 and demonstrate its improved homology detection ability by comparison with the previous version. The new pipeline is easier for users to apply to their own data sets, and we illustrate its ability to annotate RNAs in genomic and metagenomic data sets of various sizes. Rfam has been expanded to include 260 new families, including the well-studied large subunit ribosomal RNA family, and for the first time includes information on short sequence- and structure-based RNA motifs present within families.
[Show abstract][Hide abstract] ABSTRACT: The primary task of the Rfam database is to collate experimentally validated noncoding RNA (ncRNA) sequences from the published literature and facilitate the prediction and annotation of new homologues in novel nucleotide sequences. We group homologous ncRNA sequences into "families" and related families are further grouped into "clans." We collate and manually curate data cross-references for these families from other databases and external resources. Our Web site offers researchers a simple interface to Rfam and provides tools with which to annotate their own sequences using our covariance models (CMs), through our tools for searching, browsing, and downloading information on Rfam families. In this chapter, we will work through examples of annotating a query sequence, collating family information, and searching for data.
[Show abstract][Hide abstract] ABSTRACT: The field of non-coding RNA biology has been hampered by the lack of availability of a comprehensive, up-to-date collection
of accessioned RNA sequences. Here we present the first release of RNAcentral, a database that collates and integrates information
from an international consortium of established RNA sequence databases. The initial release contains over 8.1 million sequences,
including representatives of all major functional classes. A web portal (http://rnacentral.org) provides free access to data, search functionality, cross-references, source code and an integrated genome browser for selected
Nucleic Acids Research 10/2014; 43. DOI:10.1093/nar/gku991 · 9.11 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Here we present the results of a large-scale bioinformatic annotation of
non-coding RNA loci in 48 avian genomes. Our approach uses probabilistic models
of hand-curated families from the Rfam database to infer conserved RNA families
within each avian genome. We supplement these annotations with predictions from
the tRNA annotation tool, tRNAscan-SE and microRNAs from miRBase. We show that
a number of lncRNA-associated loci are conserved between birds and mammals,
including several intriguing cases where the reported mammalian lncRNA function
is not conserved in birds. We also demonstrate extensive conservation of
classical ncRNAs (e.g., tRNAs) and more recently discovered ncRNAs (e.g.,
snoRNAs and miRNAs) in birds. Furthermore, we describe numerous "losses" of
several RNA families, and attribute these to genuine loss, divergence or
missing data. In particular, we show that many of these losses are due to the
challenges associated with assembling Avian microchromosomes. These combined
results illustrate the utility of applying homology-based methods for
annotating novel vertebrate genomes.
[Show abstract][Hide abstract] ABSTRACT: The development of RNA bioinformatic tools began more than 30 y ago with the description of the Nussinov and Zuker dynamic programming algorithms for single sequence RNA secondary structure prediction. Since then, many tools have been developed for various RNA sequence analysis problems such as homology search, multiple sequence alignment, de novo RNA discovery, read-mapping, and many more. In this issue, we have collected a sampling of reviews and original research that demonstrate some of the many ways bioinformatics is integrated with current RNA biology research.
[Show abstract][Hide abstract] ABSTRACT: The Rfam database (available via the website at http://rfam.sanger.ac.uk and through our mirror at http://rfam.janelia.org) is a collection of non-coding RNA families, primarily RNAs with a conserved RNA secondary structure, including both RNA genes and mRNA cis-regulatory elements. Each family is represented by a multiple sequence alignment, predicted secondary structure and covariance model. Here we discuss updates to the database in the latest release, Rfam 11.0, including the introduction of genome-based alignments for large families, the introduction of the Rfam Biomart as well as other user interface improvements. Rfam is available under the Creative Commons Zero license.
[Show abstract][Hide abstract] ABSTRACT: Homology detection is critical to genomics. Identifying homologous sequence
allows us to transfer information gathered in one organism to another quickly
and with a high degree of confidence. Non-coding RNA (ncRNA) presents a
challenge for homology detection, as the primary sequence is often poorly
conserved and de novo structure prediction remains difficult. This chapter
introduces methods developed by the Rfam database for identifying "families" of
homologous ncRNAs from single "seed" sequences using manually curated
alignments to build powerful statistical models known as covariance models
(CMs). We provide a brief overview of the state of alignment and secondary
structure prediction algorithms. This is followed by a step-by-step iterative
protocol for identifying homologs, then constructing an alignment and
corresponding CM. We also work through an example, building an alignment and CM
for the bacterial small RNA MicA, discovering a previously unreported family of
divergent MicA homologs in Xenorhabdus in the process. This chapter will
provide readers with the background necessary to begin defining their own ncRNA
families suitable for use in comparative, functional, and evolutionary studies
of structured RNA elements.
[Show abstract][Hide abstract] ABSTRACT: Curated databases are an integral part of the tool set that researchers use on a daily basis for their work. For most users, however, how databases are maintained, and by whom, is rather obscure. The International Society for Biocuration (ISB) represents biocurators, software engineers, developers and researchers with an interest in biocuration. Its goals include fostering communication between biocurators, promoting and describing their work, and highlighting the added value of biocuration to the world. The ISB recently conducted a survey of biocurators to better understand their educational and scientific backgrounds, their motivations for choosing a curatorial job and their career goals. The results are reported here. From the responses received, it is evident that biocuration is performed by highly trained scientists and perceived to be a stimulating career, offering both intellectual challenges and the satisfaction of performing work essential to the modern scientific community. It is also apparent that the ISB has at least a dual role to play to facilitate biocurators’ work: (i) to promote biocuration as a career within the greater scientific community; (ii) to aid the development of resources for biomedical research through promotion of nomenclature and data-sharing standards that will allow interconnection of biological databases and better exploit the pivotal contributions that biocurators are making.Database URL:
Database The Journal of Biological Databases and Curation 02/2012; 2012:bar059. DOI:10.1093/database/bar059 · 3.37 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: InterPro amalgamates predictive protein signatures from a number of well-known partner databases into a single resource. To aid with interpretation of results, InterPro entries are manually annotated with terms from the Gene Ontology (GO). The InterPro2GO mappings are comprised of the cross-references between these two resources and are the largest source of GO annotation predictions for proteins. Here, we describe the protocol by which InterPro curators integrate GO terms into the InterPro database. We discuss the unique challenges involved in integrating specific GO terms with entries that may describe a diverse set of proteins, and we illustrate, with examples, how InterPro hierarchies reflect GO terms of increasing specificity. We describe a revised protocol for GO mapping that enables us to assign GO terms to domains based on the function of the individual domain, rather than the function of the families in which the domain is found. We also discuss how taxonomic constraints are dealt with and those cases where we are unable to add any appropriate GO terms. Expert manual annotation of InterPro entries with GO terms enables users to infer function, process or subcellular information for uncharacterized sequences based on sequence matches to predictive models.
http://www.ebi.ac.uk/interpro. The complete InterPro2GO mappings are available at: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/external2go/interpro2go
Database The Journal of Biological Databases and Curation 01/2012; 2012:bar068. DOI:10.1093/database/bar068 · 3.37 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: InterPro (http://www.ebi.ac.uk/interpro/) is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely
available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures,
against which protein sequences can be searched to determine their potential function. InterPro has utility in the large-scale
analysis of whole genomes and meta-genomes, as well as in characterizing individual protein sequences. Herein we give an overview
of new developments in the database and its associated software since 2009, including updates to database content, curation
processes and Web and programmatic interfaces.