[Show abstract][Hide abstract] ABSTRACT: The Ensembl project (http://www.ensembl.org) is a system for genome annotation, analysis, storage and dissemination designed to facilitate the access of genomic annotation
from chordates and key model organisms. It provides access to data from 87 species across our main and early access Pre! websites.
This year we introduced three newly annotated species and released numerous updates across our supported species with a concentration
on data for the latest genome assemblies of human, mouse, zebrafish and rat. We also provided two data updates for the previous
human assembly, GRCh37, through a dedicated website (http://grch37.ensembl.org). Our tools, in particular the VEP, have been improved significantly through integration of additional third party data.
REST is now capable of larger-scale analysis and our regulatory data BioMart can deliver faster results. The website is now
capable of displaying long-range interactions such as those found in cis-regulated datasets. Finally we have launched a website optimized for mobile devices providing views of genes, variants and
phenotypes. Our data is made available without restriction and all code is available from our GitHub organization site (http://github.com/Ensembl) under an Apache 2.0 license.
Preview · Article · Dec 2015 · Nucleic Acids Research
[Show abstract][Hide abstract] ABSTRACT: Homeobox genes are a group of genes coding for transcription factors with a DNA-binding helix-turn-helix structure called a homeodomain and which play a crucial role in pattern formation during embryogenesis. Many homeobox genes are located in clusters and some of these, most notably the HOX genes, are known to have antisense or opposite strand long non-coding RNA (lncRNA) genes that play a regulatory role. Because automated annotation of both gene clusters and non-coding genes is fraught with difficulty (over-prediction, under-prediction, inaccurate transcript structures), we set out to manually annotate all homeobox genes in the mouse and human genomes. This includes all supported splice variants, pseudogenes and both antisense and flanking lncRNAs. One of the areas where manual annotation has a significant advantage is the annotation of duplicated gene clusters. After comprehensive annotation of all homeobox genes and their antisense genes in human and in mouse, we found some discrepancies with the current gene set in RefSeq regarding exact gene structures and coding versus pseudogene locus biotype. We also identified previously un-annotated pseudogenes in the DUX, Rhox and Obox gene clusters, which helped us re-evaluate and update the gene nomenclature in these regions. We found that human homeobox genes are enriched in antisense lncRNA loci, some of which are known to play a role in gene or gene cluster regulation, compared to their mouse orthologues. Of the annotated set of 241 human protein-coding homeobox genes, 98 have an antisense locus (41%) while of the 277 orthologous mouse genes, only 62 protein coding gene have an antisense locus (22%), based on publicly available transcriptional evidence.
No preview · Article · Sep 2015 · Database The Journal of Biological Databases and Curation
[Show abstract][Hide abstract] ABSTRACT: Annotation on the reference genome of the C57BL6/J mouse has been an ongoing project ever since the draft genome was first published. Initially, the principle focus was on the identification of all protein-coding genes, although today the importance of describing long non-coding RNAs, small RNAs, and pseudogenes is recognized. Here, we describe the progress of the GENCODE mouse annotation project, which combines manual annotation from the HAVANA group with Ensembl computational annotation, alongside experimental and in silico validation pipelines from other members of the consortium. We discuss the more recent incorporation of next-generation sequencing datasets into this workflow, including the usage of mass-spectrometry data to potentially identify novel protein-coding genes. Finally, we will outline how the C57BL6/J genebuild can be used to gain insights into the variant sites that distinguish different mouse strains and species.
Electronic supplementary material
The online version of this article (doi:10.1007/s00335-015-9583-x) contains supplementary material, which is available to authorized users.
[Show abstract][Hide abstract] ABSTRACT: A vast amount of DNA variation is being identified by increasingly large-scale exome and genome sequencing projects. To be useful, variants require accurate functional annotation and a wide range of tools are available to this end. McCarthy et al recently demonstrated the large differences in prediction of loss-of-function (LoF) variation when RefSeq and Ensembl transcripts are used for annotation, highlighting the importance of the reference transcripts on which variant functional annotation is based.
We describe a detailed analysis of the similarities and differences between the gene and transcript annotation in the GENCODE and RefSeq genesets. We demonstrate that the GENCODE Comprehensive set is richer in alternative splicing, novel CDSs, novel exons and has higher genomic coverage than RefSeq, while the GENCODE Basic set is very similar to RefSeq. Using RNAseq data we show that exons and introns unique to one geneset are expressed at a similar level to those common to both. We present evidence that the differences in gene annotation lead to large differences in variant annotation where GENCODE and RefSeq are used as reference transcripts, although this is predominantly confined to non-coding transcripts and UTR sequence, with at most ~30% of LoF variants annotated discordantly. We also describe an investigation of dominant transcript expression, showing that it both supports the utility of the GENCODE Basic set in providing a smaller set of more highly expressed transcripts and provides a useful, biologically-relevant filter for further reducing the complexity of the transcriptome.
The reference transcripts selected for variant functional annotation do have a large effect on the outcome. The GENCODE Comprehensive transcripts contain more exons, have greater genomic coverage and capture many more variants than RefSeq in both genome and exome datasets, while the GENCODE Basic set shows a higher degree of concordance with RefSeq and has fewer unique features. We propose that the GENCODE Comprehensive set has great utility for the discovery of new variants with functional potential, while the GENCODE Basic set is more suitable for applications demanding less complex interpretation of functional variants.
[Show abstract][Hide abstract] ABSTRACT: Ensembl (http://www.ensembl.org) is a genomic interpretation system providing the most up-to-date annotations, querying tools and access methods for chordates
and key model organisms. This year we released updated annotation (gene models, comparative genomics, regulatory regions and
variation) on the new human assembly, GRCh38, although we continue to support researchers using the GRCh37.p13 assembly through
a dedicated site (http://grch37.ensembl.org). Our Regulatory Build has been revamped to identify regulatory regions of interest and to efficiently highlight their activity
across disparate epigenetic data sets. A number of new interfaces allow users to perform large-scale comparisons of their
data against our annotations. The REST server (http://rest.ensembl.org), which allows programs written in any language to query our databases, has moved to a full service alongside our upgraded
website tools. Our online Variant Effect Predictor tool has been updated to process more variants and calculate summary statistics.
Lastly, the WiggleTools package enables users to summarize large collections of data sets and view them as single tracks in
Ensembl. The Ensembl code base itself is more accessible: it is now hosted on our GitHub organization page (https://github.com/Ensembl) under an Apache 2.0 open source license.
Preview · Article · Oct 2014 · Nucleic Acids Research
[Show abstract][Hide abstract] ABSTRACT: Historically pseudogenes were believed to represent nonfunctional genomic fossils; however, there is emerging evidence that many of them could be biologically active. This possibility has ignited interest in pseudogene loci and made the need for their high-quality annotation more pressing as an accurate knowledge of all pseudogenes in the human reference genome sequence facilitates confident functional analysis. GENCODE have undertaken the first genome-wide pseudogene assignment for protein-coding genes combining both large-scale manual annotation and computational pseudogene prediction pipelines. Multiple computational predictions provide an unbiased set of hints for manual annotators to investigate, both during first-pass annotation and as part of QC to identify any potential missing pseudogene loci. Where a pseudogene is identified, the extent of its homology to the parent locus is fully investigated by a manual annotator; a pseudogene model is built and assigned to one of eight pseudogene biotypes depending on the mechanism of creation and on the presence of locus-specific transcriptional or proteomic data. The high-quality, information-rich set of pseudogenes created has been integrated with ENCODE functional genomics data, specifically expression level, transcription factor and RNA polymerase II binding, and chromatin marks. In this way we have been able to identify some pseudogenes that possess conventional characteristics of functionality as well as others with interesting patterns of partial activity, which might suggest that putatively inactive loci could be gaining a novel function, for example as long noncoding RNAs. The activity data associated with every pseudogene is stored in the psiDR resource.
No preview · Article · Oct 2014 · Methods in molecular biology (Clifton, N.J.)
[Show abstract][Hide abstract] ABSTRACT: Pseudogenes are degraded fossil copies of genes. Here, we report a comparison of pseudogenes spanning three phyla, leveraging the completed annotations of the human, worm, and fly genomes, which we make available as an online resource. We find that pseudogenes are lineage specific, much more so than protein-coding genes, reflecting the different remodeling processes marking each organism's genome evolution. The majority of human pseudogenes are processed, resulting from a retrotranspositional burst at the dawn of the primate lineage. This burst can be seen in the largely uniform distribution of pseudogenes across the genome, their preservation in areas with low recombination rates, and their preponderance in highly expressed gene families. In contrast, worm and fly pseudogenes tell a story of numerous duplication events. In worm, these duplications have been preserved through selective sweeps, so we see a large number of pseudogenes associated with highly duplicated families such as chemoreceptors. However, in fly, the large effective population size and high deletion rate resulted in a depletion of the pseudogene complement. Despite large variations between these species, we also find notable similarities. Overall, we identify a broad spectrum of biochemical activity for pseudogenes, with the majority in each organism exhibiting varying degrees of partial activity. In particular, we identify a consistent amount of transcription (∼15%) across all species, suggesting a uniform degradation process. Also, we see a uniform decay of pseudogene promoter activity relative to their coding counterparts and identify a number of pseudogenes with conserved upstream sequences and activity, hinting at potential regulatory roles.
Full-text · Article · Aug 2014 · Proceedings of the National Academy of Sciences
[Show abstract][Hide abstract] ABSTRACT: Determining the full complement of protein-coding genes is a key goal of genome annotation. The most powerful approach for confirming protein coding potential is the detection of cellular protein expression through peptide mass spectrometry experiments. Here we mapped peptides detected in 7 large-scale proteomics studies to almost 60% of the protein coding genes in the GENCODE annotation of the human genome. We found a strong relationship between detection in proteomics experiments and both gene family age and cross-species conservation. Most of the genes for which we detected peptides were highly conserved. We found peptides for more than 96% of genes that evolved before bilateria. At the opposite end of the scale we identified almost no peptides for genes that have appeared since primates, for genes that did not have any protein-like features or for genes with poor cross-species conservation. These results motivated us to describe a set of 2,001 potentially non-coding genes based on features such as weak conservation, a lack of protein features, or ambiguous annotations from major databases, all of which correlated with low peptide detection across the seven experiments. We identified peptides for just 3% of these genes. We show that many of these genes behave more like non-coding genes than protein-coding genes, and suggest that most are unlikely to code for proteins under normal circumstances. We believe that their inclusion in the human protein coding gene catalogue should be revised as part of the ongoing human genome annotation effort.
Full-text · Article · Jun 2014 · Human Molecular Genetics
[Show abstract][Hide abstract] ABSTRACT: Background / Purpose:
Improving the manual annotation of the GENCODE geneset using RNAseq, CAGE and polyAseq data.
GENCODE now has more complete gene models with confident support for the transcription start site and the polyA tail (s). We are now supporting the new H38 gene build.
[Show abstract][Hide abstract] ABSTRACT: The genetic contribution to the variation in human lifespan is approximately 25%. Despite the large number of identified disease-susceptibility loci, it is not known which loci influence population mortality.We performed a genome-wide association meta-analysis of 7729 long-lived individuals of European descent (≥ 85 years) and 16121 younger controls (< 65 years) followed by replication in an additional set of 13060 long-lived individuals and 61156 controls. In addition, we performed a subset analysis in cases≥90 years.We observed genome-wide significant association with longevity, as reflected by survival to ages beyond 90 years, at a novel locus, rs2149954, on chromosome 5q33.3 (OR=1.10, P =1.74 x 10(-8)). We also confirmed association of rs4420638 on chromosome 19q13.32 (OR=0.72, P=3.40 x 10(-36)), representing the TOMM40/APOE/APOC1 locus. In a prospective meta-analysis (n=34103) the minor allele of rs2149954 (T) on chromosome 5q33.3 associates with increased survival (HR=0.95, P=0.003). This allele has previously been reported to associate with low blood pressure in middle age. Interestingly, the minor allele (T) associates with decreased cardiovascular mortality risk, independent of blood pressure.We report on the first GWAS-identified longevity locus on chromosome 5q33.3 influencing survival in the general European population. The minor allele of this locus associates with low blood pressure in middle age, although the contribution of this allele to survival may be less dependent on blood pressure. Hence, the pleiotropic mechanisms by which this intragenic variation contributes to lifespan regulation have to be elucidated.
Full-text · Article · Mar 2014 · Human Molecular Genetics
[Show abstract][Hide abstract] ABSTRACT: The determination of whether a gene is protein coding or not is a key goal of
genome annotation projects. Peptide mass spectrometry is a powerful tool for
detecting cellular expression of proteins and therefore is an attractive
approach for verifying the protein coding potential of genes. However, a range
of technical difficulties limit the coverage from proteomics experiments, with
the highest recorded coverage of the human proteome being approximately 50% of
human genes. Here we map the peptides detected in 7 large-scale proteomics
studies to the GENCODE v12 annotation of the human genome and identify almost
60% of the protein coding genes. We find that there are surprisingly strong
correlations between peptide detection and cross-species conservation, gene age
and the presence of protein-like features. The age of the gene and its
conservation across vertebrate species are key indicators of whether a peptide
will be detected in proteomics experiments. We find peptides for most highly
conserved genes and for practically all genes that evolved before bilateria. At
the same time there is little or no evidence for protein expression for genes
that have appeared since primates or that do not have any protein-like features
or conservation. Based on our results we describe a set of 2,001 genes that
have no protein features and poor conservation, or have ambiguous annotations
in gene or protein databases. We suggest that many of the genes that lack
supporting evidence and that are not detected in proteomics experiments, do not
code for proteins under normal circumstances and that they should not be
included in the human protein coding gene catalogue.
[Show abstract][Hide abstract] ABSTRACT: Ensembl (http://www.ensembl.org) creates tools and data resources to facilitate genomic analysis in chordate species with an emphasis on human, major vertebrate
model organisms and farm animals. Over the past year we have increased the number of species that we support to 77 and expanded
our genome browser with a new scrollable overview and improved variation and phenotype views. We also report updates to our
core datasets and improvements to our gene homology relationships from the addition of new species. Our REST service has been
extended with additional support for comparative genomics and ontology information. Finally, we provide updated information
about our methods for data access and resources for user training.
Full-text · Article · Dec 2013 · Nucleic Acids Research
[Show abstract][Hide abstract] ABSTRACT: The Vertebrate Genome Annotation (VEGA) database (http://vega.sanger.ac.uk), initially designed as a community resource for browsing manual annotation of the human genome project, now contains five
reference genomes (human, mouse, zebrafish, pig and rat). Its introduction pages have been redesigned to enable the user to
easily navigate between whole genomes and smaller multi-species haplotypic regions of interest such as the major histocompatibility
complex. The VEGA browser is unique in that annotation is updated via the Human And Vertebrate Analysis aNd Annotation (HAVANA)
update track every 2 weeks, allowing single gene updates to be made publicly available to the research community quickly.
The user can now access different haplotypic subregions more easily, such as those from the non-obese diabetic mouse, and
display them in a more intuitive way using the comparative tools. We also highlight how the user can browse manually annotated
updated patches from the Genome Reference Consortium (GRC).
Full-text · Article · Dec 2013 · Nucleic Acids Research
[Show abstract][Hide abstract] ABSTRACT: The Consensus Coding Sequence (CCDS) project (http://www.ncbi.nlm.nih.gov/CCDS/) is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies by the National Center for Biotechnology Information (NCBI) and Ensembl genome annotation pipelines. Identical annotations that pass quality assurance tests are tracked with a stable identifier (CCDS ID). Members of the collaboration, who are from NCBI, the Wellcome Trust Sanger Institute and the University of California Santa Cruz, provide coordinated and continuous review of the dataset to ensure high-quality CCDS representations. We describe here the current status and recent growth in the CCDS dataset, as well as recent changes to the CCDS web and FTP sites. These changes include more explicit reporting about the NCBI and Ensembl annotation releases being compared, new search and display options, the addition of biologically descriptive information and our approach to representing genes for which support evidence is incomplete. We also present a summary of recent and future curation targets.
Full-text · Article · Nov 2013 · Nucleic Acids Research