[Show abstract][Hide abstract] ABSTRACT: Annotation on the reference genome of the C57BL6/J mouse has been an ongoing project ever since the draft genome was first published. Initially, the principle focus was on the identification of all protein-coding genes, although today the importance of describing long non-coding RNAs, small RNAs, and pseudogenes is recognized. Here, we describe the progress of the GENCODE mouse annotation project, which combines manual annotation from the HAVANA group with Ensembl computational annotation, alongside experimental and in silico validation pipelines from other members of the consortium. We discuss the more recent incorporation of next-generation sequencing datasets into this workflow, including the usage of mass-spectrometry data to potentially identify novel protein-coding genes. Finally, we will outline how the C57BL6/J genebuild can be used to gain insights into the variant sites that distinguish different mouse strains and species.
[Show abstract][Hide abstract] ABSTRACT: A vast amount of DNA variation is being identified by increasingly large-scale exome and genome sequencing projects. To be useful, variants require accurate functional annotation and a wide range of tools are available to this end. McCarthy et al recently demonstrated the large differences in prediction of loss-of-function (LoF) variation when RefSeq and Ensembl transcripts are used for annotation, highlighting the importance of the reference transcripts on which variant functional annotation is based.
We describe a detailed analysis of the similarities and differences between the gene and transcript annotation in the GENCODE and RefSeq genesets. We demonstrate that the GENCODE Comprehensive set is richer in alternative splicing, novel CDSs, novel exons and has higher genomic coverage than RefSeq, while the GENCODE Basic set is very similar to RefSeq. Using RNAseq data we show that exons and introns unique to one geneset are expressed at a similar level to those common to both. We present evidence that the differences in gene annotation lead to large differences in variant annotation where GENCODE and RefSeq are used as reference transcripts, although this is predominantly confined to non-coding transcripts and UTR sequence, with at most ~30% of LoF variants annotated discordantly. We also describe an investigation of dominant transcript expression, showing that it both supports the utility of the GENCODE Basic set in providing a smaller set of more highly expressed transcripts and provides a useful, biologically-relevant filter for further reducing the complexity of the transcriptome.
The reference transcripts selected for variant functional annotation do have a large effect on the outcome. The GENCODE Comprehensive transcripts contain more exons, have greater genomic coverage and capture many more variants than RefSeq in both genome and exome datasets, while the GENCODE Basic set shows a higher degree of concordance with RefSeq and has fewer unique features. We propose that the GENCODE Comprehensive set has great utility for the discovery of new variants with functional potential, while the GENCODE Basic set is more suitable for applications demanding less complex interpretation of functional variants.
[Show abstract][Hide abstract] ABSTRACT: Homeobox genes are a group of genes coding for transcription factors with a DNA-binding helix-turn-helix structure called a homeodomain and which play a crucial role in pattern formation during embryogenesis. Many homeobox genes are located in clusters and some of these, most notably the HOX genes, are known to have antisense or opposite strand long non-coding RNA (lncRNA) genes that play a regulatory role. Because automated annotation of both gene clusters and non-coding genes is fraught with difficulty (over-prediction, under-prediction, inaccurate transcript structures), we set out to manually annotate all homeobox genes in the mouse and human genomes. This includes all supported splice variants, pseudogenes and both antisense and flanking lncRNAs. One of the areas where manual annotation has a significant advantage is the annotation of duplicated gene clusters. After comprehensive annotation of all homeobox genes and their antisense genes in human and in mouse, we found some discrepancies with the current gene set in RefSeq regarding exact gene structures and coding versus pseudogene locus biotype. We also identified previously un-annotated pseudogenes in the DUX, Rhox and Obox gene clusters, which helped us re-evaluate and update the gene nomenclature in these regions. We found that human homeobox genes are enriched in antisense lncRNA loci, some of which are known to play a role in gene or gene cluster regulation, compared to their mouse orthologues. Of the annotated set of 241 human protein-coding homeobox genes, 98 have an antisense locus (41%) while of the 277 orthologous mouse genes, only 62 protein coding gene have an antisense locus (22%), based on publicly available transcriptional evidence.
[Show abstract][Hide abstract] ABSTRACT: Ensembl (http://www.ensembl.org) is a genomic interpretation system providing the most up-to-date annotations, querying tools and access methods for chordates
and key model organisms. This year we released updated annotation (gene models, comparative genomics, regulatory regions and
variation) on the new human assembly, GRCh38, although we continue to support researchers using the GRCh37.p13 assembly through
a dedicated site (http://grch37.ensembl.org). Our Regulatory Build has been revamped to identify regulatory regions of interest and to efficiently highlight their activity
across disparate epigenetic data sets. A number of new interfaces allow users to perform large-scale comparisons of their
data against our annotations. The REST server (http://rest.ensembl.org), which allows programs written in any language to query our databases, has moved to a full service alongside our upgraded
website tools. Our online Variant Effect Predictor tool has been updated to process more variants and calculate summary statistics.
Lastly, the WiggleTools package enables users to summarize large collections of data sets and view them as single tracks in
Ensembl. The Ensembl code base itself is more accessible: it is now hosted on our GitHub organization page (https://github.com/Ensembl) under an Apache 2.0 open source license.
Nucleic Acids Research 10/2014; 43(D1). DOI:10.1093/nar/gku1010 · 9.11 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Historically pseudogenes were believed to represent nonfunctional genomic fossils; however, there is emerging evidence that many of them could be biologically active. This possibility has ignited interest in pseudogene loci and made the need for their high-quality annotation more pressing as an accurate knowledge of all pseudogenes in the human reference genome sequence facilitates confident functional analysis. GENCODE have undertaken the first genome-wide pseudogene assignment for protein-coding genes combining both large-scale manual annotation and computational pseudogene prediction pipelines. Multiple computational predictions provide an unbiased set of hints for manual annotators to investigate, both during first-pass annotation and as part of QC to identify any potential missing pseudogene loci. Where a pseudogene is identified, the extent of its homology to the parent locus is fully investigated by a manual annotator; a pseudogene model is built and assigned to one of eight pseudogene biotypes depending on the mechanism of creation and on the presence of locus-specific transcriptional or proteomic data. The high-quality, information-rich set of pseudogenes created has been integrated with ENCODE functional genomics data, specifically expression level, transcription factor and RNA polymerase II binding, and chromatin marks. In this way we have been able to identify some pseudogenes that possess conventional characteristics of functionality as well as others with interesting patterns of partial activity, which might suggest that putatively inactive loci could be gaining a novel function, for example as long noncoding RNAs. The activity data associated with every pseudogene is stored in the psiDR resource.
[Show abstract][Hide abstract] ABSTRACT: Pseudogenes are degraded fossil copies of genes. Here, we report a comparison of pseudogenes spanning three phyla, leveraging the completed annotations of the human, worm, and fly genomes, which we make available as an online resource. We find that pseudogenes are lineage specific, much more so than protein-coding genes, reflecting the different remodeling processes marking each organism's genome evolution. The majority of human pseudogenes are processed, resulting from a retrotranspositional burst at the dawn of the primate lineage. This burst can be seen in the largely uniform distribution of pseudogenes across the genome, their preservation in areas with low recombination rates, and their preponderance in highly expressed gene families. In contrast, worm and fly pseudogenes tell a story of numerous duplication events. In worm, these duplications have been preserved through selective sweeps, so we see a large number of pseudogenes associated with highly duplicated families such as chemoreceptors. However, in fly, the large effective population size and high deletion rate resulted in a depletion of the pseudogene complement. Despite large variations between these species, we also find notable similarities. Overall, we identify a broad spectrum of biochemical activity for pseudogenes, with the majority in each organism exhibiting varying degrees of partial activity. In particular, we identify a consistent amount of transcription (∼15%) across all species, suggesting a uniform degradation process. Also, we see a uniform decay of pseudogene promoter activity relative to their coding counterparts and identify a number of pseudogenes with conserved upstream sequences and activity, hinting at potential regulatory roles.
Proceedings of the National Academy of Sciences 08/2014; 111(37). DOI:10.1073/pnas.1407293111 · 9.67 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Determining the full complement of protein-coding genes is a key goal of genome annotation. The most powerful approach for confirming protein coding potential is the detection of cellular protein expression through peptide mass spectrometry experiments. Here we mapped peptides detected in 7 large-scale proteomics studies to almost 60% of the protein coding genes in the GENCODE annotation of the human genome. We found a strong relationship between detection in proteomics experiments and both gene family age and cross-species conservation. Most of the genes for which we detected peptides were highly conserved. We found peptides for more than 96% of genes that evolved before bilateria. At the opposite end of the scale we identified almost no peptides for genes that have appeared since primates, for genes that did not have any protein-like features or for genes with poor cross-species conservation. These results motivated us to describe a set of 2,001 potentially non-coding genes based on features such as weak conservation, a lack of protein features, or ambiguous annotations from major databases, all of which correlated with low peptide detection across the seven experiments. We identified peptides for just 3% of these genes. We show that many of these genes behave more like non-coding genes than protein-coding genes, and suggest that most are unlikely to code for proteins under normal circumstances. We believe that their inclusion in the human protein coding gene catalogue should be revised as part of the ongoing human genome annotation effort.
Human Molecular Genetics 06/2014; 23(22). DOI:10.1093/hmg/ddu309 · 6.39 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Background / Purpose:
Improving the manual annotation of the GENCODE geneset using RNAseq, CAGE and polyAseq data.
GENCODE now has more complete gene models with confident support for the transcription start site and the polyA tail (s). We are now supporting the new H38 gene build.
[Show abstract][Hide abstract] ABSTRACT: The genetic contribution to the variation in human lifespan is approximately 25%. Despite the large number of identified disease-susceptibility loci, it is not known which loci influence population mortality.We performed a genome-wide association meta-analysis of 7729 long-lived individuals of European descent (≥ 85 years) and 16121 younger controls (< 65 years) followed by replication in an additional set of 13060 long-lived individuals and 61156 controls. In addition, we performed a subset analysis in cases≥90 years.We observed genome-wide significant association with longevity, as reflected by survival to ages beyond 90 years, at a novel locus, rs2149954, on chromosome 5q33.3 (OR=1.10, P =1.74 x 10(-8)). We also confirmed association of rs4420638 on chromosome 19q13.32 (OR=0.72, P=3.40 x 10(-36)), representing the TOMM40/APOE/APOC1 locus. In a prospective meta-analysis (n=34103) the minor allele of rs2149954 (T) on chromosome 5q33.3 associates with increased survival (HR=0.95, P=0.003). This allele has previously been reported to associate with low blood pressure in middle age. Interestingly, the minor allele (T) associates with decreased cardiovascular mortality risk, independent of blood pressure.We report on the first GWAS-identified longevity locus on chromosome 5q33.3 influencing survival in the general European population. The minor allele of this locus associates with low blood pressure in middle age, although the contribution of this allele to survival may be less dependent on blood pressure. Hence, the pleiotropic mechanisms by which this intragenic variation contributes to lifespan regulation have to be elucidated.
Human Molecular Genetics 03/2014; 23(16). DOI:10.1093/hmg/ddu139 · 6.39 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The determination of whether a gene is protein coding or not is a key goal of
genome annotation projects. Peptide mass spectrometry is a powerful tool for
detecting cellular expression of proteins and therefore is an attractive
approach for verifying the protein coding potential of genes. However, a range
of technical difficulties limit the coverage from proteomics experiments, with
the highest recorded coverage of the human proteome being approximately 50% of
human genes. Here we map the peptides detected in 7 large-scale proteomics
studies to the GENCODE v12 annotation of the human genome and identify almost
60% of the protein coding genes. We find that there are surprisingly strong
correlations between peptide detection and cross-species conservation, gene age
and the presence of protein-like features. The age of the gene and its
conservation across vertebrate species are key indicators of whether a peptide
will be detected in proteomics experiments. We find peptides for most highly
conserved genes and for practically all genes that evolved before bilateria. At
the same time there is little or no evidence for protein expression for genes
that have appeared since primates or that do not have any protein-like features
or conservation. Based on our results we describe a set of 2,001 genes that
have no protein features and poor conservation, or have ambiguous annotations
in gene or protein databases. We suggest that many of the genes that lack
supporting evidence and that are not detected in proteomics experiments, do not
code for proteins under normal circumstances and that they should not be
included in the human protein coding gene catalogue.
[Show abstract][Hide abstract] ABSTRACT: Ensembl (http://www.ensembl.org) creates tools and data resources to facilitate genomic analysis in chordate species with an emphasis on human, major vertebrate
model organisms and farm animals. Over the past year we have increased the number of species that we support to 77 and expanded
our genome browser with a new scrollable overview and improved variation and phenotype views. We also report updates to our
core datasets and improvements to our gene homology relationships from the addition of new species. Our REST service has been
extended with additional support for comparative genomics and ontology information. Finally, we provide updated information
about our methods for data access and resources for user training.
[Show abstract][Hide abstract] ABSTRACT: The Vertebrate Genome Annotation (VEGA) database (http://vega.sanger.ac.uk), initially designed as a community resource for browsing manual annotation of the human genome project, now contains five reference genomes (human, mouse, zebrafish, pig and rat). Its introduction pages have been redesigned to enable the user to easily navigate between whole genomes and smaller multi-species haplotypic regions of interest such as the major histocompatibility complex. The VEGA browser is unique in that annotation is updated via the Human And Vertebrate Analysis aNd Annotation (HAVANA) update track every 2 weeks, allowing single gene updates to be made publicly available to the research community quickly. The user can now access different haplotypic subregions more easily, such as those from the non-obese diabetic mouse, and display them in a more intuitive way using the comparative tools. We also highlight how the user can browse manually annotated updated patches from the Genome Reference Consortium (GRC).
[Show abstract][Hide abstract] ABSTRACT: The Consensus Coding Sequence (CCDS) project (http://www.ncbi.nlm.nih.gov/CCDS/) is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies by the National Center for Biotechnology Information (NCBI) and Ensembl genome annotation pipelines. Identical annotations that pass quality assurance tests are tracked with a stable identifier (CCDS ID). Members of the collaboration, who are from NCBI, the Wellcome Trust Sanger Institute and the University of California Santa Cruz, provide coordinated and continuous review of the dataset to ensure high-quality CCDS representations. We describe here the current status and recent growth in the CCDS dataset, as well as recent changes to the CCDS web and FTP sites. These changes include more explicit reporting about the NCBI and Ensembl annotation releases being compared, new search and display options, the addition of biologically descriptive information and our approach to representing genes for which support evidence is incomplete. We also present a summary of recent and future curation targets.
Nucleic Acids Research 11/2013; 42(D1). DOI:10.1093/nar/gkt1059 · 9.11 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We evaluated 25 protocol variants of 14 independent computational methods for exon identification, transcript reconstruction and expression-level quantification from RNA-seq data. Our results show that most algorithms are able to identify discrete transcript components with high success rates but that assembly of complete isoform structures poses a major challenge even when all constituent elements are identified. Expression-level estimates also varied widely across methods, even when based on similar transcript models. Consequently, the complexity of higher eukaryotic genomes imposes severe limitations on transcript recall and splice product discrimination that are likely to remain limiting factors for the analysis of current-generation RNA-seq data.
[Show abstract][Hide abstract] ABSTRACT: High-throughput RNA sequencing is an increasingly accessible method for studying gene structure and activity on a genome-wide scale. A critical step in RNA-seq data analysis is the alignment of partial transcript reads to a reference genome sequence. To assess the performance of current mapping software, we invited developers of RNA-seq aligners to process four large human and mouse RNA-seq data sets. In total, we compared 26 mapping protocols based on 11 programs and pipelines and found major performance differences between methods on numerous benchmarks, including alignment yield, basewise accuracy, mismatch and gap placement, exon junction discovery and suitability of alignments for transcript reconstruction. We observed concordant results on real and simulated RNA-seq data, confirming the relevance of the metrics employed. Future developments in RNA-seq alignment methods would benefit from improved placement of multimapped reads, balanced utilization of existing gene annotation and a reduced false discovery rate for splice junctions.
[Show abstract][Hide abstract] ABSTRACT: The last decade has seen tremendous effort committed to the annotation of the human genome sequence, most notably perhaps in the form of the ENCODE project. One of the major findings of ENCODE, and other genome analysis projects, is that the human transcriptome is far larger and more complex than previously thought. This complexity manifests, for example, as alternative splicing within protein-coding genes, as well as in the discovery of thousands of long noncoding RNAs. It is also possible that significant numbers of human transcripts have not yet been described by annotation projects, while existing transcript models are frequently incomplete. The question as to what proportion of this complexity is truly functional remains open, however, and this ambiguity presents a serious challenge to genome scientists. In this article, we will discuss the current state of human transcriptome annotation, drawing on our experience gained in generating the GENCODE gene annotation set. We highlight the gaps in our knowledge of transcript functionality that remain, and consider the potential computational and experimental strategies that can be used to help close them. We propose that an understanding of the true overlap between transcriptional complexity and functionality will not be gained in the short term. However, significant steps toward obtaining this knowledge can now be taken by using an integrated strategy, combining all of the experimental resources at our disposal.
Genome Research 10/2013; 23(12). DOI:10.1101/gr.161315.113 · 14.63 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: RNA sequencing has opened new avenues for the study of transcriptome composition. Significant evidence has accumulated showing that the human transcriptome contains in excess of a hundred thousand different transcripts. However, it is still not clear to what extent this diversity prevails when considering the relative abundances of different transcripts from the same gene.
Here we show that, in a given condition, most protein coding genes have one major transcript expressed at significantly higher level than others, that in human tissues the major transcripts contribute almost 85 percent to the total mRNA from protein coding loci, and that often the same major transcript is expressed in many tissues. We detect a high degree of overlap between the set of major transcripts and a recently published set of alternatively spliced transcripts that are predicted to be translated utilizing proteomic data. Thus, we hypothesize that although some minor transcripts may play a functional role, the major ones are likely to be the main contributors to the proteome. However, we still detect a non-negligible fraction of protein coding genes for which the major transcript does not code a protein.
Overall, our findings suggest that the transcriptome from protein coding loci is dominated by one transcript per gene and that not all the transcripts that contribute to transcriptome diversity are equally likely to contribute to protein diversity. This observation can help to prioritize candidate targets in proteomics research and to predict the functional impact of the detected changes in variation studies.
[Show abstract][Hide abstract] ABSTRACT: Background:
The domestic pig is known as an excellent model for human immunology and the two species share many pathogens. Susceptibility to infectious disease is one of the major constraints on swine performance, yet the structure and function of genes comprising the pig immunome are not well-characterized. The completion of the pig genome provides the opportunity to annotate the pig immunome, and compare and contrast pig and human immune systems.
The Immune Response Annotation Group (IRAG) used computational curation and manual annotation of the swine genome assembly 10.2 (Sscrofa10.2) to refine the currently available automated annotation of 1,369 immunity-related genes through sequence-based comparison to genes in other species. Within these genes, we annotated 3,472 transcripts. Annotation provided evidence for gene expansions in several immune response families, and identified artiodactyl-specific expansions in the cathelicidin and type 1 Interferon families. We found gene duplications for 18 genes, including 13 immune response genes and five non-immune response genes discovered in the annotation process. Manual annotation provided evidence for many new alternative splice variants and 8 gene duplications. Over 1,100 transcripts without porcine sequence evidence were detected using cross-species annotation. We used a functional approach to discover and accurately annotate porcine immune response genes. A co-expression clustering analysis of transcriptomic data from selected experimental infections or immune stimulations of blood, macrophages or lymph nodes identified a large cluster of genes that exhibited a correlated positive response upon infection across multiple pathogens or immune stimuli. Interestingly, this gene cluster (cluster 4) is enriched for known general human immune response genes, yet contains many un-annotated porcine genes. A phylogenetic analysis of the encoded proteins of cluster 4 genes showed that 15% exhibited an accelerated evolution as compared to 4.1% across the entire genome.
This extensive annotation dramatically extends the genome-based knowledge of the molecular genetics and structure of a major portion of the porcine immunome. Our complementary functional approach using co-expression during immune response has provided new putative immune response annotation for over 500 porcine genes. Our phylogenetic analysis of this core immunome cluster confirms rapid evolutionary change in this set of genes, and that, as in other species, such genes are important components of the pig's adaptation to pathogen challenge over evolutionary time. These comprehensive and integrated analyses increase the value of the porcine genome sequence and provide important tools for global analyses and data-mining of the porcine immune response.