Article

GENCODE: producing a reference annotation for ENCODE

Wellcome Trust Sanger Institute, Wellcome Trust Campus, Hinxton, Cambridge CB10 1SA, UK.
Genome biology (Impact Factor: 10.47). 02/2006; 7 Suppl 1(Suppl 1):S4.1-9. DOI: 10.1186/gb-2006-7-s1-s4
Source: PubMed

ABSTRACT The GENCODE consortium was formed to identify and map all protein-coding genes within the ENCODE regions. This was achieved by a combination of initial manual annotation by the HAVANA team, experimental validation by the GENCODE consortium and a refinement of the annotation based on these experimental results.
The GENCODE gene features are divided into eight different categories of which only the first two (known and novel coding sequence) are confidently predicted to be protein-coding genes. 5' rapid amplification of cDNA ends (RACE) and RT-PCR were used to experimentally verify the initial annotation. Of the 420 coding loci tested, 229 RACE products have been sequenced. They supported 5' extensions of 30 loci and new splice variants in 50 loci. In addition, 46 loci without evidence for a coding sequence were validated, consisting of 31 novel and 15 putative transcripts. We assessed the comprehensiveness of the GENCODE annotation by attempting to validate all the predicted exon boundaries outside the GENCODE annotation. Out of 1,215 tested in a subset of the ENCODE regions, 14 novel exon pairs were validated, only two of them in intergenic regions.
In total, 487 loci, of which 434 are coding, have been annotated as part of the GENCODE reference set available from the UCSC browser. Comparison of GENCODE annotation with RefSeq and ENSEMBL show only 40% of GENCODE exons are contained within the two sets, which is a reflection of the high number of alternative splice forms with unique exons annotated. Over 50% of coding loci have been experimentally verified by 5' RACE for EGASP and the GENCODE collaboration is continuing to refine its annotation of 1% human genome with the aid of experimental validation.

Download full-text

Full-text

Available from: Alexandre Reymond, Jun 28, 2015
1 Follower
 · 
244 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Large-scale RNA sequencing has revealed a large number of long mRNA-like transcripts (lncRNAs) that do not code for proteins. The evolutionary history of these lncRNAs has been notoriously hard to study systematically due to their low level of sequence conservation that precludes comprehensive homology-based surveys and makes them nearly impossible to align. An increasing number of special cases, however, has been shown to be at least as old as the vertebrate lineage. Here we use the conservation of splice sites to trace the evolution of lncRNAs. We show that >85% of the human GENCODE lncRNAs were already present at the divergence of placental mammals and many hundreds of these RNAs date back even further. Nevertheless, we observe a fast turnover of intron/exon structures. We conclude that lncRNA genes are evolutionary ancient components of vertebrate genomes that show an unexpected and unprecedented evolutionary plasticity. We offer a public web service (http://splicemap.bioinf.uni-leipzig.de) that allows to retrieve sets of orthologous splice sites and to produce overview maps of evolutionarily conserved splice sites for visualization and further analysis. An electronic supplement containing the ncRNA data sets used in this study is available at http://www.bioinf.uni-leipzig.de/publications/supplements/12-001. © 2015 Nitsche et al.; Published by Cold Spring Harbor Laboratory Press for the RNA Society.
    RNA 03/2015; 21(5). DOI:10.1261/rna.046342.114 · 4.62 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Classically, gene prediction programs are based on detecting signals such as boundary sites (splice sites, starts, and stops) and coding regions in the DNA sequence in order to build potential exons and join them into a gene structure. Although nowadays it is possible to improve their performance with additional information from related species or/and cDNA databases, further improvement at any step could help to obtain better predictions. Here, we present WISCOD, a web-enabled tool for the identification of significant protein coding regions, a novel software tool that tackles the exon prediction problem in eukaryotic genomes. WISCOD has the capacity to detect real exons from large lists of potential exons, and it provides an easy way to use global íµí±ƒ value called expected probability of being a false exon (EPFE) that is useful for ranking potential exons in a probabilistic framework, without additional computational costs. The advantage of our approach is that it significantly increases the specificity and sensitivity (both between 80% and 90%) in comparison to other ab initio methods (where they are in the range of 70–75%). WISCOD is written in JAVA and R and is available to download and to run in a local mode on Linux and Windows platforms.
    BioMed Research International 09/2014; 2014(282343):10. DOI:10.1155/2014/282343 · 2.71 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The Cd247 gene encodes for a transmembrane protein important for the expression and assembly of TCR/CD3 complex on the surface of T lymphocytes. Down-regulation of CD247 has functional consequences in systemic autoimmunity and has been shown to be associated with Type 1 Diabetes in NOD mouse. In this study, we have utilized the wealth of high-throughput sequencing data produced during the Encyclopedia of DNA Elements (ENCODE) project to identify spatially conserved regulatory elements within the Cd247 gene from human and mouse. We show the presence of two transcription factor binding sites, supported by histone marks and ChIP-seq data, that specifically have features of an enhancer and a promoter, respectively. We also identified a putative long non-coding RNA from the characteristically long first intron of the Cd247 gene. The long non-coding RNA annotation is supported by manual annotations from the GENCODE project in human and our expression quantification analysis performed in NOD and B6 mice using qRT-PCR. Furthermore, 17 of the 23 SNPs already known to be implicated with T1D were observed within the long non-coding RNA region in mouse. The spatially conserved regulatory elements identified in this study have the potential to enrich our understanding of the role of Cd247 gene in autoimmune diabetes.
    Gene 05/2014; DOI:10.1016/j.gene.2014.05.004 · 2.08 Impact Factor