Harrow, J. et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 7, S4

Wellcome Trust Sanger Institute, Wellcome Trust Campus, Hinxton, Cambridge CB10 1SA, UK.
Genome biology (Impact Factor: 10.81). 02/2006; 7 Suppl 1(Suppl 1):S4.1-9. DOI: 10.1186/gb-2006-7-s1-s4
Source: PubMed


The GENCODE consortium was formed to identify and map all protein-coding genes within the ENCODE regions. This was achieved by a combination of initial manual annotation by the HAVANA team, experimental validation by the GENCODE consortium and a refinement of the annotation based on these experimental results.
The GENCODE gene features are divided into eight different categories of which only the first two (known and novel coding sequence) are confidently predicted to be protein-coding genes. 5' rapid amplification of cDNA ends (RACE) and RT-PCR were used to experimentally verify the initial annotation. Of the 420 coding loci tested, 229 RACE products have been sequenced. They supported 5' extensions of 30 loci and new splice variants in 50 loci. In addition, 46 loci without evidence for a coding sequence were validated, consisting of 31 novel and 15 putative transcripts. We assessed the comprehensiveness of the GENCODE annotation by attempting to validate all the predicted exon boundaries outside the GENCODE annotation. Out of 1,215 tested in a subset of the ENCODE regions, 14 novel exon pairs were validated, only two of them in intergenic regions.
In total, 487 loci, of which 434 are coding, have been annotated as part of the GENCODE reference set available from the UCSC browser. Comparison of GENCODE annotation with RefSeq and ENSEMBL show only 40% of GENCODE exons are contained within the two sets, which is a reflection of the high number of alternative splice forms with unique exons annotated. Over 50% of coding loci have been experimentally verified by 5' RACE for EGASP and the GENCODE collaboration is continuing to refine its annotation of 1% human genome with the aid of experimental validation.

Download full-text


Available from: Alexandre Reymond,
1 Follower
31 Reads
    • "We used Tophat(Langmead et al., 2009) to map the reads to the human genome (hg19) and the GencodeV7(Harrow et al., 2006) transcriptome annotation. The mapped reads in BAM file format were converted to SAM format, using SAMtools(Li et al., 2009), and sorted. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Autism spectrum disorder (ASD) is a disorder of brain development. Most cases lack a clear etiology or genetic basis, and the difficulty of re-enacting human brain development has precluded understanding of ASD pathophysiology. Here we use three-dimensional neural cultures (organoids) derived from induced pluripotent stem cells (iPSCs) to investigate neurodevelopmental alterations in individuals with severe idiopathic ASD. While no known underlying genomic mutation could be identified, transcriptome and gene network analyses revealed upregulation of genes involved in cell proliferation, neuronal differentiation, and synaptic assembly. ASD-derived organoids exhibit an accelerated cell cycle and overproduction of GABAergic inhibitory neurons. Using RNA interference, we show that overexpression of the transcription factor FOXG1 is responsible for the overproduction of GABAergic neurons. Altered expression of gene network modules and FOXG1 are positively correlated with symptom severity. Our data suggest that a shift toward GABAergic neuron fate caused by FOXG1 is a developmental precursor of ASD. Copyright © 2015 Elsevier Inc. All rights reserved.
    Cell 07/2015; 162(2):375-390. DOI:10.1016/j.cell.2015.06.034 · 32.24 Impact Factor
  • Source
    • "(3) In order to investigate the conservation of zebrafish lncRNAs we use the eight-way zebrafish MULTIZ alignment (containing five teleosts, frog, mouse, and human) since the 46-way vertebrate alignment contains only sequences that are alignable to the human genome. As basis set for transcripts we use a recent RefSeq track (10/2012, 40,373 transcripts) obtained from UCSC as well as the GENCODE v.14 collection of transcripts (Harrow et al. 2006). In addition we extracted all splice sites supported by at least one expressed sequence tag (EST) in the data collection of the UCSC genome browser (downloaded 08/2012). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Large-scale RNA sequencing has revealed a large number of long mRNA-like transcripts (lncRNAs) that do not code for proteins. The evolutionary history of these lncRNAs has been notoriously hard to study systematically due to their low level of sequence conservation that precludes comprehensive homology-based surveys and makes them nearly impossible to align. An increasing number of special cases, however, has been shown to be at least as old as the vertebrate lineage. Here we use the conservation of splice sites to trace the evolution of lncRNAs. We show that >85% of the human GENCODE lncRNAs were already present at the divergence of placental mammals and many hundreds of these RNAs date back even further. Nevertheless, we observe a fast turnover of intron/exon structures. We conclude that lncRNA genes are evolutionary ancient components of vertebrate genomes that show an unexpected and unprecedented evolutionary plasticity. We offer a public web service ( that allows to retrieve sets of orthologous splice sites and to produce overview maps of evolutionarily conserved splice sites for visualization and further analysis. An electronic supplement containing the ncRNA data sets used in this study is available at © 2015 Nitsche et al.; Published by Cold Spring Harbor Laboratory Press for the RNA Society.
    RNA 03/2015; 21(5). DOI:10.1261/rna.046342.114 · 4.94 Impact Factor
  • Source
    • "and R ( [12]) programming languages. It is freely distributed and continuously improved. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Classically, gene prediction programs are based on detecting signals such as boundary sites (splice sites, starts, and stops) and coding regions in the DNA sequence in order to build potential exons and join them into a gene structure. Although nowadays it is possible to improve their performance with additional information from related species or/and cDNA databases, further improvement at any step could help to obtain better predictions. Here, we present WISCOD, a web-enabled tool for the identification of significant protein coding regions, a novel software tool that tackles the exon prediction problem in eukaryotic genomes. WISCOD has the capacity to detect real exons from large lists of potential exons, and it provides an easy way to use global íµí±ƒ value called expected probability of being a false exon (EPFE) that is useful for ranking potential exons in a probabilistic framework, without additional computational costs. The advantage of our approach is that it significantly increases the specificity and sensitivity (both between 80% and 90%) in comparison to other ab initio methods (where they are in the range of 70–75%). WISCOD is written in JAVA and R and is available to download and to run in a local mode on Linux and Windows platforms.
    BioMed Research International 09/2014; 2014(282343):10. DOI:10.1155/2014/282343 · 3.17 Impact Factor
Show more