GENCODE: the reference human genome annotation for the ENCODE, project

Wellcome Trust Sanger Institute, Wellcome Trust Campus, Hinxton, Cambridge CB10 1SA, United Kingdom
Genome Research (Impact Factor: 14.63). 09/2012; 22(9):1760-74. DOI: 10.1101/gr.135350.111


The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from and via the Ensembl and UCSC Genome Browsers

Download full-text


Available from: Jose Manuel Rodríguez
  • Source
    • "The potential PCR clones in each library were removed using Stacks:clone_filter program (Cole et al, 2005) to keep unique reads. The reads in each library were further screened for mapping to reference genome (hg19, GRCh37 including all alternative haplotypes), transcriptome (GENECODEv19 comprehensive transcripts (Harrow et al, 2012); RefSeq genes; human all mRNAs (Pruitt et al, 2005); USCS genes (Hsu et al, 2006); Ensemble genes (Hubbard et al, 2002); lincRNAs (Trapnell et al, 2010); and human ribosomal RNA sequences); abundant sequences (vector sequences [http://]; phage sequences (Leinonen et al, 2011); and polyA/C sequences), bacterial rRNA sequences (Cole et al, 2005), and bacterial and viral genomic sequences (Leinonen et al, 2011). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Crucial parts of the genome including genes encoding microRNAs and noncoding RNAs went unnoticed for years, and even now, despite extensive annotation and assembly of the human genome, RNA-sequencing continues to yield millions of unmappable and thus uncharacterized reads. Here, we examined > 300 billion reads from 536 normal donors and 1,873 patients encompassing 21 cancer types, identified ~300 million such uncharacterized reads, and using a distinctive approach de novo assembled 2,550 novel human transcripts, which mainly represent long noncoding RNAs. Of these, 230 exhibited relatively specific expression or non-expression in certain cancer types, making them potential markers for those cancers, whereas 183 exhibited tissue specificity. Moreover, we used lentiviral-mediated expression of three selected transcripts that had higher expression in normal than in cancer patients and found that each inhibited the growth of HepG2 cells. Our analysis provides a comprehensive and unbiased resource of unmapped human transcripts and reveals their associations with specific cancers, providing potentially important new genes for therapeutic targeting. This article is a U.S. Government work and is in the public domain in the USA Published under the terms of the CC BY 4.0 license.
    Full-text · Article · Aug 2015 · Molecular Systems Biology
  • Source
    • "All obtained reads from each sample were mapped against the human genome (hg19 build) with Bowtie/Tophat v2.0.2, which allows mapping across splice sites by reads segmentation (Trapnell et al., 2012). The uniquely mapped reads were subsequently assembled into transcripts guided by reference annotation (Gencode v14 and RefSeq gene models) (Harrow et al., 2012; Pruitt et al., 2012)wit hCu fflinks v2.0.2 (Trapnell et al., 2012). The expression level of each gene was quantified with normalized FPKM (fragments per kilobase of exon per million mapped fragments). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Long non-coding RNAs (lncRNAs) regulate diverse biological processes, including cell lineage specification. Here, we report transcriptome profiling of human endoderm and pancreatic cell lineages using purified cell populations. Analysis of the data sets allows us to identify hundreds of lncRNAs that exhibit differentiation-stage-specific expression patterns. As a first step in characterizing these lncRNAs, we focus on an endoderm-specific lncRNA, definitive endoderm-associated lncRNA1 (DEANR1), and demonstrate that it plays an important role in human endoderm differentiation. DEANR1 contributes to endoderm differentiation by positively regulating expression of the endoderm factor FOXA2. Importantly, overexpression of FOXA2 is able to rescue endoderm differentiation defects caused by DEANR1 depletion. Mechanistically, DEANR1 facilitates FOXA2 activation by facilitating SMAD2/3 recruitment to the FOXA2 promoter. Thus, our study not only reveals a large set of differentiation-stage-specific lncRNAs but also characterizes a functional lncRNA that is important for endoderm differentiation. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
    Full-text · Article · Apr 2015 · Cell Reports
  • Source
    • "RNA-Seq libraries were prepared using SOLiD™ Total RNA-Seq Kit (Life Technologies, Carlsbad US), according to the manufacturer's recommendations and were sequenced on the SOLiD sequencing platform (Life Technologies, Carlsbad US). Sequences were aligned against the human reference genome (hg19; GRCh 37) using Tophat2 [22] and Gencode data [23] (Build 15) as the transcriptome database. Aligned sequences were merged and only those alignments with quality greater than 20 (Q>20) were used for further analysis. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Neoadjuvant chemoradiotherapy (nCRT) may lead to complete tumor regression in rectal cancer patients. Prediction of complete response to nCRT may allow a personalized management of rectal cancer and spare patients from unnecessary radical total mesorectal excision with or without sphincter preservation. To identify a gene expression signature capable of predicting complete pathological response (pCR) to nCRT, we performed a gene expression analysis in 25 pretreatment biopsies from patients who underwent 5FU-based nCRT using RNA-Seq. A supervised learning algorithm was used to identify expression signatures capable of predicting pCR, and the predictive value of these signatures was validated using independent samples. We also evaluated the utility of previously published signatures in predicting complete response in our cohort. We identified 27 differentially expressed genes between patients with pCR and patients with incomplete responses to nCRT. Predictive gene signatures using subsets of these 27 differentially expressed genes peaked at 81.8% accuracy. However, signatures with the highest sensitivity showed poor specificity, and vice-versa, when applied in an independent set of patients. Testing previously published signatures on our cohort also showed poor predictive value. Our results indicate that currently available predictive signatures are highly dependent on the sample set from which they are derived, and their accuracy is not superior to current imaging and clinical parameters used to assess response to nCRT and guide surgical intervention. Copyright © 2015 Elsevier Inc. All rights reserved.
    Full-text · Article · Mar 2015 · Cancer Genetics
Show more