Guttman, M. et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nature Biotech. 28, 503-510

Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA.
Nature Biotechnology (Impact Factor: 41.51). 05/2010; 28(5):503-10. DOI: 10.1038/nbt.1633
Source: PubMed


Massively parallel cDNA sequencing (RNA-Seq) provides an unbiased way to study a transcriptome, including both coding and noncoding genes. Until now, most RNA-Seq studies have depended crucially on existing annotations and thus focused on expression levels and variation in known transcripts. Here, we present Scripture, a method to reconstruct the transcriptome of a mammalian cell using only RNA-Seq reads and the genome sequence. We applied it to mouse embryonic stem cells, neuronal precursor cells and lung fibroblasts to accurately reconstruct the full-length gene structures for most known expressed genes. We identified substantial variation in protein coding genes, including thousands of novel 5' start sites, 3' ends and internal coding exons. We then determined the gene structures of more than a thousand large intergenic noncoding RNA (lincRNA) and antisense loci. Our results open the way to direct experimental manipulation of thousands of noncoding RNAs and demonstrate the power of ab initio reconstruction to render a comprehensive picture of mammalian transcriptomes.

Download full-text


Available from: Manuel Garber,
  • Source
    • "Identification of lncRNAs from genome sequences through computational methods is difficult because of the low expression, poor sequence conservation and complex functions of these molecules (Da Sacco et al., 2012). Nevertheless, detection from transcriptome sequencing data is possible because lncRNAs are frequently polyadenylated (polyA) (Mortazavi et al., 2008; Guttman et al., 2010; Pauli et al., 2011; Boerner & McGinnis, 2012; Fatica & Bozzoni, 2013). LncRNAs have been identified from the genomes of mammals, including human, mouse and pig (Cabili et al., 2011; Luo et al., 2013; Zhou et al., 2014), as well as from plants, such as Arabidopsis, maize and rice (Song et al., 2009; Boerner & McGinnis, 2012; Liu et al., 2012; Wang et al., 2014; Zhang et al., 2014). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Long noncoding RNAs (lncRNAs) regulate gene expression and biological processes. With the development of high-throughput RNA sequencing technology, lncRNAs have been extensively studied in recent years. Nevertheless, the expression and evolution of lncRNAs in plants remain poorly understood. Here, we identified 413 and 709 multi-exon noncoding transcripts from 353 and 595 loci of the cultivar tomato Heinz1706 and its wild relative LA1589, respectively. Systematic comparison of the sequence and expression of lncRNAs showed that they are poorly conserved in Solanaceae, with only < 0.4% lncRNAs present in all sequenced genomes of tomato and potato. Sequence analysis of Lycopersicon-specific lncRNA loci in Solanum lycopersicum and S. pennellii showed that the origins of these molecules are associated with transposable elements (TEs). LncRNA-314, a fruit-specific lncRNA expressed in S. lycopersicum and S. pimpinellifolium, but not in S. pennellii, originated through two evolutionary events: speciation of S. pennellii resulted in insertion of a long terminal repeat (LTR) retrotransposon into chromosome 10 and contributed to most of the transcribed region of lncRNA-314; and a large deletion in Lycopersicon generated the promoter region and part of the transcribed region of lncRNA-314. These results provide novel insights into the evolution of lncRNAs in plants.
    New Phytologist 10/2015; DOI:10.1111/nph.13718 · 7.67 Impact Factor
    • "Next, we subjected the same nuclear RNA from TN-DROSHA expressing HEK293T cells to Illumina RNA sequencing to test its suitability for transcriptome-wide pri-miRNA assembly. After generating a very deep RNA-seq data set (193,346,087 100-bp pairedend reads), we evaluated several transcriptome assemblers, such as StringTie, Cufflinks (Trapnell et al. 2010), IsoLasso (Li et al. 2011), and Scripture (Guttman et al. 2010), to assess their performance for this application (Supplemental Table S2). By evaluating the assembly of pri-miRNAs that are annotated in RefSeq, we found that StringTie correctly assembled the highest number of pri-miRNA transcripts in considerably less time than the other assemblers . "
    [Show abstract] [Hide abstract]
    ABSTRACT: Precise regulation of microRNA (miRNA) expression is critical for diverse physiologic and pathophysiologic processes. Nevertheless, elucidation of the mechanisms through which miRNA expression is regulated has been greatly hindered by the incomplete annotation of primary miRNA (pri-miRNA) transcripts. While a subset of miRNAs are hosted in protein-coding genes, the majority of pri-miRNAs are transcribed as poorly characterized noncoding RNAs that are 10's to 100's of kilobases in length and low in abundance due to efficient processing by the endoribonuclease DROSHA, which initiates miRNA biogenesis. Accordingly, these transcripts are poorly represented in existing RNA-seq data sets and exhibit limited and inaccurate annotation in current transcriptome assemblies. To overcome these challenges, we developed an experimental and computational approach that allows genome-wide detection and mapping of pri-miRNA structures. Deep RNA-seq in cells expressing dominant-negative DROSHA resulted in much greater coverage of pri-miRNA transcripts compared with standard RNA-seq. A computational pipeline was developed that produces highly accurate pri-miRNA assemblies, as confirmed by extensive validation. This approach was applied to a panel of human and mouse cell lines, providing pri-miRNA transcript structures for 1291/1871 human and 888/1181 mouse miRNAs, including 594 human and 425 mouse miRNAs that fall outside protein-coding genes. These new assemblies uncovered unanticipated features and new potential regulatory mechanisms, including links between pri-miRNAs and distant protein-coding genes, alternative pri-miRNA splicing, and transcripts carrying subsets of miRNAs encoded by polycistronic clusters. These results dramatically expand our understanding of the organization of miRNA-encoding genes and provide a valuable resource for the study of mammalian miRNA regulation. © 2015 Chang et al.; Published by Cold Spring Harbor Laboratory Press.
    Genome Research 08/2015; 25(9). DOI:10.1101/gr.193607.115 · 14.63 Impact Factor
  • Source
    • "Several reference-based assemblers (i.e., mapping assemblers ) [10] [13] [32] [38] are developed to reconstruct the full set of expressed mRNAs. With high-quality references, even some erroneous reads can be correctly mapped and thus, expressed transcripts can be accurately reconstructed. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: RNA-seq has made feasible the analysis of a whole set of expressed mRNAs. Mapping-based assembly of RNA-seq reads sometimes is infeasible due to lack of high-quality references. However, de novo assembly is very challenging due to uneven expression levels among transcripts and also the read coverage variation within a single transcript. Existing methods either apply de Bruijn graphs of single-sized k-mers to assemble the full set of transcripts, or conduct multiple runs of assembly, but still apply graphs of single-sized k-mers at each run. However, a single k-mer size is not suitable for all the regions of the transcripts with varied coverage. Contribution: This paper presents a de novo assembler Bermuda with new insights for handling uneven coverage. Opposed to existing methods that use a single k-mer size for all the transcripts in each run of assembly, Bermuda self-adaptively uses a few k-mer sizes to assemble different regions of a single transcript according to their local coverage. As such, Bermuda can deal with uneven expression levels and coverage not only among transcripts, but also within a single transcript. Extensive tests show that Bermuda outperforms popular de novo assemblers in reconstructing unevenly-expressed transcripts with longer length, better contiguity and lower redundancy. Further, Bermuda is computationally efficient with moderate memory consumption.
Show more