Figures
Explore figures and images from publications
PacBio transcriptomes of 6-month male adult mouse cortex and hippocampus. a-b) Isoform diversity isoform for one representative dataset of cortex and hippocampus. c) 2262 genes with higher novelty read counts (NIC + NNC) than known of which 442 are only higher in cortex and 428 higher in hippocampus. d-f) GO semantic maps for d) cortex genes, e) hippocampus, and f) shared. g) Example of Mef2a isoforms expression in cortex and hippocampus.

PacBio transcriptomes of 6-month male adult mouse cortex and hippocampus. a-b) Isoform diversity isoform for one representative dataset of cortex and hippocampus. c) 2262 genes with higher novelty read counts (NIC + NNC) than known of which 442 are only higher in cortex and 428 higher in hippocampus. d-f) GO semantic maps for d) cortex genes, e) hippocampus, and f) shared. g) Example of Mef2a isoforms expression in cortex and hippocampus.

Source publication
Preprint
Full-text available
Alternative splicing is widely acknowledged to be a crucial regulator of gene expression and is a key contributor to both normal developmental processes and disease states. While cost-effective and accurate for quantification, short-read RNA-seq lacks the ability to resolve full-length transcript isoforms despite increasingly sophisticated computat...

Contexts in source publication

Context 1
... expression was highly correlated across biological replicates (Pearson r > 0.9) ( Figure S18a-d) and on average, we detected 10,000 known genes and 14,000 known transcripts for each tissue. The diversity of the isoform categories was similar between cortex and hippocampus (Figure 5a, 5b). We focused our analysis on genes that had more reads assigned to NIC and NNC novel isoforms than known transcript models for both areas and found a shared set of 1,393 genes with an additional 442 and 429 being specific to cortex and hippocampus, respectively (Figure 5c; Table S34-36). ...
Context 2
... order to understand the gene categories that had more novel than known transcripts in our transcriptomes, we performed a GO analysis using Metascape 38 and REVIGO 39 for visualization and semantically arranged the terminology. We found that key genes Jun. 18, 2019; involved in locomotor behavior, long-term synaptic potentiation, and behavior are enriched in the cortex-specific set, which is the main center of knowledge integration and movement (Figure 5d; Table S37). The hippocampus was enriched for terms associated with synaptic vesicles and neuron projection (Figure 5e; Table S38). ...
Context 3
... found that key genes Jun. 18, 2019; involved in locomotor behavior, long-term synaptic potentiation, and behavior are enriched in the cortex-specific set, which is the main center of knowledge integration and movement (Figure 5d; Table S37). The hippocampus was enriched for terms associated with synaptic vesicles and neuron projection (Figure 5e; Table S38). Not surprisingly, the shared terms for cortex and hippocampus are related to synaptic terms, mostly because there is significant set of ATPase and GTPase metabolism genes as well as cell to cell adhesion molecules (Figure 5f; Table S39). ...
Context 4
... hippocampus was enriched for terms associated with synaptic vesicles and neuron projection (Figure 5e; Table S38). Not surprisingly, the shared terms for cortex and hippocampus are related to synaptic terms, mostly because there is significant set of ATPase and GTPase metabolism genes as well as cell to cell adhesion molecules (Figure 5f; Table S39). Interestingly, RNA splicing GO terms are associated with this group of under-annotated transcript models. ...
Context 5
... regions such as the hippocampus, the Mef2 family can control the number of synapses and dendrite remodeling 42 . We find 13 novel isoforms that pass platform-specific filtering from a total of 20 isoforms, all of which contain the DNA binding domain (Figure 5g). We then predicted protein sequences for these novel transcripts using TransDecoder 43 and performed protein domain analysis searches on the resultant sequences using Hmmer 44 Jun. 18, 2019; predicted to lack the beta-sandwich domain (ENCODEM00000394337) that is present in all known Mef2a protein isoforms except one. ...

Similar publications

Article
Full-text available
Keratinocyte differentiation requires intricately coordinated spatiotemporal expression changes that specify epidermis structure and function. This article utilizes single-cell RNA-seq data from 22,338 human foreskin keratinocytes to reconstruct the transcriptional regulation of skin development and homeostasis genes, organizing them by differentia...

Citations

... The ability of long read RNA-Seq to generate reads corresponding to full-length transcripts provides an opportunity to discover novel transcripts and thereby enable the quantification of isoform expression using context-specific annotations 16 . Tools such as FLAIR 17 , TALON 18 , or StringTie2 19 have been developed for transcript discovery from long read RNA-Seq and have been shown to identify novel transcripts even in well annotated genomes. However, RNA degradation, sequencing, and alignment artefacts can introduce false positive transcript candidates and impact quantification 20 . ...
... To deal with possible false positive novel transcripts, existing methods rely on user defined thresholds such as the minimum read count to filter novel transcript candidates [17][18][19] . ...
... Therefore, unlike arbitrary thresholds for parameters used in other tools [17][18][19] , a more stringent NDR threshold guarantees higher precision while providing an upper bound on the FDR. This is especially relevant for well annotated genomes or analyses which involve high numbers of samples where precision is more important than sensitivity to obtain accurate annotations and quantification results. ...
Preprint
Full-text available
Most approaches to transcript quantification rely on fixed reference annotations. However, the transcriptome is dynamic, and depending on the context, such static annotations contain inactive isoforms for some genes while they are incomplete for others. To address this, we have developed Bambu, a method that performs machine-learning based transcript discovery to enable quantification specific to the context of interest using long-read RNA-Seq data. To identify novel transcripts, Bambu employs a precision-focused threshold referred to as the novel discovery rate (NDR), which replaces arbitrary per-sample thresholds with a single interpretable parameter. Bambu retains the full-length and unique read counts, enabling accurate quantification in presence of inactive isoforms. Compared to existing methods for transcript discovery, Bambu achieves greater precision without sacrificing sensitivity. We show that context-aware annotations improve abundance estimates for both novel and known transcripts. We apply Bambu to human embryonic stem cells to quantify isoforms from repetitive HERVH-LTR7 retrotransposons, demonstrating the ability to estimate transcript expression specific to the context of interest.
... Yet, the relatively high long reads error rates, of above 10% for both direct RNA and cDNA sequences, complicate the detection of the transcript's exact exon structure [10,11,20,21]. Several computational and sequencing methods have been developed to overcome this challenge [22][23][24], yet all these methods are applicable only to cDNA. ...
... Given the high number of expected transcripts assembled (greater than threefold than the number of genes), it is likely that the low ONT read yield imposes a limitation to accurately quantify transcripts. Thus, despite the premise of long reads in constructing transcripts, their low throughput and low sequence accuracy demonstrate that generating accurate transcriptomes from imperfect RNA reads is still a challenge [22,31]. Moreover, it is hard to distinguish whether reads with premature starts and ends indicate native internal transcription start or end sites, or technical issues such as fragmented reads or blocked pores. ...
... The TALON package [22] was applied to identify and quantify isoforms in ONT samples (cDNA and Direct RNA). The alignments, pooled from both replicates, were pre-processed with talon_label_reads to remove artefacts of internal priming with A-rich sequences (20 bp window). ...
Article
Full-text available
Alternative splicing produces various mRNAs, and thereby various protein products, from one gene, impacting a wide range of cellular activities. However, accurate reconstruction and quantification of full-length transcripts using short-reads is limited, due to their length. Long-reads sequencing technologies may provide a solution by sequencing full-length transcripts. We explored the use of both Illumina short-reads and two long Oxford Nanopore Technology (cDNA and Direct RNA) RNA-Seq reads for detecting global differential splicing during mouse embryonic stem cell differentiation, applying several bioinformatics strategies: gene-based, isoform-based and exon-based. We detected the strongest similarity among the sequencing platforms at the gene level compared to exon-based and isoform-based. Furthermore, the exon-based strategy discovered many differential exon usage (DEU) events, mostly in a platform-dependent manner and in non-differentially expressed genes. Thus, the platforms complemented each other in the ability to detect DEUs (i.e. long-reads exhibited an advantage in detecting DEUs at the UTRs, and short-reads detected more DEUs). Exons within 20 genes, detected in one or more platforms, were here validated by PCR, including key differentiation genes, such as Mdb3 and Aplp1. We provide an important analysis resource for discovering transcriptome changes during stem cell differentiation and insights for analysing such data.
... RNA-seq can be leveraged to study many aspects of RNA biology such as expression, splicing, gene fusion detection, and structure [77]. Currently, innovations in RNA-seq, such as direct RNA-seq [78], detection of long reads [79], spatialomics [80], and the development of new computational tools (TALON for transcriptome quantification [81] or FLAIR for transcriptome assembly [82]) for data analysis have contributed to a complete understanding of the biology of the transcriptome in biomedical research. ...
Article
Full-text available
SARS-CoV-2 is a coronavirus family member that appeared in China in December 2019 and caused the disease called COVID-19, which was declared a pandemic in 2020 by the World Health Organization. In recent months, great efforts have been made in the field of basic and clinical research to understand the biology and infection processes of SARS-CoV-2. In particular, transcriptome analysis has contributed to generating new knowledge of the viral sequences and intracellular signaling pathways that regulate the infection and pathogenesis of SARS-CoV-2, generating new information about its biology. Furthermore, transcriptomics approaches including spatial transcriptomics, single-cell transcriptomics and direct RNA sequencing have been used for clinical applications in monitoring, detection, diagnosis, and treatment to generate new clinical predictive models for SARS-CoV-2. Consequently, RNA-based therapeutics and their relationship with SARS-CoV-2 have emerged as promising strategies to battle the SARS-CoV-2 pandemic with the assistance of novel approaches such as CRISPR-CAS, ASOs, and siRNA systems. Lastly, we discuss the importance of precision public health in the management of patients infected with SARS-CoV-2 and establish that the fusion of transcriptomics, RNA-based therapeutics, and precision public health will allow a linkage for developing health systems that facilitate the acquisition of relevant clinical strategies for rapid decision making to assist in the management and treatment of the SARS-CoV-2-infected population to combat this global public health problem.
... RNA-seq can be leveraged to study many aspects of RNA biology such as expression, splicing, gene fusion detection, and structure [77]. Currently, innovations in RNA-seq, such as direct RNA-seq [78], detection of long reads [79], spatialomics [80], and the development of new computational tools (TALON for transcriptome quantification [81] or FLAIR for transcriptome assembly [82]) for data analysis have contributed to a complete understanding of the biology of the transcriptome in biomedical research. ...
Article
Full-text available
SARS-CoV-2 is a coronavirus family member that appeared in China in December 2019 and caused the disease called COVID-19, which was declared a pandemic in 2020 by the World Health Organization. In recent months, great efforts have been made in the field of basic and clinical research to understand the biology and infection processes of SARS-CoV-2. In particular, transcriptome analysis has contributed to generating new knowledge of the viral sequences and intracellular signaling pathways that regulate the infection and pathogenesis of SARS-CoV-2, generating new information about its biology. Furthermore, transcriptomics approaches including spatial transcriptomics, single-cell transcriptomics and direct RNA sequencing have been used for clinical applications in monitoring, detection, diagnosis, and treatment to generate new clinical predictive models for SARS-CoV-2. Consequently, RNA-based therapeutics and their relationship with SARS-CoV-2 have emerged as promising strategies to battle the SARS-CoV-2 pandemic with the assistance of novel approaches such as CRISPR-CAS, ASOs, and siRNA systems. Lastly, we discuss the importance of precision public health in the management of patients infected with SARS-CoV-2 and establish that the fusion of transcriptomics, RNA-based therapeutics, and precision public health will allow a linkage for developing health systems that facilitate the acquisition of relevant clinical strategies for rapid decision-making to assist in the management and treatment of the SARS-CoV-2-infected population to combat this global public health problem.
... We evaluated the detection performance of splicing variants and fusion genes. Although several Nanopore long-reads RNA-seq analysis pipelines exist [21,[29][30][31][32][33][34], to the best of our knowledge, no software can detect both splicing variants and fusion genes. In this study, we used sequence data from MCF-7 and an HCC sample (RK107C) and compared the performance of SPLICE and four other methods (TALON [34], FLAIR [21], StringTie [31], and Bambu [29]) for the splicing variant detection and SPLICE and two other methods (LongGF [32] and JAFFAL [30]) for the fusion gene detection. ...
... Although several Nanopore long-reads RNA-seq analysis pipelines exist [21,[29][30][31][32][33][34], to the best of our knowledge, no software can detect both splicing variants and fusion genes. In this study, we used sequence data from MCF-7 and an HCC sample (RK107C) and compared the performance of SPLICE and four other methods (TALON [34], FLAIR [21], StringTie [31], and Bambu [29]) for the splicing variant detection and SPLICE and two other methods (LongGF [32] and JAFFAL [30]) for the fusion gene detection. In the analysis of splicing variants, SPLICE identified the third highest number of transcripts after FLAIR and StringTie (S3A We next compared the fusion genes detected from MCF7 by SPLICE, LongGF, and JAF-FAL. ...
... To our knowledge, SPLICE is the only longreads RNA-seq pipeline that can detect both splicing variants and fusion genes. While most long-reads RNA-seq analysis pipelines for detecting splicing variants use short-reads RNA-seq data for the alignment correction of splice junctions [21,31,34], SPLICE considers error rates around splicing sites (S3C Fig) and can obtain accurate results using only long-reads RNA-seq data (S3 Fig). SPLICE also identified a larger number of fusion genes than existing software (S4 Fig). ...
Article
Full-text available
Genes generate transcripts of various functions by alternative splicing. However, in most transcriptome studies, short-reads sequencing technologies (next-generation sequencers) have been used, leaving full-length transcripts unobserved directly. Although long-reads sequencing technologies would enable the sequencing of full-length transcripts, the data analysis is difficult. In this study, we developed an analysis pipeline named SPLICE and analyzed cDNA sequences from 42 pairs of hepatocellular carcinoma (HCC) and matched non-cancerous livers with an Oxford Nanopore sequencer. Our analysis detected 46,663 transcripts from the protein-coding genes in the HCCs and the matched non-cancerous livers, of which 5,366 (11.5%) were novel. A comparison of expression levels identified 9,933 differentially expressed transcripts (DETs) in 4,744 genes. Interestingly, 746 genes with DETs, including the LINE1-MET transcript, were not found by a gene-level analysis. We also found that fusion transcripts of transposable elements and hepatitis B virus (HBV) were overexpressed in HCCs. In vitro experiments on DETs showed that LINE1-MET and HBV-human transposable elements promoted cell growth. Furthermore, fusion gene detection showed novel recurrent fusion events that were not detected in the short-reads. These results suggest the efficiency of full-length transcriptome studies and the importance of splicing variants in carcinogenesis.
... Lastly, a read count matrix of the filtered transcriptome was extracted using the talon_abundance module. To organize the long-read annotation results, a GTF-formatted annotation for transcripts and genes supported by long reads was generated using the talon_create_GTF utility based on the filtered transcriptome and the reference annotation (Wyman et al., 2020). ...
Article
Full-text available
The zebra finch (Taeniopygia guttata), a representative oscine songbird species, has been widely studied to investigate behavioral neuroscience, most notably the neurobiological basis of vocal learning, a rare trait shared in only a few animal groups including humans. In 2019, an updated zebra finch genome annotation (bTaeGut1_v1.p) was released from the Ensembl database and is substantially more comprehensive than the first version published in 2010. In this study, we utilized the publicly available RNA-seq data generated from Illumina-based short-reads and PacBio single-molecule real-time (SMRT) long-reads to assess the bird transcriptome. To analyze the high-throughput RNA-seq data, we adopted a hybrid bioinformatic approach combining short and long-read pipelines. From our analysis, we added 220 novel genes and 8,134 transcript variants to the Ensembl annotation, and predicted a new proteome based on the refined annotation. We further validated 18 different novel proteins by using mass-spectrometry data generated from zebra finch caudal telencephalon tissue. Our results provide additional resources for future studies of zebra finches utilizing this improved bird genome annotation and proteome.
... In addition, the high similarity and short divergence time between copies prevented the accurate quantification of ADS expression using classical methods. Wyman et al. showed that PacBio Iso-Seq data have the ability to quantify genes (Wyman et al., 2019). We introduced PacBio Hi-Fi Iso-Seq reads for the quantification of ADS and other ABP genes, thus avoiding complex primary and subsequent operations when using RNA-seq or qPCR. ...
Article
Full-text available
Artemisia annua is the major natural resource for globally commonly used anti-malarial medicine – artemisinin. Here, we present chromosomal-level haploid maps for two chemotypic-distinct A. annua strains to explore the relationships between the genomic organization and artemisinin production. High-fidelity sequencing, optical mapping, and chromatin conformation capture sequencing were used to assemble the heterogeneous and repetitive genome and resolve the haplotypes of A. annua. Approximately 50,000 genes were annotated for each haplotype genome, and a triplication event that occurred an estimated 58.2 million years ago was examined for the first time in this species. A total of 3,903,467 to 5,193,414 variants (SNPs, indels and structural variants) were identified in the 1.5-Gb genome during the pairwise comparison between haplotypes, which is consistent with the high heterozygosity of this species. There was a correlation between artemisinin yield and the copy numbers of amorpha-4,11-diene synthase (ADS), and the correlation was observed by 36-sample population sequencing. Circular consensus sequencing of transcripts facilitated detection of paralog expression. The novel findings in this work provide insights into the biosynthesis and regulation of artemisinin and will contribute to finally conquering malaria worldwide.
... We have hereby applied this pipeline ( Fig. 1) to the analysis of two publicly available mouse neural datasets, including scRNA-seq Smart-Seq2 (primary visual cortex, generated by Tasic et al. 43,44 ) and bulk ENCODE PacBio long-read data (mouse cortex and hippocampus, generated by Wyman et al. 45 ). As a result, we successfully detected cell type-level co-expression of isoforms in a manner that was independent of gene-level expression. ...
... from genome version GRCm38.p6 and assembly accession GCF_00001635.26. Long-read datasets form mouse hippocampus and cortex, generated by Wyman et al. 45 , were downloaded from ENCODE accessions ENCSR214HSG and ENCSR340GWV, respectively. Long readdefined transcriptome files (generated using above-cited long-read data and reference files, details in Supplementary Note) have been made available at the tappAS repository of annotation files. ...
Article
Full-text available
Alternative splicing (AS) is a highly-regulated post-transcriptional mechanism known to modulate isoform expression within genes and contribute to cell-type identity. However, the extent to which alternative isoforms establish co-expression networks that may be relevant in cellular function has not been explored yet. Here, we present acorde, a pipeline that successfully leverages bulk long reads and single-cell data to confidently detect alternative isoform co-expression relationships. To achieve this, we develop and validate percentile correlations, an innovative approach that overcomes data sparsity and yields accurate co-expression estimates from single-cell data. Next, acorde uses correlations to cluster co-expressed isoforms into a network, unraveling cell type-specific alternative isoform usage patterns. By selecting same-gene isoforms between these clusters, we subsequently detect and characterize genes with co-differential isoform usage (coDIU) across cell types. Finally, we predict functional elements from long read-defined isoforms and provide insight into biological processes, motifs, and domains potentially controlled by the coordination of post-transcriptional regulation. The code for acorde is available at https://github.com/ConesaLab/acorde. Alternative splicing (AS) is a highly-regulated post-transcriptional mechanism known to modulate isoform expression within genes and contribute to cell-type identity. Here, the authors present acorde, a pipeline that successfully leverages bulk long reads and single-cell data to confidently detect alternative isoform co-expression relationships.
... The initial gene prediction was produced with the Braker pipeline using all RNA-Seq data stated above (32). Furthermore, Nanopore-cDNA-Seq data produced by CTR-Seq was used to predict the final Pv 5.2.4 gene set using Talon (33). This gene set was validated for completeness with BUSCO (34). ...
Article
Full-text available
Non-biting midges (Chironomidae) are known to inhabit a wide range of environments, and certain species can tolerate extreme conditions, where the rest of insects cannot survive. In particular, the sleeping chironomid Polypedilum vanderplanki is known for the remarkable ability of its larvae to withstand almost complete desiccation by entering a state called anhydrobiosis. Chromosome numbers in chironomids are higher than in other dipterans and this extra genomic resource might facilitate rapid adaptation to novel environments. We used improved sequencing strategies to assemble a chromosome-level genome sequence for P. vanderplanki for deep comparative analysis of genomic location of genes associated with desiccation tolerance. Using whole genome-based cross-species and intra-species analysis, we provide evidence for the unique functional specialization of Chromosome 4 through extensive acquisition of novel genes. In contrast to other insect genomes, in the sleeping chironomid a uniquely high degree of subfunctionalization in paralogous anhydrobiosis genes occurs in this chromosome, as well as pseudogenization in a highly duplicated gene family. Our findings suggest that the Chromosome 4 in Polypedilum is a site of high genetic turnover, allowing it to act as a ‘sandbox’ for evolutionary experiments, thus facilitating the rapid adaptation of midges to harsh environments.
... Different third-generation sequencing platforms have superior specificity. As a representative sequencing platform, Oxford Nanopore (ONT) platform has 1-Mb of sequencing reads (Wyman et al., 2019). ...
... The thirdgeneration sequencing technology seems to offer an opportunity to solve above-mentioned difficulties, this is the case because it can obtain some transcripts large enough to cover the FL genes (Eid et al., 2009;Feng et al., 2015). Given the advantage of ONT platform in long-read sequencing (Wyman et al., 2019), thus it can be applied to sequence the FL transcriptome of R. typus. In the present study, a total of 14,930 FL transcripts were generated and these sequences helped us identify 714 novel transcripts and 1,642 novel genetic loci in the R. typus genome. ...
... In the present study, 175 fusion transcripts were detected, which is relatively low. We suspected that the poor accuracy of ONT platform sequencing led to this finding (Wyman et al., 2019). Due to the lack of genomic data at the chromosomal level, we have not been able to determine whether these fusion transcripts are from intergenic splicing or chromosomal rearrangement . ...
Article
Full-text available
Rhincodon typus is a keystone and indicator species in marine ecosystems. Meanwhile, R. typus has been listed on the IUCN red list of vulnerable species. Here we used ONT platform to determine the full-length (FL) transcriptome of R. typus and obtained 14,930 FL transcripts. Among all FL transcripts, 14,915 transcripts were covered 11,892 genetic loci and 1,642 novel genetic loci were further found. Meanwhile, we identified 714 novel transcripts by compared FL transcripts with the R. typus genome. Based on FL transcripts, we also predicted the distribution patterns of ASs, LncRNAs, polyAs, CDSs and methylation sites on FL transcriptome of R. typus . Furthermore, a total of 31,021 (97.86%) CDSs can obtained annotation information. Overall, our work firstly provided the FL transcriptome and these sequences complete the annotated R. typus genome information. Furthermore, these information are a potential resource to study biological processes of R. typus .