Comparative analysis of RNA-seq alignment algorithms and the RNA-Seq Unified Mapper (RUM)

Penn Center for Bioinformatics, University of Pennsylvania School of Medicine, Philadelphia, PA 19104, USA.
Bioinformatics (Impact Factor: 4.98). 07/2011; 27(18):2518-28. DOI: 10.1093/bioinformatics/btr427
Source: PubMed


A critical task in high-throughput sequencing is aligning millions of short reads to a reference genome. Alignment is especially complicated for RNA sequencing (RNA-Seq) because of RNA splicing. A number of RNA-Seq algorithms are available, and claim to align reads with high accuracy and efficiency while detecting splice junctions. RNA-Seq data are discrete in nature; therefore, with reasonable gene models and comparative metrics RNA-Seq data can be simulated to sufficient accuracy to enable meaningful benchmarking of alignment algorithms. The exercise to rigorously compare all viable published RNA-Seq algorithms has not been performed previously.
We developed an RNA-Seq simulator that models the main impediments to RNA alignment, including alternative splicing, insertions, deletions, substitutions, sequencing errors and intron signal. We used this simulator to measure the accuracy and robustness of available algorithms at the base and junction levels. Additionally, we used reverse transcription-polymerase chain reaction (RT-PCR) and Sanger sequencing to validate the ability of the algorithms to detect novel transcript features such as novel exons and alternative splicing in RNA-Seq data from mouse retina. A pipeline based on BLAT was developed to explore the performance of established tools for this problem, and to compare it to the recently developed methods. This pipeline, the RNA-Seq Unified Mapper (RUM), performs comparably to the best current aligners and provides an advantageous combination of accuracy, speed and usability.
The RUM pipeline is distributed via the Amazon Cloud and for computing clusters using the Sun Grid Engine (;
The RNA-Seq sequence reads described in the article are deposited at GEO, accession GSE26248.

Download full-text


Available from: Michael H Farkas
  • Source
    • "The values in the table are average expression levels (and where applicable ± SEM) in RPKM of itch related neuropeptides and their receptors. RPKM is an acronym for Reads Per Kilobase of exon model per Million mapped reads, a normalization which takes into consideration the length of coding exons of genes and depth of sequencing [9]. The mouse DRG dataset consists of TRPV1 lineage and non-TRPV1 lineage RNA samples which were obtained from BAC-TRPV1 promoter-Cre mice as described (Mishra et al., 2011) [6]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Three neuropeptides, gastrin releasing peptide (GRP), natriuritic precursor peptide B (NPPB), and neuromedin B (NMB) have been proposed to play roles in itch sensation. However, the tissues in which these peptides are expressed and their positions in the itch circuit has recently become the subject of debate. Here we used next-gen RNA-Seq to examine the expression of transcripts coding for GRP, NPPB, NMB, and other peptides in DRG, trigeminal ganglion, and the spinal cord as well as expression levels for their cognate receptors in these tissues. Results RNA-Seq demonstrates that GRP is not transcribed in mouse, rat, or human sensory ganglia. NPPB, which activates natriuretic peptide receptor 1 (NPR1), is well expressed in mouse DRG and less so in rat and human, whereas NPPA, which also acts on the NPR1 receptor, is expressed in all three species. Analysis of transcripts expressed in the spinal cord of mouse, rat, and human reveals no expression of Nppb, but unambiguously detects expression of Grp and the GRP-receptor (Grpr). The transcripts coding for NMB and tachykinin peptides are among the most highly expressed in DRG. Bioinformatics comparisons using the sequence of the peptides used to produce GRP-antibodies with proteome databases revealed that the C-terminal primary sequence of NMB and Substance P can potentially account for results from previous studies which showed GRP-immunostaining in the DRG. Conclusions RNA-Seq corroborates a primary itch afferent role for NPPB in mouse and potentially NPPB and NPPA in rats and humans, but does not support GRP as a primary itch neurotransmitter in mouse, rat, or humans. As such, our results are at odds with the initial proposal of Sun and Chen (2007) that GRP is expressed in DRG. By contrast, our data strongly support an itch pathway where the itch-inducing actions of GRP are exerted through its release from spinal cord neurons.
    Full-text · Article · Aug 2014 · Molecular Pain
  • Source
    • "Reads from ribosomal RNA and genomic repeats were identified by aligning the 5 0 50 bp of each read to ribosomal sequences and mouse repeats in RepBase using Bowtie [33], allowing up to three mismatches. The remaining reads were processed with RUM [34] and aligned to the set of known transcripts included in RefSeq, UCSC known genes, and ENSEMBL transcripts, and the mouse genome (mm9). Transcript-, exon-, and intron-level quantification was done using only the uniquely aligning reads. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Objective Glucagon-like peptide-1 (GLP-1) plays a major role in pancreatic β-cell function and survival by increasing cytoplasmic cAMP levels, which are thought to affect transcription through activation of the basic leucine zipper (bZIP) transcription factor CREB. Here, we test CREB function in the adult β-cell through inducible gene deletion. Methods We employed cell type-specific and inducible gene ablation to determine CREB function in pancreatic β-cells in mice. Results By ablating CREB acutely in mature β-cells in tamoxifen-treated CrebloxP/loxP;Pdx1-CreERT2 mice, we show that CREB has little impact on β-cell turnover, in contrast to what had been postulated previously. Rather, CREB is required for GLP-1 to elicit its full effects on stimulating glucose-induced insulin secretion and protection from cytokine-induced apoptosis. Mechanistically, we find that CREB regulates expression of the pro-apoptotic gene p21 (Cdkn1a) in β-cells, thus demonstrating that CREB is essential to mediating this critical aspect of GLP-1 receptor signaling. Conclusions In sum, our studies using conditional gene deletion put into question current notions about the importance of CREB in regulating β-cell function and mass. However, we reveal an important role for CREB in the β-cell response to GLP-1 receptor signaling, further validating CREB as a therapeutic target for diabetes.
    Full-text · Article · Aug 2014 · Molecular Metabolism
  • Source
    • "Few RNA-Seq simulators have been proposed in the last years (BEERS Simulator [31], RSEM Read Simulator [35], RNASeqReadSimulator [44]). In this work we used Flux Simulator [45] (available at, "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background The main goal of the whole transcriptome analysis is to correctly identify all expressed transcripts within a specific cell/tissue - at a particular stage and condition - to determine their structures and to measure their abundances. RNA-seq data promise to allow identification and quantification of transcriptome at unprecedented level of resolution, accuracy and low cost. Several computational methods have been proposed to achieve such purposes. However, it is still not clear which promises are already met and which challenges are still open and require further methodological developments. Results We carried out a simulation study to assess the performance of 5 widely used tools, such as: CEM, Cufflinks, iReckon, RSEM, and SLIDE. All of them have been used with default parameters. In particular, we considered the effect of the following three different scenarios: the availability of complete annotation, incomplete annotation, and no annotation at all. Moreover, comparisons were carried out using the methods in three different modes of action. In the first mode, the methods were forced to only deal with those isoforms that are present in the annotation; in the second mode, they were allowed to detect novel isoforms using the annotation as guide; in the third mode, they were operating in fully data driven way (although with the support of the alignment on the reference genome). In the latter modality, precision and recall are quite poor. On the contrary, results are better with the support of the annotation, even though it is not complete. Finally, abundance estimation error often shows a very skewed distribution. The performance strongly depends on the true real abundance of the isoforms. Lowly (and sometimes also moderately) expressed isoforms are poorly detected and estimated. In particular, lowly expressed isoforms are identified mainly if they are provided in the original annotation as potential isoforms. Conclusions Both detection and quantification of all isoforms from RNA-seq data are still hard problems and they are affected by many factors. Overall, the performance significantly changes since it depends on the modes of action and on the type of available annotation. Results obtained using complete or partial annotation are able to detect most of the expressed isoforms, even though the number of false positives is often high. Fully data driven approaches require more attention, at least for complex eucaryotic genomes. Improvements are desirable especially for isoform quantification and for isoform detection with low abundance.
    Full-text · Article · May 2014 · BMC Bioinformatics
Show more