Article

Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM)

Penn Center for Bioinformatics, University of Pennsylvania School of Medicine, Philadelphia, PA 19104, USA.
Bioinformatics (Impact Factor: 4.62). 07/2011; 27(18):2518-28. DOI: 10.1093/bioinformatics/btr427
Source: PubMed

ABSTRACT A critical task in high-throughput sequencing is aligning millions of short reads to a reference genome. Alignment is especially complicated for RNA sequencing (RNA-Seq) because of RNA splicing. A number of RNA-Seq algorithms are available, and claim to align reads with high accuracy and efficiency while detecting splice junctions. RNA-Seq data are discrete in nature; therefore, with reasonable gene models and comparative metrics RNA-Seq data can be simulated to sufficient accuracy to enable meaningful benchmarking of alignment algorithms. The exercise to rigorously compare all viable published RNA-Seq algorithms has not been performed previously.
We developed an RNA-Seq simulator that models the main impediments to RNA alignment, including alternative splicing, insertions, deletions, substitutions, sequencing errors and intron signal. We used this simulator to measure the accuracy and robustness of available algorithms at the base and junction levels. Additionally, we used reverse transcription-polymerase chain reaction (RT-PCR) and Sanger sequencing to validate the ability of the algorithms to detect novel transcript features such as novel exons and alternative splicing in RNA-Seq data from mouse retina. A pipeline based on BLAT was developed to explore the performance of established tools for this problem, and to compare it to the recently developed methods. This pipeline, the RNA-Seq Unified Mapper (RUM), performs comparably to the best current aligners and provides an advantageous combination of accuracy, speed and usability.
The RUM pipeline is distributed via the Amazon Cloud and for computing clusters using the Sun Grid Engine (http://cbil.upenn.edu/RUM).
ggrant@pcbi.upenn.edu; epierce@mail.med.upenn.edu
The RNA-Seq sequence reads described in the article are deposited at GEO, accession GSE26248.

1 Bookmark
 · 
210 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Hematopoietic stem cells (HSCs) possess unique gene expression programs that enforce their identity and regulate lineage commitment. Long non-coding RNAs (lncRNAs) have emerged as important regulators of gene expression and cell fate decisions, although their functions in HSCs are unclear. Here we profiled the transcriptome of purified HSCs by deep sequencing and identified 323 unannotated lncRNAs. Comparing their expression in differentiated lineages revealed 159 lncRNAs enriched in HSCs, some of which are likely HSC specific (LncHSCs). These lncRNA genes share epigenetic features with protein-coding genes, including regulated expression via DNA methylation, and knocking down two LncHSCs revealed distinct effects on HSC self-renewal and lineage commitment. We mapped the genomic binding sites of one of these candidates and found enrichment for key hematopoietic transcription factor binding sites, especially E2A. Together, these results demonstrate that lncRNAs play important roles in regulating HSCs, providing an additional layer to the genetic circuitry controlling HSC function. Copyright © 2015 Elsevier Inc. All rights reserved.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Circadian rhythms are daily endogenous oscillations of behavior, metabolism, and physiology. At a molecular level, these oscillations are generated by transcriptional-translational feedback loops composed of core clock genes. In turn, core clock genes drive the rhythmic accumulation of downstream outputs-termed clock-controlled genes (CCGs)-whose rhythmic translation and function ultimately underlie daily oscillations at a cellular and organismal level. Given the circadian clock's profound influence on human health and behavior, considerable efforts have been made to systematically identify CCGs. The recent development of next-generation sequencing has dramatically expanded our ability to study the expression, processing, and stability of rhythmically expressed mRNAs. Nevertheless, like any new technology, there are many technical issues to be addressed. Here, we discuss considerations for studying circadian rhythms using genome scale transcriptional profiling, with a particular emphasis on RNA sequencing. We make a number of practical recommendations-including the choice of sampling density, read depth, alignment algorithms, read-depth normalization, and cycling detection algorithms-based on computational simulations and our experience from previous studies. We believe that these results will be of interest to the circadian field and help investigators design experiments to derive most values from these large and complex data sets. © 2015 Elsevier Inc. All rights reserved.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Genomics and genetics have invaded all aspects of biology and medicine, opening uncharted territory for scientific exploration. The definition of "gene" itself has become ambiguous, and the central dogma is continuously being revised and expanded. Computational biology and computational medicine are no longer intellectual domains of the chosen few. Next generation sequencing (NGS) technology, together with novel methods of pattern recognition and network analyses, has revolutionized the way we think about fundamental biological mechanisms and cellular pathways. In this review, we discuss NGS-based genome-wide approaches that can provide deeper insights into retinal development, aging and disease pathogenesis. We first focus on gene regulatory networks (GRNs) that govern the differentiation of retinal photoreceptors and modulate adaptive response during aging. Then, we discuss NGS technology in the context of retinal disease and develop a vision for therapies based on network biology. We should emphasize that basic strategies for network construction and analyses can be transported to any tissue or cell type. We believe that specific and uniform guidelines are required for generation of genome, transcriptome and epigenome data to facilitate comparative analysis and integration of multi-dimensional data sets, and for constructing networks underlying complex biological processes. As cellular homeostasis and organismal survival are dependent on gene-gene and gene-environment interactions, we believe that network-based biology will provide the foundation for deciphering disease mechanisms and discovering novel drug targets for retinal neurodegenerative diseases. Copyright © 2015. Published by Elsevier Ltd.
    Progress in Retinal and Eye Research 02/2015; DOI:10.1016/j.preteyeres.2015.01.005 · 9.90 Impact Factor

Full-text (2 Sources)

Download
89 Downloads
Available from
Jun 4, 2014