Improving PacBio Long Read Accuracy by Short Read Alignment

Department of Statistics, Stanford University, Stanford, California, United States of America.
PLoS ONE (Impact Factor: 3.23). 10/2012; 7(10):e46679. DOI: 10.1371/journal.pone.0046679
Source: PubMed


The recent development of third generation sequencing (TGS) generates much longer reads than second generation sequencing (SGS) and thus provides a chance to solve problems that are difficult to study through SGS alone. However, higher raw read error rates are an intrinsic drawback in most TGS technologies. Here we present a computational method, LSC, to perform error correction of TGS long reads (LR) by SGS short reads (SR). Aiming to reduce the error rate in homopolymer runs in the main TGS platform, the PacBio® RS, LSC applies a homopolymer compression (HC) transformation strategy to increase the sensitivity of SR-LR alignment without scarifying alignment accuracy. We applied LSC to 100,000 PacBio long reads from human brain cerebellum RNA-seq data and 64 million single-end 75 bp reads from human brain RNA-seq data. The results show LSC can correct PacBio long reads to reduce the error rate by more than 3 folds. The improved accuracy greatly benefits many downstream analyses, such as directional gene isoform detection in RNA-seq study. Compared with another hybrid correction tool, LSC can achieve over double the sensitivity and similar specificity.

Full-text preview

Available from:
  • Source
    • "To further extend the R. microplus mitochondrial genome coverage, we used the Illumina assembled contigs to identify raw Pacbio reads with sequence similarity to validated mitochondrial contigs. These raw Pacbio reads were errorcorrected using LSC 1.alpha (Au et al., 2012) and then utilized to resolve repeats within the R. microplus Texas mitochondrial genome. ClustalW from the MacVector 12.7.5 software suite (MacVector, Inc., Cary, NC, USA) was used to align the mitochondrial genomes of "
    [Show abstract] [Hide abstract]
    ABSTRACT: The cattle fever tick, Rhipicephalus (Boophilus) microplus is one of the most significant medical veterinary pests in the world, vectoring several serious livestock diseases negatively impacting agricultural economies of tropical and subtropical countries around the world. In our study, we assembled the complete R. microplus mitochondrial genome from Illumina and Pac Bio sequencing reads obtained from the ongoing R. microplus (Deutsch strain from Texas, USA) genome sequencing project. We compared the Deutsch strain mitogenome to the mitogenome from a Brazilian R. microplus and from an Australian cattle tick that has recently been taxonomically designated as Rhipicephalus australis after previously being considered R. microplus. The sequence divergence of the Texas and Australia ticks is much higher than the divergence between the Texas and Brazil ticks. This is consistent with the idea that the Australian ticks are distinct from the R. microplus of the Americas. Copyright © 2015. Published by Elsevier B.V.
    Gene 06/2015; 571(1). DOI:10.1016/j.gene.2015.06.060 · 2.14 Impact Factor
  • Source
    • "All of the SMRT subreads were mapped against the S. miltiorrhiza genome, with 96% of the reads successfully mapped using BLAT (Kent, 2002; Figure S1). To resolve the high error rates of the subreads, all 796 011 SMRT subreads were corrected using the approximately 500 million NGS reads as input data (Figure 2; Au et al., 2012). After removing the redundant sequences for all SMRT subreads using CD-HIT-EST (c = 0.90), 160 468 non-redundant reads were produced, with a mean read length of 2059 bases. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Danshen, Salvia miltiorrhiza Bunge, is one of the most widely utilized herbs in traditional Chinese medicine, wherein its rhizome/roots are particularly valued. The corresponding bioactive components include the tanshinone diterpenoids, whose biosynthesis is a subject of considerable interest. Previous investigations of the S. miltiorrhiza transcriptome have relied on short-read next-generation sequencing (NGS) technology, and the vast majority of the resulting isotigs do not represent full-length cDNA sequences. Moreover, these efforts have been targeted at either whole plants or hairy root cultures. Here we demonstrate that the tanshinone pigments are produced and accumulate in the root periderm, and apply a combination of NGS and single molecule real-time (SMRT) sequencing to various root tissues, particularly including the periderm, to provide a more complete view of the S. miltiorrhiza transcriptome, with further insight into tanshinone biosynthesis as well. In addition, use of SMRT long-read sequencing offered the ability to examine alternative splicing, which was found to occur in approximately 40% of the detected gene loci, including several involved in isoprenoid/terpenoid metabolism. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
    The Plant Journal 04/2015; 82(6). DOI:10.1111/tpj.12865 · 5.97 Impact Factor
  • Source
    • "(i) The hierarchical genomeassembly process (HGAP) uses shorter SMRT reads contained within longer reads to generate pre-assemblies and to calculate consensus sequences (Chin et al., 2013). (ii) PacBioToCA (Koren et al., 2012) and LSC (Au et al., 2012) utilize Illumina short reads in a hybrid approach to correct SMRT reads. Indeed, these approaches result in higher quality LRs. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: Today, the base code of DNA is mostly determined through sequencing by synthesis as provided by the Illumina sequencers. Although highly accurate, resulting reads are short, making their analyses challenging. Recently, a new technology, single molecule real-time (SMRT) sequencing, was developed that could address these challenges, as it generates reads of several thousand bases. But, their broad application has been hampered by a high error rate. Therefore, hybrid approaches that use high-quality short reads to correct erroneous SMRT long reads have been developed. Still, current implementations have great demands on hardware, work only in well-defined computing infrastructures and reject a substantial amount of reads. This limits their usability considerably, especially in the case of large sequencing projects. Results: Here we present proovread, a hybrid correction pipeline for SMRT reads, which can be flexibly adapted on existing hardware and infrastructure from a laptop to a high-performance computing cluster. On genomic and transcriptomic test cases covering Escherichia coli, Arabidopsis thaliana and human, proovread achieved accuracies up to 99.9% and outperformed the existing hybrid correction programs. Furthermore, proovread-corrected sequences were longer and the throughput was higher. Thus, proovread combines the most accurate correction results with an excellent adaptability to the available hardware. It will therefore increase the applicability and value of SMRT sequencing. Availability and implementation: proovread is available at the following URL:
    Bioinformatics 07/2014; 30(21). DOI:10.1093/bioinformatics/btu392 · 4.98 Impact Factor
Show more