Improving PacBio Long Read Accuracy by Short Read Alignment

Department of Statistics, Stanford University, Stanford, California, United States of America.
PLoS ONE (Impact Factor: 3.23). 10/2012; 7(10):e46679. DOI: 10.1371/journal.pone.0046679
Source: PubMed


The recent development of third generation sequencing (TGS) generates much longer reads than second generation sequencing (SGS) and thus provides a chance to solve problems that are difficult to study through SGS alone. However, higher raw read error rates are an intrinsic drawback in most TGS technologies. Here we present a computational method, LSC, to perform error correction of TGS long reads (LR) by SGS short reads (SR). Aiming to reduce the error rate in homopolymer runs in the main TGS platform, the PacBio® RS, LSC applies a homopolymer compression (HC) transformation strategy to increase the sensitivity of SR-LR alignment without scarifying alignment accuracy. We applied LSC to 100,000 PacBio long reads from human brain cerebellum RNA-seq data and 64 million single-end 75 bp reads from human brain RNA-seq data. The results show LSC can correct PacBio long reads to reduce the error rate by more than 3 folds. The improved accuracy greatly benefits many downstream analyses, such as directional gene isoform detection in RNA-seq study. Compared with another hybrid correction tool, LSC can achieve over double the sensitivity and similar specificity.

  • Source
    • "Despite recent advances, the production of reference genomes remains hampered by factors such as a high repeat content, gene and genome duplication[4,5]. Incorporating repeat spanning mate pair (MP) data and newer long read third generation sequencing platforms such as Single Molecule Real-Time (SMRT) DNA sequencing have partially resolved this, though the high error rates of long read data can also confound accurate assembly[6,7]. The accuracy of a final assembly is "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background: There has been an exponential growth in the number of genome sequencing projects since the introduction of next generation DNA sequencing technologies. Genome projects have increasingly involved assembly of whole genome data which produces inferior assemblies compared to traditional Sanger sequencing of genomic fragments cloned into bacterial artificial chromosomes (BACs). While whole genome shotgun sequencing using next generation sequencing (NGS) is relatively fast and inexpensive, this method is extremely challenging for highly complex genomes, where polyploidy or high repeat content confounds accurate assembly, or where a highly accurate 'gold' reference is required. Several attempts have been made to improve genome sequencing approaches by incorporating NGS methods, to variable success. Results: We present the application of a novel BAC sequencing approach which combines indexed pools of BACs, Illumina paired read sequencing, a sequence assembler specifically designed for complex BAC assembly, and a custom bioinformatics pipeline. We demonstrate this method by sequencing and assembling BAC cloned fragments from bread wheat and sugarcane genomes. Conclusions: We demonstrate that our assembly approach is accurate, robust, cost effective and scalable, with applications for complete genome sequencing in large and complex genomes.
    Preview · Article · Dec 2016 · Plant Methods
  • Source
    • "The technology has proven useful for unraveling isoform diversity at complex loci[186], and for determining allele-specific expression from single reads[187]. Nevertheless, long-read sequencing has its own set of limitations, such as a still high error rate that limits de novo transcript identifications and forces the technology to leverage the reference gen- ome[188]. Moreover, the relatively low throughput of SMRT cells hampers the quantification of transcript expression . "

    Full-text · Article · Jan 2016 · Genome Biology
  • Source
    • "To further extend the R. microplus mitochondrial genome coverage, we used the Illumina assembled contigs to identify raw Pacbio reads with sequence similarity to validated mitochondrial contigs. These raw Pacbio reads were errorcorrected using LSC 1.alpha (Au et al., 2012) and then utilized to resolve repeats within the R. microplus Texas mitochondrial genome. ClustalW from the MacVector 12.7.5 software suite (MacVector, Inc., Cary, NC, USA) was used to align the mitochondrial genomes of "
    [Show abstract] [Hide abstract]
    ABSTRACT: The cattle fever tick, Rhipicephalus (Boophilus) microplus is one of the most significant medical veterinary pests in the world, vectoring several serious livestock diseases negatively impacting agricultural economies of tropical and subtropical countries around the world. In our study, we assembled the complete R. microplus mitochondrial genome from Illumina and Pac Bio sequencing reads obtained from the ongoing R. microplus (Deutsch strain from Texas, USA) genome sequencing project. We compared the Deutsch strain mitogenome to the mitogenome from a Brazilian R. microplus and from an Australian cattle tick that has recently been taxonomically designated as Rhipicephalus australis after previously being considered R. microplus. The sequence divergence of the Texas and Australia ticks is much higher than the divergence between the Texas and Brazil ticks. This is consistent with the idea that the Australian ticks are distinct from the R. microplus of the Americas. Copyright © 2015. Published by Elsevier B.V.
    Full-text · Article · Jun 2015 · Gene
Show more