Article

Improving PacBio Long Read Accuracy by Short Read Alignment

Department of Statistics, Stanford University, Stanford, California, United States of America.
PLoS ONE (Impact Factor: 3.23). 10/2012; 7(10):e46679. DOI: 10.1371/journal.pone.0046679
Source: PubMed
ABSTRACT
The recent development of third generation sequencing (TGS) generates much longer reads than second generation sequencing (SGS) and thus provides a chance to solve problems that are difficult to study through SGS alone. However, higher raw read error rates are an intrinsic drawback in most TGS technologies. Here we present a computational method, LSC, to perform error correction of TGS long reads (LR) by SGS short reads (SR). Aiming to reduce the error rate in homopolymer runs in the main TGS platform, the PacBio® RS, LSC applies a homopolymer compression (HC) transformation strategy to increase the sensitivity of SR-LR alignment without scarifying alignment accuracy. We applied LSC to 100,000 PacBio long reads from human brain cerebellum RNA-seq data and 64 million single-end 75 bp reads from human brain RNA-seq data. The results show LSC can correct PacBio long reads to reduce the error rate by more than 3 folds. The improved accuracy greatly benefits many downstream analyses, such as directional gene isoform detection in RNA-seq study. Compared with another hybrid correction tool, LSC can achieve over double the sensitivity and similar specificity.

Full-text preview

Available from: dx.plos.org
Improving PacBio Long Read Accuracy by Short Read
Alignment
Kin Fai Au
1
, Jason G. Underwood
2
, Lawrence Lee
2
, Wing Hung Wong
1
*
1 Department of Statistics, Stanford University, Stanford, California, United States of America, 2 Pacific Biosciences of California, Menlo Park, California, United States of
America
Abstract
The recent development of third generation sequencing (TGS) generates much longer reads than second generation
sequencing (SGS) and thus provides a chance to solve problems that are difficult to study through SGS alone. However,
higher raw read error rates are an intrinsic drawback in most TGS technologies. Here we present a computational method,
LSC, to perform error correction of TGS long reads (LR) by SGS short reads (SR). Aiming to reduce the error rate in
homopolymer runs in the main TGS platform, the PacBioH RS, LSC applies a homopolymer compression (HC) transformation
strategy to increase the sensitivity of SR-LR alignment without scarifying alignment accuracy. We applied LSC to 100,000
PacBio long reads from human brain cerebellum RNA-seq data and 64 million single-end 75 bp reads from human brain
RNA-seq data. The results show LSC can correct PacBio long reads to reduce the error rate by more than 3 folds. The
improved accuracy greatly benefits many downstream analyses, such as directional gene isoform detection in RNA-seq
study. Compared with another hybrid correction tool, LSC can achieve over double the sensitivity and similar specificity.
Citation: Au KF, Underwood JG, Lee L, Wong WH (2012) Improving PacBio Long Read Accuracy by Short Read Alignment. PLoS ONE 7(10): e46679. doi:10.1371/
journal.pone.0046679
Editor: Yi Xing, University of Iowa, United States of America
Received June 8, 2012; Accepted September 2, 2012; Published October 4, 2012
Copyright: ß 2012 Au et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted
use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was supported by the National Institutes of Health [R01HG005717 to K.F.A. and R01HD057970 to W.H.W.]. The funders had no role in study
design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: J.G.U. and L.L. are full-time employees and stock holders of Pacific Biosciences, a company commercializing single-molecule, real-time
nucleic acid sequencing technologies.
* E-mail: W.H.W. (whwong@stanford.edu)
Introduction
The advent of second generation sequencing (SGS) has opened
up a new era of genome-wide and transcriptome-wide research.
Currently a single lane of a SGS instrument such as the HiSeqH
instrument from Illumina can generate 10
8
short reads (SR) of
length up to 200 bp with a low error rate (2%) [1]. While the high
read count of SGS allows for accurate quantitative analysis, the
relatively short length of the reads greatly reduces the utility of
SGS in tasks such as de novo genome assembly and full length
mRNA isoform reconstruction. Given RNA-seq with SRs only, the
reconstruction of gene isoforms must rely on assumptions. For
example, SLIDE [2] uses statistical modeling under sparsity
assumption, and Cufflinks [3] imposes solution constraints. These
assumptions are not supported by direct experimental evidence
and may induce many false positives.
Recently, a third generation sequencing (TGS) technology
capable of much longer reads has become available. The PacBio
RS can yield reads of average length over 2,500 bp and some
longer reads can reach 10,000 bp [4]. These continuous long
reads (CLR) can capture large isoform fragments or even full
length isoform transcripts. These CLRs tend to have high error
rate (up to 15%). The sequencing accuracy can be greatly
improved by approach called ‘‘circular consensus sequencing’’
(CCS) which uses the additional information from multiple passes
across to insert to build a higher intramolecular accuracy.
However, the requirement that 2 or more full passes across the
insert for CCS read generation limits the insert size to ,1.5 Kb
for Pacific Biosciences’ C2 chemistry and sequencing mode, not
allowing the interrogation of extremely long transcripts by CCS
reads. Furthermore, the number of reads per run from the PacBio
RS is in only the range of 50,000 per run. The relatively modest
throughput makes it difficult to obtain full sampling of the
transcriptome.
Some researchers have been attempting to combine PacBio long
reads and SGS short reads, for example the genome assembler
Allpaths-LG [5]. In this paper, we introduce a error correction
approach that combines the strengths of SGS and TGS for in the
task of isoform assembly from RNA-seq data. In particular, we use
homopolymer compression (HC) transformation as a means to
allow accurate alignment of SR to LR. HC transformation has
been previously been proven to be useful in seeking possible
alignment matches [6] of pyrosequencing reads (454 platform).
Since the SRs have lower sequencing error, the LR can be
modified based on information from the aligned SRs to form a
‘‘corrected’’ LR with a much lower error rate than that of the
original LR.
This method was implemented in the Python program LSC,
which is freely available for the research community and can be
downloaded at http://www.stanford.edu/,kinfai/LSC/LSC.html.
Methods
There are five main steps in LSC: HC transformation, SR
quality control, SR-LR alignment, error correction and decom-
pression transformation (Figure 1A).
PLOS ONE | www.plosone.org 1 October 2012 | Volume 7 | Issue 10 | e46679
Page 1

You are reading a preview. Would you like to access the full-text?

Page 8
  • Source
    • "Moreover, PacBio sequencing suffers from a majority of insertion/deletion errors. It is thus necessary to correct these long reads before analysis, or at least during assembly, and different solutions have been proposed [19][20][21][22] , but these approaches " require high computational resources and long running times on a supercomputer even for bacterial genome datasets " . [22]. "
    [Show abstract] [Hide abstract] ABSTRACT: With next-generation sequencing (NGS) technologies, the life sciences face a deluge of raw data.Classical analysis processes for such data often begin with an assembly step, needing large amounts of computingresources, and potentially removing or modifying parts of the biological information contained in the data. Ourapproach proposes to focus directly on biological questions, by considering raw unassembled NGS data, through asuite of six command-line tools.Findings: Dedicated to ‘whole-genome assembly-free’ treatments, the Colib’read tools suite uses optimizedalgorithms for various analyses of NGS datasets, such as variant calling or read set comparisons. Based on the use of ade Bruijn graph and bloom filter, such analyses can be performed in a few hours, using small amounts of memory.Applications using real data demonstrate the good accuracy of these tools compared to classical approaches. Tofacilitate data analysis and tools dissemination, we developed Galaxy tools and tool shed repositories.Conclusions: With the Colib’read Galaxy tools suite, we enable a broad range of life scientists to analyze raw NGSdata. More importantly, our approach allows the maximum biological information to be retained in the data, and usesa very low memory footprint.
    Full-text · Article · Dec 2016
  • Source
    • "Longer PacBio reads were selected automatically as seeds; the rest of the reads were aligned against these seed sequences for correction. For hybrid correction, we used LSC[10](v1.0 alpha) with the parameter for bowtie2 set to very-fast; pacBioToCA[11](wgs v8.0) using the parameter length 500. "
    [Show abstract] [Hide abstract] ABSTRACT: Background Black shank is a severe plant disease caused by the soil-borne pathogen Phytophthora nicotianae. Two physiological races of P. nicotianae, races 0 and 1, are predominantly observed in cultivated tobacco fields around the world. Race 0 has been reported to be more aggressive, having a shorter incubation period, and causing worse root rot symptoms, while race 1 causes more severe necrosis. The molecular mechanisms underlying the difference in virulence between race 0 and 1 remain elusive. Findings We assembled and annotated the genomes of P. nicotianae races 0 and 1, which were obtained by a combination of PacBio single-molecular real-time sequencing and second-generation sequencing (both HiSeq and MiSeq platforms). Gene family analysis revealed a highly expanded ATP-binding cassette transporter gene family in P. nicotianae. Specifically, more RxLR effector genes were found in the genome of race 0 than in that of race 1. In addition, RxLR effector genes were found to be mainly distributed in gene-sparse, repeat-rich regions of the P. nicotianae genome. Conclusions These results provide not only high quality reference genomes of P. nicotianae, but also insights into the infection mechanisms of P. nicotianae and its co-evolution with the host plant. They also reveal insights into the difference in virulence between the two physiological races. Electronic supplementary material The online version of this article (doi:10.1186/s13742-016-0108-7) contains supplementary material, which is available to authorized users.
    Full-text · Article · Dec 2016
  • Source
    • "Despite recent advances, the production of reference genomes remains hampered by factors such as a high repeat content, gene and genome duplication[4,5]. Incorporating repeat spanning mate pair (MP) data and newer long read third generation sequencing platforms such as Single Molecule Real-Time (SMRT) DNA sequencing have partially resolved this, though the high error rates of long read data can also confound accurate assembly[6,7]. The accuracy of a final assembly is "
    [Show abstract] [Hide abstract] ABSTRACT: Background: There has been an exponential growth in the number of genome sequencing projects since the introduction of next generation DNA sequencing technologies. Genome projects have increasingly involved assembly of whole genome data which produces inferior assemblies compared to traditional Sanger sequencing of genomic fragments cloned into bacterial artificial chromosomes (BACs). While whole genome shotgun sequencing using next generation sequencing (NGS) is relatively fast and inexpensive, this method is extremely challenging for highly complex genomes, where polyploidy or high repeat content confounds accurate assembly, or where a highly accurate 'gold' reference is required. Several attempts have been made to improve genome sequencing approaches by incorporating NGS methods, to variable success. Results: We present the application of a novel BAC sequencing approach which combines indexed pools of BACs, Illumina paired read sequencing, a sequence assembler specifically designed for complex BAC assembly, and a custom bioinformatics pipeline. We demonstrate this method by sequencing and assembling BAC cloned fragments from bread wheat and sugarcane genomes. Conclusions: We demonstrate that our assembly approach is accurate, robust, cost effective and scalable, with applications for complete genome sequencing in large and complex genomes.
    Full-text · Article · Dec 2016 · Plant Methods
Show more