Improving PacBio Long Read Accuracy by Short Read
Kin Fai Au
, Jason G. Underwood
, Lawrence Lee
, Wing Hung Wong
1 Department of Statistics, Stanford University, Stanford, California, United States of America, 2 Pacific Biosciences of California, Menlo Park, California, United States of
The recent development of third generation sequencing (TGS) generates much longer reads than second generation
sequencing (SGS) and thus provides a chance to solve problems that are difficult to study through SGS alone. However,
higher raw read error rates are an intrinsic drawback in most TGS technologies. Here we present a computational method,
LSC, to perform error correction of TGS long reads (LR) by SGS short reads (SR). Aiming to reduce the error rate in
homopolymer runs in the main TGS platform, the PacBioH RS, LSC applies a homopolymer compression (HC) transformation
strategy to increase the sensitivity of SR-LR alignment without scarifying alignment accuracy. We applied LSC to 100,000
PacBio long reads from human brain cerebellum RNA-seq data and 64 million single-end 75 bp reads from human brain
RNA-seq data. The results show LSC can correct PacBio long reads to reduce the error rate by more than 3 folds. The
improved accuracy greatly benefits many downstream analyses, such as directional gene isoform detection in RNA-seq
study. Compared with another hybrid correction tool, LSC can achieve over double the sensitivity and similar specificity.
Citation: Au KF, Underwood JG, Lee L, Wong WH (2012) Improving PacBio Long Read Accuracy by Short Read Alignment. PLoS ONE 7(10): e46679. doi:10.1371/
Editor: Yi Xing, University of Iowa, United States of America
Received June 8, 2012; Accepted September 2, 2012; Published October 4, 2012
Copyright: ß 2012 Au et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted
use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was supported by the National Institutes of Health [R01HG005717 to K.F.A. and R01HD057970 to W.H.W.]. The funders had no role in study
design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: J.G.U. and L.L. are full-time employees and stock holders of Pacific Biosciences, a company commercializing single-molecule, real-time
nucleic acid sequencing technologies.
* E-mail: W.H.W. (firstname.lastname@example.org)
The advent of second generation sequencing (SGS) has opened
up a new era of genome-wide and transcriptome-wide research.
Currently a single lane of a SGS instrument such as the HiSeqH
instrument from Illumina can generate 10
short reads (SR) of
length up to 200 bp with a low error rate (2%) . While the high
read count of SGS allows for accurate quantitative analysis, the
relatively short length of the reads greatly reduces the utility of
SGS in tasks such as de novo genome assembly and full length
mRNA isoform reconstruction. Given RNA-seq with SRs only, the
reconstruction of gene isoforms must rely on assumptions. For
example, SLIDE  uses statistical modeling under sparsity
assumption, and Cufflinks  imposes solution constraints. These
assumptions are not supported by direct experimental evidence
and may induce many false positives.
Recently, a third generation sequencing (TGS) technology
capable of much longer reads has become available. The PacBio
RS can yield reads of average length over 2,500 bp and some
longer reads can reach 10,000 bp . These continuous long
reads (CLR) can capture large isoform fragments or even full
length isoform transcripts. These CLRs tend to have high error
rate (up to 15%). The sequencing accuracy can be greatly
improved by approach called ‘‘circular consensus sequencing’’
(CCS) which uses the additional information from multiple passes
across to insert to build a higher intramolecular accuracy.
However, the requirement that 2 or more full passes across the
insert for CCS read generation limits the insert size to ,1.5 Kb
for Pacific Biosciences’ C2 chemistry and sequencing mode, not
allowing the interrogation of extremely long transcripts by CCS
reads. Furthermore, the number of reads per run from the PacBio
RS is in only the range of 50,000 per run. The relatively modest
throughput makes it difficult to obtain full sampling of the
Some researchers have been attempting to combine PacBio long
reads and SGS short reads, for example the genome assembler
Allpaths-LG . In this paper, we introduce a error correction
approach that combines the strengths of SGS and TGS for in the
task of isoform assembly from RNA-seq data. In particular, we use
homopolymer compression (HC) transformation as a means to
allow accurate alignment of SR to LR. HC transformation has
been previously been proven to be useful in seeking possible
alignment matches  of pyrosequencing reads (454 platform).
Since the SRs have lower sequencing error, the LR can be
modified based on information from the aligned SRs to form a
‘‘corrected’’ LR with a much lower error rate than that of the
This method was implemented in the Python program LSC,
which is freely available for the research community and can be
downloaded at http://www.stanford.edu/,kinfai/LSC/LSC.html.
There are five main steps in LSC: HC transformation, SR
quality control, SR-LR alignment, error correction and decom-
pression transformation (Figure 1A).
PLOS ONE | www.plosone.org 1 October 2012 | Volume 7 | Issue 10 | e46679