Improving PacBio Long Read Accuracy by Short Read
Kin Fai Au1, Jason G. Underwood2, Lawrence Lee2, Wing Hung Wong1*
1Department of Statistics, Stanford University, Stanford, California, United States of America, 2Pacific Biosciences of California, Menlo Park, California, United States of
The recent development of third generation sequencing (TGS) generates much longer reads than second generation
sequencing (SGS) and thus provides a chance to solve problems that are difficult to study through SGS alone. However,
higher raw read error rates are an intrinsic drawback in most TGS technologies. Here we present a computational method,
LSC, to perform error correction of TGS long reads (LR) by SGS short reads (SR). Aiming to reduce the error rate in
homopolymer runs in the main TGS platform, the PacBioH RS, LSC applies a homopolymer compression (HC) transformation
strategy to increase the sensitivity of SR-LR alignment without scarifying alignment accuracy. We applied LSC to 100,000
PacBio long reads from human brain cerebellum RNA-seq data and 64 million single-end 75 bp reads from human brain
RNA-seq data. The results show LSC can correct PacBio long reads to reduce the error rate by more than 3 folds. The
improved accuracy greatly benefits many downstream analyses, such as directional gene isoform detection in RNA-seq
study. Compared with another hybrid correction tool, LSC can achieve over double the sensitivity and similar specificity.
Citation: Au KF, Underwood JG, Lee L, Wong WH (2012) Improving PacBio Long Read Accuracy by Short Read Alignment. PLoS ONE 7(10): e46679. doi:10.1371/
Editor: Yi Xing, University of Iowa, United States of America
Received June 8, 2012; Accepted September 2, 2012; Published October 4, 2012
Copyright: ? 2012 Au et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted
use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was supported by the National Institutes of Health [R01HG005717 to K.F.A. and R01HD057970 to W.H.W.]. The funders had no role in study
design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: J.G.U. and L.L. are full-time employees and stock holders of Pacific Biosciences, a company commercializing single-molecule, real-time
nucleic acid sequencing technologies.
* E-mail: W.H.W. (firstname.lastname@example.org)
The advent of second generation sequencing (SGS) has opened
up a new era of genome-wide and transcriptome-wide research.
Currently a single lane of a SGS instrument such as the HiSeqH
instrument from Illumina can generate 108short reads (SR) of
length up to 200 bp with a low error rate (2%) . While the high
read count of SGS allows for accurate quantitative analysis, the
relatively short length of the reads greatly reduces the utility of
SGS in tasks such as de novo genome assembly and full length
mRNA isoform reconstruction. Given RNA-seq with SRs only, the
reconstruction of gene isoforms must rely on assumptions. For
example, SLIDE  uses statistical modeling under sparsity
assumption, and Cufflinks  imposes solution constraints. These
assumptions are not supported by direct experimental evidence
and may induce many false positives.
Recently, a third generation sequencing (TGS) technology
capable of much longer reads has become available. The PacBio
RS can yield reads of average length over 2,500 bp and some
longer reads can reach 10,000 bp . These continuous long
reads (CLR) can capture large isoform fragments or even full
length isoform transcripts. These CLRs tend to have high error
rate (up to 15%). The sequencing accuracy can be greatly
improved by approach called ‘‘circular consensus sequencing’’
(CCS) which uses the additional information from multiple passes
across to insert to build a higher intramolecular accuracy.
However, the requirement that 2 or more full passes across the
insert for CCS read generation limits the insert size to ,1.5 Kb
for Pacific Biosciences’ C2 chemistry and sequencing mode, not
allowing the interrogation of extremely long transcripts by CCS
reads. Furthermore, the number of reads per run from the PacBio
RS is in only the range of 50,000 per run. The relatively modest
throughput makes it difficult to obtain full sampling of the
Some researchers have been attempting to combine PacBio long
reads and SGS short reads, for example the genome assembler
Allpaths-LG . In this paper, we introduce a error correction
approach that combines the strengths of SGS and TGS for in the
task of isoform assembly from RNA-seq data. In particular, we use
homopolymer compression (HC) transformation as a means to
allow accurate alignment of SR to LR. HC transformation has
been previously been proven to be useful in seeking possible
alignment matches  of pyrosequencing reads (454 platform).
Since the SRs have lower sequencing error, the LR can be
modified based on information from the aligned SRs to form a
‘‘corrected’’ LR with a much lower error rate than that of the
This method was implemented in the Python program LSC,
which is freely available for the research community and can be
downloaded at http://www.stanford.edu/,kinfai/LSC/LSC.html.
There are five main steps in LSC: HC transformation, SR
quality control, SR-LR alignment, error correction and decom-
pression transformation (Figure 1A).
PLOS ONE | www.plosone.org1October 2012 | Volume 7 | Issue 10 | e46679
Figure 1. The workflow of standard LSC and the outline of error correction based on HC transformation. (a) LSC consists of five steps:
HC transformation of SRs and LRs, SR quality control, SR-LR alignment, error correction and decompression transformation. LSC outputs the sequence
from the left-most SR-covered point to the right-most SR-covered one. (b) In the SR-LR layout, correction points consist of four types: HC points, point
mismatches, deletions and insertions. Each correction point is treated independently and replaced by the consensus sequence from SRs for the first
three types. Insertion sequence at position i is treated as a whole at the gap between two positions i and i+1 of the compressed LRs. A consensus
sequence of this gap is inserted to the final output at the corresponding position.
Error Correction of PacBio Long Read
PLOS ONE | www.plosone.org2October 2012 | Volume 7 | Issue 10 | e46679
The sequences in LRs and SRs are transformed by homopol-
ymer compression so that each homopolymer run is replaced by a
single nucleotide of the same type, for example, a run of 5
consecutive adenosines, ‘‘AAAAA’’ is replaced by a single ‘‘A’’. If
the original SR is of reasonable length (.=75 bp), the
compressed SR is likely to be still of sufficient length to allow
reliable alignment to the compressed LRs. For a human
chromosome, the ratio of its HC compressed length to its original
length is around 0.65 (Table S1). For example, 75 bp SRs are
compressed to ,48 bp at average. However, in a compressed
read, any two consecutive nucleotides must be different, which
further reduces the information content useful for alignment.
Instead of having 4 degrees of freedom (A,C,G,T) in each new
position of the raw reads, compressed reads only possess 3 degrees
of freedom. Therefore, in terms of finding an repeat alignment hits
by chance, compressed reads have equivalent length of regular
reads by a factor log4(3). Thus a HC transformed 75 bp SR is
roughly equivalent to a regular 40 bp SR in terms of its ability to
identify a genomic location. This is generally sufficient unless the
SR is from a repetitive region.
To illustrate this with read data, we use 64,313,204 X 75 bp
SGS (Illumina) reads as an example, and mapped them (before or
after compression) to human RefSeq annotation by Novoalign
(V2.07.10)  and allowed 1 bp mismatch. 37,564,778 reads are
mapped to 70,247,943 hits before compression. 37,556,821 reads
are mappable after compression and mapped to 70,468,788 hits.
Therefore, with compression, very little sensitivity is sacrificed in
mapping (0.02%) and also very little in specificity (0.31%).
SR Quality Control
Some of the compressed SRs are of poor quality either because
they are very short or contain uncertain bases (indicated by ‘N’).
Such reads are likely to cause alignment errors. LSC filters out
compressed SRs that have less than 40 ‘non-N’ nucleotides or have
more than one ‘N’ by default. These parameters can be changed in
the LSC configuration file according to data quality.
The basic assumption of LR error correction is that the aligned
SRs and LRs are derived from the same source in the source
sequence. In genome sequencing, the source sequence is a segment
of genomic sequence, whereas in RNA-seq, it is a segment of a
cDNA. In either case, SRs can be mapped to LRs with only
allowance for substitutions or small indels. This is in contrast to
standard RNA-seq alignment to the reference genome which must
allow for large introns. In this work, we use Novoalign (V2.07.10)
as the default aligner. Novoalign is known to have high sensitivity
and specificity, but is computationally demanding. If computa-
tional efficiency is a concern, Novoalign can be replaced by a
faster alternative such as BWA  or Seqalto .
As input to Novoalign, LSC concatenates the HC transformed
LRs together into human chromosome-sized reference sequences
with n bp poly-N inserts between successive LRs, where n is the
Figure 2. The histogram of sequence identities of ecLRs by LSC
and rLRs. Red bars are the LSC ecLRs and purple ones are rLRs. After
corrections, much more ecLRs have accuracy higher than 0.9.
Table 1. The comparison of the identities between LSC ecLRs
Sequence identity (I) LSC ecLRrLR
I ,=0.8 8,485 (9.42%) 51,145 (56.81%)
0.8,I,=0.9 17,328 (19.25%) 27,431 (30.47%)
0.9,I,=0.95 26,043 (28.93%)10,543 (11.71%)
0.95,I,=1.00 36,355 (40.38%)875 (0.97%)
Figure 3. The comparison of sequence identities between SR-
covered/SR-uncovered regions in each ecLRs. The sequence
identity distribution of SR-covered regions is in red and SR-uncovered
one in green.
Error Correction of PacBio Long Read
PLOS ONE | www.plosone.org3 October 2012 | Volume 7 | Issue 10 | e46679
length of original SRs. This is regarded as the reference genome
against which the HC transformed SRs are aligned.
After SR alignment, the alignment of each HC transformed LR
and the corresponding SRs are laid out as in Figure 1B. The LR is
then modified according to consensus information from the
aligned SRs in order to correct LR errors. Error correction is
performed at four types of correction points: HC points, point
mismatches, deletions and insertions. Both HC points in LR and
SRs are considered to be potential correction points. At each
correction point of the first three types, the sequences of all SRs
that cover it are decompressed temporarily and the consensus
sequence of them replaces this correction point. For insertion
points, at first LSC checks whether the majority of aligned SRs
have insertions at this point or not. If yes, the consensus
decompressed sequence inserts at this point.
After all correction points are replaced with their SR-consensus
sequences, all remaining (i.e. uncorrected by SRs) HC points are
decompressed. Lastly, LSC outputs the decompressed sequence
from the left-most SR-covered point to the right-most SR-covered
point, while the SR-uncovered regions at 39/59 ends are not
We tested LSC on two RNA-seq data sets: (1) human brain
cerebellum polyA RNA processed to enrich for full-length cDNA for
the PacBio RS platform under C2 chemistry conditions as CLR data
at 8,259 bp),and (2) human brain data from Illumina’s Human Body
Map 2.0 project (GSE30611, 64,313,204 single end reads, 75 bp) as
Figure 4. The scatter plots of SR-covered sequence percentage (SP) and sequence identity of ecLRs. (a) overview (b) zoom-in view from
SP of 0.2 to 1.0 and sequence identity from 0.8 to 1.0. Sequence identity is positively related with SP.
Table 2. The percentage of matches with the reference genome (hg19) in different SCD bins.
SCD The percentage of matchesNumber of matchesNumber of mismatches
0 0.7250 13,422,373 5,090,948
1 0.9073 5,105,739521,544
2 0.9413 2,365,912147,471
3 0.9547 1,512,07371,607
50.9659 888,164 31,314
7 0.9723 633,97718,032
.7 0.9820 20,980,352385,334
Error Correction of PacBio Long Read
PLOS ONE | www.plosone.org4 October 2012 | Volume 7 | Issue 10 | e46679
SR data. After applying the default filtering described in the
‘‘Methods’’ section, 63,519,800 SRs were retained. LSC output
90,036 error-corrected LRs (ecLR).
Comparison between LRs w/o LSC Correction
The segment between the left-most SR-covered point to the
right-most SR-covered point in the raw LR is denote as rLR. rLRs
were mapped to the human genome (hg19) by BLASR , which
was developed specifially for PacBio data alignment. ecLRs were
mapped to the human genome by BLAT. As a measure of LR
accuracy, the sequence identity (I) is defined as:
I~the number of matches
The majority of ecLRs have sequence identity (I) higher than
0.9. In contrast, the sequence identity of rLRs spread from 0.2 to
0.9 (Figure 2). Out of all 90,036 ecLRs, 62,465 (69.31%) have
identity higher than or equal to 0.9, while only 11,418 rLRs pass
this cutoff (Table 1). This suggests that ecLRs will be much more
informative than rLRs in downstream analyses such as assembly of
Figure 5. The histogram of lengths of LSC ecLRs (red bars) and
PacBioToCA ecLRs (purple bars). There are much more ecLRs from
LSC than PacBioToCA in every bin.
Figure 6. The pie chart of the LSC ecLRs (I. .=0.9). The LSC ecLRs are categorized by their identities and lengths. 22.40% of these outputs are
the comparable result with PacBioToCA, while LSC also output many other ecLRs with good accuracy with various lengths.
Table 3. The comparison of the overall performance of LSC
Running time 10 hours (8 hours Novoalign
+2 hours LSC)
Hard disk usage 20 G 800 , 1,000 G
Output Top 13,995 ecLRs (.=460 bp)* 13,995 ecLRs
Averaged length917 bp880 bp
Averaged I 0.96640.9603
*Among all LSC ecLRs longer than or equal to 460 bp, the best 13,995 outputs
have sequence identities I.=0.94586.
Error Correction of PacBio Long Read
PLOS ONE | www.plosone.org5 October 2012 | Volume 7 | Issue 10 | e46679