Incorporating sequence quality data into alignment improves DNA read mapping.
ABSTRACT New DNA sequencing technologies have achieved breakthroughs in throughput, at the expense of higher error rates. The primary way of interpreting biological sequences is via alignment, but standard alignment methods assume the sequences are accurate. Here, we describe how to incorporate the per-base error probabilities reported by sequencers into alignment. Unlike existing tools for DNA read mapping, our method models both sequencer errors and real sequence differences. This approach consistently improves mapping accuracy, even when the rate of real sequence difference is only 0.2%. Furthermore, when mapping Drosophila melanogaster reads to the Drosophila simulans genome, it increased the amount of correctly mapped reads from 49 to 66%. This approach enables more effective use of DNA reads from organisms that lack reference genomes, are extinct or are highly polymorphic.
Article: Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning.[show abstract] [hide abstract]
ABSTRACT: Cytosine DNA methylation is important in regulating gene expression and in silencing transposons and other repetitive sequences. Recent genomic studies in Arabidopsis thaliana have revealed that many endogenous genes are methylated either within their promoters or within their transcribed regions, and that gene methylation is highly correlated with transcription levels. However, plants have different types of methylation controlled by different genetic pathways, and detailed information on the methylation status of each cytosine in any given genome is lacking. To this end, we generated a map at single-base-pair resolution of methylated cytosines for Arabidopsis, by combining bisulphite treatment of genomic DNA with ultra-high-throughput sequencing using the Illumina 1G Genome Analyser and Solexa sequencing technology. This approach, termed BS-Seq, unlike previous microarray-based methods, allows one to sensitively measure cytosine methylation on a genome-wide scale within specific sequence contexts. Here we describe methylation on previously inaccessible components of the genome and analyse the DNA methylation sequence composition and distribution. We also describe the effect of various DNA methylation mutants on genome-wide methylation patterns, and demonstrate that our newly developed library construction and computational methods can be applied to large genomes such as that of mouse.Nature 04/2008; 452(7184):215-9. · 36.28 Impact Factor
[show abstract] [hide abstract]
ABSTRACT: New sequencing technologies promise a new era in the use of DNA sequence. However, some of these technologies produce very short reads, typically of a few tens of base pairs, and to use these reads effectively requires new algorithms and software. In particular, there is a major issue in efficiently aligning short reads to a reference genome and handling ambiguity or lack of accuracy in this alignment. Here we introduce the concept of mapping quality, a measure of the confidence that a read actually comes from the position it is aligned to by the mapping algorithm. We describe the software MAQ that can build assemblies by mapping shotgun short reads to a reference genome, using quality scores to derive genotype calls of the consensus sequence of a diploid genome, e.g., from a human sample. MAQ makes full use of mate-pair information and estimates the error probability of each read alignment. Error probabilities are also derived for the final genotype calls, using a Bayesian statistical model that incorporates the mapping qualities, error probabilities from the raw sequence quality scores, sampling of the two haplotypes, and an empirical model for correlated errors at a site. Both read mapping and genotype calling are evaluated on simulated data and real data. MAQ is accurate, efficient, versatile, and user-friendly. It is freely available at http://maq.sourceforge.net.Genome Research 09/2008; 18(11):1851-8. · 13.61 Impact Factor
[show abstract] [hide abstract]
ABSTRACT: The nucleotide sequencing process produces not only the sequence of nucleotides, but also associated quality values. Quality values provide valuable information, but are primarily used only for trimming sequences and generally ignored in subsequent analyses. This article describes how the scoring schemes of standard alignment algorithms can be modified to take into account quality values to produce improved alignments and statistically more accurate scores. A prototype implementation is also provided, and used to post-process a set of BLAST results. Quality-adjusted alignment is a natural extension of standard alignment methods, and can be implemented with only a small constant factor performance penalty. The method can also be applied to related methods including heuristic search algorithms like BLAST and FASTA. http://malde.org/~ketil/qaa.Bioinformatics 05/2008; 24(7):897-900. · 5.47 Impact Factor
Incorporating sequence quality data into alignment
improves DNA read mapping
Martin C. Frith*, Raymond Wan and Paul Horton
Computational Biology Research Center, Institute for Advanced Industrial Science and Technology, 2-4-7 Aomi,
Koto-ku, Tokyo 135-0064, Japan
Received November 13, 2009; Revised January 5, 2010; Accepted January 6, 2010
New DNA sequencing technologies have achieved
breakthroughs in throughput, at the expense of
higher error rates. The primary way of interpreting
biological sequences is via alignment, but standard
alignment methods assume the sequences are
accurate. Here, we describe how to incorporate
the per-base error probabilities reported by seq-
uencers into alignment. Unlike existing tools for
DNA read mapping, our method models both
sequencer errors and real sequence differences.
accuracy, even when the rate of real sequence dif-
ference is only 0.2%. Furthermore, when mapping
Drosophila melanogaster reads to the Drosophila
simulans genome, it increased the amount of
correctly mapped reads from 49 to 66%. This
approach enables more effective use of DNA reads
from organisms that lack reference genomes, are
extinct or are highly polymorphic.
The major approach to interpreting biological sequences is
to align them to other sequences. As a result, alignment
algorithms such as BLAST are important and ubiquitous.
Standard alignment algorithms assume that the sequences
are accurate, and ignore per-base quality data that is
typically available from DNA sequencing instruments.
Recent sequencing technologies, however, have achieved
breakthroughs in throughput, at the expense of higher
error rates. It has thus become more important to
consider the quality data during the initial analysis step,
which is nearly always some form of alignment.
Surprisingly, we can find no previous method that
systematically incorporates quality data into sequence
alignment. Several methods for mapping DNA reads to
genomes do use quality data, but they lack scoring
matrices that model differences other than sequencing
errors [e.g. (1,2)]. Instead, a limited form of alignment is
employed, which assumes that the sequences are (almost)
identical apart from sequencing errors. A publication by
Malde describes alignment using quality data, but this also
replaces the scoring matrix with quality-derived scores
instead of combining the two (3). The work of Na et al.
(4) does combine a standard score matrix with quality
scores, but has some serious drawbacks that we describe
later in this article.
In this article, we provide an effective solution for the
task of xeno-mapping, the mapping of reads onto a refer-
ence genome which may differ from the genomic source
of the reads. Xeno-mapping is important for several
reasons. First, the vast majority of species currently lack
reference genome sequences. If we obtained DNA reads
from (say) zebra, the best way to interpret them would
probably involve mapping them to the horse genome.
It might take a decade before all 5000 mammal species
are sequenced, and longer if ever for the millions of
insect species, only a fraction of which have even been
fascinating, but assembling genomes from their meager
DNA is at best hard and at worst impossible, and so
modern genomes are typically used as ‘scaffolds’, e.g.
mammoth reads versus the elephant genome (5). Finally,
many wild organisms are highly polymorphic–extreme
examples being Ciona intestinalis (1.2%) and Ciona
savignyi (4.6%) (6) – so that real sequence differences
are frequent even when aligning sequences from the
Traditional sequence alignment
Traditional sequence alignment methods (e.g. BLAST)
allow for sequence differences by using a scoring
scheme: matching bases in an alignment get positive
scores, and mismatches and gaps receive negative scores.
More generally, they use a scoring matrix S, where Sxy
(x,y2fa,c,g,t}) specifiesthe scoreforaligning the
*To whom correspondence should be addressed. Tel: +81 3 3599 8080; Fax: +81 3 3599 8081; Email: email@example.com
Published online 27 January 2010Nucleic Acids Research, 2010, Vol. 38, No. 7 e100
? The Author(s) 2010. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
nucleotides x and y. The scoring matrix can be interpreted
as a log likelihood ratio:
Sxy¼ T ? lnðRxyÞ,
where T is an arbitrary scaling factor, and Rxyis the like-
Here, P(xy|A) is the probability of observing x aligned to
y in a probabilistic model of aligned sequences, and P(x)
and P(y) are the probabilities of observing x and y
frequencies’ of the scoring matrix (7). A scoring matrix
is optimal for distinguishing related sequences from
chance similarities when its target frequencies equal
those in an accurate alignment of related sequences
(7,8). For example, to construct an optimal scoring
matrix for 90% identical sequences of uniform composi-
tion, we would set the match and mismatch frequencies to:
are called the‘target
for the four match frequencies ðx ¼ yÞ
for the twelve mismatch frequencies ðx 6¼ yÞ:
The 12 kinds of mutations are, however, not equally
frequent: usually, the four types of transitions (A , G
and T , C) occur more often than the eight types of
transversions. If 60% of substitutions are transitions, we
would use these mismatch frequencies:
0:1 ? 0:6=4
0:1 ? 0:4=8
In short, we know how to construct optimal scoring
matrices for sequences with a known degree of divergence.
Sequencer error probabilities
DNA sequencing instruments can report a probability
that each sequenced base is erroneous. These probabilities
are usually reported in a log-transformed form, called
‘phred score’ or ‘quality score’:
quality score ¼ ?10log10ð"Þ:
Here, e is the error probability. So, for example, an
error probability of 0.01 is reported as a quality score of
20. The following variant is sometimes used instead:
quality score ¼ ?10log10ð"=ð1 ? "ÞÞ:
Sometimes, four quality scores are reported per base,
which reflect the probability that the base is a, c, g or t.
There are a few common file formats for sequence
qualities, including FASTQ (one quality score per base)
and PRB (four quality scores per base).
Median and mean error probabilities for two sets of
Illumina reads are shown in Figure 1. They exhibit a
typical pattern where error rates increase toward the end
of the read. In the first data set, the median error proba-
bility is <1% for all but the last six bases. The mean error
rates tend to be much higher than the medians, due to the
skewed distributions (i.e. most of the error probabilities
are low, but a few are much higher). In the second data
set, there was clearly a problem with the 32nd base, since
the average reported error probability is exactly 1: this
phenomenon is not unusual (1). These two data sets are
not especially atypical (Supplementary Figure S1).
In this study, we describe a way to merge sequence
quality data into the traditional sequence alignment
framework. This means that we model both sequencer
errors and real sequence differences at the same time.
0510 152025 3035
Position in read
010 2030 40 50
Position in read
Figure 1. Estimated error rates for two DNA short-read data sets. (A) Error rates for a set of 36-bp reads from the Solexa 1G Genome Analyzer
(the first 100000 reads of SRR001981). (B) Error rates for a set of 51-bp reads from the Illumina Genome Analyzer II (the first 100000 reads of
SRR016157). For both panels, the error rates were obtained from FASTQ files in the NCBI Short Read Archive.
e100 Nucleic Acids Research, 2010,Vol.38, No. 7PAGE 2 OF 9
We apply the method to simulated DNA reads (where we
know the correct mapping locations), and show that it
improves mapping accuracy compared with modeling
either only sequencer errors or only real sequence differ-
ences. Finally, we show a dramatic improvement in
mapping of real Drosophila melanogaster reads to the
Drosophila simulans genome (which simulates mapping
mammoth reads to the elephant genome or the like).
MATERIALS AND METHODS
In this section, we show how quality scores can be used for
aligning sequences and mapping DNA reads to genomes.
While these techniques are described in sufficient detail to
be incorporated into any alignment software, we also give
a brief summary of a publicly available system called
LAST that includes these features.
Incorporating sequence quality data into alignment
We wish to extend the standard scoring matrix derivation
to take sequencing error probabilities into account. In our
scenario, one sequence (the read) has per-base error
probabilities, and the other sequence (the genome) does
not. We assume that the sequencing instrument estimates
P(y|d), the probability that a base is y (where y2fa,c,g,t})
based on some observed data d (e.g. image intensities).
Following the likelihood-ratio principle, we define a gen-
eralization of standard substitution scores:
This formula can be rearranged by observing,
and by Bayes formula,
Simplifying, we obtain:
Finally, we define scores by the usual log transformation:
This scoring scheme can be implemented efficiently by
converting each length n read to a 4 ? n position-specific
scoring matrix, which holds the scores for aligning a, c, g
and t to each position in the read. Details of how our
software performs these calculations are provided in the
xd¼ T ? ln R0
Comparison with expected score method
Na et al. (4) proposed a method for combining quality
scores with traditional scoring matrices. Although pre-
sented in a more general form, which allows for
mismatch and indel sequencing errors in both sequences,
overall their method is similar to ours. They consider
aligning two sequences, each one represented as a 4 ? n
matrix, holding the probability of each base at each
position. Unfortunately, their method suffers from two
problems. First, the computation is not well justified the-
oretically, and second, the method breaks down when very
similar sequences are aligned.
In their notation, the score of an alignment column is
?Pmþ ?Pnþ ?Pg
where g, d and m represent the traditional score matrix
score for matches, mismatches and gaps, respectively,
and Pm, Pnand Pgrepresent the probability, given the
sequencer data, of each of those three cases. We submit
that the following equation, which corresponds to our R0xd
[equation (11)], would be more justified:
e?=TPmþ e?=TPnþ e?=TPg
This follows from that fact that (i) Pm, Pnand Pgare
the probabilities of disjoint events (indeed Pm+ Pn+
Pg?1), (ii) as pointed out in equation (1), eg/T, ed/T
matches, mismatches and gaps in the ‘real’ sequences
and (iii) the reasonable assumption that sequencer error
is independent of real sequence differences.
This is not just a theoretical point, but makes a signif-
icant difference in practice. Consider the case of aligning
two sequences expected to be identical except for
sequencing error. In this case, the appropriate value for
d and m would be T?ln(0) =?1, and thus by equation
(13), any column involving a non-zero sequencer error
probability would be assigned the same score of ?1.
On the other hand, equation (14), or equivalently R0
would assign a score proportional to Pm, which is the
probability assigned by the sequencer that the two bases
match each other.
In traditional sequence alignment, we simply report align-
ments with significantly high scores, and it does not matter
if one query sequence has more than one high-scoring
alignment. For read mapping, on the other hand, we
suppose that each read comes from just one place in the
PAGE 3 OF 9Nucleic Acids Research,2010, Vol.38, No. 7e100
genome. Since genomes contain many duplications and
simple repeats, though, it is common for one read to
have multiple high-scoring alignments.
This problem can be addressed by calculating mapping
probabilities (1,2). Suppose that one read has high-scoring
alignments at several genome locations. The alignment
score at location i is denoted as Si. The mapping proba-
pðread maps to iÞ ¼ eSi=T?X
This formula can be derived from probabilistic align-
ment models (Supplementary Data). Thus, if a read has
a much higher alignment score at one location than any
other, we can be confident that it comes from there. If it
has nearly equal alignment scores at many locations, we
cannot confidently map it.
This calculation assumes that the read certainly comes
from one of the locations found by the alignment proce-
dure. This is a very dubious assumption in practice, for
(i) The read might not come from the genome at all
(i.e. it might be a contaminant).
(ii) The read might come from part of the genome that
is not present in the reference sequence. (Many ref-
erence sequences are incomplete.)
(iii) The alignmentalgorithm
high-scoring alignments. Because of the large size
of the data sets, heuristic algorithms are normally
used which may miss some alignments.
Therefore, these mapping probabilities should not be
trusted absolutely. Nevertheless, they prove useful.
We expect that many existing read-mapping tools could be
modified to incorporate the scoring scheme defined above.
To demonstrate our method, we have incorporated it into
our own large-scale alignment tool, LAST.
Since we intend to describe LAST in detail in a separate
publication, we only give a minimal description here.
LAST follows the same three steps as BLAST (9). It finds
seeds (exact matches), extends gapless alignments from the
seeds and finally extends gapped alignments. We incorpo-
rated the new scoring scheme into the last two phases,
since the seed-finding phase does not use scores at all.
The main innovation of LAST is its use of adaptive
seeds whose length adapts to the repetitiveness of the
sequence. This makes it much faster for genomic data.
Specifically, it finds exact matches of any length that
occur no more than (say) 10 times in the genome. These
can be found efficiently using enhanced suffix arrays (10).
LAST can also use spaced seeds (11). Finally, LAST can
use seeds that are both spaced and adaptive: in fact, this is
the default algorithm in this study (Supplementary Data).
LAST is freely available at: http://last.cbrc.jp/.
Alignment parameter settings
In this study, the gapless and gapped score thresholds of
LAST (-d and -e) were set to 120, and the gapless
max-drop parameter (-y) was set to 99999 (effectively
infinite). Gapped alignment was used only when aligning
reads to the D. simulans genome. When mapping 51-bp
reads, a score threshold of 150 was used instead of 120.
When modeling sequencer errors only, the mismatch cost
(-q) was set to 1 million (effectively infinite). To clarify:
despite the infinite mismatch cost, mismatches were
tolerated due to the modeling of sequencer errors.
Finding all alignments with up to two mismatches
For part of our analysis, we wished to avoid using
adaptive seeds, and instead guarantee to find all align-
ments with up to two mismatches (and score ?120). We
did this with LAST using spaced seeds. We first found all
matches of length 26 between each read and the genome,
requiring only 18 out of the 26 bases to match. The posi-
tions required to match are indicated by ‘1’s in the follow-
ing pattern: 11111011000111110110001111. Any
length 36 read with up to two mismatches (and no gaps)
is guaranteed to have a match using this pattern.
Moreover, this pattern is optimal for this problem, in
that no pattern with >18 ‘1’s provides this guarantee.
The LAST documentation includes a table of optimal
spaced seeds for various read lengths and mismatch
limits, obtained using algorithms from others (12,13).
Finally, as usual, we extended alignments from every
seed match, and reported those with alignment score
Mapping DNA reads to Drosophila genomes
We obtained the genome sequences of D. melanogaster
(dm3 excluding chrUextra) and D. simulans (droSim1)
from the UCSC genome database. We only tested reads
that could be confidently mapped to the D. melanogaster
genome (53748 reads in default mapping mode; 51898
reads in two-mismatch guarantee mode). We considered
a read confidently mapped if it has alignment score ?150
and mapping probability ?0.99. (Alignments with score
?120 were used to calculate the mapping probabilities.)
To cross-reference the mappings, we used the genome
alignment file dm3.droSim1.all.chain from UCSC (14).
Our results show the effectiveness of combining quality
scores with sequence alignment by applying LAST within
two experiment settings: the first with synthetic data and
the second with real data based on cross-species mapping.
Test with simulated DNA reads
In our first experiment, we employ simulated reads since
we are able to know exactly where they should map to. We
began by sampling 100000 random 36-bp fragments from
human chromosome 1 (hg19, both strands). To simulate
real sequence differences, we made random substitutions
at a low level (0.2, 0.5, 2 or 5%). These substitutions con-
sisted of 60% transitions and 40% transversions: a realis-
tic proportion (6). To keep this initial test simple, we did
not introduce any insertions or deletions. Finally,
we assigned 100000 real quality score strings (those
e100Nucleic Acids Research, 2010,Vol.38, No. 7PAGE 4 OF 9
summarized in Figure 1A) to the simulated reads, and
randomly mutated each base according to the correspond-
ing error probability.
We then aligned the reads to chromosome 1, and
checked whether or not they mapped back to their
original locations. The ‘real’ sequence differences were
modeled by using suitable alignment score parameters
for each level of divergence (Table 1). We obtained align-
ments with score ?120 (equivalent to 20 error-free
matching bases), then calculated mapping probabilities,
and kept alignments with mapping probability ?0.99.
Figure 2 shows the relationship between the number of
correctly and incorrectly mapped reads, as the score
threshold is varied between 216 (the maximum possible)
and 120. As the score threshold approaches 120, falsely
mapped reads increase dramatically: this is because the
mapping probabilities become less reliable since they fail
to account for alignments with scores ?119. In all cases,
however, mapping accuracy improves (i.e. we obtain more
correctly mapped reads for a given number of incorrectly
mapped ones) when we model both sequencer errors and
‘real’ substitutions. If we model only sequencer errors,
there is the potential to do worse than traditional align-
ment, where only substitutions are modeled.
To check whether these conclusions hold for a different
read length and quality score distribution, we repeated the
0 100 200 300400500600
qualities + matrix
0 100 200 300 400500600
0100 200300 400500 600
0 100200300400 500 600
Incorrectly mapped reads
Correctly mapped reads
Figure 2. Mapping accuracy for 100000 simulated 36-bp reads. The reads differ from the genome by a certain rate of ‘real’ substitutions (0.2, 0.5,
1 or 2%) plus sequencer errors. Each line shows the relationship between the number of correctly and incorrectly mapped reads as the alignment
score cutoff is varied. Circles indicate a score cutoff of 150. Dotted lines show the accuracy when we model the substitutions but not the sequencer
errors. Dashed lines show the accuracy when we model the sequencer errors but not the substitutions. Solid lines show the accuracy when we
Table 1. Alignment score parameters for DNA with various substitu-
aApplies when there is no transition/transversion bias (i.e. one in three
substitutions are transitions).
bFor the case where 60% of substitutions are transitions.
cFor the case where 45% of substitutions are transitions.
PAGE 5 OF 9 Nucleic Acids Research,2010, Vol.38, No. 7e100