Article

Incorporating sequence quality data into alignment improves DNA read mapping.

Computational Biology Research Center, Institute for Advanced Industrial Science and Technology, Koto-ku, Tokyo 135-0064, Japan.
Nucleic Acids Research (impact factor: 8.03). 04/2010; 38(7):e100. DOI:10.1093/nar/gkq010 pp.e100
Source: PubMed

ABSTRACT New DNA sequencing technologies have achieved breakthroughs in throughput, at the expense of higher error rates. The primary way of interpreting biological sequences is via alignment, but standard alignment methods assume the sequences are accurate. Here, we describe how to incorporate the per-base error probabilities reported by sequencers into alignment. Unlike existing tools for DNA read mapping, our method models both sequencer errors and real sequence differences. This approach consistently improves mapping accuracy, even when the rate of real sequence difference is only 0.2%. Furthermore, when mapping Drosophila melanogaster reads to the Drosophila simulans genome, it increased the amount of correctly mapped reads from 49 to 66%. This approach enables more effective use of DNA reads from organisms that lack reference genomes, are extinct or are highly polymorphic.

0 0
 · 
0 Bookmarks
 · 
44 Views
  • Article: Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning.
    [show abstract] [hide abstract]
    ABSTRACT: Cytosine DNA methylation is important in regulating gene expression and in silencing transposons and other repetitive sequences. Recent genomic studies in Arabidopsis thaliana have revealed that many endogenous genes are methylated either within their promoters or within their transcribed regions, and that gene methylation is highly correlated with transcription levels. However, plants have different types of methylation controlled by different genetic pathways, and detailed information on the methylation status of each cytosine in any given genome is lacking. To this end, we generated a map at single-base-pair resolution of methylated cytosines for Arabidopsis, by combining bisulphite treatment of genomic DNA with ultra-high-throughput sequencing using the Illumina 1G Genome Analyser and Solexa sequencing technology. This approach, termed BS-Seq, unlike previous microarray-based methods, allows one to sensitively measure cytosine methylation on a genome-wide scale within specific sequence contexts. Here we describe methylation on previously inaccessible components of the genome and analyse the DNA methylation sequence composition and distribution. We also describe the effect of various DNA methylation mutants on genome-wide methylation patterns, and demonstrate that our newly developed library construction and computational methods can be applied to large genomes such as that of mouse.
    Nature 04/2008; 452(7184):215-9. · 36.28 Impact Factor
  • Source
    Article: Mapping short DNA sequencing reads and calling variants using mapping quality scores.
    [show abstract] [hide abstract]
    ABSTRACT: New sequencing technologies promise a new era in the use of DNA sequence. However, some of these technologies produce very short reads, typically of a few tens of base pairs, and to use these reads effectively requires new algorithms and software. In particular, there is a major issue in efficiently aligning short reads to a reference genome and handling ambiguity or lack of accuracy in this alignment. Here we introduce the concept of mapping quality, a measure of the confidence that a read actually comes from the position it is aligned to by the mapping algorithm. We describe the software MAQ that can build assemblies by mapping shotgun short reads to a reference genome, using quality scores to derive genotype calls of the consensus sequence of a diploid genome, e.g., from a human sample. MAQ makes full use of mate-pair information and estimates the error probability of each read alignment. Error probabilities are also derived for the final genotype calls, using a Bayesian statistical model that incorporates the mapping qualities, error probabilities from the raw sequence quality scores, sampling of the two haplotypes, and an empirical model for correlated errors at a site. Both read mapping and genotype calling are evaluated on simulated data and real data. MAQ is accurate, efficient, versatile, and user-friendly. It is freely available at http://maq.sourceforge.net.
    Genome Research 09/2008; 18(11):1851-8. · 13.61 Impact Factor
  • Article: The effect of sequence quality on sequence alignment.
    [show abstract] [hide abstract]
    ABSTRACT: The nucleotide sequencing process produces not only the sequence of nucleotides, but also associated quality values. Quality values provide valuable information, but are primarily used only for trimming sequences and generally ignored in subsequent analyses. This article describes how the scoring schemes of standard alignment algorithms can be modified to take into account quality values to produce improved alignments and statistically more accurate scores. A prototype implementation is also provided, and used to post-process a set of BLAST results. Quality-adjusted alignment is a natural extension of standard alignment methods, and can be implemented with only a small constant factor performance penalty. The method can also be applied to related methods including heuristic search algorithms like BLAST and FASTA. http://malde.org/~ketil/qaa.
    Bioinformatics 05/2008; 24(7):897-900. · 5.47 Impact Factor

Full-text

View
0 Downloads
Available from

Keywords

alignment
 
approach enables
 
Drosophila simulans genome
 
effective use
 
higher error rates
 
interpreting biological sequences
 
lack reference genomes
 
method models
 
New DNA sequencing technologies
 
per-base error probabilities
 
polymorphic
 
real sequence difference
 
real sequence differences
 
sequencer errors
 
standard alignment methods
 
tools