Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing

Biochemical Science Division, National Institute of Standards and Technology, Gaithersburg, Maryland, United States of America.
PLoS ONE (Impact Factor: 3.23). 07/2012; 7(7):e41356. DOI: 10.1371/journal.pone.0041356
Source: PubMed


While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a data set used to calculate association of SSEs with various features in the reads and sequence context. This data set is typically either from a part of the data set being "recalibrated" (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 Phred-scaled quality score units, and by as much as 13 units at CpG sites. In addition, since the spike-in data used for recalibration are independent of the genome being sequenced, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration.

Download full-text


Available from: Justin M Zook
    • "In regard to base-substitution errors, there is no reason to believe that RRL sequencing techniques are immune to them, and as such, one should be less confident that an A/C substitution is a true SNP when compared to a substitution of type T/C. Establishing the error rate may require spiking sequencing runs with synthetic sequences to obtain experiment-specific error estimates (Zook et al. 2012). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Single nucleotide polymorphisms (SNPs) have become the marker of choice for genetic studies in organisms of conservation, commercial, or biological interest. Most SNP discovery projects in non-model organisms apply a strategy for identifying putative SNPs based on filtering rules that account for random sequencing errors. Here, we analyze data used to develop 4723 novel SNPs for the commercially important deep-sea fish, orange roughy (Hoplostethus atlanticus), in order to measure the impact of not accounting for systematic sequencing errors when filtering identified polymorphisms to be added to the SNP chip. We used SAMtools to identify polymorphisms in a Velvet assembly of genomic DNA sequence data from seven individuals. The resulting set of polymorphisms were filtered to minimise ‘bycatch’ – polymorphisms caused by sequencing or assembly error. An Illumina Infinium SNP chip was used to genotype a final set of 7,714 polymorphisms across 1,734 individuals. Five predictors of SNP validity were examined for their effect on the probability of obtaining an assayable SNP: depth of coverage, number of reads that support a variant, polymorphism type (e.g., A/C), strand-bias, and SNP probe design score. Our results support a strategy of filtering out systematic sequencing errors in order to improve the efficiency of SNP discovery. We show that blastx can be used as an efficient tool to identify single-copy genomic regions in the absence of a reference genome. The results have implications for research aiming to identify SNPs and build SNP genotyping assays for non-model organisms.This article is protected by copyright. All rights reserved.
    No preview · Article · Nov 2014 · Molecular Ecology Resources
  • Source
    • "Strand bias related inaccuracies and decreased depth of coverage or uneven coverage (due to allele dropout in case of sampling error or as a function of tumor heterogeneity) can also compound the problem of mutation calling inaccuracies. Accurate base calling algorithms for Dx assays must minimally utilize spike-in controls during technical feasibility experiments and raw data controls for software training that include mutation calls in regions of predicted poor base calling if those are part of the assay design (41, 43, 66). The use of a highly sequenced reference sample, such as NA12878 by NIST (v.2.15) for software training and algorithm development has been proposed in many forums such as the NIST “Genome in a Bottle” Consortium (92). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Over the past decade, next-generation sequencing (NGS) technology has experienced meteoric growth in the aspects of platform, technology, and supporting bioinformatics development allowing its widespread and rapid uptake in research settings. More recently, NGS-based genomic data have been exploited to better understand disease development and patient characteristics that influence response to a given therapeutic intervention. Cancer, as a disease characterized by and driven by the tumor genetic landscape, is particularly amenable to NGS-based diagnostic (Dx) approaches. NGS-based technologies are particularly well suited to studying cancer disease development, progression and emergence of resistance, all key factors in the development of next-generation cancer Dxs. Yet, to achieve the promise of NGS-based patient treatment, drug developers will need to overcome a number of operational, technical, regulatory, and strategic challenges. Here, we provide a succinct overview of the state of the clinical NGS field in terms of the available clinically targeted platforms and sequencing technologies. We discuss the various operational and practical aspects of clinical NGS testing that will facilitate or limit the uptake of such assays in routine clinical care. We examine the current strategies for analytical validation and Food and Drug Administration (FDA)-approval of NGS-based assays and ongoing efforts to standardize clinical NGS and build quality control standards for the same. The rapidly evolving companion diagnostic (CDx) landscape for NGS-based assays will be reviewed, highlighting the key areas of concern and suggesting strategies to mitigate risk. The review will conclude with a series of strategic questions that face drug developers and a discussion of the likely future course of NGS-based CDx development efforts.
    Full-text · Article · Apr 2014 · Frontiers in Oncology
  • Source
    • "The correction factors are applied to the sequence counts of the unknown samples to increase the accuracy of the quantitative estimates. Similar spike-in standards are also applied to account for biases in studies using nextgeneration sequencing to look at differential gene expression (Jiang et al. 2011; Zook et al. 2012). If the control materials and unknown samples are both treated in an identical fashion during the methodological protocol, this approach should account for many of the species-specific methodological biases in a single correction (e.g. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Ecologists are increasingly interested in quantifying consumer diets based on food DNA in dietary samples and high-throughput sequencing of marker genes. It is tempting to assume that food DNA sequence proportions recovered from diet samples are representative of consumer's diet proportions, despite the fact that captive feeding studies do not support that assumption. Here, we examine the idea of sequencing control materials of known composition along with dietary samples in order to correct for technical biases introduced during amplicon sequencing, and biological biases such as variable gene copy number. Using the Ion Torrent PGM©, we sequenced prey DNA amplified from scats of captive harbour seals (Phoca vitulina) fed a constant diet including three fish species in known proportions. Alongside, we sequenced a prey tissue mix matching the seals’ diet to generate Tissue Correction Factors (TCFs). TCFs improved the diet estimates (based on sequence proportions) for all species and reduced the average estimate error from 28 ± 15% (uncorrected), to 14 ± 9% (TCF corrected). The experimental design also allowed us to infer the magnitude of prey-specific digestion biases and calculate Digestion Correction Factors (DCFs). The DCFs were compared to possible proxies for differential digestion (e.g., fish% protein,% lipid,% moisture) revealing a strong relationship between the DCFs and percent lipid of the fish prey, suggesting prey-specific corrections based on lipid content would produce accurate diet estimates in this study system. These findings demonstrate the value of parallel sequencing of food tissue mixtures in diet studies and offer new directions for future research in quantitative DNA diet analysis.
    Full-text · Article · Jan 2014
Show more