Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing

Biochemical Science Division, National Institute of Standards and Technology, Gaithersburg, Maryland, United States of America.
PLoS ONE (Impact Factor: 3.23). 07/2012; 7(7):e41356. DOI: 10.1371/journal.pone.0041356
Source: PubMed


While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a data set used to calculate association of SSEs with various features in the reads and sequence context. This data set is typically either from a part of the data set being "recalibrated" (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 Phred-scaled quality score units, and by as much as 13 units at CpG sites. In addition, since the spike-in data used for recalibration are independent of the genome being sequenced, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration.

Download full-text


Available from: Justin M Zook,
    • "In regard to base-substitution errors, there is no reason to believe that RRL sequencing techniques are immune to them, and as such, one should be less confident that an A/C substitution is a true SNP when compared to a substitution of type T/C. Establishing the error rate may require spiking sequencing runs with synthetic sequences to obtain experiment-specific error estimates (Zook et al. 2012). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Single nucleotide polymorphisms (SNPs) have become the marker of choice for genetic studies in organisms of conservation, commercial, or biological interest. Most SNP discovery projects in non-model organisms apply a strategy for identifying putative SNPs based on filtering rules that account for random sequencing errors. Here, we analyze data used to develop 4723 novel SNPs for the commercially important deep-sea fish, orange roughy (Hoplostethus atlanticus), in order to measure the impact of not accounting for systematic sequencing errors when filtering identified polymorphisms to be added to the SNP chip. We used SAMtools to identify polymorphisms in a Velvet assembly of genomic DNA sequence data from seven individuals. The resulting set of polymorphisms were filtered to minimise ‘bycatch’ – polymorphisms caused by sequencing or assembly error. An Illumina Infinium SNP chip was used to genotype a final set of 7,714 polymorphisms across 1,734 individuals. Five predictors of SNP validity were examined for their effect on the probability of obtaining an assayable SNP: depth of coverage, number of reads that support a variant, polymorphism type (e.g., A/C), strand-bias, and SNP probe design score. Our results support a strategy of filtering out systematic sequencing errors in order to improve the efficiency of SNP discovery. We show that blastx can be used as an efficient tool to identify single-copy genomic regions in the absence of a reference genome. The results have implications for research aiming to identify SNPs and build SNP genotyping assays for non-model organisms.This article is protected by copyright. All rights reserved.
    Molecular Ecology Resources 11/2014; 15(4). DOI:10.1111/1755-0998.12343 · 3.71 Impact Factor
  • Source
    • "Strand bias related inaccuracies and decreased depth of coverage or uneven coverage (due to allele dropout in case of sampling error or as a function of tumor heterogeneity) can also compound the problem of mutation calling inaccuracies. Accurate base calling algorithms for Dx assays must minimally utilize spike-in controls during technical feasibility experiments and raw data controls for software training that include mutation calls in regions of predicted poor base calling if those are part of the assay design (41, 43, 66). The use of a highly sequenced reference sample, such as NA12878 by NIST (v.2.15) for software training and algorithm development has been proposed in many forums such as the NIST “Genome in a Bottle” Consortium (92). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Over the past decade, next-generation sequencing (NGS) technology has experienced meteoric growth in the aspects of platform, technology, and supporting bioinformatics development allowing its widespread and rapid uptake in research settings. More recently, NGS-based genomic data have been exploited to better understand disease development and patient characteristics that influence response to a given therapeutic intervention. Cancer, as a disease characterized by and driven by the tumor genetic landscape, is particularly amenable to NGS-based diagnostic (Dx) approaches. NGS-based technologies are particularly well suited to studying cancer disease development, progression and emergence of resistance, all key factors in the development of next-generation cancer Dxs. Yet, to achieve the promise of NGS-based patient treatment, drug developers will need to overcome a number of operational, technical, regulatory, and strategic challenges. Here, we provide a succinct overview of the state of the clinical NGS field in terms of the available clinically targeted platforms and sequencing technologies. We discuss the various operational and practical aspects of clinical NGS testing that will facilitate or limit the uptake of such assays in routine clinical care. We examine the current strategies for analytical validation and Food and Drug Administration (FDA)-approval of NGS-based assays and ongoing efforts to standardize clinical NGS and build quality control standards for the same. The rapidly evolving companion diagnostic (CDx) landscape for NGS-based assays will be reviewed, highlighting the key areas of concern and suggesting strategies to mitigate risk. The review will conclude with a series of strategic questions that face drug developers and a discussion of the likely future course of NGS-based CDx development efforts.
    Frontiers in Oncology 04/2014; 4:78. DOI:10.3389/fonc.2014.00078
  • Source
    • "We conducted a RNA-seq experiment of Taraxacum officinale RNA mixed with ERCC RNA spikes. Twenty-three libraries were multiplexed using Illumina’s multiplex sequencing assay and pooled with a 2% ERCC spike [12]. Subsequently they were sequenced on 2 Hiseq lanes yielding a total of 2.4 million 100bp read pairs mapping to the ERCC spike set. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Priming of random hexamers in cDNA synthesis is known to show sequence bias, but in addition it has been suggested recently that mismatches in random hexamer priming could be a cause of mismatches between the original RNA fragment and observed sequence reads. To explore random hexamer mispriming as a potential source of these errors, we analyzed two independently generated RNA-seq datasets of synthetic ERCC spikes for which the reference is known. First strand cDNA synthesized by random hexamer priming on RNA showed consistent position and nucleotide-specific mismatch errors in the first seven nucleotides. The mismatch errors found in both datasets are consistent in distribution and thermodynamically stable mismatches are more common. This strongly indicates that RNA-DNA mispriming of specific random hexamers causes these errors. Due to their consistency and specificity, mispriming errors can have profound implications for downstream applications if not dealt with properly.
    PLoS ONE 12/2013; 8(12):e85583. DOI:10.1371/journal.pone.0085583 · 3.23 Impact Factor
Show more