Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing

Biochemical Science Division, National Institute of Standards and Technology, Gaithersburg, Maryland, United States of America.
PLoS ONE (Impact Factor: 3.23). 07/2012; 7(7):e41356. DOI: 10.1371/journal.pone.0041356
Source: PubMed

ABSTRACT While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a data set used to calculate association of SSEs with various features in the reads and sequence context. This data set is typically either from a part of the data set being "recalibrated" (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 Phred-scaled quality score units, and by as much as 13 units at CpG sites. In addition, since the spike-in data used for recalibration are independent of the genome being sequenced, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration.

Download full-text


Available from: Justin M Zook, Sep 28, 2015
30 Reads
  • Source
    • "Strand bias related inaccuracies and decreased depth of coverage or uneven coverage (due to allele dropout in case of sampling error or as a function of tumor heterogeneity) can also compound the problem of mutation calling inaccuracies. Accurate base calling algorithms for Dx assays must minimally utilize spike-in controls during technical feasibility experiments and raw data controls for software training that include mutation calls in regions of predicted poor base calling if those are part of the assay design (41, 43, 66). The use of a highly sequenced reference sample, such as NA12878 by NIST (v.2.15) for software training and algorithm development has been proposed in many forums such as the NIST “Genome in a Bottle” Consortium (92). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Over the past decade, next-generation sequencing (NGS) technology has experienced meteoric growth in the aspects of platform, technology, and supporting bioinformatics development allowing its widespread and rapid uptake in research settings. More recently, NGS-based genomic data have been exploited to better understand disease development and patient characteristics that influence response to a given therapeutic intervention. Cancer, as a disease characterized by and driven by the tumor genetic landscape, is particularly amenable to NGS-based diagnostic (Dx) approaches. NGS-based technologies are particularly well suited to studying cancer disease development, progression and emergence of resistance, all key factors in the development of next-generation cancer Dxs. Yet, to achieve the promise of NGS-based patient treatment, drug developers will need to overcome a number of operational, technical, regulatory, and strategic challenges. Here, we provide a succinct overview of the state of the clinical NGS field in terms of the available clinically targeted platforms and sequencing technologies. We discuss the various operational and practical aspects of clinical NGS testing that will facilitate or limit the uptake of such assays in routine clinical care. We examine the current strategies for analytical validation and Food and Drug Administration (FDA)-approval of NGS-based assays and ongoing efforts to standardize clinical NGS and build quality control standards for the same. The rapidly evolving companion diagnostic (CDx) landscape for NGS-based assays will be reviewed, highlighting the key areas of concern and suggesting strategies to mitigate risk. The review will conclude with a series of strategic questions that face drug developers and a discussion of the likely future course of NGS-based CDx development efforts.
    Frontiers in Oncology 04/2014; 4:78. DOI:10.3389/fonc.2014.00078
  • Source
    • "We conducted a RNA-seq experiment of Taraxacum officinale RNA mixed with ERCC RNA spikes. Twenty-three libraries were multiplexed using Illumina’s multiplex sequencing assay and pooled with a 2% ERCC spike [12]. Subsequently they were sequenced on 2 Hiseq lanes yielding a total of 2.4 million 100bp read pairs mapping to the ERCC spike set. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Priming of random hexamers in cDNA synthesis is known to show sequence bias, but in addition it has been suggested recently that mismatches in random hexamer priming could be a cause of mismatches between the original RNA fragment and observed sequence reads. To explore random hexamer mispriming as a potential source of these errors, we analyzed two independently generated RNA-seq datasets of synthetic ERCC spikes for which the reference is known. First strand cDNA synthesized by random hexamer priming on RNA showed consistent position and nucleotide-specific mismatch errors in the first seven nucleotides. The mismatch errors found in both datasets are consistent in distribution and thermodynamically stable mismatches are more common. This strongly indicates that RNA-DNA mispriming of specific random hexamers causes these errors. Due to their consistency and specificity, mispriming errors can have profound implications for downstream applications if not dealt with properly.
    PLoS ONE 12/2013; 8(12):e85583. DOI:10.1371/journal.pone.0085583 · 3.23 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: RNA-Seq promises to be used in clinical settings as a gene-expression profiling tool; however, questions about its variability and biases remain and need to be addressed. Thus, RNA controls with known concentrations and sequence identities originally developed by the External RNA Control Consortium (ERCC) for microarray and qPCR platforms have recently been proposed for RNA-Seq platforms, but only with a limited number of samples. In this study, we report our analysis of RNA-Seq data from 92 ERCC controls spiked in a diverse collection of 447 RNA samples from eight ongoing studies involving five species (human, rat, mouse, chicken, and Schistosoma japonicum) and two mRNA enrichment protocols, i.e., poly(A) and RiboZero. The entire collection of datasets consisted of 15650143175 short sequence reads, 131603796 (i.e., 0.84%) of which were mapped to the 92 ERCC references. The overall ERCC mapping ratio of 0.84% is close to the expected value of 1.0% when assuming a 2.0% mRNA fraction in total RNA, but showed a difference of 2.8-fold across studies and 4.3-fold among samples from the same study with one tissue type. This level of fluctuation may prevent the ERCC controls from being used for cross-sample normalization in RNA-Seq. Furthermore, we observed striking biases of quantification between poly(A) and RiboZero which are transcript-specific. For example, ERCC-00116 showed a 7.3-fold under-enrichment in poly(A) compared to RiboZero. Extra care is needed in integrative analysis of multiple datasets and technical artifacts of protocol differences should not be taken as true biological findings.
    Science China. Life sciences 02/2013; 56(2):134-42. DOI:10.1007/s11427-013-4437-9 · 1.69 Impact Factor
Show more