Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing

Biochemical Science Division, National Institute of Standards and Technology, Gaithersburg, Maryland, United States of America.
PLoS ONE (Impact Factor: 3.53). 07/2012; 7(7):e41356. DOI: 10.1371/journal.pone.0041356
Source: PubMed

ABSTRACT While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a data set used to calculate association of SSEs with various features in the reads and sequence context. This data set is typically either from a part of the data set being "recalibrated" (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 Phred-scaled quality score units, and by as much as 13 units at CpG sites. In addition, since the spike-in data used for recalibration are independent of the genome being sequenced, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration.

Download full-text


Available from: Justin M Zook, Jul 28, 2015
  • [Show abstract] [Hide abstract]
    ABSTRACT: Despite the recent technological advances in DNA quantitation by sequencing, accurate delineation of the quantitative relationship among different DNA sequences is yet to be elaborated due to difficulties in correcting the sequence-specific quantitation biases. We here developed a novel DNA quantitation method via spiking-in a neighbor genome for competitive PCR amplicon sequencing (SiNG-PCRseq). This method utilizes genome-wide chemically equivalent but easily discriminable homologous sequences with a known copy arrangement in the neighbor genome. By comparing the amounts of selected human DNA sequences simultaneously to those of matched sequences in the orangutan genome, we could accurately draw the quantitative relationships for those sequences in the human genome (root-mean-square deviations <0.05). Technical replications of cDNA quantitation performed using different reagents at different time points also resulted in excellent correlations (R(2) > 0.95). The cDNA quantitation using SiNG-PCRseq was highly concordant with the RNA-seq-derived version in inter-sample comparisons (R(2) = 0.88), but relatively discordant in inter-sequence quantitation (R(2) < 0.44), indicating considerable level of sequence-dependent quantitative biases in RNA-seq. Considering the measurement structure explicitly relating the amount of different sequences within a sample, SiNG-PCRseq will facilitate sharing and comparing the quantitation data generated under different spatio-temporal settings.
    Scientific Reports 5:11879. DOI:10.1038/srep11879 · 5.58 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: RNA-Seq promises to be used in clinical settings as a gene-expression profiling tool; however, questions about its variability and biases remain and need to be addressed. Thus, RNA controls with known concentrations and sequence identities originally developed by the External RNA Control Consortium (ERCC) for microarray and qPCR platforms have recently been proposed for RNA-Seq platforms, but only with a limited number of samples. In this study, we report our analysis of RNA-Seq data from 92 ERCC controls spiked in a diverse collection of 447 RNA samples from eight ongoing studies involving five species (human, rat, mouse, chicken, and Schistosoma japonicum) and two mRNA enrichment protocols, i.e., poly(A) and RiboZero. The entire collection of datasets consisted of 15650143175 short sequence reads, 131603796 (i.e., 0.84%) of which were mapped to the 92 ERCC references. The overall ERCC mapping ratio of 0.84% is close to the expected value of 1.0% when assuming a 2.0% mRNA fraction in total RNA, but showed a difference of 2.8-fold across studies and 4.3-fold among samples from the same study with one tissue type. This level of fluctuation may prevent the ERCC controls from being used for cross-sample normalization in RNA-Seq. Furthermore, we observed striking biases of quantification between poly(A) and RiboZero which are transcript-specific. For example, ERCC-00116 showed a 7.3-fold under-enrichment in poly(A) compared to RiboZero. Extra care is needed in integrative analysis of multiple datasets and technical artifacts of protocol differences should not be taken as true biological findings.
    Science China. Life sciences 02/2013; 56(2):134-42. DOI:10.1007/s11427-013-4437-9 · 1.51 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Ecologists are increasingly interested in quantifying consumer diets based on food DNA in dietary samples and high-throughput sequencing of marker genes. It is tempting to assume that food DNA sequence proportions recovered from diet samples are representative of consumer's diet proportions, despite the fact that captive feeding studies do not support that assumption. Here, we examine the idea of sequencing control materials of known composition along with dietary samples in order to correct for technical biases introduced during amplicon sequencing, and biological biases such as variable gene copy number. Using the Ion Torrent PGM©, we sequenced prey DNA amplified from scats of captive harbour seals (Phoca vitulina) fed a constant diet including three fish species in known proportions. Alongside, we sequenced a prey tissue mix matching the seals' diet to generate Tissue Correction Factors (TCFs). TCFs improved the diet estimates (based on sequence proportions) for all species and reduced the average estimate error from 28 ± 15% (uncorrected), to 14 ± 9% (TCF corrected). The experimental design also allowed us to infer the magnitude of prey-specific digestion biases and calculate Digestion Correction Factors (DCFs). The DCFs were compared to possible proxies for differential digestion (e.g., fish% protein,% lipid,% moisture) revealing a strong relationship between the DCFs and percent lipid of the fish prey, suggesting prey-specific corrections based on lipid content would produce accurate diet estimates in this study system. These findings demonstrate the value of parallel sequencing of food tissue mixtures in diet studies and offer new directions for future research in quantitative DNA diet analysis. This article is protected by copyright. All rights reserved.
    Molecular Ecology 09/2013; 23(15). DOI:10.1111/mec.12523 · 5.84 Impact Factor
Show more