Targeted enrichment beyond the consensus coding DNA sequence exome reveals exons with higher variant densities

Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA.
Genome biology (Impact Factor: 10.81). 07/2011; 12(7):R68. DOI: 10.1186/gb-2011-12-7-r68
Source: PubMed


Enrichment of loci by DNA hybridization-capture, followed by high-throughput sequencing, is an important tool in modern genetics. Currently, the most common targets for enrichment are the protein coding exons represented by the consensus coding DNA sequence (CCDS). The CCDS, however, excludes many actual or computationally predicted coding exons present in other databases, such as RefSeq and Vega, and non-coding functional elements such as untranslated and regulatory regions. The number of variants per base pair (variant density) and our ability to interrogate regions outside of the CCDS regions is consequently less well understood.
We examine capture sequence data from outside of the CCDS regions and find that extremes of GC content that are present in different subregions of the genome can reduce the local capture sequence coverage to less than 50% relative to the CCDS. This effect is due to biases inherent in both the Illumina and SOLiD sequencing platforms that are exacerbated by the capture process. Interestingly, for two subregion types, microRNA and predicted exons, the capture process yields higher than expected coverage when compared to whole genome sequencing. Lastly, we examine the variation present in non-CCDS regions and find that predicted exons, as well as exonic regions specific to RefSeq and Vega, show much higher variant densities than the CCDS.
We show that regions outside of the CCDS perform less efficiently in capture sequence experiments. Further, we show that the variant density in computationally predicted exons is more than 2.5-times higher than that observed in the CCDS.

Download full-text


Available from: Irene Newsham,
  • Source
    • "A peripheral blood sample was submitted to the Baylor College of Medicine clinical exome sequencing service. This analysis was performed based on standard procedures within that laboratory, based on published methods [Bainbridge et al., 2011] and information at the Baylor Human Genome Sequencing Center: https:// Paired-End_Capture_Library_Preparation.pdf and analyzed by their clinical testing pipeline: "
    [Show abstract] [Hide abstract]
    ABSTRACT: The TARP syndrome (Talipes equinovarus, Atrial septal defect, Robin sequence, and Persistent left superior vena cava) is an X-linked disorder that was determined to be caused by mutations in RBM10 in two families, and confirmed in a subsequent case report. The first two original families were quite similar in phenotype, with uniform early lethality although a confirmatory case report showed survival into childhood. Here we report on five affecteds from three newly recognized families, including patients with atypical manifestations. None of the five patients had talipes and others also lacked cardinal TARP features of Robin sequence and atrial septal defect. All three families demonstrated de novo mutations, and one of the families had two recurrences, with demonstrable maternal mosaicism. © 2013 Wiley Periodicals, Inc.
    American Journal of Medical Genetics Part A 01/2014; 164(1). DOI:10.1002/ajmg.a.36212 · 2.16 Impact Factor
  • Source
    • "Finally, we assessed the quality and sensitivity of SNV detection in our FFPE libraries compared to matched fresh-frozen pairs, and accounting for difference between TruSeq and ScriptSeq protocols. Transition:transversion (Ti/Tv) ratios of the RiboZeroGold ScriptSeq FFPE libraries were within the range reported from DNA sequencing studies [14,15], and highly similar to their matched RiboZeroGold ScriptSeq counterparts for known SNVs, 2.21 and 2.15 respectively. However, for novel SNVs, the Ti/Tv ratio was slightly higher in FFPE than fresh-frozen material, 2.23 and 1.81 respectively, likely a result of formalin-fixation. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Advantages of RNA-Seq over array based platforms are quantitative gene expression and discovery of expressed single nucleotide variants (eSNVs) and fusion transcripts from a single platform, but the sensitivity for each of these characteristics is unknown. We measured gene expression in a set of manually degraded RNAs, nine pairs of matched fresh-frozen, and FFPE RNA isolated from breast tumor with the hybridization based, NanoString nCounter (226 gene panel) and with whole transcriptome RNA-Seq using RiboZeroGold ScriptSeq V2 library preparation kits. We performed correlation analyses of gene expression between samples and across platforms. We then specifically assessed whole transcriptome expression of lincRNA and discovery of eSNVs and fusion transcripts in the FFPE RNA-Seq data. For gene expression in the manually degraded samples, we observed Pearson correlations of >0.94 and >0.80 with NanoString and ScriptSeq protocols, respectively. Gene expression data for matched fresh-frozen and FFPE samples yielded mean Pearson correlations of 0.874 and 0.783 for NanoString (226 genes) and ScriptSeq whole transcriptome protocols respectively, p<2x10(-16). Specifically for lincRNAs, we observed superb Pearson correlation (0.988) between matched fresh-frozen and FFPE pairs. FFPE samples across NanoString and RNA-Seq platforms gave a mean Pearson correlation of 0.838. In FFPE libraries, we detected 53.4% of high confidence SNVs and 24% of high confidence fusion transcripts. Sensitivity of fusion transcript detection was not overcome by an increase in depth of sequencing up to 3-fold (increase from ~56 to ~159 million reads). Both NanoString and ScriptSeq RNA-Seq technologies yield reliable gene expression data for degraded and FFPE material. The high degree of correlation between NanoString and RNA-Seq platforms suggests discovery based whole transcriptome studies from FFPE material will produce reliable expression data. The RiboZeroGold ScriptSeq protocol performed particularly well for lincRNA expression from FFPE libraries, but detection of eSNV and fusion transcripts was less sensitive.
    PLoS ONE 11/2013; 8(11):e81925. DOI:10.1371/journal.pone.0081925 · 3.23 Impact Factor
  • Source
    • "Figure 2b summarizes the Ti/Tv ratio of the SNP sets by four single-sample calling pipelines. The raw SNPs have average Ti/Tv ratio 2.79, 2.79, and 2.73 for SAMtools, GATK, and glfSingle, while the filtered have 2.96, 2.99, 2.96 and 2.97 (Atlas2), closer to the expected value 3.02 (see Table S2 in [24]). These confirmed that filtering is important to improve variant calling quality. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Next generation sequencing (NGS) has been leading the genetic study of human disease into an era of unprecedented productivity. Many bioinformatics pipelines have been developed to call variants from NGS data. The performance of these pipelines depends crucially on the variant caller used and on the calling strategies implemented. We studied the performance of four prevailing callers, SAMtools, GATK, glftools and Atlas2, using single-sample and multiple-sample variant-calling strategies. Using the same aligner, BWA, we built four single-sample and three multiple-sample calling pipelines and applied the pipelines to whole exome sequencing data taken from 20 individuals. We obtained genotypes generated by Illumina Infinium HumanExome v1.1 Beadchip for validation analysis and then used Sanger sequencing as a "gold-standard" method to resolve discrepancies for selected regions of high discordance. Finally, we compared the sensitivity of three of the single-sample calling pipelines using known simulated whole genome sequence data as a gold standard. Overall, for single-sample calling, the called variants were highly consistent across callers and the pairwise overlapping rate was about 0.9. Compared with other callers, GATK had the highest rediscovery rate (0.9969) and specificity (0.99996), and the Ti/Tv ratio out of GATK was closest to the expected value of 3.02. Multiple-sample calling increased the sensitivity. Results from the simulated data suggested that GATK outperformed SAMtools and glfSingle in sensitivity, especially for low coverage data. Further, for the selected discrepant regions evaluated by Sanger sequencing, variant genotypes called by exome sequencing versus the exome array were more accurate, although the average variant sensitivity and overall genotype consistency rate were as high as 95.87% and 99.82%, respectively. In conclusion, GATK showed several advantages over other variant callers for general purpose NGS analyses. The GATK pipelines we developed perform very well.
    PLoS ONE 09/2013; 8(9):e75619. DOI:10.1371/journal.pone.0075619 · 3.23 Impact Factor
Show more