Efficiency and Power as a Function of Sequence Coverage, SNP Array Density, and Imputation

Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America.
PLoS Computational Biology (Impact Factor: 4.62). 07/2012; 8(7):e1002604. DOI: 10.1371/journal.pcbi.1002604
Source: PubMed

ABSTRACT Author Summary
In this work we address a series of questions prompted by the rise of next-generation sequencing as a data collection strategy for genetic studies. How does low coverage sequencing compare to traditional microarray based genotyping? Do studies increase sensitivity by collecting both sequencing and array data? What can we learn about technology error modes based on analysis of SNPs for which sequence and array data disagree? To answer these questions, we developed a statistical framework to estimate genotypes from sequence reads, array intensities, and imputation. Through experiments with intensity and read data from the Hapmap and 1000 Genomes (1000 G) Projects, we show that 1 M SNP arrays used for genome wide association studies perform similarly to 1× sequencing. We find that adding low coverage sequence reads to dense array data significantly increases rare variant sensitivity, but adding dense array data to low coverage sequencing has only a small impact. Finally, we describe an improved SNP calling algorithm used in the 1000 G project, inspired by a novel next-generation sequencing error mode identified through analysis of disputed SNPs. These results inform the use of next-generation sequencing in genetic studies and model an approach to further improve genotype calling methods.

Download full-text


Available from: George B Grant, Aug 28, 2014
20 Reads
  • Source
    • "While exome sequencing is becoming a more widely accessible and available tool in the context of translational medicine research and in clinical diagnosis (Need et al. 2012), the choice of an accurate and reliable pipeline is of fundamental importance. The clinical environment has additional pressure to reduce the number of false-positive variant calls, while keeping the sensitivity as high as possible (Ku et al. 2011; Flannick et al. 2012). As new analytical methods are developed, simply comparing the characteristics and quality of the calls alone is not sufficient . "
    [Show abstract] [Hide abstract]
    ABSTRACT: The choice of an appropriate variant calling pipeline for exome sequencing data is becoming increasingly more important in translational medicine projects and clinical contexts. Within GOSgene, which facilitates genetic analysis as part of a joint effort of the University College London and the Great Ormond Street Hospital, we aimed to optimize a variant calling pipeline suitable for our clinical context. We implemented the GATK/Queue framework and evaluated the performance of its two callers: the classical UnifiedGenotyper and the new variant discovery tool HaplotypeCaller. We performed an experimental validation of the loss-of-function (LoF) variants called by the two methods using Sequenom technology. UnifiedGenotyper showed a total validation rate of 97.6% for LoF single-nucleotide polymorphisms (SNPs) and 92.0% for insertions or deletions (INDELs), whereas HaplotypeCaller was 91.7% for SNPs and 55.9% for INDELs. We confirm that GATK/Queue is a reliable pipeline in translational medicine and clinical context. We conclude that in our working environment, UnifiedGenotyper is the caller of choice, being an accurate method, with a high validation rate of error-prone calls like LoF variants. We finally highlight the importance of experimental validation, especially for INDELs, as part of a standard pipeline in clinical environments.
    01/2014; 2(1):58-63. DOI:10.1002/mgg3.42
  • Source
    • "Although these ‘off-target’ reads are sometimes discarded in analyses, in our view, they can be extremely useful, particularly in samples that have not been genotyped with whole genome arrays. These off-target reads can be used to estimate the local or global ancestry of each sample (enabling improved case–control matching in association analyses or admixture mapping analyses), can be combined with a panel of reference haplotypes to estimate genotypes across the genome (59–61) and can facilitate detection of large structural variants (such as deletions of entire genes) (62). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Genetic association and linkage studies can provide insights into complex disease biology, guiding the development of new diagnostic and therapeutic strategies. Over the past decade, genetic association studies have largely focused on common, easy to measure genetic variants shared between many individuals. These common variants typically have subtle functional consequence and translating the resulting association signals into biological insights can be challenging. In the last few years, exome sequencing has emerged as a cost-effective strategy for extending these studies to include rare coding variants, which often have more marked functional consequences. Here, we provide practical guidance in the design and analysis of complex trait association studies focused on rare, coding variants.
    Human Molecular Genetics 09/2012; 21(R1):R1-9. DOI:10.1093/hmg/dds387 · 6.39 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Researchers have successfully applied exome sequencing to discover causal variants in selected individuals with familial, highly penetrant disorders. We demonstrate the utility of exome sequencing followed by imputation for discovering low-frequency variants associated with complex quantitative traits. We performed exome sequencing in a reference panel of 761 African Americans and then imputed newly discovered variants into a larger sample of more than 13,000 African Americans for association testing with the blood cell traits hemoglobin, hematocrit, white blood count, and platelet count. First, we illustrate the feasibility of our approach by demonstrating genome-wide-significant associations for variants that are not covered by conventional genotyping arrays; for example, one such association is that between higher platelet count and an MPL c.117G>T (p.Lys39Asn) variant encoding a p.Lys39Asn amino acid substitution of the thrombpoietin receptor gene (p = 1.5 × 10(-11)). Second, we identified an association between missense variants of LCT and higher white blood count (p = 4 × 10(-13)). Third, we identified low-frequency coding variants that might account for allelic heterogeneity at several known blood cell-associated loci: MPL c.754T>C (p.Tyr252His) was associated with higher platelet count; CD36 c.975T>G (p.Tyr325(∗)) was associated with lower platelet count; and several missense variants at the α-globin gene locus were associated with lower hemoglobin. By identifying low-frequency missense variants associated with blood cell traits not previously reported by genome-wide association studies, we establish that exome sequencing followed by imputation is a powerful approach to dissecting complex, genetically heterogeneous traits in large population-based studies.
    The American Journal of Human Genetics 10/2012; 91(5). DOI:10.1016/j.ajhg.2012.08.031 · 10.93 Impact Factor
Show more