Efficiency and Power as a Function of Sequence Coverage, SNP Array Density, and Imputation

Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America.
PLoS Computational Biology (Impact Factor: 4.83). 07/2012; 8(7):e1002604. DOI: 10.1371/journal.pcbi.1002604
Source: PubMed

ABSTRACT Author Summary
In this work we address a series of questions prompted by the rise of next-generation sequencing as a data collection strategy for genetic studies. How does low coverage sequencing compare to traditional microarray based genotyping? Do studies increase sensitivity by collecting both sequencing and array data? What can we learn about technology error modes based on analysis of SNPs for which sequence and array data disagree? To answer these questions, we developed a statistical framework to estimate genotypes from sequence reads, array intensities, and imputation. Through experiments with intensity and read data from the Hapmap and 1000 Genomes (1000 G) Projects, we show that 1 M SNP arrays used for genome wide association studies perform similarly to 1× sequencing. We find that adding low coverage sequence reads to dense array data significantly increases rare variant sensitivity, but adding dense array data to low coverage sequencing has only a small impact. Finally, we describe an improved SNP calling algorithm used in the 1000 G project, inspired by a novel next-generation sequencing error mode identified through analysis of disputed SNPs. These results inform the use of next-generation sequencing in genetic studies and model an approach to further improve genotype calling methods.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The choice of an appropriate variant calling pipeline for exome sequencing data is becoming increasingly more important in translational medicine projects and clinical contexts. Within GOSgene, which facilitates genetic analysis as part of a joint effort of the University College London and the Great Ormond Street Hospital, we aimed to optimize a variant calling pipeline suitable for our clinical context. We implemented the GATK/Queue framework and evaluated the performance of its two callers: the classical UnifiedGenotyper and the new variant discovery tool HaplotypeCaller. We performed an experimental validation of the loss-of-function (LoF) variants called by the two methods using Sequenom technology. UnifiedGenotyper showed a total validation rate of 97.6% for LoF single-nucleotide polymorphisms (SNPs) and 92.0% for insertions or deletions (INDELs), whereas HaplotypeCaller was 91.7% for SNPs and 55.9% for INDELs. We confirm that GATK/Queue is a reliable pipeline in translational medicine and clinical context. We conclude that in our working environment, UnifiedGenotyper is the caller of choice, being an accurate method, with a high validation rate of error-prone calls like LoF variants. We finally highlight the importance of experimental validation, especially for INDELs, as part of a standard pipeline in clinical environments.
    01/2014; 2(1):58-63. DOI:10.1002/mgg3.42
  • [Show abstract] [Hide abstract]
    ABSTRACT: There is considerable debate about the most efficient way to interrogate rare coding variants in association studies. The options include direct genotyping of specific known coding variants in genes or, alternatively, sequencing across the entire exome to capture known as well as novel variants. Each strategy has advantages and disadvantages, but the availability of cost-efficient exome arrays has made the former appealing. Here we consider the utility of a direct genotyping chip, the Illumina HumanExome array (HE), by evaluating its content based on: 1. functionality; and 2. amenability to imputation. We explored these issues by genotyping a large, ethnically diverse cohort on the HumanOmniExpressExome array (HOEE) which combines the HE with content from the GWAS array (HOE). We find that the use of the HE is likely to be a cost-effective way of expanding GWAS, but does have some drawbacks that deserve consideration when planning studies.
    Gene 04/2014; 540(1):104–109. DOI:10.1016/j.gene.2014.01.069 · 2.08 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Although genome-wide association studies (GWAS) have identified many common variants associated with complex traits, low-frequency and rare variants have not been interrogated in a comprehensive manner. Imputation from dense reference panels, such as the 1000 Genomes Project (1000G), enables testing of ungenotyped variants for association. Here we present the results of imputation using a large, new population-specific panel: the Genome of The Netherlands (GoNL). We benchmarked the performance of the 1000G and GoNL reference sets by comparing imputation genotypes with 'true' genotypes typed on ImmunoChip in three European populations (Dutch, British, and Italian). GoNL showed significant improvement in the imputation quality for rare variants (MAF 0.05–0.5%) compared with 1000G. In Dutch samples, the mean observed Pearson correlation, r 2 , increased from 0.61 to 0.71. We also saw improved imputation accuracy for other European populations (in the British samples, r 2 improved from 0.58 to 0.65, and in the Italians from 0.43 to 0.47). A combined reference set comprising 1000G and GoNL improved the imputation of rare variants even further. The Italian samples benefitted the most from this combined reference (the mean r 2 increased from 0.47 to 0.50). We conclude that the creation of a large population-specific reference is advantageous for imputing rare variants and that a combined reference panel across multiple populations yields the best imputation results. European Journal of Human Genetics advance online publication, 4 June 2014; doi:10.1038/ejhg.2014.19
    European journal of human genetics: EJHG 06/2014; DOI:10.1038/ejhg.2014.19 · 4.23 Impact Factor

Full-text (3 Sources)

Available from
Aug 28, 2014