Efficiency and power as a function of sequence coverage, SNP array density, and imputation.

Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America.
PLoS Computational Biology (Impact Factor: 4.87). 07/2012; 8(7):e1002604. DOI: 10.1371/journal.pcbi.1002604
Source: PubMed

ABSTRACT High coverage whole genome sequencing provides near complete information about genetic variation. However, other technologies can be more efficient in some settings by (a) reducing redundant coverage within samples and (b) exploiting patterns of genetic variation across samples. To characterize as many samples as possible, many genetic studies therefore employ lower coverage sequencing or SNP array genotyping coupled to statistical imputation. To compare these approaches individually and in conjunction, we developed a statistical framework to estimate genotypes jointly from sequence reads, array intensities, and imputation. In European samples, we find similar sensitivity (89%) and specificity (99.6%) from imputation with either 1× sequencing or 1 M SNP arrays. Sensitivity is increased, particularly for low-frequency polymorphisms (MAF < 5%), when low coverage sequence reads are added to dense genome-wide SNP arrays--the converse, however, is not true. At sites where sequence reads and array intensities produce different sample genotypes, joint analysis reduces genotype errors and identifies novel error modes. Our joint framework informs the use of next-generation sequencing in genome wide association studies and supports development of improved methods for genotype calling.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Here we report the results of gene expression analyses using multiple probesets aimed at determining the incidence of Ikaros/IKZF1 deletions in pediatric B-precursor acute lymphoblastic leukemia (BPL). Primary leukemia cells from 122 Philadelphia chromosome (Ph)(+) BPL patients and 237 Ph(-) BPL patients as well as normal hematopoietic cells from 74 normal non-leukemic bone marrow specimens were organized according to expression levels of IKZF1 transcripts utilizing two-way hierarchical clustering technique to identify specimens with low IKZF1 expression for the 10 probesets interrogating Exons 1 through 4 and Exon 8. Our analysis demonstrated no changes in expression that would be expected from homozygous or heterozygous deletions of IKZF1 in primary leukemic cells. Similar results were obtained in gene expression analysis of primary leukemic cells from 20 Ph(+) positive and 155 Ph(-) BPL patients in a validation dataset. Taken together, our gene expression analyses in 534 pediatric BPL cases, including 142 cases with Ph(+) BPL, contradict previous reports that were based on SNP array data and suggested that Ph(+) pediatric BPL is characterized by a high frequency of homozygous or heterozygous IKZF1 deletions. Further, exon-specific genomic PCR analysis of primary leukemia cells from 21 high-risk pediatric BPL patients and 11 standard-risk pediatric BPL patients, and 8 patients with infant BPL did not show any evidence for homozygous IKZF1 locus deletions. Nor was there any evidence for homozygous or heterozygous intragenic IKZF1 deletions.
    International journal of molecular medical science. 07/2013; 3(9):72-82.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Imputation using external reference panels is a widely used approach for increasing power in GWAS and meta-analysis. Existing HMM-based imputation approaches require individual-level genotypes. Here, we develop a new method for Gaussian imputation from summary association statistics, a type of data that is becoming widely available. In simulations using 1000 Genomes (1000G) data, this method recovers 84% (54%) of the effective sample size for common (>5%) and low-frequency (1-5%) variants (increasing to 87% (60%) when summary LD information is available from target samples) versus 89% (67%) for HMM-based imputation, which cannot be applied to summary statistics. Our approach accounts for the limited sample size of the reference panel, a crucial step to eliminate false-positive associations, and is computationally very fast. As an empirical demonstration, we apply our method to 7 case-control phenotypes from the WTCCC data and a study of height in the British 1958 birth cohort (1958BC). Gaussian imputation from summary statistics recovers 95% (105%) of the effective sample size (as quantified by the ratio of $\chi^2$ association statistics) compared to HMM-based imputation from individual-level genotypes at the 227 (176) published SNPs in the WTCCC (1958BC height) data. In addition, for publicly available summary statistics from large meta-analyses of 4 lipid traits, we publicly release imputed summary statistics at 1000G SNPs, which could not have been obtained using previously published methods, and demonstrate their accuracy by masking subsets of the data. We show that 1000G imputation using our approach increases the magnitude and statistical evidence of enrichment at genic vs. non-genic loci for these traits, as compared to an analysis without 1000G imputation. Thus, imputation of summary statistics will be a valuable tool in future functional enrichment analyses.
    Bioinformatics (Oxford, England). 09/2013;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Although genome-wide association studies (GWAS) have identified many common variants associated with complex traits, low-frequency and rare variants have not been interrogated in a comprehensive manner. Imputation from dense reference panels, such as the 1000 Genomes Project (1000G), enables testing of ungenotyped variants for association. Here we present the results of imputation using a large, new population-specific panel: the Genome of The Netherlands (GoNL). We benchmarked the performance of the 1000G and GoNL reference sets by comparing imputation genotypes with 'true' genotypes typed on ImmunoChip in three European populations (Dutch, British, and Italian). GoNL showed significant improvement in the imputation quality for rare variants (MAF 0.05–0.5%) compared with 1000G. In Dutch samples, the mean observed Pearson correlation, r 2 , increased from 0.61 to 0.71. We also saw improved imputation accuracy for other European populations (in the British samples, r 2 improved from 0.58 to 0.65, and in the Italians from 0.43 to 0.47). A combined reference set comprising 1000G and GoNL improved the imputation of rare variants even further. The Italian samples benefitted the most from this combined reference (the mean r 2 increased from 0.47 to 0.50). We conclude that the creation of a large population-specific reference is advantageous for imputing rare variants and that a combined reference panel across multiple populations yields the best imputation results. European Journal of Human Genetics advance online publication, 4 June 2014; doi:10.1038/ejhg.2014.19
    European journal of human genetics: EJHG 06/2014; · 3.56 Impact Factor

Full-text (3 Sources)

Available from
Aug 28, 2014