The Use of Imputed Values in the Meta-Analysis of Genome-Wide Association Studies

Cancer Prevention Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA.
Genetic Epidemiology (Impact Factor: 2.95). 11/2011; 35(7):597-605. DOI: 10.1002/gepi.20608
Source: PubMed

ABSTRACT In genome-wide association studies (GWAS), it is a common practice to impute the genotypes of untyped single nucleotide polymorphism (SNP) by exploiting the linkage disequilibrium structure among SNPs. The use of imputed genotypes improves genome coverage and makes it possible to perform meta-analysis combining results from studies genotyped on different platforms. A popular way of using imputed data is the "expectation-substitution" method, which treats the imputed dosage as if it were the true genotype. In current practice, the estimates given by the expectation-substitution method are usually combined using inverse variance weighting (IVM) scheme in meta-analysis. However, the IVM is not optimal as the estimates given by the expectation-substitution method are generally biased. The optimal weight is, in fact, proportional to the inverse variance and the expected value of the effect size estimates. We show both theoretically and numerically that the bias of the estimates is very small under practical conditions of low effect sizes in GWAS. This finding validates the use of the expectation-substitution method, and shows the inverse variance is a good approximation of the optimal weight. Through simulation, we compared the power of the IVM method with several methods including the optimal weight, the regular z-score meta-analysis and a recently proposed "imputation aware" meta-analysis method (Zaitlen and Eskin [2010] Genet Epidemiol 34:537-542). Our results show that the performance of the inverse variance weight is always indistinguishable from the optimal weight and similar to or better than the other two methods.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Adult body height is a quantitative trait for which genome-wide association studies (GWAS) have identified numerous loci, primarily in European populations. These loci, comprising common variants, explain <10% of the phenotypic variance in height. We searched for novel associations between height and common (minor allele frequency, MAF≥5%) or infrequent (0.5%<MAF<5%) variants across the exome in African Americans. Using a reference panel of 1,692 African Americans and 471 Europeans from the National Heart, Lung, and Blood Institute's (NHLBI) Exome Sequencing Project (ESP), we imputed whole-exome sequence data into 13,719 African Americans with existing array-based GWAS data (discovery). Variants achieving a height-association threshold of P<5E-06 in the imputed dataset were followed up in an independent sample of 1,989 African Americans with whole-exome sequence data (replication). We used P<2.5E-07 (=0.05/196,779 variants) to define statistically significant associations in meta-analyses combining the discovery and replication sets (N=15,708). We discovered and replicated 3 independent loci for association: 5p13.3/C5orf22/rs17410035 (MAF=0.10, β=0.64 cm, P=8.3E-08), 13q14.2/SPRYD7/rs114089985 (MAF=0.03, β=1.46 cm, P=4.8E-10), and 17q23.3/GH2/rs2006123 (MAF=0.30; β=0.47 cm; P=4.7E-09). Conditional analyses suggested 5p13.3 (C5orf22/rs17410035) and 13q14.2 (SPRYD7/rs114089985) may harbor novel height alleles independent of previous GWAS-identified variants (r(2) with GWAS loci<0.01); whereas 17q23.3/GH2/rs2006123 was correlated with GWAS-identified variants in European and African populations. Notably, 13q14.2/rs114089985 is infrequent in African Americans (MAF=3%), extremely rare in European Americans (MAF=0.03%), and monomorphic in Asian populations, suggesting it may be an African American-specific height allele. Our findings demonstrate that whole-exome imputation of sequence variants can identify low frequency variants and discover novel variants in non-European populations.
    Human Molecular Genetics 07/2014; DOI:10.1093/hmg/ddu361 · 6.68 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Genotype imputation has become standard practice in modern genetic studies. As sequencing-based reference panels continue to grow, increasingly more markers are being well or better imputed but at the same time, even more markers with relatively low minor allele frequency are being imputed with low imputation quality. Here, we propose new methods that incorporate imputation uncertainty for downstream association analysis, with improved power and/or computational efficiency. We consider two scenarios: I) when posterior probabilities of all potential genotypes are estimated; and II) when only the one-dimensional summary statistic, imputed dosage, is available. For scenario I, we have developed an expectation-maximization likelihood-ratio test for association based on posterior probabilities. When only imputed dosages are available (scenario II), we first sample the genotype probabilities from its posterior distribution given the dosages, and then apply the EM-LRT on the sampled probabilities. Our simulations show that type I error of the proposed EM-LRT methods under both scenarios are protected. Compared with existing methods, EM-LRT-Prob (for scenario I) offers optimal statistical power across a wide spectrum of MAF and imputation quality. EM-LRT-Dose (for scenario II) achieves a similar level of statistical power as EM-LRT-Prob and, outperforms the standard Dosage method, especially for markers with relatively low MAF or imputation quality. Applications to two real data sets, the Cebu Longitudinal Health and Nutrition Survey study and the Women's Health Initiative Study, provide further support to the validity and efficiency of our proposed methods.
    PLoS ONE 11/2014; 9(11):e110679. DOI:10.1371/journal.pone.0110679 · 3.53 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: In the large cohorts that have been used for genome-wide association studies (GWAS), it is prohibitively expensive to sequence all cohort members. A cost-effective strategy is to sequence subjects with extreme values of quantitative traits or those with specific diseases. By imputing the sequencing data from the GWAS data for the cohort members who are not selected for sequencing, one can dramatically increase the number of subjects with information on rare variants. However, ignoring the uncertainties of imputed rare variants in downstream association analysis will inflate the type I error when sequenced subjects are not a random subset of the GWAS subjects. In this article, we provide a valid and efficient approach to combining observed and imputed data on rare variants. We consider commonly used gene-level association tests, all of which are constructed from the score statistic for assessing the effects of individual variants on the trait of interest. We show that the score statistic based on the observed genotypes for sequenced subjects and the imputed genotypes for nonsequenced subjects is unbiased. We derive a robust variance estimator that reflects the true variability of the score statistic regardless of the sampling scheme and imputation quality, such that the corresponding association tests always have correct type I error. We demonstrate through extensive simulation studies that the proposed tests are substantially more powerful than the use of accurately imputed variants only and the use of sequencing data alone. We provide an application to the Women's Health Initiative. The relevant software is freely available.
    Proceedings of the National Academy of Sciences 01/2015; DOI:10.1073/pnas.1406143112 · 9.81 Impact Factor


Available from
May 21, 2014