From the SelectedWorks of Shuo Jiao
The use of imputed values in the meta-analysis of
genome-wide association studies.
Start Your Own
of New Work
Genetic Epidemiology (2011)
The Use of Imputed Values in the Meta-Analysis of Genome-Wide
Shuo Jiao,1?Li Hsu,2Carolyn M. Hutter,1and Ulrike Peters1,3
1Cancer Prevention Program, Fred Hutchinson Cancer Research Center, Seattle, Washington
2Biostatistics and Biomathematics Program, Fred Hutchinson Cancer Research Center, Seattle, Washington
3Department of Epidemiology, School of Public Health, University of Washington, Seattle, Washington
In genome-wide association studies (GWAS), it is a common practice to impute the genotypes of untyped single nucleotide
polymorphism (SNP) by exploiting the linkage disequilibrium structure among SNPs. The use of imputed genotypes
improves genome coverage and makes it possible to perform meta-analysis combining results from studies genotyped on
different platforms. A popular way of using imputed data is the ‘‘expectation-substitution’’ method, which treats the imputed
dosage as if it were the true genotype. In current practice, the estimates given by the expectation-substitution method are
usually combined using inverse variance weighting (IVM) scheme in meta-analysis. However, the IVM is not optimal as the
estimates given by the expectation-substitution method are generally biased. The optimal weight is, in fact, proportional to the
inverse variance and the expected value of the effect size estimates. We show both theoretically and numerically that the bias
of the estimates is very small under practical conditions of low effect sizes in GWAS. This finding validates the use of the
expectation-substitution method, and shows the inverse variance is a good approximation of the optimal weight. Through
simulation, we compared the power of the IVM method with several methods including the optimal weight, the regular
z-score meta-analysis and a recently proposed ‘‘imputation aware’’ meta-analysis method (Zaitlen and Eskin  Genet
Epidemiol 34:537–542). Our results show that the performance of the inverse variance weight is always indistinguishable from
the optimal weight and similar to or better than the other two methods. Genet. Epidemiol. 2011.
r 2011 Wiley-Liss, Inc.
Key words: GWAS; imputation; bias; meta-analysis; weight
Contract grant sponsor: National Institutes of Health; Contract grant numbers: 5R01 CA059045; 5U01 CA137088; R01AG14358;
P01CA53996; Contract grant sponsors: Division of Cancer Prevention; National Cancer Institute; National Institutes of Health;
Department of Health and Human Services; NIH GEI; Contract grant numbers: Z01 CP 010200; U01 HG 004438.
?Correspondence to: Shuo Jiao, Cancer Prevention Program, Fred Hutchinson Cancer Research Center, Seattle, Washington.
Received 6 April 2011; Revised 2 June 2011; Accepted 3 June 2011
Published online in Wiley Online Library (wileyonlinelibrary.com).
The advance of high-throughput technology makes it
possible to genotype hundreds of thousands of single
nucleotide polymorphisms (SNPs) simultaneously which
allows researchers to examine genetic variation across the
(GWAS). By testing the association between SNPs and
complex traits and diseases, GWAS have successfully
uncovered hundreds of novel susceptibility loci to date
[Hindorff et al., 2009].
Even though current GWAS platforms include markers
for hundreds of thousands or even millions of SNPs, they
still only directly assay a proportion of the whole genome.
Obviously, if only directly genotyped SNPs are considered,
this can lead to associated SNPs undetected. Another
drawback of the partial coverage is that the selected SNP
panel often varies for different platforms [Barrett and
Cardon, 2006]. When different studies use different
platforms, combining across studies will lead to a much
reduced set of SNPs genotyped in all the studies. For
example, the overlap between the Affymetrix SNP Array
6.0 and Illumina OmniExpress genotyping array is less
than 30%. An effective approach to overcome the afore-
mentioned problems is to impute the untyped SNPs based
on a common reference panel.
The basic idea behind genotype imputation is to take
advantage of the linkage disequilibrium (LD) information
among SNPs. Because of the LD and haplotype structure,
genotyped variants can provide information about untyped
SNPs. It is feasible to use data on genotyped SNPs along
with an appropriate reference panel containing informa-
tion on a larger set of SNPs to predict the genotypes of the
ungenotyped SNPs. Currently, the HapMap project [The
International HapMap Consortium, 2005, 2007] provides
such reference panels, and future studies are likely to
extend to the 1,000 Genomes Project [The 1,000 Genomes
Project Consortium, 2010] or other whole genome or
exome sequence data. The most popular imputation
programs include MACH [Li et al., 2010], IMPUTE
[Marchini et al., 2007], and Beagle [Browning and
Browning, 2009], among others.
There are several approaches to using imputed values in
the association analysis. Suppose a SNP of a given subject i
has genotype gi, where gitakes one of the three values 0, 1,
r 2011 Wiley-Liss, Inc.
and 2, the number of copies of one of the alleles (typically
the ‘‘minor’’ or lower frequency allele). The output of an
imputation program usually includes three probabilities:
pi05P(gi50); pi15P(gi51); pi25P(gi52). One method is
to use the most likely genotype (the genotype with the
highest probability) as if it were the true genotype.
However, it has been shown in Lin and Huang  that
this method leads to intrinsically biased estimates because
of the unavoidable discrepancy between the most likely
genotypeand the true genotype.
approach is the so-called expectation-substitution method.
Instead of using the most likely genotype, this method
alleles5pi112pi2, as if it were the true genotype. In the
haplotype analysis framework, several studies [Kraft et al.,
2005; Kraft and Stram, 2007; Cordell, 2006] have shown
through a series of simulation experiments that the
expectation-substitution method has no noticeable bias
under practical settings. It is also possible to use Bayesian
methods [Marchini et al., 2007; Servin and Stephens, 2007]
to perform the imputation and the association test at the
same time, however, these methods are usually computa-
tionally intensive and hence not feasible on a genome wide
scale. Therefore, in the remaining of the article, we will
focus on the expectation-substitution method.
If multiple studies are imputed using the same
reference, then the different studies have data on a
common set of SNPs, making meta-analysis across studies
possible. Because combining studies increases sample size,
meta-analysis increases power and allows detection of loci
not found in individual studies. One way of performing
meta-analysis is to use the regular z-score meta-analysis
(MetaZ), which combines z-scores weighted by square root
of sample sizes. Alternatively, the effect size meta-analysis
(MetaBeta) combines effect sizes by computing a weighted
average of the estimates. For meta-analysis that involves
imputed genotypes, the imputation quality is an important
factor. Hence, it seems natural that the imputation quality
should also be reflected in the weight for meta-analysis.
For MetaZ, de Bakker et al.  suggested scaling the
weighted sum of z-scores by the imputation quality
measure. Based on this idea, Zaitlen and Eskin 
have recently proposed an ‘‘imputation aware’’ method to
combine z-scores. In the ‘‘imputation aware’’ method, the
weight for the z-score of each study is proportional toR
where R2is the imputation quality measure and n is the
sample size. Results has shown the ‘‘imputation aware’’
method is more powerful than the regular z-score meta-
analysis when the imputation quality varies among
studies [Zaitlen and Eskin, 2010].
For MetaBeta, most studies use the traditional inverse
variance weighting (IVM) to combine estimates from
imputed and genotyped SNPs in current practice [Soranzo
et al., 2009; Willer et al., 2008]. However, it is unknown
whether the IVM is the optimal weighting scheme under
this situation. In this article, we address this question. For
imputed SNPs, we find that the optimal weight is
proportional to both the expected value and inverse
variance of estimates given by the expectation-substitution
method. While the expectation-substitution method does
not give unbiased estimators in general, the bias is usually
very small under practical situations of GWAS. Based on
this finding, we show that the inverse-variance weighting
scheme is a good approximation of the optimal weight for
the meta-analysis of imputed SNPs. These results are
important, because they validate that the expectation-
substitution method and the IVM scheme currently being
used in GWAS meta-analysis are adequate and close to be
optimal in GWAS settings.
MATERIALS AND METHODS
Consider a case-control study of n individuals. For a
given SNP, suppose for subject i, i ¼ 1;...;n, the true
genotype is gi50, 1, or 2 and the disease status is di50 or 1,
where 0 indicates control and 1 indicates case, then the
standard logistic model for modeling the association
between the SNP and disease status is:
mðgi;b0;b1Þ ? Pðdi¼ 1;b0;b1Þ ¼
Note that model (1) is designed for a prospective study
where subjects are first selected, then followed up for
disease development. However, in many GWAS, the study
design is retrospective. In a seminal article by Prentice and
Pyke , the authors showed that it is valid to apply
model (1) to a case-control study as if the data were
prospectively collected and the resulting estimators of b1
are consistent to the true values and asymptotically
normal. Because of its simplicity and the appealing
interpretation of exp(b1) which approximates relative risk
in rare disease, model (1) has been widely used in practice
and will be used throughout this article.
If the genotype for this given SNP is unknown, the
expectation-substitution method replaces the unknown
genotype by the dosage from the imputation ? gi¼ pi112pi2.
In this case, model (1) becomes
mð? gi;b0;b1Þ ¼ Pðdi¼ 1;b0;b1Þ ¼
The likelihood function can be written as:
mð? gi;b0;b1Þdif1 ? mð? gi;b0;b1Þg1?di:
By Taylor’s expansion, the maximum likelihood esti-
mator ð^b0;^b1Þ for (b0,b1) satisfies
Uðb0;b1Þ ¼ n?1X
? gifdi?mð? gi;b0;b1Þg
1Þ ¼ ?n?1
1Þ is on the line segment joining ð^b0;^b1Þ and (b0,b1).
2 Jiao et al.
Taking the expectation of U(b0,b1) in Equation (4), we have
fEðdijpi0;pi1;pi2Þ ? mð? gi;b0;b1Þg
? gifEðdijpi0;pi1;pi2Þ ? mð? gi;b0;b1Þg
When b150 (no association) or one of pi0, pi1, pi2 is 1
(perfectly imputed), it is obvious that
Eðdijpi0;pi1;pi2Þ ? mð? gi;b0;b1Þ
and^b1is unbiased. Therefore, the expectation-substitution
method does not cause potential inflation in type I error
rate. On the other hand, if b16¼ 0 and the imputation is
imperfect,^b1from (4) is biased, which as we show below,
could cause potential problems.
OPTIMAL WEIGHT FOR META-ANALYSIS
WITH IMPUTED VALUES
Suppose for a given imputed SNP, the b1ð6¼ 0Þ estimate
from (4) in the ith study (i ¼ 1;...;M) is^bi
the estimate for b1from the meta-analysis is
1; the estimated
1is^Vi; the weight for the ith study is wi, then
1Þ by mi, the test statistic is
Based on (8), the optimal weight to maximize the power
to detect the association is equivalent to maximizing
A simple derivation shows that wineeds to be propor-
tional to mi/Viin order to maximize (9). Hence, even if the
effect size is the same across studies, mi may still vary
among studies because variation in imputation quality
between studies will yield a different degree of bias in b1
estimates. This contrasts to the directly genotyped data
where mi5b1 for all studies so wi needs to only be
proportional to 1=^V. However, this optimal weight which
incorporates both the variance and miis hard to estimate in
practice, because of the difficulty in estimating mi.
Fortunately, we can show theoretically that the bias of
^b1 is very small when the true b1is small, regardless of
the imputation quality. For example, when b050 and
b15log(1.2), the bias of^b1¼ jEð^b1Þ ? b1jo0:002; when
b15log(1.5), jEð^b1Þ ? b1jo0:02. Further theoretical details
showing the upperbound of bias are provided in
Appendix A. The theoretical results about the approximate
unbiasedness are also verified by extensive simulations in
the Results section.
Given the approximate unbiasedness of b1estimators,
the optimal weight can therefore be approximated by the
regular inverse variance weight.
INVERSE VARIANCE INCORPORATES IMPU-
We have shown the inverse variance weight can
approximate the optimal weight. For imputed SNPs, it
seems natural that the weight for^b1 should increase as
imputation quality increases. For this reason, we will
explore whether the IVM scheme incorporates imputation
variance of ð^b0;^b1Þ0can be estimated by I?1ð^b0;^b1Þ. Let
hðg;^b0;^b1Þ ¼ mðg;^b0;^b1Þf1 ? mðg;^b0;^b1Þg, we have
The first derivative of hðg;^b0;^b1Þ with respect to g is
is approximately 0 when^b1 is sufficiently small. Hence,
we can consider hð? gi;^b0;^b1Þ as a constant c, and write
Equation (10) as
i¼1? gihð? gi;^b0;^b1Þg2:
varð^b1Þ ? ðncÞ?1
where R2is the imputation quality measure in MACH [Li
et al., 2010] defined as the ratio of the sample variance of ? gi
and the expected variance of gi, which is equivalent to the
squared correlation between true and imputed genotypes.
From Equation (11), we can see that the inverse variance of
^b1is approximately proportional to the imputation quality.
Thus, we show that the current IVM scheme automatically
incorporates imputation quality in the meta-analysis.
Simulation results confirm the positive correlation between
the imputation quality and inverse variances (see Results
Another interesting observation is that there is a
connection between the IVM scheme and the ‘‘imputation
aware’’ method in Zaitlen and Eskin  through (11).
Note that the IVM estimator can be written as
and the ‘‘imputation aware’’ method can be written as
We can see that the only difference between (12) and (13) is
the var(gi) part. Since var(gi) depends on minor allele
frequency (MAF), we expect those two methods perform
similarly when the MAFs of the SNP across studies are
similar. Generally, we do not expect the MAF varies
much for studies with similar ethnicity. However, if
3 Use of Imputed Values in the Meta-Analysis of GWAS
meta-analysis was conducted across different ethnic
groups [Xiong et al., 2009; Chapman et al., 2008], the
MAF variation can be substantial. In such cases, we expect
the IVM method to have better power.
In this section, we first use simulation to demonstrate
the finite sample properties of^b1given by the expectation-
substitution method, such as the approximate unbiased-
ness and relationship between varð^b1Þ and imputation
quality. Then, we compare the power of the IVM method
in the meta-analysis with various other methods.
FINITE SAMPLE PROPERTIES OF^b1
of two SNPs, considering a range of MAF combinations
(f1,f2) of the two SNPs, and a range of LD measure as D0. To
We generated the genotypes
mimic the imputation scenario, we assume that genotypes
of the second SNP are unknown, and imputed its dosage
based on the genotypes of the first SNP. We varied the
imputation quality by changing the LD measure D0.
A population of 10,000 was generated based on the logistic
regression model in Equation (1) with genotypes at the
logð1:5Þ;logð2Þ, corresponding to odds ratios 1.2, 1.5,
and 2. Then 1,000:1,000 case-control samples were ran-
domly selected from this population of 10,000. We fit
model (2) to the case-control samples with the imputed
dosage at the second SNP as ? gi. For comparison, we also
fitted model (1) with the true genotype gi. For each
parameter setting, we replicated the above procedure
10,000 times. The results are summarized in Table I.
When b150, all the estimated type I error rates are well
controlled at the nominal a level 0.05. When b15log(1.2)
and log(1.5), the relative bias of^b1 is very small (o2%),
regardless of the MAF of both SNPs. In contrast, when b1is
TABLE I. Simulation results of the expectation-substitution method under various parameter settings based on 10,000
simulated data sets, each has 1,000 cases and 1,000 controls
Bias (%)SE SD SD?
0 0.20.2 0.5
log(2) 0.2 0.2
b1is the true value; f1and f2are the MAFs for SNP 1 (the marker) and 2 (the disease-causing SNP with missing genotypes), respectively;
D0is the LD measure; Bias (%) is the percentage of relative bias (100ðEð^b1Þ ? b1Þ=b1); SE and SD are standard error and standard deviation
estimates of^b1from 10,000 replicates, respectively; 95% CP is the estimated coverage probability for the 95% confidence interval; R2is the
imputation quality measure; Power is obtained at a significance level of 0.05. SD?and Power?are the counterparts of SD and Power when
fitting the model with true genotypes. MAF, minor allele frequency; SNP, single nucleotide polumorphism; LD, linkage disequilibrium.
aThe percentage of relative bias 100 ðEð^b1Þ ? b1Þ=b1is not defined when b150.
bEstimated type I error rate under the null.
4 Jiao et al.
larger, log(2),^b1slightly underestimates the true b1and the
bias is greater as the imputation quality worsens. Under
the simulation settings in Table I, for any given MAF
combinations (f1,f2), b0, b1, and D0, we obtained a numeric
solution of b?
system of equations:
1, where b1! b?
1by solving the following
E? gi di?
Figure 1 shows that even with the worst imputation
quality in Table I (when D050.5), the bias of^b1 is still
less than 5% for the odds ratio as large as 2. Since it is
less common for the associated alleles identified by
GWAS to have an odds ratio greater than 2 [Hindorff
et al., 2009, 2011], this bias is not really problematic in
In Table I, the mean of standard errors (SE) and the
standard deviation (SD) of the estimates over 10,000
simulated data sets agree with each other very well,
suggesting that the SE estimates are reliable. Furthermore,
the SE of^b1decreases as the imputation quality R2increases;
as a result, the power (Power) increases. As a comparison,
we also show the standard deviations of parameter estimates
(SD?) and power (Power?) if the genotypes for SNP 2 are
known. As we can see, SD?is always less than SD and
Power?is always greater than Power, which implies that
there is efficiency loss using imputed genotypes. For
example, when b15log(1.2) and f15f250.2, the power loss
decreases from 67% to 0.6% as the imputation quality
increases. Taken together, we can see that even with very
small R2, the power is still acceptable in many cases using
imputed genotypes. The estimated coverage probabilities are
all very close to the nominal value 0.95, indicating that the
confidence interval estimates are very accurate.
Real imputation data.
performance of the expectation-substitution method in a
more realistic setting, we used GWAS scans from Prostate,
Lung, Colorectal, and Ovarian Cancer Screening Trial
(PLCO) [Prorok et al., 2000; Hayes et al., 2000]. PLCO is a
randomized, two-arm trial coordinated by the NCI in 10
In order to explore the
The PLCO data include 2,520 samples, genotyped on
Illumina Human Hap 300k&240k, 550k and 610k plat-
forms. We randomly selected 1,000 genotyped SNPs on
chromosome 22 and masked their genotypes. Then we
used MACH to impute the genotypes of the 1,000 SNPs as
if they were untyped, using HapMap II release 24 as the
reference panel. In this way, we have both the true
genotypes and the imputed dosages. Similarly, as in the
previous section, case-control samples were generated
based on model (1) using true genotypes and^b1 was
estimated by fitting the model (2) with imputed dosages.
We set b1to be 0, log(1.2), log(1.5), and log(2). For each
value of b1, we replicate the procedure 50 times for each of
the 1,000 SNPs. Figure 2 shows a boxplot of the percent of
bias of^b1of SNPs grouped by MAF and R2. We can see that
^b1is approximately unbiased regardless of the imputation
quality R2, which agree with the theoretical results. On the
other hand, the variability of the estimates is much greater
when R2o0.3 and MAFo0.05.
PERFORMANCE OF IVM IN THE META-
We generated the data in the same way as the previous
section. Here, we let b1take 10 equally spaced values from
0.05 to log(2), MAFs (f1,f2) of the two SNPs be (0.2,0.2) for
both studies and the LD measure D00.5 and 0.99 for two
studies, respectively. We conducted meta-analysis for the
two studies using the following four methods and
compared their power:
1. The optimal weighting, which is proportional to mi=^Vi.
In practice, it is usually impossible to estimate mi.
However, with b1,f1,f2and D0known in the simulation,
we can compute mifrom (14). Hence, we can estimate
the optimal weight for the purpose of comparison.
2. The IVM method, which is an approximation of the
optimal weighting under practical situations in GWAS.
3. The ‘‘imputation aware’’ method by Zaitlen and Eskin
4. The regular z-score meta-analysis (MetaZ) method
without correcting for imputation quality.
Fig. 1. The theoretical relative bias (%) of^b1 as a function of
true b1. The biases are computed from (14) with different f1, f2,
and b1. b0is fixed at 0 and D0is fixed at 0.5.
1: MAF>0.05 & R2<0.3
2: MAF>0.05 & 0.3<R2<0.6
3: MAF>0.05 & R2>0.6
4: MAF<0.05 & R2<0.3
5: MAF<0.05 & 0.3<R2<0.6
6: MAF<0.05 & R2>0.6
Fig. 2. Boxplot of the bias of^b1of SNPs grouped by different
MAF and R2categories. MAF, minor allele frequency; SNP,
single nucleotide polymorphism.
5 Use of Imputed Values in the Meta-Analysis of GWAS
As we can see from Figure 3, the optimal weighting,
IVM and the ‘‘imputation aware’’ method have indis-
tinguishable performance. In addition, they are all more
powerful than the regular MetaZ method which does not
account for imputation quality. This confirms that the IVM
method is a good approximation of the optimal weight
and it automatically incorporates the imputation quality.
We also simulated a situation where the MAFs are
different between the two studies, which results in
different var(gi). Instead of letting the MAF50.2 for both
studies, we let the MAF50.1 for the first study and 0.4 for
the second study. The power comparison is shown in
Figure 4. As we expected, the IVM method has better
performance than the ‘‘imputation aware’’ method in this
case because it is an approximation to the optimal weight.
In practice, we would not expect MAFs differ substantially
for studies of similar populations. However, for a cross-
ethnicity meta-analysis, the IVM is superior to the
‘‘imputation aware’’ method since it accounts for the
MAF variation among different ethnic groups.
As imputation has been widely used to recover
information from GWAS data, the expectation-substitution
method is the most commonly used method to analyze
imputed SNPs while accounting for genotype uncertainty.
Our work shows, both numerically and theoretically, that
the expectation-substitution method gives approximately
unbiased estimates under practical conditions of low effect
sizes for GWAS studies of common diseases. We also show
that the IVM scheme approximates the optimal weight
well and always has the best power among different meta-
analysis methods compared.
Two recent articles have outlined the advantages of
using meta-analysis, and discussed study design, quality
control, and analysis issues to consider when implement-
ing meta-analysis of GWAS data [Cantor et al., 2010;
Zeggini and Ioannidis, 2009]. These articles address
weighting schemes for combining results, but focus more
on random-effects vs. fixed-effects analysis, rather than on
methods to include imputation quality.
The different imputation software packages provide
information not only on the probability of each genotype
but also an overall imputation quality measure. This
measure is typically defined as the ratio of the sample
variance of the genotype to the expected variance, with
lower scores indicating less well-imputed SNPs. Studies
often exclude SNPs with either low R2or low MAF.
A threshold of imputation R250.3 has been recommended
by MACH as the imputation quality cut-off for estimates
[MACH Homepage]. Our results show that in terms of
bias, the combination of imputation quality and MAF
seems to be most relevant. In particular, we show that the
variability of estimates is large for lower imputation
quality and lower MAF. In current practice, rare variants
(MAFo0.05) are often excluded from imputation and
subsequent meta-analysis. In this situation, either not
using a filter, or using a filter based only on R2is likely
sufficient. However, as meta-analysis grows larger and
data become available to impute rare variants, we
recommend using both the imputation quality and the
MAF to set filtering criterion. For example, in our
simulation results (Fig. 2), the optimal filter appears to
be excluding SNPs with both MAFo0.05 and R2o0.3,
rather than all SNPs with R2o0.3. In this article, we used
the imputation quality measure R2defined by MACH [Li
et al., 2010], which is the squared correlation between true
’Imputation Aware’ Z
Fig. 3. The power of optimal weighting (optimal), IVW method,
‘‘imputation aware’’ method (‘‘Imputation Aware’’ Z), and the
(MetaZ) from the meta-analysis of two studies. The MAFs of
the disease-causing SNP in both studies are 0.2. A commonly
used genome-wide P-value cut-off 5?10?8is used as the
significance level. IVM, inverse variance weighting; MAF,
minor allele frequency; SNP, single nucleotide polymorphism.
’Imputation Aware’ Z
Fig. 4. The power of optimal weighting (optimal), IVW method,
‘‘imputation aware’’ method (‘‘Imputation Aware’’ Z), and the
(MetaZ) from the meta-analysis of two studies. The MAF of
the disease-causing SNP in the two studies are 0.1 and 0.4,
respectively. A commonly used genome-wide P-value cut-off
5?10?8is used as the significance level. MAF, minor allele
frequency; IVW, inverse variance weighting; SNP, single
6Jiao et al.
genotypes and imputed dosages. In Beagle [Browning and
Browning, 2009], R2is defined as the squared correlation
between true and the most likely genotypes. To investigate
whether the choice of different quality measures makes
much difference, we randomly chose 10,000 imputed
SNPs on chromosome 22 in the PLCO data [Prorok et al.,
2000; Hayes et al., 2000] and computed their MACH R2
and Beagle R2. It turns out that the two R2’s are highly
correlated (r40.99). Thus, although the cut-offs for the two
R2’s could be slightly different, the general conclusion
should still hold.
As we move into the post-GWAS era, our results
provide important guidance for investigators on how to
optimally conduct meta-analysis in the presence of
imputed genotypes for marginal SNP associations. We
support the current practice of using the expectation-
substitution method and the IVM in meta-analysis.
Additional theoretical and numerical work is needed to
evaluate the use of imputed data in more sophisticated
analysis, including proposed methods for gene-gene and
We thank two reviewers for their helpful comments.
This work was supported by the National Institutes of
R01AG14358, P01CA53996). Genotype data included in
these analyses from the Prostate, Lung, Colorectal, and
Ovarian (PLCO) Cancer Screening Trial was supported by
the Intramural Research Program of the Division of Cancer
Epidemiology and Genetics and supported by contracts
from the Division of Cancer Prevention, National Cancer
Institute, National Institutes of Health, Department of
Drs. Christine Berg and Philip Prorok, Division of Cancer
Prevention, National Cancer Institute, the Screening
Center investigators and staff or the Prostate, Lung,
Colorectal, and Ovarian (PLCO) Cancer Screening Trial,
Mr. Tom Riley and staff, Information Management
Services, Inc., Ms. Barbara O’Brien and staff, Westat, Inc.,
and Drs. Bill Kopp, Wen Shao, and staff, SAIC-Frederick.
Most importantly, we acknowledge the study participants
for their contributions to making this study possible.
Data included in these analyses were also generated
from the GWAS of Lung Cancer and Smoking. Funding for
this work was provided through the National Institutes of
Health Genes, Environment and Health Initiative [NIH
GEI] (Z01 CP 010200). The human subjects participating in
the GWAS were from The Environment and Genetics in
Lung Cancer Etiology (EAGLE) case-control study and the
Prostate, Lung, Colon and Ovarian Screening Trial and
these studies are supported by intramural resources of the
National Cancer Institute. Assistance with genotype
cleaning, as well as with general study coordination, was
provided by the Gene Environment Association Studies,
GENEVA Coordinating Center (U01 HG004446). Assis-
tance with data cleaning was provided by the National
Center for Biotechnology Information. Funding support
for genotyping, which was performed at the Johns
Hopkins University Center for Inherited Disease Research,
was provided by the NHI GEI (U01 HG 004438). The data
sets used for the analyses described in this manuscript were
obtained from dbGaP at http://www.ncbi.nlm.nih.gov/
and 5U01 CA137088,
gap through dbGaP accession number ph000093 v2.p2.c1.
In addition, data generated from the Cancer Genetic
Markers of Susceptibility (CGEMS) [CGEMS] prostate
cancer scan were also included in this analysis. The data
sets used for the analyses described in this manuscript were
accessed with appropriate approval through the dbGaP
through dbGaP accession number 000207 v.1p1.c1.
Barrett JC, Cardon LR. 2006. Evaluating coverage of genome-wide
association studies. Nat Genet 38:659–662.
Browning BL, Browning SR. 2009. A unified approach to genotype
imputation and haplotype phase inference for large data sets of
trios and unrelated individuals. Am J Hum Genet 84:210–223.
Cancer Genetic Markers of Susceptibility (CGEMS) Data. 2009. http://
cgems.cancer.gov/data/. May 10, 2009.
Cantor RM, Lange K, Sinsheimer JS. 2010. Prioritizing GWAS results:
a review of statistical methods and recommendations for their
application. Am J Hum Genet 86:6–22.
Chapman K, Takahashi A, Meulenbelt I, Rodriguez J, Egli R, Tsezou A,
Malizos KN, Kloppenburg M, Southam L, Breggen R, Donn R,
Qin J, Doherty M, Slagboom PE, Wallis G, Kamatani N, Jiang Q,
Gonzalez A, Loughlin J, Ikegawa S. 2008. A meta-analysis of
European and Asian cohorts reveals a global role of a functional
SNP in the 50UTR of GDF5 with osteoarthritis susceptibility. Hum
Mol Genet 17:1497–1504.
Cordell HJ. 2006.Estimation
haplotype effects in case-control studies: comparison of weighted
regression and multiple imputation procedures. Genet Epidemiol
de Bakker PIW, Ferreira MAR, Jia X, Neale BM, Raychaudhuri S,
Voight BF. 2008. Practical aspects of imputation-driven meta-
analysis of genome-wide association studies. Hum Mol Genet 17:
Hayes RB, Reding D, Kopp W, Subar AF, Bhat N, Rothman N,
Caporaso N, Ziegler RG, Johnson CC, Weissfeld JL, Hoover RN,
Hartge P, Palace C, Gohagan JK, Prostate, Lung, Colorectal and
Ovarian Cancer Screening Trial Project Team. 2000. Etiologic and
early marker studies in the prostate, lung, colorectal and ovarian
(PLCO) cancer screening trial. Control Clin Trials 21:349S–355S.
Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP,
Collins FS, Manolio TA. 2009. Potential etiologic and functional
implications of genome-wide association loci for human diseases
and traits. Proc Natl Acad Sci USA 106:9362–9367.
Hindorff LA, Junkins HA, Hall PN, Mehta JP, Manolio TA. 2011.
A catalog of published genome-wide association studies. Available
at: www.genome.gov/gwastudies. Accessed March 29.
Kraft P, Stram OD. 2007. RE: the use of inferred haplotypes in
downstream analysis. Am J Hum Genet 81:863–865.
Kraft P, Cox DG, Paynter RA, Hunter D, De Vivo I. 2005. Accounting
for haplotype uncertainty in matched association studies: a
comparison of simple and flexible techniques. Genet Epidemiol
Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. 2010. MaCH: using
sequence and genotype data to estimate haplotypes and un-
observed genotypes. Genet Epidemiol 34:816–834.
Lin DY, Huang BE. 2007. The use of inferred haplotypes in
downstream analyses. Am J Hum Genet 80:577–579.
Marchini J, Howie B, Myers S, McVean G, Donnelly P. 2007. A new
multipoint method for genome-wide association studies via
imputation of genotypes. Nat Genet 39:906–913.
Prentice RL, Pyke R. 1979. Logistic disease incidence models and case-
control studies. Biometrika 66:403–411.
andtesting ofgenotype and
7 Use of Imputed Values in the Meta-Analysis of GWAS
Prorok PC, Andriole GL, Bresalier RS, Buys SS, Chia D, Crawford ED,
Fogel R, Gelmann EP, Gilbert F, Hasson MA, Hayes RB,
Johnson CC, Mandel JS, Oberman A, O’Brien B, Oken MM,
Rafla S, Reding D, Rutt W, Weissfeld JL, Yokochi L, Gohagan JK,
Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial
Project Team. 2000. Design of the prostate, lung, colorectal and
ovarian (PLCO) cancer screening trial. Control Clin Trials 21:
Servin B, Stephens M. 2007. Imputation-based analysis of association
studies: candidate regions and quantitative traits. PLoS Genet 3:
Soranzo N, Rivadeneira F, Chinappen-Horsley U, et al. 2009. Meta-
analysis of genome-wide scans for human adult stature identifies
novel loci and associations with measures of skeletal frame size.
PLoS Genet 5:e1000445.
The International HapMap Consortium. 2005. A haplotype map of the
human genome. Nature 427:1299–1320.
The International HapMap Consortium. 2007. A second generation
human haplotype map of over 3.1 million SNPs. Nature 449:
The 1000 Genomes Project Consortium. 2010. A map of human
genome variation from population-scale sequencing. Nature 467:
Willer CJ, Speliotes EK, Loos RJ, Li S, Lindgren CM, Heid IM,
Berndt SI, Elliott AL, Jackson AU, Lamina C, Lettre G, Lim N,
Lyon HN, McCarroll SA, Papadakis K, Qi L, Randall JC,
Roccasecca RM, Sanna S, Scheet P, Weedon MN, Wheeler E,
Zhao JH, Jacobs LC, Prokopenko I, Soranzo N, Tanaka T,
Timpson NJ, Almgren P, Bennett A, Bergman RN, Bingham SA,
Bonnycastle LL, Brown M, Burtt NP, Chines P, Coin L, Collins FS,
Connell JM, Cooper C, Smith GD, Dennison EM, Deodhar P,
Elliott P, Erdos MR, Estrada K, Evans DM, Gianniny L, Gieger C,
Gillson CJ, Guiducci C, Hackett R, Hadley D, Hall AS,
Havulinna AS, Hebebrand J, Hofman A, Isomaa B, Jacobs KB,
Johnson T, Jousilahti P, Jovanovic Z, Khaw KT, Kraft P,
Kuokkanen M, Kuusisto J, Laitinen J, Lakatta EG, Luan J,
Luben RN, Mangino M, McArdle WL, Meitinger T, Mulas A,
Munroe PB, Narisu N, Ness AR, Northstone K, O’Rahilly S,
Purmann C, Rees MG, Ridderstra ˚le M, Ring SM, Rivadeneira F,
Ruokonen A, Sandhu MS, Saramies J, Scott LJ, Scuteri A,
Stringham HM,Tung YC,
Watanabe RM, Waterworth DM, Watkins N, Wellcome Trust Case
Zillikens MC, Altshuler D, Caulfield MJ, Chanock SJ, Farooqi IS,
Ferrucci L, Guralnik JM, Hattersley AT, Hu FB, Jarvelin MR,
Laakso M, Mooser V, Ong KK, Ouwehand WH, Salomaa V,
Samani NJ, Spector TD, Tuomi T, Tuomilehto J, Uda M,
UitterlindenAG, Wareham NJ, Deloukas P, Frayling TM,
Groop LC, Hayes RB, Hunter DJ, Mohlke KL, Peltonen L,
Schlessinger D, Strachan DP, Wichmann HE, McCarthy MI,
Boehnke M, Barroso I, Abecasis GR, Hirschhorn JN, Genetic
Investigation of ANthropometric Traits Consortium. 2008. Six new
loci associated with body mass index highlight a neuronal
influence on body weight regulation. Nat Genet 41:25–34.
Xiong DH, Liu XG, Guo YF, Tan LJ, Wang L, Sha BY, Tang ZH, Pan F,
Yang TL, Chen XD, Lei SF, Yerges LM, Zhu XZ, Wheeler VW,
Patrick AL, Bunker CH, Guo Y, Yan H, Pei YF, Zhang YP, Levy S,
Papasian CJ, Xiao P, Lundberg YW, Recker RR, Liu YZ, Liu YJ,
Zmuda JM, Deng HW. 2009. Genome-wide association and follow-
up replication studies identified ADAMTS18 and TGFBR3 as bone
mass candidate genes in different ethnic groups. Am J Hum Genet
Zaitlen N, Eskin E. 2010. Imputation aware meta-analysis of genome-
wide association studies. Genet Epidemiol 34:537–542.
Zeggini E, Ioannidis JP. 2009. Meta-analysis in genome-wide associa-
tion studies. Pharmacogenomics 10:191–201.
First, we introduce some notation. Let mð? gi;b0;b1Þ ¼
expðb01b1? giÞ=ð11expðb01b1? giÞÞ. For convenience, we will
interchangeably use the notation mð:;b0;b1Þ and m(?) in the
Appendix. Denote the first derivative of m(g) with respect
infg2intðm0ðgÞÞ; DU½x1;x2?¼jU½x1;x2??fmðx2Þ ? mðx1Þg=ðx2? x1Þj;
DL½x1;x2? ¼ jL½x1;x2?? fmðx2Þ ? mðx1Þg=ðx2? x1Þj.
Lemma 1 shows that the extrema of jEðdijpi0;pi1;pi2Þ ?
mð? gi;b0;b1Þj can only be achieved on the boundary. It also
computes the extrema of jEðdijpi0;pi1;pi2Þ ? mð? gi;b0;b1Þj on
each boundary condition and chooses the maximum one
as the upperbound for jEðdijpi0;pi1;pi2Þ ? mð? gi;b0;b1Þj.
Lemma 2 shows that there exists some dðb0;b1Þ (which
depends on the upperbound given by Lemma 1), such that
when~b1? b11dðb0;b1Þ, mð? gi;b0;~b1Þ ? Eðdijpi0;pi1;pi2Þ40 for
? gi2 ½0;2?;
Eðdijpi0;pi1;pi2Þo0 for any ? gi2 [0;2]. As a result, b?
of score equationPn
jEðdijpi0;pi1;pi2Þ ? mð? gi;b0;b1Þj ? Mðb0;b1Þ
minð? gi;2 ? ? giÞ, where
Mðb0;b1Þ ¼ maxðDU½0;2?;DI½0;2?;DU½0;1?;DI½0;1?;
~b1? b1? dðb0;b1Þ,
mð? gi;b0;~b1Þ ?
1, the root
i¼1? giðmð? gi;b0;~b1Þ ? Eðdijpi0;pi1;pi2ÞÞ ¼ 0,
lies between b1? dðb0;b1Þ and b11dðb0;b1Þ. Given that
1, the theorem is proved.
Proof. We can rewrite Eðdijpi0;pi1;pi2Þ ? mð? giÞ in terms of
pi0and ? giby following the constraints pi01pi11pi2¼ 1 and
pi112pi2¼ ? gi. This gives
fð? gi;pi0Þ ¼ Eðdijpi0;pi1;pi2Þ ? mð? giÞ
¼ pi0mð0Þ1ð2 ? 2pi0? ? giÞmð1Þ
1ðpi01? gi? 1Þmð2Þ ? mð? giÞ:
The extrema of fð? gi;pi0Þ occur when the derivative equals 0
or at the boundary. Taking the first derivative of fð? gi;pi0Þ
w.r.t pi0 we can see that there is no solution for the
derivative equaling 0. So, the extrema can only occur at the
boundary: pi0¼ 1 ? ? gi=2 or pi0¼ 1 ? ? gior pi0¼ 0. We can
calculate the extrema for each boundary condition.
When pi0¼ 1?? gi=2, fð? gi;pi0Þ ¼ ð1?? gi=2Þmð0Þ1ð? gi=2Þmð2Þ?
mð? giÞ. We can see that the value of mð? giÞ is between
½mð0Þ1L½0;2?? gi;mð0Þ1U½0;2?? gi? and also ½mð2Þ ? U½0;2?ð2 ? ? giÞ;
mð2Þ ? L½0;2?ð2 ? ? giÞ?. Plugging the upper and lower bounds
of mð? giÞ into fð? gi;pi0Þ, we have jfð? gi;pi0ÞjomaxðDU½0;2?;
DI½0;2?Þmin ð? gi;2 ? ? giÞ.
Similarly, we can show
jfð? gi;pi0ÞjomaxðDU½0;1?;DI½0;1?Þminð? gi;2?? giÞ;
pi0¼0, jfð? gi;pi1ÞjomaxðDU½1;2?;DI½1;2?Þminð? gi;2 ? ? giÞ.
Combining all the results above, we have
pi0¼ 1 ? ? gi,
jEðdijpi0;pi1;pi2Þ ? mð? gi;b0;b1Þj ? Mðb0;b1Þminð? gi;2 ? ? giÞ:
Lemma 2. Let dðb0;b1Þ¼sup? gi2½0;2?Mðb0;b1Þ=½ð1?mð? gi;b0;b1ÞÞ
? gi2 ½0;2?.
? gi2 ½0;2?;
mð? gi;b0;~b1Þ ?
mð? gi;b0;~b1Þ ? Eðdijpi0;pi1;pi2Þo0
8Jiao et al.
Proof. Consider the following equation of~b1 Download full-text
mð? gi;b0;~b1Þ ? Eðdijpi0;pi1;pi2Þ ¼ 0
The root for this equation would be
~b1¼?logðEðdijpi0;pi1;pi2Þ?1? 1Þ ? b0
Denote Eðdijpi0;pi1;pi2Þ ? mð? giÞ by Di. When Di is small,
approximated by logðmð? giÞ?1? 1Þ1Di=½ð1 ? mð? giÞÞmð? giÞ? fol-
b11ðDi=? giÞ=[ð1 ? mð? giÞÞmð? giÞ]. From Equation (A1), jDij ?
Mðb0;b1Þminð? gi;2 ? ? giÞ. It follows that j~b1? b1joMðb0;b1Þ=
½ð1 ? mð? giÞÞmð? giÞ?.
½ð1 ? mð? giÞÞmð? giÞ?. As mð? gi;b0;~b1Þ is an increasing function
~b1,combiningwith the fact
we can see that when~b1? b11dðb0;b1Þ, mð? gi;b0;~b1Þ ?
b1?dðb0;b1Þ,mð? gi;b0;~b1Þ?Eðdijpi0;pi1;pi2Þo0 forany ? gi2 ½0;2?.
dðb0;b1Þ ¼ sup? gi2[0;2]Mðb0;b1Þ=
? gi2 ½0;2?; and when
Theorem. jEð^b1Þ ? b1jodðb0;b1ÞProof. As b?
the equation of~b1:
1is the root of
? gifmð? gi;b0;~b1Þ ? Eðdijpi0;pi1;pi2Þg ¼ 0
Applying Lemma 2, when~b1? b11dðb0;b1Þ, the LHS of
Equation (A4) will be positive; when~b1? b1? dðb0;b1Þ, the
LHS of Equation (A4) will be negative. As the LHS of
Equation (A4) is also an increasing function of~b1, then the
½b1? dðb0;b1Þ;b11dðb0;b1Þ?. Given that^b1! b?
jEð^b1Þ ? b1jodðb0;b1Þ.
To show the magnitude of dðb0;b1Þ, which is the
upperbound of the bias of^b1, we tried a few different
values of b1. For example, when b050, b15log(1.2),
dðb0; b1Þ ¼ DUð½0; 2?Þ=½ð1 ? mð2; b0; b1ÞÞmð2; b0; b1Þ? ¼ 0:002;
when b1 ¼ logð1:5Þ, dðb0; b1Þ ¼ DUð½0; 2?Þ=½ð1?mð2; b0; b1ÞÞ
mð2; b0; b1Þ? ¼ 0:02. Those upperbounds of bias have also
been confirmed by the simulation studies.
1, we have
9Use of Imputed Values in the Meta-Analysis of GWAS