How to deal with the early GWAS data when imputing
and combining different arrays is necessary
Hae-Won Uh*,1,2, Joris Deelen2,3, Marian Beekman3, Quinta Helmer1, Fernando Rivadeneira2,4,5,
Jouke-Jan Hottenga6, Dorret I Boomsma6, Albert Hofman2,4,5, Andre
´G Uitterlinden2,4,5, PE Slagboom2,3,
¨hringer1and Jeanine J Houwing-Duistermaat1
Genotype imputation has become an essential tool in the analysis of genome-wide association scans. This technique allows
investigators to test association at ungenotyped genetic markers, and to combine results across studies that rely on different
genotyping platforms. In addition, imputation is used within long-running studies to reuse genotypes produced across
generations of platforms. Typically, genotypes of controls are reused and cases are genotyped on more novel platforms yielding a
case–control study that is not matched for genotyping platforms. In this study, we scrutinize such a situation and validate GWAS
results by actually retyping top-ranking SNPs with the Sequenom MassArray platform. We discuss the needed quality controls
(QCs). In doing so, we report a considerable discrepancy between the results from imputed and retyped data when applying
recommended QCs from the literature. These discrepancies appear to be caused by extrapolating differences between arrays by
the process of imputation. To avoid false positive results, we recommend that more stringent QCs should be applied. We also
advocate reporting the imputation quality measure (RT2) for the post-imputation QCs in publications.
European Journal of Human Genetics (2012) 20, 572–576; doi:10.1038/ejhg.2011.231; published online 21 December 2011
Keywords: GWAS; imputation; quality control
Imputation-based association methods provide a powerful framework
for testing ungenotyped variants for association with phenotypes.
Genotype imputation is particularly useful for combining results
across studies that use different genotyping platforms, because a
meta-analysis of several studies with relatively modest ﬁndings can
result in a number of strongly associated loci that were not previously
indicated. Many successes of such meta-analysis have been reported.1,2
Here, we consider the use of imputation to pool subjects genotyped
with different platforms within studies. For example, when the data of
control groups such as the Wellcome Trust Case Control Consortium3
are reused, the cases are typically not matched regarding genotyping
platforms or arrays.4Another example concerns combining expression
quantitative trait loci studies with data being generated at very
different time points from different platforms, thereby requiring
genotype imputation.5Although reusing such existing data seems to
be an efﬁcient approach, it may increase chances of observing spurious
associations due to chip differences. In this paper, we discuss whether
more stringent quality controls (QCs) should be applied.
In general, the following QCs are performed at the preimputation
stage: minor allele frequency (MAF) Z1–5%, Hardy–Weinberg equi-
librium (HWE) P-value 4104–106, SNP call rate Z90–99%,
sample call rate Z90–98%, and other checks such as sex mismatch
and Mendelian errors. For the details of QCs in GWAS, we refer to
Anderson et al.6Imputation software such as MACH7or IMPUTE8
can be used to impute SNPs based on the HapMap CEU-phased
haplotypes. There seems to be no consensus yet on the QCs after
imputation, and on reporting the quality of imputed genotypes in
publications. In the tutorial of MACH an inclusion threshold r2of 0.3
is recommended. In addition to the preanalysis information measures,
such as r2of MACH and info of IMPUTE, which are the information
measures about the population allele frequency, SNPTEST8provides a
post-analysis information measure about the association parameter
for unrelated samples. Here we propose a similar post-analysis
information measure to test related samples, called RT2.
As in a meta-analysis, the focus is on combining estimates of
association parameters, it seems prudent to base QC on post-analysis
information measures that also cover the strength of association, such
as SNPTEST info or RT2. These measures can be used to obtain
homogeneity and to increase the comparability between the studies.9
Marchini et al10 showed that based on a simulated data set of 1000
cases and 1000 controls the MACH and IMPUTE preanalysis infor-
mation measures were highly correlated, and that there was a good
agreement between the IMPUTE preanalysis information measure and
the SNPTEST post-analysis information measure when testing an
additive genetic model. In this paper we investigate whether good
agreement holds for strongly associated SNPs between the pre- and
postanalysis information measures, and whether the post-analysis
information measures such as SNPTEST info and RT2can have an
important role as an inclusion criterion of candidate SNPs.
Received 16 April 2011; revised 28 October 2011; accepted 9 November 2011; published online 21 December 2011
1Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, The Netherlands; 2Netherlands Consortium for Healthy Ageing, Leiden University
Medical Center, Leiden, The Netherlands; 3Section of Molecular Epidemiology, Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden,
The Netherlands; 4Department of Epidemiology, Erasmus Medical Center, Rotterdam, The Netherlands; 5Department of Internal Medicine, Erasmus Medical Center, Rotterdam,
The Netherlands and 6Department of Biological Psychology, Vrije Universiteit, Amsterdam, The Netherlands
*Correspondence: Dr H-W Uh, Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, PO Box 9600, 2300 RC Leiden, The Netherlands.
Tel: +31 71 5269718; Fax: +31 71 5268280; E-mail: email@example.com
European Journal of Human Genetics (2012) 20, 572 – 576
2012 Macmillan Publishers Limited All rights reserved 1018-4813/12
MATERIALS AND METHODS
In 2007 we performed a GWAS for the Leiden Longevity Study (LLS)11 with an
affected sibling pair (ASP) and control design. One sibling from each of 420
long-lived sibling pairs was genotyped with the ﬁrst generation Affymetrix Gene
Chip Human Mapping 500K Array (Affy500, Perlegen Sciences, Mountain View,
CA, USA). This Affy500 data set was discarded for the analysis that was eventually
published.12 To illustrate the situation in which data obtained by an early platform
are combined with data generated on more recent platforms, we have here included
the Affy500 data yet again. The remaining siblings were genotyped with Illumina
Inﬁnium HD Human660W-Quad BeadChips (Illumina660, San Diego, CA, USA).
Using the following per-individual QC6of GWA data, we excluded individuals with
discordant sex information, individuals with sample call rate o0.95, and duplicated
individuals. Per-marker QC was carried out for including SNPs with the following
criteria: SNP call rate 40.95, MAF 40.01, and HWE P-value 4104.AfterQC,
517K SNPs remained on the Illumina and 350K SNPs remained on the Affy500
arrays. Of these, only 60K SNPs of Affy500 overlapped with Illumina660. To reuse
the genotypes we used MACH for imputation of missing 457K SNPs in Affy500
based on HapMap CEU individuals. To guarantee the quality of imputation, we
set the inclusion threshold to r2¼0.3 as recommended. For 1670 (younger
unrelated) controls from the Rotterdam Study, genotypes were generated with
Illumina Inﬁnium II HumanHap 550K and HumanHap550-Duo BeadChips
(Illumina550).12,13 Our data, therefore, differs from the usual simulation setting
in the following way: the sib of each sibship genotyped with Affy500 was
imputed to match the SNPs of other siblings and controls. The description of
the study design and the different arrays used is given in Figure 1 and Table 1.
An additional check of the imputation accuracy was performed; 10% of the
SNPs were randomly masked, and correctness of imputation was determined
by comparing imputed genotypes with the masked ones. More than 99% of
masked SNPs passed the default imputation threshold of r2¼0.3, so that our
data passed this additional QC. For validation of the GWAS results, the 89 top-
ranking SNPs were re-genotyped with the Sequenom MassArray platform.
Here, we compare imputed and measured genotypes of these top-ranking
Score test. Modeling the LLS data needs to account for (1) ascertainment,
that is, cases were long-lived sibling pairs (ASPs), and (2) the fact that one of
the sibs in each pair had most markers imputed because it belonged to the
Affy500 data. On the basis of the argument that the ascertainment event
depends on the phenotype but is conditionally independent of the genotype
given a phenotype, we use the score statistic corresponding to the retrospective
likelihood for testing.
We let X¼(X1,y,Xn)bethen1vector of genotype data. We code each
genotype as 0, 1, or 2, corresponding to the number of minor alleles present at
that locus. For nindividuals, we let Y¼(Y1,y,Yn)bethen1 vector of the
case–control status, which is coded 0 for control subjects and 1 for case
subjects. Further, Y
¯denotes the proportion of cases. The score statistic for
testing for an additive effect of a diallelic locus on phenotype is given as
¯)X. Under the null hypothesis of no association between genotype
and disease, the score test U2x/Var (UX) is asymptotically distributed as w2with 1
degree of freedom. To account for relatedness of cases we used the kinship
coefﬁcients matrix when computing the variance of the score statistic.14
Imputation is dealt with by accounting for loss of information due to genotype
uncertainty. A detailed derivation of the score test is given in the Appendix.
Post-analysis information measures. Let the posterior probability of
imputed genotypes be pi¼(pi0,pi1,pi2) for subject i, and the expected dosage
for the genotype counts of the ith individual be E(Xi)¼pi1+2pi2. Further,
let pdenote the population minor allele frequency. Assuming HWE, the
MACH r2is deﬁned by
so that this preanalysis information measure depends only on the allele
frequency and imputed genotypes. When data are genotyped, r2equals one.
As in the Appendix, let Kdenote the genetic correlation matrix. The
genotypic variance of the sample is denoted by S,andSloss is the loss of
information due to uncertainty. The relative efﬁciency measure for case–control
design of Uh et al15 can be used as an information measure about the
where 1denotes the (Hadamard) term-wise product. Consequently with
genotyped data Sloss¼0, hence, RT2equals to 1. In contrast to the preanalysis
Figure 1 Study samples and arrays used. Affy500 stands for the ﬁrst genera-
tion Affymetrix Gene Chip Human Mapping 500K Array, Illumina660 for
Illumina Inﬁnium HD Human660W-Quad BeadChips, and Illumina550 for
Illumina Inﬁnium II HumanHap 550K and HumanHap550-Duo BeadChips.
Sib 2 and controls were all genotyped, and for Sib1 in addition to the over-
lapping genotyped 60K SNPs, the remaining 457K SNPs were imputed. After
post-imputation QC, 451K SNPs were analyzed using ASP–control design.
Table 1 Study designs and arrays used in Figure 3
Figure 3 Study design Sample No. of SNPsaOverlap Imputed SNPs QC passed and tested SNPs Genomic control lGC
a ASP–control Sib 2 and control
60K 457K 451K 1.16
b Case–control Sib 2 and control 517K 517K 517K 1.03
c ASP–control Sib 2 and control
60K 60K 1.06
d ASP–control Sib 2 and control
aNo. of SNPs that passed QC at the pre-imputation stage.
bNo. of SNPs with Rr2X0.98.
H-W Uh et al
European Journal of Human Genetics
information measure r2, this post-analysis information measure RT2assigns
more weight to associated SNPs.
An executable C++ program for the score test and RT2is available (http://
The difference between the pre- and postanalysis information
measures, MACH r2and RT2, is shown in Figure 2. Using Sib 1 and
controls data, we randomly selected 1000 SNPs each from three classes
of SNPs: P-values 4greater than 0.05, P-values smaller than 0.001, and
intermediate ones. Although for unassociated SNPs (P-value 40.05)
the two measures show good agreement, they are quite different for
strongly associated SNPs (P-value o0.001). The post-analysis mea-
sure, therefore, can be a useful tool for selecting SNPs for meta-
Quantile–quantile (Q–Q) plots in Figure 3 illustrate the GWAS
results using different study designs as described in Table 1. The test
statistics in all Q–Q plots were corrected by their genomic control
inﬂation factor lGC.16 First we used combined data of ASPs (imputed
Sib 1 and genotyped Sib 2) and genotyped controls. Results
(Figure 3a) show deviation from ﬁrst diagonal (dashed line), hence,
inﬂation of test statistics (lGC¼1.16). Next (Figure 3b), we compared
genotyped Sib 2 and controls (Illumina660 for cases and Illumina550
for controls, respectively): lGC¼1.03. One might conjecture that
inﬂated test statistics in Figure 3a were caused by also considering
imputed sibling data. We then investigated whether this inﬂation is an
artifact solely from imputation, or due to combining different arrays.
To determine the possibility of a chip (or batch) effect, we conducted
ASP and control analysis only on genotyped overlapping 60K SNPs
with Affy500 (Sib 1), Illumina660 (Sib 2), and Illumina550 (control).
In Figure 3c, the genomic control inﬂation factor is decreased from
1.16 to 1.06 as compared with Figure 3a and increased from 1.03 to
1.06 as compared with Figure 3b. This may suggest that there is a chip-
effect, which was ampliﬁed by the imputation. Figure 3d shows that by
applying a very stringent extra QC (RT240.98, 60K genotyped and
97K imputed SNPs) inﬂation of test statistic could be dealt with
(lGC¼1.05). Therefore, the signiﬁcantly biased results (Figure 3a)
seem to be caused by the different chips from one of which is of
For validation, the 89 top-ranking SNPs (MACH r240.3) resulting
from the association analysis using the ﬁrst design were retyped with
the Sequenom MassArray platform. We checked the quality of
genotyping (of the different platforms) as well as that of imputation.
Figure 4 illustrates the comparison of minor allele frequencies (MAFs)
in the long-lived siblings. In the left panel, the deviation of the points
from ﬁrst diagonal (dashed line) indicates the poor match of the
Affy500 data and retyped sample. Meanwhile, the retyping of the
Illumina660 data shows better agreement (bottom panel). Visual
inspection of cluster plots of the sole exception (the red ﬁlled circle)
conﬁrmed the results of the Sequenom array.
Our study illustrates that imputation, whereas combining different
arrays in GWAS using data from the earliest platforms without
sufﬁciently stringent QCs may produce false positive associations. A
simple remedy to better quality is to choose a stricter threshold for
inclusion at the pre- and postimputation stages. For preimputation
QCs we refer to Anderson et al.6
In addition to the preanalysis measures such as r2of MACH
and info of IMPUTE, which are the relative information measures
only depending on the population allele frequency and imputation
accuracy, we proposed an additional post-analysis measure RT2.
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Figure 2 Comparison of the pre- and the postanalysis imputation
information measure. The xaxis shows the preanalysis information measure
(r2), and the yaxis the post-analysis information measure (RT2). The blue
points indicate the SNPs with no association (P-value 40.05); there is little
effect of case–control status, and two measures agree. The red ones are the
SNPs that show strong association (P-value o0.001), and the green ones
are intermediate cases.
0 1015205 0 10 15 20
010155 0 10 15 205
Figure 3 Quantile–quantile plots obtained from LLS GWAS analyses. The
triangles indicate the SNPs at which the test statistic exceeds 30
(corresponding P-value o5108). The 95% concentration bands (shaded
gray) are included. (a) ASP–control design: combined data of imputed
Affy500 (Sib 1), typed Illumina660 (Sib 2), and typed Illumina550
(control). Deviation form the dashed line indicates inﬂation of test statistics.
(b) Case–control design: genotyped with Illumina660 (Sib 2) and
Illumina550 (control). (c) ASP–control design: 60K overlap using combined
typed data of Affy500 (Sib 1), Illumina660 (Sib 2), and Illumina550
(control). (d) ASP–control design: as in (a), but only SNPs with RT240.98.
Details are provided in Table 1.
H-W Uh et al
European Journal of Human Genetics
Our measure is an information measure that assesses the above
information but also includes strength of association. When testing
independent samples, this is equivalent to the information measure
of SNPTEST. For a recessive or dominant model, Marchini et al10
showed that the post-analysis measures are quite different from the
preanalysis information measure r2. For strongly associated SNPs
under an additive model we showed that RT2and r2could be quite
different (Figure 2). For example, meta-analyses aim to combine
estimates of association parameters, which argues for the use of
post-analysis QC measures such as RT2and SNPTEST info.In
situations such as ours, ﬁltering on RT2leads to a reduction in
heterogeneity between studies, making the studies more comparable
and meta-analysis more powerful. To interpret the results of meta-
analysis properly, it also is important to report the difference between
the studies, such as the quality of both genotyping and imputation.
All information measures need to be carefully considered in further
analysis. In our study, by re-genotyping strongly associated SNPs, we
found that an extremely tight inclusion threshold of our imputation
quality measure RT2greater than 0.98 was needed to achieve reliable
results as shown in Figures 3 and 4; only 18 from the 89 top-ranking
SNPs passed the post-analysis QC. These plots suggest that false
positive ﬁndings are caused by imputation based on arrays of inferior
quality, when cases and controls are not matched for genotyping
platforms. Actually, in our GWAS for longevity we discarded the
Affy500 data set because of the small number of reliable SNPs. It
should be noted that 97K imputed SNPs remained in the analysis even
for this stringent cutoff (Table 1). We also retyped the Affy500 cases
with the Illumina 660K platform and recently published our GWAS.12
In Figure 3c one may ask whether the Q–Q plot using only 60K
overlapping SNPs is comparable to Q–Q plots using larger number of
SNPs. We compared the distribution of association P-values using 60K
cases and controls and 350K cases and controls, and both distributions
were quite similar (data not shown).
The results presented here, were based on an early scan data with a
small sample size. When combining modern arrays within studies, less
bias may be expected due to better genotyping quality. On the other
hand, the enormous sample size of pooled studies may amplify even
the small individual effects, for example, due to platform effects,
population strata, or genotyping batch effects, resulting in false
positive ﬁndings, as heterogeneity between studies is ampliﬁed by
imputation. Imputation of genotypes while combining different data
sets can be a very powerful method, and has identiﬁed susceptibility
loci using early scan data.17,18 However, our ﬁndings stress that when
combining newer data sets with early scan data rigorous QCs should
be applied to ensure reproducible ﬁndings including pre- and post-
analysis stages. Moreover, we recommend that post-analysis QC
measures should be reported in publications as they give the most
direct insight into inﬂuence of imputation on association.
CONFLICT OF INTEREST
The authors declare no conﬂict of interest.
We acknowledge R van der Breggen, N Lakenberg, D Kremer, and HED
Suchiman for their efforts in genotyping by Sequenom MassArray. This work is
supported by a grant from the Netherlands Organization for Scientiﬁc Research
(NWO 917.66.334). We thank all the participants of the Leiden Longevity
Study and the Rotterdam Study. This study was supported by a grant from the
Innovation-Oriented Research Program on Genomics (SenterNovem
IGE05007), the Centre for Medical Systems Biology, and the Netherlands
Consortium for Healthy Ageing (Grant 050–060-810), all in the framework of
the Netherlands Genomics Initiative/Netherlands Organization for Scientiﬁc
Research (NWO), and BBMRI-NL (Biobanking and Biomolecular Resources
Research Infrastructure). The generation and management of GWAS genotype
data for the Rotterdam study is supported by the Netherlands Organization for
Scientiﬁc Research NWO Investments (No. 175.010.2005.011, 911-03-012).
This study is funded by the Research Institute for Diseases in the Elderly
(014-93-015; RIDE2) and the Netherlands Genomics Initiative (NGI)/Netherlands
Organization for Scientiﬁc Research (NWO) Project No. 050-060-810; we
thank P Arp, M Jhamai, M Verkerk, L Herrera, and M Peters for their
help in creating the GWAS database. The Rotterdam Study is funded by the
Erasmus Medical Center and Erasmus University, Rotterdam, the Netherlands
Organization for the Health Research and Development (ZonMw), the
Research Institute for Diseases in the Elderly, the Ministry of Education,
Culture and Science, the Ministry for Health, Welfare and Sports, the
European Commission (DG XII), and the Municipality of Rotterdam.
1 Li Y, Willer C, Sanna S, Abecasis G: Genotype imputation. Annu Rev Genomics Hum
Genet 2009; 10: 387–4 06.
2 Howie BN, Donnelly P, Marchini J: A ﬂexible and accurate genotype imputation method
for the next generation of genome-wide association studies. PLoS Genet 2009; 5:
3 The Wellcome Trust Case Control Consortium: Genome-wide association study of
14 000 cases of seven common diseases and 3000 shared controls. Nature 2007;
4 ANZ genes: Genome-wide association study identiﬁes new multiple sclerosis suscept-
ibility loci on chromosome 12 and 20. Nat Genet 2009; 41: 824–828.
5 Zhong H, Yang X, Kaplan LM, Molony C, Schadt EE: Integrating pathway analysis and
genetics of gene expression for genome-wide association studies. Am J Hum Genet
2010; 86: 581–591.
6 Anderson CA, Pettersson FH, Clarke GM, Cardon LR, Morris AP, Zondervan KT:
Data quality control in genetic case-control association studies. Nat Protoc 2010; 5:
0.0 0.1 0.2 0.3 0.4 0.5
Imputed & genotyped (Affymetrix)
2 ≤ 0.98
2 > 0.98
0.0 0.1 0.2 0.3 0.4 0.5
2 ≤ 0.98
2 > 0.98
Figure 4 Comparison of the MAF between GWAS and replication data.
Top : xaxis shows MAF of imputed Sib 1 data using Affy500, and yaxis
MAF of the same SNPs replicated with Sequenom. The green colored did
not pass the threshold RT240.98. Bottom: xaxis shows MAF of (genotyped)
Sib 2 data using Illumina660, and yaxis MAF of the same SNPs replicated
with Sequenom. The red-ﬁlled circle in both panels indicates the same SNP.
H-W Uh et al
European Journal of Human Genetics
7 Li Y, Abecasis G: Mach 1.0: rapid haplotype reconstruction and missing genotype
inference. Am J Hum Genet 2006; S79: 2290.
8 Marchini J, Howie B, Myers S, McVean G, Donnelly P: A new multipoint method for
genome-wide association studies via imputation of genotypes. Nat Genet 2007; 39:
9 Cantor RM, Lange K, Sinsheimer JS: Prioritizing GWAS results: a review of statistical
methods and recommendations for their approach. Am J Hum Genet 2010; 86:6–22.
10 Marchini J, Howie B: Genotype imputation for genome-wide association studies. Nat
Rev Genet 2010; 11:499–511.
11 Westendorp RG, van Heemst D, Rozing MP et al: Nonagenarian siblings and their
offspring display lower risk for mortality and morbidity than sporadic nonagenarians:
the Leiden Longevity Study. JAmGeriatrSoc2009; 59: 1634–1637.
12 Deelen J, Beekman M, Uh HW et al: Genome-wide association study ide ntiﬁes a single
major locus contributing to survival into old age; the APOE locus revisited. Ageing Cell
2011; 10: 686–698.
13 Hofman A, Breteler MM, Van Duijn CM et al: The Rotterdam Study: 2010 objectives
and design upd ate. Eur J Epidemiol 2009; 24:553–572.
14 Uh HW, Wijk HJ, Houwing-Duistermaat JJ: Testing for genetic association taking into
account phenotypic information of relatives. BMC Proc 2009; 5(Suppl 7): S123.
15 Uh H-W, Houwing-Duistermaat JJ, Putter H, van Houwelingen HC: Assessment of
global phase uncertainty in case-control studies. BMC Genet 2009; 10:54.
16 Devlin B, Roeder K: Genomic control for association studies. Biometrics 1999; 55:
17 Stuart PE, Nair RP, Ellinghaus E et al: Genome-wide association analysis identiﬁes
three psoriasis susceptibility loci. Nat Genet 2010; 42: 1000–1004.
18 Ellinor PT, Lunetta KL, Clazer NL et al: Common variants in KCNN3 are associated with
lone atrial ﬁbrillation. Nat Genet 2010; 42:240–244.
This work is licensed under the Creative Commons
Attribution-NonCommercial-No Derivative Works
3.0 Unported Licence. To view a copy of this licence, visit http://
We ﬁrst address the ascertainment of the independent cases. Let
Y¼(Y1,y,Yn) be the phenotype, X¼(X1,y,Xn) denotes genotype
dosage 0, 1, or 2. Further, Y
¯is the mean of Yin the whole sample,
or the proportion of cases in case–control studies. As the ascertain-
ment event Sdepends on the phenotype but is conditionally
independent of the genotype given Y, P(X|Y,S)¼P(X|Y).Therefore,
the retrospective likelihood based on P(X|Y) is appropriate under
selection. On the basis of retrospective likelihood, the score statistic for
testing for an additive effect of a genotyped locus on phenotype is as
follows. The score is,
and the variance of UX
where s2Xis the genotypic variance. Under HWE assumption, s2Xcan
be estimated by 2^
pÞwith the MAF estimate ^
p. Under H0, the
test statistic U2X/VarUXis asymptotically distributed as w2with 1
degree of freedom.
When using multiplex cases from the same pedigree, we need to
take into account correlations. We deﬁne the correlation matrix Kfor
nsubjects as follows:
r12 1 r1n
The off-diagonal entries, rijs, are twice the kinship coefﬁcient between
individuals iand j(iaj). Then, the expression of the denominator of
the score statistic is replaced by
To deal with imputed genotypes, the uncertainty caused by imputa-
tion needs to be considered. On the basis of the statistical theory for
missing data, the genotype data can be partitioned into two parts
The log likelihoods for the complete data (lcomp) and observed
(incomplete) data (lobs) are given by
Let U(y) be the complete data score qlcomp/qy,andI(y)thecomplete
data information ql2comp/q2y, respectively.
Instead of observing X, for imputed genotypes the posterior
probability pi¼(pi0,pi1,pi2) is given for subject i¼1,y,n. Let the
expected dosage for the genotype counts of the ith individual be
˜I¼EXi¼pi1+2pi2. Then we replace the genotype counts Xby
in the score statistic (1).
Tbe nnmatrix with the genotypic variance s2X
where 1represents a vector of ones of length n. And, the nnmatrix
Sloss denotes the loss of information.
Then, the score and information for the observed data likelihood
are given by
IobsðyÞ¼EXmis jXobsIðyÞVarXmis jXobs UðyÞ¼XX
Here, the term VarXmis|Xobs() represents the loss of information due
to imputation uncertainty. The elements of Sloss are deﬁned by the
outer product of the square root of individual loss li,
Thus, on the diagonal we have Sloss;ii¼liand off the diagonal we have
for i,j¼1,y,n. Then the variance of the score statistic can be
where Jdenotes the (Hadamard) term-wise product.
1. Uh HW, Wijk HJ, Houwing-Duistermaat JJ: Testing for genetic
association taking into account phenotypic information of relatives.
BMC Proc 2009; (Suppl 7): S123.
2. Louis TA: Finding the observed information matrix when using
the EM algorithm. J R Stat Soc 1982; 44: 226-233.
H-W Uh et al
European Journal of Human Genetics