ArticlePDF Available

Abstract and Figures

Genotype imputation has become an essential tool in the analysis of genome-wide association scans. This technique allows investigators to test association at ungenotyped genetic markers, and to combine results across studies that rely on different genotyping platforms. In addition, imputation is used within long-running studies to reuse genotypes produced across generations of platforms. Typically, genotypes of controls are reused and cases are genotyped on more novel platforms yielding a case-control study that is not matched for genotyping platforms. In this study, we scrutinize such a situation and validate GWAS results by actually retyping top-ranking SNPs with the Sequenom MassArray platform. We discuss the needed quality controls (QCs). In doing so, we report a considerable discrepancy between the results from imputed and retyped data when applying recommended QCs from the literature. These discrepancies appear to be caused by extrapolating differences between arrays by the process of imputation. To avoid false positive results, we recommend that more stringent QCs should be applied. We also advocate reporting the imputation quality measure (R(T)(2)) for the post-imputation QCs in publications.
Content may be subject to copyright.
ARTICLE
How to deal with the early GWAS data when imputing
and combining different arrays is necessary
Hae-Won Uh*,1,2, Joris Deelen2,3, Marian Beekman3, Quinta Helmer1, Fernando Rivadeneira2,4,5,
Jouke-Jan Hottenga6, Dorret I Boomsma6, Albert Hofman2,4,5, Andre
´G Uitterlinden2,4,5, PE Slagboom2,3,
Stefan Bo
¨hringer1and Jeanine J Houwing-Duistermaat1
Genotype imputation has become an essential tool in the analysis of genome-wide association scans. This technique allows
investigators to test association at ungenotyped genetic markers, and to combine results across studies that rely on different
genotyping platforms. In addition, imputation is used within long-running studies to reuse genotypes produced across
generations of platforms. Typically, genotypes of controls are reused and cases are genotyped on more novel platforms yielding a
case–control study that is not matched for genotyping platforms. In this study, we scrutinize such a situation and validate GWAS
results by actually retyping top-ranking SNPs with the Sequenom MassArray platform. We discuss the needed quality controls
(QCs). In doing so, we report a considerable discrepancy between the results from imputed and retyped data when applying
recommended QCs from the literature. These discrepancies appear to be caused by extrapolating differences between arrays by
the process of imputation. To avoid false positive results, we recommend that more stringent QCs should be applied. We also
advocate reporting the imputation quality measure (RT2) for the post-imputation QCs in publications.
European Journal of Human Genetics (2012) 20, 572–576; doi:10.1038/ejhg.2011.231; published online 21 December 2011
Keywords: GWAS; imputation; quality control
INTRODUCTION
Imputation-based association methods provide a powerful framework
for testing ungenotyped variants for association with phenotypes.
Genotype imputation is particularly useful for combining results
across studies that use different genotyping platforms, because a
meta-analysis of several studies with relatively modest findings can
result in a number of strongly associated loci that were not previously
indicated. Many successes of such meta-analysis have been reported.1,2
Here, we consider the use of imputation to pool subjects genotyped
with different platforms within studies. For example, when the data of
control groups such as the Wellcome Trust Case Control Consortium3
are reused, the cases are typically not matched regarding genotyping
platforms or arrays.4Another example concerns combining expression
quantitative trait loci studies with data being generated at very
different time points from different platforms, thereby requiring
genotype imputation.5Although reusing such existing data seems to
be an efficient approach, it may increase chances of observing spurious
associations due to chip differences. In this paper, we discuss whether
more stringent quality controls (QCs) should be applied.
In general, the following QCs are performed at the preimputation
stage: minor allele frequency (MAF) Z1–5%, Hardy–Weinberg equi-
librium (HWE) P-value 4104–106, SNP call rate Z90–99%,
sample call rate Z90–98%, and other checks such as sex mismatch
and Mendelian errors. For the details of QCs in GWAS, we refer to
Anderson et al.6Imputation software such as MACH7or IMPUTE8
can be used to impute SNPs based on the HapMap CEU-phased
haplotypes. There seems to be no consensus yet on the QCs after
imputation, and on reporting the quality of imputed genotypes in
publications. In the tutorial of MACH an inclusion threshold r2of 0.3
is recommended. In addition to the preanalysis information measures,
such as r2of MACH and info of IMPUTE, which are the information
measures about the population allele frequency, SNPTEST8provides a
post-analysis information measure about the association parameter
for unrelated samples. Here we propose a similar post-analysis
information measure to test related samples, called RT2.
As in a meta-analysis, the focus is on combining estimates of
association parameters, it seems prudent to base QC on post-analysis
information measures that also cover the strength of association, such
as SNPTEST info or RT2. These measures can be used to obtain
homogeneity and to increase the comparability between the studies.9
Marchini et al10 showed that based on a simulated data set of 1000
cases and 1000 controls the MACH and IMPUTE preanalysis infor-
mation measures were highly correlated, and that there was a good
agreement between the IMPUTE preanalysis information measure and
the SNPTEST post-analysis information measure when testing an
additive genetic model. In this paper we investigate whether good
agreement holds for strongly associated SNPs between the pre- and
postanalysis information measures, and whether the post-analysis
information measures such as SNPTEST info and RT2can have an
important role as an inclusion criterion of candidate SNPs.
Received 16 April 2011; revised 28 October 2011; accepted 9 November 2011; published online 21 December 2011
1Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, The Netherlands; 2Netherlands Consortium for Healthy Ageing, Leiden University
Medical Center, Leiden, The Netherlands; 3Section of Molecular Epidemiology, Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden,
The Netherlands; 4Department of Epidemiology, Erasmus Medical Center, Rotterdam, The Netherlands; 5Department of Internal Medicine, Erasmus Medical Center, Rotterdam,
The Netherlands and 6Department of Biological Psychology, Vrije Universiteit, Amsterdam, The Netherlands
*Correspondence: Dr H-W Uh, Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, PO Box 9600, 2300 RC Leiden, The Netherlands.
Tel: +31 71 5269718; Fax: +31 71 5268280; E-mail: h.uh@lumc.nl
European Journal of Human Genetics (2012) 20, 572 576
&
2012 Macmillan Publishers Limited All rights reserved 1018-4813/12
www.nature.com/ejhg
MATERIALS AND METHODS
In 2007 we performed a GWAS for the Leiden Longevity Study (LLS)11 with an
affected sibling pair (ASP) and control design. One sibling from each of 420
long-lived sibling pairs was genotyped with the first generation Affymetrix Gene
Chip Human Mapping 500K Array (Affy500, Perlegen Sciences, Mountain View,
CA, USA). This Affy500 data set was discarded for the analysis that was eventually
published.12 To illustrate the situation in which data obtained by an early platform
are combined with data generated on more recent platforms, we have here included
the Affy500 data yet again. The remaining siblings were genotyped with Illumina
Infinium HD Human660W-Quad BeadChips (Illumina660, San Diego, CA, USA).
Using the following per-individual QC6of GWA data, we excluded individuals with
discordant sex information, individuals with sample call rate o0.95, and duplicated
individuals. Per-marker QC was carried out for including SNPs with the following
criteria: SNP call rate 40.95, MAF 40.01, and HWE P-value 4104.AfterQC,
517K SNPs remained on the Illumina and 350K SNPs remained on the Affy500
arrays. Of these, only 60K SNPs of Affy500 overlapped with Illumina660. To reuse
the genotypes we used MACH for imputation of missing 457K SNPs in Affy500
based on HapMap CEU individuals. To guarantee the quality of imputation, we
set the inclusion threshold to r2¼0.3 as recommended. For 1670 (younger
unrelated) controls from the Rotterdam Study, genotypes were generated with
Illumina Infinium II HumanHap 550K and HumanHap550-Duo BeadChips
(Illumina550).12,13 Our data, therefore, differs from the usual simulation setting
in the following way: the sib of each sibship genotyped with Affy500 was
imputed to match the SNPs of other siblings and controls. The description of
the study design and the different arrays used is given in Figure 1 and Table 1.
An additional check of the imputation accuracy was performed; 10% of the
SNPs were randomly masked, and correctness of imputation was determined
by comparing imputed genotypes with the masked ones. More than 99% of
masked SNPs passed the default imputation threshold of r2¼0.3, so that our
data passed this additional QC. For validation of the GWAS results, the 89 top-
ranking SNPs were re-genotyped with the Sequenom MassArray platform.
Here, we compare imputed and measured genotypes of these top-ranking
SNPs.
Methods
Score test. Modeling the LLS data needs to account for (1) ascertainment,
that is, cases were long-lived sibling pairs (ASPs), and (2) the fact that one of
the sibs in each pair had most markers imputed because it belonged to the
Affy500 data. On the basis of the argument that the ascertainment event
depends on the phenotype but is conditionally independent of the genotype
given a phenotype, we use the score statistic corresponding to the retrospective
likelihood for testing.
We let X¼(X1,y,Xn)bethen1vector of genotype data. We code each
genotype as 0, 1, or 2, corresponding to the number of minor alleles present at
that locus. For nindividuals, we let Y¼(Y1,y,Yn)bethen1 vector of the
case–control status, which is coded 0 for control subjects and 1 for case
subjects. Further, Y
¯denotes the proportion of cases. The score statistic for
testing for an additive effect of a diallelic locus on phenotype is given as
Ux¼(YY
¯)X. Under the null hypothesis of no association between genotype
and disease, the score test U2x/Var (UX) is asymptotically distributed as w2with 1
degree of freedom. To account for relatedness of cases we used the kinship
coefficients matrix when computing the variance of the score statistic.14
Imputation is dealt with by accounting for loss of information due to genotype
uncertainty. A detailed derivation of the score test is given in the Appendix.
Post-analysis information measures. Let the posterior probability of
imputed genotypes be pi¼(pi0,pi1,pi2) for subject i, and the expected dosage
for the genotype counts of the ith individual be E(Xi)¼pi1+2pi2. Further,
let pdenote the population minor allele frequency. Assuming HWE, the
MACH r2is defined by
r2¼P
n
i¼1
X2
i=nP
n
i¼1
Xi=n

2
2^
pð1^
pÞ;ð1:1Þ
so that this preanalysis information measure depends only on the allele
frequency and imputed genotypes. When data are genotyped, r2equals one.
As in the Appendix, let Kdenote the genetic correlation matrix. The
genotypic variance of the sample is denoted by S,andSloss is the loss of
information due to uncertainty. The relative efficiency measure for case–control
design of Uh et al15 can be used as an information measure about the
association parameter:
R2
T¼ðY
YÞKSSloss
ðÞ½ðY
YÞ
ðY
YÞKS½ðY
YÞ;ð1:2Þ
where 1denotes the (Hadamard) term-wise product. Consequently with
genotyped data Sloss¼0, hence, RT2equals to 1. In contrast to the preanalysis
Figure 1 Study samples and arrays used. Affy500 stands for the first genera-
tion Affymetrix Gene Chip Human Mapping 500K Array, Illumina660 for
Illumina Infinium HD Human660W-Quad BeadChips, and Illumina550 for
Illumina Infinium II HumanHap 550K and HumanHap550-Duo BeadChips.
Sib 2 and controls were all genotyped, and for Sib1 in addition to the over-
lapping genotyped 60K SNPs, the remaining 457K SNPs were imputed. After
post-imputation QC, 451K SNPs were analyzed using ASP–control design.
Table 1 Study designs and arrays used in Figure 3
Figure 3 Study design Sample No. of SNPsaOverlap Imputed SNPs QC passed and tested SNPs Genomic control lGC
a ASP–control Sib 2 and control
Sib 1
517K
350K
60K 457K 451K 1.16
b Case–control Sib 2 and control 517K 517K 517K 1.03
c ASP–control Sib 2 and control
Sib 1
517K
350K
60K 60K 1.06
d ASP–control Sib 2 and control
Sib 1
517K
350K
60K 97Kb157K21.05
aNo. of SNPs that passed QC at the pre-imputation stage.
bNo. of SNPs with Rr2X0.98.
GWAS
H-W Uh et al
573
European Journal of Human Genetics
information measure r2, this post-analysis information measure RT2assigns
more weight to associated SNPs.
An executable C++ program for the score test and RT2is available (http://
www.msbi.nl/uh).
RESULTS
The difference between the pre- and postanalysis information
measures, MACH r2and RT2, is shown in Figure 2. Using Sib 1 and
controls data, we randomly selected 1000 SNPs each from three classes
of SNPs: P-values 4greater than 0.05, P-values smaller than 0.001, and
intermediate ones. Although for unassociated SNPs (P-value 40.05)
the two measures show good agreement, they are quite different for
strongly associated SNPs (P-value o0.001). The post-analysis mea-
sure, therefore, can be a useful tool for selecting SNPs for meta-
analysis.
Quantile–quantile (Q–Q) plots in Figure 3 illustrate the GWAS
results using different study designs as described in Table 1. The test
statistics in all Q–Q plots were corrected by their genomic control
inflation factor lGC.16 First we used combined data of ASPs (imputed
Sib 1 and genotyped Sib 2) and genotyped controls. Results
(Figure 3a) show deviation from first diagonal (dashed line), hence,
inflation of test statistics (lGC¼1.16). Next (Figure 3b), we compared
genotyped Sib 2 and controls (Illumina660 for cases and Illumina550
for controls, respectively): lGC¼1.03. One might conjecture that
inflated test statistics in Figure 3a were caused by also considering
imputed sibling data. We then investigated whether this inflation is an
artifact solely from imputation, or due to combining different arrays.
To determine the possibility of a chip (or batch) effect, we conducted
ASP and control analysis only on genotyped overlapping 60K SNPs
with Affy500 (Sib 1), Illumina660 (Sib 2), and Illumina550 (control).
In Figure 3c, the genomic control inflation factor is decreased from
1.16 to 1.06 as compared with Figure 3a and increased from 1.03 to
1.06 as compared with Figure 3b. This may suggest that there is a chip-
effect, which was amplified by the imputation. Figure 3d shows that by
applying a very stringent extra QC (RT240.98, 60K genotyped and
97K imputed SNPs) inflation of test statistic could be dealt with
(lGC¼1.05). Therefore, the significantly biased results (Figure 3a)
seem to be caused by the different chips from one of which is of
low quality.
For validation, the 89 top-ranking SNPs (MACH r240.3) resulting
from the association analysis using the first design were retyped with
the Sequenom MassArray platform. We checked the quality of
genotyping (of the different platforms) as well as that of imputation.
Figure 4 illustrates the comparison of minor allele frequencies (MAFs)
in the long-lived siblings. In the left panel, the deviation of the points
from first diagonal (dashed line) indicates the poor match of the
Affy500 data and retyped sample. Meanwhile, the retyping of the
Illumina660 data shows better agreement (bottom panel). Visual
inspection of cluster plots of the sole exception (the red filled circle)
confirmed the results of the Sequenom array.
DISCUSSION
Our study illustrates that imputation, whereas combining different
arrays in GWAS using data from the earliest platforms without
sufficiently stringent QCs may produce false positive associations. A
simple remedy to better quality is to choose a stricter threshold for
inclusion at the pre- and postimputation stages. For preimputation
QCs we refer to Anderson et al.6
In addition to the preanalysis measures such as r2of MACH
and info of IMPUTE, which are the relative information measures
only depending on the population allele frequency and imputation
accuracy, we proposed an additional post-analysis measure RT2.
r2
RT
2
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.8
0.9
1.0
strong association
weak association
no association
Figure 2 Comparison of the pre- and the postanalysis imputation
information measure. The xaxis shows the preanalysis information measure
(r2), and the yaxis the post-analysis information measure (RT2). The blue
points indicate the SNPs with no association (P-value 40.05); there is little
effect of case–control status, and two measures agree. The red ones are the
SNPs that show strong association (P-value o0.001), and the green ones
are intermediate cases.
0
5
10
15
20
25
30
Expected
Observed
λGC=1.16
0
5
10
15
20
25
30
Expected
Observed
λGC=1.03
0
5
10
15
20
25
30
Expected
Observed
λGC=1.06
0
5
10
15
20
25
30
Expected
Observed
λGC=1.05
0 1015205 0 10 15 20
5
010155 0 10 15 205
Figure 3 Quantile–quantile plots obtained from LLS GWAS analyses. The
triangles indicate the SNPs at which the test statistic exceeds 30
(corresponding P-value o5108). The 95% concentration bands (shaded
gray) are included. (a) ASP–control design: combined data of imputed
Affy500 (Sib 1), typed Illumina660 (Sib 2), and typed Illumina550
(control). Deviation form the dashed line indicates inflation of test statistics.
(b) Case–control design: genotyped with Illumina660 (Sib 2) and
Illumina550 (control). (c) ASP–control design: 60K overlap using combined
typed data of Affy500 (Sib 1), Illumina660 (Sib 2), and Illumina550
(control). (d) ASP–control design: as in (a), but only SNPs with RT240.98.
Details are provided in Table 1.
GWAS
H-W Uh et al
574
European Journal of Human Genetics
Our measure is an information measure that assesses the above
information but also includes strength of association. When testing
independent samples, this is equivalent to the information measure
of SNPTEST. For a recessive or dominant model, Marchini et al10
showed that the post-analysis measures are quite different from the
preanalysis information measure r2. For strongly associated SNPs
under an additive model we showed that RT2and r2could be quite
different (Figure 2). For example, meta-analyses aim to combine
estimates of association parameters, which argues for the use of
post-analysis QC measures such as RT2and SNPTEST info.In
situations such as ours, filtering on RT2leads to a reduction in
heterogeneity between studies, making the studies more comparable
and meta-analysis more powerful. To interpret the results of meta-
analysis properly, it also is important to report the difference between
the studies, such as the quality of both genotyping and imputation.
All information measures need to be carefully considered in further
analysis. In our study, by re-genotyping strongly associated SNPs, we
found that an extremely tight inclusion threshold of our imputation
quality measure RT2greater than 0.98 was needed to achieve reliable
results as shown in Figures 3 and 4; only 18 from the 89 top-ranking
SNPs passed the post-analysis QC. These plots suggest that false
positive findings are caused by imputation based on arrays of inferior
quality, when cases and controls are not matched for genotyping
platforms. Actually, in our GWAS for longevity we discarded the
Affy500 data set because of the small number of reliable SNPs. It
should be noted that 97K imputed SNPs remained in the analysis even
for this stringent cutoff (Table 1). We also retyped the Affy500 cases
with the Illumina 660K platform and recently published our GWAS.12
In Figure 3c one may ask whether the Q–Q plot using only 60K
overlapping SNPs is comparable to Q–Q plots using larger number of
SNPs. We compared the distribution of association P-values using 60K
cases and controls and 350K cases and controls, and both distributions
were quite similar (data not shown).
The results presented here, were based on an early scan data with a
small sample size. When combining modern arrays within studies, less
bias may be expected due to better genotyping quality. On the other
hand, the enormous sample size of pooled studies may amplify even
the small individual effects, for example, due to platform effects,
population strata, or genotyping batch effects, resulting in false
positive findings, as heterogeneity between studies is amplified by
imputation. Imputation of genotypes while combining different data
sets can be a very powerful method, and has identified susceptibility
loci using early scan data.17,18 However, our findings stress that when
combining newer data sets with early scan data rigorous QCs should
be applied to ensure reproducible findings including pre- and post-
analysis stages. Moreover, we recommend that post-analysis QC
measures should be reported in publications as they give the most
direct insight into influence of imputation on association.
CONFLICT OF INTEREST
The authors declare no conflict of interest.
ACKNOWLEDGEMENTS
We acknowledge R van der Breggen, N Lakenberg, D Kremer, and HED
Suchiman for their efforts in genotyping by Sequenom MassArray. This work is
supported by a grant from the Netherlands Organization for Scientific Research
(NWO 917.66.334). We thank all the participants of the Leiden Longevity
Study and the Rotterdam Study. This study was supported by a grant from the
Innovation-Oriented Research Program on Genomics (SenterNovem
IGE05007), the Centre for Medical Systems Biology, and the Netherlands
Consortium for Healthy Ageing (Grant 050–060-810), all in the framework of
the Netherlands Genomics Initiative/Netherlands Organization for Scientific
Research (NWO), and BBMRI-NL (Biobanking and Biomolecular Resources
Research Infrastructure). The generation and management of GWAS genotype
data for the Rotterdam study is supported by the Netherlands Organization for
Scientific Research NWO Investments (No. 175.010.2005.011, 911-03-012).
This study is funded by the Research Institute for Diseases in the Elderly
(014-93-015; RIDE2) and the Netherlands Genomics Initiative (NGI)/Netherlands
Organization for Scientific Research (NWO) Project No. 050-060-810; we
thank P Arp, M Jhamai, M Verkerk, L Herrera, and M Peters for their
help in creating the GWAS database. The Rotterdam Study is funded by the
Erasmus Medical Center and Erasmus University, Rotterdam, the Netherlands
Organization for the Health Research and Development (ZonMw), the
Research Institute for Diseases in the Elderly, the Ministry of Education,
Culture and Science, the Ministry for Health, Welfare and Sports, the
European Commission (DG XII), and the Municipality of Rotterdam.
1 Li Y, Willer C, Sanna S, Abecasis G: Genotype imputation. Annu Rev Genomics Hum
Genet 2009; 10: 387–4 06.
2 Howie BN, Donnelly P, Marchini J: A flexible and accurate genotype imputation method
for the next generation of genome-wide association studies. PLoS Genet 2009; 5:
e1000529.
3 The Wellcome Trust Case Control Consortium: Genome-wide association study of
14 000 cases of seven common diseases and 3000 shared controls. Nature 2007;
447: 661–678.
4 ANZ genes: Genome-wide association study identifies new multiple sclerosis suscept-
ibility loci on chromosome 12 and 20. Nat Genet 2009; 41: 824–828.
5 Zhong H, Yang X, Kaplan LM, Molony C, Schadt EE: Integrating pathway analysis and
genetics of gene expression for genome-wide association studies. Am J Hum Genet
2010; 86: 581–591.
6 Anderson CA, Pettersson FH, Clarke GM, Cardon LR, Morris AP, Zondervan KT:
Data quality control in genetic case-control association studies. Nat Protoc 2010; 5:
1564–1573.
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.1
0.2
0.3
0.4
0.5
Imputed & genotyped (Affymetrix)
Genotyped (Sequenom)
RT
2 0.98
RT
2 > 0.98
rs4814335
Sib 1
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.1
0.2
0.3
0.4
0.5
Genotyped (Illumina)
Genotyped (Sequenom)
RT
2 0.98
RT
2 > 0.98
rs4814335
Sib 2
Figure 4 Comparison of the MAF between GWAS and replication data.
Top : xaxis shows MAF of imputed Sib 1 data using Affy500, and yaxis
MAF of the same SNPs replicated with Sequenom. The green colored did
not pass the threshold RT240.98. Bottom: xaxis shows MAF of (genotyped)
Sib 2 data using Illumina660, and yaxis MAF of the same SNPs replicated
with Sequenom. The red-filled circle in both panels indicates the same SNP.
GWAS
H-W Uh et al
575
European Journal of Human Genetics
7 Li Y, Abecasis G: Mach 1.0: rapid haplotype reconstruction and missing genotype
inference. Am J Hum Genet 2006; S79: 2290.
8 Marchini J, Howie B, Myers S, McVean G, Donnelly P: A new multipoint method for
genome-wide association studies via imputation of genotypes. Nat Genet 2007; 39:
906–913.
9 Cantor RM, Lange K, Sinsheimer JS: Prioritizing GWAS results: a review of statistical
methods and recommendations for their approach. Am J Hum Genet 2010; 86:622.
10 Marchini J, Howie B: Genotype imputation for genome-wide association studies. Nat
Rev Genet 2010; 11:499511.
11 Westendorp RG, van Heemst D, Rozing MP et al: Nonagenarian siblings and their
offspring display lower risk for mortality and morbidity than sporadic nonagenarians:
the Leiden Longevity Study. JAmGeriatrSoc2009; 59: 1634–1637.
12 Deelen J, Beekman M, Uh HW et al: Genome-wide association study ide ntifies a single
major locus contributing to survival into old age; the APOE locus revisited. Ageing Cell
2011; 10: 686–698.
13 Hofman A, Breteler MM, Van Duijn CM et al: The Rotterdam Study: 2010 objectives
and design upd ate. Eur J Epidemiol 2009; 24:553572.
14 Uh HW, Wijk HJ, Houwing-Duistermaat JJ: Testing for genetic association taking into
account phenotypic information of relatives. BMC Proc 2009; 5(Suppl 7): S123.
15 Uh H-W, Houwing-Duistermaat JJ, Putter H, van Houwelingen HC: Assessment of
global phase uncertainty in case-control studies. BMC Genet 2009; 10:54.
16 Devlin B, Roeder K: Genomic control for association studies. Biometrics 1999; 55:
997–1004.
17 Stuart PE, Nair RP, Ellinghaus E et al: Genome-wide association analysis identifies
three psoriasis susceptibility loci. Nat Genet 2010; 42: 1000–1004.
18 Ellinor PT, Lunetta KL, Clazer NL et al: Common variants in KCNN3 are associated with
lone atrial fibrillation. Nat Genet 2010; 42:240244.
This work is licensed under the Creative Commons
Attribution-NonCommercial-No Derivative Works
3.0 Unported Licence. To view a copy of this licence, visit http://
creativecommons.org/licenses/by-nc-nd/3.0/
APPENDIX
We first address the ascertainment of the independent cases. Let
Y¼(Y1,y,Yn) be the phenotype, X¼(X1,y,Xn) denotes genotype
dosage 0, 1, or 2. Further, Y
¯is the mean of Yin the whole sample,
or the proportion of cases in case–control studies. As the ascertain-
ment event Sdepends on the phenotype but is conditionally
independent of the genotype given Y, P(X|Y,S)¼P(X|Y).Therefore,
the retrospective likelihood based on P(X|Y) is appropriate under
selection. On the basis of retrospective likelihood, the score statistic for
testing for an additive effect of a genotyped locus on phenotype is as
follows. The score is,
UX¼ðY
YÞgX;ð1Þ
and the variance of UX
Var UX¼ðY
YÞgðY
YÞs2
X;ð2Þ
where s2Xis the genotypic variance. Under HWE assumption, s2Xcan
be estimated by 2^
pð1^
pÞwith the MAF estimate ^
p. Under H0, the
test statistic U2X/VarUXis asymptotically distributed as w2with 1
degree of freedom.
When using multiplex cases from the same pedigree, we need to
take into account correlations. We define the correlation matrix Kfor
nsubjects as follows:
K¼
1r12  r1n
r12 1 r1n
.
.
.  .
.
.
r1nr2n 1
0
B
B
B
@
1
C
C
C
A
The off-diagonal entries, rijs, are twice the kinship coefficient between
individuals iand j(iaj). Then, the expression of the denominator of
the score statistic is replaced by
Var UX¼ðY
YÞgKðY
YÞs2
X:1
To deal with imputed genotypes, the uncertainty caused by imputa-
tion needs to be considered. On the basis of the statistical theory for
missing data, the genotype data can be partitioned into two parts
Xcomp ¼Xobs;Xmis
½:2
The log likelihoods for the complete data (lcomp) and observed
(incomplete) data (lobs) are given by
lcompðyÞ¼log PX
obs;XmisjyðÞ;
lobsðyÞ¼log ZPX
obs;Xmis jyðÞd;Xmis
Let U(y) be the complete data score qlcomp/qy,andI(y)thecomplete
data information ql2comp/q2y, respectively.
Instead of observing X, for imputed genotypes the posterior
probability pi¼(pi0,pi1,pi2) is given for subject i¼1,y,n. Let the
expected dosage for the genotype counts of the ith individual be
X
˜I¼EXi¼pi1+2pi2. Then we replace the genotype counts Xby
U~
X¼ðY
YÞg~
X
in the score statistic (1).
Let S¼s2X11
Tbe nnmatrix with the genotypic variance s2X
where 1represents a vector of ones of length n. And, the nnmatrix
Sloss denotes the loss of information.
Then, the score and information for the observed data likelihood
are given by
UobsðyÞ¼EXmis jXobsUðyÞ;
IobsðyÞ¼EXmis jXobsIðyÞVarXmis jXobs UðyÞ¼XX
loss
Here, the term VarXmis|Xobs() represents the loss of information due
to imputation uncertainty. The elements of Sloss are defined by the
outer product of the square root of individual loss li,
li¼pi1ð1pi1Þ+4pi2ð1pi2Þ4pi1pi2
Thus, on the diagonal we have Sloss;ii¼liand off the diagonal we have
X
loss;i;j
¼ffiffiffiffiffi
lilj
q
for i,j¼1,y,n. Then the variance of the score statistic can be
expressed as
VarXobsU~
X¼n1ðY
YÞgKð
XX
loss
Þ
"#
ðY
YÞ;
where Jdenotes the (Hadamard) term-wise product.
References
1. Uh HW, Wijk HJ, Houwing-Duistermaat JJ: Testing for genetic
association taking into account phenotypic information of relatives.
BMC Proc 2009; (Suppl 7): S123.
2. Louis TA: Finding the observed information matrix when using
the EM algorithm. J R Stat Soc 1982; 44: 226-233.
GWAS
H-W Uh et al
576
European Journal of Human Genetics
... Genotype imputation is a process to predict or impute undetermined genotypes in a sample of individuals, and has been routinely used in genetic studies, including genome-wide association studies, to improve the power of analysis, fine-mapping association studies, and meta-analysis-combining multiple studies [77][78][79][80]. Before imputation, a prefiltering process is usually conducted for the input genotype data to remove low-quality genetic variants [77,[79][80][81][82][83][84][85]. Such prefiltering usually includes standard quality control steps for genetic association studies. ...
... Specifically, as part of the standard quality control procedure, the Hardy-Weinberg proportion test is conducted on the input genotype data prior to genotype imputation, by the commonly used imputation software programs such as Minimac (implementation of the MaCH algorithm) [82,86] and IMPUTE [77,79,82,86]. The filtering criterion used for the Hardy-Weinberg proportion test is P-value >10 À6 to 10 À4 , which is similar to those used in genetic association studies [80,83]. ...
Chapter
Full-text available
The Hardy-Weinberg principle, one of the most important principles in population genetics, was originally developed for the study of allele frequency changes in a population over generations. It is now, however, widely used in studies of human diseases to detect inbreeding, population stratification, and genotyping errors. For assessment of deviation from Hardy-Weinberg proportions in data, the most popular approaches include the asymptotic Pearson’s chi-squared goodness-of-fit test and the exact test. Pearson’s chi-squared goodness-of-fit test is simple and straightforward, but is very sensitive to a small sample size or rare allele frequency. The exact test of Hardy-Weinberg proportions is preferable in these situations. The exact test can be performed through complete enumeration of heterozygote genotypes or on the basis of the Markov chain Monte Carlo procedure. In this chapter, we describe the Hardy-Weinberg principle and the commonly used Hardy-Weinberg proportion tests and their applications, and we demonstrate how the chi-squared test and exact test of Hardy-Weinberg proportions can be performed step-by-step using the popular software programs SAS, R, and PLINK, which have been widely used in genetic association studies, along with numerical examples. We also discuss approaches for testing Hardy-Weinberg proportions in case–control study designs that are better than traditional approaches for testing Hardy-Weinberg proportions in controls only. Finally, we note that deviation from the Hardy-Weinberg proportions in affected individuals can provide evidence for an association between genetic variants and diseases.
... This was mostly based on routine quality control (QC) applied in association studies and ne mapping. The QC excluded low frequency variants and singletons 9,10 . The con dence index threshold for post-imputation information measures was set either between 0.3-0.4 or at a more conservative score of 0.7-0.9 ...
Preprint
Full-text available
Quality control methods for genome-wide association studies and fine mapping are commonly used for imputation, however, they result in loss of many single nucleotide polymorphisms (SNPs). To investigate the consequences of filtration on imputation, we studied the direct effects on the number of markers, their allele frequencies, imputation quality scores and post-filtration events. We pre-phrased 1,031 genotyped individuals from diverse ethnicities and compared the imputed variants to 1,089 NCBI recorded individuals for additional validation. Without variant pre-filtration based on quality control (QC), we observed no impairment in the imputation of SNPs that failed QC whereas with pre-filtration there was an overall loss of information. Significant differences between frequencies with and without pre-filtration were found only in the range of very rare (5E-04-1E-03) and rare variants (1E-03-5E-03) (p < 1E-04). Increasing the post-filtration imputation quality score from 0.3 to 0.8 reduced the number of single nucleotide variants (SNVs) <0.001 2.5 fold with or without QC pre-filtration and halved the number of very rare variants (5E-04). As a result, to maintain confidence and enough SNVs, we propose here a 2-step post-filtration approach to increase the number of very rare and rare variants compared to conservative post-filtration methods.
... In all cases, patients and controls were from the same geographic area, they were genotyped with the same array and using the same genome assembly. In order to reduce bias, we merged controls with cases prior to quality control (QC) and imputation (Mitchell et al., 2014;Uh et al., 2012). The analyses performed to control for population stratification is summarized in Fig. S1. ...
Article
Full-text available
Cocaine dependence is a complex psychiatric disorder that is highly comorbid with other psychiatric traits. Twin and adoption studies suggest that genetic variants contribute substantially to cocaine dependence susceptibility, which has an estimated heritability of 65-79%. Here we performed a meta-analysis of genome-wide association studies of cocaine dependence using four datasets from the dbGaP repository (2085 cases and 4293 controls, all of them selected by their European ancestry). Although no genome-wide significant hits were found in the SNP-based analysis, the gene-based analysis identified HIST1H2BD as associated with cocaine-dependence (10% FDR). This gene is located in a region on chromosome 6 enriched in histone-related genes, previously associated with schizophrenia (SCZ). Furthermore, we performed LD Score regression analysis with comorbid conditions and found significant genetic correlations between cocaine dependence and SCZ, ADHD, major depressive disorder (MDD) and risk taking. We also found, through polygenic risk score analysis, that all tested phenotypes are significantly associated with cocaine dependence status: SCZ (R2 = 2.28%; P = 1.21e-26), ADHD (R2 = 1.39%; P = 4.5e-17), risk taking (R2 = 0.60%; P = 2.7e-08), MDD (R2 = 1.21%; P = 4.35e-15), children's aggressive behavior (R2 = 0.3%; P = 8.8e-05) and antisocial behavior (R2 = 1.33%; P = 2.2e-16). To our knowledge, this is the largest reported cocaine dependence GWAS meta-analysis in European-ancestry individuals. We identified suggestive associations in regions that may be related to cocaine dependence and found evidence for shared genetic risk factors between cocaine dependence and several comorbid psychiatric traits. However, the sample size is limited and further studies are needed to confirm these results.
... In all cases, patients and controls were from the same geographic area, they were genotyped with the same array and using the same genome assembly. In order to reduce bias, we merged controls with cases prior to quality control (QC) and imputation (Mitchell et al., 2014;Uh et al., 2012). The analyses performed to control for population stratification is summarized in Fig. S1. ...
Preprint
Full-text available
Cocaine dependence is a complex neuropsychiatric disorder that is highly comorbid with other psychiatric traits. Association studies suggest that common genetic variants contribute substantially to cocaine dependence susceptibility. Also, increasing evidence supports the role of shared genetic risk factors in the lifetime co-occurrence of psychiatric traits and cocaine dependence. Here we performed a genome-wide association study (GWAS) meta-analysis of cocaine dependence using four different dbGaP datasets (2,085 cases and 4,293 controls). Although no genome-wide significant hits were found in the SNP-based analysis, the gene-based analysis identified HIST1H2BD as significantly associated with cocaine-dependence (10% FDR). This gene is located in a region on chromosome 6 enriched in histone-related genes, previously associated with schizophrenia. The top SNPs of this region, rs806973 and rs56401801 (P=3.14e-06 and 3.44e-06, respectively), are eQTLs for different genes in multiple brain areas. Furthermore, we performed LD Score regression (LDSC) analysis with comorbid conditions and found significant genetic correlations between cocaine dependence and ADHD, SCZ, MDD and risk-taking behavior. We also found, through polygenic risk score (PRS) analysis, that all tested phenotypes can significantly predict cocaine dependence status: SCZ (R2=2.28%; P=1.21e-26), ADHD (R2=1.39%; P=4.5e-17), risk-taking behavior (R2=0.60%; P=2.7e-08), MDD (R2=1.21%; P=4.35e-15), children's aggressive behavior (R2=0.3%; P=8.8e-05) and antisocial behavior (R2=1.33%; P=2.2e-16). To our knowledge, this is the largest reported cocaine dependence GWAS meta-analysis on European ancestry individuals. Despite the small sample size, we identified suggestive associations in regions that may be related to cocaine dependence. Furthermore, we found evidence for shared genetic risk factors between cocaine dependence and several comorbid psychiatric traits.
... The test statistics were not adjusted for inflation (population stratification) because of the low genomic inflation factor (λ = 1.02). In order to account for the family relationships in the twin cohort we used QTassoc [39], a software tool based on SNPtest that is capable of handling familial data (using the kinship coefficients matrix) and genotype uncertainty. Data from both cohorts were analyzed using linear regression under an additive model and were adjusted for age, gender, glucose tolerance status, insulin sensitivity index, and familiarity (NTR only) as potential confounders. ...
Article
Full-text available
Glucagon-like peptide 1 (GLP-1) stimulated insulin secretion has a considerable heritable component as estimated from twin studies, yet few genetic variants influencing this phenotype have been identified. We performed the first genome-wide association study (GWAS) of GLP-1 stimulated insulin secretion in non-diabetic individuals from the Netherlands Twin register (n = 126). This GWAS was enhanced using a tissue-specific protein-protein interaction network approach. We identified a beta-cell protein-protein interaction module that was significantly enriched for low gene scores based on the GWAS P-values and found support at the network level in an independent cohort from Tübingen, Germany (n = 100). Additionally, a polygenic risk score based on SNPs prioritized from the network was associated (P < 0.05) with glucose-stimulated insulin secretion phenotypes in up to 5,318 individuals in MAGIC cohorts. The network contains both known and novel genes in the context of insulin secretion and is enriched for members of the focal adhesion, extracellular-matrix receptor interaction, actin cytoskeleton regulation, Rap1 and PI3K-Akt signaling pathways. Adipose tissue is, like the beta-cell, one of the target tissues of GLP-1 and we thus hypothesized that similar networks might be functional in both tissues. In order to verify peripheral effects of GLP-1 stimulation, we compared the transcriptome profiling of ob/ob mice treated with liraglutide, a clinically used GLP-1 receptor agonist, versus baseline controls. Some of the upstream regulators of differentially expressed genes in the white adipose tissue of ob/ob mice were also detected in the human beta-cell network of genes associated with GLP-1 stimulated insulin secretion. The findings provide biological insight into the mechanisms through which the effects of GLP-1 may be modulated and highlight a potential role of the beta-cell expressed genes RYR2, GDI2, KIAA0232, COL4A1 and COL4A2 in GLP-1 stimulated insulin secretion.
... This might cause spurious results in any secondary analysis on related traits. Although a missing SNP can be imputed, it will have a higher degree of inaccuracy in imputed compared with genotyped SNPs, potentially creating differential measurement error that could also lead to bias [41,46,47]. Therefore, we first looked at the overlap of SNPs between different genotyping arrays and identified three broad platform families with high degree of overlap within category (Fig 1). ...
Article
Full-text available
The Nurses’ Health Study (NHS), Nurses’ Health Study II (NHSII), Health Professionals Follow Up Study (HPFS) and the Physicians Health Study (PHS) have collected detailed longitudinal data on multiple exposures and traits for approximately 310,000 study participants over the last 35 years. Over 160,000 study participants across the cohorts have donated a DNA sample and to date, 20,691 subjects have been genotyped as part of genome-wide association studies (GWAS) of twelve primary outcomes. However, these studies utilized six different GWAS arrays making it difficult to conduct analyses of secondary phenotypes or share controls across studies. To allow for secondary analyses of these data, we have created three new datasets merged by platform family and performed imputation using a common reference panel, the 1,000 Genomes Phase I release. Here, we describe the methodology behind the data merging and imputation and present imputation quality statistics and association results from two GWAS of secondary phenotypes (body mass index (BMI) and venous thromboembolism (VTE)). We observed the strongest BMI association for the FTO SNP rs55872725 (β = 0.45, p = 3.48x10⁻²²), and using a significance level of p = 0.05, we replicated 19 out of 32 known BMI SNPs. For VTE, we observed the strongest association for the rs2040445 SNP (OR = 2.17, 95% CI: 1.79–2.63, p = 2.70x10⁻¹⁵), located downstream of F5 and also observed significant associations for the known ABO and F11 regions. This pooled resource can be used to maximize power in GWAS of phenotypes collected across the cohorts and for studying gene-environment interactions as well as rare phenotypes and genotypes.
Article
Full-text available
Although DNA array-based approaches for genome-wide association studies (GWAS) permit the collection of thousands of low-cost genotypes, it is often at the expense of resolution and completeness, as SNP chip technologies are ultimately limited by SNPs chosen during array development. An alternative low-cost approach is low-pass whole genome sequencing (WGS) followed by imputation. Rather than relying on high levels of genotype confidence at a set of select loci, low-pass WGS and imputation rely on the combined information from millions of randomly sampled low-confidence genotypes. To investigate low-pass WGS and imputation in the dog, we assessed accuracy and performance by downsampling 97 high-coverage (> 15×) WGS datasets from 51 different breeds to approximately 1× coverage, simulating low-pass WGS. Using a reference panel of 676 dogs from 91 breeds, genotypes were imputed from the downsampled data and compared to a truth set of genotypes generated from high-coverage WGS. Using our truth set, we optimized a variant quality filtering strategy that retained approximately 80% of 14 M imputed sites and lowered the imputation error rate from 3.0% to 1.5%. Seven million sites remained with a MAF > 5% and an average imputation quality score of 0.95. Finally, we simulated the impact of imputation errors on outcomes for case–control GWAS, where small effect sizes were most impacted and medium-to-large effect sizes were minorly impacted. These analyses provide best practice guidelines for study design and data post-processing of low-pass WGS-imputed genotypes in dogs.
Article
Full-text available
Quality control (QC) methods for genome-wide association studies and fine mapping are commonly used for imputation, however they result in loss of many single nucleotide polymorphisms (SNPs). To investigate the consequences of filtration on imputation, we studied the direct effects on the number of markers, their allele frequencies, imputation quality scores and post-filtration events. We pre-phrased 1031 genotyped individuals from diverse ethnicities and compared the imputed variants to 1089 NCBI recorded individuals for additional validation. Without QC-based variant pre-filtration, we observed no impairment in the imputation of SNPs that failed QC whereas with pre-filtration there was an overall loss of information. Significant differences between frequencies with and without pre-filtration were found only in the range of very rare (5E−04–1E−03) and rare variants (1E−03–5E−03) (p < 1E−04). Increasing the post-filtration imputation quality score from 0.3 to 0.8 reduced the number of single nucleotide variants (SNVs) < 0.001 2.5 fold with or without QC pre-filtration and halved the number of very rare variants (5E−04). Thus, to maintain confidence and enough SNVs, we propose here a two-step filtering procedure which allows less stringent filtering prior to imputation and post-imputation in order to increase the number of very rare and rare variants compared to conservative filtration methods.
Article
Genome-wide association studies (GWAS) have identified hundreds of loci associated with kidney-related traits such as glomerular filtration rate, albuminuria, hypertension, electrolyte and metabolite levels. However, these impressive, large-scale mapping approaches have not always translated into an improved understanding of disease or development of novel therapeutics. GWAS have several important limitations. Nearly all disease-associated risk loci are located in the non-coding region of the genome and therefore, their target genes, affected cell types and regulatory mechanisms remain unknown. Genome-scale approaches can be used to identify associations between DNA sequence variants and changes in gene expression (quantified through bulk and single-cell methods), gene regulation and other molecular quantitative trait studies, such as chromatin accessibility, DNA methylation, protein expression and metabolite levels. Data obtained through these approaches, used in combination with robust computational methods, can deliver robust mechanistic inferences for translational exploitation. Understanding the genetic basis of common kidney diseases means having a comprehensive picture of the genes that have a causal role in disease development and progression, of the cells, tissues and organs in which these genes act to affect the disease, of the cellular pathways and mechanisms that drive disease, and of potential targets for disease prevention, detection and therapy.
Article
Full-text available
There is increasing evidence that genome-wide association (GWA) studies represent a powerful approach to the identification of genes involved in common human diseases. We describe a joint GWA study (using the Affymetrix GeneChip 500K Mapping Array Set) undertaken in the British population, which has examined 2,000 individuals for each of 7 major diseases and a shared set of 3,000 controls. Case-control comparisons identified 24 independent association signals at P < 5 10-7: 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn's disease, 3 in rheumatoid arthritis, 7 in type 1 diabetes and 3 in type 2 diabetes. On the basis of prior findings and replication studies thus-far completed, almost all of these signals reflect genuine susceptibility effects. We observed association at many previously identified loci, and found compelling evidence that some loci confer risk for more than one of the diseases studied. Across all diseases, we identified a large number of further signals (including 58 loci with single-point P values between 10-5 and 5 10-7) likely to yield additional susceptibility loci. The importance of appropriately large samples was confirmed by the modest effect sizes observed at most loci identified. This study thus represents a thorough validation of the GWA approach. It has also demonstrated that careful use of a shared control group represents a safe and effective approach to GWA analyses of multiple disease phenotypes; has generated a genome-wide genotype database for future studies of common diseases in the British population; and shown that, provided individuals with non-European ancestry are excluded, the extent of population stratification in the British population is generally modest. Our findings offer new avenues for exploring the pathophysiology of these important disorders. We anticipate that our data, results and software, which will be widely available to other investigators, will provide a powerful resource for human genetics research.
Article
Full-text available
By studying the loci that contribute to human longevity, we aim to identify mechanisms that contribute to healthy aging. To identify such loci, we performed a genome-wide association study (GWAS) comparing 403 unrelated nonagenarians from long-living families included in the Leiden Longevity Study (LLS) and 1670 younger population controls. The strongest candidate SNPs from this GWAS have been analyzed in a meta-analysis of nonagenarian cases from the Rotterdam Study, Leiden 85-plus study, and Danish 1905 cohort. Only one of the 62 prioritized SNPs from the GWAS analysis (P<1×10(-4) ) showed genome-wide significance with survival into old age in the meta-analysis of 4149 nonagenarian cases and 7582 younger controls [OR=0.71 (95% CI 0.65-0.77), P=3.39 × 10(-17) ]. This SNP, rs2075650, is located in TOMM40 at chromosome 19q13.32 close to the apolipoprotein E (APOE) gene. Although there was only moderate linkage disequilibrium between rs2075650 and the ApoE ε4 defining SNP rs429358, we could not find an APOE-independent effect of rs2075650 on longevity, either in cross-sectional or in longitudinal analyses. As expected, rs429358 associated with metabolic phenotypes in the offspring of the nonagenarian cases from the LLS and their partners. In addition, we observed a novel association between this locus and serum levels of IGF-1 in women (P=0.005). In conclusion, the major locus determining familial longevity up to high age as detected by GWAS was marked by rs2075650, which tags the deleterious effects of the ApoE ε4 allele. No other major longevity locus was found.
Article
Full-text available
This protocol details the steps for data quality assessment and control that are typically carried out during case-control association studies. The steps described involve the identification and removal of DNA samples and markers that introduce bias. These critical steps are paramount to the success of a case-control study and are necessary before statistically testing for association. We describe how to use PLINK, a tool for handling SNP data, to perform assessments of failure rate per individual and per SNP and to assess the degree of relatedness between individuals. We also detail other quality-control procedures, including the use of SMARTPCA software for the identification of ancestral outliers. These platforms were selected because they are user-friendly, widely used and computationally efficient. Steps needed to detect and establish a disease association using case-control data are not discussed here. Issues concerning study design and marker selection in case-control studies have been discussed in our earlier protocols. This protocol, which is routinely used in our labs, should take approximately 8 h to complete.
Article
Full-text available
We carried out a meta-analysis of two recent psoriasis genome-wide association studies with a combined discovery sample of 1,831 affected individuals (cases) and 2,546 controls. One hundred and two loci selected based on P value rankings were followed up in a three-stage replication study including 4,064 cases and 4,685 controls from Michigan, Toronto, Newfoundland and Germany. In the combined meta-analysis, we identified three new susceptibility loci, including one at NOS2 (rs4795067, combined P = 4 × 10⁻¹¹), one at FBXL19 (rs10782001, combined P = 9 × 10⁻¹⁰) and one near PSMA6-NFKBIA (rs12586317, combined P = 2 × 10⁻⁸). All three loci were also associated with psoriatic arthritis (rs4795067, combined P = 1 × 10⁻⁵; rs10782001, combined P = 4 × 10⁻⁸; and rs12586317, combined P = 6 × 1⁻⁵) and purely cutaneous psoriasis (rs4795067, combined P = 1 × 10⁻⁸; rs10782001, combined P = 2 × 10⁻⁶; and rs12586317, combined P = 1 × 10⁻⁶). We also replicated a recently identified association signal near RNF114 (rs495337, combined P = 2 × 10⁻⁷).
Article
Full-text available
To identify multiple sclerosis (MS) susceptibility loci, we conducted a genome-wide association study (GWAS) in 1,618 cases and used shared data for 3,413 controls. We performed replication in an independent set of 2,256 cases and 2,310 controls, for a total of 3,874 cases and 5,723 controls. We identified risk-associated SNPs on chromosome 12q13-14 (rs703842, P = 5.4 x 10(-11); rs10876994, P = 2.7 x 10(-10); rs12368653, P = 1.0 x 10(-7)) and upstream of CD40 on chromosome 20q13 (rs6074022, P = 1.3 x 10(-7); rs1569723, P = 2.9 x 10(-7)). Both loci are also associated with other autoimmune diseases. We also replicated several known MS associations (HLA-DR15, P = 7.0 x 10(-184); CD58, P = 9.6 x 10(-8); EVI5-RPL5, P = 2.5 x 10(-6); IL2RA, P = 7.4 x 10(-6); CLEC16A, P = 1.1 x 10(-4); IL7R, P = 1.3 x 10(-3); TYK2, P = 3.5 x 10(-3)) and observed a statistical interaction between SNPs in EVI5-RPL5 and HLA-DR15 (P = 0.001).
Article
The Methods section of the paper describes how missing genotypes are inferred through the use of a model of an individual's genotype vector Gi conditional upon a set of N known haplotypes H. A Hidden Markov Model (HMM) is used that has the form
Article
To compare the risk of mortality of nonagenarian siblings with that of sporadic nonagenarians (not selected on having a nonagenarian sibling) and to compare the prevalence of morbidity in their offspring with that of the offsprings' partners. Longitudinal (mortality risk) and cross-sectional (disease prevalence). Nationwide sample. The Leiden Longevity Study consists of 991 nonagenarian siblings derived from 420 Caucasian families, 1,365 of their offspring, and 621 of the offsprings' partners. In the Leiden 85-plus Study, 599 subjects aged 85 were included, of whom 275 attained the age of 90 (sporadic nonagenarians). All nonagenarian siblings and sporadic nonagenarians were followed for mortality (with a mean+/-standard deviation follow-up time of 2.7+/-1.4 years and 3.0+/-1.5 years, respectively). Information on medical history and medication use was collected for offspring and their partners. Nonagenarian siblings had a 41% lower risk of mortality (P<.001) than sporadic nonagenarians. The offspring of nonagenarian siblings had a lower prevalence of myocardial infarction (2.4% vs 4.1%, P=.03), hypertension (23.0% vs 27.5%, P=.01), diabetes mellitus (4.4% vs 7.6%, P=.004), and use of cardiovascular medication (23.0% vs 28.9%, P=.003) than their partners. The lower mortality rate of nonagenarian siblings and lower prevalence of morbidity in their middle-aged offspring reinforce the notion that resilience against disease and death have similar underlying biology that is determined by genetic or familial factors.
Article
A procedure is derived for extracting the observed information matrix when the EM algorithm is used to find maximum likelihood estimates in incomplete data problems. The technique requires computation of a complete‐data gradient vector or second derivative matrix, but not those associated with the incomplete data likelihood. In addition, a method useful in speeding up the convergence of the EM algorithm is developed. Two examples are presented.
Article
In the past few years genome-wide association (GWA) studies have uncovered a large number of convincingly replicated associations for many complex human diseases. Genotype imputation has been used widely in the analysis of GWA studies to boost power, fine-map associations and facilitate the combination of results across studies using meta-analysis. This Review describes the details of several different statistical methods for imputing genotypes, illustrates and discusses the factors that influence imputation performance, and reviews methods that can be used to assess imputation performance and test association at imputed SNPs.
Article
There is increasing evidence that genome-wide association ( GWA) studies represent a powerful approach to the identification of genes involved in common human diseases. We describe a joint GWA study ( using the Affymetrix GeneChip 500K Mapping Array Set) undertaken in the British population, which has examined similar to 2,000 individuals for each of 7 major diseases and a shared set of similar to 3,000 controls. Case-control comparisons identified 24 independent association signals at P < 5 X 10(-7): 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn's disease, 3 in rheumatoid arthritis, 7 in type 1 diabetes and 3 in type 2 diabetes. On the basis of prior findings and replication studies thus-far completed, almost all of these signals reflect genuine susceptibility effects. We observed association at many previously identified loci, and found compelling evidence that some loci confer risk for more than one of the diseases studied. Across all diseases, we identified a large number of further signals ( including 58 loci with single-point P values between 10(-5) and 5 X 10(-7)) likely to yield additional susceptibility loci. The importance of appropriately large samples was confirmed by the modest effect sizes observed at most loci identified. This study thus represents a thorough validation of the GWA approach. It has also demonstrated that careful use of a shared control group represents a safe and effective approach to GWA analyses of multiple disease phenotypes; has generated a genome-wide genotype database for future studies of common diseases in the British population; and shown that, provided individuals with non-European ancestry are excluded, the extent of population stratification in the British population is generally modest. Our findings offer new avenues for exploring the pathophysiology of these important disorders. We anticipate that our data, results and software, which will be widely available to other investigators, will provide a powerful resource for human genetics research.