Page 1

Quantifying and correcting for the winner's curse in genetic

association studies

Rui Xiao and Michael Boehnke

Department of Biostatistics and Center for Statistical Genetics, University of Michigan

Abstract

Genetic association studies are a powerful tool to detect genetic variants that predispose to human

disease. Once an associated variant is identified, investigators are also interested in estimating the

effect of the identified variant on disease risk. Estimates of the genetic effect based on new

association findings tend to be upwardly biased due to a phenomenon known as the “winner's

curse”. Overestimation of genetic effect size in initial studies may cause follow-up studies to be

underpowered and so to fail. In this paper, we quantify the impact of the winner's curse on the

allele frequency difference and odds ratio estimators for one- and two-stage case-control

association studies. We then propose an ascertainment-corrected maximum likelihood method to

reduce the bias of these estimators. We show that overestimation of the genetic effect by the

uncorrected estimator decreases as the power of the association study increases and that the

ascertainment-corrected method reduces absolute bias and mean square error unless power to

detect association is high.

Keywords

winner's curse; ascertainment bias; genome-wide association study; maximum likelihood

Introduction

Large-scale genetic association studies are now commonly used to localize genetic variants

that predispose to a wide range of human diseases. In genetic association studies, once the

disease-predisposing variants are identified, it is of interest to estimate the genetic effect of

those variants on disease risk. The simplest method of estimating the effect size of the

variant is to calculate the difference of the observed risk allele frequency between cases and

controls or the corresponding odds ratio. However, these naïve estimators are likely to

overestimate the true genetic effect size as a consequence of the “winner's curse”

[Lohmueller et al., 2003], a phenomenon first described in the auction theory literature

[Bazerman and Samuelson, 1983]. In auctions, participants place bids on an item. Even if

the bids are unbiased, the winning bid is likely to overestimate the true item value since it is

the highest among all the bids. In genetic association studies, an initial positive finding plays

the role of the winning bid, since we generally focus on genetic effect size estimates only for

the variants that yield significant evidence for association, resulting in effect size estimates

that are upwardly biased. We refer to this bias as ‘ascertainment bias’ since it is caused by

ascertaining only those samples that result in significant association evidence. If the sample

size calculation for a subsequent study is based on an overestimated effect size, replication

studies are likely to be underpowered and so more likely to fail. A review of association

Address for correspondence: Michael Boehnke, Ph.D., Department of Biostatistics, School of Public Health, University of Michigan,

Ann Arbor, Michigan, 48109-2029, Phone: (734) 936-1001, FAX: (734) 615-8322, E-mail: boehnke@umich.edu.

NIH Public Access

Author Manuscript

Genet Epidemiol. Author manuscript; available in PMC 2010 July 1.

Published in final edited form as:

Genet Epidemiol. 2009 July ; 33(5): 453–462. doi:10.1002/gepi.20398.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 2

studies [Ioannidis et al., 2001] has described the overestimation in first positive reports,

consistent with the winner's curse.

This problem has drawn attention from several investigators in the context of genetic linkage

and association studies [Göring et al., 2001; Siegmund, 2002; Allison et al., 2003; Sun and

Bull, 2005; Wu et al., 2006; Garner, 2007; Yu et al., 2007; Zöllner and Pritchard, 2007;

Zhong and Prentice, 2008; Ghosh et al., 2008]. Göring et al. [2001] recommended the use of

two independent datasets: one for locus mapping, the other for parameter estimation. An

obvious disadvantage of this strategy is the power loss due to splitting the sample in two.

Sun and Bull [2005] proposed resampling estimators that employ repeated random sample

splitting of the data via cross-validation or the bootstrap. Wu et al. [2006] compared their

bootstrap estimators for locus-specific quantitative trait linkage analysis, and, in the context

of two-stage design, Yu et al. [2007] applied a bootstrap estimator to correct for stage 1 bias

and improve sample size estimates for stage 2. Zöllner and Pritchard [2007] used computer

simulation to evaluate the magnitude of the winner's curse effect in case-control studies and

proposed a maximum likelihood method to correct for it. Their method estimates the

frequencies of all genotypes and corresponding penetrance parameters based on a known

population prevalence of the disease under different inheritance models. Garner [2007]

studied the source of the upward bias in the odds ratio estimate in genome-wide association

studies, but did not propose a method to correct for it. Zhong and Prentice [2008] and Ghosh

et al. [2008] recently proposed conditional-likelihood-based methods for point and interval

estimation of the (logarithm of the) odds ratio in the context of logistic regression analysis of

case-control status using genotype categories as a covariate.

In this paper, we take a direct approach to evaluate and correct for the effect of winner's

curse in the context of case-control genetic association studies. In contrast to previous

simulation-based evaluations, we calculate analytically the impact of the winner's curse on

estimates of the allele frequency difference between cases and controls and the

corresponding odds ratios as a function of sample size, allele frequencies, and statistical

significance level. We then describe a simple ascertainment-corrected maximum likelihood

method to estimate the risk allele frequency difference and odds ratio. Our method is most

similar to that of Zöllner and Pritchard [2007], but in contrast to their method, ours estimates

directly the allele frequency difference or odds ratio, instead of estimating the penetrance

parameters. We compare the performance (bias, standard error, and mean square error

(MSE)) of our ascertainment-corrected maximum likelihood estimators (MLEs) to that of

the naïve, uncorrected estimators. We extend these calculations to two-stage association

studies, in which all markers are genotyped on a set of individuals in Stage 1, and the most

promising markers are followed up by genotyping a second set of individuals in Stage 2.

Consistent with Zöllner and Pritchard [2007], we find that (1) the factors that result in

overestimation of the allele frequency difference can be summarized by study power,

independent of sample size and allele frequency, and that overestimation decreases as power

increases; and (2) compared to the uncorrected estimator of the allele frequency difference,

the ascertainment-corrected estimator results in reduced absolute bias when study power is

low or moderate, and has comparable absolute bias when power is high. Further, we find

that (3) for the logarithm of the odds ratio (ln OR), overestimation can again be summarized

by study power, independent of sample size and allele frequency, and that overestimation

decreases as power increases; (4) compared to the uncorrected estimator, the ascertainment-

corrected MLE of the OR generally results in reduced bias and MSE, and (5) for reasonable

two-stage designs [Skol et al., 2007], results mirror those for the corresponding one-stage

designs. We recommend use of this ascertainment-corrected maximum likelihood method

for estimation of genetic effect size in large-scale genetic association studies.

Xiao and BoehnkePage 2

Genet Epidemiol. Author manuscript; available in PMC 2010 July 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 3

Methods

I. One-stage design

Model and assumptions—We assume independent samples of N cases and N controls

genotyped at an autosomal disease locus with alleles D and d. Let p and p+δ (δ ≠ 0) denote

the frequency of the risk allele D in controls and cases, respectively. For a complex disease,

we expect the genetic effect size to be small, so that Hardy-Weinberg equilibrium

predictions provide a good approximation to the genotype frequencies in both controls and

cases. Under this assumption, the counts m0 and m1 of the risk allele D in controls and cases

follow independent binomial distributions on 2N trials with probabilities of success p and p

+δ, respectively.

Let X be the standard Pearson chi-square test statistic for association in a 2×2 table of allele

counts in cases and controls. Under the assumption of Hardy-Weinberg equilibrium, X

follows a chi-square distribution with one degree of freedom under the null hypothesis of no

association (δ = 0). We claim an association significant if X exceeds the critical value xα at

significance level α.

Uncorrected (naïve) maximum likelihood estimators (MLEs)—In practice,

investigators generally estimate the allele frequency difference between cases and controls

by its MLE

these uncorrected MLEs “naïve” because they ignore the bias associated with focusing on

genetic markers with statistically significant association results.

, or the corresponding odds ratio by . We call

To model the impact of the winner's curse, we calculate the expected value of the

uncorrected MLE δˆun of the allele frequency difference δ conditional on obtaining

significant evidence for association:

(1)

and from it the bias of the estimator as E(δˆun | X > xα) − δ, and the proportional bias as

. Here, I = {(m0, m1) : X (m0, m1) > xα} is the set of allele count pairs that

result in statistically significant evidence for association and

(2)

Note that the denominator in (1) is the power to detect association if we genotype the

disease SNP.

The standard error of the uncorrected MLEδˆun can be calculated as:

Xiao and Boehnke Page 3

Genet Epidemiol. Author manuscript; available in PMC 2010 July 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 4

(3)

where E(δˆun2 | X > xα) may be calculated by replacing δˆun by δˆun2 in (1).

We also calculate the absolute bias of δˆun as:

(4)

Analogous formulae allow us to calculate the conditional bias, standard error, and absolute

bias of the uncorrected MLE of the odds ratio OR, and from the expectation, the

proportional bias of the logarithm of the estimator .

Ascertainment-corrected MLEs—The naive estimators ignore the fact that we typically

are interested in estimates of the allele frequency difference δ and the odds ratio OR only if

we have strong evidence for association. To address this, we propose an ascertainment-

corrected maximum likelihood method that conditions on obtaining evidence for association.

To this end, we calculate the conditional likelihood function

(5)

where the indicator function1{X > xα | m0,m1,N} equals 1 or 0 depending on whether or not

X > xα.

We maximize L(p, δ | X > xα) as a function of p and δ to obtain the ascertainment-corrected

MLEs pˆas and δˆas by using the Nelder-Mead [1965] simplex method. We calculate the

empirical standard errors of these estimators based on 1000 simulation replicates, and the

asymptotic-theory standard errors by calculating the observed information matrix (see

Appendix) evaluated at the parameter estimates:

(6)

The covariance matrix for pˆas and δˆas can be approximated by I−1(pˆas, δˆas). We take

advantage of the invariance property of the MLE to calculate the ascertainment-corrected

MLE for the odds ratio, and apply the delta method [Rao, 1965] to obtain its standard error.

We calculate the mean square error (MSE) for the estimators by taking the sum of the

variance and the squared bias of the estimator.

Xiao and Boehnke Page 4

Genet Epidemiol. Author manuscript; available in PMC 2010 July 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 5

II. Two-stage design

Model and assumptions—We next consider two-stage association studies, in which N1

cases and N1 controls are genotyped for all markers, and only the most promising markers

are genotyped in the second stage in an additional N2 cases and N2 controls. Let pi and δi be

the risk allele frequencies in controls and the allele frequency difference between cases and

controls in stage i. Given genetic homogeneity between stages 1 and 2, p1 = p2 = p and δ1 =

δ2 = δ. At each stage, we calculate the association test statistic using the data only from that

stage

(7)

where pˆi0 and pˆi1 are the naïve MLEs of the risk allele frequencies in controls and cases

respectively at stage i,

association (δ = 0), the association test statistic Zi follows a standard normal distribution

with mean 0 and variance 1.

. Under null hypothesis of no disease-marker

We employ a joint analysis strategy for this two-stage study [Satagopan et al., 2002; Skol et

al., 2006] by calculating

(8)

where πsample = N1/(N1+N2) is the proportion of individuals genotyped in Stage 1. We claim

significant association when both |Z1| and |Z12| exceed the relevant critical values C1 and

C12 in joint analysis. C1 is calculated so that P(|Z>1| C1) =πmarker, where πmarker is the

proportion of markers to be genotyped in Stage 2, and C12 by finding the threshold so that

P(|Z1| > C1,|Z12| > C12) = P(|Z12| > C12 ||Z1| > C1) × P(|Z1| > C1) results in the desired

significance level [Skol et al., 2006].

Uncorrected (naïve) MLEs—The uncorrected MLE of the risk allele frequency

difference for the two-stage design δˆ12 = πsampleδˆ1 + (1−πsample)δˆ2, where

i = 1, 2. The bias of the uncorrected MLE δˆ12 can be calculated exactly as for one-stage

design by formula (1) and similarly the proportional bias. However, exact calculation

becomes computationally difficult when N1 or N2 is large, so we simulated n=1000 datasets

satisfying |Z1| > C1 and |Z12| > C12 and approximated the expectation and empirical standard

error of δˆ12 by calculating the mean and the standard error of the uncorrected MLE of the n

simulated datasets:

,

(9)

Xiao and BoehnkePage 5

Genet Epidemiol. Author manuscript; available in PMC 2010 July 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript