Page 1

BioMed Central

Page 1 of 8

(page number not for citation purposes)

BMC Genetics

Open Access

Research article

Power analysis for genome-wide association studies

Robert J Klein

Address: Program in Cancer Biology and Genetics, Memorial Sloan-Kettering Cancer Center, 1275 York Avenue, New York, NY, USA

Email: Robert J Klein - kleinr@mskcc.org

Abstract

Background: Genome-wide association studies are a promising new tool for deciphering the

genetics of complex diseases. To choose the proper sample size and genotyping platform for such

studies, power calculations that take into account genetic model, tag SNP selection, and the

population of interest are required.

Results: The power of genome-wide association studies can be computed using a set of tag SNPs

and a large number of genotyped SNPs in a representative population, such as available through the

HapMap project. As expected, power increases with increasing sample size and effect size. Power

also depends on the tag SNPs selected. In some cases, more power is obtained by genotyping more

individuals at fewer SNPs than fewer individuals at more SNPs.

Conclusion: Genome-wide association studies should be designed thoughtfully, with the choice

of genotyping platform and sample size being determined from careful power calculations.

Background

One goal of modern human genetics is to identify the

genetic variants that predispose individuals to develop

common, complex diseases. It has been proposed that

population-based association studies will be more power-

ful than traditional family-based linkage methods in iden-

tifying such high-frequency, low-penetrance alleles [1].

Such studies require the genotypes a large number of pol-

ymorphisms (usually single nucleotide polymorphisms

[SNPs]) across the genome, each of which is tested for

association with the phenotype of interest. As originally

proposed, this would be a direct test of association, in

which the functional mutation is presumed to be geno-

typed. An alternate approach to association studies takes

advantage of the correlation between SNPs, called linkage

disequilibrium (LD), that can occur due to the genealogi-

cal history of the polymorphisms [2]. In this approach,

often called indirect association, one SNP is genotyped

and used to infer indirectly the genotypes at other SNPs

with which it is in high LD [3]. As one genotyped SNP,

called a "tag" SNP, can be in LD with numerous other

SNPs, much fewer SNPs (105 – 106) would need to be gen-

otyped to capture the common variation in the genome

[3]. Recent advances in genotyping technology make such

studies feasible [4,5] and the first results of such studies

are being published [6-10].

One key question in designing such studies is the choice

of tag SNPs. Numerous methods for choosing the best set

of tagging SNPs have been developed and compared [11].

One common measure evaluates the pairwise LD, meas-

ured by r2, between the tag SNPs and all other SNPs [12].

The value r2 represents the correlation between two SNPs.

It is a useful measure because, if N individuals are needed

for a specific power with a direct test of association, N/r2

individuals would be needed for an indirect test of associ-

ation [2]. Sets of tag SNPs are usually compared by their

"coverage," or fraction of variants in the genome that are

Published: 28 August 2007

BMC Genetics 2007, 8:58doi:10.1186/1471-2156-8-58

Received: 22 November 2006

Accepted: 28 August 2007

This article is available from: http://www.biomedcentral.com/1471-2156/8/58

© 2007 Klein; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),

which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 2

BMC Genetics 2007, 8:58 http://www.biomedcentral.com/1471-2156/8/58

Page 2 of 8

(page number not for citation purposes)

in LD (r2 above some threshold) with at least one tag [12-

15].

There are two related problems with this measure of cov-

erage. First, the binary decision of whether r2 is above or

below a threshold does not capture the continual decrease

in power as r2 decreases. If the cutoff value of r2 is 0.8, a

SNP that shows LD of r2 = 0.75 with a tag would be called

undetectable since the measure of LD is below the thresh-

old. In truth, association would be detectable, albeit with

reduced power. Second, knowledge of the coverage of a set

of tag SNPs says nothing about the number of individuals

needed for a well-powered study. A better measure to eval-

uate tag SNPs would be an explicit calculation of the prob-

ability that a genome-wide association study will find a

statistically significant association given that such an asso-

ciation exists (i.e., power). To solve this problem, one

needs to be able to calculate the power of a study given a

specified genetic model and sample size. Skol et al. have

proposed a method for computing power, though they

were concerned with issues of study design rather than tag

SNP choice [16]. Jorgenson and Witte, who noted the

same problems, propose a "cumulative r2 adjusted power"

that integrates LD and tag SNP information to provide the

overall power of a study [17].

Realistically, one does not have an unlimited choice of

SNPs but rather chooses among several competing com-

mercial products with fixed sets of tag SNPs. Therefore,

instead of choosing a set of tag SNPs, a more common

problem now is how to evaluate which of several fixed sets

of tag SNPs is better for a particular study. Several papers

have looked at power for hypothetical and commercial

sets of tag SNPs through empirical simulations on a subset

of chromosomal regions [13,18]. This approach suffers

from both the speed problem of empirical simulations

and the assumption that the sampled regions are repre-

sentative of the genome as a whole. What is needed is an

application of explicit power calculation methods (such

as that of Jorgenson and Witte [17]) to the commercially

available sets of tag SNPs to allow comparison among

products and power calculation for real studies.

Here, I present a method for computing the power of a

genome-wide association study when a genetic model and

sample size are specified and LD information is available

for the population being studied. This method is equiva-

lent to the cumulative r2 adjusted power of Jorgenson and

Witte [17], which will be referred to as "power" for brev-

ity. I show that to obtain the best power, different com-

mercial genotyping products should be used for different

populations. I further find that power is sometimes

improved by genotyping more individuals at fewer SNPs

rather than fewer individuals at more SNPs. These calcula-

tions can guide the optimal design of future genome-wide

association studies.

Results and discussion

The power calculations require genotype data on a large

representative sample of common SNPs from the popula-

tion as well as a list of which of these representative SNPs

are the tag SNPs (SNPs to be genotyped). Power is com-

puted in three steps. First the best tag SNP for each of the

representative SNPs is found. Then, the power for detect-

ing association for each of the representative SNPs assum-

ing that SNP directly influences the phenotype is

computed. For this computation, it is assumed that the

study will be performed by testing for genotype frequency

differences between cases and controls using a two-degree

of freedom χ2 test in which multiple tests are corrected for

using the Bonferroni correction. This test explicitly

assumes a codominant model. I use this test because it is

the most general, at the cost of reduced power relative to

a model-specific test. While a multimarker tagging

approach could be taken [13], this added level of com-

plexity is not usually included in a first-pass analysis of

genome-wide association data and is therefore including

it in our power-calculation would inflate the power one

might expect in real-world application of genome-wide

association studies. Finally, the average power over all the

SNPs is taken to be the power of the study.

Taking the average power over all the SNPs is justified

using probability theory. Assume there are N SNPs present

in a given population, each one represented as Si. Let Ci

represent SNP i being causative, and Di represent SNP i

being detected. Assume that one of these SNPs is the caus-

ative SNP, but it is unknown which of these is the causa-

tive SNP. Then the overall power of the study is given by

N

=∑

1

. The power computed for a specific SNP Si

is given as Pi = Pr(Di|Ci). Thus, if each Pi multiplied by

Pr(Ci), we get

The added assumption that each SNP is equally likely to

be causative yields

Pr(,)

C D

ii

i

Pr(,)

N

=∑

Pr(| )Pr() Pr()

Pr()

C D

i

DCCPC

PowerPC

iiiiii

ii

i

==

=

1

Page 3

BMC Genetics 2007, 8:58http://www.biomedcentral.com/1471-2156/8/58

Page 3 of 8

(page number not for citation purposes)

This final equation is the same as taking the average power

over all the SNPs.

This method was applied to examine the power of

genome-wide association studies in the four populations

studied in the International HapMap Project [19]. I exam-

ined the performance of the tag SNPs provided by the

major high-density genotyping platforms available com-

mercially: 100 K and 500 K SNP sets from Affymetrix and

300 K and 550 K SNP sets from Illumina. (Since then,

more products have come on the market; the same

approach can be taken with them.) I first asked how many

SNPs on each of these arrays would be useful for studying

a given population by asking what percentage of tag SNPs

provided by each platform are common (minor allele fre-

quency > 5%) in each of the four HapMap populations

(Table 1). The largest fraction of common SNPs is found

when the Illumina chip is used in the CEU population. As

the Illumina chip was designed to optimize coverage of

the CEU population, this result is unsurprising.

I next asked how power changes with increasing sample

size for the various genotyping platforms (Figure 1), pop-

ulations, and models. For all sets of tag SNPs, as expected,

power increases both as the sample size increases and as

the magnitude of effect, measured by the genotype relative

risk (GRR), increases. While Figure 1 only shows this data

for a multiplicative model in the CEU population, simi-

larly shaped curves were observed in the other popula-

tions and for other models [see Additional file 1]. In the

Affymetrix 500 K and Illumina 300 K SNP sets, the slope

of the power curve starts leveling off (approaching zero)

with a few thousand individuals when GRR is more than

1.5. For smaller GRRs, the sample sizes required for ade-

quate (at least 50%) power becomes quite large.

One critique of this approach is that the non-specific test

used may not be the most powerful approach if we know

the genetic model the disease follows. For instance, to

study a trait that we believe follows a multiplicative

model; a 2 × 2 contingency table to test for allelic associa-

tion may be more appropriate. Power calculations for this

test (Figure 2) shows that the relative pattern is the same

as for a test of genotypic association, but the power is gen-

erally increased when an allelic test is used in instead of a

genotypic test. Similar power calculations can be done if

one wants to use an explicit test for a dominant or reces-

sive mode of inheritance. However, as can be seen in this

comparison between the Affymetrix 500 K and Illumina

550 K genotyping system, choice of SNPs and sample size

can play a bigger role in determining power than choice of

test. For the specified GRR of 1.5, the Illumina 550 K sys-

tem with a genotypic test is more powerful than the

Affymetrix 500 K system when sample size is greater than

2000 individuals (Figure 2).

Another possible criticism of this method is that the SNPs

genotyped as part of the International HapMap Project

may not be a representative subset of the common SNPs

in the genome as a whole. To investigate this possibility, I

compared the coverage of the various SNPs in the

ENCODE and non-ENCODE regions from the HapMap

project (Figure 3). Since the ENCODE regions of the Hap-

Map project were completely resequenced in a subset of

48 individuals, I hypothesized that almost all common

(minor allele frequency >5%) variants would have been

identified in that region. If the SNPs genotyped as part of

the HapMap are a representative subset of all of the com-

mon SNPs, then the coverage of an arbitrary set of tag

SNPs should be equal for the two data sets. Assuming tag

SNPs were chosen similarly for the ENCODE and non-

ENCODE regions, relying on the HapMap data slightly

overestimates r2 with the tag SNPs and therefore could

slightly inflate the power estimation. As the fraction of

Pr()

C

N

N

∑

PowerP

NN

P

i

i

i

i

i

N

∑

=

==

==

1

11

11

Table 1: The number of SNPs present in each population and present in each commercial genotyping system

PopulationCEU JPT+CHB YRI

SNPs in HapMap

SNPs w/MAF >= 0.05 (%)

Common SNPs on Affy 100 K chip (%)

Common SNPs on Affy 500 K chip (%)

Common SNPs on Illumina 300 K chip (%)

Common SNPs on Illumina 550 K chip (%)

3868157

2230515 (58%)

91400 (79%)

378415 (77%)

313265 (99%)

506543 (91%)

3890416

2046163 (53%)

82995 (72%)

346887 (70%)

251560 (79%)

425631 (77%)

3796934

2477182 (65%)

91363 (79%)

409849 (83%)

252678 (80%)

441884 (80%)

The percentages given are the fraction of SNPs from the overall SNP set, and from each of the genotyping platforms, that are present

with a MAF of at least 0.05 in each population

Page 4

BMC Genetics 2007, 8:58 http://www.biomedcentral.com/1471-2156/8/58

Page 4 of 8

(page number not for citation purposes)

SNPs with an r2 greater than the cutoff differs between the

ENCODE and non-ENCODE regions by at most ten per-

centage points, and an average of three percentage points,

this overestimation is not likely to be extreme.

An easy and useful way to compare the power of different

tag SNP sets in different populations is the sample size

needed to achieve 80% power. The Illumina 550 K clearly

performs best in all three populations (Figure 4). For the

CEU population, the Illumina 300 K outperforms the

Affymetrix 500 K, while in the other two populations the

Affymetrix 500 K is better. This is not surprising, as the

Illumina chips were optimized on CEU HapMap data. As

the Affymetrix 500 K set is really two independent 250 K

sets, I also looked at the power of each 250 K set individ-

ually. While the complete 500 K set of SNPs has more

power than either half, the number of individuals

required for 80% power using one half of the set is never

twice the number required for the full set. This means that

in cases when the number of chips that can be run rather

than number of available samples is the limiting factor, it

might make more sense to genotype more individuals

using only one chip than to genotype fewer individuals

using both chips. To test this hypothesis, I plotted power

versus the number chips needed for the components of

the Affymetrix 500 K system (Figure 5). The number of

chips is simply the sample size for Nsp and Sty alone, and

twice the sample size for the Nsp+Sty combination. Except

in cases where power gets very high due to a large GRR

and/or sample size, for a constant number of chips using

only one of Nsp or Sty on more individuals provides a

more powerful study.

Power for the test of genotypic association as a function of sample size at different genotype relative risks (GRR)

Figure 1

Power for the test of genotypic association as a function of sample size at different genotype relative risks (GRR). All panels are

for the CEU HapMap population when the number of cases equals the number of controls and a multiplicative model is used.

(A) Power for the Affymetrix 100 K system. (B) Power for the Illumina 300 K system. (C) Power for the Affymetrix 500 K

system. (D) Power for the Illumina 550 K system.

0 5000 10000

Sample size

0

0.2

0.4

0.6

0.8

1

Power

0 250050007500 10000

Sample size

0

0.2

0.4

0.6

0.8

1

Power

0 2500 50007500 10000

Sample size

0

0.2

0.4

0.6

0.8

1

Power

GRR 1.25

GRR 1.5

GRR 1.75

GRR 2

GRR 2.5

0 250050007500 10000

Sample size

0

0.2

0.4

0.6

0.8

1

Power

AB

CD

Affymetrix 100KIllumina 300K

Affymetrix 500KIllumina 550K

GRR 1.25

GRR 1.5

GRR 1.75

GRR 2

GRR 2.5

GRR 1.25

GRR 1.5

GRR 1.75

GRR 2

GRR 2.5

GRR 1.25

GRR 1.5

GRR 1.75

GRR 2

GRR 2.5

Page 5

BMC Genetics 2007, 8:58 http://www.biomedcentral.com/1471-2156/8/58

Page 5 of 8

(page number not for citation purposes)

I have presented a method to compute the power of a

genome-wide association study in which a fixed set of tag

SNPs will be genotyped. For the sake of simplicity, I only

considered one straightforward single-SNP analysis

scheme. While this approach has been used successfully

[6], others have suggested that greater power can be

obtained by looking at multiple tags or haplotypes

[18,20]. This method for computing power can be

adapted to such strategies provided it is possible to com-

pute the power of detecting each SNP in the population

given the set of tagging SNPs. I also assume that each SNP

is equally likely to be functional. If we knew a priori the

probability that a given SNP is functional, we could use

this to weight the average power over all the SNPs. Such a

weighting scheme would prioritize SNPs more likely to be

of interest because of either functional considerations or

location [21]. For instance, assume we assigned each SNP

a probability of being the causative SNP based on external

evidence such as a prior linkage study. If these probabili-

ties are normalized to sum to one, they can be used to

compute a weighted average power in this approach.

Conclusion

Proper design of a genome-wide association study

requires careful calculation of the power. These calcula-

tions will be invaluable to anyone who is planning a

genome-wide association study. Using these calculations,

the proper sample size to get adequate power in a given

study can be computed. Furthermore, the performance of

different genotyping platforms can be compared, allow-

ing an investigator to choose whatever is best for his or her

study. By performing such calculations, genome-wide

association studies can be optimized to get the maximal

power possible for a given set of resources.

Methods

Genotype data and populations

I used genotype data from release 21 (phase II) of the

International HapMap project [19]. I used data from all

four populations studied in the HapMap project. These

populations are defined by the HapMap project as fol-

lows: Yoruba in Ibadan, Nigeria (abbreviation: YRI); Jap-

anese in Tokyo, Japan (abbreviation: JPT); Han Chinese in

Beijing, China (abbreviation: CHB); and CEPH (Utah res-

idents with ancestry from northern and western Europe)

(abbreviation: CEU). Similar to the analysis performed by

the HapMap project, I combined genotypes from the JPT

Power for genotypic and allelic tests

Figure 2

Power for genotypic and allelic tests. Data is shown for a

GRR of 1.5 under a multiplicative model, the CEU HapMap

population, and the specified genotyping system.

0 25005000750010000

Sample size

0

0.2

0.4

0.6

0.8

1

Power

Affymetrix 500K Genotypic

Illumina 550K Genotypic

Affymetrix 500K Allelic

Illumina 500K Allelic

Coverage of tag SNPs

Figure 3

Coverage of tag SNPs. Fraction of non-tag SNPs in LD with a

tag SNP with r2 above specified threshold for the ENCODE

and non-ENCODE regions of the HapMap project for the

CEU and YRI populations. Results are shown for the Illumina

550 K (A) and Affymetrix 500 K (B) chips. The JPT+CHB

population was not included because the curves generally

overlap with the CEU curves and would make the graph

harder to read. Results for the JPT+CHB population and for

the other chips are qualitatively similar to the curves shown

here.

0 0.20.40.60.81

Min. r^2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fraction of SNPs with marker above r^2 threshold

CEU ENCODE

CEU non-ENCODE

YRI ENCODE

YRI non-ENCODE

0 0.20.40.6 0.81

Min. r^2

0

0.2

0.4

0.6

0.8

1

Fraction of SNPs with marker above r^2 threshold

CEU ENCODE

CEU non-ENCODE

YRI ENCODE

YRI non-ENCODE

A

B

Illumina 550K

Affymetrix 500K