Evaluating coverage of genome-wide association studies
Jeffrey C Barrett & Lon R Cardon
Genome-wide association studies involving hundreds of
thousands of SNPs in thousands of cases and controls are now
underway. The first of many analytical challenges in these
studies involves the choice of SNPs to genotype. It is not
practical to construct a different panel of tag SNPs for each
study, so the first generation of genome-wide scans will use
predefined, commercially available marker panels, which will in
part dictate their success or failure. We compare different
approaches in use today, and show that although many of them
provide substantial coverage of common variation in non-
African populations, the precise extent is strongly dependent on
the frequencies of alleles of interest and on specific conside-
rations of study design. Overall, despite substantial differences
in genotyping technologies, marker selection strategies and
number of markers assayed, the first-generation high-throughput
platforms all offer similar levels of genome coverage.
Falling genotype costs and the recent completion of the International
HapMap Project1,2have made genome-wide association studies
(GWAS) of complex diseases imminent3–5. Such studies have the
potential to assay 100,000–500,000 genetic markers from the 44
million validated genetic variants now available. Although genotyping
most or all of the genetic variants would be desirable in many settings,
present economic and experimental conditions render it necessary, in
practice, to reduce the complete set of genetic
variants down to a tractable but maximally
There are a number of potential approaches
to this problem that have resulted both from
individual investigators’ interests and from
broader questions such as the importance of
obtaining full coverage of the genome versus
focusing on potentially functional variants6,7.
Most of the ongoing or planned GWAS aim to
evaluate most of the common genetic variants
in the human genome, irrespective of their
genic location3,4. For such designs, an obvious
marker selection approach for any particular
study is to pick a theoretically ‘ideal’ set of
SNPs for the study and genotype them in
large samples. This method is appropriate for
studies of small regions or candidate genes, but it is impractical for
GWAS, as the cost of ordering a de novo SNP set for each new genome
scan is prohibitive. Instead, genome-wide studies must choose from
several commercially available alternatives. These pragmatic concerns
of what is currently available in a high-throughput capacity will be at
least as important as theory-driven marker selection for the first
generation of scans that are now underway or being planned.
The practical necessity of having a fixed set of GWAS markers has
obvious advantages, such as the potential to combine data sets across
disease laboratories and the ability to design statistical methods for
commonly used panels, as done for linkage studies over the past decade.
This broad usage makes it important to appreciate the properties of
different marker selection strategies in terms of genomic coverage, allele
frequency representation and population diversity. Here we evaluate the
different strategies (Box 1) used in several commercially available
GWAS panels, including nonsynonymous SNP (nsSNP)-exclusive
sets8, linkage disequilibrium (LD)-based tagging panels9and random
SNP collections across the genome10. In order to provide as compre-
hensive an evaluation as possible, we use the recently available HapMap
Phase II data1(one SNP for every 1,250 bp across the entire genome) to
provide a framework for testing and comparison of common variation.
We examine coverage as measured by simple pairwise correlation
(r2) between a member of the tag set and a potentially captured
SNP11,12. This approach is attractive in that it makes few assumptions
Fraction of SNPs tagged
0 500 1,000 1,5000 5001,000 1,5000500 1,0001,500
r2 ≥ 0.5
r2 ≥ 0.8
r2 = 1
JPT + CHB
Number of tags (× 1,000)
Figure 1 Genomic coverage by maximally efficient (pairwise) tag sets for three HapMap panels and three
r2cutoffs. Evaluation of common SNPs is performed against the Phase II HapMap data, which provides
a near-complete catalog of common variation (minor allele frequency Z 0.05), including 5 million SNPs
in 270 individuals from populations in North America (CEU), Africa (YRI) and Asia (CHB+JPT)1. The
finished Phase II HapMap contains one common SNP every 1,250 bp in the CEU population and is
estimated to capture 94% of common variation in CEU and CHB+JPT and 81% in YRI1.
Received 3 January; accepted 13 April; published online 21 May 2006; doi:10.1038/ng1801
Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford OX3 7BN, UK. Correspondence should be addressed to L.C.
NATURE GENETICS VOLUME 38 [ NUMBER 6 [ JUNE 2006659
© 2006 Nature Publishing Group http://www.nature.com/naturegenetics
about downstream analysis and has a simple relationship to sample
size for disease association studies under certain conditions1,13,14, but
it is not always the most efficient9. Another study in this issue15
describes approaches to make use of multiple-marker tests derived
from the HapMap when the initial panel is fixed.
Coverage calculations may be confounded by at least two important
factors. First, there is an upward bias in coverage when including the
tag SNPs themselves as part of the coverage calculation. Second,
coverage estimates will be biased upward if all or part of the reference
set used for the estimate was used to select tag SNPs. In order to equate
coverage of a reference set to coverage of the genome, the reference set
must be considered representative of all common SNPs. If, on the
other hand, tag SNPs have been chosen specifically to capture all or
part of the reference set, then they are ‘overfitted’ to the reference set
compared with the set of all common SNPs in the genome. Detailed
corrections for these biases are discussed in Methods.
Figure 1 shows cumulative coverage of Phase II HapMap of a
maximally efficient set of tag SNPs (Box 1, strategy 1) in three HapMap
panels for three r2thresholds. The data for the panel of Americans of
European ancestry (CEU) and the panel of Han Chinese from Beijing
panel combined with the panel of Japanese from Tokyo (JPT+CHB)
indicate that nearly all common variation in the genome can be
captured with r2Z 0.8 with approximately 500,000 carefully selected
SNPs. As expected, the Yoruba from Nigeria (YRI) panel requires more
than twice as many SNPs to capture the Phase II HapMap.
In all cases, the gain in coverage achieved with increasing number of
tags shows a marked pattern of diminishing returns. For example,
although approximately 500,000 SNPs are required to completely
capture common variation in CEU, a set of 250,000 already captures
85%. This pattern has been previously observed when selecting large
pools of tag SNPs9and follows from the fact that tagging algorithms
initially select the most useful SNP (that is, the SNP with the largest
number of proxies) and then include the next most useful, and so on9.
One consequence of this effect is that a large fraction of any genome-
wide tag set is devoted to capturing ‘singleton’ SNPs that are not in
strong LD with any other SNPs. The magnitude of this effect varies
strongly depending on the r2threshold and population, from about
one-third of all tags when tagging CEU or JPT+CHB at r2Z 0.5 to
nearly 1.1 million singletons in YRI at r2¼ 1, representing almost 80%
of the total tag set.
Several practical factors influence the coverage of emerging tag SNP
panels, which differ from the theoretically most efficient set. Table 1
(evaluated at r2Z 0.8) and Supplementary Figure 1 online show
coverage for several strategies, all of which are lower than the
theoretically most efficient values (Fig. 1). Most of these differences
are explained because the tag SNPs in presently available sets were
necessarily selected from a less complete reference set. Whereas
250,000 SNPs selected from the Phase II HapMap capture 85% of
common variation in CEU, this efficiency is reduced to only 72%
when tags are selected from the Phase I HapMap data set. The effect of
incomplete reference data is exacerbated by the differences in under-
lying allele frequency spectra of the HapMap phases. Figure 2 and
Supplementary Figure 2 online show that although both phases of
the HapMap are enriched for common SNPs, the skewing is less
pronounced in Phase II, which has a larger proportion of rare (mean
allele frequency (MAF) o 0.05) and moderately rare (0.05 o MAF o
0.10) variants. As rare alleles are more likely to be ‘singletons’, this
difference reduces efficiency.
Coverage values for marker panels that do not incorporate LD
patterns (Box 1, strategy 2) are also shown in Table 1. Two of the most
popular products in this category, the Affymetrix 111K and 500K
panels, are estimated to capture 31% and 65% of commonvariation in
CEU at r2Z 0.8, respectively. As expected, these products have lower
overall coverage and lower efficiency (coverage per SNP genotyped)
than sets of carefully selected tag SNPs. However, their randomness
offers a protective redundancy against the failure of any given SNP.
Figure 3 shows the relative coverage of the Affymetrix 500K and
Illumina HumanHap-300 products for random failure rates between
BOX 1 SNP SELECTION METHODS FOR GENOME-WIDE ASSOCIATION STUDIES
1. Tag: an LD-based set of tag SNPs carefully chosen to maximize the amount of variation captured per SNP. We first examined the theoretically most efficient set of tag
SNPs and then considered illustrative examples from commercially available products. There are at least two tag sets presently available or soon to be marketed: the
Illumina HumanHap-300 set of 317,000 markers and the impending full HumanHap-500 set of B500,000 SNPs.
2. Random: a set of SNPs distributed approximately randomly across the genome that ignores LD patterns. We have evaluated two LD ‘agnostic’ marker sets currently
available from Affymetrix, with 111,000 and 500,000 markers.
3. Combined: a combination of these two methods consisting of a set of ‘random’ SNPs augmented by a carefully chosen fill-in set. Owing to cost efficiencies and speed
of genotyping, several disease investigations have combined the random sets of Affymetrix with custom tag sets that aim to fill in the gaps of random coverage.
4. Functional: either a panel of all known, polymorphic nsSNPs (the MegAllele system marketed by Affymetrix involves 12,000 nsSNPs) or a larger set focused more
broadly on genes, such as Illumina’s Human-1 BeadChip. We have not evaluated nsSNP-specific sets for genome-wide coverage, as they are designed to test the
functional variants directly, as opposed to the other product,s which indirectly test many variants genome-wide without any prior hypothesis.
Table 1 Genomic coverage of commercial GWAS products for common SNPs at r2Z 0.8, evaluated in Phase II HapMap
TypeCoverage (%)Mean r2
Coverage (%)Mean r2
Coverage (%)Mean r2
Affymetrix 500k + 175K tag
Despite the r2cutoff of 0.8, the mean r2for tagged SNPs is very high; also, ‘untagged’ SNPs are covered with intermediate values of r2, providing modest power to detect such alleles
(Supplementary Fig. 1).
aCoverage estimates for the Human-1 product are underestimates because some of its SNPs were not genotyped in the HapMap project. As these SNPs are largely rare genic SNPs, it is not
expected that they would substantially raise coverage of common variation.
660VOLUME 38 [ NUMBER 6 [ JUNE 2006 NATURE GENETICS
© 2006 Nature Publishing Group http://www.nature.com/naturegenetics
0–20%. This relationship between efficiency and redundancy could be
important, given the serious consequences of failure rate for disease
association studies3,8and initial indications of nontrivial failure rates
for at least some genome-wide scans16.
Another factor that affects choice of a genome-wide SNP panel is
the population being studied. Table 1 shows that the advantage of
carefully selected tags is reduced when considering coverage in another
of the HapMap analysis panels. Although the Illumina HumanHap300
outperforms the Affymetrix 500K for the CEU samples (Illumina,
75%; Affymetrix, 65%), this advantage is no longer apparent with the
JPT+CHB samples (Illumina, 63%; Affymetrix, 66%), and the advan-
tage is reversed for the YRI samples (Illumina, 28%; Affymetrix, 41%).
Figure 3 further shows that the most efficient design, selecting tags
from the same population to be studied, is most susceptible to marker
failure because nearly all redundancy has been intentionally eliminated.
All of these estimates of coverage are expected to vary somewhat when
applied to non-HapMap samples.
Although coverage of common variation by the larger versions of
currently available products is promising, rare variants are much less
well covered owing to the preferential selection of common SNPs on
these products. At present, publicly available data in which large
genomic regions have been resequenced in large sample sizes do not
yet exist, although such studies of candidate
genes are increasing17. In order to gain some
insight into correlation patterns among rare
SNPs, we used data from the HapMap
ENCODE project1, which suggest that fewer
than 10% are well captured (pairwise r2Z
0.8) by any of the genome-wide products in
any of the HapMap analysis panels. We cau-
tion, however, that even these low coverage
levels may be overestimates for several reasons.
First, because only a modest number of indi-
viduals were resequenced, we expect that a
large number of rare SNPs in these regions
remain undiscovered. Second, estimates of r2
for rare SNPs have a high variance, indicating
that observations of ‘perfectly correlated’ rare
SNPs in the HapMap samples may be due to
chance. Finally, many ‘private SNPs’ (that is,
rare alleles observed in only one of the resequenced individuals) will
appear to be perfectly correlated, thus inflating estimates of LD. The
combination of the low estimates derived from ENCODE and the
numerous factors that indicate that true coverage is lower emphasize
the assumption of these GWAS products that common diseases are at
least partially influenced by common genetic variants.
One possible compromise among the considerations of cost,
efficiency and redundancy is to couple a set of ‘random’ SNPs with
a smaller, personalized set that is highly selected to fill in gaps (Box 1,
strategy 3). In this case, the exact extent to which coverage improves
depends on the goals for the selection of the fill-in set. Using 500,000
random SNPs as a baseline, approximately 360,000 additional SNPs
would be required to capture all remaining commonvariation in CEU.
Of course, the majority of these would be spent in the singleton tail
(capturing only themselves). Instead, some groups, such as the Well-
come Trust Case Control Consortium (WTCCC), have designed
panels of nonsingleton tags to bring coverage to 86% (Table 1). The
remainder of the SNPs in the combined approach can then be used to
add targeted redundancy to LD bins with many surrogates or focus on
known nonsynonymous variation.
Whereas coverage of common variation in genes is generally
comparable to the rest of genome, other focused classes of functional
variants are captured poorly by SNP sets aimed at common variation.
Figure 4 shows the CEU allele frequency distribution of the 15,500
nsSNPs that are polymorphic in the HapMap. Clearly there is an
overwhelming preponderance of rare nsSNPs in this set. Capturing all
50,000 nsSNPs catalogued in the NCBI SNP database (dbSNP) would
be a considerable challenge, given that coverage of rare SNPs by
genome-wide products is poor. The most direct way to ensure cover-
age of these SNPs is to genotype them directly (Box 1, strategy 4),
either through a separate platform18or by adding them to the tag set.
Both the HumanHap-300 and combined approaches eschew some
singleton tags to include known polymorphic nsSNPs. The Illumina
Human-1 BeadChip offers an intermediate approach by concentrating
109,000 SNPs in exons and conserved regions. This combines good
coverage for a broad variety of functional SNPs and modest genome-
wide coverage of common SNPs.
It is notable that, despite the many differences in panel focus,
marker selection strategies and technology platforms, the random and
tag-SNP sets generally capture a similar fraction of common variants
in the human genome. Choosing a GWAS SNP set requires careful
evaluation of levels of coverage, population of interest and the balance
between efficiency (via careful tag selection) and redundancy. Newer,
Fraction of SNPs
0.0 0.10.2 0.3 0.40.5
Figure 2 Minor allele frequency in CEU for 900,000 polymorphic Phase I
HapMap SNPs, 2 million distinct polymorphic SNPs added during the
second phase of HapMap and 10,000 polymorphic ENCODE SNPs (which
approximate the underlying frequency distribution in the genome). Both
phases of HapMap are intentionally biased toward common SNPs, but the
bias for Phase II is less extreme. The pattern in the JPT+CHB and YRI
analysis panels is not substantially different (Supplementary Fig. 2).
0.000.050.10 0.15 0.20
JPT + CHB
0.000.05 0.100.15 0.20
Figure 3 Coverage of common variation in the Phase II HapMap by the Affymetrix 500K and Illumina
HumanHap300 products plotted as a function of random genotype failure rate. For each level of failure
rate, markers were excluded at random, and coverage was calculated using only the remaining markers.
Each point is the average of 1,000 replicates. The larger number of SNPs and increased redundancy in
the Affymetrix array provide it greater resistance to decreased coverage due to marker failure. The
Illumina array performs worse in non-CEU populations because its SNPs are so carefully targeted
NATURE GENETICS VOLUME 38 [ NUMBER 6 [ JUNE 2006661
© 2006 Nature Publishing Group http://www.nature.com/naturegenetics
larger GWAS products, together with functionally focused assays, Download full-text
promise an ever-more-complete picture of genetic variation. The
challenge will then shift from the number of variants missed to how
to extract meaningful information from those that are not.
Data. Genome-wide coverage evaluation was performed on Phase II HapMap
(release 20) combined with Affymetrix genotypes on the HapMap samples for
those markers contained on their product but not on HapMap. This data set
containedbetween 2.5and 3 million polymorphic SNPs, depending on HapMap
analysis panel (CEU, YRI, JPT + CHB), genotyped on 90 samples per panel. For
evaluating rare coverage we used the HapMap ENCODE data set (release 16c.1),
which was created by genotyping all variable sites observed after resequencing
5 Mb in 48 unrelated individuals in the full set of HapMap samples. We
combined the JPT and CHB samples into one analysis panel for all analyses.
Tag SNP selection and coverage evaluation. We selected tag SNPs using the
program Haploview19. A SNP was considered ‘tagged’ by another if they had
pairwise r2greater than a certain threshold. The maximum allowed physical
distance between an allele and a tag was 200 kb. Common coverage of
predetermined tag sets was evaluated by using Haploview with the same
parameters, but force-including the SNPs in the tag set and force-excluding
all other SNPs. The raw value for coverage is the fraction of all common (MAF
Z 0.05) SNPs that are captured by the tag set.
To calculate a corrected value for coverage, consider a reference set of SNPs R,
such as the Phase II HapMap. For a given tag set T, some SNPs are captured
either because they are contained in T, or they are in LD with a SNP in T (we
call this latter set of SNPs L). Thus, a naive estimate of coverage of all SNPs in
the genome (G) would be
However, this overestimates the true fraction of captured SNPs that are actually
tags, because G 4 R. To correct for this overestimate, we define the genome-
wide coverage as
R ? T
ðG ? TÞ + T
This requires a value for G, the number of common SNPs in the genome. Our
coverages are corrected using an estimate derived from the ENCODE regions,
indicating G E 7.5 million common SNPs for CEU samples, but we note
that using larger values up to 10 million SNPs does not materially alter the
In the case of tag SNPs chosen using the Phase I HapMap data, we attempt
to correct for this bias by considering coverage of the Phase II-only subset of R,
which contains no overlapping markers with the Phase I tag set. The correction
factor above is thus split into two parts, representing the overfitted portion of
the reference, R1, and the remainder, R2(see Supplementary Methods):
R2? T2ðG ? R1? T2Þ + T2+ L1+ T1
where L1, T1, L2and T2are the portion of SNPs captured by LD (L1, L2) and the
tag SNPs (T1, T2) in the Phase I and Phase II data, respectively.
Failure rate. Effects of marker failure on coverage were evaluated by randomly
selecting some fraction (0–20%) of markers and calculating genome-wide
coverage based on the remainder of the ‘successful’markers. Each data point for
a failure rate of a particular SNP set in a particular analysis panel is the average
of 1,000 replicates.
URLs. Wellcome Trust Case Control Consortium: http://www.wtccc.org.uk/.
Complete SNP lists for all evaluated products are available at http://www.
well.ox.ac.uk/~jcbarret/gwas/. Illumina: http://www.illumina.com/. Affymetrix:
Note: Supplementary information is available on the Nature Genetics website.
We wish to thank M. Daly, I. Pe’er, L. Palmer, M. Barnes and the WTCCC
analysis group, particularly D. Clayton and P. Donnelly, for discussions on many
of these topics. We thank D. Evans for comments on the manuscript. We also
thank the investigators and participants in the International HapMap project
for generating the unique data set and making it available to the scientific
community. The authors are supported by the Wellcome Trust, the US National
Institutes of Health and a grant from the European Union (MolPAGE).
COMPETING INTERESTS STATEMENT
The authors declare that they have no competing financial interests.
Published online at http://www.nature.com/naturegenetics
Reprints and permissions information is available online at http://npg.nature.com/
1. Altshuler, D. et al. A haplotype map of the human genome. Nature 437, 1299–1320
2. Hinds, D.A. et al. Whole-genome patterns of common DNA variation in three human
populations. Science 307, 1072–1079 (2005).
3. Wang, W.Y., Barratt, B.J., Clayton, D.G. & Todd, J.A. Genome-wide association studies:
theoretical and practical concerns. Nat. Rev. Genet. 6, 109–118 (2005).
4. Hirschhorn, J.N. & Daly, M.J. Genome-wide association studies for common diseases
and complex traits. Nat. Rev. Genet. 6, 95–108 (2005).
5. Palmer, L.J. & Cardon, L.R. Shaking the tree: mapping complex disease genes with
linkage disequilibrium. Lancet 366, 1223–1234 (2005).
6. Botstein, D. & Risch, N. Discovering genotypes underlying human phenotypes: past
successes for mendelian disease, future approaches for complex disease. Nat. Genet.
33 (Suppl.), 228–237 (2003).
7. Neale, B.M. & Sham, P.C. The future of association studies: gene-based analysis and
replication. Am. J. Hum. Genet. 75, 353–362 (2004).
8. Clayton, D.G. et al. Population structure, differential bias and genomic control
in a large-scale, case-control association study. Nat. Genet. 37, 1243–1246
9. de Bakker, P.I. et al. Efficiency and power in genetic association studies. Nat. Genet.
37, 1217–1223 (2005).
10. Dong, S. et al. Flexible use of high-density oligonucleotide arrays for single-
nucleotide polymorphism discovery and validation. Genome Res. 11, 1418–1424
11. Carlson, C.S. et al. Selecting a maximally informative set of single-nucleotide poly-
morphisms for association analyses using linkage disequilibrium. Am. J. Hum. Genet.
74, 106–120 (2004).
12. Ke, X. et al. A comparison of tagging methods and their tagging space. Hum. Mol.
Genet. 14, 2757–2767 (2005).
13. Pritchard, J.K. & Przeworski, M. Linkage disequilibrium in humans: models and data.
Am. J. Hum. Genet. 69, 1–14 (2001).
14. Jorgenson, E. & Witte, J.S. Coverage and power in genomewide association studies.
Am. J. Hum. Genet. 78, 884–888 (2006).
15. Daly, M. et al. Evaluating and improving power in whole-genome association studies
using fixed marker sets. Nat. Genet. advance online publication 21 May 2006
16. Klein, R.J. et al. Complement factor H polymorphism in age-related macular degen-
eration. Science 308, 385–389 (2005).
17. Rieder, M.J. et al. Effect of VKORC1 haplotypes on transcriptional regulation and
warfarin dose. N. Engl. J. Med. 352, 2285–2293 (2005).
18. Hardenbol, P. et al. Highly multiplexed molecular inversion probe genotyping: over
10,000 targeted SNPs genotyped in a single tube assay. Genome Res. 15, 269–275
19. Barrett, J.C., Fry, B., Maller, J. & Daly, M.J. Haploview: analysis and visualization of LD
and haplotype maps. Bioinformatics 21, 263–265 (2005).
Number of nsSNPs
Figure 4 CEU minor allele frequencies for nsSNPs polymorphic in
Phase II HapMap.
662VOLUME 38 [ NUMBER 6 [ JUNE 2006 NATURE GENETICS
© 2006 Nature Publishing Group http://www.nature.com/naturegenetics