A second generation human haplotype
map of over 3.1 million SNPs
The International HapMap Consortium*
We describe the Phase II HapMap, which characterizes over 3.1 million human single nucleotide polymorphisms (SNPs)
the populations surveyed. The map is estimated to capture untyped common variation with an average maximum r2of
between 0.9 and 0.96 depending on population. We demonstrate that the current generation of commercial genome-wide
genotyping products captures common Phase II SNPs with an average maximum r2of up to 0.8 in African and up to 0.95 in
dataalsorevealnovelaspects ofthestructureoflinkagedisequilibrium.Weshow that10–30%ofpairsofindividualswithin
a population share at least one region of extended genetic identity arising from recent ancestry and that up to 1% of all
common variants are untaggable, primarily because they lie within recombination hotspots. We show that recombination
rates vary systematically around genes and between genes of different function. Finally, we demonstrate increased
efficacy of natural selection between populations.
Advances made possible by the Phase I haplotype map
The International HapMap Project was launched in 2002 with the
aim of providing a public resource to accelerate medical genetic
research. The objective was to genotype at least one common SNP
every 5kilobases (kb) across the euchromatic portion of the genome
mother–father–adult child trios from the Yoruba in Ibadan, Nigeria
(abbreviated YRI); 30triosofnorthernandwestern Europeanances-
try living in Utah from the Centre d’Etude du Polymorphisme
Humain (CEPH) collection (CEU); 45 unrelated Han Chinese indi-
viduals in Beijing, China (CHB); and 45 unrelated Japanese indivi-
duals in Tokyo, Japan (JPT). The YRI samples and the CEU samples
eachformananalysis panel;theCHBandJPTsamples togetherform
Phase I of the project, and a description of this resource was pub-
lished in 2005 (ref. 3).
The initial HapMap Project data had a central role in the develop-
ment of methods for the design and analysis of genome-wide asso-
ciation studies. These advances, alongside the release of commercial
platforms for performing economically viable genome-wide geno-
typing, have led to a new phase in human medical genetics. Already,
large-scale studies have identified novel loci involved in multiple
complex diseases4,5. In addition, the HapMap data have led to novel
insights into the distribution and causes of recombination hot-
spots3,6, the prevalence of structural variation7,8and the identity of
genes that have experienced recent adaptive evolution3,9. Because the
to integrate their own experimental data with the genome-wide SNP
data to gain new insight into copy-number variation10, the relation-
ship between classical human leukocyte antigen (HLA) types and
SNP variation11, and heritable influences on gene expression12–14.
The ability to combine genome-wide data on such diverse aspects
of genetic variation with molecular phenotypes collected in the same
samples provides a powerful framework to study the connection of
DNA sequence to function.
*Lists of participants and affiliations appear at the end of the paper.
In Phase II of the HapMap Project, a further 2.1 million SNPs
were successfully genotyped on the same individuals. The resulting
HapMap has an SNP density of approximately one per kilobase
and is estimated to contain approximately 25–35% of all the 9–10
assembled human genome (that is, excluding gaps in the reference
sequence alignment; see Supplementary Text 1), although this num-
ber shows extensive local variation. This paper describes the Phase II
resource, its implications for genome-wide association studies and
additional insights into the fine-scale structure of linkage disequilib-
rium, recombination and natural selection.
Construction of the Phase II HapMap
Most of the additional genotype data for the Phase II HapMap were
obtained using the Perlegen amplicon-based platform15. Briefly, this
platform uses custom oligonucleotide arrays to type SNPs in DNA
segmentally amplified via long-range polymerase chain reaction
(PCR). Genotyping was attempted at 4,373,926 distinct SNPs, which
corresponds, with exceptions (see Methods), to nearly all SNPs in
dbSNP release 122 for which an assay could be designed. Additional
submissions were included from the Affymetrix GeneChip Mapping
Array 500K set, the Illumina HumanHap100 and HumanHap300
SNP assays, a set of ,11,000 non-synonymous SNPs genotyped by
Affymetrix (ParAllele) and a set of ,4,500 SNPs within the extended
major histocompatibility complex (MHC)11. Genotype submissions
were subjected to the same quality control (QC) filters as described
previously (see Methods) and mapped to NCBI build 35 (University
re-mapping of SNPs from Phase I of the project identified 21,177
SNPs that had an ambiguous position or some other feature indi-
cative of low reliability; these are not included in the filtered Phase II
data release. All genotype data are available from the HapMap Data
Coordination Center (http://www.hapmap.org) and dbSNP (http://
www.ncbi.nlm.nih.gov/SNP); analyses described in this paper refer
to release 21a. Three data sets are available: ‘redundant unfiltered’
Vol 449|18 October 2007|doi:10.1038/nature06258
contains all genotype submissions, ‘redundant filtered’ contains all
submissions that pass QC, and ‘non-redundant filtered’ contains a
single QC1 submission for each SNP in each analysis panel.
The QC filters remove SNPs showing gross errors. However, it is
also important to understand the magnitude and structure of more
subtle genotyping errors among SNPs that pass QC. We therefore
PCR amplicon structure on genotyping error, the concordance rates
between genotype calls from different genotyping platforms and
betweenthose platforms and re-sequencingassays, as well as therates
of false monomorphism and mis-mapping of SNPs (see Supplemen-
tary Text 2, Supplementary Figs 1–3 and Supplementary Tables 1–4).
We estimate that the average per genotype accuracy is at least 99.5%.
However, there are higher rates of missing data and genotype discre-
pancies at non-reference alleles, withsome clustering of errorsresult-
ing from the amplicon design and a few incorrectly mapped SNPs.
Table 1 shows the numbers of SNPs attempted and converted to
QC1 SNPs in each analysis panel (Supplementary Table 5 shows a
were estimated for each analysis panel separately using both trio
information and statistical methods based on the coalescent model
(see Methods). To enable cross-population comparisons, a con-
sensus data set was created consisting of 3,107,620 SNPs that were
QC1 in all analysis panels and polymorphic in at least one analysis
panel. The equivalent figure from Phase I was 931,340 SNPs. Unless
stated otherwise, all analyses have been carried out on the consensus
the consensus where a putative ancestral state could be assigned by
comparison of the human alleles to the orthologous position in the
chimpanzee and rhesus macaque genomes.
inFig.1.Onaverage thereare1.14genotyped polymorphic SNPsper
kilobase (average spacing is 875base pairs (bp)) and 98.6% of the
assembled genome is within 5kb of the nearest polymorphic SNP.
Still, there is heterogeneity in genotyped SNP density at both broad
(Fig. 1a) and fine (Fig. 1b) scales. Furthermore, there are systematic
changes in genotyped SNP density around genomic features includ-
ing genes (Fig. 1c).
The Phase II HapMap differs from the Phase I HapMap not only
in SNP spacing, but also in minor allele frequency distribution and
patterns of linkage disequilibrium (Supplementary Fig. 4). Because
the criteria for choosing additional SNPs did not include considera-
tion of SNP spacing or preferential selection for high MAF, the SNPs
added in Phase II are, on average, more clustered and have lower
MAF than the Phase I SNPs. Because MAF predictably influences the
distribution of linkage disequilibrium statistics, the average r2at a
given physical distance is typically lower in Phase II than in Phase I;
notable consequence is that the Phase II HapMap includes a better
representation of rare variation than the Phase I HapMap.
The increased resolution provided by Phase II of the project is
illustrated in Fig. 2. Broadly, an additional SNP added to a region
of haplotype structure (for example, a group of chromosomes with
identical local haplotypes in Phase I can be shown in Phase II to carry
Table 1 | Summary of Phase II HapMap data (release 21)
Phase SNP categories Analysis panel
YRI CEU CHB1JPT
I Assays submitted
Did not pass QC
.1 duplicate inconsistent
.1 mendelian error
,0.001 Hardy–Weinberg P -value
Did not pass QC
.1 duplicate inconsistent
.1 mendelian error
,0.001 Hardy–Weinberg P -value
Did not pass QC
.1 duplicate inconsistent
.1 mendelian error
,0.001 Hardy–Weinberg P -value
Non-redundant (unique) SNPs
SNP categoriesAll analysis panels
Unique QC-passed SNPs
Passed in one analysis panel
Passed in two analysis panels
Passed in three analysis panels (QC13)
QC13 and monomorphic across
three analysis panels
QC13 and polymorphic in at least one analysis panel
QC13 and polymorphic in all three analysis panels
QC13 and MAF$0.05 in at least
one of three analysis panels
NATURE|Vol 449|18 October 2007
SNPs) may reveal previously missed recombinant haplotypes. The
extent to which each type of event occurs varies among populations
identifying new recombinant haplotypes and haplotype groupings,
occurinYRI. Consequently,thePhaseIIHapMapprovides increased
resolution in the estimated fine-scale genetic map and improved
power to detect and localize recombination hotspots (Fig. 2b).
The use of the Phase II HapMap in association studies
The increased SNP density of the Phase II HapMap has already been
extensively exploited in genome-wide studies of disease association.
In this section, we quantify the gain in resolution and outline how
the HapMap data can be used to improve the power of association
Improved coverage of common variation. We previously predicted
thatthevast majority ofcommonSNPswouldbecorrelated toPhase
II HapMap SNPs by extrapolation from the ten HapMap ENCODE
regions3. Using the actual Phase II marker spacing and frequency
distributions (Table 2), we repeated the simulations and estimate
that Phase II HapMap marker sets capture the overwhelming ma-
jority of all common variants at high r2. For common variants
(MAF$0.05) the mean maximum r2of any SNP to a typed one is
0.90 in YRI, 0.96 in CEU and 0.95 in CHB1JPT. The impact of the
10,000 10,020 10,040 10,06010,08010,100
Position NCBI build 35 (kb)
–100–80 –60 –40 –20020 40 60 80 100
SNP density (per kb)
dbSNP polymorphic SNP
density (per kb)
Figure 1 | SNP density in the Phase II HapMap. a, SNP density across the
genome. Colours indicate the number of polymorphic SNPs per kb in the
consensus data set. Gaps in the assembly are shown as white. b, Example of
the fine-scale structure of SNP density for a 100-kb region on chromosome
17 showing Perlegen amplicons (black bars), polymorphic Phase I SNPs in
the consensus data set (red triangles) and polymorphic Phase II SNPs in the
I SNPs. c, The distribution of polymorphic SNPs in the consensus Phase II
HapMap data (blue line and left-hand axis) around coding regions. Also
shown is the density of SNPs in dbSNP release 125 around genes (red line
and right-hand axis). Values were calculated separately 59 from the coding
start site (the left dotted line) and 39 from the coding end site (right dotted
line) and were joined at the median midpoint position of the coding unit
(central dotted line).
5160000 5180000 5200000
Position (NCBI build 35)
OR51V1HBB HBE HBD
Figure 2 | Haplotype structure and recombination rate estimates from the
Phase II HapMap. a, Haplotypes from YRI in a 100kb region around the
b-globin (HBB) gene. SNPs typed in Phase I are shown in dark blue.
for which the derived allele can be unambiguously identified by parsimony
(by comparisonwith an outgroupsequence)are shown(89% ofSNPs inthe
region); the derived allele is shown in colour. b, Recombination rates (lines)
and the location of hotspots (horizontal blue bars) estimated for the same
region from the Phase I (dark blue) and Phase II HapMap (light blue) data.
Also shown are the location of genes within the region (grey bars) and the
location of the experimentally verified recombination hotspot57,58at the 59
end of the HBB gene (black bar).
NATURE|Vol 449|18 October 2007
increased density of the Phase II HapMap is most notable in YRI (in
are found if a threshold of r2$0.8 is used to determine whether an
SNP is captured (Table 2). As expected, very common SNPs with
MAF.0.25 are captured extremely well (mean maximum r2of 0.93
in YRIto 0.97 inCEU), whereas rarer SNPs with MAF,0.05are less
well covered (mean maximum r2of 0.74 in CHB1JPT to 0.76 in
YRI). The latter figure is probably an overestimate because it is based
on lower frequency SNPs discovered via re-sequencing 48 HapMap
individuals, and does not include a much larger number of very rare
SNPs. We also assessed the increase in coverage provided by using
two-SNP haplotypes as proxies for SNPs that are poorly captured by
single SNPs16(Table 2). These two-SNP haplotypes lead to a modest
cies. However, in some regions, particularly where marker density is
low, gains from multi-marker and imputation approaches in prac-
tical situations can be substantial (see below).
able resource for selecting tag SNPs genome-wide. Using a simple
pairwise tagging approach, we find that 1.09 million SNPs are
required to capture all common Phase II SNPs with r2$0.8 in
YRI, with slightly more than 500,000 required in CEU and
CHB1JPT (Table 3). These numbers are approximately twice those
required to capture SNPs in the Phase I HapMap (which has one-
third as many SNPs). The number of SNPs required to achieve per-
fect tagging (r251.0) in each analysis panel is almost double that
required to achieve the r2$0.8 threshold. It becomes increasingly
expensive to improve the coverage afforded by tags from the Phase I
and, now, the Phase II HapMap, because additional tag SNPs are
unlikely to capture large groups of additional SNPs.
Phase II HapMap and genome-wide association studies. Although
most disease studies the tag SNPs genotyped will be primarily deter-
Using Phase II data, we estimated the coverage of several available
products on which genome-wide association studies are already
underway (Table 4). Similar to earlier estimates17,18, these products
typically perform well in CEU and CHB1JPT, and some also per-
capture 68–88% (depending on selection method) of all HapMap
the Phase II HapMap will be covered more poorly because most
genotyping products were designed using HapMap data.
HapMap data have several additional roles in the analysis of dis-
ease-association studies using fixed marker sets. For example, the
high-quality haplotype information within the Phase II HapMap
can be used to aid the phasing of genotype data from new samples
least one haplotype in the Phase II data. By a similar argument,
missing genotypes can potentially be inferred through comparison
to the Phase II haplotypes. Genotypes may be missing either because
of genotyping failure or because the SNP was not assayed within
the experiment. Therefore, the HapMap haplotypes provide a way
of in silico genotyping Phase II SNPs that were not included in the
Although there is no clear consensus yet about the role of SNP
imputation in the analysis of genome-wide association studies, high
imputation accuracy can be achieved using model-based meth-
ods19–23and can lead to an increase in power23,24. To illustrate the
possibilities, in the 500-kb HapMap ENCODE region on 8q24.11
(Supplementary Fig. 5) we evaluated imputation of Phase II SNPs
from the Affymetrix GeneChip 500K array. To do this, we used a
Table 2 | Estimated coverage of the Phase II HapMap in the ten HapMap ENCODE regions
Panel MAF binPhase I HapMap3
Phase II HapMap
Pairwise linkage disequilibriumAdditional 2-SNP tests
r2$0.8 (%)Mean maximum r2
r2$0.8 (%)Mean maximum r2
r2$0.8 (%) Mean maximum r2
2-SNP tests, linkage disequilibrium to haplotypes formed from two nearby SNPs.
Table 4 | Estimated coverage of commercially available fixed marker arrays
r2$0.8 (%)Mean maximum r2
r2$0.8 (%)Mean maximum r2
r2$0.8 (%)Mean maximum r2
Affymetrix GeneChip 500K
Affymetrix SNP Array 6.0
*Assuming all SNPs on the product are informative and pass QC; in practice these numbers are overestimates.
Table 3 | Number of tag SNPs required to capture common (MAF$0.05)
Phase II SNPs
NATURE|Vol 449|18 October 2007
leave-one-out procedure to assess the accuracy of genotype predic-
tion in the YRI. For SNPs with MAF$0.2, the average maximum r2
prediction r2of 0.86. Furthermore, whereas 44% of such SNPs in the
region have no single-marker proxy with r2$0.5, fewer than 6% of
the SNPs have a genotype imputation accuracy of r2,0.5, establish-
ing that accurate imputation can be achieved even in the population
where linkage disequilibrium is the weakest.
New insights into linkage disequilibrium structure
The paradigm underlying association studies is that linkage disequi-
librium can be used to capture associations between markers and
nearby untyped SNPs. However, the Phase II HapMap has revealed
several properties of linkage disequilibrium that illustrate the full
complexity of empirical patterns of genetic variation. Two striking
features are the long-range similarity among haplotypes, and SNPs
that show almost no linkage disequilibrium with any other SNP.
The extent of recent common ancestry and segmental sharing. A
simplified view of linkage disequilibrium is that genetic variation is
organized in relatively short stretches of strong linkage disequilib-
rium (haplotype blocks), each containing only a few common hap-
lotypes and separated by recombination hotspots across which little
mosomes share a recent common ancestor then similarity between
in the four populations surveyed here has not been characterized
of chromosomes, both within and across individuals, reflectingauto-
zygosity and identity-by-descent (IBD) (Fig. 3a). After first checking
none was found for YRI, CEU and JPT, and only small stratification
was found for CHB), we calculated genome-wide probabilities of
sharing 0, 1 or 2 chromosomes identical by descent for each pair of
individuals (see Supplementary Text 4). In addition to identifying a
that, on average,any twoindividuals from thesamepopulationshare
approximately 0.5% of their genome through recent IBD (Table 5).
Using a hidden Markov model approach27(see Supplementary Text
5),wesearched forsuch shared segments over1-megabase (Mb) long
and containing at least 50 SNPs, after first pruning the list of SNPs to
remove local linkage disequilibrium. We find that 10–30% of pairs
in each analysis panel share regions of extended identity resulting
from sharing a common ancestor within 10–100 generations. These
regions typically span hundreds of SNPs and can extend over tens of
megabases (Table 5).
Similarly, extended stretches of homozygosity are indicative of
recent inbreeding within populations28,29. Although short runs of
homozygosity are commonplace, covering up to one-third of the
genome and showing population differences reflective of ancient
linkage disequilibrium patterns (Table 5 and Fig. 3b), very long
homozygous runs exist that are clearly distinct from this process.
Including two JPT individuals who have unusually high levels of
homozygosity (NA18987 and NA18992) and one CEU individual
(NA12874), we identified 79 homozygous regions over 3Mb in 51
tary Tables 7 and 8). Segments intersecting with suspected deletions
were first removed from the analysis (Supplementary Text 6).
ing surrounding recent mutations, usually with a frequency of much
less than 1%, has been exploited to great advantage through homo-
zygosity mapping30,31and haplotype sharing32methods. In studies of
tially offers a route for identifying rare variants (MAF in the range of
1–5%) of high penetrance33,34, which tend to be poorly captured
through single-marker association with genome-wide arrays. To
illustrate the idea, we identified SNPs where only two copies of the
minor allele are present (referred to as ‘2-SNPs’), which have minor
allele frequencies of 1–2%. We find that these are enriched approxi-
mately sevenfold (Table 5) among regions of IBD identified by the
hidden Markov model approach. Notably, identification of IBD re-
gions can be performed with the same genome-wide SNP data being
Total length (Mb)
1 2 3 4 5 6 7 8 9 10111213141516171819202122 X
NA19130 NA19192 (YRI)
PIBD1 = 0.48
52 segments, 1,330.8 Mb
NA06994 NA12892 (CEU)
PIBD1 = 0.06
12 segments, 152.1 Mb
NA12006 NA12155 (CEU)
PIBD1 = 0.01
1 segment, 7.6 Mb
Figure 3 | The extent of recent co-ancestry among HapMap individuals.
sharing illustrate the continuum between very close and very distant
relatedness and its relation to segmental sharing. The three pairs are: high
sharing (NA19130 and NA19192 from YRI; previously identified as second-
degree relatives3), moderate sharing (NA06994 and NA12892 from CEU)
and low sharing (NA12006 and NA12155 from CEU). Along each
chromosome, the probability of sharing at least one chromosome IBD is
genome in segments is similar to each pair’s estimated global relatedness.
b, The extent of homozygosity on each chromosome for each individual in
each analysis panel. Excludes segments,106kb and chromosome X in
blue; JPT, magenta.
Table 5 | Relatedness, extended segmental sharing and homozygosity
Number of pairs included
Mean identity by state (IBS) (%)
Mean identity by descent (IBD) (%)
Number of pairs with.1% IBD (%)
Number of pairs with one or more
Total number of segments
Total distance spanned (Mb)
Mean segment length (Mb)
Maximum segment length (Mb)
Maximum segment length (Mb)
(including close relatives)
Total number of 2-SNPs
Number of 2-SNPs in segments
2-SNP fold increase
Number of homozygous segments
SNPs in homozygous segments (3105)
Total length of homozygous segments
2-SNP, SNPs where only two copies of the minor allele are present.
NATURE|Vol 449|18 October 2007
collected in large-scale association studies, making haplotype-
sharing approaches an attractive and complementary analysis to
standard SNP association tests, with the potential to identify rare
variants associated with complex disease.
The distribution and causes of untaggable SNPs. Despite the SNP
density of the Phase II HapMap, there are high-frequency SNPs
for which no tag can be identified. Among high-frequency SNPs
(MAF$0.2), we marked as untaggable SNPs to which no other
SNPwithin 100kb hasanr2valueofatleast 0.2.InPhaseII, approxi-
mately 0.5–1.0% of all high-frequency SNPs are untaggable and the
proportion in YRI is approximately twice as high as in the other
panels. Similar proportions are observed across the ten HapMap
To identify factors influencing the location of untaggable SNPs
we considered their distribution relative to segmental duplications,
repeat sequence, CpG dinucleotide density, regions of low SNP den-
sity, unusual allele frequency distribution, linkage disequilibrium
patterns and recombination hotspots. We find no evidence for an
enrichment of untaggable SNPs in segmental duplications or repeat
sequence, as would be expected from mis-mapping of SNPs (2% and
35% of common SNPs lie in segmental duplications and repeat
sequence, respectively, compared to 1.8% and 29%, respectively, of
untaggable SNPs). Untaggable SNPs are slightly enriched in CpG
islands (0.37% of common SNPs are in CpG islands compared to
1.4% of untaggable SNPs) and have slightly reduced MAF (Fig. 4).
Most notably, untaggable SNPs are strongly enriched in regions of
low linkage disequilibrium, particularly in recombination hotspots.
the identification of recombination hotspots, we eliminated them
from 100 randomly chosen recombination hotspots and reassessed
evidence for a considerable increase in local recombination rate.
Over 50% of all untaggable SNPs lie within 1kb of the centre of a
detected recombination hotspot and over 90% are within 5kb.
Because only 3–4% of all SNPs lie within 1kb from the centre of a
detected recombination hotspot (16% are within 5kb), this consti-
tutes a marked enrichment and implies that at least 10% of all SNPs
within 1kb of hotspots are untaggable. The implication for asso-
ciation mapping is that when a region of interest contains a known
hotspot it may be prudent to perform additional sequencing within
the hotspot. Many of the variants identified in this manner will be
untaggable SNPs that should be genotyped directly in association
studies. From a biological perspective, the proximity of untaggable
SNPs to the centre of hotspots suggests that they may lie within gene
conversion tracts associated with the repair of double-strand breaks.
Double-strand breaks are thought to resolve as crossover events only
5–25% of the time35. Consequently, SNPs lying near the centre of a
hotspot are liable to be included within gene conversion tracts and
will experience much higher effective recombination rates than pre-
dicted from crossover rates alone.
The distribution of recombination
In the Phase II HapMap we identified 32,996 recombination hot-
spots3,6,36(an increase of over 50% from Phase I) of which 68%
localized to a region of#5kb. The median map distance induced
by a hotspot is 0.043cM (or one crossover per 2,300 meioses) and
the hottest identified, on chromosome 20, is 1.2cM (one crossover
per 80 meioses). Hotspots account for approximately 60% of re-
combination in the human genome and about 6% of sequence
(Supplementary Fig. 6). We do not find marked differences among
chromosomes in the concentration of recombination in hotspots,
which implies that obligate differences in recombination among
chromosomes of different size result from differences in hotspot
density and intensity6.
stand better the influence of genomic features on the distribution of
recombination. Previous work identified specific DNA motifs that
influence hotspot location6,37as well as additional influences of local
sequence context including the location of genes6and base composi-
influences. Figure 5a shows the distribution of recombination, hot-
spot motifs and base composition around genes. Within the tran-
scribed region of genes there is a marked decrease in the estimated
recombination rate. However, 59 of the transcription start site is a
density of hotspot motifs. This region also shows a marked increase
in G1C content, reflecting the presence of CpG islands in promoter
regions. There is also an asymmetry in recombination rate across
genes, with recombination rates 39 of transcribed regions being ele-
of genes. Studies in yeast have previously suggested an association
and recombination in humans. Nevertheless, the vast majority of
hotspots in the human genome are not in gene promoters. The asso-
chromatin and crossover activity.
Systematic differences in recombination rate by gene class.
Previous work has demonstrated differences in the magnitude of
linkage disequilibrium, as measured at a megabase scale, among
etic map estimated from the Phase II HapMap data we can quantify
local increases in recombination rate associated with genes of differ-
ent function using the Panther gene ontology annotation41. Average
recombination rates vary more than sixfold among such gene
est rates (1.9cMMb21) and chaperones showing the lowest rates
(0.3cMMb21). Gene functions associated with cell surfaces and
ity, cell adhesion, extracellular matrix, ion channels, signalling)
whereas those with lower recombination rates are typically internal
tematic differences between gene classes in base composition and
gene clustering, the differences between groups remain significant.
SNPs per kb
Hotspots per kb
SNP densityAllele frequency Linkage disequilibrium
Recombination rate Hotspots
Figure 4 | Properties of untaggable SNPs. a–e, Properties of the genomic
regions surrounding untaggable SNPs in terms of: a, the density of
polymorphic SNPs within the consensus data set; b, mean minor allele
Phase II data; d, the density of estimated recombination hotspots (defined
from hotspot centres); and e, the estimated mean recombination rate. YRI,
green; CEU, orange; CHB1JPT, purple.
NATURE|Vol 449|18 October 2007
systematically among gene classes and that variation in motif density
explains over 50% of the variance in recombination rate among gene
functions (Supplementary Fig. 7).
These results pose interesting evolutionary questions. Because
hotspots may be selected against in some highly conserved parts of
the genome. In regions exposed to recurrent selection (for example,
from changes in environment or pathogen pressure) it is plausible
that recombination may be selected for. However, because the fine-
scale structure of recombination seems to evolve rapidly42,43it will be
important to learn whether patterns of recombination rate hetero-
geneity among molecular functions are conserved between species.
that show evidence for the influence of adaptive evolution3,9, prim-
arily through extended haplotype structure indicative of recent posi-
tive selection. Using two established approaches9,44, we identified
approximately 200 regions with evidence of recent positive selection
from the Phase II HapMap (Supplementary Table 9). These regions
include many established cases of selection, such as the genes HBB
and LCT, the HLA region, and an inversion on chromosome 17.
Many other regions have been previously identified in HapMap
Phase I including LARGE, SYT1 and SULT1C2 (previously called
SULT1C1). A detailed description of the findings from the Phase II
HapMap is published elsewhere45.
The Phase II HapMap also provides new insights into the forces
acting on SNPs in coding regions. Effort was made to genotype as
many known or putative non-synonymous SNPs as possible. Of the
56,789 non-synonymous SNPs identified in dbSNP release 125,
attempts were made to genotype 36,777, which resulted in 17,427
that are QC1 in all three analysis panels and polymorphic. We
selected only those SNPs for which ancestral allele information was
available (approximately 90%). For comparison, we used patterns
of variation at synonymous SNPs. As previously reported46,47, non-
a slight decrease of common variants compared to synonymous
SNPs, compatible with widespread purifying selection against non-
synonymous mutations (Fig. 6a). In contrast, we find no excess of
high-frequency derived non-synonymous mutations, as might be
expected if positive selection were widespread.
cies differ betweenpopulations,notonly through local selective pres-
sures that drive alleles to different frequencies48,49, but also through
distribution of population differentiation (as measured by FST, the
proportion of total variation in allele frequency that is due to differ-
ousSNPs matched forallelefrequency (Fig. 6b). Wefind a systematic
bias fornon-synonymous SNPs to show stronger differentiationthan
synonymous SNPs. Among SNPs showing high levels of differenti-
ation there is a strong tendency for the derived allele to be at higher
frequency in non-YRI populations. Among SNPs with FST.0.5
between CEU and YRI, in 79% and 75% of non-synonymous and
in CEU. Although this difference between non-synonymous and
synonymous SNPs is not significant, among the eight exonic SNPs
with FST.0.95, all are non-synonymous. We see no such bias
towards increased MAF in CEU at high-differentiation SNPs, indi-
cating that SNP ascertainment is unlikely to explain the difference.
Rather, this effect can largely be explained by more genetic drift in
the non-African populations, as confirmed by simulations (data
not shown). In addition, reduced selection against deleterious muta-
tions and local adaptation within non-African populations will both
act to increase the frequency of derived variants in non-African
Toassess the evidence for widespread local adaptation influencing
non-synonymous mutations we considered the distribution of
integrated extended haplotype homozygosity (iEHH) statistics9,44
(Fig. 6c). We find no evidence for systematic differences between
non-synonymous and synonymous SNPs, suggesting that local
adaptation does not explain their higher differentiation. Although
hitch-hiking effects will tend to obscure differences between selected
0.0001 0.001 0.01 0.110.10.01 0.001 0.0001
Significance level (P)
0.5 1.91.5 1.7
Mean recombination rate within category (cM Mb–1)
Defence/immunity protein (269)
Cell adhesion molecule (268)
Extracellular matrix (262)
Ion channel (264)
Signalling molecule (627)
Select calcium-binding protein (190)
Cell junction protein (77)
Cytoskeletal protein (547)
Miscellaneous function (591)
Transfer/carrier protein (230)
Transcription factor (1,322)
Select regulatory molecule (821)
Membrane traffic protein (249)
Nucleic acid binding (1,567)
Synthase and synthetase (170)
200150 100500 50100150 200
(motifs per kb)
Figure 5 | Recombination rates around genes. a, The recombination rate,
density of recombination-hotspot-associated motifs (all motifs with up to
1 bp different from the consensus CCTCCCTNNCCAC) and G1C content
around genes. The blue line indicates the mean. For the recombination rate,
grey lines indicate the quartiles of the distribution. Values were calculated
separately 59 from the transcription start site (the first dotted line) and 39
from the transcription end site (third dotted line) and were joined at the
median midpoint position of the transcription unit (central dotted line).
G1C content and the hotspot-associated motif, suggesting that additional
factorsinfluencerecombination ratesaroundgenes. b,Recombinationrates
within genes of different molecular function41. The chart shows the increase
estimated by permutation of category; numbers of genes are shown in
NATURE|Vol 449|18 October 2007
the higher differentiation of non-synonymous SNPs is primarily dri-
ven byareduction inthe strengthor efficacy ofpurifying selection in
Discussion and prospects
The International HapMap Project has been instrumental in making
well-powered, large-scale, genome-wide association studies a reality.
and analysis of disease association studies in populations across the
world50–53. Furthermore, the decreasing costs and increasing SNP
density of standard genotyping panels mean that the focus of atten-
tion in disease association studies is shifting from candidate gene
approachestowards genome-wide analyses. Alongside developments
in technology, new statistical methodologies aimed at improving
aspects of analysis, such as genotype calling21,54, the identification
of and correction for population stratification and relatedness55,56,
and imputation of untyped variants21–23, are increasing the accuracy
and reliability of genome-wide association studies.
Within this context, it is important to consider the future of the
HapMap Project. Currently, additional samples from the popula-
tions used to develop the initial HapMap, as well as samples from
seven additional populations (Luhya in Webuye, Kenya; Maasai in
USA; Denver (Colorado) metropolitan Chinese community; people
of Mexican origin in Los Angeles, California, USA; and people with
African ancestry in the southwestern United States; http://ccr.coriell.
org/Sections/Collections/NHGRI/?SsId511) will be sequenced and
on rarer variants and helping to enable genome-wide association
studies in additional populations. There are also ongoing efforts by
many groups to characterize additional forms of genetic variation,
such as structural variation, and molecular phenotypes in the
HapMap samples. Finally, in the future, whole-genome sequencing
will provide a natural convergence of technologies to type both SNP
and structural variation. Nevertheless, until that point, and even
after, the HapMap Project data will provide an invaluable resource
for understanding the structure of human genetic variation and its
link to phenotype.
Of approximately 6.9 million SNPs in dbSNP release 122 approximately 4.7
reasons (see Methods). Perlegen performed genotyping using custom high-
density oligonucleotide arrays as previously described15. Additional genotype
submissions are described in the text. QC filters were applied as previously
the lowest missing data rate was chosen for inclusion in the non-redundant
filtered data set. Haplotypes were estimated from genotype data as described
previously3. Ancestral states at SNPs were inferred by parsimony by comparison
to orthologous bases in the chimpanzee (panTro2) and rhesus macaque
(rheMac2) assemblies. Recombination rates and the location of recombination
hotspots were estimated as described previously3. Additional details can be
found in the Methods section and the Supplementary Information. The data
described in this paper are in release 21 of the International HapMap Project.
Full Methods and any associated references are available in the online version of
the paper at www.nature.com/nature.
Received 12 April; accepted 18 September 2007.
1. The International HapMap Consortium. Integrating ethics and science in the
International HapMap Project. Nature Rev. Genet. 5, 467–475 (2004).
The International HapMap Consortium. The International HapMap Project.
Nature 426, 789–796 (2003).
The International HapMap Consortium. A haplotype map of the human genome.
Nature 437, 1299–1320 (2005).
Bowcock, A. M. Genomics: guilt by association. Nature 447, 645–646 (2007).
Altshuler, D. & Daly, M. Guilt beyond a reasonable doubt. Nature Genet. 39,
Myers, S., Bottolo, L., Freeman, C., McVean, G. & Donnelly, P. A fine-scalemap of
recombination rates and hotspots across the human genome. Science 310,
McCarroll, S. A. et al. Common deletion polymorphisms in the human genome.
Nature Genet. 38, 86–92 (2006).
Conrad, D. F., Andrews, T. D., Carter, N. P., Hurles, M. E. & Pritchard, J. K. A high-
resolution survey of deletion polymorphism in the human genome. Nature Genet.
38, 75–81 (2006).
Voight, B. F., Kudaravalli, S., Wen, X. & Pritchard, J. K. A map of recent positive
selection in the human genome. PLoS Biol. 4, e72 (2006).
11. de Bakker, P. I. et al. A high-resolution HLA and SNP haplotype map for disease
association studies in the extended human MHC. Nature Genet. 38, 1166–1172
12. Pastinen, T. et al. Mapping common regulatory variants to human haplotypes.
Hum. Mol. Genet. 14, 3963–3971 (2005).
13. Stranger, B. E. et al. Genome-wide associations of gene expression variation in
humans. PLoS Genet. 1, e78 (2005).
14. Cheung, V. G. et al. Mapping determinants of human gene expression by regional
and genome-wide association. Nature 437, 1365–1369 (2005).
15. Hinds, D. A. et al. Whole-genome patterns of common DNA variation in three
human populations. Science 307, 1072–1079 (2005).
16. de Bakker, P. I. et al. Efficiency and power in genetic association studies. Nature
Genet. 37, 1217–1223 (2005).
17. Pe’er, I. et al. Evaluating and improving power in whole-genome association
studies using fixed marker sets. Nature Genet. 38, 663–667 (2006).
18. Barrett, J. C. & Cardon, L. R. Evaluating coverage of genome-wide association
studies. Nature Genet. 38, 659–662 (2006).
19. Burdick, J. T., Chen, W. M., Abecasis, G. R. & Cheung, V. G. In silico
method for inferring genotypes in pedigrees. Nature Genet. 38, 1002–1004
20. Servin, B. R. & Stephens, M. Imputation-based analysis of association studies:
candidate regions and quantitative traits. PLoS Genet. 3, e114 (2007).
0 0.2 0.4 0.6 1.00.8
0 0.2 0.4 0.61.0 0.800.2 0.4 0.6
00.2 0.4 0.6 1.0 0.8
0 0.2 0.4 0.61.00.8
Average DAF across panels
Proportion of SNPs
with FST > 0.5
Proportion of top 5% iEHH
Figure 6 | Properties of non-synonymous and synonymous SNPs. a, The
derived allele frequency (DAF) spectrum in each analysis panel for all SNPs
(black), synonymous SNPs (green) and non-synonymous SNPs (red). Note
the excess of rare variants for coding sequence SNPs but no excess of high-
genic SNPs showing high differentiation. For each of ten classes of derived
allele frequency (averaged across analysis panels) the fraction of non-
synonymous (red) and synonymous (green) variants in that class that show
FST.0.5 is shown. Note the strong enrichment of non-synonymous SNPs
among SNPs of moderate to high derived-allele frequency (asterisk,
P,0.05; double asterisk, P,0.01). c, Lack of enrichment of non-
synonymous SNPs among those showing long-range haplotype structure.
The integrated extended haplotype homozygosity (iEHH) statistic9was
calculated for non-synonymous and synonymous SNPs in each analysis
allele frequency classes, the proportion of non-synonymous SNPs among
those showing the 5% most extreme statistics (within the allele frequency
class) is shown (points). Also shown is the proportion of non-synonymous
SNPs among SNPs in the coding sequence for each frequency class (dotted
lines). Differences between synonymous and non-synonymous SNPs are
tested for using a contingency table test.
NATURE|Vol 449|18 October 2007
21. The Wellcome Trust Case Control Consortium. Genome-wide association study
of 14,000 cases of seven common diseases and 3,000 shared controls. Nature
447, 661–668 (2007).
22. Scott, L. J. et al. A genome-wide association study of type 2 diabetes in Finns
detects multiple susceptibility variants. Science 316, 1341–1345 (2007).
23. Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint
Genet. 39, 906–913 (2007).
24. Chapman, J. M., Cooper, J. D., Todd, J. A. & Clayton, D. G. Detecting disease
associations due to linkage disequilibrium using haplotype tags: a class of tests
and the determinants of statistical power. Hum. Hered. 56, 18–31 (2003).
25. Paabo, S. The mosaic that is our genome. Nature 421, 409–412 (2003).
26. McVean, G., Spencer, C. C. & Chaix, R. Perspectives on human genetic variation
from the HapMap Project. PLoS Genet. 1, e54 (2005).
27. Purcell, S. et al. PLINK: a toolset for whole-genome association and population-
based linkage analysis. Am. J. Hum. Genet. 81, 559–575 (2007).
28. Broman, K. W. & Weber, J. L. Long homozygous chromosomal segments in
reference families from the centre d’Etude du polymorphisme humain. Am. J.
Hum. Genet. 65, 1493–1500 (1999).
29. Gibson, J., Morton, N. E. & Collins, A. Extended tracts of homozygosity in outbred
human populations. Hum. Mol. Genet. 15, 789–795 (2006).
30. Lander, E. S. & Botstein, D. Homozygosity mapping: a way to map human
recessive traits with the DNA of inbred children. Science 236, 1567–1570 (1987).
31. Leutenegger, A. L. et al. Using genomic inbreeding coefficient estimates for
homozygosity mapping of rare recessive traits: application to Taybi-Linder
syndrome. Am. J. Hum. Genet. 79, 62–66 (2006).
32. Te Meerman, G. J., Van der Meulen, M. A. & Sandkuijl, L. A. Perspectives of
identity by descent (IBD) mapping in founder populations. Clin. Exp. Allergy 25
(Suppl 2), 97–102 (1995).
33. Houwen, R. H. et al. Genome screening by searching for shared segments:
mapping a gene for benign recurrent intrahepatic cholestasis. Nature Genet. 8,
34. Durham, L. K. & Feingold, E. Genome scanning for segments shared identical by
descent among distant relatives in isolated populations. Am. J. Hum. Genet. 61,
35. Jeffreys,A.J.&May,C.A.Intenseandhighlylocalized geneconversionactivityin
human meiotic crossover hot spots. Nature Genet. 36, 151–156 (2004).
human genome. Science 304, 581–584 (2004).
genome. Biochem. Soc. Trans. 34, 526–530 (2006).
38. Spencer, C. C. et al. The influence of recombination on human genetic diversity.
PLoS Genet. 2, e148 (2006).
39. Petes, T. D. Meiotic recombination hot spots and cold spots. Nature Rev. Genet. 2,
40. Smith, A. V., Thomas, D. J., Munro, H. M. & Abecasis, G. R. Sequence features in
regions of weak and strong linkage disequilibrium. Genome Res. 15, 1519–1534
41. Thomas, P. D. et al. PANTHER: a library of protein families and subfamilies
indexed by function. Genome Res. 13, 2129–2141 (2003).
42. Winckler, W. et al. Comparison of fine-scale recombination rates in humans and
chimpanzees. Science 308, 107–111 (2005).
43. Ptak, S. E. et al. Fine-scale recombination patterns differ between chimpanzees
and humans. Nature Genet. 37, 429–434 (2005).
44. Sabeti, P. C. et al. Detecting recent positive selection in the human genome from
haplotype structure. Nature 419, 832–837 (2002).
45. Sabeti, P. C. et al. Genome-wide detection and characterization of positive
selection in human populations. Nature doi:10.1038/nature06250 (this issue).
46. Bustamante, C. D. et al. Natural selection on protein-coding genes in the human
genome. Nature 437, 1153–1157 (2005).
47. Cargill, M. et al. Characterization of single-nucleotide polymorphisms in coding
regions of human genes. Nature Genet. 22, 231–238 (1999).
48. Akey, J. M., Zhang, G., Zhang, K., Jin, L. & Shriver, M. D. Interrogating a high-
density SNP map for signatures of natural selection. Genome Res. 12, 1805–1814
49. Sabeti, P. C. et al. Positive natural selection in the human lineage. Science 312,
50. de Bakker, P. I. et al. Transferability of tag SNPs in genetic association studies in
multiple populations. Nature Genet. 38, 1298–1303 (2006).
51. Conrad, D. F. et al. A worldwide survey of haplotype variation and linkage
disequilibrium in the human genome. Nature Genet. 38, 1251–1260 (2006).
52. Service, S., Sabatti, C. &Freimer, N. TagSNPs chosen from HapMap perform well
in several population isolates. Genet. Epidemiol. 31, 189–194 (2007).
53. Lim, J. et al. Comparative study of the linkage disequilibrium of an ENCODE
region, chromosome 7p15, in Korean, Japanese, and Han Chinese samples.
Genomics 87, 392–398 (2006).
54. Rabbee, N. & Speed, T. P. A genotype calling algorithm for affymetrix SNP arrays.
Bioinformatics 22, 7–12 (2006).
55. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-
based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
56. Price, A. L. et al. Principal components analysis corrects for stratification in
genome-wide association studies. Nature Genet. 38, 904–909 (2006).
57. Smith, R. A., Ho, P. J., Clegg, J. B., Kidd, J. R. & Thein, S. L. Recombination
breakpoints in the human b-globin gene cluster. Blood 92, 4415–4421
58. Holloway, K., Lawson, V. E. & Jeffreys, A. J. Allelic recombination and de novo
deletions in sperm in the human b-globin gene region. Hum. Mol. Genet. 15,
59. Weir, B. S. & Cockerham, C. C. Estimating F-statistics for the analysis of
population structure. Evolution 38, 1358–1370 (1984).
Supplementary Information is linked to the online version of the paper at
Acknowledgements We thank many people who contributed to this project: all
members of the genotyping laboratory and the sample, primer, bioinformatics,
data quality and IT groups at Perlegen Sciences for technical and infrastructural
support; J. Beck, C. Beiswanger, D. Coppock, A. Leach, J. Mintzer and L. Toji for
transforming the Yoruba, Japanese and Han Chinese samples, distributing the
DNA and cell lines, storing the samples for use in future research, and producing
the community newsletters and reports; J. Greenberg and R. Anderson for
providing funding and support for cell line transformation and storage in the
NIGMS Human Genetic Cell Repository at the Coriell Institute; T. Dibling,
and A. Moghadam for technical support in genotyping and all members of the
computing resources; H. Chen, W. Chen, L. Deng, Y. Dong, C. Fu, L. Gao, H. Geng,
Z. Wang, C. Ye and X. Yu for help with genotyping and sample collection; X. Feng,
Y. Li, J. Ren and X. Zhou for help with sample collection; J. Fan, W. Gu, W. Guan,
S. Hu, H. Jiang, R. Lei, Y. Lin, Z. Niu, B. Wang, L. Yang, W. Yang, Y. Wang, Z. Wang,
with genotyping; P. Fong, C. Lai, C. Lau, T. Leung, L. Luk and W. Tong for help with
genotyping; C. Pang for help with genotyping; K. Ding, B. Qiang, J. Zhang, X. Zhang
and K. Zhou for help with genotyping; Q. Fu, S. Ghose, X. Lu, D. Nelson, A. Perez,
S.Poole, R.VegaandH.Yonath forhelpwithgenotyping; C.Bruckner,T.Brundage,
S. Chow, O. Iartchouk, M. Jain, M. Moorhead and K. Tran for help with genotyping;
design; R. Donaldson and S. Duan for help with genotyping, and J. Rice and
N. Saccone for help with experimental design; J. Wigginton for help with
implementing and testing QA/QC software; A. Clark, B. Keats, R. Myers,
D. Nickerson and A. Williamson for providing advice to NIH; C.Juenger, C. Bennet,
C. Bird, J. Melone, P. Nailer, M. Weiss, J. Witonsky and E. DeHaut-Combs for help
for help with figures; the Yoruba people of Ibadan, Nigeria, the people of Tokyo,
Japan, and the community at Beijing Normal University, who participated in public
consultationsandcommunity engagements;thepeople inthesecommunitieswho
donated their blood samples; and the people in the Utah CEPH community who
allowed the samples they donated earlier to be used for the Project. This work was
supported by the Japanese Ministry of Education, Culture, Sports, Science and
Technology, the Wellcome Trust, Nuffield Trust, Wolfson Foundation, UK EPSRC,
GenomeCanada,Ge ´nomeQue ´bec,theChineseAcademyofSciences,theMinistry
of Science and Technology of the People’s Republic of China, the National Natural
Science Foundation of China, the Hong Kong Innovation and Technology
Commission, the University Grants Committee of Hong Kong, the SNP
Consortium, the US National Institutes of Health (FIC, NCI, NCRR, NEI, NHGRI,
NIA, NIAAA, NIAID, NIAMS, NIBIB, NIDA, NIDCD, NIDCR, NIDDK, NIEHS,
NIGMS, NIMH, NINDS, NLM, OD), the W.M. Keck Foundation, and the Delores
from dbSNP (http://www.ncbi.nlm.nih.gov/SNP); all genotype information is
available from dbSNP and the HapMap website (http://www.hapmap.org).
Author Information Reprints and permissions information is available at
www.nature.com/reprints. The authors declare competing financial interests:
details accompany the full-text HTML version of the paper at www.nature.com/
nature. Correspondence and requests for materials should be addressed to G.M.
(firstname.lastname@example.org) or M.D. (email@example.com).
The International HapMap Consortium (Participants are arranged by institution and
then alphabetically within institutions except for Principal Investigators and Project
Leaders, as indicated.)
Genotyping centres: Perlegen Sciences Kelly A. Frazer (Principal Investigator)1,
Belmont3, Andrew Boudreau4, Paul Hardenbol5, Suzanne M. Leal3, Shiran Pasternak6,
Yang (Principal Investigator)8, Changqing Zeng (Principal Investigator)8, Yang Gao8,
Jun Zhou8; Broad Institute of Harvard and Massachusetts Institute of Technology
NATURE|Vol 449|18 October 2007
Stacey B. Gabriel (Project Leader)7, Rachel Barry7, Brendan Blumenstiel7, Amy
Moore7, Huy Nguyen7, Robert C. Onofrio7, Melissa Parkin7, Jessica Roy7, Erich Stahl7,
National Human Genome Center at Beijing Yan Shen (Principal Investigator)10,
Zhijian Yao10; Chinese National Human Genome Center at Shanghai Wei Huang
Weiwei Sun11, Haifeng Wang11, Yi Wang11, Ying Wang11, Xiaoyan Xiong11, Liang Xu11;
K. W. Tsui13; Hong Kong University of Science and Technology Hong Xue (Principal
Investigator)14, J. Tze-Fei Wong14; Illumina Luana M. Galver (Project Leader)15,
Jian-Bing Fan15, Kevin Gunderson15, Sarah S. Murray1, Arnold R. Oliphant16, Mark S.
Chee (Principal Investigator)17; McGill University and Ge ´nome Que ´bec Innovation
Centre Alexandre Montpetit (Project Leader)18, Fanny Chagnon18, Vincent Ferretti18,
Martin Leboeuf18, Jean-Franc ¸ois Olivier4, Michael S. Phillips18, Ste ´phanie Roumy15,
Cle ´mentine Salle ´e19, Andrei Verner18, Thomas J. Hudson (Principal Investigator)20;
University of California at San Francisco and Washington University Pui-Yan Kwok
(Principal Investigator)21, Dongmei Cai21, Daniel C. Koboldt22, Raymond D. Miller22,
Ludmila Pawlikowska21, Patricia Taillon-Miller22, Ming Xiao21; University of Hong
K. H. Tam23; University of Tokyo and RIKEN Yusuke Nakamura (Principal
Investigator)24,25, Takahisa Kawaguchi25, Takuya Kitamoto25, Takashi Morizono25,
Atsushi Nagashima25, Yozo Ohnishi25, Akihiro Sekine25, Toshihiro Tanaka25,
Tatsuhiko Tsunoda25; Wellcome Trust Sanger Institute Panos Deloukas (Project
Leader)26, Christine P. Bird26, Marcos Delgado26, Emmanouil T. Dermitzakis26, Rhian
Gwilliam26, Sarah Hunt26, Jonathan Morrison27, Don Powell26, Barbara E. Stranger26,
Pamela Whittaker26, David R. Bentley (Principal Investigator)28
Analysis groups: Broad Institute Mark J. Daly (Project Leader)7,9, Paul I. W. de
Bakker7,9, Jeff Barrett7,9, Yves R. Chretien7, Julian Maller7,9, Steve McCarroll7,9, Nick
(Principal Investigator)7,9; Cold Spring Harbor Laboratory Lincoln D. Stein (Principal
Investigator)6, Lalitha Krishnan6, Albert Vernon Smith6, Marcela K. Tello-Ruiz6,
Gudmundur A. Thorisson30; Johns Hopkins University School of Medicine Aravinda
Chakravarti (Principal Investigator)31, Peter E. Chen31, David J. Cutler31, Carl S.
Kashuk31, Shin Lin31; University of Michigan Gonc ¸alo R. Abecasis (Principal
Investigator)32, Weihua Guan32, Yun Li32, Heather M. Munro33, Zhaohui Steve Qin32,
Daryl J. Thomas34; University of Oxford Gilean McVean (Project Leader)35, Adam
Auton35,LeonardoBottolo35,Niall Cardin35, SusanaEyheramendy35,Colin Freeman35,
Jonathan Marchini35, Simon Myers35, Chris Spencer7, Matthew Stephens36, Peter
Donnelly (Principal Investigator)35; University of Oxford, Wellcome Trust Centre for
Human Genetics Lon R. Cardon (Principal Investigator)37, Geraldine Clarke38, David
Investigator)25, Todd A. Johnson25; US National Institutes of Health James C.
Mullikin40; US National Institutes of Health National Center for Biotechnology
Information Stephen T. Sherry41, Michael Feolo41, Andrew Skol42
Community engagement/public consultation and sample collection groups: Beijing
Normal University and Beijing Genomics Institute Houcan Zhang43, Changqing
Zeng8, Hui Zhao8; Health Sciences University of Hokkaido, Eubios Ethics Institute,
and Shinshu University Ichiro Matsuda (Principal Investigator)44, Yoshimitsu
Fukushima45, Darryl R. Macer46, Eiko Suda47; Howard University and University of
Ibadan Charles N. Rotimi (Principal Investigator)48, Clement A. Adebamowo49, Ike
Ajayi49, Toyin Aniagwu49, Patricia A. Marshall50, Chibuzor Nkwodimmah49,
Charmaine D. M. Royal48; University of Utah Mark F. Leppert (Principal
Investigator)51, Missy Dixon51, Andy Peiffer51
Ethical, legal and social issues: Chinese Academy of Social Sciences Renzong Qiu52;
Genetic Interest Group Alastair Kent53; Kyoto University Kazuto Kato54; Nagasaki
University Norio Niikawa55; University of Ibadan School of Medicine Isaac F.
Adewole49; University of Montre ´al Bartha M. Knoppers19; University of Oklahoma
Morris W. Foster56; Vanderbilt University Ellen Wright Clayton57; Wellcome Trust
SNPdiscovery:BaylorCollegeofMedicine Richard A.Gibbs(Principal Investigator)3,
John W. Belmont3, Donna Muzny3, Lynne Nazareth3, Erica Sodergren3, George M.
Weinstock3, David A. Wheeler3, Imtaz Yakub3; Broad Institute of Harvard and
Massachusetts InstituteofTechnologyStaceyB. Gabriel(Project Leader)7,Robert C.
Onofrio7, Daniel J. Richter7, Liuda Ziaugra7, Bruce W. Birren7, Mark J. Daly7,9, David
Altshuler (Principal Investigator)7,9; Washington University Richard K. Wilson
(Principal Investigator)59, Lucinda L. Fulton59; Wellcome Trust Sanger Institute Jane
Rogers (Principal Investigator)26, John Burton26, Nigel P. Carter26, Christopher M.
Clee26, Mark Griffiths26, Matthew C. Jones26, Kirsten McLay26, Robert W. Plumb26,
Mark T. Ross26, Sarah K. Sims26, David L. Willey26
Scientific management: Chinese Academy of Sciences Zhu Chen60, Hua Han60, Le
Kang60; Genome Canada Martin Godbout61, John C. Wallenburg62; Ge ´nome Que ´bec
Paul L’Archeve ˆque63, Guy Bellemare63; Japanese Ministry of Education, Culture,
Sports, Science and Technology Koji Saeki64; Ministry of Science and Technology of
the People’s Republic of China Hongguang Wang65, Daochang An65, Hongbo Fu65,
Qing Li65, Zhen Wang65; The Human Genetic Resource Administration of China
Renwu Wang66; The SNP Consortium Arthur L. Holden15; US National Institutes of
Health Lisa D. Brooks67, Jean E. McEwen67, Mark S. Guyer67, Vivian Ota Wang67,68,
Jane L. Peterson67, Michael Shi69, Jack Spiegel70, Lawrence M. Sung71, Lynn F.
Zacharia67, Francis S. Collins72; Wellcome Trust Karen Kennedy61, Ruth Jamieson58,
1The Scripps Research Institute, 10550 North Torrey Pines Road MEM275, La Jolla,
California 92037, USA.2Perlegen Sciences, Inc., 2021 Stierlin Court, Mountain View,
California 94043, USA.3Baylor College of Medicine, Human Genome Sequencing
Center, Department of Molecular and Human Genetics, 1 Baylor Plaza, Houston, Texas
77030, USA.4Affymetrix, Inc., 3420 Central Expressway, Santa Clara, California 95051,
USA.5Pacific Biosciences,1505 AdamsDrive,MenloPark, California 94025, USA.6Cold
Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, New York 11724, USA.
7The Broad Institute of Harvard and Massachusetts Institute of Technology, 1 Kendall
Square, Cambridge, Massachusetts 02139, USA.8Beijing Genomics Institute, Chinese
Academy of Sciences, Beijing 100300, China.9Massachusetts General Hospital and
Harvard Medical School, Simches Research Center, 185 Cambridge Street, Boston,
201203, China.12Fudan University and CAS-MPG Partner Institute for Computational
Biology, School of Life Sciences, SIBS, CAS, Shanghai 201203, China.13The Chinese
University of Hong Kong, Department of Biochemistry, The Croucher Laboratory for
Human Genetics, 6/F Mong Man Wai Building, Shatin, Hong Kong.14Hong Kong
University of Science and Technology, Department of Biochemistry and Applied
Genomics Center, Clear Water Bay, Knowloon, Hong Kong.15Illumina, 9885 Towne
Centre Drive, San Diego, California 92121, USA.16Complete Genomics, Inc., 658 North
Pastoria Avenue, Sunnyvale, California 94085, USA.17Prognosys Biosciences, Inc., 4215
Sorrento Valley Boulevard, Suite 105, San Diego, California 92121, USA.18McGill
University and Ge ´nome Que ´bec Innovation Centre, 740 Dr. Penfield Avenue, Montre ´al,
Que ´bec H3A 1A4, Canada.19University of Montre ´al, The Public Law Research Centre
(CRDP), PO Box 6128, Downtown Station, Montre ´al, Que ´bec H3C 3J7, Canada.
20Ontario Institute for Cancer Research, MaRS Centre, South Tower, 101 College Street,
Suite 500, Toronto, Ontario M5G 1L7, Canada.21University of California, San Francisco,
Cardiovascular Research Institute, 513 Parnassus Avenue, Box 0793, San Francisco,
California 94143, USA.22Washington University School of Medicine, Department of
Genetics, 660 South Euclid Avenue, Box 8232, St Louis, Missouri 63110, USA.
23University of HongKong,Genome Research Centre, 6/F, Laboratory Block, 21 Sassoon
Road, Pokfulam, Hong Kong.24University of Tokyo, Institute of Medical Science, 4-6-1
Sirokanedai, Minato-ku, Tokyo 108-8639, Japan.25RIKEN SNP Research Center, 1-7-22
Suehiro-cho, Tsurumi-ku Yokohama, Kanagawa 230-0045, Japan.26Wellcome Trust
Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA,
UK.27University of Cambridge, Department of Oncology, Cambridge CB1 8RN, UK.
28Solexa Ltd, Chesterford Research Park, Little Chesterford, Nr Saffron Walden, Essex
CB10 1XL, UK.29Columbia University, 500 West 120th Street, New York, New York
10027, USA.30University of Leicester, Department of Genetics, Leicester LE1 7RH, UK.
31Johns Hopkins University School of Medicine, McKusick-Nathans Institute of Genetic
Medicine, Broadway Research Building, Suite 579, 733 North Broadway, Baltimore,
Maryland 21205, USA.32University of Michigan, Center for Statistical Genetics,
Department of Biostatistics, 1420 Washington Heights, Ann Arbor, Michigan 48109,
USA.33International Epidemiology Institute, 1455 Research Boulevard, Suite 550,
Rockville, Maryland 20850, USA.34Center for Biomolecular Science and Engineering,
Engineering 2, Suite 501, Mail Stop CBSE/ITI, UC Santa Cruz, Santa Cruz, California
95064, USA.35University of Oxford, Department of Statistics, 1 South Parks Road,
Oxford OX1 3TG, UK.36University of Chicago, Department of Statistics, 5734 South
University Avenue, Eckhart Hall, Room 126, Chicago, Illinois 60637, USA.37Fred
Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, Washington
98109, USA.38University of Oxford/Wellcome Trust Centre for Human Genetics,
Roosevelt Drive, Oxford OX3 7BN, UK.39University of Washington Department of
Biostatistics, Box 357232, Seattle, Washington 98195, USA.40US National Institutes of
Health, National Human Genome Research Institute, 50 South Drive, Bethesda,
Maryland 20892, USA.41US National Institutes of Health, National Library of Medicine,
20894, USA.42University of Chicago, Department of Medicine, Section of Genetic
Medicine, 5801 South Ellis, Chicago, Illinois 60637, USA.43Beijing Normal University, 19
Xinjiekouwai Street, Beijing 100875, China.44Health Sciences University of Hokkaido,
Ishikari Tobetsu Machi 1757, Hokkaido 061-0293, Japan.45Shinshu University School of
Medicine, Department of Medical Genetics, Matsumoto 390-8621, Japan.46United
Nations Educational, Scientific and Cultural Organization (UNESCO Bangkok), 920
Sukhumwit Road, Prakanong, Bangkok 10110, Thailand.47University of Tsukuba, Eubios
Ethics Institute, PO Box 125, Tsukuba Science City 305-8691, Japan.48Howard
University, National Human Genome Center, 2216 6th Street, NW, Washington DC
20059, USA.49University of Ibadan College of Medicine, Ibadan, Oyo State, Nigeria.
Euclid Avenue, Cleveland, Ohio 44106, USA.51University of Utah, Eccles Institute of
Human Genetics, Department of Human Genetics, 15 North 2030 East, Salt Lake City,
Utah 84112,USA.52Chinese Academy ofSocial Sciences, Institute of Philosophy/Center
for Applied Ethics, 2121, Building 9, Caoqiao Xinyuan 3 Qu, Beijing 100067, China.
53Genetic Interest Group, 4DLeroy House, 436 Essex Road,London N130P, UK.54Kyoto
University, Institute for Research in Humanities and Graduate School of Biostudies,
Ushinomiya-cho, Sakyo-ku, Kyoto 606-8501, Japan.55Nagasaki University Graduate
NATURE|Vol 449|18 October 2007
School of Biomedical Sciences, Department of Human Genetics, Sakamoto 1-12-4,
West Lindsey Street, Norman, Oklahoma 73019, USA.57Vanderbilt University, Center
for Genetics and Health Policy, 507 Light Hall, Nashville, Tennessee 37232, USA.
58Wellcome Trust, 215 Euston Road, London NW1 2BE, UK.59Washington University
Louis, Missouri 63108, USA.60Chinese Academy of Sciences, 52 Sanlihe Road, Beijing
1P1, Canada.62McGill University, Office of Technology Transfer, 3550 University Street,
Montre ´al, Que ´bec H3A 2A7, Canada.63Ge ´nome Que ´bec, 630, boulevard
Rene ´-Le ´vesque Ouest, Montre ´al, Que ´bec H3B 1S6, Canada.64Ministry of Education,
Culture, Sports, Science, and Technology, 3-2-2 Kasumigaseki, Chiyodaku, Tokyo
100-8959, Japan.65Ministry of Science and Technology of the People’s Republic of
China, 15 B. Fuxing Road, Beijing 100862, China.66The Human Genetic Resource
Administration of China, b7, Zaojunmiao, Haidian District, Beijing 100081, China.67US
National Institutes of Health, National Human Genome Research Institute, 5635 Fishers
Lane, Bethesda, Maryland 20892, USA.68US National Institutes of Health, Office of
Behavioral and Social Science Research, 31 Center Drive, Bethesda, Maryland 20892,
USA.69Novartis Pharmaceuticals Corporation, Biomarker Development, One Health
Plaza, East Hanover, New Jersey 07936, USA.70US National Institutes of Health, Office
of Technology Transfer, 6011 Executive Boulevard, Rockville, Maryland 20852, USA.
21201, USA.72US National Institutes of Health, National Human Genome Research
Institute, 31 Center Drive, Bethesda, Maryland 20892, USA.
NATURE|Vol 449|18 October 2007
METHODS Download full-text
SNP selection and genotyping. All SNPs in dbSNP release 122 were considered
for genotyping by Perlegen. Among these the following were excluded: SNPs for
which no assay could be designed (primarily through location in repeat-rich
regions; approximately 2.5 million); SNPs shown previously in samples from
related populations15to be most probably in perfect association (r251) with a
PhaseI SNP(approximately 122,000);all butone ofSNPsshownpreviously15to
be most probably in perfect association (r251) with each other but not with a
Phase I SNP (approximately 62,000); and SNPs shown previously15to have
MAF,0.05 (approximately 119,000). In addition, a few SNPs were excluded
for efficiency (for example, if an amplicon contained a single SNP).
Approximately 30,000 SNPs that had been typed in Phase I were deliberately
retyped in Phase II to allow detailed comparisons of data quality, and an addi-
tional 15,000 SNPs that showed discrepancies between multiple genotyping
attempts in Phase I were re-typed in Phase II. A further 2,000 SNPs identified
by the Mammalian Gene Collection were also typed.
Perlegen performed genotyping using custom high-density oligonucleotide
arrays as previously described15. Initially, a pilot phase was carried out on chro-
mosome 2p to optimize experimental workflow and data handling. Details
of amplicons used in the experiment and PCR primers can be found at
http://genome.perlegen.com/pcr/ and also on the HapMap website. The arrays
SNP. These consisted of four sets of features, corresponding to forward and
Within a feature set, the position of the SNP within the oligonucleotide varied
from position 11 to position 15. Mismatch probes were used to measure back-
or absence of a specific PCR product. The 40-feature and 24-feature tilings both
provided 10 perfect-match features for each SNP allele and differed only in the
number of mismatch probes.
Genotypes were scored by clustering intensity measurements as previously
described15. In addition, quality scores similar to Phred scores were computed
for each genotype call, based on a combination of experimental metrics corre-
lated to data quality. Assays with overall call rates less than 80% or with poor
average quality scores were flagged as failed. About 38% of the tiled assays failed
these basic criteria, and the remainder were processed using the more rigorous
HapMap Project data quality control filters. For analysis of the whole genome,
type the samples by plates as had been done for the Phase I genotyping, instead
each plate were not included as a component of QC for this genotyping. In the
Phase I HapMap a single JPT sample had been excluded because of technical
problems. Perlegen typed a replacement sample (from the original JPT collec-
SNPs, although a substantial fraction of these was typed in Phase II.
Additional genotype submissions came from the Affymetrix GeneChip
Human Mapping 500K array called with the BRLMM algorithm. In release
21a additional genotype submissions were incorporated from the MHC haplo-
type consortium11, the Illumina HumanHap300 BeadChip, the Illumina
Human-1 Genotyping BeadChip and the 10K non-synonymous SNP set from
Details of primer design, DNA amplification, DNA labelling and hybridiza-
tion and signal detection for the Perlegen platform can be found in Supple-
mentary Text 7.
QC analyses. Genotype submissions were assessed for mendelian errors (where
possible), missing data rates and Hardy–Weinberg proportions. QC filters were
met theQC criteria the submission with thelowest missing data ratewas chosen
for inclusion in the non-redundant filtered data set. Comparison of the Phase II
where the reported minor allele is discrepant (referred to as ‘allele-flipping’).
Over the entire data set, we expect that 500–2,000 SNPs have this problem and
the vast majority will occur in SNPs from Phase I of the project. The Data
Coordination Center (DCC) is working to resolve as many of these as possible.
Analyses of data quality. See Supplementary Text 2.
Analyses of population stratification, relatedness and homozygosity. See
Supplementary Texts 3–6.
Analysis of recombination rate and gene ontology. We used the Panther
Database41to obtain details of the gene molecular function and biological pro-
level biological processgroups,with eachgene allowedtoexist inmore thanone
RefSeq Annotation for which we could obtain recombination rates. Of these,
9,735 had at least one assigned molecular function and 9,432 had at least one
assigned biological process. Genes without a molecular function or biological
we estimated the mean recombination rate over a 20-kb region centred on the
mid-point of each gene transcription region.
Genes were grouped based on molecular function and biological process. A
mean recombination rate was calculated for each group. The significance of
the result from each group was calculated via a permutation test involving 105
random groupings of genes. No correction was made for multiple testing. To
account for the effect of G1C content on recombination, we performed a
linear regression between the G1C content and recombination rate of all
genes in each sample. Using the estimated regression parameters, the propor-
tion of recombination explained by G1C content was subtracted from each
annotations from dbSNP release 125 we identified 17,427 polymorphic non-
synonymous SNPs in release 21 and 15,976 polymorphic synonymous SNPs. Of
these, 15,583 non-synonymous and 14,324 synonymous SNPs were autosomal
and could have ancestral allele status unambiguously assigned by parsimony
through comparison to the chimpanzee and macaque genomes. We used the
phasedhaplotypesforanalysisinwhichmissingdatahad beenimputed. FSTwas
calculated using the method of Weir and Cockerham59.
To detect recent partial selective sweeps we used the long-range haplotype
(LRH) test44,49and the integrated haplotype score (iHS) test9. On simulated
data45, we found that the tests have similar power to detect recent selection
but the iHS test has slightly lower power at low haplotype frequency and the
LRH testhas slightly lowerpower athigh frequency. Thiscan beseen in applica-
tions to HapMap Phase I data3,9, where the iHS test misses the well-known cases
of HBB and CD36 and the LRH test misses the SULT1C2 region. Although both
tests are based on the concept of EHH44, we observed that the false positives
produced by the two tests tend not to overlap and thus that signals detected by
both tests have a very low false-positive rate.