Ancestry Informative Marker Sets for Determining
Continental Origin and Admixture Proportions
in Common Populations in America
Roman Kosoy,1Rami Nassir,1Chao Tian,1Phoebe A. White,2Lesley M. Butler,3Gabriel Silva,4Rick Kittles,5
Marta E. Alarcon-Riquelme,6Peter K. Gregersen,7John W. Belmont,8Francisco M. De La Vega,2
and Michael F. Seldin1?
1Rowe Program in Human Genetics, Departments of Biochemistry and Medicine, University of California Davis, Davis, California
2Applied Biosystems, Foster City, California
3Department of Public Health Sciences, University of California Davis, Davis, California
4Obras Sociales del Hermano Pedro, Antigua, Guatemala
5Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois
6Department of Genetics and Pathology, Rudbeck Laboratory, Uppsala University, Uppsala, Sweden
7The Robert S. Boas Center for Genomics and Human Genetics, Feinstein Institute for Medical Research, North Shore LIJ Health System,
Manhasset, New York
8Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas
Communicated by Pui-Yan Kwok
Received 15 February 2008; accepted revised manuscript 21 April 2008.
Published online 6 August 2008 in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/humu.20822
ABSTRACT: To provide a resource for assessing con-
tinental ancestry in a wide variety of genetic studies, we
identified, validated, and characterized a set of 128
ancestry informative markers (AIMs). The markers were
chosen for informativeness, genome-wide distribution,
and genotype reproducibility on two platforms (Taq-
Mansassays and Illumina arrays). We analyzed genotyp-
ing data from 825 subjects with diverse ancestry,
including European, East Asian, Amerindian, African,
South Asian, Mexican, and Puerto Rican. A comprehen-
sive set of 128 AIMs and subsets as small as 24 AIMs are
shown to be useful tools for ascertaining the origin of
subjects from particular continents, and to correct for
population stratification in admixed population sample
sets. Our findings provide general guidelines for the
application of specific AIM subsets as a resource for wide
application. We conclude that investigators can use
TaqMan assays for the selected AIMs as a simple and
cost efficient tool to control for differences in continental
ancestry when conducting association studies in ethni-
cally diverse populations.
Hum Mutat 30, 69–78, 2009.
& 2008 Wiley-Liss, Inc.
KEY WORDS: population structure; continental ances-
try; population stratification; ancestry informative markers
Analyses of population genetic structure have shown that
continental population groups can be identified by examining
differences in allele frequencies [Rosenberg et al., 2002, 2005].
Over the last several years studies have demonstrated that
thousands of individual single nucleotide polymorphisms (SNPs)
distributed through out the genome have very large differences in
allele frequencies between two or more continental populations
[Mao et al., 2007; Price et al., 2007; Smith et al., 2004; Tian et al.,
2006, 2007]. These studies have set the framework for both
admixture mapping and adjusting for population genetic
structure in association testing. The latter is particularly
important since differences in population genetic structure
between cases and controls can confound SNP-disease associa-
tions, leading to false-positive or false-negative findings [Campbell
et al., 2005; Clayton et al., 2005; Freedman et al., 2004; Helgason
et al., 2005; Marchini et al., 2004]. Methods to measure, and
therefore address differences in population structure in association
testing have been developed [Epstein et al., 2007; Hoggart et al.,
2003; Price et al., 2006; Pritchard et al., 2000b; Purcell et al., 2007;
Satten et al., 2001]. In the context of whole genome association
(WGA) scans, these methods can be readily applied. However, for
follow-up association studies to further define critical candidate
regions in larger population sets, or for analyses of additional
populations, a small set of ancestry informative markers (AIMs) is
While differences within continental populations, and popula-
tion substructure, must also be considered [Bauchet et al., 2007;
Price et al., 2008; Seldin et al., 2006; Tian et al., 2008], the larger
difference in allele frequencies between continental populations
potentially creates the greatest confounding problem in interpret-
ing such association studies. At this point a large number of WGA
studies have been conducted in populations of primarily or
exclusively European ancestry. Thus, the issue of confounding by
population stratification will become particularly evident as more
& 2008 WILEY-LISS, INC.
Additional Supporting Information may be found in the online version of this article.
?Correspondence to: Michael F. Seldin, Rowe Program in Genetics, UC Davis,
Contract grant sponsor: Applied Biosystems; Swedish Research Council; Christine
Landgraf Memorial Research Fund (LMB); National Institutes of Health (NIH); Grant
numbers: AR050267; DK071185; and P30 CA093373.
Tupper Hall Room 4453, Davis, CA 95616. E-mail: email@example.com
genetic associations are conduced among multiethnic, and
therefore substantially admixed populations, to evaluate ethnic
disparities in disease risk. Addressing these differences in
population structure is particularly relevant for extending genetic
associations to underserved minority groups that include
substantial admixture between continents.
The current study was undertaken to provide a resource for
determining and quantifying differences in continental popula-
tions using the smallest numbers of SNPs possible as a cost and
time efficient strategy. Previous studies by both our group and
others, have shown that AIM sets of 200 markers or less have
ability to discern continental structure [Parra et al., 2004; Salari
et al., 2005; Yang et al., 2005]. However, the use of such markers
has been sporadic, the validation of many of the markers
incomplete, and in some cases have been limited to specific
platforms that cannot be readily and inexpensively used by
multiple laboratories. The current study utilizing the widely
used TaqMans(Applied Biosystems, Foster City, CA) platform
provides a set of AIMs that distinguish continental groups that can
be widely applied to genetic studies. In addition, the application of
AIMs depends in part on availability of genotypes. Our study also
provides genotypes of continental populations as a research
community resource. Most importantly, the current study shows
both the value and limitations of using smaller subsets of AIMs by
providing guidance in practical application.
Materials and Methods
DNA samples or genotypes used for population structure
analyses were from 825 individuals that included: 128 European
Americans (NYCPEA), 60 CEPH Europeans (CEU), 56 Yoruban
African (YRI), 19 Bini West African, 23 Kanuri West African, 50
Mayan Amerindians, 26 Quechuan Amerindians, 29 Nahua
Amerindians, 40 Mexican Americans (MAM), 26 Mexican
(MXN), 28 Puerto Rican American (PRA), 43 Chinese (CHB),
43 Chinese American (CHAH), 43 Japanese (JPT), eight
Vietnamese American (VAH), one Korean American (KAH), 45
Filipino American (NYCPFA), two unspecified East Asian
Americans (OEAS), three Japanese American (JAH), and 64
South Asian Indian Americans (SAS).
These populations were based on self-identified ethnic affilia-
tion. The NYCPEA, NYCPFA, and PRA were from New York city
and were collected as part of the New York Cancer Project
[Mitchell et al., 2004]. The Mayan samples were collected from
two villages, Bola De Oro and Cienega Grande, from Chimalte-
nango Guatemala (provided by G.S. and J.B.), the Quechuan
individuals were from Peru (provided by J.B.); the Nahua were
from central Mexico (provided by M.EAR); the MXN were from
Mexico City (provided by M.E.A.-R.), the MAM and AFA were
from California, and the CHAH, VAH, KAH, and SAS were from
Houston (provided by J.B.). For the West African samples the
Bini, are a Niger-Congo group of Bantu speakers from Edo State
and the Kanuri, a group of Nilo Saharan speakers fro the Lake
Chad region of northern Nigeria (provided by R.K.). The CEU
and YRI were HapMap panel genotypes [Altshuler et al., 2005]
and the JPT and CHB were from the I-ControlDB (Illumina, San
Diego, CA; www.illumina.com/iControlDB).
Additional genotypes used in modeling studies derived
included: 1) EURNIHLN genotypes (254 subjects) that were
available from the NIH Laboratory of Neurogenetics at the Coriell
Queue website; 2) East Asian genotypes from the iControlDB (198
subjects); 3) East Asian samples (85) genotyped at North Shore;
and 4) African American genotypes (1,847 subjects) from the
iControlDB. For the modeling studies we limited the genotypes to
autosomal SNPs that were typed in 495% of each of the included
subjects and that were in Hardy-Weinberg equilibrium (HWE)
(P40.001) within a given self-identified group and in combined
samples from a given continent.
The subjects studied were all healthy and not first-degree
relatives of each other based on self-reporting. All DNA and blood
samples were obtained according to protocols and informed-
consent procedures approved by institutional review boards, and
were labeled with an anonymous code number.
The value of the genetic variance variable, Fst, was determined
using Genetix software [Belkhir et al., 2001], which applies the
Weir and Cockerham algorithm [Weir and Cockerham, 1984] This
algorithm defines Fst as (MSP – MSG)/[MSP 1 (nc– 1)MSG],
where MSP denotes the observed mean square errors for loci
between populations and MSG denotes the mean square errors for
loci within populations. The pairwise Fst values thus provide a
measurement of interpopulation genetic variance in comparison
to intrapopulation genetic variance. HWE was examined using an
exact test implemented in the FINETTI software, which can be
bin/hw/hwa1.pl). Population admixture proportions were deter-
mined using the Bayesian clustering algorithms developed by
Pritchard and implemented in the program STRUCTURE v2.1
[Falush et al., 2003; Pritchard et al., 2000a]. Informativeness
between multiple population groups was determined using the In
algorithm [Rosenberg et al., 2003].
For STRUCTURE, unless otherwise noted in the results, each
analysis was performed without any prior population assignment
and was performed at least three times with similar results using
410,000 replicates and 5,000 burn-in cycles under the admixture
model. For analyses using smaller marker sets (24 and 48 markers)
longer runs were necessary to achieve similar results on multiple
run comparisons. For 24 and 48 marker sets, 50,000 replicates and
10,000 burn-in cycles were used with the exception of 24 markers
selected using In4 (four population informativeness). For these
analyses, 100,000 replicates and 20,000 burn-in cycles were
necessary. For all analyses reported we used the ‘‘infer a’’ option
with a separate a estimated for each population (where a is the
Dirichlet parameter for degree of admixture). Runs were
performed under thel51 option, where l parameterizes the
prior allele frequency and is based on the Dirichlet distribution of
Fst, In, and allele frequencies were determined using sets of 80
subjects representing European (EURA), West African (AFR),
Amerindian (AMI), and East Asian (EAS) ancestry. These
included the following distribution of subjects: 1) EURA, CEPH
(17 subjects), NYCPEA (63 subjects); 2) AFR, YRI (45 subjects),
Bini (17 subjects), and Kanuri (18 subjects); 3) AMI, Mayan (38
subjects), Nahua (23 subjects), and Quechuan (19 subjects); and
4) EAS, HCB (15 subjects), Filipino (16 subjects), 25 diverse
ethnic Chinese American (25 subjects), JPT (15 subjects), Japanese
American (one subject), Korean American (one subject), and
Vietnamese Americans (seven subjects).
For modeling studies, association tests were performed using
the EIGENSTRAT statistical package [Price et al., 2006]. False
discovery rate statistics [Devlin and Roeder, 1999] were deter-
HUMAN MUTATION, Vol. 30, No. 1, 69–78, 2009
mined using HelixTree 5.0.2 software (Golden Helix, Bozeman,
TaqMansSNP genotyping assays were developed for each of
the SNPs used in the current study (Supplementary Table S1;
available online at http://www.interscience.wiley.com/jpages/1059-
7794/suppmat) and are commercially available (Applied Biosys-
tems, Foster City, CA; www.allsnps.com). Assays were performed
with the TaqMan Genotyping Master Mix, using conditions
recommended by the manufacturer, on an ABI 7900 Sequence
Detection System (Applied Biosystems).
For the current studies the deCODE [Kong et al., 2002] genetic
map was used. The position of each SNP was determined by
interpolation using markers that were both on the genetic map
and for which an unambiguous physical map position was
available in NCBI build 35. Any markers that were not in the same
relative order in both the genetic and physical maps were omitted
as anchors for the interpolation of the genetic positions of the
The SNPs chosen for inclusion were based on two large sets of
previous genotyping results in our laboratory [Tian et al., 2006,
2007] were limited to those SNPs that overlapped with the 300K
genome-wide Illumina SNP array. A total of 250 SNPs were
chosen selecting the best SNP in each 10-cM deCODE bin that
met the criteria of large allele frequency differences (445%)
between EURA and AMI groups and small allele frequency
differences (o5%) between two disparate AMI groups (Pima and
Mayan). Similarly, 250 SNPs with large frequency differences
(445%) between African and European groups were selected.
From these 500 SNPs we reduced the number for testing to 184
based on the following criteria: 1) in silico design criteria for
TaqMan assays; 2) genome-wide distribution pattern (minimum
intermarker distance58cM on deCODE map); and 3) EAS
differences based on HapMap results in JPT and CHB.
TaqMansSNP genotyping assays were designed for the 184 SNPs
and tested using DNA panels. Of these, 128 SNPs passed our
quality filters, demonstrating reproducible genotyping results in
population samples of diverse origin, 490% complete typing
results in each population and were in HWE (P40.01) in the
EURA group. A small number of SNPs were not in HWE in
specific populations (two SNPs in AFR, three SNPs AMI, and
three SNPs EAS). These SNPs did not overlap between these
groups and only two SNPs showed HWE o 0.005). Thus, these
SNPs were not excluded, because recent admixture in these self-
identified ethnic groups could result in departure from HWE.
Summary information for the final set of 128 SNPs is provided in
Supplementary Table S1.
Identifying Subsets of AIMs
Subsets of the 128 marker set were chosen using the In
algorithm [Rosenberg et al., 2003] with the goal of finding the
most informative markers distinguishing one or more of the
following: 1) four continental populations (EURA, AFR, AMI, and
EAS); 2) three continental populations (EURA, AFR, and AMI);
or 3) two continental populations (EURA and AFR or EURA and
AMI). Each subset was determined using 80 subjects from each
ethnic group (described in Statistical Methods) and marker
selection was based on the most informative set for each analysis
(provided in Supplementary Table S2).
To test whether a limited number of AIMs can correct for false-
positive results observed in case-control studies due to population
stratification, we modeled three population specific loci as disease
phenotypes. The modeling was done in the following step-wise
manner independently for each surrogate phenotype: 1) surrogate
cases and controls (with available SNP genotypes on Illumina
300K platform) were chosen on the basis of genotypes for a
population-specific marker; 2) 200K SNPs that passed quality
control filters in the surrogate case-control sample sets were tested
for association using the HelixTree software package; 3)
significantly associated markers (by Armitage w2test, w2Z26.6;
Po0.05 with Bonferroni correction for 200,000 tests) in or near
the locus designating the surrogate phenotype are defined as true-
positive signal, while significantly associated SNPs outside the
locus are defined as false positives; 4) six to 10 SNPs with the
strongest false-positive associations and a similar number of true-
positive associations with w2values comparable to the false-
positives were selected for further analysis; 5) the genotypes for
the chosen true-positively- and false-positively-associated markers
are combined with genotypes for the markers in the selected sets
(all 200K SNP markers, 128 In4, 96 In4, 64 In4, 48 In4, and 24 In4),
and were tested for association testing correcting for substructure
by principal component analysis using EIGENSTRAT [Price et al.,
2006]; 6) the positively associated markers were reanalyzed for
association using correction for population stratification with an
appropriate number of principal components (PC 1 or PC2
depending on the studies population, determined by the plateau
The surrogate phenotypes were assigned based on SNPs selected
from haplotype analyses of three regions that contained genes with
strong ancestry association. The models chosen were for the
SLC24A5, lactase gene (LCT), and ADH1B. SLC24A5, coding for
a K-gated Na/Ca exchanger, is located on chromosome 15, and
plays a role in human skin pigmentation [Lamason et al., 2005].
This study provided evidence that a nonsynonymous genetic
substitution (rs1426654, A/G 111) is under strong positive
selection in Europeans, with allele A nearly fixed in various
European populations (98.7–100%), whereas allele G is present at
97% to 100% frequency in African and East Asian HapMap
populations [Lamason et al., 2005]. Since genotypes for rs1426654
was not available in our dataset, individuals homozygous for allele
A of rs2675348, in complete linkage disequilibrium (LD) with
allele A of rs1426654 (r251.00 in HapMap CEU samples), were
designated as surrogate cases, while individuals with A/G and G/G
genotypes were designated as surrogate controls (allele A is 1.0 in
CEU, 0.5 in CHB, 0.589 in JPT, and 0.25 in YRI).
The second locus chosen for modeling a population-specific
phenotype is LCT, located on chromosome 2. Avariant within the
LCT gene, rs4988235 (C/T –13910), is associated with lactase
persistence, leading to ability to digest milk in adults, and has been
demonstrated to be under strong positive selection in Europeans
[Bersaglieri et al., 2004; Hamblin and Di Rienzo, 2000; Tishkoff
et al., 2007]. Allele A is found at 0.75 frequency in HapMap
European samples, but is absent in HapMap YRI, CHB, and JPT
samples. Since rs4988235 genotypes were not available for our
HUMAN MUTATION, Vol. 30, No. 1, 69–78, 2009
sample set, an allele A for rs1446585, a nearby SNP in strong LD
with allele T of rs4988235 (r250.73 in HapMap CEU samples)
was used for modeling. Individuals homozygous for allele A for
rs1446585 were designated as surrogate cases, while individuals
with A/G and G/G genotypes were designated as surrogate
controls (allele A is 0.792 in CEU, and 0.00 in CHB, JPT, and YRI).
The third locus is for the alcohol dehydrogenase ADH1B gene,
where a nonsynonymous coding genetic variant rs1229984
(Arg47His) is reported to be under positive selection in East Asia
[Han et al., 2007]. Allele A is found at 0.77 frequency in HapMap
CHB and JPT, but is absent in CEU and YRI samples. Since
genotypes for rs1229984 were not available for our sample set,
allele A for rs10008281, a nearby SNP in strong LD with allele T
for rs1229984 (r250.53 in HapMap CEU samples) was used for
modeling. (Note: since the trait is modeled on the proxy SNP, the
performance of AIM sets should be unaffected by the r2.)
Individuals homozygous for allele A for rs10008281 were
designated as surrogate cases, while individuals with A/G and G/
G genotypes were designated as surrogate controls (allele A is 0.82
in CHB and 0.83 in JPT, and 0.28 in CEU and YRI).
Small AIM Sets Distinguish Major Population Groups
A set of 128 SNPs selected on the basis of informativeness (In)
between four continental groups (European, Amerindian, West
African, and East Asian) passed our initial quality filters (see
Materials and Methods). Analysis of genotypes using this
informative marker set (designated 128 In4) was first evaluated
using Fst as a general measure of the ability to separate continental
population groups. The markers showed large Fst differences
between the continental populations and relatively small differ-
ences within large groups of disparate individuals within these
continental groups (Table 1). The South-Asian group, not used in
the marker selection, showed substantial differences with the
European group consistent with previous observations that this
subcontinental group is distinct [Yang et al., 2005]. In addition,
there was a larger intercontinental difference among the
Amerindian groups as previously observed [Price et al., 2007;
Tian et al., 2007].
Population structure analyses using a Bayesian cluster analysis
(STRUCTURE) showed a clear distinction between the continental
population groups when the number of clusters was defined at 4
(K54). The 128 In4 set consistently identified diverse individuals
corresponding to European, West African, Amerindian, and East
Asian population groups (Fig. 1a; Table 2). Adding an additional
cluster (K55), also allowed the identification of individuals from
another genetically distinguishable population, that corresponding
to a South Asian subcontinental group (Fig. 1b).
The ability of smaller sets of In4 markers (96, 64, 48, and 24) to
discern population genetic structure was also examined. Here, the
smaller sets were in each case the highest ranking In4 SNPs
(Supplementary Table S1; see Supplementary Table S2 for
additional summary information). The individual estimation of
continental ancestry was nearly identical when 128, 96, or 64 In4
markers were used (e.g., compare Fig. 1c with Fig. 1a). A summary
of all the results shows that as few as 24 In4, could in fact identify
the same general population clusters (Table 2). Specifically, for both
West African and European ancestry the results are very consistent
with similar proportion of population measurements seen even
when comparing 128 In4 with 24 In4 results. For the Amerindian
and East Asian continental population groups there is a modest fall-
off in the concordance with self-identification as the numbers of
markers decrease, for example, the cluster membership that
Summary of Fst Values Between and Within Ancestry
aPopulations are European American (EURA), West African (AFR), Amerindian
(AMI), East Asian (EAS), and South Asian Indian Americans (SAS).
bFst values, as determined by the Weir and Cockerham algorithm using the results
genotypes for the 128 AIMs described in the text. The intrapopulation Fst was
determined using two or three populations for the different continental populations.
The population groups were: EURA (CEU and NYCP); AFR (YRI, Kanuri, and Bini);
AMI (Mayan, Nahua, and Quechan); and East Asia (CHB, JPT and NYCPF). See
Materials and Methods for further definition of population groups.
Each vertical line represents an individual subject. Each self-identified
population group is shown along the abscissa. The population groups
include European American (EURA, 188 subjects), West African (AFR,
98 subjects), Amerindian (AMI, 88 subjects), East Asian (105 subjects),
South Asian (SAS, 64 subjects), African American (88 subjects),
Puerto Rican American (PRA, 28 subjects), Mexican American (MAM,
40 subjects) and Mexican (MXN, 26 subjects). Analyses were
performed without any prior population assignment. Analyses for
the128 In4 marker set are shown for 4 population groups (K54) in (A),
and K55 in (B). Analyses for 64 In4 for K55 in (C) and K53 (without
East or South Asian samples) in (D). [Color figure can be viewed in the
online issue which is available at www.interscience.wiley.com]
Analysis of population genetic structure using In4 AIMs.
HUMAN MUTATION, Vol. 30, No. 1, 69–78, 2009
corresponds best to self identified Amerindian ancestry (pop 4)
decreased from 0.94 (128 In4) to 0.88 (24 In4) (Table 2). However,
the difference is more pronounced for the estimated contribution
from pop5 (corresponding to South Asian background) in the
South Asian population (0.75/0.68/0.70/0.59/0.55). The increased
uncertainty for South Asian contribution may be explained by the
relatively low Fst values between South Asian and European/East
Asian populations observed for the In4 markers (Table 1) that in
turn reflects the selection criteria (see Materials and Methods).
The population structure analyses of different population groups
are also influenced by which subjects are included. When the subject
set is limited to only those individuals of particular self-identified
backgrounds the results show more distinct cluster assignments.
This is illustrated in Figure 1d when East Asian and South Asian
subjects are excluded from the analyses and the number of assumed
population groups is defined as three (K53). In addition, small
numbers of markers chosen using other criteria may provide good
distinction between two or three population groups but provide
inaccurate information on other nonincluded population groups.
The performance of subsets of markers selected using either
European/West African informativeness or European/Amerindian
informativeness is provided in Supplementary Table S3.
Summary of Population Structure Results Using Markers Selected by Informativeness Between Four Continental
Pop1 Pop2Pop3 Pop4 Pop5 Pop1Pop2 Pop3 Pop4
128 EURA (188)c
aThe fraction of membership in each population group determined by STRUCTURE analyses is shown for different numbers of AIMs that were selected using In4 (see
Materials and Methods). The number of AIMs is shown in the first column for each section of the table.
bThe number of population groups (K) specified in the analysis.
cThe number of subjects in each self-identified group is provided for European American (EURA), West African (AFR), Amerindian (AMI), East Asian (EAS), South Asian
(SAS), African American (AFA), Puerto Rican (PRA), and a combined group of Mexican (MXN) and Mexican American (MAM) populations. See Materials and Methods for
additional self-identified population information.
dBold numbers indicate the largest population component for each self-identified continental group.
HUMAN MUTATION, Vol. 30, No. 1, 69–78, 2009
Ability to Exclude Subjects of Disparate Ancestry for
One practical aspect of utilizing continental AIMs is to identify
sets of individuals corresponding to a particular continental
group. The ability of In4 sets to exclude subjects from the different
self identified groups is summarized in Table 3 using the
predominant population group cluster membership as the
standard for each continent. Two criteria, 10% nonmembership
and 15% nonmembership are shown. In general, the 128 In4 AIMs
and smaller sets showed nearly complete exclusion of individuals
with other self-identified ancestries when considering any of the
continental groups. However, for European, there was a large
decrease in the performance of smaller maker sets (o64 markers)
with respect to exclusion of South Asian subjects.
For both Amerindians and East Asians the exclusion criteria
used in these analyses also would result in excluding a relatively
large number of subjects for these specific ancestries. For example,
10% non-Amerindian exclusion would result in excluding 17% of
the Amerindian subjects using 128 In4. While this result is
Comparison of the Ability of AIMs to Distinguish Different Continental Populations?
Number of AIMs (4 Pop In)a
128 96 644824 1289664 48 24
410% non-EURA ancestryb
415% non-EURA ancestryb
EURA (188)0.040.05 0.030.090.09 0.010.01 0.02 0.02 0.05
AMI (105) 1.00 1.00 1.001.00 1.001.00 1.001.001.00 1.00
EAS (188) 1.00 1.001.00 1.00 1.001.00 1.001.00 1.00 1.00
SAS (64) 0.980.95 0.94 0.830.640.970.97 0.830.73 0.47
AFA (88) 0.990.990.990.99 0.990.99 0.990.990.99 0.99
PRA (28) 1.00 1.000.960.96 0.891.000.960.930.86 0.79
MAM/MXN (66)1.00 1.00 1.001.00 0.971.001.000.98 0.980.97
410% non-AFR ancestryb
415% non-AFR ancestryb
EURA 1.001.001.00 1.001.001.00 1.00 1.001.001.00
AFR0.040.04 0.02 0.090.070.030.030.010.020.03
410% non-AMI ancestryb
415% non-AMI ancestryb
EURA 1.001.001.001.001.00 1.001.00 1.001.001.00
EAS1.001.00 1.00 1.001.001.00 1.001.00 1.001.00
SAS 1.001.00 1.001.001.001.001.001.001.001.00
AFA 1.001.00 1.00 1.00 1.00 1.00 1.001.00 1.00 1.00
PRA 1.00 1.00 1.00 1.001.001.001.001.001.00 1.00
MAM/MXN0.95 0.970.95 0.970.950.950.95 0.920.91 0.83
410% non-EAS ancestryb
415% non-EAS ancestryb
AFA1.001.001.001.00 1.00 1.001.001.001.001.00
?The ability to exclude subjects based on different numbers of AIMs is shown using both 10% and 15% non-ancestry group membership.
aThe top of each column indicates the number of AIMs in each set. The AIMs were selected based on informativeness in four population groups (see Materials and Methods).
Analyses were performed using K54.
bFor each set the criteria for exclusion is shown. The fraction of subjects that would be excluded for each criterion is indicated for each subject group.
HUMAN MUTATION, Vol. 30, No. 1, 69–78, 2009
probably partially due to European admixture, there also is some
difficulty in fully resolving AMI and EAS ancestry at this level.
This issue is less severe when the criteria is set at 15%
nonmembership but is much more problematic when smaller
In4 marker sets are used (Table 3). Nevertheless, investigators can
use these criteria to improve analyses by excluding most subjects
from disparate ancestry regardless of whether they are the result of
mistaken self-identification and/or due to mislabeling of samples.
Use of AIMs for Admixture Studies
Another major use of continental AIMs is in admixture studies.
The differences in admixture proportions estimated using the 128
In4 AIMs is illustrated in Figure 1 and summarized in Table 2 for
African Americans (AFA), Mexican Americans (MAM), Mexican
(MXN), and Puerto Rican (PRA) population groups. These results
using STRUCTURE, similar to those with continental popula-
tions, are robust and yield consistent admixture proportions in
multiple runs using appropriate analysis parameters (see Materials
and Methods). The results also show that the overall admixture
proportions of these groups, AFA, MAM, MXN, and PRA can be
ascertained with small numbers of In4 AIMs.
To further evaluate how consistently different subsets markers
can estimate individual admixture, we examined the correlation of
ancestry assignments. Using the 128 In4 results as the standard, we
compared the estimated contribution of one of the ancestral
parental populations contributing to each of three different
admixed populations. These include West African contribution in
AFA, European contribution in PRA, and Amerindian contribu-
tion in MAM and MXN. The latter two groups (MAM and MXN)
were combined since the admixture proportions are similar.
Marker sets chosen for their optimum ability to discriminate
between four ancestral populations (In4 sets), and two ancestral
populations (In2 sets) were examined (Fig. 2). The correlation
values (r2) for West African contribution in AFA are high, ranging
between 0.988 for 96 In4 and 0.835 for 24 In4, suggesting that
small number of markers are sufficient to identify West African
contribution. Similar results in AFA were also observed using the
marker sets selected specifically to distinguish European and West
African (e.g., 0.976 for 48 In2 European/West African). As
anticipated, the markers chosen for European/Amerindian
differences did not accurately distinguish European/African
For Amerindian contribution in MAM and MXN the correla-
tion values using In4 markers was also strong but did show a
discernable decrease when 48 or 24 In4 markers were examined.
For the In2 AIMs optimized for European/Amerindian differences,
the results showed stronger correlations (e.g., 0.798 for 48 In2
European/Amerindian vs. 0.733 for 48 In4). Similar results are also
shown for the European contribution in PRA, however, the
correlations were markedly lower. The correlations for European
contribution in PRA population were 0.877, 0.587, 0.560, and
0.519 for 96 In4, 64 In4, 48 In4 and 24 In4, respectively.
The low correlation between estimates for European contribu-
tion in PRA may be explained by the fact that three ancestral
populations, Europeans, Amerindians, and West Africans, have
substantial contributions in the PRA population. This is unlike
AFA and MAM/MXN, in which there are two main contributing
ancestral populations, West African and Europeans, and Amer-
indian and Europeans. Using r240.8 as a threshold for high
correlation, any of the In4 sets should be acceptable to estimate
West African contribution in AFA; 128 In4, 96 In4, and 64 In4 are
sufficient for Amerindian contribution in Mexican and Mexican
American populations; and 128 In4 and 96 In4 sets should provide
sufficiently accurate information for European contribution
To further measure the precision of the ancestry estimation of
individual subjects in admixed populations, we examined the 90%
confidence intervals. For each individual the 90% Bayesian
confidence interval was measured (STRUCTURE output). For
each set of AIMs, the average size of this confidence interval was
then calculated (Table 4). Comparison of these results shows the
decrease in individual confidence intervals based on the number
of markers and the dependency on the admixed population being
analyzed. These confidence limits show that in studies of AFA,
smaller sets can still provide good precision in individual
admixture measurement. However, for MAM/MXN, relatively
larger numbers of AIMs are required. The confidence limits are
smaller when In2 marker sets optimized for the particular admixed
population are used. However, the 96 In4 and 128 In4 set appear
to perform very well in each of the admixed groups.
The ability to exclude subjects of other continental ancestry in
admixed populations was also examined (Supplementary Table
S4). For AFA, nearly all individuals of non-West African or
European ancestry could be excluded at the 15% exclusion criteria
while maintaining nearly all of the subjects of self-identified AFA
ancestry using 64 or more In4 AIMs. However, for the MAM/
MXN subjects much looser criteria (430% non-Amerindian or
European ancestry) were necessary to include 490% of self-
identified MAM/MXN even with 96 In4 AIMs. This is probably
due to the small West African contribution present in the MAM/
MXN populations, requiring a larger number of AIMs to enable
good definition of this admixture component.
Performance of AIM Sets in Association Studies
As another assessment of the performance of the AIM sets, we
examined whether these AIMs could correct for false-positive
association results in models for population-specific disease
susceptibility loci. Using 200K genotypes from the I-control
database and additional genotypes available from other ongoing
studies (see Materials and Methods), we specified specific
genotypes as disease surrogates and identified true (located in a
close genetic position to the modeled SNP) and false (unlinked)
associated SNPs. These population sets included genotypes for
each of the 128 In4 AIMs since each is included within the
Illumina 300K array. Three disease gene models were specified
using the surrogate phenotypes defined by SNPs in strong LD
with: 1) a nonsynonymous genetic substitution in SLC24A5 on
chromosome 15 under strong positive selection in Europeans; 2)
lactase tolerance phenotype on chromosome 2 that is under strong
positive selection in northern European populations; and 3) a
nonsynonymous coding variant in ADH1B under positive
selection in East Asian populations (see Materials and Methods
for additional details).
The surrogate phenotypes were specified in a sample set of 865
individuals primarily from three disparate continental popula-
tions, European (254 subjects), East Asia (283 subjects), and
Africa (as represented by 328 African American subjects). In
addition, the phenotype defined by SLC24A5 was examined in
1,847 African American subjects. For each of the phenotypes
examined, putative true positives (SNPs located close to the
chromosomal position of the modeled genotype) and false
positives, unlinked SNPs were found with strong association
(Po0.01 after Bonferroni correction) (Fig. 3; Supplementary
HUMAN MUTATION, Vol. 30, No. 1, 69–78, 2009
As expected, principal components analysis (PCA) using the
entire 200K SNP sets were effective in correcting the false-positive
associations for each of the three surrogate phenotypes was
examined in mixed population sets (Fig. 3a–c; Supplementary
Table S5). The 128 In4 and 96 In4 AIM sets were nearly as effective
in correcting the false-positive associations. Smaller In4 sets also
corrected most of the false positive results; however, these sets
failed on some of the analyses, e.g., the false association for
rs4871195 in the LCT model remained significant for 64 In4 and
smaller sets. For the admixed AFA population group, similar
results were observed (Fig. 3d; Supplementary Table S5). Here, the
smallest set (24 In4) showed incomplete correction. Together,
these analyses show that relatively small numbers of AIMs can
correct for false-positive results in these Mendelian models.
The current study was undertaken to provide researchers with a
set of validated AIMs for distinguishing continental populations.
We believe that the results provide strong confidence that these
128 In4 AIMs and subsets of these SNPs can be used for
characterizing sample sets from diverse population groups. These
markers can be applied either to identify those individuals from a
particular study that are members of one continental population
group, or alternatively used to adjust for population stratification
due to differences in continental population frequency in cases
and controls. The former will reduce population heterogeneity
that may also correspond to reducing genetic heterogeneity for
specific traits. The latter can, as shown in our modeling studies,
allow the reduction or elimination of false-positive results.
Our analyses provide guidelines for application, especially with
regard to using the program STRUCTURE [Falush et al., 2003;
In4 result and the ordinate the result using the color coded AIM set. The individual for African contribution in African Americans (A,B), European
contribution in Puerto Ricans (C,D), and Amerindian in Mexicans and Mexican Americans (E,F) are shown based on STRUCTURE analyses.
[Color figure can be viewed in the online issue which is available at www.interscience.wiley.com]
Correlation between the estimations of genetic contribution using different AIM sets and 128 In4 AIMs. The abscissa shows the 128
Summary of Confidence Intervals Using Different
AIMsAFAMAM/MXNAFA, PRA, MAM/MXN
aThe average of the individual subject 90% Bayesian confidence intervals (CI) was
determined using the different AIM sets. For the AFA and MAM/MXA subject
groups the CI were determined using K52. For the combined admixed group (AFA,
PRA, MAM/MXA) the CI was determined using K53.
bFor the 2PopIn marker sets, the CI was determined using the EURA/AFR for AFA or
EURA/AMI for the MAM/MXN subject groups.
HUMAN MUTATION, Vol. 30, No. 1, 69–78, 2009
Pritchard et al., 2000a]. Other computational programs, including
ADMIXMAP [Hoggart et al., 2004] can also be applied with very
similar results (data not shown). In general, as indicated in the
Materials and Methods section, the performance of smaller AIM
SNP sets in STRUCTURE analyses is only consistently reprodu-
cible when very large numbers of iterations are used. This is not a
major limitation because the computational time is not a major
problem when small sets of markers are used, even with large
sample sets; several thousand samples will require o24hr for
100,000 replicates using STRUCTURE and 48 markers. However,
smaller marker sets (especially those o64) provide a poorer
ability to exclude subjects of disparate continental ancestry and
will provide less precision in the individual ancestry assignment.
For larger studies (sample sizes of several thousand) the precision
of individual assignments will be less consequential than for
smaller studies in which the investigation will be more dependent
on the accurate assessment of ancestry of each individual. Thus
choice of the number of SNP AIMs depends on the populations
being studied as well as practical aspects of genotyping. However,
as shown in our study, the 96 In4 SNPAIMs perform well for each
of the potential applications with only a very modest reduction of
potential information compared with the 128 In4 set. Even smaller
numbers perform adequately in particular situations but may
require additional confidence in the prior information; i.e.,
confidence in self-identification of population membership.
A major application of SNP AIMs is to reduce false positives in
association studies. For traits associated with continental ancestry,
our modeling studies found that relatively small numbers of
SNP AIMs (64 or more) could adequately adjust for differences in
ancestry stratification between cases and controls. It is notable
that without the useof AIMs we observed many
positives even when the surrogate models used loci were not in
complete LD with the true ancestry associated trait (i.e., r250.73 for
model 2 and r250.53 for model 3). This suggests that
it is necessary to adjust for population structure for traits that are
only partially association with continental ancestry and underscores
the importance of the application of these or similar methods when
subjects of mixed ancestry are studied. Our modeling studies also
examined the use of AIMs in association tests for an admixed
population (African Americans). Similar to the subject sets
containing individuals from multiple continents, these studies
showed that relatively small numbers of highly informative SNP
AIMs (64 or more) can adequately adjust for population
substructure and eliminate false-positive results. Additional studies
will be necessary to determine the efficacy of these AIMs in more
complex sample sets and other population groups.
The identification of the ancestry groups using nonhierarchical
clustering algorithms, or for that matter PCA, is enhanced by the
inclusion of representatives of the parental population groups. In
the analyses performed in the current studies there were
representatives of the different continental groups. The inclusion
of these groups is particularly important when admixed popula-
tions are being examined. The inclusion of these groups, even
without specifying population membership, allows more accurate
cluster separation. In general, and specifically for the studies
reported herein, we did not specify population membership, an
available option in the STRUCTURE program. (Similar results are
obtained using this option but with larger confidence intervals
[data not shown]). To facilitate the appropriate application of the
AIMs described in this study, the genotypes of continental
populations groups are provided as a resource to the scientific
community (Supplementary Table S6).
Finally, for each of the SNP AIMs used in the current
study a TaqMan SNP genotyping assay is readily available
(Supplementary Table S2). We also note that each of the SNPs
is also part of the Illumina 300K array, which should enable
inspection and utilization of genotypes that are provided in the I-
control data base. A summary of the information for each SNP is
provided in Supplemental Tables S2 and S6. In addition, since
many researches may wish to use a smaller AIMs set, we have
optimized a panel of 96 SNPs for which robust TaqMan assays are
available as a cost-effective format (see Supplementary Table S1,
The Swedish Research Council provided support (to M.E.A.-R.). P.W. and
F.D.L.V. declare competing financial interests.
tests using different AIM sets. Three population-specific alleles were
used to model phenotypes prevalent in a particular population. The
ordinate shows the w2value with the first value showing the Armitage
test result. The correction for false-positive association tests
(EIGENSTRAT analyses) using either 200K SNP markers or the
selected AIM sets are shown along the abscissa. The surrogate
cases are defined by homozygosity for: (A,D) Allele A for rs 2675348 in
the SLC24A5 locus; (B) allele A for rs1446585 in the LCT locus; and (C)
allele A for rs100008281 in the ADH1B locus. The surrogate cases are
chosen in 865 samples from EURA, AFR, and EAS populations in (A),
(B), and (C), respectively; and from 1,847 African American samples in
(D). The dashed bold line represent nominal significance level
(P50.05)corrected for 200K
(P52.5e–7). The marker shade/color indicates the location of relative
to the locus chosen to define the surrogate phenotype. The dark
markers are located on chromosomes that do not contain the locus
defining the surrogate phenotype while the lighter markers are
located near the locus. [Color figure can be viewed in the online issue
which is available at www.interscience.wiley.com]
Correction of population stratification in association
HUMAN MUTATION, Vol. 30, No. 1, 69–78, 2009
References Download full-text
Altshuler D, Brooks LD, Chakravarti A, Collins FS, Daly MJ, Donnelly P. 2005. A
haplotype map of the human genome. Nature 437:1299–1320.
Bauchet M, McEvoy B, Pearson LN, Quillen EE, Sarkisian T, Hovhannesyan K, Deka
R, Bradley DG, Shriver MD. 2007. Measuring European population stratification
with microarray genotype data. Am J Hum Genet 80:948–956.
Belkhir K, Borsa P, Chikhi L, Raufaste N, Bonhomme F. 2001. GENETIX, software
under WindowsTMfor the genetic of populations. Version 4.02. Montpellier,
France: Laboratory Genome, Populations, Interactions CNRS UMR 5000,
University of Montpellier II.
Bersaglieri T, Sabeti PC, Patterson N, Vanderploeg T, Schaffner SF, Drake JA, Rhodes
M, Reich DE, Hirschhorn JN. 2004. Genetic signatures of strong recent positive
selection at the lactase gene. Am J Hum Genet 74:1111–1120.
Campbell CD, Ogburn EL, Lunetta KL, Lyon HN, Freedman ML, Groop LC,
Altshuler D, Ardlie KG, Hirschhorn JN. 2005. Demonstrating stratification in a
European American population. Nat Genet 37:868–872.
Clayton DG, Walker NM, Smyth DJ, Pask R, Cooper JD, Maier LM, Smink LJ, Lam
AC, Ovington NR, Stevens HE, Nutland S, Howson JM, Faham M, Moorhead
M, Jones HB, Falkowski M, Hardenbol P, Willis TD, Todd JA. 2005. Population
structure, differential bias and genomic control in a large-scale, case-control
association study. Nat Genet 37:1243–1246.
Devlin B, Roeder K. 1999. Genomic control for association studies. Biometrics
Epstein MP, Allen AS, Satten GA. 2007. A simple and improved correction for
population stratification in case-control studies. Am J Hum Genet 80:
Falush D, Stephens M, Pritchard JK. 2003. Inference of population structure using
multilocus genotype data: linked loci and correlated allele frequencies. Genetics
Freedman ML, Reich D, Penney KL, McDonald GJ, Mignault AA, Patterson N,
Gabriel SB, Topol EJ, Smoller JW, Pato CN, Pato MT, Petryshen TL, Kolonel LN,
Lander ES, Sklar P, Henderson B, Hirschhorn JN, Altshuler D. 2004. Assessing
the impact of population stratification on genetic association studies. Nat Genet
Hamblin MT, Di Rienzo A. 2000. Detection of the signature of natural selection in
humans: evidence from the Duffy blood group locus. Am J Hum Genet
Han Y, Gu S, Oota H, Osier MV, Pakstis AJ, Speed WC, Kidd JR, Kidd KK. 2007.
Evidence of positive selection on a class I ADH locus. Am J Hum Genet
Helgason A, Yngvadottir B, Hrafnkelsson B, Gulcher J, Stefansson K. 2005. An
Icelandic example of the impact of population structure on association studies.
Nat Genet 37:90–95.
Hoggart CJ, Parra EJ, Shriver MD, Bonilla C, Kittles RA, Clayton DG, McKeigue PM.
2003. Control of confounding of genetic associations in stratified populations.
Am J Hum Genet 72:1492–1504.
Hoggart CJ, Shriver MD, Kittles RA, Clayton DG, McKeigue PM. 2004. Design and
analysis of admixture mapping studies. Am J Hum Genet 74:965–978.
Kong A, Gudbjartsson DF, Sainz J, Jonsdottir GM, Gudjonsson SA, Richardsson B,
Sigurdardottir S, Barnard J, Hallbeck B, Masson G, Shlien A, Palsson ST,
Frigge ML, Thorgeirsson TE, Gulcher JR, Stefansson K. 2002. A high-resolution
recombination map of the human genome. Nat Genet 31:241–247.
Lamason RL, Mohideen MA, Mest JR, Wong AC, Norton HL, Aros MC, Jurynec MJ,
Mao X, Humphreville VR, Humbert JE, Sinha S, Moore JL, Jagadeeswaran P,
Zhao W, Ning G, Makalowska I, McKeigue PM, O’Donnell D, Kittles R, Parra EJ,
Mangini NJ, Grunwald DJ, Shriver MD, Canfield VA, Cheng KC. 2005.
SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and
humans. Science 310:1782–1786.
Mao X, Bigham AW, Mei R, Gutierrez G, Weiss KM, Brutsaert TD, Leon-Velarde F,
Moore LG, Vargas E, McKeigue PM, Shriver MD, Parra EJ. 2007. A genomewide
admixture mapping panel for Hispanic/Latino populations. Am J Hum Genet
Marchini J, Cardon LR, Phillips MS, Donnelly P. 2004. The effects of human
population structure on large genetic association studies. Nat Genet 36:512–517.
Mitchell MK, Gregersen PK, Johnson S, Parsons R, Vlahov D. 2004. The New York
Cancer project: rationale, organization, design, and baseline characteristics.
J Urban Health 81:301–310.
Parra EJ, Kittles RA, Shriver MD. 2004. Implications of correlations between skin
color and genetic ancestry for biomedical research. Nat Genet 36(11
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. 2006.
Principal components analysis corrects for stratification in genome-wide
association studies. Nat Genet 38:904–909.
Price AL, Patterson N, Yu F, Cox DR, Waliszewska A, McDonald GJ, Tandon A,
Schirmer C, Neubauer J, Bedoya G, Duque C, Villegas A, Bortolini MC, Salzano
FM, Gallo C, Mazzotti G, Tello-Ruiz M, Riba L, Aguilar-Salinas CA, Canizales-
Quinteros S, Menjivar M, Klitz W, Henderson B, Haiman CA, Winkler C, Tusie-
Luna T, Ruiz-Linares A, Reich D. 2007. A genomewide admixture map for
Latino populations. Am J Hum Genet 80:1024–1036.
Price AL, Butler J, Patterson N, Capelli C, Pascali VL, Scarnicci F, Ruiz-Linares A,
Groop L, Saetta AA, Korkolopoulou P, Seligsohn U, Waliszewska A, Schirmer C,
Ardlie K, Ramos A, Nemesh J, Arbeitman L, Goldstein DB, Reich D, Hirschhorn
JN. 2008. Discerning the ancestry of European Americans in genetic association
studies. PLoS Genet 4:e236.
Pritchard JK, Stephens M, Donnelly P. 2000a. Inference of population structure using
multilocus genotype data. Genetics 155:945–959.
Pritchard JK, Stephens M, Rosenberg NA, Donnelly P. 2000b. Association mapping in
structured populations. Am J Hum Genet 67:170–181.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar
P, de Bakker PI, Daly MJ, Sham PC. 2007. PLINK: a tool set for whole-genome
association and population-based linkage analyses. Am J Hum Genet
Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA,
Feldman MW. 2002. Genetic structure of human populations. Science
Rosenberg NA, Li LM, Ward R, Pritchard JK. 2003. Informativeness of genetic
markers for inference of ancestry. Am J Hum Genet 73:1402–1422.
Rosenberg NA, Mahajan S, Ramachandran S, Zhao C, Pritchard JK, Feldman MW.
2005. Clines, clusters, and the effect of study design on the inference of human
population structure. PLoS Genet 1:e70.
Salari K, Choudhry S, Tang H, Naqvi M, Lind D, Avila PC, Coyle NE, Ung N, Nazario
S, Casal J, Torres-Palacios A, Clark S, Phong A, Gomez I, Metallana H, Perez-
Stable EJ, Shriver MD, Kwok PY, Sheppard D, Rodriguez-Cintron W, Risch NJ,
Burchard EG, Ziv E. 2005. Genetic admixture and asthma-related phenotypes in
Mexican American and Puerto Rican asthmatics. Genet Epidemiol 29:76–86.
Satten GA, Flanders WD, Yang Q. 2001. Accounting for unmeasured population
substructure in case-control studies of genetic association using a novel latent-
class model. Am J Hum Genet 68:466–477.
Seldin MF, Shigeta R, Villoslada P, Selmi C, Tuomilehto J, Silva G, Belmont JW,
Klareskog L, Gregersen PK. 2006. European population substructure: clustering
of northern and southern populations. PLoS Genet 2:1339–1351.
Smith MW, Patterson N, Lautenberger JA, Truelove AL, McDonald GJ, Waliszewska
A, Kessing BD, Malasky MJ, Scafe C, Le E, De Jager PL, Mignault AA, Yi Z, De
The G, Essex M, Sankale JL, Moore JH, Poku K, Phair JP, Goedert JJ, Vlahov D,
Williams SM, Tishkoff SA, Winkler CA, De La Vega FM, Woodage T, Sninsky JJ,
Hafler DA, Altshuler D, Gilbert DA, O’Brien SJ, Reich D. 2004. A high-density
admixture map for disease gene discovery in African Americans. Am J Hum
Tian C, Hinds DA, Shigeta R, Kittles R, Ballinger DG, Seldin MF. 2006. A
genomewide single-nucleotide-polymorphism panel with high ancestry infor-
mation for African American admixture mapping. Am J Hum Genet
Tian C, Hinds DA, Shigeta R, Adler SG, Lee A, Pahl MV, Silva G, Belmont JW,
Hanson RL, Knowler WC, Gregersen PK, Ballinger DG, Seldin MF. 2007. A
genomewide single-nucleotide-polymorphism panel for Mexican American
admixture mapping. Am J Hum Genet 80:1014-–1023.
Tian C, Plenge RM, Ransom M, Lee A, Villoslada P, Selmi C, Klareskog L, Pulver AE,
Qi L, Gregersen PK, Seldin MF. 2008. Analysis and application of European
genetic substructure using 300K SNP information. PLoS Genet 4:e4.
Tishkoff SA, Reed FA, Ranciaro A, Voight BF, Babbitt CC, Silverman JS, Powell K,
Mortensen HM, Hirbo JB, Osman M, Ibrahim M, Omar SA, Lema G, Nyambo
TB, Ghori J, Bumpstead S, Pritchard JK, Wray GA, Deloukas P. 2007.
Convergent adaptation of human lactase persistence in Africa and Europe.
Nat Genet 39:31–40.
Weir B, Cockerham C. 1984. Estimating F-statistics for the analysis of population
structure. Evolution 38:1358–1370.
Yang N, Li H, Criswell LA, Gregersen PK, Alarcon-Riquelme ME, Kittles R, Shigeta R,
Silva G, Patel PI, Belmont JW, Seldin MF. 2005. Examination of ancestry and
ethnic affiliation using highly informative diallelic DNA markers: application to
diverse and admixed populations and implications for clinical epidemiology and
forensic medicine. Hum Genet 118:382–392.
HUMAN MUTATION, Vol. 30, No. 1, 69–78, 2009