ArticlePDF Available

Abstract and Figures

The Kalash represent an enigmatic isolated population of Indo-European speakers who have been living for centuries in the Hindu Kush mountain ranges of present-day Pakistan. Previous Y chromosome and mitochondrial DNA markers provided no support for their claimed Greek descent following Alexander III of Macedon's invasion of this region, and analysis of autosomal loci provided evidence of a strong genetic bottleneck. To understand their origins and demography further, we genotyped 23 unrelated Kalash samples on the Illumina HumanOmni2.5M-8 BeadChip and sequenced one male individual at high coverage on an Illumina HiSeq 2000. Comparison with published data from ancient hunter-gatherers and European farmers showed that the Kalash share genetic drift with the Paleolithic Siberian hunter-gatherers and might represent an extremely drifted ancient northern Eurasian population that also contributed to European and Near Eastern ancestry. Since the split from other South Asian populations, the Kalash have maintained a low long-term effective population size (2,319-2,603) and experienced no detectable gene flow from their geographic neighbors in Pakistan or from other extant Eurasian populations. The mean time of divergence between the Kalash and other populations currently residing in this region was estimated to be 11,800 (95% confidence interval = 10,600-12,600) years ago, and thus they represent present-day descendants of some of the earliest migrants into the Indian sub-continent from West Asia. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
Consequences of Drift and Selection in the Kalash (A) A nonsense variant in ACTN3 (rs1815739) is present at a higher frequency (left) in the Kalash than in their neighbors in Pakistan. Forward-time simulations (right) show that such a high frequency of the derived allele in the Kalash (dashed blue line) is only observed in a scenario that considers positive selection acting on the variant. The lower line represents the observed mean frequency of the derived allele in the Pakistani population, the orange lines represent the simulated allele frequency of the derived allele in each replicate in the scenario without selection, and the dark red lines represent each replicate in the scenario with positive selection. The observed frequency of the derived allele in Kalash population is reached only in the scenario with selection and only after 400 generations of drift (~10,000 or 11,200 years ago if we assume a generation time of 25 or 28 years, respectively), suggesting that the observed pattern for this stop gain on ACTN3 can best be explained by selection acting in ancient times and not by any recent population split. (B) The Kalash are fixed for the ancestral allele of the MCM6 intronic variant (rs4988235) that is associated with lactose intolerance. The derived allele that is associated with lactase persistence is present at moderate frequency in populations from Pakistan (left panel and upper dashed line in the right panel). Forward-time simulations (right panel) suggest that recent isolation and genetic drift cannot explain the observed pattern for this functional polymorphism in the Kalash population. Only 1/1,000 replicates (represented by orange lines) reach fixation after 500 generations of drift (~12,500 years ago if we assume a generation time of 25 years).
… 
Content may be subject to copyright.
ARTICLE
The Kalash Genetic Isolate:
Ancient Divergence, Drift, and Selection
Qasim Ayub,
1,7,
*Massimo Mezzavilla,
1,2,7
Luca Pagani,
1,3
Marc Haber,
1
Aisha Mohyuddin,
4
Shagufta Khaliq,
5
Syed Qasim Mehdi,
6
and Chris Tyler-Smith
1
The Kalash represent an enigmatic isolated population of Indo-European speakers who have been living for centuries in the Hindu Kush
mountain ranges of present-day Pakistan. Previous Y chromosome and mitochondrial DNA markers provided no support for their
claimed Greek descent following Alexander III of Macedon’s invasion of this region, and analysis of autosomal loci provided evidence
of a strong genetic bottleneck. To understand their origins and demography further, we genotyped 23 unrelated Kalash samples on the
Illumina HumanOmni2.5M-8 BeadChip and sequenced one male individual at high coverage on an Illumina HiSeq 2000. Comparison
with published data from ancient hunter-gatherers and European farmers showed that the Kalash share genetic drift with the Paleolithic
Siberian hunter-gatherers and might represent an extremely drifted ancient northern Eurasian population that also contributed to
European and Near Eastern ancestry. Since the split from other South Asian populations, the Kalash have maintained a low long-
term effective population size (2,319–2,603) and experienced no detectable gene flow from their geographic neighbors in Pakistan or
from other extant Eurasian populations. The mean time of divergence between the Kalash and other populations currently residing
in this region was estimated to be 11,800 (95% confidence interval ¼10,60012,600) years ago, and thus they represent present-day
descendants of some of the earliest migrants into the Indian sub-continent from West Asia.
Introduction
Human populations show subtle allele-frequency differ-
ences that lead to geographical structure, and available
methods thus allow individuals to be clustered according
to genetic information into groups that correspond to
geographical regions. In an early worldwide survey of
this kind, division into five clusters unsurprisingly iden-
tified (1) Africans, (2) a widespread group including
Europeans, Middle Easterners, and South Asians, (3) East
Asians, (4) Oceanians, and (5) Native Americans. However,
division into six groups led to a more surprising finding:
the sixth group consisted of a single population, the
Kalash.
1
The Kalash are an isolated South Asian population
of Indo-European speakers residing in the Hindu Kush
mountain valleys in northwest Pakistan, near the Afghan
frontier. With a reported census size of 5,000 individuals,
they represent a religious minority with unique and rich
cultural traditions. DNA samples from the Kalash have
been distributed as part of the cell-line panel from the
Foundation Jean Dausset’s Human Genome Diversity
Project and Centre d’Etude du Polymorphisme Humain
(HGDP-CEPH) for over a decade and have formed part of
several genetic analyses.
2
Analyses of uni-parental (Y chro-
mosome and mitochondrial) DNA markers characterized
the Kalash as a small population that had undergone a
population bottleneck during their recent migration to
their present-day abode.
3,4
This was confirmed by the
study of genome-wide autosomal SNPs, which highlighted
a strong pattern of genetic drift in this population.
5
A
recent exploration of admixture at fine scales suggested
that a major admixture event between the Kalash and pre-
sent-day western Eurasians occurred between 990 and 210
BCE and related this to Alexander’s invasion of the Indian
sub-continent in 327–326 BCE,
6
although no evidence of
such admixture was detected by an analysis of Y chromo-
some and autosomal short tandem repeat (STR) variation
in the Kalash.
7,8
To further investigate the Kalash population’s demo-
graphic history and origins, we genotyped additional unre-
lated Kalash samples on the Illumina bead chip and
sequenced one male individual at high coverage. Our
aim was to assess whether the Kalash were a recent or an
ancient isolate and categorize the extent of genetic isola-
tion and admixture, if any, with extant or archaic humans
and thus better understand the reasons for their unique
position in worldwide comparisons.
Material and Methods
DNA Samples and Genotyping
The Kalash samples were collected from three valleys in the
Hindu Kush mountain ranges in northwest Pakistan (Figure 1A).
In accordance with the Declaration of Helsinki, the samples
were collected after informed consent was obtained, and the study
was approved by all relevant institutional ethics committees.
1
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK;
2
Institute for Maternal and Child Health,
IRCCS Burlo Garofolo, University of Trieste, 34137 Trieste, Italy;
3
Division of Biological Anthropology, University of Cambridge, Cambridge CB2 1QH,
UK;
4
Section of Biochemistry, Shifa College of Medicine, Shifa Tameer-e-Millat University, Sector H-8/4, Islamabad 44000, Pakistan;
5
Department of Human
Genetics & Molecular Biology, University of Health Sciences, Lahore 54000, Pakistan;
6
Centre for Human Genetics and Molecular Medicine, Sindh Institute
of Urology and Transplantation, Karachi, 74200, Pakistan
7
These authors contributed equally to this work
*Correspondence: qa1@sanger.ac.uk
http://dx.doi.org/10.1016/j.ajhg.2015.03.012.Ó2015 The Authors
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
The American Journal of Human Genetics 96, 1–9, May 7, 2015 1
Please cite this article in press as: Ayub et al., The Kalash Genetic Isolate: Ancient Divergence, Drift, and Selection, The American Journal of
Human Genetics (2015), http://dx.doi.org/10.1016/j.ajhg.2015.03.012
Lymphoblastoid cell lines were established from all collected
blood samples, and some (n ¼25) were deposited with CEPH;
these latter samples form part of the South and Central Asian
collection of the HGDP-CEPH cell-line panel. We used 10 of
these and an additional 13 samples that are not in the collection
for our analysis. All of these unrelated (n ¼23) Kalash males
were genotyped on the Illumina HumanOmni2.5M-8 BeadChip
with 200 ng of DNA (26 ng/ml) prepared from these lymphoblas-
toid cell lines.
2
Genotyping calls and quality control (QC) were
performed with GenoSNP
9
by the Sanger Institute’s core genotyp-
ing facility. Genotypes were called only for samples passing
Sequenom genetic fingerprinting and gender concordance. These
were run through the standard QC pipeline. All 23 samples passed
a call-rate threshold of 95% and were used in the downstream
analysis. Genotyping quality was assessed by comparison of
178,072 SNPs that overlapped the Illumina 650,000 K SNP
chip.
5
Ten of the Kalash samples analyzed in this study had also
been genotyped on this platform, and the sample genotype
concordance was 99.999%. Comparative data were obtained
from 35 populations representing Africa, Europe, Caucasus, and
West, Central, East, and South Asia (Table S1).
5,10–12
DNA Sequencing
High-coverage (303) 100-bp paired-end sequencing of one of the
genotyped male samples was carried out on an Illumina HiSeq
2000 with 5 mg of lymphoblastoid cell line DNA, standard library
preparation, and analysis pipelines developed for the 1000 Ge-
nomes Project.
13
The sequenced reads were mapped to the human
GRCh37 reference sequence (UCSC Human Genome Browser
hg19) used by the project (human_g1k_v37.fasta.gz). Variant
annotations were performed with the R package NCBI2R and
Ensembl’s Variant Effect Predictor.
14
There was high concordance
(99.9%) between the variant calls from the high-coverage data and
the same sample’s SNP-chip genotypes.
Figure 1. Population Structure and Isolation of the Kalash
(A) Geographic location of the three Pakistani villages where the Kalash samples were collected.
(B) Principal-component analysis (PCA) of Eurasian populations shows the first two components superimposed with the spatial kriging
interpolation of the admixture coefficient of the Kalash genetic cluster. The proportion of admixture is indicated by color: orange
represents the maximum level of admixture, and black represents the lowest. There is no gradient into the proportion of admixture
with the Kalash cluster, suggesting a low level of gene flow between nearby populations and a high degree of isolation.
(C) Admixture analysis in which the lowest cross-validation error (k ¼7) shows the unique Kalash cluster (dark green).
2The American Journal of Human Genetics 96, 1–9, May 7, 2015
Please cite this article in press as: Ayub et al., The Kalash Genetic Isolate: Ancient Divergence, Drift, and Selection, The American Journal of
Human Genetics (2015), http://dx.doi.org/10.1016/j.ajhg.2015.03.012
Data Analysis
The data were merged with reference-population data from African
and non-African sources covering Eurasia (Tabl e S1).
5,10–12,15,16
The merged dataset was pruned for the removal of variants in high
(r
2
>0.4) linkage disequilibrium (LD) and individuals with high
identity by descent (IBD) (PLINK IBD score >0.6) from the analysis.
The high IBD threshold was chosen to account for the increased
inbreeding levels introduced by the strong genetic drift experienced
by the Kalash population. Principal-component analysis (PCA) was
performed with EIGENSOFT v.5.01, and ancestry analysis was
performed with ADMIXTURE v.1.22.
17
Spatial kriging was used to
quantify the spatial geneticheterogeneity by interpolating ancestry
values (obtained from ADMIXTURE analysis) for each cluster; the
principal-component eigenvectors were used as coordinates for
each individual.
18
Genetic structure and gene flow between popu-
lations was investigated via three different approaches: ALDER,
19
three-population (f3) statistics,
20
and TreeMix.
21
We applied pairwise sequentially Markovian coalescent (PSMC)
analysis to draw inferences about the long-term effective popula-
tion sizes and times of divergence from ten high-coverage ge-
nomes, including the Kalash.
22
The nine other genomes were
sequenced by Complete Genomics and included three unrelated
African populations (Yoruba in Ibadan, Nigeria [YRI]; Luhya in
Webuye, Kenya [LWK]; Maasi in Kinyawa, Kenya [MKK]) and six
non-African genomes from East Asia (Han Chinese in Beijing,
China [CHB]; Japanese in Tokyo, Japan [JPT]), Europe (Utah resi-
dents with Northern and Western European ancestry from the
CEPH collection [CEU]; Toscani in Italy [TSI]), South Asia (Gujarati
Indian in Houston, Texas [GIH]) and America (Mexican Ancestry
in Los Angeles, California [MXL]). Phasing was carried out with
SHAPEIT2
23
and the 1000 Genomes Project reference panel.
22
We also estimated effective population size and time of divergence
between the Kalash and other populations by analyzing LD
patterns in SNP-chip data with the NeON R package.
24
We assessed the genetic relatedness of ancient genomes to
modern populations by computing outgroup f3 statistics.
20
In
the absence of admixture with the outgroup, the expected value
of f3 (outgroup; A, B) is a function of the shared genetic history
of A and B. We used the YRI as an outgroup to non-African popu-
lations and computed f3 statistic (YRI; ancient, X) to investigate
the shared history of an ancient genome and a set of 32 worldwide
populations, including the Kalash (X), and f3 statistic (YRI;
Kalash, Y) to investigate the Kalash and a set of 32 worldwide pop-
ulations and ancient genomes (Y). Ancient genomes used in the
analysis included the Mal’ta boy (MA-1), a Paleolithic Siberian
hunter-gatherer;
25
La Bran
˜a 1, a Mesolithic European hunter-gath-
erer from Iberia;
26
and O
¨tzi, the Tyrolean Iceman and a represen-
tative European Neolithic farmer.
27
BAM files of the ancient
genomes were downloaded from the respective references and
managed with Picard v.1.112, and the genotypes were called
with the Genome Analysis Toolkit v.3.3.
28
Data on DNA polymor-
phisms were stored in VCF files and managed with VCFtools.
29
Selection
Selection in the Kalash populations was estimated as described by
Yi et al.
30
In brief, we obtained estimates of the population time
of divergence (T) from F
ST
values and corrected for the effective
population size of the population considered (N
ep
), as shown in
T¼logð1FSTÞ
log11
23Nep:
For each marker, we calculated T between the Kalash and CHB
(T
KC
), between the Kalash and Balochi (T
KB
)—who were consid-
ered representative of the other Pakistani populations—and be-
tween the CHB and Balochi (T
CB
). The population branch statistic
(PBS), which defines the length of the branch leading to the Kalash
since the split from East Asians, is equal to
PBS ¼TKC þTKB TCB
2:
N
ep
estimates were obtained from the linkage data (2,471
[95% confidence interval (CI) ¼2,319–2,603]) with the NeON
R package. The variants within the 99
th
percentile of the distri-
bution of our PBS values were annotated, and a list of genes
associated with these variants was used for Ingenuity Pathway
Analysis (IPA).
Simulation
We modeled the effect of drift and selection on specific variants
by using the simuPOP library.
31
We used the effective population
size estimated from this study in the simulations and obtained
the initial allele frequency for each marker from the observed
data. We recorded the allele frequency every 50 generations for
500 generations (which roughly corresponds to 12,500 or 14,000
years ago if we assume a generation time of 25 or 28 years, respec-
tively). Each scenario was replicated 1,000 times.
Results
The Kalash Are a Genetic Isolate
PCA using only Eurasian and South Asian populations
separated the populations from Europe, Caucasus, and
West Asia from East Asians in the first component and
from South and Central Asians in the second component;
Central Asians lay closer to the Sherpa from Nepal and
CHB from East Asia. The Kalash samples clustered together
as an outlier population to the other South Asian samples
from India and Pakistan (Figure 1B). The Kalash genetic
isolation was also supported by the ADMIXTURE plot
(Figure 1C), in which the lowest cross-validation error
was achieved with seven ancestry components. In this
analysis, the Kalash were characterized mainly by a unique
genetic component (dark green), although many samples
shared a proportion of their ancestry with their neighbors
in Pakistan (light orange and light blue). This light-blue
component was also shared among many diverse popula-
tions from West, Central, and South Asia.
The pattern of runs of homozygosity (Figures S1A and
S1B) and LD decay (Figure S2) reveal the highest average
level of homozygosity in the Kalash and most extensive
LD, possibly reflecting a high level of isolation and low
effective population size. The f3 test statistic (Table S3),
ALDER (Table S4), and TreeMix analysis (Figure S3) also
test for admixture, but they showed no evidence of gene
flow into the Kalash and thus provide further support for
their genetic isolation. The TreeMix
5
analysis supports
most strongly an un-rooted tree that has 11 migration
edges and shows extreme genetic drift in the Kalash but
no migration events affecting them (Figure S3).
The American Journal of Human Genetics 96, 1–9, May 7, 2015 3
Please cite this article in press as: Ayub et al., The Kalash Genetic Isolate: Ancient Divergence, Drift, and Selection, The American Journal of
Human Genetics (2015), http://dx.doi.org/10.1016/j.ajhg.2015.03.012
The Kalash Are an Ancient Genetic Isolate
PSMC analysis applied to the high-coverage Kalash, three
African genomes (YRI, LWK, and MKK), and six non-
African genomes showed that the Kalash, like other non-
Africans, experienced a severe bottleneck 50,000–70,000
years ago. The Kalash recovered slightly after the bottle-
neck but never achieved an effective population size above
20,000, as observed in the GIH (the other South Asian
genome) and other non-African genomes, except the
MXL (Figure 2A). The Kalash have maintained a low effec-
tive size below 10,000 for more than 20,000 years before
the present. This pattern of unusually small effective
population size in the Kalash is also supported by the esti-
mate from the decay of LD, which was significantly lower
(p ¼<2310
14
) than that of neighboring populations
from Pakistan (Figure 2B), although the estimated absolute
sizes differed between the two approaches.
To examine the time of divergence between the Kalash
and other genomes, we used multiple sequentially
Markovian coalescent (MSMC) analysis on phased high-
coverage genomes. The estimates based on pairs of ge-
nomes showed that the Kalash split first from Africans
(LWK, MKK, and YRI) and then from East Asians (CHB
and JPT). The split from Europeans (CEU and TSI) and
South Asians (GIH) appears to have happened around
the same time (Figure 2C), approximately 8,000 years
ago. Examination based on LD decay in genotyping data
also showed that the Kalash were the first population to
split from the Central and South Asian cluster around
11,800 (95% CI ¼10,600–12,600) years ago (Figure 2D).
This estimate was obtained by UPGMA (unweighted pair
group method with arithmetic mean) phylogenetic anal-
ysis comparing the structure of the tree in the Kalash and
other South and Central Asian populations. This split
Figure 2. Kalash Demographic History
(A) PSMC analysis shows a low effective population size for the Kalash.
(B) Kalash effective population size estimated from LD analysis.
(C) MSMC analysis of the time of the split between the Kalash and African genomes (YRI, LWK, and MKK) and non-African genomes
from East Asia (CHB and JPT), Europe (CEU and TSI), South Asia (GIH), and America (MXL).
(D) A UPGMA (unweighted pair group method with arithmetic mean) dendrogram shows the LD-estimated time of divergence between
populations. The mean time of divergence between the Kalash and other populations from the Indian sub-continent is estimated to be
11,800 years ago (dashed red line).
4The American Journal of Human Genetics 96, 1–9, May 7, 2015
Please cite this article in press as: Ayub et al., The Kalash Genetic Isolate: Ancient Divergence, Drift, and Selection, The American Journal of
Human Genetics (2015), http://dx.doi.org/10.1016/j.ajhg.2015.03.012
time remained constant even after the addition of the YRI
population. We also estimated these split times by using
different subsets of non-African populations. The resulting
UPGMA trees were not strongly affected by different sub-
sets of European or South Asian populations (Figure S5),
and the split times between the Kalash and other popula-
tions ranged from 9,600 to 12,600 years ago.
The Kalash Share Genetic Drift with Paleolithic
Siberian Hunter-Gatherers
We assessed the genetic relatedness between three ancient
genomes and modern human populations, including the
Kalash, by computing outgroup f3 statistics. The measure
circumvents potential bias from classical genetic related-
ness tests, such as PCA (which has sample-size bias) and
F
ST
(which is sensitive to genetic drift that has occurred
since divergence of the test populations), when using
ancient genomes. According to outgroup f3 statistics, the
Kalash share a high level of genetic drift with MA-1,
a Paleolithic Siberian hunter-gatherer skeleton dated to
~24,000 years ago, but not a very high level with La Bran
˜a
1, the Mesolithic European hunter-gatherer (skeletal re-
mains dated to ~7,000 years ago) or the European farmers
represented by O
¨tzi, the Tyrolean Iceman dated to ~5,300
years ago (Figure 3). Similar to Native Americans, the
Kalash share a high proportion of genetic drift with
MA-1. In comparison with other populations from Pakistan
and India, the Kalash also share a higher proportion of
genetic drift with La Bran
˜a 1 and O
¨tzi. The level of drift
shared with La Bran
˜a 1 and O
¨tzi is comparable to that of
other North European populations (Figure 3B). We also
used TreeMix to estimate the proportion of Neandertal
ancestry from the high-coverage archaic Altai Neandertal
who lived ~50,000 years ago. The jackknife estimate of
Neandertal-to-Kalash gene flow was 2.4% 50.48%.
Consequences of Ancient Isolation
We also examined the effect of the inferred long-term
isolation on genetic drift and natural selection in the
Kalash. Our PBS analysis showed evidence of possible
positive selection on 1,709 SNPs, of which 762 lie within
548 genes, including RYR2 (MIM: 180902) and ACTN3
(MIM: 102574) (Table S5). IPA showed an enrichment of
selection signals in 28 genes associated with cardiovascular
Figure 3. Shared Genetic Drift with Ancient Genomes
(A) Proportion of shared genetic drift (measured as f3 statistics) between extant world-wide HGDP-CEPH populations (including the
Kalash) and the ancient Siberian hunter-gatherer (MA-1). The magnitude of the computed f3 statistics is represented by the graded
heat key. The proportion of genetic drift shared between the Kalash and MA-1 is comparable to that shared between MA-1 and the Yakut,
Native Americans, and northern European populations.
(B) Ternary plot of shared genetic drift with three ancient genomes: MA-1 (left), La Bran
˜a 1 (middle), and O
¨tzi, the Tyrolean Iceman
(right). The high proportion of genetic drift shared between the Kalash and MA-1 is comparable to that shared between MA-1 and Native
Americans. In comparison with other populations from South Asia, the Kalash also share a higher proportion of genetic drift with La
Bran
˜a 1 and O
¨tzi.
The American Journal of Human Genetics 96, 1–9, May 7, 2015 5
Please cite this article in press as: Ayub et al., The Kalash Genetic Isolate: Ancient Divergence, Drift, and Selection, The American Journal of
Human Genetics (2015), http://dx.doi.org/10.1016/j.ajhg.2015.03.012
physiology and disease pathways (Fisher’s exact test p
value ¼4.61 310
9
).
Two variants that were highly differentiated between the
Kalash and the neighboring Pakistani populations stood
out. One variant, rs4988235 (c.13910C>T), which influ-
ences lactase (LCT [MIM: 603202]) expression and confers
lactose tolerance, is fixed for the ancestral lactose-intol-
erant allele in the Kalash. The derived allele, however, is
reported to be present at a moderate frequency (average
29%) in Pakistan.
32
Forward-time simulations demon-
strated that the observed pattern cannot easily be ex-
plained by recent genetic drift, given that only 0.1% of
the 1,000 simulations achieved fixation for the ancestral
allele after 500 generations (Figure 4A).
The second variant, rs1815739 (c.1729C>T
[p.Arg577Ter]), is a natural knockout variant in ACTN3
and has been associated with elite athletic performance.
33
The derived T allele is present at a very high frequency
(93%) in the Kalash. The average frequency in the remain-
ing Pakistani populations is 47%. Using this (47%)
frequency as a starting point for the forward-time simu-
lations, we found that the very high frequency for this
variant in the Kalash cannot be explained by genetic drift
alone, even after 500 generations (Figure 4B). A selection
Figure 4. Consequences of Drift and Selection in the Kalash
(A) A nonsense variant in ACTN3 (rs1815739) is present at a higher frequency (left) in the Kalash than in their neighbors in Pakistan.
Forward-time simulations (right) show that such a high frequency of the derived allele in the Kalash (dashed blue line) is only observed
in a scenario that considers positive selection acting on the variant. The lower line represents the observed mean frequency of the
derived allele in the Pakistani population, the orange lines represent the simulated allele frequency of the derived allele in each replicate
in the scenario without selection, and the dark red lines represent each replicate in the scenario with positive selection. The observed
frequency of the derived allele in Kalash population is reached only in the scenario with selection and only after 400 generations of drift
(~10,000 or 11,200 years ago if we assume a generation time of 25 or 28 years, respectively), suggesting that the observed pattern for this
stop gain on ACTN3 can best be explained by selection acting in ancient times and not by any recent population split.
(B) The Kalash are fixed for the ancestral allele of the MCM6 intronic variant (rs4988235) that is associated with lactose intolerance. The
derived allele that is associated with lactase persistence is present at moderate frequency in populations from Pakistan (left panel and
upper dashed line in the right panel). Forward-time simulations (right panel) suggest that recent isolation and genetic drift cannot
explain the observed pattern for this functional polymorphism in the Kalash population. Only 1/1,000 replicates (represented by orange
lines) reach fixation after 500 generations of drift (~12,500 years ago if we assume a generation time of 25 years).
6The American Journal of Human Genetics 96, 1–9, May 7, 2015
Please cite this article in press as: Ayub et al., The Kalash Genetic Isolate: Ancient Divergence, Drift, and Selection, The American Journal of
Human Genetics (2015), http://dx.doi.org/10.1016/j.ajhg.2015.03.012
signal (selection coefficient s ¼0.01) achieved the observed
frequency in 80% of the simulations after 500 generations.
Both of these results support the long-standing isolation of
the Kalash.
Discussion
The present study sheds light on the origins of the enig-
matic Kalash population from Pakistan. We propose that
the population represents an ancient genetic isolate rather
than a recently split population showing extreme genetic
drift, as suggested by earlier studies.
1,6
The outlier status
of these South Asians is corroborated by the fact that we
found no evidence of recent admixture in the Kalash by
using a variety of analyses, including TreeMix, f3, and
linkage-based statistics. The fact that researchers also
genotyped ten of these samples earlier by using the
HGDP-CEPH panel and that these cluster with the samples
genotyped in this study rules out the possibility of con-
founding results due to population sub-structure within
the Kalash.
The ancient separation of the Kalash from a common
Eurasian ancestor is supported by PSMC and MSMC ana-
lyses, which estimated that the Kalash split from East
Asians (CHB and JPT as proxy) prior to splitting from
Europeans and other South Asian populations. The split
from Europeans (CEU and TSI) and South Asians (repre-
sented here by GIH) appears to have occurred during the
Neolithic period, which is also supported by the decay of
LD. LD decay showed that the Kalash were the first popu-
lation to split from the other Central and South Asian clus-
ter around 11,800 (95% CI ¼10,60012,600) years ago.
This estimate remained constant even after the addition
of an African (YRI) population or when the Kalash were
compared with different subsets of non-African popula-
tions. The pairwise times of divergence with other Pakis-
tani populations ranged from 8,800 years ago with the
Burusho to 12,200 years ago with the Hazara. Although
migration and undetected admixture in reference popula-
tions could bias our estimate of the time of divergence,
using different subsets of population revealed no strong
bias in the split between the Kalash and South Asians,
which occurred after the split between Europeans and
South Asian populations.
Since this split, the Kalash have maintained a low N
e
of
around 2,500 (95% CI ¼2,300–2,600), estimated from LD
decay with no evidence of admixture. These N
e
estimates
are lower than those obtained from PSMC analysis because
the latter method gives a single estimate of the cross-coa-
lescence rate from the present to 24,000 years ago, whereas
the linkage-based method gives us several estimates over
the past 10,000 years. It is likely that PSMC analysis could
not detect that the Kalash population suffered a contin-
uous decline in effective population size. Taking into ac-
count the expected differences in N
e
between autosomes
and the Y chromosome, this is in agreement with the
reported N
e
of 237–1,124, which was estimated with
observed and evolutionary mutation rates for Y chromo-
somal STRs.
34
The Kalash represent a unique branch in the South Asian
population tree and appear to be the earliest population to
split from the ancestral Pakistani and Indian populations,
indicating a complex scenario for population origins in
the sub-continent rather than just the ancestral northern
and southern Indian components identified previously.
35
These Indo-European speakers were possibly the first mi-
grants to arrive in the Indian sub-continent from northern
or western Asia. This is supported by the higher level of
shared genetic drift between the Kalash and the Paleolithic
Siberian hunter-gatherer skeleton (MA-1) than between
MA-1 and the other South Asian populations.
Whereas the Kalash have recently been reported to have
European admixture, postulated to be related to Alexan-
der’s invasion of South Asia,
6
our results show no evidence
of admixture. Although several oral traditions claim that
the Kalash are descendants of Alexander’s soldiers, this
was not supported by Y chromosomal analysis in which
the Kalash had a high proportion of Y haplogroup L3a
lineages, which are characterized by having the derived
allele for the PK3 Y-SNP and are not found elsewhere.
7
They also have predominantly western Eurasian mitochon-
drial lineages and no genetic affiliation with East Asians.
4
We observed that the Kalash share a substantial propor-
tion of drift with a Paleolithic ancient Siberian hunter-
gatherer, who has been suggested to represent a third
northern Eurasian genetic ancestry component for pre-
sent-day Europeans.
36,37
This is also supported by the
shared drift observed between the Kalash and the Yam-
naya, an ancient (2,000–1,800 BCE) Neolithic pastoralist
culture that lived in the lower Volga and Don steppe
lands of Russia and also shared ancestry with MA-1.
36,37
Thus, the Kalash could be considered a genetically drifted
ancient northern Eurasian population, and this shared
ancient component was probably misattributed to recent
admixture with western Europeans.
We also looked at how this long-term separation, iso-
lation, and low effective population size affected the
patterns of genetic variation in the Kalash. One striking
example is the frequency of the derived allele for
rs4988235, which has been linked to lactose tolerance.
The Kalash, like the MA-1, are fixed for the ancestral allele
for this variant, whereas their neighbors in Pakistan have
been observed to have moderate frequencies of the derived
allele. Although this supports their long-term isolation, it
is surprising in other ways because the Kalash have no
reported lactose intolerance and indeed celebrate a ‘‘milk
day’’ during their annual spring rituals.
38
This suggests
that there might be additional derived lactase-persistence
alleles in the LCT-MCM6 (MIM: 601806) region in this
population.
Another example is the extremely high frequency (93%)
of the stop-gain ACTN3 variant (rs1815739) associated
with normal variation in human muscle strength and
The American Journal of Human Genetics 96, 1–9, May 7, 2015 7
Please cite this article in press as: Ayub et al., The Kalash Genetic Isolate: Ancient Divergence, Drift, and Selection, The American Journal of
Human Genetics (2015), http://dx.doi.org/10.1016/j.ajhg.2015.03.012
speed.
39
This variant was picked up as an outlier in the PBS
test for selection in the Kalash. Simulations indicated that
such a high frequency of the derived allele in the Kalash
can only be obtained under a scenario that includes
positive selection. The variant might be relevant in cardio-
vascular conditioning and muscle strength related to
climbing up and down high mountain passes. Although
ACTN3 has not been associated with adaptation to high
altitude, RYR2, another gene with an intronic outlier
variant (rs2992644) in PBS, has.
40
It has been postulated that South Asia, which is now a
densely occupied land, was encountered by the first popu-
lations of modern humans that ventured out of Africa
more than 50,000 years ago. The exact route taken by these
earliest settlers is not known, although it has been sug-
gested that they traveled via a southern coastal route.
41,42
The genetically isolated Kalash might be seen as descen-
dants of the earliest migrants that took a route into
Afghanistan and Pakistan and are most likely present-day
genetically drifted representatives of these ancient north-
ern Eurasians. A larger survey that includes populations
from their ancestral homeland in Nuristan, Afghanistan,
would provide more insights into their unique genetic
structure and origins and help explain the complex history
of the peopling of South Asia.
Accession Numbers
The high-coverage Kalash sequence reported in this paper has
been deposited in the European Nucleotide Archive (ENA) under
accession number ENA: ERS233567.
Supplemental Data
Supplemental Data include five figures and five tables and can be
found with this article online at http://dx.doi.org/10.1016/j.ajhg.
2015.03.012.
Acknowledgments
This work was supported by the Wellcome Trust (098051). We
thank Anna di Rienzo at the University of Chicago and Cynthia
Beall at Case Western Reserve University for access to the Sherpa
genotypes.
Received: February 9, 2015
Accepted: March 26, 2015
Published: April 30, 2015
Web Resources
The URLs for data presented herein are as follows:
1000 Genomes, http://www.1000genomes.org
Ensembl, http://www.ensembl.org/index.html
European Nucleotide Archive, http://www.ebi.ac.uk/ena
HUGO Gene Nomenclature Committee, http://www.genenames.
org/
Ingenuity Pathway Analysis, http://www.ingenuity.com
NCBI2R, http://cran.r-project.org/src/contrib/Archive/NCBI2R/
OMIM, http://omim.org
Picard 1.112, http://broadinstitute.github.io/picard/
UCSC Human Genome Browser, http://genome.ucsc.edu/cgi-bin/
hgGateway
VCFtools, http://vcftools.sourceforge.net/
References
1. Rosenberg, N.A., Pritchard, J.K., Weber, J.L., Cann, H.M., Kidd,
K.K., Zhivotovsky, L.A., and Feldman, M.W. (2002). Genetic
structure of human populations. Science 298, 2381–2385.
2. Cann, H.M., de Toma, C., Cazes, L., Legrand, M.F., Morel, V.,
Piouffre, L., Bodmer, J., Bodmer, W.F., Bonne-Tamir, B., Cam-
bon-Thomsen, A., et al. (2002). A human genome diversity
cell line panel. Science 296, 261–262.
3. Qamar, R., Ayub, Q., Mohyuddin, A.,Helgason, A., Mazhar, K.,
Mansoor, A., Zerjal, T., Tyler-Smith, C., and Mehdi, S.Q.
(2002). Y-chromosomal DNA variation in Pakistan. Am. J.
Hum. Genet. 70, 1107–1124.
4. Quintana-Murci, L., Chaix, R., Wells, R.S., Behar, D.M., Sayar,
H., Scozzari, R., Rengo, C., Al-Zahery, N., Semino, O., Santa-
chiara-Benerecetti, A.S., et al. (2004). Where west meets east:
the complex mtDNA landscape of the southwest and Central
Asian corridor. Am. J. Hum. Genet. 74, 827–845.
5. Li, J.Z., Absher, D.M., Tang, H., Southwick, A.M., Casto, A.M.,
Ramachandran, S., Cann, H.M., Barsh, G.S., Feldman, M., Cav-
alli-Sforza, L.L., and Myers, R.M. (2008). Worldwide human re-
lationships inferred from genome-wide patterns of variation.
Science 319, 1100–1104.
6. Hellenthal, G., Busby, G.B., Band, G., Wilson, J.F., Capelli, C.,
Falush, D., and Myers, S. (2014). A genetic atlas of human
admixture history. Science 343, 747–751.
7. Firasat, S., Khaliq, S., Mohyuddin, A., Papaioannou, M., Tyler-
Smith, C., Underhill, P.A., and Ayub, Q. (2007). Y-chromo-
somal evidence for a limited Greek contribution to the Pathan
population of Pakistan. Eur. J. Hum. Genet. 15, 121–126.
8. Mansoor, A., Mazhar, K., Khaliq, S.,Hameed, A., Rehman,S., Sid-
diqi, S., Papaioannou, M., Cavalli-Sforza, L.-L., Mehdi, S.Q., and
Ayub, Q. (2004). Investigation of the Greek ancestry of popula-
tions from northern Pakistan. Hum. Genet. 114, 484–490.
9. Giannoulatou, E., Yau, C., Colella, S., Ragoussis, J., and
Holmes, C.C. (2008). GenoSNP: a variational Bayes within-
sample SNP genotyping algorithm that does not require a
reference population. Bioinformatics 24, 2209–2214.
10. Jeong, C., Alkorta-Aranburu, G., Basnyat, B., Neupane, M., Wi-
tonsky, D.B., Pritchard, J.K., Beall, C.M., and Di Rienzo, A.
(2014). Admixture facilitates genetic adaptations to high alti-
tude in Tibet. Nat. Commun. 5, 3281.
11. Yunusbayev, B., Metspalu, M., Ja
¨rve, M., Kutuev, I., Rootsi, S.,
Metspalu, E., Behar, D.M., Varendi, K., Sahakyan, H., Khusai-
nova, R., et al. (2012). The Caucasus as an asymmetric semi-
permeable barrier to ancient human migrations. Mol. Biol.
Evol. 29, 359–365.
12. Behar, D.M., Yunusbayev, B., Metspalu, M., Metspalu, E., Ros-
set, S., Parik, J., Rootsi, S., Chaubey, G., Kutuev, I., Yudkovsky,
G., et al. (2010). The genome-wide structure of the Jewish peo-
ple. Nature 466, 238–242.
13. Abecasis, G.R., Auton, A., Brooks, L.D., DePristo, M.A., Dur-
bin, R.M., Handsaker, R.E., Kang, H.M., Marth, G.T., and
McVean, G.A.; 1000 Genomes Project Consortium (2012).
An integrated map of genetic variation from 1,092 human ge-
nomes. Nature 491, 56–65.
8The American Journal of Human Genetics 96, 1–9, May 7, 2015
Please cite this article in press as: Ayub et al., The Kalash Genetic Isolate: Ancient Divergence, Drift, and Selection, The American Journal of
Human Genetics (2015), http://dx.doi.org/10.1016/j.ajhg.2015.03.012
14. McLaren, W., Pritchard, B., Rios, D., Chen, Y., Flicek, P., and
Cunningham, F. (2010). Deriving the consequences of
genomic variants with the Ensembl API and SNP Effect Predic-
tor. Bioinformatics 26, 2069–2070.
15. Metspalu, M., Romero, I.G., Yunusbayev, B., Chaubey, G.,
Mallick, C.B., Hudjashov, G., Nelis, M., Ma
¨gi, R., Metspalu,
E., Remm, M., et al. (2011). Shared and unique components
of human population structure and genome-wide signals of
positive selection in South Asia. Am. J. Hum. Genet. 89,
731–744.
16. Altshuler, D.M., Gibbs, R.A., Peltonen, L., Altshuler, D.M.,
Gibbs, R.A., Peltonen, L., Dermitzakis, E., Schaffner, S.F., Yu,
F., Peltonen, L., et al.; International HapMap 3 Consortium
(2010). Integrating common and rare genetic variation in
diverse human populations. Nature 467, 52–58.
17. Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E.,
Shadick, N.A., and Reich, D. (2006). Principal components
analysis corrects for stratification in genome-wide association
studies. Nat. Genet. 38, 904–909.
18. Storfer, A., Murphy, M.A., Evans, J.S., Goldberg, C.S., Robin-
son, S., Spear, S.F., Dezzani, R., Delmelle, E., Vierling, L., and
Waits, L.P. (2007). Putting the ‘‘landscape’’ in landscape ge-
netics. Heredity (Edinb) 98, 128–142.
19. Loh, P.-R., Lipson, M., Patterson, N., Moorjani, P., Pickrell, J.K.,
Reich, D., and Berger, B. (2013). Inferring admixture histories
of human populations using linkage disequilibrium. Genetics
193, 1233–1254.
20. Reich, D., Thangaraj, K., Patterson, N., Price, A.L., and Singh,
L. (2009). Reconstructing Indian population history. Nature
461, 489–494.
21. Pickrell, J.K., and Pritchard, J.K. (2012). Inference of popula-
tion splits and mixtures from genome-wide allele frequency
data. PLoS Genet. 8, e1002967.
22. Schiffels, S., and Durbin, R. (2014). Inferring human popula-
tion size and separation history from multiple genome se-
quences. Nat. Genet. 46, 919–925.
23. O’Connell, J., Gurdasani, D., Delaneau, O., Pirastu, N., Ulivi,
S., Cocca, M., Traglia, M., Huang, J., Huffman, J.E., Rudan,
I., et al. (2014). A general approach for haplotype phasing
across the full spectrum of relatedness. PLoS Genet. 10,
e1004234.
24. Mezzavilla, M., and Ghirotto, S. (2015). Neon: An R package to
estimate human effective population size and divergence time
from patterns of linkage disequilibrium between SNPS.
J. Comput. Sci. Syst. Biol. 8, 37–44.
25. Raghavan, M., Skoglund, P., Graf, K.E., Metspalu, M.,
Albrechtsen, A., Moltke, I., Rasmussen, S., Stafford, T.W., Jr.,
Orlando, L., Metspalu, E., et al. (2014). Upper Palaeolithic
Siberian genome reveals dual ancestry of Native Americans.
Nature 505, 87–91.
26. Olalde, I., Allentoft, M.E., Sa
´nchez-Quinto, F., Santpere, G.,
Chiang, C.W., DeGiorgio, M., Prado-Martinez, J., Rodrı
´guez,
J.A., Rasmussen, S., Quilez, J., et al. (2014). Derived immune
and ancestral pigmentation alleles in a 7,000-year-old Meso-
lithic European. Nature 507, 225–228.
27. Keller, A., Graefen, A., Ball, M., Matzas, M., Boisguerin, V.,
Maixner, F., Leidinger, P., Backes, C., Khairat, R., Forster, M.,
et al. (2012). New insights into the Tyrolean Iceman’s origin
and phenotype as inferred by whole-genome sequencing.
Nat. Commun. 3, 698.
28. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis,
K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly,
M., and DePristo, M.A. (2010). The Genome Analysis Toolkit:
a MapReduce framework for analyzing next-generation DNA
sequencing data. Genome Res. 20, 1297–1303.
29. Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E.,
DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T.,
Sherry, S.T., et al.; 1000 Genomes Project Analysis Group
(2011). The variant call format and VCFtools. Bioinformatics
27, 2156–2158.
30. Yi, X., Liang, Y., Huerta-Sanchez, E., Jin, X., Cuo, Z.X.P., Pool,
J.E., Xu, X., Jiang, H., Vinckenbosch, N., Korneliussen, T.S.,
et al. (2010). Sequencing of 50 human exomes reveals adapta-
tion to high altitude. Science 329, 75–78.
31. Peng, B., and Kimmel, M. (2005). simuPOP: a forward-time
population genetics simulation environment. Bioinformatics
21, 3686–3687.
32. Enattah, N.S., Trudeau, A., Pimenoff, V., Maiuri, L., Auricchio,
S., Greco, L., Rossi, M., Lentze, M., Seo, J.K., Rahgozar, S., et al.
(2007). Evidence of still-ongoing convergence evolution of
the lactase persistence T-13910 alleles in humans. Am. J.
Hum. Genet. 81, 615–625.
33. Ma, F., Yang, Y., Li, X., Zhou, F., Gao, C., Li, M., and Gao, L.
(2013). The association of sport performance with ACE and
ACTN3 genetic polymorphisms: a systematic review and
meta-analysis. PLoS ONE 8, e54685.
34. Shi, W., Ayub, Q., Vermeulen, M., Shao, R.G., Zuniga, S., van
der Gaag, K., de Knijff, P., Kayser, M., Xue, Y., and Tyler-Smith,
C. (2010). A worldwide survey of human male demographic
history based on Y-SNP and Y-STR data from the HGDP-
CEPH populations. Mol. Biol. Evol. 27, 385–393.
35. Moorjani, P., Thangaraj, K., Patterson, N., Lipson, M., Loh,
P.-R., Govindaraj, P., Berger, B., Reich, D., and Singh, L.
(2013). Genetic evidence for recent population mixture in
India. Am. J. Hum. Genet. 93, 422–438.
36. Lazaridis, I., Patterson, N., Mittnik, A., Renaud, G., Mallick, S.,
Kirsanow, K., Sudmant, P.H., Schraiber, J.G., Castellano, S.,
Lipson, M., et al. (2014). Ancient human genomes suggest
three ancestral populations for present-day Europeans. Nature
513, 409–413.
37. Haak, W., Lazaridis, I., Patterson, N., Rohland, N., Mallick, S.,
Llamas, B., Brandt, G., Nordenfelt, S., Harney, E., Stewardson,
K., et al. (2015). Massive migration from the steppe was a
source for Indo-European languages in Europe. Nature. Pub-
lished online March 2, 2015.
38. Lines, M. (1999). The Kalasha people of North-Western
Pakistan (Peshawar: Emjay Books International).
39. MacArthur, D.G., and North, K.N. (2007). ACTN3: A genetic
influence on muscle function and athletic performance. Ex-
erc. Sport Sci. Rev. 35, 30–34.
40. Huerta-Sa
´nchez, E., Degiorgio, M., Pagani, L., Tarekegn, A.,
Ekong, R., Antao, T., Cardona, A., Montgomery, H.E., Caval-
leri, G.L., Robbins, P.A., et al. (2013). Genetic signatures reveal
high-altitude adaptation in a set of Ethiopian populations.
Mol. Biol. Evol. 30, 1877–1888.
41. Ayub, Q., and Tyler-Smith, C. (2009). Genetic variation in
South Asia: assessing the influences of geography, language
and ethnicity for understanding history and disease risk. Brief.
Funct. Genomics Proteomics 8, 395–404.
42. Macaulay, V., Hill, C., Achilli, A., Rengo, C., Clarke, D., Mee-
han, W., Blackburn, J., Semino, O., Scozzari, R., Cruciani, F.,
et al. (2005). Single, rapid coastal settlement of Asia revealed
by analysis of complete mitochondrial genomes. Science
308, 1034–1036.
The American Journal of Human Genetics 96, 1–9, May 7, 2015 9
Please cite this article in press as: Ayub et al., The Kalash Genetic Isolate: Ancient Divergence, Drift, and Selection, The American Journal of
Human Genetics (2015), http://dx.doi.org/10.1016/j.ajhg.2015.03.012
The American Journal of Human Genetics
Supplemental Data
The Kalash Genetic Isolate:
Ancient Divergence, Drift, and Selection
Qasim Ayub, Massimo Mezzavilla, Luca Pagani, Marc Haber, Aisha Mohyuddin,
Shagufta Khaliq, Syed Qasim Mehdi, and Chris Tyler-Smith
Figure S1. Pattern of runs of homozygosity.
Number of homozygous segments (NSEG) and total level of homozygosity (MB)
measured in Megabases.
Figure S2. Decay of linkage disequilibrium (LD) in the Kalash.
Kalash LD decay is comparable to the Sherpa and higher than any other Pakistani
population.
Figure S3. TreeMix shows no gene flow in the Kalash.
TreeMix analysis showing that the Kalash lie on a long branch among the other
samples from Pakistan, suggestive of a high level of genetic drift with no evidence
for gene flow.
Figure S4. Per-SNP population branch statistics.
Results for the analyses of the Kalash population. Black dotted line refers to the 99th
percentile of the distribution. Red points refer to the SNPs inside the genes reported
in Table S5.
Figure S5. UPGMA tree using East Asian (CHB), Europeans (French, TSI) and
South Asian (GIH) reference populations.
CHB are Han Chinese from Beijing, China, TSI are Tuscans in Italy and GIH are
Gujarati Indian in Houston, Texas, a representative population from South Asia.
CHB
French
TSI
Kalash
GIH
010000 20000 30000 40000
Cluster Dendrogram
hclust (*, "complete")
as.dist(a1) * 1000
Height
ya
Table S1. Populations examined and their sample sizes.
Population
Sample Size
Region
References
Adygei
17
Caucasus
Behar et al., 201012
Armenia
19
Caucasus
Behar et al., 201012
Balochi
25
South Asia
Li et al., 20085
Brahui
25
South Asia
Li et al., 20085
Burusho
25
South Asia
Li et al., 20085
Chamar
10
South Asia
Metspalu et al., 201115
CHB
21
East Asia
The International HapMap 3 Consortium, 201016
Chechens
20
Caucasus
Yunusbayev et al., 201211
Dharkars
12
South Asia
Metspalu et al., 201115
Dushadh
10
South Asia
Metspalu et al., 201115
French
25
Europe
Li et al., 20085
GIH
24
South Asia
The International HapMap 3 Consortium, 201016
Hazara
20
South Asia
Li et al., 20085
Hungarians
20
Europe
Behar et al., 201012
Iranians
20
West Asia
Behar et al., 201012
Kalash
23
South Asia
This study
Kalash
14
South Asia
Li et al., 20085
Kol
17
South Asia
Metspalu et al., 201115
Makrani
25
South Asia
Li et al., 20085
Palestinians
25
West Asia
Li et al., 20085
Pathan
23
South Asia
Li et al., 20085
Saudi Arabian
20
West Asia
Behar et al., 201012
Sherpa
68
South Asia
Jeong et al., 201410
Sindhi
25
South Asia
Li et al., 20085
Tajiks
15
Central Asia
Yunusbayev et al., 201211
TSI
20
Europe
The International HapMap 3 Consortium, 201016
Turkmen
15
Central Asia
Yunusbayev et al., 201211
Turks
19
West Asia
Behar et al., 201012
Uzbeks
15
Central Asia
Behar et al., 201012
Velamas
10
South Asia
Metspalu et al., 201115
Yemenese
9
West Asia
Behar et al., 201012
Yoruba
21
Africa
Li et al., 20085
San
5
Africa
Li et al., 20085
Bantu
19
Africa
Li et al., 20085
Biaka Pygmies
22
Africa
Li et al., 20085
Mbuti Pygmies
13
Africa
Li et al., 20085
Mandenka
22
Africa
Li et al., 20085
Table S2. Long term effective population sizes estimated from linkage
disequilibrium patterns.
Population
Long term Ne
95% CI
Adygei
6168
(5820-6570)
Armenia
6876
(6335-7373)
Balochi
6660
(6219-7414)
Brahui
6343
(5979-6898)
Burusho
6220
(5818-6958)
Chamar
5420
(5032-5899)
CHB
6918
(6360-7526)
Chechens
5990
(5345-6399)
Dharkars
4505
(3786-5020)
Dusadh
3031
(2690-3348)
French
6190
(5864-6722)
GIH
7369
(7070-7826)
Hazara
5823
(5395-6580)
Hungarians
6293
(5780-6911)
Iranians
7161
(6570-7746)
Kalash
2471
(2319-2603)
Kol
6809
(6013-7205)
Makrani
7022
(6510-7582)
Palestinians
6463
(5902-6856)
Pathan
7542
(6965-7948)
Saudis
6520
(5879-7007)
Sherpa
3394
(3184-3697)
Sindhi
7343
(7009-8094)
Tajiks
6738
(6396-7284)
TSI
6900
(6419-7367)
Turkmen
4713
(4223-5215)
Turks
7325
(6874-8119)
Uzbeks
6697
(5943-7305)
Velamas
4655
(4153-5342)
Yemenese
3409
(2951-3656)
Yoruba
10805
(10190-11200)
... The major ethnic groups include the Punjabis, Pathans, Sindhi, Saraiki, Muhajir, Balochi, Kalashi, and Makrani (Rakha et al. 2011). The Kalasha or Kalash people are a group of Indo-European Indo-Iranian speaking people living in the Chitral district of Khyber-Pakhtunkhwa province of Pakistan (Denker 1981;Ayub et al. 2015). This unique tribe amongst Indo-Aryan peoples of Pakistan comes from a Dardic family. ...
... This unique tribe amongst Indo-Aryan peoples of Pakistan comes from a Dardic family. Census' outcome reports its population size to be 5000 individuals which shows its religious minority accompanied by rich cultural attributes (Ayub et al. 2015). Generally, the Kalash people by dint of their legends and mythos are associated to ancient Greece, but traditionally they are much nearer to Vedic and pre-Zoroastrians (Mela-Athanasopoulou, 2011) . ...
... Mitochondrial DNA (mtDNA) analysis on Kalash population also does not provide much insight on their evolutionary history because of low cohort size (44 individuals) studied in them (Quintana-Murci et al., 2004). Further, in contrast to previous mtDNA-based Kalash studies where haplogroup assignment is done on the basis of high-resolution RFLP analysis (Ayub et al. 2015), in this study Kalash characterization on the basis of maternal inheritance is done by sequencing $1122 bp long entire control region of mtDNA. So, this study with the highest number of Kalash samples (111) so far is aimed to analyze the mtDNA control region of the genome to identify the haplogroup composition of Kalash. ...
Article
Full-text available
The mitochondrial DNA (mtDNA) complete control region coverage of 111 individuals from Kalash population of Pakistan has been presented for forensic applications and to infer their genetic parameters. We detected in total 14 different haplotypes with only five unique and nine shared by more than one individual. This population has come up with quite lower haplotype diversity (0.8393) and very higher random match probability (0.1682), and ultimately lower power of discrimination (0.832). Additionally, haplogroup distribution reveals the genetic ancestry of Kalash, mainly from West Eurasia (98.8%) and very little from South Asia (0.9%). Neither African lineages nor East Asian genetic segments were detected among these Kalash. This study will contribute to the database development for forensic applications as well as to track the evolutionary highlights of this ethnic group.
... Previous studies have indicated that this is a reasonable PBS threshold. [57][58][59] To correct for the effects of linkage disequilibrium (LD), we selected the eQTL with the top PBS score in each 100 kb genomic window. To increase rigor, analyses were repeated using a cutoff of the top 0.1% eQTL PBS scores (also LD pruned). ...
Article
Full-text available
Large numbers of expression quantitative trait loci (eQTLs) have recently been identified in humans, and many of these regulatory variants have large allele frequency differences between populations. Here, we conducted genome-wide scans of selection to identify adaptive eQTLs (i.e., eQTLs with large population branch statistics). We then tested whether tissue pleiotropy affects whether eQTLs are more or less likely to be adaptive and identified tissues that have been key targets of positive selection during the last 100,000 years. Top adaptive eQTL outliers include rs1043809, rs66899053, and rs2814778 (a SNP that is associated with malaria resistance). We found that effect sizes of eQTLs were negatively correlated with population branch statistics, and that adaptive eQTLs affect two-thirds as many tissues as non-adaptive eQTLs. Because the tissue breadth of an eQTL can be viewed as a measure of pleiotropy, these results imply that pleiotropy inhibits adaptation. The proportion of eQTLs that are adaptive varies by tissue, and we found that eQTLs that regulate expression in testis, thyroid, blood, or sun-exposed skin are enriched for signatures of positive selection. By contrast, eQTLs that regulate expression in the cerebrum or female-specific tissues have a relative lack of adaptive outliers. Scans of selection also reveal that many adaptive eQTLs are closely linked to disease-associated loci. Taken together, our results indicate that eQTLs have played an important role in recent human evolution.
... Therefore, a mixed pattern of clusters does not always represent actual admixture, e.g., individuals with a history of admixture with an unknown ghost population. Moreover, assigning a group of individuals to a single cluster does not indicate that individuals have not undergone admixture, e.g., if a sister population (Kalash) is way more privately drifted due to a recent substantial population bottleneck (Ayub et al., 2015a). Consequently, STRUCTURE or ADMIXTURE results must be corroborated by other methods that follow varied modeling assumptions such as TreeMix (Pickrell and Pritchard, 2012), fine-STRUCTURE (Lawson et al., 2012), f 3 and D statistics , etc., to conclude patterns of population mixing and demographic histories decisively. ...
Thesis
This PhD thesis, prepared in Tartu University, addresses genetics of population history of the South Asian peoples. Inhabited considerably before the Last Glacial Maximum, the region harbors by now about 1.8 billion humans – almost a quarter of the global population. Therefore, understanding of present-day variation of the latter, in particular outside sub-Saharan Africa, is not possible without deeper knowledge about genetics of South Asian populations. This thesis is based on four published papers. The first one is focused on selected populations inhabiting northeastern Indus Valley, bearing, in particular, in mind ancient Indus Valley civilization and following it Vedic period. The second and the third paper address historically somewhat better known migrations, bringing to India religiously distinct Parsi and Jewish peoples. The fourth paper analyses the genetic variation of a populous Tharu tribe, living predominantly in Nepal, but also in northern provinces of India. Perhaps the most interesting finding of the first paper is that the presumably identified already in Vedic texts, Ror population exhibits significant genetic affinity with northern Steppe and West European peoples, testifying about prehistoric north to south migration(s). The arrival of Parsis to South Asia in 7th century was a consequence of the Islamization of Iran. Comparing Parsi genomes in their historic contexts, we observed their extensive admixture with South Asians, in particular, asymmetrically in paternal and maternal lineages. Nearly the same can be said about different Indian communities that preserved Judaist traditions: their genomes show affinities to peoples living in the Near and Middle East. As far as the genetically highly diverse Tharu tribe is concerned, a clearly distinct East Asian contribution can be seen, admixed with South Asian genetic heritage. It seems justified to identify the Tharu as cultural, rather than demic phenomenon.
... Probably they are present-day genetically drifted representative of the ancient northern Eurasians. (Ayub et al., 2015). It is claimed that Kalash tribe migrated from Afghanistan while some historians claim that they are the descendants of Alexendar army and have the Greek origin (Shah, 2008). ...
Article
The Kalash is an isolated population famous for their cultural and religious practices residing in the Northern areas of Pakistan. Many theories have been reported by historians about the origin of the Kalash people but very little genetic evidence been reported to date. In the current study, we investigated the mitochondrial DNA of unrelated individuals representing the Kalash population. The mitochondrial control region of 76 individuals was elucidated by high throughput Sanger sequencing. A total of 31 (23 unique) different haplotypes were observed. High genetic diversity (GD) was observed was 0.9012 and there is high discriminatory power (DP = 0.8894). The most common haplogroups among 76 individuals found were H2a2a (West Eurasian) (23.68%) followed by H4a1a and J1d3a (10.5%) each and H2a3 (9.21%). The results obtained were then compared with other world populations reported and it is concluded that the Kalash are a diverse population. This study provides an important contribution toward the establishment of a mitochondrial DNA repository of Pakistan.
... Distinct Indo-Aryan dialect complexes, natively spoken in this region and known as "Dardic," were broadly assumed to be the independent surviving remnants of the ancestors of Indo-Iranian speakers [17]. Anthropological and genetic studies have suggested that settlers in northern Pakistan, such as the Kalash and Kho, were historically and culturally isolated from their urbanized surroundings in South Asia and other extant Eurasian populations [18,19], thus probably representing early offshoots of the Vedic Aryans [20]. Moreover, the other major languages in this range, i.e., the eastern Iranians, can be traced back to Avestan scriptures [21], implying antiquity and long-term occupation of these IE language bearers along the northwestern region of South Asia. ...
Article
To elucidate whether Bronze Age population dispersals from the Eurasian Steppe to South Asia contributed to the gene pool of Indo-Iranian-speaking groups, we analyzed 19,568 mitochondrial DNA (mtDNA) sequences from northern Pakistani and surrounding populations, including 213 newly generated mitochondrial genomes (mitogenomes) from Iranian and Dardic groups, both speakers from the ancient Indo-Iranian branch in northern Pakistan. Our results showed that 23% of mtDNA lineages with west Eurasian origin arose in situ in northern Pakistan since ~5000 years ago (kya), a time depth very close to the documented Indo-European dispersals into South Asia during the Bronze Age. Together with ancient mitogenomes from western Eurasia since the Neolithic, we identified five haplogroups (~8.4% of maternal gene pool) with roots in the Steppe region and subbranches arising (age ~5-2 kya old) in northern Pakistan as genetic legacies of Indo-Iranian speakers. Some of these haplogroups, such as W3a1b that have been found in the ancient samples from the late Bronze Age to the Iron Age period individuals of Swat Valley northern Pakistan, even have sub-lineages (age ~4 kya old) in the southern subcontinent, consistent with the southward spread of Indo-Iranian languages. By showing that substantial genetic components of Indo-Iranian speakers in northern Pakistan can be traced to Bronze Age in the Steppe region, our study suggests a demographic link with the spread of Indo-Iranian languages, and further highlights the corridor role of northern Pakistan in the southward dispersal of Indo-Iranian-speaking groups.
... More than 14 languages are spoken in the region (Khan & Uddin, 2013). Due to its location, it has a cultural resemblance to Greek and Iranian (Ayub et al., 2015;Cann et al., 2002). The region has a beautiful tribe known as the Kalash is also part of this district. ...
Article
Ancient human DNA has various applications in molecular evolution and studies the genetic relationship between the archaic human population and the modern human population. Many ancient human remains are stored in the archeological museum and can be used for DNA sequence analysis. The current study was ever first attempt in Pakistan to use old biological specimens for molecular characterization to trace the population of Chitral district in KP Pakistan. Due to the low quantity and quality of ancient DNA, it is challenging to isolate DNA profiles from ancient human samples. A protocol was optimized for the extraction of degraded DNA from the ancient human bone's specimens. Different Bioinformatics analyses like online servers, Mitomastar and James lick, Phylogenetic Tree, and Genetic Diversity were used for the molecular characterization of the Chitrali population. Our results show that the Ancient Chitrali population has admixture with Europeans and Neolithic European populations.
... Saudi Arabia, along with other Middle Eastern populations, showed intermediate levels of genetic diversity. Overall diversity was highest among the Biaka Pygmies, reflecting the great genetic diversity retained in Africa, while lower values were seen among the Japanese as previously noted [43], and particularly in the Kalash which show signs of an earlier genetic bottleneck [44] and may be further influenced by their unusual marriage practices which allow women freedom to divorce and remarry [45]. ...
Article
Massively parallel sequencing (MPS) of forensic STRs has the potential to reveal additional allele diversity compared to conventional capillary electrophoresis (CE) typing strategies, but population studies are currently relatively few in number. The Verogen ForenSeq™ DNA Signature Prep Kit includes both Y-STRs and X-STRs among its targeted loci, and here we report the sequences of these loci, analysed using Verogen’s ForenSeq™ Universal Analysis Software (UAS) v1.3 and STRait Razor v3.0, in a representative sample of 89 Saudi Arabian males. We identified 56 length variants (equivalent to CE alleles) and 75 repeat sequence sub-variants across the six X-STRs analysed; equivalent figures for the set of 24 Y-STRs were 147 and 192 respectively. We also observed two flanking sequence variants for the X-, and six for the Y-STRs. Recovery of sequence data and concordance with CE data (where available) across the tested loci was good, though rare flanking variation affected interpretation and allele calling at DYF387S1 and DXS7132. Examination of flanking sequences of the Y-STRs revealed five SNPs (L255, M4790, BY7692, Z16708 and S17543) previously shown to define specific haplogroups by Y-chromosome sequencing. These define Y-haplogroups in 62 % of our sample, a proportion that increases to 91 % when haplogroup-associated repeat-sequence motifs are also considered. A population-level comparison of the Saudi Arabian X-STRs with a global sample showed our dataset to be part of a large cluster of populations of West Eurasian and Middle Eastern origin.
... Saudi Arabia, along with other Middle Eastern populations, showed intermediate levels of genetic diversity. Overall diversity was highest among the Biaka Pygmies, reflecting the great genetic diversity retained in Africa, while lower values were seen among the Japanese as previously noted [43], and particularly in the Kalash which show signs of an earlier genetic bottleneck [44] and may be further influenced by their unusual marriage practices which allow women freedom to divorce and remarry [45]. ...
Article
Massively parallel sequencing (MPS) of forensic STRs has the potential to reveal additional allele diversity compared to conventional capillary electrophoresis (CE) typing strategies, but population studies are currently relatively few in number. The Verogen ForenSeq™ DNA Signature Prep Kit includes both Y-STRs and X-STRs among its targeted loci, and here we report the sequences of these loci, analysed using Verogen’s ForenSeq™ Universal Analysis Software (UAS) v1.3 and STRait Razor v3.0, in a representative sample of 89 Saudi Arabian males. We identified 56 length variants (equivalent to CE alleles) and 75 repeat sequence sub-variants across the six X-STRs analysed; equivalent figures for the set of 24 Y-STRs were 147 and 192 respectively. We also observed two flanking sequence variants for the X-, and six for the Y-STRs. Recovery of sequence data and concordance with CE data (where available) across the tested loci was good, though rare flanking variation affected interpretation and allele calling at DYF387S1 and DXS7132. Examination of flanking sequences of the Y-STRs revealed five SNPs (L255, M4790, BY7692, Z16708 and S17543) previously shown to define specific haplogroups by Y-chromosome sequencing. These define Y-haplogroups in 62 % of our sample, a proportion that increases to 91 % when haplogroup-associated repeat-sequence motifs are also considered. A population-level comparison of the Saudi Arabian X-STRs with a global sample showed our dataset to be part of a large cluster of populations of West Eurasian and Middle Eastern origin.
... • Panel "2240K": Genotypes for 404 whole-genome sequenced modern individuals 31,[75][76][77][78][79][80] , at 2,043,687 autosomal SNPs targeted for in-solution capture in previously published ancient DNA panels [81][82][83] . For both panels, pseudo-haploid genotypes for ancient individuals were generated by randomly sampling an allele passing filters (mapping quality ≥ 30 and base quality ≥ 30) at the reference panel SNP positions. ...
Article
Full-text available
Anatomically modern humans reached East Asia more than 40,000 years ago. However, key questions still remain unanswered with regard to the route(s) and the number of wave(s) in the dispersal into East Eurasia. Ancient genomes at the edge of the region may elucidate a more detailed picture of the peopling of East Eurasia. Here, we analyze the whole-genome sequence of a 2,500-year-old individual (IK002) from the main-island of Japan that is characterized with a typical Jomon culture. The phylogenetic analyses support multiple waves of migration, with IK002 forming a basal lineage to the East and Northeast Asian genomes examined, likely representing some of the earliest-wave migrants who went north from Southeast Asia to East Asia. Furthermore, IK002 shows strong genetic affinity with the indigenous Taiwan aborigines, which may support a coastal route of the Jomon-ancestry migration. This study highlights the power of ancient genomics to provide new insights into the complex history of human migration into East Eurasia. Takashi Gakuhari, Shigeki Nakagome et al. report the genomic analysis on a 2.5 kya individual from the ancient Jomon culture in present-day Japan. Phylogenetic analysis with comparison to other Eurasian sequences suggests early migration patterns in Asia and provides insight into the genetic affinities between peoples of the region.
Article
Full-text available
Calabrian Greeks are an enigmatic population that have preserved and evolved a unique variety of language, Greco , survived in the isolated Aspromonte mountain area of Southern Italy. To understand their genetic ancestry and explore possible effects of geographic and cultural isolation, we genome-wide genotyped a large set of South Italian samples including both communities that still speak Greco nowadays and those that lost the use of this language earlier in time. Comparisons with modern and ancient populations highlighted ancient, long-lasting genetic links with Eastern Mediterranean and Caucasian/Near-Eastern groups as ancestral sources of Southern Italians. Our results suggest that the Aspromonte communities might be interpreted as genetically drifted remnants that departed from such ancient genetic background as a consequence of long-term isolation. Specific patterns of population structuring and higher levels of genetic drift were indeed observed in these populations, reflecting geographic isolation amplified by cultural differences in the groups that still conserve the Greco language. Isolation and drift also affected the current genetic differentiation at specific gene pathways, prompting for future genome-wide association studies aimed at exploring trait-related loci that have drifted up in frequency in these isolated groups.
Article
Full-text available
We generated genome-wide data from 69 Europeans who lived between 8,000-3,000 years ago by enriching ancient DNA libraries for a target set of almost four hundred thousand polymorphisms. Enrichment of these positions decreases the sequencing required for genome-wide ancient DNA analysis by a median of around 250-fold, allowing us to study an order of magnitude more individuals than previous studies and to obtain new insights about the past. We show that the populations of western and far eastern Europe followed opposite trajectories between 8,000-5,000 years ago. At the beginning of the Neolithic period in Europe, ~8,000-7,000 years ago, closely related groups of early farmers appeared in Germany, Hungary, and Spain, different from indigenous hunter-gatherers, whereas Russia was inhabited by a distinctive population of hunter-gatherers with high affinity to a ~24,000 year old Siberian6 . By ~6,000-5,000 years ago, a resurgence of hunter-gatherer ancestry had occurred throughout much of Europe, but in Russia, the Yamnaya steppe herders of this time were descended not only from the preceding eastern European hunter-gatherers, but from a population of Near Eastern ancestry. Western and Eastern Europe came into contact ~4,500 years ago, as the Late Neolithic Corded Ware people from Germany traced ~3/4 of their ancestry to the Yamnaya, documenting a massive migration into the heartland of Europe from its eastern periphery. This steppe ancestry persisted in all sampled central Europeans until at least ~3,000 years ago, and is ubiquitous in present-day Europeans. These results provide support for the theory of a steppe origin of at least some of the Indo-European languages of Europe.
Article
Full-text available
Objective: Estimating the effective population size (Ne) is crucial to understanding how populations evolved, expanded or shrunk. One possible approach is to compare DNA diversity, so as to obtain an average Ne over many past generations; however as the population sizes change over time, another possibility is to describe this change. Linkage Disequilibrium (LD) patterns contain information about these changes, and, whenever a large number of densely linked markers are available, can be used to monitor fluctuating population size through time. Here, we present a new R package, NeON that has been designed to explore population’s LD patterns to reconstruct two key parameters of human evolution: the effective population size and the divergence time between populations. Methods: NeON starts with binary or pairwise-LD PLINK files, and allows (a) to assign a genetic map position using HapMap (NCBI release 36 or 37) (b) to calculate the effective population size over time exploiting the relationship between Ne and the average squared correlation coefficient of LD (r 2 LD) within predefined recombination distance categories, and (c) to calculate the confidence interval about Ne based on the observed variation of the estimator across chromosomes; the outputs of the functions are both numerical and graphical. This package also offers the possibility to estimate the divergence time between populations given the Ne values calculated from the within-population LD data and a matrix of between-populations F ST . These routines can be adapted to any species whenever genetic map positions are available. Results and Conclusion: The functions contained in the R package NeON provide reliable estimates of effective population sizes of human chromosomes from LD patterns of genome-wide SNPs data, as it is shown here for the populations contained in the CEPH panel. The NeON package enables to accommodate variable numbers of individuals, populations and genetic markers, allowing analyzing those using standard personal computers.
Article
Full-text available
Many existing cohorts contain a range of relatedness between genotyped individuals, either by design or by chance. Haplotype estimation in such cohorts is a central step in many downstream analyses. Using genotypes from six cohorts from isolated populations and two cohorts from non-isolated populations, we have investigated the performance of different phasing methods designed for nominally 'unrelated' individuals. We find that SHAPEIT2 produces much lower switch error rates in all cohorts compared to other methods, including those designed specifically for isolated populations. In particular, when large amounts of IBD sharing is present, SHAPEIT2 infers close to perfect haplotypes. Based on these results we have developed a general strategy for phasing cohorts with any level of implicit or explicit relatedness between individuals. First SHAPEIT2 is run ignoring all explicit family information. We then apply a novel HMM method (duoHMM) to combine the SHAPEIT2 haplotypes with any family information to infer the inheritance pattern of each meiosis at all sites across each chromosome. This allows the correction of switch errors, detection of recombination events and genotyping errors. We show that the method detects numbers of recombination events that align very well with expectations based on genetic maps, and that it infers far fewer spurious recombination events than Merlin. The method can also detect genotyping errors and infer recombination events in otherwise uninformative families, such as trios and duos. The detected recombination events can be used in association scans for recombination phenotypes. The method provides a simple and unified approach to haplotype estimation, that will be of interest to researchers in the fields of human, animal and plant genetics.
Article
Full-text available
Modern genetic data combined with appropriate statistical methods have the potential to contribute substantially to our understanding of human history. We have developed an approach that exploits the genomic structure of admixed populations to date and characterize historical mixture events at fine scales. We used this to produce an atlas of worldwide human admixture history, constructed by using genetic data alone and encompassing over 100 events occurring over the past 4000 years. We identified events whose dates and participants suggest they describe genetic impacts of the Mongol empire, Arab slave trade, Bantu expansion, first millennium CE migrations in Eastern Europe, and European colonialism, as well as unrecorded events, revealing admixture to be an almost universal force shaping human populations.
Article
Full-text available
By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.
Article
Full-text available
Ancient genomic sequences have started to reveal the origin and the demographic impact of farmers from the Neolithic period spreading into Europe. The adoption of farming, stock breeding and sedentary societies during the Neolithic may have resulted in adaptive changes in genes associated with immunity and diet. However, the limited data available from earlier hunter-gatherers preclude an understanding of the selective processes associated with this crucial transition to agriculture in recent human evolution. Here we sequence an approximately 7,000-year-old Mesolithic skeleton discovered at the La Braña-Arintero site in León, Spain, to retrieve a complete pre-agricultural European human genome. Analysis of this genome in the context of other ancient samples suggests the existence of a common ancient genomic signature across western and central Eurasia from the Upper Paleolithic to the Mesolithic. The La Braña individual carries ancestral alleles in several skin pigmentation genes, suggesting that the light skin of modern Europeans was not yet ubiquitous in Mesolithic times. Moreover, we provide evidence that a significant number of derived, putatively adaptive variants associated with pathogen resistance in modern Europeans were already present in this hunter-gatherer.
Article
Full-text available
We sequenced the genomes of a ~7,000 year old farmer from Germany and eight ~8,000 year old hunter-gatherers from Luxembourg and Sweden. We analyzed these and other ancient genomes1–4 with 2,345 contemporary humans to show that most present Europeans derive from at least three highly differentiated populations: West European Hunter-Gatherers (WHG), who contributed ancestry to all Europeans but not to Near Easterners; Ancient North Eurasians (ANE) related to Upper Paleolithic Siberians3, who contributed to both Europeans and Near Easterners; and Early European Farmers (EEF), who were mainly of Near Eastern origin but also harbored WHG-related ancestry. We model these populations’ deep relationships and show that EEF had ~44% ancestry from a “Basal Eurasian” population that split prior to the diversification of other non-African lineages.
Article
The availability of complete human genome sequences from populations across the world has given rise to new population genetic inference methods that explicitly model ancestral relationships under recombination and mutation. So far, application of these methods to evolutionary history more recent than 20,000-30,000 years ago and to population separations has been limited. Here we present a new method that overcomes these shortcomings. The multiple sequentially Markovian coalescent (MSMC) analyzes the observed pattern of mutations in multiple individuals, focusing on the first coalescence between any two individuals. Results from applying MSMC to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago and give information about human population history as recent as 2,000 years ago, including the bottleneck in the peopling of the Americas and separations within Africa, East Asia and Europe.
Article
Admixture is recognized as a widespread feature of human populations, renewing interest in the possibility that genetic exchange can facilitate adaptations to new environments. Studies of Tibetans revealed candidates for high-altitude adaptations in the EGLN1 and EPAS1 genes, associated with lower haemoglobin concentration. However, the history of these variants or that of Tibetans remains poorly understood. Here we analyse genotype data for the Nepalese Sherpa, and find that Tibetans are a mixture of ancestral populations related to the Sherpa and Han Chinese. EGLN1 and EPAS1 genes show a striking enrichment of high-altitude ancestry in the Tibetan genome, indicating that migrants from low altitude acquired adaptive alleles from the highlanders. Accordingly, the Sherpa and Tibetans share adaptive haemoglobin traits. This admixture-mediated adaptation shares important features with adaptive introgression. Therefore, we identify a novel mechanism, beyond selection on new mutations or on standing variation, through which populations can adapt to local environments.