ArticlePDF Available

Southern African ancient genomes estimate modern human divergence to 350,000-260,000 years ago

Authors:

Abstract and Figures

Southern Africa is consistently placed as a potential region for the evolution of Homo sapiens. We present genome sequences, up to 13x coverage, from seven ancient individuals from KwaZulu-Natal, South Africa. Three Stone Age hunter-gatherers (~2,000 years old), were genetically similar to current-day southern San groups, while four Iron Age farmers (300-500 years old) were genetically similar to present-day Bantu-speakers. We estimate that all modern-day Khoe-San groups have been influenced by 9-30% genetic admixture from East Africans/Eurasians. Using traditional and new approaches, we estimate the first modern human population divergence time to between 350,000 and 260,000 years ago. This estimate increases the deepest divergence amongst modern humans, coinciding with anatomical developments of archaic humans into modern humans as represented in the local fossil record.
Content may be subject to copyright.
Cite as: C. M. Schlebusch et al., Science
10.1126/science.aao6266 (2017).
First release: 28 September 2017 www.sciencemag.org (Page numbers not final at time of first release) 1
on September 28, 2017 http://science.sciencemag.org/Downloaded from
First release: 28 September 2017 www.sciencemag.org (Page numbers not final at time of first release) 2
on September 28, 2017 http://science.sciencemag.org/Downloaded from
First release: 28 September 2017 www.sciencemag.org (Page numbers not final at time of first release) 3
on September 28, 2017 http://science.sciencemag.org/Downloaded from
First release: 28 September 2017 www.sciencemag.org (Page numbers not final at time of first release) 10
on September 28, 2017 http://science.sciencemag.org/Downloaded from
First release: 28 September 2017 www.sciencemag.org (Page numbers not final at time of first release) 4
on September 28, 2017 http://science.sciencemag.org/Downloaded from
First release: 28 September 2017 www.sciencemag.org (Page numbers not final at time of first release) 5
on September 28, 2017 http://science.sciencemag.org/Downloaded from
First release: 28 September 2017 www.sciencemag.org (Page numbers not final at time of first release) 6
on September 28, 2017 http://science.sciencemag.org/Downloaded from
First release: 28 September 2017 www.sciencemag.org (Page numbers not final at time of first release) 7
on September 28, 2017 http://science.sciencemag.org/Downloaded from
First release: 28 September 2017 www.sciencemag.org (Page numbers not final at time of first release) 8
on September 28, 2017 http://science.sciencemag.org/Downloaded from
First release: 28 September 2017 www.sciencemag.org (Page numbers not final at time of first release) 9
on September 28, 2017 http://science.sciencemag.org/Downloaded from
First release: 28 September 2017 www.sciencemag.org (Page numbers not final at time of first release) 11
Table 1. Samples [see (13)].
Sample
Calibrated
date BP (2σ)
Genomic
DNA cov-
erage
Mitochondrial
DNA coverage
Biological sex
determination
Mitochondrial
haplogroup
Y chromosome
haplogroup
Morphological
sex determination
Ballito Bay A
1986-1831*
12.94
1035
XY
L0d2c1
A1b1b2
Juvenile
Ballito Bay B
2149-1932
1.25
84
XY
L0d2a1
A1b1b2
Male
Doonside
2296-1910*
0.01
2.6
L0d2
Champagne Castle
448-282
0.36
186
XX
L0d2a1a
Female
Eland Cave
533-453
13.23
7597
XX
L3e3b1
Female
Mfongosi
448-308
6.94
562
XX
L3e1b2
Female
Newcastle
508-327
10.65
616
XX
L3e2b1a2
Female
*Ribot et al. (2010) (15)
on September 28, 2017 http://science.sciencemag.org/Downloaded from
years ago
Southern African ancient genomes estimate modern human divergence to 350,000 to 260,000
Mário Vicente, Maryna Steyn, Himla Soodyall, Marlize Lombard and Mattias Jakobsson
Carina M. Schlebusch, Helena Malmström, Torsten Günther, Per Sjödin, Alexandra Coutinho, Hanna Edlund, Arielle R. Munters,
published online September 28, 2017
ARTICLE TOOLS http://science.sciencemag.org/content/early/2017/09/27/science.aao6266
MATERIALS
SUPPLEMENTARY http://science.sciencemag.org/content/suppl/2017/09/27/science.aao6266.DC1
REFERENCES http://science.sciencemag.org/content/early/2017/09/27/science.aao6266#BIBL
This article cites 99 articles, 21 of which you can access for free
PERMISSIONS http://www.sciencemag.org/help/reprints-and-permissions
Terms of ServiceUse of this article is subject to the
registered trademark of AAAS. is aScienceAmerican Association for the Advancement of Science. No claim to original U.S. Government Works. The title
Science, 1200 New York Avenue NW, Washington, DC 20005. 2017 © The Authors, some rights reserved; exclusive licensee
(print ISSN 0036-8075; online ISSN 1095-9203) is published by the American Association for the Advancement ofScience
on September 28, 2017 http://science.sciencemag.org/Downloaded from
1
Supplementary Materials for
Southern African Ancient Genomes Estimate Modern Human Divergence To 350,000-
260,000 Years Ago
Carina M. Schlebusch1,4*, Helena Malmström1,4*, Torsten Günther1, Per Sjödin1, Alexandra Coutinho1,
Hanna Edlund1, Arielle R. Munters1, Mário Vicente1, Maryna Steyn2, Himla Soodyall3, Marlize
Lombard4,5#, Mattias Jakobsson1,4,6#
correspondence to: Mattias Jakobsson (Mattias.jakobsson@ebc.uu.se) and Marlize Lombard
(mlombard@uj.ac.za)
This PDF file includes:
Materials and Methods
Supplementary Text
Figs. S1 to S34
Tables S1 to S25
2
Contents
Materials and Methods..................................................................................................................... 2
1. Samples ................................................................................................................................... 4
2. Permission and permits for sampling and export and sampling procedure ............................. 6
3. aDNA laboratory procedures ................................................................................................... 7
4. aDNA data processing ............................................................................................................. 8
4.1. Initial data processing ...................................................................................................... 8
4.2. Authentication of DNA sequence data and estimation of mitochondrial contamination. 9
Supplementary Text ....................................................................................................................... 10
5. Uniparental markers .............................................................................................................. 10
5.1. Y-chromosomes ............................................................................................................. 10
5.2. Mitochondrial DNA ....................................................................................................... 10
6. Population Structure and Admixture ..................................................................................... 12
6.1. Comparative Data .......................................................................................................... 12
6.2. Principal Component Analysis ....................................................................................... 13
6.3. Cluster analysis .............................................................................................................. 14
6.4. Formal tests of admixture and fractions of admixture ................................................... 14
6.5. Admixture graphs ........................................................................................................... 15
6.6. Admixture dating ........................................................................................................... 18
6.7. Presence of archaic admixture in current-day Khoe-San ............................................... 18
7. Diversity estimates and demographic inferences .................................................................. 18
7.1. Diversity estimates - Heterozygosity ............................................................................. 18
7.2. Diversity estimates - Runs of Homozygosity ................................................................. 19
7.3. Demographic inferences - MSMC ................................................................................. 19
8. Dating of population split times (G-PhoCS) ......................................................................... 19
8.1. Data ................................................................................................................................ 19
8.2. Split times (Tau) in modern humans .............................................................................. 20
8.3. Ne (Theta) in humans ..................................................................................................... 20
8.4. Split times to Neandertals .............................................................................................. 20
9. Estimations based on sample configuration frequencies ....................................................... 21
9.1. Inference under a split model with pairwise sampling – the TT method ....................... 21
9.2. Split time estimates ........................................................................................................ 23
9.3. Branch specific drift ....................................................................................................... 24
9.4. Other measures of drift (FST and outgroup f3) ............................................................... 25
9.5. Testing drift in admixture-masked Ju|’hoansi ................................................................ 25
10. Genomic regions of interest and selection ........................................................................... 26
10.1. Variants of specific phenotypic interest ....................................................................... 26
List of Figures
Fig. S1. ............................................................................................................................................. 1
Fig. S2 .............................................................................................................................................. 2
Fig. S3 .............................................................................................................................................. 4
Fig. S4 .............................................................................................................................................. 5
Fig. S5 .............................................................................................................................................. 6
Fig. S6 .............................................................................................................................................. 7
Fig. S7 .............................................................................................................................................. 8
Fig. S8 .............................................................................................................................................. 9
Fig. S9 ............................................................................................................................................ 10
Fig. S10 .......................................................................................................................................... 11
Fig. S11 .......................................................................................................................................... 12
Fig. S12 .......................................................................................................................................... 13
Fig. S13 .......................................................................................................................................... 14
3
Fig. S14 .......................................................................................................................................... 15
Fig. S15 .......................................................................................................................................... 16
Fig. S16 .......................................................................................................................................... 17
Fig. S17 .......................................................................................................................................... 18
Fig. S18 .......................................................................................................................................... 19
Fig. S19 .......................................................................................................................................... 20
Fig. S20 .......................................................................................................................................... 21
Fig. S21 .......................................................................................................................................... 22
Fig. S22 .......................................................................................................................................... 23
Fig. S23 .......................................................................................................................................... 24
Fig. S24 .......................................................................................................................................... 25
Fig. S25 .......................................................................................................................................... 26
Fig. S26 .......................................................................................................................................... 27
Fig. S27 .......................................................................................................................................... 28
Fig. S28 .......................................................................................................................................... 29
Fig. S29 .......................................................................................................................................... 30
Fig. S30 .......................................................................................................................................... 31
Fig. S31 .......................................................................................................................................... 32
Fig. S32 .......................................................................................................................................... 33
Fig. S33 .......................................................................................................................................... 34
Fig. S34 .......................................................................................................................................... 35
List of Tables
Table S1. ........................................................................................................................................ 36
Table S2. ........................................................................................................................................ 37
Table S3. ........................................................................................................................................ 38
Table S4. ........................................................................................................................................ 39
Table S5. ........................................................................................................................................ 40
Table S6. ........................................................................................................................................ 41
Table S7. ........................................................................................................................................ 42
Table S8. ........................................................................................................................................ 43
Table S9. ........................................................................................................................................ 44
Table S10. ...................................................................................................................................... 45
Table S11. ...................................................................................................................................... 46
Table S12. ...................................................................................................................................... 47
Table S13. ...................................................................................................................................... 48
Table S14. ...................................................................................................................................... 49
Table S15. ...................................................................................................................................... 50
Table S16. ...................................................................................................................................... 51
Table S17. ...................................................................................................................................... 52
Table S18. ...................................................................................................................................... 53
Table S19. ...................................................................................................................................... 54
Table S20. ...................................................................................................................................... 55
Table S21. ...................................................................................................................................... 56
Table S22. ...................................................................................................................................... 57
Table S23. ...................................................................................................................................... 58
Table S24. ...................................................................................................................................... 59
Table S25. ...................................................................................................................................... 60
4
Materials and Methods
1. Samples
Overview and geography: We investigate the skeletal material from seven individuals excavated in the
KwaZulu-Natal Province along the east coast of South Africa (Figure S1). The Ballito Bay A, Ballito Bay B
and Doonside individuals were retrieved from the shoreline near the towns of Ballito Bay and Doonside. The
skeletal material from Eland Cave and Champagne Castle were excavated from caves in the uKhahlamba-
Drakensberg mountain range, and the Mfongosi and Newcastle individuals are from inland KwaZulu-Natal.
Five of the individuals were AMS dated in this study and three were dated previously (Ballito Bay B dated
twice, see text below) (15). All conventional radiocarbon dates were modelled using OxCal v.4.2 and
SHCal13 calibration curves (26, 27).
Ballito Bay A (BBayA): The skeletal remains of Ballito Bay A belongs to a juvenile individual dated
to 1986-1831 cal BP (95% probability) (1980+/-20 BP, Pta-5796) (15). The remains were excavated by
Schoute-Vanneck and Walsh during the 1960s (28), first curated at the Durban Museum, and then transferred
to the KwaZulu-Natal Museum where it is now curated (accession no. 2009/007). The site from which it was
retrieved is said to have been a mound formed by a shell midden overlooking the beach, about 46 m from the
high water mark. The skeletal material cannot be directly associated with archaeological material from the
site, as clear stratigraphic context is unknown, but the unpublished archaeology includes Early Iron Age
pottery (28). Ribot et al. (15) performed stable isotope analyses and AMS radiocarbon dating. Together with
other individuals, they placed this specimen in the cultural context of Later Stone Age populations with δ13C
and δ15N values that indicate a diet that included a sizable sea food intake (-14.4‰ and 11.8‰ respectively)
(15).
Ballito Bay B (BBayB): The well-preserved Ballito Bay B remains belong to an adult male that were
retrieved from a shell midden context about 46 m from the shore near Ballito Bay. The remains that were
discovered by Shoute-Vanneck and Walsch during the 1960s (29), were first curated at the Durban Museum,
and then transferred to the KwaZulu-Natal Museum where it is now curated (accession no. 2009/008.001 and
2009/008.002). The skeletal material cannot be directly associated with the archaeological material at the site
as clear stratigraphic context is unknown. However, the unpublished archaeology includes some stone
artefacts and Early Iron Age pottery. Ribot et al. (15) performed craniometric analyses, stable isotope
analyses and AMS radiocarbon dating. They placed this individual in the cultural context of Later Stone Age
populations because of morphological similarities to pre-2 kya foragers and to modern groups with Khoe-
San ancestry, and because of his large sea-food intake (δ13C -13.6‰ and δ15N 13.3‰) (15). More detailed
analyses of the well-preserved skull showed that this individual had a non-lethal cranial injury inflicted by a
sharp stone flake (30). We AMS dated a humerus from this individual to 2149-1932 cal BP (95% probability)
(2110+/-30 BP, Beta-398217) and obtained a similar stable carbon isotope value (δ13C -13.7‰) to Ribot et
al. (2010). However, there is a previous radiocarbon date available of 3209-2880 cal BP (95% probability)
(2940 ± 50 BP, Pta-5803) (15). As the genomic data we have produced indicate that the bones and teeth that
we have analyzed belong to the same individual (see section 4.1), we use the radiocarbon date obtained by
us.
Doonside (DOO): The Doonside individual was morphologically determined to be an adult female (but
lacked sufficient genetic data to call the biological sex), is dated to 2296-1910 cal BP (95% probability)
(2110+/-50 BP, Pta-5800), and based on isotope values consumed much sea-food (δ13C -13.8‰ and δ15N
14.2‰) (15). Cranio-morphologically, the individual shows extreme flatness and shortness of face
characteristic of Khoe-San populations, but falls outside the 90% confidence ellipse of the Khoe-San
variation although remaining closest to Khoe-San according to squared Euclidean distances (15). The
5
remains were initially curated at the Durban Museum, then transferred to the KwaZulu-Natal Museum where
it is now curated (accession no. 2009/010). Apart from having been found near Doonside on the KwaZulu-
Natal coast, there is no additional information about its discovery, and no associated archaeology was
recorded.
Eland Cave (ELA): The remains of the Eland Cave individual comprise a complete left first rib, one
right rib that had been fractured postmortem, left first metatarsal and the distal half of the left tibia. Its sex
could not be determined morphologically due to the absence of the cranium, pelvis and long bones, but she
was determined to be female using genetics (this study). The distal epiphysis of the tibia was completely
closed, suggesting that she was older than 20 years when she died. Other morphological criteria are consistent
with an age estimate of 50+ years. The individual was discovered by Lombard (the then landowner) in 1926
at Eland Cave (previously known as Lombard Cave) in the uKhahlamba-Drakensberg Mountains together
with what appeared to be hunter-gatherer artefacts, which were acquired and curated by the KwaZulu-Natal
Museum (accession no. 1925/037). The skeletal material cannot be directly associated with archaeological
material from the site, as its stratigraphic context is unknown, but the published (31) and unpublished (32)
archaeology include a complete hunter-gatherer bow-and-arrow kit, a bone/ivory arm ring, stone artefacts
associated with the Smithfield variation of the final Later Stone Age, and a few pottery sherds associated
with Iron Age farmers. The site is also associated with extensive rock paintings, which today form part of the
uKhahlamba/Drakensberg UNESCO World Heritage Site (33, 34). We have directly AMS dated the
specimen to 533-453 cal BP (95% probability) (480+/-30 BP, Beta-398219) using a tibia, and obtained a
δ13C value of -11.7‰.
Mfongosi (MFO): The human remains from the Mfongosi individual consist of the mandible, frontal
bone, left and right parietal bones and disarticulated occipital bone of the cranium. The left humerus, left
femur and left tibia were also present, but the proximal and distal epiphyses were damaged postmortem.
Morphological features indicate a female individual, which is consistent with the DNA results. No age
estimate could be made based on morphology, but the long bones appeared to be adult size and all teeth
present were permanent with moderate to advanced dental wear, suggesting that this was not a young adult
individual. The remains were discovered in the Tugela River valley, and excavated by Jones (the then
landowner) from a grave in which the body was buried in a flexed position. It was presented to the KwaZulu-
Natal Museum in 1932 where it is now curated (accession no. 1925/036.002). Material associated with the
grave included some pottery sherds, the tops of two carved and perforated bone pendants and twelve bone
bead fragments (35), which is consistent with Iron Age farmer material culture. We have directly AMS dated
the individual to 448-308 cal BP (95% probability) (360+/-30 BP, Beta-398220) using a femur and obtained
a δ13C value of -9.4‰.
Newcastle (NEW): The human remains from the Newcastle individual comprised of a fragmented right
parietal bone, left temporal bone, left inferior-lateral orbital rim fragment, right superior-lateral orbital rim
fragment, both claviculae, both scapulae, left and right os coxae, both patellae, right fibula, four cervical
vertebrae, four thoracic vertebrae, five lumbar vertebrae, manubrium, six right ribs, seven left ribs, one rib
fragment, 10 foot bones along with 10 metatarsals and 10 foot phalanges, nine metacarpals and eight hand
phalanges, as well as three hand bones. Morphological features suggest a female individual, which was
confirmed by the DNA analysis. The presence of a well-developed preauricular sulcus suggests that this
woman has had at least one child. Vertebral osteophytes were noticed on the lumbar, thoracic and cervical
vertebrae, with the lumbars most severely affected, and together with other features, the remains seem to
indicate a middle-aged female (probably around 40 to 60 years) of short stature. The remains were discovered
by employees of the Drakensville Berg Resort at Oliviershoek near Newcastle in a disturbed grave, from
which the skull and most of the long bones were removed. Van de Venter and Van Heerden (36) excavated
the remaining bones in 2002, and they are now curated at the KwaZulu-Natal Museum (accession no.
2007/006.001). Associated archaeological material includes two burial stones, one of which is a later Iron
6
Age lower grinding stone. Other archaeological findings close by, but not directly associated with the grave,
include stone walling, a stone cairn, some late white rock art, hunter-gatherer rock paintings and some Middle
and Later Stone Age artefacts (36, 37). We have directly AMS dated the individual using the petrous portion
of a temporal bone fragment to 508-327 cal BP (95% probability) (430+/-30 BP, Beta-398221), and obtained
a δ13C value of -7‰.
Champagne Castle (CHA): The Champagne Castle remains include a complete cranium and mandible,
right humerus, right ulna, right radius, left pubic symphysis, right os coxa (damaged postmortem), one
cervical vertebra and a few small fractured bone pieces. The presence of a preauricular sulcus and very wide
greater sciatic notch indicate a female, which is confirmed by the DNA analysis. The presence of this pre-
auricular sulcus indicates that this individual had most probably borne at least one child during her lifetime.
Based on a range of morphological criteria, it is concluded that the remains are most likely that of a young
adult female (age estimated to be 20 to 30 years) with an estimated stature of about 154 cm. She has a fracture
of her right parietal bone which most probably occurred around the time of death, as is evidenced by a green
bone response. This fracture most possibly reflects an episode of interpersonal violence, but it cannot be
determined if this traumatic injury was the actual cause of death in this case. The skeleton was excavated by
Albino in 1945 from Champagne Castle in the uKhahlamba-Drakensberg Mountains and is curated in the
KwaZulu-Natal Museum (accession no. 2009/023). The archaeology of the site includes two occupation
layers: an upper layer associated with the Iron Age, and a lower layer associated with the Later Stone Age
with stone artefacts ascribed to both the Smithfield and Wilton Industries (38). We obtained a direct AMS
date from a femur of 448-282 cal BP (95% probability) (310+/-30 BP, Beta-398218) for the specimen, and a
δ13C value of -11.7‰.
2. Permission and permits for sampling and export and sampling procedure
Permission was obtained for the sampling of the specimens under curatorial supervision from the
Council of the KwaZulu-Natal Museum in a letter from the Assistant Director, Human Sciences, Dr. Carolyn
Thorpe. A sampling permit (no 0014/06) was issued to Marlize Lombard under the KwaZulu-Natal Heritage
Act No. 4 of 2008 and Section 38 (1) of the National Heritage Resources Act No. 25 of 1999. Also under the
latter legislation, permits were issued by the South African Heritage Resources Agency (SAHRA) for the
destructive sampling and ancient DNA analyses at Uppsala University, Sweden (permit no 1939), and for
sending samples for radiocarbon dating to Beta Analytic, England (permit no 1940). Final reports on the
sampling and dating have been submitted to both heritage agencies.
The skeletal material for this paper is curated at the KwaZulu-Natal Museum in Pietermaritzburg, South
Africa. The material was provided by the Museum research technician Mudzunga Munzhedzi and the
sampling strategy for each specimen was discussed with Dr. Carolyn Thorpe and Dr. Gavin Whitelaw prior
to sampling. The sampling for DNA and radiocarbon dating was done by HM and AC on location in October
2014 and a portable ancient DNA laboratory was set up in a separate room at the Museum. Three samples
were taken from each individual (or museum accession number), the majority of which were from different
bone elements for ancient DNA analyses. The bone elements were UV irradiated (254 nm) for 30 minutes to
one hour per side and stored in plastic zip-lock bags until sampled. Further handling of the specimens was
done in a bleach-decontaminated (DNA Away, ThermoScientific) enclosed sampling tent with adherent
gloves (Captair Pyramide portable isolation enclosure, Erlab). Teeth were wiped with 0.5% bleach (NaOH)
and UV-irradiated sterile water (HPLC grade, Sigma-Aldrich). The outer surface was removed by drilling at
low speed using a portable Dremel 8100, and between 60 and 200 mg of bone powder was sampled for DNA
analyses from the interior of the bones and teeth. All plastics and equipment used had been decontaminated
with DNA-away and/or UV irradiation prior to their use. The researchers wore full-zip suits with caps, face-
masks with visors and double latex gloves and the tent was frequently cleaned with DNA-away during
sampling. Five of the individuals were sampled for AMS radiocarbon dating either through cutting off a small
piece of bone (1.8-4 cm) or through drilling out bone powder (600-750 mg). The sampled bone elements
7
were directly returned to the Museum and the ancient DNA samples and radiocarbon samples were
transported to the Ancient DNA Laboratory at Uppsala University, Sweden. The radiocarbon samples were
sent to Beta Analytic for AMS dating.
3. aDNA laboratory procedures
The 1.5 ml tubes containing the bone powder samples were thoroughly wiped with DNA-away before
they were taken into the dedicated ancient DNA clean room facility at Uppsala University. The laboratory is
equipped with, among other things, an air-lock between the lab and corridor, positive air pressure, UV lamps
in the ceiling (254nm) and HEPA-filtered laminar flow hoods. The laboratory is frequently cleaned with
bleach (NaOH) and UV-irradiation and all equipment and non-biological reagents are regularly
decontaminated with bleach and/or DNA-away (ThermoScientific) and UV irradiation.
DNA was extracted from between 60 and 190 mg of bone powder using silica-based protocols, either
as in Yang et al. (39) with modifications as in Malmström et al. (40) or as in Dabney et al. (41), and were
eluted in 50-110 μl Elution Buffer (Qiagen) (Table S1). The collected bone powder was in some cases
subdivided to enable more than one extraction from each original tube. Between 3 and 6 DNA extracts were
made for each individual (or accession number) and one negative extraction control was processed for every
4 to 7 samples extracted. Indexed DNA libraries were prepared from 20 μl of extract using either a blunt-end
protocol and P5 and P7 adapters as in Meyer and Kircher (42) and Günther et al. (43) with the shearing step
omitted or with a “damage-repair” protocol that repair post-mortem deaminated sites using Uracil-DNA-
glycosylase (UDG) and endonuclease VIII (endo VIII) (44) (Table S1). Between one and five libraries were
prepared from each DNA extract and one negative library control was processed for every 6-8 ancient DNA
libraries
The optimal number of PCR cycles to use for each library was determined using quantitative PCR
(qPCR) in order to see at what cycle a library reached the plateau (where it is saturated) and then deducting
three cycles from that value. The 25 µl qPCR reactions were set up in duplicates and contained 1 µl of DNA
library, 1X Maxima SYBR Green Mastermix and 200 nM of each IS7 and IS8 primers (42) and were
amplified according to supplier instructions (ThermoFisher Scientific). Each library was then amplified in
four or eight reactions using between 12 and 21 PCR cycles. One negative PCR control was set up for every
four reactions. Blunt-end reactions were prepared and amplified as in Günther et al. (43) using IS4 and index
primers from Meyer and Kircher (42). Damage-repair reactions had a final volume of 25 μl and contained 4
μl DNA library and the following in final concentrations; 1X AccuPrime Pfx Reaction Mix, 1.25U
AccuPrime DNA Polymerase (ThermoFisher Scientific) and 400nM of each the IS4 primer and index primer
(42). Thermal cycling conditions were as recommended by ThermoFisher with an annealing temperature of
60oC (42).
Because the femur from Ballito Bay B yielded low amounts of endogenous human DNA, one blunt-end
library was enriched using Mybait Human Whole Genome Capture Kit (MYcroarray) following the
manufacturer´s instructions (Mybaits manual version 2.3.1) and amplified as above. For each library, four
reactions with identical indexing primers were pooled and purified using AMPure XP Beads (Agencourt).
The resulting libraries were quantified either on a TapeStation using a High Sensitivity kit (Agilent
Technologies) or using a Bioanalyzer 2100 and a High Sensitivity DNA chip (Agilent Technologies). The
negative controls processed did not yield any DNA and were therefore not sequenced. The DNA libraries
were sequenced at SciLife Sequencing Centre in Uppsala using either Illumina HiSeq 2500 with v2 paired-
end 125 bp chemistry or HiSeq XTen with paired end 150 bp chemistry. The initial strategy was to screen
the DNA extracts to evaluate the endogenous ancient human DNA content by building blunt-end libraries
and sequencing each library on either a 1/10th of a HiSeq 2500 lane or on a 1/20th of a HiSeq XTen lane.
Additional blunt-end or damage-repair libraries were then built and sequenced and high-quality libraries were
sequenced to completion (up to 97% clonality) while libraries with low endogenous contents were sequenced
to a lesser extent (average 36% clonality over all libraries).
8
4. aDNA data processing
4.1. Initial data processing
Adapters were trimmed and the pair-end reads of each library were merged if the two reads overlapped
at at least 11 base pairs using the script MergeReadsFastQ_cc.py (45). Bwa aln 0.7.13 (46) was then used to
map them as single end reads to the human reference genome (hg18 and hg19). Non-default parameters for
bwa were -l 16500 -n 0.01 -o 2 (47, 48). Reads with less than 10% mismatches to the human reference
genome and longer than 35 base pairs were retained for further analysis. To determine biological sex we
implemented the method described in Skoglund et al. (49, 50). It uses reads with a mapping quality of at least
30 and calculates the ratio of reads mapping to the Y chromosome and those reads mapping to both X and Y
chromosomes.
Sequence data were then merged on library level to ensure maximal retention of reads using samtools
merge tool (v.0.1.19) (51) before removal of PCR duplicates. Reads with identical start and end positions
were identified as PCR duplicates and collapsed using a modified version of FilterUniqSAMCons_cc.py (45)
which ensures the random assignment of bases in a 50/50 case. Non-UDG and UDG-treated libraries where
then merged per individual. Eight individuals were processed and four of these individuals, one male and
three females, had an estimated genome coverage over 6x (Table S2). The sequence data generated from the
Ballito Bay B remains with accession nos. 2009/008.001 and 2009/008.002 were initially treated as two
separate individuals. To investigate whether these two bone fragments belonged to the same individual and/or
if they were related to the other ~2,000 year old coastal samples, we analyzed baa001, bab001, bab002 and
doo001 using READ (52). READ calculates the proportion non-matching alleles inside non-overlapping
windows and then classifies the samples as unrelated, second-degree, first-degree or identical individuals or
twins. The coverage of bab002-dr and doo001-dr did not contain enough overlapping data to estimate kinship.
All other comparisons were classified as unrelated except bab001 and bab002, which were identified as the
same individual or identical twins (Table S3).
For population genetic analyses, the ancient shotgun data were merged with comparative data sets (see
Section 6.1). For low coverage shotgun data, the SNPs in the ancient samples were called as follows: at each
SNP site, a random read with minimum mapping and base quality 30 was drawn and the allelic status at that
read was coded to be the hemizygous genotype of the individual (file-formats require diploid genotypes and
we use the homozygote code for the record, but the data are treated as hemizygote in all downsteam analyses).
Sites showing additional alleles or indels were removed from the data. For non-UDG treated sequence data,
all transition sites were coded as missing data to avoid the effect of post-mortem damage. For the sequenced
individuals we had both UDG-treated and non-UDG treated libraries. At non-transition sites, a read from
either of the two library types were randomly sampled, and for transition sites only reads from damage-
repaired libraries were sampled.
The three high coverage ancient individuals used in this study (BBayA, ELA, and NEW) as well as
high-coverage reference ancient individuals (Mota and LBK) were subjected to diploid genotype calling. We
restricted the genotype calling for BBayA to UDG-treated libraries. Base qualities of all Ts in the first five
base pairs of each read and all As in the last five base pairs were set to 2. Picard
(http://broadinstitute.github.io/picard/) was used to add read groups to the files and indel realignment was
conducted using GATK v3.5.0 (53) and the indels from phase 1 of the 1000 genomes project as references
(54). Diploid genotypes were called with GATK’s UnifiedGenotyper and the following parameters -
stand_call_conf 50.0, -stand_emit_conf 50.0, -mbq 30, -contamination 0.02 and --output_mode
EMIT_ALL_SITES using dbSNP version 142 as known SNPs. UnifiedGenotyper uses Bayesian genotype
likelihood model to estimate the most likely genotypes at each position of the genome. The confidence of
each individual call depends on the allele(s) observed in the sequencing reads, how often they occur relative
to the total depth at the site, whether they match the reference genome’s allele or alleles known to dbSNP as
well as the respective base and mapping quality. Vcftools (55) was used to extract the relevant SNP positions
9
from the VCF if they were not flagged as low quality calls. The alleles from the non-UDG treated Mota were
set to missing data for all transition sites and the different data sets were merged using Plink v1.9 (56).
4.2. Authentication of DNA sequence data and estimation of mitochondrial contamination.
Ancient DNA sequences have a high frequency of cytosine to thymine (C to T) transitions at the 5’ ends
and of guanidine to adenine (G to A) at 3’ ends due to post mortem deamination (57). Figure S2 show these
typical damage patterns for ancient DNA for the non-damage repaired libraries.
We investigated potential mitochondrial contamination for all samples using the approach of (58) that
utilizes private or near-private consensus alleles in modern-day individuals (<5% in 311 modern mtDNAs),
and bases with mapping quality of 30 or higher, as well as a coverage of at least 10x for the ancient DNA
data. Positions with a consensus allele of either C or G and where a transition substitution was detected were
filtered out to avoid postmortem damage. To obtain a contamination estimate, the counts of consensus and
alternative alleles were added together across all sites (58). The mitochondrial contamination estimates were
less than 4.5% for all ancient individuals (Table S4).
To estimate errors in the ancient samples coming from sequencing errors, mapping errors and chemical
modifications of bases, we used ANGSD’s (59) error estimation procedure that utilizes an out-group
individual (chimpanzee mapped against hg19) and an ad hoc “error-free” individual. To generate an “error-
free” individual, sequence reads with a mapping quality higher than 35 from a 1000genomes (60) CEU male,
NA12342, were used. By comparing the quantity of derived alleles in the samples, in relation to the “error-
free” individual, to the ancestral state a relative error for each test individual can be calculated. All sequence
reads were used, but only sites where ancestral, “error-free” sample, and the target sample have a coverage
of ≥1x with a base quality higher than 30 were used for computing the error rate (Fig S3). The error rate of
Ballito Bay A (0.1%) is on par with previous good-coverage damage repaired data, such as the Loschbour
individual from Lazaridis et al. (48). The overall error rate for the ancient individuals (with UDG treated
sequence data) is around one false positive variant in a thousand called variants, about twice as large as for
modern-day DNA samples that show just over one in 2,000 called variants.
10
Supplementary Text
5. Uniparental markers
5.1. Y-chromosomes
Samtools v.1.3 (51) mpileup were used to call single base substitutions from Phylotree (version of
09/03/2016; (61)) from bam files mapped to hg19 (UDG treated data only). Sites with mapping quality and
base quality of at least 30 were extracted. Insertions, deletions and sites with chimeric alleles were excluded.
Transition sites and A>T and G>C SNPs were kept, to maximize the number of haplogroup defining
substitutions. All derived states in the hierarchal phylogeny as well as all ancestral states downstream of the
last of the derived alleles within the branch, are reported to show the certainty of the haplogroup call.
Additionally, we double-checked that there were no ancestral alleles upstream (in the hierarchal phylogeny)
of the defined haplogroup that would contradict the call. The nomenclature of the International Society of
Genetic Geneaology (ISOGG) version 11.224 (http://isogg.org) was used. The definitions in the minimal
reference phylogeny of Phylotree (http://www.phylotree.org/Y/tree/) were used for sites not present in
ISOGG.
The Ballito Bay A boy belongs to Y-chromosomal haplogroup A1b1b2, as supported by 12 derived
allele states (Table S5). The further downstream subtype is, however, unclear as two additional sites
displayed derived substitution for M51 and M118 defining A1b1b2a and A1b1b2b1, respectively. We also
note that this individual is ancestral for M13 and M201, both defining haplogroup A1b1b2b. As ancient
individuals may belong to branches not found among extant populations (62), Ballito Bay B could possibly
represent an ancient hitherto unknown sub-branch of A1b1b2. Some additional sites displayed derived alleles
that do not fit within the A1 phylogeny and they were cross-checked against an updated version of ISOGG
(version 11.325, updated 19 November 2016) (the minimal reference phylogeny in Phylotree had not been
updated since our last check). These were M236 (G>C) (B1 according to ISOGG), M10072 (G>A) (S
according to ISOGG and M2 according to Phylotree), CTS4385 (A>T) and R-Y40 (C>T) (R1a according to
Phylotree, marker not present in ISOGG). However, as they were sporadically shattered over the phylogeny,
and as multiple upstream sites displayed ancestral states, they may be false positives resulting from strand
misidentifications, sequencing errors or postmortem deaminations (63).
Similar to Ballito Bay A, the Ballito Bay B male belongs to haplogroup A1b1b2, and likely even to
A1b1b2b1 (Table S6). Seven markers displaying derived alleles support the former as does A1b1b2b1-M118
for the latter. No ancestral sites were found downstream of A1b1b2b1. Two additional sites displayed derived
alleles, namely M236 and M10072. These were also observed in Ballito Bay A; see above for possible
explanations of these discrepant alleles.
Haplogroup A is the oldest Y-chromosomal lineage with an estimated age of circa 150,000 years (64).
The sub-haplogroup A1b1b2a (A-M51, previously known as A3b1), is together with A1b1a (A-M14,
previously known as A2), common among southern African Khoe-San populations while being rare in Bantu-
speaking populations (10, 65, 66). The only other ancient Y-chromosomal data available to date from Africa,
is the 4,500-year-old Mota hunter-gatherer from Ethiopia, who belonged to haplogroup E1b1 (67).
5.2. Mitochondrial DNA
Consensus sequences were generated using samtools’ mpileup and vcfutils.pl (and vcf2fq) (v0.1.19,
(51)) coupled with ANGSD (59). A minimum base quality and mapping quality score of 30 and a coverage
of at least three sequence reads were used to call the consensus sequences. Haplogroups were assigned to the
sequences using HaploFind (68) and PhyloTree mtDNA Build 17 (18 Feb 2016) (69). Doonside had low
11
mtDNA coverage (2.6x) and the haplogroup was called manually from PhyloTree without restrictions on
coverage. The variants are reported against the Reconstructed Sapiens Reference Sequence, RSRS (70). The
mitochondrial coverage, haplogroups, variants supporting the called haplogroup and private variants are
reported in Table S7. There were a few regions where none of the consensus sequences had any data after
filtering. The majority of these positions are situated between the ND1 and CO3 genes and have previously
been reported as regions that are difficult to map when working with short sequence reads (71). We noted
that HaploFind was not well-adjusted to L0-lineages and therefore we manually curated our variant table to
fit the phylogeny in PhyloTree (Build 17, 18 Feb 2016). Haplofind assigned haplotypes correctly, but
reported that several variants were missing from the L0d-haplotypes although some of these missing variants
defined haplotypes within L1'2'3'4'5'6 and not within L0 (e.g. 146T, 182T, 10664C, 10915T, 11914G,
13276A, 16230A). These errors have been reported to HaploFind.
The mitochondrial genome of the Ballito Bay A boy (BBayA) has 40 variants leading to L0d2c1 (Table
S7). There are five other variants associated with this haplotype. Two of them were ancestral in this sequence
(BBayA lacked a deletion at np 498 and a transition at np 8251) and for the remaining three sites (nps 4204,
4232 and 7154), there were not enough high-quality sequence data. This individual has six additional variants
(T4312C, A4732G, T7256C, T8655C, G8701A and A16129G), that are present in between one and 18 other
haplotypes in Phylotree. It is not likely that these variants are caused by post-mortem deamination as i) the
majority of the data are based on UDG-treated DNA libraries in which the majority of these types of damages
are removed, and ii) only one of the sites comprised of a G to A transition.
The Ballito Bay B (BBayB) male belongs to L0d2a1 and displays 33 of 42 expected variants for this
haplotype (Table S7). Three of the sites have the ancestral allele (i.e. no deletion at np 498 and no transitions
at nps 7154 and 8392), while there were not enough data for the remaining six sites (nps 4025, 4044, 4225,
4232, 5153 and 6815). BBayB displays two additional variants; T310C and T16187C. The former has not
previously been found in any L-haplotypes but the latter is recurrently found within L-lineages, including in
L0d2a1b. BBayB is, however, ancestral for C463T and T7861C (which together with T16178C defines
L0d2a1b).
Due to the low coverage, the Doonside (DOO) consensus contained several sites displaying C to T
transitions likely caused by post-mortem damage and positions lacking data. We could not use HaploFind to
call the haplotoype for this individual. Instead we manually investigate what ancestral and derived states the
DOO mt data displayed for haplogroup defining positions, first within the L0 lineage, and then following
L1’2’3’4’5’6 lineage leading to all other non-L0 lineages. We conclude that DOO belongs to haplogroup
L0d2 as only derived states were present for defining positions leading to this haplogroup. The derived states
were G263A, C1048T, C3516A, T5442C, T6185C, C9042T, A9347G, A12720G (leading to L0);
G1438A, G8251A, T12121C, G15466A, G15930A, T15941C, T16243C (leading to L0d); A3756G,
G9755A, T16278C (leading to L0d1’2); and T11854C, A15766G (leading to L0d2) (Table S7). DOO further
displayed ancestral alleles for the downstream lineages L0d2a’b’d (16212A), L0d2a (12172A), L0d2b
(1386T, 9932G, 10084T, 16069C, 16169C), L0d2d (125T, 127T, 188A, 8434C, 9254A, 9476A,
10745C, 14094T), L0d2c (294T, 4937T, 6644C, 8420A, 9230T, 9305G, 13827A, 14007A, 15346G)
and L0d3 (721T, 1243T, 2755A, 5460G, 6377C, 8459A, 9027C, 9488C, 11061C, 13359G, 15236A,
15312T, 16290C, 16300A) with the exception of a few positions with derived alleles (G16390A found in
L0d2a, C152T found in L0d2b, C150T found in L0d2d and L0d3). It is highly unlikely that DOO would
belong to a non-L0 lineage as it displayed ancestral alleles for L1’2’3’4’5’6 (146C, 182C, 10664T, 13276G),
L2’3’4’5’6 (2758A, 2885C, 8468T) and L2’3’4’6 (195T, 247A, 10688A, 13105G, 13506T, 15301G,
16129A) and only derived states at seven positions (10915T, 16230A, 152T, 8655C, 10810T, 16187C,
16189T) which may largely be due to low read coverage at the positions combined with post-mortem damage.
As the DOO consensus is uncertain, only the derived states leading to L0d2 are reported in Table S7.
The Champagne Castle (CHA) female has 36 of 43 variants leading to L0d2a1a (Table S7). This
individual displayed the ancestral state for one variant (i.e. did not have a deletion at np 498, similar to BBayA
and BBayB) and the remaining six sites lacked high-quality sequence data (np 4025, 4044, 4225, 4232, 7154
and 8392). There were three additional variants present in this mitochondrial genome. The C11881T and
12
G15077A transitions are present in three other haplotypes, while the T16093C transition is highly recurrent,
and was present in over 50 haplotypes dispersed over the PhyloTree mitochondrial phylogeny.
The Eland Cave (ELA) female displays all 51 of the expected variants leading to L3e3b1 (Table S7).
This individual has two additional variants, G10373A and T15071C, which are also found elsewhere in the
phylogeny (in 12 other haplotypes and one other haplotype in PhyloTree, respectively).
The Mfongosi (MFO) female belongs to L3e1b2 (Table S7). This individual has 46 of the expected
variants for this haplotype but displays the ancestral state for the remaining two sites (i.e. does not have a
deletion at np 16325 and lacks the C16327T transition). In addition, there are four private variants (T310C,
T15115C, C16239T and C16519T) that are also present in different haplotypes in PhyloTree.
The Newcastle (NEW) female displays all 42 expected variants for L3e2b1a2. NEW has two additional
transitions, G4769A and T15721C, and an additional transversion mutation, A16183C.
The three approximately 2,000-year-old individuals from the coastal region of eastern South Africa,
Ballito Bay A, Ballito Bay B and Doonside belong to L0d lineages (L0d2c, L0d2a1 and L0d2, respectively).
The deepest split in the mitochondrial phylogeny is between L0 and L1’2’3’4’5’6 (which comprise all other
haplogroups) (72, 73). This lineage is highly divergent and common in the Khoe-San populations of southern
Africa (72, 74-77). The L0d2c haplogroup, found in the Ballito Bay A individual, is most common in present-
day Nama and ≠Khomani (12%-14%) from Namibia and South Africa, but it is also found at lower
frequencies in other Khoe-San populations and in Coloured populations (75) and has recently been identified
in some Bantu-speaking populations (73). Furthermore, a 2,330-year-old forager skeleton from St. Helena
Bay on the south west coast of South Africa, displays the sub-haplogroup L0d2c1c (78). L0d2a is more
frequently observed in present-day populations than L0d2c, and this haplogroup is carried by Ballito Bay B.
The highest frequency is found in the Karretjie People (60%), ≠Khomani (33%) and Nama (21%) (75). L0d2a
is further found in Bantu-speaker populations (12%), in Coloured populations and in ‘Baster’ (73, 75). This
potential Khoe-San maternal contribution into non-Khoe-San populations can be observed in one of the
younger, 300-500-year-old, individuals (the Champagne Castle female, who carries an L0d2a1a
mitochondria and an otherwise typical Bantu-speaker genomic signature).
Three of the younger individuals (dated to ~300-500 BP); Eland Cave, Mfongosi and Newcastle, belong
to L3e-lineages (L3e3b1, L3e1b2 and L3e2b1a2, respectively). L3 lineages are frequent in modern-day
individuals in East Africa and the L3e lineage, the most frequent of the L3 lineages, is common in
Central/West African groups, and has been suggested to have reached southern Africa with the Bantu
expansion ~1,800 years ago (79-82). These lineages are generally absent in Khoe-San populations, with the
exception of the Khwe, but are common among many present-day Bantu-speaking populations (75). The
4,500-year-old ‘Mota’ hunter-gatherer from Ethiopia also carries an L3-lineage, L3x2a (67).
6. Population Structure and Admixture
6.1. Comparative Data
Comparative SNP study data were downloaded for both Illumina and Affymetrix Human Origins SNP
platforms. The SNP sets of the two platforms were kept separate for analyses to maximize the SNP overlap
(aDNA data handling for merging with SNP data is described in section S4.1). The Illumina platform
southern African datasets containing Khoe-San and Bantu-speaker groups, typed on the 2.5 Omni array (6,
83), were merged with the data from the ancient individuals. In this southern African dataset 1,989,349 SNPs
were retained from the merge, and the number of SNPs for each of the ancient individuals is indicated in
Table S8. This dataset was merged with 6 additional populations (YRI, MKK, LWK, TSI, CEU, JPT) from
the 1000 genomes project (KGP) global dataset typed on the Illumina 2.5 Omni array (60). In this global
extended dataset 1,984,902 SNPs were retained and the number of SNPs of each ancient individual is
indicated in Table S8. To expand the modern-day East African representation of the dataset, we merged the
data with 6 additional populations (AMHARA, OROMO, ARI-BLACKSMITH, GUMUZ, SUDANESE,
13
SOMALI) from diverse East African groups, typed on the Illumina 1M Omni array (84). For this ‘East Africa
extended’ dataset, 527,131 SNPs were retained and the number of SNPs for each of the ancient individuals
are indicated in Table S8. All the mergers of datasets were performed using Plink v. 1.9 (56) and A/T and
C/G SNPs were removed before merging the datasets. During merging, mismatching SNPs were strand-
flipped once and remaining mismatching SNPs were excluded. Only intersecting SNPs were kept after
merging. To include more African populations, but to retain high SNP density, we also merged 16 additional
populations from the African Genome Variation Project (85) with the Global Comparative dataset to form
the AGV comparative dataset and retained 1,421,001 SNPs (Table S8).
We downloaded the Affymetrix Human Origins fully public dataset as described in (48) from
(https://reich.hms.harvard.edu/datasets). This dataset also contained Khoe-San populations from (8). We
merged the ancient individuals with this dataset to form the Human Origin comparative dataset (548,476
retained SNPs).
Comparative full genome data, consisting of bam files of 11 HGDP samples (HGDP: 1 individual from
Dinka, Mbuti, French, Papuan, Sardinian, Han, Yoruba, Karitiana, San, Mandenka, and Dai populations)
were downloaded from (http://www.cbs.dtu.dk/suppl/malta/data/Published_genomes/bams/). The data were
originally generated by Meyer et al. (86) and the re-mapping and generation of the bam files was done and
described in Raghavan et al. (87). The 11 HGDP bamfiles were used and SNPs called individually for each
bamfile using the Unified Genotyper of GATK v. 3.2.0 (53). SNPs and indels were called separately. A strand
call confidence of 30.0 was used, all sites present in reference genome were emitted (not just variant sites)
and vcfs were extensively annotated (SpanningDeletions, Coverage, DepthPerAlleleBySample,
QualByDepth, FisherStrand, MappingQualityRankSumTest, ReadPosRankSumTest, GCContent,
HaplotypeScore, HomopolymerRun, TandemRepeatAnnotator, VariantType). After SNP calling we applied
a hard filter with the following criteria: “QD < 3.0 || FS > 20.0 || MQ < 55.0 || MQRankSum < -3 ||
ReadPosRankSum < -4.0 || SOR > 3.0 || HaplotypeScore > 5.0”.
Additionally, the Neandertal and Denisova genomes were prepared for comparative data analysis:
Denisova (Published originally in Meyer et al. (86), remapped in Raghavan et al. (87) and obtained for this
study from http://www.cbs.dtu.dk/suppl/malta/data/ Published_genomes/bams/), Neandertal (Published in
Prüfer et al. (14) and obtained from http://cdna.eva.mpg.de/Neanderthal/altai/AltaiNeanderthal/bam/). The
SNPs for Neandertal and Denisova were called similarly to the HGDP bams and the following hard filter was
applied: “QD < 3.0 || FS > 20.0 || MQ < 30.0 || MQRankSum < -3 || ReadPosRankSum < -4.0 || SOR > 3.0 ||
HaplotypeScore > 10.0”.
To be able to compare directly the ancient individuals to genome sequence data from a larger diverse
set of modern-day individuals (which is not affected by ascertainment bias present in SNP array genotype
data), we downloaded the called variants from the Simons Genome project (88)
(https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/phased_data/PS3_multisample_public/). These
genotype data were merged with the Ballito Bay A diploid called sites to be used for confirming results
obtained from SNP array data (that contain many more individuals compared to the genome sequence
datasets).
6.2. Principal Component Analysis
Principal Component Analysis (PCA) was done on haploidized versions of comparative datasets
(random draw of one of the alleles at each locus). PCA was performed using EIGENSOFT (89, 90) with the
following parameters: r2 threshold of 0.9, sample size limit of 20, 10 iterations of outlier removal.
The first PC of the southern African dataset (Figure S4) separates Khoe-San from Bantu-speakers and
the second PC separates southeast Bantu-speakers from southwest Bantu-speakers. The two older samples
(BBayA and BBayB) cluster with the current-day Khoe-San groups, while the four younger samples cluster
with southeast Bantu-speakers.
14
When adding comparative African and non-African groups from the KGP panel (Figure S5), PC1
separates non-Africans from Africans and PC2 separates West African origin populations from southern
African Khoe-San populations. On this PC BBayA and BBayB form the one extreme and West African
Yoruba the other extreme. It appears that compared to BBayA and BBayB, all current-day Khoe-San groups
are shifted towards the non-African and other African extremes of the PCA. The four younger samples (ELA,
NEW, MFO, CHA) are located with southern African Bantu-speakers (between the southwestern and
southeastern Bantu-speakers), thus showing evidence of Khoe-San admixture (compared to Yoruba), but not
quite as much as most of the current-day southeastern Bantu-speakers. ELA appears to be the least admixed
and CHA the most.
To further clarify the association of samples with East and West Africans and northern and southern
Khoe-San, only one representative group from East Africa (Maasai), West Africa (Yoruba), northern San
(Ju|’hoansi) and southern San (Karretjie People) were included in the PCA. In this analysis it appears that,
compared to BBayA, Ju|’hoansi is shifted towards East Africa, while some of the Karretjie individuals are
shifted towards East Africans and some towards West Africans (Figure S6). BBayA and BBayB appear to
cluster with southern San and not with northern San. The Khoe-San admixture in the younger samples also
appears to have come from southern San (Figure S7).
6.3. Cluster analysis
Admixture fractions were estimated using ADMIXTURE (91) in order to cluster individuals based on
SNP genotypes. Default settings and random seeds were used. Between 2 and 13 clusters (K) were tested. A
total of 50 iterations of ADMIXTURE were run for each value of K. Iterations for each K were analyzed
using CLUMPP (92) and the LargeKGreedy algorithm with 1,000 repeats to identify common modes among
replicates. Pairs of replicates yielding a symmetric coefficient G’>=0.9 were considered to belong to common
modes. The most frequent common modes were selected and CLUMPP was run a second time for all values
of K containing the most frequent common mode (LargeKGreedy algorithm, 10,000 repeats). The results
were visualized using DISTRUCT (93) (Figure S8).
Admixture analyses show that the three ~2000 year old individuals (BBayA, BBayB and DOO) cluster
with present-day Khoe-San groups (Figure S8, for K≥3), and specifically southern San groups (for K≥7). At
K=13, these individuals cluster with the Karretjie People (6, 94) and the Lake Chrissie San (CHR) (83). The
four ~300-500-year-old individuals (ELA, NEW, MFO and CHA) grouped with populations of West African
origin (Figure S8, for K≥3), and more specifically southeast Bantu-speakers from South Africa (for K≥12).
They have low, but clear signals of admixture with southern Khoe-San groups (Figure S9) and the admixture
is lowest for the oldest of the four individuals (ELA - 12%) and greatest for the youngest individual (CHA -
18%). Comparatively the levels of admixture in current-day southeastern Bantu-speakers are 19% on average.
This observation is consistent with continuous admixture into Bantu-speakers from San groups, however, we
note that the time-serial sample is small. CHA also had an L0d mtDNA haplogroup. Although found in its
highest frequency in Khoe-San (75, 95), L0d occurs at levels of 20% to 40% in present day southeast Bantu-
speakers from South Africa (75).
6.4. Formal tests of admixture and fractions of admixture
f3 tests of southern African Iron Age farmers vs. western Bantu-speakers:
We did outgroup f3 analysis of the four southern African Iron Age farmers (CHA, NEW, MFO, ELA)
against the western Bantu-speakers (wBSP) of Patin et al, 2017 (16). The chimp reference genome was
employed as an outgroup and standard errors were estimated using a weighted block jackknife procedure
with blocks of 5Mbp. The most shared drift with the ancient Iron Age farmers in the wBSP was observed for
15
the southern wBSP (of northern Angola) (Figure S10). This is in accordance with the findings of Patin et al
(2017) and supports the “late-split” hypothesis of the Bantu-expansion (see Patin et al 2017 for a full
discussion of the “early” vs. “late split” linguistic hypothesis of the Bantu-expansion). Alternative
explanations of a more complex migration history of Bantu-speakers and interactions between the western
and eastern Bantu-speaking migration waves are also possible (96).
f3 tests of admixture into San:
To estimate whether the Ju|’hoansi received admixture from another population, we computed the f3
statistic (97) with Ju|’hoansi as the recipient population, the diploid Ballito Bay A as one source and other
populations of the East African extended dataset as the other source population (Table S9). Negative Z scores
was observed for all non-Africans and East Africans (except the prehistoric Mota individual), as a source
population in addition to Ballito Bay A. This shows admixture from either East Africans, Eurasians, or a
mixture of the two into the Ju|’hoansi.
f4 ratio statistics:
In order to estimate the degree of back-admixture (from non-Africans) into African populations, we
followed the approach by Gallego-Llorente et al. (67) and calculated a ratio of f4 statistics (97). The statistic
calculates the proportion of ancestry from a Eurasian source α for each population X by using an ancient
Eurasian (LBK in our case; (48)) and an ancient African (BBayA in our case) as sources and East Asians
(Japanese) and Europeans (French or CEU, depending on the dataset) as outgroups,
.
We calculated this statistic for the different datasets using qpF4ratio from ADMIXTOOLS (97). The
estimates were similar between datasets (Tables S10-S13). We noticed that αMota is positive in all datasets,
which suggests that Mota itself might have received some Eurasian back-admixture compared to BBayA.
As most southern African populations display admixture from a source of mixed east African and
Eurasian ancestry (see below), we also used f4 ratios to estimate this ancestry using two different admixed
east African (Oromo and Amhara) populations as source. The results of this analysis are shown in Table S14
and S15. We do not assume that the admixing source was either Amhara or Oromo; they are just the best
representatives in the dataset. Both sources lead to similar and highly correlated estimates. We note that the
estimates for non-southern Africans should be interpreted with caution as the two sources might not be a
suitable model for those populations.
6.5. Admixture graphs
Methods:
Our analyses suggest several admixture events into Khoe-San populations during the last 2000 years. In
order to disentangle the contributions from different admixture sources, we used qpGraph v5052 of
ADMIXTOOLS (97) to construct admixture graphs. qpGraph takes a user-defined graph as input and
estimates drift parameters and admixture proportions by calculating all combinations of f statistics along the
graph. Each internal node in the graph can represent a bifurcation into two child populations and/or the
recipient or source for two-way admixtures. An admixture graph is usually considered consistent with the
data if the differences between all observed (from the data) and expected (from the graph) values of the f
16
statistics are less than 3 standard errors apart from each other (|Z|<3). The standard errors are calculated using
a block-jackknife across the whole genome. QpGraph was used with the following settings; initmix: 1000,
lsqmode: YES, blgsize: 0.005 and diag: 0.0001. We used the chimpanzee reference genome sequence as an
outgroup. If the chimpanzee allele was different from the two alleles observed in the human populations, the
site was excluded from the analysis.
Basic graph for the East African extended dataset:
We started by constructing a graph for the three ancient genomes used in this analysis. We used Mota
(67) as a representative of ancient East Africa, LBK (48) to represent Eurasians and BBayA to represent
Stone Age southern Africans. A simple model where African populations split into East Africans related to
Mota and southern Africans related to BBayA, followed by a split of an out-of-Africa population which is
closer to Mota, is consistent with the data (worst |Z|<0.19). We then added Amhara as an East African
population with a substantial degree of Eurasian back-admixture (see above and (84)). Consistent with these
expectations, Amhara would fit as an admixed population between Eurasians and East Africans (worst
|Z|<1.49) with 38 percent contribution from East African and 62% from Eurasian populations respectively.
As a first Khoe-San population, we added Ju|’hoansi to the model. Ju|’hoansi are considered to be the least
admixed Khoe-San population, but our data suggest that they received East African and/or Eurasian
admixture since BBayA lived. Therefore, we tested four different models for Ju|’hoansi:
(I) Ju|’hoansi as an admixed population between Stone Age southern Africans and Eurasians.
(II) Ju|’hoansi as an admixed population between Stone Age southern Africans and East Africans.
(III) Ju|’hoansi as an admixed population between Stone Age southern Africans and a population related
to Amhara.
(IV) Ju|’hoansi as an admixed population between southern Africans and a population admixed between
Eurasians and East Africans. The ancestry sources for these populations would be identical to those used
for Amhara but the admixture contributions would be different.
Models I and II were rejected (worst f4(Mota, LBK; Ju|’hoansi, Amhara), Z=3.58 and f4(baa001, Ju|’hoansi;
Mota, LBK), Z=4.58, respectively), but both models III (worst |Z|<2.16) and IV (worst |Z|<1.63) were
consistent with the data (Figure S11). As the |Z| score for model IV is slightly lower than for model III and
since we do not consider Amhara to be the actual source of admixture, we assume model IV to be more
representative of the putative population history of Ju|’hoansi, which means that the group mixing with Stone
Age southern Africans was probably admixed between Eurasians and East Africans but at different
proportions to that of the Amhara. We excluded the Amhara from all further admixture graph modeling of
Khoe-San populations.
Modeling West Africans by adding Yoruba:
The expansion of Bantu-speaking populations had major impacts on the genomic composition of
southern African populations and they contributed ancestry to some Khoe-San populations as well (6, 8).
Modern-day southern African Bantu speakers have received Khoe-San admixture themselves, which makes
it difficult to use them as a source for the Bantu-speaker component in Khoe-San. Due to the lack of West
African Bantu-speakers in our data set, we used the closely related West African Niger Kordofanian speakers
(Yoruba) as a population related to the source of Bantu-speaker admixture.
We tried to add Yoruba as a simple split off from any internal node of the model or by adding internal
nodes from which Yoruba would split off, but none of these models were consistent with the data with |Z|<3.
Furthermore, models assuming Yoruba as a two-way admixture between any of the internal nodes failed. A
model which did fit the data is where Yoruba are modeled as a two-way admixture of two additional nodes:
one Basal African node above the split between ancient East Africans and ancient southern Africans and a
second group, which is a sister group to the East African population that gave rise to the out-of-Africa groups
17
(worst |Z|<0.42). The drift between the Basal African node and the population that splits into Eastern and
southern Africans is small compared to the rest of the graph, but it appears to provide a better fit to the data.
We did not further investigate the population history of Yoruba as the focus of our analyses is southern
Africa.
The model including Yoruba and Ju|'hoansi as a Khoe-San population without admixture from Bantu
speakers is consistent with the data. We call this model ‘model A’ (Figure S12). In a second approach, we
model additional Bantu-speaking admixture into the Khoe-San populations, which we call ‘model B’ (Figure
S13). We note that model B is also consistent for Ju|'hoansi, (worst |Z|<0.32), but since the simpler model A
is also consistent, we conclude that Ju|'hoansi have not received significant admixture from Bantu-speaking
populations.
Applying model A and B to all KS from the main dataset:
We next tried to model all KS populations and the 68,284 overlapping transversion SNPs in the East
African extended dataset using both models A and B. Model A is consistent for Ju|'hoansi, Coloured
(Askham), ≠Khomani and Nama, indicating no (or minimal) admixture from Bantu speakers. The results of
the different admixture proportions are shown in Table S16. Notably, some values for Eurasian admixture
are different to the values obtained with f4 ratios above this can be attributed to not accounting for East
African admixture in the two-source f4 ratio test and also to the fact that the populations used unlikely
represent the populations that actually mixed.
Model B seems to be a suitable representation for all Khoe-San populations. Model B, however, does
not always allow us to fully disentangle the East African, West African and Eurasian sources. Part of the
ancestry of Yoruba is coming from a population related to East Africans while East Africans also contribute
to the mixed East African/Eurasian population, which leads to two possible sources of East African ancestry
in Khoe-San when they receive significant West African admixture. Some of the resulting graphs set the East
African contribution to the mixed East African/Eurasian population zero, which is balanced by a higher East
African component in the Bantu-speaker node. Notably, the Lake Chrissie San (CHR) is the only Khoe-San
population where the Eurasian contribution to the mixed East African/Eurasian population is set to zero. This
pattern is consistent with results in (83) and not unexpected given the probable historical locations of the
Khoekhoe populations (98). Generally, these issues make it difficult to interpret and compare the admixture
proportions estimated by model B for different populations: the West African proportion might be inflated
due to the East African contribution, and different drift parameters between the source nodes make the values
hard to compare even if all three proportions are estimated to be non-zero. The results for model B are shown
in Table S17 but we caution against over-interpreting them.
KS populations from the Human Origins dataset:
We also ran qpGraph on the 87,599 overlapping transversion SNPs with the Human Origins dataset (8,
48, 97, 99), which contains some additional Khoe-San populations. Similar to the results obtained for the
East African extended dataset, we see that model A is a consistent model for Ju|’hoansi North, Ju|’hoansi
South, ≠Khomani, Naro and Taa North. For Ju|’hoansi North and Taa North, however, the difference between
the Eurasian and East African sources could not be resolved the drift parameter between the two populations
was estimated to be very low and all admixture comes from the node slightly closer to the OOA populations.
The results for model A are shown in Table S18. Model B is consistent with the data for all Khoe-San
populations and we observe a similar pattern of balancing of the East African contributions in some Khoe-
San (Table S19). Overall, the results are quite similar for the two data sets.
18
Caveats:
We note that while we consider the presented models to be good models to represent the population
history of Khoe-San populations, there are very likely other models that would fit the data as well. The
extremely large number of possible admixture graph models combined with the need of manually defined
graphs restricted us to this kind of analysis. Most applications of studying complex admixture histories with
qpGraph have been restricted to less than 200,000 SNP markers so far e.g. (48, 100, 101). As the focus of
this analysis was to obtain a simple and general model of the population history of Khoe-San populations,
we did not analyze the bigger data sets SGDP and the AGV extended dataset to avoid over-fitting the models
due to the large numbers of markers and various minor admixture events between the populations used as
proxies for the admixing populations. In summary, the results from this model-fitting are overall consistent
with results from other analyses in this study.
6.6. Admixture dating
We used ADMIXTOOLS (97) and the KGP extended dataset to estimate the linkage disequilibrium
(LD) decay due to admixture, and thereby infer admixture dates. The date of admixture into current-day
Khoe-San groups with no visible Bantu-speaker admixture was estimated using the two ancient southern
Africans (BBayA and BBayB) as one parental population and the East African Maasai as the other parental
population (Table S20). Default parameters were used and the standard error was estimated with a jackknife
procedure implemented in the ROLLOFF package. Admixture dates of East Africans into other Khoe-San
groups are difficult to distinguish since the admixture with Bantu-speakers will influence signals. Indeed,
when admixture dates with East African and West African source populations were inferred for populations
with admixture from both groups, the dates were similar (Table S21).
6.7. Presence of archaic admixture in current-day Khoe-San
We performed D-tests for testing admixture with Neandertal and Denisovan with P1=HGDP San or
BBayA; P2 one of the other 10 HGDP individuals and P3=Neandertal or Denisovan (Figure S14 and S15).
We estimated standard deviations using the weighted block jackknife approach with 5 Mb blocks. We only
used sites for which the ancestral state were confidently called (the 3 great apes showed the same variant and
exactly one additional variant in the three individuals tested) and set P4 to the ancestral state.
We retrieved the signal for introgression of Neandertals into non-African individuals. Among African
individuals, the signal for Neandertal introgression was always around 0 but consistently lower for P1=San
than for P1=BBayA. This may reflect a larger Neandertal component (due to admixture) in the HGDP San
than in BBayA. When testing for introgression from Denisovans (P3=Denisovan), only the non-African
individuals had a mean signal more than 2 SD away from 0 (this is likely to reflect Neandertal admixture)
and the well-known strong signal in the Papuan individual was retrieved. Interestingly, the signal for the
Mandenka individual was almost as strong as for the French individual, especially for P1=BBayA.
7. Diversity estimates and demographic inferences
7.1. Diversity estimates - Heterozygosity
We compared the proportion of heterozygote sites for the HGDP individuals and BBayA for sites with
i) >1x coverage ii) >7x coverage and iii) sites with a coverage >13x and within the 99.95% of the coverage
distribution (this cuts off the high and low tail of the coverage distribution and adapted to the specific
19
coverage distribution of an individual in order to avoid regions with unexpected coverage). We also limited
the data to sites where the ancestral state could be confidently called (no more than 2 variants including the
three apes, no missing data and the 3 apes showing the same variant). The effect of coverage on
heterozygosity can be seen in Figure S16. We note that BBayA has levels of heterozygosity similar to most
other African groups, but that modern-day San (Ju’|hoansi) have greater heterozygosity compared to other
African groups.
7.2. Diversity estimates - Runs of Homozygosity
The distributions of Runs of Homozygosity (RoH) of the data were computed using Plink (v. 1.9) (56)
with the following parameters: sliding windows of 50 SNPs, allowing 1 heterozygote per window, with an
overlapping proportion of 0.05, final window sizes of at least 200 kb and 200 SNPs with a minimum SNP
density of 1 in 20kb and a gap of 50 kb between SNPs before the run of homozygosity is split in two. Ballito
Bay A had among the longest RoH results among Khoe-San individuals, suggesting lower diversity and lower
Ne in Ballito Bay A compared to modern-day Khoe-San groups (Figure S17).
7.3. Demographic inferences - MSMC
To infer past effective population sizes for BBayA and compare it to a number of high-coverage modern-
day genomes, we use MSMC’s implementation of PSMC’ (19). Input files were created using a set of scripts
provided with MSMC (). MSMC was run with default parameters except for -r 0.88 in order to represent the
ratio of recombination and mutation rate for humans and --fixedRecombination. We plot the effective
population size for BBayA together with the HGDP individuals assuming a mutation rate of 1.25x10e-8 per
site per generation and a generation time of 30 years (Figure S18). The curve for BBayA is shifted according
to the radiocarbon date of the individual.
Starting from the past, all populations start reducing their effective population size around 150 kya.
Non-African populations go through a drastic reduction in Ne, probably representing the out-of-Africa
migration bottleneck. African populations have higher population sizes during this time, but still show signs
of a weaker bottleneck, except the San. BBayA’s population size more recent than 100 kya is very similar to
West Africans (Yorubans, Mandenka) and Dinka. Notably, Mbuti and modern San have greater population
sizes during this period. This could be explained by the recent admixture into San as admixture has been
shown to inflate estimates of effective population not during the time the admixture actually happened (20).
Around 30 kya, BBayA’s estimated effective population size starts to increase (towards more recent times),
which could be an effect of residual deanimation and/or mapping/sequencing errors in the ancient sample
which has the lowest coverage of all individuals in this analysis (13x) and shows a slightly increased per-
base error rate compared to modern individuals (Figure S3).
8. Dating of population split times (G-PhoCS)
8.1. Data
We used the diploid genotypes of Ballito Bay A together with the 11 HGDP genomes to estimate
pairwise population split times and effective population sizes, through coalescence based analyses using G-
PhoCS (7). The sequence data for coalescence analysis were prepared according to the guidelines outlined in
Gronau et al. (7). Over 30,000 short sequence fragments were sampled from random positions across the
autosomes. The length of the fragments was set to 1kb, which is a good length for human genomes, as it
represents the optimal trade-off between minimizing the impact of recombination and maximizing
20
information for coalescence analysis (7). For filtering the fragments, we followed the guidelines and
recommendations of (7). Five filters were downloaded from the UCSC genome annotation database for hg19
(http://hgdownload.soe.ucsc.edu/goldenPath/ hg19/database/), which targets known genic regions (refGene,
knownGene), simple and complex repeat regions (simpleRepeat, genomicSuperDups) and CpG islands
(cpgIslandExt). In addition, we also compiled a filter from our own called INDEL regions in the dataset.
Positions were set to missing using the 6 different filters and thereafter 1kb fragments containing more than
10% missing data were filtered out. The pipeline thus contained the following steps; random sampling of 1kb
fragments from the autosomes, marking positions present in filters as missing, filtering out of fragments
containing over 10% missing data and converting the data to the right input format for G-PhoCS. This
pipeline is then run until over 30,000 fragments were obtained. The exact number of fragments used in the
G-PhoCS run was 32,569 fragments.
8.2. Split times (Tau) in modern humans
G-PhoCS was run for all pairwise combinations of the individuals from the HGDP dataset and Ballito
Bay A. Default input parameters were used except that the data were logged every 20 steps instead of every
10. No migration bands were added. The MCMC was run for 200,000 iterations and the first 50,000 were
discarded as burn in. The visualization of the trace files showed that both the inferred split time (Tau) and
population size (Theta) had already stabilized before reaching the burn-in cut-off.
Mean and median split times (Tau) were calculated for the 150,000 remaining logs of Tau after the burn-
in was removed and are visualized as mean split times together with standard deviations as bar plots (Figure
S19). To convert Tau to calendar years, a mutation rate of 1.25 x 10-8 per site per generation was used and a
generation time of 30 years was assumed. Pairwise split times were grouped according to hierarchical split
times (Table S22) and visualized as violin plots (Figure S20) using the vioplot package in R.
8.3. Ne (Theta) in humans
For each pairwise comparison, G-PhoCS also estimates Ne (Theta) for each population and the ancestral
population of the pair. Ne was calculated from Theta for the 150,000 remaining logs of Theta after the burn-
in was removed. To convert Theta to Ne, a mutation rate of 1.25×10-8 and a generation time of 30 years was
assumed. Mean Ne and standard deviation of focus populations (Fig S21) and their ancestral populations are
visualized as bar plots in Figure S22.
8.4. Split times to Neandertals
We also analyzed the Altai Neandertal (14) and compared it to the diploid call-set of Ballito Bay A and
the 11 HGDP genomes, with G-PhoCS. Neandertal SNP calling and filtering is described in section S8.
Additional G-PhoCS specific filtering and run settings were the same as described in Section 8.1-8.2. The
estimated split times and standard deviations are shown in Figure S23 and listed in in Table S23. Neandertal
split times were in general older for comparisons with Africans compared to non-Africans, likely due to
archaic admixture in non-Africans (14). The split with Ballito Bay A is the oldest and is around 10,000 years
older than the comparison with HGDP San. The split times estimates against Neandertal are in general
younger than dates estimated with the TT method, see section S9.1 and slightly younger than dates reported
previously (i.e. 553,000-589,000 years ago (14)).
21
9. Estimations based on sample configuration frequencies
9.1. Inference under a split model with pairwise sampling – the TT method
We developed an approach for estimating population split times that involve samples of two gene copies
from each of two populations (denoted the TT method - from Two plus Two). The approach builds on, and
extends the ‘concordance’ approach in Schlebusch et al. (6), Skoglund et al. (102) and Wakeley (103) that
estimates model parameters under a pure split model using single individual samples. In contrast to the
‘concordance’ approach, the TT method utilizes 2 gene copies from each of a pair of populations and relies
on the frequencies of all possible sample configurations (there are 9 such sample configurations but only 7
that are variable) in order to estimate model parameters. The assumptions of the model are; an infinite number
of sites/small mutation rate per site, independence between sites, a pure split model (no migration between
populations) and a constant population size for the panmictic ancestral population predating the split. The TT
approach does not rely on assumptions about i) the population size dynamics in the two daughter populations
(more recent than the split event), ii) the mutation rate, or iii) the number of generations since the split time
in either of the two daughter populations. The modeled and estimated parameters are: the number of
generations from population 1 to the population split, T1 (scaled by a per site and per generation mutation
rate), the number of generations from population 2 to the population split, T2 (scaled by a per site and per
generation mutation rate), the probability of two gene copies not coalescing before the split in population 1,
α1, the probability of 2 gene copies not coalescing before the split in population 2, α2, and the size of the
ancestral population, θA (scaled by a per site and generation mutation rate). Each population branch in
calendar years, t1 and t2, can be estimated by dividing T1 and T2 by the per site and per generation mutation
rate and multiplying by an assumed generation time in years. The ancestral population size can be estimated
by dividing θA by the per site and per generation mutation rate. The expected number of generations to
coalesce, given that the two lineages (from a specific population) coalesce before the population split, is the
only additional parameter that would affect the probability of the different sample configurations under this
model. These probabilities are denoted V1 (the value for population 1 multiplied by the mutation rate per site
and per generation) and V2 (the value for population 2 multiplied by the mutation rate per site and per
generation). It is possible to derive closed formulas for the probabilities of all the possible sampling
configuration in terms of α1, α2, T1, T2, θA, V1 and V2. Assuming two sampled gene copies from each of the
two populations, we denote the possible sample configurations of derived variants as:
Configuration number derived in sample 1 number derived in sample 2
O0 0 0
O1 1 0
O2 0 1
O3 2 0
O4 0 2
O5 1 1
O6 2 1
O7 1 2
O8 2 2
The probability for each of these sample configurations can be derived from considering the probability of
the configuration conditioning on either i) all four lineages coalescing before reaching the split in each branch
(this is an event with probability (11)(12)), ii) the lineages in sample 1 coalescing before T1, but the
lineages in sample 2 remain as separate lineages at T2 (an event with probability (112), iii) the lineages
in sample 2 coalescing before T2, but the lineages in sample 1 remain as separate lineages at T1 (an event
22
with probability α1(12)), iv) both samples remain as separate lineages until the split in each branch (an event
with probability α1α2). We can then derive the following probabilities:
( ) ( )( ) ( )
211111
-4
3
--12-21 αα
θ
+VTαT=OP
,
( ) ( )( ) ( )
122222 -4
3
--12-22 αα
θ
+VTαT=OP
,
( ) ( )( ) ( )
2121111
-24
6
---13 ααα+α
θ
θ+VTα=OP
,
( ) ( )( ) ( )
2121222
-42
6
---14 ααα+α
θ
θ+VTα=OP
,
( )
21
2
3
5αα
θ
=OP
,
( ) ( )
21
-2
3
6αα
θ
=OP
,
( ) ( )
11
-2
3
7αα
θ
=OP
,
( ) ( )
OiP=OOP
=i
7
1
-180
.
We denote the number of sites that display the specific sample configuration by:
m0: number of sites that are O0,
m1: number of sites that are O1,
m2: number of sites that are O2,
m3: number of sites that are O3,
m4: number of sites that are O4,
m5: number of sites that are O5,
m6: number of sites that are O6,
m7: number of sites that are O7,
m8: number of sites that are O8.
The total number of sites is then M (M=m0+m1+m2+m3+m4+m5+m6+m7+m8). We find the following
estimators (the ^ of the estimators have been omitted for simplicity):
56
5
12
2
m+m
m
=α
,
57
5
2
2
2
m+m
m
=α
,
( )( )
5
5756
8
223
m
m+mm+m
M
=θ
,
( )( )
58
62
-
2
15756
3
1
1m
m+mm+m
m+
m
M
=T
,
( )( )
5
5756
4
2
2
8
26
-
2
1
m
m+mm+m
m+
m
M
=T
,
23
so that
( )( )
5
5756
8
223
m
m+mm+m
μM
=N
A
,
( )( )
5
5756
3
1
1
8
62
-
2m
m+mm+m
m+
m
μM
g
=t
,
( )( )
5
5756
4
2
2
8
26
-
2m
m+mm+m
m+
m
μM
g
=t
,
where µ is the by the per site and generation mutation rate and g is the number of years per generations. Since
1
1
1
exp θ
T
=α
and
2
2
2
exp θ
T
=α
,
where θ1 and θ2 are the branch specific effective population sizes for population 1 and 2 (multiplied by the
per site and generation mutation rate). θ1 and θ2 can be estimated as the branch specific effective population
size for population 1 and 2 as
( )
1
1
1ln α
T
=θ
and
( )
2
2
2
ln α
T
=θ
.
9.2. Split time estimates
We utilized a weighted block jackknife procedure with 5 Mb blocks to estimate the confidence intervals
of the parameters. We applied this method to pairwise comparisons of Ballito Bay A and the 11 individuals
from the HGPD panel. These 12 individuals are assumed to each represent a population. The genome data
for all 12 individuals were filtered with the same criteria, including only retaining sites for which the 3 great
apes displayed the same variant. Furthermore, we noticed an effect of genome coverage on the split time
estimates, and restrict analyses to positions where both individuals (in a pairwise comparison) passed a
coverage filter (≥13x and within 99.95% of the coverage distributions), as described above. The SNP-calling
was conducted for each individual separately.
For a comparison between individual A and B, there are branch specific estimates of split time, drift and
effective population size. We therefore refer to the estimates of these parameters in the branch leading to
individual A (in this particular comparison) as the split time with individual A being ‘focal’ and individual
B being ‘reference’ (and vice versa when B is focal and A is reference).
In view of the results in section 6, we tried to model the modern-day San individual’s genome (from the
HGDP panel) as a combination of Ballito Bay A, Dinka and Sardinian genomes. Assuming that Ballito Bay
A contributed 86%, Dinka contributed 9.66% and Sardinian 4.34%, we randomly sampled for each position
a variant from the Ballito Bay A genome with probability 0.8614², one allele from the Ballito Bay A genome
and one allele from the Dinka genome with probability 2×0.8614×0.0966 and so forth to construct all possible
24
combinations of genotypes (the probabilities to sample an allele from a particular genome correspond to the
admixture proportions estimated in section 6). Here, sampling was done without replacement so that if both
alleles were drawn from the same source then the site in the modeled genome would be heterozygote if the
source genome was heterozygote at this position. This random sampling was reiterated independently for
each individual this ‘modeled artificially admixed modern-day San’ (‘AS’ in the figures) was compared to.
The split time estimates based on all comparisons associated with a split between a Khoe-San individual
and an individual with a non-Khoe-San origin are shown in Figure S24 and Table S24. The estimated
population split times do decrease for the modeled modern-day San individual (AS), but relatively little, and
the estimated drift parameter recaptures the estimated genetic drift for the San population relatively well (see
below), suggesting that we capture some features of the modern-day San genomes by mixing genomic
material from Ballito Bay A, Dinka and Sardinian. On the other hand, the estimated split times are lower for
the AS genome compared to the modern-day San, suggesting that all features of the mixed modern-day San
genome was not recreated by simply mixing genetic material from Ballito Bay A, Dinka and Sardinian. This
observation is hardly surprising since Ballito Bay A is likely an ancestor to southern Khoe-San populations,
not to northern Khoe-San populations such as the Ju|’hoansi (the HGDP San is a Ju|’hoansi individual), and
Dinka is likely not a perfect representative of the East African source population. From this investigation of
an ‘artificially’ admixed individual, we note that: i) small amounts of admixture have limited influence the
estimates of split times, and ii) we qualitatively recapitulate the change in estimated genetic drift and
population split times using an artificially admixed individual of genomic material from Ballito Bay A, Dinka
and Sardinian.
We also estimated the split between the Altai Neandertal individual and the 12 non-archaic individuals
(the 11 HGDP individuals and Ballito Bay A) as shown in Figure S25. The estimates are older than those
estimated by GPhoCS above, and also less affected by the small proportion of Neandertal admixture in non-
African individuals, but overall on par with past estimates (2, 14).
The estimates of the deepest population split among humans (Khoe-San vs non-Khoe-San) using the
Ballito Bay A individual consistently produced longer population branches from the Ballito Bay A genome
compared to the estimates from the non-Khoe-San branch. Although this effect was mitigated by filtering out
low coverage sites, it was not completely removed. It is consistent with the slightly increased error rate seen
in ancient genomes (Figure S3). This effect is possibly due to additional errors due to nature of ancient DNA,
including mapping errors due to short reads, lower and more variable coverage compared to modern-day
genome sequences, and possibly, residual deamination not fully repaired by the UDG treatment of the
libraries. This (relatively small) effect of aDNA properties will affect the split time estimates in the
Neandertal branch as well, however, such effects will also be counteracted by the age of the remains. The
Ballito Bay A individual dates to ~2,000 years ago while the Altai Neandertal individual has an age estimate
of around 50,000 years, likely concealing some of the effects of aDNA errors. However, the TT method
provides a novel way to overcome the issues with residual aDNA errors by estimating the population branch
of a modern-day individual (as a focal group) in a pairwise comparison with an ancient individual.
9.3. Branch specific drift
Using the TT-approach, we estimate the branch specific drift (Figure S26) as well as branch specific
effective population size (given estimates of branch specific drift and split time, the branch specific effective
population size is the split time divided by the drift, Figure S27).
In order to identify the modern population closest related to Ballito Bay A, and to compare the TT-
method to other approaches, we calculated the genetic drift in the Ballito Bay A individual compared to
several different modern-day Khoe-San individuals using the TT approach. We first compared him to the 6
Khoe-San individuals in (88) together with the HGDP San (in total, 2 ≠Khomani and 5 Ju’|hoansi
individuals). Here, only variable sites were required in order to calculate the drift parameter. Branch specific
drift on the Ballito Bay A individual/branch is shown in Figure S28.
25
We also calculated genetic drift on the Ballito Bay A branch/individual when comparing to the
Schlebusch et al. SNP-genotype data (6). Here, because the TT-method explicitly models the mutation
process, and SNP-genotype data are heavily ascertained, the TT-method is not suitable. Instead, if there is no
admixture and the SNPs have been ascertained in non-Khoe-San populations, then all SNPs that are variable
in Khoe-San populations must have been present before the split between Khoe-San and other groups and
the ‘concordance’ method described in Schlebusch et al. (6) is more suitable than the TT method. Estimated
drifts on the BBayA branch is shown in Figure S29.
Both these analyses (and supported by the outgroup-f3 analysis below) suggest that the Ballito Bay A
individual shows greatest genetic affinity to southern Khoe-San groups of today (see also sections 6-8). The
Ballito Bay A boy appears to be particularly closely related to the Karretjie People.
9.4. Other measures of drift (FST and outgroup f3)
For reference, we estimated pairwise FST between Ballito Bay A and the 11 HGDP individuals with the
same filtered data as for the TT analyses. We also estimated pairwise FST between Ballito Bay A and the
Schlebusch et al. SNP-genotype data (6). See Figures S30 and S31.
Finally, we estimated shared drift as measured by outgroup f3 values between Ballito Bay A and the
southern African dataset (6) (Figure S32).
The FST and outgroup f3 analyses comparing Ballito Bay A to the individuals in the Schlebusch et al.
(2012) data both suggest that the Ballito Bay A individual is closer related to modern day southern Khoe-San
individuals than he is to modern day northern Khoe-San individuals. Moreover, that FST between non-Khoe-
San individuals and Ju’|hoansi (from the HGDP, Figure S30) is lower compared to between the non-Khoe-
San individuals and Ballito Bay A is consistent with admixture into Ju’|hoansi (from non-Khoe-San
individuals), an admixture that is not present in the Ballito Bay A boy.
9.5. Testing drift in admixture-masked Ju|’hoansi
We use the RFMix software (104) to identify specific ancestries of genomic fragments for Ju’hoansi
individuals. A combined dataset of selected Khoe-San and neighboring Bantu-speaking populations from
Schlebusch et al. (6), was merged with YRI, CEU, TSI, CHB and JPT individuals from the 1000 Genome
Project (60) and East African Amhara and Oromo; and the West African Mandinka from the African Genome
Variation Project (85). The masking process was performed on dataset composed with 1,507,271 SNPs. The
dataset was imputed and phased with fastPHASE v.1.4.0 (105). The number of haplotype clusters was set to
25 and we use 25 runs of the EM algorithm to generate the “best” haplotype guess.
From the ADMIXTURE analyses at K=5, we selected individuals with the “Khoe-San” component
higher than 95%, independently of their ethnographic label. 30 Khoisan-speakers were selected to represent
the “Khoe-San” parental source. To have a similar parental source sample-size, we evenly selected random
individuals from Yoruba and Mandika (to represent West Africa), Amhara and Oromo (East Africa), Central
Europeans and Tuscans (Europe), and the Han Chinese and Japanese (Asia) for the RFMix analyses. We ran
RFMix analyses with two extra iterations to account for admixture in the source populations and minimize
assignment errors, we set 3 minimum reference haplotypes per tree node and a window size of 0.02 cM. We
used the HapMap II genetic map as recombination map.
In order to validate that all modern Khoe-San populations (including Ju|’hoansi) share admixture from
Eastern African/non-African sources post-dating the BBayA individual, we contrasted estimations on a
masked and a non-masked Ju|’hoansi individual (randomly picked from the Ju|’hoansi population from
Schlebusch et al. 2012).
We estimated ougroup-f3 and drift on the Khoe-San branch compared to Yoruba (YRI, n=15), CEU
(n=15) and Europeans from Tuscany (TSI, n=15). Under the hypothesis that admixture explains the larger
26
estimates of Ne on the HGDP-San compared to BBayA, a similar pattern should be apparent in the unmasked
compared to the masked Ju|’hoansi individual. Likewise, if admixture induce deeper split-time estimates
using BBayA instead of HGDP-San, then this is potentially visible as more shared drift (as measured by
outgroup-f3) between the unmasked Ju’hoansi and non-Khoe-San individuals (Yoruba, CEU and Tuscans in
this case) compared to shared drift between the masked Ju|’hoansi and non- Khoe-San individuals. This is
also what we observe (figures S33 and S34).
10. Genomic regions of interest and selection
10.1. Variants of specific phenotypic interest
In order to investigate SNP variants associated with particular traits, we scanned the literature for
specific sites and determined the alleles at these sites in the ancient samples using the samtools mpileup
function (v1.3) (51). Genes coding for traits of particular interest in African populations were analyzed (106),
including the following genes/regions: i) the MCM6 gene containing regulatory functions for the physically
nearby LCT gene that produces lactose and is strongly associated with lactase persistence in adulthood (106),
ii) DARC, HBB, G6PD, ATP2B4, and APOL1 genes for resistance to malaria and African sleeping sickness
(107-110), and iii) the SLC24A5/A2, HERC2 and OCA2 pigmentation genes (111-113). Either the OMIM or
NCBI SNP directory was used to obtain the rs number, chromosome position and reference allele for each
associated site in (or nearby) the genes. Chromosome positions in according to the hg19 reference sequence
were used in the analysis.
All ancient southern African individuals (that had enough data) exhibited the reference SNP call for all
lactase persistence genes (Table S25), and none of the samples displayed any variants that were linked to
lactase persistence.
For malaria resistance, the alternative variants were found in Eland Cave (possibly heterozygote C/T),
Mfongosi (possibly heterozygote C/T), and Newcastle (homozygote C) for the malarial resistance Duffy null
allele (Table S25). Interestingly, all three ~300-500-year-old individuals (that have enough data) carry at
least one Duffy null allele that has a strong protective effect against malaria (110), while the older samples
do not carry the Duffy null allele. For the Duffy FY*A/B locus, the FY*B alleles were found in Champagne
Castle (at least one allele is FY*B), ELA (possibly homozygote FY*B), MFO (homozygote FY*B), and
NEW (homozygote FY*B). One of the ~2,000-year-old individuals that had enough data displayed the FY*A
allele. The FY*A allele potentially has some protective effect against malaria compared to the FY*B allele
(108), but this locus likely has less impact on malaria resistance than the variants at the Duffy null locus
(110). The Duffy null allele is usually found on a FY*B background and therefore the high frequency of
FY*B among the more recent individuals is not surprising. For the ATP2B4 gene variant, both alleles appear
in both the old (~2,000) and the young (300-500) set of individuals. Taken together, these observations points
to strong malaria protective variants existing in migrant Iron Age farmers (of West African origin) in contrast
to southern African Stone Age hunter-gatherers.
Having at least one G allele for the APOL1 gene SNP rs73885319 confers resistance to African sleeping
sickness (109). Eland Cave is heterozygous for this polymorphism and Newcastle is homozygous for the
alternative variant (Table S25). This suggests that the protective variant was present in moderate frequency
among southern African Iron Age farmers.
The SLC24A5 G allele is near fixation in African populations (112) and all individuals with enough data
exhibit the alternative G variant for the SLC24A5 gene SNP rs1426654, which codes for darker skin color
(112). All individuals with enough data exhibit the ancestral C variant for the SLC24A5 gene SNP
rs16891982, which is also associated with darker skin pigmentation. All individuals (with enough data)
present the ancestral allele for the OCA2 and HERC2 genes associated with eye color (Table S25), and the
individuals were likely brown eyed (112, 113).
1
Fig. S1.
Site map of southern Africa. The map shows elevation and the geographic locations of
the archaeological sites associated with the investigated ancient individuals, and
comparative Khoe-San and Bantu-speaking populations from Schlebusch et al. (6).
2
Fig. S2
Cytosine deamination patterns for non-damage repaired libraries.
3
Fig. S2
(continued): Cytosine deamination patterns for non-damage repaired libraries
4
Fig. S3
Estimated error rates using an outgroup and an “error-free” individual for specific base
changes. The average error rate is given in the figure legend.
5
Fig. S4
Principal Component analysis of the Southern African dataset, showing first four PCs.
6
Fig. S5
Principal Component analysis of the KGP comparative dataset.
7
Fig. S6
Principal Component analysis with comparative East and West Africans (Maasai and
Yoruba) and southern and northern Khoe-San (Karretjie People and Ju|’hoansi), excluding
Champagne Castle and Doonside to maximize the number of retained SNPs. (A) Ju|’hoansi
is shifted towards Maasai (east Africans) compared to Ballito Bay A for PC 1 and 2, while
some of the Karretjie individuals are shifted towards Maasai and some towards Yoruba
(PC1 and 2). (B) Ballito Bay A and Ballito Bay B cluster with Karretjie People (southern
San) and not with Ju’|hoansi (northern San) for greater PCs. The Khoe-San admixture in
the Iron Age farmers also cluster with the Karretjie People (C) Bar plot displaying the
amount of genetic variation explain by each PC.
8
Fig. S7
Principal Component analysis with comparative East and West Africans (Maasai and
Yoruba) and southern and northern Khoe-San (Karretjie and Ju|’hoansi) (maximum Bantu-
speaker SNPs, excluding BBayB and DOO).
9
Fig. S8
Admixture analysis with Illumina datasets for K=2 until K=13.
10
Fig. S9
Admixture analysis. Zoom-in on Bantu-speaker aDNA
11
Fig. S10
Outgroup f3 analysis of the four southern African Iron Age farmers (CHA, NEW, MFO,
ELA) against the western Bantu-speakers (wBSP) of Patin et al, 2017 (16). The most
shared drift between the ancient Iron Age farmers were observed for the southern wBSP
(Angola) and not northern wBSP. This is in accordance with the findings of Patin et al (16)
and supports the “late-split” hypothesis of the Bantu-expansion.
12
Fig. S11
Admixture Map Analysis. Model III and IV.
13
Fig. S12
Admixture Map Analysis. Model A for Ju|'hoansi. A population and admixture graph
model of Ju|’hoansi as an admixed population between southern Africans and an admixed
(Eurasian/East African) population is consistent with the data. The numbers next to edges
represent the amount of drift between the nodes (multiplied by 1000). The model is
including Yoruba as a potential source of Bantu-speaking ancestry. Ju|'hoansi as a Khoe-
San population without admixture from Bantu-speakers is also consistent with the data.
Alternative tested models constructed in a hierarchical way are discussed in SI section 6.5.
The most likely model for the Ju|hoansi is shown in this figure.
14
Fig. S13
Admixture Map Analysis. Model B for Ju|'hoansi.
15
Fig. S14
D-tests with P1=San or BBayA; P2 one of the 10 HGDP individuals and P3=Neandertal.
16
Fig. S15
D-tests for testing admixture with P1=San or BBayA; P2 one of the 10 HGDP individuals
and P3=Denisova.
17
Fig. S16
Heterozygosity estimates based on sites of different qualities. Red points indicates
heterozygosity based on all sites with coverage >7x, black circles on sites with coverage
>1x and blue points have additional applied quality filters, including >13x. Points are
sorted in ascending order based on the more stringently filtered sites (blue points).
18
Fig. S17
RoH of the KGP extended dataset. The cumulative length of RoH (x-axis) plotted against
the number of RoH fragments (Y-axis) for the shortest RoH class (200-500Kb). Left:
Including non-Africans and the LBK Neolithic European. Right: Zoom-in on African
samples with BBayA (black dot) showing among the greatest cumulative lengths of RoH
among African samples.
19
Fig. S18
MSMC plot of 11 high-coverage HGDP genomes together with the diploid full genome of
Ballito Bay A (BBayA). The increased recent population size of BbayA could be due to a
slightly higher error rate while the increased population size of modern San more than 30
kya could be due to admixture (20).
20
Fig. S19
Means (dots) and standard deviations (bars) of G-PhoCS pairwise population split times,
sorted in descending order. Colors are according to the hierarchical split times: Ballito Bay
A (BAA) vs. all non-San (Black); San vs. all non-San (Red); Mbuti vs. all non-San
(Turquoise); Ballito Bay A vs. San (Gray), West Africans (Mandenka and Yoruba) vs. non-
Africans and East Africans (Green); East Africans vs. non-Africans (Blue); pairwise non-
Africans (Pink).
21
Fig. S20
Violin plot of G-PhoCS pairwise population split times. Y-axis: Tau (Time in generations
= Tau / (10,000 × mutation rate)). X-Axis: Populations. West Africans (WA), East Africans
(EA), and non-Africans (NA). Means and standard deviations of these grouped split times
are summarized in Table S22.
22
Fig. S21
Means (dots) and standard deviations (bars) of G-PhoCS estimated effective population
sizes for the populations of different pairwise population comparisons, sorted descending.
Colors correspond to the specific split, see Figure S19.
23
Fig. S22
Means (dots) and standard deviations (bars) of G-PhoCS estimated effective population
sizes for ancestral populations of different pairs of populations, sorted descending. Colors
correspond to specific splits, see Figure S19.
24
Fig. S23
Mean and standard deviation of estimated split times of Ballito Bay A (BAA) and 11
individuals from the HGDP panel against Altai Neandertal.
25
Fig. S24
Estimates of split time between pairs of individuals using the TT method. The populations
displayed on top and in larger font are focal populations while the populations below in
smaller font are the contrasting populations. We assume a mutation rate of 1.25×10-8 per
site and generation, and a generation time of 30 years to translate the estimated parameter
T to time in calendar years. In the figure, ‘BBayA’ refers to Ballito Bay A and ‘AS’ to the
modeled admixed modern-day San.
26
Fig. S25
Estimation of split time between Altai Neandertal and other populations. The populations
above and in larger font are focal while the populations below in smaller font are the
contrasting populations. We assume a mutation rate of 1.25×10-8 per site and generation,
and a generation time of 30 years to translate the estimated parameter T to time in calendar
years. In the figure, ‘BBayA’ refers to Ballito Bay A.
27
Fig. S26
Estimation of branch specific drift until the split between Khoe-San populations and other
populations. The populations above and in larger font are focal while the populations below
in smaller font are the contrasting populations. In the figure, ‘BBayA’ refers to Ballito Bay
A and ‘AS’ to the modeled admixed modern-day San individual.
28
Fig. S27
Estimation of effective size up until the split between Khoe-San populations and other
populations. The populations above and in larger font are focal while the populations below
in smaller font are the contrasting populations. We assume a mutation rate of 1.25×10-8 per
site and generation, and a generation time of 30 years to translate the estimated parameter
θ into a diploid effective population size. In the figure, ‘BBayA refers to Ballito Bay A
and ‘AS’ to the modeled admixed modern-day San individual.
29
Fig. S28
Genetic drift specific to Ballito Bay A compared to whole genome sequenced Khoe-San
individuals (‘San’ from HGPD in red, ≠Khomani and Ju’|hoan from (88)).
30
Fig. S29
Genetic drift specific to Ballito Bay A when comparing to Khoe-San and Bantu speakers
from (6), based on SNP-genotype data and the ‘concordance’ method of Schlebusch et al.,
2012, (6) to estimate branch-specific genetic drift.
31
Fig. S30
Pairwise FST between Ballito Bay A and the 11 HGDP genome-sequenced individuals.
32
Fig. S31
Pairwise FST between Ballito Bay A (BBayA) and the individuals in (6).
33
Fig. S32
Outgroup-f3 between Ballito Bay A (BBayA) and Khoe-San and Bantu-speaker individuals
from Schlebusch et al., 2012, (6). Ancestral sites are inferred from the genomes of three
great apes and are used as outgroup.
34
Fig. S33
Ougroup-f3 on the Khoe-San branch compared to Yoruba (YRI), CEU and Europeans from
Tuscany (TSI). f3 in Ju|’hoansi vs. admixture-masked Ju|’hoansi are shown.
35
Fig. S34
Drift parameter on the Khoe-San branch compared to Yoruba (YRI), CEU and Europeans
from Tuscany (TSI). Drift in Ju|’hoansi vs. admixture-masked Ju|’hoansi are shown.
36
Table S1.
The number and types of extractions and libraries for each individual.
Individual
Accession no
Bone element
Extract
(Yang)
(39)
Extract
(Dabney)
(41)
Library
Blunt-
end
Library
Damage-
repair
Library
Mybaits
capture
Ballito Bay A
2009/007
Petrous, left
1
-
1
4
-
2009/007
Petrous, right
1
-
-
2
-
2009/007
Premolar, upper left
1
-
1
-
-
Ballito Bay B
2009/008.001
Premolar, lower left
1
-
-
2
-
2009/008.001
Premolar, lower right
1
-
-
1
-
2009/008.001
Petrous, left
1
-
1
4
-
2009/008.002
Femur
2
2
6
5
1
Doonside
2009/010
Humerus
1
1
2
1
-
2009/010
Femur
-
2
3
1
-
2009/010
Foot/handbone
-
2
3
1
-
Champagne
Castle
2009/023
Molar, lower left
1
1
4
-
-
2009/023
Canine, lower left
1
-
2
-
-
2009/023
Femur
-
1
-
1
-
Newcastle
2007/006.001
Incisor
1
-
-
5
-
2007/006.001
Premolar
1
1
1
1
-
2007/006.001
Foot/handbone
3
-
1
11
-
Mfongosi
1925/036.002
Molar
1
-
1
4
-
1925/036.002
Incisor
1
-
1
4
-
1925/036.002
Femur
1
1
-
7
-
Eland Cave
1925/037
Tibia
3
-
1
12
-
1925/037
Foot/handbone
1
1
2
4
-
37
Table S2.
Individual and library information.
Individual Library
Avg.
proportion
human
Avg. read
length
Genome
cov MT cov Biological Sex
Ballito Bay A
baa001-dr
0.162
58.4562
11.3586
908.332
XY
Ballito Bay B
bab001-dr
0.024
62.1155
0.877068
59.715
XY
Ballito Bay B bab002-dr 0.003 61.7779 0.0031745 0.323315
consistent with XY but not
XX
Champagne
Castle
cha001-dr 0.008 65.9827 0.302505 156.993 XX
Doonside doo001-dr 0.001 52.6248
0.00083378
4
0.207255
consistent with XY but not
XX
Eland Cave
ela001-dr
0.119
60.9033
9.33034
5621.84
XX
Mfongosi
mfo001-dr
0.085
65.358
6.1
482.422
XX
Newcastle
new001-dr
0.072
55.4451
9.73943
514.755
XX
Ballito Bay A
baa001
0.226
66.3037
1.5861
127.061
XY
Ballito Bay B
bab001-
b3e1l1
0.051 74.7823 0.360675 22.6492 XY
Ballito Bay B
bab002
0.005
69.3384
0.00631018
1.2588
XY
Champagne
Castle
cha001-
b1e1l1
0.019 71.9981 0.0589717 29469 XX
Doonside doo001 0.003 57.7276 0.0120243 2.38717
consistent with XY but not
XX
Eland Cave
ela001
0.203
68.5227
3.89613
1975.24
XX
Mfongosi
mfo001-
b1e1l1
0.108 71.0634 0.84215 79.3689 XX
Newcastle
new001
0.122
59.2867
0.913927
101.383
XX
38
Table S3.
READ results for four samples from the ~2,000-year-old remains. The DNA libraries
bab001-dr and bab002-dr indicate that these two bone elements with different museum
accession numbers originate from the same individual: Ballito Bay B. The overlapping
coverage was not sufficient to calculate kinship for the bab002-dr and doo001-dr samples.
Ind/Sample 1
Ind/Sample 2
Relationship
Z upper
Z lower
baa001-dr
Bab001-dr
Unrelated
NA
-18.4991343501
baa001-dr
bab002-dr
Unrelated
NA
-3.86762077351
baa001-dr
doo001-dr
Unrelated
NA
-2.48933227105
bab001-dr
bab002-dr
IdenticalTwins/
SameIndividual
4.08507125493
NA
bab001-dr
doo001-dr
Unrelated
NA
-1.78533153114
bab002-dr
doo001-dr
-
-
-
39
Table S4.
Investigating potential mitochondrial contamination.
Sample
Library
Point estimate
(%)
Informative
sites
Consensus
alleles
Total
alleles
Lower
C.I.
Higher
C.I.
Ballito Bay A
baa001-dr
1.093491124
26
20894
21125
0.953249
2265
1.233733
022
Ballito Bay B
bab001-dr
3.29847144
21
1202
1243
2.305598
806
4.291344
074
Ballito Bay B
bab002-dr
-
Champagne Castle
cha001-dr
1.657940663
25
3381
3438
1.231108
364
2.084772
963
Doonside
doo001-dr
-
Eland Cave
ela001-dr
0.146710594
12
38795
38852
0.108651
2493
0.184769
9388
Mfongosi
mfo001-dr
4.42556996
5
2138
2237
3.573297
327
5.277845
923
Newcastle
new001-dr
0.7432432432
5
2938
2960
0.433818
0001
1.052668
486
Ballito Bay A
baa001
0.8980866849
25
2538
2561
0.532701
8244
1.263472
287
Ballito Bay B
bab001-b3e1l1
1.049868766
17
377
381
0.026412
52551
2.073325
007
Ballito Bay B
bab002
-
Champagne Castle
cha001-b1e1l1
1.15384616
17
514
520
0.235918
9506
2.071773
357
Doonside
doo001
-
Eland Cave
ela001
0.2278913718
10
15761
15797
0.153531
7317
0.302251
0119
Mfongosi
mfo001-b1e1l1
0.6153846154
5
323
325
0
1.465635
885
Newcastle
new001
0
5
526
526
0
0.567912
0981
40
Table S5.
The Y-chromosome haplogroup support for Ballito Bay A including markers, their position
in hg19 and information about the mutations.
Hg ISOGG SNP/
marker RefSNP ID Position
hg19 Mutation Obs.
allele
No.
reads Allele state
A0-T
L1085
-
2790726
T>C
C
13
derived
A0-T
L1130
-
16661010
T>G
G
14
derived
A (Investigation)
PK1
rs373116908
22583507
C>A
A
7
derived
A1
V168
rs191505182
17947672
G>A
A
2
derived
A1
V171
rs2524861
4898665
C>G
G
5
derived
A1b
P108
-
15426248
C>T
T
3
derived
A1b
V221
rs188292317
7589303
G>T
T
11
derived
A1b1
L419
rs111762602
15204887
G>A
A
5
derived
A1b1b
M32
-
21740436
T>C
C
5
derived
A1b1b2
M144
rs2032619
21925500
T>C
C
4
derived
A1b1b2
M190
rs2032603
14968527
A>G
G
8
derived
A1b1b2
P289
rs372246020
8467082
C>G
G
2
derived
A1b1b2a
M51
rs34078768
21868863
G>A
A
7
derived
A1b1b2b1
M118
-
21763965
A>T
T
9
derived
A1b1b2b
M13
rs3904
21722098
G>C
G
6
ancestral
A1b1b2b
M202
rs2032649
15029492
T>G
T
3
ancestral
41
Table S6.
The Y-chromosome haplogroup support for Ballito Bay B including markers, their position
in hg19 and information about the mutations.
Hg ISOGG
SNP/
marker
RefSNP ID
Position hg19
Mutation
Obs.
allele
No.
reads
Allele state
A0-T
L1085
-
2790726
T>C
C
1
derived
A0-T
L1130
-
16661010
T>G
G
1
derived
A1
V171
rs2524861
4898665
C>G
G
1
derived
A1b1b
M32
-
21740436
T>C
C
1
derived
A1b1b2
M144
rs2032619
21925500
T>C
C
1
derived
A1b1b2
M190
rs2032603
14968527
A>G
G
1
derived
A1b1b2
P289
rs372246020
8467082
C>G
G
1
derived
A1b1b2b1
M118
-
21763965
A>T
T
1
derived
42
Table S7.
Mitochondrial coverage, haplogroup assignment, polymorphisms supporting the assigned
haplogroup, variants associated with assigned haplogroup that either display the ancestral
allele (sites/ancestral) or for which no data are available (sites/no data), and private variants
in the ancient African consensus sequences are shown.
Individual
Mt
coverage
Mt hg
Polymorphisms for called hg (against RSRS)*
sites/
ancestral
sites/ no
data
private
variant
Ballito Bay A
1035
L0d2c1
263A 294A 1048T 1438A 3516A 3756G 3981G
4025T 4038G 4044G 4937C 5442C 6185C 6249A
6644T 6815C 8113A 8152A 8284T 8420G 9042T
9230C 9305A 9347G 9755A 10589A 11854C
11974G 12007A 12121C 12720G 13827G 14007G
15346A 15466A 15766G 15930A 15941C 16243C
16278C
498del
8251A
4204C
4232C
7154G
4312C 4732G
7256C 8655C
8701A
16129G
Ballito Bay B
84
L0d2a1
198T 263A 597T 1048T 1438A 3516A 3756G
3981G 5442C 6185C 8113A 8152A 8251A 9042T
9347G 9755A 10589A 11854C 12007A 12121C
12172G 12234G 12720G 12810G 14221C 15466A
15766G 15930A 15941C 16212G 16243C 16278C
16390A
498del
7154G
G8392A
4025T
4044G
4225G
4232C
5153G
6815C
310C 16187C
Doonside
2.6
L0d2
263A 1048T 1438A 3516A 3756G 5442C 6185C
8251A 9042T 9347G 9755A 11854C 12121C
12720G 15466A 15766G 15930A 15941C 16243C
16278C
Champagne
Castle
186
L0d2a1a
198T 263A 597T 1048T 1438A 3516A 3756G
3981G 5153G 5442C 6185C 6815C 8113A 8152A
8251A 8545A 9042T 9347G 9755A 10589A
11854C 12007A 12121C 12172G 12234G 12720G
12810G 14221C 15466A 15766G 15930A 15941C
16212G 16243C 16278C 16390A
498del
4025T
4044G
4225G
4232C
7154G
8392A
11881T
15077A
16093C
Eland Cave
7597
L3e3b1
146T 150T 152T 247G 750A 769G 825T 1018G
2000T 2352C 2758G 2885T 3594C 4104A 4312C
4655A 5262A 6261A 6524C 7146A 7256C 7521G
8468C 8655C 9554A
10664C 10667C 10688G
10810T 10816G 10819G 10915T 11914G 12248G
13101C 13105A 13197T 13276A 13506C 13650C
13651G 14212C 15301A 15812A 16129G 16187C
16189T 16230A 16265T 16278C 16311T
10373A
15071C
Mfongosi
562
L3e1b2
146T 150T 152T 185A 189G 195T 247G 769G
825T 1018G 2352C 2758G 2885T 3594C 4104A
4312C 6587T 7146A 7256C 7521G 8468C 8577G
8655C 10664C 10688G 10810T 10819G 10915T
11914G 12192A 13105A 13276A 13506C 13650C
14152G 14212C 14926G 15301A 15670C 15942C
16129G 16187C 16189T 16230A 16278C 16311T
16325del
16327T
310C 15115C
16239T
16519T
Newcastle
616
L3e2b1a2
146T 150T 152T 247G 769G 825T 1018G 2352C
2483C 2758G 2885T 3277A 3594C 4104A 4312C
7146A 7256C 7521G 8468C 8655C 9377G
10664C 10688G 10810T 10819G 10915T 11914G
12406A 13105A 13276A 13506C
13650C 14212C
14905A 15301A 16129G 16172C 16187C 16230A
16278C 16311T 16320T
4769A
15721C
16183C
43
Table S8.
Comparative dataset with the number of SNPs present in ancient individuals from this
study
Southern
African dataset
Global Extended
Dataset
East African
Extended Dataset
AGV
Extended
Dataset
Human
Origins
dataset
Simons Genome
Variant Sites
Full merged dataset
1,989,349
1,984,902
527,131
1,421,001
548,476
28,622,172
BBayA
1,962,247
1,957,905
526,465
1,402,541
548,153
24,671,536
BBayB1
1,268,240
1,265,495
341,135
908,376
363,171
na
BBayB2
9,854
9,831
2,564
7,040
2,849
na
Champagne Castle
479,547
478,510
127,778
344,026
136,456
na
Doonside
7,261
7,249
1,859
5,272
1,987
na
Eland Cave
1,961,630
1,957,282
526,467
1,402,052
548,172
na
Mfongozi
1,957,298
1,952,973
525,283
1,399,297
547,083
na
Newcastle
1,961,140
1,956,800
526,280
1,401,827
548,009
na
44
Table S9.
Admixture into Ju|’hoansi inferred using f3 statistics (ancient individuals are marked in
italic).
Source
Source
Recipient
f3
Std. error
Z-score
BBayA
Northwestern
Europeans
Ju|'hoansi
-0.007914
0.000931
-8.502
BBayA
Tuscans (TSI)
Ju|'hoansi
-0.007569
0.000922
-8.213
BBayA
Neolithic European
Ju|'hoansi
-0.00723
0.001201
-6.02
BBayA
Japanese (JPT)
Ju|'hoansi
-0.005671
0.000968
-5.858
BBayA
Amhara
Ju|'hoansi
-0.005213
0.000782
-6.67
BBayA
Somali
Ju|'hoansi
-0.004683
0.000792
-5.91
BBayA
Oromo
Ju|'hoansi
-0.004112
0.000765
-5.372
BBayA
Maasai (MKK)
Ju|'hoansi
-0.002504
0.000714
-3.507
BBayA
Gumuz
Ju|'hoansi
-0.001467
0.000751
-1.952
BBayA
Sudanese
Ju|'hoansi
-0.001212
0.000726
-1.67
BBayA
Ari Blacksmith
Ju|'hoansi
-0.000863
0.000747
-1.154
BBayA
Luhya (LWK)
Ju|'hoansi