Content uploaded by Arnaud Estoup
Author content
All content in this area was uploaded by Arnaud Estoup on Jul 29, 2022
Content may be subject to copyright.
Mol Ecol Resour. 2021;00:1–23. wileyonlinelibrary.com/journal/men
|
1© 2021 John Wiley & Sons Ltd
1 | INTRODUCTION
In their seminal paper, Reich et al. (2009) introduced a new popu-
lation genetics framework to decipher the history of Indian human
populations. This inference approach relied on a set of so- called
f- statistics that are aimed at capturing various patterns of the
structuring of genetic diversity across- population based on single
nucleotide polymorphism (SNP) assayed on a genome- wide scale
(see also Patterson et al., 2012). The parameters underlying these
statistics, denoted F, are defined as covariances in allele frequency
differences among sets of two (F2), three (F3) or four (F4) popula-
tions and were demonstrated to be highly informative about pop-
ulations demographic history when modelled as admixture graphs
corresponding to population trees possibly including admixture
events (Patterson et al., 2012). Hence, formal tests of admixture,
called three- population tests, between a target population and two
source population surrogates can be derived from estimates of F3.
Conversely, via the so- called four- population test, estimating F4
Received: 28 May 2021
|
Revised: 16 September 2021
|
Accepted: 8 Nove mber 2021
DOI : 10.1111/1755-0998.13557
RESOURCE ARTICLE
f- Statistics estimation and admixture graph construction with
Pool- Seq or allele count data using the R package poolfstat
Mathieu Gautier1 | Renaud Vitalis1 | Laurence Flori2 | Arnaud Estoup1
1CBGP, INRAE , CIRAD, IRD, Montpelli er
SupAg ro, Univ Montpellier, Montp ellier,
France
2SelMet , INRAE , CIRAD, Montpellier
SupAg ro, Montpel lier, France
Correspondence
Mathieu G autier, INR AE- CBGP (Centre de
Biologi e pour la Ges tion des Populations),
755 avenue du campus Agropolis,
CS30 016, 34988 M ontferrier sur lez
cedex, France.
Email: mathieu.gautier@inrae.fr
Funding information
Agence N ationale de l a Recherche , Grant/
Award Number: ANR- 16- CE02- 0 015- 01
a n d A N R - 2 0 - C E 0 2 - 0 0 1 8
Abstract
By capturing various patterns of the structuring of genetic variation across popula-
tions,
f
- statistics have proved highly effective for the inference of demographic his-
tory. Such statistics are defined as covariances of SNP allele frequency differences
among sets of populations without requiring haplotype information and are hence
particularly relevant for the analysis of pooled sequencing (Pool- Seq) data. We here
propose a reinterpretation of the
F
(and
D
) parameters in terms of probability of gene
identity and derive from this unified definition unbiased estimators for both Pool- Seq
data and standard allele count data obtained from individual genotypes. We imple-
mented these estimators in a new version of the R package poolfstat, which now
includes a wide range of inference methods: (i) three- population test of admixture; (ii)
four- population test of treeness; (iii)
F4
- ratio estimation of admixture rates; and (iv)
fitting, visualization and (semi- automatic) construction of admixture graphs. A com-
prehensive evaluation of the methods implemented in poolfstat on both simulated
Pool- Seq (with various sequencing coverages and error rates) and allele count data
confirmed the accuracy of these approaches, even for the most cost- effective Pool-
Seq design involving relatively low sequencing coverages. We further analysed a real
Pool- Seq data made of 14 populations of the invasive species Drosophila suzukii, which
allowed refining both the demographic history of native populations and the invasion
routes followed by this emblematic pest. Our new package poolfstat provides the
community with a user- friendly and efficient all- in- one tool to unravel complex popu-
lation genetic histories from large- size Pool- Seq or allele count SNP data.
KEYWORDS
admixture graph, demographic inference, Drosophila suzukii, f- statistics, Pool- Seq
2
|
G AUTIER ET Al.
among quadruplets of populations allows to test for their treeness,
that is, if their joint history can be modelled as a simple (unrooted)
bifur catin g tree. Und er ce rt ain re stric tive assu mptio ns abou t th e un-
derlying phylogeny, accurate estimates of the relative contributions
of the ancestral sources of an admixed population may be obtained
from ratios of F4 involving some of its related sampled populations.
A normalized version of the F4 parameter, called Patterson's D, was
also developed by Green et al. (2010) and has become quite popular
to characterize introgression in phylogenies of closely related spe-
cies (Durand et al., 2011). Finally, f- statistics can directly be used to
fit admixture graphs (i.e. estimate branch leng ths and/or admixture
propor tions) and to rigorously assess their suppor t (Lipson, 2020;
Lipson et al., 2013; Patterson et al., 2012).
A critical advantage of F and D parameters is that they only de-
pend on population allele frequencies and their estimation does not
require haplotype information. The non- independence of neigh-
bouring SNPs (Linkage Disequilibrium or LD) can be accurately ac-
counted for with block- jackknife statistical techniques (Busing et al.,
1999; Kunsch, 1989; Patterson et al., 2012; Reich et al., 2009) when
computing standard errors of the estimated f- statistics, which are
noticeably required for three- and four- population tests and also to
assess the residuals of fitted admixture graphs. These character-
istics make the f- statistics- based inference framework particularly
attractive for the analysis of Pool- Seq dat a that result from the mas-
sive sequencing of pools of individual DNA and have become quite
popular, most particularly in non- model species (Schlötterer et al.,
2014). Indeed, although LD information is generally lost in Pool- Seq
experiments (but see Long et al., 2011; or Feder et al., 2012), they
lead to accurate and cost- effective assessment of allele frequencies
across populations on a whole genome basis (Gautier et al., 2013;
Schlötterer et al., 2014). If the derivation of unbiased estimates of al-
lele frequencies from Pool- Seq data is straightforward, estimation of
more elaborated population genetics parameters characterizing the
structuring of genetic diversity within or across populations is more
challenging (Ferretti et al., 2013; Gautier et al., 2013; Hivert et al.,
2018). As the individual origin of the sequencing reads is not iden-
tifiable within pools, it is not possible to assess whether reads are
identical because they are sequenced copies of the same individual
chromosome or because they are copies of different chromosomes
carrying the same allele. The resulting additional level of variation
thus needs to be accounted for in the estimation of population ge-
netics parameters (Ferretti et al., 2013; Gautier et al., 2013; Hivert
et al., 2018).
In the present paper, we first propose a (re)interpretation of the
F (and D) parameters in terms of probability of Identity In State (IIS
or AIS for Alike- In- State) of pairs of genes sampled either within the
same population (Q1) or between two different populations (Q2), ex-
tending results we introduced in some earlier studies (Hivert et al.,
2018; Leblois et al., 2018; Collin et al., 2021). This unified definition
simplified the derivation of the unbiased estimators for both allele-
count and Pool- Seq read count data that we implemented in a new
version of our R package poolfstat (Hivert et al., 2018) together
with methods that rely on the estimated f- statistics for historical and
demographic inference. These methods include (i) three- population
test of admixture; (ii) four- population test of treeness; (iii) F4- ratio
estimation of admixture proportion; and (iv) fitting, visualization and
(semi- automatic) construction of admixture graphs. For completion,
we briefly present the underlying methods as implemented in the
package. We then carried out a comprehensive evaluation of the
whole package on both simulated allele count and Pool- Seq read
count data, considering for the latter various sequencing coverages
and the presence or not of sequencing errors. Finally, we illustrate
the power and limitations of poolfstat by analysing real Pool- Seq
data available from a previous study (Olazcuaga et al., 2020) for 14
populations of the invasive species Drosophila suzukii. This example
illustrates how f- st atistics based inference and admixture graph
construction may confirm previous inferences and provide new
insights into both the history of populations from the native area
and the invasion routes followed by an emblematic invasive species.
We provide as Supplementary Materials, a first vignette (Appendix
S2: Vignette V1) designed as a detailed hands- on manual to outline
the main functionalities of poolfstat. This vignette may also be
viewed as a practical introduction to f- statistics- based demographic
inference although we strongly encourage unfamiliar readers (and
also others) to carefully read the original methodological papers (in
particular, Lipson et al., 2013; Patterson et al., 2012; Peter, 2016)
and the recent review by Lipson (2020). Similarly, we provide a sec-
ond vignette (Appendix S2: Vignette V2) that details the analysis of
the D. suzukii data making it fully reproducible. Both vignettes fur-
ther include practical recommendations for an efficient use and in-
terpretation of f- statistics based methods and outputs.
2 | MATERIAL AND METHODS
2.1 | Definition- , estimation- and f- statistics- based
inference methods
2.1.1 | A unified definition of F2, F3 and F4
parameter and their scaled version FST,
F∗
3
and D in
terms of Q1 and Q2 probabilities
Let
pA
,
pB
,
pC
and
pD
the allele frequency of an arbitrarily chosen allele
at a random SNP segregating in populations
A
,
B
,
C
and
D
, respec-
tively. The parameters
F2
,
F3
and
F4
were originally defined in terms
of covariance in allele frequencies difference among different sets
of populations as follows (Patterson et al., 2012; Reich et al., 2009):
In total, with
n
populations, there are
⎛
⎜
⎜
⎝
n
2
⎞
⎟
⎟
⎠
=1
2n(n−1
)
possible
F2
;
3⎛
⎜
⎜
⎝
n
3
⎞
⎟
⎟
⎠
=1
2n(n−1)(
n−2
)
possible
F3
; and
3⎛
⎜
⎜
⎝
n
4
⎞
⎟
⎟
⎠
=1
8n(n−1)(
n−2)(
n−3
)
pos-
(1)
F
2(A;B)≡𝔼
pA−pB
2
F
3(A;B,C)≡𝔼pA−pBpA−pC=1
2F2(A;B)+F2(A;C)−F2(B;C)
F
4(A,B;C,D)≡𝔼
pA−pB
pC−pD
=1
2
F2(A;D)+F2(B;C)−F2(A;C)−F2(B;D)
|
3
GAUTIER E T Al.
sible
F4
(excluding equivalent configurations obtained by permuting
populations within pairs; see section 4.1 of Appendix S2: Vignet te
V1). Due to the linear dependency of all these parameters (eq. 1), the
1
8
n(n−1)
(
n2−n+2
)
F
parameters actually span a vector space of
dimension
1
2
n(n−1
)
the basis of which may be specif ied by the set of
all the
⎛
⎜
⎜
⎝
n
2
⎞
⎟
⎟
⎠
possible
F2
or, given a reference population
i
(randomly
chosen among the
n
ones) the set of all the
n−1
F2
of the form
F2(i;j)
(with
j
≠
i
) and all the
⎛
⎜
⎜
⎝
n−1
2
⎞
⎟
⎟
⎠
F3
of the form
F3(i;j,k)
(with
j
≠
i
;
k≠i
and
j
≠
k
) (Lipson, 2020; Patterson et al., 2012). It is important to notice
that these definitions are invariant in the choice of the reference
SNP allele (Patterson et al., 2012) leading to:
where
QA
1
(resp.
QB
1
) is the probability of sampling two genes (or alleles)
identical in state (IIS) within population
A
(resp.
B
) and
QA,B
2
is the prob-
ability of sampling two IIS genes from
A
and
B
. Similar expressions for
F3(A;B,C)
and
F4(A,B;C,D)
directly follows from Equations 1 And 2
(see Eq. 3 below).
The
Q1
and
Q2
probabilities, and hence the
F2
,
F3
and
F4
parameters
depend on both demographic parameters (i.e. population sizes, diver-
gence times and other historical events) and marker polymorphism (i.e.
their mutation rates and ascertainment process). For instance, under a
simple pure- drift model with no mutation, if
pr
denotes the allele fre-
quency of the ancestral population
R
of two isolated populations A and B,
then
1
−Q
A,B
2
=𝔼
[
2p
A
p
B|
p
r]
=2p
r(
1−p
r)
, which is the heterozygosity in
R. Similarly,
1
−Q
A
1
=2p
r(
1−p
r)
e
−𝜏A
(resp.,
1
−Q
B
1
=2p
r(
1−p
r)
e
−𝜏B
)
where
𝜏A
(resp.,
𝜏B
) is the divergence time separating R and A (resp., B) on
a diffusion timescale (i.e. in drift units of
1
2Ne
where
Ne
is the effective
population along the branch). As a consequence, the resulting estimates
of
F2
,
F3
and
F4
strongly depend on the underlying set of genetic markers
and may not be compared across different data sets, even from the same
populations. Various scaling procedure may actually help in reducing this
dependence. Scaling the
F2
with respect to the across population hetero-
zygosity
1−Q2
leads to the standard definition of pairwise- population
FST
in terms of IIS probabilities (Hivert et al., 2018; Rousset, 2007), which
is also concordant with its original definition as the numerator of
FST
(Peter, 2016; Reich et al., 2009):
where
Q
1=
1
2(
QA
1
+QB
1)
is the overall probability of sampling two
IIS genes within the same population (i.e. averaged over populations
A
and
B
). Similarly, the scaled versions of the
F3
and
F4
parameters
named
F∗
3
and
D
, respectively (Durand et al., 2011; Green et al., 2010;
Patterson et al., 2012), can also be expressed as a function of
Q1
and
Q2
probabilities leading to the following expression for all the
F
(and
D
) parameters:
2.1.2 | Unbiased parameter estimators from Pool-
Seq read count and standard allele count data
Let
yij
be the allele count for an arbitrarily chosen reference allele
and
nij
the total number of sampled alleles (e.g. twice the number of
genotyped individuals for a diploid species) at SNP
i
in population
j
.
For Pool- Seq read count data, the
yij
s are not observed, and for a
given pool
j
, it is assumed that
nij =nj
(the haploid sample size) for
each and every SNP. We further defined
rij
as the read counts for the
reference allele and
cij
the overall coverage obser ved at SNP
i
in
population
j
. If allele count data are directly obser ved, unbiased es-
timators of the IIS probability within population
j
(
Qj
1,i
) and between
a pair of populations
j
and
k
(
Qj,k
2,i)
for a given SNP
i
are:
Likewise, for Pool- Seq read count, unbiased estimators of
Q1,i
and
Q2,i
are defined as (Hivert et al., 2018, eqns A37 and A40):
Genome- wide estimates of all the
F
(and
D
) parameters (eq. 3) are
then simply obtained from averages of these unbiased estimators of
IIS probabilities over all the
I
SNPs. Importantly, for the three scaled
parameters (
FST
,
F∗
3
and
D
), multi- locus estimators consist of ratios
of the numerator and denominator averages and not average of ra-
tios (Bhatia et al., 2013; Hivert et al., 2018; Patterson et al., 2012;
Rousset, 2007; Weir & Goudet, 2017) as, for example:
(2)
F
2(A;B)≡𝔼
pA−pB
2
=𝔼
1−pA
−
1−pB
2
=
1
2
𝔼
pA−pB
2
+𝔼
1−pA
−
1−pB
2
=1
2𝔼p2
A+𝔼(1−pA)2+1
2𝔼p2
B+𝔼(1−pB)2−𝔼pApB+𝔼1−pA1−pB
=1
2
QA
1+QB
1
−QA,B
2
F
ST (A;B)≡
Q1−Q
A,B
2
1−QA,B
2
=F2(A;B)
1−QA,B
2
(3)
F
2(A;B)≡
QA
1+QB
1
2−QA,B
2and FST (A;B)≡F2(A;B)
1−QA,B
2
=QA
1+QB
1−2Q
A,B
2
21−QA,B
2
F
3(A;B,C)≡
QA
1+QB,C
2−QA,B
2−QA,C
2
2and F∗
3(A;B,C)≡F3(A;B,C)
1−QA
1
=QA
1+QB,C
2−QA,B
2−QA,C
2
21−QA
1
F
4(A,B;C,D)≡
QA,C
2+QB,D
2−QA,D
2−QB,C
2
2and D(A,B;C,D)≡F4(A,B;C,D)
1−QA,B
2
1−QC,D
2
=QA,C
2+QB,D
2−QA,D
2−QB,C
2
2
1−QA,B
2
1−QC,D
2
(4)
Q
j
1,i=1−2yij
(
nij −yij
)
n
ij (
n
ij
−1
)
and
Qj,k
2,i=yijyik +
(
nij −
y
ij
)(
nik −yik
)
nijnik
(5)
Q
j
1,i=1−2nj
nj−1
rij
(
cij −rij
)
c
ij (
c
ij
−1
)
and
Qj,k
2,i=rijrik +
(
cij −rij
)(
cik −rik
)
cijcik
4
|
G AUTIER ET Al.
the resulting pairwise
FST
estimator being similar to the one described
in Rousset (2007) for allele count data and identical to the ‘PID’ esti-
mator described in Hiver t et al. (2018) for Pool- Seq read count data
(‘Identity’ method of the poolfstat function co mputeFST). Fin ally,
the within- population heterozygosity (
hj
≡1−Q
j
1
) of a given popula-
tion
j
is simply estimated as:
2.1.3 | Block- Jackknife estimation of
standard errors
Following Reich et al. (2009), standard errors of genome- wide esti-
mates of the different statistics are computed using block- jackknife
(Busing et al., 1999; Kunsch, 1989), which consists of dividing the ge-
nome into contiguous chunks of a predefined number of SNPs and
then removing each block in turn to quantify the variability of the es-
timator. For a given parameter
F
, if
nb
blocks are available and
Fb
is the
estimated statistics when removing all SNPs belonging to block
b
, the
standard error
𝜎 F
of the genome- wide estimator
F
is computed as:
where
𝜇
F=
1
nb∑n
b
b=1
F
b
, which may be slightly different than the esti-
mator obtained with all the
I
markers since the latter may include
SNPs that are not eligible for block- jackknife sampling (e.g. those at
the chromosome or scaf folds boundaries). Finally, block- jackknife
sampling may also be used to obtain estimates of the error covari-
ance between two estimates
Fu
and
Fv
as:
For convenience, we here chose to specify the same number
of SNPs for each block instead of a block size in genetic distance
(Patterson et al., 2012; Reich et al., 2009). We therefore do not re-
course to a weighted block- jackknife (Busing et al., 1999). In prac-
tice, this has little impact providing the distribution of markers is
homogeneous along the genome and the amount of missing data is
negligible.
2.1.4 | Admixture graph fitting
The approach implemented in the new version of poolfstat to fit
admixture graphs from f- statistics is directly inspired from the one
proposed by Patterson et al. (2012) and implemented in the qp-
Graph soft ware (see also Lipson, 2020). Briefly, let
f
the vector (of
length
n
l
(
nl−1
)
2
where
nl
is the number of graph leaves) of the esti-
mated
f2
and
f3
statistics forming the basis of all the f- statistics (see
2.1.1). Similarly, let
g(e;a)=X(a)×e
the vector of their expected
values given the graph edge lengths vector
e
and an incidence matrix
X(a)
, which summarize the structure of the graph given the vector
a
of propor tions of all admixture events (for a tree- topology,
X(a)
only consists of 0 or 1). In poolfstat,
X(a)
is derived using simple
operations from another
nl
by
ne
matrix (where
ne
is the number of
graph edges) that specifies the weights of each edge along all the
paths connecting the graph leaves to the root. It should be noticed
that an admixture event is modelled as an instantaneous mixing of
two populations
S1
and
S2
into a population
S
directly ancestral to a
child population
A
, while assuming a null length for the edges con-
necting
S1
and
S2
to the rest of graph (Appendix S2: Vignette V1,
section 5.1). We further define
Q
as the
n
l
(
nl−1
)
2
by
n
l
(
nl−1
)
2
covariance
matrix of the basis f- statistics estimated by block- jackknife. Graph
fitting then consists of finding the graph parameter values (
e
and
a
)
that minimize a cost (score of the model) defined as:
where
𝚪
results from the Cholesky decomposition of
Q−1
(i.e.
Q−1=Γ
�
Γ
). Given admixture rates
a
,
S(e;a)
is quadratic in the edge
lengths
e
(Patterson et al., 2012) leading us to rely on the Lawson-
Hanson non- negative linear least squares algorithm implemented in
the R package nnls (Lawson & Hanson, 1995)) to estimate the vec-
tor
e
that minimizes
S(e;a)
(subject to the constraint of positive edge
lengths). Full minimization of
S(e;a)
is thus reduced to the identifica-
tion of the admixture rates
a
, which is performed using the L- BFGS- B
algorithm implemented in the optim function of the R package stats
(Nocedal & Wright, 1999).
2.1.5 | Confidence intervals and model
fit assessment
Using above notations and assuming
f∼N
(
g
(
e;
a
)
,Q
)
, t he op ti mi ze d
score
S(
e;
a
)
verifies
S(
e;
a
)
=−2log (L)−
K
where
L
is the likeli-
hood of the f itted graph a nd the constant
K=nlog (2𝜋)+log (|Q|)
. A s
detailed in Appendix S2: Vignette V1 (section 5.2), a
BIC
(Bayesian
Information Criterion) may then be derived from
S(
e;
a
)
to compare
alternative admixture graph topologies at least when specified
with the same number of parameters (Lipson, 2020, section 3.3).
Also, the likelihood interpretation of the optimized score allows
constructing confidence intervals (CI) for the fitted parameters of
f
∗
3(A;B,C)=
I
i=1
Q
A
1,i+
Q
B,C
2,i−
Q
A,B
2,i−
Q
A,C
2,i
2
I−
I
i=1
QA
1,i
and D(A,B;C,D)=
I
i=1
Q
A,C
2,i+
Q
B,D
2,i−
Q
A,D
2,i−
Q
B,C
2,i
2
I
i=1
1−
QA,B
2,i
1−
QC,D
2,i
h
j=1−1
I
I
∑
i=1
Qj
1,
i
𝜎
F=
nb−1
nb
nb
b=1
Fb−𝜇 F
2
ℂ
ov
(
Fu;
Fv
)
=nb−1
n
b
n
b
∑
b=1(
Fu
b−𝜇 Fu
)(
Fv
b−𝜇 Fv
)
(6)
S
(e;a)=
(
f−g(e;a)
)�
Q−1
(
f−g(e;a)
)
=
(
Γ
f−ΓX(a)e
)�(
Γ
f−ΓX(a)e
)
|
5
GAUTIER E T Al.
a given graph (i.e. elements of the
e
and
a
vectors). Indeed, for a
given parameter
𝜈
(either an edge length or an admixture rate), the
difference
S𝜈
(x)−S
(
e;
a
)
(where
S𝜈(x)
is the score when
𝜈=x
and
all the other parameters are set to their best fitted values) can be
interpreted as a likelihood- ratio test statistics following a
𝜒2
dis-
tribution with one degree of freedom. Lower and upper bounda-
ries
𝜈min
and
𝜈max
of the 95% CI (such that
S𝜈
(x)−S
(�
e;
�
a
)
<
3.84
for
all
𝜈min <x<𝜈
max
) may then simply be computed using a bisection
method, as implemented in poolfstat.
Finally, a straightforward (but highly informative and recom-
mended) approach to assess the fit of an admixture graph is to
evaluate to which extent the f- statistics derived from the fitted
admixture graph parameters (g
(
e;
a
)
) depart from the estimated
ones (Lipson, 2020; Pat terson et al., 2012). This can be summa-
rized via a Z- score of residuals computed as
Z
=f−
G
𝜎2
F
where
G
is a
given fitted f- statistics; f is its corresponding estimated values; and
𝜎2
F
the block- jackknife standard error. The presence of outlying Z-
scores for at least one f- statistics (e.g.
|Z|>1.96
at a 95% signifi-
cance threshold) may suggest poor model fit while also providing
insights into the leaves or graph edges that are the most problem-
atic (Lipson, 2020).
2.1.6 | Scaling of branch lengths in drift units
Admixture graph fitting results in estimated edge lengths on the
same scale as
F2
, which limits their interpretation, because they de-
pend both on the overall level of SNP polymorphism and on their
distance to the root (Patterson et al., 2012). Lipson et al. (2013) pro-
posed an empirical approach to rescale edge lengths on a diffusion
timescale using estimates of overall marker heterozygosities within
(i.e.
1−Q1
) or across (i.e.
1−Q2
) populations. The argument echoes
the aforementioned interpretation of pairwise
FST
as a scaled
F2
. If
pC
and
pP
are the reference allele frequencies in a child node
C
and its
direct parent node
P
and their divergence time (on a diffusion time-
scale) is
𝜏
C,P=
t
Ne
(where
t
is the branch length in generations), then
conditional on
pC
,
F2
(C;P)=(1−e
−𝜏
C,P)p
C(
1−p
C)
and
QC,P
2
=1−2p
C(
1−p
C)
leading to
FC,P
ST =
F
2
(C;P)
1−QC,P
2
=
1
2(1−e−𝜏C,P
)
(i.e.
FC,P
ST ≃
t
2Ne
when
𝜏C,P≪1
). Hence, the estimated graph edges length
F2(C;P)=
eP↔C
are scaled in units of drift by a factor equal to
2
hP
where
hP
is the estimated heterozygosity (i.e.
1
−
Q
P
1
) in the (parent)
node
P
. Rearranging equation 2 and using
QC,P
2
=Q
P
1
(conditional on
pP
) shows that
hP=F2(C;P)+hC
, where
hC
=
(
1−Q
C
1)
is the hete-
rozygosit y of the child node
C
. Hence, all the node heteroz ygosities
can be inferred iteratively from the leaves to the root along the ad-
mixture graph using the leave heterozygosities (directly estimated
from the data) and the fit ted edge lengths (Lipson et al., 2013).
2.1.7 | Admixture graph construction
Comprehensive exploration of the space of possible admixture
graphs rapidly becomes impossible even for a moderate number
of populations. We implemented in poolfstat different heuris-
tics to facilitate admixture graph construction based on a super-
vised approach (see Appendix S2: Vignette V1 for details). First,
the ad d.lea f function allows exploring all the possible connec-
tions of a new population to an existing admixture graph. If
ne
is the number of edges of the admixture graphs,
ne+1
possible
graphs connecting the new leaf with a non- admixed edges (i.e. in-
cluding a new rooting with the candidate leaf as an outgroup) and
1
2
ne
(
ne−1
)
−
1
connecting the new leaf with a two- way admix-
ture event are then tested. Note that an admixture between the
two root edges is excluded from the exploration since it results
in a singular model. More generally, the different possible graphs
are always checked for singularity by empirically verifying that the
rank of the model incidence matrix
X(a)
is equal to the number of
edges to fit. The different fitted graph can then be ranked accord-
ing to their
BIC
, the graph with the lowest
BIC
having the strongest
support. The grap h.bu ild er function allows a larger explora-
tion of the graph space by successively adding several leaves in
a given order to an existing admixture graph. At each step of the
process, a heap stores the best resulting graph together with some
intermediary suboptimal graphs based on their
BIC
. After initial-
izing the heap with some graph (or a list of graphs), the a dd .le af
function is called to evaluate, for each candidate leaf in turn, all
its possible connections (with non- admixed or admixed edges) to
all the graphs stored in the heap. Among the obtained graphs, the
one with the lowest
BIC
together with those with a BIC within a
given
ΔBIC
(
ΔBIC =6
by default) are included in a newly generated
heap. If the resulting heap contains more than a predefined num-
ber of graphs
nmax
g
(
nmax
g=25
by default), only the
nmax
g
graphs with
the lowest BIC are finally kept in the heap of graphs to be used
for the addition of the next leaf. Although helpful, such heuristic
should be used cautiously and we recommend to only tr y adding
a small number of populations (i.e.
≤5
) to an existing graph. One
also needs to evaluate different orders of population inclusion
(Appendix S2: Vignette V1 and Appendix S3: Vignette V2). More
generally, exploring all possible leaf addition orders may become
impractical even with moderately complex admixture graphs (e.g.
>10 populations and/or more than 3 admixture events), and other
approaches are needed including promising ones in the currently
developed Admixtools 2.0 package (https://uqrma ie1.github.io/
admix tools/ artic les/graphs.html).
It is also critical to start these supervised procedures with graphs
that are representative of the whole history of the populations
under study and not too unbalanced with respect to the candidate
leaves. In particular, starting with a small tree of closely related pop-
ulations, which are distantly related to the candidate leaves must
be avoided. When prior knowledge about the history of the inves-
tigated population is limited (which is usually the case), Lipson et al.
(2013) proposed to start admixture graph construction with a scaf-
fold tree of populations displaying no evidence of admix ture. As de-
tailed in Appendix S2: Vignette V1 (section 6.1), we implemented in
poolfstat two functions that allow (i) identifying candidate sets of
unadmixed populations among all the genotyped ones (find.tree.
6
|
G AUTIER ET Al.
popset); and (ii) building rooted neighbour- joining tree (rooted.
njtree.builder).
2.2 | Overview of the new poolfstat package
Tables 1 and 2 describe the main objects and functions imple-
mented in our new version (v2.0.0) of the R package poolfstat
publicly available from the CRAN repository (https://cran.r- proje
ct.org/web/packa ges/poolf stat/index.html). In- depth analyses
of two Pool- Seq and allele count simulated data sets (see below)
are described for illustration purposes in the package vignette
provided as Appendix S2: Vignette V1. Detailed documentation
page of the dif ferent objects and functions can also be directly ac-
cessed from an R terminal with poolfstat loaded using the help
function (or the
?
operator).
The package includes several functions to parse allele count
(e.g. genotreemix2countdata) or Pool- Seq (e.g. vcf2p ooldata)
input data stored in various formats commonly used in population
genomics studies (Table 2). These functions allow to clearly distin-
guish these two different types of data by producing objects of ei-
ther the so- called countdata (for allele count) or pooldata (for
Pool- Seq data) classes (Table 1). This step is critical to further rely
on the appropriate unbiased estimators for the F and D parameters.
Some functions allow to perform subsequent manipulation of the
input data, for instance to only consider some of the populations or
to remove SNPs according to various criteria ( Table 2).
The three functions c omputeFST, compute.pairwiseFST and
compute.fstats implement the unbiased estimators of the differ-
ent f- , D- and within- population heterozygosities (based on allele
IIS probabilities within and between pairs of populations) together
with block- jackknife estimation of their s tandard errors. Import antly,
these three functions automatically detect the appropriate estima-
tors given the type of data (either allele or Pool- Seq read counts)
according to the input object class (either countdata or pool-
data). For the estimation of FST, the computeFST and com pute.
pairwiseFST also implement (by default) estimators based on an
Analysis of Variance framework that correspond to those developed
by Weir (1996) for allele count data and by Hivert et al. (2018) for
Pool- Seq dat a.
The fit.g ra ph function implements the approach described
above to estimate the parameters (i.e. edge lengths and admixture
rates) of an admixture graph that is stored in a graph.params
object ( Table 1). Such objects can be generated with the gener-
ate.graph.params function (Table 2) to include the target basis
f- statistics and the error covariance matrix (denoted above
f
and
Q
, respectively) estimated with compute.fstats (stored in an
fstats object) and to specify the topology and the parameters of
the admixture graph. Note that the graph.params2symbolic.
fstats function allows exploring in details the properties of an ad-
mixture graph specified by a graph.params object by deriving a
symbolic representation of all the
F2
,
F3
,
F4
and the model equations
(see above) using functionalities of the Ryacas package for symbolic
computation (Andersen & Højsgaard, 2019).
The fit.graph function then produced an object of class fit-
ted.graph that includes the estimated edge lengths (in
F2
and also
optionally in drift units) and admixture proportions together with
(optionally) their 95% CI. For model fit assessment purposes, fit-
ted.graph objects also include the
BIC
and Z- score of the residuals
of the fitted basis f- statistics. Such a comparison can (and should) be
S4 object Description
countdata Standard allele count data (i.e. obtained from individual genotyping or
sequencing data)
poold ataaPool- Seq read count data
pairwiseFST Store pairwise
FST
estimates. This obje ct is generated by the com put e.
pairwiseFST function. Estimates can be conveniently visualized
with the heatmap or plot functions, the latter interfacing the
plot _ fstats function of poolfstat
fstats Store
F2
, pairwise
FST
,
F3
,
F∗
3
,
F4
and
D
estimation result s. This object is
generated by the compute.fstats function. Estimates can be
conveniently visualized with the heat map or plot functions, the
latter interfacing the plot _ fstats function of poolfstat
graph.params Represent a population tree or an admixture graph and its parameter.
This object is generated by the generate.graph.params
function. The graph can be visualized with the plot function that
interf aces the grV iz function from the DiagrammeR package
(Iannone, 2020)
fit te d.g ra ph Represent a population tree or an admixture graph and it s underlying
fitted parameters as obtained from the fit.graph or other fitting
functions. The graph can be visualized with the plot function
that inter faces the grViz function from the DiagrammeR package
(Iannone, 2020)
aObject already existing in the first poolfstat version.
TABLE 1 Description of the main S4
object s of the poolfstat package
|
7
GAUTIER E T Al.
TABLE 2 Description of the main poolfstat functions
Function Typ e Detail
genotreemix2countdata data import Generate a coundata object (Table 1) from allele count data stored in TreeMix format
(Pickrell & Pritchard, 2012)
genobaypass2countdata data import Generate a coundata object (Table 1) from allele count data stored in BayPass format
(Gautier, 2015)
vcf2p ooldat aadata import Generate a poold ata object (Table 1) from Pool- Seq read count data stored in a vc f
file generated by commonly used calling software as VarScan (Koboldt et al., 2012),
bcftools (Li et al., 2009), GATK (McKenna et al., 2010) or FreeBayes (Garrison &
Marth, 2012). Parsing of vcf files uses C++ routines inspired by the vcfR package
(Knaus and Grünwald, 2017)
popsync2pooldatabdata import G enerate a pooldata object ( Table 1) from Pool- Seq read count data stored in the sy nc
Popoolation format (Kofler et al., 2011)
genobaypass2pooldatabdata import Generate a poold ata objec t (Table 1) from Pool- S eq read count data stored in BayPass
format (Gautier, 2015)
countdata.subset data handling Subset s a coundata object (Table 1) according to various criteria (e.g. population or SNP
indexes, marker polymorphism, call rate)
po old ata.su bsetbdata handling Subset a coundata object (Table 1) according to various criteria (e.g. population or SNP
indexes, marker polymorphism, pool coverage)
pooldata2genobaypassbdata handling Export Pool- Seq read count data stored in a pooldata object in BayPass format (Gautier,
2015)
computeFSTaestimation Estimate genome- wide
FST
over all the populations
compute.pairwiseFSTaestimation Estimate pairwise- population population
FST
. The func tion gener ates a pairwiseFST
object (Table 1)
compute.fstats estimation Estimate
f−
(
f2
, pairwise
FST
,
f3
,
F∗
3
and
f4
),
D
- statistics and within- population
heterozygosities together with their standard errors via block- jackknife. The function
generates an fstats object (Table 1)
com pute.f4ratio estimation Estimate admixture r ates and their standard errors using ratios of f4- statistics from an
fstats object (Table 1)
generate.graph.params graph fitting Generate a graph.param s object synthesizing the structure and properties of an
admixture graph ( Table 1)
fit.g raph graph fitting Estimate parameters (edge lengths and admixture rates) of an admixture graph. The
function generates a fitt ed.gr ap h object (Table 1)
co m par e.fi tt ed.fst at s graph fitting Compare all the fitted
f2
,
f3
and
f4
derived from a fitted graph and stored in a fitt ed.
graph objec t (Table 1) to the estimated ones (s tored in a fstats object). This function
allows graph fitting assessment
find.tree.popset graph building Find set s of populations that may use as scaffold tree based on the estimated
f3
and
f4
stored in a fstats object (Table 1)
rooted.njtree.builder graph building Construct and root a Neighbour- Joining tree of presumably unadmixed scaf fold
populations. This function generates a fitted.gr ap h object (Table 1)
ad d.lea f graph building Evaluate all possible connection of a new leaf (i.e. genotyped population) to an existing
graph (stored in either a graph.params or fitte d.gr ap h object) with both non-
admixed and admixed edges. This functions generates a list of f itt ed.gr ap h objects
including other information (e.g.
BIC
of all the possible graph, index of the best fitted
graph)
graph.b uilde r graph building Implement a graph builder heuristic by successively adding leaves (i.e. genotyped
populations) to an existing graph or tree (stored in either a graph.params or fitted.
graph objec t) by successively calling the ad d.lea f functions and keeping sub- optimal
graphs in a heap at each step. This func tions gener ates a list of f itt ed.g raph objec ts
including other information (e.g.
BIC
of all the possible graph, index of the best fitted
graph)
plot _ fstats visualization Plot
f
- statistics with their Confidence Intervals (can be called directly using plot)
graph.params2symbolic.
fstats
Other utilities Derive a symbolic representation of
F2
,
F3
and
F4
(and the graph model system of equation)
as a function of admixture graph parameters specified in a graph.params object
(Table 1) using functionalities of the Ryacas package (Andersen & Højsgaard , 2019)
Note: a,bObjects existing or significantly improved since the first poolfstat version, respectively.
8
|
G AUTIER ET Al.
generalized to all the
f2
,
f3
and
f4
statistics (not just the ones forming the
basis) using compare.fitted.fstats jointly applied to a fitted.
graph and a fstats objects. Notice that we developed for comparison
purposes a function named graph.params2qpGraphFiles to ex-
port admixture graph specification and their underlying estimated basis
f- statistics (both stored in a graph.params object) into qpGraph for-
mat (Patterson et al., 2012), allowing independent fitting based on the
same estimated statistics to be carried out with this later program.
The poolfstat package includes several functions to assist
construction of admixture graphs. As mentioned in the previous
section, the find.tree.popset and rooted.njtree.builder
functions allow to identify and build rooted tree(s) of scaffold of
(presumably) unadmixed populations that may be used as starting
graph(s). Besides, the ad d.le af and g raph.bu ilder functions
implement the above described heuristic to extent an existing graph
(or tree) by adding one or several leaves (i.e. genot yped populations).
These functions generate a list of fitt ed.gr ap h objects together
with other information that may be helpful for graph comparison
(e.g.
BIC
of all the graphs or index of the best fitted graph).
Finally, as detailed and exemplified in the Appendix S2: Vignette
V1, fitted graphs (stored in fit te d.g ra ph object s) and non- fitted
graphs (stored in graph.params objects) can be directly and con-
veniently plotted with the plot function, which internally interfaces
the grViz functio n from the DiagrammeR package (Iannone, 2020).
2.3 | Data analyses
2.3.1 | Simulation study
Genetic data for a total of 150 diploid individuals belonging to six
different populations (n = 25 individuals per populations) related by
the demographic scenario depicted in Figure 1 were simulated using
the msprime coalescent simulator (Kelleher et al., 2016) with the
following command:
mspm s 300 20 -t 4000 -I 6 50 50 50 50 50 50 0 -r
4000 100000000 -p 8 -es 0.0125 6 0.25 -ej 0.0125 6 2
-ej 0.0125 7 3 -ej 0.025 2 1 -ej 0.05 3 1 -ej 0.075
5 4 -ej 0.1 4 1
Each genome thus consisted of 20 independent chromosomes
of
L=100
Mb assuming a scaled chromosome- wide recombination
rate of
𝜌=4LNer=4, 000
as expected for instance in a population
of constant diploid effective size of
Ne=103
when the per- base
and per- generation recombination rate is
r=10
−
8
(i.e. one cM
per Mb). The scaled chromosome- wide mutation rate was set to
𝜃=4LNe𝜇=4, 000
which is also the expected nucleotide diversity
in a population with
Ne=103
at mutation– drift equilibrium when the
per- base mutation rate is
𝜇=10
−
8
. A total of 250 independent ge-
notyping data sets were simulated and each was subsequently pro-
cessed to generate 32 different types of data sets corresponding to:
• Two standard allele count data sets (namely ACm>1% and ACm>5%)
obtained by simple counting of the simulated individual (haploid)
genotypes for each population (i.e. assuming Hardy– Weinberg
equilibrium within population) and removing SNPs with a Minor
Allele Frequency (MAF) computed over all the individuals lower
than 1% (for ACm>1% data sets) or 5% (for ACm>5% data sets)
• Thirty Pool- Seq dat a sets (coded as
PS𝜆𝜀=𝜀
m>mt%
) for (i) five differ-
ent average sequencing coverages
𝜆
(equal to 30, 50, 75, 100 or
200 reads; a 30X Pool- Seq coverage representing a lower limit for
population genomics studies); (ii) two different MAF thresholds
mt
of 1% and 5% (MAF being estimated on the read counts over all
FIGURE 1 Simulated scenario relating six sampled populations. The population
P6
derived from a population
S
, which is admixed between
two ancestral sources (
S1
and
S2
) directly related to populations
P2
and
P3
and contributing to
𝛼=25%
and
1−𝛼=75%
of its genome,
respectively. The branch lengths are in a diffusion timescale, that is with
𝜏
≃
t
2Ne
under a pure- drift model of divergence (where
t
is the
number of non- overlapping generations and
Ne
the average diploid effective population sizes along the branch). The names of the internal
node populations (not sampled) are represented in grey
0.00
0.05
0.10
0.15
0.20
τ
P7
P9
S
P8
R
S1 S2
α=25% 1−α=75%
P1 P2 P6 P3 P4 P5
|
9
GAUTIER E T Al.
the pools); and (iii) three different sequencing error rates
𝜀
of 0 (no
error), 1 and 2.5 the two latter being representative of Illumina
sequencers (Glenn, 2011).
Pool- Seq data sets were simulated from the ACm≥1% allele count
data sets following a procedure similar to that described in Hivert
et al. (2018). Briefly, the vector
rij
=
{
r
ijk}
of read count s at SNP po-
sition
i
in population
j
for the nucleotide
k
(where by convention
k=1
and
k=2
for the derived and ancestral alleles respectively) was
sampled from a Multinomial distribution parameterized as:
where
yij
is the derived allele count for SNP i in population j (from the
corresponding ACm>1% data set);
nj
is the haploid sample size of popula-
tion j (here
nj=50
for all j); and
cij
=
∑k=4
k=1
r
ijk
is the overall read cover-
age. To introduce variation in read coverages across pools and SNPs,
each
cij
was sampled from a Poisson distribution with a parameter
𝜆
(the
target Pool- Seq mean coverage). When
𝜀=0
, only reads for the derived
(
k=1
) or ancestral (
k=2
) alleles can be generated and the above
Multinomial sampling actually reduces to a Binomial sampling following
r
ij1∼Bin
(y
ij
n
j
;cij
)
(and
rij2=cij −rij1
). However, when
𝜀>0
, sequencing
errors might lead to non- null read counts for the two other alleles lead-
ing to tri- or tetra- allelic SNPs. Moreover, sequencing errors may also
introduce spurious additional variation by generating false SNPs at
monomorphic sites. To account for the latter, read count vectors
ri
′
j
for
all the
2×109−I
monomorphic positions
i′
(where
I
is the number of
SNPs observed in the considered ACm>1% data set) were sampled as
r
i�j∼Multin
({
1−𝜀
3
;𝜀
3
;𝜀
3
;𝜀
3}
;ci�j
)
with coverages
ci
′
j
sampled from a
Poisson distribution (as
cij
for polymorphic positions). Yet, as usually
done with empirical data sets, we applied a minimum read count filtering
step consisting of disregarding all the alleles with less than 2 observed
reads (over all the populations). Only bi- allelic SNPs passing the overall
MAF threshold
mf
were finally retained in the final PS
𝜆𝜀=𝜀
m≥mt%
data sets.
Analyses of the simulated data were carried out with poolf-
stat (Appendix S2: Vignette V1). Briefly, each msms simulated data
set was conver ted into an ACm>1% data set in TreeMix format
(Pickrell & Pritchard, 2012) further imported into a countdata ob-
ject with genotreemix2countdata (Tables 1 and 2) and used to
generate each corresponding ACm>5% data set using countdata.
subset. To improve comput ational effi ciency, the differ ent PS
𝜆𝜀=𝜀
m>mt%
Pool- Seq data sets were generated from the ACm>1% countdata
object s in the form of p ooldat a object using custom functions (not
included in the package) coded in C++ and integrated within R using
Rcpp (Eddelbuettel, 2013). In ad dition, to evaluate the impac t of the
(bad) practice consisting of analysing Pool- Seq data as if they were
allele count data (i.e. overlooking the sampling of reads from individ-
ual genes of the pool), we also created ‘fake’ countdata objects
from the different pooldata object s. We then used default options
(unless otherwise stated) of (i) co mputeFST to estimate genome-
wide
FST
over all the populations; (ii) compute.fstats to estimate
all the f- and D- statistics; (iii) co m pute.f4r at io to estimate
admixture proportions; and iv) f it.g ra ph to estimate the admix-
ture graph parameters (Table 2). As the number of SNPs was variable
across the different simulated data sets, we adjusted the number of
successive SNPs defining a block for block- jackknife estimation of
standard errors by dividing the total number of available SNPs by
500. This thus resulted on average in 490 blocks of 4.1 Mb over the
genome for all the analysed data sets (the simulated genomes con-
sisting of 20 chromosomes). Note that the parameter estimates were
always taken as the block- jackknife mean values rather than esti-
mates over all SNPs (i.e. including those in the chromosome ends). In
practice, the differences between the two are insignificant (e.g.
Appendix S2: Vignette V1).
For validation purposes, we also analysed the 250 ACm>1% data
sets with programs from the Ad mixTools suite (Patterson et al.,
2012) after conversion to the appropriate input format using custom
awk script s. More specifically, we ran q pfstat s (v. 200) to estimate
the 15 basis f- statistics, that is taking
P1
as the reference population,
the five
f2
of the form (P1, Px) and the ten
f3
of the form (P1; Px, Py)
(where
x=2, …,6
and
y=3, …,6
with
y≠x
) and their corresponding
error covariance matrix. Default options were considered except for
the disabling of the scaling of estimated values (using option - l 1) to
facilitate their comparison with poolfstat estimates. We also ran
with default options q p3Po p (v. 65 0) to est ima te
f∗
3
for all th e 60 pos-
sible triplet configurations and qpDstat (v. 970) to estimate the D-
statistics for all the 45 possible quadruplet configurations together
with their associated Z- scores. By default, these three programs
define blocks of 5 cM to implement the (weighted) block- jackknife
procedure. As we here converted the simulated SNP positions from
Mb to cM assuming one cM per Mb (see above), the sizes of the 400
blocks were thus about 20% larger than for poolfstat analyses.
2.3.2 | Analysis of a real Drosophila suzukii Pool-
Seq data
The spotted wing drosophila, D. suzukii, represents an attractive model
to study biological invasion and hence recent historical and demographic
history. Native to South East Asia, this pest species was first observed
outside its native range in Hawaii in 1980, and later rapidly invaded
America and Europe simultaneously between 2008 and 2013 (Fraimout
et al., 2017). Using DNA sequences and microsatellite markers, Adrion
et al. (2014) and Fraimout et al. (2017) deciphered the routes taken by D.
suzukii during its worldwide invasion. Both studies showed that America
and European populations globally represent separate invasion routes
with different native source populations. Olazcuaga et al. (2020) recently
generated Pool- Seq genomic data from 22 worldwide population sam-
ples to detect genetic variants associated with the historical status (i.e.
invasive versus native) of the sampled populations. We here focused our
illustration on 14 Pool- Seq data from this study (with 50– 100 diploid
r
ij ∼Multin
y
ij
n
j
1−𝜀
3
+
1−
y
ij
n
j
𝜀
3;
1−
y
ij
n
j
1−𝜀
3
+
y
ij
n
j
𝜀
3;𝜀
3;𝜀
3
;cij
10
|
G AUTIER ET Al.
individuals per pool) for populations representative of the Asian native
area (six populations), Hawaii (one population) and the invaded continen-
tal America (seven populations), where the species was first observed
in 2008 on the Western coast of the USA (around Watsonville, CA;
Figure 2a). Besides native populations, we have restricted our analysis
to the American continent because the invasion of this area is character-
ized by multiple admixture events between different source populations
(Fraimout et al., 2017), which makes it an appealing situation to evalu-
ate the power and the limitation of poolfstat analyses. Moreover, 13
of our 14 population samples consist of individuals originating from the
same sites (albeit sometimes collected at different dates for some pools;
Table 2 in Appendix S3: vignette V2) as those genotyped at 25 microsat-
ellite markers and analysed with an Approximate Bayesian Computation
Random Forest (ABC- RF) approach to infer the routes of invasion on a
worldwide scale by Fraimout et al. (2017).
To allow for complete reproduction (and exploration) of our anal-
yses, all the command lines used to analyse the D. suzukii Pool- Seq
data set are described in the Appendix S3: Vignette V2. Briefly, we
combined the 14 (bam) files, obtained by Olazcuaga et al. (2020) after
aligning the 14 Pool- Seq data onto the latest D. suzukii assembly (Paris
et al., 2020), into an mpileup file using SAMtools 1.9 with options
- q 20 - Q20 (Li et al., 2009). Variant calling was then performed using
VarScan mpileup2snp v2.3.4 Koboldt et al. (2012) run with options
- - min- coverage 10 - - min- avg- qual 25 - - min- var- freq 0.005
- - p- value 0.5 (i.e. with very loose criteria). After discarding positions
mapping to non- autosomal contigs, the resulting vcf file was parsed
with the vcf2pooldata function of poolfstat with default options
except for i) the overall MAF threshold (computed from read counts)
that was set to 5%; and ii) the minimal read coverage for each pool that
was set to 50. The resulting pooldata object was further filtered with
pooldata.subset to discard (i) all positions with a coverage higher
than the 99th coverage percentile within at least one pool; and (ii) dis-
card all SNPs with MAF < 5% over all the populations from the na-
tive area to favour ancestral SNPs. The final data set then consisted of
read counts for 1,588,569 bi- allelic SNPs with a median read coverage
varying from 64 (US- Sok) to 95 (CN- Bei and US- Haw) among the 14
pools (Table 2 in the Appendix S3: Vignette V2). We defined blocks of
10,000 consecutive SNPs for block- jackknife estimation of standard er-
rors leading to a total of 145 blocks of 698 kb on average (varying from
414 kb to 2.03 Mb). Hence, most analyses actually relied on 1,450,000
SNPs that mapped to the 15 largest contigs of the assembly (totalling
116 Mb). In other words, SNPs mapping to the smallest (and less reli-
able) contigs were discarded in addition to the few ones mapping to the
end of the 15 retained contigs.
3 | RESULTS
3.1 | Evaluation of poolfstat on simulated data
Histori cal and demogra phic inference bas ed on f- and D- statistics has
already been extensively evaluated in previous studies (Lipson et al.,
2013; Patterson et al., 2012; Peter, 2016). Therefore, the purpose
of our simulation study was essentially threefold: (i) to validate the
estimators implemented in poolfstat by comparing, for allele
count data, with those obtained with the reference Adm ixTools
suite (Patterson et al., 2012); (ii) to evaluate the performance of the
estimators for Pool- Seq data as a function of read coverage and se-
quencing errors; and (iii) to provide example data sets with known
ground truth for illustration purposes.
3.1.1 | Description of the simulated data sets
We simulated 250 genetic data sets for six populations (named
P1
to
P6
) each consisting of 25 diploid individuals and that were histori-
cally related by the admixture graph represented in Figure 1 (see
2.3.1). Each of these data sets was further used as template to gen-
erate two allele count data sets (applying 1% or 5% threshold on the
overall MAF for ACm>1% and ACm>5% data sets, respectively) and to
simulate 30 Pool- Seq data sets with five different mean read cover-
ages (
𝜆∈{30;50;75;100;200}
); three sequencing error rates
(
𝜀
∈
{
0;10−3;2.5 ×10−3
}
) and two MAF (computed over all read
counts) thresholds (referred to as PS
𝜆𝜀=𝜀
m>1%
and PS
𝜆𝜀=𝜀
m>5%
for 1% and
5% MAF thresholds, respectively). This thus leads to a total of
8,00 0 simulated data sets. The average number of available SNPs
and false SNPs (for PS
𝜆𝜀=1
m≥mt%
and PS
𝜆𝜀=2.5
m>mt%
data sets) is given in
Appendix S1: Table S1 for each of the 32 different types of data sets
and represented as a func tion of the mean coverages
𝜆
and MAF
thresholds in Appendix S1: Figure S1.
Overall, 471,919 SNPs and 240,369 SNPs were available on aver-
ag e for al le le co unt dat a se ts at the 1% (ACm>1% ) an d 5% (A Cm>5%) MAF
thresholds, respectively, consistent with the L- shaped distribution
of allele frequencies (Appendix S1: Figure S2A). As expected from
binomial sampling (Appendix S1: Figure S2B), for Pool- Seq data sets
generated with no sequencing error, the number of SNPs remained
always lower than the ACm>1% data sets at the 1% MAF threshold
although increasing with coverages from 13.8% for PS
30𝜀
=
0
m>1%
to
2.01% for PS
200m>1%𝜀=0
data sets (see Appendix S1: Table S1 leg-
end for details). Conversely, at the 5% MAF threshold, the number of
SNPs was slightly higher than the ACm>5% data sets (from 2.58% for
PS
30𝜀
=
0
m>5%
to 1.51% for PS
200𝜀
=
0
m>5%
), which is related to i) the shape of
the allele frequency spectrum (stochastic variation in read sampling
leading to include more SNPs with
0.01 <MAF <0.05
than exclude
SNPs with
MAF >0.05
from the simulated genotyping data because
they are more numerous); and (ii) variation in the simulated read cov-
erages that explains the decreasing trend with
𝜆
.
With sequencing errors, our filtering steps proved efficient to
remove false SNPs except at the 1% MAF threshold when
𝜀=2.5
or
when
𝜀=1
at the lowest coverage (
𝜆=30
and
𝜆=50
). These con-
figurations displayed substantial to very high proportions of false
SNPs (up to 93.8% for PS
50𝜀
=
2.5
m>1%
) although decreasing with coverage
(Appendix S1: Figure S1B). A 5% MAF threshold always resulted in
the complete removal of all the false SNPs for all the investigated
scenarios (Appendix S1: Table S1). Note that for the highest cover-
ages, sequencing errors lead to a relative reduction of the number
|
11
GAUTIER E T Al.
FIGURE 2 Historical and demographic inferences about native and invasive Drosophila suzukii populations from Pool- Seq data based
on
f−
statistics. (a) Geographic location of the 14 population samples (Olazcuaga et al., 2020). Names are coloured according to their
area of origin. The (invasive) Hawaiian population, which was considered as intermediate between the Asian native and the continental
America invasive area, was first observed in 1980, that is ca. 300 generations before the invasion of the American continent assuming
10 generations per year. Solid points indicate the 13 population sampling sites in common with Fraimout et al. (2017). (b) Best fitting
admixture graph connecting five populations of the native areas and the Hawaiian population with two inferred admixture events. (c)
Best fitting admixture graph connecting three invasive populations from continental America with populations from the native area (and
Hawaii). In (b) and (c), estimates of branch lengths (×103, in drif t units of
t
2Ne
) and admix ture rates (and their 95% CI into bracket) are indicated
next to the corresponding edges. The worst fitted
f
- statistics is written in red for each of the two graphs
(a)
CN−Bei
CN−Lia
CN−Nin
CN−Shi
JP−Sap
JP−Tok
BR−Pal
US−Col
US−Haw
US−Nca
US−Sdi
US−Sok
US−Wat
US−Wis
Native
Hawaiian (>1980)
American (>2008)
(b) (c)
12
|
G AUTIER ET Al.
of SNPs due to the generation of spurious tri- or tetra- allelic SNPs
from the simulated bi- allelic SNPs (compare, e.g., PS
200𝜀
=
2.5
m>5%
and
PS
100𝜀
=
2.5
m>5%
on Appendix S1: Figure S1A).
3.1.2 | Comparison of poolfstat and
Admixtools estimates for allele count data
We first analysed the 250 simulated ACm>1% data sets to estimate with
both p oolfstat and Admixtools programs (i ) the 15 basis f- statistics
(taking
P1
as the reference population) consisting of five
f2
and ten
f3
(Figure 3a) and their corresponding error covariance matrix (Figure 3b);
(ii) the 60
f∗
3
(Fig ure 3c) and their associated Z- s cores (Figure 3d); and iii)
the 45 D- statistics (Figure 3e) and their associated Z- scores (Figure 3f).
The estimates were all found in almost perfect agreement between
the two implementations with Mean Absolute Differences (MAD)
negligible when compared to the range of variation of the underlying
values. For f- and D- statistics, slight differences were mostly due to
the plotted poolfstat estimates corresponded to block- jackknife
means (i.e. excluding SNPs outside blocks as those from chromosome
ends). Using poolfstat estimates based on all the SNPs indeed re-
sulted in almost null MAD (MAD’ in Figure 3a, c and e), up to round-
ing errors due to lower decimal precision in the printed output of the
Admixtools programs. Note that the differences in block- jackknife
implementation among the two programs (Material and Methods) had
very minor impact on the estimation of error variance and covariance
of the estimates (Figure 3b). Accordingly, the MAD computed on Z-
scores remained very small (although inflated for higher values) and
Z- score based decision for the underlying three- population admixture
(Figure 3d) or four- population treeness test s (Figure 3f) were highly
consistent (with a propor tion
𝛽=97.7%
and
𝛽=98.0%
, respectively, of
Z- scores significant with the two programs among the ones significant
with at least one program).
3.1.3 | Performance of
f3
- and
f∗
3
- based tests of
admixture and
f4
and D- based tests of treeness for
allele count and Pool- Seq data
We ran the compute.fstats function on all the simulated allele
count and Pool- Seq data sets to estimate all f- and D- statistics. To
further evaluate the impact of (improperly) treating read counts as
allele counts when analysing Pool- Seq data, we also analysed the
simulated Pool- Seq data sets (focusing only on PS
𝜆𝜀=0
m>mt%
data sets,
i.e. simulated without sequencing error) as if they were allele count
data. Overall, 42 different configurations were thus investigated
each originating from the 250 allele count data sets simulated under
the demographic scenario represented in Figure 1, leading to a total
of 42×250 = 10,500 analyses.
Table 3 and Appendix S1: Table S2 provide the estimated power
(true positive rate, TPR) and false positive rate (FPR) of the
f3
- and
f∗
3
- based test of admixture for each configurations. As P6 was the only
admixed population, each TPR was estimated as the proportion of
f3
(
f∗
3
, respectively) with an associated Z- score <- 1.65 (95% significance
threshold) for the (P6;P2,P3) population triplet (i.e. among 250 es-
timates). Conversely, the FPR was computed as the proportion of
f3
(
f∗
3
, respectively) with an associated Z- score <−1.65 among all the
50 population triplets that do not involve P6 as a target (i.e. among
12,250=250 × 50 estimates). Consistent with Pat terson et al. (2012),
the performance of
f3
- and
f∗
3
- based test of admixture was vir tually
the same for all the configurations. When the same MAF threshold
was appli ed, the perfor mance of Pool- Seq dat a generated with no s e-
quencing error was very close to that obtained with allele count data
although the power tended to slightly decrease with decreasing se-
quencing coverage. Interestingly, increasing the MAF threshold from
1% to 5% increased the power by more than 10% and in all cases,
no false positive signal of admixture was detected. Surprisingly, se-
quencing errors in Pool- Seq data also tended to increase the power
from a negligible amount (less or close to 1%) at 5% MAF threshold
to a quite substantial amount at 1% MAF threshold (decreasing with
coverage and increasing with sequencing error rate). At the extreme,
a power of 100% was even observed when
𝜆≤50
and
𝜀≥1
. This
trend was actually directly related to the propor tion of false SNPs
introduced by sequencing error (Appendix S1: Figure S1B) that re-
sulted in a downward bias of
f3
and
f∗
3
estimates, although the un-
derlying tests remained robust as all the estimated FPR were null
except for PS
50𝜀
=
2.5
m>1%
data (FPR = 6.47%), which displayed the highest
propor tion of false SNPs (>90%, Appendix S1: Figure S1B). However,
this observed apparent robustness of the three- population tests to
false SNPs should be interpreted cautiously since it may rather result
from the moderate to high expected
f3
and
f∗
3
values in our simulated
scenario for the population triplets that do not involve P6 as a target.
Overall, applying a 5% MAF threshold on Pool- Seq data (even with
𝜀
= 2.5) to remove false SNPs (see above) allowed recovering the
performances similar to that obtained when analysing data sets with
no sequencing error. Finally, it is worth stressing that analysing Pool-
Seq data as allele counts, whatever the coverage or MAF threshold
considered, lead to no power in detecting admixture event with
f3
- or
f∗
3
- based tests due to a strong upward estimation bias.
Table 4 and Appendix S1: Table S3 similarly provide the es-
timated power (TPR) and FPR of the
f4
- and D- based tests of tree-
ness for the 42 configurations investigated in the simulation study.
Given the simulated scenario, eight of the 45 different population
quadruplets (namely (P1,P2;P3,P4); (P1,P2;P3,P5); (P1,P2;P4,P5);
(P1,P3;P4,P5); (P1,P6;P4,P5); (P2,P3;P4,P5); (P2,P6;P4,P5); and
(P3,P6;P4,P5)) have a null expected
F4
(and D) value. Note that this
may easily be shown with the symbolic calculus derivation imple-
mented in graph.params2symbolic.fstats (Table 2). For each
configuration, the TPR of the treeness test was then estimated as the
proportion of
f4
(D, respectively) with an associated absolute Z- score
<1.96 (95% significance threshold) for these eight population qua-
druplets ((P1,P2;P3,P4); (P1,P2;P3,P5); (P1,P2;P4,P5); (P1,P3;P4,P5);
(P1,P6;P4,P5); (P2,P3;P4,P5); (P2,P6;P4,P5); (P3,P6;P4,P5)) over all
the 250 different underlying analyses (i.e. among 2,000=250 × 8 es-
timates). Conversely, the FPR was estimated as the proportion of
f4
(respectively D) with an associated absolute Z- score <1.96 among all
|
13
GAUTIER E T Al.
the 37 remaining population quadruplets (i.e. among 9250 = 250 × 37
estimates). The power for both F4- and D- based tests were remarkably
consistent across all the different configurations. In addition, the tests
were all found almost perfectly calibrated since the estimated power
were close to 95%, the probability of rejecting the null hypothesis at
the chosen 95% significance threshold for Z- scores. Likewise, all FPR
remained low (
≤0.15%
), although increasing with MAF thresholds
(more than twice higher for a given type of data when increasing the
MAF threshold from 1% to 5%). Overall, sequencing errors and cover-
age had no impact on the performance of the
f4
- and
D
- based test of
FIGURE 3 Comparison of poolfstat and Admixtools estimates across 250 simulated allele count data sets (ACm≥1% ). (a) All estimates
of the 15 basis f- statistics taking
P1
as the reference population and corresponding to 5
f2
of the form (
P1
,
Px
) and the 10
f3
of the form (
P1
;
Px
,
Py
) (with
x=2,..,6
;
y=3,..,6
and
y>x
). (b) All Block- jackknife estimates of the covariance matrix
Q
of the 15 basis f- statistics (15 error
variances and 105 error covariances). (c) All estimates of the 60
f∗
3
(scaled
f3
) and their associated Z- scores (d). (e) All estimates of the 45
D- statistics (scaled
f4
) and their associated Z- scores (f). For each comparison, the mean absolute difference (MAD) between the parameter
estimates of the two programs are given on the upper left corner of the plots. In (a), (c) and (e), poolfstat estimates correspond to block-
jackknife means (i.e. they only include SNPs eligible for block- jackknife). The given MAD’ value is the MAD between Admixtools and
poolfstat estimates that include all SNPs (see documentation for the compute.fstats function). In (d), a consistency score
𝛽
is also given
and was computed as the proportion of Z- scores <−1.65 (i.e. significant three- population test of admixture at a 5% threshold) with both
programs among the
n=216
ones significant in at least one of the two programs. Similarly, in f), the given consistency score
𝛽
is computed
as the proportion of absolute Z- scores <1.96 (i.e. passing the four- population treeness at a 5% threshold) with both programs among the
n = 1912 ones with an absolute Z- scores <1.96 in at least one of the t wo programs)
14
|
G AUTIER ET Al.
treeness. As expected, analysing read counts as allele count data did
not affect the performance of these tests (see Discussion).
3.1.4 | Precision of the
F4
- ratio- based
estimation of the admixture rate α
Given the simulated scenario, two different ratios of
f4
estimates could
be used to estimate the admixture proportion
𝛼=0.25
(Figure 1),
namely
𝛼
1=
f
4
(P1,P4;P3,P6)
f4(P1,P4;P2,P3)
and
𝛼
2=
f
4
(P1,P5;P3,P6)
f4(P1,P5;P2,P3)
(Patterson et al., 2012).
The graph.params2symbolic.fstats function (Table 2) may also
prove useful to identify appropriate quadruplets (Appendix S3:
Vignette V2). We used the c omp ute.f4 ra ti o function to obtain
these t wo estimates from all the simulated data sets together with their
95% CI (defined as
𝛼 ±1.96𝜎 𝛼
where
𝜎 𝛼
is the block- jackknife standard-
error estimate). Table 5 and Appendix S1: Table S4 provide the mean of
the estimated
𝛼 1
and
𝛼 2
, respectively, over the 250 analysed data sets
for each of the 42 investigated configurations. As expected from the
above evaluation of treeness tests, estimates of
𝛼
were highly consist-
ent among all the investigated configurations and similar for the two
considered
f4
- ratio with a mean value varying bet ween 0.245 and
0.248. Yet, a slight downward bias (always
<2%
) could be noticed, but
the estimated 95% CIs were almost always optimal (or close to) since
they cont ained the true sim ulated value (
𝛼=0.25
) from 90.0 % to 95.2%
of the time (Table 5 and Appendix S1: Table S4).
3.1.5 | Evaluation of graph fitting
We further estimated for all the simulated data sets branch lengths
in drift units and admixture proportion
𝛼
with their 95% CIs by fit-
ting the simulated graph with fit.grap h. As for the
f4
- ratio- based
estimation, estimates of
𝛼
were virtually unbiased and consistent
across all the 42 different investigated configurations (Appendix
S1: Figure S3). Nevertheless, the 95% CIs were always too narrow
since they contained the actual value (
𝛼=0.25
) from only 4 0.8%
to 74.4% of the time (Appendix S1: Table S5) as expected from the
𝜒2
approximation of the LRT underlying the computation of these
CIs. Figure 4 and Appendix S1: Table S4 plot the distributions of the
estimated lengths for the ten branches of the simulated admixture
graph branches (over the 250 estimates per configuration) when ap-
plying 5% and 1% MAF threshold, respectively. The corresponding
mean estimates and proportions of 95% CI's including the true value
are provided in Appendix S1: Tables S6– S15. Note that the branches
P8↔R
and
P9↔R
that are connected to the root
R
(Figure 1) can
only be estimated jointly (as
𝜏P8↔P9=𝜏P8↔R+𝜏P9↔R
,
R
being arbitrar-
ily set in its middle).
At the 5% MAF threshold, very similar performance was ob-
tained for the allele count and the different Pool- Seq data sets
whatever the simulated read coverage or sequencing error rates
(Figure 4a– c). Hence, mostly unbiased branch lengths were esti-
mated for the four leaves (terminal branches)
𝜏P1↔P7
,
𝜏P2↔S1
,
𝜏P6↔S
and
𝜏P3↔S2
. As previously observed with
𝛼
, the estimated 95% CIs
remained too narrow particularly for
𝜏P2↔S1
for which <50% of the
CI's contained the true value (Appendix S1: Table S7) compared
to more than 80% for
𝜏P1↔P7
(Appendix S1: Table S6). A s expected
from the drift- scaling approximation, the estimated branch lengths
tended to be slightly downwardly biased (ca. 2%) for the two other
leaves (
𝜏P4↔P9
and
𝜏P5↔P9
), but the estimated 95% CI displayed similar
characteristics since from 48.0% to 87.6% contained the true val-
ues, the proportion increasing in Pool- Seq data sets when coverage
and sequencing error decreased (Appendix S1: Tables S10 and S11).
Conversely, the internal branch lengths tended to be upwardly bi-
ased from a slight (ca. 2%) for
𝜏S1↔P7
,
𝜏P7↔P8
and
𝜏S2↔P8
(Appendix
TABLE 3 Comparison of the performance of
f3
- based tests of admixture for different types of dat a simulated under the Figure 1 scenario
processing poolfstat analyses
MAF threshold Seq. error
𝜺
Pool- Seq (read counts) dat a
Allele
count data
𝝀=30
𝝀=50
𝝀=75
𝝀=100
𝝀=200
>1% 082.0 (0.00) 84.4 (0.00) 86.0 (0.00) 86.0 (0.00) 85.2 (0.00) 85.6 (0.00)
1100 (0.00) 100 (0.00) 86.8 (0.00) 87.2 (0.00) 86.4 (0.00)
2.5 100 (0.00) 100 (6.47) 99.6 (0.00) 92.8 (0.00) 88.4 (0.00)
0 0.00 (0.00)*0.00 (0.00)*0.00 (0.00)* 0.00 (0.00)* 0.00 (0.00)*
>5% 093.6 (0.00) 95.2 (0.00) 96.4 (0.0 0) 96.0 (0.00) 96.0 (0.00) 96.8 (0.00)
194.0 (0.00) 96.8 (0.00) 96.4 (0.0 0) 97.2 (0.00) 96.8 (0.00)
2.5 94.0 (0.00) 96.0 (0.00) 96.0 (0.00) 97.2 (0.00) 96.8 (0.00)
00.00 (0.00)* 0.00 (0.00)* 0.00 (0.00)* 0.0 0 (0.00)* 0.00 (0.00)*
Note: For each MAF threshold (MAF > 1% or MAF > 5%), the table gives true and false (in parenthesis) positive rates (in %) for 21 different types of
analyses relying on (i) allele count data; (ii) 15 dif ferent Pool- Seq read count data (five mean cover ages
𝜆
and three sequencing error rates
𝜀
); and (iii)
Pool- Seq read count data simulated with
𝜀=0
treated as allele count s (corresponding result s of this bad practice are highlighted in italics and *). Each
TPR was computed from the analysis of 250 independent data sets ( generated from the data simulated under Figure 1 demographic scenario) as the
proportion of f3 with an associated Z- score <−1.65 (95% significance threshold) for the (P6; P2, P3) population triplet (n = 250 estimates). The FPR
was similarly computed as the proportion of f3 with an associated Z- score <−1.65 among all the 50 population triplets that do not involve P6 as target
population (n = 250 × 50=12,250 estimates).
|
15
GAUTIER E T Al.
S1: Tables S12 to S14), to a moderate amount (ca. 20%) for the root
including branch
𝜏P8↔P9
, the true value being then always outside
the estimated 95% CIs (Appendix S1: Table S15). Yet, when analysing
data with a lower MAF threshold of 1%, this bias almost completely
vanished (Appendix S1: Figure S4 and Table S15).
For the other branches, the estimates had similar characteris-
tics (yet with a slightly decreased performance for the
𝜏P4↔P9
and
𝜏P5↔P9
leaves) for allele count data or Pool- Seq data simulated with-
out sequencing error (Appendix S1: Figure S 4A). In agreement with
previous observations, at the 1% MAF threshold, sequencing errors
lead to strong downward bias at the lowest simulated coverages,
that is when the percentage of false SNPs became non- negligible
(Appendix S1: Figures S4B and S4C). Finally, whatever the chosen
MAF threshold, improperly analysing read counts as allele count
data always lead to a substantial upward bias of the lengths of all
the leaves (Figure 4d). Notice, however, that this had no or limited
impact on the estimation of internal branch lengths.
3.1.6 | Evaluation of graph construction
To provide insights into the reliability of graph construction, we
evaluated the performance of the a dd .le af function in position-
ing the admixed population P6 on the underlying (((P1,P2),P3),(
TABLE 4 Comparison of the performance of f4- based test of treeness for different types of dat a simulated under the Figure 1 scenario
processing poolfstat analyses
MAF threshold Seq. error
𝜺
Pool- Seq (read counts) dat a
Allele
count data
𝝀=30
𝝀=50
𝝀=75
𝝀=100
𝝀=200
> 1% 094.0 (0.05) 94.4 (0.06) 94.1 (0.04) 94.5 (0.05) 94.3 (0.02) 94.2 (0.02)
194.3 (0.04) 94.2 (0.03) 94.3 (0.03) 94.3 (0.03) 94.1 (0.05)
2.5 94.8 (0.06) 94.5 (0.05) 94.8 (0.03) 94.5 (0.06) 94.3 (0.04)
094.0 (0.05)* 94.4 (0.06)* 94.1 (0.04)*94.5 (0.05)* 94.3 (0.02)*
> 5% 094. 5 (0.14) 94.3 (0.11) 94.1 (0.14) 94.9 (0.09) 94.3 (0.08) 94. 3 (0.11)
194.5 (0.09) 94. 5 (0.11) 94.5 (0 .13) 94.2 (0.09) 94.1 ( 0.15)
2.5 95.2 (0.13) 93. 8 (0.11) 94.2 (0.12) 94. 5 (0.11) 94.3 (0.13 )
094. 5 (0. 14)* 94.3 (0.11)* 94.1 ( 0.14)* 94.9 (0 .09)* 94.3 (0.08)*
Note: For each MAF threshold (MAF > 1% or MAF > 5%), the table gives true and false (in parenthesis) positive rates (in %) for 21 different types of
analyses relying on (i) allele count data; (ii) 15 dif ferent Pool- Seq read count data (five mean cover ages
𝜆
and three sequencing error rates
𝜀
); and (iii)
Pool- Seq read count data simulated with
𝜀=0
treated as allele count s (corresponding result s of this bad practice are highlighted in italics and *). Each
TPR was computed from the analysis of 250 independent data sets ( generated from the data simulated under Figure 1 demographic scenario) as the
proportion of
f4
with an associated absolute Z- score < 1.96 (95% signific ance threshold) among all the eight population quadruplets ((P1,P2;P3,P4);
(P1,P2;P3,P5); (P1,P2;P4,P5); (P1,P3;P4,P5); (P1,P6;P4,P5); (P2,P3;P4,P5); (P2,P6;P4,P5); (P3,P6;P4,P5)) with a null expected
F4
(n = 250 × 8=2000
estimates). The FPR was similarly computed as the proportion of
f4
with an associated absolute Z- score < 1.96 among all the 37 remaining population
quadruplets (n = 250 × 37=9250 estimates).
TABLE 5 Comparison of
F4
- ratio- based estimation of the simulated admixture proportion
α
in Figure 1 scenario for different types of data
processing poolfstat analyses
MAF Threshold seq. error
𝜺
Pool- Seq (read counts) dat a
Allele
count data
𝝀=30
𝝀=50
𝝀=75
𝝀=100
𝝀=200
>1% 00.247 (92 .4) 0.247 (92.8) 0.247 (94 .4) 0.247 (93.6) 0.247 (92.0) 0.247 (92.8)
10.248 (91.6) 0.247 (91.6) 0. 247 (92.4) 0.247 (93.2) 0.247 (92.0)
2.5 0.247 (91.6) 0.246 (94.0) 0.248 (93.2) 0.247 (91.2) 0.248 (92.0)
00.247 (92. 4) 0. 247 (92.8 )* 0 .247 (94. 4)* 0 .247 (93.6)* 0. 247 (92.0 )*
>5% 00.247 (92.4) 0.248 (93.2) 0.247 (93.6) 0.247 (93.2) 0.247 (92.4) 0 .247 (92.4)
10.248 (93.2) 0.247 (91.6) 0.247 (91.6) 0.247 (92.8) 0.247 (91.2)
2.5 0.247 (91.6) 0.246 (95.2) 0.248 (93.2) 0.247 (90.0) 0.248 (93.2)
00.247 (92. 4)* 0.248 (93.2)* 0. 247 (93.6)* 0.247 (93.2)* 0.247 (92.4)*
Note: For each MAF threshold (MAF > 1% or MAF > 5%), the table gives the mean of t he estimated
𝛼
=
f
4
(P1,P4;P3,P6)
f4(P1,P4;P2,P3)
(across 250 independent
simulated data sets) for 21 different t ypes of analyses relying on (i) allele count data; (ii) 15 different Pool- Seq read count data (five mean coverages
𝜆
and three sequencing error rates
𝜀
); and (iii) Pool- Seq read count data simulated with
𝜀=0
treated as allele count s (corresponding result s of this bad
practice are highlighted in italics and *). The proportion (in %) of the 250 es timated 95% CIs that contain the true simulated value (
𝛼=0.25
) is given in
parenthesis.
16
|
G AUTIER ET Al.
P4,P5)) tree (Figure 1) for the dif ferent types of simulated data
sets. Appendix S1: Table S16 gives the proportion of correctly in-
ferred admixture graphs (i.e. corresponding to the simulated sce-
nario) with a
ΔBIC >6
support with all other tested graphs over the
250 analysed data sets for each of the 42 investigated configura-
tions. As the reference tree with rooted topology (((P1,P2),P3),(P4
,P5)) consists of eight branches, P6 may be connected with either
(i) nine non- admixed edges (connection to either one of the eight
branches or as an outgroup) or ii)
⎛
⎜
⎜
⎝
8
2
⎞
⎟
⎟
⎠
−1=
27
admixed edges from
two- way admixture events. Except for the PS
50𝜀
=
2.5
m>1%
data set (the
one with the highest percentage of false SNPs), the correct
FIGURE 4 Distribution of the estimated drift- scaled lengths for all the branches in Figure 1 simulated scenario using admixture graph
fitting (as implemented in the fit.gr ap h function of poolfstat) for different t ypes of data with a
5%
threshold on the overall SNP
MAF. Each box plot summarize the distribution of the 250 estimated lengths of each of the ten branches obtained from the analysis of
either allele count data set (‘Counts’) or one of the five different simulated Pool- Seq read count data sets (‘PSλX’) with different mean
coverages (
𝜆=30;50;75;100;
and 200) as generated from the genotyping data simulated under the scenario depicted in Figure 1. Pool-
Seq read count data were generated with no sequencing errors (
𝜀=0
) in (a) and (d) and with a sequencing error rate of
𝜀=1
and
𝜀=2.5
in panel (b) and (c) respectively (Appendix S1: Table S1). In (d), the read count data were analysed as allele counts, which corresponds to a
bad prac tice. Note that the two branches coming from the root are combined since the position of the root is not identifiable by the model
(i.e.
𝜏P8↔P9=𝜏P8↔R+𝜏P9↔R
). Note that the box plots obtained from the analysis of count data are replicated in each panel for comparison
purposes. For each branch, a red dotted line indicates the underlying simulated value. For Pool- Seq data, the overall MAF was estimated
from read counts
|
17
GAUTIER E T Al.
admixture graph was always retrieved with a fairly high support
(
ΔBIC >15
).
3.2 | Analysis of real Drosophila suzukii Pool-
Seq data
We here sketched the main findings from the analyses using p oolf-
stat of a subset of the Pool- Seq data previously generated by
Olazcuaga et al. (2020) focusing on 14 population samples of the
invasive species D. suzukii. For more details, we encourage readers
to consult the Appendix S3: Vignette V2.
3.2.1 | Structuring of genetic diversity across the 14
populations
Overall , the estimated gl obal
FST
across th e 14 populati ons was 7.03%
(95% CI;
[
6.90%;7.32%
]
). Estimates of all the pairwise- population FST
confirmed that populations tended to cluster according to their geo-
graphic area of origin (i.e. Asia, America and Hawaii; Figure 2A), with
some geographically close populations showing low level of differ-
entiation. For instance, in the American invaded area the US- Nca,
US- Col and US- Nca populations all displayed pairwise FST signifi-
cantly lower than 1%. Likewise, in the native area, the three popula-
tions CN- Bei, CN- Nin and CN- Lia originating from North- Western
China were all found very closely related (all pairwise m. being lower
or very close to 1%). Conversely, the Hawaiian sample (US- Haw) was
found the most highly dif ferentiated population, all pairwise
FST
in-
cluding US- Haw ranging from 11.7% (with US- Sok) to 17.0% (with
US- Col) suggesting strong drift in this population as confirmed by its
lowest estimated heterozygosity (Appendix S3: Vignette V2).
3.2.2 | f3- based tests of admixture suggest
pervasive admixture in the invaded area
Out of the 14 sampled populations, t wo (CN- Lia and JP- Tok) from
the Asian native area and four (US- Col, US- Nca, US- Wat and US-
Wis) from the continental American invaded areas showed at least
one significantly negative
f3
at the 95% significance threshold (i.e. Z-
score
<−1.65
). Table 6 summarizes for each of these 6 populations
the number of significantly negative f3 together with the triplet with
the lowest Z- score giving insights into the pair of populations that
branch the closest to the two original sources (assuming a two- way
admixture event). Except for CN- Lia, all the detected signals were
significant at a far more stringent threshold (e.g. Z- score
<−2.33
at 99% significance threshold). The
f3
and
f∗
3
statistics gave almost
exactly the same results (Appendix S3: Vignette V2).
In the native area, JP- Tok showed clear evidence of admixture
with 11 significant tests that all involved JP- Sap (from Northern
Japan) as a source proxy. The three lowest
f3
values were obtained
with three Chinese populations (CN- Nin, CN- Bei and CN- Shi in in-
creasing order of
f3
). Assuming an admixture- graph like history, this
suggests that the two populations branching the closest to the two
sources of JP- Tok were JP- Sap and CN- Nin. The remaining Chinese
population, CN- Lia showed some weak evidence for admixture with
only one test barely significant at the 95% threshold for the triplet
involving CN- Shi and JP- Sap as source proxies (Table 6).
Out of the seven invasive populations from continental America,
the four populations US- Col, US- Wis, US- Nca and US- Wat showed
strong evidence of admixture. Although it has up to now been con-
sidered as the closest to the first invading population of Continental
America (based on historical records), the Western American US- Wat
population displayed the strongest signals with 11 (strongly) signifi-
cant tests. Interestingly, the three signals supported by the lowest
and hence more significant Z- score all involved pairs of source pop-
ulation proxies originating from the continental American invaded
area namely, in order of increasing Z- score (i.e. decreasing evidence),
the (US- Sdi,US- Sok), (BR- Pal,US- Sok) and (US- Col,US- Sok) pairs. As
the underlying
f3
CI's did not overlap with those of the other triplet
configurations, these three pairs of populations may be considered as
the closest (among the sampled populations) to the original US- Wat
source populations. It is worth noting that the Western American US-
Sok population was involved in nine of the 11 significant negative
f3
statistic with US- Wat as a target. The three others populations, US-
Col, US- Wis and US- Nca only had a moderate number of significant
tests (compared to others). Such tests always involved at least one of
the two other populations and overlapping
f3
CI's. This suggests com-
plex patterns of recurrent admixture event among US- Col, US- Wis
and US- Nca, a feature consistent with their low level of differentiation
and close geographic origins.
TABLE 6 Results of the
f3
- based tests of admixture on
populations from the D. suzukii invasive species
Population Origin
nb. of signif.
tests (f3
Z<−1.65
)
Triplet with the
lowest f3 Z- score
CN- Lia Native 1CN- Lia;CN- Shi,JP-
Sap (Z = −1 .66 )
JP- Tok Native 11 JP- Tok;CN- Nin,JP-
Sap (Z = −7. 11)
U S - C o l Invasive (AM) 2US- Col;BR- Pal,US-
Wis (Z = −3.31)
US- Nca Invasive (AM) 6 US- Nca;JP- Sap,US-
Col (Z = −3.89)
U S - W a t Invasive (AM) 13 US- Wat;US- Sdi,US-
Sok (Z = −2 3. 6)
US- Wis Invasive (AM) 4US- Wis;JP- Sap,US-
Col (Z = −5.02)
Note: For all the population displaying at least one significant signal of
admixture at the 95% signif icance threshold (f3
Z<−1.65
), the table
gives the number of significant tests (out of the
C13
2
=
78
performed
per population) and the triplet displaying the lowest Z- score (i.e. most
significant test).
18
|
G AUTIER ET Al.
3.2.3 | Exploring invasion scenarios with admixture
graph construction and fitting
To provide further insights into the relationships of the sur veyed pop-
ulations and the probable scenarios of invasion of D. suzukii in the
American area, we relied on admixture graph construction. Our pur-
pose was not to build a comprehensive admixture graph for the 14
populations, which may be elusive given the close relationships of
some populations and the pervasiveness of recent and presumably
recurrent admixture events among the different populations, but
rather to identify key regional event that occurred at early time of the
invasion history of the species. From our extensive analyses (Appendix
S3: Vignette V2), we were able to build and estimate the parameters
of two admixture graphs represented in Figure 2b,c. The first admix-
ture graph described the somewhat complex and so far non-
investigated relationships among the populations of the native area
(including the early invasive population established in Hawaii since
1980) with a ver y good fit since the Z- score of the residuals for the
worst fitted f- statistics was 1.06 (Figure 2b). In agreement with previ-
ous findings (and geographic proximity), the Hawaiian population was
found more closely related to the Japane se population JP- Sap than to
the other Chinese populations, but it experienced a strong differenti-
ation from their common ancestor (named JP in Figure 2b) with an
estimated branch length of 0.255 drift units (
t
2Ne
). Yet, it was not pos-
sible from our data to definitively conclude that US- Haw originates
from a Japanese population since we have no element to claim that
the (ancestral) node population JP was located in Japan. To that end,
additional sampling of Japanese populations would be required. The
inferred graph also confirmed above
f3
- based test results of an ad-
mixed origins of JP- Tok between a population closely related to JP-
Sap (the main contributor) and a second source likely of Chinese origin
although the same caution as for JP are needed regarding the geo-
graphic origins of this internal node population. Similarly, CN- Lia was
found admixed with a contributing source of Chinese ancestry related
to CN- Shi largely predominant (
𝛼 C=96.0%
; 95% CI [95.7; 96.3]), and a
second minor contributing source of presumably Japanese origin (re-
lated to JP- Sap). This may explain why the corresponding
f3
- based test
was only barely significant (Table 6). Interestingly, the graph topology
also allowed estimating the Chinese ancestry of CN- Lia based on
F4
-
ratio resulting in consistent but larger 95% CI (
𝛼 C=95.6
; 95% CI
[94.4;96.8]) as expected from above simulation study. CN- Nin, the
remaining population from the native area, could not be positioned
with reasonable accuracy onto the admixture graph of Figure 2b, the
resulting worst fitted
f
- statistics associated to the best fitting graph
having a Z- score = 3.43. However, both its genetic proximit y with CN-
Lia and the best fitting admixture graph resulting from its positioning
onto the scaffold tree including US- Haw, JP- Sap, CN- Bei and CN- Shi
suggested a small amount of Japanese introgression (see Appendix
S3: Vignette V2 for more details).
The second admixture graph represented in Figure 2c allowed
providing insights into the history of introduction of D. suzukii into
the American continent. It related the three continental American
population, US- Sok, US- Wat and BR- Pal, to a scaffold including the
four unadmixed populations US- Haw, JP- Sap, CN- Shi and CN- Bei
with a good fit (the worst fitted f- statistics had a Z- score = −1.83).
The underlying scenario suggested that continental American popu-
lations originated from at least two major and successive admixture
events. The first admixture event leads to the internal node popula-
tion named Am1 and occurred in balanced proportions between two
sources, a Japanese one closely related to JP- Sap and a Hawaiian one
relatively distantly related to US- Haw (according to the estimated
branch lengths). The US- Sok population was the sampled continental
American population the closest to Am1 and may thus be assumed
the most closely related to the first invading population (in agreement
with
f3
- based test results). Yet, US- Sok remained separated by about
0.0816 drift units from Am1, which may explain why no significantly
negative
f3
were found for triplets with US- Sok as a target.
The second major admixture events occurred between an internal
node population named Am2 and a Chinese population closely related
to the common ancestor of CN- Bei and CN- Shi, with CN- Shi contrib-
uting slightly more (58.5% against 41.5% for the other Am1- related
ancestor). Interestingly, the closest Am2 representatives among the
sampled populations were BR- Pal and US- Sdi (also in agreement with
f3
- based tests) suggesting a more Southern geographical origin for Am2.
We found that some additional ancestry from a ghost population or
recurrent admixture events (e.g. related to Hawaiian populations) may
also have contributed to US- Sdi, but this lead to a poor fit (worst fit-
ted
f
- statistics Z- score=−5.87 for the best fitting graph resulting from
the positioning of US- Sdi onto the graph, see Appendix S3: Vignette
V2). Therefore, US- Sdi is not represented in Figure 2c. Although geo-
graphically distant, the Brazilian population BR- Pal thus appeared as the
best proxy for Am2, thereby suggesting a rapid spread of D. suzukii in
South America from this population without any subsequent admixture
events. Additional (and preferably ancient) sample from South- American
populations would help refining this scenario. Finally, according to the
inferred graph, US- Wat was found to originate from a recent admixture
between a population very closely related to US- Sok (and thus Am1)
and a population deriving from Am2 with similar contributions of both.
In agreement with
f3
- based admixture tests that suggested com-
plex admixture histories among the closely related US- Col, US- Wis
and US- Nca populations, no satisfactory admixture graph could be
found when trying to position each of these onto the Figure 2c graph.
Nevertheless, their resulting best fitted graphs all suggested a high
contribution of the Am2 admixed source, a second contributing source
being related to Japanese populations (Appendix S3: Vignette V2).
4 | DISCUSSION
4.1 | A new version of poolfstat for f- statistics
estimation and associated inference from both Pool-
Seq and allele count data
The R package poolfstat was originally developed by Hivert et al.
(2018) to implement an unbiased estimator of FST for PoolSeq data
and provide utilities to facilitate manipulation of such data. We
|
19
GAUTIER E T Al.
here proposed a substantially improved version that implements
unbiased estimators of
F2
,
F3
and
F4
parameters together with their
scaled versions (i.e. pairwise FST,
F∗
3
and
D
, respectively). Although
we primarily focused on the analysis of Pool- Seq data, we extended
the package to analyse standard allele count (as obtained from indi-
vidual genotyping or sequencing dat a) and to implement unbiased
estimators equivalent to those available in the Admixtools suite
(Patterson et al., 2012) allowing us in turn to validate our estima-
tion procedure. Recently, the admixr package was developed to
interface most of the Ad mixtools programs with R for the estima-
tion of
f
- statistics (only from allele count data), with the noticeable
exception of the admixture graph fitting program qpGraph (Petr
et al., 2019). We implemented in poolfstat our own functions for
fitting, building, visualizing and performing quality assessment of
admixture graphs based on the estimated f- statistics. The underly-
ing procedures shared strong similarities with those implemented in
qpGraph (Patterson et al., 2012) resulting on the same fitting on
some examples (e.g. Appendix S2: Vignette V1) or also MixMapper
(Lipson et al., 2013, 2014) programs. As recognized by the devel-
opers themselves, the latter program specifically developed for
admixture inference from allele count data was written in C++ and
MATLAB making it ‘cumbersome to use’ for users, as ourselves,
with no MATLAB license. Moreover, to facilitate local exploration
of the admixture graphs space, we also implemented in poolfstat
efficient semi- automated building utilities (ad d.le af and g raph.
build er functions). It should be noticed that although it does not
include functions for the estimation of f- statistics, the admix-
turegraph R package (Leppälä et al., 2017) also provides several
alternative valuable utilities for the fitting (based on a slightly differ-
ent approach), the manipulation, and the visualization of admixture
graphs together with utilities for the plotting of the statistics with
their confidence intervals or the symbolic derivation of f- statistics
(as poolfstat). Overall, our effort of developing with poolfstat
a self- contained, efficient and user- friendly R package capable of
performing the entire workflow for f- statistics- based demographic
inference from both standard allele count and Pool- Seq read count
data will hopefully make such a powerful framework accessible to a
wider range of researchers and biological models.
4.2 | A unified definition of the F parameters in
terms of probability of gene identities
To derive our unbiased estimators, we proposed to recapitulate and
unify the different definitions of the
F
and D parameters in terms
of probability of gene identity within population (
Q1
) or between
pairs of populations (
Q2
) as summarized in Equation (1). This formu-
lation offers a complementary perspective to the original descrip-
tion of these parameters in terms of covariance of allele frequencies
(Patterson et al., 2012) although in practice, a little algebra shows
that the unbiased estimators derived from these two alternative for-
mulations for allele count data are strictly equivalent. Formally, the
Q1
and
Q2
probabilities can be viewed as expected identity (in state)
of genes across independent replicates of the (stochastic) evolution-
ary process (Rousset, 2007) that may be expressed as a function of
other demo- genetic population parameters. Hence, the obtained
expressions for
F2
,
F3
and
F4
in terms of
Q1
and
Q2
probabilities can
be directly related to those by Peter (2016) in terms of coalescent
times which allowed him to provide an in- depth exploration of their
theoretical properties under a wide range of demographic models
other than admix ture graphs (see, e.g., Figure 7 in Peter, 2016). More
precisely, under an infinite- site mutation model with constant per-
generation mutation rate
𝜇
, the probability that t wo genes are iden-
tical in state is
Q
=
∞
t=1
C
t
(1−𝜇)
2t
=1−2𝜇𝔼
T
+O
𝜇
2
, where
Ct
is the probability that the two genes coalesced t generations in
the past and
𝔼
T
≡
∞
t=1
tC
t
is the expected coalescence time of
two genes (see Slatkin, 1991; Rousset, 20 07, pp.58– 59). Using
Q(1)
1
=1−2𝜇𝔼
[
T
11]
and
Q(2)
1
=1−2𝜇𝔼
[
T
22]
as the IIS probabilities
within populations
1
and
2
, respectively, and
Q2
=1−2𝜇𝔼
[
T
12]
as
the IIS probability between
1
and
2
allows recovering equations
16 (after fixing a typo into it), 20c and 24 by Peter (2016) for
F2
,
F3
and
F4
, respectively. Likewise, the estimators derived from (unbi-
ased) estimators of
Q1
and
Q2
are equivalent to those expressed in
terms of average pairwise differences between and within popula-
tions which are natural estimators for
2
𝜇𝔼
[
T
]
terms as proposed by
Peter (2016, eq. 17) for
F2
estimator based on allele count data (e.g.
noting that
𝜋 11
=1−
Q
1
1
,
𝜋 22
=1−
Q
2
1
and
𝜋 12 =1−
Q2
following his
notations). For Pool- Seq data, replacing the latter estimators of nu-
cleotide diversities by the unbiased estimators described in Ferretti
et al. (2013, eqs. 3 and 10) would also result in the same estimator
for
F2
(and other parameters) as those we derived from
Q1
and
Q2
estimators.
In practice, estimators are obtained by averaging over (a high)
number of SNPs, which amounts assuming that each represent
an independent outcome of a common demographic processes
that shaped the genome- wide patterns of genetic diversity. This
generally allows to provide accurate estimations and LD between
markers (i.e. violation of the marker independence assumption) can
be accounted for with block- jackknife estimation of standard er-
rors (Patterson et al., 2012). Importantly, as originally noticed by
Patterson et al. (2012), expressions of
F2
,
F3
and
F4
in terms of co-
alescent times (Peter, 2016) show explicitly that they both depend
on the demography (via
𝔼[
T
]
) and the marker mutation rate (
𝜇
). In
the scaled versions of
F2
and
F3
(
FST
and
F∗
3
, respectively), the param-
eter
𝜇
cancels out making them presumably more comparable across
different data sets. It should however be noticed, that for demo-
graphic inference purposes, scaling of the
f
- statistics is not needed.
Indeed, the three- population test of admixture is informed by the
sign of
f3
, which is not af fected by the denominator of
F∗
3
. Similarly,
the four- population test evaluates departure of
f4
(i.e. the numerator
of
D
) from a null value expected under the hypothesis of treeness.
Patterson et al. (2012) also showed both analytically and using sim-
ulations that
F3
and
F4
estimates remained mostly robust to various
realistic SNP ascert ainment scheme. It is finally worth stressing that
admixture graph inference only requires additivity of
F2
(Patterson
et al., 2012), a feature not fulfilled by
FST
(or
F∗
3
and
D
).
20
|
G AUTIER ET Al.
4.3 | Estimation of f- statistics and inference from
Pool- Seq data
Our simulation study showed that accurate estimates of
F
and
D
parameters could be obtained from Pool- Seq data from the unbi-
ased estimators we developed, thereby extending our findings for
the Pool- Seq
FST
estimator (Hivert et al., 2018). With no sequenc-
ing error, this remained true even at a read coverage as low as 30X,
which was here lower than our simulated haploid sample size of 50.
Increasing the coverage only provided marginal gain. When intro-
ducing sequencing errors, the per formance of the estimators tended
to decrease for the lowest investigated read coverages (up to 50X)
and MAF filtering threshold. This was however essentially due to
the presence of spurious SNPs that were not completely filtered out
when considering too loose criteria. As a result, simply increasing
the threshold on the overall MAF (computed from read counts over
all the pool samples) to 5% allowed to remove all the spurious SNPs
and recover accurate estimates of the parameters at the lowest read
coverages. In agreement with original observations made for allele
count data (Patterson et al., 2012), all the f- statistics based analy-
ses (i.e. three- population test of admixture, four- population test of
treeness,
F4
- ratio estimation of admixture proportions or admixture
graph fitting) remained remarkably robust to a MAF- based ascer-
tainment process. From our simulation study, discarding lowly poly-
morphic SNPs was only found to increase the bias of the drif t- scaled
length estimates of internal branch in admixture graph. In practice,
cost- effective designs consisting of sequencing pools of 30 to 50
individuals at a 50– 100× coverage and applying MAF threshold of
5% to filter the called SNPs are expected to provide good perfor-
mance for all the different f- statistics- based inference methods we
presented here.
For Pool- Seq data, all the above conclusions were neverthe-
less only valid for the analyses based on the unbiased estimator
that accounts for the additional level of variation introduced by the
sampling of the DNA of pooled individuals (nonidentifiable) at the
sequencing step. We found that improperly analysing Pool- Seq read
counts as standard allele counts had high detrimental consequence
on the estimation of all the
F
parameters that involved
Q1
proba-
bilities (within population probability of identity) in their definition,
that is
F2
,
FST
(as previously observed by Hivert et al., 2018, see also
Figure S5),
F3
and
F∗
3
leading to a complete loss of power of the as-
sociated three- population test in our simulation. When processing
admixture graph fitting, this also resulted in a strong upward bias in
the estimation of branch lengths, including the external ones that
were accurately estimated when relying on unbiased estimators.
Loosely speaking, not accounting for the extra- variance introduced
by the sampling of reads has the same effect of adding a (substan-
tial) amount of extra drift explaining the two aforementioned ob-
servations. Although not investigated here (and of little interest
since we should definitely rely on unbiased estimators), the amount
of extra variance may be inversely proportional to the pool haploid
sample size (i.e. bias may decrease with increasing pool sample size).
Conversely, analysing Pool- Seq read counts as standard allele counts
did not affect the performance of the
f4
(and
D
)- based test of tree-
ness or the estimation of admixture proportion from
F4
- ratio. This
was expected from the properties of the underlying parameters that
only depends on the
Q2
probabilities across the different pairs of
population involved in the quadruplet of interest (Eq. 3) resulting in
the same estimators (see Eqs. 4 and 5) for both allele count and Pool-
Seq data. More generally, analysing Pool- Seq read count data with
popular programs that were developed for standard allele count
data such as those from the Admixtools (Patterson et al., 2012) or
TreeMix (Pickrell & Pritchard, 2012) suites should be avoided and,
if not, results should be carefully interpreted.
4.4 | Insights into the history of the invasive
species D. suzukii from Pool- Seq data analysis
To illustrate both the power and limitations of f- statistics based
methods for historical and demographic inference as implemented
in poolfstat, we analysed Pool- Seq dat a available for 14 popu-
lations of the invasive species D. suzukii (Olazcuaga et al., 2020).
These population samples were representative of both the Asian
native area and the recently invaded American area. Most of them
consisted of individuals originating from the same sites as those
genotyped in Fraimout et al. (2017) at microsatellite markers and
analysed under an ABC- RF framework. The results remained con-
sistent between the two studies, both of them pointing to com-
plex invasion pathways including multiple introductions leading
to admixed origins of the continental American populations. The
main source contributions were from Hawaii, where D. suzukii was
described about 30 years earlier and the native area (China and
Japan). However, some inferred scenarios appeared somewhat
conflicting. First , for the Hawaiian population that played a key
role in American invasion route, both poolfstat and Fraimout
et al. (2017) suggested a Japanese origin. However, we here found
that the sample the closest to the source (internal node population
JP in Figure 2) was JP- Sap (sampled in Sapporo) while Fraimout
et al. (2017) rather concluded it was JP- Tok (sampled in Tokyo),
which was not found to be directly contributing to US- Haw in
poolfstat analyses and was even found to be admixed by native
populations from Japan and China. In the ABC- RF treatments by
Fraimout et al. (2017), all populations from the native area were
assumed to be non- admixed and no ‘ghost’ (i.e. unsampled) popu-
lations were included in the model, whereas such populations are
present in admixture graphs through internal nodes. Moreover,
the samples from Hawaii and Tokyo both differed in their exact
location and collection date (2013 and 2016 for Hawaii, 2014 and
2016, for Tokyo) between the two studies, which may further ex-
plain the observed discrepancies and more generally promotes the
sequencing of additional samples in this area to better resolve the
origin of the Hawaiian population(s).
Interestingly, poolfstat results challenged the initial view
about the pioneering origin of the Californian population US-
Wat in the invasion of continental America (as suggested by
|
21
GAUTIER E T Al.
historical records) suggesting it rather originates from an admix-
ture between two already established but unsampled continen-
tal American populations, one presumably Northern (related to
Am1 and here represented by US- Sok) and the other presumably
more Southern (related to Am2 and here represented by BR- Pal
from Brazil and US- Sdi from South- California). This discrepancy
between Fraimout et al. (2017) and poolfstat results points
to three key issues. First , a too strong reliance on the reported
date of first observation of the species in the invasive area when
formalizing the scenario to be compared in ABC modelling may
actually mislead inference and this especially since D. suzukii was
first observed at very close dates in the US- Wat, US- Sdi and US-
Sok sampled locations (i.e. 2008, 2009 and 2009, respectively).
As a matter of fact, Fraimout et al. (2017) only considered sce-
narios in which US- Wat was the first population introduced in
continent al America. Second, in ABC, scenarios are defined by
hand justifying the use of dates of first observation to minimize
their number (Estoup & Guillemaud, 2010). The functions imple-
mented in poolfstat circumvent this constraint by facilitating
a quick and automatic exploration of the admixture graph space
to identif y key historical events relating the populations of inter-
est. Third, our finding reinforces the concern that the formaliza-
tion of invasion scenarios including the possibility of unsampled
populations is crucial. This possibility is by construction included
in admixture graph construction but is also possible in ABC mod-
elling (Guillemaud et al., 2010). Similarly, Fraimout et al. (2017)
argued for an admixed origin of the Brazilian BR- Pal population
(first observed in 2013) between undefined North- Western and
North- Eastern American sources, while we here found that this
population was the best prox y for the ancestral ‘ghost’ American
population Am2 (Figure 2c), which may be viewed as one of the
main contributor of all the sampled North- American populations
(but US- Sok). Again, this results underline advantages of not re-
lying on historical dates as for poolfstat analyses and promote
the sequencing of additional samples in South and North- Western
America areas to more thoroughly decipher the invasion routes
followed by continental American populations.
If Pool- Seq data analysed with poolfstat allowed to refine
historical and demographic scenarios in both the native and in-
vaded areas, the D. suzukii Pool- Seq data analysis also illustrated
some inherent constraints imposed to the modelled demographic
history when fitting admixture graph models. In particular, more
complex histories involving recurrent admixture events turned out
to be difficult or even impossible to fit unless a number of source
key samples are included, as observed here for the North- Eastern
American populations. In real- life applications involving a large
number of invasive populations characterized by numerous and
recurrent introduction events, summarizing precisely and with a
good fit the history of all surveyed populations with a comprehen-
sive admixture graph may remain elusive. However, as previously
underlined (Lipson, 2020; Lipson & Reich, 2017; Patterson et al.,
2012), in addition to providing robust formal tests of admixture or
treeness, a decisive advantage of f- based inference methods is to
allow straightforward assessment of the fitted admixture graph by
carefully inspecting and reporting Z- score of the residuals of the
fitted statistic s, an option not available in other related methods
such as TreeMix (Pickrell & Pritchard, 2012). Beyond modelling
the history of populations as admixture graphs (via formal tests
of admixture of treeness or graph fitting), Peter (2016) provided
valuable theoretical insights to interpret the estimated f- statistics
under alternative demographic models such as island, stepping-
stone or serial founder models. This suggests in turn that these
statistics should be informative to estimate the parameters of
demographic scenarios more complex than admixture graphs (e.g.
under an ABC framework as in Collin et al., 2021).
ACKNOWLEDGEMENTS
This work was suppor ted by the French National Research Agency
(ANR) for the projects SWING (ANR- 16- CE02- 0015- 01) and
G A N D H I ( A N R - 2 0 - C E 0 2 - 0 0 1 8 ) .
DATA AVAIL ABILI TY STATEMENT
The vcf file generated for the D. suzukii Pool Seq data is publicly
available from the Zenodo repository (http://doi.org/10.5281/
zenodo.4709080). Further details are provided in Appendix S2:
Vignette V1 and Appendix S3: Vignet te V2.
ORCID
Mathieu Gautier https://orcid.org/0000-0001-7257-5880
Renaud Vitalis https://orcid.org/0000-0001-7096-3089
Laurence Flori https://orcid.org/0000-0002-7529-8521
Arnaud Estoup https://orcid.org/0000-0002-4357-6144
REFERENCES
Adrion, J. R., Kousathanas, A., Pascual, M., Burrack, H. J., Haddad, N .
M., Bergland, A. O., Machado, H., Sackton, T. B., Schlenke, T. A.,
Watada, M., Wegmann, D., & Singh, N. D. (2014). Drosophila suzukii:
The genetic footprint of a recent, world wide invasion. Molecular
Biology and Evolution, 31, 3148– 3163. ht tps://doi.org/10.1093/
molbe v/msu246
Andersen, M. M., & Højsgaard, S. (2019). ryacas: A computer algebr a
system in R. Journal of Open Source Software, 4, 1763. https://doi.
org/10.21105/ joss.01763
Bhatia, G., Pat terson, N., Sankararaman, S., & Price, A . L . (2013).
Estimating and interpreting FST
: The impact of rare variants.
Genome Research, 9, 1514– 1521.
Busing, F. M. T. A., Meijer, E ., & Leeden, R. V. D. (1999). Delete- m jack-
knife for unequal m. Statistics and Computing, 9, 3– 8.
Collin, F. D., Durif, G., Raynal, L., Lombaert, E., Gautier, M., Vitalis, R.,
Marin, J.- M., & Estoup, A. (2021). Extending approximate bayes-
ian computation with supervised machine learning to infer demo-
graphic history from genetic polymorphisms using diyabc random
forest . Molecular Ecology Resources, 21(8), 2598– 2613.
Durand, E. Y., Patterson, N., Reich, D., & Slatkin, M. (2011). Testing for
ancient admixture between closely related populations. Molecular
Biology and Evolution, 28, 2239– 2252 . htt ps://doi.o rg/10.10 93/
molbe v/msr 04 8
Eddelbuettel, D. (2013). S eamless R and C++ Integrati on with rcpp. Springer.
Estoup, A ., & Guillemaud, T. (2010). Reconstructing routes of invasion
using genetic data: Why, how and so what? Molecular Ecology, 19,
4113– 4130. https://doi.org/10.1111/j.1365- 294X.2010.04773.x
22
|
G AUTIER ET Al.
Feder, A . F., Petrov, D. A., & Ber gland, A. O. (2012). LDx: estimation of
linkage disequilibrium from high- throughput pooled resequenc-
ing data. PLoS One, 7, e48588. https://doi.org/10.1371/journ
al.pone.0048588
Ferretti, L ., Ramos- Onsins, S. E., & Pérez- Enciso, M. (2013). Population
genomics from pool sequencing. Molecular Ecology, 22, 5561– 5576.
https://doi .org /10.1111/mec.12522
Fraimout, A., Debat, V., Fellous, S., Hufbauer, R. A., Foucaud, J., Pudlo, P.,
Marin, J.- M., Price, D. K., Cattel, J., Chen, X., Deprá, M., Duyck, P. F.,
Guedot , C., Kenis, M., Kimura, M. T., Loeb, G., Loiseau, A., Martinez-
Sañudo, I., Pascual, M., … Estoup, A. (2017). Deciphering the routes
of invasion of Drosophila suzukii by means of abc random forest.
Molecular Biology and Evolution, 34, 980– 996.
Garrison, E., & Marth, G. (2012). Haplotype- based Variant Detection
from short- read Sequencing. arXiv:1207.3907.
Gautier, M. (2015). Genome- wide scan for adaptive divergence and as-
sociation with population- specific covariates. Genetics, 201, 1555–
1579. https://doi.org/10.1534/genet ics.115.181453
Gautier, M., Foucaud, J., G harbi, K ., Cézard, T., Galan, M., Loiseau,
A., Thomson, M., Pudlo, P., Kerdelhué, C., & Estoup, A. (2013).
Estimation of population allele frequencies from next- generation
sequencing data: pool- versus individual- based genotyping.
Molecular Ecology, 22, 3766– 3779. https://doi. org/10.1111/
mec.12360
Glenn, T. C. (2011). Field guide to next- generation DNA sequenc-
ers. Molecular Ecology Resources, 11, 759– 769. ht tps://doi.
org /10.1111/j.1755- 0998.2 011.03024.x
Green, R. E., Krause, J., Briggs, A. W., Maricic, T., Stenzel, U., Kircher, M.,
Patter son, N., Li, H., Zhai, W., Fritz, M. H.- Y., Hansen, N. F., Durand,
E. Y., Malaspinas, A .- S., Jensen, J. D., Marques- Bonet, T., Alkan, C.,
Prüfer, K., Meyer, M., Burbano, H. A., … Pääbo, S. (2010). A draf t
sequence of the neandertal genome. Science (New York, N.Y.), 328,
71 0– 7 2 2.
Guillemaud, T., Beaumont, M. A., Ciosi, M., Cornuet, J. M., & Estoup, A.
(2010). Inferring introduction routes of invasive species using ap-
proximate Bayesian computation on microsatellite data. Heredity,
104, 88– 99. https://doi.org/10.1038/hd y.2009.92
Hivert, V., Leblois, R., Petit, E. J., Gautier, M., & Vitalis, R. (2018).
Measuring genetic differentiation from pool- seq data. Genetics,
210, 315– 330. https://doi.org/10.1534/genet ics.118.300900
Iannone, R. (2020) diagrammer: Graph/Network Visualization. R package
version 1.0.6.1.
Kelleher, J., Etheridge, A . M., & McVean, G. (2016). Ef ficient coalescent
simulation and genealogical analysis for large sample sizes. PLoS
Computational Biology, 12, e10 04842. https://doi. org/10 .1371/
journ al.pcbi.1004842
Knaus, B. J., & Grünwald, N. J. (2017). vcfr: a package to manipulate and
visualize variant call format dat a in R . Molecular Ecology Resources,
17, 4 4– 53 .
Koboldt, D. C., Zhang, Q., L arson, D. E., Shen, D., McLellan, M. D., Lin, L.,
Miller, C. A., Mardis, E. R., Ding, L. I., & Wilson, R. K. (2012). varscan
2: Somatic mutation an d copy number alteration discovery in can-
cer by exome sequencing. Genome Research, 22, 568– 576. https://
doi .org/10.1101/gr.129684.111
Kofler, R., Orozco- terWengel, P., De Maio, N., Pandey, R. V., Nolte, V.,
Futschik, A., Kosiol, C ., & Schlötterer, C. (2011). popoolation: A tool-
box for population genetic analysis of next generation sequencing
data from pooled individuals. PLoS One, 6, e15925. https://doi.
org/10.1371/journ al.pone.0015925
Kunsch, H. R. (1989). The jackknife and the bootstrap for general st a-
tionary observations. Annals of Statistics, 17, 1217– 1241. https://
doi .org /10.1214/aos/11763 47265
Lawson, C. L., & Hanson, R. J. (1995). Solving least squares problems. No.
15 in Classics in applied mathematics, 1st edn. Society for Industrial
and Applied Mathematics.
Leblois , R., Gautier, M., Rohfritsch, A ., Foucaud, J., Burban, C., Galan, M.,
Loiseau, A., Sauné, L., Branco, M., Gharbi, K., Vitalis, R., & Kerdelhué,
C. (2018). De ciphering the demographic histor y of allochronic dif-
ferentiation in the pine processionary moth Thaumetopoea pityo-
campa. Molecular Ecology, 27, 264– 278.
Leppälä, K., Nielsen, S. V., & Mailund, T. (2017). admixturegraph:
an r package for admixture graph manipulation and fitting.
Bioinformatics, 33, 1738– 1740. http s://doi.org/10.1093/ bioi n
forma tics/btx0 48
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth,
G., Abecasis, G., & Durbin, R. (2009). The sequence alignment/map
format and SAMtools. Bioinformatics, 856(25), 2078– 2079. https://
doi.org/10.1093/bioin forma tics/btp352
Lipson, M. (2020). Applying f- statistic s and admixture graphs: Theor y
and examples. Molecular Ecology Resources, 858 (20), 1658– 1667.
Lipson, M., Loh, P. R., Levin, A., Reich, D., Patterson, N., & Berger, B.
(2013). Efficient moment- based inference of admixture parameters
and sources of gene flow. Mole cular Biology a nd Evolution, 30, 1788–
180 2. https://doi .org /10.10 93/molbe v/m st0 99
Lipson, M., Loh, P.- R., Patterson, N., Moorjani, P., Ko, Y.- C., Stoneking,
M., Berger, B., & Reich, D. (2014). Reconstructing Austronesian
population history in island southeast Asia. Nature Communications,
5, 4689. https://doi.org/10.1038/ncomm s5689
Lipson, M., & Reich, D. (2017). A working model of the deep rela-
tionships of diverse modern human genetic lineages outside of
Africa. Molecular Biology and Evolution, 34, 889– 902. ht tps://doi.
org /10.10 93/molbe v/msw29 3
Long, Q., Jeffares, D. C., Zhang, Q., Ye, K., Nizhynska, V., Ning, Z., Tyler-
Smith, C ., & Nordborg, M. (2011). poolhap: Inferring haplot ype fre-
quencies from pooled samples by next generation sequencing. PLoS
One, 6, e15292 . htt ps://doi.o rg/10.1371/jou rn al. pone .001 5292
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky,
A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., & DePristo, M.
A. (2010). The genome analysis toolkit: A mapreduce framework for
analyzing next- generation DNA sequencing data. Genome Research,
20, 1297– 1303. https://doi.org/10.1101/gr.107524.110
Nocedal , J., & Wright, S. J. (1999). Numerical optimization. Springer series
in operations research. Springer.
Olazcuaga, L., Loiseau, A., Parrinello, H., Paris, M., Fraimout, A., Guedot,
C., Diepenbrock, L. M., Kenis, M., Zhang, J., Chen, X., Borowiec, N.,
Facon, B., Vogt, H., Price, D. K., Vogel, H., Prud’homme, B., Estoup, A.,
& Gautier, M. (2020). A whole- genome scan for association with inva-
sion success in the fruit fly Drosophila suzukii using contrasts of allele
frequencies corrected for population structure. Molecular Biology and
Evolution, 37, 2369– 2385. https://doi.org/10.1093/molbe v/msaa098
Paris, M., Boyer, R., Jaenichen, R., Wolf, J., Karageorgi, M., Green, J., Cagnon,
M., Parinello, H., Estoup, A., Gautier, M., Gompel, N., & Prud’homme,
B. (2020). Near- chromosome level genome assembly of the fruit pest
Drosophila suzukii using long- read sequencing. Scientific Reports, 10,
11227. https://doi.org/10.1038/s4159 8- 020- 67373 - z
Patter son, N., Moorjani, P., Luo, Y., Mallick, S., Rohland, N., Zhan, Y.,
Genschoreck, T., Webster, T., & Reich, D. (2012). Ancient admix-
ture in human history. Genetics, 192, 1065– 1093. https://doi.
org /10.1534/genet ics.112.145037
Peter, B. M. (2016). Admixture, population structure, and f- statistics.
Genetics, 202, 1485– 1501.
Petr, M., Vernot, B., & Kelso, J. (2019). admixr- r package for reproducible
analyses using admixtools. Bioinformatics, 35, 3194– 3195. https://
doi.org/10.1093/bioin forma tics/btz030
Pickrell, J. K., & Pritchard, J. K. (2012). Inference of population splits and
mixtures from genome- wide allele frequency data. PLoS Genetics,
8, e1002967.
Reich, D., Thangaraj, K ., Patterson, N., Price, A. L., & Singh , L . (2009).
Reconstructing Indian population history. Nature, 461, 489– 494.
https://doi.org/10.1038/natur e08365
|
23
GAUTIER E T Al.
Rousset, F. (2007). Inferences from spatial population genetics. In D.
J. Balding, M. Bishop, & C. Cannings (Ed.), Handbook of Statistical
Genetics, 3rd ed (pp. 945– 979). John Wiley and S ons Ltd.
Schlötterer, C., Tobler, R., Kofler, R., & Nolte, V. (2014). Sequencing pools
of individuals - mining genome- wide polymorphism data with-
out big funding. Nature Reviews Genetics, 15, 749– 763. https://doi.
org/10.1038/nrg3803
Slatkin, M. (1991). Inbreeding coefficients and coalescence times.
Genetical Research, 58, 167– 175. https://doi.org /10.1017/S0 016
67230 0029827
Weir, B. S. (1996). Genetic data analysis II: methods for discrete population
genetic data. Sinauer Associates.
Weir, B. S., & Goudet, J. (2017). A unified characterization of population
struc ture and relatedness. Genetics, 206, 2085– 2103. https://doi.
org /10.1534/genet ics.116.198424
SUPPORTING INFORMATION
Additional supporting information may be found in the online ver-
sion of the ar ticle at the publisher’s website.
How to cite this article: Gautier, M., Vitalis, R., Flori, L., &
Estoup, A . (2021). f- Statistics estimation and admixture graph
construction with Pool- Seq or allele count data using the R
package poolfstat. Molecular Ecology Resources, 00, 1– 23.
https: //doi .or g/10.1111/1755- 0998.13557
Code Count data type
MAF Coverage Seq. error Nb. of SNPs Nb. of false SNPs
threshold (λ) () Mean ±s.d. Mean ±s.d.
ACm>1% Alleles 1% - - 471,919 ±1,474 -
ACm>5% Alleles 5% - - 240,369 ±1,118 -
PS30=0
m>1% Reads 1%?30 0 406,742 ±1,362 -
PS50=0
m>1% Reads 1% 50 0 442,426 ±1,419 -
PS75=0
m>1% Reads 1% 75 0 449,663 ±1,433 -
PS100=0
m>1% Reads 1% 100 0 454,056 ±1,434 -
PS200=0
m>1% Reads 1% 200 0 462,429 ±1,466 -
PS30=0
m>5% Reads 5% 30 0 246,560 ±1,111 -
PS50=0
m>5% Reads 5% 50 0 245,296 ±1,108 -
PS75=0
m>5% Reads 5% 75 0 244,706 ±1,126 -
PS100=0
m>5% Reads 5% 100 0 244,432 ±1,140 -
PS200=0
m>5% Reads 5% 200 0 243,993 ±1,131 -
PS30=1h
m>1% Reads 1% 30 1h613,255 ±1,471 206,051 ±443
PS50=1h
m>1% Reads 1% 50 1h871,953 ±1,605 432038 ±654
PS75=1h
m>1% Reads 1% 75 1h445,922 ±1,417 3,866 ±58
PS100=1h
m>1% Reads 1% 100 1h440,012 ±1,399 196 ±14
PS200=1h
m>1% Reads 1% 200 1h408,432 ±1,300 0 ±0
PS30=1h
m>5% Reads 5% 30 1h246,442 ±1,123 0 ±0
PS50=1h
m>5% Reads 5% 50 1h243,699 ±1,113 0 ±0
PS75=1h
m>5% Reads 5% 75 1h240,422 ±1,111 0 ±0
PS100=1h
m>5% Reads 5% 100 1h236,566 ±1,089 0 ±0
PS200=1h
m>5% Reads 5% 200 1h215,471 ±1,007 0 ±0
PS30=2.5h
m>1% Reads 1% 30 2.5h3,402,213 ±2,147 2,999,100 ±1,773
PS50=2.5h
m>1% Reads 1% 50 2.5h6,803,455 ±2,847 6,380,412 ±2,512
PS75=2.5h
m>1% Reads 1% 75 2.5h686,967 ±1,464 282,076 ±530
PS100=2.5h
m>1% Reads 1% 100 2.5h417,951 ±1,243 39,043 ±189
1
PS200=2.5h
m>1% Reads 1% 200 2.5h251,968 ±863 2 ±1
PS30=2.5h
m>5% Reads 5% 30 2.5h243,414 ±1,088 0 ±0
PS50=2.5h
m>5% Reads 5% 50 2.5h234,158 ±1,100 0 ±0
PS75=2.5h
m>5% Reads 5% 75 2.5h220,095 ±1,050 0 ±0
PS100=2.5h
m>5% Reads 5% 100 2.5h203,702 ±967 0 ±0
PS200=2.5h
m>5% Reads 5% 200 2.5h132,980 ±669 0 ±0
Table S1. Description of the simulated datasets. A total of 250 independent genotyping datasets were simulated
according to the demographic scenario represented in Figure 1. For each simulated genotyping datasets, we generated
two allele count datasets applying 1% or 5% threshold (column 3) on the overall MAF (computed from the allele counts);
and simulated fifteen Pool-Seq read count datasets with five different mean read coverages λ(column 4) and three different
sequencing error rates (column 5) and applied 1% or 5% threshold (column 3) on the overall MAF (computed from read
counts) for each of them (resulting in a total of 30 Pool-Seq read count dataset per simulated genotyping data). As highlighted
by the asterisk (?), for Pool-Seq data with λ=30, the actual MAF threshold was slightly higher than indicated due the additional
filtering criterion on the minimal read count (MRC>2, see Material and Methods). Indeed, when λ=30, the overall coverage
(that follows a Poisson distribution with parameter 6 ×30 =180) is ≤200 for on expected proportion of 93.5% of the SNPs
resulting to a more stringent MAF threshold of 2
6×30 =1.11% when requiring MRC>2. The last two columns give the average
number (and standard deviation) of SNPs and false SNPs (for read count data simulated with > 0) computed over the 250
independent simulated datasets available for each of the 32 configurations.
2
MAF seq. error Pool-Seq (read counts) data allele count
threshold λ =30 λ=50 λ=75 λ=100 λ=200 data
>1%
0 82.0 (0.00) 84.4 (0.00) 86.0 (0.00) 86.0 (0.00) 85.2 (0.00)
85.6 (0.00)
1h100 (0.00) 100 (0.00) 86.8 (0.00) 87.2 (0.00) 86.4 (0.00)
2.5h100 (0.00) 100 (6.47) 99.6 (0.00) 92.8 (0.00) 88.4 (0.00)
0 0.00 (0.00)?0.00 (0.00)?0.00 (0.00)?0.00 (0.00)?0.00 (0.00)?
>5%
0 93.6 (0.00) 95.2 (0.00) 96.4 (0.00) 96.0 (0.00) 96.0 (0.00)
96.8 (0.00)
1h94.0 (0.00) 96.8 (0.00) 96.4 (0.00) 97.2 (0.00) 96.8 (0.00)
2.5h94.0 (0.00) 96.0 (0.00) 96.0 (0.00) 97.2 (0.00) 96.8 (0.00)
0 0.00 (0.00)?0.00 (0.00)?0.00 (0.00)?0.00 (0.00)?0.00 (0.00)?
Table S2. Comparison of the performance of f?
3-based tests of admixture for different types of data sim-
ulated under the Figure 1 scenario processing poolfstat analyses. For each MAF threshold (MAF>1% or
MAF>5%), the table gives True and False (in parenthesis) Positive Rates (in %) for 21 different types of analyses relying on i)
allele count data; ii) 15 different Pool-Seq read count data (five mean coverages λand three sequencing error rates ); and iii)
Pool-Seq read count data simulated with =0 treated as allele counts (corresponding results of this bad practice are highlighted
in italics and ?). Each TPR was computed from the analysis of 250 independent datasets (generated from the data simulated
under Figure 1 demographic scenario) as the proportion of f?
3with an associated Z-score <−1.65 (95% significance threshold)
for the (P6;P2,P3) population triplet (n=250 estimates). The FPR was similarly computed as the proportion of f?
3with an
associated Z-score<−1.65 among all the 50 population triplets that do not involve P6 as target population (n=250×50=12,250
estimates).
3
MAF seq. error Pool-Seq (read counts) data allele count
threshold λ =30 λ=50 λ=75 λ=100 λ=200 data
>1%
0 94.0 (0.05) 94.4 (0.06) 94.1 (0.04) 94.5 (0.05) 94.3 (0.02)
94.2 (0.02)
1h94.3 (0.04) 94.2 (0.03) 94.3 (0.03) 94.3 (0.03) 94.2 (0.05)
2.5h94.8 (0.06) 94.5 (0.05) 94.8 (0.03) 94.5 (0.06) 94.3 (0.04)
0 94.0 (0.05)?94.4 (0.06)?94.1 (0.04)?94.5 (0.05)?94.3 (0.02)?
>5%
0 94.5 (0.14) 94.3 (0.11) 94.1 (0.14) 94.8 (0.09) 94.3 (0.08)
94.3 (0.11)
1h94.5 (0.09) 94.5 (0.11) 94.5 (0.13) 94.2 (0.09) 94.0 (0.15)
2.5h95.2 (0.13) 93.8 (0.11) 94.2 (0.12) 94.5 (0.11) 94.3 (0.13)
0 94.5 (0.14)?94.3 (0.11)?94.1 (0.14)?94.9 (0.09)?94.3 (0.08)?
Table S3. Comparison of the performance of D-based test of treeness for different types of data simulated
under the Figure 1 scenario processing poolfstat analyses. For each MAF threshold (MAF>1% or MAF>5%),
the table gives True and False (in parenthesis) Positive Rates (in %) for 21 different types of analyses relying on i) allele count
data; ii) 15 different Pool-Seq read count data (five mean coverages λand three sequencing error rates ); and iii) Pool-Seq read
count data simulated with =0 treated as allele counts (corresponding results of this bad practice are highlighted in italics and
?). Each TPR was computed from the analysis of 250 independent datasets (generated from the data simulated under Figure 1
demographic scenario) as the proportion of D-statistics with an associated absolute Z-score <1.96 (95% significance thresh-
old) among all the eight population quadruplets ((P1,P2;P3,P4); (P1,P2;P3,P5); (P1,P2;P4,P5); (P1,P3;P4,P5); (P1,P6;P4,P5);
(P2,P3;P4,P5); (P2,P6;P4,P5); (P3,P6;P4,P5)) with a null expected F4(n=250×8=2,000 estimates). The FPR was similarly
computed as the proportion of D-statistics with an associated absolute Z-score <1.96 among all the 37 remaining population
quadruplets (n=250×37=9,250 estimates).
MAF seq. error Pool-Seq (read counts) data allele count
threshold λ =30 λ=50 λ=75 λ=100 λ=200 data
>1%
0 0.246 (94.0) 0.246 (93.2) 0.246 (92.0) 0.245 (93.6) 0.246 (93.2)
0.246 (92.8)
1h0.246 (92.0) 0.247 (96.0) 0.246 (92.8) 0.246 (93.2) 0.245 (91.6)
2.5h0.245 (91.2) 0.245 (92.8) 0.246 (92.0) 0.246 (92.4) 0.246 (94.0)
0 0.246 (94.0)?0.246 (93.2)?0.246 (92.0)?0.245 (93.6)?0.246 (93.2)?
>5%
0 0.246 (93.2) 0.246 (93.2) 0.246 (92.4) 0.245 (94.0) 0.246 (93.6)
0.246 (93.2)
1h0.246 (91.6) 0.247 (94.8) 0.246 (92.4) 0.246 (93.2) 0.246 (92.0)
2.5h0.245 (91.2) 0.245 (93.2) 0.246 (91.2) 0.246 (92.0) 0.246 (92.8)
0 0.246 (93.2)?0.246 (93.2)?0.246 (92.4)?0.245 (94.0)?0.246 (93.6)?
Table S4. Comparison of F4-ratio based estimation of the simulated admixture proportion αin Figure
1 scenario for different types of data processing poolfstat analyses. For each MAF threshold (MAF>1% or
MAF>5%), the table gives the mean of the estimated ˆα=f4(P1,P5;P3,P6)
f4(P1,P5;P2,P3)) (across 250 independent simulated datasets) for 21
different types of analyses relying on i) allele count data; ii) 15 different Pool-Seq read count data (five mean coverages λand
three sequencing error rates ); and iii) Pool-Seq read count data simulated with =0 treated as allele counts (corresponding
results of this bad practice are highlighted in italics and ?). The proportion (in %) of the 250 estimated 95% confidence intervals
that contain the true simulated value (α=0.25) is given in parenthesis.
4
MAF seq. error Pool-Seq (read counts) data allele count
threshold λ =30 λ=50 λ=75 λ=100 λ=200 data
>1%
0 0.247 (44.8) 0.247 (44.0) 0.247 (43.6) 0.246 (42.0) 0.247 (42.4)
0.246 (42.8)
1h0.247 (41.6) 0.247 (43.2) 0.246 (44.4) 0.247 (43.2) 0.246 (44.4)
2.5h0.251 (74.4) 0.252 (56.8) 0.247 (42.8) 0.246 (40.4) 0.247 (42.0)
0 0.247 (44.8)?0.247 (44.0)?0.247 (43.6)?0.246 (42.0)?0.247 (42.4)?
>5%
0 0.247 (45.6) 0.247 (43.2) 0.247 (43.2) 0.246 (41.6) 0.247 (42.0)
0.246 (42.8)
1h0.247 (42.8) 0.247 (40.8) 0.246 (44.0) 0.247 (44.4) 0.247 (42.8)
2.5h0.246 (44.0) 0.246 (43.6) 0.247 (42.4) 0.247 (43.2) 0.247 (41.6)
0 0.247 (45.2)?0.247 (43.2)?0.247 (43.2)?0.246 (41.6)?0.247 (41.6)?
Table S5. Comparison of the estimation of the simulated admixture proportion αin Figure 1 scenario using
admixture graph fitting (as implemented in the fit.graph function of poolfstat) for different types of
data processing poolfstat analyses. For each MAF threshold (MAF>1% or MAF>5%), the table gives the mean of
the estimated ˆα(across 250 independent simulated datasets) for 21 different types of analyses relying on i) allele count data; ii)
15 different Pool-Seq read count data (five mean coverages λand three sequencing error rates ); and iii) Pool-Seq read count
data simulated with =0 treated as allele counts (corresponding results of this bad practice are highlighted in italics and ?).
The proportion (in %) of the 250 estimated 95% confidence intervals that contain the true simulated value (α=0.25) is also
given in parenthesis.
MAF seq. error Pool-Seq (read counts) data allele count
threshold λ =30 λ=50 λ=75 λ=100 λ=200 data
>1%
0 5.1×10−2(86.0) 5.1×10−2(91.2) 5.1×10−2(91.2) 5.1×10−2(92.0) 5.1×10−2(92.4)
5.1×10−2(91.6)
1h4.4×10−2(0.40) 4.1×10−2(0.00) 5.1×10−2(94.4) 5.1×10−2(94.0) 5.1×10−2(95.6)
2.5h6.0×10−3(0.00) 1.1e-03 (0.00) 4.3×10−2(0.80) 4.9×10−2(90.0) 5.0×10−2(97.2)
0 7.2×10−2(0.00)?7.2×10−2(0.00)?7.2×10−2(0.00)?7.2×10−2(0.00)?7.2×10−2(0.00)?
>5%
0 5.1×10−2(81.6) 5.1×10−2(85.6) 5.1×10−2(89.6) 5.1×10−2(89.2) 5.1×10−2(91.2)
5.0×10−2(90.4)
1h5.1×10−2(84.0) 5.1×10−2(88.8) 5.1×10−2(89.6) 5.1×10−2(91.6) 5.0×10−2(92.0)
2.5h5.1×10−2(87.2) 5.0×10−2(90.0) 5.0×10−2(92.4) 5.0×10−2(90.0) 5.0×10−2(90.8)
0 7.2×10−2(0.00)?7.2×10−2(0.00)?7.2×10−2(0.00)?7.1×10−2(0.00)?7.1×10−2(0.00)?
Table S6. Comparison of the estimation of the simulated length (in drift units) for the P7↔P1 branch
τP7↔P1= 5 ×10−2in Figure 1 scenario using admixture graph fitting (as implemented in the fit.graph
function of poolfstat) for different types of data processing poolfstat analyses. For each MAF threshold
(MAF>1% or MAF>5%), the table gives the mean of the estimated ˆτP7↔P1(across 250 independent simulated datasets) for 21
different types of analyses relying on i) allele count data; ii) 15 different Pool-Seq read count data (five mean coverages λand
three sequencing error rates ); and iii) Pool-Seq read count data simulated with =0 treated as allele counts (corresponding
results of this bad practice are highlighted in italics and ?). The proportion (in %) of the 250 estimated 95% confidence intervals
that contain the true simulated value is given in parenthesis.
5
MAF seq. error Pool-Seq (read counts) data allele count
threshold λ =30 λ=50 λ=75 λ=100 λ=200 data
>1%
0 2.5×10−2(44.8) 2.5×10−2(49.2) 2.5×10−2(51.2) 2.5×10−2(50.8) 2.5×10−2(52.0)
2.5×10−2(50.8)
1h2.0×10−2(32.4) 1.9×10−2(12.0) 2.4×10−2(48.8) 2.5×10−2(47.6) 2.4×10−2(51.6)
2.5h0.00 (0.00) 0.00 (0.00) 2.0×10−2(25.6) 2.4×10−2(49.6) 2.4×10−2(48.8)
0 4.5×10−2(0.40)?4.4×10−2(0.40)?4.4×10−2(0.00)?4.4×10−2(0.40)?4.4×10−2(0.40)?
>5%
0 2.5×10−2(47.2) 2.5×10−2(48.8) 2.4×10−2(46.4) 2.4×10−2(45.6) 2.4×10−2(46.8)
2.4×10−2(49.2)
1h2.5×10−2(44.0) 2.5×10−2(46.4) 2.4×10−2(47.2) 2.4×10−2(47.2) 2.4×10−2(48.4)
2.5h2.4×10−2(44.8) 2.4×10−2(47.6) 2.4×10−2(46.0) 2.4×10−2(46.8) 2.4×10−2(46.4)
0 4.4×10−2(0.80)?4.4×10−2(0.40)?4.4×10−2(0.00)?4.4×10−2(0.00)?4.4×10−2(0.00)?
Table S7. Comparison of the estimation of the simulated length (in drift units) for the P2↔S1 branch
τP2↔S1= 2.5×10−2in Figure 1 scenario using admixture graph fitting (as implemented in the fit.graph
function of poolfstat) for different types of data processing poolfstat analyses. For each MAF threshold
(MAF>1% or MAF>5%), the table gives the mean of the estimated ˆτP2↔S1(across 250 independent simulated datasets) for 21
different types of analyses relying on i) allele count data; ii) 15 different Pool-Seq read count data (five mean coverages λand
three sequencing error rates ); and iii) Pool-Seq read count data simulated with =0 treated as allele counts (corresponding
results of this bad practice are highlighted in italics and ?). The proportion (in %) of the 250 estimated 95% confidence intervals
that contain the true simulated value is given in parenthesis.
MAF seq. error Pool-Seq (read counts) data allele count
threshold λ =30 λ=50 λ=75 λ=100 λ=200 data
>1%
0 2.5×10−2(65.6) 2.5×10−2(70.8) 2.5×10−2(70.8) 2.5×10−2(68.8) 2.5×10−2(68.8)
2.5×10−2(70.0)
0.10% 2.1×10−2(4.80) 1.9×10−2(0.00) 2.5×10−2(70.0) 2.5×10−2(72.8) 2.5×10−2(70.0)
0.25% 0.0 (0.00) 0.0 (0.00) 2.0×10−2(5.6) 2.4×10−2(57.2) 2.5×10−2(71.2)
0 4.5×10−2(0.00)?4.5×10−2(0.00)?4.5×10−2(0.00)?4.5×10−2(0.00)?4.5×10−2(0.00)?
>5%
0 2.5×10−2(62.0) 2.5×10−2(62.4) 2.5×10−2(62.0) 2.5×10−2(62.4) 2.5×10−2(62.8)
2.5×10−2(60.8)
0.10% 2.5×10−2(66.4) 2.5×10−2(61.2) 2.5×10−2(61.2) 2.5×10−2(59.6) 2.5×10−2(61.6)
0.25% 2.5×10−2(66.4) 2.5×10−2(63.6) 2.5×10−2(61.6) 2.5×10−2(61.2) 2.5×10−2(61.2)
0 4.5×10−2(0.00)?4.5×10−2(0.00)?4.5×10−2(0.00)?4.4×10−2(0.00)?4.4×10−2(0.00)?
Table S8. Comparison of the estimation of the simulated length (in drift units) for the P6↔S branch
τP6↔S= 2.5×10−2in Figure 1 scenario using admixture graph fitting (as implemented in the fit.graph
function of poolfstat) for different types of data processing poolfstat analyses. For each MAF threshold
(MAF>1% or MAF>5%), the table gives the mean of the estimated ˆτP6↔S(across 250 independent simulated datasets) for 21
different types of analyses relying on i) allele count data; ii) 15 different Pool-Seq read count data (five mean coverages λand
three sequencing error rates ); and iii) Pool-Seq read count data simulated with =0 treated as allele counts (corresponding
results of this bad practice are highlighted in italics and ?). The proportion (in %) of the 250 estimated 95% confidence intervals
that contain the true simulated value is given in parenthesis.
MAF seq. error Pool-Seq (read counts) data allele count
threshold λ =30 λ=50 λ=75 λ=100 λ=200 data
>1%
0 2.6×10−2(54.4) 2.6×10−2(56.0) 2.6×10−2(59.6) 2.6×10−2(59.2) 2.6×10−2(57.6)
2.6×10−2(56.4)
1h2.1×10−2(28.8) 1.9×10−2(10.0) 2.6×10−2(57.6) 2.6×10−2(61.6) 2.6×10−2(59.6)
2.5h0.00 (0.00) 0.00 (0.00) 2.1×10−2(25.2) 2.5×10−2(55.2) 2.5×10−2(58.8)
0 4.5×10−2(0.00)?4.5×10−2(0.00)?4.5×10−2(0.00)?4.5×10−2(0.00)?4.5×10−2(0.00)?
>5%
0 2.6×10−2(52.0) 2.6×10−2(52.8) 2.5×10−2(54.0) 2.6×10−2(56.4) 2.5×10−2(54.0)
2.5×10−2(54.8)
1h2.6×10−2(53.2) 2.5×10−2(51.2) 2.5×10−2(54.0) 2.5×10−2(54.4) 2.5×10−2(54.4)
2.5h2.6×10−2(56.8) 2.5×10−2(54.4) 2.5×10−2(54.0) 2.5×10−2(50.8) 2.5×10−2(53.6)
0 4.5×10−2(0.00)?4.5×10−2(0.00)?4.5×10−2(0.00)?4.5×10−2(0.00)?4.5×10−2(0.00)?
Table S9. Comparison of the estimation of the simulated length (in drift units) for the P3↔S2 branch
τP3↔S2= 2.5×10−2in Figure 1 scenario using admixture graph fitting (as implemented in the fit.graph
function of poolfstat) for different types of data processing poolfstat analyses. For each MAF threshold
(MAF>1% or MAF>5%), the table gives the mean of the estimated ˆτP3↔S2(across 250 independent simulated datasets) for 21
different types of analyses relying on i) allele count data; ii) 15 different Pool-Seq read count data (five mean coverages λand
three sequencing error rates ); and iii) Pool-Seq read count data simulated with =0 treated as allele counts (corresponding
results of this bad practice are highlighted in italics and ?). The proportion (in %) of the 250 estimated 95% confidence intervals
that contain the true simulated value is given in parenthesis.
6
MAF seq. error Pool-Seq (read counts) data allele count
threshold λ =30 λ=50 λ=75 λ=100 λ=200 data
>1%
0 0.143 (20.0) 0.141 (6.00) 0.141 (5.60) 0.141 (4.80) 0.141 (4.00)
0.140 (3.20)
1h0.126 (0.00) 0.120 (0.00) 0.140 (2.40) 0.140 (3.20) 0.140 (2.00)
2.5h4.2×10−2(0.00) 3.0×10−2(0.00) 0.125 (0.00) 0.137 (0.00) 0.139 (0.00)
0 0.160 (2.80)?0.158 (10.0)?0.158 (9.20)?0.158 (10.0)?0.158 (12.4)?
>5%
0 0.148 (85.6) 0.147 (77.6) 0.146 (74.0) 0.146 (70.4) 0.146 (65.6)
0.145 (60.0)
1h0.147 (80.4) 0.146 (73.6) 0.146 (69.2) 0.146 (67.2) 0.145 (62.4)
2.5h0.147 (77.6) 0.146 (67.6) 0.145 (62.4) 0.145 (59.6) 0.145 (56.4)
0 0.165 (0.00)?0.164 (0.00)?0.163 (0.00)?0.163 (0.80)?0.163 (0.80)?
Table S10. Comparison of the estimation of the simulated length (in drift units) for the P4↔P9 branch
(τP4↔P9= 0.150)in Figure 1 scenario using admixture graph fitting (as implemented in the fit.graph
function of poolfstat) for different types of data processing poolfstat analyses. For each MAF threshold
(MAF>1% or MAF>5%), the table gives the mean of the estimated ˆτP4↔P9(across 250 independent simulated datasets) for 21
different types of analyses relying on i) allele count data; ii) 15 different Pool-Seq read count data (five mean coverages λand
three sequencing error rates ); and iii) Pool-Seq read count data simulated with =0 treated as allele counts (corresponding
results of this bad practice are highlighted in italics and ?). The proportion (in %) of the 250 estimated 95% confidence intervals
that contain the true simulated value is given in parenthesis.
MAF seq. error Pool-Seq (read counts) data allele count
threshold λ =30 λ=50 λ=75 λ=100 λ=200 data
>1%
0 0.142 (12.0) 0.141 (2.80) 0.140 (1.60) 0.140 (1.20) 0.140 (0.80)
0.140 (0.40)
1h0.126 (0.00) 0.120 (0.00) 0.140 (0.40) 0.140 (0.40) 0.139 (0.40)
2.5h4.2×10−2(0.00) 3.0×10−2(0.00) 0.125 (0.00) 0.137 (0.00) 0.139 (0.00)
0 0.159 (1.60)?0.158 (7.20)?0.158 (8.00)?0.158 (9.60)?0.157 (11.2)?
>5%
0 0.147 (87.6) 0.146 (77.6) 0.146 (70.0) 0.146 (64.4) 0.145 (59.2)
0.145 (53.2)
1h0.147 (82.0) 0.146 (72.0) 0.145 (60.4) 0.145 (58.4) 0.145 (54.0)
2.5h0.146 (75.6) 0.145 (61.6) 0.145 (53.6) 0.145 (52.8) 0.144 (48.0)
0 0.164 (0.00)?0.163 (0.00)?0.163 (0.00)?0.163 (0.00)?0.162 (0.00)?
Table S11. Comparison of the estimation of the simulated length (in drift units) for the P5↔P9 branch
(τP5↔P9= 0.150)in Figure 1 scenario using admixture graph fitting (as implemented in the fit.graph
function of poolfstat) for different types of data processing poolfstat analyses. For each MAF threshold
(MAF>1% or MAF>5%), the table gives the mean of the estimated ˆτP5↔P9(across 250 independent simulated datasets) for 21
different types of analyses relying on i) allele count data; ii) 15 different Pool-Seq read count data (five mean coverages λand
three sequencing error rates ); and iii) Pool-Seq read count data simulated with =0 treated as allele counts (corresponding
results of this bad practice are highlighted in italics and ?). The proportion (in %) of the 250 estimated 95% confidence intervals
that contain the true simulated value is given in parenthesis.
7
MAF seq. error Pool-Seq (read counts) data allele count
threshold λ =30 λ=50 λ=75 λ=100 λ=200 data
>1%
0 2.7×10−2(48.0) 2.6×10−2(54.0) 2.6×10−2(50.0) 2.6×10−2(50.4) 2.6×10−2(50.0)
2.6×10−2(52.8)
1h2.4×10−2(42.0) 2.3×10−2(36.4) 2.6×10−2(48.8) 2.6×10−2(51.6) 2.6×10−2(50.8)
2.5h5.8×10−3(0.00) 8.9×10−4(0.00) 2.3×10−2(51.2) 2.6×10−2(51.2) 2.6×10−2(49.2)
0 2.7×10−2(46.4)?2.7×10−2(54.0)?2.7×10−2(50.4)?2.7×10−2(50.0)?2.7×10−2(52.0)?
>5%
0 2.7×10−2(44.0) 2.7×10−2(49.2) 2.7×10−2(44.8) 2.7×10−2(48.0) 2.7×10−2(46.8)
2.7×10−2(50.8)
1h2.7×10−2(46.4) 2.7×10−2(52.8) 2.7×10−2(47.6) 2.7×10−2(50.0) 2.7×10−2(46.8)
2.5h2.7×10−2(46.4) 2.7×10−2(44.4) 2.6×10−2(47.2) 2.6×10−2(48.8) 2.6×10−2(46.8)
0 2.8×10−2(45.2)?2.7×10−2(49.2)?2.7×10−2(46.0)?2.7×10−2(48.8)?2.7×10−2(46.8)?
Table S12. Comparison of the estimation of the simulated length (in drift units) for the S1↔P7 branch
τS1↔P7= 2.5×10−2in Figure 1 scenario using admixture graph fitting (as implemented in the fit.graph
function of poolfstat) for different types of data processing poolfstat analyses. For each MAF threshold
(MAF>1% or MAF>5%), the table gives the mean of the estimated ˆτS1↔P7(across 250 independent simulated datasets) for 21
different types of analyses relying on i) allele count data; ii) 15 different Pool-Seq read count data (five mean coverages λand
three sequencing error rates ); and iii) Pool-Seq read count data simulated with =0 treated as allele counts (corresponding
results of this bad practice are highlighted in italics and ?). The proportion (in %) of the 250 estimated 95% confidence intervals
that contain the true simulated value is given in parenthesis.
MAF seq. error Pool-Seq (read counts) data allele count
threshold λ =30 λ=50 λ=75 λ=100 λ=200 data
>1%
0 5.0×10−2(95.2) 4.9×10−2(93.2) 4.9×10−2(94.0) 4.9×10−2(94.4) 4.9×10−2(94.4)
4.9×10−2(94.4)
1h4.4×10−2(7.60) 4.3×10−2(0.80) 4.9×10−2(92.8) 4.9×10−2(94.8) 4.9×10−2(93.2)
2.5h1.9×10−2(0.00) 1.5×10−2(0.00) 4.4×10−2(3.60) 4.8×10−2(81.6) 4.9×10−2(92.0)
0 5.1×10−2(93.2)?5.0×10−2(94.8)?5.0×10−2(94.4)?5.0×10−2(94.4)?5.0×10−2(95.2)?
>5%
0 5.3×10−2(67.6) 5.2×10−2(72.4) 5.2×10−2(72.8) 5.2×10−2(72.0) 5.2×10−2(74.0)
5.2×10−2(74.8)
1h5.2×10−2(72.8) 5.2×10−2(73.6) 5.2×10−2(73.6) 5.2×10−2(73.6) 5.2×10−2(76.0)
2.5h5.2×10−2(72.8) 5.2×10−2(76.4) 5.2×10−2(76.4) 5.2×10−2(76.4) 5.2×10−2(80.0)
0 5.4×10−2(50.4)?5.3×10−2(54.0)?5.3×10−2(55.2)?5.3×10−2(56.0)?5.3×10−2(58.0)?
Table S13. Comparison of the estimation of the simulated length (in drift units) for the P7↔P8
branch τP7↔P8= 5.0×10−2in Figure 1 scenario using admixture graph fitting (as implemented in the
fit.graph function of poolfstat) for different types of data processing poolfstat analyses. For each MAF
threshold (MAF>1% or MAF>5%), the table gives the mean of the estimated ˆτP7↔P8(across 250 independent simulated
datasets) for 21 different types of analyses relying on i) allele count data; ii) 15 different Pool-Seq read count data (five mean
coverages λand three sequencing error rates ); and iii) Pool-Seq read count data simulated with =0 treated as allele counts
(corresponding results of this bad practice are highlighted in italics and ?). The proportion (in %) of the 250 estimated 95%
confidence intervals that contain the true simulated value is given in parenthesis.
MAF seq. error Pool-Seq (read counts) data allele count
threshold λ =30 λ=50 λ=75 λ=100 λ=200 data
>1%
0 7.4×10−2(71.6) 7.4×10−2(68.0) 7.4×10−2(69.6) 7.4×10−2(68.8) 7.4×10−2(69.2)
7.4×10−2(67.6)
1h6.6×10−2(5.60) 6.4×10−2(0.00) 7.3×10−2(66.0) 7.3×10−2(68.0) 7.3×10−2(66.8)
2.5h2.5×10−2(0.00) 1.8×10−2(0.00) 6.6×10−2(4.80) 7.2×10−2(54.0) 7.3×10−2(62.0)
0 7.6×10−2(76.0)?7.5×10−2(77.2)?7.5×10−2(77.6)?7.5×10−2(76.0)?7.5×10−2(76.4)?
>5%
0 7.7×10−2(65.6) 7.7×10−2(67.6) 7.7×10−2(69.6) 7.7×10−2(70.4) 7.7×10−2(71.6)
7.7×10−2(71.6)
1h7.7×10−2(67.6) 7.7×10−2(69.2) 7.7×10−2(72.8) 7.7×10−2(71.6) 7.6×10−2(72.0)
2.5h7.7×10−2(70.8) 7.6×10−2(69.6) 7.6×10−2(73.6) 7.6×10−2(70.0) 7.6×10−2(74.4)
0 7.9×10−2(52.4)?7.9×10−2(54.4)?7.8×10−2(58.4)?7.8×10−2(60.8)?7.8×10−2(60.8)?
Table S14. Comparison of the estimation of the simulated length (in drift units) for the S2↔P8 branch
τS2↔P8= 7.5×10−2in Figure 1 scenario using admixture graph fitting (as implemented in the fit.graph
function of poolfstat) for different types of data processing poolfstat analyses. For each MAF threshold
(MAF>1% or MAF>5%), the table gives the mean of the estimated ˆτS2↔P8(across 250 independent simulated datasets) for 21
different types of analyses relying on i) allele count data; ii) 15 different Pool-Seq read count data (five mean coverages λand
three sequencing error rates ); and iii) Pool-Seq read count data simulated with =0 treated as allele counts (corresponding
results of this bad practice are highlighted in italics and ?). The proportion (in %) of the 250 estimated 95% confidence intervals
that contain the true simulated value is given in parenthesis.
8
MAF seq. error Pool-Seq (read counts) data allele count
threshold λ =30 λ=50 λ=75 λ=100 λ=200 data
>1%
0 0.155 (59.2) 0.153 (81.2) 0.153 (84.4) 0.153 (84.0) 0.153 (87.2)
0.152 (90.0)
1h0.138 (0.40) 0.132 (0.00) 0.152 (88.4) 0.152 (88.4) 0.152 (92.0)
2.5h5.62×10−2(0.00) 4.49×10−2(0.00) 0.137 (0.00) 0.149 (94.8) 0.151 (96.4)
0 0.158 (19.2)?0.156 (43.6)?0.156 (45.6)?0.156 (47.6)?0.156 (54.8)?
>5%
0 0.177 (0.00) 0.177 (0.00) 0.177 (0.00) 0.177 (0.00) 0.177 (0.00)
0.178 (0.00)
1h0.176 (0.00) 0.176 (0.00) 0.177 (0.00) 0.177 (0.00) 0.177 (0.00)
2.5h0.175 (0.00) 0.176 (0.00) 0.176 (0.00) 0.176 (0.00) 0.176 (0.00)
0 0.18 (0.00)?0.18 (0.00)?0.18 (0.00)?0.181 (0.00)?0.181 (0.00)?
Table S15. Comparison of the estimation of the simulated length (in drift units) for the P8↔P9 branch
(τP8↔P9= 0.150)that combines the two branches from the root Rin Figure 1 scenario using admixture
graph fitting (as implemented in the fit.graph function of poolfstat) for different types of data processing
poolfstat analyses. For each MAF threshold (MAF>1% or MAF>5%), the table gives the mean of the estimated ˆτP8↔P9
(across 250 independent simulated datasets) for 21 different types of analyses relying on i) allele count data; ii) 15 different
Pool-Seq read count data (five mean coverages λand three sequencing error rates ); and iii) Pool-Seq read count data simulated
with =0 treated as allele counts (corresponding results of this bad practice are highlighted in italics and ?). The proportion
(in %) of the 250 estimated 95% confidence intervals that contain the true simulated value is given in parenthesis.
MAF seq. error Pool-Seq (read counts) data allele count
threshold λ =30 λ=50 λ=75 λ=100 λ=200 data
>1%
0 100 (23.5) 100 (28.0) 100 (29.7) 100 (28.8) 100 (33.8)
100 (33.2)
1h100 (28.5) 100 (31.4) 100 (31.7) 100 (31.9) 100 (33.9)
2.5h100 (29.9) 42.4 (0.00) 100 (31.2) 100 (26.9) 100 (26.4)
0 100 (23.5)?100 (28.0)?100 (29.7)?100 (28.8)?100 (33.8)?
>5%
0 100 (17.3) 100 (19.6) 100 (24.5) 100 (19.7) 100 (23.3)
100 (23.1)
1h100 (22.7) 100 (24.8) 100 (23.2) 100 (22.4) 100 (23.7)
2.5h100 (19.8) 100 (20.5) 100 (23.2) 100 (18.7) 100 (19.3)
0 100 (17.3)?100 (19.6)?100 (24.5)?100 (19.7)?100 (23.3)?
Table S16. Performance of the add.leaf function in positioning the simulated population P6 on the un-
derlying (((P1,P2),P3),(P4,P5)) tree (Figure 1) for different types of simulated data processing poolfstat
analyses. For each MAF threshold (MAF>1% or MAF>5%), the table gives the proportion of correctly inferred admixture
graph (i.e., positioning of the P6 population as deriving from an admixture events between two populations directly ancestral
to P2 and P3 with a ∆BIC >6 support) across 250 independent simulated datasets for 21 different types of analyses relying
on i) allele count data; ii) 15 different Pool-Seq read count data (five mean coverages λand three sequencing error rates );
and iii) Pool-Seq read count data simulated with =0 treated as allele counts (corresponding results of this bad practice are
highlighted in italics and ?). Note that a total of 36 different positioning of P6 on the (((P1,P2),P3),(P4,P5)) rooted tree are
evaluated for each call of add.leaf function. Indeed as the reference tree consists of eight branches, P6 may be connected
with i) 9 non-admixed edges (connection to either one of the 8 branches or as an outgroup) or, ii) 27 admixed edges from
two-way admixture events (27=C8
2-1, an admixture event between the two branches from the root being not identifiable). The
lowest ∆BIC between the true graph and the 35 other possible graphs over the 250 different datasets is given in parenthesis for
each type of analyses.
9
Figure S1. Average number of SNPs (A) and percentage of false SNPs (B) in the simulated datasets as a
function of read coverage and MAF filtering thresholds (see Table S1). As highlighted with parentheses around the
points in A and B, for Pool-Seq data with λ=30, the actual MAF threshold is slightly higher than indicated due the additional
filtering criterion imposed on the minimal read count (MRC>2; see Material and Methods). This is because for λ=30, the
overall coverage (that follows a Poisson distribution with parameter 6 ×30 =180) is ≤200 for on expected proportion of
93.5% of the SNPs resulting to a more stringent MAF threshold of 2
6×30 =1.11% when requiring MRC>2. This feature also
explains the observed increased in the percentage of false observed from 30X to 50X coverage for the red (MAF>1% and
=10−3) and green (MAF>1% and =2.5×10−3) straight lines in B). Note also that the red (MAF>5% and =10−3) and
green (MAF>5% and =2.5×10−3) dashed lines are perfectly superposed in B) since no false SNP are included in all the
corresponding configurations. 10
Figure S2. Allele Frequency Spectrum of the simulated genotyping data (A) and read binomial sampling
properties (B).A) Combined distribution of the overall SNP minor allele frequencies over all the 250 simulated genotyping
datasets. B) Expected proportion of SNPs passing MAF filtering steps after binomial sampling of the minor read count r(i.e.,
with r
c>MAF, where cis the overall read coverage) as a function of the SNP allele frequency y
n(where yis the minor allele
count and nis the haploid number of individuals) for two different coverage (c=30 ×6=180 and c=200 ×6=1200
corresponding to the two extremes of the simulated mean read coverage) and MAF thresholds (1% and 5%). The 5% MAF
value is represented by a vertical dotted line in the two graphs.
11
Figure S3. Distribution of the estimated admixture proportion α=0.25 in the Figure 1 simulated scenario
using admixture graph fitting (as implemented in the fit.graph function of poolfstat) for different types
of data processing poolfstat analyses. Each box plot summarizes the distribution of the 250 estimated values obtained
from the analysis of either allele count dataset (the leftmost box named “Counts” of each group) or one of the five different
simulated Pool-Seq read count datasets (“PSλX”) with different mean coverages (λ=30; 50; 75; 100; and 200) as generated
from the genotyping data simulated under the scenario depicted in Figure 1. Pool-Seq read count data were generated with
no sequencing errors (=0) in A) and D) and with a sequencing error rate of =1hand =2.5hin panel B) and C),
respectively (Table S1). In D), the read count data were analyzed as allele counts which corresponds to a bad practice. In each
of the four panels, analyses performed after discarding SNPs with an overall MAF (estimated from read counts in Pool-Seq
data) ≤1% and ≤5%) are grouped in the left-hand and right-hand sides, respectively. Note that the two box plots obtained from
the analysis of count data (MAF>1% and MAF>5%) are replicated in each panel for comparison purposes. The red dotted line
indicates the simulated value of α.
12
Figure S4. Distribution of the estimated drift-scaled lengths for all the branches in Figure 1 simulated sce-
nario using admixture graph fitting (as implemented in the fit.graph function of poolfstat) for different
types of data with a 1% threshold on the overall SNP MAF. Each box plot summarize the distribution of
the 250 estimated lengths of each of the ten branches obtained from the analysis of either allele count dataset
(“Counts”) or one of the five different simulated Pool-Seq read count datasets (“PSλX”) with different mean cov-
erages (λ=30; 50;75; 100; and 200) as generated from the genotyping data simulated under the scenario depicted
in Figure 1. Pool-Seq read count data were generated with no sequencing errors (=0) in A) and D) and with
a sequencing error rate of =1hand =2.5hin panel B) and C), respectively (Table S1). In D), the read
count data were analyzed as allele counts which corresponds to a bad practice. The two branches coming from the
root are combined since the position of the root is not identifiable by the model (i.e., τP8↔P9=τP8↔R+τP9↔R).
The box plots obtained from the analysis of count data are replicated in each panel for comparison purposes. For
each branch, a red dotted line indicates the underlying simulated value. For Pool-Seq data, the overall MAF was
estimated from read counts.
13
Figure S5. Distribution of the estimated global FS T for the data simulated under the Figure 1 simulated
scenario (using the computeFST function of poolfstat run with default option) for different types of data.
Each box plot summarizes the distribution of the 250 estimated values obtained from the analysis of either allele count dataset
(“Counts”) or one of the five different simulated Pool-Seq read count datasets (“PSλX”) with different mean coverages (λ=
30; 50; 75; 100; and 200) as generated from the genotyping data simulated under the scenario depicted in Figure 1. Pool-Seq
read count data were generated with no sequencing errors (=0) in A) and D) and with a sequencing error rate of =1h
and =2.5hin panel B) and C) respectively (Table S1). In D), the read count data were analyzed as allele counts which
corresponds to a bad practice. In each of the four panels, analyses performed after discarding SNPs with on the overall MAF
(estimated from read counts in Pool-Seq data) ≤1% (respectively ≤5%) are grouped in the left-hand (respectively right-hand)
side. The two box plots obtained from the analysis of count data (MAF>1% and MAF>5%) are replicated in each panel for
comparison purposes and the mean estimated values for the allele count data with MAF>1% is indicated by a dotted line.
14
Vignette for the package poolfstat (version 2+)
Mathieu Gautier
2021-09-14
Contents
1 Preamble: presentation of the working example data set 2
2 Reading and manipulating input data 3
2.1 Generating a countdata object for allele count data . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Generating a pooldata object for Pool-Seq read count data . . . . . . . . . . . . . . . . . . . 4
2.3 Manipulating countdata and pooldata objects ........................... 5
3 Estimating FS T 6
3.1 Estimating genome-wide FST across all the populations . . . . . . . . . . . . . . . . . . . . . 6
3.2 Estimating and visualizing pairwise-population FST ....................... 8
4 Estimating and visualizing f-statistics (f2,f3,f?
3,f4and D) 10
4.1 The compute.fstats function and fstats objects........................... 11
4.2
The plot_fstats function for visualization of
f2
(and pairwise
FST
),
f3
(and
f?
3
) and
f4
(and
D) estimates and their confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Estimating admixture proportions with f4-ratios ......................... 18
5 Using f-statistics to estimate parameters of admixture graphs 21
5.1 Generating a graph.params object with the generate.graph.params function . . . . . . . . . . 21
5.2 Fitting a graph with the fit.graph function............................. 25
5.3 Adding a new leaf to an existing graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6 Building admixture graph from scratch 32
6.1 Building scaffold trees of unadmixed populations . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.2 Extending an initial tree (or graph) with the graph.builder function............... 35
7 Other utilities 37
7.1
Symbolic representation of the
F
parameters, admixture graph equations and the scaled
covariance matrix Ωwith graph.params2symbolic.fstats ...................... 37
7.2 Generating files for the qpGraph software with graph.params2qpGraphFiles ........... 38
8 References 40
A Apprendix 42
A.1 Block-Jackknife estimation of standard errors . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1
This vignette describes how the R package poolfstat can be used to compute various f- and D-statistics
(estimation of
FST
, Patterson’s
F2
,
F3
,
F?
3
,
F4
and
D
parameters
1
) in population genomics studies from
allele count or Pool-Seq read count data. The package also includes functions to fit and construct admixture
graphs to infer the demographic history of populations based on the estimated
f
-statistics together with
their visualization. This document is conceived as a hands-on tutorial providing users with an overview
of the package functions with working example analyzing the Pool-Seq and allele count simulated data
sets described in section 1 and publicly available for download from the Zenodo repository
2
. Details and
(numerous) reference for the all the underlying methods are available in Gautier et al. (2021).
The poolfstat package is currently available for most platforms (Linux, MS Windows and MacOSX) from the
CRAN repository (http://cran.r-project.org/) and may be installed using a standard procedure. Once the
package has been successfully installed on your system, it can be loaded by:
library(poolfstat)
1 Preamble: presentation of the working example data set
Genetic data were simulated using the coalescent simulator msprime (Kelleher et al. 2016) for a total of 150
diploid individuals belonging to 6 different populations (n=25 per populations) that were historically related
by the admixture graph represented in Figure 1. Each genome consisted of 20 independent chromosomes of
L
= 100 Mb assuming a scaled chromosome-wide recombination rate of
ρ
= 4
LNer
= 4
,
000
3
. Similarly, the
scaled chromosome-wide mutation
θ
= 4
LNeµ
= 4
,
000
4
. More precisely, the following msprime command
was used:
mspms 300 20 -t 4000 -I 6 50 50 50 50 50 50 0 -es 0.0125 6 0.25 -ej 0.0125 6 2 -ej 0.0125
7 3 -ej 0.025 2 1 -ej 0.05 3 1 -ej 0.075 5 4 -ej 0.1 4 1 -r 4000 100000000 -p 8
0.00
0.05
0.10
0.15
0.20
τ
P7
P9
S
P8
R
S1 S2
α=25% 1−α=75%
P1 P2 P6 P3 P4 P5
Figure 1: Simulated scenario relating the 6 populations of the working example. Names of the internal node
populations for which no data is available are written in grey.
1Following Patterson et al. (2012), we use Fto refer to the parameter and fto the statistics estimated from the data
2See http://doi.org/10.5281/zenodo.4709728
3
For instance,
ρ
= 4
,
000 if the per-base and per-generation recombination rate
r
= 10
−8
(i.e., it the cM per Mb ratio is equal
to 1) in a population of constant diploid effective size Ne= 103
4
For instance, a nucleotide diversity of
θ
= 4
,
000 is expected at mutation-drift equilibrium in a population of constant diploid
effective size Ne= 103if the per-base mutation rate is µ= 10−8
2
The simulation output was further parsed to remove all variants displaying a Minor Allele Frequency (MAF)
5
lower than 1% which lead to a total of 472,410 remaining SNPs (from 23,246 to 24,237 per chromosome) with
position information stored in the file snp6p.snpinfo.gz
2
. From the resulting genotyping data, allele counts
for both the ancestral and the derived alleles (which was taken as reference) were easily obtained by simple
counting
6
. A Pool-Seq data set with no sequencing error was subsequently simulated from the allele count
data as described in Gautier et al. (2021). Briefly, for each SNP
i
in each population
j
, we sampled a read
count
rij
for the reference allele from a Binomial distribution following
rij ∼Bin yij
nj;cij
where
yij
is the
derived allele count for SNP
i
in population
j
,
nj
is the (haploid) sample size of population
j
(here
nj
= 50
for all
j
) and
cij
is the overall read coverage at SNP
i
position
7
. Varying overall read coverages across pools
and SNPs were simulated by sampling the
cij
’s from a Poisson distribution with mean
λ
= 30, i.e., assuming
a read coverage of 30X for the different pools
8
. Two files representative of “real-life” data format were finally
generated to store the simulate allele count and read count data:
•
a file named sim6p.genotreemix.gz
2
containing allele count data for each SNP and each population in
the same format as the one used in the population program Treemix (Pickrell and Pritchard 2012)
•
a file named sim6p.poolseq30X.vcf.gz
2
containing read count data for each SNP and each population in
avcf format similar to the one obtained with the software VarScan (Koboldt et al. 2012)
2 Reading and manipulating input data
The package poolfstat includes several utilities to parse allele count or Pool-Seq read count input data stored
in various standard formats. It is worth stressing that
distinguishing standard allele count data9from
Pool-Seq read count data is critical
to rely on the appropriate
f
-statistics estimator (Gautier et al.
2021; Hivert et al. 2018). Hence, in the poolfstat package, two different S4 object classes are defined to
store the data:
•
the countdata S4 class for allele count data with elements (slots) detailed in the documentation page
that can be accessed with the help function (or ?operator) as follows:
help(countdata)
•
the pooldata S4 class for (PoolSeq) read count data (PoolSeq data) with elements (slots) detailed in the
documentation page that can be accessed with the help function (or ?operator) as follows:
help(pooldata)
These classes characterize the type and origin of the data and are automatically detected by the computeFST
(section 3.1.1), compute.pairwiseFST (section 3.2) and compute.fstats (section 4) functions that implement
the different unbiased estimators thereby ensuring that the appropriate estimation procedure is used.
2.1 Generating a countdata object for allele count data
Acountdata object can be generated from allele count data stored in two different input file format:
•
The input file format required by the popular Treemix program (Pickrell and Pritchard 2012)
using the function genotreemix2countdata
•
The input file format required by the program BayPass (Gautier 2015) for allele count data using the
function genobaypass2countdata
The following example shows how to generate a countdata object (here named sim6p.allelecount) from the
sim6p.genotreemix.gz example file that contains the allele count data (in a Treemix file format) for the
simulated example data (section 1):
5MAF was estimated over all the 300 haploid individuals
6The simulated data being haploid this implicitly assumes Hardy-Weinberg equilibrium for the different populations
7Note that the read count for the alternate allele is simply cij −rij
8Such a coverage is actually in the lower limit of what is usually recommended for Pool-Seq experiment
9i.e., obtained from individual genotyping data
3
sim6p.allelecount<-genotreemix2countdata(genotreemix.file = "sim6p.genotreemix.gz")
Information about marker positions (chromosome or scaffold of origin of the markers and position on the
chromosome) may be provided using the snp.pos argument of the genotreemix2countdata and genobay-
pass2countdata functions. For instance, to include the mapping information stored in the snp6p.snpinfo.gz
example file (section 1), one may use the following commands10:
positions<-read.table("sim6p.snpinfo.gz",header=TRUE,row.names=1,stringsAsFactors = FALSE)
sim6p.allelecount<-genotreemix2countdata(genotreemix.file = "sim6p.genotreemix.gz",
snp.pos=positions,verbose=FALSE)
sim6p.allelecount #display a summary of the object
* * * Countdata Object * * *
* Number of SNPs = 472410
* Number of Pops = 6
* Pop Names :
P1; P2; P3; P4; P5; P6
**************
Notice
For operations requiring marker position information i.e., block-jackknife estimation of standard
error (Appendix A.1) and estimation of multi-locus
FST
on sliding windows over the genome (section
3.1.1), markers are always assumed to be ordered in the genome as their relative position in the input
files. If no map information is provided, a default position is given to all the SNPs and they are all
assumed to map to the same chromosome. In this case, some windows or blocks may then extend
over two consecutive contigsa.
a
Providing the reference assembly used to order the markers is not too fragmented, the block-jackknife estimates of
standard errors may only be marginally affected since the blocks are defined by a number of consecutive markers (and
not by a physical length)
Additional arguments may allow filtering the data for low marker polymorphism levels (min.maf ) or
genotyping call rate (min.indgeno.per.pop). More details are available in the documentation pages
of the genotreemix2countdata and genobaypass2countdata functions accessible with the commands
?genotreemix2pooldata and ?genobaypasstreemix2pooldata.
2.2 Generating a pooldata object for Pool-Seq read count data
A
pooldata
object can be generated from Pool-Seq read count data stored in either of the four following
formats:
•
vcf file generated by most SNP calling software commonly used to analyze Pool-Seq data including
VarScan (Koboldt et al. 2012), bcftools/SAMtools (Li et al. 2009), GATK (McKenna et al. 2010) or
FreeBayes (Garrison and Marth 2012) using the vcf2pooldata function11
•
rsync files generated by the PoPoolation software (Kofler et al. 2011) using the popsync2pooldata
function
•
The two input files (pool read count and pool haploid sizes) required by the program BayPass (Gautier
2015) to analyze Pool-Seq data using the genobaypass2pooldata function
•
The input file required by the SelEstim program (Vitalis et al. 2014) to analyze Pool-Seq data using
the genoselestim2pooldata function
10
If the number of markers in the snp.pos object is not consistent with that of allele count file or the given matrix is not
2-column, default value for marker positions are given and a warning message is printed in the console
11
The format of the vcf file is automatically detected from the genotype format that includes i) both an AD and RD fields for
VarScan vcf files; or ii) only an AD field (with comma-separated read counts for the different allele) other than VarScan vcf.
Parsing of vcf files has been substantially improved since poolfstat version 1.2 with computationally intensive text manipulation
now implemented in C++ routines inspired by those of the vcfR package (Knaus and Grünwald 2017)
4
The following example shows how to generate a pooldata object for the simulated Pool-Seq data contained in
the sim6p.poolseq30X.vcf.gz vcf file (section 1).
sim6p.readcount30X <-vcf2pooldata(vcf.file="sim6p.poolseq30X.vcf.gz",poolsizes=rep(50,6))
Reading Header lines
Parsing allele counts
VarScan like format detected for allele count data: the AD field contains allele depth
for the alternate allele and RD field for the reference allele
(N.B., positions with more than one alternate allele will be ignored)
472410 lines processed in 0 h 0 m 3 s : 472410 SNPs found
Data consists of 472410 SNPs for 6 Pools
sim6p.readcount30X #display a summary of the resulting pooldata object
* * * PoolData Object * * *
* Number of SNPs = 472410
* Number of Pools = 6
* Pool Names :
Pool1; Pool2; Pool3; Pool4; Pool5; Pool6
**************
Additional arguments may allow filtering the data according to the read coverage of the pool (e.g., min.maf,
min.rc,min.cov.per.pool or max.cov.per.pool ). More details are available in the in the documentation
pages of these different functions accessible with the commands
?vcf2pooldata
,
?popsync2pooldata
,
?genobaypass2pooldata and ?genoselestim2pooldata respectively.
2.3 Manipulating countdata and pooldata objects
The function pooldata.subset (respectively countdata.subset) allows subsetting pooldata (resp. countdata) to
retrieve just parts (e.g., some SNPs and/or population samples) of the original data sets according to various
criteria (see
?pooldata.subset
or
?countdata.subset
for more details). In the example below, a pooldata
object containing only information for populations P2,P3 and P6 for SNP with a MAF>0.05 is created
from the sim6p.readcount object previously created:
sim6p.readcount30X.subset<-pooldata.subset(sim6p.readcount30X,pool.index=c(2,3,6),
min.maf=0.05,verbose=FALSE)
sim6p.readcount30X.subset #display a summary of the resulting pooldata object
* * * PoolData Object * * *
* Number of SNPs = 241280
* Number of Pools = 3
* Pool Names :
Pool2; Pool3; Pool6
**************
Note that indexes of the retained SNPs from the original data set may be obtained by setting the return.snp.idx
argument to TRUE in the functions pooldata.subset (or countdata.subset). In this case the row of the matrix
stored in the snpinfo slot of the pooldata (or countdata) output object are named “rs”snp.idx (where snp.idx
is the SNP index for the original object) which makes it straightforward to obtain the indexes of the selected
SNP as shown below:
sim6p.readcount30X.subset<-pooldata.subset(sim6p.readcount30X,pool.index=c(2,3,6),
min.maf=0.05,return.snp.idx=TRUE,verbose=FALSE)
selected.snps.idx <- as.numeric(sub("rs","",rownames(sim6p.readcount30X.subset@snp.info)))
head(selected.snps.idx)
[1]134568
For pooldata objects, the functions pooldata2genobaypass and pooldata2selestim may also be used to generate
input files for the aforementioned BayPass and SelEstim programs, including the generation of sub-samples
5
of the original data (see ?pooldata2genobaypass and ?pooldata2selestim for more details).
3 Estimating FST
The
FST
parameter is commonly used to quantify the structuring of the genetic diversity among populations
(see e.g. Hivert et al. 2018 and references therein). It may be defined as:
FST ≡Q1−Q2
1−Q2
where
Q1
is the Identity In State (IIS) probability for genes sampled within populations (or pools), and
Q2
is
the IIS probability for genes sampled between populations (or pools).
The computeFST and compute.pairwiseFST (for all pairs of populations) functions implement two different
FST estimators relying on:
•
a decomposition of the total variance of allele or read count frequencies in an analysis-of-variance
framework (Weir and Cockerham 1984) which is the default procedure of the functions (as specified
by the argument method=“Anova”). The implemented estimators are derived in Weir (1996) (eq. 5.2)
(see also Akey et al. 2002) for allele count data (i.e., countdata objects, see 2.1); and in Hivert et al.
(2018) (eq. 9) for (Pool-Seq) read count data (i.e., pooldata objects, see 2.2).
•
unbiased estimators
c
Q1
and
c
Q2
of the IIS probabilities
Q1
and
Q2
(as specified by the argument
method=“Identity”). For allele count data (i.e., countdata objects, see 2.1) this estimator actually
correspond to the one used by Karlsson et al. (2007). For Pool-Seq read count data (i.e., pooldata
objects, see 2.2), the
c
Q1
of the
c
Q2
estimators are described in equations A39 and A43 respectively in
Hivert et al. (2018) Supplementary Materials.
Note that multi-locus estimates (i.e., genome-wide estimates or sliding windows estimates) are derived as the
sum of locus-specific numerators over the sum of locus-specific denominators of the different quantities (see,
e.g., Hivert et al. 2018 and references therein).
3.1 Estimating genome-wide FST across all the populations
3.1.1 The computeFST function
The computeFST function automatically uses the appropriate estimator given the input object class (either
allele count for countdata objects or Pool-Seq read count data for pooldata objects). For instance with the
simulated example data, we obtain the following estimates of FST with:
•allele count data:
sim6p.allelecount.fst<-computeFST(sim6p.allelecount)
sim6p.allelecount.fst$FST #genome-wide Fst over all populations
[1] 0.132319
•Pool-Seq read count data:
sim6p.readcount30X.fst<-computeFST(sim6p.readcount30X)
sim6p.readcount30X.fst$FST #genome-wide Fst over all populations
[1] 0.1324199
Note that by default, the method Anova is implemented in the computeFST function which may be changed
with the method argument of the function (see section 3).
6
3.1.2 Block-Jackknife estimation of ˆ
FST standard-error and confidence intervals:
Standard-error of the
FST
estimates can be estimated using a block-jackknife sampling approach (see Appendix
A.1) by specifying the number of consecutive SNPs defining a block with the argument nsnp.per.bjack.block
(by default nsnp.per.bjack.block=0, i.e., no block-jackknife is carried out) as illustrated below for:
•allele count data:
sim6p.allelecount.fst<-computeFST(sim6p.allelecount,nsnp.per.bjack.block = 1000)
Starting Block-Jackknife sampling
462 Jackknife blocks identified with 462000 SNPs (out of 472410 ).
SNPs map to 20 different chrom/scaffolds
Average (min-max) Block Sizes: 4.232 ( 3.515 - 4.975 ) Mb
The resulting genome-wide
FST
estimated as the mean over block-jackknife sample is stored in the mean.fst
element of the output list. Note that mean.fst may slightly differ from the default genome-wide estimate of
FST
(stored in the FST element of the output list, as above with no block-jackknife sampling is carried out)
as it is only computed from the SNPs eligible for block-jackknife (see Appendix A.1):
sim6p.allelecount.fst$FST #genome-wide Fst over all populations
[1] 0.132319
sim6p.allelecount.fst$mean.fst #block-jacknife estimate of s.e.
[1] 0.1324889
The standard-error of the genome-wide
FST
estimate is stored in the se.fst element of the output list and
may be used to construct 95% confidence intervals (CI) intervals of the estimated values:
sim6p.allelecount.fst$se.fst #s.e. of the genome-wide Fst estimate
[1] 0.0007603234
#95% c.i. of the estimated genome-wide Fst
sim6p.allelecount.fst$mean.fst+c(-1.96,1.96)*sim6p.allelecount.fst$se.fst
[1] 0.1309987 0.1339791
•Pool-Seq read count data:
sim6p.readcount30X.fst<-computeFST(sim6p.readcount30X,nsnp.per.bjack.block = 1000,
verbose=FALSE)
sim6p.readcount30X.fst$FST #genome-wide Fst over all populations
[1] 0.1324199
sim6p.readcount30X.fst$mean.fst #block-jacknife estimate of s.e.
[1] 0.1326089
sim6p.readcount30X.fst$se.fst #s.e. of the genome-wide Fst estimate
[1] 0.0007620463
#95% c.i. of the estimated genome-wide Fst
sim6p.readcount30X.fst$mean.fst+c(-1.96,1.96)*sim6p.readcount30X.fst$se.fst
[1] 0.1311153 0.1341025
3.1.3 Computing multi-locus FST to scan the genome over sliding-windows of SNPs
The sliding.window.size argument allows computing multi-locus
FST
for sliding windows over the different
chromosomes (or scaffolds/contigs), e.g., to carry out genome-scans for adaptive differentiation. Each sliding
7
window includes a number of consecutive SNPs specified by the sliding.window.size argument. This is
illustrated below for the Pool-Seq read count example data (similar results would be obtained with allele
count data):
sim6p.readcount30X.fst<-computeFST(sim6p.readcount30X,sliding.window.size=50)
Start sliding-window scan
20 chromosomes scanned (with more than 50 SNPs)
Average (min-max) Window Sizes 207.4 ( 90.1 - 453.2 ) kb
plot(sim6p.readcount30X.fst$sliding.windows.fst$CumulatedPosition/1e6,
sim6p.readcount30X.fst$sliding.windows.fst$MultiLocusFst,
xlab="Cumulated Position (in Mb)",ylab="Muli-locus Fst",
col=as.numeric(sim6p.readcount30X.fst$sliding.windows.fst$Chr),pch=16)
abline(h=sim6p.readcount30X.fst$FST,lty=2)
Figure 2: Manhattan plot of the multi-locus
FST
computed over sliding-windows of 50 SNPs on the Pool-
Seq example data. The dashed line indicates the estimated overall genome-wide
FST
. The 20 simulated
chromosomes are represented by alternate colors.
As expected (since the data set was simulated under neutrality), no clear outlier signal of adaptive differentia-
tion (like e.g., a tower of overly differentiated windows) shows up (Figure 2).
3.2 Estimating and visualizing pairwise-population FST
3.2.1 The compute.pairwiseFST and the heatmap functions
The compute.pairwiseFST function allows to estimate the genome-wide
FST
for all the
npop(npop−1)
2
pairs of
populations from data stored in either a countdata or a pooldata object. As for the computeFST function
(section 3), the compute.pairwiseFST function automatically uses the appropriate estimation procedure for
the type of the input data (either allele count for countdata objects or Pool-Seq read count data for pooldata
objects). The function returns an S4 object of class pairwisefst whose elements (slots) are detailed in the
documentation page that can be accessed with the following command (or with the ?operator):
8
help(pairwisefst)
The pairwise-population
FST
may then be visualized using the generic heatmap function directly applied on
the obtained pairwisefst object as illustrated below for Pool-Seq example results (similar results are obtained
with the allele count data):
sim6p.pairwisefst<-compute.pairwiseFST(sim6p.readcount30X,verbose=FALSE)
Overall Analysis Time: 0 h 0 m 3 s
heatmap(sim6p.pairwisefst)
Pool5
Pool4
Pool6
Pool3
Pool2
Pool1
Pool5
Pool4
Pool6
Pool3
Pool2
Pool1
Figure 3: Heatmap representing the pairwise-population
FST
matrix of the six populations of the 30X
Pool-Seq example data set
The resulting heat map (Figure 3) is consistent with the simulated scenario (Figure 1). Note that the
population P3 is the closest to the admixed population P6 (leading to their early clustering in the binary
tree representation) as expected from the high contribution of the P3 ancestor (1
−α
= 75%) to the admixed
ancestor of P6 and the short timing of admixture (τ=t
2Ne= 0.0125).
3.2.2 Block-Jackknife estimation of ˆ
FST standard-error and visualisation of confidence inter-
vals
As with the computeFST function, standard-error of the pairwise-population
FST
estimates may be estimated
using a block-jackknife sampling approach (see Appendix A.1) by specifying the number of consecutive
SNPs forming each block with the argument nsnp.per.bjack.block (by default nsnp.per.bjack.block=0, i.e.,
no block-jackknife is carried out). The resulting estimated standard-errors may directly be used to derive
confidence intervals (see above) that can also be plotted with the plot_fstats function (or directly using the
plot command that calls plot_fstats for pairwisefst objects). This is illustrated below with the allele count
example data (similar results are obtained with the Pool-Seq read count data):
sim6p.pairwisefst<-compute.pairwiseFST(sim6p.allelecount,
nsnp.per.bjack.block = 1000,verbose=FALSE)
Overall Analysis Time: 0 h 0 m 7 s
#Estimated pairwise Fst are stored in the slot values:
#5 first estimated pairwise
head(sim6p.pairwisefst@values)
9
Fst Estimate Fst bjack mean Fst bjack s.e. Q2 Estimate Q2 bjack mean Q2 bjack s.e. Nsnp
P1;P2 0.04946710 0.04936227 0.0007453263 0.8323244 0.8323799 0.0005895541 472410
P1;P3 0.09478083 0.09499692 0.0010947422 0.8243978 0.8244107 0.0005931841 472410
P1;P4 0.17816331 0.17843286 0.0015418662 0.8095896 0.8096142 0.0006251644 472410
P1;P5 0.17716282 0.17716008 0.0015559078 0.8094495 0.8096229 0.0006214764 472410
P1;P6 0.07039455 0.07055328 0.0009233910 0.8265288 0.8265487 0.0005869646 472410
P2;P3 0.09543866 0.09560477 0.0011336466 0.8236410 0.8236723 0.0005878641 472410
plot(sim6p.pairwisefst)
0.05 0.10 0.15
Fst
P3;P4
P1;P4
P2;P4
P3;P5
P1;P5
P2;P5
P4;P6
P5;P6
P4;P5
P2;P3
P1;P3
P1;P6
P2;P6
P1;P2
P3;P6
Figure 4: Estimated pairwise-population
FST
with their 95% confidence intervals for allele count example
data set
The resulting estimated pairwise-population
FST
displayed in Figure 4 are consistent with the simulated
scenario (Figure 1). The lowest level of differentiation is observed for the P3 and P6 population pair as
expected from the high contribution of the P3 ancestor (1
−α
= 75%) to the admixed ancestor of P6 and
the short timing of admixture (τ=t
2Ne= 0.0125).
4 Estimating and visualizing f-statistics (f2,f3,f?
3,f4and D)
The
f2
,
f3
and
f4
statistics were introduced in a seminal paper by Reich and co-workers (Reich et al. 2009)
retracing the history of Indian human population and forms the core components of a general framework for
demographic history inference detailed in Patterson et al. (2012; see also Lipson et al. 2013; Peter 2016;
Lipson 2020). These statistics measure (expected) covariance in allele frequencies among sets of two (
F2
),
three (
F3
) or four (
F4
) populations and are formally defined as follows (denoting
pi
the SNP reference allele
frequency in population i):
•F2(A;B)≡Eh(pA−pB)2i
•F3(A;B, C )≡E[(pA−pB) (pA−pC)] ≡1
2(F2(A;B)+F2(A;C)−F2(B;C))
•F4(A, B;C, D )≡E[(pA−pB) (pC−pD)] ≡1
2(F2(A;D)+F2(B;C)−F2(A;C)−F2(B;D))
The definitions of the F parameters are not depending on the reference allele choice since
((1 −pA)−(1 −pB))2
=
(pB−pA)2
=
(pA−pB)2
. As a consequence,
F2
and all the other
F
parameters may also be defined in terms
of IIS within and between pairs of population as
F2
(
A
;
B
) =
Q1−Q2
(see section 3) which allows deriving
unbiased estimators for both Pool-Seq read count and standard allele count data (Gautier et al. 2021).
10
Notice
With Ipopulations, there are
I
2
=
1
2I
(
I−
1) possible
F2
(i.e., 15 with
I
= 6 populations);
3
I
3
=
1
2I
(
I−
1)(
I−
2) possible
F3
(i.e., 60 with
I
= 6 populations); and 3
I
4
=
1
8I
(
I−
1)(
I−
2)(
I−
3)
possible
F4
(i.e., 45 with
I
= 6 populations). However, due to their underlying linear dependency
(see the above definitions), these
1
8I
(
I−
1)(
I2−I
+ 2) F-statistics form a vector space of dimension
1
2I
(
I−
1) the basis of which may be specified by the set of all the
I
2
possible
F2
statistics or, given
a reference population
i
(randomly chosen among the I ones) the set of all the
I−
1
F2
statistics of
the form
F2
(
i
;
j
)(with
j6
=
i
) and the
I−1
2F3
statistics of the form
F3
(
i
;
j, k
)(with
j6
=
i
;
k6
=
i
and
j6
=
k
) (Patterson et al. 2012; Lipson 2020). The resulting basis is informative about population
history and may be used to fit admixture graph (see section 5).
Moreover, although
f2
statistics are difficult to interpret or to compare across pairs of populations (see the
notice below), formal tests of population admixture (“3-populations” test) and tests of treeness of population
quadruplets (“4-populations” test) can directly be performed using
f3
(see e.g., section 4.1.2) and
f4
(see e.g.,
section 4.1.3) statistics respectively (Patterson et al. 2012). Indeed, if
f3
(
A
;
B, C
)
<
0, we can conclude
that population A originates from a population that is admixed between two source populations related to
populations B and C respectively (although the signal may vanish if population A has drifted too much
drifted since admixture or the admixture rates are too close to 0 or 1). Conversely, if
f4
(
A, B
;
C, D
)=0, the
populations A, B, C and D are related by a bifurcating tree with the unrooted topology (A,B;C,D) although
some may be admixed (if the paths connecting the (A,B) and (C,D) population pairs are not overlapping, see
section 4.1.3 for an example). Finally, under certain circumstances, proportions of ancestry that contributed
to a given admixed population can be estimated with ratios of
f4
statistics (see e.g., 4.3) for set of carefully
related populations (Patterson et al. 2012).
Notice
The parameters
F2
,
F3
and
F4
are not scaled with respect to the distribution of marker information
content (i.e., heterozygosities). As a consequence, their resulting estimates may strongly depend on
the chosen set of genetic markers (Patterson et al. 2012). The well-known
FST
parameter and the
two parameters
F?
3
and
D
introduced by Patterson et al. (2012) correspond to scaled versions of
F2
,
F3
and
F4
expected to be less sensitive to the SNP ascertainment and thus more comparable across
data sets. As shown in Gautier et al. (2021), these can be defined in terms of IIS probabilities as:
•FST(A;B)≡F2(A;B)
1−QA,B
2
=QA
1+QB
1−2QA,B
2
21−QA,B
2
•F?
3(A;B, C )≡F3(A;B, C)
1−QA
1
=QA
1+QB,C
2−QA,B
2−QA,C
2
21−QA
1
•D(A, B;C, D )≡F4(A, B;C, D)
1−QA,B
21−QC,D
2=QA,C
2+QB,D
2−QA,D
2−QB,C
2
21−QA,B
21−QC,D
2
Three-population and Four-population tests naturally extend to
F?
3
and
D
statistics respectively. An
advantage of
D
over
F4
is that it is constrained to the [
−
1
,
1] interval and may thus be interpreted
as the magnitude of deviation to treeness of the tested quadruplet (Patterson et al. 2012).
4.1 The compute.fstats function and fstats objects
The compute.fstats function implements unbiased estimators of the parameters
F2
(and
FST
),
F3
(and
F?
3
),
F4
and
D
defined above for allele count data (stored in countdata objects, see section 2.1) or Pool-Seq read
11
count data (stored in pooldata objects, see section 2.2) as described in Gautier et al. (2021)
12
. The function
also allows estimating within-population heterozygosities (defined as 1
−Q1
) which is needed to scale branch
lengths of admixture graphs in drift units (see section 5) or for rooting neighbor-joining trees of unadmixed
populations (see section 6.1). Block-jackknife estimates of standard errors of the different estimators (needed
for “Three-populations” and “Four-populations” tests, see below; and to fit admixture graph, see section 5)
and their covariance (needed to fit admixture graph, see section 5) may also be performed.
As for the computeFST (section 3.1.1) and compute.pairwiseFST (section 3.2), the compute.fstats functions
automatically detects which estimator to implement according to the class of the input object (either
countdata or pooldata). The function estimates by default all the within population heterozygosities, the
F2
(and its scaled version
FST 13
),
F3
(and its scaled version
F?
3
)
F4
statistics (and the scaled version
FST
)
14
.
Computation of
D
statistics (i.e., scaled
F4
) is not carried out by default (as specified with the computeDstat
argument set to FALSE by default) since this may add some non negligible additional computation time for
data with a large number of populations due to the extra computation of the
F4
scaling factor
15
although in
the example below (with only 6 populations) the difference in running time is negligible.
The results are then stored in an object of class fstats whose elements (slots) are detailed in the documentation
page accessible with the following command (or the ?operator):
help(fstats)
The underlying f-statistics may then be easily accessible or visualized with the plot_fstats function (or directly
using the plot command that calls plot_fstats for fstats objects) as illustrated below for the allele count
(sim6p.allelecount) and Pool-Seq read count (sim6p.readcount30X) example data16.
##Estimation of f-statistics on count data
sim6p.allelecount.fstats<-compute.fstats(sim6p.allelecount,nsnp.per.bjack.block = 1000,
computeDstat = TRUE)
Estimating Q1
Estimating Q2
Estimating within-population heterozygosities
Estimating F2
Estimating F3
Estimating F4
Computing Dstat
Starting Block-Jackknife sampling
462 Jackknife blocks identified with 462000 SNPs (out of 472410 ).
SNPs map to 20 different chrom/scaffolds
Average (min-max) Block Sizes: 4.232 ( 3.515 - 4.975 ) Mb
computing Q1 averages per blocks
computing Q2 averages per blocks
computing F2 averages per blocks
Starting computation of estimators s.e.
within-pop heterozygosity s.e. estimation done
F2 s.e. estimation done
F3 and F3* s.e. estimation done
estimating F4 and Dstat s.e. (may be long since require denominator averages per blocks)
F4 and D s.e. estimation done
Overall Analysis Time: 0 h 0 m 3 s
12
Although not defined in the same way, estimator for allele count data are strictly equivalent to those by Patterson et al.
(2012)
13
The estimator is then actually exactly the same as the one implemented in the compute.pairwiseFST or computeFST
functions when the argument method=“Identity”
14
The compute.fstats function is optimized in such a way that the computational cost for the estimation of pairwise
FST
,
F3
,
F?
3and F4from the F2-statistics and Q2estimates is negligible
15
For instance on a tested allele count real data set consisting of 640,000 SNPs genotyped on 24 populations, compute.fstats
ran in 2 m 56 s (6 m 50 if nsnp.per.bjack.block = 5000 ) with computeDstat=TRUE and only 7 s (16 s) if computeDstat=FALSE
16
Note that estimates of the different statistics are highly similar between the allele count and the Pool-Seq read count data.
12
sim6p.allelecount.fstats
* * * fstats Object * * *
Example of useful visualization functions are plot.fstats
##Estimation of f-statistics on Pool-Seq data (without computation of Dstat)
sim6p.readcount30X.fstats<-compute.fstats(sim6p.readcount30X,nsnp.per.bjack.block = 1000,
verbose=FALSE)
##Estimation of f-statistics on Pool-Seq data (with computation of Dstat)
sim6p.readcount30X.fstats<-compute.fstats(sim6p.readcount30X,nsnp.per.bjack.block = 1000,
computeDstat = TRUE,verbose=FALSE)
4.1.1 f2and FST estimates (f2.values and fst.values slots of the fstat object)
# count data (3 first f2)
head(sim6p.allelecount.fstats@f2.values,3)
Estimate bjack mean bjack s.e.
P1,P2 0.008294424 0.008274108 0.0001298376
P1,P3 0.016643720 0.016680443 0.0002108642
P1,P4 0.033924154 0.033971079 0.0003456714
# 30X Pool-Seq data (3 first f2)
head(sim6p.readcount30X.fstats@f2.values,3)
Estimate bjack mean bjack s.e.
Pool1,Pool2 0.00829429 0.008275487 0.0001351787
Pool1,Pool3 0.01665818 0.016696323 0.0002159221
Pool1,Pool4 0.03398383 0.034045968 0.0003525136
# count data (3 first Fst)
head(sim6p.allelecount.fstats@fst.values,3)
Estimate bjack mean bjack s.e.
P1,P2 0.04946710 0.04936227 0.0007453263
P1,P3 0.09478083 0.09499692 0.0010947422
P1,P4 0.17816331 0.17843286 0.0015418662
# 30X Pool-Seq data (3 first Fst)
head(sim6p.readcount30X.fstats@fst.values,3)
Estimate bjack mean bjack s.e.
Pool1,Pool2 0.04947320 0.04937253 0.000774369
Pool1,Pool3 0.09487048 0.09508626 0.001119595
Pool1,Pool4 0.17839940 0.17872795 0.001572758
Notice that the pairwise
FST
estimates are the same as those obtained with the compute.pairwiseFST function
(see section 3.2) run with method=Identity (which are here actually equal with estimates obtained by the
default Anova method because there is no variation in the total allele counts for all SNPs).
Notice
By construction
F2
(
A
;
B
) =
F2
(
B, A
)(and
FST
(
A
;
B
) =
FST
(
B, A
)). If
iP
is the index of population
P
in the popnames or poolnames slots of the countdata or pooldata objects (i.e., the column order in
the corresponding allele or read count data matrices) used to obtain the fstats object, the
F2
(
A, B
)
(resp. FST (B, A)) configurations reported in the slot f2.values (resp. f st.values) satisfy iA< iB.
13
4.1.2 f3and f?
3estimates (f3.values and f3star.values slots of the fstat object) and
3-Population tests:
# count data (3 first f3)
head(sim6p.allelecount.fstats@f3.values,3)
Estimate bjack mean bjack s.e. Z-score
P1;P2,P3 0.004053338 0.004048392 0.0001176972 34.39668
P1;P2,P4 0.004089298 0.004087474 0.0001326605 30.81155
P1;P2,P5 0.004155216 0.004149524 0.0001341405 30.93416
# 30X Pool-Seq data (3 first f3)
head(sim6p.readcount30X.fstats@f3.values,3)
Estimate bjack mean bjack s.e. Z-score
Pool1;Pool2,Pool3 0.004057570 0.004053794 0.0001247034 32.50750
Pool1;Pool2,Pool4 0.004114142 0.004114670 0.0001365234 30.13893
Pool1;Pool2,Pool5 0.004157049 0.004154896 0.0001411765 29.43051
# count data (3 first f3*)
head(sim6p.allelecount.fstats@f3star.values,3)
Estimate bjack mean bjack s.e. Z-score
P1;P2,P3 0.02552286 0.02549608 0.0007467871 34.14102
P1;P2,P4 0.02574929 0.02574221 0.0008480235 30.35554
P1;P2,P5 0.02616436 0.02613299 0.0008581050 30.45430
# 30X Pool-Seq data (3 first f3*)
head(sim6p.readcount30X.fstats@f3star.values,3)
Estimate bjack mean bjack s.e. Z-score
Pool1;Pool2,Pool3 0.02555404 0.02553297 0.0007888059 32.36915
Pool1;Pool2,Pool4 0.02591032 0.02591640 0.0008722134 29.71337
Pool1;Pool2,Pool5 0.02618054 0.02616977 0.0009009973 29.04533
Notice
By construction
F3
(
A
;
B, C
) =
F3
(
A
;
C, B
)(and
F?
3
(
A
;
B, C
) =
F?
3
(
A
;
C, B
)). If
iP
is the index of
population
P
in the popnames or poolnames slots of the countdata or pooldata objects (i.e., the
column order in the corresponding allele or read count data matrices) used to obtain the fstats object,
the F3(A;B, C )(and F?
3(A;B, C )) configurations reported in the slot f3.values satisfy iB< iC.
As shown in the above example, activation of block-jackknife estimation of standard errors (i.e., argument
nsnp.per.bjack.block>0) results in the computation of Z-scores (i.e., ratio of the block-jackknife estimated
mean and standard-error) which quantifies the deviation of the estimated
f3
-statistics from 0 (in units of s.e.).
This gives a simple decision criterion for three-population tests of admixture (i.e., negative
f3
or negative
f?
3
).
For instance a Z-score
<−
1
.
65 provides evidence for admixture (i.e., significantly negative
f3
) at the 95%
significance threshold:
# count data (F3-based 3-pop test)
tst.sel<-sim6p.allelecount.fstats@f3.values$`Z-score`<-1.65
sim6p.allelecount.fstats@f3.values[tst.sel,]
Estimate bjack mean bjack s.e. Z-score
P6;P2,P3 -0.0002671143 -0.0002623528 8.638581e-05 -3.036989
# 30X Pool-Seq data (F3-based 3-pop test)
tst.sel<-sim6p.readcount30X.fstats@f3.values$`Z-score`<-1.65
sim6p.readcount30X.fstats@f3.values[tst.sel,]
Estimate bjack mean bjack s.e. Z-score
14
Pool6;Pool2,Pool3 -0.0002417509 -0.0002304611 8.851744e-05 -2.603567
# count data (F3*-based 3-pop test)
tst.sel<-sim6p.allelecount.fstats@f3star.values$`Z-score`<-1.65
sim6p.allelecount.fstats@f3star.values[tst.sel,]
Estimate bjack mean bjack s.e. Z-score
P6;P2,P3 -0.001631657 -0.001603205 0.0005274779 -3.039379
# 30X Pool-Seq data (F3*-based 3-pop test)
tst.sel<-sim6p.readcount30X.fstats@f3star.values$`Z-score`<-1.65
sim6p.readcount30X.fstats@f3star.values[tst.sel,]
Estimate bjack mean bjack s.e. Z-score
Pool6;Pool2,Pool3 -0.001475866 -0.001407435 0.0005402774 -2.605024
In agreement with the simulated scenario (Figure 1), both the allele count and Pool-Seq read count data
support an admixed origin for population P6 with ancestral sources related to P2 and P3.
4.1.3 f4and Destimates (f4.values and Dstat.values slots of the fstat object) and 4-Population
tests:
# count data (3 first f4)
head(sim6p.allelecount.fstats@f4.values,3)
Estimate bjack mean bjack s.e. Z-score
P1,P2;P3,P4 3.595944e-05 3.908225e-05 1.154193e-04 0.3386111
P1,P2;P3,P5 1.018776e-04 1.011325e-04 1.197917e-04 0.8442359
P1,P2;P3,P6 5.594310e-04 5.622762e-04 6.138658e-05 9.1595941
# 30X Pool-Seq data (3 first f4)
head(sim6p.readcount30X.fstats@f4.values,3)
Estimate bjack mean bjack s.e. Z-score
Pool1,Pool2;Pool3,Pool4 5.657166e-05 6.087576e-05 1.194004e-04 0.5098456
Pool1,Pool2;Pool3,Pool5 9.947887e-05 1.011014e-04 1.253135e-04 0.8067878
Pool1,Pool2;Pool3,Pool6 5.653400e-04 5.621051e-04 6.409453e-05 8.7699380
# count data (3 first D)
head(sim6p.allelecount.fstats@Dstat.values,3)
Estimate bjack mean bjack s.e. Z-score
P1,P2;P3,P4 0.0006651386 0.0007230774 0.002135688 0.3385688
P1,P2;P3,P5 0.0018826436 0.0018705234 0.002216171 0.8440338
P1,P2;P3,P6 0.0107508025 0.0108152267 0.001182842 9.1434253
# 30X Pool-Seq data (3 first D)
head(sim6p.readcount30X.fstats@Dstat.values,3)
Estimate bjack mean bjack s.e. Z-score
Pool1,Pool2;Pool3,Pool4 0.001045743 0.001125524 0.002208046 0.5097373
Pool1,Pool2;Pool3,Pool5 0.001837431 0.001868847 0.002316956 0.8065961
Pool1,Pool2;Pool3,Pool6 0.010853673 0.010799873 0.001235136 8.7438717
15
Notice
When comparing two pairs of populations (
A, B
)and (
C, D
), the
f4
statistics for the 8 quadruplets
(A,B;C,D); (B,A;C,D); (A,B;D,C); (B,A;D,C); (C,D;A,B); (C,D;B,A); (D,C;A,B) and (D,C;B,A) have
the same absolute value by definition of the F4parameter:
F4(A, B;C, D ) = F4(B, A;D, C) = F4(C, D;A, B) = F4(D, C;A, B)
−F4(A, B;C, D ) = F4(B, A;C, D) = F4(C, D;B , A) = F4(D, C;B , A) = F4(A, B;D, C)
and similarly
D(A, B;C, D ) = D(B, A;D, C) = D(C, D;A, B) = D(D, C;A, B)
−D(A, B;C, D ) = D(B, A;C, D) = D(C, D;B , A) = D(D, C;B , A) = D(A, B;D, C)
If
iP
is the index of population
P
in the popnames or poolnames slots of the countdata or pooldata
objects (i.e., the column order in the corresponding allele or read count data matrices) used to
obtain the fstats object, the
F4
(
A, B
;
C, D
)(and
D
(
A, B
;
C, D
)) configurations reported in the slot
f4.values (and Dstat.values) satisfy iA< iB;iA< iCand iC< iD.
As for
f3
and
f?
3
(section 4.1.2), activating block-jackknife estimation of standard errors (i.e., the argument
nsnp.per.bjack.block>0) results in the computation of Z-scores (i.e., ratio of the block-jackknife estimated
mean and standard-error) which quantifies the deviation of the estimated
f4
-statistics from 0 (in units of
s.e.). This gives a simple decision criterion for four-population tests of treeness (i.e., non null
F4
or
D
).
For instance a Z-score lower than 1.96 in absolute value provides no evidence against the null-hypothesis of
treeness for the tested population configuration at the 95% significance threshold:
# count data
tst.sel<-abs(sim6p.allelecount.fstats@f4.values$`Z-score`)<1.96
sim6p.allelecount.fstats@f4.values[tst.sel,]
Estimate bjack mean bjack s.e. Z-score
P1,P2;P3,P4 3.595944e-05 3.908225e-05 1.154193e-04 0.3386111
P1,P2;P3,P5 1.018776e-04 1.011325e-04 1.197917e-04 0.8442359
P1,P2;P4,P5 6.591816e-05 6.205022e-05 9.313675e-05 0.6662270
P1,P3;P4,P5 1.309202e-05 1.464589e-05 1.268061e-04 0.1154983
P1,P6;P4,P5 4.031413e-05 4.458874e-05 1.058348e-04 0.4213050
P2,P3;P4,P5 -5.282615e-05 -4.740433e-05 1.302898e-04 -0.3638376
P2,P6;P4,P5 -2.560403e-05 -1.746147e-05 1.085089e-04 -0.1609221
P3,P6;P4,P5 2.722212e-05 2.994286e-05 7.650662e-05 0.3913760
# 30X Pool-Seq data
tst.sel<-abs(sim6p.readcount30X.fstats@f4.values$`Z-score`)<1.96
as.data.frame(sim6p.readcount30X.fstats@f4.values)[tst.sel,]
Estimate bjack mean bjack s.e. Z-score
Pool1,Pool2;Pool3,Pool4 5.657166e-05 6.087576e-05 1.194004e-04 0.50984558
Pool1,Pool2;Pool3,Pool5 9.947887e-05 1.011014e-04 1.253135e-04 0.80678780
Pool1,Pool2;Pool4,Pool5 4.290721e-05 4.022566e-05 9.502690e-05 0.42330817
Pool1,Pool3;Pool4,Pool5 3.769747e-07 -3.302134e-06 1.277902e-04 -0.02584028
Pool1,Pool6;Pool4,Pool5 4.073843e-05 4.734750e-05 1.053345e-04 0.44949661
Pool2,Pool3;Pool4,Pool5 -4.253023e-05 -4.352780e-05 1.336022e-04 -0.32580156
Pool2,Pool6;Pool4,Pool5 -2.168779e-06 7.121841e-06 1.102031e-04 0.06462467
Pool3,Pool6;Pool4,Pool5 4.036145e-05 5.064964e-05 8.000836e-05 0.63305432
In other words, both the allele count and Pool-Seq read count data only provide no evidence against the null
hypothesis of treeness at the 95% threshold for quadruplets involving i) non-admixed populations (P1,P2,
P3,P4 and P5 ) for configurations consistent with the simulated scenario; and ii) the admixed population P6
for configurations of the form (P6,X;P4,P5) where P4 and P5 are the two outgroup populations and X=P1,
16
P2 or P3. This is actually expected since for these latter quadruplets, the path connecting P4 and P5 is not
overlapping with either of the paths connecting P6 to P1,P2 or P3 in the simulated graph (Figure 1).
4.1.4 Population heterozygostity estimates (heterozygosities slot of the fstat object)
# count data (3 first populations)
head(sim6p.allelecount.fstats@heterozygosities,3)
Estimate bjack mean bjack s.e.
P1 0.1588121 0.1587849 0.0005972556
P2 0.1599502 0.1599070 0.0006057583
P3 0.1591048 0.1590328 0.0005900926
# 30X Pool-Seq data (3 first populations)
head(sim6p.readcount30X.fstats@heterozygosities,3)
Estimate bjack mean bjack s.e.
Pool1 0.1587839 0.1587670 0.0006012812
Pool2 0.1599319 0.1599083 0.0006096762
Pool3 0.1590771 0.1590230 0.0006024449
4.2 The plot_fstats function for visualization of f2(and pairwise FST ), f3(and
f?
3) and f4(and D) estimates and their confidence intervals
The plot_fstats function (that may be called directly using plot on fstats objects) allows plotting all or only
some (using the pop.sel,pop.f3.target or value.range arguments) of the estimated
f2
,
fST
,
f3
,
f?
3
,
f4
or
D
statistics from a fstats object. In addition, for
f3
,
f?
3
,
f4
and
D
statistics, the highlight.signif argument allows
highlighting in red significant (as defined with the ci.perc argument) three-population or four-population
tests. Some example plots are shown in Figures 5, 6 and 7 which were generated with the following codes:
4.2.1 Example of f2statistics plot (Figure 5)
layout(matrix(1:2,2,1,byrow=T))
plot(sim6p.allelecount.fstats,main="F2 (Allele Count)")
plot(sim6p.readcount30X.fstats,main="F2 (30X Pool-Seq)")
Similar plots may be obtained for the scaled
f2
(i.e., pairwise
fST
) by specifying stat.name=“Fst” (see also
the compute.pairwiseFST functions described in section 3.2).
4.2.2 Example of f3statistics plot (Figure 6)
layout(matrix(1:4,2,2,byrow=T))
plot(sim6p.allelecount.fstats,stat.name="F3",main="F3 (Allele Count)")
plot(sim6p.readcount30X.fstats,stat.name="F3",main="F3 (30X Pool-Seq)")
plot(sim6p.readcount30X.fstats,stat.name="F3",pop.f3.target=c("Pool6","Pool1"),
main="30X Pool-Seq (only F3 with P6 or P1 as target pops)")
plot(sim6p.readcount30X.fstats,stat.name="F3",value.range=c(NA,5e-3),
main="30X Pool-Seq (only F3 < 5e-3)")
Similar plots may be obtained for the scaled f3(i.e., f?
3) by specifying stat.name=“F3star”.
4.2.3 Example of f4and Dstatistics plot (Figure 7)
layout(matrix(1:6,3,2,byrow=T))
plot(sim6p.allelecount.fstats,stat.name="Dstat",main="D (Allele Count)")
plot(sim6p.readcount30X.fstats,stat.name="Dstat",main="D (30X Pool-Seq)")
plot(sim6p.allelecount.fstats,stat.name="F4",main="F4 (Allele Count)")
17
0.005 0.010 0.015 0.020 0.025 0.030 0.035
F2 (Allele Count)
P3,P4
P2,P4
P1,P4
P3,P5
P1,P5
P2,P5
P4,P6
P5,P6
P4,P5
P2,P3
P1,P3
P1,P6
P2,P6
P1,P2
P3,P6
0.005 0.010 0.015 0.020 0.025 0.030 0.035
F2 (30X Pool−Seq)
Pool3,Pool4
Pool2,Pool4
Pool1,Pool4
Pool3,Pool5
Pool1,Pool5
Pool2,Pool5
Pool4,Pool6
Pool5,Pool6
Pool4,Pool5
Pool2,Pool3
Pool1,Pool3
Pool1,Pool6
Pool2,Pool6
Pool1,Pool2
Pool3,Pool6
Figure 5: Estimated
f2
statistics with their 95% confidence intervals for the allele count and 30X Pool-Seq
data sets
plot(sim6p.readcount30X.fstats,stat.name="F4",main="F4 (30X Pool-Seq)")
plot(sim6p.readcount30X.fstats,stat.name="F4",pop.sel=c("Pool1","Pool2"),
main="30X Pool-Seq (only F4 with both P1 and P2)")
plot(sim6p.readcount30X.fstats,stat.name="F4",value.range=c(-2e-3,2e-3),
main="30X Pool-Seq (only -2e-3 < F4 < 2e-3)")
4.3 Estimating admixture proportions with f4-ratios
Given an admixture graph (assumed to be correct), ratios of
f4
-statistics (Patterson et al. 2012) may
provide estimates of the relative contributions of the ancestral sources of a (two-way) admixed populations
(P6 in our example) if outgroups (P1 and P4 or P5 in our example) for the two source population proxies
(e.g., P2 and P3 in our example) have been sampled. For instance, the proportion
α
of
P
2-related ancestry
in population P6(Figure 1) is equal to:
α=F4(P1, P 4; P3, P 6)
F4(P1, P 4; P2, P 3) =F4(P1, P 5; P3, P 6)
F4(P1, P 5; P2, P 3)
18
0.000 0.005 0.010 0.015 0.020 0.025 0.030
F3 (Allele Count)
P4;P3,P6
P5;P3,P6
P4;P1,P2
P5;P1,P2
P4;P2,P6
P5;P2,P6
P4;P1,P6
P5;P1,P6
P4;P1,P3
P4;P2,P3
P5;P1,P3
P5;P2,P3
P3;P4,P5
P2;P4,P5
P1;P4,P5
P6;P4,P5
P4;P2,P5
P4;P5,P6
P4;P3,P5
P4;P1,P5
P3;P1,P2
P5;P1,P4
P5;P3,P4
P5;P4,P6
P5;P2,P4
P1;P3,P6
P2;P3,P6
P3;P2,P5
P3;P2,P4
P3;P1,P4
P3;P1,P5
P2;P3,P4
P2;P3,P5
P1;P3,P5
P1;P3,P4
P6;P1,P2
P1;P5,P6
P1;P4,P6
P2;P4,P6
P2;P5,P6
P3;P2,P6
P6;P1,P4
P6;P1,P5
P3;P1,P6
P1;P2,P6
P6;P2,P5
P6;P2,P4
P2;P1,P3
P2;P1,P4
P1;P2,P5
P2;P1,P5
P1;P2,P4
P1;P2,P3
P3;P5,P6
P3;P4,P6
P2;P1,P6
P6;P3,P4
P6;P3,P5
P6;P1,P3
P6;P2,P3
0.000 0.005 0.010 0.015 0.020 0.025 0.030
F3 (30X Pool−Seq)
Pool4;Pool3,Pool6
Pool5;Pool3,Pool6
Pool4;Pool1,Pool2
Pool5;Pool1,Pool2
Pool4;Pool2,Pool6
Pool5;Pool2,Pool6
Pool4;Pool1,Pool6
Pool5;Pool1,Pool6
Pool4;Pool1,Pool3
Pool4;Pool2,Pool3
Pool5;Pool1,Pool3
Pool5;Pool2,Pool3
Pool3;Pool4,Pool5
Pool2;Pool4,Pool5
Pool1;Pool4,Pool5
Pool6;Pool4,Pool5
Pool4;Pool5,Pool6
Pool4;Pool2,Pool5
Pool4;Pool1,Pool5
Pool4;Pool3,Pool5
Pool3;Pool1,Pool2
Pool5;Pool3,Pool4
Pool5;Pool1,Pool4
Pool5;Pool2,Pool4
Pool5;Pool4,Pool6
Pool1;Pool3,Pool6
Pool2;Pool3,Pool6
Pool3;Pool2,Pool5
Pool3;Pool2,Pool4
Pool3;Pool1,Pool5
Pool3;Pool1,Pool4
Pool2;Pool3,Pool4
Pool2;Pool3,Pool5
Pool1;Pool3,Pool4
Pool1;Pool3,Pool5
Pool6;Pool1,Pool2
Pool1;Pool5,Pool6
Pool1;Pool4,Pool6
Pool2;Pool5,Pool6
Pool2;Pool4,Pool6
Pool3;Pool2,Pool6
Pool6;Pool1,Pool4
Pool6;Pool1,Pool5
Pool3;Pool1,Pool6
Pool1;Pool2,Pool6
Pool6;Pool2,Pool4
Pool6;Pool2,Pool5
Pool2;Pool1,Pool3
Pool2;Pool1,Pool4
Pool1;Pool2,Pool5
Pool2;Pool1,Pool5
Pool1;Pool2,Pool4
Pool1;Pool2,Pool3
Pool3;Pool5,Pool6
Pool3;Pool4,Pool6
Pool2;Pool1,Pool6
Pool6;Pool3,Pool4
Pool6;Pool3,Pool5
Pool6;Pool1,Pool3
Pool6;Pool2,Pool3
0.000 0.005 0.010 0.015 0.020
30X Pool−Seq (only F3 with P6 or P1 as target pops)
Pool1;Pool4,Pool5
Pool6;Pool4,Pool5
Pool1;Pool3,Pool6
Pool1;Pool3,Pool4
Pool1;Pool3,Pool5
Pool6;Pool1,Pool2
Pool1;Pool5,Pool6
Pool1;Pool4,Pool6
Pool6;Pool1,Pool4
Pool6;Pool1,Pool5
Pool1;Pool2,Pool6
Pool6;Pool2,Pool4
Pool6;Pool2,Pool5
Pool1;Pool2,Pool5
Pool1;Pool2,Pool4
Pool1;Pool2,Pool3
Pool6;Pool3,Pool4
Pool6;Pool3,Pool5
Pool6;Pool1,Pool3
Pool6;Pool2,Pool3
0.000 0.001 0.002 0.003 0.004 0.005
30X Pool−Seq (only F3 < 5e−3)
Pool1;Pool2,Pool6
Pool6;Pool2,Pool4
Pool6;Pool2,Pool5
Pool2;Pool1,Pool3
Pool2;Pool1,Pool4
Pool1;Pool2,Pool5
Pool2;Pool1,Pool5
Pool1;Pool2,Pool4
Pool1;Pool2,Pool3
Pool3;Pool5,Pool6
Pool3;Pool4,Pool6
Pool2;Pool1,Pool6
Pool6;Pool3,Pool4
Pool6;Pool3,Pool5
Pool6;Pool1,Pool3
Pool6;Pool2,Pool3
Figure 6: Estimated
f3
statistics with their 95% confidence intervals for the allele count and 30X Pool-Seq
data sets
The compute.f4ratio function implements
f4
-ratio based estimators of admixture proportion from an fs-
tats object. It requires to specify both the numerator (num.quadruplet argument) and the denominator
(den.quadruplet)
F4
quadruplets
17
. The following example illustrates how to use the function to the admixture
proportion
α
(
αsimulated
= 0
.
25) from the P2-related source that contributed to the P6 ancestral population
(Figure 1):
17
In practice, the required
f4
may not be directly available from the f4.values slot of the fstats since, as mentioned in section
4.1.3, only one out of the 8 equivalent quadruplet configurations is reported (the chosen one only depending on the ordering of
the populations in the countdata or pooldata objects used for estimation). However, the function automatically computes the
f4
for the numerator and denominator configurations specified by the user from the estimator available in the input fstats object
19
−0.3 −0.2 −0.1 0.0 0.1 0.2
D (Allele Count)
P1,P4;P2,P5
P1,P5;P2,P4
P1,P4;P3,P5
P1,P5;P3,P4
P2,P5;P3,P4
P2,P4;P3,P5
P1,P3;P2,P6
P1,P6;P2,P3
P1,P3;P2,P5
P1,P3;P2,P4
P1,P4;P2,P3
P1,P5;P2,P3
P1,P3;P4,P6
P1,P3;P5,P6
P2,P3;P5,P6
P2,P3;P4,P6
P1,P4;P2,P6
P1,P5;P2,P6
P1,P6;P2,P5
P1,P6;P2,P4
P1,P2;P3,P6
P1,P2;P4,P6
P1,P2;P5,P6
P1,P2;P3,P5
P1,P2;P4,P5
P1,P6;P4,P5
P1,P2;P3,P4
P3,P6;P4,P5
P1,P3;P4,P5
P2,P6;P4,P5
P2,P3;P4,P5
P1,P5;P3,P6
P1,P4;P3,P6
P2,P5;P3,P6
P2,P4;P3,P6
P1,P6;P3,P5
P1,P6;P3,P4
P2,P6;P3,P4
P2,P6;P3,P5
P1,P5;P4,P6
P1,P4;P5,P6
P2,P4;P5,P6
P2,P5;P4,P6
P3,P5;P4,P6
P3,P4;P5,P6
−0.3 −0.2 −0.1 0.0 0.1 0.2
D (30X Pool−Seq)
Pool1,Pool4;Pool2,Pool5
Pool1,Pool5;Pool2,Pool4
Pool1,Pool5;Pool3,Pool4
Pool1,Pool4;Pool3,Pool5
Pool2,Pool5;Pool3,Pool4
Pool2,Pool4;Pool3,Pool5
Pool1,Pool3;Pool2,Pool6
Pool1,Pool6;Pool2,Pool3
Pool1,Pool3;Pool2,Pool4
Pool1,Pool3;Pool2,Pool5
Pool1,Pool4;Pool2,Pool3
Pool1,Pool5;Pool2,Pool3
Pool1,Pool3;Pool5,Pool6
Pool1,Pool3;Pool4,Pool6
Pool2,Pool3;Pool5,Pool6
Pool1,Pool5;Pool2,Pool6
Pool1,Pool4;Pool2,Pool6
Pool2,Pool3;Pool4,Pool6
Pool1,Pool6;Pool2,Pool5
Pool1,Pool6;Pool2,Pool4
Pool1,Pool2;Pool3,Pool6
Pool1,Pool2;Pool4,Pool6
Pool1,Pool2;Pool5,Pool6
Pool1,Pool2;Pool3,Pool5
Pool1,Pool2;Pool3,Pool4
Pool3,Pool6;Pool4,Pool5
Pool1,Pool6;Pool4,Pool5
Pool1,Pool2;Pool4,Pool5
Pool2,Pool6;Pool4,Pool5
Pool1,Pool3;Pool4,Pool5
Pool2,Pool3;Pool4,Pool5
Pool1,Pool5;Pool3,Pool6
Pool1,Pool4;Pool3,Pool6
Pool2,Pool5;Pool3,Pool6
Pool2,Pool4;Pool3,Pool6
Pool1,Pool6;Pool3,Pool5
Pool1,Pool6;Pool3,Pool4
Pool2,Pool6;Pool3,Pool5
Pool2,Pool6;Pool3,Pool4
Pool1,Pool5;Pool4,Pool6
Pool1,Pool4;Pool5,Pool6
Pool2,Pool5;Pool4,Pool6
Pool2,Pool4;Pool5,Pool6
Pool3,Pool5;Pool4,Pool6
Pool3,Pool4;Pool5,Pool6
−0.015 −0.010 −0.005 0.000 0.005 0.010 0.015
F4 (Allele Count)
P1,P4;P2,P5
P1,P5;P2,P4
P1,P4;P3,P5
P1,P5;P3,P4
P2,P5;P3,P4
P2,P4;P3,P5
P1,P3;P2,P6
P1,P6;P2,P3
P1,P3;P2,P5
P1,P3;P2,P4
P1,P4;P2,P3
P1,P5;P2,P3
P1,P3;P4,P6
P1,P3;P5,P6
P2,P3;P5,P6
P2,P3;P4,P6
P1,P4;P2,P6
P1,P5;P2,P6
P1,P6;P2,P5
P1,P6;P2,P4
P1,P2;P3,P6
P1,P2;P4,P6
P1,P2;P5,P6
P1,P2;P3,P5
P1,P2;P4,P5
P1,P6;P4,P5
P1,P2;P3,P4
P3,P6;P4,P5
P1,P3;P4,P5
P2,P6;P4,P5
P2,P3;P4,P5
P1,P5;P3,P6
P1,P4;P3,P6
P2,P5;P3,P6
P2,P4;P3,P6
P1,P6;P3,P5
P1,P6;P3,P4
P2,P6;P3,P4
P2,P6;P3,P5
P1,P5;P4,P6
P1,P4;P5,P6
P2,P4;P5,P6
P2,P5;P4,P6
P3,P5;P4,P6
P3,P4;P5,P6
−0.015 −0.010 −0.005 0.000 0.005 0.010 0.015
F4 (30X Pool−Seq)
Pool1,Pool4;Pool2,Pool5
Pool1,Pool5;Pool2,Pool4
Pool1,Pool5;Pool3,Pool4
Pool1,Pool4;Pool3,Pool5
Pool2,Pool5;Pool3,Pool4
Pool2,Pool4;Pool3,Pool5
Pool1,Pool3;Pool2,Pool6
Pool1,Pool6;Pool2,Pool3
Pool1,Pool3;Pool2,Pool4
Pool1,Pool3;Pool2,Pool5
Pool1,Pool4;Pool2,Pool3
Pool1,Pool5;Pool2,Pool3
Pool1,Pool3;Pool5,Pool6
Pool1,Pool3;Pool4,Pool6
Pool2,Pool3;Pool5,Pool6
Pool2,Pool3;Pool4,Pool6
Pool1,Pool5;Pool2,Pool6
Pool1,Pool4;Pool2,Pool6
Pool1,Pool6;Pool2,Pool5
Pool1,Pool6;Pool2,Pool4
Pool1,Pool2;Pool3,Pool6
Pool1,Pool2;Pool4,Pool6
Pool1,Pool2;Pool5,Pool6
Pool1,Pool2;Pool3,Pool5
Pool1,Pool2;Pool3,Pool4
Pool3,Pool6;Pool4,Pool5
Pool1,Pool6;Pool4,Pool5
Pool1,Pool2;Pool4,Pool5
Pool2,Pool6;Pool4,Pool5
Pool1,Pool3;Pool4,Pool5
Pool2,Pool3;Pool4,Pool5
Pool1,Pool5;Pool3,Pool6
Pool1,Pool4;Pool3,Pool6
Pool2,Pool5;Pool3,Pool6
Pool2,Pool4;Pool3,Pool6
Pool1,Pool6;Pool3,Pool5
Pool1,Pool6;Pool3,Pool4
Pool2,Pool6;Pool3,Pool5
Pool2,Pool6;Pool3,Pool4
Pool1,Pool5;Pool4,Pool6
Pool1,Pool4;Pool5,Pool6
Pool2,Pool5;Pool4,Pool6
Pool2,Pool4;Pool5,Pool6
Pool3,Pool5;Pool4,Pool6
Pool3,Pool4;Pool5,Pool6
0.000 0.005 0.010 0.015
30X Pool−Seq (only F4 with both P1 and P2)
Pool1,Pool4;Pool2,Pool5
Pool1,Pool5;Pool2,Pool4
Pool1,Pool3;Pool2,Pool6
Pool1,Pool6;Pool2,Pool3
Pool1,Pool3;Pool2,Pool4
Pool1,Pool3;Pool2,Pool5
Pool1,Pool4;Pool2,Pool3
Pool1,Pool5;Pool2,Pool3
Pool1,Pool5;Pool2,Pool6
Pool1,Pool4;Pool2,Pool6
Pool1,Pool6;Pool2,Pool5
Pool1,Pool6;Pool2,Pool4
Pool1,Pool2;Pool3,Pool6
Pool1,Pool2;Pool4,Pool6
Pool1,Pool2;Pool5,Pool6
Pool1,Pool2;Pool3,Pool5
Pool1,Pool2;Pool3,Pool4
Pool1,Pool2;Pool4,Pool5
−0.0015 −0.0010 −0.0005 0.0000 0.0005
30X Pool−Seq (only −2e−3 < F4 < 2e−3)
Pool1,Pool2;Pool3,Pool6
Pool1,Pool2;Pool4,Pool6
Pool1,Pool2;Pool5,Pool6
Pool1,Pool2;Pool3,Pool5
Pool1,Pool2;Pool3,Pool4
Pool3,Pool6;Pool4,Pool5
Pool1,Pool6;Pool4,Pool5
Pool1,Pool2;Pool4,Pool5
Pool2,Pool6;Pool4,Pool5
Pool1,Pool3;Pool4,Pool5
Pool2,Pool3;Pool4,Pool5
Pool1,Pool5;Pool3,Pool6
Pool1,Pool4;Pool3,Pool6
Pool2,Pool5;Pool3,Pool6
Pool2,Pool4;Pool3,Pool6
Figure 7: Estimated
f4
and
D
statistics with their 95% confidence intervals for the allele count and 30X
Pool-Seq data sets
# count data (two possible estimates)
compute.f4ratio(sim6p.allelecount.fstats,num.quadruplet = "P1,P4;P3,P6",
den.quadruplet="P1,P4;P2,P3")
[1] -0.2497209
compute.f4ratio(sim6p.allelecount.fstats,num.quadruplet = "P1,P5;P3,P6",
den.quadruplet="P1,P5;P2,P3")
[1] -0.2462923
20
# 30X Pool-Seq data (two possible estimates)
compute.f4ratio(sim6p.readcount30X.fstats,num.quadruplet = "Pool1,Pool4;Pool3,Pool6",
den.quadruplet="Pool1,Pool4;Pool2,Pool3")
[1] -0.25285
compute.f4ratio(sim6p.readcount30X.fstats,num.quadruplet = "Pool1,Pool5;Pool3,Pool6",
den.quadruplet="Pool1,Pool5;Pool2,Pool3")
[1] -0.24559
Standard errors of the estimated admixture proportions are automatically computed if the input fstats object
was obtained by running the compute.fstats function with return.F4.blockjackknife.samples = TRUE (and
nsnp.per.bjack.block>0 ) to activate block-jackknife estimation of standrd errors) as illustrated below for the
Pool-Seq read count data:
sim6p.readcount30X.fstats<-compute.fstats(sim6p.readcount30X,nsnp.per.bjack.block = 1000,
verbose=FALSE,return.F4.blockjackknife.samples = TRUE)
alpha.est=compute.f4ratio(sim6p.readcount30X.fstats,num.quadruplet = "Pool1,Pool4;Pool3,Pool6",
den.quadruplet="Pool1,Pool4;Pool2,Pool3")
#95% c.i. of alpha derived from the se
alpha.est[2]+c(-1.96,1.96)*alpha.est[3]
[1] -0.3013463 -0.2100506
#prediction error in units of s.e.
as.numeric(abs(0.25-alpha.est[2])/alpha.est[3])
[1] 21.71337
Note that the simulated value (
α
= 0
.
25) is within the confidence interval and actually less than 0
.
25 standard
error higher than the F4-ratio estimate.
5 Using f-statistics to estimate parameters of admixture graphs
The f-statistics can be used to estimate the parameters (branch lengths and/or admixture proportions) of
trees or admixture graphs (i.e., trees including admixture edges) that summarize the demographic history of
the surveyed populations. The approach implemented in poolfstat to fit admixture graphs from f-statistics is
directly inspired (and actually highly similar) to the one used in the qpGraph software originally described by
Patterson et al. (2012; see also Lipson 2020). The core functions used for admixture graph fitting consist
of:
•
the generate.graph.params function to define graph parameters specifying the candidate graph to fit
and the underlying f-statistics provided as an fstats object (section 4)
•the fit.graph function to estimate graph parameters using an optimization algorithm
•
the compare.fitted.fstats function to assess model fit by comparing estimated and fitted f-statistics
(Patterson et al. 2012; Lipson 2020)
•
the add.leave function to evaluate all the possible admixture graphs (or trees) resulting from the addition
of a new leave to an existing graph (connected with either non-admixed or admixed edges).
5.1 Generating a graph.params object with the generate.graph.params function
5.1.1 Specifying the structure of the admixture graph in a graph.params object
Admixture graph specification including the structure of the graph (i.e., topology consisting of edges and, if
any admixture proportions) are defined in an object of class graph.params detailed in the documentation
page accessible with the following command (or the ?operator):
help(graph.params)
21
The graph.params objects may be constructed with the generate.graph.params function from a user-defined
(character) matrix specifying the structure of the admixture graph (or tree if no admixture edges are included)
and consisting of three columns defining for each edge (whether admixed or not) i) the child node; ii) the
parent node; iii) the admixture proportion (blank for non-admixed edges). As a result, in the input matrix,
each admixture event is specified by two rows for the two admixture edges corresponding to the same
admixed child node and two different source nodes as the parent node (i.e., source populations). Their third
column elements contain the two underlying admixture proportions coded as aand (1-a) (the parentheses
are mandatory and an error message is printed if absent) where ais the name of the admixture proportion
(names of admixture proportions should not include space). The example below shows the construction of
agraph.params object specifying the admixture graph for the scenario used to simulate the example data
(Figure 1):
sim.graph<-rbind(c("P1","P7",""),c("P2","s1",""),c("P3","s2",""),c("P6","S",""),
c("S","s1","a"),c("S","s2","(1-a)"),c("s2","P8",""),c("s1","P7",""),
c("P4","P9",""),c("P5","P9",""),c("P7","P8",""),
c("P8","R",""),c("P9","R",""))
sim.graph.params<-generate.graph.params(sim.graph)
sim.graph.params
* * * graph.params Object * * *
Example of useful functions are:
plot() to visualize the graph (interface for grViz() from the DiagrammeR package)
fit.graph() to estimate graph parameter values
****************
The root is automatically identified by the generate.graph.params function (as the node only present in the
parent node column) and several checks are made within the function. It is however recommended to check
the graph by plotting using the plot function. The following code shows how to plot the graph stored in the
example sim.graph.params object that was generated above (Figure 8):
plot(sim.graph.params)
P1
P2
P3
P6
P4 P5P7
s1
S
a
s2
(1-a)
P8 P9
R
Figure 8: Plot of the admixture graph specifying the simulated scenario
22
Notice
The plot function applied to graph.params objects internally calls the grViz function from the
DiagrammeR package (Iannone 2020) which actually generates an object of class htmlwidget that
will print itself into HTML in a browser. Also, if run within the Rstudio IDE, the graph will be
plotted in the View pane
a
. Hence, graph plots may not be easily exported from R into a pdf (or
other) device although the following trick may be useful to that end:
require(webshot,htmlwidgets,imager)
tmp<-plot(sim.graph.params) #plot the graph
saveWidget(tmp,"tmp.html") ; webshot("tmp.html","tmp.png")
load.image("tmp.png")%>% autocrop() %>% plot(axes=F)
To insert a graph plot into an external document or edit it (e.g., to change node or edge colors, node
names, etc.), one may also directly rely on its dot coded definition
b
stored in the dot.graph slot of the
graph.params (or fitted.graph, see section 5.2) object outside R. A graph dot file may also be generated
by specifying an output file name prefix with the outfileprefix argument of the generate.graph.params
functionc.
aRegularly clearing the View items using the broom icon is recommended
b
From the open source graph visualization software graphviz (https://graphviz.org/). For instance, dot files
can be converted into png file using the command dot -Tpng inputgraph.dot in a Linux terminal. Several online
user-friendly implementations also allow very convenient manipulation of dot files from a web-browser, see e.g.,
https://dreampuf.github.io/GraphvizOnline
c
Alternatively, one may also use the command writeLines(x@dot.g raph,con=outfile) where xis the graph.params
object and outfile is the desired name of the dot output file (e.g., “out.dot”)
The graph.params object produced by the generate.graph.params function includes a symbolic representation
of the graph incidence matrix (slot graph.matrix) which consists of a
nl
leaves by
ne
edges matrix containing
the edges weight for the paths from each leave to the root
18
. Examples for the sim.graph.params object that
specifies the simulation scenario are given below:
#names of the edges (automatically given)
sim.graph.params@edges.names
[1] "P7<->P1" "s1<->P2" "s2<->P3" "S<->P6" "P8<->s2" "P7<->s1" "P9<->P4" "P9<->P5" "P8<->P7" "R<->P8" "R<->P9"
#names of the admixture proportions (automatically given)
sim.graph.params@adm.params.names
[1] "a"
#graph incidence matrix
sim.graph.params@graph.matrix
P7<->P1 s1<->P2 s2<->P3 S<->P6 P8<->s2 P7<->s1 P9<->P4 P9<->P5 P8<->P7 R<->P8 R<->P9
P1 "1" "0" "0" "0" "0" "0" "0" "0" "1" "1" "0"
P2 "0" "1" "0" "0" "0" "1" "0" "0" "1" "1" "0"
P3 "0" "0" "1" "0" "1" "0" "0" "0" "0" "1" "0"
P6 "0" "0" "0" "1" "1-a" "a" "0" "0" "a" "1" "0"
P4 "0" "0" "0" "0" "0" "0" "1" "0" "0" "0" "1"
P5 "0" "0" "0" "0" "0" "0" "0" "1" "0" "0" "1"
18
In the symbolic representation, the names for the graph edges and admixture proportions correspond to those stored in the
edges.names and adm.params.names slots of the graph.params object respectively
23
Notice
As mentioned by Pickrell and Pritchard (2012), Patterson et al. (2012) and Lipson (2020),
the three branch lengths surrounding an admixture event (e.g., edges
s1 ↔S
,
s2 ↔S
and
S↔P6
connecting s1 to S;s2 to S; and Sto P6 respectively for the admixture event in Figure 8) are
not identifiable and can only be estimated jointly in a single compound parameter (e.g.,
ζ
=
a2×es1↔S
+ (1
−a
)
2×es2↔S
+
eS↔P6
in Figure 8) unless samples from the source populations (s1,s2
and/or S) or samples from different populations deriving from the same admixed source are available
(Lipson 2020). Following Patterson et al. (2012) and Lipson (2020), this identifiability issue
is solved by nullifying the length of admixture edges (i.e., setting
es1↔S
=
es2↔S
= 0 in the above
example) which may lead to some overestimation of the divergence (branch length) of the admixed
population (here P6 ) from its direct admixed ancestor (here S) if the two source populations (here s1
and s2 ) have experienced strong divergence since their separation from the graph connecting the
other populations (which is not the case in the simulated example in which both
es1↔S
and
es2->S
are
equal to 0). This differs from the choice made by Pickrell and Pritchard (2012) in the Treemix
model consisting of setting eS↔P6 = 0 and es2↔S= 0 (if a > 0.5) or es1↔S= 0 (if a < 0.5).
The graph incidence matrix plays a pivotal role for graph fitting since the fit.graph function (see section 5.1.2)
use it to build the model equations. A (simplified) symbolic representation of these model equations together
with expression for the parameters
F2
,
F3
and
F4
can also be generated from a graph.params object with the
graph.params2symbolic.fstats function (see section 7.1).
5.1.2 Generating a graph.params object with f-statistics estimates for admixture graph fitting
The f-statistics estimates (
f2
and
f3
) need to be included within the graph.params object to allow further
fitting of the admixture graph (i.e., estimation of its edge lengths and, if any, admixture proportions) with the
fit.graph function (see section 5.2). This can be done by providing the generate.graph.params function with
an fstats object including all the
f2
and
f3
statistics involving the graph leaves (which is usually the case)
and block-jackknife estimates of their standard errors and error covariance matrix (i.e., the compute.fstats
function used to generate the fstats object must have been run with nsnp.per.bjack.block>0), otherwise an
error message is returned
19
. If
nl
the number of leaves of the admixture graph and
A
a “reference” population
among the
nl
ones
20
, the generate.graph.params function selects the estimates (block-jackknife means) for
the
nl−
1
f2
statistics of the form
f2
(
A
;
B
)(with
B
a leave population other than
A
) and the
nl−1
2f3
statistics of the form F3(A;B, C )(with Band Ctwo leave populations other than A) which form the basis
of the f-statistics vector (see section 4). The basis f-statistics are stored in the f2.target and f3.target slots
of the resulting graph.params object (with their corresponding names available in the f2.target.pops and
f3.target.pops slots respectively). In addition, as required to further fit the admixture graph (see section
5.2), the
nl(nl−1)
2
by
nl(nl−1)
2
covariance matrix of the basis f-statistics is stored in the f.Qmat slot of the
graph.params object. Optionally, if available in the fstats object, estimates of the leaves heterozygoties (needed
to scale fitted branch lengths in drift units, see 5.2.1) are stored in the Het slot of the graph.params object.
The following code shows how to generate a graph.params object for the example (allele count) data:
sim.graph.params<-generate.graph.params(sim.graph,fstats = sim6p.allelecount.fstats)
Total Number of Parameters: 11 (10 edges lengths + 1 adm. coeff.)
Total Number of Statistics: 15 (5 F2 and 10 F3)
As shown above, the functions returns the number of parameters
npar
= 2
nl
+ 2
na−
3of the admixture
graph where
na
and
nl
are the number of admixture events
na
and the number of leaves
nl
, respectively
(Lipson 2020). Note that the plotting properties of the graph.params object remain the same whether fstats
information is included or not (i.e., the plot function may be used as above to generate the representation
19
Conversely, the fstats object may include f-statistics involving populations other than the graph leaves, the gener-
ate.graph.params function selecting only the f-statistics relevant for the fitting of the input graph
20
by default,
A
is the first population in the vector of leaves but, although of limited interest, another reference population
may be specified with the popref argument
24
displayed in Figure 8).
5.2 Fitting a graph with the fit.graph function
The fit.graph functions provides estimate of the parameters (i.e., edge length and admixture proportions) of an
admixture graph stored in a graph.params object as detailed in Gautier et al. (2021) and directly inspired by
Patterson et al. (2012). Briefly, let
b
f
represent the vector of length
nl(nl−1)
2
(where
nl
is the number of graph
leaves) of the estimated
f2
and
F3
basis f-statistics
21
and
g(e;a)
=
X(a)×e
the vector of these expected basis
f-statistics values given the vector of graph edges lengths
e
and the incidence matrix
X(a)
that depends on the
structure of the graph and the admixture rates
a22
. Let further
Q
represent the
nl(nl−1)
2
by
nl(nl−1)
2
covariance
matrix of the basis F-statistics estimates estimated by block-jackknife and stored in the slot f.Qmat of the
input graph.params object (see 5.1.2)
23
. The function attempts to find the graph parameter values (
ˆe
and
ˆa
)
that minimize a cost (score of the model) defined as
S
(
e
;
a
) =
b
f−g(e;a)0
Q−1b
f−g(e;a)
. As mentioned
by Patterson et al. (2012), given admixture rates
a
,
S
(
e
;
a
)is actually quadratic in the edge lengths
e
allow
the fit.graph function to rely on the Lawson-Hanson non-negative linear least squares algorithm implemented
in the nnls function (nnls package) to estimate the vector
ˆe
that minimizes
S
(
e
;
a
)(subject to the constraint of
positive edge lengths). Full minimization of
S
(
e
;
a
)is thus reduced to the identification of the admixture rates
a
which is performed using the L-BFGS-B method
24
implemented in the optim function (stats package). The
eps.admix.prop argument allows specifying the lower and upper bound of the admixture rates to eps.admix.prop
and 1-eps.admix.prop respectively. In addition, assuming
b
f∼
N
(g(ˆe;ˆa),Q)25
,
S
(
ˆe
;
ˆa
) =
−
2
log
(
L
)
−K
where
L
is the likelihood of the fitted graph and
K
=
nlog
(2
π
) +
log
(
|Q|
)allowing to almost directly derive a
BI C
(Bayesian Information Criterion) of the fitted graph from the optimized score
S
(
ˆe
;
ˆa
)
26
. The BIC may be
used for comparison of different admixture graphs (see section 5.3) providing they were all fitted based on
the same vector of f-statistics (i.e., they include the same set of populations). Indeed, when comparing two
graphs
G1
and
G2
with BIC equal to
BI C1
and
BI C2
,∆
12
=
BI C2−BI C1'
2
log (BF12 )
where
BF12
is
the Bayes Factor associated to
G1
and
G2
graph comparison (eq. 9, Kass and Raftery 1995). The (slightly)
modified Jeffreys’ rule proposed by Kass and Raftery (1995) might further be used to evaluate to which
extent the data support
G1
or
G2
with e.g., ∆
12 >
6(respectively ∆
12 >
10) providing “strong” (respectively
“very strong”) evidence in favor of G127.
The fit.graph function returns an object of class fitted.graph detailed in the documentation page accessible
with the following command (or the ?operator):
help(fitted.graph)
The following code shows how to fit the example graph stored in the graph.params object sim.graph.params
generated above and some features of the resulting fitted.graph object:
sim.fittedgraph<-fit.graph(sim.graph.params)
Starting estimation of admixture rates (LBFGS score optimisation)
Initial score= 316.0469
Estimation ended in 0 m 0 s
Final Score: 1.004106
BIC: 276.6086
21stored in the f2.target and f3.target slots of the input graph.params object, see section 5.1.2
22If there is no admixture in the graph, X(a)only contains only 0 or 1
23
The argument Q.lambda of the fit.graph function may be used to add a small constant (e.g., 10
−4
) to all to the diagonal
elements of
Q
(i.e., the variance of the basis f-statistics estimates) as proposed by Patterson et al. (2012; see also Lipson 2020)
to avoid numerical problems. Note that Qmat.diag.adjust is always set 0 in the final estimate of the score Sor the BIC
24Limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm with box constraints
25
i.e., the observed vector of f-statistics is multivariate normal distributed around the expected
g
(
e
;
a
)vector specified by the
admixture graph and the covariance structure empirically estimated
26BIC =S+npar log 1
2nl(nl−1)−1
2nl(nl−1) log(2π)−log(|Q|)
27
The two thresholds of 6and 10 on ∆
BIC
corresponds to thresholds of 13 and 21 deciban (units of 10
log10
scale) on BF. The
original Jeffreys’ rule considered
BF12
thresholds of 10,15 and 20 decibans (corresponding to ∆
BIC
thresholds of 4
.
6,6
.
9and
9.2) as “strong”, “very strong” and “decisive” evidence in favor of model 1.
25
#Estimated edge lengths
sim.fittedgraph@edges.length
P7<->P1 s1<->P2 s2<->P3 S<->P6 P8<->s2 P7<->s1 P9<->P4 P9<->P5 P8<->P7 R<->P8 R<->P9
0.004079551 0.001970516 0.002148837 0.002111818 0.006334944 0.002228311 0.012858490 0.012529775 0.004160468 0.006445943 0.006445943
#Estimated admixture proportion
sim.fittedgraph@admix.prop
[1] 0.2478947
#Final Score
sim.fittedgraph@score
[1] 1.004106
#BIC
sim.fittedgraph@bic
[1] 276.6086
The fitted.graph object also stores the output results from the optim optimization function in the slot op-
tim.results. For instance, in the example below, convergence was reached without any issue (convergence=0 )
28
:
#optim function results (list)
sim.fittedgraph@optim.results
$par
[1] 0.2478947
$value
[1] 1.004106
$counts
function gradient
8 8
$convergence
[1] 0
$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
As for graph.params objects (see section 5.1.1), the plot function may be directly applied to fitted.graph objects
to plot the admixture graph with estimated parameter values
29
. The admix.fact and edge.fact argument
of the fit.graph function allow to apply a multiplying factor to the printed branch lengths and admixture
proportions (by default admix.fact=100, i.e., admixture proportions are printed in %; and edge.fact =1000,
i.e. edge lengths are printed in
h
). Figure 9 shows the fitted example graph and was generated with the
following command:
plot(sim.fittedgraph)
28
In case of convergence problem (i.e., convergence not equal to 0), a message detailing execution error in the optimization
algorithm is stored in the optim.results component named message. For more details, see the documentation for the optim
function with the command ?optim providing the package optim is loaded
29
the graph is also coded in dot format with a character vector stored in the dot.graph slot of the resulting fitted.graph object.
As for generate.graph.params, a dot file may also be printed out by specifying an output file name prefix with the outfileprefix
argument of the fit.graph function (or using the command writeLines(x@dot.graph,con=outfile) where xis the fitted.graph object
and outfile is the desired named of the dot output file). See 5.1.1 for more details on how to externally export and customize dot
files
26
P1
P2
P3
P6
P4 P5P7
4
s1
2
2
S
25%
s2
2
75%
2
P8
4 6
P9
13 13
R
6 6
Figure 9: Plot of the admixture graph specifying the simulated scenario with fitted parameter values (x1000
for edge lengths)
Notice
The two edges from the root node of the graph are not identifiable and only their joint lengths can
be estimated. The root position is arbitrarily set in the mid position (i.e., the two root edges have
the same length by construction as shown in Figure 9).
5.2.1 Scaling of branch lengths in drift units
The estimated edge lengths are in the same scale as the other f-statistics which limits their interpretation
since they strongly depend on the SNP ascertainment process (see section 4). Lipson et al. (2013) showed
however that the lengths may be approximately scaled in genetic drift units (i.e., in units of
τ
=
t
2Ne
where
t
is a number of generations and
Ne
is the diploid effective population along the branch) using estimates
of overall marker heterozygosities within (i.e., 1
−Q1
) or across (i.e., 1
−Q2
) population (Gautier et al.
2021). Briefly, for a given edge
P↔C
relating a child node
C
to its parent node
P
with an (unscaled)
estimated branch length
ˆeP↔C
,
ˆτP↔C
= 2
ˆeP↔C
c
hP
where
ˆτP↔C
is the estimated branch length scaled in drift
units and
c
hP
is the estimated heterozygosity in the (parent) node
P
. The parent node heterozygosities can be
estimated from leaves to root by using the property
hP
=
hC
+ 2
eP↔C
= 0 that relate the child
C
and parent
P
node heterozygosities (
hC
and
hP
respectively) and
eP↔C
(Lipson et al. 2013). When the drift.scaling
argument is set to TRUE
30
, the fit.graph function returns the edge lengths scaled in drift units in a slot
named edges.length.scaled together with the estimated node heterozygosities (nodes.het slot) as shown below
with the example data:
sim.fittedgraph.scaled<-fit.graph(sim.graph.params,drift.scaling = TRUE,verbose=FALSE)
Note that the obtained results are the same as above with no drift scaling since the latter is a post-processing
step independent of admixture graph parameters estimation
#Estimated edge lengths
sim.fittedgraph.scaled@edges.length.scaled
P7<->P1 s1<->P2 s2<->P3 S<->P6 P8<->s2 P7<->s1 P9<->P4 P9<->P5 P8<->P7 R<->P8 R<->P9
0.04964054 0.02405296 0.02631275 0.02516072 0.07378969 0.02711439 0.14321071 0.13954967 0.04846131 0.07700936 0.07700936
When plotting the resulting fitted.graph objects, branch lengths are displayed in drift scaled units as shown in
the Figure 10 below obtained with the following command:
30providing the input graph.params object includes estimates of within-population heterozygosities (see section 5.1.2)
27
plot(sim.fittedgraph.scaled)
P1
P2
P3
P6
P4 P5P7
50
s1
27
24
S
25%
s2
26
75%
25
P8
48 74
P9
143 140
R
77 77
Figure 10: Plot of the admixture graph specifying the simulated scenario with fitted edge lengths scaled in
drift units (x1,000)
The estimated branch lengths are close to the simulated ones (Figure 1) as represented below (see Gautier
et al. 2021 for a more in-depth exploration of the accuracy of the estimates):
P7<->P1 s1<->P2 s2<->P3 S<->P6 P8<->s2 P7<->s1 P9<->P4 P9<->P5 P8<->P7 R<->P8+R<->P9
Estimated 0.04964054 0.02405296 0.02631275 0.02516072 0.07378969 0.02711439 0.1432107 0.1395497 0.04846131 0.1540187
Simulated 0.05000000 0.02500000 0.02500000 0.02500000 0.07500000 0.02500000 0.1500000 0.1500000 0.05000000 0.1500000
5.2.2 Estimating 95% confidence intervals of the estimated parameters values
Calling the fit.graph function with the compute.ci argument set to TRUE allows deriving (rough) 95%
confidence intervals for the admixture graph parameter estimates. The procedure considers each parameter in
turn (the other parameters being set to their estimated values) and consists of interpreting the score difference
Sν−S?
(where
S?
is the fitted graph score associated with estimated parameter value
ν?
and
Sν
is the score
associated with a parameter value
ν6
=
ν?
) as a likelihood-ratio test statistics following a
χ2
distribution with
one degree of freedom (see Gautier et al. 2021 for details). The lower and upper boundaries
νmin
and
νmax
of the 95% CI (such
Sν−S?<
3
.
84 for all
νmin < ν < νmax
) are then simply computed using a bisection
method (with a 10−4precision threshold).
sim.fittedgraph.with.ci<-fit.graph(sim.graph.params,compute.ci=TRUE,
drift.scaling = TRUE,verbose = FALSE)
#95% CI for the admixture proportion
sim.fittedgraph.with.ci@admix.prop.ci
95% Inf. 95% Sup.
a 0.2354879 0.2604726
#95% CI for edge length (including drift scaled as drift.scaling=TRUE)
sim.fittedgraph.with.ci@edges.length.ci
95% Inf. 95% Sup. 95% Inf. (drift scaled) 95% Sup. (drift scaled)
P7<->P1 0.003888322 0.004322695 0.04731364 0.05259916
s1<->P2 0.001785780 0.002153260 0.02179800 0.02628362
s2<->P3 0.001947384 0.002392453 0.02384593 0.02929586
S<->P6 0.001913835 0.002294537 0.02280190 0.02733768
P8<->s2 0.006037994 0.006638187 0.07033080 0.07732187
P7<->s1 0.002019406 0.002471907 0.02457242 0.03007851
P9<->P4 0.012406433 0.013340492 0.13817596 0.14857899
P9<->P5 0.012040331 0.013011938 0.13409851 0.14491974
28
P8<->P7 0.003835432 0.004464375 0.04467527 0.05200122
R<->P8 0.006219328 0.006686937 0.07430200 0.07988849
R<->P9 0.006219328 0.006686937 0.07430200 0.07988849
The simulated values are all within the estimated 95% CI of the estimated value except for the long
eP9↔P4
and
eP9↔P5
branches that are slightly higher than the upper boundary. Note also that if the 95% CI for the
admixture proportion
a
contains the simulated value of 0.25, it is narrower than the one that may be derived
from the block-jackknife estimate of the
F4
-ratio standard error (see section 4.3 and Gautier et al. 2021 for
a more thorough evaluation of the estimated 95% CI).
5.2.3 Assessing the fit of the graph with the compare.fitted.fstats function
As outlined by Patterson et al. (2012) and Lipson (2020), a straightforward but highly informative
approach to assess the fit of the graph is to compare the f-statistics derived from the fitted admixture graph
parameters to the estimated ones, i.e., to evaluate to which extent the fitted F-statistics lie within the
confidence intervals of the estimated ones. This may be summarized by computing a Z-score of the residuals
for each f-statistics as
Z
=
f−g
σ2
g
where
f
and
g
stand for the fitted and estimated values respectively and
σ2
g
is the standard error of
g
. This information is available for the basis f-statistics in the fitted.outstats slot of
the fitted.graph object generated by the fit.graph functions, as illustrated below:
#Fitted basis F-stats
sim.fittedgraph.scaled@fitted.outstats
Stat. value Fitted Value Z-score
P1,P2 0.008274108 0.008278377 0.03287705
P1,P3 0.016680443 0.016723801 0.20561876
P1,P4 0.033971079 0.033990395 0.05587989
P1,P5 0.033727224 0.033661681 -0.18890298
P1,P6 0.012237561 0.012265163 0.16197672
P1;P2,P3 0.004048392 0.004079551 0.26473869
P1;P2,P4 0.004087474 0.004079551 -0.05972586
P1;P2,P5 0.004149524 0.004079551 -0.52164304
P1;P2,P6 0.004610668 0.004631937 0.19288766
P1;P3,P4 0.008230291 0.008240019 0.05210483
P1;P3,P5 0.008244937 0.008240019 -0.02672913
P1;P3,P6 0.011937638 0.011973206 0.20637614
P1;P4,P5 0.021148713 0.021131906 -0.06164584
P1;P4,P6 0.007189623 0.007208661 0.11673856
P1;P5,P6 0.007234212 0.007208661 -0.16250690
However, the fit should be evaluated for all the f-statistics (not only those forming the f-statistics vector-space
basis used to fit the admixture graph) with the compare.fitted.fstats function. This may in turn provide
insights into those populations (or graph edges) leading to poor fit (Lipson 2020). As shown below, the
function requires the original fstats object (that may contain f-statistics for additional populations not
represented in the admixture graph) and the fitted.graph object. It then produces a matrix with information
on all the fitted stats and the n.worst.stats (by default n.worst.stats=5) f-statistics, i.e. with the lowest
absolute Z-score, are directly printed in the console:
sim.fitted.fstats.comp<-compare.fitted.fstats(sim6p.allelecount.fstats,sim.fittedgraph)
5 Worst fit for:
Estimated Fitted Z–score
P1,P2;P3,P5 1.011325e-04 0.000000e+00 -0.8442359
P1,P2;P5,P6 4.611437e-04 5.523863e-04 0.7459512
P1,P2;P4,P5 6.205022e-05 3.469447e-18 -0.6662270
P2;P1,P5 4.124584e-03 4.198826e-03 0.5401737
P1;P2,P5 4.149524e-03 4.079551e-03 -0.5216430
#Information on the last five fitted F-statistics
tail(sim.fitted.fstats.comp)
29
Estimated Fitted Z–score
P2,P6;P3,P4 -4.787097e-03 -4.764545e-03 0.16513037
P2,P6;P3,P5 -4.804558e-03 -4.764545e-03 0.27257790
P2,P6;P4,P5 -1.746147e-05 0.000000e+00 0.16092207
P3,P4;P5,P6 -1.765179e-02 -1.765643e-02 -0.01671976
P3,P5;P4,P6 -1.762185e-02 -1.765643e-02 -0.12418834
P3,P6;P4,P5 2.994286e-05 6.938894e-18 -0.39137603
As shown above, no outlying fitted f-statistics (e.g., with
|Z|>
2) is observed on the example providing strong
support for the fitted admixture graph.
5.3 Adding a new leaf to an existing graph
The add.leaf function allows to perform iterative calls to the fit.graph function in order to evaluate all possible
connections of a given leaf (population) to an existing graph with non-admixed and/or admixed edges. Three
input arguments are required:
•
a graph specified within a graph.params object (obtained with the generate.graph.params function, see
section 5.1.1) or a fitted.graph object (obtained with the fit.graph,add.leaf or graph.builder functions,
see sections 5.2 and 6.2 below allowing more convenient exploration of the admixture graph space via
recursive calls)
•the name of the leaf to add (leaf.to.add argument)
•
an fstats object (see 4) containing a minima estimates of all the
f2
and
f3
statistics (and their standard
errors) involving the leaves of the input graph and the leaf to add
By default the function tests all the possible positions of the candidate leaf (leaf.to.add) to the graph with
non-admixed (including a new rooting with the candidate leaf as an outgroup) or admixed edges. If
ne
is the number of non-admixed edges in the original graph, the number of tested graphs equals
ne
+ 1 for
non-admixed candidate edges
31
and
1
2ne
(
ne−
1) for admixed candidate edges
32
. Optional arguments may
allow disabling the evaluation of non-admixed (by setting the only.test.non.admixed.edges argument to TRUE)
or admixed (by setting the only.test.admixed.edges argument to TRUE) candidate edges.
The object returned by the add.leaf function is a list consisting of:
•an element named n.graphs corresponding to the number of tested graphs
•
an element named fitted.graphs.list which consists of a list of fitted.graph objects (indexed from 1 to
n.graphs and in the same order as the list “graphs”) containing the fit.graph function results for each
candidate graph
•
an element named best.fitted.graph which is the fitted.graph object associated to the candidate graph
with minimal BIC (see 5.2) among all the n.graphs graphs within fitted.graphs.list.
•
an element named bic which is a vector consisting of the n.graphs BIC (indexed from 1 to n.graphs and
in the same order as the fitted.graphs.list list).
Use of the add.leaf function is illustrated below on the example data by evaluating the connection of the
P6 population on the graph (actually tree) connecting the five other populations (P1,P2,P3,P4 and P5 )
specified according to the simulated topology in a graph.params object (5.1.1) named sim5p.tree.params and
plotted in Figure 11:
sim5p.tree<-sim.graph<-rbind(c("P1","P7",""),c("P2","P7",""),c("P3","P8",""),
c("P7","P8",""),c("P4","P9",""),c("P5","P9",""),
c("P8","R",""),c("P9","R",""))
sim5p.tree.params<-generate.graph.params(sim5p.tree)
#Note: fstats object is not necessary at this stage
plot(sim5p.tree.params)
31The newly added node is named “N-”leaf.to.add
32
The three added nodes are named “S-”leaf.to.add, “S1-”leaf.to.add and “S2-”leaf.to.add and the admixture proportions are
named with a letter (A to Z depending on the number of admixed nodes already present in the graph)
30
P1 P2
P3 P4 P5P7
P8 P9
R
Figure 11: Plot of the five-population graph on which to add the P6population with the add.leaf function
All the possible positions of the P6 population on the sim5p.tree graph (here using the f-statistics estimated
on the simulated allele count data, see section 4), are tested as follows:
add.P6<-add.leaf(sim5p.tree.params,leaf.to.add="P6",
fstats=sim6p.allelecount.fstats,
verbose=FALSE,drift.scaling=TRUE)
Note that the verbose option set to FALSE allows disabling the printing of the progress and timing of each
analysis (which may be useful in practice) and the drift.scaling option set to TRUE allows passing it to each
fit.graph call to obtained estimates of branch lengths in drift units (see section 5.2).
The graph with the lowest BIC among all the tested graphs may then be plotted by calling the plot function
on the corresponding fitted.graph object stored in the best.fitted.graph element of the add.P6 output list with
the following command that generates Figure 12:
plot(add.P6$best.fitted.graph)
P6
P2
P3P1
P4 P5
S-P6
25
P7
49
S1-P6
27
24 25%
P8
50
S2-P6
76
26
75%
P9
143 140
R
77 77
Figure 12: Plot of the graph with the lowest BIC among all the possible graphs connecting
P
6to the
five-population tree tested by the add.leaf. The fitted edge lengths are in drift units (x1000) since drift.scaling
argument was set to TRUE when calling add.leaf.
The best fitting graph based on
BI C
criterion (stored in the best.fitted.graph slot of the output list) is in
perfect agreement with the simulated scenario (Figure 1). It actually corresponds to the one directly fitted
using the simulated scenario which was represented in Figure 10 above (only the names of the nodes involved
in the admixture events differ since automatically given by the add.leaf function). In addition, comparisons
of the BIC of the different graphs provide strong support in favor of this best fitting graph, the second lowest
BIC being more than 86 units larger i.e., far above the threshold of 10 for very strong evidence (see section
5.2):
31
#D_BIC w.r.t. best fitted BIC
D_BIC=add.P6$bic-add.P6$best.fitted.graph@bic
#5 First lowest DeltaBIC (the first value of zero corresponding to the best fitted graph)
head(sort(D_BIC))
[1] 0.00000 86.31675 86.31675 279.48967 284.90577 284.90577
6 Building admixture graph from scratch
Lipson et al. (2013) proposed a two-step approach (implemented in the MixMapper software
33
) to build
admixture graph when prior knowledge about history and relationships of investigated population is limited
(which is usually the case). It consists of first building a scaffold tree of unadmixed populations and then
adding the remaining populations successively on the graph. Such a supervised approach nevertheless requires
to carefully assess at each step the graph fit and possibly try different ordering in the inclusion of populations
(or removal of some populations). The poolfstat package provides functions to help building scaffold trees that
may further be used as input tree for the add.leaf function previously described above (5.3) to implement the
Lipson et al. (2013) two-step approach.
6.1 Building scaffold trees of unadmixed populations
In the absence of admixture, the
f2
statistics among all pairs of populations are expected to be additive
along the paths of the (binary) tree summarizing the history of the populations (Lipson et al. 2013). As a
result, the (unrooted) tree topology and branch lengths connecting unadmixed populations may be inferred
with a neighbor-joining algorithm to derive a scaffold tree for further admixture graph construction. Based
on the estimated f-statistics stored in a fstats object, the functions described below allows to i) identify
candidate sets of unadmixed populations among the genotyped ones (find.tree.popset function); ii) infer a
neighbor-joining scaffold tree from a candidate set of unadmixed populations (rooted.njtree.builder function);
and iii) to infer root position based on the consistency of within population heterozygosities between the two
resulting partitions of rooted trees (see p1799 in Lipson et al. 2013).
6.1.1 The find.tree.popset function to identify sets of candidate scaffold populations
The find.tree.popset function selects maximal sets of unadmixed populations from an fstats object
34
. The
procedure involves i) discarding all the populations showing a significantly negative
f3
at a significance
theshold specified with the f3.zcore.threshold argument (equal to
−
1
.
65 by default, i.e., 95% significance
threshold, see section 4.1.2); and ii) keeping only sets of populations with all possible quadruplets passing
the
f4
-based test of treeness i.e., with an absolute
f4
Z-score lower than a threshold specified with the
f4.zcore.threshold argument (equal 2by default, i.e., 95% significance threshold, see section 4.1.3). The latter
step is implemented via a greedy algorithm (that may be run in parallel by specifying a number of threads
with the nthreads argument) consisting of trying to extend the size of the population sets from all sets of
four populations after adding candidate populations one at a time. If the number of populations is large,
this algorithm may take some times. Note that increasing (respectively decreasing) f3.zcore.threshold toward
value closer to 0 may allow decreasing (respectively increasing) the number of initial candidate populations
to be tested for inclusion in a set. Similarly, increasing (respectively decreasing) f4.zcore.threshold may allow
increasing (respectively decreasing) the size of the sets. When applied to the example allele count and read
count data, a single set of 5 unadmixed populations (P1,P2,P3,P4 and P5 ) is retrieved as expected from
the simulated scenario (Figure 1):
# count data
scaf.pops<-find.tree.popset(sim6p.allelecount.fstats,verbose=FALSE)
scaf.pops$pop.sets
33http://cb.csail.mit.edu/cb/mixmapper/
34
providing it was Z-scores were estimated for
f3
and
f4
statistics, i.e., that block-jackknife estimates of s.e. were carried out,
see section 4
32
[,1] [,2] [,3] [,4] [,5]
PopSet1 "P1" "P2" "P3" "P4" "P5"
# 30X Pool-Seq data
scaf.pops<-find.tree.popset(sim6p.readcount30X.fstats,verbose=FALSE)
scaf.pops$pop.sets
[,1] [,2] [,3] [,4] [,5]
PopSet1 "Pool1" "Pool2" "Pool3" "Pool4" "Pool5"
As previously mentioned (section 4), for a given set consisting of n populations, a total of 3
n
4
=
1
8n
(
n−
1)(
n−
2)(
n−
3) quadruplets can be formed. In other words, a given set of four populations A, B, C and D is
actually represented by only three quadruplets representative of the three possible unrooted tree topologies
35
i) (A,B;C,D); ii) (A,C;B,D); and iii) (A,D;B,C). Among these, only a single quadruplet is expected to pass
the treeness test (i.e., if the correct unrooted tree topology is (A,C;B,D), then the absolute value of the
Z-scores associated to f4(A,B;C,D) and f4(A,D;B,C) or their equivalent will be high). For each of identified
sets of presumably unadmixed populations, the list of the
n
4
quadruplets passing the treeness test is given
in the passing.quadruplets element of the output list as illustrated below:
# list of the 15 quadruplets passing the treeness test for the identified set
scaf.pops$passing.quaduplets
[,1] [,2] [,3]
PopSet1 "Pool1,Pool3;Pool4,Pool5" "Pool2,Pool3;Pool4,Pool5" "Pool1,Pool2;Pool4,Pool5"
[,4] [,5]
PopSet1 "Pool1,Pool2;Pool3,Pool4" "Pool1,Pool2;Pool3,Pool5"
In addition, for each of the identified sets, the range of variation of the passing quadruplets is given in the
Z_f4.range element of the output list:
scaf.pops$Z_f4.range
Min. |Zscore| Max. |Zscore|
PopSet1 0.02584028 0.8067878
When several sets are identified, this information may be useful to prioritize the sets of unadmixed populations
(e.g., via a minimax criterion consisting of choosing the set of populations that has the lowest maximal
absolute Z-score for its underlying quadruplets that pass the treeness test).
6.1.2 The rooted.njtree.builder to building (and root) a tree of candidate scaffold populations
The rooted.njtree.builder allows first building a Neighbor-Joining
36
of a set of presumably unadmixed
populations (as obtained e.g., from the find.tree.popset functions) given as a vector of population names in
the pop.sel argument based on the matrix of their pairwise
f2
stored in the provided fstats object (fstats
argument). The resulting (unrooted) tree is then rooted by relying on the property that root Rheterozygosity
hR
= 1
−QA,B
2
estimated from all the possible pairs of populations Aand Bthat satisfies the property of
being only connected through R in the tree (i.e., Aand Beach belong to one of the two tree partitions
defined by the
R
) should be consistent (Lipson et al. 2013). In other words, the most likely rooted tree
among the (2
nl−
3) possible ones should be the one displaying the narrower range of variation of the
hR
estimates. Note that the root position is always placed in the mid-position of the candidate branch.
The object returned by the rooted.njtree.builder function is a list consisting of:
•
an element named n.rooted.trees corresponding to the number of possible rooted binary trees that were
evaluated
•
an element named fitted.rooted.trees.list which consists of a list of fitted.graph objects (indexed from 1
to n.rooted.trees).
35
for each of these quadruplets, seven other equivalent combinations formed by permuting populations within each pair can be
defined as mentioned in the notice p16
36relying on the nj function from the popular package ape (Paradis et al. 2004)
33
•
an element named best.rooted.tree which corresponds to the fitted.graph object associated with the most
likely rooted tree (among all the fitted.rooted.trees.list ones) identified as the one displaying the minimal
standard deviation over the hRestimates
•
an element named root.het.est.var consisting of a matrix of n.rooted.trees rows and 4 columns with i) the
average estimated root heterozygosity
hR
across all the pairs of leaves that are relevant for estimation
(see above); ii) the size of the range of variation; iii) the standard deviation of the
hR
estimates, and iv)
the number of population pairs relevant for estimation
•
if n.edges>3, an element named nj.tree.eval that gives for each evaluated rooted tree, the five
f−
statistics
configuration displaying the worst fit, i.e., with the five highest absolute Z-score for the predicted value
(obtained by internally calling the compare.fitted.fstats function). Note that these are independent of
the rooting (so cannot be used to infer the root position).
Use of the rooted.njtree.builder function is illustrated below to build the scaffold tree (using Pool-Seq read
count data) based on the set of unadmixed populations (identified with find.tree.popset) and plotted with
the plot function applied to the fitted.graph object of the resulting list (best.rooted.tree element) to generate
Figure 13:
scaf.tree<-rooted.njtree.builder(fstats=sim6p.readcount30X.fstats,
pop.sel=scaf.pops$pop.sets[1,],plot.nj=FALSE)
Score of the NJ tree: 0.6884542
Construction of all the 7 possible rooted tree from the NJ tree
(stored as graph in the rooted.graph object of the output list)
plot(scaf.tree$best.rooted.tree)
Pool1 Pool2
Pool3 Pool4 Pool5
Root
I2
77
I3
77
97
I1
47 144 140
49 50
Figure 13: Plot of the rooted scaffold tree of unadmixed populations inferred by rooted.njtree.builder. The
fitted edge lengths are in a drift scale (x1000).
The fit of the Neighbor-Joining tree can be checked by inspecting the nj.tree.eval element:
scaf.tree$nj.tree.eval
Estimated Fitted Z–score
Pool1,Pool2;Pool3,Pool5 1.011014e-04 0.000000000 -0.8067878
Pool1,Pool2;Pool3,Pool4 6.087576e-05 0.000000000 -0.5098456
Pool2;Pool1,Pool5 4.120591e-03 0.004181198 0.4330143
Pool1;Pool2,Pool5 4.154896e-03 0.004094289 -0.4293005
Pool1,Pool2;Pool4,Pool5 4.022566e-05 0.000000000 -0.4233082
The following range of variation of the hRestimates were obtained for the different possible root position:
scaf.tree$root.het.est.var
Mean Range sd ncomps
Tree1 0.1814754 0.0234940763 0.0115371318 4
34
Tree2 0.1810025 0.0228772762 0.0113327886 4
Tree3 0.1857808 0.0155159278 0.0076153034 6
Tree4 0.1907099 0.0007923822 0.0002825614 6
Tree5 0.1880369 0.0113894828 0.0055518382 4
Tree6 0.1878868 0.0111334494 0.0054505348 4
Tree7 0.1833542 0.0152408946 0.0085429545 4
The “best” inferred rooted scaffold tree (i.e., the fourth one with both lowest range of
hR
variation) is
consistent with the simulated scenario. It may further be used as a reference graph to construct the complete
admixture graph after adding Pool6 population using the add.leaf function (see 5.3). The obtained graph
plotted in Figure 14 with the commands below is very close to the one previously inferred with allele count
data (Figure 12) and very strongly supported by the data:
add.pool6<-add.leaf(scaf.tree$best.rooted.tree,leaf.to.add="Pool6",
fstats=sim6p.readcount30X.fstats,verbose=FALSE,drift.scaling=TRUE)
plot(add.pool6$best.fitted.graph)
Pool6
Pool3
Pool2
Pool4 Pool5
Pool1
S-Pool6
26
I2
S1-Pool6
76
I1
50
27
75%
49
S2-Pool6
26
2425%
Root
75
I3
75
144 140
Figure 14: Plot of the graph with the lowest BIC among all the possible graphs connecting
P ool
6to the
scaffold tree of unadmixed population tested by the add.leaf. The fitted edge lengths are in drift units (x1000)
since drift.scaling argument was set to TRUE when calling add.leaf.
#D_BIC w.r.t. best fitted BIC
D_BIC=add.pool6$bic-add.pool6$best.fitted.graph@bic
#5 First lowest DeltaBIC (the first value of zero corresponding to the best fitted graph)
head(sort(D_BIC))
[1] 0.00000 78.75611 78.75611 254.27229 259.68839 259.68839
Notice
In practice, the rooted.njtree.builder function should be used with caution since both the Neighbor-
Joining tree construction and the heterozygosity-based rooting of the tree may be sensitive to
long-branch attraction (most particularly if some highly diverged populations are included). The
inferred topology may even violate treeness test for some of the quadruplets (see e.g., the empirical
example detailed in the Supplementary Vignette V2 by Gautier et al. 2021).
6.2 Extending an initial tree (or graph) with the graph.builder function
The graph.builder function implements an heuristic to carry out a larger exploration of the space of possible
graphs (but usually still not exhaustive) obtained from the joint addition of several populations (leaves) given
as an input vector (leaves.to.add argument) to an existing graph (as generated using the rooted.njtree.builder
35
function described above or included in a graph.params or fitted.graph object) or a list of graphs. The
algorithm consists of adding the leaves in the order of the input vector to each of the graphs stored in a heap
via successive calls to the add.leaf function (section 5.3). More precisely, the heap first consists of the initial
input graph (or list of graph) and at each iteration, the function add.leaf is used to evaluate all the possible
connections of each candidate leaf (with non-admixed or admixed edges) to all the graphs of the heap. For
each of the latter, the newly fitted graphs displaying a BIC less than heap.dbic units (set to 6 by default)
away from the best fitting graph (i.e., the one with the lowest BIC) included in the new heap. Once all
the graphs have been evaluated, if the heap contains more than max.heap.size (set to 25 graphs by default)
graphs, only the max.heap.size graphs with the lowest BIC are kept in the heap. Finally, after testing the
latest candidate leaf, only the graphs with a BIC less than heap.dbic units away from the graph with lowest
BIC in the heap are retained in the final list of graphs.
The object returned by the function is a list consisting of:
•an element named n.graphs corresponding to the final number of graphs
•
an element named fitted.graphs.list which consists of a list of fitted.graph objects (indexed from 1 to
n.graphs) containing the fit.graph function results for each graph
•
an element named best.fitted.graph which is the fitted.graph object associated to the graph with the
lowest BIC among all the n.graphs graphs included in fitted.graphs.list.
•
an element named bic which is a vector containing the BIC of the n.graphs BIC (indexed from 1 to
n.graphs and in the same order as fitted.graphs.list).
Use of the graph.builder function is illustrated below on the PoolSeq example data by starting from an initial
rooted tree constructed with the rooted.njtree.builder for the three populations Pool1,Pool3,Pool4 and Pool5.
This tree is extended by successively adding the two remaining populations Pool2 and Pool6 :
#build an initial 3 population trees with "Pool1","Pool3","Pool4" and "Pool5"
init.tree<-rooted.njtree.builder(fstats=sim6p.readcount30X.fstats,
pop.sel=c("Pool1","Pool3","Pool4","Pool5"),plot.nj=FALSE)
Score of the NJ tree: 0.0006790273
Construction of all the 5 possible rooted tree from the NJ tree
(stored as graph in the rooted.graph object of the output list)
#adding the three remaining pops
final.graphs<-graph.builder(x=init.tree$best.rooted.tree,leaves.to.add=c("Pool2","Pool6"),
fstats=sim6p.readcount30X.fstats)
####################
Adding: Pool2
21 graphs evaluated in 0 h 0 m 1 s
6 graphs stored in the heap
####################
Adding: Pool6
261 graphs evaluated in 0 h 0 m 5 s
7 graphs stored in the heap
Final Number of graphs: 7
(min. BIC= 275.8408 )
Overall Analysis Time: 0 h 0 m 6 s ( 282 graphs evaluated)
#D_BIC w.r.t. to the "true" graph as identified previously (object add.pool6$best.fitted.graph)
D_BIC=final.graphs$bic-add.pool6$best.fitted.graph@bic
#5 First lowest DeltaBIC (the first value of zero corresponding to the best fitted graph)
head(sort(D_BIC))
[1] 0.000000 4.667688 4.923348 4.923348 4.923348 5.416100
Among the 7 final graphs, the one with the lowest BIC is exactly the same as the one plotted in Figure 14 and
corresponds to the simulated scenario. It should however be noticed that other alternative graphs are also
36
identified with a good support. Moreover, starting with other population trees (e.g., a three population tree
consisting of Pool1,Pool2 and Pool5) could result in several graphs with the same support (i.e., ∆
BI C
= 0)
but with a different positioning of the root (not shown). In practice, it may be important to start with
scaffold trees that are as large as possible and representative of the structuring of diversity of the represented
populations (i.e., not too unbalanced with respect to the leaves to be added). Some prior knowledge about
the relationships of some of the populations may also be helpful to that respect. As examplified in the
Supplementary Vignette V2 of Gautier et al. (2021), it is also highly recommended to test different orders
of inclusion (possibly all) of the leaves (as specified in the vector leaves.to.add).
7 Other utilities
7.1 Symbolic representation of the Fparameters, admixture graph equations
and the scaled covariance matrix Ωwith graph.params2symbolic.fstats
Given a graph topology relating the populations stored in a graph.params object, the graph.params2symbolic.fstats
functions provide symbolic representation of the model equations used to fit the underlying admixture
graph and all the
F2
,
F3
and
F4
parameters together with the scaled covariance matrix of population allele
frequencies called
Ω
after Gautier (2015). Such representation may be useful for a closer examination of
graph properties (or education purposes). The output objects consists of a list with the following elements:
•
a character matrix named model.matrix consisting of the matrix
M
relating the parameters underlying
the basis f-statistics and graph edge lengths in the model equations defined as f=Mb where fis the
vector of the basis f-statistics (row names of the model.matrix
M
) and
b
is the vector of graph edges
(column names of model.matrix M).
•
a character matrix named omega consisting of the the scaled covariance matrix of population allele
frequencies Ω(see e.g., Gautier 2015).
•
a character vector F2.equations consisting of the symbolic representations of all the
1
2nl
(
nl−
1)
F2
parameters (with edge and admixture proportion parameter names as defined in the graph.params
object)
•
a character vector F3.equations consisting of the symbolic representations of all the
1
2nl
(
nl−
1)(
nl−
2)
F3
parameters (with edge and admixture proportion parameter names as defined in the graph.params
object)
•
a character vector F4.equations consisting of the symbolic representations of all the
1
8nl
(
nl−
1)(
nl−
2)(
nl−
3)
F4
parameters (with edge and admixture proportion parameter names as defined in the
graph.params object)
These different equations can also be printed in an output text file (with name specified by the outfile
argument). The following example shows results obtained using the graph.params object sim.graph.params
generated in 5.1.1 (Figure 8) that specifies the simulation scenario (Figure 1):
sim.fstats.sym<-graph.params2symbolic.fstats(sim.graph.params,outfile = "Fstats_equations")
#Model equations matrix
sim.fstats.sym$model.matrix
P7<->P1 s1<->P2 s2<->P3 S<->P6 P8<->s2 P7<->s1 P9<->P4 P9<->P5 P8<->P7 R<->P8 R<->P9
F2(P1,P2) "1" "1" "0" "0" "0" "1" "0" "0" "0" "0" "0"
F2(P1,P3) "1" "0" "1" "0" "1" "0" "0" "0" "1" "0" "0"
F2(P1,P4) "1" "0" "0" "0" "0" "0" "1" "0" "1" "1" "1"
F2(P1,P5) "1" "0" "0" "0" "0" "0" "0" "1" "1" "1" "1"
F2(P1,P6) "1" "0" "0" "1" "a^2-2*a+1" "a^2" "0" "0" "a^2-2*a+1" "0" "0"
F3(P1;P2,P3) "1" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
F3(P1;P2,P4) "1" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
F3(P1;P2,P5) "1" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
F3(P1;P2,P6) "1" "0" "0" "0" "0" "a" "0" "0" "0" "0" "0"
F3(P1;P3,P4) "1" "0" "0" "0" "0" "0" "0" "0" "1" "0" "0"
F3(P1;P3,P5) "1" "0" "0" "0" "0" "0" "0" "0" "1" "0" "0"
F3(P1;P3,P6) "1" "0" "0" "0" "1-a" "0" "0" "0" "1-a" "0" "0"
F3(P1;P4,P5) "1" "0" "0" "0" "0" "0" "0" "0" "1" "1" "1"
37
F3(P1;P4,P6) "1" "0" "0" "0" "0" "0" "0" "0" "1-a" "0" "0"
F3(P1;P5,P6) "1" "0" "0" "0" "0" "0" "0" "0" "1-a" "0" "0"
#scaled covariance matrix of allele frequencies (Omega)
sim.fstats.sym$omega
P1 P2 P3
P1 "P7<->P1+P8<->P7+R<->P8" "P8<->P7+R<->P8" "R<->P8"
P2 "P8<->P7+R<->P8" "s1<->P2+P7<->s1+P8<->P7+R<->P8" "R<->P8"
P3 "R<->P8" "R<->P8" "s2<->P3+P8<->s2+R<->P8"
P6 "P8<->P7*a+R<->P8" "(P7<->s1+P8<->P7)*a+R<->P8" "P8<->s2+R<->P8-P8<->s2*a"
P4 "0" "0" "0"
P5 "0" "0" "0"
P6 P4 P5
P1 "P8<->P7*a+R<->P8" "0" "0"
P2 "(P7<->s1+P8<->P7)*a+R<->P8" "0" "0"
P3 "P8<->s2+R<->P8-P8<->s2*a" "0" "0"
P6 "S<->P6+(P8<->s2+P7<->s1+P8<->P7)*a^2-2*P8<->s2*a+P8<->s2+R<->P8" "0" "0"
P4 "0" "P9<->P4+R<->P9" "R<->P9"
P5 "0" "R<->P9" "P9<->P5+R<->P9"
#F2 statistics (first five)
head(sim.fstats.sym$F2.equations)
[1] "F2(P1,P2) = P7<->P1+s1<->P2+P7<->s1"
[2] "F2(P1,P3) = P7<->P1+P8<->P7+s2<->P3+P8<->s2"
[3] "F2(P1,P6) = P7<->P1+(a^2-2*a+1)*P8<->P7+S<->P6+(a^2-2*a+1)*P8<->s2+a^2*P7<->s1"
[4] "F2(P1,P4) = P7<->P1+P8<->P7+R<->P8+P9<->P4+R<->P9"
[5] "F2(P1,P5) = P7<->P1+P8<->P7+R<->P8+P9<->P5+R<->P9"
[6] "F2(P2,P3) = s1<->P2+P7<->s1+P8<->P7+s2<->P3+P8<->s2"
#F3 statistics (first five)
head(sim.fstats.sym$F3.equations)
[1] "F3(P1;P2,P3) = P7<->P1"
[2] "F3(P1;P2,P6) = P7<->P1+a*P7<->s1"
[3] "F3(P1;P2,P4) = P7<->P1"
[4] "F3(P1;P2,P5) = P7<->P1"
[5] "F3(P1;P3,P6) = P7<->P1+(1-a)*P8<->P7+(1-a)*P8<->s2"
[6] "F3(P1;P3,P4) = P7<->P1+P8<->P7"
#F4 statistics (first five)
head(sim.fstats.sym$F4.equations)
[1] "F4(P1,P2;P3,P6) = P7<->s1*a"
[2] "F4(P1,P2;P3,P4) = 0"
[3] "F4(P1,P2;P3,P5) = 0"
[4] "F4(P1,P2;P6,P4) = -P7<->s1*a"
[5] "F4(P1,P2;P6,P5) = -P7<->s1*a"
[6] "F4(P1,P2;P4,P5) = 0"
7.2 Generating files for the qpGraph software with graph.params2qpGraphFiles
The graph.params2qpGraphFiles function allows generating the files required by the qpGraph software
(Patterson et al. 2012) from a graph.params object that includes estimates of f-statistics (see section 5.1.2).
If fis the prefix character specified with the outfileprefix argument of the function (by default f=out), these
are:
•a file named “f.graph” that specifies the graph in qpGraph format
•
a file named “f.fstats” with estimates of F-statistics (and their covariance) included in the input
graph.params object
•
a parameter file named “f.parqpGraph” to run qpGraph (this file may be edited by hand if other options
are needed).
38
The qpGraph software (v7365 and above to allow f-statistics estimates to be provided as input) may then be
run on a terminal using the following options:
qpGraph -p f.parqpGraph -g f.graph -o out.ggg -d out.dot.input
The “f.fstats” f-statistics file must be in the same directory or its PATH should be explicitly specified by
editing the “f.parqpGraph” parameter file. The following example runs qpGraph (providing appropriate
install of the software) on the sim.graph.params object generated in 5.2 (see Figure 9 representing the fitted
graph obtained with the fit.graph function):
graph.params2qpGraphFiles(sim.graph.params,outfileprefix = "sim.graph")
Fstats input file for qpGraph written in sim.graph.fstats
Graph input file for qpGraph written in sim.graph.graph
Parameter File for qpGraph with some default parameters written in sim.graph.parqpGraph
#running qpGraph (installed locally) outside R
system("qpGraph -p sim.graph.parqpGraph -g sim.graph.graph -o sim.graph.g -d sim.graph.dot")
#plotting the dot file generated by qpGraph with grViz (as done internally by poolfstat)
require(DiagrammeR)
grViz("sim.graph.dot")
sim.graph.graph::P1P2P3P50.0000000.0001010.0001010.0001200.840
P4 P5
P1P3
P2
P6
R
P8
6
P9
6
s2
6
P7
4 13 13
2
S
75%
4
s1
2
225%
2
Figure 15: Fitting results obtained by qpGraph on the same data as the one used to generate Figure 9.
Comparison of Figures 15 and 9 shows that the same results are obtained with the two fitting methods (note
that edge lengths are not scaled in drift units on the two figures).
39
8 References
Akey J. M., Zhang G., Zhang K., Jin L., Shriver M. D., 2002 Interrogating a high-density snp map for
signatures of natural selection. Genome Research 12: 1805–1814.
Garrison E., Marth G., 2012 Haplotype-based variant detection from short-read sequencing. arXiv:
1207.3907.
Gautier M., Vitalis R., Flori L., Estoup A., 2021 f-statistics estimation and admixture graph construction
with pool-seq or allele count data using the R package poolfstat. submitted.
Gautier M., 2015 Genome-wide scan for adaptive divergence and association with population-specific
covariates. Genetics 201: 1555–1579.
Hivert V., Leblois R., Petit E. J., Gautier M., Vitalis R., 2018 Measuring genetic differentiation from
pool-seq data. Genetics 210: 315–330.
Iannone R., 2020 DiagrammeR: Graph/network visualization.
Karlsson E. K., Baranowska I., Wade C. M., Salmon Hillbertz N. H. C., Zody M. C., Anderson
N., Biagi T. M., Patterson N., Pielberg G. R., Kulbokas E. J., Comstock K. E., Keller E. T.,
Mesirov J. P., Euler H. von, Kämpe O., Hedhammar A., Lander E. S., Andersson G., Andersson
L., Lindblad-Toh K., 2007 Efficient mapping of mendelian traits in dogs through genome-wide association.
Nature Genetics 39: 1321–1328.
Kass R. E., Raftery A. E., 1995 Bayes factors. Journal of the American Statistical Association
90
: 773–795.
Kelleher J., Etheridge A. M., McVean G., 2016 Efficient coalescent simulation and genealogical analysis
for large sample sizes. PLoS Computational Biology 12: e1004842.
Knaus B. J., Grünwald N. J., 2017 Vcfr: A package to manipulate and visualize variant call format data
in R. Molecular Ecology Resources 17: 44–53.
Koboldt D. C., Zhang Q., Larson D. E., Shen D., McLellan M. D., Lin L., Miller C. A., Mardis E.
R., Ding L., Wilson R. K., 2012 VarScan 2: Somatic mutation and copy number alteration discovery in
cancer by exome sequencing. Genome Research 22: 568–576.
Kofler R., Orozco-terWengel P., De Maio N., Pandey R. V., Nolte V., Futschik A., Kosiol
C., Schlötterer C., 2011 PoPoolation: A toolbox for population genetic analysis of next generation
sequencing data from pooled individuals. PloS One 6: e15925.
Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin
R., Subgroup 1. G. P. D. P., 2009 The sequence alignment/map format and SAMtools. Bioinformatics
25: 2078–2079.
Lipson M., Loh P.-R., Levin A., Reich D., Patterson N., Berger B., 2013 Efficient moment-based
inference of admixture parameters and sources of gene flow. Molecular Biology and Evolution
30
:
1788–1802.
Lipson M., 2020 Applying f-statistics and admixture graphs: Theory and examples. Molecular Ecology
Resources 20: 1658–1667.
McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K.,
Altshuler D., Gabriel S., Daly M., DePristo M. A., 2010 The genome analysis toolkit: A mapreduce
framework for analyzing next-generation dna sequencing data. Genome Research 20: 1297–1303.
Paradis E., Claude J., Strimmer K., 2004 APE: Analyses of phylogenetics and evolution in r language.
Bioinformatics 20: 289–290.
Patterson N., Moorjani P., Luo Y., Mallick S., Rohland N., Zhan Y., Genschoreck T., Webster
T., Reich D., 2012 Ancient admixture in human history. Genetics 192: 1065–1093.
Peter B. M., 2016 Admixture, population structure, and f-statistics. Genetics 202: 1485–1501.
40
Pickrell J. K., Pritchard J. K., 2012 Inference of population splits and mixtures from genome-wide allele
frequency data. PLoS Genetics 8: e1002967.
Reich D., Thangaraj K., Patterson N., Price A. L., Singh L., 2009 Reconstructing indian population
history. Nature 461: 489–494.
Vitalis R., Gautier M., Dawson K. J., Beaumont M. A., 2014 Detecting and measuring selection from
gene frequency data. Genetics 196: 799–817.
Weir B. S., 1996 Genetic data analysis II : Methods for discrete population genetic data. Sinauer Associates,
Sunderland, Mass.
Weir B. S., Cockerham C. C., 1984 Estimating F-statistics for the analysis of population structure.
Evolution 38: 1358–1370.
41
A Apprendix
A.1 Block-Jackknife estimation of standard errors
Standard-error of genome-wide estimates of
FST
and other
f
-statistics can estimated using a block-jackknife
sampling approach (see Patterson et al. 2012 and references therein). The algorithm implemented in
poolfstat consists of dividing the genome into contiguous chunks of a pre-defined number of SNPs (specified
by the argument nsnp.per.bjack.block of the computeFST,compute.pairwiseFST or compute.fstats functions,
see sections 3.1.1, 3.2 and 4 respectively) and then removing each block in turn to quantify the variability of
the estimates. If nbblocks are available and b
fiis the estimate of the statistics when removing all the SNPs
belonging to block i, the standard error cσfof the estimator b
fof the statistics of interest is computed as:
cσf=v
u
u
tnb−1
nb
i=nb
X
i=1 b
fi−cµf2
where
cµf
=
1
nb
i=nb
P
i=1 b
fi
which may be slightly different from the estimator obtained with all the markers since
the latter may include SNPs that are not eligible for block-jackknife sampling (e.g., those at the chromo-
some/scaffolds boundaries or those belonging to chromosome/scaffolds with less than nsnp.per.bjack.block
SNPs). Finally, block-jackknife sampling may also be used to obtain estimates of the covariance between the
estimates c
faand b
fbas (using similar notations):
d
Cov c
fa;b
fb=nb−1
nb
i=nb
X
i=1 c
fa
i−dµfab
fb
i−dµfb
42
Analysis of a D. suzukii Pool-Seq data set with the R package
poolfstat
2021-05-27
Contents
1 Data preparation and SNP filtering 2
2 Estimation of global FST and all f−and D- statistics 4
3 Overview of the within-population genetic diversity from estimates of heterozygosities 6
4 Overview of the structuring of genetic diversity across populations from pairwise-
population FST estimates 6
5 Insights from f3-based admixture tests 8
6 Exploring invasion scenarios with admixture graph construction 12
6.1 Scaffold trees built (naively) from scratch are unreliable . . . . . . . . . . . . . . . . . . . . . 12
6.2
Investigating the relationships among the populations from the native area (and the Hawaiian
population) .............................................. 14
6.3
Investigating the origins of the populations from the American invasive area using admixture
graphs................................................. 22
References 34
1
This vignette details the analysis with the R package poolfstat
1
of the Pool-Seq data previously generated
by Olazcuaga et al. (2020) for 14 population samples of the invasive species Drosophila suzukii (Figure
1 and Table 1). The objective of this analysis was to provide an illustration of the main poolfstat package
functionalities on real data while providing insights into the history of both native populations and the
recent invasion of this species into America (Fraimout et al. 2017). As illustrated in Figure 1, the pool
samples represented 6 populations from the Asian native area, one from Hawaii (see Figure 1 legend) and 7
populations from the American continent. Out of the 14 population samples, all but CN-Bei originate from
the same site as in Fraimout et al. (2017) who inferred the routes of invasion on a worldwide scale under
an Approximate Bayesian Computation Random Forest (ABC-RF) approach using a data set consisting of
685 individuals belonging to 23 populations (from 15 to 44 individuals per sample) and genotyped at 25
autosomal microsatellite loci. As detailed in Table 1, for nine (CN-Lia, CN-Nin, CN-Shi, JP-Sap, BR-Pal,
US-Col, US-Sdi, US-Sok and US-Wat) of the 14 population samples, all or most (30 out 50 of and 40 out 50
for CN-Lia and CN-Nin, respectively) of the individuals included in the sequenced pool were in common
among the two studies. This thus makes the outcome of the Pool-Seq data analysis based on poolfstat directly
comparable with the analysis performed by Fraimout et al. (2017).
CN−Bei CN−Lia
CN−Nin
CN−Shi
JP−Sap
JP−Tok
BR−Pal
US−Col
US−Haw
US−Nca
US−Sdi
US−Sok
US−Wat US−Wis
Native
Hawaiian (>1980)
American (>2008)
Figure 1: Origin of the 14 population samples of D. suzukii. Population names are colored according to their
area of origin, i.e., whether they originate from the Asian native, Hawaii or continental America where the
species was first observed in 2008. As done in previous studies, we distinguished the Hawaiian population
from the American invasive area because the presence of D. suzukii was first described in 1980 in Hawaii.
Hence, this Pacific Island population may be considered as intermediate between the Asian native and most
recently invaded areas (approximately 300 generations later assuming 10 generations per year). The 13
population samples originating from the same site as in Fraimout et al. (2017) are indicated by solid points.
1 Data preparation and SNP filtering
The read alignment (bam) files from Olazcuaga et al. (2020) obtained after mapping Pool-Seq data
for each of the 14 population samples onto the latest near-chromosome scale assembly of the Drosophila
suzukii (Paris et al. 2020) were combined into an mpileup file using SAMtools 1.9 with options -q 20
-Q20 (Li et al. 2009). Variant calling was then performed on the resulting file using VarScan mpileup2snp
v2.3.4 (Koboldt et al. 2012) run with options –min-coverage 10 –min-avg-qual 25 –min-var-freq 0.005
–p-value 0.5 –output-vcf 1 ensuring very loose criteria to identify SNPs further SNP filtering being delayed
to the parsing of the vcf file. In particular we set the p-value threshold to 0.5 leading to the identification
1
All the analysis were carried on a laptop computer equipped with an intel Xeon Quad-Core 3.0 Ghz processor and 32 Gb
RAM
2
Sample name Status Sample Size Sampling site Sampling date SRA ID
CN-Bei (-) Native 50 (0) Beijing, China 2014 SRR10260017
CN-Lia (CN-Lia) Native 50 (50) Liaoyuan, China 2014 SRR10260016
CN-Nin (CN-Nin) Native 50 (30) Ningbo, China 2014/2016 SRR10260027
CN-Shi (CN-Shi) Native 50 (40) Shiping county, China 2014/2016 SRR10260024
JP-Sap (JP-Sap) Native 50 (50) Sapporo, Japan 2014 SRR10260023
JP-Tok (JP-Tok) Native 50 (0) Tokyo, Japan 2016 SRR10260022
BR-Pal (BR-PA) Invasive (AM) 50 (50) Porto Alegre, Brazil 2014 SRR10260033
US-Col (US-Col) Invasive (AM) 50 (50) Fort Collins, USA 2015 SRR10260032
US-Haw (US-Haw) Invasive (PA) 50 (0) Hawaii (Hilo), USA 2016 SRR10260031
US-Nca (US-NC) Invasive (AM) 100 (0) Raleigh, USA 2016 SRR10260030
US-Sdi (US-SD) Invasive (AM) 50 (50) San-Diego, USA 2014 SRR10260029
US-Sok (US-Sok) Invasive (AM) 75 (75) Dayton, USA 2014 SRR10260028
US-Wat (US-Wat) Invasive (AM) 50 (50) Watsonville, USA 2014 SRR10260026
US-Wis (US-Wis) Invasive (AM) 75 (0) Barneveld, USA 2016 SRR10260025
Table 1: Pool sample description (adapted from Table S2 in Olazcuaga et al., 2020). Population names
and the number of individuals included in the pools in common with Fraimout et al. (2017) are given in
parenthesis in the first and third columns, respectively. Only the CN-Bei population sample is absent from
the microsatellite data set of Fraimout et al. (2017). As for the population origins (column status), we
distinguished i) the Asian native area (6 population samples); ii) the Pacific islands (PA) invasive area (the
Hawaiian sample); and iii) the contiental American invasive area (7 population samples).
of SNP with only one read supporting the non-reference allele in a single pool whatever the coverage
2
.
Positions mapping to non-autosomal contigs (Paris et al. 2020) were subsequently discarded from the vcf
file to obtain the file named dsu.auto.vcf.gz which can be downloaded from the public Zenodo repository
(http://doi.org/10.5281/zenodo.4709080) together with a file named dsu.auto.info that contains names and
haploid sample sizes of each pool. As shown below, we then used the vcf2pooldata function to parse this vcf
file relying on default options except for the overall MAF threshold (computed from read counts) that was set
to 5% (min.maf argument) and the minimal read coverage for each pool that was set to 50 (min.cov.per.pool
argument). The resulting pooldata object was further filtered with the pooldata.subset function to discard i)
all positions with a coverage higher than the 99% quantile coverage in at least one pool (cov.qthres.per.pool
argument); and ii) discard all SNPs with MAF<5% over all the populations from the native area to ensure
that mostly ancestral SNPs were included in the data as implicitly assumed by f-statistics based approaches
(see the main text of the manuscript and associated references).
#loading information file containing:
# i) pool names (in the same order as in the vcf file)
# ii) haploid pool size
dsu.info=read.table("dsu.auto.info",stringsAsFactors = F)
#parsing of the vcf file for autosomal contigs
dsu.dat=vcf2pooldata(vcf.file="dsu.auto.vcf.gz",
poolsizes =dsu.info[,2],poolnames = dsu.info[,1],
min.maf = 0.05,min.cov.per.pool = 50,remove.indels = T,
nlines.per.readblock = 1e7)
Reading Header lines
Parsing allele counts
VarScan like format detected for allele count data: the AD field contains allele depth
for the alternate allele and RD field for the reference allele
(N.B., positions with more than one alternate allele will be ignored)
1e+07 lines processed in 0 h 5 m 54 s : 773642 SNPs found
2
The Fisher exact test use for variant calling in VarScan gives
p
=
Cc
rCc
0
C2c
r
=
Cc
r
C2c
r
where
c
is the coverage and
r
is the number
of reads supporting the non-reference allele. Hence, p= 0.5if r= 1
3
2e+07 lines processed in 0 h 10 m 50 s : 1546200 SNPs found
3e+07 lines processed in 0 h 15 m 27 s : 2136045 SNPs found
30922056 lines processed in 0 h 15 m 50 s : 2161633 SNPs found
Data consists of 2161633 SNPs for 14 Pools
#Further filtering of SNPs keeping only those with MAF>0.05 on the native populations
#and coverage not exceeding the 99% pool-specific quantile of coverage in a least one pop
pop.nat.names=c("CN-Bei","CN-Lia","CN-Nin","CN-Shi","JP-Tok","JP-Sap")
pop.nat.idx=which(dsu.dat@poolnames %in% pop.nat.names)
tmp.f.nat=rowSums(dsu.dat@refallele.readcount[,pop.nat.idx])/
rowSums(dsu.dat@readcoverage[,pop.nat.idx])
maf.native=0.5-abs(0.5-tmp.f.nat)
dsu.dat=pooldata.subset(dsu.dat,snp.index = which(maf.native>0.05),
cov.qthres.per.pool = c(0,0.99))
Data consists of 1588569 SNPs for 14 Pools
The final data sets thus consists of 1,588,569 SNPs with sequencing coverage statistics among the 14 different
pools detailed in Table 2.
Mean Coverage Median Coverage Coverage Range
BR-Pal 76.72 76 50-135
CN-Bei 95.93 95 50-160
CN-Lia 69.68 69 50-127
CN-Nin 66.7 65 50-122
CN-Shi 68.41 67 50-138
JP-Sap 84.53 84 50-174
JP-Tok 68.05 67 50-116
US-Col 79.35 79 50-144
US-Haw 95.41 95 50-160
US-Nca 74.24 73 50-121
US-Sdi 89.64 89 50-143
US-Sok 65.29 64 50-130
US-Wat 73.34 72 50-126
US-Wis 76.7 76 50-143
Table 2: Summary of the sequencing coverage of the 14 Pool-Seq samples over the 1,588,569 selected SNPs
2 Estimation of global FST and all f−and D- statistics
To characterize the overall structuring of genetic diversity, we estimated the
FST
over all the 14 populations
using the estimator previously described by Hivert et al. (2018) and implemented in the computeFST
function. We estimated the standard error of the resulting estimate using a block-jackknife approach specifying
blocks of 10,000 SNPs. As shown below (information printed by the computeFST function), this resulted
in 145 blocks of 700 kb on average. Note that the number of SNPs eligible for black-jackknife (and thus
used for estimation) were reduced to 1,450,000 SNPs (i.e., 90% out of the 1,588,569) due to the removal of
SNPs mapping to contigs represented by less than 10,000 SNPs (i.e., small contigs which may actually be less
reliable as corresponding to the most fragmented part of the assembly) and to a lesser extent SNPs mapping
to contig boundaries.
dsu.global.fst=computeFST(dsu.dat,nsnp.per.bjack.block = 10000)
Starting Block-Jackknife sampling
145 Jackknife blocks identified with 1450000 SNPs (out of 1588569 ).
SNPs map to 15 different chrom/scaffolds
Average (min-max) Block Sizes: 0.698 ( 0.414 - 2.306 ) Mb
4
#Global Fst estimate
dsu.global.fst$mean.fst
[1] 0.07107105
#95% CI of the Fst estimate (based on the block-jackknife s.e.)
dsu.global.fst$mean.fst+c(-1.96,1.96)*dsu.global.fst$se.fst
[1] 0.06897769 0.07316441
We next computed all the
f−
and
D−
statistics together with their block-jackknife standard error (with
blocks of 15,000 SNPs as for the global FST ) using the compute.fstats function. These included:
•f2
for all the 91 pairs of populations together with their scaled version that correspond to the pairwise
FST
estimator based on the IIS probabilities (i.e., similar to the one implemented by the computeFST
or compute.pairwiseFST functions when specifying method=“Idendity”)
•f3(and their scaled version f?
3) for all the 1,092 possible triplets of populations
•f4(and their scaled version D) for all the 3,003 possible quadruplets of populations
dsu.fstats=compute.fstats(dsu.dat,computeDstat = TRUE,
return.F4.blockjackknife.samples = TRUE,
nsnp.per.bjack.block = 10000)
Estimating Q1
Estimating Q2
Estimating within-population heterozygosities
Estimating F2
Estimating F3
Estimating F4
Computing Dstat
Starting Block-Jackknife sampling
145 Jackknife blocks identified with 1450000 SNPs (out of 1588569 ).
SNPs map to 15 different chrom/scaffolds
Average (min-max) Block Sizes: 0.698 ( 0.414 - 2.306 ) Mb
computing Q1 averages per blocks
computing Q2 averages per blocks
computing F2 averages per blocks
Starting computation of estimators s.e.
within-pop heterozygosity s.e. estimation done
F2 s.e. estimation done
F3 and F3* s.e. estimation done
estimating F4 and Dstat s.e. (may be long since require denominator averages per blocks)
F4 and D s.e. estimation done
Overall Analysis Time: 0 h 1 m 37 s
Note that the return.F4.blockjackknife.samples argument was set to TRUE to allow for estimation of some
admixture rates using
f4
ratios (see below). In addition, as shown below most of the computation time is
spent on estimating D-statistics which are actually not always needed (e.g., treeness tests may be based
on
f4
) for most future analysis steps. In addition, due to internal optimization of the code block-jackknife
estimation of the standard errors is performed at very limited computational costs (except for D-statistics):
#computing all fstats without D-statistics
#with verbose=FALSE and hence use of sys.time to estimate computation time
tb=Sys.time()
tmp<-compute.fstats(dsu.dat,computeDstat = FALSE,
return.F4.blockjackknife.samples = TRUE,
5
nsnp.per.bjack.block = 10000,verbose=FALSE)
cat("Analysis (without D-statistics estimation) took",
round(difftime(Sys.time(),tb,units="secs"),1),"s")
Analysis (without D-statistics estimation) took 12.6 s
#computing all fstats without D-statistics
# and without blockjacknife estimates of s.e. (little interest)
tb=Sys.time()
tmp<-compute.fstats(dsu.dat,computeDstat = FALSE,verbose=FALSE)
cat("Analysis (without D-statistics and block-jackknife s.e. estimation)
took",round(difftime(Sys.time(),tb,units="secs"),1),"s")
Analysis (without D-statistics and block-jackknife s.e. estimation)
took 6.3 s
3 Overview of the within-population genetic diversity from esti-
mates of heterozygosities
Estimates of within-population heterozygosities (as 1
−ˆ
Q1
) provide a rough assessment of the genetic diversity
of the different population. It should here be recalled that the SNP ascertainment process (see section 1) is
expected to favor SNP polymorphic in the native area. The estimates and their 95% CI can be plotted as
follows:
plot(dsu.fstats@heterozygosities$`bjack mean`,ylab="Heterozygosity",xaxt="n",xlab="",
pch="",las=3,ylim=c(0.21,0.29))
axis(1,1:dsu.dat@npools,rownames(dsu.fstats@heterozygosities),las=3)
#plotting CI from bloc jackknife estimate of s.e.
ci=cbind(dsu.fstats@heterozygosities$`bjack mean`-1.96*dsu.fstats@heterozygosities$`bjack s.e.`,
dsu.fstats@heterozygosities$`bjack mean`+1.96*dsu.fstats@heterozygosities$`bjack s.e.`)
tmp<-sapply(1:dsu.dat@npools,f<-function(z){
abline(v=z,lty=3,col="grey")
arrows(z,ci[z,1],z,ci[z,2],angle = 90,code=3,length=0.05,lwd=1.5,
col=tmp.col[rownames(dsu.fstats@heterozygosities)[z]])
})
legend("bottomleft",c("Native","Hawaiian (>1980)","American (>2008)"),
fill=c("#1A1237","#847CA3","#E45A5A"))
As shown in Figure 2, US-Haw was clearly depleted in diversity as previously observed based on microsatellite
data (Fraimout et al. 2017) which is also consistent with its island origin. The populations from the native
area tended to display more diversity than those of the American invasive area with the noticeable exception
of CN-Shi (that displayed significantly lower heterozygosity than the other populations from the native area)
and US-Wat (that displayed the highest level of diversity among the populations from the invasive American
area, similar to those of the native area). Note that filtering of SNPs during data preparation (see 1) may
have resulted in upwardly biasing the relative amount of diversity in the native area.
4 Overview of the structuring of genetic diversity across popula-
tions from pairwise-population FST estimates
Figure 3 plots estimates of
FST
with their 95% CI for all the populations using the plot function applied
directly onto the dsu.fstat fstats object computed above.
6
0.22 0.24 0.26 0.28
Heterozygosity
BR−Pal
CN−Bei
CN−Lia
CN−Nin
CN−Shi
JP−Sap
JP−Tok
US−Col
US−Haw
US−Nca
US−Sdi
US−Sok
US−Wat
US−Wis
Native
Hawaiian (>1980)
American (>2008)
Figure 2: Estimates of within-population heterozygosities with their 95% CI from the ascertained SNPs
(MAF>0.05 in the native area).
layout(matrix(1:2,1,2))
plot(dsu.fstats,stat.name="Fst",main="Fst<=5%",cex.main=1.5,value.range=c(0,0.05))
abline(v=0.01,lty=2,col="blue")
#plot(dsu.fstats,stat.name="Fst",main="5%<Fst<10%",cex.main=1.5,value.range=c(0.05,0.1))
plot(dsu.fstats,stat.name="Fst",main="Fst>5%",cex.main=1.5,value.range=c(0.05,NA))
An alternative (and complementary) view, based on the compute.pairwiseFST function (estimating by default
the
FST
estimator derived under an analysis-of-variance framework) and representing the matrix of pairwise
FST
stored in the resulting pairwisefst object as an heatmap (using the heatmap function), is provided in
Figure 4.
dsu.pairwise.fst=compute.pairwiseFST(dsu.dat)
Computation of the 91 pairwise Fst
Overall Analysis Time: 0 h 0 m 38 s
heatmap(dsu.pairwise.fst)
Note that the two different estimates of pairwise-population FST remain very similar (Figure 5).
plot(dsu.pairwise.fst@values$`Fst Estimate`,dsu.fstats@fst.values$`bjack mean`,
xlab="compute.pairwiseFST (method=Anova)",ylab="compute.fstats")
abline(a=0,b=1,lty=2)
As expected from their recent history and as already observed with other analysis (Olazcuaga et al. 2020),
populations tended to cluster according to their area of origin (Figure 4) with some closely geographically
related populations showing very low level of differentiation. For instance, in continental America, US-Nca,
US-Col and US-Nca all displayed pairwise
FST <
0
.
01 with each others (Figure 3) and in the Asian native
area, the three populations CN-Bei, CN-Nin and CN-Lia originating for North-West China were all found
7
0.00 0.01 0.02 0.03 0.04
Fst<=5%
US−Nca,US−Sdi
US−Sdi,US−Wis
BR−Pal,US−Sdi
US−Sdi,US−Wat
US−Col,US−Sdi
JP−Tok,US−Wat
US−Col,US−Wat
US−Wat,US−Wis
BR−Pal,US−Wat
US−Nca,US−Wat
CN−Bei,JP−Tok
CN−Nin,JP−Tok
BR−Pal,US−Nca
CN−Lia,JP−Tok
BR−Pal,US−Wis
BR−Pal,US−Col
CN−Nin,CN−Shi
CN−Bei,CN−Shi
US−Sok,US−Wat
CN−Lia,CN−Shi
CN−Bei,CN−Nin
CN−Lia,CN−Nin
JP−Sap,JP−Tok
US−Col,US−Nca
CN−Bei,CN−Lia
US−Nca,US−Wis
US−Col,US−Wis
0.06 0.08 0.10 0.12 0.14 0.16 0.18
Fst>5%
US−Col,US−Haw
US−Haw,US−Wis
US−Haw,US−Nca
CN−Shi,US−Haw
BR−Pal,US−Haw
CN−Nin,US−Haw
CN−Bei,US−Haw
CN−Lia,US−Haw
JP−Sap,US−Haw
US−Haw,US−Sdi
JP−Tok,US−Haw
US−Haw,US−Wat
US−Haw,US−Sok
CN−Shi,US−Sok
CN−Shi,US−Sdi
US−Sdi,US−Sok
CN−Bei,US−Sok
CN−Nin,US−Sok
BR−Pal,US−Sok
US−Col,US−Sok
JP−Sap,US−Sdi
CN−Lia,US−Sok
BR−Pal,CN−Shi
BR−Pal,JP−Sap
CN−Nin,US−Sdi
US−Sok,US−Wis
CN−Bei,US−Sdi
CN−Shi,US−Nca
CN−Shi,US−Col
CN−Lia,US−Sdi
CN−Shi,US−Wis
US−Nca,US−Sok
JP−Sap,US−Col
JP−Tok,US−Sdi
BR−Pal,CN−Nin
BR−Pal,CN−Bei
JP−Sap,US−Wis
BR−Pal,JP−Tok
CN−Shi,US−Wat
CN−Nin,US−Nca
CN−Bei,US−Nca
CN−Nin,US−Col
BR−Pal,CN−Lia
CN−Bei,US−Col
CN−Nin,US−Wis
CN−Bei,US−Wis
JP−Sap,US−Nca
CN−Lia,US−Nca
JP−Sap,US−Sok
JP−Tok,US−Col
CN−Lia,US−Col
CN−Lia,US−Wis
JP−Tok,US−Wis
CN−Nin,US−Wat
JP−Tok,US−Sok
CN−Bei,US−Wat
JP−Tok,US−Nca
Figure 3: Estimates of pairwise
FST
for the population pairs of the D. suzukii data set. For ease of reading,
plots have been arranged according to the level of differentiation into small (
<
0.05) and moderate to high
level of differentiation (
>
0.05). In the leftmost plot, the blue dotted line represent the 0.01
FST
threshold
(i.e., very weak levels of differentiation)
closely related. Conversely, the Hawaiian sample (US-Haw) was found the most highly differentiated with all
the other populations, the pairwise
FST
ranging from 11.7% (with US-Sok) to 17.0% (with US-Col). Note
that this explains why US-Haw behave as an outgroup in the heatmap of Figure 4, thereby illustrating that
such representation may be misleading if interpreted in terms of demographic history (Hawaii obviously not
being the area of origin of the species).
5 Insights from f3-based admixture tests
Six of the 14 populations showed at least one significantly negative
f3
or
f?
3
statistics (both statistics being
almost exactly similar) at the 95% threshold of significance (i.e., with an associated Z-score
<−
1
.
65). Table
3 summarizes for each of these 6 target populations, the number of significantly negative
f3
(and
f?
3
) together
with the triplet displaying the lowest Z-score (out of the
13×12
2
= 78 tests per population), as obtained with
8
US−Haw
JP−Tok
JP−Sap
CN−Shi
CN−Lia
CN−Bei
CN−Nin
US−Sok
US−Wat
US−Sdi
BR−Pal
US−Nca
US−Wis
US−Col
US−Haw
JP−Tok
JP−Sap
CN−Shi
CN−Lia
CN−Bei
CN−Nin
US−Sok
US−Wat
US−Sdi
BR−Pal
US−Nca
US−Wis
US−Col
Figure 4: Heatmap representation of the matrix of pairwise-population FST of 14 D. suzukii populations.
the following script:
f3.signif=dsu.fstats@f3.values$`Z-score`< -1.65
f3s.signif=dsu.fstats@f3star.values$`Z-score`< -1.65
nf3.signif.per.pop=table(dsu.fstats@comparisons$F3[f3.signif,1])
nf3s.signif.per.pop=table(dsu.fstats@comparisons$F3star[f3s.signif,1])
##custom function to retrieve the triplet with lowest Z-score for a given target pop.
tmp.f<-function(x,stat="F3"){
if(stat=="F3"){
x.f3=dsu.fstats@f3.values[dsu.fstats@comparisons$F3[,1]==x,]
}else{
x.f3=dsu.fstats@f3star.values[dsu.fstats@comparisons$F3star[,1]==x,]
9
0.00 0.05 0.10 0.15
0.00 0.05 0.10 0.15
compute.pairwiseFST (method=Anova)
compute.fstats
Figure 5: Comparison of the pairwise-population estimates
FST
computed by the compute.pairwiseFst
(method="Anova") and compute.fstats (equivalent to method="Identity" in compute.pairwiseFst). Note that
for the latter, the estimates were taken as the block-jacknife means.
}
idx=which.min(x.f3$`Z-score`)
return(paste0(rownames(x.f3)[idx]," (Z=",
round(x.f3$`Z-score`[idx],2),")"))
}
###
pop.signif=unique(c(names(nf3.signif.per.pop),names(nf3s.signif.per.pop)))
f3.mostsignif=sapply(pop.signif,tmp.f)
f3s.mostsignif=sapply(pop.signif,tmp.f,stat="F3s")
Pop. Origin #f3<0 #f3*<0 most. sign. triplet (f3) Most sign. triplet (f3*)
CN-Lia Native 1 1 CN-Lia;CN-Shi,JP-Sap (Z=-1.66) CN-Lia;CN-Shi,JP-Sap (Z=-1.65)
JP-Tok Native 11 11 JP-Tok;CN-Nin,JP-Sap (Z=-7.11) JP-Tok;CN-Nin,JP-Sap (Z=-7.09)
US-Col Invasive (AM) 2 2 US-Col;BR-Pal,US-Wis (Z=-3.31) US-Col;BR-Pal,US-Wis (Z=-3.32)
US-Nca Invasive (AM) 6 6 US-Nca;JP-Sap,US-Col (Z=-3.89) US-Nca;JP-Sap,US-Col (Z=-3.9)
US-Wat Invasive (AM) 13 13 US-Wat;US-Sdi,US-Sok (Z=-23.64) US-Wat;US-Sdi,US-Sok (Z=-23.81)
US-Wis Invasive (AM) 4 4 US-Wis;JP-Sap,US-Col (Z=-5.02) US-Wis;JP-Sap,US-Col (Z=-5.05)
Table 3: Summary of the f3 and f3* based tests for all the populations displaying at least one significant test
at the 95% threshold (Z< -1.65)
Figure 6 plots the lowest
f3
(including the significantly negative ones highlighted in red) for the six populations
displaying at least one significant three-population test.
layout(matrix(1:6,3,2))
for(i in pop.signif){
plot(dsu.fstats,stat.name="F3",main=i,cex.main=1.5,value.range=c(NA,1e-3),
pop.f3.target=i,ci.perc=90)
#ci.perc was here set to 90% to obtain the same significant
#tests (highlighted in red) as with Z<-1.65 (since the test is one-sided)
}
10
−2e−04 0e+00 2e−04 4e−04 6e−04 8e−04
CN−Lia
CN−Lia;CN−Nin,JP−Tok
CN−Lia;CN−Nin,CN−Shi
CN−Lia;CN−Nin,US−Sdi
CN−Lia;CN−Nin,US−Sok
CN−Lia;CN−Nin,US−Nca
CN−Lia;CN−Nin,US−Haw
CN−Lia;BR−Pal,CN−Nin
CN−Lia;CN−Nin,US−Wat
CN−Lia;CN−Nin,US−Wis
CN−Lia;CN−Nin,US−Col
CN−Lia;CN−Nin,JP−Sap
CN−Lia;CN−Bei,CN−Nin
CN−Lia;CN−Bei,CN−Shi
CN−Lia;BR−Pal,CN−Bei
CN−Lia;CN−Bei,US−Sdi
CN−Lia;CN−Bei,US−Col
CN−Lia;CN−Bei,US−Haw
CN−Lia;CN−Bei,US−Wis
CN−Lia;CN−Bei,US−Nca
CN−Lia;CN−Bei,US−Wat
CN−Lia;CN−Shi,US−Sdi
CN−Lia;BR−Pal,CN−Shi
CN−Lia;CN−Shi,US−Col
CN−Lia;CN−Bei,JP−Tok
CN−Lia;CN−Bei,US−Sok
CN−Lia;CN−Shi,US−Wis
CN−Lia;CN−Shi,US−Nca
CN−Lia;CN−Shi,US−Wat
CN−Lia;CN−Shi,US−Haw
CN−Lia;CN−Bei,JP−Sap
CN−Lia;CN−Shi,US−Sok
CN−Lia;CN−Shi,JP−Tok
CN−Lia;CN−Shi,JP−Sap
−0.0015 −0.0010 −0.0005 0.0000 0.0005
JP−Tok
JP−Tok;CN−Shi,US−Sok
JP−Tok;CN−Lia,US−Sok
JP−Tok;CN−Bei,US−Sok
JP−Tok;JP−Sap,US−Sok
JP−Tok;CN−Nin,US−Sok
JP−Tok;JP−Sap,US−Nca
JP−Tok;JP−Sap,US−Wat
JP−Tok;JP−Sap,US−Haw
JP−Tok;JP−Sap,US−Sdi
JP−Tok;JP−Sap,US−Wis
JP−Tok;JP−Sap,US−Col
JP−Tok;CN−Lia,JP−Sap
JP−Tok;BR−Pal,JP−Sap
JP−Tok;CN−Shi,JP−Sap
JP−Tok;CN−Bei,JP−Sap
JP−Tok;CN−Nin,JP−Sap
−5e−04 0e+00 5e−04 1e−03
US−Col
US−Col;US−Sok,US−Wis
US−Col;CN−Lia,US−Nca
US−Col;CN−Shi,US−Nca
US−Col;US−Nca,US−Wis
US−Col;CN−Bei,US−Nca
US−Col;US−Haw,US−Wis
US−Col;US−Nca,US−Sdi
US−Col;CN−Nin,US−Wis
US−Col;CN−Lia,US−Wis
US−Col;CN−Shi,US−Wis
US−Col;CN−Bei,US−Wis
US−Col;US−Wat,US−Wis
US−Col;BR−Pal,US−Nca
US−Col;US−Sdi,US−Wis
US−Col;BR−Pal,US−Wis
−1e−03 −5e−04 0e+00 5e−04 1e−03
US−Nca
US−Nca;CN−Nin,US−Wis
US−Nca;US−Col,US−Wat
US−Nca;US−Sdi,US−Wis
US−Nca;US−Col,US−Haw
US−Nca;US−Haw,US−Wis
US−Nca;US−Wat,US−Wis
US−Nca;JP−Tok,US−Wis
US−Nca;US−Sok,US−Wis
US−Nca;JP−Sap,US−Wis
US−Nca;US−Col,US−Sok
US−Nca;JP−Tok,US−Col
US−Nca;JP−Sap,US−Col
−0.004 −0.003 −0.002 −0.001 0.000 0.001
US−Wat
US−Wat;JP−Tok,US−Sok
US−Wat;BR−Pal,JP−Sap
US−Wat;CN−Nin,US−Sok
US−Wat;CN−Lia,US−Sok
US−Wat;CN−Bei,US−Sok
US−Wat;CN−Shi,US−Sok
US−Wat;BR−Pal,US−Haw
US−Wat;US−Haw,US−Wis
US−Wat;US−Nca,US−Sok
US−Wat;US−Haw,US−Nca
US−Wat;US−Col,US−Haw
US−Wat;US−Sok,US−Wis
US−Wat;US−Col,US−Sok
US−Wat;US−Sdi,US−Sok
US−Wat;BR−Pal,US−Sok
−5e−04 0e+00 5e−04
US−Wis
US−Wis;US−Col,US−Sdi
US−Wis;US−Nca,US−Wat
US−Wis;US−Haw,US−Nca
US−Wis;US−Nca,US−Sdi
US−Wis;US−Col,US−Wat
US−Wis;CN−Nin,US−Nca
US−Wis;CN−Lia,US−Nca
US−Wis;CN−Bei,US−Col
US−Wis;CN−Shi,US−Nca
US−Wis;CN−Shi,US−Col
US−Wis;CN−Bei,US−Nca
US−Wis;BR−Pal,US−Nca
US−Wis;CN−Lia,US−Col
US−Wis;CN−Nin,US−Col
US−Wis;US−Col,US−Haw
US−Wis;US−Col,US−Nca
US−Wis;US−Col,US−Sok
US−Wis;JP−Tok,US−Col
US−Wis;JP−Sap,US−Col
Figure 6: Estimates of the lowest
f3
statistics with their 95% Confidence Intervals for the 6 populations
displaying at least one significant three-population test at the 95% threshold (shown in red).
In the native area, JP-Tok showed clear evidence of admixture with 11 significant tests that all involved
JP-Sap as a source proxy. The three lowest
f3
were obtained with the three Chinese populations (CN-Nin,
CN-Bei and CN-Shi in increasing order of
f3
). Assuming an admixture-graph like history, this suggested
that the two populations branching the closest to the two sources of JP-Tok were JP-Sap and CN-Nin. The
remaining Chinese population, CN-Lia showed some little evidence of admixture with only one test barely
11
significant at the 95% threshold for the triplet involving CN-Shi and JP-Sap as source proxies.
Out of the 7 populations from the American continent (excluding Hawaii), 4 showed evidence of admixture,
namely US-Col, US-Wis, US Nca and US-Wat. The strongest evidence (13 significant tests) was found for
US-Wat which has up to now been considered as the closest to the first invader of continental America
based on historical records (Fraimout et al. 2017). Moreover, the three signals with the lowest Z-score
all involved two source populations originating from the (continental) American invasive area. As their
underlying
f3
CI did not overlap with the other tests, these three pairs of populations may be considered
as the closest (among the sampled populations) to the original US-Wat source populations. Hence, the
three most significant signals all involved US-Sok (Northern US) as a source proxy while the other source
proxies were, in order of increasing Z-score (i.e., decreasing evidence as Z-score are here negative), BR-Pal,
US-Sdi and US-Col respectively. Challenging the initial view, this thus suggests that US-Wat derived from
an admixture between two populations already established in the continental America, one Northern (here
represented by US-Sok) and the other Southern (here represented by BR-Pal, US-Sdi and to a lesser extent
US-Col that may actually derived from US-Sdi considering historical records) as will be investigated below
via admixture graph construction. For this Southern “ghost” population, we may further speculate that
BR-Pal, although geographically distantly related, could be the closest proxy as a result of a rapid spread of
D. suzukii in Southern America although the absence of additional samples in South America makes this
hypothesis difficult to test. The three other American populations, US-Col, US-Wis and US-Nca, with at
least one significantly negative
f3
displayed only a moderate number of significant tests (compared to other
populations) and were all very closely related (section 4). For each of these, the significantly negative
f3
had
highly overlapping CI and always involved one or the two other populations
3
. This thus suggests complex
patterns of recurrent admixture events among these three populations (consistent with their low level of
differentiation and close geographic origins).
6 Exploring invasion scenarios with admixture graph construction
To provide further insights into the relationships of the surveyed populations and the probable scenarios
of invasion of the species in American and European area, we relied on admixture graph construction.
The purpose was no here to build a comprehensive admixture graph, which may be elusive given the
close relationships of the populations and the pervasiveness of recent admixture events among the different
populations, but rather to identify key regional event that happened at early time of invasion. Importantly,
we pay critical attention to the support of the different graphs to assess the validity of the proposed scenarios
by making extensive use of the compare.fitted.fstats function.
6.1 Scaffold trees built (naively) from scratch are unreliable
We first evaluated a naive approach consisting of building a scaffold tree of presumably unadmixed populations
using the find.tree.popset and rooted.njtree.builder functions. Although two candidate sets of 5 populations
could be identified, the bifurcating trees obtained from them provided a poor fit to the data, making them
obviously unreliable for further admixture graph construction. More precisely, we inferred for each of these
two sets the best rooted bifurcating tree using the rooted.njtree.builder function and evaluate its fit with
the compare.fitted.fstats function (note that the fitted f-statistics are invariant to root positioning for such
bifurcating trees). In addition, we also used the branch length estimates obtained with the fit.graph function
(instead of those relying on neighbor-joining algorithm used by rooted.njtree.builder ) of the best rooted tree
but this lead to similar conclusions.
#Identification of candidate (unadmixed) scaffold populations from scratch
scaf.pop=find.tree.popset(dsu.fstats)
Number of sets: 24 of Npops= 4 each
3
More precisely, the four significant tests with US-Wis as a target all involved US-Col as a source proxy and one also involved
US-Nca (although with the lowest
f3
); out of the six significant tests with US-Nca as a target, three involved US-Col as a source
proxy and the three other involved US-Wis; and the two significant tests with US-Col as a proxy involved US-Wis as a source
proxy
12
Number of sets: 2 of Npops= 5 each
scaf.pop$pop.sets
[,1] [,2] [,3] [,4] [,5]
PopSet1 "BR-Pal" "CN-Bei" "CN-Shi" "JP-Sap" "US-Sok"
PopSet2 "CN-Bei" "CN-Shi" "JP-Sap" "US-Sdi" "US-Sok"
#construction of scaffold trees for each set
scaf.pop.trees=list()
for(i in 1:scaf.pop$n.sets){
scaf.pop.trees[[i]]=list()
tmp.njtree<-rooted.njtree.builder(pop.sel = scaf.pop$pop.sets[i,],
fstats = dsu.fstats,verbose=FALSE)
scaf.pop.trees[[i]][["njtree"]]<-tmp.njtree$best.rooted.tree
tmp.graph=generate.graph.params(tmp.njtree$best.rooted.tree@graph,dsu.fstats,verbose=F)
scaf.pop.trees[[i]][["fitgraph"]]<-fit.graph(tmp.graph,verbose=FALSE)
rm(tmp.njtree,tmp.graph)
}
#Evaluation of scaffold trees using compare.fitted.fstats
for(i in 1:scaf.pop$n.sets){
cat("\nScaffold set:",i,"\n")
cat("\tRooted NJ tree\n")
tmp.comp<-compare.fitted.fstats(dsu.fstats,
fitted.graph = scaf.pop.trees[[i]][["njtree"]],
n.worst.stats = 1)
cat("\tRooted fit graph tree\n")
tmp.comp<-compare.fitted.fstats(dsu.fstats,
fitted.graph = scaf.pop.trees[[i]][["fitgraph"]],
n.worst.stats = 1)
}
Scaffold set: 1
Rooted NJ tree
1 Worst fit for:
Estimated Fitted Z–score
BR-Pal,CN-Shi;JP-Sap,US-Sok -0.005313611 -1.734723e-18 31.46101
Rooted fit graph tree
1 Worst fit for:
Estimated Fitted Z–score
BR-Pal,CN-Shi;JP-Sap,US-Sok -0.005313611 0 31.46101
Scaffold set: 2
Rooted NJ tree
1 Worst fit for:
Estimated Fitted Z–score
CN-Shi,US-Sdi;JP-Sap,US-Sok 0.005093941 0.002936964 -11.15586
Rooted fit graph tree
1 Worst fit for:
Estimated Fitted Z–score
CN-Shi,JP-Sap;US-Sdi,US-Sok 0.00433532 0 -6.78807
Hence, the bifurcating scaffold tree inferred for the first and second candidate sets of candidate scaffold
populations lead to f-statistics outlying from the raw estimated f-statistics, by up to 31 and 6.8 standard
errors (as measured by the printed Z-score) respectively when considering neighbor-joining estimates of
branch lengths.
13
It should be noticed that for both sets, the resulting NJ tree lead to quadruplets that violate the treeness
test, i.e., that resulted in a quadruplet configurations that disagree with the one that was retained by the
find.tree.popset function. For instance, in the first set of candidate populations, the obtained NJ tree topology
resulted in the BR-Pal,CN-Shi;JP-Sap,US-Sok quadruplet which did not pass the
f4
-based treeness tests as the
quadruplet configuration passing the treeness test for these four populations is BR-Pal,JP-Sap;CN-Shi,US-Sok:
tmp.sel=apply(dsu.fstats@comparisons$F4,1,
ff<-function(x){sum(x %in% c("BR-Pal","CN-Shi","JP-Sap","US-Sok"))==4})
dsu.fstats@f4.values[tmp.sel,]
Estimate bjack mean bjack s.e. Z-score
BR-Pal,CN-Shi;JP-Sap,US-Sok -0.0051982044 -0.0053136113 0.0001688951 -31.461012
BR-Pal,JP-Sap;CN-Shi,US-Sok 0.0005166386 0.0002464223 0.0009666008 0.254937
BR-Pal,US-Sok;CN-Shi,JP-Sap 0.0057148430 0.0055600336 0.0008822927 6.301802
6.2 Investigating the relationships among the populations from the native area
(and the Hawaiian population)
We thus adopted an alternative and more historically informed strategy to build the starting scaffold graph,
relying on the known Asian origin of the species (which may safely be assumed to be correct). We first
seek to describe the historical relationships among the 6 populations from the native area also including the
Hawaiian population that was first observed in 1980, i.e., more than 30 years (ca. 300 generations for D.
suzukii) earlier than the other populations of invasive areas (Fraimout et al. 2017).
6.2.1 Building a scaffold tree of native and the Hawaiian populations
We thus used the same approach as in the above section 6.1 to build a scaffold tree but focusing only on the
six native and the Hawaiian populations:
#Identification of candidate (unadmixed) scaffold populations
pop.am.continent=c("BR-Pal","US-Col","US-Nca","US-Sdi","US-Sok","US-Wat","US-Wis")
scaf.pop.nat=find.tree.popset(dsu.fstats,excluded.pops = pop.am.continent)
Number of sets: 2 of Npops= 4 each
scaf.pop.nat$pop.sets
P1 P2 P3 P4
PopSet1 "CN-Bei" "CN-Nin" "CN-Shi" "US-Haw"
PopSet2 "CN-Bei" "CN-Shi" "JP-Sap" "US-Haw"
#f4 associated Z-score range
#(here the range min and max are the same since only four populations in the sets)
scaf.pop.nat$Z_f4.range
Min. |Zscore| Max. |Zscore|
PopSet1 1.086422 1.086422
PopSet2 1.060292 1.060292
Here two sets of 4 candidate populations could then be proposed. We chose to focus on the PopSet 2
(CN-Bei,CN-Shi ; JP-Sap,US-Haw) because it displayed the smallest
f4
associated Z-score (|Z|=1.06) and
CN-Bei was more distantly related to CN-Shi than CN-Nin from both a geographical (Figure 1) and genetical
point of view (Figure 3). As this selected set only contained 4 populations, the (best) unrooted topology was
directly available from the quadruplet passing the
f4
treeness test, i.e. ((CN-Bei,CN-Shi),(JP-Sap,US-Haw)).
Rooting the tree based on heterozygosities of leaf populations (as implemented in the rooted.njtree.builder
function) would actually be quite sensitive to long-branch attraction by the highly diverged US-Haw (Figure
3) resulting in positionning US-Haw as the outgroup. Yet, based on historical knowledge of the introduced
status of this population, we here chose to set the root of the tree on the branch relating the common ancestor
14
of the Chinese populations (CN-Bei and CN-Shi) and the common ancestor of JP-Sap and US-Haw. Such a
positioning of the root is consistent with a (recent) Japanese origin of the Hawaiian population (Fraimout
et al. 2017). We finally obtained the scaffold tree represented in Figure 7a fitted as follows:
#define (manually) the rooted scaffold tree
scaf.tree=rbind(c("CN","R",""),
c("JP","R",""),
c("CN-Bei","CN",""),
c("CN-Shi","CN",""),
c("JP-Sap","JP",""),
c("US-Haw","JP",""))
#fit the scaffold tree (with edges lengths drift units and 95% CI estimated)
scaf.tree.params<-generate.graph.params(scaf.tree,fstats = dsu.fstats)
Total Number of Parameters: 5 (5 edges lengths + 0 adm. coeff.)
Total Number of Statistics: 6 (3 F2 and 3 F3)
scaf.tree.fit<-fit.graph(scaf.tree.params,drift.scaling = TRUE,compute.ci = TRUE)
Estimation started (direct algebraic solution)
Estimation ended in 0 m 0 s
Final Score: 1.124219
BIC: 97.95432
#plot the tree
plot(scaf.tree.fit)
#CI of edges lengths
scaf.tree.fit@edges.length.ci
95% Inf. 95% Sup. 95% Inf. (drift scaled) 95% Sup. (drift scaled)
R<->CN 0.0025699799 0.003649279 0.022374993 0.031771684
R<->JP 0.0025699799 0.003649279 0.022374993 0.031771684
CN<->CN-Bei 0.0009098188 0.001161736 0.006498229 0.008297507
CN<->CN-Shi 0.0038994202 0.005559202 0.027850956 0.039705673
JP<->JP-Sap 0.0052402047 0.007662313 0.034940631 0.051090761
JP<->US-Haw 0.0365762281 0.039862096 0.243882932 0.265792437
6.2.2 Positionning separately each remaining native populations on the scaffold tree
We then positioned the three other native populations (CN-Nin, CN-Lia and JP-Tok) on the scaffold tree as
follows:
for(p in c("CN-Nin","CN-Lia","JP-Tok")){
add.p=add.leaf(scaf.tree.fit,leaf.to.add = p,dsu.fstats,verbose=FALSE,
drift.scaling=TRUE,compute.ci=TRUE)
add.p.delta.bic=format(sort(add.p$bic-min(add.p$bic))[2],digits = 3)
add.p.comp.fit=compare.fitted.fstats(dsu.fstats,add.p$best.fitted.graph,
n.worst.stats = 0)
add.p.worst.fit=add.p.comp.fit[which.max(abs(add.p.comp.fit$`Z--score`)),]
plot(add.p$best.fitted.graph)
}
As shown in Figure 7, all the three populations could unambiguously be placed onto the scaffold tree as
admixed populations with a good fit of each resulting graphs. Indeed, for each set of five populations (four
scaffold one and the added population), the
BI C
of the best fitting graph was always far lower than all
the other possible graphs (∆
BI C >
10). Note that for JP-Tok ∆
BI C
= 0 in Figure 7d because the graph
with the second lowest BIC corresponds to the positioning of the admixed source S1-JP-Tok to the other
15
(a) Scaffold
worst fit: CN−Bei,CN−Shi;JP−Sap,US−Haw (Z=1.06)
∆BIC=12.1
(b) Placing CN-Nin onto scaffold
worst fit: CN−Bei,CN−Shi;JP−Sap,US−Haw (Z=1.06)
∆BIC=28.2
(c) Placing CN-Lia onto scaffold
worst fit: CN−Bei,CN−Shi;JP−Sap,US−Haw (Z=1.06)
∆BIC=0
(d) Placing JP-Tok onto scaffold
Figure 7: Inference of admixture graph connecting the populations from the native area (extended to
Hawaii). The scaffold tree is displayed in a) together with the best inferred admixture graph obtained by
the (independent) positionning of the three other populations from the native area, CN-Nin (b); CN-Lia
(c) and JP-Tok (d) as obtained with the add.leaf function. For each of the three latter graphs, the worst
fitted f-statistics and its associated Z-score is given and the difference of BIC of the graph with the graphs
displaying the second lowest BIC is provided (∆
BI C
) as a measure of support. For all graphs, the fitted edge
lengths are in drift units (x1000) since drift.scaling argument was set to TRUE.
16
alternative branch which is strictly equivalent to the graph displayed (as the two branches coming from R are
not identifiable). These two equivalent graphs are highly supported since ∆
BI C
= 20
.
4between either of
them and the graph with the third lowest BIC).
6.2.3 Searching for the most comprehensive admixture graph relating all native populations
We next relied on the graph.builder function to try to jointly include all the three remaining populations
(CN-Lia, CN-Nin and JP-Tok) on the graph. As a first step, we considered simply adding two of these
populations simultaneously, leading to 6 different runs of the function (3 pairs of populations and 2 orders of
inclusion for each pair. Importantly, we carefully evaluated the fit of each of the best fitting graph identified
using the compare.fitted.fstats function:
#Creating the six pairs of leaves to add for graph.builder
#a custom function to remove sets (generate by expand grid) with non unique pop
cleangrid<-function(x){n=ncol(x);x[apply(x,1,f<-function(y){length(unique(y))==n}),n:1]}
pops.to.include<-expand.grid(c("CN-Nin","CN-Lia","JP-Tok"),
c("CN-Nin","CN-Lia","JP-Tok")) %>% as.matrix() %>% cleangrid()
#exploring and evaluating the graphs
for(i in 1:nrow(pops.to.include)){
cat("##########################################\n")
cat("Inclusion Order:",pops.to.include[i,],"\n")
tmp.graph<-graph.builder(scaf.tree.fit,leaves.to.add = pops.to.include[i,],
fstats=dsu.fstats,verbose=F)
#number of graphs within Dbic<6 (default)
cat("\tNumber Of Graphs:",tmp.graph$n.graphs,"\n")
#min BIC
cat("\tMin BIC:",min(tmp.graph$bic),"\n")
#worst fitted fstats (best fitting graph)
cat("\tWorst fstats fit:\n")
tmp.comp.fit=compare.fitted.fstats(dsu.fstats,tmp.graph$best.fitted.graph,
n.worst.stats = 0)
print(tmp.comp.fit[which.max(abs(tmp.comp.fit$`Z--score`)),])
}
##########################################
Inclusion Order: CN-Nin CN-Lia
Number Of Graphs: 2
Min BIC: 332.6374
Worst fstats fit:
Estimated Fitted Z–score
CN-Bei,CN-Lia;JP-Sap,US-Haw -0.0002110636 0 4.995489
##########################################
Inclusion Order: CN-Nin JP-Tok
Number Of Graphs: 2
Min BIC: 308.1103
Worst fstats fit:
Estimated Fitted Z–score
CN-Shi,US-Haw;JP-Sap,JP-Tok -0.0005025053 -0.00067728 -1.530277
##########################################
Inclusion Order: CN-Lia CN-Nin
Number Of Graphs: 2
Min BIC: 316.4586
Worst fstats fit:
Estimated Fitted Z–score
CN-Bei,CN-Nin;JP-Sap,US-Haw -0.0001769363 3.469447e-18 3.62954
17
##########################################
Inclusion Order: CN-Lia JP-Tok
Number Of Graphs: 6
Min BIC: 299.4749
Worst fstats fit:
Estimated Fitted Z–score
CN-Bei,CN-Shi;JP-Sap,US-Haw -5.079243e-05 -3.469447e-18 1.060292
##########################################
Inclusion Order: JP-Tok CN-Nin
Number Of Graphs: 2
Min BIC: 368.2099
Worst fstats fit:
Estimated Fitted Z–score
CN-Bei,US-Haw;CN-Shi,JP-Tok 0.005775447 6.938894e-18 -7.207447
##########################################
Inclusion Order: JP-Tok CN-Lia
Number Of Graphs: 6
Min BIC: 299.4749
Worst fstats fit:
Estimated Fitted Z–score
CN-Bei,CN-Shi;JP-Sap,US-Haw -5.079243e-05 0 1.060292
Note the benefits of exploring all the possible orders of inclusion with graph.builder for a given set of
populations. For instance, the best fitted graph obtained when including JP-Tok and CN-Nin displayed a
highly lower BIC if CN-Nin was the first to be included. Conversely, in some cases, the same best fitting
graph may be obtained with the different possible inclusion orders (e.g., see the results obtained with the
CN-Lia and JP-Tok pair).
Interestingly, this exploration suggests that CN-Lia, CN-Nin and JP-Tok may not be included jointly to
the scaffold tree to obtain a comprehensive admixture graph for the seven native populations. Indeed, and
although these two populations are closely related, the graphs obtained when adding both CN-Nin and
CN-Lia to the scaffold tree displayed poor fitted f-stats (some with associated |Z|>3.95). This was confirmed
when trying to add the three populations simultaneously (and exploring all the 6 possible orders of inclusion)
with graph.builder as shown below:
###a custom function to list all possible orders of inclusion (permutations)
#adapted from https://www.r-bloggers.com/2019/06/learning-r-permutations-and-combinations-with-base-r/
permutations <- function(x) {
n<- length(x)
if (n == 1){out<-x}else{
out <- NULL
for (i in 1:n){out <- rbind(out, cbind(x[i], permutations(x[-i])))}
}
return(out)
}
##
pops.to.include<-permutations(c("CN-Nin","CN-Lia","JP-Tok"))
#exploring and evaluating the graphs
for(i in 1:nrow(pops.to.include)){
cat("##########################################\n")
cat("Inclusion Order:",pops.to.include[i,],"\n")
tmp.graph<-graph.builder(scaf.tree.fit,leaves.to.add = pops.to.include[i,],
fstats=dsu.fstats,verbose=F)
#number of graphs within Dbic<6 (default)
cat("\tNumber Of Graphs:",tmp.graph$n.graphs,"\n")
18
#min BIC
cat("\tMin BIC:",min(tmp.graph$bic),"\n")
#worst fitted fstats (best fitting graph)
cat("\tWorst fstats fit:\n")
tmp.comp.fit=compare.fitted.fstats(dsu.fstats,tmp.graph$best.fitted.graph,
n.worst.stats = 0)
print(tmp.comp.fit[which.max(abs(tmp.comp.fit$`Z--score`)),])
}
##########################################
Inclusion Order: CN-Nin CN-Lia JP-Tok
Number Of Graphs: 2
Min BIC: 480.5329
Worst fstats fit:
Estimated Fitted Z–score
CN-Bei,CN-Lia;JP-Sap,US-Haw -0.0002110636 0 4.995489
##########################################
Inclusion Order: CN-Nin JP-Tok CN-Lia
Number Of Graphs: 5
Min BIC: 480.5329
Worst fstats fit:
Estimated Fitted Z–score
CN-Bei,CN-Lia;JP-Sap,US-Haw -0.0002110636 -3.469447e-18 4.995489
##########################################
Inclusion Order: CN-Lia CN-Nin JP-Tok
Number Of Graphs: 2
Min BIC: 463.636
Worst fstats fit:
Estimated Fitted Z–score
CN-Lia,JP-Tok;CN-Nin,CN-Shi -0.0007377375 -0.0004944342 3.733745
##########################################
Inclusion Order: CN-Lia JP-Tok CN-Nin
Number Of Graphs: 11
Min BIC: 513.4301
Worst fstats fit:
Estimated Fitted Z–score
CN-Lia,US-Haw;CN-Shi,JP-Tok 0.005572856 0.000100214 -7.206377
##########################################
Inclusion Order: JP-Tok CN-Nin CN-Lia
Number Of Graphs: 8
Min BIC: 518.0997
Worst fstats fit:
Estimated Fitted Z–score
CN-Lia,JP-Tok;CN-Nin,US-Haw 0.004911067 -0.000220809 -7.205941
##########################################
Inclusion Order: JP-Tok CN-Lia CN-Nin
Number Of Graphs: 8
Min BIC: 518.0997
Worst fstats fit:
Estimated Fitted Z–score
CN-Lia,JP-Tok;CN-Nin,US-Haw 0.004911067 -0.000220809 -7.205941
Overall, only six out of the seven populations from the native area (extended to Hawaii) could be related
with an admixture graph with good fitting properties i.e., either by adding to the scaffold tree (consisting of
19
US-Haw, JP-Sap, CN-Bei and CN-Shi) the population Jp-Tok and either CN-Lia or CN-Nin. Note that the
two resulting best fitting graph cannot be compared based on the BIC criterion as the two sets of populations
differ. A closer inspection of the different graphs obtained (6 within ∆
BI C <
6with the best fitting graph for
the CN-Lia, Jp-Tok included pairs and only one for the CN-Nin, Jp-Tok pair) made us chose as the most
relevant graph the graph displayed in Figure 8. As a matter of fact, this graph had the lowest absolute
Z-score for the worst fitted f-stats (that was actually the same as the one for the scaffold tree) and it was also
quite consistent with the one obtained when adding each population individually to the scaffold tree (Figure
7c and 7d).
natgraph<-graph.builder(scaf.tree.fit,leaves.to.add = c("CN-Lia","JP-Tok"),dsu.fstats,
verbose=F,drift.scaling=T,compute.ci=TRUE)
plot(natgraph$best.fitted.graph)
JP-Tok
JP-Sap
CN-Lia
CN-Shi
CN-Bei
US-Haw
S-JP-Tok
5
S2-CN-Lia
S1-JP-Tok
9
S-CN-Lia
4%
3 64%
R
S2-JP-Tok
6
JP
6
36%
CN
33
8
S1-CN-Lia
1
2
3296%
25531
Figure 8: Best fitting admixture graph connecting six out of the seven populations representative of the native
area (extended to Hawaii). The fitted edge lengths are in drift units (x1000) since drift.scaling argument was
set to TRUE.
The difficulties encountered to jointly add CN-Lia and CN-Lin may be related to the fact that they are only
very slightly differentiated (Figures 3 and 4) making it difficult to simplify their joint history with simple
two-way admixture graphs (likely due to a high amount of gene flow between them). It should also be noticed
that the CN-Nin sample may be slightly heterogeneous as it mixes individuals from very close locations but
sampled at two different times (Table 1).
6.2.4 Estimates of the parameters of the admixture graph for the native and Hawaiian pop-
ulations and their confidence intervals
The CI’s for branch lengths (on both unscaled and drift scaled units) of the admixture graph connecting the
6 populations Figure 8 together with the two admixture proportions are directly available from the natgraph
object (as it was generated with options drift.scaled=TRUE and compute.ci=TRUE):
#branch lengths
natgraph$best.fitted.graph@edges.length.ci
95% Inf. 95% Sup. 95% Inf. (drift scaled) 95% Sup. (drift scaled)
S-JP-Tok<->JP-Tok 6.315686e-04 0.0007827838 0.0044362902 0.005498462
S2-CN-Lia<->S1-JP-Tok 8.258283e-04 0.0018089613 0.0058431246 0.012799255
20
R<->S2-JP-Tok 4.876279e-04 0.0009225970 0.0043358045 0.008203386
S1-JP-Tok<->JP-Sap 3.915283e-04 0.0005084688 0.0027874828 0.003620039
S2-JP-Tok<->CN 4.100608e-03 0.0054398001 0.0287396430 0.038125541
S-CN-Lia<->CN-Lia 2.491030e-04 0.0003931522 0.0017960161 0.002834601
CN<->S1-CN-Lia 7.342082e-05 0.0002078678 0.0005283615 0.001495888
JP<->S2-CN-Lia 4.088231e-03 0.0052797637 0.0272544120 0.035197830
S1-CN-Lia<->CN-Shi 3.821499e-03 0.0051975302 0.0273294661 0.037170156
R<->JP 4.876279e-04 0.0009225970 0.0043358045 0.008203386
CN<->CN-Bei 9.842052e-04 0.0011107899 0.0070826785 0.007993626
JP<->US-Haw 3.675243e-02 0.0397726523 0.2450120628 0.265146533
#admixture proportions
#The names were automatically given by graph.builder as A (and 1-A) for ancestry proportion
#contributing to the first (CN-Lia) and B for ancestry proportion contributing to
#the second (JP-Tok) included populations
natgraph$best.fitted.graph@admix.prop.ci
95% Inf. 95% Sup.
B 0.6312323 0.6576909
A 0.9566978 0.9633939
Alternatively, given the topology of the graph, it is also possible to estimate the proportion of Chinese
ancestry in the CN-Lia population using ratios of f4estimates as:
#Prop of Chinese ancestry in CN-Lia (A parameter defined above)
#The ratio F4(JP-Sap,CN-Shi;CN-Bei,US-Haw)/F4(JP-Sap,CN-Lia;CN-Bei,US-Haw)
#estimates A as can be shown with symbolic computations of F4
natgraph.par<-generate.graph.params(natgraph$best.fitted.graph@graph,verbose=FALSE)
natgraph.symbolicfstats<-graph.params2symbolic.fstats(natgraph.par)
natgraph.symbolicfstats$F4.equations[
grep("F4\\(JP-Sap,CN-Shi;CN-Bei,US-Haw\\)",natgraph.symbolicfstats$F4.equations)]
[1] "F4(JP-Sap,CN-Shi;CN-Bei,US-Haw) = -(S2-JP-Tok<->CN+R<->S2-JP-Tok+R<->JP)"
natgraph.symbolicfstats$F4.equations[
grep("F4\\(JP-Sap,CN-Lia;CN-Bei,US-Haw\\)",natgraph.symbolicfstats$F4.equations)]
[1] "F4(JP-Sap,CN-Lia;CN-Bei,US-Haw) = -(R<->JP+S2-JP-Tok<->CN+R<->S2-JP-Tok)*A"
#estimation via ratio of F4
alpha_lia=compute.f4ratio(dsu.fstats,num.quadruplet = c("CN-Bei","US-Haw","CN-Lia","JP-Sap"),
den.quadruplet = c("CN-Bei","US-Haw","CN-Shi","JP-Sap"))
alpha_lia[2]
bjack mean
0.9559757
#95% CI
alpha_lia[2] + c(-1.96,1.96)*alpha_lia[3]
[1] 0.9435186 0.9684329
The resulting estimates of Chinese ancestry in the CN-Lia population are thus consistent. However, as
expected from the simulation study (see the main text of the manuscript), the confidence intervals estimated
with the graph fitting procedure are smaller (and presumably too optimistic) than the ones estimates with
the f4-ratio which may be considered as more reliable.
Interestingly, a similar proportions of Chinese ancestry could be obtained using
f4
-ratio for CN-Nin than for
CN-Lia (assuming CN-Nin and CN-Lia have a similar ancestry with respect to the other populations). It was
21
also consistent with the one obtained in Figure 7b:
#Prop of Chinese ancestry in CN-Lia
alpha_nin=compute.f4ratio(dsu.fstats,num.quadruplet = c("CN-Bei","US-Haw","CN-Nin","JP-Sap"),
den.quadruplet = c("CN-Bei","US-Haw","CN-Shi","JP-Sap"))
alpha_nin[2]
bjack mean
0.9469928
#95% CI
alpha_nin[2] + c(-1.96,1.96)*alpha_nin[3]
[1] 0.9342755 0.9597100
Note that no such
f4
-ratio based estimation of admixture proportions could be carried for the JP-Tok
population as no outgroup was available for one of its ancestry (i.e., one source connects directly to the root).
6.3 Investigating the origins of the populations from the American invasive
area using admixture graphs
The
f3
-based tests of admixture (section 5) showed that four (US-Col, US-Nca, US-Wat and US-Wis) of the
seven American populations displayed admixture signals. In addition, all involved the most significant ones
always involved a source related to a population from the invasive area suggesting that that were not directly
connected to the first invader. We will thus first attempt to position the remaining American populations
(BR-Pal, US-Sdi and US-Sok) with respect to the native and Hawaiian populations.
6.3.1 Positioning BR-Pal, US-Sdi and US-Sok with respect to the native populations
We investigated the relationships of the three American populations, BR-Pal, US-Sdi and US-Sok, with the
native populations by identifying with the add.leaf function the best positioning of each population onto
either the admixture graph (Figure 8) or the scaffold tree (Figure 7a) obtained above using the following
command (output not shown):
for(p in c("BR-Pal","US-Sdi","US-Sok")){
#positioning onto the admixture graph of native pops
add.p=add.leaf(natgraph$best.fitted.graph,leaf.to.add = p,dsu.fstats,verbose=FALSE,
drift.scaling=TRUE,compute.ci=TRUE)
plot(add.p$best.fitted.graph)
add.p.comp=compare.fitted.fstats(dsu.fstats,add.p$best.fitted.graph,n.worst.stats = 1)
#positioning onto the scaffold tree of native pops
add.p=add.leaf(scaf.tree.fit,leaf.to.add = p,dsu.fstats,verbose=FALSE,
drift.scaling=TRUE,compute.ci=TRUE)
plot(add.p$best.fitted.graph)
add.p.comp=compare.fitted.fstats(dsu.fstats,add.p$best.fitted.graph,n.worst.stats = 1)
}
As summarized in Figure 9, the three populations could be placed with a good resulting fit to each graph of
native populations. The Z-score of the worst fitted f-statistics was always
<
2in absolute value except for the
admixture resulting from the placing of US-Sok on the admixture graph of native populations (Z=-4.14; Figure
9E). In addition, for each population, the positioning onto the two graphs was concordant and suggested
that they were all admixed although this was not detected by the
f3
based test of admixture (Table 3).
This is likely due to a too strong subsequent divergence from their ancestral admixed source. All the three
populations shared an Hawaiian (ancestral to US-Haw) source, the other source being of presumably Chinese
origin for both US-Sdi and BR-Pal and of Japanese origin (ancestral to JP-Sap) for US-Sok. For BR-Pal,
it should be noticed that the support for the positioning onto the scaffold tree (Figure 9B) remained loose
22
A) BR−Pal onto nat. pops. graph
∆BIC=17.6
worst fit: BR−Pal,US−Haw;CN−Bei,CN−Lia (Z=1.57)
B) BR−Pal onto nat. pop. scaf. tree
∆BIC=0.468
worst fit: CN−Bei,CN−Shi;JP−Sap,US−Haw (Z=1.06)
C) US−Sdi onto nat. pops. graph
∆BIC=6.44
worst fit: CN−Bei,CN−Shi;JP−Sap,US−Haw (Z=1.06)
D) US−Sdi onto nat. pop. scaf. tree
∆BIC=5.04
worst fit: CN−Bei,CN−Shi;JP−Sap,US−Haw (Z=1.06)
E) US−Sok onto nat. pops. graph
∆BIC=7.84e−06
worst fit: CN−Bei,JP−Tok;US−Haw,US−Sok (Z=−4.14)
F) US−Sok onto nat. pop. scaf. tree
∆BIC=19.1
worst fit: CN−Bei,CN−Shi;US−Haw,US−Sok (Z=−1.16)
Figure 9: Positionning of the BR-Pal, US-Sdi and US-Sok populations from the American invasive areas onto
the admixture graph connecting the six native and Hawaiian populations (Figure 8) and the scaffold tree
of native (and Hawaiian) populations (Figure 7a). The best fitting graphs (as obtained with the function
add.leaf) are displayed together with i) the worst fitted f-statistics and their associated Z-score; and ii) the
difference of their BIC with respect to the graphs displaying the second lowest BIC (∆
BI C
) as a measure of
support. The target populations are highlighted in yellow. For all the graphs, the fitted edge lengths are in
drift units (x1,000) since drift.scaling argument was set to TRUE.
23
(∆
BI C <
6) and a closer inspection of the corresponding add.leaf results showed that three alternative graphs
resulted in a BIC very similar to this best fitting graph:
addPal=add.leaf(scaf.tree.fit,leaf.to.add = "BR-Pal",dsu.fstats,verbose=FALSE,
drift.scaling=T,compute.ci=T)
#Difference of BIC wrt the best fitting graph
addPal.dbic=addPal$bic-min(addPal$bic)
format(addPal.dbic,digits = 3)
[1] "111.581" "111.581" "246.719" "246.647" "169.198" " 27.333" "111.581" "116.186" "115.099"
[10] "116.186" " 0.468" "116.186" "115.099" "116.186" " 0.468" "251.252" "116.186" " 0.468"
[19] "115.099" " 0.000" " 31.939"
As shown in Figure 10, these corresponded to the branching of S1-BR-Pal source on i) the CN<->CN-Bei
branch (Figure 10D) with a null distance from CN which is actually the same as Figures 9B and 10A (the
estimated admixture proportions being also the same); ii) the R<->CN branch (Figure 10B) or strictly
equivalently the R<->JP branch (Figure 10C). The two latter did not lead to any null branches (which is
usually preferable) and the branching of S1-BR-Pal was always closer to CN (10
×
10
−3
and 11
×
10
−3
drift
units respectively, recalling that the positioning of the root R is arbitrarily set in the middle of the JP<->CN
unrooted branch) than JP in agreement with a Chinese origin. This is concordant with Figures 9A) in which
the presence of JP-Tok and its presumably Chinese ancestral source (automatically named S2-JP-Tok) allows
to rule out this latter positioning of S1-BR-Pal with a high support (∆
BI C
= 17
.
6). Note also that this
resulted in either cases in an higher estimated Chinese ancestry for BR-Pal.
For US-Sdi, the three same types of branching as for BR-Pal were within 5.04 BIC units of the best fitting
graph (i.e., positioning of the S1-US-Sdi source with respect to CN in Figure 9D), i.e., the support for these
alternative graphs was clearly lower according to the BIC criterion (and the absolute Z-score for the worst
fitted f-stats was also higher) as shown with the following codes:
addSdi=add.leaf(scaf.tree.fit,leaf.to.add = "US-Sdi",dsu.fstats,verbose=FALSE,
drift.scaling=T,compute.ci=T)
#Difference of BIC wrt the best fitting graph
addSdi.dbic=addSdi$bic-min(addSdi$bic)
format(addSdi.dbic,digits = 3)
[1] "185.91" "185.91" "541.90" "541.30" "206.66" " 13.04" "185.91" "190.52" "187.77" "190.52"
[11] " 5.04" "190.52" "187.77" "190.52" " 5.04" "545.90" "190.52" " 5.04" "187.77" " 0.00"
[21] " 17.65"
#worst fitted fstats for the three graphs within 6 BIC units from the best one
sel.idx=which(addSdi.dbic<6)
sel.idx=sel.idx[order(addSdi.dbic[sel.idx])]
sel.idx=sel.idx[-1]#remove the best fitting graph
for(i in 1:length(sel.idx)){
tmp.comp=compare.fitted.fstats(dsu.fstats,addSdi$fitted.graphs.list[[sel.idx[i]]],
n.worst.stats = 1)
}
1 Worst fit for:
Estimated Fitted Z–score
CN-Bei,CN-Shi;US-Haw,US-Sdi 9.460063e-05 3.469447e-18 -2.410857
1 Worst fit for:
Estimated Fitted Z–score
CN-Bei,CN-Shi;US-Haw,US-Sdi 9.460063e-05 3.469447e-18 -2.410857
1 Worst fit for:
Estimated Fitted Z–score
CN-Bei,CN-Shi;US-Haw,US-Sdi 9.460063e-05 3.469447e-18 -2.410857
24
A )
worst fit: CN−Bei,CN−Shi;JP−Sap,US−Haw (Z=1.06)
B )
worst fit: BR−Pal,US−Haw;CN−Bei,CN−Shi (Z=1.19)
∆BIC=0.468
C )
worst fit: BR−Pal,US−Haw;CN−Bei,CN−Shi (Z=1.19)
∆BIC=0.468
D )
worst fit: BR−Pal,US−Haw;CN−Bei,CN−Shi (Z=1.19)
∆BIC=0.468
Figure 10: Admixture graphs resulting from the positioning of BR-Pal onto the scaffold tree of native and
Hawaiian populations (Figure 7a) with BIC less than 6 units higher than the BIC with the best fitting graph
(within red box and represented in Figure 9B). For each population, the graph (as obtained with the function
add.leaf) is displayed together with i) the worst fitted f-statistics and its associated Z-score; and ii) the
difference of BIC of the graph with the graphs displaying the best fitting graph (∆
BI C
) as a measure of
support. For all the graphs, the fitted edge lengths are in drift units (x1,000) since drift.scaling argument
was set to TRUE.25
In addition, the positions of US-Sdi on the extended graph and the scaffold graph of native populations were
concordant (Figures 9C and 9D) making the position of the Chinese source of US-Sdi on the CN<->CN-Shi
more likely. The distance of S1-US-Sdi from CN was also non null (as opposed to the distance separating
S1-BR-Pal from CN in Figure 9B).
Finally the position of US-Sok on the scaffold tree of native populations was highly supported (Figure 9F).
However, its positioning onto the extended graph of native populations was less clear (Figure 9E) with two
other alternative graphs with almost the same BIC as the best fitted graph:
addSok=add.leaf(natgraph$best.fitted.graph,leaf.to.add = "US-Sok",dsu.fstats,verbose=FALSE,
drift.scaling=T,compute.ci=T)
#Difference of BIC wrt the best fitting graph
addSok.dbic=addSok$bic-min(addSok$bic)
#worst fitted fstats for the three graphs within 6 BIC units from the best one
sel.idx=which(addSok.dbic<6)
sel.idx=sel.idx[order(addSok.dbic[sel.idx])]
sel.idx=sel.idx[-1]#remove the best fitting graph
#Delta_BIC (wrt the best fitted graph)
format(addSok.dbic[sel.idx],digits = 3)
[1] "7.84e-06" "9.73e-06"
for(i in 1:length(sel.idx)){
tmp.comp=compare.fitted.fstats(dsu.fstats,addSok$fitted.graphs.list[[sel.idx[i]]],
n.worst.stats = 1)
}
1 Worst fit for:
Estimated Fitted Z–score
CN-Bei,JP-Tok;US-Haw,US-Sok 0.002670761 0.0002218372 -4.139588
1 Worst fit for:
Estimated Fitted Z–score
CN-Bei,JP-Tok;US-Haw,US-Sok 0.002670761 0.0002218628 -4.139545
These actually only differed by the position of the Japanese source named S1-US-Sok in Figure 9E with
respect to the Japanese sources of CN-Lia (S2-CN-Lia) and JP-Tok (S1-JP-Tok), the estimated distances
between these three sources S1-US-Sok, S2-CN-Lia and S1-JP-Tok being null. Hence, simplifying the graph by
combining the three source into a single one lead to a similar fit (but lower BIC as the number of parameters
is reduced):
tst.SokGraph=rbind(c("S-JP-Tok","S1-JP-Tok","B"),c("S-JP-Tok","S2-JP-Tok","(1-B)"),
c("JP-Tok","S-JP-Tok",""), c("S1-JP-Tok" ,"JP",""),
c("S2-JP-Tok","R",""),c("JP-Sap","S1-JP-Tok",""),
c("CN","S2-JP-Tok",""),c("S-CN-Lia","S1-CN-Lia","A"),
c("S-CN-Lia","S1-JP-Tok","(1-A)"),c("CN-Lia","S-CN-Lia",""),
c("S1-CN-Lia","CN",""),c("CN-Shi","S1-CN-Lia",""),c("JP","R",""),
c("CN-Bei","CN",""),c("US-Haw","S1-US-Sok",""),
c("US-Sok","S-US-Sok",""),c("S-US-Sok","S1-JP-Tok","C"),
c("S-US-Sok","S1-US-Sok","(1-C)"),c("S1-US-Sok","JP",""))
tst.SokGraph.par=generate.graph.params(tst.SokGraph,dsu.fstats)
Total Number of Parameters: 15 (12 edges lengths + 3 adm. coeff.)
Total Number of Statistics: 21 (6 F2 and 15 F3)
tst.SokGraph.fit=fit.graph(tst.SokGraph.par,drift.scaling = T,verbose=FALSE)
tst.SokGraph.fit@bic
[1] 435.3341
26
tmp.comp=compare.fitted.fstats(dsu.fstats,tst.SokGraph.fit,n.worst.stats = 1)
1 Worst fit for:
Estimated Fitted Z–score
CN-Bei,JP-Tok;US-Haw,US-Sok 0.002670761 0.000221864 -4.139543
Yet this simplified version of the Figure 9E graph should be interpreted with caution as i) the absolute Z-score
for the worst fitted f-statistics remains high (Z=-4.14); ii) the estimated distance (in drift units) of JP-Tok
from its source S-JP-Tok is null; and iii) it somewhat contradicts the original graph displayed in Figure
8 (e.g., in which the two Japanese sources of CN-Lia and JP-Tok are clearly differentiated). Overall this
suggests more complex relationships among the Japanese sources of US-Sok and CN-Lia and JP-Tok from the
native area that cannot be captured by the sample at hand (or even modeled as a simple admixture graph).
To simplify the modeling of the demographic history of the American invasive populations, we further used
the scaffold tree of native populations to represent their ancestry. Hence, we discarded the two admixed
native populations CN-Lia and JP-Tok. Indeed, no evidence for a direct contribution of CN-Lia or JP-Tok to
the ancestry of BR-Pal, US-Sdi or US-Sok could be found from the above analyses.
6.3.2 Searching for the most comprehensive admixture graph including American invasive
populations
We relied on the graph.builder function to try to jointly include American invasive populations first focusing
on BR-Pal, US-Sdi and US-Sok for the same reasons as mentioned above and evaluated all the possible
inclusion order by pair (as previously when analyzing the native populations). We below simplified the output
by only printing the analyses that resulted in a best fitted graph with a worst fitted f-statistics Z-score lower
than 2 (in absolute value):
cleangrid<-function(x){n=ncol(x);x[apply(x,1,f<-function(y){length(unique(y))==n}),n:1]}
pops.to.include<-expand.grid(c("BR-Pal","US-Sdi","US-Sok"),
c("BR-Pal","US-Sdi","US-Sok")) %>% as.matrix() %>% cleangrid()
#exploring and evaluating the graphs
for(i in 1:nrow(pops.to.include)){
tmp.graph<-graph.builder(scaf.tree.fit,leaves.to.add = pops.to.include[i,],
fstats=dsu.fstats,verbose=F)
tmp.comp.fit=compare.fitted.fstats(dsu.fstats,tmp.graph$best.fitted.graph,
n.worst.stats = 0)
if(max(abs(tmp.comp.fit$`Z--score`))<2){
cat("##########################################\n")
cat("Inclusion Order:",pops.to.include[i,],"\n")
#number of graphs within Dbic<6 (default)
cat("\tNumber Of Graphs:",tmp.graph$n.graphs,"\n")
#min BIC
cat("\tMin BIC:",min(tmp.graph$bic),"\n")
#worst fitted fstats (best fitting graph)
cat("\tWorst fstats fit:\n")
print(tmp.comp.fit[which.max(abs(tmp.comp.fit$`Z--score`)),])
}
}
##########################################
Inclusion Order: US-Sok BR-Pal
Number Of Graphs: 4
Min BIC: 271.6173
Worst fstats fit:
Estimated Fitted Z–score
CN-Bei,CN-Shi;US-Haw,US-Sok 4.971676e-05 -1.040834e-17 -1.160098
27
As shown, out the six tested (ordered) pairs of included populations, including first US-Sok and BR-Pal
resulted in the best fitted graph (worst fitted f-statistics associated Z-score equal to -1.16). As before, three
alternative but equivalent graph, since they simply differ according to the positioning of the Chinese related
source of BR-Pal, its distance from the CN node being always null (Figure 11).
addSokPal=graph.builder(scaf.tree.fit,leaves.to.add = c("US-Sok","BR-Pal"),dsu.fstats,
verbose=FALSE,drift.scaling=T,compute.ci=T)
#Difference of BIC wrt the best fitting graph
addSokPal.dbic=addSokPal$bic-min(addSokPal$bic)
format(addSokPal.dbic,digits = 3)
[1] "0.388" "0.388" "0.388" "0.000"
We thus simplified the graph by defining CN (the ancestor of the two Chinese populations CN-Bei and
CN-Shi) as the Chinese source of BR-Pal (thereby removing one branch length parameter) and redefined
some node names to clarify the corresponding inferred scenario. Indeed all graphs suggested that American
invasive populations originated from two independent admixture events among the populations from the
native area (including US-Haw):
•
A first (older) admixture event between a Japanese and an Hawaiian related sources (called hereafter
J-Am1 and H-Am1 respectively) that lead to the internal node population called Am1 which is directly
ancestral to US-Sok
•
A second (and more recent) admixture event between a source related to Am1 (called S-Am2 hereafter)
and a Chinese closely related to the ancestor of CN-Shi and CN-Bei. We hereafter called this second
admixed population, directly related to BP-Pal, Am2
The corresponding graph, equivalent to the ones on Figure 11, was defined (and fitted) as follows and plotted
on Figure 12 (graph framed in red):
am.scaf.graph<-rbind(c('Am2','S-Am2','B'),c('Am2','CN','(1-B)'),c('BR-Pal','Am2',''),
c('S-Am2','Am1',''),c('US-Sok','S-Am2',''),c('CN-Shi','CN',''),
c('Am1','J-Am1','A'),c('Am1','H-Am1','(1-A)'),c('J-Am1','JP',''),
c('JP-Sap','J-Am1',''),c('US-Haw','H-Am1',''),c('H-Am1','JP',''),
c('CN','R',''),c('JP','R',''),c('CN-Bei','CN',''))
am.scaf.par=generate.graph.params(am.scaf.graph,dsu.fstats)
Total Number of Parameters: 12 (10 edges lengths + 2 adm. coeff.)
Total Number of Statistics: 15 (5 F2 and 10 F3)
am.scaf.fit=fit.graph(am.scaf.par,drift.scaling = T,verbose=FALSE)
am.scaf.fit@bic
[1] 269.2973
tmp.comp=compare.fitted.fstats(dsu.fstats,am.scaf.fit,n.worst.stats = 1)
1 Worst fit for:
Estimated Fitted Z–score
BR-Pal,US-Haw;CN-Bei,CN-Shi -5.09979e-05 0 1.187469
6.3.3 Positioning US-Col, US-Nca, US-Sdi, US-Wat and US-Wis onto the scaffold graph
extended with US-Sok and BR-Pal
We then explored the position of the five remaining American populations (US-Col, US-Nca, US-Sdi, US-Wat
and US-Wis) on the scaffold graph am.scaf that included BR-Pal and US-Sok using the add.leaf function:
As shown on Figure 12, only US-Wat could be placed without ambiguity on the graph (∆
BI C >
8and worst
fitted f-statistic absolute Z-score
<
2) and could be interpreted as a recently admixed populations between
a source related to a (presumably) Northern American population Am1 and the admixed (presumably)
28
A )
worst fit: CN−Bei,CN−Shi;US−Haw,US−Sok (Z=−1.16)
B )
worst fit: BR−Pal,US−Haw;CN−Bei,CN−Shi (Z=1.19)
∆BIC=0.388
C )
worst fit: BR−Pal,US−Haw;CN−Bei,CN−Shi (Z=1.19)
∆BIC=0.388
D )
worst fit: BR−Pal,US−Haw;CN−Bei,CN−Shi (Z=1.19)
∆BIC=0.388
Figure 11: Admixture graphs resulting from the joint positioning of US-Sok and BR-Pal onto the scaffold
tree of native populations (Figure 7a) with BIC less than 6 units higher than the BIC with the best fitting
graph (within red box and represented in Figure 9B). For each population, the graph (as obtained with the
function add.leaf) is displayed together with i) the worst fitted f-statistics and its associated Z-score; and ii)
the difference of BIC of the graph with the graphs displaying the best fitting graph (∆
BI C
) as a measure of
support. For all the graphs, the fitted edge lengths are in drift units (x1,000) since drift.scaling argument
was set to TRUE.29
Southern Am2 population. The positioning of US-Sdi also resulted in a best fitting graph with a high support
(∆
BI C >
33) and a rather good fit, the worst fitted f-statistic Z-score being equal -2.41. The resulting scenario
suggested US-Sdi derived from a population admixed between a source deriving from Am2 (and actually
closely related to BR-Pal) and a source related to an (another?) Hawaiian population distantly related to
H-Am1 the Hawaiian source of Am1 (that also contributed to Am2).
This suggested that US-Sdi resulted from multiple admixture events between populations closely related
the Am2 and Hawaii that are rather difficult to fit with simple admixture graph (worst fitted f-statistics
associated Z-score slightly higher than 2). Accordingly, when jointly adding US-Wat and US-Sdi (whatever
the order of inclusion) to the am.scaf graph, the best fitted graph was consistent with the ones obtained
above when adding the US-Sdi and US-Wat, the Am2 related source being common for the two populations
(the inferred distance between S1-US-Wat and S1-US-Sdi being null) and the Hawaiian related source of
US-Sdi being closely related to H-Am1 (Figure 13A). The inferred distance were also more consistent with
geographical and historical origins of the populations and the high contribution of the S1-US-Sdi/S1-Wat
source was also consistent with the observed
f3
values (see section 5). Nevertheless, the fit remained poor as
the worst fitted f-statistics were clearly off (absolute Z-score
>
6) and a likely alternative scenario would be
that the second US-Sdi source is a ghost population deriving from H-Am1 according to the Figure plotted in
Figure 13B. However, as indicated by the error message generated by the fit.graph function, such a scenario
is not fittable because the resulting system of equations is singular.
#Joint positioning of US-Wat and US-Sdi on the am.scaf graph
am.scafWatSdi=graph.builder(am.scaf.fit,leaves.to.add = c("US-Wat","US-Sdi"),dsu.fstats,
drift.scaling=TRUE,verbose = FALSE)
#An alternative (not fittable) admixture graph scenario
am.scafWatSdi.sc<-rbind(c('Am2','S-Am2','B'),c('Am2','CN','(1-B)'),
c('BR-Pal','Cal',''),c("Cal","Am2",""),
c("CN","R",""),c('S-Am2','Am1',''),
c('US-Sok','W',''),c('CN-Shi','CN',''),
c("W","S-Am2",""),c('Am1','J-Am1','A'),
c('Am1','H-Am1','(1-A)'),c('J-Am1','JP',''),
c('JP-Sap','J-Am1',''),c('US-Haw','H-Am1',''),
c('H-Am1','JP',''),c('JP','R',''),c('CN-Bei','CN',''),
c("US-Wat","SW",""),c("SW","W","D"),c("SW","Cal","(1-D)"),
c("US-Sdi","SSdi",""),c("SSdi","Cal","E"),
c("SSdi","S-SSdi","(1-E)"),c("S-SSdi","H-Am1",""))
am.scafWatSdi.sc.par=generate.graph.params(am.scafWatSdi.sc,dsu.fstats)
Total Number of Parameters: 19 (15 edges lengths + 4 adm. coeff.)
Total Number of Statistics: 28 (7 F2 and 21 F3)
#The graph can be plotted using plot(am.scafWatSdi.sc.par) but
#This scenario is not fittable as indicated by the error message from the fit.graph function
am.scafWatSdi.sc.fit=fit.graph(am.scafWatSdi.sc.par,drift.scaling = T)
The system is singular: the rank of the incidence matrix is lower than n.edges-1 (e.g., some branches are not identifiable) and/or the number of parameters is larger than the number of basis f-statistics
It is not possible to properly fit this admixture graph.
In agreement with the analysis of
f3
-based admixture tests results that suggested complex admixture histories
among these closely related populations, no proper admixture graph could be found when trying to position
US-Col, US-Wis or US-Nca onto the am.scaf scaffold graph (Figure 12). Nevertheless, for each of these
populations, all the resulting best fitted graph suggested a high contribution of the Am2 admixed source
(BR-Pal being its best proxy) that may be closely related to the common source of US-Sdi and US-Wat
mentioned above, a second contributing source being related to Japanese populations.
Finally, no proper admixture graph could be obtained when trying to jointly include each possible pairs of
the five populations:
30
Scaffold graph
US−Col
worst fit: BR−Pal,US−Col;CN−Shi,US−Haw (Z=6.15)
∆BIC=0
US−Nca
worst fit: BR−Pal,US−Nca;US−Haw,US−Sok (Z=−5.86)
∆BIC=0
US−Sdi
worst fit: CN−Bei,CN−Shi;US−Haw,US−Sdi (Z=−2.41)
∆BIC=33.841
US−Wat
worst fit: CN−Bei,CN−Shi;US−Haw,US−Wat (Z=−1.83)
∆BIC=8.365
US−Wis
worst fit: BR−Pal,US−Wis;CN−Shi,US−Haw (Z=4.37)
∆BIC=0
Figure 12: Admixture graphs resulting from the positioning of US-Col, US-Nca, US-Sdi, US-Wat and US-Wis
onto the am.scaf scaffold graph that relates the scaffold tree of native populations and BR-Pal and US-Sok
(red frame). The best fitting graphs (as obtained with the function add.leaf) are displayed together with i)
the worst fitted f-statistics and their associated Z-score; and ii) the difference of their BIC with respect to
the graphs displaying the second lowest BIC (∆
BI C
) as a measure of support. The target populations are
highlighted in yellow. For all the graphs, the fitted edge lengths are in drift units (x1,000) since drift.scaling
argument was set to TRUE.
31
A)
worst fit: BR−Pal,US−Sok;CN−Bei,US−Haw (Z=−5.87)
B)
Figure 13: Connecting US-Wat and US-Sdi to the am.scaf scaffold graph. A) Best fitted admixture graph
obtained with the graph.builder function (the fitted edge lengths are in drift units (x1000) since drift.scaling
argument was set to TRUE). B) Possible (but not fittable) scenario connecting US-Wat and US-Sdi to the
am.scaf scaffold graph.
32
cleangrid<-function(x){n=ncol(x);x[apply(x,1,f<-function(y){length(unique(y))==n}),n:1]}
pops.to.include<-expand.grid(c("US-Col","US-Nca","US-Sdi","US-Wat","US-Wis"),
c("US-Col","US-Nca","US-Sdi","US-Wat","US-Wis")) %>%
as.matrix() %>% cleangrid()
#exploring and evaluating the graphs
for(i in 1:nrow(pops.to.include)){
tmp.graph<-graph.builder(am.scaf.fit,leaves.to.add = pops.to.include[i,],
fstats=dsu.fstats,verbose=F)
tmp.comp.fit=compare.fitted.fstats(dsu.fstats,tmp.graph$best.fitted.graph,
n.worst.stats = 0)
if(max(abs(tmp.comp.fit$`Z--score`))<3){
cat("##########################################\n")
cat("Inclusion Order:",pops.to.include[i,],"\n")
#number of graphs within Dbic<6 (default)
cat("\tNumber Of Graphs:",tmp.graph$n.graphs,"\n")
#min BIC
cat("\tMin BIC:",min(tmp.graph$bic),"\n")
#worst fitted fstats (best fitting graph)
cat("\tWorst fstats fit:\n")
print(tmp.comp.fit[which.max(abs(tmp.comp.fit$`Z--score`)),])
}
}
33
References
Fraimout A., Debat V., Fellous S., Hufbauer R. A., Foucaud J., Pudlo P., Marin J.-M., Price D.
K., Cattel J., Chen X., Deprá M., François Duyck P., Guedot C., Kenis M., Kimura M. T.,
Loeb G., Loiseau A., Martinez-Sañudo I., Pascual M., Polihronakis Richmond M., Shearer P.,
Singh N., Tamura K., Xuéreb A., Zhang J., Estoup A., 2017 Deciphering the routes of invasion of
drosophila suzukii by means of ABC random forest. Molecular biology and evolution 34: 980–996.
Hivert V., Leblois R., Petit E. J., Gautier M., Vitalis R., 2018 Measuring genetic differentiation from
pool-seq data. Genetics 210: 315–330.
Koboldt D. C., Zhang Q., Larson D. E., Shen D., McLellan M. D., Lin L., Miller C. A., Mardis E.
R., Ding L., Wilson R. K., 2012 VarScan 2: Somatic mutation and copy number alteration discovery in
cancer by exome sequencing. Genome Research 22: 568–576.
Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin
R., Subgroup 1000. G. P. D. P., 2009 The sequence alignment/map format and SAMtools. Bioinformatics
25: 2078–2079.
Olazcuaga L., Loiseau A., Parrinello H., Paris M., Fraimout A., Guedot C., Diepenbrock L.
M., Kenis M., Zhang J., Chen X., Borowiec N., Facon B., Vogt H., Price D. K., Vogel H.,
Prud’homme B., Estoup A., Gautier M., 2020 A whole-genome scan for association with invasion
success in the fruit fly drosophila suzukii using contrasts of allele frequencies corrected for population
structure. Molecular biology and evolution 37: 2369–2385.
Paris M., Boyer R., Jaenichen R., Wolf J., Karageorgi M., Green J., Cagnon M., Parinello H.,
Estoup A., Gautier M., Gompel N., Prud’homme B., 2020 Near-chromosome level genome assembly
of the fruit pest drosophila suzukii using long-read sequencing. Scientific reports 10: 11227.
34