ArticlePDF Available

Measuring Genetic Differentiation from Pool-seq Data

Authors:

Abstract and Figures

The advent of high throughput sequencing and genotyping technologies enables the comparison of patterns of polymorphisms at a very large number of markers. While the characterization of genetic structure from individual sequencing data remains expensive for many non-model species, it has been shown that sequencing pools of individual DNAs (Pool-seq) represents an attractive and cost-effective alternative. However, analyzing sequence read counts from a DNA pool instead of individual genotypes raises statistical challenges in deriving correct estimates of genetic differentiation. In this article, we provide a method-of-moments estimator of FST for Pool-seq data, based on an analysis-of-variance framework. We show, by means of simulations, that this new estimator is unbiased, and outperforms previously proposed estimators. We evaluate the robustness of our estimator to model misspecification, such as sequencing errors and uneven contributions of individual DNAs to the pools. Finally, by reanalyzing published Pool-seq data of different ecotypes of the prickly sculpin Cottus asper, we show how the use of an unbiased FST estimator may question the interpretation of population structure inferred from previous analyses.
Content may be subject to copyright.
|INVESTIGATION
Measuring Genetic Differentiation from Pool-seq Data
Valentin Hivert,*
,
Raphaël Leblois,*
,
Eric J. Petit,
Mathieu Gautier,*
,,1
and Renaud Vitalis*
,,1,2
*CBGP, INRA, CIRAD, IRD, Montpellier SupAgro, Univ Montpellier, 34988 Montferrier-sur-Lez Cedex, France, Institut de Biologie
Computationnelle, Univ Montpellier, 34095 Montpellier Cedex, France, and ESE, Ecology and Ecosystem Health, INRA,
Agrocampus Ouest, 35042 Rennes, Cedex, France
ORCID IDs: 0000-0002-5144-6956 (V.H.); 0000-0002-3051-4497 (R.L.); 0000-0001-5058-5826 (E.J.P.); 0000-0001-7257-5880 (M.G.);
0000-0001-7096-3089 (R.V.)
ABSTRACT The advent of high throughput sequencing and genotyping technologies enables the comparison of patterns of poly-
morphisms at a very large number of markers. While the characterization of genetic structure from individual sequencing data remains
expensive for many nonmodel species, it has been shown that sequencing pools of individual DNAs (Pool-seq) represents an attractive
and cost-effective alternative. However, analyzing sequence read counts from a DNA pool instead of individual genotypes raises
statistical challenges in deriving correct estimates of genetic differentiation. In this article, we provide a method-of-moments estimator
of FST for Pool-seq data, based on an analysis-of-variance framework. We show, by means of simulations, that this new estimator is
unbiased and outperforms previously proposed estimators. We evaluate the robustness of our estimator to model misspecication,
such as sequencing errors and uneven contributions of individual DNAs to the pools. Finally, by reanalyzing published Pool-seq data of
different ecotypes of the prickly sculpin Cottus asper, we show how the use of an unbiased FST estimator may question the in-
terpretation of population structure inferred from previous analyses.
KEYWORDS F
ST
; genetic differentiation; pool sequencing; population genomics
IT has long been recognized that the subdivision of species
into subpopulations, social groups, and families fosters ge-
netic differentiation (Wahlund 1928; Wright 1931). Charac-
terizing genetic differentiation as a means to infer unknown
population structure is therefore fundamental to population
genetics and nds applications in multiple domains, includ-
ing conservation biology, invasion biology, association map-
ping, and forensics, among many others. In the late 1940s
and early 1950s, Malécot (1948) and Wright (1951) intro-
duced F-statistics to partition genetic variation within and
between groups of individuals (Holsinger and Weir 2009;
Bhatia et al. 2013). Since then, the estimation of F-statistics
has become standard practice (see, e.g., Weir 1996, 2012;
Weir and Hill 2002) and the most commonly used estimators
of FST have been developed in an analysis-of-variance frame-
work (Cockerham 1969, 1973; Weir and Cockerham 1984),
which can be recast in terms of probabilities of identity of
pairs of homologous genes (Cockerham and Weir 1987;
Rousset 2007; Weir and Goudet 2017).
Assuming that molecular markers are neutral, estimates of
FST are typically used to quantify genetic structure in natural
populations, which is then interpreted as the result of demo-
graphic history (Holsinger and Weir 2009): large FST values
are expected for small populations among which dispersal
is limited (Wright 1951), or between populations that have
long diverged in isolation from each other (Reynolds et al.
1983). When dispersal is spatially restricted, a positive re-
lationship between FST and the geographical distance for pairs
of populations generally holds (Slatkin 1993; Rousset 1997). It
hasalsobeenproposedtocharacterize the heterogeneity of
FST estimates across markers for identifying loci that are tar-
geted by selection (Cavalli-Sforza 1966; Lewontin and Krakauer
1973; Beaumont and Nichols 1996; Vitalis et al. 2001; Akey
et al. 2002; Beaumont 2005; Weir et al. 2005; Lotterhos and
Whitlock 2014, 2015; Whitlock and Lotterhos 2015).
Next-generation sequencing (NGS) technologies provide
unprecedented amounts of polymorphism data in both model
Copyright © 2018 by the Genetics Society of America
doi: https://doi.org/10.1534/genetics.118.300900
Manuscript received March 9, 2018; accepted for publication July 21, 2018; published
Early Online July 25, 2018.
Supplemental material available at Figshare: https://doi.org/10.25386/genetics.
6856781.
1
These authors are joint senior authors on this work.
2
Corresponding author: Centre de Biologie pour la Gestion des Populations, Campus
International de Baillarguet, CS 30016, 34988 Montferrier-sur-Lez Cedex, France.
E-mail: renaud.vitalis@inra.fr
Genetics, Vol. 210, 315330 September 2018 315
and nonmodel species (Ellegren 2014). Although the se-
quencing strategy initially involved individually tagged sam-
ples in humans (The International HapMap Consortium
2005), whole-genome sequencing of pools of individuals
(Pool-seq) is being increasingly used for population genomic
studies (Schlötterer et al. 2014). Because it consists of se-
quencing libraries of pooled DNA samples and does not re-
quire individual tagging of sequences, Pool-seq provides
genome-wide polymorphism data at considerably lower cost
than sequencing of individuals (Schlötterer et al. 2014).
However, non-equimolar amounts of DNA from all individu-
als in a pool and stochastic variation in the amplication
efciency of individual DNAs have raised concerns with re-
spect to the accuracy of the so-obtained allele frequency es-
timates, particularly at low sequencing depth and with small
pool sizes (Cutler and Jensen 2010; Anderson et al. 2014;
Ellegren 2014). Nonetheless, it has been shown that, at equal
sequencingefforts,Pool-seqprovidessimilar,ifnotmore
accurate, allele frequency estimates than individual-based
analyses (Futschik and Schlötterer 2010; Gautier et al.
2013). The problem is different for diversity and differenti-
ation parameters, which dependonsecondmomentsofal-
lele frequencies or, equivalently, on pairwise measures of
genetic identity: with Pool-seq data, it is indeed impossi-
ble to distinguish pairs of reads that are identical because
they were sequenced from a single gene from pairs of reads
that are identical because they were sequenced from two
distinct genes that are identical in state (IIS) (Ferretti et al.
2013).
Appropriate estimators of diversity and differentiation
parameters must therefore be sought to account for both
the sampling of individual genes from the pool and the
sampling of reads from these genes. There has been several
attempts to dene estimators for the parameter FST for Pool-
seq data (Koer et al. 2011; Ferretti et al. 2013), from ratios
of heterozygosities (or from probabilities of genetic identity
between pairs of reads) within and between pools. In the
following, we will argue that these estimators are biased
(i.e., they do not converge toward the expected value of the
parameter) and that some of them have undesired statistical
properties (i.e., the bias depends on sample size and cover-
age). Here, following Cockerham (1969, 1973), Weir and
Cockerham (1984), Weir (1996), Weir and Hill (2002),
and Rousset (2007), we dene a method-of-moments esti-
mator of the parameter FST using an analysis-of-variance
framework. We then evaluate the accuracy and precision of
this estimator, based on the analysis of simulated data sets,
and compare it to estimates dened in the software package
PoPoolation2 (Koer et al. 2011) and in Ferretti et al. (2013).
Furthermore, we test the robustness of our estimators to
model misspecications (including unequal contributions of
individuals in pools and sequencing errors). Finally, we rean-
alyze the prickly sculpin (Cottus asper) Pool-seq data (pub-
lished by Dennenmoser et al. 2017), and show how the use of
biased FST estimators in previous analyses may challenge the
interpretation of population structure.
Note that throughout this article, we use the term geneto
designate a segregating genetic unit (in the sense of the
Mendelian genefrom Orgogozo et al. 2016). We further
use the term readin a narrow sense, as a sequenced copy
of a gene. For the sake of simplicity, we will use the term Ind-
seqto refer to analyses based on individual data, for which
we further assume that individual genotypes are called with-
out error.
Model
F-statistics may be described as intraclass correlations
for the IIS probability of pairs of genes (Cockerham
and Weir 1987; Rousset 1996, 2007). FST is best dened
as:
FST [Q12Q2
12Q2
;(1)
where Q1is the IIS probability for genes sampled within
subpopulations, and Q2is the IIS probability for genes sam-
pled between subpopulations. In the following, we develop
an estimator of FST for Pool-seq data by decomposing the
total variance of read frequencies in an analysis-of-variance
framework. A complete derivation of the model is provided in
the Supplemental Material, File S1.
For the sake of clarity, the notation used throughout this
article is given in Table 1. We rst derive our model for a
single locus and eventually provide a multilocus estimator of
F
ST
. Consider a sample of ndsubpopulations, each of which is
made of nigenes ði¼1;...;ndÞsequenced in pools (hence ni
is the haploid sample size of the ith pool). We dene cij as the
number of reads sequenced from gene jðj¼1;...;niÞin sub-
population iat the locus considered. Note that cij is a latent
variable that cannot be directly observed from the data. Let
Xijr:kbe an indicator variable for read rðr¼1;...;cijÞfrom
gene jin subpopulation i, such that Xijr:k¼1 if the rth
read from the jth gene in the ith deme is of type k, and
Xijr:k¼0 otherwise. In the following, we use standard
dot notation for sample averages, i.e.:Xij:k[PrXijr:k=cij;
Xi:k[PjPrXijr:k=Pjcij;and X:k[PiPjPrXijr:k=PiPjcij:
The analysis-of-variance is based on the computation of
sums of squares, as follows:
X
nd
iX
ni
jX
cij
rXijr:k2X:k2¼X
nd
iX
ni
jX
cij
rXijr:k2Xij:k2
þX
nd
iX
ni
jX
cij
rXij:k2Xi:k2
þX
nd
iX
ni
jX
cij
rXi:k2X:k2
[SSR:kþSSI:kþSSP:k:
(2)
316 V. Hivert et al.
As is shown in File S1, the expected sums of squares depend on
the expectation of the allele frequency pkover all replicate
populations sharing the same evolutionary history, as well as
on the IIS probability Q1:kthat two genes in the same pool are
both of type k, and the IIS probability Q2:kthat two genes
from different pools are both of type k. Taking expectations
(see the detailed computations in File S1), one has:
ESSR:k¼0 (3)
for reads within individual genes, since we assume that there
is no sequencing error, i.e., all the reads sequenced from a
single gene are identical and Xijr:k¼Xij:kfor all r. For reads
between genes within pools, we get:
ESSI:k¼C12D2pk2Q1:k;(4)
where C1[PiPjcij ¼PiC1iis the total number of reads in
the full sample (total coverage), C1iis the coverage of the ith
pool, and D2[PiC1iþni21=ni:D2arises from the as-
sumption that the distribution of the read counts cij is multi-
nomial (i.e., that all genes contribute equally to the pool of
reads; see Equation A15 in File S1). For reads between genes
from different pools, we have:
ESSP:k¼C12C2
C1Q1:k2Q2:kþD22D
2pk2Q1:k;
(5)
where C2[PiC2
1iand D
2[hPiC1iðC1iþni21Þ=nii.C1
(see Equation A16 in File S1). Rearranging Equation 4 and
Equation 5 and summing over alleles, we get:
Q12Q2¼C12D2ESSP2D22D
2ESSI
C12D2C12C2=C1(6)
and
12Q2¼C12D2ESSPþnc21D22D
2ESSI
C12D2C12C2=C1;
(7)
where nc[C12C2=C1=D22D
2:Let MSI [SSI=ðC12D2Þ
and MSP [SSP=ðD22D
2Þ:Then, using the denition of FST
from Equation 1, we have:
FST [Q12Q2
12Q2
¼EMSP2EMSI
EMSPþnc21EMSI;(8)
which yields the method-of-moments estimator
^
Fpool
ST ¼MSP 2MSI
MSP þnc21MSI;(9)
where
MSI ¼1
C12D2X
kX
nd
i
C1i^pi:k12^pi:k(10)
and
MSP ¼1
D22D
2X
kX
nd
i
C1i^pi:k2^pk2(11)
(see Equations A25 and A26 in File S1). In Equation 10
and Equation 11, ^pi:k[Xi:kis the average frequency of
reads of type kwithin the ith pool, and ^pk[X:kis the
average frequency of reads of type kin the full sample.
Note that from the denition of X:k;^pk[PiPjPrXijr:k=
PiPjcij ¼PiC1i^pi:k=PiC1iis the weighted average of the
sample frequencies with weights equal to the pool coverage.
This is equivalent to the weighted analysis-of-variance in
Cockerham (1973) (see also Weir and Cockerham 1984;
Weir 1996; Weir and Hill 2002; Rousset 2007; Weir and
Table 1 Summary of main notations used
Notation Parameter denition
Xijr:kIndicator variable: Xijr:k¼1 if the rth read from the jth individual in the ith pool is of type k,
and Xijr:k¼0 otherwise
ri:k¼PjPrXijr:kNumber of reads of type kin the ith pool
cij Number of reads sequenced from individual jin subpopulation i(unobserved individual
coverage)
C1i[Pjcij Total number of reads in the ith pool (pool coverage)
C1[PiC1iTotal number of reads in the full sample (total coverage)
C2[PiC2
1iSquared number of reads in the full sample
niTotal number of genes the ith pool (haploid pool size)
yi:k(Unobserved) number of genes of type kin the ith pool
pk[EðXijr:kÞExpected frequency of reads of type kin the full sample
^pij:k[Xij:k(Unobserved) average frequency of reads of type kfor individual jin the ith pool
^pi:k[Xi:kAverage frequency of reads of type kin the ith pool
^pk[X:kAverage frequency of reads of type kin the full sample
Q1(respectively Q2) IIS probability for two genes sampled within (respectively between) pools
Qr
1(respectively Qr
2) IIS probability for two reads sampled within (respectively between) pools
^
Qpool
1(respectively ^
Qpool
2) Unbiased estimator of the IIS probability for genes sampled within (respectively between)
pools
Genetic Differentiation from Pools 317
Goudet 2017). Finally, the full expression of ^
Fpool
ST in terms of
sample frequencies develops as:
If we take the limit case where each gene is sequenced
exactly once, we recover the Ind-seq model: assuming
cij ¼1forallði;jÞ;then C1¼Pnd
ini;C2¼Pnd
in2
i;D2¼nd;
and D
2¼1:Therefore, nc¼ðC12C2=C1Þ=ðnd21Þ;and
Equation 9 reduces exactly to the estimator of FST for hap-
loids: see Weir (1996), p. 182, and Rousset (2007), p. 977.
As in Reynolds et al. (1983), Weir and Cockerham (1984),
Weir (1996), and Rousset (2007), a multilocus estimate is
derived as the sum of locus-specic numerators over the sum
of locus-specic denominators:
^
FST ¼PlMSPl2MSIl
PlMSPlþðnc21ÞMSIl
;(13)
where MSI and MSP are subscripted with lto denote the lth
locus. For Ind-seq data, Bhatia et al. (2013) refer to this multi-
locus estimate as a ratio of averagesas opposed to an
average of ratios,which would consist of averaging single-
locus FST over loci. This approach is justied in the appendix
of Weir and Cockerham (1984) and in Bhatia et al. (2013),
who analyzed both estimates by means of coalescent simula-
tions. Note that Equation 13 assumes that the pool size is
equal across loci. Also note that the construction of the esti-
mator in Equation 13 is different from Weir and Cockerhams
(1984). These authors dened their multilocus estimator as a
ratio of sums of components of variance (a,b, and cin their
notation) over loci, which give the same weight to all loci
whatever the number of sampled genes at each locus. Equa-
tion 13 follows GENEPOPs rationale (Rousset 2008) instead,
which gives more weight to loci that are more intensively
covered.
Materials and Methods
Simulation study
Generating individual genotypes: We rst generated indi-
vidual genotypes using ms (Hudson 2002), assuming an
island model of population structure (Wright 1931). For
each simulated scenario, we considered eight demes, each
made of N¼5000 haploid individuals. The migration rate
(m)wasxed to achieve the desired value of FST (0.05
or 0.2), using equation 6 in Rousset (1996) leading to,
e.g.,M[2Nm ¼16:569 for FST ¼0:05 and M¼3:489 for
FST ¼0:20:The mutation rate was set at m¼1026;giving
u[2Nm¼0:01:We considered either xed or variable sam-
ple sizes across demes. In the latter case, the haploid sample
size nwas drawn independently for each deme from a Gauss-
ian distribution with mean 100 and SD 30; this number was
rounded up to the nearest integer, with a minimum of 20 and
maximum of 300 haploids per deme. We generated a very
large number of sequences for each scenario and sampled
independent single nucleotide polymorphisms (SNPs) from
sequences with a single segregating site. Each scenario was
replicated 50 times (500 times for Figure 3 and Figure S2).
Pool sequencing: For each ms simulated data set, we gener-
ated Pool-seq data by drawing reads from a binomial distri-
bution (Gautier et al. 2013). More precisely, we assume that
for each SNP, the number ri:kof reads of allelic type kin pool i
follows:
ri:kBinyi:k
ni
;di;(14)
where yi:kis the number of genes of type kin the ith pool, niis
the total number of genes in pool i(haploid pool size), and di
is the simulated total coverage for pool i. In the following,
we either consider a xed coverage, with di¼Dfor all pools
and loci, or a varying coverage across pools and loci, with
diPoisðDÞ:
Sequencing error: We simulated sequencing errors occurring
at rate me¼0:001;which is typical of Illumina sequencers
(Glenn 2011; Ross et al. 2013). We assumed that each se-
quencing error modies the allelic type of a read to one of
three other possible states with equal probability (there are
therefore four allelic types in total, corresponding to four
nucleotides). Note that only biallelic markers are retained
in the nal data sets. Also note that, since we initiated this
procedure with polymorphic markers only, we neglect se-
quencing errors that would create spurious SNPs from mono-
morphic sites. However, such SNPs should be rare in real data
sets, since markers with a low minimum read count (MRC)
are generally ltered out.
Experimental error: Nonequimolar amounts of DNA from all
individuals in a pool and stochastic variation in the ampli-
cation efciency of individual DNAs are sources of experimen-
tal errors in Pool-seq. To simulate experimental errors, we
used the model derived by Gautier et al. (2013). In this model,
it is assumed that the contribution hij ¼cij=C1iof each gene j
^
Fpool
ST ¼PkhðC12D2ÞPnd
iC1ið^pi:k2^pkÞ22D22D
2Pnd
iC1i^pi:kð12^pi:kÞi
PkhðC12D2ÞPnd
iC1ið^pi:k2^pkÞ2þðnc21ÞD22D
2Pnd
iC1i^pi:kð12^pi:kÞi:
318 V. Hivert et al.
to the total coverage of the ith pool ðC1iÞfollows a Dirichlet
distribution:
hij1#j#niDirr
ni;(15)
where the parameter rcontrols the dispersion of gene
contributions around the value hij ¼1=ni;which is expected
if all genes contributed equally to the pool of reads. For
convenience, we dene the experimental error eas
the coefcient of variation of hij;i.e.,e[ffiffiffiffiffiffiffiffiffiffiffiffi
VðhijÞ
q.
EðhijÞ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðni21Þ=ðrþ1Þ
p(see Gautier et al. 2013). When
etends toward 0 (or equivalently, when rtends to innity),
all individuals contribute equally to the pool and there is no
experimental error. We tested the robustness of our estimates
to values of ebetween 0.05 and 0.5. The case e¼0:5 could
correspond, for example, to a situation where (for ni¼10)
ve individuals contribute 2:83more reads than the other
ve individuals.
Other estimators
For the sake of clarity, a summary of the notation of
the FST estimators used throughout this article is given
in Table 2.
PP2
d
:This estimator of FST is implemented by default in the
software package PoPoolation2 (Koer et al. 2011). It is
based on a denition of the parameter FST as the overall re-
duction in average heterozygosity relative to the total com-
bined population (see, e.g., Nei and Chesser 1983):
PP2d[
^
HT2^
HS
^
HT
;(16)
where ^
HSis the average heterozygosity within subpopu-
lations, and ^
HTis the average heterozygosity in the total
population (obtained by pooling together all subpopu-
lations to form a single virtual unit). In PoPoolation2,
^
HSis the unweighted average of within-subpopulation
heterozygosities:
^
HS¼1
ndX
nd
ini
ni21 C1i
C1i2112Xk^p2
i:k(17)
(using the notation from Table 1). Note that in PoPoolation2,
PP2dis restricted to the case of two subpopulations only
(nd¼2). The two ratios in the right-hand side of Equation
17 are presumably borrowed from Nei (1978) to provide an
unbiased estimate, although we found no formal justication
for the expression in Equation 17 for Pool-seq data. The total
heterozygosity is computed as (using the notation from
Table 1):
^
HT¼ miniðniÞ
miniðniÞ21! miniðC1iÞ
miniðC1iÞ21!12X
k
^p2
k:
(18)
PP2
a
:This is the alternative estimator of FST provided in the
software package PoPoolation2. It is based on an interpreta-
tion by Koer et al. (2011) of Karlsson et al.s (2007) estima-
tor of FST, as:
PP2a[
^
Qr
12^
Qr
2
12^
Qr
2
;(19)
where ^
Qr
1and ^
Qr
2are the frequencies of identical pairs of
reads within and between pools, respectively, computed by
simple counting of IIS pairs. These are estimates of Qr
1;the IIS
probability for two reads in the same pool (whether they are
sequenced from the same gene or not), and Qr
2;the IIS prob-
ability for two reads in different pools. Note that the IIS prob-
ability Qr
1is different from Q1in Equation 1, which, from our
denition, represents the IIS probability between distinct genes
in the same pool. This approach therefore confounds pairs of
reads within pools that are identical because they were se-
quenced from a single gene from pairs of reads that are iden-
tical because they were sequenced from distinct, yet IIS genes.
FRP
13
:This estimator of FST was developed by Ferretti et al.
(2013) (see their equations 3, 10, 11, 12, and 13). Ferretti
et al. (2013) use the same denition of FST as in Equation 16
above, although they estimate heterozygosities within and
between pools as average pairwise nucleotide diversities,
which, from their denitions, are formally equivalent to IIS
probabilities. In particular, they estimate the average hetero-
zygosity within pools as (using the notation from Table 1):
^
HS¼1
ndX
nd
ini
ni2112^
Qr
1i(20)
and the total heterozygosity among the ndpopulations as:
^
HT¼1
n2
d2
4X
nd
ini
ni2112^
Qr
1iþX
nd
ii912^
Qr
2ii93
5:(21)
Analyses of Ind-seq data
For the comparison of Ind-seq and Pool-seq data sets, we
computed FST on subsamples of 5000 loci. These subsamples
were dened so that only those loci that were polymorphic in
all coverage conditions were retained, and the same loci were
Table 2 Denition of the FST estimators used in the text
Notation Denition
^
Fpool
ST Equation 12
FRP13 Ferretti et al. (2013) and Equation 16,
Equation 20, and Equation 21
NC83 Nei and Chesser (1983)
PP2dKoer et al. (2011) and Equation 16,
Equation 17, and Equation 18
PP2aKoer et al. (2011) and Equation 19
WC84 Weir and Cockerham (1984)
Genetic Differentiation from Pools 319
used for the analysis of the corresponding Ind-seq data. For
the latter, we used either the Nei and Chessers (1983) esti-
mator based on a ratio of heterozygosity (see Equation 16
above), hereafter denoted by NC83 ;or the analysis-of-variance
estimator developed by Weir and Cockerham (1984), here-
after denoted by WC84:
All the estimators were computed using custom functions
in the R software environment for statistical computing,
version 3.3.1 (R Core Team 2017). All of these functions
were carefully checked against available software packages
to ensure that they provided strictly identical estimates.
Application example: C. asper
Dennenmoser et al. (2017) investigated the genomic basis
of adaption to osmotic conditions in the prickly sculpin (C.
asper), an abundant euryhaline sh in northwestern North
America. To do so, they sequenced the whole genome of
pools of individuals from two estuarine populations (Capi-
lano River Estuary, CR; Fraser River Estuary, FE) and two
freshwater populations (Pitt Lake, PI; Hatzic Lake, HZ) in
southern British Columbia (Canada). We downloaded the
four corresponding BAM les from the Dryad Digital
Repository (http://dx.doi.org/10.5061/dryad.2qg01)and
combined them into a single mpileup le using SAMtools
version 0.1.19 (Li et al. 2009) with default options, except
the maximum depth per BAM that was set to 5000 reads. The
resulting le was further processed using a custom awk script
to call SNPs and compute read counts, after discarding bases
with a base alignment quality (BAQ) score ,25. A position
was then considered a SNP if: (1) only two different nucleo-
tides with a read count .1 were observed (nucleotides with
#1 read being considered as a sequencing error); (2) the
coverage was between 10 and 300 in each of the four align-
ment les; (3) the minor allele frequency, as computed from
read counts, was $0:01 in the four populations. The nal
data set consisted of 608,879 SNPs.
Our aim here was to compare the population structure
inferred from pairwise estimates of FST using the estimator
^
Fpool
ST (Equation 12) with that of PP2
d
. To determine which of
the two estimators performs better, we then compared the
population structure inferred from ^
Fpool
ST and PP2dto that
inferred from the Bayesian hierarchical model implemented
in the software package BayPass (Gautier 2015). BayPass
allows the robust estimation of the scaled covariance matrix
of allele frequencies across populations for Pool-seq data,
which is known to be informative about population history
(Pickrell and Pritchard 2012). The elements of the estimated
matrix can be interpreted as pairwise and population-specic
estimates of differentiation (Coop et al. 2010) and therefore
provide a comprehensive description of population structure
that makes full use of the available data.
Data availability
An R package called poolfstat, which implements FST esti-
mates for Pool-seq data, is available at the Comprehensive
R Archive Network (CRAN): https://cran.r-project.org/web/
packages/poolfstat/index.html.
The authors state that all data necessary for conrming the
conclusions presented in this article are fully represented
within the article, gures, and tables. Supplemental material
(including Figures S1S4, Tables S1S3, and a complete der-
ivation of the model in File S1) available at Figshare: https://
doi.org/10.25386/genetics.6856781.
Results
Comparing Ind-seq and Pool-seq estimates of FST
Single-locus estimates of ^
Fpool
ST are highly correlated with the
classical estimates of WC84 (Weir and Cockerham 1984)
computed on the individual data that were used to generate
the pools in our simulations (see Figure 1). The variance of
^
Fpool
ST across independent replicates decreases as the coverage
increases. The correlation between ^
Fpool
ST and WC84 is stronger
for multilocus estimates (see Figure S1A).
Comparing Pool-seq estimators of FST
We found that our estimator ^
Fpool
ST has extremely low bias
(,0.5% over all scenarios tested: see Table 3 and Tables
S1S3). In other words, the average estimates across multiple
Figure 1 Single-locus estimates of FST:We compared
single-locus estimates of FST based on allele count data
inferred from individual genotypes (Ind-seq), using the
WC84 estimator, to ^
Fpool
ST estimates from Pool-seq data.
We simulated 5000 SNPs using ms in an island model
with nd¼8 demes. We used two migration rates cor-
responding to (A) FST ¼0:05 and (B) FST ¼0:20:The
size of each pool was xed to 100. We show the results
for different coverages (203,503, and 1003). In each
graph, the cross indicates the simulated value of FST.
320 V. Hivert et al.
loci and replicates closely equal the expected value of the FST
parameter, as given by equation 6 in Rousset (1996), which is
based on the computation of IIS probabilities in an island
model of population structure. In all the situations examined,
the bias does not depend on the sample size (i.e.,thesizeof
each pool)or on the coverage (see Figure 2). Only the variance
of the estimator across independent replicates decreases as the
sample size increases and/or as the coverageincreases. At high
coverage, the mean and root mean squared error (RMSE) of
^
Fpool
ST over independent replicates are virtually indistinguish-
able from that of the WC84 estimator (see Table S1).
Figure 3 shows the RMSE of FST estimates for a wide range
of pool sizes and coverages. The RMSE decreases as the pool
size and/or the coverage increases. The FST estimates are
more precise and accurate when differentiation is low. Figure
3 provides some clues to evaluate the pool size and the cov-
erage that is necessary to achieve the same RMSE as for Ind-
seq data. Consider, for example, the case of samples of n¼20
haploids. For FST #0:05 (in the conditions of our simula-
tions), the RMSE of FST estimates based on Pool-seq data
tends to the RMSE of FST estimates based on Ind-seq data
either by sequencing pools of 200 haploids at 203,orby
sequencing pools of 20 haploids at 2003. However, the
same precision and accuracy are achieved by sequencing
50 haploids at 503.
Conversely, we found that PP2d(the default estimator of
FST implemented in the software package PoPoolation2) is
biased when compared to the expected value of the parame-
ter. We observed that the bias depends on both the sample
size and the coverage (see Figure 2). We note that, as the
coverage and the sample size increase, PP2dconverges to the
estimator NC83 (Nei and Chesser 1983) computed from indi-
vidual data (see Figure S1B). This argument was used by
Koer et al. (2011) to validate their approach, even though
the estimates of PP2ddepart from the true value of the pa-
rameter (Figure S1, B and C).
The second of the two estimators of FST implemented in
PoPoolation2, which we refer to as PP2a;is also biased (see
Figure 2). We note that the bias decreases as the sample size
increases. However, the bias does not depend on the cov-
erage (only the variance over independent replicates de-
pends on coverage). The estimator developed by Ferretti
et al. (2013), which we refer to as FRP13;is also biased
(see Figure 2). However, the bias does not depend on the
pool size or on the coverage (only the variance over indepen-
dent replicates depends on coverage). FRP13 converges to the
estimator NC83;computed from individual data (see Figure
2). At high coverage, the mean and RMSE over independent
replicates are virtually indistinguishable from that of the
NC83 estimator.
Lastly, we stress that our estimator ^
Fpool
ST provides estimates
for multiple populations and is therefore not restricted to
pairwise analyses, contrary to PoPoolation2s estimators.
We show that, even at low sample size and low coverage,
Pool-seq estimates of differentiation are virtually indistin-
guishable from classical estimates for Ind-seq data (see
Table 3).
Robustness to unbalanced pool sizes and variable
sequencing coverage
We evaluated the accuracy and the precision of the estimator
^
Fpool
ST when sample sizes differ across pools and when the
coverage varies across pools and loci (see Figure 4). We
found that, at low coverage, unequal sampling or variable
coverage causes a negligible departure from the median of
WC84 estimates computed on individual data, which vanishes
as the coverage increases. At 1003coverage, the distribution
of ^
Fpool
ST estimates is almost indistinguishable from that of
WC84 (see Figure 4 and Tables S2 and S3).
Robustness to sequencing and experimental errors
Figure 5 shows that sequencing errors cause a negligible neg-
ative bias for ^
Fpool
ST estimates. Filtering (using an MRC of 4)
improves estimation slightly, but only at high coverage (Fig-
ure 6B). It must be noted, however, that ltering increases
the bias in the absence of sequencing error, especially at low
coverage (Figure 6A). With experimental error, i.e., when
individuals do not contribute evenly to the nal set of reads,
we observed a positive bias for ^
Fpool
ST estimates (Figure 5). We
note that the bias decreases as the size of the pools increases.
Figure S2 shows the RMSE of FST estimates for a wider range
of pool sizes, coverage, and experimental error rate (e). For
e$0:25;increasing the coverage cannot improve the quality
of the inference if the pool size is too small. When Pool-seq
experiments are prone to large experimental error rates, in-
creasing the size of pools is the only way to improve the
estimation of FST:Filtering (using an MRC of 4) does not
improve estimation (Figure 6C).
Application example
The reanalysis of the prickly sculpin data revealed larger
pairwise estimates of multilocus FST using the PP2destimator,
Table 3 Overall FST estimates from multiple pools
FST n
Pool-seq Ind-seq
Coverage ^
Fpool
ST WC84
0.05 10 20 30.050 (0.002)
0.05 10 50 30.051 (0.002) 0.050 (0.002)
0.05 10 100 30.050 (0.002)
0.05 100 20 30.050 (0.001)
0.05 100 50 30.050 (0.001) 0.051 (0.001)
0.05 100 100 30.050 (0.001)
0.20 10 20 30.200 (0.002)
0.20 10 50 30.201 (0.002) 0.201 (0.002)
0.20 10 100 30.201 (0.002)
0.20 100 20 30.201 (0.003)
0.20 100 50 30.202 (0.003) 0.203 (0.003)
0.20 100 100 30.203 (0.003)
Multilocus ^
Fpool
ST estimates were computed for various conditions of expected FST ;
pool size (n), and coverage in an island model with nd¼8 subpopulations (pools).
The mean (RMSE) is over 50 independent simulated data sets, each made of
5000 loci. For comparison, we computed multilocus WC84 estimates from individual
genotypes (Ind-seq).
Genetic Differentiation from Pools 321
as compared to ^
Fpool
ST (see Figure 7A). Furthermore, we found
that ^
Fpool
ST estimates are smaller for within-ecotype pairwise
comparisons as compared to between-ecotype compari-
sons. Therefore, the inferred relationships between samples
based on pairwise ^
Fpool
ST estimates show a clear-cut struc-
ture, separating the two estuarine samples from the freshwater
ones (see Figure 7C). We did not recover the same struc-
ture using PP2destimates (see Figure 7B). Additionally, the
scaled covariance matrix of allele frequencies across samples
is consistent with the structure inferred from ^
Fpool
ST estimates
(see Figure 7D).
Discussion
Whole-genome sequencing of pools of individuals is increas-
ingly popular for population genomic research on both
model and nonmodel species (Schlötterer et al. 2014). The
development of dedicated software packages (reviewed in
Figure 2 Precision and accuracy of pairwise estimators of FST:We considered two estimators based on allele count data inferred from individual
genotypes (Ind-seq): WC84 and NC83:For Pool-seq data, we computed the two estimators implemented in the software package PoPoolation2, which
we refer to as PP2dand PP2a;as well as the FRP13 estimator and our estimator ^
Fpool
ST :Each boxplot represents the distribution of multilocus FST estimates
across all pairwise comparisons in an island model with nd¼8 demes and across 50 independent replicates of the ms simulations. We used two
migration rates, corresponding to (A and B) FST ¼0:05 and (C and D) FST ¼0:20:The size of each pool was either xed to (A and C) 10 or to (B and D)
100. For Pool-seq data, we show the results for different coverages (203,503, and 1003). In each graph, the dashed line indicates the simulated value
of FST and the dotted line indicates the median of the distribution of NC83 estimates.
322 V. Hivert et al.
Figure 3 (AF) Precision and accuracy of our estimator ^
Fpool
ST as a function of pool size and coverage for simulated FST values ranging from 0.005 to 0.2.
Each density plot, which represents the RMSE of the estimator ^
Fpool
ST , was obtained using simple linear interpolation from a set of 44 344 pairs of pool
size and coverage values. For each pool size and coverage, 500 replicates of 5000 markers were simulated from an island model with nd¼8 demes.
White isolines represent the RMSE of the WC84 estimator computed from Ind-seq data for various sample sizes (n= 5, 10, 20, and 50). Each isoline was
tted using a thin plate spline regression with smoothing parameter l¼0:005;implemented in the elds package for R (Nychka et al. 2017).
Genetic Differentiation from Pools 323
Schlötterer et al. 2014) undoubtedly has something to do
with the breadth of research questions that have been tackled
using Pool-seq. However, the analysis of population structure
from Pool-seq data are complicated by the double sampling
process of genes from the pool and sequence reads from those
genes (Ferretti et al. 2013).
The naive approach that consists of computing FST from
read counts as if they were allele counts (e.g., as in Chen et al.
2016) ignores the extra variance brought by the random
sampling of reads from the gene pool during Pool-seq exper-
iments. Furthermore, such computation fails to consider the
actual number of lineages in the pool (haploid pool size).
Altogether, these limits may result in severely biased esti-
mates of differentiation when the pool size is low (see Figure
S3). A possible alternative is to compute FST from allele counts
imputed from read counts using a maximum-likelihood
approach conditional on the haploid size of the pools
(e.g.,asinSmadjaet al. 2012; Leblois et al. 2018), or from
allele frequencies estimated using a model-based method
which accounts for the sampling effects and the sequenc-
ing error probabilities inherent to pooled NGS experiments
(see Fariello et al. 2017). However, these latter approaches
mayonlybeaccurateinsituationswherethecoverageis
much larger than pool size, allowing for a reduction of the
sampling variance of reads (see Figure S3). We therefore
developed a new estimator of the parameter FST for Pool-
seqdatainananalysis-of-varianceframework(Cockerham
1969, 1973). The accuracy of this estimator is barely dis-
tinguishable from that of the Weir and Cockerhams(1984)
estimator for individual data. Furthermore, it does not depend
on the pool size or on the coverage, and it is robust to unequal
pool sizes and varying coverage across demes and loci.
In our analysis, the frequency of reads within pools is a
weighted average of the sample frequencies, with weights
equal to the pool coverage. Therefore, our approach follows
Cockerhams (1973) one, which he referred to as a weighted
analysis-of-variance (see also Weir and Cockerham 1984;
Weir 1996; Weir and Hill 2002; Weir and Goudet 2017).
With unequal pool sizes, weighted and unweighted analyses
differ. As discussed recently in Weir and Goudet (2017), the
unweighted approach seems appropriate when the between
component exceeds the within component, i.e., when FST is
large (Tukey 1957). It turns out that optimal weighting
depends upon the parameter to be estimated (Cockerham
1973) and is only efcient at lower levels of differentia-
tion (Robertson 1962). In a likelihood analysis of the island
Figure 4 Precision and accuracy of FST
estimates with varying pool size or vary-
ing coverage. Our estimator ^
Fpool
ST was cal-
culated from Pool-seq data over all
demes and loci and compared to the es-
timator WC84;computed from Ind-seq
data. Each boxplot represents the distri-
bution of multilocus FST estimates across
50 independent replicates of the ms sim-
ulations. We used two migration rates,
corresponding to (A and C) FST ¼0:05
and (B and D) FST ¼0:20:(A and B) The
pool size was variable across demes, with
haploid sample size ndrawn indepen-
dently for each deme from a Gaussian
distribution with mean 100 and SD 30;
nwas rounded up to the nearest integer,
with a minimum of 20 and a maximum
of 300 haploids per deme. (C and D)
The pool size was xed (n¼100) and
the coverage (di) was varying across
demes and loci, with diPoisðDÞwhere
D2f20;50;100g:For Pool-seq data, we
show the results for different coverages
(203,503, and 1003). In each graph,
the dashed line indicates the simulated
value of FST and the dotted line indicates
the median of the distribution of WC84
estimates. Var., variable.
324 V. Hivert et al.
model, Rousset (2007) derived asymptotically efcient weights
that are proportional to n2
ifor the sum of squares of differ-
ent samples (see also Robertson 1962). To the best of our
knowledge, such optimal weighting has never been consid-
ered in the literature.
Analysis-of-variance and probabilities of identity
In the analysis-of-variance framework, FST is dened in Equa-
tion 1 as an intraclass correlation for the probability of IIS
(Cockerham and Weir 1987; Rousset 1996). Extensive statis-
tical literature is available on estimators of intraclass corre-
lations. Beside analysis-of-variance estimators, introduced in
population genetics by Cockerham (1969, 1973), estimators
basedonthecomputationofprobabilities of identical re-
sponse within and between groups have been proposed
(see, e.g., Fleiss 1971; Fleiss and Cuzick 1979; Mak 1988;
Ridout et al. 1999; Wu et al. 2012), which were originally
referred to as kappa-type statistics (Fleiss 1971; Landis and
Koch 1977). These estimators have later been endorsed
in population genetics, where the probability of identical
responsewas then interpreted as the frequency with
which the genes are alike (Cockerham 1973; Cockerham
and Weir 1987; Weir 1996; Rousset 2007; Weir and Goudet
2017).
This suggests that, with Pool-seq data, another strategy
could consist of computing FST from IIS probabilities between
(unobserved) pairs of genes, which requires that unbiased
estimates of such quantities are derived from read count data.
We have done this in the second section of File S1 and we
provide alternative estimators of FST for Pool-seq data (see
Equations A44 and A48 in File S1). These estimators
(denoted by ^
Fpool2PID
ST and ~
Fpool2PID
ST ) have exactly the same
form as the analysis-of-variance estimator if the pools all have
the same size and if the number of reads per pool is constant
(Equation A33 in File S1). This echoes the derivations by
Rousset (2007) for Ind-seq data, who showed that the
analysis-of-variance approach (Weir and Cockerham 1984) and
the simple strategy of estimating IIS probabilities by counting
identical pairs of genes provide identical estimates when
sample sizes are equal (see Equation A28 in File S1 and also
Cockerham and Weir 1987; Weir 1996; Karlsson et al. 2007).
With unbalanced samples, we found that analysis-of-variance
estimates have better precision and accuracy than IIS-based
estimates, particularly for low levels of differentiation (see
Figure 5 Precision and accuracy of FST
estimates with sequencing and experi-
mental errors. Our estimator ^
Fpool
ST was
computed from Pool-seq data over all
demes and loci without error, with
sequencing error (occurring at rate
me¼0:001), and with experimental error
(e¼0:5). Each boxplot represents the
distribution of multilocus FST estimates
across 50 independent replicates of the
ms simulations. We used two migra-
tion rates, corresponding to (A and B)
FST ¼0:05 or (C and D) FST ¼0:20:The
size of each pool was either xed to (A
and C) 10 or to (B and D) 100. For Pool-
seq data, we show the results for differ-
ent coverages (203,503, and 1003). In
each graph, the dashed line indicates the
simulated value of FST. Exp., experimen-
tal; Seq., sequencing.
Genetic Differentiation from Pools 325
Figure S4). Interestingly, we found that IIS-based estimates
of FST for Pool-seq data have generally lower bias and vari-
ance if the overall estimates of IIS probabilities within and
between pools are computed as unweighted averages of
population-specic or pairwise estimates (see Equations A39
and A43 in File S1), as compared to weighted averages (Equa-
tions A46 and A47 in File S1). Equation A28 in File S1 further
shows that our estimator may be rewritten as a function close
to ð^
Q12^
Q2Þ=ð12^
Q2Þ;except that it also depends on the sum
Pið^
Q1i2^
Q1Þin both the numerator and the denominator. This
suggests that if the Q1i
s differ among subpopulations, then our
estimator provides an estimate of an average of population-
specicFST (Weir and Hill 2002; Weir and Goudet 2017).
It follows from the derivations in File S1 that the estimator
PP2a(Equation 19) is biased because the IIS probability be-
tween pairs of reads within a pool ð^
Qr
1Þis a biased estimator
of the IIS probability between pairs of distinct genes in that
pool (see Equations A34A36 in File S1). This is the case
because the former confounds pairs of reads that are identical
because they were sequenced from a single gene from pairs of
reads that are identical because they were sequenced from
distinct, yet IIS genes.
A more justied estimator of FST has been proposed by
Ferretti et al. (2013), based on previous developments by
Futschik and Schlötterer (2010). Note that, although they
dened FST as a ratio of functions of heterozygosities, they
actually worked with IIS probabilities (see Equation 20 and
Equation 21). However, although Equation 20 is strictly iden-
tical to Equation A39 in File S1, we note that they computed
the total heterozygosity by integrating over pairs of genes
sampled both within and between subpopulations (compare
Equation 21 with Equation A43 in File S1), which may ex-
plain the observed bias (see Figure 2).
Comparison with alternative estimators
An alternative framework to Weir and Cockerhams (1984)
analysis-of-variance has been developed by Masatoshi Nei
and coworkers to estimate FST from gene diversities (Nei
1973, 1977, 1986; Nei and Chesser 1983). The estimator
PP2d(see Equation 16, Equation 17, and Equation 18) imple-
mented in the software package PoPoolation2 (Koer et al.
2011) follows this logic. However, it has long been recog-
nized that both frameworks are fundamentally different in
that the analysis-of-variance approach considers both statis-
tical and genetic (or evolutionary) sampling, whereas Nei
and coworkersapproach do not (Weir and Cockerham
1984; Excofer 2007; Holsinger and Weir 2009). Further-
more, the expectation of Nei and coworkersestimators de-
pend on the number of sampled populations, with a larger
bias for lower numbers of sampled populations (Goudet
1993; Excofer 2007; Weir and Goudet 2017). This is the
case because the computation of the total diversity in Equa-
tion 18 and Equation 21 includes the comparison of pairs of
genes from the same subpopulation, whereas the computa-
tion of IIS probabilities between subpopulations do not (see,
e.g., Excofer 2007). Therefore, we do not recommend using
the estimator PP2dimplemented in the software package
PoPoolation2 (Koer et al. 2011).
Applications in evolutionary ecology studies
Pool-seq is being increasingly used in many application do-
mains (Schlötterer et al. 2014), such as conservation genetics
(see, e.g., Fuentes-Pardo and Ruzzente 2017), invasion biol-
ogy (see, e.g., Dexter et al. 2018), and evolutionary biology
in a broader sense (see, e.g., Collet et al. 2016). These stud-
ies use a large range of methods, which aim at characteriz-
ing ne-scaled population structure (see, e.g., Fischer et al.
Figure 6 Precision and accuracy of FST estimates with and without ltering. Our estimator ^
Fpool
ST was computed from Pool-seq data over all demes and
loci (A) without error, (B) with sequencing error, and (C) with experimental error (see the legend of Figure 5 for further details). For each case, we
computed FST without ltering (no MRC) and with ltering (using a MRC = 4). Each boxplot represents the distribution of multilocus FST estimates across
50 independent replicates of the ms simulations. We used a migration rate corresponding to FST ¼0:20 and pool size n¼10:We show the results for
different coverages (203,503, and 1003). In each graph, the dashed line indicates the simulated value of FST:
326 V. Hivert et al.
2017), reconstructing past demography (see, e.g., Chen et al.
2016; Leblois et al. 2018), or identifying footprints of natural
or articial selection (see, e.g., Chen et al. 2016; Fariello et al.
2017; Leblois et al. 2018).
Here, we reanalyzed the Pool-seq data produced by
Dennenmoser et al. (2017), who investigated the adaptive
genomic divergence between freshwater and brackish-water
ecotypes of the prickly sculpin C. asper, an abundant euryha-
line sh in northwestern North America. Measuring pairwise
genetic differentiation between samples using ^
Fpool
ST , we found
a clear-cut structure separating the freshwater from the
brackish-water ecotypes. Such genetic structure supports the
hypothesis that populations are locally adapted to osmotic
conditions in these two contrasted habitats, as discussed in
Dennenmoser et al. (2017). This structure, which is at odds
with that inferred from PP2destimates, is not only supported
by the scaled covariance matrix of allele frequencies, but also
by previous microsatellite-based studies, which showed that
populations were genetically more differentiated between eco-
types than within ecotypes (Dennenmoser et al. 2014, 2015).
Limits of the model and perspectives
We have shown that the stronger source of bias for the ^
Fpool
ST
estimate is unequal contributions of individuals in pools. This
is because we assume in our model that the read counts are
multinomially distributed, which supposes that all genes con-
tribute equally to the pool of reads (Gautier et al. 2013), i.e.,
that there is no variation in DNA yield across individuals and
that all genes have equal sequencing coverage (Rode et al.
2018). Because the effect of unequal contribution is expected
Figure 7 Reanalysis of the prickly sculpin (C. asper) Pool-seq data. (A) We compare the pairwise FST estimates PP2dand ^
Fpool
ST for all pairs of populations
from the estuarine (CR and FE) and freshwater samples (PI and HZ). Within-ecotype comparisons are depicted as and between-ecotype comparisons as :.
(B and C) We show hierarchical cluster analyses based on (B) PP2dand (C) ^
Fpool
ST pairwise estimates using unweighted pair group method with arithmetic
mean (UPGMA). (D) We show a heatmap representation of the scaled covariance matrix among the four C. asper populations, inferred from the Bayesian
hierarchical model implemented in the software package BayPass.
Genetic Differentiation from Pools 327
to be stronger with small pool sizes, it has been recom-
mended to use Pool-seq with at least 50 diploid individuals
per pool (Lynch et al. 2014; Schlötterer et al. 2014). However,
this limit may be overly conservative for allele frequency
estimates (Rode et al. 2018) and we have shown here that
we can achieve very good precision and accuracy of FST esti-
mates with smaller pool sizes. Furthermore, because geno-
typic information is lost during Pool-seq experiments, we
assume in our derivations that pools are haploid (and there-
fore that FIS is nil). Analyzing nonrandom mating populations
(e.g., in selng species) is therefore problematic.
Finally, our model, as in Weir and Cockerham (1984),
formally assumes that all populations provide independent
replicates of some evolutionary process (Excofer 2007;
Holsinger and Weir 2009). This may be unrealistic in many
natural populations, which motivated Weir and Hill (2002)
to derive a population-specic estimator of FST for Ind-seq
data (see also Vitalis et al. 2001). Even though the use of
Weir and Hills (2002) estimator is still scarce in the literature
(but see Weir et al. 2005; Vitalis 2012), Weir and Goudet
(2017) recently proposed a reinterpretation of population-
specicestimatesofFST in terms of allelic matching pro-
portions, which are strictly equivalent to IIS probabilities
between pairs of genes. It is therefore straightforward to
extend Weir and Goudets (2017) estimator of population-
specicFST for the analysis of Pool-seq data, using the un-
biased estimates of IIS probabilities provided in File S1.
Acknowledgments
We thank Alexandre Dehne-Garcia for his assistance in using
computer farms. We thank two anonymous reviewers for
their positive comments and suggestions. Analyses were
performed on the GenoToul bioinformatics platform Tou-
louse Midi-Pyrénées (http://bioinfo.genotoul.fr)andthe
High Performance Computational platform of the Centre
de Biologie pour la Gestion des Populations. This work is
part of V.H.s Ph.D.; V.H. was supported by a grant from
the Institut National de la Recherche Agronomiques Plant
Health and Environment (SPE) Division and by the Biodi-
vERsA project EXOTIC (ANR-13-EBID-0001). Part of this
work was supported by the project SWING (ANR-16-CE02-
0015) of the French National Research Agency, and by the
CORBAM project of the French region Hauts-de-France.
Literature Cited
Akey, J. M., G. Zhang, L. Jin, and M. D. Shriver, 2002 Interrogating
a high-density SNP map for signatures of natural selection. Ge-
nome Res. 12: 18051814. https://doi.org/10.1101/gr.631202
Anderson, E. C., H. J. Skaug, and D. J. Barshis, 2014 Next-
generation sequencing for molecular ecology: a caveat regard-
ing pooled samples. Mol. Ecol. 23: 502512. https://doi.org/
10.1111/mec.12609
Beaumont, M. A., 2005 Adaptation and speciation: what can F
ST
tell us? Trends Ecol. Evol. 20: 435440. https://doi.org/10.1016/
j.tree.2005.05.017
Beaumont, M. A., and R. A. Nichols, 1996 Evaluating loci for use
in the genetic analysis of population structure. Proc. Biol. Sci.
263: 16191626. https://doi.org/10.1098/rspb.1996.0237
Bhatia, G., N. Patterson, S. Sankararaman, and A. L. Price,
2013 Estimating and interpreting F
ST
: the impact of rare var-
iants. Genome Res. 23: 15141521. https://doi.org/10.1101/
gr.154831.113
Cavalli-Sforza, L., 1966 Population structure and human evolu-
tion. Proc. R. Soc. Lond. B Biol. Sci. 164: 362379. https://doi.
org/10.1098/rspb.1966.0038
Chen, J., T. Källman, X.-F. Ma, G. Zaina, M. Morgante et al.,
2016 Identifying genetic signatures of natural selection using
pooled populations sequencing in Picea abies. G3 (Bethesda) 6:
19791989. https://doi.org/10.1534/g3.116.028753
Cockerham, C. C., 1969 Variance of gene frequencies. Evolution
23: 7284. https://doi.org/10.1111/j.1558-5646.1969.tb03496.x
Cockerham, C. C., 1973 Analyses of gene frequencies. Genetics
74: 679700.
Cockerham, C. C., and B. S. Weir, 1987 Correlations, descent
measures: drift with migration and mutation. Proc. Natl.
Acad. Sci. USA 84: 85128514. https://doi.org/10.1073/pnas.
84.23.8512
Collet, J. M., S. Fuentes, J. Hesketh, M. S. Hill, P. Innocenti et al.,
2016 Rapid evolution of the intersexual genetic correlation for
tness in Drosophila melanogaster. Evolution 70: 781795. https://
doi.org/10.1111/evo.12892
Coop, G., D. Witonsky, A. Di Rienzo, and J. K. Pritchard,
2010 Using environmental correlations to identify loci under-
lying local adaptation. Genetics 185: 14111423. https://doi.
org/10.1534/genetics.110.114819
Cutler, D. J., and J. D. Jensen, 2010 To pool, or not to pool? Ge-
netics 186: 4143. https://doi.org/10.1534/genetics.110.121012
Dennenmoser, S., S. M. Rogers, and S. M. Vamosi, 2014 Genetic
population structure in prickly sculpin (Cottus asper)reects
isolation-by-environment between two life-history ecotypes.
Biol.J.Linn.Soc.Lond.113:943957. https://doi.org/10.1111/
bij.12384
Dennenmoser, S., A. W. Nolte, S. M. Vamosi, and S. M. Rogers,
2015 Phylogeography of the prickly sculpin (Cottus asper)in
north-western North America reveals parallel phenotypic evolu-
tion across multiple coastal-inland colonizations. J. Biogeogr.
42: 16261638. https://doi.org/10.1111/jbi.12527
Dennenmoser, S., S. M. Vamosi, S. W. Nolte, and S. M. Rogers,
2017 Adaptive genomic divergence under high gene ow be-
tween freshwater and brackish-water ecotypes of prickly sculpin
(Cottus asper) revealed by Pool-Seq. Mol. Ecol. 26: 2542.
https://doi.org/10.1111/mec.13805
Dexter, E., S. M. Bollens, J. Cordell, H. Y. Soh, G. Rollwagen-Bollens
et al., 2018 A genetic reconstruction of the invasion of the
calanoid copepod Pseudodiaptomus inopinus across the North
American Pacic Coast. Biol. Invasions 20: 15771595. https://
doi.org/10.1007/s10530-017-1649-0
Ellegren, H., 2014 Genome sequencing and population genomics
in non-model organisms. Trends Ecol. Evol. 29: 5163. https://
doi.org/10.1016/j.tree.2013.09.008
Excofer, L., 2007 Analysis of population subdivision, pp. 980
1020 in Handbook of Statistical Genetics, edited by D. J. Balding,
M. Bishop, and C. Cannings. John Wiley & Sons, Chichester,
United Kingdom.
Fariello, M. I., S. Boitard, S. Mercier, D. Robelin, T. Faraut et al.,
2017 Accounting for linkage disequilibrium in genome scans
for selection without individual genotypes: the local score ap-
proach. Mol. Ecol. 26: 37003714. https://doi.org/10.1111/
mec.14141
Ferretti, L., S. Ramos Onsins, and M. Pérez-Enciso, 2013 Population
genomics from pool sequencing. Mol. Ecol. 22: 55615576.
https://doi.org/10.1111/mec.12522
328 V. Hivert et al.
Fischer, M. C., C. Rellstab, M. Leuzinger, M. Roumet, F. Gugerli
et al., 2017 Estimating genomic diversity and population dif-
ferentiation an empirical comparison of microsatellite and SNP
variation in Arabidopsis halleri. BMC Genomics 18: 69. https://
doi.org/10.1186/s12864-016-3459-7
Fleiss, J. L., 1971 Measuring nominal scale agreement among
many raters. Psychol. Bull. 76: 378382. https://doi.org/
10.1037/h0031619
Fleiss, J. L., and J. Cuzick, 1979 The reliability of dichotomous judge-
ments: unequal numbers of judges per subject. Appl. Psychol. Meas.
3: 537542. https://doi.org/10.1177/014662167900300410
Fuentes-Pardo, A. P., and D. E. Ruzzente, 2017 Whole-genome
sequencing approaches for conservation biology: advantages,
limitations and practical recommendations. Mol. Ecol. 26: 5369
5406. https://doi.org/10.1111/mec.14264
Futschik, A., and C. Schlötterer, 2010 The next generation of mo-
lecular markers from massively parallel sequencing of pooled
DNA samples. Genetics 186: 207218. https://doi.org/10.1534/
genetics.110.114397
Gautier, M., 2015 Genome-wide scan for adaptive divergence and
association with population-specic covariates. Genetics 201:
15551579. https://doi.org/10.1534/genetics.115.181453
Gautier, M., K. Gharbi, T. Cezaerd, M. Galan, A. Loiseau et al.,
2013 Estimation of population allele frequencies from next-
generation sequencing data: pool-versus individual-based genotyp-
ing. Mol. Ecol. 22: 37663779. https://doi.org/10.1111/mec.12360
Glenn, T. C., 2011 Field guide to next-generation DNA se-
quencers. Mol. Ecol. Resour. 11: 759769. https://doi.org/10.
1111/j.1755-0998.2011.03024.x
Goudet, J., 1993 The genetics of geographically structured pop-
ulations. Ph.D. Thesis, University of Wales, Bangor, Wales.
Holsinger, K. S., and B. S. Weir, 2009 Genetics in geographically
structured populations: dening, estimating and interpreting
F
ST
. Nat. Rev. Genet. 10: 639650. https://doi.org/10.1038/
nrg2611
Hudson, R. R., 2002 Generating samples under a Wright-Fisher
neutral model of genetic variation. Bioinformatics 18: 337338.
https://doi.org/10.1093/bioinformatics/18.2.337
Karlsson,E.K.,I.Baranowska,C.M.Wade,N.H.C.Salmon
Hillbertz, M. C. Zody et al., 2007 Efcient mapping of Men-
delian traits in dogs through genome-wide association. Nat.
Genet. 39: 13211328. https://doi.org/10.1038/ng.2007.10
Koer, R., R. V. Pandey, and C. Schlötterer, 2011 PoPoolation2:
identifying differentiation between populations using sequenc-
ing of pooled DNA samples (Pool-Seq). Bioinformatics 27:
34353436. https://doi.org/10.1093/bioinformatics/btr589
Landis, J. R., and G. G. Koch, 1977 A one-way components of
variance model for categorical data. Biometrics 33: 671679.
https://doi.org/10.2307/2529465
Leblois, R., M. Gautier, A. Rohfritsch, J. Foucaud, C. Burban et al.,
2018 Deciphering the demographic history of allochronic dif-
ferentiation in the pine processionary moth Thaumetopoea pity-
ocampa. Mol. Ecol. 27: 264278. https://doi.org/10.1111/
mec.14411
Lewontin, R. C., and J. Krakauer, 1973 Distribution of gene fre-
quency as a test of the theory of the selective neutrality of poly-
morphism. Genetics 74: 175195.
Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan et al.,
2009 The sequence alignment/map format and SAMtools.
Bioinformatics 25: 20782079. https://doi.org/10.1093/
bioinformatics/btp352
Lotterhos, K. E., and M. C. Whitlock, 2014 Evaluation of demo-
graphic history and neutral parameterization on the perfor-
mance of F
ST
outlier tests. Mol. Ecol. 23: 21782192. https://
doi.org/10.1111/mec.12725
Lotterhos, K. E., and M. C. Whitlock, 2015 The relative power of
genome scans to detect local adaptation depends on sampling
design and statistical method. Mol. Ecol. 24: 10311046. https://
doi.org/10.1111/mec.13100
Lynch, M., D. Bost, S. Wilson, T. Maruki, and S. Harrison,
2014 Population-genetic inference from pooled-sequencing data.
Genome Biol. Evol. 6: 12101218. https://doi.org/10.1093/gbe/
evu085
Mak, T. K., 1988 Analysing intraclass correlation for dichotomous
variables. J. R. Stat. Soc. Ser. C Appl. Stat. 37: 344352.
Malécot, G., 1948 Les Mathématiques de lHérédité. Masson, Paris.
Nei, M., 1973 Analysis of gene diversity in subdivided popula-
tions. Proc. Natl. Acad. Sci. USA 70: 33213323. https://doi.
org/10.1073/pnas.70.12.3321
Nei, M., 1977 F-statistics and analysis of gene diversity in subdi-
vided populations. Ann. Hum. Genet. 41: 225233. https://doi.
org/10.1111/j.1469-1809.1977.tb01918.x
Nei, M., 1978 Estimation of average heterozygosity and genetic
distance from a small number of individuals. Genetics 89: 583
590.
Nei, M., 1986 Denition and estimation of xation indices. Evo-
lution 40: 643645. https://doi.org/10.1111/j.1558-5646.1986.
tb00516.x
Nei, M., and R. K. Chesser, 1983 Estimation of xation indices and
gene diversities. Ann. Hum. Genet. 47: 253259. https://doi.
org/10.1111/j.1469-1809.1983.tb00993.x
Nychka, D., R. Furrer, J. Paige, and S. Sain, 2017 elds: tools for
spatial data. R package version 9.6. University Corporation for
Atmospheric Research, Boulder, CO. DOI: 10.5065/D6W957CT
Orgogozo, V., A. E. Peluffo, and B. Morizot, 2016 The mendelian
geneand the molecular gene: two relevant concepts of ge-
netic units, pp. 126 in Genes and Evolution. Current Topics in
Developmental Biology, Vol. 119, edited by V. Orgogozo. Aca-
demic Press, New York.
Pickrell, J. K., and J. K. Pritchard, 2012 Inference of population
splits and mixtures from genome-wide allele frequency data.
PLoS Genet. 8: e1002967. https://doi.org/10.1371/journal.
pgen.1002967
R Core Team, 2017 R: A Language and Environment for Statistical
Computing. R Foundation for Statistical Computing, Vienna.
Reynolds, J., B. S. Weir, and C. C. Cockerham, 1983 Estimation of
the coancestry coefcient: basis for a short-term genetic dis-
tance. Genetics 105: 767779.
Ridout, M. S., C. G. B. Demktrio, and D. Firth, 1999 Estimating
intra-class correlation for binary data. Biometrics 55: 137148.
https://doi.org/10.1111/j.0006-341X.1999.00137.x
Robertson, A., 1962 Weighting in the estimation of variance com-
ponents in the unbalanced single classication. Biometrics 18:
413417. https://doi.org/10.2307/2527485
Rode, N. O., Y. Holtz, K. Loridon, S. Santoni, J. Ronfort et al.,
2018 How to optimize the precision of allele and haplotype
frequency estimates using pooled-sequencing data. Mol. Ecol. Re-
sour. 18: 194203. https://doi.org/10.1111/1755-0998.12723
Ross, M. G., C. Russ, M. Costello, A. Hollinger, N. J. Lennon et al.,
2013 Characterizing and measuring bias in sequence data. Ge-
nome Biol. 14: R51. https://doi.org/10.1186/gb-2013-14-5-r51
Rousset, F., 1996 Equilibrium values of measures of population sub-
division for stepwise mutation processes. Genetics 142: 13571362.
Rousset, F., 1997 Genetic differentiation and estimation of gene
ow from F-statistics under isolation by distance. Genetics 145:
12191228.
Rousset, F., 2007 Inferences from spatial population genetics, pp.
945979 in Handbook of Statistical Genetics, edited by D. J.
Balding, M. Bishop, and C. Cannings. John Wiley & Sons, Ltd.,
Chichester, England.
Rousset, F., 2008 genepop007: a complete re-implementation of
the genepop software for Windows and Linux. Mol. Ecol. Re-
sour. 8: 103106. https://doi.org/10.1111/j.1471-8286.2007.
01931.x
Genetic Differentiation from Pools 329
Schlötterer, C., R. Tobler, R. Koer, and V. Nolte, 2014 Sequencing
pools of individuals mining genome-wide polymorphism data
without big funding. Nat. Rev. Genet. 15: 749763. https://doi.
org/10.1038/nrg3803
Slatkin, M., 1993 Isolation by distance in equilibrium and non-
equilibrium populations. Evolution 47: 264279. https://doi.
org/10.1111/j.1558-5646.1993.tb01215.x
Smadja, C. M., B. Canbäck, R. Vitalis, M. Gautier, J. Ferrari et al.,
2012 Large-scale candidate gene scan reveals the role of che-
moreceptor genes in host plant specialization and speciation in
the pea aphid. Evolution 66: 27232738. https://doi.org/10.1111/
j.1558-5646.2012.01612.x
The International HapMap Consortium, 2005 A haplotype map of
the human genome. Nature 437: 12991320. https://doi.org/
10.1038/nature04226
Tukey, J. W., 1957 Variances of variance components: II. The un-
balanced single classication. Ann. Math. Stat. 28: 4356.
https://doi.org/10.1214/aoms/1177707036
Vitalis, R., 2012 DetSel: an R-Package to detect marker loci re-
sponding to selection, pp. 277293 in Data Production and Anal-
ysis in Population Genomics: Methods and Protocols.Methods
in Molecular Biology, Vol. 888, edited by F. Pompanon, and
A. Bonin. Humana Press, New York.
Vitalis, R., P. Boursot, and K. Dawson, 2001 Interpretation of variation
across marker loci as evidence of selection. Genetics 158: 18111823.
Wahlund, S., 1928 Zusammens etzung von populationen und kor-
relationserscheinungen vom standpunkt der vererbungslehre
aus betrachtet. Hereditas 11: 65106. https://doi.org/10.1111/
j.1601-5223.1928.tb02483.x
Weir, B. S., 1996 Genetic Data Analysis II. Sinauer Associates, Inc.,
Sunderland, MA.
Weir, B. S., 2012 Estimating F-statistics: a historical view. Philos.
Sci. 79: 637643. https://doi.org/10.1086/667904
Weir, B. S., and C. C. Cockerham, 1984 Estimating F-statistics for
the analysis of population structure. Evolution 38: 13581370.
https://doi.org/10.1111/j.1558-5646.1984.tb05657.x
Weir, B. S., and J. Goudet, 2017 A unied characterization of
population structure and relatedness. Genetics 206: 2085
2103. https://doi.org/10.1534/genetics.116.198424
Weir, B. S., and W. G. Hill, 2002 Estimating F-statistics. Annu.
Rev. Genet. 36: 721750. https://doi.org/10.1146/annurev.
genet.36.050802.093940
Weir, B. S., L. R. Cardon, A. D. Anderson, D. M. Nielsen, and W. G.
Hill, 2005 Measures of human population structure show het-
erogeneity among genomic regions. Genome Res. 15: 1468
1476. https://doi.org/10.1101/gr.4398405
Whitlock, M. C., and K. E. Lotterhos, 2015 Reliable detection of
loci responsible for local adaptation: inference of a null model
through trimming the distribution of F
ST
. Am. Nat. 186: S24
S36. https://doi.org/10.1086/682949
Wright, S., 1931 Evolution in Mendelian populations. Genetics
16: 97159.
Wright, S., 1951 The genetical structure of populations. Ann. Eu-
gen. 15: 323354. https://doi.org/10.1111/j.1469-1809.1949.
tb02451.x
Wu, S., C. M. Crespi, and W. K. Wong, 2012 Comparison of meth-
ods for estimating the intraclass correlation coefcient for bi-
nary responses in cancer prevention cluster randomized trials.
Contemp. Clin. Trials 33: 869880. https://doi.org/10.1016/j.
cct.2012.05.004
Communicating editor: M. Beaumont
330 V. Hivert et al.
... We assessed population structure with pairwise pool-F ST and principal components analysis (PCA). For all population pairs, we calculated pool-F ST (F pool ST ) and its 95% confidence interval (CI) using the R package poolfstat (Hivert et al., 2018). This pairwise pool-F ST statistic is equivalent to Weir & Cockerham's F ST (Weir & Cockerham, 1984) and accounts for random chromosome sampling in pool-seq. ...
Article
Full-text available
Understanding how populations adapt to their environment is increasingly important to prevent biodiversity loss due to overexploitation and climate change. Here we studied the population structure and genetic basis of local adaptation of Atlantic horse mackerel, a commercially and ecologically important marine fish that has one of the widest distributions in the eastern Atlantic. We analyzed whole‐genome sequencing and environmental data of samples collected from the North Sea to North Africa and the western Mediterranean Sea. Our genomic approach indicated low population structure with a major split between the Mediterranean Sea and the Atlantic Ocean and between locations north and south of mid‐Portugal. Populations from the North Sea are the most genetically distinct in the Atlantic. We discovered that most population structure patterns are driven by a few highly differentiated putatively adaptive loci. Seven loci discriminate the North Sea, two the Mediterranean Sea, and a large putative inversion (9.9 Mb) on chromosome 21 underlines the north–south divide and distinguishes North Africa. A genome–environment association analysis indicates that mean seawater temperature and temperature range, or factors correlated to them, are likely the main environmental drivers of local adaptation. Our genomic data broadly support the current stock divisions, but highlight areas of potential mixing, which require further investigation. Moreover, we demonstrate that as few as 17 highly informative SNPs can genetically discriminate the North Sea and North African samples from neighboring populations. Our study highlights the importance of both, life history and climate‐related selective pressures in shaping population structure patterns in marine fish. It also supports that chromosomal rearrangements play a key role in local adaptation with gene flow. This study provides the basis for more accurate delineation of the horse mackerel stocks and paves the way for improving stock assessments.
... 2.1.1) (Hivert et al. 2018;Gautier et al. 2022) was then used to filter the data further (a minimum coverage of 20 and maximal coverage of 200 per pool, a minimal allele frequency of 0.05, and no indels) and convert the sync file to a pooldata object in RStudio (v. 2022.7 ...
Article
Full-text available
Determining cryptic species and diversity in at-risk species is necessary for the understanding and conservation of biodiversity. The endangered Banff Springs Snail, Physella johnsoni, inhabits seven highly specialized thermal springs in Banff National Park, Alberta, Canada. However, it has been difficult to reconcile its species status to the much more common Physella gyrina using ecology, morphology and genetics. Here we used pooled whole-genome sequencing to characterize genomic variation and structure among five populations of P. johnsoni and three geographical proximate P. gyrina populations. By comparing over two million single nucleotide polymorphisms, we detected substantial genetic distance (pairwise FST of 0.27 to 0.44) between P. johnsoni and P. gyrina, indicative of unique gene pools. Genetic clusters among populations were found for both species, with up to 10% for P. johnsoni and 30% for P. gyrina of genetic variation being explained by population structure. P. johnsoni was found to have lower genetic diversity compared to P. gyrina, however, no patterns of were observed between genetic diversity and population minimums. Our results confirm that designation of P. johnsoni as an endangered species is warranted and that both P. johnsoni and P. gyrina exhibit microgeographic population genomic structure suggestive of rapid local adaptation and/or genetic drift within environments. This study showcases the utility of genomics to resolve patterns of cryptic species and diversity for effective conservation management. Future studies on the functional genomic diversity of P. johnsoni populations are needed to test for the possible role of selection within this thermal spring environment.
... The mpileup file was then converted to sync format by PoPoolation2 version 1201 (Kofler et al., 2011). 8.03 million (M) SNPs were detected on this sync file using R/poolfstat package v2.0.0 (Hivert et al., 2018) and the following parameters: coverage per pool between 10 and 50. In parallel, nucleotidic diversity (n) was computed using PoPoolation2 Variancesliding.pl ...