Content uploaded by Eric Petit

Author content

All content in this area was uploaded by Eric Petit on Feb 13, 2023

Content may be subject to copyright.

|INVESTIGATION

Measuring Genetic Differentiation from Pool-seq Data

Valentin Hivert,*

,†

Raphaël Leblois,*

,†

Eric J. Petit,

‡

Mathieu Gautier,*

,†,1

and Renaud Vitalis*

,†,1,2

*CBGP, INRA, CIRAD, IRD, Montpellier SupAgro, Univ Montpellier, 34988 Montferrier-sur-Lez Cedex, France, †Institut de Biologie

Computationnelle, Univ Montpellier, 34095 Montpellier Cedex, France, and ‡ESE, Ecology and Ecosystem Health, INRA,

Agrocampus Ouest, 35042 Rennes, Cedex, France

ORCID IDs: 0000-0002-5144-6956 (V.H.); 0000-0002-3051-4497 (R.L.); 0000-0001-5058-5826 (E.J.P.); 0000-0001-7257-5880 (M.G.);

0000-0001-7096-3089 (R.V.)

ABSTRACT The advent of high throughput sequencing and genotyping technologies enables the comparison of patterns of poly-

morphisms at a very large number of markers. While the characterization of genetic structure from individual sequencing data remains

expensive for many nonmodel species, it has been shown that sequencing pools of individual DNAs (Pool-seq) represents an attractive

and cost-effective alternative. However, analyzing sequence read counts from a DNA pool instead of individual genotypes raises

statistical challenges in deriving correct estimates of genetic differentiation. In this article, we provide a method-of-moments estimator

of FST for Pool-seq data, based on an analysis-of-variance framework. We show, by means of simulations, that this new estimator is

unbiased and outperforms previously proposed estimators. We evaluate the robustness of our estimator to model misspeciﬁcation,

such as sequencing errors and uneven contributions of individual DNAs to the pools. Finally, by reanalyzing published Pool-seq data of

different ecotypes of the prickly sculpin Cottus asper, we show how the use of an unbiased FST estimator may question the in-

terpretation of population structure inferred from previous analyses.

KEYWORDS F

ST

; genetic differentiation; pool sequencing; population genomics

IT has long been recognized that the subdivision of species

into subpopulations, social groups, and families fosters ge-

netic differentiation (Wahlund 1928; Wright 1931). Charac-

terizing genetic differentiation as a means to infer unknown

population structure is therefore fundamental to population

genetics and ﬁnds applications in multiple domains, includ-

ing conservation biology, invasion biology, association map-

ping, and forensics, among many others. In the late 1940s

and early 1950s, Malécot (1948) and Wright (1951) intro-

duced F-statistics to partition genetic variation within and

between groups of individuals (Holsinger and Weir 2009;

Bhatia et al. 2013). Since then, the estimation of F-statistics

has become standard practice (see, e.g., Weir 1996, 2012;

Weir and Hill 2002) and the most commonly used estimators

of FST have been developed in an analysis-of-variance frame-

work (Cockerham 1969, 1973; Weir and Cockerham 1984),

which can be recast in terms of probabilities of identity of

pairs of homologous genes (Cockerham and Weir 1987;

Rousset 2007; Weir and Goudet 2017).

Assuming that molecular markers are neutral, estimates of

FST are typically used to quantify genetic structure in natural

populations, which is then interpreted as the result of demo-

graphic history (Holsinger and Weir 2009): large FST values

are expected for small populations among which dispersal

is limited (Wright 1951), or between populations that have

long diverged in isolation from each other (Reynolds et al.

1983). When dispersal is spatially restricted, a positive re-

lationship between FST and the geographical distance for pairs

of populations generally holds (Slatkin 1993; Rousset 1997). It

hasalsobeenproposedtocharacterize the heterogeneity of

FST estimates across markers for identifying loci that are tar-

geted by selection (Cavalli-Sforza 1966; Lewontin and Krakauer

1973; Beaumont and Nichols 1996; Vitalis et al. 2001; Akey

et al. 2002; Beaumont 2005; Weir et al. 2005; Lotterhos and

Whitlock 2014, 2015; Whitlock and Lotterhos 2015).

Next-generation sequencing (NGS) technologies provide

unprecedented amounts of polymorphism data in both model

Copyright © 2018 by the Genetics Society of America

doi: https://doi.org/10.1534/genetics.118.300900

Manuscript received March 9, 2018; accepted for publication July 21, 2018; published

Early Online July 25, 2018.

Supplemental material available at Figshare: https://doi.org/10.25386/genetics.

6856781.

1

These authors are joint senior authors on this work.

2

Corresponding author: Centre de Biologie pour la Gestion des Populations, Campus

International de Baillarguet, CS 30016, 34988 Montferrier-sur-Lez Cedex, France.

E-mail: renaud.vitalis@inra.fr

Genetics, Vol. 210, 315–330 September 2018 315

and nonmodel species (Ellegren 2014). Although the se-

quencing strategy initially involved individually tagged sam-

ples in humans (The International HapMap Consortium

2005), whole-genome sequencing of pools of individuals

(Pool-seq) is being increasingly used for population genomic

studies (Schlötterer et al. 2014). Because it consists of se-

quencing libraries of pooled DNA samples and does not re-

quire individual tagging of sequences, Pool-seq provides

genome-wide polymorphism data at considerably lower cost

than sequencing of individuals (Schlötterer et al. 2014).

However, non-equimolar amounts of DNA from all individu-

als in a pool and stochastic variation in the ampliﬁcation

efﬁciency of individual DNAs have raised concerns with re-

spect to the accuracy of the so-obtained allele frequency es-

timates, particularly at low sequencing depth and with small

pool sizes (Cutler and Jensen 2010; Anderson et al. 2014;

Ellegren 2014). Nonetheless, it has been shown that, at equal

sequencingefforts,Pool-seqprovidessimilar,ifnotmore

accurate, allele frequency estimates than individual-based

analyses (Futschik and Schlötterer 2010; Gautier et al.

2013). The problem is different for diversity and differenti-

ation parameters, which dependonsecondmomentsofal-

lele frequencies or, equivalently, on pairwise measures of

genetic identity: with Pool-seq data, it is indeed impossi-

ble to distinguish pairs of reads that are identical because

they were sequenced from a single gene from pairs of reads

that are identical because they were sequenced from two

distinct genes that are identical in state (IIS) (Ferretti et al.

2013).

Appropriate estimators of diversity and differentiation

parameters must therefore be sought to account for both

the sampling of individual genes from the pool and the

sampling of reads from these genes. There has been several

attempts to deﬁne estimators for the parameter FST for Pool-

seq data (Koﬂer et al. 2011; Ferretti et al. 2013), from ratios

of heterozygosities (or from probabilities of genetic identity

between pairs of reads) within and between pools. In the

following, we will argue that these estimators are biased

(i.e., they do not converge toward the expected value of the

parameter) and that some of them have undesired statistical

properties (i.e., the bias depends on sample size and cover-

age). Here, following Cockerham (1969, 1973), Weir and

Cockerham (1984), Weir (1996), Weir and Hill (2002),

and Rousset (2007), we deﬁne a method-of-moments esti-

mator of the parameter FST using an analysis-of-variance

framework. We then evaluate the accuracy and precision of

this estimator, based on the analysis of simulated data sets,

and compare it to estimates deﬁned in the software package

PoPoolation2 (Koﬂer et al. 2011) and in Ferretti et al. (2013).

Furthermore, we test the robustness of our estimators to

model misspeciﬁcations (including unequal contributions of

individuals in pools and sequencing errors). Finally, we rean-

alyze the prickly sculpin (Cottus asper) Pool-seq data (pub-

lished by Dennenmoser et al. 2017), and show how the use of

biased FST estimators in previous analyses may challenge the

interpretation of population structure.

Note that throughout this article, we use the term “gene”to

designate a segregating genetic unit (in the sense of the

“Mendelian gene”from Orgogozo et al. 2016). We further

use the term “read”in a narrow sense, as a sequenced copy

of a gene. For the sake of simplicity, we will use the term “Ind-

seq”to refer to analyses based on individual data, for which

we further assume that individual genotypes are called with-

out error.

Model

F-statistics may be described as intraclass correlations

for the IIS probability of pairs of genes (Cockerham

and Weir 1987; Rousset 1996, 2007). FST is best deﬁned

as:

FST [Q12Q2

12Q2

;(1)

where Q1is the IIS probability for genes sampled within

subpopulations, and Q2is the IIS probability for genes sam-

pled between subpopulations. In the following, we develop

an estimator of FST for Pool-seq data by decomposing the

total variance of read frequencies in an analysis-of-variance

framework. A complete derivation of the model is provided in

the Supplemental Material, File S1.

For the sake of clarity, the notation used throughout this

article is given in Table 1. We ﬁrst derive our model for a

single locus and eventually provide a multilocus estimator of

F

ST

. Consider a sample of ndsubpopulations, each of which is

made of nigenes ði¼1;...;ndÞsequenced in pools (hence ni

is the haploid sample size of the ith pool). We deﬁne cij as the

number of reads sequenced from gene jðj¼1;...;niÞin sub-

population iat the locus considered. Note that cij is a latent

variable that cannot be directly observed from the data. Let

Xijr:kbe an indicator variable for read rðr¼1;...;cijÞfrom

gene jin subpopulation i, such that Xijr:k¼1 if the rth

read from the jth gene in the ith deme is of type k, and

Xijr:k¼0 otherwise. In the following, we use standard

dot notation for sample averages, i.e.:Xij:k[PrXijr:k=cij;

Xi:k[PjPrXijr:k=Pjcij;and X:k[PiPjPrXijr:k=PiPjcij:

The analysis-of-variance is based on the computation of

sums of squares, as follows:

X

nd

iX

ni

jX

cij

rXijr:k2X:k2¼X

nd

iX

ni

jX

cij

rXijr:k2Xij:k2

þX

nd

iX

ni

jX

cij

rXij:k2Xi:k2

þX

nd

iX

ni

jX

cij

rXi:k2X:k2

[SSR:kþSSI:kþSSP:k:

(2)

316 V. Hivert et al.

As is shown in File S1, the expected sums of squares depend on

the expectation of the allele frequency pkover all replicate

populations sharing the same evolutionary history, as well as

on the IIS probability Q1:kthat two genes in the same pool are

both of type k, and the IIS probability Q2:kthat two genes

from different pools are both of type k. Taking expectations

(see the detailed computations in File S1), one has:

ESSR:k¼0 (3)

for reads within individual genes, since we assume that there

is no sequencing error, i.e., all the reads sequenced from a

single gene are identical and Xijr:k¼Xij:kfor all r. For reads

between genes within pools, we get:

ESSI:k¼C12D2pk2Q1:k;(4)

where C1[PiPjcij ¼PiC1iis the total number of reads in

the full sample (total coverage), C1iis the coverage of the ith

pool, and D2[PiC1iþni21=ni:D2arises from the as-

sumption that the distribution of the read counts cij is multi-

nomial (i.e., that all genes contribute equally to the pool of

reads; see Equation A15 in File S1). For reads between genes

from different pools, we have:

ESSP:k¼C12C2

C1Q1:k2Q2:kþD22D⋆

2pk2Q1:k;

(5)

where C2[PiC2

1iand D⋆

2[hPiC1iðC1iþni21Þ=nii.C1

(see Equation A16 in File S1). Rearranging Equation 4 and

Equation 5 and summing over alleles, we get:

Q12Q2¼C12D2ESSP2D22D⋆

2ESSI

C12D2C12C2=C1(6)

and

12Q2¼C12D2ESSPþnc21D22D⋆

2ESSI

C12D2C12C2=C1;

(7)

where nc[C12C2=C1=D22D⋆

2:Let MSI [SSI=ðC12D2Þ

and MSP [SSP=ðD22D⋆

2Þ:Then, using the deﬁnition of FST

from Equation 1, we have:

FST [Q12Q2

12Q2

¼EMSP2EMSI

EMSPþnc21EMSI;(8)

which yields the method-of-moments estimator

^

Fpool

ST ¼MSP 2MSI

MSP þnc21MSI;(9)

where

MSI ¼1

C12D2X

kX

nd

i

C1i^pi:k12^pi:k(10)

and

MSP ¼1

D22D⋆

2X

kX

nd

i

C1i^pi:k2^pk2(11)

(see Equations A25 and A26 in File S1). In Equation 10

and Equation 11, ^pi:k[Xi:kis the average frequency of

reads of type kwithin the ith pool, and ^pk[X:kis the

average frequency of reads of type kin the full sample.

Note that from the deﬁnition of X:k;^pk[PiPjPrXijr:k=

PiPjcij ¼PiC1i^pi:k=PiC1iis the weighted average of the

sample frequencies with weights equal to the pool coverage.

This is equivalent to the weighted analysis-of-variance in

Cockerham (1973) (see also Weir and Cockerham 1984;

Weir 1996; Weir and Hill 2002; Rousset 2007; Weir and

Table 1 Summary of main notations used

Notation Parameter deﬁnition

Xijr:kIndicator variable: Xijr:k¼1 if the rth read from the jth individual in the ith pool is of type k,

and Xijr:k¼0 otherwise

ri:k¼PjPrXijr:kNumber of reads of type kin the ith pool

cij Number of reads sequenced from individual jin subpopulation i(unobserved individual

coverage)

C1i[Pjcij Total number of reads in the ith pool (pool coverage)

C1[PiC1iTotal number of reads in the full sample (total coverage)

C2[PiC2

1iSquared number of reads in the full sample

niTotal number of genes the ith pool (haploid pool size)

yi:k(Unobserved) number of genes of type kin the ith pool

pk[EðXijr:kÞExpected frequency of reads of type kin the full sample

^pij:k[Xij:k(Unobserved) average frequency of reads of type kfor individual jin the ith pool

^pi:k[Xi:kAverage frequency of reads of type kin the ith pool

^pk[X:kAverage frequency of reads of type kin the full sample

Q1(respectively Q2) IIS probability for two genes sampled within (respectively between) pools

Qr

1(respectively Qr

2) IIS probability for two reads sampled within (respectively between) pools

^

Qpool

1(respectively ^

Qpool

2) Unbiased estimator of the IIS probability for genes sampled within (respectively between)

pools

Genetic Differentiation from Pools 317

Goudet 2017). Finally, the full expression of ^

Fpool

ST in terms of

sample frequencies develops as:

If we take the limit case where each gene is sequenced

exactly once, we recover the Ind-seq model: assuming

cij ¼1forallði;jÞ;then C1¼Pnd

ini;C2¼Pnd

in2

i;D2¼nd;

and D⋆

2¼1:Therefore, nc¼ðC12C2=C1Þ=ðnd21Þ;and

Equation 9 reduces exactly to the estimator of FST for hap-

loids: see Weir (1996), p. 182, and Rousset (2007), p. 977.

As in Reynolds et al. (1983), Weir and Cockerham (1984),

Weir (1996), and Rousset (2007), a multilocus estimate is

derived as the sum of locus-speciﬁc numerators over the sum

of locus-speciﬁc denominators:

^

FST ¼PlMSPl2MSIl

PlMSPlþðnc21ÞMSIl

;(13)

where MSI and MSP are subscripted with lto denote the lth

locus. For Ind-seq data, Bhatia et al. (2013) refer to this multi-

locus estimate as a “ratio of averages”as opposed to an

“average of ratios,”which would consist of averaging single-

locus FST over loci. This approach is justiﬁed in the appendix

of Weir and Cockerham (1984) and in Bhatia et al. (2013),

who analyzed both estimates by means of coalescent simula-

tions. Note that Equation 13 assumes that the pool size is

equal across loci. Also note that the construction of the esti-

mator in Equation 13 is different from Weir and Cockerham’s

(1984). These authors deﬁned their multilocus estimator as a

ratio of sums of components of variance (a,b, and cin their

notation) over loci, which give the same weight to all loci

whatever the number of sampled genes at each locus. Equa-

tion 13 follows GENEPOP’s rationale (Rousset 2008) instead,

which gives more weight to loci that are more intensively

covered.

Materials and Methods

Simulation study

Generating individual genotypes: We ﬁrst generated indi-

vidual genotypes using ms (Hudson 2002), assuming an

island model of population structure (Wright 1931). For

each simulated scenario, we considered eight demes, each

made of N¼5000 haploid individuals. The migration rate

(m)wasﬁxed to achieve the desired value of FST (0.05

or 0.2), using equation 6 in Rousset (1996) leading to,

e.g.,M[2Nm ¼16:569 for FST ¼0:05 and M¼3:489 for

FST ¼0:20:The mutation rate was set at m¼1026;giving

u[2Nm¼0:01:We considered either ﬁxed or variable sam-

ple sizes across demes. In the latter case, the haploid sample

size nwas drawn independently for each deme from a Gauss-

ian distribution with mean 100 and SD 30; this number was

rounded up to the nearest integer, with a minimum of 20 and

maximum of 300 haploids per deme. We generated a very

large number of sequences for each scenario and sampled

independent single nucleotide polymorphisms (SNPs) from

sequences with a single segregating site. Each scenario was

replicated 50 times (500 times for Figure 3 and Figure S2).

Pool sequencing: For each ms simulated data set, we gener-

ated Pool-seq data by drawing reads from a binomial distri-

bution (Gautier et al. 2013). More precisely, we assume that

for each SNP, the number ri:kof reads of allelic type kin pool i

follows:

ri:kBinyi:k

ni

;di;(14)

where yi:kis the number of genes of type kin the ith pool, niis

the total number of genes in pool i(haploid pool size), and di

is the simulated total coverage for pool i. In the following,

we either consider a ﬁxed coverage, with di¼Dfor all pools

and loci, or a varying coverage across pools and loci, with

diPoisðDÞ:

Sequencing error: We simulated sequencing errors occurring

at rate me¼0:001;which is typical of Illumina sequencers

(Glenn 2011; Ross et al. 2013). We assumed that each se-

quencing error modiﬁes the allelic type of a read to one of

three other possible states with equal probability (there are

therefore four allelic types in total, corresponding to four

nucleotides). Note that only biallelic markers are retained

in the ﬁnal data sets. Also note that, since we initiated this

procedure with polymorphic markers only, we neglect se-

quencing errors that would create spurious SNPs from mono-

morphic sites. However, such SNPs should be rare in real data

sets, since markers with a low minimum read count (MRC)

are generally ﬁltered out.

Experimental error: Nonequimolar amounts of DNA from all

individuals in a pool and stochastic variation in the ampliﬁ-

cation efﬁciency of individual DNAs are sources of experimen-

tal errors in Pool-seq. To simulate experimental errors, we

used the model derived by Gautier et al. (2013). In this model,

it is assumed that the contribution hij ¼cij=C1iof each gene j

^

Fpool

ST ¼PkhðC12D2ÞPnd

iC1ið^pi:k2^pkÞ22D22D⋆

2Pnd

iC1i^pi:kð12^pi:kÞi

PkhðC12D2ÞPnd

iC1ið^pi:k2^pkÞ2þðnc21ÞD22D⋆

2Pnd

iC1i^pi:kð12^pi:kÞi:

318 V. Hivert et al.

to the total coverage of the ith pool ðC1iÞfollows a Dirichlet

distribution:

hij1#j#niDirr

ni;(15)

where the parameter rcontrols the dispersion of gene

contributions around the value hij ¼1=ni;which is expected

if all genes contributed equally to the pool of reads. For

convenience, we deﬁne the experimental error eas

the coefﬁcient of variation of hij;i.e.,e[ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

VðhijÞ

q.

EðhijÞ¼ ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

ðni21Þ=ðrþ1Þ

p(see Gautier et al. 2013). When

etends toward 0 (or equivalently, when rtends to inﬁnity),

all individuals contribute equally to the pool and there is no

experimental error. We tested the robustness of our estimates

to values of ebetween 0.05 and 0.5. The case e¼0:5 could

correspond, for example, to a situation where (for ni¼10)

ﬁve individuals contribute 2:83more reads than the other

ﬁve individuals.

Other estimators

For the sake of clarity, a summary of the notation of

the FST estimators used throughout this article is given

in Table 2.

PP2

d

:This estimator of FST is implemented by default in the

software package PoPoolation2 (Koﬂer et al. 2011). It is

based on a deﬁnition of the parameter FST as the overall re-

duction in average heterozygosity relative to the total com-

bined population (see, e.g., Nei and Chesser 1983):

PP2d[

^

HT2^

HS

^

HT

;(16)

where ^

HSis the average heterozygosity within subpopu-

lations, and ^

HTis the average heterozygosity in the total

population (obtained by pooling together all subpopu-

lations to form a single virtual unit). In PoPoolation2,

^

HSis the unweighted average of within-subpopulation

heterozygosities:

^

HS¼1

ndX

nd

ini

ni21 C1i

C1i2112Xk^p2

i:k(17)

(using the notation from Table 1). Note that in PoPoolation2,

PP2dis restricted to the case of two subpopulations only

(nd¼2). The two ratios in the right-hand side of Equation

17 are presumably borrowed from Nei (1978) to provide an

unbiased estimate, although we found no formal justiﬁcation

for the expression in Equation 17 for Pool-seq data. The total

heterozygosity is computed as (using the notation from

Table 1):

^

HT¼ miniðniÞ

miniðniÞ21! miniðC1iÞ

miniðC1iÞ21!12X

k

^p2

k:

(18)

PP2

a

:This is the alternative estimator of FST provided in the

software package PoPoolation2. It is based on an interpreta-

tion by Koﬂer et al. (2011) of Karlsson et al.’s (2007) estima-

tor of FST, as:

PP2a[

^

Qr

12^

Qr

2

12^

Qr

2

;(19)

where ^

Qr

1and ^

Qr

2are the frequencies of identical pairs of

reads within and between pools, respectively, computed by

simple counting of IIS pairs. These are estimates of Qr

1;the IIS

probability for two reads in the same pool (whether they are

sequenced from the same gene or not), and Qr

2;the IIS prob-

ability for two reads in different pools. Note that the IIS prob-

ability Qr

1is different from Q1in Equation 1, which, from our

deﬁnition, represents the IIS probability between distinct genes

in the same pool. This approach therefore confounds pairs of

reads within pools that are identical because they were se-

quenced from a single gene from pairs of reads that are iden-

tical because they were sequenced from distinct, yet IIS genes.

FRP

13

:This estimator of FST was developed by Ferretti et al.

(2013) (see their equations 3, 10, 11, 12, and 13). Ferretti

et al. (2013) use the same deﬁnition of FST as in Equation 16

above, although they estimate heterozygosities within and

between pools as “average pairwise nucleotide diversities,”

which, from their deﬁnitions, are formally equivalent to IIS

probabilities. In particular, they estimate the average hetero-

zygosity within pools as (using the notation from Table 1):

^

HS¼1

ndX

nd

ini

ni2112^

Qr

1i(20)

and the total heterozygosity among the ndpopulations as:

^

HT¼1

n2

d2

4X

nd

ini

ni2112^

Qr

1iþX

nd

i6¼i912^

Qr

2ii93

5:(21)

Analyses of Ind-seq data

For the comparison of Ind-seq and Pool-seq data sets, we

computed FST on subsamples of 5000 loci. These subsamples

were deﬁned so that only those loci that were polymorphic in

all coverage conditions were retained, and the same loci were

Table 2 Deﬁnition of the FST estimators used in the text

Notation Deﬁnition

^

Fpool

ST Equation 12

FRP13 Ferretti et al. (2013) and Equation 16,

Equation 20, and Equation 21

NC83 Nei and Chesser (1983)

PP2dKoﬂer et al. (2011) and Equation 16,

Equation 17, and Equation 18

PP2aKoﬂer et al. (2011) and Equation 19

WC84 Weir and Cockerham (1984)

Genetic Differentiation from Pools 319

used for the analysis of the corresponding Ind-seq data. For

the latter, we used either the Nei and Chesser’s (1983) esti-

mator based on a ratio of heterozygosity (see Equation 16

above), hereafter denoted by NC83 ;or the analysis-of-variance

estimator developed by Weir and Cockerham (1984), here-

after denoted by WC84:

All the estimators were computed using custom functions

in the R software environment for statistical computing,

version 3.3.1 (R Core Team 2017). All of these functions

were carefully checked against available software packages

to ensure that they provided strictly identical estimates.

Application example: C. asper

Dennenmoser et al. (2017) investigated the genomic basis

of adaption to osmotic conditions in the prickly sculpin (C.

asper), an abundant euryhaline ﬁsh in northwestern North

America. To do so, they sequenced the whole genome of

pools of individuals from two estuarine populations (Capi-

lano River Estuary, CR; Fraser River Estuary, FE) and two

freshwater populations (Pitt Lake, PI; Hatzic Lake, HZ) in

southern British Columbia (Canada). We downloaded the

four corresponding BAM ﬁles from the Dryad Digital

Repository (http://dx.doi.org/10.5061/dryad.2qg01)and

combined them into a single mpileup ﬁle using SAMtools

version 0.1.19 (Li et al. 2009) with default options, except

the maximum depth per BAM that was set to 5000 reads. The

resulting ﬁle was further processed using a custom awk script

to call SNPs and compute read counts, after discarding bases

with a base alignment quality (BAQ) score ,25. A position

was then considered a SNP if: (1) only two different nucleo-

tides with a read count .1 were observed (nucleotides with

#1 read being considered as a sequencing error); (2) the

coverage was between 10 and 300 in each of the four align-

ment ﬁles; (3) the minor allele frequency, as computed from

read counts, was $0:01 in the four populations. The ﬁnal

data set consisted of 608,879 SNPs.

Our aim here was to compare the population structure

inferred from pairwise estimates of FST using the estimator

^

Fpool

ST (Equation 12) with that of PP2

d

. To determine which of

the two estimators performs better, we then compared the

population structure inferred from ^

Fpool

ST and PP2dto that

inferred from the Bayesian hierarchical model implemented

in the software package BayPass (Gautier 2015). BayPass

allows the robust estimation of the scaled covariance matrix

of allele frequencies across populations for Pool-seq data,

which is known to be informative about population history

(Pickrell and Pritchard 2012). The elements of the estimated

matrix can be interpreted as pairwise and population-speciﬁc

estimates of differentiation (Coop et al. 2010) and therefore

provide a comprehensive description of population structure

that makes full use of the available data.

Data availability

An R package called poolfstat, which implements FST esti-

mates for Pool-seq data, is available at the Comprehensive

R Archive Network (CRAN): https://cran.r-project.org/web/

packages/poolfstat/index.html.

The authors state that all data necessary for conﬁrming the

conclusions presented in this article are fully represented

within the article, ﬁgures, and tables. Supplemental material

(including Figures S1–S4, Tables S1–S3, and a complete der-

ivation of the model in File S1) available at Figshare: https://

doi.org/10.25386/genetics.6856781.

Results

Comparing Ind-seq and Pool-seq estimates of FST

Single-locus estimates of ^

Fpool

ST are highly correlated with the

classical estimates of WC84 (Weir and Cockerham 1984)

computed on the individual data that were used to generate

the pools in our simulations (see Figure 1). The variance of

^

Fpool

ST across independent replicates decreases as the coverage

increases. The correlation between ^

Fpool

ST and WC84 is stronger

for multilocus estimates (see Figure S1A).

Comparing Pool-seq estimators of FST

We found that our estimator ^

Fpool

ST has extremely low bias

(,0.5% over all scenarios tested: see Table 3 and Tables

S1–S3). In other words, the average estimates across multiple

Figure 1 Single-locus estimates of FST:We compared

single-locus estimates of FST based on allele count data

inferred from individual genotypes (Ind-seq), using the

WC84 estimator, to ^

Fpool

ST estimates from Pool-seq data.

We simulated 5000 SNPs using ms in an island model

with nd¼8 demes. We used two migration rates cor-

responding to (A) FST ¼0:05 and (B) FST ¼0:20:The

size of each pool was ﬁxed to 100. We show the results

for different coverages (203,503, and 1003). In each

graph, the cross indicates the simulated value of FST.

320 V. Hivert et al.

loci and replicates closely equal the expected value of the FST

parameter, as given by equation 6 in Rousset (1996), which is

based on the computation of IIS probabilities in an island

model of population structure. In all the situations examined,

the bias does not depend on the sample size (i.e.,thesizeof

each pool)or on the coverage (see Figure 2). Only the variance

of the estimator across independent replicates decreases as the

sample size increases and/or as the coverageincreases. At high

coverage, the mean and root mean squared error (RMSE) of

^

Fpool

ST over independent replicates are virtually indistinguish-

able from that of the WC84 estimator (see Table S1).

Figure 3 shows the RMSE of FST estimates for a wide range

of pool sizes and coverages. The RMSE decreases as the pool

size and/or the coverage increases. The FST estimates are

more precise and accurate when differentiation is low. Figure

3 provides some clues to evaluate the pool size and the cov-

erage that is necessary to achieve the same RMSE as for Ind-

seq data. Consider, for example, the case of samples of n¼20

haploids. For FST #0:05 (in the conditions of our simula-

tions), the RMSE of FST estimates based on Pool-seq data

tends to the RMSE of FST estimates based on Ind-seq data

either by sequencing pools of 200 haploids at 203,orby

sequencing pools of 20 haploids at 2003. However, the

same precision and accuracy are achieved by sequencing

50 haploids at 503.

Conversely, we found that PP2d(the default estimator of

FST implemented in the software package PoPoolation2) is

biased when compared to the expected value of the parame-

ter. We observed that the bias depends on both the sample

size and the coverage (see Figure 2). We note that, as the

coverage and the sample size increase, PP2dconverges to the

estimator NC83 (Nei and Chesser 1983) computed from indi-

vidual data (see Figure S1B). This argument was used by

Koﬂer et al. (2011) to validate their approach, even though

the estimates of PP2ddepart from the true value of the pa-

rameter (Figure S1, B and C).

The second of the two estimators of FST implemented in

PoPoolation2, which we refer to as PP2a;is also biased (see

Figure 2). We note that the bias decreases as the sample size

increases. However, the bias does not depend on the cov-

erage (only the variance over independent replicates de-

pends on coverage). The estimator developed by Ferretti

et al. (2013), which we refer to as FRP13;is also biased

(see Figure 2). However, the bias does not depend on the

pool size or on the coverage (only the variance over indepen-

dent replicates depends on coverage). FRP13 converges to the

estimator NC83;computed from individual data (see Figure

2). At high coverage, the mean and RMSE over independent

replicates are virtually indistinguishable from that of the

NC83 estimator.

Lastly, we stress that our estimator ^

Fpool

ST provides estimates

for multiple populations and is therefore not restricted to

pairwise analyses, contrary to PoPoolation2’s estimators.

We show that, even at low sample size and low coverage,

Pool-seq estimates of differentiation are virtually indistin-

guishable from classical estimates for Ind-seq data (see

Table 3).

Robustness to unbalanced pool sizes and variable

sequencing coverage

We evaluated the accuracy and the precision of the estimator

^

Fpool

ST when sample sizes differ across pools and when the

coverage varies across pools and loci (see Figure 4). We

found that, at low coverage, unequal sampling or variable

coverage causes a negligible departure from the median of

WC84 estimates computed on individual data, which vanishes

as the coverage increases. At 1003coverage, the distribution

of ^

Fpool

ST estimates is almost indistinguishable from that of

WC84 (see Figure 4 and Tables S2 and S3).

Robustness to sequencing and experimental errors

Figure 5 shows that sequencing errors cause a negligible neg-

ative bias for ^

Fpool

ST estimates. Filtering (using an MRC of 4)

improves estimation slightly, but only at high coverage (Fig-

ure 6B). It must be noted, however, that ﬁltering increases

the bias in the absence of sequencing error, especially at low

coverage (Figure 6A). With experimental error, i.e., when

individuals do not contribute evenly to the ﬁnal set of reads,

we observed a positive bias for ^

Fpool

ST estimates (Figure 5). We

note that the bias decreases as the size of the pools increases.

Figure S2 shows the RMSE of FST estimates for a wider range

of pool sizes, coverage, and experimental error rate (e). For

e$0:25;increasing the coverage cannot improve the quality

of the inference if the pool size is too small. When Pool-seq

experiments are prone to large experimental error rates, in-

creasing the size of pools is the only way to improve the

estimation of FST:Filtering (using an MRC of 4) does not

improve estimation (Figure 6C).

Application example

The reanalysis of the prickly sculpin data revealed larger

pairwise estimates of multilocus FST using the PP2destimator,

Table 3 Overall FST estimates from multiple pools

FST n

Pool-seq Ind-seq

Coverage ^

Fpool

ST WC84

0.05 10 20 30.050 (0.002)

0.05 10 50 30.051 (0.002) 0.050 (0.002)

0.05 10 100 30.050 (0.002)

0.05 100 20 30.050 (0.001)

0.05 100 50 30.050 (0.001) 0.051 (0.001)

0.05 100 100 30.050 (0.001)

0.20 10 20 30.200 (0.002)

0.20 10 50 30.201 (0.002) 0.201 (0.002)

0.20 10 100 30.201 (0.002)

0.20 100 20 30.201 (0.003)

0.20 100 50 30.202 (0.003) 0.203 (0.003)

0.20 100 100 30.203 (0.003)

Multilocus ^

Fpool

ST estimates were computed for various conditions of expected FST ;

pool size (n), and coverage in an island model with nd¼8 subpopulations (pools).

The mean (RMSE) is over 50 independent simulated data sets, each made of

5000 loci. For comparison, we computed multilocus WC84 estimates from individual

genotypes (Ind-seq).

Genetic Differentiation from Pools 321

as compared to ^

Fpool

ST (see Figure 7A). Furthermore, we found

that ^

Fpool

ST estimates are smaller for within-ecotype pairwise

comparisons as compared to between-ecotype compari-

sons. Therefore, the inferred relationships between samples

based on pairwise ^

Fpool

ST estimates show a clear-cut struc-

ture, separating the two estuarine samples from the freshwater

ones (see Figure 7C). We did not recover the same struc-

ture using PP2destimates (see Figure 7B). Additionally, the

scaled covariance matrix of allele frequencies across samples

is consistent with the structure inferred from ^

Fpool

ST estimates

(see Figure 7D).

Discussion

Whole-genome sequencing of pools of individuals is increas-

ingly popular for population genomic research on both

model and nonmodel species (Schlötterer et al. 2014). The

development of dedicated software packages (reviewed in

Figure 2 Precision and accuracy of pairwise estimators of FST:We considered two estimators based on allele count data inferred from individual

genotypes (Ind-seq): WC84 and NC83:For Pool-seq data, we computed the two estimators implemented in the software package PoPoolation2, which

we refer to as PP2dand PP2a;as well as the FRP13 estimator and our estimator ^

Fpool

ST :Each boxplot represents the distribution of multilocus FST estimates

across all pairwise comparisons in an island model with nd¼8 demes and across 50 independent replicates of the ms simulations. We used two

migration rates, corresponding to (A and B) FST ¼0:05 and (C and D) FST ¼0:20:The size of each pool was either ﬁxed to (A and C) 10 or to (B and D)

100. For Pool-seq data, we show the results for different coverages (203,503, and 1003). In each graph, the dashed line indicates the simulated value

of FST and the dotted line indicates the median of the distribution of NC83 estimates.

322 V. Hivert et al.

Figure 3 (A–F) Precision and accuracy of our estimator ^

Fpool

ST as a function of pool size and coverage for simulated FST values ranging from 0.005 to 0.2.

Each density plot, which represents the RMSE of the estimator ^

Fpool

ST , was obtained using simple linear interpolation from a set of 44 344 pairs of pool

size and coverage values. For each pool size and coverage, 500 replicates of 5000 markers were simulated from an island model with nd¼8 demes.

White isolines represent the RMSE of the WC84 estimator computed from Ind-seq data for various sample sizes (n= 5, 10, 20, and 50). Each isoline was

ﬁtted using a thin plate spline regression with smoothing parameter l¼0:005;implemented in the ﬁelds package for R (Nychka et al. 2017).

Genetic Differentiation from Pools 323

Schlötterer et al. 2014) undoubtedly has something to do

with the breadth of research questions that have been tackled

using Pool-seq. However, the analysis of population structure

from Pool-seq data are complicated by the double sampling

process of genes from the pool and sequence reads from those

genes (Ferretti et al. 2013).

The naive approach that consists of computing FST from

read counts as if they were allele counts (e.g., as in Chen et al.

2016) ignores the extra variance brought by the random

sampling of reads from the gene pool during Pool-seq exper-

iments. Furthermore, such computation fails to consider the

actual number of lineages in the pool (haploid pool size).

Altogether, these limits may result in severely biased esti-

mates of differentiation when the pool size is low (see Figure

S3). A possible alternative is to compute FST from allele counts

imputed from read counts using a maximum-likelihood

approach conditional on the haploid size of the pools

(e.g.,asinSmadjaet al. 2012; Leblois et al. 2018), or from

allele frequencies estimated using a model-based method

which accounts for the sampling effects and the sequenc-

ing error probabilities inherent to pooled NGS experiments

(see Fariello et al. 2017). However, these latter approaches

mayonlybeaccurateinsituationswherethecoverageis

much larger than pool size, allowing for a reduction of the

sampling variance of reads (see Figure S3). We therefore

developed a new estimator of the parameter FST for Pool-

seqdatainananalysis-of-varianceframework(Cockerham

1969, 1973). The accuracy of this estimator is barely dis-

tinguishable from that of the Weir and Cockerham’s(1984)

estimator for individual data. Furthermore, it does not depend

on the pool size or on the coverage, and it is robust to unequal

pool sizes and varying coverage across demes and loci.

In our analysis, the frequency of reads within pools is a

weighted average of the sample frequencies, with weights

equal to the pool coverage. Therefore, our approach follows

Cockerham’s (1973) one, which he referred to as a weighted

analysis-of-variance (see also Weir and Cockerham 1984;

Weir 1996; Weir and Hill 2002; Weir and Goudet 2017).

With unequal pool sizes, weighted and unweighted analyses

differ. As discussed recently in Weir and Goudet (2017), the

unweighted approach seems appropriate when the between

component exceeds the within component, i.e., when FST is

large (Tukey 1957). It turns out that optimal weighting

depends upon the parameter to be estimated (Cockerham

1973) and is only efﬁcient at lower levels of differentia-

tion (Robertson 1962). In a likelihood analysis of the island

Figure 4 Precision and accuracy of FST

estimates with varying pool size or vary-

ing coverage. Our estimator ^

Fpool

ST was cal-

culated from Pool-seq data over all

demes and loci and compared to the es-

timator WC84;computed from Ind-seq

data. Each boxplot represents the distri-

bution of multilocus FST estimates across

50 independent replicates of the ms sim-

ulations. We used two migration rates,

corresponding to (A and C) FST ¼0:05

and (B and D) FST ¼0:20:(A and B) The

pool size was variable across demes, with

haploid sample size ndrawn indepen-

dently for each deme from a Gaussian

distribution with mean 100 and SD 30;

nwas rounded up to the nearest integer,

with a minimum of 20 and a maximum

of 300 haploids per deme. (C and D)

The pool size was ﬁxed (n¼100) and

the coverage (di) was varying across

demes and loci, with diPoisðDÞwhere

D2f20;50;100g:For Pool-seq data, we

show the results for different coverages

(203,503, and 1003). In each graph,

the dashed line indicates the simulated

value of FST and the dotted line indicates

the median of the distribution of WC84

estimates. Var., variable.

324 V. Hivert et al.

model, Rousset (2007) derived asymptotically efﬁcient weights

that are proportional to n2

ifor the sum of squares of differ-

ent samples (see also Robertson 1962). To the best of our

knowledge, such optimal weighting has never been consid-

ered in the literature.

Analysis-of-variance and probabilities of identity

In the analysis-of-variance framework, FST is deﬁned in Equa-

tion 1 as an intraclass correlation for the probability of IIS

(Cockerham and Weir 1987; Rousset 1996). Extensive statis-

tical literature is available on estimators of intraclass corre-

lations. Beside analysis-of-variance estimators, introduced in

population genetics by Cockerham (1969, 1973), estimators

basedonthecomputationofprobabilities of identical re-

sponse within and between groups have been proposed

(see, e.g., Fleiss 1971; Fleiss and Cuzick 1979; Mak 1988;

Ridout et al. 1999; Wu et al. 2012), which were originally

referred to as kappa-type statistics (Fleiss 1971; Landis and

Koch 1977). These estimators have later been endorsed

in population genetics, where the “probability of identical

response”was then interpreted as the frequency with

which the genes are alike (Cockerham 1973; Cockerham

and Weir 1987; Weir 1996; Rousset 2007; Weir and Goudet

2017).

This suggests that, with Pool-seq data, another strategy

could consist of computing FST from IIS probabilities between

(unobserved) pairs of genes, which requires that unbiased

estimates of such quantities are derived from read count data.

We have done this in the second section of File S1 and we

provide alternative estimators of FST for Pool-seq data (see

Equations A44 and A48 in File S1). These estimators

(denoted by ^

Fpool2PID

ST and ~

Fpool2PID

ST ) have exactly the same

form as the analysis-of-variance estimator if the pools all have

the same size and if the number of reads per pool is constant

(Equation A33 in File S1). This echoes the derivations by

Rousset (2007) for Ind-seq data, who showed that the

analysis-of-variance approach (Weir and Cockerham 1984) and

the simple strategy of estimating IIS probabilities by counting

identical pairs of genes provide identical estimates when

sample sizes are equal (see Equation A28 in File S1 and also

Cockerham and Weir 1987; Weir 1996; Karlsson et al. 2007).

With unbalanced samples, we found that analysis-of-variance

estimates have better precision and accuracy than IIS-based

estimates, particularly for low levels of differentiation (see

Figure 5 Precision and accuracy of FST

estimates with sequencing and experi-

mental errors. Our estimator ^

Fpool

ST was

computed from Pool-seq data over all

demes and loci without error, with

sequencing error (occurring at rate

me¼0:001), and with experimental error

(e¼0:5). Each boxplot represents the

distribution of multilocus FST estimates

across 50 independent replicates of the

ms simulations. We used two migra-

tion rates, corresponding to (A and B)

FST ¼0:05 or (C and D) FST ¼0:20:The

size of each pool was either ﬁxed to (A

and C) 10 or to (B and D) 100. For Pool-

seq data, we show the results for differ-

ent coverages (203,503, and 1003). In

each graph, the dashed line indicates the

simulated value of FST. Exp., experimen-

tal; Seq., sequencing.

Genetic Differentiation from Pools 325

Figure S4). Interestingly, we found that IIS-based estimates

of FST for Pool-seq data have generally lower bias and vari-

ance if the overall estimates of IIS probabilities within and

between pools are computed as unweighted averages of

population-speciﬁc or pairwise estimates (see Equations A39

and A43 in File S1), as compared to weighted averages (Equa-

tions A46 and A47 in File S1). Equation A28 in File S1 further

shows that our estimator may be rewritten as a function close

to ð^

Q12^

Q2Þ=ð12^

Q2Þ;except that it also depends on the sum

Pið^

Q1i2^

Q1Þin both the numerator and the denominator. This

suggests that if the Q1i

’s differ among subpopulations, then our

estimator provides an estimate of an average of population-

speciﬁcFST (Weir and Hill 2002; Weir and Goudet 2017).

It follows from the derivations in File S1 that the estimator

PP2a(Equation 19) is biased because the IIS probability be-

tween pairs of reads within a pool ð^

Qr

1Þis a biased estimator

of the IIS probability between pairs of distinct genes in that

pool (see Equations A34–A36 in File S1). This is the case

because the former confounds pairs of reads that are identical

because they were sequenced from a single gene from pairs of

reads that are identical because they were sequenced from

distinct, yet IIS genes.

A more justiﬁed estimator of FST has been proposed by

Ferretti et al. (2013), based on previous developments by

Futschik and Schlötterer (2010). Note that, although they

deﬁned FST as a ratio of functions of heterozygosities, they

actually worked with IIS probabilities (see Equation 20 and

Equation 21). However, although Equation 20 is strictly iden-

tical to Equation A39 in File S1, we note that they computed

the total heterozygosity by integrating over pairs of genes

sampled both within and between subpopulations (compare

Equation 21 with Equation A43 in File S1), which may ex-

plain the observed bias (see Figure 2).

Comparison with alternative estimators

An alternative framework to Weir and Cockerham’s (1984)

analysis-of-variance has been developed by Masatoshi Nei

and coworkers to estimate FST from gene diversities (Nei

1973, 1977, 1986; Nei and Chesser 1983). The estimator

PP2d(see Equation 16, Equation 17, and Equation 18) imple-

mented in the software package PoPoolation2 (Koﬂer et al.

2011) follows this logic. However, it has long been recog-

nized that both frameworks are fundamentally different in

that the analysis-of-variance approach considers both statis-

tical and genetic (or evolutionary) sampling, whereas Nei

and coworkers’approach do not (Weir and Cockerham

1984; Excofﬁer 2007; Holsinger and Weir 2009). Further-

more, the expectation of Nei and coworkers’estimators de-

pend on the number of sampled populations, with a larger

bias for lower numbers of sampled populations (Goudet

1993; Excofﬁer 2007; Weir and Goudet 2017). This is the

case because the computation of the total diversity in Equa-

tion 18 and Equation 21 includes the comparison of pairs of

genes from the same subpopulation, whereas the computa-

tion of IIS probabilities between subpopulations do not (see,

e.g., Excofﬁer 2007). Therefore, we do not recommend using

the estimator PP2dimplemented in the software package

PoPoolation2 (Koﬂer et al. 2011).

Applications in evolutionary ecology studies

Pool-seq is being increasingly used in many application do-

mains (Schlötterer et al. 2014), such as conservation genetics

(see, e.g., Fuentes-Pardo and Ruzzente 2017), invasion biol-

ogy (see, e.g., Dexter et al. 2018), and evolutionary biology

in a broader sense (see, e.g., Collet et al. 2016). These stud-

ies use a large range of methods, which aim at characteriz-

ing ﬁne-scaled population structure (see, e.g., Fischer et al.

Figure 6 Precision and accuracy of FST estimates with and without ﬁltering. Our estimator ^

Fpool

ST was computed from Pool-seq data over all demes and

loci (A) without error, (B) with sequencing error, and (C) with experimental error (see the legend of Figure 5 for further details). For each case, we

computed FST without ﬁltering (no MRC) and with ﬁltering (using a MRC = 4). Each boxplot represents the distribution of multilocus FST estimates across

50 independent replicates of the ms simulations. We used a migration rate corresponding to FST ¼0:20 and pool size n¼10:We show the results for

different coverages (203,503, and 1003). In each graph, the dashed line indicates the simulated value of FST:

326 V. Hivert et al.

2017), reconstructing past demography (see, e.g., Chen et al.

2016; Leblois et al. 2018), or identifying footprints of natural

or artiﬁcial selection (see, e.g., Chen et al. 2016; Fariello et al.

2017; Leblois et al. 2018).

Here, we reanalyzed the Pool-seq data produced by

Dennenmoser et al. (2017), who investigated the adaptive

genomic divergence between freshwater and brackish-water

ecotypes of the prickly sculpin C. asper, an abundant euryha-

line ﬁsh in northwestern North America. Measuring pairwise

genetic differentiation between samples using ^

Fpool

ST , we found

a clear-cut structure separating the freshwater from the

brackish-water ecotypes. Such genetic structure supports the

hypothesis that populations are locally adapted to osmotic

conditions in these two contrasted habitats, as discussed in

Dennenmoser et al. (2017). This structure, which is at odds

with that inferred from PP2destimates, is not only supported

by the scaled covariance matrix of allele frequencies, but also

by previous microsatellite-based studies, which showed that

populations were genetically more differentiated between eco-

types than within ecotypes (Dennenmoser et al. 2014, 2015).

Limits of the model and perspectives

We have shown that the stronger source of bias for the ^

Fpool

ST

estimate is unequal contributions of individuals in pools. This

is because we assume in our model that the read counts are

multinomially distributed, which supposes that all genes con-

tribute equally to the pool of reads (Gautier et al. 2013), i.e.,

that there is no variation in DNA yield across individuals and

that all genes have equal sequencing coverage (Rode et al.

2018). Because the effect of unequal contribution is expected

Figure 7 Reanalysis of the prickly sculpin (C. asper) Pool-seq data. (A) We compare the pairwise FST estimates PP2dand ^

Fpool

ST for all pairs of populations

from the estuarine (CR and FE) and freshwater samples (PI and HZ). Within-ecotype comparisons are depicted as •and between-ecotype comparisons as :.

(B and C) We show hierarchical cluster analyses based on (B) PP2dand (C) ^

Fpool

ST pairwise estimates using unweighted pair group method with arithmetic

mean (UPGMA). (D) We show a heatmap representation of the scaled covariance matrix among the four C. asper populations, inferred from the Bayesian

hierarchical model implemented in the software package BayPass.

Genetic Differentiation from Pools 327

to be stronger with small pool sizes, it has been recom-

mended to use Pool-seq with at least 50 diploid individuals

per pool (Lynch et al. 2014; Schlötterer et al. 2014). However,

this limit may be overly conservative for allele frequency

estimates (Rode et al. 2018) and we have shown here that

we can achieve very good precision and accuracy of FST esti-

mates with smaller pool sizes. Furthermore, because geno-

typic information is lost during Pool-seq experiments, we

assume in our derivations that pools are haploid (and there-

fore that FIS is nil). Analyzing nonrandom mating populations

(e.g., in selﬁng species) is therefore problematic.

Finally, our model, as in Weir and Cockerham (1984),

formally assumes that all populations provide independent

replicates of some evolutionary process (Excofﬁer 2007;

Holsinger and Weir 2009). This may be unrealistic in many

natural populations, which motivated Weir and Hill (2002)

to derive a population-speciﬁc estimator of FST for Ind-seq

data (see also Vitalis et al. 2001). Even though the use of

Weir and Hill’s (2002) estimator is still scarce in the literature

(but see Weir et al. 2005; Vitalis 2012), Weir and Goudet

(2017) recently proposed a reinterpretation of population-

speciﬁcestimatesofFST in terms of allelic matching pro-

portions, which are strictly equivalent to IIS probabilities

between pairs of genes. It is therefore straightforward to

extend Weir and Goudet’s (2017) estimator of population-

speciﬁcFST for the analysis of Pool-seq data, using the un-

biased estimates of IIS probabilities provided in File S1.

Acknowledgments

We thank Alexandre Dehne-Garcia for his assistance in using

computer farms. We thank two anonymous reviewers for

their positive comments and suggestions. Analyses were

performed on the GenoToul bioinformatics platform Tou-

louse Midi-Pyrénées (http://bioinfo.genotoul.fr)andthe

High Performance Computational platform of the Centre

de Biologie pour la Gestion des Populations. This work is

part of V.H.’s Ph.D.; V.H. was supported by a grant from

the Institut National de la Recherche Agronomique’s Plant

Health and Environment (SPE) Division and by the Biodi-

vERsA project EXOTIC (ANR-13-EBID-0001). Part of this

work was supported by the project SWING (ANR-16-CE02-

0015) of the French National Research Agency, and by the

CORBAM project of the French region Hauts-de-France.

Literature Cited

Akey, J. M., G. Zhang, L. Jin, and M. D. Shriver, 2002 Interrogating

a high-density SNP map for signatures of natural selection. Ge-

nome Res. 12: 1805–1814. https://doi.org/10.1101/gr.631202

Anderson, E. C., H. J. Skaug, and D. J. Barshis, 2014 Next-

generation sequencing for molecular ecology: a caveat regard-

ing pooled samples. Mol. Ecol. 23: 502–512. https://doi.org/

10.1111/mec.12609

Beaumont, M. A., 2005 Adaptation and speciation: what can F

ST

tell us? Trends Ecol. Evol. 20: 435–440. https://doi.org/10.1016/

j.tree.2005.05.017

Beaumont, M. A., and R. A. Nichols, 1996 Evaluating loci for use

in the genetic analysis of population structure. Proc. Biol. Sci.

263: 1619–1626. https://doi.org/10.1098/rspb.1996.0237

Bhatia, G., N. Patterson, S. Sankararaman, and A. L. Price,

2013 Estimating and interpreting F

ST

: the impact of rare var-

iants. Genome Res. 23: 1514–1521. https://doi.org/10.1101/

gr.154831.113

Cavalli-Sforza, L., 1966 Population structure and human evolu-

tion. Proc. R. Soc. Lond. B Biol. Sci. 164: 362–379. https://doi.

org/10.1098/rspb.1966.0038

Chen, J., T. Källman, X.-F. Ma, G. Zaina, M. Morgante et al.,

2016 Identifying genetic signatures of natural selection using

pooled populations sequencing in Picea abies. G3 (Bethesda) 6:

1979–1989. https://doi.org/10.1534/g3.116.028753

Cockerham, C. C., 1969 Variance of gene frequencies. Evolution

23: 72–84. https://doi.org/10.1111/j.1558-5646.1969.tb03496.x

Cockerham, C. C., 1973 Analyses of gene frequencies. Genetics

74: 679–700.

Cockerham, C. C., and B. S. Weir, 1987 Correlations, descent

measures: drift with migration and mutation. Proc. Natl.

Acad. Sci. USA 84: 8512–8514. https://doi.org/10.1073/pnas.

84.23.8512

Collet, J. M., S. Fuentes, J. Hesketh, M. S. Hill, P. Innocenti et al.,

2016 Rapid evolution of the intersexual genetic correlation for

ﬁtness in Drosophila melanogaster. Evolution 70: 781–795. https://

doi.org/10.1111/evo.12892

Coop, G., D. Witonsky, A. Di Rienzo, and J. K. Pritchard,

2010 Using environmental correlations to identify loci under-

lying local adaptation. Genetics 185: 1411–1423. https://doi.

org/10.1534/genetics.110.114819

Cutler, D. J., and J. D. Jensen, 2010 To pool, or not to pool? Ge-

netics 186: 41–43. https://doi.org/10.1534/genetics.110.121012

Dennenmoser, S., S. M. Rogers, and S. M. Vamosi, 2014 Genetic

population structure in prickly sculpin (Cottus asper)reﬂects

isolation-by-environment between two life-history ecotypes.

Biol.J.Linn.Soc.Lond.113:943–957. https://doi.org/10.1111/

bij.12384

Dennenmoser, S., A. W. Nolte, S. M. Vamosi, and S. M. Rogers,

2015 Phylogeography of the prickly sculpin (Cottus asper)in

north-western North America reveals parallel phenotypic evolu-

tion across multiple coastal-inland colonizations. J. Biogeogr.

42: 1626–1638. https://doi.org/10.1111/jbi.12527

Dennenmoser, S., S. M. Vamosi, S. W. Nolte, and S. M. Rogers,

2017 Adaptive genomic divergence under high gene ﬂow be-

tween freshwater and brackish-water ecotypes of prickly sculpin

(Cottus asper) revealed by Pool-Seq. Mol. Ecol. 26: 25–42.

https://doi.org/10.1111/mec.13805

Dexter, E., S. M. Bollens, J. Cordell, H. Y. Soh, G. Rollwagen-Bollens

et al., 2018 A genetic reconstruction of the invasion of the

calanoid copepod Pseudodiaptomus inopinus across the North

American Paciﬁc Coast. Biol. Invasions 20: 1577–1595. https://

doi.org/10.1007/s10530-017-1649-0

Ellegren, H., 2014 Genome sequencing and population genomics

in non-model organisms. Trends Ecol. Evol. 29: 51–63. https://

doi.org/10.1016/j.tree.2013.09.008

Excofﬁer, L., 2007 Analysis of population subdivision, pp. 980–

1020 in Handbook of Statistical Genetics, edited by D. J. Balding,

M. Bishop, and C. Cannings. John Wiley & Sons, Chichester,

United Kingdom.

Fariello, M. I., S. Boitard, S. Mercier, D. Robelin, T. Faraut et al.,

2017 Accounting for linkage disequilibrium in genome scans

for selection without individual genotypes: the local score ap-

proach. Mol. Ecol. 26: 3700–3714. https://doi.org/10.1111/

mec.14141

Ferretti, L., S. Ramos Onsins, and M. Pérez-Enciso, 2013 Population

genomics from pool sequencing. Mol. Ecol. 22: 5561–5576.

https://doi.org/10.1111/mec.12522

328 V. Hivert et al.

Fischer, M. C., C. Rellstab, M. Leuzinger, M. Roumet, F. Gugerli

et al., 2017 Estimating genomic diversity and population dif-

ferentiation –an empirical comparison of microsatellite and SNP

variation in Arabidopsis halleri. BMC Genomics 18: 69. https://

doi.org/10.1186/s12864-016-3459-7

Fleiss, J. L., 1971 Measuring nominal scale agreement among

many raters. Psychol. Bull. 76: 378–382. https://doi.org/

10.1037/h0031619

Fleiss, J. L., and J. Cuzick, 1979 The reliability of dichotomous judge-

ments: unequal numbers of judges per subject. Appl. Psychol. Meas.

3: 537–542. https://doi.org/10.1177/014662167900300410

Fuentes-Pardo, A. P., and D. E. Ruzzente, 2017 Whole-genome

sequencing approaches for conservation biology: advantages,

limitations and practical recommendations. Mol. Ecol. 26: 5369–

5406. https://doi.org/10.1111/mec.14264

Futschik, A., and C. Schlötterer, 2010 The next generation of mo-

lecular markers from massively parallel sequencing of pooled

DNA samples. Genetics 186: 207–218. https://doi.org/10.1534/

genetics.110.114397

Gautier, M., 2015 Genome-wide scan for adaptive divergence and

association with population-speciﬁc covariates. Genetics 201:

1555–1579. https://doi.org/10.1534/genetics.115.181453

Gautier, M., K. Gharbi, T. Cezaerd, M. Galan, A. Loiseau et al.,

2013 Estimation of population allele frequencies from next-

generation sequencing data: pool-versus individual-based genotyp-

ing. Mol. Ecol. 22: 3766–3779. https://doi.org/10.1111/mec.12360

Glenn, T. C., 2011 Field guide to next-generation DNA se-

quencers. Mol. Ecol. Resour. 11: 759–769. https://doi.org/10.

1111/j.1755-0998.2011.03024.x

Goudet, J., 1993 The genetics of geographically structured pop-

ulations. Ph.D. Thesis, University of Wales, Bangor, Wales.

Holsinger, K. S., and B. S. Weir, 2009 Genetics in geographically

structured populations: deﬁning, estimating and interpreting

F

ST

. Nat. Rev. Genet. 10: 639–650. https://doi.org/10.1038/

nrg2611

Hudson, R. R., 2002 Generating samples under a Wright-Fisher

neutral model of genetic variation. Bioinformatics 18: 337–338.

https://doi.org/10.1093/bioinformatics/18.2.337

Karlsson,E.K.,I.Baranowska,C.M.Wade,N.H.C.Salmon

Hillbertz, M. C. Zody et al., 2007 Efﬁcient mapping of Men-

delian traits in dogs through genome-wide association. Nat.

Genet. 39: 1321–1328. https://doi.org/10.1038/ng.2007.10

Koﬂer, R., R. V. Pandey, and C. Schlötterer, 2011 PoPoolation2:

identifying differentiation between populations using sequenc-

ing of pooled DNA samples (Pool-Seq). Bioinformatics 27:

3435–3436. https://doi.org/10.1093/bioinformatics/btr589

Landis, J. R., and G. G. Koch, 1977 A one-way components of

variance model for categorical data. Biometrics 33: 671–679.

https://doi.org/10.2307/2529465

Leblois, R., M. Gautier, A. Rohfritsch, J. Foucaud, C. Burban et al.,

2018 Deciphering the demographic history of allochronic dif-

ferentiation in the pine processionary moth Thaumetopoea pity-

ocampa. Mol. Ecol. 27: 264–278. https://doi.org/10.1111/

mec.14411

Lewontin, R. C., and J. Krakauer, 1973 Distribution of gene fre-

quency as a test of the theory of the selective neutrality of poly-

morphism. Genetics 74: 175–195.

Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan et al.,

2009 The sequence alignment/map format and SAMtools.

Bioinformatics 25: 2078–2079. https://doi.org/10.1093/

bioinformatics/btp352

Lotterhos, K. E., and M. C. Whitlock, 2014 Evaluation of demo-

graphic history and neutral parameterization on the perfor-

mance of F

ST

outlier tests. Mol. Ecol. 23: 2178–2192. https://

doi.org/10.1111/mec.12725

Lotterhos, K. E., and M. C. Whitlock, 2015 The relative power of

genome scans to detect local adaptation depends on sampling

design and statistical method. Mol. Ecol. 24: 1031–1046. https://

doi.org/10.1111/mec.13100

Lynch, M., D. Bost, S. Wilson, T. Maruki, and S. Harrison,

2014 Population-genetic inference from pooled-sequencing data.

Genome Biol. Evol. 6: 1210–1218. https://doi.org/10.1093/gbe/

evu085

Mak, T. K., 1988 Analysing intraclass correlation for dichotomous

variables. J. R. Stat. Soc. Ser. C Appl. Stat. 37: 344–352.

Malécot, G., 1948 Les Mathématiques de l’Hérédité. Masson, Paris.

Nei, M., 1973 Analysis of gene diversity in subdivided popula-

tions. Proc. Natl. Acad. Sci. USA 70: 3321–3323. https://doi.

org/10.1073/pnas.70.12.3321

Nei, M., 1977 F-statistics and analysis of gene diversity in subdi-

vided populations. Ann. Hum. Genet. 41: 225–233. https://doi.

org/10.1111/j.1469-1809.1977.tb01918.x

Nei, M., 1978 Estimation of average heterozygosity and genetic

distance from a small number of individuals. Genetics 89: 583–

590.

Nei, M., 1986 Deﬁnition and estimation of ﬁxation indices. Evo-

lution 40: 643–645. https://doi.org/10.1111/j.1558-5646.1986.

tb00516.x

Nei, M., and R. K. Chesser, 1983 Estimation of ﬁxation indices and

gene diversities. Ann. Hum. Genet. 47: 253–259. https://doi.

org/10.1111/j.1469-1809.1983.tb00993.x

Nychka, D., R. Furrer, J. Paige, and S. Sain, 2017 ﬁelds: tools for

spatial data. R package version 9.6. University Corporation for

Atmospheric Research, Boulder, CO. DOI: 10.5065/D6W957CT

Orgogozo, V., A. E. Peluffo, and B. Morizot, 2016 The “mendelian

gene”and the “molecular gene”: two relevant concepts of ge-

netic units, pp. 1–26 in Genes and Evolution. Current Topics in

Developmental Biology, Vol. 119, edited by V. Orgogozo. Aca-

demic Press, New York.

Pickrell, J. K., and J. K. Pritchard, 2012 Inference of population

splits and mixtures from genome-wide allele frequency data.

PLoS Genet. 8: e1002967. https://doi.org/10.1371/journal.

pgen.1002967

R Core Team, 2017 R: A Language and Environment for Statistical

Computing. R Foundation for Statistical Computing, Vienna.

Reynolds, J., B. S. Weir, and C. C. Cockerham, 1983 Estimation of

the coancestry coefﬁcient: basis for a short-term genetic dis-

tance. Genetics 105: 767–779.

Ridout, M. S., C. G. B. Demktrio, and D. Firth, 1999 Estimating

intra-class correlation for binary data. Biometrics 55: 137–148.

https://doi.org/10.1111/j.0006-341X.1999.00137.x

Robertson, A., 1962 Weighting in the estimation of variance com-

ponents in the unbalanced single classiﬁcation. Biometrics 18:

413–417. https://doi.org/10.2307/2527485

Rode, N. O., Y. Holtz, K. Loridon, S. Santoni, J. Ronfort et al.,

2018 How to optimize the precision of allele and haplotype

frequency estimates using pooled-sequencing data. Mol. Ecol. Re-

sour. 18: 194–203. https://doi.org/10.1111/1755-0998.12723

Ross, M. G., C. Russ, M. Costello, A. Hollinger, N. J. Lennon et al.,

2013 Characterizing and measuring bias in sequence data. Ge-

nome Biol. 14: R51. https://doi.org/10.1186/gb-2013-14-5-r51

Rousset, F., 1996 Equilibrium values of measures of population sub-

division for stepwise mutation processes. Genetics 142: 1357–1362.

Rousset, F., 1997 Genetic differentiation and estimation of gene

ﬂow from F-statistics under isolation by distance. Genetics 145:

1219–1228.

Rousset, F., 2007 Inferences from spatial population genetics, pp.

945–979 in Handbook of Statistical Genetics, edited by D. J.

Balding, M. Bishop, and C. Cannings. John Wiley & Sons, Ltd.,

Chichester, England.

Rousset, F., 2008 genepop’007: a complete re-implementation of

the genepop software for Windows and Linux. Mol. Ecol. Re-

sour. 8: 103–106. https://doi.org/10.1111/j.1471-8286.2007.

01931.x

Genetic Differentiation from Pools 329

Schlötterer, C., R. Tobler, R. Koﬂer, and V. Nolte, 2014 Sequencing

pools of individuals –mining genome-wide polymorphism data

without big funding. Nat. Rev. Genet. 15: 749–763. https://doi.

org/10.1038/nrg3803

Slatkin, M., 1993 Isolation by distance in equilibrium and non-

equilibrium populations. Evolution 47: 264–279. https://doi.

org/10.1111/j.1558-5646.1993.tb01215.x

Smadja, C. M., B. Canbäck, R. Vitalis, M. Gautier, J. Ferrari et al.,

2012 Large-scale candidate gene scan reveals the role of che-

moreceptor genes in host plant specialization and speciation in

the pea aphid. Evolution 66: 2723–2738. https://doi.org/10.1111/

j.1558-5646.2012.01612.x

The International HapMap Consortium, 2005 A haplotype map of

the human genome. Nature 437: 1299–1320. https://doi.org/

10.1038/nature04226

Tukey, J. W., 1957 Variances of variance components: II. The un-

balanced single classiﬁcation. Ann. Math. Stat. 28: 43–56.

https://doi.org/10.1214/aoms/1177707036

Vitalis, R., 2012 DetSel: an R-Package to detect marker loci re-

sponding to selection, pp. 277–293 in Data Production and Anal-

ysis in Population Genomics: Methods and Protocols.Methods

in Molecular Biology, Vol. 888, edited by F. Pompanon, and

A. Bonin. Humana Press, New York.

Vitalis, R., P. Boursot, and K. Dawson, 2001 Interpretation of variation

across marker loci as evidence of selection. Genetics 158: 1811–1823.

Wahlund, S., 1928 Zusammens etzung von populationen und kor-

relationserscheinungen vom standpunkt der vererbungslehre

aus betrachtet. Hereditas 11: 65–106. https://doi.org/10.1111/

j.1601-5223.1928.tb02483.x

Weir, B. S., 1996 Genetic Data Analysis II. Sinauer Associates, Inc.,

Sunderland, MA.

Weir, B. S., 2012 Estimating F-statistics: a historical view. Philos.

Sci. 79: 637–643. https://doi.org/10.1086/667904

Weir, B. S., and C. C. Cockerham, 1984 Estimating F-statistics for

the analysis of population structure. Evolution 38: 1358–1370.

https://doi.org/10.1111/j.1558-5646.1984.tb05657.x

Weir, B. S., and J. Goudet, 2017 A uniﬁed characterization of

population structure and relatedness. Genetics 206: 2085–

2103. https://doi.org/10.1534/genetics.116.198424

Weir, B. S., and W. G. Hill, 2002 Estimating F-statistics. Annu.

Rev. Genet. 36: 721–750. https://doi.org/10.1146/annurev.

genet.36.050802.093940

Weir, B. S., L. R. Cardon, A. D. Anderson, D. M. Nielsen, and W. G.

Hill, 2005 Measures of human population structure show het-

erogeneity among genomic regions. Genome Res. 15: 1468–

1476. https://doi.org/10.1101/gr.4398405

Whitlock, M. C., and K. E. Lotterhos, 2015 Reliable detection of

loci responsible for local adaptation: inference of a null model

through trimming the distribution of F

ST

. Am. Nat. 186: S24–

S36. https://doi.org/10.1086/682949

Wright, S., 1931 Evolution in Mendelian populations. Genetics

16: 97–159.

Wright, S., 1951 The genetical structure of populations. Ann. Eu-

gen. 15: 323–354. https://doi.org/10.1111/j.1469-1809.1949.

tb02451.x

Wu, S., C. M. Crespi, and W. K. Wong, 2012 Comparison of meth-

ods for estimating the intraclass correlation coefﬁcient for bi-

nary responses in cancer prevention cluster randomized trials.

Contemp. Clin. Trials 33: 869–880. https://doi.org/10.1016/j.

cct.2012.05.004

Communicating editor: M. Beaumont

330 V. Hivert et al.