Page 1

Copyright ? 2007 by the Genetics Society of America

DOI: 10.1534/genetics.107.072371

A Markov Chain Monte Carlo Approach for Joint Inference of Population

Structure and Inbreeding Rates From Multilocus Genotype Data

Hong Gao, Scott Williamson and Carlos D. Bustamante1

Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853

Manuscript received February 19, 2007

Accepted for publication April 22, 2007

ABSTRACT

Nonrandom mating induces correlations in allelic states within and among loci that can be exploited to

understand the genetic structure of natural populations (Wright 1965). For many species, it is of con-

siderable interest to quantify the contribution of two forms of nonrandom mating to patterns of standing

genetic variation: inbreeding (mating among relatives) and population substructure (limited dispersal of

gametes). Here, we extend the popular Bayesian clustering approach STRUCTURE (Pritchard et al. 2000)

for simultaneous inference of inbreeding or selfing rates and population-of-origin classification using

multilocus genetic markers. This is accomplished by eliminating the assumption of Hardy–Weinberg

equilibrium within clusters and, instead, calculating expected genotype frequencies on the basis of

inbreeding or selfing rates. We demonstrate the need for such an extension by showing that selfing leads

to spurious signals of population substructure using the standard STRUCTURE algorithm with a bias

toward spurious signals of admixture. We gauge the performance of our method using extensive coa-

lescent simulations and demonstrate that our approach can correct for this bias. We also apply our ap-

proach to understanding the population structure of the wild relative of domesticated rice, Oryza rufipogon,

an important partially selfing grass species. Using a sample of n ¼ 16 individuals sequenced at 111 random

loci, we find strong evidence for existence of two subpopulations, which correlates well with geographic

location of sampling, and estimate selfing rates for both groups that are consistent with estimates from

experimental data (s ? 0.48–0.70).

U

ogy. Here we consider the problem of using genotype

data from a sample of individuals to distinguish be-

tween two forms of nonrandom mating: inbreeding or

mating among relatives and population subdivision or

limited dispersal of gametes. As Sewall Wright dem-

onstrated, both of these evolutionary forces induce a

correlation in allelic state among uniting gametes (i.e.,

autozygosity) (Wright 1931, 1965). Specifically, writ-

ing {Ai, Aj} to denote the outcome of inheriting alleles i

and j at a particular locus of interest, Wright thought

about the problem in terms of the correlation in state:

NDERSTANDING the mating structure of natural

populations is a major goal of population biol-

corrðAi; AjÞ ¼

CovðAi; AjÞ

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

VarðAiÞVarðAjÞ

pij? pipj

pið1 ? piÞpjð1 ? pjÞ

p

p

¼

:

In a randomly mating population, the probability of

inheriting a combination of alleles {Ai, Aj} is, by defi-

nition, given by the product of their marginal probabil-

ities (i.e., pij¼ pipj). Therefore, under random mating

there is no correlation in allelic state among the genes

inherited from the two parents.

In asubdivided population with inbreeding, however,

the correlation in allelic state, FIT, may be nonzero and

is given by Wright’s famous equation

FIT¼ 1 ? ð1 ? FISÞð1 ? FSTÞ;

where FISis equivalent to the correlation in state con-

ditional on subpopulation of origin, and FSTis the cor-

relation instateamong randomlysampledalleleswithin

subpopulations. The first is a measure of inbreeding

andthesecond isameasure ofpopulationsubstructure.

This equation demonstrates that the relative contribu-

tionofthe twoforces todeviationsfrom randommating

are of comparable magnitude and depend critically on

the particular values of the parameters.

Althoughthisphenomenonisappreciatedbymanypop-

ulation geneticists, many modern statistical approaches

for analyzing genotype data ignore one of these two

components.Forexample,methodsforidentifyingpop-

ulation structure among a sample of individuals assume

random mating within subpopulations (Pritchard et al.

2000;DawsonandBelkhir2001;Coranderetal.2003;

Falush et al. 2003). Likewise, methods for estimating

self-fertilization rates from genotype data assume indi-

viduals are sampled from a single population (Ayres

and Balding 1998; Enjalbert and David 2000) or

ð1Þ

1Corresponding author: 101 Biotechnology Bldg., Cornell University,

Ithaca, NY 14853.E-mail: cdb28@cornell.edu

Genetics 176: 1635–1651 (July 2007)

Page 2

require labor-intensive approaches such as progeny

arrays (direct genotyping of offspring–mother pairs)

(Ritland 2002). Therefore, considerable interest exists

in the development of an approach that can reliably

estimate the degree of population subdivision and in-

breedingratesfromasampleofgenotypedindividualsof

unknown relatedness.

Our starting point in this study is the widely used

program STRUCTURE (Pritchard et al. 2000; Falush

et al. 2003), which implements a Bayesian clustering al-

gorithm that simultaneously estimates locus allele fre-

quencies and probabilistically assigns individuals to one

of K subpopulations. STRUCTURE works by exploiting

a key concept in population genetics: undetected pop-

ulationsubstructureleadstoagenomewidedeficitofhet-

erozygotes in a sample as compared to the predictions

of the Hardy–Weinberg equilibria (HWE) (Wahlund

1928; Hartl and Clark 1997). Informally, by assigning

individuals probabilistically across a fixed number of K

subpopulations,thealgorithmminimizesdeviationsfrom

HWE across the whole sample by maximizing within-

subpopulation HWE as well as linkage equilibrium

among unlinked loci. It is important to note, however,

that various genetic and evolutionary forces can also

lead to a genomewide deficiency of heterozygotes in a

sample. In hermaphroditic populations, for example,

partial self-fertilization reduces heterozygosity by a fac-

torð1 ? sÞ=ð1 ? ðs=2ÞÞ,wheresistheproportionofprog-

enyproducedbyself-fertilization(Haldane1924).Since

STRUCTURE assumes that individuals in the sample

are either fully outcrossing or haploid, application of

the algorithm to partially selfing populations may result

in spurious inference of population structure and/or

admixture as pointed out in Falush et al. (2003). (It is

important to note that under the extreme case of com-

plete self-fertilization, one can sidestep this issue by

treating each diploid individual as haploid.)

To investigate spurious evidence for admixture in

the presence of partial self-fertilizaton, we modified

Hudson’s implementation of the standard coalescent

algorithm (Hudson 1997) to accommodate partial self-

ing (Nordborg and Donnelly 1997) and generated a

sample of 100 individuals drawn from a population with

selfing rate s ¼ 0.5 genotyped at 100 loci. We then ran

the standard STRUCTURE 1.0 algorithm assuming two

clusters (K ¼ 2) on this data set (see the Simulations

section for details). We expect STRUCTURE to assign

all individuals to one of the two clusters shown in Figure

1c, since we have simulated data from a single unstruc-

tured population. Figure 1, a and b, generated by the

Distruct program (Rosenberg et al. 2002), summarizes

the posterior assignment probabilities. For this data set

drawn from a single population, STRUCTURE classified

all individuals as ‘‘admixed’’ with 50% of their genome

coming from cluster 1 (green) and 50% coming from

cluster2(purple).Thisresultholdsregardlessofwhether

one considers the correlated (i.e., F model) or uncorre-

lated allele frequency models and suggests that applica-

tion of STRUCTURE to data from a partially selfing

population may lead to spurious signals of population

substructureasinitiallysuggestedbyFalushetal.(2003).

To quantify this effect further, we repeated the pro-

cedure above for 100 data sets simulated for each of six

levelsofselfingandranSTRUCTUREunderbothK ¼ 1

and K ¼ 2. To gauge the improvement in fit between

the K ¼ 1 and K ¼ 2 models, we compared the differ-

ence in average log-likelihood score across retained

draws from Markov chain Monte Carlo (MCMC):

logL ¼ ElogLðK ¼ 2; ujDataÞ

? ElogLðK ¼ 1; ujDataÞ:

The distribution of log L for different values of s is

plotted in Figure 1d(A). We note that when s ¼ 0.0, the

population is completely outcrossing and the distribu-

tion of log L provides the null distribution of the test

statistic under the hypothesis of no selfing and no pop-

ulationstructure.Figure1d(A)showsthatasselfingrate

increases so does the distribution of log-likelihood dif-

ference between K ¼ 2 and K ¼ 1 leading to increased

rejection of the null hypothesis. When the selfing rate

is .0.5, the whole of the distribution of log L exceeds

the critical value, resulting in a 100% false positive rate.

Therefore, we concluded that a modification to the

basic model of STRUCTURE is essential when wanting

to infer population structure for partially selfing species

or those with a recurrent pattern of inbreeding. This

article presents and validates such an approach, which

we term ‘‘InStruct.’’ When InStruct is applied to the

data sets above, it both reduces the false positive rate

dramatically (see Figure 1d) and corrects for spurious

admixture completely (see Figure 1c).

The new algorithm we present here extends the

STRUCTURE 1.0 framework by incorporating the pos-

sibility of inbreeding among individuals in the sample.

Much of this article is focused on self-fertilization, but

the program has been written generally so asto estimate

inbreeding coefficients as well. We consider two general

scenarios: a population-specific process by which all in-

dividuals within one subpopulation share the same self-

ing potential (which may reflect a shared environment,

for example) as well as a model where selfing probabil-

ities vary among individuals in the whole sample. This

model is particularly useful for modeling population

substructure when some samples have been artificially

propagated in the lab (or the field) through enforced

selfing. For this scenario, we use a Bayesian density es-

timation algorithm called the Dirichlet process mixture

model (DPMM),whichoffersgreatflexibilityinestimat-

ing the distribution of latent (or unobserved) variables

in the probabilistic model. It has recently been used to

estimate the distribution of v ¼ dN/dSalong a protein-

coding sequence (Huelsenbeck et al. 2006). We quan-

tify the power,robustness, and accuracy oftheapproach

ð2Þ

1636H. Gao, S. Williamson and C. D. Bustamente

Page 3

usingdatasimulatedunderamyriadofscenarios,varying

both the degree of selfing and population substructure.

A major motivation for our research was the desire to

understand population structure in the wild ancestor of

domesticated Asian rice (Oryza rufipogon), in aneffort to

identify wild germplasm for improvement of this impor-

tant crop species. Therefore, to illustrate the application

of our method and to investigate the role of inbreeding

and population substructure in O. rufipogon, we apply

InStruct to multilocus data from a sample of 16 indi-

vidualscollectedfromvariouslocalitiesacrossSoutheast

Asia. We find strong evidence of population subdivision

in O. rufipogon, as well as evidence for geographic varia-

tion in the rates of self-fertilization. Potentially the most

important feature of InStruct is that it allows the iden-

tificationofvariationinmatingsystemineitherstructured

Figure 1.—Population assignments for a single data set of 100 individuals simulated under partial selfing (s ¼ 50%) and no

population substructure and analyzed assuming K ¼ 2. (a and b) The Distruct graph from STRUCTURE using (a) the correlated

alleles model and (b) the uncorrelated alleles model. (c) The Distruct graph from InStruct of the same data set. (d) Distribution

of log-likelihood difference between the K ¼ 2 and the K ¼ 1 model under six levels of population selfing rates as estimated by

STRUCTURE using the F model (A)/InStruct (B). Each colored line represents the density of average log-likelihood difference

with 100 replicate data sets simulated without population structure and under a specific selfing rate, indicated in the inset.

Inference of Inbreeding and Population Structure1637

Page 4

or unstructured populations, which in turn opens the

door to using molecular population genetic approaches

to investigate the evolution of mating systems.

THEORY

A myriad of factors influence selfing rates in natural

populations, including genetic and developmental fac-

tors (such as presence/absence of self-incompatibility

loci, flower shape, deleterious mutation rate, etc.) as

well as abiotic and biotic environmental factors (such as

availability of animal pollinators, local population den-

sity,rainfallvariation,etc.).Furthermore,plantsobtained

from intensively managed populations (such as seed

centers that propagate varieties of food crops) are often

the result of artificial selfing (i.e., purification) and dif-

ferent lines may have been propagated for different

numbers of generations via self-fertilization.

Our model is not explicit as to which of these factors

(if any) is influencing selfing rate, but rather, we start

from the premise that each individual in the sample has

a constant but unknown selfing potential that we wish

to estimate from the available genetic data. The selfing

potential of an individual is defined as the probability

that the individual reproduces via self-fertilization (see

below). We consider two models for how selfing varies

amongindividualsinthe sample:a ‘‘population-specific’’

model and an ‘‘individual’’ model.

Under the population-specific model, the selfing po-

tentials are equal for individuals assigned to the same

population and equivalent to the proportion of off-

spring produced via self-fertilization each generation.

Thisisareasonablemodeliflocalenvironmentalfactors

are the chief determinants of selfing rate. Under the

individual model, we use a form of Bayesian probability

density estimation to estimate the selfing rate for each

individual in the sample, potentially combining indi-

viduals with statistically similar rates and splitting up

individuals with statistically different rates. This is a par-

ticularly useful model for analyzing genetic material

from seed centers where different lines may have been

the result of propagation by self-fertilization and the

number of generations of propagation differs among

lines (and is often unknown).

Parameter notation: We borrow much of our nota-

tion from Pritchard et al. (2000). Probability densities

are denoted by calligraphy fonts: U represents the

uniform distribution, G the geometric distribution,

and DtheDirichlet distribution.Uppercase italic letters

(e.g.,P,G,X)arevectorsormatricesofrandomvariables

and lowercase italic letters (e.g., p, g, x) represent in-

stantiations of the random variables. Letters in boldface

type represent constants (e.g., K, D) and every effort is

made to retain the same notation as in the original

STRUCTURE articles.

Assume a sample of N individuals genotyped at L loci

are to be classified into K populations with ploidy D.

(Throughout this article we consider the diploid case

D ¼ 2).Weincorporatethepossibilityofadmixtureinto

the model by allowing an individual’s genotype at a

locus to be composed of alleles from distinct popula-

tions. This is true even for selfing individuals since their

genomes can be mosaics of haplotypes recently derived

from selfing of an admixed parent.

As in Pritchard et al. (2000), denote marker allele

frequencies by P ¼ fpklj:k ¼ 1;2; ... ;K;l ¼ 1;2; ... ;L,

and j ¼ 1;2; ... ; Jlg such that pkljis the allele frequency

of the jth allele type at the lth locus in the kth popula-

tion, where Jlis the number of distinct alleles at the lth

locus. For each individual i, let X ¼ fxild:i ¼ 1;2; ... ;

N;l ¼ 1;2; ... ;L,and d ¼ 1;2; ... ;Dg,wherexildisthe

allele carried at locus l for the dth copy. In accordance

with Pritchard et al. (2000), let Z ¼ fzild:i ¼ 1;

2; ... ;N;l ¼ 1;2; ... ;L, and d ¼ 1;2; ... ;Dg repre-

sent the matrix of zild, the population of origin of the

dth allele copy at the lth locus in the ith individual and

let Q ¼ fqik:i ¼ 1;2; ... ;N and k ¼ 1;2; ... ;Kg be the

matrix of qik, the proportion of the ith individual’s

genome originating from population k.

Write S ¼ fsi:i ¼ 1;2; ... ;Kg to denote the selfing

rates for the K subpopulations and G ¼ fgi: i ¼ 1;

2; ... ;Ng to denote the vector containing the number

of generations until each individual experiences an

outcrossing event in the past. Furthermore, let Q ¼

fui:i ¼ 1;2; ... ;Ng be the vector of individual selfing

potentials, where uiis the probability that individual i

reproduces via self-fertilization in a given generation.

We assume that this parameter is constant in time for a

given individual. Under the population-specific model,

we further assume that all individuals from a given pop-

ulation have the same valueofuiand that thisquantity is

equivalent to sk, the percentage of offspring produced

via selfing in subpopulation k. To estimate selfing rates

for individuals of admixed ancestry, we need to make

some mathematical assumptions as to how to combine

selfing potentials. The model we employ in InStruct is

aweightedaverageofpopulation-specificselfingrates.In

particular, if an individual cannot be classified unambig-

uously into one of K subpopulations, we model the

individual’s selfing potential as the weighted average of

the K population selfing rates with weighting constants

equal to the qik, the proportion of individual i’s genome

that we estimate to originate from population k (see

Equation 7 below).

WeuseasuperscripttotrackparameterswithinMCMC

iterations such that SðmÞ

k

is the value of the selfing rate

forpopulationkatiterationmofanMCMCchain.When

available, we use conjugate priors since these make the

MCMC much more efficient by often enabling Gibbs

sampling.Thesepriorscanalsoeasilyaccommodatepre-

vious information about population structure and self-

fertilization rates.

1638H. Gao, S. Williamson and C. D. Bustamente

Page 5

Modeling selfing: We model the number of gener-

ations giuntil an outcrossing event for the ith individual

as a geometric random variable with probability of suc-

cess 1 ? ui, where uiis the selfing rate for individual i:

Pðgi¼ g juiÞ ¼ ug?1

i

ð1 ? uiÞ:

ð3Þ

This amounts to assuming that whether an individual

selfs or not is independent from generation to genera-

tion and constant in time. Thus gi

step m in our MCMC, the ith individual is generated

by an outcrossing event in the previous generation,

whereas gi

selfing that extends gi

The reason for conditioning on G is that the likeli-

hood of the data given parameters P, G, and Z does not

depend on S or Q, greatly simplifying our calculations

(see Equations 5 and 6). Specifically, we write the likeli-

hoodofthegenotypedatagivenallelefrequencies,pop-

ulation assignments, and number of generations back

until an outcrossing event as

(m)¼ 1indicatesthat at

(m). 1 implies individual i was produced via

(m)? 1 generations into the past.

LðX jP;G;ZÞ ¼

Y

N

i¼1

Y

L

l¼1

Pðxil:jgi; zil:;p:l:Þ;

ð4Þ

where Pðxil:jgi;zil:;p:l:Þ is the genotype frequency of

individual i at locus l. If the two alleles for this genotype

are from different subpopulations (i.e., zil16¼ zil2), we

assume the genotype frequency is the product of the

population allele frequencies (amounting to random

mating among populations). If the population assign-

ment is the same, our probabilities follow directly from

basic population genetic theories. If individual i is the re-

sult of gi? 1 generations of selfing, then the probability

of homozygosity for the A allele is

Pðxil:¼ AAjgi; zil:; p:l:Þ ¼ p2

A12pAð1 ? pAÞ3

X

gi?1

g9¼1

0:5g9;

ð5Þ

where pAis the allele frequency of A in its assigned

subpopulation. If individual i is heterozygous at locus l

(suppose the genotype is Aa at that locus), the genotype

probability is

Pðxil:¼ Aa jgi; zil:; p:l:Þ ¼ 2pApa30:5ðgi?1Þ:

ð6Þ

Inmodelinginbreedingmoregenerally,wecanreplace

the above equations by their usual analogs in Wright’s

formulationconditionalontheinbreedingcoefficientF

(see appendix). For simplicity, we remain for the rest of

thisarticle focused onselfing, but note that InStructhas

anoptionformodelinginbreedingaswell.Nextweturn

to models for how selfing rates vary among individuals

and populations.

Population-specific model: For the population-specific

model, we define the selfing potential uiconditional on

the population assignments of individual i as

ui¼

X

K

k¼1

Pðindividuali istheproductof selfinginthe

previousgenerationgivenitisfrom

populationkÞ

3Pðindividuali comesfrompopulationkÞ:

If we assume that the probability that individual i comes

from population k equals the proportion of individual

i’s genome that originates from population k that has

selfing rate sk, we obtain

ui¼

X

K

k¼1

skqik:

ð7Þ

Individual variation in selfing model: A clear limitation

of the population-specific model is that itdoes not allow

for selfing rate variation among individuals within sub-

populations, which may be an important feature of the

data. To relax this assumption, we employ the DPMM.

The rationale behind this approach is not biological,

but statistical. Instead of assuming a distribution for

selfing rates among individuals and estimating param-

eters of the model (e.g., beta distribution, logit, probit,

etc.), we use a Bayesian version of nonparametric

density estimation to ‘‘learn’’ the selfing rates from the

data. Informally, it is equivalent to smoothing a histo-

gram of individually estimated selfing rates and taking

ouruncertaintyinthesmoothingfunctionintoaccount.

Smoothing occurs via collapsing and expanding sets of

individuals that have been assigned the same identical

selfing rate (a class) and updating the selfing rate as-

signed to each class. The parameter governing the

smoothing function, a, works mathematically by influ-

encing the prior distribution on the number of classes.

In essence, the DPMM model generates partitions of

selfingrates wherewithin apartitionallindividuals have

the same selfing rate. Formally, we think of the Dirichlet

process mixture model as a finite mixture model where

the number of mixture components is a random vari-

able. We treat each individual’s selfing rate as arising

from the same distribution family with different param-

eters for each component. The joint prior distribution

of all selfing rates in the DPMM model corresponds to a

generalized Polya urn scheme. The hierarchical struc-

ture of the Dirichlet process mixture model is

F ? DPða;F0ðuÞÞ

uijF ? FðuÞ

gijui? Gð1 ? uiÞ;

where DPða;F0ðuÞÞ is the Dirichlet process with base

distribution F0and scaling parameter a . 0, and F is

Inference of Inbreeding and Population Structure1639

Page 6

a random distribution drawn from the DP, with the

graphical model representation shown in supplemental

Figure 1 at http:/ /www.genetics.org/supplemental/. In

words, the above is saying that the distribution F from

which the selfing rate for individual i is drawn follows a

Dirichlet process. Conditional on the parameters gov-

erning F, the selfing rate uiis drawn. Conditional on the

selfing rate ui, the number of generations until out-

crossing giis geometrically distributed. The Bayesian

framework treats the probability distribution F as an

infinite-dimensional parameter, whose prior distribu-

tion is Dirichlet process and posterior is a mixture of

Dirichlet processes (MacEachern and Muller 1998

and McAuliffe et al. 2004). In our case F0is assumed to

be the uniform distribution on ½0, 1?. In practice, this

amounts to modeling the selfing rate for individual i as

either sampled from the uniform distribution or iden-

tical to one of existing selfing rates according to the

following probabilities:

Pðui¼ s ju1; u2; ... ; ui?1; a; F0Þ

a

a1i ? 1

1

a1i ? 1

To update uiunder the individual selfing rate model,

we use iterative Gibbs sampling. That is, we sample ui

from its posterior distribution conditional on all other

selfing rates in the sample u(?i)and G,

Pðui¼ s juð?iÞ; GÞ

abq0hðuijgiÞ

X

where f(gij uj) is the density function for the geometric

distribution and b is a normalizing constant: b ¼ ðaq01

Pn

Ð1

(theselfingratefor individual i),givengi;i.e.,hðuijgiÞ ¼

F0ðuiÞf ðgijuiÞ=q0¼ f ðgijuiÞ=q0. In words, the equation

above states: assign individual i a unique selfing rate

drawnfromtheposteriordistributionh(uijgi)withprob-

ability abq0; otherwise, assign individual i to an existing

selfing rate s with probability proportional to the sum of

likelihood of generations of individuals that already

carry selfing rate s multiplied by the normalizing term b.

The number of classes of selfing rates is randomlydeter-

mined by the Polya urn model, which is governed by the

scalingparametera.Itisinterestingtonotethattheprior

distribution on the number of classes is identical to the

Ewens sampling distribution for a panmictic neutrally

evolving Wright–Fisher population as has been pointed

out by several authors (e.g., Tavareand Ewens 1998).

¼

"j ,i; uj6¼ s

X

i?1

j¼1

Ifuj¼sg dj ,i; s:t:uj¼ s:

8

>

>

>

>

>

>

:

<

ð8Þ

¼

"j; uj6¼ s

b

n

j¼1;j6¼i

f ðgijujÞIfuj¼sg dj; s:t:uj¼ s;

8

>

>

:

<

ð9Þ

j¼1;j6¼if ðgijujÞÞ?1. Here, q0is the probability of the

number of generations until outcrossing gi, q0¼

0f ðgijs9Þds9, since F0(s) ¼ 1 for

s 2 ½0, 1?. And h(uij gi) is the posterior distribution on ui

0F0ðs9Þf ðgijs9Þds9 ¼Ð1

Markov chain Monte Carlo procedure: To sample

from the posterior distribution of all parameters in our

model, we use a single-component Metropolis algorithm

with blockwise updating. The sampling scheme consists

of five updating steps. For the mth iteration, the se-

quence of parameter updating is

1. Update allele frequencies P(m)via the Gibbs sampler.

2. Update selfing rates S(m)at either population or in-

dividuallevels.Underthepopulation-specificmodel,

selfing rates are updated using the back-reflection

sampler (BRS) or the ‘‘adaptive independence sam-

pler’’ (AIS) (see appendix for more information).

Selfing rates under the individual model are pro-

duced from the Dirichlet process mixture model.

3. Update the number of generations until outcrossing

events G(m)via an independent Metropolis–Hastings

step.

4. UpdatethepopulationassignmentsZ(m)viatheGibbs

sampler.

5. Update the proportion of genome assignments Q(m)

via the Gibbs sampler.

Themathematicaldetailsareprovidedintheappendix.

TheabovealgorithmhasbeenimplementedinanANSIC

computer program, InStruct (Inbreeding and Substruc-

ture) available from bustamantelab.cb.bscb.cornell.edu/

software.shtml. A web interface for InStruct is also avail-

able through cbsuapps.tc.cornell.edu/InStruct.aspx.

Inference: The selfing rate of each population (or in-

dividual) is estimated as the sample average over M

retained MCMC draws:

EðskjXÞ ?1

M

X

M

m¼1

sðmÞ

k :

Posterior credibility intervals are constructed using

the symmetric percentage method ½i.e., a=2 and (1?

ða=2Þ) empirical quantiles of the MCMC draws for

an a-level credibility interval? since we have found that

the posterior mean is often very close to the posterior

median, implying symmetric posterior distribution of

population selfing rates. We also consider the posterior

median as a point estimator of individual selfing rates

since the posterior distribution of selfing rates is often

quite skewed. Inference for the rest of the parameters is

done in a similar manner as in Pritchard et al. (2000).

Assessing convergence: To assess convergence of our

MCMC scheme, we use the Gelman–Rubin statistics that

are based on the one-way analysis of variance (ANOVA)

and compare the within-chain variance to the between-

chainvariance(GelmanandRubin1992).At stationarity,

theseshouldbeequal.WeusetheGelman–Rubinstatistics

to check the convergence of log-likelihood and selfing

rates across different chains after applying the following

identifiability constraint to the retained MCMC draws:

As in other Bayesian mixture settings, we are faced

with the label-switching problem across chains ½i.e., for

different chains the algorithm may switch the labels of

1640H. Gao, S. Williamson and C. D. Bustamente

Page 7

which population is 1, 2, etc., without affecting the like-

lihood (Jasra et al. 2005)?. We apply a simple identi-

fiability constraint on the parameter space to break

the symmetry in the likelihood; namely, the posterior

mean selfing rate of each population along the MCMC

is calculated and sorted in ascending order and the

population with lowest average selfing rate is labeled

1; thus only one permutation of population labeling

is obtained. This constraint is obviously effective

only when the selfing rates differ substantially among

subpopulations.

Simulations: To assess the power and robustness of

this approach under different selfing scenarios, we sim-

ulate data using standard coalescent theory with selfing

and population structure. We treat each diploid individ-

ualasademeoftwochromosomesand useaseparation-

of-timescales approach to draw samples under selfing

(Nordberg and Donnelly 1997; Nordborg 2000;

Wakeley 2000). The simulation was a two-step process:

Step1.Calculateforeachlocusthenumberoflineages

n9lthat make it through the scattering phase:

1. Sample the number of generations G ¼ {gi: i ¼ 1,

2, ... , N} until an outcrossing event in the past

for each individual from the geometric distribu-

tion Gð1 ? uiÞ. (This random variable is a con-

stant across all the loci for a given individual and

will strongly influence whether lineages for a

givenindividualcoalesceduetoselfingorscatter

through outcrossing.)

2. If an individual is the product of outcrossing in

the previous generation (i.e., gi¼ 1), then for all

loci the pair of chromosomes do not coalesce

within individual i. Therefore, the probability

that the two chromosomes coalesce in the past,

denoted as ri, is0. If anindividual isa productof

selfing in the previous generation (gi¼ 2), then

riis simply1

multiplegenerationsofselfing(i.e.,gi.2),then

riis 1 ? 0:5ðgi?1Þ.

3. For each locus l, draw Uil an independent

uniform(0, 1) random variable for i ¼ 1, ... ,

N. If Uil, 1 ? ri, set the number of lineages n9il

that make it out of the scattering phase to 2 for

individual i; otherwise, set it to 1.

4. Sum up among individuals to obtain the num-

ber of lineages at locus l that make it out of the

scattering phase: n9l ¼P

Step 2. Given n9l, simulate allelic history at locus l via

the standard coalescent software ‘‘ms’’ (Hudson

2002). For all loci where individual i has n9il ¼ 1,

store the individual as homozygous due to selfing.

2and if an individual is generated via

in9il.

Using this procedure, we consider several substruc-

ture and selfing models assuming equal and constant

subpopulationsizes,nomigrationamongsubpopulations,

and a divergence time t of 0.5 measured in standard

unitsof2Ngenerations.Weuse‘‘modelk’’toidentifythe

simulated population models, where k represents the

number of subpopulations in the sample, in our cases,

k ¼ {1, 2, 3, 6}.

We also consider several ‘‘individual’’-based models

for how selfing varies among individuals in the sample:

Model Ident: A single population with identical selfing

rates across individuals.

Model Norm: A single population with variable selfing

rates across individuals and the logit-transformed

selfingratesfollowthenormaldistributionwithmean

0 and standard deviation s; i.e., logðui=ð1 ? uiÞÞ ?

Nð0;sÞ.

Model Beta: A single population with variable selfing

rates across individuals, which follow the beta distri-

butionwithdifferentcombinationsofscaleandshape

parameters a and b; i.e., ui? Bða;bÞ.

RESULTS

Application to simulated data: Using the simulation

scheme outlined above, we generated 100 data sets per

parameter combination per population model and one

representative data set per parameter combination per

individualmodel.Detailedinformationregardingchoice

of parameters is provided in Table 1. For each data set,

InStruct was run for five independent chains, each

chain with 1,000,000 iterations in total, 500,000 burn-in

iterations, and a thinning interval of 10 iterations be-

tween retained draws. For all the simulated runs, the

reported diagnostic Gelman–Rubin statistic is ,1.10,

indicatinggoodconvergenceinbothlog-likelihoodand

selfing rates. We also used the direct plotting method to

showtheconvergenceoffiveMCMCchainswithdistinct

initial starting conditions. Diagnostic graphs of conver-

gence of selfing rates are provided in supplemental Fig-

ure2athttp:/ /www.genetics.org/supplemental/,showing

thefirst2000iterationsoftworandomlychosendatasets

under model 1 with selfing rates 0.3 and 0.7. The values

of the selfing rates converge quickly, normally entering

the stationary distribution within a few hundred itera-

tions.Theconvergenceofpopulationstructureisslower

thanthatofselfingrates,butitisusuallyonthesameorder

as STRUCTURE. We observed that as the complexity of

population structure increased (i.e., as k increased), so

did the number of iterations of the MCMC algorithm

required to ensure convergence (data not shown).

Inference of selfing rates for population-specific models: Our

inference goals are twofold. First, we are concerned with

theaccuracyofselfingratesestimationundereachofthe

simulationscenariosdescribedabove.Second,wewishto

assess the accuracy of population assignments once self-

ing rates have been estimated.

Undermodel1,eachsamplecontainspartiallyselfing

individualsandnopopulationsubstructure.InFigure2,

Inference of Inbreeding and Population Structure1641

Page 8

we report the distribution of estimated posterior mean

selfing rates among replicate data sets for varying levels

of s. With partial self-fertilization (i.e., s . 0), we see that

the distribution of the posterior mean estimates of self-

ing rates falls mostly within the range containing the

true selfing rates 6 0.1. For example, for data simulated

under s ¼ 0.5 the vast majority of the estimated rates

across the 100 replicate data sets lie within ½0.4, 0.6?. It is

also interesting to note that the modes of the distribu-

tions of posterior mean estimates are the true selfing

rates (Figure 2, dashed lines).

Model 2 assumes two subpopulations with equal or

distinct selfing rates split from a common ancestral

population in the recent past (t ¼ 0.5 in units of 2Ne

generations). In Figure 3, we report the distribution of

the posterior estimates of the selfing rates for the two

subpopulations under varying levels of outcrossing. In

comparison to model 1, the variance in estimated self-

ing rates among replicate data sets increased (Figure 3).

Population assignment worked extremely well for this

model with nearly 100% correct assignment probabili-

ties for all individuals in all replicate data sets.

Figures 4 and 5 illustrate the accuracy of our selfing

rate estimation under a more sophisticated population

structure model. By comparing Figure 4 (model 3,

where the sample is drawn from three populations) vs.

Figure 2(model1)andFigure3(model2)wecanassess

how population structure affects our inference regard-

ing selfing. We note that the width of the distribution of

theposteriormeanofpopulationselfingratesincreases,

implying that the variance of the estimator becomes

larger and estimation becomes slightly upwardly biased,

potentially due to population misidentification for some

individuals, especially when K ¼ 6 subpopulations are

simulated(Figure5).Itisalsoimportanttonotethatfor

thecaseofalargevarianceamongpopulationsinselfing

rates, a small fraction of replicate data sets converged to

a point with high selfing and low population structure

(i.e., high ‘‘bump’’ near 0.90 in Figure 4D). In summary,

InStruct has high accuracy in estimating selfing rates

underamyriadofselfingratecombinations forK¼1,2,

3, and 6 populations.

Another interesting result from Figures 2–5 is that

regardlessofKwhentheselfingratesarenear0or1,the

estimatorhasalower variancethanwhentheselfingrate

is near 50%. That is, when a population is nearly

Figure 2.—The posterior distribution of selfing rates esti-

mated from simulations without population structure under

six levels of population selfing rates. Each colored line repre-

sents the density of the posterior mean of selfing rates of 100

simulation runs under a specific selfing rate in the key.

TABLE 1

Parameters used for data simulated under each model

Model Data set no. Subpop. no. Subpop. size Sample size Loci no.

Combinations or distributions

of selfing rates

1 1001 100100100 0, 0.1, 0.3, 0.5, 0.7, 0.9

(0, 0.3), (0, 0.9)

(0.3, 0.3), (0.3, 0.6)

(0.3, 0.9), (0.9, 0.9)

(0.1, 0.1, 0.1), (0.9, 0.9, 0.9)

(0.4, 0.5, 0.6), (0.1, 0.5, 0.9)

(0.25, 0.6, 0.85), (0.05, 0.45, 0.75)

(0.05, 0.3, 0.45, 0.55, 0.75, 0.95)

s ¼ 0.3 or s ¼ 0.7

logitðsÞ ? Nð0;1Þor ? Nð0;10Þ

Bð9;3Þ or Bð10;25Þ

2 1002 50 100100

3 1003 50 150100

6

Ident

Norm

Beta

50

1

1

1

6

1

1

1

50300

100

100

100

100

100

100

100

100

100

100

Data set number indicates the number of replications to be simulated under a specific model. Subpop. num-

ber indicates the number of subpopulations assumed in the simulation. Subpop. size is the number of individ-

uals belonging to each subpopulation. Sample size means the total number of individuals. Loci number is the

number of unlinked loci genotyped in each individual. Combinations of selfing rates are the different selfing

levels used in the simulation; e.g., (0.3, 0.6) means two subpopulations with selfing rates 0.3 and 0.6, respectively.

1642 H. Gao, S. Williamson and C. D. Bustamente

Page 9

completely selfing or completely outcrossing, the mat-

ing system strongly affects patterns of genetic variation,

which makes it easy to detect and estimate selfing. In

contrast, when selfing rates are moderate and the

population is substructured, the precision of our

estimator decreases as evidenced by the appearance of

multimodal or flat posterior distributions for sk.

We expect the accuracy of our selfing rate estimation

to be influenced by several facets of the data, including

sample size and number of loci. To address this ques-

tion, we compared the coverageof 90% credibilityinter-

vals for skunder different combinations for the total

number of individuals sampled and the number of loci

genotyped (see Table 2, 100 data sets per combination).

Several interesting patterns emerged from this analysis.

First, when there is a single population (model 1), the

Bayesian credibility intervals are conservative since al-

most all entries in the table are significantly .90% and

Figure 3.—The posterior

distribution of selfing rates es-

timated from simulations under

model 2 with six combinations

of selfing rates: (A) s ¼ {0.0,

0.3}, (B) s ¼ {0.0, 0.9}, (C)

s ¼ {0.3, 0.3}, (D) s ¼ {0.3,

0.6}, (E) s ¼ {0.3, 0.9}, and

(F) s ¼ {0.9, 0.9}. Each colored

line represents the density of

the posterior mean of a sub-

population selfing rate from

100 simulation runs under a

specific combination of selfing

rates in the key.

Figure 4.—The posterior

distribution of selfing rates es-

timated from simulations under

model 3 with six combinations

of selfing rates: (A) S ¼ {0.4,

0.5, 0.6}, (B) S ¼ {0.1, 0.5,

0.9}, (C) S ¼ {0.1, 0.1, 0.1},

(D) S ¼ {0.25, 0.6, 0.85}, (E)

S ¼ {0.05, 0.45, 0.75}, and (F)

S ¼ {0.9, 0.9, 0.9}. Each colored

line represents the density of

the posterior mean of a sub-

population selfing rate from

100 data sets simulated under

a specific selfing rate combina-

tion in the key.

Inference of Inbreeding and Population Structure 1643

Page 10

none has an observed coverage statistically ,90%. Sec-

ond, when we sampled n ¼ 50 individuals per subpop-

ulation and L ¼ 100 loci (first line of all comparisons in

the table), the coverage of the credibility intervals was

well behaved across different population structure sce-

nariosexceptthosewithextremedifferencesinskamong

subpopulations. That is, model 1, model 2, and many

combinations in model 3 had excellent coverage. One

exception was model 3 with sk2 {0.05, 0.45, 0.75} where

the realized coverage is closer to 82% rather than 90%.

Likewise, in model 6 the average coverage among the

five subpopulations with selfing rates ,s ¼ 0.95 was only

84% (for the s ¼ 0.95 subpopulation the coverage was

conservative).Thethirdinterestingpatternthatemerges

from Table 2 is that reducing both sample size per sub-

population and number of loci per genotype tended to

decrease the coverage of the credibility intervals, but

notsystematically.Thatis,inallmodelsinvestigated,the

coverage of both the n ¼ 10 individuals per subpopu-

lation and L ¼ 100 loci sampled as well as the n ¼ 50

individuals per subpopulation and L ¼ 20 loci sampled

tended to have worse coverage than the standard of n ¼

50 individuals and L ¼ 100 loci. There are exceptions,

however, when the coverage for the smaller n treatment

had better (or more conservative) coverage than the

large n treatment. This is probably due to a larger vari-

ance of the selfing rate estimator.

Inference of selfing rates—individual variation models:

Figure 6 shows the results of the DPMM method on a

single typical data set under various models for how u

varies among individuals. We observe that for all the

cases considered, DPMM estimation of the distribution

of selfing rates across 100 individuals approximates the

true distribution well. That is, the mean, the median,

and the mode are mostly centered at their true values,

especially when selfing rates follow a beta distribution

(Figure6,CandF).Itisimportanttonotethatthepeaky

and multimodal shape of posterior distribution is an

inherent property of the DPMM model as DPMM ge-

nerates finite discrete classes within which individuals

share the same selfing rate and once a large class is

formed, the potential that an individual value belongs

to this class is greatly increased.

A key part of the DPMM method is a choice for the

a-parameter that governs the prior distribution on the

number of classes of selfing rates. Figure 6 summarizes

simulations with various values of a. According to

McAuliffe et al. (2004), for n observations the prior

expected number of classes in the data is ?a log n. We

chose values of a within the range ½1=logn;n=logn?,

corresponding to one class for all the observations and

one class per observation, respectively. Smaller values of

a lead to a ‘‘peaky’’ distribution with many values clus-

tered in one class. When a is large, the proportion of

values sampled from the base distribution increases,

resulting in smoother density estimation. Intermediate

valuesofatendtoclassifyareasonablenumberofvalues

into each class, generally resulting in a better approx-

imation to the true distribution.

When evaluating the performance of DPMM in esti-

mating the distribution of selfing rates among individ-

uals,akeyissueshouldbeconsidered:eachuiparameter

is effectively estimated from one single data point. That

is, the most amount of information one can have in our

model aboutselfingrate uiis thenumberofgenerations

until an outcrossing event gi. Even if giwere known

without error, there would still be high uncertainty in ui

since one has observed only a single geometric random

variable. Therefore,allowing selfingrates tovaryamong

individuals in the sample when one has little informa-

tionaboutaparticularuimayproducedensityestimation

that is wildly different from the true distribution. That

is, the inherent uncertainty due to sampling variation

coupledwithovershrinkageofparameters(seediscussion

below) may lead to shape estimation quite different

from the true density. To address this issue, in supple-

mentalFigure3(http:/ /www.genetics.org/supplemental/)

weplotthedistributionofthedifferencebetweenthees-

timatedselfingrateanditstruevalueofalltheindividuals

in the simulations of the three individual selfing rate

models assuming a ¼ 5. Most of them appear to follow a

nearly normal distribution, with mean 0 and standard

deviation ,0.15 for almost all the parametric simula-

tions conducted. We also report the estimated densities

for 20 data sets simulated under a beta distribution

for selfing rates, using two parameter combinations

in supplemental Figure 4 (http:/ /www.genetics.org/

supplemental/). It appears that the distributions of

estimated selfing rates are similar in shape to the un-

derlying true beta distribution with considerable among-

sample variation.

Figure 5.—The posterior distribution of selfing rates esti-

mated from simulations with six subpopulations of unequal

selfing rates. Each colored line represents the density of

the posterior mean of a subpopulation selfing rate from 50

simulation runs under a specific selfing rate in the key.

1644 H. Gao, S. Williamson and C. D. Bustamente

Page 11

Inference of population assignment for simulated data:

Our accuracy in classifying individuals into populations

is comparable to that of STRUCTURE with the original

model when no self-fertilization exists. For the 100-data-

set replications under model 2 and model 3 at various

levels of selfing, each individual is separated into one of

the major groups appropriately with frequency 0.99.

The accuracy of classification decreases slightly for

model 6 (the assignment proportion is ?0.95) as might

be expected with a more complex demographic sce-

nario. One disadvantage of InStruct is the tendency of

merging subpopulations with similar allele frequencies

and similar selfing rates when the data do not provide

sufficient evidence of differentiation. This phenome-

non, which hasalso been observedin theSTRUCTURE-

like algorithm BAPS (Corander et al. 2003) and the

Bayesian clustering algorithm with hidden Markov ran-

dom field (Francois et al. 2006), mainly occurs when

assuming more subpopulations than are represented in

the real data or when sample size per true subpopula-

tion is very small.

Application to rice data: To gauge the performance

ofouralgorithmonrealdata,weappliedInStructto111

single-nucleotide polymorphisms (SNPs) discovered via

direct sequencing across 111 unlinked loci of n ¼ 16

individuals of O. rufipogon, a wild ancestor of the cul-

tivated rice species (A. L. Caicedo, S. H. Willamson,

A. Fledel-Alon, T. L. York, N. Polato, K. M. Olsen,

R.Nielsen,S.McCouch,C.D.Bustamante, and M. D.

Purugganan, unpublished results). Each SNP has two

TABLE 2

Coverageof90%credibleintervalsofselfingratesunder models1,2,3,and6withrespecttospecificpopulation

size and locus number based on 100 data sets per selfing rate combination (50 data sets for model 6)

Model 1

Sample sizeLocus no. 0.00.1 0.30.50.7 0.9

100

20

100

100

100

20

1.00

0.988

0.99

0.93

0.99

0.958

0.93

0.92

0.932

0.912

0.888

0.94

0.95

0.93

0.924

0.958

0.92

0.96

Model 2

Sample sizeLocus no. 0.00.3 0.00.9 0.3 0.3

100

20

100

100

100

20

0.976

0.732

0.772

0.878

0.892

0.99

0.96

0.734

0.742

0.94

0.938

0.97

0.882

0.93

0.95

0.914

0.91

0.91

Model 2

Sample sizeLocus no.0.30.6 0.30.9 0.9 0.9

100

20

100

100

100

20

0.91

0.948

0.898

0.948

0.94

0.9

0.968

0.88

0.928

0.924

0.926

0.924

0.902

0.88

0.894

0.99

0.98

1.00

Model 3

Sample sizeLocus no.0.4 0.5 0.60.10.50.9

150

30

150

100

100

20

0.948

0.962

0.964

0.958

0.976

0.97

0.948

0.916

0.964

0.832

0.856

0.792

0.92

0.932

0.868

0.97

0.86

0.954

Model 3

Sample sizeLocus no.0.25 0.60.850.050.45 0.75

150

30

150

100

100

20

0.89

0.852

0.86

0.924

0.884

0.97

0.97

0.896

0.978

0.816

0.788

0.766

0.818

0.91

0.972

0.836

0.892

0.968

Model 6

Sample sizeLocus no.0.05 0.30 0.450.55 0.750.95

300 1000.8000.900 0.8400.800 0.860 1.000

Each data set was run for five independent MCMCs, with 1,000,000 iterations, 500,000 burn-in iterations, and

a thinning interval of 10 iterations (for model 6 one chain per data set). The proposal method for selfing rate

here is the AIS.

Inference of Inbreeding and Population Structure1645

Page 12

alleles and only one SNP per locus was used in our

analysis. The individuals in the sample were collected

from the wild with 9sampled from China, 5 from Nepal,

1 from India, and 1 from Laos. We focus on a subset of

the data ½n ¼ 91 (78.4%) SNPs? that contains no missing

data. We ran InStruct and STRUCTURE on these data

for five independent chains, each chain with 200,000

iteration steps, 100,000 burn-in, and a thinning interval

of 10 steps, assuming different starting points. Graph-

ical representations of population assignments from

STRUCTURE and InStruct were produced from the

program Distruct (Rosenberg et al. 2002).

When two subpopulations are assumed, the estima-

tion of selfing rates and substructure converged very

well among the five independent chains. The classifica-

tion of individuals is consistent with geographical sep-

aration in that all the individuals from China formed

one major cluster and the other cluster mainly contains

Nepaleseindividuals.ThefactthattheIndianindividual

is clustered with Nepal is quite reasonable as India is

nearer to Nepal than China geographically and the

Himalayan mountains likely reduce pollen flow to and

fromChina.TheLaosindividualfallsinbetweenthetwo

clusters with a larger part of its alleles (91.14%) as likely

of Nepalese origin and ?8.86% of Chinese origin. This

classificationisalmostthesameasthatofSTRUCTURE,

although the proportion of the genome that originates

in each population is slightly different for several indi-

viduals, which might be due to our accounting for self-

fertilization (Figure 7a). One critical difference is the

classification of a Chinese individual that STRUCTURE

predicts as admixed with nearly equal ancestry in the

two clusters. Using InStruct, this same individual is now

classified with high posterior probability 0.999 ½90%

C.I.: (0.996, 1.000)? inthe ‘‘Chinese’’cluster.The lack of

overlap in credibility intervals implies there is signifi-

cantdiscrepancyinclassificationofthisindividualaswas

observed in the simulated data presented in Figure 1.

When we ran InStruct assuming three subpopulations,

the convergence rate was poor with some runs converg-

ing on all individuals assigned only two clusters, leaving

the third cluster empty. This is due to the tendency of

the Bayesian clustering algorithm to merge subpopula-

tions with similar allele frequencies. A likely reason for

this in our case is the small sample size of just 16 indi-

viduals and the optimal classification is to assume K ¼ 2.

The posterior means of selfing rates for the Chinese

and Nepalese subpopulations under the population

model are 0.697 and 0.484 with 90% confidence in-

tervals (0.553, 0.826) and (0.260, 0.699), respectively.

While the confidence intervals overlap, this is sugges-

tive of potential regional differences in selfing rate for

O. rufipogon. This result should be interpreted with cau-

tion, however, since the Nepalese material was collected

recently from the wild while the Chinese individuals

come mainly from an existing germplasm collection

and may have undergone purification as part of stan-

dard germplasm propagation (S. McCouch, personal

communication). In Figure 7b, we present the results

of running the individual-based model of InStruct that

uses DPMM for density estimation. We note that the

majority of individuals have posterior means for u, the

selfing rate parameter, between 0.5 and 0.7, which is

consistent withprevious estimates based onpollencount

(Oka 1988). It is important to note that confidence

intervalsforuaremuchwiderundertheindividual-based

Figure

tions of posterior medians of

selfing rates of 100 individuals

drawn from the Dirichlet pro-

cess mixture model. The ma-

genta dashed lines represent

the true distribution of selfing

rates in the simulation. The

red, green, blue, and yellow

solid lines are the estimated

densities from the Dirichlet

process mixture model with

scaling parameters a ¼ 1, a ¼

5, a ¼ 10, and a ¼ 20, respec-

tively. The individual selfing

rates were simulated under

three different scenarios in

threecolumns:(1)modelident

(A)S¼0.3and(D) S¼0.7,(2)

model norm (B) logitðSÞ ?

Nð0;1Þ and (E) logitðSÞ ?

Nð0;10Þ, and (3) model beta

(C) S ? beta(9, 3) and (F) S ?

beta(10, 25).

6.—Thedistribu-

1646H. Gao, S. Williamson and C. D. Bustamente

Page 13

model as compared to the population-based estimate of

selfing rates.

DISCUSSION

Inthisarticle,wepresentamodificationofthepopular

Bayesian clustering program STRUCTURE (Pritchard

etal.2000)forinferringpopulationsubstructureandself-

fertilization simultaneously. Using extensive simulations

withfourdistinctdemographicmodels(K¼1,2,3,6),we

demonstrate that our method can accurately estimate

selfing rates in the presence of population structure in

the data. Additionally it can classify individuals into their

appropriate subpopulations without the assumption of

Hardy–Weinberg equilibrium within subpopulations.

It is important to note that theaccuracy of selfingrate

estimation is influenced by multiple factors, including

samplesizeandnumberofloci,withdecreasedprecision

when theyare small, asisillustratedin Table 2.Likewise,

we find that the complexity of the true demographic

history underlying data (e.g., the number of subpopu-

lations derived from a common ancestral population)

also influences accuracy. In general, more complicated

models lead to decreased precision in selfing rate esti-

mation. Forexample,when wesimulatedsix subpopula-

tions split from one ancestral population, the coverages

of 90% credible intervals of selfing rates are near 85%.

As with other methods for inference of population

structure, InStruct explores a complex multimodal like-

lihood surface using a stochastic search algorithm. This

means that the program may ‘‘get stuck’’ in suboptimal

parts of the parameter space. We, therefore, encourage

users to run several chains and compare the expected

log-likelihood as with other MCMC schemes. In prac-

tice,wehaveobservedthatInStructinfrequentlymerges

subpopulations, especially ones with correlated allele

frequencies, which can result in ‘‘empty’’ clusters and

poorconvergenceinpopulationassignmentsand selfing

rate estimation. This phenomenon has been described

previously for other STRUCTURE-like algorithms such

as BAPS (Corander et al. 2003) and the Bayesian clus-

tering algorithm with hidden Markov random field

(Francois et al. 2006). One idea we have explored is

to use simulated annealing to ‘‘heat and cool chains’’so

as to allow movement among local maxima. We have

also investigated stopping MCMC chains with ‘‘empty

clusters,’’ where an empty cluster contains less than one

expected individual after sufficient burn-in. While this

suggestion is ad hoc and in a sense does not solve the

poor convergence problem, we have found that it tends

to control against merging populations into an extreme

pathological case of K ¼ 1 with high selfing for data

simulated under K . 1.

We employ the Dirichlet process mixture model to

estimate how individual selfing rates vary among individ-

uals in the sample. Instead of assuming a distribution for

selfing rates among individuals and estimating pa-

rameters of the model, we use a Bayesian version of non-

parametric density estimation to ‘‘learn’’the selfing rates

from the data. We anticipate that the individual specific

model will facilitate plant breeding by providing a fairly

accurate estimate of individual selfing rates divorced

from the consequences of population structure. There

are a few statistical caveats, however, that we raise.

Inmanystatisticalinferenceproblems,thenumberof

parameters to be estimated is much smaller than the

samplesize.Therefore,‘‘large-sample’’estimatorssuchas

maximum likelihood or method-of-moments have good

statistical properties (e.g., unbiased, consistent, efficient,

etc.). In our case, we wish to estimate a selfing rate

parameter for each individual in the sample based on a

single (unobserved) data point, namely, G, the number

Figure 7.—(a) The Distruct plot of population assignment

for n ¼ 16 rice accessions assuming K ¼ 2 from STRUCTURE

and InStruct. The two clusters are represented by pink and

light blue. For InStruct, the corresponding selfing rates of

subpopulations are indicated at the top. (b) Estimated selfing

rates under the individual model using the Dirichlet process

prior model. The points represent the posterior mean of in-

dividual selfing rates and their different shapes indicate the

countries where that individual was collected: squares with x’s

inside represent China, diamonds represent Nepal, circles

represent India, and triangles indicate Laos. The x-axis repre-

sents the index of 16 individuals collected from the wild. The

red lines across the points represent the 90% posterior con-

fidence intervals of individual selfing rates.

Inference of Inbreeding and Population Structure 1647

Page 14

of generations of selfing in the genealogy of the indi-

vidual until an outcrossing event looking back in time.

Forthistypeofinferenceproblem,standardlarge-sample

statistical approaches are not accurate and approaches

that ‘‘share’’ information across related parameters (so-

called ‘‘shrinkage’’ estimators) often have better perfor-

mance. That is, when estimating the selfing rate of a

given individual i we use information regarding selfing

rates for all other individuals in the sample and iterate

this procedure. Shrinkage methods reduce (or shrink)

the variance of estimated parameters by drawing out-

liers nearer to the mean value. The drawback to such an

approach is that we may sometimes ‘‘overshrink’’ and

downwardly or upwardly bias the estimation for some in-

dividualswithselfingratesinthetailsofthedistribution.

Wefindthatthedistributionofestimatedselfingrates

minus the corresponding true values has the shape of

normal distribution with mean zero and standard devi-

ation ?0.15 under various simulated individual models

as shown in supplemental Figure 3 (http:/ /www.genetics.

org/supplemental/).Estimationismoreaccuratewhen

no substructure exists or subpopulations have similar

selfing rates, compared to subpopulations with very dis-

tinctselfingratesastheDirichletprocessmixturemodel

tends to find a local maximum and thus cluster indi-

vidual data points into big categories of selfing rates.

When DPMM is applied to data sets simulated with two

subpopulations and two distinct selfing rates, it some-

times peaks at two true selfing rates (supplemental Fig-

ure 5D at http:/ /www.genetics.org/supplemental/) or

peaks at a value in the middle of the two true selfing

rates and clusters all individual values into that class

(supplemental Figure 5, A–C). It is important to note

that the DPMM model is a nonparametric method of

density estimation, which is less efficient than the pa-

rametric estimation approach and thus takes longer to

reach stationary states.

Due to the structure of the likelihood function under

the individual model and the limitation of data avail-

able,confidenceintervalsforindividualselfingrateswill

likely be large unless the posterior mean or median is

close to complete selfing (ui¼ 1). The reason for this is

that the most information one can have in our model

regarding uiis the true number of generations until

outcrossing gi. Depending on the magnitude of gi, many

possible values uimay be consistent with the observed

data. For example, if there has been only one genera-

tion since an outcrossingevent (gi¼ 1), this observation

is consistent with nearly the whole of the interval ½0, 1)

andtheposteriormeanforuijgi¼1is1

prior for ui.

Another practical issue for our approach is how to

choose the appropriate scaling parameter and base

distribution for inference under the individual selfing

rate model (Figure 6). If the scaling parameter is small,

thentheexpectednumberofselfingrateclassesissmall,

leading to the peaky distribution of selfing rates. If the

3underauniform

scaling parameter is large, then one class contains only

one data point, which adds much uncertainty to esti-

mation, leading to biased estimation of the underlying

distribution. According to McAuliffe et al. (2004), the

nonparametric estimation method of the scaling pa-

rameter and base distribution can be incorporated into

the MCMC scheme, which may facilitate estimation,

or a hierarchical uninformative prior distribution can be

placed on the scaling parameter and base distribution

to integrate out the uncertainty of estimation on these

nuisance parameters.

Although the estimation accuracy is dependent on

multiple factors, we expect that this model will have wide

applications in many aspects of sequence analysis as it

has great flexibility for analyzingmultilocus markerdata.

However,severalpointsneedtobeaddressedwithrespect

to improving the basic model presented here.

First, InStruct assumes loci are unlinked and condi-

tionally independent given model parameters. It is

known that pairwise linkage disequilibrium increases

with selfing and can extend very far in highly selfed

organisms(Nordborg2000).Theflipsideofthisisthat

selfing may leave a strong linkage disequilibrium (LD)

signal that may be exploited for further refinement of

ourinferenceofindividualselfingrates.Therefore,link-

age disequilibrium should be incorporated into this

modelasinanewversionofSTRUCTURE(Falushetal.

2003).Oneapproachmightbetoincludealinkagemap

for the markers explicitly in the model with predictions

from population genetic theory regarding how selfing

affects LD among loci conditional on known recom-

bination rates. A second limitation of our model is that

it is applicable only to diploid individuals. It would be

more practical, particularly for inference in plant pop-

ulations, to extend the model to polyploid individuals.

Two complications on this front are that the number of

genotypes at a polyploid locus exponentially increases

withtheploidyofthegenomeandtwotypesofpolyploid

exist, autopolyploid and allopolyploid, which increase

the complexity of calculating genotype frequencies for

each locus.

The application of InStruct to data from the partially

selfing wild relative of domesticated rice O. rufipogon

gives results consistent with geographic sampling and

with the program STRUCTURE. Our estimates of the

selfing rates for each subpopulation overlap, suggesting

an outcrossing rate for wild rice near 50%. Partial out-

crossing has several potential evolutionary advantages

in regard to either complete outcrossing or complete

selfing. For example, advantageous mutations can be

fixed in the population at a faster rate as compared

tooutcrossing. Likewise, when mates are rare (e.g., in an

adverse environment), selfing ensures the likely survival

of the lineage. Last, partial outcrossing can purge the

population of deleterious mutations without inducing a

highgeneticload.WehopethedevelopmentofInStruct

willallowestimationofselfingratesamongnaturalplant

1648 H. Gao, S. Williamson and C. D. Bustamente

Page 15

populations,enablingthecommunitytotesthypotheses

regarding the evolutionary and ecological context for

selfing rate evolution.

WearegratefultoSusanMcCouchandJohnKellyformanythought-

ful comments on an early version of the manuscript. Two anonymous

reviewers greatly helped the exposition of this work. This work is

funded by National Science Foundation award 0319553 to Michael

D. Purugganan, Susan McCouch, Carlos D. Bustamante, and Rasmus

Nielsen.

LITERATURE CITED

Ayres, K. L., and D. J. Balding, 1998

Hardy-Weinberg: a Markov chain Monte Carlo method for esti-

mating the inbreeding coefficient. Heredity 80(6): 769–777.

Corander, J., P. Waldmannand M. Sillanpaa, 2003

ysis of genetic differentiation between populations. Genetics 163:

367–374.

Dawson, K. J., and K. Belkhir, 2001

identification of panmictic populations and the assignment of in-

dividuals. Genet. Res. 78: 59–77.

Enjalbert, J., and J. L. David, 2000

rates using multilocus individual heterozygosity: application to

evolving wheat populations. Genetics 156: 1973–1982.

Falush, D., M. Stephens and J. K. Pritchard, 2003

population structure using multilocus genotype data: linked loci

and correlated allele frequencies. Genetics 164: 1567–1587.

Francois, O., S. Ancelet and G. Guillot, 2006

usinghidden Markovrandomfields inspatial population genetics.

Genetics 174: 805–816.

Gelman,A.,andD.B.Rubin,1992

using multiple sequences (with discussion). Stat. Sci. 7: 457–511.

Haldane, J. B. S., 1924 A mathematical theory of natural and arti-

ficial selection. ii. The influence of partial self-fertilisation, in-

breeding, assortative mating, and selective fertilisation on the

composition of Mendelian populations, andon natural selection.

Proc. Camb. Philos. Soc. Biol. Sci. 1: 158–163.

Hartl, D., and A. Clark, 1997

Principles of Population Genetics. Sina-

uer Associates, Sunderland, MA.

Hudson, R. R., 2002 Generating samples under a Wright-Fisher

neutral model of genetic variation. Bioinformatics 18: 337–338.

Measuring departures from

Bayesian anal-

A Bayesian approach to the

Inferring recent outcrossing

Inference of

Bayesian clustering

Inferencefromiterativesimulation

Huelsenbeck, J. P., S. Jain, S. W. D. Frost and S. L. K. Pond, 2006

Dirichlet process model for detecting positive selection in pro-

tein-coding DNA sequences. Proc. Natl. Acad. Sci. USA 103:

6263–6268.

Jasra, A., C. C. Holmes and D. A. Stephens, 2005

Monte Carlo methods and the label switching problem in

Bayesian Markov chain Monte Carlo methods and the label

switching problem in Bayesian mixture modeling. Stat. Sci. 20:

50–67.

MacEachern, S. N., and P. Muller, 1998

Dirichlet process models. J. Comput. Graph. Stat. 7: 223–238.

McAuliffe, J. D., D. M. Blei and M. I. Jordan, 2004

empirical Bayes for the Dirichlet process mixture model. Techni-

cal Report 675. University of California, Berkeley, CA.

Nordborg, M., 2000 Linkage disequilibrium, gene trees and selfing:

an ancestral recombination graph with partial self-fertilization.

Genetics 154: 923–929.

Nordborg, M., and P. Donnelly, 1997

selfing. Genetics 146: 1185–1195.

Oka, H. I., 1988

Origin of Cultivated Rice. Japan Scientific Societies

Press, Tokyo; Elsevier, Amsterdam/New York.

Pritchard, J. K., M. Stephens and P. Donnelly, 2000

of population structure using multilocus genotype data. Genetics

155: 945–959.

Ritland, K., 2002 Extensions of models for the estimation of

mating systems using n independent loci. Heredity 88:

221–228.

Rosenberg, N., J. K. Pritchard, J. L. Weber, H. Cann, K. Kidd et al.,

2002 Genetic structure of human populations. Science 298:

2381–2385.

Tavare, S., and W. J. Ewens, 1998

230–234 in Encyclopedia of Statistical Sciences Update, Vol. 2. Wiley,

New York.

Wahlund, S., 1928 Composition of populations from the perspec-

tive of the theory of heredity. Hereditas 11: 65–105.

Wakeley, J., 2000The effects of subdivision on the genetic diver-

gence of populations and species. Evol. Int. J. Org. Evol. 54:

1092–1101.

Wright, S., 1931Evolution in Mendelian populations. Genetics 16:

97–159.

Wright, S., 1965 The interpretation of population structure by

f-statistics with special regard to systems of mating. Evolution

19: 395–420.

A

Markov chain

Estimating mixture of

Nonparametric

The coalescent process with

Inference

The Ewens sampling formula, pp.

Communicating editor: N. Takahata

APPENDIX: DETAILS OF THE MARKOV CHAIN MONTE CARLO ALGORITHM

Initiation of MCMC: Underthepopulation-specificmodel,theinitialstatesofpopulationselfingrateparameterssk

aregeneratedfromtheuniformdistributionU½0;1?.Theinitialnumberofgenerationsuntilanoutcrossingeventgifor

each individual is drawn independently by sampling from the geometric distribution with unique uniform random

probabilities of success. Under the individual selfing model, the ui’s are first drawn from the Dirichlet process prior

and then the gi’s are sampled from the geometric distribution with a probability of success 1 ? ui. Initiation of Z and Q

is congruent with Pritchard et al. (2000).

Updating of MCMC: In the blockwise updating scheme of MCMC, the update of P, Z, and Q follows Pritchard

et al. (2000). The rest of the parameters are updated with the single-component Metropolis–Hastings algorithm as

detailed below:

a. Update S:

i. At the population level, selfing rates are proposed with either the BRS or the AIS. For the BRS, we update the

selfing rate vector S(m)by using Metropolis sampling with a K-dimensional uniform proposal distribution

centered on the current vector of population selfing rates. That is, a proposed selfing rate sk*for population k

is drawn from Uðsðm?1Þ

For the AIS, we assume three classes of states for the selfing rate parameter: s0equivalent to complete

outcrossing, s(0,1)that denotes the case of partial outcrossing (s 2 (0, 1)), and s1that represents complete

k

? d;sðm?1Þ

k

1dÞ with back reflection in ½0, 1?, where d is a tuning parameter.

Inference of Inbreeding and Population Structure1649

Page 16

selfing. Let p0represent the probability of proposing a jump to state s0on the basis of the current value of s,

p(0,1)be the probability ofproposing ajump tostate s(0,1)onthe basis ofcurrent s, and p1be the probabilityof

proposing a jump to state s1onthe basis of current s. In our model, we use the probabilities in the table below

to calculate the proposal density q(s, s*), where the first column in the table shows three starting states for

selfing rates and the first row represents three ending states,

qðs; s*Þ ¼ p0d0ðs*Þ1Uð0; 1Þ3pð0;1Þð1 ? d0ðs*ÞÞð1 ? d1ðs*ÞÞ1p1d1ðs*Þ;

where di(j) is a Kronecker delta function defined by

ðA1Þ

diðjÞ ¼

1

0

if i ¼ j

if i 6¼ j:

?

SincetheprioronSisuniformandtheproposaloftheBRSissymmetric,theMetropolisacceptanceprobability

r depends only on the ratio of the likelihood function at the two points proposed, sk*and current sk:

?

The allele frequencies P or population assignments Z are ignored from the above formula as the relevant

likelihood does not depend on them conditional on G and Q.

For the AIS, the Metropolis–Hastings ratio needs to multiply a proposal term:

?

Since we assume individuals are independently sampled and use the formula (3), the likelihood is

r ¼ min1;LðG js*

k; sð?kÞ; QÞ

LðG jS; QÞ

?

:

r ¼ min1;LðG js*

k; sð?kÞ; QÞqðsk; s*

LðG jS; QÞqðs*

kÞ

k; skÞ

?

:

LðG jS; QÞ ¼

Y

N

i¼1

PðgijuiÞ ¼

Y

N

i¼1

ð1 ? uiÞugi?1

i

;

where uiis calculated as the expected selfing rate for individual i using Equation 7.

The rationale for needing two samplers is that when the selfing rate value of our MCMC is near the

boundaries, one needs to be able to jump in and out of the states for complete selfing (s ¼ 1) or complete

outcrossing (s¼ 0).As we illustrate below,theAIS isnot asefficient asthe BRS,so whenthe MCMC chain isnot

near sk¼ 0 or sk¼ 1, the BRS is recommended.

ii. Updating of individual selfing rates is described in the Modeling selfing section.

b. Update G: We choose an independent sampler to update each component of G. Specifically, the proposed

update gi*is drawn from a geometric distribution independently for each individual gi*? Gð1 ? uiÞ, where uðmÞ

calculatedusingformula(7). Andanupperbound50isplaced ongi*tofacilitate thecomputationasthevalueof

gi. 50 does not affect likelihood calculation much compared to the value of 50. Since the proposal distribution

we employ is an independence sampler and the likelihood does not depend on the current values of S or Q,

the Metropolis–Hastings ratio is thus

?

where L(X jG, Z, P) is the likelihood Equation 4.

i

is

r ¼ min 1;LðX jg*

i; gð?iÞ; Z; PÞ

LðX jG; Z; PÞ

?

;

Joint inference of inbreeding coefficients and substructure: Estimating inbreeding coefficients while accounting

for population structure is done in a similar manner to inference of selfing rates, except that there is no ‘‘G’’

sp0

p(0,1)

p1

s ¼ 0

s 2 (0, 1)

s ¼ 1

0.50

0.05

0.0

0.50

0.90

0.50

0.0

0.05

0.50

1650 H. Gao, S. Williamson and C. D. Bustamente

Page 17

component and the likelihood of data is calculated using Wright’s formula. This likelihood now depends on the

inbreeding coefficients F and allele frequencies P and assignment of alleles Z,

LðX jP; F; ZÞ ¼

Y

N

i¼1

Y

L

l¼1

Pðxil:jF; zil:; p:l:Þ;

ðA2Þ

where P(xil.j F, zil., p.l.) is the genotype frequency of individual i at locus l. If the two alleles for this genotype are from

different subpopulations (i.e., zil16¼ zil2), we assume the genotype frequency is the product of the population allele

frequencies (amounting to random mating among populations). If the population assignment is the same, our

probabilities follow directly from basic population genetic theory. The probability of homozygosity for the A allele is a

function of the general inbreeding coefficient in the population assigned to individual i at position l ðfzil:Þ,

Pðxil:¼ AA j fzil:; zil:; p:l:Þ ¼ p2

where pAis the allele frequency of A in its assigned subpopulation. If individual i is heterozygous at locus l (suppose

the genotype is Aa at that locus), the genotype probability is

A3ð1 ? fzil:Þ1pAfzil:;

ðA3Þ

Pðxil:¼ Aa j fzil:; zil:; p:l:Þ ¼ 2pApað1 ? fzil:Þ:

ðA4Þ

We use the BRS and AIS to propose inbreeding coefficients and then accept it with the Metropolis–Hastings

algorithm.

We find that the BRS is very efficient and easily tunable, but has the disadvantage that it can never attain the

boundaryvaluesofcompleteoutcrossing(0.0)orcompleteselfing(1.0). TheAIScangenerateproposaldrawsforany

value in the interval ½0, 1?, but, as implemented, the rejection rate for AIS is high. One can observe from the

convergence graphs (see supplemental Figure 2 at http:/ /www.genetics.org/supplemental/) that the patterns of

selfing rate updating are remarkably different between the two methods. This is likely because a fraction of new

proposed selfing rates by AIS are randomly sampled from the uniform distribution on ½0, 1?, which have low a priori

probability of explaining the data. The AIS sampler can easily get stuck in one value for several iterations while BRS

tendstorejectnewproposedjumpsmuchlessoften(interestinglytheconvergenceefficiencyofAISissimilartothatof

BRS). The importance of using AIS near the boundaries is illustrated in supplemental Figure 6 at http:/ /

www.genetics.org/supplemental/, where we note that the BRS density for zero selfing rate is strongly right shifted

as compared to AIS. In actual application of InStruct to real data, the selfing rate proposal density should be chosen

according to context and necessity.

Inference of Inbreeding and Population Structure 1651