Page 1

Copyright ? 2007 by the Genetics Society of America

DOI: 10.1534/genetics.107.072371

A Markov Chain Monte Carlo Approach for Joint Inference of Population

Structure and Inbreeding Rates From Multilocus Genotype Data

Hong Gao, Scott Williamson and Carlos D. Bustamante1

Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853

Manuscript received February 19, 2007

Accepted for publication April 22, 2007

ABSTRACT

Nonrandom mating induces correlations in allelic states within and among loci that can be exploited to

understand the genetic structure of natural populations (Wright 1965). For many species, it is of con-

siderable interest to quantify the contribution of two forms of nonrandom mating to patterns of standing

genetic variation: inbreeding (mating among relatives) and population substructure (limited dispersal of

gametes). Here, we extend the popular Bayesian clustering approach STRUCTURE (Pritchard et al. 2000)

for simultaneous inference of inbreeding or selfing rates and population-of-origin classification using

multilocus genetic markers. This is accomplished by eliminating the assumption of Hardy–Weinberg

equilibrium within clusters and, instead, calculating expected genotype frequencies on the basis of

inbreeding or selfing rates. We demonstrate the need for such an extension by showing that selfing leads

to spurious signals of population substructure using the standard STRUCTURE algorithm with a bias

toward spurious signals of admixture. We gauge the performance of our method using extensive coa-

lescent simulations and demonstrate that our approach can correct for this bias. We also apply our ap-

proach to understanding the population structure of the wild relative of domesticated rice, Oryza rufipogon,

an important partially selfing grass species. Using a sample of n ¼ 16 individuals sequenced at 111 random

loci, we find strong evidence for existence of two subpopulations, which correlates well with geographic

location of sampling, and estimate selfing rates for both groups that are consistent with estimates from

experimental data (s ? 0.48–0.70).

U

ogy. Here we consider the problem of using genotype

data from a sample of individuals to distinguish be-

tween two forms of nonrandom mating: inbreeding or

mating among relatives and population subdivision or

limited dispersal of gametes. As Sewall Wright dem-

onstrated, both of these evolutionary forces induce a

correlation in allelic state among uniting gametes (i.e.,

autozygosity) (Wright 1931, 1965). Specifically, writ-

ing {Ai, Aj} to denote the outcome of inheriting alleles i

and j at a particular locus of interest, Wright thought

about the problem in terms of the correlation in state:

NDERSTANDING the mating structure of natural

populations is a major goal of population biol-

corrðAi; AjÞ ¼

CovðAi; AjÞ

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

VarðAiÞVarðAjÞ

pij? pipj

pið1 ? piÞpjð1 ? pjÞ

p

p

¼

:

In a randomly mating population, the probability of

inheriting a combination of alleles {Ai, Aj} is, by defi-

nition, given by the product of their marginal probabil-

ities (i.e., pij¼ pipj). Therefore, under random mating

there is no correlation in allelic state among the genes

inherited from the two parents.

In asubdivided population with inbreeding, however,

the correlation in allelic state, FIT, may be nonzero and

is given by Wright’s famous equation

FIT¼ 1 ? ð1 ? FISÞð1 ? FSTÞ;

where FISis equivalent to the correlation in state con-

ditional on subpopulation of origin, and FSTis the cor-

relation instateamong randomlysampledalleleswithin

subpopulations. The first is a measure of inbreeding

andthesecond isameasure ofpopulationsubstructure.

This equation demonstrates that the relative contribu-

tionofthe twoforces todeviationsfrom randommating

are of comparable magnitude and depend critically on

the particular values of the parameters.

Althoughthisphenomenonisappreciatedbymanypop-

ulation geneticists, many modern statistical approaches

for analyzing genotype data ignore one of these two

components.Forexample,methodsforidentifyingpop-

ulation structure among a sample of individuals assume

random mating within subpopulations (Pritchard et al.

2000;DawsonandBelkhir2001;Coranderetal.2003;

Falush et al. 2003). Likewise, methods for estimating

self-fertilization rates from genotype data assume indi-

viduals are sampled from a single population (Ayres

and Balding 1998; Enjalbert and David 2000) or

ð1Þ

1Corresponding author: 101 Biotechnology Bldg., Cornell University,

Ithaca, NY 14853.E-mail: cdb28@cornell.edu

Genetics 176: 1635–1651 (July 2007)

Page 2

require labor-intensive approaches such as progeny

arrays (direct genotyping of offspring–mother pairs)

(Ritland 2002). Therefore, considerable interest exists

in the development of an approach that can reliably

estimate the degree of population subdivision and in-

breedingratesfromasampleofgenotypedindividualsof

unknown relatedness.

Our starting point in this study is the widely used

program STRUCTURE (Pritchard et al. 2000; Falush

et al. 2003), which implements a Bayesian clustering al-

gorithm that simultaneously estimates locus allele fre-

quencies and probabilistically assigns individuals to one

of K subpopulations. STRUCTURE works by exploiting

a key concept in population genetics: undetected pop-

ulationsubstructureleadstoagenomewidedeficitofhet-

erozygotes in a sample as compared to the predictions

of the Hardy–Weinberg equilibria (HWE) (Wahlund

1928; Hartl and Clark 1997). Informally, by assigning

individuals probabilistically across a fixed number of K

subpopulations,thealgorithmminimizesdeviationsfrom

HWE across the whole sample by maximizing within-

subpopulation HWE as well as linkage equilibrium

among unlinked loci. It is important to note, however,

that various genetic and evolutionary forces can also

lead to a genomewide deficiency of heterozygotes in a

sample. In hermaphroditic populations, for example,

partial self-fertilization reduces heterozygosity by a fac-

torð1 ? sÞ=ð1 ? ðs=2ÞÞ,wheresistheproportionofprog-

enyproducedbyself-fertilization(Haldane1924).Since

STRUCTURE assumes that individuals in the sample

are either fully outcrossing or haploid, application of

the algorithm to partially selfing populations may result

in spurious inference of population structure and/or

admixture as pointed out in Falush et al. (2003). (It is

important to note that under the extreme case of com-

plete self-fertilization, one can sidestep this issue by

treating each diploid individual as haploid.)

To investigate spurious evidence for admixture in

the presence of partial self-fertilizaton, we modified

Hudson’s implementation of the standard coalescent

algorithm (Hudson 1997) to accommodate partial self-

ing (Nordborg and Donnelly 1997) and generated a

sample of 100 individuals drawn from a population with

selfing rate s ¼ 0.5 genotyped at 100 loci. We then ran

the standard STRUCTURE 1.0 algorithm assuming two

clusters (K ¼ 2) on this data set (see the Simulations

section for details). We expect STRUCTURE to assign

all individuals to one of the two clusters shown in Figure

1c, since we have simulated data from a single unstruc-

tured population. Figure 1, a and b, generated by the

Distruct program (Rosenberg et al. 2002), summarizes

the posterior assignment probabilities. For this data set

drawn from a single population, STRUCTURE classified

all individuals as ‘‘admixed’’ with 50% of their genome

coming from cluster 1 (green) and 50% coming from

cluster2(purple).Thisresultholdsregardlessofwhether

one considers the correlated (i.e., F model) or uncorre-

lated allele frequency models and suggests that applica-

tion of STRUCTURE to data from a partially selfing

population may lead to spurious signals of population

substructureasinitiallysuggestedbyFalushetal.(2003).

To quantify this effect further, we repeated the pro-

cedure above for 100 data sets simulated for each of six

levelsofselfingandranSTRUCTUREunderbothK ¼ 1

and K ¼ 2. To gauge the improvement in fit between

the K ¼ 1 and K ¼ 2 models, we compared the differ-

ence in average log-likelihood score across retained

draws from Markov chain Monte Carlo (MCMC):

logL ¼ ElogLðK ¼ 2; ujDataÞ

? ElogLðK ¼ 1; ujDataÞ:

The distribution of log L for different values of s is

plotted in Figure 1d(A). We note that when s ¼ 0.0, the

population is completely outcrossing and the distribu-

tion of log L provides the null distribution of the test

statistic under the hypothesis of no selfing and no pop-

ulationstructure.Figure1d(A)showsthatasselfingrate

increases so does the distribution of log-likelihood dif-

ference between K ¼ 2 and K ¼ 1 leading to increased

rejection of the null hypothesis. When the selfing rate

is .0.5, the whole of the distribution of log L exceeds

the critical value, resulting in a 100% false positive rate.

Therefore, we concluded that a modification to the

basic model of STRUCTURE is essential when wanting

to infer population structure for partially selfing species

or those with a recurrent pattern of inbreeding. This

article presents and validates such an approach, which

we term ‘‘InStruct.’’ When InStruct is applied to the

data sets above, it both reduces the false positive rate

dramatically (see Figure 1d) and corrects for spurious

admixture completely (see Figure 1c).

The new algorithm we present here extends the

STRUCTURE 1.0 framework by incorporating the pos-

sibility of inbreeding among individuals in the sample.

Much of this article is focused on self-fertilization, but

the program has been written generally so asto estimate

inbreeding coefficients as well. We consider two general

scenarios: a population-specific process by which all in-

dividuals within one subpopulation share the same self-

ing potential (which may reflect a shared environment,

for example) as well as a model where selfing probabil-

ities vary among individuals in the whole sample. This

model is particularly useful for modeling population

substructure when some samples have been artificially

propagated in the lab (or the field) through enforced

selfing. For this scenario, we use a Bayesian density es-

timation algorithm called the Dirichlet process mixture

model (DPMM),whichoffersgreatflexibilityinestimat-

ing the distribution of latent (or unobserved) variables

in the probabilistic model. It has recently been used to

estimate the distribution of v ¼ dN/dSalong a protein-

coding sequence (Huelsenbeck et al. 2006). We quan-

tify the power,robustness, and accuracy oftheapproach

ð2Þ

1636 H. Gao, S. Williamson and C. D. Bustamente

Page 3

usingdatasimulatedunderamyriadofscenarios,varying

both the degree of selfing and population substructure.

A major motivation for our research was the desire to

understand population structure in the wild ancestor of

domesticated Asian rice (Oryza rufipogon), in aneffort to

identify wild germplasm for improvement of this impor-

tant crop species. Therefore, to illustrate the application

of our method and to investigate the role of inbreeding

and population substructure in O. rufipogon, we apply

InStruct to multilocus data from a sample of 16 indi-

vidualscollectedfromvariouslocalitiesacrossSoutheast

Asia. We find strong evidence of population subdivision

in O. rufipogon, as well as evidence for geographic varia-

tion in the rates of self-fertilization. Potentially the most

important feature of InStruct is that it allows the iden-

tificationofvariationinmatingsystemineitherstructured

Figure 1.—Population assignments for a single data set of 100 individuals simulated under partial selfing (s ¼ 50%) and no

population substructure and analyzed assuming K ¼ 2. (a and b) The Distruct graph from STRUCTURE using (a) the correlated

alleles model and (b) the uncorrelated alleles model. (c) The Distruct graph from InStruct of the same data set. (d) Distribution

of log-likelihood difference between the K ¼ 2 and the K ¼ 1 model under six levels of population selfing rates as estimated by

STRUCTURE using the F model (A)/InStruct (B). Each colored line represents the density of average log-likelihood difference

with 100 replicate data sets simulated without population structure and under a specific selfing rate, indicated in the inset.

Inference of Inbreeding and Population Structure1637

Page 4

or unstructured populations, which in turn opens the

door to using molecular population genetic approaches

to investigate the evolution of mating systems.

THEORY

A myriad of factors influence selfing rates in natural

populations, including genetic and developmental fac-

tors (such as presence/absence of self-incompatibility

loci, flower shape, deleterious mutation rate, etc.) as

well as abiotic and biotic environmental factors (such as

availability of animal pollinators, local population den-

sity,rainfallvariation,etc.).Furthermore,plantsobtained

from intensively managed populations (such as seed

centers that propagate varieties of food crops) are often

the result of artificial selfing (i.e., purification) and dif-

ferent lines may have been propagated for different

numbers of generations via self-fertilization.

Our model is not explicit as to which of these factors

(if any) is influencing selfing rate, but rather, we start

from the premise that each individual in the sample has

a constant but unknown selfing potential that we wish

to estimate from the available genetic data. The selfing

potential of an individual is defined as the probability

that the individual reproduces via self-fertilization (see

below). We consider two models for how selfing varies

amongindividualsinthe sample:a ‘‘population-specific’’

model and an ‘‘individual’’ model.

Under the population-specific model, the selfing po-

tentials are equal for individuals assigned to the same

population and equivalent to the proportion of off-

spring produced via self-fertilization each generation.

Thisisareasonablemodeliflocalenvironmentalfactors

are the chief determinants of selfing rate. Under the

individual model, we use a form of Bayesian probability

density estimation to estimate the selfing rate for each

individual in the sample, potentially combining indi-

viduals with statistically similar rates and splitting up

individuals with statistically different rates. This is a par-

ticularly useful model for analyzing genetic material

from seed centers where different lines may have been

the result of propagation by self-fertilization and the

number of generations of propagation differs among

lines (and is often unknown).

Parameter notation: We borrow much of our nota-

tion from Pritchard et al. (2000). Probability densities

are denoted by calligraphy fonts: U represents the

uniform distribution, G the geometric distribution,

and DtheDirichlet distribution.Uppercase italic letters

(e.g.,P,G,X)arevectorsormatricesofrandomvariables

and lowercase italic letters (e.g., p, g, x) represent in-

stantiations of the random variables. Letters in boldface

type represent constants (e.g., K, D) and every effort is

made to retain the same notation as in the original

STRUCTURE articles.

Assume a sample of N individuals genotyped at L loci

are to be classified into K populations with ploidy D.

(Throughout this article we consider the diploid case

D ¼ 2).Weincorporatethepossibilityofadmixtureinto

the model by allowing an individual’s genotype at a

locus to be composed of alleles from distinct popula-

tions. This is true even for selfing individuals since their

genomes can be mosaics of haplotypes recently derived

from selfing of an admixed parent.

As in Pritchard et al. (2000), denote marker allele

frequencies by P ¼ fpklj:k ¼ 1;2; ... ;K;l ¼ 1;2; ... ;L,

and j ¼ 1;2; ... ; Jlg such that pkljis the allele frequency

of the jth allele type at the lth locus in the kth popula-

tion, where Jlis the number of distinct alleles at the lth

locus. For each individual i, let X ¼ fxild:i ¼ 1;2; ... ;

N;l ¼ 1;2; ... ;L,and d ¼ 1;2; ... ;Dg,wherexildisthe

allele carried at locus l for the dth copy. In accordance

with Pritchard et al. (2000), let Z ¼ fzild:i ¼ 1;

2; ... ;N;l ¼ 1;2; ... ;L, and d ¼ 1;2; ... ;Dg repre-

sent the matrix of zild, the population of origin of the

dth allele copy at the lth locus in the ith individual and

let Q ¼ fqik:i ¼ 1;2; ... ;N and k ¼ 1;2; ... ;Kg be the

matrix of qik, the proportion of the ith individual’s

genome originating from population k.

Write S ¼ fsi:i ¼ 1;2; ... ;Kg to denote the selfing

rates for the K subpopulations and G ¼ fgi: i ¼ 1;

2; ... ;Ng to denote the vector containing the number

of generations until each individual experiences an

outcrossing event in the past. Furthermore, let Q ¼

fui:i ¼ 1;2; ... ;Ng be the vector of individual selfing

potentials, where uiis the probability that individual i

reproduces via self-fertilization in a given generation.

We assume that this parameter is constant in time for a

given individual. Under the population-specific model,

we further assume that all individuals from a given pop-

ulation have the same valueofuiand that thisquantity is

equivalent to sk, the percentage of offspring produced

via selfing in subpopulation k. To estimate selfing rates

for individuals of admixed ancestry, we need to make

some mathematical assumptions as to how to combine

selfing potentials. The model we employ in InStruct is

aweightedaverageofpopulation-specificselfingrates.In

particular, if an individual cannot be classified unambig-

uously into one of K subpopulations, we model the

individual’s selfing potential as the weighted average of

the K population selfing rates with weighting constants

equal to the qik, the proportion of individual i’s genome

that we estimate to originate from population k (see

Equation 7 below).

WeuseasuperscripttotrackparameterswithinMCMC

iterations such that SðmÞ

k

is the value of the selfing rate

forpopulationkatiterationmofanMCMCchain.When

available, we use conjugate priors since these make the

MCMC much more efficient by often enabling Gibbs

sampling.Thesepriorscanalsoeasilyaccommodatepre-

vious information about population structure and self-

fertilization rates.

1638 H. Gao, S. Williamson and C. D. Bustamente

Page 5

Modeling selfing: We model the number of gener-

ations giuntil an outcrossing event for the ith individual

as a geometric random variable with probability of suc-

cess 1 ? ui, where uiis the selfing rate for individual i:

Pðgi¼ g juiÞ ¼ ug?1

i

ð1 ? uiÞ:

ð3Þ

This amounts to assuming that whether an individual

selfs or not is independent from generation to genera-

tion and constant in time. Thus gi

step m in our MCMC, the ith individual is generated

by an outcrossing event in the previous generation,

whereas gi

selfing that extends gi

The reason for conditioning on G is that the likeli-

hood of the data given parameters P, G, and Z does not

depend on S or Q, greatly simplifying our calculations

(see Equations 5 and 6). Specifically, we write the likeli-

hoodofthegenotypedatagivenallelefrequencies,pop-

ulation assignments, and number of generations back

until an outcrossing event as

(m)¼ 1indicatesthat at

(m). 1 implies individual i was produced via

(m)? 1 generations into the past.

LðX jP;G;ZÞ ¼

Y

N

i¼1

Y

L

l¼1

Pðxil:jgi; zil:;p:l:Þ;

ð4Þ

where Pðxil:jgi;zil:;p:l:Þ is the genotype frequency of

individual i at locus l. If the two alleles for this genotype

are from different subpopulations (i.e., zil16¼ zil2), we

assume the genotype frequency is the product of the

population allele frequencies (amounting to random

mating among populations). If the population assign-

ment is the same, our probabilities follow directly from

basic population genetic theories. If individual i is the re-

sult of gi? 1 generations of selfing, then the probability

of homozygosity for the A allele is

Pðxil:¼ AAjgi; zil:; p:l:Þ ¼ p2

A12pAð1 ? pAÞ3

X

gi?1

g9¼1

0:5g9;

ð5Þ

where pAis the allele frequency of A in its assigned

subpopulation. If individual i is heterozygous at locus l

(suppose the genotype is Aa at that locus), the genotype

probability is

Pðxil:¼ Aa jgi; zil:; p:l:Þ ¼ 2pApa30:5ðgi?1Þ:

ð6Þ

Inmodelinginbreedingmoregenerally,wecanreplace

the above equations by their usual analogs in Wright’s

formulationconditionalontheinbreedingcoefficientF

(see appendix). For simplicity, we remain for the rest of

thisarticle focused onselfing, but note that InStructhas

anoptionformodelinginbreedingaswell.Nextweturn

to models for how selfing rates vary among individuals

and populations.

Population-specific model: For the population-specific

model, we define the selfing potential uiconditional on

the population assignments of individual i as

ui¼

X

K

k¼1

Pðindividuali istheproductof selfinginthe

previousgenerationgivenitisfrom

populationkÞ

3Pðindividuali comesfrompopulationkÞ:

If we assume that the probability that individual i comes

from population k equals the proportion of individual

i’s genome that originates from population k that has

selfing rate sk, we obtain

ui¼

X

K

k¼1

skqik:

ð7Þ

Individual variation in selfing model: A clear limitation

of the population-specific model is that itdoes not allow

for selfing rate variation among individuals within sub-

populations, which may be an important feature of the

data. To relax this assumption, we employ the DPMM.

The rationale behind this approach is not biological,

but statistical. Instead of assuming a distribution for

selfing rates among individuals and estimating param-

eters of the model (e.g., beta distribution, logit, probit,

etc.), we use a Bayesian version of nonparametric

density estimation to ‘‘learn’’ the selfing rates from the

data. Informally, it is equivalent to smoothing a histo-

gram of individually estimated selfing rates and taking

ouruncertaintyinthesmoothingfunctionintoaccount.

Smoothing occurs via collapsing and expanding sets of

individuals that have been assigned the same identical

selfing rate (a class) and updating the selfing rate as-

signed to each class. The parameter governing the

smoothing function, a, works mathematically by influ-

encing the prior distribution on the number of classes.

In essence, the DPMM model generates partitions of

selfingrates wherewithin apartitionallindividuals have

the same selfing rate. Formally, we think of the Dirichlet

process mixture model as a finite mixture model where

the number of mixture components is a random vari-

able. We treat each individual’s selfing rate as arising

from the same distribution family with different param-

eters for each component. The joint prior distribution

of all selfing rates in the DPMM model corresponds to a

generalized Polya urn scheme. The hierarchical struc-

ture of the Dirichlet process mixture model is

F ? DPða;F0ðuÞÞ

uijF ? FðuÞ

gijui? Gð1 ? uiÞ;

where DPða;F0ðuÞÞ is the Dirichlet process with base

distribution F0and scaling parameter a . 0, and F is

Inference of Inbreeding and Population Structure 1639