Page 1

Copyright ? 2009 by the Genetics Society of America

DOI: 10.1534/genetics.108.098095

Inference of Historical Changes in Migration Rate From the

Lengths of Migrant Tracts

John E. Pool*,1and Rasmus Nielsen*,†

*Department of Integrative Biology and†Department of Statistics, University of California, Berkeley, California 94720

Manuscript received October 27, 2008

Accepted for publication December 13, 2008

ABSTRACT

After migrant chromosomes enter a population, they are progressively sliced into smaller pieces by

recombination. Therefore, the length distribution of ‘‘migrant tracts’’ (chromosome segments with recent

migrant ancestry) contains information about historical patterns of migration. Here we introduce a

theoretical framework describing the migrant tract length distribution and propose a likelihood inference

methodtotestdemographichypothesesandestimateparametersrelatedtoahistoricalchangeinmigration

rate. Applying this method to data from the hybridizing subspecies Mus musculus domesticus and M. m.

musculus, we find evidence for an increase in the rate of hybridization. Our findings could indicate an

evolutionary trajectory toward fusion rather than speciation in these taxa.

A

for recent signatures of positive selection in population

genetic data (e.g., Jensen et al. 2005), the study of ad-

mixed human populations to identify disease-associated

genetic variants (e.g., Montana and Pritchard 2004;

Patterson et al. 2004), and the definition of manage-

ment units in conservation (Pearse and Crandall

2004). Patterns of genetic variation contain information

about past changes in populationsize (e.g., Cornuetand

Luikart 1996; Marth et al. 2004), the timing of popu-

lation splitting events (e.g., Nielsen and Wakeley 2001),

and levels of migrationbetween populations (e.g., Beerli

and Felsenstein 2001).

Since the advent of molecular markers, researchers

have sought to gauge the genetic differentiation of

populations and to draw conclusions about the level of

migration between them. Wright’s FST(Wright 1952)

has served as the classic metric of population differen-

tiation, and, under ideal conditions, the population

migrationratecanbeestimatedbyNem ¼ ð1?FSTÞ=4FST,

where Neis the effective population size, m is the per-

generation probability of being a migrant, and Nem is

thus equal to the number of migrants exchanged each

generation. However, this relationship relies on several

assumptions that may not be valid for most natural

populations (reviewed in Whitlock and McCauley

1999), including that of a constant rate of migration. A

given FSTvalue between two populations could be pro-

duced bya constantlevel ofmigrationovera longperiod

of time, or by genetic drift following a relatively recent

N accurate understanding of population history is

essential for such diverse applications as the search

split between the two populations, or by recent admix-

ture between historically isolated populations, or by any

number of more complex scenarios. The isolation-

migration (IM) inference framework (e.g., Nielsen and

Wakeley 2001; Hey 2005) offers a way to differentiate

ongoing migration between populations from lineage

sorting in isolated populations, while estimating relevant

demographic parameters.

Asinthecase ofIM,mostpopulationgenetic methods

that estimate demographic parameters assume that all

sitesormarkersunder studyareeithercompletelylinked

(no recombination) or completely unlinked (free re-

combination) (although see Becquet and Przeworski

2007). And correspondingly, most population genetic

data have been collected with these criteria in mind.

Assuming either full linkage among sites or else in-

dependence among loci can greatly simplify the task of

modeling the histories of molecular markers. However,

the bulk of the genome in most organisms consists of

DNA that is subject to recombination, and, furthermore,

the pattern of recombination events within a sample of

chromosomes may hold valuable information concern-

ing population history. For example, we know that

haplotype statistics (Depaulis et al. 2003) and linkage

disequilibrium (Wall et al. 2002) across short loci are

quite sensitive to the effects of population bottlenecks.

The recent availability of genome-scale polymorphism

data should facilitate investigation of longer-range

linkage patterns, which may shed new light on the

recent histories of populations.

Patternsofdiversityatpartiallylinkedmarkersmaybe

especially informative concerning the historical pattern

of migration between populations. Once a migrant

chromosome enters a new population, recombination

will break it down into progressively shorter segments.

1Corresponding author: Department of Integrative Biology, University of

California, 3060 Valley Life Sciences Bldg. No. 3140, Berkeley, CA 94720-

3140. E-mail: jpool@berkeley.edu

Genetics 181: 711–719 (February 2009)

Page 2

The lengths of these ‘‘migrant tracts’’—or admixture

‘‘chunks’’ (Falush et al. 2003)—therefore contain in-

formationabouthowlongagomigrationoccurred.This

logic has been utilized to estimate the timing of recent

admixtureevents(e.g.,Patterson etal.2004; Hoggart

et al. 2004; Koopman et al. 2007), but its applicability

should extend beyond such cases. We suggest that

migrant tract lengths are expected to have a certain

equilibrium distribution under a constant migration

rate model. An excess of long migrant tracts would

indicate a recent increase in migration rate, while the

opposite pattern would suggest recently reduced gene

flow. We use theoretical predictions and simulations to

explore the migrant tract length distribution under a

variety of demographic scenarios, and we assess the

potential of this approach for inferring demographic

parameters related to migration rate changes.

MODELS AND METHODS

Constant migration rate: A large set of different pop-

ulation genetic models converges to the same coales-

cence process as the population size becomes large

(N / ‘; Kingman 1982a,b). In two-island models

(Wright1931),anancestralprocessarises(e.g.,Hudson

1983), which can be described by a Markov pure jump

process {X(t), t $ 0} with state space on {0,..., n1} 3

{0,..., n2}\(0, 0), initial state (n1, n2), absorbing states (0,

1), (1, 0), and transition rates

qðði;jÞ/ði ? 1;jÞÞ ¼

i

2

? ?

? ?N1

if i $2

qðði;jÞ/ði;j ? 1ÞÞ ¼

j

2

N2

if j $2

ð1Þ

qðði;jÞ/ði ? 1;j 11ÞÞ ¼ N1m21i

qðði;jÞ/ði 11;j ? 1ÞÞ ¼ N2m12j

if i $1

if j $1;

where n1and n2are the sample sizes from populations

1 and 2, respectively, and N1and N2are the population

sizes.Migrationoccursfrompopulation2to1,andfrom

1 to 2, at rates m21 and m12, respectively. Time is

measured in units of N1generations, and Njmijcan be

interpreted as the proportion of individuals in popula-

tionjthatarereplacedwithindividualsfrompopulation

i in each generation.

Consider the ancestry of a single lineage from popu-

lation 1. The waiting time in number of generations

until the last migration event for this lineage is expo-

nentially distributed with mean 1/m (letting m ¼ m21

here and in the following to simplify the notation). We

now introducerecombinationand measure distances in

the genome as genetic distances. By using genetic

distances, we may assume that recombination in each

generation occurs according to a Poisson process with

rate 1 along the chromosome. We assume that migrant

tracts do not recombine together, we disallow back-

migration events (i.e., assume m12¼ 0), and we ignore

the effect of the ends of the chromosome (but later we

evaluate violations of these assumptions). Then, after t

generations, the distribution of tracts lengths follows an

exponential distribution with mean 1/t:

f ðx; tÞ ¼ te?tx:

ð2Þ

Because we can reliably infer migrant tracts only over

a certain length, we are interested in the distribution of

tracts and the expected proportion of a chromosome in

tractslargerthanacertainthreshold,C.Theproportion

of a migrant chromosome from time t that is in tractson

a size .C, pC, can be found from the convolution of two

independent and identically distributed exponential

random variables with parameter t:

ðC

E½pCjt? ¼ 1 ?

0

te?tyð1 ? e?tðC?yÞÞdy ¼ e?tCð11CtÞ:

ð3Þ

These two variables represent, respectively, the distance

to the left and right on the chromosome from the point

of inspection to the nearest recombination event.

Integrating over t, we find

ð‘

Theexpectednumberoffragmentsinthepopulation

of a migrant chromosome of length L is

E½pC? ¼

0

me?mte?tCð11CtÞdt ¼mð2C 1mÞ

ðC 1mÞ2:

ð4Þ

E½kðtÞ? ¼ 11Lt

ð5Þ

after tgenerations;i.e.,thecontributionofmigranttracts

from generation t to the population is proportional

to me?mt(1 1 Lt). Again ignoring recombination among

migrant tracts, the density of tract lengths will be

formedas amixture distribution of tracts from different

times,

Ð‘

f ðxÞ ¼

0te?txð11LtÞme?mtdt

Ð‘

The conditional tract length distribution of tracts of a

length larger than C is then

0ð11LtÞme?mtdt

¼m2ð2L 1m 1xÞ

ðL 1mÞðm 1xÞ3:

ð6Þ

f ðx jx .CÞ ¼

f ðxÞ

Ð‘

Cf ðxÞdx¼ðC 1mÞ2ð2L 1m 1xÞ

ðC 1L 1mÞðm 1xÞ3: ð7Þ

Theseexpressionsdoallowforgeneticdrift.However,

they assume that recombination events between de-

scendants of the same or different migration events

contribute to the breakdown of chromosomes into

smaller distinguishable tracts. In practice, we cannot

distinguish between nonrecombinants and recombi-

712J. E. Pool and R. Nielsen

Page 3

nants between copies of the same allele. The approx-

imations we derive here are, therefore, expected to

break down when t becomes so large compared to N1

thatmigrantallelesmayhavedriftedtoappreciablyhigh

allele frequencies, thereby allowing for recombination

between migrant tracts. However, this is not a funda-

mental problem as we can infer only relatively large

tracts that, with high probability, are descendants of

recent migrants. If C is sufficiently large, it is highly

probable that only fragments for which t is small have

been sampled. The chance that a migrant allele of size

.C has drifted to high frequencies is small if C ? 1/N

(since recombination will break down tracts below this

threshold before drift can substantially elevate them in

frequency). Problems identifying recombinants be-

tween migrant alleles are, therefore, avoidable if C is

sufficiently large. For the same reason, for large C,

inferences based on Equation 7 should be relatively

robust to violations of the assumption of no back

migration; i.e., m12¼ 0.

Changes in the migration rate: We now extend these

results to the case where there has been a discrete

changein the rateof migration.Again,we consider only

migration into population 1, and assume that the

current migration rate is m1, and that it before T

generations ago was m2. We then have

E2½pC?

¼

ðT

1e?m1T

0

m1e?m1te?tCð11CtÞdt

ð‘

¼m1ð2C 1m1Þ

ðC 1m1Þ2

?C2e?ðC1m1ÞTðm1? m2Þð2C 1m11m21ðC 1m1ÞðC 1m2ÞTÞ

ðC 1m1Þ2ðC 1m2Þ2

0

m2e?m2te?ðt1TÞCð11Cðt 1TÞÞdt

:

ð8Þ

Likewise, setting

f2ðxÞ

¼

ÐT

0te?txð11LtÞm1e?m1tdt 1e?m1TÐ‘

Tte?txð11LtÞm2e?m2ðt?TÞdt

Tð11LtÞm2e?m2ðt?TÞdt

ÐT

0ð11LtÞm1e?m1tdt 1e?m1TÐ‘

and conditioning as in Equation 7, we find

ð9Þ

f2ðx jx .CÞ ¼

f2ðxÞ

Cf2ðxÞdx

¼ eTðC?xÞðC 1m1Þ2ðC 1m2Þ23a ? b

Ð‘

c

;

ð10Þ

where

a ¼m2ðm21x 1Tðm21xÞ21Lð21Tðm21xÞð21Tðm21xÞÞÞÞ

ðm21xÞ3

b ¼m1ððm11xÞð1 ? eT ðm11xÞ1Tðm11xÞÞ1Lð2 ? 2eTðm11xÞ1Tðm11xÞð21Tðm11xÞÞÞÞ

ðm11xÞ3

c ¼ eðC1m1ÞTm1ðC 1L 1m1ÞðC 1m2Þ2? ðm1? m2Þ

3ð?Lm1m21C3ð11LTÞ1Cm1m2ð11LTÞ

1C2ðL 1m11m21Lðm11m2ÞTÞÞ:

Inference: We wish to estimate the parameters, m1,

m2, and T from an observed tract length distribution. As

only large tracts can be easily identified, we have to base

inferences on Equations 8 and 10 and not on Equation

9. We define a composite-likelihood function by taking

the product of Equation 10 among all tracts in the data

above a prespecified threshold (C). The reason why we

consider this a composite-likelihood function and not a

true-likelihood function is that the same tract can be

counted twice. However, for real data with C large, this

will rarely happen and the estimation function is essen-

tially a true-likelihood function.

Equation 10 contains only very little information

about the overall amount of population subdivision, be-

cause we look only at the relative abundance of tracts

with length greater than C. However, much of the infor-

mation regarding the overall level of population sub-

division is captured by our estimate of pC(Equation 8).

We therefore do a constrained optimization of the like-

lihood function subject to the constraint

E2½pC? ¼ˆ pC;

ð11Þ

whereˆ pCis the observed proportion of the genome in

tracts larger than C. Specifically, we rearrange Equation

8 to express Tas a function of C, m1, m2, and E2½pC?, and

we then substituteˆ pCfor E2½pC?. We then perform a two-

dimensional optimization for m1 and m2 while con-

straining T to take on the value given by the aforemen-

tioned equation. This approach reduces the number of

parameters from Equation 10 to be estimated (from

threetotwo)andaddsinformationconcerningthetotal

proportion of migrant DNA observed (fromˆ pC). Con-

strained models with one of the two migration rates set

to zero are evaluated similarly, via a one-dimensional

optimization of the other migration rate. For the con-

stant migration rate model, m can be estimated simply

by setting m1¼ m2in Equation 8, and thus usingˆ pCto

solve for m.

Comparison of likelihood scores from different

models allows the testing of demographic hypotheses.

Test 1 compares the maximum-log-likelihood score

from the migration rate change model (with m1and

m2allowed to vary) against the null hypothesis of a sin-

gle, constant migration rate (with m inferred fromˆ pC).

Test 2 compares the maximum-log-likelihood score of

the migration rate change model against amodel where

either (A) m1is constrained to be zero or (B) m2is

Inferring Changes in Migration Rate 713

Page 4

constrained to be zero. Generally, test 2A is performed

when m1, m2, and test 2B is performed if m1. m2.

Because the distributions of likelihood ratios are not

well modeled by standard asymptotic theory for any of

these tests, critical values are obtained using data simu-

lated under the null hypothesis. For computational rea-

sons, we obtain critical values using Nem ¼ 0.1 for test 1,

the true values of m2and T for test 2A, and true m1and

Tfortest2B(ratherthanusingtheestimatedparameter

values for each simulated replicate). In the analysis of

empirical data, we use the estimated null model pa-

rameter values instead.

Simulation: A forward simulation program was writ-

ten to allow the generation of migrant tract data. This

program simulates each chromosome present in two

populations and models the processes of genetic drift,

migration, and recombination under a Wright–Fisher

model (Fisher 1930; Wright 1931). It does not

generate polymorphism data; instead it directly mon-

itors migrant tract status along chromosomes. When an

individual migrates, all previously nonmigrant chromo-

some sections become migrant tracts, and any previous

migrant tracts become nonmigrant. Tracts are ‘‘forgot-

ten’’ when recombination breaks them down to a size

below the threshold length. The program initializes

with no migrant tracts present, but goes through a

‘‘burn-in’’ period with migration at rate m2. For the

analyses shown here, using a threshold tract length of

C ¼ 0.5 cM, the burn-in time was 2000 generations

(resultsandtheoryindicatedthiswasmorethanenough

time to reach an equilibrium migrant tract length

distribution) and Ne was 10,000. At the end of the

burn-in, the migration rate switches to m1 and the

program records all migrant tracts present in each

population at a series of time points (T) after this

change. An extension to this program allows migrant

tracts to be sampled from a specific number of individ-

uals. In testing the performance of the likelihood

method, we simulated ‘‘genomes’’ containing 35 chro-

mosomes, each 100 cM in length (3500 cM is close to

the genetic map size of humans and many other mam-

mals; Kong et al. 2002), and we sampled 100 haploid

individuals from one population.

Applicationtoempirical

method was applied to genomewide single-nucleotide

polymorphism (SNP) data from two hybridizing sub-

species of the house mouse, Mus musculus domesticus

and M. m. musculus. These data were produced by the

Wellcome Trust Center for Human Genetics and are

availableathttp:/ /www.well.ox.ac.uk/mouse/INBREDS/.

The strains examined here consist of seven from M. m.

domesticus and eight from M. m. musculus, with varying

geographicorigins(seeHarr2006forasummary).The

data come from wild-derived, inbred mouse strains and

are effectively haploid. The few apparently heterozy-

gous sites were recoded as missing data, and invariant

SNPs were removed. Since the X chromosome is

data:

The likelihood

expected to have a different history, all of the 9935

SNPsanalyzedherewereautosomal.Thevastmajorityof

these SNPs have inferred genetic map positions

(Jensen-Seaman et al. 2004), and all analyses were done

in terms of genetic distance, rather than physical

position. These SNPs had been ascertained in labora-

tory lines of mixed origin and could be biased in terms

of diversity levels and allele frequencies (Boursot and

Belkhir 2006), but we do not expect a particular bias

for the inference and analysis of migrant tracts.

In general, our likelihood inference method allows

theusertodecidehowmigranttractsshouldbedefined.

The sample sizes of the mouse SNP data set seemed too

small for published methods for identifying ancestry

along recombining chromosomes (e.g., Falush et al.

2003). However, the task of tract identification is

simplified by the high level of genetic differentiation

between the two subspecies, which diverged perhaps

1 million generations ago and show very high levels of

geneticdifferentiation(BainesandHarr2007;Salcedo

etal.2007).Wewerethereforeabletouseaverysimpleset

ofcriteriafordefiningmigranttractsinthesedata.Given

the small sample sizes, an individual’s SNP allele was

deemed to provide evidence for a migrant tract only if it

was otherwise absent from the individual’s subspecies,

but present in the in other subspecies (we call this a

‘‘positive SNP’’). If an individual’s SNP allele is otherwise

presentinbothsubspecies,thisisa‘‘neutralSNP’’neither

favoring nor opposing migrant tract status. And if an

individual’s SNP allele is not present in the other sub-

species, it is taken as evidence against migrant tract

status (a ‘‘negative SNP’’). Migrant tracts consisted of

two or more positive SNPs with no negative SNPs be-

tween them. The minimum tract length was considered

to be the genetic distance spanning only the positive

SNPs at each end of the tract. The maximum tract

length included all sites up to the first negative SNPs

flanking the tract.

Given the minimum and maximum length of a mi-

grant tract, we were interested in estimating how far

beyond the positive SNPs this tract is expected to ex-

tend. To do this we assume that the length of a tract is

exponentially distributed with parameter l. If marker

Miis in a tract, the probability that the next marker,

Mi11, is also in the same tract is e?lDi;i11, where Di,i11is

the genetic distance between markers Miand Mi11. A

log-likelihood function for l is then given by

LðlÞ ¼

Y

j:Mj2Z;Mj112Z

e?lDj;j11

Y

j:Mj2Z;Mj11;Z

ð1 ? e?lDj;j11Þ;

ð12Þ

where Z is the set of all markers in a migration tract. By

enteringthelengthsofallSNPintervals(Di,i11)wherewe

remain in a migrant tract or leave one and then maxi-

mizing this function, we obtain a maximum-likelihood

714J. E. Pool and R. Nielsen

Page 5

estimateofl.Nowtheexpecteddistancetoaddtoatract

on the right side is

E½dj;j11jMj2 Z;Mj11;Z? ¼

ðDj;j11

Dj;j11

1 ? elDj;j1111

0

tle?lt

ð1 ? e?lDj;j11Þdt

¼

l

ð13Þ

and we similarly add

E½dj?1;jjMj2 Z;Mj?1;Z? ¼

Dj?1;j

1 ? elDj?1;j11

l

ð14Þ

to the left side.

Applying this method to the mouse SNP data, the

resulting tract lengths were then used in the likelihood

inference method described above. To ensure that

undetected tracts did not lead to spurious rejection of

thenull model, migrant tracts from simulated data were

subjected to the constraints of the mouse SNP data set.

The probability that any given SNP allele is informative

concerning migrantancestrywasestimatedbyreplacing

each SNP allele in one subspecies with each possible

SNP allele from the other subspecies and monitoring

the proportion of transplanted alleles that yielded

positive evidence for migrant history under the criteria

detailed above. Average SNP informativeness was esti-

mated in this way for each subspecies separately. Tract

lengths from constant migration rate simulations were

randomlyplacedonthemouseSNPmap,andeachtract

was detected only if two or more informative SNPs fell

within it. This process was repeated until the number of

tracts observed in the empirical data was matched.

RESULTS

Above, we described a theoretical framework for the

distribution of migrant tract lengths and a forward

whole-population simulation tool to generate migrant

tract data. The simulation program enables several

assumptions of the theory to be violated: by allowing

back migration, recombinational joining of migrant

tracts, and effects of the ends of chromosomes. In all

cases examined, including those shown in Figure 1,

simulated data closely matched theoretical predictions.

Figure 1 depicts the migrant tract length distributions

generated by a constant migration rate model and by

admixture beginning 100, 200, or 300 generations ago.

The contrasting migrant tract lengths generated by

these histories suggested that such data could be

informative for demographic inference. But Figure 1

is based on a large number of simulated replicates, and

we were interested in testing whether individual data

sets would contain enough information for demo-

graphic hypothesis testing and parameter inference.

Large, genome-scale data sets were generated for

population samples under various demographic histo-

ries, using the migrant tract simulation method de-

scribed above. Genomes 3500 cM in size were generated

for a sample size of 100, and a minimum tract length of

0.5 cM was used. Likelihood optimization was per-

formed for each simulated data set under the migration

rate change model, yielding estimates of m1, m2, and

T. The highest log-likelihood value obtained for this

model was compared against the log-likelihood score

for the constant migration rate model, and the signif-

icance of likelihood ratios was assessed via comparison

with data sets simulated under the constant rate model.

Results are presented in Figure 2, A and B.

The method was found to have high power to reject a

constantratemodelforarangeofhistories.Thehighest

power often occurred within the first few hundred

generations after a migration rate change—this is not

surprising, as only tracts .0.5 cM are considered here,

and recombination will typically break down migrant

chromosomes to this size within ?200 generations. In

some cases, particularly for strong decreases in migra-

tion rate, the method’s power lasted well beyond this

expectation. Even for the most subtle migration rate

changes considered (from Nem ¼ 0.1 to Nem ¼ 0.04 and

vice versa), power was fairly high, particularly around

the T ¼ 200 to T ¼ 500 time window.

For histories involving a migration rate decrease, a

similar procedure was applied to test whether a model

with no current migration (m1¼ 0) could be rejected

(test 2A). Here, power was often a bit lower than for test

1, but generally still quite high (Figure 2C). Conversely,

Figure 1.—The distribution of migrant tract lengths after

the advent of admixture. Models where previously isolated

populations begin exchanging migrants at rate Nem ¼ 0.1

100, 200, or 300 generations ago are compared against the

case in which populations exchange migrants at a constant

rate Nem ¼ 0.1 with no prior isolation (the single migration

rate, ‘‘equilibrium’’ model). Depicted here is the relative

abundance of migrant tracts for 0.01-cM histogram bins be-

tween 0.5 (the minimum/threshold tract length) and 5 cM.

Also shown is the agreement between theoretical predictions

(lines) and tracts from 1000 simulated replicates with Ne¼

10,000 (shapes).

Inferring Changes in Migration Rate 715