Page 1

Copyright ? 2009 by the Genetics Society of America

DOI: 10.1534/genetics.108.098095

Inference of Historical Changes in Migration Rate From the

Lengths of Migrant Tracts

John E. Pool*,1and Rasmus Nielsen*,†

*Department of Integrative Biology and†Department of Statistics, University of California, Berkeley, California 94720

Manuscript received October 27, 2008

Accepted for publication December 13, 2008

ABSTRACT

After migrant chromosomes enter a population, they are progressively sliced into smaller pieces by

recombination. Therefore, the length distribution of ‘‘migrant tracts’’ (chromosome segments with recent

migrant ancestry) contains information about historical patterns of migration. Here we introduce a

theoretical framework describing the migrant tract length distribution and propose a likelihood inference

methodtotestdemographichypothesesandestimateparametersrelatedtoahistoricalchangeinmigration

rate. Applying this method to data from the hybridizing subspecies Mus musculus domesticus and M. m.

musculus, we find evidence for an increase in the rate of hybridization. Our findings could indicate an

evolutionary trajectory toward fusion rather than speciation in these taxa.

A

for recent signatures of positive selection in population

genetic data (e.g., Jensen et al. 2005), the study of ad-

mixed human populations to identify disease-associated

genetic variants (e.g., Montana and Pritchard 2004;

Patterson et al. 2004), and the definition of manage-

ment units in conservation (Pearse and Crandall

2004). Patterns of genetic variation contain information

about past changes in populationsize (e.g., Cornuetand

Luikart 1996; Marth et al. 2004), the timing of popu-

lation splitting events (e.g., Nielsen and Wakeley 2001),

and levels of migrationbetween populations (e.g., Beerli

and Felsenstein 2001).

Since the advent of molecular markers, researchers

have sought to gauge the genetic differentiation of

populations and to draw conclusions about the level of

migration between them. Wright’s FST(Wright 1952)

has served as the classic metric of population differen-

tiation, and, under ideal conditions, the population

migrationratecanbeestimatedbyNem ¼ ð1?FSTÞ=4FST,

where Neis the effective population size, m is the per-

generation probability of being a migrant, and Nem is

thus equal to the number of migrants exchanged each

generation. However, this relationship relies on several

assumptions that may not be valid for most natural

populations (reviewed in Whitlock and McCauley

1999), including that of a constant rate of migration. A

given FSTvalue between two populations could be pro-

duced bya constantlevel ofmigrationovera longperiod

of time, or by genetic drift following a relatively recent

N accurate understanding of population history is

essential for such diverse applications as the search

split between the two populations, or by recent admix-

ture between historically isolated populations, or by any

number of more complex scenarios. The isolation-

migration (IM) inference framework (e.g., Nielsen and

Wakeley 2001; Hey 2005) offers a way to differentiate

ongoing migration between populations from lineage

sorting in isolated populations, while estimating relevant

demographic parameters.

Asinthecase ofIM,mostpopulationgenetic methods

that estimate demographic parameters assume that all

sitesormarkersunder studyareeithercompletelylinked

(no recombination) or completely unlinked (free re-

combination) (although see Becquet and Przeworski

2007). And correspondingly, most population genetic

data have been collected with these criteria in mind.

Assuming either full linkage among sites or else in-

dependence among loci can greatly simplify the task of

modeling the histories of molecular markers. However,

the bulk of the genome in most organisms consists of

DNA that is subject to recombination, and, furthermore,

the pattern of recombination events within a sample of

chromosomes may hold valuable information concern-

ing population history. For example, we know that

haplotype statistics (Depaulis et al. 2003) and linkage

disequilibrium (Wall et al. 2002) across short loci are

quite sensitive to the effects of population bottlenecks.

The recent availability of genome-scale polymorphism

data should facilitate investigation of longer-range

linkage patterns, which may shed new light on the

recent histories of populations.

Patternsofdiversityatpartiallylinkedmarkersmaybe

especially informative concerning the historical pattern

of migration between populations. Once a migrant

chromosome enters a new population, recombination

will break it down into progressively shorter segments.

1Corresponding author: Department of Integrative Biology, University of

California, 3060 Valley Life Sciences Bldg. No. 3140, Berkeley, CA 94720-

3140. E-mail: jpool@berkeley.edu

Genetics 181: 711–719 (February 2009)

Page 2

The lengths of these ‘‘migrant tracts’’—or admixture

‘‘chunks’’ (Falush et al. 2003)—therefore contain in-

formationabouthowlongagomigrationoccurred.This

logic has been utilized to estimate the timing of recent

admixtureevents(e.g.,Patterson etal.2004; Hoggart

et al. 2004; Koopman et al. 2007), but its applicability

should extend beyond such cases. We suggest that

migrant tract lengths are expected to have a certain

equilibrium distribution under a constant migration

rate model. An excess of long migrant tracts would

indicate a recent increase in migration rate, while the

opposite pattern would suggest recently reduced gene

flow. We use theoretical predictions and simulations to

explore the migrant tract length distribution under a

variety of demographic scenarios, and we assess the

potential of this approach for inferring demographic

parameters related to migration rate changes.

MODELS AND METHODS

Constant migration rate: A large set of different pop-

ulation genetic models converges to the same coales-

cence process as the population size becomes large

(N / ‘; Kingman 1982a,b). In two-island models

(Wright1931),anancestralprocessarises(e.g.,Hudson

1983), which can be described by a Markov pure jump

process {X(t), t $ 0} with state space on {0,..., n1} 3

{0,..., n2}\(0, 0), initial state (n1, n2), absorbing states (0,

1), (1, 0), and transition rates

qðði;jÞ/ði ? 1;jÞÞ ¼

i

2

? ?

? ?N1

if i $2

qðði;jÞ/ði;j ? 1ÞÞ ¼

j

2

N2

if j $2

ð1Þ

qðði;jÞ/ði ? 1;j 11ÞÞ ¼ N1m21i

qðði;jÞ/ði 11;j ? 1ÞÞ ¼ N2m12j

if i $1

if j $1;

where n1and n2are the sample sizes from populations

1 and 2, respectively, and N1and N2are the population

sizes.Migrationoccursfrompopulation2to1,andfrom

1 to 2, at rates m21 and m12, respectively. Time is

measured in units of N1generations, and Njmijcan be

interpreted as the proportion of individuals in popula-

tionjthatarereplacedwithindividualsfrompopulation

i in each generation.

Consider the ancestry of a single lineage from popu-

lation 1. The waiting time in number of generations

until the last migration event for this lineage is expo-

nentially distributed with mean 1/m (letting m ¼ m21

here and in the following to simplify the notation). We

now introducerecombinationand measure distances in

the genome as genetic distances. By using genetic

distances, we may assume that recombination in each

generation occurs according to a Poisson process with

rate 1 along the chromosome. We assume that migrant

tracts do not recombine together, we disallow back-

migration events (i.e., assume m12¼ 0), and we ignore

the effect of the ends of the chromosome (but later we

evaluate violations of these assumptions). Then, after t

generations, the distribution of tracts lengths follows an

exponential distribution with mean 1/t:

f ðx; tÞ ¼ te?tx:

ð2Þ

Because we can reliably infer migrant tracts only over

a certain length, we are interested in the distribution of

tracts and the expected proportion of a chromosome in

tractslargerthanacertainthreshold,C.Theproportion

of a migrant chromosome from time t that is in tractson

a size .C, pC, can be found from the convolution of two

independent and identically distributed exponential

random variables with parameter t:

ðC

E½pCjt? ¼ 1 ?

0

te?tyð1 ? e?tðC?yÞÞdy ¼ e?tCð11CtÞ:

ð3Þ

These two variables represent, respectively, the distance

to the left and right on the chromosome from the point

of inspection to the nearest recombination event.

Integrating over t, we find

ð‘

Theexpectednumberoffragmentsinthepopulation

of a migrant chromosome of length L is

E½pC? ¼

0

me?mte?tCð11CtÞdt ¼mð2C 1mÞ

ðC 1mÞ2:

ð4Þ

E½kðtÞ? ¼ 11Lt

ð5Þ

after tgenerations;i.e.,thecontributionofmigranttracts

from generation t to the population is proportional

to me?mt(1 1 Lt). Again ignoring recombination among

migrant tracts, the density of tract lengths will be

formedas amixture distribution of tracts from different

times,

Ð‘

f ðxÞ ¼

0te?txð11LtÞme?mtdt

Ð‘

The conditional tract length distribution of tracts of a

length larger than C is then

0ð11LtÞme?mtdt

¼m2ð2L 1m 1xÞ

ðL 1mÞðm 1xÞ3:

ð6Þ

f ðx jx .CÞ ¼

f ðxÞ

Ð‘

Cf ðxÞdx¼ðC 1mÞ2ð2L 1m 1xÞ

ðC 1L 1mÞðm 1xÞ3: ð7Þ

Theseexpressionsdoallowforgeneticdrift.However,

they assume that recombination events between de-

scendants of the same or different migration events

contribute to the breakdown of chromosomes into

smaller distinguishable tracts. In practice, we cannot

distinguish between nonrecombinants and recombi-

712J. E. Pool and R. Nielsen

Page 3

nants between copies of the same allele. The approx-

imations we derive here are, therefore, expected to

break down when t becomes so large compared to N1

thatmigrantallelesmayhavedriftedtoappreciablyhigh

allele frequencies, thereby allowing for recombination

between migrant tracts. However, this is not a funda-

mental problem as we can infer only relatively large

tracts that, with high probability, are descendants of

recent migrants. If C is sufficiently large, it is highly

probable that only fragments for which t is small have

been sampled. The chance that a migrant allele of size

.C has drifted to high frequencies is small if C ? 1/N

(since recombination will break down tracts below this

threshold before drift can substantially elevate them in

frequency). Problems identifying recombinants be-

tween migrant alleles are, therefore, avoidable if C is

sufficiently large. For the same reason, for large C,

inferences based on Equation 7 should be relatively

robust to violations of the assumption of no back

migration; i.e., m12¼ 0.

Changes in the migration rate: We now extend these

results to the case where there has been a discrete

changein the rateof migration.Again,we consider only

migration into population 1, and assume that the

current migration rate is m1, and that it before T

generations ago was m2. We then have

E2½pC?

¼

ðT

1e?m1T

0

m1e?m1te?tCð11CtÞdt

ð‘

¼m1ð2C 1m1Þ

ðC 1m1Þ2

?C2e?ðC1m1ÞTðm1? m2Þð2C 1m11m21ðC 1m1ÞðC 1m2ÞTÞ

ðC 1m1Þ2ðC 1m2Þ2

0

m2e?m2te?ðt1TÞCð11Cðt 1TÞÞdt

:

ð8Þ

Likewise, setting

f2ðxÞ

¼

ÐT

0te?txð11LtÞm1e?m1tdt 1e?m1TÐ‘

Tte?txð11LtÞm2e?m2ðt?TÞdt

Tð11LtÞm2e?m2ðt?TÞdt

ÐT

0ð11LtÞm1e?m1tdt 1e?m1TÐ‘

and conditioning as in Equation 7, we find

ð9Þ

f2ðx jx .CÞ ¼

f2ðxÞ

Cf2ðxÞdx

¼ eTðC?xÞðC 1m1Þ2ðC 1m2Þ23a ? b

Ð‘

c

;

ð10Þ

where

a ¼m2ðm21x 1Tðm21xÞ21Lð21Tðm21xÞð21Tðm21xÞÞÞÞ

ðm21xÞ3

b ¼m1ððm11xÞð1 ? eT ðm11xÞ1Tðm11xÞÞ1Lð2 ? 2eTðm11xÞ1Tðm11xÞð21Tðm11xÞÞÞÞ

ðm11xÞ3

c ¼ eðC1m1ÞTm1ðC 1L 1m1ÞðC 1m2Þ2? ðm1? m2Þ

3ð?Lm1m21C3ð11LTÞ1Cm1m2ð11LTÞ

1C2ðL 1m11m21Lðm11m2ÞTÞÞ:

Inference: We wish to estimate the parameters, m1,

m2, and T from an observed tract length distribution. As

only large tracts can be easily identified, we have to base

inferences on Equations 8 and 10 and not on Equation

9. We define a composite-likelihood function by taking

the product of Equation 10 among all tracts in the data

above a prespecified threshold (C). The reason why we

consider this a composite-likelihood function and not a

true-likelihood function is that the same tract can be

counted twice. However, for real data with C large, this

will rarely happen and the estimation function is essen-

tially a true-likelihood function.

Equation 10 contains only very little information

about the overall amount of population subdivision, be-

cause we look only at the relative abundance of tracts

with length greater than C. However, much of the infor-

mation regarding the overall level of population sub-

division is captured by our estimate of pC(Equation 8).

We therefore do a constrained optimization of the like-

lihood function subject to the constraint

E2½pC? ¼ˆ pC;

ð11Þ

whereˆ pCis the observed proportion of the genome in

tracts larger than C. Specifically, we rearrange Equation

8 to express Tas a function of C, m1, m2, and E2½pC?, and

we then substituteˆ pCfor E2½pC?. We then perform a two-

dimensional optimization for m1 and m2 while con-

straining T to take on the value given by the aforemen-

tioned equation. This approach reduces the number of

parameters from Equation 10 to be estimated (from

threetotwo)andaddsinformationconcerningthetotal

proportion of migrant DNA observed (fromˆ pC). Con-

strained models with one of the two migration rates set

to zero are evaluated similarly, via a one-dimensional

optimization of the other migration rate. For the con-

stant migration rate model, m can be estimated simply

by setting m1¼ m2in Equation 8, and thus usingˆ pCto

solve for m.

Comparison of likelihood scores from different

models allows the testing of demographic hypotheses.

Test 1 compares the maximum-log-likelihood score

from the migration rate change model (with m1and

m2allowed to vary) against the null hypothesis of a sin-

gle, constant migration rate (with m inferred fromˆ pC).

Test 2 compares the maximum-log-likelihood score of

the migration rate change model against amodel where

either (A) m1is constrained to be zero or (B) m2is

Inferring Changes in Migration Rate 713

Page 4

constrained to be zero. Generally, test 2A is performed

when m1, m2, and test 2B is performed if m1. m2.

Because the distributions of likelihood ratios are not

well modeled by standard asymptotic theory for any of

these tests, critical values are obtained using data simu-

lated under the null hypothesis. For computational rea-

sons, we obtain critical values using Nem ¼ 0.1 for test 1,

the true values of m2and T for test 2A, and true m1and

Tfortest2B(ratherthanusingtheestimatedparameter

values for each simulated replicate). In the analysis of

empirical data, we use the estimated null model pa-

rameter values instead.

Simulation: A forward simulation program was writ-

ten to allow the generation of migrant tract data. This

program simulates each chromosome present in two

populations and models the processes of genetic drift,

migration, and recombination under a Wright–Fisher

model (Fisher 1930; Wright 1931). It does not

generate polymorphism data; instead it directly mon-

itors migrant tract status along chromosomes. When an

individual migrates, all previously nonmigrant chromo-

some sections become migrant tracts, and any previous

migrant tracts become nonmigrant. Tracts are ‘‘forgot-

ten’’ when recombination breaks them down to a size

below the threshold length. The program initializes

with no migrant tracts present, but goes through a

‘‘burn-in’’ period with migration at rate m2. For the

analyses shown here, using a threshold tract length of

C ¼ 0.5 cM, the burn-in time was 2000 generations

(resultsandtheoryindicatedthiswasmorethanenough

time to reach an equilibrium migrant tract length

distribution) and Ne was 10,000. At the end of the

burn-in, the migration rate switches to m1 and the

program records all migrant tracts present in each

population at a series of time points (T) after this

change. An extension to this program allows migrant

tracts to be sampled from a specific number of individ-

uals. In testing the performance of the likelihood

method, we simulated ‘‘genomes’’ containing 35 chro-

mosomes, each 100 cM in length (3500 cM is close to

the genetic map size of humans and many other mam-

mals; Kong et al. 2002), and we sampled 100 haploid

individuals from one population.

Applicationtoempirical

method was applied to genomewide single-nucleotide

polymorphism (SNP) data from two hybridizing sub-

species of the house mouse, Mus musculus domesticus

and M. m. musculus. These data were produced by the

Wellcome Trust Center for Human Genetics and are

availableathttp:/ /www.well.ox.ac.uk/mouse/INBREDS/.

The strains examined here consist of seven from M. m.

domesticus and eight from M. m. musculus, with varying

geographicorigins(seeHarr2006forasummary).The

data come from wild-derived, inbred mouse strains and

are effectively haploid. The few apparently heterozy-

gous sites were recoded as missing data, and invariant

SNPs were removed. Since the X chromosome is

data:

The likelihood

expected to have a different history, all of the 9935

SNPsanalyzedherewereautosomal.Thevastmajorityof

these SNPs have inferred genetic map positions

(Jensen-Seaman et al. 2004), and all analyses were done

in terms of genetic distance, rather than physical

position. These SNPs had been ascertained in labora-

tory lines of mixed origin and could be biased in terms

of diversity levels and allele frequencies (Boursot and

Belkhir 2006), but we do not expect a particular bias

for the inference and analysis of migrant tracts.

In general, our likelihood inference method allows

theusertodecidehowmigranttractsshouldbedefined.

The sample sizes of the mouse SNP data set seemed too

small for published methods for identifying ancestry

along recombining chromosomes (e.g., Falush et al.

2003). However, the task of tract identification is

simplified by the high level of genetic differentiation

between the two subspecies, which diverged perhaps

1 million generations ago and show very high levels of

geneticdifferentiation(BainesandHarr2007;Salcedo

etal.2007).Wewerethereforeabletouseaverysimpleset

ofcriteriafordefiningmigranttractsinthesedata.Given

the small sample sizes, an individual’s SNP allele was

deemed to provide evidence for a migrant tract only if it

was otherwise absent from the individual’s subspecies,

but present in the in other subspecies (we call this a

‘‘positive SNP’’). If an individual’s SNP allele is otherwise

presentinbothsubspecies,thisisa‘‘neutralSNP’’neither

favoring nor opposing migrant tract status. And if an

individual’s SNP allele is not present in the other sub-

species, it is taken as evidence against migrant tract

status (a ‘‘negative SNP’’). Migrant tracts consisted of

two or more positive SNPs with no negative SNPs be-

tween them. The minimum tract length was considered

to be the genetic distance spanning only the positive

SNPs at each end of the tract. The maximum tract

length included all sites up to the first negative SNPs

flanking the tract.

Given the minimum and maximum length of a mi-

grant tract, we were interested in estimating how far

beyond the positive SNPs this tract is expected to ex-

tend. To do this we assume that the length of a tract is

exponentially distributed with parameter l. If marker

Miis in a tract, the probability that the next marker,

Mi11, is also in the same tract is e?lDi;i11, where Di,i11is

the genetic distance between markers Miand Mi11. A

log-likelihood function for l is then given by

LðlÞ ¼

Y

j:Mj2Z;Mj112Z

e?lDj;j11

Y

j:Mj2Z;Mj11;Z

ð1 ? e?lDj;j11Þ;

ð12Þ

where Z is the set of all markers in a migration tract. By

enteringthelengthsofallSNPintervals(Di,i11)wherewe

remain in a migrant tract or leave one and then maxi-

mizing this function, we obtain a maximum-likelihood

714J. E. Pool and R. Nielsen

Page 5

estimateofl.Nowtheexpecteddistancetoaddtoatract

on the right side is

E½dj;j11jMj2 Z;Mj11;Z? ¼

ðDj;j11

Dj;j11

1 ? elDj;j1111

0

tle?lt

ð1 ? e?lDj;j11Þdt

¼

l

ð13Þ

and we similarly add

E½dj?1;jjMj2 Z;Mj?1;Z? ¼

Dj?1;j

1 ? elDj?1;j11

l

ð14Þ

to the left side.

Applying this method to the mouse SNP data, the

resulting tract lengths were then used in the likelihood

inference method described above. To ensure that

undetected tracts did not lead to spurious rejection of

thenull model, migrant tracts from simulated data were

subjected to the constraints of the mouse SNP data set.

The probability that any given SNP allele is informative

concerning migrantancestrywasestimatedbyreplacing

each SNP allele in one subspecies with each possible

SNP allele from the other subspecies and monitoring

the proportion of transplanted alleles that yielded

positive evidence for migrant history under the criteria

detailed above. Average SNP informativeness was esti-

mated in this way for each subspecies separately. Tract

lengths from constant migration rate simulations were

randomlyplacedonthemouseSNPmap,andeachtract

was detected only if two or more informative SNPs fell

within it. This process was repeated until the number of

tracts observed in the empirical data was matched.

RESULTS

Above, we described a theoretical framework for the

distribution of migrant tract lengths and a forward

whole-population simulation tool to generate migrant

tract data. The simulation program enables several

assumptions of the theory to be violated: by allowing

back migration, recombinational joining of migrant

tracts, and effects of the ends of chromosomes. In all

cases examined, including those shown in Figure 1,

simulated data closely matched theoretical predictions.

Figure 1 depicts the migrant tract length distributions

generated by a constant migration rate model and by

admixture beginning 100, 200, or 300 generations ago.

The contrasting migrant tract lengths generated by

these histories suggested that such data could be

informative for demographic inference. But Figure 1

is based on a large number of simulated replicates, and

we were interested in testing whether individual data

sets would contain enough information for demo-

graphic hypothesis testing and parameter inference.

Large, genome-scale data sets were generated for

population samples under various demographic histo-

ries, using the migrant tract simulation method de-

scribed above. Genomes 3500 cM in size were generated

for a sample size of 100, and a minimum tract length of

0.5 cM was used. Likelihood optimization was per-

formed for each simulated data set under the migration

rate change model, yielding estimates of m1, m2, and

T. The highest log-likelihood value obtained for this

model was compared against the log-likelihood score

for the constant migration rate model, and the signif-

icance of likelihood ratios was assessed via comparison

with data sets simulated under the constant rate model.

Results are presented in Figure 2, A and B.

The method was found to have high power to reject a

constantratemodelforarangeofhistories.Thehighest

power often occurred within the first few hundred

generations after a migration rate change—this is not

surprising, as only tracts .0.5 cM are considered here,

and recombination will typically break down migrant

chromosomes to this size within ?200 generations. In

some cases, particularly for strong decreases in migra-

tion rate, the method’s power lasted well beyond this

expectation. Even for the most subtle migration rate

changes considered (from Nem ¼ 0.1 to Nem ¼ 0.04 and

vice versa), power was fairly high, particularly around

the T ¼ 200 to T ¼ 500 time window.

For histories involving a migration rate decrease, a

similar procedure was applied to test whether a model

with no current migration (m1¼ 0) could be rejected

(test 2A). Here, power was often a bit lower than for test

1, but generally still quite high (Figure 2C). Conversely,

Figure 1.—The distribution of migrant tract lengths after

the advent of admixture. Models where previously isolated

populations begin exchanging migrants at rate Nem ¼ 0.1

100, 200, or 300 generations ago are compared against the

case in which populations exchange migrants at a constant

rate Nem ¼ 0.1 with no prior isolation (the single migration

rate, ‘‘equilibrium’’ model). Depicted here is the relative

abundance of migrant tracts for 0.01-cM histogram bins be-

tween 0.5 (the minimum/threshold tract length) and 5 cM.

Also shown is the agreement between theoretical predictions

(lines) and tracts from 1000 simulated replicates with Ne¼

10,000 (shapes).

Inferring Changes in Migration Rate 715

Page 6

for histories involving a migration rate increase, we

tested whether a model with no migration before the

rate change (m2¼ 0) could be rejected (test 2B). Power

for this test was high for very recent migration rate

changes (i.e., 100–200 generations ago), but declined

quickly from there (Figure 2D).

Accuracy of parameter estimation under the migra-

tion rate change model is shown in Figure 3. For a

variety of demographic histories involving isolation,

migration rate decreases, migration rate increases, and

admixture, estimates of m1and m2were often quite

precise.Althoughthemethodcannotalwaysdistinguish

low migration rates from zero, higher migration rates of

1E?5(Nem¼0.1)wereestimatedquiteaccurately,often

with 95% confidence intervals extending only ?30%

above and below the true value. A similar degree of

accuracy was observed for T, with confidence intervals

spanning a factor of 2 or considerably less. Parameter

estimates for migration rate changes beyond 500 gen-

erations ago typically became less precise (data not

shown), which makes sense as these data sets become

less informative, with few tracts .0.5 cM having arisen

before the migration rate change.

Given the generally favorable performance of the

likelihood inference method on simulated migrant

tract data, we then sought to apply it to empirical data.

Becauseaprerequisiteforthismethodisasetofmigrant

tracts inferred with reasonable confidence, it is most

applicable to populations ortaxa that show a high degree

ofgeneticdifferentiation.Onesuchcaseisrepresentedby

thehybridizinghousemousesubspeciesM.m.domesticus

and M. m. musculus in Europe. We used a simple set of

criteriatodefinemigrant(hybrid)tractsingenomewide

SNP data from both subspecies and then applied the

likelihood inference method. Due to the limited size of

the data set, relatively small numbers of migrant tracts

werefound:75inM.m.domesticusand60inM.m.musculus.

However, the length distributions seemed to contain an

excessoflongtractsrelativetoequilibriumexpectations

(Figure 4), and the inference method detected a signal

for an increase in the rate of introgression for both

subspecies (Table 1).

In spite of having a larger likelihood-ratio statistic

against the constant rate model than for M. m. musculus,

test1wasonlymarginallysignificantforM.m.domesticus,

while being significant for M. m. musculus and for a

combined analysis of tracts from both subspecies. The

weaker result for M. m. domesticus is due to a lower level

of SNP informativeness in this subspecies: only 18% of

M. m. musculus alleles would be detected as migrant in

M. m. domesticus, compared to 38% in the opposite

direction.Therefore,smallertractsmorefrequentlywent

undetected in the simulations used to assess signifi-

cance in M. m. domesticus (see models and methods for

details), and likelihood ratios from these simulations

were higher. We also confirmed that the combined tract

length data set for both subspecies showed the same

signal for increased hybridization when each tract was

required to have a minimum of three SNPs favoring

migrant ancestry, rather than two (P , 0.01; results not

shown).

M. m. domesticus had an estimate of zero for m2, while

the estimate for M. m. musculus was nonzero (Table 1),

but in neither case were the data sufficient to differen-

tiate between no hybridization vs. low hybridization

prior to the inferred rate change. The estimated timing

Figure 2.—Power to test demo-

graphic hypotheses. Shown here

first are tests comparing the mi-

gration rate change model to

the null model of a constant mi-

gration rate, forhistories involving

decreasing (A) or increasing (B)

migration rates. For histories in-

volving decreasing migration rates,

power to reject a model with m1

constrained to be zero is shown

(C).Forhistoriesinvolvingincrea-

sing migration rates, power to re-

ject a model with m2constrained

to be zero is shown (D). Signifi-

cance was gauged by comparing

the difference in log-likelihood

scores between models to data

simulated under the null model.

Each data set consisted of 100

simulated haploid genomes, and

a threshold tract length of 0.5

cM was used.

716J. E. Pool and R. Nielsen

Page 7

of the rate change was similar in both taxa (202 and 234

generations ago) and in the combined analysis (206

generations ago). The two subspecies’ estimates of m1

differ by only about a factor of 2 (Table 1), and both

suggest a high contemporary population migration rate

between these subspecies.

DISCUSSION

The specific demographic parameter estimates ob-

tained from the house mouse SNP data should be inter-

pretedwithcautioninlightoflimitationsinthedataset.

Sample sizes are small—seven and eight haploid ge-

nomes. Samples originate from various geographic

locations (Harr 2006), and our quantitative estimates

mightdependontheproximityofsamplestothehybrid

zone. Thus, it would be worthwhile to confirm our

conclusions and refine parameter estimates using full-

genome sequence data from reasonably large popula-

tion samples of both subspecies.

Still, it is interesting that both taxa yielded estimates

of ?200 generations for the time since an increase in

hybridization rate. Particularly when using a minimum

tract length as large as 0.5 cM (which is necessitated by

the density of SNPs in this data set), the time scope of

our inference method is limited to fairly recent events

(Figure 2). Thus, the time of ?200 generations may not

represent the first contact between these subspecies in

Europe, and, indeed, archaeological evidence suggests

amoreancientdateforthisevent(reviewedinBoursot

et al. 1993). However, this timing may still represent an

increase in the rate of hybridization. If these mice have

?2 generations per year, the inference method suggests

that hybridization increased ?100 years ago, which

seems generally coincident with an increased potential

for human-mediated transport in Europe.

The evolutionary trajectories of these hybridizing

house mouse subspecies will depend on a variety of

factors, butonepotentialpredictor isthecurrent rateof

hybridization in terms of Nem1. The true values of cur-

rent Nefor European populations of M. m. domesticus

and M. m. musculus are unknown, but long-term effec-

tivesizesontheorderof1millionhavebeeninferredfor

ancestral range populations of both subspecies (Baines

and Harr 2007). Given the successful relationship of

these mice with humans, it seems very plausible that the

current Neis at least this large. If we therefore take

1 million as an estimate for Nein both taxa, the m1

estimates obtained here imply that M. m. domesticus is

currentlyreceiving?61immigrantsfromM.m.musculus

each generation, while M. m. musculus is receiving ?33

immigrants per generation from M. m. domesticus (on

the basis of the estimate of Nem1for each subspecies).

Since both of these estimates give 4Nem1? 1, these re-

sults could indicate that M. m. domesticus and M. m.

musculus are currently on a path toward fusion rather

than speciation. However, the presence of partial in-

compatibilitiesbetweenthesetaxa,particularlyontheX

Figure 3.—Distribution of demographic parameter esti-

mates. Results from the analysis of simulated migrant tract

data are shown, including median estimates (diamonds)

and 95% confidence intervals (the 2.5 and 97.5 percentiles

of the distribution of estimates) for (A) m1, (B) m2, and

(C) T. The order of parameter sets is the same in each panel

(i.e., the far left estimates are for true values of m1¼ 0, m2¼

1E?5, and T ¼ 100).

Figure 4.—Migrant tract lengths found in M. m. domesticus

and M. m. musculus, compared to constant migration rate ex-

pectations.

Inferring Changes in Migration Rate717

Page 8

chromosome (e.g., Good et al. 2008), suggests that cer-

tainportionsofthegenomemayresisthomogenization.

Our analysis of simulated data showed that, given the

lengths of migrant tracts from a population sample of

genomes, the likelihood inference method presented

here has high power to detect historical changes in

migration (Figure 2), even for rather subtle shifts in

migration rate (i.e., 2.5-fold changes), and should be

useful in testing hypotheses and estimating parameters

related to migration rate changes. This approach is con-

ceptually related to methods that estimate the timing of

recentadmixtureevents(Hoggartetal.2004;Patterson

et al. 2004), but it allows for a greater variety of historical

scenarios.Intermsofitstemporalscale,ourmethodfallsin

between methods that identify very recent migration

events (e.g., Rannala and Mountain 1997) and those

that estimate long-term migration rates (e.g., Beerli and

Felsenstein 2001). Although the results presented here

suggest that our method is most relevant for detecting

migration rate changes within the past 1000 generations,

in many cases it may be possible to use a lower threshold

tractlength (C) thanthe0.5cM usedinthisstudy, andthe

temporal scope should expand with the inverse of C. The

main assumption of the method is that recombination

willbreakdowntractstobelowthethresholdlengthbefore

genetic drift can lift them to high frequency. Values of C

that are ,1/Neare therefore recommended, but the

choice of C will also depend on the level of diversity, the

degree of population differentiation, and the density of

markers (all of which constrain the inference of short

migrant tracts). For M. musculus, a smaller threshold tract

lengthwouldbeaviableoptionwithadenserSNPdataset.

Our method does not address the inference of pop-

ulation ancestry along a recombining chromosome and

requires that migrant tracts be identified beforehand.

Published methods exist for this purpose (e.g., Falush

et al. 2003) and the optimal method may depend on the

datasetbeinganalyzed.Tractlengthdataobtainedfrom

such methods can be used as the input for our analysis,

and for methods that allow sampling from a posterior

distribution of tract lengths, uncertainty in the tract

length inference can be directly incorporated in the

likelihood method. Without the use of such methods,

the need for confident identification of migrant tracts

would make this approach difficult to apply to weakly

differentiated populations, but for morestronglydifferen-

tiated populations or hybridizing subspecies, this method

should be very useful in its current form.

To derive the tract length distributions, a number of

assumptions were needed. The most troublesome of

these, the lack of recombination among migrant tracts,

would be very difficult to relax in the current frame-

work. A full treatment of the problem would require

analysis of an ancestral recombination graph in a

subdividedpopulationfor whole-genomedata.Another

simplifying assumption is made by ignoring the ends of

the chromosome. This assumption is much easier to

relax and can be done by considering the conditional

distribution in Equation 2. However, as this leads to a

considerablylesstractablealgebraicrepresentation,and

since the current approximation performs very well for

realistic chromosome lengths, we have chosen not to

pursue this further.

The inference method described here may be appli-

cable in a number of biological contexts. As demon-

strated by our analysis of the M. musculus SNP data, the

migrant tract approach may be especially relevant in

testing hypotheses about historical trends of gene flow

across hybrid zones, perhaps shedding light on the

evolutionary trajectories of hybridizing taxa. The infer-

ences enabled by this method may also find particular

relevance in conservation: to test the effect of a new

barrier (such as a highway) on the dispersal of an

organism with a short generation time or to infer the

rate of migration over relatively recent timescales

(rather than over the past 4Negenerations) to guide

management strategies for species with fragmented

habitats. In this context it is important to note that

inferences are done at a timescale more relevant to

conservation genetics and that estimates of time in

number of generations are obtained directly and do not

rely on inferences of effective population sizes.

For optimal power, this method requires reasonably

dense, genomewide polymorphism data from moderate

to large sample sizes. It also requires information about

the genetic map position of each marker, which can be

estimated by genotyping related individuals such as

parent–offspring trios. In light of rapidly improving

TABLE 1

Parameter inference and hypothesis testing for house mouse data

No. tracts

Pm1

m2

T

Test 1Test 2B

Subspecies

domesticus

musculus

Combined

75

60

135

0.01095

0.00684

0.00876

6.08E-05

3.29E-05

4.71E-05

0 202

234

206

P ¼ 0.08

P ¼ 0.03

P ¼ 0.02

NA

NS

NS

1.00E-06

7.50E-07

Listed for each subspecies (and for the combined analysis) is the number of tracts .0.5 cM; the proportion of

the genome that included migrant tracts (P); parameter estimates for m1, m2, and T ; and results of hypothesis

tests. NA, test 2B is not applicable when the estimate of m2is zero; NS, P-values not approaching significance.

718 J. E. Pool and R. Nielsen

Page 9

DNA sequencing technology, we are optimistic that the

inferences described here will be possible for both

model and nonmodel organisms in the near future.

This research was supported by a National Institutes of Health

(NIH) Kirschstein–National Research Service Award Postdoctoral

Fellowship (F32 HG004182) to J.E.P. and a NIH research grant

(UO1HL084706) to R.N.

LITERATURE CITED

Baines, J. F., and B. Harr, 2007

rived populations of house mice. Genetics 175: 1911–1921.

Becquet, C., and M. Przeworski, 2007

parameters of speciation models with application to apes. Ge-

nome Res. 17: 1505–1519.

Beerli, P., and J. Felsenstein, 2001

tion of migration matrix and effective population sizes in n sub-

populations by using a coalescent approach. Proc. Natl. Acad.

Sci. USA 98: 4563–4568.

Boursot, P., and K. Belkhir, 2006

ogy: beware of ascertainment biases. Genome Res. 16: 1191–1192.

Boursot, P., J.-C. Auffray, J. Britton-Davidian and F. Bonhomme,

1993The evolution of house mice. Annu. Rev. Ecol. Syst. 24:

119–152.

Cornuet, J. M., and G. Luikart, 1996

ysis of two tests for detecting population bottlenecks from allele

frequency data. Genetics 144: 2001–2014.

Depaulis,F.,S.MoussetandM.Veuille,2003

to detect bottlenecks and hitchhiking. J. Mol. Evol. 57: S190–

S200.

Falush, D., M. Stephens and J. K. Pritchard, 2003

population structure using multilocus genotype data: linked loci

and correlated allele frequencies. Genetics 164: 1567–1587.

Fisher, R. A.,1930

The Genetical Theory ofNaturalSelection. Clarendon

Press, Oxford.

Good, J. M., M. D. Dean and M. W. Nachman, 2008

netic basis to X-linked hybrid male sterility between two species

of house mice. Genetics 179: 2213–2228.

Harr, B., 2006 Genomic islands of differentiation between house

mouse subspecies. Genome Res. 16: 730–737.

Hey, J., 2005 On the number of New World founders: a population

genetic portraitof thepeopling of the Americas. PLoSBiol.3: e193.

Hoggart, C. J., M. D. Shriver, R. A. Kittles, D. G. Clayton and

P. M. McKeigue, 2004Design and analysis of admixture mapping

studies. Am. J. Hum. Genet. 74: 965–978.

Hudson, R. R., 1983Properties of a neutral allele model with intra-

genic recombination. Theor. Popul. Biol. 23: 183–201.

Jensen, J. D., Y. Kim, V. Bauer DuMont, C. F. Aquadro and C. D.

Bustamante, 2005 Distinguishing between selective sweeps

Reduced X-linked diversity in de-

A new approach to estimate

Maximum likelihood estima-

Mouse SNPs for evolutionary biol-

Description and power anal-

Powerofneutralitytests

Inference of

A complex ge-

and demography using DNA polymorphism data. Genetics 170:

1401–1410.

Jensen-Seaman, M. I., T. S. Furey, B. A. Payseur, Y. Lu, K. M. Roskin

et al., 2004Comparative recombination rates in the rat, mouse,

and human genomes. Genome Res. 14: 528–538.

Kingman, J. F. C., 1982aThe coalescent. Stoch. Proc. Appl. 13: 235–

248.

Kingman, J. F. C., 1982bOn the genealogy of large populations.

J. Appl. Probab. 19A: 27–43.

Kong, A., D. F. Gudbjartsson, J. Sainz, G. M. Jonsdottir, S. A.

Gudjonsson et al., 2002 A high-resolution map of the human

genome. Nature 31: 241–247.

Koopman, W. J., Y. Li, E. Coart, W. E. Van De Weg, B. Vosmanet al.,

2007 Linked vs. unlinked markers: multilocus microsatellite

haplotype-sharing as a tool to estimate gene flow and introgres-

sion. Mol. Ecol. 16: 243–256.

Marth, G. T., E. Czabarka, J. Murvai and S. T. Sherry, 2004

allele frequency spectrum in genomewide human variation data

reveals signals of differential demographic history in three large

world populations. Genetics 166: 351–372.

Montana, G., and J. K. Pritchard, 2004

ture mapping with case-control and cases-only data. Am. J. Hum.

Genet. 75: 771–789.

Nielsen, R., and J. Wakeley, 2001

isolation: a Markov chain Monte Carlo approach. Genetics

158: 885–896.

Patterson, N., N. Hattangadi, B. Lane, K. E. Lohmueller, D. A.

Hafler et al., 2004Methods for high-density admixture map-

ping of disease genes. Am. J. Hum. Genet. 74: 979–1000.

Pearse, D. E., and K. A. Crandall, 2004

population genetic data for conservation. Conserv. Genet. 5:

585–602.

Rannala, B., and J. L. Mountain, 1997

using multilocus genotypes. Proc. Natl. Acad. Sci. USA 94: 9197–

9201.

Salcedo, T., A. Geraldes and M. W. Nachman, 2007

variation in wild and inbred mice. Genetics 177: 2277–2291.

Wall, J. D., P. Andolfatto and M. Przeworski, 2002

els of selection and demography in Drosophila simulans. Genetics

162: 203–216.

Whitlock, M. C., and D. E. McCauley, 1999

gene flow and migration: FST6¼ 1/(4Nm 1 1). Heredity 82: 117–

125.

Wright, S., 1931Evolution in Mendelian populations. Genetics 16:

97–159.

Wright, S., 1952The theoretical variance within and among subdi-

visions of a population that is in a steady state. Genetics 37: 312–

323.

The

Statistical tests for admix-

Distinguishing migration from

Beyond FST: analysis of

Detecting immigration by

Nucleotide

Testing mod-

Indirect measures of

Communicating editor: N. Takahata

Inferring Changes in Migration Rate719