# msBayes: pipeline for testing comparative phylogeographic histories using hierarchical approximate Bayesian computation.

**ABSTRACT** Although testing for simultaneous divergence (vicariance) across different population-pairs that span the same barrier to gene flow is of central importance to evolutionary biology, researchers often equate the gene tree and population/species tree thereby ignoring stochastic coalescent variance in their conclusions of temporal incongruence. In contrast to other available phylogeographic software packages, msBayes is the only one that analyses data from multiple species/population pairs under a hierarchical model.

msBayes employs approximate Bayesian computation (ABC) under a hierarchical coalescent model to test for simultaneous divergence (TSD) in multiple co-distributed population-pairs. Simultaneous isolation is tested by estimating three hyper-parameters that characterize the degree of variability in divergence times across co-distributed population pairs while allowing for variation in various within population-pair demographic parameters (sub-parameters) that can affect the coalescent. msBayes is a software package consisting of several C and R programs that are run with a Perl "front-end".

The method reasonably distinguishes simultaneous isolation from temporal incongruence in the divergence of co-distributed population pairs, even with sparse sampling of individuals. Because the estimate step is decoupled from the simulation step, one can rapidly evaluate different ABC acceptance/rejection conditions and the choice of summary statistics. Given the complex and idiosyncratic nature of testing multi-species biogeographic hypotheses, we envision msBayes as a powerful and flexible tool for tackling a wide array of difficult research questions that use population genetic data from multiple co-distributed species. The msBayes pipeline is available for download at http://msbayes.sourceforge.net/ under an open source license (GNU Public License). The msBayes pipeline is comprised of several C and R programs that are run with a Perl "front-end" and runs on Linux, Mac OS-X, and most POSIX systems. Although the current implementation is for a single locus per species-pair, future implementations will allow analysis of multi-loci data per species pair.

**2**Bookmarks

**·**

**147**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**Mathematical models have been central to ecology for nearly a century. Simple models of population dynamics have allowed us to understand fundamental aspects underlying the dynamics and stability of ecological systems. What has remained a challenge, however, is to meaningfully interpret experimental or observational data in light of mathematical models. Here, we review recent developments, notably in the growing field of approximate Bayesian computation (ABC), that allow us to calibrate mathematical models against available data. Estimating the population demographic parameters from data remains a formidable statistical challenge. Here, we attempt to give a flavor and overview of ABC and its applications in population biology and ecology and eschew a detailed technical discussion in favor of a general discussion of the advantages and potential pitfalls this framework offers to population biologists.F1000prime reports. 07/2014; 6:60. - SourceAvailable from: Alexandre Aleixo
##### Article: The drivers of tropical speciation

Brian Tilston Smith, John E. McCormack, Andrés M. Cuervo, Michael. J. Hickerson, Alexandre Aleixo, Carlos Daniel Cadena, Jorge L. Pérez-Emán, Curtis W. Burney, Xiaoou Xie, Michael G. Harvey, Brant C. Faircloth, Travis C. Glenn, Elizabeth P. Derryberry, Jesse Prejean, Samantha Fields, Robb T. Brumfield[Show abstract] [Hide abstract]

**ABSTRACT:**Since the recognition that allopatric speciation can be induced by large-scale reconfigurations of the landscape that isolate formerly continuous populations, such as the separation of continents by plate tectonics, the uplift of mountains or the formation of large rivers, landscape change has been viewed as a primary driver of biological diversification. This process is referred to in biogeography as vicariance1. In the most species-rich region of the world, the Neotropics, the sundering of populations associated with the Andean uplift is ascribed this principal role in speciation. An alternative model posits that rather than being directly linked to landscape change, allopatric speciation is initiated to a greater extent by dispersal events, with the principal drivers of speciation being organism-specific abilities to persist and disperse in the landscape. Landscape change is not a necessity for speciation in this model8. Here we show that spatial and temporal patterns of genetic differentiation in Neotropical birds are highly discordant across lineages and are not reconcilable with a model linking speciation solely to landscape change. Instead, the strongest predictors of speciation are the amount of time a lineage has persisted in the landscape and the ability of birds to move through the landscape matrix. These results, augmented by the observation that most species-level diversity originated after episodes of major Andean uplift in the Neogene period, suggest that dispersal and differentiation on a matrix previously shaped by large-scale landscape events was a major driver of avian speciation in lowland Neotropical rainforests.Nature 09/2014; · 42.35 Impact Factor - SourceAvailable from: Maria Tereza Chiarioni Thomé[Show abstract] [Hide abstract]

**ABSTRACT:**In this study we investigated the relative contribution of geographic barriers and Pleistocene refuges in the diversification of the Rhinella crucifer species complex, a group of endemic toads with a widespread distribution in the Brazilian Atlantic Forest (AF). We used intensive sampling and multilocus DNA sequence data to compare nucleotide diversity between refuge and non-refuge areas, investigate regional demographic patterns, estimate demographic parameters related to genetic breaks, and test refuge versus barrier scenarios of diversification using approximate Bayesian computation. We did not find higher levels of genetic diversity in putative refuge areas, either at regional or biome scale. Rather, the demographic history of the species complex supports regional differences with moderate population growth in the north and central regions, and stability in southern AF. Genetic breaks were dated to the Plio-Pleistocene; however, our analyses rejected the role of refuges in creating a northern and central divergence, supporting a recent colonization scenario at a smaller scale within the central AF. Overall our data rule out massive climatically driven fragmentation and large-scale recolonization events for populations across the biome. We confirmed the importance of geographic barriers in creating main divergences and underscored the importance of searching for cryptic discontinuities in the landscape. Comparison of our results with those of other AF taxa indicates organismal specific responses to moderate shifts in habitat and that multiple refuges may constitute a more realistic model for diversification of Atlantic Forest biota.This article is protected by copyright. All rights reserved.Molecular Ecology 11/2014; · 5.84 Impact Factor

Page 1

BioMed Central

Page 1 of 7

(page number not for citation purposes)

BMC Bioinformatics

Open Access

Software

msBayes: Pipeline for testing comparative phylogeographic

histories using hierarchical approximate Bayesian computation

Michael J Hickerson*1, Eli Stahl2 and Naoki Takebayashi3

Address: 1Biology Department, Queens College, CUNY, 65-30 Kissena Blvd, Flushing, NY 11367-1597, USA, 2Department of Biology, University

of Massachusetts Dartmouth, 285 Old Westport Rd, North Dartmouth, MA 02747, USA and 3Institute of Arctic Biology and Department of Biology

and Wildlife, 311 Irving I Bldg, University of Alaska, Fairbanks, AK 99775, USA

Email: Michael J Hickerson* - michael.hickerson@qc.cuny.edu; Eli Stahl - estahl@umassd.edu; Naoki Takebayashi - ffnt@uaf.edu

* Corresponding author

Abstract

Background: Although testing for simultaneous divergence (vicariance) across different

population-pairs that span the same barrier to gene flow is of central importance to evolutionary

biology, researchers often equate the gene tree and population/species tree thereby ignoring

stochastic coalescent variance in their conclusions of temporal incongruence. In contrast to other

available phylogeographic software packages, msBayes is the only one that analyses data from

multiple species/population pairs under a hierarchical model.

Results: msBayes employs approximate Bayesian computation (ABC) under a hierarchical

coalescent model to test for simultaneous divergence (TSD) in multiple co-distributed population-

pairs. Simultaneous isolation is tested by estimating three hyper-parameters that characterize the

degree of variability in divergence times across co-distributed population pairs while allowing for

variation in various within population-pair demographic parameters (sub-parameters) that can

affect the coalescent. msBayes is a software package consisting of several C and R programs that

are run with a Perl "front-end".

Conclusion: The method reasonably distinguishes simultaneous isolation from temporal

incongruence in the divergence of co-distributed population pairs, even with sparse sampling of

individuals. Because the estimate step is decoupled from the simulation step, one can rapidly

evaluate different ABC acceptance/rejection conditions and the choice of summary statistics. Given

the complex and idiosyncratic nature of testing multi-species biogeographic hypotheses, we

envision msBayes as a powerful and flexible tool for tackling a wide array of difficult research

questions that use population genetic data from multiple co-distributed species. The msBayes

pipeline is available for download at http://msbayes.sourceforge.net/ under an open source license

(GNU Public License). The msBayes pipeline is comprised of several C and R programs that are run

with a Perl "front-end" and runs on Linux, Mac OS-X, and most POSIX systems. Although the

current implementation is for a single locus per species-pair, future implementations will allow

analysis of multi-loci data per species pair.

Published: 26 July 2007

BMC Bioinformatics 2007, 8:268doi:10.1186/1471-2105-8-268

Received: 6 April 2007

Accepted: 26 July 2007

This article is available from: http://www.biomedcentral.com/1471-2105/8/268

© 2007 Hickerson et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),

which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 2

BMC Bioinformatics 2007, 8:268http://www.biomedcentral.com/1471-2105/8/268

Page 2 of 7

(page number not for citation purposes)

Background

Testing for simultaneous divergence (vicariance) across

different population-pairs that span the same historical

barrier to gene flow is of central importance to evolution-

ary biology, biogeography and community ecology [1-4].

Such inferences inform processes underlying speciation,

community composition, range delineation, and the eco-

logical consequences of climatic changes. Estimating a

population divergence time with an appropriate statistical

model [5] can be accomplished in a variety of ways [6-8],

yet analyzing comparative phylogeographic data with

multiple co-occurring species pairs that vary with respect

to demographic parameters and pairwise coalescent times

is less straightforward.

Instead of conducting an independent analysis on every

population-pair and testing the hypothesis of temporal

concordance based on this set of independent parameter

estimates of divergence time, the hierarchical model

employed by msBayes follows the suggestion of [9] by

concurrently estimating three hyper-parameters that char-

acterize the mean, variability and number of different

divergence events across a set of population-pairs. The

model employed in msBayes allows estimation of these

hyper-parameters across a multi-species data set while

explicitly incorporating uncertainty and variation in the

sub-parameters that independently describe each popula-

tion-pair's demographic history (divergence time, current,

ancestral and founding effective population sizes), post-

divergence migration rate and recombination rate. The

msBayes software pipeline is based on the introduction of

the approximate Bayesian computation (ABC) method

for sampling from the hyper-posterior distribution for

testing for simultaneous divergence [10]. We review the

important features here. Although the current implemen-

tation is for a single locus per species-pair, future imple-

mentations will allow analysis of multi-loci data per

species/population pair.

In contrast to previous ABC-like models [11-15], our TSD

is accomplished by implementing a hierarchical Bayesian

model in which the sub-parameters (Φ; within popula-

tion-pair parameters) are conditional on "hyper-parame-

ters" (ϕ) that describe the variability of Φ among the Y

population-pairs. For example, divergence times (Φ) can

vary across a set of population pairs conditional on the set

of hyper-parameters (ϕ) that varies according to their

hyper-prior distribution. Instead of explicitly calculating

the likelihood expression P(Data | ϕ,Φ) to get a posterior

distribution, we sample from the posterior distribution

P((ϕ,Φ) | Data) by simulating the data K times under the

coalescent model using candidate parameters drawn from

the prior distribution P(ϕ,Φ). A summary statistic vector

D for each simulated dataset is then compared to the

observed summary statistic vector in order to generate

random observations from the joint posterior distribution

f(ϕi,Φi|Di) by way of a rejection/acceptance algorithm

[16] followed by an optional weighted local regression

step [15]. Loosely speaking, hyper-parameter values are

accepted and used to construct the posterior distribution

with probabilities proportional to the similarity between

the summary statistic vector from the observed data and

the summary statistic vector calculated from simulated

data.

The hierarchical model consists of ancestral populations

that split at divergence times TY = {τ1...τY} in the past. The

hyper-parameter set, ϕ quantifies the degree of variability

in these Y divergence times across the Y ancestral popula-

tions and their Y descendent population pairs: (1) Ψ, the

number of possible divergence times (1 ≤ Ψ ≤ Y); (2) E(τ),

the mean divergence time; and (3) Ω, the ratio of variance

to the mean in these Y divergence times, Var(τ)/E(τ). The

sub-parameters for the i-th population-pair (Φi) are

allowed to vary independently across Y population pairs

and include divergence time (τi), current population sizes,

ancestral population sizes, post-divergence founding pop-

ulation sizes, durations of post-divergence population

growth, recombination rates, and post-divergence migra-

tion rates. The multiple population-pair splitting model is

depicted in Figure 1. Each divergence time parameter τ is

scaled by 2NAVE generations, where NAVE is the parametric

expectation of N (effective population size) across Y pop-

ulation pairs given the prior distribution.

The summary statistic vector D employed in msBayes cur-

rently consists of up to six summary statistics collected

from each of the Y population pairs (π,θW, Var(π - θW),

πnet, πb, and πw). This includes π, the average number of

pairwise differences among all sequences within each

population pair, θW the number of segregating sites within

each population pair normalized for sample size, [17],

Var(π - θW) in each population pair, and πnet, Nei and Li's

net nucleotide divergence between each pair of popula-

tions [18]. This last summary statistic is the difference (πb

- πw) where πb is the average pairwise differences between

each population pair and πw is the average pairwise differ-

ences within a sister pair of descendent populations. The

default setting includes the first four aforementioned

summary statistics because they were found to be a least

correlated subset of a larger group [19], however, future

versions of msBayes will allow users to choose other sum-

mary statistics.

An extensive simulation study was conducted in [10] to

evaluate the performance of our hierarchical ABC model.

Because comparative phylogeographic studies are often

conducted on multi-species data sets that include rare taxa

from which it is difficult to obtain samples from many

individuals, we extend the previous evaluation to explore

Page 3

BMC Bioinformatics 2007, 8:268http://www.biomedcentral.com/1471-2105/8/268

Page 3 of 7

(page number not for citation purposes)

the effectiveness of msBayes in conducting a TSD given

small sample sizes (≤ 5 individuals per population pair).

Implementation

After preparation of a sample size file and the input files

from DNA sequence data, running msBayes is a three step

process that includes: (A) calculating the observed sum-

mary statistic vector from the DNA sequence input files

and the sample size file; (B) running coalescent simula-

tions of the DNA sequence data using parameters drawn

from the hyper-prior (ϕ) and prior (Φ); and (C) sampling

from the posterior distribution and obtaining posterior

estimates of Ψ, E(τ), and Ω across the Y population pairs

(Figure 2).

Step A is accomplished by a command-line Perl program

(obsSumStats.pl) which uses two C programs to calculate

the observed summary statistic vector file. Specifically, the

user runs obsSumStats.pl after collecting separate aligned

DNA sequence files from each population-pair in FASTA

format, and constructing an additional text file that

describes the samples sizes, length of genes and transi-

tion/transversion rate ratios.

Step B iteratively simulates data sets under the hierarchi-

cal model by: 1.) randomly drawing hyper-parameters

and sub-parameters from the hyper-prior and sub-prior

distributions; 2.) using these hyper-parameters and sub-

parameters to simulate finite sites DNA sequence data

from Y population-pairs; and 3.) calculating a summary

statistic vector D from the simulated data set of Y popula-

tion-pairs. This is accomplished with several C programs

that are run with a Perl "front-end" (msbayes.pl) that

either prompts the user for the upper-bounds of various

priors and the number of iterations to simulate or alterna-

tively uses a batch configuration file with equivalent infor-

mation. The first C program draws hyper-parameters and

sub-parameters from their hyper-prior and sub-prior dis-

tributions. These parameters are then fed into several C

programs that simulate finite-sites DNA sequence data

using msarbpopQH a modified version of Hudson's clas-

sic coalescent simulator ms [20], which incorporates finite

sites mutation and arbitrary population structure and

dynamics. Another set of C programs calculates a sum-

mary statistic vector (D) for every simulated data set of Y

population pairs.

Flowchart describing operation of msBayes

Figure 2

Flowchart describing operation of msBayes.

Prepare sample

size file

Prepare FASTA

file for each

population-pair

Run obsSumStats.pl

Observed

summary

statistic

vector

Run msbayes.pl

Simulate Prior

K simulations from hyper-prior

Sample from Posterior

determined by # accepted draws from

prior distribution of K draws

Run acceptRej.pl

Data Prep

Step A

Step B

Step C

Depiction of the multiple population-pair divergence model used for the ABC estimates of Ψ, E(τ), and Ω

Figure 1

Depiction of the multiple population-pair divergence

model used for the ABC estimates of Ψ, E(τ), and Ω.

(A): The white lines depict a gene tree with TMRCA being

the time to the gene sample's most recent common ances-

tor, and the black tree containing the gene tree is the popula-

tion/species tree. (B): Parameters in the multiple population-

pair divergence model. The population mutation parameter,

θ, is 2Nµ where 2N is the summed haploid effective female

population size of each pair of daughter populations (µ is the

per gene per generation mutation rate). The time since isola-

tion of each population pair is denoted by τ (in units of 2NAVE

generations, where NAVE is the parametric expectation of N

across Y population pairs given the prior distribution). Popu-

lation mutation parameters for daughter populations a and b

are θa and θb, whereas θ 'a and θ'b are the population muta-

tion parameters for the sizes of daughter populations a and b

at the time of divergence until τ' (length of bottleneck). The

daughter populations θ 'a and θ'b then grow exponentially to

sizes θa and θb. The population mutation parameter for each

ancestral population is depicted as θA. The migration rate

between each pair of daughter populations is depicted as M

(number of effective migrants per generation). (C): Example

of four population-pairs where parameters in (B) are drawn

from uniform priors.

C

A

B

Time

t'

TMRCA

qA

Y = 4 taxon-pairs

Y= 4 divergence times

Population-pair 1

Population-pair 2

Population-pair 3 Population-pair 4

Time

qa

qb

q'a

q'a

q'a

q'b

q'b

M

t

Population a

Population b

Page 4

BMC Bioinformatics 2007, 8:268http://www.biomedcentral.com/1471-2105/8/268

Page 4 of 7

(page number not for citation purposes)

Step C is accomplished by our command-line user-inter-

face program (acceptRej.pl). This Perl program internally

uses R for the calculation. The algorithim is a simple

extension of the original R scripts which are kindly pro-

vided by M. Beaumont [15]. This step does the rejection/

acceptance sampling and local regression to produce the

approximate sample of the posterior distribution. This

third step uses the output of step B as the input and pro-

duces an output file that contains multiple graphical

depictions of the posterior distributions and a text output

file with various summaries of the posterior distributions

(estimates of Ψ, E(τ), and Ω across the Y population

pairs). The user can choose which summary statistics to

include within D (the summary statistic vector), choose

the proportion of accepted draws from the prior, and can

optionally choose to perform simple rejection sampling

without the additional local regression step.

We distribute msBayes as C source code and pre-compiled

binaries that run on Linux or Mac OS X operating systems.

The msBayes package also includes the R functions, and

Perl scripts, as well as installation/running instructions.

Results and Discussion

Performance of estimator with small sample sizes

At the present time, there are no other available coales-

cent-based tools for analyzing multiple population pairs

simultaneously to yield hyper-parameter estimates.

Although IM and IMa are most similar to msBayes [8,21]

because they estimate divergence times and population

sizes from single pairs of populations under a coalescent

model, these do not employ a hierarchical model and

therefore can only do so one pair at a time. The program

MCMCcoal can estimate divergence times of a known

phylogeny under a coalescent model, but can only use the

separate divergence time estimates to test for phylogeo-

graphic congruence [7]. The program BEST [6] infers a

species phylogeny and demographic parameters (e.g.

divergence times and population sizes) using a popula-

tion coalescent model, but likewise can only use the indi-

vidual divergence time

phylogeographic congruence across a multi-species data-

set. On the other hand, the hierarchical model employed

in msBayes not only can estimate hyper-parameters but

also comes with the benefit of additional information

gained from the "borrowing strength" across datasets [22-

24]. In this case, the resulting emergent multi-species

hyper-estimates use more of the information than the

sum of their parts (within species-pair estimates).

estimates to test for

Although the hierarchical ABC model employed in

msBayes was extensively evaluated in [10], the behavior of

the ABC estimator given minimal sampling of individuals

was not examined. Because comparative phylogeographic

studies are often conducted on multi-species data sets that

include rare taxa from which it is difficult to obtain sam-

ples from many individuals, we evaluate how low sample

sizes can affect inference. To this end, we explored the per-

formance in scenarios where ≤ 5 per population pair were

sampled from each of 10 population pairs. We created

1,000 simulated data sets under each of two different his-

tories: (1) simultaneous divergence history and (2) varia-

ble divergence history among population pairs. In the

simultaneous divergence history (true Ω = Var(τ)/E(τ) =

0), all ten population pairs arose from ancestral popula-

tions at τ = 1.8 before the present. In the variable diver-

gence history (true Ω = 0.1), two population pairs arose at

τ = 1.0 and eight population pairs arose at τ = 2.0 before

the present. We simulated these two histories with small

sample sizes (2–5 individual per population-pair) and

with larger sample sizes (20 individuals per population

pair; 10 per descendent population). The simulated data

sets consisted of haploid mtDNA samples from ten popu-

lation pairs that were 550–600 base pairs in length. From

each of the four sets of 1,000 simulated data sets, we used

msBayes to obtain 1,000 ABC estimates of the hyper-

parameter, Ω, with the goal of assessing the effects of sam-

ple sizes on the root mean square error (RMSE) of the ABC

Ω estimator. Each estimate of Ω was obtained from the

mode of 1,000 accepted draws (after the local regression

step) from 500,000 random draws from the hyper-prior,

as these conditions were found to be optimal in [10]. For

the larger sample sizes we use four classes of summary sta-

tistics (π, θW, Var(π - θW) and πnet), while for the smaller

sample sizes we only use πb to avoid null or n.a.n. values

(not a number) that are yielded when only one sample is

collected from a descendent population.

The simulation analysis demonstrated that msBayes can

usually distinguish simultaneous divergence from tempo-

ral incongruence in divergence, even with sparse sampling

of individuals. The estimates of Ω were not markedly

improved by sampling 20 individuals per population pair

(10 each population) when compared to sampling 2–5

individuals per population pair (1–3 each population;

Figure. 3). However, Ω is being overestimated under both

sample sizes and this upward bias is stronger with larger

sample sizes when true Ω = 1. Therefore, simultaneous

divergence is easier to correctly reject with larger sample

sizes. Root mean square error (RMSE) for estimating Ω

was < 0.12 when the true history was simultaneous diver-

gence (Ω = 0), and RMSE was < 0.18 when the true history

involved 2 different divergence events across 10 popula-

tion pairs (Ω = 0.1). It is encouraging that one can obtain

fair estimates with so few samples per population pair and

that two samples per population pair can be analyzed by

msBayes.

An attractive benefit of an ABC method such as msBayes

is that one can perform this estimator evaluation relatively

Page 5

BMC Bioinformatics 2007, 8:268 http://www.biomedcentral.com/1471-2105/8/268

Page 5 of 7

(page number not for citation purposes)

quickly. Simulating data from parameters drawn from the

prior is only done once per set of conditions (sample size/

history) and can be done in approximately 5 hours per

population pair on a 2 GHz linux computer. The compu-

tational time can be further reduced as the simulations

can be run parallel on multiple processors. Because the

acceptance/rejection step is decoupled from simulating

the prior, multiple estimates from a series of simulated

datasets can be accomplished without re-simulating the

prior each time. The acceptance/rejection step for a single

estimate is accomplished in one second to well under a

minute such that 1,000 estimates can be obtained from

1,000 data sets simulated from fixed known parameter

values in under an hour to within 24 hours on a single

processor.

General use and future development

The most important aspect of msBayes is that its flexible

and modular nature will allow us and others to add in

new features. This characteristic is essential for a phyloge-

ographic software tool because phylogeographic studies

are highly idiosyncratic. Using population genetic data to

test how climate and/or geological changes result in bio-

geographic shifts, speciation, extinction and consequent

changes in ecological interactions can involve a wide array

of hypotheses and models that conform to no generality

with regards to model complexity, parameterization and

sampling. We therefore anticipate making several exten-

sions to msBayes, and will encourage other bioinformati-

cians to make versions that suite particular difficult

research questions. Furthermore, phylogeographic studies

are most powerful when combined with independent evi-

dence (or hypotheses) about past habitat distributions

that are generated from other types of historic data and

ecological distribution models [25]. Particular historical

hypotheses can then be directly parameterized by paleo-

distribution models and tested with genetic data within

the msBayes framework using Bayes factors [26].

One feature we plan to include in future versions of

msBayes is an option to simulate from the prior after con-

straining the number of divergence events per Y popula-

tion pairs to one fixed number. This will then allow

getting estimates for when these different isolation events

took place as well as estimating which population pairs

originated at either of the particular divergence events.

Other upcoming features to be included are: 1.) multiple

loci per population pair by expanding the summary statis-

tic vector and adding additional hyper-parameters con-

trolling mutation rate variation across loci; 2.) having

more summary statistics available; 3.) allowing analysis of

only one population pair at a time; 4.) testing multi-spe-

cies colonization hypotheses; 5.) three or more popula-

tion models (as opposed to two population models); 6.)

allowing microsatellite data and 7.) automating the simu-

lation testing procedure used to obtain estimator bias.

It should be noted that migration could hinder the ability

of this method to correctly infer simultaneous divergence.

Moderate migration in a subset of species/population

pairs could cause the method to incorrectly support tem-

poral discordance in divergence when the true history

involved temporal congruence because migration can

erase the genetic signal of isolation [27,28]. Although the

Bayesian support for temporal concordance in divergence

times would likely weaken if this happens in a subset of

species/population pairs, we will explore using the sum-

Performance of estimator

Figure 3

Performance of estimator. Panels A through D each depict

frequency histograms of 1,000 Ω estimates given 1,000 data-

sets simulated under either of two constrained histories. The

simulated histories in panels A and C involve simultaneous

divergence across ten population pairs (Ω = 0.0; all τ = 1.8),

whereas panels B and D are from histories involving two dif-

ferent divergence events across the 10 population pairs (Ω =

0.1; two splitting at τ = 1.0 and eight splitting at τ = 2.0). Pan-

els A and B are using small sample sizes (≤ 5 individuals per

population pair), whereas panels C and D are using samples

of 10 individuals per population pair. The actual sample sizes

used for panels A and B are species pair 1: 1, 2; pair 2: 3, 2;

pair 3: 1, 1; pair 4: 2, 2; pair 5: 2, 3; pair 6: 2, 1; pair 7: 1, 1;

pair 8: 1, 3; pair 9: 3, 1; pair 10: 2, 1.

0.00.5 1.01.52.0

0

5

10

15

0.0 0.5 1.0 1.5 2.0

0

5

10

15

0.0 0.51.01.5 2.0

Frequency

0.0 0.51.0 1.5 2.0

0

5

10

15

0

5

10

15

FrequencyFrequency

Frequency

Estimates of W

(True W = 0.0)

Estimates of W

(True W = 0.1)

Estimates of W

(True W = 0.0)

Estimates of W

(True W = 0.1)

RMSE = 0.12

RMSE = 0.11

RMSE = 0.16

RMSE = 0.18

AB

CD

Mean Estimate = 0.04

Median Estimate = 0.01

Mean Estimate = 0.17

Mean Estimate = 0.04 Mean Estimate = 0.21

Median Estimate = 0.14

Median Estimate = 0.00 Median Estimate = 0.18

Page 6

BMC Bioinformatics 2007, 8:268http://www.biomedcentral.com/1471-2105/8/268

Page 6 of 7

(page number not for citation purposes)

mary statistic Var(π) as a means to tease apart migration

from isolation as suggested in [29,30].

Conclusion

The msBayes software pipeline will increasingly become

an important tool as the field of comparative phylogeog-

raphy progresses to become a more rigorous and statistical

enterprise [5]. The program can obtain hyper-parameter

estimates using hierarchical models in a reasonable

amount of time without having the problems associated

with convergence and mixing found in MCMC methods

(Markov chain Monte Carlo). Because the estimation step

is decoupled from the simulation step, one can quickly

evaluate different ABC acceptance/rejection conditions

and the choice of summary statistics. The method can rea-

sonably distinguish biogeographic congruence from tem-

poral incongruence, even with sparse sampling of

individuals. Given the complex and idiosyncratic nature

of testing multi-species biogeographic hypotheses, we

envision msBayes as a powerful and flexible tool that is

open for modification when faced with particularly diffi-

cult research questions. Finally, due to its flexible and

modular design, msBayes will be a well-suited tool for the

heterogeneous data sets that are emerging and being com-

bined to test complex historical hypotheses.

Availability and requirements

The installation instructions, documentation, source code

and precompiled binary for msBayes are all available for

download at http://msbayes.sourceforge.net/ under an

open source license (GNU Public License). The msBayes

pipeline is comprised of several C and R programs that are

run with a Perl "front-end" and runs on Linux, Mac OS-X,

and most POSIX systems.

List of Abbreviations used

ABC: Approximate Bayesian Computation

TSD: Test of simultaneous divergence

mtDNA: Mitochondrial DNA

Authors' contributions

MJH developed the idea for using ABC within a hierarchi-

cal model to analyze multiple population pairs simulta-

neously. ES developed the finite sites version of D.

Hudson's classic coalescent simulator (ms). MJH and NT

developed C, R, and Perl routines and modified pre-exist-

ing R and C routines to comprise an ABC algorithm that

makes use of a Hierarchical model. NT extensively devel-

oped the C and Perl routines that comprise the user ver-

sion of msBayes now available. NT and MJH maintains

the msBayes website and NT developed the installation

configurations and precompiled binaries. All authors read

and approved the final version of the manuscript.

Acknowledgements

We thank A. Lancaster and three anonymous reviewers for making helpful

suggestions and E. Andersson for handling the manuscript. We thank M.

Beaumont for kindly providing R scripts and critically useful discussions. We

thank D. Hudson for permission to use E. Stahl's finite sites version of his

ms coalescent simulator under GNU Public License. We thank J. McGuire

and C. Moritz for use of the linux parallel computing cluster housed at the

Museum of Vertebrate zoology (University of California, Berkeley). We

thank MBI (Mathematical Biosciences Institute) for hosting the workshop

on Phylogenetics and Phylogeography. Support for M. J. Hickerson was pro-

vided by a NSF post-doctoral fellowship in interdisciplinary informatics. N.

Takebayashi was supported by NIH Grant Number 2P20RR16466 from the

INBRE program of the National Center for Research Resources and NSF

DEB-0640520. E. Stahl, was supported by Sloan/DOE Fellowship award DE-

FG02-00ER62993.

References

1.Avise JC: Phylogeography: The history and formation of spe-

cies. Cambridge: Harvard University Press; 2000.

2. Wen J: Origin and the evolution of the eastern North Ameri-

can distinct distributions of flowering plants. Annu Rev Ecol Syst

1999, 30:421-455.

3. Lessios HA: The first stage of speciation as seen in organisms

separated by the Isthmus Panama. In Endless Forms: species and

speciation Edited by: Howard DJ, Berlocher S. Oxford, U.K.: Oxford

University Press; 1998:186-201.

4. Barraclough TG, Nee S: Phylogenetics and speciation. Trends

Ecol Evol 2001, 16(7):391-399.

5. Knowles LL, Maddison WP: Statistical phylogeography. Mol Ecol

2002, 11(12):2623-2635.

6. Edwards SV, Liu L, Pearl DK: High-resolution species trees with-

out concatination. Proc Natl Acad Sci USA 2007, 104:5936-5941.

7. Rannala B, Yang ZH: Bayes Estimation of Species Divergence

Times and Ancestral Population Sizes Using DNA

Sequences From Multiple Loci. Genetics 2003, 164:1645-1656.

8. Hey J, Nielsen R: Integration within the Felsenstein equation

for improved Markov chain Monte Carlo methods in popula-

tion genetics. Proc Natl Acad Sci USA 2007, 104:2785-2790.

9.Edwards SV, Beerli P: Perspective: Gene divergence, population

divergence, and the variance in coalescence time in phyloge-

ographic studies. Evolution 2000, 54(6):1839-1854.

10.Hickerson MJ, Stahl E, Lessios HA: Test for simultaneous diver-

gence using approximate Bayesian computation. Evolution

2006, 60:2435-2453.

11.Thornton K, Andolfatto P: Approximate Bayesian inference

reveals evidence for a recent, severe bottleneck in a Nether-

lands population of Drosophila melanogaster. Genetics 2005,

172:1607-1619.

12. Excoffier L, Estoup A, Cornuet J-M: Bayesian analysis of an

admixture model with mutations and arbitrarily linked

markers. Genetics 2005, 169:1727-1738.

13.Estoup A, Beaumont BA, Sennedot F, Moritz C, Cornuet J-M:

Genetic analysis of complex demographic scenarios: spa-

tially expanding populations of the cane toad, Bufo marinus.

Evolution 2004, 58:2021-2036.

14. Tallmon DA, Luikart G, Beaumont BA: Comparative evaluation of

a new effective population size estimator based on approxi-

mate Bayesian computation. Genetics 2004, 167:977-988.

15. Beaumont MA, Zhang W, Balding DJ: Approximate Bayesian

computation in population genetics. Genetics 2002,

162:2025-2035.

16. Weiss G, von Haeseler A: Inference of population history using

a likelihood approach. Genetics 1998, 149:1539-1546.

17. Watterson GA: On the number of segregating sites in genetic

models without recombination. Theor Popul Biol 1975,

7:256-276.

18.Nei M, Li W: Mathematical model for studying variation in

terms of restriction endonucleases. Proc Natl Acad Sci USA 1979,

76:5269-5273.

19. Hickerson MJ, Dolman G, Moritz C: Comparative phylogeo-

graphic summary statistics for testing simultaneous vicari-

ance across taxon-pairs. Mol Ecol 2006, 15:209-224.

Page 7

Publish with BioMed Central and every

scientist can read your work free of charge

"BioMed Central will be the most significant development for

disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours — you keep the copyright

Submit your manuscript here:

http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral

BMC Bioinformatics 2007, 8:268 http://www.biomedcentral.com/1471-2105/8/268

Page 7 of 7

(page number not for citation purposes)

20. Hudson RR: Generating samples under a Wright-Fisher neu-

tral model of genetic variation. Bioinformatics 2002, 18:337-338.

Hey J, Nielsen R: Multilocus methods for estimating population

sizes, migration rates and divergence time, with applications

to the divergence of Drosophila pseudoobscura and D. persimi-

lis. Genetics 2004, 167:747-760.

Beaumont MA, Rannala B: The Bayesian revolution in genetics.

Nat Rev Genet 2004, 5:251-261.

James W, Stein C: Estimation with quadradic loss. In Proceedings

of the Fourth Berkeley Symposium on Mathematical Statistics and Probabil-

ity: 1960 Berkeley, CA: University of California Press; 1960.

Gelman A, Carlin JB, Stern HS, Rubin DB: Bayesian Data Analysis.

London, UK: Chapman and Hall/CRC; 1995.

Hugall A, Moritz C, Moussalli A, Stanisic J: Reconciling paleodistri-

bution models and comparative phylogeography in the Wet

Tropics rainforest land snail Gnarosophia bellendenkerensis

(Brazier 1875). Proc Natl Acad Sci USA 2002, 99(9):6112-6117.

Kass RE, Raftery A: Bayesian factors. Journal of the American Statis-

tical Association 1995, 90:773-795.

Kalinowski ST: Evolutionary and statistical properties of three

genetic distances. Mol Ecol 2002, 11(8):1263-1273.

Slatkin M: Gene flow in natural populations. Ann Rev Ecol Syst

1985, 16:393-430.

Wakeley J: The variance of pairwise nucleotide differences in

two populations with migration. Theor Popul Biol 1996, 49:39-57.

Wakeley J: Distinguishing migration from isolation using the

variance of pairwise differences. Theor Popul Biol 1996,

49:369-386.

21.

22.

23.

24.

25.

26.

27.

28.

29.

30.