msBayes: Pipeline for testing comparative phylogeographic histories using hierarchical approximate Bayesian computation

Article (PDF Available)inBMC Bioinformatics 8(1):268 · February 2007with68 Reads
DOI: 10.1186/1471-2105-8-268 · Source: PubMed
Abstract
Although testing for simultaneous divergence (vicariance) across different population-pairs that span the same barrier to gene flow is of central importance to evolutionary biology, researchers often equate the gene tree and population/species tree thereby ignoring stochastic coalescent variance in their conclusions of temporal incongruence. In contrast to other available phylogeographic software packages, msBayes is the only one that analyses data from multiple species/population pairs under a hierarchical model. msBayes employs approximate Bayesian computation (ABC) under a hierarchical coalescent model to test for simultaneous divergence (TSD) in multiple co-distributed population-pairs. Simultaneous isolation is tested by estimating three hyper-parameters that characterize the degree of variability in divergence times across co-distributed population pairs while allowing for variation in various within population-pair demographic parameters (sub-parameters) that can affect the coalescent. msBayes is a software package consisting of several C and R programs that are run with a Perl "front-end". The method reasonably distinguishes simultaneous isolation from temporal incongruence in the divergence of co-distributed population pairs, even with sparse sampling of individuals. Because the estimate step is decoupled from the simulation step, one can rapidly evaluate different ABC acceptance/rejection conditions and the choice of summary statistics. Given the complex and idiosyncratic nature of testing multi-species biogeographic hypotheses, we envision msBayes as a powerful and flexible tool for tackling a wide array of difficult research questions that use population genetic data from multiple co-distributed species. The msBayes pipeline is available for download at http://msbayes.sourceforge.net/ under an open source license (GNU Public License). The msBayes pipeline is comprised of several C and R programs that are run with a Perl "front-end" and runs on Linux, Mac OS-X, and most POSIX systems. Although the current implementation is for a single locus per species-pair, future implementations will allow analysis of multi-loci data per species pair.
BioMed Central
Page 1 of 7
(page number not for citation purposes)
BMC Bioinformatics
Open Access
Software
msBayes: Pipeline for testing comparative phylogeographic
histories using hierarchical approximate Bayesian computation
Michael J Hickerson*
1
, Eli Stahl
2
and Naoki Takebayashi
3
Address:
1
Biology Department, Queens College, CUNY, 65-30 Kissena Blvd, Flushing, NY 11367-1597, USA,
2
Department of Biology, University
of Massachusetts Dartmouth, 285 Old Westport Rd, North Dartmouth, MA 02747, USA and
3
Institute of Arctic Biology and Department of Biology
and Wildlife, 311 Irving I Bldg, University of Alaska, Fairbanks, AK 99775, USA
Email: Michael J Hickerson* - michael.hickerson@qc.cuny.edu; Eli Stahl - estahl@umassd.edu; Naoki Takebayashi - ffnt@uaf.edu
* Corresponding author
Abstract
Background: Although testing for simultaneous divergence (vicariance) across different
population-pairs that span the same barrier to gene flow is of central importance to evolutionary
biology, researchers often equate the gene tree and population/species tree thereby ignoring
stochastic coalescent variance in their conclusions of temporal incongruence. In contrast to other
available phylogeographic software packages, msBayes is the only one that analyses data from
multiple species/population pairs under a hierarchical model.
Results: msBayes employs approximate Bayesian computation (ABC) under a hierarchical
coalescent model to test for simultaneous divergence (TSD) in multiple co-distributed population-
pairs. Simultaneous isolation is tested by estimating three hyper-parameters that characterize the
degree of variability in divergence times across co-distributed population pairs while allowing for
variation in various within population-pair demographic parameters (sub-parameters) that can
affect the coalescent. msBayes is a software package consisting of several C and R programs that
are run with a Perl "front-end".
Conclusion: The method reasonably distinguishes simultaneous isolation from temporal
incongruence in the divergence of co-distributed population pairs, even with sparse sampling of
individuals. Because the estimate step is decoupled from the simulation step, one can rapidly
evaluate different ABC acceptance/rejection conditions and the choice of summary statistics. Given
the complex and idiosyncratic nature of testing multi-species biogeographic hypotheses, we
envision msBayes as a powerful and flexible tool for tackling a wide array of difficult research
questions that use population genetic data from multiple co-distributed species. The msBayes
pipeline is available for download at http://msbayes.sourceforge.net/
under an open source license
(GNU Public License). The msBayes pipeline is comprised of several C and R programs that are run
with a Perl "front-end" and runs on Linux, Mac OS-X, and most POSIX systems. Although the
current implementation is for a single locus per species-pair, future implementations will allow
analysis of multi-loci data per species pair.
Published: 26 July 2007
BMC Bioinformatics 2007, 8:268 doi:10.1186/1471-2105-8-268
Received: 6 April 2007
Accepted: 26 July 2007
This article is available from: http://www.biomedcentral.com/1471-2105/8/268
© 2007 Hickerson et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0
),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
BMC Bioinformatics 2007, 8:268 http://www.biomedcentral.com/1471-2105/8/268
Page 2 of 7
(page number not for citation purposes)
Background
Testing for simultaneous divergence (vicariance) across
different population-pairs that span the same historical
barrier to gene flow is of central importance to evolution-
ary biology, biogeography and community ecology [1-4].
Such inferences inform processes underlying speciation,
community composition, range delineation, and the eco-
logical consequences of climatic changes. Estimating a
population divergence time with an appropriate statistical
model [5] can be accomplished in a variety of ways [6-8],
yet analyzing comparative phylogeographic data with
multiple co-occurring species pairs that vary with respect
to demographic parameters and pairwise coalescent times
is less straightforward.
Instead of conducting an independent analysis on every
population-pair and testing the hypothesis of temporal
concordance based on this set of independent parameter
estimates of divergence time, the hierarchical model
employed by msBayes follows the suggestion of [9] by
concurrently estimating three hyper-parameters that char-
acterize the mean, variability and number of different
divergence events across a set of population-pairs. The
model employed in msBayes allows estimation of these
hyper-parameters across a multi-species data set while
explicitly incorporating uncertainty and variation in the
sub-parameters that independently describe each popula-
tion-pair's demographic history (divergence time, current,
ancestral and founding effective population sizes), post-
divergence migration rate and recombination rate. The
msBayes software pipeline is based on the introduction of
the approximate Bayesian computation (ABC) method
for sampling from the hyper-posterior distribution for
testing for simultaneous divergence [10]. We review the
important features here. Although the current implemen-
tation is for a single locus per species-pair, future imple-
mentations will allow analysis of multi-loci data per
species/population pair.
In contrast to previous ABC-like models [11-15], our TSD
is accomplished by implementing a hierarchical Bayesian
model in which the sub-parameters (Φ; within popula-
tion-pair parameters) are conditional on "hyper-parame-
ters" (
ϕ
) that describe the variability of Φ among the Y
population-pairs. For example, divergence times (Φ) can
vary across a set of population pairs conditional on the set
of hyper-parameters (
ϕ
) that varies according to their
hyper-prior distribution. Instead of explicitly calculating
the likelihood expression P(Data |
ϕ
,Φ) to get a posterior
distribution, we sample from the posterior distribution
P((
ϕ
,Φ) | Data) by simulating the data K times under the
coalescent model using candidate parameters drawn from
the prior distribution P(
ϕ
,Φ). A summary statistic vector
D for each simulated dataset is then compared to the
observed summary statistic vector in order to generate
random observations from the joint posterior distribution
f(
ϕ
i
,Φ
i
|D
i
) by way of a rejection/acceptance algorithm
[16] followed by an optional weighted local regression
step [15]. Loosely speaking, hyper-parameter values are
accepted and used to construct the posterior distribution
with probabilities proportional to the similarity between
the summary statistic vector from the observed data and
the summary statistic vector calculated from simulated
data.
The hierarchical model consists of ancestral populations
that split at divergence times T
Y
= {
τ
1
...
τ
Y
} in the past. The
hyper-parameter set,
ϕ
quantifies the degree of variability
in these Y divergence times across the Y ancestral popula-
tions and their Y descendent population pairs: (1) Ψ, the
number of possible divergence times (1 Ψ Y); (2) E(
τ
),
the mean divergence time; and (3) , the ratio of variance
to the mean in these Y divergence times, Var(
τ
)/E(
τ
). The
sub-parameters for the i-th population-pair (Φ
i
) are
allowed to vary independently across Y population pairs
and include divergence time (
τ
i
), current population sizes,
ancestral population sizes, post-divergence founding pop-
ulation sizes, durations of post-divergence population
growth, recombination rates, and post-divergence migra-
tion rates. The multiple population-pair splitting model is
depicted in Figure 1. Each divergence time parameter
τ
is
scaled by 2N
AVE
generations, where N
AVE
is the parametric
expectation of N (effective population size) across Y pop-
ulation pairs given the prior distribution.
The summary statistic vector D employed in msBayes cur-
rently consists of up to six summary statistics collected
from each of the Y population pairs (
π
,
θ
W
, Var(
π
-
θ
W
),
π
net
,
π
b
, and
π
w
). This includes
π
, the average number of
pairwise differences among all sequences within each
population pair,
θ
W
the number of segregating sites within
each population pair normalized for sample size, [17],
Var(
π
-
θ
W
) in each population pair, and
π
net
, Nei and Li's
net nucleotide divergence between each pair of popula-
tions [18]. This last summary statistic is the difference (
π
b
-
π
w
) where
π
b
is the average pairwise differences between
each population pair and
π
w
is the average pairwise differ-
ences within a sister pair of descendent populations. The
default setting includes the first four aforementioned
summary statistics because they were found to be a least
correlated subset of a larger group [19], however, future
versions of msBayes will allow users to choose other sum-
mary statistics.
An extensive simulation study was conducted in [10] to
evaluate the performance of our hierarchical ABC model.
Because comparative phylogeographic studies are often
conducted on multi-species data sets that include rare taxa
from which it is difficult to obtain samples from many
individuals, we extend the previous evaluation to explore
BMC Bioinformatics 2007, 8:268 http://www.biomedcentral.com/1471-2105/8/268
Page 3 of 7
(page number not for citation purposes)
the effectiveness of msBayes in conducting a TSD given
small sample sizes ( 5 individuals per population pair).
Implementation
After preparation of a sample size file and the input files
from DNA sequence data, running msBayes is a three step
process that includes: (A) calculating the observed sum-
mary statistic vector from the DNA sequence input files
and the sample size file; (B) running coalescent simula-
tions of the DNA sequence data using parameters drawn
from the hyper-prior (
ϕ
) and prior (Φ); and (C) sampling
from the posterior distribution and obtaining posterior
estimates of Ψ, E(
τ
), and across the Y population pairs
(Figure 2).
Step A is accomplished by a command-line Perl program
(obsSumStats.pl) which uses two C programs to calculate
the observed summary statistic vector file. Specifically, the
user runs obsSumStats.pl after collecting separate aligned
DNA sequence files from each population-pair in FASTA
format, and constructing an additional text file that
describes the samples sizes, length of genes and transi-
tion/transversion rate ratios.
Step B iteratively simulates data sets under the hierarchi-
cal model by: 1.) randomly drawing hyper-parameters
and sub-parameters from the hyper-prior and sub-prior
distributions; 2.) using these hyper-parameters and sub-
parameters to simulate finite sites DNA sequence data
from Y population-pairs; and 3.) calculating a summary
statistic vector D from the simulated data set of Y popula-
tion-pairs. This is accomplished with several C programs
that are run with a Perl "front-end" (msbayes.pl) that
either prompts the user for the upper-bounds of various
priors and the number of iterations to simulate or alterna-
tively uses a batch configuration file with equivalent infor-
mation. The first C program draws hyper-parameters and
sub-parameters from their hyper-prior and sub-prior dis-
tributions. These parameters are then fed into several C
programs that simulate finite-sites DNA sequence data
using msarbpopQH a modified version of Hudson's clas-
sic coalescent simulator ms [20], which incorporates finite
sites mutation and arbitrary population structure and
dynamics. Another set of C programs calculates a sum-
mary statistic vector (D) for every simulated data set of Y
population pairs.
Flowchart describing operation of msBayesFigure 2
Flowchart describing operation of msBayes.
Prepare sample
size file
Prepare FASTA
file for each
population-pair
Run obsSumStats.pl
Observed
summary
statistic
vector
Run msbayes.pl
Simulate Prior
K simulations from hyper-prior
Sample from Posterior
determined by # accepted draws from
prior distribution of K draws
Run acceptRej.pl
Data Prep
Step A
Step B
Step C
Depiction of the multiple population-pair divergence model used for the ABC estimates of Ψ, E(
τ
), and Figure 1
Depiction of the multiple population-pair divergence
model used for the ABC estimates of Ψ, E(
τ
), and .
(A): The white lines depict a gene tree with TMRCA being
the time to the gene sample's most recent common ances-
tor, and the black tree containing the gene tree is the popula-
tion/species tree. (B): Parameters in the multiple population-
pair divergence model. The population mutation parameter,
θ
, is 2N
µ
where 2N is the summed haploid effective female
population size of each pair of daughter populations (µ is the
per gene per generation mutation rate). The time since isola-
tion of each population pair is denoted by
τ
(in units of 2N
AVE
generations, where N
AVE
is the parametric expectation of N
across Y population pairs given the prior distribution). Popu-
lation mutation parameters for daughter populations a and b
are
θ
a
and
θ
b
, whereas
θ
'
a
and
θ
'
b
are the population muta-
tion parameters for the sizes of daughter populations a and b
at the time of divergence until
τ
' (length of bottleneck). The
daughter populations
θ
'
a
and
θ
'
b
then grow exponentially to
sizes
θ
a
and
θ
b
. The population mutation parameter for each
ancestral population is depicted as
θ
A
. The migration rate
between each pair of daughter populations is depicted as M
(number of effective migrants per generation). (C): Example
of four population-pairs where parameters in (B) are drawn
from uniform priors.
C
A
B
Time
t
'
TMRCA
q
A
Y = 4 taxon-pairs
Y
= 4 divergence times
Population-pair 1
Population-pair 2
Population-pair 3 Population-pair 4
Time
q
a
q
b
q
'
a
q
'
a
q
'
a
q
'
b
q
'
b
M
t
Population a Population b
BMC Bioinformatics 2007, 8:268 http://www.biomedcentral.com/1471-2105/8/268
Page 4 of 7
(page number not for citation purposes)
Step C is accomplished by our command-line user-inter-
face program (acceptRej.pl). This Perl program internally
uses R for the calculation. The algorithim is a simple
extension of the original R scripts which are kindly pro-
vided by M. Beaumont [15]. This step does the rejection/
acceptance sampling and local regression to produce the
approximate sample of the posterior distribution. This
third step uses the output of step B as the input and pro-
duces an output file that contains multiple graphical
depictions of the posterior distributions and a text output
file with various summaries of the posterior distributions
(estimates of Ψ, E(
τ
), and across the Y population
pairs). The user can choose which summary statistics to
include within D (the summary statistic vector), choose
the proportion of accepted draws from the prior, and can
optionally choose to perform simple rejection sampling
without the additional local regression step.
We distribute msBayes as C source code and pre-compiled
binaries that run on Linux or Mac OS X operating systems.
The msBayes package also includes the R functions, and
Perl scripts, as well as installation/running instructions.
Results and Discussion
Performance of estimator with small sample sizes
At the present time, there are no other available coales-
cent-based tools for analyzing multiple population pairs
simultaneously to yield hyper-parameter estimates.
Although IM and IMa are most similar to msBayes [8,21]
because they estimate divergence times and population
sizes from single pairs of populations under a coalescent
model, these do not employ a hierarchical model and
therefore can only do so one pair at a time. The program
MCMCcoal can estimate divergence times of a known
phylogeny under a coalescent model, but can only use the
separate divergence time estimates to test for phylogeo-
graphic congruence [7]. The program BEST [6] infers a
species phylogeny and demographic parameters (e.g.
divergence times and population sizes) using a popula-
tion coalescent model, but likewise can only use the indi-
vidual divergence time estimates to test for
phylogeographic congruence across a multi-species data-
set. On the other hand, the hierarchical model employed
in msBayes not only can estimate hyper-parameters but
also comes with the benefit of additional information
gained from the "borrowing strength" across datasets [22-
24]. In this case, the resulting emergent multi-species
hyper-estimates use more of the information than the
sum of their parts (within species-pair estimates).
Although the hierarchical ABC model employed in
msBayes was extensively evaluated in [10], the behavior of
the ABC estimator given minimal sampling of individuals
was not examined. Because comparative phylogeographic
studies are often conducted on multi-species data sets that
include rare taxa from which it is difficult to obtain sam-
ples from many individuals, we evaluate how low sample
sizes can affect inference. To this end, we explored the per-
formance in scenarios where 5 per population pair were
sampled from each of 10 population pairs. We created
1,000 simulated data sets under each of two different his-
tories: (1) simultaneous divergence history and (2) varia-
ble divergence history among population pairs. In the
simultaneous divergence history (true = Var(
τ
)/E(
τ
) =
0), all ten population pairs arose from ancestral popula-
tions at
τ
= 1.8 before the present. In the variable diver-
gence history (true = 0.1), two population pairs arose at
τ
= 1.0 and eight population pairs arose at
τ
= 2.0 before
the present. We simulated these two histories with small
sample sizes (2–5 individual per population-pair) and
with larger sample sizes (20 individuals per population
pair; 10 per descendent population). The simulated data
sets consisted of haploid mtDNA samples from ten popu-
lation pairs that were 550–600 base pairs in length. From
each of the four sets of 1,000 simulated data sets, we used
msBayes to obtain 1,000 ABC estimates of the hyper-
parameter, , with the goal of assessing the effects of sam-
ple sizes on the root mean square error (RMSE) of the ABC
estimator. Each estimate of was obtained from the
mode of 1,000 accepted draws (after the local regression
step) from 500,000 random draws from the hyper-prior,
as these conditions were found to be optimal in [10]. For
the larger sample sizes we use four classes of summary sta-
tistics (
π
,
θ
W
, Var(
π
-
θ
W
) and
π
net
), while for the smaller
sample sizes we only use
π
b
to avoid null or n.a.n. values
(not a number) that are yielded when only one sample is
collected from a descendent population.
The simulation analysis demonstrated that msBayes can
usually distinguish simultaneous divergence from tempo-
ral incongruence in divergence, even with sparse sampling
of individuals. The estimates of were not markedly
improved by sampling 20 individuals per population pair
(10 each population) when compared to sampling 2–5
individuals per population pair (1–3 each population;
Figure. 3). However, is being overestimated under both
sample sizes and this upward bias is stronger with larger
sample sizes when true = 1. Therefore, simultaneous
divergence is easier to correctly reject with larger sample
sizes. Root mean square error (RMSE) for estimating
was < 0.12 when the true history was simultaneous diver-
gence ( = 0), and RMSE was < 0.18 when the true history
involved 2 different divergence events across 10 popula-
tion pairs ( = 0.1). It is encouraging that one can obtain
fair estimates with so few samples per population pair and
that two samples per population pair can be analyzed by
msBayes.
An attractive benefit of an ABC method such as msBayes
is that one can perform this estimator evaluation relatively
BMC Bioinformatics 2007, 8:268 http://www.biomedcentral.com/1471-2105/8/268
Page 5 of 7
(page number not for citation purposes)
quickly. Simulating data from parameters drawn from the
prior is only done once per set of conditions (sample size/
history) and can be done in approximately 5 hours per
population pair on a 2 GHz linux computer. The compu-
tational time can be further reduced as the simulations
can be run parallel on multiple processors. Because the
acceptance/rejection step is decoupled from simulating
the prior, multiple estimates from a series of simulated
datasets can be accomplished without re-simulating the
prior each time. The acceptance/rejection step for a single
estimate is accomplished in one second to well under a
minute such that 1,000 estimates can be obtained from
1,000 data sets simulated from fixed known parameter
values in under an hour to within 24 hours on a single
processor.
General use and future development
The most important aspect of msBayes is that its flexible
and modular nature will allow us and others to add in
new features. This characteristic is essential for a phyloge-
ographic software tool because phylogeographic studies
are highly idiosyncratic. Using population genetic data to
test how climate and/or geological changes result in bio-
geographic shifts, speciation, extinction and consequent
changes in ecological interactions can involve a wide array
of hypotheses and models that conform to no generality
with regards to model complexity, parameterization and
sampling. We therefore anticipate making several exten-
sions to msBayes, and will encourage other bioinformati-
cians to make versions that suite particular difficult
research questions. Furthermore, phylogeographic studies
are most powerful when combined with independent evi-
dence (or hypotheses) about past habitat distributions
that are generated from other types of historic data and
ecological distribution models [25]. Particular historical
hypotheses can then be directly parameterized by paleo-
distribution models and tested with genetic data within
the msBayes framework using Bayes factors [26].
One feature we plan to include in future versions of
msBayes is an option to simulate from the prior after con-
straining the number of divergence events per Y popula-
tion pairs to one fixed number. This will then allow
getting estimates for when these different isolation events
took place as well as estimating which population pairs
originated at either of the particular divergence events.
Other upcoming features to be included are: 1.) multiple
loci per population pair by expanding the summary statis-
tic vector and adding additional hyper-parameters con-
trolling mutation rate variation across loci; 2.) having
more summary statistics available; 3.) allowing analysis of
only one population pair at a time; 4.) testing multi-spe-
cies colonization hypotheses; 5.) three or more popula-
tion models (as opposed to two population models); 6.)
allowing microsatellite data and 7.) automating the simu-
lation testing procedure used to obtain estimator bias.
It should be noted that migration could hinder the ability
of this method to correctly infer simultaneous divergence.
Moderate migration in a subset of species/population
pairs could cause the method to incorrectly support tem-
poral discordance in divergence when the true history
involved temporal congruence because migration can
erase the genetic signal of isolation [27,28]. Although the
Bayesian support for temporal concordance in divergence
times would likely weaken if this happens in a subset of
species/population pairs, we will explore using the sum-
Performance of estimatorFigure 3
Performance of estimator. Panels A through D each depict
frequency histograms of 1,000 estimates given 1,000 data-
sets simulated under either of two constrained histories. The
simulated histories in panels A and C involve simultaneous
divergence across ten population pairs ( = 0.0; all
τ
= 1.8),
whereas panels B and D are from histories involving two dif-
ferent divergence events across the 10 population pairs ( =
0.1; two splitting at
τ
= 1.0 and eight splitting at
τ
= 2.0). Pan-
els A and B are using small sample sizes ( 5 individuals per
population pair), whereas panels C and D are using samples
of 10 individuals per population pair. The actual sample sizes
used for panels A and B are species pair 1: 1, 2; pair 2: 3, 2;
pair 3: 1, 1; pair 4: 2, 2; pair 5: 2, 3; pair 6: 2, 1; pair 7: 1, 1;
pair 8: 1, 3; pair 9: 3, 1; pair 10: 2, 1.
0.0 0.5 1.0 1.5 2.0
0 5 10 15
0.0 0.5 1.0 1.5 2.0
0 5 10 15
0.0 0.5 1.0 1.5 2.0
Frequency
0.0 0.5 1.0 1.5 2.0
0 5 10 150 5 10 15
Frequency
FrequencyFrequency
Estimates of
W
Estimates of
W
Estimates of
W
Estimates of
W
(True
W
= 0.0)
(True
W
= 0.0)
(True
W
= 0.1)
(True
W
= 0.1)
RMSE = 0.12
RMSE = 0.11
RMSE = 0.16
RMSE = 0.18
AB
CD
Mean Estimate = 0.04
Median Estimate = 0.01
Mean Estimate = 0.17
Mean Estimate = 0.04 Mean Estimate = 0.21
Median Estimate = 0.14
Median Estimate = 0.00 Median Estimate = 0.18
BMC Bioinformatics 2007, 8:268 http://www.biomedcentral.com/1471-2105/8/268
Page 6 of 7
(page number not for citation purposes)
mary statistic Var(
π
) as a means to tease apart migration
from isolation as suggested in [29,30].
Conclusion
The msBayes software pipeline will increasingly become
an important tool as the field of comparative phylogeog-
raphy progresses to become a more rigorous and statistical
enterprise [5]. The program can obtain hyper-parameter
estimates using hierarchical models in a reasonable
amount of time without having the problems associated
with convergence and mixing found in MCMC methods
(Markov chain Monte Carlo). Because the estimation step
is decoupled from the simulation step, one can quickly
evaluate different ABC acceptance/rejection conditions
and the choice of summary statistics. The method can rea-
sonably distinguish biogeographic congruence from tem-
poral incongruence, even with sparse sampling of
individuals. Given the complex and idiosyncratic nature
of testing multi-species biogeographic hypotheses, we
envision msBayes as a powerful and flexible tool that is
open for modification when faced with particularly diffi-
cult research questions. Finally, due to its flexible and
modular design, msBayes will be a well-suited tool for the
heterogeneous data sets that are emerging and being com-
bined to test complex historical hypotheses.
Availability and requirements
The installation instructions, documentation, source code
and precompiled binary for msBayes are all available for
download at http://msbayes.sourceforge.net/
under an
open source license (GNU Public License). The msBayes
pipeline is comprised of several C and R programs that are
run with a Perl "front-end" and runs on Linux, Mac OS-X,
and most POSIX systems.
List of Abbreviations used
ABC: Approximate Bayesian Computation
TSD: Test of simultaneous divergence
mtDNA: Mitochondrial DNA
Authors' contributions
MJH developed the idea for using ABC within a hierarchi-
cal model to analyze multiple population pairs simulta-
neously. ES developed the finite sites version of D.
Hudson's classic coalescent simulator (ms). MJH and NT
developed C, R, and Perl routines and modified pre-exist-
ing R and C routines to comprise an ABC algorithm that
makes use of a Hierarchical model. NT extensively devel-
oped the C and Perl routines that comprise the user ver-
sion of msBayes now available. NT and MJH maintains
the msBayes website and NT developed the installation
configurations and precompiled binaries. All authors read
and approved the final version of the manuscript.
Acknowledgements
We thank A. Lancaster and three anonymous reviewers for making helpful
suggestions and E. Andersson for handling the manuscript. We thank M.
Beaumont for kindly providing R scripts and critically useful discussions. We
thank D. Hudson for permission to use E. Stahl's finite sites version of his
ms coalescent simulator under GNU Public License. We thank J. McGuire
and C. Moritz for use of the linux parallel computing cluster housed at the
Museum of Vertebrate zoology (University of California, Berkeley). We
thank MBI (Mathematical Biosciences Institute) for hosting the workshop
on Phylogenetics and Phylogeography. Support for M. J. Hickerson was pro-
vided by a NSF post-doctoral fellowship in interdisciplinary informatics. N.
Takebayashi was supported by NIH Grant Number 2P20RR16466 from the
INBRE program of the National Center for Research Resources and NSF
DEB-0640520. E. Stahl, was supported by Sloan/DOE Fellowship award DE-
FG02-00ER62993.
References
1. Avise JC: Phylogeography: The history and formation of spe-
cies. Cambridge: Harvard University Press; 2000.
2. Wen J: Origin and the evolution of the eastern North Ameri-
can distinct distributions of flowering plants. Annu Rev Ecol Syst
1999, 30:421-455.
3. Lessios HA: The first stage of speciation as seen in organisms
separated by the Isthmus Panama. In Endless Forms: species and
speciation Edited by: Howard DJ, Berlocher S. Oxford, U.K.: Oxford
University Press; 1998:186-201.
4. Barraclough TG, Nee S: Phylogenetics and speciation. Trends
Ecol Evol 2001, 16(7):391-399.
5. Knowles LL, Maddison WP: Statistical phylogeography. Mol Ecol
2002, 11(12):2623-2635.
6. Edwards SV, Liu L, Pearl DK: High-resolution species trees with-
out concatination. Proc Natl Acad Sci USA 2007, 104:5936-5941.
7. Rannala B, Yang ZH: Bayes Estimation of Species Divergence
Times and Ancestral Population Sizes Using DNA
Sequences From Multiple Loci. Genetics 2003, 164:1645-1656.
8. Hey J, Nielsen R: Integration within the Felsenstein equation
for improved Markov chain Monte Carlo methods in popula-
tion genetics. Proc Natl Acad Sci USA 2007, 104:2785-2790.
9. Edwards SV, Beerli P: Perspective: Gene divergence, population
divergence, and the variance in coalescence time in phyloge-
ographic studies. Evolution 2000, 54(6):1839-1854.
10. Hickerson MJ, Stahl E, Lessios HA: Test for simultaneous diver-
gence using approximate Bayesian computation. Evolution
2006, 60:2435-2453.
11. Thornton K, Andolfatto P: Approximate Bayesian inference
reveals evidence for a recent, severe bottleneck in a Nether-
lands population of Drosophila melanogaster. Genetics 2005,
172:1607-1619.
12. Excoffier L, Estoup A, Cornuet J-M: Bayesian analysis of an
admixture model with mutations and arbitrarily linked
markers. Genetics 2005, 169:1727-1738.
13. Estoup A, Beaumont BA, Sennedot F, Moritz C, Cornuet J-M:
Genetic analysis of complex demographic scenarios: spa-
tially expanding populations of the cane toad, Bufo marinus.
Evolution 2004, 58:2021-2036.
14. Tallmon DA, Luikart G, Beaumont BA: Comparative evaluation of
a new effective population size estimator based on approxi-
mate Bayesian computation. Genetics 2004, 167:977-988.
15. Beaumont MA, Zhang W, Balding DJ: Approximate Bayesian
computation in population genetics. Genetics 2002,
162:2025-2035.
16. Weiss G, von Haeseler A: Inference of population history using
a likelihood approach. Genetics 1998, 149:1539-1546.
17. Watterson GA: On the number of segregating sites in genetic
models without recombination. Theor Popul Biol 1975,
7:256-276.
18. Nei M, Li W: Mathematical model for studying variation in
terms of restriction endonucleases. Proc Natl Acad Sci USA 1979,
76:5269-5273.
19. Hickerson MJ, Dolman G, Moritz C: Comparative phylogeo-
graphic summary statistics for testing simultaneous vicari-
ance across taxon-pairs. Mol Ecol 2006, 15:209-224.
Publish with Bio Med Central and every
scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical research in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours — you keep the copyright
Submit your manuscript here:
http://www.biomedcentral.com/info/publishing_adv.asp
BioMedcentral
BMC Bioinformatics 2007, 8:268 http://www.biomedcentral.com/1471-2105/8/268
Page 7 of 7
(page number not for citation purposes)
20. Hudson RR: Generating samples under a Wright-Fisher neu-
tral model of genetic variation. Bioinformatics 2002, 18:337-338.
21. Hey J, Nielsen R: Multilocus methods for estimating population
sizes, migration rates and divergence time, with applications
to the divergence of Drosophila pseudoobscura and D. persimi-
lis. Genetics 2004, 167:747-760.
22. Beaumont MA, Rannala B: The Bayesian revolution in genetics.
Nat Rev Genet 2004, 5:251-261.
23. James W, Stein C: Estimation with quadradic loss. In Proceedings
of the Fourth Berkeley Symposium on Mathematical Statistics and Probabil-
ity: 1960 Berkeley, CA: University of California Press; 1960.
24. Gelman A, Carlin JB, Stern HS, Rubin DB: Bayesian Data Analysis.
London, UK: Chapman and Hall/CRC; 1995.
25. Hugall A, Moritz C, Moussalli A, Stanisic J: Reconciling paleodistri-
bution models and comparative phylogeography in the Wet
Tropics rainforest land snail Gnarosophia bellendenkerensis
(Brazier 1875). Proc Natl Acad Sci USA 2002, 99(9):6112-6117.
26. Kass RE, Raftery A: Bayesian factors. Journal of the American Statis-
tical Association 1995, 90:773-795.
27. Kalinowski ST: Evolutionary and statistical properties of three
genetic distances. Mol Ecol 2002, 11(8):1263-1273.
28. Slatkin M: Gene flow in natural populations. Ann Rev Ecol Syst
1985, 16:393-430.
29. Wakeley J: The variance of pairwise nucleotide differences in
two populations with migration. Theor Popul Biol 1996, 49:39-57.
30. Wakeley J: Distinguishing migration from isolation using the
variance of pairwise differences. Theor Popul Biol 1996,
49:369-386.
    • "We tested the null hypothesis that the colonization of crater Lake Xilo a occurred simultaneously for both Midas cichlids and A. centrarchus using a comparative phylogeographic approach. In particular, we used msBayes 20081106 (Hickerson et al. 2007), in the same spirit as previously done (Elmer et al. 2013) for Midas cichlids and Hypsophrys nematopus. The msBayes software pipeline uses approximate Bayesian computation to test the null hypothesis of simultaneous divergence across lineages spanning a common geographic barrier (Hickerson et al. 2006). "
    [Show abstract] [Hide abstract] ABSTRACT: Established empirical cases of sympatric speciation are scarce, although there is an increasing consensus that sympatric speciation might be more common than previously thought. Midas cichlid fish are one of the few substantiated cases of sympatric speciation, and they formed repeated radiations in crater lakes. In contrast, in the same environment, such radiation patterns have not been observed in other species of cichlids and other families of fish. We analyze morphological and genetic variation in a cichlid species (Archocentrus centrarchus) that co-inhabits several crater lakes with the Midas species complex. In particular, we analyze variation in body and pharyngeal jaw shape (two ecologically important traits in sympatrically divergent Midas cichlids) and relate that to genetic variation in mitochondrial control region and microsatellites. Using these four datasets, we analyze variation between and within two Nicaraguan lakes: a crater lake where multiple Midas cichlids have been described and a lake where the source population lives. We do not observe any within-lake clustering consistent across morphological traits and genetic markers, suggesting the absence of sympatric divergence in A. centrarchus. Genetic differentiation between lakes was low and morphological divergence absent. Such morphological similarity between lakes is found not only in average morphology, but also when analyzing covariation between traits and degree of morphospace occupation. A combined analysis of the mitochondrial control region in A. centrarchus and Midas cichlids suggests that a difference between lineages in the timing of crater lake colonization cannot be invoked as an explanation for the difference in their levels of diversification. In light of our results, A. centrarchus represents the ideal candidate to study the genomic differences between these two lineages that might explain why some lineages are more likely to speciate and diverge in sympatry than others.
    Article · May 2016
    • "After preliminary analyses, we removed the marker C-myc2 to avoid unrealistic results due the low number of gene copies and low variability. We performed the rejection step in msReject (Hickerson et al. 2007) with a tolerance of 0.0001 (0.01 for prior selection). Following Pelletier and Carstens (2014), we used three distinct measures of nucleotide diversity (within population one, within population two, and between populations) as observed summary statistics, which were calculated on a per locus basis in DnaSP 5.10 (Librado and Rozas 2009). "
    [Show abstract] [Hide abstract] ABSTRACT: The evolutionary history of Neotropical organisms has been often interpreted through broad-scale generalizations. The most accepted model of diversification for the Brazilian Atlantic forest (BAF) rely on putative historical stability of northern areas and massive past habitat replacement of its southern range. Here, we use the leaf frog Phyllomedusa distincta, endemic to the southern BAF, to better understand diversification patterns within this underexplored rainforest region. We used an integrative approach coupling fine-scale sampling and multilocus sequence data, with traditional and statistical phylogeographic (multilocus approximate Bayesian computation) methods to explore alternative hypotheses of diversification. We also employed species paleodistribution modeling to independently verify habitat stability upon a spatially explicit model. Our data support two divergent lineages with coherent geographic distribution that span throughout northern and southern ranges. Demographic estimates suggested the Southern lineage has experienced a recent population expansion, whereas the Northern lineage remained more stable. Hypothesis testing supports a scenario of ancient vicariance with recent population expansion. The paleodistribution model revealed habitat discontinuity during the Last Glacial Maximum (LGM) with one area of putative stability within the range of the Northern lineage. Evidence on genetic structure, demography, and paleodistribution of P. distincta support a historically heterogeneous landscape for the southern BAF, with both areas of forest stability and regions where forest occupation is probably recent. We also associate the southern end of the Cubatão shear zone with a phylogeographic break in the BAF. Taken together, our results argue for the idea of multiple mechanisms generating diversity in this biome and underscore the need of fine-scale data in revealing more detailed pictures.
    Full-text · Article · Aug 2015
    • "Because of problems with the prior distributions used in some implementations of the hABC method (i.e. ms- bayes; Hickerson et al. 2007; Huang et al. 2011) that cause a bias towards supporting clustered divergences (Oaks et al. 2013Oaks et al. , 2014), we used the recent modification of the hABC method (dpp-msbayes; Oaks 2014), which has an improved performance using more flexible prior distributions. More specifically, we used (i) exponential prior distributions on demographic and divergence-time parameters, as the uniform distributions result in prohibitively small marginal likelihoods for models with high number of divergence events (Oaks 2014), and (ii) a dirichlet process prior on the divergence models, because a uniform prior on the number of divergence events favours models with either very few or very many divergence events (i.e. a uniform prior on the number of divergence events actually produces a Ushaped prior on the divergence models, as there are many more ways to assign taxa in models with intermediate numbers of divergence events; see Oaks 2014; Oaks et al. 2014). "
    [Show abstract] [Hide abstract] ABSTRACT: The contribution of Pleistocene sea-level changes to diversification patterns in archipelagos around the world, and specifically whether the repeated cycles of island connectivity and isolation acted as a "species pump" is debated. The debate has been perpetuated in part because of the type of evidence used to evaluate the species pump hypothesis. Specifically, existing tests of the "Pleistocene Aggregate Island Complex" (PAIC) model of diversification interpret the lack of concordant divergence times among multiple co-distributed taxa as a rejection of the PAIC model. However the null expectation of concordance disregards taxon-specific ecological traits and geographic characteristics that may affect population persistence and gene flow among islands. Here we study the factors affecting population divergence in thirteen flightless darkling beetle species (Coleoptera: Tenebrionidae) across the PAIC system of the Cycladic plateau in the Aegean archipelago. Based on isolation-by-resistance analyses, hierarchical AMOVA and the degree of genealogical sorting on individual islands, we identify a major effect of bathymetry and habitat stability on the levels of genetic divergence across the PAIC, with island size and body size playing a secondary role as well. We subsequently use bathymetric maps and habitat association to generate predictions about the set of islands and group of taxa expected to show phylogeographic concordance. We test these predictions using hierarchical Approximate Bayesian Computation and show how our interpretations regarding the role of PAICs as drivers of divergence change when relying on a null expectation of concordance compared to a refined model that takes geography and ecological traits into account. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
    Full-text · Article · Jul 2015
Show more