We investigated the coevolutionary history of seabirds (orders Procellariiformes and Sphenisciformes) and their lice (order Phthiraptera). Independent trees were produced for the seabirds (tree derived from 12S ribosomal RNA, isoenzyme, and behavioral data) and their lice (trees derived from 12S rRNA data). Brook's parsimony analysis (BPA) supported a general history of cospeciation (consistency index = 0.84, retention index = 0.81). We inferred that the homoplasy in the BPA was caused by one intrahost speciation, one potential host-switching, and eight or nine sorting events. Using reconciliation analysis, we quantified the cost of fitting the louse tree onto the seabird tree. The reconciled trees postulated one host-switching, nine cospeciation, three or four intrahost speciation, and 11 to 14 sorting events. The number of cospeciation events was significantly more than would be expected from chance alone (P < 0.01). The sequence data were used to test for rate heterogeneity for both seabirds and lice. Neither data set displayed significant rate heterogeneity. An examination of the codivergent nodes revealed that seabirds and lice have cospeciated synchronously and that lice have evolved at approximately 5.5 times the rate of seabirds. The degree of sequence divergence supported some of the postulated intrahost speciation events (e.g., Halipeurus predated the evolution of their present hosts). The sequence data also supported some of the postulated host-switching events. These results demonstrate the value of sequence data and reconciliation analyses in unraveling complex histories between hosts and their parasites.
Polyploidy, the genome wide duplication of chromosome number, is a key feature in eukaryote evolution. Polyploidy exists in diverse groups including animals, fungi, and invertebrates but is especially prevalent in plants with most, if not all, plant species having descended from a polyploidization event. Polyploids often differ markedly from their diploid progenitors in morphological, physiological, and life history characteristics as well as rates of adaptation. The altered characteristics displayed by polyploids may contribute to their success in novel ecological habitats. Clearly, a better understanding of the processes underlying changes in the number of chromosomes within genomes is a key goal in our understanding of speciation and adaptation for a wide range of families and genera. Despite the fundamental role of chromosome number change in eukaryotic evolution, probabilistic models describing the evolution of chromosome number along a phylogeny have not yet been formulated. We present a series of likelihood models, each representing a different hypothesis regarding the evolution of chromosome number along a given phylogeny. These models allow us to reconstruct ancestral chromosome numbers and to estimate the expected number of polyploidization events and single chromosome changes (dysploidy) that occurred along a phylogeny. We test, using simulations, the accuracy of this approach and its dependence on the number of taxa and tree length. We then demonstrate the application of the method for the study of chromosome number evolution in 4 plant genera: Aristolochia, Carex, Passiflora, and Helianthus. Considering the depth of the available cytological and phylogenetic data, formal models of chromosome number evolution are expected to advance significantly our understanding of the importance of polyploidy and dysploidy across different taxonomic groups.
Even when the maximum likelihood (ML) tree is a better estimate of the true phylogenetic tree than those produced by other methods, the result of a poor ML search may be no better than that of a more thorough search under some faster criterion. The ability to find the globally optimal ML tree is therefore important. Here, I compare a range of heuristic search strategies (and their associated computer programs) in terms of their success at locating the ML tree for 20 empirical data sets with 14 to 158 sequences and 411 to 120,762 aligned nucleotides. Three distinct topics are discussed: the success of the search strategies in relation to certain features of the data, the generation of starting trees for the search, and the exploration of multiple islands of trees. As a starting tree, there was little difference among the neighbor-joining tree based on absolute differences (including the BioNJ tree), the stepwise-addition parsimony tree (with or without nearest-neighbor-interchange (NNI) branch swapping), and the stepwise-addition ML tree. The latter produced the best ML score on average but was orders of magnitude slower than the alternatives. The BioNJ tree was second best on average. As search strategies, star decomposition and quartet puzzling were the slowest and produced the worst ML scores. The DPRml, IQPNNI, MultiPhyl, PhyML, PhyNav, and TreeFinder programs with default options produced qualitatively similar results, each locating a single tree that tended to be in an NNI suboptimum (rather than the global optimum) when the data set had low phylogenetic information. For such data sets, there were multiple tree islands with very similar ML scores. The likelihood surface only became relatively simple for data sets that contained approximately 500 aligned nucleotides for 50 sequences and 3,000 nucleotides for 100 sequences. The RAxML and GARLI programs allowed multiple islands to be explored easily, but both programs also tended to find NNI suboptima. A newly developed version of the likelihood ratchet using PAUP* successfully found the peaks of multiple islands, but its speed needs to be improved.
Phylogenetic analyses using genome-scale data sets must confront incongruence among gene trees, which in plants is exacerbated by frequent gene duplications and losses. Gene tree parsimony (GTP) is a phylogenetic optimization criterion in which a species tree that minimizes the number of gene duplications induced among a set of gene trees is selected. The run time performance of previous implementations has limited its use on large-scale data sets. We used new software that incorporates recent algorithmic advances to examine the performance of GTP on a plant data set consisting of 18,896 gene trees containing 510,922 protein sequences from 136 plant taxa (giving a combined alignment length of >2.9 million characters). The relationships inferred from the GTP analysis were largely consistent with previous large-scale studies of backbone plant phylogeny and resolved some controversial nodes. The placement of taxa that were present in few gene trees generally varied the most among GTP bootstrap replicates. Excluding these taxa either before or after the GTP analysis revealed high levels of phylogenetic support across plants. The analyses supported magnoliids sister to a eudicot + monocot clade and did not support the eurosid I and II clades. This study presents a nuclear genomic perspective on the broad-scale phylogenic relationships among plants, and it demonstrates that nuclear genes with a history of duplication and loss can be phylogenetically informative for resolving the plant tree of life.
Butterflies in the large Palearctic genus Agrodiaetus (Lepidoptera: Lycaenidae) are extremely uniform and exhibit few distinguishing morphological characters. However, these insects
are distinctive in one respect: as a group they possess among the greatest interspecific karyotype diversity in the animal
kingdom, with chromosome numbers (n) ranging from 10 to 125. The monophyly of Agrodiaetus and its systematic position relative to other groups within the section Polyommatus have been controversial. Characters from
the mitochondrial genes for cytochrome oxidases I and II and from the nuclear gene for elongation factor 1α were used to reconstruct the phylogeny of Agrodiaetus using maximum parsimony and Bayesian phylogenetic methods. Ninety-one individuals, encompassing most of the taxonomic diversity
of Agrodiaetus, and representatives of 14 related genera were included in this analysis. Our data indicate that Agrodiaetus is monophyletic. Representatives of the genus Polyommatus (sensu stricto) are the closest relatives. The sequences of the Agrodiaetus taxa in this analysis are tentatively arranged into 12 clades, only 1 of which corresponds to a species group traditionally
recognized in Agrodiaetus. Heterogeneous substitution rates across a recovered topology were homogenized with a nonparametric rate-smoothing algorithm
before the application of a molecular clock. Two published estimates of substitution rates dated the origin of Agrodiaetus between 2.51 and 3.85 million years ago. During this time, there was heterogeneity in the rate and direction of karyotype
evolution among lineages within the genus. Karyotype instability has evolved independently three times in the section Polyommatus,
within the lineages Agrodiaetus, Lysandra, and Plebicula. Rapid karyotype diversification may have played a significant role in the radiation of the genus Agrodiaetus.
Notoriously slow rates of molecular evolution and convergent evolution among some morphological characters have limited phylogenetic resolution for the palm family (Arecaceae). This study adds nuclear DNA (18S SSU rRNA) and chloroplast DNA (cpDNA; atpB and rbcL) sequence data for 65 genera of palms and characterizes molecular variation for each molecule. Phylogenetic relationships were estimated with maximum likelihood and maximum parsimony techniques for the new data and for previously published molecular data for 45 palm genera. Maximum parsimony analysis was also used to compare molecular and morphological data for 33 palm genera. Incongruence among datasets was detected between cpDNA and 18S data and between molecular and morphological data. Most conflict between nuclear and cpDNA data was associated with the genus Nypa. Several taxa showed relatively long branches with 18S data, but phylogenetic resolution of these taxa was essentially the same for 18S and cpDNA data. Base composition bias for 18S that contributed to erroneous phylogenetic resolution in other taxa did not seem to be present in Palmae. Morphological data were incongruent with all molecular data due to apparent morphological homoplasy for Caryoteae, Ceroxyloideae, Iriarteae, and Thrinacinae. Both cpDNA and nuclear 18S data firmly resolved Caryoteae with Borasseae of Coryphoideae, suggesting that at least some morphological characters used to place Caryoteae in Arecoideae are homoplastic. In this study, increased character sampling seems to be more important than increased taxon sampling; a comparison of the full (65-taxon) and reduced (45- and 33-taxon) datasets suggests little difference in core topology but considerably more nodal support with the increased character sample sizes. These results indicate a general trend toward a stable estimate of phylogenetic relationships for the Palmae. Although the 33-taxon topologies are even better resolved, they lack several critical taxa and are affected by incongruence between molecular and morphological data. As such, a comparison of results from the 45- and 33-taxon trees offers the best available reference for phylogenetic inference on palms.
The nuclear small subunit rRNA (18S) has played a dominant role in the estimation of relationships among insect orders from molecular data. In previous studies, 18S sequences have been aligned by unadjusted automated approaches (computer alignments that are not manually readjusted), most recently with direct optimization (simultaneous alignment and tree building using a program called "POY"). Parsimony has been the principal optimality criterion. Given the problems associated with the alignment of rRNA, and the recent availability of the doublet model for the analysis of covarying sites using Bayesian MCMC analysis, a different approach is called for in the analysis of these data. In this paper, nucleotide sequence data from the 18S small subunit rRNA gene of insects are aligned manually with reference to secondary structure, and analyzed under Bayesian phylogenetic methods with both GTR+I+G and doublet models in MrBayes. A credible phylogeny of Insecta is recovered that is independent of the morphological data and (unlike many other analyses of 18S in insects) not contradictory to traditional ideas of insect ordinal relationships based on morphology. Hexapoda, including Collembola, are monophyletic. Paraneoptera are the sister taxon to a monophyletic Holometabola but weakly supported. Ephemeroptera are supported as the sister taxon of Neoptera, and this result is interpreted with respect to the evolution of direct sperm transfer and the evolution of flight. Many other relationships are well-supported but several taxa remain problematic, e.g., there is virtually no support for relationships among orthopteroid orders. A website is made available that provides aligned 18S data in formats that include structural symbols and Nexus formats.
Triploblastic relationships were examined in the light of molecular and morphological evidence. Representatives for all triploblastic "phyla" (except Loricifera) were represented by both sources of phylogenetic data. The 18S ribosomal (rDNA) sequence data for 145 terminal taxa and 276 morphological characters coded for 36 supraspecific taxa were combined in a total evidence regime to determine the most consistent picture of triploblastic relationships for these data. Only triploblastic taxa are used to avoid rooting with distant outgroups, which seems to happen because of the extreme distance that separates diploblastic from triploblastic taxa according to the 18S rDNA data. Multiple phylogenetic analyses performed with variable analysis parameters yield largely inconsistent results for certain groups such as Chaetognatha, Acoela, and Nemertodermatida. A normalized incongruence length metric is used to assay the relative merit of the multiple analyses. The combined analysis having the least character incongruence yields the following scheme of relationships of four main clades: (1) Deuterostomia [((Echinodermata + Enteropneusta) (Cephalochordata (Urochordata + Vertebrata)))]; (2) Ecdysozoa [(((Priapulida + Kinorhyncha) (Nematoda + Nematomorpha)) ((Onychophora + Tardigrada) Arthropoda))]; (3) Trochozoa [((Phoronida + Brachiopoda) (Entoprocta (Nemertea (Sipuncula (Mollusca (Pogonophora (Echiura + Annelida)))))))]; and (4) Platyzoa [((Gnathostomulida (Cycliophora + Syndermata)) (Gastrotricha + Plathelminthes))]. Chaetognatha, Nemertodermatida, and Bryozoa cannot be assigned to any one of these four groups. For the first time, a data analysis recognizes a clade of acoelomates, the Platyzoa (sensu Cavalier-Smith, Biol. Rev. 73:203-266, 1998). Other relationships that corroborate some morphological analyses are the existence of a clade that groups Gnathostomulida + Syndermata (= Gnathifera), which is expanded to include the enigmatic phylum Cycliophora, as sister group to Syndermata.
Previous phylogenetic analyses of tetrapod 18S ribosomal RNA (rRNA) sequences support the grouping of birds with mammals, whereas other molecular data, and morphological and paleontological data favor the grouping of birds with crocodiles. The 18S rRNA gene has consequently been considered odd, serving as "definitive evidence of different genes providing significantly different estimates of phylogeny in higher organisms" (p. 156; Huelsenbeck et al., 1996, Trends Ecol. Evol. 11:152-158). Our research indicates that the previous discrepancy of phylogenetic results between the 18S rRNA gene and other genes is caused mainly by (1) the misalignment of the sequences, (2) the inappropriate use of the frequency parameters, and (3) poor sequence quality. When the sequences are aligned with the aide of the secondary structure of the 18S rRNA molecule and when the frequency parameters are estimated either from all sites or from the variable domains where substitutions have occurred, the 18S rRNA sequences no longer support the grouping of the avian species with the mammalian species.
The molecular clock, i.e., constancy of the rate of evolution over time, is commonly assumed in estimating divergence dates. However, this assumption is often violated and has drastic effects on date estimation. Recently, a number of attempts have been made to relax the clock assumption. One approach is to use maximum likelihood, which assigns rates to branches and allows the estimation of both rates and times. An alternative is the Bayes approach, which models the change of the rate over time. A number of models of rate change have been proposed. We have extended and evaluated models of rate evolution, i.e., the lognormal and its recent variant, along with the gamma, the exponential, and the Ornstein-Uhlenbeck processes. These models were first applied to a small hominoid data set, where an empirical Bayes approach was used to estimate the hyperparameters that measure the amount of rate variation. Estimation of divergence times was sensitive to these hyperparameters, especially when the assumed model is close to the clock assumption. The rate and date estimates varied little from model to model, although the posterior Bayes factor indicated the Ornstein-Uhlenbeck process outperformed the other models. To demonstrate the importance of allowing for rate change across lineages, this general approach was used to analyze a larger data set consisting of the 18S ribosomal RNA gene of 39 metazoan species. We obtained date estimates consistent with paleontological records, the deepest split within the group being about 560 million years ago. Estimates of the rates were in accordance with the Cambrian explosion hypothesis and suggested some more recent lineage-specific bursts of evolution.
Current hypotheses regarding family relationships in the suborder Adephaga (Coleoptera) are conflicting. Here we report full-length 18S ribosomal RNA sequences of 39 adephagans and 13 outgroup taxa. Data analysis focused on the impact of sequence alignment on tree topology, using two principally different approaches. Tree alignments, which seek to minimize indels and substitutions on the tree in a single step, as implemented in an approximate procedure by the computer program POY, were contrasted with a more traditional procedure based on alignments followed by phylogenetic inference based on parsimony, likelihood, and distance analyses. Despite substantial differences between the procedures, phylogenetic conclusions regarding basal relationships within Adephaga and relationships between the four suborders of Coleoptera were broadly similar. The analysis weakly supports monophyly of Adephaga, with Polyphaga usually as its sister, and the two small suborders Myxophaga and Archostemata basal to them. In some analyses, however, Polyphaga was reconstructed as having arisen from within Hydradephaga. Adephaga generally split into two monophyletic groups, corresponding to the terrestrial Geadephaga and the aquatic Hydradephaga, as initially proposed by Crowson in 1955, consistent with a single colonization of the aquatic environment by adephagan ancestors and contradicting the recent proposition of three independent invasions. A monophyletic Hydradephaga is consistently, though not strongly, supported under most analyses, and a parametric bootstrapping test significantly rejects an hypothesis of nonmonophyly. The enigmatic Trachypachidae, which exhibit many similarities to aquatic forms but whose species are entirely terrestrial, were usually recovered as a basal lineage within Geadephaga. Strong evidence opposes the view that terrestrial trachypachids are related to the dytiscoid water beetles.
NINE PROBLEMS WITH HEADS (1998)
1. Disjunct Distributions Generate Rather
than Distinguish Hypotheses
2. Disjunct Taxa Used as Examples Are
Mainly Conspecifics
3. Selective Sampling of Taxa
4. Ecological Constraint
5. Incomplete Sampling of Taxa
6. Dispersal Is Invoked Only When It Fits the
Hypothesis
7. Current-Day Evidence Supports Glacial
Extirpation
8. The Alpine Fault Is Only Recently Alpine
9. Land Surfaces Have Not Been in
Continuous Existence
We introduce molecularevolution.org, a publicly available gateway for high-throughput, maximum-likelihood phylogenetic analysis
powered by grid computing. The gateway features a garli 2.0 web service that enables a user to quickly and easily submit thousands of maximum likelihood tree searches or bootstrap
searches that are executed in parallel on distributed computing resources. The garli web service allows one to easily specify partitioned substitution models using a graphical interface, and it performs sophisticated
post-processing of phylogenetic results. Although the garli web service has been used by the research community for over three years, here we formally announce the availability of the
service, describe its capabilities, highlight new features and recent improvements, and provide details about how the grid
system efficiently delivers high-quality phylogenetic results. [garli, gateway, grid computing, maximum likelihood, molecular evolution portal, phylogenetics, web service.]
This special issue of Systematic Biology contains review articles, contributed by keynote speakers after the fifth edition of the “Mathematical and Computational Evolutionary Biology” conference (MCEB; see web sites below) conference, held in 2013. We started in 2003 (under a slightly different name: “Mathematics of Evolution and Phylogeny”) at the Mathematics Research Center Henri Poincare, Paris, with well-known speakers like Walter Fitch and Joe Felsenstein. We had the feeling that the considerable efforts of the keynote speakers to synthesize and present their research in an extensive but
In the early 1990s, a comprehensive set of missions and goals for the discipline was articulated by a global community of systematists; these were presented as Systematics Agenda 2000 (1994). Abbreviated here as SA2K, this agenda spurred awareness of the field and initiated discussions about the role of systematics within biology (e.g., Blackmore and Cutler 1996; Cracraft 2002; Halanych and Goertzen 2009), in education (e.g., Krishtalka and Humphrey 2000; Thanukos 2010), and public policy (e.g., Prance 1995). After nearly 20 years of achievement and growth in systematic biology, a series of four US National Science Foundation-sponsored workshops on “Future Directions in Biodiversity and Systematics Research” was held during 2009–2010 to evaluate progress in the field and identify new directions and opportunities. Workshop participants reviewed SA2K as a way to rapidly achieve common ground and to jump-start our discussions. We did not plan to undertake a formal revision of SA2K, but our discussions led to consensus on a number of relevant points. We share these here with the intention of generating further reflection and discussion toward advancing our field and its missions.
We examined the efficiencies of ordination methods in the treatment of gene frequency data at intraspecific level, using metric and nonmetric distance measures (Nei's and Rogers' genetic distances, chi 2 distance). We assessed initial processes responsible for the geographical distribution of the Mediterranean land snail Helix aspersa. Seventeen enzyme loci from 30 North African snail populations were considered in the present analysis. Five combinations of distance/multivariate analysis were compared: correspondence analysis (CA), nonmetric multidimensional scaling (NMDS) on Nei's, Rogers', and chi 2 distances, and principal coordinates analysis on Rogers' distances. Configuration of the objects resulting from ordination was projected onto three-dimensional graphics with the minimum spanning tree or the relative neighborhood graph superimposed. Pre- and postordination or clustering distance matrices were compared by means of correlation methods. As expected, all combinations led to a clear west versus east pattern of variation. However, the intraregional relationships and degree of connectivity between pairs of operational taxonomic units were not necessarily constant from one method to another. Ordination methods when applied with Nei's and Rogers' distances provided the best fit, with original distances (r = 0.98) compared with UPGMA clustering (r approximately 0.75). The Nei/NMDS combination seems to be a good compromise (distortion index dt = 10%) between Rogers/NMDS, which produces a more confusing pattern of differentiation (dt = 24%), and chi 2/CA, which tends to distort large distances (dt = 31%). NMDS obviously provides a powerful method to summarize relationships between populations, when neither hierarchical structure nor phylogenetic inference are required. These findings led the discussion on the good performance of NMDS, the appropriate distances to be used, and the potential application of this method to other types of allelic data (such as microsatellite loci) or data on nucleotide sequences of genes.
It is now well known that incomplete lineage sorting can cause serious difficulties for phylogenetic inference, but little attention has been paid to methods that attempt to overcome these difficulties by explicitly considering the processes that produce them. Here we explore approaches to phylogenetic inference designed to consider retention and sorting of ancestral polymorphism. We examine how the reconstructability of a species (or population) phylogeny is affected by (a) the number of loci used to estimate the phylogeny and (b) the number of individuals sampled per species. Even in difficult cases with considerable incomplete lineage sorting (times between divergences less than 1 N(e) generations), we found the reconstructed species trees matched the "true" species trees in at least three out of five partitions, as long as a reasonable number of individuals per species were sampled. We also studied the tradeoff between sampling more loci versus more individuals. Although increasing the number of loci gives more accurate trees for a given sampling effort with deeper species trees (e.g., total depth of 10 N(e) generations), sampling more individuals often gives better results than sampling more loci with shallower species trees (e.g., depth = 1 N(e)). Taken together, these results demonstrate that gene sequences retain enough signal to achieve an accurate estimate of phylogeny despite widespread incomplete lineage sorting. Continued improvement in our methods to reconstruct phylogeny near the species level will require a shift to a compound model that considers not only nucleotide or character state substitutions, but also the population genetics processes of lineage sorting. [Coalescence; divergence; population; speciation.].
Although genetic methods of species identification, especially DNA barcoding, are strongly debated, tests of these methods have been restricted to a few empirical cases for pragmatic reasons. Here we use simulation to test the performance of methods based on sequence comparison (BLAST and genetic distance) and tree topology over a wide range of evolutionary scenarios. Sequences were simulated on a range of gene trees spanning almost three orders of magnitude in tree depth and in coalescent depth; that is, deep or shallow trees with deep or shallow coalescences. When the query's conspecific sequences were included in the reference alignment, the rate of positive identification was related to the degree to which different species were genetically differentiated. The BLAST, distance, and liberal tree-based methods returned higher rates of correct identification than did the strict tree-based requirement that the query was within, but not sister to, a single-species clade. Under this more conservative approach, ambiguous outcomes occurred in inverse proportion to the number of reference sequences per species. When the query's conspecific sequences were not in the reference alignment, only the strict tree-based approach was relatively immune to making false-positive identifications. Thresholds affected the rates at which false-positive identifications were made when the query's species was unrepresented in the reference alignment but did not otherwise influence outcomes. A conservative approach using the strict tree-based method should be used initially in large-scale identification systems, with effort made to maximize sequence sampling within species. Once the genetic variation within a taxonomic group is well characterized and the taxonomy resolved, then the choice of method used should be dictated by considerations of computational efficiency. The requirement for extensive genetic sampling may render these techniques inappropriate in some circumstances.
A new method, ParaFit, has been developed to test the significance of a global hypothesis of coevolution between parasites and their hosts. Individual host-parasite association links can also be tested. The test statistics are functions of the host and parasite phylogenetic trees and of the set of host-parasite association links. Numerical simulations are used to show that the method has correct rate of type I error and good power except under extreme error conditions. An application to real data (pocket gophers and chewing lice) is presented.
Nuclear DNA is widely used to estimate phylogenetic and phylogeographic relationships. Nuclear gene variants may be present in an individual's genome, and these result in Intra-Individual Site Polymorphisms (2ISP; pronounced 'twisp') in direct-PCR or individual-consensus sequences based on a clone sample. 2ISPs can occur fairly often, especially within, but not restricted to, high-copy-number regions such as the widely used internal transcribed spacers of the nuclear ribosomal cistron. Dealing with 2ISPs has been problematic as phylogeny reconstruction optimality criteria generally do not take account of this variation. Here we test whether an approach that treats 2ISPs as additional (termed 'informative'), rather than ambiguous, characters offers improved support in three common criteria used for phylogenetic inference: Minimum Evolution (via Neighbour Joining), Maximum Parsimony and Maximum Likelihood. We demonstrate significant improvements using the 2ISP-informative treatment with simulated, real-world and case study datasets. We envisage that this 2ISP-informative approach will greatly aid phylogenetic inference using any nuclear DNA regions that contain polymorphisms within individuals (including consensus sequences generated from next generation sequencing), especially at the intrageneric or intraspecific level.
PhyML is a phylogeny software based on the maximum-likelihood principle. Early PhyML versions used a fast algorithm performing nearest neighbor interchanges to improve a reasonable starting tree topology. Since the original publication (Guindon S., Gascuel O. 2003. A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52:696-704), PhyML has been widely used (>2500 citations in ISI Web of Science) because of its simplicity and a fair compromise between accuracy and speed. In the meantime, research around PhyML has continued, and this article describes the new algorithms and methods implemented in the program. First, we introduce a new algorithm to search the tree space with user-defined intensity using subtree pruning and regrafting topological moves. The parsimony criterion is used here to filter out the least promising topology modifications with respect to the likelihood function. The analysis of a large collection of real nucleotide and amino acid data sets of various sizes demonstrates the good performance of this method. Second, we describe a new test to assess the support of the data for internal branches of a phylogeny. This approach extends the recently proposed approximate likelihood-ratio test and relies on a nonparametric, Shimodaira-Hasegawa-like procedure. A detailed analysis of real alignments sheds light on the links between this new approach and the more classical nonparametric bootstrap method. Overall, our tests show that the last version (3.0) of PhyML is fast, accurate, stable, and ready to use. A Web server and binary files are available from http://www.atgc-montpellier.fr/phyml/.
Since its introduction in 2001, MrBayes has grown in popularity as a software package for Bayesian phylogenetic inference using Markov chain Monte Carlo (MCMC) methods. With this note, we announce the release of version 3.2, a major upgrade to the latest official release presented in 2003. The new version provides convergence diagnostics and allows multiple analyses to be run in parallel with convergence progress monitored on the fly. The introduction of new proposals and automatic optimization of tuning parameters has improved convergence for many problems. The new version also sports significantly faster likelihood calculations through streaming single-instruction-multiple-data extensions (SSE) and support of the BEAGLE library, allowing likelihood calculations to be delegated to graphics processing units (GPUs) on compatible hardware. Speedup factors range from around 2 with SSE code to more than 50 with BEAGLE for codon problems. Checkpointing across all models allows long runs to be completed even when an analysis is prematurely terminated. New models include relaxed clocks, dating, model averaging across time-reversible substitution models, and support for hard, negative, and partial (backbone) tree constraints. Inference of species trees from gene trees is supported by full incorporation of the Bayesian estimation of species trees (BEST) algorithms. Marginal model likelihoods for Bayes factor tests can be estimated accurately across the entire model space using the stepping stone method. The new version provides more output options than previously, including samples of ancestral states, site rates, site d(N)/d(S) rations, branch rates, and node dates. A wide range of statistics on tree parameters can also be output for visualization in FigTree and compatible software.
Many published phylogenies are based on methods that assume equal nucleotide composition among taxa. Studies have shown, however, that this assumption is often not accurate, particularly in divergent lineages. Nonstationary sequence evolution, when taxa in different lineages evolve in different ways, can lead to unequal nucleotide composition. This can cause inference methods to fail and phylogenies to be inaccurate. Recent advancements in phylogenetic theory have proposed new models of nonstationary sequence evolution; these models often outperform equivalent stationary models. A variety of new phylogenetic software implementing such models has been developed, but the studies employing the new methodology are still few. We discovered convergence of nucleotide composition within mitochondrial genomes of the insect order Coleoptera (beetles). We found variation in base content both among species and among genes in the genome. To this data set, we have applied a broad range of phylogenetic methods, including some traditional stationary models of evolution and all the more recent nonstationary models. We compare 8 inference methods applied to the same data set. Although the more commonly used methods universally fail to recover established clades, we find that some of the newer software packages are more appropriate for data of this nature. The software packages p4, PHASE, and nhPhyML were able to overcome the systematic bias in our data set, but parsimony, MrBayes, NJ, LogDet, and PhyloBayes were not.
Young polyploid events are easily diagnosed by various methods, but older polyploid events become increasingly difficult to identify as chromosomal rearrangements, tandem gene or partial chromosome duplications, changes in substitution rates among duplicated genes, pseudogenization or locus loss, and interlocus interactions complicate the means of inferring past genetic events. Genomic data have provided valuable information about the polyploid history of numerous species, but on their own fail to show whether related species, each with a polyploid past, share a particular polyploid event. A phylogenetic approach provides a powerful method to determine this but many processes may mislead investigators. These processes can affect individual gene trees, but most likely will not affect all genes, and almost certainly will not affect all genes in the same way. Thus, a multigene approach, which combines the large-scale aspect of genomics with the resolution of phylogenetics, has the power to overcome these difficulties and allow us to infer genomic events further into the past than would otherwise be possible. Previous work using synonymous distances among gene pairs within species has shown evidence for large-scale duplications in the legumes Glycine max and Medicago truncatula. We present a case study using 39 gene families, each with three or four members in G. max and the putative orthologues in M. truncatula, rooted using Arabidopsis thaliana. We tested whether the gene duplications in these legumes occurred separately in each lineage after their divergence (Hypothesis 1), or whether they share a round of gene duplications (Hypothesis 2). Many more gene family topologies supported Hypothesis 2 over Hypothesis 1 (11 and 2, respectively), even after synonymous distance analysis revealed that some topologies were providing misleading results. Only ca. 33% of genes examined support either hypothesis, which strongly suggests that single gene family approaches may be insufficient when studying ancient events with nuclear DNA. Our results suggest that G. max and M. truncatula, along with approximately 7000 other legume species from the same clade, share an ancient round of gene duplications, either due to polyploidy or to some other process.
The interaction between yuccas and yucca moths has been central to understanding the origin and loss of obligate mutualism and mutualism reversal. Previous systematic research using mtDNA sequence data and characters associated with genitalic morphology revealed that a widespread pollinator species in the genus Tegeticula was in fact a complex of pollinator species that differed in host use and the placement of eggs into yucca flowers. Within this mutualistic clade two nonpollinating "cheater" species evolved. Cheaters feed on yucca seeds but lack the tentacular mouthparts necessary for yucca pollination. Previous work suggested that the species complex formed via a rapid radiation within the last several million years. In this study, we use an expanded mtDNA sequence data set and AFLP markers to examine the phylogenetic relationships among this rapidly diverging clade of moths and compare these relationships to patterns in genitalic morphology. Topologies obtained from analyses of the mtDNA and AFLP data differed significantly. Both data sets, however, corroborated the hypothesis of a rapid species radiation and suggested that there were likely two independent species radiations. Morphological analyses based on oviposition habit produced species groupings more similar to the AFLP topology than the mtDNA topology and suggested the two radiations coincided with differences in oviposition habit. The evolution of cheating was reaffirmed to have evolved twice and the closest pollinating relative for one cheater species was identified by both mtDNA and AFLP markers. For the other cheater species, however, the closest pollinating relative remains ambiguous, and mtDNA, AFLP, and morphological data suggest this cheater species may be diverged based on host use. Much of the divergence in the species complex can be explained by geographic isolation associated with the evolution of two oviposition habits.
This study attempts to resolve relationships among and within the four basal arthropod lineages (Pancrustacea, Myriapoda, Euchelicerata, Pycnogonida) and to assess the widespread expectation that remaining phylogenetic problems will yield to increasing amounts of sequence data. Sixty-eight regions of 62 protein-coding nuclear genes (approximately 41 kilobases (kb)/taxon) were sequenced for 12 taxonomically diverse arthropod taxa and a tardigrade outgroup. Parsimony, likelihood, and Bayesian analyses of total nucleotide data generally strongly supported the monophyly of each of the basal lineages represented by more than one species. Other relationships within the Arthropoda were also supported, with support levels depending on method of analysis and inclusion/exclusion of synonymous changes. Removing third codon positions, where the assumption of base compositional homogeneity was rejected, altered the results. Removing the final class of synonymous mutations--first codon positions encoding leucine and arginine, which were also compositionally heterogeneous--yielded a data set that was consistent with a hypothesis of base compositional homogeneity. Furthermore, under such a data-exclusion regime, all 68 gene regions individually were consistent with base compositional homogeneity. Restricting likelihood analyses to nonsynonymous change recovered trees with strong support for the basal lineages but not for other groups that were variably supported with more inclusive data sets. In a further effort to increase phylogenetic signal, three types of data exploration were undertaken. (1) Individual genes were ranked by their average rate of nonsynonymous change, and three rate categories were assigned--fast, intermediate, and slow. Then, bootstrap analysis of each gene was performed separately to see which taxonomic groups received strong support. Five taxonomic groups were strongly supported independently by two or more genes, and these genes mostly belonged to the slow or intermediate categories, whereas groups supported only by a single gene region tended to be from genes of the fast category, arguing that fast genes provide a less consistent signal. (2) A sensitivity analysis was performed in which increasing numbers of genes were excluded, beginning with the fastest. The number of strongly supported nodes increased up to a point and then decreased slightly. Recovery of Hexapoda required removal of fast genes. Support for Mandibulata (Pancrustacea + Myriapoda) also increased, at times to "strong" levels, with removal of the fastest genes. (3) Concordance selection was evaluated by clustering genes according to their ability to recover Pancrustacea, Euchelicerata, or Myriapoda and analyzing the three clusters separately. All clusters of genes recovered the three concordance clades but were at times inconsistent in the relationships recovered among and within these clades, a result that indicates that the a priori concordance criteria may bias phylogenetic signal in unexpected ways. In a further attempt to increase support of taxonomic relationships, sequence data from 49 additional taxa for three slow genes (i.e., EF-1 alpha, EF-2, and Pol II) were combined with the various 13-taxon data sets. The 62-taxon analyses supported the results of the 13-taxon analyses and provided increased support for additional pancrustacean clades found in an earlier analysis including only EF-1 alpha, EF-2, and Pol II.
The microbial way of life spans at least 3.8 billion years of evolution. Microbial organisms are pervasive, ubiquitous, and essential components of all ecosystems. The geochemical composition of Earth's biosphere has been molded largely by microbial activities. Yet, despite the predominance of microbes during the course of life's history, general principles and theory of microbial evolution and ecology are not well developed. Until recently, investigators had no idea how accurately cultivated microorganisms represented overall microbial diversity. The development of molecular phylogenetics has recently enabled characterization of naturally occurring microbial biota without cultivation. Free from the biases of culture-based studies, molecular phylogenetic surveys have revealed a vast array of new microbial groups. Many of these new microbes are widespread and abundant among contemporary microbiota and fall within novel divisions that branch deep within the tree of life. The breadth and extent of extant microbial diversity has become much clearer. A remaining challenge for microbial biologists is to better characterize the biological properties of these newly described microbial taxa. This more comprehensive picture will provide much better perspective on the natural history, ecology, and evolution of extant microbial life.
Tertiary macrofossils of the flowering plant family Leguminosae (legumes) were used as time constraints to estimate ages of the earliest branching clades identified in separate plastid matK and rbcL gene phylogenies. Penalized likelihood rate smoothing was performed on sets of Bayesian likelihood trees generated with the AIC-selected GTR+ Gamma +I substitution model. Unequivocal legume fossils dating from the Recent continuously back to about 56 million years ago were used to fix the family stem clade at 60 million years (Ma), and at 1-Ma intervals back to 70 Ma. Specific fossils that showed distinctive combinations of apomorphic traits were used to constrain the minimum age of 12 specific internal nodes. These constraints were placed on stem rather than respective crown clades in order to bias for younger age estimates. Regardless, the mean age of the legume crown clade differs by only 1.0 to 2.5 Ma from the fixed age of the legume stem clade. Additionally, the oldest caesalpinioid, mimosoid, and papilionoid crown clades show approximately the same age range of 39 to 59 Ma. These findings all point to a rapid family-wide diversification, and predict few if any legume fossils prior to the Cenozoic. The range of the matK substitution rate, 2.1-24.6 x 10(-10) substitutions per site per year, is higher than that of rbcL, 1.6- 8.6 x 10(-10), and is accompanied by more uniform rate variation among codon positions. The matK and rbcL substitution rates are highly correlated across the legume family. For example, both loci have the slowest substitution rates among the mimosoids and the fastest rates among the millettioid legumes. This explains why groups such as the millettioids are amenable to species-level phylogenetic analysis with these loci, whereas other legume groups are not.
Multigene families have provided opportunities for evolutionary biologists to assess molecular evolution processes and phylogenetic reconstructions at deep and shallow systematic levels. However, the use of these markers is not free of technical and analytical challenges. Many evolutionary studies that used the nuclear 5S rDNA gene family rarely used contiguous 5S coding sequences due to the routine use of head-to-tail PCR primers that are anchored to the coding region. Moreover, the 5S coding sequences have been concatenated with independent, adjacent gene units in many studies, creating simulated chimeric genes as the raw data for evolutionary analysis. This practice is based on the tacitly assumed, but rarely tested, hypothesis that strict intra-locus concerted evolution processes are operating in 5S rDNA genes, without any empirical evidence as to whether it holds for the recovered data. The potential pitfalls of analysing the patterns of molecular evolution and reconstructing phylogenies based on these chimeric genes have not been assessed to date. Here, we compared the sequence integrity and phylogenetic behaviour of entire versus concatenated 5S coding regions from a real data set obtained from closely related plant species (Medicago, Fabaceae). Our results suggest that within arrays sequence homogenization is partially operating in the 5S coding region, which is traditionally assumed to be highly conserved. Consequently, concatenating 5S genes increases haplotype diversity, generating novel chimeric genotypes that most likely do not exist within the genome. In addition, the patterns of gene evolution are distorted, leading to incorrect haplotype relationships in some evolutionary reconstructions.
We collected ∼29 kb of sequence data using Roche 454 pyrosequencing in order to estimate the timing and pattern of diversification
in the carnivorous pitcher plant Sarracenia alata. Utilizing modified protocols for reduced representation library construction, we generated sequence data from 86 individuals
across 10 populations from throughout the range of the species. We identified 76 high-quality and high-coverage loci (containing
over 500 SNPs) using the bioinformatics pipeline PRGmatic. Results from a Bayesian clustering analysis indicate that populations
are highly structured, and are similar in pattern to the topology of a population tree estimated using *BEAST. The pattern
of diversification within Sarracenia alata implies that riverine barriers are the primary factor promoting population diversification, with divergence across the Mississippi
River occurring more than 60,000 generations before present. Further, significant patterns of niche divergence and the identification
of several outlier loci suggest that selection may contribute to population divergence. Our results demonstrate the feasibility
of using next-generation sequencing to investigate intraspecific genetic variation in nonmodel species.
Despite the recent surge of interest in studying the evolution of development, surprisingly little work has been done to investigate the phylogenetic signal in developmental characters. Yet, both the potential usefulness of developmental characters in phylogenetic reconstruction and the validity of inferences on the evolution of developmental characters depend on the presence of such a phylogenetic signal and on the ability of our coding scheme to capture it. In a recent study, we showed, using simulations, that a new method (called the continuous analysis) using standardized time or ontogenetic sequence data and squared-change parsimony outperformed event pairing and event cracking in analyzing developmental data on a reference phylogeny. Using the same simulated data, we demonstrate that all these coding methods (event pairing and standardized time or ontogenetic sequence data) can be used to produce phylogenetically informative data. Despite some dependence between characters (the position of an event in an ontogenetic sequence is not independent of the position of other events in the same sequence), parsimony analysis of such characters converges on the correct phylogeny as the amount of data increases. In this context, the new coding method (developed for the continuous analysis) outperforms event pairing; it recovers a lower proportion of incorrect clades. This study thus validates the use of ontogenetic data in phylogenetic inference and presents a simple coding scheme that can extract a reliable phylogenetic signal from these data.
Complex organs such as eyes are commonly lost during evolution, but the timescale on which lost phenotypes could be reactivated is a matter of long-standing debate, with important implications for the molecular mechanisms of trait loss. Two phylogenetic approaches have been used to test whether regain of traits has occurred. One way is by comparison of nested, continuous-time Markov models of trait evolution, approaches that we term tree-based tests. A second way to demonstrate statistical support for trait regain is through use of node-based tests that employ explicit estimation of ancestral node states. Here, we estimate new molecular and morphological phylogenies and use them to examine the possibility of eye regain and dispersal between abyssal and shallow seas during the history of cylindroleberidid ostracods, a family of about 200 species, comprising both eyeless and sighted species. First, we confirmed that eye presence/absence is correlated with habitat depth. Parameter estimates from a phylogenetic model indicate that speciation is more rapid in deep-sea eyeless clades compared with shallow-water sighted clades. In addition, we found that tree-based statistical tests usually indicated reversals, including both transitions from deep to shallow seas and regain of eyes. In contrast, node-based statistical tests usually failed to show significant support for reversals. These results also hold for simulated phylogenies, indicating that they are not unique to the current data set. We recommend that both tree-based and node-based tests should be examined before making conclusions about character reversal and that ideally, alternative character histories should be tested using additional data, besides just the phylogenetic distribution of presence/absence of the characters.
More than a decade of phylogenetic research has yielded a well-sampled, strongly supported hypothesis of relationships within the large (> 4,000 species) plant family Acanthaceae. This hypothesis points to intriguing biogeographic patterns and asymmetries in sister clade diversity but, absent a time-calibrated estimate for this evolutionary history, these patterns have remained unexplored. Here, we reconstruct divergence times within Acanthaceae using fossils as calibration points and experimenting with both fossil selection and effects of invoking a maximum age prior related to the origin of Eudicots. Contrary to earlier reports of a paucity of fossils of Lamiales (an order of ~23,000 species that includes Acanthaceae) and to the expectation that a largely herbaceous to soft-wooded and tropical lineage would have few fossils, we recovered 51 reports of fossil Acanthaceae. Rigorous evaluation of these for accurate identification, quality of age assessment, and utility in dating yielded eight fossils judged to merit inclusion in analyses. With nearly 10 kilobases of DNA sequence data, we used two sets of fossils as constraints to reconstruct divergence times. We demonstrate differences in age estimates depending on fossil selection and that enforcement of maximum age priors substantially alters estimated clade ages, especially in analyses that utilize a smaller rather than larger set of fossils. Our results suggest that long-distance dispersal events explain present-day distributions better than do Gondwanan or northern land bridge hypotheses. This biogeographical conclusion is for the most part robust to alternative calibration schemes. Our data support a minimum of 13 Old World to New World dispersal events but, intriguingly, only one in the reverse direction. Eleven of these 13 were among Acanthaceae s.s., which comprises > 90% of species diversity in the family. Remarkably, if minimum age estimates approximate true history, these 11 events occurred within the last ~20 million years even though Acanthaceae s.s is over three times as old. A simulation study confirmed that these dispersal events were significantly skewed towards the present and not simply a chance occurrence. Finally, we review reports of fossils that have been assigned to Acanthaceae that are substantially older than the lower Cretaceous estimate for Angiosperms as a whole (i.e., the general consensus that has resulted from several recent dating and fossil-based studies in plants). This is the first study to reconstruct divergence times among clades of Acanthaceae and sets the stage for comparative evolutionary research in this and related families that have until now been thought to have extremely poor fossil resources.
Idiosyncratic markers are features of genes and genomes that are so unusual that it is unlikely that they evolved more than once in a lineage of organisms. Here we explore further the potential of idiosyncratic markers and changes to typically conserved tRNA sequences for phylogenetic inference. Hard ticks were chosen as the model group because their phylogeny has been studied extensively. Fifty-eight candidate markers from hard ticks (family Ixodidae) and 22 markers from the subfamily Rhipicephalinae sensu lato were mapped onto phylogenies of these groups. Two of the most interesting markers, features of the secondary structure of two different tRNAs, gave strong support to the hypothesis that species of the Prostriata (Ixodes spp.) are monophyletic. Previous analyses of genes and morphology did not strongly support this relationship, instead suggesting that the Prostriata is paraphyletic with respect to the Metastriata (the rest of the hard ticks). Parallel or convergent evolution was not found in the arrangements of mitochondrial genes in ticks nor were there any reversals to the ancestral arthropod character state. Many of the markers identified were phylogenetically informative, whereas others should be informative with study of additional taxa. Idiosyncratic markers and changes to typically conserved nucleotides in tRNAs that are phylogenetically informative were common in this data set, and thus these types of markers might be found in other organisms.
The existence of multiple likelihood maxima necessitates algorithms that explore a large part of the tree space. However, because of computational constraints, stepwise addition-based tree-searching methods do not allow for this exploration in reasonable time. Here, I present an algorithm that increases the speed at which the likelihood landscape can be explored. The iterative algorithm combines the computational speed of distance-based tree construction methods to arrive at approximations of the global optimum with the accuracy of optimality criterion based branch-swapping methods to improve on the result of the starting tree. The algorithm moves between local optima by iteratively perturbing the tree landscape through a process of reweighting randomly drawn samples of the underlying sequence data set. Tests on simulated and real data sets demonstrated that the optimal solution obtained using stepwise addition-based heuristic searches was found faster using the algorithm presented here. Tests on a previously published data set that established the presence of tree islands under maximum likelihood demonstrated that the algorithm identifies the same tree islands in a shorter amount of time than that needed using stepwise addition. The algorithm can be readily applied using standard software for phylogenetic inference.
Molecular evolutionary rate heterogeneity-the violation of a molecular clock-is a prominent feature of many phylogenetic datasets. It has particular importance to systematists not only because of its biological implications, but also for its practical effects on our ability to infer and date evolutionary events. Here we show, using both maximum likelihood and Bayesian approaches, that a remarkably strong increase in substitution rate in the vittarioid ferns is consistent across the nuclear and plastid genomes. Contrary to some expectations, this rate increase is not due to selective forces acting at the protein level on our focal loci. The vittarioids bear no signature of the change in the relative strengths of selection and drift that one would expect if the rate increase was caused by altered post-mutation fixation rates. Instead, the substitution rate increase appears to stem from an elevated supply of mutations, perhaps limited to the vittarioid ancestral branch. This generalized rate increase is accompanied by extensive fine-scale heterogeneity in rates across loci, genomes, and taxa. Our analyses demonstrate the effectiveness and flexibility of trait-free investigations of rate heterogeneity within a model selection framework, emphasize the importance of explicit tests for signatures of selection prior to invoking selection-related or demography-based explanations for patterns of rate variation, and illustrate some unexpected nuances in the behavior of relaxed clock methods for modeling rate heterogeneity, with implications for our ability to confidently date divergence events. In addition, our data provide strong support for the monophyly of Adiantum, and for the position of Calciphilopteris in the cheilanthoid ferns, two relationships for which convincing support was previously lacking.
High-throughput DNA sequencing has the potential to accelerate species discovery if it is able to recognize evolutionary entities from sequence data that are comparable to species. The general mixed Yule-coalescent (GMYC) model estimates the species boundary from DNA surveys by identifying independently evolving lineages as a transition from coalescent to speciation branching patterns on a phylogenetic tree. Applied here to 12 families from 4 orders of insects in Madagascar, we used the model to delineate 370 putative species from mitochondrial DNA sequence variation among 1614 individuals. These were compared with data from the nuclear genome and morphological identification and found to be highly congruent (98% and 94%). We developed a modified GMYC that allows for a variable transition from coalescent to speciation among lineages. This revised model increased the congruence with morphology (97%), suggesting that a variable threshold better reflects the clustering of sequence data into biological species. Local endemism was pronounced in all 5 insect groups. Most species (60-91%) and haplotypes (88-99%) were found at only 1 of the 5 study sites (40-1000 km apart). This pronounced endemism resulted in a 37% increase in species numbers using diagnostic nucleotides in a population aggregation analysis. Sample sizes between 7 and 10 individuals represented a threshold above which there was minimal increase in genetic diversity, broadly agreeing with coalescent theory and other empirical studies. Our results from > 1.4 Mb of empirical data suggest that the GMYC model captures species boundaries comparable to those from traditional methods without the need for prior hypotheses of population coherence. This provides a method of species discovery and biodiversity assessment using single-locus data from mixed or environmental samples while building a globally available taxonomic database for future identifications.
Single-access keys are a major tool for biologists who need to identify specimens. The construction process of these keys is particularly complex (especially if the input data set is large) so having an automatic single-access key generation tool is essential. As part of the European project ViBRANT, our aim was to develop such a tool as a web service, thus allowing end-users to integrate it directly into their workflow.
IKey+generates single-access keys on demand, for single users or research institutions. It receives user input data (using the standard SDD format), accepts several key-generation parameters (affecting the key topology and representation), and supports several output formats.
IKey+is freely available (sources and binary packages) at www.identificationkey.fr. Furthermore, it is deployed on our server and can be queried (for testing purposes) through a simple web client also available at www.identificationkey.fr (last accessed 13 August 2012). Finally, a client plugin will be integrated to the Scratchpads biodiversity networking tool (scratchpads.eu).
Amino acid substitution models are essential to most methods to infer phylogenies from protein data. These models represent the ways in which proteins evolve and substitutions accumulate along the course of time. It is widely accepted that the substitution processes vary depending on the structural configuration of the protein residues. However, this information is very rarely used in phylogenetic studies, though the 3-dimensional structure of dozens of thousands of proteins has been elucidated. Here, we reinvestigate the question in order to fill this gap. We use an improved estimation methodology and a very large database comprising 1471 nonredundant globular protein alignments with structural annotations to estimate new amino acid substitution models accounting for the secondary structure and solvent accessibility of the residues. These models incorporate a confidence coefficient that is estimated from the data and reflects the reliability and usefulness of structural annotations in the analyzed sequences. Our results with 300 independent test alignments show an impressive likelihood gain compared with standard models such as JTT or WAG. Moreover, the use of these models induces significant topological changes in the inferred trees, which should be of primary interest to phylogeneticists. Our data, models, and software are available for download from http://atgc.lirmm.fr/phyml-structure/.
Identifying and dating historical biological events is a fundamental goal of evolutionary biology, and recent analytical advances permit the modeling of factors known to affect both the accuracy and the precision of molecular date estimates. As the use of multilocus data sets becomes increasingly routine, it becomes more important to evaluate the potentially confounding effects of rate heterogeneity both within (e.g., codon positions) and among loci when estimating divergence times. Here, using Plestiodon lizards as a test case, we examine the effects of accommodating rate heterogeneity among data partitions on divergence time estimation. Plestiodon inhabits both East Asia and North America, yet both the geographic origin of the genus and timing of dispersal between the continents have been debated. For each of the eight independently evolving loci and a combined data set, we conduct single model and partitioned analyses. We found that extreme saturation has obscured the underlying rate of evolution in the mitochondrial DNA (mtDNA), resulting in severe underestimation of the rate in this locus. As a result, the age of the crown Plestiodon clade was overestimated by 15-17 Myr by the unpartitioned analysis of the combined loci data. However, the application of partition-specific models to the combined data resulted in ages that were fully congruent with those inferred by the individual nuclear loci. Although partitioning improved divergence date estimates of the mtDNA-only analysis, the ages were nonetheless overestimated, thus indicating an inadequacy of our current models to capture the complex nature of mtDNA evolution in over large time scales. Finally, the statistically incongruent age distributions inferred by the partitioned and unpartitioned analyses of the combined data support mutually exclusive hypotheses of the timing of intercontinental dispersal of Plestiodon from Asia to North America. Analyses that best capture the rate of evolution in the combined data set infer that this exchange occurred via Beringia ∼18.0-30 Ma.
Substitution rates are one of the most fundamental parameters in a phylogenetic analysis and are represented in phylogenetic models as the branch lengths on a tree. Variation in substitution rates across an alignment of molecular sequences is well established and likely caused by variation in functional constraint across the genes encoded in the sequences. Rate variation across alignment sites is important to accommodate in a phylogenetic analysis; failure to account for across-site rate variation can cause biased estimates of phylogeny or other model parameters. Traditionally, rate variation across sites has been modeled by treating the rate for a site as a random variable drawn from some probability distribution (such as the gamma probability distribution) or by partitioning sites to different rate classes and estimating the rate for each class independently. We consider a different approach, related to site-specific models in which sites are partitioned to rate classes. However, instead of treating the partitioning scheme in which sites are assigned to rate classes as a fixed assumption of the analysis, we treat the rate partitioning as a random variable under a Dirichlet process prior. We find that the Dirichlet process prior model for across-site rate variation fits alignments of DNA sequence data better than commonly used models of across-site rate variation. The method appears to identify the underlying codon structure of protein-coding genes; rate partitions that were sampled by the Markov chain Monte Carlo procedure were closer to a partition in which sites are assigned to rate classes by codon position than to randomly permuted partitions but still allow for additional variability across sites.