ArticlePDF Available

PartitionFinder 2: New Methods for Selecting Partitioned Models of Evolution for Molecular and Morphological Phylogenetic Analyses

Authors:

Abstract

PartitionFinder 2 is a program for automatically selecting best-fit partitioning schemes and models of evolution for phylogenetic analyses. PartitionFinder 2 is substantially faster and more efficient than version 1, and incorporates many new methods and features. These include the ability to analyze morphological datasets, new methods to analyze genome-scale datasets, new output formats to facilitate interoperability with downstream software, and many new models of molecular evolution. PartitionFinder 2 is freely available under an open source license and works on Windows, OSX, and Linux operating systems. It can be downloaded from www.robertlanfear.com/partitionfinder The source code is available at https://github.com/brettc/partitionfinder.
PartitionFinder 2: New Methods for Selecting Partitioned
Models of Evolution for Molecular and Morphological
Phylogenetic Analyses
Robert Lanfear,*
,1,2
Paul B. Frandsen,
3
April M. Wright,
4
Tereza Senfeld,
2
and Brett Calcott
5
1
Research School of Biology, Australian National University, Canberra, ACT, Australia
2
Department of Biological Sciences, Macquarie University, Sydney, Australia
3
Office of Research Information Services, Office of the Chief Information Officer, Smithsonian Institution, Washington, DC
4
Ecology, Evolution and Organismal Biology, Iowa State University, Ames, IA
5
Department of Philosophy, University of Sydney, Sydney, NSW, Australia
*
Corresponding author: E-mail: rob.lanfear@anu.edu.au.
Associate editor: Michael S. Rosenberg
Abstract
PartitionFinder 2 is a program for automatically selecting best-fit partitioning schemes and models of evolution for
phylogenetic analyses. PartitionFinder 2 is substantially faster and more efficient than version 1, and incorporates many
new methods and features. These include the ability to analyze morphological datasets, new methods to analyze genome-
scale datasets, new output formats to facilitate interoperability with downstream software, and many new models of
molecular evolution. PartitionFinder 2 is freely available under an open source license and works on Windows, OSX, and
Linux operating systems. It can be downloaded from www.robertlanfear.com/partitionfinder. The source code is available
at https://github.com/brettc/partitionfinder.
Key words: partitioning, AIC, BIC, AICc, model selection, molecular evolution.
Main Text
In phylogenetic analyses it is important to account for vari-
ation in rates and patterns of evolution among sites (Yang
1996;Kumar et al. 2012). Partitioning attempts to achieve this
by estimating independent models of molecular evolution for
subsets of sites that are deemed to have evolved in similar
ways. It can be challenging to choose a good partitioning
scheme, because the number of possible schemes can be
extremely large.
The original version of PartitionFinder (Lanfear et al. 2012)
proposed algorithms to automate the selection of a parti-
tioning scheme given a set of user-defined data blocks as
input. By combining these algorithms with the selection of
models of molecular evolution, PartitionFinder improved and
simplified phylogenetic analyses for many users. However,
PartitionFinder was written before the advent of phyloge-
nomic datasets such as those produced by sequencing whole
genomes (e.g., Jarvis et al. 2014) and transcriptomes (e.g.,
Misof et al. 2014), and remains too slow to be practical for
use with these datasets. Because of this, we designed new
features and re-wrote all of the methods and routines in
PartitionFinder, which we present as PartitionFinder 2.
PartitionFinder 2 includes a number of new features. First,
we wrote faster versions of the k-means, relaxed-clustering, and
greedy algorithms (Lanfear et al. 2014;Frandsen et al. 2015),
although we urge caution with relying on purely data-driven
approaches to partitioning such as k-means, because we still
lack evidence that they perform appropriately under a wide
range of simulation conditions (Frandsen et al. 2015). Second,
we included a range of new models of evolution, including
important recent advances such as the LG4X and LG4M mix-
ture models (Le et al. 2012). Third, we implemented
Maximum-Likelihood (ML) starting trees for all analyses, mo-
tivated by our observation that model selection methods can
be biased by the choice of starting tree (Frandsen et al. 2015).
Fourth, we implemented the ability to analyze morphological
datasets. Finally, we included a variety of new output formats
to improve interoperability with downstream software.
In addition to new features, we also implemented a num-
ber of improvements that enable the efficient analysis of
genome-scale datasets. These include: a new alignment
parser;moreefficientuseofmultipleprocessors;adramatic
reduction in the number of files that are written and read; and
many improvements in internal and external data storage
and processing. These improvements streamline analyses
and help to make the best use of the available computational
resources.
The net result of the new features and improvements is
that PartitionFinder 2 can be dramatically faster than its pre-
decessor, particularly for very large datasets analyzed on com-
puters with many processors. To illustrate this, we compared
the performance of version 2.0.0 to version 1.1.1 on two data-
sets: an insect dataset comprising 2,868 protein domains
(each specified as a separate data block) and 595,033 sites
from 144 taxa (Misof et al. 2014); and a vertebrate dataset of
Brief Communication
ßThe Author 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
772 Mol. Biol. Evol. 34(3):772–773 doi:10.1093/molbev/msw260 Advance Access publication December 24, 2016
Downloaded from https://academic.oup.com/mbe/article-abstract/34/3/772/2738784 by guest on 13 March 2020
56 genes (split into 168 codon-position data blocks) and
25,919 sites from 110 taxa (Fong et al. 2012). We used
Maximum Parsimony starting trees in all analyses to enable
direct comparisons of execution times. We analyzed the in-
sect dataset on a server with fifty-six 2.6 GHz processors, using
the new fast relaxed clustering (rclusterf) algorithm in version
2.0.0, and the original relaxed clustering algorithm in version
1.1.1, both with default settings. Version 2.0.0 was more than
100 times faster than version 1.1.1: it completed the analysis
in 35 h, while version 1.1.1 finished less than 1% of the analysis
in the same time. We analyzed the vertebrate dataset on a
desktop Macintosh computer with eight 4 GHz processors,
using the greedy algorithm with precisely the same settings in
versions 1.1.1 and 2.0.0. Version 2.0.0 was five times faster than
version 1.1.1: it completed the analysis in 108 min compared
with 534 min for version 1.1.1.
PartitionFinder 2 can be installed by downloading it from
the website above, or installing it via GitHub. No other pro-
grams need to be compiled, but it does require the installa-
tion of Python and a small number of dependencies. These
can be managed via a point-and-click installer, following the
details outlined in the manual. We hope that PartitionFinder
2 will be useful to the phylogenetics community.
Acknowledgments
RML was supported by the Australian Research Council.
AMW was supported by NSF DEB-1256993. This work was
supported by the Macquarie University Genes to Geoscience
center.
References
Frandsen PB, Calcott B, Mayer C, Lanfear R. 2015. Automatic selection of
partitioning schemes for phylogenetic analyses using iterative
k-means clustering of site rates. BMC Evol Biol. 15:13.
Fong JJ, Brown JM, Fujita MK, Boussau B. 2012. A phylogenomic
approach to vertebrate phylogeny supports a turtle-archosaur
affinity and a possible paraphyletic lissamphibia. PLoS One
7(11): e48990.
Jarvis ED, Mirarab S, Aberer AJ, Li B, Houde P, Li C, Ho SYW, Faircloth BC,
Nabholz B, Howard JT, et al. 2014. Whole-genome analyses resolve
early branches in the tree of life of modern birds. Science
346:1320–1331.
Kumar S, Filipski AJ, Battistuzzi FU, Kosakovsky Pond SL, Tamura K.
2012. Statistics and truth in phylogenomics. Mol Biol Evol.
29:457–472.
Lanfear R, Calcott B, Kainer D, Mayer C, Stamatakis A. 2014. Selecting
optimal partitioning schemes for phylogenomic datasets. BMC Evol
Biol. 14:82.
Lanfear R, Calcott B, Ho SYW, Guindon S. 2012. Partitionfinder: com-
bined selection of partitioning schemes and substitution models for
phylogenetic analyses. MolBiolEvol.29:1695–1701.
Le SQ, Dang CC, Gascuel O. 2012. Modeling protein evolution with
several amino acid replacement matrices depending on site rates.
Mol Biol Evol. 29:2921–2936.
Misof B, Liu S, Meusemann K, Peters RS, Donath A, Mayer C, Frandsen
PB, Ware J, Flouri T, Beutel RG, et al. 2014. Phylogenomics resolves
the timing and pattern of insect evolution. Science 346:763–767.
Yang Z. 1996. Among-site rate variation and its impact on phylogenetic
analyses. Trends Ecol Evol. 11:367–372.
PartitionFinder 2 .doi:10.1093/molbev/msw260 MBE
773
Downloaded from https://academic.oup.com/mbe/article-abstract/34/3/772/2738784 by guest on 13 March 2020
... Sequences were aligned using MAFFT v.7.455 (Katoh & Standley, 2013) and the L-INS-i algorithm. For each region, the best codon partitioning scheme and evolutionary model available in BEAST were selected using the Bayesian Information Criterion (BIC) in Parti-tionFinder v.2.1.1 (Lanfear et al., 2016), using the greedy algorithm (Lanfear et al., 2012). ...
Article
1. The evolution of host specialization and feeding habits in non-pollinating fig wasps remains poorly understood. 2. This study investigates these dynamics within the wasp genus Idarnes, which exhibits diverse life strategies and degrees of host specificity. 3. We reconstructed the phylogeny of 57 Idarnes species using four genetic markers-two mitochondrial, one nuclear and one ribosomal-and performed ancestral state reconstructions to elucidate the evolution of host specialization. 4. Our analyses reveal that host monophagy is ancestral in Idarnes, with oligophagy evolving independently in several lineages. Oligophagous species are more prevalent in kleptoparasitic compared to gall-inducing lineages. Early gallers-which are wasps that colonize figs during the pre-receptive stage of development-are predominantly monophagous, while some oligophagous species occur in the galler line-age that uses flower ovaries at the fig receptive phase. 5. The relative ovipositor length correlates with the number of host species visited by kleptoparasite species, suggesting morphological adaptations play a role in host specialization. In the gall-inducing lineages, the use of multiple host species seems not primarily constrained by the ovipositor size but possibly by the capacity of the wasp to induce galls in a non-typical host species. 6. This study provides valuable insights into the mechanisms driving host specialization in fig wasps, enhancing our understanding of this intricate insect-plant system. K E Y W O R D S ancestral state reconstruction, evolutionary ecology, Ficus, phylogenetic comparative methods, Pteromalidae
... The DNA sequences generated in this study were aligned with those obtained from GenBank ( [32] to combine multiple sequences in a specified order, generating a new complete sequence. PartitionFinder v2.0 (Australian National University, Canberra, Australia) [33] was then used to determine optimal nucleotide substitution models for each data partition. The best model for both ITS and the LSU is TRNEF+G. ...
Article
Full-text available
Naematelia aurantialba and its allies are important edible and medicinal mushrooms in China. They are usually called Jiner (金耳) and have been cultivated on a commercial scale. However, due to the lack of DNA sequences from the holotype of Naematelia aurantialba, the taxonomic issues of the species complex are unresolved. In this study, the authors successfully generated DNA sequences from the holotype of N. aurantialba by a genome skimming approach and additional allied species by Sanger sequencing. Based on morphological characteristics, molecular phylogenetic data, and geographic distribution patterns, four species, including three new ones, in the complex in southwestern China were uncovered. Naematelia aurantialba occurs at high altitudes (over 3000 m above sea level), with subalpine dead plants as its substrates, and has larger basidiospores, while the commonly cultivated species, described as N. sinensis in this work, is distributed in subtropical areas at altitudes between 1800 m and 2600 m on the dead wood of subtropical plants and has smaller basidiospores. The third species, namely N. nodulosa, has habitats similar to those of N. sinensis but differs from the latter in its basidiomata with an uneven nodulose surface, a loose context with small internal cavities, and numerous conidia. The fourth species, N. pedicellata, is easily distinguished from the others by its basidia, with long basal stalks and broadly ellipsoid basidiospores measuring 10.5–12.5 × 8.0–10.0 μm. All these species are parasitic on Stereum species. This study provides a solid basis for future guidance for the selection of new strains and cultivation practices of these valuable fungi.
... The best partitioning scheme and evolutionary models were selected using PartitionFinder2 v2.1.1, with the greedy algorithm under the corrected Akaike information (AICc) criterion [28]. Afterward, a phylogenetic tree was constructed using IQ-TREE v.2.2.0 software based on the maximum likelihood method with bootstrapping values defined from 5000 repetitions [29]. ...
Article
Full-text available
Phlebotomus sichuanensis, considered a potential vector for visceral leishmaniasis (VL), is distributed in the southern Gansu and northern Sichuan regions in China. However, the high similarity in the morphology of P. sichuanensis and P. chinensis s.s. poses unresolved taxonomic challenges. In this study, phlebotomine sand flies were collected from three locations in the southern Gansu and northern Sichuan regions (SCB group) and three locations that are the dominant distribution areas of P. chinensis s.s. (ZHB group). Their whole mitochondrial genomes were sequenced and analyzed. The differential analysis revealed that there were 339 fixed differential sites in the mitochondrial genome-coding region of P. chinensis s.s. and P. sichuanensis, among which the COI gene had the most differential sites (57), followed by ND5 (46), ND4 (38), and CYTB (37), while ATP8 had the least differential sites (4). The molecular genetic p-distance was calculated based on 13 protein-coding regions, and the genetic distance ranged from 0.001 to 0.018 in the ZHB group and from 0.001 to 0.006 in the SCB group, while the interspecies molecular genetic distance was 0.464–0.466 between the two groups. A phylogenetic maximum likelihood tree was constructed from 16 samples via tandem sequence of 13 protein-coding regions, and the topology showed that the ZHB and SCB groups formed separate clusters. A real-time PCR method was established based on the differences in the COI fragment, which can identify P. sichuanensis from P. chinensis s.s. effectively. This study presents objective evidence of the genetic differentiation between P. sichuanensis and P. chinensis s.s., and provides a method for identifying these two morphologically highly similar VL-transmitting sandflies.
... Partitioned models are proposed to describe the site heterogeneity and improve the accuracy of phylogenetic inference. In partitioned models, sites are divided into several disjoint subsets that apply different models, each of which may have different substitution rates, base frequencies, and branch lengths (Lanfear et al. 2017). The importance of partitioned models has been demonstrated in phylogenomic analysis (Chernomor et al. 2016), as they can affect tree branches, topology, bootstrap support, and divergence date (Brandley et al. 2005;Ho and Lanfear 2010;Leavitt et al. 2013;Poux et al. 2008; Rota and Wahlberg 2012;Le Kim and Le Sy 2020). ...
Article
Full-text available
Phylogenetics has been widely used in molecular biology to infer the evolutionary relationships among species. With the rapid development of sequencing technology, genomic data with thousands of sites become increasingly common in phylogenetic analysis, while heterogeneity among sites arises as one of the major challenges. A single homogeneous model is not sufficient to describe the evolution of all sites and partitioned models are often employed to model the evolution of heterogeneous sites by partitioning them into distinct groups and utilizing distinct evolutionary models for each group. It is crucial to determine the best partitioning, which greatly affects the reconstruction correctness of phylogeny. However, the best partitioning is usually intractable to obtain in practice. Traditional partitioning methods rely on heuristic algorithms or greedy search to determine the best ones in their solution space, are usually time consuming, and with no guarantee of optimality. In this study, we propose a novel partitioning approach, termed PsiPartition, based on the parameterized sorting indices of sites and Bayesian optimization. We apply our method to empirical datasets, and it performs significantly better compared to existing methods, in terms of Bayesian information criterion (BIC) and the corrected Akaike information criterion (AICc). We test PsiPartition on the simulated datasets with different site heterogeneity, alignment lengths, and number of loci. It is demonstrated that PsiPartition evidently and stably outperforms other methods in terms of the Robinson–Foulds (RF) distance between the true simulated trees and the reconstructed trees, especially on the data with more site heterogeneity. More importantly, our proposed Bayesian optimization-based method, for the first time, provides a new general framework to efficiently determine the optimal number of partitions. The corresponding reproducible source code and data are available at http://github.com/xu-shi-jie/PsiPartition.
... A maximum-likelihood (ML) tree was constructed based on nine obtained COI genes and seventeen Prochas COI genes downloaded from BOLD, with species Cryptophion inaequalipes as an outgroup. Using PartitionFinder ver 2.1.1 [31], the best fitted model was identified, and then cluster analysis was carried out through the maximum-likelihood (ML) method using MEGA 11. Bootstrap analysis was performed based on 1000 resampling. ...
Article
Full-text available
DNA barcoding is an effective modern tool in taxonomy, evolutionary biology, and biodiversity research. Many new species have been discovered and described with DNA barcodes as part of their diagnostic features. We combined morphological examination and molecular species delimitation of the mitochondrial cytochrome c oxidase 1 (COI) gene using the automatic barcode gap discovery (ABGD) to investigate species boundaries. The genus Prochas Walkley (Hymenoptera, Ichneumonidae, Campopleginae) was first reported from China and is new for the Oriental and Eastern Palearctic regions. Using an integrative taxonomy method, two new species P. rugipunctata sp. nov. and P. striata sp. nov. are hereby described and illustrated. A key to the world species and a distribution map are provided.
... We used the default cut-off score of 0.93 in all single gene alignments. Partition-Finder2 was used to select the best partition for our data and substitution models (Lanfear et al. 2016). A single substitution model was selected for each region (GTR+G for ITS1 and ITS2, TRNEF+I+G for 5.8S and LSU, JC for SSU, F81+I+G for the first and second codon position of TEF1, TVM+G for the third codon position of TEF1, TIM+I for the first codon position of RPB1, K81UF+I the second codon position of RPB1 and HKY+I+G for the third codon position of RPB1) under a greedy search algorithm and the Akaike Information Criterion (AIC) (Lanfear et al. 2012). ...
Article
The Neotropics, particularly mountain cloud forests, are characterized by a high diversity of microfungi that inhabit lichens. However, based on our field studies, many of these microfungi remain undescribed, and their phylogenetic relationships are poorly understood. This study focuses on Bolivian lichenicolous Trichonectria inhabiting the genus Usnea (Parmeliaceae), a common lichen host in the tropical Andean forests. Here, we present 14 species of which eight are described as new to science: Trichonectria abortispora sp. nov., T. biglobospora sp. nov., T. boliviana sp. nov., T. citrispora sp. nov., T. cylindrospora sp. nov., T. gigaspora sp. nov., T. microsporusneae sp. nov. and T. toensbergiana sp. nov. The five-locus phylogenetic analyses show that the anamorphic genus Cylindromonium and the teleomorphic genus Trichonectria cluster together in a well-supported clade within the order Hypocreales, but we have not yet reached a taxonomic conclusion. Phylogenetic placements of five lichenicolous species of the genus Trichonectria are reported here for the first time, including T. vinosa comb. nov.
... The final concatenated alignment was divided into five partitions (three codons each for COI, 16S rRNA and 28S rRNA). The best-fit substitution model for each partition was determined using PartitionFinder2 v.2.3.4 [55] with the corrected Akaike Information Criterion (AICc). The best-fit model was identified as GTR+I for the first and second codons of COI, GTR+G for the third codon of COI, and GTR+I+G for 16S rRNA and 28S rRNA genes. ...
Article
Full-text available
Stingless beekeeping, also known as meliponiculture, has gained increasing popularity in many tropical and subtropical countries for its use in commercial pollination and high-value honey and propolis production. However, this rising interest in stingless beekeeping has led to significant geographical displacements of bee colonies by beekeepers, occasionally surpassing their native ranges. Consequently, this affects local bee populations by disrupting gene flow across unnaturally large geographic scales. For Heterotrigona itama, one of the most common stingless bee species in Southeast Asian countries, including Thailand, there is concern that large-scale artificial propagation by beekeepers utilizing a limited number of bee colonies will lead to inbreeding. This practice leads to increased inbreeding within managed populations and introgression into wild populations. These concerns highlight the need for careful management practices in stingless beekeeping to mitigate potential adverse effects. To assess the genetic structure of H. itama in Thailand, 70 colonies were sampled, and partially sequenced cytochrome c oxidase subunit 1 (COI) gene, large ribosomal subunit rRNA gene (16S rRNA), and 28S large ribosomal subunit rDNA gene (28S rRNA) were analyzed. Our results showed slightly lower nuclear genetic variability, but higher mitochondrial genetic variability, which can be attributed to gene flow, colony transport, and nest division. We suggest that increasing the number of colonies maintained through nest division does not negatively affect genetic variability, as it is maintained by small-scale male dispersal and human-mediated nest transport. However, caution should be exercised when transporting nests from distant localities, considering the high genetic differentiation observed between samples from Narathiwat and those from Krabi and Nakhon Si Thammarat provinces, which might indicate local adaptation.
Article
Full-text available
The phylogenetic studies of the tribe Alsineae (Caryophyllaceae) have revealed a clearer boundary between the genus Stellaria and related genera, primarily relying on the morphological characteristics of style 3, stamens 10 and petals deeply bifid. However, the newly-published species in China, which have 5 styles or ten or more lobes per petal, challenge this boundary and necessitate further studies. In this paper, we reviewed six newly-published Chinese species of Stellaria , utilising both molecular phylogenetic evidence from nuclear ribosomal internal transcribed spacer (ITS) and four plastid regions ( trnL-F , matK , rbcL , rps16 ) and morphological evidence. Our results demonstrated that the five new species ( Stellaria abaensis , S. multipartita , S. pentastyla , S. procumbens and S. zhuxiensis ) were nested within the genus Stellaria , but Stellaria motuoensis was sister to the genus Schizotechium . Herein, we accepted four new Stellaria species and proposed a new combination in Schizotechium and a new synonym in Stellaria . Additionally, we described a new species Stellaria longipedicellata from Sichuan Province, China, which was distinguished from the closely-related species Stellaria decumbens by its glabrous body, linear-lanceolate leaves, long pedicellate flowers, prostrate growth habit and flowers nearly equal to or slightly shorter than sepals. Both molecular and morphological evidence support the treatment of S. longipedicellata as a new species of the genus Stellaria .
Article
Full-text available
The extant colobine monkeys are a large primate radiation represented by two geographic subtribes, the African Colobina Blyth, 1875 and the Asian Presbytina Gray, 1825. The phylogenetic relationships of the colobinans are well resolved, but uncertainty persists among presbytinans. This study combines a large molecular dataset with a novel morphological matrix to 1) reassess relationships of the presbytinans using a total evidence phylogenetic approach, and 2) revisit the comparative morphology of the colobines within the context of an updated phylogenetic hypothesis. Previously supported relationships of colobinans are replicated here. Among presbytinans, Presbytis Eschscholtz, 1821 is the sister to the other presbytinans, Semnopithecus Desmarest, 1822 and Trachypithecus Reichenbach, 1862 form a clade, and Rhinopithecus Milne-Edwards, 1872 is sister to the other odd-nosed colobines (Pygathrix E. Geoffroy, 1812, Simias Miller, 1903, and Nasalis E. Geoffroy, 1812). Several features diagnostic of the subtribes and clades within each subtribe are identified. The skeletal diversity of the presbytinans and the presence of few features that unite the subtribe may be attributable to the recent and rapid nature of their diversification and to substantial historical introgression among lineages. Finally, this work provides a foundation for future studies including the fossil colobines, whose relationships remain largely unresolved.
Article
Full-text available
Model selection is a vital part of most phylogenetic analyses, and accounting for the heterogeneity in evolutionary patterns across sites is particularly important. Mixture models and partitioning are commonly used to account for this variation, and partitioning is the most popular approach. Most current partitioning methods require some a priori partitioning scheme to be defined, typically guided by known structural features of the sequences, such as gene boundaries or codon positions. Recent evidence suggests that these a priori boundaries often fail to adequately account for variation in rates and patterns of evolution among sites. Furthermore, new phylogenomic datasets such as those assembled from ultra-conserved elements lack obvious structural features on which to define a priori partitioning schemes. The upshot is that, for many phylogenetic datasets, partitioned models of molecular evolution may be inadequate, thus limiting the accuracy of downstream phylogenetic analyses. We present a new algorithm that automatically selects a partitioning scheme via the iterative division of the alignment into subsets of similar sites based on their rates of evolution. We compare this method to existing approaches using a wide range of empirical datasets, and show that it consistently leads to large increases in the fit of partitioned models of molecular evolution when measured using AICc and BIC scores. In doing so, we demonstrate that some related approaches to solving this problem may have been associated with a small but important bias. Our method provides an alternative to traditional approaches to partitioning, such as dividing alignments by gene and codon position. Because our method is data-driven, it can be used to estimate partitioned models for all types of alignments, including those that are not amenable to traditional approaches to partitioning.
Article
Full-text available
To better determine the history of modern birds, we performed a genome-scale phylogenetic analysis of 48 species representing all orders of Neoaves using phylogenomic methods created to handle genome-scale data. We recovered a highly resolved tree that confirms previously controversial sister or close relationships. We identified the first divergence in Neoaves, two groups we named Passerea and Columbea, representing independent lineages of diverse and convergently evolved land and water bird species. Among Passerea, we infer the common ancestor of core landbirds to have been an apex predator and confirm independent gains of vocal learning. Among Columbea, we identify pigeons and flamingoes as belonging to sister clades. Even with whole genomes, some of the earliest branches in Neoaves proved challenging to resolve, which was best explained by massive protein-coding sequence convergence and high levels of incomplete lineage sorting that occurred during a rapid radiation after the Cretaceous-Paleogene mass extinction event about 66 million years ago. Copyright © 2014, American Association for the Advancement of Science.
Article
Full-text available
Insects are the most speciose group of animals, but the phylogenetic relationships of many major lineages remain unresolved. We inferred the phylogeny of insects from 1478 protein-coding genes. Phylogenomic analyses of nucleotide and amino acid sequences, with site-specific nucleotide or domain-specific amino acid substitution models, produced statistically robust and congruent results resolving previously controversial phylogenetic relations hips. We dated the origin of insects to the Early Ordovician [~479 million years ago (Ma)], of insect flight to the Early Devonian (~406 Ma), of major extant lineages to the Mississippian (~345 Ma), and the major diversification of holometabolous insects to the Early Cretaceous. Our phylogenomic study provides a comprehensive reliable scaffold for future comparative analyses of evolutionary innovations among insects.
Article
Full-text available
Partitioning involves estimating independent models of molecular evolution for different subsets of sites in a sequence alignment, and has been shown to improve phylogenetic inference. Current methods for estimating best-fit partitioning schemes, however, are only computationally feasible with datasets of less than 100 loci. This is a problem because datasets with thousands of loci are increasingly common in phylogenetics. We develop two novel methods for estimating best-fit partitioning schemes on large phylogenomic datasets: strict and relaxed hierarchical clustering. These methods use information from the underlying data to cluster together similar subsets of sites in an alignment, and build on clustering approaches that have been proposed elsewhere. We compare the performance of our methods to each other, and to existing methods for selecting partitioning schemes. We demonstrate that while strict hierarchical clustering has the best computational efficiency on very large datasets, relaxed hierarchical clustering provides scalable efficiency and returns dramatically better partitioning schemes as assessed by common criteria such as AICc and BIC scores. These two methods provide the best current approaches to inferring partitioning schemes for very large datasets. We provide free open-source implementations of the methods in the PartitionFinder software. We hope that the use of these methods will help to improve the inferences made from large phylogenomic datasets.
Article
Full-text available
IN RESOLVING THE VERTEBRATE TREE OF LIFE, TWO FUNDAMENTAL QUESTIONS REMAIN: 1) what is the phylogenetic position of turtles within amniotes, and 2) what are the relationships between the three major lissamphibian (extant amphibian) groups? These relationships have historically been difficult to resolve, with five different hypotheses proposed for turtle placement, and four proposed branching patterns within Lissamphibia. We compiled a large cDNA/EST dataset for vertebrates (75 genes for 129 taxa) to address these outstanding questions. Gene-specific phylogenetic analyses revealed a great deal of variation in preferred topology, resulting in topologically ambiguous conclusions from the combined dataset. Due to consistent preferences for the same divergent topologies across genes, we suspected systematic phylogenetic error as a cause of some variation. Accordingly, we developed and tested a novel statistical method that identifies sites that have a high probability of containing biased signal for a specific phylogenetic relationship. After removing putatively biased sites, support emerged for a sister relationship between turtles and either crocodilians or archosaurs, as well as for a caecilian-salamander sister relationship within Lissamphibia, with Lissamphibia potentially paraphyletic.
Article
Full-text available
Most protein substitution models use a single amino acid replacement matrix summarizing the biochemical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors that influence the substitution patterns. In this paper, we investigate the use of different substitution matrices for different site evolutionary rates. Indeed, the variability of evolutionary rates corresponds to one of the most apparent heterogeneity factors among sites, and there is no reason to assume that the substitution patterns remain identical regardless of the evolutionary rate. We first introduce LG4M, which is composed of four matrices, each corresponding to one discrete gamma rate category (of four). These matrices differ in their amino acid equilibrium distributions and in their exchangeabilities, contrary to the standard gamma model where only the global rate differs from one category to another. Next, we present LG4X, which also uses four different matrices, but leaves aside the gamma distribution and follows a distribution-free scheme for the site rates. All these matrices are estimated from a very large alignment database, and our two models are tested using a large sample of independent alignments. Detailed analysis of resulting matrices and models shows the complexity of amino acid substitutions and the advantage of flexible models such as LG4M and LG4X. Both significantly outperform single-matrix models, providing gains of dozens to hundreds of log-likelihood units for most data sets. LG4X obtains substantial gains compared with LG4M, thanks to its distribution-free scheme for site rates. Since LG4M and LG4X display such advantages but require the same memory space and have comparable running times to standard models, we believe that LG4M and LG4X are relevant alternatives to single replacement matrices. Our models, data, and software are available from http://www.atgc-montpellier.fr/models/lg4x.
Article
Full-text available
Phylogenomics refers to the inference of historical relationships among species using genome-scale sequence data and to the use of phylogenetic analysis to infer protein function in multigene families. With rapidly decreasing sequencing costs, phylogenomics is becoming synonymous with evolutionary analysis of genome-scale and taxonomically densely sampled data sets. In phylogenetic inference applications, this translates into very large data sets that yield evolutionary and functional inferences with extremely small variances and high statistical confidence (P value). However, reports of highly significant P values are increasing even for contrasting phylogenetic hypotheses depending on the evolutionary model and inference method used, making it difficult to establish true relationships. We argue that the assessment of the robustness of results to biological factors, that may systematically mislead (bias) the outcomes of statistical estimation, will be a key to avoiding incorrect phylogenomic inferences. In fact, there is a need for increased emphasis on the magnitude of differences (effect sizes) in addition to the P values of the statistical test of the null hypothesis. On the other hand, the amount of sequence data available will likely always remain inadequate for some phylogenomic applications, for example, those involving episodic positive selection at individual codon positions and in specific lineages. Again, a focus on effect size and biological relevance, rather than the P value, may be warranted. Here, we present a theoretical overview and discuss practical aspects of the interplay between effect sizes, bias, and P values as it relates to the statistical inference of evolutionary truth in phylogenomics.
Article
Full-text available
Although several decades of study have revealed the ubiquity of variation of evolutionary rates among sites, reliable methods for studying rate variation were not developed until very recently. Early methods fit theoretical distributions to the numbers of changes at sites inferred by parsimony and substantially underestimate the rate variation. Recent analyses show that failure to account for rate variation can have drastic effects, leading to biased dating of speciation events, biased estimation of the transition:transversion rate ratio, and incorrect reconstruction of phylogenies.
Article
In phylogenetic analyses of molecular sequence data, partitioning involves estimating independent models of molecular evolution for different sets of sites in a sequence alignment. Choosing an appropriate partitioning scheme is an important step in most analyses because it can affect the accuracy of phylogenetic reconstruction. Despite this, partitioning schemes are often chosen without explicit statistical justification. Here, we describe two new objective methods for the combined selection of best-fit partitioning schemes and nucleotide substitution models. These methods allow millions of partitioning schemes to be compared in realistic time frames and so permit the objective selection of partitioning schemes even for large multilocus DNA data sets. We demonstrate that these methods significantly outperform previous approaches, including both the ad hoc selection of partitioning schemes (e.g., partitioning by gene or codon position) and a recently proposed hierarchical clustering method. We have implemented these methods in an open-source program, PartitionFinder. This program allows users to select partitioning schemes and substitution models using a range of information-theoretic metrics (e.g., the Bayesian information criterion, akaike information criterion [AIC], and corrected AIC). We hope that PartitionFinder will encourage the objective selection of partitioning schemes and thus lead to improvements in phylogenetic analyses. PartitionFinder is written in Python and runs under Mac OSX 10.4 and above. The program, source code, and a detailed manual are freely available from www.robertlanfear.com/partitionfinder.