Systematic Biology

Published by Oxford University Press (OUP)
Online ISSN: 1076-836X
Various events may result in the absence of hypothesized parasite 2 from host B. (a) Sequential colonization of hosts (letters) by parasites (numbers) that coincides with the phylogeny of their hosts, but never colonized host B. (b) Extinction of parasite 2 among the ancestors of host B. (c) MTB: The founder population of host B lacked parasites. (d) Parasite 2 is present on host B but has not been detected because of low density or variable distribution.
Incongruent host (a) and parasite (b) cladograms and the reconciliation tree (c: spread form) and (d: stacked form) of the two. The parasite tree indicates that parasite species 1 and 2 are sister taxa, implying that their hosts, A and C, are most closely related. This conicts with the host tree, which indicates that species A and B are sister taxa. Reconciliation analysis reconstructs the evolutionary events (cospeciation , intrahost speciation, and sorting events) necessary to produce the observed host and parasite cladograms. Reconciling the host and parasite trees requires one intrahost speciation (open circle) and three sorting events (-). The presence of extant parasites is indicated by solid lines. The shaded branches in (d) represent the host phylogeny and the thin and thick solid lines represent the two parasite lineages. Spread and stacked trees are merely different ways of representing the reconciliation tree and contain the same information.  
(a) Pruned seabird subtree from the total evidence tree of Paterson et al. (1995a; their Fig. 6a). See Table 1 for common names. (b–d) Three maximum likelihood trees generated from louse 12S sequences. Each branch is numbered for BPA. The percentage of times that a branch appears in 10,000 bootstrap replication s is recorded in parentheses along each branch (only values > 50% are recorded).  
Stacked TreeMap reconciliation of seabird (Fig. 3a) and louse phylogenies (a = Fig. 3b; b = Fig. 3c; c = Fig. 3d). Louse relationships are mapped onto the seabird phylogeny and postulated evolutionary events are indicated: cospeciatio n (A–I), sorting events (short branches), intrahost speciation or duplications (1–4). The thick line represents the seabird phylogeny, and the louse genera are represented by thin lines. Potential host-switching events are indicated with arrows.  
We investigated the coevolutionary history of seabirds (orders Procellariiformes and Sphenisciformes) and their lice (order Phthiraptera). Independent trees were produced for the seabirds (tree derived from 12S ribosomal RNA, isoenzyme, and behavioral data) and their lice (trees derived from 12S rRNA data). Brook's parsimony analysis (BPA) supported a general history of cospeciation (consistency index = 0.84, retention index = 0.81). We inferred that the homoplasy in the BPA was caused by one intrahost speciation, one potential host-switching, and eight or nine sorting events. Using reconciliation analysis, we quantified the cost of fitting the louse tree onto the seabird tree. The reconciled trees postulated one host-switching, nine cospeciation, three or four intrahost speciation, and 11 to 14 sorting events. The number of cospeciation events was significantly more than would be expected from chance alone (P < 0.01). The sequence data were used to test for rate heterogeneity for both seabirds and lice. Neither data set displayed significant rate heterogeneity. An examination of the codivergent nodes revealed that seabirds and lice have cospeciated synchronously and that lice have evolved at approximately 5.5 times the rate of seabirds. The degree of sequence divergence supported some of the postulated intrahost speciation events (e.g., Halipeurus predated the evolution of their present hosts). The sequence data also supported some of the postulated host-switching events. These results demonstrate the value of sequence data and reconciliation analyses in unraveling complex histories between hosts and their parasites.
Polyploidy, the genome wide duplication of chromosome number, is a key feature in eukaryote evolution. Polyploidy exists in diverse groups including animals, fungi, and invertebrates but is especially prevalent in plants with most, if not all, plant species having descended from a polyploidization event. Polyploids often differ markedly from their diploid progenitors in morphological, physiological, and life history characteristics as well as rates of adaptation. The altered characteristics displayed by polyploids may contribute to their success in novel ecological habitats. Clearly, a better understanding of the processes underlying changes in the number of chromosomes within genomes is a key goal in our understanding of speciation and adaptation for a wide range of families and genera. Despite the fundamental role of chromosome number change in eukaryotic evolution, probabilistic models describing the evolution of chromosome number along a phylogeny have not yet been formulated. We present a series of likelihood models, each representing a different hypothesis regarding the evolution of chromosome number along a given phylogeny. These models allow us to reconstruct ancestral chromosome numbers and to estimate the expected number of polyploidization events and single chromosome changes (dysploidy) that occurred along a phylogeny. We test, using simulations, the accuracy of this approach and its dependence on the number of taxa and tree length. We then demonstrate the application of the method for the study of chromosome number evolution in 4 plant genera: Aristolochia, Carex, Passiflora, and Helianthus. Considering the depth of the available cytological and phylogenetic data, formal models of chromosome number evolution are expected to advance significantly our understanding of the importance of polyploidy and dysploidy across different taxonomic groups.
Comparison of the log-likelihood scores of the trees found by the various analyses of the NAD4 data set, as calculated by PhyML (allowing the substitution-model parameters to be optimized individually for each tree) and PAUP* (with the substitution-model parameters fixed at the values for the optimal tree). The dashed line represents identical scores. The filled symbols represent the trees from the three PhyML tree-searches. The tree from the Tree-Puzzle search cannot be shown at this scale.  
Nonmetric multidimensional scaling ordinations of the Robinson-Foulds distance between (a) 37 trees with –log-likelihood >2085.0 for the NAD4 data set, (b) 338 trees with −log-likelihood >18,503.6 for the Isospora data set, and (c) 54 trees with –loglikelihood >235,808.9 for the HIV data set. Each symbol represents a single tree, and the distance between symbols represents the Robinson- Foulds distance between those trees. The filled symbols represent the trees at the various SPR-island peaks, labeled with their –log-likelihood score. Note that not all of the solutions found by the phylogenetic analyses are shown here, due to the arbitrary cut-off value chosen for the " water level " defining the islands.  
Maximum likelihood scores for analyses of the three main data sets via the Ratchet (Nixon) strategy.
Results of analyzing the MURP data set, showing either the single tree found or the range of trees found. The log-likelihood was determined by PAUP * , with the parameters of the nucleotide- substitution model fixed at the values determined for the optimal tree. The RF distance is the topological distance to the maximum-likelihood tree.
Even when the maximum likelihood (ML) tree is a better estimate of the true phylogenetic tree than those produced by other methods, the result of a poor ML search may be no better than that of a more thorough search under some faster criterion. The ability to find the globally optimal ML tree is therefore important. Here, I compare a range of heuristic search strategies (and their associated computer programs) in terms of their success at locating the ML tree for 20 empirical data sets with 14 to 158 sequences and 411 to 120,762 aligned nucleotides. Three distinct topics are discussed: the success of the search strategies in relation to certain features of the data, the generation of starting trees for the search, and the exploration of multiple islands of trees. As a starting tree, there was little difference among the neighbor-joining tree based on absolute differences (including the BioNJ tree), the stepwise-addition parsimony tree (with or without nearest-neighbor-interchange (NNI) branch swapping), and the stepwise-addition ML tree. The latter produced the best ML score on average but was orders of magnitude slower than the alternatives. The BioNJ tree was second best on average. As search strategies, star decomposition and quartet puzzling were the slowest and produced the worst ML scores. The DPRml, IQPNNI, MultiPhyl, PhyML, PhyNav, and TreeFinder programs with default options produced qualitatively similar results, each locating a single tree that tended to be in an NNI suboptimum (rather than the global optimum) when the data set had low phylogenetic information. For such data sets, there were multiple tree islands with very similar ML scores. The likelihood surface only became relatively simple for data sets that contained approximately 500 aligned nucleotides for 50 sequences and 3,000 nucleotides for 100 sequences. The RAxML and GARLI programs allowed multiple islands to be explored easily, but both programs also tended to find NNI suboptima. A newly developed version of the likelihood ratchet using PAUP* successfully found the peaks of multiple islands, but its speed needs to be improved.
Summary of supertree bootstrap support from the GTP analysis 
Phylogenetic analyses using genome-scale data sets must confront incongruence among gene trees, which in plants is exacerbated by frequent gene duplications and losses. Gene tree parsimony (GTP) is a phylogenetic optimization criterion in which a species tree that minimizes the number of gene duplications induced among a set of gene trees is selected. The run time performance of previous implementations has limited its use on large-scale data sets. We used new software that incorporates recent algorithmic advances to examine the performance of GTP on a plant data set consisting of 18,896 gene trees containing 510,922 protein sequences from 136 plant taxa (giving a combined alignment length of >2.9 million characters). The relationships inferred from the GTP analysis were largely consistent with previous large-scale studies of backbone plant phylogeny and resolved some controversial nodes. The placement of taxa that were present in few gene trees generally varied the most among GTP bootstrap replicates. Excluding these taxa either before or after the GTP analysis revealed high levels of phylogenetic support across plants. The analyses supported magnoliids sister to a eudicot + monocot clade and did not support the eurosid I and II clades. This study presents a nuclear genomic perspective on the broad-scale phylogenic relationships among plants, and it demonstrates that nuclear genes with a history of duplication and loss can be phylogenetically informative for resolving the plant tree of life.
Maximum parsimony (MP) and Bayesian inference (BI) ingroup trees of Agrodiaetus inferred from 113 sequences of COI and COII. The strict consensus tree (MP) was constructed from 7,433 MP trees: Total length = 2,508; consistency index = 0.356; retention index = 0.719. Bootstrap values >50% and Bremer support are shown above and below recovered branches, respectively. The 70% majority consensus tree was recovered from Bayesian trees sampled during four independent Bayesian analyses under the GTR+I+ model for DNA substitution: mean negative log likelihood = 16147.25 ± 12.1. The posterior probability is shown above every branch on the BI tree. Recognized Agrodiaetus species groups are mapped on the inferred topologies. Recovered clades are numbered with successive Roman numerals. Clade VII is monophyletic on the MP tree but paraphyletic on the BI tree. Haploid chromosome numbers are shown to the right of the names of specimens (for details, see Appendix 1).  
Maximum parsimony (MP) and Bayesian inference (BI) outgroup trees of Agrodiaetus inferred from 113 sequences of COI and COII. The strict consensus tree (MP) was constructed from 7,433 MP trees: TL = 2508, CI = 0.356, and RI = 0.719. Values for bt > 50% and Br are shown above and below recovered branches, respectively. The 70% majority consensus tree was recovered from Bayesian trees sampled during four independent Bayesian analyses under the GTR+I+ model for DNA substitution: mean −lnL = 16147.25 ± 12.1. The pP is shown above every branch on the BI tree.  
Character statistics for different data sets used in the study. 
Separate and combined analyses of COI + COII and EF1-α genes. (a) Comparison between MP trees inferred from the separate analyses of COI + COII and EF1-α genes. The strict consensus tree for the COI + COII genes was constructed from 22 MP trees: TL = 1,470; CI = 0.485; RI = 0.437. The strict consensus tree for the EF1-α gene was constructed from 6,629 MP trees: TL = 354; CI = 0.678; RI = 0.562. Values for bt > 50% and Br are shown above and below recovered branches, respectively. (b) MP and BI trees inferred from the combined analyses of COI + COII + EF1-α genes. The strict consensus tree was constructed from 12 MP trees: TL = 1,852; CI = 0.515; RI = 0.440. Values of bt > 50% and Br are shown above and below recovered branches, respectively. The 70% majority consensus tree was recovered from Bayesian trees sampled during four independent Bayesian analyses under the GTR+I+ model for DNA substitution: mean −lnL = 14098.60 ± 6.6. The pP is shown above every branch on the BI tree. The shaded blocks highlight sampled Agrodiaetus species. The Roman numerals correspond to Agrodiaetus clades (for details, see Fig. 1).  
Butterflies in the large Palearctic genus Agrodiaetus (Lepidoptera: Lycaenidae) are extremely uniform and exhibit few distinguishing morphological characters. However, these insects are distinctive in one respect: as a group they possess among the greatest interspecific karyotype diversity in the animal kingdom, with chromosome numbers (n) ranging from 10 to 125. The monophyly of Agrodiaetus and its systematic position relative to other groups within the section Polyommatus have been controversial. Characters from the mitochondrial genes for cytochrome oxidases I and II and from the nuclear gene for elongation factor 1α were used to reconstruct the phylogeny of Agrodiaetus using maximum parsimony and Bayesian phylogenetic methods. Ninety-one individuals, encompassing most of the taxonomic diversity of Agrodiaetus, and representatives of 14 related genera were included in this analysis. Our data indicate that Agrodiaetus is monophyletic. Representatives of the genus Polyommatus (sensu stricto) are the closest relatives. The sequences of the Agrodiaetus taxa in this analysis are tentatively arranged into 12 clades, only 1 of which corresponds to a species group traditionally recognized in Agrodiaetus. Heterogeneous substitution rates across a recovered topology were homogenized with a nonparametric rate-smoothing algorithm before the application of a molecular clock. Two published estimates of substitution rates dated the origin of Agrodiaetus between 2.51 and 3.85 million years ago. During this time, there was heterogeneity in the rate and direction of karyotype evolution among lineages within the genus. Karyotype instability has evolved independently three times in the section Polyommatus, within the lineages Agrodiaetus, Lysandra, and Plebicula. Rapid karyotype diversification may have played a significant role in the radiation of the genus Agrodiaetus.
Phylogenetic relationships resulting from a direct optimization approach, using uniform gapcost to change ratios, and equal weighting for transitions and transversions, modified from Figure 13 of Wheeler et al. (2001). Some of the taxa that most systematists would consider misplaced are in bold.  
The nuclear small subunit rRNA (18S) has played a dominant role in the estimation of relationships among insect orders from molecular data. In previous studies, 18S sequences have been aligned by unadjusted automated approaches (computer alignments that are not manually readjusted), most recently with direct optimization (simultaneous alignment and tree building using a program called "POY"). Parsimony has been the principal optimality criterion. Given the problems associated with the alignment of rRNA, and the recent availability of the doublet model for the analysis of covarying sites using Bayesian MCMC analysis, a different approach is called for in the analysis of these data. In this paper, nucleotide sequence data from the 18S small subunit rRNA gene of insects are aligned manually with reference to secondary structure, and analyzed under Bayesian phylogenetic methods with both GTR+I+G and doublet models in MrBayes. A credible phylogeny of Insecta is recovered that is independent of the morphological data and (unlike many other analyses of 18S in insects) not contradictory to traditional ideas of insect ordinal relationships based on morphology. Hexapoda, including Collembola, are monophyletic. Paraneoptera are the sister taxon to a monophyletic Holometabola but weakly supported. Ephemeroptera are supported as the sister taxon of Neoptera, and this result is interpreted with respect to the evolution of direct sperm transfer and the evolution of flight. Many other relationships are well-supported but several taxa remain problematic, e.g., there is virtually no support for relationships among orthopteroid orders. A website is made available that provides aligned 18S data in formats that include structural symbols and Nexus formats.
Previous phylogenetic analyses of tetrapod 18S ribosomal RNA (rRNA) sequences support the grouping of birds with mammals, whereas other molecular data, and morphological and paleontological data favor the grouping of birds with crocodiles. The 18S rRNA gene has consequently been considered odd, serving as "definitive evidence of different genes providing significantly different estimates of phylogeny in higher organisms" (p. 156; Huelsenbeck et al., 1996, Trends Ecol. Evol. 11:152-158). Our research indicates that the previous discrepancy of phylogenetic results between the 18S rRNA gene and other genes is caused mainly by (1) the misalignment of the sequences, (2) the inappropriate use of the frequency parameters, and (3) poor sequence quality. When the sequences are aligned with the aide of the secondary structure of the 18S rRNA molecule and when the frequency parameters are estimated either from all sites or from the variable domains where substitutions have occurred, the 18S rRNA sequences no longer support the grouping of the avian species with the mammalian species.
Phylogenetic hypotheses for different platyzoan taxa. (a) Lorenzen (1985). (b) Wallace et al. (1995, 1996). (c) Neuhaus et al. (1996); (d) Ahlrichs (1995, 1997); (e) Nielsen (1995); Nielsen et al. (1996). (f ) Haszprunar (1996).  
18S rDNA trees for parameter set 111 (gap = change; tv = ts), when the complete gene sequence is used (a), or when the most heterogeneous regions are removed (b). The deuterostomes have been collapsed. Colors represent major protostome groups: Ecdysozoa (green), Platyzoa (red), and Trochozoa (blue). Branches of the platyzoan taxa are represented in red.  
Strict consensus of 16 trees of 455 steps (consistency index = 0.466; retention index = 0.701) based on the morphological data of Zrzav´yZrzav´ Zrzav´y et al. (1998). Bremer support values > 1 are indicated.  
Summary tree of Figure 4 with the terminal taxa as coded for the morphology. The dashed line in Macrostomida indicates nonmonophyly. Numbers on branches indicate Bremer support values.  
Triploblastic relationships were examined in the light of molecular and morphological evidence. Representatives for all triploblastic "phyla" (except Loricifera) were represented by both sources of phylogenetic data. The 18S ribosomal (rDNA) sequence data for 145 terminal taxa and 276 morphological characters coded for 36 supraspecific taxa were combined in a total evidence regime to determine the most consistent picture of triploblastic relationships for these data. Only triploblastic taxa are used to avoid rooting with distant outgroups, which seems to happen because of the extreme distance that separates diploblastic from triploblastic taxa according to the 18S rDNA data. Multiple phylogenetic analyses performed with variable analysis parameters yield largely inconsistent results for certain groups such as Chaetognatha, Acoela, and Nemertodermatida. A normalized incongruence length metric is used to assay the relative merit of the multiple analyses. The combined analysis having the least character incongruence yields the following scheme of relationships of four main clades: (1) Deuterostomia [((Echinodermata + Enteropneusta) (Cephalochordata (Urochordata + Vertebrata)))]; (2) Ecdysozoa [(((Priapulida + Kinorhyncha) (Nematoda + Nematomorpha)) ((Onychophora + Tardigrada) Arthropoda))]; (3) Trochozoa [((Phoronida + Brachiopoda) (Entoprocta (Nemertea (Sipuncula (Mollusca (Pogonophora (Echiura + Annelida)))))))]; and (4) Platyzoa [((Gnathostomulida (Cycliophora + Syndermata)) (Gastrotricha + Plathelminthes))]. Chaetognatha, Nemertodermatida, and Bryozoa cannot be assigned to any one of these four groups. For the first time, a data analysis recognizes a clade of acoelomates, the Platyzoa (sensu Cavalier-Smith, Biol. Rev. 73:203-266, 1998). Other relationships that corroborate some morphological analyses are the existence of a clade that groups Gnathostomulida + Syndermata (= Gnathifera), which is expanded to include the enigmatic phylum Cycliophora, as sister group to Syndermata.
SLD has the largest L post k , but the posterior TABLE 1. Bayes estimates (posterior medians § SE) of the divergence times in clocklike and nonclocklike analyses. 
ML tree for six species of hominoids. The branch lengths of the unrooted tree were estimated under the HKY85 C 0 model of nucleotide substitution. The root of the tree is placed on the siamang branch.  
Posterior medians of evolutionary rates for branches 5 and 7 in Figure 1 under different models of rate change: SLD (¥), OUP (N), GD ( ¦ ), and ED (£). Rates are measured by the expected number of substitutions per site per 10 9 years. The hyperparameter ¯ of OUP is set to 100.  
Fit of different models of rate change to the metazoan 18S rRNA sequences. 
The molecular clock, i.e., constancy of the rate of evolution over time, is commonly assumed in estimating divergence dates. However, this assumption is often violated and has drastic effects on date estimation. Recently, a number of attempts have been made to relax the clock assumption. One approach is to use maximum likelihood, which assigns rates to branches and allows the estimation of both rates and times. An alternative is the Bayes approach, which models the change of the rate over time. A number of models of rate change have been proposed. We have extended and evaluated models of rate evolution, i.e., the lognormal and its recent variant, along with the gamma, the exponential, and the Ornstein-Uhlenbeck processes. These models were first applied to a small hominoid data set, where an empirical Bayes approach was used to estimate the hyperparameters that measure the amount of rate variation. Estimation of divergence times was sensitive to these hyperparameters, especially when the assumed model is close to the clock assumption. The rate and date estimates varied little from model to model, although the posterior Bayes factor indicated the Ornstein-Uhlenbeck process outperformed the other models. To demonstrate the importance of allowing for rate change across lineages, this general approach was used to analyze a larger data set consisting of the 18S ribosomal RNA gene of 39 metazoan species. We obtained date estimates consistent with paleontological records, the deepest split within the group being about 560 million years ago. Estimates of the rates were in accordance with the Cambrian explosion hypothesis and suggested some more recent lineage-specific bursts of evolution.
Summary of sampled taxa.
Most-parsimonious tree obtained from a POY analysis from the conserved regions (regions 1, 3, 5, and 7) with a gap cost and a change cost of one. The cost of the alignment is 1,074. Numbers on branches represent Bremer support values. The circled numbers refer to particularly relevant nodes in Tables 2 and 3. Shading of branches as in Figure 1.  
One of the three most-parsimonious trees obtained from a POY analysis of all regions except the central portion of the hypervariable regions with a gap cost and a change cost of one. The cost of this tree is 2,138 (also see Table 2). Decay index values are listed below relevant nodes, and suprageneric taxa are listed to the right.  
Printout from tree searches conducted by POY to illustrate the operation of the program. Tree searches were performed on the full data set of 24 segments of the gene, excluding (a) and including (b) the central portion of region V2, V4, and V6. The segment of the peripheral V4 region shown represents the correspondences of individual nucleotide positions inferred in the tree alignment. Note the differences between both matrices as the result of the exclusion/inclusion of data external to this DNA segment and the changes in the implied homology between bases in either analysis. This printout was obtained by using the impliedalignment command.
Current hypotheses regarding family relationships in the suborder Adephaga (Coleoptera) are conflicting. Here we report full-length 18S ribosomal RNA sequences of 39 adephagans and 13 outgroup taxa. Data analysis focused on the impact of sequence alignment on tree topology, using two principally different approaches. Tree alignments, which seek to minimize indels and substitutions on the tree in a single step, as implemented in an approximate procedure by the computer program POY, were contrasted with a more traditional procedure based on alignments followed by phylogenetic inference based on parsimony, likelihood, and distance analyses. Despite substantial differences between the procedures, phylogenetic conclusions regarding basal relationships within Adephaga and relationships between the four suborders of Coleoptera were broadly similar. The analysis weakly supports monophyly of Adephaga, with Polyphaga usually as its sister, and the two small suborders Myxophaga and Archostemata basal to them. In some analyses, however, Polyphaga was reconstructed as having arisen from within Hydradephaga. Adephaga generally split into two monophyletic groups, corresponding to the terrestrial Geadephaga and the aquatic Hydradephaga, as initially proposed by Crowson in 1955, consistent with a single colonization of the aquatic environment by adephagan ancestors and contradicting the recent proposition of three independent invasions. A monophyletic Hydradephaga is consistently, though not strongly, supported under most analyses, and a parametric bootstrapping test significantly rejects an hypothesis of nonmonophyly. The enigmatic Trachypachidae, which exhibit many similarities to aquatic forms but whose species are entirely terrestrial, were usually recovered as a basal lineage within Geadephaga. Strong evidence opposes the view that terrestrial trachypachids are related to the dytiscoid water beetles.
Maximum likelihood reconstruction for atpB C rbcL data for 65 genera of palms plus three outgroups (GTR C I C 0 model, –log likelihood D 9,439.44; AC D 2.034, AG D 4.618, AT D 0.425, CG D 0.981, CT D 5.309, GT D 1, I D 0.637, 0 D 0.518). Parameter values were estimated with outgroups. Bootstrap proportions >60% are listed on the branches. Higher taxa are labeled as in Figure 1.  
Maximum likelihood reconstruction for the combined cpDNA sequence data (atpB, rbcL, rps16, and trnL-trnF) dataset for 45 genera of palms (GTR C I C 0 model,-log likelihood D 13,321.03; AC D 1.575, AG D 3.297, AT D 0.556, CG D 0.975, CT D 3.506, GT D 1, I D 0.528, 0 D 0.678). Parameter values were estimated with outgroups. Bootstrap proportions >60% are listed on the branches. Higher taxa are labeled as in Figure 1.
Strict consensus of 32 MP trees for the combined molecular sequence data (atpB, rbcL, rps16, trnL-trnF, and 18S) for 64 taxa of palms. Bootstrap proportions >60% are labeled on the branches. Higher taxa are labeled as in Figure 1.  
Strict consensus of MP trees for the reduced 33-taxon analyses. Bootstrap proportions >50% are labeled on branches. On the left, consensus of 29 trees (L D 1,245, CI D 0.682, RI D 0.651) from the combined molecular data analyses. On the right, consensus of two trees (L D 1,375, CI D 0.655, RI D 0.647) from analysis of the combined morphological and molecular data. The four major lineages of palms are identied as Ca D Calamoideae, N D Nypoideae, C C C D Coryphoideae C Caryoteae, and A D the Arecoid Line.  
Notoriously slow rates of molecular evolution and convergent evolution among some morphological characters have limited phylogenetic resolution for the palm family (Arecaceae). This study adds nuclear DNA (18S SSU rRNA) and chloroplast DNA (cpDNA; atpB and rbcL) sequence data for 65 genera of palms and characterizes molecular variation for each molecule. Phylogenetic relationships were estimated with maximum likelihood and maximum parsimony techniques for the new data and for previously published molecular data for 45 palm genera. Maximum parsimony analysis was also used to compare molecular and morphological data for 33 palm genera. Incongruence among datasets was detected between cpDNA and 18S data and between molecular and morphological data. Most conflict between nuclear and cpDNA data was associated with the genus Nypa. Several taxa showed relatively long branches with 18S data, but phylogenetic resolution of these taxa was essentially the same for 18S and cpDNA data. Base composition bias for 18S that contributed to erroneous phylogenetic resolution in other taxa did not seem to be present in Palmae. Morphological data were incongruent with all molecular data due to apparent morphological homoplasy for Caryoteae, Ceroxyloideae, Iriarteae, and Thrinacinae. Both cpDNA and nuclear 18S data firmly resolved Caryoteae with Borasseae of Coryphoideae, suggesting that at least some morphological characters used to place Caryoteae in Arecoideae are homoplastic. In this study, increased character sampling seems to be more important than increased taxon sampling; a comparison of the full (65-taxon) and reduced (45- and 33-taxon) datasets suggests little difference in core topology but considerably more nodal support with the increased character sample sizes. These results indicate a general trend toward a stable estimate of phylogenetic relationships for the Palmae. Although the 33-taxon topologies are even better resolved, they lack several critical taxa and are affected by incongruence between molecular and morphological data. As such, a comparison of results from the 45- and 33-taxon trees offers the best available reference for phylogenetic inference on palms.
Map of New Zealand, showing displacement of South Island landmasses as a result of displacement along the Alpine Fault over a 25-MY period (after Kamp, 1992). The approximate position of the current coastline is shown on all maps. The black area represents the distribution of a hypothetical taxon fractured by this movement. 
Map of South Island, New Zealand, showing maximal extent of Pleistocene glaciation s (black) and river gravel aggradation (gray) (after Suggate et al., 1978). 
NINE PROBLEMS WITH HEADS (1998) 1. Disjunct Distributions Generate Rather than Distinguish Hypotheses 2. Disjunct Taxa Used as Examples Are Mainly Conspecifics 3. Selective Sampling of Taxa 4. Ecological Constraint 5. Incomplete Sampling of Taxa 6. Dispersal Is Invoked Only When It Fits the Hypothesis 7. Current-Day Evidence Supports Glacial Extirpation 8. The Alpine Fault Is Only Recently Alpine 9. Land Surfaces Have Not Been in Continuous Existence
Properties of trees from multiple search replicates for a representative analysis using garli. a) The distribution of likelihood scores. b) The distribution of symmetric tree distances (as a fraction of the maximum possible value for the data set). Both measures are given as frequency and proportion.
Confidence intervals associated with the bootstrap probabilities observed in the majority rule consensus tree computed from 500 garli bootstrap replicates. Confidence intervals are given for three probabilities (0.90, 0.95, and 0.99).
Relationship between the number of search replicates (out of 100) returning the same topology as that of the best tree found and the estimated number of search replicates necessary to guarantee a particular probability of recovering that topology. Estimates are given at three probabilities (0.90, 0.95, 0.99).
Completion times of 719 analyses submitted to the garli web service for a recent six month period (2013-07-23 to 2014-01-23). Despite great variation in analysis parameters (e.g., data matrix size, substitution model used, number of replicates requested), ≈97% of analyses were completed in less than 24 hours.
We introduce, a publicly available gateway for high-throughput, maximum-likelihood phylogenetic analysis powered by grid computing. The gateway features a garli 2.0 web service that enables a user to quickly and easily submit thousands of maximum likelihood tree searches or bootstrap searches that are executed in parallel on distributed computing resources. The garli web service allows one to easily specify partitioned substitution models using a graphical interface, and it performs sophisticated post-processing of phylogenetic results. Although the garli web service has been used by the research community for over three years, here we formally announce the availability of the service, describe its capabilities, highlight new features and recent improvements, and provide details about how the grid system efficiently delivers high-quality phylogenetic results. [garli, gateway, grid computing, maximum likelihood, molecular evolution portal, phylogenetics, web service.]
This special issue of Systematic Biology contains review articles, contributed by keynote speakers after the fifth edition of the “Mathematical and Computational Evolutionary Biology” conference (MCEB; see web sites below) conference, held in 2013. We started in 2003 (under a slightly different name: “Mathematics of Evolution and Phylogeny”) at the Mathematics Research Center Henri Poincare, Paris, with well-known speakers like Walter Fitch and Joe Felsenstein. We had the feeling that the considerable efforts of the keynote speakers to synthesize and present their research in an extensive but
In the early 1990s, a comprehensive set of missions and goals for the discipline was articulated by a global community of systematists; these were presented as Systematics Agenda 2000 (1994). Abbreviated here as SA2K, this agenda spurred awareness of the field and initiated discussions about the role of systematics within biology (e.g., Blackmore and Cutler 1996; Cracraft 2002; Halanych and Goertzen 2009), in education (e.g., Krishtalka and Humphrey 2000; Thanukos 2010), and public policy (e.g., Prance 1995). After nearly 20 years of achievement and growth in systematic biology, a series of four US National Science Foundation-sponsored workshops on “Future Directions in Biodiversity and Systematics Research” was held during 2009–2010 to evaluate progress in the field and identify new directions and opportunities. Workshop participants reviewed SA2K as a way to rapidly achieve common ground and to jump-start our discussions. We did not plan to undertake a formal revision of SA2K, but our discussions led to consensus on a number of relevant points. We share these here with the intention of generating further reflection and discussion toward advancing our field and its missions.
FIG URE 6. S catte rp lot sh o wing corre la tion m atrix b e twe e n x 2 /C A an d N e i/N M D S .  
FIG U RE 1. C ollection loca lities for M agh reb p opu la tion s o f He lix a sp ersa. T wo sa m ples c olle cted n e a r Algiers a n d we re rem o ved fr om a n alyse s b e ca use of th e ir low sam ple size (11 a n d 5, re spe ctively). (A 6 A 7 ) 
FIG U RE 2. D ata an alysis ste ps fr om allele fre q ue n cies to o rdin ation plot : (1) com p uting m atrix of pairwise initial ge n e tic (N e i, 1978 ; R og e rs, 1972) an d x 2 distan c e s b ase d u pon allele fre q ue n cies fo r 17 e n zym e loci, (2) ordin a tion (N M D S , PC oA o r C A), (3) c om pu ting m atrix of p airwise Eu clide an distan c e s ob taine d afte r ord ination , (4) co m putin g e dg e s (c on n e ction s) b e twe e n n e are st pop ulation s u sin g the M S T an d th e R N G proce d ure s, (5) m atrix co m parison s to co m pa re (i) pre-an d postordin ation (or cluste ring) distan ce m atrice s (cor relation s with in an d (ii) two ordin ation (or cluste ring) m e thod s (c orre lation s b e twe e n r W ) r B ), an d (6) grap h ical re pre se n tation of o rdin ation with M S T /R N G an d ine rtia e llipsoids su pe rim pose d. Th e se e llipsoids we re ob ta in e d with a classica l h ierarch ical clusterin g an a lysis u sin g th e W a rd m e th od (W ard, 1963) p e rform e d fr om ordin a tion sco res. 
FIG U RE 5. D iagram m atic rep rese n tation of cop h e n e tic an d corre la tion s : (a ) N e i distan c e s, (b ) (r W ) (r B ) R oge rs distan ce s, a n d (c) x 2 distan ce s. 
We examined the efficiencies of ordination methods in the treatment of gene frequency data at intraspecific level, using metric and nonmetric distance measures (Nei's and Rogers' genetic distances, chi 2 distance). We assessed initial processes responsible for the geographical distribution of the Mediterranean land snail Helix aspersa. Seventeen enzyme loci from 30 North African snail populations were considered in the present analysis. Five combinations of distance/multivariate analysis were compared: correspondence analysis (CA), nonmetric multidimensional scaling (NMDS) on Nei's, Rogers', and chi 2 distances, and principal coordinates analysis on Rogers' distances. Configuration of the objects resulting from ordination was projected onto three-dimensional graphics with the minimum spanning tree or the relative neighborhood graph superimposed. Pre- and postordination or clustering distance matrices were compared by means of correlation methods. As expected, all combinations led to a clear west versus east pattern of variation. However, the intraregional relationships and degree of connectivity between pairs of operational taxonomic units were not necessarily constant from one method to another. Ordination methods when applied with Nei's and Rogers' distances provided the best fit, with original distances (r = 0.98) compared with UPGMA clustering (r approximately 0.75). The Nei/NMDS combination seems to be a good compromise (distortion index dt = 10%) between Rogers/NMDS, which produces a more confusing pattern of differentiation (dt = 24%), and chi 2/CA, which tends to distort large distances (dt = 31%). NMDS obviously provides a powerful method to summarize relationships between populations, when neither hierarchical structure nor phylogenetic inference are required. These findings led the discussion on the good performance of NMDS, the appropriate distances to be used, and the potential application of this method to other types of allelic data (such as microsatellite loci) or data on nucleotide sequences of genes.
Species tree accuracy with different methods of reconstruction and different accuracy measures, for species trees of depth 1 N e . Lines join points with equal numbers of total sequences, with numbers of loci (Loc) and sampled individuals (Ind) indicated. Accuracy measured as average accuracy between true and inferred tree over the 500 simulated species trees. Reconstruction methods and accuracy measures yield similar results except for the Minimize Deep Coalescences for rooted accuracy, which is notably lower.  
Average amount of sequence divergence and incomplete lineage sorting observed for shallow (recent divergences) and deep (older divergences) species trees. Calculated for 10 replicates of each case; standard errors are shown in parentheses. Sequence divergences are average raw uncorrected percent pairwise differences, presented to confirm simulations generated divergences typical in empirical studies (values for 9 individuals not calculated, presumed to be bracketed by results from 1, 3, and 27). Incomplete lineage sorting is measured as the minimal number of deep coalescences required (Maddison, 1997).
It is now well known that incomplete lineage sorting can cause serious difficulties for phylogenetic inference, but little attention has been paid to methods that attempt to overcome these difficulties by explicitly considering the processes that produce them. Here we explore approaches to phylogenetic inference designed to consider retention and sorting of ancestral polymorphism. We examine how the reconstructability of a species (or population) phylogeny is affected by (a) the number of loci used to estimate the phylogeny and (b) the number of individuals sampled per species. Even in difficult cases with considerable incomplete lineage sorting (times between divergences less than 1 N(e) generations), we found the reconstructed species trees matched the "true" species trees in at least three out of five partitions, as long as a reasonable number of individuals per species were sampled. We also studied the tradeoff between sampling more loci versus more individuals. Although increasing the number of loci gives more accurate trees for a given sampling effort with deeper species trees (e.g., total depth of 10 N(e) generations), sampling more individuals often gives better results than sampling more loci with shallower species trees (e.g., depth = 1 N(e)). Taken together, these results demonstrate that gene sequences retain enough signal to achieve an accurate estimate of phylogeny despite widespread incomplete lineage sorting. Continued improvement in our methods to reconstruct phylogeny near the species level will require a shift to a compound model that considers not only nucleotide or character state substitutions, but also the population genetics processes of lineage sorting. [Coalescence; divergence; population; speciation.].
Although genetic methods of species identification, especially DNA barcoding, are strongly debated, tests of these methods have been restricted to a few empirical cases for pragmatic reasons. Here we use simulation to test the performance of methods based on sequence comparison (BLAST and genetic distance) and tree topology over a wide range of evolutionary scenarios. Sequences were simulated on a range of gene trees spanning almost three orders of magnitude in tree depth and in coalescent depth; that is, deep or shallow trees with deep or shallow coalescences. When the query's conspecific sequences were included in the reference alignment, the rate of positive identification was related to the degree to which different species were genetically differentiated. The BLAST, distance, and liberal tree-based methods returned higher rates of correct identification than did the strict tree-based requirement that the query was within, but not sister to, a single-species clade. Under this more conservative approach, ambiguous outcomes occurred in inverse proportion to the number of reference sequences per species. When the query's conspecific sequences were not in the reference alignment, only the strict tree-based approach was relatively immune to making false-positive identifications. Thresholds affected the rates at which false-positive identifications were made when the query's species was unrepresented in the reference alignment but did not otherwise influence outcomes. A conservative approach using the strict tree-based method should be used initially in large-scale identification systems, with effort made to maximize sequence sampling within species. Once the genetic variation within a taxonomic group is well characterized and the taxonomy resolved, then the choice of method used should be dictated by considerations of computational efficiency. The requirement for extensive genetic sampling may render these techniques inappropriate in some circumstances.
The three elements of the H-P coevolution problem can be translated into rectangular data matrices A, B, and C. See text.
Given the information in matrices A, B, and C, the problem is to estimate the parameters in the fourth-corner matrix D that crosses the principal coordinates of the hosts with those of the parasites.  
Pocket gophers and chewing lice phylogenetic trees and H-P links. Signicant H-P links are represented by full lines, nonsignicant links by dashed lines.  
Pruned trees: The trees are now identical and display perfect coevolution for a subset of the animals.  
A new method, ParaFit, has been developed to test the significance of a global hypothesis of coevolution between parasites and their hosts. Individual host-parasite association links can also be tested. The test statistics are functions of the host and parasite phylogenetic trees and of the set of host-parasite association links. Numerical simulations are used to show that the method has correct rate of type I error and good power except under extreme error conditions. An application to real data (pocket gophers and chewing lice) is presented.
A comprehensive phylogeny of papilionoid legumes was inferred from sequences of 2228 taxa in GenBank release 147. A semiautomated analysis pipeline was constructed to download, parse, assemble, align, combine, and build trees from a pool of 11,881 sequences. Initial steps included all-against-all BLAST similarity searches coupled with assembly, using a novel strategy for building length-homogeneous primary sequence clusters. This was followed by a combination of global and local alignment protocols to build larger secondary clusters of locally aligned sequences, thus taking into account the dramatic differences in length of the heterogeneous coding and noncoding sequence data present in GenBank. Next, clusters were checked for the presence of duplicate genes and other potentially misleading sequences and examined for combinability with other clusters on the basis of taxon overlap. Finally, two supermatrices were constructed: a "sparse" matrix based on the primary clusters alone (1794 taxa x 53,977 characters), and a somewhat more "dense" matrix based on the secondary clusters (2228 taxa x 33,168 characters). Both matrices were very sparse, with 95% of their cells containing gaps or question marks. These were subjected to extensive heuristic parsimony analyses using deterministic and stochastic heuristics, including bootstrap analyses. A "reduced consensus" bootstrap analysis was also performed to detect cryptic signal in a subtree of the data set corresponding to a "backbone" phylogeny proposed in previous studies. Overall, the dense supermatrix appeared to provide much more satisfying results, indicated by better resolution of the bootstrap tree, excellent agreement with the backbone papilionoid tree in the reduced bootstrap consensus analysis, few problematic large polytomies in the strict consensus, and less fragmentation of conventionally recognized genera. Nevertheless, at lower taxonomic levels several problems were identified and diagnosed. A large number of methodological issues in supermatrix construction at this scale are discussed, including detection of annotation errors in GenBank sequences; the shortage of effective algorithms and software for local multiple sequence alignment; the difficulty of overcoming effects of fragmentation of data into nearly disjoint blocks in sparse supermatrices; and the lack of informative tools to assess confidence limits in very large trees.
Nuclear DNA is widely used to estimate phylogenetic and phylogeographic relationships. Nuclear gene variants may be present in an individual's genome, and these result in Intra-Individual Site Polymorphisms (2ISP; pronounced 'twisp') in direct-PCR or individual-consensus sequences based on a clone sample. 2ISPs can occur fairly often, especially within, but not restricted to, high-copy-number regions such as the widely used internal transcribed spacers of the nuclear ribosomal cistron. Dealing with 2ISPs has been problematic as phylogeny reconstruction optimality criteria generally do not take account of this variation. Here we test whether an approach that treats 2ISPs as additional (termed 'informative'), rather than ambiguous, characters offers improved support in three common criteria used for phylogenetic inference: Minimum Evolution (via Neighbour Joining), Maximum Parsimony and Maximum Likelihood. We demonstrate significant improvements using the 2ISP-informative treatment with simulated, real-world and case study datasets. We envisage that this 2ISP-informative approach will greatly aid phylogenetic inference using any nuclear DNA regions that contain polymorphisms within individuals (including consensus sequences generated from next generation sequencing), especially at the intrageneric or intraspecific level.
Performance of the parsimony filter 
Comparison of branch supports with simulated data. These graphics show the distribution of supports (vertical axis) using boxes and whisker plots with bounds provided on the right of the corresponding panel. GTR+ Γ 4: both data generation and analysis (tree inference and branch testing) are performed with the same model. JC69: data are generated with GTR + Γ 4, but the analysis is performed using a simple JC69 model; this mimics real data analyses in which the standard substitution models used for estimation inevitably simplify the true evolutionary processes. BP, bootstrap supports; KI2, aLRT with chi-square–based branch supports; SH, aLRT with SH-like branch supports.  
Comparison of log-likelihoods on 50 DNA and 50 pro- tein medium-size data sets
Bootstrap and aLRT-SH agreement as a function of the phylogenetic signal. M1499 and M2588 are the two data sets shown in Figure 3. The phylogenetic signal is measured by the number of sites (with less than 10% gaps or missing values) times the median of internal branch lengths. This roughly corresponds to the expected number of substitutions supporting any given internal branch. Branch support agreement equals the proportion of branches with both SH-like support >0.90 and bootstrap support >0.75. See text for further details and explanations.
PhyML is a phylogeny software based on the maximum-likelihood principle. Early PhyML versions used a fast algorithm performing nearest neighbor interchanges to improve a reasonable starting tree topology. Since the original publication (Guindon S., Gascuel O. 2003. A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52:696-704), PhyML has been widely used (>2500 citations in ISI Web of Science) because of its simplicity and a fair compromise between accuracy and speed. In the meantime, research around PhyML has continued, and this article describes the new algorithms and methods implemented in the program. First, we introduce a new algorithm to search the tree space with user-defined intensity using subtree pruning and regrafting topological moves. The parsimony criterion is used here to filter out the least promising topology modifications with respect to the likelihood function. The analysis of a large collection of real nucleotide and amino acid data sets of various sizes demonstrates the good performance of this method. Second, we describe a new test to assess the support of the data for internal branches of a phylogeny. This approach extends the recently proposed approximate likelihood-ratio test and relies on a nonparametric, Shimodaira-Hasegawa-like procedure. A detailed analysis of real alignments sheds light on the links between this new approach and the more classical nonparametric bootstrap method. Overall, our tests show that the last version (3.0) of PhyML is fast, accurate, stable, and ready to use. A Web server and binary files are available from
Since its introduction in 2001, MrBayes has grown in popularity as a software package for Bayesian phylogenetic inference using Markov chain Monte Carlo (MCMC) methods. With this note, we announce the release of version 3.2, a major upgrade to the latest official release presented in 2003. The new version provides convergence diagnostics and allows multiple analyses to be run in parallel with convergence progress monitored on the fly. The introduction of new proposals and automatic optimization of tuning parameters has improved convergence for many problems. The new version also sports significantly faster likelihood calculations through streaming single-instruction-multiple-data extensions (SSE) and support of the BEAGLE library, allowing likelihood calculations to be delegated to graphics processing units (GPUs) on compatible hardware. Speedup factors range from around 2 with SSE code to more than 50 with BEAGLE for codon problems. Checkpointing across all models allows long runs to be completed even when an analysis is prematurely terminated. New models include relaxed clocks, dating, model averaging across time-reversible substitution models, and support for hard, negative, and partial (backbone) tree constraints. Inference of species trees from gene trees is supported by full incorporation of the Bayesian estimation of species trees (BEST) algorithms. Marginal model likelihoods for Bayes factor tests can be estimated accurately across the entire model space using the stepping stone method. The new version provides more output options than previously, including samples of ancestral states, site rates, site d(N)/d(S) rations, branch rates, and node dates. A wide range of statistics on tree parameters can also be output for visualization in FigTree and compatible software.
Many published phylogenies are based on methods that assume equal nucleotide composition among taxa. Studies have shown, however, that this assumption is often not accurate, particularly in divergent lineages. Nonstationary sequence evolution, when taxa in different lineages evolve in different ways, can lead to unequal nucleotide composition. This can cause inference methods to fail and phylogenies to be inaccurate. Recent advancements in phylogenetic theory have proposed new models of nonstationary sequence evolution; these models often outperform equivalent stationary models. A variety of new phylogenetic software implementing such models has been developed, but the studies employing the new methodology are still few. We discovered convergence of nucleotide composition within mitochondrial genomes of the insect order Coleoptera (beetles). We found variation in base content both among species and among genes in the genome. To this data set, we have applied a broad range of phylogenetic methods, including some traditional stationary models of evolution and all the more recent nonstationary models. We compare 8 inference methods applied to the same data set. Although the more commonly used methods universally fail to recover established clades, we find that some of the newer software packages are more appropriate for data of this nature. The software packages p4, PHASE, and nhPhyML were able to overcome the systematic bias in our data set, but parsimony, MrBayes, NJ, LogDet, and PhyloBayes were not.
Young polyploid events are easily diagnosed by various methods, but older polyploid events become increasingly difficult to identify as chromosomal rearrangements, tandem gene or partial chromosome duplications, changes in substitution rates among duplicated genes, pseudogenization or locus loss, and interlocus interactions complicate the means of inferring past genetic events. Genomic data have provided valuable information about the polyploid history of numerous species, but on their own fail to show whether related species, each with a polyploid past, share a particular polyploid event. A phylogenetic approach provides a powerful method to determine this but many processes may mislead investigators. These processes can affect individual gene trees, but most likely will not affect all genes, and almost certainly will not affect all genes in the same way. Thus, a multigene approach, which combines the large-scale aspect of genomics with the resolution of phylogenetics, has the power to overcome these difficulties and allow us to infer genomic events further into the past than would otherwise be possible. Previous work using synonymous distances among gene pairs within species has shown evidence for large-scale duplications in the legumes Glycine max and Medicago truncatula. We present a case study using 39 gene families, each with three or four members in G. max and the putative orthologues in M. truncatula, rooted using Arabidopsis thaliana. We tested whether the gene duplications in these legumes occurred separately in each lineage after their divergence (Hypothesis 1), or whether they share a round of gene duplications (Hypothesis 2). Many more gene family topologies supported Hypothesis 2 over Hypothesis 1 (11 and 2, respectively), even after synonymous distance analysis revealed that some topologies were providing misleading results. Only ca. 33% of genes examined support either hypothesis, which strongly suggests that single gene family approaches may be insufficient when studying ancient events with nuclear DNA. Our results suggest that G. max and M. truncatula, along with approximately 7000 other legume species from the same clade, share an ancient round of gene duplications, either due to polyploidy or to some other process.
Maps showing the geographic distribution of the 10 Tegeticula species in the burst of speciation in the T. yuccasella species complex. The distributions outlined by dashed lines are for the two cheater species that oviposit into fruit. (a) Distribution of the six locule-ovipositing species (eggs laid next to developing Yucca ovules in flowers, except T. corruptrix) (b) Distribution of the four superficially ovipositing species (eggs laid in the ovary wall of Yucca flowers, expect T. intermedia).  
Null distribution to test for goodness of fit of model of evolution for mtDNA sequence data. Model was not rejected (P = 0.87). The value 1236.16 is the difference in the unconstrained and constrained likelihood values for the original data set.  
Phylogram of Tegeticula species based on maximum likelihood analyses of mtDNA sequence data. Bootstrap values above 50 are shown above the branches. Names in dark grey boxes denote locule-ovipositing species and those in light grey boxes superficially ovipositing species. Boxes with borders denote cheater species. Single letters after T. corruptrix denote capsular-fruited yucca feeders (c) and fleshy-fruited yucca feeders (f). T. mojavella was an additional outgroup.  
Phylogram of Tegeticula species based on minimum evolution analyses of Nei-Li distances calculated from 352 AFLP markers. Bootstrap values above 50 are shown above the branches. Names in light grey boxes denote superficially ovipositing species and those in dark grey boxes locule-ovipositing species. Boxes with borders denote cheater species. T. mojavella was an additional outgroup. Single letters after T. corruptrix denote capsular-fruited yucca feeders (c) and fleshy-fruited yucca feeders (f). To the right of the phylogeny are pictures of ovipositors for each species (all to the same scale). Species differences in oviposition habit corresponded with the AFLP topology. Locule-ovipositing species have longer, curved ovipositors in contrast to the shorter, straight ovipositors of superficially ovipositing species.  
Principal component analyses of morphology for (a) female and (b) male Tegeticula species. Superficially ovipositing species: cas = cassandra; e = elatella; i = intermedia; s = superficiella. Locule-ovipositing species: a = altiplanella; b = baccatella; cal = 'california'; y = yuccasella; cor-c = corruptrix from capsular-fruited yuccas; cor-f = corruptrix from fleshy-fruited yuccas.
The interaction between yuccas and yucca moths has been central to understanding the origin and loss of obligate mutualism and mutualism reversal. Previous systematic research using mtDNA sequence data and characters associated with genitalic morphology revealed that a widespread pollinator species in the genus Tegeticula was in fact a complex of pollinator species that differed in host use and the placement of eggs into yucca flowers. Within this mutualistic clade two nonpollinating "cheater" species evolved. Cheaters feed on yucca seeds but lack the tentacular mouthparts necessary for yucca pollination. Previous work suggested that the species complex formed via a rapid radiation within the last several million years. In this study, we use an expanded mtDNA sequence data set and AFLP markers to examine the phylogenetic relationships among this rapidly diverging clade of moths and compare these relationships to patterns in genitalic morphology. Topologies obtained from analyses of the mtDNA and AFLP data differed significantly. Both data sets, however, corroborated the hypothesis of a rapid species radiation and suggested that there were likely two independent species radiations. Morphological analyses based on oviposition habit produced species groupings more similar to the AFLP topology than the mtDNA topology and suggested the two radiations coincided with differences in oviposition habit. The evolution of cheating was reaffirmed to have evolved twice and the closest pollinating relative for one cheater species was identified by both mtDNA and AFLP markers. For the other cheater species, however, the closest pollinating relative remains ambiguous, and mtDNA, AFLP, and morphological data suggest this cheater species may be diverged based on host use. Much of the divergence in the species complex can be explained by geographic isolation associated with the evolution of two oviposition habits.
This study attempts to resolve relationships among and within the four basal arthropod lineages (Pancrustacea, Myriapoda, Euchelicerata, Pycnogonida) and to assess the widespread expectation that remaining phylogenetic problems will yield to increasing amounts of sequence data. Sixty-eight regions of 62 protein-coding nuclear genes (approximately 41 kilobases (kb)/taxon) were sequenced for 12 taxonomically diverse arthropod taxa and a tardigrade outgroup. Parsimony, likelihood, and Bayesian analyses of total nucleotide data generally strongly supported the monophyly of each of the basal lineages represented by more than one species. Other relationships within the Arthropoda were also supported, with support levels depending on method of analysis and inclusion/exclusion of synonymous changes. Removing third codon positions, where the assumption of base compositional homogeneity was rejected, altered the results. Removing the final class of synonymous mutations--first codon positions encoding leucine and arginine, which were also compositionally heterogeneous--yielded a data set that was consistent with a hypothesis of base compositional homogeneity. Furthermore, under such a data-exclusion regime, all 68 gene regions individually were consistent with base compositional homogeneity. Restricting likelihood analyses to nonsynonymous change recovered trees with strong support for the basal lineages but not for other groups that were variably supported with more inclusive data sets. In a further effort to increase phylogenetic signal, three types of data exploration were undertaken. (1) Individual genes were ranked by their average rate of nonsynonymous change, and three rate categories were assigned--fast, intermediate, and slow. Then, bootstrap analysis of each gene was performed separately to see which taxonomic groups received strong support. Five taxonomic groups were strongly supported independently by two or more genes, and these genes mostly belonged to the slow or intermediate categories, whereas groups supported only by a single gene region tended to be from genes of the fast category, arguing that fast genes provide a less consistent signal. (2) A sensitivity analysis was performed in which increasing numbers of genes were excluded, beginning with the fastest. The number of strongly supported nodes increased up to a point and then decreased slightly. Recovery of Hexapoda required removal of fast genes. Support for Mandibulata (Pancrustacea + Myriapoda) also increased, at times to "strong" levels, with removal of the fastest genes. (3) Concordance selection was evaluated by clustering genes according to their ability to recover Pancrustacea, Euchelicerata, or Myriapoda and analyzing the three clusters separately. All clusters of genes recovered the three concordance clades but were at times inconsistent in the relationships recovered among and within these clades, a result that indicates that the a priori concordance criteria may bias phylogenetic signal in unexpected ways. In a further attempt to increase support of taxonomic relationships, sequence data from 49 additional taxa for three slow genes (i.e., EF-1 alpha, EF-2, and Pol II) were combined with the various 13-taxon data sets. The 62-taxon analyses supported the results of the 13-taxon analyses and provided increased support for additional pancrustacean clades found in an earlier analysis including only EF-1 alpha, EF-2, and Pol II.
The microbial way of life spans at least 3.8 billion years of evolution. Microbial organisms are pervasive, ubiquitous, and essential components of all ecosystems. The geochemical composition of Earth's biosphere has been molded largely by microbial activities. Yet, despite the predominance of microbes during the course of life's history, general principles and theory of microbial evolution and ecology are not well developed. Until recently, investigators had no idea how accurately cultivated microorganisms represented overall microbial diversity. The development of molecular phylogenetics has recently enabled characterization of naturally occurring microbial biota without cultivation. Free from the biases of culture-based studies, molecular phylogenetic surveys have revealed a vast array of new microbial groups. Many of these new microbes are widespread and abundant among contemporary microbiota and fall within novel divisions that branch deep within the tree of life. The breadth and extent of extant microbial diversity has become much clearer. A remaining challenge for microbial biologists is to better characterize the biological properties of these newly described microbial taxa. This more comprehensive picture will provide much better perspective on the natural history, ecology, and evolution of extant microbial life.
Continuation of the Bayesian consensus matK phylogeny from Figure 2 showing the indigoferoid, millettioid, and Hologalegina crown clades. Estimated substitution parameters for this Bayesian analysis are given in Table 1. Detailed age and rate estimates for all nodes labeled with letters or numbers are presented in Table 2. Age estimates are reported for the older crown clades. Ma = million years. The nodes labeled with diamonds are those with fixed (node A) or minimum fossil constraints (nodes B to M).
Descriptive statistical comparisons of mean rate estimates (see Tables 2 and 3) derived from the matK and rbcL data sets. (a) Rate estimates derived from matK using penalized likelihood (PL) for the 94 nodes (B to M, 1 to 82) listed in Table 1 compared to PL estimates for the same nodes derived from not invoking the 12 fossil-calibrated minimum age constraints; subs/site/Ma = substitution per site per million years. (b) Rate estimates derived from matK using PL with the 12 minimum age constraints imposed and compared to the rate constant Langley-Fitch (LF) and rate variable nonparametric rate smoothing (NPRS) methods. The optimum substitution rate estimated with LF using minimal age constraints is indicated by the line with an intercept of 0.00081 substitutions/site/Ma (diamonds). Without minimum age constraints, the rate constant estimate is 0.00144 substitutions/site/Ma. (c) Comparison of rate estimates derived from each of the matK and rbcL Bayesian consensus phylogenies for the 28 comparable nodes listed in Table 3 (not including root node A) using PL rate smoothing (see Figs. 1 and 3 for the identification of the labeled nodes).
Phylogenies derived from the matK data set. (a) Bayesian consensus phylogram. Scale bar equals 0.005 substitutions per site. (b) Penalized likelihood rate-smoothed chronogram with the 12 minimum age constraints imposed. Scale bar equals 10 Ma. The vertical lines (b) divide the 60-Ma duration of legume evolution into 15-Ma segments. The oldest and taxonomically large crown clades are alternately shaded black and gray. The shapes of both phylogenies depict generally short internal branch lengths indicative of a rapid family-wide diversification.
Phylogenies derived from the rbcL data set. (a) Bayesian consensus phylogram. Scale bar equals 0.005 substitutions per site. (b) Penalized likelihood rate-smoothed chronogram with the 12 minimum time constraints enforced. Scale bar equals 10 Ma. The vertical lines (b) divide the 60-Ma duration of legume evolution into 15-Ma segments. The shapes of both phylogenies depict generally short internal branch lengths indicative of a rapid family-wide diversification. The oldest and taxonomically large crown clades are alternately shaded black and gray, with the top gray shaded lines representing the outgroups and the following paraphyletic grade of black lines representing the caesalpinioid legumes. Otherwise, the groups are labeled. The arrows indicate the legume stem clade.
Descriptive statistical comparisons of mean age estimates when the fixed age of the legume stem clade is scaled between 60 and 70 Ma. (a) Age estimates derived from the matK data set for the 94 nodes identified in Table 2 (B to M, 1 to 82; not including the root node). (b) Age estimates derived from the rbcL data set for the 28 nodes identified in Table 3 that are shared with the matK phylogeny (not including root node). (c) Ages estimates derived from the matK data set or rbcL data set (where noted) for selected old crown clades.
Tertiary macrofossils of the flowering plant family Leguminosae (legumes) were used as time constraints to estimate ages of the earliest branching clades identified in separate plastid matK and rbcL gene phylogenies. Penalized likelihood rate smoothing was performed on sets of Bayesian likelihood trees generated with the AIC-selected GTR+ Gamma +I substitution model. Unequivocal legume fossils dating from the Recent continuously back to about 56 million years ago were used to fix the family stem clade at 60 million years (Ma), and at 1-Ma intervals back to 70 Ma. Specific fossils that showed distinctive combinations of apomorphic traits were used to constrain the minimum age of 12 specific internal nodes. These constraints were placed on stem rather than respective crown clades in order to bias for younger age estimates. Regardless, the mean age of the legume crown clade differs by only 1.0 to 2.5 Ma from the fixed age of the legume stem clade. Additionally, the oldest caesalpinioid, mimosoid, and papilionoid crown clades show approximately the same age range of 39 to 59 Ma. These findings all point to a rapid family-wide diversification, and predict few if any legume fossils prior to the Cenozoic. The range of the matK substitution rate, 2.1-24.6 x 10(-10) substitutions per site per year, is higher than that of rbcL, 1.6- 8.6 x 10(-10), and is accompanied by more uniform rate variation among codon positions. The matK and rbcL substitution rates are highly correlated across the legume family. For example, both loci have the slowest substitution rates among the mimosoids and the fastest rates among the millettioid legumes. This explains why groups such as the millettioids are amenable to species-level phylogenetic analysis with these loci, whereas other legume groups are not.
Schematic representation of the 5S amplicons obtained in this study, using the PCR 3 approach shown in Figure 1 and the 5SF and 5SR primers. The obtained monomers (PCR 3a; not analysed), dimers (PCR 3b), trimers (PCR 3c), and tetramers (PCR 3d) are shown. 
Phylogenetic relationships using ML analyses between 5S rDNA haplotypes from direct (a), chimeric (b), and combined (c) sequences. Direct sequences (circles) also present in chimeric sequences (squares) are shown as composed circles inside squares. The best nucleotide substitution model was Kimura 2-parameter (a, c), and Jukes-Cantor evolutionary distances. ML BS values were based on 100 replicates and are shown at interior nodes. Each species is represented by acronyms: M. arborea (ARB), M. citrina (CIT), and M. strasseri (STR). 
Multigene families have provided opportunities for evolutionary biologists to assess molecular evolution processes and phylogenetic reconstructions at deep and shallow systematic levels. However, the use of these markers is not free of technical and analytical challenges. Many evolutionary studies that used the nuclear 5S rDNA gene family rarely used contiguous 5S coding sequences due to the routine use of head-to-tail PCR primers that are anchored to the coding region. Moreover, the 5S coding sequences have been concatenated with independent, adjacent gene units in many studies, creating simulated chimeric genes as the raw data for evolutionary analysis. This practice is based on the tacitly assumed, but rarely tested, hypothesis that strict intra-locus concerted evolution processes are operating in 5S rDNA genes, without any empirical evidence as to whether it holds for the recovered data. The potential pitfalls of analysing the patterns of molecular evolution and reconstructing phylogenies based on these chimeric genes have not been assessed to date. Here, we compared the sequence integrity and phylogenetic behaviour of entire versus concatenated 5S coding regions from a real data set obtained from closely related plant species (Medicago, Fabaceae). Our results suggest that within arrays sequence homogenization is partially operating in the 5S coding region, which is traditionally assumed to be highly conserved. Consequently, concatenating 5S genes increases haplotype diversity, generating novel chimeric genotypes that most likely do not exist within the genome. In addition, the patterns of gene evolution are distorted, leading to incorrect haplotype relationships in some evolutionary reconstructions.
Distribution map of Sarracenia alata in the southern US. Dashed lines show the approximate range of the species. Western sampling localities from this study are marked with filled circles and include the following populations: Sundew (S), Pitcher Trail (P), Bouton Lake (B), Cooter's Bog (C), and Kisatchie (K). Eastern populations are marked with open circles and include the following populations: Abita Springs (A), Talisheek (T), Lake Ramsey (L), Franklin Creek (F), and DeSoto (D). Minor rivers (light grey) and major rivers and water bodies (dark gray and thick black lines) are shown.  
Population clustering analyses suggest significant population structure among S. alata populations. A) Likelihood scores for each value of k genetic clusters from Structure (Pritchard et al. 2000). B) k scores for each value of k genetic clusters following Evanno et al. (2005). Figures were generated using Structure Harvester (Earl et al. 2011).  
Maximum Clade Credibility tree for the 10 sampled populations generated using *BEAST (Drummond and Rambaut 2007; Heled and Drummond 2010). The population phylogeny is shown at top, with posterior probabilities of each node shown. Scale bar to left of phylogeny corresponds to 5.0 × 10 −5 substitutions / site / generation. The bottom of the figure shows histograms generated by Structure (Pritchard et al. 2000) for the sampling localities (i.e., k = 10).  
Environmental niche models and environmental variation for eastern (blue) and western (red) populations of S. alata. a) Divergence in niches across the Mississippi River. Predictions were calculated using MAXENT v 3.3.3e (Phillips et al. 2006), with darker colors showing greater prediction scores. The thick black line denotes the Mississippi River. b) Principal components axes 1 and 6 show significant niche divergence, c) whereas axes 2 and 3 show significant niche conservatism. WS: western sampled points; WB: western background points; ES: eastern sampled points; EB: eastern background points.  
We collected ∼29 kb of sequence data using Roche 454 pyrosequencing in order to estimate the timing and pattern of diversification in the carnivorous pitcher plant Sarracenia alata. Utilizing modified protocols for reduced representation library construction, we generated sequence data from 86 individuals across 10 populations from throughout the range of the species. We identified 76 high-quality and high-coverage loci (containing over 500 SNPs) using the bioinformatics pipeline PRGmatic. Results from a Bayesian clustering analysis indicate that populations are highly structured, and are similar in pattern to the topology of a population tree estimated using *BEAST. The pattern of diversification within Sarracenia alata implies that riverine barriers are the primary factor promoting population diversification, with divergence across the Mississippi River occurring more than 60,000 generations before present. Further, significant patterns of niche divergence and the identification of several outlier loci suggest that selection may contribute to population divergence. Our results demonstrate the feasibility of using next-generation sequencing to investigate intraspecific genetic variation in nonmodel species.
Despite the recent surge of interest in studying the evolution of development, surprisingly little work has been done to investigate the phylogenetic signal in developmental characters. Yet, both the potential usefulness of developmental characters in phylogenetic reconstruction and the validity of inferences on the evolution of developmental characters depend on the presence of such a phylogenetic signal and on the ability of our coding scheme to capture it. In a recent study, we showed, using simulations, that a new method (called the continuous analysis) using standardized time or ontogenetic sequence data and squared-change parsimony outperformed event pairing and event cracking in analyzing developmental data on a reference phylogeny. Using the same simulated data, we demonstrate that all these coding methods (event pairing and standardized time or ontogenetic sequence data) can be used to produce phylogenetically informative data. Despite some dependence between characters (the position of an event in an ontogenetic sequence is not independent of the position of other events in the same sequence), parsimony analysis of such characters converges on the correct phylogeny as the amount of data increases. In this context, the new coding method (developed for the continuous analysis) outperforms event pairing; it recovers a lower proportion of incorrect clades. This study thus validates the use of ontogenetic data in phylogenetic inference and presents a simple coding scheme that can extract a reliable phylogenetic signal from these data.
Complex organs such as eyes are commonly lost during evolution, but the timescale on which lost phenotypes could be reactivated is a matter of long-standing debate, with important implications for the molecular mechanisms of trait loss. Two phylogenetic approaches have been used to test whether regain of traits has occurred. One way is by comparison of nested, continuous-time Markov models of trait evolution, approaches that we term tree-based tests. A second way to demonstrate statistical support for trait regain is through use of node-based tests that employ explicit estimation of ancestral node states. Here, we estimate new molecular and morphological phylogenies and use them to examine the possibility of eye regain and dispersal between abyssal and shallow seas during the history of cylindroleberidid ostracods, a family of about 200 species, comprising both eyeless and sighted species. First, we confirmed that eye presence/absence is correlated with habitat depth. Parameter estimates from a phylogenetic model indicate that speciation is more rapid in deep-sea eyeless clades compared with shallow-water sighted clades. In addition, we found that tree-based statistical tests usually indicated reversals, including both transitions from deep to shallow seas and regain of eyes. In contrast, node-based statistical tests usually failed to show significant support for reversals. These results also hold for simulated phylogenies, indicating that they are not unique to the current data set. We recommend that both tree-based and node-based tests should be examined before making conclusions about character reversal and that ideally, alternative character histories should be tested using additional data, besides just the phylogenetic distribution of presence/absence of the characters.
Pollen diversity in Acanthaceae, illustrating small fraction of variation found among lineages (see Appendix 1 for further details). Top row depicts pollen types relevant to fossils utilized in this study: a) Acantheae: Neriacanthus grandiflorus (Daniel et al. 8152), with colpate (simple) apertures. b) Justicieae: Justicia tenuistachys (Colque and Tapia 276), showing tricolporate (compound) apertures, with characteristic "insulae" on apertural face. c) Ruellieae: Phaulopsis betonica (Love and Congclon 3157), showing tricolporate (compound) apertures surrounded by characteristic "sexine lips" and bands of pseudocolpi between apertures. d) Ruellieae: Ruelliopsis setosa (Smith 3107), showing same features as (c). e) Ruellieae: Trichosanchezia chrysothrix (Diaz et al. 6954), showing bicolporate (compound) apertures surrounded by "sexine lips" and bands of pseudocolpi that are arranged in opposing 90 • orientations. f) Ruellieae: Sanchezia decora (Foster 8790), showing same features as (e). Bottom row illustrates additional variation found in the family. g) Ruellieae: Ruellia geayi (Daniel 11048). h) Andrographideae: Phlogacanthus thyrsiflorus (Lindburg 200). i) Justicieae: Trichaulax mwasumbii (Mwasumbi 14238). j) Barlerieae: Lasiocladus sp. (Daniel et al. 11058). k) Whitfieldieae: Chlamydacanthus euphorbioides (Capuron 24734P). l) Justicieae: Mirandea sylvatica (Wendt et al. 4104). m) Ruellieae: Petalidium ramulosum (Volk 57). Images reproduced from SEM micrographs from earlier studies (a: McDade et al. 2005; c-g: Tripp et al. 2013; h, j, k: McDade et al. 2008; i, l: Daniel et al. 2008) except (b) (courtesy of C. Kiel) and (m) (Tripp, unpublished). 
Fossil constraints and priors used in Analysis 1
Best estimates of divergence times, major dispersal events within Acanthaceae based on BEAST Analyses 1A, and correlation with climatic and geological episodes in Earth history since late Cretaceous. Terminal taxa serve as phylogenetic place holders for more diverse clades for which we have more extensive phylogenetic information (see text: OW to NW dispersal events are inferred from other studies with more taxa per clade sampled). Circled plus signs denote an OW to NW dispersal event (n= 13) and 95% HPD intervals are depicted on those branches with gray bars. Circled minus sign denotes the sole example of a NW to OW dispersal event (n= 1; the lineage inclusive of Elytraria in Nelsonioideae). Dashed line represents the onset of OW to NW dispersals (n= 11) in Acanthaceae s.s.; all occurred within the last ∼20 Ma (significantly skewed toward the present) despite the fact that the lineage to which they belong is over three times as old. Small gray boxes with numerals identify nodes calibrated with fossils in Analysis 1 only, Analysis 2 only, or in both analyses (Tables 2 and 3). Major clades of Acanthaceae s.l. labeled at far right; Acanthaceae s.s. encompass Acantheae through Ruellieae in this figure. Temporal range of Gondwana and Atlantic and Pacific land bridges indicated by colored boxes overlaying phylogeny (key in upper right). Reconstruction of deep ocean temperatures (as a proxy for global temperature) is derived from oxygen isotopes corrected for variation in global ice volume (from Working Group I, 2007 Intergovernmental Panel for Climate Change report; see Fig. 6.1 therein). Approximate delimitation of climatic events presented is based on work of other authors (Wolfe 1975; Nilsen 1978; Mathews 1979; McKenna 1983; Tiffney 1985; see Burbring and Lawson 2007 for partial summary). All branches have >99% posterior probabilities. 
More than a decade of phylogenetic research has yielded a well-sampled, strongly supported hypothesis of relationships within the large (> 4,000 species) plant family Acanthaceae. This hypothesis points to intriguing biogeographic patterns and asymmetries in sister clade diversity but, absent a time-calibrated estimate for this evolutionary history, these patterns have remained unexplored. Here, we reconstruct divergence times within Acanthaceae using fossils as calibration points and experimenting with both fossil selection and effects of invoking a maximum age prior related to the origin of Eudicots. Contrary to earlier reports of a paucity of fossils of Lamiales (an order of ~23,000 species that includes Acanthaceae) and to the expectation that a largely herbaceous to soft-wooded and tropical lineage would have few fossils, we recovered 51 reports of fossil Acanthaceae. Rigorous evaluation of these for accurate identification, quality of age assessment, and utility in dating yielded eight fossils judged to merit inclusion in analyses. With nearly 10 kilobases of DNA sequence data, we used two sets of fossils as constraints to reconstruct divergence times. We demonstrate differences in age estimates depending on fossil selection and that enforcement of maximum age priors substantially alters estimated clade ages, especially in analyses that utilize a smaller rather than larger set of fossils. Our results suggest that long-distance dispersal events explain present-day distributions better than do Gondwanan or northern land bridge hypotheses. This biogeographical conclusion is for the most part robust to alternative calibration schemes. Our data support a minimum of 13 Old World to New World dispersal events but, intriguingly, only one in the reverse direction. Eleven of these 13 were among Acanthaceae s.s., which comprises > 90% of species diversity in the family. Remarkably, if minimum age estimates approximate true history, these 11 events occurred within the last ~20 million years even though Acanthaceae s.s is over three times as old. A simulation study confirmed that these dispersal events were significantly skewed towards the present and not simply a chance occurrence. Finally, we review reports of fossils that have been assigned to Acanthaceae that are substantially older than the lower Cretaceous estimate for Angiosperms as a whole (i.e., the general consensus that has resulted from several recent dating and fossil-based studies in plants). This is the first study to reconstruct divergence times among clades of Acanthaceae and sets the stage for comparative evolutionary research in this and related families that have until now been thought to have extremely poor fossil resources.
Phylogeny of the major lineages of hard ticks and other chelicerate arthropods (outgroups) (after Klompen et al., 2000) with markers from data matrix A (Table 2; Appendix 1) mapped onto it. Bold lines indicate relationships that had >70% bootstrap support in the Klompen et al. (2000) study (some tip branches are bold because those branches represent multiple
Phylogeny of the major lineages of hard ticks of the subfamily Rhipicephalinae, after Murrell et al. (2001b), with markers from data matrix B mapped onto it (Table 2; Appendix 1). Bold lines indicate relationships that had >70% bootstrap support in the Murrell et al. (2001b) study (some tip branches are bold because those branches represent multiple taxa). Tree length = 15; consistency index = 0.86; retention index = 0.91; rescaled consistency index = 0.78. Vertical line labeled Rp indicates the genus Rhipicephalus s.l. (=Rhipicephalus s.s. + Boophilus; as recommended by Murrell et al., 2001b). Numbers above branches and taxa identify those branches and taxa in text and tables. Because of missing data (see Appendix 2), in some cases it is unclear to which of two or more branches a marker mapped. Thus, the possible branches that a marker mapped to are indicated with arrows.  
(A) Consensus of 15 shortest maximum parsimony trees (length = 18; CI = 1; RI = 1; RCI = 1) for the Ixodida inferred from the characters in Table 2 (matrix in Appendix 1). Numbers above branches are the number of changes. (B) Consensus of five shortest maximum parsimony trees (length = 5; CI = 1; RI = 1; RCI = 1) for the Rhipicephalinae inferred with the characters from Table 3 (matrix in Appendix 2). Numbers above branches are the number of changes.  
Idiosyncratic markers are features of genes and genomes that are so unusual that it is unlikely that they evolved more than once in a lineage of organisms. Here we explore further the potential of idiosyncratic markers and changes to typically conserved tRNA sequences for phylogenetic inference. Hard ticks were chosen as the model group because their phylogeny has been studied extensively. Fifty-eight candidate markers from hard ticks (family Ixodidae) and 22 markers from the subfamily Rhipicephalinae sensu lato were mapped onto phylogenies of these groups. Two of the most interesting markers, features of the secondary structure of two different tRNAs, gave strong support to the hypothesis that species of the Prostriata (Ixodes spp.) are monophyletic. Previous analyses of genes and morphology did not strongly support this relationship, instead suggesting that the Prostriata is paraphyletic with respect to the Metastriata (the rest of the hard ticks). Parallel or convergent evolution was not found in the arrangements of mitochondrial genes in ticks nor were there any reversals to the ancestral arthropod character state. Many of the markers identified were phylogenetically informative, whereas others should be informative with study of additional taxa. Idiosyncratic markers and changes to typically conserved nucleotides in tRNAs that are phylogenetically informative were common in this data set, and thus these types of markers might be found in other organisms.
High-throughput DNA sequencing has the potential to accelerate species discovery if it is able to recognize evolutionary entities from sequence data that are comparable to species. The general mixed Yule-coalescent (GMYC) model estimates the species boundary from DNA surveys by identifying independently evolving lineages as a transition from coalescent to speciation branching patterns on a phylogenetic tree. Applied here to 12 families from 4 orders of insects in Madagascar, we used the model to delineate 370 putative species from mitochondrial DNA sequence variation among 1614 individuals. These were compared with data from the nuclear genome and morphological identification and found to be highly congruent (98% and 94%). We developed a modified GMYC that allows for a variable transition from coalescent to speciation among lineages. This revised model increased the congruence with morphology (97%), suggesting that a variable threshold better reflects the clustering of sequence data into biological species. Local endemism was pronounced in all 5 insect groups. Most species (60-91%) and haplotypes (88-99%) were found at only 1 of the 5 study sites (40-1000 km apart). This pronounced endemism resulted in a 37% increase in species numbers using diagnostic nucleotides in a population aggregation analysis. Sample sizes between 7 and 10 individuals represented a threshold above which there was minimal increase in genetic diversity, broadly agreeing with coalescent theory and other empirical studies. Our results from > 1.4 Mb of empirical data suggest that the GMYC model captures species boundaries comparable to those from traditional methods without the need for prior hypotheses of population coherence. This provides a method of species discovery and biodiversity assessment using single-locus data from mixed or environmental samples while building a globally available taxonomic database for future identifications.
Flowchart of the likelihood ratchet strategy. ML = maximum likelihood. 
The existence of multiple likelihood maxima necessitates algorithms that explore a large part of the tree space. However, because of computational constraints, stepwise addition-based tree-searching methods do not allow for this exploration in reasonable time. Here, I present an algorithm that increases the speed at which the likelihood landscape can be explored. The iterative algorithm combines the computational speed of distance-based tree construction methods to arrive at approximations of the global optimum with the accuracy of optimality criterion based branch-swapping methods to improve on the result of the starting tree. The algorithm moves between local optima by iteratively perturbing the tree landscape through a process of reweighting randomly drawn samples of the underlying sequence data set. Tests on simulated and real data sets demonstrated that the optimal solution obtained using stepwise addition-based heuristic searches was found faster using the algorithm presented here. Tests on a previously published data set that established the presence of tree islands under maximum likelihood demonstrated that the algorithm identifies the same tree islands in a shorter amount of time than that needed using stepwise addition. The algorithm can be readily applied using standard software for phylogenetic inference.
Molecular evolutionary rate heterogeneity-the violation of a molecular clock-is a prominent feature of many phylogenetic datasets. It has particular importance to systematists not only because of its biological implications, but also for its practical effects on our ability to infer and date evolutionary events. Here we show, using both maximum likelihood and Bayesian approaches, that a remarkably strong increase in substitution rate in the vittarioid ferns is consistent across the nuclear and plastid genomes. Contrary to some expectations, this rate increase is not due to selective forces acting at the protein level on our focal loci. The vittarioids bear no signature of the change in the relative strengths of selection and drift that one would expect if the rate increase was caused by altered post-mutation fixation rates. Instead, the substitution rate increase appears to stem from an elevated supply of mutations, perhaps limited to the vittarioid ancestral branch. This generalized rate increase is accompanied by extensive fine-scale heterogeneity in rates across loci, genomes, and taxa. Our analyses demonstrate the effectiveness and flexibility of trait-free investigations of rate heterogeneity within a model selection framework, emphasize the importance of explicit tests for signatures of selection prior to invoking selection-related or demography-based explanations for patterns of rate variation, and illustrate some unexpected nuances in the behavior of relaxed clock methods for modeling rate heterogeneity, with implications for our ability to confidently date divergence events. In addition, our data provide strong support for the monophyly of Adiantum, and for the position of Calciphilopteris in the cheilanthoid ferns, two relationships for which convincing support was previously lacking.
Single-access keys are a major tool for biologists who need to identify specimens. The construction process of these keys is particularly complex (especially if the input data set is large) so having an automatic single-access key generation tool is essential. As part of the European project ViBRANT, our aim was to develop such a tool as a web service, thus allowing end-users to integrate it directly into their workflow.IKey+generates single-access keys on demand, for single users or research institutions. It receives user input data (using the standard SDD format), accepts several key-generation parameters (affecting the key topology and representation), and supports several output formats.IKey+is freely available (sources and binary packages) at Furthermore, it is deployed on our server and can be queried (for testing purposes) through a simple web client also available at (last accessed 13 August 2012). Finally, a client plugin will be integrated to the Scratchpads biodiversity networking tool ( [Systematics; taxonomy; single-access key; web service; biodiversity informatics.].
Topological impact 
AIC gain per site compared to LG (and WAG and JTT)  
Number of alignments with better/worse likelihood values than LG  
Comparison of CONF/MIX, PART and MIX  
Distribution of the confidence coefficient χ depending on the site partition  
Amino acid substitution models are essential to most methods to infer phylogenies from protein data. These models represent the ways in which proteins evolve and substitutions accumulate along the course of time. It is widely accepted that the substitution processes vary depending on the structural configuration of the protein residues. However, this information is very rarely used in phylogenetic studies, though the 3-dimensional structure of dozens of thousands of proteins has been elucidated. Here, we reinvestigate the question in order to fill this gap. We use an improved estimation methodology and a very large database comprising 1471 nonredundant globular protein alignments with structural annotations to estimate new amino acid substitution models accounting for the secondary structure and solvent accessibility of the residues. These models incorporate a confidence coefficient that is estimated from the data and reflects the reliability and usefulness of structural annotations in the analyzed sequences. Our results with 300 independent test alignments show an impressive likelihood gain compared with standard models such as JTT or WAG. Moreover, the use of these models induces significant topological changes in the inferred trees, which should be of primary interest to phylogeneticists. Our data, models, and software are available for download from
The 50% majority rule consensus tree of the 30,000 trees sampled in the MCMC analyses. Individuals from Africa are indicated by the vertical bars.  
Correlation between the posterior probabilities of individual clades, f (¿ (i ) j X), obtained from the separate Markov chains.  
Many biogeographic problems are tested on phylogenetic trees. Typically, the uncertainty in the phylogeny is not accommodated when investigating the biogeography of the organisms. Here we present a method that accommodates uncertainty in the phylogenetic trees. Moreover, we describe a simple method for examining the support for competing biogeographic scenarios. We illustrate the method using mitochondrial DNA sequences sampled from modern humans. The geographic origin of modern human mtDNA is inferred to be in Africa, although support for this hypothesis was ambiguous for data from an early paper.
Top-cited authors
Fredrik Ronquist
  • Swedish Museum of Natural History
John Huelsenbeck
  • University of California, Berkeley
Liang Liu
  • University of Georgia
Daniel L Ayres
  • University of Maryland, College Park
Sebastian Höhna
  • Ludwig-Maximilians-University of Munich