-
[show abstract]
[hide abstract]
ABSTRACT: The parsimony score of a character on a tree equals the number of state changes required to fit that character onto the tree. We show that for unordered, reversible characters this score equals the number of tree rearrangements required to fit the tree onto the character. We discuss implications of this connection for the debate over the use of consensus trees or total evidence and show how it provides a link between incongruence of characters and recombination.
Systematic Biology 05/2008; 57(2):251-6. · 10.23 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Determining an optimal phylogenetic tree using maximum parsimony, also referred to as the Steiner tree problem in phylogenetics,
is NP hard. Here we provide a new formulation for this problem which leads to an analytical and linear time solution when
the dimensionality (sequence length, or number of characters) is at most two. This new formulation of the problem provides
a direct link between the maximum parsimony problem and the maximum compatibility problem via the intersection graph. The
solution for the “two character case” has numerous practical applications in phylogenetics, some of which are discussed.
Annals of Combinatorics 03/2008; 12(1):45-51. · 0.32 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Several models have been proposed to relax the molecular clock in order to estimate divergence times. However, it is unclear which model has the best fit to real data and should therefore be used to perform molecular dating. In particular, we do not know whether rate autocorrelation should be considered or which prior on divergence times should be used. In this work, we propose a general bench mark of alternative relaxed clock models. We have reimplemented most of the already existing models, including the popular lognormal model, as well as various prior choices for divergence times (birth-death, Dirichlet, uniform), in a common Bayesian statistical framework. We also propose a new autocorrelated model, called the "CIR" process, with well-defined stationary properties. We assess the relative fitness of these models and priors, when applied to 3 different protein data sets from eukaryotes, vertebrates, and mammals, by computing Bayes factors using a numerical method called thermodynamic integration. We find that the 2 autocorrelated models, CIR and lognormal, have a similar fit and clearly outperform uncorrelated models on all 3 data sets. In contrast, the optimal choice for the divergence time prior is more dependent on the data investigated. Altogether, our results provide useful guidelines for model choice in the field of molecular dating while opening the way to more extensive model comparisons.
Molecular Biology and Evolution 01/2008; 24(12):2669-80. · 5.55 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Traditionally, phylogenetic analyses over many genes combine data into a contiguous block. Under this concatenated model, all genes are assumed to evolve at the same rate. However, it is clear that genes evolve at very different rates and that accounting for this rate heterogeneity is important if we are to accurately infer phylogenies from heterogeneous multigene data sets. There remain open questions regarding how best to incorporate gene rate parameters into phylogenetic models and which properties of real data correlate with improved fit over the concatenated model. In this study, two methods of accounting for gene rate heterogeneity are compared: the n-parameter method, which allows for each of the n gene partitions to have a gene rate parameter, and the alpha-parameter method, which fits a distribution to the gene rates. Results demonstrate that the n-parameter method is both computationally faster and in general provides a better fit over the concatenated model than the alpha-parameter method. Furthermore, improved model fit over the concatenated model is highly correlated with the presence of a gene with a slow relative rate of evolution.
Systematic Biology 05/2007; 56(2):194-205. · 10.23 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: BACKGROUND: Neighbor-Net is a novel method for phylogenetic analysis that is currently being widely used in areas such as virology, bacteriology, and plant evolution. Given an input distance matrix, Neighbor-Net produces a phylogenetic network, a generalization of an evolutionary or phylogenetic tree which allows the graphical representation of conflicting phylogenetic signals. RESULTS: In general, any network construction method should not depict more conflict than is found in the data, and, when the data is fitted well by a tree, the method should return a network that is close to this tree. In this paper we provide a formal proof that Neighbor-Net satisfies both of these requirements so that, in particular, Neighbor-Net is statistically consistent on circular distances.
Algorithms for Molecular Biology 02/2007; 2:8. · 1.35 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Recombination is a powerful evolutionary force that merges historically distinct genotypes. But the extent of recombination within many organisms is unknown, and even determining its presence within a set of homologous sequences is a difficult question. Here we develop a new statistic, phi(w), that can be used to test for recombination. We show through simulation that our test can discriminate effectively between the presence and absence of recombination, even in diverse situations such as exponential growth (star-like topologies) and patterns of substitution rate correlation. A number of other tests, Max chi2, NSS, a coalescent-based likelihood permutation test (from LDHat), and correlation of linkage disequilibrium (both r2 and /D'/) with distance, all tend to underestimate the presence of recombination under strong population growth. Moreover, both Max chi2 and NSS falsely infer the presence of recombination under a simple model of mutation rate correlation. Results on empirical data show that our test can be used to detect recombination between closely as well as distantly related samples, regardless of the suspected rate of recombination. The results suggest that phi(w) is one of the best approaches to distinguish recurrent mutation from recombination in a wide variety of circumstances.
Genetics 05/2006; 172(4):2665-81. · 4.01 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: We propose a continuous model for variation in the evolutionary rate across sites and over the phylogenetic tree. We derive exact transition probabilities of substitutions under this model. Changes in rate are modelled using the CIR process, a diffusion widely used in financial applications. The model directly extends the standard gamma distributed rates across site model, with one additional parameter governing changes in rate down the tree. The parameters of the model can be estimated directly from two well-known statistics: the index of dispersion and the gamma shape parameter of the rates across sites model. The CIR model can be readily incorporated into probabilistic models for sequence evolution. We provide here an exact formula for the likelihood of a three-taxon tree. The likelihoods of larger trees can be evaluated using Monte-Carlo methods.
Mathematical Biosciences 03/2006; 199(2):216-33. · 1.54 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: In phylogenetic analyses with combined multigene or multiprotein data sets, accounting for differing evolutionary dynamics at different loci is essential for accurate tree prediction. Existing maximum likelihood (ML) and Bayesian approaches are computationally intensive. We present an alternative approach that is orders of magnitude faster. The method, Distance Rates (DistR), estimates rates based upon distances derived from gene/protein sequence data. Simulation studies indicate that this technique is accurate compared with other methods and robust to missing sequence data. The DistR method was applied to a fungal mitochondrial data set, and the rate estimates compared well to those obtained using existing ML and Bayesian approaches. Inclusion of the protein rates estimated from the DistR method into the ML calculation of trees as a branch length multiplier resulted in a significantly improved fit as measured by the Akaike Information Criterion (AIC). Furthermore, bootstrap support for the ML topology was significantly greater when protein rates were used, and some evident errors in the concatenated ML tree topology (i.e., without protein rates) were corrected. [Bayesian credible intervals; DistR method; multigene phylogeny; PHYML; rate heterogeneity.].
Systematic Biology 01/2006; 54(6):900-15. · 10.23 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Standard likelihood-based frameworks in phylogenetics consider the process of evolution of a sequence site by site. Assuming that sites evolve independently greatly simplifies the required calculations. However, this simplification is known to be incorrect in many cases. Here, a computational method that allows for general dependence between sites of a sequence is investigated. Using this method, measures acting as sequence fitness proxies can be considered over a phylogenetic tree. In this work, a set of statistically derived amino acid pairwise potentials, developed in the context of protein threading, is used to account for what we call the structural fitness of a sequence. We describe a model combining statistical potentials with an empirical amino acid substitution matrix. We propose such a combination as a useful way of capturing the complexity of protein evolution. Finally, we outline features of the model using three datasets and show the approach's sensitivity to different tree topologies.
Gene 04/2005; 347(2):207-17. · 2.34 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Although most often used to represent phylogenetic uncertainty, network methods are also potentially useful for describing the phylogenetic complexity expected to characterize recent species radiations. One network method with particular advantages in this context is split decomposition. However, in its standard implementation this approach is limited by a conservative criterion for branch length estimation. Here we extend the utility of split decomposition by introducing a least squares optimization technique for correcting branch lengths that may be underestimated by the standard implementation. This optimization of branch lengths is generally expected to improve divergence time estimates calculated from splits graphs. We illustrate the effect of least squares optimization on such estimates using the Australasian Myosotis and the Hawaiian silversword alliance as examples. We also discuss the biogeographic interpretation and limitations of splits graphs.
Systematic Biology 03/2005; 54(1):56-65. · 10.23 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: We present Neighbor-Net, a distance based method for constructing phylogenetic networks that is based on the Neighbor-Joining (NJ) algorithm of Saitou and Nei. Neighbor-Net provides a snapshot of the data that can guide more detailed analysis. Unlike split decomposition, Neighbor-Net scales well and can quickly produce detailed and informative networks for several hundred taxa. We illustrate the method by reanalyzing three published data sets: a collection of 110 highly recombinant Salmonella multi-locus sequence typing sequences, the 135 "African Eve" human mitochondrial sequences published by Vigilant et al., and a collection of 12 Archeal chaperonin sequences demonstrating strong evidence for gene conversion. Neighbor-Net is available as part of the SplitsTree4 software package.
Molecular Biology and Evolution 03/2004; 21(2):255-65. · 5.55 Impact Factor
-
Systematic Biology 01/2004; 52(6):865-8. · 10.23 Impact Factor
-
Algorithms in Bioinformatics, Second International Workshop, WABI 2002, Rome, Italy, September 17-21, 2002, Proceedings; 01/2002
-
01/2000
-
Journal of Computational Biology. 01/2000; 7:521-535.