Article

Identifying the Rooted Species Tree from the Distribution of Unrooted Gene Trees under the Coalescent

Department of Mathematics and Statistics, University of Alaska Fairbanks, PO Box 756660, Fairbanks, AX 99775, USA.
Journal of Mathematical Biology (Impact Factor: 1.85). 06/2011; 62(6):833-62. DOI: 10.1007/s00285-010-0355-7
Source: PubMed

ABSTRACT

Gene trees are evolutionary trees representing the ancestry of genes sampled from multiple populations. Species trees represent populations of individuals-each with many genes-splitting into new populations or species. The coalescent process, which models ancestry of gene copies within populations, is often used to model the probability distribution of gene trees given a fixed species tree. This multispecies coalescent model provides a framework for phylogeneticists to infer species trees from gene trees using maximum likelihood or Bayesian approaches. Because the coalescent models a branching process over time, all trees are typically assumed to be rooted in this setting. Often, however, gene trees inferred by traditional phylogenetic methods are unrooted. We investigate probabilities of unrooted gene trees under the multispecies coalescent model. We show that when there are four species with one gene sampled per species, the distribution of unrooted gene tree topologies identifies the unrooted species tree topology and some, but not all, information in the species tree edges (branch lengths). The location of the root on the species tree is not identifiable in this situation. However, for 5 or more species with one gene sampled per species, we show that the distribution of unrooted gene tree topologies identifies the rooted species tree topology and all its internal branch lengths. The length of any pendant branch leading to a leaf of the species tree is also identifiable for any species from which more than one gene is sampled.

Download full-text

Full-text

Available from: James H Degnan
  • Source
    • "They have appeared in empirical investigations of the gene tree topologies likely to be produced along the branches of a given species tree (Rosenberg and Tao, 2008). They are a component of mathematical proofs that concern properties of evolutionary models of gene trees conditional on species trees (Allman et al., 2011; Than and Rosenberg, 2011). Coalescent histories also arise in studying state spaces for "
    [Show abstract] [Hide abstract]
    ABSTRACT: Coalescent histories are combinatorial structures that describe for a given gene tree and species tree the possible lists of branches of the species tree on which the gene tree coalescences take place. Properties of the number of coalescent histories for gene trees and species trees affect a variety of probabilistic calculations in mathematical phylogenetics. Exact and asymptotic evaluations of the number of coalescent histories, however, are known only in a limited number of cases. Here we introduce a particular family of species trees, the \emph{lodgepole} species trees $(\lambda_n)_{n\geq 0}$, in which tree $\lambda_n$ has $m=2n+1$ taxa. We determine the number of coalescent histories for the lodgepole species trees, in the case that the gene tree matches the species tree, showing that this number grows with $m!!$ in the number of taxa $m$. This computation demonstrates the existence of tree families in which the growth in the number of coalescent histories is faster than exponential. Further, it provides a substantial improvement on the lower bound for the ratio of the largest number of matching coalescent histories to the smallest number of matching coalescent histories for trees with $m$ taxa, increasing a previous bound of $(\sqrt{\pi} / 32)[(5m-12)/(4m-6)] m \sqrt{m}$ to $[ \sqrt{m-1}/(4 \sqrt{e}) ]^{m}$. We discuss the implications of our enumerative results for phylogenetic computations.
    Preview · Article · Mar 2015 · Journal of computational biology: a journal of computational molecular cell biology
  • Source
    • "For this reason, numerous methods based on the coalescent process have recently been proposed for estimation of the phylogenetic species tree given multi-locus DNA sequence data (e.g., BEST [25], *BEAST [17], STEM [22], MP-EST [26], SNAPP [8]). Use of these methods assumes that the phylogenetic species tree can be identified from DNA sequence data at the leaves of the tree, but this has not formally been established (note, however, that Allman et al. (2011) [2] have established identifiability given a collection of gene tree topologies and Allman et al. (2011) [7] have considered identifiability given clade probabilities). Here, we prove that the unrooted topology of the phylogenetic species tree is identifiable given observed data at the leaves of the tree that are assumed to have arisen from the coalescent process. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The inference of the evolutionary history of a collection of organisms is a problem of fundamental importance in evolutionary biology. The abundance of DNA sequence data arising from genome sequencing projects has led to significant challenges in the inference of these phylogenetic relationships. Among these challenges is the inference of the evolutionary history of a collection of species based on sequence information from several distinct genes sampled throughout the genome. It is widely accepted that each individual gene has its own phylogeny, which may not agree with the species tree. Many possible causes of this gene tree incongruence are known. The best studied is incomplete lineage sorting, which is commonly modeled by the coalescent process. Numerous methods based on the coalescent process have been proposed for estimation of the phylogenetic species tree given multi-locus DNA sequence data. However, use of these methods assumes that the phylogenetic species tree can be identified from DNA sequence data at the leaves of the tree, although this has not been formally established. We prove that the unrooted topology of the $n$-leaf phylogenetic species tree is generically identifiable given observed data at the leaves of the tree that are assumed to have arisen from the coalescent process with time-reversible substitution.
    Full-text · Article · Jun 2014 · Journal of Theoretical Biology
  • Source
    • "The CF ranges from 0.0 to 1.0. BUCKy implements a consensus method based on unrooted quartets and which consistently identifies the species tree [62]. We ran BUCKy at several levels of α to evaluate how much effect choice of this parameter value would have on the results. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The controversy surrounding the potential impact of birds in spirochete transmission dynamics and their capacity to serve as a reservoir has existed for a long time. The majority of analyzed bird species are able to infect larval ticks with Borrelia. Dispersal of infected ticks due to bird migration is a key to the establishment of new foci of Lyme borreliosis. The dynamics of infection in birds supports the mixing of different species, the horizontal exchange of genetic information, and appearance of recombinant genotypes. Four Borrelia burgdorferi sensu lato strains were cultured from Ixodes minor larvae and four strains were isolated from Ixodes minor nymphs collected from a single Carolina Wren (Thryothorus ludovicianus). A multilocus sequence analysis that included 16S rRNA, a 5S-23S intergenic spacer region, a 16S-23S internal transcribed spacer, flagellin, p66, and ospC separated 8 strains into 3 distinct groups. Additional multilocus sequence typing of 8 housekeeping genes, clpA, clpX, nifS, pepX, pyrG, recG, rplB, and uvrA was used to resolve the taxonomic status of bird-associated strains. Results of analysis of 14 genes confirmed that the level of divergence among strains is significantly higher than what would be expected for strains within a single species. The presence of cross-species recombination was revealed: Borrelia burgdorferi sensu stricto housekeeping gene nifS was incorporated into homologous locus of strain, previously assigned to B. americana. Genetically diverse Borrelia strains are often found within the same tick or same vertebrate host, presenting a wide opportunity for genetic exchange. We report the cross-species recombination that led to incorporation of a housekeeping gene from the B. burgdorferi sensu stricto strain into a homologous locus of another bird-associated strain. Our results support the hypothesis that recombination maintains a majority of sequence polymorphism within Borrelia populations because of the re-assortment of pre-existing sequence variants. Even if our findings of broad genetic diversity among 8 strains cultured from ticks that fed on a single bird could be the exception rather than the rule, they support the theory that the diversity and evolution of LB spirochetes is driven mainly by the host.
    Full-text · Article · Jan 2014 · Parasites & Vectors
Show more