Article

Squirrel : Reconstructing Semi-directed Phylogenetic Level-1 Networks from Four-Leaved Networks or Sequence Alignments

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

With the increasing availability of genomic data, biologists aim to find more accurate descriptions of evolutionary histories influenced by secondary contact, where diverging lineages reconnect before diverging again. Such reticulate evolutionary events can be more accurately represented in phylogenetic networks than in phylogenetic trees. Since the root location of phylogenetic networks can not be inferred from biological data under several evolutionary models, we consider semi-directed (phylogenetic) networks: partially directed graphs without a root in which the directed edges represent reticulate evolutionary events. By specifying a known outgroup, the rooted topology can be recovered from such networks. We introduce the algorithm Squirrel (Semi-directed Quarnet-based Inference to Reconstruct Level-1 Networks) which constructs a semi-directed level-1 network from a full set of quarnets (four-leaf semi-directed networks). Our method also includes a heuristic to construct such a quarnet set directly from sequence alignments. We demonstrate Squirrel's performance through simulations and on real sequence data sets, the largest of which contains 29 aligned sequences close to 1.7 Mbp long. The resulting networks are obtained on a standard laptop within a few minutes. Lastly, we prove that Squirrel is combinatorially consistent: given a full set of quarnets coming from a triangle-free semi-directed level-1 network, it is guaranteed to reconstruct the original network. Squirrel is implemented in Python, has an easy-to-use graphical user-interface that takes sequence alignments or quarnets as input, and is freely available at https://github.com/nholtgrefe/squirrel

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Our upper-bound results suggest that the networks generated by tools like SNAQ [22], PhyNEST [13], and Squirrel [9] with a constant network level can be efficiently decomposed into trees with small treewidth. As the level of generated networks grows [18], we may still expect the treewidth to stay low and treewidth-parametrized algorithms to be efficient on higher-level networks. ...
Preprint
Full-text available
Phylogenetic networks are directed acyclic graphs that depict the genomic evolution of related taxa. Reticulation nodes in such networks (nodes with more than one parent) represent reticulate evolutionary events, such as recombination, reassortment, hybridization, or horizontal gene transfer. Typically, the complexity of a phylogenetic network is expressed in terms of its level, i.e., the maximum number of edges that are required to be removed from each biconnected component of the phylogenetic network to turn it into a tree. Here, we study the relationship between the level of a phylogenetic network and another popular graph complexity parameter - treewidth. We show a k+32\frac{k+3}{2} upper bound on the treewidth of level-k phylogenetic networks and an improved (1/3+δ)k(1/3 + \delta) k upper bound for large k. These bounds imply that many computational problems on phylogenetic networks, such as the small parsimony problem or some variants of phylogenetic diversity maximization, are polynomial-time solvable on level-k networks with constant k. Our first bound is applicable to any k, and it allows us to construct an explicit tree decomposition of width k+32\frac{k+3}{2} that can be used to analyze phylogenetic networks generated by tools like SNAQ that guarantee bounded network level. Finally, we show a k/13 lower bound on the maximum treewidth among level-k phylogenetic networks for large enough k based on expander graphs.
Preprint
Full-text available
We consider the fundamental question of which evolutionary histories can potentially be reconstructed from sufficiently long DNA sequences, by studying the identifiability of phylogenetic networks from data generated under Markov models of DNA evolution. This topic has previously been studied for phylogenetic trees and for phylogenetic networks that are level-1, which means that reticulate evolutionary events were restricted to be independent in the sense that the corresponding cycles in the network are non-overlapping. In this paper, we study the identifiability of phylogenetic networks from DNA sequence data under Markov models of DNA evolution for more general classes of networks that may contain pairs of tangled reticulations. Our main result is generic identifiability, under the Jukes-Cantor model, of binary semi-directed level-2 phylogenetic networks that satisfy two additional conditions called triangle-free and strongly tree-child. We also consider level-1 networks and show stronger identifiability results for this class than what was known previously. In particular, we show that the number of reticulations in a level-1 network is identifiable under the Jukes-Cantor model. Moreover, we prove general identifiability results that do not restrict the network level at all and hold for the Jukes-Cantor as well as for the Kimura-2-Parameter model. We show that any two binary semi-directed phylogenetic networks are distinguishable if they do not display exactly the same 4-leaf subtrees, called quartets. This has direct consequences regarding the blobs of a network, which are its reticulated components. We show that the tree-of-blobs of a network, the global branching structure of the network, is always identifiable, as well as the circular ordering of the subnetworks around each blob, for networks in which edges do not cross and taxa are on the outside.
Article
Full-text available
In the evolution landscape of HIV, the coexistence of multiple subtypes has led to new, complex recombinants, posing public health challenges. CRF55_01B, first identified among MSM in Shenzhen, China, has spread rapidly across China. In this study, 47 plasma samples from newly diagnosed HIV-1 CRF55_01B patients in Shenzhen, of which the genotype was only identified by the routine HIV drug resistance test, were collected. Multiple gene regions were acquired using Sanger and next-generation sequencing methods, followed by the phylogenetic reconstruction, recombination breakpoint scanning, Bayesian molecular clock, and the prediction of coreceptors. From 47 samples, we found seven new unique recombinants formed by CRF55_01B and CRF07_BC, which shared similar breakpoints in certain gene regions and primarily utilized CCR5 receptors. All of the most recent common ancestors of subregions for these recombinants were estimated to be later than CRF55_01B and CRF07_BC, potentially suggesting they are the third-generation recombinants formed by CRF55_01B and CRF07_BC as parents. The continuous emergence of new recombinants highlights the increasing complexity of circulating strains in Shenzhen, and also suggests that subtype analysis using partial pol gene may lead to an overestimation of the major subtype strains and an underestimation of new complex HIV recombinants. Consequently, to effectively address and mitigate the complex HIV epidemic, there is an urgent need for expanded monitoring and the optimization of testing methodologies.
Article
Full-text available
The tree of blobs of a species network shows only the tree-like aspects of relationships of taxa on a network, omitting information on network substructures where hybridization or other types of lateral transfer of genetic information occur. By isolating such regions of a network, inference of the tree of blobs can serve as a starting point for a more detailed investigation, or indicate the limit of what may be inferrable without additional assumptions. Building on our theoretical work on the identifiability of the tree of blobs from gene quartet distributions under the Network Multispecies Coalescent model, we develop an algorithm, TINNiK, for statistically consistent tree of blobs inference. We provide examples of its application to both simulated and empirical datasets, utilizing an implementation in the MSCquartets 2.0 R package.
Preprint
Full-text available
Inference of a species network from genomic data remains a difficult problem, with recent progress mostly limited to the level-1 case. However, inference of the Tree of Blobs of a network, showing only the networks cut edges, can be performed for any network by TINNiK, suggesting a divide-and-conquer approach to network inference where the tree's multifurcations are individually resolved to give more detailed structure. Here we develop a method, NANUQ ⁺ , to quickly perform such a level-1 resolution. Viewed as part of the NANUQ pipeline for fast level-1 inference, this gives tools for both understanding when the level-1 assumption is likely to be met and for exploring all highly-supported resolutions to cycles.
Article
Full-text available
Hybridization has been recognized to play important roles in evolution, however studies of the genetic consequence are still lagging behind in vertebrates due to the lack of appropriate experimental systems. Fish of the genus Xiphophorus are proposed to have evolved with multiple ancient and ongoing hybridization events. They have served as an informative research model in evolutionary biology and in biomedical research on human disease for more than a century. Here, we provide the complete genomic resource including annotations for all described 26 Xiphophorus species and three undescribed taxa and resolve all uncertain phylogenetic relationships. We investigate the molecular evolution of genes related to cancers such as melanoma and for the genetic control of puberty timing, focusing on genes that are predicted to be involved in pre-and postzygotic isolation and thus affect hybridization. We discovered dramatic size-variation of some gene families. These persisted despite reticulate evolution, rapid speciation and short divergence time. Finally, we clarify the hybridization history in the entire genus settling disputed hybridization history of two Southern swordtails. Our comparative genomic analyses revealed hybridization ancestries that are manifested in the mosaic fused genomes and show that hybridization often preceded speciation.
Article
Full-text available
Motivation The abundance of gene flow in the Tree of Life challenges the notion that evolution can be represented with a fully bifurcating process which cannot capture important biological realities like hybridization, introgression, or horizontal gene transfer. Coalescent-based network methods are increasingly popular, yet not scalable for big data, because they need to perform a heuristic search in the space of networks as well as numerical optimization that can be NP-hard. Here, we introduce a novel method to reconstruct phylogenetic networks based on algebraic invariants. While there is a long tradition of using algebraic invariants in phylogenetics, our work is the first to define phylogenetic invariants on concordance factors (frequencies of four-taxon splits in the input gene trees) to identify level-1 phylogenetic networks under the multispecies coalescent model. Results Our novel hybrid detection methodology is optimization-free as it only requires the evaluation of polynomial equations, and as such, it bypasses the traversal of network space, yielding a computational speed at least 10 times faster than the fastest-to-date network methods. We illustrate our method’s performance on simulated and real data from the genus Canis. Availability and implementation We present an open-source publicly available Julia package PhyloDiamond.jl available at https://github.com/solislemuslab/PhyloDiamond.jl with broad applicability within the evolutionary community.
Preprint
Full-text available
A core goal of phylogenomics is to determine the evolutionary history of a set of species from biological sequence data. Phylogenetic networks are able to describe more complex evolutionary phenomena than phylogenetic trees but are more difficult to accurately reconstruct. Recently, there has been growing interest in developing methods to infer semi-directed phylogenetic networks. As computing such networks can be computationally intensive, one approach to building such networks is to puzzle together smaller networks. Thus, it is essential to have robust methods for inferring semi-directed phylogenetic networks on small numbers of taxa. In this paper, we investigate an algebraic method for performing phylogenetic network inference from nucleotide sequence data on 4-leaved semi-directed phylogenetic networks by analysing the distribution of leaf-pattern probabilities. On simulated data, we found that we can correctly identify with high accuracy semi-directed networks as sequences approach 10Mbp in length, and that we are able to use our approach to identify tree-like evolution and determine the underlying tree. We also applied our approach to published transcriptome data from swordtail fish to compare its performance with a pseudolikelihood method for inferring semi-directed networks.
Article
Full-text available
Genome-scale data and the development of novel statistical phylogenetic approaches have greatly aided the reconstruction of a broad sketch of the tree of life and resolved many of its branches. However, incongruence - the inference of conflicting evolutionary histories - remains pervasive in phylogenomic data, hampering our ability to reconstruct and interpret the tree of life. Biological factors, such as incomplete lineage sorting, horizontal gene transfer, hybridization, introgression, recombination and convergent molecular evolution, can lead to gene phylogenies that differ from the species tree. In addition, analytical factors, including stochastic, systematic and treatment errors, can drive incongruence. Here, we review these factors, discuss methodological advances to identify and handle incongruence, and highlight avenues for future research.
Article
Full-text available
Phylogenetic networks extend phylogenetic trees to model non-vertical inheritance, by which a lineage inherits material from multiple parents. The computational complexity of estimating phylogenetic networks from genome-wide data with likelihood-based methods limits the size of networks that can be handled. Methods based on pairwise distances could offer faster alternatives. We study here the information that average pairwise distances contain on the underlying phylogenetic network, by characterizing local and global features that can or cannot be identified. For general networks, we clarify that the root and edge lengths adjacent to reticulations are not identifiable, and then focus on the class of zipped-up semidirected networks. We provide a criterion to swap subgraphs locally, such as 3-cycles, resulting in indistinguishable networks. We propose the “distance split tree”, which can be constructed from pairwise distances, and prove that it is a refinement of the network’s tree of blobs, capturing the tree-like features of the network. For level-1 networks, this distance split tree is equal to the tree of blobs refined to separate polytomies from blobs, and we prove that the mixed representation of the network is identifiable. The information loss is localized around 4-cycles, for which the placement of the reticulation is unidentifiable. The mixed representation combines split edges for 4-cycles, regular tree and hybrid edges from the semidirected network, and edge parameters that encode all information identifiable from average pairwise distances.
Article
Full-text available
Background Although originally thought to evolve clonally, studies have revealed that most bacteria exchange DNA. However, it remains unclear to what extent gene flow shapes the evolution of bacterial genomes and maintains the cohesion of species. Results Here, we analyze the patterns of gene flow within and between >2600 bacterial species. Our results show that fewer than 10% of bacterial species are truly clonal, indicating that purely asexual species are rare in nature. We further demonstrate that the taxonomic criterion of ~95% genome sequence identity routinely used to define bacterial species does not accurately represent a level of divergence that imposes an effective barrier to gene flow across bacterial species. Interruption of gene flow can occur at various sequence identities across lineages, generally from 90 to 98% genome identity. This likely explains why a ~95% genome sequence identity threshold has empirically been judged as a good approximation to define bacterial species. Our results support a universal mechanism where the availability of identical genomic DNA segments required to initiate homologous recombination is the primary determinant of gene flow and species boundaries in bacteria. We show that these barriers of gene flow remain porous since many distinct species maintain some level of gene flow, similar to introgression in sexual organisms. Conclusions Overall, bacterial evolution and speciation are likely shaped by similar forces driving the evolution of sexual organisms. Our findings support a model where the interruption of gene flow—although not necessarily the initial cause of speciation—leads to the establishment of permanent and irreversible species borders.
Article
Full-text available
Phylogenomic analyses routinely estimate species trees using methods that account for gene tree discordance. However, the most scalable species tree inference methods, which summarize independently inferred gene trees to obtain a species tree, are sensitive to hard-to-avoid errors introduced in the gene tree estimation step. This dilemma has created much debate on the merits of concatenation versus summary methods and practical obstacles to using summary methods more widely and to the exclusion of concatenation. The most successful attempt at making summary methods resilient to noisy gene trees has been contracting low support branches from the gene trees. Unfortunately, this approach requires arbitrary thresholds and poses new challenges. Here, we introduce threshold-free weighting schemes for the quartet-based species tree inference, the metric used in the popular method ASTRAL. By reducing the impact of quartets with low support or long terminal branches (or both), weighting provides stronger theoretical guarantees and better empirical performance than the unweighted ASTRAL. Our simulations show that weighting improves accuracy across many conditions and reduces the gap with concatenation in conditions with low gene tree discordance and high noise. On empirical data, weighting improves congruence with concatenation and increases support. Together, our results show that weighting, enabled by a new optimization algorithm we introduce, improves the utility of summary methods and can reduce the incongruence often observed across analytical pipelines.
Article
Full-text available
Phylogenetic networks can represent evolutionary events that cannot be described by phylogenetic trees. These networks are able to incorporate reticulate evolutionary events such as hybridization, introgression, and lateral gene transfer. Recently, network-based Markov models of DNA sequence evolution have been introduced along with model-based methods for reconstructing phylogenetic networks. For these methods to be consistent, the network parameter needs to be identifiable from data generated under the model. Here, we show that the semi-directed network parameter of a triangle-free, level-1 network model with any fixed number of reticulation vertices is generically identifiable under the Jukes–Cantor, Kimura 2-parameter, or Kimura 3-parameter constraints.
Article
Full-text available
Backtracking a pandemic Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) may have had a history of abortive human infections before a variant established a productive enough infection to create a transmission chain with pandemic potential. Therefore, the Wuhan cluster of infections identified in late December of 2019 may not have represented the initiating event. Pekar et al. used genome data collected from the early cases of the COVID-19 pandemic combined with molecular clock inference and epidemiological simulation to estimate when the most successful variant gained a foothold in humans. This analysis pushes human-to-human transmission back to mid-October to mid-November of 2019 in Hubei Province, China, with a likely short interval before epidemic transmission was initiated. Science , this issue p. 412
Article
Full-text available
Our understanding of the evolutionary history of primates is undergoing continual revision due to ongoing genome sequencing efforts. Bolstered by growing fossil evidence, these data have led to increased acceptance of once controversial hypotheses regarding phylogenetic relationships, hybridization and introgression, and the biogeographical history of primate groups. Among these findings is a pattern of recent introgression between species within all major primate groups examined to date, though little is known about introgression deeper in time. To address this and other phylogenetic questions, here, we present new reference genome assemblies for 3 Old World monkey (OWM) species: Colobus angolensis ssp. palliatus (the black and white colobus), Macaca nemestrina (southern pig-tailed macaque), and Mandrillus leucophaeus (the drill). We combine these data with 23 additional primate genomes to estimate both the species tree and individual gene trees using thousands of loci. While our species tree is largely consistent with previous phylogenetic hypotheses, the gene trees reveal high levels of genealogical discordance associated with multiple primate radiations. We use strongly asymmetric patterns of gene tree discordance around specific branches to identify multiple instances of introgression between ancestral primate lineages. In addition, we exploit recent fossil evidence to perform fossil-calibrated molecular dating analyses across the tree. Taken together, our genome-wide data help to resolve multiple contentious sets of relationships among primates, while also providing insight into the biological processes and technical artifacts that led to the disagreements in the first place.
Article
Full-text available
Abstract Species networks generalize the notion of species trees to allow for hybridization or other lateral gene transfer. Under the network multispecies coalescent model, individual gene trees arising from a network can have any topology, but arise with frequencies dependent on the network structure and numerical parameters. We propose a new algorithm for statistical inference of a level-1 species network under this model, from data consisting of gene tree topologies, and provide the theoretical justification for it. The algorithm is based on an analysis of quartets displayed on gene trees, combining several statistical hypothesis tests with combinatorial ideas such as a quartet-based intertaxon distance appropriate to networks, the NeighborNet algorithm for circular split systems, and the Circular Network algorithm for constructing a splits graph.
Article
Full-text available
The process of adaptive radiation was classically hypothesized to require isolation of a lineage from its source (no gene flow) and from related species (no competition). Alternatively, hybridization between species may generate genetic variation that facilitates adaptive radiation. Here we study haplochromine cichlid assemblages in two African Great Lakes to test these hypotheses. Greater biotic isolation (fewer lineages) predicts fewer constraints by competition and hence more ecological opportunity in Lake Bangweulu, whereas opportunity for hybridization predicts increased genetic potential in Lake Mweru. In Lake Bangweulu, we find no evidence for hybridization but also no adaptive radiation. We show that the Bangweulu lineages also colonized Lake Mweru, where they hybridized with Congolese lineages and then underwent multiple adaptive radiations that are strikingly complementary in ecology and morphology. Our data suggest that the presence of several related lineages does not necessarily prevent adaptive radiation, although it constrains the trajectories of morphological diversification. It might instead facilitate adaptive radiation when hybridization generates genetic variation, without which radiation may start much later, progress more slowly or never occur.
Article
Full-text available
Background: Evolutionary histories can be discordant across the genome, and such discordances need to be considered in reconstructing the species phylogeny. ASTRAL is one of the leading methods for inferring species trees from gene trees while accounting for gene tree discordance. ASTRAL uses dynamic programming to search for the tree that shares the maximum number of quartet topologies with input gene trees, restricting itself to a predefined set of bipartitions. Results: We introduce ASTRAL-III, which substantially improves the running time of ASTRAL-II and guarantees polynomial running time as a function of both the number of species (n) and the number of genes (k). ASTRAL-III limits the bipartition constraint set (X) to grow at most linearly with n and k. Moreover, it handles polytomies more efficiently than ASTRAL-II, exploits similarities between gene trees better, and uses several techniques to avoid searching parts of the search space that are mathematically guaranteed not to include the optimal tree. The asymptotic running time of ASTRAL-III in the presence of polytomies is [Formula: see text] where D=O(nk) is the sum of degrees of all unique nodes in input trees. The running time improvements enable us to test whether contracting low support branches in gene trees improves the accuracy by reducing noise. In extensive simulations, we show that removing branches with very low support (e.g., below 10%) improves accuracy while overly aggressive filtering is harmful. We observe on a biological avian phylogenomic dataset of 14K genes that contracting low support branches greatly improve results. Conclusions: ASTRAL-III is a faster version of the ASTRAL method for phylogenetic reconstruction and can scale up to 10,000 species. With ASTRAL-III, low support branches can be removed, resulting in improved accuracy.
Article
Full-text available
Reticulate species evolution, such as hybridization or introgression, is relatively common in nature. In the presence of reticulation, species relationships can be captured by a rooted phylogenetic network, and orthologous gene evolution can be modeled as bifurcating gene trees embedded in the species network. We present a Bayesian approach to jointly infer species networks and gene trees from multilocus sequence data. A novel birthhybridization process is used as the prior for the species network, and we assume a multispecies network coalescent (MSNC) prior for the embedded gene trees. We verify the ability of our method to correctly sample from the posterior distribution, and thus to infer a species network, through simulations. To quantify the power of our method, we reanalyze two large datasets of genes from spruces and yeasts. For the three closely related spruces, we verify the previously suggested homoploid hybridization event in this clade; for the yeast data, we find extensive hybridization events. Our method is available within the BEAST 2 add-on SpeciesNetwork, and thus provides an extensible framework for Bayesian inference of reticulate evolution.
Article
Full-text available
Background: Although hybridization is thought to be relatively rare in animals, the raw genetic material introduced via introgression may play an important role in fueling adaptation and adaptive radiation. The butterfly genus Heliconius is an excellent system to study hybridization and introgression but most studies have focused on closely related species such as H. cydno and H. melpomene. Here we characterize genome-wide patterns of introgression between H. besckei, the only species with a red and yellow banded 'postman' wing pattern in the tiger-striped silvaniform clade, and co-mimetic H. melpomene nanna. Results: We find a pronounced signature of putative introgression from H. melpomene into H. besckei in the genomic region upstream of the gene optix, known to control red wing patterning, suggesting adaptive introgression of wing pattern mimicry between these two distantly related species. At least 39 additional genomic regions show signals of introgression as strong or stronger than this mimicry locus. Gene flow has been on-going, with evidence of gene exchange at multiple time points, and bidirectional, moving from the melpomene to the silvaniform clade and vice versa. The history of gene exchange has also been complex, with contributions from multiple silvaniform species in addition to H. besckei. We also detect a signature of ancient introgression of the entire Z chromosome between the silvaniform and melpomene/cydno clades. Conclusions: Our study provides a genome-wide portrait of introgression between distantly related butterfly species. We further propose a comprehensive and efficient workflow for gene flow identification in genomic data sets.
Article
Full-text available
Several phylogenomic analyses have recently demonstrated the need to account simultaneously for incomplete lineage sorting (ILS) and hybridization when inferring a species phylogeny. A maximum likelihood approach was introduced recently for inferring species phylogenies in the presence of both processes, and showed very good results. However, computing the likelihood of a model in this case is computationally infeasible except for very small data sets. Inspired by recent work on the pseudo-likelihood of species trees based on rooted triples, we introduce the pseudo-likelihood of a phylogenetic network, which, when combined with a search heuristic, provides a statistical method for phylogenetic network inference in the presence of ILS. Unlike trees, networks are not always uniquely encoded by a set of rooted triples. Therefore, even when given sufficient data, the method might converge to a network that is equivalent under rooted triples to the true one, but not the true one itself. The method is computationally efficient and has produced very good results on the data sets we analyzed. The method is implemented in PhyloNet, which is publicly available in open source. Maximum pseudo-likelihood allows for inferring species phylogenies in the presence of hybridization and ILS, while scaling to much larger data sets than is currently feasible under full maximum likelihood. The nonuniqueness of phylogenetic networks encoded by a system of rooted triples notwithstanding, the proposed method infers the correct network under certain scenarios, and provides candidates for further exploration under other criteria and/or data in other scenarios.
Article
Full-text available
Phylogenetic networks are necessary to represent the tree of life expanded by edges to represent events such as horizontal gene transfers, hybridizations or gene flow. Not all species follow the paradigm of vertical inheritance of their genetic material. While a great deal of research has flourished into the inference of phylogenetic trees, statistical methods to infer phylogenetic networks are still limited and under development. The main disadvantage of existing methods is a lack of scalability. Here, we present a statistical method to infer phylogenetic networks from multi-locus genetic data in a pseudolikelihood framework. Our model accounts for incomplete lineage sorting through the coalescent model, and for horizontal inheritance of genes through reticulation nodes in the network. Computation of the pseudolikelihood is fast and simple, and it avoids the burdensome calculation of the full likelihood which can be intractable with many species. Moreover, estimation at the quartet-level has the added computational benefit that it is easily parallelizable. Simulation studies comparing our method to a full likelihood approach show that our pseudolikelihood approach is much faster without compromising accuracy. We applied our method to reconstruct the evolutionary relationships among swordtails and platyfishes (Xiphophorus: Poeciliidae), which is characterized by widespread hybridizations.
Article
Full-text available
Motivation: Increasing attention has been devoted to estimation of species-level phylogenetic relationships under the coalescent model. However, existing methods either use summary statistics (gene trees) to carry out estimation, ignoring an important source of variability in the estimates, or involve computationally intensive Bayesian Markov chain Monte Carlo algorithms that do not scale well to whole-genome datasets. Results: We develop a method to infer relationships among quartets of taxa under the coalescent model using techniques from algebraic statistics. Uncertainty in the estimated relationships is quantified using the nonparametric bootstrap. The performance of our method is assessed with simulated data. We then describe how our method could be used for species tree inference in larger taxon samples, and demonstrate its utility using datasets for Sistrurus rattlesnakes and for soybeans. Availability and implementation: The method to infer the phylogenetic relationship among quartets is implemented in the software SVDquartets, available at www.stat.osu.edu/∼lkubatko/software/SVDquartets.
Article
Full-text available
Background Males in some species of the genus Xiphophorus, small freshwater fishes from Meso-America, have an extended caudal fin, or sword – hence their common name “swordtails”. Longer swords are preferred by females from both sworded and – surprisingly also, non-sworded (platyfish) species that belong to the same genus. Swordtails have been studied widely as models in research on sexual selection. Specifically, the pre-existing bias hypothesis was interpreted to best explain the observed bias of females in presumed ancestral lineages of swordless species that show a preference for assumed derived males with swords over their conspecific swordless males. However, many of the phylogenetic relationships within this genus still remained unresolved. Here we construct a comprehensive molecular phylogeny of all 26 known Xiphophorus species, including the four recently described species (X. kallmani, X. mayae, X. mixei and X. monticolus). We use two mitochondrial and six new nuclear markers in an effort to increase the understanding of the evolutionary relationships among the species in this genus. Based on the phylogeny, the evolutionary history and character state evolution of the sword was reconstructed and found to have originated in the common ancestral lineage of the genus Xiphophorus and that it was lost again secondarily. Results We estimated the evolutionary relationships among all known species of the genus Xiphophorus based on the largest set of DNA markers so far. The phylogeny indicates that one of the newly described swordtail species, Xiphophorus monticolus, is likely to have arisen through hybridization since it is placed with the southern platyfish in the mitochondrial phylogeny, but with the southern swordtails in the nuclear phylogeny. Such discordance between these two types of markers is a strong indication for a hybrid origin. Additionally, by using a maximum likelihood approach the possession of the sexually selected sword trait is shown to be the most likely ancestral state for the genus Xiphophorus. Further, we provide a well supported estimation of the phylogenetic relationships between the previously unresolved northern swordtail groups. Conclusions This comprehensive molecular phylogeny of the entire genus Xiphophorus provides evidence that a second swordtail species, X. monticolus, arose through hybridization. Previously, we demonstrated that X. clemenciae, another southern swordtail species, arose via hybridization. These findings highlight the potential key role of hybridization in the evolution of this genus and suggest the need for further investigations into how hybridization contributes to speciation more generally.
Article
Full-text available
Phylogenetic relationships, divergence times, and patterns of biogeographic descent among primate species are both complex and contentious. Here, we generate a robust molecular phylogeny for 70 primate genera and 367 primate species based on a concatenation of 69 nuclear gene segments and ten mitochondrial gene sequences, most of which were extracted from GenBank. Relaxed clock analyses of divergence times with 14 fossil-calibrated nodes suggest that living Primates last shared a common ancestor 71-63 Ma, and that divergences within both Strepsirrhini and Haplorhini are entirely post-Cretaceous. These results are consistent with the hypothesis that the Cretaceous-Paleogene mass extinction of non-avian dinosaurs played an important role in the diversification of placental mammals. Previous queries into primate historical biogeography have suggested Africa, Asia, Europe, or North America as the ancestral area of crown primates, but were based on methods that were coopted from phylogeny reconstruction. By contrast, we analyzed our molecular phylogeny with two methods that were developed explicitly for ancestral area reconstruction, and find support for the hypothesis that the most recent common ancestor of living Primates resided in Asia. Analyses of primate macroevolutionary dynamics provide support for a diversification rate increase in the late Miocene, possibly in response to elevated global mean temperatures, and are consistent with the fossil record. By contrast, diversification analyses failed to detect evidence for rate-shift changes near the Eocene-Oligocene boundary even though the fossil record provides clear evidence for a major turnover event ("Grande Coupure") at this time. Our results highlight the power and limitations of inferring diversification dynamics from molecular phylogenies, as well as the sensitivity of diversification analyses to different species concepts.
Article
Full-text available
Comparative genomic analyses of primates offer considerable potential to define and understand the processes that mold, shape, and transform the human genome. However, primate taxonomy is both complex and controversial, with marginal unifying consensus of the evolutionary hierarchy of extant primate species. Here we provide new genomic sequence (~8 Mb) from 186 primates representing 61 (~90%) of the described genera, and we include outgroup species from Dermoptera, Scandentia, and Lagomorpha. The resultant phylogeny is exceptionally robust and illuminates events in primate evolution from ancient to recent, clarifying numerous taxonomic controversies and providing new data on human evolution. Ongoing speciation, reticulate evolution, ancient relic lineages, unequal rates of evolution, and disparate distributions of insertions/deletions among the reconstructed primate lineages are uncovered. Our resolution of the primate phylogeny provides an essential evolutionary framework with far-reaching applications including: human selection and adaptation, global emergence of zoonotic diseases, mammalian comparative genomics, primate taxonomy, and conservation of endangered species.
Article
Full-text available
Neandertals, the closest evolutionary relatives of present-day humans, lived in large parts of Europe and western Asia before disappearing 30,000 years ago. We present a draft sequence of the Neandertal genome composed of more than 4 billion nucleotides from three individuals. Comparisons of the Neandertal genome to the genomes of five present-day humans from different parts of the world identify a number of genomic regions that may have been affected by positive selection in ancestral modern humans, including genes involved in metabolism and in cognitive and skeletal development. We show that Neandertals shared more genetic variants with present-day humans in Eurasia than with present-day humans in sub-Saharan Africa, suggesting that gene flow from Neandertals into the ancestors of non-Africans occurred before the divergence of Eurasian groups from each other.
Article
Full-text available
Phylogenetic trees resulting from molecular phylogenetic analysis are available in Newick format from specialized databases but when it comes to phylogenetic networks, which provide an explicit representation of reticulate evolutionary events such as recombination, hybridization or lateral gene transfer, the lack of a standard format for their representation has hindered the publication of explicit phylogenetic networks in the specialized literature and their incorporation in specialized databases. Two different proposals to represent phylogenetic networks exist: as a single Newick string (where each hybrid node is splitted once for each parent) or as a set of Newick strings (one for each hybrid node plus another one for the phylogenetic network). The standard we advocate as extended Newick format describes a whole phylogenetic network with k hybrid nodes as a single Newick string with k repeated nodes, and this representation is unique once the phylogenetic network is drawn or the ordering among children nodes is fixed. The extended Newick format facilitates phylogenetic data sharing and exchange, and also allows for the practical use of phylogenetic networks in computer programs and scripts. This standard has been recently agreed upon by a number of computational biologists, is already supported by several phylogenetic tools, and avoids the different drawbacks of using an a priori unknown number of Newick strings without any additional mark-up to represent a phylogenetic network. The adoption of the extended Newick format as a standard for the representation of phylogenetic network is an important step towards the publication of explicit phylogenetic networks in peer-reviewed journals and their incorporation in a future database of published phylogenetic networks.
Article
Full-text available
Phylogenies, i.e., the evolutionary histories of groups of taxa, play a major role in representing the interrelationships among biological entities. Many software tools for reconstructing and evaluating such phylogenies have been proposed, almost all of which assume the underlying evolutionary history to be a tree. While trees give a satisfactory first-order approximation for many families of organisms, other families exhibit evolutionary mechanisms that cannot be represented by trees. Processes such as horizontal gene transfer (HGT), hybrid speciation, and interspecific recombination, collectively referred to as reticulate evolutionary events, result in networks, rather than trees, of relationships. Various software tools have been recently developed to analyze reticulate evolutionary relationships, which include SplitsTree4, LatTrans, EEEP, HorizStory, and T-REX. In this paper, we report on the PhyloNet software package, which is a suite of tools for analyzing reticulate evolutionary relationships, or evolutionary networks, which are rooted, directed, acyclic graphs, leaf-labeled by a set of taxa. These tools can be classified into four categories: (1) evolutionary network representation: reading/writing evolutionary networks in a newly devised compact form; (2) evolutionary network characterization: analyzing evolutionary networks in terms of three basic building blocks - trees, clusters, and tripartitions; (3) evolutionary network comparison: comparing two evolutionary networks in terms of topological dissimilarities, as well as fitness to sequence evolution under a maximum parsimony criterion; and (4) evolutionary network reconstruction: reconstructing an evolutionary network from a species tree and a set of gene trees. The software package, PhyloNet, offers an array of utilities to allow for efficient and accurate analysis of evolutionary networks. The software package will help significantly in analyzing large data sets, as well as in studying the performance of evolutionary network reconstruction methods. Further, the software package supports the proposed eNewick format for compact representation of evolutionary networks, a feature that allows for efficient interoperability of evolutionary network software tools. Currently, all utilities in PhyloNet are invoked on the command line.
Article
Semidirected networks have received interest in evolutionary biology as the appropriate generalization of unrooted trees to networks, in which some but not all edges are directed. Yet these networks lack proper theoretical study. We define here a general class of semidirected phylogenetic networks, with a stable set of leaves, tree nodes and hybrid nodes. We prove that for these networks, if we locally choose the direction of one edge, then globally the set of directed paths starting by this edge is stable across all choices to root the network. We define an edge-based representation of semidirected phylogenetic networks and use it to define a dissimilarity between networks, which can be efficiently computed in near-quadratic time. Our dissimilarity extends the widely-used Robinson-Foulds distance on both rooted trees and unrooted trees. After generalizing the notion of tree-child networks to semidirected networks, we prove that our edge-based dissimilarity is in fact a distance on the space of tree-child semidirected phylogenetic networks.
Article
We address the problem of how to estimate a phylogenetic network when given single-nucleotide polymorphisms (i.e., SNPs, or bi-allelic markers that have evolved under the infinite sites assumption). We focus on level-1 phylogenetic networks (i.e., networks where the cycles are node-disjoint), since more complex networks are unidentifiable. We provide a polynomial time quartet-based method that we prove correct for reconstructing the semi-directed level-1 phylogenetic network N, if we are given a set of SNPs that covers all the bipartitions of N, even if the ancestral state is not known, provided that the cycles are of length at least 5; we also prove that an algorithm developed by Dan Gusfield in the Journal of Computer and System Sciences in 2005 correctly recovers semi-directed level-1 phylogenetic networks in polynomial time in this case. We present a stochastic model for DNA evolution, and we prove that the two methods (our quartet-based method and Gusfield's method) are statistically consistent estimators of the semi-directed level-1 phylogenetic network. For the case of multi-state homoplasy-free characters, we prove that our quartet-based method correctly constructs semi-directed level-1 networks under the required conditions (all cycles of length at least five), while Gusfield's algorithm cannot be used in that case. These results assume that we have access to an oracle for indicating which sites in the DNA alignment are homoplasy-free, and we show that the methods are robust, under some conditions, to oracle errors.
Article
While phylogenies have been essential in understanding how species evolve, they do not adequately describe some evolutionary processes. For instance, hybridization, a common phenomenon where interbreeding between two species leads to formation of a new species, must be depicted by a phylogenetic network, a structure that modifies a phylogenetic tree by allowing two branches to merge into one, resulting in reticulation. However, existing methods for estimating networks become computationally expensive as the dataset size and/or topological complexity increase. The lack of methods for scalable inference hampers phylogenetic networks from being widely used in practice, despite accumulating evidence that hybridization occurs frequently in nature. Here, we propose a novel method, PhyNEST (Phylogenetic Network Estimation using SiTe patterns), that estimates binary, level-1 phylogenetic networks with a fixed, user-specified number of reticulations directly from sequence data. By using the composite likelihood as the basis for inference, PhyNEST is able to use the full genomic data in a computationally tractable manner, eliminating the need to summarize the data as a set of gene trees prior to network estimation. To search network space, PhyNEST implements both hill climbing and simulated annealing algorithms. PhyNEST assumes that the data are composed of coalescent independent sites that evolve according to the Jukes-Cantor substitution model and that the network has a constant effective population size. Simulation studies demonstrate that PhyNEST is often more accurate than two existing composite likelihood summary methods (SNaQ and PhyloNet) and that it is robust to at least one form of model misspecification (assuming a less complex nucleotide substitution model than the true generating model). We applied PhyNEST to reconstruct the evolutionary relationships among Heliconius butterflies and Papionini primates, characterized by hybrid speciation and widespread introgression, respectively. PhyNEST is implemented in an open-source Julia package and is publicly available at https://github.com/sungsik-kong/PhyNEST.jl.
Article
Hybridization is an evolutionary phenomenon that has fascinated biologists for centuries. Prior to the advent of whole-genome sequencing, it was clear that hybridization had played a role in the evolutionary history of many extant taxa, particularly plants. The extent to which hybridization has contributed to the evolution of Earth’s biodiversity has, however, been the topic of much debate. Analyses of whole genomes are providing further insight into this evolutionary problem. Recent studies have documented ancient hybridization in a diverse array of taxa including mammals, birds, fish, fungi, and insects. Evidence for adaptive introgression is being documented in an increasing number of systems, though demonstrating the adaptive function of introgressed genomic regions remains difficult. And finally, several new homoploid hybrid speciation events have been reported. Here we review the current state of the field and specifically evaluate the additional insights gained from having access to whole-genome data and the challenges that remain with respect to understanding the evolutionary relevance and frequency of ancient hybridization, adaptive introgression, and hybrid speciation in nature.
Article
We show that many topological features of level-1 species networks are identifiable from the distribution of the gene tree quartets under the network multi-species coalescent model. In particular, every cycle of size at least 4 and every hybrid node in a cycle of size at least 5 is identifiable. This is a step toward justifying the inference of such networks which was recently implemented by Sol\'is-Lemus and An\'e. We show additionally how to compute quartet concordance factors for a network in terms of simpler networks, and explore some circumstances in which cycles of size 3 and hybrid nodes in 4-cycles can be detected.
Article
PhyloNetworks is a Julia package for the inference, manipulation, visualization and use of phylogenetic networks in an interactive environment. Inference of phylogenetic networks is done with maximum pseudolikelihood from gene trees or multi-locus sequences (SNaQ), with possible bootstrap analysis. PhyloNetworks is the first software providing tools to summarize a set of networks (from a bootstrap or posterior sample) with measures of tree edge support, hybrid edge support, and hybrid node support. Networks can be used for phylogenetic comparative analysis of continuous traits, to estimate ancestral states or do a phylogenetic regression. The software is available in open source and with documentation at https://github.com/crsl4/PhyloNetworks.jl.
Article
Genome-wide data on genetic variation are now available for multiple primate species and populations, facilitating analyses of evolutionary history within and across taxa. One emerging theme from these studies involves the central role of admixture. Genomic data sets indicate that both ancient gene flow following initial taxonomic divergence and ongoing gene flow at current species boundaries are common. These findings are of particular interest given evidence for a complex history of admixture in our own lineage, including examples of ecologically driven adaptive introgression. Like other aspects of human biology, studies of nonhuman primates thus provide both comparative context and a living model for understanding admixture dynamics in hominins. We highlight several open questions that could be addressed in future work.
Article
Phylogenetic networks are a generalization of evolutionary trees that can be used to represent reticulate processes such as hybridization and recombination. Here, we introduce a new approach called TriLoNet (Trinet Level- one Network algorithm) to construct such networks directly from sequence alignments which works by piecing together smaller phylogenetic networks. More specifically, using a bottom up approach similar to Neighbor-Joining, TriLoNet constructs level-1 networks (networks that are somewhat more general than trees) from smaller level-1 networks on three taxa. In simulations, we show that TriLoNet compares well with Lev1athan, a method for reconstructing level-1 networks from three-leaved trees. In particular, in simulations we find that Lev1athan tends to generate networks that overestimate the number of reticulate events as compared with those generated by TriLoNet. We also illustrate TriLoNet’s applicability using simulated and real sequence data involving recombination, demonstrating that it has the potential to reconstruct informative reticulate evolutionary histories. TriLoNet has been implemented in JAVA and is freely available at https://www.uea.ac.uk/computing/TriLoNet.
Article
Horizontal gene transfer (HGT) is the sharing of genetic material between organisms that are not in a parent-offspring relationship. HGT is a widely recognized mechanism for adaptation in bacteria and archaea. Microbial antibiotic resistance and pathogenicity are often associated with HGT, but the scope of HGT extends far beyond disease-causing organisms. In this Review, we describe how HGT has shaped the web of life using examples of HGT among prokaryotes, between prokaryotes and eukaryotes, and even between multicellular eukaryotes. We discuss replacement and additive HGT, the proposed mechanisms of HGT, selective forces that influence HGT, and the evolutionary impact of HGT on ancestral populations and existing populations such as the human microbiome.
Article
It has recently been concluded that phylogenomic data from 310 nuclear genes support the clade of (Amborellales, Nymphaeales) as sister to the remaining angiosperms and that shortcut coalescent phylogenetic methods outperformed concatenation for these data. We falsify both of those conclusions here by demonstrating that discrepant results between the coalescent and concatenation analyses are primarily caused by the coalescent methods applied (MP-EST and STAR) not being robust to the highly divergent and often mis-rooted gene trees that were used. This result reinforces the expectation that low amounts of phylogenetic signal and methodological artifacts in gene-tree reconstruction can be more problematic for shortcut coalescent methods than is the assumption of a single hierarchy for all genes by concatenation methods when these approaches are applied to ancient divergences in empirical studies. We also demonstrate that a third coalescent method, ASTRAL, is more robust to mis-rooted gene trees than MP-EST or STAR, and that both Observed Variability (OV) and Tree Independent Generation of Evolutionary Rates (TIGER), which are two character subsampling procedures, are biased in favor of characters with highly asymmetrical distributions of character states when applied to this dataset. We conclude that enthusiastic application of novel tools is not a substitute for rigorous application of first principles, and that trending methods (e.g., shortcut coalescent methods applied to ancient divergences, tree-independent character subsampling), may be novel sources of previously under-appreciated, systematic errors. Copyright © 2015 Elsevier Inc. All rights reserved.
Article
Hybridization is increasingly being recognized as a widespread process, even between ecologically and behaviorally divergent animal species. Determining phylogenetic relationships in the presence of hybridization remains a major challenge for evolutionary biologists, but advances in sequencing technology and phylogenetic techniques are beginning to address these challenges. Here we reconstruct evolutionary relationships among swordtails and platyfishes (Xiphophorus: Poeciliidae), a group of species characterized by remarkable morphological diversity and behavioral barriers to interspecific mating. Past attempts to reconstruct phylogenetic relationships within Xiphophorus have produced conflicting results. Because many of the 26 species in the genus are interfertile, these conflicts are likely due to hybridization. Using genomic data, we resolve a high-confidence species tree of Xiphophorus that accounts for both incomplete lineage sorting and hybridization. Our results allow us to reexamine a long-standing controversy about the evolution of the sexually selected sword in Xiphophorus, and demonstrate that hybridization has been strikingly widespread in the evolutionary history of this genus.
Article
This paper will continue certain investigations into the geometric nature of the well-known traveling salesman problem: that of determining the extreme Hamiltonian circuits (H-circuits) of a graph.
Article
The evolution of sexual signaling systems is influenced by natural and sexual selection acting on complex interactions among traits. Natural hybrid zones are ex- cellent systems for assessing fitness effects on sexual phenotypes. Most documented hybrid zones, however, show little variation in sexual signals. A hybrid zone between the swordtails Xiphophorus birchmanni and Xiphophorus malinche is characterized by numerous recombinants for male sexual traits. Analyses of geographic variation in morphological and isozyme traits in the Rio Calnali, Hidalgo, Mexico, reveal an upstream-to-downstream gradient from X. malinche -t oX. birchmanni-type traits. A second hybrid zone, likely isolated from the R. Calnali, occurs in the nearby Arroyo Pochutla. Although the presumed female preference for swords predicts the intro- gression of swords from X. malinche-like populations into hybrid populations, the opposite pattern was observed. Swords are reduced in populations otherwise char- acterized by X. malinche traits. Sexually dimorphic traits were poorly correlated with- in individuals, indicating that sexual selection does not act against recombinant phe- notypes. Hybrid males also exhibit trait values outside the range of parental varia- tion. These patterns are consistent with predictions that females are permissive, preferring generally conspicuous males without attending to specific features.
Article
Once thought rare in animal taxa, hybridization has been increasingly recognized as an important and common force in animal evolution. In the past decade, a number of studies have suggested that hybridization has driven speciation in some animal groups. We investigate the signature of hybridization in the genome of a putative hybrid species, Xiphophorus clemenciae, through whole genome sequencing of this species and its hypothesized progenitors. Based on analysis of this data, we find that X. clemenciae is unlikely to have been derived from admixture between its proposed parental species. However, we find significant evidence for recent gene flow between Xiphophorus species. Although we detect genetic exchange in two pairs of species analyzed, the proportion of genomic regions that can be attributed to hybrid origin is small, suggesting that strong behavioral premating isolation prevents frequent hybridization in Xiphophorus. The direction of gene flow between species is potentially consistent with a role for sexual selection in mediating hybridization.
Article
The genetic divergence time between two species varies substantially across the genome, conveying important information about the timing and process of speciation. Here we develop a framework for studying this variation and apply it to about 20 million base pairs of aligned sequence from humans, chimpanzees, gorillas and more distantly related primates. Human–chimpanzee genetic divergence varies from less than 84% to more than 147% of the average, a range of more than 4 million years. Our analysis also shows that human–chimpanzee speciation occurred less than 6.3 million years ago and probably more recently, conflicting with some interpretations of ancient fossils. Most strikingly, chromosome X shows an extremely young genetic divergence time, close to the genome minimum along nearly its entire length. These unexpected features would be explained if the human and chimpanzee lineages initially diverged, then later exchanged genes before separating permanently.
Article
We consider the problem of inferring the evolutionary tree of a set of n species. We propose a quartet reconstruction method which specifically produces trees whose edges have strong combinatorial evidence. For this purpose we use the Q* relation [3], defined as the maximum subset of resolved quartets which is equivalent to a tree. We further investigate the properties of this variation of the NP-hard quartet consistency problem, first providing a polynomial time, O(n 4), algorithm. Moreover, we show that the convergence rate of the method is polynomial for realistic conditions, under the Cavender-Farris model of evolution.
Article
Recently, much attention has been devoted to the construction of phylogenetic networks which generalize phylogenetic trees in order to accommodate complex evolutionary processes. Here, we present an efficient, practical algorithm for reconstructing level-1 phylogenetic networks--a type of network slightly more general than a phylogenetic tree--from triplets. Our algorithm has been made publicly available as the program LEV1ATHAN. It combines ideas from several known theoretical algorithms for phylogenetic tree and network reconstruction with two novel subroutines. Namely, an exponential-time exact and a greedy algorithm both of which are of independent theoretical interest. Most importantly, LEV1ATHAN runs in polynomial time and always constructs a level-1 network. If the data are consistent with a phylogenetic tree, then the algorithm constructs such a tree. Moreover, if the input triplet set is dense and, in addition, is fully consistent with some level-1 network, it will find such a network. The potential of LEV1ATHAN is explored by means of an extensive simulation study and a biological data set. One of our conclusions is that LEV1ATHAN is able to construct networks consistent with a high percentage of input triplets, even when these input triplets are affected by a low to moderate level of noise.
Article
Natural hybrid zones provide opportunities to study a range of evolutionary phenomena from speciation to the genetic basis of fitness-related traits. We show that widespread hybridization has occurred between two neo-tropical stream fishes with partial reproductive isolation. Phylogenetic analyses of mitochondrial sequence data showed that the swordtail fish Xiphophorus birchmanni is monophyletic and that X. malinche is part of an independent monophyletic clade with other species. Using informative single nucleotide polymorphisms in one mitochondrial and three nuclear intron loci, we genotyped 776 specimens collected from twenty-three sites along seven separate stream reaches. Hybrid zones occurred in replicated fashion in all stream reaches along a gradient from high to low elevation. Genotyping revealed substantial variation in parental and hybrid frequencies among localities. Tests of F(IS) and linkage disequilibrium (LD) revealed generally low F(IS) and LD except in five populations where both parental species and hybrids were found suggesting incomplete reproductive isolation. In these locations, heterozygote deficiency and LD were high, which suggests either selection against early generation hybrids or assortative mating. These data lay the foundation to study the adaptive basis of the replicated hybrid zone structure and for future integration of behaviour and genetics to determine the processes that lead to the population genetic patterns observed in these hybrid zones.
Article
Phylogenetic networks are a generalization of evolutionary or phylogenetic trees that allow the representation of conflicting signals or alternative evolutionary histories in a single diagram. Recently the Quartet-Net or “QNet” method was introduced, a method for computing a special kind of phylogenetic network called a split network from a collection of weighted quartet trees (i.e. phylogenetic trees with 4 leaves). This can be viewed as a quartet analogue of the distance-based Neighbor-Net (NNet) method for constructing outer-labeled planar split networks. In this paper, we prove that QNet is a consistent method, that is, we prove that if QNet is applied to a collection of weighted quartets arising from a circular split weight function, then it will return precisely this function. This key property of QNet not only ensures that it is guaranteed to produce a tree if the input corresponds to a tree, and an outer-labeled planar split network if the input corresponds to such a network, but also provides the main guiding principle for the design of the method.
Article
Human immunodeficiency virus type 1 (HIV-1) sequences that pre-date the recognition of AIDS are critical to defining the time of origin and the timescale of virus evolution. A viral sequence from 1959 (ZR59) is the oldest known HIV-1 infection. Other historically documented sequences, important calibration points to convert evolutionary distance into time, are lacking, however; ZR59 is the only one sampled before 1976. Here we report the amplification and characterization of viral sequences from a Bouin's-fixed paraffin-embedded lymph node biopsy specimen obtained in 1960 from an adult female in Léopoldville, Belgian Congo (now Kinshasa, Democratic Republic of the Congo (DRC)), and we use them to conduct the first comparative evolutionary genetic study of early pre-AIDS epidemic HIV-1 group M viruses. Phylogenetic analyses position this viral sequence (DRC60) closest to the ancestral node of subtype A (excluding A2). Relaxed molecular clock analyses incorporating DRC60 and ZR59 date the most recent common ancestor of the M group to near the beginning of the twentieth century. The sizeable genetic distance between DRC60 and ZR59 directly demonstrates that diversification of HIV-1 in west-central Africa occurred long before the recognized AIDS pandemic. The recovery of viral gene sequences from decades-old paraffin-embedded tissues opens the door to a detailed palaeovirological investigation of the evolutionary history of HIV-1 that is not accessible by other methods.