ArticlePDF Available

Quartet Puzzling: A Quartet Maximum-Likelihood Method for Reconstructing Tree Topologies

Authors:

Abstract

A versatile method, quartet puzzling, is introduced to reconstruct the topology (branching pattern) of a phylogenetic tree based on DNA or amino acid sequence data. This method applies maximum-likelihood tree reconstruction to all possible quartets that can be formed from n sequences. The quartet trees serve as starting points to reconstruct a set of optimal n- taxon trees. The majority rule consensus of these trees defines the quartet puzzling tree and shows groupings that are well supported. Computer simulations show that the performance of quartet puzzling to reconstruct the true tree is always equal to or better than that of neighbor joining. For some cases with high transition/transversion bias quartet puzzling outperforms neighbor joining by a factor of 10. The application of quartet puzzling to mitochondrial RNA and tRNA(Val) sequences from amniotes demonstrates the power of the approach. A PHYLIP-compatible ANSI C program, PUZZLE, for analyzing nucleotide or amino acid sequence data is available.
A preview of the PDF is not available
... In particular, either the reduced number of specimens available on GenBank for which sequences of Control Region and 16s were present, and the low phylogenetic signal obtained for 16s (see Figure S1b) prevented us from applying those analyses to a reliable concatenated dataset in terms of a phylogenetic signal (see Figure S1c). Indeed, the test of the Likelihood Map disassembled the dataset in quartets, that represent the smallest set of taxa for which more than one unrooted tree topology exists [65]. The quartet puzzling works on groups of four sequences, in order to obtain a map that allows for understanding whether data are reliable for phylogenetic and taxonomic inferences. ...
... The network analysis performed on the Control Region dataset, including all the sequences available for the Mediterranean freshwater blennies (see Figure 4 and Table S2), showed the presence of three main clusters: cluster A, cluster B, and cluster C. Region and 16s were present, and the low phylogenetic signal obtained for 16s (see Figure S1b) prevented us from applying those analyses to a reliable concatenated dataset in terms of a phylogenetic signal (see Figure S1c). Indeed, the test of the Likelihood Map disassembled the dataset in quartets, that represent the smallest set of taxa for which more than one unrooted tree topology exists [65]. The quartet puzzling works on groups of four sequences, in order to obtain a map that allows for understanding whether data are reliable for phylogenetic and taxonomic inferences. ...
Article
Full-text available
The genus Salariopsis (Blenniidae) comprises freshwater blenny fish that inhabits Mediterranean Sea, Black Sea, and north-east Atlantic areas. Three species were formally described to date: Salariopsis fluviatilis. S. economidisi, and S. atlantica. In this study, 103 individuals were collected from different Italian regions (Sardinia, Liguria, Piedmont, Lombardy) and analyzed using the mtDNA Control Region and the ribosomal 16s gene. We aimed (i) to depict the phylogeographic patterns of S. fluviatilis in northern Italy and Sardinia and (ii) to compare the genetic structure of Italian samples with those from other Mediterranean regions. Results obtained showed the presence of a well-supported genetic structuring among Italian S. fluviatilis populations, shedding new light on the phylogeographic patterns of northern Italian populations of S. fluviatilis sensu stricto across the Ligurian Alpine ridge and the Sardinia Island-mainland dispersal patterns. Furthermore, our species delimitation analysis was consistent in supporting results of previous research about the presence of genetic differentiation among S. fluviatilis, evidencing: (i) a large group of S. fluviatilis sensu stricto that includes two sub-groups (Occidental and Oriental), (ii) one group comprising populations from the Middle East of a taxonomic entity corresponding to Salariopsis cf. fluviatilis, and (iii) one group of Iberian individuals from the Guadiana River.
... Moreover, using results from [37], [43], and [55], one can modify this method to return probabilities for each class rather than a classification. This may be useful for developing methods for constructing larger networks from quarnets using a weighted quarnet scheme similar to those for trees [34,45]. ...
Preprint
Phylogenetic networks provide a means of describing the evolutionary history of sets of species believed to have undergone hybridization or gene flow during their evolution. The mutation process for a set of such species can be modeled as a Markov process on a phylogenetic network. Previous work has shown that a site-pattern probability distributions from a Jukes-Cantor phylogenetic network model must satisfy certain algebraic invariants. As a corollary, aspects of the phylogenetic network are theoretically identifiable from site-pattern frequencies. In practice, because of the probabilistic nature of sequence evolution, the phylogenetic network invariants will rarely be satisfied, even for data generated under the model. Thus, using network invariants for inferring phylogenetic networks requires some means of interpreting the residuals, or deviations from zero, when observed site-pattern frequencies are substituted into the invariants. In this work, we propose a method of utilizing invariant residuals and support vector machines to infer 4-leaf level-one phylogenetic networks, from which larger networks can be reconstructed. Given data for a set of species, the support vector machine is first trained on model data to learn the patterns of residuals corresponding to different network structures to classify the network that produced the data. We demonstrate the performance of our method on simulated data from the specified model and primate data.
... In order to distinguish ambiguous internal phylogenetic relationships among four clades from the northern East Indian Ocean, southern East Indian Ocean, South China Sea, and Central-Southeast Pacific, a likelihood mapping analysis was performed in TREE-PUZZLE version 5.2 (Schmidt et al., 2002). Likelihood mapping triangle diagrams were used to compare the competing relationship between the four clusters and to assess the support values of internal branches (Strimmer and von Haeseler, 1996;Strimmer and von Haeseler, 1997). ...
Article
Full-text available
As a bio]diversity hotspot, the East Indies (Coral) Triangle possesses the highest biodiversity on the earth. However, evolutionary hypotheses around this area remain controversial; e.g., center of origin, center of accumulation, and center of overlap have been supported by different species. This study aims to answer the evolutionary influence of the Indonesian Seaway on the biodiversity of the Coral Triangle by recovering the evolutionary origins of a wide-ranging ommastrephid squid (Sthenoteuthis oualaniensis) based on integrated molecular and oceanographic clues from the Indo-Pacific. Three new clades were revealed; viz., clade I from the South China Sea, clade II from the northern East Indian Ocean, and clade III from the southern East Indian Ocean. These two Indian Ocean clades formed a monophyly closely related to clade IV from the Central-Southeast Pacific. Clade VI from the central Equatorial Pacific and clade V from the northern Eastern Pacific sit in basal positions of phylogenetic trees. Ancestral Sthenoteuthis was inferred to have originated from the Atlantic Ocean and sequentially dispersed to the northern East Pacific, central Equatorial Pacific, and West Pacific through the open Panama Seaway and being transported by westward North Equatorial Current. The East Indian Ocean was likely colonized by an ancestral population of clade IV from the Southeast Pacific. Westward South Equatorial Circulation could have promoted transoceanic migration of S. oualaniensis through the wide paleo-Indonesian Seaway. Sea level regression since the Miocene and the closure of the Indonesian Seaway at 4–3 Ma were responsible for the population genetic differentiation of S. oualaniensis in the Indo-Pacific. Therefore, the Indonesian Gateway played an important role in influencing marine organisms’ migration and population differentiation through controlling and reorganizing circulations in the Indo-Pacific.
... Another approach is to infer individual quartets (with or without weights), and then amalgamate them into a single coherent species tree (Avni et al, 2015;Chifman and Kubatko, 2014;Mahbub et al, 2021;Ranwez and Gascuel, 2001;Reaz et al, 2014;Schmidt et al, 2002;Snir and Rao, 2010;Strimmer and von Haeseler, 1996). wQFM (Mahbub et al, 2021) and wQMC (Avni et al, 2015) represent the latter category of quartet-based methods. ...
Article
Species tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, for a combination of reasons (ranging from sampling biases to more biological causes, as in gene birth and loss), gene trees are often incomplete, meaning that not all species of interest have a common set of genes. Incomplete gene trees can potentially impact the accuracy of phylogenomic inference. We, for the first time, introduce the problem of imputing the quartet distribution induced by a set of incomplete gene trees, which involves adding the missing quartets back to the quartet distribution. We present Quartet based Gene tree Imputation using Deep Learning (QT-GILD), an automated and specially tailored unsupervised deep learning technique, accompanied by cues from natural language processing, which learns the quartet distribution in a given set of incomplete gene trees and generates a complete set of quartets accordingly. QT-GILD is a general-purpose technique needing no explicit modeling of the subject system or reasons for missing data or gene tree heterogeneity. Experimental studies on a collection of simulated and empirical datasets suggest that QT-GILD can effectively impute the quartet distribution, which results in a dramatic improvement in the species tree accuracy. Remarkably, QT-GILD not only imputes the missing quartets but can also account for gene tree estimation error. Therefore, QT-GILD advances the state-of-the-art in species tree estimation from gene trees in the face of missing data.
... Total about, 1400 bp nucleotide for ITS2 and 2000 bp nucleotide for D-loop were aligned. The analysis were carried out with using neighborjoining (NJ) and maximum parsimony (MP) methods within the PHYLIP 3.5c (Felsenstein, 1993) and maximum likelihood (DNAML) from PUZZLE (Strimmer and von Haeseler, 1996) quartet-puzzling approach. Both analysis authenticity were tested with using bootstrapping (Felsenstein, 1985) on the basis of 1000 replications of the data. ...
Preprint
We study the approximability of a broad class of computational problems -- originally motivated in evolutionary biology and phylogenetic reconstruction -- concerning the aggregation of potentially inconsistent (local) information about $n$ items of interest, and we present optimal hardness of approximation results under the Unique Games Conjecture. The class of problems studied here can be described as Constraint Satisfaction Problems (CSPs) over infinite domains, where instead of values $\{0,1\}$ or a fixed-size domain, the variables can be mapped to any of the $n$ leaves of a phylogenetic tree. The topology of the tree then determines whether a given constraint on the variables is satisfied or not, and the resulting CSPs are called Phylogenetic CSPs. Prominent examples of Phylogenetic CSPs with a long history and applications in various disciplines include: Triplet Reconstruction, Quartet Reconstruction, Subtree Aggregation (Forbidden or Desired). For example, in Triplet Reconstruction, we are given $m$ triplets of the form $ij|k$ (indicating that ``items $i,j$ are more similar to each other than to $k$'') and we want to construct a hierarchical clustering on the $n$ items, that respects the constraints as much as possible. Despite more than four decades of research, the basic question of maximizing the number of satisfied constraints is not well-understood. The current best approximation is achieved by outputting a random tree (for triplets, this achieves a 1/3 approximation). Our main result is that every Phylogenetic CSP is approximation resistant, i.e., there is no polynomial-time algorithm that does asymptotically better than a (biased) random assignment. This is a generalization of the results in Guruswami, Hastad, Manokaran, Raghavendra, and Charikar (2011), who showed that ordering CSPs are approximation resistant (e.g., Max Acyclic Subgraph, Betweenness).
Preprint
Gene tree discordance due to incomplete lineage sorting or introgression has been described in numerous genomic datasets. Among distantly related taxa, however, it is difficult to differentiate these biological sources of discordance from discordance due to errors in gene tree reconstruction, even when supervised machine learning techniques are used to infer individual gene trees. Here, rather than applying machine learning to the problem of inferring single tree topologies, we develop a model to infer important properties of a particular internal branch of the species tree via genome-scale summary statistics extracted from individual alignments and inferred gene trees. We show that our model can effectively predict the presence/absence of discordance, estimate the probability of discordance, and infer the correct species tree topology in the presence of multiple, common sources of error. While gene tree topology counts are the most salient predictors of discordance at short time scales, other genomic features become relevant for distantly related species. We validate our approach through simulation, and apply it to data from the deepest splits among metazoans. Our results suggest that the base of Metazoa experienced significant gene tree discordance, implying that discordant traits among current taxa can be explained without invoking homoplasy. In addition, we find support for Porifera as the sister clade to the rest of Metazoa. Overall, these results demonstrate how machine learning can be used to answer important phylogenetic questions, while marginalizing over individual gene tree - and even species tree - topologies.
Chapter
In 1998, two species of minke whales were recognized based on the review of the morphological and genetic information available at that time: the Antarctic minke whale (Balaenoptera bonaerensis), which is restricted to the Southern Hemisphere, and the cosmopolitan common minke whale (Balaenoptera acutorostrata). Furthermore, three sub-species of the common minke whale were recognized: the North Atlantic (B. a. acutorostrata), North Pacific (B. a. scammoni) and Southern Hemisphere (B. a. subsp.). This chapter reviews the genetic studies on minke whales conducted after 1998. The review is organized by topic, e.g., those studies focused on phylogeny and other matters most relevant for taxonomy, and those focused on population genetic structure within oceanic basins most relevant for conservation and management. On the former topic, the new genetic information, whilst strongly supporting the minke whale taxonomic classification recognized in 1998, also reveals substantial genetic differentiation within the Southern Hemisphere common minke whales, with subsequent taxonomic implications. On the latter topic, results from different analytical procedures have provided information on population identification and structure in the Indo-Pacific sector of the Antarctic and western North Pacific, but they have failed to identify unequivocally any population within the North Atlantic common minke whales.
Article
Full-text available
The discovery of a new species of the genus Canthocamptus, C. waldemarschneideri sp. nov., in northern Siberia prompted a taxonomic analysis of this genus. In this work, on the basis of cladistic analysis, we show that the genus is not monophyletic. Based on differences in the structure of the endopods on the second pair of male swimming legs, fifth legs of males and females, and caudal rami, we conclude that the Canthocamptus mirabilis species group is a separate genus, Kikuchicamptus gen. nov. Additionally, two species are transferred to the genus Attheyella, and one species, Canthocamptus gibba, is synonymized. The subgenera Canthocamptus (Baikalocamptus) and Canthocamptus (Canthocamptus) are also synonymized. The new species, Canthocamptus waldemarschneideri sp. nov., is most closely related to the American Canthocamptus assimilis Kiefer, 1931 and differs from it in the ornamentation of the abdominal somites and the shape of the caudal setae.
Article
Full-text available
The efficiencies of distance-matrix methods for correct tree reconstruction under a variety of substitution rates, transition- transversion biases, and different model trees were studied. If substitution rates are high and the ratio of transitions and transversions is large, even a Kimura two-parameter correction fails very often to reconstruct the model tree. We show that a combination of combinatorial weighting by Williams and Fitch and the Jukes-Cantor correction significantly increases the efficiency of tree-reconstruction methods, for a large fraction of evolutionary parameters. We explain why this approach is superior to any other weighting/correction scheme tested, as long as sequences are sufficiently long or substitution rates are sufficiently large. An approximate threshold for switching to a different weighting scheme is given.