Robert Lanfear’s research while affiliated with Australian National University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (173)


Figure 3: Relative risks for different proportions of removed sites for the Short, Medium and Long branch length simulations. The relative risk is the ratio of the probability of removing sites given that they were mislabelled to the probability of removing sites given that they were not mislabelled.
Robust Phylogenetics
  • Preprint
  • File available

April 2025

·

90 Reads

Qin Liu

·

·

Robert Lanfear

·

[...]

·

Barbara R. Holland

We present a robust phylogenetic inference method, called the trimmed log-likelihood method, which effectively identifies fast-evolving, saturated, or erroneous sites in both simulated and empirical multiple sequence alignments. This method avoids circularity by dynamically identifying and removing sites without relying on an initial tree, allowing the specific sites removed to change as tree topology and branch lengths are estimated. Our analyses demonstrate that this method outperforms existing approaches, such as the Slow-Fast method, Tree Independent Generation of Evolutionary Rate approach, and Le Quesne Probability statistics, by removing fewer sites while still inferring phylogenies with comparable or greater accuracy. Implemented in IQ-TREE2, the trimmed log-likelihood method is user-friendly with a simple command-line interface. However, challenges remain in addressing heterogeneous evolutionary processes including compositional biases, such as GC bias. Despite these challenges, our approach offers a practical solution for improving phylogenetic inference by automatically and dynamically identifying sites to down-weight during phylogenetic analyses. We recommended that researchers compare trees inferred by varying the proportion of down-weighted sites to monitor changes in tree topology and to identify a set of candidate tree topologies for further consideration.

Download

Fig. 1. MixtureFinder algorithm design. MixtureFinder starts by fitting a 1-class model, adds classes one by one until the criterion do not support adding class. The optimal mixture model, RHAS model and the tree are reported as the output of the algorithm.
Fig. 2. a) The proportion of partitions in which the 1-or 2-class model are better according to BIC. The exact number of partitions in which the 1-or 2-class model is better is shown on the top of each bar. b) The nRF distance between the commonly accepted tree and each of the 2-and 1-class trees (for every partition). The difference of the average nRF (1-2-class trees) is shown on the top of each pair of boxes. c)The tree length of the 2-and 1-class tree (for every partition). The relative tree length difference of 2-to 1-class trees (i.e. mean length of 2-class trees divided by mean length of 1-class trees) is shown on the top of each pair of boxes. The "*" marks represent the significance of the paired t-test between 1-and 2-class models in (b) and (c) ("****" 0.0001 "***" 0.001 "**" 0.01 "*" 0.05).
Fig. 5. The nRF distance and relative difference in tree length between estimated model trees and true trees. The mean value and the corresponding standard error of the mean (error bar) of each experimental condition are shown with different lines. The nRF of mixture model trees to the simulated trees are slightly smaller than the nonmixture model in only simulation scheme 1, 10,000-site alignments. In both simulation schemes, the mixture models recover the tree lengths that are closer to the lengths of the simulated trees, while the nonmixture models underestimate the tree lengths.
Fig. 6. Mixture models change the conclusions from a concatenated phylogenomic dataset. a) The three-, four-, five-, six-, and seven-class trees all recover a tree with Turtles as the sister group of Archosaurs with bootstrap supports of 90, 94, 99, 98, and 100, respectively. b) The one-and two-class trees both recover Turtles as the sister group of Crocodiles, with bootstrap supports of 100 and 55, respectively. c) Bootstrap supports of three key nodes in the tree (Turtles-Archosaurs), (Turtles-Crocodiles), and (Python-Anolis) change almost monotonically as classes are added to the mixture model. Bootstrap supports of the rest of the nodes in the trees are 100%.
Empirical datasets used in the test
MixtureFinder: Estimating DNA Mixture Models for Phylogenetic Analyses

December 2024

·

66 Reads

Molecular Biology and Evolution

In phylogenetic studies, both partitioned models and mixture models are used to account for heterogeneity in molecular evolution among the sites of DNA sequence alignments. Partitioned models require the user to specify the grouping of sites into subsets, and then assume that each subset of sites can be modelled by a single common process. Mixture models do not require users to pre-specify subsets of sites, and instead calculate the likelihood of every site under every model, while co-estimating the model weights and parameters. While much research has gone into the optimisation of partitioned models by merging user-specified subsets, there has been less attention paid to the optimisation of mixture models for DNA sequence alignments. In this study, we first ask whether a key assumption of partitioned models – that each user-specified subset can be modelled by a single common process – is supported by the data. Having shown that this is not the case, we then design, implement, test, and apply an algorithm, MixtureFinder, to select the optimum number of classes for a mixture model of Q matrices for the standard models of DNA sequence evolution. We show this algorithm performs well on simulated and empirical datasets and suggest that it may be useful for future empirical studies. MixtureFinder is available in IQ-TREE2, and a tutorial for using MixtureFinder can be found here: http://www.iqtree.org/doc/Complex-Models#mixture-models.


Fig. 1. Concordance and discordance. a) A species tree of four clades of organisms, A, B, C, and D. Clade D is the outgroup. b) Three possible gene trees were derived from the species tree in a). Gene tree 1 is concordant with the species tree, while gene trees 2 and 3 are discordant. c) The gCV, ψ, describes the number of gene trees, or sometimes the proportion of gene trees, that fall into each of four categories. ψ 1 is the number of gene trees that are concordant with the species tree (commonly called the "CF"). ψ 2 and ψ 3 are the number of gene trees that match discordant trees 2 and 3 in b), ordered with the largest count first. ψ 4 is the number of gene trees that are discordant with the species tree but match neither gene tree 2 nor gene tree 3 (for example, any gene tree in which clades A, B, or C are not monophyletic).
Fig. 2. Phylogenetic statistical support compared with measures of phylogenetic statistical variation as dataset size increases. In two example datasets (Cannon et al. 2016; Ran et al. 2018), we calculated the mean values of two measures of statistical support: UltraFast Bootstraps (Hoang et al. 2017) and ASTRAL local posterior probabilities (Sayyari and Mirarab 2016), as well as four measures of CFs: gCF (Minh et al. 2020), qCF (Mirarab et al. 2014), and the sCF calculated with parsimony (Minh et al. 2020) and likelihood (Mo et al. 2023), assuming that the tree calculated from the 200-locus dataset was correct. The figure shows that as more loci are used to calculate the statistics, the average of the measures of statistical support (grayscale lines, square points) tends to increase toward 100%, but the average of the measures of concordance (colored lines, circular points) tends to stabilize after being estimated inaccurately with small numbers of loci. Even with just 200 loci, most bootstrap and posterior probability values become very high (86.7% of bootstrap values and 69.3% of posterior probability values were >95% for the metazoa dataset; 94.3% of bootstrap values and 91.4% of posterior probability values were >95% for the plants dataset, respectively).
Fig. 4. Concordance vectors for six major clades of birds show dramatic differences in concordance despite maximal statistical support. a) The phylogeny shows six major clades of birds identified in a recent phylogenomic study (Stiller et al. 2024). Each clade is named and shown in a color that matches those used in the study by Stiller et al. 2024. The stem branch length of each clade (measured in coalescent units) and the posterior probability of each branch are shown. The right-most extent of each clade corresponds to the maximum root-to-tip distance of the tips in a clade (using coalescent branch lengths from ASTRAL). The inset shows a cladogram that clarifies the topology of the inferred species tree. b) Concordance vectors for each clade reveal dramatic differences in the CF and DF of each clade. Opposite each clade name is a matrix of the gene, site, and quartet concordance vectors. The numbers in each cell are percentages, with higher percentages colored in darker shades of red.
The Meaning and Measure of Concordance Factors in Phylogenomics

October 2024

·

100 Reads

·

18 Citations

Molecular Biology and Evolution

As phylogenomic datasets have grown in size, researchers have developed new ways to measure biological variation and to assess statistical support for specific branches. Larger datasets have more sites and loci, and therefore less sampling variance. While we can more accurately measure the mean signal in these datasets, lower sampling variance is often reflected in uniformly high measures of branch support—such as the bootstrap and posterior probability—limiting their utility. Larger datasets have also revealed substantial biological variation in the topologies found across individual loci, such that the single species tree inferred by most phylogenetic methods represents a limited summary of the data for many purposes. In contrast to measures of statistical support, the degree of underlying topological variation among loci should be approximately constant regardless of the size of the dataset. “Concordance factors” and similar statistics have therefore become increasingly important tools in phylogenetics. In this review, we explain why concordance factors should be thought of as descriptors of topological variation rather than as measures of statistical support, and argue that they provide important information about the predictive power of the species tree not contained in measures of support. We review a growing suite of statistics for measuring concordance, compare them in a common framework that reveals their interrelationships, and demonstrate how to calculate them using an example from birds. We also discuss how measures of topological variation might change in the future as we move beyond estimating a single “tree of life” towards estimating the myriad evolutionary histories underlying genomic variation.


GTRpmix: A Linked General Time-Reversible Model for Profile Mixture Models

August 2024

·

37 Reads

·

7 Citations

Molecular Biology and Evolution

Profile mixture models capture distinct biochemical constraints on the amino acid substitution process at different sites in proteins. These models feature a mixture of time-reversible models with a common matrix of exchangeabilities and distinct sets of equilibrium amino acid frequencies known as profiles. Combining the exchangeability matrix with each profile generates the matrix of instantaneous rates of amino acid exchange for that profile. Currently, empirically estimated exchangeability matrices (e.g., the LG matrix) are widely used for phylogenetic inference under profile mixture models. However, these were estimated using a single profile and are unlikely optimal for profile mixture models. Here, we describe the GTRpmix model that allows maximum likelihood estimation of a common exchangeability matrix under any profile mixture model. We show that exchangeability matrices estimated under profile mixture models differ from the LG matrix, dramatically improving model fit and topological estimation accuracy for empirical test cases. Because the GTRpmix model is computationally expensive, we provide two exchangeability matrices estimated from large concatenated phylogenomic-supermatrices to be used for phylogenetic analyses. One, called Eukaryotic Linked Mixture (ELM), is designed for phylogenetic analysis of proteins encoded by nuclear genomes of eukaryotes, and the other, Eukaryotic and Archaeal Linked mixture (EAL), for reconstructing relationships between eukaryotes and Archaea. These matrices, combined with profile mixture models, fit data better and have improved topology estimation relative to the LG matrix combined with the same mixture models. Starting with version 2.3.1, IQ-TREE2 allows users to estimate linked exchangeabilities (i.e. amino acid exchange rates) under profile mixture models.


GTRpmix: A linked general-time reversible model for profile mixture models

March 2024

·

79 Reads

Profile mixture models capture distinct biochemical constraints on the amino acid substitution process at different sites in proteins. These models feature a mixture of time-reversible models with a common set of amino acid exchange rates (a matrix of exchangeabilities) and distinct sets of equilibrium amino acid frequencies known as profiles. Combining the exchangeability matrix with each profile generates the matrix of instantaneous rates of amino acid exchange for that profile. Currently, empirically estimated exchangeability matrices (e.g., the LG or WAG matrices) are widely used for phylogenetic inference under profile mixture models. However, such matrices were originally estimated using site homogeneous models with a single set of equilibrium amino acid frequencies; therefore unlikely to be optimal for site heterogeneous profile mixture models. Here we describe the GTRpmix model, implemented in IQ-TREE2, that allows maximum likelihood estimation of a common set of exchangeabilities for all site classes under any profile mixture model. We show that exchangeability matrices estimated in the presence of a site-heterogeneous profile mixture model differ markedly from the widely used LG matrix and dramatically improve model fit and topological estimation accuracy for empirical test cases. Because the GTRpmix model is computationally expensive, we provide two exchangeability matrices estimated from large concatenated phylogenomic supermatrices under the C60 profile mixture model that can be used as fixed matrices for phylogenetic analyses. One of these, called Eukaryotic Linked Mixture (ELM), is designed for phylogenetic analysis of proteins encoded by nuclear genomes of eukaryotes, and the other, Eukaryotic and Archeal Linked mixture (EAL), for reconstructing relationships between eukaryotes and Archaea. These matrices when combined with profile mixture models fit data much better and have improved topology estimation relative to the empirical LG matrix combined with the same underlying mixture models. Version v2.3.1 of IQ-TREE2 implementing these models is available at www.iqtree.org .


Figure 1 MixtureFinder algorithm design MixtureFinder starts by fitting a 1-class model, adds classes one by one until the criterion do not support adding class. Finally, the algorithm
Figure 4 Integrated Squared Error (ISE) between parameters in estimated model and simulated model. The mixture models (green and blue bars) have smaller ISEs to the simulation than the non-mixture models (red bars).
Figure 6 Turtle trees (a) The three-, four-, five-and six-class model trees (Turtle-Archosaur). (b) The one-and two-class model trees (Turtle-Crocodile). (*) Bootstrap supports of this node under three-, four-, five-and six-class tree is 62, 97, 97 and 100 respectively. (**) Bootstrap supports of this node under one-and two-class tree is 72 and 55 respectively. Bootstrap supports of other nodes are 100, except node (·).
The number of partitions in which the 2-class model are better than the 1-class model according to likelihood ratio test (LRT) or Bayesian Information Criterion (BIC).
Empirical datasets used in the test
MixtureFinder: Estimating DNA mixture models for phylogenetic analyses

March 2024

·

156 Reads

·

1 Citation

In phylogenetic studies, both partitioned models and mixture models are used to account for heterogeneity in molecular evolution among the sites of DNA sequence alignments. Partitioned models require the user to specify the grouping of sites into subsets, and then assume that each subset of sites can be modelled by a single common process. Mixture models do not require users to pre-specify subsets of sites, and instead calculate the likelihood of every site under every model, while co-estimating the model weights. While much research has gone into the optimisation of partitioned models by merging user-specified subsets, there has been less attention paid to the optimisation of mixture models for DNA sequence alignments. In this study, we first ask whether a key assumption of partitioned models – that each user-specified subset can be modelled by a single common process – is supported by the data. Having shown that this is not the case, we then design, implement, test, and apply an algorithm, MixtureFinder, to select the optimum number of classes for a mixture model of Q matrices for the standard models of DNA sequence evolution. We show this algorithm performs well on simulated and empirical datasets and suggest that it may be useful for future empirical studies. MixtureFinder is available in IQ-TREE2, and a tutorial for using MixtureFinder can be found here: http://www.iqtree.org/doc/Complex-Models#mixture-models .


MAST: Phylogenetic Inference with Mixtures Across Sites and Trees

February 2024

·

24 Reads

·

8 Citations

Systematic Biology

Hundreds or thousands of loci are now routinely used in modern phylogenomic studies. Concatenation approaches to tree inference assume that there is a single topology for the entire dataset, but different loci may have different evolutionary histories due to incomplete lineage sorting, introgression, and/or horizontal gene transfer; even single loci may not be treelike due to recombination. To overcome this shortcoming, we introduce an implementation of a multi-tree mixture model that we call MAST. This model extends a prior implementation by Boussau et al. (2009) by allowing users to estimate the weight of each of a set of pre-specified bifurcating trees in a single alignment. The MAST model allows each tree to have its own weight, topology, branch lengths, substitution model, nucleotide or amino acid frequencies, and model of rate heterogeneity across sites. We implemented the MAST model in a maximum-likelihood framework in the popular phylogenetic software, IQ-TREE. Simulations show that we can accurately recover the true model parameters, including branch lengths and tree weights for a given set of tree topologies, under a wide range of biologically realistic scenarios. We also show that we can use standard statistical inference approaches to reject a single-tree model when data are simulated under multiple trees (and vice versa). We applied the MAST model to multiple primate datasets and found that it can recover the signal of incomplete lineage sorting in the Great Apes, as well as the asymmetry in minor trees caused by introgression among several macaque species. When applied to a dataset of four Platyrrhine species for which standard concatenated maximum likelihood and gene tree approaches disagree, we observe that MAST gives the highest weight (i.e. the largest proportion of sites) to the tree also supported by gene tree approaches. These results suggest that the MAST model is able to analyse a concatenated alignment using maximum likelihood, while avoiding some of the biases that come with assuming there is only a single tree. We discuss how the MAST model can be extended in the future.


Figure 1. The comparison of different implementations on the 8000 (A) and 64 000 (B) sequence subsets of COVID19, LSU_NR99, and SSU_NR99 datasets. BIONJ and Quicktree do not support multithreading.
DecentTree: Scalable Neighbour-Joining for the Genomic Era

August 2023

·

65 Reads

·

8 Citations

Bioinformatics

Motivation: Neighbour-Joining is one of the most widely used distance-based phylogenetic inference methods. However, current implementations do not scale well for datasets with more than 10,000 sequences. Given the increasing pace of generating new sequence data, particularly in outbreaks of emerging diseases, and the already enormous existing databases of sequence data for which NJ is a useful approach, new implementations of existing methods are warranted. Results: Here we present DecentTree, which provides highly optimised and parallel implementations of Neighbour-Joining and several of its variants. DecentTree is designed as a stand-alone application and a header-only library easily integrated with other phylogenetic software (e.g., it is integral in the popular IQ-TREE software). We show that DecentTree shows similar or improved performance over existing software (BIONJ, Quicktree, FastME, and RapidNJ), especially for handling very large alignments. For example, DecentTree is up to 6-fold faster than the fastest existing Neighbour-Joining software (e.g., RapidNJ) when generating a tree of 64,000 SARS-CoV-2 genomes. Availability: DecentTree is open source and freely available at https://github.com/iqtree/decenttree. All code and data used in this analysis are available on Github (https://github.com/asdcid/Comparison-of-neighbour-joining-software). Supplementary information: Supplementary data are available at Bioinformatics online.


Online Phylogenetics with matOptimize Produces Equivalent Trees and is Dramatically More Efficient for Large SARS-CoV-2 Phylogenies than de novo and Maximum-Likelihood Implementations

May 2023

·

51 Reads

·

13 Citations

Systematic Biology

Phylogenetics has been foundational to SARS-CoV-2 research and public health policy, assisting in genomic surveillance, contact tracing, and assessing emergence and spread of new variants. However, phylogenetic analyses of SARS-CoV-2 have often relied on tools designed for de novo phylogenetic inference, in which all data are collected before any analysis is performed and the phylogeny is inferred once from scratch. SARS-CoV-2 datasets do not fit this mold. There are currently over 14 million sequenced SARS-CoV-2 genomes in online databases, with tens of thousands of new genomes added every day. Continuous data collection, combined with the public health relevance of SARS-CoV-2, invites an "online" approach to phylogenetics, in which new samples are added to existing phylogenetic trees every day. The extremely dense sampling of SARS-CoV-2 genomes also invites a comparison between likelihood and parsimony approaches to phylogenetic inference. Maximum likelihood (ML) and pseudo-ML methods may be more accurate when there are multiple changes at a single site on a single branch, but this accuracy comes at a large computational cost, and the dense sampling of SARS-CoV-2 genomes means that these instances will be extremely rare because each internal branch is expected to be extremely short. Therefore, it may be that approaches based on maximum parsimony (MP) are sufficiently accurate for reconstructing phylogenies of SARS-CoV-2, and their simplicity means that they can be applied to much larger datasets. Here, we evaluate the performance of de novo and online phylogenetic approaches, as well as ML, pseudo-ML, and MP frameworks for inferring large and dense SARS-CoV-2 phylogenies. Overall, we find that online phylogenetics produces similar phylogenetic trees to de novo analyses for SARS-CoV-2, and that MP optimization with UShER and matOptimize produces equivalent SARS-CoV-2 phylogenies to some of the most popular ML and pseudo-ML inference tools. MP optimization with UShER and matOptimize is thousands of times faster than presently available implementations of ML and online phylogenetics is faster than de novo inference. Our results therefore suggest that parsimony-based methods like UShER and matOptimize represent an accurate and more practical alternative to established maximum likelihood implementations for large SARS-CoV-2 phylogenies and could be successfully applied to other similar datasets with particularly dense sampling and short branch lengths.


Testing for Phylogenetic Signal in Single-Cell RNA-Seq Data

December 2022

·

53 Reads

·

6 Citations

Journal of Computational Biology: a Journal of Computational Molecular Cell Biology

Phylogenetic methods are emerging as a useful tool to understand cancer evolutionary dynamics, including tumor structure, heterogeneity, and progression. Most currently used approaches utilize either bulk whole genome sequencing or single-cell DNA sequencing and are based on calling copy number alterations and single nucleotide variants (SNVs). Single-cell RNA sequencing (scRNA-seq) is commonly applied to explore differential gene expression of cancer cells throughout tumor progression. The method exacerbates the single-cell sequencing problem of low yield per cell with uneven expression levels. This accounts for low and uneven sequencing coverage and makes SNV detection and phylogenetic analysis challenging. In this article, we demonstrate for the first time that scRNA-seq data contain sufficient evolutionary signal and can also be utilized in phylogenetic analyses. We explore and compare results of such analyses based on both expression levels and SNVs called from scRNA-seq data. Both techniques are shown to be useful for reconstructing phylogenetic relationships between cells, reflecting the clonal composition of a tumor. Both standardized expression values and SNVs appear to be equally capable of reconstructing a similar pattern of phylogenetic relationship. This pattern is stable even when phylogenetic uncertainty is taken in account. Our results open up a new direction of somatic phylogenetics based on scRNA-seq data. Further research is required to refine and improve these approaches to capture the full picture of somatic evolutionary dynamics in cancer.


Citations (69)


... Gene concordance factors (gCF) and site concordance factors (sCF) were computed on the species trees of both the 75p and 50p UCE datasets using IQ-TREE [35? ]. By employing a quartet replicate approach, gCF indicates the percentage of gene trees that contain a particular clade on the species tree, whereas sCF indicates the percentage of sites supporting a particular clade on the species tree [36]. Concordance factors below 33% are generally considered uninformative, and support of 50% or greater is considered strong support for a node [36]. ...

Reference:

Quills of Confusion! Genomic insight into the evolution and morphology of sea pens (Octocorallia: Scleralcyonacea: Pennatuloidea)
The Meaning and Measure of Concordance Factors in Phylogenomics

Molecular Biology and Evolution

... For protein data, in addition to profile mixtures (Wang et al. 2008) and structure-based mixture models ) available in version 2, IQ-TREE 3 allows users to estimate amino-acid exchangeability matrices under complex profile mixture models, the so-called GTRpmix model (Baños et al. 2024). Notably, IQ-TREE 3 can also relax the assumption of a single tree common to many phylogenetic analyses by using the Mixtures Across Sites and Trees (MAST) model , which generalizes the GHOST (General Heterogeneous evolution On a Single Topology) mixture model of branch lengths (Crotty et al. 2020). ...

GTRpmix: A Linked General Time-Reversible Model for Profile Mixture Models
  • Citing Article
  • August 2024

Molecular Biology and Evolution

... The resulting vcf file was then filtered in TASSEL v.5.0 (Bradbury et al. 2007), keeping sites represented in at least 22/44 accessions. We applied a best-fit mixture model with MixtureFinder (Ren et al. 2024) in IQTree2, with up to 10 mixture-rate classes, chosen via the Bayesian Information Criterion. Alignments for all datasets can be found at Zenodo. ...

MixtureFinder: Estimating DNA mixture models for phylogenetic analyses

... Sources of confusion whose impact we have yet to explore are those of branch-compositional heterogeneity and of loci with distinct evolutionary histories. In recent years there has, however, been a proliferation of models developed to model branch-composition heterogeneity [e.g., NDCH2 (43); GHOST (44)], site-branch-composition heterogeneity [GFmix (45)] and multi-tree histories [MAST (46)], which if combined with our jackknifing and site-by-site focused approach might shed further light on the deuterostome monophyly problem. ...

MAST: Phylogenetic Inference with Mixtures Across Sites and Trees
  • Citing Article
  • February 2024

Systematic Biology

... TINNiK's next step is to use a quartetbased intertaxon distance formula [6] to convert B-and T-quartet information to a distance approximately fitting the topological tree of blobs. Then an inferred tree of blobs can be obtained by any of a number of well-known tree-building algorithms such as Neighbor-Joining [30], DescentTree [31], or FastME [32]. If the quality of the input data is unknown, or its fit to the NMSC model is doubted, we recommend the use of the Neighbor-Net algorithm [33] to confirm the distance reflects a strong tree signal before tree building. ...

DecentTree: Scalable Neighbour-Joining for the Genomic Era

Bioinformatics

... Such an exact reformulation is always possible, but should yield especially large efficiencies when mutations are sparse. Outbreak sequences often differ from their closest related sequence at as few as 0 to 2 sites 23,29,30 , and so clearly exhibit such sparsity. ...

Online Phylogenetics with matOptimize Produces Equivalent Trees and is Dramatically More Efficient for Large SARS-CoV-2 Phylogenies than de novo and Maximum-Likelihood Implementations

Systematic Biology

... (iii) the branching process, it was difficult to interpret the history and divergence of groupings of the cancer cell subclones as evolutionary events. One approach to address these deficiencies used phylogenies inferred from single nucleotide variants seen in the scRNA-Seq data 18 , but encountered difficulties due to the low coverage and high drop-out rates seen in scRNA-Seq data 19 . As a consequence, few cells were studied, cancer cells were difficult to distinguish from normal cells, and distinct clades of cancer cells were hard to identify. ...

Testing for Phylogenetic Signal in Single-Cell RNA-Seq Data

Journal of Computational Biology: a Journal of Computational Molecular Cell Biology

... Also, for the three alignments, we inferred coalescent-based phylogenies using the coalescent gene tree-species tree reconciliation program ASTRAL v.5.7.8 (Zhang et al. 2018). For the UCEs + Sanger loci alignment, we also inferred maximum likelihood phylogenies that incorporated support value tests of gene and site concordance factors (gCF and sCF, respectively; Minh et al. 2020a, Mo et al. 2023. ...

Updated site concordance factors minimize effects of homoplasy and taxon sampling

Bioinformatics

... Positive values indicated a stronger fit with the first evolutionary history, and negative values, the second history. We verified the robustness of these assignments using an alternative method implemented in IQ-TREE, which is conceptually related to a phylogenetic HMM but instead computes the likelihood as a mixture model across prespecified trees (31). ...

MAST: Phylogenetic Inference with Mixtures Across Sites and Trees

... Here, we used next-generation sequencing to uncover the diversity and genetic evolution of SARS-CoV-2 through analysis of iSNVs and recombination events related to vaccination status in a cohort within the Kenyan population. Globally, recombination events have been reported in areas with high genomic surveillance, such as the UK, USA, and Denmark (17,18,30,52,53), and it is estimated that 5% of circulating US and UK SARS-CoV-2 viruses are recombinant (16,17). Specifically, the genomic surveillance of SARS-CoV-2 and the tracking of both intervariant and intrahost recombination events are proven crucial in obtaining a better picture of the virus' genetic evolution that may be driven by multiple variant infection, immune pressure, and vaccine efficacy (54). ...

Pandemic-Scale Phylogenomics Reveals The SARS-CoV-2 Recombination Landscape

Nature