David L. Swofford’s research while affiliated with Florida Museum of Natural History and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (76)


Inference of Phylogenetic Networks From Sequence Data Using Composite Likelihood
  • Article

October 2024

·

29 Reads

·

5 Citations

Systematic Biology

·

David L Swofford

·

Laura S Kubatko

While phylogenies have been essential in understanding how species evolve, they do not adequately describe some evolutionary processes. For instance, hybridization, a common phenomenon where interbreeding between two species leads to formation of a new species, must be depicted by a phylogenetic network, a structure that modifies a phylogenetic tree by allowing two branches to merge into one, resulting in reticulation. However, existing methods for estimating networks become computationally expensive as the dataset size and/or topological complexity increase. The lack of methods for scalable inference hampers phylogenetic networks from being widely used in practice, despite accumulating evidence that hybridization occurs frequently in nature. Here, we propose a novel method, PhyNEST (Phylogenetic Network Estimation using SiTe patterns), that estimates binary, level-1 phylogenetic networks with a fixed, user-specified number of reticulations directly from sequence data. By using the composite likelihood as the basis for inference, PhyNEST is able to use the full genomic data in a computationally tractable manner, eliminating the need to summarize the data as a set of gene trees prior to network estimation. To search network space, PhyNEST implements both hill climbing and simulated annealing algorithms. PhyNEST assumes that the data are composed of coalescent independent sites that evolve according to the Jukes-Cantor substitution model and that the network has a constant effective population size. Simulation studies demonstrate that PhyNEST is often more accurate than two existing composite likelihood summary methods (SNaQ and PhyloNet) and that it is robust to at least one form of model misspecification (assuming a less complex nucleotide substitution model than the true generating model). We applied PhyNEST to reconstruct the evolutionary relationships among Heliconius butterflies and Papionini primates, characterized by hybrid speciation and widespread introgression, respectively. PhyNEST is implemented in an open-source Julia package and is publicly available at https://github.com/sungsik-kong/PhyNEST.jl.



Comparative analysis of the primate X-inactivation center region and reconstruction of the ancestral primate XIST locus
  • Article
  • Full-text available

March 2023

·

457 Reads

·

1 Citation

·

Christina B Sheedy

·

·

[...]

·

Huntington F Willard

Here we provide a detailed comparative analysis across the candidate X-Inactivation Center (XIC) region and the XIST locus in the genomes of six primates and three mammalian outgroup species. Since lemurs and other strepsirrhine primates represent the sister lineage to all other primates, this analysis focuses on lemurs to reconstruct the ancestral primate sequences and to gain insight into the evolution of this region and the genes within it. This comparative evolutionary genomics approach reveals significant expansion in genomic size across the XIC region in higher primates, with minimal size alterations across the XIST locus itself. Reconstructed primate ancestral XIC sequences show that the most dramatic changes during the past 80 million years occurred between the ancestral primate and the lineage leading to Old World monkeys. In contrast, the XIST locus compared between human and the primate ancestor does not indicate any dramatic changes to exons or XIST-specific repeats; rather, evolution of this locus reflects small incremental changes in overall sequence identity and short repeat insertions. While this comparative analysis reinforces that the region around XIST has been subject to significant genomic change, even among primates, our data suggest that evolution of the XIST sequences themselves represents only small lineage-specific changes across the past 80 million years.

Download


Figure 3: (a) Each box plot summarizes the number of steps taken at the point of termination of the 20 independent runs for each combination of α and c during network search using the simulated annealing algorithm. We selected α, c ∈ {0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99}, hence 121 combinations were evaluated here. The left y-axis represents the number of steps, and ranges from 0 to 1000 (threshold), the x-axis represents tested values of α, and the value in the gray boxes on the right y-axis represents c. The red solid horizontal line in each pane indicates 500 steps. (b) Line graphs showing the computed F score using Equation (17) for the 121 combinations of α and c evaluated. The y-axis on the left represents the scores, the x-axis represents tested values of α, and the values in the gray boxes on the y-axis on the right represent tested values of c. The horizontal red line for each pane indicates the score of 0.05. The highest score was observed when (α, c)=(0.8,0.9) followed by (0.9,0.9). We selected (α, c)=(0.8,0.9) as the default values for PhyNEST when using the simulated annealing algorithm.
Figure 11: Phylogenetic networks of seven species in the Papionini group (plus an outgroup Callithrix jacchus) inferred with PhyNEST from the sequence alignment of 1,761,114 base pairs published in Vanderpool et al. (2020) using (a) h max = 1 and (b) =2. The value beside each reticulation edge (light and dark blue) represents inheritance probability for the edge with the same color.
Inference of Phylogenetic Networks from Sequence Data using Composite Likelihood

November 2022

·

126 Reads

·

6 Citations

While phylogenies have been essential in understanding how species evolve, they do not adequately describe some evolutionary processes. For instance, hybridization, a common phenomenon where interbreeding between two species leads to formation of a new species, must be depicted by a phylogenetic network, a structure that modifies a phylogeny by allowing two branches to merge into one, resulting in reticulation. However, existing methods for estimating networks are computationally expensive as the dataset size and/or topological complexity increase. The lack of methods for scalable inference hampers phylogenetic networks from being widely used in practice, despite accumulating evidence that hybridization occurs frequently in nature. Here, we propose a novel method, PhyNEST (Phylogenetic Network Estimation using SiTe patterns), that estimates phylogenetic networks directly from sequence data. PhyNEST achieves computational efficiency by using composite likelihood as well as accuracy by using the full genomic data to incorporate all sources of variability, rather than first summarizing the data by estimating a set of gene trees, as is required by most of the existing methods. To efficiently search network space, we implement both hill-climbing and simulated annealing algorithms. Simulation studies show that PhyNEST can accurately estimate parameters given the true network topology and that it has comparable accuracy to two popular methods that use composite likelihood and a set of gene trees as input, implemented in SNaQ and PhyloNet. For datasets with a large number of loci, PhyNEST is more efficient than SNaQ and PhyloNet when considering the time required for gene tree estimation. We applied PhyNEST to reconstruct the evolutionary relationships among Heliconius butterflies and Papionini primates, characterized by hybrid speciation and widespread introgression, respectively. PhyNEST is implemented in an open-source Julia package and publicly available at https://github.com/sungsik-kong/PhyNEST.jl


Estimation of Speciation Times Under the Multispecies Coalescent

October 2022

·

15 Reads

·

13 Citations

Bioinformatics

Motivation: The multispecies coalescent model is now widely accepted as an effective model for incorporating variation in the evolutionary histories of individual genes into methods for phylogenetic inference from genome-scale data. However, because model-based analysis under the coalescent can be computationally expensive for large data sets, a variety of inferential frameworks and corresponding algorithms have been proposed for estimation of species-level phylogenies and associated parameters, including speciation times and effective population sizes. Results: We consider the problem of estimating the timing of speciation events along a phylogeny in a coalescent framework. We propose a maximum a posteriori estimator based on composite likelihood (MAPCL) for inferring these speciation times under a model of DNA sequence evolution for which exact site pattern probabilities can be computed under the assumption of a constant θ throughout the species tree. We demonstrate that the MAPCL estimates are statistically consistent and asymptotically normally distributed, and we show how this result can be used to estimate their asymptotic variance. We also provide a more computationally efficient estimator of the asymptotic variance based on the nonparametric bootstrap. We evaluate the performance of our method using simulation and by application to an empirical dataset for gibbons. Availability and implementation: The method has been implemented in the PAUP* program, freely available at https://paup.phylosolutions.com for Macintosh, Windows, and Linux operating systems. Supplementary information: Supplementary data are available at Bioinformatics online.


FIGURE 1. Relative performance gains (fold-speedup) for a challenging highly partitioned nucleotide-model analysis using various combinations of implementations and versions of the BEAGLE library, and hardware resources, with BEAST and with MrBayes. We report fold-speedup on the log-scale, relative to the total run time when using the native double-precision likelihood calculator on the slowest system (denoted with an asterisk) for each program.
FIGURE 5. Absolute (throughput in billions of partial likelihood calculations per second) and relative (fold-speedup relative to the slowest performance observed at any number of unique site patterns) performance scaling with problem size for implementations of the BEAGLE library Version 3.1.2 and the Phylogenetics Likelihood library Version 2 on nodes of the Comet Supercomputer available via CIPRES. The data are simulated nucleotide sequences for a tree of 128 OTUs.
BEAGLE 3: Improved Performance, Scaling, and Usability for a High-Performance Computing Library for Statistical Phylogenetics

November 2019

·

606 Reads

·

194 Citations

Systematic Biology

BEAGLE is a high-performance likelihood-calculation library for phylogenetic inference. The BEAGLE library defines a simple, but flexible, application programming interface (API), and includes a collection of efficient implementations for calculation under a variety of evolutionary models on different hardware devices. The library has been integrated into recent versions of popular phylogenetics software packages including BEAST and MrBayes and has been widely used across a diverse range of evolutionary studies. Here, we present BEAGLE 3 with new parallel implementations, increased performance for challenging data sets, improved scalability, and better usability. We have added new OpenCL and central processing unit-threaded implementations to the library, allowing the effective utilization of a wider range of modern hardware. Further, we have extended the API and library to support concurrent computation of independent partial likelihood arrays, for increased performance of nucleotide-model analyses with greater flexibility of data partitioning. For better scalability and usability, we have improved how phylogenetic software packages use BEAGLE in multi-GPU (graphics processing unit) and cluster environments, and introduced an automated method to select the fastest device given the data set, evolutionary model, and hardware. For application developers who wish to integrate the library, we also have developed an online tutorial. To evaluate the effect of the improvements, we ran a variety of benchmarks on state-of-the-art hardware. For a partitioned exemplar analysis, we observe run-time performance improvements as high as 5.9-fold over our previous GPU implementation. BEAGLE 3 is free, open-source software licensed under the Lesser GPL and available at https://beagle-dev.github.io.


Figure 1: The 5-leaf species tree (left) can be split into 5 different 4-leaf subtrees (right), shown with speciation times marked.
Figure 3: The species tree for 5 gibbon species and 1 outgroup (human)
Figure 4: Histogram of the 100 pseudolikelihood estimates for branch length τ 1 using 100,000 unlinked CIS from the 5-leaf model trees with a single lineage per tip. The red line in each histogram is the sample mean of the 100 pseudolikelihood estimates. The true values are given in the figure titles.
Figure 5: Plots of the 100 variance estimates for branch length τ 1 using 100,000 unlinked CIS from the 5-leaf model trees with a single lineage per tip. Points denoted by * are computed by the asymptotic variance formula in (12), while points denoted by o are obtained by bootstrapping. The x-axis is an index for the simulated samples. The red line in the plots is the sample variance of the 1000 pseudolikelihood estimates.
Estimation of Speciation Times Under the Multispecies Coalescent

June 2019

·

104 Reads

·

2 Citations

Motivation: The coalescent model is now widely accepted as a necessary component for phylogenetic inference from genome-scale data. However, because model-based analysis under the coalescent is computationally prohibitive, a variety of inferential frameworks and corresponding algorithms have been proposed for estimation of species-level phylogenies and the associated parameters, including the speciation times and effective population sizes. Results: We consider the problem of estimating the timing of speciation events along a phylogeny in a coalescent framework. We propose a pseudolikelihood method for estimation of these speciation times under a model of DNA sequence evolution for which exact site pattern probabilities can be computed. We demonstrate that the pseudolikelihood estimates are statistically consistent and asymptotically normally distributed, and we show how this result can be used to estimate their asymptotic variance. We also provide a more computationally efficient estimator of the asymptotic variance based on the nonparametric bootstrap. We evaluate the performance of our method using simulation and by application to an empirical dataset on gibbons.




Citations (51)


... Our upper-bound results suggest that the networks generated by tools like SNAQ [22], PhyNEST [13], and Squirrel [9] with a constant network level can be efficiently decomposed into trees with small treewidth. As the level of generated networks grows [18], we may still expect the treewidth to stay low and treewidth-parametrized algorithms to be efficient on higher-level networks. ...

Reference:

Bounds on the Treewidth of Level-k Rooted Phylogenetic Networks
Inference of Phylogenetic Networks From Sequence Data Using Composite Likelihood
  • Citing Article
  • October 2024

Systematic Biology

... Younger, species-specific repeats (e.g. Alu in primates and SINEs B1 and B2 in rodents) inserted in Xist more recently and contributed to the species-specific evolution of the gene [23,54]. ...

Comparative analysis of the primate X-inactivation center region and reconstruction of the ancestral primate XIST locus

... These applications ask about the underlying species tree topology, as well as deviations from this history due to introgression. The most widely used method employing site-based quartets to infer a species tree is SVDquartets (Chifman and Kubatko 2014;Swofford and Kubatko 2023). SVDquartets considers all quartet site patterns-i.e. ...

Species Tree Estimation Using Site Pattern Frequencies
  • Citing Chapter
  • March 2023

... Phylogenetic networks can thus be used to model processes such as hybridization, horizontal gene transfer, gene duplication and loss, and recombination (Linder & Rieseberg, 2004;Nakhleh, 2010). Despite the development of innovative inference algorithms (e.g., Kong et al., 2024;Solís-Lemus & Ané, 2016;Than et al., 2008), estimating a phylogenetic network is a challenging task because the methods currently available often scale poorly and are presently limited to the analysis of relatively small data sets (Hejase & Liu, 2016). An alternative approach is to employ methods that detect hybrids among a large number of genomic sequences, rather than attempting to estimate the phylogenetic network directly. ...

Inference of Phylogenetic Networks from Sequence Data using Composite Likelihood

... [45]. Besides, the BEAGLE library [46] was used to expedite the computational process of our phylodynamic models. We used the substitution model selected by IQ-tree and assessed commonly used three parametric and one non-parametric node-age priors (i.e., tree models). ...

BEAGLE 3: Improved Performance, Scaling, and Usability for a High-Performance Computing Library for Statistical Phylogenetics

Systematic Biology

... For example, when the multispecies coalescent is used as the basis for inference of a phylogenetic network from multilocus DNA sequence data, an inheritance parameter γ e is used to represent the proportional contribution of the parental lineage represented by arc e to the lineage that descends from the reticulation vertex. As an alternative to use of algorithms for constrained optimization, transformations of parameters may be applied to allow application of faster, better-performing numerical algorithms for unconstrained parameter estimation (see, e.g., the approach of Peng et al. (2021)). ...

Estimation of Speciation Times Under the Multispecies Coalescent

... Relative likelihoods support multiple gains of xyloglucanase genes: in the ancestor of Agaricomycetes, in the ancestor of Pezizomycotina, and in various lineages in Chytridiomycota, implying the c. 750 Ma LCA of Dikarya and Chytridiomycota lacked xyloglucanases. Nodes corresponding to the common ancestor of Agaricomycetes and the common ancestor of Pezizomycotina are dated at c. 310 and 330 Ma, respectively (Chang et al., 2021), or 299 and 489 Ma (Lutzoni et al., 2018). Within the chytrids, the earliest inferred presence of xyloglucanases is in the common ancestor of Cladochytriales, corresponding to c. 410 Ma (Amses et al., 2022). ...

Contemporaneous radiations of fungi and plants linked to symbiosis

... Using this method, we infer that the genus diverged from its sister lineage, the genus Mirza, about 2. Table 7). Such a temporal framework (< 2 Ma ago) is supported by other MSC studies 36,37,58,59 and suggests that the diversification of the genus Microcebus fits a model of allopatric speciation in response to climatic fluctuations (that is, glacial-interglacial cycles). This interpretation agrees with studies that have posited that closed-canopy ecosystems converted to open vegetation during the Pleistocene in different areas of the island 60,61 , forcing lineages to track forest habitats that shifted in elevation or to retreat to humid refugia 62,63 . ...

Geogenetic patterns in mouse lemurs (genus Microcebus ) reveal the ghosts of Madagascar's forests past

Proceedings of the National Academy of Sciences

... 112 Taxon sampling has also been shown to affect the estimation of the rate heterogeneity 113 parameters of maximum likelihood models (Sullivan et al., 1999 We would like to point out that different taxon samplings lead to different results and data 133 interpretations. The best solution is to include more data from Juncaceae in our analyses, for 134 example, the best 474 species known up to now; however, this goal is unrealistic at this 135 moment, as many species have not yet been collected. ...

The effect of taxon sampling on estimating rate heterogeneity parameters of maximum-likelihood models
  • Citing Article
  • October 1999

Molecular Biology and Evolution