Article
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... These parameters are difficult to estimate in meiofauna where baseline knowledge is poor compared with well-studied groups. UML methods were not specifically developed for the problem of species delimitation but can reveal underlying structures in the data that could correspond to species [25,26]. ...
... Data in matrix M4 were clustered with principal component analysis (PCA), random forest (RF) and t-distributed stochastic neighbour embedding (tSNE) [25,[72][73][74][75][76][77][78][79][80][81]. Variational autoencoder (VAE) analysis [82] was run using matrices M3 and M5-M9. ...
... RF+Gap Statistic found L. maerski Pyrenees Snap1 in its own cluster and RF+Hierarchical Clustering found nine clusters. Hierarchical Clustering underperformed in previous analyses and recovered more clusters than other methods [25]. VAE was congruent across runs and supported two species corresponding to Northern Hemisphere and Crozet. ...
Article
Full-text available
The latest animal phylum to be discovered, Micrognathozoa, constitutes a rare group of limnic meiofauna. These microscopic ‘jaw animals’ are among the smallest metazoans yet possess highly complex jaw structures. The single species of Micrognathozoa, Limnognathia maerski Kristensen and Funch, 2000, was first described from Greenland, later reported from a remote Subantarctic island and more recently discovered in the Pyrenees on the European continent. Successful collections of these three known populations facilitated investigations of the intraphylum relationships and species limits within Limnognathia for the first time. Through detailed anatomical comparisons, we substantiate the lack of morphological differences between the three geographically disjunct populations. With transcriptomic data from single specimens, we conducted the first intraphylum phylogenetic analyses and extensively tested species hypotheses using standard approaches and novel machine learning methods. Analyses clearly delimited the Subantarctic population, here described as Limnognathia desmeti sp. nov., the second species of Micrognathozoa, but did not definitively split the Greenland and Pyrenees populations as separate species. Divergence dating analysis suggests the disjunct distribution of Micrognathozoa is not human mediated but the result of long-distance dispersal raising questions about their dispersal capabilities and potential undiscovered populations.
... This greatly diminishes the utility of morphological and ecological species delimitation criteria in this group. Additionally, in Opiliones it can be difficult or impossible to directly test for reproductive isolation because sympatry between congeners is rare; they tend to show "nested" allopatry, where closely related species, and closely related populations within those species, are completely allopatric in distribution (Derkarabetian et al., 2011(Derkarabetian et al., , 2019a. Given this combination of characteristics, it is unsurprising that many Opiliones are short-range endemics (Harvey, 2002), known from one or a small number of sites (e.g., Emata and Hedin, 2016). ...
... Given this combination of characteristics, it is unsurprising that many Opiliones are short-range endemics (Harvey, 2002), known from one or a small number of sites (e.g., Emata and Hedin, 2016). Unfortunately, the low gene flow present in these taxa can make species delimitation based entirely on genetic data difficult, as analyses may not be able to differentiate between high population structure within species and species-level divergences, resulting in the overestimation of species counts (Derkarabetian et al., 2019a;Fernández and Giribet, 2014). Recently, the use of unsupervised (Derkarabetian et al., 2019a) and supervised (Derkarabetian et al., 2022b) machine learning for species delimitation has shown promise as a robust approach to delimiting cryptic species in these challenging taxa. ...
... Unfortunately, the low gene flow present in these taxa can make species delimitation based entirely on genetic data difficult, as analyses may not be able to differentiate between high population structure within species and species-level divergences, resulting in the overestimation of species counts (Derkarabetian et al., 2019a;Fernández and Giribet, 2014). Recently, the use of unsupervised (Derkarabetian et al., 2019a) and supervised (Derkarabetian et al., 2022b) machine learning for species delimitation has shown promise as a robust approach to delimiting cryptic species in these challenging taxa. One such harvester taxon for which species delimitation has been problematic is the Aoraki denticulata species complex. ...
... First, we used two standard genetic clustering techniques: discriminant analysis of principle components (DAPC; Jombart et al. 2010) and STRUCTURE (Pritchard et al. 2000). Second, we used an unsupervised machine learning (UML) approach: t-distributed stochastic neighbor embedding (t-SNE; van der Maaten and Hinton 2008, Derkarabetian et al. 2019). Finally, we use the Bayes factor delimitation approach (*BFD; Leaché et al. 2014) with SNAPP in BEAST 2.0 to test multiple species hypotheses. ...
... t-SNE is a non-linear dimensionality reduction algorithm that aims to preserve probability distributions of distances among samples in a cluster while also repelling samples that are in a different cluster (Derkarabetian et al. 2019). t-SNE was executed using the R package tsne (Donaldson 2016), using the results of the initial PCA as input, following recommendations for large datasets (Pedregosa et al. 2011). ...
... Two sets of clustering analyses were conducted on the t-SNE outputs: (1) Partition around medoids (PAM) clustering with optimal K determined via gap statistic using k-mean clustering and the 'factorextra' package (Kassabara and Munt 2017); and (2) optimal K and clustering determined via hierarchical cluster with the 'mclust' R package (Scrucca et al. 2017) using the broken stick method of the 'PCDimensions' package (Coombes and Wang 2018). The t-SNE and clustering analyses were run using modified code from Derkarabetian et al. (2019) ...
Article
Full-text available
Molecular phylogenetics has fundamentally altered our understanding of the taxonomy, systematics and biogeography of corals. Recently developed phylogenomic techniques have started to resolve species-level relationships in the diverse and ecologically important genus Acropora, providing a path to resolve the taxonomy of this notoriously problematic group. We used a targeted capture dataset (2032 loci) to investigate systematic relationships within an Acropora clade containing the putatively widespread species Acropora tenuis and its relatives. Using maximum likelihood phylogenies and genetic clustering of single nucleotide polymorphisms from specimens, including topotypes, collected across the Indo-Pacific, we show ≥ 11 distinct lineages in the clade, only four of which correspond to currently accepted species. Based on molecular, morphological and geographical evidence, we describe two new species; Acropora rongoi n. sp. and Acropora tenuissima n. sp. and remove five additional nominal species from synonymy. Systematic relationships revealed by our molecular phylogeny are incongruent with traditional morphological taxonomy and demonstrate that characters traditionally used to delineate species boundaries and infer evolutionary history are homoplasies. Furthermore, we show that species within this clade have much smaller geographical ranges and, consequently, population sizes than currently thought, a finding with profound implications for conservation and management of reef corals.
... Data for ultraconserved elements were produced following the (Glenn et al., 2019;BadDNA@UGA) with slight modifications on a few steps for A. isabella (for details see Derkarabetian et al., 2019). Libraries were hybridized at 60°C for 24 h to the Spider probe set (Kulkarni et al., 2020) following the version 4 chemistry protocol (Arbor Biosciences). ...
... All measurements were recorded in millimeters and were quantified with a Leica M165C stereomicroscope using the Leica Application Suite software and a digital camera. Measurements were trans- data as input and compresses this high-dimensional data through several encoding layers into two-dimensional latent variables, which is subsequently reconstructed by uncompressing the latent variables through several decoding layers (Derkarabetian et al., 2019). ...
... Our VAE analysis with the lower locus completeness dataset (50p) showed obvious separation between all three of the A. icenoglei lineages, whereas our higher locus completeness datasets (75p and 90p) retained only enough signal to maintain the North lineage as a separate cluster but not for Central or South lineages (Figure 4). VAE relies on the inherent structure present in the data (Derkarabetian et al., 2019), and previous studies have shown that VAE analyses have been heavily influenced by the filtering parameters for the SNP datasets (Martin et al., 2021;Newton et al., 2020). Specifically, if a lower threshold for locus completeness is allowed in a dataset the more likely it is to "over-split", whereas more stringent filtering (i.e., a high threshold for locus completeness) can remove potentially important signal and "under-split" the amount of diversity. ...
Article
Full-text available
Species delimitation is an imperative first step toward understanding Earth's biodiversity, yet what constitutes a species and the relative importance of the various processes by which new species arise continue to be debatable. Species delimitation in spiders has traditionally used morphological characters; however, certain mygalomorph spiders exhibit morphological homogeneity despite long periods of population-level isolation, absence of gene flow, and consequent high degrees of molecular divergence. Studies have shown strong geographic structuring and significant genetic divergence among several species complexes within the trapdoor spider genus Aptostichus, most of which are restricted to the California Floristic Province (CAFP) biodiversity hotspot. Specifically, the Aptostichus icenoglei complex, which comprises the three sibling species, A. barackobamai, A. isabella, and A. icenoglei, exhibits evidence of cryptic mitochondrial DNA diversity throughout their ranges in Northern, Central, and Southern California. Our study aimed to explicitly test species hypotheses within this assemblage by implementing a cohesion species-based approach. We used genomic-scale data (ultraconserved elements, UCEs) to first evaluate genetic exchangeability and then assessed ecological interchangeability of genetic lineages. Biogeographical analysis was used to assess the likelihood of dispersal versus vicariance events that may have influenced speciation pattern and process across the CAFP's complex geologic and topographic landscape. Considering the lack of congruence across data types and analyses, we take a more conservative approach by retaining species boundaries within A. icenoglei.
... The profusion of genomic and natural-history data sets and powerful statistical (Jackson et al. 2017;Leaché et al. 2019) and machine-learning (ML; Derkarabetian et al. 2019;Smith and Carstens 2020) methods make the identification of clusters trivial, even for high-dimensional, nonlinear interactions between multiple traits and loci. From the recognition of these species-level evolutionary entities, one can then construct and test speciation hypotheses (Singhal et al. 2018;Bamberger et al. 2022). ...
... We introduce a new approach for unguided species delimitation using ML that is conceptually distinct from most previous efforts (Derkarabetian et al. 2019;Smith and Carstens 2020). This individual-based method is based on self-organizing (or "Kohonen") maps (Kohonen 1998;Wehrens and Buydens 2007), which find a two-dimensional configuration of multidimensional data that maximizes the similarity between the distance matrix of the input and output features (Brentan et al. 2018). ...
... Consequently, repeated dimensional reduction of individuals to genotypes and proportional assignment to clusters approximates individual ancestry coefficients (sensu Frichot et al. 2014). At least initially, we anticipate concordance between this nonbiological clustering method and other biological algorithms such as snapclust (Beugin et al. 2018), though further study is needed to interrogate the performance of UML methods (Derkarabetian et al. 2019). ...
Article
Significant advances have been made in species delimitation and numerous methods can test precisely defined models of speciation, though the synthesis of phylogeography and taxonomy is still sometimes incomplete. Emerging consensus treats distinct genealogical clusters in genome-scale data as strong initial evidence of speciation in most cases; a hypothesis that must therefore be falsified under an explicit evolutionary model. We can now test speciation hypotheses linking trait differentiation to specific mechanisms of divergence with increasingly large datasets. Integrative taxonomy can therefore reflect an understanding of how each axis of variation relates to underlying speciation processes, with nomenclature for distinct evolutionary lineages. We illustrate this approach here with Seal Salamanders (Desmognathus monticola) and introduce a new unsupervised machine-learning approach for species delimitation. Plethodontid salamanders are renowned for their morphological conservatism despite extensive phylogeographic divergence. We discover two geographic genetic clusters, for which demographic and spatial models of ecology and gene flow provide robust support for ecogeographic speciation despite limited phenotypic divergence. These data are integrated under evolutionary mechanisms (e.g., spatially localized gene flow with reduced migration) and reflected in emergent properties expected under models of reinforcement (e.g., ethological isolation and selection against hybrids). Their genetic divergence is prima facie evidence for species-level distinctiveness, supported by speciation models and divergence along axes such as behavior, geography, and climate that suggest an ecological basis with subsequent reinforcement through prezygotic isolation. As datasets grow more comprehensive, species delimitation models can be tested, rejected, or corroborated as explicit speciation hypotheses, providing for reciprocal illumination of evolutionary processes and integrative taxonomies.
... Often, the first step in diversification studies is assigning samples to operational taxonomic units (OTUs) for subsequent reconstruction of species trees and tests for gene flow. To assess the appropriate taxonomic units, a "species discovery" approach has been proposed (Derkarabetian et al. 2019), which deploys unsupervised machine learning algorithms to delimit species accurately under reasonable conditions 1 2 SYSTEMATIC BIOLOGY (Derkarabetian et al. 2019;Newton et al. 2020;Moles et al. 2021) and avoid the over-splitting attributed to many coalescent-based species delimitation approaches (Sukumaran and Knowles 2017;Bamberger et al. 2022). Instead of a single delimitation scheme, this approach produces a suite of possible OTUs, which reveals uncertainty and serves as the basis for further hypothesis testing. ...
... Often, the first step in diversification studies is assigning samples to operational taxonomic units (OTUs) for subsequent reconstruction of species trees and tests for gene flow. To assess the appropriate taxonomic units, a "species discovery" approach has been proposed (Derkarabetian et al. 2019), which deploys unsupervised machine learning algorithms to delimit species accurately under reasonable conditions 1 2 SYSTEMATIC BIOLOGY (Derkarabetian et al. 2019;Newton et al. 2020;Moles et al. 2021) and avoid the over-splitting attributed to many coalescent-based species delimitation approaches (Sukumaran and Knowles 2017;Bamberger et al. 2022). Instead of a single delimitation scheme, this approach produces a suite of possible OTUs, which reveals uncertainty and serves as the basis for further hypothesis testing. ...
... As input, we sequenced thousands of genome-wide restriction enzyme-associated DNA (RAD) loci from the entire geographic distribution of Scrub-Jays across North America. We implemented a clustering-based species discovery pipeline outlined in Derkarabetian et al. (2019), which compares species delimitation schemes across a handful of analytically unique dimensionality reduction and clustering approaches. We then employed multiple species tree methods to account for discordant histories among gene trees generated by both ILS and gene flow during the speciation process (Degnan 1993;Maddison 1997), a problem exacerbated by recent and rapid speciation (Degnan and Rosenberg 2006;Kubatko and Degnan 2007;McCormack et al. 2009;Giarla and Esselstyn 2015). ...
Article
Full-text available
Complex speciation, involving rapid divergence and multiple bouts of post-divergence gene flow, can obfuscate phylogenetic relationships and species limits. In North America, cases of complex speciation are common, due at least in part to the cyclical Pleistocene glacial history of the continent. Scrub-jays in the genus Aphelocoma provide a useful case study in complex speciation because their range throughout North America is structured by phylogeographic barriers with multiple cases of secondary contact between divergent lineages. Here, we show that a comprehensive approach to genomic reconstruction of evolutionary history, i.e., synthesizing results from species delimitation, species tree reconstruction, demographic model testing, and tests for gene flow, is capable of clarifying evolutionary history despite complex speciation. We find concordant evidence across all statistical approaches for the distinctiveness of an endemic southern Mexico lineage (A. w. sumichrasti), culminating in support for the species status of this lineage under any commonly applied species concept. We also find novel genomic evidence for the species status of a Texas endemic lineage A. w. texana, for which equivocal species delimitation results were clarified by demographic modeling and spatially explicit models of gene flow. Finally, we find that complex signatures of both ancient and modern gene flow between the non-sister California Scrub-Jay (A. californica) and Woodhouse's Scrub-Jay (A. woodhouseii), result in discordant gene trees throughout the species' genomes despite clear support for their overall isolation and species status. In sum, we find that a multi-faceted approach to genomic analysis can increase our understanding of complex speciation histories, even in well-studied groups. Given the emerging recognition that complex speciation is relatively commonplace, the comprehensive framework that we demonstrate for interrogation of species limits and evolutionary history using genomic data can provide a necessary roadmap for disentangling the impacts of gene flow and incomplete lineage sorting to better understand the systematics of other groups with similarly complex evolutionary histories.
... Here, we use unsupervised machine learning (UML) methods (Derkarabetian et al., 2019) incorporating allelic, spatial, and ecological data (Pyron, 2023). This is the first step to demonstrate that lineage structure is real and discoverable across organismal dimensions. ...
... Most approaches for genetic-based species delimitation rely solely on molecular data(Leaché et al., 2019;Yang & Rannala, 2010), or are limited to a few traits under a restrictive parametric model(Solís-Lemus et al., 2015). Even recent UML methods have typically been limited only to allelic data(Derkarabetian et al., 2019(Derkarabetian et al., , 2022. ...
Article
Full-text available
The outcomes of speciation across organismal dimensions (e.g., ecological, genetic, phenotypic) are often assessed using phylogeographic methods. At one extreme, reproductively isolated lineages represent easily delimitable species differing in many or all dimensions, and at the other, geographically distinct genetic segments introgress across broad environmental gradients with limited phenotypic disparity. In the ambiguous gray zone of speciation, where lineages are genetically delimitable but still interacting ecologically, it is expected that these lineages represent species in the context of ontology and the evolutionary species concept when they are maintained over time with geographically well‐defined hybrid zones, particularly at the intersection of distinct environments. As a result, genetic structure is correlated with environmental differences and not space alone, and a subset of genes fail to introgress across these zones as underlying genomic differences accumulate. We present a set of tests that synthesize species delimitation with the speciation process. We can thereby assess historical demographics and diversification processes while understanding how lineages are maintained through space and time by exploring spatial and genome clines, genotype‐environment interactions, and genome scans for selected loci. Employing these tests in eight lineage‐pairs of snakes in North America, we show that six pairs represent 12 “good” species and that two pairs represent local adaptation and regional population structure. The distinct species pairs all have the signature of divergence before or near the mid‐Pleistocene, often with low migration, stable hybrid zones of varying size, and a subset of loci showing selection on alleles at the hybrid zone corresponding to transitions between distinct ecoregions. Locally adapted populations are younger, exhibit higher migration, and less ecological differentiation. Our results demonstrate that interacting lineages can be delimited using phylogeographic and population genetic methods that properly integrate spatial, temporal, and environmental data.
... Though UCEs present an ideal tool for reconstructing the evolutionary history of insects, the mechanisms of molecular evolution among and within populations and diversifying lineages are complex (e.g., incomplete lineage sorting (ILS) and gene flow; Maddison, 1997;Edwards & Beerli, 2000;Kubatko & Degnan, 2007;Edwards, 2009;Sukumaran & Knowles, 2017). While these mechanisms may influence the evolutionary history of organisms at all timescales (Hime et al., 2021;Oliver, 2013), parsing patterns and testing for processes of diversification of closely related lineages often require a nuanced blend of population genomics and phylogenomics that runs the gamut from assignment of ancestry proportions in individual samples to large-scale reconstructions of the evolutionary history of the recovered lineages Derkarabetian et al., 2019;Derkarabetian et al., 2022). As the N. traili species group presents a case of hierarchical population genetic structure that blurs the lines between species and populations, it is desirable to examine patterns of evolution with an added population genetic approach, as has been successfully done in other systems using UCEs (McCormack et al., 2016;Newton et al., 2023). ...
... Studies using similar SNP calling protocols have shown the effectiveness of SNP datasets extracted from UCE-enriched reads (Decicco et al., 2023;DeRaad et al., 2023;McCormack et al., 2016;McCormack et al., 2023), including in comparisons with other reduced capture methods such as RAD-Seq (Harvey et al., 2016;Manthey et al., 2016). Studies in arthropods have commonly applied UCEs and other target capture methods for resolving deeper timescales (Blaimer et al., 2016;Branstetter, Longino, et al., 2017;Starrett et al., 2017;Derkarabetian et al., 2023;Homziak et al., 2023; see also Zhang et al., 2018;Gustafson et al., 2020 for comprehensive reviews), with relatively few used in shallow-scale reconstructions (Branstetter & Longino, 2019;Branstetter & Longino, 2022) and fewer using SNP-based analyses (Derkarabetian et al., 2019;Newton et al., 2023). A future goal of this research will be to increase the sampling within Clade 3, to better understand the genetic structure within these populations. ...
Article
Full-text available
The Notomicrus traili species group (Coleoptera: Noteridae) is a lineage of aquatic beetles distributed throughout South America and extends into Mexico and the West Indies. Previous research has revealed a species complex within this group, with multiple distinct clades sharing overlapping distributions and lineages attributed to N. traili and the closely related Notomicrus gracilipes recovered as polyphyletic. Here, we perform targeted capture of ultraconserved elements (UCEs) to examine relationships and patterns of evolution within the N. traili group. First, we use short‐read whole‐genome sequencing of four noterid genera to design a noterid‐specific UCE probe set (Noteridae 3.4Kv1) targeting over 3400 unique loci. Using this probe set, we capture UCE data from population‐level sampling of 44 traili group specimens from across the Neotropics, with an emphasis on the Guiana Shield where distributions of several putative N. traili group populations overlap. We subject the resulting data matrix to various trimming and data completeness treatments and reconstruct the phylogeny with both concatenated maximum likelihood and coalescent congruent methods. We recover robust phylogenetic estimates that identify several phylogenetically distinct clades within the traili group that share overlapping distributions. To test for the genetic distinctiveness of populations, we extract single nucleotide polymorphism (SNP) data from UCE alignments using a chimeric reference method to map UCE‐enriched reads and examine patterns of genetic clustering using principal component analyses (PCAs) and STRUCTURE. Population genetic results are highly concordant with recovered phylogenetic structure, revealing a high degree of co‐ancestry shared within identified clades, contrasting with limited ancestry sharing between clades. We recover a pattern consistent with repeated diversification and dispersal of the traili group in the Neotropics, highlighting the efficacy of a tailored UCE approach for facilitating shallow‐scale phylogenetic reconstructions and population genetic analyses, which can reveal novel aspects of coleopteran phylogeography.
... Traditionally, the most commonly applied delimitation criterion is morphology, which has been occasionally supplemented by other datatypes ( Figure 1A; [5,52]). For many morphospecies outside TCGs, the classification is still valid and could be confirmed by modern methods [19,54,55]. However, there are several challenges when using morphology-based delimitation and also identification. ...
... but their applicability suffers from high computational effort for locus-or species-rich datasets and mentioned biological limitations of the MSC. The first promising, supervised (e.g., delimitR or CLADES) and unsupervised (e.g., RF or t-SNE) attempts have been made using predominantly classical ML building on animal genetic data, which were partly supplemented by phylogenetics and morphology [55,69,105,106]. However, these methods are often not suitable for integrative taxonomy or TCGs; for example, due to disregard of gene flow, implementation for few-locus-based, diploid animal genetic data only, and/or lack of strategies for dataset fusion. ...
Article
Full-text available
Although species are central units for biological research, recent findings in genomics are raising awareness that what we call species can be ill-founded entities due to solely morphology-based, regional species descriptions. This particularly applies to groups characterized by intricate evolutionary processes such as hybridization, polyploidy, or asexuality. Here, challenges of current integrative taxonomy (genetics/genomics + morphology + ecology, etc.) become apparent: different favored species concepts, lack of universal characters/markers, missing appropriate analytical tools for intricate evolutionary processes, and highly subjective ranking and fusion of datasets. Now, integrative taxonomy combined with artificial intelligence under a unified species concept can enable automated feature learning and data integration, and thus reduce subjectivity in species delimitation. This approach will likely accelerate revising and unraveling eukaryotic biodiversity.
... A pipeline was developed to analyze the mt genome sequences obtained by PacBio HiFi sequencing incorporating a machine-learning method. In particular, the pipeline integrates (1) custom Python scripts, (2) the multiple sequence alignment program MAFFT [47], (3) a modified variational autoencoders (VAEs) [48], and (4) a clustering method using DBSCAN algorithm (Density-Based Spatial Clustering of Applications with Noise) for data analysis and pattern recognition [49]. The pipeline, including all described scripts, is available on GitHub at https:// github. ...
... git (Fig. 1). The variational autoencoders (VAEs) are a generative machine-learning model that discovers hidden patterns, such as putative groups of haemosporidian mt lineages/species [48]. The input data for the VAEs is an alignment converted into a binary matrix. ...
Article
Full-text available
Background Studies on haemosporidian diversity, including origin of human malaria parasites, malaria's zoonotic dynamic, and regional biodiversity patterns, have used target gene approaches. However, current methods have a trade-off between scalability and data quality. Here, a long-read Next-Generation Sequencing protocol using PacBio HiFi is presented. The data processing is supported by a pipeline that uses machine-learning for analysing the reads. Methods A set of primers was designed to target approximately 6 kb, almost the entire length of the haemosporidian mitochondrial genome. Amplicons from different samples were multiplexed in an SMRTbell® library preparation. A pipeline (HmtG-PacBio Pipeline) to process the reads is also provided; it integrates multiple sequence alignments, a machine-learning algorithm that uses modified variational autoencoders, and a clustering method to identify the mitochondrial haplotypes/species in a sample. Although 192 specimens could be studied simultaneously, a pilot experiment with 15 specimens is presented, including in silico experiments where multiple data combinations were tested. Results The primers amplified various haemosporidian parasite genomes and yielded high-quality mt genome sequences. This new protocol allowed the detection and characterization of mixed infections and co-infections in the samples. The machine-learning approach converged into reproducible haplotypes with a low error rate, averaging 0.2% per read (minimum of 0.03% and maximum of 0.46%). The minimum recommended coverage per haplotype is 30X based on the detected error rates. The pipeline facilitates inspecting the data, including a local blast against a file of provided mitochondrial sequences that the researcher can customize. Conclusions This is not a diagnostic approach but a high-throughput method to study haemosporidian sequence assemblages and perform genotyping by targeting the mitochondrial genome. Accordingly, the methodology allowed for examining specimens with multiple infections and co-infections of different haemosporidian parasites. The pipeline enables data quality assessment and comparison of the haplotypes obtained to those from previous studies. Although a single locus approach, whole mitochondrial data provide high-quality information to characterize species pools of haemosporidian parasites.
... Though UCEs present an ideal tool for reconstructing the evolutionary history of insects, the mechanisms of molecular evolution among and within populations and diversifying lineages are complex (e.g., incomplete lineage sorting (ILS) and gene flow; Maddison, 1997;Edwards & Beerli, 2000;Kubatko & Degnan, 2007;Edwards, 2009;Sukumaran & Knowles, 2017). While these mechanisms may influence the evolutionary history of organisms at all timescales (Hime et al., 2021;Oliver, 2013), parsing patterns and testing for processes of diversification of closely related lineages often require a nuanced blend of population genomics and phylogenomics that runs the gamut from assignment of ancestry proportions in individual samples to large-scale reconstructions of the evolutionary history of the recovered lineages Derkarabetian et al., 2019;Derkarabetian et al., 2022). As the N. traili species group presents a case of hierarchical population genetic structure that blurs the lines between species and populations, it is desirable to examine patterns of evolution with an added population genetic approach, as has been successfully done in other systems using UCEs (McCormack et al., 2016;Newton et al., 2023). ...
... Studies using similar SNP calling protocols have shown the effectiveness of SNP datasets extracted from UCE-enriched reads (Decicco et al., 2023;DeRaad et al., 2023;McCormack et al., 2016;McCormack et al., 2023), including in comparisons with other reduced capture methods such as RAD-Seq (Harvey et al., 2016;Manthey et al., 2016). Studies in arthropods have commonly applied UCEs and other target capture methods for resolving deeper timescales (Blaimer et al., 2016;Branstetter, Longino, et al., 2017;Starrett et al., 2017;Derkarabetian et al., 2023;Homziak et al., 2023; see also Zhang et al., 2018;Gustafson et al., 2020 for comprehensive reviews), with relatively few used in shallow-scale reconstructions (Branstetter & Longino, 2019;Branstetter & Longino, 2022) and fewer using SNP-based analyses (Derkarabetian et al., 2019;Newton et al., 2023). A future goal of this research will be to increase the sampling within Clade 3, to better understand the genetic structure within these populations. ...
Preprint
Full-text available
The Notomicrus traili species group (Coleoptera: Noteridae) is a lineage of aquatic beetles distributed throughout South America and extends into Mexico and the West Indies. Previous research has revealed a species complex within this group, with multiple distinct clades sharing overlapping distributions and lineages attributed to N. traili and the closely related N. gracilipes recovered as polyphyletic. Here, we perform targeted capture of ultraconserved elements (UCEs) to examine relationships and patterns of evolution within the N. traili group. First, we use short-read whole genome sequencing of four noterid genera to design a noterid-specific UCE probe set (Noteridae 3.4Kv1) targeting over 3,400 unique loci. Using this probe set, we capture UCE data from population-level sampling of 44 traili group specimens from across the Neotropics, with an emphasis on the Guiana Shield where distributions of several putative N. traili group populations overlap. We subject the resulting data matrix to various trimming and data completeness treatments and reconstruct the phylogeny with both concatenated maximum likelihood and coalescent congruent methods. We recover robust phylogenetic estimates that identify several phylogenetically distinct clades within the traili group that share overlapping distributions. To test for the genetic distinctiveness of populations, we extract single nucleotide polymorphism (SNP) data from UCE alignments and examine patterns of genetic clustering using principal component analyses (PCAs) and STRUCTURE. Population genetic results are highly concordant with recovered phylogenetic structure, revealing a high degree of co-ancestry shared within identified clades, contrasting with limited ancestry sharing between clades. We recover a pattern consistent with repeated diversification and dispersal of the traili group in the Neotropics, highlighting the efficacy of a tailored UCE approach for facilitating shallow-scale phylogenetic reconstructions and population genetic analyses, which can reveal novel aspects of coleopteran phylogeography.
... There are over 30 different species concepts (Zachos, 2016), with new additions appearing regularly in the literature (Hill, 2017;Shanker et al., 2017;Hong, 2020;Seifert, 2020). One of the major challenges to species delimitation is distinguishing between species-and population-level divergences (Derkarabetian et al., 2019). In other words, determining when genetically distinct groups are different species, and when are they intraspecific populations. ...
... A prominent caveat in species delimitation is the difficulty in differentiating between interspecific and intraspecific variation (e.g. due to recent speciation events or intraspecific geographic sub-structuring) (Sukumaran and Knowles, 2017;Derkarabetian et al., 2019). In a study of species delimitation in the Cataglyphis bicolor Fab. ...
... In contrast to that, unsupervised learning refers to tasks where no examples are supplied, and the algorithms optimize some general loss function (e.g. genomic species delimitation, see Derkarabetian et al., 2019). Finally, in reinforcement learning, the ML algorithm is trained by interacting with a (virtual) environment. ...
... For clustering and ordination tasks, which have a long tradition in ecology and ML algorithms for unsupervised learning tasks (Box 2), classical ML algorithms such as k-means or t-distributed stochastic neighbour embedding algorithms are and will remain important, for example for species delimitation (Derkarabetian et al., 2019), outlier detection, identification of eco-provinces (Sonnewald et al., 2020) or operational taxonomic units (OTUs) in metabarcoding (Deiner et al., 2017). DL-based approaches (e.g. based on [variational] autoencoders), on the other hand, are gaining popularity into certain data-dependent tasks, such as image-based tasks in remote sensing (Zerrouki et al., 2021) or (genomic) sequences (Wang & Gu, 2018). ...
Article
Full-text available
The popularity of machine learning (ML), deep learning (DL) and artificial intelligence (AI) has risen sharply in recent years. Despite this spike in popularity, the inner workings of ML and DL algorithms are often perceived as opaque, and their relationship to classical data analysis tools remains debated. Although it is often assumed that ML and DL excel primarily at making predictions, ML and DL can also be used for analytical tasks traditionally addressed with statistical models. Moreover, most recent discussions and reviews on ML focus mainly on DL, failing to synthesise the wealth of ML algorithms with different advantages and general principles. Here, we provide a comprehensive overview of the field of ML and DL, starting by summarizing its historical developments, existing algorithm families, differences to traditional statistical tools, and universal ML principles. We then discuss why and when ML and DL models excel at prediction tasks and where they could offer alternatives to traditional statistical methods for inference, highlighting current and emerging applications for ecological problems. Finally, we summarize emerging trends such as scientific and causal ML, explainable AI, and responsible AI that may significantly impact ecological data analysis in the future. We conclude that ML and DL are powerful new tools for predictive modelling and data analysis. The superior performance of ML and DL algorithms compared to statistical models can be explained by their higher flexibility and automatic data‐dependent complexity optimization. However, their use for causal inference is still disputed as the focus of ML and DL methods on predictions creates challenges for the interpretation of these models. Nevertheless, we expect ML and DL to become an indispensable tool in ecology and evolution, comparable to other traditional statistical tools.
... Combining appropriate geographical sampling with other sources of evidence such as morphological, ecological or phenological data, provides a robust framework for species delimitation (Carstens et al., 2013;Chambers and Hillis, 2020). Furthermore, more recent implementations using unsupervised machine learning algorithms can avoid the issue of over-splitting attributed to MSC methods (Chambers and Hillis, 2020;Sukumaran and Knowles, 2017) and have demonstrated accurate species delimitation in several organisms (DeRaad et al., 2022;Derkarabetian et al., 2019;Newton et al., 2020). ...
... We performed clustering-based species discovery analyses as outlined by Derkarabetian et al. (2019), which include both traditional clustering approaches and novel applications of unsupervised machinelearning (UML) algorithms. First, we used the R package adegenet to perform discriminant analysis of principal components (DAPC) (Jombart et al., 2010). ...
Article
Full-text available
Speciation is a continuous and complex process shaped by the interaction of numerous evolutionary forces. Despite the continuous nature of the speciation process, the implementation of conservation policies relies on the delimitation of species and evolutionary significant units (ESUs). Puffinus shearwaters are globally distributed and threatened pelagic seabirds. Due to remarkable morphological status the group has been under intense taxonomic debate for the past three decades. Here, we use double digest Restriction-Site Associated DNA sequencing (ddRAD-Seq) to genotype species and subspecies of North Atlantic and Mediterranean Puffinus shearwaters across their entire geographical range. We assess the phylogenetic relationships and population structure among and within the group, evaluate species boundaries, and characterise the genomic landscape of divergence. We find that current taxonomies are not supported by genomic data and propose a more accurate taxonomy by integrating genomic information with other sources of evidence. Our results show that several taxon pairs are at different stages of a speciation continuum. Our study emphasises the potential of genomic data to resolve taxonomic uncertainties, which can help to focus management actions on relevant taxa, even if they do not necessarily coincide with the taxonomic rank of species.
... Identification and classification undertakings can also be led by classical ML unsupervised learning algorithms. For instance, K-means groups individuals based on similarities in their characteristics, whereas t-SNE visualizes the similarities and differences between ecological samples, all of which ends up being helpful for tasks such as species delimitationidentifying potential new species or differentiating between closely related species (Derkarabetian et al., 2019) -or operational taxonomic units (OTUs) in metabarcoding (Deiner et al., 2017). Similarly, Kohonen's self-organizing maps have been used in terrestrial ecology to classify vegetation types based on environmental variables, helping to identify patterns and ecological gradients (Adamczyk et al., 2012). ...
Article
Full-text available
The integration of artificial intelligence (AI) algorithms in ecological research is revolutionizing how we monitor, predict, and manage natural systems, enabling more advanced data analysis, pattern recognition, and predictive modelling. This review critically analyzes and synthesizes the application of machine learning and deep learning in terrestrial ecology, providing a comprehensive overview of their paradigms – unsupervised, supervised, and reinforcement learning – and semi-supervised learning, along with their respective algorithm families, strengths, and limitations. We examine both current and emerging applications in terrestrial ecological dynamics and modelling, ecosystem management and conservation, identification and classification tasks, such as trait and behavior recognition. Despite these advancements, we summarize several issues hindering the extensive adoption of AI algorithms in ecology, such as inconsistencies or limitations in datasets, algorithm complexity and interpretability affecting transparency and reliability, high computational demands raising environmental sustainability concerns, and difficulties with model generalization. To address these barriers, we identify key areas for future research, namely optimizing data collection, using transfer learning and data augmentation, refining model transparency through explainable AI (XAI) and ethical considerations, and integrating causal inference into AI models. We conclude that AI algorithms hold great promise for delivering more accurate, scalable, and timely data, advancing real-time monitoring and near-instantaneous predictions – e.g., seasonal forecasting – for more dynamic responses to environmental changes. The need for continued methodological innovation and multi- and trans-disciplinary collaboration is emphasized to ensure these technologies are effective, sustainable, and equitable in supporting ecosystem conservation and restoration efforts addressing global ecological crises.
... There are over 30 different species concepts (Zachos 2016), with new additions appearing regularly in the literature (Hill 2017;Hong 2020;Seifert 2020;Shanker et al. 2017). One of the major challenges to species delimitation is distinguishing between speciesand population-level divergences (Derkarabetian et al. 2019). Species numbers may be overestimated or underestimated due to a number of factors, including the sensitivity of delimitation algorithms to population genetic structure (e.g., due to geographical sub-structuring) (Sukumaran & Knowles 2017), variation in sample sizes, the choice of genetic marker, and the violation of one or more of the models' statistical assumptions (Carstens et al. 2013). ...
Article
The genus Tetramesa Walker (Hymenoptera: Eurytomidae) comprises over 200 species of herbivorous wasps that feed exclusively on grasses. Recent field surveys in South Africa for grass biological control programs have uncovered a large diversity of potential Tetramesa on African grasses. Here, mitochondrial (cytochrome c oxidase I [COI]) and nuclear (28S) genetic sequences were used to compare the outputs of seven popular species delimitation methods and to guide the generation of consensus species boundaries for putative Tetramesa taxa and close relatives. Additionally, the nuclear region was used to run a dated analysis that applied a molecular clock rate. Consensus species delimitation results found 35 molecular operational taxonomic units (MOTUs) in the COI data and 21 MOTUs in the 28S data. Of the 35 COI MOTUs, there were 17 putative Tetramesa taxa (16 novel southern African taxa and 1 described Northern Hemisphere species, Tetramesa romana ), 13 of which showed evidence of specialisation to a single host plant. Comparatively, of the twenty‐one 28S MOTUs, there were 5 putative Tetramesa taxa (4 novel southern African taxa and 1 T. romana ), all of which showed evidence of host specificity. The dated analysis suggested that the genus Tetramesa originated ~67.1 mya. There was evidence of rapid diversification in the Southern Hemisphere clades between 5 and 15 mya, which coincides with grassland expansions and climatic fluctuations in Africa at the time that may have driven host specialisation. The present results provide valuable insights into the diversity and broader scale evolutionary patterns in this Southern Hemisphere microhymenopteran group.
... We also used dimension reduction to uncover underlying group structure/clustering. We implemented the t-distributed stochastic neighbor embedding algorithm (t-SNE) that can capture local structure of highdimensional data within 2-3 dimensions (van der Maaten and Hinton 2008; Li et al. 2017;Derkarabetian et al. 2019). This analysis was performed with the R package Rtsne (Krijthe 2015) using the following settings: perplexity = 10, max iterations = 1,00,000, theta = 0.0. ...
Article
Full-text available
Hybridization plays a major role in the evolutionary history of many taxa and can generate confounding patterns affecting many downstream applications. In this study, we empirically demonstrate how hybridization obfuscates phylogenetic inference (via the artefactual branch effect), species boundaries, and taxonomy in an adaptive radiation of frogs. Philippine narrow-mouthed frogs of the genus Kaloula exhibit a wide range of phenotypic and ecological adaptations but their evolutionary history and taxonomy remain poorly understood. In particular, the Kaloula conjuncta complex contains numerous subspecies with unresolved taxonomic boundaries and unclear evolutionary relationships. Within this complex, Kaloula conjuncta stickeli, until now was considered a rare, enigmatic, and phenotypically distinct subspecies that had not been encountered since its original description nearly 80 years ago. Here, we show that K. c. stickeli shares alleles with K. conjuncta meridionalis and another species outside the conjuncta group, K. picta. Using target-capture sequencing and a robust analytical framework, we show that despite having a unique phenotype, K. c. stickeli is likely an inviable F1 hybrid between K. c. meridionalis and K. picta and thus, does not warrant taxonomic recognition. Our results show how industry-standard approaches in systematic inference and integrative taxonomy-morphological, phylogenomic, clustering, and distance-based methods-can generate misleading results for identifying and understanding affinities of hybrids. In contrast, we demonstrate how network multispecies coalescent and population genetic approaches are more effective at accurately inferring reticulated evolutionary history. We also propose a rare phenomenon of deforestation-induced hybridization, which could have important consequences in light of large-scale Southeast Asian forest destruction.
... We also utilised the Bayesian information criterion (BIC) to assess the number of genetic clusters using a discriminant analysis of principal components (DAPC) analysis implemented in the R package Adegenet (Jombart and Collins 2015). Furthermore, we conducted an unsupervised machine-learning approach, variational autoencoders (VAE), for species delimitation, following the method tested by Derkarabetian et al. (2019) as a third clustering method. The use of artificial intelligence methods like the VAE approach may help delimit species using automated features learned from unlabelled data (Karbstein et al. 2024). ...
Article
Full-text available
Adaptive introgression involves the acquisition of advantageous genetic variants through hybridisation, which are subsequently favoured by natural selection due to their association with beneficial traits. Here, we analysed speciation patterns of the kleptoparasitic spider, Argyrodes lanyuensis, through genomic analyses and tested for possible genetic evidence of adaptive introgression at the Taiwan–Philippines transition zone. Our study used highly polymorphic SNPs to demonstrate that speciation occurred when the Hualien (on Taiwan Island + Green Island) and Orchid Island + Philippine lineages separated during the early to mid‐Pleistocene. The best colonisation model suggested by approximate Bayesian computation and random forests and biogeographical analyses supported an inference of a bottleneck during speciation, an interpretation reinforced by observation of lower FST values and reduced genetic diversity of the Orchid Island + Philippines lineage. We also found the highest support for the occurrence of introgression on the youngest island (Green Island) of the Taiwan–Philippines transition zone based on the ABBA‐BABA test. Our study highlights the inference of two noteworthy species (Hualien + Green Island and Orchid Island + Philippines) based on our species delimitation tests, with gene flow between Green Island and Orchid Island that indicates introgression. The potential adaptive alleles in Green Island population, which are under balancing selection, provide initial evidence of possible rare case of adaptive introgression. This could represent an evolutionary response to a newly formed niche (or novel geographical context) lying between the tropical climate of the Philippines and the subtropical climate of Hualien, Taiwan.
... Species delimitation results implied the presence of a higher number of putative species within the complex, with greatest support for a model considering each population a separate species, and an approximate plateau in likelihood at five species. However, multispecies coalescent approaches such as SNAPP have been demonstrated to over-split lineages (Chambers and Hillis 2020), particularly when populations are highly structured and divergent (Derkarabetian et al. 2019). Thus, we instead suggest that the genomic data unequivocally indicate that at least three species are present in the clade corresponding to the lineages N. pygmaea, N. vittata [A] and N. vittata [B], following similar nomenclature in Unmack et al. (2011). ...
Article
Full-text available
Anthropogenic climate change is forecast to drive regional climate disruption and instability across the globe. These impacts are likely to be exacerbated within biodiversity hotspots, both due to the greater potential for species loss but also to the possibility that endemic lineages might not have experienced significant climatic variation in the past, limiting their evolutionary potential to respond to rapid climate change. We assessed the role of climatic stability on the accumulation and persistence of lineages in an obligate freshwater fish group endemic to the southwest Western Australia (SWWA) biodiversity hotspot. Using 19,426 genomic (ddRAD-seq) markers and species distribution modelling, we explored the phylogeographic history of western ( Nannoperca vittata ) and little ( Nannoperca pygmaea ) pygmy perches, assessing population divergence and phylogenetic relationships, delimiting species and estimating changes in species distributions from the Pliocene to 2100. We identified two deep phylogroups comprising three divergent clusters, which showed no historical connectivity since the Pliocene. We conservatively suggest these represent three isolated species with additional intraspecific structure within one widespread species. All lineages showed long-term patterns of isolation and persistence owing to climatic stability but with significant range contractions likely under future climate change. Our results highlighted the role of climatic stability in allowing the persistence of isolated lineages in the SWWA. This biodiversity hotspot is under compounding threat from ongoing climate change and habitat modification, which may further threaten previously undetected cryptic diversity across the region.
... Furthermore, the Bayes factor delimitation (BFD*) method (Grummer et al., 2014) was utilized to objectively compare and validate the alternative species delimitation models against the current taxonomy of S. chamaejasme in Beast2 v2.6.3. Additionally, genomic clustering was performed using UML approaches, specifically RF and t-SNE (Derkarabetian et al., 2019). ...
Article
Full-text available
The mountains of Southwest China comprise a significant large mountain range and biodiversity hotspot imperiled by global climate change. The high species diversity in this mountain system has long been attributed to a complex set of factors, and recent large‐scale macroevolutionary investigations have placed a broad timeline on plant diversification that stretches from 10 million years ago (Mya) to the present. Despite our increasing understanding of the temporal mode of speciation, finer‐scale population‐level investigations are lacking to better refine these temporal trends and illuminate the abiotic and biotic influences of cryptic speciation. This is largely due to the dearth of organismal sampling among closely related species and populations, spanning the incredible size and topological heterogeneity of this region. Our study dives into these evolutionary dynamics of speciation using genomic and eco‐morphological data of Stellera chamaejasme L. We identified four previously unrecognized cryptic species having indistinct morphological traits and large metapopulation of evolving lineages, suggesting a more recent diversification (~2.67–0.90 Mya), largely influenced by Pleistocene glaciation and biotic factors. These factors likely influenced allopatric speciation and advocated cyclical warming–cooling episodes along elevational gradients during the Pleistocene. The study refines the evolutionary timeline to be much younger than previously implicated and raises the concern that projected future warming may influence the alpine species diversity, necessitating increased conservation efforts.
... Inconsistencies in clustering could be due to SNPs having different genetic histories due to recombination, stochasticity in levels of information content in SNP sets, poor model fit of nucleotide substitutions at SNP sites, or a combination of these factors. Although machine learning clustering methods using SNP data have been shown to have utility in determining population-level evolution as well as in delimitating population-species boundaries (Battey et al., 2021;Derkarabetian et al., 2019;Newton et al., 2020Newton et al., , 2023, analysis of a single random SNP set for species delimitation could produce misleading results, particularly if the SNP sample size is limited. ...
Article
Full-text available
The recognition and delineation of cryptic species remains a perplexing problem in systematics, evolution, and species delimitation. Once recognized as such, cryptic species complexes provide fertile ground for studying genetic divergence within the context of phenotypic and ecological divergence (or lack thereof). Herein we document the discovery of a new cryptic species of trapdoor spider, Promyrmekiaphila korematsui sp. nov. Using subgenomic data obtained via target enrichment, we document the phylogeography of the California endemic genus Promyrmekiaphila and its constituent species, which also includes P. clathrata and P. winnemem . Based on these data we show a pattern of strong geographic structuring among populations but cannot entirely discount recent gene flow among populations that are parapatric, particularly for deeply diverged lineages within P. clathrata . The genetic data, in addition to revealing a new undescribed species, also allude to a pattern of potential phenotypic differentiation where species likely come into close contact. Alternatively, phenotypic cohesion among genetically divergent P. clathrata lineages suggests that some level of gene flow is ongoing or occurred in the recent past. Despite considerable field collection efforts over many years, additional sampling in potential zones of contact for both species and lineages is needed to completely resolve the dynamics of divergence in Promyrmekiaphila at the population–species interface.
... the use of deep-learning in ecology covers various fields, such as population dynamics, landscape ecology, functioning of ecosystems or conservation biology (Christin et al., 2019;Borowiec et al., 2022). their scale ranges from the individual level to global, including applications such as molecular data (Derkarabetian et al., 2019) or the classification and analysis of massive bio-acoustic (Mac Aodha et al., 2018) and image datasets (Hansen et al., 2020). in the context of the scientific monitoring of the French fisheries of the southern ocean (Martin et al., 2021), we have Abstract. -We applied a deep-learning approach in order to develop a neural network able to detect and identify macro-invertebrate organisms within images of benthos bycatch collected in the southern ocean. ...
Article
Full-text available
We applied a deep-learning approach in order to develop a neural network able to detect and iden�tify macro-invertebrate organisms within images of benthos bycatch collected in the Southern Ocean. We used the Faster RCNN architecture and fine-tuning approach. To perform the transfer-learning, we used an annotated dataset of 59,756 images of organisms identified within 1,845 images of lots, covering eleven taxa: Echinoder�mata, Asteroidea, Arthropoda, Annelida, Chordata, Hemichordata, Cnidaria, Porifera, Bryozoa, Brachiopoda and Mollusca. The resulting network, not yet efficient enough to obtain precise identifications, is able to provide detection and classification of organisms with a good level of accuracy considering the limited quality of the images used for training. We present this study as a proof of concept for teams involved in the management of collections of macro invertebrate images
... The Variation Autoencoder (VAE; Derkarabetian et al., 2019) was also used to further visualize sample clustering. This unsupervised machine learning method is derived from Bayesian probability theory and recently has been shown to be effective for species delimitation (Derkarabetian et al., 2019;Newton et al., 2020). The pipeline consists of an encoder that takes in the one-hot file SNPs as unlabelled input data and performs a training phase in which the system learns patterns in the data. ...
Article
Full-text available
Aim We assessed the population genetic structure of the kleptoparasitic spider Argyrodes bonadea across the Southwestern Pacific islands. Our aim is to evaluate the impact of overseas distances and, in particular, the Kerama gap, as potential drivers of genetic differentiation. If no relationship exists, then we assume dispersal following adaptive change as alternative non‐vicariant mechanism that generates divergence. Location Southwestern Pacific Islands. Taxon Argyrodes bonadea. Methods We used mitochondrial Cytochrome Oxidase 1 (CO1) gene sequences and Restriction Site‐associated DNA Sequencing (RAD‐seq) for our analyses. Results Two strongly supported lineages, an Amami‐Okinawa Lineage (AOL) and an Austral‐Asia Lineage (AAL), correspond to two separate clades, roughly divided by the Kerama Gap, in phylogenetic trees estimated here. However, species delimitation led to the interpretation of only a single species present. The AOL exhibits complex, geographically structured host web spider species specificity, wherein the Amami population utilizes Cyrtophora, but AOL samples in Okinawa associate exclusively with Nephila—and yet all broadly distributed AAL populations show no evidence of host web spider species specificity. Main Conclusion The population boundary between AOL and AAL likely results from local adaptation to novel hosts—instead of isolation by the Kerama Gap—following long‐distance dispersal and range expansion. Our results suggest kleptoparasitic spiders have the capacity to overcome permanent deep‐sea barriers and colonize distant landmasses. Whereas peripheral populations (AOL) demonstrate the capacity for specialization to a single host, which may have contributed to genetic differentiation and isolation, the broadly distributed AAL persists and has successfully expanded its geographical range as a host generalist, which may contribute to ongoing gene flow inferred in this study.
... But regardless of algorithmic details, it will rarely if ever be the case that species can be delimited robustly without mechanistic hypotheses regarding the speciation process itself; independent evidence that they are "separately evolving" (Padial & De la Riva, 2021;. Rather, there may be a meaningful distinction between-population "structure" and actively diverging or collapsing "species" providing a more explicit mechanistic and quantifiable test in at least some instances (Derkarabetian et al., 2019;Sukumaran & Knowles, 2017), with operational criteria stemming from theoretical models (Kelly et al., 2010;. ...
Article
Full-text available
Numerous mechanisms can drive speciation, including isolation by adaptation, distance, and environment. These forces can promote genetic and phenotypic differentiation of local populations, the formation of phylogeographic lineages, and ultimately, completed speciation. However, conceptually similar mechanisms may also result in stabilizing rather than diversifying selection, leading to lineage integration and the long‐term persistence of population structure within genetically cohesive species. Processes that drive the formation and maintenance of geographic genetic diversity while facilitating high rates of migration and limiting phenotypic differentiation may thereby result in population genetic structure that is not accompanied by reproductive isolation. We suggest that this framework can be applied more broadly to address the classic dilemma of “structure” versus “species” when evaluating phylogeographic diversity, unifying population genetics, species delimitation, and the underlying study of speciation. We demonstrate one such instance in the Seepage Salamander (Desmognathus aeneus) from the southeastern United States. Recent studies estimated up to 6.3% mitochondrial divergence and four phylogenomic lineages with broad admixture across geographic hybrid zones, which could potentially represent distinct species supported by our species‐delimitation analyses. However, while limited dispersal promotes substantial isolation by distance, microhabitat specificity appears to yield stabilizing selection on a single, uniform, ecologically mediated phenotype. As a result, climatic cycles promote recurrent contact between lineages and repeated instances of high migration through time. Subsequent hybridization is apparently not counteracted by adaptive differentiation limiting introgression, leaving a single unified species with deeply divergent phylogeographic lineages that nonetheless do not appear to represent incipient species.
... Barrow et al. (2021) highlighted the potential of machine learning algorithms, particularly the random forest model, in analyzing intraspecific diversity among Nearctic amphibians, integrating over 42 000 gene sequences across 299 species. Similarly, Derkarabetian et al. (2019) applied genomics to study species classification within arachnid taxa, such as Metanonychus, known for a high degree of population genetic structuring. Using three unified machine learning methods, namely, random forest, variational autoencoders, and t-distributed Stochastic Neighbor Embedding (t-SNE), they constructed a phylogenetic tree conducive to species delimitation, demonstrating effectiveness across diverse natural systems and taxa with different biological characteristics. ...
Article
Full-text available
Since the late 2010s, Artificial Intelligence (AI) including machine learning, boosted through deep learning, has boomed as a vital tool to leverage computer vision, natural language processing and speech recognition in revolutionizing zoological research. This review provides an overview of the primary tasks, core models, datasets, and applications of AI in zoological research, including animal classification, resource conservation, behavior, development, genetics and evolution, breeding and health, disease models, and paleontology. Additionally, we explore the challenges and future directions of integrating AI into this field. Based on numerous case studies, this review outlines various avenues for incorporating AI into zoological research and underscores its potential to enhance our understanding of the intricate relationships that exist within the animal kingdom. As we build a bridge between beast and byte realms, this review serves as a resource for envisioning novel AI applications in zoological research that have not yet been explored.
... In V. tristis, they were able to distinguish between population structure and species-level divergence, recovering fewer entities than sNMF. A similar behavior has been reported for other unsupervised machine learning approaches used in species delimitation (Derkarabetian et al. 2019). While these methods are not based on biological models, there are other approaches that rely on data simulated under a variety of evolutionary scenarios for training (e.g., Pei et al. 2018;Smith and Carstens 2020). ...
Preprint
Full-text available
The accurate characterization of species diversity is a vital prerequisite for ecological and evolutionary research, as well as conservation. Thus, it is necessary to generate robust hypotheses of species limits based on the inference of evolutionary processes. Integrative species delimitation, the inference of species limits based on multiple sources of evidence, can provide unique insight into species diversity and the processes behind it. However, the application of integrative approaches in non-model organisms is often limited by the amount of data that is available. Here, we show how data relevant for species delimitation can be bolstered by incorporating information from tissue collections, museum specimens, and observations made by the wider community. We show how to integrate these data under a hypothesis-driven, integrative framework by identifying the processes generating genetic and phenotypic variation in Varanus tristis , a widespread and variable complex of Australian monitor lizards. Using genomic, morphometric (linear and geometric), coloration, spatial, and environmental data we show that disparity in this complex is inconsistent with intraspecific variation and instead suggests that speciation has occurred. Based on our results, we identify the environmental factors that may have been responsible for the geographic sorting of variation. Our workflow provides a guideline for the integrative analysis of several types of data to identify the occurrence and causes of speciation. Furthermore, our study highlights how community science and machine learning—two tools used here—can be used to accelerate taxonomic research.
... Group structure was also inferred by implementing the t-distributed stochastic neighbor embedding algorithm (t-SNE) that is effective at capturing local structure of high-dimensional data within 2-3 dimensions (van der Maaten and Hinton 2008; Li et al. 2017;Derkarabetian et al. 2019). ...
Article
Full-text available
Mangrove pit vipers of the Trimeresurus purpureomaculatus-erythrurus complex are the only species of viper known to inhabit mangroves. Despite serving integral ecological functions in mangrove ecosystems, the evolutionary history, distribution, and species boundaries of mangrove pit vipers remain poorly understood, partly due to overlapping distributions, confusing phenotypic variations, and the lack of focused studies. Here, we present the first genomic study on mangrove pit vipers and introduce a robust hypothesis-driven species delimitation framework that considers gene flow and phylogenetic uncertainty in conjunction with a novel application of a new class of speciation-based delimitation model implemented through the program Delineate.Our results showed that gene flow produced phylogenetic conflict in our focal species and substantiates the artefactual branch effect where highly admixed populations appear as divergent non-monophyletic lineages arranged in a stepwise manner at the basal position of clades. Despite the confounding effects of gene flow, we were able to obtain unequivocal support for the recognition of a new species based on the intersection and congruence of multiple lines of evidence. This study demonstrates that an integrative hypothesis-driven approach predicated on the consideration of multiple plausible evolutionary histories, population structure/differentiation, gene flow, and the implementation of a speciation-based delimitation model can effectively delimit species in the presence of gene flow and phylogenetic conflict.
... The fourth objective was direct comparisons of specimens hybridized and sequenced using both the Arachnida and Opiliones probe sets, since the number of Opiliones samples hybridized with the Arachnida probe set is quite large (Derkarabetian et al. 2018, Derkarabetian et al., 2019aDerkarabetian et al., 2019b;Derkarabetian et al., 2021;Derkarabetian et al., 2022a;Derkarabetian et al., 2022b;Giribet et al., 2022). Sixteen (eight fresh and eight historical) specimens were hybridized using both probe sets. ...
Article
Sequence capture of ultraconserved elements (UCEs) has transformed molecular systematics across many taxa, with arachnids being no exception. The probe set available for Arachnida has been repeatedly used across multiple arachnid lineages and taxonomic levels, however more specific probe sets for spiders have demonstrated that more UCEs can be recovered with higher probe specificity. In this study, we develop an Opiliones-specific UCE probe set targeting 1915 UCEs using a combination of probes designed from genomes and transcriptomes, as well as the most useful probes from the Arachnida probe set. We demonstrate the effectiveness of this probe set across Opiliones with the most complete family-level phylogeny made to date, including representatives from 61 of 63 currently described families. We also test UCE recovery from historical specimens with degraded DNA, examine population-level data sets, and assess "backwards compatibility" with samples hybridized with the Arachnida probe set. The resulting phylogenies - which include specimens hybridized using both the Opiliones and Arachnida probe sets, historical specimens, and transcriptomes - are largely congruent with previous multi-locus and phylogenomic analyses. The probe set is also "backwards compatible", increasing the number of loci obtained in samples previously hybridized with the Arachnida probe set, and shows high utility down to shallow population-level divergences. This probe set has the potential to further transform Opiliones molecular systematics, resolving many long-standing taxonomic issues plaguing this lineage.
... One common implementation of unsupervised deep learning is through the use of variational autoencoders (Kingma and Welling 2014) (VAEs). VAEs have become increasingly popular in recent years (Doersch 2016) and have been developed in biology for problems such as predicting effects of mutations (Riesselman et al. 2018), visualizing population structure (Battey et al. 2021), and species delimitation (Derkarabetian et al. 2019). A VAE consists of two neural networks, an encoder and a decoder. ...
Article
Full-text available
Interpreting protein function from sequence data is a fundamental goal of bioinformatics. However, our current understanding of protein diversity is bottlenecked by the fact that most proteins have only been functionally validated in model organisms, limiting our understanding of how function varies with gene sequence diversity. Thus, accuracy of inferences in clades without model representatives is questionable. Unsupervised learning may help to ameliorate this bias by identifying highly complex patterns and structure from large datasets without external labels. Here we present DeepSeqProt, an unsupervised deep learning program for exploring large protein sequence datasets. DeepSeqProt is a clustering tool capable of distinguishing between broad classes of proteins while learning local and global structure of functional space. DeepSeqProt is capable of learning salient biological features from unaligned, unannotated sequences. DeepSeqProt is more likely to capture complete protein families and statistically significant shared ontologies within proteomes than other clustering methods. We hope this framework will prove of use to researchers and provide a preliminary step in further developing unsupervised deep learning in molecular biology.
... Membership probabilities were calculated for each individual, as well as the proportion of successful reassignment to the web colony. Clustering was also performed with the Variational Autoencoder (VAE) unsupervised machine learning approach described in Derkarabetian et al. (2019). The unlinked SNP dataset was converted to one-hot format, and missing data were treated as masked. ...
Article
Spiders are notoriously solitary and cannibalistic, with instances of colonial or social lifestyles in only about 50-60, or ~0.1% of 50,000 described species. Population analyses indicate that most colonies consist of multiple cohorts formed by close relatives. Territorial social spiders facultatively form colonies by interlinking individual webs, but further cooperation is infrequent, and only among juveniles or (rarely) females. In spiders therefore, aggregations of males outside of the male-male competition context has been unknown. Here, we report on a discovery of a kite spider from Madagascar that exhibits unique colonies. We found colonies of the newly described araneid Isoxya manangonan. sp. formed by up to 41 interconnected, single-cohort adult female webs with up to 38 adult males aggregating on a central, single, nonsticky line. With males resting tightly together, we found no evidence for male-male aggression. Genetic analyses from RAD sequencing suggest that most colonies consist of unrelated individuals. Furthermore, genetic variability of males was somewhat less than that of females. Single cohort colonies made up purely of adults, and peaceful male aggregations, have not previously been observed in spiders. Although direct behavioral observations are preliminary, we speculate based on the available evidence that these colonies may represent a novel and first case of lekking in spiders.
... A VAE is a machine learning method that learns structure in high-dimensional data by encoding it into a low-dimensional space and subsequently generating simulated data from the low-dimensional encodings (Kingma and Welling, 2013). VAEs have previously been used for species delineation in spiders (Derkarabetian et al., 2019) and visualisation of population structure in Anopheles and humans (Battey et al., 2021). Both these studies used sequence alignments containing much more genomic sequence than the amplicon panel provides. ...
Article
Full-text available
The ANOSPP amplicon panel is a genus-wide targeted sequencing panel to facilitate large-scale monitoring of Anopheles species diversity. Combining information from the 62 nuclear amplicons present in the ANOSPP panel allows for a more nuanced species assignment than single gene (e.g. COI) barcoding, which is desirable in the light of permeable species boundaries. Here, we present NNoVAE, a method using Nearest Neighbours (NN) and Variational Autoencoders (VAE), which we apply to k-mers resulting from the ANOSPP amplicon sequences in order to hierarchically assign species identity. The NN step assigns a sample to a species-group by comparing the k-mers arising from each haplotype's amplicon sequence to a reference database. The VAE step is required to distinguish between closely related species, and also has sufficient resolution to reveal population structure within species. In tests on independent samples with over 80% amplicon coverage, NNoVAE correctly classifies to species level 98% of samples within the An. gambiae complex and 89% of samples outside the complex. We apply NNoVAE to over two thousand new samples from Burkina Faso and Gabon, identifying unexpected species in Gabon. NNoVAE presents an approach that may be of value to other targeted sequencing panels, and is a method that will be used to survey Anopheles species diversity and Plasmodium transmission patterns through space and time on a large scale, with plans to analyse half a million mosquitoes in the next five years.
... Unsupervised learning are tasks where the response are unknown (e.g. genomic species delimitation, see Derkarabetian et al., 2019). Finally, in Reinforcement learning, the ML algorithm is trained by interacting 9 with a (virtual) environment. ...
Preprint
Full-text available
The popularity of Machine learning (ML), Deep learning (DL), and Artificial intelligence (AI) has sharply risen in recent years. Despite their spike in popularity, the inner workings of ML and DL algorithms are perceived as opaque, and their relationship to classical data analysis tools remains debated. It is often assumed that ML and DL excel primarily at making predictions. Recently, however, they have been increasingly used for classical analytical tasks traditionally covered by statistical models. Moreover, recent reviews on ML have focused exclusively on DL, missing out on synthesizing the wealth of ML algorithms with different advantages and general principles. Here, we provide a comprehensive overview of ML and DL, starting with their historical developments, their algorithm families, their differences from traditional statistical tools, and universal ML principles. We then discuss why and when ML and DL excel at prediction tasks, and where they could offer alternatives to traditional statistical methods for inference, highlighting current and emerging applications for ecological problems. Finally, we summarize emerging trends, particularly scientific and causal ML, explainable AI, and responsible AI that may significantly impact ecological data analysis in the future.
... A variational autoencoder is a machine learning method that learns structure in high-dimensional data by encoding it into a low-dimensional space and subsequently generating simulated data from the low-dimensional encodings (Kingma and Welling 2013). VAEs have previously been used for species delineation in spiders (Derkarabetian et al. 2019) and visualisation of population structure in Anopheles and humans (Battey, Coffing, and Kern 2021). Both these studies used sequence alignments containing much more genomic sequence than the amplicon panel provides. ...
Preprint
Full-text available
The ANOSPP amplicon panel is a genus-wide targeted sequencing panel to facilitate large-scale monitoring of Anopheles species diversity. Combining information from the 62 nuclear amplicons present in the ANOSPP panel allows for a more nuanced species assignment than single gene (e.g. COI) barcoding, which is desirable in the light of permeable species boundaries. Here, we present NNoVAE, a method using Nearest Neighbours (NN) and Variational Autoencoders (VAE), which we apply to k -mers resulting from the ANOSPP amplicon sequences in order to hierarchically assign species identity. The NN step assigns a sample to a species-group by comparing the k -mers arising from each haplotype's amplicon sequence to a reference database. The VAE step is required to distinguish between closely related species, and also has sufficient resolution to reveal population structure within species. In tests on independent samples with over 80% amplicon coverage, NNoVAE correctly classifies to species level 98% of samples within the An. gambiae complex and 89% of samples outside the complex. We apply NNoVAE to over two thousand new samples from Burkina Faso and Gabon, identifying unexpected species in Gabon. NNoVAE presents an approach that may be of value to other targeted sequencing panels, and is a method that will be used to survey Anopheles species diversity and Plasmodium transmission patterns through space and time on a large scale, with plans to analyse half a million mosquitoes in the next five years.
Article
Aim The archipelago of Aotearoa displays both high biodiversity and a dynamic geologic history, shaped by constantly shifting coastlines and the dramatic effects of glacial cycling on forest cover across the islands. This geographic history has important implications for the evolution of dispersal‐limited forest‐dwelling arthropods, such as Opiliones, which can help us reconstruct key past biogeographic events. In this study, we shed light on the evolutionary history of the triaenonychid genus Algidia Hogg, 1920 . Location The archipelago of Aotearoa|New Zealand. Time Period Late Cretaceous to the present‐day, with particular focus on events in the Oligocene onwards. Major Taxa Studied Algidia , Triaenonychidae, Opiliones, Arachnida. Methods We utilise an integrative phylobiogeographic approach, incorporating target enrichment sequence capture of ultraconserved elements, divergence dating, species delimitation and ecological niche modeling. Results Our genomic data in conjunction with divergence dating find evidence of high geographic structure and the influence of multiple key geologic events in the natural history of Aotearoa, including the origination and continuation of the Alpine Fault, marine transgression during the Oligocene and cycles of glaciation and orogeny that characterised the Pliocene and Pleistocene on the islands. Our results recover 10 putative species, including four that are undescribed. Paleoclimate modelling reflects geographic changes to Aotearoa's coastline which potentially underpin the modern distributions of Algidia , including land bridges in place of the current marine straits Raukawa Moana|Cook Strait and Te Ara‐a‐Kiwa|Foveaux Strait. Main Conclusions The phylogeny of Algidia indicates consistent northwards expansion, with the earliest diverging clade, A. homerica , located in Rakiura and southern Te Waipounamu, and subsequently diverging clades moving steadily northwards in their geographic distributions. Diversification of Algidia predates the Oligocene Marine Transgression, lending support to the now well‐established hypothesis that Aotearoa was not fully submerged during the Oligocene. The Alpine Fault seems to be an important feature explaining cladogenesis and diverging populations, including for species found across Raukawa Moana. However, other phenomena, including glaciation, orogeny or continental shifting, are also important explanatory factors in species distributions across Aotearoa.
Article
Full-text available
Synopsis Artificial intelligence (AI) is poised to revolutionize many aspects of science, including the study of evolutionary morphology. While classical AI methods such as principal component analysis and cluster analysis have been commonplace in the study of evolutionary morphology for decades, recent years have seen increasing application of deep learning to ecology and evolutionary biology. As digitized specimen databases become increasingly prevalent and openly available, AI is offering vast new potential to circumvent long-standing barriers to rapid, big data analysis of phenotypes. Here, we review the current state of AI methods available for the study of evolutionary morphology, which are most developed in the area of data acquisition and processing. We introduce the main available AI techniques, categorizing them into 3 stages based on their order of appearance: (1) machine learning, (2) deep learning, and (3) the most recent advancements in large-scale models and multimodal learning. Next, we present case studies of existing approaches using AI for evolutionary morphology, including image capture and segmentation, feature recognition, morphometrics, and phylogenetics. We then discuss the prospectus for near-term advances in specific areas of inquiry within this field, including the potential of new AI methods that have not yet been applied to the study of morphological evolution. In particular, we note key areas where AI remains underutilized and could be used to enhance studies of evolutionary morphology. This combination of current methods and potential developments has the capacity to transform the evolutionary analysis of the organismal phenotype into evolutionary phenomics, leading to an era of “big data” that aligns the study of phenotypes with genomics and other areas of bioinformatics.
Article
Full-text available
Simple Summary We evaluated linear morphometry of male genitalia as a diagnostic method to distinguish the genera and species of Monomorium and Syllophopsis (Hymenoptera: Formicidae). We measured 10 morphometric characters on the male genitalia from 10 species of Monomorium and 5 species of Syllophopsis. We used three datasets, raw data, ratio data, and RAV data, and analyzed them using multivariate methods: hierarchical clustering (Ward’s method), Principal Component Analysis (PCA), Non-Metric Multidimensional Scaling analyses (NMDS), Linear Discriminant Analysis (LDA), and Conditional Inference Trees (CITs). The ratio data were most effective in separating the two genera, while the raw data were more effective at species-level delimitation. The findings highlighted the potential for a broader application of genitalia-based morphometric analyses in ant systematics. Abstract Morphometric analyses of male genitalia are routinely used to distinguish genera and species in beetles, butterflies, and flies, but are rarely used in ants, where most morphometric analyses focus on the external morphology of the worker caste. In this work, we performed linear morphometric analysis of the male genitalia to distinguish Monomorium and Syllophopsis in Madagascar. For 80 specimens, we measured 10 morphometric characters, especially on the paramere, volsella, and penisvalvae. Three datasets were made from linear measurements: mean (raw data), the ratios of characters (ratio data), and the Removal of Allometric Variance (RAV data). The following quantitative methods were applied to these datasets: hierarchical clustering (Ward’s method), unconstrained ordination methods including Principal Component Analysis (PCA), Non-Metric Multidimensional Scaling analyses (NMDS), Linear Discriminant Analysis (LDA), and Conditional Inference Trees (CITs). The results from statistical analysis show that the ratios proved to be the most effective approach for genus-level differentiation. However, the RAV method exhibited overlap between the genera. Meanwhile, the raw data facilitated more nuanced distinctions at the species level compared with the ratios and RAV approaches. The CITs revealed that the ratios of denticle length of the valviceps (SeL) to the paramere height (PaH) effectively distinguished between genera and identified key variables for species-level differentiation. Overall, this study shows that linear morphometric analysis of male genitalia is a useful data source for taxonomic delimitation.
Article
Full-text available
The biota of cave habitats faces heightened conservation risks, due to geographic isolation and high levels of endemism. Molecular datasets, in tandem with ecological surveys, have the potential to precisely delimit the nature of cave endemism and identify conservation priorities for microendemic species. Here, we sequenced ultraconserved elements of Tegenaria within, and at the entrances of, 25 cave sites to test phylogenetic relationships, combined with an unsupervised machine learning approach for detecting species. Our analyses identified clear and well-supported genetic breaks in the dataset that accorded closely with morphologically diagnosable units. Through these analyses, we also detected some previously unidentified, potential cryptic morphospecies. We then performed conservation assessments for seven troglobitic Israeli species of this genus and determined five of these to be critically endangered.
Preprint
Artificial intelligence (AI) is poised to transform many aspects of society, and the study of evolutionary morphology is no exception. Machine learning-grade methods of AI such as Principal Component Analysis (PCA) and Cluster Analysis have been commonplace in evolutionary morphology for decades, but the last decade has seen increasing application of Deep Learning to ecology and evolutionary biology, opening up the potential to circumvent longstanding barriers to rapid, big data analysis of phenotype. Here we review the current state of AI methods available for the study of evolutionary morphology and discuss the prospectus for near-term advances in specific subfields of this research area, including the potential of new AI methods that have not yet been applied to the study of morphological evolution. We introduce the main available AI techniques, categorising them into three stages based on their order of appearance: (i) Machine Learning, (ii) Deep Learning with neural networks and (iii) the most recent advancements in large-scale models and multimodal learning. Next, we present existing AI approaches and case studies using AI for evolutionary morphology, including image capture and segmentation, feature recognition, morphometrics, phylogenetics, and biomechanics. Finally, we discuss areas where there is potential, but no current application of AI to key areas in evolutionary morphology. Combined, these advancements and potential developments have the capacity to transform the evolutionary analysis of organismal phenotype into evolutionary phenomics, launch it fully in the “Big Data'' sphere, and align it with genomics and other areas of bioinformatics.
Preprint
Full-text available
Species delimitation is the process of distinguishing between populations of the same species and distinct species of a particular group of organisms. Various methods exist for inferring species limits, with most of them being rooted in Coalescent Theory. Their primary goal is to identify independently evolving lineages that should represent separate species. Coalescent models have improved species delimitation by enabling explicit testing of hypotheses regarding evolutionary independence among lineages. However, they have some limitations, especially regarding complex evolutionary scenarios, large datasets, and varying genetic data types. In this context, machine learning (ML) can be considered as a promising analytical tool, and clearly provides an effective way to explore dataset structures when species-level divergences are hypothesised. In this review, we examine the use of ML in species delimitation and provide an overview and critical appraisal of existing workflows. We also provide simple explanations on how the main types of ML approaches operate, which should help researchers and students interested in the field. While current ML methods designed to infer species limits are analytically powerful, they also present specific limitations and should not be considered as definitive alternatives to traditional coalescent methods for species delimitation. For instance, there are clear limitations regarding the utilisation of simulated data, especially in supervised and deep learning approaches, and the type of data representation used by each ML approach. We then discuss the strengths and weaknesses of existing pipelines, propose best practices for the use of ML methods in species delimitation, and offer insights into potential future applications. Generative adversarial networks and domain adaptation techniques, for instance, could be used to partially address the misspecification issue related to simulating genetic data. Besides, integrating ML methods into the hypothesis testing process, alongside available coalescent-based methods, could enable a more comprehensive exploration of evolutionary models and parameters, improving the accuracy and biological interpretability of species delimitation analyses. Additionally, we suggest guidelines for enhancing the accessibility, effectiveness, and objectivity of ML in species delimitation processes, aiming to offer a transformative perspective on this subject.
Preprint
Aim We assessed the population genetic structure of the kleptoparasitic spider Argyrodes bonadea across the Southwestern Pacific islands. Our focus is on assessing the impact of overseas distances and, in particular, the Kerama gap, as potential drivers of genetic differentiation. We found that the spider kleptoparasite’s switch to a specific host species is associated with significant genetic variation at fine scales, whereas the same species adoption of a generalist host strategy has likely facilitated its broad dispersal, colonization, and recent range expansion across the southwestern Pacific, and is associated with a lack of geographically– structured genetic variation in these latter, subsequently-colonized landmasses. Location Southwestern Pacific Islands Taxon Argyrodes bonadea Methods We used mitochondrial Cytochrome Oxidase 1 (CO1) gene sequences, and Restriction Site-associated DNA Sequencing (RAD-seq) for our analyses. Results Two strongly supported lineages, an Amami-Okinawa Lineage (AOL) and an Austral-Asia Lineage (AAL) correspond to two separate clades, roughly divided by the Kerama Gap, in phylogenetic trees estimated here. However, species delimitation led to the interpretation of only a single species present. The AOL exhibits complex, geographically-structured host web spider species specificity, wherein the Amami population utilizes Cyrtophora , but AOL samples in Okinawa associates exclusively with Nephila —and yet all broadly distributed AAL populations show no evidence of host web spider species specificity. Main conclusion The population boundary between AOL and AAL likely results from local adaptation to novel hosts—instead of isolation by the Kerama Gap—following long-distance dispersal and range expansion. Our results suggest kleptoparasitic spiders have the capacity to overcome permanent deep-sea barriers and colonize distant landmasses. Whereas peripheral populations (AOL) demonstrate the capacity for specialization to a single host, which may have contributed to genetic differentiation and isolation, the broadly-distributed AAL persists and has successfully expanded its geographical range as a host generalist, which may contribute to ongoing gene flow inferred in this study.
Article
Full-text available
Defining species boundaries, or delimiting species, is a complex and often difficult task. Indeed, when such studies incorporate approaches that consider evolutionary mechanisms, there is much to be learned about species diversity and how the processes that play critical roles in speciation can impact species delineation. In 2021, a virtual workshop on species delimitation was held at the Smithsonian Institution National Museum of Natural History to train natural history scientists and taxonomists on the appropriate analytical tools that can be used to help delimit species when using molecular data. This perspective highlights some of the main themes discussed during that workshop while detailing three processes that can challenge any species delimitation study. Specifically, we discuss incomplete lineage sorting, gene flow, and population structure when delimiting species boundaries using molecular data. We highlight empirical studies and methodological approaches that have successfully met these challenges under various scenarios. Finally, we provide recommendations and considerations for undertaking species delimitation studies in a variety of taxa. To this end, we recommend that taxonomists fully embrace process-based species delimitation, which can provide important insights into speciation in their study systems. For those developing analytical approaches, we hope they consider incorporating less well-known taxa, such as marine invertebrates, into method testing. Marine invertebrates encompass many dark taxa across the tree of life yet represent the majority of animal phyla, many of which are vulnerable to extinction due to global ocean change. Thus, advancing species delimitation to address taxonomic revisions in these organisms will support conservation decisions on keystone ecosystems. Furthermore, the diversity of their life history strategies, the lack of obvious barriers to gene flow in the ocean environment, and their occurrence in isolated habitat patches can better inform our knowledge of speciation and the evolutionary processes that play a role in generating diversity in nature.
Article
Full-text available
Economically and agriculturally important fungal species exhibit various lifestyles, and they can switch their life modes depending on the habitat, host tolerance, and resource availability. Traditionally, fungal lifestyles have been determined based on observation at a particular host or habitat. Therefore, potential fungal pathogens have been neglected until they cause devastating impacts on human health, food security, and ecosystem stability. This study focused on the class Sordariomycetes to explore the genomic traits that could be used to determine the lifestyles of fungi and the possibility of predicting fungal lifestyles using machine learning algorithms. A total of 638 representative genomes encompassing 5 subclasses, 17 orders, and 50 families were selected and annotated. Through an extensive literature survey, the lifestyles of 553 genomes were determined, including plant pathogens, saprotrophs, entomopathogens, mycoparasites, endophytes, human pathogens and nematophagous fungi. We first tried to examine the relationship between fungal lifestyles and transposable elements. We unexpectedly discovered that second-generation sequencing technologies tend to result in reduced size of transposable elements while having no discernible impact on the content of protein-coding genes. Then, we constructed three numerical matrices: 1) a basic genomic feature matrix including 25 features; 2) a functional protein matrix including 24 features; 3) and a combined matrix. Meanwhile, we reconstructed a genome-scale phylogeny, across which comprehensive comparative analyses were conducted. The results indicated that basic genomic features reflected more on phylogeny rather than lifestyle, but the abundance of functional proteins exhibited relatively high discrimination not only in differentiating taxonomic groups at the higher levels but also in differentiating lifestyles. Among these lifestyles including plant pathogens, saprotrophs, entomopathogens, mycoparasites, endophytes, and human pathogens, plant pathogens exhibited the largest secretomes, while entomopathogens had the smallest secretomes. The abundance of secretomes served as a valuable indicator for differentiating plant pathogens from mycoparasites, saprotrophs, and entomopathogens, as well as for Mycosphere 14(1): 1530-1563 (2023) www.mycosphere.org ISSN 2077 7019 Article Doi 10.5943/mycosphere/14/1/17 1531 discriminating endophytes from entomopathogens. Effectors have long been considered disease determinants, and indeed, we observed a higher presence of effectors in plant pathogens than in saprotrophs and entomopathogens. However, surprisingly, endophytes also exhibited a similar abundance of effectors, challenging their role as a reliable indicator for pathogenic fungi. A single functional protein group could not differentiate all lifestyles, but their combinations resulted in accurate differentiation for most lifestyles. Furthermore, models of six machine learning algorithms were trained, optimized, and evaluated based on the labeled genomes. The best-performance model was used to predict the lifestyle of 83 unlabeled genomes. Although insufficient genome sampling for several lifestyles and inaccurate lifestyle assignments for some genomes, the predictive model still obtained a high degree of accuracy in differentiating plant pathogens. The predictive model can be further optimized with more sequenced genomes in the future and provide a more reliable prediction. It can serve as an early warning system, enabling the identification of potentially devastating fungi and facilitating the implementation of appropriate measures to prevent their spread.
Article
Full-text available
The systematization of Maesa , a genus of almost 200 species, has haunted taxonomists for more than a century due to its lack of distinct qualitative characters or discontinuities in quantitative characters for species delimitation. The clarification of phylogenetic relationships in such a problematic genus like Maesa is essential to aid infrageneric classification and species delimitation. Here, a species‐level phylogenetic tree of Maesa is reconstructed. Leaf materials were sampled mainly from herbarium specimens which cover 60% of the species across the entire distribution range of the genus. Targeted sequence capture with the Angiosperms353 probe set was used to acquire sequences for downstream bioinformatic analyses. We obtained a species tree inferred from 310 gene trees that divides Maesa into an African clade and an Asian‐Pacific clade. The African clade is further divided into two subclades, while the Asian‐Pacific clade is divided into three subclades; all subclades are well supported. Hence, we propose five subgenera of Maesa , namely M. subg. Maesa , subg. Indicae , subg. Monotaxis , subg. Papuanae and subg. Ramentaceae . In addition, we scrutinize some species complexes within the genus; however, with the lack of phylogenetic signal at shallow levels, we are unable to conclusively resolve all species boundaries in these complexes. This study provides the phylogenomic framework to untangle taxonomic problems in the genus Maesa and lays the foundation for further detailed studies in biogeography, trait evolution and population genetics.
Article
Nearly all lineages of land plants have experienced at least one whole-genome duplication (WGD) in their history. The legacy of these ancient WGDs is still observable in the diploidized genomes of extant plants. Genes originating from WGD-paleologs-can be maintained in diploidized genomes for millions of years. These paleologs have the potential to shape plant evolution through sub- and neofunctionalization, increased genetic diversity, and reciprocal gene loss among lineages. Current methods for classifying paleologs often rely on only a subset of potential genomic features, have varying levels of accuracy, and often require significant data and/or computational time. Here, we developed a supervised machine learning approach to classify paleologs from a target WGD in diploidized genomes across a broad range of different duplication histories. We collected empirical data on syntenic block sizes and other genomic features from 27 plant species each with a different history of paleopolyploidy. Features from these genomes were used to develop simulations of syntenic blocks and paleologs to train a gradient boosted decision tree. Using this approach, Frackify (Fractionation Classify), we were able to accurately identify and classify paleologs across a broad range of parameter space, including cases with multiple overlapping WGDs. We then compared Frackify with other paleolog inference approaches in six species with paleotetraploid and paleohexaploid ancestries. Frackify provides a way to combine multiple genomic features to quickly classify paleologs while providing a high degree of consistency with existing approaches.
Poster
Full-text available
In the field of macrobiology including taxonomy, ecology, and evolutionary biology, we frequently confront missing values both in phenotype and genotype data, leaving it an important issue how to treat missing data. In particular, it is common to encounter a case where either genotype or phenotype data does not exist for an individual sample due to various realistic limitations in the data collection procedure. In this study, we explore an idea of reciprocally imputing missing genotypes and phenotypes by linking information from these two domains using machine learning techniques for dimension reduction. Specifically, we independently reduced the dimensions of genotype and phenotype data using principal component analysis (PCA) and the Autoencoder approach respectively and created a map between the reduced genotype and phenotype data using the projection Procrustes analysis. After that, we insert the transformed coordinates of genotype into the decoder trained with phenotype to reconstruct the missing phenotype data. For testing, we applied this statistical framework to mtDNA COI (cytochrome oxidase I) genotypes – 65 morphometric phenotypes data of Kaolinonychus, Korean harvestmen, and whole-genome sequencing genotypes – behavioral experimental phenotypes of Drosophila melanogaster from DGRP2 dataset. In both cases, the error rates were significantly lower than that of the mean and median imputations generally used. The performance of our method seems to depend greatly on the similarity of high dimensional distribution between the genotype and phenotype data. We expect this method to be utilized in alleviating missing data issues in the field of macrobiology. Video: https://www.youtube.com/watch?v=V8cog--jOiQ&t=83s
Article
Deep learning is driving recent advances behind many everyday technologies, including speech and image recognition, natural language processing and autonomous driving. It is also gaining popularity in biology, where it has been used for automated species identification, environmental monitoring, ecological modelling, behavioural studies, DNA sequencing and population genetics and phylogenetics, among other applications. Deep learning relies on artificial neural networks for predictive modelling and excels at recognizing complex patterns. In this review we synthesize 818 studies using deep learning in the context of ecology and evolution to give a discipline‐wide perspective necessary to promote a rethinking of inference approaches in the field. We provide an introduction to machine learning and contrast it with mechanistic inference, followed by a gentle primer on deep learning. We review the applications of deep learning in ecology and evolution and discuss its limitations and efforts to overcome them. We also provide a practical primer for biologists interested in including deep learning in their toolkit and identify its possible future applications. We find that deep learning is being rapidly adopted in ecology and evolution, with 589 studies (64%) published since the beginning of 2019. Most use convolutional neural networks (496 studies) and supervised learning for image identification but also for tasks using molecular data, sounds, environmental data or video as input. More sophisticated uses of deep learning in biology are also beginning to appear. Operating within the machine learning paradigm, deep learning can be viewed as an alternative to mechanistic modelling. It has desirable properties of good performance and scaling with increasing complexity, while posing unique challenges such as sensitivity to bias in input data. We expect that rapid adoption of deep learning in ecology and evolution will continue, especially in automation of biodiversity monitoring and discovery and inference from genetic data. Increased use of unsupervised learning for discovery and visualization of clusters and gaps, simplification of multi‐step analysis pipelines, and integration of machine learning into graduate and postgraduate training are all likely in the near future.
Article
Full-text available
Many recent species delimitation studies rely exclusively on limited analyses of genetic data analyzed under the multispecies coalescent (MSC) model, and results from these studies often are regarded as conclusive support for taxonomic changes. However, most MSC-based species delimitation methods have well-known and often unmet assumptions. Uncritical application of these genetic-based approaches (without due consideration of sampling design, the effects of a priori group designations, isolation by distance, cytoplasmic-nuclear mismatch, and population structure) can lead to over-splitting of species. Here, we argue that in many common biological scenarios, researchers must be particularly cautious regarding these limitations, especially in cases of well-studied, geographically variable, and parapatrically-distributed species complexes. We consider these points with respect to a historically controversial species group, the American milksnakes (Lampropeltis triangulum complex), using genetic data from a recent analysis (Ruane et al. 2014; Syst. Biol. 63:231-250). We show that over-reliance on the program BPP, without adequate consideration of its assumptions and of sampling limitations, resulted in over-splitting of species in this study. Several of the hypothesized species of milksnakes instead appear to represent arbitrary slices of continuous geographic clines. We conclude that the best available evidence supports three, rather than seven, species within this complex. More generally, we recommend that coalescent-based species delimitation studies incorporate thorough analyses of geographic variation and carefully examine putative contact zones among delimited species before making taxonomic changes.
Article
Full-text available
We discuss the fauna of New Caledonia in the context of the prolonged submergence of Grande Terre until its re‐emergence around 37 million years ago and whether the resulting fauna can be entirely explained by over‐water dispersal. The current literature discussing the predominant neoendemism in New Caledonia is reviewed, questioning some of the discourse about how the fact that most animal and plant lineages are neoendemics should weigh in to disregard the fewer cases of paleoendemism (clades that have persisted and diversified in New Caledonia for over 37 million years). We argue that many of the examples used in the literature, selected for other purposes, were not chosen to test this particular hypothesis, but several old lineages of non‐vagile animals show that a non‐trivial number of clades have a history that predates the supposed emergence of New Caledonia. We conclude by posing the question of how much additional evidence should be needed to demonstrate a discordance between the geological history of the archipelago and the evolutionary history of its biota.
Article
Full-text available
The atypoid mygalomorphs include spiders from three described families that build a diverse array of entrance web constructs, including funnel-and-sheet webs, purse webs, trapdoors, turrets and silken collars. Molecular phylogenetic analyses have generally supported the monophyly of Atypoidea, but prior studies have not sampled all relevant taxa. Here we generated a dataset of ultraconserved element loci for all described atypoid genera, including taxa (Mecicobothrium and Hexurella) key to understanding familial monophyly, divergence times, and patterns of entrance web evolution. We show that the conserved regions of the arachnid UCE probe set target exons, such that it should be possible to combine UCE and transcriptome datasets in arachnids. We also show that different UCE probes sometimes target the same protein, and under the matching parameters used here show that UCE alignments sometimes include non-orthologs. Using multiple curated phylogenomic matrices we recover a monophyletic Atypoidea, and reveal that the family Mecicobothriidae comprises four separate and divergent lineages. Fossil-calibrated divergence time analyses suggest ancient Triassic (or older) origins for several relictual atypoid lineages, with late Cretaceous/early Tertiary divergences within some genera indicating a high potential for cryptic species diversity. The ancestral entrance web construct for atypoids, and all mygalomorphs, is reconstructed as a funnel-and-sheet web.
Article
Full-text available
Rapid and reliable identification of insects is important in many contexts, from the detection of disease vectors and invasive species to the sorting of material from biodiversity inventories. Because of the shortage of adequate expertise, there has long been an interest in developing automated systems for this task. Previous attempts have been based on laborious and complex handcrafted extraction of image features, but in recent years it has been shown that sophisticated convolutional neural networks (CNNs) can learn to extract relevant features automatically, without human intervention. Unfortunately, reaching expert-level accuracy in CNN identifications requires substantial computational power and huge training datasets, which are often not available for taxonomic tasks. This can be addressed using feature transfer: a CNN that has been pretrained on a generic image classification task is exposed to the taxonomic images of interest, and information about its perception of those images is used in training a simpler, dedicated identification system. Here, we develop an effective method of CNN feature transfer, which achieves expert-level accuracy in taxonomic identification of insects with training sets of 100 images or less per category, depending on the nature of dataset. Specifically, we extract rich representations of intermediate to high-level image features from the CNN architecture VGG16 pretrained on the ImageNet dataset. This information is submitted to a linear support vector machine classifier, which is trained on the target problem. We tested the performance of our approach on two types of challenging taxonomic tasks: (1) identifying insects to higher groups when they are likely to belong to subgroups that have not been seen previously; and (2) identifying visually similar species that are difficult to separate even for experts. For the first task, our approach reached > 92 % accuracy on one dataset (884 face images of 11 families of Diptera, all specimens representing unique species), and > 96 % accuracy on another (2936 dorsal habitus images of 14 families of Coleoptera, over 90 % of specimens belonging to unique species). For the second task, our approach outperformed a leading taxonomic expert on one dataset (339 images of three species of the Coleoptera genus Oxythyrea; 97 % accuracy), and both humans and traditional automated identification systems on another dataset (3845 images of nine species of Plecoptera larvae; 98.6 % accuracy). Reanalyzing several biological image identification tasks studied in the recent literature, we show that our approach is broadly applicable and provides significant improvements over previous methods, whether based on dedicated CNNs, CNN feature transfer, or more traditional techniques. Thus, our method, which is easy to apply, can be highly successful in developing automated taxonomic identification systems even when training datasets are small and computational budgets limited. We conclude by briefly discussing some promising CNN-based research directions in morphological systematics opened up by the success of these techniques in providing accurate diagnostic tools.
Article
Full-text available
Population-scale genomic data sets have given researchers incredible amounts of information from which to infer evolutionary histories. Concomitant with this flood of data, theoretical and methodological advances have sought to extract information from genomic sequences to infer demographic events such as population size changes and gene flow among closely related populations/species, construct recombination maps, and uncover loci underlying recent adaptation. To date, most methods make use of only one or a few summaries of the input sequences and therefore ignore potentially useful information encoded in the data. The most sophisticated of these approaches involve likelihood calculations, which require theoretical advances for each new problem, and often focus on a single aspect of the data (e.g., only allele frequency information) in the interest of mathematical and computational tractability. Directly interrogating the entirety of the input sequence data in a likelihood-free manner would thus offer a fruitful alternative. Here, we accomplish this by representing DNA sequence alignments as images and using a class of deep learning methods called convolutional neural networks (CNNs) to make population genetic inferences from these images. We apply CNNs to a number of evolutionary questions and find that they frequently match or exceed the accuracy of current methods. Importantly, we show that CNNs perform accurate evolutionary model selection and parameter estimation, even on problems that have not received detailed theoretical treatments. Thus, when applied to population genetic alignments, CNNs are capable of outperforming expert-derived statistical methods and offer a new path forward in cases where no likelihood approach exists.
Article
Full-text available
Advances in single-cell technologies have enabled high-resolution dissection of tissue composition. Several tools for dimensionality reduction are available to analyze the large number of parameters generated in single-cell studies. Recently, a nonlinear dimensionality-reduction technique, uniform manifold approximation and projection (UMAP), was developed for the analysis of any type of high-dimensional data. Here we apply it to biological data, using three well-characterized mass cytometry and single-cell RNA sequencing datasets. Comparing the performance of UMAP with five other tools, we find that UMAP provides the fastest run times, highest reproducibility and the most meaningful organization of cell clusters. The work highlights the use of UMAP for improved visualization and interpretation of single-cell data.
Preprint
Full-text available
The well-documented, species-rich, and diverse group of ants (Formicidae) are important ecological bioindicators for species richness, ecosystem health, and biodiversity, but ant species identification is complex and requires specific knowledge. In the past few years, insect identification from images has seen increasing interest and success, with processing speed improving and costs lowering. Here we propose deep learning (in the form of a convolutional neural network (CNN)) to classify ants at species level using AntWeb images. We used an Inception-ResNet-V2-based CNN to classify ant images, and three shot types with 10,204 images for 97 species, in addition to a multi-view approach, for training and testing the CNN while also testing a worker-only set and an AntWeb protocol-deviant test set. Top 1 accuracy reached 62% - 81%, top 3 accuracy 80% - 92%, and genus accuracy 79% - 95% on species classification for different shot type approaches. The head shot type outperformed other shot type approaches. Genus accuracy was broadly similar to top 3 accuracy. Removing reproductives from the test data improved accuracy only slightly. Accuracy on AntWeb protocol-deviant data was very low. In addition, we make recommendations for future work concerning image threshold, distribution, and quality, multi-view approaches, metadata, and on protocols; potentially leading to higher accuracy with less computational effort.
Article
Full-text available
Recent simulation studies examining the performance of Bayesian species delimitation as implemented in the BPP program have suggested that BPP may detect population splits but not species divergences and that it tends to over-split when data of many loci are analyzed. Here we confirm these results and provide the mathematical justifications. We point out that the distinction between population and species splits made in the protracted speciation model has no influence on the generation of gene trees and sequence data, which explains why no method can use such data to distinguish between population splits and speciation. We suggest that the protracted speciation model is unrealistic as its mechanism for assigning species status assumes instantaneous speciation, contradicting prevailing taxonomic practice. We confirm the suggestion, based on simulation, that in the case of speciation with gene flow, Bayesian model selection as implemented in BPP tends to detect population splits when the amount of data (the number of loci) increases. We discuss the use of a recently proposed empirical genealogical divergence index (gdi) for species delimitation and illustrate that parameter estimates produced by a full likelihood analysis as implemented in BPP provide much more reliable inference under the gdi than the approximate method phrapl. We distinguish between Bayesian model selection and parameter estimation, and suggest that the model selection approach is useful for identifying sympatric cryptic species while the parameter estimation approach may be used to implement empirical criteria for determining species status among allopatric populations.
Article
Full-text available
Morphological, mitochondrial, and nuclear phylogenomic data were combined to address phylogenetic and species delimitation questions in cave-limited Cicurina spiders from central Texas. Special effort was focused on specimens and cave locations in the San Antonio region (Bexar County), home to four eyeless species listed as US Federally Endangered. Sequence capture experiments resulted in the recovery of ~200–400 homologous ultra-conserved element (UCE) nuclear loci across taxa, and nearly complete COI mitochondrial DNA sequences from the same set of individuals. Some of these nuclear and mitochondrial sequences were recovered from “standard” museum specimens without special preservation of DNA material, including museum specimens preserved in the 1990s. Multiple phylogenetic analyses of the UCE data agree in the recovery of two major lineages of eyeless Cicurina in Texas. These lineages also differ in mitochondrial clade membership, female genitalic morphology, degree of troglomorphy (as measured by relative leg length), and are mostly allopatric across much of Texas. Rare sympatry was confirmed in Bexar County, where members of the two major clades sometimes co-exist in the same karst feature. Both nuclear phylogenomic and mitochondrial data indicate the existence of undescribed species from the San Antonio region, although further sampling and collection of adult specimens is needed to explicitly test these hypotheses. Our data support the two following species synonymies ( Cicurinavenii Gertsch, 1992 = Cicurinamadla Gertsch, 1992; Cicurinaloftini Cokendolpher, 2004 = Cicurinavespera Gertsch, 1992), formally proposed here. Overall, our taxonomy-focused research has many important conservation implications, and again highlights the fundamental importance of robust taxonomy in conservation research.
Article
Full-text available
Molecular phylogenetics has transitioned into the phylogenomic era, with data derived from next-generation sequencing technologies allowing unprecedented phylogenetic resolution in all animal groups, including understudied invertebrate taxa. Within the most diverse harvestmen suborder, Laniatores, most relationships at all taxonomic levels have yet to be explored from a phylogenomics perspective. Travunioidea is an early-diverging lineage of laniatorean harvestmen with a Laurasian distribution, with species distributed in eastern Asia, eastern and western North America, and south-central Europe. This clade has had a challenging taxonomic history, but the current classification consists of ~77 species in three families, the Travuniidae, Paranonychidae, and Nippononychidae. Travunioidea classification has traditionally been based on structure of the tarsal claws of the hind legs. However, it is now clear that tarsal claw structure is a poor taxonomic character due to homoplasy at all taxonomic levels. Here, we utilize DNA sequences derived from capture of ultraconserved elements (UCEs) to reconstruct travunioid relationships. Data matrices consisting of 317–677 loci were used in maximum likelihood, Bayesian, and species tree analyses. Resulting phylogenies recover four consistent and highly supported clades; the phylogenetic position and taxonomic status of the enigmatic genus Yuria is less certain. Based on the resulting phylogenies, a revision of Travunioidea is proposed, now consisting of the Travuniidae, Cladonychiidae, Paranonychidae (Nippononychidae is synonymized), and the new family Cryptomastridae Derkarabetian & Hedin, fam. n. , diagnosed here. The phylogenetic utility and diagnostic features of the intestinal complex and male genitalia are discussed in light of phylogenomic results, and the inappropriateness of the tarsal claw in diagnosing higher-level taxa is further corroborated.
Article
Full-text available
UMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning technique for dimension reduction. UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. The result is a practical scalable algorithm that applies to real world data. The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance. Furthermore, UMAP as described has no computational restrictions on embedding dimension, making it viable as a general purpose dimension reduction technique for machine learning.
Article
Full-text available
As population genomic datasets grow in size, researchers are faced with the daunting task of making sense of a flood of information. To keep pace with this explosion of data, computational methodologies for population genetic inference are rapidly being developed to best utilize genomic sequence data. In this review we discuss a new paradigm that has emerged in computational population genomics: that of supervised machine learning (ML). We review the fundamentals of ML, discuss recent applications of supervised ML to population genetics that outperform competing methods, and describe promising future directions in this area. Ultimately, we argue that supervised ML is an important and underutilized tool that has considerable potential for the world of evolutionary genomics.
Article
Full-text available
Species delimitation has been divided by two approaches: DNA barcoding that focuses on standardization of the genetic marker and multilocus methods that place a premium on genomic coverage and conceptual rigor in modeling the divergence process. Most multilocus methods fail as barcodes, however, because few assay the same marker set and are therefore not readily comparable across studies and databases. We introduce ultraconserved elements (UCEs) as potential genomic barcodes that allow rigorous species delimitation and a bridge to DNA barcoding database to allow both rigorous species delimitation and standardized identification of delimited taxa. UCEs query thousands of loci across the nuclear genome in way that is replicable across broad taxonomic groups (i.e., vertebrates). We apply UCEs to species delimitation in a species complex of frogs found in the Mexican Highlands. Sarcohyla contains 24 described species, many of which are critically endangered and known only from their type localities. Eviden
Article
Full-text available
Phylogeographic datasets have grown from tens to thousands of loci in recent years, but extant statistical methods do not take full advantage of these large datasets. For example, Approximate Bayesian Computation (ABC) is a commonly used method for the explicit comparison of alternate demographic histories, but it is limited by the 'curse of dimensionality' and issues related to the simulation and summarization of data when applied to next-generation sequencing (NGS) datasets. We implement here several improvements to overcome these difficulties. We use a Random Forest (RF) classifier for model selection to circumvent the curse of dimensionality and apply a binned representation of the multidimensional site frequency spectrum (mSFS) to address issues related to the simulation and summarization of large SNP datasets. We evaluate the performance of these improvements using simulation and find low overall error rates (~ 7%). We then apply the approach to data from Haplotrema vancouverense, a land snail endemic to the Pacific Northwest of North America. Fifteen demographic models were compared, and our results support a model of recent dispersal from coastal to inland rainforests. Our results demonstrate that binning is an effective strategy for the construction of a mSFS and imply that the statistical power of RF when applied to demographic model selection is at least comparable to traditional ABC algorithms. Importantly, by combining these strategies, large sets of models with differing numbers of populations can be evaluated. This article is protected by copyright. All rights reserved.
Article
Full-text available
Motivation: Genome sequencing projects sometimes uncover more organisms than expected, especially for complex and/or non-model organisms. It is therefore useful to develop software to identify mix of organisms from genome sequence assemblies. Here we present PhylOligo, a new package including tools to explore, identify and extract organism-specific sequences in a genome assembly using the analysis of their DNA compositional characteristics. Availability: The tools are written in Python3 and R under the GPLv3 Licence and can be found at https://github.com/itsmeludo/Phyloligo/
Article
Full-text available
Identifying units of biological diversity is a major goal of organismal biology. An increasing literature has focused on the importance of cryptic diversity, defined as the presence of deeply diverged lineages within a single species. While most discoveries of cryptic lineages proceed on a taxon-by-taxon basis, rapid assessments of biodiversity are needed to inform conservation policy and decision-making. Here, we introduce a predictive framework for phylogeography that allows rapidly identifying cryptic diversity. Our approach proceeds by collecting environmental, taxonomic and genetic data from codistributed taxa with known phylogeographic histories. We define these taxa as a reference set, and categorize them as either harbouring or lacking cryptic diversity. We then build a random forest classifier that allows us to predict which other taxa endemic to the same biome are likely to contain cryptic diversity.We apply this framework to data from two sets of disjunct ecosystems known to harbour taxa with cryptic diversity: the mesic temperate forests of the Pacific Northwest of North America and the arid lands of Southwestern North America. The predictive approach presented here is accurate, with prediction accuracies placed between 65% and 98.79% depending of the ecosystem. This seems to indicate that our method can be successfully used to address ecosystemlevel questions about cryptic diversity. Further,our applicationfor the prediction of the cryptic/non-cryptic nature of unknown species is easily applicable and provides results that agree with recent discoveries from those systems. Our results demonstrate that the transition of phylogeography from a descriptive to a predictive discipline is possible and effective. © 2016 The Author(s) Published by the Royal Society. All rights reserved.
Article
Full-text available
Finite mixture models are being used increasingly to model a wide variety of random phenomena for clustering, classification and density estimation. mclust is a powerful and popular package which allows modelling of data as a Gaussian finite mixture with different covariance structures and different numbers of mixture components, for a variety of purposes of analysis. Recently, version 5 of the package has been made available on CRAN. This updated version adds new covariance structures, dimension reduction capabilities for visualisation, model selection criteria, initialisation strategies for the EM algorithm, and bootstrap-based inference, making it a full-featured R package for data analysis via finite mixture modelling.
Article
Full-text available
Aim Animals' phylogeographical patterns are frequently explained by Pleistocene glacial fluctuations and topographical environments. However, species‐specific biological traits are thought to have profound impacts on distribution patterns, particularly in aphids. We hypothesize that the phylogeographical patterns and/or population dynamics of two sympatric aphids may be different due to their different reproductive modes and feeding sites, even though they share the same hosts and environmental conditions. Location China. Methods We explored our hypothesis in Chaitophorus saliniger and Tuberolachnus salignus , two aphids that share the same host plants (genus Salix ) but differ biologically. Chaitophorus saliniger is characterized by alternating sexual and asexual reproduction and only feeds on willow leaves, whereas T. salignus has obligate asexual reproduction and feeds on trunks and branches. The genetic diversity, population structure and demographic history of the aphids were analysed based on both mitochondrial DNA (cytochrome c oxidase subunit I and cytochrome b ) and nuclear DNA (translation elongation factor 1 alpha). Ecological niche models ( ENM s) were used to explore historical changes in distribution. The chief environmental variables that discriminate the different haplogroups were identified through multivariate statistical analysis. Results There were striking differences in the phylogeographical patterns between the species. The sexual C. saliniger exhibited higher genetic diversity and population variations than the asexual T. salignus . According to genetic analyses and ENM s, both species experienced glacial contraction and post‐glacial expansion. Multivariate statistical analysis revealed that the climatic differences between the divergent haplogroups were explained by principal components mainly loaded with temperature and elevation. Main conclusions Our results suggest that species‐specific biological traits and historical climate fluctuations have both shaped the current phylogeographical patterns of both aphid species. Their distinct genetic diversity and population structures highlight the importance of intrinsic biological features in driving phylogeographical patterns.
Article
Full-text available
TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous "parameter server" designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with particularly strong support for training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model in contrast to existing systems, and demonstrate the compelling performance that TensorFlow achieves for several real-world applications.
Article
Full-text available
Detecting the targets of adaptive natural selection from whole genome sequencing data is a central problem for population genetics. However, to date most methods have shown sub-optimal performance under realistic demographic scenarios. Moreover, over the past decade there has been a renewed interest in determining the importance of selection from standing variation in adaptation of natural populations, yet very few methods for inferring this model of adaptation at the genome scale have been introduced. Here we introduce a new method, S/HIC, which uses supervised machine learning to precisely infer the location of both hard and soft selective sweeps. We show that S/HIC has unrivaled accuracy for detecting sweeps under demographic histories that are relevant to human populations, and distinguishing sweeps from linked as well as neutrally evolving regions. Moreover, we show that S/HIC is uniquely robust among its competitors to model misspecification. Thus, even if the true demographic model of a population differs catastrophically from that specified by the user, S/HIC still retains impressive discriminatory power. Finally, we apply S/HIC to the case of resequencing data from human chromosome 18 in a European population sample, and demonstrate that we can reliably recover selective sweeps that have been identified earlier using less specific and sensitive methods.
Article
Full-text available
Motivation: Approximate Bayesian computation (ABC) methods provide an elaborate approach to Bayesian inference on complex models, including model choice. Both theoretical arguments and simulation experiments indicate, however, that model posterior probabilities may be poorly evaluated by standard ABC techniques. Results: We propose a novel approach based on a machine learning tool named random forests to conduct selection among the highly complex models covered by ABC algorithms. We thus modify the way Bayesian model selection is both understood and operated, in that we rephrase the inferential goal as a classification problem, first predicting the model that best fits the data with random forests and postponing the approximation of the posterior probability of the predicted MAP for a second stage also relying on random forests. Compared with earlier implementations of ABC model choice, the ABC random forest approach offers several potential improvements: (i) it often has a larger discriminative power among the competing models, (ii) it is more robust against the number and choice of statistics summarizing the data, (iii) the computing effort is drastically reduced (with a gain in computation efficiency of at least fifty), and (iv) it includes an approximation of the posterior probability of the selected model. The call to random forests will undoubtedly extend the range of size of datasets and complexity of models that ABC can handle. We illustrate the power of this novel methodology by analyzing controlled experiments as well as genuine population genetics datasets. Availability: The proposed methodologies are implemented in the R package abcrf available on the CRAN.
Article
Full-text available
Background The human gastrointestinal tract harbors a diverse microbial community, in which metabolic phenotypes play important roles for the human host. Recent developments in meta-omics attempt to unravel metabolic roles of microbes by linking genotypic and phenotypic characteristics. This connection, however, still remains poorly understood with respect to its evolutionary and ecological context. Results We generated automatically refined draft genome-scale metabolic models of 301 representative intestinal microbes in silico. We applied a combination of unsupervised machine-learning and systems biology techniques to study individual and global differences in genomic content and inferred metabolic capabilities. Based on the global metabolic differences, we found that energy metabolism and membrane synthesis play important roles in delineating different taxonomic groups. Furthermore, we found an exponential relationship between phylogeny and the reaction composition, meaning that closely related microbes of the same genus can exhibit pronounced differences with respect to their metabolic capabilities while at the family level only marginal metabolic differences can be observed. This finding was further substantiated by the metabolic divergence within different genera. In particular, we could distinguish three sub-type clusters based on membrane and energy metabolism within the Lactobacilli as well as two clusters within the Bifidobacteria and Bacteroides. Conclusions We demonstrate that phenotypic differentiation within closely related species could be explained by their metabolic repertoire rather than their phylogenetic relationships. These results have important implications in our understanding of the ecological and evolutionary complexity of the human gastrointestinal microbiome. Electronic supplementary material The online version of this article (doi:10.1186/s40168-015-0121-6) contains supplementary material, which is available to authorized users.
Article
Full-text available
Theory and empirical evidence clearly indicate that phylogenies (trees) of different genes (loci) should not display precisely matched topologies. The main reason for such phylogenetic incongruence is reticulated evolutionary history of most species due to meiotic sexual recombination in eukaryotes, orhorizontal transfers of genetic materials in prokaryotes. Nevertheless, most genes should display topologically related phylogenies, and should group into one or more (for genetic hybrids) clusters in the "tree space." In this paper we propose to apply the normalized-cut (Ncut) clustering algorithm to the set of gene trees with the geodesic distance between trees over the Billera-Holmes-Vogtmann (BHV) tree space. We first show by simulated data sets that the Ncut algorithm accurately clusters the set of gene trees given a species tree under the coalescent process, and show that the Ncut algorithm works better on the gene trees reconstructed via the neighbor-joining method than these reconstructed via the maximum likelihood estimator under the evolutionary models. Moreover, we apply the methods to a genome-wide data set (1290 genes encoding 690,838 amino acid residues) on coelacanths, lungfishes, and tetrapods. The result suggests that there are two clusters in the data set. Finally we reconstruct the consensus trees from these two clusters; the consensus tree constructed from one cluster has the tree topology that coelacanths are most closely related to the tetrapods, and the consensus tree from the other includes an irresolvable trichotomy over the coelacanth, lungfish, and tetrapod lineages, suggesting divergence within a very short time interval.
Article
Full-text available
Microhexura montivaga is a miniature tarantula-like spider endemic to the highest peaks of the southern Appalachian mountains, and is known only from six allopatric, highly disjunct montane populations. Because of severe declines in Spruce-fir forest in the late 20(th) century, M. montivaga was formally listed as a US Federally Endangered species in 1995. Using DNA sequence data from one mitochondrial and seven nuclear genes, patterns of multigenic genetic divergence were assessed both within and among six montane populations. Independent mitochondrial and nuclear discovery analyses reveal obvious genetic fragmentation both within and among montane populations, with five to seven primary genetic lineages recovered. Multispecies coalescent validation analyses (guide-tree and unguided Bayesian Phylogenetics and Phylogeography (BPP), Bayes factor delimitation (BFD)) using nuclear-only data congruently recover six or seven distinct lineages; BFD analyses using combined nuclear plus mitochondrial data favor seven or eight lineages. In stark contrast to this clear genetic fragmentation, a survey of secondary sexual features for available males indicates morphological conservatism across montane populations. While it is certainly possible that morphologically cryptic speciation has occurred in this taxon, this system may alternatively represent a case where extreme population genetic structuring (but not speciation) leads to an over-splitting of lineage diversity by multispecies coalescent methods. Our results have clear conservation implications for this Federally endangered taxon, and illustrate a methodological issue expected to become more common as genomic-scale datasets are gathered for taxa found in naturally fragmented habitats. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
Article
Growing evidence for lineage diversification that occurs without strong ecological divergence (i.e., nonadaptive radiation) challenges assumptions about the buildup and maintenance of species in evolutionary radiations, particularly when ecologically similar and thus potentially competing species co-occur. Understanding nonadaptive radiations involves identifying conditions conducive to both the nonecological generation of species and the maintenance of co-occurring ecologically similar species. To borrow MacArthur's [1] (Challenging Biological Problems 1972;253–259) form of inquiry, the ecology of nonadaptive radiations can be understood as follows: for species of type A, in environments of type B, nonadaptive radiations may emerge. We review purported cases of nonadaptive radiation and suggest properties of organisms, resources, and landscapes that might be conducive to their origin and maintenance. These properties include poor dispersal ability and the ephemerality and patchiness of resources.
Article
Biodiversity monitoring is the standard for environmental impact assessment of anthropogenic activities. Several recent studies showed that high‐throughput amplicon sequencing of environmental DNA (eDNA metabarcoding) could overcome many limitations of the traditional morphotaxonomy‐based bioassessment. Recently, we demonstrated that supervised machine learning (SML) can be used to predict accurate biotic indices values from eDNA metabarcoding data, regardless of the taxonomic affiliation of the sequences. However, it is unknown to which extent the accuracy of such models depends on taxonomic resolution of molecular markers or how SML compares with metabarcoding approaches targeting well‐established bioindicator species. In this study, we address these issues by training predictive models upon five different ribosomal bacterial and eukaryotic markers and measuring their performance to assess the environmental impact of marine aquaculture on independent datasets. Our results show that all tested markers are yielding accurate predictive models, and that they all outperform the assessment relying solely on taxonomically assigned sequences. Remarkably, we did not find any significant difference in the performance of the models built using universal eukaryotic or prokaryotic markers. Using any molecular marker with a taxonomic range broad enough to comprise different potential bioindicator taxa, SML approach can overcome the limits of taxonomy‐based eDNA bioassessment. This article is protected by copyright. All rights reserved.
Preprint
Historically, investigations into the processes driving speciation have largely been isolated from systematic investigations into species limits. Recent advances in sequencing technology have led to a rapid increase in the availability of genomic data, and this, in turn, has led to the introduction of many novel methods for species delimitation. However, these methods have been limited to divergence-only scenarios and have not attempted to evaluate complex modes of speciation, such as those that include gene flow during early stages of divergence (sympatric speciation) or population size changes (founder effect speciation). To address this shortcoming, we introduce delimitR, an approach that enables biologists to infer species boundaries and evaluate the demographic processes that may have led to speciation. delimitR uses the binned multidimensional Site Frequency Spectrum and a machine-learning algorithm (Random Forests) to compare speciation models. We use simulations to evaluate the accuracy of delimitR. When comparing models that include lineage divergence and gene flow for three populations, error rates are near zero with recent divergence times (<100,000 generations) and a modest number of Single Nucleotide Polymorphisms (SNPs; 1,500). When applied to a more complex model set (including divergence, gene flow, and population size changes), error rates are moderate (~0.15 with 10,000 SNPs), and misclassifications are generally between highly similar models. We also evaluate the utility of delimitR using three previously published datasets and find results that corroborate previous findings. Our analyses indicate that delimitR can serve as an important conceptual bridge uniting various investigations into the process of speciation.
Article
Species are considered to be the basic unit of ecological and evolutionary studies. Since multi‐locus genomic data are increasingly available there has been considerable interests in the use of DNA sequence data to delimit species. In this paper, we show that machine learning can be used for species delimitation. Our method treats the species delimitation problem as a classification problem for identifying the category of a new observation on the basis of training data. Extensive simulation is first conducted over a broad range of evolutionary parameters for training purposes.Each pair of known populations are combined to form training samples with a label of “same species” or “different species”. We use Support Vector Machine (SVM) to train a classi_er using a set of summary statistics computed from training samples as features. The trained classifier can classify a test sample to two outcomes: “same species” or “different species”. Given multi‐locus genomic data of multiple related organisms or populations, our method (called CLADES) performs species delimitation by first classifying pairs of populations. CLADES then delimits species by maximizing the likelihood of species assignment for multiple populations. CLADES is evaluated through extensive simulation and also tested on real genetic data. We show that CLADES is both accurate and effcient for species delimitation when compared with existing methods. CLADES can be useful especially when existing methods have difficulty in delimitation, e.g. with short species divergence time and gene flow. This article is protected by copyright. All rights reserved.
Article
Capturing conserved genomic elements to shed light on deep evolutionary history is becoming the new gold standard for phylogenomic research. Ultraconserved elements are shared among distantly related organisms, allowing the capture of unpreceded amounts of genomic data of non‐model taxa. An underappreciated consequence of hybrid enrichment methods is the potential of introducing undetected DNA sequences from organisms outside the lineage of interest, facilitated through the high degree of conservation of the target regions. In this in silico study, we quantify ultraconserved loci using a data set of 400 published genomes. We utilized six newly designed UCE bait sets, tailored to various arthropod groups, and screened for shared conserved elements in all 242 currently published arthropod genomes. Additionally, we included a diverse set of other potential contaminating organisms, such as various species of fungi and bacteria. Our results show that specific UCE bait sets can capture genomic elements from vastly divergent lineages, including human DNA . Nonetheless, our in silico modeling demonstrates that sufficiently strict bioinformatic processing parameters effectively filter out unintentionally targeted DNA from taxa other than the focus group. Lastly, we characterize all the 100 most widely shared UCE loci as highly conserved exonic regions. We give practical recommendations to address contamination in data sets generated through targeted‐enrichment.
Article
Harvestmen penises are intromittent structures surprisingly complex and diverse, making it difficult to imagine that the sole function of the penis is sperm transfer. Knowledge of penis morphology in harvestmen is largely derived from taxonomic works. Therefore, we have a considerable amount of information regarding genital structures, but relatively little is known about their movements and functions. Yet, in the case female genitalia, there is an even deeper gap on the understanding of the morphology and its functional features. Female genitalia are commonly neglected due to the assumption that they are not a valuable source of information for taxonomy. In this scenario of fragmented morphological knowledge, complete and accurate descriptions of male and female genitalia are highly desirable, and are essential to understanding the functional morphology of copulatory organs in Phalangida. Our study improves upon the original genitalia descriptions (ovipositor and penis) for two Chilean Triaenonychidae species (Triaenonychoides cekalovici and Triaenonychoides breviops) and provides corrections to several original mistakes and misinterpretation. In addition, we report here that T. cekalovici was described based on the morphology of an everted penis, illustrate and compare penises in the resting and everted state, and describe for the first time the mechanical eversion of the penis' glans. Through mechanical system the muscular penis type can execute a remarkably complicated movement resulting in eversion of the glans. The contraction of the muscle connected to the rounded basis of the ventral plate (via the tendon) is responsible for the ventral scrolling of the ventral plate and this in turn triggers the eversion of the capsula interna. We predict that this elongation is necessary during genital coupling, for positioning the stylus closer to the seminal receptacles area. Depositing the sperm closer to the seminal receptacles can be a decisive advantage, considering that in harvestmen the spermatozoa are immobile. We also suggest that this process of glans' muscular mechanical eversion could be widespread among Triaenonychidae and that the knowledge of functional morphology is relevant for taxonomic purposes, as this is probably not the only case of taxonomic description based on an everted genitalia.
Article
The relative roles of ecological niche conservatism versus niche divergence in promoting montane speciation remains an important topic in biogeography. Here, our aim was to test whether lineage diversification in a species complex of trapdoor spiders corresponds with riverine barriers or with an ecological gradient associated with elevational tiering. Aliatypus janus was sampled from throughout its range, with emphasis on populations in the southern Sierra Nevada Mountains of California. We collected multi-locus genetic data to generate a species tree for A. janus and its close relatives. Coalescent based hypothesis tests were conducted to determine if genetic breaks within A. janus conform to riverine barriers. Ecological niche models (ENM) under current and Last Glacial Maximum (LGM) conditions were generated and hypothesis tests of niche conservatism and divergence were performed. Coalescent analyses reveal deeply divergent genetic lineages within A. janus, likely corresponding to cryptic species. Two primary lineages meet along an elevational gradient on the western slopes of the southern Sierra Nevada Mountains. ENMs under both current and LGM conditions indicate that these groups occupy largely non-overlapping niches. ENM hypothesis testing rejected niche identity between the two groups, and supported a sharp ecological gradient occurring where the groups meet. However, the niche similarity test indicated that the two groups may not inhabit different niches from their backgrounds. The Sierra Nevada Mountains provide a natural laboratory for simultaneously testing ecological niche divergence and conservatism and their role in speciation across a diverse range of taxa. Aliatypus janus represents a species complex with cryptic lineages that may have diverged due to parapatric speciation along an ecological gradient, or been maintained by the evolution of ecological niche differences following allopatric speciation.
Conference Paper
How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions is two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.
Article
Targeted enrichment of conserved genomic regions is a popular method for collecting large amounts of sequence data from non‐model taxa for phylogenetic, phylogeographic and population genetic studies. For example, two available bait sets each allow enrichment of thousands of orthologous loci from >20 000 species (Faircloth et al . Systematic Biology, 61, 717–726, 2012; Molecular Ecology Resources, 15, 489–501, 2015). Unfortunately, few open‐source workflows are available to identify conserved genomic elements shared among divergent taxa and to design enrichment baits targeting these regions. Those that do exist require extensive bioinformatics expertise and significant amounts of time to use. These shortcomings limit the application of targeted enrichment methods to additional organismal groups. Here, I describe a universal workflow for identifying conserved genomic regions in available genomic data and for designing targeted enrichment baits to collect data from these conserved regions. These methods require less expertise, less time and better use commonly available information to identify conserved loci and design baits to capture them. I apply this computational approach to the understudied arthropod groups Arachnida, Coleoptera, Diptera, Hemiptera or Lepidoptera to identify thousands of conserved loci in each group and design target enrichment baits to capture these loci. I then use in silico analyses to demonstrate that targeted enrichment of the conserved loci can be used to reconstruct the accepted relationships among genome sequences from the focal arthropod orders. The software workflow I created allowed me to identify thousands of conserved loci in five diverse arthropod groups and design sequence capture baits to target them. This suite of capture bait designs should enable collection of phylogenomic data from >900 000 arthropod species. Although the examples in this manuscript focus on understudied arthropod groups, the approach I describe is applicable to all organismal groups having some form of pre‐existing genomic information (e.g. other invertebrates, plants, fungi and microbes). Finally, the documentation, design steps, software code and bait sets developed here are available under an open‐source license for restriction‐free testing, use, and additional modification by any research group.
Article
We describe a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations. We assume a model in which there are K populations (where K may be unknown), each of which is characterized by a set of allele frequencies at each locus. Individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more populations if their genotypes indicate that they are admixed. Our model does not assume a particular mutation process, and it can be applied to most of the commonly used genetic markers, provided that they are not closely linked. Applications of our method include demonstrating the presence of population structure, assigning individuals to populations, studying hybrid zones, and identifying migrants and admixed individuals. We show that the method can produce highly accurate assignments using modest numbers of loci—e.g., seven microsatellite loci in an example using genotype data from an endangered bird species. The software used for this article is available from http://www.stats.ox.ac.uk/~pritch/home.html.
Article
Significance Despite its widespread application to the species delimitation problem, our study demonstrates that what the multispecies coalescent actually delimits is structure. The current implementations of species delimitation under the multispecies coalescent do not provide any way for distinguishing between structure due to population-level processes and that due to species boundaries. The overinflation of species due to the misidentification of general genetic structure for species boundaries has profound implications for our understanding of the generation and dynamics of biodiversity, because any ecological or evolutionary studies that rely on species as their fundamental units will be impacted, as well as the very existence of this biodiversity, because conservation planning is undermined due to isolated populations incorrectly being treated as distinct species.
Article
Arachnida is an ancient, diverse, and ecologically important animal group that contains a number of species of interest for medical, agricultural, and engineering applications. Despite their importance, many aspects of the arachnid tree of life remain unresolved, hindering comparative approaches to arachnid biology. Biologists have made considerable efforts to resolve the arachnid phylogeny; yet, limited and challenging morphological characters, as well as a dearth of genetic resources, have hindered progress. Here, we present a genomic toolkit for arachnids featuring hundreds of conserved DNA regions (ultraconserved elements or UCEs) that allow targeted sequencing of any species in the arachnid tree of life. We used recently developed capture probes designed from conserved regions of available arachnid genomes to enrich a sample of loci from 32 diverse arachnids. Sequence capture returned an average of 487 UCE loci for all species, with a range from 170 to 722. Phylogenetic analysis of these UCEs produced a highly resolved arachnid tree with relationships largely consistent with recent transcriptome-based phylogenies. We also tested the phylogenetic informativeness of UCE probes within the spider, scorpion, and harvestman orders, demonstrating the utility of these markers at shallower taxonomic scales, and suggesting that these loci will be useful for species-level differences. This probe set will open the door to phylogenomic and population genomic studies across the arachnid tree of life, enabling systematics, species delimitation, species discovery, and conservation of these diverse arthropods. This article is protected by copyright. All rights reserved.
Article
Deterministic processes may uniquely affect co-distributed species' phylogeographic patterns such that discordant genetic variation among taxa is predicted. Yet, explicitly testing expectations of genomic discordance in a statistical framework remains challenging. Here, we construct spatially and temporally dynamic models to investigate the hypothesized effect of microhabitat preferences on the permeability of glaciated regions to gene flow in two closely related montane species. Utilizing environmental niche models from the Last Glacial Maximum and the present to inform demographic models of changes in habitat suitability over time, we evaluate the relative probabilities of two alternative models using approximate Bayesian computation (ABC) in which glaciated regions are either (i) permeable or (ii) a barrier to gene flow. Results based on the fit of the empirical data to datasets simulated using a spatially explicit coalescent under alternative models indicate that genomic data are consistent with predictions about the hypothesized role of microhabitat in generating discordant patterns of genetic variation among the taxa. Specifically, a model in which glaciated areas acted as a barrier was much more probable based on patterns of genomic variation in Carex nova, a wet-adapted species. However, in the dry-adapted C. chalciolepis, the permeable model was more probable, although the difference in the support of the models was small. This work highlights how statistical inferences can be used to distinguish deterministic processes that are expected to result in discordant genomic patterns among species, including species-specific responses to climate change. This article is protected by copyright. All rights reserved.
Article
Comparative phylogeographic investigations have identified congruent phylogeographic breaks in co-distributed species in nearly every region of the world. The qualitative assessments of phylogeographic patterns traditionally used to identify such breaks, however, are limited because they rely on identifying monophyletic groups across species and do not account for coalescent stochasticity. Only long-standing phylogeographic breaks are likely to be obvious; many species could have had a concerted response to more recent landscape events, yet possess subtle signs of phylogeographic congruence because ancestral polymorphism has not completely sorted. Here we introduce Phylogeographic Concordance Factors (PCFs), a novel method for quantifying phylogeographic congruence across species. We apply this method to the Sarracenia alata pitcher plant system, a carnivorous plant with a diverse array of commensal organisms. We explore whether a group of ecologically associated arthropods have co-diversified with the host pitcher plant, and identify if there is a positive correlation between ecological interaction and PCFs. Results demonstrate that multiple arthropods share congruent phylogeographic breaks with S. alata, and provide evidence that the level of ecological association can be used to predict the degree of similarity in the phylogeographic pattern. This study outlines an approach for quantifying phylogeographic congruence, a central concept in biogeographic research. This article is protected by copyright. All rights reserved.
Article
Current statistical biogeographical analysis methods are limited in the ways ecology can be related to the processes of diversification and geographical range evolution, requiring conflation of geography and ecology, and/or assuming ecologies that are uniform across all lineages and invariant in time. This precludes the possibility of studying a broad class of macroevolutionary biogeographical theories that relate geographical and species histories through lineage-specific ecological and evolutionary dynamics, such as taxon cycle theory. Here we present a new model that generates phylogenies under a complex of superpositioned geographical range evolution, trait evolution, and diversification processes that can communicate with each other. We present a likelihood-free method of inference under our model using discriminant analysis of principal components of summary statistics calculated on phylogenies, with the discriminant functions trained on data generated by simulations under our model. This approach of model selection by classification of empirical data with respect to data generated under training models is shown to be efficient, robust, and performs well over a broad range of parameter space defined by the relative rates of dispersal, trait evolution, and diversification processes. We apply our method to a case study of the taxon cycle, i.e. testing for habitat and trophic level constraints in the dispersal regimes of the Wallacean avifaunal radiation.
Article
Availability and implementation: PHYLUCE is written for Python 2.7. PHYLUCE is supported on OSX and Linux (RedHat/CentOS) operating systems. PHYLUCE source code is distributed under a BSD-style license from https://www.github.com/faircloth-lab/phyluce/. PHYLUCE is also available as a package (https://binstar.org/faircloth-lab/phyluce) for the Anaconda Python distribution that installs all dependencies, and users can request a PHYLUCE instance on iPlant Atmosphere (tag: phyluce). The software manual and a tutorial are available from http://phyluce.readthedocs.org/en/latest/ and test data are available from doi: 10.6084/m9.figshare.1284521. Contact: brant@faircloth-lab.org SUPPLEMENTARY INFORMATION: Supplementary Figure 1.
Article
We use mitochondrial and multi-locus nuclear DNA sequence data to infer both species boundaries and species relationships within California nemesiid spiders. Higher-level phylogenetic data show that the California radiation is monophyletic, and distantly related to European members of the genus Brachythele. As such, we consider all California nemesiid taxa to belong to the genus Calisoga Chamberlin, 1937. Rather than find support for one or two taxa as previously hypothesized, genetic data reveal Calisoga to be a species-rich radiation of spiders, including perhaps dozens of species. This conclusion is supported by multiple mitochondrial barcoding analyses, and also independent analyses of nuclear data that reveal general genealogical congruence. We discovered three instances of sympatry, and genetic data indicate reproductive isolation when in sympatry. An examination of female reproductive morphology does not reveal species-specific characters, and observed male morphological differences for a subset of putative species are subtle. Our coalescent species tree analysis of putative species lays the groundwork for future research on the taxonomy and biogeographic history of this remarkable endemic radiation. Copyright © 2015 Elsevier Inc. All rights reserved.