Alexandros Stamatakis

Alexandros Stamatakis
Heidelberger Institut für Theoretische Studien | HITS · Scientific Computing Group

Prof. Dr.

About

373
Publications
134,981
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
85,015
Citations
Citations since 2017
118 Research Items
52163 Citations
201720182019202020212022202302,0004,0006,0008,000
201720182019202020212022202302,0004,0006,0008,000
201720182019202020212022202302,0004,0006,0008,000
201720182019202020212022202302,0004,0006,0008,000
Additional affiliations
October 2001 - December 2004
Technische Universität München
Position
  • PhD Student

Publications

Publications (373)
Article
Full-text available
Computing ancestral ranges via the Dispersion Extinction and Cladogensis (DEC) model of biogeography is characterized by an exponential number of states relative to the number of regions considered. This is because the DEC model requires computing a large matrix exponential, which typically accounts for up to 80% of overall runtime. Therefore, the...
Article
Full-text available
Motivation: Missing data and incomplete lineage sorting are two major obstacles to accurate species tree inference. Gene tree summary methods such as ASTRAL and ASTRID have been developed to account for incomplete lineage sorting. However, they can be severely affected by high levels of missing data. Results: We present Asteroid, a novel algorit...
Article
Full-text available
We present a spatiotemporal picture of human genetic diversity in Anatolia, Iran, Levant, South Caucasus, and the Aegean, a broad region that experienced the earliest Neolithic transition and the emergence of complex hierarchical societies. Combining 35 new ancient shotgun genomes with 382 ancient and 23 present-day published genomes, we found that...
Article
Full-text available
Phylogenetic analyses under the Maximum Likelihood model are time and resource intensive. To adequately capture the vastness of tree space, one needs to infer multiple independent trees. On some datasets, multiple tree inferences converge to similar tree topologies, on others to multiple, topologically highly distinct yet statistically indistinguis...
Preprint
Full-text available
Taxonomic assignment of OTUs is an important bioinformatics step in analyzing environmental sequencing data. Pairwise-alignment and phylogenetic-placement methods represent two alternative approaches to taxonomic assignments, but their results can differ. Here we used available colpodean ciliate OTUs from forest soils to compare the taxonomic assig...
Preprint
Full-text available
The evaluation of phylogenetic inference tools is commonly conducted on simulated and empirical sequence data alignments. An open question is how representative these alignments are with respect to those, commonly analyzed by users. Based upon the RAxMLGrove database, it is now possible to simulate DNA sequences based on more than 70, 000 represent...
Preprint
Full-text available
The prediction of knockout tournaments represents an area of large public interest and active academic as well as industrial research. Here, we show how one can leverage the computational analogies between calculating the phylogenetic likelihood score used in the area of molecular evolution to efficiently calculate , instead of approximate via simu...
Preprint
Full-text available
Motivation Missing data and incomplete lineage sorting are two major obstacles to accurate species tree inference. Gene tree methods such as ASTRAL and ASTRID have been developed to account for incomplete lineage sorting. However, they can be severely affected by high levels of missing data. Results We present Asteroid, a novel supertree method th...
Preprint
Motivation Maximum Likelihood (ML) is a widely used model for inferring phylogenies. The respective ML implementations heavily rely on numerical optimization routines that use internal numerical thresholds to determine convergence. We systematically analyze the impact of these threshold settings on the log-likelihood (LnL scores) and runtimes for M...
Article
Full-text available
Motivation: In recent years, full-genome sequences have become increasingly available and as a result many modern phylogenetic analyses are based on very long sequences, often with over 100 000 sites. Phylogenetic reconstructions of large-scale alignments are challenging for likelihood-based phylogenetic inference programs and usually require usin...
Preprint
Full-text available
Phylogenetic analyses under the Maximum Likelihood model are time and resource intensive. To adequately capture the vastness of tree space, one needs to infer multiple independent trees. On some datasets, multiple tree inferences converge to similar tree topologies, on others to multiple, topologically highly distinct yet statistically indistinguis...
Article
Full-text available
Motivation Phylogenetic networks can represent non-treelike evolutionary scenarios. Current, actively developed approaches for phylogenetic network inference jointly account for non-treelike evolution and incomplete lineage sorting. Unfortunately, this induces a very high computational complexity and current tools can only analyze small data sets....
Article
Full-text available
Phylogenetic placement refers to a family of tools and methods to analyze, visualize, and interpret the tsunami of metagenomic sequencing data generated by high-throughput sequencing. Compared to alternative (e. g., similarity-based) methods, it puts metabarcoding sequences into a phylogenetic context using a set of known reference sequences and ta...
Preprint
Full-text available
A bstract Computing ancestral ranges via the Dispersion Extinction and Cladogensis (DEC) model of biogeography is characterized by an exponential number of states relative to the number of regions considered. This is because the DEC model requires computing a large matrix exponential, which typically accounts for up to 80% of overall runtime. There...
Preprint
Fault-tolerant distributed applications require mechanisms to recover data lost via a process failure. On modern cluster systems it is typically impractical to request replacement resources after such a failure. Therefore, applications have to continue working with the remaining resources. This requires redistributing the workload and that the non-...
Preprint
Full-text available
Phylogenetic placement refers to a family of tools and methods to analyze, visualize, and interpret the tsunami of metagenomic sequencing data generated by high-throughput sequencing. Compared to alternative (e. g., similarity-based) methods, it puts the metagenomic sequences into a phylogenetic context using a set of known reference sequences and...
Article
Full-text available
We introduce CellPhy, a maximum likelihood framework for inferring phylogenetic trees from somatic single-cell single-nucleotide variants. CellPhy leverages a finite-site Markov genotype model with 16 diploid states and considers amplification error and allelic dropout. We implement CellPhy into RAxML-NG, a widely used phylogenetic inference packag...
Article
Full-text available
Species tree inference from gene family trees is becoming increasingly popular because it can account for discordance between the species tree and the corresponding gene family trees. In particular, methods that can account for multiple-copy gene families exhibit potential to leverage paralogy as informative signal. At present, there does not exist...
Article
Full-text available
The assessment of novel phylogenetic models and inference methods is routinely being conducted via experiments on simulated as well as empirical data. When generating synthetic data it is often unclear how to set simulation parameters for the models and generate trees that appropriately reflect empirical model parameter distributions and tree shape...
Preprint
Full-text available
The assessment of novel phylogenetic models and inference methods is routinely being conducted via experiments on simulated as well as empirical data. When generating synthetic data it is often unclear how to set simulation parameters for the models and generate trees that appropriately reflect empirical model parameter distributions and tree shape...
Preprint
Full-text available
Phylogenetic networks are used to represent non-treelike evolutionary scenarios. Current, actively developed approaches for phylogenetic network inference jointly account for non-treelike evolution and incomplete lineage sorting (ILS). Unfortunately, this induces a very high computational complexity. Hence, current tools can only analyze small data...
Article
Full-text available
Motivation: Previously we presented swarm, an open-source amplicon clustering program that produces fine-scale molecular operational taxonomic units (OTUs) that are free of arbitrary global clustering thresholds. Here we present swarm v3 to address issues of contemporary datasets that are growing towards tera-byte sizes. Results: When compared t...
Article
A wide range of data types can be used to delimit species and various computer‐based tools dedicated to this task are now available. Although these formalized approaches have significantly contributed to increase the objectivity of species delimitation (SD) under different assumptions, they are not routinely used by alpha‐taxonomists. One obvious s...
Preprint
Full-text available
The prediction of knock-out tournaments represents an area of large public interest and active academic as well as industrial research. Here, we leverage the computational analogies between calculating the so-called phylogenetic likelihood score used in the area of molecular evolution and efficiently calculating, instead of approximating via simula...
Article
Full-text available
: Phylogenetic trees are now routinely inferred on large scale HPC systems with thousands of cores as the parallel scalability of phylogenetic inference tools has improved over the past years to cope with the molecular data avalanche. Thus, the parallel fault tolerance of phylogenetic inference tools has become a relevant challenge. To this end, w...
Article
Full-text available
Scientific software from all areas of scientific research is pivotal to obtaining novel insights. Yet the coding standards adherence of scientific software is rarely assessed, even though it might lead to incorrect scientific results in the worst case. Therefore, we have developed an open source tool and benchmark called SoftWipe, that provides a r...
Article
Full-text available
Background In phylogenetic analysis, it is common to infer unrooted trees. However, knowing the root location is desirable for downstream analyses and interpretation. There exist several methods to recover a root, such as molecular clock analysis (including midpoint rooting) or rooting the tree using an outgroup. Non-reversible Markov models can al...
Preprint
Full-text available
Species tree inference from gene family trees is becoming increasingly popular because it can account for discordance between the species tree and the corresponding gene family trees. In particular, methods that can account for multiple-copy gene families exhibit potential to leverage paralogy as informative signal. At present, there does not exist...
Preprint
Full-text available
A wide range of data types can be used to delimit species and various computer-based tools dedicated to this task are now available. Although these formalized approaches have significantly contributed to increase the objectivity of SD under different assumptions, they are not routinely used by alpha-taxonomists. One obvious shortcoming is the lack...
Article
Tools, rules and incentives are buckling under the flood of coronavirus genome sequences — to help control the pandemic, researchers need new approaches. Tools, rules and incentives are buckling under the flood of coronavirus genome sequences — to help control the pandemic, researchers need new approaches.
Article
Wall lizards of the genus Podarcis (Sauria, Lacertidae) are the predominant reptile group in southern Europe, including 24 recognized species. Mitochondrial DNA data have shown that, with the exception of P. muralis, the Podarcis species distributed in the Balkan peninsula form a species group that is further sub-divided into two subgroups: the one...
Preprint
Full-text available
Phylogenetic trees are now routinely inferred on large scale HPC systems with thousands of cores as the parallel scalability of phylogenetic inference tools has improved over the past years to cope with the molecular data avalanche. Thus, the parallel fault tolerance of phylogenetic inference tools has become a relevant challenge. To this end, we e...
Article
The prospect of reduced vaccine potency from fast-spreading SARS-CoV-2 variants has spurred a global rush to increase genomic surveillance for the coronavirus. This is crucial for quickly identifying and tracking emergent strains. It can also pin down how transmission occurs between individuals more definitively than typical contact tracing can. As...
Article
Full-text available
Numerous studies covering some aspects of SARS-CoV-2 data analyses are being published on a daily basis, including a regularly updated phylogeny on nextstrain.org. Here, we review the difficulties of inferring reliable phylogenies by example of a data snapshot comprising a quality-filtered subset of 8, 736 out of all 16, 453 virus sequences availab...
Article
Full-text available
Numerous studies covering some aspects of SARS-CoV-2 data analyses are being published on a daily basis, including a regularly updated phylogeny on nextstrain.org. Here, we review the difficulties of inferring reliable phylogenies by example of a data snapshot comprising a quality-filtered subset of 8, 736 out of all 16, 453 virus sequences availab...
Article
Full-text available
Anthozoan corals are an ecologically important group of cnidarians, which power the productivity of reef ecosystems. They are sessile, inhabit shallow, tropical oceans and are highly dependent on sun- and moonlight to regulate sexual reproduction, phototaxis and photosymbiosis. However, their exposure to high levels of sunlight also imposes an incr...
Preprint
Full-text available
Scientific software from all areas of scientific research is pivotal to obtaining novel insights. Yet the quality of scientific software is rarely assessed, even though it might lead to incorrect scientific results in the worst case. Therefore, we have developed an open source tool and benchmark called SoftWipe , that provides a relative software q...
Preprint
Full-text available
Numerous studies covering some aspects of SARS-CoV-2 data analyses are being published on a daily basis, including a regularly updated phylogeny on nextstrain.org. Here, we review the difficulties of inferring reliable phylogenies by example of a data snapshot comprising all virus sequences available on May 5, 2020 from gisaid.org. We find that it...
Article
Full-text available
Microbial ecology research is currently driven by the continuously decreasing cost of DNA sequencing and the improving accuracy of data analysis methods. One such analysis method is phylogenetic placement, which establishes the phylogenetic identity of the anonymous environmental sequences in a sample by means of a given phylogenetic reference tree...
Article
Motivation Gene and species tree reconciliation methods are used to interpret gene trees, root them and correct uncertainties that are due to scarcity of signal in multiple sequence alignments. So far, reconciliation tools have not been integrated in standard phylogenetic software and they either lack performance on certain functions, or usability...
Preprint
Full-text available
We have developed a maximum likelihood framework called CellPhy for inferring phylogenetic trees from single-cell DNA sequencing (scDNA-seq) data, that can be directly applied to somatic cells and clones. CellPhy is based on a finite-site Markov nucleotide substitution model with 10 diploid states, akin to those typically used in statistical phylog...
Article
Full-text available
Inferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges species tree-aware methods also leverage information from a pu...
Preprint
Full-text available
Microbial ecology research is currently driven by the continuously decreasing cost of DNA sequencing and the improving accuracy of data analysis methods. One such analysis method is phylogenetic placement, which establishes the phylogenetic identity of the anonymous environmental sequences in a sample by means of a given phylogenetic reference tree...
Preprint
Full-text available
In phylogenetic analysis, it is common to infer unrooted trees. Thus, it is unknown which node is the most recent common ancestor of all the taxa in the phylogeny. However, knowing the root location is desirable for downstream analyses and interpretation. There exist several methods to recover a root, such as midpoint rooting or rooting the tree at...
Article
Full-text available
The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands-or even millions-of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges...
Article
Full-text available
We present genesis, a library for working with phylogenetic data, and gappa, an accompanying command line tool for conducting typical analyses on such data. The tools target phylogenetic trees and phylogenetic placements, sequences, taxonomies, and other relevant data types, offer high-level simplicity as well as low-level customizability, and are...
Article
Full-text available
Motivation: Recently, Lemoine et al. suggested the Transfer Bootstrap Expectation (TBE) branch support metric as an alternative to classical phylogenetic bootstrap support for taxon-rich datasets. However, the original TBE implementation in the booster tool is compute- and memory-intensive. Results: We developed a fast and memory-efficient TBE i...
Article
High‐throughput DNA metabarcoding of amplicon sizes below 500 bp has revolutionized the analysis of environmental microbial diversity. However, these short regions contain limited phylogenetic signal, which makes it impractical to use environmental DNA in full phylogenetic inferences. This lesser phylogenetic resolution of short amplicons may be ov...
Article
Full-text available
Background The classification of hepatitis viruses still predominantly relies on ad hoc criteria, i.e., phenotypic traits and arbitrary genetic distance thresholds. Given the subjectivity of such practices coupled with the constant sequencing of samples and discovery of new strains, this manual approach to virus classification becomes cumbersome an...
Preprint
Full-text available
A bstract Terraces in phylogenetic tree space are, among other things, important for the design of tree space search strategies. While the phenomenon of phylogenetic terraces is already known for unlinked partition models on partitioned phylogenomic data sets, it has not yet been studied if an analogous structure is present under linked and scaled...
Preprint
Full-text available
Motivation Gene and species tree reconciliation methods are used to interpret gene trees, root them and correct uncertainties that are due to scarcity of signal in multiple sequence alignments. So far, reconciliation tools have not been integrated in standard phylogenetic software and they either lack performance on certain functions, or usability...
Preprint
Full-text available
Inferring gene trees is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges species tree-aware methods seek to use information from a putative species tree. However, there are few method...
Article
Incongruence, or topological conflict, is prevalent in genome-scale data sets. Internode Certainty (IC) and related measures were recently introduced to explicitly quantify the level of incongruence of a given internal branch among a set of phylogenetic trees and complement regular branch support measures (e.g., bootstrap, posterior probability) th...
Preprint
The recent upswing of microfluidics and combinatorial indexing strategies, further enhanced by very low sequencing costs, have turned single cell sequencing into an empowering technology; analyzing thousands—or even millions—of cells per experimental run is becoming a routine assignment in laboratories worldwide. As a consequence, we are witnessing...
Preprint
Full-text available
Recently, Lemoine et al. suggested the Transfer Bootstrap Expectation (TBE) branch support metric as an alternative to classical phylogenetic bootstrap support metric on taxon-rich datasets. However, the original TBE implementation in the booster tool is compute- and memory-intensive. Therefore, we developed a fast and memory-efficient TBE implemen...
Article
Full-text available
ModelTest-NG is a re-implementation from scratch of jModelTest and ProtTest, two popular tools for selecting the best-fit nucleotide and amino acid substitution models, respectively. ModelTest-NG is one to two orders of magnitude faster than jModelTest and ProtTest but equally accurate, and introduces several new features, such as ascertainment bia...
Preprint
The recent upswing of microfluidics and combinatorial indexing strategies, further enhanced by very low sequencing costs, have turned single cell sequencing into an empowering technology; analyzing thousands—or even millions—of cells per experimental run is becoming a routine assignment in laboratories worldwide. As a consequence, we are witnessing...
Preprint
Full-text available
The recent upswing of microfluidics and combinatorial indexing strategies, further enhanced by very low sequencing costs, have turned single cell sequencing into an empowering technology; analyzing thousands—or even millions—of cells per experimental run is becoming a routine assignment in laboratories worldwide. As a consequence, we are witnessing...
Preprint
The recent upswing of microfluidics and combinatorial indexing strategies, further enhanced by very low sequencing costs, have turned single cell sequencing into an empowering technology; analyzing thousands—or even millions—of cells per experimental run is becoming a routine assignment in laboratories worldwide. As a consequence, we are witnessing...
Article
Full-text available
Background: The exponential decrease in molecular sequencing cost generates unprecedented amounts of data. Hence, scalable methods to analyze these data are required. Phylogenetic (or Evolutionary) Placement methods identify the evolutionary provenance of anonymous sequences with respect to a given reference phylogeny. This increasingly popular me...
Preprint
Full-text available
We present GENESIS, a library for working with phylogenetic data, and GAPPA, an accompanying command line tool for conducting typical analyses on such data. The tools target phylogenetic trees and phylogenetic placements, sequences, taxonomies, and other relevant data types, offer high-level simplicity as well as low-level customizability, and are...
Article
Full-text available
Few models of sequence evolution incorporate parameters describing protein structure, despite its high conservation, essential functional role and increasing availability. We present a structurally-aware empirical substitution model for amino acid sequence evolution in which proteins are expressed using an expanded alphabet that relays both amino a...
Article
Full-text available
Motivation: Phylogenies are important for fundamental biological research, but also have numerous applications in biotechnology, agriculture, and medicine. Finding the optimal tree under the popular maximum likelihood (ML) criterion is known to be NP-hard. Thus, highly optimized and scalable codes are needed to analyze constantly growing empirical...