Alexandros StamatakisHeidelberger Institut für Theoretische Studien | HITS · Scientific Computing Group
Alexandros Stamatakis
Prof. Dr.
About
398
Publications
171,919
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
98,314
Citations
Introduction
Additional affiliations
October 2001 - December 2004
Publications
Publications (398)
A common problem when analyzing ancient DNA (aDNA) data is to identify the species which corresponds to the recovered aDNA sequence(s). The standard approach is to deploy sequence similarity based tools such as BLAST. However, as aDNA reads may frequently either stem from unsampled taxa due to extinction, it is likely that there is no exact match i...
In the Battle of Crete during the World War II occupation of Greece, the German forces faced substantial civilian resistance. To retribute the numerous German losses, a series of mass executions took place in numerous places in Crete; a common practice reported from Greece and elsewhere. In Adele, a village in the regional unit of Rethymnon, 18 mal...
The field of population genetics attempts to advance our understanding of evolutionary processes. It has applications, for example, in medical research, wildlife conservation, and – in conjunction with recent advances in ancient DNA sequencing technology – studying human migration patterns over the past few thousand years. The basic toolbox of popu...
Despite tremendous efforts in the past decades, relationships among main avian lineages remain heavily debated without a clear resolution. Discrepancies have been attributed to diversity of species sampled, phylogenetic method and the choice of genomic regions1–3. Here we address these issues by analysing the genomes of 363 bird species⁴ (218 taxon...
Motivation
Genomes are a rich source of information on the pattern and process of evolution across biological scales. How best to make use of that information is an active area of research in phylogenetics. Ideally, phylogenetic methods should not only model substitutions along gene trees, which explain differences between homologous gene sequences...
Motivation
Genotype datasets typically contain a large number of single nucleotide polymorphisms for a comparatively small number of individuals. To identify similarities between individuals and to infer an individual’s origin or membership to a cultural group, dimensionality reduction techniques are routinely deployed. However, inherent (technical...
Estimating the statistical robustness of the inferred tree(s) constitutes an integral part of most phylogenetic analyses. Commonly, one computes and assigns a branch support value to each inner branch of the inferred phylogeny. The most widely used method for calculating branch support on trees inferred under Maximum Likelihood (ML) is the Standard...
Motivation
Simulating Multiple Sequence Alignments (MSAs) using probabilistic models of sequence evolution plays an important role in the evaluation of phylogenetic inference tools, and is crucial to the development of novel learning-based approaches for phylogenetic reconstruction, for instance, neural networks. These models and the resulting simu...
Motivation
Genomes are a rich source of information on the pattern and process of evolution across biological scales. How best to make use of that information is an active area of research in phylogenetics. Ideally, phylogenetic methods should not only model substitutions along gene trees, which explain differences between homologous gene sequences...
Phylogenetic inferences under the Maximum-Likelihood criterion deploy heuristic tree search strategies to explore the vast search space. Depending on the input dataset, searches from different starting trees might all converge to a single tree topology. Often, though, distinct searches infer multiple topologies with large log-likelihood score diffe...
Maximum likelihood (ML) is a widely used phylogenetic inference method. ML implementations heavily rely on numerical optimization routines that use internal numerical thresholds to determine convergence. We systematically analyze the impact of these threshold settings on the log-likelihood and runtimes for ML tree inferences with RAxML-NG, IQ-TREE,...
ALE and GeneRax are tools for probabilistic gene tree-species tree reconciliation. Based on a common underlying statistical model of how gene trees evolve along species trees, these methods rely on gene versus species tree discordance to infer gene duplication, transfer and loss events, map gene family origins and root species trees. Published anal...
Taxonomic assignment of operational taxonomic units (OTUs) is an important bioinformatics step in analyzing environmental sequencing data. Pairwise alignment and phylogenetic-placement methods represent two alternative approaches to taxonomic assignments, but their results can differ. Here we used available colpodean ciliate OTUs from forest soils...
Motivation: Simulating sequence evolution plays an important role in the development and evaluation of phylogenetic inference tools. Naturally, the simulated data needs to be as realistic as possible to be indicative of the performance of the developed tools on empirical data. Over the years, numerous phylogenetic sequence simulators, employing var...
Motivation
Phylogenetic inferences under the Maximum-Likelihood (ML) criterion deploy heuristic tree search strategies to explore the vast search space. Depending on the input dataset, searches from different starting trees might all converge to a single tree topology. Often, though, distinct searches infer multiple topologies with large log-likeli...
Unlabelled:
The prediction of knockout tournaments represents an area of large public interest and active academic as well as industrial research. Here, we show how one can leverage the computational analogies between calculating the phylogenetic likelihood score used in the area of molecular evolution to efficiently calculate, instead of approxim...
Species tree-aware phylogenetic methods model how gene trees are generated along the species tree by a series of evolutionary events, including the duplication, transfer and loss of genes. Over the past ten years these methods have emerged as a powerful tool for inferring and rooting gene and species trees, inferring ancestral gene repertoires, and...
Computing ancestral ranges via the Dispersion Extinction and Cladogensis (DEC) model of biogeography is characterized by an exponential number of states relative to the number of regions considered. This is because the DEC model requires computing a large matrix exponential, which typically accounts for up to 80% of overall runtime. Therefore, the...
This is a pre-print. Currently, the paper is under-review.
You may download it freely from here: 10.2139/ssrn.4578618
Motivation:
Missing data and incomplete lineage sorting are two major obstacles to accurate species tree inference. Gene tree summary methods such as ASTRAL and ASTRID have been developed to account for incomplete lineage sorting. However, they can be severely affected by high levels of missing data.
Results:
We present Asteroid, a novel algorit...
We present a spatiotemporal picture of human genetic diversity in Anatolia, Iran, Levant, South Caucasus, and the Aegean, a broad region that experienced the earliest Neolithic transition and the emergence of complex hierarchical societies. Combining 35 new ancient shotgun genomes with 382 ancient and 23 present-day published genomes, we found that...
Phylogenetic analyses under the Maximum Likelihood model are time and resource intensive. To adequately capture the vastness of tree space, one needs to infer multiple independent trees. On some datasets, multiple tree inferences converge to similar tree topologies, on others to multiple, topologically highly distinct yet statistically indistinguis...
Taxonomic assignment of OTUs is an important bioinformatics step in analyzing environmental sequencing data. Pairwise-alignment and phylogenetic-placement methods represent two alternative approaches to taxonomic assignments, but their results can differ. Here we used available colpodean ciliate OTUs from forest soils to compare the taxonomic assig...
The evaluation of phylogenetic inference tools is commonly conducted on simulated and empirical sequence data alignments. An open question is how representative these alignments are with respect to those, commonly analyzed by users. Based upon the RAxMLGrove database, it is now possible to simulate DNA sequences based on more than 70, 000 represent...
The prediction of knockout tournaments represents an area of large public interest and active academic as well as industrial research. Here, we show how one can leverage the computational analogies between calculating the phylogenetic likelihood score used in the area of molecular evolution to efficiently calculate , instead of approximate via simu...
Motivation
Missing data and incomplete lineage sorting are two major obstacles to accurate species tree inference. Gene tree methods such as ASTRAL and ASTRID have been developed to account for incomplete lineage sorting. However, they can be severely affected by high levels of missing data.
Results
We present Asteroid, a novel supertree method th...
Motivation
Maximum Likelihood (ML) is a widely used model for inferring phylogenies. The respective ML implementations heavily rely on numerical optimization routines that use internal numerical thresholds to determine convergence. We systematically analyze the impact of these threshold settings on the log-likelihood (LnL scores) and runtimes for M...
Motivation:
In recent years, full-genome sequences have become increasingly available and as a result many modern phylogenetic analyses are based on very long sequences, often with over 100 000 sites. Phylogenetic reconstructions of large-scale alignments are challenging for likelihood-based phylogenetic inference programs and usually require usin...
Phylogenetic analyses under the Maximum Likelihood model are time and resource intensive. To adequately capture the vastness of tree space, one needs to infer multiple independent trees. On some datasets, multiple tree inferences converge to similar tree topologies, on others to multiple, topologically highly distinct yet statistically indistinguis...
Motivation
Phylogenetic networks can represent non-treelike evolutionary scenarios. Current, actively developed approaches for phylogenetic network inference jointly account for non-treelike evolution and incomplete lineage sorting. Unfortunately, this induces a very high computational complexity and current tools can only analyze small data sets....
Phylogenetic placement refers to a family of tools and methods to analyze, visualize, and interpret the tsunami of metagenomic sequencing data generated by high-throughput sequencing. Compared to alternative (e. g., similarity-based) methods, it puts metabarcoding sequences into a phylogenetic context using a set of known reference sequences and ta...
A bstract
Computing ancestral ranges via the Dispersion Extinction and Cladogensis (DEC) model of biogeography is characterized by an exponential number of states relative to the number of regions considered. This is because the DEC model requires computing a large matrix exponential, which typically accounts for up to 80% of overall runtime. There...
Fault-tolerant distributed applications require mechanisms to recover data lost via a process failure. On modern cluster systems it is typically impractical to request replacement resources after such a failure. Therefore, applications have to continue working with the remaining resources. This requires redistributing the workload and that the non-...
Phylogenetic placement refers to a family of tools and methods to analyze, visualize, and interpret the tsunami of metagenomic sequencing data generated by high-throughput sequencing. Compared to alternative (e. g., similarity-based) methods, it puts the metagenomic sequences into a phylogenetic context using a set of known reference sequences and...
We introduce CellPhy, a maximum likelihood framework for inferring phylogenetic trees from somatic single-cell single-nucleotide variants. CellPhy leverages a finite-site Markov genotype model with 16 diploid states and considers amplification error and allelic dropout. We implement CellPhy into RAxML-NG, a widely used phylogenetic inference packag...
Species tree inference from gene family trees is becoming increasingly popular because it can account for discordance between the species tree and the corresponding gene family trees. In particular, methods that can account for multiple-copy gene families exhibit potential to leverage paralogy as informative signal. At present, there does not exist...
The assessment of novel phylogenetic models and inference methods is routinely being conducted via experiments on simulated as well as empirical data. When generating synthetic data it is often unclear how to set simulation parameters for the models and generate trees that appropriately reflect empirical model parameter distributions and tree shape...
The assessment of novel phylogenetic models and inference methods is routinely being conducted via experiments on simulated as well as empirical data. When generating synthetic data it is often unclear how to set simulation parameters for the models and generate trees that appropriately reflect empirical model parameter distributions and tree shape...
Phylogenetic networks are used to represent non-treelike evolutionary scenarios. Current, actively developed approaches for phylogenetic network inference jointly account for non-treelike evolution and incomplete lineage sorting (ILS). Unfortunately, this induces a very high computational complexity. Hence, current tools can only analyze small data...
Motivation:
Previously we presented swarm, an open-source amplicon clustering program that produces fine-scale molecular operational taxonomic units (OTUs) that are free of arbitrary global clustering thresholds. Here we present swarm v3 to address issues of contemporary datasets that are growing towards tera-byte sizes.
Results:
When compared t...
A wide range of data types can be used to delimit species and various computer‐based tools dedicated to this task are now available. Although these formalized approaches have significantly contributed to increase the objectivity of species delimitation (SD) under different assumptions, they are not routinely used by alpha‐taxonomists. One obvious s...
The prediction of knock-out tournaments represents an area of large public interest and active academic as well as industrial research. Here, we leverage the computational analogies between calculating the so-called phylogenetic likelihood score used in the area of molecular evolution and efficiently calculating, instead of approximating via simula...
:
Phylogenetic trees are now routinely inferred on large scale HPC systems with thousands of cores as the parallel scalability of phylogenetic inference tools has improved over the past years to cope with the molecular data avalanche. Thus, the parallel fault tolerance of phylogenetic inference tools has become a relevant challenge. To this end, w...
Scientific software from all areas of scientific research is pivotal to obtaining novel insights. Yet the coding standards adherence of scientific software is rarely assessed, even though it might lead to incorrect scientific results in the worst case. Therefore, we have developed an open source tool and benchmark called SoftWipe, that provides a r...
Background
In phylogenetic analysis, it is common to infer unrooted trees. However, knowing the root location is desirable for downstream analyses and interpretation. There exist several methods to recover a root, such as molecular clock analysis (including midpoint rooting) or rooting the tree using an outgroup. Non-reversible Markov models can al...
Species tree inference from gene family trees is becoming increasingly popular because it can account for discordance between the species tree and the corresponding gene family trees. In particular, methods that can account for multiple-copy gene families exhibit potential to leverage paralogy as informative signal. At present, there does not exist...
A wide range of data types can be used to delimit species and various computer-based tools dedicated to this task are now available. Although these formalized approaches have significantly contributed to increase the objectivity of SD under different assumptions, they are not routinely used by alpha-taxonomists. One obvious shortcoming is the lack...
Tools, rules and incentives are buckling under the flood of coronavirus genome sequences — to help control the pandemic, researchers need new approaches. Tools, rules and incentives are buckling under the flood of coronavirus genome sequences — to help control the pandemic, researchers need new approaches.
Wall lizards of the genus Podarcis (Sauria, Lacertidae) are the predominant reptile group in southern Europe, including 24 recognized species. Mitochondrial DNA data have shown that, with the exception of P. muralis, the Podarcis species distributed in the Balkan peninsula form a species group that is further sub-divided into two subgroups: the one...
Phylogenetic trees are now routinely inferred on large scale HPC systems with thousands of cores as the parallel scalability of phylogenetic inference tools has improved over the past years to cope with the molecular data avalanche. Thus, the parallel fault tolerance of phylogenetic inference tools has become a relevant challenge. To this end, we e...
The prospect of reduced vaccine potency from fast-spreading SARS-CoV-2 variants has spurred a global rush to increase genomic surveillance for the coronavirus. This is crucial for quickly identifying and tracking emergent strains. It can also pin down how transmission occurs between individuals more definitively than typical contact tracing can. As...
Numerous studies covering some aspects of SARS-CoV-2 data analyses are being published on a daily basis, including a regularly updated phylogeny on nextstrain.org. Here, we review the difficulties of inferring reliable phylogenies by example of a data snapshot comprising a quality-filtered subset of 8, 736 out of all 16, 453 virus sequences availab...
Numerous studies covering some aspects of SARS-CoV-2 data analyses are being published on a daily basis, including a regularly updated phylogeny on nextstrain.org. Here, we review the difficulties of inferring reliable phylogenies by example of a data snapshot comprising a quality-filtered subset of 8, 736 out of all 16, 453 virus sequences availab...
Anthozoan corals are an ecologically important group of cnidarians, which power the productivity of reef ecosystems. They are sessile, inhabit shallow, tropical oceans and are highly dependent on sun- and moonlight to regulate sexual reproduction, phototaxis and photosymbiosis. However, their exposure to high levels of sunlight also imposes an incr...
Scientific software from all areas of scientific research is pivotal to obtaining novel insights. Yet the quality of scientific software is rarely assessed, even though it might lead to incorrect scientific results in the worst case. Therefore, we have developed an open source tool and benchmark called SoftWipe , that provides a relative software q...
Numerous studies covering some aspects of SARS-CoV-2 data analyses are being published on a daily basis, including a regularly updated phylogeny on nextstrain.org. Here, we review the difficulties of inferring reliable phylogenies by example of a data snapshot comprising all virus sequences available on May 5, 2020 from gisaid.org. We find that it...
Microbial ecology research is currently driven by the continuously decreasing cost of DNA sequencing and the improving accuracy of data analysis methods. One such analysis method is phylogenetic placement, which establishes the phylogenetic identity of the anonymous environmental sequences in a sample by means of a given phylogenetic reference tree...
Motivation
Gene and species tree reconciliation methods are used to interpret gene trees, root them and correct uncertainties that are due to scarcity of signal in multiple sequence alignments. So far, reconciliation tools have not been integrated in standard phylogenetic software and they either lack performance on certain functions, or usability...
We have developed a maximum likelihood framework called CellPhy for inferring phylogenetic trees from single-cell DNA sequencing (scDNA-seq) data, that can be directly applied to somatic cells and clones. CellPhy is based on a finite-site Markov nucleotide substitution model with 10 diploid states, akin to those typically used in statistical phylog...
Inferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges species tree-aware methods also leverage information from a pu...
Microbial ecology research is currently driven by the continuously decreasing cost of DNA sequencing and the improving accuracy of data analysis methods. One such analysis method is phylogenetic placement, which establishes the phylogenetic identity of the anonymous environmental sequences in a sample by means of a given phylogenetic reference tree...
In phylogenetic analysis, it is common to infer unrooted trees. Thus, it is unknown which node is the most recent common ancestor of all the taxa in the phylogeny. However, knowing the root location is desirable for downstream analyses and interpretation. There exist several methods to recover a root, such as midpoint rooting or rooting the tree at...
The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands-or even millions-of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges...
We present genesis, a library for working with phylogenetic data, and gappa, an accompanying command line tool for conducting typical analyses on such data. The tools target phylogenetic trees and phylogenetic placements, sequences, taxonomies, and other relevant data types, offer high-level simplicity as well as low-level customizability, and are...