Julia Haag’s research while affiliated with Heidelberg Institute for Theoretical Studies and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (15)


Figure 1: Visualization of σ m,100 and σ m for m = 3 . . . 99 averaged across 1000 MSAs. The dashed horizontal line indicates σ 100 and the dashed vertical line the intersection of σ m with σ 100 at 23.
Figure 3: Average prediction error per difficulty range. The figure shows the error for Pythia 0.0 and Pythia 2.0 on their respective training datasets. We compute the prediction error as predicted difficulty -ground-truth difficulty.
Figure 4: Average absolute prediction error of Pythia 2.0 per data type.
Pythia 2.0: New Data, New Prediction Model, New Features
  • Preprint
  • File available

March 2025

·

20 Reads

Julia Haag

·

Maximum Likelihood (ML) based phylogenetic inference is time- and resource-intensive, especially when initiating multiple independent inferences from distinct comprehensive tree topologies. Performing multiple independent inferences is often required to (sufficiently) explore the vast search space of possible unrooted binary tree topologies. Yet, these independent inferences do not necessarily converge to a single phylogeny or at least topologically highly similar trees. While foreasy-to-analyze multiple sequence alignments (MSAs), one is likely to obtain a conclusive, single phylogeny, difficult-to-analyze MSAs yield topologically highly distinct, yet statistically indistinguishable tree topologies. In 2022, we proposed a compute-intensive approach to quantify the inherent difficulty of a phylogenetic analysis for a specific, given MSA, and also trained a machine-learning based prediction model called Pythia to substantially reduce the computational cost of determining the difficulty. Pythia can predict the difficulty for a given MSA with high accuracy, while being substantially faster than even a single ML tree inference. Pythia predicts the difficulty on a scale from 0 (easy) to 1 (difficult). Here, we present all improvements to Pythia that we have introduced since our initial publication in 2022. We trained a new prediction model using approximately three times more MSAs and a new type of machine learning model. We improved the runtime of two feature computations, and we also introduced two additional prediction features. Our latest version Pythia 2.0 is slightly more accurate than our initial version and is also approximately twice as fast. Finally, we also present and make available, the novel and easy-to-use command line tool PyDLG that allows to compute the ground-truth difficulty seamlessly for a given MSA. This ground-truth difficulty can be used, for instance, as a prediction target for training a new Pythia model.

Download

Pandora: A Tool to Estimate Dimensionality Reduction Stability of Genotype Data

March 2025

·

16 Reads

·

1 Citation

Bioinformatics Advances

Motivation Genotype datasets typically contain a large number of single nucleotide polymorphisms for a comparatively small number of individuals. To identify similarities between individuals and to infer an individual’s origin or membership to a population, dimensionality reduction techniques are routinely deployed. However, inherent (technical) difficulties such as missing or noisy data need to be accounted for when analyzing a lower dimensional representation of genotype data, and the intrinsic uncertainty of such analyses should be reported in all studies. However, to date, there exists no stability assessment technique for genotype data that can estimate this uncertainty. Results Here, we present Pandora, a stability estimation framework for genotype data based on bootstrapping. Pandora computes an overall score to quantify the stability of the entire embedding, infers per-individual support values, and also deploys a k-means clustering approach to assess the uncertainty of assignments to potential cultural groups. Using published empirical and simulated datasets, we demonstrate the usage and utility of Pandora for studies that rely on dimensionality reduction techniques. Availability and Implementation Pandora is available on GitHub https://github.com/tschuelia/Pandora. Supplementary information Supplementary data are available online.


Predicting Phylogenetic Bootstrap Values via Machine Learning

October 2024

·

21 Reads

·

2 Citations

Molecular Biology and Evolution

Estimating the statistical robustness of the inferred tree(s) constitutes an integral part of most phylogenetic analyses. Commonly, one computes and assigns a branch support value to each inner branch of the inferred phylogeny. The still most widely used method for calculating branch support on trees inferred under maximum likelihood (ML) is the Standard, nonparametric Felsenstein bootstrap support (SBS). Due to the high computational cost of the SBS, a plethora of methods has been developed to approximate it, for instance, via the rapid bootstrap (RB) algorithm. There have also been attempts to devise faster, alternative support measures, such as the SH-aLRT (Shimodaira–Hasegawa-like approximate likelihood ratio test) or the UltraFast bootstrap 2 (UFBoot2) method. Those faster alternatives exhibit some limitations, such as the need to assess model violations (UFBoot2) or unstable behavior in the low support interval range (SH-aLRT). Here, we present the educated bootstrap guesser (EBG), a machine learning-based tool that predicts SBS branch support values for a given input phylogeny. EBG is on average 9.4 (σ=5.5) times faster than UFBoot2. EBG-based SBS estimates exhibit a median absolute error of 5 when predicting SBS values between 0 and 100. Furthermore, EBG also provides uncertainty measures for all per-branch SBS predictions and thereby allows for a more rigorous and careful interpretation. EBG can, for instance, predict SBS support values on a phylogeny comprising 1,654 SARS-CoV2 genome sequences within 3 h on a mid-class laptop. EBG is available under GNU GPL3.


Complexity of avian evolution revealed by family-level genomes

April 2024

·

1,462 Reads

·

94 Citations

Nature

·

·

Al-Aabid Chowdhury

·

[...]

·

Despite tremendous efforts in the past decades, relationships among main avian lineages remain heavily debated without a clear resolution. Discrepancies have been attributed to diversity of species sampled, phylogenetic method and the choice of genomic regions1–3. Here we address these issues by analysing the genomes of 363 bird species⁴ (218 taxonomic families, 92% of total). Using intergenic regions and coalescent methods, we present a well-supported tree but also a marked degree of discordance. The tree confirms that Neoaves experienced rapid radiation at or near the Cretaceous–Palaeogene boundary. Sufficient loci rather than extensive taxon sampling were more effective in resolving difficult nodes. Remaining recalcitrant nodes involve species that are a challenge to model due to either extreme DNA composition, variable substitution rates, incomplete lineage sorting or complex evolutionary events such as ancient hybridization. Assessment of the effects of different genomic partitions showed high heterogeneity across the genome. We discovered sharp increases in effective population size, substitution rates and relative brain size following the Cretaceous–Palaeogene extinction event, supporting the hypothesis that emerging ecological opportunities catalysed the diversification of modern birds. The resulting phylogenetic estimate offers fresh insights into the rapid radiation of modern birds and provides a taxon-rich backbone tree for future comparative studies.


Figure 4: Visualization of the first two principal components of the unbootstrapped HO-WE dataset for one Pandora analysis without shrinking (left figure) and one Pandora analysis with shrinking (right figure). Both figures show the computed embedding as gray dots, and individuals with a P SV ≤ 0.62 are highlighted. Colors indicate populations, as stated in the legends.
Figure 7: PSV deviations and speedups for both analyzed convergence tolerance settings. The box plots show the data for all empirical datasets using PCA analyses.
Pandora: A Tool to Estimate Dimensionality Reduction Stability of Genotype Data

March 2024

·

76 Reads

Motivation Genotype datasets typically contain a large number of single nucleotide polymorphisms for a comparatively small number of individuals. To identify similarities between individuals and to infer an individual’s origin or membership to a cultural group, dimensionality reduction techniques are routinely deployed. However, inherent (technical) difficulties such as missing or noisy data need to be accounted for when analyzing a lower dimensional representation of genotype data, and the uncertainty of such an analysis should be reported in all studies. However, to date, there exists no stability estimation technique for genotype data that can estimate this uncertainty. Results Here, we present Pandora, a stability estimation framework for genotype data based on bootstrapping. Pandora computes an overall score to quantify the stability of the entire embedding, perindividual support values, and deploys a k -means clustering approach to assess the uncertainty of assignments to potential cultural groups. In addition to this bootstrap-based stability estimation, Pandora offers a sliding-window stability estimation for whole-genome data. Using published empirical and simulated datasets, we demonstrate the usage and utility of Pandora for studies that rely on dimensionality reduction techniques. Data and Code Availability Pandora is available on GitHub https://github.com/tschuelia/Pandora . All Python scripts and data to reproduce our results are available on GitHub https://github.com/tschuelia/PandoraPaper . Contact julia.haag@h-its.org


Predicting Phylogenetic Bootstrap Values via Machine Learning

March 2024

·

187 Reads

Estimating the statistical robustness of the inferred tree(s) constitutes an integral part of most phylogenetic analyses. Commonly, one computes and assigns a branch support value to each inner branch of the inferred phylogeny. The most widely used method for calculating branch support on trees inferred under Maximum Likelihood (ML) is the Standard, non-parametric Felsenstein Bootstrap Support (SBS). Due to the high computational cost of the SBS, a plethora of methods has been developed to approximate it, for instance, via the Rapid Bootstrap (RB) algorithm. There have also been attempts to devise faster, alternative support measures, such as the SH-aLRT (Shimodaira–Hasegawalike approximate Likelihood Ratio Test) or the UltraFast Bootstrap 2 (UFBoot2) method. Those faster alternatives exhibit some limitations, such as the need to assess model violations (UFBoot2) or meaningless low branch support intervals (SH-aLRT). Here, we present the Educated Bootstrap Guesser (EBG), a machine learning-based tool that predicts SBS branch support values for a given input phylogeny. EBG is on average 9.4 ( σ = 5.5) times faster than UFBoot2. EBG-based SBS estimates exhibit a median absolute error of 5 when predicting SBS values between 0 and 100. Furthermore, EBG also provides uncertainty measures for all per-branch SBS predictions and thereby allows for a more rigorous and careful interpretation. EBG can predict SBS support values on a phylogeny comprising 1654 SARS-CoV2 genome sequences within 3 hours on a mid-class laptop. EBG is available under GNU GPL3. Data and Code Availability github.com/wiegertj/EBG github.com/wiegertj/EBG-train Contact julius-wiegert@web.de


Simulations of Sequence Evolution: How (Un)realistic They Are and Why

December 2023

·

123 Reads

·

11 Citations

Molecular Biology and Evolution

Motivation Simulating Multiple Sequence Alignments (MSAs) using probabilistic models of sequence evolution plays an important role in the evaluation of phylogenetic inference tools, and is crucial to the development of novel learning-based approaches for phylogenetic reconstruction, for instance, neural networks. These models and the resulting simulated data need to be as realistic as possible to be indicative of the performance of the developed tools on empirical data and to ensure that neural networks trained on simulations perform well on empirical data. Over the years, numerous models of evolution have been published with the goal to represent as faithfully as possible the sequence evolution process and thus simulate empirical-like data. In this study, we simulated DNA and protein MSAs under increasingly complex models of evolution with and without insertion/deletion (indel) events using a state-of-the-art sequence simulator. We assessed their realism by quantifying how accurately supervised learning methods are able to predict whether a given MSA is simulated or empirical. Results Our results show that we can distinguish between empirical and simulated MSAs with high accuracy using two distinct and independently developed classification approaches across all tested models of sequence evolution. Our findings suggest that the current state-of-the-art models fail to accurately replicate several aspects of empirical MSAs, including site-wise rates as well as amino acid and nucleotide composition. Data and Code Availability All simulated and empirical MSAs, as well as all analysis results, are available at https://cme.h-its.org/exelixis/material/simulation_study.tar.gz. All scripts required to reproduce our results are available at https://github.com/tschuelia/SimulationStudy and https://github.com/JohannaTrost/seqsharp.


Adaptive RAxML-NG: Accelerating Phylogenetic Inference under Maximum Likelihood using Dataset Difficulty

October 2023

·

162 Reads

·

14 Citations

Molecular Biology and Evolution

Phylogenetic inferences under the Maximum-Likelihood criterion deploy heuristic tree search strategies to explore the vast search space. Depending on the input dataset, searches from different starting trees might all converge to a single tree topology. Often, though, distinct searches infer multiple topologies with large log-likelihood score differences or yield topologically highly distinct, yet almost equally likely, trees. Recently, Haag et al. introduced an approach to quantify, and implemented machine learning methods to predict, the dataset difficulty with respect to phylogenetic inference. Easy MSAs exhibit a single likelihood peak on their likelihood surface, associated with a single tree topology to which most, if not all, independent searches rapidly converge. As difficulty increases, multiple locally optimal likelihood peaks emerge, yet from highly distinct topologies. To make use of this information, we introduce and implement an adaptive tree search heuristic in RAxML-NG, which modifies the thoroughness of the tree search strategy as a function of the predicted difficulty. Our adaptive strategy is based upon three observations. First, on easy datasets, searches converge rapidly and can hence be terminated at an earlier stage. Second, over-analyzing difficult datasets is hopeless, and thus it suffices to quickly infer only one of the numerous almost equally likely topologies to reduce overall execution time. Third, more extensive searches are justified and required on datasets with intermediate difficulty. While the likelihood surface exhibits multiple locally optimal peaks in this case, a small proportion of them is significantly better. Our experimental results for the adaptive heuristic on 9,515 empirical and 5,000 simulated datasets with varying difficulty exhibit substantial speedups, especially on easy and difficult datasets (53% of total MSAs), where we observe average speedups of more than 10x. Further, approximately 94% of the inferred trees using the adaptive strategy are statistically indistinguishable from the trees inferred under the standard strategy (RAxML-NG).


Figure 3. Influence of simultaneously changing both likelihood epsilon settings on the LnL scores and runtime of the RAxML-NG tree inference. (a) Influence of simultaneously changing both likelihood epsilon settings on the LnL scores of RAxML-NG. The highlighted box indicates the default combination. The y-axis shows the LnL score degradation per inferred tree in percent relative to the LnL score of the best-known tree. Higher percentages indicate worse LnL scores. (b) Influence of simultaneously changing both likelihood epsilon settings on the RAxML-NG tree inference runtimes. The highlighted box indicates the default combination. The y-axis shows the speedup relative to the average runtime under the default combination.
Figure 4. Influence of the LnL setting on the LnL scores and runtimes of IQ-TREE tree inferences. (a) Influence of the LnL setting on the LnL scores of IQ-TREE. The highlighted box indicates the default setting. The y-axis shows the LnL score degradation per inferred tree in percent relative to the LnL score of the best-known tree. Higher percentages indicate worse LnL scores. (b) Influence of the LnL setting on IQ-TREE tree inference runtimes. The highlighted box indicates the default setting. The y-axis shows the speedup relative to the average runtime under the default setting.
Figure 5. Influence of simultaneously changing both likelihood epsilon settings on the bootstrap support values and runtime of the RAxML-NG bootstrap. (a) Influence of simultaneously changing both likelihood epsilon settings on the bootstrap support values. The highlighted box indicates the default combination. The y-axis shows the Pearson correlation coefficients between support values for all ML trees across all analyzed datasets. (b) Influence of simultaneously changing both likelihood epsilon settings on the RAxML-NG bootstrapping runtimes. The highlighted box indicates the default combination. The y-axis shows the speedup relative to the runtime under the default combination. This figure shows all MSAs (no outlier filtering).
Numerical thresholds we varied, including the analyzed settings and respective inference tools where they are applicable. a
The Free Lunch is not over yet—systematic exploration of numerical thresholds in maximum likelihood phylogenetic inference

September 2023

·

39 Reads

·

4 Citations

Bioinformatics Advances

Maximum likelihood (ML) is a widely used phylogenetic inference method. ML implementations heavily rely on numerical optimization routines that use internal numerical thresholds to determine convergence. We systematically analyze the impact of these threshold settings on the log-likelihood and runtimes for ML tree inferences with RAxML-NG, IQ-TREE, and FastTree on empirical datasets. We provide empirical evidence that we can substantially accelerate tree inferences with RAxML-NG and IQ-TREE by changing the default values of two such numerical thresholds. At the same time, altering these settings does not significantly impact the quality of the inferred trees. We further show that increasing both thresholds accelerates the RAxML-NG bootstrap without influencing the resulting support values. For RAxML-NG, increasing the likelihood thresholds ϵLnL and ϵbrlen to 10 and 103, respectively, results in an average tree inference speedup of 1.9 ± 0.6 on Data collection 1, 1.8 ± 1.1 on Data collection 2, and 1.9 ± 0.8 on Data collection 2 for the RAxML-NG bootstrap compared to the runtime under the current default setting. Increasing the likelihood threshold ϵLnL to 10 in IQ-TREE results in an average tree inference speedup of 1.3 ± 0.4 on Data collection 1 and 1.3 ± 0.9 on Data collection 2. Availability and implementation All MSAs we used for our analyses, as well as all results, are available for download at https://cme.h-its.org/exelixis/material/freeLunch_data.tar.gz. Our data generation scripts are available at https://github.com/tschuelia/ml-numerical-analysis.


Figure 2: Visualized substitution rates for an anecdotal (specifically selected to highlight the issue) gapless empirical DNA MSA (left), and gapless simulated MSA (right) generated based on the inferred tree and estimated evolutionary model parameters of the left MSA under the GTR model. The x-axis denotes the alignment site index. A brighter color denotes a higher number of substitutions.
Figure 5: Performance of logistic regression on MSA compositions and CNN on site-wise compositions. For each evolutionary model the BACC of each fold is represented as well as the mean and standard error.
Average of the BACC on empirical and simulated data collections across 10 folds for the GBT and CNN classifiers. Parameter configurations of simulations listed in the first column are sorted with increasing complexity from top to bottom for both DNA and protein data. For both, the last row(s)
Simulations of sequence evolution: how (un)realistic they really are and why

July 2023

·

114 Reads

·

3 Citations

Motivation: Simulating sequence evolution plays an important role in the development and evaluation of phylogenetic inference tools. Naturally, the simulated data needs to be as realistic as possible to be indicative of the performance of the developed tools on empirical data. Over the years, numerous phylogenetic sequence simulators, employing various models of evolution, have been published with the goal to simulate such empirical-like data. In this study, we simulated DNA and protein Multiple Sequence Alignments (MSAs) under increasingly complex models of evolution with and without insertion/deletion (indel) events using a state-of-the-art sequence simulator. We assessed their realism by quantifying how well supervised learning methods are able to predict whether a given MSA is simulated or empirical. Results: Our results show that we can distinguish between empirical and simulated MSAs with high accuracy using two distinct and independently developed classification approaches across all tested models of sequence evolution. Our findings suggest that the current state-of-the-art models fail to accurately replicate the process of evolution.


Citations (9)


... However, progress is rapid in this field, and the even more rapid development of machine learning methods offers new opportunities to accelerate phylogenetic and reconciliation analyses without compromising accuracy. New methods that use machine learning to rapidly select the best-fit phylogenetic model [110,111], to optimise analysis settings [112,113], to efficiently search tree space [114], and to inexpensively predict bootstrap support values [115] will all benefit phylogenetic inference. One reason for the efficiency of these new methods is that machine learning algorithms can predict the results of computationally expensive likelihood calculations using cheap-to-compute input features, such as parsimony (gene) trees, which exhibit high feature importance in recent studies (70%-80%; [112,115]). ...

Reference:

Phylogenetic reconciliation: making the most of genomes to understand microbial ecology and evolution
Predicting Phylogenetic Bootstrap Values via Machine Learning

Molecular Biology and Evolution

... Meanwhile, long-read sequencing approaches, recently coined "method of the year" by Nature Methods [18], are the method of choice and allow for much better genome assemblies up to chromosome-scale scaffolds [19]. Comparative genomic studies are performed across all the branches of the tree of life [20][21][22]. Advances in sequencing technology and assembly techniques promoted large consortium efforts to produce datasets that allow for broad scale comparative genomics approaches and led to the final goal to sequence representative genomes from all eukaryotic species (www. ...

Complexity of avian evolution revealed by family-level genomes

Nature

... It is important to keep in mind, though, that conclusions from simulation studies are based on synthetic data, which may lack important aspects of real-world data. 112 Because of the hierarchical structure of Bayesian models, the level of complexity can also be adjusted by making assumptions about parameter values through prior point or range estimates of parameters like the substitution rate or even the phylogenetic tree. 113 Other data types can also be integrated into the analysis via prior probability distributions, bringing together different lines of evidence. ...

Simulations of Sequence Evolution: How (Un)realistic They Are and Why

Molecular Biology and Evolution

... This does not only allow for an informed decision on the required analysis setup (for instance, how many trees to infer and what type of post-analyses to conduct), but might also prevent potentially time-and resource-intense analyses that are unlikely to yield a conclusive phylogeny. Building on Pythia, and a thorough performance comparisons of various ML tree inference heuristics by Hoehler et al. [6], Togkousidis et al. [16] implemented an adaptive ML tree inference procedure for the popular RAxML-NG (adaptive RAxML-NG) ML tree inference tool. Adaptive RAxML-NG predicts the difficulty of an MSA using Pythia and subsequently categorizes the MSA into easy (difficulty ranging from 0.0 to 0.3), intermediate (0.3 to 0.7), and difficult (> 0.7). ...

Adaptive RAxML-NG: Accelerating Phylogenetic Inference under Maximum Likelihood using Dataset Difficulty

Molecular Biology and Evolution

... The process is divided into two stages. During the first stage, Fast SPR rounds alternate with NNI rounds, until either the RF distance between two consecutive tree topologies is zero d or the likelihood score improvement is below a user-defined threshold (Haag et al. 2022b). In the second stage, Slow SPR rounds alternate with NNI rounds until the likelihood score improvement threshold is reached again. ...

The Free Lunch is not over yet—systematic exploration of numerical thresholds in maximum likelihood phylogenetic inference

Bioinformatics Advances

... Trost et al. [32] demonstrate, that machine learning algorithms can easily distinguish between simulated and empirical MSAs with high accuracy and conclude that sequence simulations do not fully capture all characteristics of empirical MSAs. Consequently, we exclusively use empirical MSAs to train EBG. ...

Simulations of sequence evolution: how (un)realistic they really are and why

... In the following, we provide detailed explanations of all changes, as well as respective performance analyses and comparisons. Finally, we also present a novel and easy-to-use command-line tool that allows users to compute the "true" difficulty of an MSA based on our quantification presented in Haag et al. [5]. This might serve for training Pythia on novel data types, such as data from linguistics or potential alignments of protein structure alphabets. ...

From Easy to Hopeless - Predicting the Difficulty of Phylogenetic Analyses

Molecular Biology and Evolution

... Authors using simulated data in their publications typically set simulation parameters according to attributes (e.g., MSA lengths, or proportions of gaps) of empirical reference MSAs (see e.g., Price et al. [39]). Some also attempt to extract or sample simulation parameters from Maximum Likelihood estimates in large scale empirical databases, such as TreeBASE [37], in the hope that thereby simulated data will better resemble empirical data [1,22]. Despite the effort, there still exist performance or program behavior differences on simulated versus empirical data. ...

A representative Performance Assessment of Maximum Likelihood based Phylogenetic Inference Tools

... For data collections simulating indel events, we also used the proportion of gaps as feature (% gaps). Further, we quantified the signal in the MSA using the Shannon entropy [43] as average over all column entropies (Entropy), two metrics based on the number and frequency of patterns in the MSA (Bollback multinomial [8]; Pattern entropy), and the difficulty of the respective phylogenetic analysis as predicted by Pythia [20] (difficulty). In order to assess downstream effects on tree inferences using simulated and empirical data, we inferred 100 trees based on the fast-to-compute maximum parsimony criterion [14,15] and a single Maximum Likelihood (ML) tree using RAxML-NG [28]. ...

From Easy to Hopeless - Predicting the Difficulty of Phylogenetic Analyses