Lukas Kammerer’s research while affiliated with Johannes Kepler University of Linz and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (17)


Figure 7. Relation between median bias and median model size over all problems.
Distribution of the average RMSE values and corresponding ranks per problem, broken down by noise level σ. The left value in each cell is the median, the right value the interquartile range. The right column are the values outlined in Figures 3 and 4.
Distribution of the average model size and corresponding ranks per problem, broken down by noise level σ. The left value in each cell is the median, the right value the interquartile range. The right column are the values outlined in Figures 3 and 4.
Distribution of the average bias and corresponding ranks per problem, broken down by noise level σ. The left value in each cell is the median, the right value the interquartile range. The right column are the values outlined in Figures 5 and 6.
Distribution of the average variance and corresponding ranks per problem, broken down by noise level σ. The left value in each cell is the median, the right value the interquartile range. The right column are the values outlined in Figures 5 and 6.
Bias and Variance Analysis of Contemporary Symbolic Regression Methods
  • Article
  • Full-text available

November 2024

·

32 Reads

Lukas Kammerer

·

Gabriel Kronberger

·

Symbolic regression is commonly used in domains where both high accuracy and interpretability of models is required. While symbolic regression is capable to produce highly accurate models, small changes in the training data might cause highly dissimilar solution. The implications in practice are huge, as interpretability as key-selling feature degrades when minor changes in data cause substantially different behavior of models. We analyse those perturbations caused by changes in training data for ten contemporary symbolic regression algorithms. We analyse existing machine learning models from the SRBench benchmark suite, a benchmark that compares the accuracy of several symbolic regression algorithms. We measure the bias and variance of algorithms and show how algorithms like Operon and GP-GOMEA return highly accurate models with similar behavior despite changes in training data. Our results highlight that larger model sizes do not imply different behavior when training data change. On the contrary, larger models effectively prevent systematic errors. We also show how other algorithms like ITEA or AIFeynman with the declared goal of producing consistent results meet up to their expectation of small and similar models.

Download


Fig. 1. Pareto front of solutions obtained using operon when fitting σ 8 / 10 9 A s as a function of Ω b , Ω m , h and n s . We plot the root mean squared error as a function of model length from the training and validation sets separately. The model in Eq. (4) has a model length of 27.
Fig. 2. Linear matter power spectrum (upper), the residuals Eq. (5) from the Eisenstein & Hu fit without baryons (middle), and the fractional residuals on P(k) compared to the truth for the Planck 2018 (Planck Collaboration VI 2020) cosmology. In all panels we plot the truth computed with camb with solid red lines, and the analytic fit Eq. (6) obtained in this paper with dashed blue lines. We see that the fit is accurate within 0.3% across all k considered.
Fig. 3. Pareto front of solutions obtained using operon when fitting the linear matter power spectrum as a function of σ 8 , Ω b , Ω m , h and n s . We plot the root mean squared error on log F as a function of model length for the training and validation sets separately. The model given in Eq. (6) has a model length of 77, as indicated by the dotted line.
Fig. 4. Distribution of fractional errors as a function of k on the linear matter power spectrum across all cosmologies in the training and validation sets, as compared to the predictions of camb. The bands give the 1 and 2σ values. The dotted line corresponds to a 1% error, and we see that our expression achieves this for all cosmologies and values of k considered, with a root mean squared fractional error of 0.2%.
Fig. 5. Contributions to log F from our emulator as a function of k for the Planck 2018 cosmology. The line numbers indicated in the legend correspond to the line in Eq. (6). One sees that the first term provides an overall offset, the second and fourth capture the BAO signal, and the third term contains a broad oscillation and then matches on to the decaying residual at high k.
A precise symbolic emulator of the linear matter power spectrum

April 2024

·

14 Reads

·

11 Citations

Astronomy and Astrophysics

Context. Computing the matter power spectrum, P ( k ), as a function of cosmological parameters can be prohibitively slow in cosmological analyses, hence emulating this calculation is desirable. Previous analytic approximations are insufficiently accurate for modern applications, so black-box, uninterpretable emulators are often used. Aims. We aim to construct an efficient, differentiable, interpretable, symbolic emulator for the redshift zero linear matter power spectrum which achieves sub-percent level accuracy. We also wish to obtain a simple analytic expression to convert A s to σ 8 given the other cosmological parameters. Methods. We utilise an efficient genetic programming based symbolic regression framework to explore the space of potential mathematical expressions which can approximate the power spectrum and σ 8 . We learn the ratio between an existing low-accuracy fitting function for P ( k ) and that obtained by solving the Boltzmann equations and thus still incorporate the physics which motivated this earlier approximation. Results. We obtain an analytic approximation to the linear power spectrum with a root mean squared fractional error of 0.2% between k = 9 × 10 ⁻³ − 9 h Mpc ⁻¹ and across a wide range of cosmological parameters, and we provide physical interpretations for various terms in the expression. Our analytic approximation is 950 times faster to evaluate than CAMB and 36 times faster than the neural network based matter power spectrum emulator BACCO . We also provide a simple analytic approximation for σ 8 with a similar accuracy, with a root mean squared fractional error of just 0.1% when evaluated across the same range of cosmologies. This function is easily invertible to obtain A s as a function of σ 8 and the other cosmological parameters, if preferred. Conclusions. It is possible to obtain symbolic approximations to a seemingly complex function at a precision required for current and future cosmological analyses without resorting to deep-learning techniques, thus avoiding their black-box nature and large number of parameters. Our emulator will be usable long after the codes on which numerical approximations are built become outdated.


Symbolic Regression with Fast Function Extraction and Nonlinear Least Squares Optimization

February 2023

·

5 Reads

·

2 Citations

Lecture Notes in Computer Science

Fast Function Extraction (FFX) is a deterministic algorithm for solving symbolic regression problems. We improve the accuracy of FFX by adding parameters to the arguments of nonlinear functions. Instead of only optimizing linear parameters, we optimize these additional nonlinear parameters with separable nonlinear least squared optimization using a variable projection algorithm. Both FFX and our new algorithm is applied on the PennML benchmark suite. We show that the proposed extensions of FFX leads to higher accuracy while providing models of similar length and with only a small increase in runtime on the given data. Our results are compared to a large set of regression methods that were already published for the given benchmark suite.


Symbolic Regression with Fast Function Extraction and Nonlinear Least Squares Optimization

September 2022

·

3 Reads

Fast Function Extraction (FFX) is a deterministic algorithm for solving symbolic regression problems. We improve the accuracy of FFX by adding parameters to the arguments of nonlinear functions. Instead of only optimizing linear parameters, we optimize these additional nonlinear parameters with separable nonlinear least squared optimization using a variable projection algorithm. Both FFX and our new algorithm is applied on the PennML benchmark suite. We show that the proposed extensions of FFX leads to higher accuracy while providing models of similar length and with only a small increase in runtime on the given data. Our results are compared to a large set of regression methods that were already published for the given benchmark suite.


Cluster Analysis of a Symbolic Regression Search Space

September 2021

·

16 Reads

In this chapter we take a closer look at the distribution of symbolic regression models generated by genetic programming in the search space. The motivation for this work is to improve the search for well-fitting symbolic regression models by using information about the similarity of models that can be precomputed independently from the target function. For our analysis, we use a restricted grammar for uni-variate symbolic regression models and generate all possible models up to a fixed length limit. We identify unique models and cluster them based on phenotypic as well as genotypic similarity. We find that phenotypic similarity leads to well-defined clusters while genotypic similarity does not produce a clear clustering. By mapping solution candidates visited by GP to the enumerated search space we find that GP initially explores the whole search space and later converges to the subspace of highest quality expressions in a run for a simple benchmark problem.


Symbolic Regression by Exhaustive Search: Reducing the Search Space Using Syntactical Constraints and Efficient Semantic Structure Deduplication

September 2021

·

30 Reads

Symbolic regression is a powerful system identification technique in industrial scenarios where no prior knowledge on model structure is available. Such scenarios often require specific model properties such as interpretability, robustness, trustworthiness and plausibility, that are not easily achievable using standard approaches like genetic programming for symbolic regression. In this chapter we introduce a deterministic symbolic regression algorithm specifically designed to address these issues. The algorithm uses a context-free grammar to produce models that are parameterized by a non-linear least squares local optimization procedure. A finite enumeration of all possible models is guaranteed by structural restrictions as well as a caching mechanism for detecting semantically equivalent solutions. Enumeration order is established via heuristics designed to improve search efficiency. Empirical tests on a comprehensive benchmark suite show that our approach is competitive with genetic programming in many noiseless problems while maintaining desirable properties such as simple, reliable models and reproducibility.


Data Aggregation for Reducing Training Data in Symbolic Regression

August 2021

·

5 Reads

The growing volume of data makes the use of computationally intense machine learning techniques such as symbolic regression with genetic programming more and more impractical. This work discusses methods to reduce the training data and thereby also the runtime of genetic programming. The data is aggregated in a preprocessing step before running the actual machine learning algorithm. K-means clustering and data binning is used for data aggregation and compared with random sampling as the simplest data reduction method. We analyze the achieved speed-up in training and the effects on the trained models test accuracy for every method on four real-world data sets. The performance of genetic programming is compared with random forests and linear regression. It is shown, that k-means and random sampling lead to very small loss in test accuracy when the data is reduced down to only 30% of the original data, while the speed-up is proportional to the size of the data set. Binning on the contrary, leads to models with very high test error.


Hash-Based Tree Similarity and Simplification in Genetic Programming for Symbolic Regression

July 2021

·

36 Reads

We introduce in this paper a runtime-efficient tree hashing algorithm for the identification of isomorphic subtrees, with two important applications in genetic programming for symbolic regression: fast, online calculation of population diversity and algebraic simplification of symbolic expression trees. Based on this hashing approach, we propose a simple diversity-preservation mechanism with promising results on a collection of symbolic regression benchmark problems.



Citations (9)


... Both operations were performed six times, and each candidate model produced was evaluated for consistency at a sample of domain locations. We did not investigate if there was some optimum number with respect to overall run time or if the preferential search strategy made a significant impact on an appropriate measure of success rate, as discussed in Kronberger et al. (2024). Although, Garbrecht et al. (2021b) showed that the preferential search strategy can improve overall run time when the evaluation and selection steps performed in crossover and mutation are faster than the main evaluation and selection step, assuming it beneficially guides the algorithm's search. ...

Reference:

Modeling plasticity-mediated void growth at the single crystal scale: A physics-informed machine learning approach
The Inefficiency of Genetic Programming for Symbolic Regression
  • Citing Chapter
  • September 2024

... After the first seminal papers on this topic [5][6][7], several emulators have been produced in the literature, emulating the output of Boltzmann solvers such as CAMB [8] or CLASS [9], with applications ranging from the Cosmic Microwave Background (CMB) [10][11][12][13][14], the linear matter power spectrum [11,[15][16][17][18][19], galaxy power spectrum multipoles [17,[19][20][21][22], and the galaxy survey angular power spectrum [23][24][25][26][27][28][29]. ...

A precise symbolic emulator of the linear matter power spectrum

Astronomy and Astrophysics

... However, this approach is practically viable only if the number of selected features remains low, such as below ten, as the search algorithm is still being performed with GP. Alternative methods include the fast function extraction [32,33] and differentiable GP approach [34], which aim to address the scalability challenges of SR but are not NN-based. ...

Symbolic Regression with Fast Function Extraction and Nonlinear Least Squares Optimization
  • Citing Chapter
  • February 2023

Lecture Notes in Computer Science

... They calculated bias and variance measurements but only for few data sets and only for ensemble bagging of standard GP-based SR. Kammerer et al. [11] analyzed the variance for two different GP-based SR variants and compared them with Random Forest regression [12] and linear regression on few data sets. Highly related to bias and variance is the work by de Franca et al. [13], who provided a successor of SRBench. ...

Empirical analysis of variance for genetic programming based symbolic regression
  • Citing Conference Paper
  • July 2021

... The motivation behind employing genetic algorithms lies in their ability to efficiently explore large solution spaces, enabling the discovery of diverse and effective solutions that may be elusive through traditional optimization methods [21,22,23]. In particular, GAs have been adapted to the realm of ML to find a well-performing ML pipeline [24], solve a symbolic regression task [25,26,27], or even as part of ML ensemble-based model [28,29]. arXiv:2412.09035v1 ...

Symbolic Regression by Exhaustive Search: Reducing the Search Space Using Syntactical Constraints and Efficient Semantic Structure Deduplication
  • Citing Chapter
  • May 2020

... Significant progress has been made in modeling dynamics using methods such as sparse regression, genetic programming, and deep learning [23][24][25][26][27], however selecting an appropriate set of state variables still remains a fundamental challenge [28]. The difficulty stems from the non-uniqueness of state variables, even in simple systems. ...

Identification of Dynamical Systems Using Symbolic Regression
  • Citing Chapter
  • April 2020

Lecture Notes in Computer Science

... When Every generation [2], [117]- [122] Final generation [2], [123], [124] At certain interval [119], [125]- [128] Which individuals All individuals in the population [2], [119], [122], [125]- [127] The best individuals in the population [2], [121], [123], [124], [128] Randomly with some probability [117] The parents for breeding [118], [120] How Genotypic (Structural) [2], [117]- [119], [122], [123], [125]- [127], [129]- [131] Phenotypic (Behavioural) [120], [121], [124], [126]- [128], [132]- [135] Regarding "when" to simplify, the commonly used strategies are based on frequency (i.e., after every k generations). If k = 1, then the simplification is conducted after every generation. ...

Hash-Based Tree Similarity and Simplification in Genetic Programming for Symbolic Regression
  • Citing Chapter
  • April 2020

Lecture Notes in Computer Science

... The algorithm sequentially generates solutions from the grammar and keeps track of the most accurate one. For very small problems, it is even feasible to iterate the whole search space [19]. However, our goal in larger problems is to find accurate and concise solutions early during the search and to stop the algorithm after a reasonable time. ...

Cluster Analysis of a Symbolic Regression Search Space
  • Citing Chapter
  • January 2019

... However, to achieve a strong generalisation ability, the learners in an ensemble should be accurate and diverse. To obtain a good ensemble, many methods have been developed, including bagging and boosting methods [8,13,29]. However, in most existing ensemble methods, the selection of the base learners and the combination of them are often manually determined. ...

Confidence-based ensemble modeling in medical data mining
  • Citing Conference Paper
  • July 2018