Conference Paper

Reducing Overparameterization of Symbolic Regression Models with Equality Saturation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... To make matters worse, the usual way of encoding mathematical expressions as symbolic expression trees, allows GP to visit different but equivalent expressions [21] that evaluate to the same values. These equivalent expressions may be unnecessarily large and contain redundant parameters, which reduces the probability of finding their optimal values [9,19]. Even for the simple expression 1 1 we can produce an infinite number of equivalent expressions considering that are fitting parameters, for example (( 1 1 ) + ( 2 1 ), 1 / 1 , 2 1 /( 1 1 ), etc. are all different parameterizations of the same expression. ...
... GP cannot differentiate between equivalent forms of a given expression, and simplification heuristics are often insufficient, as seen in [9]. Some authors [16,26] argue that redundancy is necessary to allow the algorithm to navigate through the search space, as these equivalent expressions are guaranteed to have the same accuracy, allowing the search to keep multiple genetically different variations of solution candidates in the hopes of finding a better solutions. ...
... The main idea is that upon saturation, the graph contains all equivalent forms of the original program and the optimal form can be extracted from the e-graph using a heuristic. This technique was previously used in the context of SR in [9,19] to investigate the problem of overparameterization that can negatively affect the fitting of numerical parameters. The e-graph has another interesting feature that can be exploited by SR algorithms: it contains a database of all visited patterns and their equivalent forms that can be easily matched against new candidate expressions. ...
Preprint
Full-text available
The search for symbolic regression models with genetic programming (GP) has a tendency of revisiting expressions in their original or equivalent forms. Repeatedly evaluating equivalent expressions is inefficient, as it does not immediately lead to better solutions. However, evolutionary algorithms require diversity and should allow the accumulation of inactive building blocks that can play an important role at a later point. The equality graph is a data structure capable of compactly storing expressions and their equivalent forms allowing an efficient verification of whether an expression has been visited in any of their stored equivalent forms. We exploit the e-graph to adapt the subtree operators to reduce the chances of revisiting expressions. Our adaptation, called eggp, stores every visited expression in the e-graph, allowing us to filter out from the available selection of subtrees all the combinations that would create already visited expressions. Results show that, for small expressions, this approach improves the performance of a simple GP algorithm to compete with PySR and Operon without increasing computational cost. As a highlight, eggp was capable of reliably delivering short and at the same time accurate models for a selected set of benchmarks from SRBench and a set of real-world datasets.
... This is possible with the use of e-graph data structure [43] created for the equality saturation algorithm, a technique used to alleviate the phase ordering problem in the optimization of computer programs during the compilation process. This technique was previously used in the context of symbolic regression in [11,24] to investigate the problem of unwanted overparameterization that increases the chance of a suboptimal fitting of the numerical parameters. The generated e-graph has another interesting feature that can be exploited by symbolic regression algorithms: it contains a database of patterns and equivalent expressions that can be easily matched against a new candidate expression. ...
... Equality saturation has been used in the context of symbolic regression as a support tool to study the behavior of the search. de Franca and Kronberger [11,24] show that many state-of-the-art SR algorithms have a bias towards creating expressions with redundant numerical parameters. This redundancy can increase the chance of failing to correctly optimize such parameters, leading to sub-optimal solutions. ...
Preprint
Full-text available
Regression analysis is used for prediction and to understand the effect of independent variables on dependent variables. Symbolic regression (SR) automates the search for non-linear regression models, delivering a set of hypotheses that balances accuracy with the possibility to understand the phenomena. Many SR implementations return a Pareto front allowing the choice of the best trade-off. However, this hides alternatives that are close to non-domination, limiting these choices. Equality graphs (e-graphs) allow to represent large sets of expressions compactly by efficiently handling duplicated parts occurring in multiple expressions. E-graphs allow to store and query all SR solution candidates visited in one or multiple GP runs efficiently and open the possibility to analyse much larger sets of SR solution candidates. We introduce rEGGression, a tool using e-graphs to enable the exploration of a large set of symbolic expressions which provides querying, filtering, and pattern matching features creating an interactive experience to gain insights about SR models. The main highlight is its focus in the exploration of the building blocks found during the search that can help the experts to find insights about the studied phenomena.This is possible by exploiting the pattern matching capability of the e-graph data structure.
... Algorithmic aspects of SR amplify these effects: First, the stochastic nature of a Genetic Programming (GP)-based [5] SR algorithm results in differences in models between multiple SR runs even when the training data do not change at all. Second, since there is no guarantee for optimality in an SR search space [6], models trained in different SR runs might even provide very similar accuracy despite being completely different mathematical expressions due to, e.g., bloat [5] or over-parameterization [7] that increase the size of a model without affecting its accuracy. ...
... While there is a clear relationship between bias, variance and test error of algorithms, the relation between model size as a notion of the inverse of parsimony, bias, and variance is not clear. Effects like bloat, over-parameterization [7] as well as different sets of used mathematical functions distort clear connections between those properties. The relation between the median values of model size and bias/variance over all problems is outlined in Figures 7 and 8. FFX is excluded from both figures as its huge model size values would distort the axis scale. ...
Article
Full-text available
Symbolic regression is commonly used in domains where both high accuracy and interpretability of models is required. While symbolic regression is capable to produce highly accurate models, small changes in the training data might cause highly dissimilar solution. The implications in practice are huge, as interpretability as key-selling feature degrades when minor changes in data cause substantially different behavior of models. We analyse those perturbations caused by changes in training data for ten contemporary symbolic regression algorithms. We analyse existing machine learning models from the SRBench benchmark suite, a benchmark that compares the accuracy of several symbolic regression algorithms. We measure the bias and variance of algorithms and show how algorithms like Operon and GP-GOMEA return highly accurate models with similar behavior despite changes in training data. Our results highlight that larger model sizes do not imply different behavior when training data change. On the contrary, larger models effectively prevent systematic errors. We also show how other algorithms like ITEA or AIFeynman with the declared goal of producing consistent results meet up to their expectation of small and similar models.
... If we change 0.19x 2 , 0.21x 2 both to 0.2x 2 , we get an expression that, after some algebraic manipulation, becomes the true expression (Table I). This reveals some problems related to the internal optimization of the numerical parameters that can hinder the search for the correct expression: i) the expression may be ill-conditioned [31]; ii) it may not have reached the local optimum (computational budget and accuracy trade-off); iii) it may have converged to a bad local optimum; iv) it can bias the search to overparametrized expressions [32]. This limits the application of an algebraic simplification to alleviate this problem, as seem in the previous example. ...
Article
Full-text available
Symbolic regression searches for analytic expressions that accurately describe studied phenomena. The main promise of this approach is that it may return an interpretable model that can be insightful to users, while maintaining high accuracy. The current standard for benchmarking these algorithms is SRBench, which evaluates methods on hundreds of datasets that are a mix of real-world and simulated processes spanning multiple domains. At present, the ability of SRBench to evaluate interpretability is limited to measuring the size of expressions on real-world data, and the exactness of model forms on synthetic data. In practice, model size is only one of many factors used by subject experts to determine how interpretable a model truly is. Furthermore, SRBench does not characterize algorithm performance on specific, challenging sub-tasks of regression such as feature selection and evasion of local minima. In this work, we propose and evaluate an approach to benchmarking SR algorithms that addresses these limitations of SRBench by 1) incorporating expert evaluations of interpretability on a domain-specific task, and 2) evaluating algorithms over distinct properties of data science tasks. We evaluate 12 modern symbolic regression algorithms on these benchmarks and present an in-depth analysis of the results, discuss current challenges of symbolic regression algorithms and highlight possible improvements for the benchmark itself.
Article
Full-text available
Context. Computing the matter power spectrum, P ( k ), as a function of cosmological parameters can be prohibitively slow in cosmological analyses, hence emulating this calculation is desirable. Previous analytic approximations are insufficiently accurate for modern applications, so black-box, uninterpretable emulators are often used. Aims. We aim to construct an efficient, differentiable, interpretable, symbolic emulator for the redshift zero linear matter power spectrum which achieves sub-percent level accuracy. We also wish to obtain a simple analytic expression to convert A s to σ 8 given the other cosmological parameters. Methods. We utilise an efficient genetic programming based symbolic regression framework to explore the space of potential mathematical expressions which can approximate the power spectrum and σ 8 . We learn the ratio between an existing low-accuracy fitting function for P ( k ) and that obtained by solving the Boltzmann equations and thus still incorporate the physics which motivated this earlier approximation. Results. We obtain an analytic approximation to the linear power spectrum with a root mean squared fractional error of 0.2% between k = 9 × 10 ⁻³ − 9 h Mpc ⁻¹ and across a wide range of cosmological parameters, and we provide physical interpretations for various terms in the expression. Our analytic approximation is 950 times faster to evaluate than CAMB and 36 times faster than the neural network based matter power spectrum emulator BACCO . We also provide a simple analytic approximation for σ 8 with a similar accuracy, with a root mean squared fractional error of just 0.1% when evaluated across the same range of cosmologies. This function is easily invertible to obtain A s as a function of σ 8 and the other cosmological parameters, if preferred. Conclusions. It is possible to obtain symbolic approximations to a seemingly complex function at a precision required for current and future cosmological analyses without resorting to deep-learning techniques, thus avoiding their black-box nature and large number of parameters. Our emulator will be usable long after the codes on which numerical approximations are built become outdated.
Article
Full-text available
A core challenge for both physics and artificial intelligence (AI) is symbolic regression: finding a symbolic expression that matches data from an unknown function. Although this problem is likely to be NP-hard in principle, functions of practical interest often exhibit symmetries, separability, compositionality, and other simplifying properties. In this spirit, we develop a recursive multidimensional symbolic regression algorithm that combines neural network fitting with a suite of physics-inspired techniques. We apply it to 100 equations from the Feynman Lectures on Physics , and it discovers all of them, while previous publicly available software cracks only 71; for a more difficult physics-based test set, we improve the state-of-the-art success rate from 15 to 90%.
Article
Full-text available
In this paper we analyze the effects of using nonlinear least squares for parameter identification of symbolic regression models and integrate it as local search mechanism in tree-based genetic programming. We employ the Levenberg–Marquardt algorithm for parameter optimization and calculate gradients via automatic differentiation. We provide examples where the parameter identification succeeds and fails and highlight its computational overhead. Using an extensive suite of symbolic regression benchmark problems we demonstrate the increased performance when incorporating nonlinear least squares within genetic programming. Our results are compared with recently published results obtained by several genetic programming variants and state of the art machine learning algorithms. Genetic programming with nonlinear least squares performs among the best on the defined benchmark suite and the local search can be easily integrated in different genetic programming algorithms as long as only differentiable functions are used within the models.
Article
Full-text available
SymPy is an open source computer algebra system written in pure Python. It is built with a focus on extensibility and ease of use, through both interactive and programmatic applications. These characteristics have led SymPy to become a popular symbolic library for the scientific Python ecosystem. This paper presents the architecture of SymPy, a description of its features, and a discussion of select submodules. The supplementary material provide additional examples and further outline details of the architecture and features of SymPy.
Conference Paper
Full-text available
We propose a new means of executing a genetic program which improves its output quality. Our approach, called Multiple Regression Genetic Programming (MRGP) decouples and linearly combines a program's subexpressions via multiple regression on the target variable. The regression yields an alternate output: the prediction of the resulting multiple regression model. It is this output, over many fitness cases, that we assess for fitness, rather than the program's execution output. MRGP can be used to improve the fitness of a final evolved solution. On our experimental suite, MRGP consistently generated solutions fitter than the result of competent GP or multiple regression. When integrated into GP, inline MRGP, on the basis of equivalent computational budget, outperforms competent GP while also besting posthoc MRGP. Thus MRGP's output method is shown to be superior to the output of program execution and it represents a practical, cost neutral, improvement to GP.
Article
Full-text available
It is approximately 50years since the first computational experiments were conducted in what has become known today as the field of Genetic Programming (GP), twenty years since John Koza named and popularised the method, and ten years since the first issue appeared of the Genetic Programming & Evolvable Machines journal. In particular, during the past two decades there has been a significant range and volume of development in the theory and application of GP, and in recent years the field has become increasingly applied. There remain a number of significant open issues despite the successful application of GP to a number of challenging real-world problem domains and progress in the development of a theory explaining the behavior and dynamics of GP. These issues must be addressed for GP to realise its full potential and to become a trusted mainstream member of the computational problem solving toolkit. In this paper we outline some of the challenges and open issues that face researchers and practitioners of GP. We hope this overview will stimulate debate, focus the direction of future research to deepen our understanding of GP, and further the development of more powerful problem solving algorithms. KeywordsOpen issues-Genetic programming
Article
Full-text available
Most evolutionary optimization models incorporate a fitness evaluation that is based on a predefined static set of test cases or problems. In the natural evolutionary process, selection is of course not based on a static fitness evaluation. Organisms do not have to combat every existing disease during their lifespan; organisms of one species may live in different or changing environments; different species coevolve. This leads to the question of how information is integrated over many generations. This study focuses on the effects of different fitness evaluation schemes on the types of genotypes and phenotypes that evolve. The evolutionary target is a simple numerical function. The genetic representation is in the form of a program (i.e., a functional representation, as in genetic programming). Many different programs can code for the same numerical function. In other words, there is a many-to-one mapping between “genotypes” (the programs) and “phenotypes”. We compare fitness evaluation based on a large static set of problems and fitness evaluation based on small coevolving sets of problems. In the latter model very little information is presented to the evolving programs regarding the evolutionary target per evolutionary time step. In other words, the fitness evaluation is very sparse. Nevertheless the model produces correct solutions to the complete evolutionary target in about half of the simulations. The complete evaluation model, on the other hand, does not find correct solutions to the target in any of the simulations. More important, we find that sparse evaluated programs are better generalizable compared to the complete evaluated programs when they are evaluated on a much denser set of problems. In addition, the two evaluation schemes lead to programs that differ with respect to mutational stability; sparse evaluated programs are less stable than complete evaluated programs.
Article
We present a new approach to e-matching based on relational join; in particular, we apply recent database query execution techniques to guarantee worst-case optimal run time. Compared to the conventional backtracking approach that always searches the e-graph "top down", our new relational e-matching approach can better exploit pattern structure by searching the e-graph according to an optimized query plan. We also establish the first data complexity result for e-matching, bounding run time as a function of the e-graph size and output size. We prototyped and evaluated our technique in the state-of-the-art egg e-graph framework. Compared to a conventional baseline, relational e-matching is simpler to implement and orders of magnitude faster in practice.
Article
An e-graph efficiently represents a congruence relation over many expressions. Although they were originally developed in the late 1970s for use in automated theorem provers, a more recent technique known as equality saturation repurposes e-graphs to implement state-of-the-art, rewrite-driven compiler optimizations and program synthesizers. However, e-graphs remain unspecialized for this newer use case. Equality saturation workloads exhibit distinct characteristics and often require ad-hoc e-graph extensions to incorporate transformations beyond purely syntactic rewrites. This work contributes two techniques that make e-graphs fast and extensible, specializing them to equality saturation. A new amortized invariant restoration technique called rebuilding takes advantage of equality saturation's distinct workload, providing asymptotic speedups over current techniques in practice. A general mechanism called e-class analyses integrates domain-specific analyses into the e-graph, reducing the need for ad hoc manipulation. We implemented these techniques in a new open-source library called egg. Our case studies on three previously published applications of equality saturation highlight how egg's performance and flexibility enable state-of-the-art results across diverse domains.
Book
Most textbooks on regression focus on theory and the simplest of examples. Real statistical problems, however, are complex and subtle. This is not a book about the theory of regression. It is about using regression to solve real problems of comparison, estimation, prediction, and causal inference. Unlike other books, it focuses on practical issues such as sample size and missing data and a wide range of goals and techniques. It jumps right in to methods and computer code you can use immediately. Real examples, real stories from the authors' experience demonstrate what regression can do and its limitations, with practical advice for understanding assumptions and implementing methods for experiments and observational studies. They make a smooth transition to logistic regression and GLM. The emphasis is on computation in R and Stan rather than derivations, with code available online. Graphics and presentation aid understanding of the models and model fitting.
Conference Paper
The balance between approximation error and model complexity is an important trade-off for Symbolic Regression algorithms. This trade-off is achieved by means of specific operators for bloat control, modified operators, limits to the size of the generated expressions and multi-objective optimization. Recently, the representation Interaction-Transformation was introduced with the goal of limiting the search space to simpler expressions, thus avoiding bloating. This representation was used in the context of an Evolutionary Algorithm in order to find concise expressions resulting in small approximation errors competitive with the literature. Particular to this algorithm, two parameters control the complexity of the generated expression. This paper investigates the influence of those parameters w.r.t. the goodness-of-fit. Through some extensive experiments, we find that the maximum number of terms is more important to control goodness-of-fit but also that there is a limit to the extent that increasing its value renders any benefits. Second, the limit to the minimum and maximum value of the exponent has a smaller influence to the results and it can be set to a default value without impacting the final results.
Preprint
Multidimensional genetic programming represents candidate solutions as sets of programs, and thereby provides an interesting framework for exploiting building block identification. Towards this goal, we investigate the use of machine learning as a way to bias which components of programs are promoted, and propose two semantic operators to choose where useful building blocks are placed during crossover. A forward stagewise crossover operator we propose leads to significant improvements on a set of regression problems, and produces state-of-the-art results in a large benchmark study. We discuss this architecture and others in terms of their propensity for allowing heuristic search to utilize information during the evolutionary process. Finally, we look at the collinearity and complexity of the data representations that result from these architectures, with a view towards disentangling factors of variation in application.
Conference Paper
The Gene-pool Optimal Mixing Evolutionary Algorithm (GOMEA) is a recently introduced model-based EA that has been shown to be capable of outperforming state-of-the-art alternative EAs in terms of scalability when solving discrete optimization problems. One of the key aspects of GOMEA's success is a variation operator that is designed to extensively exploit linkage models by effectively combining partial solutions. Here, we bring the strengths of GOMEA to Genetic Programming (GP), introducing GP-GOMEA. Under the hypothesis of having little problem-specific knowledge, and in an effort to design easy-to-use EAs, GP-GOMEA requires no parameter specification. On a set of well-known benchmark problems we find that GP-GOMEA outperforms standard GP while being on par with more recently introduced, state-of-the-art EAs. We furthermore introduce Input-space Entropy-based Building-block Learning (IEBL), a novel approach to identifying and encapsulating relevant building blocks (subroutines) into new terminals and functions. On problems with an inherent degree of modularity, IEBL can contribute to compact solution representations, providing a large potential for knock-on effects in performance. On the difficult, but highly modular Even Parity problem, GP-GOMEA+IEBL obtains excellent scalability, solving the 14-bit instance in less than 1 hour.
Article
This paper provides a preliminary report on a new research project that aims to construct a code generator that uses an automatic theorem prover to produce very high-quality (in fact, nearly mathematically optimal) machine code for modern architectures. The code generator is not intended for use in an ordinary compiler, but is intended to be used for inner loops and critical subroutines in those cases where peak performance is required, no available compiler generates adequately efficient code, and where current engineering practice is to use hand-coded machine language. The paper describes the design of the superoptimizer, and presents some encouraging preliminary results.
Chapter
Symbolic regression via genetic programming (hereafter, referred to simply as symbolic regression) has proven to be a very important tool for industrial empirical modeling (Kotanchek et al., 2003). Two of the primary problems with industrial use of symbolic regression are (1) the relatively large computational demands in comparison with other nonlinear empirical modeling techniques such as neural networks and (2) the difficulty in making the trade-off between expression accuracy and complexity. The latter issue is significant since, in general, we prefer parsimonious (simple) expressions with the expectation that they are more robust with respect to changes over time in the underlying system or extrapolation outside the range of the data used as the reference in evolving the symbolic regression. In this chapter, we present a genetic programming variant, ParetoGP, which exploits the Pareto front to dramatically speed the symbolic regression solution evolution as well as explicitly exploit the complexity-performance trade-off. In addition to the improvement in evolution efficiency, the Pareto front perspective allows the user to choose appropriate models for further analysis or deployment. The Pareto front avoids the need to a priori specify a trade-off between competing objectives (e.g. complexity and performance) by identifying the curve (or surface or hyper-surface) which characterizes, for example, the best performance for a given expression complexity.
Regression modeling strategies
  • E Frank
  • Harrell
  • Harrell Frank E
PySR: Fast & Parallelized Symbolic Regression in Python/Julia
  • Miles Cranmer
Contemporary Symbolic Regression Methods and their Relative Performance
  • Patryk William La Cava
  • Bogdan Orzechowski
  • Fabricio Burlacu
  • Marco Olivetti De Franca
  • Ying Virgolin
  • Michael Jin
  • Jason H Kommenda
  • Moore
  • Cava William La
Learning concise representations for regression by evolving networks of trees
  • Tilak Raj William La Cava
  • James Singh
  • Srinivas Taggart
  • Jason H Suri
  • Moore
  • Cava William La