Preprint

Symbolic Regression by Exhaustive Search: Reducing the Search Space Using Syntactical Constraints and Efficient Semantic Structure Deduplication

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

Symbolic regression is a powerful system identification technique in industrial scenarios where no prior knowledge on model structure is available. Such scenarios often require specific model properties such as interpretability, robustness, trustworthiness and plausibility, that are not easily achievable using standard approaches like genetic programming for symbolic regression. In this chapter we introduce a deterministic symbolic regression algorithm specifically designed to address these issues. The algorithm uses a context-free grammar to produce models that are parameterized by a non-linear least squares local optimization procedure. A finite enumeration of all possible models is guaranteed by structural restrictions as well as a caching mechanism for detecting semantically equivalent solutions. Enumeration order is established via heuristics designed to improve search efficiency. Empirical tests on a comprehensive benchmark suite show that our approach is competitive with genetic programming in many noiseless problems while maintaining desirable properties such as simple, reliable models and reproducibility.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Chapter
Full-text available
As symbolic regression (SR) has advanced into the early stages of commercial exploitation, the poor accuracy of SR, still plaguing even the most advanced commercial packages, has become an issue for early adopters. Users expect to have the correct formula returned, especially in cases with zero noise and only one basis function with minimally complex grammar depth. At a minimum, users expect the response surface of the SR tool to be easily understood, so that the user can know apriori on what classes of problems to expect excellent, average, or poor accuracy. Poor or unknown accuracy is a hinderence to greater academic and industrial acceptance of SR tools. In a previous paper, we published a complex algorithm for modern symbolic regression which is extremely accurate for a large class of Symbolic Regression problems. The class of problems, on which SR is extremely accurate, was described in detail. This algorithm was extremely accurate, on a single processor, for up to 25 features (columns); and, a cloud configuration was used to extend the extreme accuracy up to as many as 100 features. While the previous algorithm’s extreme accuracy for deep problems with a small number of features (25–100) was an impressive advance, there are many very important academic and industrial SR problems requiring from 100 to 1000 features. In this chapter we extend the previous algorithm such that high accuracy is achieved on a wide range of problems, from 25 to 3000 features, using only a single processor. The class of problems, on which the enhanced algorithm is highly accurate, is described in detail. A definition of extreme accuracy is provided, and an informal argument of highly SR accuracy is outlined in this chapter. The new enhanced algorithm is tested on a set of representative problems. The enhanced algorithm is shown to be robust, performing well even in the face of testing data containing up to 3000 features.
Chapter
Full-text available
Although recent advances in symbolic regression (SR) have promoted the field into the early stages of commercial exploitation, the poor accuracy of SR is still plaguing even the most advanced commercial packages today. Users expect to have the correct formula returned, especially in cases with zero noise and only one basis function with minimally complex grammar depth. Poor accuracy is a hindrance to greater academic and industrial acceptance of SR tools. In a previous paper, the poor accuracy of Symbolic Regression was explored, and several classes of test formulas, which prove intractable for SR, were examined. An understanding of why these test problems prove intractable was developed. In another paper a baseline Symbolic Regression algorithm was developed with specific techniques for optimizing embedded real numbers constants. These previous steps have placed us in a position to make an attempt at vanquishing the SR accuracy problem. In this chapter we develop a complex algorithm for modern symbolic regression which is extremely accurate for a large class of Symbolic Regression problems. The class of problems, on which SR is extremely accurate, is described in detail. A definition of extreme accuracy is provided, and an informal argument of extreme SR accuracy is outlined in this chapter. Given the critical importance of accuracy in SR, it is our suspicion that in the future all commercial Symbolic Regression packages will use this algorithm or a substitute for this algorithm.
Article
Full-text available
We present the results of a community survey regarding genetic programming (GP) benchmark practices. Analysis shows broad consensus that improvement is needed in problem selection and experimental rigor. While views expressed in the survey dissuade us from proposing a large-scale benchmark suite, we find community support for creating a blacklist of ``toy problems.'' We provide a set of alternative problems named GPBench2012 to replace the blacklisted ones, a discussion on improving experimental rigor, and a listing of challenging problems in the hope of improving GP research.
Conference Paper
Full-text available
We propose Locally Geometric Crossover (LGX) for genetic programming. For a pair of homologous loci in the parent solutions, LGX finds a semantically intermediate procedure from a previously prepared library, and uses it as replacement code. The experiments involving six symbolic regression problems show significant increase in search performance when compared to standard subtree-swapping cross-over and other control methods. This suggests that semantically geometric manipulations on subprograms propagate to entire programs and improve their fitness.
Chapter
Full-text available
Today numerous variants of heuristic optimization algorithms are used to solve different kinds of optimization problems. This huge variety makes it very difficult to reuse already implemented algorithms or problems. In this paper the authors describe a generic, extensible, and paradigm-independent optimization environment that strongly abstracts the process of heuristic optimization. By providing a well organized and strictly separated class structure and by introducing a generic operator concept for the interaction between algorithms and problems, HeuristicLab makes it possible to reuse an algorithm implementation for the attacking of lots of different kinds of problems and vice versa. Consequently HeuristicLab is very well suited for rapid prototyping of new algorithms and is also useful for educational support due to its state-of-the-art user interface, its self-explanatory API and the use of modern programming concepts.
Chapter
Full-text available
Symbolic regression is a common application for genetic programming (GP). This paper presents a new non-evolutionary technique for symbolic regression that, compared to competent GP approaches on real-world problems, is orders of magnitude faster (taking just seconds), returns simpler models, has comparable or better prediction on unseen data, and converges reliably and deterministically. I dub the approach FFX, for Fast Function Extraction. FFX uses a recentlydeveloped machine learning technique, pathwise regularized learning, to rapidly prune a huge set of candidate basis functions down to compact models. FFX is verified on a broad set of real-world problems having 13 to 1468 input variables, outperforming GP as well as several state-of-the-art regression techniques. Keywordstechnology-symbolic regression-genetic programming, pathwise-regularization-real-world problems-machine learning-lasso-ridge regression-elastic net-integrated circuits
Article
Full-text available
We investigate the effects of semantically-based crossover operators in genetic programming, applied to real-valued symbolic regression problems. We propose two new relations derived from the semantic distance between subtrees, known as semantic equivalence and semantic similarity. These relations are used to guide variants of the crossover operator, resulting in two new crossover operators—semantics aware crossover (SAC) and semantic similarity-based crossover (SSC). SAC, was introduced and previously studied, is added here for the purpose of comparison and analysis. SSC extends SAC by more closely controlling the semantic distance between subtrees to which crossover may be applied. The new operators were tested on some real-valued symbolic regression problems and compared with standard crossover (SC), context aware crossover (CAC), Soft Brood Selection (SBS), and No Same Mate (NSM) selection. The experimental results show on the problems examined that, with computational effort measured by the number of function node evaluations, only SSC and SBS were significantly better than SC, and SSC was often better than SBS. Further experiments were also conducted to analyse the perfomance sensitivity to the parameter settings for SSC. This analysis leads to a conclusion that SSC is more constructive and has higher locality than SAC, NSM and SC; we believe these are the main reasons for the improved performance of SSC. KeywordsGenetic programming–Semantics–Crossover–Symbolic regression locality
Chapter
Full-text available
Trust is a major issue with deploying empirical models in the real world since changes in the underlying system or use of the model in new regions of parameter space can produce (potentially dangerous) incorrect predictions. The trepidation involved with model usage can be mitigated by assembling ensembles of diverse models and using their consensus as a trust metric, since these models will be constrained to agree in the data region used for model development and also constrained to disagree outside that region. The problem is to define an appropriate model complexity (since the ensemble should consist of models of similar complexity), as well as to identify diverse models from the candidate model set. In this chapter we discuss strategies for the development and selection of robust models and model ensembles and demonstrate those strategies against industrial data sets. An important benefit of this approach is that all available data may be used in the model development rather than a partition into training, test and validation subsets. The result is constituent models are more accurate without risk of over-fitting, the ensemble predictions are more accurate and the ensemble predictions have a meaningful trust metric.
Article
Full-text available
This paper presents a novel approach to generate data-driven regression models that not only give reliable prediction of the observed data but also have smoother response surfaces and extra generalization capabilities with respect to extrapolation. These models are obtained as solutions of a genetic programming (GP) process, where selection is guided by a tradeoff between two competing objectives - numerical accuracy and the order of nonlinearity. The latter is a novel complexity measure that adopts the notion of the minimal degree of the best-fit polynomial, approximating an analytical function with a certain precision. Using nine regression problems, this paper presents and illustrates two different strategies for the use of the order of nonlinearity in symbolic regression via GP. The combination of optimization of the order of nonlinearity together with the numerical accuracy strongly outperforms ldquoconventionalrdquo optimization of a size-related expressional complexity and the accuracy with respect to extrapolative capabilities of solutions on all nine test problems. In addition to exploiting the new complexity measure, this paper also introduces a novel heuristic of alternating several optimization objectives in a 2-D optimization framework. Alternating the objectives at each generation in such a way allows us to exploit the effectiveness of 2-D optimization when more than two objectives are of interest (in this paper, these are accuracy, expressional complexity, and the order of nonlinearity). Results of the experiments on all test problems suggest that alternating the order of nonlinearity of GP individuals with their structural complexity produces solutions that are both compact and have smoother response surfaces, and, hence, contributes to better interpretability and understanding.
Conference Paper
Full-text available
A new digital signature based only on a conventional encryption function (such as DES) is described which is as secure as the underlying encryption function -- the security does not depend on the difficulty of factoring and the high computational costs of modular arithmetic are avoided. The signature system can sign an unlimited number of messages, and the signature size increases logarithmically as a function of the number of messages signed. Signature size in a ‘typical’ system might range from a few hundred bytes to a few kilobytes, and generation of a signature might require a few hundred to a few thousand computations of the underlying conventional encryption function.
Article
Full-text available
This paper describes the use of genetic programming to automate the discovery of numerical approximation formulae. The authors present results involving rediscovery of known approximations for Harmonic numbers and discovery of rational polynomial approximations for functions of one or more variables, the latter of which are compared to Padé approximations obtained through a symbolic mathematics package. For functions of a single variable, it is shown that evolved solutions can be considered superior to Padé approximations, which represent a powerful technique from numerical analysis, given certain tradeoffs between approximation cost and accuracy, while for functions of more than one variable, we are able to evolve rational polynomial approximations where no Padé approximation can be computed. Further, it is shown that evolved approximations can be refined through the evolution of approximations to their error function. Based on these results, we consider genetic programming to be a powerful and effective technique for the automated discovery of numerical approximation formulae.
Conference Paper
Full-text available
Abstract Expression Grammars have the potential to integrate Genetic Algorithms, Genetic Programming, Swarm Intelligence, and Differential Evolution into a seamlessly unified array of tools for use in symbolic regression. The features of abstract expression grammars are explored, examples of implementations are provided, and the beneficial effects of abstract expression grammars are tested with several published nonlinear regression problems.
Article
Full-text available
Most evolutionary optimization models incorporate a fitness evaluation that is based on a predefined static set of test cases or problems. In the natural evolutionary process, selection is of course not based on a static fitness evaluation. Organisms do not have to combat every existing disease during their lifespan; organisms of one species may live in different or changing environments; different species coevolve. This leads to the question of how information is integrated over many generations. This study focuses on the effects of different fitness evaluation schemes on the types of genotypes and phenotypes that evolve. The evolutionary target is a simple numerical function. The genetic representation is in the form of a program (i.e., a functional representation, as in genetic programming). Many different programs can code for the same numerical function. In other words, there is a many-to-one mapping between “genotypes” (the programs) and “phenotypes”. We compare fitness evaluation based on a large static set of problems and fitness evaluation based on small coevolving sets of problems. In the latter model very little information is presented to the evolving programs regarding the evolutionary target per evolutionary time step. In other words, the fitness evaluation is very sparse. Nevertheless the model produces correct solutions to the complete evolutionary target in about half of the simulations. The complete evaluation model, on the other hand, does not find correct solutions to the target in any of the simulations. More important, we find that sparse evaluated programs are better generalizable compared to the complete evaluated programs when they are evaluated on a much denser set of problems. In addition, the two evaluation schemes lead to programs that differ with respect to mutational stability; sparse evaluated programs are less stable than complete evaluated programs.
Chapter
In this chapter we take a closer look at the distribution of symbolic regression models generated by genetic programming in the search space. The motivation for this work is to improve the search for well-fitting symbolic regression models by using information about the similarity of models that can be precomputed independently from the target function. For our analysis, we use a restricted grammar for uni-variate symbolic regression models and generate all possible models up to a fixed length limit. We identify unique models and cluster them based on phenotypic as well as genotypic similarity. We find that phenotypic similarity leads to well-defined clusters while genotypic similarity does not produce a clear clustering. By mapping solution candidates visited by GP to the enumerated search space we find that GP initially explores the whole search space and later converges to the subspace of highest quality expressions in a run for a simple benchmark problem.
Article
Data-driven modeling plays an increasingly important role in different areas of engineering. For most of existing methods, such as genetic programming (GP), the convergence speed might be too slow for large scale problems with a large number of variables. It has become the bottleneck of GP for practical applications. Fortunately, in many applications, the target models are separable in some sense. In this paper, we analyze different types of separability of some real-world engineering equations and establish a mathematical model of generalized separable system (GS system). In order to get the structure of the GS system, a multilevel block building (MBB) algorithm is proposed, in which the target model is decomposed into a number of blocks, further into minimal blocks and factors. Compare to the conventional GP, MBB can make large reductions to the search space. This makes MBB capable of modeling a complex system. The minimal blocks and factors are optimized and assembled with a global optimization search engine, low dimensional simplex evolution (LDSE). An extensive study between the proposed MBB and a state-of-the-art data-driven fitting tool, Eureqa, has been presented with several man-made problems, as well as some real-world problems. Test results indicate that the proposed method is more effective and efficient under all the investigated cases.
Chapter
In this chapter we review a number of real-world applications where symbolic regression was used recently and with great success. Industrial scale symbolic regression armed with the power to select right variables and variable combinations, build robust trustable predictions and guide experimentation has undoubtedly earned its place in industrial process optimization, business forecasting, product design and now complex systems modeling and policy making.
Chapter
From a real-world perspective, good enough has been achieved in the core representations and evolutionary strategies of genetic programming assuming state-of-the-art algorithms and implementations are being used. What is needed for industrial symbolic regression are tools to (a) explore and refine the data, (b) explore the developed model space and extract insight and guidance from the available sample of the infinite possibilities of model forms and (c) identify appropriate models for deployment as predictors, emulators, etc. This chapter focuses on the approaches used in DataModeler to address the modeling life cycle. A special focus in this chapter is the identification of driving variables and metavariables. Exploiting the diversity of search paths followed during independent evolutions and, then, looking at the distributions of variables and metavariable usage also provides an opportunity to gather key insights. The goal in this framework, however, is not to replace the modeler but, rather, to augment the inclusion of context and collection of insight by removing mechanistic requirements and facilitating the ability to think. We believe that the net result is higher quality and more robust models.
Conference Paper
In this publication a constant optimization approach for symbolic regression is introduced to separate the task of finding the correct model structure from the necessity to evolve the correct numerical constants. A gradient-based nonlinear least squares optimization algorithm, the Levenberg-Marquardt (LM) algorithm, is used for adjusting constant values in symbolic expression trees during their evolution. The LM algorithm depends on gradient information consisting of partial derivations of the trees, which are obtained by automatic differentiation. The presented constant optimization approach is tested on several benchmark problems and compared to a standard genetic programming algorithm to show its effectiveness. Although the constant optimization involves an overhead regarding the execution time, the achieved accuracy increases significantly as well as the ability of genetic programming to learn from provided data. As an example, the Pagie-1 problem could be solved in 37 out of 50 test runs, whereas without constant optimization it was solved in only 10 runs. Furthermore, different configurations of the constant optimization approach (number of iterations, probability of applying constant optimization) are evaluated and their impact is detailed in the results section.
Conference Paper
We introduce Prioritized Grammar Enumeration (PGE), a deterministic Symbolic Regression (SR) algorithm using dynamic programming techniques. PGE maintains the tree-based representation and Pareto non-dominated sorting from Genetic Programming (GP), but replaces genetic operators and random number use with grammar production rules and systematic choices. PGE uses non-linear regression and abstract parameters to fit the coefficients of an equation, effectively separating the exploration for form, from the optimization of a form. Memoization enables PGE to evaluate each point of the search space only once, and a Pareto Priority Queue provides direction to the search. Sorting and simplification algorithms are used to transform candidate expressions into a canonical form, reducing the size of the search space. Our results show that PGE performs well on 22 benchmarks from the SR literature, returning exact formulas in many cases. As a deterministic algorithm, PGE offers reliability and reproducibility of results, a key aspect to any system used by scientists at large. We believe PGE is a capable SR implementation, following an alternative perspective we hope leads the community to new ideas.
Article
We introduce an estimation of distribution algorithm that co-evolves fitness predictors in order to reduce the computational cost of evolution. Fitness predictors are light objects which, given an evolving individual, heuristically approximate its true fit-ness. The predictors are trained by their ability to correctly differentiate between good and bad solutions using reduced computation. We apply co-evolution of fitness predic-tors to symbolic regression and measure its impact. Our results show that a small com-putational investment in co-evolving fitness predictors greatly enhances both speed and convergence of individual solutions while reducing the computational effort overall. Fi-nally we apply fitness prediction to interactive evolution of pen stroke drawings. These results show that fitness prediction is extremely effective at modeling user preference while minimizing the sampling on the user to fewer than ten prompts.
Chapter
Symbolic regression via genetic programming (hereafter, referred to simply as symbolic regression) has proven to be a very important tool for industrial empirical modeling (Kotanchek et al., 2003). Two of the primary problems with industrial use of symbolic regression are (1) the relatively large computational demands in comparison with other nonlinear empirical modeling techniques such as neural networks and (2) the difficulty in making the trade-off between expression accuracy and complexity. The latter issue is significant since, in general, we prefer parsimonious (simple) expressions with the expectation that they are more robust with respect to changes over time in the underlying system or extrapolation outside the range of the data used as the reference in evolving the symbolic regression. In this chapter, we present a genetic programming variant, ParetoGP, which exploits the Pareto front to dramatically speed the symbolic regression solution evolution as well as explicitly exploit the complexity-performance trade-off. In addition to the improvement in evolution efficiency, the Pareto front perspective allows the user to choose appropriate models for further analysis or deployment. The Pareto front avoids the need to a priori specify a trade-off between competing objectives (e.g. complexity and performance) by identifying the curve (or surface or hyper-surface) which characterizes, for example, the best performance for a given expression complexity.
Article
Probabilistic incremental program evolution (PIPE) is a novel technique for automatic program synthesis. We combine probability vector coding of program instructions, population-based incremental learning, and tree-coded programs like those used in some variants of genetic programming (GP). PIPE iteratively generates successive populations of functional programs according to an adaptive probability distribution over all possible programs. Each iteration, it uses the best program to refine the distribution. Thus, it stochastically generates better and better programs. Since distribution refinements depend only on the best program of the current population, PIPE can evaluate program populations efficiently when the goal is to discover a program with minimal runtime. We compare PIPE to GP on a function regression problem and the 6-bit parity problem. We also use PIPE to solve tasks in partially observable mazes, where the best programs have minimal runtime.
Article
Although the problem of determining the minimum cost path through a graph arises naturally in a number of interesting applications, there has been no underlying theory to guide the development of efficient search procedures. Moreover, there is no adequate conceptual framework within which the various ad hoc search strategies proposed to date can be compared. This paper describes how heuristic information from the problem domain can be incorporated into a formal mathematical theory of graph searching and demonstrates an optimality property of a class of search strategies.
Evolutionary module acquisition
  • P J Angeline
  • J Pollack
Angeline, P.J., Pollack, J.: Evolutionary module acquisition. In: Proceedings of the Second Annual Conference on Evolutionary Programming, pp. 154-163. La Jolla, CA, USA (1993)
Hash-based Tree Similarity and Simplification in Genetic Programming for Symbolic Regression
  • B Burlacu
  • L Kammerer
  • M Affenzeller
  • G Kronberger
Burlacu, B., Kammerer, L., Affenzeller, M., Kronberger, G.: Hash-based Tree Similarity and Simplification in Genetic Programming for Symbolic Regression. In: Computer Aided Systems Theory, EUROCAST 2019 (2019)
Improving symbolic regression with interval arithmetic and linear scaling
  • M Keijzer
Keijzer, M.: Improving symbolic regression with interval arithmetic and linear scaling. In: Genetic Programming, Proceedings of EuroGP'2003, LNCS, vol. 2610, pp. 70-82. Springer-Verlag, Essex (2003)
Genetic programming, ensemble methods and the bias/variance tradeoff -introductory investigations
  • M Keijzer
  • V Babovic
Keijzer, M., Babovic, V.: Genetic programming, ensemble methods and the bias/variance tradeoff -introductory investigations. In: Genetic Programming, Proceedings of Eu-roGP'2000, LNCS, vol. 1802, pp. 76-90. Springer-Verlag, Edinburgh (2000)
Undirected training of run transferable libraries
  • M Keijzer
  • C Ryan
  • G Murphy
  • M Cattolico
Keijzer, M., Ryan, C., Murphy, G., Cattolico, M.: Undirected training of run transferable libraries. In: Proceedings of the 8th European Conference on Genetic Programming, Lecture Notes in Computer Science, vol. 3447, pp. 361-370. Springer, Lausanne, Switzerland (2005)
Abstract expression grammar symbolic regression
  • M F Korns
Korns, M.F.: Abstract expression grammar symbolic regression. In: Genetic Programming Theory and Practice VIII, Genetic and Evolutionary Computation, vol. 8, chap. 7, pp. 109-128. Springer, Ann Arbor, USA (2010)
Highly accurate symbolic regression with noisy training data. In: Genetic Programming Theory and Practice XIII, Genetic and Evolutionary Computation
  • M F Korns
Korns, M.F.: Highly accurate symbolic regression with noisy training data. In: Genetic Programming Theory and Practice XIII, Genetic and Evolutionary Computation, pp. 91-115. Springer, Ann Arbor, USA (2015)
Behavioral program synthesis: Insights and prospects. In: Genetic Programming Theory and Practice XIII, Genetic and Evolutionary Computation
  • K Krawiec
  • J Swan
  • U M O'reilly
Krawiec, K., Swan, J., O'Reilly, U.M.: Behavioral program synthesis: Insights and prospects. In: Genetic Programming Theory and Practice XIII, Genetic and Evolutionary Computation, pp. 169-183. Springer, Ann Arbor, USA (2015)
A simple but theoretically-motivated method to control bloat in genetic programming
  • R Poli
Poli, R.: A simple but theoretically-motivated method to control bloat in genetic programming. In: Genetic Programming, Proceedings of EuroGP'2003, LNCS, vol. 2610, pp. 204-217. Springer-Verlag, Essex (2003)
Symbolic regression of implicit equations. In: Genetic Programming Theory and Practice VII, Genetic and Evolutionary Computation
  • M Schmidt
  • H Lipson
Schmidt, M., Lipson, H.: Symbolic regression of implicit equations. In: Genetic Programming Theory and Practice VII, Genetic and Evolutionary Computation, chap. 5, pp. 73-85. Springer, Ann Arbor (2009)
Age-fitness pareto optimization
  • M Schmidt
  • H Lipson
Schmidt, M., Lipson, H.: Age-fitness pareto optimization. In: Genetic Programming Theory and Practice VIII, Genetic and Evolutionary Computation, vol. 8, chap. 8, pp. 129-146. Springer, Ann Arbor, USA (2010)
Faster genetic programming based on local gradient search of numeric leaf values
  • A Topchy
  • W F Punch
Topchy, A., Punch, W.F.: Faster genetic programming based on local gradient search of numeric leaf values. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001), pp. 155-162. Morgan Kaufmann, San Francisco, California, USA (2001)