Chapter

Symbolic Regression by Exhaustive Search: Reducing the Search Space Using Syntactical Constraints and Efficient Semantic Structure Deduplication

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Symbolic regression is a powerful system identification technique in industrial scenarios where no prior knowledge on model structure is available. Such scenarios often require specific model properties such as interpretability, robustness, trustworthiness and plausibility, that are not easily achievable using standard approaches like genetic programming for symbolic regression. In this chapter we introduce a deterministic symbolic regression algorithm specifically designed to address these issues. The algorithm uses a context-free grammar to produce models that are parameterized by a non-linear least squares local optimization procedure. A finite enumeration of all possible models is guaranteed by structural restrictions as well as a caching mechanism for detecting semantically equivalent solutions. Enumeration order is established via heuristics designed to improve search efficiency. Empirical tests on a comprehensive benchmark suite show that our approach is competitive with genetic programming in many noiseless problems while maintaining desirable properties such as simple, reliable models and reproducibility.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Individuals with high scores have a higher probability of being selected for the next iteration of crossover, mutation, and reproduction. Ideas to enhance genetic algorithms have been proposed in [10,11] to reduce the search space. Nicolau and McDermott [13] use prior information of the values of the dependent variable. ...
... The quality of solutions is not stable, however, because genetic algorithms are a stochastic process, which means that it can generate different solutions for the same input data and the same settings. Kammerer et al. [10] remark that "it might produce highly dissimilar solutions even for the same input data." ...
... (k 1 , k 2 ) ← (k 2 , k 2 + 2) ; // Change the neighbor set. 10 if k 2 > k max then ...
Preprint
Full-text available
In this paper we consider the problem of learning a regression function without assuming its functional form. This problem is referred to as symbolic regression. An expression tree is typically used to represent a solution function, which is determined by assigning operators and operands to the nodes. The symbolic regression problem can be formulated as a nonconvex mixed-integer nonlinear program (MINLP), where binary variables are used to assign operators and nonlinear expressions are used to propagate data values through nonlinear operators such as square, square root, and exponential. We extend this formulation by adding new cuts that improve the solution of this challenging MINLP. We also propose a heuristic that iteratively builds an expression tree by solving a restricted MINLP. We perform computational experiments and compare our approach with a mixed-integer program-based method and a neural-network-based method from the literature.
... In both variants ADFs inherits the same disadvantages and problems of GP such as bloat, redundancy, lack of phenotypic robustness (small changes in tree leading to big changes in model behaviour) and introduce additional complexity (in particular, when performing evaluation of trees with ADFs). Fast Function Extraction (FFX) algorithms [5,6] do not suffer from these disadvantages, because they are deterministic and provide simpler models with more generalization ability. In this paper we introduce an approach to use deterministic algorithm [6] to generate symbolic regression formulas and to use them as building blocks for GP. ...
... Fast Function Extraction (FFX) algorithms [5,6] do not suffer from these disadvantages, because they are deterministic and provide simpler models with more generalization ability. In this paper we introduce an approach to use deterministic algorithm [6] to generate symbolic regression formulas and to use them as building blocks for GP. ...
... The main scheme of the algorithm is presented below and the description of main steps. Initially (step 1), a set of symbolic regression models is created using a deterministic approach such as [6], to be subsequently used as leafs in GP trees. Afterward (step 2), the model set is simplified by deleting duplicate models according to a given distance measure. ...
Article
Full-text available
The main idea of this paper is to add model set pre-processing for Genetic Programming based Evolvement of Models of Models. Simple Symbolic Formulas generated offline with the help of the deterministic function extraction algorithm will be used as building blocks for Genetic Programming. In this work, a pre-processing of models set is generated by means of clustering. All distances between models are calculated offline by using a special metric. The effectiveness of this approach was evaluated on bench-marks as well as on real-world problems.
... Furthermore, these systems tend to overfit given large and noisy data 41 , which is the case of typical empirical results in physics. Two main methods to overcome the computational expense are performed by 42,43 , where they apply a brute-force approach on a reduced search space rather than perform an incomplete search in the entire search space. In both methods, the search space is reduced by removing algebraically equivalent expressions, either through the recursive application of the grammar production rules 42 or by preventing semantic duplicates using grammar restrictions and semantic hashing 43 . ...
... Two main methods to overcome the computational expense are performed by 42,43 , where they apply a brute-force approach on a reduced search space rather than perform an incomplete search in the entire search space. In both methods, the search space is reduced by removing algebraically equivalent expressions, either through the recursive application of the grammar production rules 42 or by preventing semantic duplicates using grammar restrictions and semantic hashing 43 . ...
Article
Full-text available
Discovering a meaningful symbolic expression that explains experimental data is a fundamental challenge in many scientific fields. We present a novel, open-source computational framework called Scientist-Machine Equation Detector (SciMED), which integrates scientific discipline wisdom in a scientist-in-the-loop approach, with state-of-the-art symbolic regression (SR) methods. SciMED combines a wrapper selection method, that is based on a genetic algorithm, with automatic machine learning and two levels of SR methods. We test SciMED on five configurations of a settling sphere, with and without aerodynamic non-linear drag force, and with excessive noise in the measurements. We show that SciMED is sufficiently robust to discover the correct physically meaningful symbolic expressions from the data, and demonstrate how the integration of domain knowledge enhances its performance. Our results indicate better performance on these tasks than the state-of-the-art SR software packages , even in cases where no knowledge is integrated. Moreover, we demonstrate how SciMED can alert the user about possible missing features, unlike the majority of current SR systems.
... The motivation behind employing genetic algorithms lies in their ability to efficiently explore large solution spaces, enabling the discovery of diverse and effective solutions that may be elusive through traditional optimization methods [21,22,23]. In particular, GAs have been adapted to the realm of ML to find a well-performing ML pipeline [24], solve a symbolic regression task [25,26,27], or even as part of ML ensemble-based model [28,29]. arXiv:2412.09035v1 ...
Preprint
Full-text available
Data-driven models, in general, and machine learning (ML) models, in particular, have gained popularity over recent years with an increased usage of such models across the scientific and engineering domains. When using ML models in realistic and dynamic environments, users need to often handle the challenge of concept drift (CD). In this study, we explore the application of genetic algorithms (GAs) to address the challenges posed by CD in such settings. We propose a novel two-level ensemble ML model, which combines a global ML model with a CD detector, operating as an aggregator for a population of ML pipeline models, each one with an adjusted CD detector by itself responsible for re-training its ML model. In addition, we show one can further improve the proposed model by utilizing off-the-shelf automatic ML methods. Through extensive synthetic dataset analysis, we show that the proposed model outperforms a single ML pipeline with a CD algorithm, particularly in scenarios with unknown CD characteristics. Overall, this study highlights the potential of ensemble ML and CD models obtained through a heuristic and adaptive optimization process such as the GA one to handle complex CD events.
... Random Search, i.e., the brute-force approach, has gained more attention recently [19] as it can rival GP if used with a restrictive grammar [25] or sophisticated search strategy [47]. Our random-search approach is as follows: ...
Preprint
Full-text available
We develop faultless, fixed-depth, string-based, prefix and postfix symbolic regression grammars, capable of producing \emph{any} expression from a set of operands, unary operators and/or binary operators. Using these grammars, we outline simplified forms of 5 popular heuristic search strategies: Brute Force Search, Monte Carlo Tree Search, Particle Swarm Optimization, Genetic Programming, and Simulated Annealing. For each algorithm, we compare the relative performance of prefix vs postfix for ten ground-truth expressions implemented entirely within a common C++/Eigen framework. Our experiments show a comparatively strong correlation between the average number of nodes per layer of the ground truth expression tree and the relative performance of prefix vs postfix. The fixed-depth grammars developed herein can enhance scientific discovery by increasing the efficiency of symbolic regression, enabling faster identification of accurate mathematical models across various disciplines.
... Convolutional Neural Network (CNN) is a Deep Learning algorithm, which is used to distinguished if an input image is male or female. Genetic Algorithm are optimization techniques which are not guaranteed to produce better results than the search methodology they rely on, in this case an exhaustive search, but commonly outperform naive search methods [10]. Heuristics employ assumptions on the structure of the underlying data to attempt to shortcut the underlying search methods they are based on. ...
Article
Full-text available
Human gender plays an imperative role as social construct and an essential form of an individual’s personality. Gender recognition is a fundamental task for human beings. It is highly reflected in social communication, forensic science, surveillance, and target marketing. Gender recognition previously depended only on standard face images. Using the term “standard” means that the image was taken in a standard light without any background variations and without any cropped parts. But this type of image is not found in real-world. These images are called non-standard images, as they have a lot of variations, like illumination and head pose. The image may also have a lot of faces, where one of them may wear sunglasses or other accessories. Using this type of image will affect the accuracy results of gender recognition approaches. Nowadays, selfie images appear as they are unconstrained images. People take selfie images of themselves. Selfie images are very complex, as some parts of the images are cropped and damaged. This paper proposes a (CNNGA) technique for gender recognition from selfie images by merging a deep learning approach with genetic algorithms. The proposed technique achieves 90.2% accuracy in recognizing gender from the selfie dataset. The experiments use various challenge datasets, which are widely adopted in the scientific community like LFW, Data Hub, FERET, and Caltech-web Faces.
... Random Search, i.e., the brute-force approach, has gained more attention recently [19] as it can rival GP if used with a restrictive grammar [25] or sophisticated search strategy [47]. Our random-search approach is as follows: ...
Article
Full-text available
We develop faultless, fixed-depth, string-based, prefix and postfix symbolic regression grammars, capable of producing any expression from a set of operands, unary operators and/or binary operators. Using these grammars, we outline simplified forms of 5 popular heuristic search strategies: Brute Force Search, Monte Carlo Tree Search, Particle Swarm Optimization, Genetic Programming, and Simulated Annealing. For each algorithm, we compare the relative performance of prefix vs postfix for ten ground-truth expressions implemented entirely within a common C++/Eigen framework. Our experiments show a comparatively strong correlation between the average number of nodes per layer of the ground truth expression tree and the relative performance of prefix vs postfix. The fixed-depth grammars developed herein can enhance scientific discovery by increasing the efficiency of symbolic regression, enabling faster identification of accurate mathematical models across various disciplines.
... Quite recently, it has been proven that SR is an NP-hard problem in view of the fact that it is not always possible to find the best-fitting mathematical expression for a given data set in polynomial time [39]. Even though SR can be addressed by other algorithms (such as Monte Carlo tree search [5,34], enumeration algorithms [13], greedy algorithms [7], mixed-integer nonlinear programming [6]), Genetic Programming (GP) remains a popular choice. ...
... Reducing the Size of the Search Space Some methods have extended GP with extra steps to reduce the size of the search space by removing algebraically isomorphic expressions and limiting the complexity of expressions. Exhaustive Symbolic Regression [2] and Grammar Enumeration [17] rely on heuristicsguided exhaustive search of the function space to find all possible structures and then evaluation on a specified cost function to find the best fit. ...
Preprint
Full-text available
Symbolic regression is a machine learning method with the goal to produce interpretable results. Unlike other machine learning methods such as, e.g. random forests or neural networks, which are opaque, symbolic regression aims to model and map data in a way that can be understood by scientists. Recent advancements, have attempted to bridge the gap between these two fields; new methodologies attempt to fuse the mapping power of neural networks and deep learning techniques with the explanatory power of symbolic regression. In this paper, we examine these new emerging systems and test the performance of an end-to-end transformer model for symbolic regression versus the reigning traditional methods based on genetic programming that have spearheaded symbolic regression throughout the years. We compare these systems on novel datasets to avoid bias to older methods who were improved on well-known benchmark datasets. Our results show that traditional GP methods as implemented e.g., by Operon still remain superior to two recently published symbolic regression methods.
... The greater the representativeness and diversity of the data, the higher the likelihood of capturing the underlying patterns and relationships. Inadequate or noisy data can impede the discovery process and result in inaccurate or incomplete outcomes [77][78][79]. ...
... Finally, the study did not account for other factors that may influence family satisfaction, such as income, education, and cultural background. In a complementary manner, future work can include the usage of non-linear but explainable data-driven methods to better understand the connection found in this work, such as symbolic regression [40,41,42]. ...
Preprint
Full-text available
Individuals' satisfaction with their nuclear and extended family plays a critical role in individuals everyday life. Thus, a better understanding of the features that determine one's satisfaction with her family can open the door to the design of better sociological policies. To this end, this study examines the relationship between the family tree graph and family members' satisfaction with their nuclear and extended family. We collected data from 486 families which included a family tree graph and family members' satisfaction with each other. We obtain a model that is able to explain 75\% of the family members' satisfaction with one another. We found three indicators for more satisfied families. First, larger families, on average, have more satisfied members. Moreover, families with kids from the same parents - i.e., without step-siblings also express more satisfaction from both their siblings and parents when the children are already adults. Lastly, the average satisfaction of the family's oldest alive generation has a positive linear and non-linear correlation with the satisfaction of the entire extended family.
... Symbolic regression (SR) is a method for building models from data without a presumption of the structure of the model [1][2][3][4][5][6][7][8][9][10][11]. Techniques such as genetic programming (GP) [12][13][14][15][16][17] are suitable for solving SR problems since they can learn the structure of the model as expression trees that evolve during the execution of the algorithm. ...
Article
Full-text available
This article proposes an algorithm to construct multi-output symbolic regression models in a single execution. The algorithm extends the single-output Kaizen programming (KP) to a multi-output KP. KP is a hybrid evolutionary algorithm used to solve symbolic regression problems, without making any prior assumptions on the structure of the models. The extension to multi-output KP is made through an island model (MOKPIM). The idea behind MOKPIM is to find common terms among the outputs by balancing the solution obtained by each island working independently on a different output, with their cooperation due to the periodic exchange of migrants. In a previous effort, we followed a different approach for extending KP to multi-output scenarios based on using a multi-output linear regression in the linear regression step of the algorithm (MOKPMLR). A comparative analysis of the performance of MOKPIM with the classical single-output KP, our previous multi-output approach for KP, a multi-output Gaussian Process, and a multi-output decision tree regressor was conducted. The evaluation of algorithms used four different schemes of term sharing; five classical benchmark functions and a chemical process case study were considered for each scheme. The numerical results show that MOKPIM is the best-performing algorithm regarding both the independent analysis of each output and the global analysis of all the outputs together. The proposed algorithm MOKPIM outperformed the other multi-output symbolic regression methods tested in this work. It also obtained competitive results with state-of-the-art methods when the outputs were considered independently.
... In Bayesian econometrics, this is done by deriving the posterior distribution from a prior distribution, and in genetic programming, the algorithm modifies the set of potentially optimal solutions. Unfortunately, both methods, despite their advantages, suffer from computational issues in many real-life applications [26][27][28][29]. For example, [30] and [31] provided extensive reviews about these issues. ...
Article
Full-text available
In this study, the crude oil spot price is forecast using Bayesian symbolic regression (BSR). In particular, the initial parameters specification of BSR is analysed. Contrary to the conventional approach to symbolic regression, which is based on genetic programming methods, BSR applies Bayesian algorithms to evolve the set of expressions (functions). This econometric method is able to deal with variable uncertainty (feature selection) issues in oil price forecasting. Secondly, this research seems to be the first application of BSR to oil price forecasting. Monthly data between January 1986 and April 2021 are analysed. As well as BSR, several other methods (also able to deal with variable uncertainty) are used as benchmark models, such as LASSO and ridge regressions, dynamic model averaging, and Bayesian model averaging. The more common ARIMA and naïve methods are also used, together with several time-varying parameter regressions. As a result, this research not only presents a novel and original application of the BSR method but also provides a concise and uniform comparison of the application of several popular forecasting methods for the crude oil spot price. Robustness checks are also performed to strengthen the obtained conclusions. It is found that the suitable selection of functions and operators for BSR initialization is an important, but not trivial, task. Unfortunately, BSR does not result in forecasts that are statistically significantly more accurate than the benchmark models. However, BSR is computationally faster than the genetic programming-based symbolic regression.
... However, disadvantages of GP are its stochasticity, its long algorithm runtime for nontrivial problems and its complex hyperparameter settings. These characteristics led to the development of non-evolutionary algorithms which produce more restricted models but have advantages in determinism, runtime or complexity of hyperparameters [3,6,9]. ...
Preprint
Fast Function Extraction (FFX) is a deterministic algorithm for solving symbolic regression problems. We improve the accuracy of FFX by adding parameters to the arguments of nonlinear functions. Instead of only optimizing linear parameters, we optimize these additional nonlinear parameters with separable nonlinear least squared optimization using a variable projection algorithm. Both FFX and our new algorithm is applied on the PennML benchmark suite. We show that the proposed extensions of FFX leads to higher accuracy while providing models of similar length and with only a small increase in runtime on the given data. Our results are compared to a large set of regression methods that were already published for the given benchmark suite.
... To demonstrate the calculation of likelihood profiles and prediction intervals, we use three SR implementations: Heuris-ticLab (HL), Transformation-Interaction-Rational (TIR), and Deterministic Symbolic Regression using Grammar Enumeration (DSR), each of which implements a different algorithm for SR. HeuristicLab uses tree-based GP very similar to Koza-style GP [24], TIR uses a restricted model structure and an evolutionary algorithm in combination with ordinary least squares [22], and DSR uses a tree search algorithm using a formal grammar that restricts model structures [25]. Coefficients are optimized using the Levenberg-Marquardt algorithm. ...
Preprint
Full-text available
Symbolic regression is a nonlinear regression method which is commonly performed by an evolutionary computation method such as genetic programming. Quantification of uncertainty of regression models is important for the interpretation of models and for decision making. The linear approximation and so-called likelihood profiles are well-known possibilities for the calculation of confidence and prediction intervals for nonlinear regression models. These simple and effective techniques have been completely ignored so far in the genetic programming literature. In this work we describe the calculation of likelihood profiles in details and also provide some illustrative examples with models created with three different symbolic regression algorithms on two different datasets. The examples highlight the importance of the likelihood profiles to understand the limitations of symbolic regression models and to help the user taking an informed post-prediction decision.
... SR has been addressed with many other types of algorithms than genetic ones, oftentimes in order to obtain a deterministic behavior. Worm and Chiu (2013) and Kammerer et al. (2020) proposed enumeration algorithms which make SR tractable (for sufficiently small instances of SR) by restricting the space of possible models to consider and including dynamic programming and pruning strategies. Cozad (2014); Cozad and Sahinidis (2018) showed how SR can be addressed with mixed integer nonlinear programming. ...
Preprint
Symbolic regression (SR) is the task of learning a model of data in the form of a mathematical expression. By their nature, SR models have the potential to be accurate and human-interpretable at the same time. Unfortunately, finding such models, i.e., performing SR, appears to be a computationally intensive task. Historically, SR has been tackled with heuristics such as greedy or genetic algorithms and, while some works have hinted at the possible hardness of SR, no proof has yet been given that SR is, in fact, NP-hard. This begs the question: Is there an exact polynomial-time algorithm to compute SR models? We provide evidence suggesting that the answer is probably negative by showing that SR is NP-hard.
... Symbolic regression (SR) is the task of discovering a governing mathematical equation that underlies the given data [19]. Algorithms implementing SR can be of wildly different nature, including exhaustive or greedy search strategies [8,15,23,31], genetic programming (GP) and other evolutionary approaches [16,19,34], deep neural networks [3,7,29], as well as hybrids [6] and pipelines [38]. ...
Preprint
Currently, the genetic programming version of the gene-pool optimal mixing evolutionary algorithm (GP-GOMEA) is among the top-performing algorithms for symbolic regression (SR). A key strength of GP-GOMEA is its way of performing variation, which dynamically adapts to the emergence of patterns in the population. However, GP-GOMEA lacks a mechanism to optimize coefficients. In this paper, we study how fairly simple approaches for optimizing coefficients can be integrated into GP-GOMEA. In particular, we considered two variants of Gaussian coefficient mutation. We performed experiments using different settings on 23 benchmark problems, and used machine learning to estimate what aspects of coefficient mutation matter most. We find that the most important aspect is that the number of coefficient mutation attempts needs to be commensurate with the number of mixing operations that GP-GOMEA performs. We applied GP-GOMEA with the best-performing coefficient mutation approach to the data sets of SRBench, a large SR benchmark, for which a ground-truth underlying equation is known. We find that coefficient mutation can help re-discovering the underlying equation by a substantial amount, but only when no noise is added to the target variable. In the presence of noise, GP-GOMEA with coefficient mutation discovers alternative but similarly-accurate equations.
... However, several control parameters should be tuned in the EBR method. Kammerer et al. (2020), a deterministic symbolic regression algorithm is proposed using a context-free grammar. It utilizes nonlinear least squares local optimization to produce the symbolic regression models. ...
Article
Full-text available
Incompleteness is one of the problematic data quality challenges in real-world machine learning tasks. A large number of studies have been conducted for addressing this challenge. However, most of the existing studies focus on the classification task and only a limited number of studies for symbolic regression with missing values exist. In this work, a new imputation method for symbolic regression with incomplete data is proposed. The method aims to improve both the effectiveness and efficiency of imputing missing values for symbolic regression. This method is based on genetic programming (GP) and weighted K-nearest neighbors (KNN). It constructs GP-based models using other available features to predict the missing values of incomplete features. The instances used for constructing such models are selected using weighted KNN. The experimental results on real-world data sets show that the proposed method outperforms a number of state-of-the-art methods with respect to the imputation accuracy, the symbolic regression performance, and the imputation time.
Article
Symbolic regression is commonly considered in wide-ranging applications due to its inherent capability for learning both structure and weighting parameters of an interpretable model. However, for those scenarios that require fitting multiple expressions (MEs) synchronously, existing symbolic regression algorithms need to run multiple times step by step asynchronously for fitting such a group of expressions. Due to lacking mechanisms to explicitly capture and leverage the relationships between these expressions, the coupling information among MEs will be lost in this approach. A multiexpression symbolic regression algorithm (ME-SR) is proposed in this article to address the issue in learning MEs. Additionally, a methodology for extracting the approximate maximum common subexpression (aMCSE) among these MEs is suggested to disclose the relationships. A new metric is formulated to measure the quality of an aMCSE in ME-SR by imitating the concept of intersection over union. Furthermore, an adaptive cross matrix is incorporated into the algorithm to balance the search efforts between intertask and intratask domains. The proposed ME-SR demonstrates superior performance when compared to its counterparts of single expression symbolic regression on the designed test set. Finally, the efficacy of the method is well verified by a real-world circuit design case.
Article
In this paper, we consider the problem of learning a regression function without assuming its functional form. This problem is referred to as symbolic regression. An expression tree is typically used to represent a solution function, which is determined by assigning operators and operands to the nodes. Cozad and Sahinidis propose a nonconvex mixed-integer nonlinear program (MINLP), in which binary variables are used to assign operators and nonlinear expressions are used to propagate data values through nonlinear operators, such as square, square root, and exponential. We extend this formulation by adding new cuts that improve the solution of this challenging MINLP. We also propose a heuristic that iteratively builds an expression tree by solving a restricted MINLP. We perform computational experiments and compare our approach with a mixed-integer program–based method and a neural network–based method from the literature. History: Accepted by Pascal Van Hentenryck, Area Editor for Computational Modeling: Methods & Analysis. Funding: This work was supported by the Applied Mathematics activity within the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research [Grant DE-AC02-06CH11357]. Supplemental Material: The software that supports the findings of this study is available within the paper and its Supplemental Information ( https://pubsonline.informs.org/doi/suppl/10.1287/ijoc.2022.0050 ) as well as from the IJOC GitHub software repository ( https://github.com/INFORMSJoC/2022.0050 ). The complete IJOC Software and Data Repository is available at https://informsjoc.github.io/ .
Article
Symbolic Regression (SR) algorithms attempt to learn analytic expressions which fit data accurately and in a highly interpretable manner. Conventional SR suffers from two fundamental issues which we address here. First, these methods search the space stochastically (typically using genetic programming) and hence do not necessarily find the best function. Second, the criteria used to select the equation optimally balancing accuracy with simplicity have been variable and subjective. To address these issues we introduce Exhaustive Symbolic Regression (ESR), which systematically and efficiently considers all possible equations—made with a given basis set of operators and up to a specified maximum complexity— and is therefore guaranteed to find the true optimum (if parameters are perfectly optimised) and a complete function ranking subject to these constraints. We implement the minimum description length principle as a rigorous method for combining these preferences into a single objective. To illustrate the power of ESR we apply it to a catalogue of cosmic chronometers and the Pantheon+ sample of supernovae to learn the Hubble rate as a function of redshift, finding 40 functions (out of 5.2 million trial functions) that fit the data more economically than the Friedmann equation. These low-redshift data therefore do not uniquely prefer the expansion history of the standard model of cosmology. We make our code and full equation sets publicly available.
Chapter
Fast Function Extraction (FFX) is a deterministic algorithm for solving symbolic regression problems. We improve the accuracy of FFX by adding parameters to the arguments of nonlinear functions. Instead of only optimizing linear parameters, we optimize these additional nonlinear parameters with separable nonlinear least squared optimization using a variable projection algorithm. Both FFX and our new algorithm is applied on the PennML benchmark suite. We show that the proposed extensions of FFX leads to higher accuracy while providing models of similar length and with only a small increase in runtime on the given data. Our results are compared to a large set of regression methods that were already published for the given benchmark suite.
Article
In this paper, a novel operator, semantic cluster operator, was developed to overcome the low convergence performance of genetic programming in symbolic regression. The main strategy for steep convergence was to narrow search space and scrutinize the narrowed search space using a semantic cluster library. To demonstrate the success of this idea, the computation time and offspring fitness of the operator developed in this paper were compared with those of exhaustive search. The computation time of the operator was approximately 6% of that of the exhaustive search, and its offspring fitness was in the top 0.5% among all offspring derived from the exhaustive search. In two application problems, derived models from an algorithm using the operator showed high prediction accuracy comparable to an artificial neural network, random forest, and support vector machine despite its simplicity.
Chapter
Full-text available
As symbolic regression (SR) has advanced into the early stages of commercial exploitation, the poor accuracy of SR, still plaguing even the most advanced commercial packages, has become an issue for early adopters. Users expect to have the correct formula returned, especially in cases with zero noise and only one basis function with minimally complex grammar depth. At a minimum, users expect the response surface of the SR tool to be easily understood, so that the user can know a priori on what classes of problems to expect excellent, average, or poor accuracy. Poor or unknown accuracy is a hindrance to greater academic and industrial acceptance of SR tools. In two previous papers, we published a complex algorithm for modern symbolic regression which is extremely accurate for a large class of Symbolic Regression problems. The class of problems, on which SR is extremely accurate, is described in detail in these two previous papers. This algorithm is extremely accurate, in reasonable time on a single processor, for from 25 up to 3000 features (columns). Extensive statistically correct, out of sample training and testing, demonstrated the extreme accuracy algorithm’s advantages over a previously published base line Pareto algorithm in case where the training and testing data contained zero noise. While the algorithm’s extreme accuracy for deep problems with a large number of features on noiseless training data is an impressive advance, there are many very important academic and industrial SR problems where the training data is very noisy. In this chapter we test the extreme accuracy algorithm and compare the results with the previously published baseline Pareto algorithm. Both algorithms’ performance are compared on a set of complex representative problems (from 25 to 3000 features), on noiseless training, on noisy training data, and on noisy training data with range shifted testing data. The enhanced algorithm is shown to be robust, with definite advantages over the baseline Pareto algorithm, performing well even in the face of noisy training data and range shifted testing data.
Chapter
Full-text available
As symbolic regression (SR) has advanced into the early stages of commercial exploitation, the poor accuracy of SR, still plaguing even the most advanced commercial packages, has become an issue for early adopters. Users expect to have the correct formula returned, especially in cases with zero noise and only one basis function with minimally complex grammar depth. At a minimum, users expect the response surface of the SR tool to be easily understood, so that the user can know apriori on what classes of problems to expect excellent, average, or poor accuracy. Poor or unknown accuracy is a hinderence to greater academic and industrial acceptance of SR tools. In a previous paper, we published a complex algorithm for modern symbolic regression which is extremely accurate for a large class of Symbolic Regression problems. The class of problems, on which SR is extremely accurate, was described in detail. This algorithm was extremely accurate, on a single processor, for up to 25 features (columns); and, a cloud configuration was used to extend the extreme accuracy up to as many as 100 features. While the previous algorithm’s extreme accuracy for deep problems with a small number of features (25–100) was an impressive advance, there are many very important academic and industrial SR problems requiring from 100 to 1000 features. In this chapter we extend the previous algorithm such that high accuracy is achieved on a wide range of problems, from 25 to 3000 features, using only a single processor. The class of problems, on which the enhanced algorithm is highly accurate, is described in detail. A definition of extreme accuracy is provided, and an informal argument of highly SR accuracy is outlined in this chapter. The new enhanced algorithm is tested on a set of representative problems. The enhanced algorithm is shown to be robust, performing well even in the face of testing data containing up to 3000 features.
Chapter
Full-text available
Although recent advances in symbolic regression (SR) have promoted the field into the early stages of commercial exploitation, the poor accuracy of SR is still plaguing even the most advanced commercial packages today. Users expect to have the correct formula returned, especially in cases with zero noise and only one basis function with minimally complex grammar depth. Poor accuracy is a hindrance to greater academic and industrial acceptance of SR tools. In a previous paper, the poor accuracy of Symbolic Regression was explored, and several classes of test formulas, which prove intractable for SR, were examined. An understanding of why these test problems prove intractable was developed. In another paper a baseline Symbolic Regression algorithm was developed with specific techniques for optimizing embedded real numbers constants. These previous steps have placed us in a position to make an attempt at vanquishing the SR accuracy problem. In this chapter we develop a complex algorithm for modern symbolic regression which is extremely accurate for a large class of Symbolic Regression problems. The class of problems, on which SR is extremely accurate, is described in detail. A definition of extreme accuracy is provided, and an informal argument of extreme SR accuracy is outlined in this chapter. Given the critical importance of accuracy in SR, it is our suspicion that in the future all commercial Symbolic Regression packages will use this algorithm or a substitute for this algorithm.
Article
Full-text available
We present the results of a community survey regarding genetic programming (GP) benchmark practices. Analysis shows broad consensus that improvement is needed in problem selection and experimental rigor. While views expressed in the survey dissuade us from proposing a large-scale benchmark suite, we find community support for creating a blacklist of ``toy problems.'' We provide a set of alternative problems named GPBench2012 to replace the blacklisted ones, a discussion on improving experimental rigor, and a listing of challenging problems in the hope of improving GP research.
Conference Paper
Full-text available
We propose Locally Geometric Crossover (LGX) for genetic programming. For a pair of homologous loci in the parent solutions, LGX finds a semantically intermediate procedure from a previously prepared library, and uses it as replacement code. The experiments involving six symbolic regression problems show significant increase in search performance when compared to standard subtree-swapping cross-over and other control methods. This suggests that semantically geometric manipulations on subprograms propagate to entire programs and improve their fitness.
Article
Full-text available
We examine the effectiveness of gradient search optimization of numeric leaf values for Genetic Programming. Genetic search for tree-like programs at the population level is complemented by the optimization of terminal values at the individual level. Local adaptation of individuals is made easier by algorithmic differentiation. We show how conventional random constants are tuned by gradient descent with minimal overhead. Several experiments with symbolic regression problems are performed to demonstrate the approach's effectiveness. Effects of local learning are clearly manifest in both improved approximation accuracy and selection changes when periods of local and global search are interleaved. Special attention is paid to the low overhead of the local gradient descent. Finally, the inductive bias of local learning is quantified.
Chapter
Full-text available
Today numerous variants of heuristic optimization algorithms are used to solve different kinds of optimization problems. This huge variety makes it very difficult to reuse already implemented algorithms or problems. In this paper the authors describe a generic, extensible, and paradigm-independent optimization environment that strongly abstracts the process of heuristic optimization. By providing a well organized and strictly separated class structure and by introducing a generic operator concept for the interaction between algorithms and problems, HeuristicLab makes it possible to reuse an algorithm implementation for the attacking of lots of different kinds of problems and vice versa. Consequently HeuristicLab is very well suited for rapid prototyping of new algorithms and is also useful for educational support due to its state-of-the-art user interface, its self-explanatory API and the use of modern programming concepts.
Chapter
Full-text available
This chapter examines the use of Abstract Expression Grammars to perform the entire Symbolic Regression process without the use of Genetic Programming per se. The techniques explored produce a symbolic regression engine which has absolutely no bloat, which allows total user control of the search space and output formulas, which is faster, and more accurate than the engines produced in our previous papers using Genetic Programming. The genome is an all vector structure with four chromosomes plus additional epigenetic and constraint vectors, allowing total user control of the search space and the final output formulas. A combination of specialized compiler techniques, genetic algorithms, particle swarm, aged layered populations, plus discrete and continuous differential evolution are used to produce an improved symbolic regression sytem. Nine base test cases, from the literature, are used to test the improvement in speed and accuracy. The improved results indicate that these techniques move us a big step closer toward future industrial strength symbolic regression systems. Keywordsabstract expression grammars-differential evolution-grammar template genetic programming-genetic algorithms-particle swarm-symbolic regression
Chapter
Full-text available
Symbolic regression is a common application for genetic programming (GP). This paper presents a new non-evolutionary technique for symbolic regression that, compared to competent GP approaches on real-world problems, is orders of magnitude faster (taking just seconds), returns simpler models, has comparable or better prediction on unseen data, and converges reliably and deterministically. I dub the approach FFX, for Fast Function Extraction. FFX uses a recentlydeveloped machine learning technique, pathwise regularized learning, to rapidly prune a huge set of candidate basis functions down to compact models. FFX is verified on a broad set of real-world problems having 13 to 1468 input variables, outperforming GP as well as several state-of-the-art regression techniques. Keywordstechnology-symbolic regression-genetic programming, pathwise-regularization-real-world problems-machine learning-lasso-ridge regression-elastic net-integrated circuits
Article
Full-text available
We investigate the effects of semantically-based crossover operators in genetic programming, applied to real-valued symbolic regression problems. We propose two new relations derived from the semantic distance between subtrees, known as semantic equivalence and semantic similarity. These relations are used to guide variants of the crossover operator, resulting in two new crossover operators—semantics aware crossover (SAC) and semantic similarity-based crossover (SSC). SAC, was introduced and previously studied, is added here for the purpose of comparison and analysis. SSC extends SAC by more closely controlling the semantic distance between subtrees to which crossover may be applied. The new operators were tested on some real-valued symbolic regression problems and compared with standard crossover (SC), context aware crossover (CAC), Soft Brood Selection (SBS), and No Same Mate (NSM) selection. The experimental results show on the problems examined that, with computational effort measured by the number of function node evaluations, only SSC and SBS were significantly better than SC, and SSC was often better than SBS. Further experiments were also conducted to analyse the perfomance sensitivity to the parameter settings for SSC. This analysis leads to a conclusion that SSC is more constructive and has higher locality than SAC, NSM and SC; we believe these are the main reasons for the improved performance of SSC. KeywordsGenetic programming–Semantics–Crossover–Symbolic regression locality
Chapter
Full-text available
Trust is a major issue with deploying empirical models in the real world since changes in the underlying system or use of the model in new regions of parameter space can produce (potentially dangerous) incorrect predictions. The trepidation involved with model usage can be mitigated by assembling ensembles of diverse models and using their consensus as a trust metric, since these models will be constrained to agree in the data region used for model development and also constrained to disagree outside that region. The problem is to define an appropriate model complexity (since the ensemble should consist of models of similar complexity), as well as to identify diverse models from the candidate model set. In this chapter we discuss strategies for the development and selection of robust models and model ensembles and demonstrate those strategies against industrial data sets. An important benefit of this approach is that all available data may be used in the model development rather than a partition into training, test and validation subsets. The result is constituent models are more accurate without risk of over-fitting, the ensemble predictions are more accurate and the ensemble predictions have a meaningful trust metric.
Article
Full-text available
This paper presents a novel approach to generate data-driven regression models that not only give reliable prediction of the observed data but also have smoother response surfaces and extra generalization capabilities with respect to extrapolation. These models are obtained as solutions of a genetic programming (GP) process, where selection is guided by a tradeoff between two competing objectives - numerical accuracy and the order of nonlinearity. The latter is a novel complexity measure that adopts the notion of the minimal degree of the best-fit polynomial, approximating an analytical function with a certain precision. Using nine regression problems, this paper presents and illustrates two different strategies for the use of the order of nonlinearity in symbolic regression via GP. The combination of optimization of the order of nonlinearity together with the numerical accuracy strongly outperforms ldquoconventionalrdquo optimization of a size-related expressional complexity and the accuracy with respect to extrapolative capabilities of solutions on all nine test problems. In addition to exploiting the new complexity measure, this paper also introduces a novel heuristic of alternating several optimization objectives in a 2-D optimization framework. Alternating the objectives at each generation in such a way allows us to exploit the effectiveness of 2-D optimization when more than two objectives are of interest (in this paper, these are accuracy, expressional complexity, and the order of nonlinearity). Results of the experiments on all test problems suggest that alternating the order of nonlinearity of GP individuals with their structural complexity produces solutions that are both compact and have smoother response surfaces, and, hence, contributes to better interpretability and understanding.
Conference Paper
Full-text available
A new digital signature based only on a conventional encryption function (such as DES) is described which is as secure as the underlying encryption function -- the security does not depend on the difficulty of factoring and the high computational costs of modular arithmetic are avoided. The signature system can sign an unlimited number of messages, and the signature size increases logarithmically as a function of the number of messages signed. Signature size in a ‘typical’ system might range from a few hundred bytes to a few kilobytes, and generation of a signature might require a few hundred to a few thousand computations of the underlying conventional encryption function.
Conference Paper
Full-text available
This paper presents a simple method to control bloat which is based on the idea of strategically and dynamically creating fitness “holes” in the fitness landscape which repel the population. In particular we create holes by zeroing the fitness of a certain proportion of the offspring that have above average length. Unlike other methods where all individuals are penalised when length constraints are violated, here we randomly penalise only a fixed proportion of all the constraintviolating offspring. The paper describes the theoretical foundation for this method and reports the results of its empirical validation with two relatively hard test problems, which has confirmed the effectiveness of the approach.
Book
Full-text available
Genetic Algorithms and Genetic Programming: Modern Concepts and Practical Applications discusses algorithmic developments in the context of genetic algorithms (GAs) and genetic programming (GP). It applies the algorithms to significant combinatorial optimization problems and describes structure identification using HeuristicLab as a platform for algorithm development. The book focuses on both theoretical and empirical aspects. The theoretical sections explore the important and characteristic properties of the basic GA as well as main characteristics of the selected algorithmic extensions developed by the authors. In the empirical parts of the text, the authors apply GAs to two combinatorial optimization problems: the traveling salesman and capacitated vehicle routing problems. To highlight the properties of the algorithmic measures in the field of GP, they analyze GP-based nonlinear structure identification applied to time series and classification problems. Written by core members of the HeuristicLab team, this book provides a better understanding of the basic workflow of GAs and GP, encouraging readers to establish new bionic, problem-independent theoretical concepts. By comparing the results of standard GA and GP implementation with several algorithmic extensions, it also shows how to substantially increase achievable solution quality.
Article
Full-text available
This paper describes the use of genetic programming to automate the discovery of numerical approximation formulae. The authors present results involving rediscovery of known approximations for Harmonic numbers and discovery of rational polynomial approximations for functions of one or more variables, the latter of which are compared to Padé approximations obtained through a symbolic mathematics package. For functions of a single variable, it is shown that evolved solutions can be considered superior to Padé approximations, which represent a powerful technique from numerical analysis, given certain tradeoffs between approximation cost and accuracy, while for functions of more than one variable, we are able to evolve rational polynomial approximations where no Padé approximation can be computed. Further, it is shown that evolved approximations can be refined through the evolution of approximations to their error function. Based on these results, we consider genetic programming to be a powerful and effective technique for the automated discovery of numerical approximation formulae.
Conference Paper
Full-text available
Abstract Expression Grammars have the potential to integrate Genetic Algorithms, Genetic Programming, Swarm Intelligence, and Differential Evolution into a seamlessly unified array of tools for use in symbolic regression. The features of abstract expression grammars are explored, examples of implementations are provided, and the beneficial effects of abstract expression grammars are tested with several published nonlinear regression problems.
Article
Full-text available
Most evolutionary optimization models incorporate a fitness evaluation that is based on a predefined static set of test cases or problems. In the natural evolutionary process, selection is of course not based on a static fitness evaluation. Organisms do not have to combat every existing disease during their lifespan; organisms of one species may live in different or changing environments; different species coevolve. This leads to the question of how information is integrated over many generations. This study focuses on the effects of different fitness evaluation schemes on the types of genotypes and phenotypes that evolve. The evolutionary target is a simple numerical function. The genetic representation is in the form of a program (i.e., a functional representation, as in genetic programming). Many different programs can code for the same numerical function. In other words, there is a many-to-one mapping between “genotypes” (the programs) and “phenotypes”. We compare fitness evaluation based on a large static set of problems and fitness evaluation based on small coevolving sets of problems. In the latter model very little information is presented to the evolving programs regarding the evolutionary target per evolutionary time step. In other words, the fitness evaluation is very sparse. Nevertheless the model produces correct solutions to the complete evolutionary target in about half of the simulations. The complete evaluation model, on the other hand, does not find correct solutions to the target in any of the simulations. More important, we find that sparse evaluated programs are better generalizable compared to the complete evaluated programs when they are evaluated on a much denser set of problems. In addition, the two evaluation schemes lead to programs that differ with respect to mutational stability; sparse evaluated programs are less stable than complete evaluated programs.
Article
Full-text available
Evolutionary programming and genetic algorithms share many features, not the least of which is a reliance of an analogy to natural selection over a population as a means of implementing search. With their commonalities come shared problems whose solutions can be investigated at a higher level and applied to both. One such problem is the manipulation of solution parameters whose values encode a desirable sub-solution. In this paper, we define a superset of evolutionary programming and genetic algorithms, called evolutionary algorithms, and demonstrate a method of automatic modularization that protects promising partial solutions and speeds acquisition time. 1. Introduction Evolutionary programming (EP) (Fogel 1992; Fogel et. al. 1966) and genetic algorithms (GAs) (Holland 1966; Goldberg 1989) have borrowed little from each other. But there are many levels at which EP and GAs are similar. For instance, both employ an analogy to natural selection over a population to search through a sp...
Chapter
We introduce in this paper a runtime-efficient tree hashing algorithm for the identification of isomorphic subtrees, with two important applications in genetic programming for symbolic regression: fast, online calculation of population diversity and algebraic simplification of symbolic expression trees. Based on this hashing approach, we propose a simple diversity-preservation mechanism with promising results on a collection of symbolic regression benchmark problems.
Chapter
In this chapter we take a closer look at the distribution of symbolic regression models generated by genetic programming in the search space. The motivation for this work is to improve the search for well-fitting symbolic regression models by using information about the similarity of models that can be precomputed independently from the target function. For our analysis, we use a restricted grammar for uni-variate symbolic regression models and generate all possible models up to a fixed length limit. We identify unique models and cluster them based on phenotypic as well as genotypic similarity. We find that phenotypic similarity leads to well-defined clusters while genotypic similarity does not produce a clear clustering. By mapping solution candidates visited by GP to the enumerated search space we find that GP initially explores the whole search space and later converges to the subspace of highest quality expressions in a run for a simple benchmark problem.
Article
Data-driven modeling plays an increasingly important role in different areas of engineering. For most of existing methods, such as genetic programming (GP), the convergence speed might be too slow for large scale problems with a large number of variables. It has become the bottleneck of GP for practical applications. Fortunately, in many applications, the target models are separable in some sense. In this paper, we analyze different types of separability of some real-world engineering equations and establish a mathematical model of generalized separable system (GS system). In order to get the structure of the GS system, a multilevel block building (MBB) algorithm is proposed, in which the target model is decomposed into a number of blocks, further into minimal blocks and factors. Compare to the conventional GP, MBB can make large reductions to the search space. This makes MBB capable of modeling a complex system. The minimal blocks and factors are optimized and assembled with a global optimization search engine, low dimensional simplex evolution (LDSE). An extensive study between the proposed MBB and a state-of-the-art data-driven fitting tool, Eureqa, has been presented with several man-made problems, as well as some real-world problems. Test results indicate that the proposed method is more effective and efficient under all the investigated cases.
Chapter
In this chapter we review a number of real-world applications where symbolic regression was used recently and with great success. Industrial scale symbolic regression armed with the power to select right variables and variable combinations, build robust trustable predictions and guide experimentation has undoubtedly earned its place in industrial process optimization, business forecasting, product design and now complex systems modeling and policy making.
Chapter
Genetic programming (GP) is a stochastic, iterative generate-and-test approach to synthesizing programs from tests, i.e. examples of the desired input-output mapping. The number of passed tests, or the total error in continuous domains, is a natural objective measure of a program’s performance and a common yardstick when experimentally comparing algorithms. In GP, it is also by default used to guide the evolutionary search process. An assumption that an objective function should also be an efficient ‘search driver’ is common for all metaheuristics, such as the evolutionary algorithms which GP is a member of. Programs are complex combinatorial structures that exhibit even more complex input-output behavior, and in this chapter we discuss why this complexity cannot be effectively reflected by a single scalar objective. In consequence, GP algorithms are systemically ‘underinformed’ about the characteristics of programs they operate on, and pay for this with unsatisfactory performance and limited scalability. This chapter advocates behavioral program synthesis, where programs are characterized by informative execution traces that enable multifaceted evaluation and substantially change the roles of components in an evolutionary infrastructure. We provide a unified perspective on past work in this area, discuss the consequences of the behavioral viewpoint, outlining the future avenues for program synthesis and the wider application areas that lie beyond.
Chapter
From a real-world perspective, good enough has been achieved in the core representations and evolutionary strategies of genetic programming assuming state-of-the-art algorithms and implementations are being used. What is needed for industrial symbolic regression are tools to (a) explore and refine the data, (b) explore the developed model space and extract insight and guidance from the available sample of the infinite possibilities of model forms and (c) identify appropriate models for deployment as predictors, emulators, etc. This chapter focuses on the approaches used in DataModeler to address the modeling life cycle. A special focus in this chapter is the identification of driving variables and metavariables. Exploiting the diversity of search paths followed during independent evolutions and, then, looking at the distributions of variables and metavariable usage also provides an opportunity to gather key insights. The goal in this framework, however, is not to replace the modeler but, rather, to augment the inclusion of context and collection of insight by removing mechanistic requirements and facilitating the ability to think. We believe that the net result is higher quality and more robust models.
Conference Paper
In this publication a constant optimization approach for symbolic regression is introduced to separate the task of finding the correct model structure from the necessity to evolve the correct numerical constants. A gradient-based nonlinear least squares optimization algorithm, the Levenberg-Marquardt (LM) algorithm, is used for adjusting constant values in symbolic expression trees during their evolution. The LM algorithm depends on gradient information consisting of partial derivations of the trees, which are obtained by automatic differentiation. The presented constant optimization approach is tested on several benchmark problems and compared to a standard genetic programming algorithm to show its effectiveness. Although the constant optimization involves an overhead regarding the execution time, the achieved accuracy increases significantly as well as the ability of genetic programming to learn from provided data. As an example, the Pagie-1 problem could be solved in 37 out of 50 test runs, whereas without constant optimization it was solved in only 10 runs. Furthermore, different configurations of the constant optimization approach (number of iterations, probability of applying constant optimization) are evaluated and their impact is detailed in the results section.
Conference Paper
We introduce Prioritized Grammar Enumeration (PGE), a deterministic Symbolic Regression (SR) algorithm using dynamic programming techniques. PGE maintains the tree-based representation and Pareto non-dominated sorting from Genetic Programming (GP), but replaces genetic operators and random number use with grammar production rules and systematic choices. PGE uses non-linear regression and abstract parameters to fit the coefficients of an equation, effectively separating the exploration for form, from the optimization of a form. Memoization enables PGE to evaluate each point of the search space only once, and a Pareto Priority Queue provides direction to the search. Sorting and simplification algorithms are used to transform candidate expressions into a canonical form, reducing the size of the search space. Our results show that PGE performs well on 22 benchmarks from the SR literature, returning exact formulas in many cases. As a deterministic algorithm, PGE offers reliability and reproducibility of results, a key aspect to any system used by scientists at large. We believe PGE is a capable SR implementation, following an alternative perspective we hope leads the community to new ideas.
Article
We introduce an estimation of distribution algorithm that co-evolves fitness predictors in order to reduce the computational cost of evolution. Fitness predictors are light objects which, given an evolving individual, heuristically approximate its true fit-ness. The predictors are trained by their ability to correctly differentiate between good and bad solutions using reduced computation. We apply co-evolution of fitness predic-tors to symbolic regression and measure its impact. Our results show that a small com-putational investment in co-evolving fitness predictors greatly enhances both speed and convergence of individual solutions while reducing the computational effort overall. Fi-nally we apply fitness prediction to interactive evolution of pen stroke drawings. These results show that fitness prediction is extremely effective at modeling user preference while minimizing the sampling on the user to fewer than ten prompts.
Article
  We propose the elastic net, a new regularization and variable selection method. Real world data and a simulation study show that the elastic net often outperforms the lasso, while enjoying a similar sparsity of representation. In addition, the elastic net encourages a grouping effect, where strongly correlated predictors tend to be in or out of the model together. The elastic net is particularly useful when the number of predictors (p) is much bigger than the number of observations (n). By contrast, the lasso is not a very satisfactory variable selection method in the p≫n case. An algorithm called LARS-EN is proposed for computing elastic net regularization paths efficiently, much like algorithm LARS does for the lasso.
Chapter
Fitness models are used to reduce evaluation frequency and cost. There are three fundamental challenges faced when using fitness models: (1) the model learning investment, (2) the model level of approximation, and (3) the lack of convergence to optima. We propose a coevolutionary algorithm to resolve these problems automatically during evolution. We discuss applications of this approach and measure its impact in symbolic regression. Results show coevolution yields significant improvement in performance over other algorithms and different fitness modeling approaches. Finally we apply coevolution to interactive evolution of pen stroke drawings where no true fitness function is known. These results demonstrate coevolution’s ability to infer a fitness landscape of a user’s preference while minimizing user interaction.
Chapter
Symbolic regression via genetic programming (hereafter, referred to simply as symbolic regression) has proven to be a very important tool for industrial empirical modeling (Kotanchek et al., 2003). Two of the primary problems with industrial use of symbolic regression are (1) the relatively large computational demands in comparison with other nonlinear empirical modeling techniques such as neural networks and (2) the difficulty in making the trade-off between expression accuracy and complexity. The latter issue is significant since, in general, we prefer parsimonious (simple) expressions with the expectation that they are more robust with respect to changes over time in the underlying system or extrapolation outside the range of the data used as the reference in evolving the symbolic regression. In this chapter, we present a genetic programming variant, ParetoGP, which exploits the Pareto front to dramatically speed the symbolic regression solution evolution as well as explicitly exploit the complexity-performance trade-off. In addition to the improvement in evolution efficiency, the Pareto front perspective allows the user to choose appropriate models for further analysis or deployment. The Pareto front avoids the need to a priori specify a trade-off between competing objectives (e.g. complexity and performance) by identifying the curve (or surface or hyper-surface) which characterizes, for example, the best performance for a given expression complexity.
Article
Most algorithms for the least-squares estimation of non-linear parameters have centered about either of two approaches. On the one hand, the model may be expanded as a Taylor series and corrections to the several parameters calculated at each iteration on the assumption of local linearity. On the other hand, various modifications of the method of steepest-descent have been used. Both methods not infrequently run aground, the Taylor series method because of divergence of the successive iterates, the steepest-descent (or gradient) methods because of agonizingly slow convergence after the first few iterations. In this paper a maximum neighborhood method is developed which, in effect, performs an optimum interpolation between the Taylor series method and the gradient method, the interpolation being based upon the maximum neighborhood in which the truncated Taylor series gives an adequate representation of the nonlinear model. The results are extended to the problem of solving a set of nonlinear algebraic e
Conference Paper
This paper investigates the robustness of Run Transferable Libraries(RTLs) on scaled problems. RTLs provide GP with a library of functions which replace the usual primitive functions provided when approaching a problem. The RTL evolves from run to run using feedback based on function usage, and has been shown to outperform GP by an order of magnitude on a variety of scalable problems. RTLs can, however, also be applied across a domain of related problems, as well as across a range of scaled instances of a single problem. To do this successfully, it will need to balance a range of functions. We introduce a problem that can deceive the system into converging to a sub-optimal set of functions, and demonstrate that this is a consequence of the greediness of the library update algorithm. We demonstrate that a much simpler, truly evolutionary, update strategy doesn’t suffer from this problem, and exhibits far better optimization properties than the original strategy.
Conference Paper
The decomposition of regression error into bias and variance terms provides insight into the generalization capability of modeling methods. The paper offers an introduction to bias/variance decomposition of mean squared error, as well as a presentation of experimental results of the application of genetic programming. Finally ensemble methods such as bagging and boosting are discussed that can reduce the generalization error in genetic programming.
Conference Paper
The use of protected operators and squared error measures are standard approaches in symbolic regression. It will be shown that two relatively minor modifications of a symbolic regression system can result in greatly improved predictive performance and reliability of the induced expressions. To achieve this, interval arithmetic and linear scaling are used. An experimental section demonstrates the improvements on 15 symbolic regression problems.
Conference Paper
We propose a multi-objective method for avoiding premature convergence in evolutionary algorithms, and demonstrate a three-fold performance improvement over comparable methods. Previous research has shown that partitioning an evolving population into age groups can greatly improve the ability to identify global optima and avoid converging to local optima. Here, we propose that treating age as an explicit optimization criterion can increase performance even further, with fewer algorithm implementation parameters. The proposed method evolves a population on the two-dimensional Pareto front comprising (a) how long the genotype has been in the population (age); and (b) its performance (fitness). We compare this approach with previous approaches on the Symbolic Regression problem, sweeping the problem difficulty over a range of solution complexities and number of variables. Our results indicate that the multi-objective approach identifies the exact target solution more often that the age-layered population and standard population methods. The multi-objective method also performs better on higher complexity problems and higher dimensional datasets -- finding global optima with less computational effort.
Chapter
Traditional Symbolic Regression applications are a form of supervised learning, where a label y is provided for every [(x)\vec]\vec{x} and an explicit symbolic relationship of the form y = f([(x)\vec])y = f(\vec{x}) is sought. This chapter explores the use of symbolic regression to perform unsupervised learning by searching for implicit relationships of the form f([(x)\vec], y) = 0f(\vec{x}, y) = 0 . Implicit relationships are more general and more expressive than explicit equations in that they can also represent closed surfaces, as well as continuous and discontinuous multi-dimensional manifolds. However, searching these types of equations is particularly challenging because an error metric is difficult to define. We studied several direct and indirect techniques, and present a successful method based on implicit derivatives. Our experiments identified implicit relationships found in a variety of datasets, such as equations of circles, elliptic curves, spheres, equations of motion, and energy manifolds.
Article
Probabilistic incremental program evolution (PIPE) is a novel technique for automatic program synthesis. We combine probability vector coding of program instructions, population-based incremental learning, and tree-coded programs like those used in some variants of genetic programming (GP). PIPE iteratively generates successive populations of functional programs according to an adaptive probability distribution over all possible programs. Each iteration, it uses the best program to refine the distribution. Thus, it stochastically generates better and better programs. Since distribution refinements depend only on the best program of the current population, PIPE can evaluate program populations efficiently when the goal is to discover a program with minimal runtime. We compare PIPE to GP on a function regression problem and the 6-bit parity problem. We also use PIPE to solve tasks in partially observable mazes, where the best programs have minimal runtime.
Article
Although the problem of determining the minimum cost path through a graph arises naturally in a number of interesting applications, there has been no underlying theory to guide the development of efficient search procedures. Moreover, there is no adequate conceptual framework within which the various ad hoc search strategies proposed to date can be compared. This paper describes how heuristic information from the problem domain can be incorporated into a formal mathematical theory of graph searching and demonstrates an optimality property of a class of search strategies.
Genetic Programming Theory and Practice XIII, Genetic and Evolutionary Computation
  • M F Korns
Journal of the royal statistical society: series B (statistical methodology)
  • H Zou
  • T Hastie