Chapter

Hash-Based Tree Similarity and Simplification in Genetic Programming for Symbolic Regression

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We introduce in this paper a runtime-efficient tree hashing algorithm for the identification of isomorphic subtrees, with two important applications in genetic programming for symbolic regression: fast, online calculation of population diversity and algebraic simplification of symbolic expression trees. Based on this hashing approach, we propose a simple diversity-preservation mechanism with promising results on a collection of symbolic regression benchmark problems.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... When Every generation [2], [117]- [122] Final generation [2], [123], [124] At certain interval [119], [125]- [128] Which individuals All individuals in the population [2], [119], [122], [125]- [127] The best individuals in the population [2], [121], [123], [124], [128] Randomly with some probability [117] The parents for breeding [118], [120] How Genotypic (Structural) [2], [117]- [119], [122], [123], [125]- [127], [129]- [131] Phenotypic (Behavioural) [120], [121], [124], [126]- [128], [132]- [135] Regarding "when" to simplify, the commonly used strategies are based on frequency (i.e., after every k generations). If k = 1, then the simplification is conducted after every generation. ...
... If domain knowledge is available, we can develop more simplification rules, e.g., max{x, x − y} → x if the variable y is always positive [122]. Hashing has been employed to speed up the simplification [119], [131]. ...
Article
Full-text available
Explainable artificial intelligence has received great interest in the recent decade, due to its importance in critical application domains such as self-driving cars, law and healthcare. Genetic programming is a powerful evolutionary algorithm for machine learning. Compared with other standard machine learning models such as neural networks, the models evolved by GP tend to be more interpretable due to their model structure with symbolic components. However, interpretability has not been explicitly considered in genetic programming until recently, following the surge in popularity of explainable artificial intelligence. This paper provides a comprehensive review of the studies on genetic programming that can potentially improve the model interpretability, both explicitly and implicitly, as a byproduct. We group the existing studies related to explainable artificial intelligence by genetic programming into two categories. The first category considers the intrinsic interpretability, aiming to directly evolve more interpretable (and effective) models by genetic programming. The second category focuses on post-hoc interpretability, which uses genetic programming to explain other black-box machine learning models, or explain the models evolved by genetic programming by simpler models such as linear models. This comprehensive survey demonstrates the strong potential of genetic programming for improving the interpretability of machine learning models and balancing the complex trade-off between model accuracy and interpretability.
Article
Symbolic Regression (SR) algorithms attempt to learn analytic expressions which fit data accurately and in a highly interpretable manner. Conventional SR suffers from two fundamental issues which we address here. First, these methods search the space stochastically (typically using genetic programming) and hence do not necessarily find the best function. Second, the criteria used to select the equation optimally balancing accuracy with simplicity have been variable and subjective. To address these issues we introduce Exhaustive Symbolic Regression (ESR), which systematically and efficiently considers all possible equations—made with a given basis set of operators and up to a specified maximum complexity— and is therefore guaranteed to find the true optimum (if parameters are perfectly optimised) and a complete function ranking subject to these constraints. We implement the minimum description length principle as a rigorous method for combining these preferences into a single objective. To illustrate the power of ESR we apply it to a catalogue of cosmic chronometers and the Pantheon+ sample of supernovae to learn the Hubble rate as a function of redshift, finding 40 functions (out of 5.2 million trial functions) that fit the data more economically than the Friedmann equation. These low-redshift data therefore do not uniquely prefer the expansion history of the standard model of cosmology. We make our code and full equation sets publicly available.
Article
Full-text available
Genetic programming (GP), a widely used Evolutionary Computing technique, suffers from bloat -- the problem of excessive growth in individuals' sizes. As a result, its ability to efficiently explore complex search spaces reduces. The resulting solutions are less robust and generalisable. Moreover, it is difficult to understand and explain models which contain bloat. This phenomenon is well researched, primarily from the angle of controlling bloat: instead, our focus in this paper is to review the literature from an explainability point of view, by looking at how simplification can make GP models more explainable by reducing their sizes. Simplification is a code editing technique whose primary purpose is to make GP models more explainable. However, it can offer bloat control as an additional benefit when implemented and applied with caution. Researchers have proposed several simplification techniques and adopted various strategies to implement them. We organise the literature along multiple axes to identify the relative strengths and weaknesses of simplification techniques and to identify emerging trends and areas for future exploration. We highlight design and integration challenges and propose several avenues for research. One of them is to consider simplification as a standalone operator, rather than an extension of the standard crossover or mutation operators. Its role is then more clearly complementary to other GP operators, and it can be integrated as an optional feature into an existing GP setup. Another proposed avenue is to explore the lack of utilisation of complexity measures in simplification. So far, size is the most discussed measure, with only two pieces of prior work pointing out the benefits of using time as a measure when controlling bloat.
Book
These contributions, written by the foremost international researchers and practitioners of Genetic Programming (GP), explore the synergy between theoretical and empirical results on real-world problems, producing a comprehensive view of the state of the art in GP. In this year’s edition, the topics covered include many of the most important issues and research questions in the field, such as: opportune application domains for GP-based methods, game playing and co-evolutionary search, symbolic regression and efficient learning strategies, encodings and representations for GP, schema theorems, and new selection mechanisms.The volume includes several chapters on best practices and lessons learned from hands-on experience. Readers will discover large-scale, real-world applications of GP to a variety of problem domains via in-depth presentations of the latest and most significant results.
Chapter
Symbolic regression is a powerful system identification technique in industrial scenarios where no prior knowledge on model structure is available. Such scenarios often require specific model properties such as interpretability, robustness, trustworthiness and plausibility, that are not easily achievable using standard approaches like genetic programming for symbolic regression. In this chapter we introduce a deterministic symbolic regression algorithm specifically designed to address these issues. The algorithm uses a context-free grammar to produce models that are parameterized by a non-linear least squares local optimization procedure. A finite enumeration of all possible models is guaranteed by structural restrictions as well as a caching mechanism for detecting semantically equivalent solutions. Enumeration order is established via heuristics designed to improve search efficiency. Empirical tests on a comprehensive benchmark suite show that our approach is competitive with genetic programming in many noiseless problems while maintaining desirable properties such as simple, reliable models and reproducibility.
Article
Full-text available
Many diversity techniques have been developed for addressing premature convergence, which is a serious problem that stifles the search effectiveness of evolutionary algorithms. However, approaches that aim to avoid premature convergence can often take longer to discover a solution. The Genetic Marker Diversity algorithm is a new technique that has been shown to find solutions significantly faster than other approaches while maintaining diversity in genetic programming. This study provides a more in-depth analysis of the search behavior of this technique compared to other state-of-the-art methods, as well as a comparison of the performance of these techniques on a larger and more modern set of test problems.
Article
Full-text available
“Exploration and exploitation are the two cornerstones of problem solving by search.” For more than a decade, Eiben and Schippers' advocacy for balancing between these two antagonistic cornerstones still greatly influences the research directions of evolutionary algorithms (EAs) [1998]. This article revisits nearly 100 existing works and surveys how such works have answered the advocacy. The article introduces a fresh treatment that classifies and discusses existing work within three rational aspects: (1) what and how EA components contribute to exploration and exploitation; (2) when and how exploration and exploitation are controlled; and (3) how balance between exploration and exploitation is achieved. With a more comprehensive and systematic understanding of exploration and exploitation, more research in this direction may be motivated and refined.
Article
Full-text available
In the field of Genetic Programming (GP), there has been a growing interest in the effects of loss of genetic diversity, which causes the whole popu-lation prematurely converge to local optima. Improving diversity of the popula-tion is always an implicit goal of almost any basic genetic programming system. Most research in this area suggests a diversity measurement and controls this quantitative metric to maintain genetically diverse populations. This paper brief overviews of the measures used in Genetic Programming for diversity mainte-nance and promotion.
Article
Full-text available
This paper presents a novel approach to generate data-driven regression models that not only give reliable prediction of the observed data but also have smoother response surfaces and extra generalization capabilities with respect to extrapolation. These models are obtained as solutions of a genetic programming (GP) process, where selection is guided by a tradeoff between two competing objectives - numerical accuracy and the order of nonlinearity. The latter is a novel complexity measure that adopts the notion of the minimal degree of the best-fit polynomial, approximating an analytical function with a certain precision. Using nine regression problems, this paper presents and illustrates two different strategies for the use of the order of nonlinearity in symbolic regression via GP. The combination of optimization of the order of nonlinearity together with the numerical accuracy strongly outperforms ldquoconventionalrdquo optimization of a size-related expressional complexity and the accuracy with respect to extrapolative capabilities of solutions on all nine test problems. In addition to exploiting the new complexity measure, this paper also introduces a novel heuristic of alternating several optimization objectives in a 2-D optimization framework. Alternating the objectives at each generation in such a way allows us to exploit the effectiveness of 2-D optimization when more than two objectives are of interest (in this paper, these are accuracy, expressional complexity, and the order of nonlinearity). Results of the experiments on all test problems suggest that alternating the order of nonlinearity of GP individuals with their structural complexity produces solutions that are both compact and have smoother response surfaces, and, hence, contributes to better interpretability and understanding.
Conference Paper
Full-text available
A new digital signature based only on a conventional encryption function (such as DES) is described which is as secure as the underlying encryption function -- the security does not depend on the difficulty of factoring and the high computational costs of modular arithmetic are avoided. The signature system can sign an unlimited number of messages, and the signature size increases logarithmically as a function of the number of messages signed. Signature size in a ‘typical’ system might range from a few hundred bytes to a few kilobytes, and generation of a signature might require a few hundred to a few thousand computations of the underlying conventional encryption function.
Conference Paper
Full-text available
This paper presents a simple method to control bloat which is based on the idea of strategically and dynamically creating fitness “holes” in the fitness landscape which repel the population. In particular we create holes by zeroing the fitness of a certain proportion of the offspring that have above average length. Unlike other methods where all individuals are penalised when length constraints are violated, here we randomly penalise only a fixed proportion of all the constraintviolating offspring. The paper describes the theoretical foundation for this method and reports the results of its empirical validation with two relatively hard test problems, which has confirmed the effectiveness of the approach.
Article
Full-text available
Most evolutionary optimization models incorporate a fitness evaluation that is based on a predefined static set of test cases or problems. In the natural evolutionary process, selection is of course not based on a static fitness evaluation. Organisms do not have to combat every existing disease during their lifespan; organisms of one species may live in different or changing environments; different species coevolve. This leads to the question of how information is integrated over many generations. This study focuses on the effects of different fitness evaluation schemes on the types of genotypes and phenotypes that evolve. The evolutionary target is a simple numerical function. The genetic representation is in the form of a program (i.e., a functional representation, as in genetic programming). Many different programs can code for the same numerical function. In other words, there is a many-to-one mapping between “genotypes” (the programs) and “phenotypes”. We compare fitness evaluation based on a large static set of problems and fitness evaluation based on small coevolving sets of problems. In the latter model very little information is presented to the evolving programs regarding the evolutionary target per evolutionary time step. In other words, the fitness evaluation is very sparse. Nevertheless the model produces correct solutions to the complete evolutionary target in about half of the simulations. The complete evaluation model, on the other hand, does not find correct solutions to the target in any of the simulations. More important, we find that sparse evaluated programs are better generalizable compared to the complete evaluated programs when they are evaluated on a much denser set of problems. In addition, the two evaluation schemes lead to programs that differ with respect to mutational stability; sparse evaluated programs are less stable than complete evaluated programs.
Article
Full-text available
In this paper we address the problem of defining a measure of diversity for a population of individuals whose genome can be subjected to major reorganizations during the evolutionary process. To this end, we introduce a measure of diversity for populations of strings of variable length defined on a finite alphabet, and from this measure we derive a semi-metric distance between pairs of strings. The definitions are based on counting the number of substrings of the strings, considered first separately and then collectively. This approach is related to the concept of linguistic complexity, whose definition we generalize from single strings to populations. Using the substring count approach we also define a new kind of Tanimoto distance between strings. We show how to extend the approach to representations that are not based on strings and, in particular, to the tree-based representations used in the field of genetic programming. We describe how suffix trees can allow these measures and distances to be implemented with a computational cost that is linear in both space and time relative to the length of the strings and the size of the population. The definitions were devised to assess the diversity of populations having genomes of variable length and variable structure during evolutionary computation runs, but applications in quantitative genomics, proteomics, and pattern recognition can be also envisaged.
Article
Full-text available
Examines measures of diversity in genetic programming. The goal is to understand the importance of such measures and their relationship with fitness. Diversity methods and measures from the literature are surveyed and a selected set of measures are applied to common standard problem instances in an experimental study. Results show the varying definitions and behaviors of diversity and the varying correlation between diversity and fitness during different stages of the evolutionary process. Populations in the genetic programming algorithm are shown to become structurally similar while maintaining a high amount of behavioral differences. Conclusions describe what measures are likely to be important for understanding and improving the search process and why diversity might have different meaning for different problem domains.
Article
Much effort has been put into understanding the artificial evolutionary dynamics within genetic programming (GP). However, the details are yet unclear so far, as to which elements make GP so powerful. This paper presents an attempt to study the evolution of a population of computer programs using HeuristicLab. A newly developed methodology for recording heredity information, based on a general conceptual framework of evolution, is employed for the analysis of algorithm behavior on a symbolic regression benchmark problem. In our example, we find the complex interplay between selection and crossover to be the cause for size increase in the population, as the average amount of genetic information transmitted from parents to offspring remains constant and independent of run constraints (i.e., tree size and depth limits). Empirical results reveal many interesting details and confirm the validity and generality of our approach, as a tool for understanding the complex aspects of GP.
Conference Paper
Many studies emphasize the importance of genetic diversity and the need for an appropriate tuning of selection pressure in genetic programming. Additional important aspects are the performance and effects of the genetic operators (crossover and mutation) on the transfer and stabilization of inherited information blocks during the run of the algorithm. In this context, different ideas about the usage of lineage and genealogical information for improving genetic programming have taken shape in the last decade. Our work builds on those ideas by introducing an evolution tracking framework for assembling genealogical and inheritance graphs of populations. The proposed approach allows detailed investigation of phenomena related to building blocks, size evolution, ancestry and diversity. We introduce the notion of genetic fragments to represent subtrees that are affected by reproductive operators (mutation and crossover) and present a methodology for tracking such fragments using flexible similarity measures. A fragment matching algorithm was designed to work on both structural and semantic levels, allowing us to gain insight into the exploratory and exploitative behavior of the evolutionary process. The visualization part which is the subject of this paper integrates with the framework and provides an easy way of exploring the population history. The paper focuses on a case study in which we investigate the evolution of a solution to a symbolic regression benchmark problem.
Article
A new method is presented for flexible regression modeling of high dimensional data. The model takes the form of an expansion in product spline basis functions, where the number of basis functions as well as the parameters associated with each one (product degree and knot locations) are automatically determined by the data. This procedure is motivated by the recursive partitioning approach to regression and shares its attractive properties. Unlike recursive partitioning, however, this method produces continuous models with continuous derivatives. It has more power and flexibility to model relationships that are nearly additive or involve interactions in at most a few variables. In addition, the model can be represented in a form that separately identifies the additive contributions and those associated with the different multivariable interactions.
Conference Paper
A new bottom-up distance measure for labeled trees, which is based on the largest common forest of the trees and has the threefold advantage of independence of particular edit costs, low complexity, and coverage of ordered and unordered trees, is introduced and related in this paper with other distance measures published in the literature. Algorithms for computing the bottom-up distance in time linear in the number of nodes are given in full detail. Key words design and analysis of algorithms, combinatorial problems, graph algorithms, pattern matching, tree pattern matching, tree isomorphism, subtree isomorphism, edit distance, metric space, largest common forest 1
Article
Multi-objective evolutionary algorithms (MOEAs) that use non-dominated sorting and sharing have been criticized mainly for: (1) their O(MN3) computational complexity (where M is the number of objectives and N is the population size); (2) their non-elitism approach; and (3) the need to specify a sharing parameter. In this paper, we suggest a non-dominated sorting-based MOEA, called NSGA-II (Non-dominated Sorting Genetic Algorithm II), which alleviates all of the above three difficulties. Specifically, a fast non-dominated sorting approach with O(MN2) computational complexity is presented. Also, a selection operator is presented that creates a mating pool by combining the parent and offspring populations and selecting the best N solutions (with respect to fitness and spread). Simulation results on difficult test problems show that NSGA-II is able, for most problems, to find a much better spread of solutions and better convergence near the true Pareto-optimal front compared to the Pareto-archived evolution strategy and the strength-Pareto evolutionary algorithm - two other elitist MOEAs that pay special attention to creating a diverse Pareto-optimal front. Moreover, we modify the definition of dominance in order to solve constrained multi-objective problems efficiently. Simulation results of the constrained NSGA-II on a number of test problems, including a five-objective, seven-constraint nonlinear problem, are compared with another constrained multi-objective optimizer, and the much better performance of NSGA-II is observed
Article
Genetic programming is an evolutionary optimization method that produces functional programs to solve a given task. These programs commonly take the form of trees representing LISP s-expressions, and a typical evolutionary run produces a great many of these trees. For this reason, a good tree-generation algorithm is very important to genetic programming. This paper presents two new tree-generation algorithms for genetic programming and for “strongly typed” genetic programming, a common variant. These algorithms are fast, allow the user to request specific tree sizes, and guarantee probabilities of certain nodes appearing in trees. The paper analyzes these two algorithms, and compares them with traditional and recently proposed approaches
Maintaining the diversity of genetic programs
  • A Ekárt
  • S Z Németh
  • J A Foster
  • E Lutton
  • J Miller
  • C Ryan
  • Tettamanzi
Ekárt, A., Németh, S.Z.: Maintaining the diversity of genetic programs. In Foster, J.A., Lutton, E., Miller, J., Ryan, C., Tettamanzi, A.G.B., eds.: Genetic Programming, Proceedings of the 5th European Conference, EuroGP 2002. Volume 2278 of LNCS., Kinsale, Ireland, Springer-Verlag (2002) 162-171
Phenotypic diversity in initial genetic programming populations
  • D Jackson
Jackson, D.: Phenotypic diversity in initial genetic programming populations. In Esparcia-Alcazar, A.I., Ekart, A., Silva, S., Dignum, S., Uyar, A.S., eds.: Proceedings of the 13th European Conference on Genetic Programming, EuroGP 2010. Volume 6021 of LNCS., Istanbul, Springer (2010) 98-109
Crossover, sampling, bloat and the harmful effects of size limits
  • S Dignum
  • R Poli
  • M O'neill
  • L Vanneschi
  • S Gustafson
  • A I Esparcia Alcazar
  • I De Falco
  • A Della Cioppa
  • E Tarantino
Dignum, S., Poli, R.: Crossover, sampling, bloat and the harmful effects of size limits. In O'Neill, M., Vanneschi, L., Gustafson, S., Esparcia Alcazar, A.I., De Falco, I., Della Cioppa, A., Tarantino, E., eds.: Proceedings of the 11th European Conference on Genetic Programming, EuroGP 2008. Volume 4971 of Lecture Notes in Computer Science., Naples, Springer (2008) 158-169