PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

In this paper we provide a broad benchmarking of recent genetic programming approaches to symbolic regression in the context of state of the art machine learning approaches. We use a set of nearly 100 regression benchmark problems culled from open source repositories across the web. We conduct a rigorous benchmarking of four recent symbolic regression approaches as well as nine machine learning approaches from scikit-learn. The results suggest that symbolic regression performs strongly compared to state-of-the-art gradient boosting algorithms, although in terms of running times is among the slowest of the available methodologies. We discuss the results in detail and point to future research directions that may allow symbolic regression to gain wider adoption in the machine learning community.
Content may be subject to copyright.
arXiv:1804.09331v1 [cs.NE] 25 Apr 2018
Where are we now?
A large benchmark study of recent symbolic regression methods
PATRYK ORZECHOWSKI, University of Pennsylvania
WILLIAM LA CAVA,University of Pennsylvania
JASON H. MOORE, University of Pennsylvania
In this paper we provide a broad benchmarking of recent genetic programming approaches to symbolic regression in the context of
state of the art machine learning approaches. We use a set of nearly 100 regression benchmark problems culled from open source
repositories across the web. We conduct a rigorous benchmarking of four recent symbolic regression approaches as well as nine
machine learning approaches from scikit-learn. The results suggest that symbolic regression performs strongly compared to state-of-
the-art gradient boosting algorithms, although in terms of running times is among the slowest of the available methodologies. We
discuss the results in detail and point to future research directions that may allow symbolic regression to gain wider adoption in the
machine learning community.
CCS Concepts: • Computing methodologies Classification and regression trees;Genetic programming; Ensemble meth-
ods; Cross-validation;
Additional Key Words and Phrases: symbolic regression, benchmarking, machine learning, genetic programming
ACM Reference Format:
Patryk Orzechowski, William La Cava, and Jason H. Moore. 2018. Where are we now? A large benchmark study of recent symbolic
regression methods. 1, 1 (April 2018), 12 pages.
Since the beginning of the field, the genetic programming (GP) community has considered the task of symbolic re-
gression (SR) as a basis for methodology research and as a primary application area. GP-based SR (GPSR) has pro-
duced a number of notable results in real-world regression applications, for example dynamical system modeling in
physics [28], biology [30], industrial wind turbines [19], fluid dynamics [18], robotics [2], climate change forecast-
ing [33], and financial trading [17], among others. However, the most prevalent use of GPSR is in the experimental
analysis of new methods, for which SR provides a convenient platform for benchmarking. Despite this persistent use,
several shortcomings of SR benchmarking are notable. First, the GP community lacks a unified standard for SR bench-
mark datasets, as noted previously [21]. Several SR benchmarks have been proposed [17, 23, 36], critiqued [6, 21], and
black-listed [21], leading to inconsistencies in the experimental design of papers. In addition to a lack of consensus for
corresponding author
Authors’ addresses: Patryk Orzechowski, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA 19104, USA, patryk.orzechowski@gmail.
com; William La Cava, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA 19104, USA,; Jason H. Moore, University
of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA 19104,
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components
of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on
servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissi ons from
© 2018 Copyright held by the owner/author(s). Publication rights licensed to Association for Computing Machinery.
Manuscript submitted to ACM
Manuscript submitted to ACM 1
2 P. Orzechowski et al.
benchmark datasets, there is not a standard set of benchmark algorithms against which new methods are compared. As
a result, it is typical for researchers to design or choose their own set of algorithms to compare to proposed methods,
and it is up to reviewers and readers to assess the validity of the comparison. Experiments typically consider single val-
ues for GP hyperparameters such as population size or crossover rate, which increases the uncertainty of results even
further. These practices make it nearly impossible to judge a new method outside the narrow scope of the experimental
Of course, there are shortcomings to focusing on benchmarks as well, as noted by others [9, 38]. Putting too much
focus on benchmarking may stifle innovation or lead to a lack of generalization to new tasks. However, the evidence
suggests that the GP community is far from being overly focused on benchmarking. A 2012 survey of GP papers in
EuroGP and GECCO from 2009 - 2011 reported the average number of SR problems per paper to be 2.4 [21]; 26.2% of
papers relied on the quartic polynomial problem, which has since been black-listed for being too trivial [38]. We con-
tend that the lack of focus in the GP community on rigorous benchmarking makes it hard to know how GPSR methods
fit into the broader machine learning (ML) community. This lack of clarity also impedes the adoption of advancements
to traditional GP techniques, and leaves researchers unsure about which advancements will have meaningful impacts.
There have been a few efforts to conduct broad benchmarking of GP methods in the past. For example, a recent a
study looked at five SR methods on a set of five synthetic and four real world datasets [37]. Outside of GP, the efforts to
benchmark ML approaches across many problems are more frequent, although most focus on the task of classification.
Previous studies have looked at hundreds classification methodologies [11] and up to 165 datasets [24]. Collaborative
online tools such as Kaggle and OpenML [35] have also driven ML benchmarking and adoption of new methods. These
larger benchmark studies have, for the most part, ignored GP-based methods. As a result, the GPSR field lacks a general
sense of where it stands in relation to the broader ML field in terms of expected performance.
Our goal in this study is to present initial results in our efforts to assess the performance of recent GPSR methods in
the broad context of ML regression. We benchmark the performance of four recent SR algorithms and ten established
ML approaches on a collection of 94 different real-world regression problems. For each problem we consider hyperpa-
rameter tuning via cross-validation and assess each method in terms of training error, test error, and wall-clock time.
Finally, we provide the code for the analysis in order to allow researchers to benchmark their own methods in this
framework and reproduce the results shown here.
We compare four recent GPSR methods in this benchmark and ten well-established ML regression methods. In this
section we briefly present the selected methods and describe the design of the experiment.
2.1 GP methods
A number of factors impacted our choice of these methods. Two key elements were open-source implementations and
ease of use. In addition, we wished to test different research thrusts in GP literature. The four methods encompass
different innovations to standard GPSR, including incorporation of constant optimization, semantic search divers, and
Pareto optimization. Each method is described briefly below.
Multiple regression genetic programming ( MRGP). [1] MRGP combines Lasso regression with the tree search afforded
by GP. A weight is attached to each node in each program. These weights are adapted by applying Lasso regression to
Manuscript submitted to ACM
Where are we now?
A large benchmark study of recent symbolic regression methods 3
the entire program trace. MRGP uses point mutation and sub-tree crossover for variation and NSGA-II for selection.
We use the version implemented in FlexGP 1.
ϵ-Lexicase selection (EPLEX). [20] ϵ-lexicase selection adapts lexicase selection method [32] for regression. Rather
than aggregating performance on the training set into a single fitness score, EPLEX selects parents by filtering the
population through randomized orderings of training samples and removing individuals that are not within ϵof the
best performance in the pool. We use the EPLEX method implemented in ellyn2. Ellyn is a stack-based GP system
written in C++ with a Python interface for use with scikit-learn. It uses point mutation and subtree crossover. Weights
in the programs are trained each generation via stochastic hill climbing. A Pareto archive of trade-offs between mean
squared error and complexity is kept during each run, and a small internal validation fold is used to select the final
model returned by the search process.
Age-fitness Pareto Optimization (AFP). [29] AFP is a selection scheme based on the concept of age-layered pop-
ulations introduced by Hornby et. al. [15]. AFP introduces a new individual each generation with an age of 0. An
individual’s age is updated each generation to reflect the number of generations since its oldest node (gene) entered
the population. Parent selection is random and Pareto tournaments are used for survival on the basis of age and fitness.
We use the version of AFP implemented in ellyn, with the same settings described above.
Geometric Semantic Genetic Programming (GSGP). [22] GSGP is a recent method that has shown many promising
results for SR and other tasks. The main concept behind GSGP is the use of semantic variation operators that produce
offspring whose semantics lie on the vector between the semantics of the parent and the target semantics (i.e. target
labels). Use of these variation operators has the advantage of creating a unimodal fitness landscape. On the downside,
the variation operators result in exponential growth of programs. We use the version GSGP implemented in C++ by
Castelli et. al. [4], which is optimized to minimize memory usage. It is available from SourceForge3.
2.2 ML methods
We use scikit-learn [26] implementations of the following methods in this study:
Linear Regression. Linear Regression is a simple model of regression that minimizes the sum of the square errors
of a linear model of inputs. The model is defined by ˆ
y=b+wTx, where yis a dependent variable (target), xare
explanatory variables, band ware intercept and slope variables, and the minimized function is equal to (1).
CLR (w)=1
Kernel Ridge. Kernel Ridge [27] performs Ridge regression using a linear function in the space of the respective
kernel. Least squares with l2-norm regularization is applied in order to prevent overfitting. The minimized function is
equal to (2), where ϕis a kernel function and λis the regularization parameter.
CK R (w)=1
2λ||w| |2(2)
Least-angle regression with Lasso. Lasso (Least absolute shrinkage and selection operator) is a popular method of
regression that applies both feature selection and regularization [34]. Similarly to Kernel Ridge, high values of ware
Manuscript submitted to ACM
4 P. Orzechowski et al.
penalized. The use of the l1-norm on win the minimization function (see (3)) improves the ability to push individual
weights to zero, effectively performing feature selection.
Least-angle regression with Lasso, a.k.a. Lars [10], is an efficient algorithm for producing a family of Lasso solutions.
It is able to compute the exact values of λfor new variables entering the model.
Linear SVR. Linear Support Vector Regression extends the concept of Support Vector Classifiers (SVC) to the task of
regression, i.e. to predict real values instead of classes. Its objective is to minimize an ϵ-insensitive loss function with
a regularization penalty ( 1
2||w| |2) in order to improve generalization [31].
SGD Regression. SGD Regression implements stochastic gradient descent and is especially well suited for larger
problems with over 10,000 of instances [26]. We add this method of regression regardless, to compare its performance
on smaller datasets.
MLP Regressor. Neural networks have been applied to regression problems for almost three decades [14]. We include
multilayer perceptrons (MLPs) as one of the benchmarked algorithms. We decided to benchmark neural network with
a single hidden layer with fixed number of neurons (100) and compare different activation functions, learning functions
and solvers, including the novel adam solver [16].
AdaBoost regression. Adaptive Boosting, called also AdaBoost [8, 12], is a flexible technique of combining a set of
weak learners into a single stronger regressor. Bychanging the distribution (i.e. weights) of instances in the data, previ-
ously misclassified instances are favored in consecutive iterations. The final prediction is obtained by a weighted sum
or weighted majority voting. As the result, the final regressor has smaller prediction errors. The method is considered
sensitive to outliers.
Random Forest regression. Random Forests [3] are a very popular ensemble method based on combining multiple
decision trees into a single stronger predictor. Each tree is trained independently with a randomly selecte d subset of the
instances, in a process known as bootstrap-aggregating or bagging. The resulting prediction is an average of multiple
predictions. Random forests try to reduce variance by not allowing decision trees to grow large, making them harder
to overfit.
Gradient Boosting regression. Gradient Boosting [13] is an ensemble method that is based on regression trees. It
shares the AdaBoost concept of iteratively improving the system performance on its weakest points. In contrast to
AdaBoost, the distribution of the samples remain the same. Instead, consecutively created trees correct the errors of
the previous ones. Gradient Boosting minimizes bias (not variance like in Random Forests). In comparison to Random
Forests, Gradient Boosting is sequential (thus slower), more difficult to train, but is reported to perform better than
Random Forest [24].
Extreme Gradient Boosting. Extreme Gradient Boosting, also known as XGBoost [5], incorporates regularization into
the Gradient Boosting algorithm in order to control overfitting. Its objective function combines the optimization of
training loss with model complexity. This brings the predictor closer to the underlying distribution of the data, while
encouraging simple models, which have smaller variance. Extreme gradient boosting is considered a state-of-the-art
method in ML.
Manuscript submitted to ACM
Where are we now?
A large benchmark study of recent symbolic regression methods 5
2.3 Datasets
We pulled the benchmark datasets from the Penn Machine Learning Benchmark (PMLB) [25] repository4, which con-
tains a large collection of standardized datasets for classification and regression problems. This repository overlaps
heavily with datasets from UCI, OpenML, and Kaggle. In this paper we considered regression problems only, of which
there are 120 total. For our experiment, we removed datasets with 3000 instances or more (22 datasets) and two others
for which at least one of the methods failed to provide the required number of results in feasible time (i.e. 72 hours
on Intel® Xeon® CPU E5-2680 v3 @ 2.50GHz). This gave the collection of 94 datasets in total. The distribution of the
number of instances and the number of features in the collection of the datasets is presented in Fig. 1.
0 200 400 600 800 1000
Number of instances
Number of features
Fig. 1. Basic characteristics of the datasets used in the study
2.4 Experiment design
In order to benchmark different regression methods, an effort was made to measure performance of each of the methods
in as similar an environment as possible. First, we decided to treat each of the GP methods as a classical ML approach
and used the scikit-learn library [26] for cross validation and hyperparameter optimization. This required some source
code modifications to allow GSGP and MR GP to communicate with the wrapp er. Second, instead of reimplementing the
algorithms, we relied on the original implementations with as few modifications as possible. Wrapping each method
allowed us to keep a common benchmarking framework based on the scikit-learn functions.
The datasets were divided in the following way: 75% of samples in each of the datasets were used for training,
whereas the remaining 25% were used for testing. We used grid search to tune the hyperparameters of each method.
Each method was run with a preset grid of input parameters, detailed in Table 1. The optimal setting of the parameters
was determined based on 5-fold cross-validation performed on training data only. The setting with the best R2score
across all folds was used for training the algorithms on the whole training data. The performance of the methods was
Manuscript submitted to ACM
6 P. Orzechowski et al.
Table 1. Analyzed algorithms with their parameters seings. The parameters in quotations refer to their names in the scikit-learn
Algorithm name Parameter Values
eplex, pop size / generations {100/1000,1000/100}
afp, max program length / max depth {64 / 6}
mrgp crossover rate {0.2,0.5,0.8}
mutation rate 1-crossover rate
gsgp pop size / generations {100/1000,200/500,1000/100}
initial depth {6}
crossover rate {0.0,0.1,0.2}
mutation rate 1-crossover rate
eplex_1M pop size / generations {500/2000,1000/1000,2000/500}
max program length {100}
crossover rate {0.2,0.5,0.8}
mutation rate 1-crossover rate
AdaBoostRegressor ‘n_estimators’ {10, 100, 1000}
‘learning_rate’ {0.01, 0.1, 1, 10}
GradientBoostingRegressor ‘n_estimators’ {10, 100, 1000}
‘min_weight_fraction_leaf’ {0.0, 0.25, 0.5}
‘max_features’ {‘sqrt’,‘log2’, None}
KernelRidge ‘kernel’ {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}
‘alpha’ {1e-4, 1e-2, 0.1, 1}
‘gamma’ {0.01, 0.1, 1, 10 }
LassoLARS ‘alpha’ { 1e-04, 0.001, 0.01, 0.1, 1 }
LinearRegression default default
MLPRegressor ‘activation’ {‘logistic’, ‘tanh’, ‘relu’}
‘solver’ {‘lbfgs’,’adam’,‘sgd’}
‘learning_rate’ {‘constant’, ‘invscaling’, ‘adaptive’}
RandomForestRegressor ‘n_estimators’ {10, 100, 1000}
‘min_weight_fraction_leaf’ {0.0, 0.25, 0.5}
‘max_features’ {‘’sqrt’,‘log2’, None}
SGDRegressor ‘alpha’ {1e-06, 1e-04, 0.01, 1 }
‘penalty’ {‘l2’, ‘l1’, ‘elasticnet’}
LinearSVR ‘C’ {1e-06, 1e-04, 0.1, 1 }
‘loss’ {‘epsilon_insensitive’, ‘squared_epsilon_insensitive’}
XGBoost ‘n_estimators’ {10, 50, 100, 250, 500, 1000}
‘learning_rate’ {1e-4, 0.01, 0.05, 0.1, 0.2}
‘gamma’ {0, 0.1, 0.2, 0.3, 0.4}
‘max_depth’ {6}
‘subsample’ {0.5, 0.75, 1}
measured on both training and testing datasets on the best model obtained during cross-validation. We repeated the
entire experiment 10 times for each method and dataset.
Because of time constraints, we decided to run each of the GP-based methods for 100,000 evaluations (population
size x number of generations). Additionally, we generated results for 1 million evaluations using EPLEX (referred to
as EPLEX_1M) in order to assess how much a more thorough training of a GP-based regressor would improve its
Data preprocessing. We decided to feed benchmarked algorithms with scaled data using StandardScaler function
from scikit-learn. The reason for this is our effort to keep the format of the input data consistent across multiple algo-
rithms for the purpose of benchmarking. The choice of the optimal preprocessing method for the particular regressor
is out of scope of this paper.
Manuscript submitted to ACM
Where are we now?
A large benchmark study of recent symbolic regression methods 7
Median ranking for training set
Fig. 2. Ranking of the performance of the algorithms based on the MSE score on training datasets.
Initialization of the algorithms. We initially considered starting each of the methods with the same random seed,
but eventually decided to make all data splits randomly. In our belief both approaches have disadvantages: the results
will either be biased by the choice of the random seed, or by using different splits for different methods. By taking a
median of the scores we became independent of the initial split of the data.
Wrappers for the GP methods. Some modifications needed to be done to each of the GP methods. For EPLEX and
AFP, ellyn provides an existing Python wrapper that was used. For other methods we implemented a class derived
from scikit-learn BaseEstimator, which implemented two methods: fit(), used for training the regressor, and predict(),
used for testing performance of the regressor. The source code of MRGP and GSGP had to be modified, so that the
algorithms could communicate with the wrapper.
Parameters for the algorithms. The settings of the input parameters for the algorithms were determined based on the
available recommendations for the given method, as well as previous experience of the authors. For GP-based methods
we applied from 6 to 9 different settings (mainly: population size x number of generations and crossover and mutation
rates). For the ML algorithms the number of settings was method dependent. The largest grid of the parameters was
used for XGBoost method. The exact parameters for the methods used in this study can be found in Table 1.
We present aggregated results of the benchmarked algorithms on the collection of 94 regression datasets in Figures 2-3.
The relative performance of the algorithms was determined as the ability to make the best predictions on the training
and testing data using mean squared error (MSE) of the samples. The performance on the testing dataset is of primary
importance, as it shows how well the methods can generalize to previously unseen data [7]. However we include the
training comparisons as a way to assess the prediliction for overfitting among methods.
We first analyze the results for each of the regression tasks on the training data. The relative rankings of each
method in terms of MSE is presented in Fig. 2. The best training performance was obtained with gradient boosting,
which completed in top-2 for the vast majority of the benchmarked datasets. The second best method across all the
Manuscript submitted to ACM
8 P. Orzechowski et al.
Median ranking for testing set
Fig. 3. Ranking of the performance of the algorithms based on the MSE score on testing datasets.
datasets was XGBoost. The top-performing GP method across all the datasets was MRGP and held the third place on
average across training sets.
The test data results allow us to assess how well the algorithms handle generalization as well as their level of
overfitting on training data. The relative performance of the methods changed noticeably when previously unseen
data was used for evaluation. The results are presented in Fig. 3. The best performing method on average was EPLEX-
1M. This GPSR method slightly outperformed XGBoost, which ended as the second best generalizing method across
datasets. Gradient boosting was the third best method, and MLP finished in fourth place.
Several of the methods exhibit overfitting by changing ranking between the training and testing evaluations. Gra-
dient boosting, for example, moves from first to third place. The performance of MRGP, which was one of the best
regressors on the training data, also exhibits overfitting, resulting in a drop of its average ranking from 4th to 6th.
MRGP’s results also contained the highest variance in performance on test sets. GSGP exhibits the highest level of
overfitting in terms rank changes, dropping from 8th to 13th. Conversely, several methods appear to generalize well,
including EPLEX-1M (moving from a median ranking of 5 to 3) and Lasso (13 to 11).
We used the test set MSE scores to check for significant differences between methods across all datasets according
to a Friedman test, which produces a p-value less than 2e-16, indicating significant differences. Post-hoc pairwise tests
are then conducted and reported in Table 2. The large number of datasets provides higher statistical power than smaller
experimental studies, leading to many p-values below 0.05. EPLEX-1M statistically outperforms the highest number of
other methods (11), followed by XGBoost (9) and gradient boosting (7). We find that none of the comparisons between
EPLEX-1M, XGBoost, gradient boosting and MLP are significantly different.
We now analyze the GP methods given equivalent numbers of fitness evaluations (AFP, MRGP, EPLEX, and GSGP).
The results between MRGP and EPLEX show no significant difference. The only noted difference is that EPLEX sig-
nificantly outperforms AFP, whereas MRGP does not. The three methods AFP, MRGP, and EPLEX all significantly
outperform GSGP. Given more fitness evaluations, EPLEX-1M significantly outperforms all the other GP experiments,
including EPLEX.
Manuscript submitted to ACM
Where are we now?
A large benchmark study of recent symbolic regression methods 9
Table 2. Friedman Asymptotic General Symmetry Test. Bold indicates p<0.05.
eplex-1m xgboost gradboost mlp rf eplex mrgp kernel- adaboost afp lasso- linear- linear- sgd-
ridge lars svr r egression regression
xgboost 1 - - - - - - - - - - - - -
gradboost 0.9 1 - - - - - - - - - - - -
mlp 0.2 0.6 1 - - - - - - - - - - -
rf 0.003 0.05 0.6 1 - - - - - - - - - -
eplex 0.001 0.02 0.4 1 1 - - - - - - - - -
mrgp 3e-07 2e-05 0.005 0.2 0.9 1 - - - - - - - -
kernel-ridge 0.0007 0.02 0.4 1 1 1 1 - - - - - - -
adaboost 1e-07 4e-06 0.002 0.1 0.8 0.9 1 0.9 - - - - - -
afp 3e-16 7e-14 4e-10 9e-06 0.0008 0.002 0.3 0.004 0.5 - - - - -
lasso-lars 0 0 2e-15 1e-11 1e-07 5e-07 0.002 6e-07 0.006 1 - - - -
linear-svr 0 0 0 3e-13 2e-09 3e-08 0.0 002 7e-08 0.0008 0.8 1 - - -
linear-regression 0 0 0 1e-14 7e-11 6e-10 5e-05 1e-09 0.0001 0.5 1 1 - -
sgd-regression 0 0 0 0 1e-13 4e-12 1e-07 1e-12 5e-07 0.07 0.9 1 1 -
gsgp 0 0 0 0 0 0 2e-12 0 2e-11 0.0004 0.1 0.4 0.7 1
The comparison of running times per training task is presented in Fig. 4. Three important considerations should be
made when assessing these results. First, the experiment was conducted in a cluster environment. Second, each algo-
rithm was run on a single thread for each dataset. Thus the easily parallelized algorithms (i.e., all GP-based methods
and some ensemble tree methods) would likely show better relative performance in a multicore setting. Third, bench-
marked algorithms were implemented using different programming languages. Thus, comparison of running times
doesn’t exclusively reflect the complexity of the methods.
Despite these considerations, it is worth noting how much additional computation time is required by the GP meth-
ods, which are one to three orders of magnitude slower than the nearest comparison. In terms of GP methods, MRGP
runs the slowest, which may be partially due to its Java implementation (the other four GP methods use c++). EPLEX-
1M is able to complete 10 times as many fitness evaluations in approximately the same time. The other three GP
methods (GSGP, AFP, and EPLEX) show similar computation times. Among other ML methods, the ensemble tree
methods and MLP are the slowest, and the linear methods are fastest, as expected.
Runtime of the algorithms (in secs)
Fig. 4. Median running time of each of the algorithms (in seconds).
Manuscript submitted to ACM
10 P. Orzechowski et al.
The most frequent settings of the parameters picked for the best model across all trials are presented in Table 3. We
purposefully do not include Linear Regression in the table (run with the default values) or Kernel Ridge regression, for
which multiple settings of input parameters performed comparably. It may be noted each GP-based method besides
GSGP tended to favor large population sizes over larger numbers of generations. The optimal setting for crossover and
mutation rates varied beteween methods.
Table 3. Most frequently chosen parameter seings based on 5-fold cross validation across all datasets.
Algorithm name Frequently best parameter settings
gsgp (‘g’=500, ‘max_len’=6, ‘popsiz e’=200, ‘r t_cross’=0.2, ‘rt_mut’=0.8)
afp (‘g’=100, ‘max_len’=64, ‘popsize’=1000, ‘rt_cross’=0.8, ‘rt_mut’ 0.2)
mrgp ({‘g’=100, ‘pop_size’=1000} or the opposite; ’r t_cross’=0.2, ‘ rt_mut’=0.8)
eplex (‘g’=100, ‘max_len’=64, ‘popsize’=1000, ‘rt_cross’=0.8, ‘ rt_mut’=0.2)
eplex-1m (‘g’=500, ‘max_len’=100, ‘popsize’=2000, ‘rt_cross’=0.8, ‘r t_mut’=0.2)
xgboost (‘’gamma’=0, ‘learning_rate’=0.01, ‘max_depth’=6, ‘n_estimators’=1000, ’s ubsample’=0.5)
gradboost (‘max_features’=None, ‘min_weight_fraction_leaf’=0.0, ‘n_estimators’=1000)
mlp (‘activation’=’logistic’, ‘learning_rate’= ‘constant’,‘solver’=‘lbfgs’)
rf (‘max_features’=None, ‘min_weight_fraction_leaf’=0.0, ‘n_estimators’=1000)
adaboost (‘learning_rate’=1.0, ‘n_estimators’=1000)
lasso-lars (‘alpha’=‘0.001’)
linear-svr (‘C’=0.1, ‘loss’=‘squared_epsilon_insensitive’)
sgd-regression (‘alpha’=0.01, ‘penalty’=‘l1’)
linear-regression (‘fit_intercept’ ‘True’)
In this paper we evaluated four recent GPSR methods in comparison to ten state-of-the-art ML methods on a set of
94 real-world regression problems. We consider hyper-parameter optimization for each method using nested cross-
validation, and compare the methods in terms of the MSE they produce on training and testing sets, and their runtime.
The analysis includes some interesting results. The most noteworthy finding is that a GPSR method (ϵ-lexicase selection
implemented in ellyn), given 1 million fitness evaluations, achieves the best test set MSE ranking across all datasets
and methods. Two of the GP-based methods, namely: EPLEX and MRGP, produce competitive results compared to
state-of-the-art ML regression approaches. The downside of the GP-based methods is their computation complexity
when run on a single thread, which contributes to much higher runtimes. Parallelism is likely to be a key factor in
allowing GP-based approaches to become competitive with leading ML methods with respect to running times.
We also should note some shortcomings of this study that motivate further analysis. First, a guiding motivation for
the use of GPSR is often its ability to produce legible symbolic models. Our analysis did not attempt to quantify the
complexity of the models produced by any of the methods. An extension of this work could establish a standardized
way of assessing this complexity, for example using the polynomial complexity method proposed by Vladislavleva et.
al. [36]. Ultimately the relative value of explainability versus predictive power will depend on the application domain.
Second, we have considered real world datasets for the source of our benchmarks. Simulation studies could also be
used, and have the advantage of providing ground truth about the underlying process, as well as the ability to scale
Manuscript submitted to ACM
Where are we now?
A large benchmark study of recent symbolic regression methods 11
complexity or difficulty. It should also be noted that the datasets used for this study were of relatively small sizes (up to
1000 of instances). Future work should consider larger dataset sizes, but will come with a larger computational burden.
We have also limited our initial analysis to looking at bulk performance of algorithms over many datasets. Further
analysis of these results should provide insight into the properties of datasets that make them amenable to, or difficult
for, GP-based regression. Such an analysis can provide suggestions for new problem sub-types that may be of interest
to the GP community.
We hope this study will provide the ML community with a data-driven sense of how state-of-the-art SR methods
compare broadly to other popular ML approaches to regression.
Source code for o ur experiment can be found at the following u rl: benchmark.
This work is supported by NIH grants LM010098 and AI116794.
[1] Ignacio Arnaldo, Krzysztof Krawiec, and Una-May O’Reilly. 2014. Multiple regression genetic programming. In Proceedings of the 2014 Annual
Conference on Genetic and Evolutionary Computation. ACM, 879–886.
[2] J.C. Bongard and H. Lipson. 2005. Nonlinear System Identification Using Coevolution of Models and Tests. IEEE Transactions on Evolutionary
Computation 9, 4 (Aug. 2005), 361–384. DOI:
[3] Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.
[4] Mauro Castelli, Sara Silva, and Leonardo Vanneschi. 2015. A C++ framework for geometric semantic genetic programming. Genetic Programming
and Evolvable Machines 16, 1 (March 2015), 73–81. DOI: 9218-0
[5] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international c onference on
knowledge discovery and data mining. ACM, 785–794.
[6] Grant Dick, Aysha P. Rimoni, and Peter A. Whigham. 2015. A Re-Examination of the Use of Genetic Programming on the Oral Bioavailability
Problem. ACM Press, 1015–1022. DOI:
[7] Pedro Domingos. 2012. A few useful things to know about machine learning. Commun. ACM 55, 10 (2012), 78–87.
[8] Harris Druck er. 1997. Improving regressors using b oosting techniques. In ICML, Vol. 97. 107–115.
[9] Chris Drummond and Na thalie Ja pkowicz. 2010. Warning: statistical benchmarking is addictive. Kick ing the habit in machine learning. Journal of
Experimental & Theoretical Artificial Intelligence 22, 1 (March 2010), 67–80. DOI:
[10] Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, and others. 2004. Least angle regression. The Annals of statistics 32, 2 (2004),
[11] Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. 2014. Do we need hundreds of classifiers to solve real world classi-
fication problems. J. Mach. Learn. Res 15, 1 (2014), 3133–3181.
[12] Yoav Freund and Robert E Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of
computer and system sciences 55, 1 (1997), 119–139.
[13] Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189–1232.
[14] Geoffrey E Hinton. 1989. Connectionist Learning Procedures. Artificial Intelligence 40 (1989), 185–234.
[15] Gregory S Hornby. 2006. ALPS: the age-layered population structure for reducing the problem of prematureconvergence. In Proceedings of the 8th
annual conference on Genetic and evolutionary computation. ACM, 815–822. DOI:
[16] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[17] Michael F. Korns. 2011. Accuracy in symbolic regression. In Genetic Programming Theory and Practice IX. Springer, 129–151. 5_8
[18] William La Cava, Kourosh Danai, and Lee Spector. 2016. Inference of compact nonlinear dynamic models by epigenetic local search. Engineering
Applications of Artificial Intelligence 55 (Oct. 2016), 292–306. DOI:
[19] William La Cava, Kourosh Danai, Lee Spector, Paul Fleming, Alan Wright, and Matthew Lackner. 2016. Automatic identification
of wind turbine models using evolutionary multiobjective optimization. Renewable Energy 87, Part 2 (March 2016), 892–902. DOI:
Manuscript submitted to ACM
12 P. Orzechowski et al.
[20] William La Cava, Lee Spector, and Kourosh Danai. 2016. Epsilon-Lexicase Selection for Regression. In Proceedings of the Genetic and Evolutionary
Computation Conference 2016 (GECCO ’1 6). ACM, New York, NY, USA, 741–748. DOI:
[21] James McDermott, David R. White, Sean Luke, Luca Manzoni, Mauro Castelli, Leonardo Vanneschi, Wojciech Jaskowski, Krzysztof Krawiec, Robin
Harper, and Kenneth De Jong. 2012. Genetic programming needs better benchmarks. In Proceedings of the fourteenth international conference on
Genetic and evolutionary computation conference. ACM, 791–798.
[22] Alberto Moraglio, Krzysztof Krawiec, and Colin G. Johnson. 2012. Geometric semantic genetic programming. In Parallel Problem Solving from
Nature-PPSN XII. Springer, 21–31. 3-642-32937-1_3
[23] Quang Uy Nguyen, Tuan Anh Pham, Xuan Hoai Nguyen, and James McDermott. 2015. Subtree semantic geometric crossover for genetic program-
ming. Genetic Programming and E volvable Mac hines (Oct. 2015), 1–29. DOI: 015-9253- 5
[24] Randal S. Olson, William La Cava, Zairah Mustahsan, Akshay Varik, and Jason H. Moore. 2017. Data-driven Advice for Applying Machine Learning
to Bioinformatics Problems. In Pacific Symposium on Biocomputing (PSB). arXiv: 1708.05070.
[25] Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. 2017. PMLB: A Large Benchmark Suite for
Machine Learning Evaluation and Comparison. BioData Mining (2017). arXiv preprint arXiv:1703.00512.
[26] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer,
Ron Weiss, Vincent Dubourg, and others. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, Oct (2011),
[27] Christian Rober t. 2014. Machine learning, a probabilistic pers pective. (2014).
[28] Michael Schmidt and Hod Lipson. 2009. Distilling free-form natural laws from experimental data. Science 324, 5923 (2009), 81–85.
[29] Michael Schmidt and Hod Lipson. 2011. Age-fitness pareto optimization. In Genetic Programming Theory and Practice VIII. Springer, 129–146. 2_8
[30] Michael D Schmidt, Ravishankar R Vallabhajosyula, Jerry W Jenkins, Jonathan E Hood, Abhishek S Soni, John P Wikswo, and Hod Lipson.
2011. Automated refinement and inference of analytical models for metabolic networks. Physical Biology 8, 5 (Oct. 2011), 055011. DOI:
[31] Alex J Smola and Bernhard Schölkopf. 2004. A tutorial on support vector regression. Statistics and computing 14, 3 (2004), 199–222.
[32] Lee Spector. 2012. Assessment of problem modality by differential performance of lexicase selection in genetic programming: a prelimi-
nary report. In Proceedings of the fourteenth international conference on Genetic and evolutionary computation conference companion. 401–408.
[33] Karolina Stanislawska, Krzysztof Krawiec, and Zbigniew W. Kundzewicz. 2012. Modeling global temperature changes with genetic programming.
Computers & Mathematics with Applications 64, 12 (Dec. 2012), 3717–3728. DOI:
[34] Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) (1996),
[35] Joaquin Vanschoren, Jan N. van Rijn, Ber nd Bischl, and Luis Torgo. 2014. OpenML: Networked Science in Machine Learning. SIGKDD Explor. Newsl.
15, 2 (June 2014), 49–60. DOI:
[36] E.J. Vladislavleva, G.F. Smits, and D. den Hertog. 2009. Order of Nonlinearity as a Complexity Measure for Models Generated by
Symbolic Regression via Pareto Genetic Programming. IEEE Transactions on Evolutionary Computation 13, 2 (2009), 333–349. DOI:
[37] Jan Žegklitz and Petr Pošík. 2017. Symbolic Regression Algorithms with Built-in Linear Regression. arXiv:1701.03641 [cs] (Jan. 2017). arXiv: 1701.03641.
[38] David R. White, James McDermott, Mauro Ca stelli, Luca Manzoni, Brian W. Goldman, Gabriel Kronberger, Wojciech Jaśkowski, Una-May O’Reilly,
and Sean Luke. 2012. Better GP benchmarks: community survey results and proposals. Genetic Programming and Evolvable Machines 14, 1 (Dec.
2012), 3–29. DOI: 012-9177-2
Manuscript submitted to ACM
... In the conventional case, the components of the activation functions g l are monotonously increasing functions, such as sigmoids [18]. In contrast, a symbolic regression network can be understood as an ANN with the additional constraint that the resulting approximator f should be a humanreadable closed-form expression of a mathematical function [17,[19][20][21]. Training of symbolic regression networks is therefore often referred to as equation learning [17]. ...
Full-text available
Efficiently finding covariate model structures that minimize the need for random effects to describe pharmacological data is challenging. The standard approach focuses on identification of relevant covariates, and present methodology lacks tools for automatic identification of covariate model structures. Although neural networks could potentially be used to approximate covariate-parameter relationships, such approximations are not human-readable and come at the risk of poor generalizability due to high model complexity.In the present study, a novel methodology for the simultaneous selection of covariate model structure and optimization of its parameters is proposed. It is based on symbolic regression, posed as an optimization problem with a smooth loss function. This enables training of the model through back-propagation using efficient gradient computations.Feasibility and effectiveness are demonstrated by application to a clinical pharmacokinetic data set for propofol, containing infusion and blood sample time series from 1031 individuals. The resulting model is compared to a published state-of-the-art model for the same data set. Our methodology finds a covariate model structure and corresponding parameter values with a slightly better fit, while relying on notably fewer covariates than the state-of-the-art model. Unlike contemporary practice, finding the covariate model structure is achieved without an iterative procedure involving manual interactions.
... Other machine learning methods (LR, ANN, and RF) were respectively implemented using the LogisticRegressionCV, MLPRegressor, and RandomForestRegressor functions in scikit-learn, a Python library of machine learning tools. We tuned the following hyperparameters through three-fold cross-validation according to the parameter settings in a benchmark study [61]: Activation functions, solvers, and learning rate schedules for ANN; and the number of trees, the minimum weighted fraction at a leaf node, and the number of features considered at each split for RF. We used default values for the other hyperparameters. ...
Full-text available
This paper examines the relationship between user pageview (PV) histories and their itemchoice behavior on an e-commerce website. We focus on PV sequences, which represent time series of the number of PVs for each user–item pair. We propose a shape-restricted optimization model that accurately estimates item-choice probabilities for all possible PV sequences. This model imposes monotonicity constraints on item-choice probabilities by exploiting partial orders for PV sequences, according to the recency and frequency of a user’s previous PVs. To improve the computational efficiency of our optimization model, we devise efficient algorithms for eliminating all redundant constraints according to the transitivity of the partial orders. Experimental results using real-world clickstream data demonstrate that our method achieves higher prediction performance than that of a state-of-the-art optimization model and common machine learning methods.
Full-text available
It is 30 years since John R. Koza published “Jaws”, the first book on genetic programming [Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press (1992)]. I recount and expand the celebration at GECCO 2022, very briefly summarise some of what the rest of us have done and make suggestions for the next thirty years of GP research.
Quickly designing correct and efficient digital circuits is a crucial need for the electronics industry. Several Electronic Design Automation tools are used for this task. Still, they often lack the diversity of designs that search-based techniques can offer, such as our system producing three different designs for a 5-bit ‘11011’ Sequence Detector. Sequence Detectors are some of the most crucial digital sequential circuits evolved in this work using Grammatical Evolution, a Machine Learning technique based on Evolutionary Computation. Compared to the literature, a reasonably small training data set is used to generate diverse solutions/circuits. A comparison is delivered of the results of the evolved circuits using two different parent selection techniques, tournament selection and lexicase selection. It is shown that the evolved circuits using a small training data set have shown a hundred percent test accuracy on a vast amount of test data sets, and the performance of lexicase selection is much better than tournament selection while evolving these circuits.
Full-text available
This study provides novel and accurate symbolic regression-based solutions for the calculation of pipe diameter when flow rate and pressure drop (head loss) are known, together with the length of the pipe, absolute inner roughness of the pipe, and kinematic viscosity of the fluid. PySR and Eureqa, free and open-source symbolic regression tools, are used for discovering simple and accurate approximate formulas. Three approaches are used: (1) brute force of computing power, which provides results based on raw input data; (2) an improved method where input parameters are transformed through the Lambert W-function; (3) a method where the results are based on inputs and the Colebrook equation transformed through new suitable dimensionless groups. The discovered models were simplified by the WolframAlpha simplify tool and/or the equivalent Matlab Symbolic toolbox. Novel models make iterative calculus redundant; they are simple for computer coding while the relative error remains lower compared with the solution through nomograms. The symbolic-regression solutions discovered by brute force computing power discard the kinematic viscosity of the fluid as an input parameter, implying that it has the least influence.
Full-text available
Background The selection, development, or comparison of machine learning methods in data mining can be a difficult task based on the target problem and goals of a particular study. Numerous publicly available real-world and simulated benchmark datasets have emerged from different sources, but their organization and adoption as standards have been inconsistent. As such, selecting and curating specific benchmarks remains an unnecessary burden on machine learning practitioners and data scientists. ResultsThe present study introduces an accessible, curated, and developing public benchmark resource to facilitate identification of the strengths and weaknesses of different machine learning methodologies. We compare meta-features among the current set of benchmark datasets in this resource to characterize the diversity of available data. Finally, we apply a number of established machine learning methods to the entire benchmark suite and analyze how datasets and algorithms cluster in terms of performance. From this study, we find that existing benchmarks lack the diversity to properly benchmark machine learning algorithms, and there are several gaps in benchmarking problems that still need to be considered. Conclusions This work represents another important step towards understanding the limitations of popular benchmarking suites and developing a resource that connects existing benchmarking standards to more diverse and efficient standards in the future.
Full-text available
Recently, several algorithms for symbolic regression (SR) emerged which employ a form of multiple linear regression (LR) to produce generalized linear models. The use of LR allows the algorithms to create models with relatively small error right from the beginning of the search; such algorithms are thus claimed to be (sometimes by orders of magnitude) faster than SR algorithms based on vanilla genetic programming. However, a systematic comparison of these algorithms on a common set of problems is still missing. In this paper we conceptually and experimentally compare several representatives of such algorithms (GPTIPS, FFX, and EFS). They are applied as off-the-shelf, ready-to-use techniques, mostly using their default settings. The methods are compared on several synthetic and real-world SR benchmark problems. Their performance is also related to the performance of three conventional machine learning algorithms --- multiple regression, random forests and support vector regression.
Full-text available
The semantic geometric crossover (SGX) proposed by Moraglio et al. has achieved very promising results and received great attention from researchers, but has a significant disadvantage in the exponential growth in size of the solutions. We propose a crossover operator named subtree semantic geometric crossover (SSGX), with the aim of addressing this issue. It is similar to SGX but uses subtree semantic similarity to approximate the geometric property. We compare SSGX to standard crossover (SC), to SGX, and to other recent semantic-based crossover operators, testing on several symbolic regression problems. Overall our new operator out-performs the other operators on test data performance, and reduces computational time relative to most of them. Further analysis shows that while SGX is rather exploitative, and SC rather explorative, SSGX achieves a balance between the two. A simple method of further enhancing SSGX performance is also demonstrated.
We propose a new method for estimation in linear models. The ‘lasso’ minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree‐based models are briefly described.
As the bioinformatics field grows, it must keep pace not only with new data but with new algorithms. Here we contribute a thorough analysis of 13 state-of-the-art, commonly used machine learning algorithms on a set of 165 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. We present a number of statistical and visual comparisons of algorithm performance and quantify the effect of model selection and algorithm tuning for each algorithm and dataset. The analysis culminates in the recommendation of five algorithms with hyperparameters that maximize classifier performance across the tested problems, as well as general guidelines for applying machine learning to supervised classification problems.
Conference Paper
Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
We introduce a method to enhance the inference of meaningful dynamic models from observational data by genetic programming (GP). This method incorporates an inheritable epigenetic layer that specifies active and inactive genes for a more effective local search of the model structure space. We define several GP implementations using different features of epigenetics, such as passive structure, phenotypic plasticity, and inheritable gene regulation. To test these implementations, we use hundreds of data sets generated from nonlinear ordinary differential equations (ODEs) in several fields of engineering and from randomly constructed nonlinear ODE models. The results indicate that epigenetic hill climbing consistently produces more compact dynamic equations with better fitness values, and that it identifies the exact solution of the system more often, validating the categorical improvement of GP by epigenetic local search. The results further indicate that when faced with complex dynamics, epigenetic hill climbing reduces the computational effort required to infer the correct underlying dynamics. We then apply the method to the identification of three real-world systems: a cascaded tanks system, a chemical distillation tower, and an industrial wind turbine. We analyze its solutions in comparison to theoretical and black-box approaches in terms of accuracy and intelligibility. Finally, we analyze population homology to evaluate the efficiency of the method. The results indicate that the epigenetic implementations provide protection from premature convergence by maintaining diversity in silenced portions of programs.
Conference Paper
Lexicase selection is a parent selection method that considers test cases separately, rather than in aggregate, when performing parent selection. It performs well in discrete error spaces but not on the continuous-valued problems that compose most system identification tasks. In this paper, we develop a new form of lexicase selection for symbolic regression, named ε-lexicase selection, that redefines the pass condition for individuals on each test case in a more effective way. We run a series of experiments on real-world and synthetic problems with several treatments of ε and quantify how ε affects parent selection and model performance. ε-lexicase selection is shown to be effective for regression, producing better fit models compared to other techniques such as tournament selection and age-fitness Pareto optimization. We demonstrate that ε can be adapted automatically for individual test cases based on the population performance distribution. Our experiments show that ε-lexicase selection with automatic ε produces the most accurate models across tested problems with negligible computational overhead. We show that behavioral diversity is exceptionally high in lexicase selection treatments, and that ε-lexicase selection makes use of more fitness cases when selecting parents than lexicase selection, which helps explain the performance improvement.