Content uploaded by Patryk Orzechowski

Author content

All content in this area was uploaded by Patryk Orzechowski on Apr 30, 2018

Content may be subject to copyright.

arXiv:1804.09331v1 [cs.NE] 25 Apr 2018

Where are we now?

A large benchmark study of recent symbolic regression methods

PATRYK ORZECHOWSKI, University of Pennsylvania

WILLIAM LA CAVA∗,University of Pennsylvania

JASON H. MOORE, University of Pennsylvania

In this paper we provide a broad benchmarking of recent genetic programming approaches to symbolic regression in the context of

state of the art machine learning approaches. We use a set of nearly 100 regression benchmark problems culled from open source

repositories across the web. We conduct a rigorous benchmarking of four recent symbolic regression approaches as well as nine

machine learning approaches from scikit-learn. The results suggest that symbolic regression performs strongly compared to state-of-

the-art gradient boosting algorithms, although in terms of running times is among the slowest of the available methodologies. We

discuss the results in detail and point to future research directions that may allow symbolic regression to gain wider adoption in the

machine learning community.

CCS Concepts: • Computing methodologies →Classiﬁcation and regression trees;Genetic programming; Ensemble meth-

ods; Cross-validation;

Additional Key Words and Phrases: symbolic regression, benchmarking, machine learning, genetic programming

ACM Reference Format:

Patryk Orzechowski, William La Cava, and Jason H. Moore. 2018. Where are we now? A large benchmark study of recent symbolic

regression methods. 1, 1 (April 2018), 12 pages. https://doi.org/10.1145/3205455.3205539

1 INTRODUCTION

Since the beginning of the ﬁeld, the genetic programming (GP) community has considered the task of symbolic re-

gression (SR) as a basis for methodology research and as a primary application area. GP-based SR (GPSR) has pro-

duced a number of notable results in real-world regression applications, for example dynamical system modeling in

physics [28], biology [30], industrial wind turbines [19], ﬂuid dynamics [18], robotics [2], climate change forecast-

ing [33], and ﬁnancial trading [17], among others. However, the most prevalent use of GPSR is in the experimental

analysis of new methods, for which SR provides a convenient platform for benchmarking. Despite this persistent use,

several shortcomings of SR benchmarking are notable. First, the GP community lacks a uniﬁed standard for SR bench-

mark datasets, as noted previously [21]. Several SR benchmarks have been proposed [17, 23, 36], critiqued [6, 21], and

black-listed [21], leading to inconsistencies in the experimental design of papers. In addition to a lack of consensus for

∗corresponding author

Authors’ addresses: Patryk Orzechowski, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA 19104, USA, patryk.orzechowski@gmail.

com; William La Cava, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA 19104, USA, lacava@upenn.edu; Jason H. Moore, University

of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA 19104, jhmoore@upenn.edu.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not

made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. Copyrights for components

of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on

servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. Request permissi ons from permissions@acm.org.

© 2018 Copyright held by the owner/author(s). Publication rights licensed to Association for Computing Machinery.

Manuscript submitted to ACM

Manuscript submitted to ACM 1

2 P. Orzechowski et al.

benchmark datasets, there is not a standard set of benchmark algorithms against which new methods are compared. As

a result, it is typical for researchers to design or choose their own set of algorithms to compare to proposed methods,

and it is up to reviewers and readers to assess the validity of the comparison. Experiments typically consider single val-

ues for GP hyperparameters such as population size or crossover rate, which increases the uncertainty of results even

further. These practices make it nearly impossible to judge a new method outside the narrow scope of the experimental

results.

Of course, there are shortcomings to focusing on benchmarks as well, as noted by others [9, 38]. Putting too much

focus on benchmarking may stiﬂe innovation or lead to a lack of generalization to new tasks. However, the evidence

suggests that the GP community is far from being overly focused on benchmarking. A 2012 survey of GP papers in

EuroGP and GECCO from 2009 - 2011 reported the average number of SR problems per paper to be 2.4 [21]; 26.2% of

papers relied on the quartic polynomial problem, which has since been black-listed for being too trivial [38]. We con-

tend that the lack of focus in the GP community on rigorous benchmarking makes it hard to know how GPSR methods

ﬁt into the broader machine learning (ML) community. This lack of clarity also impedes the adoption of advancements

to traditional GP techniques, and leaves researchers unsure about which advancements will have meaningful impacts.

There have been a few eﬀorts to conduct broad benchmarking of GP methods in the past. For example, a recent a

study looked at ﬁve SR methods on a set of ﬁve synthetic and four real world datasets [37]. Outside of GP, the eﬀorts to

benchmark ML approaches across many problems are more frequent, although most focus on the task of classiﬁcation.

Previous studies have looked at hundreds classiﬁcation methodologies [11] and up to 165 datasets [24]. Collaborative

online tools such as Kaggle and OpenML [35] have also driven ML benchmarking and adoption of new methods. These

larger benchmark studies have, for the most part, ignored GP-based methods. As a result, the GPSR ﬁeld lacks a general

sense of where it stands in relation to the broader ML ﬁeld in terms of expected performance.

Our goal in this study is to present initial results in our eﬀorts to assess the performance of recent GPSR methods in

the broad context of ML regression. We benchmark the performance of four recent SR algorithms and ten established

ML approaches on a collection of 94 diﬀerent real-world regression problems. For each problem we consider hyperpa-

rameter tuning via cross-validation and assess each method in terms of training error, test error, and wall-clock time.

Finally, we provide the code for the analysis in order to allow researchers to benchmark their own methods in this

framework and reproduce the results shown here.

2 METHODS

We compare four recent GPSR methods in this benchmark and ten well-established ML regression methods. In this

section we brieﬂy present the selected methods and describe the design of the experiment.

2.1 GP methods

A number of factors impacted our choice of these methods. Two key elements were open-source implementations and

ease of use. In addition, we wished to test diﬀerent research thrusts in GP literature. The four methods encompass

diﬀerent innovations to standard GPSR, including incorporation of constant optimization, semantic search divers, and

Pareto optimization. Each method is described brieﬂy below.

Multiple regression genetic programming ( MRGP). [1] MRGP combines Lasso regression with the tree search aﬀorded

by GP. A weight is attached to each node in each program. These weights are adapted by applying Lasso regression to

Manuscript submitted to ACM

Where are we now?

A large benchmark study of recent symbolic regression methods 3

the entire program trace. MRGP uses point mutation and sub-tree crossover for variation and NSGA-II for selection.

We use the version implemented in FlexGP 1.

ϵ-Lexicase selection (EPLEX). [20] ϵ-lexicase selection adapts lexicase selection method [32] for regression. Rather

than aggregating performance on the training set into a single ﬁtness score, EPLEX selects parents by ﬁltering the

population through randomized orderings of training samples and removing individuals that are not within ϵof the

best performance in the pool. We use the EPLEX method implemented in ellyn2. Ellyn is a stack-based GP system

written in C++ with a Python interface for use with scikit-learn. It uses point mutation and subtree crossover. Weights

in the programs are trained each generation via stochastic hill climbing. A Pareto archive of trade-oﬀs between mean

squared error and complexity is kept during each run, and a small internal validation fold is used to select the ﬁnal

model returned by the search process.

Age-ﬁtness Pareto Optimization (AFP). [29] AFP is a selection scheme based on the concept of age-layered pop-

ulations introduced by Hornby et. al. [15]. AFP introduces a new individual each generation with an age of 0. An

individual’s age is updated each generation to reﬂect the number of generations since its oldest node (gene) entered

the population. Parent selection is random and Pareto tournaments are used for survival on the basis of age and ﬁtness.

We use the version of AFP implemented in ellyn, with the same settings described above.

Geometric Semantic Genetic Programming (GSGP). [22] GSGP is a recent method that has shown many promising

results for SR and other tasks. The main concept behind GSGP is the use of semantic variation operators that produce

oﬀspring whose semantics lie on the vector between the semantics of the parent and the target semantics (i.e. target

labels). Use of these variation operators has the advantage of creating a unimodal ﬁtness landscape. On the downside,

the variation operators result in exponential growth of programs. We use the version GSGP implemented in C++ by

Castelli et. al. [4], which is optimized to minimize memory usage. It is available from SourceForge3.

2.2 ML methods

We use scikit-learn [26] implementations of the following methods in this study:

Linear Regression. Linear Regression is a simple model of regression that minimizes the sum of the square errors

of a linear model of inputs. The model is deﬁned by ˆ

y=b+wTx, where yis a dependent variable (target), xare

explanatory variables, band ware intercept and slope variables, and the minimized function is equal to (1).

CLR (w)=1

2Õ

i

(yi−wTxi)2(1)

Kernel Ridge. Kernel Ridge [27] performs Ridge regression using a linear function in the space of the respective

kernel. Least squares with l2-norm regularization is applied in order to prevent overﬁtting. The minimized function is

equal to (2), where ϕis a kernel function and λis the regularization parameter.

CK R (w)=1

2Õ

i

(yi−wTϕ(xi))2+1

2λ||w| |2(2)

Least-angle regression with Lasso. Lasso (Least absolute shrinkage and selection operator) is a popular method of

regression that applies both feature selection and regularization [34]. Similarly to Kernel Ridge, high values of ware

1https://ﬂexgp.github.io/gp-learners/

2https://epistasislab.github.io/ellyn/

3http://gsgp.sourceforge.net/

Manuscript submitted to ACM

4 P. Orzechowski et al.

penalized. The use of the l1-norm on win the minimization function (see (3)) improves the ability to push individual

weights to zero, eﬀectively performing feature selection.

CL(w)=1

2Õ

i

(yi−wTϕ(xi))2+λ||w||1(3)

Least-angle regression with Lasso, a.k.a. Lars [10], is an eﬃcient algorithm for producing a family of Lasso solutions.

It is able to compute the exact values of λfor new variables entering the model.

Linear SVR. Linear Support Vector Regression extends the concept of Support Vector Classiﬁers (SVC) to the task of

regression, i.e. to predict real values instead of classes. Its objective is to minimize an ϵ-insensitive loss function with

a regularization penalty ( 1

2||w| |2) in order to improve generalization [31].

SGD Regression. SGD Regression implements stochastic gradient descent and is especially well suited for larger

problems with over 10,000 of instances [26]. We add this method of regression regardless, to compare its performance

on smaller datasets.

MLP Regressor. Neural networks have been applied to regression problems for almost three decades [14]. We include

multilayer perceptrons (MLPs) as one of the benchmarked algorithms. We decided to benchmark neural network with

a single hidden layer with ﬁxed number of neurons (100) and compare diﬀerent activation functions, learning functions

and solvers, including the novel adam solver [16].

AdaBoost regression. Adaptive Boosting, called also AdaBoost [8, 12], is a ﬂexible technique of combining a set of

weak learners into a single stronger regressor. Bychanging the distribution (i.e. weights) of instances in the data, previ-

ously misclassiﬁed instances are favored in consecutive iterations. The ﬁnal prediction is obtained by a weighted sum

or weighted majority voting. As the result, the ﬁnal regressor has smaller prediction errors. The method is considered

sensitive to outliers.

Random Forest regression. Random Forests [3] are a very popular ensemble method based on combining multiple

decision trees into a single stronger predictor. Each tree is trained independently with a randomly selecte d subset of the

instances, in a process known as bootstrap-aggregating or bagging. The resulting prediction is an average of multiple

predictions. Random forests try to reduce variance by not allowing decision trees to grow large, making them harder

to overﬁt.

Gradient Boosting regression. Gradient Boosting [13] is an ensemble method that is based on regression trees. It

shares the AdaBoost concept of iteratively improving the system performance on its weakest points. In contrast to

AdaBoost, the distribution of the samples remain the same. Instead, consecutively created trees correct the errors of

the previous ones. Gradient Boosting minimizes bias (not variance like in Random Forests). In comparison to Random

Forests, Gradient Boosting is sequential (thus slower), more diﬃcult to train, but is reported to perform better than

Random Forest [24].

Extreme Gradient Boosting. Extreme Gradient Boosting, also known as XGBoost [5], incorporates regularization into

the Gradient Boosting algorithm in order to control overﬁtting. Its objective function combines the optimization of

training loss with model complexity. This brings the predictor closer to the underlying distribution of the data, while

encouraging simple models, which have smaller variance. Extreme gradient boosting is considered a state-of-the-art

method in ML.

Manuscript submitted to ACM

Where are we now?

A large benchmark study of recent symbolic regression methods 5

2.3 Datasets

We pulled the benchmark datasets from the Penn Machine Learning Benchmark (PMLB) [25] repository4, which con-

tains a large collection of standardized datasets for classiﬁcation and regression problems. This repository overlaps

heavily with datasets from UCI, OpenML, and Kaggle. In this paper we considered regression problems only, of which

there are 120 total. For our experiment, we removed datasets with 3000 instances or more (22 datasets) and two others

for which at least one of the methods failed to provide the required number of results in feasible time (i.e. 72 hours

on Intel® Xeon® CPU E5-2680 v3 @ 2.50GHz). This gave the collection of 94 datasets in total. The distribution of the

number of instances and the number of features in the collection of the datasets is presented in Fig. 1.

0 200 400 600 800 1000

Number of instances

0

20

40

60

80

100

Number of features

Fig. 1. Basic characteristics of the datasets used in the study

2.4 Experiment design

In order to benchmark diﬀerent regression methods, an eﬀort was made to measure performance of each of the methods

in as similar an environment as possible. First, we decided to treat each of the GP methods as a classical ML approach

and used the scikit-learn library [26] for cross validation and hyperparameter optimization. This required some source

code modiﬁcations to allow GSGP and MR GP to communicate with the wrapp er. Second, instead of reimplementing the

algorithms, we relied on the original implementations with as few modiﬁcations as possible. Wrapping each method

allowed us to keep a common benchmarking framework based on the scikit-learn functions.

The datasets were divided in the following way: 75% of samples in each of the datasets were used for training,

whereas the remaining 25% were used for testing. We used grid search to tune the hyperparameters of each method.

Each method was run with a preset grid of input parameters, detailed in Table 1. The optimal setting of the parameters

was determined based on 5-fold cross-validation performed on training data only. The setting with the best R2score

across all folds was used for training the algorithms on the whole training data. The performance of the methods was

4https://github.com/EpistasisLab/penn-ml-benchmarks

Manuscript submitted to ACM

6 P. Orzechowski et al.

Table 1. Analyzed algorithms with their parameters seings. The parameters in quotations refer to their names in the scikit-learn

implementations.

Algorithm name Parameter Values

eplex, pop size / generations {100/1000,1000/100}

afp, max program length / max depth {64 / 6}

mrgp crossover rate {0.2,0.5,0.8}

mutation rate 1-crossover rate

gsgp pop size / generations {100/1000,200/500,1000/100}

initial depth {6}

crossover rate {0.0,0.1,0.2}

mutation rate 1-crossover rate

eplex_1M pop size / generations {500/2000,1000/1000,2000/500}

max program length {100}

crossover rate {0.2,0.5,0.8}

mutation rate 1-crossover rate

AdaBoostRegressor ‘n_estimators’ {10, 100, 1000}

‘learning_rate’ {0.01, 0.1, 1, 10}

GradientBoostingRegressor ‘n_estimators’ {10, 100, 1000}

‘min_weight_fraction_leaf’ {0.0, 0.25, 0.5}

‘max_features’ {‘sqrt’,‘log2’, None}

KernelRidge ‘kernel’ {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}

‘alpha’ {1e-4, 1e-2, 0.1, 1}

‘gamma’ {0.01, 0.1, 1, 10 }

LassoLARS ‘alpha’ { 1e-04, 0.001, 0.01, 0.1, 1 }

LinearRegression default default

MLPRegressor ‘activation’ {‘logistic’, ‘tanh’, ‘relu’}

‘solver’ {‘lbfgs’,’adam’,‘sgd’}

‘learning_rate’ {‘constant’, ‘invscaling’, ‘adaptive’}

RandomForestRegressor ‘n_estimators’ {10, 100, 1000}

‘min_weight_fraction_leaf’ {0.0, 0.25, 0.5}

‘max_features’ {‘’sqrt’,‘log2’, None}

SGDRegressor ‘alpha’ {1e-06, 1e-04, 0.01, 1 }

‘penalty’ {‘l2’, ‘l1’, ‘elasticnet’}

LinearSVR ‘C’ {1e-06, 1e-04, 0.1, 1 }

‘loss’ {‘epsilon_insensitive’, ‘squared_epsilon_insensitive’}

XGBoost ‘n_estimators’ {10, 50, 100, 250, 500, 1000}

‘learning_rate’ {1e-4, 0.01, 0.05, 0.1, 0.2}

‘gamma’ {0, 0.1, 0.2, 0.3, 0.4}

‘max_depth’ {6}

‘subsample’ {0.5, 0.75, 1}

measured on both training and testing datasets on the best model obtained during cross-validation. We repeated the

entire experiment 10 times for each method and dataset.

Because of time constraints, we decided to run each of the GP-based methods for 100,000 evaluations (population

size x number of generations). Additionally, we generated results for 1 million evaluations using EPLEX (referred to

as EPLEX_1M) in order to assess how much a more thorough training of a GP-based regressor would improve its

performance.

Data preprocessing. We decided to feed benchmarked algorithms with scaled data using StandardScaler function

from scikit-learn. The reason for this is our eﬀort to keep the format of the input data consistent across multiple algo-

rithms for the purpose of benchmarking. The choice of the optimal preprocessing method for the particular regressor

is out of scope of this paper.

Manuscript submitted to ACM

Where are we now?

A large benchmark study of recent symbolic regression methods 7

gsgp

afp

mrgp

eplex

eplex-1m

xgboost

gradboost

mlp

rf

kernel-ridge

adaboost

lasso-lars

linear-svr

linear-regression

sgd-regression

Algorithm

2

4

6

8

10

12

14

Median ranking for training set

Fig. 2. Ranking of the performance of the algorithms based on the MSE score on training datasets.

Initialization of the algorithms. We initially considered starting each of the methods with the same random seed,

but eventually decided to make all data splits randomly. In our belief both approaches have disadvantages: the results

will either be biased by the choice of the random seed, or by using diﬀerent splits for diﬀerent methods. By taking a

median of the scores we became independent of the initial split of the data.

Wrappers for the GP methods. Some modiﬁcations needed to be done to each of the GP methods. For EPLEX and

AFP, ellyn provides an existing Python wrapper that was used. For other methods we implemented a class derived

from scikit-learn BaseEstimator, which implemented two methods: ﬁt(), used for training the regressor, and predict(),

used for testing performance of the regressor. The source code of MRGP and GSGP had to be modiﬁed, so that the

algorithms could communicate with the wrapper.

Parameters for the algorithms. The settings of the input parameters for the algorithms were determined based on the

available recommendations for the given method, as well as previous experience of the authors. For GP-based methods

we applied from 6 to 9 diﬀerent settings (mainly: population size x number of generations and crossover and mutation

rates). For the ML algorithms the number of settings was method dependent. The largest grid of the parameters was

used for XGBoost method. The exact parameters for the methods used in this study can be found in Table 1.

3 RESULTS

We present aggregated results of the benchmarked algorithms on the collection of 94 regression datasets in Figures 2-3.

The relative performance of the algorithms was determined as the ability to make the best predictions on the training

and testing data using mean squared error (MSE) of the samples. The performance on the testing dataset is of primary

importance, as it shows how well the methods can generalize to previously unseen data [7]. However we include the

training comparisons as a way to assess the prediliction for overﬁtting among methods.

We ﬁrst analyze the results for each of the regression tasks on the training data. The relative rankings of each

method in terms of MSE is presented in Fig. 2. The best training performance was obtained with gradient boosting,

which completed in top-2 for the vast majority of the benchmarked datasets. The second best method across all the

Manuscript submitted to ACM

8 P. Orzechowski et al.

gsgp

afp

mrgp

eplex

eplex-1m

xgboost

gradboost

mlp

rf

kernel-ridge

adaboost

lasso-lars

linear-svr

linear-regression

sgd-regression

Algorithm

2

4

6

8

10

12

14

Median ranking for testing set

Fig. 3. Ranking of the performance of the algorithms based on the MSE score on testing datasets.

datasets was XGBoost. The top-performing GP method across all the datasets was MRGP and held the third place on

average across training sets.

The test data results allow us to assess how well the algorithms handle generalization as well as their level of

overﬁtting on training data. The relative performance of the methods changed noticeably when previously unseen

data was used for evaluation. The results are presented in Fig. 3. The best performing method on average was EPLEX-

1M. This GPSR method slightly outperformed XGBoost, which ended as the second best generalizing method across

datasets. Gradient boosting was the third best method, and MLP ﬁnished in fourth place.

Several of the methods exhibit overﬁtting by changing ranking between the training and testing evaluations. Gra-

dient boosting, for example, moves from ﬁrst to third place. The performance of MRGP, which was one of the best

regressors on the training data, also exhibits overﬁtting, resulting in a drop of its average ranking from 4th to 6th.

MRGP’s results also contained the highest variance in performance on test sets. GSGP exhibits the highest level of

overﬁtting in terms rank changes, dropping from 8th to 13th. Conversely, several methods appear to generalize well,

including EPLEX-1M (moving from a median ranking of 5 to 3) and Lasso (13 to 11).

We used the test set MSE scores to check for signiﬁcant diﬀerences between methods across all datasets according

to a Friedman test, which produces a p-value less than 2e-16, indicating signiﬁcant diﬀerences. Post-hoc pairwise tests

are then conducted and reported in Table 2. The large number of datasets provides higher statistical power than smaller

experimental studies, leading to many p-values below 0.05. EPLEX-1M statistically outperforms the highest number of

other methods (11), followed by XGBoost (9) and gradient boosting (7). We ﬁnd that none of the comparisons between

EPLEX-1M, XGBoost, gradient boosting and MLP are signiﬁcantly diﬀerent.

We now analyze the GP methods given equivalent numbers of ﬁtness evaluations (AFP, MRGP, EPLEX, and GSGP).

The results between MRGP and EPLEX show no signiﬁcant diﬀerence. The only noted diﬀerence is that EPLEX sig-

niﬁcantly outperforms AFP, whereas MRGP does not. The three methods AFP, MRGP, and EPLEX all signiﬁcantly

outperform GSGP. Given more ﬁtness evaluations, EPLEX-1M signiﬁcantly outperforms all the other GP experiments,

including EPLEX.

Manuscript submitted to ACM

Where are we now?

A large benchmark study of recent symbolic regression methods 9

Table 2. Friedman Asymptotic General Symmetry Test. Bold indicates p<0.05.

eplex-1m xgboost gradboost mlp rf eplex mrgp kernel- adaboost afp lasso- linear- linear- sgd-

ridge lars svr r egression regression

xgboost 1 - - - - - - - - - - - - -

gradboost 0.9 1 - - - - - - - - - - - -

mlp 0.2 0.6 1 - - - - - - - - - - -

rf 0.003 0.05 0.6 1 - - - - - - - - - -

eplex 0.001 0.02 0.4 1 1 - - - - - - - - -

mrgp 3e-07 2e-05 0.005 0.2 0.9 1 - - - - - - - -

kernel-ridge 0.0007 0.02 0.4 1 1 1 1 - - - - - - -

adaboost 1e-07 4e-06 0.002 0.1 0.8 0.9 1 0.9 - - - - - -

afp 3e-16 7e-14 4e-10 9e-06 0.0008 0.002 0.3 0.004 0.5 - - - - -

lasso-lars 0 0 2e-15 1e-11 1e-07 5e-07 0.002 6e-07 0.006 1 - - - -

linear-svr 0 0 0 3e-13 2e-09 3e-08 0.0 002 7e-08 0.0008 0.8 1 - - -

linear-regression 0 0 0 1e-14 7e-11 6e-10 5e-05 1e-09 0.0001 0.5 1 1 - -

sgd-regression 0 0 0 0 1e-13 4e-12 1e-07 1e-12 5e-07 0.07 0.9 1 1 -

gsgp 0 0 0 0 0 0 2e-12 0 2e-11 0.0004 0.1 0.4 0.7 1

The comparison of running times per training task is presented in Fig. 4. Three important considerations should be

made when assessing these results. First, the experiment was conducted in a cluster environment. Second, each algo-

rithm was run on a single thread for each dataset. Thus the easily parallelized algorithms (i.e., all GP-based methods

and some ensemble tree methods) would likely show better relative performance in a multicore setting. Third, bench-

marked algorithms were implemented using diﬀerent programming languages. Thus, comparison of running times

doesn’t exclusively reﬂect the complexity of the methods.

Despite these considerations, it is worth noting how much additional computation time is required by the GP meth-

ods, which are one to three orders of magnitude slower than the nearest comparison. In terms of GP methods, MRGP

runs the slowest, which may be partially due to its Java implementation (the other four GP methods use c++). EPLEX-

1M is able to complete 10 times as many ﬁtness evaluations in approximately the same time. The other three GP

methods (GSGP, AFP, and EPLEX) show similar computation times. Among other ML methods, the ensemble tree

methods and MLP are the slowest, and the linear methods are fastest, as expected.

gsgp

afp

mrgp

eplex

eplex-1m

xgboost

gradboost

mlp

rf

kernel-ridge

adaboost

lasso-lars

linear-svr

linear-regression

sgd-regression

Algorithm

10−2

10−1

100

101

102

103

Runtime of the algorithms (in secs)

Fig. 4. Median running time of each of the algorithms (in seconds).

Manuscript submitted to ACM

10 P. Orzechowski et al.

The most frequent settings of the parameters picked for the best model across all trials are presented in Table 3. We

purposefully do not include Linear Regression in the table (run with the default values) or Kernel Ridge regression, for

which multiple settings of input parameters performed comparably. It may be noted each GP-based method besides

GSGP tended to favor large population sizes over larger numbers of generations. The optimal setting for crossover and

mutation rates varied beteween methods.

Table 3. Most frequently chosen parameter seings based on 5-fold cross validation across all datasets.

Algorithm name Frequently best parameter settings

gsgp (‘g’=500, ‘max_len’=6, ‘popsiz e’=200, ‘r t_cross’=0.2, ‘rt_mut’=0.8)

afp (‘g’=100, ‘max_len’=64, ‘popsize’=1000, ‘rt_cross’=0.8, ‘rt_mut’ 0.2)

mrgp ({‘g’=100, ‘pop_size’=1000} or the opposite; ’r t_cross’=0.2, ‘ rt_mut’=0.8)

eplex (‘g’=100, ‘max_len’=64, ‘popsize’=1000, ‘rt_cross’=0.8, ‘ rt_mut’=0.2)

eplex-1m (‘g’=500, ‘max_len’=100, ‘popsize’=2000, ‘rt_cross’=0.8, ‘r t_mut’=0.2)

xgboost (‘’gamma’=0, ‘learning_rate’=0.01, ‘max_depth’=6, ‘n_estimators’=1000, ’s ubsample’=0.5)

gradboost (‘max_features’=None, ‘min_weight_fraction_leaf’=0.0, ‘n_estimators’=1000)

mlp (‘activation’=’logistic’, ‘learning_rate’= ‘constant’,‘solver’=‘lbfgs’)

rf (‘max_features’=None, ‘min_weight_fraction_leaf’=0.0, ‘n_estimators’=1000)

adaboost (‘learning_rate’=1.0, ‘n_estimators’=1000)

lasso-lars (‘alpha’=‘0.001’)

linear-svr (‘C’=0.1, ‘loss’=‘squared_epsilon_insensitive’)

sgd-regression (‘alpha’=0.01, ‘penalty’=‘l1’)

linear-regression (‘ﬁt_intercept’ ‘True’)

4 CONCLUSIONS

In this paper we evaluated four recent GPSR methods in comparison to ten state-of-the-art ML methods on a set of

94 real-world regression problems. We consider hyper-parameter optimization for each method using nested cross-

validation, and compare the methods in terms of the MSE they produce on training and testing sets, and their runtime.

The analysis includes some interesting results. The most noteworthy ﬁnding is that a GPSR method (ϵ-lexicase selection

implemented in ellyn), given 1 million ﬁtness evaluations, achieves the best test set MSE ranking across all datasets

and methods. Two of the GP-based methods, namely: EPLEX and MRGP, produce competitive results compared to

state-of-the-art ML regression approaches. The downside of the GP-based methods is their computation complexity

when run on a single thread, which contributes to much higher runtimes. Parallelism is likely to be a key factor in

allowing GP-based approaches to become competitive with leading ML methods with respect to running times.

We also should note some shortcomings of this study that motivate further analysis. First, a guiding motivation for

the use of GPSR is often its ability to produce legible symbolic models. Our analysis did not attempt to quantify the

complexity of the models produced by any of the methods. An extension of this work could establish a standardized

way of assessing this complexity, for example using the polynomial complexity method proposed by Vladislavleva et.

al. [36]. Ultimately the relative value of explainability versus predictive power will depend on the application domain.

Second, we have considered real world datasets for the source of our benchmarks. Simulation studies could also be

used, and have the advantage of providing ground truth about the underlying process, as well as the ability to scale

Manuscript submitted to ACM

Where are we now?

A large benchmark study of recent symbolic regression methods 11

complexity or diﬃculty. It should also be noted that the datasets used for this study were of relatively small sizes (up to

1000 of instances). Future work should consider larger dataset sizes, but will come with a larger computational burden.

We have also limited our initial analysis to looking at bulk performance of algorithms over many datasets. Further

analysis of these results should provide insight into the properties of datasets that make them amenable to, or diﬃcult

for, GP-based regression. Such an analysis can provide suggestions for new problem sub-types that may be of interest

to the GP community.

We hope this study will provide the ML community with a data-driven sense of how state-of-the-art SR methods

compare broadly to other popular ML approaches to regression.

SUPPLEMENTARY MATERIALS

Source code for o ur experiment can be found at the following u rl: https://github.com/EpistasisLab/regression- benchmark.

ACKNOWLEDGMENTS

This work is supported by NIH grants LM010098 and AI116794.

REFERENCES

[1] Ignacio Arnaldo, Krzysztof Krawiec, and Una-May O’Reilly. 2014. Multiple regression genetic programming. In Proceedings of the 2014 Annual

Conference on Genetic and Evolutionary Computation. ACM, 879–886.

[2] J.C. Bongard and H. Lipson. 2005. Nonlinear System Identiﬁcation Using Coevolution of Models and Tests. IEEE Transactions on Evolutionary

Computation 9, 4 (Aug. 2005), 361–384. DOI:http://dx.doi.org/10.1109/TEVC.2005.850293

[3] Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.

[4] Mauro Castelli, Sara Silva, and Leonardo Vanneschi. 2015. A C++ framework for geometric semantic genetic programming. Genetic Programming

and Evolvable Machines 16, 1 (March 2015), 73–81. DOI:http://dx.doi.org/10.1007/s10710-014- 9218-0

[5] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international c onference on

knowledge discovery and data mining. ACM, 785–794.

[6] Grant Dick, Aysha P. Rimoni, and Peter A. Whigham. 2015. A Re-Examination of the Use of Genetic Programming on the Oral Bioavailability

Problem. ACM Press, 1015–1022. DOI:http://dx.doi.org/10.1145/2739480.2754771

[7] Pedro Domingos. 2012. A few useful things to know about machine learning. Commun. ACM 55, 10 (2012), 78–87.

[8] Harris Druck er. 1997. Improving regressors using b oosting techniques. In ICML, Vol. 97. 107–115.

[9] Chris Drummond and Na thalie Ja pkowicz. 2010. Warning: statistical benchmarking is addictive. Kick ing the habit in machine learning. Journal of

Experimental & Theoretical Artiﬁcial Intelligence 22, 1 (March 2010), 67–80. DOI:http://dx.doi.org/10.1080/09528130903010295

[10] Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, and others. 2004. Least angle regression. The Annals of statistics 32, 2 (2004),

407–499.

[11] Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. 2014. Do we need hundreds of classiﬁers to solve real world classi-

ﬁcation problems. J. Mach. Learn. Res 15, 1 (2014), 3133–3181.

[12] Yoav Freund and Robert E Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of

computer and system sciences 55, 1 (1997), 119–139.

[13] Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189–1232.

[14] Geoﬀrey E Hinton. 1989. Connectionist Learning Procedures. Artiﬁcial Intelligence 40 (1989), 185–234.

[15] Gregory S Hornby. 2006. ALPS: the age-layered population structure for reducing the problem of prematureconvergence. In Proceedings of the 8th

annual conference on Genetic and evolutionary computation. ACM, 815–822. DOI:http://dx.doi.org/10.1145/1143997.1144142

[16] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[17] Michael F. Korns. 2011. Accuracy in symbolic regression. In Genetic Programming Theory and Practice IX. Springer, 129–151.

http://link.springer.com/chapter/10.1007/978-1-4614-1770- 5_8

[18] William La Cava, Kourosh Danai, and Lee Spector. 2016. Inference of compact nonlinear dynamic models by epigenetic local search. Engineering

Applications of Artiﬁcial Intelligence 55 (Oct. 2016), 292–306. DOI:http://dx.doi.org/10.1016/j.engappai.2016.07.004

[19] William La Cava, Kourosh Danai, Lee Spector, Paul Fleming, Alan Wright, and Matthew Lackner. 2016. Automatic identiﬁcation

of wind turbine models using evolutionary multiobjective optimization. Renewable Energy 87, Part 2 (March 2016), 892–902. DOI:

http://dx.doi.org/10.1016/j.renene.2015.09.068

Manuscript submitted to ACM

12 P. Orzechowski et al.

[20] William La Cava, Lee Spector, and Kourosh Danai. 2016. Epsilon-Lexicase Selection for Regression. In Proceedings of the Genetic and Evolutionary

Computation Conference 2016 (GECCO ’1 6). ACM, New York, NY, USA, 741–748. DOI:http://dx.doi.org/10.1145/2908812.2908898

[21] James McDermott, David R. White, Sean Luke, Luca Manzoni, Mauro Castelli, Leonardo Vanneschi, Wojciech Jaskowski, Krzysztof Krawiec, Robin

Harper, and Kenneth De Jong. 2012. Genetic programming needs better benchmarks. In Proceedings of the fourteenth international conference on

Genetic and evolutionary computation conference. ACM, 791–798. http://dl.acm.org/citation.cfm?id=2330273

[22] Alberto Moraglio, Krzysztof Krawiec, and Colin G. Johnson. 2012. Geometric semantic genetic programming. In Parallel Problem Solving from

Nature-PPSN XII. Springer, 21–31. http://link.springer.com/chapter/10.1007/978- 3-642-32937-1_3

[23] Quang Uy Nguyen, Tuan Anh Pham, Xuan Hoai Nguyen, and James McDermott. 2015. Subtree semantic geometric crossover for genetic program-

ming. Genetic Programming and E volvable Mac hines (Oct. 2015), 1–29. DOI:http://dx.doi.org/10.1007/s10710- 015-9253- 5

[24] Randal S. Olson, William La Cava, Zairah Mustahsan, Akshay Varik, and Jason H. Moore. 2017. Data-driven Advice for Applying Machine Learning

to Bioinformatics Problems. In Paciﬁc Symposium on Biocomputing (PSB). http://arxiv.org/abs/1708.05070 arXiv: 1708.05070.

[25] Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. 2017. PMLB: A Large Benchmark Suite for

Machine Learning Evaluation and Comparison. BioData Mining (2017). https://arxiv.org/abs/1703.00512 arXiv preprint arXiv:1703.00512.

[26] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer,

Ron Weiss, Vincent Dubourg, and others. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, Oct (2011),

2825–2830.

[27] Christian Rober t. 2014. Machine learning, a probabilistic pers pective. (2014).

[28] Michael Schmidt and Hod Lipson. 2009. Distilling free-form natural laws from experimental data. Science 324, 5923 (2009), 81–85.

http://www.sciencemag.org/content/324/5923/81.short

[29] Michael Schmidt and Hod Lipson. 2011. Age-ﬁtness pareto optimization. In Genetic Programming Theory and Practice VIII. Springer, 129–146.

http://link.springer.com/chapter/10.1007/978-1-4419-7747- 2_8

[30] Michael D Schmidt, Ravishankar R Vallabhajosyula, Jerry W Jenkins, Jonathan E Hood, Abhishek S Soni, John P Wikswo, and Hod Lipson.

2011. Automated reﬁnement and inference of analytical models for metabolic networks. Physical Biology 8, 5 (Oct. 2011), 055011. DOI:

http://dx.doi.org/10.1088/1478-3975/8/5/055011

[31] Alex J Smola and Bernhard Schölkopf. 2004. A tutorial on support vector regression. Statistics and computing 14, 3 (2004), 199–222.

[32] Lee Spector. 2012. Assessment of problem modality by diﬀerential performance of lexicase selection in genetic programming: a prelimi-

nary report. In Proceedings of the fourteenth international conference on Genetic and evolutionary computation conference companion. 401–408.

http://dl.acm.org/citation.cfm?id=2330846

[33] Karolina Stanislawska, Krzysztof Krawiec, and Zbigniew W. Kundzewicz. 2012. Modeling global temperature changes with genetic programming.

Computers & Mathematics with Applications 64, 12 (Dec. 2012), 3717–3728. DOI:http://dx.doi.org/10.1016/j.camwa.2012.02.049

[34] Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) (1996),

267–288.

[35] Joaquin Vanschoren, Jan N. van Rijn, Ber nd Bischl, and Luis Torgo. 2014. OpenML: Networked Science in Machine Learning. SIGKDD Explor. Newsl.

15, 2 (June 2014), 49–60. DOI:http://dx.doi.org/10.1145/2641190.2641198

[36] E.J. Vladislavleva, G.F. Smits, and D. den Hertog. 2009. Order of Nonlinearity as a Complexity Measure for Models Generated by

Symbolic Regression via Pareto Genetic Programming. IEEE Transactions on Evolutionary Computation 13, 2 (2009), 333–349. DOI:

http://dx.doi.org/10.1109/TEVC.2008.926486

[37] Jan Žegklitz and Petr Pošík. 2017. Symbolic Regression Algorithms with Built-in Linear Regression. arXiv:1701.03641 [cs] (Jan. 2017).

http://arxiv.org/abs/1701.03641 arXiv: 1701.03641.

[38] David R. White, James McDermott, Mauro Ca stelli, Luca Manzoni, Brian W. Goldman, Gabriel Kronberger, Wojciech Jaśkowski, Una-May O’Reilly,

and Sean Luke. 2012. Better GP benchmarks: community survey results and proposals. Genetic Programming and Evolvable Machines 14, 1 (Dec.

2012), 3–29. DOI:http://dx.doi.org/10.1007/s10710- 012-9177-2

Manuscript submitted to ACM