Available via license: CC BY 4.0
Content may be subject to copyright.
Citation: Kammerer, L.; Kronberger,
G.; Winkler, S. Bias and Variance
Analysis of Contemporary Symbolic
Regression Methods. Appl. Sci. 2024,
14, 11061. https://doi.org/
10.3390/app142311061
Academic Editor: Vincent A. Cicirello
Received: 20 September 2024
Revised: 15 November 2024
Accepted: 22 November 2024
Published: 28 November 2024
Copyright: © 2024 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
Article
Bias and Variance Analysis of Contemporary Symbolic
Regression Methods
Lukas Kammerer 1,2 , Gabriel Kronberger 1,* and Stephan Winkler 1,2
1Heuristic and Evolutionary Algorithms Laboratory, University of Applied Sciences Upper Austria,
4232 Hagenberg, Austria; lukas.kammerer@fh-hagenberg.at (L.K.); stephan.winkler@fh-hagenberg.at (S.W.)
2Department of Computer Science, Johannes Kepler University, 4040 Linz, Austria
*Correspondence: gabriel.kronberger@fh-hagenberg.at
Featured Application: Scientific machine learning, white box modelling.
Abstract: Symbolic regression is commonly used in domains where both high accuracy and inter-
pretability of models is required. While symbolic regression is capable to produce highly accurate
models, small changes in the training data might cause highly dissimilar solution. The implications
in practice are huge, as interpretability as key-selling feature degrades when minor changes in data
cause substantially different behavior of models. We analyse those perturbations caused by changes
in training data for ten contemporary symbolic regression algorithms. We analyse existing machine
learning models from the SRBench benchmark suite, a benchmark that compares the accuracy of
several symbolic regression algorithms. We measure the bias and variance of algorithms and show
how algorithms like Operon and GP-GOMEA return highly accurate models with similar behavior
despite changes in training data. Our results highlight that larger model sizes do not imply different
behavior when training data change. On the contrary, larger models effectively prevent systematic
errors. We also show how other algorithms like ITEA or AIFeynman with the declared goal of
producing consistent results meet up to their expectation of small and similar models.
Keywords: Genetic Programming; symbolic regression; bias/variance
1. Introduction
Symbolic regression (SR) is a task where we aim to find a closed-form mathematical
equation which describes linear and nonlinear dependencies in data without making
prior assumptions. The goal is to make predictions for unseen data by training a mathe-
matical expression with a finite set of existing observations. While there is a huge variety
of methods for solving regression problems, starting from linear models over decision
trees [
1
] up to neural networks [
2
], SR methods deliver non-linear and human-readable
closed-form expressions as models with smooth and differentiable outputs. Therefore,
SR is commonly applied in situations where we cannot rely on black-box models like
neural networks, as they require comprehensible and traceable results for throughout
model verification.
Bias and variance are properties of a machine learning algorithm that affect the inter-
pretability of its models. The variance of an algorithm describes, how the outputs of its
models change when there are differences in the used training data. SR algorithms are con-
sidered as high-variance algorithms, which means that even slightly different training data
can lead to very dissimilar models [
3
]. However, high variance of an algorithm implies that
it is capable to fit a model, to a certain extent, perfectly to training data. It is often necessary
to limit the variance in order to perform well on unseen data, without restricting it too
much, which would prevent highly accurate models and cause so-called bias. Balancing
both properties is called the bias/variance trade-off [1].
Appl. Sci. 2024,14, 11061. https://doi.org/10.3390/app142311061 https://www.mdpi.com/journal/applsci
Appl. Sci. 2024,14, 11061 2 of 16
From a practitioner’s point of view, perturbations caused by variance do not spark
trustworthiness for practitioners, as we would expect small changes in training data to
have little effect on the overall results. Therefore, many algorithms use, e.g., statistical
tools to gracefully deal with bias and variance or ignore this property as the primary
focus is accuracy regardless of the model structure [
4
]. However, bias and variance
have specific implications to SR. SR is known for its human-readable and potentially
interpretable white-box model structure, but high variance limits this feature. On the
other hand, the complexity of models is limited for the sake of interpretability, which
causes bias. Algorithmic aspects of SR amplify these effects: First, the stochastic nature
of a Genetic Programming (GP)-based [5] SR algorithm results in differences in models
between multiple SR runs even when the training data do not change at all. Second,
since there is no guarantee for optimality in an SR search space [
6
], models trained in
different SR runs might even provide very similar accuracy despite being completely
different mathematical expressions due to, e.g., bloat [
5
] or over-parameterization [
7
]
that increase the size of a model without affecting its accuracy.
1.1. Research Question
Controlling the variance while still achieving high accuracy is an ambient goal in
SR research and both directly and indirectly targeted by new algorithms. In this work,
we examine the bias and the variance of contemporary SR algorithms. We show which
algorithms perform most consistent despite perturbations in training data. The results
are put into perspective with their achieved accuracy and parsimony. We analyze which
algorithms in recent SR research have been the most promising regarding robustness
and reliability by mitigating variance while still providing high accuracy. This work
complements and builds upon the existing results of the SRBench benchmark [
8
], which
is a solid benchmark suite for SR algorithms. SRBench compares the accuracy of several
SR algorithms for problems from the PMLB benchmark suite [
9
]. The results, which are
available at cavalab.org/srbench (accessed on 21 November 2024), also include the trained
models for each algorithm and data set. We reuse these published results and analyze the
bias and variance of the already published models.
1.2. Related Work
Although most algorithmic advancements in SR algorithms affect their bias and
variance, actual measurements and analyses of these properties are sparse. One analyses
of bias and variance of SR algorithms was done by Keijzer and Babovic [
10
]. They calcu-
lated bias and variance measurements but only for few data sets and only for ensemble
bagging of standard GP-based SR. Kammerer et al. [
11
] analyzed the variance for two
different GP-based SR variants and compared them with Random Forest regression [
12
]
and linear regression on few data sets. Highly related to bias and variance is the work
by de Franca et al. [
13
], who provided a successor of SRBench. In this benchmark, they
tested specific properties of algorithms for a few specific data sets, instead of only the
accuracy and model size for a wide range of different data sets. Those tests included,
e.g., whether algorithms identified the ground truth on a symbolic level instead of just
approximating it with an expression of any structure. In our work, we use the first
version of SRBench as we focus on the semantic of models on a broad range of data
sets. More research about bias and variance was performed on other algorithms, most
recently for neural networks by Neal et al. [
4
] and Yang et al. [
14
]. They evaluated the
relation between bias/variance to model structure and number of model parameters.
Belkin et al. [
15
] questioned the classical understanding of the relationship between bias
and variance and their dependence on the dimensionality of the problem for very large
black-box models.
Appl. Sci. 2024,14, 11061 3 of 16
2. Bias/Variance Decomposition
In supervised machine learning tasks, we want to identify a function
ˆ
f(x) = y
,
which approximates the unknown ground truth
f(x)
for any
x
drawn from a problem-
specific distribution
P
.
x
is a vector of so-called features and
y
the scalar target. The target
values contain randomly distributed noise
ϵ
. In this work, we define
ϵ∼
N
(
0,
σ)
with
σ
being a problem-specific standard deviation, as it is done in SRBench [
8
]. Therefore,
our data-generating function is
y=f(x) + ϵ
. To find an approximation
ˆ
f(x)
, we use a
training set
D={(x1
,
y1)
,
. . .
,
(xn
,
yn)}
with
xi∼P
and their corresponding
y
values.
The goal of machine learning is to learn a prediction model
ˆ
y=ˆ
f(x
,
D)
which minimizes
a predefined loss function
L(y
,
ˆ
y)
on a training set
D
. This means that the output of a
machine learning model depends on the used algorithm, the features
x
and the set of
training samples D.
Bias and variance of an algorithm occur due to changes in the training data
D
and
have a direct effect on the error and loss of models. We use the mean squared error as
the loss function. For this loss function, Hastie et al. [
1
] describe that a model’s expected
error on previously unseen data consists of bias, variance and irreducible noise
ϵ
. However,
Domingos [
16
] describes how the decomposition of the error into bias and variance is also
possible for other loss functions.
The distribution of training sets
D
that contain values drawn from
P
also leads to a
distribution of model outputs
ˆ
y0=ˆ
f(x0
,
D)
for a single fixed feature vector
x0
. The bias is
the difference between the ground truth value
y0=f(x0)
and the expected value over
D
over the
ˆ
f
of model outputs
EDhˆ
f(x0,D)i
, as depicted in Equation (1). The bias describes,
how far our estimation is “off” on average from the truth.
BiasD(ˆ
f,x) = EDhˆ
f(x,D)i−f(x)(1)
VarD(ˆ
f,x) = EDEDhˆ
f(x,D)i−ˆ
f(x,D)2(2)
Variance defines, how far the outputs of estimators spread on a specific point
x0
when
they were trained on different data sets. It is independent of the ground truth. It is defined
in Equation (2) as the expected squared difference between the output of models and the
average output of those model on a specific point
x0
. Both properties are not mutually
exclusive. However, algorithms with high variance tend to have lower bias and vice versa,
so practitioners need to find an algorithm setting that minimizes both properties. This is
called the bias-variance trade-off [1].
An example inspired by Geman et al. [
17
] for high bias, high variance and an optimal
trade-off between both is given in Figures 1and 2, where we want to approximate an
oscillating function with polynomial regression. We use polynomial regression as the results
nicely translates to SR as its search space is a subset of SR methods with an arithmetic
function set. We add normal distributed noise to each target value
y
, so
y=f(x) + ϵ
with
ϵ∼
N
(
0,
σ)
and
σ=
0.2. Each set of samples
D
consists of ten randomly sampled
observations
D={x1
,
. . .
,
x10}
with
xi∼
N
(π
, 2
)
. Figure 1shows the ground truth and
one training set of samples. We get a different model with every different set of samples
D
.
Depending on the algorithm settings, those models behave more or less similar, resulting in
different bias and variance measures. We expect the error of a perfect model to be equally
distributed as the irreducible noise ϵ.
Appl. Sci. 2024,14, 11061 4 of 16
012 3 45 6
x
−0.5
0.0
0.5
1.0
f(x),yD=f(x) + ϵ
f(x)
D
Figure 1. A simple example of an oscillating ground truth
f(x) =
0.1
x+sin(x)
with randomly
sampled points
yD∈D
. Depending on the used machine learning algorithm, different training
samples of
D
will result in different models and therefore cause, to a certain extent, bias and variance.
246
x
−1
0
1
2
y,ˆ
y
ˆ
f(x,D)
f(x)
(a) Fitted Polynomials ˆ
f(x,Di)of degree 2.
−2−1012
d=f(x=4)−ˆ
f(x=4, D)
num. of D
N(0, σ)
N(d, sd(d))
(b) High bias for degree 2.
246
x
−1
0
1
2
y,ˆ
y
ˆ
f(x,D)
f(x)
(c) Fitted Polynomials ˆ
f(x,Di)of degree 9.
−2−1012
d=f(x=4)−ˆ
f(x=4, D)
num. of D
N(0, σ)
N(d, sd(d))
(d) High variance for degree 9.
Figure 2. Cont.
Appl. Sci. 2024,14, 11061 5 of 16
246
x
−1
0
1
2
y,ˆ
y
ˆ
f(x,D)
f(x)
(e) Fitted Polynomials ˆ
f(x,Di)of degree 4.
−2−1012
d=f(x=4)−ˆ
f(x=4, D)
num. of D
N(0, σ)
N(d, sd(d))
(f) Suitable fit for degree 4.
Figure 2. Polynomial regression performed on the data from Figure 1with different polynomial
degrees as hyperparameter. The left column shows in each plot 20 different polynomials that were
trained on different training sets. The right column shows the distribution and histogram of the
error of 1000 different polynomials at
x=
4 and reveal bias and variance of the algorithm and its
hyper-parameters.
In Figure 2we draw 1000 sets of samples
D
and learn one polynomial for each training
set for three algorithm settings. 20 exemplary polynomials are shown for each setting in
Figure 2a,c,e. The distribution of the difference
d
between the ground truth and the model
output at
x=
4 is shown as histogram in Figure 2b,d,f. These plots also compare the
probability density function of
ϵ
and a normal distribution with the mean and standard
deviation (
sd
) of
d
as parameters. Figure 2a,b show polynomials of degree 2, which cannot
capture oscillations in the data. The variance is high, and the average error
d
clearly differs
from the ground truth
f(x=
4
)
. Figure 2c,d show polynomials of degree 9 with the highest
variance and the smallest bias in this example. The error of single predictions at that point
is often very high, which makes a single model unusable. However, their average fits the
ground truth well, as there is nearly no bias. Figure 2e,f show a polynomial of degree
4, which appears to be an appropriate setting with low bias and variance close to the
error variance.
As described before, the bias and variance of an algorithm is directly linked to
its model generalization capabilities, as the prediction error on previously unseen data
can be decomposed into the square of its bias and its variance [
16
]. Therefore, we ex-
pect algorithms with models that generalize well to be both low in bias and in vari-
ance. Of course, not every algorithm is equally capable to produce high quality re-
sults, as shown in SRBench. While the error of models is well measured, it is often
unclear whether it is primarily caused by variance or bias. As described in Section 1,
GP-based algorithms exhibit variance even when there is no change in the training data.
On the other hand, deterministic algorithms like FFX and AIFeynman produce equal re-
sults on equal inputs. Therefore, we expect more similar models and less variance for
those algorithms.
3. Experiments
Our analysis builds upon the data sets and models of the SRBench benchmark [
8
].
SRBench provides an in-depth analysis of performance and model size of contemporary
SR methods on many data sets. The methodology and the results of this benchmark
fit perfectly for our purpose, and we can rely on a reviewed setting of algorithms and
benchmark data.
SRBench uses the data sets of the PMLB benchmark suite [
9
,
18
], the Feynman Symbolic
Regression Database [
19
] and the ODE-Strogatz repository [
20
]. In total, it contains both
real world data and generated data, where the ground truth and noise distribution is
known. In this work, we will use SRBench’s results on the 116 datasets from the Feynman
Appl. Sci. 2024,14, 11061 6 of 16
Symbolic Regression Database because the ground truth and data distribution is known, so
we can arbitrarily generate new data. We refer to those problems as Feynman problems in
the following. The Feynman problems are mostly low-dimensional, nonlinear expression
that were taken from physics textbooks [21].
In SRBench, four different noise levels
σ
are used for Feynman problems. Given a
problem-specific ground truth
f(x)
, training data
y
for one problem are generated with
y=f(x) + ϵ
,
ϵ∼
N
(
0,
σ)
and
σ∈ {
0, 0.001, 0.01, 0.1
}
, with N
(
0, 0
)
meaning no noise. This
results in 464 combinations of data sets and noise levels. For each of those 464 problems,
ten models were trained in the SRBench benchmark per algorithm. A different training
set is sampled for each of the ten models [
8
]. All ten models were trained with the same
hyperparameters, which is suitable for our study, because bias and variance are only caused
by the search procedure and not by differences in hyperparameters.
We compare both the accuracy, the model size, bias, and variance. We use the root
mean squared error (RMSE) to measure the accuracy, since it is on the same scale as the
applied noise and therefore the bias and variance. The model size is the number of symbols
in the model as defined in SRBench [
13
]. The model size describes the inverse of parsimony
and is used as a notion of simplicity of a model.
3.1. Bias/Variance Calculation
To compare the bias and variance between problems and algorithms, we use the
expected value of the bias and variance over
x
. Given are a feature value distribution
P
, a ground truth
f(x)
, a noise level
σ
with a data generating function
y=f(x) + ϵ
,
ϵ∼N(0, σ)
and a distribution of training sets with
D={(x1
,
y1)
,
. . .
,
(xm
,
xm)}
,
xi∼P
,
m=
10, 000.
P
,
f(x)
,
σ∈ {
0, 0.001, 0.01, 0.1
}
are problem-specific and defined
in SRBench.
We reuse the ten models that were trained in SRBench for each problem. Every
model was trained on different training set drawn from
D
. We generate a new set of data
X={x1
,
. . .
,
xn}
,
xi∼P
,
n=
10, 000 to estimate the expected value for bias and variance
for those ten models. We define the estimators as the average of bias and variance over
the ten models for all values in
X
, as defined in Equation (3) for bias and Equation (4) for
variance. To prevent sample-specific outliers, we use the one set of samples
X
per problem
for all algorithms.
Bias(ˆ
f) = s1
n
n
∑
i=1
BiasDˆ
f,xi2,xi∈X(3)
The bias for one value
xi
defined in Equation (1) can become both negative and positive.
However, we are not interested in the sign of the bias but only in its absolute value. To be
consistent with the bias/variance decomposition by Hastie et al. [
1
], which decomposes
the error into the sum of variance, the square of bias and an irreducible error, we also use
the square of the bias
BiasD(ˆ
f
,
x=x0)
from Equation (1) for the calculation of bias of the
overall problem and algorithm in
BiasD,x(ˆ
f)
. We take the square root of the average of the
squared bias to be on the same scale as the error of a model.
Var(ˆ
f) = 1
n
n
∑
i=1
VarDˆ
f,xi,xi∈X(4)
3.2. SRBench Algorithms and Models
The SR algorithms tested in SRBench provide different approaches to limit the size
and/or structure of their produced models to counteract over-fitting and therefore reduce
the algorithm’s variance. Therefore, we expect clear differences. E.g., GP implementations
such as GP-GOMEA [
22
], Operon [
23
], AFP [
24
] and AFP-FE [
25
] adapt their search proce-
dure. GP-GOMEA identifies building blocks that cover essential dependencies. Operon,
AFP and AFP-FE use a multi-objective search procedure to incorporate both accuracy and
parsimony in their objective function [
23
–
25
]. EPLEX, AFP and AFP-FE use the
ϵ
-lexicase
Appl. Sci. 2024,14, 11061 7 of 16
selection in their GP procedure [
26
], in which instead of one aggregated error value multi-
ple tests are performed over different regions of the training data. ITEA [
27
] deliberately
restricts the structure of their models. gplearn (gplearn.readthedocs.io, accessed on 21
November 2024) provides a GP implementation that is close to the very first ideas about
GP-based SR by Koza [
5
]. DSR [
28
] is a non-GP-based algorithm that considers SR as a
reinforcement learning problem and uses a neural network, another high-variance method,
to produce a distribution of small symbolic models. FFX [
29
] and AIFeynman 2.0 [
19
] do
also not build upon GP but run deterministic search strategies in restricted search spaces.
We expect that both determinism and restrictions in the search space result in higher bias
and therefore lower variance [1].
We reuse the string representation of all models from the published SRBench GitHub-
repository. The strings are parsed in the same way as in SRBench [
8
] with the SymPy [
30
]
Python framework. However, certain algorithms, such as Operon, lack in precision in
their string representation, which printed only three decimal places for real-valued param-
eters in the model. We re-tuned those parameters for models that didn’t achieve the in
SRBench reported accuracy when just evaluating the model using the reported parameter
values. We re-tune the parameters with the L-BFGS-B algorithm [
31
,
32
] as implemented
in the SciPy library [
33
]. We constrain the optimization so that we only optimize decimal
places of the reported parameters, e.g., the optimization of the reported value 5.5 is con-
strained to the interval
[
5, 6
]
. The goal is to change the original model as little as possible
and prevent further distortion of the overall results, while still be able to reuse results
from SRBench.
We skip algorithms, where we could not reproduce their reported accuracy with
the corresponding reported string representation of their models. We excluded Bayesian
Symbolic Regression (BSR) [
34
], Feature Engineering Automation Tool (FEAT) [
35
], Multiple
Regression GP (MRGP) [36] and Semantic Backpropagation GP (SBGP) [37].
4. Results
As the bias and variance is directly linked to the error of models, we first analyze
the generalization error of all models. In SRBench, the models were trained on data with
different noise, however, the test data is noise free to analyze how close the models are to
the ground truth [
8
]. Given that the error can be decomposed into bias, variance and an
irreducible error, the absence of an irreducible error allows us to deduce whether the error
of algorithms on a dataset is caused by bias or variance. The root mean squared error of all
models as well as their size are taken from SRBench and shown in Figure 3. The algorithms
in all plots are sorted by their median test error. While SRBench only analyzed which
functions provided a close enough approximation to the ground truth using a specific
threshold, we show the error values.
4.1. Error and Model Size
SRBench [
8
] shows that GP-based methods provide most accurate results on the test
partition for the given Feynman datasets. To demonstrate, which algorithms perform better
than the others, we rank the algorithms by their average accuracy on each problem. We
assign rank one to the algorithm with the lowest error for a specific problem and noise
level combination, rank two to the second best, and so on. The same procedure is done for
model size. The distributions for accuracy and model size over all noise levels are shown in
Figure 3. The median and the interquartile range of all error values and their corresponding
ranks, broken down by noise level, are shown in Table 1. Table 2shows the same statistics
for the size of all models. The box of FFX in Figure 3b is cut off to prevent distortion of the
axis, but the distribution is described in detail in Table 2.
Appl. Sci. 2024,14, 11061 8 of 16
Table 1. Distribution of the average RMSE values and corresponding ranks per problem, broken
down by noise level
σ
. The left value in each cell is the median, the right value the interquartile range.
The right column are the values outlined in Figures 3and 4.
σ=0.000 σ=0.001 σ=0.010 σ=0.100 All Noise Levels
AIFeynman RMSE 9.7 ×10−15/1.1 ×10−24.3 ×10−4/1.6 ×10−26.0 ×10−3/3.0 ×10−24.1 ×100/1.5 ×1013.2 ×10−3/6.6 ×10−2
Rank 2/3 3/4 4/4 10/4 4/7
Operon RMSE 4.3 ×10−4/8.5 ×10−37.4 ×10−4/1.0 ×10−26.9 ×10−3/2.9 ×10−24.3 ×10−2/1.5 ×10−14.6 ×10−3/4.1 ×10−2
Rank 2/3 2/3 2/3 3/5 2/3
GP-GOMEA RMSE 3.5 ×10−3/8.2 ×10−23.0 ×10−3/5.5 ×10−29.1 ×10−3/6.0 ×10−24.0 ×10−2/1.3 ×10−11.2 ×10−2/9.5 ×10−2
Rank 3/2 3/2 3/2 2/3 3/2
AFP-FE RMSE 3.4 ×10−3/1.6 ×10−17.9 ×10−3/1.7 ×10−19.3 ×10−3/1.8 ×10−13.8 ×10−2/2.0 ×10−11.3 ×10−2/1.8 ×10−1
Rank 5/1 5/1 4/2 4/3 4/1
EPLEX RMSE 3.4 ×10−2/2.6 ×10−13.6 ×10−2/2.1 ×10−12.1 ×10−2/1.7 ×10−14.4 ×10−2/1.9 ×10−13.4 ×10−2/2.0 ×10−1
Rank 5/3 5/2 5/3 4/2 5/3
AFP RMSE 2.6 ×10−2/3.8 ×10−13.0 ×10−2/3.4 ×10−13.8 ×10−2/3.5 ×10−16.5 ×10−2/4.1 ×10−14.0 ×10−2/3.7 ×10−1
Rank 6/2 7/2 6/2 6/2 6/2
gplearn RMSE 8.3 ×10−2/4.0 ×10−18.1 ×10−2/3.9 ×10−18.8 ×10−2/4.0 ×10−11.0 ×10−1/4.0 ×10−18.9 ×10−2/4.0 ×10−1
Rank 7/3 7/3 8/3 7/3 7/3
ITEA RMSE 1.0 ×10−1/1.6 ×1001.0 ×10−1/1.6 ×1001.0 ×10−1/1.6 ×1001.3 ×10−1/1.6 ×1001.1 ×10−1/1.6 ×100
Rank 8/4 8/4 8/4 7/3 8/4
DSR RMSE 1.8 ×10−1/1.2 ×1001.7 ×10−1/1.1 ×1001.7 ×10−1/1.1 ×1001.7 ×10−1/1.2 ×1001.7 ×10−1/1.1 ×100
Rank 9/3 9/4 9/4 8/2 9/3
FFX RMSE 2.1 ×10−1/1.5 ×1002.1 ×10−1/1.5 ×1002.0 ×10−1/1.5 ×1002.6 ×10−1/1.5 ×1002.2 ×10−1/1.5 ×100
Rank 8/3 8/3 8/3 7/4 8/3
Table 2. Distribution of the average model size and corresponding ranks per problem, broken down
by noise level
σ
. The left value in each cell is the median, the right value the interquartile range.
The right column are the values outlined in Figures 3and 4.
σ=0.000 σ=0.001 σ=0.010 σ=0.100 All Noise Levels
AIFeynman Size 12/9 17/8 18/9 14/8 16/10
Rank 2/3 3/2 3/4 3/4 3/4
Operon Size 80/39 80/39 88/8 90/6 86/15
Rank 9/1 8/1 9/1 9/1 9/1
GP-GOMEA Size 34/23 42/18 43/17 46/17 42/19
Rank 5/2 5/2 5/3 6/2 5/3
AFP-FE Size 48/48 58/38 58/35 58/26 57/40
Rank 6/3 6/2 6/2 7/1 6/2
EPLEX Size 62/8 62/7 61/11 53/28 60/14
Rank 8/1 8/1 7/1 6/2 7/2
AFP Size 35/38 41/35 44/34 44/32 42/35
Rank 5/2 5/2 5/2 5/2 5/2
gplearn Size 18/53 18/50 18/47 15/35 17/46
Rank 4/6 4/6 3/5 3/5 3/5
ITEA Size 20/10 20/10 20/10 20/9 20/10
Rank 3/2 3/1 2/1 3/1 3/1
DSR Size 12/12 13/15 13/16 13/13 13/13
Rank 2/2 1/2 1/2 2/2 2/2
FFX Size 236/301 243/306 250/307 295/270 259/307
Rank 10/1 10/1 10/1 10/1 10/1
Table 1shows that most algorithms perform similar across all noise levels. One
exception is AIFeynman [
19
]. As also described in the original paper by Udrescu and
Tegmark [
19
], AIFeynman outperforms other algorithms on problems without noise but
struggles with low noise levels of
σ∈[
0.001, 0.01
]
and performs worst at
σ=
0.1. Operon
Appl. Sci. 2024,14, 11061 9 of 16
and GP-GOMEA provided the most accurate models across all noise levels. The difference
between those two algorithms is not significant, as the pairwise p-values of the error rank
distributions in Table 3show. Algorithms with restricted search spaces such as AIFeynman,
ITEA, DSR and FFX performed worst and (mostly) without significant differences. An out-
lier in algorithms with restricted search space is AIFeynman for problems without noise.
We are also interested in the distribution of model sizes in Figures 3a and 4a. Larger
and more complex solutions are commonly linked to higher variance [
17
]. However, recent
work by Neal et al. [
4
] describes that both bias and variance in neural networks can decrease
even when the number of parameters in a model shrinks by adapting the structure of the
neural network. Table 2shows that the noise level affects the size of models very little,
with more noise leading to slightly larger models.
The claimed large size of models for Operon is evident in Figures 3a and 4a. Ex-
cept from FFX, Operon tends to produce the largest models of all algorithms. On the other
hand, GP-GOMEA provides similar accuracy but produces smaller models that rank just in
the middle compared to other algorithms. GP-based methods ITEA and DSR provided the
most inaccurate results among all GP-methods but achieved their self-declared algorithmic
goal of finding concise models [27,28].
Table 3. p-Values of pairwise Wilcoxon signed rank tests on the error ranks in Figure 4a with
Bonferonni correction of significance level
α
as proposed by Demšar [
38
]. Significance level is
α=
1.1
×
10
−3
, significant values smaller than
α
are highlighted in bold. Values below 1
×
10
−10
are
rounded to zero.
AIFeynman
Operon GP-
GOMEA AFP-FE EPLEX AFP gplearn ITEA DSR FFX
AIFeynman 0 0 2.4 ×10−35.0 ×10−11.1×10−60 0 0 0
Operon 03.2 ×10−10000000
GP-GOMEA 03.2 ×10−10000000
AFP-FE 2.4 ×10−30 0 7.1×10−40 0 0 0 0
EPLEX 5.0 ×10−10 0 7.1×10−40 0 0 0 0
AFP 1.1×10−60 0 0 0 0 2.6×10−10 0 0
gplearn 0 0 0 0 0 0 7.1 ×10−11.3×10−53.6 ×10−1
ITEA 0 0 0 0 0 2.6×10−10 7.1 ×10−11.6 ×10−33.0 ×10−1
DSR 0 0 0 0 0 0 1.3×10−51.6 ×10−37.9 ×10−2
FFX 0 0 0 0 0 0 3.6 ×10−13.0 ×10−17.9 ×10−2
AIFeynman
Operon
GP-GOMEA
AFP-FE
EPLEX
AFP
gplearn
ITEA
DSR
FFX
0.00
0.25
0.50
0.75
1.00
1.25
1.50
RMSE Test
(a) Root mean squared error on test partition.
AIFeynman
Operon
GP-GOMEA
AFP-FE
EPLEX
AFP
gplearn
ITEA
DSR
FFX
0
20
40
60
80
100
Model Size
(b) Model size.
Figure 3. For each problem and algorithm, ten models were trained in SRBench [
8
]. The box plots
show the distribution of the median as blue lines and the interquartile range as boxes of the mean of
the ten models for each problem and algorithm. The values are also listed in Tables 1and 4.
Appl. Sci. 2024,14, 11061 10 of 16
AIFeynman
Operon
GP-GOMEA
AFP-FE
EPLEX
AFP
gplearn
ITEA
DSR
FFX
2
4
6
8
10
Rank RMSE Test
(a)
AIFeynman
Operon
GP-GOMEA
AFP-FE
EPLEX
AFP
gplearn
ITEA
DSR
FFX
2
4
6
8
10
Rank Model Size
(b)
Figure 4. Distribution of problem-wise ranks for test error and model size. We compare the average
test error and average model size per problem and algorithm from Figure 3and assign ranks from
one to ten for each problem. The box plots show the median as blue lines and the interquartile range
as boxes of the algorithm’s ranks over all problems. The values are also listed in Tables 1and 4.
(a) Rank
distribution of test error. Rank one is assigned to the model with the smallest error. (b) Rank
distribution of model size. Rank one is assigned to the smallest model.
Table 4. Distribution of the average bias and corresponding ranks per problem, broken down by noise
level
σ
. The left value in each cell is the median, the right value the interquartile range. The right
column are the values outlined in Figures 5and 6.
σ=0.000 σ=0.001 σ=0.010 σ=0.100 All Noise Levels
AIFeynman Bias 3.0 ×10−13/4.1 ×10−25.1 ×10−4/3.8 ×10−27.0 ×10−3/6.1 ×10−25.3 ×100/1.5 ×1018.9 ×10−3/7.6 ×10−1
Rank 2/3 3/4 4/4 10/4 4/6
Operon Bias 9.4 ×10−4/1.6 ×10−21.2 ×10−3/1.9 ×10−25.8 ×10−3/3.8 ×10−22.7 ×10−2/1.3 ×10−15.9 ×10−3/4.9 ×10−2
Rank 3/4 2/3 3/4 3/5 3/4
GP-GOMEA Bias 2.8 ×10−3/5.6 ×10−22.3 ×10−3/4.2 ×10−24.3 ×10−3/4.4 ×10−22.0 ×10−2/8.0 ×10−26.0 ×10−3/5.8 ×10−2
Rank 3/2 3/3 3/2 2/2 3/2
AFP-FE Bias 1.4 ×10−2/1.8 ×10−12.3 ×10−2/2.3 ×10−12.1 ×10−2/1.9 ×10−13.7 ×10−2/2.3 ×10−12.9 ×10−2/2.1 ×10−1
Rank 4/1 5/2 5/1 4/2 4/1
EPLEX Bias 6.2 ×10−2/2.1 ×10−14.5 ×10−2/2.0 ×10−13.7 ×10−2/1.5 ×10−14.6 ×10−2/1.6 ×10−14.5 ×10−2/1.8 ×10−1
Rank 5/2 5/2 4/3 4/2 4/3
AFP Bias 6.3 ×10−2/4.2 ×10−15.4 ×10−2/3.8 ×10−15.5 ×10−2/3.8 ×10−17.2 ×10−2/3.9 ×10−16.1 ×10−2/4.0 ×10−1
Rank 6/2 6/2 6/2 5/1 6/2
gplearn Bias 1.4 ×10−1/4.3 ×10−11.2 ×10−1/3.6 ×10−11.1 ×10−1/3.4 ×10−11.2 ×10−1/3.6 ×10−11.2 ×10−1/3.7 ×10−1
Rank 7/3 7/3 7/3 7/3 7/4
ITEA Bias 1.1 ×10−1/1.5 ×1001.1 ×10−1/1.5 ×1001.1 ×10−1/1.5 ×1001.3 ×10−1/1.5 ×1001.1 ×10−1/1.6 ×100
Rank 8/4 8/4 8/5 7/4 8/4
DSR Bias 1.5 ×10−1/8.9 ×10−11.6 ×10−1/9.6 ×10−11.6 ×10−1/8.5 ×10−11.6 ×10−1/9.5 ×10−11.6 ×10−1/9.3 ×10−1
Rank 8/4 8/4 8/4 8/3 8/3
FFX Bias 2.2 ×10−1/1.5 ×1002.4 ×10−1/1.5 ×1002.2 ×10−1/1.5 ×1002.4 ×10−1/1.5 ×1002.4 ×10−1/1.5 ×100
Rank 8/2 8/4 8/3 7/3 8/3
Appl. Sci. 2024,14, 11061 11 of 16
AIFeynman
Operon
GP-GOMEA
AFP-FE
EPLEX
AFP
gplearn
ITEA
DSR
FFX
0.00
0.25
0.50
0.75
1.00
1.25
1.50
Bias
(a) Bias values according to Equation (3).
AIFeynman
Operon
GP-GOMEA
AFP-FE
EPLEX
AFP
gplearn
ITEA
DSR
FFX
0.0
0.2
0.4
0.6
0.8
Variance
(b) Variance values according to Equation (4).
Figure 5. Distribution of bias and variance values as defined in Equations (3) and (4) over all problems.
The blue line shows the median, the boxes show the interquartile range.
AIFeynman
Operon
GP-GOMEA
AFP-FE
EPLEX
AFP
gplearn
ITEA
DSR
FFX
2
4
6
8
10
Rank Bias
(a)
AIFeynman
Operon
GP-GOMEA
AFP-FE
EPLEX
AFP
gplearn
ITEA
DSR
FFX
2
4
6
8
10
Rank Variance
(b)
Figure 6. Distribution of problem-wise ranks for bias and variance. We compare the bias and variance
per problem and algorithm from Figure 5and assign ranks from one to ten for each problem. The box
plots show the median as blue lines and the interquartile range as boxes over all rank values. (a) Rank
distribution of bias. Rank one is assigned to the model with the smallest bias. (b) Rank distribution
of variance. Rank one is assigned to the model with the smallest variance.
4.2. Bias and Variance
Given that there is no irreducible error on the target variable in SRBench’s test parti-
tions [
8
], the error can be decomposed into bias and variance. The distributions of bias and
variance values as defined in Equations (3) and (4) for each data set over all noise levels
are shown in Figure 5. As expected, algorithms with more accurate models tend to be low
both in bias and variance. However, both properties do not behave the same across all
algorithms. The bias of AIFeynman increases from no bias at all to the highest bias of all
algorithms with increasing noise level, as Table 4shows. For other algorithms, the median
bias as well as the median variance increase across all algorithms with the median error.
Exceptions regarding variance are ITEA and to some extent FFX, which suffer primarily
from high bias and not from high variance. Considering the restricted search space of both
algorithms, this appears plausible.
Appl. Sci. 2024,14, 11061 12 of 16
Figure 6and Tables 4and 5show a rank-wise comparison of bias and variance. Operon
and GP-GOMEA provide the smallest bias without significant differences according to the
pairwise p-values and significance values in Table 6. Table 4shows that the distribution of
bias values for AIFeynman is distored by its very high bias at high noise levels. Multiple
algorithms are on the same level regarding variance. Tables 6and 7show that not only Operon
and GP-GOMEA but also the rank-wise less accurate AIFeynman and ITEA have similar
variance rank distributions without statistically significant differences. This shows that ITEA,
FFX and AIFeynman suffer primarily from high bias and comparably low variance.
Table 5. Distribution of the average variance and corresponding ranks per problem, broken down
by noise level
σ
. The left value in each cell is the median, the right value the interquartile range.
The right column are the values outlined in Figures 5and 6.
σ=0.000 σ=0.001 σ=0.010 σ=0.100 All Noise Levels
AIFeynman Variance 3.6 ×10−15/2.4 ×10−27.5 ×10−4/2.1 ×10−28.3 ×10−3/2.2 ×10−25.8 ×10−2/3.4 ×10−13.6 ×10−3/3.8 ×10−2
Rank 2/3 3/4 3/4 6/5 3/5
Operon Variance 1.5 ×10−3/2.4 ×10−22.5 ×10−3/2.3 ×10−21.3 ×10−2/5.1 ×10−26.4 ×10−2/2.0 ×10−11.2 ×10−2/7.7 ×10−2
Rank 3/3 3/2 3/3 4/4 3/3
GP-GOMEA Variance 2.9 ×10−3/6.1 ×10−24.8 ×10−3/4.3 ×10−21.1 ×10−2/5.6 ×10−24.0 ×10−2/1.4 ×10−11.1 ×10−2/8.4 ×10−2
Rank 3/2 3/3 3/3 3/2 3/3
AFP-FE Variance 3.2 ×10−2/1.9 ×10−14.5 ×10−2/1.8 ×10−13.6 ×10−2/2.4 ×10−16.2 ×10−2/2.7 ×10−14.5 ×10−2/2.2 ×10−1
Rank 6/2 6/2 6/2 5/2 6/3
EPLEX Variance 1.1 ×10−1/3.6 ×10−18.5 ×10−2/3.4 ×10−18.1 ×10−2/2.4 ×10−18.1 ×10−2/3.1 ×10−19.0 ×10−2/3.4 ×10−1
Rank 6/3 6/3 6/2 6/3 6/2
AFP Variance 8.5 ×10−2/4.5 ×10−17.4 ×10−2/5.3 ×10−19.2 ×10−2/5.2 ×10−11.0 ×10−1/6.3 ×10−19.4 ×10−2/5.4 ×10−1
Rank 8/3 8/2 8/3 8/3 8/3
gplearn Variance 1.2 ×10−1/5.3 ×10−11.5 ×10−1/5.4 ×10−11.7 ×10−1/5.1 ×10−11.6 ×10−1/5.0 ×10−11.5 ×10−1/5.2 ×10−1
Rank 9/2 8/2 8/3 8/3 8/3
ITEA Variance 6.8 ×10−3/8.6 ×10−28.2 ×10−3/9.3 ×10−28.7 ×10−3/8.1 ×10−23.5 ×10−2/1.2 ×10−11.2 ×10−2/1.0 ×10−1
Rank 4/3 4/4 3/4 3/4 3/4
DSR Variance 9.7 ×10−2/6.9 ×10−11.1 ×10−1/5.8 ×10−11.1 ×10−1/7.0 ×10−19.4 ×10−2/7.5 ×10−11.0 ×10−1/6.9 ×10−1
Rank 9/7 9/5 9/5 9/5 9/6
FFX Variance 6.2 ×10−2/2.6 ×10−16.3 ×10−2/2.8 ×10−15.9 ×10−2/2.6 ×10−17.2 ×10−2/3.0 ×10−16.2 ×10−2/2.9 ×10−1
Rank 6/2 6/3 6/3 5/3 6/3
Table 6. p-Values of pairwise Wilcoxon signed rank tests on the bias ranks in Figure 6a with Bonferonni
correction of significance level
α
as proposed by Demšar [
38
]. Significance level is
α=
1.1
×
10
−3
,
significant values smaller than
α
are highlighted in bold. Values below 1
×
10
−10
are rounded to zero.
AIFeynman
Operon GP-
GOMEA AFP-FE EPLEX AFP gplearn ITEA DSR FFX
AIFeynman 9.8×10−10 07.7 ×10−24.2 ×10−16.3×10 −60 0 0 0
Operon 9.8×10−10 1.3 ×10−10000000
GP-GOMEA 01.3 ×10−10000000
AFP-FE 7.7 ×10−20 0 2.2 ×10−10 0 0 0 0
EPLEX 4.2 ×10−10 0 2.2 ×10−10 0 0 0 0
AFP 6.3×10−60 0 0 0 0 0 0 0
gplearn 0 0 0 0 0 0 3.4 ×10−16.5×10−62.3×10−4
ITEA 0 0 0 0 0 0 3.4 ×10−11.7 ×10−22.3 ×10−2
DSR 0 0 0 0 0 0 6.5×10−61.7 ×10−29.0 ×10−1
FFX 0 0 0 0 0 0 2.3×10−42.3 ×10−29.0 ×10−1
Appl. Sci. 2024,14, 11061 13 of 16
Table 7. p-Values of pairwise Wilcoxon signed rank tests on the variance ranks in Figure 6b with
Bonferonni correction of significance level
α
as proposed by Demšar [
38
]. Significance level is
α=
1.1
×
10
−3
, significant values smaller than
α
are highlighted in bold. Values below 1
×
10
−10
are
rounded to zero.
AIFeynman
Operon GP-
GOMEA AFP-FE EPLEX AFP gplearn ITEA DSR FFX
AIFeynman 1.8 ×10−21.9 ×10−10 0 0 0 9.8 ×10−20 0
Operon 1.8 ×10−22.3 ×10−20 0 0 0 9.1 ×10−10 0
GP-GOMEA 1.9 ×10−12.3 ×10−20 0 0 0 2.8 ×10−20 0
AFP-FE 0 0 0 1.8×10−40 0 0 0 6.8 ×10−1
EPLEX 0 0 0 1.8×10−40 0 0 7.4×10−67.8×10−6
AFP 00000 1.7 ×10−204.1 ×10−20
gplearn 000001.7 ×10−202.8 ×10−10
ITEA 9.8 ×10−29.1 ×10−12.8 ×10−20 0 0 0 0 0
DSR 0 0 0 0 7.4×10−64.1 ×10−22.8 ×10−101.5×10−8
FFX 0 0 0 6.8 ×10−17.8×10−60 0 0 1.5×10−8
4.3. Relation Between Parsimony, Bias, and Variance
While there is a clear relationship between bias, variance and test error of algorithms,
the relation between model size as a notion of the inverse of parsimony, bias, and variance
is not clear. Effects like bloat, over-parameterization [
7
] as well as different sets of used
mathematical functions distort clear connections between those properties. The relation
between the median values of model size and bias/variance over all problems is outlined
in Figures 7and 8. FFX is excluded from both figures as its huge model size values would
distort the axis scale.
Figure 7shows that algorithms with larger models tend to lower bias, which is ex-
pected. An outlier is again AIFeynman with low bias only at low noise levels, which
results in an overall low median bias as shown in Table 4. In contrary to the clear connec-
tions in bias, Figure 8does not show a clear connection between model size and variance
over all algorithms. The relation between variance and size is distorted by the different
achieved accuracy levels of each algorithm. Operon provides both the largest models and
the smallest variance. Given that Operon is one of the most accurate algorithms, its small
variance is expected, as the error can be decomposed to bias and variance. GP-GOMEA
on the other hand has similar accuracy and variance, but provides much smaller models.
Other algorithms like gplearn or EPLEX have a higher error and therefore higher variance,
but create models of very different sizes.
0 20 40 60 80 100
Median Model Size
0.000
0.025
0.050
0.075
0.100
0.125
0.150
Median Bias
AIFeynm. Operon
GP-GOMEA
AFP-FE
EPLEX
AFP
gplearn
ITEA
DSR
Figure 7. Relation between median bias and median model size over all problems.
Appl. Sci. 2024,14, 11061 14 of 16
0 20 40 60 80 100
Median Model Size
0.000
0.025
0.050
0.075
0.100
0.125
0.150
Median Variance
AIFeynm. Operon
GP-GOMEA
AFP-FE
EPLEX
AFP
gplearn
ITEA
DSR
Figure 8. Relation between median variance and median model size per algorithm over all problems.
5. Conclusions and Outlook
In this study, we analyze the bias and variance of ten contemporary SR methods. We
show, how small differences in training data affect the behavior of the models besides differ-
ences in error metrics. We use the models that were trained in the SRBench benchmark [
8
],
as they provide a well-established setting for a fair algorithm comparison.
We show that both bias and variance increase with the test error of models over most
algorithms. Exceptions are the algorithms AIFeynman, ITEA and FFX, whose error is
primarily caused by bias. This is expected as stronger restrictions of the search space
should lead to more similar outputs despite changes in training data but also to a consistent,
systematic error in all models. Another assumption for the high bias for AIFeynman and
FFX are their non-evolutionary heuristic search, which induces bias.
Our experiments prove our expectation that larger models tend to have smaller bias,
except from FFX. However, despite the common assumption that larger models are sus-
ceptible to high variance, we could not observe a clear connection between those two
properties in our analysis. The connection between variance and model size is distorted by
the different median accuracy of the methods. Given that high accuracy is achieved both
by algorithms with small and large models, also small variance occurs in algorithms with
models of any size. E.g., Operon and GP-GOMEA provide the smallest error and variance
across all noise levels, however, the size of their models differs clearly. While GP-GOMEA’s
average model size ranks between the other algorithms, Operon found the largest models
of all GP-based methods. However, this still implies, that even though Operon’s models
are large and might look very different between multiple algorithm runs, their behavior on
the Feynman datasets are very consistent.
This work is a first step towards the analysis of bias and variance in the symbolic
regression domain. Although bias and variance belongs to the basic knowledge in machine
learning, recent studies for other machine learning methods challenge the common un-
derstanding of this topic. Therefore, we also suggest further research in this direction for
symbolic regression. The most obvious extension of our work would be an analysis on
real world problems, as this work was restricted to generated benchmark data of problems
with limited dimensionality. While the presence of actually unknown noise, an unknown
ground truth and usually a too small number of observations in such scenarios make a
fair comparison hard, it would give even further insight in the practicality of SR methods.
Moreover, our analysis, especially regarding variance, was limited by the high accuracy
of multiple algorithms on the given synthetic problems. Further research regarding the
relationship between variance and model size would benefit from harder problems, where
all algorithms yield a certain level of error. This would allow a more in-depth analysis
of variance and highlight more differences between algorithms, especially between GP-
Appl. Sci. 2024,14, 11061 15 of 16
GOMEA and Operon. Another SR-specific aspect for further research is the analysis of the
symbolic structure of SR models. While this work only focuses on the behavior and output
and therefore the semantic of models, another important aspect for practitioners is whether
the formulas found in SR are similar from a syntactic perspective.
Author Contributions: Conceptualization, L.K. and S.W.; methodology, L.K.; software, L.K.; valida-
tion, L.K. and G.K.; data curation, L.K.; writing—original draft preparation, L.K.; writing—review and
editing, G.K. and S.W.; visualization, L.K.; supervision, G.K and S.W.; project administration, G.K.;
funding acquisition, G.K. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the Austrian Federal Ministry for Climate Action, Environ-
ment, Energy, Mobility, Innovation and Technology, the Federal Ministry for Labour and Economy,
and the regional government of Upper Austria within the COMET project ProMetHeus (904919)
supported by the Austrian Research Promotion Agency (FFG).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data and the models that are analyzed in this study were taken
from the SRBench benchmark suite [
8
] and are publicly available on cavalab.org/srbench (accessed on
21 November 2024).
Conflicts of Interest: The authors declare no conflicts of interest.
References
1.
Hastie, T.; Tibshirani, R.; Friedman, J.H.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction;
Springer: Berlin/Heidelberg, Germany, 2009; Volume 2.
2.
Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989,
2, 359–366. [CrossRef]
3.
Korns, M.F. Accuracy in symbolic regression. In Genetic Programming Theory and Practice IX; Springer: New York, NY, USA, 2011;
pp. 129–151.
4.
Neal, B.; Mittal, S.; Baratin, A.; Tantia, V.; Scicluna, M.; Lacoste-Julien, S.; Mitliagkas, I. A Modern Take on the Bias-Variance
Tradeoff in Neural Networks. In Proceedings of the ICML 2019 Workshop on Identifying and Understanding Deep Learning
Phenomena, Long Beach, CA, USA, 15 June 2019.
5.
Koza, J. Genetic Programming: On the Programming of Computers by Means of Natural Selection; MIT Press: Cambridge, MA,
USA, 1992.
6.
Virgolin, M.; Pissis, S.P. Symbolic Regression is NP-Hard. In Transactions on Machine Learning Research; CWI: Amsterdam, The
Netherlands, 2022.
7.
de Franca, F.O.; Kronberger, G. Reducing Overparameterization of Symbolic Regression Models with Equality Saturation. In
Proceedings of the Genetic and Evolutionary Computation Conference, Lisbon, Portugal, 15–19 July 2023; pp. 1064–1072.
8.
La Cava, W.; Burlacu, B.; Virgolin, M.; Kommenda, M.; Orzechowski, P.; de França, F.O.; Jin, Y.; Moore, J.H. Contemporary
symbolic regression methods and their relative performance. Adv. Neural Inf. Process. Syst. 2021,2021, 1. [PubMed]
9.
Olson, R.S.; La Cava, W.; Orzechowski, P.; Urbanowicz, R.J.; Moore, J.H. PMLB: A large benchmark suite for machine learning
evaluation and comparison. BioData Min. 2017,10, 36. [CrossRef] [PubMed]
10.
Keijzer, M.; Babovic, V. Genetic programming, ensemble methods and the bias/variance tradeoff–introductory investigations.
In Genetic Programming, Proceedings of the European Conference, EuroGP 2000 Edinburgh, Scotland, UK, 15–16 April 2000; Springer:
Berlin/Heidelberg, Germany, 2000; pp. 76–90.
11.
Kammerer, L.; Kronberger, G.; Winkler, S. Empirical analysis of variance for genetic programming based symbolic regression. In
Proceedings of the Genetic and Evolutionary Computation Conference Companion, Lille, France, 10–14 July 2021; pp. 251–252.
12. Breiman, L. Random forests. Mach. Learn. 2001,45, 5–32. [CrossRef]
13.
de Franca, F.; Virgolin, M.; Kommenda, M.; Majumder, M.; Cranmer, M.; Espada, G.; Ingelse, L.; Fonseca, A.; Landajuela, M.;
Petersen, B.; et al. SRBench++: Principled benchmarking of symbolic regression with domain-expert interpretation. IEEE Trans.
Evol. Comput. 2024, early access. [CrossRef]
14.
Yang, Z.; Yu, Y.; You, C.; Steinhardt, J.; Ma, Y. Rethinking bias-variance trade-off for generalization of neural networks. In
Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 13–18 July 2020; pp. 10767–10777.
15.
Belkin, M.; Hsu, D.; Ma, S.; Mandal, S. Reconciling modern machine-learning practice and the classical bias–variance trade-off.
Proc. Natl. Acad. Sci. USA 2019,116, 15849–15854. [CrossRef] [PubMed]
16.
Domingos, P. A unified bias-variance decomposition. In Proceedings of the 17th International Conference on Machine Learning,
Morgan Kaufmann Stanford, Stanford, CA, USA, 29 June–2 July 2000; pp. 231–238.
17.
Geman, S.; Bienenstock, E.; Doursat, R. Neural networks and the bias/variance dilemma. Neural Comput. 1992,4, 1–58. [CrossRef]
Appl. Sci. 2024,14, 11061 16 of 16
18.
Romano, J.D.; Le, T.T.; La Cava, W.; Gregg, J.T.; Goldberg, D.J.; Chakraborty, P.; Ray, N.L.; Himmelstein, D.; Fu, W.; Moore, J.H.
PMLB v1. 0: An open-source dataset collection for benchmarking machine learning methods. Bioinformatics 2022,38, 878–880.
[CrossRef] [PubMed]
19.
Udrescu, S.M.; Tegmark, M. AI Feynman: A physics-inspired method for symbolic regression. Sci. Adv. 2020,6, eaay2631.
[CrossRef] [PubMed]
20.
Strogatz, S.H. Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry, and Engineering; CRC Press: Boca
Raton, FL, USA, 2018.
21.
Feynman, R.P.; Leighton, R.B.; Sands, M. The Feynman Lectures on Physics, Vol. I: The New Millennium Edition: Mainly Mechanics,
Radiation, and Heat; Basic Books: New York, NY, USA, 2015; Volume 1.
22.
Virgolin, M.; Alderliesten, T.; Witteveen, C.; Bosman, P.A. Scalable genetic programming by gene-pool optimal mixing and
input-space entropy-based building-block learning. In Proceedings of the Genetic and Evolutionary Computation Conference,
Berlin, Germany, 15–19 July 2017; pp. 1041–1048.
23.
Burlacu, B.; Kronberger, G.; Kommenda, M. Operon C++ an efficient genetic programming framework for symbolic regression.
In Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion, Cancun, Mexico, 8–12 July 2020;
pp. 1562–1570.
24.
Schmidt, M.D.; Lipson, H. Age-fitness pareto optimization. In Proceedings of the 12th Annual Conference on GENETIC and
Evolutionary Computation, Portland, ON, USA, 7–11 July 2010; pp. 543–544.
25. Schmidt, M.D.; Lipson, H. Coevolution of fitness predictors. IEEE Trans. Evol. Comput. 2008,12, 736–749. [CrossRef]
26.
La Cava, W.; Spector, L.; Danai, K. Epsilon-lexicase selection for regression. In Proceedings of the Genetic and Evolutionary
Computation Conference 2016, Denver, CO, USA, 20–24 July 2016; pp. 741–748.
27.
de Franca, F.O.; Aldeia, G.S.I. Interaction–transformation evolutionary algorithm for symbolic regression. Evol. Comput. 2021,
29, 367–390. [CrossRef] [PubMed]
28.
Petersen, B.K.; Larma, M.L.; Mundhenk, T.N.; Santiago, C.P.; Kim, S.K.; Kim, J.T. Deep symbolic regression: Recovering
mathematical expressions from data via risk-seeking policy gradients. In Proceedings of the International Conference on Learning
Representations, Addis Ababa, Ethiopia, 30 April 2020.
29. McConaghy, T. FFX: Fast, scalable, deterministic symbolic regression technology. In Genetic Programming Theory and Practice IX;
Springer: New York, NY, USA, 2011; pp. 235–260.
30.
Meurer, A.; Smith, C.P.; Paprocki, M.; ˇ
Certík, O.; Kirpichev, S.B.; Rocklin, M.; Kumar, A.; Ivanov, S.; Moore, J.K.; Singh, S.; et al.
SymPy: Symbolic computing in Python. PeerJ Comput. Sci. 2017,3, e103. [CrossRef]
31.
Byrd, R.H.; Lu, P.; Nocedal, J.; Zhu, C. A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput.
1995,16, 1190–1208. [CrossRef]
32.
Zhu, C.; Byrd, R.H.; Lu, P.; Nocedal, J. Algorithm 778: L-BdefranFGS-B: Fortran subroutines for large-scale bound-constrained
optimization. ACM Trans. Math. Softw. 1997,23, 550–560. [CrossRef]
33.
Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.;
Bright, J.; et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods 2020,17, 261–272. [CrossRef]
[PubMed]
34. Jin, Y.; Fu, W.; Kang, J.; Guo, J.; Guo, J. Bayesian symbolic regression. arXiv 2019, arXiv:1910.08892.
35.
La Cava, W.; Singh, T.R.; Taggart, J.; Suri, S.; Moore, J.H. Learning concise representations for regression by evolving networks of
trees. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019.
36.
Arnaldo, I.; Krawiec, K.; O’Reilly, U.M. Multiple regression genetic programming. In Proceedings of the 2014 Annual Conference
on Genetic and Evolutionary Computation, Vancouver, BC, Canada, 12–16 June 2014; pp. 879–886.
37.
Virgolin, M.; Alderliesten, T.; Bosman, P.A. Linear scaling with and within semantic backpropagation-based genetic programming
for symbolic regression. In Proceedings of the Genetic and Evolutionary Computation Conference, Prague, Czech Republic,
13–17 July 2019; pp. 1084–1092.
38. Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006,7, 1–30.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.