Content uploaded by Haizhao Yang
Author content
All content in this area was uploaded by Haizhao Yang on Mar 12, 2025
Content may be subject to copyright.
PARSING THE LANGUAGE OF EXPRESSION: ENHANCING
SYMBOLIC REGRESSION WITH DOMAIN-AWARE SYMBOLIC
PRIORS
Sikai Huang
Department of Mathematics
Purdue University
West Lafayette, IN, USA
huan1580@purdue.edu
Yixin Berry Wen
Department of Geography
University of Florida
Gainesville, FL, USA
yixin.wen@ufl.edu
Tara Adusumilli, Kusum Choudhary
Department of Computer Science
University of Maryland College Park
College Park
{tadusumi, kusumcho}@umd.edu
Haizhao Yang
Department of Mathematics, Department of Computer Science
University of Maryland College Park
College Park
hzyang@umd.edu
ABS TR ACT
Symbolic regression is essential for deriving interpretable expressions that elucidate complex phe-
nomena by exposing the underlying mathematical and physical relationships in data. In this paper, we
present an advanced symbolic regression method that integrates symbol priors from diverse scientific
domains - including physics, biology, chemistry, and engineering - into the regression process. By
systematically analyzing domain-specific expressions, we derive probability distributions of symbols
to guide expression generation. We propose novel tree-structured recurrent neural networks (RNNs)
that leverage these symbol priors, enabling domain knowledge to steer the learning process. Addi-
tionally, we introduce a hierarchical tree structure for representing expressions, where unary and
binary operators are organized to facilitate more efficient learning. To further accelerate training,
we compile characteristic expression blocks from each domain and include them in the operator
dictionary, providing relevant building blocks. Experimental results demonstrate that leveraging
symbol priors significantly enhances the performance of symbolic regression, resulting in faster
convergence and higher accuracy.
Keywords Symbolic Regression ·Reinforcement Learning ·Recurrent Neural Network, ·Domain Knowledge Prior
1 Introduction
1.1 Problem Statement
Symbolic regression is a powerful technique that searches the space of mathematical expressions to identify equations
that best fit a dataset. Unlike traditional regression models that rely on predefined structures, symbolic regression
discovers interpretable relationships between variables, offering deeper insights into data dynamics. This capability is
particularly valuable in fields with complex, poorly understood relationships in various fields, e.g., physical sciences
[
1
,
2
,
3
], materials science [
4
,
5
], chemistry [
3
,
6
,
7
], climate science, ecology, [
8
,
9
,
10
], and finance [
11
,
12
]. These
diverse applications underscore the versatility of symbolic regression as a powerful tool for scientific discovery and
analysis.
Symbolic regression methods are generally categorized into two main approaches. The first approach involves an
optimization process to identify suitable expressions. This can be a two-step process: first, generating a “skeleton"
of the equation using a parametric function built from a predefined set of operators according to physical knowledge,
DOMAIN-AWARE SYMBOLIC PRIORS FOR SYMBOLIC REGRESSION
such as basic arithmetic operations and elementary functions, to define its overall structure. The second step then
solves a regression problem to estimate parameters within this skeleton. Alternatively, both the function skeleton and
parameters can be learned simultaneously through mixed optimization. This problem is typically addressed using
genetic algorithms [13, 14, 15, 16, 17] or, more recently, reinforcement learning (RL) [18, 17, 19, 20].
Inspired by recent advances in language models, a second approach to symbolic regression has emerged as an end-to-end
solver, often referred to as Neural Symbolic Regression (NSR). This approach frames symbolic regression as a natural
language processing (NLP) task, leveraging large-scale pre-trained models to map data directly to expressions in an
end-to-end manner, akin to machine translation [
21
,
22
,
23
,
24
,
25
,
26
]. These neural models are trained end-to-end,
taking sampled data points as input and generating symbolic representations of mathematical expressions that best fit
the data.
1.2 Symbolic Prior for Symbolic Regression
Symbolic regression is extensively employed to derive interpretable expressions that characterize dynamical
systems in a wide range of scientific domains, including physics, biology, and chemistry. However, the fre-
quency and combination of symbols and operators differ markedly across these fields, reflecting their underlying
principles and commonly adopted mathematical formulations. For example, trigonometric functions (e.g., sine
and cosine) often appear in physics to capture oscillatory behaviors, whereas exponential and logistic functions
are more prevalent in biology to model population growth and decay. These observations naturally lead to two questions:
How can we systematically extract these symbol priors? Moreover, how can we efficiently incorporate such prior
knowledge to enhance existing symbolic regression methods?
The main contributions of this paper for the above questions are summarized below.
Novel Tree Representation of Expressions:
This work proposes a general (multi-branch) tree representation
for mathematical expressions, effectively capturing hierarchical structures, particularly in consecutive additions.
By treating linked unary operators as equivalent nodes, the method preserves local structures that binary trees
and linear sequences often fail to maintain due to increased depth and imbalance. The output of a leaf node is
expressed as a linear combination of variables applied element-wise to the same unary operator, reducing tree
depth and yielding a more compact expression.
Collection and Categorization of Domain-Specific Expressions.
This study systematically gathers math-
ematical expressions from relevant arXiv papers and organizes them by scientific domain. Leveraging our
general tree-structure representation, we analyze domain-specific symbol relationships and operator combi-
nations to extract priors and refine the operator set, thereby improving training efficiency. We classify these
priors into two categories: horizontal priors, which capture relationships among sibling unary operators under
the same parent, and vertical priors, which characterize hierarchical dependencies between a node and its
ancestors. Conditional categorical distributions encode these intrinsic horizontal and vertical features, thus
providing a domain-aware understanding of expression patterns.
Tree-Structured RNN Policy Optimized with KL Regularization:
As illustrated in Figure 1, we employ
atree-structured recurrent neural network (RNN) to represent the policy within a reinforcement learning (RL)
framework for generating mathematical expressions. This hierarchical design leverages the nested structure
of expressions, thereby reducing the number of RNN modules required for capturing complex operator
interactions. To incorporate domain-specific symbol priors, we introduce a Kullback-Leibler (KL) divergence
regularization term into the reward function, effectively minimizing the discrepancy between the policy’s
learned distribution and the prior distribution. The policy is trained via policy gradient methods, exploring the
space of symbolic expressions and maintaining a pool of high-scoring “skeletons.” As depicted in Figure 2,
these candidate expressions are iteratively refined to converge toward the target equation.
By incorporating domain-specific symbol priors into the hierarchical RNN’s training procedure, the model leverages
relevant prior knowledge to enhance both the efficiency and accuracy of symbolic regression. Empirical evaluations
indicate that this strategy not only accelerates convergence but also yields more accurate and interpretable models
across diverse scientific domains.
1.3 Related work
Numerous strategies have been proposed to incorporate prior knowledge into symbolic regression. One prevalent
approach involves balancing interpretability and data fidelity through complexity-based penalties, such as the Bayesian
2
DOMAIN-AWARE SYMBOLIC PRIORS FOR SYMBOLIC REGRESSION
Figure 1: A tree-structured RNN-based reinforcement learning framework for generating symbolic expressions. Domain-
specific priors (top) are incorporated as soft constraints (KL divergence) and hard constraints (rule-based masking),
guiding the controller to propose expressive yet valid “skeletons.” Sampled expressions are then refined by optimizing
their parameters, yielding interpretable mathematical models aligned with target data.
Information Criterion (BIC) [
27
]. Another line of research assigns priors to tree structures and operators via uniform
distributions or user-defined preferences, yet lacks a systematic way to derive domain-specific priors [
12
]. In contrast,
the method introduced here employs domain-specific symbol priors to improve both accuracy and interpretability. By
adding a Kullback–Leibler (KL) divergence term to the reward function, it aligns the learned categorical distribution
with domain knowledge, thereby achieving more robust control and faster convergence relative to existing techniques.
To further constrain the search space, several works have imposed structural constraints on expressions. For instance,
monotonicity, convexity, and symmetry constraints have been incorporated to guide the search, thereby improving
efficiency [
28
,
29
,
30
]. Although effective, these constraints can be difficult to infer from data alone and may not always
be applicable. Similarly, research on embedding fundamental physical laws has used conservation principles to enforce
physical validity [31, 32], but such methods typically require detailed a priori knowledge about the system.
Beyond structural constraints, some approaches leverage optimization algorithms, notably genetic algorithms, to validate
candidate solutions under predefined conditions such as symmetry, monotonicity, convexity, and boundary constraints
[
33
]. However, these strategies often rely on data quality for verifying constraints, which can be challenging in practice.
Alternatively, context-sensitive filtering methods selectively prune unlikely token sequences by examining the structure
of the expression tree, thereby reducing invalid symbol combinations and enhancing efficiency [18, 34].
A further avenue for integrating prior knowledge involves domain-specific constraints on symbolic representations.
For example, dimensional consistency has been enforced by masking the categorical distribution in RNNs, ensuring
physically meaningful expressions [
35
]. Building on this idea, our approach introduces “hard constraints" that rule out
3
DOMAIN-AWARE SYMBOLIC PRIORS FOR SYMBOLIC REGRESSION
Figure 2: Parameter optimization and candidate selection loop. The tree-structured RNN controller generates candidate
expressions, which are evaluated against the data to produce a reward signal. High-scoring candidates are selected for
further parameter tuning, where the weights
θ={α, β, γ }
are refined to better fit the target data, ultimately yielding
more accurate symbolic expressions.
symbol combinations not observed in specific domains, supplemented by probabilistic biases on token combinations to
further refine the search. This combination restricts the solution space to expressions that are both more plausible and
interpretable within the given domain.
2 Symbol Priors
In this section, we present a strategy for incorporating symbol priors into symbolic regression. We begin by introducing
atree-structured representation of mathematical expressions that systematically gathers and utilizes symbol priors,
enabling a more compact and structured encoding of expressions for cross-domain analysis. We then describe a
method for extracting symbol priors from domain-specific mathematical expressions in arXiv papers. By capturing
the characteristic symbol distributions and operator preferences of each discipline, this approach leverages structured
domain knowledge to enhance both the accuracy and interpretability of symbolic regression.
2.1 Representation Method
Our proposed representation addresses the shortcomings of conventional binary expression trees by permitting a single
binary operator to connect multiple sequences of unary operators. This multi-branch design yields a more flexible
and expressive hierarchical structure. Empirical analysis of real-world expressions, particularly those describing
dynamical systems, shows that most physically meaningful forms can be effectively captured within a two-level tree
structure—some even reduce to a single layer. By aligning closely with these practical mathematical constructs, the
representation improves interpretability and boosts the efficiency of symbolic regression.
For instance, in the case of consecutive additions, one addition operator can connect multiple child nodes that are treated
equivalently, thus eliminating strict hierarchical dependencies often found in binary trees. By contrast, traditional binary
representations impose tiered parent–child relationships, leading to deeper and more inflexible structures.
To formalize our representation method, we define the following sets:
•Unary Operator Set
: Let
U={sin,exp,log, I d, (·)2, .. . }
, which encompasses a range of elementary
functions (e.g., polynomials, trigonometric functions). Here, I d denotes the identity function.
•Binary Operator Set: Let B={+,×,÷} represent the set of binary operators within the tree structure.
•Variable Set
: Let
V={f, x1, . . . , xn, fxi, fxixj|1≤i, j ≤n}
, where
f
is the primary function,
x1, . . . , xn
are variables,
fxi=∂f
∂xi
denotes the first-order partial derivative of
f
, and
fxixj=∂2f
∂xi∂xj
denotes the second-order partial derivative. Higher-order derivatives are typically uncommon in most physical
systems.
4
DOMAIN-AWARE SYMBOLIC PRIORS FOR SYMBOLIC REGRESSION
Figure 3: The top panel illustrates the fundamental structure of our representation method, while the bottom panel
presents two example expressions represented using this structure.
With these sets established, our representation method—illustrated in Figure 3—integrates unary operators (and their
compositions) with binary operators in a hierarchical framework. Specifically:
•Root Node (UR)
: The root node is a unary operator chosen from
U
. It applies its operation to the output of its
subtrees, which are connected by a binary operator from B.
•Root Node Binary Operator Connection (B)
: A binary operator from
B={+,×,÷}
links multiple
sequences of unary operators. For example,
B1
connects the first-level sequences
S1
1
and
S1
2
to the root node
UR
, merging sub-expressions without enforcing the strict parent–child hierarchy typical of standard binary
trees.
5
DOMAIN-AWARE SYMBOLIC PRIORS FOR SYMBOLIC REGRESSION
•Sequences of Unary Operators (Sj
i)
: Each
Sj
i
is a composition of unary operators drawn from
U
associated
with
Bj
b(i+1)/2c
. In particular,
S1
i
denotes the first-level sequences associated with
B1
1
, while
S2
i
denotes the
second-level sequences linked by
B2
i
. By allowing multiple unary operators to be chained, this structure
captures a broader range of real-world expressions.
•Leaf Nodes (Ui)
: The leaf nodes, denoted by
Ii
, take inputs from the variable set
V
, which includes the
function
f
, its partial derivatives, and the variables
x1, . . . , xn
. Each leaf applies a unary operator element-wise
to these inputs and outputs a linear combination of the resulting transformed variables:
O=γ1µ(v1) + γ2µ(v2) + . . . +γnµ(vn), vi∈V,
where µ(·)is the chosen unary operator and γiare learned coefficients.
•Linear Transformation in Non-Leaf Unary Operators:
Each non-leaf unary operator
µ
in
UR
or in any
Sj
i
undergoes a linear transformation:
O=α µ(I) + β,
where
α
is a scaling factor and
β
is a bias term. This additional affine component expands the representational
flexibility for modeling complex functional relationships.
Remark.
The coefficients associated with variables in each leaf node’s linear combination can be optimized as part
of the model-fitting process. This strategy implicitly performs feature selection, allowing the discovery of the most
pertinent variables and revealing core relationships governing the system.
Within our proposed framework, we introduce three fundamental concepts pertaining to each expression: subsequences,
width, and depth. Formal definitions and illustrative examples are given as follows.
Let
T
be an
h
-level expression tree with a unique root unary operator
UR
and leaf nodes
{Ik}n
k=1
, where each leaf
node Ikis associated with a unary operator Uk.
Definition 1 (Subsequence of an Expression).
Asubsequence for a particular leaf node
Ik
is defined as the ordered
list of unary and binary operators encountered along the unique path from the root node
UR
down to
Uk
. If at each
level
`∈ {1, . . . , h}
, the path may include a binary operator
B`
and an associated sequence of unary operators
S`
, then
formally:
Subsequence(Ik) = UR, B1, S1, B2, S2, . . . , B h, Sh, Uk, k ∈ {1, . . . , n},
where
UR
is the root unary operator,
B1, . . . , Bh
are the binary operators encountered along the path,
S1, . . . , Sh
are
the corresponding unary-operator sequences, and
Uk
is the final unary operator at leaf node
Ik
. In cases where fewer
than hlevels are traversed, the omitted operators are simply excluded from the subsequence.
Definition 2 (Width of an Expression).
The width of an expression is defined as the total number of first-level
sequences. Formally, given a tree Twith outermost unary operator sequences {S1
1, S1
2, . . . , S1
m}, the width of Tis
Width(T) = m,
where mdenotes the number of unary-operator sequences directly connected to the root node.
Definition 3 (Depth of an Expression).
The depth of an expression is defined as the length of the longest subsequence
in the expression tree
T
. Equivalently, it is the greatest number of unary operators on any path from a leaf node
Ik
to
the root node UR. Formally,
Depth(T) = max
1≤k≤n(Length of Subsequence(Ik)) ,
where the length of the subsequence is the total number of unary operators encountered between
UR
and
Ik
and
n
is the
total number leaf nodes. Figure 4 presents two illustrative examples explained below:
•Left example
: The subsequences are
{tan,+,exp,(·)2,Id}
,
{tan,+,exp,+,(·)2,Id}
, and
{tan,+,exp,+,exp,(·)2,Id}
. Since there are two first-level sequences connected directly to the
root node, the width of the tree is 2. The depth is 7, reflecting the length of the longest path from the root to a
leaf node.
•Right example
: This tree contains three subsequences of the form
{√,+,(·)2,Id}
. Because three first-level
sequences directly connect to the root node, the width of the tree is 3. Its depth is 4, corresponding to the
length of the longest path from a leaf node to the root.
6
DOMAIN-AWARE SYMBOLIC PRIORS FOR SYMBOLIC REGRESSION
Figure 4: Examples of two expression trees, illustrating their subsequences, width, and depth.
To maintain consistency in our representation, the identity operator
Id
is allowed in the root node, leaf nodes, and any
sequence
S
. However, within a given sequence
S
,
Id
is permitted only if it is the sole unary operator. Superfluous
occurrences of
Id
increase the overall expression length and hinder subsequent symbol-prior extraction. Moreover,
restricting Id at the leaf nodes helps preserve a concise and interpretable representation.
2.2 Hierarchical Symbol Priors Extraction
We systematically gathered mathematical expressions from arXiv, targeting specific scientific domains. For each domain,
10,000 highly relevant papers were selected, and their embedded expressions were extracted to examine structural
patterns critical to our approach. Each expression was then converted into a general tree structure, capturing elements
such as subsequences, root nodes, root-level binary operators, leaf nodes, as well as the tree’s width and depth. This
structured format enables a detailed analysis of symbol relationships and domain-specific usage. Equipped with this
extensive repository of expressions and subsequences, we proceed to extract the following core information.
Vertical Tree Node Analysis:
The goal is to capture vertical compositional operator relationships from root to leaf
nodes. By examining each path, we estimate conditional categorical distributions of symbols at multiple levels, thus
deriving vertical symbol priors that encapsulate domain-specific operator patterns.
Notably, numerous symbol combinations occur with zero probability. In many cases, these combinations violate the
General Formulation Rules [
18
], which impose restrictions such as limiting trigonometric nesting to two levels (e.g.,
disallowing
cos(x+ sin(y+ tan))
), restricting self-nesting of exponential and logarithmic functions (e.g.,
exp(exp(·))
,
log(log(·))), and forbidding direct succession of inverse unary operations (e.g., exp(log(·)),log(exp(·))).
Horizontal Tree Node Analysis:
From a horizontal perspective, we investigate sibling nodes connected by a common
binary operator at each tree level. Specifically, for each level
h
, we aggregate all child nodes linked via the same
operator
B
and estimate categorical distributions of the observed symbol combinations. This procedure reveals how
operators co-occur laterally at the same level and enables more precise modeling of domain-specific patterns.
Domain-Specific Component Analysis:
Note that certain substructures, composed of both unary and binary operators,
appear frequently within specific domains. In engineering—particularly in signal processing—combinations such as
cos(·) + sin(·)
often model waveforms, whereas in chemistry, exponential constructs like
exp(·/·)
frequently arise
in reaction-rate formulations (e.g., the Arrhenius equation for temperature-dependent reaction rates). Incorporating
these domain-specific components in the expression search can significantly improve the computational efficiency of
symbolic regression.
Other Priors:
Beyond specific operator combinations, we also extract prior information from each expression tree,
including distributions of root nodes, leaf nodes, and structural characteristics such as depth and width. The root
node determines the overarching form of the expression, while leaf nodes serve as anchors for variables or constants.
7
DOMAIN-AWARE SYMBOLIC PRIORS FOR SYMBOLIC REGRESSION
Examining these distributions uncovers domain-specific preferences for certain functions and operations. Furthermore,
structural priors involving depth and width help control the complexity of candidate expressions, preventing solutions
that are either overly simplistic or excessively convoluted.
Remark:
The total number of variables within an expression affects the raw frequency counts of operator combinations,
potentially causing statistical bias. For instance, consider the expression
Pn
i=1 cos(xi)
. As the number of variables
n
increases, the raw count of the unary operator
cos(·)
increases proportionally, artificially inflating its frequency relative
to other operators. This bias necessitates an appropriate normalization method to accurately reflect the true operator
distributions across diverse expressions.
Definition 4 (Normalization by Variable Count).
Let
E
denote the corpus of collected expressions. To address the
bias introduced by varying numbers of variables across different expressions, we define a normalized frequency measure
for operator combinations across the corpus. Specifically, for an operator combination
s
, the normalized count within a
single expression E∈ E is computed as:
Normalized CountE(s) = Raw count of operator combination sin expression E
Total number of variables in expression E.
The overall normalized frequency for the operator combination
s
is then defined by averaging these normalized counts
over all expressions in the corpus E:
Normalized Count(s) = 1
|E| X
E∈E
Normalized CountE(s).
This approach yields a balanced and representative probability distribution of operator combinations across diverse
expressions.
Definition 5 (Prior Probability for Node Symbols).
Let
si
be a node in an expression tree with parent node
sp(h)
at
level
h
. Denote by
Si
the set of sibling symbols of
si
. We define the prior probability of
si
given its siblings
Si
, parent
sp, and level has follows:
•If the number of siblings |Si| ≤ 3, we explicitly consider all siblings:
S∗
i= (si−1, si−2, si−3),with fewer symbols included if fewer siblings exist.
•If the number of siblings |Si|>3
, we consider only the three most frequently occurring siblings
s(1), s(2) , s(3):
S∗
i= (s(1), s(2) , s(3)).
Using normalized counts as defined earlier, the prior probability is rigorously estimated by aggregating normalized
frequencies across a corpus Eof multiple expressions:
P∗(si|Si, sp(h)) = PE∈E Normalized CountE(si, S ∗
i, sp(h))
PE∈E Normalized CountE(S∗
i, sp(h)) ,
where Normalized CountE(·)denotes the normalized count computed within an individual expression E.
Remark (Rationale for Restricting Sibling Context).
Restricting the sibling context to the three most frequent
siblings provides a practical balance between capturing essential statistical information and mitigating combinatorial
complexity. Specifically, considering all sibling nodes would result in exponential growth of the combinatorial search
space, severely exacerbating data sparsity and estimation instability. Empirical evidence from symbolic regression
and related probabilistic modeling tasks suggests that the majority of the predictive information (measured in terms
of conditional mutual information) is typically concentrated in a limited subset of frequently co-occurring sibling
nodes. Thus, incorporating more than three sibling symbols typically yields diminishing marginal returns in information
content, while greatly increasing complexity and reducing statistical robustness.
2.3 Case Study
This section systematically compares the horizontal and vertical priors previously outlined, along with distributions
of root and leaf nodes and fundamental structural attributes. By examining both individual and combined impacts of
8
DOMAIN-AWARE SYMBOLIC PRIORS FOR SYMBOLIC REGRESSION
these priors, we illustrate how incorporating domain-specific knowledge significantly enhances symbolic regression
performance and informs optimal learning framework configurations.
Across all four domains—physics, chemistry, biology, and engineering—expressions with substantial depth are relatively
uncommon. The introduction of identity operators, whenever no unary operator exists between consecutive binary
operators, increases the effective depth, thus often resulting in actual depths exceeding initial assessments.
Expressions in physics and engineering typically exhibit greater depth due to the nested functions and layered operations
necessary for modeling complex phenomena. Specifically, physics expressions frequently involve nested trigonometric
or exponential functions, differential equations, and integrals, while engineering models commonly integrate multiple
layers of system dynamics and control mechanisms.
The width of expressions generally clusters around moderate values in all domains. While expressions structured as
Pn
i=1
superficially suggest potentially unbounded width due to the variable
n
, the normalized count definition mitigates
this issue by proportionally scaling symbol combinations relative to the total variable count. Thus, normalized widths
remain statistically manageable despite variations in
n
. Broad top-level structures occur frequently across domains,
reflecting parallel interactions inherent within these systems: chemical equations summing reactants and products,
biological models aggregating multiple genetic or environmental factors, engineering calculations combining parallel
impedances, and physical expressions summing over multiple states or particles. Nonetheless, the normalized count
method ensures consistent and meaningful statistical comparisons of expression widths.
Figure 5 illustrates the distribution of the binary operator
B1
1
conditioned on various root nodes across physics, biology,
chemistry, and engineering domains. This vertical analysis reveals how selecting specific root nodes—such as
Id
,
log
,
exp
,
sin
,
cos
,
tan
,
√·
, or
(·)2
—influences subsequent binary operator distributions. By examining these hierarchical
dependencies, we gain deeper insight into domain-specific conventions governing the construction of mathematical
expressions.
Moreover, the analysis identifies subsequence patterns rarely encountered within these scientific domains, highlighting
the practical importance of adhering to established formulation rules. For instance, combinations like
plog(tan(·))
or
ptan(log(·))
are infrequently used, as they typically lack clear interpretability or physical relevance, and thus are
generally avoided in standard scientific modeling.
Our horizontal analysis focuses on sibling nodes sharing a common binary operator
B
, examining the conditional
distributions of sibling nodes given their parent. Formally, consider a binary parent node
B
with sibling child nodes
s1
and s2. Figure 6 illustrates empirical conditional distributions of sibling operators across various scientific domains.
Common operand pairs, such as
exp +Id
and
exp + exp
, frequently arise in all examined fields, representing funda-
mental models of exponential growth, decay, and the summation of exponentials common in differential equations. In
contrast, combinations like
exp + tan
or
exp +(·)2
rarely appear due to their limited physical interpretability and poten-
tial numerical instability. Furthermore, domain-specific preferences are evident: physics often employs combinations
of exponential and trigonometric functions (e.g.,
exp + sin
or
exp + cos
) to model oscillatory phenomena; biology
typically utilizes simpler forms such as
exp +Id
or
exp + log
reflecting fundamental growth dynamics; chemistry
frequently combines exponentials with logarithmic functions (e.g.,
exp + log
) due to their role in reaction kinetics;
and engineering integrates varied combinations, including
exp +√·
and
exp +(·)2
, indicative of broader modeling
requirements. Incorporating these domain-specific horizontal dependencies significantly improves the interpretability
and practical relevance of symbolic regression outcomes.
Figure 7 displays domain-specific distributions of binary operators conditioned on different root node symbols across
physics, biology, chemistry, and engineering. The results reveal distinct and consistent operator preferences for each
domain. Across all fields, the addition operator (
+
) predominates, reflecting a universal tendency to combine terms
directly without transformations. Physics and engineering demonstrate substantial use of multiplication (
×
), indicative
of complex interactions and layered dynamics commonly modeled in these disciplines. Biology and chemistry show
relatively balanced usage of addition and multiplication, while division (
÷
) consistently exhibits the lowest frequency,
likely due to its numerical instability and less frequent natural occurrence in models across disciplines. These clear
patterns underline the domain-specific structures inherent in mathematical expressions, supporting the integration of
tailored prior knowledge into the symbolic regression framework to enhance both model interpretability and predictive
accuracy.
9
DOMAIN-AWARE SYMBOLIC PRIORS FOR SYMBOLIC REGRESSION
Figure 5: Statistical distributions of expression depth, width, and root nodes across Physics, Biology, Chemistry, and
Engineering.
10
DOMAIN-AWARE SYMBOLIC PRIORS FOR SYMBOLIC REGRESSION
Figure 6: Probability Distributions of Second Sibling Unary Operator Given Parent Binary Operator “+" and First
Sibling “exp" in Various Fields’
Figure 7: Across Physics, Biology, Chemistry, and Engineering, various root unary operators (Id, log, exp, trig, sqrt,
square) predominantly connect with addition (
+
) and multiplication (
∗
), underscoring their essential roles in aggregating
and scaling expressions. However, the specific proportions of these binary operators vary among disciplines, reflecting
each field’s unique mathematical modeling requirements
11
DOMAIN-AWARE SYMBOLIC PRIORS FOR SYMBOLIC REGRESSION
3 Methods
In this section, we present a reinforcement learning-based approach to identify the skeleton of mathematical expressions
and subsequently optimize the associated coefficients to finalize the expression learning. Although the training concept
is similar to the FEX [
19
] approach, the proposed method is innovative, featuring substantially different elements such
as novel tree structures, optimization regularization, and domain-specific prior knowledge. Given a fixed tree structure
T
with
nT
nodes, FEX aims to identify an expression with finitely many operators to fit given data
{X, y}
by solving
mine,θ L(g(X;T, e, θ))
, where
L
is a functional quantifying how good an expression
g(X;T, e, θ)
is to fit data,
e
is the sequence of operators to form
g
, and
θ={α, β, γ }
represents the learnable parameters in
T
to form
g
. The
expression
g(X;T, e, θ)
is formed by the chosen operators and parameters within the tree structure. This problem is
addressed by alternating between optimizing
e
using reinforcement learning (e.g., policy gradients) and optimizing
θ
using gradient-based methods (e.g., Adam, BFGS).
3.1 Agent
In this section, we introduce a novel tree-structured recurrent neural network (RNN) designed to function as our agent.
As illustrated in Figure 4, this structure enables efficient exploration and representation of complex expressions by
capturing hierarchical relationships within the expression tree. In this tree-structured RNN, each output
yi
represents
Figure 8: Tree-structured RNNs for Symbolic regression
a categorical probability distribution, indicating the likelihood of selecting various operators for the
i−
th node. The
operators
xi
are then sampled based on the probabilities provided by
yi−1
, the output from the preceding node. The
activations
ai
propagate through the structure, passing from parent nodes to all child nodes, or horizontally between
sibling nodes. This setup allows the model to capture and learn hierarchical dependencies among nodes, reflecting the
structured relationships inherent in mathematical expressions.
The key advantage of this structure:
Preservation of Structured Information:
This tree-structured RNN is designed to maintain the hierarchical
relationships inherent in mathematical expressions. By allowing activations to flow from parent nodes to child
nodes and horizontally between sibling nodes, the model preserves the natural structure of expressions. Each
node not only receives information from its parent but also shares information with its siblings, enabling the
RNN to capture dependencies at multiple levels. This structure aligns closely with the nested and layered
nature of mathematical expressions, ensuring that important contextual relationships are retained throughout
the network.
12
DOMAIN-AWARE SYMBOLIC PRIORS FOR SYMBOLIC REGRESSION
Efficient Information Flow:
The hidden layer output of a parent node is propagated to all its child nodes,
reducing the number of RNN blocks required compared to traditional binary tree methods.
Remark.
In the tree-structured RNN architecture, certain nodes (blocks) may connect simultaneously to multiple
subsequent nodes, each corresponding to sibling nodes in the expression tree. Initially, the distribution emitted by
an RNN node, denoted as
Pr(child |parent)
, is assumed uniform across all connected child nodes. However, due
to sequential sampling of sibling symbols, the selection of an earlier sibling influences the conditional probability
distribution of the symbols for subsequent siblings. Formally, this sequential sampling induces dependencies of the
form:
Pr(i-th child |(i-1)-th child,parent),
which allows the tree-structured RNN to explicitly model structured sibling dependencies and thus yields a more
accurate and coherent representation of expression trees.
3.2 KL-divergence: Soft constraint
In our tree-structured recurrent neural network (RNN) architecture, each node
si
outputs a categorical distribution
yi
over the set of possible symbols
S
, which includes unary operators, binary operators, and variables. To ensure that
the learned distributions
yi
align with our predefined priors
P∗(si|Si, sp, h)
, we compute the Kullback-Leibler (KL)
divergence between the RNN-generated distribution yiand the prior distribution P∗(si|Si, sp, h)for each node si.
The KL divergence for node siis defined as:
KL (P∗(si|Si, sp, h)kyi) = X
s∈S
P∗(s|Si, sp, h) log P∗(s|Si, sp, h)
yi(s)
To aggregate the KL divergences computed for each node within the expression tree, we calculate the average KL
divergence over all nodes:
KLavg =1
N
N−1
X
i=0
KL (P∗(si|Si, sp, h)kyi)
Where Nis the total number of nodes in the expression tree.
3.3 Formula Rule: Hard constraint
We define a set of operator combinations that are prohibited from appearing along the same path within an expression
tree. As discussed in Section 2, we observe that many operator combinations are absent from the collected subsequences.
This absence may result from various factors: these combinations might violate established symbol rules [18], lead to
numerical instability, or simply be uncommon in the specific domain or due to insufficient data.
We formalize this set as HConstraint ={H C1, H C2}:
HC1
: Represents combinations that violate symbolic rules or result in numerical instability, as identified in
prior research. These combinations are strictly prohibited and are excluded from the sampling process.
HC2
: Represents combinations that rarely occur. Although they are not commonly observed, we assign
them a very small probability,
, and include them in the set of soft constraints. This design promotes model
exploration, enabling the potential discovery of novel physical laws.
For the operators combinations in hard constraint, we simply use the method in [
18
], zero-out the probability during
sampling.
By categorizing constraints in this way, we ensure that our model adheres to known rules while still allowing flexibility
for exploration. This approach balances enforcing known constraints with maintaining a level of uncertainty, enabling
the model to explore new combinations that might reveal novel insights.
3.4 Reward
The reward for an operator sequence e={s0, s1, ..., sN−1}, denoted as R(e), is defined as:
13
DOMAIN-AWARE SYMBOLIC PRIORS FOR SYMBOLIC REGRESSION
R(e) := 1
1 + L(e),
L(e) = minθNRMSE
. This reward
R(e)
ranges between 0 and 1, where lower values of
R(e)
result in rewards closer
to 1, indicating a better fit to the target equation. Conversely, higher L(e)values lead to lower reward.
3.5 Agent update
The agent is updated using a basic policy gradient method with a KL-divergence regularization term to regulate
the exploration. This regularization controls the distance between the learned policy and the domain-specific prior
distribution. Detailed implementation procedures including algorithmic steps and optimization strategies, are provided
below:
The agent
AΨ
is implemented as a recurrent neural network (RNN) with parameters
Ψ
. The KL-Regularized training
objective of the agent trades off maximizing returns with staying close to the sequences associated with our symbolic
prior. This objective is formulated as:
J(Ψ) = Ee∼AΨhR(e)−`1
N
N−1
X
i=0
KL (P∗(si|Si, sp, h)kyi)i,
where yiis the i-th output of AΨ,`is the hyperparameter.
To optimize the controller
AΨ
, we employ a policy gradient-based updating method in reinforcement learning (RL). In
practice, we compute an approximation of this gradient using a batch of
k
sampled operator sequences
e(1), e(2) , . . . , e(k)
as follows:
∇ΨJ(Ψ) ≈1
k
k
X
n=1
R(e(n))
N−1
X
i=0 h∇Ψlog(y(n)
i)−`
N∇ΨKL (P∗(si|Si, sp, h)kyi)i.
To update the parameters Ψof the agent, we use the gradient ascent method with a learning rate η:
Ψ←Ψ + η(∇ΨJ(Ψ)).
The goal of the objective function
J(Ψ)
is to improve the average reward of the sampled operator sequences. To enhance
the probability of obtaining the best equation expression, we modify the objective function using the risk-seeking policy
gradient approach:
J(Ψ) = Ee∼AΨ[R(e)·I(R(e)≥RΨ)],
where
RΨ
represents the
(1 −α)
-quantile of the reward distribution generated by
AΨ
, and
α∈[0,1]
. The gradient
computation is updated as:
∇ΨJ(Ψ) ≈1
N
N
X
n=1 R(e(n))−ˆ
RαI(R(e(n))≥ˆ
Rα)
k
X
i=1 ∇Ψlog(y(n)
i),
where
ˆ
Rα
is an estimate of
Rα
based on the sampled operator sequences. This adjustment improves the convergence of
the controller
AΨ
by focusing on higher-reward sequences. To obtain the final symbolic expression generated by our
tree-structure RNN, we employ a FEX-based algorithm.
14
DOMAIN-AWARE SYMBOLIC PRIORS FOR SYMBOLIC REGRESSION
Algorithm 1 Regularized FEX with tree structure RNNs
Input:
Data
X
, a tree structre
T
, search loop iteration
T
, coarse-tune iterations
T1
(using Adam) and
T2
(using BFGS),
fine-tune iteration T3, pool size Kand batch size N.
Output: The expression (T∗, θ∗)
1: Initialize an agent AΨfor the tree Tand an empty P
2: for tfrom 1to Tdo
3: Sample Nsequences {e(1), ..., e(N)}from the agent.
4: for nfrom 1 to Ndo
5: Optimize the NRMSE using both coarse-tune iterations T1+T2
6: Compute the reward for each sequence.
7: Compute KL divergence
8: if enbelongs to the top-Kthen
9: P.append(en)
10: end if
11: end for
12: g←1
NPN
n=1 R(e(n))−ˆ
RαI(R(e(n))≥ˆ
Rα)Pk
i=1 ∇Ψlog(y(n)
i)
13: gKL ← − `
NPN
n=1 P|T |−1
i=0
1
|T |∇ΨKL (P∗(si|Si, sp, h)kyi)i
14: Ψ←η(g+gKL )
15: end for
16: for ein Pdo
17: Fine-tune NRMSE using T3iterations
18: end for
19: return the expression with smallest fine-tune error
3.6 Dynamic Scheduling of KL Divergence Regularization to Balance Prior Influence
A potential concern with the inclusion of priors is that they may introduce bias, forcing the model to remain overly
close to the prior categorical distribution throughout training. To address this, we employ a dynamic scheduling strategy
for the KL divergence regularization term. Specifically, we adapt the hyperparameter
`
to gradually decay as training
progresses.
In the early stages of training, a larger
`
ensures that the model leverages prior knowledge to accelerate convergence
and stabilize the search process. As training continues,
`
is gradually reduced, allowing the model to rely more on
the observed data and explore solutions beyond the initial priors. This balance mitigates the risk of excessive prior
influence while preserving the benefits of guided exploration.
The decay of `can be implemented using an exponential schedule:
β(t) = `0·exp(−λdt),
where
`0
is the initial regularization weight,
λd
is the decay rate, and
t
represents the training epoch. This dynamic
adjustment ensures that the model transitions smoothly from a prior-dominated phase to a data-driven optimization
phase, ultimately improving flexibility and generalization. The initial value of the KL divergence regularization term
`0
and the decay rate
λ
for the dynamic scheduling were determined using grid search over a predefined range, ensuring
optimal trade-offs between prior alignment and model flexibility.
4 Experiments
In this section, we conduct a series of experiments to evaluate the effectiveness, performance, and generalization
capabilities of our proposed method. The experiments are designed to comprehensively assess the method across
multiple domains, compare it with existing approaches, and investigate the impact of incorporating domain-aware
symbolic priors. We begin with a Benchmark Test to validate our method on standardized symbolic expressions,
followed by a Comparative Analysis Across Domains to demonstrate its robustness and versatility.
4.1 Benchmark Test
In this section, we evaluate the performance and generalization capabilities of our proposed method on a standardized
set of benchmark expressions. We construct the benchmark test to include symbolic expressions across various domains
15
DOMAIN-AWARE SYMBOLIC PRIORS FOR SYMBOLIC REGRESSION
and complexities. This ensures a comprehensive comparison between our method and existing approaches. we employ
two well-established benchmarks to evaluate the performance of our proposed method across distinct domains:
•Feynman Benchmark for Physics
: A collection of symbolic expressions derived from the Feynman Lectures
on Physics, covering fundamental laws across mechanics, electromagnetism, thermodynamics, and quantum
mechanics. These expressions are known for their dimensional consistency and real-world applicability,
making them a rigorous test for symbolic regression methods in the physics domain[36].
•ODEbase for Biology: The ODEbase database serves as a benchmark for biological expressions. It consists
of mathematical models, primarily in the form of ordinary differential equations (ODEs), that are widely used
to describe biological processes, such as population dynamics, biochemical reactions, and cellular behaviors.
From ODEbase, we extract representative expressions to evaluate the ability of our method to build accurate
models for biological systems.[37]
We meticulously adhered to the established protocol delineated in SRBench by La Cava et al. (2021)[
36
] for evaluating
the performance of our method across the two selected benchmarks: Feynman Benchmark for Physics and ODEbase for
Biology.
•
For the Feynman Benchmark, our algorithm was tasked with identifying symbolic expressions that fit
10,000
data points corresponding to each Feynman benchmark equation. Exact symbolic recovery was determined
by verifying that the difference between the generated expression and the target expression simplified to a
constant.
•
For the ODEbase Benchmark, we extracted
200 representative expressions
describing biological systems
modeled by ordinary differential equations (ODEs). The recovery criteria and symbolic simplification checks
were identical to those applied in the Feynman Benchmark.
To ensure robustness, each experiment was repeated
10 times
with different random seeds, and recovery rates were
averaged across these trials.
Noise Levels
: In alignment with benchmark practices, experiments were conducted across four noise levels:
0
,
0.01
,
0.05,0.07 and 0.1.
Data Usage: Each benchmark problem was defined by:
•Aground truth expression for symbolic recovery validation.
•Atraining dataset used to compute the reward for candidate expressions during optimization.
•Atest dataset used to evaluate the final candidate expression at the end of training.
To ensure the robustness and reliability of the results, each experiment was repeated using 100 different random seeds
for every benchmark expression. The recovery rate was determined as the proportion of runs in which the algorithm
successfully identified the target expression, following the exact symbolic recovery criteria outlined earlier. By applying
this rigorous evaluation protocol, we ensured a fair and comprehensive assessment of our method’s performance across
the physics and biology benchmarks.
The results presented in Figure 10 highlight the significant impact of incorporating prior knowledge and utilizing
hierarchical structures in symbolic regression tasks under varying noise conditions.
First, methods that integrate prior knowledge (RNN + prior,Tree-RNN + prior, and FEX + prior) demonstrate
consistently higher recovery rates across all noise levels compared to their counterparts without priors. This trend is
particularly evident in both benchmarks, where the inclusion of priors effectively mitigates the performance degradation
caused by increasing noise. Prior knowledge serves as a regularization mechanism, constraining the solution space and
enabling the models to generalize more robustly in noisy environments.
Second, among the tested methods, Tree-RNN and Tree-RNN + prior exhibit superior performance, particularly at
moderate and high noise levels. The hierarchical representation in Tree-RNN allows for efficient modeling of complex
expressions, reducing search depth and improving recovery accuracy. Notably, the Tree-RNN + prior method achieves
the highest recovery rates across all conditions, underscoring the complementary benefits of combining structural
efficiency with domain-specific priors.
In contrast, traditional methods such as RNN and FEX show a sharp decline in performance as noise levels increase,
indicating their sensitivity to noisy data. The results confirm that the integration of priors and hierarchical tree structures
not only enhances recovery robustness but also improves the scalability of symbolic regression methods to real-world
noisy datasets.
16
DOMAIN-AWARE SYMBOLIC PRIORS FOR SYMBOLIC REGRESSION
Figure 9: Comparison of recovery rates across varying noise levels for Biology and Physics benchmarks. Results are
shown for six methods: RNN, RNN + prior, Tree-RNN, Tree-RNN + prior, FEX, and FEX + prior.
4.2 Comparative Analysis Across Domains
In this section, we choose four expressions from four distinct domains to conduct a comparative analysis of the
following methods: FEX, FEX with priors, RL + RNN, RL + RNN with priors, RL + tree-structured RNN, and RL
+ tree-structured RNN with priors. The detailed descriptions of the six expressions utilized in this experiment are
provided in A.
Learning parameters for the first two problems are: learning rate
0.003
, batch size
1000
, risk factor is
0.05
, KL
divergence parameter is
0.5
. For the other two: learning rate
0.001
, batch size
1000
, risk factor is
0.05
, KL divergence
parameter is 0.35.
Based on experiments, we can draw several important conclusions about the effectiveness of using prior knowledge and
tree-structured RNNs:
•Effectiveness of Priors and Tree-Structured RNNs:
The incorporation of domain-specific priors and
tree-structured RNNs significantly enhances learning efficiency. Both “Tree-RNN with prior" and “RNN-
prior" converge more quickly to optimal policies compared to other methods, demonstrating the advantage
of leveraging prior knowledge and hierarchical architectures. For instance, the Hamiltonian expression,
characterized by numerous additive terms, poses challenges for traditional RNNs. The tree-structured RNN
reduces the required network depth for such additive structures, while the prior categorical distribution equips
the model with domain-specific insights. This combination accelerates convergence and enables efficient
identification of high-quality solutions.
•Variability in Prior Impact:
The fourth figure (Fluid Dynamics) reveals a potential drawback of using
priors. Here, the methods incorporating prior knowledge (“Tree-RNN with prior" and “RNN-prior") do not
outperform the other methods by a large margin. This suggests that priors can introduce biases that may not
always align well with certain complex expressions, thereby limiting their effectiveness.
In general, our results demonstrate that combining domain-specific priors with a tree-structured RNN agent can
significantly enhance the learning of complex functions. However, as illustrated in the fourth figure, the incorporation
of priors may sometimes introduce biases, leading to suboptimal performance in certain cases. This variability in
effectiveness highlights the need for careful consideration and selection of priors to match the characteristics of the
problem domain.
5 Conclusion
We find that combining domain-specific priors with our tree-structured RNN agent quickly results in an effective policy.
Learning from expressions across various fields has provided valuable insights for future research. However, our
approach is sensitive to the prior categorical distribution, making bias a challenge despite careful data collection.
17
DOMAIN-AWARE SYMBOLIC PRIORS FOR SYMBOLIC REGRESSION
Figure 10: Results across four different scenarios: (
Top-left
) Fluid Dynamics, (
Top-right
) Evolution of cell populations,
(
Bottom-left
) Reaction rate equation, and (
Bottom-right
) Hamiltonian expression. Each subfigure shows results for
six methods: RNN,RNN+prior,Tree-RNN,Tree-RNN+prior,FEX, and FEX+prior.
The prior for each domain consists of two parts: a “behavior prior" shared across all fields, and a domain-specific
component. This is similar to the multitask problem in reinforcement learning. In future work, we plan to optimize both
the domain-specific and “behavior" priors during training, aiming to uncover intriguing and interesting results.
Acknowledgement
Haizhao Yang were partially supported by the US National Science Foundation under awards DMS-2244988, DMS-
2206333, the Office of Naval Research Award N00014-23-1-2007, and the DARPA D24AP00325.
18
DOMAIN-AWARE SYMBOLIC PRIORS FOR SYMBOLIC REGRESSION
References
[1]
D. Angelis, F. Sofos, T. E. Karakasidis, Artificial intelligence in physical sciences: Symbolic regression trends
and perspectives, Archives of Computational Methods in Engineering 30 (6) (2023) 3845–3865.
[2]
C. Miles, M. R. Carbone, E. J. Sturm, D. Lu, A. Weichselbaum, K. Barros, R. M. Konik, Machine learning
of kondo physics using variational autoencoders and symbolic regression, Physical Review B 104 (23) (2021)
235111.
[3]
P. Neumann, L. Cao, D. Russo, V. S. Vassiliadis, A. A. Lapkin, A new formulation for symbolic regression to
identify physico-chemical laws from experimental data, Chemical Engineering Journal 387 (2020) 123412.
[4]
Y. Wang, N. Wagner, J. M. Rondinelli, Symbolic regression in materials science, MRS Communications 9 (3)
(2019) 793–805.
[5]
C. Wang, Y. Zhang, C. Wen, M. Yang, T. Lookman, Y. Su, T.-Y. Zhang, Symbolic regression in materials science
via dimension-synchronous-computation, Journal of Materials Science & Technology 122 (2022) 77–83.
[6]
J. Xie, L. Zhang, Machine learning and symbolic regression for adsorption of atmospheric molecules on low-
dimensional tio2, Applied Surface Science 597 (2022) 153728.
[7]
W. Hu, L. Zhang, First-principles, machine learning and symbolic regression modelling for organic molecule
adsorption on two-dimensional cao surface, Journal of Molecular Graphics and Modelling 124 (2023) 108530.
[8]
Y. Chen, M. T. Angulo, Y.-Y. Liu, Revealing complex ecological dynamics via symbolic regression, BioEssays
41 (12) (2019) 1900069.
[9]
B. T. Martin, S. B. Munch, A. M. Hein, Reverse-engineering ecological theory from data, Proceedings of the
Royal Society B: Biological Sciences 285 (1878) (2018) 20180422.
[10]
P. Cardoso, V. V. Branco, P. A. Borges, J. C. Carvalho, F. Rigal, R. Gabriel, S. Mammola, J. Cascalho, L. Correia,
Automated discovery of relationships, models, and principles in ecology, Frontiers in Ecology and Evolution 8
(2020) 530135.
[11]
J. Duffy, J. Engle-Warnick, Using symbolic regression to infer strategies from experimental data, in: Evolutionary
computation in Economics and Finance, Springer, 2002, pp. 61–82.
[12] Y. Jin, W. Fu, J. Kang, J. Guo, J. Guo, Bayesian symbolic regression, arXiv preprint arXiv:1910.08892.
[13]
I. Blkadek, K. Krawiec, Solving symbolic regression problems with formal constraints, in: Proceedings of the
Genetic and Evolutionary Computation Conference, 2019, pp. 977–984.
[14]
M. Schmidt, H. Lipson, Distilling free-form natural laws from experimental data, science 324 (5923) (2009)
81–85.
[15]
S. Mirjalili, S. Mirjalili, Genetic algorithm, Evolutionary algorithms and neural networks: theory and applications
(2019) 43–55.
[16] W. B. Langdon, R. Poli, Foundations of genetic programming, Springer Science & Business Media, 2013.
[17]
T. N. Mundhenk, M. Landajuela, R. Glatt, C. P. Santiago, D. M. Faissol, B. K. Petersen, Symbolic regression via
neural-guided genetic programming population seeding, arXiv preprint arXiv:2111.00053.
[18]
B. K. Petersen, M. Landajuela, T. N. Mundhenk, C. P. Santiago, S. K. Kim, J. T. Kim, Deep symbolic re-
gression: Recovering mathematical expressions from data via risk-seeking policy gradients, arXiv preprint
arXiv:1912.04871.
[19]
S. Liang, H. Yang, Finite expression method for solving high-dimensional partial differential equations, arXiv
preprint arXiv:2206.10121.
[20]
F. Sun, Y. Liu, J.-X. Wang, H. Sun, Symbolic physics learner: Discovering governing equations via monte carlo
tree search, arXiv preprint arXiv:2205.13134.
[21]
T. Bendinelli, L. Biggio, P.-A. Kamienny, Controllable neural symbolic regression, in: International Conference
on Machine Learning, PMLR, 2023, pp. 2063–2077.
[22]
P.-A. Kamienny, S. d’Ascoli, G. Lample, F. Charton, End-to-end symbolic regression with transformers, Advances
in Neural Information Processing Systems 35 (2022) 10269–10281.
[23]
M. Vastl, J. Kulhánek, J. Kubalík, E. Derner, R. Babuška, Symformer: End-to-end symbolic regression using
transformer-based architecture, IEEE Access.
[24]
P. Shojaee, K. Meidani, A. Barati Farimani, C. Reddy, Transformer-based planning for symbolic regression,
Advances in Neural Information Processing Systems 36.
19
DOMAIN-AWARE SYMBOLIC PRIORS FOR SYMBOLIC REGRESSION
[25]
W. Li, W. Li, L. Sun, M. Wu, L. Yu, J. Liu, Y. Li, S. Tian, Transformer-based model for symbolic regression via
joint supervised learning, The Eleventh International Conference on Learning Representations.
[26]
M. Merler, K. Haitsiukevich, N. Dainese, P. Marttinen, In-context symbolic regression: Leveraging large language
models for function discovery, in: Proceedings of the 62nd Annual Meeting of the Association for Computational
Linguistics (Volume 4: Student Research Workshop), 2024, pp. 589–606.
[27]
Z. Bastiani, R. M. Kirby, J. Hochhalter, S. Zhe, Complexity-aware deep symbolic regression with robust risk-
seeking policy gradients, arXiv preprint arXiv:2406.06751.
[28]
M. Gupta, A. Cotter, J. Pfeifer, K. Voevodski, K. Canini, A. Mangylov, W. Moczydlowski, A. Van Esbroeck,
Monotonic calibrated interpolated look-up tables, Journal of Machine Learning Research 17 (109) (2016) 1–47.
[29]
L. C. Bezerra, M. López-Ibáñez, T. Stützle, Archiver effects on the performance of state-of-the-art multi-and many-
objective evolutionary algorithms, in: Proceedings of the Genetic and Evolutionary Computation Conference,
2019, pp. 620–628.
[30]
G. Kronberger, F. O. de França, B. Burlacu, C. Haider, M. Kommenda, Shape-constrained symbolic regres-
sion—improving extrapolation with prior knowledge, Evolutionary Computation 30 (1) (2022) 75–98.
[31]
D. Ashok, J. Scott, S. J. Wetzel, M. Panju, V. Ganesh, Logic guided genetic algorithms (student abstract), in:
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, 2021, pp. 15753–15754.
[32]
J. Kubalik, E. Derner, R. Babuvska, Multi-objective symbolic regression for physics-aware dynamic modeling,
Expert Systems with Applications 182 (2021) 115210.
[33]
I. Blkadek, K. Krawiec, Counterexample-driven genetic programming for symbolic regression with formal
constraints, IEEE Transactions on Evolutionary Computation 27 (5) (2022) 1327–1339.
[34]
B. K. Petersen, C. P. Santiago, M. Landajuela, Incorporating domain knowledge into neural-guided search via in
situ priors and constraints, Tech. rep., Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States)
(2021).
[35]
W. Tenachi, R. Ibata, F. I. Diakogiannis, Deep symbolic regression for physics guided by units constraints: toward
the automated discovery of physical laws, The Astrophysical Journal 959 (2) (2023) 99.
[36]
W. La Cava, B. Burlacu, M. Virgolin, M. Kommenda, P. Orzechowski, F. O. de França, Y. Jin, J. H. Moore,
Contemporary symbolic regression methods and their relative performance, Advances in neural information
processing systems 2021 (DB1) (2021) 1.
[37]
C. Lüders, T. Sturm, O. Radulescu, ODEbase: A repository of ODE systems for systems biology, Bioinformatics
Advances 2 (1), vbac027.
arXiv:https://academic.oup.com/bioinformaticsadvances/article-pdf/
2/1/vbac027/43625164/vbac027.pdf,doi:10.1093/bioadv/vbac027.
URL https://doi.org/10.1093/bioadv/vbac027
20
DOMAIN-AWARE SYMBOLIC PRIORS FOR SYMBOLIC REGRESSION
A Descriptions of Expressions
In physics, we compare different SR method to recover Hamiltonian expression. The Hamiltonian
H
for a nuclear
system with a simplified model involving three momentum variables p1, p2, p3is given by:
H=ˆ
A−1
ˆ
A
3
X
i=1
p2
i
2mN−1
mNˆ
A
3
X
i<j
pi·pj+
3
X
i<j
Vij .
Where
mN
(Nucleon Mass) represents the average mass of a nucleon (either a proton or a neutron) in the nuclear
system. It is used in kinetic energy calculations. The average nucleon mass simplifies computations, as the system
contains multiple nucleons.
ˆ
A
(Particle-Number Operator) is an operator representing the total number of nucleons
(particles) in the system. In the given context,
ˆ
A
can be treated as the scalar number of nucleons, often denoted by
A
.
The operator form is used in many-body physics to handle systems with varying particle numbers.
pi
(Momentum)
represents the momentum of the
i
-th nucleon. In this simplified model, only three momentum variables (
p1, p2, p3
) are
considered.
Vij
(Two-Body Potential) represents the interaction energy between nucleons
i
and
j
. This term accounts
for forces between pairs of nucleons and can take various forms depending on the nature of the interaction. We use a
simplified form, such as
Vij =g
rij
, where
g
is a constant. Given these variables and terms, the simplified Hamiltonian
expression for the system involving three momentum variables (p1, p2, p3) is:
H=ˆ
A−1
ˆ
A
3
X
i=1
p2
i
2mN−1
mNˆ
A
3
X
i<j
pi·pj+
3
X
i<j
g
rij
,
We set Ahat = 2.0mN= 1.5,g= 0.8.
In biology
, we always describe the evolution of four distinct cell populations within a tumor microenvironment during
the course of treatment. These populations include two sub-populations of tumor cells and two types of interacting cells
(CAR T-cells and bystander cells). The model uses a system of differential equations to capture the dynamics of these
populations. The simplified form of Equation (4) now looks like:
dB
dt =b−γBB−µBlog B+C
K2+dB+sB
Ts22
k+dB+sB
Ts22B−ωBB(Ts+Tr).
Where
Ts
and
Tr
are variables representing the tumor sub-populations.
B
is the bystander cell population.
C
is the CAR
T-cell population. We set
b= 0.5, gammaB= 0.1, muB= 0.3, K 2=1.0, dB= 0.05, s = 2.0, k = 0.8, omegaB=
0.2.
In chemistry: Reaction Rate Equation for n= 3:
Given three substrates (S1, S2, S3) and an inhibitor I, the equation can be written as:
v=Vmax ·[S1]·[S2]·[S3]
(Km+ [S1]+[S2]+[S3]) 1 + [I]
Ki
We keep Vmax = 1.0,Km= 0.5, and Ki= 0.3. You can modify these parameters as needed.
Random Concentrations: We generate random concentrations for three substrates (
S1, S2, S3
) and one inhibitor
(I) within specified ranges.
Reaction Rate Calculation: The reaction rate is computed using the updated equation that involves three
substrates.
In Engineering, A deep function in the context of engineering can be a composition of multiple nested unary and binary
operators, often found in fields like control systems, fluid dynamics, signal processing, or structural engineering. The
more nested or “deep" the operations, the more challenging it becomes for symbolic regression to approximate.
Here’s an example of a complicated “deep" function inspired by fluid dynamics and turbulence modeling. This function
includes multiple layers of unary operations such as logarithms, trigonometric functions, and nested square roots:
21
DOMAIN-AWARE SYMBOLIC PRIORS FOR SYMBOLIC REGRESSION
f(x) = log α√x+βsin(γx +δ)+
cos (η√x+θlog(x)) + ζexp (−λx2).
α, β, γ , δ, , η, θ, ζ, λ
are coefficients that control the function’s shape and behavior. We set coefficients
α= 1.2, β =
0.8, γ = 2.0, δ = 0.5, = 0.1, η = 1.5, θ = 0.3, ζ = 0.05, λ=0.01.
22