PreprintPDF Available

Neurosymbolic Programming for Science

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Neurosymbolic Programming (NP) techniques have the potential to accelerate scientific discovery across fields. These models combine neural and symbolic components to learn complex patterns and representations from data, using high-level concepts or known constraints. As a result, NP techniques can interface with symbolic domain knowledge from scientists, such as prior knowledge and experimental context, to produce interpretable outputs. Here, we identify opportunities and challenges between current NP models and scientific workflows, with real-world examples from behavior analysis in science. We define concrete next steps to move the NP for science field forward, to enable its use broadly for workflows across the natural and social sciences.
Content may be subject to copyright.
Neurosymbolic Programming for Science
Jennifer J. Sun
Caltech
Megan Tjandrasuwita
MIT CSAIL
Atharva Sehgal
UT Austin
Armando Solar-Lezama
MIT CSAIL
Swarat Chaudhuri
UT Austin
Yisong Yue
Caltech
Omar Costilla-Reyes
MIT CSAIL
Abstract
Neurosymbolic Programming (NP) techniques have the potential to accelerate
scientific discovery across fields. These models combine neural and symbolic com-
ponents to learn complex patterns and representations from data, using high-level
concepts or known constraints. As a result, NP techniques can interface with sym-
bolic domain knowledge from scientists, such as prior knowledge and experimental
context, to produce interpretable outputs. Here, we identify opportunities and
challenges between current NP models and scientific workflows, with real-world
examples from behavior analysis in science. We define concrete next steps to move
the NP for science field forward, to enable its use broadly for workflows across the
natural and social sciences.
1 Introduction
One of the grand challenges in the artificial intelligence and scientific communities is to find an AI
scientist: an artificial agent that can automatically design, test, and infer scientific hypotheses from
data. This application poses several distinct challenges for existing machine learning techniques
because of the need to ensure that any new theories that are formulated from data are consistent with
our prior scientific knowledge, as well as to enable scientists to reason about the implications of new
hypotheses and experimental designs.
The distinct requirements to satisfy scientific discovery have pushed the community to explore
techniques such as symbolic regression [Cranmer, 2020], a variety of approaches to interpretable
machine learning [Ustun and Rudin, 2017, Doshi-Velez and Kim, 2017, Kleinberg et al., 2018,
McGrath et al., 2021], as well as the use of program synthesis to understand biological systems
[Koksal et al., 2013] and language morphology Ellis et al. [2022]. These techniques have helped the
community make significant progress in a number of applications, but we are still far from solving
the grand challenge.
We focus on the opportunities behind an important class of learning techniques based on Neurosym-
bolic Programming (NP) [Chaudhuri et al., 2021]. These combine neural and symbolic reasoning to
build models that incorporate prior expert knowledge and strong constraints on model behavior and
structure. NP is capable of producing symbolic representations of theories that can be analyzed and
manipulated to answer rich counterfactuals.
NP has an important role to play in automating scientific discovery in a way that builds on existing
knowledge and leads to finding new scientific insights. NP allows for a new line of attack on the grand
AI scientist challenge: represent scientific hypotheses as programs in a Domain Specific Language
(DSL) and use neurosymbolic program synthesis to automatically discover these programs. Users
Equal contribution
Corresponding author: costilla@mit.edu
Preprint. Under review.
arXiv:2210.05050v1 [cs.AI] 10 Oct 2022
Choose
Research
Question
Scientific
Process
Generate
Hypothesis
Experimental
Setup /
Data Collection
Analysis Interpretation Report to
Community
NP Lifecycle
& Challenges
Data Curation
(Sec 3.4)
Domain Knowledge
(Sec 3.2)
NP Model
Training
(Sec 3.1; 3.3)
Evaluation
(Sec 3.5; Sec 3.7)
Deployment
(Sec 3.6)
Input to NP Output from NP
Iterative
Iterative
Figure 1: Synergy between the scientific and neurosymbolic programming workflow.
can incorporate complex forms of prior knowledge (for example, known features and constraints)
into the design of the DSL. The NP learning algorithms can then follow principles of reasoning
found in classical science. Also, models learned this way are often similar to the code that human
domain experts write during manual scientific modeling. Collectively, these characteristics enable an
interactive and iterative process through which an artificial system and a human expert collaborate on
evidence-based reasoning and the discovery of new scientific facts.
In this paper, we use behavior analysis as a concrete example to demonstrate the challenges and
opportunities in this space. We start with an introduction to NP in Section 2. While NP has the
potential to improve the scientific process end-to-end (Figure 1), there are challenges that exist at the
intersection of neurosymbolic learning and science, which we outline in Section 3, with a focus on
evaluation and benchmarking in Section 3.7.
Behavior analysis as running example.
Many of the underlying challenges in behavior analysis
are shared in other scientific domains, from monitoring vital signs to modeling physical systems,
to dynamics of chemical reactions. Fundamentally, behavioral data is a type of spatiotemporal
data, and this data type is common across the sciences as many phenomena are spatiotemporal in
nature. Behavior analysis illustrates common challenges with scientific data analysis. Scientists
generally study behaviors that are rare with imbalanced data. Datasets vary across labs, organisms,
and experimental setups, and are often noisy and imperfect. Finally, automatic behavior quantification
is becoming increasingly crucial in a range of scientific fields, such as neuroscience, ecology, biology,
and healthcare monitoring. As automatic behavior analysis and neurosymbolic learning are both
developing research fields, there are many exciting opportunities to explore at their intersection.
Background on behavior analysis.
An important objective of behavior analysis is to quantify
behavior from video using continuous or discrete representations. We focus on the case of animal
behavior analysis in science [Anderson and Perona, 2014, Datta et al., 2019], where there are diverse
organisms and naturalistic behaviors. In this setting, a common analysis method is first to perform
animal pose tracking from video [Mathis et al., 2018, Pereira et al., 2022], then categorize behaviors of
interest from animal pose [Segalin et al., 2021]. We need methods that facilitate scientists’ extraction
of biological insights from pose trajectories or directly from recorded videos of animals.
Existing challenges in behavior analysis.
Similar to other fields of science, data collection and
annotation are expensive for behavioral experiments, and raw data is often imperfect and noisy. The
process of data analysis is time-consuming and expensive since specialized domain expertise is
required for identifying behaviors of interest and extracting knowledge. Models need to interface
efficiently with scientists and data at both the inputs and outputs from the scientific process (Fig-
ure 1). For NP models, leveraging domain expertise in the form of behavioral attributes has been
demonstrated to improve data efficiency [Sun et al., 2021] and interpretability [Tjandrasuwita et al.,
2021].
Beyond existing works, there is a variety of domain expertise that requires new algorithmic designs to
integrate into the NP workflow, such as experimental context, ethograms, and scientific spatiotemporal
constraints. Incorporating such domain knowledge has the potential to enable NP models to be more
robust to noisy and imperfect data. Furthermore, when black-box models are used for studying
2
def is_attacking(fly, tgt):
f2t_angle = atan((tgt.y-fly.y) / (tgt.x - fly.x))
rel_angle = |fly.abs_angle - f2t_angle|
return fly.speed > 2 and rel_angle < 0.1
Neural Symbolic
Black-box,
many parameters
Interpretable via visualizations,
few parameters
Interpretable,
no parameters
Interpretable via visualizations,
few parameters (symbolic),
many parameters (neural)
Properties
Visualization
Name 1D Convolutional Network Neurosymbolic Encoder Differentiable Program Human-written Program
Symbolic
Neural
Input
Encoding
Feature Weights
Channel 1 Channel 2
Figure 2: Space of neurosymbolic programming models in behavior analysis.
behavior, it is difficult to diagnose errors and interpret the model output. NP models have the potential
to produce symbolic descriptions of behavior (Figure 3), which enables experts to connect model
interpretations with other parts of the behavior analysis workflow. For example, for describing
behavioral differences across different strains of mice. Finally, to enable the use of NP models in
real-world science workflows, these models must be scalable and produce robustly reproducible
interpretations, which is a shared challenge across domains in science.
2 Neurosymbolic Programming techniques
In the NP framework, the objective is to learn neurosymbolic programs, which incorporate latent
representations computed by neural networks and symbols that explicitly capture pre-existing human
knowledge, and connect these elements using rich architectures that mirror classical models of
computation. The programs, assumed to belong to a DSL, are learned using a combination of
gradient-based optimization, probabilistic methods, and symbolic techniques.
The range of NP methods vary in the degree to which they use neural vs. symbolic reasoning
(Figure 2). The two ends of the spectrum correspond to purely neural—a 1D convolutional network—
and purely symbolic techniques—a human written program used for weak supervision [Tseng et al.,
2022] —respectively, whereas the techniques close to the center are both neurosymbolic. Specifically,
the model in the center-left is a neurosymbolic encoder [Zhan et al., 2021], while the model in the
center-right is a program with differentiable parameters for behavior analysis [Tjandrasuwita et al.,
2021].
For an example of the strengths and weaknesses of each model, assume that we have a scientific
hypothesis to test on a dataset. On one side, the fully symbolic model would involve an expert-written
program that encodes the hypothesis in a general programming language. This program requires no
learnable parameters and is fully interpretable and, if needed, can be iteratively improved. However,
this method is also brittle, and the program must be engineered to handle all the dynamics of the
dataset. This is intractable for models with complex dynamics. On the other side, the purely neural
model would model the hypothesis directly using the dataset. Such models generalize well to the
dataset but offer limited interpretability and control over the generated hypothesis.
Differentiable programs.
While neural network models are powerful choices when given sufficient
high-quality data, they are generally prone to overfitting and difficult to interpret. To address some
of these weaknesses, NP methods have been applied in a variety of machine learning paradigms to
leverage the advantages of neural networks and programs. For example, in a supervised behavior
classification setting, [Shah et al., 2020] introduces a Neural Admissible Relaxation (NEAR) tech-
nique. Their goal is to synthesize differentiable programs, a powerful way of engineering systems
with both discrete architectures and continuous parameters.
On a high level, a differentiable program
αθ
for a programming language
L
is defined as a composition
of functions
σL
such that the parameters of the function
σθR
are differentiable. A differentiable
program follows the syntax defined by a DSL which can consist of parametric functions (multi-
layered perceptrons, linear transformations), algebraic functions (
add
,
multiply
), and programming
3
Figure 3: A program generated by a neurosymbolic programming framework [Shah et al., 2020].
languages higher-order functions (map,fold). Furthermore, the composition of these differentiable
functions is also differentiable through the chain rule. This property enables resulting programs to
be fully differentiable. [Shah et al., 2020] claim that programs synthesized by NEAR compete in
performance with a black-box recurrent neural network while being much more human-interpretable.
Figure 3 is an example program discovered by NEAR for behavior analysis.
The program in Figure 3 classifies the “sniff" action between two mice. An interpretation is that if
the distance between two mice is small, they are doing a “sniff"; otherwise, they are only doing a
“sniff" if the accelerations are small.
Though NEAR and follow-up work [Cui and Zhu, 2021] demonstrate great promise in terms of
accuracy on behavior classification tasks, their programs are still written in a fairly generic DSL,
and the results may be difficult to interpret by domain experts. [Tjandrasuwita et al., 2021] directly
addresses this challenge in explaining the difference between different sets of behavior expert
annotations. Their primary contribution is to replace generic higher-order functions over recursive
data structures, e.g.
map
and
fold
, with a differentiable temporal filter operation, the Morlet Filter.
The filter models temporal information in a highly data-efficient manner and can be interpreted as a
human’s impulse response to a given behavioral feature for classification.
3 Opportunities and Challenges at the Intersection of Neurosymbolic
Learning and Science
The techniques outlined above demonstrate some benefits of the NP framework. However, there
are still major gaps between current NP approaches and practical use cases in science. There exist
challenges across the life-cycle of NP algorithm development, deployment, and its synergy with the
scientific workflow: from dealing with noisy input data to scalable model training to evaluation and
tool building and deployment (see Figure 1). We draw attention to these challenges to encourage the
research community to collaborate in the development of new NP methods to increase the synergy
with scientific workflows to accelerate science.
3.1 Scalability challenge
Inductive synthesis.
A large body of works on program synthesis has focused on inductive synthesis,
or synthesizing programs from examples [Lau and Weld, 1998, Gulwani, 2011, Devlin et al., 2017].
While such a goal is on the surface similar to performing machine learning (ML) with programs
as models, a key difference is that ML approaches depend on defining a clear space of models (i.e.
neural networks, support vector machines, decision trees) and generalizing to unseen data. In contrast,
much work in inductive synthesis considers an arbitrary space of programs and spends significant
effort on sample engineering, treating them as noiseless specifications. As a result, inductive synthesis
scales poorly with an increase in program length and number of examples.
Scaling NP in science.
To tackle scalability in science, models need to handle large and potentially
noisy datasets, high-dimensional input space, and a variety of analysis tasks. Recently, NP research
[Shah et al., 2020, Cui and Zhu, 2021] propose frameworks that scale to large datasets given an
expressive DSL. These works are instantiated in behavior analysis: learning programs on temporal
trajectory data to reproduce expert annotations of behavior that contain noisy labels, similar to other
scientific data. These works tackle the challenge of discovering programs with parameters, which can
be directly optimized through popular gradient optimization techniques. While NP methods provide
a means of scaling inductive synthesis to scientific datasets, these techniques often involve combining
a discrete search over an exponential space of programs with continuous optimization.
4
Challenges for enabling scalability
. The differentiable program synthesis library (dPads) introduced
by [Cui and Zhu, 2021] addresses the computational efficiency challenge of having to synthesize
discrete program architectures and perform gradient optimization at runtime or NP models. To do
so, dPads trades off computation with the cost of storing many more parameters in memory, with
heuristics to mitigate memory usage. However, training fully neural models on a GPU is often more
efficient than training NP models, which requires searching through an exponentially large space of
symbolic architectures on a CPU. Furthermore, scalability has not been broadly explored for different
types of scientific data, such as video recordings, which are much higher dimensional than trajectory
data. Finally, the effectiveness of program synthesis may still be limited by the expressivity of a DSL,
which requires experts to spend time encoding domain knowledge, such as expert-designed behavior
attributes [Sun et al., 2021] and temporal filters [Tjandrasuwita et al., 2021] (further discussed in
Section 3.2).
Scalability challenges also arise in symbolic regression [Cranmer et al., 2020], where the goal is
to learn exact mathematical relationships between variables by searching a space of mathematical
expressions. In this case, the regression scales poorly with the number of interactions between
variables.
3.2 Encoding and Learning Domain Knowledge
The success of NP techniques often depends on how a DSL is defined. However, it may be not only
be time-consuming but also unclear about how to handcraft domain-specific components that work
best in a scientific context. Library learning proposes algorithms that consolidate common patterns
in successful programs generated in program synthesis and add them iteratively to the current DSL,
enabling the program search to discover high-performing programs with little effort.
Library learning for science
. In behavior analysis, humans are capable of writing short programs
that can improve model learning, such as by designing features and heuristics [Segalin et al., 2021,
Tseng et al., 2022, Eyjolfsdottir et al., 2014]. However, these programs are greatly limited by their
simplicity and may not capture complex behavior. Library learning has the potential to augment
human feature design, by synthesizing interpretable programs and inducing high-level DSLs, given
low-level, generic primitives. A state-of-the-art library learning system, Dreamcoder [Ellis et al.,
2021], has been successful in physics equation discovery. Library learning has been successfully
applied to generative modeling in molecular chemistry [Guo et al., 2021]. The proposed library
learning method is capable of handling data-limited settings, often found in science due to the cost
of obtaining high-quality labels from experts. Library learning methods also have the advantage
of enabling greater possibilities for including domain-specific knowledge, as all components in the
library are written in terms of human-interpretable primitives.
Challenges of library learning for science
. While DreamCoder and molecular generation have
shown promise in their respective domains, it is unclear how library learning can scale to more
complex real-world data scientific domains, such as behavior analysis, which often consists of
thousands of frames of videos with noisy data. In addition, it is highly expensive to collect behavior
annotations across many behaviors, which is needed to perform traditional library learning. In
contrast, current library learning methods have been applied to contexts where each task consists of a
few examples, not exceeding hundreds of data points. In addition, the labels are noiseless, as opposed
to real-world situations found in behavior analysis [Leng et al., 2020, Segalin et al., 2021].
Another challenge is that domain experts still need to interpret solutions generated by the NP library
learning system. One promising approach leverages natural language to impose a stronger prior on the
program search and the library learning [Wong et al., 2021], resulting in a more human-interpretable
DSL. In addition to the challenge of scaling library learning to real behavior analysis settings, building
a smooth interface between expertise in science, program synthesis, neural networks and probabilistic
library learning methods, found in NP, would likely require significant engineering and research
efforts (Section 3.6).
3.3 Challenges of optimization of discrete and continuous space in neurosymbolic programs
NP relies on techniques from symbolic program synthesis to facilitate interpretable and verifiable
searches over the scientific hypothesis space. However, programs are inherently symbolic, owing
5
to their roots in mathematical logic. This makes modeling phenomena in the continuous domain
challenging without modifying the way we interpret programs.
For instance, consider a simple program that is modeled by an if-then-else statement (
if condition
do expr1 else do expr2
). The possible behaviors of
condition
are partitioned into two sets
True (
1
) or False (
0
). These sets evaluate to either
expr1
or
expr2
respectively. However, an NP
approach requires reasoning over a gradient of possibilities to be differentiable. Discrete programs
are inaccurate models for these applications. Specifically, in behavior classification, modeling the
“attack” action using a symbolic if-then-else expression would partition the mouse’s aggression into a
binary set: either always attacking or not attacking at all.
Continuous relaxations.
We approach the continuous program optimization problem of the symbolic
domain by changing the semantics of the programming language. Specifically, work on Smooth
Interpretation [Chaudhuri and Solar-Lezama, 2010] rewrites discrete functions using their closest
smooth mathematical functions. Consecutively, an if-then-else statement would be rewritten as
a shifted sigmoid function with a high temperature. This smoothening is not restricted to a one-
dimensional input space and specialized functions. In general, in higher dimensions, we can use
Gaussian smoothing to smooth discontinuities. Such relaxations, in conjunction with other program
analysis tools, allow gradient descent-based optimizers to converge to optimal programmatic models.
Continuous relaxations enable an approximate interface between neural networks and programming
languages, which are essential in the NP framework. For example, in Houdini [Valkov et al., 2018],
continuous relaxations enabled the construction of a functional programming language that admits
neural networks and higher-order functions. This construction facilitated high-level transfer of learned
concepts across tasks in a life-long learning setting. In NEAR [Shah et al., 2020], the interface
between neural networks and differentiable programs allowed measuring the performance of partial
programs. This proved to be an
admissible heuristic for synthesizing differentiable programs in
the behavior analysis setting.
Smooth Interpretation allows to posit a differentiable approximation for a non-differentiable program.
This approximation error introduces a tradeoff between the output precision and optimal trainability
of the model. Specifically, under-approximating the non-differentiable components might increase
the precision of the differentiable program at the cost of retaining discontinuities in the optimization
landscape and converging to a suboptimal model, and vice-versa.
3.4 Dealing with raw, noisy, and imperfect data
Data found in scientific domains provides an opportunity to study NP models with imperfect data
in real-world conditions, such as with missing data, experiment noise, and distribution shifts. By
incorporating prior knowledge and known constraints in NP models, they have the potential to
perform well in the presence of imperfect data. For example, for behavior analysis, neurosymbolic
models can automatically learn weak labels from a small amount of annotated examples and apply
these trained models to generate weak supervision for a full dataset [Tseng et al., 2022].
These types of imperfect data exist throughout science: missing data in neural recordings due to
hardware issues, noise in pose estimators for tracking animal movements, and distribution shifts. An
additional source of noise in data is the considerable variability that exists in the labeling generation
process, such as annotator subjectivity and ambiguity in category definitions. Furthermore, scientists
are often interested in studying rare categories, such as behaviors that may occur in less than 1%
of a dataset. NP research [Shah et al., 2020, Cui and Zhu, 2021] leverages the flexibility of neural
networks with symbolic domain knowledge; however, there remain challenges in improving model
scalability that we have outlined in this section.
Structural discovery.
In many scientific workflows, meaningful categories, and structures in raw
data may not be clear ahead of time and requires unsupervised or self-supervised learning from
data. For example, there are many tools for discovering new behavior categories from data without
expert supervision [Pereira et al., 2020]. [Zhan et al., 2021] demonstrated that integrating domain
knowledge in an NP workflow results in more meaningful discovered categories compared to fully
neural methods. In addition to the algorithmic challenges discussed in previous sections, future
research work needs to be robust to variations in experimental noise and produce interpretations of
discovered structures in the data that are useful in the context of science.
6
Figure 4: Functionalities of MARS and Bento [Segalin et al., 2021] in the behavior analysis pipeline.
Distribution Shifts.
Distribution shifts are common in real-world scientific applications [Koh et al.,
2021]. For typical black-box machine learning models, it is difficult to diagnose and address these
errors. NP approaches generally learn interpretable and modular programs, which have the promise to
tackle this challenge. For example, in behavior analysis, when the physical behavioral area changes
in size or shape, the relative size of mice also changes. This causes errors in behavior classifiers
trained in a previously known area, but NP programs can be adjusted to the new task accordingly.
3.5 Evaluation challenge
Similar to other ML models, evaluation in NP is crucial for measuring progress in model development
and understanding gaps in model performance. This is also important in real-world deployment in
scientific discovery pipelines. As an important next step in reducing the gap between NP models and
science, we discuss the evaluation of NP models and focus on interpretability.
The main goal of model interpretability is to obtain model insights that are understandable and
actionable to humans and to assist scientists in their analysis workflow. The following are commonly
described properties of explanations that can be found in the machine learning framework [Murphy,
2023], that have the potential to improve the interpretability and evaluation of NP workflows:
Compactness or sparsity:
Sparsity generally corresponds to some notion of smallness measurement
(a few features or a few parameters);
Completeness:
To measure if the explanation include all the
relevant elements, higher-level concepts needed;
Stability:
To measure the extent that there are
explanations similar for similar input;
Actionability:
To allow focusing on only aspects of the model
that the user might be able to intervene on;
Modularity:
Explanation can be broken down into
understandable parts. To study interpretability for NP models for science, we need datasets and
benchmarks that are able to quantify these different dimensions in a range of scientific contexts,
which is still an open problem.
3.6 Cross-Domain Analysis Science Tools
Importance of tools in science.
User-friendly tools are important for facilitating the integration of
ML models in real-world science workflows but have not been well-explored for NP approaches. For
example, numerous tools, based on statistical analysis and ML, have been developed to interface
with scientists and facilitate behavior analysis from videos in [Pereira et al., 2020]. These tools assist
with much of the computational pipeline for behavior classification as outlined in Figure 4, and will
often provide visual interfaces that visualize relevant raw data such as video, with model outputs,
such as pose data and behavior [Segalin et al., 2021]. Enabling similar tools for NP approaches has
the potential to benefit existing scientific workflows. For instance, integrating NEAR into a visual
interface could provide scientists with a user-friendly way of generating differentiable programs and
means of understanding the programs that NP pipelines generate. The parameters associated with
programmatic primitives are likely to have a much more human interpretation [Tjandrasuwita et al.,
2021] than those found in black-box neural networks.
Challenges of building NP tools.
Domain expertise in science varies in structure, from behavioral
attributes to visual or textual descriptions, to known dynamics of movement, to knowledge graphs,
and generally differs across labs and domains. Furthermore, to measure progress, user evaluations are
needed that could offer quantitative or qualitative evidence in NP workflows. Taking the first steps
7
to realize and evaluate the effectiveness of NP algorithms through a human-computer interaction
approach may not only improve the scientific pipeline but also yield new algorithmic directions on
combining NP with more traditional human-in-the-loop methods, such as active learning.
3.7 Benchmarking
Much of the scientific advancement with ML workflows have focused on single domains, for example
behavioral neuroscience [Sun et al., 2021], chemistry [Guo et al., 2021], or physics [Cranmer et al.,
2020]. While many individual fields of science have seen successes through NP, consolidating
underlying insights that are generalizable across domains in science remains another significant open
challenge for the scientific and machine learning communities. To encourage the development of
new methods that work across a diverse set of scientific problems and enable comparisons between
them, we propose to build benchmarks around low-dimensional spatiotemporal data, a setting where
NP methods have been demonstrated to be successful [Shah et al., 2020, Verma et al., 2018]. We
believe that there are several benefits to gain from developing an NP benchmark for the ML and
scientific communities: (1) systematic improvements across broad scientific use cases (2) model
improvements across evaluation metrics, instead of in domain-specific dimensions, (3) increased
awareness of important scientific applications that have not received as much attention from the ML
community.
Challenges of benchmarking NP for science.
Interpreting programmatic structures requires expert
domain knowledge, which can be expensive and time-consuming to obtain. In behavior analysis,
evaluating learned programmatic structures requires interactions with experts in the behavioral science
community to interpret these structures. This imposes a major bottleneck on evaluating outputs. A
standardized benchmark will make it easier for the community to convene and interact with a panel
of experts. We believe that developing a benchmark for NP pipelines is integral to moving the NP
field forward.
The space of NP models is broad. Each algorithm presents a unique methodology for encoding
expert knowledge into the NP lifecycle. This requires comparing models on multiple evaluation
metrics. However, not all NP algorithms can be systematically evaluated on the same set of metrics.
For instance, certain classes of models use stochastic search to discover the programmatic structure
and the programs found by such an approach may not be reproducible. Additionally, NP models
might exhibit properties that do not have concrete evaluation metrics. For instance, classes of NP
algorithms that exhibit robust reproducibility. That is, the model’s outputs are reproducible with
small perturbations to the input data. However, to the best of our knowledge, defining such a metric
quantitatively and objectively remains an open challenge.
The hardware requirements for learning neural representations and symbolic functions are orthogonal.
Neural network training is GPU intensive, while program synthesis is CPU intensive. This increases
the cost of computation and imposes a barrier to entry for aspiring NP researchers. NP benchmarks
need to take the efficiency and performance of training and inference into account.
4 Conclusion
Neurosymbolic programming offers clear promise to accelerate scientific discovery. The benefits
are in its ability to incorporate prior knowledge and the symbolic nature of the solutions. This
approach is compatible with scientific workflows and thus has the potential to optimize scientific
discovery end-to-end. However, challenges still remain in scalability and optimization stability of
these approaches, comprehensive evaluations, and deployment in the form of tools. In this paper, we
have demonstrated the opportunities and challenges of neurosymbolic programming in a concrete
scientific application, behavior analysis. A key promise of neurosymbolic programming is to provide
a set of unifying principles that connect and generalize solutions in interpretable machine learning
and prior scientific literature. We invite the science and computer science communities to adopt these
methods in their scientific workflow and to contribute to the research to advance NP techniques for
science.
8
Funding
This project was supported by the the National Science Foundation under Grant No. 1918839
"Understanding the World Through Code", http://www.neurosymbolic.org/
References
David J Anderson and Pietro Perona. Toward a science of computational ethology. Neuron, 84(1):
18–31, 2014.
Swarat Chaudhuri and Armando Solar-Lezama. Smooth interpretation. ACM Sigplan Notices, 45(6):
279–291, 2010.
Swarat Chaudhuri, Kevin Ellis, Oleksandr Polozov, Rishabh Singh, Armando Solar-Lezama, Yisong
Yue, et al. Neurosymbolic programming. Foundations and Trends
®
in Programming Languages, 7
(3):158–243, 2021.
Miles Cranmer. Pysr: Fast & parallelized symbolic regression in python/julia, 2020.
Miles Cranmer, Alvaro Sanchez Gonzalez, Peter Battaglia, Rui Xu, Kyle Cranmer, David Spergel,
and Shirley Ho. Discovering Symbolic Models from Deep Learning with Inductive Biases. In
Advances in Neural Information Processing Systems, volume 33, pages 17429–17442. Curran
Associates, Inc., 2020.
Guofeng Cui and He Zhu. Differentiable synthesis of program architectures. Advances in Neural
Information Processing Systems, 34:11123–11135, 2021.
Sandeep Robert Datta, David J Anderson, Kristin Branson, Pietro Perona, and Andrew Leifer.
Computational neuroethology: a call to action. Neuron, 104(1):11–24, 2019.
Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and
Pushmeet Kohli. RobustFill: Neural Program Learning under Noisy I/O. In Proceedings of the
34th International Conference on Machine Learning, pages 990–998. PMLR, July 2017. ISSN:
2640-3498.
Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning.
arXiv preprint arXiv:1702.08608, 2017.
Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sablé-Meyer, Lucas Morales, Luke Hewitt, Luc
Cary, Armando Solar-Lezama, and Joshua B. Tenenbaum. DreamCoder: bootstrapping inductive
program synthesis with wake-sleep library learning. In Proceedings of the 42nd ACM SIGPLAN
International Conference on Programming Language Design and Implementation, PLDI 2021,
pages 835–850, New York, NY, USA, June 2021. Association for Computing Machinery. ISBN
978-1-4503-8391-2.
Kevin Ellis, Adam Albright, Armando Solar-Lezama, Joshua B Tenenbaum, and Timothy J O’Donnell.
Synthesizing theories of human language with bayesian program induction. Nature communica-
tions, 13(1):1–13, 2022.
Eyrun Eyjolfsdottir, Steve Branson, Xavier P. Burgos-Artizzu, Eric D. Hoopfer, Jonathan Schor,
David J. Anderson, and Pietro Perona. Detecting Social Actions of Fruit Flies. In David Fleet,
Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision ECCV 2014, Lecture
Notes in Computer Science, pages 772–787, Cham, 2014. Springer International Publishing. ISBN
978-3-319-10605-2. doi: 10.1007/978-3-319-10605-2_50.
Sumit Gulwani. Automating String Processing in Spreadsheets using Input-Output Examples. In
PoPL’11, January 26-28, 2011, Austin, Texas, USA, January 2011.
Minghao Guo, Veronika Thost, Beichen Li, Payel Das, Jie Chen, and Wojciech Matusik. Data-
efficient graph grammar learning for molecular generation. In International Conference on
Learning Representations, 2021.
9
Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan.
Human decisions and machine predictions. The quarterly journal of economics, 133(1):237–293,
2018.
Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Bal-
subramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A
benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning,
pages 5637–5664. PMLR, 2021.
Ali Sinan Koksal, Yewen Pu, Saurabh Srivastava, Rastislav Bodik, Jasmin Fisher, and Nir Piterman.
Synthesis of biological models from mutation experiments. In Proceedings of the 40th annual
ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 469–482,
2013.
Tessa A Lau and Daniel S Weld. Programming by demonstration: An inductive learning formulation.
In Proceedings of the 4th international conference on Intelligent user interfaces, pages 145–152,
1998.
Xubo Leng, Margot Wohl, Kenichi Ishii, Pavan Nayak, and Kenta Asahina. Quantifying influence
of human choice on the automated detection of Drosophila behavior by a supervised machine
learning algorithm. PLoS ONE, 15(12):e0241696, December 2020. ISSN 1932-6203. doi:
10.1371/journal.pone.0241696.
Alexander Mathis, Pranav Mamidanna, Kevin M. Cury, Taiga Abe, Venkatesh N. Murthy, Macken-
zie W. Mathis, and Matthias Bethge. Deeplabcut: markerless pose estimation of user-defined body
parts with deep learning. Nature Neuroscience, 2018.
Thomas McGrath, Andrei Kapishnikov, Nenad Tomašev, Adam Pearce, Demis Hassabis, Been Kim,
Ulrich Paquet, and Vladimir Kramnik. Acquisition of chess knowledge in alphazero. arXiv preprint
arXiv:2111.09259, 2021.
Kevin P. Murphy. Probabilistic Machine Learning: Advanced Topics. MIT Press, 2023.
Talmo D Pereira, Joshua W Shaevitz, and Mala Murthy. Quantifying behavior to understand the
brain. Nature neuroscience, 23(12):1537–1549, 2020.
Talmo D Pereira, Nathaniel Tabris, Arie Matsliah, David M Turner, Junyu Li, Shruthi Ravindranath,
Eleni S Papadoyannis, Edna Normand, David S Deutsch, Z. Yan Wang, Grace C McKenzie-Smith,
Catalin C Mitelut, Marielisa Diez Castro, John D’Uva, Mikhail Kislin, Dan H Sanes, Sarah D
Kocher, Samuel S-H, Annegret L Falkner, Joshua W Shaevitz, and Mala Murthy. Sleap: A deep
learning system for multi-animal pose tracking. Nature Methods, 19(4), 2022.
Cristina Segalin, Jalani Williams, Tomomi Karigo, May Hui, Moriel Zelikowsky, Jennifer J Sun,
Pietro Perona, David J Anderson, and Ann Kennedy. The mouse action recognition system (mars)
software pipeline for automated analysis of social behaviors in mice. Elife, 10:e63720, 2021.
Ameesh Shah, Eric Zhan, Jennifer Sun, Abhinav Verma, Yisong Yue, and Swarat Chaudhuri. Learning
Differentiable Programs with Admissible Neural Heuristics. In Advances in Neural Information
Processing Systems, volume 33, pages 4940–4952. Curran Associates, Inc., 2020.
Jennifer J. Sun, Ann Kennedy, Eric Zhan, David J. Anderson, Yisong Yue, and Pietro Perona. Task
Programming: Learning Data Efficient Behavior Representations. pages 2876–2885, 2021.
Megan Tjandrasuwita, Jennifer J Sun, Ann Kennedy, Swarat Chaudhuri, and Yisong Yue. Interpreting
expert annotation differences in animal behavior. CV4Animals Workshop at CVPR, 2021.
Albert Tseng, Jennifer J Sun, and Yisong Yue. Automatic synthesis of diverse weak supervision
sources for behavior analysis. Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2022.
Berk Ustun and Cynthia Rudin. Optimized risk scores. In Proceedings of the 23rd ACM SIGKDD
international conference on knowledge discovery and data mining, pages 1125–1134, 2017.
10
Lazar Valkov, Dipak Chaudhari, Akash Srivastava, Charles Sutton, and Swarat Chaudhuri. Houdini:
Lifelong learning as program synthesis. Advances in Neural Information Processing Systems, 31,
2018.
Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri.
Programmatically interpretable reinforcement learning. In International Conference on Machine
Learning, pages 5045–5054. PMLR, 2018.
Catherine Wong, Kevin M Ellis, Joshua Tenenbaum, and Jacob Andreas. Leveraging language
to learn program abstractions and search heuristics. In International Conference on Machine
Learning, pages 11193–11204. PMLR, 2021.
Eric Zhan, Jennifer J Sun, Ann Kennedy, Yisong Yue, and Swarat Chaudhuri. Unsupervised learning
of neurosymbolic encoders. arXiv preprint arXiv:2107.13132, 2021.
11
... 11,14 We will discuss research in the direction of scientist-in-the-loop AI systems toward tackling these challenges. 5,15,16 These systems are developed in collaboration with neuroscientists and computer scientists, in order to transform diverse and complex experimental data into interpretable representations. ...
... Second, we need human-interpretable models to enable bidirectional knowledge transfer. This is an active area with works in explainable AI and machine learning, 46 mechanistic interpretability,47 and neurosymbolic modeling.16,48 Further research in these areas enables veterinarians to both integrate their domain expertise into the model and extract new insights and knowledge from the model's outputs. ...
Article
Full-text available
This review focuses on opportunities and challenges of future AI developments in veterinary medicine, from the perspective of computer science researchers in developing AI systems for animal behavior analysis. We examine the paradigms of supervised learning, self-supervised learning, and foundation models, highlighting their applications and limitations in automating animal behavior analysis. These emerging technologies present future challenges in data, modeling, and evaluation in veterinary medicine. To address this, we advocate for a collaborative approach that integrates the expertise of AI researchers, veterinary professionals, and other stakeholders to navigate the evolving landscape of AI in veterinary medicine. Through cross-domain dialogue and an emphasis on human and animal well-being, we can shape AI development to advance veterinary practice for the benefit of all.
Article
Full-text available
We analyze the knowledge acquired by AlphaZero, a neural network engine that learns chess solely by playing against itself yet becomes capable of outperforming human chess players. Although the system trains without access to human games or guidance, it appears to learn concepts analogous to those used by human chess players. We provide two lines of evidence. Linear probes applied to AlphaZero’s internal state enable us to quantify when and where such concepts are represented in the network. We also describe a behavioral analysis of opening play, including qualitative commentary by a former world chess champion.
Article
Full-text available
Automated, data-driven construction and evaluation of scientific models and theories is a long-standing challenge in artificial intelligence. We present a framework for algorithmically synthesizing models of a basic part of human language: morpho-phonology, the system that builds word forms from sounds. We integrate Bayesian inference with program synthesis and representations inspired by linguistic theory and cognitive models of learning and discovery. Across 70 datasets from 58 diverse languages, our system synthesizes human-interpretable models for core aspects of each language’s morpho-phonology, sometimes approaching models posited by human linguists. Joint inference across all 70 data sets automatically synthesizes a meta-model encoding interpretable cross-language typological tendencies. Finally, the same algorithm captures few-shot learning dynamics, acquiring new morphophonological rules from just one or a few examples. These results suggest routes to more powerful machine-enabled discovery of interpretable models in linguistics and other scientific domains.
Article
Full-text available
The desire to understand how the brain generates and patterns behavior has driven rapid methodological innovation in tools to quantify natural animal behavior. While advances in deep learning and computer vision have enabled markerless pose estimation in individual animals, extending these to multiple animals presents unique challenges for studies of social behaviors or animals in their natural environments. Here we present Social LEAP Estimates Animal Poses (SLEAP), a machine learning system for multi-animal pose tracking. This system enables versatile workflows for data labeling, model training and inference on previously unseen data. SLEAP features an accessible graphical user interface, a standardized data model, a reproducible configuration system, over 30 model architectures, two approaches to part grouping and two approaches to identity tracking. We applied SLEAP to seven datasets across flies, bees, mice and gerbils to systematically evaluate each approach and architecture, and we compare it with other existing approaches. SLEAP achieves greater accuracy and speeds of more than 800 frames per second, with latencies of less than 3.5 ms at full 1,024 × 1,024 image resolution. This makes SLEAP usable for real-time applications, which we demonstrate by controlling the behavior of one animal on the basis of the tracking and detection of social interactions with another animal.
Conference Paper
Full-text available
Distribution shifts—where the training distribution differs from the test distribution—can substantially degrade the accuracy of machine learning (ML) systems deployed in the wild. Despite their ubiquity in the real-world deployments, these distribution shifts are under-represented in the datasets widely used in the ML community today. To address this gap, we present WILDS, a curated benchmark of 10 datasets reflecting a diverse range of distribution shifts that naturally arise in real-world applications, such as shifts across hospitals for tumor identification; across camera traps for wildlife monitoring; and across time and location in satellite imaging and poverty mapping. On each dataset, we show that standard training yields substantially lower out-of-distribution than in-distribution performance. This gap remains even with models trained by existing methods for tackling distribution shifts, underscoring the need for new methods for training models that are more robust to the types of distribution shifts that arise in practice. To facilitate method development, we provide an open-source package that automates dataset loading, contains default model architectures and hyperparameters, and standardizes evaluations. The full paper, code, and leaderboards are available at https://wilds.stanford.edu.
Article
Full-text available
Automated quantification of behavior is increasingly prevalent in neuroscience research. Human judgments can influence machine-learning-based behavior classification at multiple steps in the process, for both supervised and unsupervised approaches. Such steps include the design of the algorithm for machine learning, the methods used for animal tracking, the choice of training images, and the benchmarking of classification outcomes. However, how these design choices contribute to the interpretation of automated behavioral classifications has not been extensively characterized. Here, we quantify the effects of experimenter choices on the outputs of automated classifiers of Drosophila social behaviors. Drosophila behaviors contain a considerable degree of variability, which was reflected in the confidence levels associated with both human and computer classifications. We found that a diversity of sex combinations and tracking features was important for robust performance of the automated classifiers. In particular, features concerning the relative position of flies contained useful information for training a machine-learning algorithm. These observations shed light on the importance of human influence on tracking algorithms, the selection of training images, and the quality of annotated sample images used to benchmark the performance of a classifier (the ‘ground truth’). Evaluation of these factors is necessary for researchers to accurately interpret behavioral data quantified by a machine-learning algorithm and to further improve automated classifications.
Article
We present a framework for the unsupervised learning of neurosymbolic encoders, i.e., encoders obtained by composing neural networks with symbolic programs from a domain-specific language. Such a framework can naturally incorporate symbolic expert knowledge into the learning process and lead to more interpretable and factorized latent representations than fully neural encoders. Also, models learned this way can have downstream impact, as many analysis workflows can benefit from having clean programmatic descriptions. We ground our learning algorithm in the variational autoencoding (VAE) framework, where we aim to learn a neurosymbolic encoder in conjunction with a standard decoder. Our algorithm integrates standard VAE-style training with modern program synthesis techniques. We evaluate our method on learning latent representations for real-world trajectory data from animal biology and sports analytics. We show that our approach offers significantly better separation than standard VAEs and leads to practical gains on downstream tasks.
Article
We survey recent work on neurosymbolic programming, an emerging area that bridges the areas of deep learning and program synthesis. Like in classic machine learning, the goal here is to learn functions from data. However, these functions are represented as programs that can use neural modules in addition to symbolic primitives and are induced using a combination of symbolic search and gradient-based optimization. Neurosymbolic programming can offer multiple advantages over end-to-end deep learning. Programs can sometimes naturally represent long-horizon, procedural tasks that are difficult to perform using deep networks. Neurosymbolic representations are also, commonly, easier to interpret and formally verify than neural networks. The restrictions of a programming language can serve as a form of regularization and lead to more generalizable and data-efficient learning. Compositional programming abstractions can also be a natural way of reusing learned modules across learning tasks. In this monograph, we illustrate these potential benefits with concrete examples from recent work on neurosymbolic programming. We also categorize the main ways in which symbolic and neural learning techniques come together in this area. We conclude with a discussion of the open technical challenges in the field.
Conference Paper
Specialized domain knowledge is often necessary to accurately annotate training sets for in-depth analysis, but can be burdensome and time-consuming to acquire from domain experts. This issue arises prominently in automated behavior analysis, in which agent movements or actions of interest are detected from video tracking data. To reduce annotation effort, we present TREBA: a method to learn annotation-sample efficient trajectory embedding for behavior analysis, based on multi-task self-supervised learning. The tasks in our method can be efficiently engineered by domain experts through a process we call "task programming", which uses programs to explicitly encode structured knowledge from domain experts. Total domain expert effort can be reduced by exchanging data annotation time for the construction of a small number of programmed tasks. We evaluate this trade-off using data from behavioral neuroscience, in which specialized domain knowledge is used to identify behaviors. We present experimental results in three datasets across two domains: mice and fruit flies. Using embeddings from TREBA, we reduce annotation burden by up to a factor of 10 without compromising accuracy compared to state-of-the-art features. Our results thus suggest that task programming and self-supervision can be an effective way to reduce annotation effort for domain experts.