ArticlePDF Available

Simulation Intelligence: Towards a New Generation of Scientific Methods

Authors:
  • Latent Sciences
  • Oxford Immune Algorithmics

Abstract and Figures

The original "Seven Motifs" set forth a roadmap of essential methods for the field of scientific computing, where a motif is an algorithmic method that captures a pattern of computation and data movement. We present the "Nine Motifs of Simulation Intelligence", a roadmap for the development and integration of the essential algorithms necessary for a merger of scientific computing, scientific simulation, and artificial intelligence. We call this merger simulation intelligence (SI), for short. We argue the motifs of simulation intelligence are interconnected and interdependent, much like the components within the layers of an operating system. Using this metaphor, we explore the nature of each layer of the simulation intelligence operating system stack (SI-stack) and the motifs therein: (1) Multi-physics and multi-scale modeling; (2) Surrogate modeling and emulation; (3) Simulation-based inference; (4) Causal modeling and inference; (5) Agent-based modeling; (6) Probabilistic programming; (7) Differentiable programming; (8) Open-ended optimization; (9) Machine programming. We believe coordinated efforts between motifs offers immense opportunity to accelerate scientific discovery, from solving inverse problems in synthetic biology and climate science, to directing nuclear energy experiments and predicting emergent behavior in socioeconomic settings. We elaborate on each layer of the SI-stack, detailing the state-of-art methods, presenting examples to highlight challenges and opportunities, and advocating for specific ways to advance the motifs and the synergies from their combinations. Advancing and integrating these technologies can enable a robust and efficient hypothesis-simulation-analysis type of scientific method, which we introduce with several use-cases for human-machine teaming and automated science.
Content may be subject to copyright.
SIMULATION INTELLIGENCE: TOWARDS A NEW GENERATION
OF SCIENTIFIC METHODS
Alexander Lavin
Institute for Simulation Intelligence Hector Zenil
Alan Turing Institute Brooks Paige
Alan Turing Institute David Krakauer
Santa Fe Institute
Justin Gottschlich
Intel Labs Tim Mattson
Intel Anima Anandkumar
Nvidia Sanjay Choudry
Nvidia Kamil Rocki
Neuralink
Atılım Güne¸s Baydin
University of Oxford Carina Prunkl
University of Oxford Olexandr Isayev
Carnegie Mellon University
Erik Peterson
Carnegie Mellon University Peter L. McMahon
Cornell University Jakob H. Macke
University of Tübingen Kyle Cranmer
New York University
Jiaxin Zhang
Oak Ridge National Lab Haruko Wainwright
Lawrence Berkeley National Lab Adi Hanuka
SLAC National Accelerator Lab
Samuel Assefa
US Bank AI Innovation Stephan Zheng
Salesforce Research Manuela Veloso
JPM AI Research Avi Pfeffer
Charles River Analytics
ABSTRACT
The original “Seven Motifs” set forth a roadmap of essential methods for the field of scientific
computing, where a motif is an algorithmic method that captures a pattern of computation and data
movement.
1
We present the Nine Motifs of Simulation Intelligence, a roadmap for the development
and integration of the essential algorithms necessary for a merger of scientific computing, scientific
simulation, and artificial intelligence. We call this merger simulation intelligence (SI), for short. We
argue the motifs of simulation intelligence are interconnected and interdependent, much like the
components within the layers of an operating system. Using this metaphor, we explore the nature of
each layer of the simulation intelligence “operating system” stack (SI-stack) and the motifs therein:
1. Multi-physics and multi-scale modeling
2. Surrogate modeling and emulation
3. Simulation-based inference
4. Causal modeling and inference
5. Agent-based modeling
6. Probabilistic programming
7. Differentiable programming
8. Open-ended optimization
9. Machine programming
We believe coordinated efforts between motifs offers immense opportunity to accelerate scientific
discovery, from solving inverse problems in synthetic biology and climate science, to directing nuclear
energy experiments and predicting emergent behavior in socioeconomic settings. We elaborate on each
layer of the SI-stack, detailing the state-of-art methods, presenting examples to highlight challenges
and opportunities, and advocating for specific ways to advance the motifs and the synergies from
their combinations. Advancing and integrating these technologies can enable a robust and efficient
hypothesis–simulation–analysis type of scientific method, which we introduce with several use-cases
for human-machine teaming and automated science.
Keywords:
Simulation; Artificial Intelligence; Machine Learning; Scientific Computing; Physics-infused ML; Inverse
Design; Human-Machine Teaming; Optimization; Causality; Complexity; Open-endedness
lavin@simulation.science (ISI & Pasteur Labs)
1We eschew the original term “dwarf for the more appropriate “motif” in this paper, and encourage the field to follow suit.
Preprint. Under review.
arXiv:2112.03235v1 [cs.AI] 6 Dec 2021
Contents
Introduction 3
Simulation Intelligence Motifs 4
The Modules 5
1. MULTI-PHYSICS & MULTI-SCALE MODELING . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2. SURROGATE MODELING & EMULATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.SIMULATION-BASEDINFERENCE.................................... 17
4.CAUSALREASONING........................................... 22
5.AGENT-BASEDMODELING........................................ 28
The Engine 33
6.PROBABILISTICPROGRAMMING.................................... 33
7.DIFFERENTIABLEPROGRAMMING................................... 40
The Frontier 45
8.OPEN-ENDEDOPTIMIZATION...................................... 45
9.MACHINEPROGRAMMING ....................................... 48
Simulation Intelligence Themes 51
INVERSE-PROBLEMSOLVING ....................................... 51
UNCERTAINTYREASONING ........................................ 54
INTEGRATIONS................................................ 55
HUMAN-MACHINETEAMING ....................................... 63
Simulation Intelligence in Practice 64
Data-intensive science and computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Acceleratedcomputing ............................................. 68
Domainsforuse-inspiredresearch ....................................... 70
Discussion 70
Honorablementionmotifs ........................................... 70
Conclusion ................................................... 74
2
Introduction
Simulation has become an indispensable tool for researchers across the sciences to explore the behavior of complex,
dynamic systems under varying conditions [
1
], including hypothetical or extreme conditions, and increasingly tipping
points in environments such as climate [
2
,
3
,
4
], biology [
5
,
6
], sociopolitics [
7
,
8
], and others with significant
consequences. Yet there are challenges that limit the utility of simulators (and modeling tools broadly) in many settings.
First, despite advances in hardware to enable simulations to model increasingly complex systems, computational costs
severely limit the level of geometric details, complexity of physics, and the number of simulator runs. This can lead to
simplifying assumptions, which often render the results unusable for hypothesis testing and practical decision-making.
In addition, simulators are inherently biased as they simulate only what they are programmed to simulate; sensitivity
and uncertainty analyses are often impractical for expensive simulators; simulation code is composed of low-level
mechanistic components that are typically non-differentiable and lead to intractable likelihoods; and simulators can
rarely integrate with real-world data streams, let alone run online with live data updates.
Recent progress with artificial intelligence (AI) and machine learning (ML) in the sciences has advanced methods
towards several key objectives for AI/ML to be useful in sciences (beyond discovering patterns in high-dimensional
data). These advances allow us to import priors or domain knowledge into ML models and export knowledge from
learned models back to the scientific domain; leverage ML for numerically intractable simulation and optimization
problems, as well as maximize the utility of real-world data; generate myriads of synthetic data; quantify and reason
about uncertainties in models and data; and infer causal relationships in the data.
It is at the intersection of AI and simulation sciences where we can expect significant strides in scientific experimentation
and discovery, in essentially all domains. For instance, the use of neural networks to accelerate simulation software for
climate science [
9
], or multi-agent reinforcement learning and game theory towards economic policy simulations [
10
].
Yet this area is relatively nascent and disparate, and a unifying holistic perspective is needed to advance the intersection
of AI and simulation sciences.
This paper explores this perspective. We lay out the methodologies required to make significant strides in simulation and
AI for science, and how they must be fruitfully combined. The field of scientific computing was at a similar inflection
point when Phillip Colella in 2004 presented to DARPA the “Seven Dwarfs” for Scientific Computing, where each of
the seven represents an algorithmic method that captures a pattern of computation and data movement [
11
,
12
,
13
].
ii
For the remainder of this paper, we choose to replace a potentially insensitive term with “motif”, a change we suggest
for the field going forward.
The motifs nomenclature has proved useful for reasoning at a high level of abstraction about the behavior and require-
ments of these methods across a broad range of applications, while decoupling these from specific implementations.
Even more, it is an understandable vocabulary for talking across disciplinary boundaries. Motifs also provide “anti-
benchmarks”: not tied to narrow performance or code artifacts, thus encouraging innovation in algorithms, programming
languages, data structures, and hardware [
12
]. Therefore the motifs of scientific computing provided an explicit roadmap
for R&D efforts in numerical methods (and eventually parallel computing) in sciences.
In this paper, we similarly define the Nine Motifs of Simulation Intelligence, classes of complementary algorithmic
methods that represent the foundation for synergistic simulation and AI technologies to advance sciences; simulation
intelligence (SI) describes a field that merges of scientific computing, scientific simulation, and artificial intelligence
towards studying processes and systems in silico to better understand and discover in situ phenomena. Each of the SI
motifs has momentum from the scientific computing and AI communities, yet must be pursued in concert and integrated
in order to overcome the shortcomings of scientific simulators and enable new scientific workflows.
Unlike the older seven motifs of scientific computing, our SI motifs are not necessarily independent. Many of these are
interconnected and interdependent, much like the components within the layers of an operating system. The individual
modules can be combined and interact in multiple ways, gaining from this combination. Using this metaphor, we
explore the nature of each layer of the “SI stack”, the motifs within each layer, and the combinatorial possibilities
available when they are brought together the layers are illustrated in Fig. 1.
We begin by describing the core layers of the SI stack, detailing each motif within: the concepts, challenges, state-of-
the-art methods, future directions, ethical considerations, and many motivating examples. As we traverse the SI stack,
encountering the numerous modules and scientific workflows, we will ultimately be able to lay out how these advances
will benefit the many users of simulation and scientific endeavors. Our discussion continues to cover important SI
themes such as inverse problem solving and human-machine teaming, and essential infrastructure areas such as data
engineering and accelerated computing.
ii
The 2004 “Seven Motifs for Scientific Computing”: Dense Linear Algebra, Sparse Linear Algebra, Computations on Structured
Grids, Computations on Unstructured Grids, Spectral Methods, Particle Methods, and Monte Carlo [11].
3
Figure 1: An operating system (OS) diagram, elucidating the relationships of the nine Simulation Intelligence motifs,
their close relatives, such as domain-specific languages (DSL) [
14
] and working memory [
15
], and subsequent SI-based
science workflows. The red, purple, orange, and green plates represent the hardware, OS, application, and user layers,
respectively (from bottom to top). In the applications layer (orange plate) are some main SI workflows notice some
entries have a cyan outline, signifying machine-science classes that contain multiple workflows in themselves (see text
for details), and entries with a pink outline denote statistical and ML methods that have been used for decades, but here
in unique SI ways. In the text we generically refer to this composition as the “SI stack”, although the mapping to an OS
is a more precise representation of the SI layers and their interactions. We could have shown a simpler diagram with
only the nine SI motifs, but the context of the broader stack that is enabled by their integration better elucidates the
improved and new methods that can arise.
By pursuing research in each of these SI motifs, as well as ways of combining them together in generalizable software
towards specific applications in science and intelligence, the recent trend of decelerating progress may be reversed
[
16
,
17
]. This paper aims to motivate the field and provide a roadmap for those who wish to work in AI and simulation
in pursuit of new scientific methods and frontiers.
Simulation Intelligence Motifs
Although we define nine concrete motifs, they are not necessarily independent many of the motifs are synergistic in
function and utility, and some may be building blocks underlying others. We start by describing the “module” motifs, as
in Fig. 1, followed by the underlying “engine” motifs: probabilistic and differentiable programming. We then describe
the motifs that aim to push the frontier of intelligent machines: open-endedness and machine programming.
4
The Modules
Above the “engine” in the proverbial SI stack (Fig. 1) are “modules”, each of which can be built in probabilistic and
differentiable programming frameworks, and make use of the accelerated computing building blocks in the hardware
layer. We start this section with several motifs that closely relate to the physics-informed learning topics most recently
discussed above, and then proceed through the SI stack, describing how and why the module motifs compliment one
another in myriad, synergistic ways.
1. MULTI-PHYSICS & MULTI-SCALE MODELING
Simulations are pervasive in every domain of science and engineering, yet are often done in isolation: climate simulations
of coastal erosion do not model human-driven effects such as urbanization and mining, and even so are only consistent
within a constrained region and timescale. Natural systems involve various types of physical phenomena operating at
different spatial and temporal scales. For simulations to be accurate and useful they must support multiple physics and
multiple scales (spatial and temporal). The same goes for AI & ML, which can be powerful for modeling multi-modality,
multi-fidelity scientific data, but machine learning alone based on data-driven relationships ignores the fundamental
laws of physics and can result in ill-posed problems or non-physical solutions. For ML (and AI-driven simulation) to be
accurate and reliable in the sciences, methods must integrate multi-scale, multi-physics data and uncover mechanisms
that explain the emergence of function.
Multi-physics
Complex real-world problems require solutions that span a multitude of physical phenomena, which
often can only be solved using simulation techniques that cross several engineering disciplines. Almost all practical
problems in fluid dynamics involve the interaction between a gas and/or liquid with a solid object, and include a range
of associated physics including heat transfer, particle transport, erosion, deposition, flow-induced-stress, combustion
and chemical reaction. A multi-physics environments is defined by coupled processes or systems involving more
than one simultaneously occurring physical fields or phenomena. Such an environment is typically described by
multiple partial differential equations (PDEs), and tightly coupled such that solving them presents significant challenges
with nonlinearities and time-stepping. In general, the more interacting physics in a simulation, the more costly the
computation.
Multi-scale
Ubiquitous in science and engineering, cascades-of-scales involve more than two scales with long-range
spatio-temporal interactions (that often lack self-similarity and proper closure relations). In the context of biological
and behavioral sciences, for instance, multi-scale modeling applications range from the molecular, cellular, tissue, and
organ levels all the way to the population level; multi-scale modeling can enable researchers to probe biologically
relevant phenomena at smaller scales and seamlessly embed the relevant mechanisms at larger scales to predict the
emergent dynamics of the overall system [
20
]. Domains such as energy and synthetic biology require engineering
materials at the nanoscale, optimizing multi-scale processes and systems at the macroscale, and even the discovery
of new governing physico-chemical laws across scales. These scientific drivers call for a deeper, broader, and more
integrated understanding of common multi-scale phenomena and scaling cascades. In practice, multi-scale modeling
is burdened by computational inefficiency, especially with increasing complexity and scales; it is not uncommon to
encounter “hidden” or unknown physics of interfaces, inhomogeneities, symmetry-breaking and other singularities.
Methods for utilizing information across scales (and physics that vary across space and time) are often needed in
real-world settings, and with data from various sources: Multi-fidelity modeling aims to synergistically combine
abundant, inexpensive, low-fidelity data and sparse, expensive, high-fidelity data from experiments and simulations.
Multi-fidelity modeling is often useful in building efficient and robust surrogate models (which we detail in the surrogate
motif section later) [
21
] some examples include simulating the mixed convection flow past a cylinder [
22
] and cardiac
electrophysiology [23].
In computational fluid dynamics (CFD), classical methods for multi-physics and multi-scale simulation (such as finite
elements and pseudo-spectral methods) are only accurate if flow feature are all smooth, and thus meshes must resolve
the smallest features. Consequently, direct numerical simulation for real-world systems such as climate and jet physics
are impossible. It is common to use smoothed versions of the Navier Stokes equations to allow coarser meshes while
sacrificing accuracy. Although successful in design of engines and turbo-machinery, there are severe limits to what
can be accurately and reliably simulated the resolution-efficiency tradeoff imposes a significant bottleneck. Methods
for AI-driven acceleration and surrogate modeling could accelerate CFD and multi-physics multi-scale simulation by
orders of magnitude. We discuss specific methods and examples below.
Physics-informed ML
The newer class of physics-informed machine learning methods integrate mathematical physics
models with data-driven learning. More specifically, making an ML method physics-informed amounts to introducing
5
Figure 2: Diagram of the various physics and scales to simulate for turbulent flow within a nuclear reactor core
top-to-bottom zooming in reveals finer and finer scales, each with different physics to model. AI-driven computational
fluid dynamics (CFD) can radically improve the resolution and efficiency of such simulations [18,19].
appropriate observational, inductive, or learning biases that can steer or constrain the learning process to physically
consistent solutions [24]:
1.
Observational biases can be introduced via data that allows an ML system to learn functions, vector fields,
and operators that reflect physical structure of the data.
2.
Inductive biases are encoded as model-based structure that imposes prior assumptions or physical laws, making
sure the physical constraints are strictly satisfied.
3.
Learning biases force the training of an ML system to converge on solutions that adhere to the underlying
physics, implemented by specific choices in loss functions, constraints, and inference algorithms.
A common problem template involves extrapolating from an initial condition obtained from noisy experimental data,
where a governing equation is known for describing at least some of the physics. An ML method would aim to predict
the latent solution
u(t, x)
of a system at later times
t > 0
and propagate the uncertainty due to noise in the initial data.
A common use-case in scientific computing is reconstructing a flow field from scattered measurements (e.g., particle
image velocimetry data), and using the governing Navier–Stokes equations to extrapolate this initial condition in time.
6
 

Figure 3: Diagram of a physics-informed neural network (PINN), where a fully-connected neural network (red), with
time and space coordinates
(t, x)
as inputs, is used to approximate the multi-physics solutions
ˆu= [u, v, p, φ]
. The
derivatives of
ˆu
with respect to the inputs are calculated using automatic differentiation (purple; autodiff is discussed
later in the SI engine section) and then used to formulate the residuals of the governing equations in the loss function
(green), that is generally composed of multiple terms weighted by different coefficients. By minimizing the physics-
informed loss function (i.e., the right-to-left process) we simultaneously learn the parameters of the neural network
θ
and the unknown PDE parameters λ. (Figure reproduced from Ref. [25])
For fluid flow problems and many others, Gaussian processes (GPs) present a useful approach for capturing the physics
of dynamical systems. A GP is a Bayesian nonparametric machine learning technique that provides a flexible prior
distribution over functions, enjoys analytical tractability, defines kernels for encoding domain structure, and has a fully
probabilistic workflow for principled uncertainty reasoning [
26
,
27
]. For these reasons GPs are used widely in scientific
modeling, with several recent methods more directly encoding physics into GP models: Numerical GPs have covariance
functions resulting from temporal discretization of time-dependent partial differential equations (PDEs) which describe
the physics [
28
,
29
], modified Matérn GPs can be defined to represent the solution to stochastic partial differential
equations [
30
] and extend to Riemannian manifolds to fit more complex geometries [
31
], and the physics-informed
basis-function GP derives a GP kernel directly from the physical model [
32
] the latter method we elucidate in an
experiment optimization example in the surrogate modeling motif.
With the recent wave of deep learning progress, research has developed methods for building the three physics-informed
ML biases into nonlinear regression-based physics-informed networks. More specifically, there is increasing interest in
physics-informed neural networks (PINNs): deep neural nets for building surrogates of physics-based models described
by partial differential equations (PDEs) [
33
,
34
]. Despite recent successes (with some examples explored below)
[
35
,
36
,
37
,
38
,
39
,
40
], PINN approaches are currently limited to tasks that are characterized by relatively simple and
well-defined physics, and often require domain-expert craftsmanship. Even so, the characteristics of many physical
systems are often poorly understood or hard to implicitly encode in a neural network architecture.
Probabilistic graphical models (PGMs) [
41
], on the other hand, are useful for encoding a priori structure, such as the
dependencies among model variables in order to maintain physically sound distributions. This is the intuition behind
the promising direction of graph-informed neural networks (GINN) for multi-scale physics [
42
]. There are two main
components of this approach (shown in Fig. 7): First, embedding a PGM into the physics-based representation to
encode complex dependencies among model variables that arise from domain-specific information and to enable the
generation of physically sound distributions. Second, from the embedded PGM identify computational bottlenecks
intrinsic to the underlying physics-based model and replace them with an efficient NN surrogate. The hybrid model thus
encodes a domain-aware physics-based model that synthesizes stochastic and multi-scale modeling, while computational
bottlenecks are replaced by a fast surrogate NN whose supervised learning and prediction are further informed by the
PGM (e.g., through structured priors). With significant computational advantages from surrogate modeling within
GINN, we further explore this approach in the surrogate modeling motif later.
Modeling and simulation of complex nonlinear multi-scale and multi-physics systems requires the inclusion and
characterization of uncertainties and errors that enter at various stages of the computational workflow. Typically
we’re concerned with two main classes of uncertainties in ML: aleatoric and epistemic uncertainties. The former
quantifies system stochasticity such as observation and process noise, and the latter is model-based or subjective
7
uncertainty due to limited data. In environments of multiple scales and multiple physics, one should consider an
additional type of uncertainty due the randomness of parameters of stochastic physical systems (often described by
stochastic partial- or ordinary- differential equations (SPDEs, SODEs)). One can view this type of uncertainty arising
from the computation of a sufficiently well-posed deterministic problem, in contrast to the notion of epistemic or
aleatoric uncertainty quantification. The field of probabilistic numerics [
43
] makes a similar distinction, where the
use of probabilistic modeling is to reason about uncertainties that arise strictly from the lack of information inherent
in the solution of intractable problems such as quadrature methods and other integration procedures. In general the
probabilistic numeric viewpoint provides a principled way to manage the parameters of numerical procedures. We
discuss more on probabilistic numerics and uncertainty reasoning later in the SI themes section. The GINN can quantify
uncertainties with a high degree of statistical confidence, while Bayesian analogs of PINNs are a work in progress
[
44
]–one cannot simply plug in MC-dropout or other deep learning uncertainty estimation methods. The various
uncertainties may be quantified and mitigated with methods that can utilize data-driven learning to inform the original
systems of differential equations. We define this class of physics-infused machine learning later in this section and in
the next motif.
The synergies of mechanistic physics models and data-driven learning are brought to bear when physics-informed ML
is built with differentiable programming (one of the engine motifs), which we explore in the first example below.
Examples
Accelerated CFD via physics-informed surrogates and differentiable programming
Kochkov et al. [
18
] look
to bring the advantages of semi-mechanistic modeling and differentiable programming to the challenge of complex
computational fluid dynamics (CFD). The Navier Stokes (NS) equations describe fluid dynamics well, yet in cases of
multiple physics and complex dynamics, solving the equations at scale is severely limited by the computational cost
of resolving the smallest spatiotemporal features. Approximation methods can alleviate this burden, but at the cost
of accuracy. In Kochkov et al, the components of traditional fluids solvers most affected by the loss of resolution are
replaced with better performing machine-learned alternatives (as presented in the semi-mechanistic modeling section
later). This AI-driven solver algorithm is represented as a differentiable program with the neural networks and the
numerical methods written in the JAX framework [
45
]. JAX is a leading framework for differentiable programming, with
reverse-mode automatic differentiation that allows for end-to-end gradient based optimization of the entire programmed
algorithm. In this CFD use-case, the result is an algorithm that maintains accuracy while using 10x coarser resolution in
each dimension, yielding an 80-fold improvement in computation time with respect to an advanced numerical method
of similar accuracy.
Related approaches implement PINNs without the use of differentiable programming, such as TF-Net for modeling
turbulent flows with several specially designed U-Net deep learning architectures [
46
]. A promising direction in this and
other spatiotemporal use-cases is neural operator learning: using NNs to learn mesh-independent, resolution-invariant
solution operators for PDEs. To achieve this, Li et al. [
47
] use a Fourier layer that implements a Fourier transform, then
a linear transform, and an inverse Fourier transform for a convolution-like operation in a NN. In a Bayesian inverse
experiment, the Fourier neural operator acting as a surrogate can draw MCMC samples from the posterior of initial NS
vorticity given sparse, noisy observations in 2.5 minutes, compared to 18 hours for the traditional solver.
Multi-physics and HPC simulation of blood flow in an intracranial aneurysm
SimNet is an AI-driven multi-
physics simulation framework based on neural network solvers–more specifically it approximates the solution to a
PDE by a neural network [
19
]. SimNet improves on previous NN-solvers to take on the challenge of gradients and
discontinuities introduced by complex geometries or physics. The main novelties are the use of Signed Distance
Functions for loss weighting, and integral continuity planes for flow simulation. An intriguing real-world use case is
simulating the flow inside a patient-specific geometry of an aneurysm, as shown in Fig. A. It is particularly challenging
to get the flow field to develop correctly, especially inside the aneurysm sac. Building SimNet to support multi-
GPU and multi-node scaling provides the computational efficiency necessary for such complex geometries. There’s
also optimization over repetitive trainings, such as training for surrogate-based design optimization or uncertainty
quantification, where transfer learning reduces the time to convergence for neural network solvers. Once a model
is trained for a single geometry, the trained model parameters are transferred to solve a different geometry, without
having to train on the new geometry from scratch. As shown in Fig. B, transfer learning accelerates the patient-specific
intracranial aneurysm simulations.
Raissi et al. [
48
] similarly approach the problem of 3D physiologic blood flow in a patient-specific intracranial aneurysm,
but implementing a physics-informed NN technique: the Hidden Fluid Mechanics approach that uses autodiff (i.e.
within differentiable programming) to simultaneously exploit information from the Navier Stokes equations of fluid
dynamics and the information from flow visualization snapshots. Continuing this work has high potential for the robust
8





Figure 4: SimNet simulation results for the aneurysm problem [
19
]. Left: patient-specific geometry of an intracranial
aneurysm. Center: Streamlines showing accurate flow field simulation inside the aneurysm sac. Right: Transfer learning
within the NN-based simulator accelerates the computation of patient-specific geometries.
and data-efficient simulation in physical and biomedical applications. Raissi et al. effectively solve this as an inverse
problem using blood flow data, while SimNet approaches this as a forward problem without data. Nonetheless SimNet
has potential for solving inverse problems as well (within the inverse design workflow of Fig. 33 for example).














Figure 5: A cascade of spatial and temporal scales governing key biophysical mechanisms in brain blood flow
requires effective multi-scale modeling and simulation techniques (inspired by [
49
]). Listed are the standard modeling
approaches for each spatial and temporal regime, where an increase in computational demands is inevitable as one
pursues an integrated resolution of interactions in finer and finer scales. SimNet [
19
], JAX MD [
50
], and related works
with the motifs will help mitigate these computational demands and more seamlessly model dynamics across scales.
9
Learning quantum chemistry
Using modern algorithms and supercomputers, systems containing thousands of
interacting ions and electrons can now be described using approximations to the physical laws that govern the world
on the atomic scale, namely the Schrödinger equation [
51
]. Chemical-simulation can allow for the properties of a
compound to be anticipated (with reasonable accuracy) before synthesizing it in the laboratory. These computational
chemistry applications range from catalyst development for greenhouse gas conversion, materials discovery for energy
harvesting and storage, and computer-assisted drug design [
52
]. The potential energy surface is the central quantity of
interest in the modeling of molecules and materials, computed by approximations to the time-independent Schrödinger
equation. Yet there is a computational bottleneck: high-level wave function methods have high accuracy, but are
often too slow for use in many areas of chemical research. The standard approach today is density functional theory
(DFT) [
53
], which can yield results close to chemical accuracy, often on the scale of minutes to hours of computational
time. DFT has enabled the development of extensive databases that cover the calculated properties of known and
hypothetical systems, including organic and inorganic crystals, single molecules, and metal alloys [
54
,
55
,
56
]. In
high-throughput applications, often used methods include force-field (FF) and semi-empirical quantum mechanics
(SEQM), with runtimes on the order of fractions of a second, but at the cost of reliability of the predictions [
57
]. ML
methods can potentially provide accelerated solvers without loss of accuracy, but to be reliable in the chemistry space
the model must capture the underlying physics, and well-curated training data covering the relevant chemical problems
must be available Butler et al. [
51
] provide a list of publicly accessible structure and property databases for molecules
and solids. Symmetries in geometries or physics, or equivariance, defined as the property of being independent of the
choice of reference frame, can be exploited to this end. For instance, a message passing neural network [
58
] called
OrbNet encodes a molecular system in graphs based on features from a quantum calculation that are low cost by
implementing symmetry-adapted atomic orbitals [
57
,
59
]. The authors demonstrate effectiveness of their equivariant
approach on several organic and biological chemistry benchmarks, with accuracies on par to that of modern DFT
functionals, providing a far more efficient drop-in replacement for DFT energy predictions. Additional approaches that
constraining NNs to be symmetric under these geometric operations have been successfully applied in molecular design
[60] and quantum chemistry [61].
Future directions
In general, machine learned models are ignorant of fundamental laws of physics, and can result in ill-posed problems
or non-physical solutions. This effect is exacerbated when modeling the interplay of multiple physics or cascades of
scales. Despite recent successes mentioned above, there is much work to be done for multi-scale and multi-physics
problems. For instance, PINNs can struggle with high-frequency domains and frequency bias [
62
], which is particularly
problematic in multi-scale problems [
63
]. Improved methods for learning multiple physics simultaneously are also
needed, as the training can be prohibitively expensive for example, a good approach with current tooling is training a
model for each field separately and subsequently learning the coupled solutions through either a parallel or a serial
architecture using supervised learning based on additional data for a specific multi-physics problem.
In order to approach these challenges in a collective and reproducible way, there is need to create open benchmarks
for physics-informed ML, much like other areas of ML community such as computer vision and natural language
processing. Yet producing quality benchmarks tailored for physics-informed ML can be more challenging:
1.
To benchmark physics-informed ML methods, we additionally need the proper parameterized physical models
to be explicitly included in the databases.
2.
Many applications in physics and chemistry require full-field data, which cannot be obtained experimentally
and/or call for significant compute.
3.
Often different, problem-specific, physics-based evaluation methods are necessary, for example the metrics
proposed in [64] for scoring physical consistency and [65] for scoring spatiotemporal predictions.
An overarching benchmarking challenge but also advantageous constraint is the multidisciplinary nature of physics-
informed ML: there must be multiple benchmarks in multiple domains, rather than one benchmark to rule them all,
which is a development bias that ImageNet has put on the computer vision field the past decade. And because we
have domain knowledge and numerical methods for the underlying data generating mechanisms and processes in
physical and life sciences, we have an opportunity to quantify robustly the characteristics of datasets to better ground
the performances of various models and algorithms. This is contrast to the common practice of naïve data gathering
to compile massive benchmark datasets for deep learning for example, scraping the internet for videos to compose
a benchmark dataset for human action recognition (deepmind.com/research/open-source/kinetics) where not only
are the underlying statistics and causal factors a priori unknown, the target variables and class labels are non-trivial to
define and can lead to significant ethical issues such as dataset biases that lead to model biases, which in some cases
can propagate harmful assumptions and stereotypes.
10
Further we propose to broaden the class of methods beyond physics-informed, to
physics-infused machine learning
.
The former is unidirectional (physics providing constraints or other information to direct ML methods), whereas the
latter is bidirectional, including approaches that can better synergize the two computational fields. For instance, for
systems with partial information, methods in physics-infused ML can potentially compliment known physical models
to learn missing or misunderstood components of the systems. We specifically highlight one approach named Universal
Differential Equations [66] in the surrogate modeling motif next.
Physics-infused ML can enable many new simulation tools and use-cases because of this ability to integrate physical
models and data within a differentiable software paradigm. JAX and SimNet are nice examples, each enabling physics-
informed ML methods for many science problems and workflows. Consider, for instance, the JAX fluid dynamics
example above: another use-case in the same DP framework is JAX MD [
50
] for performing differentiable physics
simulations with a focus on molecular dynamics.
Another exciting area of future multi-physics multi-scale development is inverse problem solving. We already mentioned
the immense acceleration in the CFD example above with Fourier Neural Operators [
47
]. Beyond efficiency gains,
physics-infused ML can take on applications with inverse and ill-posed problems which are either difficult or impossible
to solve with conventional approaches, notably quantum chemistry: A recent approach called FermiNet [
67
] takes a
dual physics-informed learning approach to solving the many-electron Schrodinger equation, where both inductive
bias and learning bias are employed. The advantage of physics-infused ML here is eliminating extrapolation problems
with the standard numerical approach, which is a common source of error in computational quantum chemistry. In
other domains such as biology, biomedicine, and behavioral sciences, focus is shifting from solving forward problems
based on sparse data towards solving inverse problems to explain large datasets [
21
]. The aim is to develop multi-scale
simulations to infer the behavior of the system, provided access to massive amounts of observational data, while the
governing equations and their parameters are not precisely known. We further detail the methods and importance of
inverse problem solving with SI later in the Discussion section.
2. SURROGATE MODELING & EMULATION
Asurrogate model is an approximation method that mimics the behavior of an expensive computation or process. For
example, the design of an aircraft fuselage includes computationally intensive simulations with numerical optimizations
that may take days to complete, making design space exploration, sensitivity analysis, and inverse modeling infeasible.
In this case a computationally efficient surrogate model can be trained to represent the system, learning a mapping
from simulator inputs to outputs. And Earth systems models (ESMs), for example, are extremely computationally
expensive to run due to the large range of spatial and temporal scales and large number of processes being modeled.
ESM surrogates can be trained on a few selected samples of the full, expensive simulations using supervised machine
learning tools. In this simulation context, the aim of surrogate modeling (or statistical emulation) is to replace simulator
code with a machine learning model (i.e., emulator) such that running the ML model to infer the simulator outputs is
more efficient than running the full simulator itself. An emulator is thus a model of a model: a statistical model of the
simulator, which is itself a mechanistic model of the world.
For surrogate modeling in the sciences, non-linear, nonparametric Gaussian processes (GP) [
26
] are typically used
because of their flexibility, interpretability, and accurate uncertainty estimates [
70
]. Although traditionally limited to
smaller datasets because of
O(N3)
computational cost of training (where
N
is the number of training data points), much
work on reliable GP sparsification and approximation methods make them viable for real-world use [71,72,73,74].
Neural networks (NNs) can also be well-suited to the surrogate modeling task as function approximation machines:
A feedforward network defines a mapping
y=f(x;θ)
and learns the value of the parameters
θ
that result in the
best function approximation. The Universal Approximation Theorem demonstrates that sufficiently large NNs can
approximate any nonlinear function with a finite set of parameters [
75
,
76
]. Although sufficient to represent any
function, a NN layer may be unfeasibly large such that it may fail to learn and generalize correctly [
77
]. Recent work
has shown that a NN with an infinitely wide hidden layer converges to a GP, representing the normal distribution over
the space of functions.
In Fig. 6we show an example of how a NN surrogate can be used as an emulator that encapsulates either the entire
simulator or a specific part of the simulator. In the former, training is relatively straightforward because the loss function
only has neural networks, and the trained network can be used towards inverse problem solving. However, we now
have a black-box simulator: there is no interpretability of the trained network, and we cannot utilize the mechanistic
components (i.e. differential equations of the simulator) for scientific analyses. In the case of the partial surrogate we
have several advantages: the surrogate’s number of parameters is reduced and thus the network is more stable (similar
11





Figure 6: An example Earth system model (ESM) for the plastic cycle, with two variations of ML surrogates (purple
screens): On the left, an ML surrogate learns the whole model. A variety of physics-infused ML methods can be applied
training is relatively straightforward because the loss function only has neural networks, and the trained network can
be used towards inverse problem solving. However, we now have a black-box simulator: there is no interpretability of
the trained network, and we cannot utilize the mechanistic components (i.e. differential equations of the simulator)
for scientific analyses. On the right, two unknown portions of the model are learned by NN surrogates, while the
remaining portions are represented by known mechanistic equations possible with surrogate modeling approaches
like UDE and GINN. This “partial surrogate” case has several advantages: the surrogate’s number of parameters is
reduced and thus the network is more stable (similar logic holds for nonparametric Gaussian process surrogate models),
and the simulator retains all structural knowledge and the ability to run numerical analysis. The main challenge is that
backpropagation of arbitrary scientific simulators is required, which we can address with the engine motif differentiable
programming, producing learning gradients for arbitrary programs. The computational gain associated with the use of a
hybrid surrogate-simulator cascades into a series of additional advantages including the possibility of simulating more
scenarios towards counterfactual reasoning and epistemic uncertainty estimates, decreasing grid sizes, or exploring
finer-scale parameterizations [68,69].
logic holds for nonparametric GP surrogate models), and the simulator retains all structural knowledge and the ability
to run numerical analysis. Yet the main challenge is that backpropagation of arbitrary scientific simulators is required
thus the significance of the differentiable programming motif we discussed earlier. The computational gain associated
with the use of a hybrid surrogate-simulator cascades into a series of additional advantages including the possibility of
simulating more scenarios towards counterfactual reasoning and epistemic uncertainty estimates, decreasing grid sizes,
or exploring finer-scale parameterizations [68,69].
Semi-mechanistic modeling It follows from the Universal Approximation Theorem that an NN can learn to approx-
imate any sufficiently regular differential equation. The approach of recent neural-ODE methods [
78
] is to learn
to approximate differential equations directly from data, but these can perform poorly when required to extrapolate
[
79
]. More encouraging for emulation and scientific modeling, however, is to directly utilize mechanistic modeling
simultaneously with NNs (or more generally, universal approximator models) in order to allow for arbitrary data-driven
model extensions. The result is a semi-mechanistic approach, more specifically Universal Differential Equations (UDE)
[
66
] where part of the differential equation contains a universal approximator model we’ll generally assume an NN is
used for UDE in this paper, but other options include GP, Chebyshev expansion, or random forest.
UDE augments scientific models with machine-learnable structures for scientifically-based learning. This is very similar
in motivation to the physics-informed neural nets (PINN) we previously discussed, but the implementation in general
has a key distinction: a PINN is a deep learning model that become physics-informed because of some added physics
bias (observational, inductive, or learning), whereas a UDE is a differential equation with one or more mechanistic
components replaced with a data-driven model this distinction is why we earlier defined physics-infused ML as the
bidirectional influence of physics and ML.
Bayesian optimal experiment design
A key aspect that makes emulators useful in scientific endeavors is that they
allow us to reason probabilistically about outer-loop decisions such as optimization [
80
], data collection [
81
], and to
use them to explain how uncertainty propagates in a system [82].
12
Numerous challenges in science and engineering can be framed as optimization tasks, including the maximization of
reaction yields, the optimization of molecular and materials properties, and the fine-tuning of automated hardware
protocols [
83
]. When we seek to optimize the parameters of a system with an expensive cost function
f
, we look
to employ Bayesian optimization (BO) [
80
,
84
,
85
,
86
] to efficiently explore the search space of solutions with a
probabilistic surrogate model
ˆ
f
rather than experimenting with the real system. Gaussian process models are the most
common surrogates due to their flexible, nonparametric behavior. Various strategies to explore-exploit the search space
can be implemented with acquisition functions to efficiently guide the BO search by estimating the utility of evaluating
f
at a given point (or parameterization). Often in science and engineering settings this provides the domain experts
with a few highly promising candidate solutions to then try on the real system, rather than searching the intractably
large space of possibilities themselves for example, generating novel molecules with optimized chemical properties
[
87
,
88
], materials design with expensive physics-based simulations [
89
], and design of aerospace engineering systems
[90].
Similarly, scientists can utilize BO for designing experiments such that the outcomes will be as informative as possible
about the underlying process. Bayesian optimal experiment design (BOED) is a powerful mathematical framework
for tackling this problem [
91
,
92
,
93
,
94
], and can be implemented across disciplines, from bioinformatics [
95
] to
pharmacology [
96
] to physics [
97
] to psychology [
98
]. In addition to design, there are also control methods in
experiment optimization, which we detail in the context of a particle physics example below.
Examples
Simulation-based online optimization of physical experiments
It is often necessary and challenging to design
experiments such that outcomes will be as informative as possible about the underlying process, typically because
experiments are costly or dangerous. Many applications such as nuclear fusion and particle acceleration call for online
control and tuning of system parameters to deliver optimal performance levels i.e., the control class of experiment
design we introduced above.
In the case of particle accelerators, although physics models exist, there are often significant differences between the
simulation and the real accelerator, so we must leverage real data for precise tuning. Yet we cannot rely on many runs
with the real accelerator to tune the hundreds of machine parameters, and archived data does not suffice because there
are often new machine configurations to try a control or tuning algorithm must robustly find the optimum in a complex
parameter space with high efficiency. With physics-infused ML, we can exploit well-verified mathematical models to
learn approximate system dynamics from few data samples and thus optimize systems online and in silico. It follows
that we can additionally look to optimize new systems without prior data.
For the online control of particle accelerators, Hanuka et al. [
32
] develop the physics-informed basis-function GP. To
clarify what this model encompasses, we need to understand the several ways to build such a GP surrogate for BO of a
physical system:
1. Data-informed GP using real experimental data
2. Physics-informed GP using simulated data
3. Basis-function GP from deriving a GP kernel directly from the physical model
4.
Physics-informed basis-function GP as a combination of the above methods 2 and 3 were combined in
Hanuka et al.
The resulting physics-informed GP is more representative of the particle accelerator system, and performs faster
in an online optimization task compared to routinely used optimizers (ML-based and otherwise). Additionally, the
method presents a relatively simple way to construct the GP kernel, including correlations between devices learning
the kernel from simulated data instead of machine data is a form of kernel transfer learning, which can help with
generalizability. Hanuka et al. interestingly point out that constructing the kernel from basis functions without using the
likelihood function is a form of Gaussian process with likelihood-free inference, which is the regime of problems that
simulation-based inference is designed for (i.e. the motif we discuss next).
This and similar physics-informed methods are emerging as a powerful strategy for in silico optimization of expensive
scientific processes and machines, and further to enable scientific discovery by means of autonomous experimentation.
For instance, the recent Gemini [
99
] and Golem [
83
] molecular experiment optimization algorithms, which are purpose-
built for automated science workflows with SI: using surrogate modeling techniques for proxying expensive chemistry
experiments, the BO and uncertainty estimation methods are designed for robustness to common scientific measurement
challenges such as input variability and noise, proxy measurements, and systematic biases. Similarly, Shirobokov et
al. [
100
] propose a method for gradient-based optimization of black-box simulators using local generative surrogates
13
that are trained in successive local neighborhoods of the parameter space during optimization, and demonstrate this
technique in the optimization of the experimental design of the SHiP (Search for Hidden Particles) experiment proposed
at CERN. These and other works of Alán Aspuru-Guzik et al. are good sources to follow in this area of automating
science. There are potentially significant cause-effect implications to consider in these workflows, as we introduce in
the causality motif later.
Multi-physics multi-scale surrogates for Earth systems emulation
The climate change situation is worsening in
accelerating fashion: the most recent decade (2010 to 2019) has been the costliest on record with the climate-driven
economic damage reaching $2.98 trillion-US, nearly double the decade 2000–2009 [
101
]. The urgency for climate
solutions motivates the need for modeling systems that are computationally efficient and reliable, lightweight for
low-resource use-cases, informative towards policy- and decision-making, and cyber-physical with varieties of sensors
and data modalities. Further, models need to be integrated with, and workflows extended to, climate-dependent domains
such as energy generation and distribution, agriculture, water and disaster management, and socioeconomics. To this
end, we and many others have been working broadly on Digital Twin Earth (DTE), a catalogue of ML and simulation
methods, datasets, pipelines, and tools for Earth systems researchers and decision-makers. In general, a digital twin
is a computer representation of a real-world process or system from large aircraft to individual organs. We define
digital twin in the more precise sense of simulating the real physics and data-generating processes of an environment or
system, with sufficient fidelity such that one can reliably run queries and experiments in silico.
Some of the main ML-related challenges for DTE include integrating simulations of multiple domains, geographies,
and fidelities; not to mention the need to integrate real and synthetic data, as well as data from multiple modalities
(such as fusing Earth observation imagery with on-the-ground sensor streams). SI methods play important roles in the
DTE catalogue, notably the power of machine learned surrogates to accelerate existing climate simulators. Here we
highlight one example for enabling lightweight, real-time simulation of coastal environments: Existing simulators for
coastal storm surge and flooding are physics-based numerical models that can be extremely computationally expensive.
Thus the simulators cannot be used for real-time predictions with high resolution, are unable to quantify uncertainties,
and require significant computational infrastructure overhead only available to top national labs. To this end, Jiang
et al. [
102
] developed physics-infused ML surrogates to emulate several of the main coastal simulators worldwide,
NEMO [
103
] and CoSMoS [
104
]. Variations of the Fourier Neural Operator (FNO) [
47
] (introduced in the multi-physics
motif) were implemented to produce upwards of 100x computational efficiency on comparable hardware.
This use-case exemplified a particularly thorny data preprocessing challenge that is commonplace working with
spatiotemporal simulators and Digital Twin Earth: One of the coastal simulators to be emulated uses a standard
grid-spaced representation for geospatial topology, which is readily computable with DFT in the FNO model, but
another coastal simulator uses highly irregular grids that differ largely in scale, and further stacks these grids at varying
resolutions. A preprocessing pipeline to regrid and interpolate the data maps was developed, along with substitute
Fourier transform methods. It is our experience that many applications in DTE call for tailored solutions such as this
as a community we are lacking shared standards and formats for scientific data and code.
Hybrid PGM and NN for efficient domain-aware scientific modeling
One can in general characterize probabilistic
graphical models (PGM) [
41
] as structured models for encoding domain knowledge and constraints, contrasted with
deep neural networks as data-driven function-approximators. The advantages of PGM have been utilized widely in
scientfic ML [
105
,
106
,
107
,
108
], and of course recent NN methods as discussed throughout this paper. Graph-Informed
Neural Networks (GINNs) [
42
] are a new approach to incorporating the best of both worlds: PGMs incorporate expert
knowledge, available data, constraints, etc. with physics-based models such as systems of ODEs and PDEs, while
computationally intensive nodes in this hybrid model are replaced by learned features as NN surrogates. GINNs are
particularly suited to enhance the computational workflow for complex systems featuring intrinsic computational
bottlenecks and intricate physical relations variables. Hall et al. demonstrate GINN towards simulation-based decision-
making in a multiscale model of electrical double-layer (EDL) supercapacitor dynamics. The ability for downstream
decision-making is afforded by robust and reliable sensitivity analysis (due to the probabilistic ML approach), and
orders of magnitude more computational efficiency means many hypotheses can be simulated and predicted posteriors
quantified.
Auto-emulator design with neural architecture search
Kasim et al. look to recent advances in neural architecture
search (NAS) to automatically design and train a NN as an efficient, high-fidelity emulator, as doing this manually
can be time-consuming and require significant ML expertise. NAS methods aim to learn a network topology that can
achieve the best performance on a certain task by searching over the space of possible NN architectures given a set of
NN primitives see Elsken et al. [109] for a thorough overview.
14





Figure 7: Graph-Informed Neural Networks (GINNs) [
42
] provide a computational advantage while maintaining the
advantages of structured modeling with PGMs, by replacing computational bottlenecks with NN surrogates. Here a
PGM encoding structured priors serves as input to both a Bayesian Network PDE (lower route) and a GINN (upper) for
a homogenized model of ion diffusion in supercapacitors. A simple fully-connected NN is pictured, but in principle any
architecture can work, for instance physics-informed methods that further enforce physical constraints.
The NAS results are promising for automated emulator construction: running on ten distinct scientific simulation cases,
from fusion energy science [
110
,
111
] to aerosol-climate [
112
] and oceanic [
113
] modeling, the results are reliably
accurate output simulations with NN-based emulators that run thousands to billions times faster than the originals, while
also outperforming other NAS based emulation approaches as well as manual emulator design. For example, a global
climate model (GCM) simulation tested normally takes about 1150 CPU-hours to run [
112
], yet the emulator speedup
is a factor of 110 million in direct comparison, and over 2 billion with a GPU providing scientists with simulations
on the order of seconds rather than days enables faster iteration of hypotheses and experiments, and potentially new
experiments never before thought possible.
Also in this approach is a modified MC dropout method for estimating the predictive uncertainty of emulator outputs.
Alternatively, we suggest pursuing Bayesian optimization-based NAS methods [
114
] for more principled uncertainty
reasoning. For example, the former can flag when an emulator architecture is overconfident in its predictions, while the
latter can do that and use the uncertainty values to dynamically adjust training parameters and search strategies.
The motivations of Kasim et al. for emulator-based accelerated simulation are same as we’ve declared above: enable
rapid screening and ideas testing, and real-time prediction-based experimental control and optimization. The more we
can optimize and automate the development and verification of emulators, the more efficiently scientists without ML
expertise can iterate over simulation experiments, leading to more robust conclusions and more hypotheses to explore.
Deriving physical laws from data-driven surrogates
Simulating complex dynamical systems often relies on gov-
erning equations conventionally obtained from rigorous first principles such as conservation laws or knowledge-based
phenomenological derivations. Although non-trivial to derive, these symbolic or mechanistic equations are interpretable
and understandable for scientists and engineers. NN-based simulations, including surrogates, are not interpretable in
this way and can thus be challenging to use, especially in many cases where it is important the scientist or engineer
understand the causal, data-generating mechanisms.
15
Figure 8: Illustrating the UDE forward process (top), where mechanistic equations are used with real data to produce a
trained neural network (NN) model, followed by the inverse problem of recovering the governing equations in symbolic
form (bottom).
Recent ML-driven advances have led approaches for sparse identification of nonlinear dynamics (SINDy) [
115
] to
learn ODEs or PDEs from observational data. SINDy essentially selects dominant candidate functions from a high-
dimensional nonlinear function space based on sparse regression to uncover ODEs that match the given data; one can
think of SINDy as providing an ODE surrogate model. This exciting development has led to scientific applications
from biological systems [
116
] to chemical processes [
117
] to active matter [
118
], as well as data-driven discovery of
spatiotemporal systems governed by PDEs [
119
,
120
]. Here we showcase two significant advances on SINDy utilizing
methods described in the surrogate modeling motif and others:
1. Synergistic learning deep NN surrogate and governing PDEs from sparse and independent data
Chen
et al. [
121
] present a novel physics-informed deep learning framework todiscover governing PDEs of nonlinear
spatiotemporal systems from scarce and noisy data accounting for different initial/boundary conditions (IBCs).
Their approach integrates the strengths of deep NNs for learning rich features, automatic differentiation for
accurate and efficient derivative calculation, and
l0
sparse regression to tackle the fundamental limitation of
existing methods that scale poorly with data noise and scarcity. The special network architecture design is able
to account for multiple independent datasets sampled under different IBCs, shown with simple experiments
that should still be validated on more complex datasets. An alternating direction optimization strategy
simultaneously trains the NN on the spatiotemporal data and determine the optimal sparse coefficients of
selected candidate terms for reconstructing the PDE(s) the NN provides accurate modeling of the solution
and its derivatives as a basis for constructing the governing equation(s), while the sparsely represented PDE(s)
in turn informs and constraints the DNN which makes it generalizable and further enhances the discovery. The
overall semi-mechanistic approach bottom-up (data-driven) and top-down (physics-informed) processes is
promising for ML-driven for scientific discovery.
2. Sparse identification of missing model terms via Universal Differential Equations
We earlier described
several scenarios where an ML surrogate is trained for only part of the full simulator system, perhaps for the
computationally inefficient or the unknown parts. This is also a use-case of the UDE, replacing parts of a
simulator described by mechanistic equations with a data-driven NN surrogate model. Now consider we’re
at the end of the process of building a UDE (we have learned and verified an approximation for part of the
causal generative model (i.e. a simulator)). Do we lose interpretability and analysis capabilities? With a
knowledge-enhanced approach of the SINDy method we can sparse-identify the learned semi-mechanistic
UDE back to mechanistic terms that are understandable and usable by domain scientists. Rackauckas et al.
[
66
] modify the SINDy algorithm to apply to only subsets of the UDE equation in order to perform equation
discovery specifically on the trained neural network components. In a sense this narrows the search space of
potential governing equations by utilizing the prior mechanistic knowledge that wasn’t replaced in training the
UDE. Along with the UDE approach in general, this sparse identification method needs further development
and validation with more complex datasets.
Future directions
The UDE approach has significant implications for use with simulators and physical modeling, where the underlying
mechanistic models are commonly differential equations. By directly utilizing mechanistic modeling simultaneously
with universal approximator models, UDE is a powerful semi-mechanistic approach allowing for arbitrary data-driven
model extensions. In the context of simulators, this means a synergistic model of domain expertise and real-world data
that more faithfully represents the true data-generating process of the system.
16
What we’ve described is a transformative approach for ML-augmented scientific modeling. That is,
1.
Practitioner identifies known parts of a model and builds a UDE when using probabilistic programming (an
SI engine motif), this step can be done in a high-level abstraction where the user does not need to write custom
inference algorithms.
2.
Train an NN (or other surrogate model such as Gaussian process) to capture the missing mechanisms one
may look to NAS and Bayesian optimization approaches to do this in an automated, uncertainty-aware way.
3.
The missing terms can be sparse-identified into mechanistic terms this is an active area of research and much
verification of this concept is needed, as mentioned in the example above.
4.
Verify the recovered mechanisms are scientifically sane for future work, how can we better enable this with
human-machine teaming?
5.
Verify quantitatively: extrapolate, do asymptotic analysis, run posterior predictive checks, predict bifurcations.
6. Gather additional data to validateiii the new terms.
Providing the tools for this semi-mechanistic modeling workflow can be immense for enabling scientists to make the
best use of domain knowledge and data, and is precisely what the SI stack can deliver notably with the differentiable
programming and probabilistic programming “engine” motifs. Even more, building in a unified framework that’s
purpose-built for SI provides extensibility, for instance to integrate recent graph neural network approaches from
Cranmer et al. [
123
] that can recover the governing equations in symbolic forms from learned physics-informed models,
or the AI Feynmann” [
124
,
125
] approaches based on traditional fitting techniques in coordinate with neural networks
that leverage physics properties such as symmetries and separability in the unknown dynamics function(s) key features
such as NN-equivariance and normalizing flows (and how they may fit into the stack) are discussed later.
3. SIMULATION-BASED INFERENCE
Numerical simulators are used across many fields of science and engineering to build computational models of complex
phenomena. These simulators are typically built by incorporating scientific knowledge about the mechanisms which are
known (or assumed) to underlie the process under study. Such mechanistic models have often been extensively studied
and validated in the respective scientific domains. In complexity, they can range from extremely simple models that
have a conceptual or even pedagogical flavor (e.g. the Lotka-Volterra equations describing predator-prey interactions
in ecological systems and also economic theories [
126
] (expressed in Fig. 21)) to extremely detailed and expensive
simulations implemented in supercomputers, e.g. whole-brain simulations [127].
A common challenge across scientific disciplines and complexity of models is the question of how to link such
simulation-based models with empirical data. Numerical simulators typically have some parameters whose exact values
are non a priori, and have to be inferred by data. For reasons detailed below, classical statistical approaches can not
readily be applied to models defined by numerical simulators. The field of simulation-based inference (SBI) [
1
] aims to
address this challenge, by designing statistical inference procedures that can be applied to complex simulators. Building
on foundational work from the statistics community (see [
128
] for an overview), SBI is starting to bring together work
from multiple fields including, e.g., population genetics, neuroscience, particle physics, cosmology, and astrophysics
which are facing the same challenges and using tools from machine learning to address them. SBI can provide a
unifying language, and common tools [
129
,
130
,
131
] and benchmarks [
132
] are being developed and generalized
across different fields and applications.
Why is it so challenging to constrain numerical simulations by data? Many numerical simulators ave stochastic
components, which are included either to provide a verisimilar model of the system under study if it is believed to be
stochastic itself, or often also pragmatically to reflect incomplete knowledge about some components of the system.
Linking such stochastic models with data falls within the domain of statistics, which aims to provide methods for
constraining the parameters of a model by data, approaches for selecting between different model-candidates, and
criteria for determining whether a hypothesis can be rejected on grounds of empirical evidence. In particular, statistical
inference aims to determine which parameters and combinations of parameters are compatible with empirical
data and (possibly) a priori assumptions. A key ingredient of most statistical procedures is the likelihood
p(x|θ)
of
data
x
given parameters
θ
. For example, Bayesian inference characterizes parameters which are compatible both
with data and prior by the posterior distribution
p(θ|x)
, which is proportional to the product of likelihood and prior,
iii
Note we use “verify” and “validate” specifically, as there’s important difference between verification and validation (V&V):
verification asks “are we building the solution right?” whereas validation asks “are we building the right solution?” [122].
17
mechanistic model
prior
data or summary data
posterior
consistent sample
inconsistent sample
neural density estimator
simulated data
parameter 1
parameter 2
ms
mV
ms
mV
probability
consistent sample
123
4
ms
mV
probability
Figure 9: The goal of simulation-based inference (SBI) is to algorithmically identify parameters of simulation-based
models which are compatible with observed data and prior assumptions. SBI algorithms generally take three inputs (left):
A candidate mechanistic model (e.g. a biophysical neuron model), prior knowledge or constraints on model parameters,
and observational data (or summary statistics thereof). The general process shown is to (1) sample parameters from the
prior followed by simulating synthetic data from these parameters; (2) learn the (probabilistic) association between data
(or data features) and underlying parameters (i.e., to learn statistical inference from simulated data) for which different
SBI methods (discussed in the text) such as neural density estimation [
133
] can be used; (3) apply the learned model to
empirical data to derive the full space of parameters consistent with the data and the prior, i.e. the posterior distribution.
Posterior distributions may have complex shapes (such as multiple modes), and different parameter configurations may
lead to data-consistent simulations. If needed, (4) an initial estimate of the posterior can be used to adaptively generate
additional informative simulations. (Illustration from from [133])
p(θ|x)p(x|θ)p(θ)
. Frequentist inference procedures typically construct confidence regions based on hypothesis
tests, often using the likelihood ratio as test statistic.
However, for many simulation-based models, one can easily sample from the model (i.e., generate synthetic data
xp(x|θ)
) but evaluating the associated likelihoods can be computationally prohibitive because, for instance,
the same output
x
could result from a very large number of internal paths through the simulator, and integrating
over all of them is prohibitive. More pragmatically, it might also be the case that the simulator is implemented in a
“black-box” manner which does not provide access to its internal workings or states. If likelihoods can not be evaluated,
most conventional inference approaches can not be used. The goal of simulation-based inference is to make statistical
inference possible for so-called implicit models which allow generating simulated data, but not evaluation of likelihoods.
SBI is not a new idea. Simulation-based inference approaches have been studied extensively in statistics, typically
under the heading of likelihood-free inference. An influential approach has been that of Approximate Bayesian
Computation [
128
,
134
,
135
]. In its simplest form it consists of drawing parameter values from a proposal distribution,
running the simulator for these parameters to generate synthetic outputs
x p(x|θ)
, comparing these outputs against
the observed data, and accepting the parameter values only if they are close to the observed data under some distance
metric,
kxxobservedk<
. After following this procedure repeatedly, the accepted samples approximately follow the
posterior. A second class of methods approximates the likelihood by sampling from the simulator and estimating the
density in the sample space with kernel density estimation or histograms. This approximate density can then be used in
lieu of the exact likelihood in frequentist or Bayesian inference techniques [136].
Both of these methods enable approximate inference in the likelihood-free setting, but they suffer from certain
shortcomings: In the limit of a strict ABC acceptance criterion (
0
) or small kernel size, the inference results
become exact, but the sample efficiency is reduced (the simulation has to be run many times). Relaxing the acceptance
criterion or increasing the kernel size improves the sample efficiency, but reduces the quality of the inference results.
The main challenge, however, is that these methods do not scale well to high-dimensional data, as the number of
required simulations grows approximately exponentially with the dimension of the data
x
. In both approaches, the
18
   




Figure 10: Various simulation-based inference workflows (or prototypes) are presented in Cranmer et al. [
1
]. Here we
show four main workflows (or templates) of simulation-based inference: the left represents Approximate Bayesian
Computation (ABC) approaches, and then to the right are three model-based approaches for approximating likelihoods,
posteriors, and density ratios, respectively. Notice that all include algorithms that use the prior distribution to propose
parameters (green), as well as algorithms for sequentially adapting the proposal (purple)—i.e., steps (1) and (4) shown
in Fig. 9. (Figure reproduced from Ref. [132])
raw data is therefore usually first reduced to low-dimensional summary statistics. These are typically designed by
domain experts with the goal of retaining as much information on the parameters
θ
as possible. In many cases, the
summary statistics are not sufficient and this dimensionality reduction limits the quality of inference or model selection
[
137
]. Recently, new methods for learning summary statistics in a fully [
138
,
139
] or semi-automatic manner [
140
] are
emerging, which might alleviate some of these limitations.
The advent of deep learning has powered a number of new simulation-based inference techniques. Many of these
methods rely on the key principle of training a neural surrogate for the simulation. Such models are closely related
to the emulator models discussed in the previous section, but not geared towards efficient sampling. Instead, we
need to be able to access its likelihood [
141
,
142
] (or the related likelihood ratio [
143
,
144
,
145
,
146
,
147
,
148
])
or the posterior [
149
,
150
,
151
,
152
]. After the surrogate has been trained, it can be used during frequentist or
Bayesian inference instead of the simulation. On a high level, this approach is similar to the traditional method
based on histograms or kernel density estimators [
136
], but modern ML models and algorithms allow it to scale to
higher-dimensional and potentially structured data.
The impressive recent progress in SBI methods does not stem from deep learning alone. Another important theme
is active learning: running the simulator and inference procedure iteratively and using past results to improve the
proposal distribution of parameter values for the next runs [
141
,
142
,
150
,
151
,
153
,
154
,
155
,
156
,
157
,
158
]. This can
substantially improve the sample efficiency. Finally, in some cases simulators are not just black boxes, but we have
access to (part of) their latent variables and mechanisms or probabilistic characteristics of their stack trace. In practice,
such information can be made available through domain-specific knowledge or by implementing the simulation in a
framework that supports differential or probabilistic programming– i.e., the SI engine. If it is accessible, such data can
substantially improve the sample efficiency with which neural surrogate models can be trained, reducing the required
compute [
159
,
160
,
161
,
162
,
163
]. On a high level, this represents a tighter integration of the inference engine with the
simulation [164].
These components neural surrogates for the simulator, active learning, the integration of simulation and inference
can be combined in different ways to define workflows for simulation-based inference, both in the Bayesian and
frequentist setting. We show some example inference workflows in Fig. 10. The optimal choice of the workflow
depends on the characteristics of the problem, in particular on the dimensionality and structure of the observed data and
the parameters, whether a single data point or multiple i.i.d. draws are observed, the computational complexity of the
simulator, and whether the simulator admits accessing its latent process.
Examples
Simulation-based inference techniques have