Content uploaded by Olexandr Isayev

Author content

All content in this area was uploaded by Olexandr Isayev on Jan 11, 2023

Content may be subject to copyright.

SIMULATION INTELLIGENCE: TOWARDS A NEW GENERATION

OF SCIENTIFIC METHODS

Alexander Lavin∗

Institute for Simulation Intelligence Hector Zenil

Alan Turing Institute Brooks Paige

Alan Turing Institute David Krakauer

Santa Fe Institute

Justin Gottschlich

Intel Labs Tim Mattson

Intel Anima Anandkumar

Nvidia Sanjay Choudry

Nvidia Kamil Rocki

Neuralink

Atılım Güne¸s Baydin

University of Oxford Carina Prunkl

University of Oxford Olexandr Isayev

Carnegie Mellon University

Erik Peterson

Carnegie Mellon University Peter L. McMahon

Cornell University Jakob H. Macke

University of Tübingen Kyle Cranmer

New York University

Jiaxin Zhang

Oak Ridge National Lab Haruko Wainwright

Lawrence Berkeley National Lab Adi Hanuka

SLAC National Accelerator Lab

Samuel Assefa

US Bank AI Innovation Stephan Zheng

Salesforce Research Manuela Veloso

JPM AI Research Avi Pfeffer

Charles River Analytics

ABSTRACT

The original “Seven Motifs” set forth a roadmap of essential methods for the ﬁeld of scientiﬁc

computing, where a motif is an algorithmic method that captures a pattern of computation and data

movement.

1

We present the Nine Motifs of Simulation Intelligence, a roadmap for the development

and integration of the essential algorithms necessary for a merger of scientiﬁc computing, scientiﬁc

simulation, and artiﬁcial intelligence. We call this merger simulation intelligence (SI), for short. We

argue the motifs of simulation intelligence are interconnected and interdependent, much like the

components within the layers of an operating system. Using this metaphor, we explore the nature of

each layer of the simulation intelligence “operating system” stack (SI-stack) and the motifs therein:

1. Multi-physics and multi-scale modeling

2. Surrogate modeling and emulation

3. Simulation-based inference

4. Causal modeling and inference

5. Agent-based modeling

6. Probabilistic programming

7. Differentiable programming

8. Open-ended optimization

9. Machine programming

We believe coordinated efforts between motifs offers immense opportunity to accelerate scientiﬁc

discovery, from solving inverse problems in synthetic biology and climate science, to directing nuclear

energy experiments and predicting emergent behavior in socioeconomic settings. We elaborate on each

layer of the SI-stack, detailing the state-of-art methods, presenting examples to highlight challenges

and opportunities, and advocating for speciﬁc ways to advance the motifs and the synergies from

their combinations. Advancing and integrating these technologies can enable a robust and efﬁcient

hypothesis–simulation–analysis type of scientiﬁc method, which we introduce with several use-cases

for human-machine teaming and automated science.

Keywords:

Simulation; Artiﬁcial Intelligence; Machine Learning; Scientiﬁc Computing; Physics-infused ML; Inverse

Design; Human-Machine Teaming; Optimization; Causality; Complexity; Open-endedness

∗lavin@simulation.science (ISI & Pasteur Labs)

1We eschew the original term “dwarf” for the more appropriate “motif” in this paper, and encourage the ﬁeld to follow suit.

Preprint. Under review.

arXiv:2112.03235v1 [cs.AI] 6 Dec 2021

Contents

Introduction 3

Simulation Intelligence Motifs 4

The Modules 5

1. MULTI-PHYSICS & MULTI-SCALE MODELING . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2. SURROGATE MODELING & EMULATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.SIMULATION-BASEDINFERENCE.................................... 17

4.CAUSALREASONING........................................... 22

5.AGENT-BASEDMODELING........................................ 28

The Engine 33

6.PROBABILISTICPROGRAMMING.................................... 33

7.DIFFERENTIABLEPROGRAMMING................................... 40

The Frontier 45

8.OPEN-ENDEDOPTIMIZATION...................................... 45

9.MACHINEPROGRAMMING ....................................... 48

Simulation Intelligence Themes 51

INVERSE-PROBLEMSOLVING ....................................... 51

UNCERTAINTYREASONING ........................................ 54

INTEGRATIONS................................................ 55

HUMAN-MACHINETEAMING ....................................... 63

Simulation Intelligence in Practice 64

Data-intensive science and computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Acceleratedcomputing ............................................. 68

Domainsforuse-inspiredresearch ....................................... 70

Discussion 70

Honorablementionmotifs ........................................... 70

Conclusion ................................................... 74

2

Introduction

Simulation has become an indispensable tool for researchers across the sciences to explore the behavior of complex,

dynamic systems under varying conditions [

1

], including hypothetical or extreme conditions, and increasingly tipping

points in environments such as climate [

2

,

3

,

4

], biology [

5

,

6

], sociopolitics [

7

,

8

], and others with signiﬁcant

consequences. Yet there are challenges that limit the utility of simulators (and modeling tools broadly) in many settings.

First, despite advances in hardware to enable simulations to model increasingly complex systems, computational costs

severely limit the level of geometric details, complexity of physics, and the number of simulator runs. This can lead to

simplifying assumptions, which often render the results unusable for hypothesis testing and practical decision-making.

In addition, simulators are inherently biased as they simulate only what they are programmed to simulate; sensitivity

and uncertainty analyses are often impractical for expensive simulators; simulation code is composed of low-level

mechanistic components that are typically non-differentiable and lead to intractable likelihoods; and simulators can

rarely integrate with real-world data streams, let alone run online with live data updates.

Recent progress with artiﬁcial intelligence (AI) and machine learning (ML) in the sciences has advanced methods

towards several key objectives for AI/ML to be useful in sciences (beyond discovering patterns in high-dimensional

data). These advances allow us to import priors or domain knowledge into ML models and export knowledge from

learned models back to the scientiﬁc domain; leverage ML for numerically intractable simulation and optimization

problems, as well as maximize the utility of real-world data; generate myriads of synthetic data; quantify and reason

about uncertainties in models and data; and infer causal relationships in the data.

It is at the intersection of AI and simulation sciences where we can expect signiﬁcant strides in scientiﬁc experimentation

and discovery, in essentially all domains. For instance, the use of neural networks to accelerate simulation software for

climate science [

9

], or multi-agent reinforcement learning and game theory towards economic policy simulations [

10

].

Yet this area is relatively nascent and disparate, and a unifying holistic perspective is needed to advance the intersection

of AI and simulation sciences.

This paper explores this perspective. We lay out the methodologies required to make signiﬁcant strides in simulation and

AI for science, and how they must be fruitfully combined. The ﬁeld of scientiﬁc computing was at a similar inﬂection

point when Phillip Colella in 2004 presented to DARPA the “Seven Dwarfs” for Scientiﬁc Computing, where each of

the seven represents an algorithmic method that captures a pattern of computation and data movement [

11

,

12

,

13

].

ii

For the remainder of this paper, we choose to replace a potentially insensitive term with “motif”, a change we suggest

for the ﬁeld going forward.

The motifs nomenclature has proved useful for reasoning at a high level of abstraction about the behavior and require-

ments of these methods across a broad range of applications, while decoupling these from speciﬁc implementations.

Even more, it is an understandable vocabulary for talking across disciplinary boundaries. Motifs also provide “anti-

benchmarks”: not tied to narrow performance or code artifacts, thus encouraging innovation in algorithms, programming

languages, data structures, and hardware [

12

]. Therefore the motifs of scientiﬁc computing provided an explicit roadmap

for R&D efforts in numerical methods (and eventually parallel computing) in sciences.

In this paper, we similarly deﬁne the Nine Motifs of Simulation Intelligence, classes of complementary algorithmic

methods that represent the foundation for synergistic simulation and AI technologies to advance sciences; simulation

intelligence (SI) describes a ﬁeld that merges of scientiﬁc computing, scientiﬁc simulation, and artiﬁcial intelligence

towards studying processes and systems in silico to better understand and discover in situ phenomena. Each of the SI

motifs has momentum from the scientiﬁc computing and AI communities, yet must be pursued in concert and integrated

in order to overcome the shortcomings of scientiﬁc simulators and enable new scientiﬁc workﬂows.

Unlike the older seven motifs of scientiﬁc computing, our SI motifs are not necessarily independent. Many of these are

interconnected and interdependent, much like the components within the layers of an operating system. The individual

modules can be combined and interact in multiple ways, gaining from this combination. Using this metaphor, we

explore the nature of each layer of the “SI stack”, the motifs within each layer, and the combinatorial possibilities

available when they are brought together – the layers are illustrated in Fig. 1.

We begin by describing the core layers of the SI stack, detailing each motif within: the concepts, challenges, state-of-

the-art methods, future directions, ethical considerations, and many motivating examples. As we traverse the SI stack,

encountering the numerous modules and scientiﬁc workﬂows, we will ultimately be able to lay out how these advances

will beneﬁt the many users of simulation and scientiﬁc endeavors. Our discussion continues to cover important SI

themes such as inverse problem solving and human-machine teaming, and essential infrastructure areas such as data

engineering and accelerated computing.

ii

The 2004 “Seven Motifs for Scientiﬁc Computing”: Dense Linear Algebra, Sparse Linear Algebra, Computations on Structured

Grids, Computations on Unstructured Grids, Spectral Methods, Particle Methods, and Monte Carlo [11].

3

Figure 1: An operating system (OS) diagram, elucidating the relationships of the nine Simulation Intelligence motifs,

their close relatives, such as domain-speciﬁc languages (DSL) [

14

] and working memory [

15

], and subsequent SI-based

science workﬂows. The red, purple, orange, and green plates represent the hardware, OS, application, and user layers,

respectively (from bottom to top). In the applications layer (orange plate) are some main SI workﬂows – notice some

entries have a cyan outline, signifying machine-science classes that contain multiple workﬂows in themselves (see text

for details), and entries with a pink outline denote statistical and ML methods that have been used for decades, but here

in unique SI ways. In the text we generically refer to this composition as the “SI stack”, although the mapping to an OS

is a more precise representation of the SI layers and their interactions. We could have shown a simpler diagram with

only the nine SI motifs, but the context of the broader stack that is enabled by their integration better elucidates the

improved and new methods that can arise.

By pursuing research in each of these SI motifs, as well as ways of combining them together in generalizable software

towards speciﬁc applications in science and intelligence, the recent trend of decelerating progress may be reversed

[

16

,

17

]. This paper aims to motivate the ﬁeld and provide a roadmap for those who wish to work in AI and simulation

in pursuit of new scientiﬁc methods and frontiers.

Simulation Intelligence Motifs

Although we deﬁne nine concrete motifs, they are not necessarily independent – many of the motifs are synergistic in

function and utility, and some may be building blocks underlying others. We start by describing the “module” motifs, as

in Fig. 1, followed by the underlying “engine” motifs: probabilistic and differentiable programming. We then describe

the motifs that aim to push the frontier of intelligent machines: open-endedness and machine programming.

4

The Modules

Above the “engine” in the proverbial SI stack (Fig. 1) are “modules”, each of which can be built in probabilistic and

differentiable programming frameworks, and make use of the accelerated computing building blocks in the hardware

layer. We start this section with several motifs that closely relate to the physics-informed learning topics most recently

discussed above, and then proceed through the SI stack, describing how and why the module motifs compliment one

another in myriad, synergistic ways.

1. MULTI-PHYSICS & MULTI-SCALE MODELING

Simulations are pervasive in every domain of science and engineering, yet are often done in isolation: climate simulations

of coastal erosion do not model human-driven effects such as urbanization and mining, and even so are only consistent

within a constrained region and timescale. Natural systems involve various types of physical phenomena operating at

different spatial and temporal scales. For simulations to be accurate and useful they must support multiple physics and

multiple scales (spatial and temporal). The same goes for AI & ML, which can be powerful for modeling multi-modality,

multi-ﬁdelity scientiﬁc data, but machine learning alone – based on data-driven relationships – ignores the fundamental

laws of physics and can result in ill-posed problems or non-physical solutions. For ML (and AI-driven simulation) to be

accurate and reliable in the sciences, methods must integrate multi-scale, multi-physics data and uncover mechanisms

that explain the emergence of function.

Multi-physics

Complex real-world problems require solutions that span a multitude of physical phenomena, which

often can only be solved using simulation techniques that cross several engineering disciplines. Almost all practical

problems in ﬂuid dynamics involve the interaction between a gas and/or liquid with a solid object, and include a range

of associated physics including heat transfer, particle transport, erosion, deposition, ﬂow-induced-stress, combustion

and chemical reaction. A multi-physics environments is deﬁned by coupled processes or systems involving more

than one simultaneously occurring physical ﬁelds or phenomena. Such an environment is typically described by

multiple partial differential equations (PDEs), and tightly coupled such that solving them presents signiﬁcant challenges

with nonlinearities and time-stepping. In general, the more interacting physics in a simulation, the more costly the

computation.

Multi-scale

Ubiquitous in science and engineering, cascades-of-scales involve more than two scales with long-range

spatio-temporal interactions (that often lack self-similarity and proper closure relations). In the context of biological

and behavioral sciences, for instance, multi-scale modeling applications range from the molecular, cellular, tissue, and

organ levels all the way to the population level; multi-scale modeling can enable researchers to probe biologically

relevant phenomena at smaller scales and seamlessly embed the relevant mechanisms at larger scales to predict the

emergent dynamics of the overall system [

20

]. Domains such as energy and synthetic biology require engineering

materials at the nanoscale, optimizing multi-scale processes and systems at the macroscale, and even the discovery

of new governing physico-chemical laws across scales. These scientiﬁc drivers call for a deeper, broader, and more

integrated understanding of common multi-scale phenomena and scaling cascades. In practice, multi-scale modeling

is burdened by computational inefﬁciency, especially with increasing complexity and scales; it is not uncommon to

encounter “hidden” or unknown physics of interfaces, inhomogeneities, symmetry-breaking and other singularities.

Methods for utilizing information across scales (and physics that vary across space and time) are often needed in

real-world settings, and with data from various sources: Multi-ﬁdelity modeling aims to synergistically combine

abundant, inexpensive, low-ﬁdelity data and sparse, expensive, high-ﬁdelity data from experiments and simulations.

Multi-ﬁdelity modeling is often useful in building efﬁcient and robust surrogate models (which we detail in the surrogate

motif section later) [

21

] – some examples include simulating the mixed convection ﬂow past a cylinder [

22

] and cardiac

electrophysiology [23].

In computational ﬂuid dynamics (CFD), classical methods for multi-physics and multi-scale simulation (such as ﬁnite

elements and pseudo-spectral methods) are only accurate if ﬂow feature are all smooth, and thus meshes must resolve

the smallest features. Consequently, direct numerical simulation for real-world systems such as climate and jet physics

are impossible. It is common to use smoothed versions of the Navier Stokes equations to allow coarser meshes while

sacriﬁcing accuracy. Although successful in design of engines and turbo-machinery, there are severe limits to what

can be accurately and reliably simulated – the resolution-efﬁciency tradeoff imposes a signiﬁcant bottleneck. Methods

for AI-driven acceleration and surrogate modeling could accelerate CFD and multi-physics multi-scale simulation by

orders of magnitude. We discuss speciﬁc methods and examples below.

Physics-informed ML

The newer class of physics-informed machine learning methods integrate mathematical physics

models with data-driven learning. More speciﬁcally, making an ML method physics-informed amounts to introducing

5

Figure 2: Diagram of the various physics and scales to simulate for turbulent ﬂow within a nuclear reactor core –

top-to-bottom zooming in reveals ﬁner and ﬁner scales, each with different physics to model. AI-driven computational

ﬂuid dynamics (CFD) can radically improve the resolution and efﬁciency of such simulations [18,19].

appropriate observational, inductive, or learning biases that can steer or constrain the learning process to physically

consistent solutions [24]:

1.

Observational biases can be introduced via data that allows an ML system to learn functions, vector ﬁelds,

and operators that reﬂect physical structure of the data.

2.

Inductive biases are encoded as model-based structure that imposes prior assumptions or physical laws, making

sure the physical constraints are strictly satisﬁed.

3.

Learning biases force the training of an ML system to converge on solutions that adhere to the underlying

physics, implemented by speciﬁc choices in loss functions, constraints, and inference algorithms.

A common problem template involves extrapolating from an initial condition obtained from noisy experimental data,

where a governing equation is known for describing at least some of the physics. An ML method would aim to predict

the latent solution

u(t, x)

of a system at later times

t > 0

and propagate the uncertainty due to noise in the initial data.

A common use-case in scientiﬁc computing is reconstructing a ﬂow ﬁeld from scattered measurements (e.g., particle

image velocimetry data), and using the governing Navier–Stokes equations to extrapolate this initial condition in time.

6

Figure 3: Diagram of a physics-informed neural network (PINN), where a fully-connected neural network (red), with

time and space coordinates

(t, x)

as inputs, is used to approximate the multi-physics solutions

ˆu= [u, v, p, φ]

. The

derivatives of

ˆu

with respect to the inputs are calculated using automatic differentiation (purple; autodiff is discussed

later in the SI engine section) and then used to formulate the residuals of the governing equations in the loss function

(green), that is generally composed of multiple terms weighted by different coefﬁcients. By minimizing the physics-

informed loss function (i.e., the right-to-left process) we simultaneously learn the parameters of the neural network

θ

and the unknown PDE parameters λ. (Figure reproduced from Ref. [25])

For ﬂuid ﬂow problems and many others, Gaussian processes (GPs) present a useful approach for capturing the physics

of dynamical systems. A GP is a Bayesian nonparametric machine learning technique that provides a ﬂexible prior

distribution over functions, enjoys analytical tractability, deﬁnes kernels for encoding domain structure, and has a fully

probabilistic workﬂow for principled uncertainty reasoning [

26

,

27

]. For these reasons GPs are used widely in scientiﬁc

modeling, with several recent methods more directly encoding physics into GP models: Numerical GPs have covariance

functions resulting from temporal discretization of time-dependent partial differential equations (PDEs) which describe

the physics [

28

,

29

], modiﬁed Matérn GPs can be deﬁned to represent the solution to stochastic partial differential

equations [

30

] and extend to Riemannian manifolds to ﬁt more complex geometries [

31

], and the physics-informed

basis-function GP derives a GP kernel directly from the physical model [

32

] – the latter method we elucidate in an

experiment optimization example in the surrogate modeling motif.

With the recent wave of deep learning progress, research has developed methods for building the three physics-informed

ML biases into nonlinear regression-based physics-informed networks. More speciﬁcally, there is increasing interest in

physics-informed neural networks (PINNs): deep neural nets for building surrogates of physics-based models described

by partial differential equations (PDEs) [

33

,

34

]. Despite recent successes (with some examples explored below)

[

35

,

36

,

37

,

38

,

39

,

40

], PINN approaches are currently limited to tasks that are characterized by relatively simple and

well-deﬁned physics, and often require domain-expert craftsmanship. Even so, the characteristics of many physical

systems are often poorly understood or hard to implicitly encode in a neural network architecture.

Probabilistic graphical models (PGMs) [

41

], on the other hand, are useful for encoding a priori structure, such as the

dependencies among model variables in order to maintain physically sound distributions. This is the intuition behind

the promising direction of graph-informed neural networks (GINN) for multi-scale physics [

42

]. There are two main

components of this approach (shown in Fig. 7): First, embedding a PGM into the physics-based representation to

encode complex dependencies among model variables that arise from domain-speciﬁc information and to enable the

generation of physically sound distributions. Second, from the embedded PGM identify computational bottlenecks

intrinsic to the underlying physics-based model and replace them with an efﬁcient NN surrogate. The hybrid model thus

encodes a domain-aware physics-based model that synthesizes stochastic and multi-scale modeling, while computational

bottlenecks are replaced by a fast surrogate NN whose supervised learning and prediction are further informed by the

PGM (e.g., through structured priors). With signiﬁcant computational advantages from surrogate modeling within

GINN, we further explore this approach in the surrogate modeling motif later.

Modeling and simulation of complex nonlinear multi-scale and multi-physics systems requires the inclusion and

characterization of uncertainties and errors that enter at various stages of the computational workﬂow. Typically

we’re concerned with two main classes of uncertainties in ML: aleatoric and epistemic uncertainties. The former

quantiﬁes system stochasticity such as observation and process noise, and the latter is model-based or subjective

7

uncertainty due to limited data. In environments of multiple scales and multiple physics, one should consider an

additional type of uncertainty due the randomness of parameters of stochastic physical systems (often described by

stochastic partial- or ordinary- differential equations (SPDEs, SODEs)). One can view this type of uncertainty arising

from the computation of a sufﬁciently well-posed deterministic problem, in contrast to the notion of epistemic or

aleatoric uncertainty quantiﬁcation. The ﬁeld of probabilistic numerics [

43

] makes a similar distinction, where the

use of probabilistic modeling is to reason about uncertainties that arise strictly from the lack of information inherent

in the solution of intractable problems such as quadrature methods and other integration procedures. In general the

probabilistic numeric viewpoint provides a principled way to manage the parameters of numerical procedures. We

discuss more on probabilistic numerics and uncertainty reasoning later in the SI themes section. The GINN can quantify

uncertainties with a high degree of statistical conﬁdence, while Bayesian analogs of PINNs are a work in progress

[

44

]–one cannot simply plug in MC-dropout or other deep learning uncertainty estimation methods. The various

uncertainties may be quantiﬁed and mitigated with methods that can utilize data-driven learning to inform the original

systems of differential equations. We deﬁne this class of physics-infused machine learning later in this section and in

the next motif.

The synergies of mechanistic physics models and data-driven learning are brought to bear when physics-informed ML

is built with differentiable programming (one of the engine motifs), which we explore in the ﬁrst example below.

Examples

Accelerated CFD via physics-informed surrogates and differentiable programming

Kochkov et al. [

18

] look

to bring the advantages of semi-mechanistic modeling and differentiable programming to the challenge of complex

computational ﬂuid dynamics (CFD). The Navier Stokes (NS) equations describe ﬂuid dynamics well, yet in cases of

multiple physics and complex dynamics, solving the equations at scale is severely limited by the computational cost

of resolving the smallest spatiotemporal features. Approximation methods can alleviate this burden, but at the cost

of accuracy. In Kochkov et al, the components of traditional ﬂuids solvers most affected by the loss of resolution are

replaced with better performing machine-learned alternatives (as presented in the semi-mechanistic modeling section

later). This AI-driven solver algorithm is represented as a differentiable program with the neural networks and the

numerical methods written in the JAX framework [

45

]. JAX is a leading framework for differentiable programming, with

reverse-mode automatic differentiation that allows for end-to-end gradient based optimization of the entire programmed

algorithm. In this CFD use-case, the result is an algorithm that maintains accuracy while using 10x coarser resolution in

each dimension, yielding an 80-fold improvement in computation time with respect to an advanced numerical method

of similar accuracy.

Related approaches implement PINNs without the use of differentiable programming, such as TF-Net for modeling

turbulent ﬂows with several specially designed U-Net deep learning architectures [

46

]. A promising direction in this and

other spatiotemporal use-cases is neural operator learning: using NNs to learn mesh-independent, resolution-invariant

solution operators for PDEs. To achieve this, Li et al. [

47

] use a Fourier layer that implements a Fourier transform, then

a linear transform, and an inverse Fourier transform for a convolution-like operation in a NN. In a Bayesian inverse

experiment, the Fourier neural operator acting as a surrogate can draw MCMC samples from the posterior of initial NS

vorticity given sparse, noisy observations in 2.5 minutes, compared to 18 hours for the traditional solver.

Multi-physics and HPC simulation of blood ﬂow in an intracranial aneurysm

SimNet is an AI-driven multi-

physics simulation framework based on neural network solvers–more speciﬁcally it approximates the solution to a

PDE by a neural network [

19

]. SimNet improves on previous NN-solvers to take on the challenge of gradients and

discontinuities introduced by complex geometries or physics. The main novelties are the use of Signed Distance

Functions for loss weighting, and integral continuity planes for ﬂow simulation. An intriguing real-world use case is

simulating the ﬂow inside a patient-speciﬁc geometry of an aneurysm, as shown in Fig. A. It is particularly challenging

to get the ﬂow ﬁeld to develop correctly, especially inside the aneurysm sac. Building SimNet to support multi-

GPU and multi-node scaling provides the computational efﬁciency necessary for such complex geometries. There’s

also optimization over repetitive trainings, such as training for surrogate-based design optimization or uncertainty

quantiﬁcation, where transfer learning reduces the time to convergence for neural network solvers. Once a model

is trained for a single geometry, the trained model parameters are transferred to solve a different geometry, without

having to train on the new geometry from scratch. As shown in Fig. B, transfer learning accelerates the patient-speciﬁc

intracranial aneurysm simulations.

Raissi et al. [

48

] similarly approach the problem of 3D physiologic blood ﬂow in a patient-speciﬁc intracranial aneurysm,

but implementing a physics-informed NN technique: the Hidden Fluid Mechanics approach that uses autodiff (i.e.

within differentiable programming) to simultaneously exploit information from the Navier Stokes equations of ﬂuid

dynamics and the information from ﬂow visualization snapshots. Continuing this work has high potential for the robust

8

Figure 4: SimNet simulation results for the aneurysm problem [

19

]. Left: patient-speciﬁc geometry of an intracranial

aneurysm. Center: Streamlines showing accurate ﬂow ﬁeld simulation inside the aneurysm sac. Right: Transfer learning

within the NN-based simulator accelerates the computation of patient-speciﬁc geometries.

and data-efﬁcient simulation in physical and biomedical applications. Raissi et al. effectively solve this as an inverse

problem using blood ﬂow data, while SimNet approaches this as a forward problem without data. Nonetheless SimNet

has potential for solving inverse problems as well (within the inverse design workﬂow of Fig. 33 for example).

Figure 5: A cascade of spatial and temporal scales governing key biophysical mechanisms in brain blood ﬂow

requires effective multi-scale modeling and simulation techniques (inspired by [

49

]). Listed are the standard modeling

approaches for each spatial and temporal regime, where an increase in computational demands is inevitable as one

pursues an integrated resolution of interactions in ﬁner and ﬁner scales. SimNet [

19

], JAX MD [

50

], and related works

with the motifs will help mitigate these computational demands and more seamlessly model dynamics across scales.

9

Learning quantum chemistry

Using modern algorithms and supercomputers, systems containing thousands of

interacting ions and electrons can now be described using approximations to the physical laws that govern the world

on the atomic scale, namely the Schrödinger equation [

51

]. Chemical-simulation can allow for the properties of a

compound to be anticipated (with reasonable accuracy) before synthesizing it in the laboratory. These computational

chemistry applications range from catalyst development for greenhouse gas conversion, materials discovery for energy

harvesting and storage, and computer-assisted drug design [

52

]. The potential energy surface is the central quantity of

interest in the modeling of molecules and materials, computed by approximations to the time-independent Schrödinger

equation. Yet there is a computational bottleneck: high-level wave function methods have high accuracy, but are

often too slow for use in many areas of chemical research. The standard approach today is density functional theory

(DFT) [

53

], which can yield results close to chemical accuracy, often on the scale of minutes to hours of computational

time. DFT has enabled the development of extensive databases that cover the calculated properties of known and

hypothetical systems, including organic and inorganic crystals, single molecules, and metal alloys [

54

,

55

,

56

]. In

high-throughput applications, often used methods include force-ﬁeld (FF) and semi-empirical quantum mechanics

(SEQM), with runtimes on the order of fractions of a second, but at the cost of reliability of the predictions [

57

]. ML

methods can potentially provide accelerated solvers without loss of accuracy, but to be reliable in the chemistry space

the model must capture the underlying physics, and well-curated training data covering the relevant chemical problems

must be available – Butler et al. [

51

] provide a list of publicly accessible structure and property databases for molecules

and solids. Symmetries in geometries or physics, or equivariance, deﬁned as the property of being independent of the

choice of reference frame, can be exploited to this end. For instance, a message passing neural network [

58

] called

OrbNet encodes a molecular system in graphs based on features from a quantum calculation that are low cost by

implementing symmetry-adapted atomic orbitals [

57

,

59

]. The authors demonstrate effectiveness of their equivariant

approach on several organic and biological chemistry benchmarks, with accuracies on par to that of modern DFT

functionals, providing a far more efﬁcient drop-in replacement for DFT energy predictions. Additional approaches that

constraining NNs to be symmetric under these geometric operations have been successfully applied in molecular design

[60] and quantum chemistry [61].

Future directions

In general, machine learned models are ignorant of fundamental laws of physics, and can result in ill-posed problems

or non-physical solutions. This effect is exacerbated when modeling the interplay of multiple physics or cascades of

scales. Despite recent successes mentioned above, there is much work to be done for multi-scale and multi-physics

problems. For instance, PINNs can struggle with high-frequency domains and frequency bias [

62

], which is particularly

problematic in multi-scale problems [

63

]. Improved methods for learning multiple physics simultaneously are also

needed, as the training can be prohibitively expensive – for example, a good approach with current tooling is training a

model for each ﬁeld separately and subsequently learning the coupled solutions through either a parallel or a serial

architecture using supervised learning based on additional data for a speciﬁc multi-physics problem.

In order to approach these challenges in a collective and reproducible way, there is need to create open benchmarks

for physics-informed ML, much like other areas of ML community such as computer vision and natural language

processing. Yet producing quality benchmarks tailored for physics-informed ML can be more challenging:

1.

To benchmark physics-informed ML methods, we additionally need the proper parameterized physical models

to be explicitly included in the databases.

2.

Many applications in physics and chemistry require full-ﬁeld data, which cannot be obtained experimentally

and/or call for signiﬁcant compute.

3.

Often different, problem-speciﬁc, physics-based evaluation methods are necessary, for example the metrics

proposed in [64] for scoring physical consistency and [65] for scoring spatiotemporal predictions.

An overarching benchmarking challenge but also advantageous constraint is the multidisciplinary nature of physics-

informed ML: there must be multiple benchmarks in multiple domains, rather than one benchmark to rule them all,

which is a development bias that ImageNet has put on the computer vision ﬁeld the past decade. And because we

have domain knowledge and numerical methods for the underlying data generating mechanisms and processes in

physical and life sciences, we have an opportunity to quantify robustly the characteristics of datasets to better ground

the performances of various models and algorithms. This is contrast to the common practice of naïve data gathering

to compile massive benchmark datasets for deep learning – for example, scraping the internet for videos to compose

a benchmark dataset for human action recognition (deepmind.com/research/open-source/kinetics) – where not only

are the underlying statistics and causal factors a priori unknown, the target variables and class labels are non-trivial to

deﬁne and can lead to signiﬁcant ethical issues such as dataset biases that lead to model biases, which in some cases

can propagate harmful assumptions and stereotypes.

10

Further we propose to broaden the class of methods beyond physics-informed, to

physics-infused machine learning

.

The former is unidirectional (physics providing constraints or other information to direct ML methods), whereas the

latter is bidirectional, including approaches that can better synergize the two computational ﬁelds. For instance, for

systems with partial information, methods in physics-infused ML can potentially compliment known physical models

to learn missing or misunderstood components of the systems. We speciﬁcally highlight one approach named Universal

Differential Equations [66] in the surrogate modeling motif next.

Physics-infused ML can enable many new simulation tools and use-cases because of this ability to integrate physical

models and data within a differentiable software paradigm. JAX and SimNet are nice examples, each enabling physics-

informed ML methods for many science problems and workﬂows. Consider, for instance, the JAX ﬂuid dynamics

example above: another use-case in the same DP framework is JAX MD [

50

] for performing differentiable physics

simulations with a focus on molecular dynamics.

Another exciting area of future multi-physics multi-scale development is inverse problem solving. We already mentioned

the immense acceleration in the CFD example above with Fourier Neural Operators [

47

]. Beyond efﬁciency gains,

physics-infused ML can take on applications with inverse and ill-posed problems which are either difﬁcult or impossible

to solve with conventional approaches, notably quantum chemistry: A recent approach called FermiNet [

67

] takes a

dual physics-informed learning approach to solving the many-electron Schrodinger equation, where both inductive

bias and learning bias are employed. The advantage of physics-infused ML here is eliminating extrapolation problems

with the standard numerical approach, which is a common source of error in computational quantum chemistry. In

other domains such as biology, biomedicine, and behavioral sciences, focus is shifting from solving forward problems

based on sparse data towards solving inverse problems to explain large datasets [

21

]. The aim is to develop multi-scale

simulations to infer the behavior of the system, provided access to massive amounts of observational data, while the

governing equations and their parameters are not precisely known. We further detail the methods and importance of

inverse problem solving with SI later in the Discussion section.

2. SURROGATE MODELING & EMULATION

Asurrogate model is an approximation method that mimics the behavior of an expensive computation or process. For

example, the design of an aircraft fuselage includes computationally intensive simulations with numerical optimizations

that may take days to complete, making design space exploration, sensitivity analysis, and inverse modeling infeasible.

In this case a computationally efﬁcient surrogate model can be trained to represent the system, learning a mapping

from simulator inputs to outputs. And Earth systems models (ESMs), for example, are extremely computationally

expensive to run due to the large range of spatial and temporal scales and large number of processes being modeled.

ESM surrogates can be trained on a few selected samples of the full, expensive simulations using supervised machine

learning tools. In this simulation context, the aim of surrogate modeling (or statistical emulation) is to replace simulator

code with a machine learning model (i.e., emulator) such that running the ML model to infer the simulator outputs is

more efﬁcient than running the full simulator itself. An emulator is thus a model of a model: a statistical model of the

simulator, which is itself a mechanistic model of the world.

For surrogate modeling in the sciences, non-linear, nonparametric Gaussian processes (GP) [

26

] are typically used

because of their ﬂexibility, interpretability, and accurate uncertainty estimates [

70

]. Although traditionally limited to

smaller datasets because of

O(N3)

computational cost of training (where

N

is the number of training data points), much

work on reliable GP sparsiﬁcation and approximation methods make them viable for real-world use [71,72,73,74].

Neural networks (NNs) can also be well-suited to the surrogate modeling task as function approximation machines:

A feedforward network deﬁnes a mapping

y=f(x;θ)

and learns the value of the parameters

θ

that result in the

best function approximation. The Universal Approximation Theorem demonstrates that sufﬁciently large NNs can

approximate any nonlinear function with a ﬁnite set of parameters [

75

,

76

]. Although sufﬁcient to represent any

function, a NN layer may be unfeasibly large such that it may fail to learn and generalize correctly [

77

]. Recent work

has shown that a NN with an inﬁnitely wide hidden layer converges to a GP, representing the normal distribution over

the space of functions.

In Fig. 6we show an example of how a NN surrogate can be used as an emulator that encapsulates either the entire

simulator or a speciﬁc part of the simulator. In the former, training is relatively straightforward because the loss function

only has neural networks, and the trained network can be used towards inverse problem solving. However, we now

have a black-box simulator: there is no interpretability of the trained network, and we cannot utilize the mechanistic

components (i.e. differential equations of the simulator) for scientiﬁc analyses. In the case of the partial surrogate we

have several advantages: the surrogate’s number of parameters is reduced and thus the network is more stable (similar

11

Figure 6: An example Earth system model (ESM) for the plastic cycle, with two variations of ML surrogates (purple

screens): On the left, an ML surrogate learns the whole model. A variety of physics-infused ML methods can be applied

– training is relatively straightforward because the loss function only has neural networks, and the trained network can

be used towards inverse problem solving. However, we now have a black-box simulator: there is no interpretability of

the trained network, and we cannot utilize the mechanistic components (i.e. differential equations of the simulator)

for scientiﬁc analyses. On the right, two unknown portions of the model are learned by NN surrogates, while the

remaining portions are represented by known mechanistic equations – possible with surrogate modeling approaches

like UDE and GINN. This “partial surrogate” case has several advantages: the surrogate’s number of parameters is

reduced and thus the network is more stable (similar logic holds for nonparametric Gaussian process surrogate models),

and the simulator retains all structural knowledge and the ability to run numerical analysis. The main challenge is that

backpropagation of arbitrary scientiﬁc simulators is required, which we can address with the engine motif differentiable

programming, producing learning gradients for arbitrary programs. The computational gain associated with the use of a

hybrid surrogate-simulator cascades into a series of additional advantages including the possibility of simulating more

scenarios towards counterfactual reasoning and epistemic uncertainty estimates, decreasing grid sizes, or exploring

ﬁner-scale parameterizations [68,69].

logic holds for nonparametric GP surrogate models), and the simulator retains all structural knowledge and the ability

to run numerical analysis. Yet the main challenge is that backpropagation of arbitrary scientiﬁc simulators is required –

thus the signiﬁcance of the differentiable programming motif we discussed earlier. The computational gain associated

with the use of a hybrid surrogate-simulator cascades into a series of additional advantages including the possibility of

simulating more scenarios towards counterfactual reasoning and epistemic uncertainty estimates, decreasing grid sizes,

or exploring ﬁner-scale parameterizations [68,69].

Semi-mechanistic modeling It follows from the Universal Approximation Theorem that an NN can learn to approx-

imate any sufﬁciently regular differential equation. The approach of recent neural-ODE methods [

78

] is to learn

to approximate differential equations directly from data, but these can perform poorly when required to extrapolate

[

79

]. More encouraging for emulation and scientiﬁc modeling, however, is to directly utilize mechanistic modeling

simultaneously with NNs (or more generally, universal approximator models) in order to allow for arbitrary data-driven

model extensions. The result is a semi-mechanistic approach, more speciﬁcally Universal Differential Equations (UDE)

[

66

] where part of the differential equation contains a universal approximator model – we’ll generally assume an NN is

used for UDE in this paper, but other options include GP, Chebyshev expansion, or random forest.

UDE augments scientiﬁc models with machine-learnable structures for scientiﬁcally-based learning. This is very similar

in motivation to the physics-informed neural nets (PINN) we previously discussed, but the implementation in general

has a key distinction: a PINN is a deep learning model that become physics-informed because of some added physics

bias (observational, inductive, or learning), whereas a UDE is a differential equation with one or more mechanistic

components replaced with a data-driven model – this distinction is why we earlier deﬁned physics-infused ML as the

bidirectional inﬂuence of physics and ML.

Bayesian optimal experiment design

A key aspect that makes emulators useful in scientiﬁc endeavors is that they

allow us to reason probabilistically about outer-loop decisions such as optimization [

80

], data collection [

81

], and to

use them to explain how uncertainty propagates in a system [82].

12

Numerous challenges in science and engineering can be framed as optimization tasks, including the maximization of

reaction yields, the optimization of molecular and materials properties, and the ﬁne-tuning of automated hardware

protocols [

83

]. When we seek to optimize the parameters of a system with an expensive cost function

f

, we look

to employ Bayesian optimization (BO) [

80

,

84

,

85

,

86

] to efﬁciently explore the search space of solutions with a

probabilistic surrogate model

ˆ

f

rather than experimenting with the real system. Gaussian process models are the most

common surrogates due to their ﬂexible, nonparametric behavior. Various strategies to explore-exploit the search space

can be implemented with acquisition functions to efﬁciently guide the BO search by estimating the utility of evaluating

f

at a given point (or parameterization). Often in science and engineering settings this provides the domain experts

with a few highly promising candidate solutions to then try on the real system, rather than searching the intractably

large space of possibilities themselves – for example, generating novel molecules with optimized chemical properties

[

87

,

88

], materials design with expensive physics-based simulations [

89

], and design of aerospace engineering systems

[90].

Similarly, scientists can utilize BO for designing experiments such that the outcomes will be as informative as possible

about the underlying process. Bayesian optimal experiment design (BOED) is a powerful mathematical framework

for tackling this problem [

91

,

92

,

93

,

94

], and can be implemented across disciplines, from bioinformatics [

95

] to

pharmacology [

96

] to physics [

97

] to psychology [

98

]. In addition to design, there are also control methods in

experiment optimization, which we detail in the context of a particle physics example below.

Examples

Simulation-based online optimization of physical experiments

It is often necessary and challenging to design

experiments such that outcomes will be as informative as possible about the underlying process, typically because

experiments are costly or dangerous. Many applications such as nuclear fusion and particle acceleration call for online

control and tuning of system parameters to deliver optimal performance levels – i.e., the control class of experiment

design we introduced above.

In the case of particle accelerators, although physics models exist, there are often signiﬁcant differences between the

simulation and the real accelerator, so we must leverage real data for precise tuning. Yet we cannot rely on many runs

with the real accelerator to tune the hundreds of machine parameters, and archived data does not sufﬁce because there

are often new machine conﬁgurations to try – a control or tuning algorithm must robustly ﬁnd the optimum in a complex

parameter space with high efﬁciency. With physics-infused ML, we can exploit well-veriﬁed mathematical models to

learn approximate system dynamics from few data samples and thus optimize systems online and in silico. It follows

that we can additionally look to optimize new systems without prior data.

For the online control of particle accelerators, Hanuka et al. [

32

] develop the physics-informed basis-function GP. To

clarify what this model encompasses, we need to understand the several ways to build such a GP surrogate for BO of a

physical system:

1. Data-informed GP using real experimental data

2. Physics-informed GP using simulated data

3. Basis-function GP from deriving a GP kernel directly from the physical model

4.

Physics-informed basis-function GP as a combination of the above – methods 2 and 3 were combined in

Hanuka et al.

The resulting physics-informed GP is more representative of the particle accelerator system, and performs faster

in an online optimization task compared to routinely used optimizers (ML-based and otherwise). Additionally, the

method presents a relatively simple way to construct the GP kernel, including correlations between devices – learning

the kernel from simulated data instead of machine data is a form of kernel transfer learning, which can help with

generalizability. Hanuka et al. interestingly point out that constructing the kernel from basis functions without using the

likelihood function is a form of Gaussian process with likelihood-free inference, which is the regime of problems that

simulation-based inference is designed for (i.e. the motif we discuss next).

This and similar physics-informed methods are emerging as a powerful strategy for in silico optimization of expensive

scientiﬁc processes and machines, and further to enable scientiﬁc discovery by means of autonomous experimentation.

For instance, the recent Gemini [

99

] and Golem [

83

] molecular experiment optimization algorithms, which are purpose-

built for automated science workﬂows with SI: using surrogate modeling techniques for proxying expensive chemistry

experiments, the BO and uncertainty estimation methods are designed for robustness to common scientiﬁc measurement

challenges such as input variability and noise, proxy measurements, and systematic biases. Similarly, Shirobokov et

al. [

100

] propose a method for gradient-based optimization of black-box simulators using local generative surrogates

13

that are trained in successive local neighborhoods of the parameter space during optimization, and demonstrate this

technique in the optimization of the experimental design of the SHiP (Search for Hidden Particles) experiment proposed

at CERN. These and other works of Alán Aspuru-Guzik et al. are good sources to follow in this area of automating

science. There are potentially signiﬁcant cause-effect implications to consider in these workﬂows, as we introduce in

the causality motif later.

Multi-physics multi-scale surrogates for Earth systems emulation

The climate change situation is worsening in

accelerating fashion: the most recent decade (2010 to 2019) has been the costliest on record with the climate-driven

economic damage reaching $2.98 trillion-US, nearly double the decade 2000–2009 [

101

]. The urgency for climate

solutions motivates the need for modeling systems that are computationally efﬁcient and reliable, lightweight for

low-resource use-cases, informative towards policy- and decision-making, and cyber-physical with varieties of sensors

and data modalities. Further, models need to be integrated with, and workﬂows extended to, climate-dependent domains

such as energy generation and distribution, agriculture, water and disaster management, and socioeconomics. To this

end, we and many others have been working broadly on Digital Twin Earth (DTE), a catalogue of ML and simulation

methods, datasets, pipelines, and tools for Earth systems researchers and decision-makers. In general, a digital twin

is a computer representation of a real-world process or system – from large aircraft to individual organs. We deﬁne

digital twin in the more precise sense of simulating the real physics and data-generating processes of an environment or

system, with sufﬁcient ﬁdelity such that one can reliably run queries and experiments in silico.

Some of the main ML-related challenges for DTE include integrating simulations of multiple domains, geographies,

and ﬁdelities; not to mention the need to integrate real and synthetic data, as well as data from multiple modalities

(such as fusing Earth observation imagery with on-the-ground sensor streams). SI methods play important roles in the

DTE catalogue, notably the power of machine learned surrogates to accelerate existing climate simulators. Here we

highlight one example for enabling lightweight, real-time simulation of coastal environments: Existing simulators for

coastal storm surge and ﬂooding are physics-based numerical models that can be extremely computationally expensive.

Thus the simulators cannot be used for real-time predictions with high resolution, are unable to quantify uncertainties,

and require signiﬁcant computational infrastructure overhead only available to top national labs. To this end, Jiang

et al. [

102

] developed physics-infused ML surrogates to emulate several of the main coastal simulators worldwide,

NEMO [

103

] and CoSMoS [

104

]. Variations of the Fourier Neural Operator (FNO) [

47

] (introduced in the multi-physics

motif) were implemented to produce upwards of 100x computational efﬁciency on comparable hardware.

This use-case exempliﬁed a particularly thorny data preprocessing challenge that is commonplace working with

spatiotemporal simulators and Digital Twin Earth: One of the coastal simulators to be emulated uses a standard

grid-spaced representation for geospatial topology, which is readily computable with DFT in the FNO model, but

another coastal simulator uses highly irregular grids that differ largely in scale, and further stacks these grids at varying

resolutions. A preprocessing pipeline to regrid and interpolate the data maps was developed, along with substitute

Fourier transform methods. It is our experience that many applications in DTE call for tailored solutions such as this –

as a community we are lacking shared standards and formats for scientiﬁc data and code.

Hybrid PGM and NN for efﬁcient domain-aware scientiﬁc modeling

One can in general characterize probabilistic

graphical models (PGM) [

41

] as structured models for encoding domain knowledge and constraints, contrasted with

deep neural networks as data-driven function-approximators. The advantages of PGM have been utilized widely in

scientﬁc ML [

105

,

106

,

107

,

108

], and of course recent NN methods as discussed throughout this paper. Graph-Informed

Neural Networks (GINNs) [

42

] are a new approach to incorporating the best of both worlds: PGMs incorporate expert

knowledge, available data, constraints, etc. with physics-based models such as systems of ODEs and PDEs, while

computationally intensive nodes in this hybrid model are replaced by learned features as NN surrogates. GINNs are

particularly suited to enhance the computational workﬂow for complex systems featuring intrinsic computational

bottlenecks and intricate physical relations variables. Hall et al. demonstrate GINN towards simulation-based decision-

making in a multiscale model of electrical double-layer (EDL) supercapacitor dynamics. The ability for downstream

decision-making is afforded by robust and reliable sensitivity analysis (due to the probabilistic ML approach), and

orders of magnitude more computational efﬁciency means many hypotheses can be simulated and predicted posteriors

quantiﬁed.

Auto-emulator design with neural architecture search

Kasim et al. look to recent advances in neural architecture

search (NAS) to automatically design and train a NN as an efﬁcient, high-ﬁdelity emulator, as doing this manually

can be time-consuming and require signiﬁcant ML expertise. NAS methods aim to learn a network topology that can

achieve the best performance on a certain task by searching over the space of possible NN architectures given a set of

NN primitives – see Elsken et al. [109] for a thorough overview.

14

Figure 7: Graph-Informed Neural Networks (GINNs) [

42

] provide a computational advantage while maintaining the

advantages of structured modeling with PGMs, by replacing computational bottlenecks with NN surrogates. Here a

PGM encoding structured priors serves as input to both a Bayesian Network PDE (lower route) and a GINN (upper) for

a homogenized model of ion diffusion in supercapacitors. A simple fully-connected NN is pictured, but in principle any

architecture can work, for instance physics-informed methods that further enforce physical constraints.

The NAS results are promising for automated emulator construction: running on ten distinct scientiﬁc simulation cases,

from fusion energy science [

110

,

111

] to aerosol-climate [

112

] and oceanic [

113

] modeling, the results are reliably

accurate output simulations with NN-based emulators that run thousands to billions times faster than the originals, while

also outperforming other NAS based emulation approaches as well as manual emulator design. For example, a global

climate model (GCM) simulation tested normally takes about 1150 CPU-hours to run [

112

], yet the emulator speedup

is a factor of 110 million in direct comparison, and over 2 billion with a GPU — providing scientists with simulations

on the order of seconds rather than days enables faster iteration of hypotheses and experiments, and potentially new

experiments never before thought possible.

Also in this approach is a modiﬁed MC dropout method for estimating the predictive uncertainty of emulator outputs.

Alternatively, we suggest pursuing Bayesian optimization-based NAS methods [

114

] for more principled uncertainty

reasoning. For example, the former can ﬂag when an emulator architecture is overconﬁdent in its predictions, while the

latter can do that and use the uncertainty values to dynamically adjust training parameters and search strategies.

The motivations of Kasim et al. for emulator-based accelerated simulation are same as we’ve declared above: enable

rapid screening and ideas testing, and real-time prediction-based experimental control and optimization. The more we

can optimize and automate the development and veriﬁcation of emulators, the more efﬁciently scientists without ML

expertise can iterate over simulation experiments, leading to more robust conclusions and more hypotheses to explore.

Deriving physical laws from data-driven surrogates

Simulating complex dynamical systems often relies on gov-

erning equations conventionally obtained from rigorous ﬁrst principles such as conservation laws or knowledge-based

phenomenological derivations. Although non-trivial to derive, these symbolic or mechanistic equations are interpretable

and understandable for scientists and engineers. NN-based simulations, including surrogates, are not interpretable in

this way and can thus be challenging to use, especially in many cases where it is important the scientist or engineer

understand the causal, data-generating mechanisms.

15

Figure 8: Illustrating the UDE forward process (top), where mechanistic equations are used with real data to produce a

trained neural network (NN) model, followed by the inverse problem of recovering the governing equations in symbolic

form (bottom).

Recent ML-driven advances have led approaches for sparse identiﬁcation of nonlinear dynamics (SINDy) [

115

] to

learn ODEs or PDEs from observational data. SINDy essentially selects dominant candidate functions from a high-

dimensional nonlinear function space based on sparse regression to uncover ODEs that match the given data; one can

think of SINDy as providing an ODE surrogate model. This exciting development has led to scientiﬁc applications

from biological systems [

116

] to chemical processes [

117

] to active matter [

118

], as well as data-driven discovery of

spatiotemporal systems governed by PDEs [

119

,

120

]. Here we showcase two signiﬁcant advances on SINDy utilizing

methods described in the surrogate modeling motif and others:

1. Synergistic learning deep NN surrogate and governing PDEs from sparse and independent data

– Chen

et al. [

121

] present a novel physics-informed deep learning framework todiscover governing PDEs of nonlinear

spatiotemporal systems from scarce and noisy data accounting for different initial/boundary conditions (IBCs).

Their approach integrates the strengths of deep NNs for learning rich features, automatic differentiation for

accurate and efﬁcient derivative calculation, and

l0

sparse regression to tackle the fundamental limitation of

existing methods that scale poorly with data noise and scarcity. The special network architecture design is able

to account for multiple independent datasets sampled under different IBCs, shown with simple experiments

that should still be validated on more complex datasets. An alternating direction optimization strategy

simultaneously trains the NN on the spatiotemporal data and determine the optimal sparse coefﬁcients of

selected candidate terms for reconstructing the PDE(s) – the NN provides accurate modeling of the solution

and its derivatives as a basis for constructing the governing equation(s), while the sparsely represented PDE(s)

in turn informs and constraints the DNN which makes it generalizable and further enhances the discovery. The

overall semi-mechanistic approach – bottom-up (data-driven) and top-down (physics-informed) processes – is

promising for ML-driven for scientiﬁc discovery.

2. Sparse identiﬁcation of missing model terms via Universal Differential Equations

– We earlier described

several scenarios where an ML surrogate is trained for only part of the full simulator system, perhaps for the

computationally inefﬁcient or the unknown parts. This is also a use-case of the UDE, replacing parts of a

simulator described by mechanistic equations with a data-driven NN surrogate model. Now consider we’re

at the end of the process of building a UDE (we have learned and veriﬁed an approximation for part of the

causal generative model (i.e. a simulator)). Do we lose interpretability and analysis capabilities? With a

knowledge-enhanced approach of the SINDy method we can sparse-identify the learned semi-mechanistic

UDE back to mechanistic terms that are understandable and usable by domain scientists. Rackauckas et al.

[

66

] modify the SINDy algorithm to apply to only subsets of the UDE equation in order to perform equation

discovery speciﬁcally on the trained neural network components. In a sense this narrows the search space of

potential governing equations by utilizing the prior mechanistic knowledge that wasn’t replaced in training the

UDE. Along with the UDE approach in general, this sparse identiﬁcation method needs further development

and validation with more complex datasets.

Future directions

The UDE approach has signiﬁcant implications for use with simulators and physical modeling, where the underlying

mechanistic models are commonly differential equations. By directly utilizing mechanistic modeling simultaneously

with universal approximator models, UDE is a powerful semi-mechanistic approach allowing for arbitrary data-driven

model extensions. In the context of simulators, this means a synergistic model of domain expertise and real-world data

that more faithfully represents the true data-generating process of the system.

16

What we’ve described is a transformative approach for ML-augmented scientiﬁc modeling. That is,

1.

Practitioner identiﬁes known parts of a model and builds a UDE – when using probabilistic programming (an

SI engine motif), this step can be done in a high-level abstraction where the user does not need to write custom

inference algorithms.

2.

Train an NN (or other surrogate model such as Gaussian process) to capture the missing mechanisms – one

may look to NAS and Bayesian optimization approaches to do this in an automated, uncertainty-aware way.

3.

The missing terms can be sparse-identiﬁed into mechanistic terms – this is an active area of research and much

veriﬁcation of this concept is needed, as mentioned in the example above.

4.

Verify the recovered mechanisms are scientiﬁcally sane – for future work, how can we better enable this with

human-machine teaming?

5.

Verify quantitatively: extrapolate, do asymptotic analysis, run posterior predictive checks, predict bifurcations.

6. Gather additional data to validateiii the new terms.

Providing the tools for this semi-mechanistic modeling workﬂow can be immense for enabling scientists to make the

best use of domain knowledge and data, and is precisely what the SI stack can deliver – notably with the differentiable

programming and probabilistic programming “engine” motifs. Even more, building in a uniﬁed framework that’s

purpose-built for SI provides extensibility, for instance to integrate recent graph neural network approaches from

Cranmer et al. [

123

] that can recover the governing equations in symbolic forms from learned physics-informed models,

or the “AI Feynmann” [

124

,

125

] approaches based on traditional ﬁtting techniques in coordinate with neural networks

that leverage physics properties such as symmetries and separability in the unknown dynamics function(s) – key features

such as NN-equivariance and normalizing ﬂows (and how they may ﬁt into the stack) are discussed later.

3. SIMULATION-BASED INFERENCE

Numerical simulators are used across many ﬁelds of science and engineering to build computational models of complex

phenomena. These simulators are typically built by incorporating scientiﬁc knowledge about the mechanisms which are

known (or assumed) to underlie the process under study. Such mechanistic models have often been extensively studied

and validated in the respective scientiﬁc domains. In complexity, they can range from extremely simple models that

have a conceptual or even pedagogical ﬂavor (e.g. the Lotka-Volterra equations describing predator-prey interactions

in ecological systems and also economic theories [

126

] (expressed in Fig. 21)) to extremely detailed and expensive

simulations implemented in supercomputers, e.g. whole-brain simulations [127].

A common challenge – across scientiﬁc disciplines and complexity of models – is the question of how to link such

simulation-based models with empirical data. Numerical simulators typically have some parameters whose exact values

are non a priori, and have to be inferred by data. For reasons detailed below, classical statistical approaches can not

readily be applied to models deﬁned by numerical simulators. The ﬁeld of simulation-based inference (SBI) [

1

] aims to

address this challenge, by designing statistical inference procedures that can be applied to complex simulators. Building

on foundational work from the statistics community (see [

128

] for an overview), SBI is starting to bring together work

from multiple ﬁelds – including, e.g., population genetics, neuroscience, particle physics, cosmology, and astrophysics

– which are facing the same challenges and using tools from machine learning to address them. SBI can provide a

unifying language, and common tools [

129

,

130

,

131

] and benchmarks [

132

] are being developed and generalized

across different ﬁelds and applications.

Why is it so challenging to constrain numerical simulations by data? Many numerical simulators ave stochastic

components, which are included either to provide a verisimilar model of the system under study if it is believed to be

stochastic itself, or often also pragmatically to reﬂect incomplete knowledge about some components of the system.

Linking such stochastic models with data falls within the domain of statistics, which aims to provide methods for

constraining the parameters of a model by data, approaches for selecting between different model-candidates, and

criteria for determining whether a hypothesis can be rejected on grounds of empirical evidence. In particular, statistical

inference aims to determine which parameters – and combinations of parameters – are compatible with empirical

data and (possibly) a priori assumptions. A key ingredient of most statistical procedures is the likelihood

p(x|θ)

of

data

x

given parameters

θ

. For example, Bayesian inference characterizes parameters which are compatible both

with data and prior by the posterior distribution

p(θ|x)

, which is proportional to the product of likelihood and prior,

iii

Note we use “verify” and “validate” speciﬁcally, as there’s important difference between veriﬁcation and validation (V&V):

veriﬁcation asks “are we building the solution right?” whereas validation asks “are we building the right solution?” [122].

17

mechanistic model

prior

data or summary data

posterior

consistent sample

inconsistent sample

neural density estimator

simulated data

parameter 1

parameter 2

ms

mV

ms

mV

probability

consistent sample

123

4

ms

mV

probability

Figure 9: The goal of simulation-based inference (SBI) is to algorithmically identify parameters of simulation-based

models which are compatible with observed data and prior assumptions. SBI algorithms generally take three inputs (left):

A candidate mechanistic model (e.g. a biophysical neuron model), prior knowledge or constraints on model parameters,

and observational data (or summary statistics thereof). The general process shown is to (1) sample parameters from the

prior followed by simulating synthetic data from these parameters; (2) learn the (probabilistic) association between data

(or data features) and underlying parameters (i.e., to learn statistical inference from simulated data) for which different

SBI methods (discussed in the text) such as neural density estimation [

133

] can be used; (3) apply the learned model to

empirical data to derive the full space of parameters consistent with the data and the prior, i.e. the posterior distribution.

Posterior distributions may have complex shapes (such as multiple modes), and different parameter conﬁgurations may

lead to data-consistent simulations. If needed, (4) an initial estimate of the posterior can be used to adaptively generate

additional informative simulations. (Illustration from from [133])

p(θ|x)∝p(x|θ)p(θ)

. Frequentist inference procedures typically construct conﬁdence regions based on hypothesis

tests, often using the likelihood ratio as test statistic.

However, for many simulation-based models, one can easily sample from the model (i.e., generate synthetic data

x∼p(x|θ)

) but evaluating the associated likelihoods can be computationally prohibitive – because, for instance,

the same output

x

could result from a very large number of internal paths through the simulator, and integrating

over all of them is prohibitive. More pragmatically, it might also be the case that the simulator is implemented in a

“black-box” manner which does not provide access to its internal workings or states. If likelihoods can not be evaluated,

most conventional inference approaches can not be used. The goal of simulation-based inference is to make statistical

inference possible for so-called implicit models which allow generating simulated data, but not evaluation of likelihoods.

SBI is not a new idea. Simulation-based inference approaches have been studied extensively in statistics, typically

under the heading of likelihood-free inference. An inﬂuential approach has been that of Approximate Bayesian

Computation [

128

,

134

,

135

]. In its simplest form it consists of drawing parameter values from a proposal distribution,

running the simulator for these parameters to generate synthetic outputs

x p(x|θ)

, comparing these outputs against

the observed data, and accepting the parameter values only if they are close to the observed data under some distance

metric,

kx−xobservedk<

. After following this procedure repeatedly, the accepted samples approximately follow the

posterior. A second class of methods approximates the likelihood by sampling from the simulator and estimating the

density in the sample space with kernel density estimation or histograms. This approximate density can then be used in

lieu of the exact likelihood in frequentist or Bayesian inference techniques [136].

Both of these methods enable approximate inference in the likelihood-free setting, but they suffer from certain

shortcomings: In the limit of a strict ABC acceptance criterion (

→0

) or small kernel size, the inference results

become exact, but the sample efﬁciency is reduced (the simulation has to be run many times). Relaxing the acceptance

criterion or increasing the kernel size improves the sample efﬁciency, but reduces the quality of the inference results.

The main challenge, however, is that these methods do not scale well to high-dimensional data, as the number of

required simulations grows approximately exponentially with the dimension of the data

x

. In both approaches, the

18

Figure 10: Various simulation-based inference workﬂows (or prototypes) are presented in Cranmer et al. [

1

]. Here we

show four main workﬂows (or templates) of simulation-based inference: the left represents Approximate Bayesian

Computation (ABC) approaches, and then to the right are three model-based approaches for approximating likelihoods,

posteriors, and density ratios, respectively. Notice that all include algorithms that use the prior distribution to propose

parameters (green), as well as algorithms for sequentially adapting the proposal (purple)—i.e., steps (1) and (4) shown

in Fig. 9. (Figure reproduced from Ref. [132])

raw data is therefore usually ﬁrst reduced to low-dimensional summary statistics. These are typically designed by

domain experts with the goal of retaining as much information on the parameters

θ

as possible. In many cases, the

summary statistics are not sufﬁcient and this dimensionality reduction limits the quality of inference or model selection

[

137

]. Recently, new methods for learning summary statistics in a fully [

138

,

139

] or semi-automatic manner [

140

] are

emerging, which might alleviate some of these limitations.

The advent of deep learning has powered a number of new simulation-based inference techniques. Many of these

methods rely on the key principle of training a neural surrogate for the simulation. Such models are closely related

to the emulator models discussed in the previous section, but not geared towards efﬁcient sampling. Instead, we

need to be able to access its likelihood [

141

,

142

] (or the related likelihood ratio [

143

,

144

,

145

,

146

,

147

,

148

])

or the posterior [

149

,

150

,

151

,

152

]. After the surrogate has been trained, it can be used during frequentist or

Bayesian inference instead of the simulation. On a high level, this approach is similar to the traditional method

based on histograms or kernel density estimators [

136

], but modern ML models and algorithms allow it to scale to

higher-dimensional and potentially structured data.

The impressive recent progress in SBI methods does not stem from deep learning alone. Another important theme

is active learning: running the simulator and inference procedure iteratively and using past results to improve the

proposal distribution of parameter values for the next runs [

141

,

142

,

150

,

151

,

153

,

154

,

155

,

156

,

157

,

158

]. This can

substantially improve the sample efﬁciency. Finally, in some cases simulators are not just black boxes, but we have

access to (part of) their latent variables and mechanisms or probabilistic characteristics of their stack trace. In practice,

such information can be made available through domain-speciﬁc knowledge or by implementing the simulation in a

framework that supports differential or probabilistic programming– i.e., the SI engine. If it is accessible, such data can

substantially improve the sample efﬁciency with which neural surrogate models can be trained, reducing the required

compute [

159

,

160

,

161

,

162

,

163

]. On a high level, this represents a tighter integration of the inference engine with the

simulation [164].

These components – neural surrogates for the simulator, active learning, the integration of simulation and inference

– can be combined in different ways to deﬁne workﬂows for simulation-based inference, both in the Bayesian and

frequentist setting. We show some example inference workﬂows in Fig. 10. The optimal choice of the workﬂow

depends on the characteristics of the problem, in particular on the dimensionality and structure of the observed data and

the parameters, whether a single data point or multiple i.i.d. draws are observed, the computational complexity of the

simulator, and whether the simulator admits accessing its latent process.

Examples

Simulation-based inference techniques have