Available via license: CC BY 4.0
Content may be subject to copyright.
Multi-Language Probabilistic Programming
SAM STITES, Northeastern University, USA
JOHN M. LI, Northeastern University, USA
STEVEN HOLTZEN, Northeastern University, USA
There are many dierent probabilistic programming languages that are specialized to specic kinds of
probabilistic programs. From a usability and scalability perspective, this is undesirable: today, probabilistic
programmers are forced up-front to decide which language they want to use and cannot mix-and-match
dierent languages for handling heterogeneous programs. To rectify this, we seek a foundation for sound
interoperability for probabilistic programming languages: just as today’s Python programmers can resort
to low-level C programming for performance, we argue that probabilistic programmers should be able
to freely mix dierent languages for meeting the demands of heterogeneous probabilistic programming
environments. As a rst step towards this goal, we introduce MultiPPL, a probabilistic multi-language that
enables programmers to interoperate between two dierent probabilistic programming languages: one that
leverages a high-performance exact discrete inference strategy, and one that uses approximate importance
sampling. We give a syntax and semantics for MultiPPL, prove soundness of its inference algorithm, and
provide empirical evidence that it enables programmers to perform inference on complex heterogeneous
probabilistic programs and exibly exploits the strengths and weaknesses of two languages simultaneously.
CCS Concepts: •Mathematics of computing
→
Bayesian computation;•Theory of computation
→
Denotational semantics.
ACM Reference Format:
Sam Stites, John M. Li, and Steven Holtzen. 2025. Multi-Language Probabilistic Programming. In .ACM, New
York, NY, USA, 50 pages. https://doi.org/XXXXXXX.XXXXXXX
1 INTRODUCTION
Scalable and reliable probabilistic inference remains a signicant barrier for applying and using
probabilistic programming languages (PPLs) in practice. The core of the inference challenge is that
there is no universal approach: dierent kinds of inference strategies are specialized for dierent
kinds of probabilistic programs. For example, Stan’s inference strategy is highly eective on
continuous and dierentiable programs such as hierarchical Bayesian models, but struggles on
programs with high-dimensional discrete structure such as graph and network reachability [
9
].
On the other extreme, languages like Dice and ProbLog scale well on purely-discrete problems,
but the price for this scalability is that they must forego any support whatsoever of continuous
probability distributions [
16
,
23
]. In an ideal world, a probabilistic programmer would not have to
commit to one language or the other: they could use a Dice-like language for high-performance
scalable inference on the discrete portion of the program, a Stan-like language for the portion to
which it is well-suited, and be able to transfer data and control-ow between these two languages
for heterogeneous programs.
This raises a key question: how should we orchestrate the hando between two probabilistic
programming languages whose underlying semantics may be radically dierent and seemingly
incompatible? This question of sound language interoperability has been extensively explored in
the context of non-probabilistic languages [
13
,
31
,
36
,
39
,
47
,
60
,
61
], where the goal is to prove
properties such as type-soundness and termination in a multi-language setting. As a starting point,
Matthews and Findler
[31]
introduced an eective model for capturing the interaction between two
languages by language embedding: the syntax and operational semantics of a Scheme-like language
OOPSLA ’25, October 12–18, 2025, Singapore
2025. ACM ISBN 978-1-4503-XXXX-X/18/06.. .$15.00
https://doi.org/XXXXXXX.XXXXXXX
1
arXiv:2502.19538v1 [cs.PL] 26 Feb 2025
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
and an ML-like language are unied into a multi-language and syntactic boundary terms are added
to mediate transfer of control and data between the two languages. Using this embedding approach,
they were able to establish the type-soundness of the multi-language. Their approach relied on a
careful enforcement of soundness properties on the boundaries, for instance inserting dynamic
guards or contracts to ensure soundness when typed ML values are transferred to untyped Scheme.
We introduce the notion of sound inference interoperability: whereas sound interoperability
of traditional languages ensures that multi-language programs are type-sound, sound inference
interoperability ensures that probabilistic multi-language programs correctly represent the intended
probability distribution. Our main goal will be to establish sound inference interoperability of two
PPLs: the rst is called Disc and is similar to Dice, and the second is called Cont and it makes use
of importance-sampling-based inference. These two languages are a nice pairing: Disc provides
scalable exact discrete inference at the expense of expressivity, as it does not support continuous
random variables and unbounded loops. On the other hand, Cont provides signicant expressivity
(it supports continuous random variables) at the cost of relying on importance-sampling-based
approximate inference. Following Matthews and Findler
[31]
, we embed both Disc and Cont into
a multi-language we call MultiPPL. Together, these two languages cover a broad spectrum of
interesting probabilistic programs that are dicult to handle today. We will show in Section 2
and Section 4 that examples from networking and probabilistic graphical models benet from the
ability to exibly use dierent languages and inference algorithms within a unied multi-language.
Traditional multi-language semantics establishes sound interoperability by proving type-soundness
of the combined multi-language [
31
]. Analogously, we establish sound inference interoperability
between Disc and Cont, guaranteeing that well-typed MultiPPL programs correctly represent
the intended probability distribution. Our contributions are as follows:
•
We introduce MultiPPL, a multi-language in the style of Matthews and Findler
[31]
that en-
ables interoperation between a discrete exact probabilistic programming language Disc and a
continuous approximate probabilistic programming language Cont.
•
In Section 3 we construct two models of MultiPPL by combining appropriate semantic domains
for Disc and Cont programs: a high-level model capturing the probability distribution intended
by a given MultiPPL program, and a low-level model capturing the details of our particular
implementation strategy. We then prove that these two semantics agree, establishing correctness
of the implementation (Theorem 3.6). We identify two key requirements for ensuring sound infer-
ence interoperability between exact and approximate programs: Disc programs must additionally
enforce sample consistency for ensuring Disc values pass safely into Cont, and Cont programs
must additionally perform importance weighting for ensuring the safety of Disc conditioning.
•
In Section 4 we validate the practical eectiveness of MultiPPL through our provided imple-
mentation. We evaluate the ecacy of MultiPPL by modeling complex independence structures
through real-world problems in the domain of networking and probabilistic graphical models.
We provide insights into MultiPPL’s approach to probabilistic inference and characterize the
nuanced landscape that interoperation produces.
2 OVERVIEW
We argue that it is often the case that realistic probabilistic programs consist of sub-programs
that are best handled by dierent probabilistic programming languages. Consider for example the
packet arrival situation visualized in Fig. 1. In this example, at each time step, network packets
are arriving according to a Poisson distribution, a fairly standard setup in discrete-time queueing
theory [
32
]. Then, each packet is forwarded through the network, whose topology is visualized
as a directed graph. The goal is to query for various properties about the network’s behavior:
2
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
R1
R2
R3
R4
1function step() {
2let lambda = uniform(0, 5) in
3let numPackets = poisson(lambda) in
4for i in 0..numPackets {
5forwardPacket(R1)
6}
7}
Fig. 1. A small network and a fragment of a probabilistic program encoding of the packet arrival problem.
Disc Expressions 𝑀,𝑁::= 𝑋|true |false |𝑀∧𝑁|¬𝑀
|⟨⟩ |⟨𝑀, 𝑁 ⟩|fst 𝑀|snd 𝑀
|ret 𝑀|let 𝑋be 𝑀in 𝑁|if 𝑒then 𝑀else 𝑁
|flip 𝑒|observe 𝑀|L𝑒M𝐸
Types 𝐴,𝐵::= unit |bool |𝐴×𝐵
Contexts Δ::= 𝑋1:𝐴1, . . . , 𝑋𝑛:𝐴𝑛
Cont Expressions 𝑒::= 𝑥|true |false |𝑟|𝑒1+𝑒2|−𝑒|𝑒1·𝑒2|𝑒1≤𝑒2
|() |(𝑒1, 𝑒2)|fst 𝑒|snd 𝑒
|ret 𝑒|let 𝑥be 𝑒1in 𝑒2|if 𝑒1then 𝑒2else 𝑒3
|𝑑|obs(𝑒𝑜, 𝑑 )|L𝑀M𝑆
Distributions 𝑑::= flip 𝑒|uniform 𝑒1𝑒2|poisson 𝑒
Types 𝜎,𝜏::= unit |bool |real |𝜎×𝜏
Contexts Γ::= 𝑥1:𝜏1, . . . , 𝑥𝑛:𝜏𝑛
Number literals 𝑟∈R
Fig. 2. Syntax of MultiPPL. In Disc, we require
𝑒∈ [
0
,
1
]
for
flip
. In Cont, the syntax of distributions
𝑑
denotes probability distributions. In obs these terms condition, otherwise they are immediately sampled.
for instance, the probability of a packet reaching the end of the network, or of a packet queue
overowing. This example task is inspired by prior work on using probabilistic programming
languages to perform network verication [18, 53].
The situation in Fig. 1 is a small illustrative example of packet arrival, but programs like it are
extremely challenging for today’s PPLs because they mix dierent kinds of program structure. Lines
2 and 3 manipulate continuous and countably-innite-domain random variables, which precludes
the use of Dice. However, graph reachability and queue behavior are complex discrete distributions,
which are dicult for Stan due to their inherent non-dierentiability and high-dimensional discrete
structure. In order to scale on this example, we would like to be able to use an inference algorithm
like Stan’s for lines 2 and 3, and an inference algorithm like Dice’s for lines 4–6.
Our approach to designing a language capable of handling situations like that described in
Fig. 1 is to enable the programmer to seamlessly transition between programming in two dierent
PPLs: Cont, an expressive language that supports sampling-based inference and continuous
random variables, and Disc, a restricted discrete-only language that supports scalable exact discrete
inference. Following Matthews and Findler
[31]
, we describe a probabilistic multi-language that
embeds both languages into a single unied syntax: see Fig. 2. In Section 3.1 we discuss the
intricacies of the syntax in Fig. 2 in full detail, including typing judgments found in Fig. 7 and the
appendix; here we briey note its high-level structure and discuss examples.
3
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
These languages delineate our two syntactic categories:
(1)
Disc terms, shown in purple, that support discrete probabilistic operations such as Bernoulli
random variables and Bayesian conditioning. The syntax is standard for an ML-like functional
language with the addition of probabilistic constructs:
flip 𝑒
introduces a Bernoulli random
variable that is
true
with probability
𝑒∈ [
0
,
1
]
and
false
otherwise; the construct
observe 𝑀
conditions on
𝑀
. Notably, Disc lacks introduction forms for continuous random variables or
real numbers and so in order to dene the Bernoulli-distributed random variable using
flip
, we
must rely on interoperation to construct our distribution.
(2)
Cont terms, shown in orange, additionally support standard continuous operations and
sampling capabilities from two distributions inexpressible in Disc: a Uniform distribution
uniform 𝑒1𝑒2
over the interval
[𝑒1,𝑒2]
, with
𝑒1,𝑒2∈R
, and a Poisson distribution
poisson
with
rate 𝑒∈Rbeing greater than zero. The syntax obs(𝑒𝑜, 𝑑 )
denotes conditioning on the event that a sample drawn from distribution 𝑑is equal to 𝑒𝑜.
Listing 1. TwoCoins
1let 𝜃be uniform 0 1 in
2Llet 𝑋be flip 𝜃in
3let 𝑌be flip 𝜃in
4observe 𝑋∨𝑌in
5ret 𝑋M𝑆
Mediating between the Disc and Cont sublanguages
are the boundaries
L𝑒M𝐸
and
L𝑀M𝑆
: the boundary
L𝑒M𝐸
allows passing from Cont to Disc, and the boundary
L𝑀M𝑆
allows passing from Disc to Cont. This style of
presentation is similar to Patterson [40].
Listing 1 shows an example program in our multi-
language which passes a uniformly-sampled real value
𝜃
from Cont into Disc and uses the resulting value as a prior for sampling two independent Bernoulli
random variables. The outer-most language is Cont. On Line 1,
𝜃
is bound to a sample drawn from
the uniform distribution on the unit interval. Then, on Lines 2–5, we begin evaluation of a Disc
program inside the boundary term
L−M𝑆
. We ip two coins
𝑋
and
𝑌
(Lines 2 and 3, respectively)
in the Disc sub-language, whose prior parameters are both
𝜃
. On Line 4, we observe that one of
the two coins was true, taking advantage of syntactic sugar where
observe
is bound to a discarded
variable name. Line 5 brings us to the nal line of our program, where we query for the probability
that
𝑋
is true. The next two sub-sections will explain our approach to bridging the two languages.
2.1 Disc and Cont inference
Before we describe the intricacies of language interoperation, we rst provide some high-level
intuition for how we wish to perform inference on Disc and Cont independently. First, we give
a denotational semantics for MultiPPL that we denote
J−K
which associates each MultiPPL
term with a probability distribution on MultiPPL values (see Section 3 for a formal denition
of these semantics). Here we will briey illustrate these semantics by example: the semantics
Jflip 𝑝K
produces a Bernoulli distribution that is true with probability
𝑝∈ [
0
,
1
]
; the semantics
Juniform 𝑒1𝑒2Kproduces a uniform distribution on the interval [𝑒1,𝑒2] ∈ R.
The goal of inference is to eciently evaluate the denotation of a probabilistic program. While
Disc and Cont share a unied denotation, they have very dierent approaches to inference. The
key advantage of our multi-language approach is that we can specialize the design of Cont and Disc
to take full advantage of structural dierences between their underlying inference algorithms: for
Disc we will use an exact inference strategy based on knowledge compilation similar to Dice [
23
],
and for Cont we will rely on approximate inference via sampling. In the next two subsections we
give a high-level overview of these standard approaches.
2.1.1 Exact inference via knowledge compilation. Here, we illustrate the principles of exact inference
in Disc via example; Section 3 provides a formal treatment of these semantics. In Fig. 3a, we
reproduce the Disc program compiled in Lines 3–5 of Listing 1, but instantiate the priors of
𝑌
and
4
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
1let 𝑋be flip 0.4in
2let 𝑌be flip 0.3in
3observe 𝑋∨𝑌in
4ret 𝑋
(a) Example Disc program.
𝑓𝑋
T F
0.4 0.6
𝑓𝑋
𝑓𝑌
T
T F
0.6
0.4
0.3 0.7
(b) BDD representations of formulas.
Fig. 3. Motivating example showing the compilation of the Disc program in 3a to BDDs in 3b. On the le of
3b is a BDD representing the distribution formula
𝜑=𝑓𝑋
; on the right is the BDD representing the accepting
formula 𝛼=𝑓𝑋∨𝑓𝑌.Tand Frepresent true and false values, respectively.
𝑍
with numeric literals 0.4 and 0.3, respectively. Our example in Fig. 3a denotes the probability
distribution of Bernoulli 0.3, given that one of the two weighted coin ips is true; its semantics is
J𝐹𝑖𝑔. 3𝑎K(true)=0.4
0.58 ≈0.689.
The exact inference strategy used by Disc is to perform probabilistic inference via weighted
model counting [
10
,
14
,
46
], following Holtzen et al
. [23]
. The key idea is to interpret the probabilistic
program as a weighted Boolean formula whose models are in one-to-one correspondence with
paths through the program, and where each path is associated with a weight that matches the
probability of that path in the program. Concretely, a weighted Boolean formula is a pair
(𝜑, 𝑤 )
where
𝜑
is a Boolean formula and
𝑤
is a weight function that associates literals (assignments to
variables) in
𝜑
with real-valued weights. Then, the weighted model count of a weighted Boolean
formula is the weighted sum of models:
WMC(𝜑 , 𝑤)=
{𝑚⊨𝜑}
{ℓ∈𝑚}
𝑤(ℓ).(1)
To perform Disc inference by reduction to weighted model counting, we associate each Disc
program with a pair of Boolean formulae in a manner similar to Holtzen et al
. [23]
: (1) an accepting
formula 𝛼that encodes the paths through the program that does not violate observations; and (2)
adistribution formula
𝜑
such that
WMC(𝜑∧𝛼)
is the unnormalized probability of the program
returning
true
. For instance, we would compile Fig. 3a into accepting formula
𝜑=𝑓𝑋
and
𝛼=𝑓𝑋∨𝑓𝑌
,
where
𝑓𝑋
is a Boolean variable that represents the outcome of
flip
0
.
4and
𝑓𝑌
represents the outcome
of
flip
0
.
3. Then, the weight function is
𝑤(𝑓𝑋)=
0
.
4
, 𝑤 (𝑓𝑋)=
0
.
6
, 𝑤 (𝑓𝑌)=
0
.
4
, 𝑤 (𝑓𝑌)=
0
.
3
, 𝑤 (𝑓𝑌)=
0.7. Then, we can compute the semantics of Fig. 3a by performing weighted model counting:
J𝐹𝑖𝑔. 3𝑎K(true)=
WMC(𝜑∧𝛼 , 𝑤)
WMC(𝛼 , 𝑤)
=
WMC(𝑓𝑋, 𝑤 )
WMC(𝑓𝑋∨𝑓𝑌, 𝑤 )
=
0.4
0.4+0.6·0.3=
0.4
0.58 ≈0.689
The weighted model counting task is well-studied, and there is an array of high-performance
implementations for solving it [
10
,
23
,
46
]. One approach that is particularly eective is knowledge
compilation, which compiles the Boolean formula into a representation for which weighted model
counting can be performed eciently (typically, polynomial-time in the size of the compiled
representation). A common target for this compilation process is binary decision diagrams (BDDs),
shown in Fig. 3b. A BDD is a rooted DAG whose internal nodes are labeled with Boolean variables
and whose leaves are labeled with either true or false values. A BDD is read top-down: solid edges
denote true assignments to variables, and dashed edges denote false assignments. Once a Boolean
formula is compiled to a BDD, inference can be performed in polynomial time (in the size of the
BDD) by performing a bottom-up traversal of the DAG.
5
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
1let 𝑥be flip 0.20 in
2Llet 𝑌be flip 0.25 in
3observe L𝑥M𝐸∨𝑌in
4ret 𝑌M𝑆
(a) Motivating example.
1Llet 𝑌be flip 0.25 in
2observe true ∨𝑌in
3ret 𝑌M𝑆
(b) Sampled 𝑥=true.
1Llet 𝑌be flip 0.25 in
2observe false ∨𝑌in
3ret 𝑌M𝑆
(c) Sampled 𝑥=false.
Fig. 4. Interpreting Cont values in Disc.
While highly eective for discrete probabilistic inference tasks with nite domains, inference via
knowledge compilation has a critical weakness: it cannot support continuous random variables or
unbounded discrete random variables due to the requirement that each program be associated with
a (nite) Boolean formula. Hence, the design of Disc must be carefully restricted to only permit
programs that can be compiled to Boolean formulae, which is why it does not contain syntactic
support for these features.
2.1.2 Approximate inference via sampling. A powerful alternative to exact inference is approximate
inference via sampling. The engine that drives sampling-based inference is the expectation estimator.
The expectation estimator is widely used as a foundation for approximate inference strategies for
probabilistic programs [
9
,
11
,
29
,
33
,
43
,
55
]. We will use it to give an inference algorithm for Cont.
Concretely, suppose we want to use the expectation estimator to approximate the semantics of the
Cont program
Jflip 1/4K
. To do this, we can draw
𝑁=
100 samples from the program: in roughly
1
/
4of these samples, the program will output
true
. This approach is known as direct sampling, and
is one way of utilizing the expectation estimator to design approximate inference algorithms.
Formally, let
Ω
be a sample space,
Pr
a probability density function, and let
𝑋
:
Ω→R
be
a real-valued random variable out of the sample space. Then, the expectation of
𝑋
is dened as
EPr [𝑋]=Pr(𝜔)𝑋(𝜔)𝑑𝜔
. The expectation estimator approximates the expectation of a random
variable 𝑋by drawing 𝑁samples from Pr:
EPr [𝑋] ≈ 1
𝑁
𝑁
𝑥∼Pr
𝑋(𝑥).(2)
There are many more advanced approaches to sampling-based inference beyond direct sampling
such as Hamiltonian Monte-Carlo [
9
,
34
]; at their core, all these approximate inference algorithms
follow the same principle of drawing some number of samples from the program and using that to
estimate the semantics.
When compared with the exact inference strategy described in Section 2.1.1, sampling-based
inference has the key advantage that it only requires the ability to sample from the probabilistic
program: each time a random quantity is introduced, it can be dealt with by eagerly sampling. This
makes sampling an ideal inference algorithm for implementing exible and expressive languages
with many features: unlike Disc, it is straightforward to add interesting features like continuous
random variable and unbounded loops to Cont without wholesale redesigning of its inference
algorithm. This gain in expressivity comes at the cost of precision: unlike Disc,Cont is only able
to provide an approximation to the nal expectation.
2.2 Sound interoperation
Now we move on to our main goal, establishing sound interoperation between the underlying
inference strategies of Disc and Cont by identifying two key invariants that must be maintained
when transporting Disc and Cont values across boundaries: importance weighting and sample
6
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
consistency. At rst, there appears to be a straightforward way of establishing interoperation
between these two languages: when a Cont value
𝑣𝑠
is interpreted in a Disc context, it is lifted in
a Dirac-delta distribution on
𝑣𝑠
. Figure 4a gives an illustration of this scenario: rst, on Line 1 we
sample a value (either
true
or
false
) for
𝑥
. Then, on Line 3 we interpret
𝑥
within an exact context.
Figures 4b and 4c show the two possible liftings: the sampled value is given its straightforward
interpretation in the exact context. When a Disc value
𝑣𝑒
is interpreted in a Cont context, one can
draw a sample
𝑣𝑠
from the exact distribution denoted by
𝑣𝑒
. However, we will show in the next
two sub-sections that a naive approach that fails to preserve key invariants will result in incorrect
inference results, and that one must maintain careful invariants in order to ensure soundness of
inference across boundaries.
2.2.1 Importance weighting. Let us more carefully consider the situation shown in Fig. 4a. First,
we observe that the desired semantics is
J𝐹𝑖𝑔. 4𝑎K(true)=
0
.
25
/
0
.
4
=
0
.
625. Suppose we were to
follow a naive multi-language inference procedure of drawing 100 samples by eagerly evaluating
values for 𝑥. Following Section 2.1.2, approximately 20 of these samples will yield the program in
Fig. 4b and approximately 80 will yield the program in Fig. 4c. Observe that
J𝐹𝑖𝑔. 4𝑏K(true)=
0
.
25
and J𝐹𝑖𝑔. 4𝑐K(true)=1. So our naive estimate J𝐹𝑖𝑔. 4𝑎K(true)would be:
J𝐹𝑖𝑔. 4𝑎K(true)?
≈20
100 J𝐹𝑖𝑔. 4𝑏K(true) + 80
100 J𝐹𝑖𝑔. 4𝑐K(true)=0.85 (3)
Something went wrong – we expected the result of Eq. (3) to be 0.625. This naive sampling
approach signicantly over-estimated
J𝐹𝑖𝑔. 4𝑎K(true)
. The issue is that, in this naive approach, the
observation that occurs on Line 3 (Fig. 4a) is not taken into account when sampling a value for
𝑥
:
samples where
𝑥=true
are under-sampled relative to their true probability, and samples where
𝑥=false are over-sampled.
This example illustrates that a naive approach to interoperation is unsound. To x it, one approach
is to adjust the relative importance of each sample: we will still sample
𝑥=true
roughly 20% of
the time, but we will decrease the overall importance of this sample. The key idea comes from
importance sampling, which is a renement of the expectation estimator given in Eq. (2) but enables
estimating an expectation E𝑝[𝑋]by sampling from a proposal distribution 𝑞:
E𝑝[𝑋]=𝑋(𝑥)𝑝(𝑥)𝑑𝑥 =𝑋(𝑥)𝑝(𝑥)
𝑞(𝑥)𝑞(𝑥)𝑑𝑥 =E𝑞𝑋(𝑥)𝑝(𝑥)
𝑞(𝑥).(4)
The above holds as long as the proposal
𝑞
supports
𝑝
(i.e., satises the property that, for all
𝑥
,
if
𝑝(𝑥)>
0then
𝑞(𝑥)>
0). The ratio
𝑝(𝑥)/𝑞(𝑥)
is called the importance weight of the sample
𝑥
:
intuitively, if
𝑥
is more likely according to the true distribution
𝑝
than the proposal
𝑞
, the importance
ratio will be greater than 1; similarly, if
𝑥
is less likely according to
𝑝
than
𝑞
, its weight will be less
than 1. In this instance, the proposal
𝑞
is semantics of the program with all observe statements
deleted, and 𝑝is J𝐹𝑖𝑔. 4𝑎K.
Unfortunately, already having access to
𝑝
defeats the purpose of approximating. In addition, our
programs 𝑝always incorporate a normalization constant 𝑍, such that
𝑝(𝑥)=ˆ
𝑝(𝑥)/𝑍, (5)
with
ˆ
𝑝
being the unnormalized distribution. Summing the probability of
ˆ
𝑝
for all
𝑥
in the domain
of
𝑝
yields
𝑍=ˆ
𝑝(𝑥)𝑑𝑥
. Computing this normalization constant is expensive, and amounts to
calculating
𝑝
directly. In our setting, calculating this normalization constant is identical to the
denotation of Line 3 in the exact setting. To avoid solving for this in our importance sampler, we
can incorporate Eq. (5) into our expectation Eq. (4) and jointly approximate our query alongside
𝑍
,
7
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
E𝑝[𝑋]=𝑋(𝑥)𝑝(𝑥)𝑑𝑥 =𝑋(𝑥)ˆ
𝑝(𝑥)𝑑𝑥
ˆ
𝑝(𝑥)𝑑𝑥
=𝑋(𝑥)ˆ
𝑝(𝑥)
𝑞(𝑥)𝑞(𝑥)𝑑𝑥
ˆ
𝑝(𝑥)
𝑞(𝑥)𝑞(𝑥)𝑑𝑥
=
E𝑥∼𝑞𝑋(𝑥)ˆ
𝑝(𝑥)
𝑞(𝑥)
E𝑥∼𝑞ˆ
𝑝(𝑥)
𝑞(𝑥).(6)
The above is called a self-normalized importance sampler [
44
]. Here, in the denominator, we
construct the normalizing constant for
𝑞
to be the ratio of the unnormalized
ˆ
𝑝
to
𝑞
: the Line 2
in Fig. 4b when
𝑥=true
and the Line 2 in Fig. 4c when
𝑥=false
. Notice that the probability
of evidence encoded by
observe
statements in Fig. 4b and Fig. 4b are
Jtrue ∨𝑌K(true)=
1and
Jfalse ∨𝑌K(true)=0.25, respectively.
Sampling 100 draws of
𝑥
, again, with 20 samples yielding the program in Fig. 4b and 80 samples
yielding the program in Fig. 4c, Eq. (6) now returns our expected result:
J𝐹𝑖𝑔. 4𝑎K(true) ≈
20
100 1·J𝐹𝑖𝑔. 4𝑏K(true) + 80
100 0.25 ·J𝐹𝑖𝑔. 4𝑐K(true)
20
100 1+80
100 0.25
=
20 ·0.25 +80 ·0.25
20 +20 =0.625
Listing 2. Sample consistency
1let 𝑋be flip 0.5in
2Llet 𝑦be L𝑋M𝑆in
3let 𝑧be L𝑋M𝑆in
4ret 𝑦∧𝑧M𝐸
2.2.2 Sample consistency. Importance weighting is not all that is
necessary to ensure sound interoperability: we must also ensure
that Disc values are safely interpreted with a Cont context. Con-
sider the example in Listing 2. There are two observations to make
about this program. The rst is that we embed a Cont program
into a Disc context; this results in a sampler that evaluates all Cont
fragments while preserving the semantics of all Disc variables in
order to produce a sample. The next thing to notice is that a Disc
program denotes a distribution; in the semantics of Cont, when
we come across a distribution a sample is immediately drawn from it.
Again, we can propose a naive strategy for performing inference on this program: one where we
draw a new sample each time we encounter a distribution. Notice that Line 2 holds a reference 𝑋
to
flip
0
.
5, denoting a Bernoulli distribution. When we evaluate this boundary, with probability 1
/
2
we sample
𝑦=true
; suppose we sample
𝑦=true
. We encounter this reference, again, on Line 3
and suppose we sample
𝑧=false
. Finally, on Line 4, we evaluate the Boolean expression, resulting
in
false
, which is lifted into the Dirac-delta distribution in Disc. Running this program
𝑛
number of
times, we will expect to see the expectation of
𝑦∧𝑧
with
𝑦
and
𝑧
as two independent draws of the
fair Bernoulli distribution. At this point, something strange has occurred: by referencing a single
variable in Disc, we have simulated two independent flips.
Intuitively, the sampled value for
𝑧
must be the same as the sampled value for
𝑦
. Operationally, to
ensure this is the case, any samples drawn across at the
L−M𝑆
boundary additionally constraining
Disc program’s accepting criteria so that all subsequent samples remain consistent.
3MULTIPPL: MULTI-LANGUAGE PROBABILISTIC PROGRAMMING
In this section we present MultiPPL, a multilanguage that supports both exact and sampling-based
inference. Sections 3.1 and 3.2 describe the syntax of MultiPPL programs and MultiPPL’s type
system. We then present two semantic models of MultiPPL. First, Section 3.3 presents a high-level
model
HJ−K
capturing the probability distribution generated by a MultiPPL program; this model
species the intended behavior of our implementation. Second, Section 3.4 presents a low-level
model
LJ−K
capturing our particular inference strategy; taking the intuition we have built up
in Section 2.2 and providing the precise way in which our implementation combines knowledge
compilation with importance sampling. Finally, Section 3.5 connects these two models: we show
8
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
𝐴↭𝜏
unit ↭unit bool ↭bool
𝐴↭𝜏𝐵↭𝜎
𝐴×𝐵↭𝜏×𝜎
Fig. 5. Rules for convertibility between Disc types 𝐴and Cont types 𝜏.
that
LJ−K
soundly renes
HJ−K
, establishing sound inference interoperability between Disc and
Cont with respect to our inference strategy.
3.1 Syntax
The syntax of MultiPPL is given in Fig. 2. MultiPPL is a union of two sublanguages, Disc and Cont,
that support exact and sampling-based inference. To streamline the presentation of the models in
Sections 3.3 and 3.4, each sublanguage is then subdivided into pure and eectful fragments.
The sampling-based sublanguage Cont is a rst-order probabilistic programming language with
Booleans, tuples, and real numbers. In Cont, the pure fragment includes not only the basic opera-
tions on Booleans and pairs, but also arithmetic operations on real numbers. The eectful fragment
additionally includes primitive operations
uniform 𝑒1𝑒2
for generating uniformly-distributed real
numbers in the interval
[𝑒1,𝑒2]
,
poisson 𝑒
for generating Poisson-distributed integers with rate
𝑒
,
and obs-expressions denoting conditioning operators for these distributions.
The exact sublanguage Disc, reminiscent of Dice [
23
], is a discrete rst-order probabilistic
programming language with Booleans and tuples. The pure fragment of Disc includes the basic
operations on Booleans and pairs, while the eectful fragment includes constructs for sequencing
and branching, as well as the primitive operations
flip 𝑒
– for generating Bernoulli-distributed
Booleans with parameter 𝑒of type real, and observe 𝑀– for conditioning on an event 𝑀.
The Disc branching construct
if 𝑒then 𝑀else 𝑁
requires the guard
𝑒
to be a Cont term.
This is not an essential restriction, but rather required for sound inference interoperability with
respect to the specic implementation strategy we have chosen. As sketched in Sections 2.1.1
and 2.1.2, standard sampling-based inference maintains a weight for the current trace, while exact
inference maintains a weight map and an accepting formula. In our implementation, we wanted a
language whose inference algorithm would stay as close to these traditional inference algorithms as
possible while avoiding incorrect weighting schemes. To do this while maintaining safe inference
interoperability, one must have the rather subtle invariant that
if-then-else
expressions in the Disc
sublanguage have then- and else- branches that importance-weight their respective traces by the
same amount. The syntactic restriction on
if 𝑒then 𝑀else 𝑁
is a simple way of ensuring this is
always the case: probabilistic choice is removed, and only one branch need ever be considered. In
our implementation, we also permit
if-then-else
expressions where both branches are boundary-
free Disc programs, as exact inference for such programs can be performed just as in Holtzen et al
.
[23]
, without touching the importance weight. These special cases could be avoided by maintaining
an auxiliary Boolean formula tracking a path condition, which encodes during inference the then-
and else- branches of
if-then-else
expressions taken to reach a given subterm. This would allow
arbitrary
if-then-else
expressions in the Disc sublanguage, at the expense of additional overhead
of maintaining this path condition during inference. In our design of MultiPPL, we have decided
to restrict the syntax of the language rather than impose a performance cost; in practice, this has
been sucient to express all of the examples in Section 4.
9
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
Γ;Δ⊢c𝑀:𝐴
Δ⊢𝑀:𝐴
Γ;Δ⊢cret 𝑀:𝐴
Γ;Δ⊢c𝑀:𝐴Γ;Δ,𝑋:𝐴⊢c𝑁:𝐵
Γ;Δ⊢clet 𝑋be 𝑀in 𝑁:𝐵
Γ⊢𝑒:bool Γ;Δ⊢c𝑀:𝐴Γ;Δ⊢c𝑁:𝐴
Γ;Δ⊢cif 𝑒then 𝑀else 𝑁:𝐴
Γ⊢𝑒:real
Γ;Δ⊢cflip 𝑒:bool
Δ⊢𝑀:bool
Γ;Δ⊢cobserve 𝑀:unit
Γ;Δ⊢c𝑒:𝜏𝐴↭𝜏
Γ;Δ⊢cL𝑒M𝐸:𝐴
Γ;Δ⊢c𝑒:𝜏
Γ⊢𝑒:𝜏
Γ;Δ⊢cret 𝑒:𝜏
Γ;Δ⊢c𝑒1:𝜎Γ,𝑥:𝜎;Δ⊢c𝑒2:𝜏
Γ;Δ⊢clet 𝑥be 𝑒1in 𝑒2:𝜏
Γ⊢𝑒1:bool Γ;Δ⊢c𝑒2:𝜏Γ;Δ⊢c𝑒3:𝜏
Γ;Δ⊢cif 𝑒1then 𝑒2else 𝑒3:𝜏
Γ⊢𝑒:real
Γ;Δ⊢cflip 𝑒:bool
Γ⊢𝑒1:real Γ⊢𝑒2:real
Γ;Δ⊢cuniform 𝑒1𝑒2:real
Γ⊢𝑒:real
Γ;Δ⊢cpoisson 𝑒:real
Γ⊢𝑒𝑜:bool Γ⊢𝑒1:real
Γ;Δ⊢cobs(𝑒𝑜,flip 𝑒1):unit
Γ⊢𝑒𝑜:real Γ⊢𝑒1:real Γ⊢𝑒2:real
Γ;Δ⊢cobs(𝑒𝑜,uniform 𝑒1𝑒2):unit
Γ⊢𝑒𝑜:real Γ⊢𝑒:real
Γ;Δ⊢cobs(𝑒𝑜,poisson 𝑒):unit
Γ;Δ⊢c𝑀:𝐴 𝐴 ↭𝜏
Γ;Δ⊢cL𝑀M𝑆:𝜏
Fig. 6. Typing rules for the eectful fragment of MultiPPL.
3.2 Typing
The syntax of types and typing contexts is given in Fig. 2. Disc types
𝐴
include Booleans and pairs;
Cont types
𝜏
additionally include a type of real numbers. A Disc typing context
Δ
is a mapping
of Disc variables to Disc types, and a Cont typing context
Γ
is a mapping of Cont variables to
Cont types. By convention we will denote Disc syntactic elements with capital letters and Cont
elements with lower-case Greek letters. This section is best read in color, where we use orange
monotype font for Cont terms and purple sans-serif font for Disc terms.
MultiPPL contains two sublanguages that each have a pure and eectful part, so there are
correspondingly four forms of typing judgment. For the pure fragments,
•Δ⊢𝑀:𝐴says the pure Disc term 𝑀has Disc type 𝐴in Disc context Δ.
•Γ⊢𝑒:𝜏says the pure Cont term 𝑒has Cont type 𝜏in Cont context Γ.
These judgments are standard and deferred to the appendix.
The typing judgments for eectful MultiPPL terms are parameterized by a combined context
Γ;Δ, as an eectful term may mention variables from both Disc and Cont via boundaries:
•Γ;Δ⊢c𝑀:𝐴says the eectful Disc term 𝑀has Disc type 𝐴in combined context Γ;Δ.
10
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
JunitK={★}
JboolK={ ⊤,⊥}
J𝐴×𝐵K=J𝐴K×J𝐵K
JunitK={★}
JboolK={⊤,⊥}
JrealK=R
J𝜎×𝜏K=J𝜎K×J𝜏K
JΔK=
𝑋∈dom ΔJΔ(𝑋)K JΓK=
𝑥∈dom ΓJΓ(𝑥)K
Fig. 7. Interpreting types and typing contexts.
JΔ⊢𝑀:𝐴K:JΔK→J𝐴K
J𝑋K(𝛿)=𝛿(𝑋)
JtrueK(𝛿)=⊤
JfalseK(𝛿)=⊥
J𝑀∧𝑁K(𝛿)=⊤,if J𝑀K(𝛿)=J𝑁K(𝛿)=⊤
⊥,otherwise
J¬𝑀K(𝛿)=⊥,if J𝑀K(𝛿)=⊤
⊤,otherwise
J⟨⟩K(𝛿)=★
J⟨𝑀, 𝑁 ⟩K(𝛿)=(J𝑀K(𝛿),J𝑁K(𝛿) )
Jfst 𝑀K(𝛿)=𝜋1(J𝑀K(𝛿) )
Jsnd 𝑀K(𝛿)=𝜋2(J𝑀K(𝛿) )
JΓ⊢𝑒:𝜏K:JΓKmeasurable
−−−−−−−→ J𝜏K
J𝑥K(𝛾)=𝛾(𝑥)
JtrueK(𝛾)=⊤
JfalseK(𝛾)=⊥
J𝑟K(𝛾)=𝑟
J𝑒1+𝑒2K(𝛾)=J𝑒1K(𝛾) + J𝑒2K(𝛾)
J−𝑒K(𝛾)=−J𝑒K(𝛾)
J𝑒1·𝑒2K(𝛾)=J𝑒1K(𝛾) · J𝑒2K(𝛾)
J𝑒1≤𝑒2K(𝛾)=⊤,if J𝑒1K(𝛾) ≤ J𝑒2K(𝛾)
⊥,otherwise
J( ) K(𝛾)=★
J(𝑒1, 𝑒2)K(𝛾)=(J𝑒1K(𝛾),J𝑒2K(𝛾) )
Jfst 𝑒K(𝛾)=𝜋1(J𝑒K(𝛾) )
Jsnd 𝑒K(𝛾)=𝜋2(J𝑒K(𝛾) )
Fig. 8. Interpreting pure terms.
•Γ;Δ⊢c𝑒:𝜏says the eectful Cont term 𝑒has Cont type 𝜏in combined context Γ;Δ.
These judgments are dened in Fig. 6. Note that in the rule for
flip 𝑒
, the parameter
𝑒
can be
an arbitrary pure Cont term; this allows expressing the TwoCoins example from Section 2. In
principle, one could allow arbitrary eectful Cont programs
𝑒
as parameter to
flip
instead of
just pure ones, but we have not found this to be useful in practice. The typing judgments for the
boundaries
L𝑒M𝐸
and
L𝑀M𝑆
allow converting Disc terms of type
𝐴
into Cont terms of type
𝜏
and
vice versa, so long as
𝐴
and
𝜏
are convertible, written
𝐴↭𝜏
. The convertibility relation is dened
in Fig. 5; it simply states that Disc types can be converted into their Cont counterparts in the
expected way, and that the Cont type real has no Disc counterpart.
3.3 High-level semantic model
This section denes a high-level model
HJ−K
of MultiPPL to serve as the denition of sound
inference interoperability for the MultiPPL multilanguage.
Setting aside details of any particular inference strategy, a MultiPPL program
•
;
•⊢c𝑒
:
𝜏
should
produce a conditional probability distribution over values of type
𝜏
. Following standard techniques
for modelling probabilistic programs with conditioning [
55
], we interpret types and typing contexts
as measurable spaces, pure terms as measurable functions, and eectful terms via a suitable monad.
Fig. 7 gives the interpretations of types and typing contexts. Disc types denote nite discrete
measurable spaces and Cont types denote arbitrary measurable spaces. These interpretations are
then lifted to typing contexts in the usual way: a Disc context
Δ
denotes the measurable space
11
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
HJΓ;Δ⊢c𝑀:𝐴K:JΓK×JΔK→DistwJ𝐴K
HJret 𝑀K(𝛾, 𝛿 )=ret(J𝑀K(𝛿) )
HJlet 𝑋be 𝑀in 𝑁K(𝛾, 𝛿 )=𝑥← HJ𝑀K(𝛾, 𝛿 );
HJ𝑁K(𝛾, 𝛿 [𝑋↦→ 𝑥] )
HJL𝑒M𝐸K(𝛾, 𝛿 )=HJ𝑒K(𝛾 , 𝛿 )
Hu
w
v
if 𝑒
then 𝑀
else 𝑁
}
~(𝛾, 𝛿 )=
if J𝑒K(𝛾)
then HJ𝑀K(𝛾, 𝛿 )
else HJ𝑁K(𝛾, 𝛿 )
HJflip 𝑒K(𝛾 , 𝛿 )=ip(J𝑒K(𝛾) )
HJobserve 𝑀K(𝛾, 𝛿 )=score (1J𝑀K(𝛿)=⊤)
HJΓ;Δ⊢c𝑒:𝜏K:JΓK×JΔK→DistwJ𝜏K
HJret 𝑒K(𝛾, 𝛿 )=ret (J𝑒K(𝛾) )
HJlet 𝑥be 𝑒1in 𝑒2K(𝛾, 𝛿 )=𝑥← HJ𝑒1K(𝛾, 𝛿 );
HJ𝑒2K(𝛾[𝑥↦→ 𝑥], 𝛿 )Hu
w
v
if 𝑒1
then 𝑒2
else 𝑒3
}
~(𝛾, 𝛿 )=
if J𝑒1K(𝛾)
then HJ𝑒2K(𝛾, 𝛿 )
else HJ𝑒3K(𝛾, 𝛿 )
HJL𝑀M𝑆K(𝛾, 𝛿 )=HJ𝑀K(𝛾 , 𝛿 )
HJflip 𝑒K(𝛾, 𝛿 )=ip (J𝑒K(𝛾) )
HJuniform 𝑒1𝑒2K(𝛾, 𝛿 )=uniform (J𝑒1K(𝛾),J𝑒2K(𝛾) )
HJpoisson 𝑒K(𝛾, 𝛿 )=poisson (J𝑒K(𝛾) )
HJobs(𝑒𝑜,flip 𝑒1)K(𝛾 , 𝛿 )=score(ip (J𝑒1K(𝛾) ) ( J𝑒𝑜K(𝛾) ) )
HJobs(𝑒𝑜,poisson 𝑒1)K(𝛾 , 𝛿 )=score(poisson (J𝑒1K(𝛾) ) ( J𝑒𝑜K(𝛾) ) )
HJobs(𝑒𝑜,uniform 𝑒1𝑒2)K(𝛾 , 𝛿 )=score(uniform (J𝑒1K(𝛾),J𝑒2K(𝛾) ) ( J𝑒𝑜K(𝛾)) )
Fig. 9. Interpreting eectful terms. We use Haskell-style syntactic sugar for the usual monad operations.
of substitutions
𝛿
such that
𝛿(𝑥) ∈ JΔ(𝑥)K
for all
𝑥∈dom Δ
, and a Cont contexts
Γ
denotes the
measurable space of substitutions 𝛾such that 𝛾(𝑥) ∈ JΓ(𝑥)Kfor all 𝑥∈dom Γ.
Fig. 8 gives the standard interpretations of pure terms [
55
]. Pure Disc terms
Δ⊢𝑀
:
𝐴
denote
functions
J𝑀K
:
JΔK→J𝐴K
, automatically measurable because every Disc type denotes a discrete
measurable space. Pure Cont terms Γ⊢𝑒:𝜏denote measurable functions J𝑒K:JΓK→J𝜏K.
Following Staton et al
. [55]
, to interpret eectful terms we make use of the monad
Distw𝐴=
Dist( [
0
,
1
] × 𝐴)
, obtained by combining the writer monad for the monoid
([
0
,
1
],×,
1
)
of weights
with the probability monad
Dist
[
20
]. Under this interpretation, a MultiPPL program
•
;
•⊢c𝑒
:
𝜏
denotes a distribution over pairs
(𝑤, 𝑣 )
, where
𝑣
is a value of type
𝜏
produced by a particular run of
𝑒and 𝑤is the weight accumulated by both Cont and Disc observe expressions.
Fig. 9 interprets eectful MultiPPL terms using
Distw
. A Disc term
Γ
;
Δ⊢c𝑀
:
𝐴
is interpreted as
a measurable function
HJ𝑀K
:
JΓK×JΔK→DistwJ𝐴K
, and a Cont term
Γ
;
Δ⊢c𝑒
:
𝜏
is interpreted
as a measurable function
HJ𝑒K
:
JΓK×JΔK→DistwJ𝜏K
. To model the basic probabilistic operations,
the interpretation additionally makes use of the following primitives:
• (•): Dist(𝐴) → Distw(𝐴)lifts distributions on 𝐴into Distwby setting 𝑤=1.
•score : [0,1] → Distw{★}sends a weight 𝑤to the Dirac distribution 𝛿(𝑤,★)centered at (𝑤 , ★).
•
For
𝑝∈R
,
ip(𝑝)
is the Bernoulli distribution on
{⊤,⊥}
with parameter
𝑝
if
𝑝∈ [
0
,
1
]
and the
Dirac distribution 𝛿⊥otherwise.
•For 𝑎, 𝑏 ∈R,uniform(𝑎 , 𝑏)is the uniform distribution on [𝑎 , 𝑏]if 𝑎≤𝑏and 𝛿min(𝑎,𝑏 )otherwise.
•For 𝜆∈R,poisson(𝜆)is the Poisson distribution with rate 𝜆if 𝜆>0and 𝛿0otherwise.
12
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
Boundaries have no eect under this interpretation, reecting the idea that changing one’s inference
strategy should not change the inferred distribution; semantic values of Disc type are implicitly
coerced into semantic values of Cont type and vice versa, thanks to the following lemma:
Lemma 3.1 (natural embedding). If 𝐴↭𝜏then J𝐴K=J𝜏K.
Proof. By induction on 𝐴↭𝜏.□
3.4 Low-level model
This section presents a low-level model
LJ−K
of MultiPPL, capturing the particular details of our
inference strategy.
The interpretations of types, typing contexts, and the pure fragment of MultiPPL are identical
to the ones given in Section 3.3. Where
LJ−K
diers from
HJ−K
is in the interpretation of eectful
terms. Key to this interpretation is the construction of a suitable semantic domain for interpreting
eectful terms in a way that faithfully reects the details of our implementation. We construct this
semantic domain by combining models of exact and sampling-based inference.
Our model of sampling-based inference is entirely standard, making use of the monad
Distw
of
Section 3.3. This monad captures the fact that a sampler performs inference by drawing weighted
samples from the distribution dened by a probabilistic program [55].
Our model of exact inference, on the other hand, is novel. As explained in Section 2.1.1 and
documented in full detail in Holtzen et al
. [23]
, exact inference via knowledge compilation performs
inference by maintaining two pieces of state: a weight map
𝑤
associating Boolean literals to
probabilities, and a Boolean formula
𝛼
, called the accepting formula, that encodes the paths through
the program that do not violate observe statements. The nal result of knowledge compilation is
itself a Boolean formula
𝜑
; the posterior distribution can then be calculated by performing weighted
model counting on 𝜑∧𝛼and 𝛼with respect to the weight map 𝑤.
The dening trait of this knowledge compilation strategy is that it maintains an exact representa-
tion of the underlying probability space throughout probabilistic program execution. At any given
moment during knowledge compilation, there is an underlying sample space: the space of models
over the collection of Boolean variables generated so far. The purpose of the weight map
𝑤
is to
represent a distribution over this sample space: the probability of a given model can be computed
by multiplying the weights of all of its literals. Together, the sample space and the weight map form
aprobability space, which is statefully manipulated throughout the knowledge compilation process.
Upon each encounter of a
flip
command, the probability space grows: this is implemented by
generating a fresh Boolean variable to represent the result of the
flip
and extending
𝑤
accordingly.
The purpose of the accepting formula
𝛼
is to represent an event in this probability space: the event
consisting of those models that satisfy
𝛼
. Upon each encounter of an
observe
command, this event
shrinks: this is implemented by conjoining the condition being observed onto
𝛼
. Finally, the purpose
of the output formula
𝜑
is to represent a random variable, which is to say a Boolean-valued function
out of the sample space: the formula
𝜑
represents the random variable that takes values
⊤
for those
models that satisfy 𝜑and ⊥otherwise.
What is essential about this setup is that it maintains a conditional probability space
(Ω, 𝜇, 𝐸 )
,
consisting of a sample space
Ω
(the space of models), a probability measure
𝜇
on it (represented by
the weight map), and an event
𝐸
denoting the result of all
observe
statements so far (represented
by the accepting formula), and that it produces random variables. The fact that these probability
spaces, events, and random variables are represented via weighted Boolean formulas, while crucial
for the eciency of inference, are details of the implementation that are irrelevant to ensuring safe
inference interoperability. Because of this, our low-level semantics abstracts over these represen-
tation concerns, choosing instead to work directly with probability spaces and random variables.
13
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
LJret 𝑀K(Ω) (𝛾, 𝐷 )=ret (Ω,id,J𝑀K◦𝐷)
LJlet 𝑋be 𝑀in 𝑁K(Ω) (𝛾, 𝐷 )=
(Ω1, 𝑓1, 𝑋 )←LJ𝑀K(Ω) (𝛾, 𝐷 );
(Ω2, 𝑓2, 𝑌 )←LJ𝑁K(Ω1) (𝛾, (𝐷◦𝑓1) [ 𝑋↦→ 𝑋]) ;
ret(Ω2, 𝑓1◦𝑓2, 𝑌 )
LJif 𝑒then 𝑀else 𝑁K(Ω) (𝛾, 𝐷 )=
if J𝑒K(𝛾)
then LJ𝑀K(Ω) (𝛾, 𝐷 )
else LJ𝑁K(Ω) (𝛾, 𝐷 )
LJflip 𝑒K(Ω) (𝛾 , 𝐷 )=
𝑝:=if J𝑒K(𝛾) ∈ [0,1]then J𝑒K(𝛾)else 0;
Ωip :=({0,1}, 𝜇, {0,1} ) where 𝜇(1)=𝑝;
Ω′:=Ω⊗Ωip;
𝑋:=𝜔′↦→ if 𝜋2(𝜔′)=1 then ⊤else ⊥;
ret(Ω′, 𝜋 1,𝑋 )
LJobserve 𝑀K(Ω, 𝜇, 𝐸 ) (𝛾, 𝐷 )=
𝐹:=(J𝑀K◦𝐷)−1(⊤);
score(𝜇|𝐸(𝐹) ) ;
ret( ( Ω, 𝜇, 𝐸 ∩𝐹),id,_↦→ ★)
LJL𝑒M𝐸K(Ω) (𝛾, 𝐷 )=(Ω′, 𝑓 , 𝑥 )←LJ𝑒K(Ω) (𝛾, 𝐷 );
ret(Ω′, 𝑓 , _↦→ 𝑥)
LJret 𝑒K(Ω) (𝛾, 𝐷 )=ret(Ω,id,J𝑒K(𝛾) )
LJlet 𝑥be 𝑒1in 𝑒2K(Ω) (𝛾, 𝐷 )=
(Ω1, 𝑓1, 𝑥 ) ← LJ𝑒1K(Ω) (𝛾 , 𝐷 )
(Ω2, 𝑓2, 𝑦)←LJ𝑒2K(Ω1) (𝛾[𝑥↦→ 𝑥], 𝐷 ◦𝑓1)
ret(Ω2, 𝑓1◦𝑓2, 𝑦 )
LJif 𝑒1then 𝑒2else 𝑒3K(Ω) (𝛾, 𝐷 )=
if J𝑒1K(𝛾)
then LJ𝑒2K(Ω) (𝛾, 𝐷 )
else LJ𝑒3K(Ω) (𝛾, 𝐷 )
LJflip 𝑒K(Ω) (𝛾, 𝐷 )=𝑥←ip(J𝑒K(𝛾) )
ret(Ω,id, 𝑥 )
LJuniform 𝑒1𝑒2K(Ω) (𝛾, 𝐷 )=𝑥←uniform(J𝑒1K(𝛾),J𝑒2K(𝛾) )
ret(Ω,id, 𝑥 )
LJpoisson 𝑒K(Ω) (𝛾, 𝐷 )=𝑥←poisson(J𝑒K(𝛾) )
ret(Ω,id, 𝑥 )
LJobs(𝑒𝑜,flip 𝑒1)K(Ω) (𝛾 , 𝐷 )=score(ip(J𝑒1K(𝛾) ) (J𝑒𝑜K(𝛾)));
ret(Ω,id,★)
LJobs(𝑒𝑜,uniform 𝑒1𝑒2)K(Ω) (𝛾 , 𝐷 )=score(uniform(J𝑒1K(𝛾),J𝑒2K(𝛾) ) (J𝑒𝑜K(𝛾)));
ret(Ω,id,★)
LJobs(𝑒𝑜,poisson 𝑒1)K(Ω) (𝛾 , 𝐷 )=score(poisson(J𝑒1K(𝛾) ) (J𝑒𝑜K(𝛾)));
ret(Ω,id,★)
LJL𝑀M𝑆K(Ω) (𝛾, 𝐷 )=
( (Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑋 )←LJ𝑀K(Ω) (𝛾, 𝐷 )
𝑥←𝜔′←𝜇′|𝐸′; ret(𝑋(𝜔′) )
ret( ( Ω′, 𝜇′, 𝐸 ′∩𝑋−1(𝑥) ), 𝑓 , 𝑥 )
Fig. 10. Low-level interpretation of eectful MultiPPL terms. Parts crucial for sound inference interoperability
are highlighted, appearing in the denotation of observe 𝑀and L𝑀M𝑆. Best read in color.
14
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
Following Li et al
. [27]
, we model Disc programs as statefully manipulating tuples
(Ω, 𝜇, 𝐸 )
and
producing random variables 𝑋. For example, we model running the eectful Disc program
𝑋:bool ⊢𝑐
let 𝑌be flip 1/2in
observe (𝑋∧𝑌)in
ret 𝑋
:bool
given input probability space (Ω, 𝜇, 𝐸)and random variable 𝑋:Ω→JboolKas follows:
•flip
1
/
2expands the probability space from
(Ω, 𝜇, 𝐸 )
to
(Ω×JboolK, 𝜇 ⊗Ber
1
/
2
, 𝐸 ×JboolK)
, and
produces the Boolean random variable
𝑌
:
Ω×JboolK→JboolK
dened by
𝑌(𝜔, 𝑏 )=𝑏
. This is
implemented by generating a new Boolean variable representing
𝑌
. Note that
𝑌
is dened in
terms of the new sample space
Ω×JboolK
. The function
𝜋1
:
Ω×JboolK→Ω
says how to convert
between the old sample space
Ω
and the new sample space
Ω×JboolK
: the random variable
𝑋
,
dened in terms of the old space
Ω
, can be converted into a random variable
𝑋◦𝜋1
dened in
terms of the new space
Ω×JboolK
by precomposition with
𝜋1
. Similarly, the conditioning set
𝐸
, a subset of the old space
Ω
, can be converted into a conditioning set
𝜋−1
1(𝐸)=𝐸×JboolK
on
the new sample space
Ω×JboolK
. In the implementation, these conversions are no-ops: they
amount to the fact that a Boolean formula over Boolean variables
Γ
can be weakened to a Boolean
formula over variables Γ, 𝑥 .
•observe (𝑋∧𝑌)
shrinks the new conditioning set
𝐸×JboolK
by intersecting it with the subset
𝐺
:
={(𝜔, 𝑏 ) | 𝑋(𝜔)=𝑌(𝑏)=⊤}
of
Ω×JboolK
on which
𝑋
and
𝑌
are both
⊤
; this produces a
new conditioning set
(𝐸×JboolK) ∩ 𝐺
. This is implemented by conjoining the Boolean formula
representing 𝑋∧𝑌onto the accepting formula.
In general, we will interpret MultiPPL programs in a semantic domain that combines this stateful
approach to modelling exact inference with the standard
Distw
-based approach to modelling
sampling-based inference: a MultiPPL program Γ;Δ⊢c𝑀:𝐴denotes a function that receives:
(1) a concrete instantiation 𝛾∈JΓKfor free Cont variables
(2) a probability space Ωand a random variable 𝐷∈Ω→JΔKfor free Disc variables,
and uses the monad
Distw
to produce a weighted sample consisting of a new probability space
Ω′
and a random variable
𝑋∈Ω′→J𝐴K
of outputs. The old and new probability spaces are connected
by a function
𝑓
:
Ω′→Ω
, which says how to convert random variables and events dened in
terms of the old space into random variables and events dened on the new one. The following
denitions make this idea precise.
Denition 3.2. Anite conditional probability space is a triple
(Ω, 𝜇, 𝐸 )
where (1)
Ω
is a nite
set; (2)
𝜇
:
Ω→ [
0
,
1
]
is a discrete probability distribution, and (3)
𝐸
is a subset of
Ω
called the
conditioning set. Let FCPS be the collection of nite conditional probability spaces.
Denition 3.3. Amap of nite conditional probability spaces
𝑓
:
(Ω, 𝜇, 𝐸 ) → (Ω′, 𝜇′, 𝐸 ′)
is a
measure-preserving map
𝑓
:
(Ω, 𝜇)→(Ω′, 𝜇′)
such that
𝐸⊆𝑓−1(𝐸′)
. For two nite conditional
probability spaces
(Ω, 𝜇, 𝐸 )
and
(Ω′, 𝜇′, 𝐸 ′)
, let
(Ω, 𝜇, 𝐸 )FCPS
−−−−→ (Ω′, 𝜇′, 𝐸 ′)
be the set of maps from
(Ω, 𝜇, 𝐸 )to (Ω′, 𝜇′, 𝐸′).
Note: For readability, nite conditional probability spaces
(Ω, 𝜇, 𝐸 )
will be written
Ω
unless
disambiguation is needed.
With these two denitions in hand, we can give a precise description to the semantic domains
used to construct our low-level model of eectful MultiPPL terms. Given a nite conditional
probability space
Ω
as input, an eectful Disc term
Γ
;
Δ⊢c𝑀
:
𝐴
sends a pair of substitutions for
15
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
free Disc and Cont variables to a distribution over weighted samples consisting of a new nite
conditional probability space Ω′and a random variable Ω′→J𝐴Kof outputs:
LJΓ;Δ⊢c𝑀:𝐴K(Ω):JΓK× (Ω→JΔK) → Distw
Ω′∈FCPS
(Ω′FCPS
−−−−→ Ω)×(Ω′→J𝐴K)
The notation
Ω′∈FCPS (Ω′FCPS
−−−−→ Ω) × (Ω′→J𝐴K)
denotes an indexed coproduct: an element of
this set is a tuple
(Ω′, 𝑓 , 𝑋 )
consisting of a new nite conditional probability space
Ω′
, a map of
nite conditional probability spaces
𝑓
:
Ω′→Ω
connecting the old and new sample spaces, and a
random variable 𝑋dened on the new sample space.
Analogously, an eectful Cont term
Γ
;
Δ⊢c𝑒
:
𝜏
sends a pair of substitutions to a distribution
over weighted samples consisting of a new nite conditional probability space
Ω′
and a value
𝑣∈J𝜏K:
LJΓ;Δ⊢c𝑒:𝜏K(Ω):JΓK× (Ω→JΔK) → Distw
Ω′∈FCPS
(Ω′FCPS
−−−−→ Ω) × J𝜏K
The semantic equations dening
LJΓ;Δ⊢c𝑀:𝐴K(Ω)
and
LJΓ;Δ⊢c𝑒:𝜏K(Ω)
are given in Fig. 10.
As in Fig. 9, we continue to use Haskell-style syntactic sugar for the
Distw
monad operations. The
interpretation of eectful Cont programs is largely similar to the one given by
HJ−K
; the primary
dierence is the plumbing of probability spaces
Ω
and maps
𝑓
throughout. The interpretation
of eectful Disc programs statefully manipulates the probability space as sketched earlier:
flip 𝑒
expands the probability space from
Ω
to
Ω⊗Ωip
, where
Ωip
is a freshly-generated probability
space supporting a Bernoulli-distributed random variable with parameter
𝑒
, and
observe 𝑀
shrinks
the conditioning set from
𝐸
to
𝐸∩𝐹
, where
𝐹
is the subset of the sample space on which
𝑀
is
⊤
.
Maps of conditional probability spaces
𝑓
are used to convert random variables from old to new
sample spaces throughout.
The interpretation of the Cont-to-Disc boundary
L𝑒M𝐸
is to draw a weighted sample
𝑥
from
𝑒
and return the constant random variable at
𝑥
. Conversely, the interpretation of the Disc-to-Cont
boundary
L𝑀M𝑆
is to compute the random variable
𝑋
denoted by
𝑀
and then return a sample
𝑥
drawn from the distribution of
𝑋
. The parts of Fig. 10 shown in bold ensure sound inference
interoperability: in the interpretation of
L𝑀M𝑆
, the event
𝑋−1(𝑥)
is added to the conditioning set to
ensure sample consistency; in the interpretation of
observe 𝑀
, the statement
score(𝜇|𝐸(𝐹))
performs
importance weighting, to ensure the weight of the current execution remains correct relative to
other possible executions.1
3.5 Soundness
This section presents our main theoretical result: the low-level model
LJ−K
capturing our inference
strategy soundly renes the high-level model
HJ−K
; that is, given a complete MultiPPL program
𝑒
, weighted samples drawn from
𝑒
according to our knowledge-compilation- and importance-
sampling-based inference strategy follow the same distribution as samples drawn according to
HJ−K
. To make this precise, we rst dene what it means to run a complete MultiPPL program,
and what it means for two distributions over weighted samples to be equivalent.
Denition 3.4. For a closed program •;•⊢c𝑒:𝜏, let evalL(𝑒)be the computation
(_,_, 𝑥) ← LJ𝑒K(emp) (∅,∅);
ret 𝑥: DistwJ𝜏K
1Here, 𝜇|𝐸is the distribution 𝜇conditioned on the event E.
16
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
where
∅
is the empty substitution,
emp
is the unique 1-point probability space. Let
evalH(𝑒)
be the
computation HJ𝑒K(∅,∅) : DistwJ𝜏K.
Denition 3.5. Two computations
𝜇, 𝜈
:
Distw𝐴
are equal as importance samplers, written
𝜇≃𝜈
, if
for all bounded integrable 𝑘:𝐴→Rit holds that E(𝑎,𝑥) ∼𝜇[𝑎·𝑘(𝑥)] =E(𝑏,𝑦) ∼𝜈[𝑏·𝑘(𝑦)].
With these denitions in hand, our soundness theorem states that our inference strategy agrees
with the high-level model up to equality of importance samplers.
Theorem 3.6 (soundness). If •;•⊢c𝑒:𝜏then evalL(𝑒) ≃ evalH(𝑒).
Theorem 3.6 is proved by induction on typing, after suitable strengthening of the theorem
statement from closed to open terms. The essence of the proof boils down to two key lemmas. The
rst lemma allows swapping the order of sampling and scoring, and is crucial to the correctness of
our importance reweighting scheme in interpreting observe:
Lemma 3.7. If (Ω, 𝜇 , 𝐸) ∈ FCPS then
𝜔←𝜇;
score(1𝜔∈𝐸);
ret 𝜔
≃
score(𝜇(𝐸));
𝜔←𝜇|𝐸;
ret 𝜔
.
The second lemma says that sampling twice — from a marginal on
𝑋
to get a sample
𝑥
, then from
the conditional distribution given
𝑋=𝑥
— is the same as sampling once from the joint distribution,
and is crucial to ensuring sample consistency in our implementation of the boundary L𝑀M𝑆.
Lemma 3.8. If (Ω, 𝜇 , 𝐸) ∈ FCPS and 𝑋:Ω→𝐴with 𝐴nite, then
𝑥←𝜔←𝜇; ret(𝑋 𝜔 );
𝜔′←𝜇|𝑋−1(𝑥);
ret(𝑥 , 𝜔)
=𝜔′←𝜇;
ret(𝑋 𝜔 ′, 𝜔 ′).
The full details can be found in Appendix A.7.
4 EVALUATION
In Section 3 we described the theoretical underpinnings of MultiPPL and proved it sound. In this
section we provide implementation details and empirical evidence for the utility of MultiPPL by
measuring its scalability on well-known inference tasks and comparing its performance against
existing probabilistic programming systems. We conclude with a discussion of our evaluation and
how these programs relate to the design space of MultiPPL programs.
4.1 Lightweight Extensions to MultiPPL
The semantics described in Section 3 provide a minimal model of multi-language interoperation
that is simple and correct. In our implementation we extend the semantics of Disc and Cont to
support more features, resulting in a practical and exible language.
4.1.1 Extensions to Cont.Importance-sampling languages often include more features than those
described in Cont. The grammar for Cont, shown in Fig. 2, supports three base distributions:
Bernoulli, Uniform, and Poisson distributions. In our implementation we include many more
distributions including Normal, Beta, and Dirichlet distributions, as well as their corresponding
observation expressions. We also extend Cont with unbounded loops and list data structures.
17
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
4.1.2 Extensions to Disc.Our implementation of Disc directly leverages the BDD library of
Dice [
23
] and includes support for integers as described in Holtzen et al
. [23]
. Integers can be
introduced into a Disc program either by embedding a Cont integer or through new syntax in
Disc representing a discrete distribution. Both terms are translated into one-hot encoded tuples of
Boolean variables: Cont integers are translated dynamically, while discrete categorical distributions
are translated by the compiler statically into the Disc grammar shown in Fig. 2.
4.2 Empirical evaluation
MultiPPL programs encompass a vast design space, including both Cont and Disc programs as
well any interleaving of these two languages. To investigate the ecacy of our implementation
and characterize this landscape, we ask the following questions:
(1)
Does MultiPPL capture enough expressive power to represent interesting and practical probabilistic
structure while maintaining competitive performance? We consider four benchmarks with com-
plex conditional independence structures to illustrate the design space of MultiPPL programs.
We draw on models in the domains of network analysis [18, 25] and Bayesian networks [5, 7].
(2)
How does MultiPPL compare with contemporary PPLs in using exact and approximate inference
with respect to wall-clock time and distance from the exact distribution? To answer this question,
we benchmark against state-of-the-art PPLs which handle both discrete and continuous variables:
PSI [
19
], performing exact inference by compilation, and Pyro [
8
], using its importance sampling
infrastructure for approximate inference.
4.2.1 Experimental Setup. For exact inference, PSI is a best-in-class language that encodes both
discrete and continuous variables using its compiled symbolic approach. For approximate inference
we leverage Pyro’s importance sampling infrastructure. MultiPPL is written in Rust and performs
both knowledge compilation and sampling during runtime evaluation when it encounters a Disc
or Cont program, respectively. To unify the comparison between these disparate settings, we
delineate our evaluation criteria along two metrics of sample eciency and sample quality.
The sample eciency of each inference strategy is dened as the wall-clock time to draw 1000
samples; measured in seconds and recorded in “Time(s)” column of the following gures. Comparing
the performance of inference algorithms implemented in dierent languages is a general challenge.
To account for the dierence in overhead, we treat Cont as our baseline in the approximate setting.
Sample quality is also important and we computed the L1-distance (i.e., the dierence of absolute
values) between a ground-truth answer, derived for each task, and the estimated quantity from
sampling. Tasks that only evaluate exact inference always yield an L1-distance of 0: for these tasks
we only report wall-clock time, and we only draw one sample from the MultiPPL program.
Heuristically, our aim in writing MultiPPL programs is to achieve high quality samples using
Disc while maintaining reasonable wall-clock eciency with Cont. While this guides the design
of our evaluation, users must decide how this trade-o eects their models on a case-by-case basis.
All benchmarks involving approximate inference are performed using a xed budget of 1000
samples and all statistics collected are averaged over 100 independent trials.2
4.2.2 Estimating packet arrival. Our rst evaluation comes from the motivating example of Fig. 1.
For this arrival task we are interested in modeling packets traversing a router network and observe
the presence, or absence, of packets at their destination. Our main interest is in some unobservable
router that lives along the traversal path, and we query the expected number of packets which pass
through this node. The router network in our evaluation has a tree-based topology that uses an
2
All evaluations are run on a single thread of an AMD EPYC 7543 Processor with 2.8GHz and 500 GiB of RAM. A software
artifact is available on Zenodo[56] and GitHub (https://github.com/stites/multippl)
18
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
Model PSI Pyro MultiPPL (Cont)MultiPPL
L1 Time(s) L1 Time(s) L1 Time(s) L1 Time(s)
arrival/tree-15 — — 0.365 12.713 0.355 0.247 0.337 0.349
arrival/tree-31 — — 0.216 26.366 0.218 0.561 0.179 0.754
arrival/tree-63 — — 0.118 53.946 0.120 1.469 0.093 1.912
alarm t/o t/o 1.290 16.851 1.173 0.433 0.364 14.444
insurance t/o t/o 0.149 13.724 0.144 1.104 0.099 11.406
gossip/4 — — 0.119 6.734 0.119 0.720 0.118 0.812
gossip/10 — — 0.533 6.786 0.531 1.561 0.524 1.373
gossip/20 — — 0.747 7.064 0.745 3.565 0.750 2.888
Fig. 11. Empirical results of our benchmarks of the arrival, discrete Bayesian network, and gossip tasks.
“MultiPPL (Cont)” shows the evaluation of a baseline Cont program with no boundary crossings into Disc,
evaluations under the “MultiPPL” column performs interoperation. “t/o” indicates a timeout beyond 30
minutes, and “—” indicates that the problem is not expressible in PSI because of an unbounded loop.
start
(a) The arrival network topology.
𝑛∼Poisson(𝜆=3)
𝑞←0
while 𝑛>0do
𝑞←𝑞+network()
𝑛←𝑛−1
end while
return 𝑞
(b) Pseudocode describing arrival task.
Fig. 12. Implementation-generic details for the packet-arrival task. Shown in 12a, a packet traverses the
network by entering the boom-le most node, annotated by the start arrow. We observe a successful
traversal to the gray-filled node, and we query the double-circle node for its posterior distribution. The PSI,
Pyro, and MultiPPL programs all follow pseudocode shown in 12b, where network models the topology.
equal-cost multipath (ECMP) protocol [
24
], as shown in Fig. 12a. The ECMP protocol dictates that
a packet is forwarded with uniform probability to all neighboring routers with equal distance to
the goal, as shown in Fig. 12a. In this scenario,
𝑛
packets traverse the network where
𝑛
is drawn
from a Poisson distribution with a rate of 3, as described in Fig. 12b. The presence of this Poisson
random variable makes this example quite challenging for many existing PPL inference strategies
because the resulting loop has a statically unbounded number of iterations. We made the following
additional design decisions in making this task:
(1) Evidence: We observe the gray node of the network topology depicted in Fig. 12a.
(2)
Query: We query the expected probability that the packet traverses through a central node of
the tree-topology, depicted by the twice-circled node of Fig. 12a.
(3)
Boundary decisions: Cont models the Poisson distribution and outer loop. One boundary
call is made to the network, dened in Disc.
(4) Scaling: We scale this model to topologies of 15, 31, and 63 nodes.
(5)
Ground truth: Our ground truth is dened by writing a Dice program for the network, and
analytically solving for the expected number of packets.
The rows labeled by “arrival” in Fig. 11 summarize the evaluation for this section. This table shows
that MultiPPL’s samples are signicantly higher quality than Pyro’s and the Cont program in this
19
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
experiment. As the topology increases in size, we see that the MultiPPL program is able to produce
increasingly higher quality samples with respect to L1 distance. This is because MultiPPL is able
to exactly compute packet reachability of a single traversal in increasingly larger networks using
Disc, while still able to express sampling from the Poisson distribution in Cont. The MultiPPL
program performing interoperation is an order of magnitude more ecient than Pyro, however
the Cont alternative is still most ecient with regard to wall-clock time and yields similar quality
samples as Pyro. PSI, using its symbolic inference procedure, fails to model the unbounded loop.
4.2.3 erying discrete Bayesian networks. Bayesian networks [
41
] provide a challenging and
practical source of programs with widespread adoption across numerous domains, including
medicine [
1
,
38
], healthcare [
5
], and actuarial sciences [
7
]. Even in the purely-discrete setting,
Bayesian networks remain a practical challenge when evaluating exact inference strategies due to
the complex independence structures intrinsic to this domain.
In this task we study interoperation of our language by modeling two discrete Bayesian networks:
ALARM [
5
] and Insurance [
7
]. These networks pose a scaling challenge for exact inference, and
form the largest models described in our evaluation: ALARM contains 509 ip primitives and the
Insurance network contains 1008 ip primitives. Modeling the entirety of the network in Disc and
sampling 1000 times will result in a time-out for our evaluation, and we must use interoperation to
increase sample quality while keeping sample eciency competitive in our benchmark.
The ALARM network models a system for patient monitoring, while the Insurance network
estimates the expected costs for a car insurance policyholder. We summarize these tasks as follows:
(1) Evidence: In both models, we observed one or more leaf nodes.
(2) Query: We query all root nodes for both of the Bayesian networks.
(3)
Boundary decisions: Variables are dened in Disc or Cont to heuristically maximize the
degree of exact inference permitted while keeping the wall-clock time within 60 seconds.
(4) Scaling: ALARM contains 509 ip primitives and Insurance contains 1008 ip primitives.
(5) Ground truth: The ground truth is dened by an equivalent Dice program.
The Cont model, with similar sample quality to Pyro, is more ecient than its MultiPPL
counterpart in this evaluation. It is also signicantly more ecient than Pyro and PSI (which timed
out on this benchmark). As an importance sampler, Cont and Pyro simply sample each distribution
directly, and we see the Python overhead slowing down the Pyro model.
Our MultiPPL programs demonstrate superior sample quality to the Pyro and Cont models.
We achieve this by declaring boundaries that split the ALARM and Insurance networks into sub-
networks that are modeled exactly with Disc and keep compiled BDD sizes small. However it
should be noted that placement of boundaries tips the scales in a tradeo between quality and
eciency. Optimal interleaving between Cont and Disc is task-sensitive, and the MultiPPL
programs evaluated only demonstrate a best-eort approach to modeling.
4.2.4 Network takeover with a gossip protocol. The gossip protocol is a common peer-to-peer
communication method for distributed systems. In our setting each packet traverses an undirected,
fully-connected network using a FIFO scheduler for transport. At each time step, indicated by a
tick in the scheduler, a server will schedule two additional packets to all of its neighbors with each
destination drawn i.i.d. from a uniform distribution. This task initializes with a compromised node
which sends two infected packets to its neighbors. When a server receives an infected packet, it
becomes compromised and can only propagate infected packets for the remainder of the evaluation.
Taken from Gehr et al
. [18]
, we sample
𝑛
time steps from a uniform distribution, step
𝑛
times, then
query the model for the expected number of compromised servers.
20
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
(a) Topology of the gossip network.
init ←0
steps ∼Uniform(4, 8)
state ←
(true, false, false, false)
deque ← [ ]
for 𝑛=1,2do
s∼Discrete( 1
3,1
3,1
3);
deque ←s+1 :: deque
end for
while steps >0do
cur ←head(deque);
deque ←tail(deque);
state[cur] ←state[cur] ∨true
for 𝑛=1,2do
s∼Discrete( 1
3,1
3,1
3);
ix
←
if (s < cur) { s } else { s + 1 };
deque ←deque ++ [ix]
end for
steps ←steps - 1
end while
return state
(b) Pseudocode for a gossip network task.
Fig. 13. Implementation-generic details of the gossip network task. The 4-node topology of the undirected
network is shown in 13a. Pseudocode to iterate over each time step is provided in 13b.
This evaluation poses an expressivity challenge to the Disc sub-language, which cannot dene
the dynamic-length FIFO queue without interoperation with Cont. To handle this requirement, we
extend Cont to support lists and dene all discrete random variables in Disc. At each end of the
loop we update our queue in Cont, collapsing any compiled BDDs.
(1) Evidence: This task denes a direct sampler and no evidence is given.
(2) Query: The model queries for the expected number of compromised servers after 𝑛steps.
(3) Boundary decisions: Discrete variables are in Disc, the loop and FIFO queue live in Cont.
(4) Scaling: This network scales from 4- to 10- and 20- nodes.
(5)
Ground truth: The PSI model from Gehr et al
. [18]
was used to generate the ground truth
for a statically dened set of time steps. An enumerative model was also dened to count the
number of states. The expectation of these models over the loop is derived analytically.
In Fig. 11, we see that all terminating evaluations have similar L1-distances, with Pyro and Cont
programs producing slightly better quality samples. The MultiPPL model produces more ecient
samples, on average, which speaks to the minimal overhead of interoperation when knowledge
compilation plays a small role in inference. There is also the possibility that BDDs are cached and
reused, resulting in a small speedup for some intermediate samples drawn from Disc.
As this benchmark comes with a PSI implementation from Gehr et al
. [18]
, we provided a best-
eort attempt at getting this to run including limiting the number of time steps to make the task
more tractable, but we were unable reproduce their results within our 30m evaluation window.
4.2.5 Estimating network reliability. The network reliability task is interested in a single packet’s
traversal through a network using a probabilistic routing protocol that is embedded in a larger
network. As a model only involving discrete random variables, we can observe how interoperation
eects sample quality and eciency by looking at programs dened in Cont,Disc, and in an
optimal interoperation conguration. Consider, again the ECMP protocol from Section 4.2.2. In
this task we modify each router with non-uniform probabilities, as a packet can traverse out of the
sub-network. The sub-network itself is a directed grid, shown in Fig. 14a, with the probability of
traversal being dependent on the packet’s origin. Pseudocode for the model is presented in Fig. 14b.
This benchmark observes a packet arriving at the nal node in the sub-network, and queries the
probability that this packet passes through each router in the model. As there are no continuous
random variables involved, we can model this task using either exact or approximate inference.
(1) Evidence: The nal node in the network topology observes a successful packet traversal.
21
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
(a) Topology of a 9-node network
𝑛00∼Bern 1
3
𝑛01∼if 𝑛00 then Bern 1
4else Bern 1
5
𝑛10∼if ¬𝑛00 then Bern 1
4else Bern 1
5
𝑝←if 𝑛10 ∨𝑛01 then 1
6else
𝑛10 ∨ ¬𝑛01 then 1
7else
¬𝑛10 ∨𝑛01 then 1
8else 1
9
𝑛11∼Bern 𝑝
observe 𝑛11=⊤
return (𝑛00,𝑛01,𝑛10 ,𝑛11)
(b) Pseudocode of a 4-node model
Fig. 14. An overview of the reliability task, with the topology of the 9-node network in 14a: a packet is
observed in the node shaded gray and all nodes are queried for their posterior distribution. In 14b we show
the pseudocode for a 4-node reliability task, similar structure is used for networks with 9-, 36-, and 81-nodes.
# Nodes PSI MultiPPL (Disc) Pyro MultiPPL (Cont)MultiPPL
Time(s) Time(s) L1 Time(s) L1 Time(s) L1 Time(s)
9 546.748 0.001 0.080 3.827 0.079 0.067 0.033 0.098
36 t/o 0.089 1.812 14.952 0.309 0.277 0.055 1.169
81 t/o 40.728 7.814 33.199 0.680 0.887 0.079 81.300
Fig. 15. Exact and approximate results for models performing approximate inference
(2) Query: The model queries for the marginal probability on all nodes in the network.
(3)
Boundary decisions: MultiPPL programs (in column “MultiPPL” of table Fig. 15), model the
minor upper- and lower- triangles of the network topology in Disc and perform interoperation
along the minor diagonal to break the exact inference task into two parts. This maximizes the
size of compiled BDDs while providing orders of magnitude improvement in sample eciency.
(4) Scaling: This network scales in the size of the grid, scaling from 9- to 36- to 81- nodes.
(5) Ground truth: An equivalent Dice model was used as the ground truth for this model.
The rst two columns of Fig. 15 show the results of exact compilation; comparing PSI to Disc
programs (column “MultiPPL (Disc)”). Because of the nature of this evaluation, Disc can represent
the exact posterior of the model and produce perfect samples with competitive eciency for small
programs. As the program grows in size, producing samples take considerably longer: scaling with
the size of the underlying logical formula.
The partially-collapsed and fully-sampled MultiPPL programs are compared to Pyro in the
remaining columns of Fig. 15. MultiPPL programs (column “MultiPPL”) are dened in Disc and
model the minor diagonal of the network’s grid in Cont. Programs fully dened in Cont (column
“MultiPPL (Cont)”) sample each node individually in the same manner as Pyro.
In this evaluation Cont is more ecient and MultiPPL programs eectively leverage Disc’s
knowledge compilation to produce higher-quality samples. For smaller models, the dened Mul-
tiPPL programs have eciency competitive to Cont. As the model scales, the overhead of knowl-
edge compilation increases. This can be seen by noting the single-sample eciency of Disc programs
from our exact evaluation. As the MultiPPL program scales to 81 nodes the sample eciency
decreases, suggesting an alternative collapsing scheme may be preferable for larger programs.
This network reliability evaluation, alongside prior evaluations, demonstrates that MultiPPL
consistently produces higher quality samples compared to alternatives in the approximate setting.
Through these evaluations, we nd that MultiPPL does capture enough expressive power to
22
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
represent interesting and practical probabilistic structure while remaining competitive with other
languages. That said, the performance of MultiPPL’s inference poses a nuanced landscape and we
leave a full characterization of this design space to future work.
5 RELATED WORK
Multi-language interoperation between probabilistic programming languages builds on a wide body
of work spanning the programming language and the machine learning communities. We situate
our research in four categories: heterogeneous inference, programmable inference, multi-language
semantics, and the monadic semantics of probabilistic programming languages.
5.1 Heterogeneous inference in probabilistic programming languages. There are existing probabilis-
tic programming languages and systems that enable users to blend dierent kinds of inference
algorithms when performing inference on a single probabilistic program. Particularly relevant
are approaches that leverage Rao-Blackwellization in order to combine exact and approximate
inference strategies into a single system. Within this vein, Atkinson et al
. [2]
introduced semi-
symbolic inference, where the idea is to perform exact marginalization over distributions whose
posteriors can be determined to have some closed-form solution. Other works that use variations of
Rao-Blackwellization [
21
,
33
,
37
] all seek to explicitly marginalize out portions of the distribution
by using closed-form exact posteriors when feasible. The main dierence between these approaches
to Rao-Blackwellization and our proposed approach is that these systems do not expose separate
languages that correspond to dierent phases of the inference algorithm: they provide a single
unied syntax in which the user programs. As a consequence, they all rely on (semi-)automated
means of automatically discovering which portions can be feasibly Rao-Blackwellized; this process
can be dicult to control and lead to unpredictable performance. Our multi-language approach has
the following benets: (1) predictable and interpretable performance due to the explicit choice of in-
ference algorithm that is exposed to the user; and (2) amenability to modular formalization, since we
can verify the correctness of each inference strategy and verify the correctness of their composition
on the boundary. We hope to incorporate the interesting ideas of these related works into MultiPPL,
in particular closed-form approaches to exact marginalization of continuous distributions.
There is a broad literature on heterogeneous inference that we hope to eventually draw on to
build a richer vocabulary of sub-languages to add to MultiPPL. Friedman and Van den Broeck
[17]
described an approach to collapsed approximate inference that dynamically blends exact inference
via knowledge compilation and approximate inference via sampling; we are curious if this can be
integrated with our system. We also look towards incorporating more stateful inference algorithms
such as Markov-Chain Monte Carlo into MultiPPL, and aim to investigate this in future work.
5.2 Programmable inference. Programmable inference (or inference (meta-)programming) provide
probabilistic programmers with a meta-language for dening new inference algorithms within a
single PPL by oering language primitives that give direct access to the inference runtime [
29
].
Cusumano-Towner et al
. [12]
provides a black-box interface to underlying inference algorithms
alongside combinators to operate on these interfaces, Stites et al
. [57]
designs a domain specic
language (DSL) for inference which produces correct-by-construction importance weights.
We see programmable inference as a viable means of designing new inference algorithms which
we can incorporate into a multi-language. Furthermore, a multi-language setting can oer inference
programmers the ability to abstract away the nuances of the inference process, lowering the barrier
to entry for this type of development. One common thread through much of the work on inference
programming is core primitives which encapsulate the building blocks for inference algorithms
including resample-move sequential Monte Carlo, variational inference, many other Markov chain
23
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
Monte Carlo methods. These primitives could be designed formally as DSLs, which would be a
great addition to a multi-language and something we look forward to developing in future work.
5.3 Nested inference. Nested inference enriches a probabilistic programming language with a
rst-class
infer
or
normalize
construct that enables the programmer to query for the probability
of an event inside their probabilistic programs [
3
,
42
,
54
,
58
,
63
]. Nested inference is a useful
programming construct that enables a variety of new applications, such as in cognitive science
where one agent may wish to reason about the intent of another [
58
]. Nested inference is similar in
spirit to our multi-language approach in that it gives the programmer control over when inference is
performed on their program and what inference algorithm is used. A key dierence between nested
inference and our multi-language approach is that the former provides access to the inference result
whereas MultiPPL’s boundary forms do not. This dierence is essential. In our view, there is the
following analogy to non-probabilistic programming: performing nested inference is like invoking
a compiler and inspecting the resulting binary, whereas performing multi-language inference is
like interoperating through an FFI. In the non-probabilistic setting, these two situations require
distinct semantic models — compare, for example, formal models of introspection and dynamic code
generation [
6
,
15
,
28
,
30
,
52
] with formal models of FFI-based interoperability [
22
,
26
,
31
,
39
,
45
] —
and we believe the same is likely true of our probabilistic setting.
In the future, it would be interesting to consider integrating nested inference within a multi-
language setting and exploring the consequences of this new feature on language interoperation. It
would also be quite interesting to investigate whether our multi-language inference strategy could
be compiled to, or expressed in terms of, rich nested inference constructs. A preliminary analysis
reveals a number of basic dierences between MultiPPL’s inference strategy and standard models
of nested inference, so such a compilation scheme would likely require signicant modications to
nested inference — for a detailed technical discussion, see Appendix B.
5.4 Multi-language semantics. Today, it is often the case that probabilistic programming languages
are embedded in a host, non-probabilistic language [
4
]. However, these PPLs assume their host
semantics will not interfere with the semantics of the PPL’s inference process. This work is the
rst of its kind to build on top of multi-language semantics to reason about inference algorithms.
Multi-language semantics, while new to the domain of probabilistic programming, has had a
large impact on the broader programming language community. They play a fundamental role in
reasoning about interoperation [
39
], gradual typing [
35
,
59
], and compositional compiler verica-
tion [
45
]. There are two styles of calculi which represent the current approaches to multi-language
interoperation. These are the multi-languages approach from Matthews and Findler
[31]
and the a
more ne-grained approach by Siek and Taha [49] using a gradually typed lambda calculus.
Ye et al
. [62]
takes a traditional programming language approach to the gradual typing of PPLs
and denes a gradually typed probabilistic lambda calculus which allows a user to migrate a PPL
from an untyped- to typed- language — a nontrivial task involving a probabilistic coupling argument
for soundness. In contrast, our work centers on how multi-languages can help the interoperation
of inference algorithms across semantic domains.
Baydin et al
. [4]
establishes, informally, a common interface for PPLs to interact with scientic
simulators across language boundaries. In this work, the semantics of the simulator is a black-box
distribution dened in some language, which may or may not be probabilistic, and a separate
PPL may interact with the simulator during the inference process. While Baydin et al
. [4]
works
across language boundaries, they do not reason about interoperation — they only involve one
inference algorithm — and they do not provide any soundness guarantees. That said, Baydin et al
.
[4]
demonstrates a simple boundary allowing for rapid integration of many practical probabilistic
programming languages, something we also strive for.
24
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
5.5 Monadic semantics of PPLs. Numerous monads have been developed for use as semantic
domains that capture the various notions of computation used in probabilistic inference. The
fundamental building block for each of these models is the probability monad
Dist
, along with its
generalizations to monads of subdistributions and measures [
20
]. Using this probability monad to
give semantics to probabilistic programs goes back to at least Ramsey and Pfeer
[43]
, who further
build on this basic setup by introducing measure terms to eciently answer expectation-based
queries. Staton et al
. [55]
make use of the writer monad transformer applied to the monoid of weights
to obtain a monad suitable for modelling probabilistic programs with score-based conditioning; we
have made essential use of this monad to dene the two semantic models of MultiPPL presented
in Section 3. Ścibior
[48]
use monad transformer stacks, implemented in Haskell, to obtain a variety
of sampling-based inference algorithms in a compositional manner, with each layer of the stack
encompassing a dierent component of an inference algorithm. Our semantics of MultiPPL builds
on this line of work in giving monadic semantics to probabilistic computations by providing a
model of exact inference via knowledge compilation in terms of stateful manipulation of nite
conditional probability spaces and random variables. In future work, we intend to investigate
whether this state-passing semantics can be packaged into a monad of its own, capturing the notion
of computation carried out when performing knowledge compilation, by making use of recent
constructions in categorical probability [50, 51].
6 CONCLUSION
Performing inference on models with a mix of continuous and discrete random variables is an
important modeling challenge for practical systems and MultiPPL oers a multi-language approach
to tackle this problem. In this work, we provide a sound denotational semantics that generalizes for
all exact inference algorithms and sampling-based approximate inference that satisfy our semantic
domains. We identify two requirements to establish the correctness of the interoperation described:
that the exact PPL must maintain sample consistency and that the approximate sampling-based PPL
must perform importance weighting. We demonstrate that our implementation of MultiPPL benets
from the expressiveness of Cont and makes practical problems representable and additionally
provides tractable inference from Disc for complex discrete-structured probabilistic programs.
Ultimately, we hope that our multi-language perspective can lead to a clean formal unication of
many probabilistic program semantics and inference strategies. For future work, we hope to extend
our semantics to incorporate local-search inference strategies such as sequential and Markov-Chain
Monte Carlo. With enough coverage across semantics, we also gain the opportunity to look at
probabilistic interoperation by inspecting a shared core calculus for inference, and would draw on
work from Patterson
[40]
. Finally, by providing a syntactic approach to inference interoperation,
we also open up opportunities to use static analysis to see when and how we might automatically
insert boundaries to further specialize a model’s inference algorithm.
ACKNOWLEDGEMENTS
This project was supported by the National Science Foundation under grant #2220408.
REFERENCES
[1]
Steen Andreassen, Roman Hovorka, Jonathan Benn, Kristian G. Olesen, and Ewart R. Carson. 1991. A Model-
Based Approach to Insulin Adjustment. In AIME 91, Vol. 44. Springer Berlin Heidelberg, Berlin, Heidelberg, 239–248.
https://doi.org/10.1007/978-3-642-48650- 0_19
[2]
Eric Atkinson, Charles Yuan, Guillaume Baudart, Louis Mandel, and Michael Carbin. 2022. Semi-Symbolic Inference
for Ecient Streaming Probabilistic Programming. Proc. ACM Program. Lang. 6, OOPSLA2 (2022), 1668–1696. https:
//doi.org/10.1145/3563347
25
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
[3]
Chris L Baker, Rebecca Saxe, and Joshua B Tenenbaum. 2009. Action understanding as inverse planning. Cognition
113, 3 (2009), 329–349. https://doi.org/10.1016/j.cognition.2009.07.005
[4]
Atılım Güneş Baydin, Lei Shao, Wahid Bhimji, Lukas Heinrich, Lawrence Meadows, Jialin Liu, Andreas Munk, Saeid
Naderiparizi, Bradley Gram-Hansen, Gilles Louppe, Mingfei Ma, Xiaohui Zhao, Philip Torr, Victor Lee, Kyle Cranmer,
Prabhat, and Frank Wood. 2019. Etalumis: Bringing Probabilistic Programming to Scientic Simulators at Scale. In
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Denver,
Colorado, Article 29. https://doi.org/10.1145/3295500.3356180
[5]
Ingo A Beinlich, Henri Jacques Suermondt, R Martin Chavez, and Gregory F Cooper. 1989. The ALARM Monitoring
System: A Case Study with Two Probabilistic Inference Techniques for Belief Networks. In AIME 89. Springer, 247–256.
https://doi.org/10.1007/978-3-642-93437- 7_28
[6]
Nick Benton and Chung-Kil Hur. 2010. Step-Indexing: The Good, the Bad and the Ugly. In Modelling, Controlling
and Reasoning about State, Proceedings of Dagstuhl Seminar 10351 (modelling, controlling and reasoning about state,
proceedings of dagstuhl seminar 10351 ed.). Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, Germany. https:
//www.microsoft.com/en-us/research/publication/step-indexing-the-good- the-bad-and-the-ugly/
[7]
John Binder, Daphne Koller, Stuart Russell, and Keiji Kanazawa. [n.d.]. Adaptive Probabilistic Networks with Hidden
Variables. 29, 2-3 ([n. d.]), 213–244. https://doi.org/10.1023/A:1007421730016
[8]
Eli Bingham, Jonathan P. Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit
Singh, Paul Szerlip, Paul Horsfall, and Noah D. Goodman. 2019. Pyro: Deep Universal Probabilistic Programming.
Journal of Machine Learning Research 20, 1 (Jan. 2019), 973–978. https://dl.acm.org/doi/abs/10.5555/3322706.3322734
[9]
Bob Carpenter, Andrew Gelman, Matthew D Homan, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus A
Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. Stan: A Probabilistic Programming Language. Journal of
statistical software 76 (2017). https://doi.org/10.18637/jss.v076.i01
[10]
Mark Chavira and Adnan Darwiche. 2008. On Probabilistic Inference by Weighted Model Counting. Articial
Intelligence 172, 6 (April 2008), 772–799. https://doi.org/10.1016/j.artint.2007.11.002
[11]
Ryan Culpepper and Andrew Cobb. 2017. Contextual Equivalence for Probabilistic Programs with Continuous Random
Variables and Scoring. In Programming Languages and Systems (Lecture Notes in Computer Science). Springer, Berlin,
Heidelberg, 368–392. https://doi.org/10.1007/978-3-662-54434-1_14
[12]
Marco F. Cusumano-Towner, Feras A. Saad, Alexander K. Lew, and Vikash K. Mansinghka. 2019. Gen: A General-
Purpose Probabilistic Programming System with Programmable Inference. In Proceedings of the 40th ACM SIGPLAN
Conference on Programming Language Design and Implementation (PLDI 2019). Association for Computing Machinery,
Phoenix, AZ, USA, 221–236. https://doi.org/10.1145/3314221.3314642
[13]
Pierre-Évariste Dagand, Nicolas Tabareau, and Éric Tanter. 2018. Foundations of Dependent Interoperability. Journal
of Functional Programming 28 (Jan. 2018), e9. https://doi.org/10.1017/S0956796818000011
[14]
A. Darwiche and P. Marquis. 2002. A Knowledge Compilation Map. Journal of Articial Intelligence Research 17 (Sept.
2002), 229–264. https://doi.org/10/ghjsq6
[15]
Rowan Davies and Frank Pfenning. 2001. A Modal Analysis of Staged Computation. J. ACM 48, 3 (May 2001), 555–604.
https://doi.org/10.1145/382780.382785
[16]
Daan Fierens, Guy Van den Broeck, Joris Renkens, Dimitar Shterionov, Bernd Gutmann, Ingo Thon, Gerda Janssens,
and Luc De Raedt. 2015. Inference and Learning in Probabilistic Logic Programs Using Weighted Boolean Formulas.
Theory and Practice of Logic Programming 15, 3 (2015), 358–401. https://doi.org/10.1017/S1471068414000076
[17]
Tal Friedman and Guy Van den Broeck. 2018. Approximate Knowledge Compilation by Online Collapsed Importance
Sampling. In Advances in Neural Information Processing Systems, Vol. 31. Curran Associates, Inc., 15. https://dl.acm.
org/doi/10.5555/3327757.3327898
[18]
Timon Gehr, Sasa Misailovic, Petar Tsankov, Laurent Vanbever, Pascal Wiesmann, and Martin Vechev. 2018. Bayonet:
Probabilistic Inference for Networks. In ACM SIGPLAN Notices, Vol. 53. ACM, 586–602. https://doi.org/10.1145/3296979.
3192400
[19]
Timon Gehr, Sasa Misailovic, and Martin Vechev. 2016. PSI: Exact Symbolic Inference for Probabilistic Programs. Proc.
of ESOP/ETAPS 9779 (2016), 62–83. https://doi.org/10/gmq8ks
[20]
Michele Giry. 2006. A categorical approach to probability theory. In Categorical Aspects of Topology and Analysis:
Proceedings of an International Conference Held at Carleton University, Ottawa, August 11–15, 1981. Springer, 68–85.
https://doi.org/10.1007/BFb0092872
[21]
Maria I Gorinova, Andrew D Gordon, Charles Sutton, and Matthijs Vákár. 2021. Conditional independence by typing.
ACM Transactions on Programming Languages and Systems (TOPLAS) 44, 1 (2021), 1–54. https://doi.org/10.1145/3490421
[22]
Armaël Guéneau, Johannes Hostert, Simon Spies, Michael Sammler, Lars Birkedal, and Derek Dreyer. 2023. Melocoton:
A Program Logic for Veried Interoperability Between OCaml and C. Artifact for "Melocoton: A Program Logic for Veried
Interoperability Between OCaml and C" 7, OOPSLA2 (Oct. 2023), 247:716–247:744. https://doi.org/10.1145/3622823
26
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
[23]
Steven Holtzen, Guy Van den Broeck, and Todd Millstein. 2020. Scaling Exact Inference for Discrete Probabilistic
Programs. Proceedings of the ACM on Programming Languages 4, OOPSLA (2020), 1–31. https://doi.org/10.1145/3428208
[24] Christian Hopp. 2000. Analysis of an Equal-Cost Multi-Path Algorithm. https://doi.org/10.17487/RFC2992
[25]
Simon Knight, Hung X. Nguyen, Nickolas Falkner, Rhys Bowden, and Matthew Roughan. 2011. The Internet Topology
Zoo. IEEE Journal on Selected Areas in Communications 29, 9 (Oct. 2011), 1765–1775. https://doi.org/10.1109/JSAC.
2011.111002
[26]
Joomy Korkut, Kathrin Stark, and Andrew W. Appel. 2025. A Veried Foreign Function Interface between Coq and C.
Proc. ACM Program. Lang. 9, POPL (Jan. 2025), 24:687–24:717. https://doi.org/10.1145/3704860
[27]
John M. Li, Amal Ahmed, and Steven Holtzen. 2023. Lilac: A Modal Separation Logic for Conditional Probability.
Proceedings of the ACM on Programming Languages 7, PLDI (June 2023), 112:148–112:171. https://doi.org/10.1145/
3591226
[28]
Jacques Malenfant, Christophe Dony, and Pierre Cointe. 1996. A Semantics of Introspection in a Reective Prototype-
Based Language. LISP and Symbolic Computation 9, 2 (May 1996), 153–179. https://doi.org/10.1007/BF01806111
[29]
Vikash Mansinghka, Daniel Selsam, and Yura Perov. 2014. Venture: A Higher-Order Probabilistic Programming
Platform with Programmable Inference. arXiv (March 2014), 78–78. arXiv:1404.0099
[30]
Jacob Matthews and Robert bruce Findler. 2008. An Operational Semantics for Scheme1. J. Funct. Program. 18, 1 (Jan.
2008), 47–86. https://doi.org/10.1017/S0956796807006478
[31]
Jacob Matthews and Robert Bruce Findler. 2007. Operational Semantics for Multi-Language Programs. ACM SIGPLAN
Notices 42, 1 (2007), 3–10. https://doi.org/10.1145/1190215.1190220
[32]
Torben Meisling. 1958. Discrete-Time Queuing Theory. Operations Research 6, 1 (1958), 96–105. https://doi.org/10.
1287/opre.6.1.96
[33]
Lawrence M. Murray, Daniel Lundén, Jan Kudlicka, David Broman, and Thomas B. Schön. 2018. Delayed Sampling
and Automatic Rao-Blackwellization of Probabilistic Programs. In International Conference on Articial Intelligence
and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain (Proceedings of Machine
Learning Research, Vol. 84). PMLR, 1037–1046. http://proceedings.mlr.press/v84/murray18a.html
[34]
Radford M. Neal. 2011. MCMC Using Hamiltonian Dynamics. Handbook of Markov Chain Monte Carlo 54 (2011),
113–162.
[35]
Max Stewart New. 2020. A Semantic Foundation for Sound Gradual Typing. Ph. D. Dissertation. Northeastern University,
USA. https://dl.acm.org/doi/book/10.5555/AAI28263083
[36]
Max S New, William J Bowman, and Amal Ahmed. 2016. Fully Abstract Compilation via Universal Embedding.
In Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming. 103–116. https:
//doi.org/10.1145/2951913.2951941
[37]
Fritz Obermeyer, Eli Bingham, Martin Jankowiak, Neeraj Pradhan, Justin Chiu, Alexander Rush, and Noah Goodman.
2019-06-09/2019-06-15. Tensor Variable Elimination for Plated Factor Graphs. In Proceedings of the 36th International
Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97). 4871–4880. https://proceedings.
mlr.press/v97/obermeyer19a.html
[38]
A. Onisko, Marek J. Druzdzel, H. Wasyluk, and A. Onisko. 2005. A Probabilistic Causal Model for Diagnosis of Liver
Disorders.
[39]
Daniel Patterson, Noble Mushtak, Andrew Wagner, and Amal Ahmed. 2022. Semantic Soundness for Language
Interoperability. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design
and Implementation. ACM, San Diego CA USA, 609–624. https://doi.org/10.1145/3519939.3523703
[40]
Daniel Baker Patterson. 2022. Interoperability through Realizability: Expressing High-Level Abstractions Using Low-Level
Code. Ph. D. Dissertation. Northeastern University. https://doi.org/10.17760/D20467221
[41]
Judea Pearl. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann
Publishers Inc., San Francisco, CA, USA.
[42]
Tom Rainforth, Robert Cornish, Hongseok Yang, and Andrew Warrington. 2018. On Nesting Monte Carlo Estimators.
In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm,
Sweden, July 10-15, 2018 (Proceedings of Machine Learning Research, Vol. 80). PMLR, 4264–4273. http://proceedings.mlr.
press/v80/rainforth18a.html
[43]
Norman Ramsey and Avi Pfeer. 2002. Stochastic Lambda Calculus and Monads of Probability Distributions. In
Proceedings of the 29th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 154–165. https:
//doi.org/10.1145/503272.503288
[44] Christian P Robert, George Casella, and George Casella. 1999. Monte Carlo statistical methods. Vol. 2. Springer.
[45]
Michael Sammler, Simon Spies, Youngju Song, Emanuele D’Osualdo, Robbert Krebbers, Deepak Garg, and Derek
Dreyer. 2023. DimSum: A Decentralized Approach to Multi-language Semantics and Verication. Proceedings of the
ACM on Programming Languages 7, POPL (Jan. 2023), 27:775–27:805. https://doi.org/10.1145/3571220
27
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
[46]
Tian Sang, Paul Bearne, and Henry Kautz. 2005. Performing Bayesian Inference by Weighted Model Counting. In
Proceedings of the 20th National Conference on Articial Intelligence - Volume 1 (AAAI’05). AAAI Press, Pittsburgh,
Pennsylvania, 475–481. https://dl.acm.org/doi/10.5555/1619332.1619409
[47]
Gabriel Scherer, Max New, Nick Rioux, and Amal Ahmed. 2018. Fabous Interoperability for ML and a Linear Language.
In Foundations of Software Science and Computation Structures. Springer International Publishing, Cham, 146–162.
https://doi.org/10.1007/978-3-319-89366- 2_8
[48]
Adam Michał Ścibior. 2018. Formally Justied and Modular Bayesian Inference for Probabilistic Programs. Ph. D.
Dissertation. Apollo - University of Cambridge Repository.
[49]
Jeremy G Siek and Walid Taha. 2006. Gradual Typing for Functional Languages. In Proceedings of the Scheme and
Functional Programming Workshop.
[50]
Alex Simpson. 2017. Probability Sheaves and the Giry Monad. In 7th Conference on Algebra and Coalgebra in Computer
Science (CALCO 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Schloss-Dagstuhl-Leibniz Zentrum für
Informatik. https://doi.org/10.4230/LIPIcs.CALCO.2017.1
[51]
Alex Simpson. 2018. Category-Theoretic Structure for Independence and Conditional Independence. Electronic Notes
in Theoretical Computer Science 336 (2018), 281–297. https://doi.org/10.1016/j.entcs.2018.03.028
[52]
Brian Cantwell Smith. 1984. Reection and Semantics in LISP. In Proceedings of the 11th ACM SIGACT-SIGPLAN
Symposium on Principles of Programming Languages (POPL ’84). Association for Computing Machinery, New York, NY,
USA, 23–35. https://doi.org/10.1145/800017.800513
[53]
Steen Smolka, Praveen Kumar, David M Kahn, Nate Foster, Justin Hsu, Dexter Kozen, and Alexandra Silva. 2019.
Scalable Verication of Probabilistic Networks. In Proceedings of the 40th ACM SIGPLAN Conference on Programming
Language Design and Implementation. 190–203. https://doi.org/10.1145/3314221.3314639
[54]
Sam Staton. 2017. Commutative Semantics for Probabilistic Programming. In Programming Languages and Systems
(Lecture Notes in Computer Science). Springer, Berlin, Heidelberg, 855–879. https://doi.org/10.1007/978-3-662-54434-
1_32
[55]
Sam Staton, Hongseok Yang, Frank Wood, Chris Heunen, and Ohad Kammar. 2016. Semantics for Probabilistic
Programming: Higher-order Functions, Continuous Distributions, and Soft Constraints. In Proceedings of the 31st
Annual ACM/IEEE Symposium on Logic in Computer Science (LICS ’16). ACM, New York, NY, USA, 525–534. https:
//doi.org/10.1145/2933575.2935313
[56]
Sam Stites, John M. Li, and Steven Holtzen. 2025. Artifact: Multi-Language Probabilistic Programming. Zenodo.
https://doi.org/10.5281/zenodo.14593465
[57]
Sam Stites, Heiko Zimmermann, Hao Wu, Eli Sennesh, and Jan-Willem van de Meent. 2021. Learning Proposals for
Probabilistic Programs with Inference Combinators. In Proceedings of the Thirty-Seventh Conference on Uncertainty in
Articial Intelligence. PMLR, 1056–1066. https://proceedings.mlr.press/v161/stites21a.html
[58]
A. Stuhlmüller and N. D. Goodman. 2014. Reasoning about Reasoning by Nested Conditioning: Modeling Theory of
Mind with Probabilistic Programs. Cognitive Systems Research 28 (June 2014), 80–99. https://doi.org/10.1016/j.cogsys.
2013.07.003
[59]
Sam Tobin-Hochstadt and Matthias Felleisen. 2008. The Design and Implementation of Typed Scheme. In Proceedings
of the 35th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’08). Association
for Computing Machinery, New York, NY, USA, 395–406. https://doi.org/10.1145/1328438.1328486
[60]
Sam Tobin-Hochstadt, Vincent St-Amour, Ryan Culpepper, Matthew Flatt, and Matthias Felleisen. 2011. Languages as
Libraries. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation.
132–141. https://doi.org/10.1145/1993498.1993514
[61]
Jesse A. Tov and Riccardo Pucella. 2010. Stateful Contracts for Ane Types. In Programming Languages and Systems.
Springer, Berlin, Heidelberg, 550–569. https://doi.org/10.1007/978-3-642-11957-6_29
[62]
Wenjia Ye, Matías Toro, and Federico Olmedo. 2023. A Gradual Probabilistic Lambda Calculus. Proceedings of the ACM
on Programming Languages 7, OOPSLA1 (April 2023), 84:256–84:285. https://doi.org/10.1145/3586036
[63]
Yizhou Zhang and Nada Amin. 2022. Reasoning about “Reasoning about Reasoning”: Semantics and Contextual
Equivalence for Probabilistic Programs with Nested Queries and Recursion. Proceedings of the ACM on Programming
Languages 6, POPL (Jan. 2022), 16:1–16:28. https://doi.org/10/gqqdp8
28
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
A APPENDIX
A.1 Syntax
Disc Expressions 𝑀,𝑁::= 𝑋|true |false |𝑀∧𝑁|¬𝑀
|⟨⟩ |⟨𝑀, 𝑁 ⟩|fst 𝑀|snd 𝑀
|ret 𝑀|let 𝑋be 𝑀in 𝑁|if 𝑒then 𝑀else 𝑁
|flip 𝑒|observe 𝑀|L𝑒M𝐸
Types 𝐴,𝐵::= unit |bool |𝐴×𝐵
Contexts Δ::= 𝑋1:𝐴1, . . . , 𝑋𝑛:𝐴𝑛
Cont Expressions 𝑒::= 𝑥|true |false |𝑟|𝑒1+𝑒2|−𝑒|𝑒1·𝑒2|𝑒1≤𝑒2
|() |(𝑒1, 𝑒2)|fst 𝑒|snd 𝑒
|ret 𝑒|let 𝑥be 𝑒1in 𝑒2|if 𝑒1then 𝑒2else 𝑒3
|𝑑|obs(𝑒𝑜, 𝑑 )|L𝑀M𝑆
Distributions 𝑑::= flip 𝑒|uniform 𝑒1𝑒2|poisson 𝑒
Types 𝜎,𝜏::= unit |bool |real |𝜎×𝜏
Contexts Γ::= 𝑥1:𝜏1, . . . , 𝑥𝑛:𝜏𝑛
Number literals 𝑟∈R
A.2 Typing rules
A.2.1 Convertibility.
unit ↭unit bool ↭bool
𝐴↭𝜏𝐵↭𝜎
𝐴×𝐵↭𝜏×𝜎
A.2.2 Pure exact sublanguage.
Δ(𝑋)=𝐴
Δ⊢𝑋:𝐴Δ⊢true :bool Δ⊢false :bool
Δ⊢𝑀:bool Δ⊢𝑁:bool
Δ⊢𝑀∧𝑁:bool
Δ⊢𝑀:bool
Δ⊢¬𝑀:bool Δ⊢⟨⟩ :unit
Δ⊢𝑀:𝐴Δ⊢𝑁:𝐵
Δ⊢⟨𝑀, 𝑁 ⟩:𝐴×𝐵
Δ⊢𝑀:𝐴×𝐵
Δ⊢fst 𝑀:𝐴
Δ⊢𝑀:𝐴×𝐵
Δ⊢snd 𝑀:𝐵
A.2.3 Eectful exact sublanguage.
Δ⊢𝑀:𝐴
Γ;Δ⊢cret 𝑀:𝐴
Γ;Δ⊢c𝑀:𝐴Γ;Δ,𝑋:𝐴⊢c𝑀:𝐵
Γ;Δ⊢clet 𝑋be 𝑀in 𝑁:𝐵
Γ⊢𝑒:bool Γ;Δ⊢c𝑀:𝐴Γ;Δ⊢c𝑁:𝐴
Γ;Δ⊢cif 𝑒then 𝑀else 𝑁:𝐴
Γ⊢𝑒:real
Γ;Δ⊢cflip 𝑒:bool
Δ⊢𝑀:bool
Γ;Δ⊢cobserve 𝑀:unit
Γ;Δ⊢c𝑒:𝜏𝐴↭𝜏
Γ;Δ⊢cL𝑒M𝐸:𝐴
29
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
A.2.4 Pure sampling sublanguage.
Γ(𝑥)=𝜏
Γ⊢𝑥:𝜏Γ⊢true :bool Γ⊢false :bool Γ⊢𝑟:real
Γ⊢𝑒1:real Γ⊢𝑒2:real
Γ⊢𝑒1+𝑒2:real
Γ⊢𝑒:real
Γ⊢−𝑒:real
Γ⊢𝑒1:real Γ⊢𝑒2:real
Γ⊢𝑒1·𝑒2:real
Γ⊢𝑒1:real Γ⊢𝑒2:real
Γ⊢𝑒1≤𝑒2:bool Γ⊢() :unit
Γ⊢𝑒1:𝜎Γ⊢𝑒2:𝜏
Γ⊢(𝑒1, 𝑒2):𝜎×𝜏
Γ⊢𝑒:𝜎×𝜏
Γ⊢fst 𝑒:𝜏
Γ⊢𝑒:𝜎×𝜏
Γ⊢snd 𝑒:𝜏
A.2.5 Eectful sampling sublanguage.
Γ⊢𝑒:𝜏
Γ;Δ⊢cret 𝑒:𝜏
Γ;Δ⊢c𝑒1:𝜎Γ,𝑥:𝜎;Δ⊢c𝑒2:𝜏
Γ;Δ⊢clet 𝑥be 𝑒1in 𝑒2:𝜏
Γ⊢𝑒1:bool Γ;Δ⊢c𝑒2:𝜏Γ;Δ⊢c𝑒3:𝜏
Γ;Δ⊢cif 𝑒1then 𝑒2else 𝑒3:𝜏
Γ⊢𝑒:real
Γ;Δ⊢cflip 𝑒:bool
Γ⊢𝑒1:real Γ⊢𝑒2:real
Γ;Δ⊢cuniform 𝑒1𝑒2:real
Γ⊢𝑒:real
Γ;Δ⊢cpoisson 𝑒:real
Γ;Δ⊢c𝑀:𝐴 𝐴 ↭𝜏
Γ;Δ⊢cL𝑀M𝑆:𝜏
Γ⊢𝑒𝑜:bool Γ⊢𝑒1:real
Γ;Δ⊢cobs(𝑒𝑜,flip 𝑒1):unit
Γ⊢𝑒𝑜:real Γ⊢𝑒1:real Γ⊢𝑒2:real
Γ;Δ⊢cobs(𝑒𝑜,uniform 𝑒1𝑒2):unit
Γ⊢𝑒𝑜:real Γ⊢𝑒:real
Γ;Δ⊢cobs(𝑒𝑜,poisson 𝑒):unit
A.3 Semantics of types
A.3.1 Types.
J𝐴K:nite discrete measurable space
JunitK=the one-point space {★}
JboolK={⊤,⊥}
J𝐴×𝐵K=J𝐴K×J𝐵K
J𝜏K:measurable space
JunitK=the one-point space {★}
JboolK=the discrete two-point space {⊤,⊥}
JrealK=R
J𝜎×𝜏K=J𝜎K×J𝜏K
30
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
A.3.2 Contexts.
JΓK:measurable space
JΓK=
𝑥∈dom ΓJΓ(𝑥)K
JΔK:nite discrete measurable space
JΔK=
𝑋∈dom ΔJΔ(𝑋)K
A.3.3 Convertibility.
Lemma A.1. If 𝐴↭𝜏then J𝐴K=J𝜏K.
Proof. By induction on 𝐴↭𝜏.□
A.4 Semantics of pure programs
JΔ⊢𝑀:𝐴K:JΔK→J𝐴K
J𝑋K(𝛿)=𝛿(𝑋)
JtrueK(𝛿)=⊤
JfalseK(𝛿)=⊥
J𝑀∧𝑁K(𝛿)=⊤,if J𝑀K(𝛿)=J𝑁K(𝛿)=⊤
⊥,otherwise
J¬𝑀K(𝛿)=⊥,if J𝑀K(𝛿)=⊤
⊤,otherwise
J⟨⟩K(𝛿)=★
J⟨𝑀, 𝑁 ⟩K(𝛿)=(J𝑀K(𝛿),J𝑁K(𝛿))
Jfst 𝑀K(𝛿)=𝜋1(J𝑀K(𝛿))
Jsnd 𝑀K(𝛿)=𝜋2(J𝑀K(𝛿))
31
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
JΓ⊢𝑒:𝜏K:JΓKmeasurable
−−−−−−−−→ J𝜏K
J𝑥K(𝛾)=𝛾(𝑥)
JtrueK(𝛾)=⊤
JfalseK(𝛾)=⊥
J𝑟K(𝛾)=𝑟
J𝑒1+𝑒2K(𝛾)=J𝑒1K(𝛾) + J𝑒2K(𝛾)
J−𝑒K(𝛾)=−J𝑒K(𝛾)
J𝑒1·𝑒2K(𝛾)=J𝑒1K(𝛾) · J𝑒2K(𝛾)
J𝑒1≤𝑒2K(𝛾)=⊤,if J𝑒1K(𝛾) ≤ J𝑒2K(𝛾)
⊥,otherwise
J() K(𝛾)=★
J(𝑒1, 𝑒2)K(𝛾)=(J𝑒1K(𝛾),J𝑒2K(𝛾))
Jfst 𝑒K(𝛾)=𝜋1(J𝑒K(𝛾))
Jsnd 𝑒K(𝛾)=𝜋2(J𝑒K(𝛾))
A.5 Eectful programs: high-level model
Denition A.2. Let
Dist
be the distribution monad dened over measurable spaces. Let
Distw
be the writer monad transformer applied to
Dist
and the monoid
([
0
,
1
],
1
)
of nonegative reals
under multiplication. Concretely,
Distw
sends a measurable space
𝐴
to the set
Dist( [
0
,
1
] × 𝐴)
.
Let
ret 𝑥
and
𝑥←𝜇
;
𝑓(𝑥)
denote the usual monad operations with respect to
Distw
, and let
(·) : Dist(𝐴) → Distw(𝐴)be the usual lifting operation. In addition to the usual operations,
•
Let
score
:
[
0
,
1
] → Distw{★}
be the function that sends a weight
𝑤
to the Dirac distribution
𝛿(𝑤,★)centered at (𝑤 , ★).
Denition A.3. For
𝑝∈R
, let
ip(𝑝)
be the Bernoulli distribution with parameter
𝑝
if
𝑝∈ [
0
,
1
]
,
and a point mass at ⊥otherwise.
Denition A.4. For
𝑎, 𝑏 ∈R
, let
uniform(𝑎, 𝑏 )
be the uniform distribution on the interval
[𝑎, 𝑏]
if
𝑎≤𝑏, and a point mass at min(𝑎, 𝑏 )otherwise.
Denition A.5. For
𝜆∈R
, let
poisson(𝜆)
be the Poisson distribution with parameter
𝜆
if
𝜆>
0,
and a point mass at 0otherwise.
32
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
A.5.1 Exact sublanguage.
HJΓ;Δ⊢c𝑀:𝐴K:JΓK×JΔK→DistwJ𝐴K
HJret 𝑀K(𝛾, 𝛿 )=ret(J𝑀K(𝛿))
HJlet 𝑋be 𝑀in 𝑁K(𝛾, 𝛿 )=𝑥← HJ𝑀K(𝛾, 𝛿 );
HJ𝑁K(𝛾, 𝛿 [𝑋↦→ 𝑥])
HJif 𝑒then 𝑀else 𝑁K(𝛾, 𝛿 )=
if J𝑒K(𝛾)
then HJ𝑀K(𝛾, 𝛿 )
else HJ𝑁K(𝛾, 𝛿 )
HJflip 𝑒K(𝛾 , 𝛿)=ip(J𝑒K(𝛾))
HJobserve 𝑀K(𝛾, 𝛿 )=score(1J𝑀K(𝛿)=⊤)
HJL𝑒M𝐸K(𝛾, 𝛿 )=HJ𝑒K(𝛾, 𝛿 )
A.5.2 Sampling sublanguage.
HJΓ;Δ⊢c𝑒:𝜏K:JΓK×JΔK→DistwJ𝜏K
HJret 𝑒K(𝛾, 𝛿 )=ret(J𝑒K(𝛾))
HJlet 𝑥be 𝑒1in 𝑒2K(𝛾, 𝛿 )=𝑥← HJ𝑒1K(𝛾, 𝛿 );
HJ𝑒2K(𝛾[𝑥↦→ 𝑥], 𝛿)
HJif 𝑒1then 𝑒2else 𝑒3K(𝛾, 𝛿 )=
if J𝑒1K(𝛾)
then HJ𝑒2K(𝛾, 𝛿 )
else HJ𝑒3K(𝛾, 𝛿 )
HJflip 𝑒K(𝛾, 𝛿 )=ip(J𝑒K(𝛾))
HJuniform 𝑒1𝑒2K(𝛾, 𝛿 )=uniform(J𝑒1K(𝛾),J𝑒2K(𝛾))
HJpoisson 𝑒K(𝛾, 𝛿 )=poisson(J𝑒K(𝛾))
HJL𝑀M𝑆K(𝛾, 𝛿 )=HJ𝑀K(𝛾, 𝛿 )
HJobs(𝑒𝑜,flip 𝑒1)K(𝛾 , 𝛿)=score (ip(J𝑒1K(𝛾))(J𝑒𝑜K(𝛾)))
HJobs(𝑒𝑜,poisson 𝑒1)K(𝛾 , 𝛿)=score (poisson(J𝑒K(𝛾))(J𝑒𝑜K(𝛾)))
HJobs(𝑒𝑜,uniform 𝑒1𝑒2)K(𝛾 , 𝛿)=score (uniform(J𝑒1K(𝛾),J𝑒2K(𝛾)) (J𝑒𝑜K(𝛾)))
A.6 Eectful programs: low-level model
Denition A.6. Anite conditional probability space is a triple
(Ω, 𝜇, 𝐸 )
where (1)
Ω
is a nonempty
nite prex of
N
, (2)
𝜇
:
Ω→ [
0
,
1
]
is a discrete probability distribution, and (3)
𝐸
is a subset of
Ω
.
Let FCPS be the set of nite conditional probability spaces.
For readability, nite conditional probability spaces
(Ω, 𝜇, 𝐸 )
will be written
Ω
unless disam-
biguation is needed.
Lemma A.7. FCPS is a measurable space.
Proof.
There is an injective function
𝑖
:
FCPS ↩→N×list(R) × PnN
sending a nite conditional
probability space
({
0
, . . . , 𝑛 −
1
}, 𝜇, 𝐸 )
to the triple
(𝑛, [𝜇(
0
), . . . , 𝜇 (𝑛−
1
)], 𝐸)
. The codomain of
𝑖
is
33
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
a measurable space with
𝜎
-algebra dened in the standard way, and the image of
𝑖
is a measurable
subset of this space. The injection
𝑖
identies the image of
𝑖
with
FCPS
, making
FCPS
a measurable
space by taking preimages along 𝑖.□
Denition A.8. Amap of nite conditional probability spaces
𝑓
:
(Ω, 𝜇, 𝐸 ) → (Ω′, 𝜇′, 𝐸 ′)
is a
measure-preserving map
𝑓
:
(Ω, 𝜇)→(Ω′, 𝜇′)
such that
𝐸⊆𝑓−1(𝐸′)
. For two nite conditional
probability spaces
(Ω, 𝜇, 𝐸 )
and
(Ω′, 𝜇′, 𝐸 ′)
, let
(Ω, 𝜇, 𝐸 )FCPS
−−−−→ (Ω′, 𝜇′, 𝐸 ′)
be the set of maps from
(Ω, 𝜇, 𝐸 )to (Ω′, 𝜇′, 𝐸′).
Denition A.9. For every
𝐴
and
𝜏
let M
𝐴
and M
𝜏
be the following
FCPS
-indexed families of sets:
M𝐴: FCPS →Set
M𝐴Ω=Distw
Ω′∈FCPS
(Ω′FCPS
−−−−→ Ω)×(Ω′→J𝐴K)
M𝜏: FCPS →Set
M𝜏Ω=Distw
Ω′∈FCPS
(Ω′FCPS
−−−−→ Ω) × J𝜏K
Proof.
For these to be well-dened, the arguments to
Distw
must be measurable spaces. Elements
of
Ω′∈FCPS (Ω′FCPS
−−−−→ Ω) × (Ω′→J𝐴K)
are triples
(Ω′, 𝑓 , 𝑋 )
where
Ω′∈FCPS
and
𝑓
is a map of
nite conditional probability spaces and
𝑋
:
Ω′→J𝐴K
. There is a canonical injective function
𝑖
from this set to the set
FCPS × (Nn
−−→ N) × (Nn
−−→ J𝐴K)
whose elements are triples
(Ω′, 𝑓 , 𝑋 )
where
Ω′∈FCPS
and
𝑓 , 𝑋
are partial functions of type
N⇀N
and
N⇀J𝐴K
respectively
with nite domain. This set is a measurable space by Lemma A.7 and by putting the discrete
𝜎
-algebras on
Nn
−−→ N
and
Nn
−−→ J𝐴K
. The image of
𝑖
is a measurable subset of this space, making
Ω′∈FCPS (Ω′FCPS
−−−−→ Ω)×(Ω′→J𝐴K)
a measurable space too. The analogous argument also makes
Ω′∈FCPS (Ω′FCPS
−−−−→ Ω) × J𝜏Kinto a measurable space. □
Denition A.10. For two nite conditional probability spaces
(Ω, 𝜇, 𝐸 )
and
(Ω′, 𝜇′, 𝐸 ′)
, their tensor
product, written
(Ω, 𝜇, 𝐸 )⊗(Ω′, 𝜇 ′, 𝐸′)
, is
(𝑓(Ω×Ω′), 𝜈 ◦𝑓−1, 𝑓 (𝐸×𝐸′))
where
𝜈
:
Ω×Ω′→ [
0
,
1
]
is the measure
𝜈(𝜔, 𝜔 ′)=𝜇(𝜔)𝜇′(𝜔′)
and
𝑓
is an arbitrary isomorphism
Ω×Ω′→ {
1
, . . . , |Ω| | Ω′|}
(such as 𝑓(𝜔 , 𝜔′)=|Ω′|𝜔+𝜔′). There are canonical projection maps
𝜋1:(Ω, 𝜇, 𝐸 )⊗(Ω′, 𝜇′, 𝐸 ′)→(Ω, 𝜇, 𝐸 )and 𝜋2:(Ω, 𝜇, 𝐸)⊗(Ω′, 𝜇′, 𝐸′)→(Ω′, 𝜇′, 𝐸 ′).
The tensor product has a unit, written
emp
, dened as
(Ωemp, 𝜇emp, 𝐸 emp)
where
Ωemp =𝐸emp ={
0
}
and 𝜇(0)=1.
Denition A.11. To model sampling from nite conditional probability spaces,
•
For any nonempty set
Ω
, let
zeroΩ
:
Distw(𝐴)
be the Dirac distribution at
(
0
, 𝜈 )
:
[
0
,
1
] × Ω
where
𝜈is an arbitrary element of Ω.
•
For any nite conditional probability space
(Ω, 𝜇, 𝐸 )
, let
𝜇|𝐸
:
Distw(Ω)
be
zeroΩ
if
𝜇(𝐸)=
0, and
the lifting of the conditional distribution of 𝜇given 𝐸if 𝜇(𝐸)>0.
34
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
A.6.1 Exact sublanguage.
LJΓ;Δ⊢c𝑀:𝐴K(Ω):JΓK× (Ω→JΔK) → M𝐴(Ω)
LJret 𝑀K(Ω)(𝛾, 𝐷 )=ret(Ω,id,J𝑀K◦𝐷)
LJlet 𝑋be 𝑀in 𝑁K(Ω)(𝛾, 𝐷 )=
(Ω1, 𝑓1, 𝑋 ) ← LJ𝑀K(Ω)(𝛾, 𝐷 );
(Ω2, 𝑓2, 𝑌 ) ← LJ𝑀K(Ω1)(𝛾, (𝐷◦𝑓1) [𝑋↦→ 𝑋] );
ret(Ω2, 𝑓1◦𝑓2, 𝑌 )
LJif 𝑒then 𝑀else 𝑁K(Ω)(𝛾, 𝐷 )=
if J𝑒K(𝛾)
then LJ𝑀K(Ω)(𝛾, 𝐷 )
else LJ𝑁K(Ω)(𝛾, 𝐷 )
LJflip 𝑒K(Ω) (𝛾, 𝐷)=
𝑝:=if J𝑒K(𝛾) ∈ [0,1]then J𝑒K(𝛾)else 0;
Ωip :=({0,1}, 𝜇, {0,1}) where 𝜇(1)=𝑝;
Ω′:=Ω⊗Ωip;
𝑋:=𝜔′↦→ if 𝜋2(𝜔′)=1 then ⊤else ⊥;
ret(Ω′, 𝜋1, 𝑋 )
LJobserve 𝑀K(Ω, 𝜇, 𝐸 )(𝛾, 𝐷)=
𝐹:=(J𝑀K(Ω)(𝐷))−1(⊤);
score(𝜇|𝐸(𝐹));
ret( (Ω, 𝜇, 𝐸 ∩𝐹),id,★)
LJL𝑒M𝐸K(Ω)(𝛾, 𝐷 )=(Ω′, 𝑓 , 𝑥 ) ← LJ𝑒K(Ω)(𝛾, 𝐷 );
ret(Ω′, 𝑓 , _↦→ 𝑥)
35
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
A.6.2 Sampling sublanguage.
LJΓ;Δ⊢c𝑒:𝜏K(Ω):JΓK× (Ω→JΔK) → M𝜏(Ω)
LJret 𝑒K(Ω)(𝛾, 𝐷 )=ret(Ω,id,J𝑒K(𝛾))
LJlet 𝑥be 𝑒1in 𝑒2K(Ω)(𝛾, 𝐷 )=
(Ω1, 𝑓1, 𝑥 ) ← LJ𝑒1K(Ω) (𝛾, 𝐷 )
(Ω2, 𝑓2, 𝑦) ← LJ𝑒2K(Ω1)(𝛾[𝑥↦→ 𝑥], 𝐷 ◦𝑓1)
ret(Ω2, 𝑓1◦𝑓2, 𝑦 )
LJif 𝑒1then 𝑒2else 𝑒3K(Ω)(𝛾, 𝐷 )=
if J𝑒1K(𝛾)
then LJ𝑒2K(Ω)(𝛾, 𝐷 )
else LJ𝑒3K(Ω)(𝛾, 𝐷 )
LJflip 𝑒K(Ω)(𝛾, 𝐷 )=𝑥←ip(J𝑒K(𝛾))
ret(Ω,id, 𝑥 )
LJuniform 𝑒1𝑒2K(Ω)(𝛾, 𝐷 )=𝑥←uniform(J𝑒1K(𝛾),J𝑒2K(𝛾))
ret(Ω,id, 𝑥 )
LJpoisson 𝑒K(Ω)(𝛾, 𝐷 )=𝑥←poisson(J𝑒K(𝛾))
ret(Ω,id, 𝑥 )
LJobs(𝑒𝑜,flip 𝑒1)K(Ω) (𝛾 , 𝐷)=
𝑤:=ip(J𝑒1K(𝛾)) (J𝑒𝑜K(𝛾))
score(𝑤);
ret(Ω,id,★)
LJobs(𝑒𝑜,uniform 𝑒1𝑒2)K(Ω) (𝛾 , 𝐷)=
𝑤:=uniform(J𝑒1K(𝛾),J𝑒2K(𝛾)) (J𝑒𝑜K(𝛾))
score(𝑤);
ret(Ω,id,★)
LJobs(𝑒𝑜,poisson 𝑒1)K(Ω) (𝛾 , 𝐷)=
𝑤:=poisson(J𝑒1K(𝛾)) (J𝑒𝑜K(𝛾))
score(𝑤);
ret(Ω,id,★)
LJL𝑀M𝑆K(Ω)(𝛾, 𝐷 )=
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑋 ) ← LJ𝑀K(Ω) (𝛾, 𝐷 )
𝑥←𝜔′←𝜇′|𝐸′; ret(𝑋(𝜔′))
ret( (Ω′, 𝜇′, 𝐸′∩𝑋−1(𝑥)), 𝑓 , 𝑥 )
A.7 Soundness
Denition A.12. Two computations
𝜇, 𝜈
:
Distw𝐴
are equal as importance samplers, written
𝜇≃𝜈
,
if for all bounded integrable 𝑘:𝐴→Rit holds that E(𝑎,𝑥) ∼𝜇[𝑎·𝑘(𝑥)] =E(𝑏,𝑦) ∼𝜇[𝑏·𝑘(𝑦)].
Lemma A.13. The equivalence relation
≃
is a congruence for the monad
Distw
: if
𝜇≃𝜈
:
Distw𝐴
and 𝑓 , 𝑔 :𝐴→Distw𝐵with 𝑓(𝑥) ≃ 𝑔(𝑥)for all 𝑥in 𝐴then (𝑥←𝜇;𝑓(𝑥)) ≃ (𝑥←𝜈;𝑔(𝑥)).
36
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
Proof. If 𝑘:𝐵→Rbounded integrable then
E
(𝑐,𝑦 )∼(𝑥←𝜇;𝑓(𝑥) )
[𝑐𝑘 (𝑦)] =E
(𝑎,𝑥 )∼𝜇E
(𝑏,𝑦) ∼𝑓(𝑥)
[𝑎·𝑏·𝑘(𝑦)](1)
=E
(𝑎,𝑥 )∼𝜇E
(𝑏,𝑦 )∼𝑔(𝑥)
[𝑎·𝑏·𝑘(𝑦)]
(2)
=E
(𝑎,𝑥 )∼𝜈E
(𝑏,𝑦 )∼𝑔(𝑥)
[𝑎·𝑏·𝑘(𝑦)]=E
(𝑐,𝑦 )∼(𝑥←𝜈;𝑔(𝑥))
[𝑐𝑘 (𝑦)]
where
(
1
)
follows from
𝑓(𝑥) ≃ 𝑔(𝑥)
and
(
2
)
from
𝜇≃𝜈
, using linearity of expectation throughout
as needed. □
Lemma A.14. If (Ω, 𝜇 , 𝐸) ∈ FCPS and 𝐹⊆Ωthen 𝜇|𝐸|𝐹=𝜇|𝐸∩𝐹.
Proof.
If
𝜇(𝐸∩𝐹)=
0then both sides are the zero measure. Otherwise for all
𝐺
have
(𝜇|𝐸|𝐹)(𝐺)=
𝜇|𝐸(𝐹∩𝐺)/𝜇|𝐸(𝐹)=𝜇(𝐸∩𝐹∩𝐺)/𝜇(𝐸∩𝐹)=𝜇|𝐸∩𝐹(𝐺).□
Lemma A.15. If (Ω, 𝜇 , 𝐸) ∈ FCPS then
score(𝜇(𝐸));
𝜔←𝜇|𝐸;
ret 𝜔
≃
𝜔←𝜇;
score(1𝜔∈𝐸);
ret 𝜔
.
Proof. For all 𝑘:Ω→Rhave
E
(𝑎,𝜔 )∼LHS [𝑘(𝜔)] =E
𝜔∼𝜇|𝐸
[𝜇(𝐸)𝑘(𝜔)] =
𝜔∈Ω
𝜇|𝐸(𝜔)𝜇(𝐸)𝑘(𝜔)=
𝜔∈Ω
𝜇(𝜔∩𝐸)𝑘(𝜔)
=
𝜔∈Ω
1𝜔∈𝐸𝜇(𝜔)𝑘(𝜔)=E
(𝑎,𝜔 )∼RHS [𝑎·𝑘(𝜔)] .
□
Lemma A.16. If (Ω, 𝜇 , 𝐸) ∈ FCPS and 𝑋:Ω→𝐴with 𝐴nite then
𝑥←𝜔←𝜇; ret(𝑋 𝜔 );
𝜔′←𝜇|𝑋−1(𝑥);
ret(𝑥 , 𝜔)
=𝜔′←𝜇;
ret(𝑋 𝜔 ′, 𝜔 ′).
Proof.
All distributions involved are discrete, so it suces to show LHS and RHS give the same
probability to pairs (𝑎, 𝑏 ).
LHS(𝑎, 𝑏 )=
𝜔∈Ω
𝜇(𝜔)
𝜔∈Ω
𝜇|𝑋−1(𝑋𝜔 )(𝜔′)1(𝑋 𝜔 ,𝜔′)=(𝑎,𝑏 )=
𝜔∈Ω,𝜔′∈Ω,𝑋 𝜔=𝑎,𝜔′=𝑏
𝜇(𝜔)𝜇|𝑋−1(𝑋 𝜔 )(𝜔′)
=
𝜔∈Ω,𝑋𝜔 =𝑎
𝜇(𝜔)𝜇|𝑋−1(𝑎)(𝑏)=𝜇|𝑋−1(𝑎)(𝑏)𝜇(𝑋−1(𝑎)) =𝜇(𝑋−1(𝑎) ∩ 𝑏)=RHS(𝑎, 𝑏)
□
Theorem A.17. The following hold:
(1) If Γ;Δ⊢c𝑀:𝐴then the following holds for all Ω, 𝜇, 𝐸, 𝛾, 𝐷:
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑋 ) ← LJ𝑀K(Ω, 𝜇, 𝐸 )(𝛾 , 𝐷);
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑋 𝜔′)
≃
𝜔←𝜇|𝐸;
𝑥← HJ𝑀K(𝛾, 𝐷 𝜔);
ret(𝐷 𝜔, 𝑥 )
(2) If Γ;Δ⊢c𝑒:𝜏then the following holds for all Ω, 𝜇, 𝐸, 𝛾, 𝐷:
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑥 ) ← LJ𝑒K(Ω, 𝜇, 𝐸 )(𝛾 , 𝐷);
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔)), 𝑥 )
≃
𝜔←𝜇|𝐸;
𝑥← HJ𝑒K(𝛾, 𝐷 𝜔);
ret(𝐷 𝜔, 𝑥 )
37
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
Proof. By induction on the typing rules.
(1)
(Γ;Δ⊢cret 𝑀:𝐴)
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑋 ) ← LJret 𝑀K(Ω, 𝜇, 𝐸 )(𝛾, 𝐷);
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑋 𝜔′)
=
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑋 ) ← ret( ( Ω, 𝜇, 𝐸),id,J𝑀K◦𝐷);
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑋 𝜔′)
=𝜔←𝜇|𝐸;
ret(𝐷𝜔, J𝑀K(𝐷𝜔)) =
𝜔←𝜇|𝐸;
𝑥← HJret 𝑀K(𝛾, 𝐷 𝜔);
ret(𝐷 𝜔, 𝑥 )
(Γ;Δ⊢clet 𝑋be 𝑀in 𝑁:𝐵)In this case we work backwards, rearranging RHS into LHS:
𝜔←𝜇|𝐸;
𝑦← HJlet 𝑋be 𝑀in 𝑁K(𝛾, 𝐷 𝜔);
ret(𝐷 𝜔, 𝑦)
=
𝜔←𝜇|𝐸;
𝑥← HJ𝑀K(𝛾, 𝐷 𝜔);
𝑦← HJ𝑁K(𝛾, 𝐷 𝜔 [𝑋↦→ 𝑥]);
ret(𝐷 𝜔, 𝑦)
=
(𝛿, 𝑥 ) ←
𝜔←𝜇|𝐸;
𝑥← HJ𝑀K(𝛾, 𝐷 𝜔);
ret(𝐷 𝜔, 𝑥 )
𝑦← HJ𝑁K(𝛾, 𝛿 [𝑋↦→ 𝑥] );
ret(𝐷 𝜔, 𝑦)
IH
≃
(𝛿, 𝑥 ) ←
((Ω1, 𝜇1, 𝐸1), 𝑓 , 𝑋 ) ← LJ𝑀K(Ω, 𝜇, 𝐸 )(𝛾, 𝐷 );
𝜔1←𝜇1|𝐸1;
ret(𝐷(𝑓(𝜔1)), 𝑋 𝜔1)
𝑦← HJ𝑁K(𝛾, 𝛿 [𝑋↦→ 𝑥] );
ret(𝐷 𝜔, 𝑦)
=
((Ω1, 𝜇1, 𝐸1), 𝑓 , 𝑋 ) ← LJ𝑀K(Ω, 𝜇, 𝐸 )(𝛾, 𝐷 );
𝜔1←𝜇1|𝐸1;
𝑦← HJ𝑁K(𝛾, ( (𝐷◦𝑓1)[𝑋↦→ 𝑋])(𝜔1));
ret(𝐷 𝜔, 𝑦)
IH
≃
((Ω1, 𝜇1, 𝐸1), 𝑓1, 𝑋 ) ← LJ𝑀K(Ω, 𝜇, 𝐸 )(𝛾 , 𝐷);
((Ω2, 𝜇2, 𝐸2), 𝑓2, 𝑌 ) ← LJ𝑁K(Ω1, 𝜇1, 𝐸1) (𝛾, (𝐷◦𝑓1) [𝑋↦→ 𝑋]);
𝜔2←𝜇2|𝐸2;
ret(𝐷(𝑓1(𝑓2(𝜔))), 𝑌 𝜔2)
=
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑍 ) ← LJlet 𝑋be 𝑀in 𝑁K(Ω, 𝜇, 𝐸)(𝛾, 𝐷 );
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔)), 𝑍 𝜔′)
38
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
(Γ;Δ⊢cif 𝑒then 𝑀else 𝑁:𝐴)
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑋 ) ← LJif 𝑒then 𝑀else 𝑁K(Ω, 𝜇, 𝐸 )(𝛾 , 𝐷);
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑋 𝜔′)
=
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑋 ) ←
if J𝑒K(𝛾)
then LJ𝑀K(Ω, 𝜇, 𝐸 )(𝛾, 𝐷);
else LJ𝑁K(Ω, 𝜇, 𝐸 )(𝛾, 𝐷);
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑋 𝜔′)
=
if J𝑒K(𝛾)then
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑋 ) ← LJ𝑀K(Ω, 𝜇, 𝐸 )(𝛾 , 𝐷);
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑋 𝜔′)
else
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑋 ) ← LJ𝑁K(Ω, 𝜇, 𝐸 )(𝛾 , 𝐷);
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑋 𝜔′)
IH×2
≃
if J𝑒K(𝛾)then
𝜔←𝜇|𝐸;
𝑥← HJ𝑀K(𝛾, 𝐷 𝜔);
ret(𝐷 𝜔, 𝑥 )
else
𝜔←𝜇|𝐸;
𝑥← HJ𝑁K(𝛾, 𝐷 𝜔);
ret(𝐷 𝜔, 𝑥 )
=
𝜔←𝜇|𝐸;
𝑥←
if J𝑒K(𝛾)
then HJ𝑀K(𝛾, 𝐷𝜔)
else HJ𝑁K(𝛾, 𝐷𝜔)
ret(𝐷 𝜔, 𝑥 )
=
𝜔←𝜇|𝐸;
𝑥← HJif 𝑒then 𝑀else 𝑁K(𝛾, 𝐷 𝜔);
ret(𝐷 𝜔, 𝑥 )
(Γ;Δ⊢cflip 𝑒:bool)Let 𝑝=(if J𝑒K(𝛾) ∈ [0,1]then J𝑒K(𝛾)else 0).
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑋 ) ← LJflip 𝑒K(Ω, 𝜇, 𝐸)(𝛾, 𝐷 );
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑋 𝜔′)
=(𝜔, 𝑏 )←(𝜇⊗ip(𝑝))|𝐸×Jbool K;
ret(𝐷 𝜔, 𝑏)
=
𝜔←𝜇|𝐸;
𝑏←ip(𝑝);
ret(𝐷 𝜔, 𝑏)
=
𝜔←𝜇|𝐸;
𝑥← HJflip 𝑒K(𝛾 , 𝐷𝜔);
ret(𝐷 𝜔, 𝑥 )
39
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
(Γ;Δ⊢cobserve 𝑀:unit)Let 𝐹be the subset (J𝑀K(Ω, 𝜇, 𝐸) ◦ 𝐷)−1(⊤) of Ω.
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑋 ) ← LJobserve 𝑀K(Ω, 𝜇, 𝐸) (𝛾 , 𝐷);
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑋 𝜔′)
=
score(𝜇|𝐸(𝐹));
𝜔←𝜇|𝐸∩𝐹;
ret(𝐷 𝜔, ★)
Lemma A.14
=
score(𝜇|𝐸(𝐹));
𝜔←𝜇|𝐸|𝐹;
ret(𝐷 𝜔, ★)
Lemma A.15
≃
𝜔←𝜇|𝐸;
score(1𝜔∈𝐹);
ret(𝐷 𝜔, ★)
=
𝜔←𝜇|𝐸;
score(1J𝑀K(𝐷𝜔 )=⊤);
ret(𝐷 𝜔, ★)
=
𝜔←𝜇|𝐸;
𝑥← HJobserve 𝑀K(𝛾, 𝐷 𝜔);
ret(𝐷 𝜔, 𝑥 )
(Γ;Δ⊢cL𝑒M𝐸:𝐴)
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑋 ) ← LJL𝑒M𝐸K(Ω, 𝜇, 𝐸 )(𝛾, 𝐷);
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑋 𝜔′)
=
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑋 ) ←
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑥 ) ← LJ𝑒K(Ω, 𝜇, 𝐸 )(𝛾 , 𝐷);
ret( (Ω′, 𝜇′, 𝐸′), 𝑓 , _↦→ 𝑥);
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑋 𝜔′)
=
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑥 ) ← LJ𝑒K(Ω, 𝜇, 𝐸 )(𝛾 , 𝐷);
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑥 )
IH
≃
𝜔←𝜇|𝐸;
𝑥← HJ𝑒K(𝛾, 𝐷 𝜔);
ret(𝐷 𝜔, 𝑥 )
=
𝜔←𝜇|𝐸;
𝑥← HJL𝑒M𝐸K(𝛾, 𝐷 𝜔);
ret(𝐷 𝜔, 𝑥 )
(2)
(Γ;Δ⊢cret 𝑒:𝜏)
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑥 ) ← LJret 𝑒K(Ω, 𝜇, 𝐸 )(𝛾 , 𝐷);
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔)), 𝑥 )
=𝜔←𝜇|𝐸;
ret(𝐷𝜔, J𝑒K(𝛾)) =
𝜔←𝜇|𝐸;
𝑥← HJret 𝑒K(𝛾, 𝐷 𝜔);
ret(𝐷 𝜔, 𝑥 )
40
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
(Γ;Δ⊢clet 𝑥be 𝑒1in 𝑒2:𝜏)In this case we work backwards, rearranging RHS into LHS:
𝜔←𝜇|𝐸;
𝑥← HJlet 𝑥be 𝑒1in 𝑒2K(𝛾, 𝐷 𝜔);
ret(𝐷 𝜔, 𝑥 )
=
𝜔←𝜇|𝐸;
𝑥← HJ𝑒1K(𝛾, 𝐷 𝜔);
𝑦← HJ𝑒2K(𝛾[𝑥↦→ 𝑥], 𝐷𝜔 );
ret(𝐷 𝜔, 𝑦)
=
(𝛿, 𝑥 ) ←
𝜔←𝜇|𝐸;
𝑥← HJ𝑒1K(𝛾, 𝐷 𝜔);
ret(𝐷 𝜔, 𝑥 )
;
𝑦← HJ𝑒2K(𝛾[𝑥↦→ 𝑥], 𝛿 );
ret(𝛿 , 𝑦)
IH
≃
(𝛿, 𝑥 ) ←
((Ω1, 𝜇1, 𝐸1), 𝑓1, 𝑥 ) ← LJ𝑒1K(Ω, 𝜇, 𝐸)(𝛾, 𝐷 );
𝜔1←𝜇1|𝐸1;
ret(𝐷(𝑓1(𝜔1)), 𝑋 𝜔1)
;
𝑦← HJ𝑒2K(𝛾[𝑥↦→ 𝑥], 𝛿 );
ret(𝛿 , 𝑦)
=
((Ω1, 𝜇1, 𝐸1), 𝑓1, 𝑥 ) ← LJ𝑒1K(Ω, 𝜇, 𝐸)(𝛾, 𝐷 );
𝜔1←𝜇1|𝐸1;
𝛿:=𝐷(𝑓1(𝜔1));
𝑦← HJ𝑒2K(𝛾[𝑥↦→ 𝑥], 𝛿 );
ret(𝛿 , 𝑦)
IH
≃
((Ω1, 𝜇1, 𝐸1), 𝑓1, 𝑥 ) ← LJ𝑒1K(Ω, 𝜇, 𝐸)(𝛾, 𝐷 );
((Ω2, 𝜇2, 𝐸2), 𝑓2, 𝑦 ) ← LJ𝑒2K(Ω1, 𝜇1, 𝐸1) (𝛾[𝑥↦→ 𝑥], 𝐷 ◦𝑓1);
𝜔2←𝜇2|𝐸2;
ret(𝐷(𝑓1(𝑓2(𝜔2))), 𝑦)
=
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑧 ) ← LJlet 𝑥be 𝑒1in 𝑒2K(Ω, 𝜇 , 𝐸)(𝛾, 𝐷 );
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑧)
41
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
(Γ;Δ⊢cif 𝑒1then 𝑒2else 𝑒3:𝜏)
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑥 ) ← LJif 𝑒1then 𝑒2else 𝑒3K(Ω, 𝜇, 𝐸 )(𝛾 , 𝐷);
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑥 )
=
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑥 ) ←
if J𝑒1K(𝛾)
then LJ𝑒2K(Ω, 𝜇, 𝐸 )(𝛾, 𝐷);
else LJ𝑒3K(Ω, 𝜇, 𝐸 )(𝛾, 𝐷);
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑥 )
=
if J𝑒1K(𝛾)then
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑥 ) ← LJ𝑒2K(Ω, 𝜇, 𝐸 )(𝛾 , 𝐷);
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑥 )
else
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑥 ) ← LJ𝑒3K(Ω, 𝜇, 𝐸 )(𝛾 , 𝐷);
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑥 )
IH×2
≃
if J𝑒1K(𝛾)then
𝜔←𝜇|𝐸;
𝑥← HJ𝑒2K(𝛾, 𝐷 𝜔);
ret(𝐷 𝜔, 𝑥 )
else
𝜔←𝜇|𝐸;
𝑥← HJ𝑒3K(𝛾, 𝐷 𝜔);
ret(𝐷 𝜔, 𝑥 )
=
𝜔←𝜇|𝐸;
𝑥←
if J𝑒1K(𝛾)
then HJ𝑒2K(𝛾, 𝐷𝜔)
else HJ𝑒3K(𝛾, 𝐷𝜔)
ret(𝐷 𝜔, 𝑥 )
=
𝜔←𝜇|𝐸;
𝑥← HJif 𝑒1then 𝑒2else 𝑒3K(𝛾, 𝐷 𝜔);
ret(𝐷 𝜔, 𝑥 )
(Γ;Δ⊢c𝑒:𝜏for 𝑒=flip 𝑒1or 𝑒=uniform 𝑒1𝑒2or 𝑒=poisson 𝑒1)
If 𝑒=flip 𝑒1for some 𝑒1, then let 𝜈be the distribution ip(J𝑒1K(𝛾)) .
If 𝑒=uniform 𝑒1𝑒2then let 𝜈be the distribution uniform(J𝑒1K(𝛾),J𝑒1K(𝛾)).
If 𝑒=poisson 𝑒1for some 𝑒1, then let 𝜈be the distribution poisson(J𝑒1K(𝛾)).
In all cases,
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑥 ) ← LJ𝑒K(Ω, 𝜇, 𝐸 )(𝛾 , 𝐷);
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑥 )
=
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑥 ) ←
𝑥←𝜈;
ret( (Ω, 𝜇, 𝐸 ),id, 𝑥);
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑥 )
=
𝑥←𝜈;
𝜔←𝜇|𝐸;
ret(𝐷 𝜔, 𝑥 )
=
𝜔←𝜇|𝐸;
𝑥← HJ𝑒K(𝛾, 𝐷 𝜔);
ret(𝐷 𝜔, 𝑥 )
(Γ;Δ⊢cobs(𝑒𝑜,𝑒):unit for 𝑒=flip 𝑒1or 𝑒=uniform 𝑒1𝑒2or 𝑒=poisson 𝑒1)
If 𝑒=flip 𝑒1for some 𝑒1, then let 𝜈be the distribution ip(J𝑒1K(𝛾)) .
If 𝑒=uniform 𝑒1𝑒2then let 𝜈be the distribution uniform(J𝑒1K(𝛾),J𝑒1K(𝛾)).
42
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
If 𝑒=poisson 𝑒1for some 𝑒1, then let 𝜈be the distribution poisson(J𝑒1K(𝛾)).
In all cases,
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑥 ) ← LJobs(𝑒𝑜, 𝑒)K(Ω, 𝜇, 𝐸) (𝛾, 𝐷 );
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑥 )
=
score(𝜈(J𝑒𝑜K(𝛾)));
𝜔←𝜇|𝐸;
ret(𝐷 𝜔, ★)
Cont’s score does not eect the FCPS and we commute with 𝜇|𝐸,
comm
=
𝜔←𝜇|𝐸;
score(𝜈(J𝑒𝑜K(𝛾)));
ret(𝐷 𝜔, ★)
=
𝜔←𝜇|𝐸;
𝑥← HJ𝑒K(𝛾, 𝐷 𝜔);
ret(𝐷 𝜔, 𝑥 )
(Γ;Δ⊢cL𝑀M𝑆:𝜏)
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑥 ) ← LJL𝑀M𝑆K(Ω, 𝜇, 𝐸 )(𝛾, 𝐷);
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑥 )
=
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑥 ) ←
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑋 ) ← LJ𝑀K(Ω, 𝜇, 𝐸 )(𝛾 , 𝐷)
𝑥←𝜔′←𝜇′|𝐸′; ret(𝑋 𝜔 ′)
ret( (Ω′, 𝜇′, 𝐸′∩𝑋−1(𝑥)), 𝑥 )
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑥 )
=
((Ω′, 𝜇 ′, 𝐸′), 𝑓 , 𝑋 ) ← LJ𝑀K(Ω, 𝜇, 𝐸 )(𝛾 , 𝐷)
𝑥←𝜔′←𝜇′|𝐸′; ret(𝑋 𝜔 ′)
𝐸′′ :=𝐸′∩𝑋−1(𝑥);
𝜔′←𝜇′|𝐸′′ ;
ret(𝐷(𝑓(𝜔′)), 𝑥 )
Lemmas A.14 and A.16
=
((Ω′, 𝜇 ′, 𝐸′), 𝑋 ) ← LJ𝑀K(Ω, 𝜇, 𝐸 )(𝛾, 𝐷)
𝜔′←𝜇′|𝐸′;
ret(𝐷(𝑓(𝜔′)), 𝑋 𝜔′)
IH
≃
𝜔←𝜇|𝐸;
𝑥← HJ𝑀K(𝛾, 𝐷 𝜔);
ret(𝐷 𝜔, 𝑥 )
=
𝜔←𝜇|𝐸;
𝑥← HJL𝑀M𝑆K(𝛾, 𝐷𝜔);
ret(𝐷 𝜔, 𝑥 )□
Denition A.18. For a closed program ·
·
·;·
·
·⊢c𝑒:𝜏, let evalL(𝑒)be the computation
(_,_, 𝑥) ← LJ𝑒K(emp) (∅,∅);
ret 𝑥: DistwJ𝜏K
where ∅denotes the empty substitution. Let evalH(𝑒)be the computation HJ𝑒K(∅,∅) : DistwJ𝜏K.
Theorem A.19 (Cont soundness). If ·;·⊢c𝑒:𝜏then evalL(𝑒) ≃ evalH(𝑒).
43
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
Proof. Apply Theorem A.17. □
A.8 Evaluation details
In this section, we include three example programs from our evaluation to showcase the syntax
of our implementation. These include the 15-node arrival model with tree-topology, the 9-node
reachability model with grid topology.
We additionally provide our evaluations, inclusive of standard error over the 100 runs.
Listing 3. Reachability-9
1ex a c t {
2le t x 00 = fl i p 1 .0 / 3 . 0 in
3le t x 01 = if x 00 the n f l ip 1. 0 / 4 .0 el s e f li p 1 . 0 / 5. 0 i n
4le t x 10 = if x 00 the n f l ip 1. 0 / 4 .0 el s e f li p 1 . 0 / 5. 0 i n
5le t d i ag = s a mpl e {
6x02 ~ if x01 then bern(1.0 / 4.0) else bern (1.0 / 5.0);
7x20 ~ if x10 then bern(1.0 / 4.0) else bern (1.0 / 5.0);
8x1 1 ~ if x1 0 && x 01 th e n b ern (1 . 0 / 6 .0 )
9el s e if x 10 && ! x 01 th e n b ern (1. 0 / 7 .0 )
10 el s e if !x1 0 && x 01 th e n b e rn (1. 0 / 8 .0 )
11 else bern (1.0 / 9.0);
12 ( x20 , x 11 , x0 2 )
13 } in
14 le t x 20 = d ia g [0 ] in
15 le t x 11 = d ia g [1 ] in
16 le t x 02 = d ia g [2 ] in
17
18 le t x 12 = if x11 && x02 the n f l ip 1. 0 / 6 .0
19 el s e if x 11 && ! x 0 2 t he n f l ip 1.0 / 7 .0
20 el s e if ! x1 1 & & x 0 2 t he n f l ip 1.0 / 8 .0
21 el s e f lip 1. 0 / 9. 0 i n
22 le t x 21 = if x20 && x11 the n f l ip 1. 0 / 6 .0
23 el s e if x 20 && ! x 1 1 t he n f l ip 1.0 / 7 .0
24 el s e if ! x2 0 & & x 1 1 t he n f l ip 1.0 / 8 .0
25 el s e f lip 1. 0 / 9. 0 i n
26 le t x 22 = if x21 && x12 the n f l ip 1. 0 / 6 .0
27 el s e if x 21 && ! x 1 2 t he n f l ip 1.0 / 7 .0
28 el s e if ! x2 1 & & x 1 2 t he n f l ip 1.0 / 8 .0
29 el s e f lip 1. 0 / 9. 0 i n
30 ob s e rve x2 2 in
31 ( x0 0 , x01 , x 02
32 , x1 0 , x 11 , x 12
33 , x2 0 , x 21 , x 22
34 )
35 }
Fig. 16. The network reachability program for the 9-node grid topology
44
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
Listing 4. Arrival-15
1ex a c t fn n e two r k () -> Boo l {
2le t n 3 0r = t r ue in
3le t n 20 r = if n3 0r t h e n fl i p 1. 0 / 2. 0 e ls e f a ls e i n
4le t n 31 r = if ! n 20 r t h en f l ip 1. 0 / 2. 0 el s e f al s e in
5
6le t n 10 r = if n2 0r t h e n fl i p 1. 0 / 2. 0 e ls e f a ls e i n
7le t n 21 r = if ! n 10 r t h en f l ip 1. 0 / 2. 0 el s e f al s e in
8le t n 32 r = if n2 1r t h e n fl i p 1. 0 / 2. 0 e ls e f a ls e i n
9le t n 33 r = if ! n 21 r t h en f l ip 1. 0 / 2. 0 el s e f al s e in
10
11 le t n 0 = n1 0r in
12
13 le t n 10 l = if n0 th e n fli p 1 .0 / 2. 0 e lse f a lse i n
14
15 le t n 20 l = if n1 0l t h e n fl i p 1. 0 / 2. 0 e ls e f a ls e i n
16 le t n 21 l = if ! n 10 l t h en f l ip 1. 0 / 2. 0 el s e f al s e in
17
18 le t n 30 l = if n2 0l t h e n fl i p 1. 0 / 2. 0 e ls e f a ls e i n
19 le t n 31 l = if ! n 20 l t h en f l ip 1. 0 / 2. 0 el s e f al s e in
20 le t n 32 l = if n2 1l t h e n fl i p 1. 0 / 2. 0 e ls e f a ls e i n
21 le t n 33 l = if ! n 21 l t h en f l ip 1. 0 / 2. 0 el s e f al s e in
22 ob s e rve n3 2 l in
23 n0
24 }
25
26 sample {
27 ix ~ po is s on ( 3. 0 ) ;
28 np a c k et s <- 0 ;
29 wh i l e i x > 0 {
30 tr a v e rs e s < - e x ac t ( n e t wo r k ( ) );
31 np a c k et s < - if tr a v e rse s { n p ack e t s + 1 } els e { n p ac k e ts };
32 ix <- i x - 1 ;
33 true
34 };
35 npackets
36 }
Fig. 17. The network reachability program for the 9-node grid topology
45
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
Listing 5. Gossip-4
1sa m p le fn fo r w ar d ( ix : I nt ) -> I n t {
2s ~ di s c r et e (1. 0 / 3 .0 , 1. 0 / 3 .0 , 1.0 / 3 . 0) ;
3if s < i x { s } e l se { s + 1 }
4}
5
6ex a ct fn n o de ( n i d : In t ) - > (I nt , I nt ) {
7le t p 1 = s am p le ( f or w ar d ( n id ) ) i n
8le t p 2 = s am p le ( f or w ar d ( n id ) ) i n
9( p1 , p 2 )
10 }
11
12 ex a ct f n n e tw or k _s t ep (
13 n0 : B oo l , n1 : B ool , n2 : B o ol , n3 : Bo ol , n e xt : In t
14 ) - > ( Bo ol , Bo ol , Bo ol , B ool , I nt , In t) {
15 le t n 0 = n0 || ( n e xt == 0 ) in
16 le t n 1 = n1 || ( n e xt == 1 ) in
17 le t n 2 = n2 || ( n e xt == 2 ) in
18 le t n 3 = n3 || ( n e xt == 3 ) in
19 le t f wd = no de ( n e xt ) i n
20 ( n0 , n1 , n2 , n 3 , fw d [0 ], fw d [1 ])
21 }
22
23 sa m p le fn a s _ num (b : Bo ol ) -> F l oat {
24 if ( b ) { 1. 0 } e lse { 0 . 0 }
25 }
26
27 sample {
28 p < - ex a ct ( n o de (0 ) ) ;
29 p1 <- p [ 0] ; p 2 < - p [1 ];
30 i0 < - tr u e ; i1 < - f al s e ; i2 < - f al s e ; i3 < - f al s e ;
31 q <- [ ] ; q <- pu s h ( q , p1 ) ; q < - p us h ( q , p 2 ) ;
32 nu m _s te ps ~ di sc re te ( 0. 25 , 0. 25 , 0. 25 , 0. 25 ) ;
33 num_steps <- num_steps + 4;
34 wh i le ( n u m_ s te ps > 0) {
35 nx t < - h ea d ( q) ;
36 q <- t ai l ( q );
37 st a te < - e xa ct ( n et w or k_ s te p (i 0 , i1 , i2 , i 3 , nx t) ) ;
38 i0 < - sta t e [ 0 ]; i 1 < - st a te [ 1] ; i 2 <- st ate [2] ; i3 <- s tat e [3 ] ;
39 q < - pu s h (q , s t at e [ 4] ) ;
40 q < - pu s h (q , s t at e [ 5] ) ;
41 num_steps <- num_steps - 1;
42 true
43 };
44 n0 <- a s _n u m ( i 0 ) ;
45 n1 <- a s _n u m ( i 1 ) ;
46 n2 <- a s _n u m ( i 2 ) ;
47 n3 <- a s _n u m ( i 3 ) ;
48 ( n0 + n 1 + n2 + n 3 )
49 }
Fig. 18. The network reachability program for the 9-node grid topology
46
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
Model PSI Pyro MultiPPL (Cont)MultiPPL
L1 Time(s) L1 Time(s) L1 Time(s) L1 Time(s)
arrival/tree-15 — — 0.365 ±0.004 12.713 ±0.025 0.355 ±0.004 0.247 ±0.001 0.337 ±0.003 0.349 ±0.002
arrival/tree-31 — — 0.216 ±0.002 26.366 ±0.054 0.218 ±0.002 0.561 ±0.004 0.179 ±0.002 0.754 ±0.002
arrival/tree-63 — — 0.118 ±0.002 53.946 ±0.086 0.120 ±0.002 1.469 ±0.003 0.093 ±0.002 1.912 ±0.004
alarm t/o t/o 1.290 ±0.056 16.851 ±0.024 1.173 ±0.049 0.433 ±0.002 0.364 ±0.015 14.444 ±0.008
insurance t/o t/o 0.149 ±0.008 13.724 ±0.020 0.144 ±0.007 1.104 ±0.012 0.099 ±0.006 11.406 ±0.015
gossip/4 – – 0.119 ±0.002 6.734 ±0.027 0.119 ±0.001 0.720 ±0.002 0.118 ±0.002 0.812 ±0.014
gossip/10 – – 0.533 ±0.003 6.786 ±0.009 0.531 ±0.003 1.561 ±0.006 0.524 ±0.003 1.373 ±0.004
gossip/20 – – 0.747 ±0.003 7.064 ±0.010 0.745 ±0.003 3.565 ±0.005 0.750 ±0.003 2.888 ±0.003
Fig. 19. Empirical results of our benchmarks of the arrival, hybrid Bayesian network, and gossip tasks. “MultiPPL (Cont)” shows the evaluation of a baseline
Cont program with no boundary crossings into Disc, evaluations under the “MultiPPL” column performs interoperation. “t/o” indicates a timeout beyond 30
minutes, and “—” indicates that the problem is not expressible in PSI because of an unbounded loop.
47
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
# Nodes PSI MultiPPL (Disc) Pyro MultiPPL (Cont)MultiPPL
Time(s) Time(s) L1 Time(s) L1 Time(s) L1 Time(s)
9 546.748 ±8.018 0.001 ±0.000 0.080 ±0.002 3.827 ±0.008 0.079 ±0.002 0.067 ±0.001 0.033 ±0.001 0.098 ±0.001
36 t/o 0.089 ±0.002 1.812 ±0.009 14.952 ±0.025 0.309 ±0.004 0.277 ±0.002 0.055 ±0.002 1.169 ±0.004
81 t/o 40.728 ±0.276 7.814 ±0.017 33.199 ±0.049 0.680 ±0.005 0.887 ±0.002 0.079 ±0.002 81.300 ±0.278
Fig. 20. Exact and approximate results for models performing approximate inference
48
Multi-Language Probabilistic Programming OOPSLA ’25, October 12–18, 2025, Singapore
B COMPARISON WITH NESTED INFERENCE
It is interesting to contemplate the relationship between the nested inference approach and Mul-
tiPPL. A crisp comparison — for instance, a formal expressivity result establishing that it is not
possible to represent our multi-language interoperation using nested inference — is dicult, due to
(1) the large variety of dierent approaches to nested inference, and (2) the fact that such expressivity
results are very hard even for very restricted languages, let alone rich general-purpose probabilistic
programming languages. It would be very interesting to investigate the relative expressivity of
multi-language interoperation and nested inference, but such an investigation is beyond the scope
of this paper.
At the very least, what we can say is that MultiPPL’s low-level denotational semantics, and hence
also its inference strategy, is markedly dierent from the standard measure-theoretic semantics
of nested inference, such as Staton
[54]
’s model of nested queries. In Staton
[54]
, probabilistic
computations denote measure-theoretic kernels. The computation
normalize(𝑡)
represents a nested
query: its takes in a probabilistic computation
𝑡
of type
𝐴
and produces a deterministic computation
that can yield one of three possible outcomes:
(1)
a tuple
(
1
,(𝑒, 𝑑 ))
consisting of a normalizing constant
𝑒
and a distribution
𝑑
over elements of
type 𝐴,
(2) a tuple (2,()) signalling that the normalizing constant was zero,
(3) or a tuple (3,()) signalling that the normalizing constant was innity.
Soundness of nested inference is then justied by the following equational reasoning principle,
reproduced here from Staton [54]:
J𝑡K=
u
w
v
case normalize(𝑡)of (1,(𝑒, 𝑑 )) ⇒ score(𝑒);sample(𝑑)
| (2,()) ⇒ score(0);𝑡
| (3,()) ⇒ 𝑡
}
~(7)
Disregarding the edge cases where the normalizing constant is zero or innity, Eq. (7) says that
running a probabilistic computation
𝑡
is the same as (1) computing a representation of the distri-
bution of
𝑡
, either by exact inference or by “freezing a simulation” and examining “a histogram
that has been built” in the words of Staton
[54]
, and then (2) resampling from this distribution and
scoring by the normalizing constant.
There are a number of points that prevent this model of nested inference from being applied
directly to justify correctness of MultiPPL, which may help to clarify the dierence between
nested inference and MultiPPL’s inference interoperation:
•
It is about a kernel-based model where programs take deterministic values as input, but
MultiPPL’s Disc programs take random variables as input. This is an important dierence:
because Disc programs take random variables as input, a Disc program that simply makes
use of a free variable in context produces a random variable, not a xed deterministic value.
In contrast, such a program always denotes a deterministic point mass under a kernel-based
semantics.
•
MultiPPL’s semantics of Disc programs is stateful, and this is necessary to model how exact
inference works in our implementation. Contrastingly, the kernel-based model of Staton
[54]
is not stateful in this way, and so would not have been sucient for establishing our main
soundness theorem.
•
Because the model of Staton
[54]
is not stateful, it can’t account for the stateful updates to
the probability space that MultiPPL performs in order to ensure sample consistency.
•
Finally, Eq. (7) suggests to use importance reweighting via
score
to ensure sound nesting of
exact inference within an approximate-inference context, by properly taking the normalizing
49
OOPSLA ’25, October 12–18, 2025, Singapore Stites, Li, and Holtzen
constant produced by the nested query into account. This is quite dierent from how Mul-
tiPPL’s
L−M𝑆
boundary form handles the nesting of Disc subterms in Cont contexts — as
shown in Fig. 10, the low-level semantics of
L−M𝑆
does not perform importance reweighting.
Instead, importance reweighting occurs in the semantics of Disc-observe statements. Thus
Eq. (7) does not explain MultiPPL’s importance reweighting scheme.
Together, these points show that multi-language inference is distinct enough from nested inference
that the standard measure-theoretic model of nested queries from Staton
[54]
cannot be used directly
to justify key aspects of MultiPPL’s inference strategy, such as the need for sample consistency and
when importance reweighting is performed. Though Staton
[54]
is just one approach to modelling
nested inference, it being a relatively well-established approach suggests that there are indeed
fundamental dierences between nested inference and MultiPPL’s multilanguage inference.
50