Available via license: CC BY 4.0
Content may be subject to copyright.
A Scalable Approach to Probabilistic Neuro-Symbolic Verification
Vasileios Manginas1,Nikolaos Manginas1,Edward Stevinson2,Sherwin Varghese2,
Nikos Katzouris1,Georgios Paliouras1and Alessio Lomuscio2
1National Centre of Scientific Research ”Demokritos”
2Imperial College London
{vmanginas, nmanginas, nkatz, paliourg}@iit.demokritos.gr,
{e.stevinson22, sherwin.varghese, a.lomuscio}@imperial.ac.uk,
Abstract
Neuro-Symbolic Artificial Intelligence (NeSy AI)
has emerged as a promising direction for integrat-
ing neural learning with symbolic reasoning. In
the probabilistic variant of such systems, a neural
network first extracts a set of symbols from sub-
symbolic input, which are then used by a sym-
bolic component to reason in a probabilistic man-
ner towards answering a query. In this work, we
address the problem of formally verifying the ro-
bustness of such NeSy probabilistic reasoning sys-
tems, therefore paving the way for their safe de-
ployment in critical domains. We analyze the
complexity of solving this problem exactly, and
show that it is NP#P-hard. To overcome this is-
sue, we propose the first approach for approximate,
relaxation-based verification of probabilistic NeSy
systems. We demonstrate experimentally that the
proposed method scales exponentially better than
solver-based solutions and apply our technique to
a real-world autonomous driving dataset, where we
verify a safety property under large input dimen-
sionalities and network sizes.
1 Introduction
Neuro-Symbolic Artificial Intelligence (NeSy AI) [Hitzler
and Sarker, 2022; Marra et al., 2024]aims to combine the
strengths of neural-based learning with those of symbolic rea-
soning. Such techniques have gained popularity, as they have
been shown to improve the generalization capacity and inter-
pretability of neural networks (NNs) by seamlessly combin-
ing deep learning with domain knowledge. We focus on NeSy
approaches that perform probabilistic reasoning. Such sys-
tems are typically compositional; first, a NN extracts symbols
from sub-symbolic input, which are then used for formal rea-
soning by a symbolic component. They rely on formal proba-
bilistic semantics to handle uncertainty in a principled fashion
[Marra et al., 2024], and are for this reason adopted by sev-
eral state-of-the-art NeSy systems [Manhaeve et al., 2018;
Winters et al., 2022].
In order to deploy such NeSy systems in mission-critical
applications, it is often necessary to have formal guarantees
of their reliable performance. Techniques for NN verification
are valuable to that end, since they are able to derive such
guarantees for purely neural systems. Still, verifying proper-
ties on top of hybrid systems combining neural and symbolic
components remains largely under-explored. In this work we
address this challenge, focusing on verifying the robustness
of probabilistic NeSy systems, i.e., verifying the property that
input perturbations do not affect the reasoning output. We do
so by lifting existing NN verification techniques to the NeSy
setting.
A few related verification approaches, often termed Neuro-
Symbolic, have been proposed in the literature [Akintunde
et al., 2020; Xie et al., 2022; Daggitt et al., 2024]. Such
methods go beyond neural classification robustness, by ver-
ifying more complex properties on top of a NN [Xie et al.,
2022], or by verifying the correct behaviour of hybrid systems
consisting of neural and symbolic components. In the latter
case, the symbolic component includes some form of control
logic over the neural outputs [Akintunde et al., 2020], or pro-
grams that make use of such outputs in the context of neuro-
symbolic programming [Daggitt et al., 2024]. These meth-
ods differ substantially from our proposed approach, which
targets systems that perform probabilistic reasoning, a task
that is beyond the reach of the aforementioned techniques.
Moreover, existing methods rely on solver-based verification
techniques, which translate the verification query into a sat-
isfiability modulo theories (SMT) problem [Xie et al., 2022;
Daggitt et al., 2024]or into a mixed-integer linear program-
ming (MILP) instance [Akintunde et al., 2020]. While such
techniques result in approaches that are sound and complete,
they suffer from serious scalability issues, which often ren-
ders them impractical.
These scalability issues motivate the use of relaxation-
based verification approaches, which sacrifice completeness
for efficiency. Such techniques reason over a relaxed version
of the verification problem by over-approximating the exact
bounds [Ehlers, 2017a; Xu et al., 2020]. Our proposed ap-
proach extends relaxation-based NN verification methods to
the NeSy setting, by relying on knowledge compilation (KC)
[Darwiche and Marquis, 2002]. KC is widely used (e.g. in
[Xu et al., 2018; Manhaeve et al., 2018]) to represent the
probabilistic symbolic component of the system as an alge-
braic computational graph comprised solely of addition, sub-
traction, and multiplication nodes. This can in turn be ap-
pended to the output layer of the neural component. The re-
arXiv:2502.03274v1 [cs.AI] 5 Feb 2025
End-to-end computational graph
Constraints a
Neural Action Selector
Neural Object Detector
car in front
red light
brake
accelerate
Arithmetic Circuit
×
+
×
×
+
+
¬
¬
¬
¬
[0.7, 0.8]
[0.8, 0.9]
[0.1, 0.2]
[0.8, 0.9]
compilation
Input Perturbations
Figure 1: A motivating example for probabilistic NeSy verification. In this autonomous driving example we want to verify two logical
constraints ϕ, a safety-oriented one and a common-sense one, on top of two neural networks accepting the same dashcam image as input. The
symbolic constraints are compiled into a tractable representation containing only addition, subtraction, and multiplication. During inference,
this is used to reason over the NN outputs and calculate the probability that the constraints are satisfied. For verification, we exploit this
structure to scalably compute how perturbations in the input affect the probabilistic output of the whole (NNs + reasoning) NeSy system.
sulting structure, which encapsulates both the neural and the
symbolic components, is amenable to verification by off-the-
shelf, state-of-the-art formal NN verifiers.
The contributions of this work are summarized as follows:
• We introduce a scalable approach to the robustness ver-
ification of NeSy probabilistic reasoning systems. Our
solution is based on extending relaxation-based verifica-
tion techniques from the pure-neural to the NeSy setting.
• We study the complexity of solving the probabilistic
NeSy verification task exactly, showing that exact bound
propagation through the symbolic component is NP#P-
hard. Beyond this theoretical analysis, we experimen-
tally demonstrate that the proposed approach scales ex-
ponentially better than solver-based solutions.
• We show that our method is applicable to real-world
problems involving high-dimensional input and realis-
tic network sizes. We demonstrate this by applying our
technique to an autonomous driving dataset, where we
verify a safety property on top of an object detection net-
work and an action selector network.
2 Background
2.1 Probabilistic NeSy Systems
Probabilistic NeSy AI aims to combine perception with prob-
abilistic logical reasoning. We provide a brief overview of
the operation of such a system based on [Marconato et al.,
2024]. Given input x∈Rn, the system utilizes a NN, as
well as symbolic knowledge K, to infer a (multi-)label output
y∈ {0,1}m. In particular, the system computes pθ(y|x; K),
where θrefers to the trainable parameters of the NN. This is
achieved in a two-step process. First, the system extracts a set
of latent concepts c∈ {0,1}k, through the use of a parame-
terized neural model pθ(c|x). These latent concept predic-
tions are then used as input to a reasoning layer, in conjunc-
tion with knowledge K, to infer p(y|c; K).
The setting is straightforward to extend to multiple NNs.
In that case, the ith network from a set Ewould predict
pi
θ(ci|x), with Si∈Eci=c. Consider the running exam-
ple of Figure 1, where two NNs accept the same image as
input and output two disjoint sets of latent concepts. These
are then combined to form the input to the reasoning layer in
order to output the target y.
2.2 Knowledge Compilation
Probabilistic reasoning in NeSy systems is often performed
via reduction to Weighted Model Counting (WMC), which
we briefly review next. Consider a propositional logical for-
mula ϕover variables V. Each boolean variable v∈Vis
assigned a weight p(v), which denotes the probability of that
variable being true. The Weighted Model Count (WMC) of
formula ϕis then defined as:
WMC(ϕ) = X
ω|=ϕY
v∈ω
p(v)Y
v /∈ω
1−p(v).(1)
In essence, the WMC is the sum of the probability of all
worlds ωthat are models of ϕ. WMC is #P-hard, since it
generalizes a #P-complete problem, #SAT, by incorporat-
ing weights [Chavira and Darwiche, 2008].
A widely-used approach for solving the WMC problem is
knowledge compilation (KC) [Darwiche and Marquis, 2002;
Chavira and Darwiche, 2008]. According to this approach,
the formula ϕis first compiled into a tractable representation,
which is used at inference time to compute a large number
of queries - in our case, instances of the WMC problem - in
polynomial time. KC techniques push most of the computa-
tional effort to the “off-line” compilation phase, resulting in
computationally cheap “on-line” query answering, a concept
termed amortized inference.
The representations obtained via KC take the form of com-
putational graphs, in which the literals, i.e., logical variables
and their negations, are found only on leaves of the graph.
The nodes are only logical AND and OR operations, and
the root represents the query. For example, consider the con-
straints ϕfrom the autonomous driving example of Figure 1:
red light ∨car in front =⇒brake
accelerate ⇐⇒ ¬brake
These dictate that (1) if there is a red light or a car in front
of the AV, then the AV should brake, and (2) that accelerat-
ing and braking should be mutually exclusive and exhaustive,
i.e., only one should take place at any given time. Figure 2a
presents the compiled form of these constraints as a boolean
circuit, namely a Sentential Decision Diagram (SDD) [Dar-
wiche, 2011]. To perform WMC, the boolean circuit is re-
accelerate
(1 - car_in_front) (1 - red_light)
×
×brake
+
0.3
0.2 0.4
0.7
0.024
0.724
accelerate
¬ car_in_front ¬ red_light
AND
AND brake
OR
(a)
accelerate
(1 - car_in_front) (1 - red_light)
×
×brake
+
0.3
0.2 0.4
0.7
0.024
0.724
accelerate
¬ car_in_front ¬ red_light
AND
AND brake
OR
(b)
Figure 2: (a) A Sentential Decision Diagram (SDD) as an example
of a computational graph obtained via knowledge compilation and
(b) the corresponding arithmetic circuit (AC) derived from the SDD
during inference by replacing the AND/OR nodes with multiplica-
tion/addition. The SDD has been minimized for conciseness.
placed by an arithmetic one, by replacing the AND nodes
of the graph with multiplication, the OR nodes with addi-
tion, and the negation of literals with subtraction (1−x).
The resulting structure, shown in Figure 2b, can compute the
WMC of ϕsimply by plugging in the literal probabilities at
the leaves and traversing the circuit bottom-up. Indeed, one
can check that assuming the probabilities:
p(accelerate)=0.3, p(red light) = 0.6
p(brake)=0.7, p(car in front)=0.8
this computation correctly calculates the probability of ϕby
summing the probability of its 5 different models.
2.3 Verification of Neural Networks
NN Robustness. Verifying the robustness of NN classifiers
amounts to proving that the network’s correct predictions
remain unchanged if the corresponding input is perturbed
within a given range ϵ[Wong et al., 2018]. Contrary to empir-
ical machine learning evaluation techniques, NN verification
methods reason over infinitely-many inputs to derive certifi-
cates for the robustness condition. For a given network f, this
is formalized as follows: for all inputs x, such that f(x)is a
correct prediction, and for all x′, such that ∥x−x′∥ ≤ ϵ, it
holds that f(x) = f(x′).
Checking if the robustness condition holds for some ϵcan
be achieved by reasoning over the relations between the net-
work’s un-normalized predictions (logits) at the NN’s output
layer. In particular, it can be seen that if for any x′in an ϵ-
ball of x, it holds that ytrue −yi>0, for all yi=ytrue, then
the network is robust for ϵ[Gowal et al., 2018]. Here ytrue
is the logit corresponding to the correct class and yiare the
logits corresponding to all other labels. This condition can be
checked by computing the minimum differences of the pre-
dictions for all points in the ϵ-ball. If that minimum is pos-
itive, the robustness condition is satisfied. However, finding
that minimum is NP-hard [Katz et al., 2017].
Solver-Based Verification. Early verification approaches
include Mixed Integer Linear Programming (MILP) [Lomus-
cio and Maganti, 2017; Tjeng et al., 2019; Henriksen and
Lomuscio, 2020]and Satisfiability Modulo Theories (SMT)
[Ehlers, 2017b; Katz et al., 2017]. MILP approaches en-
code the verification problem as an optimization task over lin-
ear constraints, which can be solved by off-the-shelf MILP-
solvers. SMT-based verifiers translate the NN operations and
the verification query into an SMT formula and use SMT
solvers to check for satisfiability. Although these methods
are precise and provide exact verification results, they do
not scale to large, deep networks, due to their high compu-
tational complexity. As such, they are impractical for real-
world applications with high-dimensional inputs like images
or videos.
Relaxation-Based Verification. As the verification prob-
lem is NP-hard [Katz et al., 2017], incomplete techniques
that do not reason over an exact formulation of the verifi-
cation problem, but rather an over-approximating relaxation,
are used for efficiency. A salient method that is commonly
used is Interval Bound Propagation (IBP), a technique which
uses interval arithmetic [Sunaga, 1958]to propagate the input
bounds through all the layers of a NN [Gowal et al., 2018].
As a non-exact approach to verification, it is not theoretically
guaranteed to solve a problem. However, the approach is
sound, in that if the lower bound is shown to be positive the
network is robust. Therefore, once the bounds of the output
layer are obtained, an instance is safe if the lower bound of
the logit corresponding to the correct class is greater than the
upper bounds of the rest of the logits, since this ensures a
correct prediction, even in the worst case.
3 Probabilistic Neuro-Symbolic Verification
3.1 Problem Statement
We now formally define the aim of relaxation-based tech-
niques in the context of NeSy probabilistic reasoning sys-
tems. Given a NeSy system, as defined in Section 2.1, our
aim is to compute:
min
x′p(yi|x′),max
x′p(yi|x′)∀x′s.t. ||x′−x|| ≤ ϵ(2)
for all yiin y. That is, we wish to calculate the minimum
and maximum value of each of the probabilistic outputs of
the NeSy system, under input perturbations of size ϵ. As de-
scribed in Section 2.3, it is then possible to use these bounds
to assess the robustness of an instance.
Consider the NeSy system in the running example of Fig-
ure 1. The neural part of the system comprises two NNs,
accepting the same dashcam image xas input. The first is
an object detector predicting whether the image contains a
red traffic light and whether there is a car in front of the
autonomous vehicle (AV). The second is an action selector,
which outputs whether to accelerate or brake the AV, given
the image. The symbolic part of the system is the conjunc-
tion of a safety constraint and a common-sense one, as de-
scribed in Section 2.2. Given an input image x, the system
computes a single output y, denoting the probability that the
specified constraints are satisfied. An instance is robust if
minx′p(y|x′)>0.5, since this means that for all inputs in an
ϵ-ball of xthe probability of the constraints being satisfied is
always greater than 0.5.
3.2 Exact Solution Complexity
Let us now assume that via known techniques described in
Section 2.3 we have obtained bounds in the form of a prob-
ability range for each output of the NN. Next, we turn to the
task of propagating these bounds through the symbolic com-
ponent, in order to obtain maxima/minima on the reasoning
output, and investigate the complexity of doing so exactly.
First, we show that, in the worst case, to find the solution we
have to check all combinations of lower/upper bounds for all
NN outputs. To illustrate this, we utilize the circuit repre-
sentation of the symbolic component obtained via KC. It is
known that such circuits represent multi-linear polynomials
of the input variables [Choi et al., 2020]. For example, a sim-
ple traversal of the SDD of Figure 2a yields the polynomial:
p=1−p(car in front)×1−p(red light)
×p(accelerate) + p(brake)
Given this formulation, it is possible to obtain bounds on
the circuit root node by solving a constrained optimization
problem, in which we find the extrema (maximum and min-
imum) of the polynomial, subject to the bounded domains
of the input variables (the NN outputs). We observe that
this circuit polynomial is defined on a rectangular domain,
since all input variables are defined in a closed interval (e.g.
red light ∈[0.3,0.4],brake ∈[0.6,0.9]). It is known that in
this case the extrema lie on the vertices of the domain [Lan-
eve et al., 2010], i.e., at the extrema each variable is assigned
either its lower or upper bound, not something in between.
Thus, in the worst case, to find the extrema one needs to
search the combinatorial space of 2npossible solutions.
In order to calculate the maximum and minimum output of
the symbolic component, for each of the 2npoints we have
to solve one instance of the WMC problem. Indeed, since
each possible solution represents a probability assignment to
all input variables, we can use WMC to compute the proba-
bilistic output of the reasoning module under that weight as-
signment, and then select the maximum and minimum value
obtained over all assignments.
Given the two steps above, it can be seen that starting with
the formula, i.e., without first compiling it into a circuit, exact
bound computation is a NP#P-hard problem. Intuitively, we
need to search in the combinatorial space of variable config-
urations (the NP part), while performing WMC for each con-
figuration (the #P part). In this context, performing amor-
tized inference via knowledge compilation entails that instead
of solving a NP#P-hard problem for every sample, we per-
form a single #P-hard compilation at the beginning, and are
“just” left with an NP problem per sample during runtime.
Henceforth, we only consider the latter setting by assuming
this initial compilation step.
3.3 Relaxation-Based Approach
The NP-hardness of exact bound computation through
the compiled symbolic component motivates the use of
relaxation-based techniques. We now show how these can
be extended to the NeSy setting in order to provide a scalable
solution to Equation 2.
Compositional probabilistic NeSy systems can be viewed
as a single computational graph, by providing the outputs of
the neural network as the inputs of the symbolic probabilistic
circuit. In the case of the running example of Figure 1, the
outputs of each of the two networks are concatenated into a
single vector and used as input to the arithmetic circuit which
represents the constraints. Hence, a NeSy system can be
seen as an end-to-end differentiable algebraic computational
graph, which accepts an input, an image in this case, and out-
puts a vector of probabilities. These characteristics allow one
to construct the NeSy system as a single module comprising
an arbitrary number of neural networks and a single arith-
metic circuit. Such a module can be constructed in a machine
learning library, such as Pytorch, and subsequently exported
as an Open Neural Network Exchange (ONNX) graph [de-
velopers, 2021]. Figure 3 depicts the ONNX representation
of the NeSy system of the running example.
ONNX is a widespread NN representation, and is the
stadard input format for NN verifiers [Brix et al., 2024].
This includes both solver-based verification tools, such as
Marabou [Katz et al., 2019], and relaxation-based ones, such
as auto LiRPA [Xu et al., 2020]and VeriNet [Henriksen and
Lomuscio, 2020]. Thus, by representing a NeSy system as an
end-to-end computational graph and exporting it to this for-
mat, it is possible to utilize state-of-the-art tools to perform
verification in an almost “out-of-the-box” fashion. While
our proposed framework is, in principle, compatible with all
the aforementioned tools, we focus on relaxation-bsed veri-
fiers, in order to showcase scalable probabilistic NeSy verifi-
cation. Such verifiers allow us to perturb the input and com-
pute bounds directly on the output of the NeSy system, that is,
without computing intermediate bounds on the NN outputs.
4 Experimental Evaluation
In this section we empirically evaluate the effectiveness and
applicability of our approach. We assess the scalability of the
proposed method via a synthetic task based on MNIST addi-
tion, a standard benchmark from the NeSy literature [Man-
haeve et al., 2018]. Further, we apply our approach to a real-
world autonomous driving dataset and verify a safety driv-
ing property on top of two 6-layer convolutional NNs. In
2×3×240×320
tensor
Split
Conv
MaxPool
Relu
Reshape
Gemm
Sigmoid
Conv
MaxPool
Relu
Reshape
Gemm
Softmax
Concat
Squeeze
Gather
Gather
Sub
Gather
Sub
Mul Mul
Add
Mul Gather
Add
46
Figure 3: Unified ONNX representation of the NeSy system of the
running example. The input image is processed by the two NNs
(left branch is action selection, right branch is object detection) and
then through the arithmetic circuit. The NNs are stripped down to
one convolutional layer (Conv + MaxPool + ReLU) and one dense
layer (Reshape + Gemm + Softmax/Sigmoid) for conciseness. The
operators in the circuit, besides Add, Sub, and Mul, are created by
Python operations, such as tensor indexing and concatenation.
this case, the scalability of our technique allows us to handle
high-dimensional input and larger networks, which are typ-
ical of real-world applications. All experiments are run on
a machine with 128 AMD EPYC 7543 32-Core processors
(3.7GHz) and 400GB of RAM. The code is available online1.
4.1 Multi-Digit MNIST Addition
In this experiment we evaluate the scalability of our approach
as the complexity of the probabilistic reasoning component
increases. Specifically, we explore how the approximate na-
ture of our method enhances scalability, while also consider-
ing the corresponding trade-off in the quality of verification
results.
1https://anonymous.4open.science/r/nesy-veri-6FBD/
To this end, we compare the following approaches:
1. End-to-End relaxation-based verification E2E-R
An implementation of our method in auto LiRPA , a
state-of-the-art relaxation-based verification tool. The
input to auto LiRPA is the NeSy system under verifica-
tion, which is translated internally into an ONNX graph.
The verification method used is IBP, as implemented in
auto LiRPA .
2. Hybrid verification R+SLV
A hybrid approach consisting of relaxation-based verifi-
cation for the neural part of the NeSy system and solver-
based bound propagation through the symbolic part. The
former is implemented in auto LiRPA using IBP. The lat-
ter is achieved by transforming the circuit into a poly-
nomial (see Section 3.2), and solving a constrained opti-
mization problem with the Gurobi solver. The purpose of
comparing to this baseline is to assess the trade-off be-
tween scalability and quality of results, when using exact
vs approximate bound propagation through the symbolic
component.
3. Solver-based verification MARABOU
Exact verification using Marabou, a state-of-the-art
SMT-based verification tool, also used as a backend by
most NeSy verification works in the literature [Xie et
al., 2022; Daggitt et al., 2024]. Marabou is unable to
run on the full NeSy architecture, as the current imple-
mentation2does not support several operators, such as
Softmax and tensor indexing. To obtain an indication of
Marabou’s performance, we use it to verify only the neu-
ral part of the NeSy system, a subtask of NeSy verifica-
tion. Specifically, we verify the classification robustness
of the CNN performing MNIST digit recognition.
Dataset. We use a synthetic task, where we can controllably
increase the size of the symbolic component, while keeping
the neural part constant. In particular, we create a variant of
multi-digit MNIST addition [Manhaeve et al., 2018], where
each instance consists of multiple MNIST digit images, and
is labelled by the sum of all digits. We can then control the
number of MNIST digits per sample, e.g. for 3-digit addition,
an instane would be ,,,13. We construct the
verification dataset from the 10K samples of the MNIST test
set, using each image only once. Thus, for a given #digits
the verification set contains 10K/#digits test instances.
Experimental setting. The NN is a convolutional neural
network3tasked to recognize single MNIST digits. The CNN
is trained in a standard supervised fashion on the MNIST
train dataset, consisting of 60K images, and achieves an ac-
curacy of 98% on the test set. The symbolic part consists of
the rules of multi-digit addition. It accepts the CNN predic-
tions for the input images and computes a probability for each
sum. As the number of summand digits increases, so does
the size of the reasoning circuit, since there are more ways
to construct a given sum using more digits (e.g. consider the
2https://github.com/NeuralNetworkVerification/Marabou
3The CNN comprises 2 convolutional layers with max pooling
and 2 linear layers, with a final softmax activation.
Verification Method Metric #MNIST digits
2 3 4 5
R+SLV Lower/Upper Bound 0.871 −0.981 0.815 −0.972 0.764 −0.962 0.731 −0.928
Robustness (%) 90.60 86.17 81.33 78.31
E2E-R Lower/Upper Bound 0.871 −0.982 0.815 −0.974 0.763 −0.965 0.716 −0.958
Robustness (%) 90.60 86.11 81.21 76.67
Table 1: Comparison of performance between the proposed approach and the baseline with respect to the size of the symbolic component.
We report one metric for bound tightness and one metric for the robustness of the system, according to each method.
ways in which 2 and 5 digits can sum to 17). We vary the
number of digits as well as the size of L∞-norm perturba-
tions added to the input images. We consider five values for
#digits: {2,3,4,5,6}and three values for the perturbation
size ϵ:{10−2,10−3,10−4}, resulting in 15 distinct experi-
ments. For each experiment, i.e., combination of #digits and
ϵvalues, we use a timeout of 72 hours. E 2E- R runs on a sin-
gle thread, while the Gurobi solver in R+SLV dynamically
allocates up to 1024 threads.
Scalability. Figure 4 presents a scalability comparison be-
tween the methods. The figure illustrates the time required to
verify the robustness of the NeSy system for a single sample,
averaged across the test dataset. All experiments terminate
within the timeout limit, with the exception of two configura-
tions for R+ SLV. For ⟨ϵ= 10−2,#digits = 5,6⟩, R+SLV
was not able to verify any instance within the timeout (which
is why the lines for ϵ= 10−2stop at 4 digits in Figure 4).
For ⟨ϵ= 10−3,#digits = 6⟩, R+ SLV verifies less than 5%
of the examples within the timeout. The reported values in
Figure 4 are the average runtime for this subset.
23456
Number of MNIST digits
10 1
100
101
102
103
104
Time (log(s))
E2E-R ( =0.01)
E2E-R ( =0.001)
E2E-R ( =0.0001)
R+SLV ( =0.01)
R+SLV ( =0.001)
R+SLV ( =0.0001)
Figure 4: Comparison of verification runtime between three meth-
ods, with respect to the size of the symbolic component. We report
the time required to verify the robustness of the NeSy system on a
single sample, averaged across the MNIST test dataset, and repeat
the experiment for three values of the perturbation size ϵ.
As Figure 4 illustrates, E2E-R scales exponentially better
than R+ SLV – note that runtimes are in log-scale. This is
due to the computational complexity of exact bound propaga-
tion through the probabilistic reasoning component, as shown
in Section 3.2. In the surrogate task of verifying the robust-
ness of the CNN only, MARABOU’s runtime is 314 seconds
per sample, averaged across 100 MNIST test images. It is
thus several orders of magnitude slower than our approach,
in performing a subtask of NeSy verification. This indicative
performance for Marabou aligns with theoretical [Zhang et
al., 2018]and empirical evidence [Brix et al., 2024]on the
poor scalability of SMT-based approaches. Our results sug-
gest that this trade-off between completeness and scalability
is favourable in the NeSy setting, where the verification task
may involve multiple NNs and complex reasoning compo-
nents.
Quality of verification results. We next investigate how
the complexity of the reasoning component affects the qual-
ity of the verification results. In Table 1 we report, for
ϵ= 0.001: (a) the tightness of the output bounds, in the form
of lower/upper bound intervals for the probability of the cor-
rect sum for each sample, averaged across the test set; (b) the
robustness of the NeSy system, defined as the the number of
robust samples divided by the total samples in the test set.4
As expected, R+SLV outputs strictly tighter bounds than
E2E-R for all configurations. We further observe that the
quality of the bounds obtained by E2E-R degrades as the size
of the reasoning circuits increases. This is also expected,
since errors compound and accumulate over the larger net-
work. However, the differences between R+S LV and E2E-R
are minimal, especially in terms of robustness.
4.2 Autonomous Driving
In this experiment we apply our proposed approach to a real-
world dataset from the autonomous driving domain. The
purpose of the experiment is to assess the robustness of a
neural autonomous driving system with respect to the safety
and common-sense properties of Figure 1, i.e., to evaluate
whether input perturbations cause the neural systems to vio-
late the constraints that they previously satisfied.
Dataset. To that end, we use the ROad event Awareness
Dataset with logical Requirements (ROAD-R) [Giunchiglia
et al., 2023]. ROAD-R consists of 22 videos of dashcam
footage from the point of view of an autonomous vehicle
(AV), and is annotated at frame-level with bounding boxes.
Each bounding box represents an agent (e.g. a pedestrian,
vehicles of different types, etc.) performing an action (e.g.
4We don’t report metrics for 6 digits since the full experiment
exceeds the timeout.
Metric Epsilon
1e-5 5e-5 1e-4 5e-4 1e-3
Robustness (%) 96.82% 92.68% 82.64% 6.21% 0.00%
Runtime per Sample (s) 0.091 0.092 0.091 0.092 0.092
Table 2: Autonomous driving experiment results, indicating robustness and verification runtime for five values of the ϵ-perturbation.
moving towards the AV, turning, etc.) at a specific location
(e.g. right pavement, incoming lane, etc.).
Experimental Setting. We focus on a subset of the dataset
that is relevant to the symbolic constraints of Figure 1. Con-
sequently, we select a subset of frames which adhere to these
constraints. Specifically, either the AV is moving forward,
there is no red traffic light in the frame, and no car stopped in
front of the AV, or the AV is stopped, and there is either a red
traffic light or a car stopped in front. By sampling the videos
every 2 seconds, we obtain a dataset of 3143 examples, where
each example contains a 3×240×320 image, and four binary
labels: red light, car in front, stop, move forward.
The neural part of the system comprises two 6-layer
CNNs5, responsible for object detection and action selec-
tion respectively. The two networks are trained in a stan-
dard supervised fashion using an 80/20 train/test split over
the selected frames. The object detection and action selec-
tion networks achieve accuracies of 97.2% and 96.3% on
the respective test sets. We add L∞-norm perturbations to
the test input images for five values of perturbation size ϵ:
{10−5,5·10−5,10−4,5·10−4,10−3}.
Table 2 presents the results. We report robustness, i.e., the
fraction of robust instances over the total number of instances
in the test set, and verification runtime for E2E-R. Since this
task consists of a small arithmetic circuit and a significantly
larger neural component, it is the latter that predominantly
affects both the computational overhead and the accumulated
errors of bound propagation. Therefore, E2 E-R and R+SLV,
which differ only in the symbolic component, provide nearly
identical results that are omitted.
As expected, robust accuracy falls as the perturbation size
increases. Regarding the verification runtime, this experiment
reinforces our results from Section 4.1, by demonstrating that
the runtime of our approach remains largely unaffected by
changes in the value of the perturbation size ϵ.
5 Related Work
Although NN verification and NeSy AI have both seen rapid
growth over the last few years, their intersection is under-
explored, with some related work existing in the literature. In
[Akintunde et al., 2020], the authors address the problem of
verifying properties accociated with the temporal dynamics
of multi-agent systems. The agents of the system combine a
neural perception module with a symbolic one, encoding ac-
tion selection mechanisms via traditional control logic. The
verification queries are specified in alternating-time temporal
5The CNNs have 4 convolutional layers with max pooling and 2
linear ones. The object detection network has a sigmoid activation
at the output, while the action selection network has a softmax.
logic, and the corresponding verification problem is cast as a
MILP instance, delegated to a custom, Gurobi-based verifi-
cation tool. The work in [Xie et al., 2022]goes beyond NN
robustness, by verifying more complex properties on top of a
NN, or system of networks. The authors introduce a property
specification language based on Hoare logic, where variables
can be instantiated to NN inputs and outputs. Trained NNs,
along with the property under verification, are compiled into
an SMT problem, which is delegated to Marabou. [Daggitt et
al., 2024]follows a similar approach, in order to verify neu-
rosymbolic programs, i.e., programs containing both neural
networks and symbolic code. The authors introduce a prop-
erty specification language, which allows for NN training and
the specification of verification queries. A custom tool then
compiles the NNs, the program, and the verification query
into an SMT problem, which is again delegated to Marabou.
The aforementioned approaches cannot verify probabilis-
tic logical reasoning systems. This is because their specifi-
cation languages (logics of limited expressive power in [Ak-
intunde et al., 2020; Xie et al., 2022]and a functional lan-
guage in [Daggitt et al., 2024]) lack a general-purpose rea-
soning engine, as well as formal probabilistic semantics. In
contrast, our method verifies the robustness of NeSy systems
which perform general-purpose reasoning under uncertainty,
by combining NNs with probabilistic logical reasoners. Fur-
thermore, all existing approaches are based on solver-based
verification and hence cannot scale to high-dimensional input
and large networks, as we show with Marabou’s indicative
performance in Section 4. This is in contrast to our proposed
method, which, being the first to utilize relaxation-based ver-
ification in a NeSy setting, is able to handle large input di-
mensionality, network sizes, and knowledge complexity.
6 Conclusion
We presented a scalable technique for verifying the robust-
ness of probabilistic neuro-symbolic reasoning systems. Our
method combines relaxation-based techniques from the NN
verification domain with knowledge compilation, in order
to assess the effects of input perturbations on the proba-
bilistic logical output of the system. We motivated our ap-
proach via a theoretical analysis, and demonstrated its ef-
ficacy via experimental evaluation on synthetic and real-
world data. Future work includes extending our method to
more sophisticated neural verification techniques, such as
(Reverse) Symbolic Interval Propagation [Gehr et al., 2018;
Wang et al., 2021], towards obtaining tighter bounds. Further,
integrating certified training techniques [M¨
uller et al., 2023;
Palma et al., 2024]would substantially increase the magni-
tude of perturbations that our approach can verify, as such
training explicitly optimizes for easier verification.
References
[Akintunde et al., 2020]Michael E Akintunde, Elena Boto-
eva, Panagiotis Kouvaros, and Alessio Lomuscio. Verify-
ing strategic abilities of neural-symbolic multi-agent sys-
tems. In Proceedings of the International Conference on
Principles of Knowledge Representation and Reasoning,
volume 17, pages 22–32, 2020.
[Brix et al., 2024]Christopher Brix, Stanley Bak, Taylor T.
Johnson, and Haoze Wu. The fifth international verifi-
cation of neural networks competition (vnn-comp 2024):
Summary and results, 2024.
[Chavira and Darwiche, 2008]Mark Chavira and Adnan
Darwiche. On probabilistic inference by weighted model
counting. Artificial Intelligence, 172(6-7):772–799, 2008.
[Choi et al., 2020]Y Choi, Antonio Vergari, and Guy
Van den Broeck. Probabilistic circuits: A unifying frame-
work for tractable probabilistic models. UCLA. URL:
http://starai. cs. ucla. edu/papers/ProbCirc20. pdf, page 6,
2020.
[Daggitt et al., 2024]Matthew L. Daggitt, Wen Kokke,
Robert Atkey, Natalia Slusarz, Luca Arnaboldi, and Eka-
terina Komendantskaya. Vehicle: Bridging the embedding
gap in the verification of neuro-symbolic programs, 2024.
[Darwiche and Marquis, 2002]Adnan Darwiche and Pierre
Marquis. A knowledge compilation map. Journal of Arti-
ficial Intelligence Research, 17:229–264, 2002.
[Darwiche, 2011]Adnan Darwiche. Sdd: A new canoni-
cal representation of propositional knowledge bases. In
Twenty-Second International Joint Conference on Artifi-
cial Intelligence, 2011.
[developers, 2021]ONNX Runtime developers. Onnx run-
time. https://onnxruntime.ai/, 2021. Version: x.y.z.
[Ehlers, 2017a]R¨
udiger Ehlers. Formal verification of piece-
wise linear feed-forward neural networks. In Deepak
D’Souza and K. Narayan Kumar, editors, Automated Tech-
nology for Verification and Analysis - 15th International
Symposium, ATVA 2017, Pune, India, October 3-6, 2017,
Proceedings, volume 10482 of Lecture Notes in Computer
Science, pages 269–286. Springer, 2017.
[Ehlers, 2017b]R¨
udiger Ehlers. Formal verification of
piece-wise linear feed-forward neural networks. CoRR,
abs/1705.01320, 2017.
[Gehr et al., 2018]Timon Gehr, Matthew Mirman, Dana
Drachsler-Cohen, Petar Tsankov, Swarat Chaudhuri, and
Martin Vechev. Ai2: Safety and robustness certification
of neural networks with abstract interpretation. In 2018
IEEE Symposium on Security and Privacy (SP), pages 3–
18, 2018.
[Giunchiglia et al., 2023]Eleonora Giunchiglia, Mi-
haela C˘
at˘
alina Stoian, Salman Khan, Fabio Cuzzolin, and
Thomas Lukasiewicz. Road-r: the autonomous driving
dataset with logical requirements. Machine Learning,
112(9):3261–3291, 2023.
[Gowal et al., 2018]Sven Gowal, Krishnamurthy Dvi-
jotham, Robert Stanforth, Rudy Bunel, Chongli Qin,
Jonathan Uesato, Relja Arandjelovic, Timothy A. Mann,
and Pushmeet Kohli. On the effectiveness of interval
bound propagation for training verifiably robust models.
CoRR, abs/1810.12715, 2018.
[Henriksen and Lomuscio, 2020]Patrick Henriksen and
Alessio Lomuscio. Efficient neural network verification
via adaptive refinement and adversarial search. In ECAI
2020, pages 2513–2520. IOS Press, 2020.
[Hitzler and Sarker, 2022]Pascal Hitzler and Md Kamruzza-
man Sarker. Neuro-symbolic artificial intelligence: The
state of the art. 2022.
[Katz et al., 2017]Guy Katz, Clark W. Barrett, David L.
Dill, Kyle Julian, and Mykel J. Kochenderfer. Reluplex:
An efficient SMT solver for verifying deep neural net-
works. In Rupak Majumdar and Viktor Kuncak, editors,
Computer Aided Verification - 29th International Confer-
ence, CAV 2017, Heidelberg, Germany, July 24-28, 2017,
Proceedings, Part I, volume 10426 of Lecture Notes in
Computer Science, pages 97–117. Springer, 2017.
[Katz et al., 2019]Guy Katz, Derek A Huang, Duligur Ibel-
ing, Kyle Julian, Christopher Lazarus, Rachel Lim, Parth
Shah, Shantanu Thakoor, Haoze Wu, Aleksandar Zelji´
c,
et al. The marabou framework for verification and analysis
of deep neural networks. In Computer Aided Verification:
31st International Conference, CAV 2019, New York City,
NY, USA, July 15-18, 2019, Proceedings, Part I 31, pages
443–452. Springer, 2019.
[Laneve et al., 2010]Cosimo Laneve, Tudor A Lascu, and
Vania Sordoni. The interval analysis of multilinear expres-
sions. Electronic Notes in Theoretical Computer Science,
267(2):43–53, 2010.
[Lomuscio and Maganti, 2017]Alessio Lomuscio and Lalit
Maganti. An approach to reachability analysis for
feed-forward relu neural networks. arXiv preprint
arXiv:1706.07351, 2017.
[Manhaeve et al., 2018]Robin Manhaeve, Sebastijan Du-
mancic, Angelika Kimmig, Thomas Demeester, and Luc
De Raedt. Deepproblog: Neural probabilistic logic pro-
gramming. Advances in neural information processing
systems, 31, 2018.
[Marconato et al., 2024]Emanuele Marconato, Samuele
Bortolotti, Emile van Krieken, Antonio Vergari, Andrea
Passerini, and Stefano Teso. Bears make neuro-symbolic
models aware of their reasoning shortcuts. arXiv preprint
arXiv:2402.12240, 2024.
[Marra et al., 2024]Giuseppe Marra, Sebastijan Dumanˇ
ci´
c,
Robin Manhaeve, and Luc De Raedt. From statistical re-
lational to neurosymbolic artificial intelligence: A survey.
Artificial Intelligence, page 104062, 2024.
[M¨
uller et al., 2023]Mark Niklas M¨
uller, Franziska Eckert,
Marc Fischer, and Martin Vechev. Certified training:
Small boxes are all you need, 2023.
[Palma et al., 2024]Alessandro De Palma, Rudy Bunel, Kr-
ishnamurthy Dvijotham, M. Pawan Kumar, Robert Stan-
forth, and Alessio Lomuscio. Expressive losses for veri-
fied robustness via convex combinations, 2024.
[Sunaga, 1958]Teruo Sunaga. Theory of an interval alge-
bra and its application to numerical analysis. In Research
Association of Applied Geometry, pages 29–46, 1958.
[Tjeng et al., 2019]Vincent Tjeng, Kai Y. Xiao, and Russ
Tedrake. Evaluating robustness of neural networks with
mixed integer programming. In International Conference
on Learning Representations, 2019.
[Wang et al., 2021]Shiqi Wang, Huan Zhang, Kaidi Xu,
Xue Lin, Suman Jana, Cho-Jui Hsieh, and J. Zico Kolter.
Beta-crown: Efficient bound propagation with per-neuron
split constraints for neural network robustness verification.
In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang,
and J. Wortman Vaughan, editors, Advances in Neural In-
formation Processing Systems, volume 34, pages 29909–
29921. Curran Associates, Inc., 2021.
[Winters et al., 2022]Thomas Winters, Giuseppe Marra,
Robin Manhaeve, and Luc De Raedt. Deepstochlog: Neu-
ral stochastic logic programming. In Proceedings of the
AAAI Conference on Artificial Intelligence, volume 36,
pages 10090–10100, 2022.
[Wong et al., 2018]E. Wong, F. Schmidt, J. Metzen, and
J. Kolter. Scaling provable adversarial defenses. In Pro-
ceedings of the 32nd Conference on Neural Information
Processing Systems (NeurIPS18), 2018.
[Xie et al., 2022]Xuan Xie, Kristian Kersting, and Daniel
Neider. Neuro-symbolic verification of deep neural net-
works. arXiv preprint arXiv:2203.00938, 2022.
[Xu et al., 2018]Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao
Liang, and Guy Broeck. A semantic loss function for deep
learning with symbolic knowledge. In International con-
ference on machine learning, pages 5502–5511. PMLR,
2018.
[Xu et al., 2020]Kaidi Xu, Zhouxing Shi, Huan Zhang,
Minlie Huang, Kai-Wei Chang, Bhavya Kailkhura, Xue
Lin, and Cho-Jui Hsieh. Automatic perturbation analysis
on general computational graphs. CoRR, abs/2002.12920,
2020.
[Zhang et al., 2018]Huan Zhang, Tsui-Wei Weng, Pin-Yu
Chen, Cho-Jui Hsieh, and Luca Daniel. Efficient neu-
ral network robustness certification with general activation
functions, 2018.