Available via license: CC BY 4.0
Content may be subject to copyright.
1
TCBB-2024-05-0317
Combining Zhegalkin Polynomials and SAT Solving for
Context-specific Boolean Modeling of Biological Systems
Vincent Deman, Marine Ciantar, Laurent Naudin, Philippe Castera, and Anne-Sophie Beignon
Abstract—Large amounts of knowledge regarding biological
processes are readily available in the literature and aggregated in
diverse databases. Boolean networks are powerful tools to render
that knowledge into models that can mimic and simulate biological
phenomena at multiple scales. Yet, when a model is required to
understand or predict the behavior of a biological system
in given conditions, existing information often does not completely
match this context. Networks built from only prior knowledge can
overlook mechanisms, lack specificity, and just partially
recapitulate experimental observations. To address this limitation,
context-specific data needs to be integrated. However, the brute-
force identification of qualitative rules matching these data
becomes infeasible as the number of candidates explodes for
increasingly complex systems. Here, we used Zhegalkin
polynomials to transform this identification into a binary value
assignment for exponentially fewer variables, which we addressed
with a state-of-the-art SAT solver. We evaluated our implemented
method alongside two widely recognized tools, CellNetOptimizer
and Caspo-ts, on both artificial toy models and large-scale models
based on experimental data from the HPN-DREAM challenge.
Our approach demonstrated benchmark-leading capabilities on
networks of significant size and intricate complexity. It thus
appears promising for the in silico modeling of ever more
comprehensive biological systems.
Index Terms—Boolean network, model calibration, modeling
biological systems, SAT, systems biology, Zhegalkin polynomials
I. INTRODUCTION
IOLOGICAL systems are complex, interconnected
networks of components that exhibit a hierarchical
organization and interactions at multiple levels, which
are responsible for regulatory mechanisms, and dynamic and
emergent properties [1]. Decades of research on those systems
have provided us with vast amounts of knowledge that recent
efforts supported by computational advances have aimed to
translate into informative biophysicochemical models [2]. The
purpose of these models is to represent and mimic the signaling,
metabolic, or other molecular processes underlying the
behavior of the entities that compose biological systems.
Among the many modeling approaches, one that has proven
> Submission/revision/publication dates <
This work was supported by the Region Ile-de-France through the Paris
Region PhD program and by Dassault Systèmes.
Corresponding author: Philippe Castera (philippe.castera@3ds.com).
Supplementary material is available online at > http://ieeexplore.ieee.org <
to be particularly useful is the Boolean network [3][4]. Boolean
networks simplify the dynamics of biological systems by
disregarding quantitative aspects, such as concentration levels
or reaction rates, and focusing solely on the qualitative changes
in the system’s state. They are defined by a set of nodes
representing binary variables, interconnected by a set of edges
that can be formalized as Boolean algebra functions. The state
of the system is typically updated in discrete steps, with nodes
transitioning between states based on these functions. Despite
their simplicity, they have been successfully applied to the
modeling of various complex biological processes [5][6][7].
Boolean networks can be built manually by mining and
combining biological entities and their reactions, interactions,
and relations directly from the literature or specific databases,
as in [8]. To speed up the process and improve its robustness,
computational tools have been developed to derive dynamic
networks from the structural properties of aggregated
knowledge sources like pathways and disease maps [9].
However, in both cases, the available knowledge generally does
not match the exact same context as the one that requires
constructing a model and is never exhaustive. For example,
signaling pathways available in databases like KEGG [10],
WikiPathways [11], SIGNOR [12], or Reactome [13] are built
from studies with different experimental contexts,
environments, or species. They are arbitrarily determined to
keep manageable sizes, yet their crosstalk with other pathways
and molecular mediators is well-established [14]. The resulting
networks can overlook some mechanisms, lack specificity, and
only partially recapitulate experimental observations. Having
not been informed with concrete data from the context of
interest, such networks can be considered generic or naïve.
To address this key limitation, specific data from that context
needs to be integrated [15]. Specifically, given a naïve network
structure or interaction graph, one should be able to select the
Boolean functions of the network so that its successive states
reproduce the dynamics underlying a specified dataset as much
as possible. This process is alternatively referred to as
calibration, as in [16]; synthesis, as in [17][18]; or network
dynamics inference, as in [19]. Network inference, in general,
Vincent Deman is with the Université Paris-Saclay, Inserm, CEA U1184
IMVA-HB/IDMIT, Fontenay-aux-Roses, France and with Dassault Systèmes
BIOVIA, Vélizy-Villacoublay, France. E-mail: vincent.deman@cea.fr.
Marine Ciantar, Laurent Naudin, and Philippe Castera are with Dassault
Systèmes BIOVIA, Vélizy-Villacoublay, France. E-mail:
marine.ciantar@3ds.com, laurent.naudin@3ds.com, and
philippe.castera@3ds.com.
Anne-Sophie Beignon is with the Université Paris-Saclay, Inserm, CEA
U1184 IMVA-HB/IDMIT, Fontenay-aux-Roses, France. E-mail: anne-
sophie.beignon@cea.fr.
B
This article has been accepted for publication in IEEE/ACM Transactions on Computational Biology and Bioinformatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2024.3456302
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
2
TCBB-2024-05-0317
also encompasses the earlier determination of the network
topology, i.e., the interactions between its nodes. Numerous
methods focusing on this step have been developed over the
years and include ARACNE [20], GENIE3 [21], and TIGRESS
[22]. Other approaches rely on the dataset to infer both the
Boolean network topology and its functions [23][24]. An
alternative is to use prior knowledge, like biochemical
pathways or various molecular interactions, to define the
network topology. The resulting interaction graph is then called
a Prior Knowledge Network (PKN). Several recent methods
using this approach, including CellNetOptimizer and its
evolutions [16][25], Caspo-ts [26], RE:IN [27], and BRE:IN
[28], have shown convincing performance.
Still, the Boolean network dynamics inference, or
calibration, remains challenging since there are possible
Boolean functions depending on variables, which makes their
enumeration impossible when increases. The iteration over
all possible functions to decide whether to select them or not is
not a viable strategy. Therefore, in such decision-based
methods, the space of Boolean functions is usually reduced by
adding constraints like the minimality properties in Caspo-ts
[26], by specifying a template for acceptable functions as in
RE:IN [27] and BRE:IN [28], or by rounding off the functions
to a subset of their terms [29]. Alternatively, or in addition, to
the reduction of the solution space, solutions can be searched
for heuristically rather than enumerated as in CellNetOptimizer
[16][25], GAPORE [24], or CGA-CNI [30]. However, in such
approaches, the global optimality of the solutions is not certain.
The calibration method described in this paper adopts an
original approach by rewriting the interactions that compose an
input naïve network as Zhegalkin polynomials. This simplified
polynomial representation of Boolean functions is also named
Algebraic Normal Form. As Zhegalkin polynomials of
variables are fully characterized by binary coefficients, the
problem changes. Instead of searching for a satisfactory
Boolean function among possibilities, we search for the
value (0 or 1) for these coefficients. We found this task to be
a Boolean satisfiability problem and assigned it to a SAT solver.
Specifically, we chose to employ a complete Weighted
MaxSAT solver to ensure the retrieval of at least one optimal
solution that best, if not entirely, matches the constraints
associated with the provided data.
By combining Zhegalkin-polynomial reformulation and
MaxSAT solving, our method does not require the space of
Boolean functions to be reduced and makes the search for the
one that best matches the data exhaustive. In addition, by
circumventing the aforementioned complexity issue rather than
containing it, previously intractable exhaustive calibration
tasks, due to the size of the network or its connectivity degree,
should become feasible.
To evaluate our method, we first applied it to artificially
generated toy Boolean networks and data. We then challenged
it with four large-scale networks based on real measured data
representative of the behavior of cancer cell lines under various
stimuli from the HPN-DREAM breast cancer network inference
challenge [31]. To serve as a reference, both evaluations were
concurrently performed on two widely used calibration tools:
CellNetOptimizer and Caspo-ts.
II. MATERIALS AND METHODS
A. Boolean network calibration
Boolean networks are defined by a set of nodes that represent
binary variables and a set of Boolean functions that govern the
state transitions of these variables. We can write ,
with:
the finite set of variable (or nodes, we
make no distinction) of the network, with a cardinality ;
the respective Boolean functions
associated with each node (with its
number of input parent nodes), that are responsible for
their dynamics. In the case where node has no parent
nodes (we say that it has an in-degree of zero or is a root
node), its initial state is preserved, and is the identity
function. Boolean functions can be formally expressed
using the basic operations (and corresponding operators)
of Boolean algebra, namely conjunction (AND, ),
disjunction (OR, ), and negation (NOT, ). The
expression of such functions can also include secondary
Boolean operations like exclusive disjunction, denoted by
the XOR () operator, that excludes the possibility of its
variables being equal to simultaneously.
We call state of the network the concurrent binary states
of all its nodes : ). There are configurations
of binary values and thus possible states for a Boolean network
with nodes. This state evolves incrementally depending on
the chosen update scheme, i.e. which variables see their state
transition for a given iteration, and the Boolean functions they
have been assigned. The update scheme can either be
synchronous, where all variables are considered for each update
of the network, or asynchronous, where any number of nodes
from one to all but one are updated. For the latter scheme, the
choice of nodes can be rule-based or stochastic. In this paper,
we chose the update scheme to be synchronous, which
guarantees a deterministic relationship between any state of the
network and the ones that precede and follow it. In addition,
synchronism usually offers shorter (i.e., easier to read and
study) trajectories without causing a significant loss of
information [32]. As every incremental iteration can be denoted
by a discrete pseudo-time step , we can write
.
In this context, similarly to the description given in [16], we
define the calibration task as the search for the Boolean
functions compatible with a fixed topology that allow for the
successive states of a network to match the consecutive
measurements of the corresponding variables in a given dataset.
B. Combining Zhegalkin polynomials with SAT solving
Zhegalkin polynomials. Our method iterates over the network
variables and starts by converting their regulatory
dependencies, i.e., the information about their input nodes, to
Zhegalkin polynomials [33]. The Zhegalkin polynomial of any
Boolean expression is an ordinary numeric polynomial where
This article has been accepted for publication in IEEE/ACM Transactions on Computational Biology and Bioinformatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2024.3456302
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
3
TCBB-2024-05-0317
the Boolean / values are replaced by the integers
modulo 2 , and where coefficients are binary and the
logical conjunction (AND, ) and exclusive disjunction (XOR,
) operators are used as analogs of the classical product and
sum. With variables that can take only binary values, we have
so that the exponents become redundant and can be
removed. Such polynomials are fully characterized by their
binary coefficients. For example, a Boolean function of two
variables and can be written as a Zhegalkin polynomial
with the coefficients that characterize its
Boolean expression. Notably, if the binary values for the
coefficients are unknown, i.e., if the polynomial is un-
parametrized, it only carries the information of what its
variables are, here and . In the context of a Boolean
network, the regulatory dependencies of every node can
effectively be encoded into an un-parametrized Zhegalkin
polynomial. Calibrating the network, i.e., choosing a function
for each one of its nodes, then comes back to assigning binary
values to its corresponding Zhegalkin coefficients. To guide
this assignment, our method has to define calibration
constraints and translate them into Boolean expressions
involving the Zhegalkin polynomials of the nodes’ interactions.
The resulting expressions will make up the input of the solver.
Illustrative example. For illustration purposes, we introduce
a simple use case. Let us consider three variables , , and
with the following known interactions: ,
, and . The corresponding
interaction graph is shown in Fig. 1a. is a root node of the
network and a parent node of ; both are parent nodes of . The
Zhegalkin polynomials corresponding to nodes A, B, and C are:
Calibration constraints. First, as the calibration process is
based on the successive transitions of our system, they need to
be explicitly defined within those expressions. The dynamics
we want our variables to match can be formalized as vectors of
binary assignments for the successive states of any
number of nodes in the network, with the number of time
points after baseline. For a variable , the assignments can be
written . The calibrating data constraints can thus
be written as the following Boolean expression:
With
For our illustrative example, we can consider a context-specific
dataset with three time points and the following binarized
measurements:
These data translate to the expression:
Within the synchronous update scheme we adopted, the
updates of the system can be written
, with the sequential step corresponding
to a given state of the network. If we consider that the
successive updates of the network match the time points of the
provided data, the constrained updates of the system can be
written with .
This is, however, a strong assumption. Indeed, experimental
time points often have variable time gaps between them. In
addition, we expect some processes to take place between the
measured states of the system. For a more realistic description,
we write , with the
maximal number of updates given to the network to reach the
state matching the next experimental time point. This threshold
applies to all time points. A Boolean network trajectory has at
most states, meaning states between time
points, distributed over gaps. It follows that
. This equality for , with ,
represents the transition system constraints for our network.
As the network only has a finite number of possible states
and deterministic dynamics, Dirichlet’s drawer principle [34]
guarantees that it will eventually come to a state it has already
been in. Its trajectory will have reached an attractor [35]. This
attractor can either be a steady state or a cycle. Once the system
has reached an attractor, we can write
with the period of the attractor,
, and equal to when the attractor is a steady
state. Supposing an attractor of a period at most with one of
its states matching the final time point of the calibrating data,
we write . This
equality represents the stability constraints for our network. We
note that for root nodes, this equality is always .
In Boolean algebra, equality is an operator where is
when and have the same value. Therefore, it has the
same truth table as . We prefer this
expression as SAT solvers take their input in Conjunctive
Normal Form (CNF), i.e., as a conjunction of disjunctive
clauses. The transition system and stability constraints can be
written as the following expressions, with the Zhegalkin
polynomials associated with the network nodes:
This article has been accepted for publication in IEEE/ACM Transactions on Computational Biology and Bioinformatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2024.3456302
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
4
TCBB-2024-05-0317
We suppose that the network updates in our illustrative example
match the three measured time points and that the latter
corresponds to a steady-state attractor, i.e., that and
. The transition system constraints then translate to:
And the stability constraints to:
Expressions , , and would be sufficient to perform
the calibration task. However, in practice, we add an additional
constraint. Two common expression profiles for the entities in
a biological system would be an activation/inhibition of the
entity that occurs at the first time point and remains for the rest
of the time series. In the Boolean formalism, these profiles can
respectively be represented by the functions and
. However, from an interpretation point of view,
they cause an undesired loss of mechanistic information.
Considering functions of variables, they have the following
Zhegalkin coefficients:
and . To
prevent the choice of these functions, we add the following
True/False penalty expression:
For the variables , , and in our illustrative example, this
penalty translates to the expression:
SAT solving. The proposition made up of the conjunction of
these four expressions defines the complete calibration
constraints in our method. Ideally, it would be fed in this format
to a SAT solver that would return the calibrated Zhegalkin
coefficients. However, as SAT solvers only accept Boolean
functions as their input in CNF, the transition system and
stability expressions need to be converted to this format (in
which the calibrating data and True/False penalty constraints
already are). Rather than using the usual Quine-McCluskey
algorithm for this task [36][37], which can lead to an
exponential increase in the formula size and computation time
with the number of input nodes, we apply a Tseitin
transformation to the expressions [38]. This transformation
returns equisatisfiable expressions in CNF, whose length only
increases linearly with the number of variables.
Finally, in practice, due to the biases that can be induced by
experimental noise, the averaging and binarization of the data,
and the forced initial topology from the naïve network, not all
constraints will usually be simultaneously satisfiable. Instead of
a rigorous solver, we implement a Weighted MaxSAT solver,
RC2 [39]. Rather than attempting to satisfy all the clauses, each
clause is assigned a weight, and the goal of such a solver is to
satisfy the clauses in a way that maximizes their summed
weights. In that context, clauses can be defined as hard, i.e.,
mandatory to satisfy, or soft with a given weight, depending on
the importance we give them. Our method thus tries to satisfy
as many calibration constraints as possible based on these
weights while ensuring we get a model as a result. In return, we
need to establish what constraints we prioritize. We decided on
the following hierarchy:
The transition system of the network is fixed, and
Dirichlet’s drawer principle guarantees we reach an
attractor. Therefore, the transition system and stability
constraints are set as hard clauses.
The goal of the calibration process is to match the
dynamics of the provided data as closely as possible. Yet,
we expect not to be able to match all successive states for
all measured variables. We thus introduce a priority within
the calibrating data constraints:
o The main result usually exploited when simulating a
Boolean model of a biological system is the attractor
[35]. Therefore, we consider the final state our
priority, and constraints for the last time point are
defined as soft clauses with the largest weight;
o As we work with a synchronous, or deterministic,
update scheme, the reached attractor is determined by
the initial state of the network. Initial state constraints
are thus defined as soft clauses with the second largest
weight;
o Experimental profiles are to be mimicked as closely as
possible. Consequently, constraints regarding
intermediary time points are defined as soft clauses
with the third largest weight.
The True/False Boolean functions should only be
discarded if they do not improve the satisfiability of all the
constraints above. The associated constraints are thus
defined as soft clauses with the smallest weight.
This article has been accepted for publication in IEEE/ACM Transactions on Computational Biology and Bioinformatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2024.3456302
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
5
TCBB-2024-05-0317
The Weighted MaxSAT solver returns the calibrated
Zhegalkin coefficients corresponding to the best or any
specified number of equally best networks. Our method then
translates these coefficients back into Boolean functions for
each node. In addition, the solver returns the cost of the
network or networks, which represents the total weight of
unsatisfied clauses. As this cost decreases, the output networks
more effectively adhere to the specified constraints. We define
the weights for the different constraint categories as the sum of
the weights of all the clauses of lower importance plus 1. That
way, there is no overlap between constraints, and the
unsatisfied clauses can be traced back to the category from
which they originate. Importantly, RC2 is a complete MaxSAT
solver, meaning that it is designed to converge towards
demonstrably optimal solutions [40], which is a notorious NP-
hard problem [41]. Despite the additional computational
burden compared to alternative approximate solvers, its
efficient heuristics make it stand out as a multiple-contest-
winning solver [42][43].
Coming back to our illustrative example, the MaxSAT solver
would be able to satisfy all constraints and return the satisfying
interpretations. One of them, illustrated in Fig. 1b, would be:
Fig. 1. Illustrative Boolean network example. (a) The interaction
graph, where the dashed edges represent influences between a source
node and a target node. (b) One calibrated network satisfying the
calibration constraints, where the edges with ← and Ⱶ endpoints
respectively illustrate activation or inhibition between a source node
and a target node, and O the conjunction (AND, ) operator.
Overview of our method. The full iterative description of our
method is given in Algorithm 1. It takes as input the interaction
graph , the context-specific binary data , the maximal
number of updates of the network between data time points ,
the maximal period of the attractor , and the desired number
of output networks . defines a set of variables , of
cardinality , and the interactions between them. The variables
can be split into the sources and recipients , respectively, on
the exerting and receiving end of at least one interaction. As
there are no unconnected variables, we have . The
data associates a subset of the variables in , called for
measured, with binary assignments .
Our method initially sets the calibrating data, transition
system, stability, and True/False penalty expressions
(respectively , , , ) to , the neutral
element for conjunction. It then iterates over the variables in
, adds the calibrating data constraints for the measured ones,
and the other constraints featuring their Zhegalkin polynomials
for the ones that receive at least one interaction. The proposition
made of the conjunction of the constraint expressions is
Tseitin-transformed into a weighted CNF Boolean formula .
is, in turn, fed to the Weighted MaxSAT solver. The solver
returns the interpretations , i.e., the binary values for the
Zhegalkin coefficients, that allow for the satisfaction of clauses
whose respective weights account for the largest possible sum.
For each of the interpretations, the Zhegalkin coefficients are
extracted and translated into Boolean functions .
Finally, the Boolean networks composed of these
functions are returned.
C. Evaluation methodology
We designed our method to perform Boolean network
Algorithm 1 An algorithm that combines Zhegalkin
polynomials and SAT solving for Boolean network
calibration
input , , , ,
output equally satisfactory calibrated Boolean networks
foreach do
;
end foreach
for to do
if then
;
end if
if then
conversion of the interaction information
from to a Zhegalkin polynomial ;
;
;
;
end if
end for
;
transformation of into a weighted CNF Boolean
formula ;
Application of the Weighted MaxSAT solver to and
generation of the interpretations with the lowest cost ;
Extraction of the calibrated Zhegalkin coefficients
and conversion into Boolean functions for the
interpretations ;
Output of the calibrated Boolean networks ;
This article has been accepted for publication in IEEE/ACM Transactions on Computational Biology and Bioinformatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2024.3456302
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
6
TCBB-2024-05-0317
calibration. We defined this calibration as the process of
searching for the functions that enable the successive states of
its nodes to align with the corresponding measured variables.
The ultimate calibration would imply retrieving the exact
network that produced the measured dynamics. This is,
however, mostly unachievable in practice. First, because in
most cases one performs network calibration for reverse
engineering purposes. The exact network is unknown and
cannot serve as a reference for comparing the calibration
results. In addition, the available data usually captures only a
fraction of the possible states of the system, which a multitude
of networks can reproduce. The data alone does not allow for
further discrimination between the solutions. From there, one
can try to restrict the number of equally satisfying networks by
adding additional constraints, thereby reducing the exhaustivity
of the calibration process. Another option is to look at the
solutions as an ensemble and study them through set theory.
Finally, one can select and examine one of the networks and
consider it representative of the solution space.
In this work, we evaluate our method on artificially generated
toy models of increasing size and complexity and artificially
generated data. We then challenge it further on four real
experimental datasets from the HPN-DREAM breast cancer
network inference challenge [31].
Toy model evaluation. We first want to assess our method’s
ability to calibrate networks with a controlled and progressively
increasing size and complexity. We implemented a toy model
generator with a configurable number of nodes, stimuli, inputs
per node, and outputs per stimulus. We can compute the
trajectories of the resulting networks in an adjustable number of
experimental conditions defined as combinations of stimulus-
induced activations/inhibitions. Working with artificial data
from known toy networks enables us to evaluate the calibrated
networks by confronting them with the original ones. In this
context, we thus evaluate the ability of a method to retrieve the
original networks from their trajectories and topology, given as
an influence graph, in the shortest time possible. Regarding the
trajectory generation, we observed in our simulations that they
rarely exceeded a length of 15 states, even for networks with up
to 30 nodes and 3 inputs per node. In addition, a study from [44]
showed that most of the transcriptomics datasets in the Stanford
Microarray Database had less than eight time points, a
limitation that remains true with today’s more powerful
sequencing technologies [45]. To comply with these
computational and experimental realities, we limit the artificial
data to 8 pseudo-time points and allow at most one intermediate
update of the network between time points, meaning trajectories
with 15 states or less. The maximal attractor period is set to two.
HPN-DREAM evaluation. To assess the ability of our method
to extend its capabilities to real applications, we challenge it
with the time-course prediction sub-challenge of the HPN-
DREAM (Heritage Provider Network-Dialogue for Reverse
Engineering Assessment and Methods) breast cancer network
inference challenge [31]. This time, in contrast with the toy
model evaluation, the initial network is unknown. Rather than
the actual network that produced the data, we thus evaluate the
ability of a method to derive any network that predicts the
binarized abundance profiles of a set of phosphoproteins in
experimental conditions absent from the training data. In this
context, to evaluate their performance, we define a score to
quantify the closeness between simulated and experimental
trajectories. The experimental time points are first mapped onto
the updates of the calibrated network. For consistency purposes
with the toy model evaluation, we allow for one potential
intermediary update of the network between two time points.
Then, for each time point, we quantify the discrepancy with the
corresponding calibrated network state on the subset of
variables they have in common by measuring their Hamming
distance [46], a metric that counts the number of values that
differ between two lists. The case where the simulated
trajectory we obtain is shorter than the experimental one implies
an undesired oversimplification of the inferred dynamics: for
each experimental time point that is skipped by the simulation,
we consider a maximal Hamming distance, i.e., a penalty equal
to the number of variables present in both the data and the
network. The final score is thus given by the sum of Hamming
distances measured for all time points divided by the number of
overlapping variables. This score can then be converted into a
percentage of matching states between predicted trajectories
and measured data for the variables present in both the network
and the test data.
The methods initially targeted by the HPN-DREAM
challenge were supposed to perform both the network topology
and dynamics inferences. Our method, however, focuses on the
latter. We thus built a Prior Knowledge Network (PKN)
beforehand to define the topology of the network to be
calibrated. The role of this PKN is to provide a structure for the
calibration process that fixes which nodes interact with each
other. A major source of information for influence data between
biological entities can be found in signaling pathways. Yet,
pathways in standard repositories like KEGG [10],
Wikipathways [11], SIGNOR [12], or Reactome [13] rarely
include phosphorylated proteins. We thus decided to build the
PKN based on their associated proteins. We listed the genes
corresponding to the proteins that appeared at least once in the
dataset and performed a functional characterization of the
resulting gene set through the STRING database [47]. Between
the various enrichments this database offers, we decided to look
into the KEGG pathways that covered most of the genes in our
list. The complete list (in Supplementary Table S1) contained
162 pathways. Among them, two pathways stood out with more
than 20 genes matching the measured proteins: the EGFR
tyrosine kinase inhibitor resistance pathway with 23 (including
four stimuli and two inhibitees) and the ErbB signaling pathway
with 20 (including two stimuli and two inhibitees). In addition,
we identified the mTOR signaling pathway with 17 matching
entities, including two stimuli and two inhibitees. We merged
these three pathways based on their shared entities and
converted the result into a Boolean network using BIOVIA
Living Map, a software on the Dassault Systèmes
3DEXPERIENCE® platform. This Boolean network can be
exported from Living Map in the standard SBML qual format
This article has been accepted for publication in IEEE/ACM Transactions on Computational Biology and Bioinformatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2024.3456302
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
7
TCBB-2024-05-0317
[48], which can, in turn, be reduced to a SIF-formatted
influence graph to be used as input for the calibration process.
The resulting PKN (in Supplementary Text S1) contains 114
entities, among which 31 overlap with the data from at least one
of the cell lines, including four stimuli and two inhibitees,
linked by 171 interactions. It is comparatively large when
looking at the overview of gene regulatory networks (GRNs) in
the literature given by [49], where the mean and median number
of nodes are respectively 41.9 and 23. The capacity for our
method to calibrate a network of this size would thus imply its
applicability to the vast majority of existing GRNs.
The HPN-DREAM data [50] contains training and test
datasets of normalized Reverse Phase Protein Array (RPPA)
quantitative proteomics measurements for 45 proteins at seven
time points up to 4 hours post-stimulus. The measurements
concern four breast cancer cell lines: BT20, BT549, MCF7, and
UACC812. Each cell line has been studied in multiple
experimental conditions defined by various inhibitor/stimulus
environments. In the training data, we limit the number of
experimental conditions to five, except for the BT549 cell line
data, which contains four conditions after preprocessing. The
preprocessed test data has four experimental conditions per cell
line, except for MCF7 that has just two. In all experimental
conditions, we set the calibrated network trajectories to match
the experimental time points and consider that the system has
reached a stable state at the last time point. We adapt the data
to match the PKN and binarize it to suit the Boolean formalism.
Further details on the HPN-DREAM challenge data pre-
processing and formatting are given in Supplementary Text S2.
D. Benchmark
To ensure the feasibility of the evaluation strategy and to
validate our method’s results, we apply two reference tools,
CellNetOptimizer and Caspo-ts, to the same calibration tasks.
Compared tools. CellNetOptimizer (CNO) was presented in
2009 [16]. Its first implementation relied on an initial
compression of the input PKN, the generation of a hypergraph
of this compressed PKN, and the heuristic browsing of this
hypergraph to find the networks that minimize an objective
function with two terms: the similarity between its attractor and
the given data (corresponding to a single time point), and its
size. Since then, several new features have been added,
including CNORode, CNORprob [51], and the one we used in
this paper, CNORdt [25], which is specific to time-course data.
Whereas CellNetOptimizer, and specifically CNORdt, relies on
a genetic algorithm to browse the potential solution networks,
the second method, Caspo-ts [26], uses Answer Set
Programming (ASP). The ASP solver is fed restrictions
regarding the compatibility with the PKN and an over-
approximation of the calibration data constraints. The over-
approximation criterion guarantees that any solution to the
calibration problem will be selected along with certain
networks that do not match the given dynamics. The two types
of solutions are respectively designated as true and false
positives. To separate them from each other, a computationally
expensive model checking step, which explores all possible
system states in a brute-force manner [52], is performed with
the NuSMV tool [53]. Notably, Caspo-ts has previously been
confronted with the HPN-DREAM challenge [54].
We implemented our method in Python 3.7 and named it
ZhegAlCal: Zhegalkin polynomial-based Algorithm for
Calibration. The computational setup and parametrization
details of the three tools are given in Supplementary Text S3.
III. RESULTS AND DISCUSSION
In this section, we assess the applicability of our method to
various and increasingly complex calibration tasks, and we
evaluate its capabilities against two reference tools,
CellNetOptimizer and Caspo-ts.
A. Toy model calibration
By using toy networks and the associated artificially
generated data, we place ourselves in an ideal position where
we can evaluate the ability of the three tools to retrieve a unique
network based on the measured dynamics of its nodes. Our
tunable toy model generator allows us to control the
characteristics of this network and their effect on the calibration
process. In addition, in contrast with the usually flawed
measured biological data, using artificially generated data
allows for control and, if need be, suppression of missing data,
noise, or batch effects.
The size of the network, as well as its complexity, here
associated with the in-degree of connectivity of its nodes, affect
the calibration process. Larger networks mean more nodes
whose dynamics need to be determined in parallel with the
others and exponentially more possible states for the whole
system ( for a network with nodes). Moreover, having an
increasing number of inputs per node causes an explosion in the
number of potential functions of these inputs ( functions for
a node with inputs). For the associated data generation, we
can decide how many distinct “experimental” conditions it
contains. Each additional condition enriches the truth tables of
the Boolean functions to be retrieved but adds its own
constraints to be satisfied. Therefore, more conditions should
help the three methods converge toward the desired network but
make the calibration task more computationally expensive.
We built ten toy models with a number of nodes ranging
between 10 and 30. We fixed the maximum number of input
nodes per node for each network to 2 or 3. The number of
experimental conditions was set to either 5 or 10, and depending
on the size and connectivity of the calibrated network, we
looked at the first models (1000 or 5000) returned by all three
methods to find the original one. If not found, we counted the
number of its disjunctive clauses retrieved by each method.
Most processes have been run and timed three times to assess
reproducibility. However, due to their larger size and longer
computation times, the processes for the 30-node networks
were only run once. The calibration results of all
aforementioned toy models are summarized in Table I. As
CellNetOptimizer with its genetic algorithm and ZhegAlCal
with its MaxSAT solver rely on heuristics to perform their
search, the order of the returned outputs slightly varied, and the
displayed retrieval of the initial model (or number of its clauses)
This article has been accepted for publication in IEEE/ACM Transactions on Computational Biology and Bioinformatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2024.3456302
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
8
TCBB-2024-05-0317
corresponds to the best-case scenario. Where applicable, the
computation times have been averaged over the three runs.
Starting with a small network with 10 nodes, 2 inputs per
node, and 5 experimental conditions, we looked at the first 1000
networks returned by each method. We observed that both
Caspo-ts and ZhegAlCal retrieved the initial model from the
provided data. The closest network CellNetOptimizer found
contained 6 out of the 9 disjunctive clauses making up the
update functions of the initial network. For this first application,
ZhegAlCal was an order of magnitude faster than Caspo-ts.
When the size of the network increased, only ZhegAlCal was
able to retrieve the initial network. CellNetOptimizer retrieved
two-thirds of its clauses, whereas Caspo-ts found at most a
third. From the intuition that the response of the initial model to
additional perturbations could further constrain the search, we
doubled the number of experimental conditions. We obtained
only a slight improvement in the number of clauses retrieved by
Caspo-ts. We decided to return to 5 experimental conditions
and browse a larger number of networks. All three methods then
retrieved the initial model among their first 5000 outputs.
To increase the complexity of the network, we changed the
number of inputs per node from 2 to 3. With otherwise
unchanged parameters, none of the methods was able to return
the initial model within its first 1000 output networks. Yet
ZhegAlCal got closest with 16 out of 18 clauses retrieved,
versus 12 for CellNetOptimizer and 6 for Caspo-ts. Again, we
tried to constrain the search with twice as many experimental
conditions. Unlike the previous network, the additional data
enabled ZhegAlCal to retrieve the initial model and slightly
improved the number of found clauses by Caspo-ts. Even with
the doubled number of experimental conditions,
CellNetOptimizer and Caspo-ts were not able to yield the
desired network within their 5000 first outputs. Notably,
CellNetOptimizer’s computation times seemed less sensitive to
the increase in size and complexity of the models. We finally
doubled the number of nodes in the network up to 30, each one
depending on 3 parent nodes. We observed that computation
times increased significantly for ZhegAlCal and exploded for
Caspo-ts, for which we had to interrupt the computation after
seven days without completion. CellNetOptimizer’s
computation times remained in the same order of magnitude as
those of the smaller networks. Concerning the inference itself,
none of the three tools was able to retrieve the initial network
among its 5000 first outputs but CellNetOptimizer managed to
recover the most clauses from the original functions.
Our method and its implementation in the ZhegAlCal tool
showed convincing abilities for retrieving a model based on its
dynamic behavior in various experimental conditions. For the
smaller models, it retrieved the initial network from limited
amounts of perturbation data faster than the two other tools. As
CellNetOptimizer uses a genetic algorithm to browse the
solution space, the search process is primarily affected by the
genetic operators, how they are implemented in the context of
Boolean networks, and the fitness function. On the other hand,
Caspo-ts employs a solver to generate a set of answers, in this
case, Boolean networks that satisfy certain constraints. These
constraints are initially an over-approximation of the exact
dynamic constraints from the data, which are then validated by
a model checking step. The shorter execution times for
ZhegAlCal on small networks might be due to the capped
length of the inferred network dynamics, which the model
checker in Caspo-ts does not allow. On the one hand, this could
be seen as a limitation to the completeness of the calibration
process. However, it seems unlikely that we could derive
networks containing numerous small-scale biological processes
from data with only a few time points. The fixed-length
trajectories for the calibrated networks, also implemented in
CellNetOptimizer, then appear as a way to control the temporal
scale of the mechanistic processes to be derived from the data.
ZhegAlCal suffers, however to a lesser extent, from the same
pitfall as Caspo-ts when the size and complexity of the
networks increase further. In such cases, the number of
potential networks explodes, and satisfiability-based
approaches like Caspo-ts and ZhegAlCal cannot discriminate
candidate networks efficiently enough. Yet, if
CellNetOptimizer’s heuristic search is expected to perform
better in such conditions, its chances of retrieving the actual
original model are infinitesimal. We should also keep in mind
TABLE I: NETWORK CALIBRATION EVALUATION RESULTS ON TOY MODELS WITH INCREASING SIZE AND COMPLEXITY. For the three evaluated
tools, the retrieval within three attempts of the initial network or of a maximum of its clauses, among the first 1000 or 5000 output networks.
Best-case results and average execution times are shown. Bold characters indicate benchmark-leading results, and greyed-out cells indicate
unavailable results. “> 7d” means the computations were interrupted after 7 days without completion, “*” means the processes were run once.
N° nodes
N°
inputs/node
N°
conditions
N°
networks
Retrieved model? (if No → n° clauses)
Computation times (s)
CNO
Caspo-ts
ZhegAlCal
CNO
Caspo-ts
ZhegAlCal
10
2
5
1000
No (6/9)
Yes
Yes
19.37
26.77
3.10
15
2
5
1000
No (12/15)
No (13/15)
Yes
33.70
89.77
14.57
10
1000
No (12/15)
No (13/15)
Yes
38.73
127.10
13.50
5
5000
Yes
Yes
Yes
36.37
266.33
13.57
3
5
1000
No (12/18)
No (6/18)
No (16/18)
48.97
304.05
155.37
10
1000
No (12/18)
No (9/18)
Yes
90.99
721.33
346.87
5
5000
No (15/18)
No (7/18)
No (17/18)
33.12
1106.17
507.30
10
5000
No (14/18)
No (10/18)
Yes
121.27
4359.97
2742.63
30*
3
5
1000
No (31/48)
No (23/48)
273.30
> 7d
26090.0
5000
No (37/48)
No (25/48)
280.70
> 7d
22339.4
This article has been accepted for publication in IEEE/ACM Transactions on Computational Biology and Bioinformatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2024.3456302
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
9
TCBB-2024-05-0317
that the number of retrieved clauses is only an indication of the
similarity between the calibrated and original models. A
network whose functions share more disjunctive clauses with
the initial one does not guarantee better matching dynamics. As
the available information decreases relatively to the size and
complexity of the studied system, the inference becomes
infeasible, and the aim of the process shifts. Instead of looking
for the exact model that produced a particular set of data, we
tried to reconstruct a system that mimics its behavior as robustly
as possible. To evaluate the ability of our method, and that of
CellNetOptimizer and Caspo-ts, to perform this task, we
challenged them with an experimental dataset containing both
training and test data. We measured how well a network
calibrated with the three tools on the training data can predict
the dynamic evolution of the variables in the test data.
B. HPN-DREAM challenge calibration
The training and test datasets from the HPN-DREAM breast
cancer network inference challenge contain phosphoprotein
abundances in mutually exclusive combinations of
stimulations/inhibitions, for four cell lines. To provide the
topological information required by CellNetOptimizer, Caspo-
ts, and ZhegAlCal to perform the calibration, we first built a
representative PKN by merging three of the KEGG pathways
that contained most of the proteins whose abundance was
measured in the training data.
We had planned to calibrate this network with all three
methods and evaluate the dynamics of their respective first
output network. However, applying CellNetOptimizer to any
PKN containing nodes with more than 5 inputs systematically
resulted in a process abortion on our computing setup. As a
result, we were forced to alter the PKN by subsampling the
inputs for nodes depending on more than 5 variables. The
affected nodes were EGFR and the 107080638:TSC1 complex,
which had 9 parent nodes; MTOR, which had 8 parent nodes;
and ERBB4, GAB1, GRB2, and SHC2, which had 6 parent
nodes. In addition, CellNetOptimizer does not accept input
edges on stimuli. Consequently, we built a CellNetOptimizer-
compatible PKN to be used with all three tools. The calibration
of the original PKN was still evaluated for Caspo-ts and
ZhegAlCal to assess their ability to deal with networks of such
complexity. We also expected this PKN to provide a more
comprehensive description of the underlying biological reality.
As for the toy model evaluation, the variability due to the
heuristics implemented in CellNetOptimizer and ZhegAlCal
was accounted for by running the processes three times and
displaying the average rather than absolute scores with the
associated standard deviations. The computation times were
averaged over the three runs. The results are shown in Table II.
For the comparative evaluation using the CellNetOptimizer-
compatible PKN, we observed that ZhegAlCal and
CellNetOptimizer produced the best predictive performances.
ZhegAlCal obtained the best score on three of the four cancer
cell lines, whereas CellNetOptimizer got the better score for the
BT20 cell line. Apart from Caspo-ts, whose deterministic
engine guarantees the same output for every run, we also
observed that the relative standard deviation of the scores was
smaller for ZhegAlCal in all cell lines except UACC812.
Notably, the aforementioned results concerned a more complete
network in the case of ZhegAlCal. Indeed, our approach does
not reduce the size of the output network to obtain a better
match with the experimental data (details in Supplementary
Text S3). It contains all the overlapping nodes between the PKN
(truncated or not) and this data, and thus prevents any potential
loss of mechanistic information. Regarding computation times,
both tools performed the calibration task in under two minutes
for all test cases, with slightly shorter times for ZhegAlCal. On
the other hand, the model checking step implemented within
Caspo-ts significantly slowed down its computation, which
consequently exceeded three days on three of the four cell lines.
Only for the BT549 cell line, where the calibration data contains
just four distinct experimental conditions, did the Caspo-ts
calibration run in less than two minutes. The first output
network, however, turned out to be a false positive. We further
model-checked the first 100 output networks for this cell line,
TABLE II: NETWORK CALIBRATION EVALUATION RESULTS ON THE HPN-DREAM CHALLENGE DATA. For the three evaluated tools, the average
prediction scores, the corresponding matching states between predicted and actual data in test conditions, the relative standard deviations, and
the corresponding average execution times. Results are shown for the four cancer cell lines and their cumulative average under the label
“TOTAL”, using the CNO-compatible and original PKNs. Bold characters indicate benchmark-leading results and greyed-out cells indicate
unavailable results. “> 3d” means the computations were interrupted after 3 days without completion.
PKN
Cell line
Score
Computation time (s)
CNO
Caspo-ts
ZhegAlCal
CNO
Caspo-
ts
ZhegAlCal
Score
Match
Rel.
dev.
Score
Match
Rel.
dev.
Score
Match
Rel.
dev.
CNO-
compatible
BT20
8.98
68%
23%
12.86
54%
0%
9.61
66%
8%
113.77
> 3d
70.55
BT549
8.54
69%
14%
9.87
65%
0%
8.47
70%
1%
86.65
94.68
51.42
MCF7
3.84
73%
12%
6.00
57%
0%
3.56
75%
11%
83.21
> 3d
82.41
UACC812
9.23
67%
4%
12.09
57%
0%
8.88
68%
11%
99.99
> 3d
70.78
TOTAL
8.61
69%
16%
11.71
58%
9%
8.52
69%
14%
95.91
68.79
Original
BT20
9.00
68%
0%
10.40
63%
5%
> 3d
40841.00
BT549
14.29
49%
0%
7.67
73%
6%
> 3d
3735.18
MCF7
7.42
47%
0%
3.37
76%
5%
> 3d
152944.34
UACC812
16.77
40%
0%
10.26
63%
9%
> 3d
92340.45
TOTAL
13.73
52%
21%
8.77
68%
19%
> 3d
72465.24
This article has been accepted for publication in IEEE/ACM Transactions on Computational Biology and Bioinformatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2024.3456302
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
10
TCBB-2024-05-0317
all of which were false positives. The incapacity for supposedly
all the output networks to fully satisfy the calibration
constraints could arise from the numerous approximations
underlying their definition. Indeed, real data are usually subject
to measurement uncertainties and missing information; they are
then biased through preprocessing and binarization and finally
mapped onto a network topology derived from generic and
incomplete pathways. As the other cell lines are subject to the
same limitations, we suppose that all their output networks will
also be false positives. For lack of true-positive solutions, we
decided to evaluate the first returned over-approximated
network for all other tests. The un-model-checked solution from
Caspo-ts obtained the lowest scores on all four cell lines.
We then moved on to evaluating Caspo-ts and ZhegAlCal on
the original PKN. We were forced to interrupt the model
checking processes within Caspo-ts after three days without
completion for all cell lines. As with the truncated PKN, we
performed the evaluation on the first returned over-
approximated networks. ZhegAlCal, on the other hand,
returned a calibrated network within an average of 20 hours.
The mean prediction scores were similar to the ones obtained
with the CellNetOptimizer-compatible PKN. Individually,
ZhegAlCal performed significantly better on 3 of the 4 cell lines
but was overperformed by Caspo-ts on the BT20 test case.
Despite the slightly varying outputs yielded by our method due
to the heuristics implemented within the RC2 MaxSAT solver,
the standard deviations for each cell line remained below 10%,
as well as below 20% for all computations considered together.
The latter cannot be said of Caspo-ts, whose overall standard
deviation reached 21% despite its non-existent variance on cell-
line-specific computations.
The application of our methodology and its implementation
to the HPN-DREAM breast cancer network inference challenge
showed strong predictive capabilities for real-life applications.
Indeed, when applied to the CellNetOptimizer-compatible
PKN, our method was able to predict, on average, around 70%
of the dynamics of the calibrated nodes within a few minutes.
CellNetOptimizer obtained comparable, if slightly lower,
scores with slightly longer execution times. When dealing with
the full size and complexity of the PKN, ZhegAlCal was the
only solution that successfully produced a calibrated network
within three days of computation on our setup. The average
dynamic prediction remained consistent, with an average
evaluation of 68%, compared to the truncated PKN. This
consistency may be due to either the limited impact of the
additional provided information or the challenging nature of
predicting the behavior of the newly added nodes. As for
Caspo-ts, the time-consuming model checking step prevented
the tool from performing most of the complete calibration
processes within three days. Even when able (in a similar
fashion to CellNetOptimizer and ZhegAlCal) to return a fully
processed network within less than a minute for the truncated-
PKN calibration with data from the BT549 cell line, the 100
first outputs were false-positive solutions. We presume that the
several-day-long calibration processes of the other cell lines for
both PKNs would result in the same outcome. As a proxy, we
evaluated the over-approximated solutions. Regardless of the
PKN, the dynamics prediction scores of these over-
approximated solutions fluctuate around an average of between
50 and 60%. The other tools work around the inconsistencies in
the calibration constraints caused by the approximations
inherent to experimental data by utilizing an optimization
approach. Specifically, CellNetOptimizer employs a genetic
algorithm that aims to minimize an objective function, and
ZhegAlCal relies on a MaxSAT solver. By defining an initial
candidate solution and trying to improve it, such approaches
guarantee at least the output of a solution. They also leverage
all available constraining information where a purely decision-
based approach such as Caspo-ts either utilizes all the available
information to find a compatible solution, or disregards it
entirely if such a solution does not exist.
In addition to how well the dynamic trajectories are
predicted, the low variability of the scores is an essential
indicator of reproducibility and robustness. ZhegAlCal’s
performance remained stable over multiple runs on the same
test case and when applied to other test cases. This underscores
its robustness and suitability for diverse biological contexts.
The simulated trajectories for the test data also revealed that
despite the size of the network to be calibrated, the trajectories
for the experimental-condition-dependent initial states rarely
exceeded 14 states. Model checking is commonly used to verify
the reachability properties of a system. It proceeds by
exhaustively examining its state space to determine if a state
exists where a given specification holds. It does not require the
definition of a temporal window, unlike the approaches
implemented in CellNetOptimizer and ZhegAlCal. It should be
noted that setting a fixed time scale relies on the Boolean
trajectories being deterministic, i.e., on the update scheme
being synchronous. On the other hand, the checking of
successive reachability properties, as implemented in Caspo-ts,
accommodates asynchronous dynamics. In return, it faces the
state-explosion problem arising from the prohibitively large
size of a system’s state space [55]. Therefore, if the trajectories
are synchronous and remain short, as they are here, then the
additional computation power and time required by a model
checking step can seem unjustified compared to approaches
based on explicitly defined temporal structures.
IV. CONCLUSIONS
The problem of calibrating a Boolean network, i.e., of
inferring its dynamics in a systematic and automated manner
with a set of longitudinal data, has been around since the 1990s
[56]. Yet, scalable methods that can handle increasingly large
and complex networks and yield exploitable results on a wide
variety of input data are scarce. We proposed an original
approach that converts the function selection problem into an
optimization one. It integrates the translation of a given PKN’s
interaction properties into a set of Zhegalkin polynomials and
the adjustment of their binary coefficients to align with the
provided data through MaxSAT solving.
We implemented our method under the name ZhegAlCal and
evaluated it on increasingly challenging toy models and on a
This article has been accepted for publication in IEEE/ACM Transactions on Computational Biology and Bioinformatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2024.3456302
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
11
TCBB-2024-05-0317
large-scale network inference challenge with real experimental
data. We compared its performances and outputs to those of two
reference tools in the field, Caspo-ts and CellNetOptimizer. The
toy-model-retrieval evaluation showed the capacity of our
method to recover a network given its influence graph and some
of its trajectories, provided its relatively small size and in-
degree per node. For more extensive networks, given roughly
the same amount of information, none of the compared tools
managed to retrieve the original network due to an overflow of
possible solutions. Providing enough trajectories to cover the
possible states of the network seems computationally
intractable and unrealistic from an experimental point of view,
so we adopted another approach. Rather than aiming at
recovering the network that generated the data, we search for a
network capable of reproducing its dynamics and predicting
them in unseen experimental conditions. Our method was able
to complete this search for four experimental setups from the
HPN-DREAM breast cancer network inference challenge.
Unlike the reference tools, ZhegAlCal correctly predicted the
experimental states of almost 70% of the measured variables
when calibrating a 114-node PKN having up to 9 inputs per
node with data limited to five experimental conditions. It did so
in a matter of minutes to hours, suggesting the combined use of
Zhegalkin polynomials and a complete but efficient MaxSAT
solver offers the right trade-off between optimal solutions and
contained computation times.
To reduce the number of output networks and concurrently
increase their biological relevance, a further improvement of
our method could be to add additional constraints like
canalization. Canalization was first introduced in [57] and has
been shown to govern a majority of biological phenomena [58].
It has already been formalized in terms of Zhegalkin
coefficients [59], which should ease its integration in the
Boolean formula to be MaxSAT-solved within our method.
Nevertheless, such additional constraints would reduce the
solution space and alter the current exhaustiveness of the
search. They should, therefore, be considered carefully, aiming
to strike a balance between biological relevance (in terms of
diversity and robustness) and computational efficiency.
Nonetheless, as it stands, our method offers an efficient
alternative to existing exhaustive calibration tools, with
capabilities that go beyond current network size and complexity
limitations. It opens the door to the analysis and predictive
usage of ever more comprehensive networks, given access to
the relevant interaction information and experimental data.
REFERENCES
[1] A.-L. Barabasi and Z. N. Oltvai, “Network biology: understanding the
cell's functional organization,” Nat. Rev. Genet., vol. 5, no. 2, pp. 101-
113, Feb. 2004, doi: 10.1038/nrg1272.
[2] B. B. Aldridge, J. M. Burke, D. A. Lauffenburger, and P. K. Sorger,
“Physicochemical modelling of cell signaling pathways,” Nat. Cell. Biol.,
vol. 8, no. 11, pp. 1195-1203, Nov. 2006, doi: 10.1038/ncb1497.
[3] S. A. Kauffman, “Metabolic Stability and Epigenesis in Randomly
Constructed Genetic Nets,” J. Theoret. Biol., vol. 22, no. 3, pp. 437-467,
Mar. 1969, doi: 10.1016/0022-5193(69)90015-0.
[4] R. Thomas, “Boolean Formalization of Genetic Control Circuits”, J.
Theoret. Biol., vol. 42, no. 3, pp. 563-585, Dec. 1973, doi: 10.1016/0022-
5193(73)90247-6.
[5] A. Saadatpour et al., “Dynamical and Structural Analysis of a T Cell
Survival Network Identifies Novel Candidate Therapeutic Targets for
Large Granular Lymphocyte Leukemia,” PLoS Comput. Biol., vol. 7, no.
11, Nov. 2011, Art. no. e1002267, doi: 10.1371/journal.pcbi.1002267.
[6] B. L. Puniya, R. G. Todd, A. Mohammed, D. M. Brown, M. Barberis, and
T. Helikar, “A Mechanistic Computational Model Reveals That Plasticity
of CD4+ T Cell Differentiation Is a Function of Cytokine Composition
and Dosage,” Front. Physiol., vol. 9, Aug. 2018, Art. no. 878, doi:
10.3389/fphys.2018.00878.
[7] M. Dahlhaus et al., “Boolean modeling identifies Greatwall/MASTL as
an important regulator in the AURKA network of neuroblastoma,”
Cancer Letters, vol: 371, no. 1, pp. 79-89, Feb. 2016, doi:
10.1016/j.canlet.2015.11.025.
[8] B. L. Puniya, R. Moore, A. Mohammed, R. Amin, A. La Fleur, and T.
Helikar, “A comprehensive logic-based model of the human immune
system to study the dynamics responses to mono- and coinfections,”
bioRxiv, Mar. 2020, doi: 10.1101/2020.03.11.988238.
[9] S. S. Aghamiri, V. Singh, A. Naldi, T. Helikar, S. Soliman, and A.
Niarakis, “Automated inference of Boolean models from molecular
interaction maps using CaSQ,” Bioinformatics, vol. 36, no. 16, pp. 4473-
4482, Aug. 2020, doi: 10.1093/bioinformatics/btaa484.
[10] M. Kanehisa and S. Goto,: Kyoto Encyclopedia of Genes and Genomes,”
Nucleic Acids Res., vol. 28, no. 1, pp. 27-30, Jan. 2000, doi:
10.1093/nar/28.1.27.
[11] A. R. Pico, T. Kelder, M. P. van Iersel, K. Hanspers, B. R. Conklin, and
C. Evelo, “WikiPathways: Pathway Editing for the People,” PLoS Biol.,
vol 6, no. 7, Jul. 2008, Art. no. e184, doi: 10.1371/journal.pbio.0060184.
[12] P. Lo Surdo et al., “SIGNOR 3.0, the SIGnaling network Open Resource
3.0: 2022 update,” Nucleic Acids Res., vol. 51, no. D1, pp. D631-D637,
Oct. 2022, doi: 10.1093/nar/gkac883.
[13] M. Milacic et al., “The Reactome Pathway Knowledgebase 2024,”
Nucleic Acids Res., vol 52, no. D1, pp. D672-D678, Jan. 2024, doi:
10.1093/nar/gkad1025.
[14] A. Oeckinghaus, M. S. Hayden, and S. Ghosh, “Crosstalk in NF-κB
signaling pathways,” Nat. Immunol., vol. 12, no. 8, pp. 695-708, Aug.
2011, doi: 10.1038/ni.2065.
[15] B. A. Hall and A. Niarakis, “Data integration in logic-based models of
biological mechanisms,” Curr. Opin. Syst. Biol., vol. 28, Dec. 2021, Art.
no. 100386, doi: 10.1016/j.coisb.2021.100386.
[16] J. Saez-Rodriguez et al., “Discrete logic modelling as a means to link
protein signalling networks with functional analysis of mammalian signal
transduction,” Mol. Syst. Biol., vol. 5, no. 1, pp. 331, Jan. 2009, doi:
10.1038/msb.2009.87.
[17] J. Fisher, N. Piterman, and R. Bodik, “Toward Synthesizing Executable
Models in Biology,” Front. Bioeng. Biotechnol., vol. 2, Dec. 2014, Art.
no. 75, doi: 10.3389/fbioe.2014.00075.
[18] S. S. Aghamiri and F. Delaplace, “TaBooN Boolean Network Synthesis
Based on Tabu Search,” IEEE/ACM Trans. Comput. Biol. and Bioinf., vol.
19, no. 4, pp. 2499-2511, Jul. 2022, doi: 10.1109/TCBB.2021.3063817.
[19] R. Bonneau, “Learning biological networks: from modules to dynamics,”
Nat. Chem. Biol., vol. 4, no. 11, pp. 658-664, Nov. 2008, doi:
10.1038/nchembio.122.
[20] A. A. Margolin et al., “ARACNE: An Algorithm for the Reconstruction
of Gene Regulatory Networks in a Mammalian Cellular Context,” BMC
Bioinformatics, vol. 7, no. S1, pp. S7, Mar. 2006, doi: 10.1186/1471-
2105-7-S1-S7.
[21] V. A. Huynh-Thu, A. Irrthum, L. Wehenkel, and P. Geurts, “Inferring
Regulatory Networks from Expression Data Using Tree-Based Methods,”
PLoS ONE, vol. 5, no. 9, Sep. 2010, Art. no. e12776, doi:
10.1371/journal.pone.0012776.
[22] A.-C. Haury, F. Mordelet, P. Vera-Licona, and J.-P. Vert, “TIGRESS:
Trustful Inference of Gene REgulation using Stability Selection,” BMC
Syst. Biol., vol. 6, Dec. 2012, Art. no. 145, doi: 10.1186/1752-0509-6-145.
[23] X. Zhang, H. Han, and W. Zhang, “Identification of Boolean Networks
Using Premined Network Topology Information,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 28, no. 2, pp. 464-469, Feb. 2017, doi:
10.1109/TNNLS.2016.2514841.
[24] X. Liu, Y. Wang, N. Shi, Z. Ji, and S. He, “GAPORE: Boolean network
inference using a genetic algorithm with novel polynomial representation
and encoding scheme,” Knowledge-Based Systems, vol. 228, Sep. 2021,
Art. no. 107277, doi: 10.1016/j.knosys.2021.107277.
[25] A. MacNamara, “CNORdt: Add-on to CellNOptR: Discretized time
treatments,” R package version 1.44.0, Oct. 2023. [Online].
doi:10.18129/B9.bioc.CNORdt.
This article has been accepted for publication in IEEE/ACM Transactions on Computational Biology and Bioinformatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2024.3456302
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
12
TCBB-2024-05-0317
[26] M. Ostrowski, L. Paulevé, T. Schaub, A. Seigel, and C. Guziolowski,
“Boolean network identification from perturbation time series data
combining dynamics abstraction and logic programming,” Biosystems,
vol. 149, pp. 139-153, Nov. 2016, doi: 10.1016/j.biosystems.2016.07.009.
[27] B. Yordanov, S.-J. Dunn, H. Kugler, A. Smith, G. Martello, and S.
Emmott, “A method to identify and analyze biological programs through
automated reasoning,” npj Syst. Biol. Appl., vol. 2, no. 1, Jul. 2016, Art.
no. 16010, doi: 10.1038/npjsba.2016.10.
[28] J. Goldfeder and H. Kugler, “BRE:IN - A Backend for Reasoning About
Interaction Networks with Temporal Logic,” in L. Bortolussi and G.
Sanguinetti, eds., Computational Methods in Systems Biology. CMSB
2019. Lecture Notes in Computer Science, Cham, Switzerland: Springer,
2019, doi: 10.1007/978-3-030-31304-3_15.
[29] P. Vera-Licona, A. Jarrah, L. Garcia-Puente, J. McGee, and R.
Laubenbacher, “An algebra-based method for inferring gene regulatory
networks,” BMC Syst. Biol., vol. 8, no. 1, 2014, Art. no. 37, doi:
10.1186/1752-0509-8-37.
[30] H.-C. Trinh and Y.-K. Kwon, “A novel constrained genetic algorithm-
based Boolean network inference method from steady-state gene
expression data,” Bioinformatics, vol. 37, no. S1, pp. i383-391, Aug.
2021, doi: 10.1093/bioinformatics/btab295.
[31] S. M. Hill et al., “Inferring causal molecular networks: empirical
assessment through a community-based effort,” Nat. Methods, vol. 13, no.
4, pp. 310-318, Apr. 2016, doi: 10.1038/nmeth.3773.
[32] M. Noual and S. Sené, “Synchronism versus asynchronism in monotonic
Boolean automata networks,” Nat. Comput., vol. 17, pp. 393-402, Jan.
2017, doi: 10.1007/s11047-016-9608-8.
[33] I. Zhegalkin, “On the technique of calculating propositions in symbolic
logic,” Matematicheskii Sbornik, vol. 34, no. 1, pp. 9-28, 1927. [Online].
Available: http://mi.mathnet.ru/msb7433.
[34] B. Rittaud and A. Heefer, “the Pigeonhole Principle, Two Centuries
Before Dirichlet,” Math. Intelligencer, vol. 36, pp. 27-29, Aug. 2013, doi:
10.1007/s00283-013-9389-1.
[35] J. D. Schwab, S. D. Kuhlwein, N. Ikonomi, M. Kuhl, and H. A. Kestler,
“Concepts in Boolean network modeling: What do they all mean?,”
Comput. Struct. Biotechnol. J., vol. 18, pp. 571-582, 2020, doi:
10.1016/j.csbj.2020.03.001.
[36] W. V. Quine, “A Way to Simplify Truth Functions,” Am. Math. Mon.,
vol. 62, no. 9, pp. 627-631, 1955, doi: 10.1080/00029890.1955.11988710.
[37] E. J. McCluskey, “Minimization of Boolean functions,” Bell Labs Tech.
J., vol. 35, no. 6, pp. 1417-1444, Nov. 1956, doi: 10.1002/j.1538-
7305.1956.tb03835.x.
[38] G. S. Tseitin, “On the Complexity of Derivation in Propositional
Calculus,” in J. H. Siekmann and G. Wrightson, eds., Automation of
Reasoning. Symbolic Computation, Berlin, Germany: Springer, 1983, doi:
10.1007/978-3-642-81955-1_28.
[39] A. Ignatiev, A. Morgado, and J. Marques-Silva, “RC2: an Efficient
MaxSAT Solver,” SAT, vol. 11, no. 1, pp. 53-64, Sep. 2019, doi:
10.3233/SAT190116.
[40] S. Kochemazov, V. Kondratiev, and I. Gribanova, “Empirical Analysis of
the RC2 MaxSAT Algorithm,” in MIPRO, Opatija, Croatia, 2023, pp.
1027-1032, doi: 10.23919/MIPRO57284.2023.10159820.
[41] C. Antosegui, M. L. Bonet, and J. Levy, “SAT-based MaxSAT
algorithms,” Artificial Intelligence, vol. 196, pp. 77-105, Mar. 2013, doi:
10.1016/j.artint.2013.01.002.
[42] R. Martins, M. Jarvisalo, and F. Bacchus, “MaxSAT Evaluation 2018,”
SAT, Oxford, UK, 2018. [Online]. Available: https://maxsat-
evaluations.github.io/2018/.
[43] R. Martins, M. Jarvisalo, and F. Bacchus, “MaxSAT Evaluation 2019,”
SAT, Lisbon, Portugal, 2019. [Online]. Available: https://maxsat-
evaluations.github.io/2019/.
[44] J. Ernst, G. J. Nau, and Z. Bar-Joseph, “Clustering short time series gene
expression data,” Bioinformatics, vol. 21, no. S1, pp. i159-i168, Jun.
2005, doi: 10.1093/bioinformatics/bti1022.
[45] B. Venn, T. Leifeld, P. Zhang, and T. Muhlhaus, “Temporal classification
of short time series data,” BMC Bioinformatics, vol. 25, Jan. 2024, Art.
no. 30, doi: 10.1186/s12859-024-05636-6.
[46] R. W. Hamming, “Error detecting and error correcting codes,” Bell Labs
Tech. J., vol. 29, no. 2, pp. 147-160, Apr. 1950, doi: 10.1002/j.1538-
7305.1950.tb00463.x.
[47] D. Szklarczyk et al., “The STRING database in 2021: customizable
protein-protein networks, and functional characterization of user-
uploaded gene/measurement sets,” Nucleic Acids Res., vol. 49, no. D1,
pp. D605-612, Jan. 2021, doi: 10.1093/nar/gkaa1074.
[48] C. Chaouiya et al., “SBML qualitative models: a model representation
format and infrastructure to foster interactions between qualitative
modelling formalisms and tools,” BMC Syst. Biol., vol. 7, no. 1, 2013, Art.
no. 135, doi: 10.1186/1752-0509-7-135.
[49] C. Kadelka, T.-M. Butrie, E. Hilton, J. Kinseth, A. Schmidt, and H.
Serdarevic, “A meta-analysis of Boolean network models reveals design
principles of gene regulatory networks,” Sci. Adv., vol. 10, no. 2, Jan.
2024, Art. no. eadj0822, doi: 10.1126/sciadv.adj0822.
[50] L. Heiser, 2016, “HPN-DREAM breast cancer network inference
challenge,” SYNAPSE, doi: 10.7303/syn1720047.
[51] E. Gjerga et al., “Converting networks to predictive logic models from
perturbation signalling data with CellNOpt,” Bioinformatics, vol. 36, no.
16, pp. 4523-4524, Aug. 2020, doi: 10.1093/bioinformatics/btaa561.
[52] C. Baier and J.-P. Katoen, Principles of Model Checking, Cambridge,
MA, USA: MIT Press, 2008.
[53] A. Cimatti et al., “NuSMV Version 2: An OpenSource Tool for Symbolic
Model Checking,” in CAV, Copenhagen, Denmark, Jul. 2002, pp. 359-
364, doi: 10.1007/3-540-45657-0_29.
[54] M. Razzaq, L. Paulevé, A. Siegel, J. Saez-Rodriguez, J. Bourdon, and C.
Guziolowski, “ PLoS Comput. Biol., vol. 14, no. 10, Oct. 2018, Art. no.
e1006538, doi : 10.1371/journal.pcbi.1006538.
[55] C. Daws and S. Tripakis, “Model checking of real-time reachability
properties using abstractions,” in B. Steffen, ed, Tools and Algorithms for
the Construction and Analysis of Systems, Berlin, Germany: Springer,
1998, pp. 313-329, doi: 10.1007/BFb0054180.
[56] S. Liang, S. Fuhrman, and R. Somogyi, “Reveal, a general reverse
engineering algorithm for inference of genetic network architectures,”
Pac. Symp. Biocomput., vol. 3, pp. 18-29, 1998. [Online]. Available:
https://psb.stanford.edu/psb-online/proceedings/psb98/liang.pdf.
[57] C. H. Waddington, “CANALIZATION OF DEVELOPMENT AND THE
INHERITANCE OF ACQUIRED CHARACTERS,” Nature, vol. 150,
pp. 563-565, Nov. 1942, doi: 10.1038/150563a0.
[58] S. E. Harris, B. K. Sawhill, A. Wuensche, and S. Kauffman, “A model of
transcriptional regulatory networks based on biases in the observed
regulation rules,” Complexity, vol. 7, no. 4, pp. 23-40, Mar. 2002, doi:
10.1002/cplx.10022.
[59] S. Faisal, G. Lichtenberg, S. Trump, and S. Attinger, “Structural
properties of continuous representations of Boolean functions for gene
network modelling,” Automatica, vol. 46, no. 12, pp. 2047-2052, Dec.
2010, doi: 10.1016/j.automatica.2010.09.001.
Vincent Deman received a Master’s degree in computational
modeling and simulation from ENSTA Paris – Institut
Polytechnique de Paris in 2021. He is pursuing a PhD in
computational systems biology with the Université Paris
Saclay in collaboration with the life sciences branch of
Dassault Systèmes, BIOVIA.
Marine Ciantar holds a double Master’s degree in Biology
and Bio/Chemoinformatics (ISDD) and a doctorate in
molecular modeling and simulation of materials chemistry.
Passionate about scientific culture, her work at BIOVIA is a
testament to the power of interdisciplinary collaboration in
shaping our understanding of the natural world.
Laurent Naudin received a Master’s degree in
Bioinformatics from UVSQ. With 25 years of experience in
the pharmaceutical industry, applying bioinformatics in
discovery projects and coordinating research projects in
oncology, he is now a product manager for software dedicated
to biologists at BIOVA, mainly focusing on systems biology.
Philippe Castera graduated from Ecole Centrale Paris and
completed a PhD in Plasma Physics at CentraleSupélec and
ONERA. After 5 years of working on the modeling of
complex systems using Graph and Category Theory, he joined
BIOVIA in 2020 to supervise the development of all graph-
related algorithms for Systems Biology and Retrosynthesis.
Anne-Sophie Beignon holds a PhD degree in cellular and
molecular biology from the Université Louis Pasteur
(Strasbourg, France). She is a vaccine immunologist, CNRS
researcher, and group leader at U1184 IMVA-HB/IDMIT, a
CEA, INSERM, and Université Paris Saclay lab.
This article has been accepted for publication in IEEE/ACM Transactions on Computational Biology and Bioinformatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2024.3456302
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/