Conference PaperPDF Available

Learning the Parameters of Probabilistic Logic Programs from Interpretations

Authors:

Abstract and Figures

ProbLog is a recently introduced probabilistic extension of the logic programming language Prolog, in which facts can be annotated with the probability that they hold. The advantage of this probabilistic language is that it naturally expresses a generative process over interpretations using a declarative model. Interpretations are relational descriptions or possible worlds. This paper introduces a novel parameter estimation algorithm LFI-ProbLog for learning ProbLog programs from partial interpretations. The algorithm is essentially a Soft-EM algorithm. It constructs a propositional logic formula for each interpretation that is used to estimate the marginals of the probabilistic parameters. The LFI-ProbLog algorithm has been experimentally evaluated on a number of data sets that justifies the approach and shows its effectiveness.
Content may be subject to copyright.
Learning the Parameters of
Probabilistic Logic Programs from
Interpretations
Bernd Gutmann
Ingo Thon
Luc De Raedt
Report CW 584, June 15, 2010
Katholieke Universiteit Leuven
Department of Computer Science
Celestijnenlaan 200A B-3001 Heverlee (Belgium)
Learning the Parameters of
Probabilistic Logic Programs from
Interpretations
Bernd Gutmann
Ingo Thon
Luc De Raedt ?
Report CW 584, June 15, 2010
Department of Computer Science, K.U.Leuven
Abstract
ProbLog is a recently introduced probabilistic extension of the logic program-
ming language Prolog, in which facts can be annotated with the probability
that they hold. The advantage of this probabilistic language is that it naturally
expresses a generative process over interpretations using a declarative model. In-
terpretations are relational descriptions or possible worlds. In this paper, a novel
parameter estimation algorithm CoPrEM for learning ProbLog programs from
partial interpretations is introduced. The algorithm is essentially a Soft-EM al-
gorithm that computes binary decision diagrams for each interpretation allowing
for a dynamic programming approach to be implemented. The CoPrEM algo-
rithm has been experimentally evaluated on a number of data sets, which justify
the approach and show its effectiveness.
?firstname.lastname@cs.kuleuven.be
Learning the Parameters of Probabilistic Logic
Programs from Interpretations
Bernd Gutmann, Ingo Thon, and Luc De Raedt
Department of Computer Science, Katholieke Universiteit Leuven
Celestijnenlaan 200A, POBox 2402, 3001 Heverlee, Belgium
{firstname.lastname}@cs.kuleuven.be
1 Introduction
Statistical relational learning [9] and probabilistic logic learning [3, 5] have con-
tributed various representations and learning schemes. Popular approaches in-
clude BLPs [13], Markov Logic [17], PRISM [20], PRMs [8], and ProbLog [4,10].
These approaches differ not only in the underlying representations but also in the
learning setting that is employed. Indeed, for learning knowledge-based model
construction approaches (KBMC), such as Markov Logic, PRMs, and BLPs,
one has typically used possible worlds (that is, Herbrand interpretations or re-
lational state descriptions) as training examples. For probabilistic programming
languages especially the probabilistic logic programming (PLP) approaches
based on Sato’s distribution semantics [19] such as PRISM and ProbLog train-
ing examples are typically provided in form of labeled facts. The labels that are
either the truth values of these facts or target probabilities. In computational
learning theory as well as in logical and relational learning [5], the former set-
ting is known as learning from interpretations and the latter as learning from
entailment. The differences between these two settings have been well studied
and characterized in the literature, and they have also been used for explaining
the differences between various statistical relational learning models [5, 6].
The differences between the two settings are akin to those between learning
a probabilistic grammar and learning a graphical model (e.g., a Bayesian net-
work). When learning the parameters of probabilistic grammars, the examples
are typically sentences sampled from the grammar, and when learning Bayesian
networks, the examples are possible worlds (that is, state descriptions). In the
former setting, which corresponds to learning from entailment, one usually starts
from observations for a single target predicate (or non-terminal), whereas in the
later setting, which corresponds to learning from interpretations, the observa-
tions may specify the value of all random variables in a state-description.
These differences in learning settings also explain why the KBMC and PLP
approaches have been applied on different kinds of data sets and applications.
Indeed, for PLP, one has, for instance, learned grammars; for KBMC, one has
learned models for entity resolution and link prediction. This paper wants to
contribute to bridging the gap between these two types of approaches to learning
by studying how the parameters of ProbLog programs can be learned from partial
interpretations. The key contribution of the paper is the introduction of a novel
algorithm called CoPrEM for learning ProbLog programs from interpretations.
The name is due to the main steps of the algorithm which are Completion,
Propagation and an Expectation Maximization procedure. It will also be shown
that the resulting algorithm applies to the usual type of problems that have
made MLNs and PRMs so popular.
The paper is organized as follows: In Section 2, we review logic programming
concepts as well as the probabilistic programming language ProbLog. Section 3
?firstname.lastname@cs.kuleuven.be
formalizes the problem of learning the parameters of ProbLog programs from
interpretations. Section 4 introduces CoPrEM for finding the maximum like-
lihood parameters. We report on some experiments in Section 5, and discuss
related work in Section 6 before concluding.
2 Probabilistic Logic Programming Concepts
Before reviewing the main concepts of the probabilistic programming language
ProbLog, we give the necessary terminology from the field of logic programming.
An atom is an expression of the form q(t1, . . . , tk) where qis a predicate of
arity kand the titerms. A term is either a variable, a constant, or a functor
applied on terms. Definite clauses are universally quantified expressions of the
form h:- b1, . . . , bnwhere hand the biare atoms. A fact is a clause where the
body is empty. A substitution θis an expression of the form {V1/t1, . . . , Vm/tm}
where the Viare different variables and the tiare terms. Applying a substitution
θto an expression eyields the instantiated expression where all variables Vi
in ehave been simultaneously replaced by their corresponding terms tiin θ. An
expression is called ground, if it does not contain variables. The semantics of
a set of definite clauses is given by its least Herbrand model, that is, the set
of all ground facts entailed by the theory (cf. [7] for more details). Prolog uses
SLD-resolution as standard proof procedure to answer queries.
A ProbLog theory (or program) Tconsists of a set of labeled facts Fand a set
of definite clauses BK that express the background knowledge. The facts pn:: fn
in Fare annotated with the probability pnthat fnθis true for all substitutions θ
grounding fn. The resulting facts fnθare called atomic choices [16] and represent
the elementary random events; they are assumed to be mutually independent.
Each non-ground probabilistic fact represents a kind of template for random
variables. Given a finite1number of possible substitutions {θn,1,...θn,Kn}for
each probabilistic fact pn:: fn, a ProbLog program T={p1:: f1,· · · , pN::
fN} BK defines a probability distribution over total choices L(the random
events), where LLT={f1θ1,1,...f1θ1,K1, . . . , fNθN,1, . . . , fNθN,KN}.
P(L|T ) = Yfnθn,kLpnYfnθn,k LT\L(1 pn).
Example 1. The following ProbLog theory states that there is a burglary with
probability 0.1, an earthquake with probability 0.2 and if either of them occurs
the alarm will go off. If the alarm goes off, a person Xwill be notified and will
therefore call with the probability of al(X), that is, 0.7. Figure 1 shows the
Bayesian network that corresponds to this theory.
F={0.1:: burglary,0.2:: earthquake,0.7:: al(X)}
BK ={person(mary). , person(john). ,
alarm :- burglary. , alarm :- earthquake. ,
calls(X) :- person(X),alarm,al(X).}
The set of atomic choices in this program is {al(mary),al(john),burglary,
and earthquake}and each total choice is a subset of this set. Each total choice
Lcan be combined with the background knowledge BK into a Prolog program.
Thus, the probability distribution at the level of atomic choices also induces
a probability distribution over possible definite clause programs of the form
L BK. Furthermore, each such program has a unique least Herbrand inter-
pretation, which is the set of all the ground facts that are logically entailed
by the program. This Herbrand interpretation represents a possible world. For
1Throughout the paper, we shall assume that Fis finite, see [19] for the infinite case.
calls(mary)calls(john)
al(mary)alarm al(john)
burglary earthquake
person(X)calls(X)
al(X)alarm
burglary earthquake
Fig. 1. The Bayesian Network (left) equivalent to Example 1 where the evidence atoms
are {person(mary), person(john), alarm,¬calls(john)}. An equivalent relational rep-
resentation (right), where each node corresponds to an atom, box nodes represent log-
ical facts, dashed nodes represent probabilistic facts, and solid elliptic nodes represent
derived predicates.
instance, for the total choice {burglary}we obtain the interpretation I=
{burglary,alarm,person(john),person(mary)}. Thus, the probability distri-
bution at the level of total choices also induces a probability distribution at
the level of possible worlds. The probability Pw(I) of this interpretation is
0.1×(1 0.2) ×(1 0.3)2. We define the success probability of a query qas
Ps(q|T ) = X
LLT
L∪BK|=q
P(L|T ) = X
LLT
δ(q, BK L)·P(L|T ) (1)
where δ(q, BK L) = 1 if there exists a θsuch that BKL|=qθ, and 0 otherwise.
It can be shown that the success probability corresponds to the probability that
the query succeeds in a randomly selected possible world (according to Pw). The
success probability Ps(calls(john)|T ) in Example 1 can be calculated as
P(al(john)(burglary earthquake)|T ) =
= P(al(john)burglary|T ) + P(al(john)earthquake|T )
P(al(john)burglary earthquake|T )
= 0.7·0.1+0.7·0.20.7·0.1·0.2=0.196
Observe that ProbLog programs do not represent a generative model at the level
of the individual facts or predicates. Indeed, it is not the case that the sum of
the probabilities of the facts for a given predicate (here calls/1) must equal 1:
Ps(calls(X)|T )6= Ps(calls(john)|T )+Ps(calls(mary)|T )6= 1 .
So, the predicates do not encode a probability distribution over their instances.
This differs from probabilistic grammars and its extensions such as stochastic
logic programs [2], where each predicate or non-terminal defines a probability dis-
tribution over its instances, which enables these approaches to sample instances
from a specific target predicate. Such approaches realize a generative process
at the level of individual predicates. Samples taken from a single predicate can
then be used as examples for learning the probability distribution governing the
predicate. This setting is known in the literature as learning from entailment
[5, 6]. Sato and Kameya’s well-known learning algorithm for PRISM [20] also
assumes that there is a generative process at the level of a single predicate and
it is therefore not applicable to learning from interpretations.
On the other hand, while the ProbLog semantics does not encode a gener-
ative process at the level of individual predicates, it does encode a generative
process at the level of interpretations. This process has been described above; it
basically follows from the fact that each total choice generates a unique possi-
ble world through its least Herbrand interpretation. Therefore, it is much more
natural to learn from interpretations in ProbLog; this is akin to typical KBMC
approaches. The key contribution of this paper is the introduction of a novel
learning algorithm for learning from interpretations in ProbLog. It can learn
both from fully and partially observable interpretations.
A partial interpretation Ispecifies for some (but not all) atoms the truth
value. We represent partial interpretations as I= (I+, I) where I+contains
all true atoms and Iall false atoms. The probability of a partial interpretation
is the sum of the probabilities of all possible worlds consistent with the known
atoms. This is the success probability of the query (VajI+aj)(VajI¬aj).
Considering Example 1, the probability of the following partial interpretation
I+={person(mary),person(john),burglary,alarm,al(john),calls(john)}
I={calls(mary),al(mary)}
is Pw((I+, I)) = 0.1×0.7×(1 0.7) ×(0.2 + (1 0.2)) as there are two total
choices that result in this partial interpretation.
3 Learning from Interpretations
Learning from (possibly partial) interpretations is a common setting in statis-
tical relational learning that has not yet been studied in its full generality for
probabilistic programming languages such as ProbLog, ICL and PRISM. The
setting can be formalized as follows:
Definition 1 (Max-Likelihood Parameter Estimation). Given a ProbLog
program T(p)containing the probabilistic facts Fwith unknown parameters
p=hp1, ..., pNiand background knowledge BK, and a set of (possibly partial)
interpretations D={I1, . . . , IM}, the training examples. Find maximum likeli-
hood probabilities b
p=hbp1,..., cpNisuch that
b
p= arg max
p
P(D|T (p)) = arg max
pYM
m=1 Pw(Im|T (p))
Thus, we are given a ProbLog program and a set of partial interpretations and
the goal is to find the maximum likelihood parameters.
Finding maximum likelihood parameters b
pcan be done using a Soft-EM algo-
rithm. There are two cases to consider while developing the algorithm. The first
is that the interpretations are complete and everything is observable, the second
that partial interpretations are allowed, and that there is partial observability.
3.1 Full Observability
It is clear that in the fully observable case the maximum likelihood estimators
cpnfor the probabilistic facts pn:: fncan be obtained by simply counting the
number of true ground instances in every interpretation
cpn=1
M
M
X
m=1
1
Km
n
Km
n
X
k=1
δm
n,k
where δm
n,k := 1 if fnθm
n,k Im
0 else
and θm
n,k is the k-th possible ground substitution for the fact fnin the inter-
pretation Imand Km
nis the number of such substitutions. Before going to the
partial observable case, let us consider the issue of determining the possible sub-
stitutions θm
n,k for a fact pn:: fnand an interpretation Im. To resolve this, we
essentially assume2that the facts fnare typed, and that each interpretation Im
contains an explicit definition of the different types in the form of a unary pred-
icate (that is fully observable). In the alarm example, the predicate person/1
can be regarded as the type of the (first) argument of al(X) and calls(X). This
predicate can differ between interpretations. One person can have john and mary
as neighbors, another one ann,bob and eve. The maximum likelihood estimator
accounts for that by allowing the domain sizes Km
nto be example-dependent.
The use of such types is standard practice in statistical relational learning and
inductive logic programming.
3.2 Partial Observability
In many applications the training examples are only partially observed. In the
alarm example we may receive a phone call but we may not know whether
an earthquake has in fact occurred. In the partial observable case similar to
Bayesian Networks a closed-form solution of the maximum likelihood param-
eters is infeasible. Instead, one can replace δm
n,k by its conditional expectation
given the Interpretation under the current model E[δm
n,k|Im] in the previous for-
mula yielding:
cpn=1
M
M
X
m=1
1
Km
n
Km
n
X
k=1
E[δm
n,k|Im]
As in the fully observable case, the domains are assumed to be given. Be-
fore describing the Soft-EM algorithm for finding cpn, we illustrate a crucial
property using the alarm example. Assume that our partial interpretation is
I+={person(mary),person(john),alarm}and I=. It is clear that for
calculating the marginal probability of all probabilistic facts these are the ex-
pected counts only the atoms in {burglary,earthquake,al(john), al(mary),
alarm} I+are relevant. This is due to the fact that the remaining atoms
{calls(john), calls(mary)}cannot be used in any proof for the facts observed
in the interpretations. Therefore, they do not influence the probability of the
partial interpretation3. This motivates the following definition.
Definition 2. Let T=F BK be a ProbLog theory and xa ground atom then
the dependency set of xis defined as:
depT(x) := {fground fact|a ground SLD-proof in Tfor xcontains f}.
Thus, depT(x) contains all ground atoms that appear in a possible proof of the
atom x. This can be generalized to partial interpretations:
Definition 3. Let T=F BK be a ProbLog theory and I= (I+, I)a partial
interpretation then the dependency set of the partial interpretation Iis defined
as depT(I) := Sx(I+I)depT(x).
Before showing that it is sufficient to the restrict the calculation of the probability
to the dependent atoms, we need the notion of a restricted ProbLog theory.
Definition 4. Let T=F BK be a ProbLog theory and I= (I+, I)a partial
interpretation. Then we define Tr(I) = Fr(I) BKr(I), the interpretation-
restricted ProbLog theory, as follows. Fr(I) = LTdepT(I)and BKr(I)is
obtained by computing all ground instances of clauses in BK in which all atoms
appear in depT(I).
2This assumption can be relaxed if the types are computable from the Problog pro-
gram and the current interpretation.
3Such atoms play a role similar to that of barren nodes in Bayesian networks [12].
For instance, for the partial interpretation I= ({burglary,alarm},), the re-
stricted background theory BKr(I) contains the single clause alarm :- burglary
and Fr(I) contains only burglary.
Theorem 1. For all ground probabilistic facts fnand partial interpretations I
P(fn|I, T) = P(fn|I, Tr(I)) if fndepT(I)
pnotherwise
where Tr(I)is the interpretation-restricted ProbLog theory of T.
This theorem shows, that the conditional probability of fngiven Icalculated
in the theory Tis equivalent to the probability calculated in Tr(I). The proba-
bilistic facts of Tr(I) are restricted to the atomic choices occurring in depT(I).
Working with Tr(I) is often more efficient than working with T.
Proof. Can be proven using the independence of probabilistic facts in ProbLog.
4 The CoPrEM algorithm
We now develop the Soft-EM algorithm for finding the maximum likelihood pa-
rameters b
p. The algorithm starts by constructing a Binary Decision Diagram
(BDD) for every training example Im(cf. Sect. 4.1), which is then used to com-
pute the expected counts E[δm
n,k|Im] (cf. Sect. 4.2).
Before giving the details on how to construct and to evaluate BDDs, we
review their relevant concepts. A BDD is a compact graph representation of a
Boolean formula, an example is shown in Fig. 2. In our case, the Boolean formula
(or, equivalently, the BDD) represents the conditions under which the partial
interpretation will be generated by the ProbLog program and the variables in
the formula are the ground atoms in depT(Im). Basically, any truth assignment
to these facts that satisfies the Boolean formula (or the BDD) will result in the
partial interpretation. Given a fixed variable ordering, a Boolean function fcan
be represented as a full Boolean decision tree where each node Non the ith level
is labeled with the ith variable and has two children called low l(N) and high
h(N). Each path from the root to some leaf stands for one complete variable
assignment. If variable xis assigned 0 (1), the branch to the low (high) child is
taken. Each leaf is labeled by the outcome of fgiven the variable assignment
represented by the corresponding path. (where we use 1 and O to denote true
and false). Starting from such a tree, one obtains a BDD by merging isomorphic
subgraphs and deleting redundant nodes until no further reduction is possible.
A node is redundant if and only if the subgraphs rooted at its children are
isomorphic. In Fig. 2, dashed edges indicate 0’s and lead to low children, solid
ones indicate 1’s and lead to high children.
4.1 Computing the BDD for an interpretation
The different steps we go through to compute the BDD for a partial interpreta-
tion Iare as follows (and also illustrated in Fig. 2):
1. Compute depT(I), that is, compute the set of ground atoms that may have
an influence on the truth value of the atoms with known truth value in the
partial interpretation I. This is realized by applying the definition of depT(I)
directly using a tabled meta-interpreter in Prolog. Tabling stores each con-
sidered atom in order to avoid that the same atom is proven repeatedly.
2. Compute BKr(I), the background theory BK restricted to the interpreta-
tion I(cf. Definition 4 and Theorem 1).
1.) Calculate Dependencies:
depT(alarm) = {alarm,earthquake,burglary}
depT(calls(john)) = {burlary,earthquake
al(john),person(john),calls(john),alarm}
2.) Restricted theory:
0.1:: burglary.person(john).
0.2:: earthquake.alarm :- burglary.
0.7:: al(john).alarm :- earthquake.
calls(john) :- person(john),alarm,al(john).
3.) Clark’s completion:
person(john)true
alarm (burglary earthquake)
calls(john)person(john)alarm al(john)
4.) Propagated evidence:
(burglary earthquake)
¬al(john)
5.) Build and evaluate BDD
0.7:: al(john)
1, 0.2
0.2:: earthquake
0.3, 0.28
alarm
(determ.)
0.24, 0.1
0.1:: burglary
0.24, 0.1
alarm
(determ.)
0.06, 1
1O
Fig. 2. The different steps of the CoPrEM algorithm for the training example I+=
{alarm},I={calls(john)}. Normally the alarm node in the BDD is propagated
away in Step 4, but kept here for illustrative purposes. The nodes are labeled with
their probability and the up- and downward probabilities.
3. Compute C omp(BKr(I)), where Comp(BKr(I)) denotes Clark’s completion
of BKr(I); Clark’s completion of a set of clauses is computed by replacing all
clauses with the same head h:- body1, ..., h:- bodynby the corresponding
formula hbody1. . . bodyn. Clark’s completion allows one to propagate
values from the head to the bodies and vice versa. It states that the head is
true if and only if one of its bodies is true and captures the least Herbrand
model semantics of definite clause programs.
4. Simplify C omp(BKr(I)) by propagating known values for the atoms in I. In
this step, we first eliminate the ground atoms with known truth value in I.
That is, we simply fill out their value in the theory C omp(BKr(I)), and then
we propagate these values until no further propagation is possible. This is
akin to the first steps of the Davis-Putnam algorithm.
5. Construct the BDDIcompactly representing the Boolean formula consist-
ing of the resulting set of clauses; it is this BDDIthat is employed by the
algorithm outlined in the next subsection for computing the expected counts.
In Step 4 of the algorithm, atoms fnwith known truth values vnare re-
moved from the formula and in turn from the BDD. This has to be taken into
account both for calculating the probability of the interpretation and the ex-
pected counts of these variables. The probability of the partial interpretation I
given the ProbLog program T(p) can be calculated as:
Pw(I|T (p)) = P(BDDI)·Yfnknown in IP(fn=vn)
where viis the value of fiin Iand P(BDDi) is the probability of the BDD as
defined in the following subsection.
However, for simplifying the notation in the algorithm below we shall act as if
the BDDIalso includes the variables with known truth-value in I. Internally, the
probability calculation is implemented using the above theorem. Furthermore,
for computing the expected counts, we also need to consider the nodes and
atoms that have been removed from the Boolean formula when the BDD has
been computed in a compressed form. See for example in Fig. 2 the probabilistic
fact burglary. It only occurs on the left path to the 1-terminal, but it is with
probability 0.1 also true on the right path. Therefore, we treat missing atoms at
a particular level as if they were there and simply go to the next node no matter
whether the missing atom has the value true or false.
4.2 Calculate Expected Counts
One can calculate the expected counts E[δm
n,k|Im] by a dynamic programming
approach on the BDD. The algorithm is similar to the forward/backward algo-
rithm for HMMs or the inside/outside probability of PCFGs. We will talk about
the upward/downward algorithm and probability. We use pNas the probabil-
ity that the node Nwill be left using the branch to the high-child and 1 pN
otherwise. For a node Ncorresponding to a probabilistic fact fithis probability
is pN=piand pN= 1 otherwise. Using πN= 1 we indicate whether a node
represents a deterministic node. For every node Nin the BDD we compute:
1. the upward probability P(N) represents the probability that the logical for-
mula encoded by the sub-BDD rooted at Nis true. For instance, in Fig. 2,
the upward probability of the leftmost node for alarm represents the prob-
ability of the formula alarm burglary.
2. the downward probability P(N) represents the probability of reaching the
current node Non a random walk starting at the root, where at deterministic
nodes both paths are followed in parallel. If all random walkers take the same
decisions at the remaining nodes it is guaranteed that only one reaches the
1-terminal. This is due to the fact that the values of all deterministic nodes
are fixed given the values for all probabilistic facts. For instance, in Fig. 2,
the downward probability of the right alarm-node would be equal to the
probability of ¬earthquake,¬al(john) which is (1 0.2) ·(1 0.7).
The following invariants hold for the BDD (excluding N= O, due to the
deterministic nodes):
XNnode at level nP(N)·P(N) = P(BDD)
where the variable Nranges over all nodes occurring at a particular level n
in the BDD. That means that P(N)·P(N)·(P(BDD))1represents the
probability that an assignment satisfying the BDD will pass through the node
N. Especially it holds that P(1) = P(Root) = P(BDD). The upward and
downward probabilities are now computed using the following formulae (cf. also
Fig. 3):
P(O) = 0 P(1) = 1 P(Root)=1
P(N) = P(h(N)) ·pπN
N+P(l(N)) ·(1 pN)πN
P(N) = XN=h(M)P(M)·pπM
M+XN=l(M)P(M)·(1 pM)πM
where πNis 0 for nodes representing deterministic nodes and 1 otherwise.
The algorithm proceeds by first computing the downward probabilities from
the root to leafs and then the upward probability from leafs to the root. Inter-
mediate results are stored and reused when nodes are revisited. The algorithm
is sketched in Algorithm 1. The E[δm
n,k|Im] are then computed as
E[δm
n,k|Im] = XNrepresents fm
P(N)·pN·P(h(N))
P(BDD).
N
T1T2
P(N) = P(T1)·pπN
N+P(T0)·(1 pN)πN
P(T1)P(T2)N
T1T2
P(N) = PP(pa(N))
pπN
N·P(N)
(1 pN)πn·
P(N)
Fig. 3. Propagation step of the upward-probability (left) and for the downward-
probability (right). The indicator function πNis 1 if Nis a probabilistic node and
zero otherwise.
Algorithm 1 Calculating the probability of a BDD
function Up(BDD node n)
if nis the 1-terminal then return 1
if nis the 0-terminal then return 0
let hand lbe the high and low children of n
if nrepresents probabilistic fact then
return pn·Up(h) + (1 pn)·Up(l)
else
return Up(h) + Up(l)
function Down(BDD node n)
qpriority queues sorting according to BDD-order
enqueue(q, n)
initialize Down as array of 0’s with length height(n)
while qnot empty do
n=dequeue(q)
pi = 1 if nprobabilistic fact 0 otherwise
Down[h(n)]+ = Down[n]·ppi
n
Down[l(n)]+ = Down[n]·(1 pn)pi
enqueue(q, h(n)) if not yet in q
enqueue(q, l (n)) if not yet in q
5 Experiments
We implemented the algorithm in YAP Prolog4utilizing SimpleCUDD5for the
BDD operations. All experiments were run on a 2.8Ghz computer with 8Gb
memory. We performed three types of experiments. In the first we evaluated
whether the CoPrEM algorithm can be used to learn which clauses are correct
in a program,in the second one we used CoPrEM to learn a sequential ProbLog
model and compared it to CPT-L [21].
5.1 7-segment display
To evaluate our approach we use the LED Display Domain Data Set6from the
UCI repository and answer question Q1as to whether CoPrEM can be used to
determine which clauses are correct in a ProbLog program. The set up is akin
to a Bayesian structure learning approach and the artificial dataset allows us to
control all properties of the problem.
The problem is to learn recognizing digits based on features stemming from
a 7-segment display. Such a display contains seven light-emitting diodes (LED)
arranged like an eight. By switching all LEDs to a specific configuration one can
4http://www.dcc.fc.up.pt/vsc/Yap/
5http://www.cs.kuleuven.be/theo/tools/simplecudd.html
6http://archive.ics.uci.edu/ml/datasets/LED+Display+Domain
g
d
a
ef
bc
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 5 10 15 20 25 30 35 40
5
6
7
Fig. 4. The learned fact probabilities after niterations of Soft-EM in the LED domain.
For the input corresponding to a 6 the graph shows, the probability of assigning the
class 5, 6 or 7. The correct rule, which predicts class 6, gets the highest probability
assigned. The remaining classes have probabilities assigned. The probabilities for 5 and
7 represent the upper respectively lower bound for the left out curves.
represent a number. So this problem domain consists of seven Boolean features
and 0 concepts. Solving this problem is non-trivial due to noise which flips each
of the Boolean features with a certain probability. To further complicate the
problem, we removed attributes including the class attribute at random. As
model we considered the combination of all possible concepts (both correct and
incorrect). The concepts are expressed in terms of the observed segments as given
in Figure 4. For instance, the correct representation of the number 3 is:
0.9:: p99.predict(3) :- a,¬b,c,d,¬e,f,g,p99.
Each rule represents one concept a configuration of the segments in the display.
It uses a probabilistic fact (p99 in this case) to capture the probability that this
rule or concept holds. The final model contains 27×10 = 1280 probabilistic
facts, and equally many clauses.
We generated 100 training examples. Each feature got perturbed with noise
by 10% chance, and each feature had a 30% probability to be unobserved. All
results are averaged over ten runs. After training we inspected manually whether
the right target concept was learned (cf. Fig. 4). The results indicate that the al-
gorithm learns the right concept even from noisy incomplete data. The runtimes
are of the algorithm are as follows: the generation of the BDDs took 10sec
in total where grounding took 4sec and the completion took 5sec. As the
BDDs do not change, this has to be done only once. Evaluating the BDD after-
ward took 2.5sec on average. These results affirmatively answer Q1.
5.2 Comparisons to CPT-L
CPT-L is a formalism based on CP-Logic. It is tailored towards sequential do-
mains. A CPT-L model consists out of a set of rules which when applied to
the current state jointly generate the next state. A single CPT-L rule is of the
form h1:: p1. . . hn:: pnb1, . . . , bnwhere the piare probability labels
(summing to 1), the hiare atoms hiand the biare literals. The rule states that
for all ground instances of the rule, whenever the body is true in the current
state exactly one of the heads is true. A CTP-L model essentially represents
a probability distribution over sequences of relational interpretations (a kind
of relational Markov model). It is possible to transform a CPT-L model into
ProbLog. Furthermore, the learning setting developed for CPT-L is similar to
that introduced here, except that it is tailored towards sequential data. Thus
the present ProbLog setting is more general and can be applied to the CPT-L
problem. This also raises the question Q3as to whether the more general setting
CoPrEM setting for ProbLog can be used on realistic sequential problems for
CPT-L.
To answer this question, we used the Massively Multiplayer Online Game
dataset used in the CPT-L paper [21] together with the original model and
translated it to ProbLog. We report only runtime results, as all other results
are identical. Using the CPT-L implementation learning takes on average 0.2
minutes per iteration whereas in LFI an average iteration takes 0.1 minutes. This
affirmatively answers Q3, and shows that CoPrEM is competitive to CPT-L
even so the latter is tailored towards sequential problems.
6 Related Work
There is a rich body of related work in the literature on statistical relational
learning and probabilistic logic programming. First, as already stated in the
introduction, current approaches to learning probabilistic logic programs for
ProbLog [4], PRISM [20], and SLPs [15], almost all learn from entailment. For
ProbLog, there exists a least square approach where the examples are ground
facts together with their target probability [10]. This approach implements a
different approach to learning that is not based on an underlying generative
process, neither at the level of predicates nor at the level of interpretations and
this explains why it requires the target probabilities to be specified. Hence, it es-
sentially corresponds to a regression approach and contrasts with the generative
learning at the level of interpretations that is pursued in the present paper.
For ProbLog, there is also the restricted study of [18] that learns acyclic
ground (that is, propositional) ProbLog programs from interpretations using a
transformation to Bayesian networks. Such a transformation is also employed in
the learning approaches for CP-logic [22]. Sato and Kameya have contributed
various interesting and advanced learning algorithms that have been incorpo-
rated in PRISM, but all of them learn from entailment. Ishihata et al. [11]
mention the possibility of using a BDD-like approach in PRISM but have to
the best of the authors’ knowledge not yet implemented such an approach. Fur-
thermore, implementing this would require a significant extension of PRISM to
relax the exclusive explanation assumption. This assumption made by PRISM
but not by ProbLog states that different proofs for a particular fact should
be disjoint, that is, not overlap. It allows PRISM to avoid the use of BDDs and
optimizes both learning and inference algorithms. They also do not specify how
to obtain the BDDs from PRISM. The upward/downward procedure used in our
algorithm to estimate the parameters from a set of BDDs is essentially an exten-
sion of that of the approaches of Ishihata et al. [11] and of Thon et al. [21] that
have been independently developed. The algorithm of Ishihiata et al. learns the
probability of literals for arbitrary Boolean formulae from examples using a BDD
while that of Thon [21] has been tailored towards learning with BDDs for the
sequential logic CPT-L. One important extension in the present algorithm, as
compared to [11], is that it can deal with deterministic nodes and dependencies
that may occur in BDDs, such as those generated by our algorithm.
Our approach can also be related to the work on knowledge-based model
construction approaches in statistical relational learning such as BLPs, PRMs
and MLNs [17]. While the setting in the present paper is the standard setting for
these kind of representations, there are significant differences at the algorithmic
level, beside the representational. First, for BLPs, PRMs, and also CP-logic, typ-
ically for each training example a ground Bayesian network is constructed where
a standard learning algorithm can be used. Second, for MLNs, many different
learning algorithms have been developed. Even so the representation generated
by Clark’s completion - a weighted ALL-SAT problem - is quite close to the rep-
resentation of Markov Logic, there are subtle differences. We use probabilities on
single facts, while Markov Logic uses weights on clauses. This generates a clear
probabilistic semantics. It is unclear whether and how BDDs could be incorpo-
rated in Markov Logic, but seems to be a promising future research direction.
7 Conclusions
We have introduced a novel parameter learning algorithm from interpretations
for the probabilistic logic programming language ProbLog. This has been mo-
tivated by the differences in the learning settings and applications of typical
knowledge-based model construction approaches such as MLNs, PRMs, BLPs,
and probabilistic logic programming approaches using Sato’s distribution se-
mantics such as ProbLog and PRISM. The CoPrEM algorithm tightly couples
logical inference with a probabilistic EM algorithm at the level of BDDs. The
approach was experimentally evaluated and it was shown that the approach es-
sentially can be applied to the typical kind of problems that have made the
knowledge-based model construction approach in statistical relational learning
so popular. The authors hope that this contributes towards bridging the gap be-
tween the knowledge-based model construction approach and the probabilistic
logic programming approach based on the distribution semantics.
References
1. Cussens, J.: Parameter estimation in stochastic logic programs. Machine Learning
44(3), 245–271 (2001)
2. De Raedt, L., Frasconi, P., Kersting, K., Muggleton, S. (eds.): Probabilistic Induc-
tive Logic Programming Theory and Applications, Lecture Notes in Artificial
Intelligence, vol. 4911. Springer (2008)
3. De Raedt, L., Kimmig, A., Toivonen, H.: Problog: A probabilistic Prolog and its
application in link discovery. In: Veloso, M. (ed.) Proceedings of the 20th Interna-
tional Joint Conference on Artificial Intelligence. pp. 2462–2467 (2007)
4. De Raedt, L.: Logical and Relational Learning. Springer (2008)
5. De Raedt, L., Kersting, K.: Probabilistic inductive logic programming. In: Algo-
rithmic Learning Theory. pp. 19–36. No. 3244 in LNCS, Springer (2004)
6. Flach, P.: Simply logical : intelligent reasoning by example. Wiley (1994)
7. Getoor, L., Friedman, N., Koller, D., Pfeffer, A.: Learning probabilistic relational
models. In: zeroski, S., Lavraˇc, N. (eds.) Relational Data Mining, pp. 307–335.
Springer (2001)
8. Getoor, L., Taskar, B. (eds.): An Introduction to Statistical Relational Learning.
MIT Press (2007)
9. Gutmann, B., Kimmig, A., De Raedt, L., Kersting, K.: Parameter learning in
probabilistic databases: A least squares approach. In: Daelemans, W., Goethals,
B., Morik, K. (eds.) ECML 2008. LNCS, vol. 5211, pp. 473–488. Springer (2008)
10. Ishihata, M., Kameya, Y., Sato, T., Minato, S.: Propositionalizing the EM algo-
rithm by BDDs. In: ILP (2008)
11. Jensen, F.: Bayesian Networks and Decision Graphs. Springer (2001)
12. Kersting, K., De Raedt, L.: Bayesian logic programming: theory and tool. In:
Getoor, L., Taskar, B. (eds.) An Introduction to Statistical Relational Learning.
MIT Press (2007)
13. Muggleton, S.: Stochastic logic programs. In: De Raedt, L. (ed.) Advances in In-
ductive Logic Programming. Frontiers in Artificial Intelligence and Applications,
vol. 32. IOS Press (1996)
14. Poole, D.: The independent choice logic and beyond. In: Raedt, L.D., Frasconi,
P., Kersting, K., Muggleton, S. (eds.) Probabilistic Inductive Logic Programming.
Lecture Notes in Computer Science, vol. 4911, pp. 222–243. Springer (2008)
15. Richardson, M., Domingos, P.: Markov logic networks. Machine Learning 62, 107–
136 (2006)
16. Riguzzi, F.: Learning ground problog programs from interpretations. In: Proceed-
ings of the 6th Workshop on Multi-Relational Data Mining (MRDM07) (2007)
17. Sato, T.: A statistical learning method for logic programs with distribution se-
mantics. In: Sterling, L. (ed.) Proceedings of the 12th International Conference on
Logic Programming. pp. 715–729. MIT Press (1995)
18. Sato, T., Kameya, Y.: Parameter learning of logic programs for symbolic-statistical
modeling. Journal of Artificial Intelligence Research 15, 391–454 (2001)
19. Thon, I., Landwehr, N., De Raedt, L.: A simple model for sequences of relational
state descriptions. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD
2008. LNCS, vol. 5211, pp. 506–521. Springer (2008)
20. Vennekens, J., Denecker, M., Bruynooghe, M.: Representing causal information
about a probabilistic process. In: Fisher, M., van der Hoek, W., Konev, B., Lisitsa,
A. (eds.) Proceedings of the 10th European Conference on Logics in Artificial
Intelligence. LNCS, vol. 4160, pp. 452–464. Springer (2006)
... Its first implementation, which dates back to 1995, offered an algorithm based on EM. The same approach is also adopted in EMBLEM (Bellodi and Riguzzi, 2013), which learns the parameters of Logic Programs with Annotated Disjunctions (LPADs) (Vennekens et al., 2004), that is logic programs with disjunctive rules where each head atom is associated with a probability, and in ProbLog2 (Fierens et al., 2015) that learn the parameters of ProbLog programs from partial interpretations adopting the LFI-ProbLog algorithm of Gutmann et al. (2011). LeProbLog Gutmann et al. (2008) still considers ProbLog programs but uses gradient descent to solve the task. ...
Article
Full-text available
Parameter learning is a crucial task in the field of Statistical Relational Artificial Intelligence: given a probabilistic logic program and a set of observations in the form of interpretations, the goal is to learn the probabilities of the facts in the program such that the probabilities of the interpretations are maximized. In this paper, we propose two algorithms to solve such a task within the formalism of Probabilistic Answer Set Programming, both based on the extraction of symbolic equations representing the probabilities of the interpretations. The first solves the task using an off-the-shelf constrained optimization solver while the second is based on an implementation of the Expectation Maximization algorithm. Empirical results show that our proposals often outperform existing approaches based on projected answer set enumeration in terms of quality of the solution and in terms of execution time.
... Structure learning is even more challenging: The structure is also learned in an iterative process, requiring parameter learning at each step. Learning methods have been devised for a large number of probabilistic relational formalisms, including MLNs (Richardson & Domingos, 2006;Khot et al., 2011), Problog (Gutmann, Thon, & De Raedt, 2011), CP-logic (Thon, Landwehr, & De Raedt, 2011), PRISM (Sato & Kameya, 2001), probabilistic relational models (Getoor, Friedman, Koller, & Taskar, 2002) and Bayesian logic programs (Kersting & De Raedt, 2001). ...
Preprint
Tasks such as social network analysis, human behavior recognition, or modeling biochemical reactions, can be solved elegantly by using the probabilistic inference framework. However, standard probabilistic inference algorithms work at a propositional level, and thus cannot capture the symmetries and redundancies that are present in these tasks. Algorithms that exploit those symmetries have been devised in different research fields, for example by the lifted inference-, multiple object tracking-, and modeling and simulation-communities. The common idea, that we call state space abstraction, is to perform inference over compact representations of sets of symmetric states. Although they are concerned with a similar topic, the relationship between these approaches has not been investigated systematically. This survey provides the following contributions. We perform a systematic literature review to outline the state of the art in probabilistic inference methods exploiting symmetries. From an initial set of more than 4,000 papers, we identify 116 relevant papers. Furthermore, we provide new high-level categories that classify the approaches, based on common properties of the approaches. The research areas underlying each of the categories are introduced concisely. Researchers from different fields that are confronted with a state space explosion problem in a probabilistic system can use this classification to identify possible solutions. Finally, based on this conceptualization, we identify potentials for future research, as some relevant application domains are not addressed by current approaches.
... DeepProbLog is able to learn both probabilities in the program and parameters in the neural networks. ProbLog has a specialized algorithm called Learning From Interpretations (LFI), explained in Gutmann et al. (2011) and Yang et al. (2022). LFI is an Expectation Maximization (EM) method. ...
Article
Full-text available
The goal of learning from positive and unlabeled (PU) examples is to learn a classifier that predicts the posterior class probability. The challenge is that the available labels in the data are determined by (1) the true class, and (2) the labeling mechanism that selects which positive examples get labeled, where often certain examples have a higher probability to be selected than others. Incorrectly assuming an unbiased labeling mechanism leads to learning a biased classifier. Yet, this is what most existing methods do. A handful of methods makes more realistic assumptions, but they are either so general that it is impossible to distinguish between the effects of the true classification and of the labeling mechanism, or too restrictive to correctly model the real situation, or require knowledge that is typically unavailable. This paper studies how to formulate and integrate more realistic assumptions for learning better classifiers, by exploiting the strengths of probabilistic logic programming (PLP). Concretely, (1) we propose PU ProbLog: a PLP-based general method that allows to (partially) model the labeling mechanism. (2) We show that our method generalizes existing methods, in the sense that it can model the same assumptions. (3) Thanks to the use of PLP, our method supports also PU learning in relational domains. (4) Our empirical analysis shows that partially modeling the labeling bias, improves the learned classifiers.
... In fact, in total observability each interpretation I m ∈ I observes the truth value of each atom and probabilistic fact of L. This case reduces to counting the number of true occurrences of each probabilistic fact in the interpretations I. We clarify this by analyzing Equation (1) from Gutmann et al. (2011) which is also implemented in smProbLog. Letp n be the estimate for p n :: f n . ...
Article
Full-text available
Argumentation problems are concerned with determining the acceptability of a set of arguments from their relational structure. When the available information is uncertain, probabilistic argumentation frameworks provide modeling tools to account for it. The first contribution of this paper is a novel interpretation of probabilistic argumentation frameworks as probabilistic logic programs. Probabilistic logic programs are logic programs in which some of the facts are annotated with probabilities. We show that the programs representing probabilistic argumentation frameworks do not satisfy a common assumption in probabilistic logic programming (PLP) semantics, which is, that probabilistic facts fully capture the uncertainty in the domain under investigation. The second contribution of this paper is then a novel PLP semantics for programs where a choice of probabilistic facts does not uniquely determine the truth assignment of the logical atoms. The third contribution of this paper is the implementation of a PLP system supporting this semantics: sm ProbLog. sm ProbLog is a novel PLP framework based on the PLP language ProbLog. sm ProbLog supports many inference and learning tasks typical of PLP, which, together with our first contribution, provide novel reasoning tools for probabilistic argumentation. We evaluate our approach with experiments analyzing the computational cost of the proposed algorithms and their application to a dataset of argumentation problems.
... However, we have not discussed this in this chapter, simply because the work presented in this thesis focuses on the representation of statistical relational knowledge and exact inference techniques. In the next chapter, we briefly discuss how we framed the problem of probabilistic inference as learning the parameters of a probabilistic logic program, by applying the learning from interpretations algorithm proposed by Gutmann, Thon, and De Raedt [GTD11]. Abstract. ...
Thesis
This thesis contributes to the development of a probabilistic logic programming language specific to the domain of cognitive neuroscience, coined NeuroLang, and presents some of its applications to the meta-analysis of the functional brain mapping literature. By relying on logic formalisms such as datalog, and their probabilistic extensions, we show how NeuroLang makes it possible to combine uncertain and heterogeneous data to formulate rich meta-analytic hypotheses. We encode the Neurosynth database into a NeuroLang program and formulate probabilistic logic queries resulting in term-association brain maps and coactivation brain maps similar to those obtained with existing tools, and highlighting existing brain networks. We prove the correctness of our model by using the joint probability distribution defined by the Bayesian network translation of probabilistic logic programs, showing that queries lead to the same estimations as Neurosynth. Then, we show that modeling term-to-study associations probabilistically based on term frequency-document inverse frequency (TF-IDF) measures results in better accuracy on simulated data, and a better consistency on real data, for two-term conjunctive queries on smaller sample sizes. Finally, we use NeuroLang to formulate and test concrete functional brain mapping hypotheses, reproducing past results. By solving segregation logic queries combining the Neurosynth database, topic models, and the data-driven functional atlas DiFuMo, we find supporting evidence of the existence of an heterogeneous organisation of the frontoparietal control network (FPCN), and find supporting evidence that the subregion of the fusiform gyrus called visual word form area (VWFA) is recruited within attentional tasks, on top of language-related cognitive tasks.
Chapter
Probabilistic Answer Set Programming (PASP) is a powerful formalism that allows to model uncertain scenarios with answer set programs. One of the possible semantics for PASP is the credal semantics, where a query is associated with a probability interval rather than a sharp probability value. In this paper, we extend the learning from interpretations task, usually considered for Probabilistic Logic Programming, to PASP: the goal is, given a set of (partial) interpretations, to learn the parameters of a PASP program such that the product of the lower bounds of the probability intervals of the interpretations is maximized. Experimental results show that the execution time of the algorithm is heavily dependent on the number of parameters rather than on the number of interpretations.
Conference Paper
Probabilistic Answer Set Programming under the credal semantics has emerged as one of the possible formalisms to encode uncertain domains described by an answer set program extended with probabilistic facts. Some problems require associating probability values to probabilistic facts such that the probability of a query is above a certain threshold. To solve this, we propose a new class of programs, called Probabilistic Optimizable Answer Set Programs, together with a practical algorithm based on constrained optimization to solve the task.
Article
Full-text available
We propose an Expectation-Maximization (EM) algorithm which works on binary decision diagrams (BDDs). The proposed algorithm, BDD-EM algorithm, opens a way to apply BDDs to statistical learning. The BDD-EM algorithm makes it possible to learn probabilities in statistical models described by Boolean formulas, and the time complexity is proportional to the size of BDDs representing them. We apply the BDD-EM algorithm to prediction of intermittent errors in logic circuits and demonstrate that it can identify error gates in a 3bit adder circuit.
Conference Paper
Probabilistic inductive logic programming aka. statistical relational learning addresses one of the central questions of artificial intelligence: the inte- gration of probabilistic reasoning with machine learning and first order and rela- tional logic representations. A rich variety of different formalisms and learning techniques have been developed. A unifying characterization of the underlying learning settings, however, is missing so far. In this chapter, we start from inductive logic programming and sketch how the in- ductive logic programming formalisms, settings and techniques can be extended to the statistical case. More precisely, we outline three classical settings for induc- tive logic programming, namely learning from entailment, learning from inter- pretations, and learning from proofs or traces, and show how they can be adapted to cover state-of-the-art statistical relational learning approaches.