Page 1
Computing Posterior Probabilities of Structural Features in
Bayesian Networks
Jin Tian and Ru He
Department of Computer Science
Iowa State University
Ames, IA 50011
{jtian, rhe}@cs.iastate.edu
Abstract
We study the problem of learning Bayesian
network structures from data.
and Sood (2004) and Koivisto (2006) pre-
sented algorithms that can compute the exact
marginal posterior probability of a subnet-
work, e.g., a single edge, in O(n2n) time and
the posterior probabilities for all n(n−1) po-
tential edges in O(n2n) total time, assuming
that the number of parents per node or the
indegree is bounded by a constant. One main
drawback of their algorithms is the require-
ment of a special structure prior that is non
uniform and does not respect Markov equiva-
lence. In this paper, we develop an algorithm
that can compute the exact posterior proba-
bility of a subnetwork in O(3n) time and the
posterior probabilities for all n(n−1) poten-
tial edges in O(n3n) total time. Our algo-
rithm also assumes a bounded indegree but
allows general structure priors. We demon-
strate the applicability of the algorithm on
several data sets with up to 20 variables.
Koivisto
1 Introduction
Bayesian networks are being widely used for prob-
abilistic inference and causal modeling [Pearl, 2000,
Spirtes et al., 2001]. One major challenge is to learn
the structures of Bayesian networks from data.
the Bayesian approach, we provide a prior probabil-
ity distribution over the space of possible Bayesian
networks and then computer the posterior distribu-
tions P(G|D) of the network structure G given data D.
We can then compute the posterior probability of any
hypothesis of interest by averaging over all possible
networks. In many applications we are interested in
structural features. For example, in causal discovery,
we are interested in the causal relations among vari-
In
ables, represented by the edges in the network struc-
ture [Heckerman et al., 1999].
The number of possible network structures is super-
exponential O(n!2n(n−1)/2) in the number of variables
n. For example, there are about 104directed acyclic
graphs (DAGs) on 5 nodes, and 1018DAGs on 10
nodes. As a result, it is impractical to sum over all
possible structures unless for very small networks (less
than 8 variables).One solution is to compute ap-
proximate posterior probabilities. Madigan and York
(1995) used Markov chain Monte Carlo (MCMC) al-
gorithm in the space of network structures. Friedman
and Koller (2003) developed a MCMC procedure in the
space of node orderings which was shown to be more
efficient than MCMC in the space of DAGs. One prob-
lem to the MCMC approach is that there is no guaran-
tee on the quality of the approximation in finite runs.
Recently, a dynamic programming (DP) algorithm
was developed that can compute the exact marginal
posterior probabilities of any subnetwork (e.g., an
edge) in O(n2n) time [Koivisto and Sood, 2004] and
the exact posterior probabilities for all n(n − 1) po-
tential edges in O(n2n) total time [Koivisto, 2006],
assuming that the indegree, i.e., the number of par-
ents of each node, is bounded by a constant.
main drawback of the DP algorithm and the order
MCMC algorithm is that they both require a spe-
cial form of the structure prior P(G).
ing prior P(G) is non uniform, and does not re-
spect Markov equivalence [Friedman and Koller, 2003,
Koivisto and Sood, 2004].
posterior probabilities could be biased.
gorithms have been developed that try to fix this
structure prior problem [Eaton and Murphy, 2007,
Ellis and Wong, 2008].
One
The result-
Therefore the computed
MCMC al-
Inspired
[Koivisto and Sood, 2004, Koivisto, 2006], we have
developed an algorithm for computing the exact
posterior probabilities of structural features that does
not require a special prior P(G) other than the stan-
by theDP algorithmin
Page 2
dard structure modularity (see Eq. (4)).
a bounded indegree, our algorithm can compute the
exact marginal posterior probabilities of any subnet-
work in O(3n) time, and the posterior probabilities
for all n(n − 1) potential edges in O(n3n) total time.
The memory requirement of our algorithm is about
the same O(n2n) as the DP algorithm.
demonstrated our algorithm on data sets with up to
20 variables. The main advantage of our algorithm is
that it can use very general structure prior P(G) that
can simply be left as uniform and can satisfy Markov
equivalence requirement. We acknowledge here that
our algorithm was inspired by and used many tech-
niques in [Koivisto and Sood, 2004, Koivisto, 2006].
Their algorithm is based on summing over all possible
total orders (leading to the bias on prior P(G) that
graphs consistent with more orders are favored).
Our algorithm directly sums over all possible DAG
structures by exploiting sinks, nodes that have no
outgoing edges, and roots, nodes that have no parents,
and as a result the computations involved are more
complicated. We note that the dynamic programming
techniques have also been used to learn the opti-
mal Bayesian networks in [Singh and Moore, 2005,
Silander and Myllymaki, 2006].
Assuming
We have
The rest of the paper is organized as follows. In Sec-
tion 2 we briefly review the Bayesian approach to learn
Bayesian networks from data. In Section 3 we present
our algorithm for computing the posterior probability
of a single edge and in Section 4 we present our al-
gorithm for computing the posterior probabilities of
all potential edges simultaneously.
demonstrate the capability of our algorithm in Sec-
tion 5 and discuss its potential applications in Sec-
tion 6.
We empirically
2 Bayesian Learning of Bayesian
Networks
A Bayesian network is a DAG G that encodes a joint
probability distribution over a set X = {X1,...,Xn}
of random variables with each node of the graph rep-
resenting a variable in X. For convenience we will typ-
ically work on the index set V = {1,...,n} and rep-
resent a variable Xiby its index i. We use XPai⊆ X
to represent the set of parents of Xiin a DAG G and
use Pai⊆ V to represent the corresponding index set.
Assume we are given a training data set D
{x1,x2,...,xN}, where each xiis a particular instan-
tiation over the set of variables X. We only consider
situations where the data are complete, that is, ev-
ery variable in X is assigned a value. In the Bayesian
approach to learn Bayesian networks from the train-
ing data D, we compute the posterior probability of a
=
network G as
P(G|D) =P(D|G)P(G)
P(D)
. (1)
We can then compute the posterior probability of any
hypothesis of interest by averaging over all possible
networks.In this paper, we are interested in com-
puting the posteriors of structural features. Let f be
a structural feature represented by an indicator func-
tion such that f(G) is 1 if the feature is present in G
and 0 otherwise. We have
P(f|D) =
?
G
f(G)P(G|D). (2)
Assuming global and local parameter independence,
and parameter modularity, P(D|G) can be decom-
posed into a product of local marginal likelihood often
called local scores as [Cooper and Herskovits, 1992,
Heckerman et al., 1995]
P(D|G) =
n
?
i=1
P(xi|xPai: D) ≡
n
?
i=1
scorei(Pai: D),
(3)
where, with appropriate parameter priors, scorei(Pai:
D) has a closed form solution. In this paper we will
assume that these local scores can be computed effi-
ciently from data.
For
structures we will assume structure modularity:
[Friedman and Koller, 2003]
thepriorP(G)overallpossibleDAG
P(G) =
n
?
i=1
Qi(Pai). (4)
In this paper we consider modular features:
f(G) =
n
?
i=1
fi(Pai), (5)
where fi(Pai) is an indicator function with values ei-
ther 0 or 1. For example, an edge j → i can be rep-
resented by setting fi(Pai) = 1 if and only if j ∈ Pai
and setting fl(Pal) = 1 for all l ?= i. In this paper,
we are interested in computing the posterior P(f|D)
of the feature, which can be obtained by computing
the joint probability P(f,D) as
P(f,D) =
?
G
f(G)P(D|G)P(G) (6)
=
?
G
n
?
i=1
n
?
i=1
fi(Pai)Qi(Pai)scorei(Pai: D)
=
?
G
Bi(Pai), (7)
Page 3
where for all Pai⊆ V − {i} we define
Bi(Pai) ≡ fi(Pai)Qi(Pai)scorei(Pai: D). (8)
It is clear from Eq. (6) that if we set all features fi(Pai)
to be constant 1 then we have P(f = 1,D) = P(D).
Therefore we can compute the posterior P(f|D) if we
know how to compute the joint P(f,D). In the next
section, we show how the summation in Eq. (7) can
be done by dynamic programming in time complexity
O(3n).
3 Computing Posteriors of Features
Every DAG must have a root node, that is, a node
with no parents. Let G denote the set of all DAGs
over V , and for any S ⊆ V let G+(S) be the set of
DAGs over V such that all variables in V − S are
root nodes (setting G+(V ) = G). Since every DAG
must have a root node we have G = ∪j∈VG+(V −{j}).
We can compute the summation over all the possi-
ble DAGs in Eq. (7) by summing over the DAGs in
G+(V − {j}) separately. However there are overlaps
between the set of graphs in G+(V −{j}), and in fact
∩j∈TG+(V − {j}) = G+(V − T). We can correct for
those overlaps using the inclusion-exclusion principle.
Define the following function for all S ⊆ V
RR(S) ≡
?
G∈G+(S)
?
i∈S
Bi(Pai). (9)
We have P(f,D) = RR(V ) since G+(V ) = G. Then
by the weighted inclusion-exclusion principle, Eq. (7)
becomes
P(f,D) = RR(V ) =
?
G∈G
n
?
i=1
Bi(Pai)
=
|V |
?
k=1
(−1)k+1?
T⊆V
|T|=k
?
G∈G+(V −T)
n
?
i=1
Bi(Pai)
=
|V |
?
k=1
(−1)k+1?
T⊆V
|T|=k
?
j∈T
Bj(∅)
?
G∈G+(V −T)
?
i∈V −T
Bi(Pai)
=
|V |
?
k=1
(−1)k+1?
T⊆V
|T|=k
?
j∈T
Bj(∅)RR(V − T), (10)
which says that P(f,D) can be computed in terms
of RR(S). Next we show that RR(S) for all S ⊆ V
can be computed recursively. We define the following
function for each i ∈ V and all S ⊆ V − {i}
Ai(S) ≡
?
Pai⊆S
Bi(Pai), (11)
where in particular Ai(∅) = Bi(∅). We have the follow-
ing results, which roughly correspond to the backward
computation in [Koivisto, 2006].
Proposition 1
P(f,D) = RR(V ), (12)
where RR(S) can be computed recursively by
RR(S) =
|S|
?
k=1
(−1)k+1?
T⊆S
|T|=k
RR(S − T)
?
j∈T
Aj(V − S)
(13)
with the base case RR(∅) = 1.
Proof: We will say a node j ∈ S is a source node
in G ∈ G+(S) if none of j’s parents are in S, that
is, Paj∩ S = ∅. For T ⊆ S let G+(S,T) denote
the set of DAGs in G+(S) such that all the variables
in T are source nodes (setting G+(S,∅) = G+(S)).
It is clear that G+(S) = ∪j∈SG+(S,{j}), and that
∩j∈TG+(S,{j}) = G+(S,T). The summation over the
DAGs in G+(S) in Eq. (9) can be computed by sum-
ming over the DAGs in G+(S,{j}) separately and cor-
recting for the overlapped DAGs. Define
RF(S,T) ≡
?
G∈G+(S,T)
?
i∈S
Bi(Pai), for any T ⊆ S,
(14)
where RF(S,∅) = RR(S).
inclusion-exclusion principle, RR(S) in Eq. (9) can be
computed as
Then by the weighted
RR(S) =
|S|
?
k=1
(−1)k+1
?
T⊆S,|T|=k
RF(S,T).(15)
RR(S) and RF(S,T) can be computed recursively as
follows. For |T| = 1,
RF(S,{j}) =
?
G∈G+(S,{j})
Bj(Paj)
?
i∈S−{j}
Bi(Pai)
= [
?
Paj⊆V −S
Bj(Paj)][
?
G∈G+(S−{j})
?
i∈S−{j}
Bi(Pai)]
(see Figure 1(a))
= Aj(V − S)RR(S − {j}). (16)
Similarly, for any T ⊆ S
RF(S,T) =
?
G∈G+(S,T)
?
j∈T
Bj(Paj)
?
i∈S−T
Bi(Pai)
=
?
j∈T
[
?
Paj⊆V −S
Bj(Paj)][
?
G∈G+(S−T)
?
i∈S−T
Bi(Pai)]
(see Figure 1(b))
=
?
j∈T
Aj(V − S) · RR(S − T). (17)
Page 4
V−S
S−{j}
j
V−S
S−T
T
(a) (b)
Figure 1: Figures illustrating the proof of Proposi-
tion 1.
Combing Eqs. (15) and (17) we obtain Eq. (13).
?
Based on Proposition 1, provided that the functions Aj
have been computed, P(f,D) can be computed in the
manner of dynamic programming, starting from the
base case RR(∅) = 1, then RR({j}) = Aj(V − {j}),
and so on, until RR(V ).
Given the functions Bi, the functions Ai as defined
in Eq. (11) can be computed using the fast M¨ obius
transform algorithm in time O(n2n) (for a fixed i)
[Kennes and Smets, 1990].
indegree bound k, Bi(Pai) is zero when Pai con-
tains more than k elements.
S ⊆ V − {i} can be computed more efficiently us-
ing the truncated M¨ obius transform algorithm given in
[Koivisto and Sood, 2004] in time O(k2n) (for a fixed
i).
In the case of a fixed
Then Ai(S) for all
The functions RR may be computed more efficiently
if we precompute the product of Aj. Define
AA(S,T) ≡
?
j∈T
Aj(V − S) for T ⊆ S ⊆ V. (18)
Then using Eq. (18) for AA(S,T − {j}) we have
AA(S,T) = Aj(V − S)AA(S,T − {j}) for any j ∈ T.
(19)
For a fixed S, AA(S,T) for all T ⊆ S can be computed
in the manner of dynamic programming in O(2|S|)
time starting with AA(S,{j}) = Aj(V − S). We then
compute RR(S) using
RR(S) =
|S|
?
k=1
(−1)k+1?
T⊆S
|T|=k
RR(S − T)AA(S,T)
(20)
in O(2|S|) time. The functions RR(S) for all S ⊆ V
can be computed in?n
k=0
In summary, we obtain the following results.
?n
k
?2k= 3ntime.
Theorem 1 Given a fixed maximum indegree k,
P(f|D) can be computed in O(3n+ kn2n) time.
4 Computing Posterior Probabilities
for All Edges
If we want to compute the posterior probabilities of
all O(n2) potential edges, we can compute RR(V ) for
each edge separately and solve the problem in O(n23n)
total time. However, there is a large overlapping in
the computations for different edges. After computing
P(D) with constant features fi(Pai) = 1 for all i ∈
V , the computation for an edge i → j only needs to
change the function fj and therefore Bj and Aj. We
can take advantage of this overlap and reduce the total
time for computing for all edges.
Inspired
[Koivisto, 2006], we developed an algorithm that can
compute all edge posterior probabilities in O(n3n) to-
tal time. The computations of P(f,D) in Section 3 are
based exploiting root nodes and roughly correspond to
the backward computation in [Koivisto, 2006]. Next
we first describe a computation of P(f,D) by ex-
ploiting sink nodes (nodes that have no outgoing
edges) which roughly corresponds to the computation
in [Koivisto and Sood, 2004] called forward computa-
tion in [Koivisto, 2006]. Then we describe how to com-
bine the two computations to reduce the total compu-
tation time.
bythe forward-backwardalgorithm in
4.1 Computing P(f,D) by exploiting sinks
For any S ⊆ V , let G(S) denote the set of all the
possible DAGs over S, and G(S,T) denote the set of
all the possible DAGs over S such that all the variables
in T ⊆ S are sinks. We have G(V ) = G and G(S,∅) =
G(S). For any S ⊆ V define
H(S) ≡
?
G∈G(S)
?
i∈S
Bi(Pai).(21)
We have P(f,D) = H(V ) since G(V ) = G. As in
Section 3 we can show that H(S) can be computed
recursively. For any T ⊆ S ⊆ V , define
F(S,T) ≡
?
G∈G(S,T)
?
i∈S
Bi(Pai),(22)
where F(S,∅) = H(S). We have the following results.
Proposition 2
P(f,D) = H(V ), (23)
Page 5
and H(S) can be computed recursively by
H(S) =
|S|
?
k=1
(−1)k+1?
T⊆S
|T|=k
H(S − T)
?
j∈T
Aj(S − T)
(24)
with the base case H(∅) = 1.
Proof: Since every DAG has a sink we have G(S) =
∪j∈SG(S,{j}).It is clear that ∩j∈TG(S,{j}) =
G(S,T). The summation over the DAGs in G(S) in
Eq. (21) can be computed by summing over the DAGs
in G(S,{j}) separately and correcting for the over-
lapped DAGs. By the weighted inclusion-exclusion
principle, H(S) in Eq. (21) can be computed as
H(S) =
|S|
?
k=1
(−1)k+1
?
T⊆S,|T|=k
F(S,T). (25)
H(S) and F(S,T) can be computed recursively as fol-
lows. For |T| = 1, we have
F(S,{j}) =
?
G∈G(S,{j})
?
i∈S
Bi(Pai)
= [
?
Paj⊆S−{j}
Bj(Paj)][
?
G∈G(S−{j})
?
i∈S−{j}
Bi(Pai)]
(see Figure 2(a))
= Aj(S − {j})H(S − {j}).(26)
Similarly, for any j ∈ T ⊆ S
F(S,T) =
?
G∈G(S,T)
?
i∈S
Bi(Pai)
= [
?
Paj⊆S−T
Bj(Paj)][
?
G∈G(S−{j},T−{j})
?
i∈S−{j}
Bi(Pai)]
(see Figure 2(b))
= Aj(S − T)F(S − {j},T − {j}). (27)
Let T = {j1,...,jk}. Repeatedly applying Eq. (27)
and using the fact (S −T′)−(T −T′) = S −T for any
T′⊆ T, we obtain
F(S,T) = Aj1(S − T)Aj2(S − T)F(S − {j1,j2},T − {j1,j2})
= ...
= Aj1(S − T)···Ajk−1(S − T)F(S − {j1,...,jk−1},{jk})
= H(S − T)
Y
j∈T
Aj(S − T),
(28)
where Eq. (26) is applied in the last step. Finally
combining Eqs. (28) and (25) leads to (24).
Based on Proposition 2, H(S) can be computed in
the manner of dynamic programming. Each H(S) is
computed in time?|S|
k=1
k
?
?|S|
?k = |S|2|S|−1. All H(S)
j
S−{j}
S−T
j T−{j}
(a)(b)
Figure 2: Figures illustrating the proof of Proposi-
tion 2.
for S ⊆ V can be computed in time?n
n3n−1. We could store all F(S,T) and compute all
H(S) in time O(3n). But the memory requirement
would become O(4n) instead of O(n2n).
k=1
?n
k
?k2k−1=
We could compute the posterior of a feature using
P(f,D) = H(V ) but this is a factor of n slower than
computing RR(V ). Next we show how to reduce the
total time for computing the posterior probabilities of
all edges by combining the contributions of H(S) and
RR(S).
4.2 Computing posteriors for all edges
Consider the summation over all the possible DAGs in
Eq. (7). Assume that we are interested in computing
the posterior probability of an edge i → v. We want
to extract the contribution of Bvfrom the rest of Bi.
The idea is as follows. For a fixed node v, we can
break a DAG into the set of ancestors U of v and the
set of nonancestors V − {v} − U.1Roughly speaking,
conditioned on U, the summation over all DAGs can
be decomposed into the contributions from the sum-
mation over DAGs over U, which corresponds to the
computation of H(U), and the contributions from the
summation over DAGs over V −{v}−U with the vari-
ables in U ∪ {v} as root nodes, which corresponds to
the computation of RR(V − {v} − U).
Define, for any v ∈ V and U ⊆ V − {v} the following
function
Kv(U)
≡
?
T⊆V −{v}−U
(−1)|T|RR(V − {v} − U − T)
?
j∈T
Aj(U).
(29)
We have the following results.
1Or we can break a DAG into the set of nondescendants
U of v and the set of descendants V − {v} − U. It could
be shown that we can also use this way of breaking DAGs
to derive Proposition 3. But this is not exploited in the
paper.
Page 6
Proposition 3
P(f,D) =
?
U⊆V −{v}
Av(U)H(U)Kv(U). (30)
The proof of Proposition 3 is given in the Appendix.
Note that in Eq. (30) the computations of H(U) and
Kv(U) do not rely on Bvand all the contribution from
Bv to P(f,D) is represented by Av. After we have
computed the functions Av, H and Kv, P(f,D) can
be computed using Eq. (30) in time O(2n).
To compute Kv(U), we can precompute the product
of Aj. Define
η(U,T) ≡
?
j∈T
Aj(U). (31)
Then Kv(U) can be computed as
Kv(U) =
X
T⊆V −{v}−U
(−1)|T|RR(V − {v} − U − T)η(U,T),
(32)
where η(U,T) can be computed recursively as
η(U,T) = Aj(U)η(U,T − {j}) for any j ∈ T. (33)
For a fixed U, all η(U,T) for T ⊆ V − {v} − U can
be computed in 2n−1−|U|time, and then Kv(U) can be
computed in 2n−1−|U|time. For a fixed v all Kv(U) for
U ⊆ V −{v} can be computed in?n−1
3n−1time.
k=0
?n−1
k
?2n−1−k=
Based on Proposition 3, to compute the posterior
probabilities for all possible edges, we can use the fol-
lowing algorithm.
1. Precomputation. Set constant feature f(G) ≡ 1.
Compute functions Bi, Ai, RR, H, and Ki.
2. For each edge u → v:
(a) For all S ⊆ V − {v}, recompute Av(S).
(b) Compute P(f,D) using Eq. (30).
P(f|D) = P(f,D)/RR(V ).
Then
For a fixed maximum indegree k, Step 1 takes time
O(n3n) as discussed before, and Step 2 takes time
O(n2(k2n+ 2n)). It takes O(n3n+ kn22n) total time
to compute the posterior probabilities for all possible
edges.
The computation time of Step 2 can be further re-
duced by a factor of n using the techniques described
in [Koivisto, 2006]. Plug in the definition of Av(U)
into Eq. (30)
P(f,D) =
?
U⊆V −{v}
[
?
Pav⊆U
Bv(Pav)]H(U)Kv(U)
=
?
Pav⊆V −{v}
Bv(Pav)
?
Pav⊆U⊆V −{v}
H(U)Kv(U)
=
?
Pav⊆V −{v}
Bv(Pav)Γv(Pav), (34)
where for all Pav⊆ V − {v} we define
Γv(Pav) ≡
?
Pav⊆U⊆V −{v}
H(U)Kv(U). (35)
For a fixed maximum indegree k, since we set Bv(Pav)
to be zero for Pavcontaining more than k variables we
need to compute the function Γv(Pav) only at sets Pav
containing at most k elements, which can be computed
using the k-truncated downward M¨ obius transform al-
gorithm described in [Koivisto, 2006] in O(k2n) time
for all Pav(for a fixed v). Then P(f,D) for an edge
u → v can be computed as
P(u → v,D) =
?
u∈Pav⊆V −{v}
|Pav|≤k
Bv(Pav)Γv(Pav), (36)
which takes O(nk) time.
In summary, we propose the algorithm in Figure 3
to compute the posterior probabilities for all possible
edges. The main result of the paper is summarized in
the following theorem.
Theorem 2 For a fixed maximum indegree k, the pos-
terior probabilities for all n(n − 1) possible edges can
be computed in O(n3n) total time.
5Experimental Results
We have implemented the algorithm in Figure 3 in the
C++ language and run some experiments to demon-
strate its capabilities. We compared our implementa-
tion with REBEL2, a C++ language implementation
of the DP algorithm in [Koivisto, 2006].
We used BDe score for scorei(Pai: D) (with the hy-
perparameters αxi;pai= 1/(|Dm(Xi)| · |Dm(Pai)|))
[Heckerman et al., 1995]. In the experiments our al-
gorithm used a uniform structure prior P(G) =
1 and REBEL used a structure prior specified in
[Koivisto, 2006]. All the experiments were run under
Linux on an ordinary desktop PC with a 3.0GHz Intel
Pentium processor and 2.0GB of memory.
Page 7
Algorithm Computing posteriors of all edges given
maximum indegree k
1. Precomputation. Set trivial feature f(G) ≡ 1
(a) For all i
|Pai| ≤ k, compute Bi(Pai). Time complex-
ity O(nk+1).
(b) For all i ∈ V , S ⊆ V − {i}, compute Ai(S).
Time complexity O(kn2n).
(c) For all S ⊆ V , compute RR(S). Time com-
plexity O(3n).
(d) For all S ⊆ V , compute H(S). Time com-
plexity O(n3n).
(e) For all i ∈ V , S ⊆ V − {i}, compute Ki(S).
Time complexity O(n3n).
(f) For all i
∈
V , Pai
|Pai| ≤ k, compute Γi(Pai). Time complex-
ity O(kn2n).
∈
V , Pai
⊆
V − {i} with
⊆
V − {i} with
2. For each edge u → v, compute P(u → v|D) us-
ing Eq. (36), and output P(u → v|D) = P(u →
v,D)/RR(V ). Time complexity O(nk+2).
Figure 3: Algorithm for computing the posterior prob-
abilities for all possible edges in time complexity
O(n3n) assuming a fixed maximum indegree k.
Table 1: The speed of our algorithm (in second).
Name
Iris
TicTacToe
n
5
10
m
150
958
k
4
4
5
6
9
4
5
6
4
TB
Ours
3.5e-3
6.2e-1
1.1
1.5
1.9
602.3
607.0
610.6
23083
REBEL
3.1e-3
5.1e-1
9.4e-1
1.4
1.7
13.4
19.2
28.7
128.3
2.2e-3
4.7e-1
9.1e-1
1.3
1.7
1.4
4.4
11.5
9.2
Zoo17 101
Synthetic 20 500
5.1 Speed Test
We tested our algorithm on several data sets
from theUCI Machine
[Asuncion and Newman, 2007]:
and Zoo. All the data sets contain discrete variables
(or are discretized) and have no missing values. We
also tested our algorithm on a synthetic data set
coming with REBEL. For each data set, we ran our
algorithm and REBEL to compute the posterior
probabilities for all potential edges. The time taken
under different maximum indegree k is reported in
Table 1, which also lists the number of variables n
and the number of instances m for each data set. We
also show the time TB for computing local scores in
Table 1 as this time also depends on the number of
instances m in a data set.
Learning
Iris,
Repository
Tic-Tac-Toe,
The results demonstrate that our algorithm is capable
of computing the posterior probabilities for all poten-
tial edges in networks over around n = 20 variables.
The memory requirement of the algorithm is O(n2n),
the same as REBEL, which will limit the use of the
algorithm to about n = 25 variables. It may take our
current implementation a few months for n = 25.
5.2 Comparison of computations
For the Tic-Tac-Toe data set with n = 10, our algo-
rithm is capable of computing the “true” exact edge
posterior probabilities by setting the maximum inde-
gree k = 9,3although an exhaustive enumeration of
DAGs with n = 10 would not be feasible. We then vary
the maximum indegree k and compare the edge pos-
terior probabilities computed by our algorithm with
the true probabilities. The results are shown as scat-
ter plots in Figure 4 (Note that in these graphs most
of the points are located at (0,0) or closely nearby).
Each point in a scatter plot corresponds to an edge
with its x and y coordinates denoting the posterior
computed by the two compared algorithms. We see
that with the increase of k the computed probabili-
ties gradually approach the true probabilities. With
k = 3 the computed probabilities already converge to
the true probabilities. Studying the effects of the ap-
proximation due to the maximum indegree restriction
in general need more substantial experiments and is
beyond the scope of this paper.
We also compared the exact posterior probabilities
computed by REBEL (setting k = 9) with the true
probabilities. The results are shown in Figure 5. We
2REBEL
http://www.cs.helsinki.fi/u/mkhkoivi/REBEL/.
3We will call the exact posterior probabilities computed
using uniform structure prior P(G) = 1 the “true” proba-
bilities.
isavailable at
Page 8
0.0 0.2 0.40.6 0.81.0
0.0
0.2
0.4
0.6
0.8
1.0
Comparison of Posteriors
Tic−Tac−Toe Data (n=10)
Posterior by our method (k=9)
Posterior by our method (k=1)
0.00.20.40.6 0.81.0
0.0
0.2
0.4
0.6
0.8
1.0
Comparison of Posteriors
Tic−Tac−Toe Data (n=10)
Posterior by our method (k=9)
Posterior by our method (k=2)
0.0 0.20.40.60.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Comparison of Posteriors
Tic−Tac−Toe Data (n=10)
Posterior by our method (k=9)
Posterior by our method (k=3)
Figure 4: Scatter plots that compare posterior prob-
ability of edges on the Tic-Tac-Toe data set as com-
puted by our algorithm with different k against the
true posterior.
0.00.2 0.4 0.60.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Comparison of Posteriors
Tic−Tac−Toe Data (n=10)
Posterior by our method (k=9)
Posterior by REBEL (k=9)
Figure 5: A scatter plot that compares posterior prob-
ability of edges on the Tic-Tac-Toe data set as com-
puted by REBEL against the “true” posterior.
0.00.2 0.4 0.60.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Comparison of Posteriors
Synthetic Data (n=20, k=4)
Posterior by our method
Posterior by REBEL
Figure 6: A scatter plot that compares posterior prob-
ability of edges on the Synthetic data set as computed
by REBEL and our algorithm.
Page 9
see that the exact probabilities computed by REBEL
without indegree bound sometimes still differ with the
true probabilities. This is due to the highly non uni-
form structure prior used by REBEL.
We compared our algorithm with REBEL over a larger
network, the synthetic data set with n = 20.
results are shown in Figure 6. We see that with the
same maximum indegree, the computed probabilities
often differ. Again, this can be attributed to the non
uniform structure prior used by REBEL.
The
6 Conclusion
We have presented an algorithm that can compute the
exact marginal posterior probability of a single edge
in O(3n) time and the posterior probabilities for all
n(n − 1) potential edges in O(n3n) total time. We
demonstrated its capability on data sets containing up
to 20 variables.
The main advantage of our algorithm over the
current state-of-the-art algorithms, the DP algo-
rithm in [Koivisto, 2006] and the order MCMC in
[Friedman and Koller, 2003], for computing the pos-
terior probabilities of structural features is that those
algorithms require special structure prior P(G) that is
highly non uniform while we allow general prior P(G).
Our algorithm computes exact posterior probabilities
and works in moderate size networks (about 20 vari-
ables), which make it a useful tool for studying several
problems in learning Bayesian networks. One applica-
tion is to assess the quality of the DP algorithm due
to the influence of non uniform prior P(G). Another
application is to study the effects of the approxima-
tion due to the maximum indegree restriction.
have shown some initial experimental results in Sec-
tion 5. Other potential applications include assessing
the quality of approximate algorithms (e.g., MCMC al-
gorithms), studying the effects of data sample size on
the learning results, and studying the effects of model
parameters (such as parameter priors) on the learning
results.
We
Acknowledgments
This research was partly supported by NSF grant IIS-
0347846.
Appendix : Proof of Proposition 3
For a fixed node v, we can break a DAG uniquely into
the set of ancestors S of v and the set of nonancestors
V − {v} − S. For v / ∈ S let Gv(S) denote the set of
DAGs over S ∪ {v} such that every node in S is an
ancestor of v. Then the summation over all possible
DAGs in Eq. (7) can be decomposed into
P(f,D) =
?
S⊆V −{v}
[
?
G∈Gv(S)
?
i∈S∪{v}
Bi(Pai)]
·[
?
G∈G+(V −{v}−S)
?
i∈V −{v}−S
Bi(Pai)]
=
?
S⊆V −{v}
LLv(S)RR(V − {v} − S), (37)
where for any S ⊆ V − {v} we define
LLv(S) ≡
?
G∈Gv(S)
?
i∈S∪{v}
Bi(Pai).(38)
Gv(S) consists of the set of DAGs over S∪{v} in which
v is the unique sink. We have
G(S ∪ {v},{v}) = Gv(S) ∪ (∪j∈SG(S ∪ {v},{v,j})),
(39)
from which, by the weighted inclusion-exclusion prin-
ciple, we obtain
LLv(S) =
X
G∈G(S∪{v},{v})
Y
i∈S∪{v}
Bi(Pai)
−
|S|
X
k=1
(−1)k+1X
T⊆S
|T|=k
X
G∈G(S∪{v},T∪{v})
Y
i∈S∪{v}
Bi(Pai)
= F(S ∪ {v},{v}) −
|S|
X
k=1
(−1)k+1X
T⊆S
|T|=k
F(S ∪ {v},T ∪ {v})
=
X
T⊆S
(−1)|T|F(S ∪ {v},T ∪ {v})
=
X
T⊆S
(−1)|T|Av(S − T)F(S,T)
=
X
U⊆S
(−1)|U|+|S|Av(U)F(S,S − U).
(40)
Plugging Eq. (40) into Eq. (37), we obtain
P(f,D) =
?
S⊆V −{v}
?
U⊆S
(−1)|U|+|S|Av(U)
· F(S,S − U)RR(V − {v} − S)
=
?
U⊆V −{v}
?
U⊆S⊆V −{v}
(−1)|U|+|S|Av(U)
· F(S,S − U)RR(V − {v} − S)
=
?
U⊆V −{v}
Av(U)H(U)
?
U⊆S⊆V −{v}
(−1)|U|+|S|
·
?
j∈S−U
Aj(U)RR(V − {v} − S)
=
?
U⊆V −{v}
Av(U)H(U)Kv(U), (41)
where we have used the definition of function Kv(U)
in Eq. (29).
Page 10
References
[Asuncion and Newman, 2007] A. Asuncion and D.J.
Newman. UCI machine learning repository, 2007.
[Cooper and Herskovits, 1992] G.
E. Herskovits. A Bayesian method for the induc-
tion of probabilistic networks from data. Machine
Learning, 9:309–347, 1992.
F. Cooper and
[Eaton and Murphy, 2007] D. Eaton and K. Murphy.
Bayesian structure learning using dynamic program-
ming and MCMC. In Proc. of Conference on Un-
certainty in Artificial Intelligence, 2007.
[Ellis and Wong, 2008] B. Ellis and W. H. Wong.
Learning causal Bayesian network structures from
experimental data. J. Am. Stat. Assoc., 103:778–
789, 2008.
[Friedman and Koller, 2003] Nir
Daphne Koller.
structure: A bayesian approach to structure dis-
covery in bayesian networks.
50(1-2):95–125, 2003.
Friedmanand
Being bayesian about network
Machine Learning,
[Heckerman et al., 1995] D. Heckerman, D. Geiger,
and D.M. Chickering. Learning Bayesian networks:
The combination of knowledge and statistical data.
Machine Learning, 20:197–243, 1995.
[Heckerman et al., 1999] D. Heckerman, C. Meek, and
G. Cooper. A Bayesian approach to causal discov-
ery. In Glymour C. and Cooper G.F., editors, Com-
putation, Causation, and Discovery, Menlo Park,
CA, 1999. AAAI Press and MIT Press.
[Kennes and Smets, 1990] R. Kennes and P. Smets.
Computational aspects of the mobius transforma-
tion. In P. B. Bonissone, M. Henrion, L. N. Kanal,
and J. F. Lemmer, editors, Proceedings of the Con-
ference on Uncertainty in Artificial Intelligence,
pages 401–416, 1990.
[Koivisto and Sood, 2004] M. Koivisto and K. Sood.
Exact Bayesian structure discovery in Bayesian net-
works. Journal of Machine Learning Research,
5:549–573, 2004.
[Koivisto, 2006] M. Koivisto.
Bayesian structure discovery in Bayesian networks.
In Proceedings of the Conference on Uncertainty in
Artificial Intelligence (UAI), 2006.
Advances in exact
[Madigan and York, 1995] D. Madigan and J. York.
Bayesian graphical models for discrete data. Inter-
national Statistical Review, 63:215–232, 1995.
[Pearl, 2000] J. Pearl. Causality: Models, Reasoning,
and Inference. Cambridge University Press, NY,
2000.
[Silander and Myllymaki, 2006] T.
P. Myllymaki. A simple approach for finding the
globally optimal Bayesian network structure.
Proceedings of the Conference on Uncertainty in
Artificial Intelligence (UAI), 2006.
Silander and
In
[Singh and Moore, 2005] Ajit
drew W. Moore.
works by dynamic programming. Technical report,
Carnegie Mellon University, School of Computer
Science, 2005.
P.Singh and An-
Finding optimal Bayesian net-
[Spirtes et al., 2001] P. Spirtes, C. Glymour, and
R. Scheines. Causation, Prediction, and Search (2nd
Edition). MIT Press, Cambridge, MA, 2001.
Download full-text