Page 1

Conditioning Graphs: Practical

Structures for Inference in Bayesian

Networks

A Thesis Submitted to the

College of Graduate Studies and Research

in Partial Fulfillment of the Requirements

for the degree of Doctor of Philosophy

in the Department of Computer Science

University of Saskatchewan

Saskatoon

By

Kevin John Grant

c ?Kevin John Grant, January/2007. All rights reserved.

Page 2

Permission to Use

In presenting this thesis in partial fulfilment of the requirements for a Postgrad-

uate degree from the University of Saskatchewan, I agree that the Libraries of this

University may make it freely available for inspection. I further agree that permission

for copying of this thesis in any manner, in whole or in part, for scholarly purposes

may be granted by the professor or professors who supervised my thesis work or, in

their absence, by the Head of the Department or the Dean of the College in which

my thesis work was done. It is understood that any copying or publication or use of

this thesis or parts thereof for financial gain shall not be allowed without my written

permission. It is also understood that due recognition shall be given to me and to the

University of Saskatchewan in any scholarly use which may be made of any material

in my thesis.

Requests for permission to copy or to make other use of material in this thesis in

whole or part should be addressed to:

Head of the Department of Computer Science

176 Thorvaldson Building

110 Science Place

University of Saskatchewan

Saskatoon, Saskatchewan

Canada

S7N 5C9

i

Page 3

Abstract

Probability is a useful tool for reasoning when faced with uncertainty. Bayesian

networks offer a compact representation of a probabilistic problem, exploiting inde-

pendence amongst variables that allows a factorization of the joint probability into

much smaller local probability distributions.

The standard approach to probabilistic inference in Bayesian networks is to com-

pile the graph into a join-tree, and perform computation over this secondary struc-

ture. While join-trees are among the most time-efficient methods of inference in

Bayesian networks, they are not always appropriate for certain applications. The

memory requirements of join-tree can be prohibitively large. The algorithms for

computing over join-trees are large and involved, making them difficult to port to

other systems or be understood by general programmers without Bayesian network

expertise.

This thesis proposes a different method for probabilistic inference in Bayesian

networks. We present a data structure called a conditioning graph, which is a run-

time representation of Bayesian network inference. The structure mitigates many of

the problems of join-tree inference. For example, conditioning graphs require much

less space to store and compute over. The algorithm for calculating probabilities

from a conditioning graph is small and basic, making it portable to virtually any

architecture. And the details of Bayesian network inference are compiled away dur-

ing the construction of the conditioning graph, leaving an intuitive structure that is

easy to understand and implement without any Bayesian network expertise.

In addition to the conditioning graph architecture, we present several improve-

ments to the model, that maintain its small and simplistic style while reducing the

runtime required for computing over it. We present two heuristics for choosing vari-

able orderings that result in shallower elimination trees, reducing the overall com-

plexity of computing over conditioning graphs. We also demonstrate several compile

ii

Page 4

and runtime extensions to the algorithm, that can produce substantial speedup to

the algorithm while adding a small space constant to the implementation. We also

show how to cache intermediate values in conditioning graphs during probabilis-

tic computation, that allows conditioning graphs to perform at the same speed as

standard methods by avoiding duplicate computation, at the price of more memory.

The methods presented also conform to the basic style of the original algorithm.

We demonstrate a novel technique for reducing the amount of required memory for

caching.

We demonstrate empirically the compactness, portability, and ease of use of con-

ditioning graphs. We also show that the optimizations of conditioning graphs allow

competitive behaviour with standard methods in many circumstances, while still pre-

serving its small and simple style. Finally, we show that the memory required under

caching can be quite modest, meaning that conditioning graphs can be competitive

with standard methods in terms of time, using a fraction of the memory.

iii

Page 5

Acknowledgements

A graduate degree is not an individual effort. It is the collaboration of many individuals,

direct and indirect. I mention a few here, but am grateful to all who were a part of this.

First, I wish to thank my supervisor, Michael Horsch. Mike was that supervisor that

every grad student hopes for. He always had time and patience to discuss ideas, both good

and bad. His knowledge of my research topics helped to keep me progressing. Most of all,

I am thankful for his friendship. His faith in me was always an encouragement. Thanks

Mike.

My graduate studies program was funded in large part by the National Sciences and

Engineering Research Council of Canada (NSERC). Their support is greatly appreciated.

I am grateful for the guidance of my thesis committee members, whom include Eric

Neufeld, Mark Keil, Winfried Grassmann, Mik Bickis, and Eugene Santos. Thank you for

your help.

I would like to express my gratitude to the people of our department. I would like to

thank our department heads (Jim Greer and Kevin Schneider) for giving me the opportu-

nity to teach. During my study, I have called upon the expertise of many members of our

faculty; thank you all for your help. I would also like to acknowledge the quiet, tireless

efforts of our support staff and office staff, for your help and patience in making sure my

computers ran, my papers printed, and my deadlines were met. Finally, a special thanks

to Eric Neufeld. When Mike left on sabbatical, I often called upon Eric for his expertise

in academic matters, and he always made time to answer my myriad of questions.

I could not have done this without my family. To Kelly, Jessie, Stephanie and Aaron,

my grandparents, and Maureen’s family (to name only a few), thanks for all your support.

I especially wish to thank my parents. Their unfailing love, support, friendship, and

encouragement is the reason for any successes that I may enjoy.

Most of all, I wish to thank Maureen and Samantha. Maureen has given so much

during my study, and taken so little in return. Postgraduate education entails sacrifices,

and the two of you have made them without question.

Saskatoon, Saskatchewan Kevin John Grant

January 9, 2007

iv

Page 6

To my family.

v

Page 7

Contents

Permission to Usei

Abstractii

Acknowledgementsiv

Contentsvi

List of Tablesviii

List of Figuresix

1 Introduction

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2 Contributions and Outline . . . . . . . . . . . . . . . . . . . . . . . .

1

5

8

2 Background and Related Work

2.1 Probability and Bayesian Networks . . . . . . . . . . . . . . . . . . .

2.2 Common Methods for Inference . . . . . . . . . . . . . . . . . . . . .

2.2.1 Variable Elimination . . . . . . . . . . . . . . . . . . . . . . .

2.2.2 Junction-Tree Propagation . . . . . . . . . . . . . . . . . . . .

2.3 Conditioning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . .

2.3.1Global Calculation . . . . . . . . . . . . . . . . . . . . . . . .

2.4 Divide and Conquer Conditioning . . . . . . . . . . . . . . . . . . . .

2.4.1 Caching in Recursive Decompositions . . . . . . . . . . . . . .

2.5 Offline Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.5.1Query-DAGs . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.5.2 Arithmetic Circuits . . . . . . . . . . . . . . . . . . . . . . . .

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

12

21

21

23

26

26

31

34

41

41

44

47

3 Conditioning Graphs

3.1 Elimination Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.2Conditioning Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3.1 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . .

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.5Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

48

55

60

60

61

62

66

4 General Optimizations

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

68

vi

Page 8

4.2 Building Shallow Elimination Trees . . . . . . . . . . . . . . . . . . .

4.2.1 Dtrees to Elimination Trees . . . . . . . . . . . . . . . . . . .

4.2.2Better Elimination Orderings . . . . . . . . . . . . . . . . . .

4.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Indexing Improvements . . . . . . . . . . . . . . . . . . . . . . . . . .

Unobserved Leaf Variables . . . . . . . . . . . . . . . . . . . . . . . .

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

69

74

78

82

86

89

4.3

4.4

4.5

5 Application-specific Optimizations

5.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.2 Compile-time Optimizations . . . . . . . . . . . . . . . . . . . . . . .

5.2.1 Sensor Models . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.2.2 Query Variables . . . . . . . . . . . . . . . . . . . . . . . . . .

5.3Runtime Optimization . . . . . . . . . . . . . . . . . . . . . . . . . .

5.3.1 Hoods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.3.2 Relevant Variables . . . . . . . . . . . . . . . . . . . . . . . . 101

5.4Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

91

91

92

92

94

97

98

6 Optimization through Caching

6.1 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.1.1 Incorporating Caching into Conditioning Graphs . . . . . . . . 121

6.1.2 Partial Caching . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6.1.3 Dead Caches. . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.1.4Subcaching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6.2 Caching at Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6.2.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6.3Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

116

7 Conclusions

7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

151

A Proof of Theorem 3.1.1157

B A C Implementation of

Conditioning Graphs

B.1 Node Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

B.2 Inference Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

B.3 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

B.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

160

C A MIPS Implementation of Conditioning Graphs

C.1 Node Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

C.2 Inference Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

C.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

171

vii

Page 9

List of Tables

2.1

2.2

Trace of visits to node labeled with {V } in Figure 2.8.

The joint probability distribution and its annotation with evidence

indicators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .35

44

3.1 Size requirements (in MB) of JTP, VE, and Conditioning Graph (CG)

storage and computation. . . . . . . . . . . . . . . . . . . . . . . . . .64

4.1 Heights of constructed elimination trees on repository Bayesian net-

works using the modified min-size heuristic for lookahead. . . . . . .

Heights of constructed elimination trees on ISAC ’85 benchmark cir-

cuits using the modified min-size heuristic for lookahead. . . . . . . .

Heights of constructed elimination trees on repository Bayesian net-

works using the modified min-fill heuristic for lookahead. . . . . . . .

Heights of constructed elimination trees on ISAC ’85 benchmark cir-

cuits using the modified min-fill heuristic for lookahead.

79

4.2

80

4.3

81

4.4

. . . . . . .81

6.1 Height vs. width of elimination trees on Bayesian networks from the

network repository. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

The amount of memory required for caching over networks from the

Bayesian network repository. . . . . . . . . . . . . . . . . . . . . . . . 130

Trace of visits to node D in Figure 6.12. . . . . . . . . . . . . . . . . 131

Trace of visits to node D in Figure 6.12. . . . . . . . . . . . . . . . . 132

The amount of memory required for caching over networks from the

Bayesian network repository. . . . . . . . . . . . . . . . . . . . . . . . 134

6.2

6.3

6.4

6.5

viii

Page 10

List of Figures

1.1 The tradeoff between flexibility and expertise in Bayesian network

software. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4

2.1

2.2

The Asia network, an example of a small Bayesian network [40].

Examples of queries and their corresponding relevant networks, where

barren variables have been greyed out. . . . . . . . . . . . . . . . . .

The Asia network after moralization. Note the marriage between T

and L, and between C and B. As well, the direction of the arcs has

been dropped. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The Asia network, triangulated. . . . . . . . . . . . . . . . . . . . . .

A junction-tree for the Asia network. Clusters are shown as rectangles

with rounded corners, and separator sets are shown as rectangles with

square corners. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The Asia network, conditioned on S. Notice that in each case, the

network is singly connected, and the beliefs can be updated using

message passing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

An example Bayesian network (from Darwiche [15]) and a recursive

decomposition of that network. The cutsets at each node are shown

in each box. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The Asia network, compiled into a dtree. . . . . . . . . . . . . . . . .

The Asia dtree, with cache-domains shown to the right of the internal

nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.10 The Asia dtree, with dead caches grayed out.

2.11 A portion of the Asia network and an example Q-DAG compilation

given the query P(B|L). . . . . . . . . . . . . . . . . . . . . . . . . .

2.12 Two instantiations of the Q-DAG from Figure 2.11. . . . . . . . . . .

2.13 The polynomial of Equation 2.19, shown in graph form. . . . . . . . .

. . 14

19

2.3

24

24 2.4

2.5

25

2.6

29

2.7

32

35 2.8

2.9

36

38. . . . . . . . . . . . .

42

43

46

3.1

3.2

3.3

3.4

The Fire Bayesian network (taken from Poole et al. [53]) . . . . . . .

Two decompositions of the Fire network. . . . . . . . . . . . . . . . .

Pseudocode for generating an elimination tree from a Bayesian network. 52

Elimination tree construction using the elimtree algorithm, with the

elimination ordering [R,S,T,L,F,A]. . . . . . . . . . . . . . . . . . .

Code for processing an elimination tree given a context. . . . . . . . .

The Alarm CPT sorted according to different variable orderings. . . .

The conditioning graph. . . . . . . . . . . . . . . . . . . . . . . . . .

The Query algorithm, which takes the root of the conditioning graph,

and recursively computes the probability of the current context. Note

that on Line 10, we are using integer division, so the fractional part

of the result is dropped. . . . . . . . . . . . . . . . . . . . . . . . . .

49

50

53

54

56

57

3.5

3.6

3.7

3.8

59

ix

Page 11

3.9The SetEvidence algorithm, which takes a node N containing variable

V , and sets V ’s value to i, where i ∈ {0,....,mV− 1} ∪ {⋄}. . . . . . .

3.10 The Fire elimination tree. Number of recursive calls to each node is

shown beside (or below) the node. . . . . . . . . . . . . . . . . . . . .

60

65

4.1

4.2

4.3

4.4

4.5

An example Bayesian network. . . . . . . . . . . . . . . . . . . . . . .

An elimination tree for the Bayesian network in Figure 4.1 . . . . . .

A dtree for the Bayesian network in Figure 4.1 . . . . . . . . . . . . .

The dtree to elimination tree conversion process. . . . . . . . . . . . .

For this Bayesian network, an elimination ordering that is optimal for

inference based on junction trees is the worst case for methods based

on decomposition structures. . . . . . . . . . . . . . . . . . . . . . . .

A worst case elimination tree for the Bayesian network in Figure 4.5,

constructed using the min-fill heuristic. . . . . . . . . . . . . . . . . .

Elimination tree construction using the described heuristic. . . . . . .

The conditioning graph, with the scalar values for each secondary link

shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Algorithm for setting evidence, given that secondary scalar values are

used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.10 Algorithm for querying, given that secondary scalar values are used. .

4.11 Algorithm for querying, given that leaf variable nodes are labeled. . .

4.12 The Fire elimination tree. Number of recursive calls to each node is

shown beside (or below) the node. . . . . . . . . . . . . . . . . . . . .

70

70

71

72

75

4.6

75

77 4.7

4.8

83

4.9

84

85

87

88

5.1The new conditioning graph, which removes primary arcs from the

sensor variables. Note that for space consideration, we use the CPT

notation, rather than listing the array of values explicitly. . . . . . . .

Optimizing the conditioning graph. . . . . . . . . . . . . . . . . . . .

The hood of the Fire example, given sensor variables Smoke and

Alarm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Algorithm for setting the evidence, incorporating changes to the hood. 100

The Fire conditioning graph of Figure 5.3. Root arcs are shown with

bold dotted lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Algorithm for setting the evidence, maintaining labeling of barren nodes.105

The SetRelevant algorithm, which marks the active part of the con-

ditioning graph for processing a particular query.

The Fire conditioning graph, given no evidence. Irrelevant nodes are

grayed out. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

The Fire conditioning graph, given L = 1 and R = 0. Irrelevant nodes

are grayed out. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.10 The Fire conditioning graph, given L = 1, R = 0, and the query

variable F. Irrelevant nodes are grayed out.

5.11 The Fire conditioning graph, given L = 1, R = 0, and the query

variable F. Irrelevant nodes are grayed out, and active nodes have

darkened borders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

93

965.2

5.3

99

5.4

5.5

5.6

5.7

. . . . . . . . . . . 106

5.8

5.9

. . . . . . . . . . . . . . 109

x

Page 12

5.12 The Query algorithm, using active and relevant nodes (Lines 03 and

05). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.13 Height difference between actual and relevant conditioning graph. . . 112

5.14 Difference between relevant height of conditioning graph and network

width. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.1 Elimination tree of Figure 4.6, with cache-domains shown above each

node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

Elimination tree of Figure 4.7, with cache-domains shown beside each

node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

Algorithm for processing an elimination tree given a context. . . . . . 119

Elimination tree of Figure 4.6, with recursive calls shown below each

node. Model assumes no caching. . . . . . . . . . . . . . . . . . . . . 121

Elimination tree of Figure 4.6, with recursive calls shown below each

node. Model assumes full caching. . . . . . . . . . . . . . . . . . . . . 121

The Fire conditioning graph, with tertiary arcs (double arcs) added

for caching. Cache-domains are shown to the left of each internal node 122

Algorithm for setting evidence, given that we are caching, and sec-

ondary scalar values are used. . . . . . . . . . . . . . . . . . . . . . . 124

Algorithm for querying, given that we are caching, and secondary

scalar values are used. Note that cache values must be reset appro-

priately before calling this algorithm. . . . . . . . . . . . . . . . . . . 125

Algorithm for querying, given that we are caching, and secondary

scalar values are used. Note that cache values must be reset appro-

priately before calling this algorithm. . . . . . . . . . . . . . . . . . . 126

6.10 The Fire conditioning graph, with the dead caches (and corresponding

tertiary arcs) grayed out. . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.11 The Fire elimination tree, in non-proper format. . . . . . . . . . . . . 128

6.12 A partial elimination tree, with caches shown to the left of the nodes.

Dead caches have been grayed out. . . . . . . . . . . . . . . . . . . . 131

6.13 Algorithm for querying, given that we are subcaching. . . . . . . . . . 135

6.14 The MakeCache algorithm, specifying the size of caches. Note that

MakeCache must be run after SetRelevant. . . . . . . . . . . . . . . . 138

6.15 An example of a cache becoming ‘dead’ because of evidence. . . . . . 139

6.16 The MakeCache algorithm, specifying the size of caches, and labeling

dead caches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

6.17 Algorithm for setting evidence, given that caching and secondary

scalar values are used. . . . . . . . . . . . . . . . . . . . . . . . . . . 142

6.18 The MakeCache algorithm, specifying the size of subcaches, and la-

beling dead caches. . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

6.19 An example of an elimination tree, where the evidence creates an

empty cache-domain (node A). Dead caches are grayed out.

6.20 Memory requirements for caching in effective conditioning graphs vs.

actual conditioning graphs. . . . . . . . . . . . . . . . . . . . . . . . . 145

6.2

6.3

6.4

6.5

6.6

6.7

6.8

6.9

. . . . . 144

xi

Page 13

6.21 Memory requirements for caching in effective conditioning graphs vs.

actual conditioning graphs (continued). . . . . . . . . . . . . . . . . . 146

6.22 Memory requirements for caching in effective conditioning graphs vs.

actual conditioning graphs (continued). . . . . . . . . . . . . . . . . . 147

6.23 Memory requirements for caching in effective conditioning graphs vs.

actual conditioning graphs (continued). . . . . . . . . . . . . . . . . . 148

6.24 Memory requirements for caching in effective conditioning graphs vs.

actual conditioning graphs (continued). . . . . . . . . . . . . . . . . . 149

xii

Page 14

Chapter 1

Introduction

In real world applications, rational agents, whether natural or artificial, rarely

have access to full information about their environment. This information may be

fundamental to making decisions and choosing actions, the role of any agent. Many

physical and temporal obstructions exist, making information unobservable. One

way to deal with uncertainty is to employ probability over events that are not directly

observable. Probability is a popular measure of belief in instances of uncertainty.

Bayesian networks [33,50] are a knowledge representation tool used to represent

information about a problem in which there is uncertainty. Computing the prob-

ability of an event in a Bayesian network is a task often referred to as inference.

Although computing exact probabilities from a Bayesian network is NP-hard [12],

many algorithms have been designed to exploit certain properties of these networks,

allowing efficient calculation of probabilities in many cases [20,35,40,50,57,65]. Most

inference algorithms exploit the independencies of a probabilistic model to compute

probabilities efficiently. Over the last two decades, Bayesian networks have proven

themselves successful in applications requiring decision making under uncertainty,

including medical diagnosis [7], classification [29], forecasting [2], and fault diag-

nosis [41] (Neapolitan [46] gives an excellent survey on current Bayesian network

applications.)

Regardless of how probabilities are computed, however, the abstraction of a prob-

lem involving probability can be quite simple: given a context of the universe, com-

pute the probability of a particular event. Abstraction is a popular concept in com-

puter science. Programmers can use libraries without ever considering their details.

This approach increases programming efficiency; it relieves the programmer of hav-

1

Page 15

ing to code the component, and the library implementation is typically a good one.

Abstraction allows a programmer to use libraries armed only with the knowledge of

what it does, without knowing how it does it.

This abstraction occurs in Bayesian network software as well. Several commercial

and academic software packages have been written that absolve programmers of

most details of inference. The programmer need only supply a Bayesian network

representing the problem, and the inference engine will calculate probabilities over

the events of interest.1While these software packages provide a good solution for

many users, they are not universally applicable. Some reasons that may preclude

their use include the following:

1. Most software packages provide extensive services with their product for per-

forming other kinds of inference (e.g., most probable explanation [50], sensi-

tivity analysis [10], etc). While these extra services are useful for some ap-

plications, they tend to bloat the software, which gives the program and its

available libraries a non-trivial memory footprint.

2. Most Bayesian network software libraries compile the network to a secondary

structure called a junction-tree (junction-trees will be discussed in Chapter 2).

While computing over junction-trees is a time-efficient means of calculation,

its space requirements can be prohibitive for some problems.

3. Most software packages and their libraries are written for standard operating

systems and programming languages, excluding more primitive environments

such as embedded systems. Porting these applications to another environment

would be time-consuming and error-prone.

4. The complexity of the software makes it difficult to get time and space guar-

antees. While most software can make rough predictions on time and space,

they typically do not provide it in terms of number of low-level operations,

1Throughout the rest of this document, we assume the existence of the Bayesian network. The

acquisition of a Bayesian network can be done in a number of ways (human expert, learning algo-

rithms), but is unimportant to the concepts of this paper.

2

Page 16

or number of bytes. This could be critical for real-time or memory-limited

applications.

When an existing software package cannot be used, a user has no choice but to

implement their own inference engine (and network representation). At this point,

abstraction must be foresaken — the user must consider the details in order to

implement; such an implementation assumes at least basic expertise in the area.

Probability itself is a universal tool, and should not be considered a concept ex-

clusive to experts in probability calculus. For instance, an 80% prediction of bad

weather is typically enough to alter anyone’s picnic plans, regardless of mathemat-

ical background. A simple fault diagnosis algorithm may only report a problem if

the probability of a fault exceeds a threshold. A navigational unit may choose its

direction based on a probabilistic assessment of its location. A virtual opponent in a

computer game may choose its actions against the user based on the statistics of pre-

vious game-play. So while the use of probability is general, calculating probabilities

from a Bayesian network still seems the domain of experts, or those with sufficient

architecture and software libraries (and some basic knowledge of Bayesian network

principles).

The aforementioned problems suggest a tradeoff between flexibility and expertise

in Bayesian networks. Figure 1.1 demonstrates this visually. When a programmer

lacks knowledge of Bayesian network inference, then third-party software must be

used, with its hardware constraints and large libraries mentioned previously (Figure

1.1(a)). On the other hand, when a programmer wishes to implement a Bayesian net-

work application on a system not supported by existing software, that programmer

will have to program the inference engine, which requires expertise in that domain

(Figure 1.1(b)). Expertise is also required in situations where the user needs precise

knowledge of the amount of time/memory the application is going to take (a concept

referred to in this document as assessibility).

In this thesis, we consider a different type of abstraction than is typical of

Bayesian network software. Abstraction typically ignores the implementation de-

tails of code. This is important, as it allows many lines of code to be summarized

3

Page 17

Architecture?Expertise?

Specific hardware?

Large libraries?

+?

Portable?

Assessible?

Expert?

Novice?

-?

+?

-?

(a) Novice users must use third-party software, which may

have hardware constraints and large libraries.

Architecture? Expertise?

Specific hardware?

Large libraries?

+?

Portable?

Assessible?

Expert?

Novice?

-?

+?

-?

(b) Programmers implementing inference algorithms re-

quire some background in Bayesian network technology.

Architecture? Expertise?

Specific hardware?

Large libraries?

+?

Portable?

Assessible?

Expert?

Novice?

-?

+?

-?

(c) With conditioning graphs, the tradeoff is reduced.

Figure 1.1: The tradeoff between flexibility and expertise in Bayesian

network software.

4

Page 18

through a small, easy to understand interface. As mentioned, such an abstraction

works fine if no implementation or understanding of the code is required. How-

ever, while the concepts of Bayesian network inference are reserved to those with

expertise, programming constructs (conditional execution, control flow) are a uni-

versal language amongst all programmers. If we can compile inference into a series

of simple programming constructs, then it should be accessible to any programmer,

regardless of background. Hence, rather than presenting the user with concepts in

an abstract fashion, we present them in terms of code. Such a system would allow

programmers without Bayesian network expertise to implement software to compute

over Bayesian networks, effectively reducing the tradeoff discussed previously (Figure

1.1(c)).

1.1 Overview

As a first step in overcoming some of the aforementioned barriers to using a Bayesian

network, we compile it into a secondary structure called a conditioning graph. The

graph is based on a tree structure called an elimination tree, extending by adding

secondary arcs from internal nodes to leaves. The inference algorithm computes over

this graph, rather than the original network, in a simple depth-first traversal that

should be very accessible to programmers.

Conditioning graphs abstract away the need for expertise in Bayesian networks.

Both the conditioning graph and its inference algorithm are presented using low-level

programming constructs. The graph itself is represented by primitive data: integers,

pointers, and floating point numbers. While these elements correspond to the local

distributions and context of the Bayesian network, such details are abstracted away.

The inference algorithm is so small that its memory footprint is negligible, and it is

simple enough to be implemented on any architecture (no special libraries or abstract

data types required).

The conversion of the Bayesian network to a conditioning graph is considered a

compilation step, or offline computation; such a compilation step when performing

5

Page 19

inference over Bayesian networks is common. Offline computation is not considered

part of the final program, and therefore its resource requirements can be ignored.

For junction-tree algorithms [40], the conversion of a Bayesian network to a junction-

tree is considered a compile-time step. For Variable Elimination [23,65], finding a

good elimination ordering is considered compile-time and not part of the inference

problem.2For recursive conditioning [15], construction of the dtree is compile-time.

Conditioning graphs are designed to address the problems described in the pre-

vious section. These problems, along with our proposed solution, are categorized

according to the requirement that they address:

1. Space Complexity. The space complexity of inference in Bayesian networks

using the most common inference techniques (junction-tree and variable elimi-

nation) is exponential on the treewidth of the network. Even moderately-sized

networks can test the limits of standard computers, and some larger Bayesian

networks are too large to compute even on high-performance machines. In

contrast, our model uses conditioning, and the extra structure required for the

model, in addition to CPT storage, is linear in the number of variables in the

network.

2. Portability. Current software libraries are designed for common operating sys-

tems and programming languages. Support for other architectures is limited.

The conditioning graph structure is compact and consists entirely of primitive

data types. The inference algorithm is written with low-level programming in-

structions common to most languages. These details make conditioning graphs

portable to any architecture.

3. Memory Footprint. Commercial software libraries for inference are typically

large. However, the amount of memory required to perform inference over a

Bayesian network is usually much more than the space required to store the

2This is considered compile-time if the same elimination order is used for all queries. Some

authors advocate a dynamic elimination ordering based on the current context of the problem [55],

in which case, computing an elimination order is no longer a compile-time problem.

6

Page 20

programs; hence, the space contributions of these programs are usually ignored.

However, such implementations can become a factor, especially in applications

where resources are limited. Our inference algorithm requires only several lines

of code; the storage of our algorithm requires trivial amounts of space.

4. Accessibility. Inference in Bayesian networks involves specific terminology and

operations, such as marginalization, normalization, observation, all of which re-

quire some background for understanding. This is further complicated by the

fact that most software packages compile the Bayesian network to a junction-

tree, which requires even further expertise (triangulation, message-passing,

etc). Conditioning graphs eliminate the Bayesian network-specific details, leav-

ing the user with an easily accessible structure written in generic programming

constructs.

5. Assessibility. The simple design of the conditioning graph, along with the

succinct nature of its inference algorithm, allow for an accurate prediction of

exactly how much memory will be required, in terms of bytes. Knowing exactly

how much memory is required is advantageous when space is limited. Also,

the simplicity of the model makes it easy to interpret this information, even

for a non-expert.

These same arguments apply to time. The number of floating point operations

or recursive calls can be easily and quickly quantified by a general user. And

because the algorithm is compact, its compilation is sufficiently small so as to

allow for its operations to be easily countable, making it possible for a very

accurate time assessment.

6. Abstraction. As mentioned, abstraction relieves a user of any details that are

unnecessary to use a library. By making the details of the computation accessi-

ble to any user, we in effect remove some of this abstraction. However, in some

cases, the abstraction of detail is still necessary. While our implementation is

accessible by almost any programmer, the conditioning graph structure is in-

tended to be automatically compiled from a Bayesian network, and an interface

7

Page 21

is provided to maintain software engineering principles of abstraction.

7. Time. The complexity of inference in Bayesian networks is NP-hard [12,14].

In addition, while conditioning algorithms require less space than junction-tree

and variable elimination, they are typically less time-efficient when compared

to the latter. While we cannot avoid exponential runtimes in all cases, we

show how our model can exploit well-known application-specific independence,

such as d-separation [50] and barren variables [59]. These optimizations can

make our model competitive with standard algorithms in many circumstances.

As well, because our model is a recursive decomposition, we can use caching

[15] that can offer a speedup given extra amounts of space. In other words,

while the space required to store conditioning graphs is much less than is used

by standard inference algorithms, it does not have to be, and we can take

advantage of extra space to reduce runtimes.

The techniques that we discuss in this thesis will allow a user of Bayesian networks

to compile a sophisticated probabilistic model into a compact and simple component;

compact enough so that the model and the inference algorithm can be implemented

in a memory-restricted environment (e.g., cameras, cell phones, appliances, etc.),

and simple enough to be accessible by most programmers. Bayesian networks can

no longer be considered impractical to use.

1.2 Contributions and Outline

The outline of the remainder of this document is as follows. Chapter 2 describes

background work upon which conditioning graphs are built. This includes a review

of inference in Bayesian networks. We focus on conditioning algorithms, specifically

recursive decomposition algorithms [11,15], as the conditioning graph structure is a

variant of other recursive decompositions. We also review junction-tree and variable

elimination, for comparison purposes. We also show low-level precompiled infer-

ence structures (Query-DAGs [18] and Arithmetic Circuits [16]), as they represent a

similar method for abstrating away the details of inference.

8

Page 22

The primary contribution of this disseration is the conditioning graph architec-

ture. The underlying structure of a conditioning graph, its elimination tree, is a

variation of a dtree [15,44]; its specific differences are outlined in Chapter 3 and

4. The function for calculating probabilities from the structure is a variation of the

Recursive Conditioning algorithm [15], modified to compute over elimination trees.

However, ours is a low-level approach, presented in such a way as to abstract away

any Bayesian network-specific details, and disambiguate the programming of the

structure in general.

Chapter 4 presents methods for balancing elimination trees, in order to optimize

the runtime of conditioning graphs. We present a conversion process from dtrees [15]

to elimination trees, in order to take advantage of dtree balancing methods. We also

present a new set of heuristics for finding good elimination orderings, and empirically

show that these heuristics produce more efficient elimination trees than previous

approaches. We also contribute two other optimizations for the conditioning graph

architecture in Chapter 4.

Chapter 5 presents compile-time optimizations to the conditioning graph struc-

ture that exploit knowledge of evidence variables and query variables. We show how

sensor variables (variables that will always have an observed value) can be sepa-

rated from the graph, such that we can reduce computation at runtime. We also

show how to perform partial elimination in conditioning graphs to remove certain

variables from the computation (i.e., variables that will not be observed or queried),

also improving runtime performance. While other elimination/conditioning hybrids

have been proposed, they have not yet been demonstrated in elimination trees.

Chapter 5 also presents runtime optimizations to the conditioning graph. We

demonstrate a novel technique for maintaining evidence variable separation from

the elimination tree dynamically. As well, we show how to exploit the well-known

independence of d-separation [50] and barren variables [59] in conditioning graphs,

incurring only linear time and space cost.

Chapter 6 considers caching in conditioning graphs. Caching in dtrees [11,15]

allows computation of probabilities to be as fast as elimination algorithms, at the

9

Page 23

price of exponential memory. However, partial caching methods [4] and cache prun-

ing methods [3] can reduce the memory costs of caching, while still providing speedup.

In this chapter, we will demonstrate how these methods can be applied to condition-

ing graphs, both as a separate optimization, and in combination with those from

Chapter 5. We also present a new approach for pruning the domains of the caches

that considerably reduces the amount of memory required. The resulting model al-

lows the calculation of probabilities with the same time complexity as the current

standard algorithms, but with a fraction of the space.

The contributions of this thesis, and the future directions of this research, are

summarized in Chapter 7.

10

Page 24

Chapter 2

Background and Related Work

This chapter presents the necessary background knowledge to understand the

methods and motivations of this disseration. It begins with a light review of Bayesian

networks, including their structure, properties, and semantics. This beginning sec-

tion also introduces the terms and notations that will be used for the remainder of

the document.

The remaining sections of the chapter are devoted to reviewing inference tech-

niques for Bayesian networks. The first section reviews the more popular methods

of inference: junction-tree propagation (JTP) and variable elimination (VE). The

methods of this thesis do not rely directly on these methods. However, we include

them both for completeness and comparison, which we feel is important given their

prevalence at the time of writing.

The second section reviews inference methods employing conditioning. Condi-

tioning is a “reasoning by cases” approach [15], and its conservative use of mem-

ory makes it an attractive option for large, highly connected networks, situations

where JTP and VE use large amounts of space. We focus primarly on recursive

decompositions, as our conditioning graph is a recursive data structure, and its close

relationship allows us to borrow from these previous techniques in their design.

The final section reviews some previous methods for Bayesian network compila-

tion methods that compile away the details of inference offline. The first, Query-

DAGs (Q-DAGs) [18] represents an inference operation as an arithmetic equation,

parameterized by the evidence (in graphical form). The second, Arithmetic Cir-

cuits (ACs) [16] are a similar compilation that allow computations of derivatives

from which values of interest (posterior probabilities, sensitivity) can be derived in

11

Page 25

constant time.

2.1 Probability and Bayesian Networks

We denote random variables with capital letters (eg. X,Y,Z), and sets of variables

with boldfaced capital letters X = {X1,...,Xn}. Each random variable V has an

associated domain D(V ) = {v1,...,vk}. Only finite discrete variable domains are

considered in this document. An instantiation of X, which is the assignment of

variable X to a value x in its domain, is denoted X = x, or x for short. A context

over a set of variables X = {X1,...,Xk} is the conjunction of an instantiation of each

variable in X, and is denoted X = x or x for short. The set of all possible contexts

over a set X is denoted as D(X). The size of a set X is denoted by |X|.

We will denote a distribution over a set of variables using function notation (e.g.

f(X)). In cases where it is clear that the notation refers to a function, we may omit

the parentheses (e.g. f). We will overload the term domain to refer to the set of

variables over which a function is defined (in addition to referring to the values of a

random variable).

A Bayesian network [50] is a tuple ?G,P?, where G = ?X,A? is a directed acyclic

graph (DAG), X = {X1,...,Xn} is a set of random variables, A (arcs) represents

direct causal influences between the variables, and P is a probability distribution

over X, such that for each Xi∈ X, instantiating its parent set in the DAG (denoted

Πi) renders Xiprobabilistically independent of its graph nondescendents (denoted

NDi):

∀Y ⊆ NDi P(Xi|Y,Πi) = P(Xi|Πi) (2.1)

Since G has no directed cycles, it imposes a partial ordering on the variables of X [64].

Formally, if Xiis an ancestor of Xjin the DAG, then Ximust come before Xjin any

ordering consistent with the partial ordering. Assume without loss of generality that

the node ordering X1,...,Xnis a total ordering consistent with the partial ordering

of G. By the definition of conditional probability, the joint probability P can be

12

Page 26

rewritten terms of conditional probabilities:

P(X1,...,Xn) = P(Xn|Xn−1,...,X1)P(Xn−1,...,X1) (2.2)

which can be recursively factorized into:

P(X1,...,Xn) =

n

?

i=1

P(Xi|Xi−1,...,X1). (2.3)

Equation 2.1 can be substituted into Equation 2.3 to obtain the following:

P(X1,...,Xn) =

n

?

i=1

P(Xi|Πi). (2.4)

Stated another way, the joint probability distribution can be represented as a prod-

uct of local probability distributions, called conditional probability tables (CPTs).

The space complexity of this factorized representation is exponential only on the

largest family (variable plus parent set), which in the worst case has the same space

complexity as the entire joint probability table, but is typically much smaller.

An example of a Bayesian network is given in Figure 2.1. Its purpose is to

represent the following fictitious knowledge [40]:

Shortness-of-breath (dyspnoea) may be due to tuberculosis, lung cancer

or bronchitis, or none of them, or more than one of them. A recent visit

to Asia increase the chances of tuberculosis, while smoking is known to be

a risk factor for both lung cancer and bronchitis. The results of a single

chest X-ray do not discriminate between lung cancer and tuberculosis, as

neither does the presence or absence of dyspnoea.

Each of the 8 variables in the graph are binary. To represent the joint probability

P(V,S,T,L,B,C,X,D) would require a table of 28= 256 values. The factorization

of the joint according to the network is

P(V )P(S)P(T|V )P(L|S)P(B|S)P(C|T,L)P(X|C)P(D|C,B)

which requires only 36 values, or 14% of the original.

Given a Bayesian network, a common goal is to compute the posterior probability

13

Page 27

Visit to Asia (V)?

Tuberculosis (T)? Lung Cancer (L)?

Smoking (S)?

Bronchitis (B)?

Tuberculosis?

or Cancer (C)?

XRay Result (X)? Dyspnea (D)?

P(V=yes) = 0.01?

P(V=no ) = 0.99?

P(? T? =yes | V= yes) = 0.05?

P(? T? =no | V= yes) = 0.95?

P(? T? =yes | V= no ) = 0.01?

P(? T? =no | V= no ) = 0.99?

P(S=yes) = 0.5?

P(S=no ) = 0.5?

P(L=yes | S= yes) = 0.10?

P(L=no | S= yes) = 0.90?

P(L=yes | S= no ) = 0.01?

P(L=no | S= no ) = 0.99?

P(B=yes | S= yes) = 0.6?

P(B=no | S= yes) = 0.4?

P(B=yes | S= no ) = 0.3?

P(B=no | S= no ) = 0.7?

P(C=yes | T=yes,L=yes)=1.0?

P(C=no | T=yes,L=yes)=0.0?

P(C=yes | T=yes,L=no )=1.0?

P(C=no | T=yes,L=no )=0.0?

P(C=yes | T=no ,L=yes)=1.0?

P(C=no | T=no ,L=yes)=0.0?

P(C=yes | T=no ,L=no )=0.0?

P(C=no | T=no ,L=no )=1.0?

P(X=yes | C= yes) = 0.98?

P(X=no | C= yes) = 0.02?

P(X=yes | C= no ) = 0.05?

P(X=no | C= no ) = 0.95?

P(D=yes | C=yes,B=yes)=0.9?

P(D=no | C=yes,B=yes)=0.1?

P(D=yes | C=yes,B=no )=0.7?

P(D=no | C=yes,B=no )=0.3?

P(D=yes | C=no ,B=yes)=0.8?

P(D=no | C=no ,B=yes)=0.2?

P(D=yes | C=no ,B=no )=0.1?

P(D=no | C=no ,B=no )=0.9?

Figure 2.1: The Asia network, an example of a small Bayesian net-

work [40].

14

Page 28

distribution, or belief, of a variable given a context over some variables in the network

(called evidence). For example, given the Asia network in Figure 2.1, one might be

interested in the probability that a patient has lung cancer in light of a positive

X-ray result. Calculating a posterior probability distribution over a variable or set

of variables in a Bayesian network is a task often referred to as inference.

Formally, let X be a set of variables from a Bayesian network. Let E = e be a

context over a subset of X. Finally, let Xibe a variable in X that is not in E. The

posterior probability distribution over Xi, given that E = e, is defined as follows:

P(Xi|E = e) = α

?

x′∈D(X′)

P(x′,Xi,e) (2.5)

where X′= X − (E?{Xi}), and α is the normalizing constant P(E = e)−1. For

readability, when summing out a variable Xifrom a distribution, we will often use

the notation Xiin place of Xi= xi; this allows us to write the above equation as:

P(Xi|E = e) = α

?

X′

P(X′,Xi,e) (2.6)

When calculating the posterior probability distribution over a variable Xi(or set of

variables X), we will refer to Xi(X) as the query variable(s), or simply query for

short.

The factorization of the joint distribution according to the Bayesian network

reduces the complexity of inference. Given distributions f and g, denote by dom(f)

the variables over which f is defined. If Xj / ∈ dom(f) then the following equality

holds [37]:

?

Xj

f · g = f ·

?

Xj

g (2.7)

Consider the substitution of Equation 2.4 into Equation 2.6:

P(Xi|e) = α

?

X′

n

?

i=1

P(Xi|Πi,e). (2.8)

Let Yjbe a set that holds Xjand all variables in X for which Xjis a parent. Let

15

Page 29

Y′

j= X − Yj. Equation 2.7 allows us to rewrite Equation 2.8 as:

P(Xi|e) = α

?

X′−{Xj}

?

Xm∈Yj′P(Xm|Πm,e)

?

Xj

?

Xk∈Yj

P(Xk|Πk,e) (2.9)

This process can be done for each variable, that is, multiply all of the distributions

defined over Xj, marginalize Xj from the distribution, and return it to the pool

of distributions. This is exactly the basis of the VE algorithm (discussed in later

sections). The complexity of the algorithm is linear on the size of the largest inter-

mediate distribution (the distribution created as a result of marginalizing a single

variable).

As an example, suppose a user is interested in the posterior probability distribu-

tion over the variable X-ray (X), given no evidence. This can be written as follows:1

P(X) =

?

X−{X}

P(V )P(S)P(T|V)P(L|S)P(B|S)P(C|T,L)P(X|C)P(D|C,B)

(2.10)

where X represents the union of all the variables in the Asia problem. Rather than

performing an entire recombination, V can first be marginalized out by combining

all of the factors that include V in their definition:

P(X) =

?

X−{X,V }

P(D|C,B)P(X|C)P(S)P(B|S)P(L|S)P(C|T,L)

?

V

P(V )P(T|V )

(2.11)

If the variable elimination ordering {V,T,L,S,B,C,D} is used, then the above

equation can be rewritten as:

P(X) =

?

D

?

C

P(X|C)

?

B

P(D|C,B)

?

S

P(S)P(B|S)

?

L

P(L|S)

?

T

P(C|T,L)

?

V

P(V )P(T|V )

(2.12)

Expanding this out, and denoting by fY the intermediate distribution created

from marginalizing Y , the size requirements of the intermediate distributions become

1We exclude α from these equations, for space consideration, and because no normalization is

required when there is no evidence.

16

Page 30

clear:

P(X)=

?

D

?

D

?

D

?

D

?

D

?

D

?

D

fD(X)

?

C

?

C

?

C

?

C

?

C

?

C

P(X|C)

?

B

?

B

?

B

?

B

?

B

P(D|C,B)

?

S

?

S

?

S

?

S

P(S)P(B|S)

?

L

?

L

?

L

P(L|S)

?

T

?

T

P(C|T,L)

?

V

P(V )P(T|V )

=

P(X|C)

P(D|C,B)

P(S)P(B|S)

P(L|S)

P(C|T,L)fV(T)

=

P(X|C)

P(D|C,B)

P(S)P(B|S)

P(L|S)fT(C,L)

=

P(X|C)

P(D|C,B)

P(S)P(B|S)fL(S,C)

=

P(X|C)

P(D|C,B)fS(B,C)

=

P(X|C)fB(D,C)

=

fC(X,D)

=

Notice that no intermediate distribution computed over during this process con-

tained more than 3 variables. The efficiency of this process is dependent on the

order in which the variables are selected to be marginalized. For instance, suppose

the following ordering was chosen: {C,D,V,T,L,S,B}. The equation becomes:

P(X) =

?

B

?

S

P(S)P(B|S)

?

L

P(L|S)

?

T

?

V

P(V )P(T|V )

?

D

?

C

P(X|C)P(D|C,B)P(C|T,L)

(2.13)

The first summation takes place over a distribution of six variables. Choosing an

optimal variable ordering for this process is NP-hard [38], however, several heuristics

exist that give good orderings in polynomial time [38,56]. We examine the effect of

variable orderings on the complexity of inference more closely in Chapter 4.

It is not always the case that we need to compute over the entire network. Cir-

cumstances exist when variables and their associated distributions do not contribute

anything to the query value. Two classes of such variables are barren variable and

d-separated variables. We consider each in turn, and show how they can affect the

size of the effective network (the subgraph of the original Bayesian network over

which we need to compute).

A barren variable [59,60] is a variable that is not part of the query or evidence,

17

Page 31

and is either childless or has all barren descendents. Consider the Asia network,

and recall our previous query, P(X). Dyspnea (D) qualifies as a barren variable,

since it is not part of the query, not observed, and has no descendents. Bronchitis

is also barren, as it is not a query or observed node, and its only descendent is

barren. Barren variables are computationally irrelevant to probability computations

in Bayesian networks, and thus can be excluded. To see this, first consider that

if a node Xi is not a query or observed node, then it must be marginalized out.

Furthermore, if Xihas no descendents, then it is only defined in one CPT, namely

P(Xi|Πi). Hence, marginalization of this variable gives?

descendents, and they are all barren, then by marginalizing its descendents first, we

XiP(Xi|Πi) = 1. If Xihas

end up with the same situation (Xidefined only in its own distribution).

Pruning barren variables from the network requires linear time in the number

of variables, and can sometimes lead to a substantial decrease in the size of the

network. Figure 2.2 shows the Asia network given different queries. Barren variables

often comprise a considerable portion of the Bayesian network, especially when the

observations and queries are localized to a particular section of the network, and

even more so when those observations/queries are shallow (closer to the root than

the leaves).

When a variable is part of the evidence, we say that it is observed. Observing

a variable instantiates it to a particular value, and has the effect of removing its

outgoing arcs in the network. Recall that an arc in a Bayesian network indicates

that the parent is a conditioning variable in the child’s local distribution. Hence, if

an arc exists from Xito Xj, then Xi∈ πj. Observing event Xi= xicreates a new

distribution in which Xiis not a part of the domain. Since the distribution is no

longer defined over Xi, the arc between Xiand Xjcan be considered to be removed.

These pruned arcs become important if the network gets separated into distinct

parts. If the Bayesian network becomes disconnected into subgraphs, then no two

distributions from different subgraphs are defined over a common variable. A query

variable Xqis therefore probabilistically dependent only on the subgraph that con-

tains it. To see this, let a Bayesian network over variables X be divided into two

18

Page 32

V?

T? L?

S?

B?

C?

X?

D?

(a) P(X)

V?

T?L?

S?

B?

C?

X?

D?

(b) P(V |C)

V?

T? L?

S?

B?

C?

X?

D?

(c) P(L|B)

V?

T? L?

S?

B?

C?

X?

D?

(d) P(V )

Figure 2.2: Examples of queries and their corresponding relevant

networks, where barren variables have been greyed out.

19

Page 33

subgraphs: Gqcontaining variables Xq(where Xq∈ Xq), and G¯ qcontaining variables

X¯ q= X − Xq. If the above assumptions hold, then we can write the probability

equation of Xqas follows:

P(Xq|e) = α

?

Xq−{Xq}

?

Xi∈Xq

P(Xi|πi,e)

?

X¯ q

?

Xj∈X¯ q

P(Xj|πj,e) (2.14)

The second factor in the product,?

value, since all of its variables are marginalized. This constant term is found in the

X¯ q

?

Xj∈X¯ qP(Xj|πj,e), reduces to a constant

denominator of the normalization constant as well. Therefore, the two cancel out,

so the second term need not be calculated.

A variable that is d-separated from the query variable exists in a subgraph that is

disconnected from the subgraph containing the query variable (once barren variables

and the arcs from observed variables are pruned). As demonstrated above, this

means that the posterior probability distribution over the query is probabilistically

independent of a variable that it is d-separated from, and thus such a variable can

be ignored during computation.

To formally define d-separation, we take the approach of Shachter2, and first

define an active path in the Bayesian network [61]:

Definition 2.1.1. Let X and Y be two nodes in a DAG, and Z be a set of nodes

from the same DAG. An active path from X to Y given Z is a path such that:

(1) every node on the path with converging arrows is in Z or has a descendent in Z

(2) every other node on the path is outside Z

The formal definition of d-separation is as follows [50,61]:

Definition 2.1.2. Let X,Y, and Z represent three disjoint set of nodes in a DAG.

Z d-separates X and Y if there is no active path between X ∈ X and Y ∈ Y given

Z.

From our example, the set {L,B} d-separates {S} from {X,D,C,T,V }. As well,

the set {} d-separates {V,T} from {L,S,B}.

2The original d-separation definition, given by Pearl [50] defines d-separation using negation,

and is less intuitive than Shachter’s definition.

20

Page 34

There have been many algorithms designed to compute posterior probability

distributions over variables in the network. Inference in Bayesian networks has been

shown to be NP-hard [12]. However, the mechanics of the algorithms allow them to

perform well in many cases, reserving exponential behaviour for a specific subset of

networks that produce worst-case complexity. These algorithms are the topic for the

remainder of this chapter.

There are two basic types of inference algorithms for Bayesian networks. First,

query-based algorithms compute the posterior probability of a query variable (or

set of variables). Hence, the algorithm must be executed once for each query, even

if the evidence does not change. The other type of algorithm, which we’ll refer

to as batch updating algorithms, compute the posterior probability of all variables

simultaneously. The model presented in this document (beginning in Chapter 3) is

an example of a query-based algorithm. We will present both types of algorithms in

this chapter, for comparison purposes and completeness.

2.2Common Methods for Inference

While many techniques have been proposed for calculating probabilities from a

Bayesian network, two classes of algorithms are the most popular at the time of

writing. Junction-tree propagation methods [35,36,40] offer efficient techniques for

computing multiple queries simultaneously from the network (at the expense of space

and precompilation), while variable elimination [23,65,66] calculate only one distri-

bution, but can exploit query-specific independence. Together, these algorithms offer

flexibility (the best algorithm can be chosen based on the type of application), as

long as the user has sufficient memory to store intermediate distributions.

2.2.1 Variable Elimination

Variable elimination (VE) is a query-based algorithm that formalizes the process

used to derive P(X) from the Asia network in the previous section. VE computes a

distribution over its query variables by marginalizing other variables from the joint

21

Page 35

probability one by one.

Variable elimination begins by creating a pool of distributions, which initially

contains the CPTs of the Bayesian network. A variable to be marginalized is se-

lected, and all distributions defined over that variable are removed from the pool.

These distributions are multiplied into a single distribution, and the selected variable

is marginalized from the resulting distribution. This distribution is then placed in the

pool, and the process is repeated, until all non-query variables have been marginal-

ized. The remaining distributions in the pool are combined using multiplication, and

the resulting distribution is normalized, giving us the posterior probability over the

query variables.

The complexity of the algorithm is O(nexp(wρ)), where n is the number of vari-

ables in the Bayesian network, ρ is the ordering in which the variables are eliminated,

and wpis the induced width of the variable ordering, which is equivalent to the number

of variables in the largest intermediate distribution. The induced width is a function

of the variable ordering. Finding an optimal elimination ordering is a hard problem,

but with several heuristics giving good approximations to the optimal [38,56].

The primary advantages of the VE algorithm are its simplicity and its dynamic

nature. The algorithm is very straightforward to implement, and no precompilation

takes place, allowing the algorithm to exploit barren and d-separated variables at

runtime. The main disadvantage to VE is that it requires k runs to compute the indi-

vidual posterior for k variables. Much of the work is repeated for each computation,

something that other methods are able to avoid (see Section 2.2.2).

There exist several variants to the VE algorithm. Bucket Elimination [20] places

the distributions into separate pools (or buckets) according to the domains of the

distributions, thus eliminating the need to search for distributions defined over a

particular variable when marginalizing. Mini-buckets [22,25] is an algorithm that

computes an approximation to the posterior of the query variables in less time and

space than VE. In the mini-bucket algorithm, the set of distributions of a bucket is

partitioned into smaller buckets, and each smaller bucket is processed the same as

a standard bucket in Bucket Elimination. This further partitioning typically creates

22

Page 36

smaller intermediate distributions, which reduces the time and space requirements

of the algorithm, at the expense of an exact answer.

2.2.2 Junction-Tree Propagation

Junction-tree propagation (JTP) [35,36,40] is a batch update technique that pre-

compiles the Bayesian network into a junction-tree. Computing over the junction

tree allows the posterior probability of each variable to be computed simultaneously

and efficiently.

A junction-tree is an undirected, acyclic graph derived from the Bayesian network.

Each node in the junction-tree, called a cluster, is a subset of the variables from the

Bayesian network. The JTP algorithm calculates a joint probability distribution over

each cluster in the junction-tree. Once JTP completes, the posterior probability of a

variable can be obtained from any cluster containing that variable by marginalizing

out all other variables in that cluster, and normalizing the resulting distribution.

The clusters of a junction-tree are identified after the Bayesian network is moral-

ized and triangulated. To moralize the Bayesian network, the parents of each variable

are married (an edge is placed between any two variables that share a common child

and do not already have an edge between them), and the direction of all links are

dropped (Figure 2.3). Triangulating a graph ensures that any cycles of length greater

than 3 have a chord intersecting them (Figure 2.4). Triangulating a graph is typ-

ically done through an elimination procedure (similar to the algorithms of Section

2.2.1), where one variable is eliminated from the graph, and edges are added between

the remaining neighbours of the eliminated variable. The triangulated graph is the

original graph with these new added edges. Each maximal clique in the moralized,

triangulated graph contains the variables for a cluster in the junction-tree.

Once the clusters of the graph have been identified, the junction-tree can be

constructed. A vertex is created for each cluster, and edges are added between the

vertices such that 1) the graph is connected, with no loops and 2) the running inter-

section property is maintained. If a junction-tree maintains the running intersection

property, then if two clusters share a common variable, then all clusters along the

23

Page 37

Visit to Asia (V)?

Tuberculosis (T)? Lung Cancer (L)?

Smoking (S)?

Bronchitis (B)?

Tuberculosis?

or Cancer (C)?

XRay Result (X)?Dyspnea (D)?

Figure 2.3: The Asia network after moralization. Note the marriage

between T and L, and between C and B. As well, the direction of the

arcs has been dropped.

Visit to Asia (V)?

Tuberculosis (T)? Lung Cancer (L)?

Smoking (S)?

Bronchitis (B)?

Tuberculosis?

or Cancer (C)?

XRay Result (X)?Dyspnea (D)?

Figure 2.4: The Asia network, triangulated.

24

Page 38

VT? TLC? LBC?SLB?

CX?CBD?

T?

LC?

LB?

BC?

C?

Figure 2.5: A junction-tree for the Asia network. Clusters are shown

as rectangles with rounded corners, and separator sets are shown as

rectangles with square corners.

path between those two clusters contain the variable as well. Each edge in the junc-

tion tree is also labeled with a variable set, known as its separator set. The separator

set is just the intersection of the clusters that the edge connects. Figure 2.5 shows a

junction-tree for the Asia network.

Inference in a junction tree proceeds with each nodes passing messages to each

other. These messages take the form of a distribution. One message is passed from

each cluster to each of its neighbours. These messages are combined into a final

distribution at each node, and the posterior probability for a variable at a cluster

can be obtained by marginalizing away all other variables in the cluster.

The complexity of inference in junction-trees is O(nexp(w)), where w is the the

size of the largest clique. The clique sizes depend on the triangulation of the Bayesian

network, which in turn depends on the variable ordering used to determine fill-edges.

Finding the optimal variable elimination ordering for this problem is NP-hard [38].

In fact, the problem of finding an optimal variable ordering is the same for both VE

and JTP, so the same heuristics can be applied.

The primary advantage of JTP is that it calculates the individual posterior of

each variable simultaneously. That is, after the completion of the algorithm, the

posterior of any variable is available from the distribution of any cluster containing

that variable. One disadvantage of a junction-tree is that its space requirement is

exponential on the size of its largest clique. As well, because the junction-tree is a

precompiled structure, it is more difficult to take advantage of barren variables and

d-separation.

25

Page 39

At the time of writing, junction-tree algorithms are the most popular algorithms

for inference in Bayesian networks. They are prevalent in commercial systems (Netica

[47], Hugin [6]), and have extensive research behind them for well over a decade.

Algorithms exist to optimize structure [28], handle evidence dynamically [13, 27],

and run query-driven inference (for generating beliefs over small subsets of variables;

this requires that functions only be stored over the separators) [21,34]. For more

indepth analysis of junction-tree algorithms, including implementations, the reader

is encouraged to consult Huang and Darwiche [34].

2.3Conditioning Algorithms

While inference in Bayesian networks requires exponential time, it does not always

have to require exponential space. Conditioning algorithms provide a space-efficient

alternative to the algorithms of the last section. Conditioning algorithms trans-

form inference into smaller subproblems, and then recombine the solutions to these

subproblems into the overall solution.

As with the popular methods, conditioning methods can be classified into batch

updating and query-based algorithms. The former transform the network to a poly-

tree (singly connected), whereas the latter is a divide and conquer approach to

inference. The former will be the topic of this section; the latter will be considered

in the next section.

2.3.1 Global Calculation

Conditioning algorithms that update all posterior probabilities (batch update) follow

the same general format: choose a cutset C whose instantiation renders the network

singly connected (no directed or undirected loops), recalling that instantiating a

variable to a value prunes its outgoing arcs. When the graph is singly connected,

the probabilities over the current context can be calculated using Pearl’s message

passing algorithm. This must be done once for each context of the cutset. The

details of this are considered later; the message passing algorithm is introduced first.

26

Page 40

Note that the message passing algorithm is presented here because of its relationship

to conditioning methods; the algorithm itself is not a conditioning algorithm.

Message Passing

Pearl’s message passing algorithm [49,50] computes the posterior probability dis-

tribution of each variable in a Bayesian network in a single run. The algorithm

only computes correct probabilities for a singly-connected network. Hence, the al-

gorithm is typically used in conjunction with a conditioning algorithm that renders

the network singly-connected.

During the message passing algorithm, a variable in the Bayesian network be-

comes a processing unit. The variable receives messages from its neighbouring nodes.

These messages are in the form of a distribution, representing information from an-

other part of the network. A variable uses these messages to calculate the posterior

probability distribution over itself, as well as to calculate messages to send to its

neighbours. The message sent to a neighbouring variable is a summary of all in-

formation received from all other neighbours. The algorithm terminates when all

messages have been sent.

The number of messages sent during the message passing algorithm is 2e, where

e is the number of arcs in the network (since a node sends and receives one message

from each neighbour). Calculating messages to be sent to parent variables takes

O(exp(f)) time, where f is the size of the largest family (calculating a message to

be sent to a child can be done in time linear on the size of the variable, once the

posterior probability has been calculated, and therefore does not contribute to the

complexity). Calculating the posterior probability of a variable from the messages

also takes O(exp(f)) time. Hence, the overall time for the algorithm is O((n +

e)exp(f)). The space required by the algorithm is O(nexp(f)), for CPT storage

(the messages passed are linear in the domain size of the variables, and therefore do

not contribute to the space complexity).

The advantages of Pearl’s algorithm is its low resource requirements: in terms

of complexity, it is among the fastest and smallest inference algorithms for Bayesian

27

Page 41

networks to date. The algorithm calculates posterior probability distributions for

each variable simultaneously, as opposed to a single distribution as in VE. Also,

because each variable processes independently, much of the computation can be

done in parallel. However, the algorithm works only for singly-connected networks,

which in practice occurs infrequently.

Pearl conjectured that running the message passing algorithm in a multiply-

connected network (containing undirected loops) might stabilize to an equilibrium,

even though the posteriors at equilibrium may not be representative of the real pos-

teriors. Murphy et. al [45] explored this idea on general probabilistic networks, at-

tempting to ascertain empirically if message-passing was a reasonable approximation

approach on “loopy” networks. The results showed that when convergence occurred,

the approximations were quite good, outperforming other standard approximation

methods given a similar amount of running time. However, the algorithm would

exhibit oscillatory behaviour over certain networks, and never converge. The oscil-

lation seemed to have correlation to small prior probabilities (the authors were able

to correct oscillation in some networks by increasing some of the prior probabilities).

Cutset Conditioning

Cutset conditioning [50] is an inference algorithm for Bayesian networks that uses

the message-passing algorithms as its probability calculator. The central idea be-

hind cutset conditioning is to choose a cutset, or set of variables from the Bayesian

network whose instantiation renders the network singly-connected. Recall that when

a variable is instantiated, it’s outgoing arcs are pruned from the network. Consider

the Asia example. Message passing cannot be applied to this network, as it is not

singly-connected. However, by instantiating Smoking, we break the only loop in the

graph (shown in Figure 2.6), and we may now apply the message passing algorithm.

Given such a cutset C, instantiating the variables to C = c and applying Pearl’s

algorithm calculates P(Xi|c,e) for each Xiin the network. To obtain the desired

posterior probability distribution P(Xi|e), we can use the law of total probabilities:

28

Page 42

Visit to Asia (V)?

Tuberculosis (? T?)? Lung Cancer (? L?)?Bronchitis (B)?

Tuberculosis?

or Cancer (? C?)?

XRay? Result (X)?Dyspnea? (D)?

P(V=yes) = 0.01?

P(V=no ) = 0.99?

P(? T? =yes | V= yes) = 0.05?

P(? T? =no | V= yes) = 0.95?

P(? T? =yes | V= no ) = 0.01?

P(? T? =no | V= no ) = 0.99?

P(? L? =yes) = 0.10?

P(? L? =no) = 0.90?

P(B=yes) = 0.6?

P(B=no ) = 0.4?

P(? C? =yes |? T? =yes,?L? =yes)=1.0?

P(? C? =no |?T? =yes,?L? =yes)=0.0?

P(? C? =yes |? T? =yes,?L? =no )=1.0?

P(? C? =no |?T? =yes,?L? =no )=0.0?

P(? C? =yes |? T? =no ,?L? =yes)=1.0?

P(? C? =no |?T? =no ,?L? =yes)=0.0?

P(? C? =yes |? T? =no ,?L? =no )=0.0?

P(? C? =no |?T? =no ,?L? =no )=1.0?

P(X=yes |? C? = yes) = 0.98?

P(X=no |?C? = yes) = 0.02?

P(X=yes |? C? = no ) = 0.05?

P(X=no |?C? = no ) = 0.95?

P(D=yes |?C? =yes,B=yes)=0.9?

P(D=no |?C? =yes,B=yes)=0.1?

P(D=yes |?C? =yes,B=no )=0.7?

P(D=no |?C? =yes,B=no )=0.3?

P(D=yes |?C? =no ,B=yes)=0.8?

P(D=no |?C? =no ,B=yes)=0.2?

P(D=yes |?C? =no ,B=no )=0.1?

P(D=no |?C? =no ,B=no )=0.9?

(a) The Asia network, conditioned on S = true.

Visit to Asia (V)?

Tuberculosis (? T?)?Lung Cancer (? L?)? Bronchitis (B)?

Tuberculosis?

or Cancer (? C?)?

XRay? Result (X)? Dyspnea? (D)?

P(V=yes) = 0.01?

P(V=no ) = 0.99?

P(? T? =yes | V= yes) = 0.05?

P(? T? =no | V= yes) = 0.95?

P(? T? =yes | V= no ) = 0.01?

P(? T? =no | V= no ) = 0.99?

P(? L? =yes ) = 0.01?

P(? L? =no ) = 0.99?

P(B=yes) = 0.3?

P(B=no ) = 0.7?

P(? C? =yes |? T? =yes,?L? =yes)=1.0?

P(? C? =no |?T? =yes,?L? =yes)=0.0?

P(? C? =yes |? T? =yes,?L? =no )=1.0?

P(? C? =no |?T? =yes,?L? =no )=0.0?

P(? C? =yes |? T? =no ,?L? =yes)=1.0?

P(? C? =no |?T? =no ,?L? =yes)=0.0?

P(? C? =yes |? T? =no ,?L? =no )=0.0?

P(? C? =no |?T? =no ,?L? =no )=1.0?

P(X=yes |? C? = yes) = 0.98?

P(X=no |?C? = yes) = 0.02?

P(X=yes |? C? = no ) = 0.05?

P(X=no |?C? = no ) = 0.95?

P(D=yes |?C? =yes,B=yes)=0.9?

P(D=no |?C? =yes,B=yes)=0.1?

P(D=yes |?C? =yes,B=no )=0.7?

P(D=no |?C? =yes,B=no )=0.3?

P(D=yes |?C? =no ,B=yes)=0.8?

P(D=no |?C? =no ,B=yes)=0.2?

P(D=yes |?C? =no ,B=no )=0.1?

P(D=no |?C? =no ,B=no )=0.9?

(b) The Asia network, conditioned on S = false.

Figure 2.6: The Asia network, conditioned on S. Notice that in each

case, the network is singly connected, and the beliefs can be updated

using message passing.

29

Page 43

P(Xi|e) =

?

c∈C

P(Xi|e,c)P(c|e) (2.15)

In other words, the message-passing algorithm is used once for each instantiation

of the cutset. Calculating P(c|e) is actually calculated as αP(e|c)P(c), and both

terms in this equation can be calculated using message passing.

Let |C| be the number of variables in the cutset. The time complexity of the

algorithm is O( [(n + e)exp(f)]|C|). Finding the optimal (smallest) cutset is NP-

hard [63]; different methods have been suggested for finding such a cutset [8,26,62,63].

The space complexity of the algorithm is O(nexp(f)), since it requires enough space

to run the message-passing algorithm.

The primary advantage of conditioning is its low memory requirements. Since

the largest family of the Bayesian network is almost always smaller than its induced

width, the conditioning algorithm can achieve an exponential space saving over JTP

and VE. However, the size of the cutsets in general Bayesian networks tend to give

these algorithms longer runtimes than JTP and VE. The difference in runtime can

be substantial, and conditioning algorithms do not enjoy the same popularity that

JTP and VE have.

Peot and Shachter [51] improve on the original cutset conditioning algorithm by

defining multiple cutsets per network - one for each knot. A knot is a subgraph of

the network that cannot be disconnected by removing one edge. This allows condi-

tioning only over relevant variables, rather than conditioning every variable over the

entire cutset. In the worst case, the knot conditioning algorithm has the same com-

plexity as standard conditioning, but is often much better. Local conditioning [26]

is a further refinement of knot-conditioning. In local conditioning, conditioning is

applied exclusively within each loop. Local conditioning provides a depth-first search

algorithm for detecting the cutsets of each loop. In practice, knot-conditioning is as

least as good as global conditioning (often much better), while local conditioning is

as least as good as knot-conditioning (and often much better).3Dechter [21] intro-

3The authors report linear to exponential ratios between local and knot conditioning in some

30

Page 44

duced a hybrid approach that uses a version of JTP with conditioning. The result is

a time-space tradeoff: the algorithm works in space-constrained environments, and

the runtime is inversely proportional to the amount of memory available. Bounded

conditioning [32] is an algorithm that uses conditioning to approximate bounds on

posteriors. Probabilities that have not yet been calculated are replaced with the in-

terval [0,1] in Equation 2.15, giving upper and lower bounds on the final posteriors.

As the actual probabilities values are calculated exactly, they replace the intervals

in the equation. Hence, as time progresses, the bounds get better and better.

2.4 Divide and Conquer Conditioning

Divide and conquer conditioning [15,44,55] is similar to cutset conditioning, in that

it uses a cutset to condition the network over a specific context. However, rather

than this cutset being chosen to make the network singly-connected, the cutset par-

titions the network into two d-separated components. The partitions are solved in

a recursive manner, and the results are recombined to obtain the solution.

Divide and conquer conditioning begins by recursively partitioning the Bayesian

network into a structure called a dtree. Figure 2.7 shows a Bayesian network, and a

dtree compilation of the network (taken from Darwiche [15]). Each internal node in

the dtree represents a subgraph of the original Bayesian network. Each internal node

N also contains a set of variables known as its cutset. The cutset at a particular

node d-separates its subgraph into two distinct subgraphs: these subgraphs become

the children of that node. The leaves represent single variables of the network (a

single variable cannot be further partitioned). The subgraph of an internal node is

implicit: each internal node stores only the cutset, while the leaves store the CPT

associated with its labeling variable.

Once construction of the dtree is complete, it is used to calculate the probability

of a context P(C = c), where C is a subset of the nodes in the Bayesian network.

Given a dtree node T, let Tland Trrepresent the left and right children of T,

examples.

31

Page 45

A? B?C? D?

E?

(a) An example Bayesian network.

A? B?C?D?

E?

A?B? C? D?

E?

A? B? C? D?E?

D?

E?

{D}?

{B}?

{? C?}?

{A}?

P(A)?

P(B|A)?

P(?C? |B)?

P(D|?C? )? P(?E?|B,D)?

(b) A recursive decomposition of the network.

Figure 2.7: An example Bayesian network (from Darwiche [15]) and

a recursive decomposition of that network. The cutsets at each node

are shown in each box.

32

Page 46

respectively, and let cutset(T) represent the cutset at T. The value calculated from

T given c, denoted PT(c), is as follows:

PT(c) =

?

d∈D(cutset(T))

PTl(c ∧ d)PTr(c ∧ d) (2.16)

that is, the value is calculated by recursively calculating the value at Tland Trfor

each d ∈ D(cutset(T)). When T is a leaf node, the context passed to this node

contains an assignment to each variable in the domain of the CPT at T; the value

returned from T is the value in this CPT that corresponds to the context.

As mentioned, the probability calculated from a dtree is the probability of a

context C = c, not a posterior probability.To calculate a posterior probability

distribution P(Xi|e) from a dtree, P(xi∧ e) is calculated for each xi∈ D(Xi), and

the resulting vector is normalized.

There are several variations of divide and conquer conditioning. Recursive con-

ditioning [15] and recursive decomposition [11,44] decompose the network into the

described binary tree structure (a dtree). Adaptive conditioning [55] is an adaptation

of recursive conditioning that allows one to tailor a query both by time and space

requirements. The algorithm differs from the other decompositions in that it does

not attempt to decompose the network to single-nodes, and the decomposition is not

necessarily binary. Instead, it decomposes based on memory requirements, and may

choose to run an inference algorithm on a multi-node subnetwork. A network is only

decomposed further if memory requirements do not suffice to run inference on the

current decomposition.

The time complexity of recursive conditioning is O(n exp(wcd)), where wcis the

size of the largest cutset, and d is the depth of the tree [15]. To see this, let N be

a node in a dtree. Its a-cutset is defined as the union of all cutsets in its ancestors

in the dtree. During the recursive conditioning algorithm, N will be called once for

each instantiation of its a-cutset, which in the worst case is of size wcd. If a dtree

is constructed using an elimination ordering, then it can be shown that the largest

cutset will be bounded from above by the induced width of the variable ordering. As

33

Page 47

well, the tree can be balanced using rake and compress methods [43], such that it has

logarithmic height while only affecting wcby a constant factor. Hence, the resulting

time complexity of the algorithm is O(n exp(wlogn)), where w is the width of the

variable ordering used to construct the dtree, and n is the number of variables in the

network.

The memory requirements of recursive conditioning are much smaller than JTP

and VE. The CPT storage requires O(nexp(f)) storage, where f is the size of the

largest family. Excluding the CPTs, the space required to store the dtree structure

is linear on the number of nodes in the network. During computation of P(E = e),

the algoritm traverses the tree structure in a depth-first fashion, storing only the

current recursive path, which is linear on the height of the tree. Hence, even though

the space complexity of recursive conditioning is asymptotically the same as other

conditioning algorithms, the actual amount of space required by the algorithm is

typically less. This small memory requirement is the primary advantage of recursive

conditioning.

2.4.1 Caching in Recursive Decompositions

Recursive decompositions require more time to compute over than JTP and VE.

The extra time complexity is due to repeat computation, which is illustrated in the

following example. Figure 2.8 shows the Asia network compiled into a dtree. Recall

that a node will be called once for each instantiation of its a-cutset. Consider the

node labeled with the cutset {V } in the graph. This node will be called once for

each instantiation of its a-cutset, which is {L,B,T}. Table 2.4.1 shows the results

of each of these calls to the node. Each entry in the table shows the current context

of the a-cutset, and the return value of each call to the node. Notice that the return

value is the same for all contexts when T = yes, as well as for T = no. Hence, the

algorithm recalculates these values unnecessarily.

Recomputation can be avoided by storing these values once they are calculated,

a technique known as caching [15]. If the value

?

v∈V

P(v)P(T = yes|v) were stored

34

Page 48

{? L? , B}?

{? T?}?{S}?

{ }?

P(S)?

P(? L? | S)? P(B | S)?

{V}? {? C? }?

P(V)?P(? T? | V)?

P(?C? |?T? ,? L? )?

{ }?

P(?X |?C?)? P(D? |? C? ,B? )?

Figure 2.8: The Asia network, compiled into a dtree.

Table 2.1: Trace of visits to node labeled with {V } in Figure 2.8.

VisitContextValue

?

v∈V

?

v∈V

?

v∈V

?

v∈V

?

v∈V

?

v∈V

?

v∈V

?

v∈V

1L = yes,B = yes,T = yes

P(v)P(T = yes|v)

2L = yes,B = yes,T = no

P(v)P(T = no|v)

3L = yes,B = no,T = yes

P(v)P(T = yes|v)

4L = yes,B = no,T = no

P(v)P(T = no|v)

5L = no,B = yes,T = yes

P(v)P(T = yes|v)

6L = no,B = yes,T = no

P(v)P(T = no|v)

7L = no,B = no,T = yes

P(v)P(T = yes|v)

8L = no,B = no,T = noP(v)P(T = no|v)

35

Page 49

{? L? , B}?

{? T?}?{S}?

{ }?

P(S)?

P(? L? | S)?P(B | S)?

{V}?{? C? }?

P(V)? P(? T? | V)?

P(?C? |?T? ,? L? )?

{ }?

P(?X |?C?)? P(D? |? C? ,B? )?

{? L? ,B}?

{? T? }?{? L? ,B,?T?}?

{? B,?C? }?

{? L? ,B}?

{? L? ,B,S}?

{ }?

Figure 2.9: The Asia dtree, with cache-domains shown to the right

of the internal nodes.

after visit 1, and

?

v∈V

P(v)P(T = no|v) after visit 2, then all subsequent visits to the

node would require a constant time lookup to the cache; the value returned would

depend on the current value of T in the context of the a-cutset.

Formally, define the cache-domain of a node N, denoted CD(N), as the intersec-

tion of N’s a-cutset and the union of all variables in the CPTs of its leaf variables.4

The values returned from N will depend only on the current context of CD(N) in

N’s a-cutset. In the above example, the cache-domain of the node labeled by {V } is

{T}. The recursive conditioning algorithm can be modified as follows: when visiting

a node, it first checks the cache at that node to see if a value for the context of the

cache-domain has already been calculated. If it has, it simply returns this value. If

not, it calculates this value, stores it in cache, and returns it. Figure 2.9 shows the

Asia dtree, with its cache-domains shown to the right of each internal node.

When caching is employed, the time requirements of recursive conditioning are

reduced, while the space requirements are increased. Rather than a node being called

once for each instantiation of its a-cutset, it is now called once for each instantiation

of its parent-node’s cache-domain unioned with its parent-node’s cutset. It can be

4Darwiche et. al [4,5,15] refers to this set as simply a context, we use cache-domain to avoid

confusion with our previous definition of context.

36

Page 50

shown that the cache-domain size at each node is bounded by the induced width of

the variable ordering used to construct the dtree [15]. This means that the time and

space complexity of recursive conditioning with caching is O(nexp(w)), the same as

JTP and VE.

Dead Caches

Caching all possible values increases the space requirements of recursive conditioning

to O(nexp(w)), or the same as JTP and VE. However, Allen and Darwiche [3]

demonstrated that the memory requirements of caching can be reduced by identifying

dead caches. Dead caches in a recursive decomposition are caches whose values would

only be calculated and never queried. Dead caches are never allocated any memory,

so the overall space requirements for caching is reduced.

As an example, consider the Asia dtree in Figure 2.9, in particular the node

labeled with {C}. The cache-domain at this node is {L,B,T}. The node is visited

only once for each instantiation of its cache-domain, therefore, the cache is never

queried.

Dead caches can be identified in dtrees as a cache whose context is a superset

of its parent’s context. These caches can be removed from recursive decompositions

with no runtime consequence. The memory savings afforded by dead cache removal

can be substantial. Figure 2.10 shows the Asia example with dead caches removed

(grayed out). In this example, the live caches requires less than 20% of the original

space required by full caching.

Allen and Darwiche showed empirically that many of the caches in dtrees con-

structed from test Bayesian networks were dead caches, and their removal consider-

ably improved the space efficiency of recursive conditioning. They also showed that

the memory required was substantially less than both JTP and VE as well, while

the time complexity remains the same as for JTP and VE.

37