Page 1

Int J Comput Vis

DOI 10.1007/s11263-010-0385-z

On Learning Conditional Random Fields for Stereo

Exploring Model Structures and Approximate Inference

Christopher J. Pal ·Jerod J. Weinman ·Lam C. Tran ·

Daniel Scharstein

Received: 22 March 2010 / Accepted: 15 September 2010

© Springer Science+Business Media, LLC 2010

Abstract Until recently, the lack of ground truth data has

hindered the application of discriminative structured predic-

tion techniques to the stereo problem. In this paper we use

ground truth data sets that we have recently constructed to

explore different model structures and parameter learning

techniques. To estimate parameters in Markov random fields

(MRFs) via maximum likelihood one usually needs to per-

form approximate probabilistic inference. Conditional ran-

dom fields (CRFs) are discriminative versions of traditional

MRFs. We explore a number of novel CRF model structures

including a CRF for stereo matching with an explicit oc-

clusion model. CRFs require expensive inference steps for

each iteration of optimization and inference is particularly

slow when there are many discrete states. We explore belief

propagation, variational message passing and graph cuts as

This research was supported in part by: a Google Research award,

Microsoft Research through awards under the eScience and Memex

funding programs, a gift from Kodak Research and an NSERC

discovery award to C.P. Support was also provided in part by NSF

grant 0413169 to D.S.

C.J. Pal (?)

École Polytechnique de Montréal, Montréal, QC, Canada

e-mail: christopher.pal@polymtl.ca

J.J. Weinman

Dept. of Computer Science, Grinnell College, Grinnell, IA, USA

e-mail: weinman@grinnell.edu

L.C. Tran

Dept. of Electrical and Computer Engineering, University of

California San Diego, San Diego, CA, USA

e-mail: lat003@ucsd.edu

D. Scharstein

Middlebury College, Middlebury, VT, USA

e-mail: schar@middlebury.edu

inference methods during learning and compare with learn-

ing via pseudolikelihood. To accelerate approximate infer-

ence we have developed a new method called sparse varia-

tional message passing which can reduce inference time by

an order of magnitude with negligible loss in quality. Learn-

ing using sparse variational message passing improves upon

previous approaches using graph cuts and allows efficient

learning over large data sets when energy functions violate

the constraints imposed by graph cuts.

Keywords Stereo · Learning · Structured prediction ·

Approximate inference

1 Introduction

In recent years, machine learning methods have been suc-

cessfully applied to a large number of computer vision

problems, including recognition, super-resolution, inpaint-

ing, texture segmentation, denoising, and context labeling.

Stereo vision has remained somewhat of an exception be-

cause of the lack of sufficient training data with ground-

truth disparities. While a few data sets with known dispari-

ties are available, until recently they had been mainly been

used for benchmarking of stereo methods (e.g., Scharstein

and Szeliski 2002). Our earlier work in this line of re-

search (Scharstein and Pal 2007) sought to remedy this sit-

uation by replacing the heuristic cues used in previous ap-

proaches with probabilistic models for structured prediction

derived from learning using real images and ground truth

stereo imagery. To obtain a sufficient amount of training

data, we used the structured-lighting approach of Scharstein

and Szeliski (2003) to construct a database of 30 stereo pairs

Page 2

Int J Comput Vis

with ground-truth disparities, which we have made available

for use by other researchers.1

By addressing the need for greater quantities of ground

truth data, we are now able to take a machine learning ap-

proach using a classical structured prediction model, the

conditional random field (CRF). We derive a gradient-based

learning approach that leverages efficient graph-cut mini-

mization methods and our ground-truth database. We then

explore the characteristics and properties of a number of

different models when learning model parameters. Using

graph-cut minimization techniques for gradient-based learn-

ing in CRFs corresponds to an aggressive approximation

of the underlying probabilities needed for expectations of

key quantities. In this work we further explore the issues

of learning and its interaction with different inference tech-

niques under richer model structures.

Among the few existing learning approaches for stereo,

one of the most prominent is the work by Zhang and Seitz

(2005), who iteratively estimate the global parameters of an

MRF stereo method from the previous disparity estimates

and thus do not rely on ground-truth data. Kong and Tao

(2004) learn to categorize matching errors of local methods

usingtheMiddleburyimages.Kolmogorovetal.(2006)con-

struct MRF models for binary segmentation using locally

learned Gaussian Mixture Models (GMMs) for foreground

and background colors. Some interesting recent work has

explored learning in a hidden variable CRF-like model for

stereo (Trinh and McAllester 2009). They formulate stereo

as the problem of modeling the probability of the right im-

age given the left. Thus, they are able to construct a con-

ditional model with hidden variables for depth information.

As such, they do not use ground-truth depth information and

cast the approach as an instance of unsupervised learning.

They use monocular texture cues, define potential functions

on segments from image segmentation and construct an en-

ergy function based on a slanted-plane model. They perform

learningusing a variationof hard assignmentconditionalex-

pectation maximization.

Learning aside, there has been growing interest in sim-

ply creating richer models for stereo vision in which more

parameters are introduced to produce more accurate results.

In particular, recent activity has focused on explicitly ac-

counting for occlusions in stereo vision models. For exam-

ple, Kolmogorov and Zabih (2001) have directly incorpo-

rated occlusion models in an energy function and graph-cut

minimization framework. Sun et al. (2005) explored a sym-

metric stereo matching approach whereby they: (1) infer the

disparity map in one view considering the occlusion map of

the other view and (2) infer the occlusion map in one view

given the disparity map of the other view. More recently,

Yang et al. (2006) have achieved impressive results building

1http://vision.middlebury.edu/stereo/data/.

on models that estimate depth in both left and right images

and using color-weighted correlations for patch matching.

They found that this approach made match scores less sensi-

tive to occlusion boundaries, as occlusions often cause color

discontinuities. All of these methods involve creating richer

models to obtain greater disparity accuracy. Thus, we see a

growing need to learn or estimate model parameters in an

efficient and principled way.

While learning for stereo is growing in interest, much

recent progress in stereo vision has been achieved along

two other avenues. First, global optimization methods have

become practical with the emergence of powerful opti-

mization techniques. Considered too slow when first pro-

posed by Barnard (1989), global methods currently domi-

nate the top of the Middlebury stereo rankings. In particular,

MRF models for stereo have become popular since high-

quality approximate solutions can be obtained efficiently us-

ing graph cuts (Boykov et al. 2001; Kolmogorov and Zabih

2001, 2002b) and belief propagation (Sun et al. 2003, 2005;

Felzenszwalb and Huttenlocher 2006). Tappen and Freeman

(2003) have compared graph cuts and belief propagation

for stereo and Szeliski et al. (2008) have compared a larger

set of MRF energy minimization techniques, providing soft-

ware that we use in our implementation.

Asecondbreakthroughhasbeentherealizationoftheim-

portance of intensity changes as a cue for object boundaries

(i.e., disparity discontinuities). Taken to an extreme, this

translates into the assumption that disparity jumps always

coincide with color edges, which is the basis of a large num-

ber of recent segment-based stereo methods (Tao et al. 2001;

Zhang and Kambhamettu 2002; Bleyer and Gelautz 2004;

Hong and Chen 2004; Wei and Quan 2004; Zitnick et al.

2004; Sun et al. 2005). Such methods start with a color

segmentation and then assume that disparities are constant,

planar, or vary smoothly within each segment. This as-

sumption works surprisingly well if the segments are small

enough. Alternatively, color segmentations can also be em-

ployed as smoothness priors in pixel-based approaches (Sun

et al. 2003).Using color segmentations is not the only

way to utilize this monocular cue; many pixel-based global

methods also change the smoothness cost (i.e., penalty

for a disparity change) if the local intensity gradient is

high (Boykov et al. 2001; Kolmogorov and Zabih 2002a;

Scharstein and Szeliski 2002). This is the approach we take.

The relationship between intensity gradient and smoothness

cost is learned from real images.

We have focused our discussion so far on discrete for-

mulations for stereo. However, continuous formulations also

exist and a number of groups have formulated the continu-

ous depth stereo problem as an energy functional represent-

ing an underlying partial differential equation (PDE) (Al-

varez et al. 2002; Strecha et al. 2003). More recent work

along these lines has cast occlusions as unobserved hid-

den variables and used expectation maximization (Strecha et

Page 3

Int J Comput Vis

al. 2004). This work also draws closer together PDE-based

methods and maximum a posteriori (MAP) estimation. As

noted by Yang et al. (2006), more studies are needed to un-

derstand the behavior of algorithms for optimizing parame-

ters in stereo models. This work addresses that need.

Some of the discrete stereo models discussed above have

formulated the problem directly as an energy function with-

out an explicit probabilistic model. When a probabilistic

model has been used, it has been a joint or generative ran-

dom field. However, there are well-known performance ad-

vantages to using discriminative as opposed to generative

modeling techniques (Ng and Jordan 2002). One of our con-

tributions is the development of a completely probabilistic

and discriminative discrete formulation for stereo. We ex-

plicitly model occlusions using additional states in the vari-

ables of a conditional random field (CRF). As we will show,

when traditional stereo techniques are augmented with an

occlusion model and cast in a CRF framework, learning can

be achieved via maximum (conditional) likelihood estima-

tion. However, learning becomes more challenging as the

stereo images and probabilistic models become more realis-

tic.

1.1 Conditional Random Fields, Learning and Inference

In this work, we use a lattice-structured CRF for stereo

vision. This leads to energy functions with a traditional

form—single variable terms and pairwise terms. Impor-

tantly, unlike purely energy-based formulations, since we

cast the stereo problem as a conditional probability model,

weareabletoviewlearningasaninstanceofmaximumcon-

ditional likelihood. We can also draw from recent insights in

the machine learning community to deal with learning in in-

tractable probability models. In this light, learning also be-

comes a task closely linked to the quality of approximate

inference in the model. From this formulation we are able

to develop a probabilistic variational method in the sense of

Jordan et al. (1999). While we focus on approximate infer-

ence and learning in lattice-structured conditional random

fields applied to stereo vision, our theoretical results and

some experimental insights are applicable to CRFs, MRFs

and Bayesian networks with arbitrary structures.

The CRF approach to modeling and learning a random

field was first presented for sequence processing problems

by Lafferty et al. (2001). Sutton and McCallum (2006)

give a good review of modeling and learning techniques

for CRFs focusing on natural language processing prob-

lems and optimization methods that exploit second order

information. Lafferty et al. (2001) originally proposed the

use of a method known as improved iterative scaling (Della

Pietra et al. 1997) for learning CRFs. However, optimization

of the conditional log likelihood using classical gradient-

descent-based methods is a natural strategy for learning in

CRFs. Indeed, several authors have reported significant in-

creases in learning speed using second-order gradient-based

optimization techniques. Quasi-Newton methods such as

the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method or

limited-memory versions of the BFGS method have been

particularly successful (Sutton and McCallum 2006). More

recently, Vishwanathan et al. (2006) have reported even

faster convergence with large data sets using Stochastic

Meta-Descent, a stochastic gradient optimization method

with adaptation of a gain vector.

Model expectations are needed for gradient-based learn-

ing. To efficiently compute these in the linear chain struc-

tured models commonly used in language processing, a sub-

tle variation of the well-known forward-backward algorithm

for hidden Markov models can be used. However, approxi-

mate inference methods must be used for many graphical

models with more complex structure. The dynamic condi-

tional random fields (DCRFs) of Sutton et al. (2004) use a

factorized set of variables at each segment of a linear-chain

CRF. This leads to a shallow but dynamically-sized lattice-

structured model. Sutton et al. (2004) explore several meth-

ods for approximate inference and learning, including tree-

based reparameterization (TRP) or the tree-based message

passing schedules of Wainwright et al. (2002, 2003) and

the loopy belief propagation strategy discussed in Murphy

et al. (1999) under a random schedule. Other work such as

Weinman et al. (2004) also explores TRP methods but not

in a large lattice structured model. For random fields with

a hidden layer in the form of a Boltzmann machine, He et

al. (2004) have used sampling methods for inference based

on contrastive divergence. Contrastive divergence initializes

Markov chain Monte Carlo (MCMC) sampling using the

data and then takes a few steps of a Gibbs sampler. This ap-

proach is faster than traditional MCMC which require con-

vergence to equilibrium. However it can lead to crude ap-

proximations to model likelihood gradients used for learn-

ing. Kumar and Hebert (2006) optimize the parameters of

lattice-structured binary CRFs using pseudolikelihood and

perform inference using iterated conditional modes ICM,

which is fast but well know to get caught in local minima.

Other work by Blake et al. (2004) has investigated the dis-

criminative optimization of lattice-structured joint random

fieldmodelsusingautoregressionoverthepseudolikelihood.

Pseudolikelihood-based techniques are equivalent to us-

ing spatially localized and independent probabilistic mod-

els as a substitute for the original global model of a com-

plete joint distribution. Pseudolikelihood can yield a convex

optimization problem that often leads to relatively fast op-

timization times. However, Liang and Jordan (2008) have

shownthatpseudolikelihoodcangivepoorerestimatesofin-

teraction parameters in random fields when interactions are

strong. In Sect. 5.6 we compare learning with pseudolike-

lihood, graph cuts and our new inference method (sparse

Page 4

Int J Comput Vis

mean field), and we find that pseudolikelihood indeed re-

sults in the lowest performance. We are therefore motivated

to use a learning procedure that accounts for the global view

of the underlying structured prediction problem.

1.2 Graph Cuts and Learning

The primary challenge with learning a CRF using a standard

gradient-descent-based optimization of the conditional like-

lihood is that one must compute intractable model expecta-

tions of features in the energy function (see Sect. 4). One

solution to this problem is to replace distributions needed

for the expectation with a single point estimate and compute

gradients in a manner reminiscent of the classical perceptron

algorithm. For the types of models we explore here, graph-

cut-based methods are typically the fastest choice for energy

minimization Szeliski et al. (2008). Thus, one particularly

attractive solution to the learning problem is to take advan-

tage of the extremely fast and high-quality performance of

graph cuts. In more precise terms, this energy corresponds

to a most-probable-explanation (MPE) (Cowell et al. 2003)

estimate for the corresponding CRF. Although it does have

important limitations, we use this fast and effective strategy

for our initial explorations of model structures.

The maximum conditional likelihood formulation for

gradient-based learning in a CRF requires one to compute

model expectations, not MPE estimates. Furthermore, the

graph cut algorithm only works if energy functions satisfy

certain conditions. While the original energy function of a

randomfieldcanhavenegativeweights,thesecondarygraph

constructed when performing graph-cut inference must have

non-negative edge weights. This transformation leads to in-

equality constraints on the original energy function. These

constraints also imply that graph-cut inference may cease to

be possible during the course of learning for some models—

something we have observed during our experiments. These

factors have motivated us to explore a second class of infer-

ence techniques, based on quickly computing approximate

marginal distributions during learning. Thus, in the second

broad area of our exploration we compare the efficiency and

quality of global inference techniques during learning.

1.3 Other Alternatives for Inference

It is well known that belief propagation (BP) (Yedidia et al.

2003) in tree structured graphs is exact. However, in stereo

and many other problems in computer vision one frequently

uses graphs defined on a 2D grid. It is possible to apply BP

on graphs with loops using loopy belief propagation (LBP),

andtherehavebeenanumberofreportsofsuccessusingthis

strategyforapplicationsrangingfromerror-correctingcodes

(Frey and MacKay 1997) and inference in large Bayesian

networks (Murphy et al. 1999) to low level vision (Felzen-

szwalb and Huttenlocher 2006). Two important variations

of belief propagation consist of the sum-product algorithm

and max-product algorithm (Kschischang et al. 2001). The

sum-product algorithm is used to compute marginal distrib-

utions while the max-product algorithm is used to give the

most probable configuration or MPE under a model. In a

generative model this is also equivalent to the MAP config-

uration. The max-product variation of BP is equivalent to

the celebrated “Viterbi” algorithm used for decoding in hid-

den Markov models. In a 2D lattice, loopy variants of max-

product can be used to find configurations that correspond

to approximate energy minima. The evaluation of Szeliski

et al. (2008) includes comparisons with variations of loopy

max-product BP but typically finds superior minima using

methods based on either graph cuts or BP variants that in-

volve more sophisticated tree-based approximations (Wain-

wright et al. 2005), which can also be viewed as linear pro-

gramming relaxations.

Tree-based approximations can be used to improve both

the quality of marginal inference and the quality of MAP or

MPE estimates. A class of algorithms known as tree-based

reparameterization (TRP) (Wainwright et al. 2003) can be

used to obtain approximate marginals in a graph with cy-

cles. This class of algorithms can be formulated as a series

of reparameterization updates to the original loopy graph.

Tree-reweighted message passing (TRW) (Wainwright et al.

2005) is an approach whereby one reweights the usual mes-

sages of LBP. This family of algorithms involves reparame-

terizing a collection of tree-structured distributions in terms

of a common set of pseudo-max-marginals on the nodes and

edges of the graph with cycles. When it is possible to find

a configuration that is locally optimal with respect to every

single node and edge pseudo-max-marginal, then the upper

bound is tight, and the MAP configuration can be obtained.

Recent work (Kolmogorov 2006) has shown that more so-

phisticated algorithms, such as sequential tree-reweighted

max-product message passing (TRW-S), have the ability to

produce even better minimum energy solutions than graph

cuts.

Belief propagation (Yedidia et al. 2003) and variational

methods (Jordan et al. 1999) are both widely used tech-

niques for inference in probabilistic graphical models and

are known for being reasonably fast and easy to implement

in a memory-efficient manner. Both techniques have been

used for inference and learning in models with applications

ranging from text processing (Blei et al. 2003) to computer

vision (Frey and Jojic 2005). Winn and Bishop (2005) pro-

posed Variational Message Passing (VMP) as a way to view

many variational inference techniques, and it represents a

general purpose algorithm for approximate inference. The

approach is similar in nature to BP in that messages propa-

gate local information throughout a graph, and the message

computation is similar. However, unlike BP, VMP optimizes

a lower bound on the log probability of observed variables

Page 5

Int J Comput Vis

in a generative model. Variational inference thus has a more

direct connection to the probability of data under a model

when an underlying graphical structure contains cycles.

Experimental and theoretical analysis of variational

methods has shown that while the asymptotic performance

of other methods such as sampling (Andrieu et al. 2003) can

be superior, frequently variational methods are faster for ap-

proximate inference (Jordan et al. 1999). However, many

real world problems require models with variables having

very large state spaces. Under these conditions, inference

with variational methods becomes very slow, diminishing

any gains. We address this by proposing sparse variational

methods. These methods also provide theoretical guarantees

that the Kullback–Leibler (KL) divergence between approx-

imate distributions and true distributions are iteratively min-

imized. Some of our previous work (Pal et al. 2006) has

explored sparse methods for approximate inference using

BP in chain-structured graphs, in loopy graphs (Weinman et

al. 2009), and 2D grids (Weinman et al. 2008).

We focus our explorations of marginal inference for

learning in this paper on approximate marginal inference us-

ing BP, variational mean field, and sparse variants of these

methods. We find that the sparse variants make most tasks

dramatically more practical, reducing training times by an

order of magnitude. We focus our theoretical analysis on

a new method we call sparse variational message pass-

ing (SVMP). The method combines the theoretical benefits

of variational methods with the time-saving advantages of

sparse messages. The state space in stereo models can be-

come quite large if one seeks to account for many possible

discretized disparity levels. Thus, we believe that the sparse

learning techniques we propose here will be an important

contribution. While we do not explore it here, we note that

sparseinferencemethodsfortree-basedapproximationtech-

niques (Wainwright et al. 2005) or structured mean field

methods (Jordan et al. 1999) could be a promising direc-

tion for future research. This paper explores a broader and

richer set of model structures compared to our earlier work

(Scharstein and Pal 2007). We also expand upon our ear-

lier experimental analysis in Weinman et al. (2008, 2007)

including a comparison with pseudolikelihood.

The remainder of the paper is structured as follows. First,

in Sect. 2 we describe the new stereo data sets we use as

ground truth for our learning experiments. In Sect. 3, we de-

velop a number of different CRFs architectures with differ-

ent levels of complexity. We begin with a probabilistic CRF

version of a canonical model for stereo vision problem in

Sect.3.1.Thecanonicalmodelisthenaugmentedtoexplore:

modulation terms that are dependent upon disparity differ-

ences (Sect. 3.2), models that use patches to compute local

costs (Sect. 3.3), and models that explicitly account for oc-

clusions (Sect. 3.4). In Sect. 4, we present the key challenge

in gradient-based learning and motivate how different types

of approximate inference can be used to approximate a key

intractableexpectation.InSect.4.2,wethenreviewclassical

meanfieldupdatesandshowhowsparsevariationalmessage

passing can accelerate inference. In Sect. 5, we present re-

sults using graph cuts for learning in the different model ar-

chitectures we have discussed. We then present results com-

paring sparse BP and VMP with graph cuts. Through this

we see how using variational distributions for learning im-

proves results over the point estimate given by graph cuts

and observe how sparse message passing can lead to an or-

der of magnitude reduction in inference time compared to

dense message passing. Finally, we show how learning pa-

rameters with our technique allows us to improve the quality

of occlusion predictions in more richly structured CRFs.

2 Data Sets

In order to obtain a significant amount of training data for

stereo learning approaches, we have created 30 new stereo

data sets with ground-truth disparities using an automated

version of the structured-lighting technique of Scharstein

and Szeliski (2003). Our data sets are available for use by

otherresearchers.2Eachdata set consists of 7 rectifiedviews

takenfromequidistantpointsalongaline,as wellas ground-

truth disparity maps for viewpoints 2 and 6. The images are

about 1300 × 1100 pixels (cropped to the overlapping field

of view), with about 150 different integer disparities present.

Each set of 7 views was taken with three different exposures

and under three different lighting conditions. We thus have 9

different images from 7 different viewpoints. These images

exhibitsignificantradiometricdifferencesandcanbeusedto

test for robustness to violations of the brightness constancy

assumption, which are common in real-world applications.

For the work reported in this paper we only use the six

data sets shown in Fig. 1: Art, Books, Dolls, Laundry, Moe-

bius and Reindeer. As input images we use a single image

pair (views 2 and 6) taken with the same exposure and light-

ing. To make the images manageable by the graph-cut stereo

matcher, we downsample the original images to one third of

their size, resulting in images of roughly 460 × 370 pixels

with a disparity range of 80 pixels. The resulting images are

still more challenging than standard stereo benchmarks such

as the Middlebury Teddy and Cones images, due to their

larger disparity range and higher percentage of untextured

surfaces.

3 Stereo Vision and CRFs

The classical formulation of the stereo vision problem is

to estimate the disparity (horizontal displacement) at each

2http://vision.middlebury.edu/stereo/data/.

Page 6

Int J Comput Vis

Fig. 1 The six data sets used in this paper. Shown is the left image of each pair and the corresponding ground-truth disparities (©2007 IEEE)

pixel given a rectified pair of images. It is common in MRF-

based stereo vision methods to work with energy functions

of the form

?

where U is a data term that measures the compatibility be-

tween a disparity xi and observed intensities y, and V is

a smoothness term between disparities at neighboring loca-

tions i ∼ j (Boykov et al. 2001).

We construct a CRF for stereo by conditionally normaliz-

ing the exponentiated F over all possible values for each xi

and for each pixel location i in the image. More formally, let

Xibe a discrete random variable taking on values xifrom

a finite alphabet X = {0,...,N − 1}. The concatenation of

all random variables X takes on values denoted by x. If we

denote the conditioning observation in our model as y, we

can then express our CRF as

F (x,y) =

i

U (xi,y)+

?

i∼j

V(xi,xj,y)

(1)

P (X = x | y) =

1

Z(y)exp(−F(x,y)),

(2)

with

Z(y) =

?

x

exp(−F(x,y)).

(3)

The normalizer Z(y) is typically referred to as the partition

function. It is useful to note that a key distinction between a

CRF and a jointly defined MRF is that the partition function

of an MRF does not depend on the observation y and nor-

malizes a joint distribution over the random variables X and

a set of random variables Y defined for y. When using our

model to create a depth map from a stereo pair, our goal is

to find an assignment to X minimize the negative log proba-

bility

−logP(x|y) = logZ(y)+

?

i

U(xi,y)+

?

i∼j

V(xi,xj,y).

(4)

Note that our formulation, unlike other energy-based stereo

approaches, explicitly accounts for a data dependent parti-

tion function. Furthermore, following the typical formula-

tion of CRFs, we express cost terms U and pairwise smooth-

ness terms V using a linear combination of feature functions

fu, fv, which gives us

?

V(xi,xj,y) =

v

U(x,y) =

u

θufu(x,y),

(5)

?

θvfv(xi,xj,y),

(6)

where θu, θvare the parameters of our model. The notation

follows the usual format for specifying the potential func-

tions of CRFs (Lafferty et al. 2001; Sutton and McCallum

2006), and the linear form allows us to derive an intuitive

gradient-based minimization procedure for parameter esti-

mation.

3.1 A Canonical Stereo Model

The CRF of (2) is a general form. Here we present the

specific CRF used for our experiments on stereo dispar-

ity estimation in Sect. 5, following the model proposed by

Scharstein and Pal (2007). The data term U is given by

U (xi,y) = c(i,xi,y),

where c simply measures the absolute intensity difference

betweenthe correspondingpixelsof the images,as indicated

by i and xi. We use the difference measure of Birchfield and

Tomasi (1998) summed over all color bands for invariance

to image sampling.

The smoothness term V is a gradient-modulated Potts

model (Boykov et al. 2001; Scharstein and Pal 2007) with

K parameters:

?

θk,

if xi?= xjand gij∈ Bk.

(7)

V(xi,xj,y) =

0,

if xi= xj

(8)

Page 7

Int J Comput Vis

Here, gij is the color gradient or root mean square color

difference between neighboring pixels i and j. The values

Bk represent discretized intervals the gradient belongs to

for the purposes of modulating the smoothness penalty. In-

terval breakpoints may be chosen from different sets. For

example, in our initial experiments we explore subsets of

{0,2,4,8,12,16,∞}. Let ?vdenote all the smoothness pa-

rameters.

3.2 Disparity Difference Dependent Modulation

Interaction potentials that take into account the difference

in disparities between pixels have been of considerable in-

terest in the past. Felzenszwalb and Huttenlocher (2006)

have explored parametric forms for this interaction such

as V(xi,xj,y) = c|xi− xj| or V(xi,xj,y) = c(xi− xj)2.

However, our framework allows us to learn the functional

form of such interactions. To explore other aspects of

smoothness modulation, we shall investigate models with

interaction terms as a more general function of disparity

changes, e.g., V(xi,xj,y) = f(|xi− xj|). We are able to

achieve this in a manner similar to our gradient discretiza-

tion approach by discretizing the absolute disparity differ-

ences dij= |xi−xj| into bins Cland defining feature func-

tions that are active on the jointly discretized disparity dif-

ference bins Cland gradient bins Bksuch that

V(xi,xj,y) = θkl

if gij∈ Bkand dij∈ Cl.

(9)

3.3 Patch Matching

While pixel to pixel intensity matching is often an effective

strategy for stereo matching, modern cameras are usually

able to produce images at a resolution much higher than one

might need for the corresponding disparity map. We thus ex-

plore matching patches in a pair of high resolution images to

compute our local cost terms that will be used for inferring

disparity at lower resolution. The pair of high resolution im-

ages are partitioned into n × n patches. The new data term

U uses the same Birchfield and Tomasi costs (Birchfield and

Tomasi 1998) over the color channels as (7), except it must

now sum the costs over all corresponding pixels in the high

resolution image patches indicated by i and xi.

We explore this model in cases where the smoothness

term V is defined similarly to our simple, canonical stereo

model of (8). However the color gradient gijbetween neigh-

boring locations i and j is divided by the size of the patch.

In our experiment, we used the full size color images of

roughly 1380 × 1110 pixels and the one-third size ground

truth disparity maps of roughly 460 × 370 to train and test

our model. Thus, the resolution we have selected for the rest

of our experiments allows us to use a patch size of 3×3.

3.4 Occlusion Modeling

To account for occlusion, we create a model with an ex-

plicit occlusion state for the random variable associated with

each pixel in the image. In this extended model we use

xi∈ {0,...,N − 1} ∪ “occluded”. The local data term U

in the extended model has the form:

?c(i,xi,y),

where ci(xi,y) is the Birchfield and Tomasi cost for dispar-

ity xiat pixel i, as before. The new parameter θois a local

bias for predicting the pixel to be occluded.

We may also extend the gradient modulated smoothness

terms to treat occluded states with a separate set of parame-

ters such that:

U (xi,y) =

if xi?= “occluded”

if xi= “occluded”,

θo,

(10)

V(xi,xj,y)

⎧

⎪⎩

=

⎪⎨

0,

θk,

θo,o,

θo,k,

if xi= xjand xi?= “occluded”

if xi?= xj, gij∈ Bkand both xi,xj?= “occluded”

if xi= xjand xi= “occluded”

if xi?= xj, gij∈ Bkand xior xj= “occluded”.

(11)

4 Parameter Learning

The energy function F(x,y) in our models is parameterized

by ? = (?u,?v), where ?udenotes the data term parame-

ters and ?vdenotes the smoothness term parameters. These

parameters may be learned in a maximum conditional like-

lihood framework with labeled training pairs. The objective

function and gradient for one training pair (x,y) can be ex-

pressed as the minimization of

O(?) = −logP (x | y;?)

= F (x,y;?)+logZ(y)

∇O(?) = ∇F (x,y;?)−?∇F (x,y;?)?P(X|y;?),

andwhere ?·?P(X|y;?)denotesanexpectationunderthemod-

els conditional distribution over X. It is known that the CRF

loss function is convex for fully observed states (Lafferty

et al. 2001). However, in 2D grid lattices such as the ones

we consider here we have a critical, but intractable expec-

tation in (14). But, the particular factorization of F(x,y) in

(1) allows the expectation in (14) to be decomposed into a

sum of expectations over gradients of each term U(xi,y)

and V(xi,xj,y) using the corresponding single node and

pairwise marginals P(Xi| y;?) and P(Xi,Xj| y;?), re-

spectively. In this context we can view the main computa-

tional challenge in learning as the task of computing good

approximations to these marginals. Approximation of these

(12)

with(13)

(14)

Page 8

Int J Comput Vis

key expectations through computing approximate marginals

is thus the crux of our exploration here.

For our experiments, we use a simple gradient descent

approach for learning. Our (approximate) gradients can be

very noisy, which can cause problems for second order

methods such as BFGS. We also have a small number of

training examples, so learning methods such as stochastic

gradient, which creates small batches of training data to ac-

celerate learning, are not well-suited for our setting. Our

experiments will focus on comparing different approximate

inference techniques using a straightforward learning algo-

rithm, minimizing potential interactions between more so-

phisticated learning approaches and the consequences of us-

ing approximate distributions.

InpreviousworkbyScharsteinandPal(2007),graphcuts

were used to find the most likely configuration of X. This

was taken as a point estimate of P(X | y;?) and used to

approximate the gradient. Such an approach is potentially

problematic for learning when the marginals are more uni-

form or contain a number of solutions with similar proba-

bility and look unlike a single delta function. Fortunately, a

variational distribution Q(X) can provide more flexible ap-

proximate marginals that may be used to approximate the

gradient. We show in our experiments that using these mar-

ginals for learning is better than using a point estimate in

situations when there is greater uncertainty in the model.

We now derive the equations for sparse mean field infer-

ence using a variational message passing (VMP) perspective

(Winn and Bishop 2005). Sparse VMP iteratively minimizes

the KL divergence between an approximation Q and the dis-

tribution P.In thecontextofCRFs, thefunctionaloptimized

by sparse VMP is an upper bound on the negative log con-

ditional partition function.

4.1 Mean Field

Here we briefly review the standard mean field approxima-

tion for a conditional distribution like (2). As before we let

Xibe a discrete random variable taking on values xifrom

a finite alphabet X = {0,...,N − 1}. The concatenation of

all random variables X takes on values denoted by x, and

the conditioning observation is y. Variational techniques,

such as mean field, minimize the KL divergence between

an approximate distribution Q(X) and the true distribution

P(X | y). For the conditional distribution (2), the divergence

is

KL(Q(X) ? P (X | y))

=

x

?

?

Q(x)log

Q(x)

P (x | y)

Q(x)Z(y)

exp(−F (x,y))

=

x

Q(x)log

= ?F (x,y)?Q(X)−H (Q(X))+logZ(y).

The energy of a configuration x is F(x,y). We define a “free

energy” of the variational distribution to be

L(Q(X)) = ?F (x,y)?Q(X)−H (Q(X)).

(15)

Thus, the free energy is the expected energy under the vari-

ational distribution Q(X), minus the entropy of Q(X). The

divergence then becomes

KL(Q(X) ? P (X | y)) = L(Q(X))+logZ(y).

(16)

Since the KL divergence is always greater than or equal to

zero, it holds that

L(Q(X)) ≥ −logZ(y),

(17)

and the KL divergence is minimized at zero when the free

energy equals the negative log partition function. Since

logZ(y) is constant for a given observation y, minimizing

the free energy serves to minimize the KL divergence.

Mean field updates will minimize KL(Q(X) ? P(X | y))

for a factored distribution Q(X) =?

?

+

i

xi

?

−H(Q(Xj))−

iQ(Xi). Using this

factored Q, we can express our objective as

L(Q(X)) =

x

?

?

Q(xj)?F (x,y)??

i

Q(xi)F (x,y)

?

Q(xi)logQ(xi)

=

x

i:i?=jQ(Xi)

?

i:i?=j

H (Q(Xi)),

(18)

where we have factored out the approximating distribution

Q(Xj) for one variable, Xj. We form a new functional by

adding Lagrange multipliers to constrain the distribution to

sum to unity. This yields an equation for iteratively calcu-

lating an updated approximating distribution Q∗(Xj) using

the energy F and the distributions Q(Xi) for other i:

Q∗(Xj= xj) =

1

Zj

exp(−?F(x,y)??

i:i?=jQ(Xi)),

(19)

where Zj is a normalization constant computed for each

update so that Q∗(xj) sums to one. See Weinman et al.

(2007) for the complete derivation of (19). Iteratively up-

dating Q(Xj) in this manner for each variable Xj will

monotonically decrease the free energy L(Q(X)), thus min-

imizing the KL divergence.

Page 9

Int J Comput Vis

Fig. 2 Minimizing the global KL divergence via two different sparse

local updates. The global divergence KL(Q(X) ? P) can be decom-

posed into a local update plus a constant: KL(Q

const. Consequently, at each step of sparse variational message pass-

?(Xj) ? Q∗(Xj)) +

ing we may minimize different local divergences to within some ? and

when updating different local Qs, we minimize the global KL diver-

gence

4.2 Sparse Variational Message Passing

Variational marginals can be more valuable than graph-cut-

based point estimates for accurate learning or other predic-

tions. However, when the state space of the Xjis large, cal-

culating the expectations within the mean field update (19)

can be computationally burdensome. Here we show how to

dramatically reduce the computational load of calculating

updates when many states have a very low probability under

the variational distribution. The sparse methods presented

here represent a middle way between a fully-Bayesian ap-

proach and a simple point estimate. While the former con-

siders all possibilities with their corresponding (often small)

probabilities, the latter only considers the most likely possi-

bility.Sparseupdatesprovideaprincipledmethodforretain-

ing an arbitrary level of uncertainty in the approximation.

The idea behind the sparse variational update is to elimi-

nate certain values of xjfrom consideration by making their

corresponding variational probabilities Q(xj) equal to zero.

Such zeros make calculating the expected energy for sub-

sequent updates substantially easier, since only a few states

must be included in the expectation. The eliminated states

are those with low probabilities to begin with. Next we show

how to bound the KL divergence between the original and

sparse versions of Q(Xj).

Given (16), (18), and (19) KL(Q(X) ? P(X | y)) can be

expressed as a function of a sparse update Q

nal mean field update Q∗(Xj) and the other Q(Xi)’s, where

i ?= j:

?(Xj), the origi-

KL(Q(X) ? P (X | y))

= KL(Q?(Xj) ? Q∗(Xj))

+logZj+logZ(y)−

?

i:i?=j

H (Q(Xi)).

(20)

Since the last three terms of (20) are constant with respect

to sparse our update Q

mized when Q

ational message passing, we will minimize KL(Q

Q∗(Xj)) to within some small ?. As a result, each update

to a different Q(Xj) yields further reduction of the global

KL divergence. These relationships are illustrated in Fig. 2.

Ifeach Xjisrestrictedtoasubsetofvalues xj∈ Xj⊆ X,

we may define sparse updates Q

nalupdate Q∗(Xj) andthecharacteristic/indicatorfunction

1Xj(xj) for the restricted range:

?(Xj), KL(Q(X) ? P(X | y)) is mini-

?(Xj) = Q∗(Xj). At each step of sparse vari-

?(Xj) ?

?(Xj) in terms of the origi-

Q?(xj) =1Xj(xj)

Z?

j

Q∗(xj),

(21)

where the new normalization constant is

Z?

j=

?

xj

Q?(xj) =

?

xj∈Xj

Q∗(xj).

(22)

Thus, the divergence between a sparse update and the origi-

nal is

KL(Q?(Xj) ? Q∗(Xj))

?

??1Xj(xj)

=

x

1Xj(xj)

Z?

j

Q∗(xj)

×log

Z?

?

j

Q∗(xj)

??

Q∗(xj)

?

(23)

= −logZ?

j

1

Z?

j

x∈Xj

Q∗(xj)

= −logZ

?

j.

(24)

Page 10

Int J Comput Vis

As a consequence, it is straightforward and efficient to com-

pute a maximally sparse Q

?(Xj) such that

KL(Q?(Xj) ? Q∗(Xj)) ≤ ε

by sorting the Q∗(xj) values and performing a sub-linear

search to satisfy the inequality. For example, if we wish to

preserve 99% of the probability mass in the sparse approxi-

mation we may set ε = −log0.99 ≈ .01. We have thus cre-

ated a global KL divergence minimization scheme using the

local divergence analysis given for belief propagation in Pal

et al. (2006) (23)–(24).

Figure 2 illustrates the way in which sparse VMP itera-

tively minimizes the KL(Q(X) ? P(X | y)) after each itera-

tion of message passing. In short, sparse mean field updates

result in a slightly lesser reduction of the KL divergence at

each iteration. Thus it is possible that sparse updates may

require more iterations to achieve the same approximation.

However, the dramatic speedup easily recoups the subopti-

mal step size by allowingmultipleiterationsto be completed

very quickly. In Sect. 5 we show how using sparse messages

can yield a dramatic increase in inference speed.

Concretely speaking, for a model with N label states,

when messages are passed between variables and their

neighbor through a pairwise interaction potential function,

sparsification reduces the computation from O(N × N) to

O(K × N) for K ? N. Importantly, this speedup is gained

for each iteration of variational message passing and one

typically needs to perform many iterations, re-visiting each

variable multiple times. Indeed, to propagate information

across the image one needs to have as many message pass-

ing iterations for each variable as the longest path from

one variable to another in the lattice. Additionally, after the

sparse message passing phase of the algorithm we compute

parameter updates. For this step we have single node and

pairwise variable expectation which can be performed using

O(Ki) vs. O(N) operations and O(Ki×Kj) vs. O(N ×N)

operations for pixels i and pairs of pixels i and j in the

image. However this savings is small compared to savings

during inference so we use a full distribution for the final

expectation and approximate marginal.

(25)

5 Experiments

In this section we present the results of a number of experi-

ments. The first batch of experiments examine learning and

generalization using a simple model. In Sect. 5.1 we first

examine the convergence when learning simple models with

only gradient modulation terms. We train models having dif-

ferent numbers of discretized bins with graph cuts for ap-

proximate marginals and all six data sets as our training set.

Then in Sect. 5.2, we use a leave-one-out approach to evalu-

ate the performance of the learned parameters on a new data

set.InSect.5.3wethenexaminehowthelearnedparameters

generalize to other data sets.

Our second batch of experiments in Sect. 5.4 examines

the impact of extending our simple model in a variety of

ways. These experiments explore extensions of the canon-

ical model of Sect. 3.1 with the disparity difference depen-

dent modulation terms of Sect. 3.2, the patch matching strat-

egy of Sect. 3.3, and the occlusion models developed in

Sect. 3.4.

Our third batch of experiments compare inference and

learning using different approximate inference techniques

for marginals. The first experiment of this batch in Sect. 5.5

compares sparse and traditional mean field methods for ap-

proximate inference, showing how sparse message passing

can greatly accelerate free energy minimization. The second

experiment in Sect. 5.6 compares the performance of mod-

els learned using approximate marginals from both sparse

mean field and a point estimate of the posterior marginals

from graph cuts.

Forallourexperimentsweuseastraightforwardgradient-

based optimization procedure: we start with a small learn-

ing rate (10−4) and increase it by a small factor unless the

norm of the gradient increases dramatically, in which case

we backtrack and decrease the learning rate.

As training and test data we use 6 stereo pair images with

ground-truth disparities from the 2005 scenes of the Middle-

bury stereo database. These images are roughly 460 × 370

pixels and have discretized disparities with N = 80 states.

Thus, when there are more than 600,000 messages of length

N to send in any round of mean field updates for one image,

shortening these to only a few states for most messages can

dramatically reduce computation time.

5.1 Convergence

In these experiments we focus on learning the ?vparame-

ters of the pairwise V potentials.

Itisimportanttoaccountforthefactthatwedonotmodel

occlusionsin this simpleCRF. It is well-knownthatspurious

minimal-cost matches in occluded areas can cause artifacts

in the inferred disparity maps. We therefore use our ground-

truth data to mask out the contributions of variables in oc-

cluded regions to our gradient computation during training.

There are a number of more principled ways to address this

issue. For example, in the model we developed in Sect. 3.4

we take a more principled approach by creating an addi-

tional occlusion state in our model. Another strategy might

be to treat the occluded pixel as a hidden variable, then use

an expected gradient or expectation maximization approach

for learning with these pixels. Techniques for learning CRFs

with hidden variables are discussed in more detail in Sut-

ton and McCallum (2006). Indeed, an even better approach

might be to use both these strategies, introducing a separate

Page 11

Int J Comput Vis

binary indicator variable for hidden vs. not hidden pixels as

well as a hidden value for such pixels.

We experiment with learning models using different

numbers of parameters ?v, from K = 1 (i.e., a single global

smoothness weight) to K = 6 (i.e., a parameter for each of

6 gradient bins). We first demonstrate the effectiveness of

the learning by training on all six datasets. It is useful to

visualize the disparities predicted by the model over each

iteration of learning. Figures 3 and 4 show how the disparity

maps change during training. For clarity we have masked

the occluded regions in black in these plots, since our model

will assign arbitrary disparities in these areas. Table 1 shows

thediscretizationstrategywe use for imagegradients as well

as the final values of the learned parameters.

Figure 5 (top) shows how the gradient during learning

illustrating that our optimization procedure terminates with

a near zero gradient. This indicates that the expectation of

features under the (approximate) conditional distribution of

the model is able to match the empirical expectation of the

features. If we did not have an approximate distribution and

expectationin our gradientthis would indicatea global max-

imum due to the convexity of the CRF objective.

Note that convergence is faster for fewer parameters.

Figure 5 (bottom) shows the disparity errors during learn-

ing. Again, models with fewer parameters converge more

quickly, thus yielding lower errors faster. However, the mod-

els with more parameters eventually outperform the simpler

models. In Fig. 5 (top) we observe that there appears to be

an initial phase (e.g., during the first 25 iterations) where the

norm of the approximate gradient monotonically decreases

during the optimization. After this point, models with larger

numbers of parameters appear to have less stability. This ef-

fect may be as a result of noisy gradient approximations due

to our use of graph-cut-derived MPEs for the model expec-

tation term of our gradient.

5.2 Performance of learned parameters

We now use 5 of the 6 datasets for training, and evaluate

the disparity error of the remaining dataset using the pa-

rameters obtained during training. Figure 6 shows the re-

sults for the Moebius dataset. The top plot shows the er-

rors during leave-one-out training. One can observe a sim-

Fig. 3 Disparity maps of the entire training set for K=3 parameters

after0,10,and20iterations.Occludedareasaremasked(©2007IEEE)

Fig. 4 Two zoomed views of the disparity maps for K=3 parameters and learning on all six data sets after 0, 5, 10, 15, and 20 iterations. Occluded

areas are masked (©2007 IEEE)

Page 12

Int J Comput Vis

Table 1 The gradient bins for K = 1,...,6 parameters and the para-

meter values θklearned over all six datasets

Fig. 5 Gradient norm (top) and disparity errors (bottom) during learn-

ing on all 6 datasets (©2007 IEEE)

ilar trend as in Fig. 5 (bottom), namely that the errors de-

crease during learning, and that the more complex mod-

els eventually outperform the simpler models. For compar-

ison, the bottom plot in Fig. 6 shows the errors when us-

ing the Moebius dataset itself for training. In this case find-

ing a low-gradient solution means that we have effectively

matched the distribution of disparity changes and associ-

ated intensity gradients of the ground-truth image. Not sur-

prisingly, this results in lower errors, but not significantly

lower than in the top plot—which indicates that the parame-

ters learned from the other 5 images generalize reasonably

well.

Figure7showstheequivalentplotsforadifferentdataset,

Reindeer. Again we show the errors during leave-one-out

training at the top and those during training on the dataset

itself on the bottom. Here we get slightly different results.

First, the leave-one-out results no longer indicate that per-

Fig. 6 Results of leave-one-out learning on the Moebius dataset. Top:

Moebius disparity errors using the parameters obtained during learning

from the other 5 datasets. Bottom: Moebius disparity errors using the

parameters learned from the dataset itself (©2007 IEEE).

formance increases with the number of parameters. In fact

the model with K = 2 does best in the end. But the re-

sults in the bottom plot (where we train the parameters on

the test data itself) show that this is not necessarily a prob-

lemofinsufficientgeneralization,butratherthatlearningthe

best parameters (which amounts to matching the smooth-

ness properties of the ground truth) might not always yield

to lower matching errors. On the other hand, this could also

be due to noisy gradient approximations as mentioned ear-

lier.

5.3 Performance on standard benchmarks

Next, we examine how well the parameters learned from our

six datasets generalize to other stereo images. Table 2 shows

the disparity errors on the Middleburybenchmark consisting

of the Tsukuba, Venus, Teddy, and Cones images. We com-

pare these errors with those of the graph cuts (GC) method

in Scharstein and Szeliski (2002), which uses a hand-tuned

MRF model with two gradient bins. Our average results for

K = 1 and K = 2 are slightly better than those of GC, and

would result in a similar ranking as the GC method in the

Middlebury evaluation. We provide this to illustrate that we

are able to match and in fact slightly exceed the performance

of a canonical model. The fact that the errors for the more

Page 13

Int J Comput Vis

complex models are higher may indicate that the learned pa-

rameters of those models are tuned more finely to the char-

acteristics of the training data and generalize less well to

datasets that are quite different. We also give a result on the

benchmark after learning a more complex disparity differ-

ence dependent modulation model as outlined in Sect. 3.2

and further explored in Table 3. Here again we use only the

new data for learning and test on the benchmark. In this case

we see a more dramatic gain over the canonical model.

5.4 Extending a Canonical CRF for Stereo

In Table 3 we compare the performance of different models

and feature types described in Sects. 3.1 to 3.4 when learn-

Fig. 7 Results of leave-one-out learning on the Reindeer dataset. Top:

Disparityerrorsusingtheparametersobtainedduringlearningfromthe

other 5 datasets. Bottom: Disparity errors using the parameters learned

from the dataset itself (©2007 IEEE)

ing across all six of the new stereo image pairs we have cre-

ated. In all cases here we use graph cuts for inference dur-

ing learning. Importantly, we observed that our optimization

for the experiments with both pixel disparity difference dis-

cretization and gradient disparity difference discretization

defined in Sect. 3.2 terminates as a result of the energy func-

tion violating the constraints imposed by graph cuts. These

model configurations perform well despite running into the

limitation of graph-cut-based inference. As such, this case

serves as an illustrative example motivating our next set

of experiments where we use message passing methods for

marginal inference during learning.

In Table 4 we show the learned parameters for a model

with interaction terms dependent on both the disparity dif-

ference between pixels and the magnitude of the gradient

between pixels. This table gives us an idea of the shape of

the learned interaction potential function. We have used a

discretizationstrategy for our experimentshere for two main

reasons: First, we have a lot of data in each image, so learn-

ing a discretizedfunction is reasonable even with manybins.

Second, this strategy allows us to easily visualize the corre-

sponding parameters and gauge their impact. Another rea-

sonable strategy might be to use a functional basis such as

one based on polynomials, which may be an interesting av-

enue for future exploration.

Table 2 A comparison of models with different numbers of parame-

ters K trained on our ground-truth data but evaluated on the Middle-

bury data set. The last two rows are the performance of the graph cut

implementation of Scharstein and Szeliski (2002) and the disparity dif-

ference modulation approach of Sect. 3.2

Tsukuba VenusTeddyConesAverage

K = 1

K = 2

K = 3

K = 4

K = 5

K = 6

GC

Disp. dif.

3.0

2.2

3.1

3.0

2.8

3.1

1.9

1.9

1.3

1.6

2.6

2.5

2.1

2.7

1.8

1.2

11.1

11.3

16.4

17.3

16.4

14.5

16.5

11.0

10.8

10.7

19.6

21.5

21.2

16.8

7.7

7.0

6.6

6.5

10.4

11.1

10.6

9.3

7.0

5.3

Table 3 Comparison of the training set disparity error (percentage of incorrectly predicted pixels) given: (a) for the canonical stereo model with

three gradient bins, (b) with five, (c) modulation terms based on both pixel disparity difference and image gradient information, (d) a gradient-

modulated occlusion model

Method ArtLaundryBooksDolls MoebiusReindeer Average

Section 3.1: Gradient bins [2,4]

Section 3.1: Gradient bins [1,2,3,4]

Section 3.3: 3×3 patches and gradient bins [1,2,3,4]

Section 3.2: Disp. dif. bins and grad. bins: both [1,2,3,4]

Section 3.4: Occlusion model with gradient bins [2,4]

22.66

20.53

14.95

17.39

15.39

30.88

22.66

24.29

19.43

21.16

26.17

19.03

19.58

16.89

16.53

12.19

12.09

10.95

10.89

9.43

18.41

13.01

12.39

12.83

12.04

17.92

16.06

15.36

14.10

14.27

21.37

17.23

16.25

15.26

14.80

Page 14

Int J Comput Vis

5.5 Approximate Inference: Speed, Energy Minimization,

and Marginals

The variational distribution Q(X) provides approximate

marginals Q(Xi) thatmaybeusedforcomputinganapprox-

imate likelihood and gradient for training. These marginals

are also used to calculate the mean field updates during free

energy minimization. If these marginals have many states

with very low probability, discarding them will have min-

imal effect on the update. First, we examine the need for

sparse updates by evaluating the amount of uncertainty in

these marginals. Then, we show how much time is saved by

using sparse updates.

Our first set of experiments uses the simpler canonical

stereo model having the smoothness term V of (8). Figure 8

shows histograms of the marginal entropies H(Q(Xi)) dur-

ing free energy minimization with two sets of parameters,

the initial parameters, ?v= 1, and the learned ?v. We ini-

tialize the variational distributions Q(Xi) to uniform and

perform one round of VMP updates. Although most pixels

have very low entropy, the initial model still has several vari-

ables with 2–4 nats3or about 3–6 bits of uncertainty. Once

the model parameters are learned, the marginal entropies af-

ter one round of mean field updates are much lower. By

the time the mean field updates converge and free energy

Table 4 The parameters of the disparity difference and gradient mod-

ulated CRF of Sect. 3.2

Disp. difference

interval

Grad. difference interval

[0, 1) [1, 2)[2, 3)[3, 4) [4, inf)

[0, 1)

[1, 2)

[2, 3)

[3, 4)

[4, inf)

2.4

19.0

21.2

25.2

32.3

1.4

20.0

21.3

25.1

32.1

5.4

17.7

20.9

25.0

31.1

7.5

16.2

20.5

25.0

30.8

0.0

21.7

22.3

25.2

32.9

3When entropy is computed using the natural logarithm (as opposed to

the base 2 logarithm for bits) the nat is the implicit and natural unit for

information entropy.

is minimized, only a small percentage (less than three per-

cent) have more than a half nat (less than two bits) of uncer-

tainty. However, if point estimates are used, the uncertainty

in these marginals will not be well represented. Sparse mes-

sages will allow those variables with low entropy to use few

states,even a pointestimate,whilethe handfulof pixelswith

larger entropy may use more states.

The variational distribution has many states carrying low

probability, even at the outset of training. We may greatly

accelerate the update calculations by dropping these states

according to (24) and the criterion (25). Figure 9 shows the

free energy after each round of updates for both sparse and

dense mean field. In all cases, sparse mean field has nearly

reached the free energy minimum before one round of dense

mean field updates is done. Importantly, the minimum free

energy found with sparse updates is roughly the same as its

dense counterpart.

5.6 Learning with Message Passing vs. Graph Cuts

Maximizing the log likelihood (13) for learning requires

marginals on the lattice. When the model is initialized, these

marginals have higher entropy (Fig. 8), representing the un-

certainty in the model. At this stage of learning, the point

estimate resulting from an energy minimization may not be

a good approximation to the posterior marginals. In fact, us-

ing the graph-cut solution as a point estimate distribution

having zero entropy, sparse mean field finds a lower free en-

ergy at the initial parameters ?v= 1.

Wecomparetheresultsof learningusingtwomethodsfor

calculating the gradient: sparse mean field and graph cuts.

As demonstrated earlier, the model has the highest uncer-

tainty at the beginning of learning. It is at this point when

sparse mean field has the greatest potential for improvement

over graph cuts.

For learning, we use the same small initial step size and a

simple gradient descent algorithm with an adaptive rate. For

prediction evaluation, we use graph cuts to find the most

probable labeling, regardless of training method. We use

leave-one-out cross validation on the six images.

After just one iteration, the training and test error with

sparse mean field is markedly lower than that of the model

Fig. 8 Histograms of approximate marginal entropies H(Q(Xi)) from the variational distributions for each pixel at the start (after the first round)

of mean field updates and at their convergence; values using the initial and learned parameters ?vof the canonical model are shown

Page 15

Int J Comput Vis

Fig. 9 Comparison of CPU time for free energy minimization with sparse and dense mean field updates using parameters ?vlearned in the

canonical model with three images (Art, Books, Dolls)

Fig. 10 Test images comparing prediction (using graph cuts) after one round of learning the canonical model with graph cuts (top) or sparse mean

field (bottom). Occluded areas are black. Images (l–r): Laundry, Moebius, Reindeer

trained with graph cuts for inference. Figure 10 shows the

corresponding depth images after one iteration.

In Table 5, we compare the results of training using

pseudolikelihood, sparse mean field, and point estimates

provided by graph cuts. We do not present results based on

BP or dense mean field as training times are prohibitively

long. For each experiment we leave out the image indicated

and train on all the others listed. Comparing learning with

graph cuts and learning with sparse mean field, the dis-

parity error in is reduced by an average of 4.70 ± 2.17%,

and a paired sign test reveals the improvement is significant

(p < 0.05).

We also test theerror of our models’for occlusionpredic-

tions. We use the extended smoothness term (11) to handle

the interactions between occluded states and the local terms

of (10). We show both leave-one-out training and test re-

sults as well as the result of training on all the data to serve

as a reference point for the capacity of the model. For this

last set of experiments we show root mean square (RMS)

errors for disparity predictions. Models trained using sparse

mean field give more accurate occlusion predictions than the

model trained using graph cuts. In the gradient-modulated

occlusion model our leave-one-out experiments show that

the error in predicting occluded pixels is reduced an average

of 4.94±1.10% and is also significant (p < 0.05).

Figure 11 shows that sparse mean field reduces the dis-

parity error in the model more quickly than graph cuts dur-

ing learning on many images. Even when the two meth-

ods approach each other as learning progresses, sparse mean

field still converges at parameters providing lower errors on

both disparity and occlusions (Fig. 12). We also provide the

learning curves for pseudolikelihood for comparison. We

see that pseudolikelihood generally has poorer performance,

both initially and after many iterations using a similar learn-

ing strategy compared to graph cuts and sparse mean field.

Page 16

Int J Comput Vis

Table 5 Comparison of learning with pseudolikelihood, graph cuts and sparse mean field. The disparity error (percentage of incorrectly predicted

pixels)givenforthecanonicalstereomodelandthegradient-modulatedocclusionmodel(with(10)and(11)).Forthegradient-modulatedocclusion

model we show the occlusion prediction error (percentage). In the last block of experiments we show RMS error

Metric Method ArtBooksDollsLaundryMoebius ReindeerAverage

Canonical Model—leave-one-out training & testing

DisparityPseudo Likelihood

% error Graph Cuts

Sparse Mean Field

22.03

20.83

17.70

27.32

23.64

23.08

11.85

10.69

10.67

29.50

30.04

29.16

15.93

15.80

15.43

15.88

14.13

13.37

20.42

19.17

18.22

Gradient-Modulated Occlusion Model—leave-one-out training & testing

Disparity Graph Cuts

% errorSparse Mean Field

21.82

21.05

24.10

23.14

11.94

11.62

27.54

27.37

11.08

11.45

16.74

16.44

19.30

18.93

Occlusion

% error

Graph Cuts

Sparse Mean Field

34.50

31.19

28.27

27.84

32.99

31.51

36.89

35.37

40.65

38.68

50.83

48.39

37.36

35.50

Gradient-Modulated Occlusion Model—trained on all (for comparison)

DisparityGraph Cuts

RMS errorSparse Mean Field

10.61

8.29

19.2

13.41

5.98

4.72

20.95

19.22

7.15

5.11

5.53

4.76

12.78

10.15

Occlusion

% error

Graph Cuts

Sparse Mean Field

16.20

10.47

10.40

8.10

24.88

19.43

29.77

23.04

27.88

21.10

32.97

27.31

21.83

16.43

Fig. 11 Disparity error (each image held out in turn) using pseudolikelihood, graph cuts, and mean field for learning the canonical CRF stereo

model. The error before learning is omitted from the plots to better highlight performance differences

6 Conclusions

As more evaluation and training data becomes available for

stereo it is natural for researchers to desire to create more

sophisticated models for stereo. While hand specified stereo

models were once widely used, with the increasing avail-

ability of ground truth data it is likely that interest will in-

crease in using learning to improve more complex models.

By creating ground truth data with structured light we have

been able to explore and evaluate a variety of different CRF

models and learning techniques for stereo. We have shown

advantages to using image patches computed on higher res-

olution imagery, advantages to interactions potentials that

are dependant on disparity differences as well as advantages

Page 17

Int J Comput Vis

Fig. 12 Comparison of error predicting occluded pixel using graph cuts and sparse mean field for learning in the gradient-modulated occlusion

model (11)

to formally modelling occlusions as random variables in a

CRF. A natural direction for future work would be to com-

bine these different elements into a more sophisticated CRF.

Indeed, one of the top performing methods on the Middle-

bury benchmark (Yang et al. 2006) is a technique that incor-

porates many similar elements. More specifically, this work

by Yang et al. (2006) is based on an MRF and uses: hier-

archical belief propagation, color-weighted correlation and

occlusion handling.

As state of the art models become more complex, both a

principled underlying modelling formalism as well as prin-

cipled, efficient, stable and robust learning techniques be-

come important. We hope that the CRF formulation we have

provided in this paper can serve as a good starting point

for more sophisticated discriminative random field models

for stereo. In practical terms, graph cuts was the fastest al-

gorithm we explored. Pseudolikelihood (PL) based learn-

ing can also be fast, especially if one exploits the fact

that gradients can be computed exactly and fast second

order optimization could thus be used. However, our ex-

periments indicate that the quality of models learned via

PL is worse than those learned using GC and that sparse

variational message passing (sparse VMP) can produce the

highest quality learned models among these three alterna-

tives.

Calculating sparse updates to the approximating varia-

tional distribution can greatly reduce the time required for

inference in models with large state spaces. For high resolu-

tion imagery this reduction in time can be essential for prac-

ticalinferenceand learningscenarios.In modelswherethere

is more uncertainty (as in the early stages of learning), we

find that sparse mean field provides a lower free energy than

graph cuts. As such, our analysis indicates that SVMP can

be used as an effective tool for approximating the distribu-

tions necessary for accurate learning. Sparse VMP could be

seen as a method occupying a middle ground between pro-

ducing point estimates and creating fuller approximate dis-

tributions. Interestingly, sparse message passing could also

be used to speed up state of the art TRW and TRW-S in-

ference techniques. One of the most important advantages

of the sparse mean field technique is that one no longer has

strong constraints on the forms of allowable potentials that

are required for graph cuts. As such, we see sparse mes-

sage passing methods a being widely applicable for models

where the constraints on potentials imposed by graph cuts

are too restrictive.

Finally, with the insights provided in this study, we hope

to open up a number of avenues of exploration for learn-

ing in generally richer models and learning models suitable

for processing more views using the additional data sets we

have created.

Acknowledgements

Wehrwein for their help in creating the data sets used in this paper. Fig-

ures (©2007 IEEE) also appear in Scharstein and Pal (2007). Support

for this work was provided in part by NSF grant 0413169 to D.S.

We would like to thank Anna Blasiak and Jeff

Page 18

Int J Comput Vis

References

Alvarez, L., Deriche, R., Snchez, J., & Weickert, J. (2002). Dense dis-

parity map estimation respecting image discontinuities: a PDE

and scale-space based approach. Journal of Visual Communica-

tion and Image Representation, 13(1–2), 3–21.

Andrieu, C., de Freitas, N., Doucet, A., & Jordan, M. (2003). An intro-

duction to MCMC for machine learning. Machine Learning, 50,

5–43.

Barnard, S. (1989). Stochastic stereo matching over scale. Interna-

tional Journal of Computer Vision, 3(1), 17–32.

Birchfield, S., & Tomasi, C. (1998). A pixel dissimilarity measure that

is insensitive to image sampling. IEEE TPAMI, 20(4), 401–406.

Blake, A., Rother, C., Brown, M., Perez, P., & Torr, P. (2004). Inter-

active image segmentation using an adaptive GMMRF model. In

Proc. ECCV (pp. 428–441)

Blei,D.,Ng,A.,&Jordan,M.(2003).LatentDirichletallocation.Jour-

nal of Machine Learning Research, 3, 993–1022.

Bleyer,M.,&Gelautz,M.(2004).Alayeredstereoalgorithmusingim-

age segmentation and global visibility constraints. In Proc. ICIP.

Boykov, Y., Veksler, O., & Zabih, R. (2001). Fast approximate energy

minimization via graph cuts. IEEE TPAMI, 23(11), 1222–1239.

Cowell, R., Dawid, A., Lauritzen, S., & Spiegelhalter, D. (2003). Prob-

abilistic Networks and Expert Systems. Berlin: Springer.

Della Pietra, S., Della Pietra, V., & Lafferty, J. (1997). Inducing fea-

tures of random fields. IEEE TPAMI, 19, 380–393.

Felzenszwalb, P., & Huttenlocher, D. (2006). Efficient belief propaga-

tion for early vision. International Journal of Computer Vision,

70(1), 41–54.

Frey, B., & Jojic, N. (2005). A comparison of algorithms for infer-

ence and learning in probabilistic graphical models. IEEE TPAMI,

27(9), 1392–1416.

Frey, B., & MacKay, D. (1997). A revolution: Belief propagation in

graphs with cycles. In Proc NIPS.

He, Z., Zemel, R., & Carreira-Perpinan, M. (2004). Multiscale condi-

tional random fields for image labeling. In Proc. CVPR (pp. 695–

702).

Hong, L., & Chen, G. (2004). Segment-based stereo matching using

graph cuts. In Proc. CVPR (Vol. I, pp. 74–81).

Jordan, M., Ghahramani, Z., Jaakkola, T., & Saul, L. (1999). Introduc-

tion to variational methods for graphical models. Machine Learn-

ing, 37, 183–233.

Kolmogorov, V. (2006). Convergent tree-reweighted message passing

for energy minimization. IEEE TPAMI, 28, 1568–1583.

Kolmogorov, V., & Zabih, R. (2001). Computing visual correspon-

dence with occlusions using graph cuts. In Proc. ICCV (pp. 508–

515).

Kolmogorov, V., & Zabih, R. (2002a). Multi-camera scene reconstruc-

tion via graph cuts. In Proc. ECCV (Vol. III, pp. 82–96).

Kolmogorov, V., & Zabih, R. (2002b). What energy functions can be

minimized via graph cuts? In Proc. ECCV (Vol. III, pp. 65–81).

Kolmogorov, V., Criminisi, A., Blake, A., Cross, G., & Rother, C.

(2006). Probabilistic fusion of stereo with color and contrast for

bilayer segmentation. IEEE TPAMI, 28(9), 1480–1492.

Kong, D., & Tao, H. (2004). A method for learning matching errors in

stereo computation. In Proc. BMVC.

Kschischang, F., Frey, B., & Loeliger, H.A. (2001). Factor graphs

and the sum-product algorithm. IEEE Transactions Info Theory,

47(2), 498–519.

Kumar, S., & Hebert, M. (2006). Discriminative random fields. Inter-

national Journal of Computer Vision, 68(2), 179–201.

Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random

fields: Probabilistic models for segmenting and labeling sequence

data. In Proc. ICML (pp. 282–289).

Liang, P., & Jordan, M. (2008). An asymptotic analysis of generative,

discriminative and pseudolikelihood estimators. In Proc ICML.

Murphy, K., Weiss, Y., & Jordan, M. (1999). Loopy belief propaga-

tion for approximate inference: An empirical study. In Proc. UAI

(pp. 467–475).

Ng, A. Y., & Jordan, M. (2002). On discriminative vs. generative clas-

sifiers: A comparison of logistic regression and naive Bayes. In

Proc. NIPS.

Pal, C., Sutton, C., & McCallum, A. (2006). Sparse forward-backward

using minimum divergence beams for fast training of conditional

random fields. In Proc. ICASSP (pp. 581–584).

Scharstein, D., & Pal, C. (2007). Learning conditional random fields

for stereo. In Proc. CVPR.

Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of

dense two-frame stereo correspondence algorithms. International

Journal of Computer Vision, 47(1), 7–42.

Scharstein, D., & Szeliski, R. (2003). High-accuracy stereo depth maps

using structured light. In Proc. CVPR (Vol. I, pp. 195–202).

Strecha, C., Tuytelaars, T., & Van Gool, L. (2003). Dense matching of

multiple wide-baseline views. In Proc. CVPR (Vol. 2, p. 1194).

Strecha, C., Fransens, R., & Van Gool, L. (2004). Wide-baseline stereo

from multiple views: A probabilistic account. In Proc. CVPR

(Vol. 1, pp. 552–559).

Sun, J., Zheng, N., & Shum, H. (2003). Stereo matching using belief

propagation. IEEE TPAMI, 25(7), 787–800.

Sun, J., Li, Y., Kang, S. B., & Shum, H. Y. (2005). Symmetric stereo

matching for occlusion handling. In Proc. CVPR (pp. 399–406).

Sutton, C., & McCallum, A. (2006). An introduction to conditional

random fields for relational learning. In L. Getoor & B. Taskar

(Eds.), Introduction to Statistical Relational Learning. Cam-

bridge: MIT Press.

Sutton, C., Rohanimanesh, K., & McCallum, A. (2004). Dynamic con-

ditional random fields: Factorized probabilistic models for label-

ing and segmenting sequence data. In Proc. ICML.

Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V.,

Agarwala, A., Tappen, M., & Rother, C. (2008). A comparative

study of energy minimization methods for Markov random fields

with smoothness-based priors. IEEE TPAMI, 30, 1068–1080.

Tao, H., Sawhney, H., & Kumar, R. (2001). A global matching frame-

work for stereo computation. In Proc. ICCV (Vol. I, pp. 532–

539).

Tappen, M., & Freeman, W. (2003). Comparison of graph cuts with

belief propagation for stereo, using identical MRF parameters. In

Proc. ICCV (pp. 900–907).

Trinh, H., & McAllester, D. (2009). Unsupervised learning of stereo

vision with monocular cues. In Proc. BMVC.

Vishwanathan, S., Schraudolph, N., Schmidt, M., & Murphy, K.

(2006). Accelerated training of conditional random fields with

stochastic gradient methods. In Proc. ICML (pp. 969–976). New

York: ACM.

Wainwright, M., Jaakkola, T., & Willsky, A. (2002). Tree-based repa-

rameterization for approximate estimation on graphs with cycles.

In Proc. NIPS.

Wainwright, M., Jaakkola, T., & Willsky, A. (2003). Tree-based repa-

rameterization framework for analysis of sum-product and re-

lated algorithms. IEEE Transaction on Information Theory, 45(9),

1120–1146.

Wainwright, M., Jaakkola, T., & Willsky, A. (2005). Map estimation

via agreement on trees: Message-passing and linear program-

ming. IEEE Transaction on Information Theory, 51(11), 3697–

3717.

Wei, Y.,& Quan, L. (2004). Region-basedprogressive stereomatching.

In Proc. CVPR (vol. I, pp. 106–113).

Weinman, J. J., Hanson, A., & McCallum, A. (2004). Sign detection in

natural images with conditional random fields. In IEEE Int. Work-

shop on Machine Learning for Signal Processing (pp. 549–558).

Page 19

Int J Comput Vis

Weinman, J. J., Pal, C., & Scharstein, D. (2007). Sparse message pass-

ing and efficiently learning random fields for stereo vision. Tech.

Rep. UM-CS-2007-054, Univ. of Massachusetts, Amherst.

Weinman, J. J., Tran, L., & Pal, C. (2008). Efficiently learning ran-

dom fields for stereo vision with sparse message passing. In Proc.

ECCV.

Weinman, J. J., Learned-Miller, E., & Hanson, A. (2009). Scene text

recognition using similarity and a lexicon with sparse belief prop-

agation. IEEE TPAMI, 31(10), 1733–1746.

Winn, J., & Bishop, C. (2005). Variational message passing. Journal of

Machine Learning Research, 6, 661–694.

Yang, Q., Wang, L., Yang, R., Stewenius, H., & Nister, D. (2006).

Stereo matching with color-weighted correlation, hierarchical be-

lief propagation and occlusion handling. In Proc. CVPR.

Yedidia, J., Freeman, W., & Weiss, Y. (2003). Understanding belief

propagation and its generalizations. In Exploring Artificial Intel-

ligence in the New Millennium (pp. 239–236).

Zhang, L., & Seitz, S. (2005). Parameter estimation for MRF stereo. In

Proc. CVPR (Vol. II, pp. 288–295).

Zhang, Y., & Kambhamettu, C. (2002). Stereo matching with

segmentation-based cooperation. In Proc. ECCV (Vol. II,

pp. 556–571).

Zitnick, L., Kang, S., Uyttendaele, M., Winder, S., & Szeliski, R.

(2004). High-quality video view interpolation using a layered rep-

resentation. SIGGRAPH, ACM Transactions on Graphics, 23(3),

600–608.