Page 1

Int J Comput Vis

DOI 10.1007/s11263-010-0385-z

On Learning Conditional Random Fields for Stereo

Exploring Model Structures and Approximate Inference

Christopher J. Pal ·Jerod J. Weinman ·Lam C. Tran ·

Daniel Scharstein

Received: 22 March 2010 / Accepted: 15 September 2010

© Springer Science+Business Media, LLC 2010

Abstract Until recently, the lack of ground truth data has

hindered the application of discriminative structured predic-

tion techniques to the stereo problem. In this paper we use

ground truth data sets that we have recently constructed to

explore different model structures and parameter learning

techniques. To estimate parameters in Markov random fields

(MRFs) via maximum likelihood one usually needs to per-

form approximate probabilistic inference. Conditional ran-

dom fields (CRFs) are discriminative versions of traditional

MRFs. We explore a number of novel CRF model structures

including a CRF for stereo matching with an explicit oc-

clusion model. CRFs require expensive inference steps for

each iteration of optimization and inference is particularly

slow when there are many discrete states. We explore belief

propagation, variational message passing and graph cuts as

This research was supported in part by: a Google Research award,

Microsoft Research through awards under the eScience and Memex

funding programs, a gift from Kodak Research and an NSERC

discovery award to C.P. Support was also provided in part by NSF

grant 0413169 to D.S.

C.J. Pal (?)

École Polytechnique de Montréal, Montréal, QC, Canada

e-mail: christopher.pal@polymtl.ca

J.J. Weinman

Dept. of Computer Science, Grinnell College, Grinnell, IA, USA

e-mail: weinman@grinnell.edu

L.C. Tran

Dept. of Electrical and Computer Engineering, University of

California San Diego, San Diego, CA, USA

e-mail: lat003@ucsd.edu

D. Scharstein

Middlebury College, Middlebury, VT, USA

e-mail: schar@middlebury.edu

inference methods during learning and compare with learn-

ing via pseudolikelihood. To accelerate approximate infer-

ence we have developed a new method called sparse varia-

tional message passing which can reduce inference time by

an order of magnitude with negligible loss in quality. Learn-

ing using sparse variational message passing improves upon

previous approaches using graph cuts and allows efficient

learning over large data sets when energy functions violate

the constraints imposed by graph cuts.

Keywords Stereo · Learning · Structured prediction ·

Approximate inference

1 Introduction

In recent years, machine learning methods have been suc-

cessfully applied to a large number of computer vision

problems, including recognition, super-resolution, inpaint-

ing, texture segmentation, denoising, and context labeling.

Stereo vision has remained somewhat of an exception be-

cause of the lack of sufficient training data with ground-

truth disparities. While a few data sets with known dispari-

ties are available, until recently they had been mainly been

used for benchmarking of stereo methods (e.g., Scharstein

and Szeliski 2002). Our earlier work in this line of re-

search (Scharstein and Pal 2007) sought to remedy this sit-

uation by replacing the heuristic cues used in previous ap-

proaches with probabilistic models for structured prediction

derived from learning using real images and ground truth

stereo imagery. To obtain a sufficient amount of training

data, we used the structured-lighting approach of Scharstein

and Szeliski (2003) to construct a database of 30 stereo pairs

Page 2

Int J Comput Vis

with ground-truth disparities, which we have made available

for use by other researchers.1

By addressing the need for greater quantities of ground

truth data, we are now able to take a machine learning ap-

proach using a classical structured prediction model, the

conditional random field (CRF). We derive a gradient-based

learning approach that leverages efficient graph-cut mini-

mization methods and our ground-truth database. We then

explore the characteristics and properties of a number of

different models when learning model parameters. Using

graph-cut minimization techniques for gradient-based learn-

ing in CRFs corresponds to an aggressive approximation

of the underlying probabilities needed for expectations of

key quantities. In this work we further explore the issues

of learning and its interaction with different inference tech-

niques under richer model structures.

Among the few existing learning approaches for stereo,

one of the most prominent is the work by Zhang and Seitz

(2005), who iteratively estimate the global parameters of an

MRF stereo method from the previous disparity estimates

and thus do not rely on ground-truth data. Kong and Tao

(2004) learn to categorize matching errors of local methods

usingtheMiddleburyimages.Kolmogorovetal.(2006)con-

struct MRF models for binary segmentation using locally

learned Gaussian Mixture Models (GMMs) for foreground

and background colors. Some interesting recent work has

explored learning in a hidden variable CRF-like model for

stereo (Trinh and McAllester 2009). They formulate stereo

as the problem of modeling the probability of the right im-

age given the left. Thus, they are able to construct a con-

ditional model with hidden variables for depth information.

As such, they do not use ground-truth depth information and

cast the approach as an instance of unsupervised learning.

They use monocular texture cues, define potential functions

on segments from image segmentation and construct an en-

ergy function based on a slanted-plane model. They perform

learningusing a variationof hard assignmentconditionalex-

pectation maximization.

Learning aside, there has been growing interest in sim-

ply creating richer models for stereo vision in which more

parameters are introduced to produce more accurate results.

In particular, recent activity has focused on explicitly ac-

counting for occlusions in stereo vision models. For exam-

ple, Kolmogorov and Zabih (2001) have directly incorpo-

rated occlusion models in an energy function and graph-cut

minimization framework. Sun et al. (2005) explored a sym-

metric stereo matching approach whereby they: (1) infer the

disparity map in one view considering the occlusion map of

the other view and (2) infer the occlusion map in one view

given the disparity map of the other view. More recently,

Yang et al. (2006) have achieved impressive results building

1http://vision.middlebury.edu/stereo/data/.

on models that estimate depth in both left and right images

and using color-weighted correlations for patch matching.

They found that this approach made match scores less sensi-

tive to occlusion boundaries, as occlusions often cause color

discontinuities. All of these methods involve creating richer

models to obtain greater disparity accuracy. Thus, we see a

growing need to learn or estimate model parameters in an

efficient and principled way.

While learning for stereo is growing in interest, much

recent progress in stereo vision has been achieved along

two other avenues. First, global optimization methods have

become practical with the emergence of powerful opti-

mization techniques. Considered too slow when first pro-

posed by Barnard (1989), global methods currently domi-

nate the top of the Middlebury stereo rankings. In particular,

MRF models for stereo have become popular since high-

quality approximate solutions can be obtained efficiently us-

ing graph cuts (Boykov et al. 2001; Kolmogorov and Zabih

2001, 2002b) and belief propagation (Sun et al. 2003, 2005;

Felzenszwalb and Huttenlocher 2006). Tappen and Freeman

(2003) have compared graph cuts and belief propagation

for stereo and Szeliski et al. (2008) have compared a larger

set of MRF energy minimization techniques, providing soft-

ware that we use in our implementation.

Asecondbreakthroughhasbeentherealizationoftheim-

portance of intensity changes as a cue for object boundaries

(i.e., disparity discontinuities). Taken to an extreme, this

translates into the assumption that disparity jumps always

coincide with color edges, which is the basis of a large num-

ber of recent segment-based stereo methods (Tao et al. 2001;

Zhang and Kambhamettu 2002; Bleyer and Gelautz 2004;

Hong and Chen 2004; Wei and Quan 2004; Zitnick et al.

2004; Sun et al. 2005). Such methods start with a color

segmentation and then assume that disparities are constant,

planar, or vary smoothly within each segment. This as-

sumption works surprisingly well if the segments are small

enough. Alternatively, color segmentations can also be em-

ployed as smoothness priors in pixel-based approaches (Sun

et al. 2003).Using color segmentations is not the only

way to utilize this monocular cue; many pixel-based global

methods also change the smoothness cost (i.e., penalty

for a disparity change) if the local intensity gradient is

high (Boykov et al. 2001; Kolmogorov and Zabih 2002a;

Scharstein and Szeliski 2002). This is the approach we take.

The relationship between intensity gradient and smoothness

cost is learned from real images.

We have focused our discussion so far on discrete for-

mulations for stereo. However, continuous formulations also

exist and a number of groups have formulated the continu-

ous depth stereo problem as an energy functional represent-

ing an underlying partial differential equation (PDE) (Al-

varez et al. 2002; Strecha et al. 2003). More recent work

along these lines has cast occlusions as unobserved hid-

den variables and used expectation maximization (Strecha et

Page 3

Int J Comput Vis

al. 2004). This work also draws closer together PDE-based

methods and maximum a posteriori (MAP) estimation. As

noted by Yang et al. (2006), more studies are needed to un-

derstand the behavior of algorithms for optimizing parame-

ters in stereo models. This work addresses that need.

Some of the discrete stereo models discussed above have

formulated the problem directly as an energy function with-

out an explicit probabilistic model. When a probabilistic

model has been used, it has been a joint or generative ran-

dom field. However, there are well-known performance ad-

vantages to using discriminative as opposed to generative

modeling techniques (Ng and Jordan 2002). One of our con-

tributions is the development of a completely probabilistic

and discriminative discrete formulation for stereo. We ex-

plicitly model occlusions using additional states in the vari-

ables of a conditional random field (CRF). As we will show,

when traditional stereo techniques are augmented with an

occlusion model and cast in a CRF framework, learning can

be achieved via maximum (conditional) likelihood estima-

tion. However, learning becomes more challenging as the

stereo images and probabilistic models become more realis-

tic.

1.1 Conditional Random Fields, Learning and Inference

In this work, we use a lattice-structured CRF for stereo

vision. This leads to energy functions with a traditional

form—single variable terms and pairwise terms. Impor-

tantly, unlike purely energy-based formulations, since we

cast the stereo problem as a conditional probability model,

weareabletoviewlearningasaninstanceofmaximumcon-

ditional likelihood. We can also draw from recent insights in

the machine learning community to deal with learning in in-

tractable probability models. In this light, learning also be-

comes a task closely linked to the quality of approximate

inference in the model. From this formulation we are able

to develop a probabilistic variational method in the sense of

Jordan et al. (1999). While we focus on approximate infer-

ence and learning in lattice-structured conditional random

fields applied to stereo vision, our theoretical results and

some experimental insights are applicable to CRFs, MRFs

and Bayesian networks with arbitrary structures.

The CRF approach to modeling and learning a random

field was first presented for sequence processing problems

by Lafferty et al. (2001). Sutton and McCallum (2006)

give a good review of modeling and learning techniques

for CRFs focusing on natural language processing prob-

lems and optimization methods that exploit second order

information. Lafferty et al. (2001) originally proposed the

use of a method known as improved iterative scaling (Della

Pietra et al. 1997) for learning CRFs. However, optimization

of the conditional log likelihood using classical gradient-

descent-based methods is a natural strategy for learning in

CRFs. Indeed, several authors have reported significant in-

creases in learning speed using second-order gradient-based

optimization techniques. Quasi-Newton methods such as

the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method or

limited-memory versions of the BFGS method have been

particularly successful (Sutton and McCallum 2006). More

recently, Vishwanathan et al. (2006) have reported even

faster convergence with large data sets using Stochastic

Meta-Descent, a stochastic gradient optimization method

with adaptation of a gain vector.

Model expectations are needed for gradient-based learn-

ing. To efficiently compute these in the linear chain struc-

tured models commonly used in language processing, a sub-

tle variation of the well-known forward-backward algorithm

for hidden Markov models can be used. However, approxi-

mate inference methods must be used for many graphical

models with more complex structure. The dynamic condi-

tional random fields (DCRFs) of Sutton et al. (2004) use a

factorized set of variables at each segment of a linear-chain

CRF. This leads to a shallow but dynamically-sized lattice-

structured model. Sutton et al. (2004) explore several meth-

ods for approximate inference and learning, including tree-

based reparameterization (TRP) or the tree-based message

passing schedules of Wainwright et al. (2002, 2003) and

the loopy belief propagation strategy discussed in Murphy

et al. (1999) under a random schedule. Other work such as

Weinman et al. (2004) also explores TRP methods but not

in a large lattice structured model. For random fields with

a hidden layer in the form of a Boltzmann machine, He et

al. (2004) have used sampling methods for inference based

on contrastive divergence. Contrastive divergence initializes

Markov chain Monte Carlo (MCMC) sampling using the

data and then takes a few steps of a Gibbs sampler. This ap-

proach is faster than traditional MCMC which require con-

vergence to equilibrium. However it can lead to crude ap-

proximations to model likelihood gradients used for learn-

ing. Kumar and Hebert (2006) optimize the parameters of

lattice-structured binary CRFs using pseudolikelihood and

perform inference using iterated conditional modes ICM,

which is fast but well know to get caught in local minima.

Other work by Blake et al. (2004) has investigated the dis-

criminative optimization of lattice-structured joint random

fieldmodelsusingautoregressionoverthepseudolikelihood.

Pseudolikelihood-based techniques are equivalent to us-

ing spatially localized and independent probabilistic mod-

els as a substitute for the original global model of a com-

plete joint distribution. Pseudolikelihood can yield a convex

optimization problem that often leads to relatively fast op-

timization times. However, Liang and Jordan (2008) have

shownthatpseudolikelihoodcangivepoorerestimatesofin-

teraction parameters in random fields when interactions are

strong. In Sect. 5.6 we compare learning with pseudolike-

lihood, graph cuts and our new inference method (sparse

Page 4

Int J Comput Vis

mean field), and we find that pseudolikelihood indeed re-

sults in the lowest performance. We are therefore motivated

to use a learning procedure that accounts for the global view

of the underlying structured prediction problem.

1.2 Graph Cuts and Learning

The primary challenge with learning a CRF using a standard

gradient-descent-based optimization of the conditional like-

lihood is that one must compute intractable model expecta-

tions of features in the energy function (see Sect. 4). One

solution to this problem is to replace distributions needed

for the expectation with a single point estimate and compute

gradients in a manner reminiscent of the classical perceptron

algorithm. For the types of models we explore here, graph-

cut-based methods are typically the fastest choice for energy

minimization Szeliski et al. (2008). Thus, one particularly

attractive solution to the learning problem is to take advan-

tage of the extremely fast and high-quality performance of

graph cuts. In more precise terms, this energy corresponds

to a most-probable-explanation (MPE) (Cowell et al. 2003)

estimate for the corresponding CRF. Although it does have

important limitations, we use this fast and effective strategy

for our initial explorations of model structures.

The maximum conditional likelihood formulation for

gradient-based learning in a CRF requires one to compute

model expectations, not MPE estimates. Furthermore, the

graph cut algorithm only works if energy functions satisfy

certain conditions. While the original energy function of a

randomfieldcanhavenegativeweights,thesecondarygraph

constructed when performing graph-cut inference must have

non-negative edge weights. This transformation leads to in-

equality constraints on the original energy function. These

constraints also imply that graph-cut inference may cease to

be possible during the course of learning for some models—

something we have observed during our experiments. These

factors have motivated us to explore a second class of infer-

ence techniques, based on quickly computing approximate

marginal distributions during learning. Thus, in the second

broad area of our exploration we compare the efficiency and

quality of global inference techniques during learning.

1.3 Other Alternatives for Inference

It is well known that belief propagation (BP) (Yedidia et al.

2003) in tree structured graphs is exact. However, in stereo

and many other problems in computer vision one frequently

uses graphs defined on a 2D grid. It is possible to apply BP

on graphs with loops using loopy belief propagation (LBP),

andtherehavebeenanumberofreportsofsuccessusingthis

strategyforapplicationsrangingfromerror-correctingcodes

(Frey and MacKay 1997) and inference in large Bayesian

networks (Murphy et al. 1999) to low level vision (Felzen-

szwalb and Huttenlocher 2006). Two important variations

of belief propagation consist of the sum-product algorithm

and max-product algorithm (Kschischang et al. 2001). The

sum-product algorithm is used to compute marginal distrib-

utions while the max-product algorithm is used to give the

most probable configuration or MPE under a model. In a

generative model this is also equivalent to the MAP config-

uration. The max-product variation of BP is equivalent to

the celebrated “Viterbi” algorithm used for decoding in hid-

den Markov models. In a 2D lattice, loopy variants of max-

product can be used to find configurations that correspond

to approximate energy minima. The evaluation of Szeliski

et al. (2008) includes comparisons with variations of loopy

max-product BP but typically finds superior minima using

methods based on either graph cuts or BP variants that in-

volve more sophisticated tree-based approximations (Wain-

wright et al. 2005), which can also be viewed as linear pro-

gramming relaxations.

Tree-based approximations can be used to improve both

the quality of marginal inference and the quality of MAP or

MPE estimates. A class of algorithms known as tree-based

reparameterization (TRP) (Wainwright et al. 2003) can be

used to obtain approximate marginals in a graph with cy-

cles. This class of algorithms can be formulated as a series

of reparameterization updates to the original loopy graph.

Tree-reweighted message passing (TRW) (Wainwright et al.

2005) is an approach whereby one reweights the usual mes-

sages of LBP. This family of algorithms involves reparame-

terizing a collection of tree-structured distributions in terms

of a common set of pseudo-max-marginals on the nodes and

edges of the graph with cycles. When it is possible to find

a configuration that is locally optimal with respect to every

single node and edge pseudo-max-marginal, then the upper

bound is tight, and the MAP configuration can be obtained.

Recent work (Kolmogorov 2006) has shown that more so-

phisticated algorithms, such as sequential tree-reweighted

max-product message passing (TRW-S), have the ability to

produce even better minimum energy solutions than graph

cuts.

Belief propagation (Yedidia et al. 2003) and variational

methods (Jordan et al. 1999) are both widely used tech-

niques for inference in probabilistic graphical models and

are known for being reasonably fast and easy to implement

in a memory-efficient manner. Both techniques have been

used for inference and learning in models with applications

ranging from text processing (Blei et al. 2003) to computer

vision (Frey and Jojic 2005). Winn and Bishop (2005) pro-

posed Variational Message Passing (VMP) as a way to view

many variational inference techniques, and it represents a

general purpose algorithm for approximate inference. The

approach is similar in nature to BP in that messages propa-

gate local information throughout a graph, and the message

computation is similar. However, unlike BP, VMP optimizes

a lower bound on the log probability of observed variables

Page 5

Int J Comput Vis

in a generative model. Variational inference thus has a more

direct connection to the probability of data under a model

when an underlying graphical structure contains cycles.

Experimental and theoretical analysis of variational

methods has shown that while the asymptotic performance

of other methods such as sampling (Andrieu et al. 2003) can

be superior, frequently variational methods are faster for ap-

proximate inference (Jordan et al. 1999). However, many

real world problems require models with variables having

very large state spaces. Under these conditions, inference

with variational methods becomes very slow, diminishing

any gains. We address this by proposing sparse variational

methods. These methods also provide theoretical guarantees

that the Kullback–Leibler (KL) divergence between approx-

imate distributions and true distributions are iteratively min-

imized. Some of our previous work (Pal et al. 2006) has

explored sparse methods for approximate inference using

BP in chain-structured graphs, in loopy graphs (Weinman et

al. 2009), and 2D grids (Weinman et al. 2008).

We focus our explorations of marginal inference for

learning in this paper on approximate marginal inference us-

ing BP, variational mean field, and sparse variants of these

methods. We find that the sparse variants make most tasks

dramatically more practical, reducing training times by an

order of magnitude. We focus our theoretical analysis on

a new method we call sparse variational message pass-

ing (SVMP). The method combines the theoretical benefits

of variational methods with the time-saving advantages of

sparse messages. The state space in stereo models can be-

come quite large if one seeks to account for many possible

discretized disparity levels. Thus, we believe that the sparse

learning techniques we propose here will be an important

contribution. While we do not explore it here, we note that

sparseinferencemethodsfortree-basedapproximationtech-

niques (Wainwright et al. 2005) or structured mean field

methods (Jordan et al. 1999) could be a promising direc-

tion for future research. This paper explores a broader and

richer set of model structures compared to our earlier work

(Scharstein and Pal 2007). We also expand upon our ear-

lier experimental analysis in Weinman et al. (2008, 2007)

including a comparison with pseudolikelihood.

The remainder of the paper is structured as follows. First,

in Sect. 2 we describe the new stereo data sets we use as

ground truth for our learning experiments. In Sect. 3, we de-

velop a number of different CRFs architectures with differ-

ent levels of complexity. We begin with a probabilistic CRF

version of a canonical model for stereo vision problem in

Sect.3.1.Thecanonicalmodelisthenaugmentedtoexplore:

modulation terms that are dependent upon disparity differ-

ences (Sect. 3.2), models that use patches to compute local

costs (Sect. 3.3), and models that explicitly account for oc-

clusions (Sect. 3.4). In Sect. 4, we present the key challenge

in gradient-based learning and motivate how different types

of approximate inference can be used to approximate a key

intractableexpectation.InSect.4.2,wethenreviewclassical

meanfieldupdatesandshowhowsparsevariationalmessage

passing can accelerate inference. In Sect. 5, we present re-

sults using graph cuts for learning in the different model ar-

chitectures we have discussed. We then present results com-

paring sparse BP and VMP with graph cuts. Through this

we see how using variational distributions for learning im-

proves results over the point estimate given by graph cuts

and observe how sparse message passing can lead to an or-

der of magnitude reduction in inference time compared to

dense message passing. Finally, we show how learning pa-

rameters with our technique allows us to improve the quality

of occlusion predictions in more richly structured CRFs.

2 Data Sets

In order to obtain a significant amount of training data for

stereo learning approaches, we have created 30 new stereo

data sets with ground-truth disparities using an automated

version of the structured-lighting technique of Scharstein

and Szeliski (2003). Our data sets are available for use by

otherresearchers.2Eachdata set consists of 7 rectifiedviews

takenfromequidistantpointsalongaline,as wellas ground-

truth disparity maps for viewpoints 2 and 6. The images are

about 1300 × 1100 pixels (cropped to the overlapping field

of view), with about 150 different integer disparities present.

Each set of 7 views was taken with three different exposures

and under three different lighting conditions. We thus have 9

different images from 7 different viewpoints. These images

exhibitsignificantradiometricdifferencesandcanbeusedto

test for robustness to violations of the brightness constancy

assumption, which are common in real-world applications.

For the work reported in this paper we only use the six

data sets shown in Fig. 1: Art, Books, Dolls, Laundry, Moe-

bius and Reindeer. As input images we use a single image

pair (views 2 and 6) taken with the same exposure and light-

ing. To make the images manageable by the graph-cut stereo

matcher, we downsample the original images to one third of

their size, resulting in images of roughly 460 × 370 pixels

with a disparity range of 80 pixels. The resulting images are

still more challenging than standard stereo benchmarks such

as the Middlebury Teddy and Cones images, due to their

larger disparity range and higher percentage of untextured

surfaces.

3 Stereo Vision and CRFs

The classical formulation of the stereo vision problem is

to estimate the disparity (horizontal displacement) at each

2http://vision.middlebury.edu/stereo/data/.