Page 1

High-Accuracy Stereo Depth Maps Using Structured Light

Daniel Scharstein

Middlebury College

schar@middlebury.edu

Richard Szeliski

Microsoft Research

szeliski@microsoft.com

Abstract

Recent progress in stereo algorithm performance is

quickly outpacing the ability of existing stereo data sets to

discriminate among the best-performing algorithms, moti-

vating the need for more challenging scenes with accurate

ground truth information. This paper describes a method

for acquiring high-complexity stereo image pairs with

pixel-accurate correspondence information using struc-

tured light. Unlike traditional range-sensing approaches,

our method does not require the calibration of the light

sources and yields registered disparity maps between all

pairs of cameras and illumination projectors. We present

new stereo data sets acquired with our method and demon-

strate their suitability for stereo algorithm evaluation. Our

results are available at http://www.middlebury.edu/stereo/.

1. Introduction

The last few years have seen a resurgence of interest in

the development of highly accurate stereo correspondence

algorithms. Part of this interest has been spurred by funda-

mental breakthroughs in matching strategies and optimiza-

tion algorithms, and part of the interest is due to the exis-

tence of image databases that can be used to test and com-

pare such algorithms. Unfortunately, as algorithms have

improved, the difficulty of the existing test images has not

kept pace. The best-performing algorithms can now cor-

rectly match most of the pixels in data sets for which correct

(ground truth) disparity information is available [21].

In this paper, we devise a method to automatically

acquire high-complexity stereo image pairs with pixel-

accurate correspondence information. Previous approaches

have either relied on hand-labeling a small number of im-

ages consisting mostly of fronto-parallel planes [17], or set-

ting up scenes with a small number of slanted planes that

can be segmented and then matched reliably with para-

metric correspondence algorithms [21]. Synthetic images

have also been suggested for testing stereo algorithm per-

formance [12, 9], but they typically are either too easy to

Figure 1. Experimental setup, showing the digital

camera mounted on a translation stage, the video pro-

jector, and the complex scene being acquired.

solve if noise, aliasing, etc. are not modeled, or too difficult,

e.g., due to complete lack of texture in parts of the scene.

In this paper, we use structured light to uniquely label

each pixel in a set of acquired images, so that correspon-

dence becomes (mostly) trivial, and dense pixel-accurate

correspondences can be automatically produced to act as

ground-truth data. Structured-light techniques rely on pro-

jecting one or more special light patterns onto a scene,

usually in order to directly acquire a range map of the

scene, typically using a single camera and a single projector

[1, 2, 3, 4, 5, 7, 11, 13, 18, 19, 20, 22, 23]. Random light

patterns have sometimes been used to provide artificial tex-

ture to stereo-based range sensing systems [14]. Another

approach is to register range data with stereo image pairs,

but the range data is usually of lower resolution than the

images, and the fields of view may not correspond exactly,

leading to areas of the image for which no range data is

available [16].

2. Overview of our approach

The goal of our technique is to produce pairs of real-

world images of complex scenes where each pixel is labeled

with its correspondence in the other image. These image

pairs can then be used to test the accuracy of stereo algo-

rithms relative to the known ground-truth correspondences.

Page 2

Our approach relies on using a pair of cameras and one

or more light projectors that cast structured light patterns

onto the scene. Each camera uses the structured light se-

quence to determine a unique code (label) for each pixel.

Finding inter-image correspondence then trivially consists

of finding the pixel in the corresponding image that has the

same unique code.

The advantage of our approach, as compared to using a

separate range sensor, is that the data sets are automatically

registered. Furthermore, as long as each pixel is illuminated

by at least one of the projectors, its correspondence in the

other image (or lack of correspondence, which indicates oc-

clusion) can be unambiguously determined.

In our current experimental setup (Figure 1), we use

a single digital camera (Canon G1) translating on a lin-

ear stage, and one or two light projectors illuminating the

scene from different directions. We acquire images un-

der both structured lighting and ambient illumination condi-

tions. The ambient illuminated images can be used as inputs

to the stereo matching algorithms being evaluated.

Let us now define some terminology. We distinguish be-

tween views – the images taken by the cameras – and illu-

minations – the structured light patterns projected onto the

scene. We model both processes using planar perspective

projection and use coordinates (x,y) for views and (u,v)

for illuminations.

There are two primary camera views, L (left) and R

(right), between which correspondences are to be estab-

lished. The illumination sources from which light patterns

are projected are identified using numbers {0,1,...}. More

than one illumination direction may be necessary to illumi-

nate all surfaces in complex scenes with occluding objects.

Our processing pipeline consists of the following stages:

1. Acquire all desired views under all illuminations.

2. Rectify the images to obtain the usual horizontal

epipolar geometry, using either a small number of cor-

responding features [15] or dense 2D correspondences

(step 4).

3. Decode the light patterns to get (u,v) codes at each

pixel in each view.

4. Use the unique codes at each pixel to compute corre-

spondences. (If the images are rectified, 1D search can

be performed, else 2D search is required.) The results

of this correspondence process are the (usual) view dis-

parities (horizontal motion).

5. Determine the projection matrices for the illumination

sources from the view disparities and the code labels.

6. Reproject the code labels into the two-view geometry.

This results in the illumination disparities.

7. Combine the disparities from all different sources to

get a reliable and accurate final disparity map.

8. Optionally crop and downsample the disparity maps

and the views taken under ambient lighting.

The remainder of this paper is structured as follows.

The next section describes the algorithms used to determine

unique codes from the structured lighting. Section 4 dis-

cusses how view disparities and illumination disparities are

established and merged. Section 5 describes our experimen-

tal results, and Section 6 describes our conclusions and fu-

ture work.

3. Structured light

To uniquely label each pixel, we project a series of struc-

tured light images onto the scene, and decode the set of pro-

jected intensities at each pixel to give it a unique label. The

simplest kind of pattern to project is a series of single stripe

images (light planes) [3, 7, 19], but these require O(n) im-

ages, where n is the width of the image in pixels.

Instead, we have tested two other kinds of structured

light: binary Gray-code patterns, and series of sine waves.

3.1. Gray codes

Gray-code patterns only contain black and white (on/off)

pixel values, which were the only possibilities available

with the earliest LCD projectors. Using such binary im-

ages requires log2(n) patterns to distinguish among n lo-

cations. For our projector (Sony VPL-CX10) with 1024×

768 pixels, it is sufficient to illuminate the scene with 10

vertical and 10 horizontal patterns, which together uniquely

encode the (u,v) position at each pixel.

well suited for such binary position encoding, since only

one bit changes at a time, and thus small mislocalizations

of 0-1 changes cannot result in large code changes [20].

Decoding the light patterns is conceptually simple, since

at each pixel we need only decide whether it is illuminated

or not. We could for example take two reference images,

all-white and all-black, and compare each code pixel with

the average of the two. (With a gray-level projector, we

could also project a reference image with 0.5 intensity).

Such reference images measure the albedo of each scene

point. In practice, however, this does not work well due to

interreflections in the scene and “fogging” inside the pro-

jector (adding a low-frequency average of intensities to the

projected pattern), which causes increased brightness near

bright areas. We have found that the only reliable way of

thresholding pixels into on/off is to project both the code

pattern and its inverse. We can then label each pixel accord-

ing to whether the pattern or its inverse appears brighter.

This avoids having to estimate the local albedo altogether.

The obvious drawback is that twice as many images are re-

quired. Figure 2 shows examples of thresholded Gray-code

images.

Gray codes are

Page 3

Figure 2. Examples of thresholded Gray-code im-

ages. Uncertain bits are shown in gray. (Full-size

versions of all images in this paper are available at

http://www.middlebury.edu/stereo/.)

Unfortunately, even using patterns and their inverses

may not be enough to reliably distinguish light patterns on

surfaces with widely varying albedos. In our experiments,

we have found it necessary to use two different exposure

times (0.5 and 0.1 sec.). At each pixel, we select the ex-

posure setting that yields the largest absolute difference be-

tween the two illuminations. If this largest difference is still

below a threshold (sum of signed differences over all color

bands < 32), the pixel is labeled “unknown” (gray pixels in

Figure 2), since its code cannot be reliably determined. This

can happen in shadowed areas or for surfaces with very low

albedo, high reflectance, or at oblique angles.

The initial code values we obtain by concatenating the

bits from all the thresholded binary images need to be

cleaned up and potentially interpolated, since the camera

resolution is typically higher than projector resolution. In

our case, the projector has a 1024 × 768 resolution, and

the camera has 2048 × 1536. Since the camera only sees a

subset of the illuminated scene (i.e., it is zoomed in) and il-

lumination pixels can appear larger on slanted surfaces, we

get even more discrepancy in resolution. In our setup, each

illumination pixel is typically 2–4 camera pixels wide. We

clean up the Gray code images by filling small holes caused

by unknown bit values. We then interpolate (integer) code

values to get a higher resolution and avoid multiple pixels

with the same code. Interpolation is done in the prominent

code direction, i.e., horizontally for u and vertically for v.

We currently compute a robust average over a sliding 1D

window of 7 values. The results of the entire decoding pro-

cess are shown in Figure 4a.

3.2. Sine waves

Binary Gray-code patterns use only two different inten-

sity levels and require a whole series of images to uniquely

determine the pixel code. Projecting a continuous function

onto the scene takes advantage of the gray-level resolution

available in modern LCD projectors, and can thus poten-

tially require fewer images (or alternatively, result in greater

precision for the same number of images). It can also po-

tentially overcome discretization problems that might intro-

duce artifacts at the boundaries of binary patterns [6].

Consider for example projecting a pure white pattern and

a gray-level ramp onto the scene. In the absence of noise

and non-linearities, the ratio of the two values would give us

the position along the ramp of each pixel. However, this ap-

proach has limited effective spatial resolution [11, 22]. Pro-

jectingamorequicklyvaryingpatternsuchasasawtoothal-

leviates this, but introduces a phase ambiguity (points at the

same phase in the periodic pattern cannot be distinguished),

which can be resolved using a series of periodic patterns at

different frequencies [13]. A sine wave pattern avoids the

discontinuities of a sawtooth, but introduces a further two-

way ambiguity in phase, so it is useful to project two or

more waves at different phases

Our current algorithm projects sine waves at two differ-

ent frequencies and 12 different phases. The first frequency

has a period equal to the whole (projector) image width or

height; the second has 10 periods per screen.

Given the images of the scene illuminated with these

patterns, how do we compute the phase and hence (u,v)

coordinates at each pixel? Assuming a linear image forma-

tion process, we have the following (color) image formation

equation

?Ikl(x,y) =?A(x,y)Bkl[sin(2πfku + φl) + 1],

(1)

where?A(x,y) is the (color) albedo corresponding to scene

pixel (x,y), Bklis the intensity of the (k,l)th projected pat-

tern, fkis its frequency, and φlis its phase. A similar equa-

tion can be obtained for horizontal sine wave patterns by

replacing u with v.

Assume for now that we only have a single frequency

fk and let cl = cosφl, sl = sinφl, cu = cos(2πfku),

su= sin(2πfku), and?C =?A(x,y)B. The above equation

can then be re-written (for a given pixel (x,y)) as

?Ikl=?C[sucl+ cusl+ 1].

(2)

We can estimate the illuminated albedo value?C at each

pixel by projecting a mid-tone grey image onto the scene.

The above equation is therefore linear in the unknowns

(cu,su), which can be (optimally) recovered using linear

least squares [10], given a set of images with different

(known) phases φl. (In the least squares fitting, we ignore

any color values that are saturated, i.e., greater than 240.)

An estimate of the u signal can then be recovered using

u = p−1

u(1

2πtan−1su

cu

+ m),

(3)

where pu=W/fuis the sine period (in pixels) and m is the

(unknown) integral phase wrap count. To solve the phase

wrapping problem, we first estimate the value of u using

Page 4

constraint line

2 π f u

(cu,su)

Figure 3. Phase estimation from (cu,su) least

squares fit. The red dot is the least squares solution

to the constraint lines, and the ellipse around it indi-

cates the two-dimensional uncertainty.

a single wave (f1=1), and then repeat the estimation with

f2=10, using the previous result to disambiguate the phase.

Since we are using least squares, we can compute a cer-

tainty for the u estimate. The normal equations for the least

squares system directly give us the information matrix (in-

verse covariance) for the (cu,su) estimate. We can convert

this to a variance in u by projecting along the direction nor-

mal to the line going through the origin and (cu,su) (Fig-

ure 3). Furthermore, we can use the distance of the fitted

point (cu,su) from the unit circle as a sanity check on the

quality of our sine wave fit. Computing certainties allows

us to merge estimates from different exposures. At present,

we simply pick the estimate with the higher certainty.

Figure 4b shows the results of recovering the u positions

using sine patterns. For these experiments, we use all 12

phases (φ = 0◦,30◦,...,330◦) and two different exposures

(0.1 and 0.5 sec). In the future, we plan to study how the

certainty and reliability of these estimates varies as a func-

tion of the number of phases used.

3.3. Comparison

Figure 4 shows examples of u coordinates recovered

both from Gray code and sine wave patterns. The total num-

ber of light patterns used is 80 for the Gray codes (10 bit

patterns and their inverses, both u and v, two exposures),

and 100 for the sine waves (2 frequencies and 12 phases

plus 1 reference image, both u and v, two exposures). Vi-

sual inspection shows that the Gray codes yield better (less

noisy) results. The main reason is that by projecting binary

patterns and their inverses, we avoid the difficult task of es-

timating the albedo of the scene. Although recovering the

phase of sine wave patterns potentially yields higher reso-

lution and could be done with fewer images, it is also more

susceptible to non-linearities of the camera and projector

and to interreflections in the scene.

In practice, the time to take the images of all structured

light patterns is relatively small compared to that of setting

up the scene and calibrating the cameras. We thus use the

Gray code method for the results reported here.

(a): Gray code(b): sine wave

Figure 4. Computed u coordinates (only low-order

bits are shown).

4. Disparity computation

Given N illumination sources, the decoding stage de-

scribed above yields a set of labels (uij(x,y),vij(x,y)), for

each illumination i ∈ {0,...,N−1} and view j ∈ {L,R}.

Note that these labels not only uniquely identify each scene

point, but also encode the coordinates of the illumination

source. We now describe how high-accuracy disparities can

be computed from such labels corresponding to one or more

illumination directions.

4.1. View disparities

The first step is to establish correspondences between the

two views L and R by finding matching code values. As-

suming rectified views for the moment, this amounts to a

simple 1D search on corresponding scanlines. While con-

ceptually simple, several practical issues arise:

• Some pixels may be partially occluded (visible only in

one view).

• Some pixels may have unknown code values in some

illuminations due to shadows or reflections.

• A perfect matching code value may not exist due to

aliasing or interpolation errors.

• Several perfect matching code values may exist due to

the limited resolution of the illumination source.

• The correspondences computed from different illumi-

nations may be inconsistent.

The first problem, partial occlusion, is unavoidable and will

result in unmatched pixels. The number of unknown code

values due to shadows in the scene can be reduced by us-

ing more than one illumination source, which allows us to

establish correspondences at all points illuminated by at

least one source, and also enables a consistency check at

pixels illuminated by more than one source. This is ad-

vantageous since at this stage our goal is to establish only

high-confidence correspondences. We thus omit all pixels

Page 5

whose disparity estimates under different illuminations dis-

agree. As a final consistency check, we establish dispari-

ties dLRand dRLindependently and cross-check for con-

sistency. We now have high-confidence view disparities at

points visible in both cameras and illuminated by at least

one source (see Figures 6b and 7b).

Before moving on, let us consider the case of unrectified

views. The above method can still be used, except that a

2D search must be used to find corresponding codes. The

resulting set of high-quality 2D correspondences can then

be used to rectify the original images [15].

4.2. Illumination disparities

The next step in our system is to compute another set

of disparities: those between the cameras and the illumina-

tion sources. Since the code values correspond to the im-

age coordinates of the illumination patterns, each camera-

illumination pair can be considered an independent source

of stereo disparities (where the role of one camera is played

by the illumination source). This is of course the idea be-

hind traditional structured lighting systems [3].

The difference in our case is that we can register these

illumination disparities with our rectified view disparities

dLRwithout the need to explicitly calibrate the illumina-

tion sources (video projectors). Since our final goal is to

express all disparities in the rectified two-view geometry,

we can treat the view disparities as a 3D reconstruction of

the scene (i.e., projective depth), and then solve for the pro-

jection matrix of each illumination source.

Let us focus on the relationship between the left view L

and illumination source 0. Each pixel whose view disparity

has been established can be considered a (homogeneous)

3D scene point S = [x y d 1]Twith projective depth

d = dLR(x,y). Since the pixel’s code values (u0L,v0L)

also represent its x and y coordinates in the illumination

pattern, we can write these coordinates as homogenous 2D

point P = [u0Lv0L1]T. We then have

P∼= M0LS,

where M0Lis the unknown 4 × 3 projection matrix of illu-

mination source 0 with respect to the left camera. If we let

m1, m2, m3denote the three rows of M0L, this yields

u0Lm3S = m1S,

v0Lm3S = m2S.

and

(4)

SinceMisonlydefineduptoascalefactor, wesetm34= 1.

Thus we have two linear equations involving the 11 un-

known entries of M for each pixel whose disparity and il-

lumination code are known, giving us a heavily overdeter-

mined linear system of equations, which we solve using

least squares [10].

If the underlying disparities and illumination codes are

correct, this is a fast and stable method for computing M0L.

In practice, however, a small number of pixels with large

disparity errors can strongly affect the least-squares fit. We

therefore use a robust fit with outlier detection by iterating

the above process. After each iteration, only those pixels

with low residual errors are selected as input to the next

iteration. We found that after 4 iterations with successively

lower error thresholds we can usually obtain a very good fit.

Given the projection matrix M0L, we can now solve

Equation (4) for d at each pixel, using again a least-squares

fit to combine the two estimates. This gives us the illumi-

nation disparities d0L(x,y) (see Figures 6c and 7c). Note

that these disparities are available for all points illuminated

by source 0, even those that are not visible from the right

camera. We thus have a new set of disparities, registered

with the first set, which includes half-occluded points. The

above process can be repeated for the other camera to yield

disparities d0R, as well as for all other illumination sources

i = 1...N−1.

4.3. Combining the disparity estimates

Our remaining task is to combine the 2N + 2 disparity

maps. Note that all disparities are already registered, i.e.,

they describe the horizontal motion between views L and

R. The first step is to create combined maps for each of L

and R separately using a robust average at pixels with more

than one disparity estimate. Whenever there is a majority of

values within close range, we use the average of this subset

of values; otherwise, the pixel is labeled unknown. In the

second step, the left and right (combined) maps are checked

for consistency. For unoccluded pixels, this means that

dLR(x,y) = −dRL(x + dLR(x,y),y),

and vice versa. If the disparities differ slightly, they are ad-

justed so that the final set of disparity maps is fully con-

sistent. Note that since we also have disparities in half-

occluded regions, the above equation must be relaxed to

reflect all legal visibility situations. This yields the final,

consistent, and highly accurate pair of disparity maps relat-

ing the two views L and R (Figures 6d and 7d).

The two final steps are cropping and downsampling. Up

tothispoint, wearestilldealingwithfull-size(2048×1536)

images. In our setup, disparities typically range from about

210 to about 450. We can bring the disparities closer to zero

by cropping to the joint field of view, which in effect stabi-

lizes an imaginary point just behind farthest surface in the

scene. This yields a disparity range of 0–240, and an image

width of 1840. Since most current stereo implementations

work with much smaller image sizes and disparity ranges,

we downsample the images and disparity maps to quarter

size (460 × 384). The disparity maps are downsampled us-

ing a majority filter, while the ambient images are reduced