Content uploaded by Abhranil Das

Author content

All content in this area was uploaded by Abhranil Das on Nov 28, 2022

Content may be subject to copyright.

Content uploaded by Abhranil Das

Author content

All content in this area was uploaded by Abhranil Das on Nov 28, 2022

Content may be subject to copyright.

Depth estimation from stereo image pairs

Abhranil Das

Department of Physics, e University of Texas at Austin

abhranil.das@utexas.edu

11 May 2017

Abstract

We present some theoretical results on depth estimation from stereo image pairs, then describe the simple computational method of

block-matching for doing this, with Matlab code and example results.

Keywords: vision, stereo, depth, computation

1 Depth inference from a stereo point pair

1.1 Inverting a projected point

Figure 1: Projecting a point in a 3D scene

onto a 2D image seen by an eye.

Fig. 1, adapted from Das, 20101shows a simplied schematic of a point (x, y, z )in a

3D scene being projected onto a 2D screen that corresponds to the at image seen by

an eye that is at the location (xe, ye, ze). Suppose a 2D screen is normal to the line of

sight and a distance din front. e location of the projected point in the coordinate

system of the screen (origin dened by the intersection of the line of sight with the

screen) can be calculated using simple geometry. Following Das:1

x0=dx−xe

z−ze

, y0=dy−ye

z−ze

.(1)

In a realistic situation we have only the projected image and we need to infer the

3D scene. In this seing this corresponds to inferring the 3D coordinates of whatever

produced the projected point (x0, y0). Slight algebraic manipulation of eq. 1gives us:

x−xe

x0=y−ye

y0=z−ze

d.(2)

is is the equation of a straight line in 3D space passing through the eye, and

with direction ratios (x0, y0, d). is represents the general set of all points that would

produce the projection (x0, y0). e actual object producing the projection may be a

point on this line, multiple points, combinations of points and line segments etc. So

with just one eye, there is no way to locate a point in space from just its projection.

1.2 Inverting a stereo point pair, and the epipolar condition

Fig. 2, again adapted from Das,1illustrates the situation when a point in 3D space is projected to two eyes that dier only in their

x-locations (xeland xer). We assume that there is no vergence, so that the lines of sight are parallel. e two projection screens may

be thought of as a stereo image pair being presented to the eyes, and using these we need to infer the 3D scene as before.

Using eq. 1, we see that now according to the le eye, the object that produced the projection lies on:

x−xel

x0

l

=y−ye

y0

l

=z−ze

d.(3)

and according to the right eye, it lies on: x−xer

x0

r

=y−ye

y0

r

=z−ze

d.(4)

1

Figure 2: Projecting a point in a 3D scene onto 2D images seen by two eyes.

e only object that could give rise to both the projections must therefore be the point of intersection of these two lines. Before

nding this point, we need to establish the condition for their intersection, which is:

x0

lx0

rxel−xer

y0

ly0

r0

d d 0

= 0.(5)

Expanding this determinant by the third column gives us:

(xel−xer)(y0

l−y0

r)d= 0.(6)

Since the eyes have dierent x-locations and the screens are at non-zero distances from them, we are le with y0

l=y0

r≡y0. is means

that when the eyes are viewing a stereo pair, a pair of corresponding points in the two images must have the same height. is is an

epipolar condition. Stereo images taken with cameras that are not correctly aligned will not satisfy this condition. ey then need to be

rectied before they can be used for depth estimation of the photographed scene. Given that the epipolar condition is satised, we can

go ahead and calculate the point of intersection that produced the two projections:

(x, y, z) = (x0

lxer−x0

rxel

x0

l−x0

r

, ye+xer−xel

x0

l−x0

r

y0, ze+xer−xel

x0

l−x0

r

d)(7)

ese coordinates are with respect to the external coordinate system that we’ve been using to locate both the eyes and the scene.

e coordinates of this triangulated point relative to an eye are found by subtracting the eye coordinates. Relative to the le eye, say,

they are:

(xer−xel

x0

l−x0

r

x0

l,xer−xel

x0

l−x0

r

y0,xer−xel

x0

l−x0

r

d) = xer−xel

x0

l−x0

r

×(x0

l, y0, d).(8)

us, the true location of the point is found by projecting the image point radially out from the eye by a factor of xer−xel

x0

l−x0

r(the same

holds with respect to the right eye as well). is means that for a pair of eyes with a xed interocular distance xer−xel, the depths of

perceived stereo point pairs are inversely proportional to their disparity x0

l−x0

r. us, the disparity map for a stereo image pair holds

the information about the depth of the scene.

Once we know the disparity of each point pair, it is therefore straightforward to compute the depth map. However, the challenging

part is to nd which two points in the stereo pair correspond. In the following section we describe and implement a simple method for

doing this.

2

2 Computing the disparity map from a stereo image pair

2.1 e method of block-matching

is is a simple method to nd corresponding points in a stereo image pair and compute the disparity map, following a tutorial by Chris

McCormick. In its basic form it works only for rectied image pairs. Fig. 3a shows an example rectied stereo pair (from the database

by Scharstein et al2). We rst convert both the images to greyscale. Each point in either of the images has its corresponding point in

the same row of the other image. Fig. 3b shows the basic procedure for block matching. e black square on the le shows a selected

block of pixels from the le image. We call this the ‘template’. We want to nd the block in the right image that corresponds to this

template. For this we scan the right image along the same row as the template, up to some range (yellow rectangle in right gure),

calculating the L1-dierence between the template and each scanned block. en we take the block in the scanned range that had the

lowest L1-dierence (green square) as the corresponding block. e dierence between the template position (white square) and the

matched block position (green square) then gives us the disparity. In this way we can calculate the disparity for all points. For a properly

rectied image, corresponding points in the second image will all be on the same side of the point position in the original image, e.g. all

corresponding points in the right image of g. 3a are to the le of those in the le image. is simplies and quickens the search.

Figure 3: a. A rectied stereo pair of images. b. e method of block matching to nd corresponding point pairs (see text).

e le image of g. 5shows the disparity map computed by block-matching the stereo pair of g. 3a.

2.1.1 Subpixel disparity estimation

Figure 4: Illustration of subpixel

disparity estimate (see text).

e block-matching procedure described so far will always return a disparity that is an integer

number of pixels, since we calculate the L1-dierence between the template and the block at 1-

pixel increments. However, consider the situation illustrated in g. 4. e grey and black dots

show the L1-dierences calculated between the template and block for a range of block positions

around the minimum shi that is 8 pixels. With the basic method, we would choose 8 pixels to

be the disparity. However, it is possible to improve on this coarse-grained estimate. We can t

a parabola to the smallest L1-dierence and its two neighbours (the black points), and choose

the minimum of this parabola to be the disparity. is gives us a beer second-order subpixel

estimate of the disparity.

Fig. 5shows a comparison of the disparity map calculated with and without the subpixel

estimation. While there are many possible improvements to this block-matching method, the

subpixel estimation is not one that yields much improvement, at least in this case.

Figs. 6and 7are two more examples of disparity maps computed from stereo pairs (with

subpixel estimation).

3

Figure 5: Disparity maps computed by block-matching the stereo pair g. 3a, with and without subpixel estimation. For this example, the improvement

with subpixel estimation is minimal.

Figure 6: Another example of computing a disparity map with subpixel estimation (boom) from a rectied stereo pair (top).

4

Figure 7: Another example of computing a disparity map with subpixel estimation (boom) from a rectied stereo pair (top).

2.1.2 Possible improvements to the method

e following are possible improvements to the block-matching method described:

•Instead of computing L1-dierences between the template and the block, we could calculate the correlation coecients of their

pixels instead. is measure is invariant to shis and scalings of the pixel values. However, I tried this and the results were worse

(I don’t know why).

•Instead of converting the stereo image pair to greyscale, we could use information in all three colour channels to nd a beer

correspondence. I tried this by calculating both the total L1-dierence and the overall correlation coecient across over all three

colour channels, but once again the results were much worse (I don’t know why).

2.2 Matlab code

Commented Matlab code for the block-matching method with subpixel estimation is available at github.com/abhranildas/depth-from-

stereo.

References

[1] Abhranil Das. Perspective: the maths of seeing. Lambert Academic Publishing, Germany, 2010.

[2] Daniel Scharstein and Richard Szeliski. High-accuracy stereo depth maps using structured light. In 2003 IEEE Computer Society

Conference on Computer Vision and Paern Recognition, 2003. Proceedings., volume 1, pages I–I. IEEE, 2003.

5