Content uploaded by Ian Hales

Author content

All content in this area was uploaded by Ian Hales on Feb 16, 2015

Content may be subject to copyright.

Automated Ground-Plane Estimation

for Trajectory Rectiﬁcation

Ian Hales, David Hogg, Kia Ng, and Roger Boyle

University of Leeds, Woodhouse Lane, Leeds, LS2 9JT

{i.j.hales06,d.c.hogg,k.c.ng,r.d.boyle}@leeds.ac.uk

Abstract. We present a system to determine ground-plane parameters

in densely crowded scenes where use of geometric features such as paral-

lel lines or reliable estimates of agent dimensions are not possible. Using

feature points tracked over short intervals, together with some plausible

scene assumptions, we can estimate the parameters of the ground-plane

to a suﬃcient degree of accuracy to correct usefully for perspective dis-

tortion. This paper describes feasibility studies conducted on controlled,

simulated data, to establish how diﬀerent levels and types of noise aﬀect

the accuracy of the estimation, and a veriﬁcation of the approach on live

data, showing the method can estimate ground-plane parameters thus

allowing, improved accuracy of trajectory analysis.

Keywords: ground-plane, trajectory, rectiﬁcation, crowd-motion.

1 Introduction

In computer vision one often wishes to examine objects in terms of their size,

speed or location, accurate measurement of which is hampered by several types

of distortion that occur when the camera captures the scene. This is particularly

relevant in applications such as behaviour analysis in crowds, where such metrics

can be used to detect events or measure crowd density. By estimating the trans-

formation undergone by the coordinate system, we can invert it to counteract

the eﬀect of such distortions and obtain a more accurate view of the world. If

we can obtain the ground-plane orientation with respect to the camera, we can

correct for one of the most prevalent sources of error – perspective distortion.

Two broad approaches are apparent within the literature: “formal methods”,

often in terms of the fundamental matrix or image homographies; and “informal

methods”, providing a coarse, but still usable, impression of the scene (e.g. a scale

ratio from front to back). A common feature is the idea of “vanishing points”,

which lie on the horizon line. Having obtained the equation of the horizon, it is

trivial to perform aﬃne rectiﬁcation [1]. Metric rectiﬁcation can then be achieved

using knowledge of known lengths or angles, or equality of angles [2]. These

points can be determined using pairs of imaged parallel lines [1] and inclusion

of a vanishing point in a third direction allows for metric measurement [3].

The Manhattan Assumption [4] oﬀers a viable framework to obtain additional

vanishing points from the background in man-made scenes [5]. Alternatively,

R. Wilson et al. (Eds.): CAIP 2013, Part II, LNCS 8048, pp. 378–385, 2013.

c

Springer-Verlag Berlin Heidelberg 2013

Automated Ground-Plane Estimation for Trajectory Rectiﬁcation 379

known angles in the scene [6] or known reference lengths [7] can produce similar

results. In pedestrian scenes, measurements (e.g. projected foot-head heights

[8,9]) provide a valid reference length. Less formal methods tend to rely on the

estimation of a single ground-plane. It is again common to use projected foot-

head heights [10] and accuracy can be improved by tracking individuals, taking

relative height measurements for each [11], which minimizes potential variations

in reference length.

We base our work within the context of pedestrian crowd analysis. The meth-

ods discussed above rely on assumptions unlikely to hold in a densely crowded

scene due to inter-occlusion and the inability to see geometric features in the

background. Stauﬀer et al. [10] mention an alternative approach using the as-

sumption of constant speed of moving objects, built upon by Bose and Grimson

[12], who use a blob tracker to generate trajectories. They then obtain the vanish-

ing line by assuming constant speed, before using inter-trajectory speed ratios

to obtain metric rectiﬁcation. However, maintained tracking was necessary to

achieve metric rectiﬁcation, making this approach infeasible in our domain.

We propose to use the local speed of tracked feature points as a calibration

measurement, along with the assumption of constant speed, to reconstruct the

ground-plane parameters. We do not require prolonged tracking of features pro-

vided we observe pedestrian motion throughout the scene. The remainder of this

paper will describe our method, prove its validity on simulated data and assess

its accuracy on real-world benchmark data.

2 Ground-Plane Estimation

Above, we saw that it is possible to construct the 3D scene using information

from measuring objects of known real-world size at various positions within the

image. In this section we show that it is equally possible to use measurements of

object speed to reconstruct the plane upon which agents are moving. Throughout

this work we assume that all objects move on a single, linear plane and that

each observed object is moving at constant (or near-constant) speed. We do not,

however, require that all objects move at the same speed.

We ﬁrst track sparse features frame-by-frame for some period using the KLT

tracker [13], until we can no longer reliably continue the trajectory. Since we deal

with high density pedestrian crowds, heavy inter-occlusion is likely to result in

many short trajectories. However, it only takes a few frames for us to gain

valuable information from the trajectory. It is plausible that objects may have

many features or no features at all assigned to them, but this is not a major

concern provided we gather information from various scene positions.

We deﬁne a trajectory as a time series of points upon on a plane recorded

at equally spaced time-steps. We observe these in the image-coordinate system

and wish to obtain their respective points within the camera coordinate system

through perspective back-projection. We deﬁne a time-series (Xτ

1,Xτ

2,...,Xτ

Nτ)

as the x-coordinates of trajectory τat times 1 to Nτand its projection into

image space as (xτ

1,x

τ

2,...,x

τ

Nτ). We represent the orientation of the ground-

plane in terms of its unit normal vector, n, given by equation (1). θrepresents

380 I. Hales et al.

the angle of elevation (rotation in the x-axis) and ψto represent the angle of

yaw (rotation in the z-axis).

n=⎛

⎝

a

b

c

⎞

⎠=⎛

⎝

sin(ψ) sin(θ)

cos(ψ) sin(θ)

cos(θ)

⎞

⎠(1)

Given a ground-plane n·X=dfor some point X=(Xτ

t,Yτ

t,Zτ

t)in the

world and some α, equations (2) to (4) show the back-projection of a point

(xτ

t,y

τ

t)onto the point (Xτ

t,Yτ

t,Zτ

t)on the ground plane at time t,where

αis the negative reciprocal of the focal length fand dis the shortest distance

between the camera and the ground-plane.

Xτ

t=αxτ

tZτ

t(2)

Yτ

t=αyτ

tZτ

t(3)

Zτ

t=d

αaxτ

t+αbyτ

t+c(4)

2.1 Speed as a Measuring Stick

An object’s observed speed varies with perspective as do its other properties

such as height [10]. Given a number of partial trajectories of constant speed,

we show that it is possible to obtain reasonable estimates for the ground-plane

parameters. Assuming a tracked object’s real-world speed is constant, we can

think of the trajectory as a set of piecewise linear segments, each with length

given by the 3D Euclidean distance formula. We can use this distance as the

“measuring-stick” from which to gain an estimation of the ground-plane.

Firstly we make some simplifying assumptions. The height of the camera

with respect to the ground-plane, d, acts primarily as a scaling parameter. As

we only aim to reconstruct to scale, we set this to 1 in (4), thus simplifying

further calculation. We also observe that in the majority of scenes, the camera

height is substantial compared to the height variations of the tracked feature

points. In section 3 we oﬀer results showing that tracked feature height variation

does not signiﬁcantly aﬀect the estimate of ground-plane orientation. Finally,

motion variation based on the articulation of our moving agents is negligible.

Substituting (2-4) into the distance formula, we obtain (5). This relates the

known 2D projected positions of a feature point at consecutive intervals t−1and

t, their 3D distance in the camera plane and the plane parameters. Hereafter,

we use Lτto represent the set of distance measurements at all time-intervals for

some trajectory, τ. Each element of this set is deﬁned as follows:

(Lτ

t)2=α2xτ

t

γτ

t

−xτ

t−1

γτ

t−12

+α2yτ

t

γτ

t

−yτ

t−1

γτ

t−12

+1

γτ

t

−1

γτ

t−12

(5)

where,

γτ

t=αxτ

ta+αyτ

tb+c

For a single trajectory τ, we deﬁne a set of the distances at all time intervals:

{Lτ

t}t=Nτ

t=2 (6)

Automated Ground-Plane Estimation for Trajectory Rectiﬁcation 381

We denote the mean and standard deviation of the set as μ(Lτ)andσ(Lτ)

for short. We now have a relationship between known image-coordinates and

the unknown camera-coordinates with respect to some constant distance along

points in a trajectory. We use this to measure how well a given set of parameters,

θ,ψand αﬁt the observed data. If a feature point is moving at constant speed

and we have a good set of parameters, σ(Lτ) should be close to zero. Conversely,

we would expect a poor parameterization to give a high spread in Lτ.

Since we do not wish to impose the constraint that all ob jects must move at

constant speed, we normalize σ(Lτ)byμ(Lτ) giving us a speed-invariant measure

of correctness. We pose this correctness measure in terms of a minimization over

the sum of squared errors for each trajectory as shown in equation (7).

E1=

τ∈Tσ(Lτ)

μ(Lτ)2

(7)

E2=σ

τ∈T(μ(Lτ)) (8)

As we expect tracked feature points to have originated from a set of homogeneous

objects (e.g. pedestrians), we can reasonably assume that they should all move

with similar (although not identical) speed. As such we include an additional

term to constrain the system, E2in equation (8)s:. Here we take the standard-

deviation over the set of all mean speeds which penalises a high spread of speeds,

thereby preventing some of the least plausible conﬁgurations.

2.2 Minimizing the Error

We solve for θ,ψand αby minimizing the error E=E1+λE2over all input

trajectories. We experimented with non-linear optimization algorithms but due

to the irregularity of the problem-space away from the vicinity of the true value,

gradient descent methods tend to fall into local minima. As such we fall back to

a multi-resolution global search to ﬁnd the correct region. At the ﬁrst level we

search all combinations of α,θand ψ, with a coarse mesh – increments of 15◦

in both the θand ψfeasible ranges (0◦to 90◦and −45◦to 45◦)andforαan

exponential search in the range 10−3to 100to ﬁnd its scale.

We then take the point with minimum error from this search and produce a

ﬁner grid around it; now searching αlinearly and reducing the step size for θ

and ψto 10% of their previous value. We repeat this procedure until either the

lowest error point is below a given tolerance (empirically 10−5is suﬃcient) or

we reach the maximum level allowed for search. We have observed 3 levels to be

suﬃcient for an accurate estimation on simulated data with some noise.

3 Experiments on Simulated Data

To prove the initial viability of this method we ﬁrst describe a number of exper-

iments on simulated data. This allows us to examine how various types of noise

and violations of the initial premise aﬀect the accuracy of the experiments. Core

sources of error are likely to be:

382 I. Hales et al.

(a) (b)

0 0.2 0.4 0.6 0.8 1

0

0.25

0.5

Error for Inter−Feature Speed Variation

Noise Level

Error

(c)

0 0.2 0.4 0.6 0.8 1

0

0.25

0.5

Error for Intra−Feature Speed Variation

Noise Level

Error

(d)

0 0.2 0.4 0.6 0.8 1

0

0.25

0.5

Error for Intra−Feature Height Variation

Noise Level

Error

Fig. 1. An example of our simulated trajectories (a) and the results of our noise ex-

periments (b)-(d) in terms of average angular error (dot product between ground-truth

and estimated plane-normals)

1. Variation in inter-trajectory speed. That is, agents move at diﬀerent speeds.

2. Variation in intra-trajectory speed; an agent varies their speed whilst moving.

3. Variation in tracked point height; i.e. some trajectories recorded on feet,

some on shoulders, etc.

We generate simulated trajectories on a number of planes with parameterized

noise (Fig. 1a), allowing the speed and height of trajectories to vary according

to Gaussian distributions. This lets us examine the eﬀect of the above error

sources on reconstruction accuracy. We assess accuracy by rectifying the image-

plane trajectories with the ground-truth parameters and our estimations, then

compare the spread of normalized speeds of each. Given a perfect reconstruction,

we see zero-error and all measurements are relative to a mean speed of 1.

We perform three sets of experiments on a number of diﬀerent planes across

the feasible range, varying the potential source of error in each in terms of the

standard deviation, between 0 and 1, of a Gaussian distribution with mean 1.

Examining the above issues in order, we ﬁrst investigate the eﬀect of the diﬀerent

agents in the scene moving at diﬀerent speeds. Agents’ initial speeds are chosen

randomly from the distribution and remain constant throughout the simulation.

From Fig. 1b we observe that even in extreme cases the average error stays low

- below 10% of the mean speed. We expect the eﬀect of intra-trajectory speed

variation to be more pronounced as it is the deﬁning metric used to recover the

parameters. We take the speed of a feature at each frame from the distribution.

Fig. 1c shows that although we experience more noise than with inter-trajectory

speed variation, it is not so substantially pronounced as to seriously damage our

result. We see that height variation has negligible aﬀect on the accuracy of our

solution – data was generated using a feasible camera height of 10m; as such the

diﬀerence in point height is relatively small.

Points tracked at diﬀerent heights with a low-positioned camera will be af-

fected more strongly, but in most real-world scenes the distances between the low-

est and highest tracked points are negligible with respect to the camera height.

Our experiments on simulated data show that over realistic ranges, the three po-

Automated Ground-Plane Estimation for Trajectory Rectiﬁcation 383

(a) (b)

Fig. 2. Example stills from PETS2009 (a) and students003 (d) video datasets

tential sources of error identiﬁed above have negligible detrimental eﬀect on the

accuracy of estimation. Of particular importance is the intra-tra jectory speed

variation, which is a violation of our assumptions and yet still does not have an

especially pronounced eﬀect on the quality of our estimation.

4 Experiments on Video Data

We gather trajectories using the tracker and split them at sharp spikes in speed,

which are easily identiﬁable violations of the constant-speed assumption. We are

typically left with considerably more trajectories than is necessary (or tractable)

to process. Indeed many of these are extremely short – only 2 or 3 frames. As

such we ﬁlter out trajectories shorter than 4 frames, as these provide us with no

additional information towards our unknowns.

When tracking pedestrians, we observe two key diﬀerences to the trajectories

produced for simulation. Firstly, we commonly track several points per person;

secondly pedestrians tend to behave in a more “human” manner than in our

simulated data – travelling in groups. As such we observe that many of our

trajectories are extremely similar and their inclusion adds little information, but

slows the processing down considerably. Therefore we ﬁrst align trajectories to

each other using the Hungarian algorithm [14], then cluster the trajectories based

on their distance and shape similarity using Aﬃnity Propagation [15], taking the

resulting clusters as input to our ground-plane estimation system.

The majority of our results are given against the PETS2009 dataset, specif-

ically videos taken from View001 shown in 2a and View002. Since these come

with full intrinsic and extrinsic calibration, we can directly compare rotation

angles. We also examine one other video dataset: “students003”, Fig. 2b, from

the University of Cyprus. This is only provided with an image to ground-plane

homography and so does not allow for direct rotation comparison. We therefore

use the homography to determine the 2D feature coordinates on the ground-

plane and compare the speed ratios for each trajectory. We compare with the

method in [12] (see section 1) although we exchanged the blob tracker for our

KLT approach as the former provided insuﬃcient tracks for reliable estimation

when applied to our data.

384 I. Hales et al.

Tabl e 1. Results for plane estimation on videos from the PETS2009 dataset

Dataset Subset View Time Index θError (degrees) ψError (degrees)

S0 Regular Flow 001 14-06 +7.2 -0.4

S1 L1 001 13-59 +1.1 +11.7

S1 L2 001 14-06 +7.5 -0.5

S1 L1 002 13-57 +0.1 -9.9

S1 L2 002 14-31 +1.9 -4.4

(a)

0 5 10 15 20 25 30 35 40 45

0.4

0.6

0.8

1

Frame Number

Normalised Speed

(b)

0 10 20 30 40 50 60 70

0

0.5

1

Frame Number

Normalised Speed

Fig. 3. Comparison of trajectory speeds rectiﬁed using the ground-truth (black,

dashed) and estimated parameters (red , so li d )andBose(green, dashed ). Examples

are longest trajectories from (a) “students003” and (b) PETS2009 Regular Flow. We

see that even in trajectories with some tracking error, we obtain a sensible result,

generally better than Bose.

Table 1 shows the orientation error for several scenes in terms of direct dif-

ference for the two rotation components, θand ψ. There is an area of high re-

ﬂectance in View002, which interrupts tracking of most individuals, despite this,

the majority of our estimates are within 10◦of the ground-truth values – suﬃ-

cient for approximate correction of trajectory speeds. Fig. 3 shows some example

comparisons of normalized trajectory speeds from the “students003” and PETS

Regular Flow datasets rectiﬁed using the provided homography/calibration, our

estimation and that of [12]. We see that our matching is generally very close,

even on trajectories with some tracking error, whereas the method of Bose and

Grimson performs poorly. We put this down to the ﬂexibility of our approach in

minimising spread rather than a strict constant speed assumption.

5 Conclusions and Further Work

This paper has considered the problem of reconstructing 3D geometry from 2D

observations taken from videos of pedestrian data taken using a single uncali-

brated camera. Our method diﬀers from previous techniques as it requires no

knowledge of scene geometry or a ﬁxed size object; needing only motion of indi-

viduals. We have provided evidence on simulations for the validity of our method

and the assumptions held within. We have then shown results on the PETS2009

dataset which illustrate the success of the method in a number of cases and have

given a qualitative comparison for another. In continuation of this work, we plan

to account for variations in trajectory height to allow for tracking individuals

on diﬀerent parts of their bodies. We then intend to extend the method into the

multi-planar domain, such that planes can be estimated and their boundaries

drawn to more accurately and realistically model real-world scenes.

Automated Ground-Plane Estimation for Trajectory Rectiﬁcation 385

References

1. Liebowitz, D., Zisserman, A.: Metric rectiﬁcation for perspective images of planes.

In: Proceedings CVPR 1998, pp. 482–488. IEEE Comput. Soc. (1998)

2. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn.

Cambridge University Press (2004)

3. Criminisi, A., Reid, I.D., Zisserman, A.: Single view metrology. International Jour-

nal of Computer Vision 40(2), 123–148 (2000)

4. Coughlan, J.M., Yuille, A.L.: Manhattan world: Compass direction from a single

image by bayesian inference. In: Proceedings ICCV 1999, pp. 941–947 (1999)

5. Pﬂugfelder, R., Bischof, H.: Online auto-calibration in man-made worlds. In: Pro-

ceedings DICTA 2005, pp. 519–526 (2005)

6. Zhang, Z., Li, M., Huang, K., Tan, T.: Robust automated ground plane rectiﬁcation

based on moving vehicles for traﬃc scene surveillance. In: 2008 15th IEEE ICIP,

pp. 1364–1367. IEEE (2008)

7. Guo, F., Chellappa, R.: Video mensuration using a stationary camera. In: Leonardis,

A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3953, pp. 164–176. Springer,

Heidelberg (2006)

8. Lv, F., Zhao, T., Nevatia, R.: Self-Calibration of a Camera from Video of a Walking

Human. In: Proceedings Pattern Recognition, vol. 1, pp. 562–567. IEEE Computer

Society, Los Alamitos (2002)

9. Micusik, B., Pajdla, T.: Simultaneous surveillance camera calibration and foot-

head homology estimation from human detections. In: CVPR 2010, pp. 1562–1569

(2010)

10. Stauﬀer, C., Tieu, K., Lee, L.: Robust Automated Planar Normalization of Track-

ing Data. In: Proceedings IEEE Workshop on VS PETS (2003)

11. Krahnstoever, N., Mendon¸ca, P.R.S.: Autocalibration from Tracks of Walking Peo-

ple. In: British Machine Vision Conference, pp. 107–116 (2006)

12. Bose, B., Grimson, E.: Ground plane rectiﬁcation by tracking moving objects. In:

Proceedings IEEE Workshop on VS PETS (2003)

13. Shi, J., Tomasi, C.: Good features to track. In: Proceedings CVPR, pp. 593–600

(1994)

14. Kuhn, H.W.: The hungarian method for the assignment problem. Naval Research

Logistics 2(1-2), 83–97 (1955)

15. Frey, B.J.J., Dueck, D.: Clustering by passing messages between data points.

Science (2007)