Conference PaperPDF Available

Automated Ground-Plane Estimation for Trajectory Rectification

Authors:

Abstract

We present a system to determine ground-plane parameters in densely crowded scenes where use of geometric features such as parallel lines or reliable estimates of agent dimensions are not possible. Using feature points tracked over short intervals, together with some plausible scene assumptions, we can estimate the parameters of the ground-plane to a sufficient degree of accuracy to correct usefully for perspective distortion. This paper describes feasibility studies conducted on controlled, simulated data, to establish how different levels and types of noise affect the accuracy of the estimation, and a verification of the approach on live data, showing the method can estimate ground-plane parameters thus allowing, improved accuracy of trajectory analysis.
Automated Ground-Plane Estimation
for Trajectory Rectification
Ian Hales, David Hogg, Kia Ng, and Roger Boyle
University of Leeds, Woodhouse Lane, Leeds, LS2 9JT
{i.j.hales06,d.c.hogg,k.c.ng,r.d.boyle}@leeds.ac.uk
Abstract. We present a system to determine ground-plane parameters
in densely crowded scenes where use of geometric features such as paral-
lel lines or reliable estimates of agent dimensions are not possible. Using
feature points tracked over short intervals, together with some plausible
scene assumptions, we can estimate the parameters of the ground-plane
to a sufficient degree of accuracy to correct usefully for perspective dis-
tortion. This paper describes feasibility studies conducted on controlled,
simulated data, to establish how different levels and types of noise affect
the accuracy of the estimation, and a verification of the approach on live
data, showing the method can estimate ground-plane parameters thus
allowing, improved accuracy of trajectory analysis.
Keywords: ground-plane, trajectory, rectification, crowd-motion.
1 Introduction
In computer vision one often wishes to examine objects in terms of their size,
speed or location, accurate measurement of which is hampered by several types
of distortion that occur when the camera captures the scene. This is particularly
relevant in applications such as behaviour analysis in crowds, where such metrics
can be used to detect events or measure crowd density. By estimating the trans-
formation undergone by the coordinate system, we can invert it to counteract
the effect of such distortions and obtain a more accurate view of the world. If
we can obtain the ground-plane orientation with respect to the camera, we can
correct for one of the most prevalent sources of error – perspective distortion.
Two broad approaches are apparent within the literature: “formal methods”,
often in terms of the fundamental matrix or image homographies; and “informal
methods”, providing a coarse, but still usable, impression of the scene (e.g. a scale
ratio from front to back). A common feature is the idea of “vanishing points”,
which lie on the horizon line. Having obtained the equation of the horizon, it is
trivial to perform affine rectification [1]. Metric rectification can then be achieved
using knowledge of known lengths or angles, or equality of angles [2]. These
points can be determined using pairs of imaged parallel lines [1] and inclusion
of a vanishing point in a third direction allows for metric measurement [3].
The Manhattan Assumption [4] offers a viable framework to obtain additional
vanishing points from the background in man-made scenes [5]. Alternatively,
R. Wilson et al. (Eds.): CAIP 2013, Part II, LNCS 8048, pp. 378–385, 2013.
c
Springer-Verlag Berlin Heidelberg 2013
Automated Ground-Plane Estimation for Trajectory Rectification 379
known angles in the scene [6] or known reference lengths [7] can produce similar
results. In pedestrian scenes, measurements (e.g. projected foot-head heights
[8,9]) provide a valid reference length. Less formal methods tend to rely on the
estimation of a single ground-plane. It is again common to use projected foot-
head heights [10] and accuracy can be improved by tracking individuals, taking
relative height measurements for each [11], which minimizes potential variations
in reference length.
We base our work within the context of pedestrian crowd analysis. The meth-
ods discussed above rely on assumptions unlikely to hold in a densely crowded
scene due to inter-occlusion and the inability to see geometric features in the
background. Stauffer et al. [10] mention an alternative approach using the as-
sumption of constant speed of moving objects, built upon by Bose and Grimson
[12], who use a blob tracker to generate trajectories. They then obtain the vanish-
ing line by assuming constant speed, before using inter-trajectory speed ratios
to obtain metric rectification. However, maintained tracking was necessary to
achieve metric rectification, making this approach infeasible in our domain.
We propose to use the local speed of tracked feature points as a calibration
measurement, along with the assumption of constant speed, to reconstruct the
ground-plane parameters. We do not require prolonged tracking of features pro-
vided we observe pedestrian motion throughout the scene. The remainder of this
paper will describe our method, prove its validity on simulated data and assess
its accuracy on real-world benchmark data.
2 Ground-Plane Estimation
Above, we saw that it is possible to construct the 3D scene using information
from measuring objects of known real-world size at various positions within the
image. In this section we show that it is equally possible to use measurements of
object speed to reconstruct the plane upon which agents are moving. Throughout
this work we assume that all objects move on a single, linear plane and that
each observed object is moving at constant (or near-constant) speed. We do not,
however, require that all objects move at the same speed.
We first track sparse features frame-by-frame for some period using the KLT
tracker [13], until we can no longer reliably continue the trajectory. Since we deal
with high density pedestrian crowds, heavy inter-occlusion is likely to result in
many short trajectories. However, it only takes a few frames for us to gain
valuable information from the trajectory. It is plausible that objects may have
many features or no features at all assigned to them, but this is not a major
concern provided we gather information from various scene positions.
We define a trajectory as a time series of points upon on a plane recorded
at equally spaced time-steps. We observe these in the image-coordinate system
and wish to obtain their respective points within the camera coordinate system
through perspective back-projection. We define a time-series (Xτ
1,Xτ
2,...,Xτ
Nτ)
as the x-coordinates of trajectory τat times 1 to Nτand its projection into
image space as (xτ
1,x
τ
2,...,x
τ
Nτ). We represent the orientation of the ground-
plane in terms of its unit normal vector, n, given by equation (1). θrepresents
380 I. Hales et al.
the angle of elevation (rotation in the x-axis) and ψto represent the angle of
yaw (rotation in the z-axis).
n=
a
b
c
=
sin(ψ) sin(θ)
cos(ψ) sin(θ)
cos(θ)
(1)
Given a ground-plane n·X=dfor some point X=(Xτ
t,Yτ
t,Zτ
t)in the
world and some α, equations (2) to (4) show the back-projection of a point
(xτ
t,y
τ
t)onto the point (Xτ
t,Yτ
t,Zτ
t)on the ground plane at time t,where
αis the negative reciprocal of the focal length fand dis the shortest distance
between the camera and the ground-plane.
Xτ
t=αxτ
tZτ
t(2)
Yτ
t=αyτ
tZτ
t(3)
Zτ
t=d
αaxτ
t+αbyτ
t+c(4)
2.1 Speed as a Measuring Stick
An object’s observed speed varies with perspective as do its other properties
such as height [10]. Given a number of partial trajectories of constant speed,
we show that it is possible to obtain reasonable estimates for the ground-plane
parameters. Assuming a tracked object’s real-world speed is constant, we can
think of the trajectory as a set of piecewise linear segments, each with length
given by the 3D Euclidean distance formula. We can use this distance as the
“measuring-stick” from which to gain an estimation of the ground-plane.
Firstly we make some simplifying assumptions. The height of the camera
with respect to the ground-plane, d, acts primarily as a scaling parameter. As
we only aim to reconstruct to scale, we set this to 1 in (4), thus simplifying
further calculation. We also observe that in the majority of scenes, the camera
height is substantial compared to the height variations of the tracked feature
points. In section 3 we offer results showing that tracked feature height variation
does not significantly affect the estimate of ground-plane orientation. Finally,
motion variation based on the articulation of our moving agents is negligible.
Substituting (2-4) into the distance formula, we obtain (5). This relates the
known 2D projected positions of a feature point at consecutive intervals t1and
t, their 3D distance in the camera plane and the plane parameters. Hereafter,
we use Lτto represent the set of distance measurements at all time-intervals for
some trajectory, τ. Each element of this set is defined as follows:
(Lτ
t)2=α2xτ
t
γτ
t
xτ
t1
γτ
t12
+α2yτ
t
γτ
t
yτ
t1
γτ
t12
+1
γτ
t
1
γτ
t12
(5)
where,
γτ
t=αxτ
ta+αyτ
tb+c
For a single trajectory τ, we define a set of the distances at all time intervals:
{Lτ
t}t=Nτ
t=2 (6)
Automated Ground-Plane Estimation for Trajectory Rectification 381
We denote the mean and standard deviation of the set as μ(Lτ)andσ(Lτ)
for short. We now have a relationship between known image-coordinates and
the unknown camera-coordinates with respect to some constant distance along
points in a trajectory. We use this to measure how well a given set of parameters,
θ,ψand αfit the observed data. If a feature point is moving at constant speed
and we have a good set of parameters, σ(Lτ) should be close to zero. Conversely,
we would expect a poor parameterization to give a high spread in Lτ.
Since we do not wish to impose the constraint that all ob jects must move at
constant speed, we normalize σ(Lτ)byμ(Lτ) giving us a speed-invariant measure
of correctness. We pose this correctness measure in terms of a minimization over
the sum of squared errors for each trajectory as shown in equation (7).
E1=
τTσ(Lτ)
μ(Lτ)2
(7)
E2=σ
τT(μ(Lτ)) (8)
As we expect tracked feature points to have originated from a set of homogeneous
objects (e.g. pedestrians), we can reasonably assume that they should all move
with similar (although not identical) speed. As such we include an additional
term to constrain the system, E2in equation (8)s:. Here we take the standard-
deviation over the set of all mean speeds which penalises a high spread of speeds,
thereby preventing some of the least plausible configurations.
2.2 Minimizing the Error
We solve for θ,ψand αby minimizing the error E=E1+λE2over all input
trajectories. We experimented with non-linear optimization algorithms but due
to the irregularity of the problem-space away from the vicinity of the true value,
gradient descent methods tend to fall into local minima. As such we fall back to
a multi-resolution global search to find the correct region. At the first level we
search all combinations of α,θand ψ, with a coarse mesh – increments of 15
in both the θand ψfeasible ranges (0to 90and 45to 45)andforαan
exponential search in the range 103to 100to find its scale.
We then take the point with minimum error from this search and produce a
finer grid around it; now searching αlinearly and reducing the step size for θ
and ψto 10% of their previous value. We repeat this procedure until either the
lowest error point is below a given tolerance (empirically 105is sufficient) or
we reach the maximum level allowed for search. We have observed 3 levels to be
sufficient for an accurate estimation on simulated data with some noise.
3 Experiments on Simulated Data
To prove the initial viability of this method we first describe a number of exper-
iments on simulated data. This allows us to examine how various types of noise
and violations of the initial premise affect the accuracy of the experiments. Core
sources of error are likely to be:
382 I. Hales et al.
(a) (b)
0 0.2 0.4 0.6 0.8 1
0
0.25
0.5
Error for Inter−Feature Speed Variation
Noise Level
Error
(c)
0 0.2 0.4 0.6 0.8 1
0
0.25
0.5
Error for Intra−Feature Speed Variation
Noise Level
Error
(d)
0 0.2 0.4 0.6 0.8 1
0
0.25
0.5
Error for Intra−Feature Height Variation
Noise Level
Error
Fig. 1. An example of our simulated trajectories (a) and the results of our noise ex-
periments (b)-(d) in terms of average angular error (dot product between ground-truth
and estimated plane-normals)
1. Variation in inter-trajectory speed. That is, agents move at different speeds.
2. Variation in intra-trajectory speed; an agent varies their speed whilst moving.
3. Variation in tracked point height; i.e. some trajectories recorded on feet,
some on shoulders, etc.
We generate simulated trajectories on a number of planes with parameterized
noise (Fig. 1a), allowing the speed and height of trajectories to vary according
to Gaussian distributions. This lets us examine the effect of the above error
sources on reconstruction accuracy. We assess accuracy by rectifying the image-
plane trajectories with the ground-truth parameters and our estimations, then
compare the spread of normalized speeds of each. Given a perfect reconstruction,
we see zero-error and all measurements are relative to a mean speed of 1.
We perform three sets of experiments on a number of different planes across
the feasible range, varying the potential source of error in each in terms of the
standard deviation, between 0 and 1, of a Gaussian distribution with mean 1.
Examining the above issues in order, we first investigate the effect of the different
agents in the scene moving at different speeds. Agents’ initial speeds are chosen
randomly from the distribution and remain constant throughout the simulation.
From Fig. 1b we observe that even in extreme cases the average error stays low
- below 10% of the mean speed. We expect the effect of intra-trajectory speed
variation to be more pronounced as it is the defining metric used to recover the
parameters. We take the speed of a feature at each frame from the distribution.
Fig. 1c shows that although we experience more noise than with inter-trajectory
speed variation, it is not so substantially pronounced as to seriously damage our
result. We see that height variation has negligible affect on the accuracy of our
solution – data was generated using a feasible camera height of 10m; as such the
difference in point height is relatively small.
Points tracked at different heights with a low-positioned camera will be af-
fected more strongly, but in most real-world scenes the distances between the low-
est and highest tracked points are negligible with respect to the camera height.
Our experiments on simulated data show that over realistic ranges, the three po-
Automated Ground-Plane Estimation for Trajectory Rectification 383
(a) (b)
Fig. 2. Example stills from PETS2009 (a) and students003 (d) video datasets
tential sources of error identified above have negligible detrimental effect on the
accuracy of estimation. Of particular importance is the intra-tra jectory speed
variation, which is a violation of our assumptions and yet still does not have an
especially pronounced effect on the quality of our estimation.
4 Experiments on Video Data
We gather trajectories using the tracker and split them at sharp spikes in speed,
which are easily identifiable violations of the constant-speed assumption. We are
typically left with considerably more trajectories than is necessary (or tractable)
to process. Indeed many of these are extremely short – only 2 or 3 frames. As
such we filter out trajectories shorter than 4 frames, as these provide us with no
additional information towards our unknowns.
When tracking pedestrians, we observe two key differences to the trajectories
produced for simulation. Firstly, we commonly track several points per person;
secondly pedestrians tend to behave in a more “human” manner than in our
simulated data – travelling in groups. As such we observe that many of our
trajectories are extremely similar and their inclusion adds little information, but
slows the processing down considerably. Therefore we first align trajectories to
each other using the Hungarian algorithm [14], then cluster the trajectories based
on their distance and shape similarity using Affinity Propagation [15], taking the
resulting clusters as input to our ground-plane estimation system.
The majority of our results are given against the PETS2009 dataset, specif-
ically videos taken from View001 shown in 2a and View002. Since these come
with full intrinsic and extrinsic calibration, we can directly compare rotation
angles. We also examine one other video dataset: “students003”, Fig. 2b, from
the University of Cyprus. This is only provided with an image to ground-plane
homography and so does not allow for direct rotation comparison. We therefore
use the homography to determine the 2D feature coordinates on the ground-
plane and compare the speed ratios for each trajectory. We compare with the
method in [12] (see section 1) although we exchanged the blob tracker for our
KLT approach as the former provided insufficient tracks for reliable estimation
when applied to our data.
384 I. Hales et al.
Tabl e 1. Results for plane estimation on videos from the PETS2009 dataset
Dataset Subset View Time Index θError (degrees) ψError (degrees)
S0 Regular Flow 001 14-06 +7.2 -0.4
S1 L1 001 13-59 +1.1 +11.7
S1 L2 001 14-06 +7.5 -0.5
S1 L1 002 13-57 +0.1 -9.9
S1 L2 002 14-31 +1.9 -4.4
(a)
0 5 10 15 20 25 30 35 40 45
0.4
0.6
0.8
1
Frame Number
Normalised Speed
(b)
0 10 20 30 40 50 60 70
0
0.5
1
Frame Number
Normalised Speed
Fig. 3. Comparison of trajectory speeds rectified using the ground-truth (black,
dashed) and estimated parameters (red , so li d )andBose(green, dashed ). Examples
are longest trajectories from (a) “students003” and (b) PETS2009 Regular Flow. We
see that even in trajectories with some tracking error, we obtain a sensible result,
generally better than Bose.
Table 1 shows the orientation error for several scenes in terms of direct dif-
ference for the two rotation components, θand ψ. There is an area of high re-
flectance in View002, which interrupts tracking of most individuals, despite this,
the majority of our estimates are within 10of the ground-truth values – suffi-
cient for approximate correction of trajectory speeds. Fig. 3 shows some example
comparisons of normalized trajectory speeds from the “students003” and PETS
Regular Flow datasets rectified using the provided homography/calibration, our
estimation and that of [12]. We see that our matching is generally very close,
even on trajectories with some tracking error, whereas the method of Bose and
Grimson performs poorly. We put this down to the flexibility of our approach in
minimising spread rather than a strict constant speed assumption.
5 Conclusions and Further Work
This paper has considered the problem of reconstructing 3D geometry from 2D
observations taken from videos of pedestrian data taken using a single uncali-
brated camera. Our method differs from previous techniques as it requires no
knowledge of scene geometry or a fixed size object; needing only motion of indi-
viduals. We have provided evidence on simulations for the validity of our method
and the assumptions held within. We have then shown results on the PETS2009
dataset which illustrate the success of the method in a number of cases and have
given a qualitative comparison for another. In continuation of this work, we plan
to account for variations in trajectory height to allow for tracking individuals
on different parts of their bodies. We then intend to extend the method into the
multi-planar domain, such that planes can be estimated and their boundaries
drawn to more accurately and realistically model real-world scenes.
Automated Ground-Plane Estimation for Trajectory Rectification 385
References
1. Liebowitz, D., Zisserman, A.: Metric rectification for perspective images of planes.
In: Proceedings CVPR 1998, pp. 482–488. IEEE Comput. Soc. (1998)
2. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn.
Cambridge University Press (2004)
3. Criminisi, A., Reid, I.D., Zisserman, A.: Single view metrology. International Jour-
nal of Computer Vision 40(2), 123–148 (2000)
4. Coughlan, J.M., Yuille, A.L.: Manhattan world: Compass direction from a single
image by bayesian inference. In: Proceedings ICCV 1999, pp. 941–947 (1999)
5. Pflugfelder, R., Bischof, H.: Online auto-calibration in man-made worlds. In: Pro-
ceedings DICTA 2005, pp. 519–526 (2005)
6. Zhang, Z., Li, M., Huang, K., Tan, T.: Robust automated ground plane rectification
based on moving vehicles for traffic scene surveillance. In: 2008 15th IEEE ICIP,
pp. 1364–1367. IEEE (2008)
7. Guo, F., Chellappa, R.: Video mensuration using a stationary camera. In: Leonardis,
A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3953, pp. 164–176. Springer,
Heidelberg (2006)
8. Lv, F., Zhao, T., Nevatia, R.: Self-Calibration of a Camera from Video of a Walking
Human. In: Proceedings Pattern Recognition, vol. 1, pp. 562–567. IEEE Computer
Society, Los Alamitos (2002)
9. Micusik, B., Pajdla, T.: Simultaneous surveillance camera calibration and foot-
head homology estimation from human detections. In: CVPR 2010, pp. 1562–1569
(2010)
10. Stauffer, C., Tieu, K., Lee, L.: Robust Automated Planar Normalization of Track-
ing Data. In: Proceedings IEEE Workshop on VS PETS (2003)
11. Krahnstoever, N., Mendon¸ca, P.R.S.: Autocalibration from Tracks of Walking Peo-
ple. In: British Machine Vision Conference, pp. 107–116 (2006)
12. Bose, B., Grimson, E.: Ground plane rectification by tracking moving objects. In:
Proceedings IEEE Workshop on VS PETS (2003)
13. Shi, J., Tomasi, C.: Good features to track. In: Proceedings CVPR, pp. 593–600
(1994)
14. Kuhn, H.W.: The hungarian method for the assignment problem. Naval Research
Logistics 2(1-2), 83–97 (1955)
15. Frey, B.J.J., Dueck, D.: Clustering by passing messages between data points.
Science (2007)
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
We propose a novel method for automatic camera calibration and foot-head homology estimation by observing persons standing at several positions in the camera field of view. We demonstrate that human body can be considered as a calibration target thus avoiding special calibration objects or manually established fiducial points. First, by assuming roughly parallel human poses we derive a new constraint which allows to formulate the calibration of internal and external camera parameters as a Quadratic Eigenvalue Problem. Secondly, we couple the calibration with an improved effective integral contour based human detector and use 3D projected models to capture a large variety of person and camera mutual positions. The resulting camera auto-calibration method is very robust and efficient, and thus well suited for surveillance applications where the camera calibration process cannot use special calibration targets and must be simple.
Conference Paper
Full-text available
This paper presents a method for video mensuration using a single stationary camera. The problem we address is simple, i.e., the mensuration of any arbitrary line segment on the reference plane using multiple frames with minimal calibration. Unlike previous solutions that are based on planar rectification, our approach is based on fitting the image of multiple concentric circles on the plane. Further, the proposed method aims to minimize the error in mensuration. Hence we can calculate the mensuration of the line segments not lying on the reference plane. Using an algorithm for detecting and tracking wheels of an automobile, we have implemented a fully automatic system for wheel base mensuration. The mensuration results are accurate enough that they can be used to determine the vehicle classes. Furthermore, we measure the line segment between any two points on the vehicle and plot them in top and side views.
Article
Assuming that numerical scores are available for the performance of each of n persons on each of n jobs, the "assignment problem" is the quest for an assignment of persons to jobs so that the sum of the n scores so obtained is as large as possible. It is shown that ideas latent in the work of two Hungarian mathematicians may be exploited to yield a new method of solving this problem. 1.
Conference Paper
Most outdoor visual surveillance scenes involve objects of interest moving on the ground plane. However, perspective distortion introduces many difficulties to various applications like object classification and activity recognition. In this paper, we propose a robust automated method for both affine and metric rectification of the ground plane based on appearance and motion of vehicles in traffic scene surveillance videos. This rectification enables normalization of object properties like size, length and velocity. Various useful applications are presented and experimental results demonstrate the effectiveness and robustness of the proposed method.
Article
This paper has been presented with the Best Paper Award. It will appear in print in Volume 52, No. 1, February 2005.
Conference Paper
It has been shown that under a small number of assumptions, observa- tions of people can be used to obtain metric calibration information of a camera, which is particularly useful for surveillance applications. How- ever, previous work had to exclude the common criticial configuration of the camera's principal point falling on the horizon line and very long focal lengths, both of which occur commonly in practise. Due to noise, the quality of the calibration quickly degrades at and in the vicinity of these configurations. This paper provides a robust solution to this problem by incorporating information about the motion of people into the estimation process. It is shown that under the assumption that people walk with a constant velocity, calibration performance can be improved significantly. In addition to solving the above problem, the incorporation of temporal data also helps to take correlations between subsequent detections into consideration, which leads to an up-front re- duction of the noise in the measurements and an overall improvement in auto-calibration performance.
Conference Paper
We describe how 3D affine measurements may be computed from a single perspective view of a scene given only minimal geometric information determined from the image. This minimal information is typically the vanishing line of a reference plane and a vanishing point for a direction not parallel to the plane. It is shown that affine scene structure may then be determined from the image, without knowledge of the camera's internal calibration (e.g. focal length), nor of the explicit relation between camera and world (pose). In particular we show how to: compute the distance between planes parallel to the reference plane (up to a common scale factor); compute area and length ratios on any plane parallel to the reference plane; determine the camera's (viewer's) location. Simple geometric derivations are given for these results. We also develop an algebraic representation which unifies the three types of measurement and, amongst other advantages, permits a first order error propagation analysis to be performed, associating an uncertainty with each measurement. We demonstrate the technique for a variety of applications, including height measurements in forensic images and 3D graphical modelling from single images
Conference Paper
When designing computer vision systems for the blind and visually impaired it is important to determine the orientation of the user relative to the scene. We observe that most indoor and outdoor (city) scenes are designed on a Manhattan three-dimensional grid. This Manhattan grid structure puts strong constraints on the intensity gradients in the image. We demonstrate an algorithm for detecting the orientation of the user in such scenes based on Bayesian inference using statistics which we have learnt in this domain. Our algorithm requires a single input image and does not involve pre-processing stages such as edge detection and Hough grouping. We demonstrate strong experimental results on a range of indoor and outdoor images. We also show that estimating the grid structure makes it significantly easier to detect target objects which are not aligned with the grid