Page 1

Object Tracking in the Presence of Occlusions via a

Camera Network

Ali Ozer Ercan

Department of Electrical

Engineering

Stanford University

Stanford, CA USA 94305

aliercan@stanford.edu

Abbas El Gamal

Department of Electrical

Engineering

Stanford University

Stanford, CA USA 94305

abbas@ee.stanford.edu

Leonidas J. Guibas

Department of Computer

Science

Stanford University

Stanford, CA USA 94305

guibas@cs.stanford.edu

ABSTRACT

This paper describes a sensor network approach to tracking a sin-

gle object in the presence of static and moving occluders using a

network of cameras. To conserve communication bandwidth and

energy, each camera first performs simple local processing to re-

duce each frame to a scan line. This information is then sent to

a cluster head to track a point object. We assume the locations of

the static occluders to be known, but only prior statistics on the

positions of the moving occluders are available. A noisy perspec-

tive camera measurement model is presented, where occlusions are

captured through an occlusion indicator function. An auxiliary par-

ticlefilterthat incorporates theoccluderinformation isused totrack

the object. Using simulations, we investigate (i) the dependency of

the tracker performance on the accuracy of the moving occluder

priors, (ii) the tradeoff between the number of cameras and the

occluder prior accuracy required to achieve a prescribed tracker

performance, and (iii) the importance of having occluder priors to

the tracker performance as the number of occluders increases. We

generally find that computing moving occluder priors may not be

worthwhile, unless it can be obtained cheaply and to a reasonable

accuracy. Preliminary experimental results are provided.

Categories and Subject Descriptors

G.3 [Probability and Statistics]: Probabilistic Algorithms; I.4.9

[Image Processing and Computer Vision]: Applications.

General Terms

Algorithms.

Keywords

Tracking, occlusion, auxiliary particle filter, wireless sensor net-

work, camera network, noisy perspective camera model.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission and/or a fee.

IPSN’07, April 25-27, 2007, Cambridge, Massachusetts, USA.

Copyright 2007 ACM 978-1-59593-638-7/07/0004 ...$5.00.

1.INTRODUCTION

Thereis agrowing need to develop low cost wireless networks of

cameras with automated detection capabilities [1]. The main chal-

lenge in building such networks is the high data rate of video cam-

eras. On the one hand sending all the data, even after performing

standard compression, is very costly in transmission energy, and on

the other, performing sophisticated vision processing at each node

to substantially reduce transmission rate requires high processing

energy. To address these challenges, a task-driven approach, in

which simple local processing is performed at each node to extract

the essential information needed for the network to collaboratively

perform the task, has been proposed and demonstrated [2,3].

In this paper, we adopt such a task-driven approach for tracking

a single object (e.g., a suspect) in a structured environment (e.g.,

an airport or a mall) in the presence of static and moving occluders

using a wireless camera network. Most previous work on tracking

with multiple cameras has focused on tracking all the objects and

does not deal directly with static occluders, which are often present

in structured environments (see brief survey in Section 2). Track-

ing all the objects clearly provides a solution to our problem, but

may be infeasible to implement in a wireless camera network due

to its high computational cost. Instead, our approach is to track

only the target object treating all other objects as occluders. We as-

sume complete knowledge of static occluder (e.g., partitions, large

pieces of furniture) locations and some prior statistics on the po-

sitions of the moving occluders (e.g., people) which are updated

in time. Simple local processing whereby each image is reduced

to a horizontal scan line is performed at each camera node. If the

camera sees the object, it provides a measurement of its position in

the scan line to the cluster head, otherwise it reports that it cannot

see the object. A noisy perspective camera measurement model is

assumed, where occlusions are captured through an occlusion indi-

cator function. Given the camera measurements and the occluder

position priors, an auxiliary particle filter (PF) [4] is used at the

cluster head to track the object. The occluder information is in-

corporated into the measurement likelihood, which is used in the

weighting of the particles.

Even if one wishes to track only one object treating other mov-

ing objects as occluders, a certain amount of information about the

positions of the occluders may be needed to achieve high tracking

accuracy. Since obtaining more accurate occluder priors would re-

quire expending more processing and/or communication energy, it

is important to understand the tradeoff between the accuracy of the

occluder information and that of tracking. Do we need any prior

occluder information? If so, how much accuracy is sufficient? A

goal of this paper is to investigate this important tradeoff.

We develop a measure of the moving occluder prior accuracy

509

Page 2

and use simulations to explore the dependency of the tracker per-

formance on this measure. We also explore the tradeoff between

the number of cameras used, the number of occluders present, and

the amount of occluder prior information needed to achieve a pre-

scribed tracker performance. We generally find that:

• Obtaining moving occluder prior information may not be

worthwhile in practice, unless it can be obtained cheaply and

to a reasonable accuracy.

• There is a tradeoff between the number of cameras used and

the amount of occluder prior information needed. As more

camerasareused, theaccuracyof thepriorinformationneeded

decreases. Having more cameras, however, means incurring

higher communications and processing cost. So, in the de-

sign of a tracking system, one needs to compare the cost of

deploying more cameras to that of obtaining more accurate

occluder priors.

• The amount of prior occluder position information needed

depends on the number of occluders present. When there are

very few moving occluders, prior information does not help

(because the object is not occluded most of the time). When

there is a moderate number of occluders, prior information

becomes more useful. However, when thereare too many oc-

cluders, prior information becomes less useful (because the

object becomes occluded most of the time).

It is important to note that these conclusions are based only on our

simulation setting, and that additional explorations by simulation

and experiments are needed to validate them.

The rest of the paper is organized as follows. A brief survey of

previous work on tracking using multiple cameras is presented in

the next section. In Section 3, we describe the setup of our track-

ing problem and introduce the camera measurement model used.

The tracker is described in Section 4. Simulation and experimental

results are presented in Sections 5 and 6, respectively.

2. PREVIOUS WORK

Tracking has been a popular topic in sensor network research

(e.g., [5–11]). Most of this work assumes low data rate range

sensors. By comparison, our work assumes cameras, which are

bearing sensors and have high data rate. The most related work

to ours is [10] and [11]. Pahawalatta et al. [10] use a camera net-

work to track and classify multiple objects on the ground plane.

This is done by detecting feature points on the objects and using

a Kalman Filter (KF) for tracking. By comparison, we use a PF,

which is more suitable for non-linear camera measurements and

track only a single object treating others as occluders. Funiak et

al. [11] use a Gaussian model obtained by reparametrizing thecam-

era coordinates together with KF. This method is fully distributed

and requires less computational power than PF. However, because

the main goal of the system is camera calibration and not tracking,

occlusions are not considered. Also, this work requires minimal

overlap of the camera FOVs, which is not a requirement for our

work.

Tracking has also been a very popular topic in computer vision

(e.g., [12–16]). Most of the work, however, has focused on tracking

objects in a single camera video sequence [12,13]. Tracking using

multiple camera video streams has also been considered [14–16].

Individual tracking is performed for each video stream and the ob-

jects appearing in the different streams are associated. More re-

cently, there has been work on tracking multiple objects in world

coordinates using multiple cameras [17–19]. Utsumi et al. [17]

extract feature points on the objects and use a KF to track the ob-

jects. They perform camera selection to avoid occlusions. By com-

parison, in our work occlusions are treated as part of the tracker.

Otsuka et al. [18] describe a double loop filter to track multiple

objects, where objects can occlude each other. One of the loops

is a PF that updates the states of the objects in time using the ob-

ject dynamics, the likelihood of the measurements, and the occlu-

sion hypotheses. The other loop is responsible for generating these

hypotheses and testing them using the object states generated by

the first loop, the measurements, and a number of geometric con-

straints. Although this method also performs a single object track-

ing in the presence of moving occluders, the hypothesis generation

and testing is computationally prohibitive for a sensor network im-

plementation. The work also does not consider static occlusions

that could be present in structured environments. Dockstader et

al. [19] describe a method for tracking multiple people using mul-

tiple cameras. Feature points are extracted from images locally

and corrected using the 3-D estimates of the feature point positions

that are fed back from the central processor to the local processor.

These corrected features are sent to the central processor where a

Bayesian network (BN) is employed to deduce a first estimate of

the 3-D positions of these features. A KF follows the BN to main-

tain temporal continuity. This approach requires that each object is

seen by some cameras at all times. This is not required in our ap-

proach. Also, performing motion vector computation at each node

is computationally costly in a wireless sensor network.

We would like to emphasize that our work is focused on tracking

a single object in the presence of static and moving occluders in

a wireless sensor network setting. When there are no occluders,

one could adopt a less computationally intensive approach similar

to [11]. When all the objects need to be tracked simultaneously, the

above mentioned methods ( [18,19]) or a filter with joint-state for

all the objects [20] can be used.

3.SETUP, MODELS, AND ASSUMPTIONS

We consider the setup illustrated in Fig. 1 in which N cameras

are aimed roughly horizontally around a room. Although an over-

head camera would have a less occluded view than a horizontally

placed one, it generally has a more limited view of the scene and

may be impractical todeploy. Additionally, targetsmay be easierto

identify in a horizontal view. The cameras are assumed to be fixed

and their locations and orientations are known to some accuracy to

the cluster head. The camera network’s task is to track an object

in the presence of static occlusions and other moving objects. We

assume that the object to track to be a point object. This is rea-

sonable because the object may be distinguished from occluders by

some specific point feature. We assume there are M other moving

objects, each modeled as a cylinder of diameter D. The position of

each object is assumed to be the center of its cylinder. From now

on, we shall refer to the object to track as the “object” and the other

moving objects as “moving occluders.”

We assume the positions and the shapes of the static occluders in

the room to be completely known in advance. This is not unreason-

able since this information can be easily provided to the network.

On the other hand, only some prior statistics of the moving oc-

cluder positions are known at each time step. In Subsection 4.4, we

discuss how these priors may be obtained.

As in [2], we assume that simple background subtraction is per-

formed locally at each camera node. We assume that the camera

nodes can distinguish between the object and the occluders. This

can be done, for example, through feature detection, e.g., [21].

Since the horizontal position of the object in each camera’s image

plane is the most relevant information to 2-D tracking, the back-

510

Page 3

θ1

θ2

θi

θN

D

x

Moving

occluder

Particles

Cam 1

Cam 2

Cam i

Cam N

Static

occluder

Moving occluder

priors

Figure 1: Illustration of the setup.

ground subtracted images are vertically summed and thresholded

to obtain a “scan line” (see Fig. 2). Only the center of the object in

the scan line is sent to the cluster head.

3.1Camera Measurement Model

If a camera “sees” the object, its measurement is described by a

noisy projective camera model. If the camera cannot see the object

because of occlusions or limited FOV, it reports a “NaN” (using

MATLAB syntax) to the cluster head. Mathematically, for camera

i = 1,...,N, we define the occlusion indicator function

if camera i sees the object

0,

otherwise.

Note that the ηi random variables are not in general independent

from each other. The camera measurement model including occlu-

sions is then defined as

ηi ?

1,

zi =

fi

NaN,

hi(x)

di(x)+ vi,

if ηi = 1

otherwise,

(1)

where xisthe positionof theobject, fiisthe focal length of camera

i, and di(x) and hi(x) are defined through Figure 3. The random

variable vimodels the read noise and the errors in the camera posi-

tion andangle θi. (seeFigure 1). Assuming that thesenoise sources

are zero mean and uncorrelated, the variance of viis given by

σ2

vi= f2

i

„

1 +h2

i(x)

d2

i(x)

«2

σ2

θ+ f2

i

h2

i(x) + d2

d4

i(x)

i(x)

σ2

pos+ σ2

read,

(2)

where σ2

variance of the read noise (See Appendix A for derivation of this

formula). Wefurther assume that given x, the noise from the differ-

ent cameras v1,v2,...,vN are independent, identically distributed

Gaussian random variables. Note that the camera nodes report only

the observations {zi} to the cluster head, and the cluster head de-

rives the values of the ηis from the zis.

posis the variance of the camera position and σ2

readis the

4.TRACKING

As the measurement model in (1) is nonlinear in the object po-

sition, using a linear filter, e.g., Kalman Filter (KF), for track-

ing would yield poor results. As discussed in [22], using an Ex-

tended Kalman Filter (EKF) with measurements from bearing sen-

sors, which are similar to cameras with the aforementioned local

processing, is not very successful. Although the use of an Un-

scented Kalman Filter (UKF) is more promising, its performance

???

???

???

???

Scene

Camera

Background

subtraction

thresholding

Vertical

summation &

Scan line

Figure 2: Local processing at each camera node.

hi(x)

x

di(x)

fi

Focal plane

zi

Figure 3: The camera measurement model.

degrades quickly when the static occluders and limited FOV con-

straints are considered. Because of the discreteness of the occlu-

sions and FOV and the fact that UKF uses only a few points from

the prior of the object state, most of these points may get discarded.

We also experimented with a Maximum A-Posteriori (MAP) esti-

mator combined with a KF, which is similar to the approach in [8].

This approach, however, failed at the optimization stage of the

MAP estimator, as the feasible set is highly disconnected due to

the static occluders and limited camera FOV. Given these consider-

ations, we decided to use a particle filter (PF) tracker [4].

We denote by u(t) the state of the object at time t, which in-

cludes its position x(t) and other relevant information. The posi-

tions of the moving occluders m ∈ {1,...M}, xm(t) are assumed

to be Gaussian with mean µm(t) and covariance matrix Σm(t).

These priors are available to the tracker. The state of the object

and positions of moving occluders are assumed to be mutually in-

dependent. Note that if the objects move in groups, one can still ap-

ply the following tracker formulation by defining a “super-object”

for each group and assuming that the super-objects move indepen-

dently. The tracker maintains the probability density function (pdf)

of the object state u(t), and updates it at each time step using the

new measurements. Given the measurements up to time t − 1,

{Y (τ)}t−1

a set of L weighted particles as

τ=1, the particle filter approximates the pdf of u(t − 1) by

f(u(t−1)|{Y (τ)}t−1

τ=1) ≈

L

X

?=1

w?(t−1)δ (u(t − 1) − u?(t − 1)),

where δ(·) is the Dirac delta function. u?is the state of particle ?,

(i.e., a sample of u(t)). At each time step, given these L weighted

particles, the camera measurements Z(t) = {z1(t), ...,zN(t)}

and η(t) = {η1(t),...,ηN(t)}, the moving occluder priors

{µm(t),Σm(t)}, m ∈ {1,...,M}, information about the static

occluder positions and the camera FOV, the tracker incorporates

511

Page 4

Algorithm: ASIR

Inputs: {u?(t − 1),w?(t − 1)}L

Z(t) = {z1(t),...,zN(t)}; η(t) = {η1(t),...,ηN(t)};

Shapes and positions of static occluders;

Camera positions and orientations (θi, i ∈ {1,...,N});

FOV of the cameras.

Output: {u?(t),w?(t)},? ∈ {1,...,L}.

01.for ? = 1,...,L

02.

κ?:= E[u(t)|u(t − 1)]

03.

˜ w?(t) ∝ f(Z(t),η(t)|κ?)w?(t − 1)

04. end for

05.

{w?(t)}L

06.

{·,·,i?}L

07. for ? = 1,...,L

08.

Draw u?(t) ∼ f(u(t)|ui?(t − 1))

09.

˜ w?(t) =

f(Z(t),η(t)|κi?)

10.end for

11.

{w?(t)}L

?=1; {µm(t),Σm(t)}M

m=1;

?=1= Normalize ({ ˜ w?(t)}L

?=1= Resample ({κ?,w?(t)}L

?=1)

?=1)

f(Z(t),η(t)|u?(t))

?=1= Normalize({ ˜ w?(t)}L

?=1)

Figure 4: The auxiliary sampling importance resampling algo-

rithm.

the new information obtained from the measurements at time t to

update the particles (and their associated weights).

We use the auxiliary sampling importance resampling (ASIR)

filter described in [4]. The outline of one step of our implementa-

tion of this filter is given in Fig. 4. In this figure, E[·] represents

the expectation operator, and the procedure “{w?}L

ize ({ ˜ w?}L

The procedure “{u?,w?,i?}L

L particle-weight pairs and produces L equally weighted particles

(w? = 1/L), preserving the original distribution. This amounts to

particles with small initial weights being killed and the ones with

high weights reproducing. The third output of the procedure (i?)

refers to the index of particle ?’s parent. The ASIR algorithm ap-

proximates the optimal importance density function

f(u(t)|u?(t − 1),Z(t),η(t)), which is not feasible to compute in

general [4].

In the following, we explain the implementation of the impor-

tance density function f(u(t)|u?(t − 1)) and the likelihood

f(Z(t),η(t)|u?(t)).

4.1Importance Density Function

The particles are advanced in time by drawing a new sample

u?(t) from the “importance density function” f(u(t)|u?(t − 1)):

u?(t) ∼ f(u(t)|u?(t − 1)),? ∈ {1,...,L}.

This is similar to the “time update” step in a KF. After all L new

particles are drawn, the distribution of the state is forwarded one

time step. Therefore, the dynamics of the system should be re-

flected as accurately as possible in the importance density function.

In KF, a constant velocity assumption with a large variance on the

velocity is assumed to account for direction changes. Although

assuming that objects move at constant velocity is not a realistic

assumption, the linearity constraint of the KF forces this choice. In

the PF implementation, we do not have to choose linear dynamics.

We use the more realistic “random waypoints model,” where the

objects choose a target and try to move toward the target with con-

stant speed plus noise, until they reach the target. When they reach

it, they choose a new target.

?=1= Normal-

?=1)” normalizes the weights so that they sum to one.

?=1= Resample ({u?,w?}L

?=1)” takes

We implemented a modified version of this model in which the

state of the particle consists of its current position x?(t), target

τ?(t), speed s?(t) and regime r?(t). Note that the time step here is

1 and thus s?represents the distance travelled in a unit time. The

model is given by

uT

?(t) = [xT

?(t) τT

?(t) s?(t) r?(t)].

The regime can be one of the following:

1. Move toward target (MTT): A particle in this regime tries to

move toward its target with constant speed plus noise:

x?(t) = x?(t−1)+s?(t−1)

where ν(t) is zero mean Gaussian white noise with Σν =

σ2

known. The speed of the particle is also updated according

to

τ?(t − 1) − x?(t − 1)

?τ?(t − 1) − x?(t − 1)?2+ν(t),

νI, I denotes the identity matrix and σν is assumed to be

s?(t) = (1 − φ)s?(t − 1) + φ?x?(t) − x?(t − 1)?2.

Updating the speed this way smooths out the variations due

to added noise. We chose φ = 0.7 for our implementation.

The target is left unchanged.

2. Change Target (CT): A particle in this regime first chooses

a new target randomly (uniformly) in the room and performs

an MTT step.

3. Wait (W): A particle in this regime does nothing.

Drawing a new particle from the importance density function in-

volves the following. First, each particle chooses a regime accord-

ing to their current position and their target. If a particle reached its

target, it chooses the regime according to

r?(t) =

8

:

<

MTT,

CT,

W,

w.p. β1,

w.p. λ1,

w.p. (1 − β1− λ1).

The target is assumed “reached” when the distance to it is less than

the particle’s speed. If a particle does not reach its target, the prob-

abilities β1 and λ1 are replaced by β2 and λ2, respectively. We

chose β1 = 0.05,λ1 = 0.9,β2 = 0.9,λ2 = 0.05.

4.2Likelihood

Updating the weights in the ASIR algorithm requires the compu-

tation of the likelihood of the measurements, f(Z(t),η(t)|u?(t)).

For brevity, we shall drop the time index from now on. We can

use the chain rule for probabilities to decompose the likelihood and

obtain

f(Z,η|u?) = p(η|u?)f(Z|η,u?).

Now, given x?, which is part of u?, and η, z1,...,zN become in-

dependent Gaussian random variables and we have

(3)

f(Z|η,u?) =

Y

i;ηi=1

N

zi;fihi(x?)

di(x?),σ2

vi

ff

,

whereN{r;ξ,ρ2}denotes aunivariateGaussian functionof r with

mean ξ and variance ρ2, σ2

defined in Fig. 3.

The first term in (3), however, cannot be expressed as a product,

as the occlusions are not independent given u?. This can be ex-

plained via the following simple example: Suppose 2 cameras are

close to each other. Once we know that one of these cameras can-

not see the object, it is more likely that the other one also cannot

viis given in (2) and di(x) and hi(x) are

512

Page 5

D

Camera s

θms

Prior of object m

x

As(x)

Figure 5: Computing qm

s(x).

see it. Hence, the 2 ηs are dependent given u?. Luckily, we can ap-

proximate the first term in 3 in a computationally feasible manner

using recursion.

First, we ignore the static occluders and the limited FOV, and

only consider the effect of the moving occluders. The effects of

static occluders and limited FOV will be added in Subsection 4.3.

Define the indicator functions ηs,m for s = 1,...,N and m =

1,...,M such that ηs,m = 1 if object m does not occlude camera

s, and 0, otherwise. Thus

{ηs = 1} =

M

\

m=1

{ηs,m = 1}.

The probability that object m occludes camera s given u is thus

given by

P{ηs,m = 0|u} =

Z

f(xm|u)P{ηs,m = 0|u,xm} dxm

(a)

=

Z

s(x),

f(xm)P{ηs,m = 0|x,xm} dxm

? qm

where x is the position part of the state vector u and step (a) uses

the facts that xm is independent of u and ηs,m is a deterministic

function of x and xm. To compute qm

out loss of generality, we assume that camera s is placed at the

origin. We assume that the moving occluder diameter D is small

compared to the occluder standard deviations. Object m occludes

point x at camera s if its center is inside the rectangle As(x). This

means P{ηs,m = 0|x,xm} = 1 if xm ∈ As(x) and it is zero

everywhere else:

s(x), refer to Figure 5. With-

qm

s(x) =

Z

As(x)

»

1

2πp|Σm|e−1

?v?

„?x??v1?2− µT

2(xm−µm)TΣ−1

m(xm−µm)dxm

(b)

≈1

4

erf

„ √α

1?

„D

2− ϕ

««

+ erf

„ √α

„µT

?v?

1?

mv2

?v?

√2σmv1, vT

„D

«–

2+ ϕ

««–

(4)

»

erf

mv2

?v?

1?

√αsin(θms)], v?

«

+ erf

1?

,

where vT

[cos(θms) αsin(θms)], ϕ = µmycos(θms) − µmxsin(θms), and

σ2

Σmof the prior of occluder m. Step (b) follows by the assumption

of small moving occluders.

To compute p(η|u), first consider the probability of all ηs of the

1 = [cos(θms)

1 =

2 =

mand σ2

m/α, α ≥ 1, are the eigenvalues of the covariance matrix

cameras in subset S, given u, to be equal to 1,

P

\

s∈S

{ηs = 1}

˛˛˛˛u

!

= P

\

m=1

s∈S

M

\

m=1

{ηs,m = 1}

˛˛˛˛u

˛˛˛˛u

˛˛˛˛u

!

= P

M

\\

s∈S

{ηs,m = 1}

!

(c)

=

M

Y

M

Y

M

Y

M

Y

S (x),

m=1

P

\

s∈S

{ηs,m = 1}

!

=

m=1

1 − P

[

s∈S

{ηs,m = 0}

˛˛˛˛u

!!

(d)

≈

m=1

1 −

X

s∈S

P{ηs,m = 0|u}

!

=

m=1

1 −

X

s∈S

qm

s(x)

!

? pmv

(5)

where (c) follows by the assumption that the occluder positions are

independent, and (d) follows from the assumption of small D and

the reasonable assumption that the cameras in S are not too close

so that the overlap between As(x), s ∈ S, is negligible. Note

that cameras that satisfy this condition can still be close enough,

such that their FOVs overlap and ηs are dependent. The superscript

“mv” signifies that “only moving occluders are taken into account.”

Now we can compute pmv(η|u) using (5) and recursion as fol-

lows. Let S = {1,...,N} (i.e., the set of all cameras). For any

i ∈ S such that ηi = 0, define

=

{η1,...,ηi−1,ηi+1,...,ηN}.

ηb

=

{η1,...,ηi−1,1,ηi+1,...,ηN}.

Then,

ηa

pmv(η|u) = pmv(ηa|u) − pmv(ηb|u).

Bothtermsintheright-hand-sideof (6)areonestepclosertopmv

(with different S), because one less element is zero in both ηaand

ηb This means that any pmv(η|u) can be reduced recursively to

terms consisting of pmv

putational load of this is exponential in the number of zeros in η.

However, this bottleneck is greatly alleviated by the limited FOV

of the cameras as will be explained in the following subsection.

4.3Adding StaticOccluders andLimitedFOV

Adding the effects of the static occluders and limited camera

FOV to the procedure described above involves a geometric par-

titioning of the particles into bins. Each bin is assigned a set of

cameras. After this partitioning, only the ηs of the assigned cam-

eras are considered for the particles in that bin. This is explained

using the example in Fig. 6. In this example, we have 2 cameras

and a single static occluder. As denoted by the dashed line in the

figure, we have 2 partitions. Let η1 = 0 and η2 = γ2 ∈ {0,1}.

Let us consider a particle belonging to the upper partition, namely

particle ?. If the object is at x?, the static occluder makes η1 = 0,

independent of where the moving occluders are. On the other hand,

the staticoccluder and limitedFOV do not occlude thesecond cam-

era’s view of particle ?. So, only Cam2is assigned to this partition,

and the first term in the likelihood is given by

(6)

S (u)

S (u), using (6). The bad news is, the com-

P({η1 = 0} ∩ {η2 = γ2}|u?) = pmv(η2|u?).

513

Page 6

Cam2

Cam1

u?= [x?,...]

uk= [xk,...]

Figure 6: Geometric partitioning to add static occluders and

limited FOV. If η1 = 1, the object cannot be at x?. If η1 = 0,

only Cam2needs to be considered for computing p(η|u?). Both

cameras need to be considered for computing p(η|uk).

Similarly,

P({η1 = 1} ∩ {η2 = γ2}|u?) = 0

P({η1 = γ1} ∩ {η2 = γ2}|uk) = pmv(η1,η2|uk),

where the fist line follows because if the object is at x?, η1 = 0,

and the second line follows because the static occluder and limited

FOV do not occlude particle k.

Note that the number of cameras assigned to a partition is not

likely to be large, mainly due to the limited camera FOV. Since the

number of zeros in η is at most the number of cameras assigned

to a partition, the actual computational complexity of the recursion

described in Subsection 4.2 is much lower than exponential. Also,

because the camera placements, FOV and static occluder positions

are known in advance, the room can be divided into regions before-

hand, with each region assigned the cameras that can see it. The

number of such regions grows at most quadratically in the number

of cameras. During tracking, the particles can be easily divided

into partitions depending on which pre-computed region each par-

ticle is.

We mentioned in Section 3 that the camera nodes can distinguish

between the object and the occluders. This may be unrealistic in

some practical settings. To address this problem, one can introduce

another random variable that indicates the event of detecting and

recognizing the object and include its probability in the likelihood.

We have not implemented this modification in this paper, however.

4.4Obtaining Occluder Priors

Our tracker assumes the availability of priors for the moving oc-

cluder positions. In this subsection we discuss how these priors

may be obtained. In Section 5, we investigate the tradeoff between

the accuracy of such priors and that of tracking.

Clearly, one could run a separate PF for each object, and then fit

Gaussians to the resulting particle distributions. This requires solv-

ing the data association problem, which would require substantial

local and centralized processing. Instead of solving the data asso-

ciation problem, trackers that represent the states of all objects in a

joint state have been proposed (e.g. [20]). This approach, however,

iscomputationally prohibitive as itrequires employing an exponen-

tially increasing number of particles in the size of the state.

Another approach to obtaining the priors is to use a hybrid sen-

sor network combining, for example, acoustic sensors in addition

to cameras. As these sensors use less energy than cameras, they

b

c

a

b

c

aa

a1D scan line

Figure 7: The visual hull is computed by back-projecting the

scan lines to the room and intersecting the resulting cones.

could be used to generate the priors for the moving occluders. An

example of this approach can be found in [23].

Yet another approach to obtaining the occluder priors involves

reasoning about occupancy using the “visual hull” (VH) as de-

scribed in [24] (see Fig. 7). To compute the VH, the entire scan

lines from the cameras are sent to the cluster head instead of only

the centers of the object blobs in the scan lines as discussed in Sec-

tion3. Thisonlymarginallyincreases thecommunication cost. The

cluster head then computes the VH by back-projecting the blobs in

the scan lines to cones in the room. The cones from the multi-

ple cameras are intersected to compute the total VH. Since the re-

sulting polygons are larger than the occupied areas and “phantom”

polygons that do not contain any objects may be present, VH pro-

vides an upper bound on occupancy. The computation of the VH

is relatively light-weight, and does not require solving the data as-

sociation problem. The VH can then be used to compute occluder

priors by fitting ellipses to the polygons and using them as Gaus-

sian priors. Alternatively, the priors can be assumed to be uniform

distributions over these polygons. In this case the computation of

qm

Although the VH approach to computing occluder priors is quite

appealing for a WSN implementation, several problems remain to

be addressed. These include dealing withthe object’s own blob and

phantom removal [24], which is necessary because their existence

can cause the killing of many good particles.

s(x) in (4) would need to be modified.

5. SIMULATION RESULTS

In a practical tracking setting one is given the room structure

(including information about the static occluders), the range of the

number of moving occluders and their motion model, and the re-

quired object tracking accuracy. Based on this information, one

needs to decide on the number of cameras to use in the room and

the amount of prior information about the moving occluder po-

sitions needed and how to best obtain this information. Making

these decisions involve several tradeoffs, for example, between the

occluder prior accuracy and the tracker performance, between the

number of cameras used and the required occluder prior accuracy,

and between the number of occluders present and the tracking per-

formance. In this section we explore these tradeoffs using simula-

tions.

In the simulations we assume a square room of size 100 × 100

units and a maximum of 8 cameras placed around its periphery (see

Fig. 8). The black rectangle in the figure depicts a static occluder.

Note, however, that in some of the simulations we assume no static

occluders. All cameras look toward the center of the room. The

514

Page 7

1

2

345

6

78

Figure 8: The setup used in simulations.

camera FOV is assumed to be 90◦. The standard deviation of the

camera position error is σpos = 1 unit, that of camera angle error is

σθ = 0.01 radians and read noise standard deviation is σread = 2

pixels. The diameter of each moving occluder is assumed to be

D = 3.33 units. We assume that the objects move according to

random waypoints model. This is similar to the way we draw new

particles from the importance density function as discussed in Sub-

section 4.1 with the following differences:

• The objects are only in regimes MTT or CT. There is no W

regime.

• The objects choose their regimes deterministically, not ran-

domly. If an object reaches its target or is heading toward the

inside of a static occluder or outside the room boundaries, it

transitions to the CT regime.

• Objects go around each other instead of colliding.

The average speed of the objects is set to 1 unit per time step. The

standard deviation of the noise added to the motion each time step

is 0.33 units. Fig. 8 also shows a snapshot of the objects for M=40

occluders. In the PF tracker we use 1000 particles. In each simu-

lation, the object and the occluders move according to the random

waypoints model for 4000 time steps.

To investigate tradeoffs involving moving occluder prior accu-

racy, we need a measure for the accuracy of the occluder prior. To

develop such a measure, we assume that the priors are obtained

using a KF run on virtual measurements of the moving occluder

positions of the form

ym(t) = xm(t) + υ(t), m = 1,2,...,M,

where xm(t) is the true occluder position, υ(t) is white Gaussian

noise withcovariance σ2

use the average RMSE of the KF (RMSEocc) as a measure of the

occluder prior accuracy. Lower RMSEocc means higher accuracy

sensors or more computation is used to obtain the priors, which re-

sult in more energy consumption in the network. At the extremes,

RMSEocc = 0 (when συ = 0) corresponds to complete knowl-

edge ofthemoving occluderpositions andRMSEocc = RMSEmax

(when συ = ∞) corresponds to no knowledge of the moving oc-

cluder positions. Note that the worst case RMSEmax is finite be-

cause when there are no measurements about the occluder posi-

tions, one can simply assume that they are located at the center of

the room. This corresponds to RMSEmax = 25.0 units for the

setup in Fig. 8.

υI, and ym(t) is the measurement. Wethen

2345678

0

5

10

15

20

25

RMSEmax

RMSEocc=RMSEmax

RMSEocc=6.67

RMSEocc=0

Number of cameras used

RMSEtr

Figure 9: Average tracker RMSE versus the number of cam-

eras for M = 40, and 1 static occluder. The dotted line is the

worst case RMSE when no tracking is performed and the ob-

ject is assumed to be at the center of the room.

To implement the tracker for these two extreme cases, we mod-

ify the p(η|u) computation as follows. We assign 0 or 1 to p(η|u)

depending on the consistency of η with our knowledge about the

occluders. For RMSEocc = 0, i.e., when we have complete infor-

mation about the moving occluder positions, the moving occluders

are treated as static occluders. On the other hand, for RMSEocc =

RMSEmax, i.e., when there isno information about themoving oc-

cluder positions, we check the consistency with only the static oc-

cluder and the limited FOV information to assign zero probabilities

to some particles. For the example in Fig. 6, we set P({η1 = 1}

∩{η2 = γ2}|u?) = 0, because if camera i can see the object, the

object cannot be at x?. Any other probability that is non-zero is set

to 1. Note that for these 2 extreme cases, we no longer need the

recursion discussed in Subsection 4.2 to compute the likelihood.

Hence, the computational complexity is considerably lighter com-

pared to using Gaussian priors.

FirstinFig.9weplot theaverageRMSEofthetracker(RMSEtr)

over 5 simulation runs for the two extreme cases of RMSEocc = 0

and RMSEocc = RMSEmax and for RMSEocc = 6.67 (obtained

by setting συ = 8) versus the number of cameras (the cameras con-

stitute a roughly evenly spaced subset of cameras in Fig. 8. For 2

cameras, orthogonal placement is used [2]). The dotted line repre-

sents the worst case RMSE, when there are no measurements and

the object is assumed to be in the center of the room.

We then investigate the dependency of the tracker accuracy on

the accuracy of the moving occluder priors. Fig. 10 plots the aver-

age RMSE for the tracker over 5 simulation runs versus RMSEocc

for N = 4 cameras. In order to include the effect of moving

occluder priors only, we used no static occluders in these simu-

lations. RMSEmax reduces to 21.3 units for this case. Note that

there is around a factor of 2.35× increase in RMSEtr from the

case of perfect occluder information (RMSEocc = 0) to the case

of no occluder information (RMSEocc = RMSEmax). Moreover,

it is not realistic to assume that the occluder prior accuracy would

be better than that of the tracker. With this consideration the im-

provement reduces to around 1.94× (this is obtained by noting that

RMSEtr =RMSEocc at around 3.72). These observations suggest

that obtaining prior information may not be worthwhile in practice,

unless it can be obtained cheaply and to a reasonable accuracy.

Thetradeoff betweenRMSEoccand thenumber ofcamerasneeded

to achieve average RMSEtr = 3 is plotted in Fig. 11. As expected

515

Page 8

05101520

3

3.5

4

4.5

5

5.5

6

6.5

7

7.5

8

RMSEocc

RMSEtr

Figure 10: Dependency of the tracker average RMSE on the

accuracy of the occluder prior for N = 4, M = 40 and no

static occluders. The dotted line is for RMSEtr=RMSEocc.

4567

0

5

10

15

20

25

Number of cameras to use

RMSEocc

Figure 11: Tradeoff between the number of cameras and

moving occluder prior accuracy for target tracker average

RMSE=3 units for M = 40 and no static occluders.

there is a tradeoff between the number of cameras and the accuracy

of the moving occluder priors as measured by RMSEocc. As more

cameras are used, the accuracy of the prior information needed de-

creases. The plot suggests that if a large enough number of cam-

eras is used, no prior information would be needed at all. Of course

having more cameras means more communications and processing

cost. So, in the design of a tracking system, one needs to com-

pare the cost of deploying more cameras to that of obtaining better

occluder priors.

Nextweexplorethequestionof how theneeded moving occluder

prior accuracy depends on the number of occluders present. To do

so, in Fig. 12 we plot the RMSEtr versus the number of moving

occluders for the two extreme cases, RMSEocc=0 and RMSEocc

=RMSEmax. Note that the difference between the RMSEtrfor the

two cases is the potential improvement in the tracking performance

achieved by having occluder prior information. When there are

very few moving occluders, prior information does not help (be-

cause the object is not occluded most of the time). As the number

of occluder increases prior information becomes more useful. But

the difference in RMSEtrbetween the two extreme cases decreases

when too many occluders are present (because the object becomes

occluded most of the time).

0 10203040 50 60 7080 90

0

2

4

6

8

10

12

RMSEocc=RMSEmax

RMSEocc=0

Number of moving occluders

RMSEtr

Figure 12: Tracker average RMSE versus the number of

moving occluders for the two extreme cases RMSEocc=0 and

RMSEocc=RMSEmax. Here N = 4 and there are no static oc-

cluders.

2345678

10

0

10

1

10

2

10

3

10

4

10

5

∝ 2N

Gaussian Priors

RMSEocc=RMSEmax

Number of cameras used

Relative average CPU time

Figure 13: Average CPU time for computing the likelihoods

relative to that for the case of 2 cameras and no occluder prior,

i.e., RMSEocc=RMSEmax. Here M = 40 and there is 1 static

occluder.

In Subsections 4.3 and 4.4, we mentioned that the complexity of

computing the likelihood given u?is exponential in the number of

cameras that cannot see the object and are assigned to the region

x? belongs to. We proposed that the limited camera FOV signif-

icantly reduces this computational complexity. To verify this, in

Fig. 13 we plot the average CPU time (per time step) used to com-

pute the likelihood relative to that of RMSEocc=RMSEmax case

for 2 cameras, versus the total number of cameras in the room.

The simulations were performed on an 3GHz Intel Xeon Proces-

sor running MATLAB R14. Note that the rate of increase of the

CPU time using priors is significantly lower than 2N, where N is

the number of cameras used, and it is close to the rate of increase

of RMSEocc=RMSEmax case. In fact, the rate of increase for this

particular example is close to linear in N.

6.EXPERIMENTAL RESULTS

We tested our tracking algorithm in an experimental setup con-

sisting of 16 web cameras placed around a 22?× 19?room. The

horizontal FOV of the cameras used is 47◦. A picture of the lab

is shown in Fig. 14(a) and the relative positions and orientations

516

Page 9

(a)

1

2

3

4

567

8

9

10

11

12

13

14

1516

(b)

Figure 14: Experimental setup. (a) View of lab (cameras are

circled). (b) Relative locations of cameras and virtual static oc-

cluder. Solid line shows actual path of the object to track.

of the cameras in the room are provided in Fig. 14(b). Each pair

of cameras is connected to a PC via IEEE 1394 (FireWire) inter-

face and each can provide 8-bit 3-channel (RGB) raw video at 7.5

Frames/s. The data from each camera is processed independently

as described in Section 3. The scan line data isthen sent to a central

PC (cluster head), where further processing is performed.

The object follows the pre-defined path (shown in Fig. 14) with

no occlusions present and 200 time-steps of data is collected. The

effect of static and moving occluders is simulated using 1 virtual

static occluder and M = 20 virtual moving occluders: we threw

away the measurements from the cameras that would have been

occluded, had there been real occluders. We chose to simulate the

occluders because it is otherwise impossible to obtain the perfect

occluder positions (RMSEocc = 0 case). The moving occluders

walk according to the model explained inSection 5. D ischosen 12

inches for the moving occluders, and the camera noise parameters

were assumed σpos = 6 inches, σread = 2 pixels and σθ = 0.005

radians.

Figure 15 plots the average RMSE of the tracker over 40 simula-

tion runs for the two extreme cases of RMSEocc = RMSEmax =

61.8 inches and RMSEocc = 0 and for RMSEocc = 14.2 inches

versus the number of cameras. There is a notable difference in the

performace between the three cases throughout the entire plot, but

the difference is more pronounced when the number of cameras is

small, agreeing with the tradeoffs discussed in Section 5.

7.CONCLUSION

We described a sensor network approach for tracking a single

object in a structured environment using multiple cameras. Instead

of tracking all objects inthe environment, which iscomputationally

very costly, we track only the target object and treat others as oc-

456789

6

8

10

12

14

16

18

RMSEocc=RMSEmax

RMSEocc=14.2

RMSEocc=0

Number of cameras used

RMSEtr

Figure 15: Experimental results. Average tracker RMSE ver-

sus the number of cameras for M = 20, and 1 static occluder.

cluders. The tracker is provided with complete information about

the static occluders and some prior information about the moving

occluders. One of the main contributions of this paper is develop-

ing asystematic wayto incorporate thisinformation into thetracker

formulation. Usingsimulations we explored tradeoffs involving the

occluder prior accuracy, the number of cameras used, the number

of occluders present, and the accuracy of tracking with some inter-

esting implications.

Several areas need to be explored further, including (i) running

simulations and experiments over real world environments to val-

idate our preliminary findings, (ii) developing a theoretical frame-

work for investigating the aforementioned tradeoffs, (iii) exploiting

the independence between the ηis for cameras that are far apart to

further reduce the computational complexity of computing the like-

lihoods, and (iv) developing a cheap method for obtaining reason-

able accuracy occluder priors (perhaps based on VH).

Acknowledgments

The authors would like to thank Dr. Jack Wenstrand for helpful dis-

cussions. The authors wish to acknowledge the support of DARPA

Microsystems Technology Office Award No. N66001-02-1-8940,

NSF grants CNS-0435111, CNS-0626151, ARO grant W911NF-

06-1-0275, and the support of DoD Multidisciplinary University

Research Initiative (MURI) program administered by ONR under

Grant N00014-00-1-0637.

8.

[1] R. Holman and T. Ozkan-Haller, “Applying video sensor

networks to nearshore environment monitoring,” IEEE

Pervasive Computing, vol. 2, no. 4, pp. 14–21, 2003.

[2] A. O. Ercan, D. B.-R. Yang, A. El Gamal, and L. J. Guibas,

“Optimal placement and selection of camera network nodes

for target localization,” in Proceedings of DCOSS, June

2006.

[3] A. O. Ercan, A. El Gamal, and L. J. Guibas, “Camera

network node selection for target localization in the presence

of occlusions,” in Distributed Smart Cameras, October 2006.

[4] B. Ristic, S. Arulampalam, and N. Gordon, Beyond the

Kalman Filter, Particle Filters for Tracking Applications.

Artech House, 2004.

[5] R. R. Brooks, P. Ramanathan, and A. M. Sayeed,

“Distributed target classification and tracking in sensor

REFERENCES

517

Page 10

networks,” Proceedings of IEEE, vol. 91, no. 8, pp.

1163–1171, August 2003.

[6] F. Zhao, J. Liu, J. Liu, L. J. Guibas, and J. Reich,

“Collaborative signal and information processing: an

information-directed approach,” Proceedings of IEEE,

vol. 91, no. 8, pp. 1199–1209, August 2003.

[7] J. Aslam, Z. Butler, F. Constantin, V. Crespi, G. Cybenko,

and D. Rus, “Tracking a moving object with a binary sensor

network,” in Proceedings of SENSYS, November 2003.

[8] C. Taylor, A. Rahimi, J. Bachrach, H. Shrobe, and A. Grue,

“Simultaneous localization, calibration and tracking in an

ad-hoc sensor network,” in Proceedings of IPSN, April 2006.

[9] N. Shrivastava, R. Mudumbai, and U. Madhow, “Target

tracking with binary proximity sensors: Fundamental limits,

minimal descriptions, and algorithms,” in Proceedings of

SENSYS, November 2006.

[10] P. V. Pahalawatta, D. Depalov, T. N. Pappas, and A. K.

Katsaggelos, “Detection, classification, and collaborative

tracking of multiple targets using video sensors,” in

Proceedings of IPSN, April 2003, pp. 529–544.

[11] S. Funiak, C. Guestrin, M. Paskin, and R. Sukthankar,

“Distributed localization of networked cameras,” in

Proceedings of IPSN, April 2006.

[12] P. F. Gabriel, J. G. Verly, J. H. Piater, and A. Genon, “The

state of the art in multiple object tracking under occlusion in

video sequences,” in Proceedings of ACIVS, September 2003.

[13] A. Yilmaz, X. Li, and M. Shah, “Contour-based object

tracking with occlusion handling in video acquired using

mobile cameras,” IEEE Transactions on Pattern Analysis and

Machine Intelligence, vol. 26, no. 11, pp. 1531–1536,

November 2004.

[14] Q. Cai and J. K. Aggarwal, “Tracking human motion in

structured environments using a distributed-camera system,”

IEEE Transactions on Pattern Analysis and Machine

Intelligence, vol. 21, no. 12, pp. 1241–1247, 1999.

[15] S. Khan, O. Javed, Z. Rasheed, and M. Shah, “Human

tracking in multiple cameras,” in Proceedings of ICCV, July

2001.

[16] W. Zajdel, A. T. Cemgil, and B. J. A. Krose, “Online

multicamera tracking with a switching state-space model,” in

Proceedings of ICPR, August 2004.

[17] A. Utsumi, H. Mori, J. Ohya, and M. Yachida,

“Multiple-view-based tracking of multiple humans,” in

Proceedings of the ICPR, 1998.

[18] K. Otsuka and N. Mukawa, “Multiview occlusion analysis

for tracking densely populated objects based on 2-d visual

angles,” in Proceedings of CVPR, 2004.

[19] S. L. Dockstander and A. M. Tekalp, “Multiple camera

tracking of interacting and occluded human motion,”

Proceedings of the IEEE, vol. 89, no. 10, pp. 1441–1455,

October 2001.

[20] A. Doucet, B.-N. Vo, C. Andrieu, and M. Davy, “Particle

filtering for multi-target tracking and sensor management,”

in Proceedings of ISIF, 2002, pp. 474–481.

[21] C. Tomasi and T. Kanade, “Detection and tracking of point

features,” Carnegie Mellon University, Technical Report

CMU-CS-91-132, April 1991.

[22] Y. Bar-Shalom, X. R. Li, and T. Kirubarajan, Estimation with

Applications to Tracking and Navigation.

John Wiley & Sons Inc., 2001.

[23] X. Sheng and Y.-H. Hu, “Maximum likelihood

New York, NY:

multiple-source localization using acoustic energy

measurements with wireless sensor networks,” IEEE

Transactions on Signal Processing, vol. 53, no. 1, pp. 44–53,

2005.

[24] D. B.-R. Yang, H. Gonzales-Banos, and L. J. Guibas,

“Counting people in crowds with a real-time network of

image sensors,” in Proceedings of ICCV, October 2003.

APPENDIX

A.DERIVATION OF THE CAMERA MEA-

SUREMENT NOISE VARIANCE

Without loss of generality, assume that the camera is at the ori-

gin, and the object is at x = [x1 x2]T. Then hi(x) and di(x) in

Fig. 3 are given by

hi(x) = sin(θi)x1− cos(θi)x2

di(x) = −cos(θi)x1− sin(θi)x2.

We take partial derivatives of zigiven in (1) with respect to θi, x1

and x2to obtain

∂zi

∂θi

∂zi

∂x1

∂zi

∂x2

= −fi

= fidi(x)sin(θi) + hi(x)cos(θi)

d2

= fi−di(x)cos(θi) + hi(x)sin(θi)

d2

An error in the camera position translates into errors in x1and x2.

Assuming the errors in the two directions to be independent, zero

mean and have the same standard deviation σpos, and the error in

the angle to be zero mean with standard deviation σθand indepen-

dent of the position errors, we obtain

„

1 +h2

i(x)

d2

i(x)

«

i(x)

i(x)

.

σ2

vi= σ2

zi|x

=

„∂zi

∂θi

«2

1 +h2

σ2

θi+

„∂zi

«2

∂x1

«2

σ2

x1+

„∂zi

∂x2

«2

i(x)

σ2

x2+ σ2

read

= f2

i

„

i(x)

d2

i(x)

σ2

θ+ f2

i

h2

i(x) + d2

d4

i(x)

σ2

pos+ σ2

read.

518