Page 1

Abstract— suppose we have a video, the first half of this

video is capturing the images of a sedan and the second half is

recording the moving of a truck, can we use the same video

tracking algorithm to follow the moving of the object over the

entire video sequence? We define this kind of problem as

“Object Class Tracking” problem. Instead of tracking a specific

object, object class tracking is to track the moving of the object

class. The challenge is how to locate the image element in the

next frame by handling the large intra-class variance. In this

paper, we propose a theoretic framework for object class

tracking based on Kalman Filter. A part-based statistical model

is employed to solve the image element localization problem. We

mathematically prove the soundness of the theoretic framework.

The method has the potential to be applied in many application

domains.

I. INTRODUCTION

IDEO tracking is the problem of tracking the moving

objects from one frame to another in a sequence of

images [1]. It has been widely used in many applications [2],

such as video surveillance, robotic navigation, and industrial

inspection. To build a video tracking system that is robust to

clutter, occlusion, and view angle change, we need to

address two important issues [1]: predicting the location of

the object being tracked in the next frame; locating the image

elements in the next frame.

There are many existing mathematical methods for video

tracking, such as Kalman Filter [3], Particle Filter [4], kernel

based approach [5]. Among them, Kalman Filter [3] has

been used in many applications successfully. Kalman filter is

a recursive state estimator that is able to handle partially

observed, non-stationary, and stochastic data sets. It gives an

optimal estimate in the least squares sense of the actual value

of a state vector from noisy observation.

Many of the existing tracking algorithms are used for

tracking one or multiple objects while the objects being

tracked do not change (although the physical properties, such

as shape and size may vary significantly) over the video

sequence. We call it as “Specific Object Tracking”. Different

from detecting the specific objects, we are approaching a

different problem, named “Object Class Tracking”. The goal

of object class tracking is to track object class given a

sequence of image frames. For example, given a video whose

first part is compose of a set of images with a moving SUV

Manuscript received October 15, 2007.

Y. Cao, S. Raka, R. Nandamuri are with the Department of Computer

Science, California State University at Fresno, Fresno, CA 93740 USA

(phone: 559-278-4635; fax: 559-278-4197; e-mail: yucao@csufresno.edu)

S. Read is with the Department of Mathematics, University of California

at Santa Barbara, CA 93106 USA. (e-mail: sread@math.ucsb.edu).

and the second part is recording the moving of a minivan,

object class tracking algorithm can follow the moving of

both objects without changing the tracking model.

The biggest challenge of object class tracking is how to

detect and located the image elements (object) in the current

frame without considering the information from previous

images. Different objects in the same object class may have

large intra-class variance. They may have totally different

colors; they may also have totally different shape

information. Recent progress

categorization provides possible solutions to our problem.

Object class recognition is to categorize the objects into

object classes. For example, instead of detecting a specific

car, we want to detect different types of cars, such as sedan,

SUV, minivan, truck, and etc. In another word, the same

object class recognition algorithms are able to detect both the

sedan in one image and the truck in another image. Solving

object class recognition requires a generic model that can

handle the large intra class variance. One possible solution

for object class recognition is a distributed model called

“Part-based Model” [6-8]. In this model, the objects are

represented as a set of parts. The final decision is made

based on both the appearance of the local parts and spatial

relationship among these parts. There are many different

types of part-based representations. Earlier research has been

focused on deterministic methods with energy minimization

[6]. Recently, a joint probability model named

“Constellation Model” was proposed [7, 8]. In these

methods, multiple parts are represented using distributed

models with parts distributed normally in appearance and

location space. The appearance variations of object parts are

modeled by Principal Component Analysis. The spatial

relationships among parts are calculated by a joint Gaussian

probability distribution function. Using the Expectation-

Maximization algorithm, weakly supervised learning of the

model parameters has been developed. Pictorial model is

another well-known part-based model [9]. In this approach,

an object is modeled as a collection of parts arranged in a

deformable configuration. An object is treated as a graph-

like entity. The nodes represent the object parts and the

edges indicate the spatial relations among parts. Originally,

the pictorial model was used for object localization. In [10],

the authors extend its capacity by providing efficient

matching algorithms using general K-fan graphs.

In this paper, we propose a theoretic framework for video

tracking that integrates the part-based model into Kalman

Filter. By employing the part-based model as measurements

methods to detect and locate the image element in the current

in the object class

A Theoretic Framework for Object Class Tracking

Yu Cao, Member, IEEE, Steve Read, Sachin Raka, Revanth Nandamuri

V

Page 2

frame, we can build a robust tracking system for object class

tracking.

The rest of the paper is organized as follows. In Section II,

we introduce our proposed method on how to integrate a

part-based model into Kalman Filter. In the same section, we

also show the soundness of our methods. We draw a

conclusion and point out the possible application domains of

proposed methods and future directions in Section III.

II. PROPOSED APPROACH

In order to build a Kalman Filter for video tracking, we

need two models: motion model and measurement model.

The motion model is used to predict where the objects will

be in the next frame. The measurement model is to detect

and locate the image elements and update the initial estimate.

In our proposed approach, we use part-based model for

measurement computation. In the part-based model, each

object is represented by a set of parts with connections

between certain pairs of parts. The visual appearance of each

part and the spatial relationship among the parts are captured

under a statistical framework. Parameters of the model are

learned from training examples. To measure the object in the

next frame, we select sub images in the search range as part

candidates. Probability scores are computed for different

combinations of part candidates. The combination with the

smallest score will be used for measurements. In the sections

below, we first introduce the detail structures of Kalman

Filter. Then we present our measurement algorithms using

part-based model.

A. Structures of Kalman Filter

As showed in Fig. 1, we define an estimate of an unknown

state as s, where s represents the object location in each

frame. The value of s at time t is named as st. In another

word, st indicates the object location at image frame t. Please

note, during each iteration for state estimation (object

location estimation), we use two different symbols to

represent two stages of the state estimation: st

- is used to

represent the initial state estimation while st indicates the

final state estimation. The measurement made at time t is

defined as mt. In the video tracking context, mt means the

detected object location in the current frame without

considering any information from previous images. The

measurement mt is combined with the initial estimation st

obtain the final estimation st. Variable Pt is used to represent

the uncertainty of state estimation quantitatively. Recall there

are two models in Kalman Filter: motion model and

measurement model. The motion model defines how s

changes with time and the measurement model is used to

update the initial estimate. Follows the assumptions of linear

relationship between st and st-1, mt and st, we define the

motion model and measurement model as

=

ttt

sAs

sHm

=

- to

111

−−−

+

t

V

(1)

tttt

W

+

(2)

where Vt-1 is a random variable with mean 0 and

covariance matrix Q t-1, and W t is another random variable

with zero mean and R t as the covariance matrix. After we

define the two models in Kalman filter, we need to perform

three operations for each time stamp t: Initial Prediction,

Kalman Gain Calculation, and Final Estimation. We

introduce each step as follows:

1) Initial Predication: This is the first step for estimating

the state s at time stamp t. As we mentioned before, s

indicates the object location and there are two symbols to

represent the state: one is st

state estimation) while another is st (which is the final state

estimation). In this initial predication stage, we can obtain

the value of st

indicates the variance of the initial estimation. The equations

for calculating the above two variables are listed as below:

−=

tt

As

- (which represents the initial

-. We also compute the value of Pt

- , which

11

−−

t

s

(3)

1111

−−−−

−

t

+=

t

T

ttt

QAPAp

(4)

where At-1 is the coefficient defined in Equation (1) and Q

t-1 is the variance of Vt-1, where Vt-1 is a random variable used

in Equation (1).

2) Kalman Gain Calculation: The Kalman filter uses a

value called “Kalman gain” to combine the initial

predication and measurement. It is chosen so that the

variance in the final estimate, st, is as small as possible. The

Kalman gain Kt is defined as below

(

=

tttt

PHPHK

where Ht is the coefficient defined in Equation (2). Pt

the variance of the initial estimation st

computing Pt

Wt is a random variable used in Equation (2).

3) Final Estimation: In this step, we will update the state

estimation to obtain the final estimation st and its variance pt.

The equations used for calculation are listed as follows:

(

+=

ttt

mKss

1

)

−−−

+

t

T

tt

RH

(5)

- is

-. The formula for

- is Equation (4). Rt is the variance of Wt , where

)

−

t

−

−

tt

sH

(6)

−

t

−−

t

p

=

ttt

pHKp

(7)

where st

- and Pt

- are the initial estimation of the state and

Initial Estimate of st

Final Estimate of st

Measurement mt

Initial Estimate of st+1

Motion Model

Fig. 1. The Structure of the Kalman Filter

Page 3

the variance of the state. These two values are obtained from

Step 1 (Initial Predication). Kt is the Kalman Gain obtained

from the second step (Kalman Gain Calculation). Ht is the

coefficient used in Equation (2). mt is the results from

measurements. We will introduce how to compute the

measurement in the following section.

B. Part-based Model for Measurement

We present our methods for computing measurement mt in

this section. The results from the measurement mt are used to

update the initial estimate using Equation (6). In our object

class tracking domain, the goal of the measurement is to find

out the possible new location in frame t without considering

the previous frame t-1. This is essentially an object class

detection problem. We propose a part-based model to solve

this problem. A restricted pictorial structural model is used

to encode the appearance and shape information. In this

model, the object is represented by a set of parts

{}

P

pppp

,,,,

210

⋅ ⋅ ⋅

where the number of parts is (P+1).

We define

{

v

,

0

vv

,

0

a P-star

}

graph

()

EVG

,

=

where

P v

,

vvV

edge(

,

⋅ ⋅ ⋅

,

) E

21

=

corresponds to the (P+1) parts; an

j∈

indicates a pair of connected parts;

0v is

the root node with vertex degree P and

jv (

[

, 1

]

Pj

∈

)

represents the leaf node with vertex 1. In addition, we define

(

llL

,,

10

=

a configuration

)

Pll

,,

⋅ ⋅ ⋅

2

to represent an

instance of the object where

[

i

, 0

∈

il indicates the location of

part

iv (

]

P

). After constructing the P-star graph and

establishing the mapping between object parts and graph

nodes, we consider the model parameters. Let

()

XA

f

,

=

θ

be a set of parameters that define a foreground object model

{

aaA

,

0

=

appearance of the

{}

EvvxX

jj

∈=

), ( |

00

(

j

where parameters

}

P

aa

,,

⋅ ⋅ ⋅

,

21

represent the

parts,

∈

and parameters

[

, 1

]

P

) characterize the

spatial relationship between connected parts.

From the statistic point of view, our model can be best

explained as follows. Suppose we have already learnt a set of

parameters

f θ for foreground object and all non-object

background images are modeled by a fixed set of

parameters

b θ . Given a new imageI , we will be able to

determine whether the image contains an instance of an

object by considering posterior ratio R using Bayes’ theory:

⋅

==

)()|(

)()|(

)()|(

)()|(

)|(

)|(

0

1

00

11

0

1

hpIp

hpIp

hphIp

hphIp

Ihp

Ihp

R

b

f

⋅

⋅

≈

⋅θ

θ

(8)

where

1h represents the hypothesis that I contains an

instance of the object and

0h represents the hypothesis that I

contains background only. The right most expression in

Equation (8) is an approximation because we represent the

category with the imperfect model [11]. To compute the

posterior ratio R, we should obtain the likelihood ratio and

prior ratio. The prior ratio can either by estimated from

training or set to a constant value manually. Two likelihood

)|(

f

Ip

θ

and

probabilities:

)|(

b

Ip

θ

, are needed to

compute the likelihood ratio. The denominator

)|(

b

Ip

θ

is

the likelihood of seeing an image with background

parameters. It can be considered as a constant for a given

image [11]. The nominator

)|(

f

Ip

θ

indicates the

likelihood of seeing an image with the foreground

parameters. To obtain this value, we should sum over all

possible object configurations [10]. We can get the following

equation using conditional probability principles:

∑

=

k

1

L represents a possible configuration and K is

⋅=

K

fkfkf

LpLIpIp

)|(),|()|(

θθθ

(9)

where

k

the number of total configurations. We define the score of

L as the probability of this configuration

the configuration

k

occurs in the image I with foreground parameters

f

θ

. We

term this probability as

),|(

fk

ILp

θ

. In addition, we

name the configuration with the highest probability as the

best configuration of this image and term the best

configuration using symbol “L”. In practice, we obtain the

)|(

f

Ip

θ

by only selecting the best best results of

configuration instead of summing up all configurations, since

usually, the background images contain many low-scoring

configurations, which cause false positives [8, 11]. Based on

this assumption and Equation (9), we get the following

formula:

),(),|(),(),|()|(

1

ff

K

k

fkfkf

LpLIpLpLIpIp

θθθθθ

⋅≈⋅=∑

=

, (10)

where L represents the best configuration. Next we apply

the Bayes’ theory to the probability score of the best

configuration L and get the following formula:

θθ∝

)|(),|(),|(

fff

LpLIpILp

θ⋅

(11)

From Equation (10) and Equation (11), we can convert the

problem of computing the likelihood of seeing an image with

(

Ip

foreground parameters (

)|

f

θ

, which is the left

expression in Equation (11)) into the problem of computing

the probability of the best configuration given the image with

foreground parameters (

(

Lp

),|

f

I

θ

, which is the left

expression of Equation (11)). This conversion simplifies the

recognition step: given a new image, we can determine the

existence of the object by only considering the probability

score of the best configuration.

The first term (likelihood probability

),|(

f

LIp

θ

) in

Equation (11) represents the likelihood of seeing an image

given that the object is at a particular configuration and it

Page 4

only depends on the appearance of the parts. If the parts do

not overlap (which is true in our case), we can assume that

each part is independent. Hence, the likelihood probability

can be factored as follows

∏

=

i

∝=

P

iif

alIpALIpLIp

0

),|(),|(),|(

θ

(12)

where

il represents the location of part

indicates the appearance parameters. The second term (prior

)|(

f

Lp

θ

) in Equation (11) models the prior

iv and

ia

probability

distribution over object configurations and it only relies on

the spatial relations among the connected parts. It can be

captured by a tree-structure Markov Random Field with edge

set E, which is equal to the joint distribution for pairs of

parts connected by edges divided by the joint distribution of

each part [12, 13]. Since we use a P-star graph to model the

relative spatial difference between the root and the leaves,

we can simplify the prior distribution with the following

equation:

∏

∈

vv

j

),(

0

=

E

jjf

cllpLp

, 00

)|,()|(

θ

(13)

where part

0v and

jv are connected pairs;

0v is root

node and

jv is leaf node.

0l and

jl are the locations for

part

0v and

jv .

j

c, 0 are the parameters for modeling the

connection between part

0v and

jv . Using Equation (12)

and Equation (13) to replace the two terms in the right

expression of Equation (11), we can get the following

formula

∏

=

i

∏

v

,

0

∈

)

⋅∝

P

Ev

jjiif

j

cllpalIpILp

0(

, 00

)|,(),|(),|(

θ

(14)

In order to solve Equation (14), we need to obtain the

parameters for both the appearance model and shape model

by learning from training examples. For recognition purpose,

we also need to assign the appearance probability score for

each part candidate and the shape probability score for

connected parts. In the next sections, we will introduce how

we learn the model parameters from training examples and

how to recognize an instance of the object for a new image

by computing the probability score.

III. CONCLUSION

In this paper, we have introduced a theoretic framework

for object class tracking. The objective of object class

tracking is to track object class from video stream instead of

tracking specific objects. We build a Kalman Filter using a

part-based model for measurement. This framework has the

potential to overcome the large intra-class variance problems

faced by object class tracking. Object class tracking can be

potentially used in many application domains. We are now

seeking the application of this method to the areas such as

animal tracking in biological field, anatomy tracking in

medical field.

Future directions include new searching and matching

method for part-based model to reduce the time cost, new

models to improve the accuracy of measurement (object

detection).

REFERENCES

[1] E. Trucco and K. Plakas, "Video Tracking: A Concise

Survey," IEEE Journal of Oceanic Engineering, vol.

31, pp. 520-529, 2006.

[2] W. H. T. Tan, L. Wang, and S. Maybank, "A survey on

visual surveillance of object motion and behaviors,"

IEEE Transactions on Systems, Man and Cybernetics,

Part C: Applications and Reviews, vol. 34, pp. 334-

352, 2004.

[3] P. S. Maybeck, Stochastic Models, Estimation, and

Control. Burlington, MA, USA: Academic Press, 1982.

[4] M. Isard and A. Blake, "CONDENSATION --

conditional density propagation for visual tracking "

International Journal of Computer Vision, vol. 29, pp.

5-28, 1998.

[5] D. Comaniciu, V. Ramesh, and P. Meer, "Kernel-Based

Object Tracking," IEEE Transactions on Pattern

Analysis and Machine Intelligence, vol. 25, pp. 564-

575, 2003.

[6] M.A.Fischler and R.A.Elschlager, "The representation

and matching of pictorial structures," IEEE

Transactions on Computer, vol. 22(1), pp. 67-92, 1973.

[7] M. Weber, M. Welling, and P. Perona, "Towards

automatic discovery of object categories," presented at

Proceedings of the IEEE Computer Society Conference

on Computer Vision and Pattern Recognition Hilton

Head Island, SC, USA, 2000.

[8] R. Fergus, P. Perona, and A. Zisserman, "Object class

recognition by unsupervised scale-invariant learning,"

presented at Proceedings of the IEEE Computer Society

Conference on Computer Vision and Pattern

Recognition, Madison, WI, USA, 2003.

[9] P. F. Felzenszwalb and D. P. Huttenlocher, "Pictorial

structures for object recognition," International Journal

of Computer Vision, vol. 61, pp. 55-79, 2005.

[10] D. Crandall, P. Felzenszwalb, and D. Huttenlocher,

"Spatial priors for part-based recognition using

statistical models," presented at Proceedings of the

IEEE Computer Society Conference on Computer

Vision and Pattern Recognition, San Diego, CA, USA,

2005.

[11] R. Fergus, P. Perona, and A. Zisserman, "Weakly

Supervised Scale-Invariant Learning of Models for

Visual Recognition," Accepted by the International

Journal of Computer Vision, 2006.

[12] G. Poggi and R. P. Ragozini, "Image segmentation by

tree-structured Markov random fields," IEEE Signal

Processing Letters, vol. 6, pp. 155-157, 1999.

[13] C. D. Elia, G. Poggi, and G. Scarpa, "A tree-structured

Markov random field model for Bayesian image

segmentation," IEEE Transactions on Image

Processing, vol. 12, pp. 1259-1273, 2003.