ArticlePDF Available

Abstract

Pan-tilt-zoom (PTZ) cameras are able to dynamically modify their field of view (FOV). This functionality introduces new capabilities to camera networks such as increasing the resolution of moving targets and adapting the sensor coverage. On the other hand, PTZ functionality requires solutions to new challenges such as controlling the PTZ parameters, estimating the ego motion of the cameras, and calibrating the moving cameras.This tutorial provides an overview of the main video processing techniques and the currents trends in this active field of research. Autonomous PTZ cameras mainly aim to detect and track targets with the largest possible resolution. Autonomous PTZ operation is activated once the network detects and identifies an object as sensible target and requires accurate control of the PTZ parameters and coordination among the cameras in the network. Therefore, we present cooperative localization and tracking methods, i.e., multiagentand consensus-based approaches to jointly compute the target's properties such as ground-plane position and velocity. Stereo vision exploiting wide baselines can be used to derive three-dimensional (3-D) target localization. This tutorial further presents different techniques for controlling PTZ camera handoff, configuring the network to dynamically track targets, and optimizing the network configuration to increase coverage probability. It also discusses implementation aspects for these video processing techniques on embedded smart cameras, with a special focus on data access properties.
IEEE SIGNAL PROCESSING MAGAZINE [78] SEPTEMBER 2010
Digit al Objec t Identi fier 10. 1109/M SP.2010. 93733 3
1053-5888/10/$26.00©2010IEEE
Video Analysis in
Pan-Tilt-Zoom Camera
Networks
[
From master-slave to
cooperative smart cameras
]
[
Christian Micheloni, Bernhard Rinner, and Gian Luca Foresti
]
Pan-tilt-zoom (PTZ) cameras are able to
dynamically modify their field of view
(FOV). This functionality introduces new
capabilities to camera networks such as
increasing the resolution of moving tar-
gets and adapting the sensor coverage. On the other
hand, PTZ functionality requires solutions to new chal-
lenges such as controlling the PTZ parameters, estimat-
ing the ego motion of the cameras, and calibrating the
moving cameras.
This tutorial provides an overview of the main video pro-
cessing techniques and the currents trends in this active
field of research. Autonomous PTZ cameras mainly aim to
detect and track targets with the largest possible resolution.
Autonomous PTZ operation is activated once the network
detects and identifies an object as sensible target and
requires accurate control of the PTZ parameters and coordi-
nation among the cameras in the network. Therefore, we
present cooperative localization and tracking methods, i.e.,
multiagent- and consensus-based approaches to jointly com-
pute the target’s properties such as ground-plane position
and velocity. Stereo vision exploiting wide baselines can be
used to derive three-dimensional (3-D) target localization.
This tutorial further presents different techniques for con-
trolling PTZ camera handoff, configuring the network to
dynamically track targets, and optimizing the network con-
figuration to increase coverage probability. It also discusses
implementation aspects for these video processing tech-
niques on embedded smart cameras, with a special focus on
data access properties.
INTRODUCTION
The progress in sensors and computing leveraged the develop-
ment of novel camera networks and processing frameworks. In
particular, small- to medium-video surveillance networks have
been deployed in different environments such as airports, train
stations, public parks, and office buildings [17].
The availability of new information guided researchers in
developing new intelligent surveillance systems [38], in which
dozens of cameras cooperate to solve complex tasks such as the
tracking of moving objects. Originally, such a task had been
tackled by single cameras. The goal was to achieve a consistent
detection of the track and preservation of correct labeling in
single and multiple object scenarios, respectively.
In modern surveillance systems, one of the major chal-
lenges in multicamera tracking is the consistency of the
objects’ labeling through different FOVs. Multisensor calibra-
tion techniques for the transition between the cameras’ views
or the development of multitarget-multisensor tracking
© BRAND X PICTURES
IEEE SIGNAL PROCESSING MAGAZINE [79] SEPTEMBER 2010
mechanisms have been the research core for such kind of
networks [10], [5], [23].
Static camera networks, even though they allow to cover a
wide area for monitoring activities, have different limitations.
Normally, camera deployment has to consider over lapping the
individual FOVs. Image resolutions are usually low due to the
wide angle view adopted to increase the monitored area. When
no overlapping FOVs are present, signal processing has still to
be performed on a single data source. As a consequence, occlu-
sions cannot be solved using multisensor-based techniques.
Localization in the real world is limited by ground-plane con-
straints affecting the performance when lower sections of
objects are occluded. All these problems can be reduced if any
point in the environment belongs to two or more FOVs of the
cameras. Naturally, such a constraint poses severe limitations
for networks covering large areas due to the huge amount of
video sensors needed.
The introduction of PTZ cameras brought new capabilities
to surveillance networks as well as new problems to be solved.
In particular, PTZ cameras can adapt their FOVs that can be
exploited to focus the video-network attention on areas of inter-
est. Hence, some problems related to nonoverlapping FOVs can
be overcome (e.g., occlusions, localization and low target reso-
lution). On the other hand, the PTZ movement has introduced
the nontrivial ego-motion estimation problem. Thus, well
known low-level techniques such as motion detection and cali-
bration needed further development to be feasible for these new
moving sensors.
This tutorial provides a comprehensive introduction of
PTZ camera networks for active surveillance. We focus on the
different signal processing methods to achieve robust object
tracking and localization by means of PTZ cameras’ coopera-
tion and reconfiguration. The tutorial is organized to show
the evolution of the signal processing techniques from single
autonomous PTZ cameras to more complex cooperative
smart networks (see Figure S1 in “The Evolution of PTZ
Camera Cooperation”). It covers low-level techniques for
autonomous object detection as well as more high-level
methods for the cooperation of different moving cameras. In
particular, the analysis of the localization performance of
The simplest usage of PTZ cameras is the master-slave inter-
action. The network composed by static cameras performs
all the processing for the event analysis and computes the
camera motion parameters to direct the PTZ gaze onto the
object of interest. To introduce autonomous PTZ cameras
different active vision techniques can be applied to make
the moving camera more intelligent. It is then possible to
locally process the video stream to extract useful informa-
tion about the activities occurring within the monitored
scene. By exploiting active vision techniques together with
common static camera networks it is possible to realize
cooperative frameworks in which the analysis is performed
by processing both streams coming from PTZ and static cam-
eras. More, the ability of PTZ cameras to modify their FOV on
the basis of the system needs (e.g., higher resolution and
more FOV overlap) allows to define ad hoc stereo vision sys-
tems, dynamic hand-over areas and so on. The development
of such techniques on-board of modern smart cameras
enables to consider different parameters (e.g., area coverage,
power consumption, and bandwidth usage) to compute the
optimal configuration of the network in terms of number of
sensors switched on and their configuration (PTZ values).
[FIGS1] Different architectures for networks exploiting PTZ cameras: master-slave, autonomous PTZ, and cooperative
smart cameras.
PTZ
PTZ
PTZ
Smart PTZ
Event
Analysis
Camera
Control
Event
Analysis
Camera
Control
Event
Analysis
Event
Analysis
Network
Configuration
Network
Awareness
Camera
Control
Master-Slave
Autonomous PTZ
Cooperative
Smart Networks
THE EVOLUTION OF PTZ CAMERA COOPERATION
IEEE SIGNAL PROCESSING MAGAZINE [80] SEPTEMBER 2010
static and PTZ cameras networks is discussed. The dynamic
configuration of the PTZ cameras is further described to pro-
pose a novel research stream focusing on the computation of
the optimal configuration of the PTZs network to improve
coverage, hand-off and tracking. We then present techniques
for embedding such algorithms into smart cameras to realize
new autonomous and distributed video networks that finally
results in a discussion on the future of smart reconfigurable
sensor networks.
AUTONOMOUS PTZ CAMERAS
Aloimonos was the first to propose the active vision paradigm to
describe the dynamic interaction between the observer and the
object observed to actively decide what to see [2]. In this way,
the video sensor can be controlled to keep the gaze on a selected
target. Such a simple idea brought a new set of problems that in
context of static cameras were inexistent. A first problem is rep-
resented by the fact, that in context of moving cameras, even
pixels belonging to static objects appear to move in the camera
frame. Such an effect is called ego motion and its estimation
and compensation represented the first objective of the active
vision research area.
EGO-MOTION ESTIMATION
Following the active vision paradigm, Murray and Basu [34]
investigated the problem of tracking moving objects by
means of a pan-tilt camera. A first solution to the estimation
of the ego motion had been given. An analytic determination
of the image transformation for the registration of the imag-
es was proposed. Such a transformation allows to compen-
sate the ego motion when a camera is rotating along the
optical center. In other words, the registration problem con-
sists in determining a transformation
T
such that for two-
time related frames I1x, t2 and I1x1dx, t1t2 the following
equation holds:
I1x, t25T#I1x1dx, t1t2.
When different camera motions are considered, different
registration techniques, thus different transformations
T
,
have to be taken into account to model the ego-motion
effects. A survey on image registration techniques [50]
explains how particular transforms (translational, affine, and
perspective) have to be computed for registering images.
Irani and Anandan [19] addressed the problem of detecting
moving objects by using a direct method that estimates a
“dominant” eight-parameters of an affine transformation.
Araki et al. [3] proposed to estimate the background motion
by using a set of features and an iterative process (least
median of squares) to solve the overdetermined linear sys-
tem. All these systems, feature based or linear, have to cope
with the outliers detection to filter out bad tracked features
or mismatched points [16]. Micheloni and Foresti [32] devel-
oped a new feature rejection rule for deleting outliers, based
on a feature-clustering technique [15]. Once the images are
registered and ego motion compensated (i.e., static pixels
overlap), the moving objects can be extracted with frame-by-
frame differencing techniques. Though, when zoom opera-
tions are needed to focus on an object of interest, such
techniques do not guarantee robust results.
FIXATION
Rather than computing and compensating the ego motion dur-
ing zoom operations, a better solution is to directly track the
target by adopting a fixation point to be kept during zoom.
Tordoff and Murray [46] solve the problem of tracking a fixation
point, mainly the center of mass of an object, by adopting an
affine projection. In particular, when the distance between the
object and the camera is much greater than the depth of the
object (as it often happen in video surveillance applications) it
is possible to consider the following affine projection:
x5MX 1t, (1)
where
x
is the fixation point, M is the affine transform,
X
is the
real position of the point, and t is a translation vector.
COMPUTING THE FIXATION POINT
Considering four coplanar points A, B, C, O belonging to the
same object and defining a vector space B55A2O,
B2O, C2O6, we can determine all other points belonging to
the object or to the space. Moreover, a generic point
X
can be
defined by means of the three affine coordinates a,
b
,
g
as follows:
X
5a1A2O21
b
1B2O21
g
1C2O21O. (2)
Given two different views (e.g., two consecutive frames of a
moving camera) for the four points 1a, b, c, o and
a
r, br, cr, or2
it is possible to compute the affine coordinates of a fifth point
X
by solving the overdetermined linear system
cx2o
xr2ord5ca2o b2o cr2or
ar2or br2or cr2ord£a
b
g
§. (3)
It is worth noticing that after having computed the coordi-
nates of the point X we do not need to compute it again in a new
view since it is given by
xs5a1as2os21
1bs2os21
1cs2os21os. (4)
Therefore, during the tracking phase, supposing that the
fixation point has a linear transformation from position
g
on
time instant t to the position
g
r on time instant t11, we can
compute its affine coordinates 3ag, bg, gg4
T
by adopting (3). On
a further frame t1
2
, the four points are projected in new
positions respectively
a
s, bs, cs, os and (4) will give the new
position of the fixation point
g
s. The main issue in adopting
this technique concerns the determination of the points to use
during the computation of the affine transform.
IEEE SIGNAL PROCESSING MAGAZINE [81] SEPTEMBER 2010
CLUSTERING FEATURE POINTS
To solve this problem, it is possible to
adopt a feature clustering technique able
to classify a tracked feature as belonging
to different moving objects or to the
background. The objective is first to iden-
tify and compensate the motion induced
by the zoom, if present, then to apply a
clustering technique to identify the fea-
tures belonging to objects with different
velocities. In this way, all the features
belonging to static objects after zoom
compensation have a null displacement.
The resulting algorithm is shown in
Algorithm 1.
[ALGORITHM 1] CLUSTERING.
repeat
Feature extraction and tracking
Clusters computation
Background cluster deletion
Computation of the center of mass for each cluster
until zooming
CLUSTER COMPUTATION
To detect the clusters, we first need to compute the affine trans-
form for each tracked object. Such a computation is performed
over all the features belonging to an object. In particular, let A
^ be
the computed affine transform, f
i
t21 and f
i
t the position of a gener-
ic feature respectively at time instant t21 and t
.
At this point, the
effective displacement of the feature i is d
i
5f
i
t2f
i
t21, while the
estimated one for the effects of the affine transform is as follows:
d
|
i
5f
i
t2f
|
i
t215f
i
t2A
^f
i
t21, (5)
where f
|
i
t21 represents the position of the feature i after the
compensation of the camera motion by means of the affine
transform A
^
.
A graphical representation of the differences
among the two displacements, effective and estimated, can be
seen in Figure 1(a). Let
TFS
ob
j
be the set of features extracted
from a window (i.e., fovea) centered on the object of interest.
Then, the cluster computation is performed respecting the fol-
lowing rule:
Cob
j
1d
|255fi[TFSob
j
0idi
|
2d
|
i2#rtol6, (6)
where Cob
j
1d
|2 is the cluster having all the features i whose dis-
placement d
i
|
is such that the norm between it, and the vector d
|
is lower than a predefined threshold
rto
l. In Figure 1(b) an exam-
ple of feature clustering is shown.
BACKGROUND CLUSTER DELETION
Once the computation of all the clusters is done, the back-
ground cluster can be easily found as well as all the features
that have been erroneously extracted from the background
in the previous feature extraction step. In particular, after
applying the affine transform to the features, if these belong
to the background then they should have a null or a small
displacement. Hence, to determine the background clusters
the rule
Cobj 1dk
|2[ebackground if 7|
dk72#rtol
object otherwise (7)
is adopted.
CENTER OF MASS COMPUTATION
After deleting all the features either belonging to the back-
ground or to a cluster with cardinality lower than three [the
minimum number of needed features to solve (4)], the tech-
nique proposed by Tordoff-Murray [46] can be applied. Thus the
fixation point of each cluster is computed. Let g
i
be the fixation
point we need to compute for each object i, then we need to
solve the following equation:
gr
d
|
k5Ng i1r, (8)
where gr
dk
| is the new position of the fixation point for the
set of features belonging to the cluster Cob
j
1d
|
k2 and
g
is the
previous fixation point of the same object. The matrices
N
5Qn1n2
n3n4R and r51r1 r22
T
are computed using the singular
value decomposition (SVD) on the following linear systems:
±f1
r
f2
r
(
f
n
r1n1 n2 r125±f11x2
f21x2
(
fn1x2 (9)
±f1
r
f2
r
(
f
n
r1n3 n4 r225±f11y2
f21y2
(
fn1y2, (10)
where 1f1
r,c, f
n
r2
T
is the vector containing all the features
f
i
r51x, y, 12 that are considered well tracked, while f
i
1#2 is
di
~
di
fi
t–1
~
fi
t–1
fi
t
(a)
Cobj(d1)
~
~
Cobj(d2)
~
Cobj(d3)
~
y
x
o
da
~
db~
dc
~
dd
rtol
C
obj
C
C
(
d
1
d
d
)
~
~
C
obj
C
C
(
d
2
d
d
)
~
C
obj
CC
(
d
3
dd
)
~
y
y
y
y
y
y
y
y
y
y
y
x
o
d
a
d
~
d
b
d
d
~
d
c
c
c
c
c
c
c
c
c
c
c
c
c
dd
~
d
d
d
d
r
tol
r
r
(b)
d
i
dd
~
d
i
dd
f
i
f
f
t
–1
tt
~
f
i
f
f
t
–1
t
t
f
i
f
f
t
()
[FIG1] (a) Visualization of the components of the displacement vector di. d
|
i represents
the component related to the real motion of the feature
f
i. (b) Clustering on the
displacement vector d
|
i. Each cluster has a tolerance radius rtol.
IEEE SIGNAL PROCESSING MAGAZINE [82] SEPTEMBER 2010
either the
x
or
y
coordinate of the feature i at time instant
t21
.
Following this heuristic, a fixation point is computed for
each detected cluster that allows to estimate the center of
mass of each moving object, thus avoiding the selection of fea-
tures in the background or belonging to objects that are not of
interest. In Figure 2(a) a representation of the computation of
the fixation point is shown. Figure 2(b) shows a real example
of such a computation. Results for the fixation of rigid and
nonrigid objects are presented in Figure 3.
Since the fixation accuracy is important for the 3-D
localization of the objects inside the environment, it is
interesting to notice how the fixation error is dependent on
the rotation speed of the camera. As matter of fact, in
Figure 4, it can be noticed that the proposed fixation tech-
nique works better when the rotation speed increases. This
phenomenon can be explained by considering that the
method is based on a motion classification technique. In
particular, it is based on the capability of segmenting the
background from the foreground motion (i.e., background
versus foreground features).
When the camera rotation speed is low, it means that the
object is also moving slow and therefore some of the features
are misclassified and considered outliers. On the other hand,
when the camera rotates faster, the object is also moving fast-
er. Thus the differences of the velocities of the features belong-
ing to the object and to the background increase. This allows
to determine more accurate clusters yielding to a better esti-
mation of the fixation points. This aspect is really important
when command and control has to be considered for moving
the PTZ cameras.
Ox
y
~
Cobj(d2)
~
Cobj(d1)
g
2
g
1
Fovea
(a)
(b)
Fixation Point Cluster 1
Cluster 2
[FIG2] (a) Graphical representation of the computation within
the area of interest (e.g., fovea) of the fixation point for each
identified cluster. In this way, different objects with different
velocities can be tracked even when they cross their trajectories.
(b) Computation of two different clusters and of the new
fixation point.
[FIG3] Sample frames of two test sequences used for evaluating the active tracking of nonrigid and rigid objects. The small boxes
represent the tracked feature points organized in clusters depending on their color. The red dot represents the computed fixation point
on the selected cluster (white for nonrigid object and yellow for rigid object).
IEEE SIGNAL PROCESSING MAGAZINE [83] SEPTEMBER 2010
COOPERATIVE CAMERA NETWORKS
Within a visual-surveillance system’s architecture, the cooper-
ation of different cameras can occur at different levels. One of
these is the tracking level. Recent techniques propose the use
of different sensors with overlapping FOVs to achieve a coop-
erative object tracking and localization. In [14], Fleuret et al.
proposed a mathematical framework that allows to combine a
robust approach to estimate the probabilities of occupancy of
the ground plane at individual time steps with dynamic pro-
gramming to track people over time. Such a scheme requires
that cameras share the majority of their FOVs. While this is
affordable for small areas (indoor rooms, alleys, and halls), it
becomes infeasible when large environments are considered.
In these cases, PTZ cameras can be exploited to generate two
or more overlapping FOVs on areas of interests to solve differ-
ent problems (e.g., occlusions, tracking, and localization).
COOPERATIVE CAMERA TRACKING
When multiple PTZ cameras with different resolution are
exploited for tracking, new questions arise [17]. How can stan-
dard projective models be extended to the case of heterogeneous
and moving cameras? How can the different object resolutions
be handled to solve the tracking problems?
In [11], Chen et al. derived two different calibration tech-
niques to determine the spatial mapping between a couple of
heterogeneous sensors. Such relations can be exploited to
improve tracking in a centralized way. Coefficients are
assigned—on the basis of the spatial mapping—to the tracking
decisions given by different cameras and the final output fol-
lows the highest score to solve ambiguity and occlusions [43].
More recently, the distributed approach to the tracking prob-
lem has been investigated by exploiting consensus and coopera-
tion in networks [35]. Such a framework requires a consensus
(i.e., an agreement regarding a quantity of interest that depends
on the state of allagents) achieved with an algorithm that speci-
fies the information exchange among the neighboring agents.
In the context of cooperative camera tracking, the agents are
represented by cameras’ processes, while the neighboring rela-
tion usually represents the capability of sharing the FOV [45].
Thus, as the target moves through the monitored area, the
neighboring relation defines a dynamic network.
In particular, let a dynamic graph G51V, E2 be the struc-
ture adopted to represent the neighboring nodes/cameras
involved in tracking a target.
G
is modified anytime a camera
starts/stops to acquire information on the target or a camera
changes PTZ parameters. How to modify
G
in a proper way is a
high-level task. Optimization strategies have to be considered to
define the best graph for the required task. Instead, when a
graph
G
is defined, consensus algorithms can be applied to
agree on the positions and labels of the targets.
In [45], a Kalman-consensus tracking is applied. Each camera
in the graph determines the ground plane position of each target
and computes the information vector and matrix. These are
shared together with the target status with the neighbors. A
Kalman-consensus state is computed by each camera for each tar-
get by fusing the information received from the neighbors. These
distributed approaches allow to reach a distributed estimation of
the positions of the targets. Hence, a centralized solution that
requires a considerable bandwidth usage from each node to the
central unit can be avoided. In any case, the localization accuracy
and the tracking performance still depend on the ground-plane
position estimation performed independently by each camera.
Such an estimation is subject to occlusions related errors.
A solution of such a problem can be given by the 3-D local-
ization of moving objects by exploiting the cooperativeness
between PTZ and static cameras to improve its estimation. In
particular, a stereo system of heterogeneous cooperative camer-
as can be defined to localize moving objects in outdoor areas.
Once a target is selected by the surveillance system, a static
camera and a PTZ camera can be selected to provide stereo
localization of such a target (see Figure S2 in “Cooperation
Among Cameras”).
Calibrated and uncalibrated cameras’ approaches can be
chosen to solve the matching and disparity computation prob-
lems. The former are able to analytically determine the cali-
bration matrices using a spherical rectification technique [49]
or defining the epipolar geometry of dual heads [18]. The lat-
ter [26], even though they require a more complex offline ini-
tialization, do not require the camera calibration and are more
robust when heterogeneous sensors with different resolutions
are adopted. In the case of uncalibrated approaches, a partial
calibration is achieved offline through a look-up-table (LUT)
that is interpolated online to determine the unknown parame-
ters. The LUT contains different pairs of rectification transfor-
mations 1Hr
i
, Hl
i
2i51, 2c n for n different pairs of stereo images
captured at arbitrary pan and tilt settings.
CONSTRUCTION OF THE LUT
The main steps to construct the LUT are as follows:
Select the different pan angles 1) pi
|
i51, 2cm by sampling the
whole pan range of the PTZ camera into m equal intervals.
2
6
10
14
18
2
4
6
8
10
12
14
16
18
20
22
Pan Speed (°/s)
Tilt Speed (°/s)
Error
[FIG4] Localization error (Euclidean distance expressed in pixels)
computed on the sequences classified with respect to the pan
and tilt rotation speeds.
IEEE SIGNAL PROCESSING MAGAZINE [84] SEPTEMBER 2010
The selection of the tilt angular values tj
|
j51, 2cn is deter-
mined in a similar way.
Compute the pairs of rectification transformati -2)
ons 1Hr
ij, Hl
ij 2i51, 2c, m
j51, 2c, n for each pair of pan and tilt
1pi, tj2i51, 2c, m
j51, 2c, n values. A rectification algorithm [20]
based on the pairs of matching pixels can be used. The
pairs of matching pixels are obtained using SIFT match-
ing [29].
Store all these 3)
r
5m3n pairs of rectification transfor-
mations in an LUT.
INTERPOLATION OF THE LUT
To compute the transformation for every possible configuration
of the pan and tilt angles, the LUT values have to be interpolat-
ed. The interpolation method receives the pan and tilt angles as
input and returns the elements of the corresponding rectifica-
tion transformations 1Hr, H
l
2 as output. This requires to con-
sider a nonlinear input-output mapping defined by the function
H5f 1p, t2. For a known set of input-output values the problem
is to find the function F1#2 that approximates f 1#2 over all
inputs. That is,
Cooperation among cameras is very helpful for solving
problems typically present in tracking based on single cam-
era analysis. Cooperation requires first to identify the col-
laborating cameras and then to decide on the collaboration
methods. A centralized approach represents a naive solu-
tion in which all the video streams are processed in a dedi-
cated node that commands and controls the entire
network. To reduce bandwidth usage of the entire net-
work, distributed approaches can be taken into account.
Defining a neighboring property (e.g., cameras having a
target in their FOV) allows to cluster cameras in subnet-
works. Within and only within each cluster the information
(e.g., preprocessed data) is shared between cameras to
achieve a common analysis about some target’s property
(i.e., position and velocity). Concerning the cooperative
tracking, consensus algorithms appear to be effective in
computing the ground-plane position of a target in a dis-
tributed way. Anyway, being such distributed techniques
dependent to the computations of each node, a real 3-D
localization is still subject to occlusions problems. Thus, to
have a reliable 3-D localization of a target, stereo vision
techniques can be exploited. The system can select the best
camera [36] and its closest PTZ camera inside the cluster.
Between the two selected image streams, different techniques
can be applied to solve the matching problem and the disparity
estimation through rectification. For a pair of stereo images
I
l
and
I
r, the rectification can be expressed as Il
r
5Hl*Il
Ir
r
5Hr*Ir
where 1Il
r, Ir
r2 are the rectified images and 1Hl, Hr2 are the rectifi-
cation matrices. These rectification transformations can be
obtained by minimizing
a
i31ml
i2THr
TF`Hlml
i4,
where 1ml
i, mr
i2 are pairs of matching points between images
I
l and
I
r and
F`
is the fundamental matrix for rectified pair of
images. To solve the point matching problem, direct meth-
ods can exploit scale invariant feature transform (SIFT) fea-
tures [29] or edges [31] to look for correspondences within
the whole frames. Due to the computational effort of such
matching tasks, stereo techniques can be adopted to restrict
the search area. These techniques have to take into account
the different orientation of the PTZ cameras with respect to
the static ones, the different resolution and a possible wide-
base line between the two considered stereo sensors.
Object Detection
and Disparity
Estimation
Stereo Vision
Rectify Images
Compute Intrinsic
and Extrinsic
Parameters
Compute
Fundamental Matrix
Extract Feature
Points from
Both Cameras
Match
Feature Points
Compute
Homographies
Calibrated
Uncalibrated
Compute
Neighborhood
Share Scores
with Neighbors
Reach Consensus
Distributed Consensus
Distributed Object
Localization
3-D Object
Localization
Sensor i
Localization and
Tracking
Sensor j
Localization and
Tracking
Stand-Alone
Position
Real-Cooperative
Position
PTZ
[FIGS2] Scheme of a cooperative stereo vision system.
COOPERATION AMONG CAMERAS
IEEE SIGNAL PROCESSING MAGAZINE [85] SEPTEMBER 2010
yF1 p, t22f1 p, t2y,P
for all 1p, t2, (11)
where
P
is a small error value. Different nonlinear regression
tools as neural networks or support vector machines can be
adopted to solve such a problem.
A further issue is given by image differences due to different
view points. Many works on rectification assume that the base-
line (the distance between the two cameras) is small if compared
to the distance of the object from the cameras, and thus the two
images acquired by the cameras are similar. This allows the
detection of the matching points using standard techniques
such as SIFT matching [29]. However, in an large surveillance
system this assumption is no longer valid. It is not rare to have
cameras deployed at distances of
50
or more meters one from
each other. In this situation, the cooperativeness of two PTZ
cameras could give more robust results thank to higher magni-
fications and the capability of focusing on a specific target. In
this context, instead of using wide baseline matching algorithms
[31], it is more accurate to use a chain of homographies to
extract pairs of matching points [27].
Let 1I
l
1, Ir
12 be a pair of images of a 3-D scene that is far from
the cameras along their optical axis. An initial homography H1
is generated by using extracted pairs of matching points between
I
l
1 and Ir
1 using standard feature extractor for wide baseline cases
[31]. Let I
l
n
and Ir
n
be a pair of images from the left and the right
cameras acquiring a scene/object near to the cameras along
their optical axes. The problem is to autonomously extract the
pairs of matching points between the images I
l
n
and Ir
n
. To solve
such a problem, a set of n images is captured from each camera
by virtually moving the cameras from the initial position (the
one at which 1I
l
1, Ir
12 are acquired) to the current position. Let
these two sets of images be 1I
l
1, I
l
2
cI
l
n2 and 1Ir
1, Ir
2
cIr
n2. The
following steps can be adopted:
Perform the SIFT matching between image pairs 1) 1I
l
1, I
l
22,
1I
l
2
, I
l
3
2, . . ., 1I
l
n21, I
l
n2 and use these sets of pairs of matching
points for computing their respective homography matrices
H
l
1,2, H
l
2
,
3
, . . ., H
l
n21, n.
Repeat the procedure given at the above step on the 2)
sequence of images captured by the right camera and com-
pute Hr
1,2, Hr
2
,
3
, . . ., Hr
n21, n.
Compute the homography matrixes 3) H
l
and H
r
Hl5q
n2
2
i
5
0
Hl
n21i112, n2i
Hr5q
n2
2
i
5
0
Hr
n21i112, n2
i
.
Compute the homography matrix 4) H
n
for
the pairs of matching points between cur-
rent images I
l
n
and Ir
n
as
Hn5Hr*H1*1H
l
221. (12)
The cooperation of a PTZ camera with a
static camera belonging to the network or
among two PTZ cameras results in an improved
localization of the objects in the scene without
requiring fixed overlapping FOVs. Such an improvement can be
qualitatively seen in Figure 5. For a quantitative analysis, Fig-
ure 6 represents a surface plot for the localization error com-
puted on a target with respect to ground truth data. It can be
noticed how the stereo localization keeps the error low. Instead,
using a single static or a PTZ calibrated camera, adopting the
foot-on-the-ground assumption, the error increases with the
distance of the object from the camera and the severity of the
occlusion. It is also worth noticing how this problem can be
mitigated with centralized or distributed cooperative tracking
but not be completely eliminated.
Table 1 presents a high-level comparison between the three
main categorizations for cooperative tracking. The major
advantage of centralized approaches are given by receiving all
the footages from the cameras, thus allowing to exploit fusion
techniques at pixel level without relying on preprocessing
results carried out by single sensors. It is clear that such a
solution demands a large bandwidth that causes problems for
large networks. On the contrary, distributed approaches can
reach a consensus by exchanging few data, mainly related to
the states of the nodes, within the neighborhood in charge of
tracking a target. In addition, the number of cooperating sen-
sors is limited. This allows the definition of ad hoc networks to
reduce the overall bandwidth usage. However, both approaches
suffer from the ground plane constraint in the computation of
the target position from single footage.
To overcome this problem, stereo approaches can be consid-
ered to reach a more precise 3-D localization. This is achieved
by exploiting two sensors thus sidestepping some of the occlu-
sions problems. These sensors have to share a large portion of
their points of view (i.e., have a limited base line) to realize
robust matching techniques. This consideration suggests to
activate such a technique only when a precise localization is
highly required.
PTZ NETWORK CONFIGURATION
Another important issue in cooperative camera networks is how
the orientations of the cameras can influence the analysis capabil-
ity of the tracking algorithms. For such a reason, there is a novel
research stream that focuses on PTZ network reconfiguration
(see Figure S3 in “PTZ Network Configuration”).
Karuppiah et al. [22] proposed two new metrics that based
on the dynamics of the scene allow to decide the pair of cameras
that maximize the detection probability of a moving object. In
[TABLE 1] COMPARISON OF COOPERATIVE CAMERA TRACKING APPROACHES.
TECHNIQUE ADVANTAGE DISADVANTAGE
CENTRALIZED COMPLETE NETWORK INFORMATION HIGH BANDWIDTH USAGE
PIXEL LEVEL FUSION AFFECTED BY OCCLUSIONS
DISTRIBUTED DYNAMIC NETWORK TOPOLOGY ACCURACY DEPENDING ON SINGLE
CAMERA ESTIMATIONS AFFECTED BY
OCCLUSIONS
DISTRIBUTED CONSENSUS
LOW BANDWIDTH USAGE
AD HOC NETWORK COMMUNICATION
STEREO PRECISE 3-D LOCALIZATION CONSTRAINED POINTS OF VIEW
OCCLUSIONS FREE BASE LINE CONSTRAINED
IEEE SIGNAL PROCESSING MAGAZINE [86] SEPTEMBER 2010
[21], Kansal et al. proposed an optimization process for the
determination of the network configuration that maximizes the
metric. It is interesting to notice how the adopted metrics are
concretely bounded to real sensors thus propose a feasible
instrument for real applications. More recently, Mittal and Davis
[33] introduced a method for determining good sensor configu-
rations that would maximize performance measures for a better
system performance. In particular, the authors based the config-
uration on the presence of random occluding objects and pro-
posed two techniques to analyze the visibility of the objects.
Qureshi and Terzopoulos [40] proposed a proactive control of
multiple PTZ cameras through a solution that plans assignment
and handoff. In particular, the authors cast the problem of
controlling multiple cameras as a multibody planning problem
in which a central planner controls the actions of multiple
physical agents. In the context of person tracking, the solution
considers the formulation of the relevance
r
1c
i
, O2 of a PTZ
camera c
i
to an observation task
O.
Five factors are taken into
account: a) camera-pedestrian distance
rd
, b) frontal viewing
direction
rg
, c) PTZ limits
r
a
b
u, d) observational range
ro
and e)
handoff success probability
rh
. The planner computes state
1
2
3
4
1
2
3
4
[FIG5] Example of cooperative PTZ localization using stereo vision paradigm. Frames from a static camera are shown in the first
column, while left and right images of a cooperative PTZs camera are respectively shown in the second and third columns. Two
different situations have been considered: 1) Occluded person (first and fourth rows) and 2) nonoccluded person (second and third
rows). In such situations, the localization computed by the stereo cooperation between the PTZs (red cross) compared to the one
computed by a monocular camera (blue X) is more robust with respect to a ground-truth position (black circle). In particular, such an
effect is more evident when occlusions occur.
4
3
2
1
0
70
60
50
40
30
Error in Localization (m)
Target
Distance (m) Occlusion’s Height (m)
0
0.5
1
Monocular
Proposed
[FIG6] Error in localization with respect to the severity of the
occlusion (i.e., height) and the object’s distance. The proposed
cooperative PTZ localization keeps the error low in all situations
while a monocular vision, based on the foot-on-the-ground
hypothesis, worsens with the occlusions and distance.
IEEE SIGNAL PROCESSING MAGAZINE [87] SEPTEMBER 2010
sequences with the highest probability of success on the basis of
a probabilistic objective function. Such a function is given by
the probability of success of a state sequence
S
over a neighbor-
hood of cameras
Na
and time t as
Q5q
t[30,1, c4aq
i[31, Na4r1ci, hj2b, (13)
where
r
1ci, hj25e1 if ci is idle
rd r
g
ra
b
u ro rhotherwise. (14)
The planing is therefore achieved by employing a greedy best
first search to find the optimal sequence of states.
A different approach to network reconfiguration for person
tracking by means of PTZ camera can be developed by employ-
ing the game theory. Arslan et al. [4], demonstrate that the
Nash equilibrium for the strategies lies in the probability dis-
tribution. From this formulation, different approaches [28],
[44] propose a set of utility functions such as
Target utility
UTi1a2: how well a target
Ti
is satisfied/
acquired while being tracked by some camera
Camera utility
UCi1a2: how well a camera
Ci
is tracking a
target assigned to it based on user-supplied criteria (e.g., size,
position, and view)
Global utility
Ug1a2: the overall degree of satisfaction for
the tracking performance.
They solve the camera assignment by maximizing the global
utility function. Different mechanisms to compute the utilities
can be provided as in [28], [44], and [45], then a bargaining pro-
cess is executed on the predictions of person utilities at each
step. Those cameras with the highest probabilities are used to
track the target thus providing a solution to the handoff prob-
lem in a video network.
On the other hand, when a PTZ camera is reconfigured to
track an object or switched on/off to save power, the topology of
the network is modified. As consequence, a new configuration is
required to provide optimal coverage of the monitored environ-
ment. Song et al. [44] adopt a uniform distribution of the targets
and the coverage resolution utility to negotiate the new network
reconfiguration. Piciarelli et al. [37] propose a new strategy for
the online reconfiguration of the network based on the analysis
of the activities occurring within the covered area. To achieve the
The goal of the PTZ reconfiguration is to automatically
reconfigure the PTZ cameras to improve the system perfor-
mance. Formally the problem can be seen as a maximization
problem 1P, T, Z25argmaxP,T,Z F1P, T, Z 2 to select the PTZs
configuration parameters that maximise a performance
function F of the system. Such a function can be related to a
target’s property. In such a case, programming the hand-off
of the cameras is of primary importance to keep the best
acquisition quality. If the function is related to different
aspects like targets, cameras and global utilities, a PTZ recon-
figuration can be applied to maximize the utility functions.
A further interesting problem is to optimally cover the moni-
tored area on the basis of the activity probability. An activity
density map can be defined following different strategies
such as optimizing tracking performance or detection per-
formance and reducing occlusions. The selected goal has
many similarities with two-dimensional data fitting prob-
lems, in which the data distribution is approximated by a
mixture of density functions (i.e., Gaussians). In our case,
there is an activity density map that should be fit by the cov-
erage areas of the PTZ cameras (projection to the ground
plane of the cones of view). One of the most popular mix-
ture-of-Gaussians data fitting algorithms is EM. In [37], a
camera projection model is proposed to project all the activi-
ties into compatible camera spaces where the application of
the EM algorithm requires less constraints than in the origi-
nal space. The resulting ellipses determine the PTZ parame-
ters of all the cameras involved in the optimization process.
PTZ
Visual Surveillance
System
PTZs
Area Coverage
Activity
Selection Activity Map
PTZ
Cooperative
Tracking Network
Reconfiguration
Programming
Handoff
[FIGS3] Scheme for PTZs reconfiguration.
PTZ NETWORK RECONFIGURATION
IEEE SIGNAL PROCESSING MAGAZINE [88] SEPTEMBER 2010
best coverage based on the density of the activities, a constrained
expectation-maximization (EM) process is introduced to produce
the optimal PTZ parameters for a optimal probabilistic coverage
of the monitored area. The result is a PTZ network that dynami-
cally can modify its configuration to better acquire the data
thus improving the video analysis performance of the network.
EMBEDDED SMART CAMERAS
Smart cameras combine image sensing processing and commu-
nication on embedded devices [42] (see Figure S4 in “Embedded
Smart Cameras”). Recently, they are deployed in active camera
networks [1] where image processing as well as control of the
cameras are executed on the embedded nodes. These embedded
camera platforms can be classified into single smart cameras,
distributed smart cameras, and smart camera nodes [41]. The
third class of platforms is of special interest, since they combine
embedded computation with wireless communication and
power awareness (e.g., [12], [24], and [47]).
Integration of image processing algorithms on embedded
platforms imposes some challenges because one has to consider
architectural issues more closely than on general-purpose com-
puters. Memory is a principle bottleneck for computer system
performance. In general-purpose computer systems, caches are
used to increase the average performance of the memory sys-
tem. However, image processing algorithms use huge amounts
of data, and often with less frequent reuse. As a result, caches
may be less effective. At a minimum, software must be carefully
optimized to make best use of the cache; at worst, the memory
system must be completely redesigned to provide adequate
memory bandwidth [48].
When we have a closer look at low-level signal processing
algorithms, we can distinguish five different data access pat-
terns [25]. How data is accessed has a great influence on the
achievable performance. Data access, and hence the underly-
ing data dependencies, are limiting the degree of paralleliza-
tion. But parallel execution is a major source of performance
gain on dedicated hardware.
Independent Pixel Processing:a) In the simplest case, a sin-
gle pass over the image is sufficient where the output pixel’s
value is only dependent on a single input pixel. The processing
of each pixel is independent of each other, enabling a full pix-
el-level data parallel implementation in hardware, even direct-
ly on the image sensor. Examples for these full pixel-parallel
algorithms include thresholding, color space transformations,
or simple logic and algorithmic operations.
Multipass Pixel Processing: b) In this class, we consider
algorithms that require multiple passes, but the output
pixel’s value is still only dependent on a single input pixel.
Thus, a full pixel-parallel implementation is possible, how-
ever synchronization between the multiple passes is neces-
sary. Examples include computation of simple image
statistics, histogram equalizations or Hough transforms.
Fixed-Size Block Access: c) Algorithms of this class compute
the output pixel’s value using multiple input pixels from a
An embedded smart camera executes all stages of a typical
image processing pipeline onboard. The image sensor con-
verts the incident light of the optics into electrical signals.
The sensing unit captures the raw data from the image
sensor and performs some preprocessing such as white bal-
ance, color transformations and image enhancements. This
unit also controls important parameters of the sensor, e.g.,
frame rate, gain, or exposure, via a dedicated interface.
The processing unit reads the preprocessed image data and
performs the main image processing tasks and transfers
the abstracted data to the communication unit that pro-
vides various wired or wireless interfaces such as USB,
WLAN, or ZigBee.
Low-level operations process the image data at the pixel
level but typically offer a high data parallelism. Thus, low-level
processing at the sensor unit is often realized on dedicated
hardware with fast on-chip memory (SRAM). Dedicated hard-
ware for the sensing unit include application-specific integrat-
ed circuits, FPGAs [13], or specialized processors [24]. High-level
processing at the processing unit, on the other hand, operates
on (few) features or objects, which reduces the required data
bandwidth but increases the complexity of the operations sig-
nificantly. DSPs or microprocessors are the prime choice for
these tasks. High-level image processing requires much more
data storage, thus large off-chip memory (SDRAM) is often
attached to the processing and communication units.
Sensor
Sensing Unit
Preprocessing
SRAM SRAM
SDRAM SDRAM
Processing Unit
Image Analysis
Video
Compression
Communication Unit
External Interfaces:
USB, WLAN, ZigBee
Results
Control
Embedded Platform
Optics
[FIGS4] Block diagram of an embedded smart camera.
EMBEDDED SMART CAMERAS
IEEE SIGNAL PROCESSING MAGAZINE [89] SEPTEMBER 2010
fixed block. The data of the input block can typically be
accessed in parallel, however for computing the output pixel
data dependencies within the block and within neighboring
blocks must be considered. Prominent examples of algorithms
with such data access pattern include convolution, wavelets,
and morphological operations.
Data-Independent Global Access:d) Algorithms of this class
access multiple source pixels from all over the image to
compute the output. Although these algorithms require
global access, the access pattern is regular and/or known in
advance. Examples include warping or distortion correction.
Data-Dependent Random Access: e) The most complex class
of algorithms require access to multiple source pixels from
all over the image to compute the output. However, the
access pattern is data dependent and therefore not known a
priori. Examples of these global image algorithms include
flood filling and contour finding.
At first glance, we would expect a linear speedup for the first
four access patterns. However, there are several factors limiting
the achievable speedup of real hardware implementations such
as the memory access times, the available processing elements
or some purely sequential control of the algorithms. As conse-
quence, the performance of low-level image processing algo-
rithms may be well below the theoretical speedup. Modern
image sensors provide different scanning methods (e.g., win-
dowing, subsampling, random read, or binning) to reduce the
number of pixel reads resulting in higher frame rates, reduced
pixel noise, or reduced power consumption.
In our previous work [8], [9], we developed several embed-
ded smart cameras executing various image processing tasks.
Table 2 summarizes the performance of tracking, motion detec-
tion and adaptive background/foreground modeling imple-
mented on a TMS320C64x digital signal processor (DSP). The
achieved performance of these low-level algorithms—with data
access pattern of classes (c), (d), or (e)—on the embedded plat-
form was comparable with reference implementations on stan-
dard personal computer (PC) platforms. Baumgartner et al. [6]
conducted a detailed comparison of low-level image processing
algorithms. In most cases, the implementation on PCs and
DSPs achieved similar results; the field programmable gate
array (FPGA) implementation outperformed the PC and DSP
implementations for algorithms with high data parallelism.
Naturally, the native data width and the data path architec-
ture (fixed point or floating point) influences the implementa-
tion on the embedded platform. There is also a tradeoff
between the required precision and the power consumption.
Fixed-point architectures are more power-efficient; an issue
that is getting more and more important.
FUTURE SMART RECONFIGURABLE SENSOR NETWORKS
Future research in the field of smart PTZ camera networks will
have to consider the available resources more intensively.
Resource-awareness is not only required for economic reasons
but also important for scalability, robustness and novel appli-
cations. One example for an innovative application is to deploy
camera sensor networks in environments with little infra-
structure, i.e., the available power supply and communication
network will be limited. To provide autonomous operation
over some period of time (days or even weeks), the sensor net-
work must use its resources economically. Components and
camera nodes will be switched off during idle times and
switched on only when required. Thus, the camera network
will be dynamically reconfigured to save resources and adapt
to some changes in the environment (i.e., switch of cameras at
night; wake up nodes when events have been detected in their
neighborhood; avoid PTZ functionality of cameras). Therefore,
new optimization algorithms as well as signal processing tech-
niques for fast and computationally efficient video analysis will
be needed to support such networks capabilities.
ACKNOWLEDGMENTS
This work was funded in part by the European Regional
Development Fund, Interreg IV Italia-Austria Program, and
Project SRSnet ID 4687.
AUTHORS
Christian Micheloni (christian.micheloni@uniud.it) received
the laurea (cum laude) and Ph.D. degrees in computer science
in 2002 and 2006, respectively, from the University of Udine,
Italy, where he is currently an assistant professor. His research
interests include active vision, active stereo vision, network self-
configuration, multispectral and hyperspectral imaging, neural
networks, and object recognition and pattern recognition tech-
niques for both the tuning of camera parameters for improved
image acquisition and facial detection. He is member of the
International Association of Pattern Recognition and the IEEE.
Bernhard Rinner (bernhard.rinner@uni-klu.ac.at) received the
M.Sc. and Ph.D. degrees in telematics from Graz University of
Technology, Austria, in 1993 and 1996, respectively. He is a full
professor and chair of pervasive computing at Klagenfurt
University, Austria, where he is currently serving as vice dean of the
faculty of technical sciences. He has held research positions with
Graz University of Technology from 1993 to 2007 and with the
Department of Computer Science, University of Texas at Austin,
from 1998 to 1999. His research interests include embedded com-
puting, embedded video and computer vision, sensor networks
and pervasive computing. He is a Senior Member of the IEEE.
Gian Luca Foresti (gianluca.foresti@uniud.it) received the
laurea degree cum laude in electronic engineering and the Ph.D.
degree in computer science from University of Genoa, Italy, in
1990 and in 1994, respectively. He is a full professor of computer
[TABLE 2] EXECUTION TIMES AND FRAME RATES (CIF
RESOLUTION) OF LOW-LEVEL ALGORITHMS ON OUR SMART
CAMERA PLATFORM (TMS320C64X DSP AT 600 MHZ).
ALGORITHM (C IF IMAGES) TIME [MS ] F/S [HZ]
CAMSHIFT (
154
3
154
SEARCH BLOCK) [39] 2.5 400
MOTION DETECTION (SINGLE GAUSSIAN) [30] 33 30.0
MOTION DETECTION (KALMAN) [30] 27 37.0
ADAPTIVE FG/BG MODELING [7] 104 9.6
IEEE SIGNAL PROCESSING MAGAZINE [90] SEPTEMBER 2010
science at the Department of Computer Science (DIMI),
University of Udine, where he is also the dean of the faculty of
education science. His main interests involve computer vision,
image processing, data fusion, and pattern recognition. He is
author of more than 200 papers published in international jour-
nals and refereed international conferences, and he is fellow
member of the International Association of Pattern Recognition.
He is a Senior Member of the IEEE.
REFERENCES
[1] I. F. Akyildiz, T. Melodia, and K. R . Chowdhury, “A survey on wireless multime-
dia sensor networks,” Comput. Netw., vol. 51, no. 4, pp. 921–960, 2007.
[2] Y. Aloimonos. Active Perception. Hillsdale, NJ: Lawrence Erlbaum
Associates, 1993.
[3] S. Araki, T. Matsuoka, N. Yokoya, a nd H. Takemura, “Real-time tracking of
multiple moving object contours in a moving camera image sequences,” IEICE
Trans. Inform. Syst., vol. E8 3-D, no. 7, pp. 1583–1591, July 2000.
[4] G. Ar slan, J. Marden, and J. Shamma, “Autonomous vehicle-target assignment:
A game-theoretical formulation,” ASME J. Dyn. S yst. Meas. Co ntr., vol. 129, no. 5,
pp. 584 –596, 2007.
[5] Y. Bar-Shalom and H. Chen, “Multisensor track-to-track association for tracks
with dependent errors,” J. Adv. Inform. F usion, vol. 1, no. 1, pp. 3–14, 2006.
[6] D. Baumgartner, P. Roessler, W. Kubinger, C. Zinner, and K. Ambrosch,
“Benchmarks of low-level vision algorithms for DSP, FPGA, and mobile PC proces-
sors,” in Embedded Computer Vision, B. Kisacanin, S. S. Bhattacharyya, and
S. Chai, Ed s. New York: Springer-Verlag, 2009, pp. 101–120.
[7] M. Bramberger, J. Brunner, B. Rinner, and H. Schwabach, “Real-time video
analysis on an embedded smart camera for traffic surveillance,” in Proc. 10th
IEEE Real-Time and Embedded Technology and Applic ations Symp. ( RTAS
2004), Toronto, Canada, May 2004, pp. 174–181.
[8] M. Bramberger, A. Doblander, A. Maier, B. Rinner, and H. Schwa bach, “Dis-
tributed embedded smart cameras for surveillance applications,” Computer, vol. 39,
no. 2, pp. 68 –75, Feb. 2006.
[9] M. Br amberger, R. Pflu gfelder, A. Maier, B. Rinner, H. Schw abach, and B. Stro bl,
“A smart camera for traffic surveillance,” in Proc. 1st Workshop on Intelligent Solu-
tions for Embedded System s (WISES’03), Vienna, Austria, June 2003, pp. 1–12.
[10] A. Cavallaro, “Special issue on multi-sensor object detection and tracking,”
Signal Image Vid eo Proces s., vol. 1, no. 2, pp. 99–100, 2007.
[11] C. H. Chen, Y. Yao, D. Page, B. Abidi, A . Koshan, and M. A bidi, “Heterogeneous
fusion of omnidirectional and PTZ cameras for multiple object tracking,” IEEE
Trans. Circ uits Syst. Video Technol., vol. 18, no. 8, pp. 1052–1063, Aug. 2008.
[12] P. Chen, P. Ahammad, C. Boyer, S-I Huang, L. Lin, E. Lobaton, M. Meingast,
S. Oh, S. Wang, P. Yan, A. Y. Yang, C. Yeo, L.-C. Chang, J. D. Tygar, and S. S.
Sastr y, “Citric: A low-bandwidth wireless camera network platform,” in Proc. 2nd
ACM/ IEEE Int. Conf. Distri buted Smart Cameras ( ICDSC‘08), St anford, CA, Se pt.
2008, pp. 1–10.
[13] F. Dias, F. Berry, J. Serot, and F. Marmoiton, “Hardware design and imple-
mentation issues on a FPGA-based smart camera,” in Proc. ACM /IEEE Int. Conf.
Distributed Smart Cameras, Vienna, Au stria, 2007, pp. 20–26.
[14] F. Fleuret, J. Berclaz, R. L engagne, and P. Fua, “Multi-camera people tracking
with a probabilistic occupancy map,” IEEE Trans. Patte rn Anal. Mac hine Intell.,
vol. 30, no. 2, pp. 267–282, Feb. 2008.
[15] G. L. Foresti and C. Micheloni, “A robust feature tracker for active surveil-
lance of outdoor scenes,” Electron. Lett. Comput. Vision Image Anal., vol. 1, no.
1, pp. 21–36, 2003.
[16] G. Guo, C. R. D yer, and Z. Z ang, “Linear combination representation for outlier
detection in motion tracking,” in Proc. IEEE I nt. Conf. Computer Vi sion and Patter n
Recognition (CVPR’ 05), San Diego, CA, June 20–2 5, 2005, vol. 2, pp. 274–281.
[17] A. Hampapur, L. Brown, J. Connell, A. Ekin, N. Haas, M. Lu, H. Markl,
S. Pank anti, A. Senior, C. Fe Shu, and Y. L. Tian, “Smart video surveillance,” IEEE
Signal Processing Mag., vol. 22, no. 2, pp. 38– 41, 2005.
[18] J. Hart, B. Scassellati, and S. W. Zucker, “Epipolar geometry for humanoid
robotic heads,” in Proc. 4th Int. Cognitive Vision Workshop, Santorini, Greece,
May 12–15, 20 08, pp. 24 –36.
[19] M. Irani and P. Ananda n, “A unified approach to moving object detection in
2-D and 3-D scenes,” IEEE Trans. Pattern Anal. Machine Intell., vol. 20, no. 6, pp.
577–589, June 1998.
[20] F. Isgro and E. Trucco, “On robust rectification of uncalibrated images,” in
Proc. Int. Conf. Image Analysi s, 1999, pp. 297–302.
[21] A. Kansal, W. Kaiser, G. Pottie, M. Sr ivastava, and G. Sukhat me, “Reconfigu-
ration methods for mobile sensor networks,” ACM Trans. Sensor Netw., vol. 3, no.
4, pp. 1–28, 2007.
[22] D. Ka ruppiah, R. Grupen, A. Han son, and E. Riseman, “Smart re-
source reconfiguration by exploiting dynamics in perceptual tasks,” in Proc.
IEEE Int. Con f. Intelligent Robots and Systems, Edmonton, Alberta, 2005,
pp. 1–7.
[23] S. M. Khan a nd M. Shah, “Tracking multiple occluding people by localizing
on multiple scene planes,” IEEE Trans. Pattern Anal. Machine Intell., vol. 31, no.
3, pp. 505 –519, Mar. 2009.
[24] R. P. Kleihorst, A. A. A bbo, B. Schueler, and A . Danilin, “Camera mote with
a high-performance parallel processor for real-time frame-based video process-
ing,” in Proc. 1st ACM /IEEE Int. Conf. Distributed S mart Cameras (ICDSC’ 07),
Sept. 25 –28, 2007, pp. 109–116.
[25] M. Kölsch and S. Butner, “Hardware considerations for embedded vision sys-
tems,” in Embedded Computer Vision, B. Kisacanin, S. S. Bhattacharyya, and
S. Chai, Ed s. New York: Springer-Verlag, 2009, pp. 3 –26.
[26] S. Kumar, C. Micheloni, and G. L. Foresti, “Stereo vision in cooperative cam-
eras network,” in Smart Cameras, N. Belbachir, Ed., 2009.
[27] S. Kumar, C. Micheloni, and C. Piciarelli, “Stereo localization using dual
ptz cameras,” in Proc. Int. Conf. Computer Analysis of Images and Patterns,
Munster, GE, Sept. 2–4, 2009, vol. 5702, pp. 1061–1069.
[28] Y. Li and B . Bhanu, “Utility-based dynamic camera assignment and hand-off
in a video network,” in Proc. IEEE /ACM Int. C onf. Distributed Smart Came ras,
Stanford, USA, Sept. 7–11, 2008, pp. 1–9.
[29] D. G. L owe, “Distinctive image features from scale-invariant keypoints,” Int. J.
Comput. Vision, vol. 60, no. 2, pp. 91–110, 2004.
[30] M. Mangard, “Motion detection on embedded smart cameras,” Master’s the-
sis, Graz Univ. Technology, 2006.
[31] J. Meltzer and S. Soatto, “Edge descriptors for robust wide-baseline corre-
spondence,” in Proc. IEEE I nt. Conf. Computer Vision a nd Pattern Recognition,
Anchora ge, AK, US A, June 24 –26, 2008, pp. 1– 8.
[32] C. Micheloni and G. L . Foresti, “Real time image processing for active moni-
toring of wide areas,” J. Vision Commu n. Image Represent., vol. 17, no. 3, pp.
589–604, 2006.
[33] A . Mittal and L. S. Davis, “A general method for sensor planning in multi-
sensor systems: Extension to random occlusion,” Int. J. Comput. Vision, vol. 76,
no. 1, pp. 31–53, 2008.
[34] D. Mur ray and A. Basu , “Motion tracking with an active camera,IEEE Trans.
Pattern Anal. Machine Intell., vol. 19, no. 5, pp. 449– 454, May 1994.
[35] R. Olfati-Saber, J. Fax, and R. Murray, “Consensus and cooperation in net-
worked multi-agent systems,” Proc. IEEE, vol. 95, no. 1, pp. 215–233, Jan. 2007.
[36] J. Park, P. C. Bhat, and A. C. K ak, “A look-up table approach for solving
the camera selection problem in large camera networks,” in Proc. Workshop on
Distributed Smart Cameras, 2006, pp. 72 –76.
[37] C. Piciarelli, C. Micheloni, and G. L. Foresti, “Video network reconfigu-
ration by expectation maximization,” in Proc. Int. Co nf. Distributed Smart
Cameras, Como, Italy, 2009.
[38] K . N. Plataniot is and C. S. Regaz zoni, “Special issue on visual-centric surveillance
networks and services,” IEEE Signal Processing Mag., vol. 22, no. 2, pp. 12–15, 200 5.
[39] M. Quaritsch, M. Kreu zthaler, B. Rinner, H. Bischof, and B. Strobl, “Autono-
mous multicamera tracking on embedded smart cameras,” EURASIP J. Embedded
Syst., vol. 10, pp. 1–10, 2007.
[40] F. Z. Qureshi and D. Terzopoulos, “ Planning ahead for PTZ camera assign-
ment and handoff,” in Proc. ACM /IEEE Int. C onf. Distributed Smart Came ras,
Como, Italy, Aug.–Sept. 2009, pp. 1–8.
[41] B. Rinner, M. Qu aritsch, W. Schrie bl, T. Win kler, and W. Wolf, “The evolution
from single to pervasive smart cameras,” in Proc. ACM /IEEE Int. C onf. Distrib-
uted Smart Cameras (ICDSC‘08), St anford, USA, Sept. 7–11, 2008, pp. 1–10.
[42] B. Rinner and W. Wolf, “Introduction to distributed smart cameras,” Proc.
IEEE, vol. 96, no. 10, pp. 1565 –1575, Oct. 2008.
[43] G. Scotti, L. Marcenaro, C. Coelho, F. Selvaggi, and C. S. Regazzoni, “Dual
camera intelligent sensor for high definition of 360 degrees surveillance,” Visual
Image Signal Process., vol. 152, no. 2, pp. 250– 257, Apr. 2005.
[44] B. Song, C. Soto, A. K. Roy-Chowdhury, and J. A . Farrell, “Decentralized
camera network control using game theory,” in Proc. ACM/ IEEE Int. Conf. Dis-
tributed Smart Cameras, Sta nford, USA , 2008, pp. 1– 8.
[45] C. Soto, B. Song, and A. Roy-Chowdhury, “Distributed multi-target tracking
in a self-configuring camera network,” in Proc. IEEE Conf. Computer Vision and
Pattern Recognition, Miami, USA, June 20–25, 2009, pp. 1486–1493.
[46] B. J. Tordoff and D. W. Murray, “Reactive control of zoom while fixating using
perspective and affine cameras,” IEEE Trans. Patte rn Anal. Mac hine Intell., vol.
26, no. 1, pp. 98 –112, Jan. 2004.
[47] T. Winkler and B. Rinner, “Pervasive smart camera networks exploiting het-
erogeneous wireless channels,” in Proc. IEEE Int. Conf. Pervasive Computing
and Communications (PerCom), Mar. 2009, pp. 296 –299.
[48] W. Wolf, High Performance Embedded Computing. San Francisco, CA:
Morgan Kaufman, 2006.
[49] D. Wan and J. Zhou, “Multiresolution and wide-scope depth estimation using a
dual-ptz camera system,” IEEE Trans. Image Processing, vol. 18, pp. 677–682, 2009.
[50] B. Zit ova and J. Flusser, “Image registration methods: A survey,” Image Vision
Comput., vol. 21, no. 11, pp. 977–1000, 20 03. [SP]
... Cooperative video surveillance research has been developed to drastically reduce human supervision [17][18][19]. This is implemented by allowing cooperative cameras to share real-time information among them in order to capture events and to guarantee global coverage of the area of interest [1][2][3]. ...
... The same simulation setup (initial cameras positions and number of pedestrian in the scene) is used to evaluate the 4 different approaches: greedy approach (experiments(1-6), Table 1), split priority and position aware approaches (experiments (10)(11)(12)(13)(14)(15)(16)(17)(18), Table 2) and reinforcement learning based approach (experiments (19)(20)(21)(22)(23)(24), Table 3). Experiments (7-9) display a single group of 10 pedestrians moving across the map and it is used to show the ability of our approach to track people in the scene. ...
... Splitting the priority for different types of cameras shows how UAVs have a key role when they are allowed to focus on the global observation of the scene (experiments (10)(11)(12)). Otherwise, the performances of the whole network decreases (experiments (13)(14)(15)(16)(17)(18)). Both the greedy and split priority methods experience a decrease in performances when they have to focus on observing the more densely populated areas. ...
Article
Full-text available
Crowd surveillance plays a key role to ensure safety and security in public areas. Surveillance systems traditionally rely on fixed camera networks, which suffer from limitations, as coverage of the monitored area, video resolution and analytic performance. On the other hand, a smart camera network provides the ability to reconfigure the sensing infrastructure by incorporating active devices such as pan-tilt-zoom (PTZ) cameras and UAV-based cameras, thus enabling the network to adapt over time to changes in the scene. We propose a new decentralised approach for network reconfiguration, where each camera dynamically adapts its parameters and position to optimise scene coverage. Two policies for decentralised camera reconfiguration are presented: a greedy approach and a reinforcement learning approach. In both cases, cameras are able to locally control the state of their neighbourhood and dynamically adjust their position and PTZ parameters. When crowds are present, the network balances between global coverage of the entire scene and high resolution for the crowded areas. We evaluate our approach in a simulated environment monitored with fixed, PTZ and UAV-based cameras.
... Active cameras can track targets (i.e. follow them) in order to record their movements and alert in case of an intrusion [1]. Real-time operation on resource-constraint hardware is needed for such systems which are increasingly being used for various applications ranging from surveillance [2], mobile robots [3], and intelligent interactive environments [4]. ...
Preprint
Full-text available
The need for automated real-time visual systems in applications such as smart camera surveillance, smart environments, and drones necessitates the improvement of methods for visual active monitoring and control. Traditionally, the active monitoring task has been handled through a pipeline of modules such as detection, filtering, and control. However, such methods are difficult to jointly optimize and tune their various parameters for real-time processing in resource constraint systems. In this paper a deep Convolutional Camera Controller Neural Network is proposed to go directly from visual information to camera movement to provide an efficient solution to the active vision problem. It is trained end-to-end without bounding box annotations to control a camera and follow multiple targets from raw pixel values. Evaluation through both a simulation framework and real experimental setup, indicate that the proposed solution is robust to varying conditions and able to achieve better monitoring performance than traditional approaches both in terms of number of targets monitored as well as in effective monitoring time. The advantage of the proposed approach is that it is computationally less demanding and can run at over 10 FPS (~4x speedup) on an embedded smart camera providing a practical and affordable solution to real-time active monitoring.
... Active cameras can track targets (i.e. follow them) to record their movements and alert in case of an intrusion [34]. Real-time operation on resource-constraint hardware is needed for such systems which are increasingly being used for various applications ranging from surveillance [4], mobile robots [30], and intelligent interactive environments [41]. ...
Article
Full-text available
The need for automated real-time visual systems in applications such as smart camera surveillance, smart environments, and drones necessitates the improvement of methods for visual active monitoring and control. Traditionally, the active monitoring task has been handled through a pipeline of modules such as detection, filtering, and control. However, such methods are difficult to jointly optimize and tune their various parameters for real-time processing in resource constraint systems. In this paper a deep Convolutional Camera Controller Neural Network is proposed to go directly from visual information to camera movement to provide an efficient solution to the active vision problem. It is trained end-to-end without bounding box annotations to control a camera and follow multiple targets from raw pixel values. Evaluation through both a simulation framework and real experimental setup, indicate that the proposed solution is robust to varying conditions and able to achieve better monitoring performance than traditional approaches both in terms of number of targets monitored as well as in effective monitoring time. The advantage of the proposed approach is that it is computationally less demanding and can run at over 10 FPS (\(\sim 4\times \) speedup) on an embedded smart camera providing a practical and affordable solution to real-time active monitoring.
... A modern video surveillance system has wide applications in different environments such as airports, subway stations, public parks, office buildings, and stadiums [1][2][3]. In such a system, zoom cameras are often used as essential components because they have the ability to change their postures and focal lengths to track moving objects with adjustable resolutions [4][5][6]. ...
Article
Full-text available
Zoom camera can change its focal length and pose to track moving objects with adjustable resolution. To extract precise geometric information of the tracked objects, an accurate calibration method for the zoom camera is in requirement. However, high-precision camera calibration methods usually require a number of control points that are not guaranteed in some practical situations. Most of zoom cameras suffer radial distortion. Traditional method can recover undistorted image with known intrinsic parameters. But it fails to work for the zoom camera with unknown focal length. Motivated by these problems, we propose a novel two-point calibration method (TPCM). In this scheme, we first propose an approximate focal-invariant radial distortion (AFRD) model. With AFRD model, RGB image can be undistorted with unknown focal length. After that, TPCM is presented to estimate focal length and rotation matrix with only two control points of one image. Synthetic experiments demonstrate that AFRD model is efficient. In the real data experiment, mean reprojection error of TPCM is less than one pixel, smaller than state-ofthe-arts methods, and meeting the demand of high-precision calibration.
... Active vision systems (i.e., movable cameras with controllable parameters such as pan and tilt) have received much attention in recent years due to their extended coverage, flexibility, cost-efficiency compared to static vision systems [1]. Active cameras can be used to track targets (i.e. ...
Preprint
Full-text available
The increasing need for automated visual monitoring and control for applications such as smart camera surveillance, traffic monitoring, and intelligent environments, necessitates the improvement of methods for visual active monitoring. Traditionally, the active monitoring task has been handled through a pipeline of modules such as detection, filtering, and control. In this paper we frame active visual monitoring as an imitation learning problem to be solved in a supervised manner using deep learning, to go directly from visual information to camera movement in order to provide a satisfactory solution by combining computer vision and control. A deep convolutional neural network is trained end-to-end as the camera controller that learns the entire processing pipeline needed to control a camera to follow multiple targets and also estimate their density from a single image. Experimental results indicate that the proposed solution is robust to varying conditions and is able to achieve better monitoring performance both in terms of number of targets monitored as well as in monitoring time than traditional approaches, while reaching up to 25 FPS. Thus making it a practical and affordable solution for multi-target active monitoring in surveillance and smart-environment applications.
Chapter
When multiple robots are required to collaborate in order to accomplish a specific task, they need to be coordinated in order to operate efficiently. To allow for scalability and robustness, we propose a novel distributed approach performed by autonomous robots based on their willingness to interact with each other. This willingness, based on their individual state, is used to inform a decision process of whether or not to interact with other robots within the environment. We study this new mechanism to form coalitions in the on-line multi-object \(\upkappa \)-coverage problem, and evaluate its performance through two sets of experiments, in which we also compare to other methods from the state-of-art. In the first set we focus on scenarios with static and mobile targets, as well as with a different number of targets. Whereas in the second, we carry out an extensive analysis of the best performing methods focusing only on mobile targets, while also considering targets that appear and disappear during the course of the experiments. Results show that the proposed method is able to provide comparable performance to the best methods under study.
Article
The design of automated video surveillance systems often involves the detection of agents which exhibit anomalous or dangerous behavior in the scene under analysis. Models aimed to enhance the video pattern recognition abilities of the system are commonly integrated in order to increase its performance. Deep learning neural networks are found among the most popular models employed for this purpose. Nevertheless, the large computational demands of deep networks mean that exhaustive scans of the full video frame make the system perform rather poorly in terms of execution speed when implemented on low cost devices, due to the excessive computational load generated by the examination of multiple image windows. This work presents a video surveillance system aimed to detect moving objects with abnormal behavior for a panoramic 360∘ surveillance camera. The block of the video frame to be analyzed is determined on the basis of a probabilistic mixture distribution comprised by two mixture components. The first component is a uniform distribution, which is in charge of a blind window selection, while the second component is a mixture of kernel distributions. The kernel distributions generate windows within the video frame in the vicinity of the areas where anomalies were previously found. This contributes to obtain candidate windows for analysis which are close to the most relevant regions of the video frame, according to the past recorded activity. A Raspberry Pi microcontroller based board is employed to implement the system. This enables the design and implementation of a system with a low cost, which is nevertheless capable of performing the video analysis with a high video frame processing rate.
Article
Full-text available
We consider an autonomous vehicle-target assignment problem where a group of vehicles are expected to optimally assign themselves to a set of targets. We introduce a game-theoretical formulation of the problem in which the vehicles are viewed as self-interested decision makers. Thus, we seek the optimization of a global utility function through autonomous vehicles that are capable of making individually rational decisions to opti-mize their own utility functions. The first important aspect of the problem is to choose the utility functions of the vehicles in such a way that the objectives of the vehicles are localized to each vehicle yet aligned with a global utility function. The second important aspect of the problem is to equip the vehicles with an appropriate negotiation mechanism by which each vehicle pursues the optimization of its own utility function. We present several design procedures and accompanying caveats for vehicle utility design. We present two new negotiation mechanisms, namely, "generalized regret monitoring with fading memory and inertia" and "selective spatial adaptive play," and provide accom-panying proofs of their convergence. Finally, we present simulations that illustrate how vehicle negotiations can consistently lead to near-optimal assignments provided that the utilities of the vehicles are designed appropriately.
Conference Paper
Full-text available
Wireless Multimedia Sensor Networks (WMSNs) are used in many application domains, such as surveillance systems, telemedicine and so on. In order to ensure a broad deployment of such innovative services, strict requirements on security, privacy, and distributed processing of multimedia contents should be satisfied, taking also into account the limited technological resources (in term of energy, computation, bandwidth, and storage) of sensor nodes. Thus, with respect to classic Wireless Sensor Networks, the achievement of these goals is more challenging due to the presence of multimedia data, which usually requires complex compression and aggregation algorithms. In order to provide a unifying synthesis on the last achievements, this survey summarizes the main findings on secure WMSNs proposed in the literature and forecasts future perspectives of such a technology.
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Article
The Embedded Computing activities supported under this contract have been in support of multiple DARPA programs and efforts in support of the development and application of embedded computing. These activities have included: the Embeddable High Performance Computing (EHPC) Program, the Adaptive Computing Systems (ACS) Program, the Just In Time Hardware (JITH) efforts, the Data Intensive Systems (DIS) Program, the Mobile Autonomous Robotic Systems (MARS) Program, the Software for Distributed Robots (SDR) Program, the Quorum Program, the Next Generation Internet (NGI) Program/Advanced High Speed Communications, the Power Aware Computing and Communications (PAC/C) Program, the Model-Based Integration of Embedded Software (MoBIES) Program, the Software Enabled Control (SEC) Program, the Digital Radio Frequency Tags (DRaFT) Program, the Polymorphous Computing Architectures (PCA) Program, the Mission Specific Processing (MSP) Program, the High Productivity Computing Systems (HPCS) Program, the Clockless Logic Analysis, Systems, and Synthesis (CLASS) Program, the Architectures for Cognitive Information Processing (ACIP) Program, Graphics Processing Unit (GPU) for Computer Generated Forces (CGF) Program, and Joint Battle Command activities; developing activities including study (initial potential program investigation activities) efforts such as the electronic Textiles (e-Textiles) effort; the assessment and development of new program concepts; and the investigation of new embedded processing, computing architectures, communications, system development support, and software technologies. This set of programs and related technical activities address novel and advanced processing technologies that enable new and expanded performance and operational capabilities for high value embedded and high performance computing applications.
Article
Modern navigation, mapping and surveillance systems re-quire numerous cameras mounted at random locations over a geographically large area. In order to efficiently extract any location-specific intelligence from such a large network of cameras, we propose to create distributed look-up tables that rank the cameras according to how well they can image a specific location. The look-up table covers all possible lo-cations within the corresponding camera's viewing frustum 1 by optimally dividing the volume of the viewing frustum. The process of updating look-up tables is purely incremental, thus it can easily be used in a dynamic network. The fore-most advantages of our approach are a significant reduction of run-time computation for selecting a set of favorable cam-eras using a simple lookup operation and a significant reduc-tion of network traffic that would otherwise be required to carry out the process of camera selection. The proposed al-gorithm can be applied to various applications involving dis-tributed cooperative processing in a large camera network.