Content uploaded by Remi Ronfard
Author content
All content in this area was uploaded by Remi Ronfard on Apr 15, 2013
Content may be subject to copyright.
From Video Shot Clustering to Sequence Segmentation
Emmanuel Veneau, R´emi Ronfard
Institut National de l’Audiovisuel
4, avenue de l’Europe
94366 Bry-sur-Marne cedex, France
eveneau,rronfard @ina.fr
Patrick Bouthemy
IRISA/INRIA
Campus Universitaire de Beaulieu
35042 Rennes cedex, France
bouthemy@irisa.fr
Abstract
Automatically building high-level segments to structure
information in video documents is a challenging task. This
paper presents a method based on the cophenetic criterion,
a distance between clustered shots which detects breaks be-
tween sequences. It describes and compares various imple-
mented options. Experimentshave proved that the proposed
criterion can be used for achieving segmentation.
1 Introduction
Browsingand queryingdata in video documentsrequires
to structure extracted information from the audio and video
flows. The first step in building a structured description of
data is to segment the video document. The elementaryseg-
ment is the shot, which is usually defined as the smallest
continuous unit of a video document. Numerous methods
for shot segmentation have been proposed (e.g. see [4]).
Nevertheless, shots are often not the relevant level to de-
scribe pertinent events, and are too numerous to enable ef-
ficient indexing or browsing.
The grouping of shots into higher-level segments has
been investigated through various methods which can be
gathered into four main families. The first one is based
on the principle of the Scene Transition Graph (STG) [11],
which can be exploited in a continuous way [9], or accord-
ing to alternate versions [5]. The methods of the second
family [1, 3] use explicit models of video documents or
rules related to editing techniques and film theory. In the
third family [6, 10], emphasis is put on the joint use of
features extracted from audio, video and textual informa-
tion. These methods achieve shot grouping more or less
through a synthesis of the segmentation performed for each
media. The fourth family of algorithms relies on statisti-
cal techniques as Hidden MarkovModels (HMM) and other
Bayesian tools [2, 7].
In this paper, we present a method based on a cophenetic
criterion which belongs to the first family. The sequel is
organized as follows. Section 2 describes our method in-
volving an agglomerative binary hierarchy and the use of
the cophenetic matrix. Section 3 specifies the various op-
tions we have implemented with respect to extracted fea-
tures, distance between features, hierarchy updating, and
temporal constraints. Experimental results are reported in
Section 4, and Section 5 contains concluding remarks.
2 Binary hierarchy for describing shot simi-
larity
We assume that a segmentation of the video into shots
is available, where each shot is represented by one or more
extracted keyframes. The information contained in a shot
(exceptits duration) reduces to the (average)signature com-
putedfrom the correspondingkeyframes. We build a spatio-
temporal description of shot similarity through a binary ag-
glomerative hierarchical time-constrained clustering.
2.1 Binary agglomerative hierarchical time-
constrained clustering
To build a hierarchy following standard methods [12],
we require a similarity measure
between shots, and a dis-
tance between shot clusters, called index of dissimilarity,
.
The temporal constraint, as defined in [11], involves a tem-
poral distance
. We introduce a temporal weighting func-
tion
in order to have a general model for the temporal
constraint. The formal definitions of these functions will be
given in section 3. The time-constrained distance between
shots
is defined (assuming that similarity is normalized
between 0 and 100) by :
if
otherwise
(1)
where and designatetwoshotsand is the maximal
temporal interval for considering any interaction between
shots.
At the beginning of the process, each shot forms a clus-
ter, and the time-constrained dissimilarity index
between
clusters is the time-constrained distance
between shots.
A symmetric time-constrained
proximity matrix
can be defined [8], using , as a represen-
tation of the dissimilarity between clusters. The hierarchy
is built by mergingthe two closest clusters at each step. The
matrix
is updated according to the index of dissimilarity
to take into account the newly created cluster. This step
is iterated until the proximity matrix contains only infinite
values.
The resulting binarytime-constrainedhierarchy provides
a description of the spatio-temporal proximity of shots.
2.2 Cophenetic dissimilarity criterion
In [8], another proximity matrix , called cophenetic
matrix, is proposed to capture the structure of the hier-
archy. We will use the time-constrained version of this
matrix
to defined a criterion for sequence segmenta-
tion. The cophenetic matrix can be expressed as follows :
,where is a so-called clustering distance
defined as :
where is the index of dissimilarity constructed on ,
and
and are two clusters. Assuming that the shot
indices follow a temporal order, the cophenetic matrix
leads us to the definition of our criterion for segmentation,
called breaking distance, calculated between two consecu-
tive shots as :
.
2.3 Segmentation using the breaking distance
If the breakingdistance betweenconsecutiveshots ex-
ceeds a given threshold
, then a sequence boundary is in-
serted between these two shots.
An example is presented on Fig 1 where two different
thresholds
and are evaluated to perform
two different segmentations in sequences (Fig. 2).
2.4 Comparison with the STG method
We have formally proved that our method delivers the
same segmentation into sequences as the STG method de-
scribedin[11]. The advantageofourformulationisto allow
one to visualize what the segmentation results are according
to the selected threshold value which can then be appropri-
ately tuned by the user. There is no need to rebuild the STG
whenever the threshold is changed.
4.4 4.45 4.5 4.55 4.6 4.65
x 10
4
0
50
100
4.4 4.45 4.5 4.55 4.6 4.65
x 10
4
0
0.5
1
4.4 4.45 4.5 4.55 4.6 4.65
x 10
4
0
0.5
1
4.4 4.45 4.5 4.55 4.6 4.65
x 10
4
0
0.5
1
Frame nb
τ
1
τ
2
S
0
S
1
S
3
S
4
S
6
S
7
Figure 1. Thresholding the breaking distance
values on excerpt 1 of Avengers movie (up-
per row), detected sequence boundaries for
(upper middle row) and (lower middle
row), and manual segmentation (lower row)
Figure 2. Obtained sequence segmentation
on excerpt 1 of Avengers movie for threshold
. is an angle / reverse angle sequence.
is a fade out / fade in effect.
2
3 Description of implemented options
3.1 Signatures for shots
Three kinds of signatures are considered in practice :
shot duration, color or region-color histogram. Color and
region-color histograms are defined in the
space
with respectively 16, 4, and 4 levels, and 12 image blocks
are considered for region-histograms. The shot duration
givesa relevant information on the rhythmof the action and
on the editing work.
3.2 Distances between signatures
Various distances between signatures have been tested.
Comparisonbetween histogramscan be achieved using his-
togram intersection, euclidian distance,
-distance. The
distance chosen between shot durations is the Manhattan
distance.
3.3 Updating of the agglomerative binary hierar-
chy
In order to update the classification hierarchy, two algo-
rithms are available [12] :
the Complete Link method. The index of dissimilarity
between clusters is defined by :
the Ward’s method. The indexof dissimilarity between
clusters is given by :
where is the gravity centre of cluster , may
represent either
or .
In both cases, the Lance and William formula, given by
, is used to update the proximity ma-
trix. We have
, for the Complete
Link method,and
, , ,
for the Ward’s method.
3.4 Temporal weighting function
The temporal weighting function is used to constrain
the distance and the index of dissimilarity as introduced in
equation 1.
In [11], only one type of temporal weighting func-
tion was proposed, i.e. rectangular function which is not
smooth. We have tested three smooth functions : linear,
parabolic, and sinusoidal.
4 Experimental results
We are evaluating our method on a three hour video cor-
pus. For this communication, four excerpts of two min-
utes were selected. Three excerpts are taken from Avengers
movies to evaluate the segmentation into sequences in dif-
ferent contexts. The first one comprises an angle / reverse
angle editing effect and a content change with a dissolve ef-
fect. The second one includes a set change, and the third
one involves color and rhythm changes. Obtained segmen-
tations can be compared with a hand segmentation acting
as ground truth, which is weighted as follow : 1 for main
changes, 0.5 for secondary changes. The last excerpt is ex-
tracted from a newsprogramto test the relevanceof the built
hierarchy.
Among the implemented options, three sets were se-
lected for their relevant results :
color histograms
intersection, rectangular temporal weighting function, and
CompleteLink method,
color histograms intersection,
parabolic temporal weighting function, and Ward’s method
based on clusters duration,
Manhattan distance on
shots duration, parabolic weighting function, and Ward’s
method based on clusters duration.
3.05 3.1 3.15 3.2 3.25 3.3 3.35
x 10
4
0
10
20
30
3.05 3.1 3.15 3.2 3.25 3.3 3.35
x 10
4
0
20
40
60
3.05 3.1 3.15 3.2 3.25 3.3 3.35
x 10
4
0
0.5
1
Frame nb
Figure 3. Breaking distance values on excerpt
2ofAvengers movie using options set
(up-
per row), options set
(middle row), and
manual segmentation (lower row)
Results obtained on the news program excerpt show that
the clustering distance
provides a correct description of
the similarity between shots at different levels, even if the
3
information distribution is not homogeneous in the various
levels of the hierarchy. An adaptive thresholding applied to
breaking distance values would be nevertheless necessary
to avoid heterogeneous results. Tests have shown that the
best description is found using the options set
.
500 1000 1500 2000 2500 3000 3500 4000
0
20
40
500 1000 1500 2000 2500 3000 3500 4000
0
100
200
300
500 1000 1500 2000 2500 3000 3500 4000
0
0.5
1
Frame nb
Figure 4. Breaking distance values on excerpt
3ofAvengers movie using options set
(up-
per row), options set
(middle row), and
manual segmentation (lower row)
In the processed excerpts, most of the sequence changes
were correctly detected, when the proper options were se-
lected. On Fig.1, using
and the options set , one can
see that all changes are detected with only one false alarm,
the angle / reverseangle effect is recognized, butthat select-
ing the threshold value is a rather critical issue. On excerpt
2, with a relevant threshold, we can predict all the bound-
aries with options set
, with only one false alarm (Fig.
3). Using options set
, relevant for the hierarchy build-
ing, false alarms and miss rates increase on excerpt 2. The
color and rhythm change in excerpt 3 (Fig. 4) have been
better detected using options set
, rather than . Con-
sequently, how to automatically select the proper options
remains an open issue.
5Conclusion
The method described in this paper, based on the cophe-
netic matrix, allows us to determine and visualize the se-
quence boundaries corresponding to all levels in the bi-
nary agglomerative time-constrained hierarchy. We imple-
mentedseveraloptions. Selecting the most appropriateones
improved our results and gave a better description of the
similarity of the shots through the hierarchy. Experiments
on a larger scale will be undertaken in future work for se-
lecting the best parameter sets and evaluating alternative
thresholding stategies.
References
[1] P. Aigrain, P. Joly, and V. Longueville. Medium knowledge-
based macro-segmentation of video into sequences. In
M. T. Maybury, editor, Intelligent Multimedia Information
Retrieval, pages 159–173. AAAI/MIT Press, 1997.
[2] J. S. Boreczky and L. D. Wilcox. A hidden Markov model
framework for video segmentation using audio and image
features. In Int. Conf. on Acoustics, Speech, and Signal Pro-
cessing (ICASSP’97), Seattle, 1997.
[3] J. Carrive, F. Pachet, and R. Ronfard. Using description
logics for indexing audiovisual documents. In ITC-IRST,
editor, Int. Workshop on Description Logics (DL’98), pages
116–120, Trento, 1998.
[4] A. Dailianas, R. B. Allen, and P. England. Comparison of
automatic video segmentation algorithms. In SPIE Photon-
ics West, volume 2615, pages 2–16, Philadelphia, 1995.
[5] A. Hanjalic, R. L. Lagendijk, and J. Biemond. Automati-
cally segmenting movies into logical story units. In Third
Int. Conf. on Visual Information Systems (VISUAL’99),vol-
ume LNCS 1614, pages 229–236, Amsterdam, 1999.
[6] A. G. Hauptmann and M. A. Smith. Text, speech, and vi-
sion for video segmentation : The informedia project. In
AAAI Fall Symposium, Computational Models for Integrat-
ing Language and Vision, Boston, 1995.
[7] G. Iyengar and A. Lippman. Models for automatic classifi-
cation of video sequences. In Photonics West ’98, (Storage
and Retrieval VI), volume SPIE 3312, pages 216–227, San
Jose, 1998.
[8] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data.
Prentice Hall, 1988.
[9] J. R. Kender and B.-L. Yeo. Video scene segmentation via
continuous video coherence. Technical report, IBM Re-
search Division, 1997.
[10] R. Lienhart, S. Pfeiffer, and W. Effelsberg. Scene determi-
nation based on video and audio features. Technical report,
University of Mannheim, November 1998.
[11] M. Yeung, B.-L. Yeo, and B. Liu. Extracting story units from
long programs for video browsing and navigation. In Proc.
of IEEE Int. Conf. on Multimedia Computing and Systems,
Tokyo, 1996.
[12] J. Zupan. Clustering of Large Data Sets. Chemometrics
Research Studies Series. John Wiley & Sons Ltd., 1982.
Acknoledgement The images from the Avenger movie,
part of the AIM corpus, was reproduced thanks to INA, De-
partment Innovation.
4