ArticlePDF Available

Abstract and Figures

Automatically building high-level segments to structure information in video documents is a challenging task. This paper presents a method based on the cophenetic criterion, a distance between clustered shots which detects breaks between sequences. It describes and compares various implemented options. Experiments have proved that the proposed criterion can be used for achieving segmentation. 1 Introduction Browsing and querying data in video documents requires to structure extracted information from the audio and video flows. The first step in building a structured description of data is to segment the video document. The elementary segment is the shot, which is usually defined as the smallest continuous unit of a video document. Numerous methods for shot segmentation have been proposed (e.g. see [4]). Nevertheless, shots are often not the relevant level to describe pertinent events, and are too numerous to enable efficient indexing or browsing. The grouping of shots into higher-le...
Content may be subject to copyright.
From Video Shot Clustering to Sequence Segmentation
Emmanuel Veneau, R´emi Ronfard
Institut National de l’Audiovisuel
4, avenue de l’Europe
94366 Bry-sur-Marne cedex, France
eveneau,rronfard @ina.fr
Patrick Bouthemy
IRISA/INRIA
Campus Universitaire de Beaulieu
35042 Rennes cedex, France
bouthemy@irisa.fr
Abstract
Automatically building high-level segments to structure
information in video documents is a challenging task. This
paper presents a method based on the cophenetic criterion,
a distance between clustered shots which detects breaks be-
tween sequences. It describes and compares various imple-
mented options. Experimentshave proved that the proposed
criterion can be used for achieving segmentation.
1 Introduction
Browsingand queryingdata in video documentsrequires
to structure extracted information from the audio and video
flows. The first step in building a structured description of
data is to segment the video document. The elementaryseg-
ment is the shot, which is usually defined as the smallest
continuous unit of a video document. Numerous methods
for shot segmentation have been proposed (e.g. see [4]).
Nevertheless, shots are often not the relevant level to de-
scribe pertinent events, and are too numerous to enable ef-
ficient indexing or browsing.
The grouping of shots into higher-level segments has
been investigated through various methods which can be
gathered into four main families. The first one is based
on the principle of the Scene Transition Graph (STG) [11],
which can be exploited in a continuous way [9], or accord-
ing to alternate versions [5]. The methods of the second
family [1, 3] use explicit models of video documents or
rules related to editing techniques and film theory. In the
third family [6, 10], emphasis is put on the joint use of
features extracted from audio, video and textual informa-
tion. These methods achieve shot grouping more or less
through a synthesis of the segmentation performed for each
media. The fourth family of algorithms relies on statisti-
cal techniques as Hidden MarkovModels (HMM) and other
Bayesian tools [2, 7].
In this paper, we present a method based on a cophenetic
criterion which belongs to the first family. The sequel is
organized as follows. Section 2 describes our method in-
volving an agglomerative binary hierarchy and the use of
the cophenetic matrix. Section 3 specifies the various op-
tions we have implemented with respect to extracted fea-
tures, distance between features, hierarchy updating, and
temporal constraints. Experimental results are reported in
Section 4, and Section 5 contains concluding remarks.
2 Binary hierarchy for describing shot simi-
larity
We assume that a segmentation of the video into shots
is available, where each shot is represented by one or more
extracted keyframes. The information contained in a shot
(exceptits duration) reduces to the (average)signature com-
putedfrom the correspondingkeyframes. We build a spatio-
temporal description of shot similarity through a binary ag-
glomerative hierarchical time-constrained clustering.
2.1 Binary agglomerative hierarchical time-
constrained clustering
To build a hierarchy following standard methods [12],
we require a similarity measure
between shots, and a dis-
tance between shot clusters, called index of dissimilarity,
.
The temporal constraint, as defined in [11], involves a tem-
poral distance
. We introduce a temporal weighting func-
tion
in order to have a general model for the temporal
constraint. The formal definitions of these functions will be
given in section 3. The time-constrained distance between
shots
is defined (assuming that similarity is normalized
between 0 and 100) by :
if
otherwise
(1)
where and designatetwoshotsand is the maximal
temporal interval for considering any interaction between
shots.
At the beginning of the process, each shot forms a clus-
ter, and the time-constrained dissimilarity index
between
clusters is the time-constrained distance
between shots.
A symmetric time-constrained
proximity matrix
can be defined [8], using , as a represen-
tation of the dissimilarity between clusters. The hierarchy
is built by mergingthe two closest clusters at each step. The
matrix
is updated according to the index of dissimilarity
to take into account the newly created cluster. This step
is iterated until the proximity matrix contains only infinite
values.
The resulting binarytime-constrainedhierarchy provides
a description of the spatio-temporal proximity of shots.
2.2 Cophenetic dissimilarity criterion
In [8], another proximity matrix , called cophenetic
matrix, is proposed to capture the structure of the hier-
archy. We will use the time-constrained version of this
matrix
to defined a criterion for sequence segmenta-
tion. The cophenetic matrix can be expressed as follows :
,where is a so-called clustering distance
defined as :
where is the index of dissimilarity constructed on ,
and
and are two clusters. Assuming that the shot
indices follow a temporal order, the cophenetic matrix
leads us to the definition of our criterion for segmentation,
called breaking distance, calculated between two consecu-
tive shots as :
.
2.3 Segmentation using the breaking distance
If the breakingdistance betweenconsecutiveshots ex-
ceeds a given threshold
, then a sequence boundary is in-
serted between these two shots.
An example is presented on Fig 1 where two different
thresholds
and are evaluated to perform
two different segmentations in sequences (Fig. 2).
2.4 Comparison with the STG method
We have formally proved that our method delivers the
same segmentation into sequences as the STG method de-
scribedin[11]. The advantageofourformulationisto allow
one to visualize what the segmentation results are according
to the selected threshold value which can then be appropri-
ately tuned by the user. There is no need to rebuild the STG
whenever the threshold is changed.
4.4 4.45 4.5 4.55 4.6 4.65
x 10
4
0
50
100
4.4 4.45 4.5 4.55 4.6 4.65
x 10
4
0
0.5
1
4.4 4.45 4.5 4.55 4.6 4.65
x 10
4
0
0.5
1
4.4 4.45 4.5 4.55 4.6 4.65
x 10
4
0
0.5
1
Frame nb
τ
1
τ
2
S
0
S
1
S
3
S
4
S
6
S
7
Figure 1. Thresholding the breaking distance
values on excerpt 1 of Avengers movie (up-
per row), detected sequence boundaries for
(upper middle row) and (lower middle
row), and manual segmentation (lower row)
Figure 2. Obtained sequence segmentation
on excerpt 1 of Avengers movie for threshold
. is an angle / reverse angle sequence.
is a fade out / fade in effect.
2
3 Description of implemented options
3.1 Signatures for shots
Three kinds of signatures are considered in practice :
shot duration, color or region-color histogram. Color and
region-color histograms are defined in the
space
with respectively 16, 4, and 4 levels, and 12 image blocks
are considered for region-histograms. The shot duration
givesa relevant information on the rhythmof the action and
on the editing work.
3.2 Distances between signatures
Various distances between signatures have been tested.
Comparisonbetween histogramscan be achieved using his-
togram intersection, euclidian distance,
-distance. The
distance chosen between shot durations is the Manhattan
distance.
3.3 Updating of the agglomerative binary hierar-
chy
In order to update the classification hierarchy, two algo-
rithms are available [12] :
the Complete Link method. The index of dissimilarity
between clusters is defined by :
the Wards method. The indexof dissimilarity between
clusters is given by :
where is the gravity centre of cluster , may
represent either
or .
In both cases, the Lance and William formula, given by
, is used to update the proximity ma-
trix. We have
, for the Complete
Link method,and
, , ,
for the Wards method.
3.4 Temporal weighting function
The temporal weighting function is used to constrain
the distance and the index of dissimilarity as introduced in
equation 1.
In [11], only one type of temporal weighting func-
tion was proposed, i.e. rectangular function which is not
smooth. We have tested three smooth functions : linear,
parabolic, and sinusoidal.
4 Experimental results
We are evaluating our method on a three hour video cor-
pus. For this communication, four excerpts of two min-
utes were selected. Three excerpts are taken from Avengers
movies to evaluate the segmentation into sequences in dif-
ferent contexts. The first one comprises an angle / reverse
angle editing effect and a content change with a dissolve ef-
fect. The second one includes a set change, and the third
one involves color and rhythm changes. Obtained segmen-
tations can be compared with a hand segmentation acting
as ground truth, which is weighted as follow : 1 for main
changes, 0.5 for secondary changes. The last excerpt is ex-
tracted from a newsprogramto test the relevanceof the built
hierarchy.
Among the implemented options, three sets were se-
lected for their relevant results :
color histograms
intersection, rectangular temporal weighting function, and
CompleteLink method,
color histograms intersection,
parabolic temporal weighting function, and Ward’s method
based on clusters duration,
Manhattan distance on
shots duration, parabolic weighting function, and Ward’s
method based on clusters duration.
3.05 3.1 3.15 3.2 3.25 3.3 3.35
x 10
4
0
10
20
30
3.05 3.1 3.15 3.2 3.25 3.3 3.35
x 10
4
0
20
40
60
3.05 3.1 3.15 3.2 3.25 3.3 3.35
x 10
4
0
0.5
1
Frame nb
Figure 3. Breaking distance values on excerpt
2ofAvengers movie using options set
(up-
per row), options set
(middle row), and
manual segmentation (lower row)
Results obtained on the news program excerpt show that
the clustering distance
provides a correct description of
the similarity between shots at different levels, even if the
3
information distribution is not homogeneous in the various
levels of the hierarchy. An adaptive thresholding applied to
breaking distance values would be nevertheless necessary
to avoid heterogeneous results. Tests have shown that the
best description is found using the options set
.
500 1000 1500 2000 2500 3000 3500 4000
0
20
40
500 1000 1500 2000 2500 3000 3500 4000
0
100
200
300
500 1000 1500 2000 2500 3000 3500 4000
0
0.5
1
Frame nb
Figure 4. Breaking distance values on excerpt
3ofAvengers movie using options set
(up-
per row), options set
(middle row), and
manual segmentation (lower row)
In the processed excerpts, most of the sequence changes
were correctly detected, when the proper options were se-
lected. On Fig.1, using
and the options set , one can
see that all changes are detected with only one false alarm,
the angle / reverseangle effect is recognized, butthat select-
ing the threshold value is a rather critical issue. On excerpt
2, with a relevant threshold, we can predict all the bound-
aries with options set
, with only one false alarm (Fig.
3). Using options set
, relevant for the hierarchy build-
ing, false alarms and miss rates increase on excerpt 2. The
color and rhythm change in excerpt 3 (Fig. 4) have been
better detected using options set
, rather than . Con-
sequently, how to automatically select the proper options
remains an open issue.
5Conclusion
The method described in this paper, based on the cophe-
netic matrix, allows us to determine and visualize the se-
quence boundaries corresponding to all levels in the bi-
nary agglomerative time-constrained hierarchy. We imple-
mentedseveraloptions. Selecting the most appropriateones
improved our results and gave a better description of the
similarity of the shots through the hierarchy. Experiments
on a larger scale will be undertaken in future work for se-
lecting the best parameter sets and evaluating alternative
thresholding stategies.
References
[1] P. Aigrain, P. Joly, and V. Longueville. Medium knowledge-
based macro-segmentation of video into sequences. In
M. T. Maybury, editor, Intelligent Multimedia Information
Retrieval, pages 159–173. AAAI/MIT Press, 1997.
[2] J. S. Boreczky and L. D. Wilcox. A hidden Markov model
framework for video segmentation using audio and image
features. In Int. Conf. on Acoustics, Speech, and Signal Pro-
cessing (ICASSP’97), Seattle, 1997.
[3] J. Carrive, F. Pachet, and R. Ronfard. Using description
logics for indexing audiovisual documents. In ITC-IRST,
editor, Int. Workshop on Description Logics (DL’98), pages
116–120, Trento, 1998.
[4] A. Dailianas, R. B. Allen, and P. England. Comparison of
automatic video segmentation algorithms. In SPIE Photon-
ics West, volume 2615, pages 2–16, Philadelphia, 1995.
[5] A. Hanjalic, R. L. Lagendijk, and J. Biemond. Automati-
cally segmenting movies into logical story units. In Third
Int. Conf. on Visual Information Systems (VISUAL’99),vol-
ume LNCS 1614, pages 229–236, Amsterdam, 1999.
[6] A. G. Hauptmann and M. A. Smith. Text, speech, and vi-
sion for video segmentation : The informedia project. In
AAAI Fall Symposium, Computational Models for Integrat-
ing Language and Vision, Boston, 1995.
[7] G. Iyengar and A. Lippman. Models for automatic classifi-
cation of video sequences. In Photonics West ’98, (Storage
and Retrieval VI), volume SPIE 3312, pages 216–227, San
Jose, 1998.
[8] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data.
Prentice Hall, 1988.
[9] J. R. Kender and B.-L. Yeo. Video scene segmentation via
continuous video coherence. Technical report, IBM Re-
search Division, 1997.
[10] R. Lienhart, S. Pfeiffer, and W. Effelsberg. Scene determi-
nation based on video and audio features. Technical report,
University of Mannheim, November 1998.
[11] M. Yeung, B.-L. Yeo, and B. Liu. Extracting story units from
long programs for video browsing and navigation. In Proc.
of IEEE Int. Conf. on Multimedia Computing and Systems,
Tokyo, 1996.
[12] J. Zupan. Clustering of Large Data Sets. Chemometrics
Research Studies Series. John Wiley & Sons Ltd., 1982.
Acknoledgement The images from the Avenger movie,
part of the AIM corpus, was reproduced thanks to INA, De-
partment Innovation.
4
... Rather than relying on the discrete link inference of the previous mechanism, it first extracts a certain continuous attribute for each shot that links with whether or not a logical story unit boundary occurs at that shot. Often, this feature indicates the level of uniformity around a shot either in visual terms (Kender and Yeo, 1998;Veneau et al., 2000;Sundaram and Chang, 2000c;Zhao et al., 2001;O'Connor et al., 2001;Wang et al., 2001;Rasheed and Shah, 2003), audio terms (Nam and Tewfik, 1997;Huang et al., 1998;Liu et al., 1998a;Jiang et al., 2000;Sundaram and Chang, 2000a) or textual terms (Ide et al., 2003). The boundaries between logical story units are detected by searching for local minima/maxima, thresholding, or a combination of these schemes. ...
... Kender and Yeo (1998) add a factor of e −(∆t)/W to account for the temporal distance between shots. ; introduce the factor 1 1+∆t/C into the similarity measure, where d is the temporal distance between two shots and C is a constant set to 20. Veneau et al. (2000) experiment with three different weighting functions: linear, parabolic and sinusoidal. Whilst the above authors treat temporal distance as the number of frames between two shots, it can also be measured in terms of the number of shots which separate them in (O'Connor et al., 2001;Zhao et al., 2001). ...
... Viewers perceive the meaning of a video at the level of LSUs [3], [6]. Therefore, next to methods for accurate shot detection there is even a greater need to have methods for automatic segmentation of a video into LSUs [2], [4], [6]- [16]. ...
... In this section, we describe general problems and assumptions underlying the broad class of segmentation methods based on visual similarity [2], [4], [6]- [16]. ...
Article
Full-text available
This paper is organized as follows. In section 2 an overview of automatic LSUsegmentation is given from an interaction point of view. In section 3, we propose extensions to make LSU-segmentation truly interactive. In section 4 the LSUsegmentation interaction model is explored in more detail. In section 5 results are presented and discussed
... In this paper, we assume that all shot descriptions are manually created. We leave for future work the important issue of automatically generating prose storyboards from existing movies, where a number of existing techniques can be used [4,32,9,12,11]. We also leave for future work the difficult problems of automatically generating movies from their prose storyboards, where existing techniques in virtual camera control can be used [13,6,15,8]. ...
Article
Full-text available
The prose storyboard language is a formal language for describing movies shot by shot, where each shot is described with a unique sentence. The language uses a simple syntax and limited vocabulary borrowed from working practices in traditional movie-making, and is intended to be readable both by machines and humans. The language is designed to serve as a high-level user interface for intelligent cinematography and editing systems.
... J'ai coordonné pour l'INA l'ensemble des développements C/C++, Java, CORBA, SQL3 et Python nécessaires à la réalisation du démonstrateur. Ces travaux on fait l'objet de nombreuses publications scientifiques [47, 10, 11, 29, 50, 12, 14, 14, 15, 16]page du film "Cours, Lola" sous la forme d'une base de données MySQL interrogeable depuis les pages du site du projet. En marge de ce projet, j'ai également créé la base de données du film "Le magicien d'Oz", en utilisant un découpage donné sous forme de texte. ...
Article
Full-text available
I review my research activities in Video Indexing and Action Recognition and sketch a research agenda for bringing those two lines of research together to address the difficult problem of recognizing actions in movies. I first present a series of older projects in Video Indexing, starting with the DIVAN project at INA and the MPEG expert group (1998-2000), and continuing at INRIA under the VIBES project (2001-2004). This research falls under the general approach of "computational media aesthetics", where we attempt to recognize film events based on our knowledge of filming styles and conventions (cinematography and editing). This is illustrated with two applications - the automatic segmentation of TV news into topics; and the automatic indexing of movies with their scripts. I then present my more recent research in Action Recognition with the MOVI group at INRIA (2005-2008). Building upon the GRIMAGE infrastructure, I present experiments in (1) learning and recognizing a small repertoire of full body gestures in 3D using "motion history volumes"; (2) segmenting a raw stream of 3D image sequences into recognizable "primitive actions" ; and (3) using statistical models learned in 3D for recovering primitive actions and relative camera positions from a single 2D video recording of similar actions.
Article
Video summarisation approaches have various fields of application, specifically related to organising, browsing and accessing large video databases. In this paper, the appropriateness of biologically inspired models to tackle these problems is discussed and suitable strategy for unsupervised video summarisation is derived. In our proposal, we model the ability of ants to build live structures with their bodies in order to discover, in a distributed and unsupervised way, a tree-structured organization and summarisation of the video data. An experimental evaluation validating the feasibility and the robustness of this novel approach is presented.
Article
Automatic video segmentation is the first and necessary step for content-based video analysis. The goal of segmentation is to organize a long video file into several smaller units. The smallest basic unit is shot. Relevant shots are typically grouped into a high-level unit called scene. Each scene is part of a story. Browsing these scenes unfolds the entire story of a film, enabling users to locate their desired video segments quickly and efficiently. Scene segmentation using visual property has been intensively studied in recent years and utilizing audio information is given relatively less attention. In this paper, we investigate different approaches to utilize video and audio information. We also proposal our novel audio-assisted scene segmentation techniques. The crux of our approach is audio fea-ture analysis and divergence measurement. The performance of different approaches were evaluated on full-length films with wide range of cam-era motion and a complex composition of shots. The experimental results show that our approach outperforms two recent audio-assisted techniques based on audio-video alignment method.
Conference Paper
The identification of useful structures in home video is difficult because this class of video is distinguished from other video sources by its unrestricted, non edited content and the absence of regulated storyline. In addition, home videos contain a lot of motion and erratic camera movements, with shots of the same character being captured from various angles and viewpoints. In this paper, we present a solution to the challenging problem of clustering shots and faces in home videos, based on the use of SIFT features. SIFT features have been known to be robust for object recognition; however, in dealing with the complexities of home video setting, the matching process needs to be augmented and adapted. This paper describes various techniques that can improve the number of matches returned as well as the correctness of matches. For example, existing methods for verification of matches are inadequate for cases when a small number of matches are returned, a common situation in home videos. We address this by constructing a robust classifier that works on matching sets instead of individual matches, allowing the exploitation of the geometric constraints between matches. Finally, we propose techniques for robustly extracting target clusters from individual feature matches.
Conference Paper
We present a framework for analyzing the human body 3D motion of golf swing from single-camera video sequences. The system is different from the methods in the literature as it evaluates the 3D model for each major body part separately to derive a more accurate 3D representation. The human body parts used for the analysis are automatically extracted using a video object segmentation technique. This 2D information estimated from the segmented body parts is utilized to obtain 3D body part models, consisting of head, upper arms, lower arms, body trunk, upper legs, lower legs and feet. The objective of the system is to obtain the 3D motion information for performance evaluation in golf swinging and comparing this information for different players.
Conference Paper
We present an algorithm for analyzing the human body 3-D motion of golf swing from single-camera video sequences. As the first step, the human body used for the analysis is automatically extracted using a video object segmentation technique. Once human body is extracted, this two-dimensional information is utilized to obtain a three-dimensional body model consisting of head, upper arms, lower arms, body trunk, upper legs, lower legs and feet using an iterative 3-D fitting algorithm and dynamic Bayesian networks. The ultimate objective of the system is to obtain the 3-D motion information for golf swinging and comparing this information for different players regardless of the great variability caused by different camera viewing perspectives. 3-D body motion during the golf swing is computed for all segments of the body. This system will allow the spatial-temporal relationship of each body segment, as they make their transition, be thoroughly studied, and enable the parameters for different players to be compared, as well
Conference Paper
Full-text available
We address the problem of indexing broadcast audiovisual documents (such as films, news). Starting from a collection of so-called shots, we aim at building automatically high level descriptions of subsets of this collection, that can be used for annotating, indexing and accessing the document. We propose to represent documents and high level descriptions with the framework of description logics, enriched with temporal relations. We first define the problem as a classification problem. We then propose an algorithm to automatically classify sub-sequences of shots, based on a bottom-up construction of descriptions using the rule mechanism of the CLASSIC system.
Conference Paper
Full-text available
Determining automatically what constitutes a scene in a video is a challenging task, particularly since there is no precise definition of the term “scene”. It is left to the individual to set attributes shared by consecutive shots which group them into scenes. Certain basic attributes such as dialogs, like settings and continuing sounds are consistent indicators. We have therefore developed a scheme for identifying scenes by clustering shots according to detected dialogs, like settings and similar audio. Results from experiments show automatic identification of these types of scenes to be reliable
Conference Paper
This chapter presents a new framework of video analysis and associated techniques to automatically parse long programs, to extract story structures, and identify story units. Content-based browsing and navigation in digital video collections have been centered on sequential and linear presentation of images. To facilitate such applications, nonlinear and nonsequential access into video documents is essential, especially with long programs. For many programs, this can be achieved by identifying underlying story structures, which are reflected both by visual content and temporal organization of composing elements. The proposed analysis and representation contribute to the extraction of scenes and story units, each representing a distinct locale or event that cannot be achieved by shot boundary detection alone. Analysis is performed on MPEG-compressed video and without prior models. In addition, the building of story structure gives nonsequential and nonlinear access to a featured program and facilitates browsing and navigation. The result is a compact representation that serves as a summary of the story and allows hierarchical organization of video documents. Story units, which represent distinct events or locales from several types of video programs, have been successfully segmented and the results are promising. The video into the hierarchy of story units and scenes, clusters of similar shots, and shots at the lowest have been decomposed, which helps in further organization.
Article
We describe three technologies involved in creating a digital video library suitable for full- content search and retrieval. Image processing analyzes scenes, speech processing transcribes the audio signal, and natural language process- ing determines word relevance. The integration of these technologies enables us to include vast amounts of video data in the library.
Conference Paper
In this paper, we explore a technique for automatic classification of video sequences, (such as a TV broadcast, movies). This technique analyzes the incoming video sequences and classifies them into categories. It can be viewed as an on-line parser for video signals. We present two techniques for automatic classification. In the first technique, the incoming video sequence is analyzed to extract the motion information. This information is optimally projected onto a single dimension. This projection information is then used to train Hidden Markov Models (HMMs) that efficiently and accurately classify the incoming video sequence. Preliminary results with 50 different test sequences (25 Sports and 25 News sequences) indicae a classification accuracy of 90% by the HMM models. In the second technique, 24 full-length motion picture trailers are classified using HMMs. This classification is compared with the internet movie database and we find that they correlate well. Only two out of 24 trailers were classified incorrectly.
Conference Paper
Content based browsing and navigation in digital video collections have been centered on sequential and linear presentation of images. To facilitate such applications, nonlinear and non sequential access into video documents is essential, especially with long programs. For many programs, this can be achieved by identifying underlying story structures which are reflected both by visual content and temporal organization of composing elements. A new framework of video analysis and associated techniques are proposed to automatically parse long programs, to extract story structures and identify story units. The proposed analysis and representation contribute to the extraction of scenes and story units, each representing a distinct locale or event, that cannot be achieved by shot boundary detection alone. Analysis is performed on MPEG compressed video and without a prior models. The result is a compact representation that serves as a summary of the story and allows hierarchical organization of video documents