Conference PaperPDF Available

Abstract

Segmenting video documents into sequences from elementary shots to supply an appropriate higher level description of the video is a challenging task. The paper presents a two-stage method. First, we build a binary agglomerative hierarchical time-constrained shot clustering. Second, based on the cophenetic criterion, a breaking distance between shots is computed to detect sequence changes. Various options are implemented and compared. Real experiments have proved that the proposed criterion can be efficiently used to achieve appropriate segmentation into sequences
From
Video
Shot
Clustering
to
Sequence
Segmentation
Emmanuel Veneau, RCmi Ronfard
Institut National de 1’Audiovisuel IRISADNRIA
4, avenue de 1’Europe
94366 Bry-sur-Mame cedex, France
Patrick Bouthemy
Campus Universitaire de Beaulieu
35042 Rennes cedex, France
{
eveneau,rronfard}@ ina.fr bouthemy@ irisa.fr
Abstract
Segmenting video documents into sequences from ele-
mentary shots to supply an appropriate higher level de-
scription
of
the video is a challenging task. This paper
presents a two-stage method. First, we build a binary
agglomerative hierarchical time-constrained shot cluster-
ing. Second, based
on the
cophenetic criterion,
a
break-
ing distance
between shots is computed to detect sequence
changes. Various options are implemented and compared.
Real experiments have proved that the proposed criterion
can be efficiently used to achieve appropriate segmentation
into sequences.
1
Introduction
Browsing and querying data in video documents requires
to first extract and organize information from the audio and
video tracks. The first step in building a structured descrip-
tion is to segment the video document into elementary shots
which are usually defined as the smallest continuous units of
a video document, Numerous methods for shot segmenta-
tion have been proposed (e.g., see
[3]).
Nevertheless, shots
are often not the relevant level to describe pertinent events,
and are too numerous to enable efficient indexing or brows-
ing.
The grouping of shots into higher-level segments has
been investigated through various methods which can be
gathered into three main families. The first one is based
on
the principle of the Scene Transition Graph (STG) [91,
which can be formulated in
a
continuous way
[7],
or accord-
ing to alternate versions [4]. Methods of the second fam-
ily
[I,
21
use explicit models of video documents or rules
related to editing techniques and film theory.
In
the third
family
[5,
81,
emphasis is put
on
the joint use of features
extracted from audio, video and textual information. These
methods achieve shot grouping more or less through a com-
bination of the segmentations performed for each track.
0-7695-0750-6/00
$10.00
0
2000
IEEE
We present
a
method based
on
a so-called
copherretic
criterion
which belongs to the first family. The sequel is
organized
as
follows. Section
2
describes our method in-
volving an agglomerative binary hierarchy and the use of
the cophenetic matrix. Section
3
specifies the various op-
tions we have implemented with respect to extracted fea-
tures, distance between features, hierarchy updating, and
temporal constraints. Experimental results are reported in
Section 4, and Section
5
contains concluding remarks.
2
Binary hierarchy for describing shot simi-
larity
We assume that a segmentation of the video into shots is
available, where each shot is represented by one or more ex-
tracted keyframes. The information representing
a
shot (ex-
cept its duration) is given
by
the (average) signature
com-
puted from the corresponding keyframes. We build a spiitio-
temporal evaluation of shot similarity through
a
binary ag-
glomerative hierarchical time-constrained clustering.
2.1
Binary
agglomerative
hierarchical
time-
constrained clustering
To build a hierarchy following usual methods
[lo],
we
need to define a similarity measure
s
between shots, and
a
distance between shot clusters, called index of dissimilar-
ity
6.
The temporal constraint, as defined
in
[9], involves
a
temporal distance
dt
.
We introduce a temporal weighting
function
W
accounting for
a
general model for the tempo-
ral constraint. The formal definitions of all these functions
will be given in Section
3.
The time-constrained distance
d
between shots is defined (assuming that similarity is nor-
malized between
0
and 100) by
:
100
-
s(i,j)
x
W(i,j)
if
dt(i,j)
5
AT
otherwise
d(i,j)
=
254
where
i
and
j
designate two shots and
AT
is the maximal
temporal interval for considering any interaction between
shots.
At the beginning
of
the process, each shot forms a clus-
ter, and the time-constrained dissimilarity index between
clusters is then the time-constrained distance
d
between
shots. A symmetric time-constrained
N
x
N
proximity
matrix
d
=
[J(i,
j)]
is considered
[6],
using
b
to evaluate
the dissimilarity between clusters. The hierarchy is built by
merging the two closest clusters at each step. The matrix
V
is updated according to the index of dissimilarity
8
to take
into account each newly created cluster. This is iterated un-
til the proximity matrix contains only infinite values. The
resulting binary time-constrained hierarchy supplies a de-
scription of the spatio-temporal proximity of the extracted
shots.
2.2 Cophenetic dissimilarity criterion
In
[6],
another proximity matrix
D,,
called
copkenetic
matrix, is proposed to capture the structure of the hierar-
chy. We will use the time-constrained version
d,
of
this
matrix to define a criterion for the segmentation of the
video into sequences. The
cophenetic
matrix is expressed
as
D,
=
[d,(i,
j)],
where
&
is the so-called
clustering dis-
tance
defined by
:
where
8
is the index
of
dissimilarity constructed from
d,
and
Cp
and
C,
are two clusters. Assuming that the shot in-
dices follow a temporal order, the
copkenetic
matrix leads
to the definition of our criterion for sequence segmentation,
called
breaking distance,
calculated between two consecu-
tive shots as
:
&(i,
i
+
1)
=
mink<i<l
-
{*,(k,
I)}.
2.3 Segmentation using the breaking distance
If the breaking distance
I&,
between consecutive shots ex-
ceeds a given threshold
rc,
then a sequence boundary is in-
serted between these two shots. An example is presented on
Fig
1
where two different thresholds to perform segmenta-
tion into sequences
TI
=
20
and
r2
=
45
are considered.
Fig.
2
displays results corresponding to thersholds
7-1
and
72.
2.4 Comparison with the
STG
method
We have formally proved that our method delivers the
same segmentation into sequences
as
the STG method de-
scribed in
[9].
Considering that
STG
method considers in a
binary way inter-shot spacing and implies non-obvious set-
ting of parameters
[7],
the advantage
of
our formulation is
to
smooth the effects
of
time, in the time-constrained dis-
tance, using continuous temporal weighting functions, and
to consider a threshold parameter related to sequence seg-
mentation and not to shot clustering. As a consequence, our
approach allows one to visualize what the segmentation re-
sults are according to the selected threshold value which can
then be appropriately tuned by the user. There is no need to
rebuild the
STG
whenever the threshold is changed.
i::Il
-5
-
-__-______
_---
0
4.4 4.45 4.5 4.55 4.6 4.65
4.4 4.45 4.5 4.55 4.6 4.65
-
4.4 4.45 4.5 4.55 4.6 4.65
"
4.4
4.45 4.5 4.55 4.6 4.65
lo4
Frame
nb
Figure
1.
Thresholding the breaking distance
values on excerpt
1
of
Avengers
movie (up-
per row), detected sequence boundaries for
71
(upper middle row) and
r2
(lower middle
row), and manual segmentation (lower row)
3
Description
of
implemented options
3.1 Signatures for shots
We have considered in pratrice three kinds of signatures
:
shot duration, color and region-based color histograms.
Color and region-based color histograms are defined in the
(Y,
cb7
CT)
space with respectively
16,
4,
and
4
levels,
and
12
image blocks are considered for region-based his-
tograms. The shot duration gives a relevant information on
the rhythm
of
the action and on the editing work.
3.2
Distances between signatures
Various distances between signatures have been tested.
Comparison between histograms can be achieved using his-
255
togram intersection, euclidian distance,
x2
-distance. The
distance chosen between shot durations is the Manhattan
distance.
3.3
Updating of the agglomerative binary hierar-
chy
In order to update the classification hierarchy, two algo-
rithms are available
[
101
:
e
the
Complete Link
method. The index of dissimilarity
between clusters is defined by
:
e
the
Ward’s
method. The index of dissimilarity between
clusters is given by
:
where
Gc;
is the gravity centre of cluster
Ci,
nci
rep-
resents either
CurdinuZ(Ci)
or
Durution(Ci).
In
both cases, the-Lance and Wjlliam formula, given by
b(A
U
B,C)
=
u16(A,C)
+
u~~(B,C)
+
u36(A,B)
+
u4(b(A,
C)
-
b(B,
C)l,
is used to update the proximity ma-
trix.
We have
ul
=
u2
=
u4
=
$,
u3
=
0
for the
Complete
u4
=
nA,T+nc
for the
Ward’s
method.
n
+n
Link
method, and
a1
=
n:;4::~,,
u2
=
na:B+Zc
9
‘3
=
09
si
4
*-
so
s5
S6
s7
Figure
2.
Obtained sequence segmentation
on excerpt
1
of
Avengers
movie for threshold
T~.
S3
is
an angle
/
reverse angle sequence.
S5
is
a fade out
I
fade in effect.
3.4
Temporal weighting function
The temporal weighting function is used to constrain
the distance and the index of dissimilarity as introduced in
equation
1.
In [9], only one type of temporal weighting
function was proposed, i.e. rectangular function which is
not smooth. We have tested three smooth functions
:
linear,
parabolic, and sinusoidal.
4
Experimental results
We have evaluated our method on a three hour video cor-
pus. We report here results on four excerpts of two minutes.
Three excerpts are taken from
Avengers
movies to evaluate
the segmentation into sequences in different contexts. The
first one comprises an angle
/
reverse angle editing effect
and a transition with a dissolve effect. The second one in-
cludes a set change, and the third one involves color and
rhythm changes. Obtained segmentations can be compared
with a hand segmentation acting as ground truth. In plots
displayed in Figures
1,
3
and
4,
main sequence changes
are represented by a value of 1 and secondary changes by
a value of
0.5.
The last excerpt is extracted from a news
program to test the relevance
of
the built hierarchy.
Among the implemented options, three sets of descrip-
tors and functions are selected
:
(01)
color histograms
intersection, rectangular temporal weighting function, and
Complete
Link
method,
(02)
color histograms intersection,
parabolic temporal weighting function, and Ward’s method
based
on
cluster duration,
(03)
Manhattan distance on shots
duration, parabolic weighting function, and Ward’s method
based
on
cluster duration.
Results obtained on the news program excerpt show that
the clustering distance
d,
provides a correct description
of
the similarity between shots at different levels, even if the
information distribution is not homogeneous in the various
levels of the hierarchy.
An
adaptive thresholding applied
to
breaking distance values would be nevertheless necessary to
avoid heterogeneous results. Tests have
shown
that the best
video segmentation into sequences is found using option set
In the processed excerpts, most of the sequence changes
were correctly detected, when the proper options were se-
lected.
On
Fig.
1,
we can point out that, using
71
and op-
tion
01,
all changes are detected with only one false alarm,
the angle
/
reverse angle effect is recognized. Selecting the
threshold value is nevertheless a rather critical issue. On
excerpt
2,
with a relevant threshold, we extract all the cor-
rect boundaries with option
01,
with only one false alarm
(Fig.
3).
Using
option
02
false
alarms
and missed detec-
tions increase on excerpt
2.
The color and rhythm changes
in excerpt
3
(Fig.
4)
have been better detected using option
02.
256
03,
rather than
01.
Consequently, how to automatically se-
lect the proper option remains an open issue.
5
Conclusion
"
3.05 3.1 3.15 3.2 3.25 3.3 3.35
~~
3.15 3.2 3.25 3.3 3.35
"
3.05 3.1 3.15 3.2 3.25 3.3 3.35
io4
Frame
nb
Figure
3.
Breaking distance values on excerpt
2
of
Avengers
movie using Option
01
(Upper
row), option
O3
(middle row), and manual seg-
mentation (lower row)
20
n
500
1000
1500 2000 2500 3000 3500
4000
:::-;I
100
01
"
1,
,
1,
.,
11
500 1000 1500
2000
2500 3000 3500
4000
1
0.5
1
0
500
1000
1500
2000
2500 3000 3500
4000
Frame
nb
Figure
4.
Breaking distance values on excerpt
3
of
Avengers
movie using option
O1
(upper
row), option
O3
(middle row), and manual seg-
mentation (lower row)
The method described in this paper, based
on
the cophe-
netic matrix, enables to accurately and efficiently segment
video documents into sequences by building
a
binary ag-
glomerative time-constrained hierarchy.
We have imple-
mented several versions. Selecting the most appropriate one
improved results and gave
a
better description of the simi-
larity of the shots through the hierarchy. Experiments
on
a
larger base will be conducted in future work for selecting
the best parameter set and evaluating altemative threshold-
ing stategies.
References
P. Aigrain, P. Joly, and
V.
Longueville. Medium knowledge-
based macro-segmentation of video into sequences.
In
M. T. Maybury, editor,
Intelligent Multimedia Information
Retrieval,
pages 159-173. AAAI/MIT Press, 1997.
J. Carrive,
E
Pachet, and R. Ronfard. Using description
logics for indexing audiovisual documents.
In
ITC-IRST,
editor,
Int. Workshop
on
Description Logics (DL'98),
pages
116-120,
Trento,
1998.
A. Dailianas, R.
B.
Allen, and P. England. Comparison of
automatic video segmentation algorithms.
In
SPIE
Photon-
ics West,
volume 2615, pages 2-16, Philadelphia, 1995.
A. Hanjalic,
R.
L. Lagendijk, and
J.
Biemond. Automati-
cally segmenting movies into logical story units. In
Third
Int.
Conf.
on
Visual Information Systems (VISUAL'99),
vol-
ume LNCS 1614, pages 229-236, Amsterdam, 1999.
A.
G.
Hauptmann and
M.
A. Smith. Text, speech, and vi-
sion for video segmentation
:
The informedia project. In
AAA1
Fall Symposium, Computational Models
for
Integrat-
ing Language and Vision,
Boston, 1995.
A.
K.
Jain and R. C. Dubes.
Algorithms for Clustering Data.
Prentice Hall, 1988.
J. R. Kender and
B.-L.
Yeo. Video scene segmentation via
continuous video coherence.
Technical
report,
IBM
Re-
search Division, 1997.
R. Lienhart,
S.
Pfeiffer, and W. Effelsberg. Scene determi-
nation based
on
video and audio features. Technical report,
University of Mannheim, November 1998.
M. Yeung,
B.-L.
Yeo, and B. Liu. Extracting story units from
long programs for video browsing and navigation.
In
P
roc.
of
IEEE
Int.
Con8
on
Multimedia Computing and Systems,
Tokyo, 1996.
J. Zupan.
Clustering
of
Large Data Sets.
Chemometrics
Research Studies Series.
John
Wiley
&
Sons Ltd.,
1982.
Acknowledgements
Images from the
Avenger
movie, part
of the AIM corpus, were reproduced thanks to INA, Depart-
ment Innovation.
257
... In this paper, we assume that all shot descriptions are manually created. We leave for future work the important issue of automatically generating prose storyboards from existing movies, where a number of existing techniques can be used [4, 32, 9, 12, 11]. We also leave for future work the difficult problems of automatically generating movies from their prose storyboards, where existing techniques in virtual camera control can be used [13, 6, 15, 8]. ...
Article
Full-text available
The prose storyboard language is a formal language for describing movies shot by shot, where each shot is described with a unique sentence. The language uses a simple syntax and limited vocabulary borrowed from working practices in traditional movie-making, and is intended to be readable both by machines and humans. The language is designed to serve as a high-level user interface for intelligent cinematography and editing systems.
... • les méthodes proposant un regroupement des plans fondé sur une similarité à la fois physique et temporelle de ceux-ci [123,6] ; ...
Article
In this work, we developed FindMeDIA, the purpose of which is to preserve the Moroccan cultural heritage in the form of movies and photographs. Therefore, we had to manage an joint image and video database. Since a video can be perceived as a sequence of fixed images, we have been treating a video as an extension which depends on the modelling of images in a quasi-transparent way. Our main objective was, on the one hand, to meet the genericity and flexibility needs allowing to navigate with different types of visual data, and, on the other hand, to put forward a system which enables us to navigate by moving indistinctly between images and videos. As far as the modelling of video is concerned, we proposed FindViDEO. Both in its model and metamodel parts, FindViDEO is flexible and includes a large spectrum of applications and pre-existing models. For the sake of navigation, we reused a Galois' lattice technique on a database composed of still images and key-frames extracted from videos. The resulting FindMeDIA system is generic and enables us to use many image description techniques for navigation. To test the interest of these approaches, the modelling of key-frames (extracted from videos) as well as still images is carried out by ClickImAGE which proposes a semi-structured representation of data based on the content of images.
... J'ai soumis plusieurs contributions lors de séminaires et réunions MPEG [48,51]. J'ai publié deux articles destiné à un public nonspécialiste [36,37] [47,10,11,29,50,12,14,14,15,16] et communications invitées [52,53,54,55]. projet, j'ai rassemblé les résultats d'analyse des différents partenaires pour constituer le découpage du film "Cours, Lola" sous la forme d'une base de données MySQL interrogeable depuis les pages du site du projet. ...
Article
Full-text available
I review my research activities in Video Indexing and Action Recognition and sketch a research agenda for bringing those two lines of research together to address the difficult problem of recognizing actions in movies. I first present a series of older projects in Video Indexing, starting with the DIVAN project at INA and the MPEG expert group (1998-2000), and continuing at INRIA under the VIBES project (2001-2004). This research falls under the general approach of "computational media aesthetics", where we attempt to recognize film events based on our knowledge of filming styles and conventions (cinematography and editing). This is illustrated with two applications - the automatic segmentation of TV news into topics; and the automatic indexing of movies with their scripts. I then present my more recent research in Action Recognition with the MOVI group at INRIA (2005-2008). Building upon the GRIMAGE infrastructure, I present experiments in (1) learning and recognizing a small repertoire of full body gestures in 3D using "motion history volumes"; (2) segmenting a raw stream of 3D image sequences into recognizable "primitive actions" ; and (3) using statistical models learned in 3D for recovering primitive actions and relative camera positions from a single 2D video recording of similar actions.
... In the literature they are also named "video paragraphs" [19], "video segments" [54,43], "story units" [38,51,50] or "chapters" [39]. Existing scene segmentation techniques can be classified in two categories: the ones using only visual features and others using multimodal features. ...
... An edits-based analysis requires an explicit model of the video program or a set of generic style rules used in editing. Editing rules have been applied in 18]. Deenition 2 is based on the semantic notion of common locale and time. ...
Article
Full-text available
Although various Logical Story Unit (LSU) segmentation methods based on visual content have been presented, systematic evalua-tion of their mutual dependencies and their performance has not been performed. The unifying framework presented in this paper shows that LSU segmentation methods can be classi ed into 4 essentially di erent types. LSUs are subjective and cannot be de ned with full certainty. We therefore present de nitions based on lm theory to limit subjec-tivity. We then introduce an evaluation method measuring the quality of a segmentation method and its economic impact rather than the amount of errors. Furthermore, the inherent complexity of the segmentation problem given a visual feature is measured. Finally, we show to what extent LSU segmentation depends on the quality of shot boundary segmentation. We present results of an evaluation of the four segmentation method types under similar circumstances using an unprecedented amount of 20 hours of 17 complete videos in di erent genres. Tools and ground truths are available for interactive use via Internet.
Thesis
Professional quality videos of live staged performances are created by recording them from different appropriate viewpoints. These are then edited together to portray an eloquent story replete with the ability to draw out the intended emotion from the viewers. Creating such competent videos, involves the combination of multiple high quality cameras and skilled camera operators. We present a thesis to make even the low budget productions adept and pleasant by producing professional quality vidoes sans a fully and expensively equipped crew of cameramen. A high resolution static camera replaces the plural camera crew and their efficient camera movements are then simulated by virtually panning - tilting - zooming within the original recordings. We show that multiple virtual cameras can be simulated by choosing different trajectories of cropping windows inside the original recording. One of the key novelties of this work is an optimazation framework for computing the virtual camera trajectories using the information extracted from the original video based on computer vision techniques. The actors present on stage are considered as the most important elements of the scene. For the task of localizing and naming actors, we introduce generative models for learning view independent person and costume specific detectors from a set of labeled examples. We explain how to learn the models from a small number of labeled keyframes or video tracks, and how to detect novel appearances of the actors in a maximum likelihood framework. We demonstrate that such actor specific models can accurately localize actors despite changes in view point and occlusions, and significantly improve the detection recall rates over generic detectors. The dissertation then presents an offline algorithm for tracking objects and actors in long video sequences using these actor specific models. Detections are first performed to independently select candidate locations of the actor/object in each frame of the video. The candidate detections are then combined into smooth trajectories in an optimization step minimizing a cost function accounting for false detections and occlusions. Using the actor tracks, we propose a framework for automatically generating multiple clips suitable for video editing by simulating pan-tilt-zoom camera movements within the frame of a single static camera. Our method requires only minimal user input to define the subject matter of each sub-clip. The composition of each sub-clip is automatically computed in a novel L1-norm optimization framework. Our approach encodes several common cinematographic practices into a single convex cost function minimization problem, resulting in aesthetically-pleasing sub-clips which can easily be edited together using off-the-shelf multi-clip video editing software.
Article
Full-text available
Professional quality videos of live staged performances are created by recording them from different appropriate viewpoints. These are then edited together to portray an eloquent story replete with the ability to draw out the intended emotion from the viewers. Creating such competent videos typically requires a team of skilled camera operators to capture the scene from multiple viewpoints. In this thesis, we explore an alternative approach where we automatically compute camera movements in post-production using specially designed computer vision methods. A high resolution static camera replaces the plural camera crew and their efficient camera movements are then simulated by virtually panning - tilting - zooming within the original recordings. We show that multiple virtual cameras can be simulated by choosing different trajectories of cropping windows inside the original recording. One of the key novelties of this work is an optimization framework for computing the virtual camera trajectories using the information extracted from the original video based on computer vision techniques. The actors present on stage are considered as the most important elements of the scene. For the task of localizing and naming actors, we introduce generative models for learning view independent person and costume specific detectors from a set of labeled examples. We explain how to learn the models from a small number of labeled keyframes or video tracks, and how to detect novel appearances of the actors in a maximum likelihood framework. We demonstrate that such actor specific models can accurately localize actors despite changes in view point and occlusions, and significantly improve the detection recall rates over generic detectors. The thesis then proposes an offline algorithm for tracking objects and actors in long video sequences using these actor specific models. Detections are first performed to independently select candidate locations of the actor/object in each frame of the video. The candidate detections are then combined into smooth trajectories by minimizing a cost function accounting for false detections and occlusions. Using the actor tracks, we then describe a method for automatically generating multiple clips suitable for video editing by simulating pan-tilt-zoom camera movements within the frame of a single static camera. Our method requires only minimal user input to define the subject matter of each sub-clip. The composition of each sub-clip is automatically computed in a novel convex optimization framework. Our approach encodes several common cinematographic practices into a single convex cost function minimization problem, resulting in aesthetically-pleasing sub-clips which can easily be edited together using off-the-shelf multiclip video editing software. The proposed methods have been tested and validated on a challenging corpus of theatre recordings. They open the way to novel applications of computer vision methods for costeffective video production of live performances including, but not restricted to, theatre, music and opera.
Article
TV programs have an underlying structure that is lost when these are broadcasted. The linear mode is the only available reading mode when viewing programs recorded using a Personal Video Recorder or through a TV-on-Demand service. The fast-forward/backward functions are the only available tools for browsing. In this context, program structuring becomes important in order to provide users with novel and useful browsing features. In addition to advanced browsing features, TV program structuring can also be used for summarization, indexing and querying, archiving, etc. This thesis addresses the problem of unsupervised TV program structuring. The idea is to automatically recover the original structure of the program by finding the start time of each part composing it. The proposed approach is completely unsupervised and addresses a large category of programs like TV games, magazines, news…. It is based on the detection of “separators” which are short audio/visual sequences that delimit the different parts of a program. To do so, audio and visual recurrences are first detected from a set of episodes of a same program. In order to extract the separators, the recurrences are then classified using decision trees. These are built based on attributes issued from techniques like applause detection, scenes segmentation, face and speaker detection and clustering.
Article
Logical units are semantic video segments above the shot level. Depending on the common semantics within the unit and data domain, different types of logical unit extraction algorithms have been presented in literature. Topic units are typically extracted for documentaries or news broadcasts while scenes are extracted for narrative-driven video such as feature films, sitcoms, or cartoons. Other types of logical units are extracted from home video and sports. Different algorithms in literature used for the extraction of logical units are reviewed in this paper based on the categories unit type, data domain, features used, segmentation method, and thresholds applied. A detailed comparative study is presented for the case of extracting scenes from narrative-driven video. While earlier comparative studies focused on scene segmentation methods only or on complete news-story segmentation algorithms, in this paper various visual features and segmentation methods with their thresholding mechanisms and their combination into complete scene detection algorithms are investigated. The performance of the resulting large set of algorithms is then evaluated on a set of video files including feature films, sitcoms, children's shows, a detective story, and cartoons.
Data
Full-text available
We address the problem of indexing broadcast audiovisual documents such as films and newscasts. Starting from a collection of shots, we aim at building automatically high-level descriptions of subsets of this collection, that can be used for annotating, indexing and accessing the documents. We propose to represent documents and high-level descriptions with the framework of description logics, enriched with temporal relations. We first define the problem as a classification problem. We then propose an algorithm to automatically classify subsequences of shots, based on a bottom-up construction of descriptions using the rule mechansim of the CLASSIC system.
Conference Paper
Full-text available
We address the problem of indexing broadcast audiovisual documents (such as films, news). Starting from a collection of so-called shots, we aim at building automatically high level descriptions of subsets of this collection, that can be used for annotating, indexing and accessing the document. We propose to represent documents and high level descriptions with the framework of description logics, enriched with temporal relations. We first define the problem as a classification problem. We then propose an algorithm to automatically classify sub-sequences of shots, based on a bottom-up construction of descriptions using the rule mechanism of the CLASSIC system.
Conference Paper
Full-text available
Determining automatically what constitutes a scene in a video is a challenging task, particularly since there is no precise definition of the term “scene”. It is left to the individual to set attributes shared by consecutive shots which group them into scenes. Certain basic attributes such as dialogs, like settings and continuing sounds are consistent indicators. We have therefore developed a scheme for identifying scenes by clustering shots according to detected dialogs, like settings and similar audio. Results from experiments show automatic identification of these types of scenes to be reliable
Article
We describe three technologies involved in creating a digital video library suitable for full- content search and retrieval. Image processing analyzes scenes, speech processing transcribes the audio signal, and natural language process- ing determines word relevance. The integration of these technologies enables us to include vast amounts of video data in the library.
Conference Paper
Content based browsing and navigation in digital video collections have been centered on sequential and linear presentation of images. To facilitate such applications, nonlinear and non sequential access into video documents is essential, especially with long programs. For many programs, this can be achieved by identifying underlying story structures which are reflected both by visual content and temporal organization of composing elements. A new framework of video analysis and associated techniques are proposed to automatically parse long programs, to extract story structures and identify story units. The proposed analysis and representation contribute to the extraction of scenes and story units, each representing a distinct locale or event, that cannot be achieved by shot boundary detection alone. Analysis is performed on MPEG compressed video and without a prior models. The result is a compact representation that serves as a summary of the story and allows hierarchical organization of video documents
In extended video sequences, individual frames are grouped into shots which are defined as a sequence taken by a single camera, and related shots are grouped into scenes which are defined by a single dramatic event taken by a small number of related cameras. This hierarchical structure is deliberately constructed, dictated by the limitations and preferences of the human visual and memory systems. We present three novel high-level segmentation results derived from these considerations, some of which are analogous to those involved in the perception of the structure of music. First and primarily, we derive and demonstrate a method for measuring probable scene boundaries, by calculating a short term memory-based model of shot-to-shot "coherence". The detection of local minima in this continuous measure permits robust and flexible segmentation of the video into scenes, without the necessity for first aggregating shots into clusters. Second, and independently of the first, we then derive and demonstrate a one-pass on-the-fly shot clustering algorithm. Third, we demonstrate partially successful results on the application of these two new methods to the next higher, "theme", level of video structure.